Microsoft Cognitive Services Vision API in R
(This article was first published on RProjects – Stoltzmaniac, and kindly contributed to Rbloggers)
Microsoft Cognitive Services Vision API in RA little while ago I did a brief tutorial of the Google Vision API using RoogleVision created by Mark Edmonson. I couldn’t find anything similar to that in R for the Microsoft Cognitive Services API so I thought I would give it a shot. I whipped this example together quickly to give it a proofofconcept but I could certainly see myself building an R package to support this (unless someone can point to one – and please do if one exists)!
A quick example, sending this image retrieved the location of the human face and created a caption! Here’s my dog lined up next to his doppelganger:
The API is extremely easy to access using RCurl and httr. There are a lot of options which can be accessed. In this example, I’ll just cover the basics of image detection and descriptions.
If you don’t want to spend time writing a bunch of code, you can simply use the “helper_functions.R” file I created and swap out your credentials and API endpoint to get it working.
Getting Started With Microsoft Cognitive ServicesIn order to get started, all you need is an Azure Account which is free if you can keep yourself under certain thresholds and limits. There is even a free trial period (at the time this was written, at least).
Once that is taken care of there are a few things you need to do:
 Login to portal.azure.com
 On the lefthand menu click “Add”
 Click on “AI + Cognitive Services” and then the “Computer Vision API”
 Fill out the information required. You may have “Free Trial” under Subscription. Pay special attention to Location because this will be used in your API script
 In the lefthand menu, click “Keys” underneath “Resource Management” and you will find what you need for credentials. Underneath your Endpoint URL, click on “Show access keys…” – copy your key and use it in your script (do not make this publicly accessible)
 You’re ready to go!
Now, you can write a script to utilize the power of the Microsoft Cognitive Vision API.
What data can you get?There are a lot of features you can request. I’m only asking for: description, tags, categories, and faces. You can also return: image type, color, and adult. There are also details which can be returned such as: landmarks and celebrities.
Here is the setup and API call:
image_url = 'https://imgur.com/rapIn0u.jpg' visualFeatures = "Description,Tags,Categories,Faces" # options = "Categories, Tags, Description, Faces, ImageType, Color, Adult" details = "Landmarks" # options = Landmarks, Celebrities reqURL = paste(api_endpoint_url, "?visualFeatures=", visualFeatures, "&details=", details, sep="") APIresponse = POST(url = reqURL, content_type('application/json'), add_headers(.headers = c('OcpApimSubscriptionKey' = api_key)), body=list(url = image_url), encode = "json") df = content(APIresponse)The dataframe returned looks messy, but isn’t too bad once you dive in. Take a look:
str(df) ## List of 5 ## $ tags :List of 6 ## ..$ :List of 2 ## .. ..$ name : chr "dog" ## .. ..$ confidence: num 0.987 ## ..$ :List of 3 ## .. ..$ name : chr "mammal" ## .. ..$ confidence: num 0.837 ## .. ..$ hint : chr "animal" ## ..$ :List of 2 ## .. ..$ name : chr "looking" ## .. ..$ confidence: num 0.814 ## ..$ :List of 2 ## .. ..$ name : chr "animal" ## .. ..$ confidence: num 0.811 ## ..$ :List of 2 ## .. ..$ name : chr "posing" ## .. ..$ confidence: num 0.54 ## ..$ :List of 2 ## .. ..$ name : chr "staring" ## .. ..$ confidence: num 0.165 ## $ description:List of 2 ## ..$ tags :List of 18 ## .. ..$ : chr "dog" ## .. ..$ : chr "mammal" ## .. ..$ : chr "looking" ## .. ..$ : chr "animal" ## .. ..$ : chr "photo" ## .. ..$ : chr "posing" ## .. ..$ : chr "camera" ## .. ..$ : chr "man" ## .. ..$ : chr "standing" ## .. ..$ : chr "smiling" ## .. ..$ : chr "face" ## .. ..$ : chr "white" ## .. ..$ : chr "holding" ## .. ..$ : chr "close" ## .. ..$ : chr "wearing" ## .. ..$ : chr "laying" ## .. ..$ : chr "head" ## .. ..$ : chr "teeth" ## ..$ captions:List of 1 ## .. ..$ :List of 2 ## .. .. ..$ text : chr "a close up of Albert Einstein and a dog posing for the camera" ## .. .. ..$ confidence: num 0.892 ## $ requestId : chr "2143e23a14c847c497509bfc82381512" ## $ metadata :List of 3 ## ..$ width : int 824 ## ..$ height: int 824 ## ..$ format: chr "Jpeg" ## $ faces :List of 1 ## ..$ :List of 3 ## .. ..$ age : int 73 ## .. ..$ gender : chr "Male" ## .. ..$ faceRectangle:List of 4 ## .. .. ..$ left : int 505 ## .. .. ..$ top : int 241 ## .. .. ..$ width : int 309 ## .. .. ..$ height: int 309 The top 5 descriptions returned are: description_tags = df$description$tags description_tags_tib = tibble(tag = character()) for(tag in description_tags){ for(text in tag){ if(class(tag) != "list"){ ## To remove the extra caption from being included tmp = tibble(tag = tag) description_tags_tib = description_tags_tib %>% bind_rows(tmp) } } } knitr::kable(description_tags_tib[1:5,]) tag dog mammal looking animal photoThe caption returned:
captions = df$description$captions captions_tib = tibble(text = character(), confidence = numeric()) for(caption in captions){ tmp = tibble(text = caption$text, confidence = caption$confidence) captions_tib = captions_tib %>% bind_rows(tmp) } knitr::kable(captions_tib) text confidence a close up of Albert Einstein and a dog posing for the camera 0.891846 The metadata returned: metadata = df$metadata metadata_tib = tibble(width = metadata$width, height = metadata$height, format = metadata$format) knitr::kable(metadata_tib) width height format 824 824 Jpeg The locations of faces returned: faces = df$faces faces_tib = tibble(faceID = numeric(), age = numeric(), gender = character(), x1 = numeric(), x2 = numeric(), y1 = numeric(), y2 = numeric()) n = 0 for(face in faces){ n = n + 1 tmp = tibble(faceID = n, age = face$age, gender = face$gender, x1 = face$faceRectangle$left, y1 = face$faceRectangle$top, x2 = face$faceRectangle$left + face$faceRectangle$width, y2 = face$faceRectangle$top + face$faceRectangle$height) faces_tib = faces_tib %>% bind_rows(tmp) } #faces_tib knitr::kable(faces_tib) faceID age gender x1 x2 y1 y2 1 73 Male 505 814 241 550 A few more examples:As always, you can find the code I used on my GitHub.
Side note: I used a ton of for loops to access the data – not ideal… please let me know if you have better ways of dealing with nested data like this when you have unknown numbers of levels.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RProjects – Stoltzmaniac. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
13 new R Jobs for R users (20170926) – from all over the world
Just visit this link and post a new R job to the R community.
You can post a job for free (and there are also “featured job” options available for extra exposure).
Current R jobsJob seekers: please follow the links below to learn more and apply for your R job of interest:
Featured Jobs
 FullTime
 Data Scientist
Edelweiss Business Services – Posted by Aakash Gupta  Mumbai
Maharashtra, India  20 Sep 2017

 FullTime
 Solutions Engineer for RStudio @ U.S.A. (or anywhere)
RStudio – Posted by nwstephens  Anywhere
 8 Sep 2017

 Freelance
 R Programmer & Statistician Empiricus – Posted by empiricus
 Anywhere
 25 Sep 2017

 FullTime
 Data Scientist Edelweiss Business Services – Posted by Aakash Gupta
 Mumbai Maharashtra, India
 20 Sep 2017

 FullTime
 Postdoctoral Data Scientist: GIS and mHealth Harvard T.H. Chan School of Public Health – Posted by Christine Choirat
 Boston Massachusetts, United States
 14 Sep 2017

 FullTime
 Data Scientist @ London Hiscox – Posted by alan.south
 London England, United Kingdom
 13 Sep 2017

 Freelance
 Crafty developer: R>C++ SherandoResearch – Posted by feersum_monkey
 Anywhere
 8 Sep 2017

 FullTime
 Solutions Engineer for RStudio @ U.S.A. (or anywhere) RStudio – Posted by nwstephens
 Anywhere
 8 Sep 2017

 FullTime
 Senior Data Scientist @ Boston, Massachusetts, U.S. Colaberry Data Analytics – Posted by Colaberry_DataAnalytics
 Boston Massachusetts, United States
 8 Sep 2017

 PartTime
 CranR Programmierer Praktikum @ Zürich, Switzerland Fisch Asset Management AG – Posted by HR_Fisch
 Zürich Zürich, Switzerland
 4 Sep 2017

 Freelance
 Data analyst at talmix @ London, UK Talmix – Posted by manikauluck
 London England, United Kingdom
 4 Sep 2017

 FullTime
 R Developer and Open Source Community Manager at UCLA @ Los Angeles, California UCLA – Posted by graemeblair
 Los Angeles California, United States
 4 Sep 2017

 FullTime
 Work in Bodrum: Fullstack web developer with R experience @ Muğla, Turkey PranaGEO Ltd. – Posted by acizmeli
 Muğla Turkey
 4 Sep 2017

 FullTime
 Digital Performance & Data Science Section Manager @ Paris, France Nissan Automotive Europe – Posted by Estefania Rendon
 Paris ÎledeFrance, France
 31 Aug 2017

 Freelance
 Quantitative Analyst (Credit) @ Zürich, Switzerland ipub – Posted by gluc
 Zürich Zürich, Switzerland
 29 Aug 2017
In Rusers.com you can see all the R jobs that are currently available.
Rusers ResumesRusers also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.
(you may also look at previous R jobs posts ).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));R live class  R for Data Science  Oct 45 Milan
(This article was first published on R blog  Quantide  R training & consulting, and kindly contributed to Rbloggers)
R for Data Science is our first course of the autumn term. It takes place in October 45 in a location close to Milano Lima.
If you want to deepen your data analysis knowledge using the most modern R tools, or you want to figure out if R is the right solution for you, this is the your class. This course is for people that are willing to learn about R and would like to get an overview of its capabilities for data science or those that have very little knowledge of R.
R for Data Science Outline– A bit of R history and online resources
– R and RStudio installation and configuration
– Your first R session
– Your first R markdown document
– R objects: data and functions
– Data import from external sources: excel and database connection
– Data manipulation with tidyverse
– Tidy data with tidyr
– Data visualization using ggplot
– An overview of statistical models and data mining with R
The course is for max 6 attendees.
R for Data Science is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.
LocationThe course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.
RegistrationIf you want to reserve a seat go to: FAQ, detailed program and tickets.
Other R courses  Autumn termYou can find an overview of all our courses here. Next dates will be:
 October 1112: Statistics for Data Science. Develop a wide variety of statistical models with R, from the simplest Linear Regression to the most sophisticated GLM models. Reserve now!
 October 2526: Machine Learning with R. Find patterns in large data sets using the R tools for Dimensionality Reduction, Clustering, Classification and Prediction. Reserve now!
 November 78: Data Visualization and Dashboard with R. Show the story behind your data: create beautiful effective visualizations and interactive Shiny dashboards. Reserve now!
 November 2122: R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
 November 2930: Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!
In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailormade course, picking all the topics that are interesting for your organization and dropping the rest.
The post R live class  R for Data Science  Oct 45 Milan appeared first on Quantide – R training & consulting.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R blog  Quantide  R training & consulting. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
When are Citi Bikes Faster than Taxis in New York City?
(This article was first published on Category: R  Todd W. Schneider, and kindly contributed to Rbloggers)
Every day in New York City, millions of commuters take part in a giant race to determine transportation supremacy. Cars, bikes, subways, buses, ferries, and more all compete against one another, but we never get much explicit feedback as to who “wins.” I’ve previously written about NYC’s public taxi data and Citi Bike share data, and it occurred to me that these datasets can help identify who’s fastest, at least between cars and bikes. In fact, I’ve built an interactive guide that shows when a Citi Bike is faster than a taxi, depending on the route and the time of day.
The methodology and findings will be explained more below, and all code used in this post is available opensource on GitHub.
Interactive guide to when taxis are faster or slower than Citi BikesPick a starting neighborhood and a time. The map shows whether you’d expect get to each neighborhood faster with a taxi (yellow) or a Citi Bike (dark blue).
Starting neighborhood
Alphabet CityBattery ParkBattery Park CityBloomingdaleCentral ParkChinatownClinton EastClinton WestEast ChelseaEast Harlem SouthEast VillageFinancial District NorthFinancial District SouthFlatironGarment DistrictGramercyGreenwich Village NorthGreenwich Village SouthHudson SqKips BayLenox Hill EastLenox Hill WestLincoln Square EastLincoln Square WestLittle Italy/NoLiTaLower East SideManhattan ValleyMeatpacking/West Village WestMidtown CenterMidtown EastMidtown NorthMidtown SouthMurray HillPenn Station/Madison Sq WestSeaportSoHoStuy Town/Peter Cooper VillageSutton Place/Turtle Bay NorthTimes Sq/Theatre DistrictTriBeCa/Civic CenterTwo Bridges/Seward ParkUN/Turtle Bay SouthUnion SqUpper East Side NorthUpper East Side SouthUpper West Side NorthUpper West Side SouthWest Chelsea/Hudson YardsWest VillageWorld Trade CenterYorkville EastYorkville West
BedfordBoerum HillBrooklyn HeightsBrooklyn Navy YardBushwick SouthCarroll GardensClinton HillCobble HillColumbia StreetCrown Heights NorthDowntown Brooklyn/MetroTechDUMBO/Vinegar HillEast WilliamsburgFort GreeneGowanusGreenpointPark SlopeProspect HeightsProspect ParkRed HookSouth WilliamsburgStuyvesant HeightsSunset Park WestWilliamsburg (North Side)Williamsburg (South Side)
Long Island City/Hunters PointLong Island City/Queens PlazaQueensbridge/RavenswoodSunnyside
8:00 AM–11:00 AM
11:00 AM–4:00 PM
4:00 PM–7:00 PM
7:00 PM–10:00 PM
10:00 PM–8:00 AM From Midtown East, weekdays 8:00 AM–11:00 AM
Taxi vs. Citi Bike travel times to other neighborhoods
Turn on javascript (or click through from RSS) to view the interactive taxi vs. Citi Bike map.
Data via NYC TLC and Citi BikeBased on trips 7/1/2016–6/30/2017
toddwschneider.com
Hover over a neighborhood (tap on mobile) to view travel time stats
40% of weekday taxi trips—over 50% during peak hours—would be faster as Citi Bike ridesI estimate that 40% of weekday taxi trips within the Citi Bike service area would expect to be faster if switched to a Citi Bike, based on data from July 2016 to June 2017. During peak midday hours, more than 50% of taxi trips would expect to be faster as Citi Bike rides.
There are some significant caveats to this estimate. In particular, if many taxi riders simultaneously switched to Citi Bikes, the bike share system would probably hit severe capacity constraints, making it difficult to find available bikes and docks. Increased bike usage might eventually lead to fewer vehicles on the road, which could ease vehicle congestion, and potentially increase bike lane congestion. It’s important to acknowledge that when I say “40% of taxi trips would be faster if they switched to Citi Bikes”, we’re roughly considering the decision of a single ablebodied person, under the assumption that everyone else’s behavior will remain unchanged.
Heading crosstown in Manhattan? Seriously consider taking a bike instead of a car!Crosstown Manhattan trips are generally regarded as more difficult than their northsouth counterparts. There are fewer subways that run crosstown, and if you take a car, the narrower eastwest streets often feel more congested than the broad northsouth avenues with their synchronized traffic lights. Crosstown buses are so notoriously slow that they’ve been known to lose races against tricycles.
I divided Manhattan into the crosstown zones pictured above, then calculated the taxi vs. Citi Bike win rate for trips that started and ended within each zone. Taxis fare especially badly in the Manhattan central business district. If you take a midday taxi that both starts and ends between 42nd and 59th streets, there’s over a 70% chance that the trip would have been faster as a Citi Bike ride.
Keep in mind that’s for all trips between 42nd and 59th streets. For some of the longest crosstown routes, for example, from the United Nations on the far east side to Hell’s Kitchen on the west, Citi Bikes beat taxis 90% of the time during the day. It’s worth noting that taxis made 8 times as many trips as Citi Bikes between 42nd and 59th streets from July 2016 to June 2017—almost certainly there would be less total time spent in transit if some of those taxi riders took bikes instead.
Hourly graphs for all of the crosstown zones are available here, and here’s a summary table for weekday trips between 8:00 AM and 7:00 PM:
Manhattan crosstown zone % taxis lose to Citi Bikes 96th–110th 41% 77th–96th 36% 59th–77th 54% 42nd–59th 69% 14th–42nd 64% Houston–14th 54% Canal–Houston 60% Below Canal 55%A reminder that this analysis restricts to trips that start and end within the same zone, so for example a trip from 23rd St to 57th St would be excluded because it starts and ends in different zones.
Taxis fare better for trips that stay on the east or west sides of Manhattan: 35% of daytime taxi trips that start and end west of 8th Avenue would expect to be faster as Citi Bike trips, along with 38% of taxi trips that start and end east of 3rd Avenue. Taxis also generally beat Citi Bikes on longer trips:
Taxis are losing more to Citi Bikes over timeWhen the Citi Bike program began in July 2013, less than half of weekday daytime taxi trips would have been faster if switched to Citi Bikes. I ran a monthbymonth analysis to see how the taxi vs. Citi Bike calculus has changed over time, and discovered that taxis are getting increasingly slower compared to Citi Bikes:
Note that this monthbymonth analysis restricts to the original Citi Bike service area, before the program expanded in August 2015. The initial expansion was largely into Upper Manhattan and the outer boroughs, where taxis generally fare better than bikes, and so to keep things consistent, I restricted the above graph to areas that have had Citi Bikes since 2013.
Taxis are losing more to Citi Bikes over time because taxi travel times have gotten slower, while Citi Bike travel times have remained roughly unchanged. I ran a pair of linear regressions to model travel times as a function of:
 trip distance
 time of day
 precipitation
 whether the route crosses between Manhattan and the outer boroughs
 month of year
 year
The regression code and output are available on GitHub: taxi, Citi Bike
As usual, I make no claim that this is a perfect model, but it does account for the basics, and if we look at the coefficients by year, it shows that, holding the other variables constant, a taxi trip in 2017 took 17% longer than the same trip in 2009. For example, a weekday morning trip from Midtown East to Union Square that took 10 minutes in 2009 would average 11:45 in 2017.
The same regression applied to Citi Bikes shows no such slowdown over time, in fact Citi Bikes got slightly faster. The regressions also show that:
 Citi Bike travel times are less sensitive to time of day than taxi travel times. A peak midday taxi trip averages 40% longer than the same trip at offpeak hours, while a peak Citi Bike trip averages 15% longer than during offpeak hours.
 Rainy days are associated with 2% faster Citi Bike travel times and 1% slower taxi travel times.
 For taxis, fall months have the slowest travel times, but for Citi Bikes, summer has the slowest travel times. For both, January has the fastest travel times.
It’s one thing to say that 50% of midday taxi trips would be faster as Citi Bike rides, but how much does that vary from day to day? You could imagine there are some days with severe road closures, where more nimble bikes have an advantage getting around traffic, or other days in the dead of summer, when taxis might take advantage of the less crowded roads.
I ran a more granular analysis to measure win/loss rates for individual dates. Here’s a histogram of the taxi loss rate—the % of taxi trips we’d expect to be faster if switched to Citi Bikes—for weekday afternoon trips from July 2016 to June 2017:
Many days see a taxi loss rate of just over 50%, but there are tails on both ends, indicating that some days tilt in favor of either taxis or Citi Bikes. I was curious if we could learn anything from the outliers on each end, so I looked at individual dates to see if there were any obvious patterns.
The dates when taxis were the fastest compared to Citi Bikes look like dates that probably had less traffic than usual. The afternoon with the highest taxi win rate was Monday, October 3, 2016, which was the Jewish holiday of Rosh Hashanah, when many New Yorkers would have been home from work or school. The next 3 best days for taxis were all Mondays in August, when I’d imagine a lot of people were gone from the city on vacation.
The top 4 dates where Citi Bikes did best against taxis were all rainy days in the fall of 2016. I don’t know why rainy days make bikes faster relative to taxis, maybe rain causes traffic on the roads that disproportionately affects cars, but it’s also possible that there’s a selection bias. I’ve written previously about how the weather predicts Citi Bike ridership, and not surprisingly there are fewer riders when it rains. Maybe the folks inclined to ride bikes when it’s raining are more confident cyclists, who also pedal faster when the weather is nice. It’s also possible that rainyday cyclists are particularly motivated to pedal faster so they can get out of the rain. I don’t know if these are really the causes, but they at least sound believable, and would explain the observed phenomenon.
When the President is in town, take a bikeJune 8, 2016 was a particularly good day for Citi Bikes compared to taxis. President Obama came to town that afternoon, and brought the requisite street closures with him. I poked around a bit looking for the routes that appeared to be the most impacted by the President’s visit, and came to afternoon trips from Union Square to Murray Hill. On a typical weekday afternoon, taxis beat Citi Bikes 57% of the time from Union Square to Murray Hill, but on June 8, Citi Bikes won 90% of the time. An even more dramatic way to see the Obama effect is to look at daily median travel times:
A typical afternoon taxi takes 8 minutes, but on June 8, the median was over 21 minutes. The Citi Bike median travel time is almost always 9 minutes, including during President Obama’s visit.
The same graph shows a similar phenomenon on September 19, 2016, when the annual United Nations General Assembly shut down large swathes of Manhattan’s east side, including Murray Hill. Although the impact was not as severe as during President Obama’s visit, the taxi median time doubled on September 19, while the Citi Bike median time again remained unchanged.
The morning of June 15, 2016 offers another example, this time on the west side, when an overturned tractor trailer shut down the Lincoln Tunnel for nearly seven hours. Taxi trips from the Upper West Side to West Chelsea, which normally take 15 minutes, took over 35 minutes. Citi Bikes typically take 18 minutes along the same route, and June 15 was no exception. Taxis would normally expect to beat Citi Bikes 67% of the time on a weekday morning, but on June 15, Citi Bikes won over 92% of the time.
These are of course three handpicked outliers, and it wouldn’t be entirely fair to extrapolate from them to say that Citi Bikes are always more resilient than taxis during extreme circumstances. The broader data shows, though, that taxis are more than twice as likely as Citi Bikes to have days when a route’s median time is at least 5 minutes slower than average, and more than 3.5 times as likely to be at least 10 minutes slower, so it really does seem that Citi Bikes are better at minimizing worstcase outcomes.
Why have taxis gotten slower since 2009?The biggest slowdowns in taxi travel times happened in 2014 and 2015. The data and regression model have nothing to say about why taxis slowed down so much over that period, though it might be interesting to dig deeper into the data to see if there are specific regions where taxis have fared better or worse since 2009.
Uber usage took off in New York starting in 2014, reaching over 10,000 vehicles dispatched per week by the beginning of 2015. There are certainly people who blame Uber—and other ridehailing apps like Lyft and Juno—for increasing traffic, but the city’s own 2016 traffic report did not blame Uber for increased congestion.
It’s undoubtedly very hard to do an accurate study measuring ridehailing’s impact on traffic, and I’m especially wary of people on both sides who have strong interests in blaming or exonerating the ridehailing companies. Nevertheless, if I had to guess the biggest reasons taxis got particularly slower in 2014 and 2015, I would start with the explosive growth of ridehailing apps, since the timing looks to align, and the publicly available data shows that they account for tens of thousands of vehicles on the roads.
On the other hand, if ridehailing were the biggest cause of increased congestion in 2014 and 2015, it doesn’t exactly make sense that taxi travel times have stabilized a bit in 2016 and 2017, because ridehailing has continued to grow, and while taxi usage continues to shrink, the respective rates of growth and shrinkage are not very different in 2016–17 than they were in 2014–15. One explanation could be that starting in 2016 there was a reduction in other types of vehicles—traditional black cars, private vehicles, etc.—to offset ridehailing growth, but I have not seen any data to support (or refute) that idea.
There are also those who blame bike lanes for worsening vehicle traffic. Again, different people have strong interests arguing both sides, but it seems like there are more data points arguing that bike lanes do not cause traffic (e.g. here, here, and here) than vice versa. I wasn’t able to find anything about the timing of NYC bike lane construction to see how closely it aligns with the 2014–15 taxi slowdown.
Lots of other factors could have contributed to worsening traffic: commuteradjusted population growth, subway usage, decaying infrastructure, construction, and presidential residences are just a few that feel like they could be relevant. I don’t know the best way to account for all of them, but it does seem like if you want to get somewhere in New York quickly, it’s increasingly less likely that a car is your best option.
How representative are taxis and Citi Bikes of all cars and bikes?I think it’s not a terrible assumption that taxis are representative of typical car traffic in New York. If anything, maybe taxis are faster than average cars since taxi drivers are likely more experienced—and often aggressive—than average drivers. On the other hand, taxi drivers seem anecdotally less likely to use a trafficenabled GPS, which maybe hurts their travel times.
Citi Bikes are probably slower than privatelyowned bikes. Citi Bikes are designed to be heavy and stable, which maybe makes them safer, but lowers their speeds. Plus, I’d guess that biking enthusiasts, who might be faster riders, are more likely to ride their own higherperformance bikes. Lastly, Citi Bike riders might have to spend extra time at the end of a trip looking for an available dock, whereas privatelyowned bikes have more parking options.
Weighing up these factors, I would guess that if we somehow got the relevant data to analyze the broader question of all cars vs. all bikes, the results would tip a bit in favor of bikes compared to the results of the narrower taxi vs. Citi Bike analysis. It’s also worth noting that both taxis and Citi Bikes have additional time costs that aren’t accounted for in trip durations: you have to hail a taxi, and there might not be a Citi Bike station in the near vicinity of your origin or destination.
What are the implications of all this?One thing to keep in mind is that even though the taxi and Citi Bike datasets are the most conveniently available for analysis, New Yorkers don’t limit their choices to cars and bikes. The subway, despite its poor reputation of late, carries millions of people every day, more than taxis, ridehailing apps, and Citi Bikes combined, so it’s not like “car vs. bike” is always the most relevant question. There are also legitimate reasons to choose a car over a bike—or vice versa—that don’t depend strictly on expected travel time.
Bike usage in New York has increased dramatically over the past decade, probably in large part because people figured out on their own that biking is often the fastest option. Even with this growth, though, the data shows that a lot of people could still save precious time—and minimize their worsecase outcomes—if they switched from cars to bikes. To the extent the city can incentivize that, it strikes me as a good thing.
When Lmageddon comes, take a bikeFor any readers who might be affected by the L train’s planned 2019 closure, if you only remember one thing from this post: Citi Bikes crush taxis when traveling from Williamsburg to just about anywhere in Manhattan during morning rush hour!
GitHubThe code for the taxi vs. Citi Bike analysis is available here as part of the nyctaxidata repo. Note that parts of the analysis also depend on loading the data from the nyccitibikedata repo.
The dataTaxi trip data is available since January 2009, Citi Bike data since July 2013. I filtered each dataset to make the analysis closer to an applestoapples comparison—see the GitHub repo for a more complete description of the filtering—but in short:
 Restrict both datasets to weekday trips only
 Restrict Citi Bike dataset to subscribers only, i.e. no daily pass customers
 Restrict taxi dataset to trips that started and ended in areas with Citi Bike stations, i.e. where taking a Citi Bike would have been a viable option
Starting in July 2016, perhaps owing to privacy concerns, the TLC stopped providing latitude and longitude coordinates for every taxi trip. Instead, the TLC now divides the city into 263 taxi zones (map), and provides the pickup and drop off zones for every trip. The analysis then makes the assumption that taxis and Citi Bikes have the same distribution of trips within a single zone, see GitHub for more.
80% of taxi trips start and end within zones that have Citi Bike stations, and the filtered dataset since July 2013 contains a total of 330 million taxi trips and 27 million Citi Bike trips. From July 1, 2016 to June 30, 2017—the most recent 12 month period of available data—the filtered dataset includes 68 million taxi trips and 9 million Citi Bike trips.
MethodologyI wrote a Monte Carlo simulation in R to calculate the probability that a Citi Bike would be faster than a taxi for a given route. Every trip is assigned to a bucket, where the buckets are picked so that trips within a single bucket are fairly comparable. The bucket definitions are flexible, and I ran many simulations with different bucket definitions, but one sensible choice might be to group trips by:
 Starting zone
 Ending zone
 Hour of day
For example, weekday trips from the West Village to Times Square between 9:00 AM and 10:00 AM would constitute one bucket. The simulation iterates over every bucket that contains at least 5 taxi and 5 Citi Bike trips, and for each bucket, it draws 10,000 random samples, with replacement, for each of taxi and Citi Bike trips. The bucket’s estimated probability that a taxi is faster than a Citi Bike, call it the “taxi win rate”, is the fraction of samples where the taxi duration is shorter than the Citi Bike duration. You can think of this as 10,000 individual headtohead races, with each race pitting a single taxi trip against a single Citi Bike trip.
Different bucketing and filtering schemes allow for different types of analysis. I ran simulations that bucketed by month to see how win rates have evolved over time, simulations that used only days where it rained, and others. There are undoubtedly more schemes to be considered, and the Monte Carlo methodology should be well equipped to handle them.
$(function() { var desktop = !mobileDevice();
var selected_location_click_handler = [];
var tooltip_x_handlers = [ { events: "@taxi_zone_marks:mouseover", update: "x()  100 < 10 ? 10 : (x()  100 > width  240 ? width  240 : x()  100)" }, { events: "@taxi_zone_marks:mouseout", update: "999" } ]
if (desktop) { $(".mapclickinstructions").show();
selected_location_click_handler = [ { events: "@taxi_zone_marks:click", update: "datum.properties.zone" }, { events: "@taxi_zone_base_marks:click", update: "datum.properties.zone" } ];
tooltip_x_handlers.push({ events: "@taxi_zone_marks:click", update: "999" }); }
if (window.screen && window.screen.width < 450) { var map_width = window.screen.width; var map_height = 520; var map_center = [73.854, 40.737]; var map_scale = 155000; $("#nyctaxizonesmap").css({"marginleft": "12px", "width": "100vw"}); } else { var map_width = 450; var map_height = 640; var map_center = [73.889, 40.75]; var map_scale = 192000; } var vega_spec = { $schema: "https://vega.github.io/schema/vega/v3.0.json", width: map_width, height: map_height, autosize: "none", signals: [ { name: "selected_location", bind: { input: "select", options: ['Alphabet City', 'Battery Park', 'Battery Park City', 'Bedford', 'Bloomingdale', 'Boerum Hill', 'Brooklyn Heights', 'Brooklyn Navy Yard', 'Bushwick South', 'Carroll Gardens', 'Central Park', 'Chinatown', 'Clinton East', 'Clinton Hill', 'Clinton West', 'Cobble Hill', 'Columbia Street', 'Crown Heights North', 'Downtown Brooklyn/MetroTech', 'DUMBO/Vinegar Hill', 'East Chelsea', 'East Harlem South', 'East Village', 'East Williamsburg', 'Financial District North', 'Financial District South', 'Flatiron', 'Fort Greene', 'Garment District', 'Gowanus', 'Gramercy', 'Greenpoint', 'Greenwich Village North', 'Greenwich Village South', 'Hudson Sq', 'Kips Bay', 'Lenox Hill East', 'Lenox Hill West', 'Lincoln Square East', 'Lincoln Square West', 'Little Italy/NoLiTa', 'Long Island City/Hunters Point', 'Long Island City/Queens Plaza', 'Lower East Side', 'Manhattan Valley', 'Meatpacking/West Village West', 'Midtown Center', 'Midtown East', 'Midtown North', 'Midtown South', 'Murray Hill', 'Park Slope', 'Penn Station/Madison Sq West', 'Prospect Heights', 'Prospect Park', 'Queensbridge/Ravenswood', 'Red Hook', 'Seaport', 'SoHo', 'South Williamsburg', 'Stuy Town/Peter Cooper Village', 'Stuyvesant Heights', 'Sunnyside', 'Sunset Park West', 'Sutton Place/Turtle Bay North', 'Times Sq/Theatre District', 'TriBeCa/Civic Center', 'Two Bridges/Seward Park', 'UN/Turtle Bay South', 'Union Sq', 'Upper East Side North', 'Upper East Side South', 'Upper West Side North', 'Upper West Side South', 'West Chelsea/Hudson Yards', 'West Village', 'Williamsburg (North Side)', 'Williamsburg (South Side)', 'World Trade Center', 'Yorkville East', 'Yorkville West'] }, value: "Midtown East", on: selected_location_click_handler }, { name: "time_of_day", bind: { input: "radio", options: ["8:00 AM–11:00 AM", "11:00 AM–4:00 PM", "4:00 PM–7:00 PM", "7:00 PM–10:00 PM", "10:00 PM–8:00 AM"] }, value: "8:00 AM–11:00 AM" }, { name: "hover_area", value: null, on: [ { events: "@taxi_zone_marks:mouseover", update: "datum" }, { events: "@taxi_zone_marks:mouseout", update: "null" } ] }, { name: "tooltip_title", value: null, update: "hover_area ? selected_location + ' to ' + hover_area.properties.zone : ''" }, { name: "tooltip_from", value: null, update: "hover_area ? 'From ' + selected_location : ''" }, { name: "tooltip_to", value: null, update: "hover_area ? 'to ' + hover_area.properties.zone : ''" }, { name: "tooltip_time_of_day", value: null, update: "hover_area ? 'Weekdays ' + time_of_day : ''" }, { name: "tooltip_message", value: null, update: "hover_area ? (hover_area.taxi_win_rate > 0.5 ? 'Taxis' : 'Citi Bikes') + ' beat ' + (hover_area.taxi_win_rate > 0.5 ? 'Citi Bikes' : 'taxis') + ' ' + format((hover_area.taxi_win_rate > 0.5 ? hover_area.taxi_win_rate : 1  hover_area.taxi_win_rate), '0.0%') + ' of the time' : ''" }, { name: "tooltip_taxi_median", value: null, update: "hover_area ? 'Taxi median: ' + timeFormat(1000 * hover_area.taxi_median, '%M:%S') : ''" }, { name: "tooltip_citi_median", value: null, update: "hover_area ? 'Citi Bike median: ' + timeFormat(1000 * hover_area.citi_median, '%M:%S') : ''" }, { name: "tooltip_x", value: 999, on: tooltip_x_handlers }, { name: "tooltip_y", on: [ { events: "@taxi_zone_marks:mouseover", update: "y()  145 < 15 ? y() + 50 : y()  145" } ] } ], data: [ { name: "simulation_results", url: "http://cdn.toddwschneider.com/taxi_vs_citibike/simulation_results.csv", format: {type: "csv", parse: "auto"}, transform: [ { type: "filter", expr: "datum.time_bucket == time_of_day && datum.start_zone == selected_location" } ] }, { name: "east_river_bridges", url: "http://cdn.toddwschneider.com/taxi_vs_citibike/east_river_bridges.json", format: {type: "topojson", feature: "bridges"} }, { name: "taxi_zones", url: "http://cdn.toddwschneider.com/taxi_vs_citibike/taxi_zones_bmq.json", format: {type: "topojson", feature: "trimmed_taxi_zones_geojson"}, transform: [ { type: "lookup", from: "simulation_results", key: "end_taxi_zone_id", fields: ["properties.locationid"], values: ["taxi_win_rate", "taxi_median", "citi_median"], as: ["taxi_win_rate", "taxi_median", "citi_median"] } ] }, { name: "taxi_zones_with_data", source: "taxi_zones", transform: [ { type: "filter", expr: "datum.taxi_win_rate" } ] }, { name: "selected_taxi_zone_origin", source: "taxi_zones", transform: [ { type: "filter", expr: "datum.properties.zone == selected_location" } ] } ], projections: [ { name: "projection", type: "mercator", center: map_center, scale: map_scale } ], scales: [ { name: "color", type: "sequential", domain: [0.05, 0.95], range: {scheme: "viridis"} } ], legends: [ { fill: "color", orient: "bottomright", title: "Taxi win %", format: "0%", type: "gradient", encode: { gradient: { update: { width: {value: 150} } }, title: { enter: { fontSize: {value: 16} } }, labels: { enter: { text: {value: ""} } } } } ], marks: [ { type: "shape", name: "taxi_zone_base_marks", from: {data: "taxi_zones"}, encode: { update: { fill: {value: "#f4f4f4"}, fillOpacity: {value: 0.5}, stroke: {value: "#aaa"}, strokeWidth: {value: 0.2}, zindex: {value: 0} }, }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "shape", name: "east_river_bridges_marks", from: {data: "east_river_bridges"}, encode: { update: { stroke: {value: "#777"}, strokeWidth: {value: 2} } }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "shape", name: "taxi_zone_marks", from: {data: "taxi_zones_with_data"}, encode: { update: { fill: { scale: "color", field: "taxi_win_rate" }, fillOpacity: {value: 1}, stroke: {value: "#777"}, strokeWidth: {value: 0.2}, zindex: {value: 10} }, hover: { fillOpacity: {value: 0.8}, stroke: {value: "#222"}, strokeWidth: {value: 2}, zindex: {value: 100} } }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "shape", name: "selected_taxi_zone_marks", from: {data: "selected_taxi_zone_origin"}, encode: { update: { fill: {value: "#f00"}, fillOpacity: {value: 1}, stroke: {value: "#222"}, strokeWidth: {value: 2}, zindex: {value: 1000} } }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "rect", interactive: false, encode: { update: { width: {value: 243}, height: {value: 127}, fill: {value: "#fff"}, fillOpacity: {value: 0.9}, stroke: {value: "#777"}, strokeWidth: {value: 2}, cornerRadius: {value: 2}, x: {signal: "tooltip_x  5"}, y: {signal: "tooltip_y  15"} } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y"}, fontSize: {value: 14}, text: { signal: "tooltip_from" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 16"}, fontSize: {value: 14}, text: { signal: "tooltip_to" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 40"}, fontSize: {value: 14}, text: { signal: "tooltip_time_of_day" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 66"}, fontSize: {value: 14}, text: { signal: "tooltip_taxi_median" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 82"}, fontSize: {value: 14}, text: { signal: "tooltip_citi_median" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 106"}, fontSize: {value: 14}, text: { signal: "tooltip_message" } } } }, { type: "text", description: "hack to get the legend labels to work", interactive: false, encode: { update: { x: {value: map_width  168}, y: {value: map_height  15}, text: {value: "0%" + vega.repeat(" ", 39) + "100%"} } } } ] }; var vega_opts = { loader: vega.loader(), logLevel: vega.Warn, renderer: 'canvas' }; var view = new vega.View(vega.parse(vega_spec), vega_opts). initialize('#nyctaxizonesmap'). hover(). run(); $("#nyctaxivscitiselectcontainer").show(); $(".maphoverinstructions").show(); $("#nyctaxivscitiselect").on("change", function() { view.signal("selected_location", $(this).val()).run(); set_title(); }); $("input[name='nyctaxivscititime']").on("change", function() { view.signal("time_of_day", $(this).val()).run(); set_title(); }); $(".williamsburglink").on("click", function() { $("#nyctaxivscitiselect").val("Williamsburg (North Side)"); view.signal("selected_location", "Williamsburg (North Side)").run(); set_title(); }); var $title = $(".taxivscitimaptitle"); var set_title = function() { var new_title = "From " + $("#nyctaxivscitiselect").val() + ", weekdays " + $("input[name='nyctaxivscititime']:checked").val(); $title.html(new_title); }; if (desktop) { $("#nyctaxizonesmap").on("click", function() { vega_val = $("select[name='selected_location']").val(); outer_select = $("#nyctaxivscitiselect"); outer_val = outer_select.val(); if (vega_val !== outer_val) { outer_select.val(vega_val); set_title(); } }); } });
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Category: R  Todd W. Schneider. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data.Table by Example – Part 1
(This article was first published on R – Mathew Analytics, and kindly contributed to Rbloggers)
For many years, I actively avoided the data.table package and preferred to utilize the tools available in either base R or dplyr for data aggregation and exploration. However, over the past year, I have come to realize that this was a mistake. Data tables are incredible and provide R users with a syntatically
concise and efficient data structure for working with small, medium, or large datasets. While the package is well documented, I wanted to put together a series of posts that could be useful for those who want to get introduced to the data.table package in a more task oriented format.
For this series of posts, I will be working with data that comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system. This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. You can use the wget command in the terminal to export it as a csv file.
$ wget –nocheckcertificate –progress=dot https://data.cityofchicago.org/api/views/ijzpq8t2/rows.csv?accessType=DOWNLOAD > rows.csvThis file can be found in your working directory and was saved as “rows.csv”. We will import the data into R with the fread function and look at the first few rows and structure of the data.
dat = fread("rows.csv") dat[1:3] str(dat)Notice that the each of the string variables in the data set was imported as a character and not a factor. With base R functions like read.csv, we have to set the stringsAsFactors argument to TRUE if we want this result.
names(dat) < gsub(" ", "_", names(dat))Let’s say that we want to see the frequency distribution of several of these variables. This can be done by using .N in conjunction with the by operator.
dat[, .N, by=.(Arrest)]In the code below, you can also see how to chain operations togther. We started by finding the count of each response in the variable, ordered the count in descending order, and then selected only those which occured more than 200,000 times.
dat[, .N, by=.(Primary_Type)][order(N)][N>=200000] dat[, .N, by=.(Location_Description)][order(N)][N>=200000]Let’s say that we want to get a count of prostitution incidents by month. To get the desired results, we will need to modify the date values, filter instances in which the primary type was “prostitution”, and then get a count by each date.
dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="")][ Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][]If you want to plot the results as a line graph, just add another chain which executes the visualization.
dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="")][ Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)] %>% ggplot(., aes(date2, N)) + geom_line(group=1)I’ve obviously skipped over a lot and some of the code presented here is more verbose than needed. Even so, beginners to R will hopefully find this useful and it will pique your interest in the beauty of the data table package. Future posts will cover more of the goodies available in the data.table package such as get(), set(), {}, and so forth.
PS: As of this past week, I’ve decided to reenter the job market. So if you are looking for a statistical analyst or data scientist with strong experience in programming in R and time series econometrics, please contact me at mathewanalytics@gmail.com or reach me through LinkedIn
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to Create an Interactive Infographic
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
An interactive infographic can be used to communicate a lot of information in an engaging way. With the right tools, they are also relatively straightforward to create. In this post, I show stepbystep how to create this interactive infographic, using Canva, Displayr and R code. The interactive example is designed so that the user can change the country and have the infographic update automatically.
Tools used to create an interactive infographic: Canva is used to create the base infographic. The calculations, charting, and automatic textwriting are performed using the R language. It is all hooked up with Displayr.
Step 1: Create or download the infographicI start by going to Canva, and choosing the Neat Interactive Gaming Infographic (tip: use Find Templates on the lefthand panel). You could, of course, design your own infographic, either in Canva or elsewhere. I like Canva, but the key thing is to create an infographic image some way. In Canva, I edited the template by deleting the bits that I wanted to replace with interactive charts and visualizations and then I download the infographic as a PNG file (2,000 pixels high by 800 wide).
Step 2: Import the infographic into Displayr
Create an account in Displayr, and then click the button that says + New Document. Set the page size to the same aspect ratio as the PNG file (Home > Page Layout > Layout > Page Size > Custom). For this example, the page should be 20 inches high and 8 inches wide.
Next, insert the infographic into Displayr (Insert > Image), move and resize it to fit the page (tip: you can use the Properties panel on the right to type in pixels 800 x 2000 to reset the correct aspect ratio of the image).
Step 3: Get the data into Displayr
The data that will make the infographic interactive needs to be hooked up in Displayr. The data used to create the infographic in this example is shown to the right. There are lots of ways to import data into Displayr (e.g., importing a raw data file and creating the tables in Displayr). For this example, the data has been pasted into Displayr from Excel using the steps below.
To paste the data into Displayr:
 Insert > Paste Table (Data), click Add data (on the right of the screen).
 Paste in the data. Alternatively, you could type it into this screen. I first just pasted in the Age Distribution data and press OK.
 Properties > GENERAL and type AgeDistribution into the Label field and check the Automatic option (above).
 Drag the table so that it is to the left of the infographic.
 Hide the table (select it, Appearance > Hide). It will stay visible but will be invisible when you share the infographic.
Repeat this process for AverageAge, Ratio, and Multiplayer. It is important that you give each of these tables these names, as we refer to them later in our R code.
Step 4: Add the country selectorNext, I add a control so that the user can change country:
 Insert > Control
 Item list: China, US, Europe
 Move it to the top of the screen and style as desired (font size, color, border)
 Name: Click on the control and select China
 Properties > GENERAL > Name: Country
I then insert a text box (“GAMERS”), and placed it to the left of the control (i.e.: font: Impact, size: 48, color: #ffb600).
Step 5: Create the charts and visualizations in RFinally, create the charts and visualizations in Displayr using the following R code.
The column chartI created the column chart using my colleague Carmen’s nifty wrapperfunction for plotly. Insert an R Output in Displayr (Insert > R Output), and copy and paste the following code, pressing Calculate and resizing moving and resizing the chart.
flipStandardCharts::Chart(AgeDistribution[, Country], type = "Column", background.fill.color = "#212121", charting.area.fill.color = "#212121", colors = "tiel", x.tick.font.color = "white", x.tick.font.size = 20, x.grid.width = 0, y.tick.font.color = "white", y.tick.font.size = 20, y.title = "%", y.title.font.color = "white", y.grid.width = 0) Average ageThe average age was also created by inserting an R Output, using the code below. While I could have written the formatting in R, I instead used the various formatting tools built into Displayr (Properties >LAYOUT and Properties > APPEARANCE).
AverageAge[,Country] The hearts pictographThis was also done using an R Output in Displayr, with the following code (using R GitHub packages built by my colleagues Kyle and Carmen).
women = Ratio["Women", Country] total = sum(Ratio[, Country]) flipPictographs::SinglePicto(women, total, layout = "Number of columns", number.cols = 5, image = "Heart", hide.base.image = FALSE, auto.size = TRUE, fill.icon.color = "red", base.icon.color = "cyan", background.color ="#212121" )The R code used to create the textbox is below (tip: toggle on Wrap text output at the bottom of the Properties panel on the right)
women = Ratio["Women", Country] total = sum(Ratio[, Country]) paste0(women, " IN ", total, " GAMERS ARE WOMEN") The pie chartThis is the R code for the pie chart:
flipStandardCharts::Chart(Multiplayer[, Country], type = "Pie", colors = c(rgb(175/255, 224/255, 170/255), rgb(0/255, 181/255, 180/255)), data.label.font.color = "White", data.label.font.size = 18, data.label.decimals = 0, data.label.suffix = "%")The R code used to create the textbox was:
one.player = Multiplayer["1 player", Country] paste0(one.player, "% OF PEOPLE PLAY VIDEO GAMES ON THEIR OWN") Create the interactive infographic yourselfSign in to Displayr and edit the document used to create the interactive infographic here. In Edit mode you can click on each of the charts, pictographs, and text boxes to see the underlying code. The final document was published (i.e., turned into a Dashboard) using Export > Web Page. Or you can view the interactive infographic created using the instructions above.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Regular Expression Searching within Shiny Selectize Objects
(This article was first published on Jonathan Sidi's R Blog, and kindly contributed to Rbloggers)
regexSelect is a small package that uses Shiny modules to solve a problem in Shiny selectize objects – regular expression (regex) searching. You can quickly filter the values in the selectize object, while being able to add that new regex query to the selectize list.
This is great for long lists, since you can return multiple item simultaneously without needing to endlessly click items in a list!
Install install.packages('regexSelect') #devtools::install_github('yonicd/regexSelect')Below are two examples of using regular expressions to quickly populate multiple items in a ggplot and a datatable.
regexSelect with PlotsThe shiny module works with two main functions:
# server side: callModule(module=regexSelect, id='myId', reactive(<selectizeInput Choices>)) # ui side: regexSelectUI(id = "myId", label = 'myLabel', choices = <selectizeInput Choices>)regexSelect comes with controls placed in a group checkbox below the selectize object. When calling regexSelect you can show or hide the checkbox controls (hidden by default), as to make it look like a normal selectize object, and save valuable UI real estate.
While the shiny app is running regexSelect properties can be manipulated through the checkbox controls giving greater flexibilty to:
 Toggle regexSelect to work as a standard selectize object.
 Retain the regex search as a new value the selectize object.
 Change arguments that are passed to base::grep : ignore.case, perl, fixed, invert.
To leave a comment for the author, please follow the link and comment on their blog: Jonathan Sidi's R Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
News Roundup from Microsoft Ignite
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
It's been a big day for the team here at Microsoft, with a flurry of announcements from the Ignite conference in Orlando. We'll provide more indepth details in the coming days and weeks, but for now here's a brief roundup of the news related to data science:
Microsoft ML Server 9.2 is now available. This is the new name for what used to be called Microsoft R Server, and now also includes support for operationalizing the Python language as well as R. This 2minute video explains how to deploy models with ML Server (and with this release, realtime scoring is now also supported Linux as well). Microsoft R Client 3.4.1, featuring R 3.4.1, was also released today and provides desktop capabilities for R developers with the ability to deploy computations to a production ML Server.
The next generation of Azure Machine Learning is now available. This new service includes the AML Workbench, a crossplatform client for AIpowered data wrangling and experiment management; the AML Experimentation service to help data scientists increase their rate of experimentation with big data and GPUs, and the AML Model Management service to host, version, manage and monitor machine learning models. This TechCrunch article provides an overview.
SQL Server 2017 is now generally available, bringing support for indatabase R and Python to both Windows and Linux.
Azure SQL Database now supports realtime scoring for R and Python, and a preview of general indatabase R services is now available as well.
Microsoft Cognitive Services offers new intelligent APIs for developing AI applications, including general availability of the text analytics API for topic extraction, language detection and sentiment analysis.
Visual Studio Code for AI, an extension for the popular opensource code editor, provides new interfaces for developing with Tensorflow, Cognitive Toolkit (CNTK), Azure ML and more.
Finally, Microsoft announced a new highlevel language for quantum computing to be integrated with Visual Studio, and a simulator for quantum computers up to 32 qubits. This 2minute video provides an overview of Microsoft's vision for Quantum Computing.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Custom Level Coding in vtreat
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "onehot encoding"). Level coding can be computationally and statistically preferable to onehot encoding for variables that have an extremely large number of possible levels.
Level coding is like measurement: it summarizes categories of individuals into useful numbers. Source: USGS
By default, vtreat level codes to the difference between the conditional means and the grand mean (catN variables) when the outcome is numeric, and to the difference between the conditional loglikelihood and global loglikelihood of the target class (catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for treebased methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.
So the data scientist may want to use a level coding different from what vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat. We assume you are familiar with the basics of vtreat: the types of derived variables, how to create and apply a treatment plan, etc.
For our example, we will implement level coders based on partial pooling, or hierarchical/multilevel models (Gelman and Hill, 2007). We’ll leave the details of how partial pooling works to a subsequent article; for now, just think of it as a score that shrinks the estimate of the conditional mean to be closer to the unconditioned mean, and hence possibly closer to the unknown true values, when there are too few measurements to make an accurate estimate.
We’ll implement our partial pooling encoders using the lmer() (multilevel linear regression) and glmer() (multilevel generalized linear regression) functions from the lme4 package. For our example data, we’ll use radon levels by county for the state of Minnesota (Gelman and Hill, 2007. You can find the original data here).
The Data: Radon levels in Minnesota library("vtreat") library("lme4") library("dplyr") library("tidyr") library("ggplot2") # example data srrs = read.table("srrs2.dat", header=TRUE, sep=",", stringsAsFactor=FALSE) # target: log of radon activity (activity) # grouping variable: county radonMN = filter(srrs, state=="MN") %>% select("county", "activity") %>% filter(activity > 0) %>% mutate(activity = log(activity), county = base::trimws(county)) %>% mutate(critical = activity>1.5) str(radonMN) ## 'data.frame': 916 obs. of 3 variables: ## $ county : chr "AITKIN" "AITKIN" "AITKIN" "AITKIN" ... ## $ activity: num 0.788 0.788 1.065 0 1.131 ... ## $ critical: logi FALSE FALSE FALSE FALSE FALSE FALSE ...For this example we have three columns of interest:
 county: 85 possible values
 activity: the log of the radon reading (numerical outcome)
 critical: TRUE when activity > 1.5 (categorical outcome)
The goal is to level code county for either the regression problem (predict the log radon reading) or the categorization problem (predict whether the radon level is "critical").
As the graph shows, the conditional mean of log radon activity by county ranges from nearly zero to about 3, and the conditional expectation of a critical reading ranges from zero to one. On the other hand, the number of readings per county is quite low for many counties — only one or two — though some counties have a large number of readings. That means some of the conditional expectations are quite uncertain.
Implementing Level Coders for Partial PoolingLet’s implement level coders that use partial pooling to compute the level score.
Regression
For regression problems, the custom coder should be a function that takes as input:
 v: a string with the name of the categorical variable
 vcol: the actual categorical column (assumed character)
 y: the numerical outcome column
 weights: a column of row weights
The function should return a column of scores (the level codings). For our example, the function builds a lmer model to predict y as a function of vcol, then returns the predictions on the training data.
# @param v character variable name # @param vcol character, independent or input variable # @param y numeric, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column ppCoderN < function(v, vcol, y, weights) { # regression case y ~ vcol d < data.frame(x = vcol, y = y, stringsAsFactors = FALSE) m < lmer(y ~ (1  x), data=d, weights=weights) predict(m, newdata=d) }Categorization
For categorization problems, the function should assume that y is a logical column, where TRUE is assumed to be the target outcome. This is because vtreat converts the outcome column to a logical while creating the treatment plan.
# @param v character variable name # @param vcol character, independent or input variable # @param y logical, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column ppCoderC < function(v, vcol, y, weights) { # classification case y ~ vcol d < data.frame(x = vcol, y = y, stringsAsFactors = FALSE) m = glmer(y ~ (1  x), data=d, weights=weights, family=binomial) predict(m, newdata=d, type='link') }You can then pass the functions in as a named list into either designTreatmentsX or mkCrossFrameXExperiment to build the treatment plan. The format of the key is [nc].levelName[.option]*.
The prefacing picks the model type: numeric or regression starts with ‘n.’ and the categorical encoder starts with ‘c.’. Currently, the only supported option is ‘center,’ which directs vtreat to center the codes with respect to the estimated grand mean. ThecatN and catB level codings are centered in this way.
Our example coders can be passed in as shown below.
customCoders = list('n.poolN.center' = ppCoderN, 'c.poolC.center' = ppCoderC) Using the Custom CodersLet’s build a treatment plan for the regression problem.
# I only want to create the cleaned numeric variables, the isBAD variables, # and the level codings (not the indicator variables or catP, etc.) vartypes_I_want = c('clean', 'isBAD', 'catN', 'poolN') treatplanN = designTreatmentsN(radonMN, varlist = c('county'), outcomename = 'activity', codeRestriction = vartypes_I_want, customCoders = customCoders, verbose=FALSE) scoreFrame = treatplanN$scoreFrame scoreFrame %>% select(varName, sig, origName, code) ## varName sig origName code ## 1 county_poolN 1.343072e16 county poolN ## 2 county_catN 2.050811e16 county catNNote that the treatment plan returned both the catN variable (default level encoding) and the pooled level encoding (poolN). You can restrict to just using one coding or the other using the codeRestriction argument either during treatment plan creation, or in prepare().
Let’s compare the two level encodings.
# create a frame with one row for every county, measframe = data.frame(county = unique(radonMN$county), stringsAsFactors=FALSE) outframe = prepare(treatplanN, measframe) # If we wanted only the new pooled level coding, # (plus any numeric/isBAD variables), we would # use a codeRestriction: # # outframe = prepare(treatplanN, # measframe, # codeRestriction = c('clean', 'isBAD', 'poolN')) gather(outframe, key=scoreType, value=score, county_poolN, county_catN) %>% ggplot(aes(x=score)) + geom_density(adjust=0.5) + geom_rug(sides="b") + facet_wrap(~scoreType, ncol=1, scale="free_y") + ggtitle("Distribution of scores")Notice that the poolN scores are "tucked in" compared to the catN encoding. In a later article, we’ll show that the counties with the most tucking in (or shrinkage) tend to be those with fewer measurements.
We can also code for the categorical problem.
# For categorical problems, coding is catB vartypes_I_want = c('clean', 'isBAD', 'catB', 'poolC') treatplanC = designTreatmentsC(radonMN, varlist = c('county'), outcomename = 'critical', outcometarget= TRUE, codeRestriction = vartypes_I_want, customCoders = customCoders, verbose=FALSE) outframe = prepare(treatplanC, measframe) gather(outframe, key=scoreType, value=linkscore, county_poolC, county_catB) %>% ggplot(aes(x=linkscore)) + geom_density(adjust=0.5) + geom_rug(sides="b") + facet_wrap(~scoreType, ncol=1, scale="free_y") + ggtitle("Distribution of link scores")Notice that the poolC link scores are even more tucked in compared to the catB link scores, and that the catB scores are multimodal. The smaller link scores mean that the pooled model avoids estimates of conditional expectation close to either zero or one, because, again, these estimates come from counties with few readings. Multimodal summaries can be evidence of modeling flaws, including omitted variables and unmodeled mixing of different example classes. Hence, we do not want our inference procedure to suggest such structure until there is a lot of evidence for it. And, as is common in machine learning, there are advantages to lowervariance estimators when they do not cost much in terms of bias.
Other ConsiderationsFor this example, we used the lme4 package to create custom level codings. Once calculated, vtreat stores the coding as a lookup table in the treatment plan. This means lme4 is not needed to prepare new data. In general, using a treatment plan is not dependent on any special packages that might have been used to create it, so it can be shared with other users with no extra dependencies.
When using mkCrossFrameXExperiment, note that the resulting cross frame will have a slightly different distribution of scores than what the treatment plan produces. This is true even for catB and catN variables. This is because the treatment plan is built using all the data, while the cross frame is built using nfold cross validation on the data. See the cross frame vignette for more details.
Thanks to Geoffrey Simmons, Principal Data Scientist at Echo Global Logistics, for suggesting partial pooling based level coding (and testing it for us!), introducing us to the references, and reviewing our articles.
In a followup article, we will go into partial pooling in more detail, and motivate why you might sometimes prefer it to vtreat‘s default coding.
ReferencesGelman, Andrew and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Survival Analysis with R
(This article was first published on R Views, and kindly contributed to Rbloggers)
With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, survival analysis is one of the oldest subfields of Statistics [1]. Basic lifetable methods, including techniques for dealing with censored data, were discovered before 1700 [2], and in the early eighteenth century, the old masters – de Moivre working on annuities, and Daniel Bernoulli studying competing risks for the analysis of smallpox inoculation – developed the modern foundations of the field [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing, Medicine, and many more application areas. So, it is not surprising that R should be rich in survival analysis functions. CRAN’s Survival Analysis Task View, a curated list of the best relevant R survival analysis packages and functions, is indeed formidable. We all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.
Looking at the Task View on a small screen, however, is a bit like standing too close to a brick wall – leftright, updown, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As wellorganized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. Newcomers – people either new to R or new to survival analysis or both – must find it overwhelming. So, it is with newcomers in mind that I offer the following narrow trajectory through the task view that relies on just a few packages: survival, ggplot2, ggfortify, and ranger
The survival package is the cornerstone of the entire R survival analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986. The first public release, in late 1989, used the Statlib service hosted by Carnegie Mellon University. Thereafter, the package was incorporated directly into Splus, and subsequently into R.
ggfortify enables producing handsome, oneline survival plots with ggplot2::autoplot.
ranger might be the surprise in my very short list of survival packages. The ranger() function is wellknown for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But ranger() also works with survival data. Benchmarks indicate that ranger() is suitable for building timetoevent models with the large, highdimensional data sets important to internet marketing applications. Since ranger() uses standard Surv() survival objects, it’s an ideal tool for getting acquainted with survival analysis in this machinelearning age.
Load the dataThis first block of code loads the required packages, along with the veteran dataset from the survival package that contains data from a twotreatment, randomized trial for lung cancer.
library(survival) library(ranger) library(ggplot2) library(dplyr) library(ggfortify) # data(veteran) head(veteran) ## trt celltype time status karno diagtime age prior ## 1 1 squamous 72 1 60 7 69 0 ## 2 1 squamous 411 1 70 5 64 10 ## 3 1 squamous 228 1 60 3 38 0 ## 4 1 squamous 126 1 60 9 63 10 ## 5 1 squamous 118 1 70 11 65 10 ## 6 1 squamous 10 1 20 5 49 0The variables in veteran are: * trt: 1=standard 2=test * celltype: 1=squamous, 2=small cell, 3=adeno, 4=large * time: survival time in days * status: censoring status * karno: Karnofsky performance score (100=good) * diagtime: months from diagnosis to randomization * age: in years * prior: prior therapy 0=no, 10=yes
Kaplan Meier AnalysisThe first thing to do is to use Surv() to build the standard survival object. The variable time records survival time; status indicates whether the patient’s death was observed (status = 1) or that survival time was censored (status = 0). Note that a “+” after the time in the print out of km indicates censoring.
# Kaplan Meier Survival Curve km < with(veteran, Surv(time, status)) head(km,80) ## [1] 72 411 228 126 118 10 82 110 314 100+ 42 8 144 25+ ## [15] 11 30 384 4 54 13 123+ 97+ 153 59 117 16 151 22 ## [29] 56 21 18 139 20 31 52 287 18 51 122 27 54 7 ## [43] 63 392 10 8 92 35 117 132 12 162 3 95 177 162 ## [57] 216 553 278 12 260 200 156 182+ 143 105 103 250 100 999 ## [71] 112 87+ 231+ 242 991 111 1 587 389 33To begin our analysis, we use the formula Surv(futime, status) ~ 1 and the survfit() function to produce the KaplanMeier estimates of the probability of survival over time. The times parameter of the summary() function gives some control over which times to print. Here, it is set to print the estimates for 1, 30, 60 and 90 days, and then every 90 days thereafter. This is the simplest possible model. It only takes three lines of R code to fit it, and produce numerical and graphical summaries.
km_fit < survfit(Surv(time, status) ~ 1, data=veteran) summary(km_fit, times = c(1,30,60,90*(1:10))) ## Call: survfit(formula = Surv(time, status) ~ 1, data = veteran) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 1 137 2 0.985 0.0102 0.96552 1.0000 ## 30 97 39 0.700 0.0392 0.62774 0.7816 ## 60 73 22 0.538 0.0427 0.46070 0.6288 ## 90 62 10 0.464 0.0428 0.38731 0.5560 ## 180 27 30 0.222 0.0369 0.16066 0.3079 ## 270 16 9 0.144 0.0319 0.09338 0.2223 ## 360 10 6 0.090 0.0265 0.05061 0.1602 ## 450 5 5 0.045 0.0194 0.01931 0.1049 ## 540 4 1 0.036 0.0175 0.01389 0.0934 ## 630 2 2 0.018 0.0126 0.00459 0.0707 ## 720 2 0 0.018 0.0126 0.00459 0.0707 ## 810 2 0 0.018 0.0126 0.00459 0.0707 ## 900 2 0 0.018 0.0126 0.00459 0.0707 #plot(km_fit, xlab="Days", main = 'Kaplan Meyer Plot') #base graphics is always ready autoplot(km_fit)Next, we look at survival curves by treatment.
km_trt_fit < survfit(Surv(time, status) ~ trt, data=veteran) autoplot(km_trt_fit)And, to show one more small exploratory plot, I’ll do just a little data munging to look at survival by age. First, I create a new data frame with a categorical variable AG that has values LT60 and GT60, which respectively describe veterans younger and older than sixty. While I am at it, I make trt and prior into factor variables. But note, survfit() and npsurv() worked just fine without this refinement.
vet < mutate(veteran, AG = ifelse((age < 60), "LT60", "OV60"), AG = factor(AG), trt = factor(trt,labels=c("standard","test")), prior = factor(prior,labels=c("N0","Yes"))) km_AG_fit < survfit(Surv(time, status) ~ AG, data=vet) autoplot(km_AG_fit)Although the two curves appear to overlap in the first fifty days, younger patients clearly have a better chance of surviving more than a year.
Cox Proportional Hazards ModelNext, I’ll fit a Cox Proportional Hazards Model that makes use of all of the covariates in the data set.
# Fit Cox Model cox < coxph(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior , data = vet) summary(cox) ## Call: ## coxph(formula = Surv(time, status) ~ trt + celltype + karno + ## diagtime + age + prior, data = vet) ## ## n= 137, number of events= 128 ## ## coef exp(coef) se(coef) z Pr(>z) ## trttest 2.946e01 1.343e+00 2.075e01 1.419 0.15577 ## celltypesmallcell 8.616e01 2.367e+00 2.753e01 3.130 0.00175 ** ## celltypeadeno 1.196e+00 3.307e+00 3.009e01 3.975 7.05e05 *** ## celltypelarge 4.013e01 1.494e+00 2.827e01 1.420 0.15574 ## karno 3.282e02 9.677e01 5.508e03 5.958 2.55e09 *** ## diagtime 8.132e05 1.000e+00 9.136e03 0.009 0.99290 ## age 8.706e03 9.913e01 9.300e03 0.936 0.34920 ## priorYes 7.159e02 1.074e+00 2.323e01 0.308 0.75794 ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## exp(coef) exp(coef) lower .95 upper .95 ## trttest 1.3426 0.7448 0.8939 2.0166 ## celltypesmallcell 2.3669 0.4225 1.3799 4.0597 ## celltypeadeno 3.3071 0.3024 1.8336 5.9647 ## celltypelarge 1.4938 0.6695 0.8583 2.5996 ## karno 0.9677 1.0334 0.9573 0.9782 ## diagtime 1.0001 0.9999 0.9823 1.0182 ## age 0.9913 1.0087 0.9734 1.0096 ## priorYes 1.0742 0.9309 0.6813 1.6937 ## ## Concordance= 0.736 (se = 0.03 ) ## Rsquare= 0.364 (max possible= 0.999 ) ## Likelihood ratio test= 62.1 on 8 df, p=1.799e10 ## Wald test = 62.37 on 8 df, p=1.596e10 ## Score (logrank) test = 66.74 on 8 df, p=2.186e11 cox_fit < survfit(cox) #plot(cox_fit, main = "cph model", xlab="Days") autoplot(cox_fit)Note that the model flags small cell type, adeno cell type and karno as significant. However, some caution needs to be exercised in interpreting these results. While the Cox Proportional Hazard’s model is thought to be “robust”, a careful analysis would check the assumptions underlying the model. For example, the Cox model assumes that the covariates do not vary with time. In a vignette [12] that accompanies the survival package Therneau, Crowson and Atkinson demonstrate that the Karnofsky score (karno) is, in fact, timedependent so the assumptions for the Cox model are not met. The vignette authors go on to present a strategy for dealing with time dependent covariates.
Data scientists who are accustomed to computing ROC curves to assess model performance should be interested in the Concordance statistic. The documentation for the survConcordance() function in the survival package defines concordance as “the probability of agreement for any two randomly chosen observations, where in this case agreement means that the observation with the shorter survival time of the two also has the larger risk score. The predictor (or risk score) will often be the result of a Cox model or other regression” and notes that: “For continuous covariates concordance is equivalent to Kendall’s tau, and for logistic regression is is equivalent to the area under the ROC curve.”
To demonstrate using the survival package, along with ggplot2 and ggfortify, I’ll fit Aalen’s additive regression model for censored data to the veteran data. The documentation states: “The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a timedependent intercept term, X is the vector of covariates for the subject (possibly timedependent), and B(t) is a timedependent matrix of coefficients.”
The plots show how the effects of the covariates change over time. Notice the steep slope and then abrupt change in slope of karno.
aa_fit <aareg(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior , data = vet) aa_fit ## Call: ## aareg(formula = Surv(time, status) ~ trt + celltype + karno + ## diagtime + age + prior, data = vet) ## ## n= 137 ## 75 out of 97 unique event times used ## ## slope coef se(coef) z p ## Intercept 0.083400 3.81e02 1.09e02 3.490 4.79e04 ## trttest 0.006730 2.49e03 2.58e03 0.967 3.34e01 ## celltypesmallcell 0.015000 7.30e03 3.38e03 2.160 3.09e02 ## celltypeadeno 0.018400 1.03e02 4.20e03 2.450 1.42e02 ## celltypelarge 0.001090 6.21e04 2.71e03 0.229 8.19e01 ## karno 0.001180 4.37e04 8.77e05 4.980 6.28e07 ## diagtime 0.000243 4.92e05 1.64e04 0.300 7.65e01 ## age 0.000246 6.27e05 1.28e04 0.491 6.23e01 ## priorYes 0.003300 1.54e03 2.86e03 0.539 5.90e01 ## ## Chisq=41.62 on 8 df, p=1.6e06; test weights=aalen #summary(aa_fit) # provides a more complete summary of results autoplot(aa_fit) Random Forests ModelAs a final example of what some might perceive as a datasciencelike way to do timetoevent modeling, I’ll use the ranger() function to fit a Random Forests Ensemble model to the data. Note however, that there is nothing new about building tree models of survival data. Terry Therneau also wrote the rpart package, R’s basic treemodeling package, along with Brian Ripley. See section 8.4 for the rpart vignette [14] that contains a survival analysis example.
ranger() builds a model for each observation in the data set. The next block of code builds the model using the same variables used in the Cox model above, and plots twenty random curves, along with a curve that represents the global average for all of the patients. Note that I am using plain old base R graphics here.
# ranger model r_fit < ranger(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior, data = vet, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE) # Average the survival models death_times < r_fit$unique.death.times surv_prob < data.frame(r_fit$survival) avg_prob < sapply(surv_prob,mean) # Plot the survival models for each patient plot(r_fit$unique.death.times,r_fit$survival[1,], type = "l", ylim = c(0,1), col = "red", xlab = "Days", ylab = "survival", main = "Patient Survival Curves") # cols < colors() for (n in sample(c(2:dim(vet)[1]), 20)){ lines(r_fit$unique.death.times, r_fit$survival[n,], type = "l", col = cols[n]) } lines(death_times, avg_prob, lwd = 2) legend(500, 0.7, legend = c('Average = black'))The next block of code illustrates how ranger() ranks variable importance.
vi < data.frame(sort(round(r_fit$variable.importance, 4), decreasing = TRUE)) names(vi) < "importance" head(vi) ## importance ## karno 0.0734 ## celltype 0.0338 ## diagtime 0.0003 ## trt 0.0007 ## prior 0.0011 ## age 0.0027Notice that ranger() flags karno and celltype as the two most important; the same variables with the smallest pvalues in the Cox model. Also note that the importance results just give variable names and not level names. This is because ranger and other tree models do not usually create dummy variables.
But ranger() does compute Harrell’s cindex (See [8] p. 370 for the definition), which is similar to the Concordance statistic described above. This is a generalization of the ROC curve, which reduces to the WilcoxonMannWhitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.
cat("Prediction Error = 1  Harrell's cindex = ", r_fit$prediction.error) ## Prediction Error = 1  Harrell's cindex = 0.308269An ROC value of .68 would normally be pretty good for a first try. But note that the ranger model doesn’t do anything to address the time varying coefficients. This apparently is a challenge. In a 2011 paper [16], Hamad observes:
However, in the context of survival trees, a further difficulty arises when time–varying effects are included. Hence, we feel that the interpretation of covariate effects with tree ensembles in general is still mainly unsolved and should attract future research.
I believe that the major use for treebased models for survival data will be to deal with very large data sets.
Finally, to provide an “eyeball comparison” of the three survival curves, I’ll plot them on the same graph.The following code pulls out the survival data from the three model objects and puts them into a data frame for ggplot().
# Set up for ggplot kmi < rep("KM",length(km_fit$time)) km_df < data.frame(km_fit$time,km_fit$surv,kmi) names(km_df) < c("Time","Surv","Model") coxi < rep("Cox",length(cox_fit$time)) cox_df < data.frame(cox_fit$time,cox_fit$surv,coxi) names(cox_df) < c("Time","Surv","Model") rfi < rep("RF",length(r_fit$unique.death.times)) rf_df < data.frame(r_fit$unique.death.times,avg_prob,rfi) names(rf_df) < c("Time","Surv","Model") plot_df < rbind(km_df,cox_df,rf_df) p < ggplot(plot_df, aes(x = Time, y = Surv, color = Model)) p + geom_line()For this data set, I would put my money on a carefully constructed Cox model that takes into account the time varying coefficients. I suspect that there are neither enough observations nor enough explanatory variables for the ranger() model to do better.
This fourpackage excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform, which has been under continuous development and improvement for nearly twenty years. The ranger package, which suggests the survival package, and ggfortify, which depends on ggplot2 and also suggests the survival package, illustrate how opensource code allows developers to build on the work of their predecessors. The documentation that accompanies the survival package, the numerous online resources, and the statistics such as concordance and Harrell’s cindex packed into the objects produced by fitting the models gives some idea of the statistical depth that underlies almost everything R.
Some Tutorials and PapersFor a very nice, basic tutorial on survival analysis, have a look at the Survival Analysis in R [5] and the OIsurv package produced by the folks at OpenIntro.
Look here for an exposition of the Cox Proportional Hazard’s Model, and here [11] for an introduction to Aalen’s Additive Regression Model.
For an elementary treatment of evaluating the proportional hazards assumption that uses the veterans data set, see the text by Kleinbaum and Klein [13].
For an exposition of the sort of predictive survival analysis modeling that can be done with ranger, be sure to have a look at Manuel Amunategui’s post and video.
See the 1995 paper [15] by Intrator and Kooperberg for an early review of using classification and regression trees to study survival data.
ReferencesFor convenience, I have collected the references used throughout the post here.
[1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed., p. 11
[2] Andersen, P.K., Keiding, N. (1998) Survival analysis Encyclopedia of Biostatistics 6. Wiley, pp. 44524461 [3] Kaplan, E.L. & Meier, P. (1958). Nonparametric estimation from incomplete observations, J American Stats Assn. 53, pp. 457–481, 562–563. [4] Cox, D.R. (1972). Regression models and lifetables (with discussion), Journal of the Royal Statistical Society (B) 34, pp. 187–220.
[5] Diez, David. Survival Analysis in R, OpenIntro
[6] Klein, John P and Moeschberger, Melvin L. Survival Analysis Techniques for Censored and Truncated Data, Springer. (1997)
[7] Wright, Marvin & Ziegler, Andreas. (2017) ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R, JSS Vol 77, Issue 1.
[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15 (1996), pp. 361387 [9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved TimeBased Predictions in R
[10] NUS Course Notes. Chapter 3 The Cox Proportional Hazards Model
[11] Encyclopedia of Biostatistics, 2nd Edition (2005). Aalen’s Additive Regression Model [12] Therneau et al. Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model
[13] Kleinbaum, D.G. and Klein, M. Survival Analysis, A Self Learning Text Springer (2005) [14] Therneau, T and Atkinson, E. An Introduction to Recursive Partitioning Using RPART Routines
[15] Intrator, O. and Kooperberg, C. Trees and splines in survival analysis Statistical Methods in Medical Research (1995)
[16] BouHamad, I. A review of survival trees Statistics Surveys Vol.5 (2011)
Authors’s note: this post was originally published on April 26, 2017 but was subsequently withdrawn because of an error spotted by Dr. Terry Therneau. He observed that the Cox Portional Hazards Model fitted in that post did not properly account for the time varying covariates. This revised post makes use of a different data set, and points to resources for addressing time varying covariates. Many thanks to Dr. Therneau. Any errors that remain are mine.
_____='https://rviews.rstudio.com/2017/09/25/survivalanalysiswithr/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Super excited for R promises
(This article was first published on Appsilon Data Science Blog, and kindly contributed to Rbloggers)
We at Appsilon are excited about RStudio introducing promises in R quite soon which is going to be a huge step forward in programming in R (we have already used futures and similar libraries to run code asynchronously, however this is going to be a standard and it looks like it’s going to be very easy to use). They support chaining which is a great way of building clean code by piping computations that take a long time.
We’ve used futures/promises/tasks in programming languages like C#, Scala, Javascript and promises have always had a big impact on the way code is structured and on the overall execution speed.
We recently spoke with Joe Chang, and he mentioned promises at the EARL conference. Here are some links if you’re interested in reading more about promises on github and medium.
What do you think? Let us know in the comments.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
eXtremely Boost your machine learning Exercises (Part1)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
eXtreme Gradient Boosting is a machine learning model which became really popular few years ago after winning several Kaggle competitions. It is very powerful algorithm that use an ensemble of weak learners to obtain a strong learner. Its R implementation is available in xgboost package and it is really worth including into anyone’s machine learning portfolio.
This is the first part of eXtremely Boost your machine learning series. For other parts follow the tag xgboost.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Load xgboost library and download German Credit dataset. Your goal in this tutorial will be to predict Creditability (the first column in the dataset).
Exercise 2
Convert columns c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20) to factors and then encode them as dummy variables. HINT: use model.matrix()
Exercise 3
Split data into training and test set 700:300. Create xgb.DMatrix for both sets with Creditability as label.
Exercise 4
Train xgboost with logistic objective and 30 rounds of training and maximal depth 2.
Exercise 5
To check model performance calculate test set classification error.
Exercise 6
Plot predictors importance.
 Create a machine learning algorithm from a beginner point of view
 Quickly dive into more advanced methods in an accessible pace and with more explanations
 And much more
This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.
Exercise 7
Use xgb.train() instead of xgboost() to add both train and test sets as a watchlist. Train model with same parameters, but 100 rounds to see how it performs during training.
Exercise 8
Train model again adding AUC and Log Loss as evaluation metrices.
Exercise 9
Plot how AUC and Log Loss for train and test sets was changing during training process. Use plotting function/library of your choice.
Exercise 10
Check how setting parameter eta to 0.01 influences the AUC and Log Loss curves.
 Model Evaluation 2
 How to prepare and apply machine learning to your dataset
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part5)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppGSL 0.3.3
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A maintenance update RcppGSL 0.3.3 is now on CRAN. It switched the vignette to the our new pinp package and its twocolumn pdf default.
The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.
No userfacing new code or features were added. The NEWS file entries follow below:
Changes in version 0.3.3 (20170924)
We also check for gslconfig at package load.

The vignette now uses the pinp package in twocolumn mode.

Minor other fixes to package and testing infrastructure.
Courtesy of CRANberries, a summary of changes to the most recent release is available.
More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppCNPy 0.2.7
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new version of the RcppCNPy package arrived on CRAN yesterday.
RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.
This version updates internals for function registration, but otherwise mostly switches the vignette over to the shiny new pinp twopage template and package.
Changes in version 0.2.7 (20170922)
Vignette updated to Rmd and use of pinp package

File src/init.c added for dynamic registration
CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppClassic 0.9.8
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A bugfix release RcppClassic 0.9.8 for the very recent 0.9.7 release which fixes a build issue on macOS introduced in 0.9.7. No other changes.
Courtesy of CRANberries, there are changes relative to the previous release.
Questions, comments etc should go to the rcppdevel mailing list off the RForge page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Upcoming data preparation and modeling article series
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN.
vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects.
vtreat handles, in a statistically sound fashion:
 Missing values.
 Encoding of categorical values for regularized inference and machine learning techniques.
 Categorical variables with very many values.
 Novel categorical values (that is values not seen during training).
 Variable pruning.
 yaware scaling.
 Structured crossvalidation.
 Mitigating nested model bias.
In our (biased) opinion opinion vtreat has the best methodology and documentation for these important data cleaning and preparation steps. vtreat‘s current public opensource implementation is for inmemory R analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: data.table, Spark, Python/Pandas, and SQL).
vtreat brings a lot of power, sophistication, and convenience to your analyses, without a lot of trouble.
A new feature of vtreat version 0.6.0 is called “custom coders.” WinVector LLC‘s Dr. Nina Zumel is going to start a short article series to show how this new interface can be used to extend vtreat methodology to include the very powerful method of partial pooled inference (a term she will spend some time clearly defining and explaining). Time permitting, we may continue with articles on other applications of custom coding including: ordinal/faithful coders, monotone coders, unimodal coders, and setvalued coders.
Please help us share and promote this article series, which should start in a couple of days. This should be a fun chance to share very powerful methods with your colleagues.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part9)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.
In this series of article I have tried to help you create an intuition on how probabilities work. To do so, we have been using simulations to see how concrete random situation can unfold and learn simple statistics and probabilistic concepts. In today’s set, I would like to show you some deceptively difficult situations that will challenge the way you understand probability and statistics. By doing so, you will practice the simulation technique we have seen in past set, refined your intuition and, hopefully help you avoid some pitfall when you do your own statistical analysis.
Answers to the exercises are available here.
For other parts of this exercise set follow the tag Hacking stats
Exercise 1
Suppose that there are exactly 365 days in a year and that the distribution of birthday in the population is uniform, meaning that the proportion of birth on any given day is the same throughout the year. In a group of 25 people, what is the probability that at least two individuals share the same birthday? Use a simulation to answer that question, then repeat this process for group of 0,10,20,…,90 and 100 people and plot the results.
Of course, when the group size is of 366 we know that the probability that two people share the same birthday is equal to 1, since there are more people than day in the year and for a group of zero person this probability is equal to 0. What is counterintuitive here is the rate at which the probability of observing this grow. From the graph we can see that with just about 23 people we have a probability of about 50% of observing two people having the same birthday and that a group of about 70 people will have almost 100% chance to see this happening.
Exercise 2
Here’s a problem that can someday save your life. Imagine you are a war prisoner in an Asian Communist country and your jailer is getting bored. So to past the time, they set up a Russian roulette game where you and another inmate play against one another. A jailer takes a sixshooter revolver, put two bullets in two consecutive chamber, spin the chamber and give the gun to your opponent, who place the gun to his temple and pull the trigger. Luckily for him, the chamber was empty and the gun is passed to you. Now you have a choice to make: you can let the chamber as it is and play or you can spin the chamber before playing. Use 10000 simulations of both choices to find which choice give you the highest probability to survive.
The key details in this problem is that the bullet are in consecutive chamber. This mean that if your opponent pulls the trigger on an empty chamber, and that you don’t spin the chamber, it’s impossible that you pull the trigger on the second bullet. You can only have an empty chamber of pull the trigger on the first bullet, which means that you have 25% chance of dying vs 2/6=33% chance of dying if you spin the chamber.
Exercise 3
What is the probability that a mother, whose is pregnant with nonidentical twin, give birth to two boys, if we know that one of the unborn child is a boy, but we cannot identifie which one is the boy?
 Work with about different binomial and logistic regression techniques
 Know how to compare regression models and choose the right fit
 And much more
Exercise 4
Two friends play head or tail to pass the time. To make this game more fun they decide to gamble pennies, so for each coin flip one friend call head or tail and if he calls right, he gets a penny and lose one otherwise. Let’s say that they have 40 and 30 pennies respectively and that they will play until someone has all the pennies.
 Create a function that simulate a complete game and return how many coin flip has been done and who win.
 In average, how many coin flip is needed before someone has all the pennies.
 Plot the histogram of the number of coin flipped during a simulation.
 What is the probability that someone wins a coin flip?
 What is the probability that each friend wins all the pennies? Why is it different than the probability of winning a single coin flip?
When the number of coin flip get high enough, the probability of someone winning often enough to get all the pennies rise to 100%. Maybe they will have to play 24h a day for weeks, but someday, someone will lose often enough to be penniless. In this context, the player who started with the most money have a huge advantage since they can survive a much longer losing streak than their opponent.
In fact, in this context where the probability of winning a single game is equal for each opponent the probability of winning all the money is equal to the proportion of the money they start with. That’s in part why the casino always win since they got more money than each gambler that plays against them, as long they get them to play long enough they will win. The fact that they propose game where they have greater chance to win help them quite a bit too.
Exercise 5
A classic counter intuitive is the Monty Hall problem. Here’s the scenario, if you never heard of it: you are on a game show where you can choose one of three doors and if a prize is hidden behind this door, you win this prize. Here’s the twist: after you choose a door, the game show host open one of the two other doors to show that there’s no prize behind it. At this point, you can choose to look behind the door you choose in the first place to see if there’s a prize or you can choose to switch door and look behind the door you left out.
 Simulate 10 000 games where you choose to look behind the first door you have chosen to estimate the probability of winning if you choose to look behind this door.
 Repeat this process, but this time choose to switch door.
 Why the probabilities are different?
When you pick the first door, you have 1/3 chance to have the right door. When the show host open one of the door you didn’t pick he gives you a huge amount of information on where the price is because he opened a door with no prize behind it. So the second door has more chance to hide the prize than the door you took in the first place. Our simulation tell us that this probability is about 1/2. So, you should always switch door since this gives you a higher probability of winning the prize.
To better understand this, imagine that the Grand Canyon is filled with small capsule with a volume of a cube centimeter. Of all those capsules only one has a piece of paper and if you pick this capsule, you win a 50% discount on a tie. You choose a capsule at random and then all the other trillion capsules are discarded except one, such than the winning capsule is still in play. Assuming you really want this discount, which capsule would you choose?
Exercise 6
This problem is a real life example of a statistical pitfall that can easily be encountered in real life and has been published by Steven A. Julious and Mark A. Mullee. In this dataset, we can see if a a medical treatment for kidney stone has been effective. There are two treatments that can be used: treatment A which include all open surgical procedure and treatment B which include small puncture surgery and the kidney stone are classified in two categories depending on his size, small or large stones.
 Compute the success rate (number of success/total number of cases) of both treatments.
 Which treatment seems the more successful?
 Create a contingency table of the success.
 Compute the success rate of both treatments when treating small kidney stones.
 Compute the success rate of both treatments when treating large kidney stones.
 Which treatment is more successful for small kidney stone? For large kidney stone?
This is an example of the Simpson paradox, which is a situation where an effect appears to be present for the set of all observations, but disappears when the observations are categorized and the analysis is done on each group. It is important to test for this phenomenon since in practice most observations can be classified in sub classes and, as the last example showed, this can change drastically the result of your analysis.
Exercise 7
 Download this dataset and do a linear regression with the variable X and Y. Then, compute the slope of the trend line of the regression.
 Do a scatter plot of the variable X and Y and add the trend line to the graph.
 Repeat this process of each of the three categories.
We can see that the general trend of the data is different from the trends of each of the categories. In other words, the Simpson paradox can also be observed in a regression context. The moral of the story is: make sure that all the variables are included in your analysis or you gonna have a bad time!
Exercise 8
For this problem you must know what’s a true positive, false positive, true negative and false negative in a classification problem. You can look at this page for a quick review of those concepts.
A big data algorithm has been developed to detect potential terrorist by looking at their behavior on the internet, their consummation habit and their traveling. To develop this classification algorithm, the computer scientist used data from a population where there’s a lot of known terrorist since they needed data about the habits of real terrorist to validate their work. In this dataset, you will find observations from this high risk population and observations taken from a low risk population.
 Compute the true positive rate, the false positive rate, the true negative rate and the false negative rate of this algorithm for the population that has a high risk of terrorism.
 Repeat this process for the remaining observations. Is there a difference between those rate?
It is a known fact that false positive rate are a lot higher in lowincidence population and this is known as . Basically, when the incidence of a certain condition in the population is lower than the average false positive rate of a test, using that test on this population will result in a much higher false positive cases than usual. This is in part due to the fact that the diminution of true positive case make the proportion of false positive so much higher. As a consequence: don’t trust to much your classification algorithm!
Exercise 9
 Generate a population of 10000 values from a normal distribution of mean 42 and standard deviation of 10.
 Create a sample of 10 observations and estimate the mean of the population. Repeat this 200 times.
 Compute the variation of the estimation.
 Create a sample of 50 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
 Create a sample of 100 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
 Create a sample of 500 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
 Plot the variance of the estimation of the means done with different sample size.
As you can see, the variance of the estimation of the mean is inversely proportional to the sample size, but this is not a linear relationship. A small sample can create an estimation that is a lot farther to the real value than a sample with more observations. Let’s see why this information is relevant to this set.
Exercise 10
A private school advertise that their small size help their student achieve better grade. In their advertisement, they claim that last year’s students have had an average 5 points higher than the average at the standardize state’s test and since no large school has such a high average, that’s proof that small school help student achieve better results.
Suppose that there is 200000 students in the state, their results at the state test was distributed normally with a mean of 76% and a standard deviation of 15, the school had 100 students and that an average school count 750 student. Does the school claim can be explained statistically?
A school can be seen as a sample of the population of student. A large school, like a large sample, has a lot more chance to be representative of the student’s population and their average score will often be near the population average, while small school can show average a lot more extreme just because they have a smaller body of student. I’m not saying that no school are better than other, but we must look at a lot of results to be sure we are not only in presence of a statistical abnormality.
Related exercise sets: Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part8)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part6)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part4)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How Random Forests improve simple Regression Trees?
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
By Gabriel Vasconcelos
Regression TreesIn this post I am going to discuss some features of Regression Trees an Random Forests. Regression Trees are know to be very unstable, in other words, a small change in your data may drastically change your model. The Random Forest uses this instability as an advantage through bagging (you can see details about bagging here) resulting on a very stable model.
The first question is how a Regression Tree works. Suppose, fore example, that we have the number of points scored by a set of basketball players and we want to relate it to the player’s weight an height. The Regression Tree will simply split the heightweight space and assign a number of points to each partition. The figure below shows two different representations for a small tree. In the left we have the tree itself and in the right how the space is partitioned (the blue line shows the first partition and the red lines the following partitions). The numbers in the end of the tree (and in the partitions) represent the value of the response variable. Therefore, if a basketball player is higher than 1.85 meters and weights more than 100kg it is expected to score 27 points (I invented this data =] ).
You might be asking how I chose the partitions. In general, in each node the partition is chosen through a simple optimization problem to find the best pair variableobservation based on how much the new partition reduces the model error.
What I want to illustrate here is how unstable a Regression Tree can be. The package tree has some examples that I will follow here with some small modifications. The example uses computer CPUs data and the objective is to build a model for the CPU performance based on some characteristics. The data has 209 CPU observations that will be used to estimate two Regression Trees. Each tree will be estimate from a random resample with replacement. Since the data comes from the same place, it would be desirable to have similar results on both models.
library(ggplot2) library(reshape2) library(tree) library(gridExtra) data(cpus, package = "MASS") # = Load Data # = First Tree set.seed(1) # = Seed for Replication tree1 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[sample(1:209, 209, replace = TRUE), ]) plot(tree1); text(tree1) # = Second Tree set.seed(10) tree2 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[sample(1:209,209, replace=TRUE), ]) plot(tree2); text(tree2)
As you can see, the two trees are different from the start. We can use some figures to verify. First let us calculate the predictions of each model in the real data (not the resample). The first figure is a scatterplot of both predictions and the second figure shows their boxplots. Although the scatterplot shows some relation between the two predictions, it is far from good.
# = Calculate predictions pred = data.frame(p1 = predict(tree1, cpus) ,p2 = predict(tree2, cpus)) # = Plots g1 = ggplot(data = pred) + geom_point(aes(p1, p2)) g2 = ggplot(data = melt(pred)) + geom_boxplot(aes(variable, value)) grid.arrange(g1, g2, ncol = 2) Random ForestAs mentioned before, the Random Forest solves the instability problem using bagging. We simply estimate the desired Regression Tree on many bootstrap samples (resample the data many times with replacement and reestimate the model) and make the final prediction as the average of the predictions across the trees. There is one small (but important) detail to add. The Random Forest adds a new source of instability to the individual trees. Every time we calculate a new optimal variableobservation point to split the tree, we do not use all variables. Instead, we randomly select 2/3 of the variables. This will make the individual trees even more unstable, but as I mentioned here, bagging benefits from instability.
The question now is: how much improvement do we get from the Random Forest. The following example is a good illustration. I broke the CPUs data into a training sample (first 150 observations) and a test sample (remaining observations) and estimated a Regression Tree and a Random Forest. The performance is compared using the mean squared error.
library(randomForest) # = Regression Tree tree_fs = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[1:150, ]) # = Random Forest set.seed(1) # = Seed for replication rf = randomForest(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data=cpus[1:150, ], nodesize = 10, importance = TRUE) # = Calculate MSE mse_tree = mean((predict(tree_fs, cpus[c(1:150), ])  log(cpus$perf)[c(1:150)])^2) mse_rf = mean((predict(rf, cpus[c(1:150), ])  log(cpus$perf[c(1:150)]))^2) c(rf = mse_rf, tree = mse_tree) ## rf tree ## 0.2884766 0.5660053As you can see, the Regression Tree has an error twice as big as the Random Forest. The only problem is that by using a combination of trees any kind of interpretation becomes really hard. Fortunately, there are importance measures that allow us to at least know which variables are more relevant in the Random Forest. In our case, both importance measures pointed to the cache size as the most important variable.
importance(rf) ## %IncMSE IncNodePurity ## syct 22.60512 22.373601 ## mmin 19.46153 21.965340 ## mmax 24.84038 27.239772 ## cach 27.92483 33.536185 ## chmin 13.77196 13.352793 ## chmax 17.61297 8.379306Finally, we can see how the model error decreases as we increase the number of trees in the Random Forest with the following code:
plot(rf)If you liked this post, you can find more details on Regression Trees and Random forest in the book Elements of Statistical learning, which can be downloaded direct from the authors page here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Welcome to R/exams
(This article was first published on R/exams, and kindly contributed to Rbloggers)
Welcome everybody, we are proud to introduce the brand new web page and
blog http://www.Rexams.org/. This provides a central access point for
the opensource software “exams” implemented in the
R system for statistical computing.
R/exams is a oneforall exams generator using (potentially) dynamic
exercises written in R/Markdown or R/LaTeX and it can export a variety of
formats from learning management systems to classical written exams.
This post gives a brief overview of what has happened so far and what
you can expect to find here in the next months.
The package was originally started more than a decade ago to facilitate
classical written exams with randomized exercises for largelecture courses.
Like many other teachers of introductory statistics and mathematics courses,
we were in need of infrastructure for conducting written exams with about 1,0001,500
students. A team effort of colleagues at WU Wirtschaftsuniversität Wien
lead to a large collection of dynamic exercises and the software was eventually
released at https://CRAN.Rproject.org/package=exams.
Over the years learning management systems (like
Moodle,
Blackboard,
OLAT, etc.)
became easily available at virtually every university, creating a desire to
employ the same dynamic exercises also for online tests and quizzes. Hence,
the R/exams infrastructure was completely reimplemented allowing to export
the same exercises not only to written exams (with automatic scanning
and evaluation) but also to learning management systems, the live voting
system ARSnova, as well as customizable standalone
files in PDF, HTML, Docx, and beyond.
Despite (or rather because of) the flexibility of the software, novice R/exams
users often struggled with adopting it because the documentation provided in
the package is either somewhat technical and/or targeted towards more experienced
R users.
Hence, this web page and blog make it easy for new users to explore the possibilities
of R/exams before reading about a lot of technical details. It also provides accessible
guides to common tasks and examples for dynamic exercises with different complexity.
For a first tour you can check out the oneforall approach of
the package based on (potentially) dynamic exercise templates
for generating large numbers of personalized exams/quizzes/tests, either for
elearning or classical written exams (with
automatic evaluation).
Some tutorials already explain the installation of R/exams (with
dependencies like LaTeX or pandoc) and the first steps in writing dynamic exercises
using either Markdown or LaTeX markup along with randomization in R. There is
also a gallery of exercise templates, ranging from basic multiplechoice
questions with simple shuffling to more advanced exercises with random data, graphics,
etc.
For the next months we plan to write more tutorial blog posts that help to accomplish
common tasks, e.g., handson guidance for exporting exercises from R to Moodle or
tips how to write good randomized exercises. If you want to give us feedback or ask
us to cover certain topics please feel free to contact us – be it via email,
discussion forum, or twitter. Also if you want to link R/examsbased materials or share
share experiences of using R/exams in a guest post, please let us know.
Big shout to all contributors that helped R/exams to grow so much
over the last years. A special thank you goes to Patrik Keller, Hanna Krimm, and Reto
Stauffer who established the infrastructure for this web page (using R/Markdown and Jekyll)
and designed graphics/icons/photos/etc. Thanks also to everyone sharing this post,
especially on http://www.Rbloggers.com/ and https://RWeekly.org/.
To leave a comment for the author, please follow the link and comment on their blog: R/exams. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Big Data Analytics with H20 in R Exercises Part 1
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
We have dabbled with RevoScaleR before , In this exercise we will work with H2O , another high performance R library which can handle big data very effectively .It will be a series of exercises with increasing degree of difficulty . So Please do this in sequence .
H2O requires you to have Java installed in your system .So please install Java before trying with H20 .As always check the documentation before trying these exercise set .
Answers to the exercises are available here.
If you want to install the latest release from H20 , install it via this instructions .
Exercise 1
Download the latest stable release from h20 and initialize the cluster
Exercise 2
Check the cluster information via clusterinfo
Exercise 3
You can see how h2o works via the demo function , Check H2O’s glm via demo method .
Exercise 4
down load the loan.csv from H2O’s github repo and import it using H2O .
Exercise 5
Check the type of imported loan data and notice that its not a dataframe , check the summary of the loan data .
Hint use h2o.summary()
Exercise 6
One might want to transfer a dataframe from R environment to H2O , use as.h2o to conver the mtcars dataframe as a H2OFrame
 work with different data import techniques,
 know how to import data and transform it for a specific moddeling or analysis goal,
 and much more.
Exercise 7
Check the dimension of the loan H2Oframe via h2o.dim
Exercise 8
Find the colnames from the H2OFrame of loan data.
Exercise 9
Check the histogram of the loan amount of loan H2Oframe .
Exercise 10
Find the mean of loan amount by each home ownership group from the loan H2OFrame
 Big Data analytics with RevoScaleR Exercises2
 Data wrangling : I/O (Part1)
 DensityBased Clustering Exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...