An AI pitches startup ideas
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Take a look at this list of 13 hot startups, from a list compiled by Alex Bresler. Perhaps one of them is the next Juicero?
FAR ATHERA: A CLINICAL AI PLATFORM THAT CAN BE ACCESSED ON DEMAND.
ZAPSY: TRYATHOME SERVICE FOR CONSUMER ELECTRONICS.
MADESS: ONDEMAND ACCESS TO CLEAN WATER.
DEERG: AI RADIOLOGIST IN A HOME
SPER: THE FASTEST, EASIEST WAY TO BUY A HOME WITHOUT THE USER HAVING TO WEAR ANYTHING.
PLILUO: VENMO FOR B2B SAAS.
LANTR: WE HELP DOCTORS COLLECT 2X MORE ON CANDLES AND KEROSENE.
ABS: WE PROVIDE FULLSERVICE SUPPORT FOR SUBLIME, VIM, AND EMACS.
INSTABLE DUGIT: GITHUB FOR YOUR LOVED ONES.
CREDITAY: BY REPLACING MECHANICAL PARTS WITH AIR, WE ELIMINATE INSTALLATION COMPLEXITY AND MAINTENANCE HEADACHES, LEADING TO SIGNIFICANT EFFICIENCY GAINS IN PRODUCTION COSTS AND HARVESTING TIME.
CREDITANO: WE BUILD SOFTWARE TO ENABLE HIGH FUNCTIONALITY BIONICS FOR TREATING HEALTH CONDITIONS TO BE SOFTWARE ENGINEERS FOR FREE IN EXCHANGE FOR A FULL STACK SOLUTION THAT DELIVERS WORKFLOW AND EMBEDS ANALYTICS WITHIN THE ELECTRONIC MEDICAL RECORD.
PATT: CONNECTED TOYS THAT ENABLE TWOWAY VOICE MESSAGING BETWEEN CHILDREN AND THEIR LOVED ONES CAN KNOW THEY ARE GONE, SO THEY GET FREE WATER REFILLS FROM OUR POOL OF VETTED ENGINEERING AND PROGRAMMING FOR KIDS 612 WITH A SINGLE MESSAGE.
GITS: WE NOW WANT TO BRING BOXING AS AN ALTERNATIVE TO PAGERS.
As you've probably guessed by now, these are not real startups. (Though given Juicero, you'd be forgiven if you didn't.) Alex Bresler created a generating neural network (GNN) from YCombinator's archive of startup names and descriptions, and the GNN created almost 5000 new ones. (You can download an Excel file with all of them here, to save load on Alex's server.) I picked a few by hand to create the list above, but there was no shortage of funny ones (and plenty of alltooreal ones as well).
Alex used his fundmanageR package (available on Github) to download the YCombinator database into R. The Generational Neural Network was trained using the keras package for R on these data, using Tensorflow as the back end. You can find all the details on how the list was generated, including the R code, at Alex's blog linked below.
ASBC LLC: Juice, Err, Ohh…
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R for Business Analysts, NYC Data Science Academy inperson training  November 6 @NYC
(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to Rbloggers)
R is a powerful language used widely for business analytics. More companies are seeking business analysts with knowledge of R.
In this R for Business Analysts course, students will learn how data science is done in the wild, with a focus on data acquisition, cleaning, and aggregation, exploratory data analysis and visualization, feature engineering, and model creation and validation. Students use the R statistical programming language to work through realworld examples that illustrate these concepts. Concurrently, students learn some of the statistical and mathematical foundations that power the datascientific approach to problem solving.
Classes will be given in a lab setting, with student exercises mixed with lectures. Students should bring a laptop to class. There will be a modest amount of homework after each class. Due to the focused nature of this course, there will be no individual class projects but the instructors will be available to help students who are applying R to their own work outside of class.
Designed and taught by Brennan Lodge, Team Lead at Bloomberg. Watch his interview here.
Details Dates: November 6th – December 11th, 2017
 Time: Mondays & Wednesdays  7:00 PM to 9:30 PM EST
 Venue: 500 8th Ave., Suite 905, New York, NY 10018 (5 min from Penn Station)
This course is for anyone with a basic understanding of data analysis techniques and business analysts interested in improving their ability to tackle problems involving multidimensional data in a systematic, principled way. A familiarity with the R programming language is helpful, but unnecessary if the prework for the course is completed (more on that below).
Prerequisites Students should have some experience with programming and have some familiarity with basic statistical and linear algebraic concepts such as mean, median, mode, standard deviation, correlation, and the difference between a vector and a matrix. In R, it will be helpful to know basic data structures such as data frames and how to use R Studio.Students should complete the following prework (approximately 2 hours) before the first day of class: R Programming – https://www.rstudio.com/onlinelearning/#R
 R Studio Essentials Programming 1: Writing Code https://www.rstudio.com/resources/webinars/rstudioessentialswebinarseriespart1/
Upon completing the course, students have:
 An understanding of data science business problems solvable using R and an ability to articulate those business use cases from a statistical perspective.
 The ability to create data visualization output with Rmarkdown files and Shiny Applications.
 Familiarity with the R data science ecosystem, strategizing and the various tools a business analyst can use to continue developing as a data scientist.
Unit 1: Data Science and R Intro
 Big Data
 Data Science
 Roles in Data Science
 Use Cases
 Data’isms
 Class Format overview
 R Background
 R Intro
 R Studio
Unit 2: Visualize
 Rules of the road with data viz
 Chart junk
 Chart terminology
 Clean chart
 Scaling data
 Data Viz framework
 Code plotting
Unit 3: R Markdown
 Presenting your work
 R markdown file structure
 Code chunks
 Generating a R markdown file
 Rmarkdown Exercise
Unit 4: Shiny
 Shiny structure
 Reactive output
 Widgets
 Rendering Output
 Stock example
 Handson challenge
Unit 5: Data Analysis
 How to begin your data journey?
 The human factor
 Business Understanding
 Dplyr
 EDA – Exploratory Data Analysis
 Data Anomalies
 Data Statistics
 Key Business Analysis Takeaways
 Diamond data set exercise
 Hands on challenge with Bank Marketing
Unit 6: Introduction to Regression
 Regression Definition
 Examples of regression
 Formulize the formula
 Plotting
 Statistical definitions involved
 mtcars regression example
 Business use case with regression
Unit 7: Introduction to Machine Learning
 ML Concept
 Types of ML
 CRISP Model
 Modeling
 Evaluation
 Titanic Example
 Decision Trees
 Feature Engineering
Unit 8: Strategy
 Data Driven Decision Making
 Data Science Strategy
 Strategy Fails
 Macroeconomic strategy
 Adapting
 Data Science Project
 Data Impact
 Project guide
 Opportunities for improvement
 Big Box Store Strategic Exercise
Seats are filling up fast! Sign up now.
If you have any questions about our course or the application process, please do not hesitate to reach out to us via email.
The post R for Business Analysts, NYC Data Science Academy inperson training  November 6 @NYC appeared first on NYC Data Science Academy Blog.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Uber assignment with lpSolve
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
By Yuri Fonseca
In this post we are going to make an Uber assignment simulation and calculate some metrics of waiting time through simulation.
SettingSuppose we live in a 100×100 block city where each block takes 1 minute to cross by car. Drivers can pick up passengers only on corners, and passengers must call Uber on corners. Inferiorleft corner is (1,1) and superiorright corner is (100,100).
FunctionsIn order to calculate the average waiting time of a passenger, we need a couple of functions. The first function calculates the distance of a specific passenger and a specific driver, the second function creates the passengers, the third creates the cars and finally, the last function builds the distance matrix between cars and passengers in order to assign them in a optimal way.
Since our city is a perfect square, the distance between a passenger and a car is simple the distance in the xaxis plus the distance in the yaxis. Moreover, signs don’t matter. In this simple example we just need to know the initial position of the driver, initial position of the car and final destination. Each of these positions is a 2×1 vector representation in the city map.
Observation: distance matrixWe are going to use an linear programming solver in order to allocate optimally cars to passangers. This allocation problem just work with square matrix. So, if we have more cars than passengers, we need to create fictionary passengers (zero distances) in order to the solver converge. Note that we are going to do just a single round of allocation, so the number of cars needs to be bigger or equal than the number of passengers asking for a UBER driver.
create_passenger = function(id){ initial.position = sample(50, 2, replace = TRUE) final.destination = sample(50, 2, replace = TRUE) return(list('number' = id, 'initial' = initial.position, 'final' = final.destination)) } create_car = function(id){ initial.position = sample(50, 2, replace = TRUE) return(list('number' = id, 'position' = initial.position)) } distance = function(x,y){ sum(abs(xy)) } distance.matrix = function(cars, passengers){ d.matrix = matrix(0, nrow = length(cars), ncol = length(cars)) for (i in 1:length(cars)){ for (j in 1:length(passengers)){ d.matrix[i,j] = distance(cars[[i]]$position, passengers[[j]]$initial) } } return(d.matrix) } Example MAPLet’s check an example of 10 passengers and 10 cars:
library(lpSolve) library(ggplot2) set.seed(20) passengers = lapply(seq(1:10), create_passenger) cars = lapply(seq(1:10), create_car) d.matrix = distance.matrix(cars, passengers) opt.allocation = lp.assign(d.matrix) passengers.points = sapply(passengers, function(x) x$initial) cars.points = sapply(cars, function(x) x$position) points = t(cbind(passengers.points, cars.points)) assignments = apply(opt.allocation$solution, 1, which.max) #checking the assignment for each car df1 = data.frame('x.axis' = points[,1], 'y.axis' = points[,2], 'id' = c(rep('Passenger',10), rep('Car',10))) # df.assign = data.frame('x' = cars.points[1,], # 'y' = cars.points[2,], # 'xend' = passengers.points[1,assignments], # 'yend' = passengers.points[2,assignments]) df.assign1 = data.frame('x' = cars.points[1,], 'y' = cars.points[2,], 'xend' = passengers.points[1,assignments], 'yend' = cars.points[2,]) df.assign2 = data.frame('x' = passengers.points[1,assignments], 'y' = cars.points[2,], 'xend' = passengers.points[1,assignments], 'yend' = passengers.points[2,assignments]) ggplot(df1, aes(x.axis,y.axis)) + geom_point(aes(color = id, group = id), size = 3) + # car and passengers geom_segment(aes(x = x, y = y, xend = xend, yend = yend), data = df.assign1) + geom_segment(aes(x = x, y = y, xend = xend, yend = yend), data = df.assign2, arrow = arrow(length = unit(0.02, "npc"), type = 'closed')) + scale_x_continuous(minor_breaks = seq(1, 50, 1)) + scale_y_continuous(minor_breaks = seq(1, 50, 1)) + ggtitle('Optimal Allocation') Monte Carlo simulationNow the waiting time… Suppose that we have 10 passengers asking for cars in the city, we are going to see how the waiting time changes with the numbers of cars. As example, we are going to change the numbers of cars from 10 up to 30 cars, with 500 hundred Monte Carlo simulation.
simulations = function(N, MC) { ncars = N times = matrix(0, nrow = MC, ncol = N) for (i in 1:MC){ passengers = lapply(seq(1:10), create_passenger) cars = lapply(seq(1:ncars), create_car) d.matrix = distance.matrix(cars, passengers) opt.allocation = lp.assign(d.matrix) times[i,] = colSums(opt.allocation$solution*opt.allocation$costs) # waiting time for each passenger } return(times) } results = lapply(seq(10,30,2), simulations, MC = 500) # MC = 500 just to save some timeNow it is possible to check how the numbers of cars affect the mean waiting time and the 10% and 90% confidence level for the waiting time.
df2 = data.frame('WaitingTime' = sapply(results, mean), 'LB' = sapply(results, quantile, probs = 0.10), 'UB' = sapply(results, quantile, probs = 0.90), 'Cars' = seq(10,30,2)) ggplot(df2, aes(x = Cars)) + geom_line(aes(y = WaitingTime), lwd = 1.2) + geom_ribbon(aes(ymin = LB, ymax = UB), fill = 'blue', alpha = .3) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
.rprofile: David Smith
(This article was first published on rOpenSci Blog, and kindly contributed to Rbloggers)
In this occasional series, we interview someone using a loosely defined set of interview questions for the purpose of “Demystifying the Creative/Development Process of the R Community”. This interview was conducted and prepared by Kelly O’Briant as part of an rOpenSci unconf17 project.David Smith is a Blogger and Community Lead at Microsoft. I had the chance to interview David last May at rOpenSci unconf17. We spoke about his career, the process of working remote within a team, community development/outreach and his personal methods for discovering great content to share and write about.
KO: What is your name, job title, and how long have you been using R?
DS: My name is David Smith. I work at Microsoft and my selfimposed title is ‘R Community Lead’. I’ve been working with R specifically for about 10 years, but I’d been working with S since the early 90s.
KO: How did you transition into using R?
DS: I was using S for a long, long time, and I worked for the company that commercialized S, where I was a project manager for SPLUS. And I got out of that company, and then worked for a startup in a different industry for a couple of years. But while I was there, the people that founded Revolution Analytics approached me because they were setting up a company to build a commercial business around R, and reached out to me because of my connections with the S community.
KO: So you came to Microsoft through Revolution?
DS: That’s correct. I was with Revolution Analytics and then Microsoft bought that company, so I’ve been with Microsoft since then.
KO: How has that transition gone, is there a Revolution team inside of Microsoft, or has it become more spread out?
DS: It’s been more spread out. It got split up into the engineering people and the marketing people, sales people all got distributed around. When I first went to Microsoft I started off in the engineering group doing product management. But I was also doing the community role, and it just wasn’t a very good fit just timewise, between doing community stuff and doing code or product management. So then I switched to a different group called the Ecosystem team, and so now I’m 100% focused on community within a group that’s focused on the ecosystem in general.
The one piece of advice I could give anyone starting out in their careers is – write what you do, write it in public, and make it so that other people can reproduce what you did.KO: What does it mean to be 100% community focused, do you do a lot of training?
DS: I don’t do a lot of training myself, but I work with a lot of other people on the team who do training. We’re focused mainly on building up the ecosystem of people that ultimately add value to the products that Microsoft has. And we’re specifically involved in the products that Microsoft has that now incorporate R by building up the value of the R ecosystem.
KO: What does your daytoday look like, are you in an office, do you work remote?
DS: I work from home. I had moved from Seattle where Microsoft is headquartered to Chicago a couple of months before the acquisition happened, so I wasn’t about to move back to Seattle. But they let me work from home in Chicago, which has worked out great because most of my job is communicating with the community at large. So I do the Revolutions Blog, which I’ve been writing for eight or nine years now, writing stories about people using R, and applications of R packages. All as a way of communicating to the wider world the cool stuff that people can do with R, and also to the R community occasionally, what kind of things you can do with R in the Microsoft environment.
KO: Have you always been a writer or interested in writing and communications?
DS: No, no. I have a mathematics and computer science degree. I’m not trained as a writer. But it’s actually been useful having the perspective of statistics and mathematics and programming, and to bring that to a broader audience through writing. I’ve learned a lot about the whole writing and blogging and journalism process through that, but I’m certainly not trained in that way.
KO: How does your Ecosystems team at Microsoft function and collaborate?
DS: Unlike many teams at Microsoft, our team is very distributed. We have people working remotely from Denver, I’m in Chicago, Seattle, we’re all kind of distributed all around the place. So we meet virtually through Skype, have video meetings once a week and communicate a lot online.
KO: What kind of tools are you using?
DS: Traditionally, as in Microsoft, mainly email and Skype for the meetings. I set up an internal team focused around community more broadly around Microsoft and we use Microsoft Teams for that,
which is a little bit like Slack. But a lot of the stuff that I do is more out in the open, so I use a lot of Twitter and Github for the code that I point to and stuff like that.
KO: How do you manage your Twitter?
DS: Twitter I do manually in realtime. I don’t do a lot of scheduling except for @RLangTip
which is a feed of daily R tips. And for that I do scheduling through Tweetdeck on the web.
KO: How many Twitter accounts are you managing?
DS: I run @revodavid which is my personal twitter account, and @RLangTip which is R language tips. I tweet for @R_Forwards which is the diversity community for R, @RConsortium, the R Consortium, so quite a few.
KO: How long has this been a core part of your work day?
DS: The community thing as a focus, maybe five or six years? My career path for a long time was in product management. So I managed SPLUS as a product for a long time, I managed another product at a different startup, and then I came to Revolution and I did a combination of engineering and product management. But in the last 18 months I’ve been 100% in the community space.
KO: How did you get into product management to begin with?
DS: That’s a good question that I’m not sure I know the answer to. I started off my first job after university — I actually left university specifically to become a support engineer for SPLUS. When I took on that role, they didn’t really have product management yet at that company, and so when they were looking for somebody to basically steer SPLUS as a product, it was a good fit for me and an opportunity to move to the States. I took that on and I kind of just learned product management as I did it. I went to a few sort of training/seminar type things, but I didn’t study it.
KO: Sure. It seems like something that people just kind of get saddled with sometimes?
DS: Exactly. It’s a discipline that doesn’t really have discipline. But for the various companies I’ve worked for, mostly startups, they all seem to have very different perspectives on what product management is and what the role of a product manager is.
KO: Yeah, I know what you mean. Are you happy to have sort of moved away from that?
DS: I am in the sense of — it was different being in a startup where being a product manager was more like being the shepherd of an entire product ecosystem, whereas in a big company the product manager is a lot more focused and inherently so, a lot more narrow. I happen to prefer the bigger picture I guess.
Honestly, I kind of focus from the point of view of what interests me personally. Which doesn’t sound very community oriented at all… but it’s an exercise in empathy.KO: What’s your process for deciding what things you talk about and bring to the community?
DS: Honestly, I kind of focus from the point of view of what interests me personally. Which doesn’t sound very community oriented at all… but it’s an exercise in empathy. If I can write about something, or find a topic that I might find is interesting or exciting and I can communicate that with other people, I’m motivated to write about it and I hope that people are then motivated to learn about it. Kind of the antithesis of this is when I worked in marketing for a while; a lot of that style of writing was the bane of my existence because you’re producing these documents that literally are designed for nobody to read, in this language that nobody engages with. I much prefer blogging and tweeting because it’s much more directly for people.
KO: What have some of your most popular or successful engagements been about? Feel free to interpret ‘successful' in any way.
DS: Well, from the point of view of what has been the most rewarding part of my job, is finding underrecognized or just these really cool things that people have done that just haven’t had a lot of exposure. And I’ve got a fairly big audience and a fairly wide reach, and it’s always fun for me to find things that people have done that maybe haven’t been seen. And it’s not my work, but I can — you know — take an eight page document that somebody’s written that has really cool things in it and just pull out various things. There is so much very cool stuff that people have done, half of the battle is getting it out there.
KO: What are some of your favorite sources for discovering cool things on the internet?
DS: There are channels on Reddit that I get a lot of material from, like /r/dataisbeautiful and things like that. It’s hard to say particular accounts on twitter, but I’ve spent a lot of time following people where I’ve read one of their blog posts and I find their twitter account, and they have just a few followers, I’ll follow them, and then over time it amounts to some good stuff. I have twitter open all day, every day. I don’t read everything on my feed every day, but I certainly keep it open.
KO: How much of your day is just spent exploring?
DS: A lot of it. I spend about half of any given day reading. It takes a long time, but every now and then you find this really cool stuff.
It’s one thing to be able to do really cool stuff in R or any other language, but until you can distill that down into something that other people consume, it’s going to be hard to sell yourself.KO: Do you have any last nuggets of wisdom for people starting out their careers in R?
DS: For people starting out their careers, I think one of the most important skills to learn is that communication skill. It’s one thing to be able to do really cool stuff in R or any other language, but until you can distill that down into something that other people consume, it’s going to be hard to sell yourself. And it’s also going to be hard to be valuable. A lot of the people I’ve watched evolve in the community are people who have begun very early in their careers, blogging about what they do. The one piece of advice I could give anyone starting out in their careers is – write what you do, write it in public, and make it so that other people can reproduce what you did.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Practical Machine Learning with R and Python – Part 2
(This article was first published on R – Giga thoughts …, and kindly contributed to Rbloggers)
In this 2nd part of the series “Practical Machine Learning with R and Python – Part 2”, I continue where I left off in my first post Practical Machine Learning with R and Python – Part 2. In this post I cover the some classification algorithmns and cross validation. Specifically I touch
Logistic Regression
K Nearest Neighbors (KNN) classification
Leave out one Cross Validation (LOOCV)
K Fold Cross Validation
in both R and Python.
As in my initial post the algorithms are based on the following courses.
 Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
 Applied Machine Learning in Python Prof KevynCollin Thomson, University Of Michigan, Coursera
You can download this R Markdown file along with the data from Github. I hope these posts can be used as a quick reference in R and Python and Machine Learning.I have tried to include the coolest part of either course in this post.
The following classification problem is based on Logistic Regression. The data is an included data set in ScikitLearn, which I have saved as csv and use it also for R. The fit of a classification Machine Learning Model depends on how correctly classifies the data. There are several measures of testing a model’s classification performance. They are
–Accuracy = TP + TN / (TP + TN + FP + FN) – Fraction of all classes correctly classified
–Precision = TP / (TP + FP) – Fraction of correctly classified positives among those classified as positive
– Recall = TP / (TP + FN) Also known as sensitivity, or True Positive Rate (True positive) – Fraction of correctly classified as positive among all positives in the data
– F1 = 2 * Precision * Recall / (Precision + Recall)
The caret and e1071 package is required for using the confusionMatrix call
source("RFunctions.R") library(dplyr) library(caret) library(e1071) # Read the data (from sklearn) cancer < read.csv("cancer.csv") # Rename the target variable names(cancer) < c(seq(1,30),"output") # Split as training and test sets train_idx < trainTestSplit(cancer,trainPercent=75,seed=5) train < cancer[train_idx, ] test < cancer[train_idx, ] # Fit a generalized linear logistic model, fit=glm(output~.,family=binomial,data=train,control = list(maxit = 50)) # Predict the output from the model a=predict(fit,newdata=train,type="response") # Set response >0.5 as 1 and <=0.5 as 0 b=ifelse(a>0.5,1,0) # Compute the confusion matrix for training data confusionMatrix(b,train$output) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 154 0 ## 1 0 272 ## ## Accuracy : 1 ## 95% CI : (0.9914, 1) ## No Information Rate : 0.6385 ## PValue [Acc > NIR] : < 2.2e16 ## ## Kappa : 1 ## Mcnemar's Test PValue : NA ## ## Sensitivity : 1.0000 ## Specificity : 1.0000 ## Pos Pred Value : 1.0000 ## Neg Pred Value : 1.0000 ## Prevalence : 0.3615 ## Detection Rate : 0.3615 ## Detection Prevalence : 0.3615 ## Balanced Accuracy : 1.0000 ## ## 'Positive' Class : 0 ## m=predict(fit,newdata=test,type="response") n=ifelse(m>0.5,1,0) # Compute the confusion matrix for test output confusionMatrix(n,test$output) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 52 4 ## 1 5 81 ## ## Accuracy : 0.9366 ## 95% CI : (0.8831, 0.9706) ## No Information Rate : 0.5986 ## PValue [Acc > NIR] : <2e16 ## ## Kappa : 0.8677 ## Mcnemar's Test PValue : 1 ## ## Sensitivity : 0.9123 ## Specificity : 0.9529 ## Pos Pred Value : 0.9286 ## Neg Pred Value : 0.9419 ## Prevalence : 0.4014 ## Detection Rate : 0.3662 ## Detection Prevalence : 0.3944 ## Balanced Accuracy : 0.9326 ## ## 'Positive' Class : 0 ## 1b. Logistic Regression – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression os.chdir("C:\\Users\\Ganesh\\RandPython") from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from matplotlib.colors import ListedColormap from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Call the Logisitic Regression function clf = LogisticRegression().fit(X_train, y_train) fig, subaxes = plt.subplots(1, 1, figsize=(7, 5)) # Fit a model clf = LogisticRegression().fit(X_train, y_train) # Compute and print the Accuray scores print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) y_predicted=clf.predict(X_test) # Compute and print confusion matrix confusion = confusion_matrix(y_test, y_predicted) from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted))) print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted))) print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted))) print('F1: {:.2f}'.format(f1_score(y_test, y_predicted))) ## Accuracy of Logistic regression classifier on training set: 0.96 ## Accuracy of Logistic regression classifier on test set: 0.96 ## Accuracy: 0.96 ## Precision: 0.99 ## Recall: 0.94 ## F1: 0.97 2. Dummy variablesThe following R and Python code show how dummy variables are handled in R and Python. Dummy variables are categorival variables which have to be converted into appropriate values before using them in Machine Learning Model For e.g. if we had currency as ‘dollar’, ‘rupee’ and ‘yen’ then the dummy variable will convert this as
dollar 0 0 0
rupee 0 0 1
yen 0 1 0
Pandas has a get_dummies function for handling dummies
import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Read data df =pd.read_csv("adult1.csv",encoding="ISO88591",na_values=[""," "," ?"]) # Drop rows with NA df1=df.dropna() print(df1.shape) # Select specific columns adult = df1[['age','occupation','education','educationNum','capitalGain','capitalloss', 'hoursperweek','nativecountry','salary']] X=adult[['age','occupation','education','educationNum','capitalGain','capitalloss', 'hoursperweek','nativecountry']] # Set approporiate values for dummy variables X_adult=pd.get_dummies(X,columns=['occupation','education','nativecountry']) y=adult['salary'] X_adult_train, X_adult_test, y_train, y_test = train_test_split(X_adult, y, random_state = 0) clf = LogisticRegression().fit(X_adult_train, y_train) # Compute and display Accuracy and Confusion matrix print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(clf.score(X_adult_train, y_train))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(clf.score(X_adult_test, y_test))) y_predicted=clf.predict(X_adult_test) confusion = confusion_matrix(y_test, y_predicted) print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted))) print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted))) print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted))) print('F1: {:.2f}'.format(f1_score(y_test, y_predicted))) ## (30161, 16) ## Accuracy of Logistic regression classifier on training set: 0.82 ## Accuracy of Logistic regression classifier on test set: 0.81 ## Accuracy: 0.81 ## Precision: 0.68 ## Recall: 0.41 ## F1: 0.51 3a – K Nearest Neighbors Classification – R codeThe Adult data set is taken from UCI Machine Learning Repository
source("RFunctions.R") df < read.csv("adult1.csv",stringsAsFactors = FALSE,na.strings = c(""," "," ?")) # Remove rows which have NA df1 < df[complete.cases(df),] dim(df1) ## [1] 30161 16 # Select specific columns adult < df1 %>% dplyr::select(age,occupation,education,educationNum,capitalGain, capital.loss,hours.per.week,native.country,salary) # Set dummy variables adult1 < dummy.data.frame(adult, sep = ".") #Split train and test as required by KNN classsification model train_idx < trainTestSplit(adult1,trainPercent=75,seed=1111) train < adult1[train_idx, ] test < adult1[train_idx, ] train.X < train[,1:76] train.y < train[,77] test.X < test[,1:76] test.y < test[,77] # Fit a model for 1,3,5,10 and 15 neighbors cMat < NULL neighbors <c(1,3,5,10,15) for(i in seq_along(neighbors)){ fit =knn(train.X,test.X,train.y,k=i) table(fit,test.y) a<confusionMatrix(fit,test.y) cMat[i] < a$overall[1] print(a$overall[1]) } ## Accuracy ## 0.7835831 ## Accuracy ## 0.8162047 ## Accuracy ## 0.8089113 ## Accuracy ## 0.8209787 ## Accuracy ## 0.8184591 #Plot the Accuracy for each of the KNN models df < data.frame(neighbors,Accuracy=cMat) ggplot(df,aes(x=neighbors,y=Accuracy)) + geom_point() +geom_line(color="blue") + xlab("Number of neighbors") + ylab("Accuracy") + ggtitle("KNN regression  Accuracy vs Number of Neighors (Unnormalized)") 3b – K Nearest Neighbors Classification – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import MinMaxScaler # Read data df =pd.read_csv("adult1.csv",encoding="ISO88591",na_values=[""," "," ?"]) df1=df.dropna() print(df1.shape) # Select specific columns adult = df1[['age','occupation','education','educationNum','capitalGain','capitalloss', 'hoursperweek','nativecountry','salary']] X=adult[['age','occupation','education','educationNum','capitalGain','capitalloss', 'hoursperweek','nativecountry']] #Set values for dummy variables X_adult=pd.get_dummies(X,columns=['occupation','education','nativecountry']) y=adult['salary'] X_adult_train, X_adult_test, y_train, y_test = train_test_split(X_adult, y, random_state = 0) # KNN classification in Python requires the data to be scaled. # Scale the data scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_adult_train) # Apply scaling to test set also X_test_scaled = scaler.transform(X_adult_test) # Compute the KNN model for 1,3,5,10 & 15 neighbors accuracy=[] neighbors=[1,3,5,10,15] for i in neighbors: knn = KNeighborsClassifier(n_neighbors = i) knn.fit(X_train_scaled, y_train) accuracy.append(knn.score(X_test_scaled, y_test)) print('Accuracy test score: {:.3f}' .format(knn.score(X_test_scaled, y_test))) # Plot the models with the Accuracy attained for each of these models fig1=plt.plot(neighbors,accuracy) fig1=plt.title("KNN regression  Accuracy vs Number of neighbors") fig1=plt.xlabel("Neighbors") fig1=plt.ylabel("Accuracy") fig1.figure.savefig('foo1.png', bbox_inches='tight') ## (30161, 16) ## Accuracy test score: 0.749 ## Accuracy test score: 0.779 ## Accuracy test score: 0.793 ## Accuracy test score: 0.804 ## Accuracy test score: 0.803Output image:
4 MPG vs HorsepowerThe following scatter plot shows the nonlinear relation between mpg and horsepower. This will be used as the data input for computing K Fold Cross Validation Error
4a MPG vs Horsepower scatter plot – R Code df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 < as.data.frame(sapply(df,as.numeric)) df2 < df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 < df2[complete.cases(df2),] ggplot(df3,aes(x=horsepower,y=mpg)) + geom_point() + xlab("Horsepower") + ylab("Miles Per gallon") + ggtitle("Miles per Gallon vs Hosrsepower") 4b MPG vs Horsepower scatter plot – Python Code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO88591") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') autoDF3=autoDF2.dropna() autoDF3.shape #X=autoDF3[['cylinder','displacement','horsepower','weight']] X=autoDF3[['horsepower']] y=autoDF3['mpg'] fig11=plt.scatter(X,y) fig11=plt.title("KNN regression  Accuracy vs Number of neighbors") fig11=plt.xlabel("Neighbors") fig11=plt.ylabel("Accuracy") fig11.figure.savefig('foo11.png', bbox_inches='tight') 5 K Fold Cross ValidationK Fold Cross Validation is a technique in which the data set is divided into K Folds or K partitions. The Machine Learning model is trained on K1 folds and tested on the Kth fold i.e.
we will have K1 folds for training data and 1 for testing the ML model. Since we can partition this as or K choose 1, there will be K such partitions. The K Fold Cross
Validation estimates the average validation error that we can expect on a new unseen test data.
The formula for K Fold Cross validation is as follows
and
and
where is the number of elements in partition ‘K’ and N is the total number of elements
Leave Out one Cross Validation (LOOCV) is a special case of K Fold Cross Validation where N1 data points are used to train the model and 1 data point is used to test the model. There are N such paritions of N1 & 1 that are possible. The mean error is measured The Cross Valifation Error for LOOCV is
where is the diagonal hat matrix
see [Statistical Learning]
The above formula is also included in this blog post
It took me a day and a half to implement the K Fold Cross Validation formula. I think it is correct. In any case do let me know if you think it is off
5a. Leave out one cross validation (LOOCV) – R CodeR uses the package ‘boot’ for performing Cross Validation error computation
library(boot) library(reshape2) # Read data df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 < as.data.frame(sapply(df,as.numeric)) # Select complete cases df2 < df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 < df2[complete.cases(df2),] set.seed(17) cv.error=rep(0,10) # For polynomials 1,2,3... 10 fit a LOOCV model for (i in 1:10){ glm.fit=glm(mpg~poly(horsepower,i),data=df3) cv.error[i]=cv.glm(df3,glm.fit)$delta[1] } cv.error ## [1] 24.23151 19.24821 19.33498 19.42443 19.03321 18.97864 18.83305 ## [8] 18.96115 19.06863 19.49093 # Create and display a plot folds < seq(1,10) df < data.frame(folds,cvError=cv.error) ggplot(df,aes(x=folds,y=cvError)) + geom_point() +geom_line(color="blue") + xlab("Degree of Polynomial") + ylab("Cross Validation Error") + ggtitle("Leave one out Cross Validation  Cross Validation Error vs Degree of Polynomial") 5b. Leave out one cross validation (LOOCV) – Python CodeIn Python there is no available function to compute Cross Validation error and we have to compute the above formula. I have done this after several hours. I think it is now in reasonable shape. Do let me know if you think otherwise. For LOOCV I use the K Fold Cross Validation with K=N
import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.cross_validation import train_test_split, KFold from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import mean_squared_error # Read data autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO88591") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') # Remove rows with NAs autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['horsepower']] y=autoDF3['mpg'] # For polynomial degree 1,2,3... 10 def computeCVError(X,y,folds): deg=[] mse=[] degree1=[1,2,3,4,5,6,7,8,9,10] nK=len(X)/float(folds) xval_err=0 # For degree 'j' for j in degree1: # Split as 'folds' kf = KFold(len(X),n_folds=folds) for train_index, test_index in kf: # Create the appropriate train and test partitions from the fold index X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # For the polynomial degree 'j' poly = PolynomialFeatures(degree=j) # Transform the X_train and X_test X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.fit_transform(X_test) # Fit a model on the transformed data linreg = LinearRegression().fit(X_train_poly, y_train) # Compute yhat or ypred y_pred = linreg.predict(X_test_poly) # Compute MSE * n_K/N test_mse = mean_squared_error(y_test, y_pred)*float(len(X_train))/float(len(X)) # Add the test_mse for this partition of the data mse.append(test_mse) # Compute the mean of all folds for degree 'j' deg.append(np.mean(mse)) return(deg) df=pd.DataFrame() print(len(X)) # Call the function once. For LOOCV K=N. hence len(X) is passed as number of folds cvError=computeCVError(X,y,len(X)) # Create and plot LOOCV df=pd.DataFrame(cvError) fig3=df.plot() fig3=plt.title("Leave one out Cross Validation  Cross Validation Error vs Degree of Polynomial") fig3=plt.xlabel("Degree of Polynomial") fig3=plt.ylabel("Cross validation Error") fig3.figure.savefig('foo3.png', bbox_inches='tight')6a K Fold Cross Validation – R code
Here K Fold Cross Validation is done for 4, 5 and 10 folds using the R package boot and the glm package
library(boot) library(reshape2) set.seed(17) #Read data df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 < as.data.frame(sapply(df,as.numeric)) df2 < df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 < df2[complete.cases(df2),] a=matrix(rep(0,30),nrow=3,ncol=10) set.seed(17) # Set the folds as 4,5 and 10 folds<c(4,5,10) for(i in seq_along(folds)){ cv.error.10=rep(0,10) for (j in 1:10){ # Fit a generalized linear model glm.fit=glm(mpg~poly(horsepower,j),data=df3) # Compute K Fold Validation error a[i,j]=cv.glm(df3,glm.fit,K=folds[i])$delta[1] } } # Create and display the K Fold Cross Validation Error b < t(a) df < data.frame(b) df1 < cbind(seq(1,10),df) names(df1) < c("PolynomialDegree","4fold","5fold","10fold") df2 < melt(df1,id="PolynomialDegree") ggplot(df2) + geom_line(aes(x=PolynomialDegree, y=value, colour=variable),size=2) + xlab("Degree of Polynomial") + ylab("Cross Validation Error") + ggtitle("K Fold Cross Validation  Cross Validation Error vs Degree of Polynomial") 6b. K Fold Cross Validation – Python codeThe implementation of KFold Cross Validation Error has to be implemented and I have done this below. There is a small discrepancy in the shapes of the curves with the R plot above. Not sure why!
import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.cross_validation import train_test_split, KFold from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import mean_squared_error # Read data autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO88591") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') # Drop NA rows autoDF3=autoDF2.dropna() autoDF3.shape #X=autoDF3[['cylinder','displacement','horsepower','weight']] X=autoDF3[['horsepower']] y=autoDF3['mpg'] # Create Cross Validation function def computeCVError(X,y,folds): deg=[] mse=[] # For degree 1,2,3,..10 degree1=[1,2,3,4,5,6,7,8,9,10] nK=len(X)/float(folds) xval_err=0 for j in degree1: # Split the data into 'folds' kf = KFold(len(X),n_folds=folds) for train_index, test_index in kf: # Partition the data acccording the fold indices generated X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # Scale the X_train and X_test as per the polynomial degree 'j' poly = PolynomialFeatures(degree=j) X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.fit_transform(X_test) # Fit a polynomial regression linreg = LinearRegression().fit(X_train_poly, y_train) # Compute yhat or ypred y_pred = linreg.predict(X_test_poly) # Compute MSE *(nK/N) test_mse = mean_squared_error(y_test, y_pred)*float(len(X_train))/float(len(X)) # Append to list for different folds mse.append(test_mse) # Compute the mean for poylnomial 'j' deg.append(np.mean(mse)) return(deg) # Create and display a plot of K Folds df=pd.DataFrame() for folds in [4,5,10]: cvError=computeCVError(X,y,folds) #print(cvError) df1=pd.DataFrame(cvError) df=pd.concat([df,df1],axis=1) #print(cvError) df.columns=['4fold','5fold','10fold'] df=df.reindex([1,2,3,4,5,6,7,8,9,10]) df fig2=df.plot() fig2=plt.title("K Fold Cross Validation  Cross Validation Error vs Degree of Polynomial") fig2=plt.xlabel("Degree of Polynomial") fig2=plt.ylabel("Cross validation Error") fig2.figure.savefig('foo2.png', bbox_inches='tight')This concludes this 2nd part of this series. I will look into model tuning and model selection in R and Python in the coming parts. Comments, suggestions and corrections are welcome!
To be continued….
Watch this space!
Also see
 Design Principles of Scalable, Distributed Systems
 Reintroducing cricketr! : An R package to analyze performances of cricketers
 Spicing up a IBM Bluemix cloud app with MongoDB and NodeExpress
 Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
 Simulating an Edge Shape in Android
To see all posts see Index of posts
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
GitHub Streak: Round Four
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Three years ago I referenced the Seinfeld Streak used in an earlier post of regular updates to to the Rcpp Gallery:
This is sometimes called Jerry Seinfeld’s secret to productivity: Just keep at it. Don’t break the streak.
and showed the first chart of GitHub streaking
And two year ago a first followup appeared in this post:
And a year ago we had a followup last year
And as it October 12 again, here is the new one:
Again, special thanks go to Alessandro Pezzè for the Chrome addon GithubOriginalStreak.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The Bold & Beautiful Character Similarities using Word Embeddings
(This article was first published on R – Longhow Lam's Blog, and kindly contributed to Rbloggers)
IntroductionI often see advertisement for The Bold and The Beautiful, I have never watched a single episode of the series. Still, even as a data scientist you might be wondering how these beautiful ladies and gentlemen from the show are related to each other. I do not have the time to watch all these episodes to find out, so I am going to use word embeddings on recaps instead…
Calculating word embeddingsFirst, we need some data, from the first few google hits I got to the site soap central. Recaps can be found from the show that date back to 1997. Then, I used a little bit of rvest code to scrape the daily recaps into an R data set.
Word embedding is a technique to transform a word onto a vector of numbers, there are several approaches to do this. I have used the socalled Global Vector word embedding. See here for details, it makes use of word cooccurrences that are determined from a (large) collection of documents and there is fast implementation in the R text2vec package.
Once words are transformed to vectors, you can calculate distances (similarities) between the words, for a specific word you can calculate the top 10 closest words for example. More over linguistic regularities can be determined, for example:
amsterdam  netherlands + germanywould result in a vector that would be close to the vector for berlin.
Results for The B&B recapsIt takes about an hour on my laptop to determine the word vectors (length 250) from 3645 B&B recaps (15 seasons). After removing some common stop words, I have 10.293 unique words, text2vec puts the embeddings in a matrix (10.293 by 250).
Lets take the lovely steffy,
the ten closest words are:
from to value 1 steffy steffy 1.0000000 2 steffy liam 0.8236346 3 steffy hope 0.7904697 4 steffy said 0.7846245 5 steffy wyatt 0.7665321 6 steffy bill 0.6978901 7 steffy asked 0.6879022 8 steffy quinn 0.6781523 9 steffy agreed 0.6563833 10 steffy rick 0.6506576Lets take take the vector steffy – liam, the closest words we get are
death furious lastly excused frustration onset 0.2237339 0.2006695 0.1963466 0.1958089 0.1950601 0.1937230and for bill – anger we get
liam katie wyatt steffy quinn said 0.5550065 0.4845969 0.4829327 0.4645065 0.4491479 0.4201712The following figure shows some other B&B characters and their closest matches.
If you want to see the top n characters for other B&B characters use my little shiny app. The R code for scraping B&B recaps, calculating glove wordembeddings and a small shiny app can be found on my Git Hub.
ConclusionThis is a Mickey Mouse use case, but it might be handy if you are in the train and hear people next to you talking about the B&B, you can join their conversation. Especially if you have had a look at my B&B shiny app……
Cheers, Longhow
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A cRyptic crossword with an R twist
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Last week's Rthemed crossword from RLadies DC was popular, so here's another Rrelated crossword, this time by Barry Rowlingson and published on page 39 of the June 2003 issue of Rnews (now known as the R Journal). Unlike the last crossword, this one follows the conventions of a British cryptic crossword: the grid is symmetrical, and eschews 4×4 blocks of white or black squares. Most importantly, the clues are in the cryptic style: rather than being a direct definition, cryptic clues pair wordplay (homonyms, anagrams, etc) with a hidden definition. (Wikipedia has a good introduction to the types of clues you're likely to find.) Cryptic crosswords can be frustrating for the uninitiated, but are fun and rewarding once you get to into it.
In fact, if you're unfamiliar with cryptic crosswords, this one is a great place to start. Not only are many (but not all) of the answers related in some way to R, Barry has helpfully provided the answers along with an explanation of how the cryptic clue was formed. There's no shame in peeking, at least for a few, to help you get your legs with the cryptic style.
Again, if you're stuck you can find the solution here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Celebrating 20 years of CRAN
(This article was first published on R – scottishsnow, and kindly contributed to Rbloggers)
In April I spoke for the Royal Statistical Society (Glasgow branch) at their event celebrating 20 years of CRAN. The other speakers were Charis Chanialidis and Colin Gillespie. Like most people I find watching myself speak really cringe worthy, but perhaps it’s not so bad! The three talks are here:
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RStudio v1.1 Released
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
We’re excited to announce the general availability of RStudio 1.1. Highlights include:
 A connections tab which makes it easy to connect to, explore, and view data in a variety of databases.
 A terminal tab which provides fluid shell integration with the IDE, xterm emulation, and even support for fullscreen terminal applications.
 An object explorer which can navigate deeply nested R data structures and objects.
 A new, modern dark theme and Retinaquality icons throughout.
 Dozens of other small improvements and bugfixes.
RStudio Server Pro 1.1 is also now available. Some of the new Pro features include:
 Support for floating licenses, which make it easy to run RStudio Server Pro in Docker containers, virtual machines, and cloud computing instances.
 Improved session management, which allows analysts to label, multiselect, force quit and otherwise selfmanage their R sessions.
 Tools for administrators, including the ability to send users notifications and automatically clean up unused sessions, freeing disk space and resources.
You can download RStudio v1.1 today, and let us know what you think on the RStudio IDE community forum.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to sample from multidimensional distributions using Gibbs sampling?
(This article was first published on Appsilon Data Science Blog, and kindly contributed to Rbloggers)
We will show how to perform multivariate random sampling using one of the Markov Chain Monte Carlo (MCMC) algorithms, called the Gibbs sampler.
Random sampling with rabbit on the bed plane
To start, what are MCMC algorithms and what are they based on?
Suppose we are interested in generating a random variable X with a distribution of \pi, over \mathcal{X}.
If we are not able to do this directly, we will be satisfied with generating a sequence of random variables X_0, X_1, \ldots, which in a sense tending to a distribution of \pi. The MCMC approach explains how to do so:
Build a Markov chain X_0, X_1, \ldots, for \mathcal{X}, whose stationary distribution is \pi.
If the structure is correct, we should expect random variables to converge to \pi.
In addition, we can expect that
for function f: \mathcal{X} \rightarrow \mathbb{R}, occurs:
\mathbb{E}(f) = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=0}^{n1}f(X_{i})
with probability equals to 1.
In essence, we want the above equality to occur for any arbitrary random variable X_0.
One of the MCMC algorithms that guarantee the above properties is the socalled Gibbs sampler.
Gibbs Sampler – description of the algorithm.
Assumptions:
 \pi is defined on the product space \mathcal{X} = \Pi_{i=1}^{d}\mathcal{X_{i}}
 We are able to draw from the conditional distributions
\pi(X_{i}X_{i}),
where
X_{i} = (X_{1}, \ldots X_{i1}, X_{i+1}, \ldots X_{d})
Algorithm steps:
 Select the initial values X_{k}^{(0)}, k = 1, \ldots d
 For t=1,2,\ldots repeat:
For k = 1, \ldots d sample X_{k}^{(t)} from distribution \pi(X_{k}^{(t)}\rvert X_{1}^{(t)}, \ldots X_{k1}^{(t)}, X_{k+1}^{(t1)}, \ldots X_{d}^{(t1)})
 Repeat step 2 until the distribution of vector (X_{1}^{(t)}, \ldots X_{d}^{(t)}) stabilizes.
 The subsequent iterations of point 2 will provide a randomization of \pi.
How do we understand point 3?
We are most satisfied when average coordinates stabilize to some accuracy.
Intuition in twodimensional case:
Source: [3]
Gibbs sampling for randomization with a twodimensional normal distribution.
We will sample from the distribution of \theta \sim \mathcal{N}(0, [\sigma_{ij}]), where
\sigma_{11} = \sigma_{22} = 1 and \sigma_{12} = \sigma_{21} = \rho.
Knowing joint density of \theta, it’s easy to show, that:
\theta_{1}\theta_{2} \sim \mathcal{N}(\rho\theta_{2}, 1\rho^{2})
\theta_{2}\theta_{1} \sim \mathcal{N}(\rho\theta_{1}, 1\rho^{2})
R implementation:
gibbs_normal_sampling < function(n_iteration, init_point, ro) { # init point is some numeric vector of length equals to 2 theta_1 < c(init_point[1], numeric(n_iteration)) theta_2 < c(init_point[2], numeric(n_iteration)) for (i in 2:(n_iteration+1)) { theta_1[i] < rnorm(1, ro * theta_2[i1], sqrt(1  ro^2)) theta_2[i] < rnorm(1, ro * theta_1[i], sqrt(1  ro^2)) } list(theta_1, theta_2) }
Using the above function, let’s see how the 10 points were scored at \rho = 0.5:
And for 10000 iterations:
We leave you a comparison of how the stability of the parameters changes depending on the selected \rho parameter.
Let’s move on to use the Gibbs sampler to estimate the density parameters.
We will show the use of the Gibbs sampler and bayesian statistics to estimate the mean parameters in the mix of normal distributions.
Assumptions (simplified case):
 iid. sample Y=Y_{1},\ldots Y_{N} comes from a mixture of normal distributions (1\pi)\mathcal{N}(\mu_{1}, \sigma_{1}^{2})+\pi\mathcal{N}(\mu_{2}, \sigma_{2}^{2}), where \pi, \sigma_{1} i \sigma_{2} are known.
 For i=1,2 \mu_{i} \sim \mathcal{N}(0, 1) (a priori distributions) and \mu_{1} with \mu_{2} are independent.
 \Delta = (\Delta_{1}, \ldots, \Delta_{N}) – the classification vector (unobserved) from which Y is derived (when \Delta_{i} = 0, Y_{i} is drawn from \mathcal{N}(\mu_{1}, \sigma_{1}^{2}), when \Delta_{i} = 1 then drawn from \mathcal{N}(\mu_{2}, \sigma_{2}^{2})).
 \mathbb{P}(\Delta_{i} = 1) = \pi
With above assumptions it can be shown that:
 f(\Delta) = (1\pi)^{N\sum_{i=1}^{N}\Delta_{i}}\cdot\pi^{\sum_{i=1}^{N}\Delta_{i}}
 (\mu_{1}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}(1\Delta_{i})y_{i}}{1 + \sum_{i=1}^{N}(1\Delta_{i})}, \frac{1}{1+\sum_{i=1}^{N}(1\Delta_{i})})
 (\mu_{2}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}\Delta_{i}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}}, \frac{1}{1+\sum_{i=1}^{N}\Delta_{i}})
 \mathbb{P} (\Delta_{i} = 1\rvert\mu_{1}, \mu_{2}, Y) = \frac{\pi \phi_{(\mu_{2}, \sigma_{2})}(y_{i})}{(1\pi) \phi_{(\mu_1, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_2, \sigma_{2})}(y_{i})}
The form of the algorithm:
 Choose starting point for the mean (\mu_{1}^{0}, \mu_{2}^{0})
 In the kth iteration do:
 With the probability:
\frac{\pi \phi_{(\mu_{2}^{(k1)}, \sigma_{2})}(y_{i})}{(1\pi) \phi_{(\mu_{1}^{(k1)}, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_{2}^{(k1)}, \sigma_{2})}(y_{i})} draw \Delta_{i}^{(k)} for i = 1, \ldots, N  Calculate:
\hat{\mu_{1}} = \frac{\sum_{i=1}^{N}(1\Delta_{i}^{(k)})y_{i}}{1 + \sum_{i=1}^{N}(1\Delta_{i}^{(k)})}
\hat{\mu_{2}} = \frac{\sum_{i=1}^{N}\Delta_{i}^{(k)}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}}  Generate:
\mu_{1}^{(k)} \sim \mathcal{N}(\hat{\mu_{1}}, \frac{1}{1 + \sum_{i=1}^{N}(1\Delta_{i}^{(k)})})
\mu_{2}^{(k)} \sim \mathcal{N}(\hat{\mu_{2}}, \frac{1}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}})
 With the probability:
 Perform step 2. until the distribution of vector (\Delta, \mu_{1}, \mu_{2}) stabilizes.
How to do this in R?
At start let’s generate random sample from mixture of normals with parameters (\pi, \mu_1, \mu_2, \sigma_1, \sigma_2) = (0.7, 10, 2, 1, 2).
set.seed(12345) mu_1 < 10 mu_2 < 2 sigma_1 < 1 sigma_2 < 2 pi_known < 0.7 n < 2000 pi_sampled < rbinom(n, 1, pi_known) y_1 < rnorm(n, mu_1, sigma_1) y_2 < rnorm(n, mu_2, sigma_2) y < (1  pi_sampled) * y_1 + pi_sampled * y_2Take a quick look at a histogram of our data:
The following task is to estimate \mu_1 and \mu_2 from above model.
mu_init < c(12, 0) n_iterations < 300 delta < vector("list", n_iterations) mu_1_vec < c(mu_init[1], numeric(n_iterations)) mu_2_vec < c(mu_init[2], numeric(n_iterations)) delta_probability < function(i, y, mu_1_vec, mu_2_vec, sigma_1, sigma_2, pi_known) { pi_known * dnorm(y, mu_2_vec[i  1], sigma_2) / ((1 pi_known) * dnorm(y, mu_1_vec[i  1], sigma_1) + pi_known * dnorm(y, mu_2_vec[i  1], sigma_2)) } mu_1_mean < function(delta, i, y) { sum((1  delta[[i  1]]) * y) / (1 + sum(1  delta[[i  1]])) } mu_2_mean < function(delta, i, y) { sum(delta[[i  1]] * y) / (1 + sum(delta[[i  1]])) } for (j in 2:(n_iterations + 1)) { delta[[j  1]] < y %>% map_int( ~ rbinom(1, 1, delta_probability(j, ., mu_1_vec, mu_2_vec, sigma_1, sigma_2, pi_known))) mu_1_vec[j] < rnorm(1, mu_1_mean(delta, j, y), sqrt(1 / (1 + sum(1  delta[[j  1]])))) mu_2_vec[j] < rnorm(1, mu_2_mean(delta, j, y), sqrt(1 / (1 + sum(delta[[j  1]])))) }Let’s see the relation of sampled data to known one:
The following plot presents the mean of the \Delta vector at each iteration:
Let’s check how mean of parameters \mu_1 and \mu_2 stabilize at used algorithm:
Note how little iteration was enough to stabilize the parameters.
Finally let’s see estimated density with our initial sample:
To those concerned about the topic, refer to [1] where you can find a generalization of normal distribution mixture by extending a priori distributions to other parameters.
It is also worth to compare the above algorithm with its deterministic counterpart, Expectation Maximization (EM) algorithm see [2].
Write your question and comments below. We’d love to hear what you think.
Resources:
[1] http://gauss.stat.su.se/rr/RR2006_1.pdf
[2] http://web.stanford.edu/~hastie/ElemStatLearn/
[3] http://vcla.stat.ucla.edu/old/MCMC/MCMC_tutorial.htm
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
mea culpa!
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
An entry about our Bayesian Essentials book on X validated alerted me to a typo in the derivation of the Gaussian posterior..! When deriving the posterior (which was left as an exercise in the Bayesian Core), I just forgot the term expressing the divergence between the prior mean and the sample mean. Mea culpa!!!
Filed under: Books, Kids, R, Statistics, University life Tagged: Bayesian Analysis, Bayesian Core, Bayesian Essentials with R, Book, cross validated, Gaussian model, typo
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A step change in managing your calendar, without social media
(This article was first published on DanielPocock.com  rproject, and kindly contributed to Rbloggers)
Have you been to an event recently involving free software or a related topic? How did you find it? Are you organizing an event and don’t want to fall into the trap of using Facebook or Meetup or other services that compete for a share of your community’s attention?
Are you keen to find events in foreign destinations related to your interest areas to coincide with other travel intentions?
Have you been concerned when your GSoC or Outreachy interns lost a week of their project going through the bureaucracy to get a visa for your community’s event? Would you like to make it easier for them to find the best events in the countries that welcome and respect visitors?
In many recent discussions about free software activism, people have struggled to break out of the illusion that social media is the way to cultivate new contacts. Wouldn’t it be great to make more meaningful contacts by attending more a more diverse range of events rather than losing time on social media?
Making it happenThere are already a number of tools (for example, Drupal plugins and WordPress plugins) for promoting your events on the web and in iCalendar format. There are also a number of sites like Agenda du Libre and GriCal who aggregate events from multiple communities where people can browse them.
How can we take these concepts further and make a convenient, compelling and global solution?
Can we harvest event data from a wide range of sources and compile it into a large database using something like PostgreSQL or a NoSQL solution or even a distributed solution like OpenDHT?
Can we use big data techniques to mine these datasources and help match people to events without compromising on privacy?
Why not build an automated iCalendar “todo” list of deadlines for events you want to be reminded about, so you never miss the deadlines for travel sponsorship or submitting a talk proposal?
I’ve started documenting an architecture for this on the Debian wiki and proposed it as an Outreachy project. It will also be offered as part of GSoC in 2018.
Ways to get involvedIf you would like to help this project, please consider introducing yourself on the debianoutreach mailing list and helping to mentor or refer interns for the project. You can also help contribute ideas for the specification through the mailing list or wiki.
Mini DebConf Prishtina 2017This weekend I’ve been at the MiniDebConf in Prishtina, Kosovo. It has been hosted by the amazing Prishtina hackerspace community.
Watch out for future events in Prishtina, the pizzas are huge, but that didn’t stop them disappearing before we finished the photos:
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: DanielPocock.com  rproject. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Spatial networks – case study St James centre, Edinburgh (1/3)
(This article was first published on R – scottishsnow, and kindly contributed to Rbloggers)
Last year I spent a bit of time learning about routing and network analysis. I created a map of distance from each GB postcode to the nearest railway station. At the time my local council were also looking at changing school catchments and building new schools. This also seemed like an excellent project for routing analysis. I made some progress on this, but never wrote it up.
One aspect I was really interested in was common routes. When you create a shortest path network from many points to a single destination you get a lot of lines which converge on the same place. Can we use this for anything else? Can we count how many lines share the same pathways? This would be particularly useful for prioritising investment in safe routes to school, or walking and cycling infrastructure.
I was spurred to look at this analysis again from the tweet below:
Have to question the merits of people coming from the west driving across the city centre to get to St James’s Ctr.
— Cllr Scott Arthur (@ProfScottThinks) October 6, 2017
For background, the St James’s Centre in Edinburgh is being demolished and a new shopping centre built to replace it. There’s been a lot of (justified) complaint surrounding the new road layout at Piccadilly Place, because it prioritises motorised traffic over more efficient and effective ways of transporting people in cities (foot, bike, bus and tram). On top of this, the new St James’ centre will have a greatly increased number of spaces for parked cars (1800 from 550. With 300 for residents).
Scott’s tweet made me think about catchment area for the St James’ centre. How far away are different parts of Edinburgh? Which roads would be most travelled to get there? If one increases the number of parking spaces at a location, surely the number of cars travelling to that location will increase. What will the impact on the road network be?
There are some great resources to help us with this. The data source I’m going to use is Ordnance Survey OpenData. In particular we need Open Roads and CodePoint (postcodes). Software I’ll use for this task are QGIS, GRASS GIS and R. These are open source and require a little practice to use, but the effort pays off!
After reading data into GRASS I calculated the distance network from all postcodes to the St James’ Centre:
# connect postcodes to streets as layer 2 v.net overwrite input=roads points=postcodes output=roads_net1 operation=connect thresh=400 arc_layer=1 node_layer=2 # connect st james to streets as layer 3 v.net overwrite input=roads_net1 points=stjames output=roads_net2 operation=connect thresh=400 arc_layer=1 node_layer=3 # shortest paths from postcodes (points in layer 2) to st james (points in layer 3) v.net.distance in=roads_net2 out=pc_2_stjames flayer=2 to_layer=3 # Join postcode and distance tables v.db.join map=postcodes column=cat other_table=pc_2_stjames other_column=cat # Make a km column v.db.addcolumn map=postcodes columns="dist_km double precision" v.db.update map=postcodes column=dist_km qcol="dist/1000" # Write to shp (for mapping in QGIS) v.out.ogr s input=postcodes output=pc_2_stjames format=ESRI_Shapefile output_layer=pc_2_stationWe can see the results of this below:
We can also zoom this map on the area inside the bypass and adjust the colouring to give us more resolution for shorter distances.
Now we can use R to investigate how many postcodes are within these distances. R can talk directly to GRASS, or we can use the exported shp file.
library(rgdal) # Read shp file # Note OGR wants the full path to the file, you can't use '~' postcodes = readOGR("/home/me/Projects/network/data/pc_2_stjames.shp") hist(postcodes$dist_km, breaks=50) # Rank the distances x = sort(postcodes$dist_km) png("~/Cloud/Michael/Projects/network/figures/stjames_postcodedistance.png", height=600, width=800) par(cex=1.5) plot(x, type="l", main="EH postcode: shortest road distances to St James centre", xlab="Number of postcodes", ylab="Distance (km)", lwd=3) points(max(which(x<2)), 2, pch=19, cex=2, col="purple4") points(max(which(x<5)), 5, pch=19, cex=2, col="darkorange") legend("topleft", c(paste(max(which(x<2)), "postcodes within 2 km (walking distance)"), paste(max(which(x<5)), "postcodes within 5 km (cycling distance)")), col=c("purple4", "darkorange"), pch=19, pt.cex=2) dev.off() # Get percentages round(100 * max(which(x<2)) / length(x)) round(100 * (max(which(x<5))  max(which(x<2))) / length(x))As we can see, 1868 postcodes are within walking distance of the St James’ Centre and 7044 within cycling distance. If we
 Ignore the other shopping centres around Edinburgh,
 Assume that population density is consistent between postcodes,
 Take the complete EH postcode locations as the catchment area,
then 8 % of total customers are within walking distance (2 km), a further 22 % are within cycling distance (25 km) and the remaining 70 % would need to travel by other means. I wonder if there will be 330 cycle parking spaces at the St James centre? This would be proportional to the number of car parking spaces. This assumes that those beyond cycling distance need to drive, but much of the population will be near a bus stop.
Additional work:
 Create an allocation network for all Edinburgh shopping centres with car parks. What is St James’ catchment from that?
 Use the census data to compare car ownership with the St James’ allocation network. Car ownership is probably quite low in the city centre!
 Consider travel time, not just distance.
 How many of those outside walking/cycling distance are near a bus stop?
To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
It’s tibbletime v0.0.2: TimeAware Tibbles, New Functions, Weather Analysis and More
(This article was first published on businessscience.io  Articles, and kindly contributed to Rbloggers)
Today we are introducing tibbletime v0.0.2, and we’ve got a ton of new features in store for you. We have functions for converting to flexible time periods with the ~period formula~ and making/calculating custom rolling functions with rollify() (plus a bunch more new functionality!). We’ll take the new functionality for a spin with some weather data (from the weatherData package). However, the new tools make tibbletime useful in a number of broad applications such as forecasting, financial analysis, business analysis and more! We truly view tibbletime as the next phase of time series analysis in the tidyverse. If you like what we do, please connect with us on social media to stay up on the latest Business Science news, events and information!
IntroductionWe are excited to announce the release of tibbletime v0.0.2 on CRAN. Loads of new
functionality have been added, including:

Generic period support: Perform timebased calculations by a number
of supported periods using a new ~period formula~. 
Creating series: Use create_series() to quickly create a tbl_time
object initialized with a regular time series. 
Rolling calculations: Turn any function into a rolling version of itself with
rollify(). 
A number of smaller tweaks and helper functions to make life easier.
As we further develop tibbletime, it is becoming clearer that the package
is a tool that should be used in addition to the rest of the tidyverse.
The combination of the two makes time series analysis in the tidyverse much easier to do!
Today we will take a look at weather data for New York and San
Francisco from 2013. It will be an exploratory analysis
to show off some of the new features in tibbletime, but the package
itself has much broader application. As we will see, tibbletime’s timebased
functionality can be a valuable data manipulation tool to help with:

Product and sales forecasting

Financial analysis with custom rolling functions

Grouping data into time buckets to analyze change over time, which is great for any part of an organization including sales, marketing, manufacturing, and HR!
The datasets used are from a neat package called weatherData. While weatherData has functionality to pull weather data for a number of cities, we will use the builtin datasets. We encourage you to explore the weatherData API if you’re interested in collecting weather data.
To get started, load the following packages:
 tibbletime: Timeaware tibbles for the tidyverse
 tidyverse: Loads packages including dplyr, tidyr, purrr, and ggplot
 corrr: Tidy correlations
 weatherData: Slick package for getting weather data
Also, load the datasets from weatherData, “NewYork2013” and “SFO2013”.
# Load libraries library(tibbletime) # Make sure you have 0.0.2 from CRAN! library(tidyverse) library(corrr) library(weatherData) # Load weatherData datasets NYC < NewYork2013 SFO < SFO2013 Combine and convertTo tidy up, we first join our data sets together using bind_rows(). Passing
a named list of tibbles along with specifying the .id argument allows
bind_rows() to create a new City reference column for us.
Next, we will convert to tbl_time and group by our City variable. Note that we know this is a tbl_time object by Index: Time that gets printed along with the tibble.
# Convert to tbl_time class weather < weather %>% mutate(Time = as.POSIXct(Time)) %>% as_tbl_time(Time) %>% group_by(City) weather ## # A time tibble: 19,706 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 20130101 00:51:00 41.0 ## 2 NYC 20130101 01:51:00 39.9 ## 3 NYC 20130101 02:51:00 41.0 ## 4 NYC 20130101 03:51:00 41.0 ## 5 NYC 20130101 04:51:00 41.0 ## 6 NYC 20130101 05:51:00 39.9 ## 7 NYC 20130101 06:51:00 39.9 ## 8 NYC 20130101 07:51:00 39.9 ## 9 NYC 20130101 08:51:00 39.9 ## 10 NYC 20130101 09:51:00 39.9 ## # ... with 19,696 more rows Period conversionThe first new idea to introduce is the ~period formula~. This tells the tibbletime functions how you want to timegroup your data. It is specified
as multiple ~ period, with examples being 1~d for “every 1 day,” and
4~m for “every 4 months.”
In our original data, it looks like weather is an hourly dataset, with each new
data point coming in on the 51st minute of the hour for NYC and the 56th minute
for SFO. Unfortunately, a number of points don’t follow this. Check out the following rows:
What we want is 1 row per hour, and in this case we get two rows for NYC hour 12.
We can use as_period() to ensure that we only have 1 row for each hour
Now that we have our data in an hourly format, we probably don’t care about
which minute it came in on. We can floor the date column using a helper function,
time_floor(). Credit to Hadley Wickham because this is essentially a convenient
wrapper around lubridate::floor_date(). Setting the period to 1~h floors
each row to the beginning of the last hour.
Now that we have cleaned up a bit, let’s visualize the data.
# Yikes: Hourly is a bit too much data for the chart ggplot(weather, aes(x = Time, y = Temperature, color = City)) + geom_line() + theme_minimal()Seems like hourly data is a bit overwhelming for this kind of chart. Let’s
convert to daily and try again.
That’s better. It looks like NYC has a much wider range of temperatures than
SFO. Both seem to be hotter in summer months.
The dplyr::summarise() function is very useful for taking grouped summaries.
time_summarise() takes this a step further by allowing you to summarise by
period.
Below we take a look at the average and standard deviation of the temperatures
calculated at monthly and bimonthly intervals.
July seemed to be one of the hottest months for NYC, let’s take a closer look at it.
To just grab July dates, use time_filter(). If you haven’t seen this before, a time formula is used to specify the dates to filter for. The onesided formula below expands to include dates between, 20130701 00:00:00 ~ 20130731 23:59:59.
july < weather %>% time_filter(~20137) july ## # A time tibble: 1,486 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 20130701 00:00:00 73.4 ## 2 NYC 20130701 01:00:00 73.9 ## 3 NYC 20130701 02:00:00 73.4 ## 4 NYC 20130701 03:00:00 73.9 ## 5 NYC 20130701 04:00:00 73.9 ## 6 NYC 20130701 05:00:00 73.9 ## 7 NYC 20130701 06:00:00 75.9 ## 8 NYC 20130701 07:00:00 75.9 ## 9 NYC 20130701 08:00:00 75.2 ## 10 NYC 20130701 09:00:00 77.0 ## # ... with 1,476 more rowsTo visualize July’s weather, we will make a boxplot of the temperatures.
Specifically, we will slice July into intervals of 2 days, and create a series
of boxplots based on the data inside those intervals. To do this, we will
use time_collapse(), which collapses a column of dates into a column of the same
lenth, but where every row in a time interval shares the same date. You can use this resulting
column for further grouping or labeling operations.
Let’s visualize to see if we can gain any insights. Wow, San Fran maintained a constant cool average of 60 degrees in the hottest month
of the year!
Finally, we will look at the correlation of temperatures in our two cities in a few different ways.
First, let’s look at the overall correlation. The corrr package provides a nice way to accomplish this with data frames.
weather %>% spread(key = City, value = Temperature) %>% select(NYC, SFO) %>% corrr::correlate() ## # A tibble: 2 x 3 ## rowname NYC SFO ## ## 1 NYC NA 0.6510299 ## 2 SFO 0.6510299 NANext, let’s look at monthly correlations. The general idea will be
to nest each month into it’s own data frame, apply correlate() to each
nested data frame, and then unnest the results. We will use time_nest() to easily perform the monthly nesting.
For each month, calculate the correlation tibble and then focus() on the NYC column. Then unnest and floor the results.
monthly_nest %>% mutate(monthly_cor = map(data, ~corrr::correlate(.x) %>% corrr::focus(NYC)) ) %>% unnest(monthly_cor) %>% time_floor(1~d) ## # A time tibble: 12 x 4 ## # Index: Time ## Time data rowname NYC ## * ## 1 20130131 SFO 0.10281153 ## 2 20130228 SFO 0.38288119 ## 3 20130331 SFO 0.52432022 ## 4 20130430 SFO 0.34258085 ## 5 20130531 SFO 0.07814153 ## 6 20130630 SFO 0.52024900 ## 7 20130731 SFO 0.29163801 ## 8 20130831 SFO 0.45479643 ## 9 20130930 SFO 0.48056194 ## 10 20131031 SFO 0.59429495 ## 11 20131130 SFO 0.35513490 ## 12 20131231 SFO 0.17559596It seems that summer and fall months tend to have higher correlation than colder months.
And finally we will calculate the rolling correlation of NYC and SFO temperatures. The “width” of our roll will be monthly, or 360 hours since we are in hourly format.
There are a number of ways to do this, but for this example
we introduce rollify(), which takes any function that you give it and creates a rolling version of it. The first argument to rollify() is the function that you want to modify, and it is passed to rollify() in the same way that you would pass a function to purrr::map(). The second argument is the window size. Call the rolling function just as you would call a nonrolling version
of cor() from inside mutate().
It looks like the correlation is definitely not stable throughout the year,
so that initial correlation value of .65 definitely has to be taken
with a grain of salt!
There are a number of ways to do rolling functions, and we recommend based on your needs. If you are interested in:

Flexibility: Use rollify(). You can literally turn any function into a “tidy” rolling function. Think everything from rolling statistics to rolling regressions. Whatever you can dream up, it can do. The speed is fast, but not quite as fast as other Rcpp based libraries.

Performance: Use the roll package, which uses RcppParallel as its backend making it the fastest option available. The only downside is flexibility since you cannot create custom rolling functions and are bound to those available.
We’ve touched on a few of the new features in tibbletime v0.0.2. Notably:

rollify() for rolling functions

as_period() with generic periods

time_collapse() for collapsing date columns
A full change log can be found in the NEWS file on Github or CRAN.
We are always open to new ideas and encourage you to submit an issue on our
Github repo here.
Have fun with tibbletime!
Final thoughtsMind you this is only v0.0.2. We have a lot of work to do, but we couldn’t
wait any longer to share this. Feel free to kick the tires on tibbletime, and let us know your thoughts. Please submit any comments, issues or bug reports to us on GitHub here. Enjoy!
Business Science takes the headache out of data science. We specialize in applying machine learning and data science in business applications. We help businesses that seek to build out this capability but may not have the resources currently to implement predictive analytics. Business Science works with clients as diverse as startups to Fortune 500 and seeks to guide organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!
Connect with Business ScienceConnect, communicate and collaborate with us! The easiest way to do so is via social media. Connect with us out on:
 @bizScienc on twitter!
 Facebook!
 LinkedIn!
 Sign up for our insights blog to stay updated!
 If you like our software, star our GitHub packages!
To leave a comment for the author, please follow the link and comment on their blog: businessscience.io  Articles. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
NFL Series
(This article was first published on StatOfMind, and kindly contributed to Rbloggers)
If you have previously attempted to analyze NFL data, it is likely that you have tried to scrape ESPN or footballreference, which provides a wealth on statistics surrounding game data. However, if you ever wanted to obtain truly indepth data, then it is likely that you found yourself leveraging the API maintained by NFL.com. Unfortunately, it’s data is surfaced in a JSON format that leaves a lot to be desired (i.e. it’s a nightmare). Lucky for me, I was recently scrolling through my Twitter feed and came across an interesting mention of a new R package called nflscrapR. After some exploration, I came through this quote from the author of the package:
NFL fans and sports enthusiastic alike, I would like to introduce the nflscrapR package, an American football data aggregator that will scrape, clean, and parse playbyplay data across games, seasons, and careers.The nflscrapR essentiallys surfaces all playbyplay data for the last 7 seasons, and this has motivated me to start a deep dive on NFL data. I plan to have two main topics, one that focusses on players at specific positions, and another focussing on team dynamics and patterns. To kick this series off, we will begin with an exploration into the performance of NFL’s best running backs.
1. PrerequisitesIn order to reproduce the figures below, wou will need to have R (v3.2.3 or above) installed on your machine. There are also a couple of additional libraries that will be required. Details on how to install those are shown in the commands below.
# install.packages('devtools') library(devtools) #> Skipping install for github remote, the SHA1 (05815ef8) has not changed since last install. #> Use `force = TRUE` to force installation devtools::install_github(repo = "maksimhorowitz/nflscrapR", force=TRUE) 2 Collecting the DataWith the nflscrapR library now installed, you are now ready to collect playbyplay data for the 20162017 NFL season. Start by loading the library and collect the data using the command below:
# Load the package library(nflscrapR) library(ggplot2) library(dplyr) library(pheatmap) # Collect data for 2016 NFL season pbp_2016 < season_play_by_play(2016)Overall the pbp_2016 dataframe contains 100 data points for 45,737 plays, but for the purpose of this post, we will be focussing exclusively on fields related to running backs (In future posts, we will explore data relevant to other positions on the football field). In addition, we’ll focus primarily on frequently used running backs, which we empirically define as any player that has had at least 200 rushes over the course of the 20162017 season.
# Get all players with at least 200 rushes during the season min_rush_cnt < 200 rush_cnt < pbp_2016 %>% filter(PlayType == 'Run') %>% group_by(Rusher) %>% summarise(rush_cnt = n(), total_yards = sum(Yards.Gained), mean_yards = round(mean(Yards.Gained), 2)) %>% filter(rush_cnt >= min_rush_cnt) %>% arrange(desc(rush_cnt)) # Get all rushing data for eligible players rushing_stats < pbp_2016 %>% filter(PlayType == 'Run' & Rusher %in% rush_cnt$Rusher & Yards.Gained <=50) %>% filter(down!=4 & !is.na(down)) %>% filter(!is.na(RunLocation))Altogether, we find that a total of 19 players rushed over 200 times during the 20162017 season. A short summary of their performance is show below.
When talking about the overall performance of running backs, it is common for people to report the total number of yards that were rushed for, or the average yards per run. While these are perfectly acceptable numbers to share, I’ve always felt like they did not tell the whole story. For example, a player could have a high average yards per run, only for us to realize that he actually often loses yards on a run but makes up for it with a few very long runs. Therefore, I started by looking at the overall distribution of number of yards gained/lost for each play, with the hope that this would reveal whether some players were more consistent on a playbyplay basis than others. We can use the ggplot2 library to generate a density plot of yards gained per play for each of our eligible players:
# Compare distribution of rushes for eligible players ggplot(rushing_stats, aes(x = Yards.Gained, y = Rusher, fill=Rusher)) + geom_joy(scale = 3) + theme_joy() + scale_fill_manual(values=rep(c('gray', 'lightblue'), length(rushing_stats$Rusher)/2)) + scale_y_discrete(expand = c(0.01, 0)) + scale_x_continuous(expand = c(0, 0)) + theme(legend.position="none") + labs(x="Yards gained per play" ,y="")Overall, we see that most running backs have a similar distribution of yards gained by run. However, we can see that LeSean McCoy (7th distribution from the top) has a much “flatter” distribution (i.e. more variance in the distribution of yards gained per run), meaning his performance can be a lot more variable/unpredictable for any given run.
4. When are running backs used?Another statement that is also commonly reported is that running backs are primarily used in early downs. To verify whether this is generall true, we can compute the total amount of runs that each player made across different downs, and go even further by breaking this down by quarter too. The code chunk below counts the number of runs that each rushing back made during pairs of downs and quarters.
# Compare when rushers are used usage_stats < pbp_2016 %>% filter(!is.na(down) & Rusher %in% rush_cnt$Rusher & qtr!=5) %>% group_by(Rusher, down, qtr) %>% summarise(cnt = n()) %>% mutate(qtr_down = paste("Q", qtr, " Down: ", down, sep=""))We can then leverage the d3heatmap to quickly generate a simple heatmap of how often running backs are used during specific downs and quarters.
library(d3heatmap) # pivot dataframe usage < usage_stats %>% dcast(Rusher ~ qtr_down, value.var = "cnt") # clean data row.names(usage) < usage$Rusher usage < usage %>% select(Rusher) usage[is.na(usage)] < 0 # normalize data usage_norm < t(apply(usage, 1, function(x) x/sum(x))) # Plot heatmap of proportions of rushes by different field locations and gaps p < d3heatmap(usage_norm, colors="Blues", Colv=FALSE, show_grid=3) saveWidget(p, file="rusher_usage_down_quarter.html") Proportion of rushes per quarter and downs for NFLs best running backs
In the plot above, we are essentially plotting the usage of each running back as a function of what stage of the game we are in. As we can see, it is abundantly clear that running backs are primarily used in the first two downs, and rarely during the third and fourth downs. Overall, there does not appear to be significant differences between how players are used. However, it does not answer whether some running backs perform better on certains downs, which is what we will address now.
5. Are some running backs better on certain downs?Another question we can ask ourselves is whether some running backs perform better on later downs. To visualize this data, we can again generate a density plot of yards gained per play for each of our eligible players, while also facetting the data by downs.
# Compare distribution of rushes by downs ggplot(rushing_stats, aes(x = Yards.Gained, y = down)) + geom_joy(scale=1, rel_min_height=.03, fill='black') + scale_y_discrete(expand = c(0.01, 0)) + xlab('Value') + facet_wrap(~Rusher, scales='free', ncol=3) + theme_joy() + theme(axis.title.y = element_blank())+ labs(x="Yards gained per play" ,y="Down")Again, we do not see any striking differences between players and the distribution of yards gained by down. However, it is interesting to note that most “long runs” (10 yards or above) tend to occur on the first down. When we look closely, we can also see that some rushers such as DeMarco Murray do exhibit visual differences between yards gained by downs. In this case, the “mass” of yards gained on the third down is much closer to zero than when compared to the “mass” for the first and second downs, which suggests that he may struggle during this down (this could be attributable to many factors: stamina, weaker offensive line on 3rd downs, etc…)
6. Where do the best running backs run?It is fairly well accepted that the performance of a running back will be heavily influenced by the strength of the offensive line in front of them. With that in mind, let’s start by looking at the field location in which different running backs prefer to run. The plot below shows the number of yards gained by each running back based on which side of the field they ran towards (left, middle or right).
ggplot(data=rushing_stats, aes(x=RunLocation, y=Yards.Gained, color=RunLocation)) + geom_jitter(position=position_jitter(0.2)) + stat_summary(fun.data=mean_sdl, mult=1, geom="pointrange", color="black") + scale_color_brewer(palette="Dark2") + theme_minimal() + facet_wrap(~Rusher)We can take this further by looking at the field location in which different running backs prefer to run. This can be achieved by generating a matrix that contains the proportion of rushes by field location for each player.
# Get proportions of rushes on different field locations rush_locations < rushing_stats %>% filter(PlayType=='Run') %>% filter(!is.na(RunLocation)) %>% group_by(Rusher, RunLocation) %>% summarise(rush_cnt = n()) %>% mutate(freq = rush_cnt / sum(rush_cnt)) loc_mat < rush_locations %>% dcast(Rusher ~ RunLocation, value.var = "freq") row.names(loc_mat) < loc_mat$Rusher loc_mat < loc_mat %>% select(Rusher)The content of the loc_mat matrix contains the preferred rush locations of each running back, and can be plotted as a clustered heatmaps using the pheatmap library in R.
# Plot heatmap of proportions of rushes by different field locations pheatmap(loc_mat, border="white", color = brewer.pal(9,"Blues"), cluster_cols=FALSE)The plot above highlights which running back are most similar in their run locations. Rushers such as J. Ajayi, E. Elliot and M. Ingram, clearly avoid running in the middle of the field. On the flipside, rushers such as M. Gordon D. Johnson, S.ware and F. Gore clearly prefer running down the middle rather than to the sides. These patterns could be attributed to the running styles of each rushers (speed, mobility, strength etc…), but also the strength of the offensive line at particular positions.
7. Who creates the gaps for the running backs?We can also explore the number of yards gained by each running back based on the offensive line positions that created space for them.
ggplot(data=rushing_stats %>% filter(!is.na(RunGap)), aes(x=RunGap, y=Yards.Gained, color=RunGap)) + geom_jitter(position=position_jitter(0.2)) + stat_summary(fun.data=mean_sdl, mult=1, geom="pointrange", color="black") + scale_color_brewer(palette="Dark2") + theme_minimal() + facet_wrap(~Rusher)The proportions of run opportunities that was enabled by each offensive line position can also be summarized in a matrix using the command below.
# Get proportions of gaps created by different offensive line positions rush_gaps < rushing_stats %>% filter(!is.na(RunGap)) %>% filter(!is.na(RunGap)) %>% group_by(Rusher, RunGap) %>% summarise(rush_cnt = n()) %>% mutate(freq = rush_cnt / sum(rush_cnt)) gap_mat < rush_gaps %>% dcast(Rusher ~ RunGap, value.var = "freq") row.names(gap_mat) < gap_mat$Rusher gap_mat < gap_mat %>% select(Rusher) # Plot heatmap of proportions of rushes by different field gaps pheatmap(gap_mat, border="white", color = brewer.pal(9,"Blues"), cluster_cols=FALSE)Again, we see many differences among the leagues top running backs. Unsurprisingly, a number of players have the most run opportunities created by the guard position, but a few players such as F. Gore, L. McCoy and D. Johnson run in gaps created by the tackle position. Finally, S. Ware from the Kansas City Chiefs often runs through gaps created by tight ends, which may be more representative of the team’s formation.
ConclusionIn this introductory post, we have explored the performance and patterns of some of NFL’s best running backs. Overall, it was a fairly superficial analysis, as it never considered interactions with other components of the team, or temporal patterns, but it does show the depth and power of this data. In the next series, I will dive into the performance and behavior of wide receivers, so stay tuned!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: StatOfMind. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Linear / Logistic Regression in R: Dealing With Unknown Factor Levels in Test Data
(This article was first published on R – Discovering Python & R, and kindly contributed to Rbloggers)
Let’s say you have data containing a categorical variable with 50 levels. When you divide the data into train and test sets, chances are you don’t have all 50 levels featuring in your training set.
This often happens when you divide the data set into train and test sets according to the distribution of the outcome variable. In doing so, chances are that our explanatory categorical variable might not be distributed exactly the same way in train and test sets – so much so that certain levels of this categorical variable are missing from the training set. The more levels there are to a categorical variable, it gets difficult for that variable to be similarly represented upon splitting the data.
Take for instance this example data set (train.csv + test.csv) which contains a categorical variable var_b that takes 349 unique levels. Our train data has 334 of these levels – on which the model is built – and hence 15 levels are excluded from our trained model. If you try making predictions on the test set with this model in R, it throws an error:
factor var_b has new levels 16060, 17300, 17980, 19060, 21420, 21820,
25220, 29340, 30300, 33260, 34100, 38340, 39660, 44300, 45460
If you’ve used R to model generalized linear class of models such as linear, logit or probit models, then chances are you’ve come across this problem – especially when you’re validating your trained model on test data.
The workaround to this problem is in the form of a function, remove_missing_levels that I found here written by pats. You need magrittr library installed and it can only work on lm, glm and glmmPQL objects.
Once you’ve sourced the above function in R, you can seamlessly proceed with using your trained model to make predictions on the test set. The code below demonstrates this for the data set shared above. You can find these codes in one of my github repos and try it out yourself.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Discovering Python & R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data Visualization Course for FirstYear Students
(This article was first published on R Project – CitizenStatistician, and kindly contributed to Rbloggers)
A little over a year ago, we decided to propose a data visualization course at the firstyear level. We had been thinking about this for awhile, but never had the time to teach it given the scheduling constraints we had. When one of the other departments on campus was shut down and the faculty merged in with other departments, we felt that the time was ripe to make this proposal.
In putting together the proposal, we knew that:
 The course would be primarily composed of social science students. My department, Educational Psychology, attracts students from the College of Education and Human Development (e.g., Child Psychology, Social Work, Family Social Science).
 To attract students, it would be helpful if the course would fulfill the University’s Liberal Education (LE) requirement for Mathematical Thinking.
This led to several challenges and long discussions about the curriculum for this course. For example:
 Should the class focus on producing data visualizations (very exciting for the students) or on understanding/interpreting existing visualizations (useful for most social science students)?
 If we were going to produce data visualizations, which software tool would we use? Could this level of student handle R?
 In order to meet the LE requirement, the curriculum for the course would need to show a rigorous treatment of students actually “doing” mathematics. How could we do this?
 Which types of visualizations would we include in the course?
 Would we use a textbook? How might this inform the content of the course?
After several conversations among the teaching team, with stakeholder departments, and with colleagues teaching data visualization courses at other universities, we eventually proposed that the course:
 Focus both on students’ being able to read and understand existing visualizations and produce a subset of these visualizations, and
 Use R (primary tool) and RAWGraphs for the production of these plots.
The choice to use R was not an immediate one. We initially looked at using Tableau, but the default choices made by the software (e.g., to immediately plot summaries rather than raw data) and the cost for students after matriculating from the course eventually sealed its fate (we don’t use it). We contemplated using Excel for a minute (gasp!), but we vetoed that even quicker than Tableau. The RAWGraphs website, we felt, held a lot of promise as a software tool for the course. It had an intuitive draganddrop interface, and could be used to create many of the plots we wanted students to produce. Unfortunately, we were not able to get the bar graph widget to produce sidebyside bar plots easily (actually at all). The other drawback was that the draganddrop interactions made it a harder sell to the LE committee as a method of building students’ computational and mathematical thinking if we used it as the primary tool.
Once we settled on using R, we had to decide between using the suite of base plots, or ggplot2 (lattice was not in the running). We decided that ggplot made the most sense in terms of thinking about extensibility. Its syntax was based on a theoretical foundation for creating and thinking about plots, which also made it a natural choice for a data visualization course. The idea of mapping variables to aesthetics was also consistent with the language used in RAWGraphs, so it helped reenforce core ideas across the tools. Lastly, we felt that using the ggplot syntax would also help students transition to other tools (such as ggviz or plotly) more easily.
One thing that the teaching team completely agreed on (and was mentioned by almost everyone who we talked to who taught data visualization) was that we wanted students to be producing graphs very early in the course; giving them a sense of power and the reenforcement that they could be successful. We felt this might be difficult for students with the ggplot syntax. To ameliorate this, we wrote a coursespecific R package (epsy1261; available on github) that allows students to create a few simple plots interactively by employing functionality from the manipulate package. (We could have also done this via Shiny, but I am not as wellversed in Shiny and only had a few hours to devote to this over the summer given other responsibilities.)
Course ContentWe decided on a threepronged approach to the course content. The first prong would be based on the production of common statistical plots: bar charts, scatterplots, and maps, and some variations of these (e.g., donut plots, treemaps, bubble charts). The second prong was focused on reading more complex plots (e.g., networks, alluvial plots), but not producing them, except maybe by hand. The third prong was a group project. This would give students a chance to use what they had learned, and also, perhaps, access other plots we had not covered. In addition, we wanted students to consider narrative in the presentation of these plots—to tell a datadriven story.
Along with this, we had hoped to introduce students to computational skills such as data summarization, tidying, and joining data sets. We also wanted to introduce concepts such as smoothing (especially for helping describe trend in scatterplots), color choice, and projection and coordinate systems (in maps). Other things we thought about were using R Markdown and data scraping.
RealityThe reality, as we are finding now that we are over a third of the way through the course, is that this amount of content was overambitious. We grossly underestimated the amount of practice time these students would need, especially working with R. Two things play a role in this:
 The course attracted way more students than we expected for the first offering (our class size is 44) and there is a lot of heterogeneity of students’ experiences and academic background. For example, we have graduate students from the School of Design, some first years, and mostly sophomores and juniors. We also have a variety of majors including, design, the social sciences, and computer science.
 We hypothesize that students are not practicing much outside of class. This means they are essentially only using R twice a week for 75 minutes when they are in class. This amount of practice is too infrequent for students to really learn the syntax.
Most of the students’ computational experiences are minimal prior to taking this course. They are very competent at using pointandclick software (e.g., Google Docs), but have an abundance of trouble when forced to use syntax. The precision of casesensitivity, commas, and parentheses is outside their wheelhouse.
I would go so far as to say that several of these students are intimidated by the computation, and completely panic on facing an error message. This has led to us having to really think through and spend time discussing computational workflows and dealing with how to “debug” syntax to find errors. All of this has added more time than we anticipated on the actual computing. (While this may add time, it is still educationally useful for these students.)
The teaching team meets weekly for 90 minutes to discuss and reflect on what happened in the course. We also plan what will happen in the upcoming week based on what we observed and what we see in students’ homework. As of now, we clearly see that students need more practice, and we have begun giving students the end result of a plot and asking them to recreate these.
I am still hoping to get to scatterplots and maps in the course. However, some of the other computational ideas (scraping, joining) may have to be relegated to conceptual ideas in a reading. We are also considering scrapping the project, at least for this semester. At the very least, we will change it to a more structured set of plots they need to produce rather than letting them choose the data sets, etc. Live and learn. Next time we offer the course it will be better.
*Technology note: RAWGraphs can be adapted by designing additional chart types, so in theory, if one had time, we could write our own version to be more compatible with the course. We are also considering using the ggplotgui package, which is a Shiny dashboard for creating ggplot plots.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R Project – CitizenStatistician. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
From Static to Numeric to SingleChoice Exercises in R/exams
(This article was first published on R/exams, and kindly contributed to Rbloggers)
IdeaOur colleagues over at the economics department became interested in using R/exams for
generating largescale exams in their introductory economics courses. However, they face the challenge that
so far they had been writing static exercises and modified them by hand if they wanted to reuse them in a
different exam in another semester. To let R/exams do this job it is illustrated how a static arithmetic
exercise can be turned into a dynamic exercise template either in num
format with a numeric solution or into schoice format with a singlechoice solution.
The idea for the exercise is a very basic price elasticity of demand task:
p(x)=a–b·x
for the price
p
given the demanded quantity
x
. What is the price elastiticy of demand at a price of
p
?
The natural candidates for “parameterizing” this exercise are the
price
p
and the parameters
a
and
b
of the inverse demand function.
Based on these the solution is simply:
x=D(p)=(a–p)/b
.
Then, at
p
the price elasticity of demand is
D‘(p)D(p)p=–1/bxp.
Overview
Below various incarnations of this exercise are provided in both R/Markdown Rmd and R/LaTeX Rnw format.
The following table gives a brief overview of all available versions along with a short description of the idea behind it.
More detailed explanations are provided in the subsequent sections.
elasticity1.Rnw No num Fixed parameters and numeric solution. 2 elasticity2.Rmd
elasticity2.Rnw No schoice As in #1 but with singlechoice solution (five answer alternatives). 3 elasticity3.Rmd
elasticity3.Rnw Yes num Randomly drawn parameters with dynamic computation of correct solution, based on #1. 4 elasticity4.Rmd
elasticity4.Rnw Yes schoice Randomly drawn parameters (as in #3) with dynamicallygenerated singlechoice solution (as in #2), computed by num_to_schoice(). 5 elasticity5.Rmd
elasticity5.Rnw Yes schoice As in #4 but with the last alternative: None of the above. Static numeric
The starting point is a completely static exercise as it had been used in a previous
introductory economics exam. The parameters had been set
to
p=5
,
a=50
,
and
b=0.5
.
This implies that
x=90
,
leading to an elasticity of
–0.111111
.
The corresponding R/exams templates simply hardcode these numbers into the question/solution
and wrap everything into R/Markdown (elasticity1.Rmd)
or R/LaTeX (elasticity1.Rnw).
Note that LaTeX is used in either case for the mathematical notation. In case you are unfamiliar
with the R/exams format, please check out the First Steps tutorial.
The metainformation simply sets extype to num, supplies the exsolution with a precision of three digits,
and allows an extol tolerance of 0.01. To see what the result looks like download the files linked above
and run exams2html() and/or exams2pdf(). (The examples below always use the R/Markdown version but the
R/LaTeX version can be used in exactly the same way.)
Singlechoice versions of exercises are often desired for use in written exams
because they can be conveniently scanned and automatically evaluated. Thus, we need to come up
with a number of incorrect alternative solutions (or “distractors”). If desired, these could include
typical wrong solutions or a None of the others alternative.
The R/exams templates
elasticity2.Rmd and
elasticity2.Rnw are
essentially copies of the static numeric exercise above but:
 Question/solution now contain an answerlist (with five alternatives).
 The extype has been changed to schoice.
 The exsolution now contains a binary coding of the correct solution.
 Furthermore, to obtain some basic randomization we have turned on shuffling by setting exshuffle
to TRUE. (Subsampling more than five alternatives would also be possible and would add some further randomization.)
As above exams2html() and/or exams2pdf() can be used to display the exercise interactively in R/exams.
Dynamic numericNext, the static exercise from above is made dynamic by drawing the parameters from a suitable
datagenerating process. In this case, the following works well:
Note that in order to obtain “nice” numbers a common scaling factor fctr is used for both x and b.
Also, while the examinees are presented with parameters a and b and have to compute x,
the datagenerating process actually draws x and b and computes a from that. Again, this makes it
easier to obtain “nice” numbers.
The R/exams templates
elasticity3.Rmd and
elasticity3.Rnw include
the datagenerating process above as a code chunk either in R/Markdown or R/LaTeX format.
The parameters are then inserted into question/solution/metainformation using `r a` (R/Markdown)
or \Sexpr{a} (R/LaTeX). Sometimes the fmt() function is used for formatting the parameters
with a desired number of digits. (See ?fmt for more details.)
As before exams2html() and/or exams2pdf() can be used to display the random draws from
the exercise templates. For checking that the metainformation is included correctly, it is often
helpful to run
Furthermore, some tweaking is usually required when calibrating the parameter ranges in the
datagenerating process. The stresstest_exercise() function draws a large number of random replications
and thus can help to spot errors that occur or to find solutions that are “extreme” (e.g., much larger or smaller than usual). To clean up your
global environment and run the function use something like this:
The latter command plots the correct solution against the (scalar) parameters that were
generated in the exercise. This might show patterns for which parameters the solution becomes
too large or too small etc. See ?stresstest_exercise for further features/details.
To go from the dynamic numeric exercise to a dynamic singlechoice exercise we need to
extend the datagenerating process to also produce a number of wrong alternatives.
The function num_to_schoice() helps with this by providing different sampling
mechanisms. It allows to set a range in which the alternatives have to be, a minimum
distance between all alternatives, possibly include typical wrong solutions, etc.
It also shuffles the resulting alternatives and tries to make sure that the correct
solution is not a certain order statistic (e.g., almost always the largest or
smallest alternative).
The R/exams templates
elasticity4.Rmd and
elasticity4.Rnw first
wrap the datagenerating process into a while loop with while(sol > 0.11). This makes sure
that there is enough “space” for four wrong alternatives between 0.11 and 0, using a minimum
distance of 0.017. Subsequently, the four wrong alternatives are generated by:
This suggests a number of typical wrong solutions err (provided that they are not too small
or too large) and makes sure that the range rng is large enough. With these arguments
num_to_schoice() is run, see ?num_to_schoice for the details. The resulting sc list
contains suitable $questions and $solutions that can be easily embedded in an answerlist()
and in the metainformation. Sometimes it is useful to wrap num_to_schoice() into another
while() loop to make sure a valid result is found. See the deriv2
template for illustraion.
As before exams2html() and/or exams2pdf() can be used to display the random draws from
this exercise template. And some stresstesting could be carried out by:
As a final variation we could include None of the above as the last alternative. This is
very easy because we can simply replace the fifth element of the question list with the corresponding
string:
No matter whether this was actually the correct alternative or one of the incorrect alternatives, the
corresponding solution will stay the same. The R/exams templates
elasticity5.Rmd and
elasticity5.Rnw incorporate
this additional line of code in the datagenerating process.
To leave a comment for the author, please follow the link and comment on their blog: R/exams. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
getSymbols and Alpha Vantage
(This article was first published on FOSS Trading, and kindly contributed to Rbloggers)
Thanks to Paul Teetor, getSymbols() can now import data from Alpha Vantage! This feature is part of the quantmod 0.411 release, and provides another another data source to avoid any Yahoo Finance API changes*.
Alpha Vantage is a free web service that provides realtime and historical equity data. They provide daily, weekly, and monthly history for both domestic and international markets, with up to 20 years of history. Dividend and split adjusted close prices are available for daily data. They also provide near realtime price bars at a resolution of 1 minute or more, for up to 10 recent days. All you need to get started is a onetime registration for an API key. Alpha Vantage has clean, documented, public API that returns either JSONencoded data or a CSV file. The arguments to getSymbols.av() closely follow the native API, so be sure to use their documentation! To get started, install the latest quantmod from CRAN. Then you call: getSymbols(“MSFT”, src = “av”, api.key = “[your key]”) Where you replace “[your key”] with the API key you receive after registration. You can use setDefaults() to set your API key one time, and use it for all getSymbols.av() calls. setDefaults(“getSymbols.av”, api.key = “[your key]”) *Speaking of API changes, this release also includes a fix for a Yahoo Finance change (#174). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...