Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 10 hours 4 min ago

Getting Your First Data Science Job

Mon, 03/04/2019 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Chris Albon, Data Scientist at Devoted Health.

Here is the podcast link.

Introducing Chris Albon

Hugo: Hi there, Chris, and welcome to DataFramed.

Chris: Hey, how’s it going?

Hugo: It’s great man. How are you?

Chris: I’m good. This is like one of the first podcast I’ve done in a while. This is like, I stopped my podcast and now I’ve gone on a few ones but then I had a kid and I did a move and that kind of stuff. Now I’m back, back on the podcasting circuit. This is it.

Hugo: Fantastic. How long has it been?

Chris: It’s been like, so my kid is two months old. So I think I did my last podcast like two months before that. So it’s been four months without-

Hugo: Right on. Well, congratulations on the new member of your family.

Chris: Thank you. Thank you. She is doing great. All is well.

Hugo: Awesome. And congrats on the new job and the move. There’s been a lot of change, right?

Chris: There’s been a lot of change. There’s definitely been a lot of change. I feel like it’s one of those things where I have a very hard time saying no to DJ So when he does think that you should come do something, you have to think really hard about whether or not you want to say no to him, which I’ve said no to him in the past, I could not say no to him this time, thus I’m with DJ on a crazy adventure.

Hugo: And so this is DJ Patil?

Chris: This is DJ Patil, former chief data scientist of the US, former head of data at LinkedIn, I don’t know his whole resume. But a well known, well known person in the data science world.

Hugo: Very well known, very well respected, doing a lot of interesting work at Devoted and otherwise as well. I mean, his recent series of articles with Mike Loukides and Hillary Mason around kind of forming a conversation around data science ethics is really interesting as well.

Chris: Yeah, I think he’s doing a lot of really important thinking in a space that we’re still figuring out. I mean, this was something that I think a lot of us have been talking about, sort of off and on for a long time of like, hey, we don’t actually, we have lots of goals for what this field can do. What does it actually mean in practice? And the more we get to do that, the more that people sort of carve out space in their workday to say, hey, I’m going to think about this topic. I’m going to produce something that other people can read and agree with and disagree with and use as a point of discussion or build something off of. Is great and I hope he does it more and I hope Hillary does it more. And I think more people should spend a little bit more time thinking about that.

Chris: And it’s part of it that was so nice to go work for him because it is sort of nice to go work for someone who’s thinking about that kind of stuff and really is big on ethics and is big on trying to use technology for social good. Like for people in this podcast, like my background is not in business. Like I have started a startup, but my main background is nonprofits, humanitarian nonprofits. I work on data on humanitarian nonprofits, did that for most of my career. And so to work for a team that’s led by someone who spends a lot of time thinking about how to build companies with a soul was very refreshing.

About Devoted Health

Hugo: Awesome. So maybe you can start off by telling us just about Devoted in general and the work you’re doing.

Chris: Sure, Devoted, a health insurance company that was started by Todd and Ed Park… Todd Park was the former CTO of the United States and Ed Park was the CEO of another health insurance company or another healthcare company before this. And what Devoted tries to do is tries to frankly create a health insurance company that you would want your own family members to be at. It is a company, it is a startup. It operates exact likely the startup, is funded exactly like the startup. I think there’s areas that would set it apart from startup. But I think overall you would look at it and be like, yes, this is a startup.

Chris: However, Devoted is on a mission. We are trying to make health care that works, that works for people, particularly right now senior citizens. So we work on Medicare. Medicare is the health insurance, those government health insurance program for senior citizens, that is what Devoted is focused on, that’s sort of what we consider ourselves as a Medicare company.

Chris: But on a daily basis if you work inside Devoted, you can see that we are trying to build something that would be a company with a soul, a company that’s trying to do something real and trying to do something that matters in people’s lives and do right by people and is hyper compliant with the law and do more than just simply profit. That was what drew me to it. I think that’s what draws a lot of people to Devoted.

Hugo: That’s great. As we’ll discuss, you’re working a lot on hiring at the moment. And what we’re going to talk about today among other things is people getting their first data science jobs and advice for such people, how it works, what it looks like from your side of the conversation as well. I think particularly at this point where Junior data science paths aren’t necessarily fleshed out, we’re starting to see a bunch of specialization in the industry, these types of things will be incredibly important going forward.

Hugo: But before we dive into that, I just want to get a bit more background about you. And if you could tell me, like you said, your background isn’t necessarily in data science. I’m wondering how you got involved in data science in the first place.

Chris: My background’s in quantitative political science. So political science, studying of politics, I studied civil wars, but it’s political science completely from the perspective of statistics, of quantitative research, of experimentation rather than qualitative work, interviewing people, looking at historic documents, that kind of thing. And when I was getting my PhD, I kept on having drinks with these people in San Francisco where I was living at the time, and they were working at places like LinkedIn, so with DJ and a number of other people, and they were doing such cool applied stuff. Amazing projects, amazing uses of data in a very, very applied way.

Chris: And ir was about then that I kind of decided that if I really wanted to have a real impact and really wanted to do something that would matter that I needed to not be in academia, I needed to go out and actually apply the skills in a way that was beyond research. And don’t get me wrong, I love research, every single person who applies for Devoted, who has any kind of PhD, I just want to talk to about their PhD all day. I think that’s a bias of mine.

Chris: But, there’s so many ways that you could apply it and then at the time, there was some chances for me to go work for some joint Kenyan US nonprofits, Kenyan nonprofits, that kind of stuff. And I spent a number of years working being sort of the first data hire over there at Ushahidi which is a Kenyan nonprofit that works on election monitoring and disaster relief. Went to go work for BRCK which is a Kenyan startup that works on providing free Wi-Fi to lower income people in Kenya. Frontline SMS, did a lot of work around election monitoring.

Chris: But taking data, real data about, say, an election in a country with an authoritarian leadership and actually seeing like is this data true, are people filing in fake elections, are people filing and fake reports, like what is actually happening on the ground with real data with people who are in places where they can get arrested if things go wrong.

Chris: That was a big eye opener to be around issues of safety, around issues of ethics, around issues of like, well, like for example, if I go to a place that has election monitoring and we’re running election monitoring campaign with a local NGO, if the cops bust down the door for some reason because something happened, I would be arrested, then flown back to the US. They would be arrested and God knows what would happen. And that means that our threat models are very different and sort of understanding that your threat model isn’t their threat model I think has lots of applications for data science in the US and everywhere that I think it was, it was an important lesson for me.

Learning Python

Hugo: Absolutely. And I have a question around, it seems like on these this type of work, you’ll learn a lot on the job both in terms of domain expertise, but also in terms of data scientific techniques. I’m just wondering, before you got your first data science job, did you know how to program in Python or did you have an idea of what the landscape looked like from your time in research?

Chris: Where I was when I was finishing up my PhD was I had done some web work. So I’m not going to say web development because it wasn’t a lot of JavaScript that kind of stuff but I’d done like HTML and CSS, which aren’t really programming, but just, that kind of stuff, like making some web pages with a little bit of JavaScript in there and that kind of stuff. And I’d done all of my research in Stata and R. But I wasn’t very good at it. I think my code was very poor at that moment. But I did see through my conversations with other people that there was, this world of coding wasn’t just a little script that you run and like leave your laptop, you know, if it needs to run for more than like six hours, you just leave your laptop open for six hours.

Chris: I realized that there was much more into the world of software engineering and I started to move in that direction because when you want to build stuff with other engineers, it’s nice to use the tools that they use and to think about things they use and to take code that they can insert into their projects. And so I really started to push more along the lines of developing more software engineering skills, more languages that are pretty common in software engineering, such as Python. When I left my PhD, I was not the wizard programmer that I’m not today, but I could be today.

What do you do as a data scientist?

Hugo: I love that. So I usually ask a question around what your colleagues think you do as opposed to what you actually do. That might be slightly different in what you do at Devoted, people might have more of an idea. I think historically, though, in terms of the startups you’ve worked in doing analytics and data science, is there a mismatch between what people think you do and what you actually do?

Chris: I think people, so we’ll take BRCK. BRCK is a Kenyan startup that does free Wi-Fi for low income people. The whole team is Kenyan, they’re based out of Kenya and I was the first data hire at BRCK. I think a lot of them thought that I was doing wizard mathematics. There was a whiteboard in my office and I’m writing equations and solving riddles of mathematics to get them some kind of thing. Where in fact what I was doing was a lot of software engineering stuff, a lot of building functions, running things, cron jobs, like using a lot of Python, or a lot of pandas, some scikit-learn, that kind of stuff.

Chris: I was mostly using established tools, but it would provide them with actually something that was useful in their work. But I think from their perspective, I was like wizardry, right? Because I would do something like imputation where you take missing values in your data and you impute the value, you like fake what the value would probably be. And it was like wizardry, right? Like, whoa, I can’t believe that happened, like you were predicting this stuff.

Chris: So I think it feels very normal to the data scientist and to me, but I think it felt very, it was very exciting for a team that didn’t have that to start to have that kind of option for stuff. I think it worked well.

Getting that First Data Science Job

Hugo: So as we’ve said, we’re here to talk about getting your first data science job, and you’re working on hiring data scientists at Devoted and thinking a lot about people getting their first data science job today. I want to kind of figure out what companies are generally looking for when hiring first time data scientists. But before that, I suppose, yeah, a preliminary question is are a lot of companies trying to hire first time data scientists? Because a lot of the job listings want 10 years experience of distributed computing and all of that jazz. I hesitate to use the term entry level, but for a first time job, are there a lot of jobs out there?

Chris: I think there are. Although from, so the perspective I think we could take this interview is that I’m sitting on the other side of the table where I’m doing a lot of the working with a team at Devoted to do a lot of hiring for our data science team. And I talk to a lot of my friends who are doing similar roles at other companies. So in the interview, they’re on the hiring side. So there definitely are a lot of junior roles, I wouldn’t put into someone’s head that like junior roles are dying or everyone needs experience and that kind of stuff.

Chris: There is something I definitely see that as a team you can absorb infinite number of senior hires. So say you have a team of six people, you can hire six new other senior people and be pretty okay that everything’s just going to work because they’re senior, like it’ll do fine. It is much more of a risk from the organization’s perspective if you have six people on your team and then you hire six junior, because there’s, like those junior people need a lot more support and if you’re unable to give that support, it is not good for the junior people or for you.

Chris: And definitely take this from the perspective of that junior person, like you do not want to be in a place where there is one senior data scientist and there’s six junior people who were all hired at the same time because you are not going to learn what you need to learn. Like you want to be the one junior person on a team of six senior people, like that would be the ideal situation and you can just sit down and take all the time of them as you want and you can work with them very closely and your learning would be massive.

Hugo: That would be incredible.

Chris: There’s definitely junior jobs out there. But there is something from an organization’s perspective where you might have a small startup and you only have three data scientists or two data scientists at this company because data science is a specialty job and you just can’t have a posting that says you want to hire five junior people. Like don’t apply to that job, I think that’d be very horrible. Instead you want somewhere, I think if I was looking, if I was junior, so one thing we should point out that when I was junior in data science it was a very different fields. So you should not take my how I got into the field as an example of how you should get into the field because when I started there wasn’t really the concept of data science and it was sort of a bunch of people who are interested in the same topic and just sort of like got together.

Hugo: Yeah and doing it.

Chris: Doing it. I’s different, obviously. But I think I can see some things around, if you are, you know, if you’re looking for that first job, I would definitely look for places in mid to larger companies and not in smaller scrappier startups.

Chris: I think one of the things that I’ve found a few times during the hiring process is that people who got their first job at say Facebook and they’re a junior junior data scientist at Facebook, there’s so much more support that they got in that time for learning, for experimentation, for using things at scale, like learning how to work at Facebook scale but doing so as a very junior person, they have such great experience that then they move on to say, say apply for a job at Devoted or something like that, and they have that experience and we can use that experience. Like that’s great, you’re awesome. Like cool, you have this great experience. This is hard experience to get because it’s hard for a boot camp to say replicate 50 million messages a day that you have to process or something like that. That’s a weird boot camp project. But that is what a lot of companies end up doing.

Chris: So going to somewhere like Facebook or going to, I’m just like picking on Facebook, and this isn’t an ad for Facebook, but any kind of larger company that can absorb you as a junior higher and allow you to learn and allow you to learn from really senior engineers and then moving on to something else is I think a great perspective.

Hugo: For sure. And those companies I think have the infrastructure all set up as well. So you’re not like battling with your data lakes and airflow and all of that, right?

Chris: Yeah. And there’s absolutely a trap that I think junior data scientists fall into, even mid level data scientists fall into, where they join a small scrappy startup, not as a founder, but just the startup is six people or something like that and they’re the first data hire, and there’s no data infrastructure in place at all. And there’s no safety nets, there’s no ability to get data from the database in a way that’s useful. And you have to sit down and build all that which in certain times can be okay, but I think in a really scrappy startup where they’re like struggling to make payroll or they’re like grinding away or something, it can be a very, very hard experience and probably not the best experience as opposed to something where you had, you go work at some really hard problems at a larger company, but you do so in a way that you can leave work at the end of the day knowing that everything’s fine.

Chris: Then you come back the next day and all that infrastructure is in place And they’ll have talks and lectures and you can go over to some person who’s done data engineering for 10 years and say, hey, why does it work like this? And they’ll be like, oh, it’s because of x, y, and z, and that kind of stuff.

Chris: And so I think there’s just, there’s so much learning that can happen at larger companies. I’ve never worked at a larger company so this is me talking from the outside. But I do think that as someone who’s worked at a few smaller companies or smaller organizations, that if those organizations do better if you’re senior just because you can sort of be left alone and figure everything out by yourself because you’ve done this before as opposed to actually I don’t know what I’m doing, I really need someone to tell me how to do it right.

Chris: And so you learn things the right way. I think there’s a lot of things where, particularly where a data scientist is doing more software engineering stuff, where it’s really nice to sit down for someone to be like, hey, explain to me object oriented programming because I don’t understand what in the world’s happening or explain to me how testing works, like unit tests, integration tests, or patterns, like software engineering patterns, like factory of factories and that kind of stuff. Tell me about that.

Chris: Like all those are totally simple concepts that anyone listening to this podcast can know. Just need to like have someone tell you about them or know that you should read about them. Like understand that, oh, actually this comes up a lot, I should read about this, that kind of stuff, which is great in companies with more support and is terrible in really, really small scrappy places.

What are companies looking for when hiring for some data scientists?

Hugo: So in general, what are companies looking for when hiring for some data scientists, do you think? And if you don’t want to answer that you can tell me like what you in particular are looking for when hiring first time data scientists.

Chris: Sure. I’ll try a stab at the one but then we’ll definitely hit on the other one. It depends on the company. I should say organization because I worked for a lot of nonprofits, but we’ll just use company as fill in gap for that. Places that are very small and scrappy tend to look for senior people that they can run the whole, I think you would call them like a full stack data scientist. I’m not entirely enamored with that phrase, but whatever.

Hugo: Like a generalist.

Chris: A generalist. You could just have them join the software engineering team or the product team or something like that, and you give them pseudo access on your server and they will just build everything they need to build. They’ll build the pipelineing, they’ll build all the, saving backups all the data, they’ll build the tables that they need to build, then they’ll run the analysis and they’ll run it in a way that runs every single, every single day, but only on certain times and they’ll install airflow by themselves and they’ll do all this kind of stuff that just they can do. That’s awesome for that person. That’s not really a junior role.

Hugo: No.

Chris: More, I think if you go to the other end of the spectrum and you go for large companies, they do a mix. Well, they will hire people from more senior positions to do things like, say someone very specialized, say someone who just got their PhD in AI or something like that, like they might have them join the team that’s working on an algorithm and that junior person can work on a part of the algorithm or work on some testing part and grind to that kind of stuff. Or I think what’s very common is more generalist data scientists who are junior join larger companies and they do probably what I would call like advanced analyses or advanced analytics. So you have a new product and you want to understand how that product is actually doing at scale because it’s deployed globally on all Android devices or something like that, and you want to see how that product is working and how people are using it.

Chris: There is some very, very complicated analyses that need to be done for that to be true. And that is a great sort of thing for a junior person to kind of tear off and start working on and it doesn’t affect the production code, but it does affect the business.

Hugo: And in that respect, I suppose we’re talking about the data analyst breed of data scientist, right?

Chris: Yeah. Again, I don’t even know why I keep on talking about Facebook, but I know at Facebook, a lot of people who have the title of data scientist do more analysis, which is totally reasonable and it’s super hard and give them massive credit. And it is, that kind of stuff is, it’s a real role, it’s a real job. And I think people who come from academia, people who I know come from sort of political sciences, social sciences, they go into those kind of roles and have a great time because they are working on hard analyses, like hard analytical problems. You’re not making dashboards to show that someone’s clicking on something, you’re making really complicated analyses to figure out the churn model that applies Bayesian to, you know, all that kind of stuff, which is cool.

Chris: I think in the middle, so you have the teeny companies and you have the large companies, you have companies like Devoted that sort of sit in the middle where there is a mix between trying to hire senior people who can craft the foundational data science infrastructure. So Devoted is building a health insurance companies tech stack from scratch. It was at one point an empty GitHub repo. Now there’s a lot of code in there. But yeah, at some point, we’re like, okay, we are a health insurance company, this is our code base, like let’s write the first line of code, that’s where we are. And in those kind of environments, you want a mix of people who are senior, who can build the architecture for how things work and how data is moved around and how say tasks are run every day or how there’s things around testing, and if companies do testing and all that kind of stuff.

Chris: And in addition, you want people who have more junior experience that you can come in, you can provide some lift them like teaching wise, like you could do more teaching with them. And you can have them support the role where you might tear off a piece of a project and say, hey, can you make this for me so I don’t need to make it. Because it was definitely, so one of the people who’s on our team currently, Kit Rodolfa, who is the former chief data scientist of the Hillary campaign. And he’s done some amazing complicated work, and he’s also done some amazing work that was well below his level of expertise because there was just no one else there to do it. And so, there’s I think typically mid level companies like ours tend to have both. Some are longer range.

What skills or qualities are you looking for when hiring?

Hugo: And so for you, for your size of company, for a junior data scientist, what skills would they, what would be good qualities, what would they need to do or demonstrate for you to be like, hey, this person would be a good fit here?

Chris: Yeah, I think for us at Devoted, we are a health insurance company, but if you looked at the internal workings of how the tech team works, we operate far more like a Silicon Valley tech company and concepts of move fast and break things or ask for forgiveness rather than permission are real things that we are doing. I think a lot of the people at Devoted are building things that is the largest and hardest thing that they’ve built ever in their career. And that’s what we want you to be that. We want you to be the person who’s building the… Doing the best work of career, doing the hardest thing that is pushing yourself right to the limit of what you think you can do.

Chris: We have a very strong culture that says, hey, we want you to run right up until the point that you break the thing, break the thing and then come and be able to say hey, I totally broke this, can someone fix this, dear God. And then we’ll go back and we’ll work with them to fix it. I say this as them but like I have in fact broken many things at Devoted. And to have a culture that says hey, you can break these things, it’s okay, don’t worry about it, you should definitely admit that it’s broken because we should know that, but run right up until you break it, then we’ll talk about why it was broken and we’ll then run again. Like don’t stop running, keep on running.

Chris: But that fast paced, fast paced doesn’t translate into a huge amount of work hours because Devoted, like I have a family, no one at Devoted I think is grinding the midnight oil left, right and center. There’s definitely, people are working nine to five-ish. But, just, you know, what you’re doing, I think there’s a lot of pushing the boundaries of what we can do skill wise right to the end and a lot of learning and trying to build the things that are the limit of our capability is pretty common.

Chris: I think that would be a skill regardless of your level of experience, whether you have one year of experience or no years of experience or 10 years of experience, we really do look for people who are very happy to take on a task that they’ve never done before but they’re pretty, you know, they kind of understand how they might go about it and go for it. Like go and run it. And we have safety guards in place that nothing bad is going to happen. That they can push themselves, and then once they have that, once they’ve built that thing, they go, okay, cool, I understand it now, it’s great, I’ve got it, like, let’s do something else. Like let’s go harder. Let’s build the next thing.

Chris: I think that’s really, it’s hard to instill because it is not experience per se. It’s more, you know, aptitude and attitude is probably a closer one where I don’t necessarily, we don’t necessarily say we’re only looking for people with X years. We tend to have a willingness to say, hey, you’re actually, like we were kind of looking for someone more senior, but you are more junior but you really are kicking butt. And we can see that when we talk to you that you have a lot of aptitude for your level of, like for your levels of experience and we’d love your attitude for running with it. Like, let’s go, like come on board.

Chris: We’ve definitely, we’ve had a few hires like that and I think it has worked out. It is something, it’s not uniquely Devoted, but it is something that is probably pretty uncommon for our industry.

Data Analytic Skills

Hugo: Yeah, and that speaks to certain skills that you’ve mentioned in there being like an active problem solver, communication, having a passion and a drive. How about in terms of hard skills, whether it be domain expertise or background in programming, whether it’s Python or R or knowledge of how the math behind machine learning models actually works to like data analytic skills?

Chris: Yeah, I’ve noticed something with hiring, where typically when we have someone in the hiring process, I will say to them up front that there are places, like if you just got your master’s degree in machine learning and you believe that how you are going to advance your career is to do machine learning, become an expert in machine learning, that’s what you’re going to do, that is a completely valid job, like career track. You should absolutely do that. Also, that isn’t a place like Devoted. Devoted tends to be more generalist builders who are trying to, you know, we’ll use machine learning here because it’s useful and then we’ll use fuzzy matching here because it’s useful and then we’ll use some simple things somewhere else because it gets the job done.

Chris: We tend to focus on the ability of people to build and solve the needs of our internal users more than say someone’s really nuanced view of machine learning algorithms. Just because we are a small team in a new company, if you wanted to just do machine learning, there are definitely companies that are big enough that have the infrastructure to have you just do that. Someone else will worry about some of the analytics or some of the data processing. You can just sit down all day and read machine learning journal articles and then implement them. That’s totally okay. I never worked at Google Brain but I imagine Google Brain has a strong feeling around that. Which is cool!

Advice to First Time Job Seekers

Hugo: Would that be your advice to first time job seekers, learn a bunch of general things in order to build stuff as opposed to go deep into machine learning models and all of that?

Chris: If I was sitting in front of someone who is taking or looking at a junior job or something like that, I would say that you probably want to take one of two tracks. One, if you have the educational experience on paper and the credentials to go for say deep AI that just go deep on AI. Like you have the master’s degree in it or you have a PhD in it or something like that, go for that track. Like that’s an amazing, it’s incredibly high paying, it’s super in demand. And there are places where you won’t have to learn anything else but ML. That can be a career and I think there’s probably going to be a full career in that and there’s absolutely no problem with that, because those are very, very hard.

Hugo: By AI do mean…

Chris: I would almost exclusively say neural networks at this point. Yeah. Basically, if you’re, if you’re working … there’s a lot of companies that are doing things that are self driving cars or things like that, where you just, you need a lot of people with that kind of knowledge to work on those problems because those are incredibly hard problems. But it is definitely along the lines of neural networks I.e deep learning to do those.

Hugo: And for this track, there is a certain mathematical overhead, which perhaps the other track, we haven’t gotten to that yet, might not have, but for example, knowing enough about multivariate calculus to understand backdrop and the vanishing gradient problem and all this stuff.

Chris: Oh, no, absolutely. I would expect that the interviews for those would be very math heavy and very theory heavy, such that it would almost feel like a dissertation defense in that area.

Hugo: That track as you said, you expect some sort of graduate work to have been done in math or something related.

Chris: Yeah, I would expect like 95% of people to have some kind of, like very obvious that they’re going down that track. I’m sure there’s some people who have like gone around that and those people are awesome and amazing but I think typically you would expect someone to be from that perspective.

Chris: The other one, if you don’t have that like very obvious sort of machine learning, deep learning focus which I don’t, so this is more my perspective, there’s another field that is way more sort of doing data science more generally at a company. So instead of just being like all I do is machine learning all day, there’s so many other problems that need to be solved using data science whether you do like bayesian analyses or you do some kind of random forest or even if you use some deep learning stuff but you’re not doing cutting edge deep learning stuff all day, there’s far more jobs.

Chris: Although like the the hype and the focus is on this sort of AI jobs. There is far more jobs, like 50 times as many jobs that people who are semi-generalists data scientists at companies, solving those companies needs to understand what’s happening in their data, whether that’s predicting when their drone should fly over and what are the crops or predicting when a customer might return or for us like predicting when someone might have a particular illness so we could do an intervention and prevent that illness from happening.

Chris: Those kind of analyses fit better for people who sort of have a more general experience because there isn’t a very type for that. There’s no Bayesian analysis PhD and if you have that you get to do Bayesian analyses and if you don’t, you don’t get to do that it’s much more general so that the people who apply for us in those kind of roles can come from any perspective and can be everything from you know a music major or can be someone with a PhD in data science and one of the new programs or can come from a boot camp, from a PhD in some other crazy field or could not have a PhD. It’s more about just if that kind of person fits what we want and what we need. But it is, I think that’s where most people do it.

Chris: I say this only because I think there’s a lot of people who think if I know enough machine learning and I go deep enough on machine learning, like that’s the guaranteed job. Which isn’t true because you could self teach a lot of machine learning and then you apply at Google Brain and they could be like, hey, you’re just not even close because you didn’t get a PhD in this and haven’t spent and haven’t spent like a huge amount of time doing it. But then you could go to another company, take that same person and go to another company say, hey, I know a lot of machine learning. They go awesome, this is like, like you’re not going to just be doing machine learning. But we have need for people who know machine learning because our products use natural language processing and that kind of stuff. Come on, join the team, you’ll be great.

Chris: And so, I think that second job has a lot more generalist things, where you end up working, you end up sort of having to be a lot more software engineering skills, a lot more skills working with sort of the software side of skills, of working with internal customers. So say someone on the sales team wants some kind of particular analysis, that analysis is very difficult to do. it. But then you have to go and present it to a way and then work with them to tweak the analysis and the way that they want. You need to be able to work with them such that they’re happy with what you’re delivering with them.

Chris: That kind of stuff I think is more common. I think that’s what most people do. It’s also, it’s different than just saying, hey, I’m just going to like know everything about machine learning and then I won’t need to care about any other topic.

Experience for a first job?

Hugo: And so for this second job, which I think is kind of the lion’s share of what we’re discussing today, there’s a chicken and egg problem in the sense that for a first time data scientist applying for their first job, how can they demonstrate that they have kind of this general array of skills? Would it be project based, like writing a blog themselves to demonstrate it. Because it seems like you need the experience in order to get the first job essentially.

Chris: Yeah, for us when someone applies, some of the best things that they can apply with are projects that they’ve done or something at like, say, a boot camp or maybe their dissertation research or something like that, where we can take a look and say, oh, cool, like you’ve done some interesting stuff, you’ve worked with some data, some interesting ways. And then for us, we have designed a take home that is very, without giving out any kind of secret to the take home, is very open. So there’s many, many ways. I mean, there’s an infinite amount of ways essentially that someone could solve it and how they solve it really says a lot about them.

Chris: But those kinds of skills of being able to demonstrate, hey, like, I can write software, like I can write unit tests, I can sit down and do bayesian and I’m totally comfortable with it. Here’s my blog posts around how this certain type of bayesian analysis works in the setting or there’s this really sweet data visualization that I did. That kind of stuff is like a nice demonstration. I don’t think it’s required, but I think if you come from say a field that isn’t known for making a bunch of quantitative people, and I come from political science, I think people don’t think about political science and think we’re all math nerds.

Chris: The more things that you can do around that where you can sort of demonstrate that you have that kind of experience is better because otherwise you don’t hit people’s biases around what a social scientist is. Because someone could say, oh, you do political science, you must really love Kant and Rousseau. And that’s all you talk about all day. And if that’s not true, like you need to demonstrate that. It’s probably not fair that you need to demonstrate that but that is it and you need to demonstrate that.

Chris: So things like boot camps, things like projects that you can run your own, blog posts are great because it’s really easy to access them, like you can just click, like someone has a link in their resume and then you click the link and you say, oh cool, this is like a really nice, I love the way that they’re thinking about this particular problem of feature importance in random forest or something like that. Is a really nice way and I think it is helpful.

Chris: We don’t require someone has side projects, but we are trying to do filtering of thousands of candidates. And so, it’s nice to find people who have, who can show you right up front that they have those kinds of skills and can do so in a way that is more than just a resume line because as someone who’s looked at a lot of resumes, everyone says that they do every skill under the sun.

Chris: Like everyone lists every skill. So everyone is, everyone says Python and SQL and R and machine learning and random forests. Everyone says every skill. So it’s not a very strong signal. But if you also have, say a blog posts about some nuance point about random forest or some nuance point of bayesian analyses, that is a real thing, that’s a costly signal to me that you actually do know that rather than just typing in the word in your resume. And it helps. I think it helps for me to get a handle on who that person is. It helps to guide interview process in the future that like other people do and I don’t do. I can sort of say hey, like here’s this article from this person’s blog, that probably will be a subject of an interview which is totally fine and probably helps the candidate a little bit because they can sort of stay in the area that they know.

Chris: Demonstrations of that are very helpful. Do I think everyone needs to do side projects left and right otherwise they’re a crappy data scientist? No, of course not. Like this isn’t, you don’t need to live and breathe data science to get a job in data science. But it is helpful to someone who’s hiring to sort of see the things that you’ve worked on and see the things that you’ve done, more than just adding that keyword to your resume.

Hugo: And I think the other thing that writing blogs demonstrates is the ability to take a project through to the end and actually do a write up. And on top of that, demonstrates communication skills, which of course incredibly important in this line of work.

Chris: We had one candidate who I really liked. They ended up taking another job somewhere else. But I really liked, and they sat down and they actually had a GitHub project that they were working on. I don’t think it was like a full Python package yet but they basically built an open source library, very close to one. I don’t think they had released it, but they had like a little open source library for some project and they had testing in there, and they had documentation and they had that object oriented Python in there.

Chris: And they were importing like relative importance of modules so they could do the tests and all that kind of stuff. And it was cool. Like you could look at that and be like, okay, I kind of, like this person has this level. This person has this this amount of knowledge because they’ve clearly written it in their own GitHub account and I can see it. It’s a nice signal for me. You could do that any way. There’s no particular way that I like absolutely want. But I think the resumes that send the least signal to me or other people who are hiring on Devoted are the ones that have just here’s the five skills that I have and then kind of don’t do any kind of explaining because there’s a lot of like cheap signals.

Chris: I could say that I do deep learning and then you’re like, oh, really, let’s talk about that, and it turns out that I did not know enough about deep learning.

Hugo: Right. And the other thing you mentioned in there, there are a couple of other points, which… GitHub repository I’m sure is incredibly helpful. And on top of that, you mentioned testing and a through line through this has been the importance of at least basic software engineering skills in terms of data science. I find that that’s something that’s missing with a lot of kind of early career data analysts and data scientists, whether it be on using debuggers or unit testing or versioning, these types of things, people need to work on a bit more.

Chris: I think that’s key. If there’s anything that goes through our data science team at Devoted when we are looking to hire someone or when we are working on our own projects is that the stuff that we work on are products, we are building full products for people. Whether they are only a few lines of code or, you know, say they’re something simpler, like moving some kind of data from a Google spreadsheet to redshift, just something really simple like that, or something more complicated, we are building full products, products for people and thus they need to have as much of that as we can have. We want testing in there. We want things to be linted We want things to follow some kind of object oriented notion or say functional Python with the doc strings in there that describe all the documents are doing, say, static typing, if we could do that, we’re seemingly on the verge of doing that, but not doing that yet.

Chris: But that kind of stuff, like we think of that as a software product. We make software products for people. Whether it’s actually like a reusable tool or analyses, that’s just what we do. I wish I thought about that more from the start and I think as data science matures, I think there’s definitely going to be this movement in the direction of data scientists sitting on software engineering teams and being a member of a software engineering team just with a certain specialty set of skills. And part of that means that you need to be able to work with them and write code that they can use and write code that they are fine to incorporate in their stuff. And that just means more software engineering skills.

Grad School

Hugo: For sure. And so, something that we’ve thrown around a bit is this idea of graduate school and dissertations. And a question I get a lot is from people who are thinking of going to grad school and they’re actually wondering whether to go to grad, if they want to work in data science, whether to go to grad school or whether to get like a data analyst job that they can get at that point and then try to progress into data science demonstrating that they have developed a bunch of analytical tools. Do you have any thoughts on that from your side of the hiring table?

Chris: Sure. I would never recommend someone get a PhD, I mean, masters might be different, I would never recommend someone get a PhD because they wanted to get a better job. Like a PhD is a long and difficult process.

Hugo: And you got to really want to do a PhD in order to do a PhD.

Chris: My PhD, like a huge amount of people quit, maybe like half or more than half of people quit along the way and got nothing. A PhD can be six years long. If you quit after year three, you don’t get anything. I think you may be they might give you a masters as like a sort of a consolation prize or something like that. But it is a very, very tough activity and it’s very, very hard and most people don’t complete it and people have a lot of stress around it and is difficult.

Hugo: And the final year in particular, I just remember my PhD, the final year was absolutely brutal. I don’t want to say I hated the work I was doing but there was certainly at the end a bunch of negative sentiments that involved the stress and the suffering, and also the sleeping under my desk for the final six months actually.

Chris: Oh, yeah. I hated my dissertation by the end. The phrase that I kept on saying over my mind is that the only good dissertation is a done dissertation. Just grinding through. But it is worth it because I got to spend five years studying a topic that I really, really cared about. And I got to go as, I mean, imagine being able to go, I mean, you don’t need to imagine.

Chris: But like, if you’re doing a PhD, imagine being given five years to go as deep into a topic as you could possibly go. Like there is no level of deepness that is okay. Keep on going deeper over and over and over again. And that is super interesting and super cool. It totally changes your thinking and it totally changes you as a person, and it’s not a good way to get, like as a stepping stone to getting another job. It really just isn’t. You could have much better spent that time doing other things that are directly related to getting an interview than going off on some crazy quest to study some amphibian in the whatever in the south Polynesian islands or something like that. Like there’s definitely better ways of just getting a pay raise and then more advanced stuff.

Hugo: So is it a viable option to enter as a data analyst and try to progress to data science?

Chris: I think there’s probably two tracks. One, if you can find the right place, you can absolutely go from data analyst into data science. I think it more, it’s about trying to find the place where there isn’t a firm division between the two, right? So like, if I joined a larger company as a data analyst, as a junior data analyst, I would try to find chances where I could do more software engineering stuff. I would start to work on more complicated project. I would try to see like, okay, cool, sure, I can make this this quick analysis but can I do it better using Bayesian or something like that. But you sort of have to be self motivated to gain more of those skills.

Chris: Another option is like a master’s degree which could be one or two years and can expose you to a lot of that kind of stuff in a relatively quick amount of time. And then you get out and then you can, you know, I think you can do like a step up. Like we don’t care about degrees at Devoted, there’s no requirement for some kind of degree. But there is lots of skills that people would learn say in a data science master’s program or any kind of quantitative master’s program, masters in mathematics or somebody that that they could really take advantage of and it can be a big step up for people to do that. I think it’s probably relatively cheap, I don’t really know exactly.

Chris: You can either do the self learning path which is, you have to be scrappy, you have to kind of find opportunities where they are and move up through that and probably have to switch jobs a few times because the people hired you as a basic data analyst and all of a sudden you’re doing Bayesian left, right and center and then you want to be paid like you’re doing Bayesian work all the time and then you go find another job that focuses on Bayesian and then you move up from there.

Chris: And then the other one is you know getting that master’s degree and then going back and trying to find another job after that, saying that you have, you’ve got this experience of that you know more about the formal training around some of this kind of stuff, and it doesn’t help you with a lot of business cases, but it does help you that you can say you understand the problems around Bayesian analyses or the problems around some kind of machine learning model then you can apply in a real way.

Hugo: And if someone comes to you and knows these types of things but doesn’t necessarily know how it works in the health space, I presume a good strategy there is to say, hey, I know these techniques, I don’t know a lot about health, but I would love to learn about this stuff, to demonstrate like a passion for the domain expertise essentially.

Chris: This is the first health insurance company I’ve ever worked for. And we do a lot of learning around that area where, so I remember I think my second day of Devoted, they were like, hey, you should come to this meeting and I came to the meeting. And it’s just four data scientists sitting in a room with a doctor explaining how medical coding works. So like, you know, like what is the code for someone who breaks their hip and how does that relate to the other code and how people are billed on that thing. It’s just doctors sitting around telling us how all these things work. And it was so educational and it was so new that it’s a big part of it.

Self Learning

Hugo: Absolutely. I really liked the way that this conversation has gone in terms of providing a variety of different paths from the machine learning to the first time data scientists demonstrating what they’ve done through project or through quantitative research through to the self learning approach. And for anyone out there who wants to take the self learning approach, I’ve heard that is an incredible place to learn. It was actually Chris who told me that before we started recording.

Chris: That’s it. I was just like DataCamp, you’ve got to use DataCamp. No one’s paying for this. I’m ambivalent. This isn’t sponsored.

Hugo: And this is not sponsored by Facebook either.

Chris: Or Bayesian analyses.

Hugo: Exactly. Yeah, Gelman. Gelman isn’t paying us. Hey, that’s a good idea actually. So, I’m wondering if there are any, like any advice, things that you love people doing when they come into interviews or that you think of the worst things that people could do. Just any general tidbits of advice for first time interviewees from your side of the table?

Chris: That’s a good question. I think for us, I’m trying to say us because I don’t want to say it’s just me that’s biased. Like that’ll be biased if I was just like, we hire as a team so it’s not just me obviously that’s hiring, but I’m the person who’s interviewing. I will say what I, what I tend to think rather than like us as a team because I don’t know what they are biased towards.

Chris: For me, I’m a big fan of people who are excited about learning new things and excited about data because I genuinely enjoy what I do. I enjoy learning a new technique, I enjoy sitting around with a new book around some kind of analysis or an old book around some analysis that I don’t know very well. I really, really enjoy that. And I think when you’re coming in as a junior person and you can sort of say, hey, I don’t know how a lot of this stuff works but I am really, really interested in what it is and I know where I want to be in five years.

Chris: And that involves a huge amount of learning. That is exactly what we want. Like we are not hiring you because we’re junior, we’re hiring you because we think that you could be senior with some training, with some mentorship, with some projects and that kind of stuff. Like we do not want you to be junior forever. We want you to be senior and like we acknowledge that the more we train you the more you might leave. And that’s totally okay with us. Like come join Devoted, be a junior person for a while, enjoy yourself, like learn a huge amount of stuff, really excited what you do and then and then go find a better job somewhere else. Is a completely reasonable like thing that we don’t have expectations that you don’t do that.

Chris: But it is, I want someone to be excited about it. That excitement can come out in various ways. So people who have lots of side projects, obviously, like that’s a signal that people used to be like I love it in my spare time. But other people don’t have a lot of spare time. That can be one signal but it’s not the only signal we care about. If you just come in and are really, really excited about it and I can tell that you have spent a lot of time thinking about it and you have a lot of knowledge around it. A lot of knowledge around it for what I would expect someone at your level to have, like, that’s a nice sign that you really care about this one thing.

Chris: You might not have knowledge around everything. But you might have this one thing where it’s like, I think random forests were super cool. I spent a lot of time reading about them. Like I don’t know deep learning, I don’t know software engineering, I don’t know anything except for like, I’m just like super interested in this one thing. Like that’s kind of cool. I think that that excitement bleeds off on to me, possibly onto other people who are interviewing.

Chris: But definitely, definitely for me, because if you can’t get the job based on your experience, which when you are more experienced, you could just get the job because you’re experienced, you kind of need another strategy. One of the strategies is just saying that you have the right attitude and you’re excited to do it and you can learn a lot and you could be a good member of the team that people want to work with and that’s a great way of doing it if you don’t have the 10 years under your belt or something.

Hugo: Sure. You need something that differentiates you, a differentiating factor.

Chris: I understand that one of the hard parts for junior data scientists is that their resumes often look a lot alike because one person went to a boot camp, someone else went to a different boot camp, someone else was a math major, someone else like did this mathematics project, someone else wrote one research paper. There’s a lot of like signals that are pretty much the same. And so, the way that people can distinguish themselves I think at least in my mind is having some enjoyment for what you do because we want you to enjoy it and therefore learn more at it, you know, learn more about doing it. You don’t need to burn the midnight oil to do it, you could totally work nine to five with the rest of us. But we want you to be interested in it.

Chris: And if we, every hire, now that I’m sitting on this side of the table, every hire is a bet. And for the junior person, the best bet that we could make is that we are hiring you and you don’t know everything we wish you knew. But, that in two years, you could know like a big portion of what we wish you knew as a senior person, right? Like, I mean, you wouldn’t be senior two years but you get my point, right? Like you could be so much more capable and so much more of a resource for the team. So we come in and hire you and you’re junior and then all of a sudden you’re mid level and you’re kicking butt and then we try to retain you because we want you to stay there because you’re so awesome.

Chris: That part is like a big thing.

Hugo: Yeah. And I suppose one thing I’d like to gauge your opinion on is the difference between what you learn self learning or even in research in terms of using Bayesian inference on machine learning, whatever it may be, the difference between the types of tools and techniques you use there, which may be importing CSVs and then doing machine learning in production in a company. And you may have all these things that you haven’t actually been exposed to before by doing Kaggle competitions, for example.

Chris: I’ll give you another example, like one of the things they is very hard to learn is data engineering. So data scientists are different than data engineers of course. But there’s a reason that there isn’t a bunch of data engineering boot camps because to do data engineering, you basically need a production system to learn that skill. Like you need to have millions of data points flowing around and all this kind of stuff for you to actually do stuff with that.

Chris: And so if you don’t learn it on the job, like there’s relatively few areas that you could learn that on your own. And I think there’s some parallels to data science, where there’s just some things that are hard to do as a self directed project. I would recommend that people do side projects or do learning projects that take advantage of those. So like, instead of say, instead of importing a CSV with the data, like load it into a database and then pull it down, and then pull it down once an hour or something like that, just to like get more experience with that.

Chris: These tools are either free or cheap. So if you have like a small amount of data, so for example, like there’s Amazon Athena, which is like a way of kind of search, like it’s basically like a way to do queries on big data. You can totally upload just a little bit of data to that and then you pay by query. So you could pay like 30 cents to do a query. And so you do like 30 queries and you kind of understand what’s happening and then you kind of use it for a project and then you drop that project and you never pay for it again. But you have that experience under your belt. And then so when someone comes to you and says, hey, I saw you have this AWS experience. I did this kind of project, we used Athena. I had to work out the security roles and then how I put it onto a server and how I, etc, etc, etc, and all that kind of stuff you can do because with things like AWS, you just pay for usage, it’s just you, you know, just have a small amount of data.

Chris: But you do have that experience, that is really useful. That’s a good signal that like, hey, you are really interested in the kind of stuff. If I release you on to our system that costs 10s of thousands of dollars a month, like you could sit down and do some really interesting stuff and you’d be interested in that. I think that’s great.

Chris: I would recommend that if you really wanted to stand out, you should try to identify the skills that are harder for people to get, because it’s very easy to download a CSV and then do some group by statements in Pandas or something like that. And that’s why there’s 50,000 tutorials that do that. And that’s why my homepage has a bunch of tutorials on doing that. And that is useful stuff. But to stand out if you had some kind of experience in something that was harder to set up and something that was harder to do, I think that’s a nice way of sort of saying like no, no, no, I’m super serious about this. Like I tried to do this thing that was more difficult. And it’s more in line with like what a real businesses would do just at a much, much smaller scale, but I’m still using the same technologies. I mean, that’s awesome. That’s great.

Favorite Data Science Technique

Hugo: So Chris, I want to ask you what one of your favorite data science techniques or methodologies is. But before that, you’ve said random forest enough. My first question is random forests or support vector machines?

Chris: Always random forest. Who uses support vector machines? I don’t think like I’ve ever, I cannot think of a single case of someone who has used a support vector machine in production. Support vector machines are mathematically cool. They’re just super cool. Like when you explain how they work, it’s just super interesting. It’s a very clever technique but random forests are like if you need to get something done, a random forest like just seemingly works out of the box in so many ways and so many cases very, very easily that it is definitely one of the best starting points. I think a random forest is definitely one of the best starting points for a lot of modeling. And then I would kind of go from there and decide if you wanted to make the random forest more complex or add more feature engineering or change your model up or something like that. But I would definitely… I’m a big random forest fan.

Hugo: For sure. I actually, I was riffing off, it was a couple of weeks ago, someone tweeted out, does anybody use support vector machine for anything. You just replied no.

Chris: Yeah, I can’t think of like, I don’t know who uses that. I cannot think of a single time that someone would choose a support vector machine over anything. I don’t see that that one moment where it’s like the killer model to use. Someone’s going to listen to this and send me some kind of article, they cured some kind of cancer using some kind of support vector machine. I cannot think of an example off the top of my head.

Chris: I always considered a support vector machine more as sort of a teaching function for like, hey, here’s one of the techniques that people try. It’s mathematically genius. I think it explains a lot of the concepts of, like a decision line between groups and that kind of stuff really well and what happens when people are on the wrong side of the… Like when a data point is on the wrong side of the decision line, all that kind of stuff, I think that’s totally totally useful. Also, I’ve never seen someone use it in production.

Call to Action

Hugo: So, do you have a final call to action for our listeners out there?

Chris: Sure. I mean, one, you should completely apply for the jobs at Devoted. This is not actually an attempt for me to get, this is not an attempt for me to get people to apply for jobs at Devoted but hey, we are an awesome company, come work with DJ Patil and myself and a number of great people. It’ll be fun, it’ll be great and I hope you apply.

Hugo: Incredible. We’ll include a link in the show notes as well too.

Chris: Awesome. Look at this, Finally, God, expense, I mean, I’m not really paying any money for it but I feel like I should expense, I don’t know, like a lunch because of this or something.

Hugo: Absolutely. Let’s do that.

Chris: This would be my other more general call to action I think for someone who’s interested in data science. Think very carefully around the skills that you’re trying to develop and the skills that other people who are applying are trying to develop. Sit down and think about ways that you can apply those skills in a way that both demonstrates that you have those skills and/or that you’re learning those skills or something like that, but has some kind of real impact in society around us.

Chris: There’s so much free data out there, there’s so many really, really, really interesting projects that you can do by yourself, on your own with basically no money that you can sit down and say, hey like, I’ve used random forest and this really strong way but I’ve also used it in a way that I detected some crazy cool thing, detected parking tickets in New York or detected like some kind of, I don’t know like some kind of thing around like police shootings or some kind of thing around crime or something, anything like that. Like, I think there’s so many really interesting things that you can do with data science. This is an applied field.

Chris: And so, learn things in data science, that’s a big part of it, but then also apply it in your own life and the projects that interest you. That is the ultimate thing that I want to talk about in every single interview is just the cool things that they’ve done and use data for. And I want to see that spark of excitement in the things that they’ve done because then I want to work with them. Because if they’re so excited about it, I’m excited about it and it makes you want to go to work every day.

Hugo: This has been a great conversation, Chris. So thank you so much for coming on the show.

Chris: Thank you for having me.

Hugo: Such a pleasure.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The fastest cyclists of Europe live in …

Mon, 03/04/2019 - 17:39

(This article was first published on Stories by Sebastian Wolf on Medium, and kindly contributed to R-bloggers)

The fastest cyclists of Europe live in … Analyzing STRAVA data to find out which city has the faster cyclists with R and R-shiny. My contribution to the shiny contest. A shiny app visualizing STRAVA segments in London and Paris and seeing where people are faster

STRAVA is one of the most popular fitness tracking apps. It not only allows users to track their own activities but also following professionals. STRAVA made running and cycling a game for everybody. It split the world into segments. Each segment can be seen as a tiny race-track. STRAVA runs a leaderboard for each track. Now you will know how fast you are compared to others, without even running a real race. Just do it in a digital way. This feature was what I used to find out, which city in Europe has the fastest cyclists. Why all that? I wanted to participate in the R-Shiny Contest.

Before I start explaining how to find the fastest city, I have to say. There is no fastest city in Europe. The average speed of a city really depends on how you evaluate the city. London is really fast at a short distance, Paris on long distance, Munich people seem to be good climbers… Every one of those settings can be evaluated inside my app. So let’s start from the beginning.

Getting the data

I decided on four cities that I wanted to compare. The STRAVA API allows for 13,000 API calls a day, meaning 13,000 questions you can ask STRAVA on, for the non-programmers. This was my most limiting fact. I chose 4 cities, London, Paris, Berlin, and Munich (because I live here). Each city got split into squares of ~100m x 100m. For these squares, I called the API to get to know which STRAVA segments can be found here. In London, I found e.g. ~2000 segments by 9000 squares. For each of those segments, I called the API twice. Once for the leaderboard of the 30 fastest male cyclists and again for the leaderboard of the 30 fastest female cyclists. From these values, I calculated the average and median for male, female and both per segment. This is the data to start the calculations from.

Inside my app, at you can use this data to compare cities. The app will use a basic setting to compare London against Paris and London will win the race. But now you can start changing settings to see if London is really the faster city.

Compare cities in different ways

The settings window allows e.g. changing the length of STRAVA segments. You can see if London people are better sprinters by choosing 0.0 to 1.0 km as the segment length. Or check out if Munich people are good on long distance segments by choosing 5.0–20.0 km segment length.

Settings window inside the shiny app for cycle races

What also changes the settings a lot is whether you use average or median evaluation. The median tells you the speed of the 15th person inside your leaderboard assuming a distribution of your leaderboard. So actually, how fast will 50% of people on the leaderboard be indeed? While the average just takes the speed of each participant and divides it by the number of participants. A fast guy on place1 can move your board up front, which cannot happen on median calculations.

A great feature is the elevation factor. It adds a linear weight to the speed depending on how steep a segment is. The speed of a steeper segment gets increased by the formula: speed = elevation_factor * 0.1 * climb (in meters). This means the steeper your segment, the higher the speed is.

Last but not least you can run the whole calculation just for men or women.

All settings can be stored and shared using the share feature of the app.

Visual features

To compare the speed of the city I built in an animated bar plot. This one is based on the jQuery plugin simple skill bar. In R it has to be added as to be added as a specific output binding. Finally the speed of each city get’s animated:

To visualize the segments that are used for the evaluation, I implemented a leaflet map within the app. The map does not only show the segments. It also highlights the speed of a segment. Moreover, users can access the STRAVA website to see details of the segment:

If users would like to see if a city has maybe faster people on longer or shorter segments, it is not needed to change the settings. Additionally, graphs plot the calculated speed for different length:

Play on

I hope you can now see, that there is no fastest city. It really depends on how you set up the features of this app. Please start enjoying the app. Here are some last facts about the app and please upvote it on the Shiny Contest Page.

# of segments in the app 421,685

# of segments with average speed (leaderboard crawled) 22,589

Average speed per segment 28.76 km/h

Average speed per segment (men) 32.7 km/h

Average speed per segment (women) 23.49 km/h

Average length of a segment 2162.3 m

App Code at is free-available: zappingseb/stravachaser

App runs on:

Further readings on my apps

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Sebastian Wolf on Medium. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using parameters in Rmarkdown

Mon, 03/04/2019 - 11:02

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

Nothing new or original here, just something that I learned about quite recently that may be useful for others.

One of my more “popular” code repositories, judging by Twitter, is – well, Twitter. It mostly contains Rmarkdown reports which summarise meetings and conferences by analysing usage of their associated Twitter hashtags.

The reports follow a common template where the major difference is simply the hashtag. So one way to create these reports is to use the previous one, edit to find/replace the old hashtag with the new one, and save a new file.

That works…but what if we could define the hashtag once, then reuse it programmatically anywhere in the document? Enter Rmarkdown parameters.

Here’s an example .Rmd file. It’s fairly straightforward: just include a params: section in the YAML header at the top and include variables as key-value pairs:

--- params: hashtag: "#amca19" max_n: 18000 timezone: "US/Eastern" title: "Twitter Coverage of `r params$hashtag`" author: "Neil Saunders" date: "`r Sys.time()`" output: github_document ---

Then, wherever you want to include the value for the variable named hashtag, simply use params$hashtag, as in the title shown here or in later code chunks.

```{r search-twitter} tweets <- search_tweets(params$hashtag, params$max_n) saveRDS(tweets, "tweets.rds") ```

That's it! There may still be some customisation and editing specific to each report, but parameters go a long way to minimising that work.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2019-01 A Geometry Engine Interface for ‘grid’

Mon, 03/04/2019 - 02:49

(This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

This report describes a new function in ‘grid’ called grobCoords and a new package called ‘gridGeometry’ that combines grobCoords with the ‘polyclip’ package to provide a geometry engine interface for ‘grid’.

Paul Murrell


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Polished human cognitive characteristics chapter

Mon, 03/04/2019 - 02:32

(This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

It has been just over two years since I release the first draft of the Human cognitive characteristics chapter of my evidence-based software engineering book. As new material was discovered, it got added where it seemed to belong (at the time), no effort was invested in maintaining any degree of coherence.

The plan was to find enough material to paint a coherence picture of the impact of human cognitive characteristics on software engineering. In practice, finishing the book in a reasonable time-frame requires that I stop looking for new material (assuming it exists), and go with what is currently available. There are a few datasets that have been promised, and having these would help fill some holes in the later sections.

The material has been reorganized into what is essentially a pass over what I think are the major issues, discussed via studies for which I have data (the rule of requiring data for a topic to be discussed, gets bent out of shape the most in this chapter), presented in almost a bullet point-like style. At least there are plenty of figures for people to look at, and they are in color.

I think the material will convince readers that human cognition is a crucial topic in software development; download the draft pdf.

Model building by cognitive psychologists is starting to become popular, with probabilistic languages, such as JAGS and Stan, becoming widely used. I was hoping to build models like this for software engineering tasks, but it would have taken too much time, and will have to wait until the book is done.

As always, if you know of any interesting software engineering data, please let me know.

Next, the cognitive capitalism chapter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

CRAN Mirror “Security”

Sun, 03/03/2019 - 21:22

(This article was first published on R –, and kindly contributed to R-bloggers)

In the “Changes on CRAN” section of the latest version of the The R Journal (Vol. 10/2, December 2018) had this short blurb entitled “CRAN mirror security”:

Currently, there are 100 official CRAN mirrors, 68 of which provide both secure downloads via ‘https’ and use secure mirroring from the CRAN master (via rsync through ssh tunnels). Since the R 3.4.0 release, chooseCRANmirror() offers these mirrors in preference to the others which are not fully secured (yet).

I would have linked to the R Journal section quoted above but I can’t because I’m blocked from accessing all resources at the IP address serving from my business-class internet connection likely due to me having a personal CRAN mirror (that was following the rules, which I also cannot link to since I can’t get to the site).

That word — “security” — is one of the most misunderstood and misused terms in modern times in many contexts. The context for the use here is cybersecurity and since CRAN (and others in the R community) seem to equate transport-layer uber-obfuscation with actual security/safety I thought it would be useful for R users in general to get a more complete picture of these so-called “secure” hosts. I also did this since I had to figure out another way to continue to have a CRAN mirror and needed to validate which nodes both supported + allowed mirroring and were at least somewhat trustworthy.

Unless there is something truly egregious in a given section I’m just going to present data with some commentary (I’m unamused abt being blocked so some commentary has an unusually sharp edge) and refrain from stating “X is |” since the goal is really to help you make the best decision of which mirror to use on your own.

The full Rproj supporting the snippets in this post (and including the data gathered by the post) can be found in my new R blog projects.

We’re going to need a few supporting packages so let’s get those out of the way:

library(xml2) library(httr) library(curl) library(stringi) library(urltools) library(ipinfo) # install.packages("ipinfo", repos = "") library(openssl) library(furrr) library(vershist) # install.packages("vershist", repos = "") library(ggalt) library(ggbeeswarm) library(hrbrthemes) library(tidyverse) What Is “Secure”?

As noted, CRAN folks seem to think encryption == security since the criteria for making that claim in the R Journal was transport-layer encryption for rsync (via ssh) mirroring from CRAN to a downstream mirror and a downstream mirror providing an https transport for shuffling package binaries and sources from said mirror to your local system(s). I find that equally as adorable as I do the rhetoric from the Let’s Encrypt cabal as this https gets you:

  • in theory protection from person-in-the-middle attacks that could otherwise fiddle with the package bits in transport
  • protection from your organization or ISP knowing what specific package you were grabbing; note that unless you’ve got a setup where your DNS requests are also encrypted the entity that controls your transport layer does indeed know exactly where you’re going.

and…that’s about it.

The soon-to-be-gone-and-formerly-green-in-most-browsers lock icon alone tells you nothing about the configuration of any site you’re connecting to and using rsync over ssh provides no assurance as to what else is on the CRAN mirror server(s), what else is using the mirror server(s), how many admins/users have shell access to those system(s) nor anything else about the cyber hygiene of those systems.

So, we’re going to look at (not necessarily in this order & non-exhaustively since this isn’t a penetration test and only lightweight introspection has been performed):

  • how many servers are involved in a given mirror URL
  • SSL certificate information including issuer, strength, and just how many other domains can use the cert
  • the actual server SSL transport configuration to see just how many CRAN mirrors have HIGH or CRITICAL SSL configuration issues
  • use (or lack thereof) HTTP “security” headers (I mean, the server is supposed to be “secure”, right?)
  • how much other “junk” is running on a given CRAN mirror (the more running services the greater the attack surface)

We’ll use R for most of this, too (I’m likely never going to rewrite longstanding SSL testers in/for R).

Let’s dig in.

Acquiring Most of the Metadata

It can take a little while to run some of the data gathering steps so the project repo includes the already-gathered data. But, we’ll show the work on the first bit of reconnaissance which involves:

  • Slurping the SSL certificate from the first server in each CRAN mirror entry (again, I can’t link to the mirror page because I literally can’t see CRAN or the main R site anymore)
  • Performing an HTTP HEAD request (to minimize server bandwidth & CPU usage) of the full CRAN mirror URL (we have to since load balancers or proxies could re-route us to a completely different server otherwise)
  • Getting an IP address for each CRAN mirror
  • Getting metadata about that IP address

This all done below:

if (!file.exists(here::here("data/mir-dat.rds"))) { mdoc <- xml2::read_xml(here::here("data/mirrors.html"), as_html = TRUE) xml_find_all(mdoc, ".//td/a[contains(@href, 'https')]") %>% xml_attr("href") %>% unique() -> ssl_mirrors plan(multiprocess) # safety first dl_cert <- possibly(openssl::download_ssl_cert, NULL) HEAD_ <- possibly(httr::HEAD, NULL) dig <- possibly(curl::nslookup, NULL) query_ip_ <- possibly(ipinfo::query_ip, NULL) ssl_mirrors %>% future_map(~{ host <- domain(.x) ip <- dig(host, TRUE) ip_info <- if (length(ip)) query_ip_(ip) else NULL list( host = host, cert = dl_cert(host), head = HEAD_(.x), ip = ip, ip_info = ip_info ) }) -> mir_dat saveRDS(mir_dat, here::here("data/mir-dat.rds")) } else { mir_dat <- readRDS(here::here("data/mir-dat.rds")) } # take a look str(mir_dat[1], 3) ## List of 1 ## $ :List of 5 ## ..$ host : chr "" ## ..$ cert :List of 4 ## .. ..$ :List of 8 ## .. ..$ :List of 8 ## .. ..$ :List of 8 ## .. ..$ :List of 8 ## ..$ head :List of 10 ## .. ..$ url : chr "" ## .. ..$ status_code: int 200 ## .. ..$ headers :List of 13 ## .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list" ## .. ..$ all_headers:List of 1 ## .. ..$ cookies :'data.frame': 0 obs. of 7 variables: ## .. ..$ content : raw(0) ## .. ..$ date : POSIXct[1:1], format: "2018-11-29 09:41:27" ## .. ..$ times : Named num [1:6] 0 0.0507 0.0512 0.0666 0.0796 ... ## .. .. ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ... ## .. ..$ request :List of 7 ## .. .. ..- attr(*, "class")= chr "request" ## .. ..$ handle :Class 'curl_handle' ## .. ..- attr(*, "class")= chr "response" ## ..$ ip : chr "" ## ..$ ip_info:List of 8 ## .. ..$ ip : chr "" ## .. ..$ hostname: chr "" ## .. ..$ city : chr "Seattle" ## .. ..$ region : chr "Washington" ## .. ..$ country : chr "US" ## .. ..$ loc : chr "47.6348,-122.3450" ## .. ..$ postal : chr "98109" ## .. ..$ org : chr "AS16509, Inc."

Note that two sites failed to respond so they were excluded from all analyses.

A Gratuitous Map of “Secure” CRAN Servers

Since‘s API returns lat/lng geolocation information why not start with a map (since that’s going to be the kindest section of this post):

maps::map("world", ".", exact = FALSE, plot = FALSE, fill = TRUE) %>% fortify() %>% filter(region != "Antarctica") -> world map_chr(mir_dat, ~.x$ip_info$loc) %>% stri_split_fixed(pattern = ",", n = 2, simplify = TRUE) %>% = FALSE) %>% as_tibble() %>% mutate_all(list(as.numeric)) -> wheres_cran ggplot() + ggalt::geom_cartogram( data = world, map = world, aes(long, lat, map_id=region), color = ft_cols$gray, size = 0.125 ) + geom_point( data = wheres_cran, aes(V2, V1), size = 2, color = ft_cols$slate, fill = alpha(ft_cols$yellow, 3/4), shape = 21 ) + ggalt::coord_proj("+proj=wintri") + labs( x = NULL, y = NULL, title = "Geolocation of HTTPS-'enabled' CRAN Mirrors" ) + theme_ft_rc(grid="") + theme(axis.text = element_blank())

Shakesperian Security

What’s in a [Subject Alternative] name? That which we call a site secure. By using dozens of other names would smell as not really secure at all? —Hackmeyo & Pwndmeyet (II, ii, 1-2)

The average internet user likely has no idea that one SSL certificate can front a gazillion sites. I’m not just talking a wildcard cert (e.g. using * for all subdomains which I try not to do for many reasons), I’m talking dozens of subject alternative names. Let’s examine some data since an example is better than blathering:

# extract some of the gathered metadata into a data frame map_df(mir_dat, ~{ tibble( host = .x$host, s_issuer = .x$cert[[1]]$issuer %||% NA_character_, i_issuer = .x$cert[[2]]$issuer %||% NA_character_, algo = .x$cert[[1]]$algorithm %||% NA_character_, names = .x$cert[[1]]$alt_names %||% NA_character_, nm_ct = length(.x$cert[[1]]$alt_names), key_size = .x$cert[[1]]$pubkey$size %||% NA_integer_ ) }) -> certs certs <- filter(certs, complete.cases(certs)) count(certs, host, sort=TRUE) %>% ggplot() + geom_quasirandom( aes("", n), size = 2, color = ft_cols$slate, fill = alpha(ft_cols$yellow, 3/4), shape = 21 ) + scale_y_comma() + labs( x = NULL, y = "# Servers", title = "Distribution of the number of alt-names in CRAN mirror certificates" ) + theme_ft_rc(grid="Y")

Most only front a couple but there are some with a crazy amount of domains. We can look at a slice of

filter(certs, host == "") %>% select(names) %>% head(20) names

The project repo has some more examples and you can examine as many as you like.

For some CRAN mirrors the certificate is used all over the place at the hosting organization. That alone isn’t bad, but organizations are generally terrible at protecting the secrets associated with certificate generation (just look at how many Google/Apple app store apps are found monthly to be using absconded-with enterprise certs) and since each server with these uber-certs has copies of public & private bits users had better hope that mal-intentioned ne’er-do-wells do not get copies of them (making it easier to impersonate any one of those, especially if an attacker controls DNS).

This Berkeley uber-cert is also kinda cute since it mixes alt-names for dev, prod & qa systems across may different apps/projects (dev systems are notoriously maintained improperly in virtually every organization).

There are legitimate reasons and circumstances for wildcard certs and taking advantage of SANs. You can examine what other CRAN mirrors do and judge for yourself which ones are Doing It Kinda OK.

Size (and Algorithm) Matters

In some crazy twist of pleasant surprises most of the mirrors seem to do OK when it comes to the algorithm and key size used for the certificate(s):

distinct(certs, host, algo, key_size) %>% count(algo, key_size, sort=TRUE) algo key_size n sha256WithRSAEncryption 2048 59 sha256WithRSAEncryption 4096 13 ecdsa-with-SHA256 256 2 sha256WithRSAEncryption 256 1 sha256WithRSAEncryption 384 1 sha512WithRSAEncryption 2048 1 sha512WithRSAEncryption 4096 1

You can go to the mirror list and hit up SSL Labs Interactive Server Test (which has links to many ‘splainers) or use the ssllabs R package to get the grade of each site. I dig into the state of config and transport issues below but will suggest that you stick with sites with ecdsa certs or sha256 and higher numbers if you want a general, quick bit of guidance.

Where Do They Get All These Wonderful Certs?

Certs come from somewhere. You can self-generate play ones, setup your own internal/legit certificate authority and augment trust chains, or go to a bona-fide certificate authority to get a certificate.

Your browsers and operating systems have a built-in set of certificate authorities they trust and you can use ssllabs::get_root_certs() to see an up-to-date list of ones for Mozilla, Apple, Android, Java & Windows. In the age of Let’s Encrypt, certificates have almost no monetary value and virtually no integrity value so where they come from isn’t as important as it used to be, but it’s kinda fun to poke at it anyway:

distinct(certs, host, i_issuer) %>% count(i_issuer, sort = TRUE) %>% head(28) i_issuer n CN=DST Root CA X3,O=Digital Signature Trust Co. 20 CN=COMODO RSA Certification Authority,O=COMODO CA Limited,L=Salford,ST=Greater Manchester,C=GB 7 CN=DigiCert Assured ID Root CA,,O=DigiCert Inc,C=US 7 CN=DigiCert Global Root CA,,O=DigiCert Inc,C=US 6 CN=DigiCert High Assurance EV Root CA,,O=DigiCert Inc,C=US 6 CN=QuoVadis Root CA 2 G3,O=QuoVadis Limited,C=BM 5 CN=USERTrust RSA Certification Authority,O=The USERTRUST Network,L=Jersey City,ST=New Jersey,C=US 5 CN=GlobalSign Root CA,OU=Root CA,O=GlobalSign nv-sa,C=BE 4 CN=Trusted Root CA SHA256 G2,O=GlobalSign nv-sa,OU=Trusted Root,C=BE 3 CN=COMODO ECC Certification Authority,O=COMODO CA Limited,L=Salford,ST=Greater Manchester,C=GB 2 CN=DFN-Verein PCA Global – G01,OU=DFN-PKI,O=DFN-Verein,C=DE 2 OU=Security Communication RootCA2,O=SECOM Trust Systems CO.\,LTD.,C=JP 2 CN=AddTrust External CA Root,OU=AddTrust External TTP Network,O=AddTrust AB,C=SE 1 CN=Amazon Root CA 1,O=Amazon,C=US 1 CN=Baltimore CyberTrust Root,OU=CyberTrust,O=Baltimore,C=IE 1 CN=Certum Trusted Network CA,OU=Certum Certification Authority,O=Unizeto Technologies S.A.,C=PL 1 CN=DFN-Verein Certification Authority 2,OU=DFN-PKI,O=Verein zur Foerderung eines Deutschen Forschungsnetzes e. V.,C=DE 1 CN=Go Daddy Root Certificate Authority – G2,\, Inc.,L=Scottsdale,ST=Arizona,C=US 1 CN=InCommon RSA Server CA,OU=InCommon,O=Internet2,L=Ann Arbor,ST=MI,C=US 1 CN=QuoVadis Root CA 2,O=QuoVadis Limited,C=BM 1 CN=QuoVadis Root Certification Authority,OU=Root Certification Authority,O=QuoVadis Limited,C=BM 1

That first one is Let’s Encrypt, which is not unexpected since they’re free and super easy to setup/maintain (especially for phishing campaigns).

A “fun” exercise might be to Google/DDG around for historical compromises tied to these CAs (look in the subject ones too if you’re playing with the data at home) and see what, eh, issues they’ve had.

You might want to keep more of an eye on this whole “boring” CA bit, too, since some trust stores are noodling on the idea of trusting surveillance firms and you never know what Microsoft or Google is going to do to placate authoritarian regimes and allow into their trust stores.

At this point in the exercise you’ve got

  • how many domains a certificate fronts
  • certificate strength
  • certificate birthplace

to use when formulating your own decision on what CRAN mirror to use.

But, as noted, certificate breeding is not enough. Let’s dive into the next areas.

It’s In The Way That You Use It

You can’t just look at a cert to evaluate site security. Sure, you can spend 4 days and use the aforementioned ssllabs package to get the rating for each cert (well, if they’ve been cached then an API call won’t be an assessment so you can prime the cache with 4 other ppl in one day and then everyone else can use the cached values and not burn the rate limit) or go one-by-one in the SSL Labs test site, but we can also use a tool like to gather technical data via interactive protocol examination.

I’m being a bit harsh in this post, so fair’s fair and here are the plaintext results from my own run of for along with ones from Qualys:

As you can see in the detail pages, I am having an issue with the provider of my .is domain (severe limitation on DNS record counts and types) so I fail CAA checks because I literally can’t add an entry for it nor can I use a different nameserver. Feel encouraged to pick nits about that tho as that should provide sufficient impetus to take two weeks of IRL time and some USD to actually get it transferred (yay. international. domain. providers.)

The project repo has all the results from a weekend run on the CRAN mirrors. No special options were chosen for the runs.

list.files(here::here("data/ssl"), "json$", full.names = TRUE) %>% map_df(jsonlite::fromJSON) %>% as_tibble() -> ssl_tests # filter only fields we want to show and get them in order sev <- c("OK", "LOW", "MEDIUM", "HIGH", "WARN", "CRITICAL") group_by(ip) %>% count(severity) %>% ungroup() %>% complete(ip = unique(ip), severity = sev) %>% mutate(severity = factor(severity, levels = sev)) %>% # order left->right by severity arrange(ip) %>% mutate(ip = factor(ip, levels = rev(unique(ip)))) %>% # order alpha by mirror name so it's easier to ref ggplot(aes(severity, ip, fill=n)) + geom_tile(color = "#b2b2b2", size = 0.125) + scale_x_discrete(name = NULL, expand = c(0,0.1), position = "top") + scale_y_discrete(name = NULL, expand = c(0,0)) + viridis::scale_fill_viridis( name = "# Tests", option = "cividis", na.value = ft_cols$gray ) + labs( title = "CRAN Mirror SSL Test Summary Findings by Severity" ) + theme_ft_rc(grid="") + theme(axis.text.y = element_text(size = 8, family = "mono")) -> gg # We're going to move the title vs have too wide of a plot gb <- ggplot2::ggplotGrob(gg) gb$layout$l[gb$layout$name %in% "title"] <- 2 grid::grid.newpage() grid::grid.draw(gb)

Thankfully most SSL checks come back OK. Unfortunately, many do not:

filter(ssl_tests,severity == "HIGH") %>% count(id, sort = TRUE) id n BREACH 42 cipherlist_3DES_IDEA 37 cipher_order 34 RC4 16 cipher_negotiated 10 LOGJAM-common_primes 9 POODLE_SSL 6 SSLv3 6 cert_expiration_status 1 cert_notAfter 1 fallback_SCSV 1 LOGJAM 1 secure_client_renego 1 filter(ssl_tests,severity == "CRITICAL") %>% count(id, sort = TRUE) id n cipherlist_LOW 16 TLS1_1 5 CCS 2 cert_chain_of_trust 1 cipherlist_aNULL 1 cipherlist_EXPORT 1 DROWN 1 FREAK 1 ROBOT 1 SSLv2 1

Some CRAN mirror site admins aren’t keeping up with secure SSL configurations. If you’re not familiar with some of the acronyms here are a few (fairly layman-friendly) links:

You’d be hard-pressed to have me say that the presence of these is the end of the world (I mean, you’re trusting random servers to provide packages for you which may run in secure enclaves on production code, so how important can this really be?) but I also wouldn’t attach the word “secure” to any CRAN mirror with HIGH or CRITICAL SSL configuration weaknesses.

Getting Ahead[er] Of Myself

We did the httr::HEAD() request primarily to capture HTTP headers. And, we definitely got some!

map_df(mir_dat, ~{ if (length(.x$head$headers) == 0) return(NULL) host <- .x$host flatten_df(.x$head$headers) %>% gather(name, value) %>% mutate(host = host) }) -> hdrs count(hdrs, name, sort=TRUE) %>% head(nrow(.)) name n content-type 79 date 79 server 79 last-modified 72 content-length 67 accept-ranges 65 etag 65 content-encoding 38 connection 28 vary 28 strict-transport-security 13 x-frame-options 8 x-content-type-options 7 cache-control 4 expires 3 x-xss-protection 3 cf-ray 2 expect-ct 2 set-cookie 2 via 2 ms-author-via 1 pragma 1 referrer-policy 1 upgrade 1 x-amz-cf-id 1 x-cache 1 x-permitted-cross-domain 1 x-powered-by 1 x-robots-tag 1 x-tuna-mirror-id 1 x-ua-compatible 1

There are a handful of “security” headers that kinda matter so we’ll see how many “secure” CRAN mirrors use “security” headers:

c( "content-security-policy", "x-frame-options", "x-xss-protection", "x-content-type-options", "strict-transport-security", "referrer-policy" ) -> secure_headers count(hdrs, name, sort=TRUE) %>% filter(name %in% secure_headers) name n strict-transport-security 13 x-frame-options 8 x-content-type-options 7 x-xss-protection 3 referrer-policy 1

I’m honestly shocked any were in use but only a handful or two are using even one “security” header. uses all five of the above so good on ya Commonwealth Scientific and Industrial Research Organisation!

I keep putting the word “security” in quotes as R does nothing with these headers when you do an install.packages(). As a whole they’re important but mostly when it comes to your safety when browsing those CRAN mirrors.

I would have liked to have seen at least one with some Content-Security-Policy header, but a girl can at least dream.

Version Aversion

There’s another HTTP response header we can look at, the Server one which is generally there to help attackers figure out whether they should target you further for HTTP server and application attacks. No, I mean it! Back in the day when geeks rules the internets — and it wasn’t just a platform for cat pictures and pwnd IP cameras — things like the Server header were cool because it might help us create server-specific interactions and build cool stuff. Yes, modern day REST APIs are likely better in the long run but the naiveté of the silver age of the internet was definitely something special (and also led to the chaos we have now). But, I digress.

In theory, no HTTP server in it’s rightly configured digital mind would tell you what it’s running down to the version level, but most do. (Again, feel free to pick nits that I let the world know I run nginx…or do I). Assuming the CRAN mirrors haven’t been configured to deceive attackers and report what folks told them to report we can survey what they run behind the browser window:

filter(hdrs, name == "server") %>% separate( value, c("kind", "version"), sep="/", fill="right", extra="merge" ) -> svr count(svr, kind, sort=TRUE) kind n Apache 57 nginx 15 cloudflare 2 CSIRO 1 Hiawatha v10.8.4 1 High Performance 8bit Web Server 1 none 1 openresty 1

I really hope Cloudflare is donating bandwidth vs charging these mirror sites. They’ve likely benefitted greatly from the diverse FOSS projects many of these sites serve. (I hadn’t said anything bad about Cloudflare yet so I had to get one in before the end).

Lots run Apache (makes sense since CRAN-proper does too, not that I can validate that from home since I’m IP blocked…bitter much, hrbrmstr?) Many run nginx. CSIRO likely names their server that on purpose and hasn’t actually written their own web server. Hiawatha is, indeed, a valid web server. While there are also “high performance 8bit web servers” out there I’m willing to bet that’s a joke header value along with “none”. Finally, “openresty” is also a valid web server (it’s nginx++).

We’ll pick on Apache and nginx and see how current patch levels are. Not all return a version number but a good chunk do:

apache_httpd_version_history() %>% arrange(rls_date) %>% mutate( vers = factor(as.character(vers), levels = as.character(vers)) ) -> apa_all filter(svr, kind == "Apache") %>% filter(! %>% mutate(version = stri_replace_all_regex(version, " .*$", "")) %>% count(version) %>% separate(version, c("maj", "min", "pat"), sep="\\.", convert = TRUE, fill = "right") %>% mutate(pat = ifelse(, 1, pat)) %>% mutate(v = sprintf("%s.%s.%s", maj, min, pat)) %>% mutate(v = factor(v, levels = apa_all$vers)) %>% arrange(v) -> apa_vers filter(apa_all, vers %in% apa_vers$v) %>% arrange(rls_date) %>% group_by(rls_year) %>% slice(1) %>% ungroup() %>% arrange(rls_date) -> apa_yrs ggplot() + geom_blank( data = apa_vers, aes(v, n) ) + geom_segment( data = apa_yrs, aes(vers, 0, xend=vers, yend=Inf), linetype = "dotted", size = 0.25, color = "white" ) + geom_segment( data = apa_vers, aes(v, n, xend=v, yend=0), color = ft_cols$gray, size = 8 ) + geom_label( data = apa_yrs, aes(vers, Inf, label = rls_year), family = font_rc, color = "white", fill = "#262a31", size = 4, vjust = 1, hjust = 0, nudge_x = 0.01, label.size = 0 ) + scale_y_comma(limits = c(0, 15)) + labs( x = "Apache Version #", y = "# Servers", title = "CRAN Mirrors Apache Version History" ) + theme_ft_rc(grid="Y") + theme(axis.text.x = element_text(family = "mono", size = 8, color = "white"))


I’ll let you decide if a six-year-old version of Apache indicates how well a mirror site is run or not. Sure, mitigations could be in place but I see no statement of efficacy on any site so we’ll go with #lazyadmin.

But, it’s gotta be better with nginx, right? It’s all cool & modern!

nginx_version_history() %>% arrange(rls_date) %>% mutate( vers = factor(as.character(vers), levels = as.character(vers)) ) -> ngx_all filter(svr, kind == "nginx") %>% filter(! %>% mutate(version = stri_replace_all_regex(version, " .*$", "")) %>% count(version) %>% separate(version, c("maj", "min", "pat"), sep="\\.", convert = TRUE, fill = "right") %>% mutate(v = sprintf("%s.%s.%s", maj, min, pat)) %>% mutate(v = factor(v, levels = ngx_all$vers)) %>% arrange(v) -> ngx_vers filter(ngx_all, vers %in% ngx_vers$v) %>% arrange(rls_date) %>% group_by(rls_year) %>% slice(1) %>% ungroup() %>% arrange(rls_date) -> ngx_yrs ggplot() + geom_blank( data = ngx_vers, aes(v, n) ) + geom_segment( data = ngx_yrs, aes(vers, 0, xend=vers, yend=Inf), linetype = "dotted", size = 0.25, color = "white" ) + geom_segment( data = ngx_vers, aes(v, n, xend=v, yend=0), color = ft_cols$gray, size = 8 ) + geom_label( data = ngx_yrs, aes(vers, Inf, label = rls_year), family = font_rc, color = "white", fill = "#262a31", size = 4, vjust = 1, hjust = 0, nudge_x = 0.01, label.size = 0 ) + scale_y_comma(limits = c(0, 15)) + labs( x = "nginx Version #", y = "# Servers", title = "CRAN Mirrors nginx Version History" ) + theme_ft_rc(grid="Y") + theme(axis.text.x = element_text(family = "mono", color = "white"))

I will at close out this penultimate section with a “thank you!” to the admins at Georg-August-Universität Göttingen and Yamagata University for keeping up with web server patches.

You Made It This Far

If I had known you’d read to the nigh bitter end I would have made cookies. You’ll have to just accept the ones the blog gives your browser (those ones taste taste pretty bland tho).

The last lightweight element we’ll look at is “what else do these ‘secure’ CRAN mirrors run”?

To do this, we’ll turn to Rapid7 OpenData and look at what else is running on the IP addresses used by these CRAN mirrors. We already know some certs are promiscuous, so what about the servers themselves?

cran_mirror_other_things <- readRDS(here::here("data/cran-mirror-other-things.rds")) # "top" 20 distinct(cran_mirror_other_things, ip, port) %>% count(ip, sort = TRUE) %>% head(20) ip n 8 7 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Four isn’t bad since we kinda expect at least 80, 443 and 21 (FTP) to be running. We’ll take those away and look at the distribution:

distinct(cran_mirror_other_things, ip, port) %>% filter(!(port %in% c(21, 80, 443))) %>% count(ip) %>% count(n) %>% mutate(n = factor(n)) %>% ggplot() + geom_segment( aes(n, nn, xend = n, yend = 0), size = 10, color = ft_cols$gray ) + scale_y_comma() + labs( x = "Total number of running services", y = "# hosts", title = "How many other services do CRAN mirrors run?", subtitle = "NOTE: Not counting 80/443/21" ) + theme_ft_rc(grid="Y")

So, what are these other ports?

distinct(cran_mirror_other_things, ip, port) %>% count(port, sort=TRUE) port n 80 75 443 75 21 29 22 18 8080 6 25 5 53 2 2082 2 2086 2 8000 2 8008 2 8443 2 111 1 465 1 587 1 993 1 995 1 2083 1 2087 1

22 is SSH, 53 is DNS, 8000/8008/8080/8553 are web high ports usually associated with admin or API endpoints and generally a bad sign when exposed externally (especially on a “secure” mirror server). 25/465/587/993/995 all deal with mail sending and reading (not exactly a great service to have on a “secure” mirror server). I didn’t poke too hard but 208[2367] tend to be cPanel admin ports and those being internet-accessible is also not great.

Port 111 is sunrpc and is a really bad thing to expose to the internet or to run at all. But, the server is a “secure” CRAN mirror, so perhaps everything is fine.


While I hope this posts informs, I’ve worked in cybersecurity for ages and — as a result — don’t really expect anything to change. Tomorrow, I’ll still be blocked from the main CRAN & site despite having better “security” than the vast majority of these “secure” CRAN mirrors (and was following the rules). Also CRAN mirror settings tend to be fairly invisible since most modern R users use the RStudio default (which is really not a bad choice from any “security” analysis angle), choose the first item in the mirror-chooser (Russian roulette!), or live with the setting in the site-wide Rprofile anyway (org-wide risk acceptance/”blame the admin”).

Since I only stated it way back up top (WordPress says this is ~3,900 words but much of that is [I think] code) you can get the full R project for this and examine the data yourself. There is a bit more data and code in the project since I also looked up the IP addresses in Rapid7’s FDNS OpenData study set to really see how many domains point to a particular CRAN mirror but really didn’t want to drag the post on any further.

Now, where did I put those Python 3 & Julia Jupyter notebooks…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Run Remote R Scripts with Mobile Device using E-mail Triggers

Sun, 03/03/2019 - 05:20

(This article was first published on R – R-BAR, and kindly contributed to R-bloggers)

Have you ever been on the road and wished you could run an R script from your mobile device and see the results? Maybe you’re a business person who needs a quick update on a project or production schedule. Or, possibly you need an up-to-the-minute report out for a meeting, and you don’t have a “cloud based solution to get you the important data. In short, there is data you want and can’t get because you’re not onsite to run an R script. In this post, I’ll show you how to resolve this issue using headless R and E-mail triggers.

Objectives: Run an R script from a mobile device

  1. Review the Overall Process
  2. Checkout a DEMO Scenario to Test our Process
  3. Load DEMO Data into R
  4. E-mail Processed DEMO Results to our Mobile from R
  5. Setup a Batch File and Headless R
  6. Create an E-mail Trigger to Run the Batch File
    1. Outlook
    2. ThunderBird
The General Idea

This is a big article, so here is quick summary of what we are going to do to get our data on demand on our mobile. First, let's pretend your mobile e-mail address is From your mobile, you send an E-mail to your desktop machine with the subject "Hey R – Send Update”. Your ever vigilant desktop e-mail client receives the e-mail, and runs a “rule” you setup before leaving the office. That rule will launch an application. For us, that application will be a simple batch file or shell script. The batch file will do two things i) launch headless R and ii) call the R script you want to run. When the R script runs, it will get the data from the target data source, process the data, and e-mail the result back to your mobile device. Here is a little flow chart that sums it up:

Process Benefits

  • You get the data on demand. No script spamming your email account.
  • Headless R only runs when it’s called.

Things to Consider

  • Your desktop e-mail client needs to be open and receiving e-mails
  • There maybe some security considerations depending on how you implement this. Talk with IT.

Great! If you've got the idea, let's get into the details.

E-mails from the Mobile Device

If you know how to send an e-mail from your mobile device, you are off to a great start. You'll need to pick something unique for your subject line like “Hey R – Send Update”, “Hey R – Run Report”, or something of that sort. If you're not sure how to send an email from your mobile device, perhaps you'll find these texts useful:

The Remote Data Source

The remote data source is the data you'd like access on the road but can't using your mobile device. Here are a few important notes about the remote data source:

  1. It needs to be accessible with an R script on your desktop and
  2. the machine running that R script needs a functioning e-mail client like Outlook or ThunderBird.

To keep us all on the same page for this post, I've created a demo data source which can be found here. We are going to pretend that our mobile devices cannot access this. Looking at the data source, you'll see it's a simple HTML page showing the current production status of some widget. There are two tables in the data source, the “Product Overview” and the “Work In Progress”. Both of these tables update throughout the day as work gets done. If HTML iframes are working, you should see the contents of the data source below. If not, check the data source out here.

Summary of the Table Items Product Overview Today’s Builds: These are the products in queue for today
Shipping Today’s Builds: The products that will be complete by the shipping deadline at 5pm EST.
Shipping Yesterday’s Builds:Products that missed yesterday’s shipping deadline and will ship today.
Potentially Shipping Tomorrow: Today’s builds that aren’t in the shipping queue yet.
Tomorrow’s Orders: What have customers ordered for tomorrow. Work In Progress This table has two rows. The “Building Product” row tells you which product ID is currently being built and how much time remains. If a product is in the packaging process, it will list its product ID and how much time remains. Remember: We are pretending that we can see this data with a desktop device but not our mobile device. Pre-Processing the Data for Our Mobile Report

Great, we have a data source and we are pretending that it can't be directly accessed with our mobile device. But, we can process it with our desktops using R. So, suppose we are on the road, and we really need to know the following: As of right now, how much product will be shipping today and how much time is there before the shipping deadline at the New York site?

Using our desktops, we generate the following code which i) scrapes the data from the data source, ii) sums the appropriate data, and iii) makes a short text summary:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Parse the HTML require(rvest) #package to parse HTML page <- read_html("") #Get the Page tbls <- html_nodes(page, "table") #Get the Tables Product_Status <- html_table(tbls[1])[[1]] #Get First Table   #Detemine how much product is shipping today as of right now ToShip <- grep(pattern = "Shipping Yesterday|Shipping Today", x = Product_Status$`Product Metrics`) Shipping_Product <- sum(Product_Status$Count[ToShip])   #Detemine how many minute before the shipping dead line MinLeft <- gsub(pattern = '\\D', replacement = "" , Product_Status$`Product Metrics`)[[2]]   #The summary message to be e-mailed email.msg <- paste0(Shipping_Product, " units of product shipping as of ", date(), ". There is ", MinLeft, " minutes remaining before today's shipping deadline.")

The results should look something like:

28 units of product shipping as of Fri Mar 01 16:55:23 2019. There is 5 minutes remaining before today’s shipping deadline.

E-mailing the Data from R to our Mobile Device

Awesome, we have the data in R and we generated the message we want to send. Now, we need to send it to our mobile device using R. To make this work, you will need to have the following information on hand:

  • SMTP server for outgoing e-mail
  • The SMTP server port
  • Username
  • Password

If you are not sure about these items, check with IT or your internet service provider (ISP). Using this information and R's mailR package, we can generate some code to send our message from the previous section to our mobile device via e-mail. Here is some example code.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #Email Trigger when "Hey R - Send Update" library(mailR) sender <- "" #The email address R will use to send recipients <- c("") #your mobile email here send.mail(from = sender, to = recipients, subject = "Requested Status Report", body = email.msg, #your summary message smtp = list( = "your.smtp.server",#check with IT or ISP port = 587, #port number (IT or ISP) = "your.username", #likely your email address passwd = "your.password", #email password ssl = TRUE #if using secure connection ), authenticate = TRUE, send = TRUE)

You will need to update the sever and account information as appropriate. If everything is working you should be able to see the results using your mobile after tuning this code. If it doesn't work, check your server details and email addresses.

Batch Files and Headless R

If you were able to receive the e-mail on your mobile device using R's mailR package discussed in the previous section, you are ready for this next step. In this section, we are going to do two things:

  1. Show how to start R in headless mode
  2. Execute a R script in headless mode using a batch file.
Running R in Headless Mode

To run R in headless mode, you need to locate a file called Rscript.exe. To do this, go to your “program files” folder. Find the R sub-folder. Locate the current version of R. Open up the bin folder. For me, the full path looks something like this:

C:\Program Files\R\R-3.5.1\bin

Within this folder, you will find a file called Rscript.exe. This is the file you will need to run R in headless mode. So, for me the full path to this executable is C:\Program Files\R\R-3.5.1\bin\Rscript.exe

Run an R script in Headless Mode

The Rscript we are going to run headless is our E-mail script described above. Here is the complete script if you haven't saved it already:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 # Parse the HTML require(rvest) #package to parse HTML page <- read_html("") #Get the Page tbls <- html_nodes(page, "table") #Get the Tables Product_Status <- html_table(tbls[1])[[1]] #Get First Table   #Detemine how much product is shipping today as of right now ToShip <- grep(pattern = "Shipping Yesterday|Shipping Today", x = Product_Status$`Product Metrics`) Shipping_Product <- sum(Product_Status$Count[ToShip])   #Detemine how many minute before the shipping dead line MinLeft <- gsub(pattern = '\\D', replacement = "" , Product_Status$`Product Metrics`)[[2]]   #The summary message to be e-mailed email.msg <- paste0(Shipping_Product, " units of product shipping as of ", date(), ". There is ", MinLeft, " minutes remaining before today's shipping deadline.")   #Email Trigger when "Hey R - Send Update" library(mailR) sender <- "" #The email address R will use to send recipients <- c("") #your mobile email here send.mail(from = sender, to = recipients, subject = "Requested Status Report", body = email.msg, #your summary message smtp = list( = "your.smtp.server",#check with IT or ISP port = 587, #port number (IT or ISP) = "your.username", #likely your email address passwd = "your.password", #email password ssl = TRUE #if using secure connection ), authenticate = TRUE, send = TRUE)

Make sure you have updated the email addresses, SMTP sever, port, username and password details appropriately (lines 25,26, 31:33). Save this script to a file called emailTriggers.R in a folder called “trigger” at c:. You will need to create the trigger folder. After you have saved the file, make sure you can find it at c:\Trigger\emailTriggers.R .

Awesome! From here, we are going to make the batch file. If you are not familiar with batch files, they are little shell scripts for the windows operating system. I first encountered them before windows was a thing… back in the DOS days. I digress…

All we are going to do with our batch file is call headless R and tell it to run our emailTrigger.R script. Making a batch file is simple. Open up notepad or notepad++ and copy the following code:

C:”Program Files"\R\R-3.5.1\bin\Rscript.exe C:\Trigger\emailTrigger.r

Simple, right? The first part calls headless R; the second part calls the R script. Now, save the batch file in the c:\trigger folder, and call it trigger.bat . Once you are done, the contents of c:\Trigger should look like this:

Great. Now execute the trigger.bat file. To execute the file, you can use the window's file explorer or the command prompt. If you used the file explorer. Go to c:\trigger and double click on the trigger.bat file. If the script runs successfully, you should get an e-mail with the most current results from the DEMO site. Make sure this works before moving onto the next section.

The Email Rules and Triggers

This section assumes you were able to run the batch file and get the newest DEMO site results sent to your mobile device's e-mail. If so, you are at the last step. On your desktop with the trigger.bat file, you need to have either Outlook or ThunderBird installed. Using these email clients, we are going to make a rule that does the following: When an e-mail comes in with the subject “Hey R – Send Update”, and is from launch the trigger.bat file you created in the previous section.


I'm assuming you already have Outlook installed and your desktop e-mail setup. If this is true, we want to make a new rule. In the newer versions of Outlook, you can't make a rule to launch an application without making a minor change to the window's registry. Here is what you need to do to make the change:

Edit Registry for Outlook Application Launch
  1. Press the Windows Key + R: This will open the run dialogue box
  2. In the “open” text box type: regedit
  3. You’ll get an “are you sure you want to use this application” warning from windows. Press OK…
  4. The registry editor should look like this:

Cool we got the registry editor open. Here are the changes we need to make

  1. In the left-hand panel drill down to:
  2. Make a new folder (key) called “Security” by right clicking on the Outlook folder and selecting “New”→“Key”
  3. Enter the new “Security” folder. Full path should now be:
  4. Create a new DWORD(32bit) value in the right panel by right clicking, call it “EnableUnsafeClientMailRules”
  5. Double click on EnableUnsafeClientMailRules and set the data value to 1.
  6. Should look like this when you are done:

All done.

Outlook Email Rules

With the registry updated. Restart Outlook. Now let's make the rule to launch the trigger.bat file when we receive an e-mail from with the subject line “Hey R – Send Update”

  1. Send an email from your mobile device to your desktop with the subject line “Hey R – Send Update”
  2. Find the e-mail you sent in Outlook
  3. Right click on the message and go to Rules→Create Rule
  4. Select advanced from the Create Rule Dialogue Box
  5. First screen – Set rule conditions
    1. Select the “From [your info]” and
    2. “With Hey R – Send Update in the subject”
    3. Press Next
  6. Second screen – set the action if the previous conditions were met
    1. In the Step 1: Select action box, select “start application”. This option won’t be available if you haven’t made the windows registry change discussed above.
    2. In the Step 2: Edit the Rule Description… Click the word “application”
    3. An “open file” dialog box will appear. Find the file c:\Trigger\trigger.bat (to see the file, you will need to change the select box from “executable files (*.exe)” to “All files (*.*)” ) Click the file and press “open”.
    4. Press next to move to the third screen
  7. Third screen – set exceptions to the rules. We aren't going to worry about this right now.
    1. Press Next
  8. Last Screen – Finish Rule

    1. Step 1: Give your Rule a name
    2. Step 2: Make sure the rule is turned on
    3. Press Finish

Great the rule is all setup. Now send an e-mail to your desktop with the subject line “Hey R – Send Update” from your mobile device to make sure the batch file gets launched. On the desktop, you should see a command prompt open briefly with some R output. If your mobile gets an e-mail back with the result from R, your setup is working. Nice job.


I'm assuming you already have ThunderBird installed and your desktop e-mail setup. If this is true, we need to install a plugin called FiltaQuilla, so we can make some e-mail rules. To install the plugin, open ThunderBird and update your mail box as necessary. Press the ALT key on your keyboard. At the top of the ThunderBird window, the program menu should appear. Click on “Tools→Add-ons ”. This will open a new tab within ThunderBird called the “Add-ons Manager”. You will want to find a search box… somewhere. To find one, I had to click on one of the many “See all” links scattered across the page. Clicking one of the “See all” links will open another tab called “Up & Coming Extensions”. Next, find the search for add-ons text box. Search for FiltaQuilla and install it. When you are done, you'll need to restart ThunderBird.

After restarting, press the ALT key on your keyboard to bring up the program menu. Then, click on tools, and select “Message Filters”. The following window will pop up:

Press the “new”“ button and fill in the rule details as shown:

As shown above, when a new message comes in the subject will be checked for the text "Hey R – Send Update” and the sender's e-mail will be checked for the appropriate senders address which will be your mobile e-mail address. If these items match, launch the batch file we made in the previous section. This will launch headless R and your script which will use the mailR package to send your mobile the most up-to-date results on the DEMO site.

After pressing OK, send an e-mail to your desktop with the subject line “Hey R – Send Update” from your mobile device to make sure the batch file gets launched. You should see a command prompt open briefly with some R output. If you mobile gets an e-mail back with the result, your setup is working.


Sometimes we need data from a home or office data source, and we can't access it directly with our mobile devices. In this post, you learned how to get at this data by launching an R script on your desktop using an e-mail trigger from your mobile device. Now, if you need to get to that data. All you need to do is follow three easy steps

  1. Write an R script on the home or office desktop that grabs the data and e-mails the results to your mobile device.
  2. Setup a batch file to have headless R call your script.
  3. Setup an e-mail rule to launch your batch file when it receives an e-mail with the subject “Hey R – Send Update”
Other Articles

WooCommerce Image Gallery | Step by Step, Automate with R
XmR Chart | Step-by-Step Guide by Hand and with R
Windows Clipboard Access with R
Source and List: Organizing R Shiny Apps

The post Run Remote R Scripts with Mobile Device using E-mail Triggers appeared first on R-BAR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – R-BAR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit

Sun, 03/03/2019 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

Can I get enough of historical newspapers data? Seems like I don’t. I already wrote four
3 and
4) blog posts, but
there’s still a lot to explore. This blog post uses a new batch of data announced on twitter:

For all who love to analyse text, the BnL released half a million of processed newspaper articles. Historical news from 1841-1878. They directly contain the full text as well as other metadata. It’s really easy to work with! #digitsation #historicalnews

— Marschall Ralph (@RalphMarschall) March 1, 2019

and this data could not have arrived at a better moment, since something else got announced via Twitter

RVowpalWabbit 0.0.13: Keeping CRAN happy
R Interface to the 'Vowpal Wabbit' fast out-of-core learner #rcpp

— Dirk Eddelbuettel (@eddelbuettel) February 22, 2019

I wanted to try using Vowpal Wabbit
for a couple of weeks now because it seems to be the perfect
tool for when you’re dealing with what I call big-ish data: data that is not big data, and might
fit in your RAM, but is still a PITA to deal with. It can be data that is large enough to take 30
seconds to be imported into R, and then every operation on it lasts for minutes, and estimating/training
a model on it might eat up all your RAM. Vowpal Wabbit avoids all this because it’s an online-learning
system. Vowpal Wabbit is capable of training a model with data that it sees on the fly, which means
VW can be used for real-time machine learning, but also for when the training data is very large.
Each row of the data gets streamed into VW which updates the estimated parameters of the model
(or weights) in real time. So no need to first import all the data into R!

The goal of this blog post is to get started with VW, and build a very simple logistic model
to classify documents using the historical newspapers data from the National Library of Luxembourg,
which you can download here (scroll down and
download the Text Analysis Pack). The goal is not to build the best model, but a model. Several
steps are needed for this: prepare the data, install VW and train a model using {RVowpalWabbit}.

Step 1: Preparing the data

The data is in a neat .xml format, and extracting what I need will be easy. However, the input
format for VW is a bit unusual; it resembles .psv files (Pipe Separated Values) but
allows for more flexibility. I will not dwell much into it, but for our purposes, the file must
look like this:

1 | this is the first observation, which in our case will be free text 2 | this is another observation, its label, or class, equals 2 4 | this is another observation, of class 4

The first column, before the “|” is the target class we want to predict, and the second column
contains free text.

The raw data looks like this:

Click if you want to see the raw data

2019-02-28T11:13:01 digitool-publish:3026998-DTL45 2019-02-28T11:13:01Z newspaper/indeplux/1871-12-29_01 L'indépendance luxembourgeoise issue:newspaper/indeplux/1871-12-29_01/article:DTL45 1871-12-29 Jean Joris 3026998|issue:3026998|article:DTL45 CONSEIL COMMUNAL de la ville de Luxembourg. Séance du 23 décembre 1871. (Suite.) Art. 6. Glacière communale. M. le Bourgmcstr ¦ . Le collège échevinal propose un autro mode de se procurer de la glace. Nous avons dépensé 250 fr. cha- que année pour distribuer 30 kilos do glace; c’est une trop forte somme pour un résultat si minime. Nous aurions voulu nous aboucher avec des fabricants de bière ou autres industriels qui nous auraient fourni de la glace en cas de besoin. L’architecte qui été chargé de passer un contrat, a été trouver des négociants, mais ses démarches n’ont pas abouti. CONSEIL COMMUNAL de la ville de Luxembourg. Séance du 23 décembre 1871. (Suite.) ARTICLE fr 863

I need several things from this file:

  • The title of the newspaper: L'indépendance luxembourgeoise
  • The type of the article: ARTICLE. Can be Article, Advertisement, Issue, Section or Other.
  • The contents: CONSEIL COMMUNAL de la ville de Luxembourg. Séance du ....

I will only focus on newspapers in French, even though newspapers in German also had articles in French.
This is because the tag fr is not always available. If it were, I could
simply look for it and extract all the content in French easily, but unfortunately this is not the case.

First of all, let’s get the data into R:

library("tidyverse") library("xml2") library("furrr") files <- list.files(path = "export01-newspapers1841-1878/", all.files = TRUE, recursive = TRUE)

This results in a character vector with the path to all the files:

head(files) [1] "000/1400000/1400000-ADVERTISEMENT-DTL78.xml" "000/1400000/1400000-ADVERTISEMENT-DTL79.xml" [3] "000/1400000/1400000-ADVERTISEMENT-DTL80.xml" "000/1400000/1400000-ADVERTISEMENT-DTL81.xml" [5] "000/1400000/1400000-MODSMD_ARTICLE1-DTL34.xml" "000/1400000/1400000-MODSMD_ARTICLE2-DTL35.xml"

Now I write a function that does the needed data preparation steps. I describe what the function
does in the comments inside:

to_vw <- function(xml_file){ # read in the xml file file <- read_xml(paste0("export01-newspapers1841-1878/", xml_file)) # Get the newspaper newspaper <- xml_find_all(file, ".//dcterms:isPartOf") %>% xml_text() # Only keep the newspapers written in French if(!(newspaper %in% c("L'UNION.", "L'indépendance luxembourgeoise", "COURRIER DU GRAND-DUCHÉ DE LUXEMBOURG.", "JOURNAL DE LUXEMBOURG.", "L'AVENIR", "L’Arlequin", "La Gazette du Grand-Duché de Luxembourg", "L'AVENIR DE LUXEMBOURG", "L'AVENIR DU GRAND-DUCHE DE LUXEMBOURG.", "L'AVENIR DU GRAND-DUCHÉ DE LUXEMBOURG.", "Le gratis luxembourgeois", "Luxemburger Zeitung – Journal de Luxembourg", "Recueil des mémoires et des travaux publiés par la Société de Botanique du Grand-Duché de Luxembourg"))){ return(NULL) } else { # Get the type of the content. Can be article, advert, issue, section or other type <- xml_find_all(file, ".//dc:type") %>% xml_text() type <- case_when(type == "ARTICLE" ~ "1", type == "ADVERTISEMENT" ~ "2", type == "ISSUE" ~ "3", type == "SECTION" ~ "4", TRUE ~ "5" ) # Get the content itself. Only keep alphanumeric characters, and remove any line returns or # carriage returns description <- xml_find_all(file, ".//dc:description") %>% xml_text() %>% str_replace_all(pattern = "[^[:alnum:][:space:]]", "") %>% str_to_lower() %>% str_replace_all("\r?\n|\r|\n", " ") # Return the final object: one line that looks like this # 1 | bla bla paste(type, "|", description) } }

I can now run this code to parse all the files, and I do so in parallel, thanks to the {furrr} package:

plan(multiprocess, workers = 12) text_fr <- files %>% future_map(to_vw) text_fr <- text_fr %>% discard(is.null) write_lines(text_fr, "text_fr.txt") Step 2: Install Vowpal Wabbit

To easiest way to install VW must be using Anaconda, and more specifically the conda package manager.
Anaconda is a Python (and R) distribution for scientific computing and it comes with a package manager
called conda which makes installing Python (or R) packages very easy. While VW is a standalone
piece of software, it can also be installed by conda or pip. Instead of installing the full Anaconda distribution,
you can install Miniconda, which only comes with the bare minimum: a Python executable and the
conda package manager. You can find Miniconda here
and once it’s installed, you can install VW with:

conda install -c gwerbin vowpal-wabbit

It is also possible to install VW with pip, as detailed here,
but in my experience, managing Python packages with pip is not super. It is better to manage your
Python distribution through conda, because it creates environments in your home folder which are
independent of the system’s Python installation, which is often out-of-date.

Step 3: Building a model

Vowpal Wabbit can be used from the command line, but there are interfaces for Python and since a
few weeks, for R. The R interface is quite crude for now, as it’s still in very early stages. I’m
sure it will evolve, and perhaps a Vowpal Wabbit engine will be added to {parsnip}, which would
make modeling with VW really easy.

For now, let’s only use 10000 lines for prototyping purposes before running the model on the whole file. Because
the data is quite large, I do not want to import it into R. So I use command line tools to manipulate
this data directly from my hard drive:

# Prepare data system2("shuf", args = "-n 10000 text_fr.txt > small.txt")

shuf is a Unix command, and as such the above code should work on GNU/Linux systems, and most
likely macOS too. shuf generates random permutations of a given file to standard output. I use >
to direct this output to another file, which I called small.txt. The -n 10000 options simply
means that I want 10000 lines.

I then split this small file into a training and a testing set:

# Adapted from # The command below counts the lines in small.txt. This is not really needed, since I know that the # file only has 10000 lines, but I kept it here for future reference # notice the stdout = TRUE option. This is needed because the output simply gets shown in R's # command line and does get saved into a variable. nb_lines <- system2("cat", args = "small.txt | wc -l", stdout = TRUE) system2("split", args = paste0("-l", as.numeric(nb_lines)*0.99, " small.txt data_split/"))

split is the Unix command that does the splitting. I keep 99% of the lines in the training set and
1% in the test set. This creates two files, aa and ab. I rename them using the mv Unix command:

system2("mv", args = "data_split/aa data_split/train.txt") system2("mv", args = "data_split/ab data_split/test.txt")

Ok, now let’s run a model using the VW command line utility from R, using system2():

oaa_fit <- system2("~/miniconda3/bin/vw", args = "--oaa 5 -d data_split/train.txt -f oaa.model", stderr = TRUE)

I need to point system2() to the vw executable, and then add some options. --oaa stands for
one against all and is a way of doing multiclass classification; first, one class gets classified
by a logistic classifier against all the others, then the other class against all the others, then
the other…. The 5 in the option means that there are 5 classes.

-d data_split/train.txt specifies the path to the training data. -f means “final regressor”
and specifies where you want to save the trained model.

This is the output that get’s captured and saved into oaa_fit:

[1] "final_regressor = oaa.model" [2] "Num weight bits = 18" [3] "learning rate = 0.5" [4] "initial_t = 0" [5] "power_t = 0.5" [6] "using no cache" [7] "Reading datafile = data_split/train.txt" [8] "num sources = 1" [9] "average since example example current current current" [10] "loss last counter weight label predict features" [11] "1.000000 1.000000 1 1.0 3 1 87" [12] "1.000000 1.000000 2 2.0 1 3 2951" [13] "1.000000 1.000000 4 4.0 1 3 506" [14] "0.625000 0.250000 8 8.0 1 1 262" [15] "0.625000 0.625000 16 16.0 1 2 926" [16] "0.500000 0.375000 32 32.0 4 1 3" [17] "0.375000 0.250000 64 64.0 1 1 436" [18] "0.296875 0.218750 128 128.0 2 2 277" [19] "0.238281 0.179688 256 256.0 2 2 118" [20] "0.158203 0.078125 512 512.0 2 2 61" [21] "0.125000 0.091797 1024 1024.0 2 2 258" [22] "0.096191 0.067383 2048 2048.0 1 1 45" [23] "0.085205 0.074219 4096 4096.0 1 1 318" [24] "0.076172 0.067139 8192 8192.0 2 1 523" [25] "" [26] "finished run" [27] "number of examples = 9900" [28] "weighted example sum = 9900.000000" [29] "weighted label sum = 0.000000" [30] "average loss = 0.073434" [31] "total feature number = 4456798"

Now, when I try to run the same model using RVowpalWabbit::vw() I get the following error:

oaa_class <- c("--oaa", "5", "-d", "data_split/train.txt", "-f", "vw_models/oaa.model") result <- vw(oaa_class) Error in Rvw(args) : unrecognised option '--oaa'

I think the problem might be because I installed Vowpal Wabbit using conda, and the package
cannot find the executable. I’ll open an issue with reproducible code and we’ll see.

In any case, that’s it for now! In the next blog post, we’ll see how to get the accuracy of this
very simple model, and see how to improve it!

Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;} Buy me an Espresso

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Efficient MCMC with Caching

Sat, 03/02/2019 - 21:03

(This article was first published on R – Stable Markets, and kindly contributed to R-bloggers)

This post is part of a running series on Bayesian MCMC tutorials. For updates, follow @StableMarkets. Metropolis Review Metropolis-Hastings is an MCMC algorithm for drawing samples from a distribution known up to a constant of proportionality, . Very briefly, the algorithm works by starting with some initial draw then running … Continue reading Efficient MCMC with Caching →

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stable Markets. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using the R Package Profvis on a Linear Model

Sat, 03/02/2019 - 17:34

(This article was first published on R – Detroit Data Lab, and kindly contributed to R-bloggers)

Not all data scientists were computer scientists who discovered their exceptional data literacy skills. They come from all walks of life, and sometimes that can mean optimizing for data structures and performance isn’t the top priority. That’s perfectly fine! There may come a time where you find yourself executing a chunk of code and consciously noting you could go take a short nap, and that’s where you’ve wondered where you could to be more productive. This short example provides help in how to profile using an extremely powerful and user-friendly package, profvis.

Data for this example:

In this tutorial, we’ll create and profile a simple classifier. The dataset linked to above provides all restaurant inspection data for the city of Detroit, from August 2016 to January 2019.

After extensive analysis and exploration in Power BI, some patterns emerge. Quarter 3 is the busiest for inspections, and Quarter 1 is the slowest. Routine inspections occur primarily on Tuesday, Wednesday, or Thursday. Hot dog carts are a roll of the dice.

This doesn’t seem too complex, and we theorize that we can create a classifier that predicts whether a restaurant is in compliance, by taking into account the number of violations in each of three categories (priority, core, and foundation).

To do so, we throw together some simple code that ingests the data, splits into a test and training set, creates the classifier model, and provides us the confusion matrix.

# Import the restaurant inspection dataset df.rst.insp <- read.csv("Restaurant_Inspections_-_All_Inspections.csv", header = TRUE) # A count of rows in the dataset num.rows <- nrow(df.rst.insp) # Create a shuffled subset of rows subset.sample <- sample(num.rows, floor(num.rows*.75)) # Create a training dataset using a shuffled subset of rows <- df.rst.insp[subset.sample,] # Create a test dataset of all rows NOT in the training dataset df.test <- df.rst.insp[-subset.sample,] # Create the generalized linear model using the training data frame mdlCompliance <- glm(In.Compliance ~ Core.Violations + Priority.Violations + Foundation.Violations, family = binomial, data = # Predict the compliance of the test dataset results <- predict(mdlCompliance, newdata=df.test, type = "response") # Turn the response predictions into a binary yes or no results <- ifelse(results < 0.5, "No", "Yes") # Add the results as a new column to the data frame with the actual results df.test$results <- results # Output the confusion matrix table(df.test$In.Compliance, df.test$results) # Output the confusion matrix library(caret) confMat <- table(df.test$In.Compliance, df.test$results) confusionMatrix(confMat, positive = "Yes")

An accuracy rate of 81.5%! That’s pretty great! Admittedly, a human wouldn’t have much trouble seeing a slew of priority violations and predicting a restaurant shutdown, but this classifier can perform the analysis at a much faster rate.

At this point, we have a good model we trust and expect to use for many years. Let’s pretend to fast forward a decade. Detroit’s meteoric rise has continued, the dataset has grown to massive amounts, and we begin to think we could improve the runtime. Easy enough! Profvis is here to give us the most user-friendly introduction to profiling available. To begin, simply install and load the package.

install.packages("profvis") library("profvis")

Wrap your code in a profvis call, placing all code inside of braces. The braces are important, and be sure to put every line you want to profile. Maybe your confusion matrix is the bad part, or maybe you read the CSV in an inefficient way!

profvis({ df.rst.insp <- read.csv("Restaurant_Inspections_-_All_Inspections.csv", header = TRUE) num.rows <- nrow(df.rst.insp) subset.sample <- sample(num.rows, floor(num.rows*.75)) <- df.rst.insp[subset.sample,] df.test <- df.rst.insp[-subset.sample,] mdlCompliance <- glm(In.Compliance ~ Core.Violations + Priority.Violations + Foundation.Violations, family = binomial, data = results <- predict(mdlCompliance, newdata=df.test, type = "response") results <- ifelse(results < 0.5, "No", "Yes") df.test$results <- results confMat <- table(df.test$In.Compliance, df.test$results) confusionMatrix(confMat, positive = "Yes") })

The output can help pinpoint poor-performing sections, and you can appropriately improve code where necessary.

The FlameGraph tab gives us a high-info breakdown. The Data tab gives us the bare-bones stats we need to get started.

In this example, we would certainly choose to improve the way we read in the data, since it accounts for two-thirds of the total run time in that single step!

The result here might be a minor gain, but we can easily understand how larger datasets would see massive performance improvements with a series of tweaks.


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Detroit Data Lab. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rquery Substitution

Sat, 03/02/2019 - 16:52

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The rquery R package has several places where the user can ask for what they have typed in to be substituted for a name or value stored in a variable.

This becomes important as many of the rquery commands capture column names from un-executed code. So knowing if something is treated as a symbol/name (which will be translated to a data.frame column name or a database column name) or a character/string (which will be translated to a constant) is important.

strings/character versus names/symbols

Let’s take a look at this through small examples. First let’s take a look at the difference between strings and symbols in R.

col_string <- "x" col_name <- str(col_string) ## chr "x" str(col_name) ## symbol x

Notice, in R a string is different than a symbol.

We can see this difference in rquery where an un-quoted x is treated as a symbol (and therefore is translated to a database column) and a quoted entity is treated as a string (and therefore is translated to a literal or constant, not to a column).

library("rquery") d <- data.frame(x = c('a', 'b'), stringsAsFactors = FALSE) d_rep <- local_td(d) db_info <- rquery_db_info(identifier_quote_char = "__IDENTIFIER__", string_quote_char = "__STRING_CONSTANT__") # direct use, comparing to a string constant # probaly not the query we intend as the # result is going to be empty independent # of the data. cat(to_sql( d_rep %.>% select_rows(.,'x')), db_info)) ## Warning in warn_about_filter_conditions(parsed): rquery::select_rows: ## expression"x") refers to no columns (so is a constant) ## SELECT * FROM ( ## SELECT ## __IDENTIFIER__x__IDENTIFIER__ ## FROM ## __IDENTIFIER__d__IDENTIFIER__ ## ) tsql_22528812269810523566_0000000000 ## WHERE ( ( __STRING_CONSTANT__x__STRING_CONSTANT__ ) IS NULL )

(The warning is new to rquery version 1.3.2.)

We take careful note what is marked as "__IDENTIFIER__", versus what is marked as "__STRING_CONSTANT__". Notice "__IDENTIFIER__" is used in the SQL for table names and column name, and "__STRING_CONSTANT__" is used for string constants. The above query is probably not what a user intended as we are checking if a user supplied string constant is NA, which is not interesting.

Likely the correct query omits the quote marks from the x.

# direct use, comparing to a column cat(to_sql( d_rep %.>% select_rows(.,, db_info)) ## SELECT * FROM ( ## SELECT ## __IDENTIFIER__x__IDENTIFIER__ ## FROM ## __IDENTIFIER__d__IDENTIFIER__ ## ) tsql_46294476393859703536_0000000000 ## WHERE ( ( __IDENTIFIER__x__IDENTIFIER__ ) IS NULL )

In the above query we are now comparing an identifier to NULL, which is how SQL expresses comparing the contents of the column named to NULL in a row by row fashion (a useful query).

Or combing the two ideas. We check which rows of the column x have the value "a" as follows.

cat(to_sql( d_rep %.>% select_rows(., x == 'a'), db_info)) ## SELECT * FROM ( ## SELECT ## __IDENTIFIER__x__IDENTIFIER__ ## FROM ## __IDENTIFIER__d__IDENTIFIER__ ## ) tsql_37960424657179884082_0000000000 ## WHERE __IDENTIFIER__x__IDENTIFIER__ = __STRING_CONSTANT__a__STRING_CONSTANT__ wrapr::let() substitution

wrapr::let() substitution is designed only to substitute in names as if the user had typed them. It is deliberately not designed to deal with other value substitutions (such as strings, integers, or floating point values). This is intentional and to keep wrapr::let() to one job: adapting NSE (Non-standard interfaces) to accept names as values.

wrapr::let()‘s principle is that there is no reason for wrapr::let() to ever substitute in a value (such as a string or an integer) as normal evaluation of variable names in environments already supplies a better way to do that. The only thing that is hard to substitute in are new symbols, so wrapr::let() has code to make sure it is doing only that.

Accordingly wrapr::let() treats both names/symbols and strings as symbols.

# Let substitution treats all substitutions as source-text # so strings and names are as if the user had typed them # in and behave as names (becoming the name of a column). let(c(COL_STRING = col_string), cat(to_sql(d_rep %.>% select_rows(.,, db_info))) ## SELECT * FROM ( ## SELECT ## __IDENTIFIER__x__IDENTIFIER__ ## FROM ## __IDENTIFIER__d__IDENTIFIER__ ## ) tsql_90437192531349864130_0000000000 ## WHERE ( ( __IDENTIFIER__x__IDENTIFIER__ ) IS NULL ) # Let substitution treats all substitutions as source-text # so strings and names are as if the user had typed them # in and behave as names (becoming the name of a column). let(c(COL_NAME = col_name), cat(to_sql(d_rep %.>% select_rows(.,, db_info))) ## SELECT * FROM ( ## SELECT ## __IDENTIFIER__x__IDENTIFIER__ ## FROM ## __IDENTIFIER__d__IDENTIFIER__ ## ) tsql_97451251523316128028_0000000000 ## WHERE ( ( __IDENTIFIER__x__IDENTIFIER__ ) IS NULL )

wrapr::let()‘s operating assumption is: if the user was using wrapr::let() the user was intending a symbol, regardless if they specify that symbol using a string or a symbol type. This means the user doesn’t have to maintain the distinction between string representations of names and symbol representations of names when using wrapr::let(). And again, for substituting string-values in: there are already much better ways, such as R evaluation itself (as we show below).

value_we_want <- "a" let(c(COL_NAME = col_name), cat(to_sql(d_rep %.>% select_rows(., COL_NAME == value_we_want), db_info))) ## SELECT * FROM ( ## SELECT ## __IDENTIFIER__x__IDENTIFIER__ ## FROM ## __IDENTIFIER__d__IDENTIFIER__ ## ) tsql_39226801511562936942_0000000000 ## WHERE __IDENTIFIER__x__IDENTIFIER__ = __STRING_CONSTANT__a__STRING_CONSTANT__

By assuming more about user intent wrapr::let() can smooth over inessential differences for the user.

base::bquote() substitution

bquote() substitution on the other hand is designed to substitute arbitrary values into un-executed language objects. This is the usual general definition of quasi-quotation, and is an emergent behavior. That we see the behavior one would expect by simply composing existing R language features. bquote() is what you get when you write reasonable code and then accept the resulting behavior as reasonable (even if the resulting behavior may or may not have been your first choice). This is in fact also a good design principle.

In this case the emergent behavior is: strings are treated as string constants, and names/symbols are treated as column names. That is the consequences of the substitution performed by bquote() is a function of the type of what is being substituted in. This actually makes sense, but it is something the user has to learn.

rquery can use bquote() substitution two ways: through its own NSE methods, or through wrapr:qe() (wrapr quote expression). Both work the same: they treat names/symbols as column names, and character/strings as string constants. So users must express their intent by passing in the correct type.

Here are examples to show the differences. In all cases substitution is triggered by the .()-notation.

# bquote substitution on string type: col_string # is taken to represent a string constant, not # the name of a column. cat(to_sql(d_rep %.>% select_rows(.,, db_info)) ## Warning in warn_about_filter_conditions(parsed): rquery::select_rows: ## expression"x") refers to no columns (so is a constant) ## SELECT * FROM ( ## SELECT ## __IDENTIFIER__x__IDENTIFIER__ ## FROM ## __IDENTIFIER__d__IDENTIFIER__ ## ) tsql_48031370941868950818_0000000000 ## WHERE ( ( __STRING_CONSTANT__x__STRING_CONSTANT__ ) IS NULL ) # bquote substitution on name type: col_name # is taken to represent a column name. cat(to_sql(d_rep %.>% select_rows(.,, db_info)) ## SELECT * FROM ( ## SELECT ## __IDENTIFIER__x__IDENTIFIER__ ## FROM ## __IDENTIFIER__d__IDENTIFIER__ ## ) tsql_31583525690015673048_0000000000 ## WHERE ( ( __IDENTIFIER__x__IDENTIFIER__ ) IS NULL ) # bquote substitution on string type: col_string # is taken to represent a string constant, not # the name of a column. cat(to_sql(d_rep %.>% select_rows_se(., qe(, db_info)) ## Warning in warn_about_filter_conditions(parsed): rquery::select_rows: ## expression"x") refers to no columns (so is a constant) ## SELECT * FROM ( ## SELECT ## __IDENTIFIER__x__IDENTIFIER__ ## FROM ## __IDENTIFIER__d__IDENTIFIER__ ## ) tsql_04319820330673360614_0000000000 ## WHERE ( ( __STRING_CONSTANT__x__STRING_CONSTANT__ ) IS NULL ) # bquote substitution on name type: col_name # is taken to represent a column name. cat(to_sql(d_rep %.>% select_rows_se(., qe(, db_info)) ## SELECT * FROM ( ## SELECT ## __IDENTIFIER__x__IDENTIFIER__ ## FROM ## __IDENTIFIER__d__IDENTIFIER__ ## ) tsql_61704616114459293739_0000000000 ## WHERE ( ( __IDENTIFIER__x__IDENTIFIER__ ) IS NULL ) Conclusion

wrapr::let() behavior is an example of a forced design: a desirable effect is identified (in this case the ability to substitute in names from variables) and the implementation guarantees this effect. Because the implementation is attempting complete control of semantics we can precisely determine user visible effects. We can bend the implementation to our teaching. wrapr::let() is working around the R language, but deliberately doing so in a very narrow way (we are not re-implementing all of the evaluation path!).

base::bquote() behavior is an example of an emergent design: the code that is natural to get the desired functionality is written, and the exact consequences and details of the implementation are derived from the underlying language semantics. Because the implementation is not trying to work around the underlying language the semantics tend to be good and compatible with other parts of the language.

Both strategies are valid and have their advantages. I feel this in contrast to systems that re-implement very many (or even every) step of expression representation and evaluation. Once one overrides and re-implements all aspects of representation and evaluation one has two incompatible languages (the original and the overridden) bolted together to great confusion.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Creating blazing fast pivot tables from R with data.table – now with subtotals using grouping sets

Sat, 03/02/2019 - 13:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)


Data manipulation and aggregation is one of the classic tasks anyone working with data will come across. We of course can perform data transformation and aggregation with base R, but when speed and memory efficiency come into play, data.table is my package of choice.

In this post we will look at of the fresh and very useful functionality that came to data.table only last year – grouping sets, enabling us, for example, to create pivot table-like reports with sub-totals and grand total quickly and easily.

  1. Basic by-group summaries with data.table
  2. Quick pivot tables with subtotals and a grand total
  3. Custom grouping sets
  4. Cube and rollup as special cases of grouping sets
  5. References
Basic by-group summaries with data.table

To showcase the functionality, we will use a very slightly modified dataset provided by Hadley Wickham’s nycflights13 package, mainly the flights data frame. Lets prepare a small dataset suitable for the showcase:

library(data.table) dataurl <- "" flights <- readRDS(url(paste0(dataurl, "r006/flights.rds"))) flights <-[month < 3]

Now, for those unfamiliar with data table, to create a summary of distances flown per month and originating airport with data.table, we could simply use:

flights[, sum(distance), by = c("month", "origin")] ## month origin V1 ## 1: 1 EWR 9524521 ## 2: 1 LGA 6359510 ## 3: 1 JFK 11304774 ## 4: 2 EWR 8725657 ## 5: 2 LGA 5917983 ## 6: 2 JFK 10331869

To also name the new column nicely, say distance instead of the default V1:

flights[, .(distance = sum(distance)), by = c("month", "origin")] ## month origin distance ## 1: 1 EWR 9524521 ## 2: 1 LGA 6359510 ## 3: 1 JFK 11304774 ## 4: 2 EWR 8725657 ## 5: 2 LGA 5917983 ## 6: 2 JFK 10331869

For more on basic data.table operations, look at the Introduction to data.table vignette.

As you have probably noticed, the above gave us the sums of distances by months and origins. When creating reports, especially readers coming from Excel may expect 2 extra perks

  • Looking at sub-totals and grand total
  • Seeing the data in wide format

Since the wide format is just a reshape and data table has the dcast() function for that for quite a while now, we will only briefly show it in practice. The focus of this post will be on the new functionality that was only released in data.table v1.11 in May last year – creating the grand- and sub-totals.

Quick pivot tables with subtotals and a grand total

To create a “classic” pivot table as known from Excel, we need to aggregate the data and also compute the subtotals for all combinations of the selected dimensions and a grand total. In comes cube(), the function that will do just that:

# Get subtotals for origin, month and month&origin with `cube()`: cubed <- data.table::cube( flights, .(distance = sum(distance)), by = c("month", "origin") ) cubed ## month origin distance ## 1: 1 EWR 9524521 ## 2: 1 LGA 6359510 ## 3: 1 JFK 11304774 ## 4: 2 EWR 8725657 ## 5: 2 LGA 5917983 ## 6: 2 JFK 10331869 ## 7: 1 27188805 ## 8: 2 24975509 ## 9: NA EWR 18250178 ## 10: NA LGA 12277493 ## 11: NA JFK 21636643 ## 12: NA 52164314

As we can see, compared to the simple group by summary we did earlier, we have extra rows in the output

  1. Rows 7,8 with months 1,2 and origin , – these are the subtotals per month across all origins
  2. Rows 9,10,11 with months NA, NA, NA and origins EWR, LGA, JFK – these are the subtotals per origin across all months
  3. Row 12 with NA month and origin – this is the Grand total across all origins and months

All that is left to get a familiar pivot table shape is to reshape the data to wide format with the aforementioned dcast() function:

# - Origins in columns, months in rows data.table::dcast(cubed, month ~ origin, value.var = "distance") ## month EWR JFK LGA NA ## 1: 1 9524521 11304774 6359510 27188805 ## 2: 2 8725657 10331869 5917983 24975509 ## 3: NA 18250178 21636643 12277493 52164314 # - Origins in rows, months in columns data.table::dcast(cubed, origin ~ month, value.var = "distance") ## origin 1 2 NA ## 1: EWR 9524521 8725657 18250178 ## 2: JFK 11304774 10331869 21636643 ## 3: LGA 6359510 5917983 12277493 ## 4: 27188805 24975509 52164314

Pivot table with data.table

Using more dimensions

We can use the same approach to create summaries with more than two dimensions, for example, apart from months and origins, we can also look at carriers, simply by adding "carrier" into the by argument:

# With 3 dimensions: cubed2 <- cube( flights, .(distance = sum(distance)), by = c("month", "origin", "carrier") ) cubed2 ## month origin carrier distance ## 1: 1 EWR UA 5084378 ## 2: 1 LGA UA 729667 ## 3: 1 JFK AA 2013434 ## 4: 1 JFK B6 3672655 ## 5: 1 LGA DL 1678965 ## --- ## 153: NA F9 174960 ## 154: NA HA 293997 ## 155: NA YV 21526 ## 156: NA OO 733 ## 157: NA 52164314

And dcast() to wide format which suits our needs best:

# For example, with month and carrier in rows, origins in columns: dcast(cubed2, month + carrier ~ origin, value.var = "distance") ## month carrier EWR JFK LGA NA ## 1: 1 9E 46125 666109 37071 749305 ## 2: 1 AA 415707 2013434 1344045 3773186 ## 3: 1 AS 148924 NA NA 148924 ## 4: 1 B6 484431 3672655 542748 4699834 ## 5: 1 DL 245277 2578999 1678965 4503241 ## 6: 1 EV 2067900 24624 86309 2178833 ## 7: 1 F9 NA NA 95580 95580 ## 8: 1 FL NA NA 226658 226658 ## 9: 1 HA NA 154473 NA 154473 ## 10: 1 MQ 152428 223510 908715 1284653 ## 11: 1 OO NA NA 733 733 ## 12: 1 UA 5084378 963144 729667 6777189 ## 13: 1 US 339595 219387 299838 858820 ## 14: 1 VX NA 788439 NA 788439 ## 15: 1 WN 539756 NA 398647 938403 ## 16: 1 YV NA NA 10534 10534 ## 17: 1 9524521 11304774 6359510 27188805 ## 18: 2 9E 42581 605085 34990 682656 ## 19: 2 AA 373884 1817048 1207701 3398633 ## 20: 2 AS 134512 NA NA 134512 ## 21: 2 B6 456151 3390047 490224 4336422 ## 22: 2 DL 219998 2384048 1621728 4225774 ## 23: 2 EV 1872395 24168 112863 2009426 ## 24: 2 F9 NA NA 79380 79380 ## 25: 2 FL NA NA 204536 204536 ## 26: 2 HA NA 139524 NA 139524 ## 27: 2 MQ 140924 201880 812152 1154956 ## 28: 2 UA 4686122 871824 681737 6239683 ## 29: 2 US 301832 222720 293736 818288 ## 30: 2 VX NA 675525 NA 675525 ## 31: 2 WN 497258 NA 367944 865202 ## 32: 2 YV NA NA 10992 10992 ## 33: 2 8725657 10331869 5917983 24975509 ## 34: NA 9E 88706 1271194 72061 1431961 ## 35: NA AA 789591 3830482 2551746 7171819 ## 36: NA AS 283436 NA NA 283436 ## 37: NA B6 940582 7062702 1032972 9036256 ## 38: NA DL 465275 4963047 3300693 8729015 ## 39: NA EV 3940295 48792 199172 4188259 ## 40: NA F9 NA NA 174960 174960 ## 41: NA FL NA NA 431194 431194 ## 42: NA HA NA 293997 NA 293997 ## 43: NA MQ 293352 425390 1720867 2439609 ## 44: NA OO NA NA 733 733 ## 45: NA UA 9770500 1834968 1411404 13016872 ## 46: NA US 641427 442107 593574 1677108 ## 47: NA VX NA 1463964 NA 1463964 ## 48: NA WN 1037014 NA 766591 1803605 ## 49: NA YV NA NA 21526 21526 ## 50: NA 18250178 21636643 12277493 52164314 ## month carrier EWR JFK LGA NA Custom grouping sets

So far we have focused on the “default” pivot table shapes with all sub-totals and a grand total, however the cube() function could be considered just a useful special case shortcut for a more generic concept – grouping sets. You can read more on grouping sets with MS SQL Server or with PostgreSQL.

The groupingsets() function allows us to create sub-totals on arbitrary groups of dimensions. Custom subtotals are defined by the sets argument, a list of character vectors, each of them defining one subtotal. Now let us have a look at a few practical examples:

Replicate a simple group by, without any subtotals or grand total

For reference, to replicate a simple group by with grouping sets, we could use:

groupingsets( flights, j = .(distance = sum(distance)), by = c("month", "origin", "carrier"), sets = list(c("month", "origin", "carrier")), )

Which would give the same results as

flights[, .(distance = sum(distance)), by = c("month", "origin", "carrier")] Custom subtotals

To give only the subtotals for each of the dimensions:

groupingsets( flights, j = .(distance = sum(distance)), by = c("month", "origin", "carrier"), sets = list( c("month"), c("origin"), c("carrier") ) ) ## month origin carrier distance ## 1: 1 27188805 ## 2: 2 24975509 ## 3: NA EWR 18250178 ## 4: NA LGA 12277493 ## 5: NA JFK 21636643 ## 6: NA UA 13016872 ## 7: NA AA 7171819 ## 8: NA B6 9036256 ## 9: NA DL 8729015 ## 10: NA EV 4188259 ## 11: NA MQ 2439609 ## 12: NA US 1677108 ## 13: NA WN 1803605 ## 14: NA VX 1463964 ## 15: NA FL 431194 ## 16: NA AS 283436 ## 17: NA 9E 1431961 ## 18: NA F9 174960 ## 19: NA HA 293997 ## 20: NA YV 21526 ## 21: NA OO 733 ## month origin carrier distance

To give only the subtotals per combinations of 2 dimensions:

groupingsets( flights, j = .(distance = sum(distance)), by = c("month", "origin", "carrier"), sets = list( c("month", "origin"), c("month", "carrier"), c("origin", "carrier") ) ) ## month origin carrier distance ## 1: 1 EWR 9524521 ## 2: 1 LGA 6359510 ## 3: 1 JFK 11304774 ## 4: 2 EWR 8725657 ## 5: 2 LGA 5917983 ## 6: 2 JFK 10331869 ## 7: 1 UA 6777189 ## 8: 1 AA 3773186 ## 9: 1 B6 4699834 ## 10: 1 DL 4503241 ## 11: 1 EV 2178833 ## 12: 1 MQ 1284653 ## 13: 1 US 858820 ## 14: 1 WN 938403 ## 15: 1 VX 788439 ## 16: 1 FL 226658 ## 17: 1 AS 148924 ## 18: 1 9E 749305 ## 19: 1 F9 95580 ## 20: 1 HA 154473 ## 21: 1 YV 10534 ## 22: 1 OO 733 ## 23: 2 US 818288 ## 24: 2 UA 6239683 ## 25: 2 B6 4336422 ## 26: 2 AA 3398633 ## 27: 2 EV 2009426 ## 28: 2 FL 204536 ## 29: 2 MQ 1154956 ## 30: 2 DL 4225774 ## 31: 2 WN 865202 ## 32: 2 9E 682656 ## 33: 2 VX 675525 ## 34: 2 AS 134512 ## 35: 2 F9 79380 ## 36: 2 HA 139524 ## 37: 2 YV 10992 ## 38: NA EWR UA 9770500 ## 39: NA LGA UA 1411404 ## 40: NA JFK AA 3830482 ## 41: NA JFK B6 7062702 ## 42: NA LGA DL 3300693 ## 43: NA EWR B6 940582 ## 44: NA LGA EV 199172 ## 45: NA LGA AA 2551746 ## 46: NA JFK UA 1834968 ## 47: NA LGA B6 1032972 ## 48: NA LGA MQ 1720867 ## 49: NA EWR AA 789591 ## 50: NA JFK DL 4963047 ## 51: NA EWR MQ 293352 ## 52: NA EWR DL 465275 ## 53: NA EWR US 641427 ## 54: NA EWR EV 3940295 ## 55: NA JFK US 442107 ## 56: NA LGA WN 766591 ## 57: NA JFK VX 1463964 ## 58: NA LGA FL 431194 ## 59: NA EWR AS 283436 ## 60: NA LGA US 593574 ## 61: NA JFK MQ 425390 ## 62: NA JFK 9E 1271194 ## 63: NA LGA F9 174960 ## 64: NA EWR WN 1037014 ## 65: NA JFK HA 293997 ## 66: NA JFK EV 48792 ## 67: NA EWR 9E 88706 ## 68: NA LGA 9E 72061 ## 69: NA LGA YV 21526 ## 70: NA LGA OO 733 ## month origin carrier distance Grand total

To give only the grand total:

groupingsets( flights, j = .(distance = sum(distance)), by = c("month", "origin", "carrier"), sets = list( character(0) ) ) ## month origin carrier distance ## 1: NA 52164314 Cube and rollup as special cases of grouping sets Implementation of cube

We mentioned above that cube() can be considered just a shortcut to a useful special case of groupingsets(). And indeed, looking at the implementation of the data.table method, most of what it does is to define the sets to represent the given vector and all of its possible subsets, and passes that to groupingsets():

function (x, j, by, .SDcols, id = FALSE, ...) { if (! stop("Argument 'x' must be a data.table object") if (!is.character(by)) stop("Argument 'by' must be a character vector of column names used in grouping.") if (!is.logical(id)) stop("Argument 'id' must be a logical scalar.") n = length(by) keepBool = sapply(2L^(seq_len(n) - 1L), function(k) rep(c(FALSE, TRUE), times = k, each = ((2L^n)/(2L * k)))) sets = lapply((2L^n):1L, function(j) by[keepBool[j, ]]) jj = substitute(j), by = by, sets = sets, .SDcols = .SDcols, id = id, jj = jj) }

This means for example that

cube(flights, sum(distance), by = c("month", "origin", "carrier")) ## month origin carrier V1 ## 1: 1 EWR UA 5084378 ## 2: 1 LGA UA 729667 ## 3: 1 JFK AA 2013434 ## 4: 1 JFK B6 3672655 ## 5: 1 LGA DL 1678965 ## --- ## 153: NA F9 174960 ## 154: NA HA 293997 ## 155: NA YV 21526 ## 156: NA OO 733 ## 157: NA 52164314

Is equivalent to

groupingsets( flights, j = .(distance = sum(distance)), by = c("month", "origin", "carrier"), sets = list( c("month", "origin", "carrier"), c("month", "origin"), c("month", "carrier"), c("month"), c("origin", "carrier"), c("origin"), c("carrier"), character(0) ) ) ## month origin carrier distance ## 1: 1 EWR UA 5084378 ## 2: 1 LGA UA 729667 ## 3: 1 JFK AA 2013434 ## 4: 1 JFK B6 3672655 ## 5: 1 LGA DL 1678965 ## --- ## 153: NA F9 174960 ## 154: NA HA 293997 ## 155: NA YV 21526 ## 156: NA OO 733 ## 157: NA 52164314 Implementation of rollup

The same can be said about rollup(), another shortcut than can be useful. Instead of all possible subsets, it will create a list representing the vector passed to by and its subsets “from right to left”, including the empty vector to get a grand total. Looking at the implementation of the data.table method

function (x, j, by, .SDcols, id = FALSE, ...) { if (! stop("Argument 'x' must be a data.table object") if (!is.character(by)) stop("Argument 'by' must be a character vector of column names used in grouping.") if (!is.logical(id)) stop("Argument 'id' must be a logical scalar.") sets = lapply(length(by):0L, function(i) by[0L:i]) jj = substitute(j), by = by, sets = sets, .SDcols = .SDcols, id = id, jj = jj) }

For example, the following:

rollup(flights, sum(distance), by = c("month", "origin", "carrier"))

Is equivalent to

groupingsets( flights, j = .(distance = sum(distance)), by = c("month", "origin", "carrier"), sets = list( c("month", "origin", "carrier"), c("month", "origin"), c("month"), character(0) ) ) References var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing Bike Share Data (NiceRide)

Sat, 03/02/2019 - 01:13

(This article was first published on | R Language Programming, and kindly contributed to R-bloggers)

This tutorial will cover exploring and visualizing data through 2018 for the Minneapolis, MN bike sharing service NiceRide. Part of what makes R incredible is the number of great packages. Part of what makes packages like ggmap and gganimate great is how they build on existing packages.

First step, as always, is to include the libraries we will be using.

# The Usual Suspects library(ggplot2) library(ggthemes) library(dplyr) library(magrittr) library(data.table) library(stringr) # Fetching library(rvest) # Cleaning column names library(janitor) # Date/Time formatting library(lubridate) # Maps library(sf) library(ggmap) # Used for animated density plots library(gganimate) # Only needed for interactive maps library(leaflet) library(leaflet.extras)

Fetching the Data (Fail)

First we are going to cover how I would normally approach fetching tables like the ones shown on

This approach did not work.

url <- "" repo <- url %>% read_html tbls <- html_nodes(repo, "table") # Shows all tables tbls

This is where we learn the approach didn’t work. Calling tbls gives us this return

{xml_nodeset (1)} [1] ...

Since there is only one table we would normally be able to use the html_tables function from rvest and have a nice data frame.

df <- html_table(tbls[1],fill = T)[[1]] nrow(df) # [1] 0

Instead, we get the return above, which has been commented out. Zero rows. This is because the data which populates the page is loaded after. There is still hope though.

Fetching the Data (Success)

This is the approach that worked. If we go up one level to we can see an xml file which we can use.

url <- "" repo <- url %>% read_html %>% xml_child(1) %>% xml_child(1) %>% xml_find_all(".//key") # Now we have an xml_nodeset repo[1:10]

That last bit lets us see a sampling of repo

{xml_nodeset (10)} [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Awesome, this is something we can use.

repo %<>% as_list %>% unlist %>% #Starts with 2018 str_match_all("^2018.*") %>% unlist

The function as_list is needed because we are converting an xml_nodeset to a vector. We used str_match_all to find files that start with 2018. The %<>% operator passes repo to as_list and also sets repo to the return value of the chain. At this point if you were to call repo you would see the following…

> repo [1] "" [2] "" [3] "" [4] "" [5] "" [6] "" [7] "" [8] ""

Downloading and Unzipping

You could merge both actions into one loop, but I made two loops. The %T>% operator comes in handy here to unlink the concatenated filename instead of the return of unzip.

dir.create("import") # Download Files for(file in repo) { download.file(paste0(url,file),destfile = paste0("import/",file)) } # Unzip for(file in repo) { paste0("import/",file) %T>% unzip(exdir = "import") %>% unlink }

Read and Merge into a Dataframe

Now that we have the files unzipped we can read them into a list and merge the list into a dataframe. You could do this manually, but this will allow for you to change the year easily, or merge 2018 and 2017 into a dataframe to company ridership between years.

# Read and merge import <- lapply("./import" %>% list.files, function(name) { return(read_csv(paste0("./import/",name))) }) rides <- rbindlist(import,fill = T) rides %<>% clean_names(case = "snake")

Additional Columns

Now that we have the data in rides, we can build some extra columns/features. Age is a much more intuitive field than birth_year, and grouping ages into bins will come in handy later.

rides$age <- 2019-rides$birth_year rides$age_bin <- rides$age %>% .bincode(seq(0,120,20)) rides$age_bin <- sapply(rides$age_bin,function(bin) { return(paste0((bin-1)*20,"-",(bin*20)," Years Old")) })

We probably will want to see things in a more macro level view, so we should build out some date/time columns as well.

# Trip times rides$minutes <- rides$tripduration/60 rides$hours <- rides$tripduration/60/60 # Start times rides$start_hour <- lubridate::hour(rides$start_time) rides$mm <- hour(rides$start_time)*60 + minute(rides$start_time) rides$start_day <- wday(rides$start_time,label = T, abbr = F, week_start = 1) # Weekend/Weekday rides$start_day_type <- ifelse(wday(rides$start_time, week_start = 1)>5, "Weekend", "Weekday") # Week of year rides$week <- week(rides$start_time) # Month (1-12) rides$month <- month(rides$start_time,label = T,abbr = F) # Month (January-December) rides$month_text <- month(rides$start_time,label = T,abbr = F) # Remove unused levels from factor rides$month_text <- droplevels(rides$month_text)

Some Tables for Context

You can print the info in a different way, I am just doing this to format everything better for the web.

table(rides$age_bin) %>% lapply({ . %>% format(big.mark=",") %>% return })

$`0-20 Years Old` [1] 1.43 $`100-120 Years Old` [1] 0 $`20-40 Years Old` [1] 35.39 $`40-60 Years Old` [1] 58.25 $`60-80 Years Old` [1] 4.82 $`80-100 Years Old` [1] 0.1

From this we can see that tossing out anything below a half a percent would remove 80-100 and 100-120. We could probably remove 0-20 as well, but that is up to you.

The Delicious Part

Now comes the time to visually explore the data. For the same of simplicity I am just using theme_fivethirtyeight() and scale_.*_viridis.*() for the theme and colors of most of these plots.

Visualizing Ridership by Month

ggplot(data=rides[which(rides$age<=60),], aes(x=week, fill= month_text)) + geom_histogram(alpha=.9) + theme_fivethirtyeight() + ggtitle("Ride Frequency by Week of Year") + facet_grid(vars(usertype), vars(age_bin)) + scale_fill_viridis_d() ggsave(filename = "ride-frequecy-histogram.png",width = 8,units = "in")

The first thing I notice is the 0-20 and 60-80 year old frequencies are so low they are nearly impossible to identify from this plot. If we want to learn what week of the year people are riding a density plot might work better. There is probably a better way to do this, but in order to get the red line I added an additional layer.

ggplot(data=rides[which(rides$age<=80),], aes(x=week, fill= age_bin)) + geom_histogram(alpha=.9,aes(y=..density..)) + theme_fivethirtyeight() + ggtitle("Ride Distribution by Week of Year") + geom_density(alpha=0,color=rgb(1,0,0,.4)) + facet_grid(vars(usertype), vars(age_bin)) + scale_fill_viridis_d() ggsave(filename = "ride-frequency-density.png",width = 8,units = "in")

Looks like we are seeing a downward trend toward the end of the year for subscribers/20-80 and customers/20-80 with the biggest downward trends for subscribers/60-80. Makes sense, from the previous plot we saw week 40 fell in October and Minneapolis can get chilly.

Visualizing Start Time

I am dumping a big chunk of code here, don’t be alarmed. The first bit of code filters thanks to dplyr and only keeps the more active months. I also filtered out those age bins we talked about earlier. The droplevels() is because ggplot didn’t like when I had unused levels (from the months).

I also added breaks and labels for the time to correspond with peak times.

ggplot(rides %>% filter(age <= 80, month > 4, month < 10) %>% droplevels(), aes(x=mm, fill= age_bin)) + geom_density(alpha=.6) + scale_x_continuous(labels = c("5am","8am","1:30pm","5pm","8pm"), breaks = c(300,480,750,1020,1200)) + labs(fill="",title="NiceRide 2018 Start Times") + theme_fivethirtyeight() + theme(strip.background = element_rect(fill = "#FFFFFF")) + facet_grid(vars(usertype), vars(start_day_type)) + scale_fill_viridis_d(option="A")

This plot gives us a lot of really interesting information! You can see how NiceRide is used by many people who are presumably commuting. There is also a lunch rush for NiceRide bikes.


So far all of this information has been on a macro level for time based ridership. Now we will switch to location based ridership.

Static Network Map

In order to generate static heatmaps we will be using the ggmap package along with the ggplot2::geom_segment() function. There is also a ggplot2::geom_curve() function which you can use to generate curved line segments. For the curved line segments you must also use coord_cartesian()

In order to speed up the rendering of geom_segment() I aggregated the data and then scaled the opacity of the line segments. I don’t think this was a necessary step, but it made rendering (and tweaking the plot) much easier.

df.lines <- rides %>% group_by(start_station_longitude, start_station_latitude, end_station_longitude, end_station_latitude, start_station_name, end_station_name) %>% summarize(rides = n())

Now we have a handy dandy dataframe which has the number of rides between stations. The start_station_name and end_station_name are not required, but they are helpful if you want to further explore station data. You can also summarize on the fly for station information with plyr

rides %>% group_by(start_station_name,end_station_name) %>% filter(start_station_name!="NULL") %>% summarize(rides = n()) %>% ungroup %>% top_n(10)

Selecting by rides # A tibble: 10 x 3 start_station_name end_station_name rides 1 100 Main Street SE 100 Main Street SE 1161 2 11th Ave S & S 2nd Street 11th Ave S & S 2nd Street 1436 3 6th Ave SE & University Ave 6th Ave SE & University Ave 1148 4 Lake Calhoun Center Lake Calhoun Center 2804 5 Lake Como Pavilion Lake Como Pavilion 2070 6 Lake Harriet Bandshell Lake Harriet Bandshell 2332 7 Lake Street & Knox Ave S Lake Street & Knox Ave S 6852 8 Minnehaha Park Minnehaha Park 1673 9 W 36th Street & W Calhoun Parkway W 36th Street & W Calhoun Parkway 2012 10 Weisman Art Museum Willey Hall 1081

Neat, right? Also, that Lake Street & Knox Ave S location had over six thousand round trips. Also pretty interesting that all of the top 10 were round trips except for Weisman Art Museum/Willey Hall. Changing to the top 20 stations we see that pop four times as a starting station.

Selecting by rides # A tibble: 20 x 3 start_station_name end_station_name rides 1 100 Main Street SE 100 Main Street SE 1161 2 11th Ave S & S 2nd Street 11th Ave S & S 2nd Street 1436 3 6th Ave SE & University Ave 6th Ave SE & University Ave 1148 4 Harriet Island Harriet Island 970 5 Hiawatha Ave & E 50th Street Hiawatha Ave & E 50th Street 827 6 Lake Calhoun Center Lake Calhoun Center 2804 7 Lake Calhoun Center Lake Street & Knox Ave S 864 8 Lake Como Pavilion Lake Como Pavilion 2070 9 Lake Harriet Bandshell Lake Harriet Bandshell 2332 10 Lake Nokomis Lake Nokomis 1059 11 Lake Street & Knox Ave S Lake Calhoun Center 777 12 Lake Street & Knox Ave S Lake Harriet Bandshell 741 13 Lake Street & Knox Ave S Lake Street & Knox Ave S 6852 14 Lake Street & Knox Ave S W 36th Street & W Calhoun Parkway 757 15 Minnehaha Park Minnehaha Park 1673 16 W 36th Street & W Calhoun Parkway Lake Street & Knox Ave S 826 17 W 36th Street & W Calhoun Parkway W 36th Street & W Calhoun Parkway 2012 18 Weisman Art Museum Social Sciences 795 19 Weisman Art Museum Willey Hall 1081 20 Willey Hall Weisman Art Museum 1036

Getting the Base Map

The package ggmap has a function called get_map() which we will be using. You can pass the name of the city, but I am passing the boundaries of our start/end coordinates.

# First you will need a Google API Key register_google(key = "YOUR_API_KEY") mpls <- get_map(c(left = min(rides$start_station_longitude), bottom = min(rides$start_station_latitude), right = max(rides$start_station_longitude), top = max(rides$start_station_latitude)), maptype='terrain', source='stamen', zoom=13)

Now that we have our city mpls tiles, we can add some layers, including our geom_segment() layer. The function scale_alpha() was a lifesaver here, and without that we wouldn’t have been able to aggregate our data in df.line as the whole plot would be either blacked out when using aes() or not determined by df.line$rides. The function theme_nothing() comes from ggmap and is useful for removing axis labels among other attributes.

# Darken allows us to lighten up the map ggmap(mpls,darken = c(.8,"#FFFFFF")) + geom_segment(data = df.lines, aes(x = start_station_longitude, y = start_station_latitude, xend = end_station_longitude, yend = end_station_latitude, alpha = sqrt(rides)), color = "#000000") + coord_cartesian() + scale_alpha(range = c(0.0001, .5)) + geom_point(data = df.lines %>% group_by(longitude = start_station_longitude, latitude = start_station_latitude) %>% summarize(rides = sum(rides)), aes(x = longitude, y = latitude, size = rides), color="#009900",alpha=.4) + scale_size_continuous(range(4,100)) + scale_color_viridis_c() + scale_fill_viridis_c() + theme_nothing() ggsave(filename = "station-network.jpg",width = 8,units = "in")

I made the executive decision to aggregate data for geom_point() on the fly. Without aggregating, we would be left with multiple points for each start station.

Some of those roundtrips threw off the scale, which is why sqrt() is used. You could substitute sqrt() for log(), but I found the results more palatable as written.

There are a couple of insights to be garnered from this plot. There is a nice little rectangle of trips down at the bottom along Minnehaha Creak, the chain of lanes, and then on Lake Street and either West River Parkway or Hiawatha. There is also a ton more activity in Minneapolis than St. Paul, as expected. This might also be a good reference for where to put additional NiceRide stations. You may want to first replace geom_segment() with geom_path() from the ggmap::route() function which I will go over in another, shorter tutorial.

Leaflet Interactive Map

This one also required a new dataframe. Essentially we are summarizing the start stations and end stations separately so we can see if there is a change in the heatmap when cycling between the two layers.

df.heatmap.start_end <- list() df.heatmap.start_end$start <- rides %>% group_by(start_station_longitude, start_station_latitude) %>% summarize(intensity = sqrt(n())) names(df.heatmap.start_end$start)[1:2] <- c("longitude","latitude") df.heatmap.start_end$end <- rides %>% group_by(end_station_longitude, end_station_latitude) %>% summarize(intensity = sqrt(n())) names(df.heatmap.start_end$end)[1:2] <- c("longitude","latitude") df.heatmap.start_end$start$pos <- "Start" df.heatmap.start_end$end$pos <- "End" df.heatmap.start_end %<>% rbindlist(fill = T)

Hot Hot Heat

You don’t have to add two layers of heatmaps, but you can! I segmented the data between start and stop to see if there was any change. The function addLayersControl() will allow us to toggle between the two layers. You can tweak the parameters to your liking, but I found these settings produced a plot I was happy with.

leaflet(df.heatmap.start_end) %>% addProviderTiles(providers$CartoDB.DarkMatter) %>% addHeatmap(data = df.heatmap.start_end %>% filter(pos=="Start"), lng=~longitude, lat=~latitude, intensity = ~intensity, blur = 10, max = 100, radius = 15, layerId = "Start", group = "Start") %>% addHeatmap(data = df.heatmap.start_end %>% filter(pos=="End"), lng=~longitude, lat=~latitude, intensity = ~intensity, blur = 10, max = 100, radius = 15, layerId = "End", group = "End") %>% addLayersControl( baseGroups = c("Start","End"), options = layersControlOptions(collapsed = FALSE) )

There was some change, and it was negligible.


library(ggTimeSeries) # Generate frequency table <- rides$start_time %>% as_date() %>% table %>% data.frame names( <- c("Date","Rides")$Date %<>% as_date ggplot_calendar_heatmap(, 'Date', 'Rides' ) + theme_fivethirtyeight() + theme(legend.position = "right", legend.direction = "vertical") + scale_fill_viridis_c() ggsave(filename = "ride-calendar.png",width = 8,units = "in")

Wrap Up

All in all there is a ton of data to go through, and plenty more can be done. Some ideas for those who want to continue this project…

  • Compare ridership by years
  • Append rides with weather information
  • Determine features with significant deviation
  • Use geom_path() and map suggested bike routes
  • Create static heatmaps using 2d density plots
  • Calendar Views
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: | R Language Programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My Shiny Dashboard, Milwaukee Beer

Sat, 03/02/2019 - 01:00

(This article was first published on Posts on Anything Data, and kindly contributed to R-bloggers)

Milwaukee Beer – Inspired by my Job Hunt

I’m excited to launch my latest Shiny app – “Milwaukee Beer” – which I made to learn shinydashboard. Due to my decision to return to the USA and hunt for a career in data, I decided to add another project to my portfolio. Milwaukee Beer is a metric tracking dashboard that provides quick insights to the unique local brews. You may toggle dropdown tabs to get rankings, nice graphs, and sentiment tracking. Feel free to search for beer by type and flavor too!



It’s been a dream of mine to break into the data science field, so prior to my move, I decided to add another project to my portfolio – a sleek Shiny dashboard.

A brutal truth about this project was that I had to invest time in finding my own data and deciding what to do with it. Long story short, I found and scraped the Milwaukee-centric subset of it over multiple iterations to build up my dataset. I then combined the information on their local breweries, beers, and ratings/reviews.

Along the way I also discovered highcharter, an amazing library for interactive plots. (If you haven’t heard of it, look it up!) First, let me provide an example of a highcharter plot I used in my dashboard. I hope it will invoke your curiosity and ultimately lure you to my app that helps explore Milwaukee’s rich array of beers!

library(tidyverse) library(highcharter) library(readr) ##You can find this data on my github: ## milwaukee_beer <- read_csv("milwaukee_beer3.csv")[-1] milwaukees_favorite <- milwaukee_beer %>% select(brewery, beer_name, avg_beer_score, beer_style, beer_category) %>% distinct() %>% filter(! %>% group_by(beer_category) %>% summarise(total = n(), avg = mean(avg_beer_score)) %>% mutate(id = group_indices(., beer_category)) ## Since I want each point colored with custom colors, I need to provide a matching vector of colors. ##Generate pallete colfunc <-colorRampPalette(c("#E9FF5C", "#FF8800", "#FF2B00")) hc_colors <- sample(colfunc(nrow(milwaukees_favorite)))

So far, nothing special – just your familiar dplyr chain.
For anybody that hasn’t used the fabulous highcharter package before, add_series is similar to a geom in ggplot2, hcaes are similar to aes, and you can add tooltips, and embed html to an extent.

highchart() %>% hc_add_series(milwaukees_favorite, type = 'column', hcaes(x = id, y = avg), tooltip = list(pointFormat = "{point.beer_category}: {point.avg}"), showInLegend = F) %>% hc_add_series(milwaukees_favorite, type = 'bubble', hcaes(x = id, y = avg, size = total, group = beer_category), tooltip = list(pointFormat = "{point.beer_category}: {}"), marker = list(fillOpacity=1), minSize = 20, maxSize = 80) %>% hc_plotOptions(column = list(pointWidth = .5, pointPlacement = "on")) %>% hc_title(text = "Milwaukee's Most Common Brewing Styles", useHTML = TRUE) %>% hc_yAxis(title = list(text = "Avg Rating", style = list(color = "white", fontSize = 22)), labels = list(style = list(color = "white", fontSize = 15)), tickColor = "white", gridLineColor = "transparent") %>% hc_xAxis(labels = list(enabled = F), gridLineColor = "transparent") %>% hc_colors(color = hc_colors) %>% hc_chart(divBackgroundImage = '', borderColor = 'white', borderRadius = 10, borderWidth = 2, backgroundColor = 'transparent') %>% hc_legend(backgroundColor = "#0D0D0D99", itemStyle = list(color = "#C9C9C9"), itemHoverStyle = list(color = "yellow"))

{"x":{"hc_opts":{"title":{"text":"Milwaukee's Most Common Brewing Styles<\/span>","useHTML":true},"yAxis":{"title":{"text":"Avg Rating","style":{"color":"white","fontSize":22}},"labels":{"style":{"color":"white","fontSize":15}},"tickColor":"white","gridLineColor":"transparent"},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"label":{"enabled":false},"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"column":{"pointWidth":0.5,"pointPlacement":"on"}},"series":[{"group":"group","data":[{"beer_category":"(misc) Ales","total":8,"avg":3.44875,"id":1,"x":1,"y":3.44875},{"beer_category":"(misc) lager","total":1,"avg":1.66,"id":2,"x":2,"y":1.66},{"beer_category":"American ale","total":57,"avg":3.31421052631579,"id":3,"x":3,"y":3.31421052631579},{"beer_category":"Belgian & French ale","total":56,"avg":3.42428571428571,"id":4,"x":4,"y":3.42428571428571},{"beer_category":"Bock","total":29,"avg":3.31172413793103,"id":5,"x":5,"y":3.31172413793103},{"beer_category":"Dark lager","total":10,"avg":3.821,"id":6,"x":6,"y":3.821},{"beer_category":"English Brown","total":15,"avg":3.426,"id":7,"x":7,"y":3.426},{"beer_category":"Finnish","total":3,"avg":3.62666666666667,"id":8,"x":8,"y":3.62666666666667},{"beer_category":"German barley","total":18,"avg":3.32388888888889,"id":9,"x":9,"y":3.32388888888889},{"beer_category":"light lager","total":37,"avg":2.77486486486487,"id":10,"x":10,"y":2.77486486486487},{"beer_category":"Pale ale","total":410,"avg":3.57682926829268,"id":11,"x":11,"y":3.57682926829268},{"beer_category":"Pilsner","total":19,"avg":3.02263157894737,"id":12,"x":12,"y":3.02263157894737},{"beer_category":"Porter","total":47,"avg":3.48957446808511,"id":13,"x":13,"y":3.48957446808511},{"beer_category":"Scottish & Irish","total":25,"avg":3.6268,"id":14,"x":14,"y":3.6268},{"beer_category":"Sour","total":7,"avg":3.86,"id":15,"x":15,"y":3.86},{"beer_category":"Specialty","total":69,"avg":3.42826086956522,"id":16,"x":16,"y":3.42826086956522},{"beer_category":"Steam","total":3,"avg":3.94333333333333,"id":17,"x":17,"y":3.94333333333333},{"beer_category":"Stout","total":115,"avg":3.74765217391304,"id":18,"x":18,"y":3.74765217391304},{"beer_category":"Strong ale","total":23,"avg":3.81695652173913,"id":19,"x":19,"y":3.81695652173913},{"beer_category":"VOM","total":25,"avg":3.22,"id":20,"x":20,"y":3.22},{"beer_category":"Wheat and rye","total":28,"avg":3.30357142857143,"id":21,"x":21,"y":3.30357142857143}],"type":"column","tooltip":{"pointFormat":"{point.beer_category}: {point.avg}"},"showInLegend":false},{"name":"(misc) Ales","data":[{"beer_category":"(misc) Ales","total":8,"avg":3.44875,"id":1,"x":1,"y":3.44875,"size":8,"z":8}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"(misc) lager","data":[{"beer_category":"(misc) lager","total":1,"avg":1.66,"id":2,"x":2,"y":1.66,"size":1,"z":1}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"American ale","data":[{"beer_category":"American ale","total":57,"avg":3.31421052631579,"id":3,"x":3,"y":3.31421052631579,"size":57,"z":57}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Belgian & French ale","data":[{"beer_category":"Belgian & French ale","total":56,"avg":3.42428571428571,"id":4,"x":4,"y":3.42428571428571,"size":56,"z":56}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Bock","data":[{"beer_category":"Bock","total":29,"avg":3.31172413793103,"id":5,"x":5,"y":3.31172413793103,"size":29,"z":29}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Dark lager","data":[{"beer_category":"Dark lager","total":10,"avg":3.821,"id":6,"x":6,"y":3.821,"size":10,"z":10}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"English Brown","data":[{"beer_category":"English Brown","total":15,"avg":3.426,"id":7,"x":7,"y":3.426,"size":15,"z":15}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Finnish","data":[{"beer_category":"Finnish","total":3,"avg":3.62666666666667,"id":8,"x":8,"y":3.62666666666667,"size":3,"z":3}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"German barley","data":[{"beer_category":"German barley","total":18,"avg":3.32388888888889,"id":9,"x":9,"y":3.32388888888889,"size":18,"z":18}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"light lager","data":[{"beer_category":"light lager","total":37,"avg":2.77486486486487,"id":10,"x":10,"y":2.77486486486487,"size":37,"z":37}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Pale ale","data":[{"beer_category":"Pale ale","total":410,"avg":3.57682926829268,"id":11,"x":11,"y":3.57682926829268,"size":410,"z":410}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Pilsner","data":[{"beer_category":"Pilsner","total":19,"avg":3.02263157894737,"id":12,"x":12,"y":3.02263157894737,"size":19,"z":19}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Porter","data":[{"beer_category":"Porter","total":47,"avg":3.48957446808511,"id":13,"x":13,"y":3.48957446808511,"size":47,"z":47}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Scottish & Irish","data":[{"beer_category":"Scottish & Irish","total":25,"avg":3.6268,"id":14,"x":14,"y":3.6268,"size":25,"z":25}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Sour","data":[{"beer_category":"Sour","total":7,"avg":3.86,"id":15,"x":15,"y":3.86,"size":7,"z":7}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Specialty","data":[{"beer_category":"Specialty","total":69,"avg":3.42826086956522,"id":16,"x":16,"y":3.42826086956522,"size":69,"z":69}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Steam","data":[{"beer_category":"Steam","total":3,"avg":3.94333333333333,"id":17,"x":17,"y":3.94333333333333,"size":3,"z":3}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Stout","data":[{"beer_category":"Stout","total":115,"avg":3.74765217391304,"id":18,"x":18,"y":3.74765217391304,"size":115,"z":115}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Strong ale","data":[{"beer_category":"Strong ale","total":23,"avg":3.81695652173913,"id":19,"x":19,"y":3.81695652173913,"size":23,"z":23}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"VOM","data":[{"beer_category":"VOM","total":25,"avg":3.22,"id":20,"x":20,"y":3.22,"size":25,"z":25}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80},{"name":"Wheat and rye","data":[{"beer_category":"Wheat and rye","total":28,"avg":3.30357142857143,"id":21,"x":21,"y":3.30357142857143,"size":28,"z":28}],"type":"bubble","tooltip":{"pointFormat":"{point.beer_category}: {}"},"marker":{"fillOpacity":1},"minSize":20,"maxSize":80}],"xAxis":{"labels":{"enabled":false},"gridLineColor":"transparent"},"colors":["#FF6C00","#EBF352","#FF3400","#F8AB1B","#FF5000","#EDE749","#FA9F12","#FC9309","#FF7E00","#FF6200","#F6B724","#FF7500","#FF2B00","#EFDB40","#FF5900","#F1CF37","#E9FF5C","#FF3D00","#FF8800","#FF4600","#F3C32E"],"chart":{"divBackgroundImage":"","borderColor":"white","borderRadius":10,"borderWidth":2,"backgroundColor":"transparent"},"legend":{"backgroundColor":"#0D0D0D99","itemStyle":{"color":"#C9C9C9"},"itemHoverStyle":{"color":"yellow"}}},"theme":{"chart":{"backgroundColor":"transparent"}},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//","canvasToolsURL":"http =//","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"chart","fonts":[],"debug":false},"evals":[],"jsHooks":[]}

Of course, this is a teaser example to lure you to see my dashboard. Go there and explore the unique world of Milwaukee beer (in a marketing style dashboard) already!

End Note

The depth and complexity of the data were not too deep, but demonstrating visualization and tracking were my primary goals. While I’ve previously made a Shiny app that uses Halo 5’s API (yes, the game), dash-boarding has become an important skill which I felt necessary to add to my project list. If anybody in the greater Milwaukee area knows about a job relating to R, SQL, data mining, or viz, please leave me a message or add me on linked in! Contact info is hosted on this site as well as my apps.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on Anything Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An architecture for real-time scoring with R

Fri, 03/01/2019 - 20:04

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Let's say you've developed a predictive model in R, and you want to embed predictions (scores) from that model into another application (like a mobile or Web app, or some automated service). If you expect a heavy load of requests, R running on a single server isn't going to cut it: you'll need some kind of distributed architecture with enough servers to handle the volume of requests in real time.

This reference architecture for real-time scoring with R, published in Microsoft Docs, describes a Kubernetes-based system to distribute the load to R sessions running in containers. This diagram from the article provides a good overview:

You can find detailed instructions for deploying this architecture in Github. This architecture uses Azure-specific components, but you could also use their open source equivalents if you wanted to host them yourself:

For more details on this architecture, take a look at the Microsoft Docs article linked below.

Microsoft Docs: Real-time scoring of R machine learning models

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The delta method and its implementation in R

Fri, 03/01/2019 - 16:39

(This article was first published on The Princess of Science, and kindly contributed to R-bloggers)

Suppose that you have a sample of a variable of interest, e.g. the heights of men in certain population, and for some obscured reason you are interest not in the mean height μ but in its square μ². How would you inference on μ², e.g. test a hypothesis or calculate a confidnce interval? The delta method is the trick you need.

The delta method is mathematical assertion that can yield estimates for the varinance of functons of statistics under mild condition. Loosley speaking, let bₙ is an estimate of β, where n is the sample size, and the distribution of bₙ is approximately normal with mean β and variance σ²/n.

Then, if g is a function, then g(bₙ) is approximately normal with mean g(β) and and variance [g’(β)² ]⋅σ²/n, provided that the sample size is large. Don’t let the calculus scare you. R’s deltmethod function will handle the calculus for you.

If bₙ is a Maximum Likelihood Estimate (MLE) of β then under mild conditions it is approximately normal for a large sample size, and g(bₙ) and g’(bₙ) are MLEs for g(β) and g’(β) respectively. Add on top of this a MLE for σ, and you can implement statistical inference.

I will demonstrate the use of the delta method using the Titanic survival data. For the sake of the example I will use only 3 variables: Survived is a 0/1 variable indicating whther a passenger survived the sinking of the Titanic, Age is obviously the passenger’s age, and Parch is the passenger’s number of parents/children aboard. Lets take a look at some of the data:

> # install.packages("titanic") > library(titanic) > titanic=titanic_train[, c(1, 2, 6,8)] > head(titanic)   PassengerId Survived Age Parch 1           1        0  22     0 2           2        1  38     0 3           3        1  26     0 4           4        1  35     0 5           5        0  35     0 6           6        0  NA     0 >

Let π be the probability of surviving. Then the odds of survival is π/(1-π). The sample size is n=891 is considered large, so we can apply the Central Limit Theorem to conclude that p is approximately normal with mean π and variance π/(1-π)/n. Then π can be estimated by p=0.384, the odds are estimated by p/(1-p)=0.623, and the variance of p is estimated by 0.000265:

> p_survival=mean(titanic$Survived) > print(p_survival) [1] 0.3838384 > survival_odds=p_survival/(1-p_survival) > print(survival_odds) [1] 0.6229508 > n=nrow(titanic) > var_p_survival=(p_survival*(1-p_survival))/n > print(var_p_survival) [1] 0.0002654394 >

Let g(x)=p/(1-x). Then g`(x)=1/(1-x)², (you can check this at so the variane of the odds can be estimated by

Var(p/(1-p))=[g’(p)² ]⋅σ²/n=[1/(1-p)²]² ⋅ [p(1-p)/n].

This can be implemented in the following R code:

> dev_g_odds=1/(1-survival_odds)^2 > var_odds=(dev_g_odds^2)*var_p_survival > print(var_odds) [1] 0.01313328 > se_odds=sqrt(var_odds) > print(se_odds) [1] 0.1146005 >

But of course, instead of doing all the calculus, you can use the deltamethod function of R’s msm package.

The function has three parameters:

  • g is a formula object representating of the transformation g(x). The formula variables must be labeled x1, x2 and so on.
  • mean is the estimate of g(β)
  • cov is the estimate of var(β) which is σ²/n

The use of the delta function provides the same result:

> # install.packages("msm") > library(msm) > se_odds_delta=deltamethod(g=~x1/(1-x1), mean=survival_odds, cov=var_p_survival) > print(se_odds_delta) [1] 0.1146005 >

The second example considers logistic regression. We will model the (logit of the) probability of survival using Age nd Parch. Using R’s glm function we can obtain estimates to the logistic regression coefficients band their standard errors se_b:

> model=glm(Survived ~ Age + Parch, family=binomial(link="logit") > model_beta=data.frame(summary(model)$coefficients[2:3, 1:2]) > names(model_beta)=c("b", "se_b") > print(model_beta)                  b        se_b Age   -0.008792681 0.005421838 Parch  0.191531303 0.090643635 >

Since exp(β) is usually interpreted as the odd ratio (OR) of the response variable with regard to the regression independent variable, reseachers are interested in inference on exp(β). In order to perform such inference one nees to estimate the standard error of exp(β). Since b is a Maximum Likelihood Estimate for β, it is approximately normal and its variance is estimated by se_b², and the delta method can be applied.

The calculus in this case is easy: g(x)=exp(x), so g’(x)=exp(x). Therefore, the standard error of exp(β) can be estimated by exp(b)⋅ se_b:

> model_beta$OR=exp(model_beta$b) > model_beta$se_OR=exp(model_beta$b)*model_beta$se_b > print(model_beta)                  b        se_b        OR       se_OR Age   -0.008792681 0.005421838 0.9912459 0.005374374 Parch  0.191531303 0.090643635 1.2111027 0.109778755 >

To use the deltamethod function, we will first use the vcov function to obtain the variance-covariance matrix of the resression coefficient estimates, and the variances will be the inputs of the deltamethod function:

> vc=vcov(model)[2:3, 2:3] > print(vc)                Age        Parch Age   2.939632e-05 9.236876e-05 Parch 9.236876e-05 8.216269e-03 > model_beta$se_OR_delta[1]=deltamethod(~exp(x1), mean=model_beta$b[1], cov=vc[1,1]) > model_beta$se_OR_delta[2]=deltamethod(~exp(x1), mean=model_beta$b[2], cov=vc[2,2]) > print(model_beta)                  b        se_b        OR       se_OR se_OR_delta Age   -0.008792681 0.005421838 0.9912459 0.005374374 0.005374374 Parch  0.191531303 0.090643635 1.2111027 0.109778755 0.109778755 >

Of course, we get the same results.


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Princess of Science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Powerball demystified

Fri, 03/01/2019 - 16:35

(This article was first published on The Princess of Science, and kindly contributed to R-bloggers)

The US Powerball lottery hysteria took another step when no one won the big jackpot in the last draw that took place on October 20, 2018. So, the total jackpot is now 2.22 billion dollars. I am sure that you want to win this jackpot. I myself want to win it.

Actually, there are two different lotteries: The Mega Million lottery prize is about 1.6 billion dollars, and the probability of winning when playing a single ticket is about 1 in 302 million. The Powerball lottery jackpot is “only” 620 million dollars but the probability of winning is slightly better: about 1 in 292 million.

The probability of winning both jackpots is therefore the multiplication of the two probabilities stated above, which 1 about 1 in 88000000000000000.

Let’s not be greedy, and aim just to 1.6 billion jackpot, although its probability of winning is slightly worse.

First, it should be noted that although the probability of winning is small, it is still positive. So if you buy a ticket you get a chance. If you do not buy a ticket you will not win, period.

Second, is buying a ticket a good investment? It looks like it is. The price of a ticket is two dollars. On the average, you will win the jackpot with probability of 1 to 302 million, and lose your two dollars dollar with probability of nearly 1. Therefore, your average return is about the jackpot multiplied by the probability of winning it minus the price of the ticket. Since the probability of winning is 1 in 302 million and the jackpot is 1600 million, then the expected return is 1600/302–2 , which is positive — about 3.30 dollars. Therefore, you should play. Or shouldn’t you?

The above figure — expected value of 3.30 dollars is an expectation of money. It is not money. You are not going to gain this expected sum of money when you play the lottery once. You either win the jackpot or lose your money. Of course, if you get a chance to participate in such a lottery with such a jackpot as many times as you wish, you should play, and the law of large numbers will be in your favor. This is not going to happen, of course. You only get to play this game once.

The next interesting question is what is the probability that someone will win?

Assume that you roll a die. The probability of rolling 6 is 1 to six. If two people roll a die, then the probability of at least one of them rolling six is about 1 in 3.3. If 3 people roll a die then the probability of at least one of them rolling six is even better: 1 in 137, and so on. The lottery is similar. Think of a lottery ticket as a die, only that the probability of rolling 6 is 1 to 302 million. If two people are rolling suck a dice, i.e. buying a lottery ticket, then the probability that at least one of them rolling a six is slightly better than 1 to 302 million. How many people should buy a lottery ticket to make the probability of a least one win greater than 5%? 10%? 50%? What is the probability that two or more people will share the jackpot? These probabilities depend on the amount of tickets sold. The more ticket sold, the higher the probability that someone wins. If you know the number of tickets sold, you can be approximated these probabilities using the Poisson distribution. You can also back-calculated the number of tickets need to be sold in order to set the probability that someone wins to any level you like. I’ll skip the technical details. According to my calculations, the number of tickets need to be sold to ensure that the probability of at least one winner exceeds 0.5 is about 210 million.

But wait: the price of a ticket is 2 dollars dollar, and there are only 302 million possible combinations of numbers. So, if I buy all possible tickets, it will cost me only 604 million, and I am guaranteed to win 1.6 billion. This is a net profit of nearly a billion dollars. Not bad. Can I do it?

The short answer is “Yes”. The long answer is “probably not”.

Yes. It was done before. In 1992, the jackpot of the Virginia lottery was 27 million dollars, while the probability of winning the jackpot was about 1 to 7 million, and the price of a ticket was 1 dollar. So it was possible to buy 7 million tickets for 7 million dollars to ensure a jackpot of 27 million. A consortium of 2500 small investors aimed to raise the money to buy all these tickets. However, they managed to buy only about 5 million tickets. They still managed to win, (See: The improbability Principle — David Hand, page 120)

To buy 302 million tickets is a different story. First you need to have 604 million dollars in cash. Second, you need to actually to buy all these tickets, and you have only 4 days to do all these shopping. In 4 days there are 345600 seconds, so you need to buy nearly 900 tickets per second, while you make sure that each ticket carries a different combination of numbers. The logistics may be difficult. In 1992, that consortium mange to but only 5 million tickets in about a week. You may not be able to buy all the tickets you need.

Second, when you win, you will probably want the money in cash, and not in annuity payments. So, the 1.6 billion reduces to 910 million, and the government will take 25% tax (and more later). You will end up with 684 million. Still a net profit of 80 million dollars. Not bad.

Third, it is possible that you will share the jackpot with someone else, maybe even more than one person. As we already saw, there is a good chance for this scenario — if 210 million tickets are sold then the probability for sharing the jackpot is about 50%. Even if you share the jackpot with just another person, your share is just 800 million, and if you want to cash it it will shrink to 342 million, a net loss of 262 million. That’s not good.

And finally: should you buy a ticket? Yes, buy one ticket. It’s fun. And for two dollars you buy the right to hope winning the jackpot, as Prof Robert Aumann, the Nobel Prize winner said.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Princess of Science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Journal publication

Fri, 03/01/2019 - 16:29

(This article was first published on R – Eran Raviv, and kindly contributed to R-bloggers)

The R Journal is the open access, refereed journal of the R project for statistical computing. It features short to medium length articles covering topics that should be of interest to users or developers of R.

Christoph Weiss, Gernot Roetzer and myself have joined forces to write an R package and the accompanied paper: Forecast Combinations in R using the ForecastComb Package, which is now published in the R journal. Below you can find a few of my thoughts about the journey towards publication in the R journal, and a few words about working with a small team of three, from three different locations.

The road to publishing in the R journal is not easy. The journal is peer-reviewed, and peers chosen are very knowledgeable, which means there are valuable and valid comments to address. Apart from that, and unlike with other publications you may have already, given that it is a computing journal the code is also scrutinized for bugs and important possible extensions. That said, the quality at the end is undoubtedly improved.

Communication and treatment from the R journal is excellent. You are assigned with associate editor who distributes the prospective publication to few referees. Then you get an email describing the next steps and associated (expected) time lines. The whole system can be described as responsive.

The R journal itself is growing in popularity, and in impact:


On working remotely

Nowadays we enjoy an incredible set of tools at our disposal to allow an Amsterdam-Cambridge-Dublin collaboration, without the need to even see each other. I don’t think I could even recognize my co-authors passing them on the street, which is quite something given that we have been doing so much together. We simply had a skype call for working meetings with clear to-do lists every three weeks or so, whatsup app for small things, dropbox for tex files and edits, git for the code and email for concise questions and meetings prep. All free tools (free here means that you don’t pay with money..).

A boundaryless era we live in. Amazing experience.

Related posts:

  1. R tips and tricks – Set Working Directory This is more an Rstudio tip than an R tip….
  2. R vs MATLAB (Round 3) At least for me, R by faR. MATLAB has its…
  3. R and Dropbox When you woRk, you probably have a set of useful…
  4. Most popular machine learning R packages The good thing about using open-source software is the community…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Eran Raviv. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A brief history of clinical trials

Fri, 03/01/2019 - 16:28

(This article was first published on The Princess of Science, and kindly contributed to R-bloggers)

The earliest report of a clinical trial is probably provided in the Book of Daniel. Daniel and a group of other Jewish people who stayed at the palace of the king of Babylon, did not want to eat the king’s non-Kosher food and preferred a vegetarian diet. To show that vegetarian and Kosher diet is healthier, Daniel suggested to conduct an experiment. There were two “treatment group” in this trial. One group ate the royal Babylonian food, the other kept the vegetarian diet. The health of the groups was compared after a follow-up period of 10 days. The conclusion was that the vegetarian diet is healthier.

The first modern clinical trial is James Lind’s scurvy trial, which many consider to be the starting point of modern medicine. This is the first documented controlled clinical trial (if you ignore Daniel’s trial). Lind conducted an experiment to test possible treatments for scurvy, the leading cause of death among sailors by the end of the 18th century. In a relatively brief voyage in the Mediterranean in 1749, Linde divided the 12 sailors who fell sick during the voyage to six equal groups. They were all hosted in the same place on the ship and were given the same menu, which was distinguished only by the experimental treatment given to them. The treatments were: drinking a liter of cider a day, drinking 25 drops of sulfuric acid three times a day, drinking two tablespoons of vinegar three times a day, drinking half a liter of seawater a day, an ointment made of garlic, mustard, and radish, or eating two oranges and lemon a day. The citrus patients had recovered completely, and the condition of cider patients improved slightly. The comparison between the groups allowed Lind to evaluate the efficacy of each treatment in relation to other therapeutic alternatives.

The next milestone is William Watson’s trial of treatments to reduce the risk of smallpox. Already in the 11th century it was known that anyone who had this disease and survived would not get sick again. As a result, a practice of immunization of the disease by “mild infection” of healthy people was developed. However, among the doctors there were disagreements about optimal adhesion and treatment for infection. Watson conducted a series of three clinical trials at London Children’s Hospital in 1767. His methodology was similar to that of Lind: The children participating in each trial were divided into groups, and in each group, controlled infection was performed using a bladder from an early stage of the disease. Each group was given a different adjuvant treatment that was supposed to reduce the risk of infection. Watson’s experiments had a number of innovations compared to Lind’s experiment. Watson ensured that in each treatment group there was an equal number of boys and girls to avoid possible bias in case the response to treatment was different between the genders. In addition, one group in each trial did not receive supplementary treatment but served as a control group. Most importantly, Watson was the first to report a quantitative measurement of results. The measure of success of treatment was the number of smallpox that occurred in each child participating in the trial. He also performed a basic statistical analysis and published the average number of blisters per child in each group. Watson concluded that conventional treatments to reduce risk, including mercury, various plants and laxatives, were ineffective.

The next significant milestone is the milk experiment in the Lancashire county of Scotland in the early 20th century. The purpose of the trials was to determine whether daily milk intake improves the growth of children compared to children who did not drink milk on a daily basis, and to check whether there is a difference in growth rates between children fed fresh milk and those fed in pasteurized milk. The experiment, conducted in 1930, was large-scale and included a total of about 20,000 children aged 6–12, who studied in 67 schools. About 5,000 were fed in fresh milk, about 5,000 in pasteurized milk, and approximately 10,000 children were assigned to the control group. The height and weight of the children were measured at the beginning of the experiment (February 1930) and at the end (June 1930). The conclusion was that a daily diet of milk improves the growth of children, and that there is no significant difference between fresh milk and pasteurized milk. The researchers also concluded that children’s age had no effect on the effect of growth rate.

This experiment entered my list because of the criticism leveled at it. The critics included Fisher and Bartlett, but the most comprehensive criticism was cast by William Sealy Gosset, also known as “Student”. In an article published in Biometrika, Gosset actually set rules that were necessary to ensure the validity of a clinical trial. First, he noted that in each school the children were treated with fresh milk or pasteurized milk, but the two groups were not represented in any school. As a result, it is not possible to directly compare fresh and pasteurized milk, due to differences between the different schools. He also noted that the treatments were assigned by the teachers in each class and not randomly. As a result, students in the control group were larger in their body size than students in the treatment groups. Thirdly he notes that although the measurements were conducted in February and June, the weight measurements did not consider the weights of the children cloths. Winter clothes are heavier than spring / summer clothes, and the weight difference between clothes offset the real weight differences. The researchers assumed that the difference in the weight of the clothes would be similar among the groups, but Gosset argued that the bias in the distribution of students to economically affected groups — children from affluent families were usually included in the control groups — meant that the weight of the control group’s winter clothing would be higher.

Gosset concluded that the results did not support the conclusion that there is no difference between a diet with fresh milk and a pasteurized milk diet, and claimed that it is impossible to conclude that there is no connection between age and the change in growth rate. He also mentions the analysis of Fisher and Bartlett that showed that fresh milk has an advantage over pasteurized milk as to the rate of growth.

Following his criticism, Gosset made a number of recommendations, including a proposal to conduct the experiment in a group of twins, one of whom will be fed milk and the other will serve as a control (or one of them will be fed in fresh milk and the other in pasteurized milk to compare the two types of milk). I think that such planning is not accepted ethically today. A more practical recommendation is to re-analyze the data collected to try to overcome the bias created in the non-random allocation to treatment and control groups. His ultimate recommendation was to re-conduct the experiment, this time using randomization, considering bias due to the weight of the clothes worn by each student, and planning the experiment so that each school has representation for the three treatment groups.

The main recommendation of Gosset, to ensure random allocation of patients to groups, was not immediately accepted, as this idea was perceived by some of the scientific community as “unethical”. It should be noted that the principle of randomization was only presented by Fisher in 1923, and there was still insufficient recognition of its importance.

The first clinical trial with random assignment to a treatment and control groups was conducted only in 1947, and is the fourth in my list. This is an experiment to test the efficacy of streptomycin antibiotics to treat pneumonia. Due to the short supply of antibiotics, there was no choice but to decide by “lottery” between the patients who will receive antibiotic treatment and who will not, and thus the planning of the experiment overcame the ethical barrier. However, the experiment was not double blind, and placebo was not used.

It should be noted that there has already been a precedent for a double blind trial: the first clinical trial using the double-blind method was conducted in 1943 to test the efficacy of penicillin as a treatment for common cold. Patients did not know whether they were treated with penicillin, or whether they were treated with placebo. The doctors who treated the patients did not know what treatment each patient received. Such a design prevents bias that may result from doctors’ prior judgment about the efficacy of the treatment, and in fact forces them to give an objective opinion about the patient’s medical condition. However, this trial did not randomize patients for treatment or control.

The debate regarding the importance of the principles outlined by Gosset and Fisher was finally decided in the trial to test the efficacy of Salk’s vaccine against polio virus, carried out in 1954. In fact, two trials were conducted. The main trial, led by Paul Meier, was a double-blind randomized trial, showing a 70% reduction in Polio-related paralysis in the treatment group compared to the control group. The size of the large sample (about 400,000 children aged 6–8) helped to establish external validity of the results. At the same time, another trial was conducted, in which the allocation of treatment (vaccination or placebo) was not random. 725,000 first and third graders who participated in the experiment served as a control group, to which 125,000 second grade children whose parents refused the vaccine were added. Their data were compared with the data of 225,000 second graders whose parents agreed to vaccinate them. A total of more than one million students participated in the experiment, almost three times the size of Meier’s trial. However, this trial results showed a decrease of only 44% in polio-related paralysis. Later analysis found that the effect was reduced due to bias related to the socioeconomic status of the treatment group. Many children in this group belonged to more affluent families, and in this population stratum polio incidence was higher because the proportion of children vaccinated naturally (the polio was mild and recovered without documentation) was lower due to the higher level of sanitation in their environment. The polio trails established the fact that the most important feature of a clinical trial is the randomization , and that only a random and double-blind allocation ensures the internal validity of the experiment.


  • Boylston, AW (2002). Clinical investigation of smallpox in 1767.New England Journal of Medicine, 346 (17), 1326–1328.
  • Leighton G, McKinlay P (1930). Milk consumption and the growth of school-children. Department of Health for Scotland, Edinburgh and London: HM Stationery Office.
  • Student (1931). The Lanarkshire Milk Experiment. Biometrika 23: 398–406
  • Fisher RA, Bartlett S (1931). Pasteurised and raw milk. Nature 127: 591–592.
  • Medical Research Council Streptomycin in Tuberculosis Trials Committee. (1948).
  • Streptomycin treatment for pulmonary tuberculosis. BMJ, 2 , 769–82.
  • Hart, PDA (1999). A change in scientific approach: from alternation to randomized allocation in clinical trials in the 1940s.BMJ, 319 (7209), 572–573.
  • Meier, Paul. “Polio trial: an early efficient clinical trial.” Statistics in medicine 9.1–2 (1990): 13–16.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Princess of Science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bayesian state space modelling of the Australian 2019 election by @ellis2013nz

Fri, 03/01/2019 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

So I’ve been back in Australia for five months now. While things have been very busy in my new role at Nous Group, it’s not so busy that I’ve failed to notice there’s a Federal election due some time by November this year. I’m keen to apply some of the techniques I used in New Zealand in the richer data landscape (more polls, for one) and complex environment of Australian voting systems.

Polling data

The Australian Electoral Commission has wonderful, highly detailed data on actual results, which I’ll doubtless be coming to at some point. However, I thought for today I’d start with the currency and perpetual conversation-making power (at least in the media) of polling data.

There’s no convenient analysis-ready collection of Australian polling data that I’m aware of. I used similar methods to what’s behind my nzelect package to grab these survey results from Wikipedia where it is compiled by some anonymous benefactor, from the time of the 2007 election campaign until today.

Thanks are owed to Emily Kothe who did a bunch of this scraping herself for 2010 and onwards and put the results on GitHub (and on the way motivated me to develop a date parser for the horror that is Wikipedia’s dates), but in the end I started from scratch so I had all my own code convenient for doing updates, as I’m sure I’ll be wanting.

All the code behind this post is in its own GitHub repository. It covers grabbing the data, fitting the model I’ll be talking about soon, and the graphics for this post. That repo is likely to grow as I do more things with Australian voting behaviopur data.

Here’s how that polling data looks when you put it together:

Notes on the abbreviations of Australian political parties in that chart:

  • ONP ~ “Pauline Hanson’s One Nation” – nationalist, socially conservative, right-wing populism
  • Oth ~ Other parties
  • Grn ~ “Australian Greens” ~ left wing, environment and social justice focus
  • ALP ~ “Australian Labor Party” ~ historically the party of the working person, now the general party of the centre left
  • Lib/Nat ~ “Liberal Party” or “National Party” ~ centre and further right wing, long history of governing in coalition (and often conflated in opinion polling, hence the aggregation into one in this chart)

I’m a huge believer in looking at polling data in the longer term, not just focusing on the current term of government and certainly not just today’s survey release. The chart above certainly tells some of the story of the last decade or so; even a casual observer of Australian politics will recognise some of the key events, and particularly the comings and goings of Prime Ministers, in this chart.

Prior to 2007 there’s polling data available in Simon Jackman’s pscl package which has functionality and data relating to political science, but it only covers the first preference of voters so I haven’t incorporated it into my cleaned up data. I need both the first preference and the estimated two-party-preference of voters.

(Note to non-Australian readers – Australia has a Westminster-based political system, with government recommended to the Governor General by whomever has the confidence of the lower house, the House of Representatives; which is electorate based with a single-transferrable-vote aka “Australian vote” system. And if the USA could just adopt something as sensible as some kind of preferential voting system, half my Twitter feed would probably go quiet).

Two-party-preferred vote

For my introduction today to analysis with this polling data, I decided to focus on the minimal simple variable for which a forecast could be credibly seen as a forecast of the outcome on election day, whenever it is. I chose the two-party-preferred voting intention for the Australian Labor Party or ALP. We can see that this is pretty closely related to how many seats they win in Parliament:

The vertical and horizontal blue lines mark 50% of the vote and of the seats respectively.

US-style gerrymanders generally don’t occur in Australia any more, because of the existence of an independent electoral commission that draws the boundaries. So winning on the two-party-preferred national vote generally means gaining a majority in the House of Representatives.

Of course there are no guarantees; and with a electoral preference that is generally balanced between the two main parties even a few accidents of voter concentration in the key electorates can make a difference. This possibility is enhanced in recent years with a few more seats captured by smaller parties and independents:

All very interesting context.

State space modelling

My preferred model of the two I used for the last New Zealand election was a Bayesian state space model. These are a standard tool in political science now, and I’ve written about them in both the Australian and New Zealand context.

To my knowledge, the seminal paper on state space modelling of voting intention based on an observed variety of polling data is Jackman’s “Pooling the Polls Over an Election Campaign”. I may be wrong; happy to be corrected. I’ve made a couple of blog posts out of replicating some of Jackman’s work with first preference intention for the ALP in the 2007 election. In fact, this was one of my self-imposed homework tasks in learning to use Stan, the wonderfully expressive statistcal modelling and high-performance statistical computation tool and probability programming language.

My state space model of the New Zealand electorate was considerably more complex than I need today, because in New Zealand I needed to model (under proportional representation) the voting intention for multiple parties at once. Whereas today I can focus on just two-party-preferred vote for either of the main blocs. Obviously a better model is possible, but not today!

The essence of this modelling approach is that we theorise the existence of an unobserved latent voting intention, which is measured imperfectly and irregularly by opinion poll surveys. These surveys have sampling error and other sources of “total survey error”, including “house effects” or statistical tendencies to over- or under-estimate vote in particular ways. Every few years, the true voting intention manifests itself in an actual election.

Using modern computational estimation methods we can estimate the daily latent voting intention of the public based on our imperfect observations, and also model the process of change in that voting intention over time and get a sense of the plausibility of different outcomes in the future. Here’s what it looks like for the 2019 election:

This all seems plausible and I’m pretty happy with the way the model works. The model specification written in Stan and the data management in R are both available on GitHub.

An important use for a statistical model in my opinion is to reinforce how uncertain we should be about the world. I like the representation above because it makes clear, in the final few months of modelled voting intention out to October or November 2019, how much change is plausible and consistent with past behaviour. So anyone who feels certain of the election outcome should have a look at the widening cone of uncertainty on this chart and have another think.

A particularly useful side effect of this type of model is statistical estimates of the over- or under-estimation of different survey types or sources. Because I’ve confronted the data with four successive elections we can get a real sense of what is going on here. This is nicely shown in this chart:

We see the tendency of Roy Morgan polls to overestimate the ALP vote by one or two percentage points, and of YouGov to underestimate it. These are interesting and important findings (not new to this blog post though). Simple aggregations of polls can’t incorporate feedback from election results in this way (although of course experienced people routinely make more ad hoc adjustments).

A more sophisticated model would factor in change over time in polling firms methods and results, but again that would take me well beyond the scope of this blog post.

Looking forward to some more analysis of election issues, including of other data sources and of other aspects, over the next few months.

Here’s a list of the contributors to R that made today’s analysis possible:

thankr::shoulders() %>% knitr::kable() %>% clipr::write_clip() maintainer no_packages packages Hadley Wickham 16 assertthat, dplyr, ellipsis, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, tidyr, tidyverse R Core Team 12 base, compiler, datasets, graphics, grDevices, grid, methods, parallel, stats, stats4, tools, utils Gábor Csárdi 6 callr, cli, crayon, pkgconfig, processx, ps Winston Chang 4 extrafont, extrafontdb, R6, Rttf2pt1 Yihui Xie 4 evaluate, knitr, rmarkdown, xfun Kirill Müller 4 DBI, hms, pillar, tibble Dirk Eddelbuettel 3 digest, inline, Rcpp Lionel Henry 3 purrr, rlang, tidyselect Jeroen Ooms 2 curl, jsonlite Jim Hester 2 pkgbuild, readr Ben Goodrich 2 rstan, StanHeaders Jim Hester 2 glue, withr Vitalie Spinu 1 lubridate Deepayan Sarkar 1 lattice Gabor Csardi 1 prettyunits Patrick O. Perry 1 utf8 Jennifer Bryan 1 cellranger Michel Lang 1 backports Simon Jackman 1 pscl Jennifer Bryan 1 readxl Kevin Ushey 1 rstudioapi Justin Talbot 1 labeling Simon Potter 1 selectr Jonah Gabry 1 loo Charlotte Wickham 1 munsell Alex Hayes 1 broom Joe Cheng 1 htmltools Baptiste Auguie 1 gridExtra Luke Tierney 1 codetools Henrik Bengtsson 1 matrixStats Peter Ellis 1 frs Simon Garnier 1 viridisLite Brodie Gaslam 1 fansi Brian Ripley 1 MASS R-core 1 nlme Stefan Milton Bache 1 magrittr Marek Gagolewski 1 stringi James Hester 1 xml2 Max Kuhn 1 generics Simon Urbanek 1 Cairo Jeremy Stephens 1 yaml Achim Zeileis 1 colorspace var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...