R news and tutorials contributed by (750) R bloggers
Updated: 12 min 24 sec ago

### Reshama Shaikh discusses women in machine learning and data science.

Mon, 02/25/2019 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Reshama Shaikh, organizer of the meetup groups Women in Machine Learning & Data Science (otherwise known as WiMLDS) and PyLadies.

Introducing Reshama Shaikh

Hugo: Hi there, Reshama, and welcome to DataFramed.

Reshama: Hello, Hugo. Thank you for inviting me.

Hugo: It’s such a pleasure to have you on the show, and some of our listeners may not know, but when we initially launched DataFramed, we had a panel down at Two Sigma on on 6th Avenue in New York City around Tribeca, a panel on which you appeared. So from kind of the genesis of this podcast, you’ve been involved in a variety of ways,

Reshama: Right. It was January of 2018, almost a year ago.

Hugo: It was, and I actually hadn’t been down there until two nights ago where I went back to present at Jared Lander’s meetup.

Reshama: How did that go? I saw that there was a meetup there with you.

Hugo: It went really, really well. I gave a talk called "What Data Scientists Really Do, according to 50 data scientists". It’s essentially my takings from, 50 hours of season one of DataFramed, and I joke, but it’s true, the great thing about giving this talk is that I get to present other people’s opinions and not be held accountable for them.

Reshama: That’s great. Was it recorded?

Hugo: It was and Jared’s put it up, so it’s definitely up there somewhere.

Reshama: All right, I will look for it.

Hugo: Yeah, it was great to launch the podcast with you and it’s great to have you on near the start of season two, particularly to talk about several things that you’re instrumental in thinking about and that you’re passionate about. For example, your work, would you say, WIMLDS? I’m trying to figure out how to pronounce that acronym.

Reshama: Yes. For Short WIMLDS, after saying Women in Machine Learning and Data Science.

Hugo: Great. So very excited to have you have you here today to talk about what you do at Women in Machine Learning and Data Science and Women in Machine Learning and Data Science in general. Your work with NumFocus on the code of conduct, and a blog post that took off that you wrote recently. The title is "Why Are Women Flourishing In The R Community, But Lagging In Python?"

What are you known for in the data community?

Hugo: So I’m really excited to be talking about all of these things with you today, but before we get there, I want to find out a bit about you. So perhaps you could just start off by telling us what you’re known for in the data community.

Reshama: Right, so I’m known for a few things. I’m an organizer, as you mentioned, for Woman in Machine Learning and Data Science. I’m also an organizer for PyLadies. I’m also a board member for Woman in Machine Learning and Data Science. I’ve been an organizer for about four years now and at the time that I started organizing we had two chapters. So it’s been good to see it grow and contribute to that.

Reshama: I also created a repository of documentation for the well known Fast AI Deep Learning Library and that’s been popular. I also give workshops throughout New York City, a few online but mainly in person, And I’ve given a dozen in the past two years. I’m a member as you mentioned that the NumFocus Diversity in Scientific Computing Committee, DISC for short, and I’m known for some of my blogs.

Hugo: That is a wide ranging, I think, resume or description of a variety of things you do. I’m just wondering how you got involved originally. What is your path or trajectory to data science initially?

Reshama: Sure. I have a degree in statistics, I have a masters in statistics, and I worked for a long time as a biostatistician, and when I working there, I started my MBA part time at Stern’s School Of Business. The first couple of years were all core classes in economics, finance, organizational behavior, and then after the first half of the MBA we take elective courses.

Reshama: So I took a course in, I believe it was spring of 2012 it was called Data Mining For Business Analytics with professor Foster Provost, and we used WEKA actually. Towards the end of our course we were analyzing project data, and it mentioned you could do that really easily in Python.

Reshama: I had never heard of Python at the time, downloaded it, but wasn’t sure how to use it. So the following semester, Professor Foster Provost offered another course called Practical Data Science. It was the first time they had it at Stern, the School of Business, and that’s when I started learning Python and I just, I just fell in love with Python.

Hugo: Fantastic. So I was working in cell biology, thinking about data, statistics, cell mechanics essentially as well, but when I started thinking about working in data science in industry, it’s Foster Provost’s book Data Science For Business, which is one of the first books that I ever read in this space.

Reshama: Well that’s so funny because at the time it was a draft copy. So we received his draft copy for free.

Hugo: Awesome. So what happened next in your journey?

Reshama: So well during my time at Stern, I also took other courses, Data Visualization, Networks, Crowds, and Markets, Design Of Apps, and after that, having a statistical background was really helpful. After that I worked at a data science bootcamp and I acquired even more skills in data science.

Hugo: Great. And I think boot camps are definitely … People ask me about the quality of of boot camps and what you get out of them, and I think two very interesting aspects of boot camps are actually firstly the network that you build and where those people end up as well, because you’ll be in a class at a boot camp and a couple of years down the track you’ll have part of your cohort being at Facebook, Twitter, DataCamp, LinkedIn, all of these places.

Hugo: So that’s very exciting, but as you said, the other aspect is you get a broad overview of kind of all the current tools and techniques that are used in data science. So it helps you really establish the knowledge of what your toolkit may need to be.

Reshama: Absolutely. When I was at Stern, they didn’t actually teach us about the Python libraries. We had to do a lot of coding from scratch, and there are things that are not covered in, university programs like git, at the time not even cloud computing. So it was nice to get some of those other supplemental and complementary skills.

Hugo: Absolutely. Yeah. Such as, as you say, cloud computing, like figuring out how to use AWS, right?

Reshama: Absolutely. Yeah.

Women in Machine Learning and Data Science (WiMLDS)

Hugo: I’d love to jump in and hear about your work at Women in Machine Learning and Data Science. So perhaps you could start by kind of giving us a brief rundown of the history of WIMLDS and the mission.

Reshama: Sure. So Women in Machine Learning was founded December, 2013, so five years ago we just celebrated our fifth anniversary by Erin LeDell who’s out in San Francisco, and she started the first chapter there, which is the Bay Area chapter.

Reshama: It was inspired by the Women in Machine Learning workshops, which are co located with conferences such as NeurIPS, ICML and Colt, but the thing about those workshops is they’re only accessible to people who attend the workshops. The goal of the Women in Machine Learning and Data Science meetup groups is to have local communities where more people can participate. We are a nonprofit organization, and as of today we have 37 chapters in 17 countries with over 21,000 members.

Hugo: That’s amazing. How can people get involved?

Reshama: The best way is to look for a local chapter close to you, and we encourage people if there’s somewhere where a chapter doesn’t exist is to consider starting a chapter.

Hugo: So, as stated, the mission is to support and promote women and gender minorities who are practicing studying or interested in the fields of ML and Data Science. How do you go about achieving this mission?

Reshama: One of the things that we do is, for our speaker events we look for women speakers. We want our community of women and other underrepresented groups to see women speaking and in the leadership position and to inspire them. So our panel talks, we’ve require that it’s at least 50% women. We advocate and promote for conferences, and we work closely with some conferences such as the O’Reilly Media and Machine Learning Conference that are really dedicated to having a diversity of speakers.

Reshama: We also work with conferences that really promote a code of conduct. We have a very active slack team where women can network, and ask questions about jobs, and how to approach different issues that work. And network, and we have all different sort of types of techniques such as career development panels. We recently had a terrific panel up with a group of women at McKinsey that talked about the challenges they face at work and how they manage the personal life with work life.

Reshama: We have open source sprints. As , the rates for women being involved in open source community are quite low, and we have, we have hands on technical workshops, and examples are Deep Learning, Neo Four J, Making A Bot, and we really want to promote a supportive and welcoming environment, and also communicative environment.

Reshama: I think one of the reasons that the New York chapter has really done so well is that we communicate with our members and we’ve had a good team of volunteers that’s been consistent for close to four years.

Hugo: We might get back this, but I want to zoom in on the the fact that you mentioned open source sprints, and we might get back to this when discussing why women women are flourishing in R Community and perhaps lagging, as you say, in Python. The open source development community, I think, is one in which we see a serious lack of diversity. That’s an understatement. So I think as you say, having open source sprints where you’re hopefully get women and under represented groups, making pull requests as soon as possible is actually really essential work, particularly on the Python side. Right?

Reshama: I absolutely agree. In fact, I’ve just been spending the past week, we had our second open source spent at the end of September, and we merged five pull requests, and we had about over a dozen that are sort of outstanding. I’ve been working with some other people to get some of those pull requests complete. So it’s a great experience. It was my first scikit-learn in sprint. I’m like, "Wow, there’s a lot that I could have learned in the past five years."

Other Initiatives.

Hugo: And this is actually the first time I heard about you, your work at Women And Machine Learning And Data Science was when Andreas Mueller who has worked with you on these, these sprints, and for those who don’t know Andy, he’s a maintainer and cold contributor of scikit-learn. He was running a sprint with you, or maybe it was actually a workshop. So it was several years ago, but it was definitely then that I started hearing about, about your fantastic work at WIMLDS …

Reshama: Well, that’s good to hear.

Hugo: … in New York City. So what other types of initiatives do you think about and implement?

Reshama: Other initiatives.

Hugo: We’ve talked about career development, and panels, and open source sprints.

Reshama: We’ve done two hackathons. They’re a lot of fun. They’re also a lot of work, but the hackathons, and they’re really well organized. There’s a lot of upfront work that goes into them, and we work with different communities in the city. So those are really popular too.

Hugo: Fantastic. Do you also have other networking events, or sharing job postings, or that type of stuff?

Reshama: Right. We have a job board on our website where companies can post jobs and we share them. We have a newsletter and also on slack. As organizers, we really tried to get to know our members, and talk to them, and sort of create the space that’s welcoming and open.

Reshama: When I go to our events I will see we tend to have about 50 to 60 people and about 90% are women, and when I go to regular Python data science events in the New York City community, it tends to be about 80%, 90% men. It’s just interesting that we are able to attract so many women to our event.

Hugo: I couldn’t agree more. I’ve got a relatively ill formed question about how a lot of these aspects are cultural and that part of our job is to push back in terms of providing supportive and welcoming environments for everyone. We spoke about Jared Lander, and I had him on the podcast last year talking about how he’s gone about building data science communities with his meetup and his conference.

Hugo: Actually in your post on why women are flourishing in the R Community, you point out that at Jared’s conference, example, he actually provides a very supportive and welcoming environment for a lot of underrepresented groups, including women. In terms of the speakers present, it’s almost at parity in terms of gender.

Reshama: It is. I have attended some of his meetups and I went to the R conference. The last one I attended was in 2016 and it is, it’s really relaxed, and fun, and welcoming. He really puts on great events.

What does 2019 look like for WIMLDS?

Hugo: I couldn’t agree more. What does 2019 look like for WIMLDS? What type of stuff do you have going on?

Reshama: We are growing. I would say of our 37 chapters, 23 were added in this past year alone. So to sustain that growth, we are looking for a sponsor. I don’t know if you’re aware, but R Ladies are supported by the R Consortium, and PyLadies is supported by PSF, and what support he means on sort of a fiscal basis is they cover the meetup dues, and provide meetup pro accounts which makes it easier to manage all of the chapters who communicate with them, but they also sort of provide these conduits in relationships to other organizations within the community. Women in Machine Learning and Data Science is sort of in the middle… I don’t even know if there is a machine learning organization to connect it to, but we are looking for a sponsor.

Hugo: If any of our listeners were interested in checking out the details, where could they do that?

Reshama: The details about Women in Machine Learning, our website is our acronym. wimlds.org. We are very active on Twitter, also the same handle, WIMLDS, and you can email us at info@wimlds.org

Reshama: We don’t have a landing page for that, but you can send us an email to our info email address and that gets forwarded automatically to all the board members.

Hugo: Yeah, perfect. So this is kind of a brief wishlist of what you’re working on. What else is on your wishlist for WiMLDS?

Reshama: So these are things that have been on my wishlist for a while and I would love to make some progress on it. One of them is, I mentioned the sponsorship and getting a meetup pro account, but I would love to record our events. We have smaller chapters in, say, Michigan and Texas, and they don’t have access to all the speakers that we are so fortunate in New York City, or the Bay area, or some of these large metropolitan areas. So I would love to record our events in the same way that Jared does for the R Community and to, be able to have that accessible to other chapters, and also to sort of see what, like our Paris chapters very active too. Well fortunately I could understand the French, but see what other people are doing, getting access to space.

Reshama: Probably the most time we spend as a meetup organizers is getting space and it would be great to have great relationships with companies where that was more accessible. I think in the past, even a couple of years ago we would promote conferences without really thinking as much about what their initiatives are and whether they had a code of conduct. So we want to get really intentional about working with conferences that have a code of conduct and would really want to bring more women into Open Source. Ultimately all these initiatives will go down the pipeline and bring women into open source machine learning, data science, artificial intelligence.

How to promote women and gender minorities in data science and machine learning?

Hugo: So we mentioned that listeners perhaps could get involved in sponsorship, but I’m wondering what DataFramed listeners, both individuals and organizations, can do across the board and in general to support and promote women and gender minorities in, in data science and machine learning?

Reshama: I would say the most important thing that I noticed about my interaction with the community being an organizer is, I would say, is be intentional. I will hear from organizations, "We support diversity." It’s become such a hackneyed expression, but to really be intentional about what does it mean to support diversity, and what do you do, and where do you need help, and creating these inclusive women’s spaces, which means like giving women a chance to speak. I personally would like to hear companies talk about, "We support pay equity," and seeing some data that supports that.

Hugo: I think these types of statements are incredibly important, because as you say saying, "We support pay equity," as opposed to "We support diversity," has a concrete outcome, right?

Reshama: Absolutely.

Intentionality

Hugo: So you mentioned the idea of being intentional, and intentionality, and one example you cited was in terms of having something concrete like supporting pay equity. Is there anything else you want to say about an intentionality in general and what that means on a daily basis for people?

Reshama: I want to point to Write Speak Code. Write Speak Code is a organization also of meetup groups, and they do a national conference, and they are really, I would say, they’ve really been at the forefront of what inclusion is. One of the other things about inclusion is using inclusive language. So often I hear, as an example, "You guys," and I think it’s important to be really, really careful of how we communicate with each other in the community, because it’s like the default is "You guys," but really, there are all sorts of different kinds of people.

Reshama: There’s a Discover Cookbook that’s put out by NUMFOCUS about making events more inclusive, to actually research that and to implement it. It doesn’t have to be all of the recommendations, but implement some of them and reach out if people have questions. Also collect and share data about who people are hiring, retaining, and be open to sharing publicly what’s working and what’s not so companies can learn from each other. So often people reach out to us and say, "We want to hire more women," and "How can we hire more women software engineers, or women data scientists, or women in machine learning? We just can’t find them."

Reshama: Rather than changing the conversation from, "We can’t find women", how about adjusting our environments so we can attract, hire and retain more women? I think it’s important to companies to realize that people speak to each other. So when a culture is working or if it’s not working, they communicate with each other at dinner, or via email, or via Slack, and it’s important to have a culture that people speak about positively.

Hugo: Absolutely, and I do think the one flip side of intentionality is also developing awareness around what you’re actually doing and what the practices you have in place, what the downstream effects are. That’s incredibly vague and abstract. I’ll give one concrete example, which is the language of job listings. So I’ll try to dig this up, but LinkedIn actually did several studies into how the language of job listings will even affect the ratio of applicant gender or something along those lines.

Reshama: That’s true, and there has been research that shows, for instance, everybody knows, men will apply for a job if they fit 50% of the list of requirements, whereas women will only feel comfortable applying if they think that they could do 90% or more of the requirements, and knowing that research is out there, companies could adjust their job postings if they want to attract more women.

Diversity and Inclusion at NUMFOCUS

Hugo: You mentioned NUMFOCUS and the diversity and inclusion in scientific computing committee, otherwise known as DISC, that you’re on, and I’m wondering how you think about diversity and inclusion at NUMFOCUS as an organization and what type of initiatives you’re involved in there.

Reshama: I think they’re really, really cognizant about the importance of it and how to sort of move that initiative forward. They’ve been fortunate to receive a grant from the Moore Foundation and they have all sorts of different initiatives. So one of the things that has been their project is to have a comprehensive code of conduct, which, it’s interesting, I thought, "Oh I’m sure somebody else has done it, because I was researching it for the NUMFOCUS committee, and I was surprised to find out that there is actually not too much out there in terms of comprehensive codes of conduct, and defining acceptable and unacceptable behavior, and how to report.

Reshama: That’s really critical and I think that other organizations outside of NUMFOCUS, outside of their projects, all of that information is publicly available. They can access it, and reference it, and use it to make their communities more inclusive and diverse.

Code of Conduct

Hugo: Great, and maybe you can speak to the idea of a code of conduct in general and what it provides for a community at large.

Reshama: Oh, a code of conduct is so important. When I started organizing four years though, we didn’t have a code of conduct and there were some of these behaviors that we would, I would observe at our meetup events. I wasn’t quite sure how to address them. Then codes of conduct came along and then it made it so much easier to to say, "That behavior’s unacceptable, because it violates our code of conduct." Now I’ve moved forward to sharing the code of conduct before the event begins so people know what behavior is expected. That’s been like a tremendous help, and I think that for people who attend events knowing that a code of conduct is important to the organization. It makes them feel comfortable in terms of attending events and knowing how the culture will be.

Hugo: Once again this comes back to being intentional, making things precise, and making things concrete. For example, this idea of, "We value diversity", it isn’t clear how you would implement anything around that, whereas writing a code of conduct puts certain precise steps in a precise code down. Right?

Reshama: Absolutely. Actually this is a side thing I wanted to say something about the diversity. As an example, we’ve had companies in the city, one major tech company that’s reached out to us and said, "We want to hire more women but we can’t find them." So we talked about an event, and after you know many emails and conversations, they said, "We just don’t have the budget for it." This is a tech company that provides three meals a day to their employees, and they’ve said that they want to hire more women, but then they say they don’t have the budget for it.

Hugo: Yeah that seems hypocritical to say the least, or willing to say something to talk the talk but not walk the walk, so to speak. Speaking of codes of conduct, you have a blog post on codes of conduct in in general.

Reshama: Yes, I do. Yes.

Hugo: We’ll link to that in the show notes but I think that’s wonderful. So I thought maybe you could say a few words about it here on the show.

Reshama: Sure. So that code of conduct, as we were doing the research for the NUMFOCUS DISC committee, it was published and because I knew that there weren’t a lot of comprehensive codes of conduct out there, I wanted to write about it so that other organizations would know that it existed and that, it could help create a welcoming and inclusive and professional environment for them.

Reshama: I really hope that people will take some time to read this blog because it’s not one that has received that much attention, but it’s probably one of the blogs that I’ve written that’s the most important to me, and I think it’s because it has the potential to really help. I’m going to focus on data science and STEM associations, but really it’s applicable to a wide field, a lot of different other fields of work, but once they… Going from, "Oh, we we’re a welcoming community", this is actually more concrete actions, like having a code of conduct. It can prevent inappropriate behavior in the community, It encourages professional inclusionary behavior, and it provides a safe avenue for communicating violations.

Reshama: I think that, I’ve reviewed about 10 or 12 organizations that serve the data science community. So in statistics, computer science, machine learning, physics, economics or engineering, and all of these organizations, aside from NUMFOCUS and the American Statistical Association have very brief codes of conduct without too much information and membership. Total membership in these vocalizations is 400,000, so if this information gets out to the STEM communities, it really has the potential to have a really favorable impact on inclusivity and professionalism in the entire STEM ecosystem, and also for generations that are coming.

Unconference and the Discover Cookbook

Hugo: As I said, we’ll include the link in the show and if any listeners have any feedback or any comments about it, please get in touch with both of us as well on Twitter or otherwise. So going back to NUMFOCUS and DISC, we’ve talked about the code of conduct, but there are other initiatives such as the Unconference and the Discover Cookbook, and and other things, right?

Reshama: In about a year ago, in November of 2017 there was a DISC unconference, and about 45 people came from the US, a few in Europe, a few in South America, and we worked on different committees. So some of those committees finalized or worked on the DISC events guide on how to make conferences and events more inclusive. There’s a directory of organizations that serve underrepresented groups. There was a committee on getting started with opensource, communicating feedback anonymously, and diversity metrics. So there were all sorts of initiatives and it really is a great organization to get involved in.

Reshama: So many things that I’ve learned through the DISC committee and being at the Discount Conference have helped me and my work with my meetup groups as well as in open source.

Hugo: Great. What’s the cookbook?

Reshama: The cookbook. It’s a list of suggestions that can make conferences and events more inclusive.

Hugo: So Reshama, you worked on the cookbook with several people and NUMFOCUS?

Reshama: Right. The cookbook is a resource that provides actionable items that communities can do to make their events and conferences more inclusive. Not everything is required, but it’s sort of separated into high impact and effort, whether high impact, high effort, low impact, low effort, and they really give, helpful suggestions about selecting venues, and food, making food options accessible to people that have different dietary restrictions, providing child care, grants and scholarships, how to select participants, event registration, having the name tags with pronouns. There’s just all sorts of options.

Hugo: So now I really want to jump in and talk about your recent blog post. Why are women flourishing in the R Community but lagging in Python, and for the listener, I just wanted to give a bit of background into this blog post. Reshama and I, a couple of months ago, were chatting on a call about what a podcast conversation could look like around these issues, And One of my favorite conferences in the world is SciPy, which happens in Austin, Texas every year.

Hugo: I was at a meeting at SciPy in July in 2018, and it came up how to think more about inclusivity and diversity at SciPy, which is a big challenge. I kind of off the cuff mentioned that, anecdotally, at least, the R Community and R conferences seem to be doing a better job at inclusivity and diversity in general.

Inclusivity and Diversity in Python

Hugo: So there was a big conversation around that and I actually asked Reshama about this and she said, "That’s a great question, or you said, Reshama, "That’s a great question." I was just speaking to the listeners for a second and decided then to go and look into it and published this blog post, which really, really took off.

Reshama: Right. It was sort of a comment at the end, and I had some ideas about why, and I thought, "Well, let me do the research and put it together in a blog post for our conversation today.

Hugo: So maybe you can tell us some of the key takeaways from your research and the post, and then we can discuss the responses.

Reshama: Sure. I would say, the key takeaway from all of that research, and one of the readers, he really said it best. He wrote, "The key takeaway here." and this is actually Dylan Niederhut. He said "The key takeaway here seems to be that R has top-down institutional support for inclusion where efforts and Python or less support and less connected with each other." I think that pretty much sums it up.

Hugo: I think that’s a very concise and very pertinent takeaway. What does the data actually tell us or show us?
Reshama: This is something that there’s all sorts of, metrics, for instance. It’s not something that’s easily measurable, like taking a blood test and knowing exactly what the white blood cell count is. So the question is, how do you define inclusivity? One of the ways is, look at it, look at open source contributions, the percent that are members of the communities, whether it’s the general communities or the women in this programming language community, and the percent speakers, and how many people submit costs or proposals, how many women are involved in different places in the community.

Hugo: What does the breakdown look like between Python and R?

Reshama: The open source contributions for, I would say for R are just so much higher. One of the things that I did discover in my research is that there’s not a lot of data out there about Python. It’s very limited data, and I think that collecting and reporting data is indicative of how much the community is, where diversity inclusivity is a priority for that community.

Reshama: I think that Python is a bit lagging… Maybe they have the data but, it’s not really, published or so, but in terms of open source, for R, it’s four times the number of package authors that are women versus in Python.

Hugo: Okay. So this idea that R, one of the takeaways, is that AR has top-down institutional support for inclusion. What does that look like in terms of the community? Where does this institutional support come from?

Reshama: Wow. It goes, it is all the way from the top. So the R Community has the R consortium, which manages the, I would say, I know you’re very involved in the R community. So I guess the R Consortium is sort of the equivalent of the Python Software Foundation. Would you agree with that?

Hugo: Yeah, absolutely. I think that’s a nice analog.

Reshama: Okay. All right. So the R Consortium has announced that R Ladies is a top level project for them. And what does that mean, again, what do these words mean? Top level project, it’s important to them. What they do is that they provide a budget cycle for them that, I guess it used to be renewed annually, but they have a budget cycle now that set for three years, and this is really impressive is that they’ve given R Ladies a voting seat on their infrastructural steering committee for the R consortium.

Reshama: Comparatively speaking for the Python Software Foundation, they support fiscally. Like they pay the meetup dues for PyLadies, but it’s only in the PyLadies who are part of the meetup pro account. They haven’t really converted them and they don’t really, they’re sort of part of Python Software Foundation, but there’s no structure there.

Hugo: We’ve discussed how, a lot of these aspects are cultural and historical as well, and I actually have no idea whether what I’m about to ask is it true or not, but at least my intuition and anecdotally knows that the R community historically has been a very statistical academic pedagogical community, which already has a lot more women, for example, involved. Python historically and culturally comes from a software dev background, which doesn’t. So could these two factors play an important role?

Reshama: I really think yes, I would say they definitely play a role, but there is something that I do want to point out is that both Python and R were released in the early nineties. So Python was released in 1990 and R was released in 1993 and so the women’s communities were founded about 20 years later.

Reshama: I think that’s interesting, both of them, for Python and R that they were founded 20 years later. I think that whatever the educational background is, both communities felt the need to create these women’s groups to advance them in the respective programming languages. So that’s really important to consider.

Hugo: I agree. And the other question that springs to mind is, the R community is actually a lot more of a community than the Python community. I actually even, my spidey sense or whatever you want to want to call it, arcs up when people talk about the Python community, because my question is you’re talking about scientific Python community, which is actually a very different structure than the Python Software Foundation. So really the point is that SciPy is very different to PyCon, right?

Reshama: Yes. I have not attended either of them, I think one of the things that Python is it’s used so extensively that there are all different communities, but I think that can also be an asset for the community too. It doesn’t have to be a hindrance.

What needs to change in the Python community?

Hugo: So what would you like to see the Python community do in the coming years?

Reshama: I would like to see more collaboration between them, communication between them. It’s probably okay for them to retained their, identity, and their structure, and all of that, but they can communicate with each other and merge some of their initiatives, because everybody, this is the analogy I would say I would use for the Python community versus the R community.

Reshama: Everyone is committed to diversity. Everybody wants inclusivity, and they’re all working hard, and they all have initiatives, but when I looked at the Python community, it’s like each group is in its own individual kayak and they’re rowing hard, everyone’s looking at the same direction and they’re moving there and they’re all working hard, but the R community has sort of invested in this really large boat and they can move to their goal a lot faster with a lot less rowers to get where they want to go.

Hugo: Right. And I think what you’re speaking to there is that everyone is on board the boat. That speaks once again to it having a strong sense of community there at the top as well.

Reshama: Yes. Yes. I think it’s more of a structural issue.

Favorite Data Science Technique

Hugo: Right. We’re going to wrap up in a minute, but I’ve got a couple of final questions and one is something I ask a lot of my guests, which isn’t isn’t related necessarily to the rest of the conversation, because we haven’t talked about data science a lot, about the technical aspects of data science. So I’m just wondering what’s one of your favorite data sciencey techniques or methodologies?

Reshama: My favorite one I would say is data visualization. I took a course when I was at stern with Kristin Sosulski, and she teaches a terrific data visualization course, and I realized after all the years of being a statistician that I was not presenting my data is optimally is that could be. So visualization I would say is key.

Call to Action

Hugo: For sure. My final question is, do you have a final call to action for the listeners out there?

Reshama: Yes I do. I would say that if you’re interested in starting a chapter for a Woman in Machine Learning and Data Science, feel free to reach out to us. Our email is info@wimlds.org. We have a starter kit and we are happy to answer any questions. I would say I really would love if the blog I wrote, the article about code of conduct for NeurIPS and other stem organizations, it’s really critical and helpful information. NUMFOCUS will be having nominations for the committees beginning in March and so I imagined they’ll post in February or so.

Reshama: I would say get involved in the various committees that can really help with wherever people are working, whether it’s at an company, you are working in academia, or just running their own community… And sponsor. If you’re interested in sponsoring Women in Machine Learning and Data Science also reach out to us at info@wimlds.org.

Hugo: So those are all fantastic final calls to action, and in terms of reading your blog posts, I’ll include them in the show notes and I’ll urge everyone when reading them to share them as loudly and as widely as possible, in particular on Twitter. I think as we’ve seen, Twitter is actually an incredible space to have these conversations, at least initially. There is a critical mass of people really willing to engage on Twitter about this as well.

Reshama: Absolutely. I love Twitter. I think it’s a way to, however it’s designed, it’s a way to connect with people that I would not normally know, either meeting them in person or even via some of the other social media platforms out there.

Hugo: Yeah. And even on, these issues, which are particularly sensitive, and where on social media, you’re generally in a lot of other disciplines, expect a lot of trolls to come out. A lot of these conversations stay amazingly civilized on Twitter as well, and constructive, I find.

Reshama: I absolutely agree. I think that’s because people are very committed and are interested in advancing the issues and making the community better for everyone. So I agree.

Hugo: Agreed. Reshama, it’s been such a pleasure having you on the show.

Reshama: You too. Thank you, Hugo, for the invite.

Hugo: Thank you.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Journal Volume 10/2, December 2018 is out!

Mon, 02/25/2019 - 17:57

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We forgot to say: R Journal Volume 10/2, December 2018 is out!

A huge thanks to the editors who work very hard to make this possible.

And big “thank you” to the editors, referees, and journal for helping improve, and for including our note on pipes in R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Introducing Rank Data Analysis with Arkham Horror Data

Mon, 02/25/2019 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Introduction

Last week I analyzed player rankings of the Arkham Horror LCG classes. This week I explain what I did in the data analysis. As I mentioned, this is the first time that I attempted inference with rank data, and I discovered how rich the subject is. A lot of the tools for the analysis I had to write myself, so you now have the code I didn’t have access to when I started.

This post will not discuss rank data modelling. Instead, it will cover what one may consider basic statistics and inference. The primary reference for what I did here is Analyzing and Modeling Rank Data, by John Marden. So far I’ve enjoyed his book and I may even buy a personal copy.

What is Rank Data?

Suppose we have objects we ask our study participants (also known as “judges”) to rank. For example, suppose we asked people to rank apples, oranges, and bananas. What we then get is a prioritization of these objects according to our judges. This could come in the form

and we interpret the number in the position as the ranking of the item. In this case, if the tuple is in the order of apples, oranges, and bananas, then oranges recieved the highest ranking, bananas the second-highest, and apples the last position.

An alternative view of this data may be

where the items are arranged in order of preference. This form of describing a ranking has its uses, but we will consider only the first form in this introduction.

Ranking data has the following distinguishing characteristics from other data: first, the data is ordinal. All that matters is the order in which items were placed, not necessarily the numbers themselves. We could insist on writing rank data as and the information content would not have changed. (But of course we would never do this.) Second, every item gets a ranking. This excludes “Choose your top 3 out of 50”-type questions, since not every item would receive a ranking (this is called an incomplete ranking and requires special care; I won’t discuss this type of data in this article). Finally, every item’s ranking is distinct; no ties are allowed.

Thus ranking data is distinct even from just ordinal data since data comes from judges in the form of a tuple, not just a single ordinal value. (Thus we would not consider, say, Likert scale responses as automatically being an instance of rank data.) An ideal method for rank data would account for this unique nature and exploit its features.

Basic Descriptive Statistics

From this point on I will be working with the Arkham Horror player class ranking data. I made the Timestamp column nonsense to anonymize the data. You can download a CSV file of the data from here, then convert it to a .Rda file with the script below (which is intended to be run as an executable):

#!/usr/bin/Rscript ################################################################################ # ArkhamHorrorClassPreferenceSurveyDataCleaner.R ################################################################################ # 2019-02-10 # Curtis Miller ################################################################################ # This file takes a CSV file read in and cleans it for later analysis, saving # the resulting data in a .Rda file. ################################################################################ # optparse: A package for handling command line arguments if (!suppressPackageStartupMessages(require("optparse"))) { install.packages("optparse") require("optparse") } ################################################################################ # MAIN FUNCTION DEFINITION ################################################################################ main <- function(input, output = "out.Rda", help = FALSE) { input_file <- read.csv(input) input_columns <- names(input_file) arkham_classes <- c("Survivor", "Guardian", "Rogue", "Seeker", "Mystic") for (cl in arkham_classes) { names(input_file)[grepl(cl, input_columns)] <- cl } names(input_file)[grepl("Reason", input_columns)] <- "Reason" input_file$Reason <- as.character(input_file$Reason) input_file$Timestamp <- as.POSIXct(input_file$Timestamp, format = "%m/%d/%Y %H:%M:%S", tz = "MST") for (cl in arkham_classes) { input_file[[cl]] <- substr(as.character(input_file[[cl]]), 1, 1) input_file[[cl]] <- as.numeric(input_file[[cl]]) } survey_data <- input_file save(survey_data, file = output) } ################################################################################ # INTERFACE SETUP ################################################################################ if (sys.nframe() == 0) { cl_args <- parse_args(OptionParser( description = paste("Converts a CSV file with survey data ranking", "Arkham Horror classes into a .Rda file with a", "well-formated data.frame"), option_list = list( make_option(c("--input", "-i"), type = "character", help = "Name of input file"), make_option(c("--output", "-o"), type = "character", default = "out.Rda", help = "Name of output file to create") ) )) do.call(main, cl_args) }

(The script with all the code for the actual analysis appears at the end of this article.)

The first statistic we will compute for this data is the marginals matrix. This matrix simply records the proportion of times an item received a particular ranking in the sample. If we want to get mathematical, if is a ranking tuple and is the ranking of the option and the sample is , then the entry of the marginal’s matrix is

where the function $\latex I_{{A}}$ is 1 if is true and 0 otherwise. (Thus the sum above simply counts how many times was equal to .)

The marginals matrix for the Arkham Horror data is given below

MARGINALS --------- 1 2 3 4 5 Guardian 18.29 20.43 26.84 19.71 14.73 Mystic 19.71 18.29 17.81 20.90 23.28 Rogue 19.24 14.73 20.67 21.38 23.99 Seeker 28.03 25.18 17.10 18.53 11.16 Survivor 14.73 21.38 17.58 19.48 26.84

Below is a visual representation of the marginals matrix.

From the marginals matrix you could compute the vector representing the “mean” ranking of the data. For instance, the mean ranking of the Guardian class is the sum of the ranking numbers (column headers) times their respective proportions (in the Guardian row); here, that’s about 2.9 for Guardians. Repeat this process for every other group to get the mean ranking vector; here, the mean rank vector is (keeping the ordering of the classes suggested by the rows above, which is alphabetical order; this will always be the ordering I use unless otherwise stated.) Of couse this is not a ranking vectors; rankings are integers. The corresponding ranking vector would be to rank the means themselves; this gives a ranking vector of .

I don’t like inference using the mean ranking vector. As mentioned above, this data is ordinal; that means the magnitude of the numbers themselves should not matter. We could replace 1, 2, 3, 4, 5 with 1, 10, 100, 1000, 10000 and the data would mean the same thing. That is not the case if you’re using the mean rank unless you first apply a transformation to the rankings. In short, I don’t think that the mean ranking vector appreciates the nature of the data well. And since the marginals matrix is closely tied to this notion of “mean”, I don’t think the matrix is fully informative.

Another matrix providing descriptive statistics is the pairs matrix. The matrix records the proportion of respondents who preferred one option to the other (specifically, the row option to the column option). Mathematically, the entry of the pairs matrix is

The pairs matrix for the Arkham Horror data is below:

PAIRS ----- Guardian Mystic Rogue Seeker Survivor Guardian 0.00 54.16 55.34 42.52 55.82 Mystic 45.84 0.00 51.07 39.90 53.44 Rogue 44.66 48.93 0.00 38.72 51.54 Seeker 57.48 60.10 61.28 0.00 61.52 Survivor 44.18 46.56 48.46 38.48 0.00

First, notice that the diagonal entries are all zero; this will always be the case. Second, the pairs matrix is essentially completely determined by the entries above the diagonal of the matrix. Other forms of interence use these upper-diagonal entries and don’t use the lower-diagonal entries since they give no new information. The number of upper-diagonal entries is , which is the number of ways to pick pairs of classes.

The pairs matrix for the Arkham Horror data is visualized below.

With the pairs matrix, crossing above or below 50% of the sample being in the bin is a significant event; it indicates which classes are preferred to the other. In fact, by counting how many times this threshold was crossed, we can estimate that the overall favorite class is the Seeker class, followed by Guardians, then Mystics, then Rogues, and finally Survivors. This is another estimate of the “central”, “modal”, or “consensus” ranking. (This agrees with the “mean” ranking, but that’s not always going to be the case; the metrics can disagree with each other.)

While I did not like the marginals matrix I do like the pairs matrix; I feel as if it accounts for the features of rank data I want any measures or inference to take account of. It turns out that the pairs matrix is also related to my favorite distance metric for analyzing rank data.

Distance Metrics for Rank Data

A distance metric is a generalized notion of distance, or “how far away” two objects and are. In order for a function to be a metric, it must have the following properties:

1. for all and .
2. if and only if .
3. for all and .
4. for all (the “triangle
inequality”)

The notion of distance you use in every-day life, the one taught in middle-school geometry and computed whenever you use a ruler, is known as Euclidean distance. It’s not the only notion of distance, though, and may not be the only distance function you use in real-life. For instance, Manhattan or taxi cab distance is the distance from one point to another when you can only make 90-degree turns and is the distance that makes the most sense when travelling in the city.

There are many distance metrics we could consider when working with rank data. The Spearman distance is the square of the Euclidean distance, while the footrule distance corresponds to the Manhattan distance. It turns out that the mean rank vector above minimizes the sum of Spearman distances. The distance metric I based my analysis on, though, was the Kendall distance. I like this distance metric since it is not connected to the mean and considers the distance between the rankings and to be greater than the distance between and (unlike, say, the Hamming distance, which gives the same distance in either case).

Kendall’s distance even has an interpretation. Suppose that two ranking tuples are seen as the ordering of books on a bookshelf. We want to go from one ordering of books to another ordering of books. The Kendall distance is how many times we would need to switch adjacent pairs of books (chosen well, so as not to waste time and energy) to go from one ordering to the other. Thus the Kendall distance between and is one; we only need to make one swap. The distance between and , in comparison, is seven, since we need to make seven swaps.

It also turns out that the Kendall distance is related to the pairs matrix. The average Kendall distance of the judges from any chosen ranking is

\omega_j} \hat{K}_{i,j}" title="\sum_{i, j: \omega_i > \omega_j} \hat{K}_{i,j}" class="latex" />

(There is a similar expression relating the Spearman distance to the marginal matrix.)

Central Ranking Estimator

Once we have a distance metric, we can define what the “best” estimate for the most central ranking is. The central ranking is the that minimizes

In other words, the most central ranking minimized the sum of distances of all the rankings in the data to that ranking.

Sometimes this ranking has already been determined. For instance, when using the Spearman distance, the central ranking emerges from the “mean” rankings. Otherwise, though, we may need to apply some search procedure to find this optimal ranking.

Since we’re working with rank data, though, it’s very tempting to not use any fancy optimization algorithms and simply compute the sum of distances for every possible ranking. This isn’t a bad idea at all if the number of items being ranked is relatively small. Here, since there are five items being ranked, the number of possible rankings is , which is not too big for a modern computer to handle. It may take some time for the exhaustive search approach to yield and answer, but the answer produced by exhaustive search comes with the reassurance that it does, in fact, minimize the sum of distances.

This is in fact what I did for estimating the central ranking when minimizing the sum of Kendall distances from said ranking. The resulting ranking, again, was Seeker/Guardian/Mystic/Rogue/Survivor (which agrees with what we determined just by looking at the pairs matrix; this likely is not a coincidence).

Statistical Inference

All of the above I consider falling into the category of descriptive statistics. It describes aspects of the sample without attempting to extrapolate to the rest of the population. With statistical inference we want to see what we can say about the population as a whole.

I should start by saying that the usual assumptions made in statistical inference are likely not satisfied by my sample. It was an opt-in sample; people chose to participate. That alone makes it a non-random sample. Additionally, only participants active on Facebook, Reddit, Twitter, Board Game Geek, and the Fantasy Flight forums were targeted by my advertising of the poll. Thus the Arkham players were likely those active on the Internet, likely at a particular time of day and day of the week (given how these websites try to push older content off the main page). They were likely young, male, and engaged enough in the game to be in the community (and unlikely to be a “casual” player). Thus the participants are likely to be more homogenous than the population of Arkham Horror players overall.

Just as a thought experiment, what would be a better study, one where we could feel confident in the inferential ability of our sample? Well, we would grab randomly selected people from the population (perhaps from pulling random names from the phone book), have them join our study, teach them how to play the game, make them play the game for many hours until they could form an educated opinion of the game (probably at least 100 hours), then ask them to rate the classes. This would be high-quality data and we could believe the data is reliable, but damn would it be expensive! No one at FFG would consider data of that quality worth the price, and frankly neither would I.

Having said that, while the sample I have is certainly flawed in how it was collected, I actually believe we can get good results from it. The opinions of the participants are likely educated ones, so we probably still have a good idea how the Arkham Horror classes compare to one another.

In rank data analysis there is a probability model called the uniform distribution that serves as a starting point for inference. Under the uniform distribution, every ranking vector is equally likely to be observed; in short, there is no preference among the judges among the choices. The marginals matrix should have all entries be , all off-diagonal entries of the pairs matrix should be , and any “central” ranking is meaningless since every ranking is equally likely to be seen. According to the uniform distribution, . If we cannot distinguish our data from data drawn from the uniform distribution, our work is done; we basically say there is no “common” ranking scheme and go about our day.

There are many tests for checking for the uniform distribution, and they are often based on the statistics we’ve already seen, such as the mean rank vector, the marginals matrix, and the pairs matrix. If is small enough relative to the sample size, we could even just base a test off of how frequently each particular ranking was seen. A test based off the latter could detect any form of non-uniformity in the data, while tests based off the marginals or pairs matrices or the mean vector cannot detect all forms of non-uniformity; that said, they often require much less data to be performed.

As mentioned, I like working with the pairs matrix/Kendall distance. The statistical test, though, involves a vector , which is the aforementioned upper triangle of the pairs matrix (excluding the diagonal entries which are always zero). (More specifically, is a vector containing the upper-diagonal entries of the pairs matrix laid out in row-major form.)

The test decides between

The test statistic is

If the null hypothesis is true, then the test statistic, for large , a distribution with degrees of freedom. (For the Arkham Horror classes case, .) Large test statistics are evidence against the null hypothesis, so -values are the area underneath the curve to the right of the test statistic.

For our data set, the reported test statistic was 2309938376; not shockingly, the corresponding -value is near zero. So the data was not drawn from the uniform distribution. Arkham Horror players do have class preferences.

But what are plausible preferences players could have? We can answer this using a confidence interval. Specifically, we want to know what rankings are plausible, and thus what we want is a confidence set of rankings.

Finding a formula for a confidence set of the central ranking is extremely hard to do, but it’s not as hard to form one for one of the statistics we can compute from the rankings, then use the possible values of that statistic to find corresponding plausible central rankings. For example, once could find a confidence set for the mean ranking vector, then translate those mean rankings into ranking vectors (this is what Marden did in his book).

As I said before, I like the pairs matrix/Kendall distance in the rank data context, so I want to form a confidence set for , the population equivalent of , the key entries of the pairs matrix. To do this, we cannot view the rank data the same way we did before; instead of seeing the -dimensional vector , we need to see the equivalent -dimensional vector that consists only of ones and zeros and records the pair-wise relationships among the ranks, rather than the ranks themselves (the latter vector literally says that item one is not ranked higher than item two, item one is ranked higher than item three, same for four, same for five, then that item two is ranked higher than item three, same for four, same for five, and so on, finally saying in its last entry that item four is ranked higher than item five).

We first compute by taking the means of these vectors. Then we compute the sample covariance matrix of the vectors; call it . Then a % confidence set for the true , appropriate for large sample sizes, is:

where is the percentile of the distribution with degrees of freedom.

The region I’ve just described is a -dimensional ellipsoid, a football-like shape that lives in a space with (probably) more than three dimensions. It sounds daunting, but one can still figure out what rankings are plausible once this region is computed. The trick is to work with each of the coordinates of the vector and determine whether there is a in the ellipsoid where that coordinate is 1/2. If the answer is no, then the value of that coordinate, for all in the ellipsoid, is either always above or always below 1/2. You can then look to (which is in the dead center of the ellipsoid) to determine which is the case.

What’s the significance of this? Let’s say that you listed all possible rankings in a table. Let’s suppose you did this procedure for the coordinate of corresponding to the Seeker/Rogue pair. If you determine that this coordinate is not 1/2 and that all in the ellipsoid ranks Seekers above Rogues, then you would take your list of rankings and remove all rankings that Rogues before Seekers, since these rankings are not in the confidence set.

If you do find a $\latex \kappa$ in the ellipsoid where the selected coordinate is 1/2, then you would not eliminate any rows in your list of rankings since you know that your confidence set must include some rankings that rank the two items one way and some rankings where the items are ranked the opposite way.

Repeat this procedure with every coordinate of —that is, every possible pairing of choices—and you then have a confidence set for central rankings.

Determining whether there is a vector in the ellipsoid with a select coordinate valued at 1/2 can be done via optimization. That is, find a $\latex \kappa$ that minimizes subject to the constraint that . You don’t even need fancy minimization algorithms for doing this; the minimum can, in principle, be computed analytically with multivariate calculus. After you found a minimizing , determine what the value of is at that . If it is less than , then you found a in the ellipsoid; otherwise, you know there is no such .

This was the procedure I used on the Arkham Horror class ranking data. The 95% confidence interval so computed determined that Seekers were ranked higher than Rogues and Survivors. That means that Seekers cannot have a ranking worse than 3 and Rogues and Survivors could not have rankings better than 2. Any ranking consistent with these constraints, though, is a plausible population central ranking. In fact, this procedure suggested that all the rankings below are plausible central population rankings:

Guardian Mystic Rogue Seeker Survivor 1 1 2 4 3 5 2 1 2 5 3 4 3 1 3 4 2 5 4 1 3 5 2 4 5 1 4 3 2 5 6 1 4 5 2 3 7 1 5 3 2 4 8 1 5 4 2 3 9 2 1 4 3 5 10 2 1 5 3 4 11 2 3 4 1 5 12 2 3 5 1 4 13 2 4 3 1 5 14 2 4 5 1 3 15 2 5 3 1 4 16 2 5 4 1 3 17 3 1 4 2 5 18 3 1 5 2 4 19 3 2 4 1 5 20 3 2 5 1 4 21 3 4 2 1 5 22 3 4 5 1 2 23 3 5 2 1 4 24 3 5 4 1 2 25 4 1 3 2 5 26 4 1 5 2 3 27 4 2 3 1 5 28 4 2 5 1 3 29 4 3 2 1 5 30 4 3 5 1 2 31 4 5 2 1 3 32 4 5 3 1 2 33 5 1 3 2 4 34 5 1 4 2 3 35 5 2 3 1 4 36 5 2 4 1 3 37 5 3 2 1 4 38 5 3 4 1 2 39 5 4 2 1 3 40 5 4 3 1 2

The confidence interval, by design, is much less bold than just an estimate of the most central ranking. Our interval suggests that there’s a lot we don’t know about what the central ranking is; we only know that whatever it is, it ranks Seekers above Rogues and Survivors.

The confidence set here is at least conservative in that it could perhaps contain too many candidate central rankings. I don’t know for sure whether we could improve on the set and eliminate more ranks from the plausible set by querying more from the confidence set for . Perhaps there are certain combinations that cannot exist, like excluding rankings that give both Seekers and Guardians a high ranking at the same time. If I were a betting man, though, I’d bet that the confidence set found with this procedure could be improved, in that not every vector in the resulting set corresponds with a in the original ellipsoidal confidence set. Improving this set, though, would take a lot of work as one would have to consider multiple coordinates of potential simultaneously, then find a rule for eliminating ranking vectors based on the results.

Clustering

Matt Newman, the lead designer of Arkham Horror: The Card Game, does not believe all players are the same. Specifically, he believes that there are player types that determine how they like to play. In statistics we might say that Matt Newman believes that there are clusters of players within any sufficiently large and well-selected sample of players. This suggests we may want to perform cluster analysis to find these sub-populations.

If you haven’t heard the term before, clustering is the practice of finding “similar” data points, grouping them together, and identifying them as belonging to some sub-population for which no label was directly observed. It’s not unreasonable to believe that these sub-populations exist and so I sought to do clustering myself.

There are many ways to cluster. Prof. Malden said that a clustering of rank data into clusters should minimize the sum of the distances of each observation from their assigned cluster’s centers. However, he did not suggest a good algorithm for finding these clusters. He did suggest that for small samples, small and for a small number of clusters, we could exhaustively search for optimal clusters, an impractical idea.

I initially attempted a k-means-type algorithm for finding good clusters, one that used the Kendall distance rather than the Euclidean distance, but unfortunately I could not get the algorithm to give good results. I don’t know whether I have errors in my code (listed below) or whether the algorithm just doesn’t work for Kendall distances, but it didn’t work; in fact, it would take a good clustering and make it worse! I eventually abandoned my home-brewed k-centers algorithm (and the hours of work that went into it) and just used spectral clustering.

Spectral clustering isn’t easily described, but the idea of spectral clustering is to find groups of data that a random walker, walking from point to point along a weighted graph, would spend a long time in before moving to another group. (That’s the best simplification I can make; the rest is linear algebra.) In order to do spectral clustering, one must have a notion of “similarity” of data points. “Similarity” roughly means the opposite of “distance”; in fact, if you have a distance metric (and we do here), you can find a similarity measure by subtracting all distances from the maximum distance between any two objects. Similarity measures are not as strictly defined as distance metrics; any function that gives two “similar” items a high score and two “dissimilar” items a low score could be considered a similarity function.

Spectral clustering takes a matrix of similarity measures, computed for each pair of observations, and spits out cluster assignments. But in addition to the similarity measure, we need to decide how many clusters to find.

I find determining the “best” number of clusters to find the hardest part of clustering. We could have only one cluster, containing all our data; this is what we start with. We could also assign each data point to its own cluster; our aforementioned measure of cluster quality would then be zero, which would be great if it weren’t for the fact that our clusters mean nothing!

One approach people use for determining how many clusters to pick is the so-called elbow method. You take a plot of, say, Malden’s metric, compared against the number of clusters, and see if you can spot the “elbow” in the plot. The elbow corresponds to the “best” number of clusters.

Here’s the corresponding plot for the dataset here:

If you’re unsure where the “elbow” of the plot is, that’s okay; I’m not sure either. My best guess is that it’s at five clusters; hence my choice of five clusters.

Another plot that people use is the silhouette plot, explained quite well by the scikit-learn documentation. The silhouette plot for the clustering found by spectral clustering is shown below:

Is this a good silhouette plot? I’m not sure. It’s not the worst silhouette plot I saw for this data set but it’s not as good as examples shown in the scikit-learn documentation. There are observations that appear to be in the wrong cluster according to the silhouette analysis. So… inconclusive?

I also computed the Dunn index of the clusters. I never got a value greater than 0.125. All together, these methods lead me to suspect that there are no meaningful clusters in this data set, at least none that can be found with this approach.

But people like cluster analysis, so if you’re one of those folks, I have results for you.

CLUSTERING ---------- Counts: Cluster 1 2 3 4 5 130 83 80 66 62 Centers: Guardian Mystic Rogue Seeker Survivor 1 3 2 4 1 5 2 3 5 4 1 2 3 3 4 1 2 5 4 1 5 3 4 2 5 5 1 4 3 2 Score: 881 CLUSTER CONFIDENCE INTERVALS ---------------------------- Cluster 1: With 95% confidence: Guardian is better than Rogue Guardian is better than Survivor Mystic is better than Rogue Mystic is better than Survivor Seeker is better than Rogue Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 1 2 4 3 5 2 1 2 5 3 4 3 1 3 4 2 5 4 1 3 5 2 4 5 2 1 4 3 5 6 2 1 5 3 4 7 2 3 4 1 5 8 2 3 5 1 4 9 3 1 4 2 5 10 3 1 5 2 4 11 3 2 4 1 5 12 3 2 5 1 4 Cluster 2: With 95% confidence: Guardian is better than Mystic Guardian is better than Rogue Seeker is better than Guardian Seeker is better than Mystic Survivor is better than Mystic Seeker is better than Rogue Survivor is better than Rogue Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 2 4 5 1 3 2 2 5 4 1 3 3 3 4 5 1 2 4 3 5 4 1 2 Cluster 3: With 95% confidence: Rogue is better than Guardian Rogue is better than Mystic Rogue is better than Seeker Rogue is better than Survivor Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 2 3 1 4 5 2 2 4 1 3 5 3 2 5 1 3 4 4 3 2 1 4 5 5 3 4 1 2 5 6 3 5 1 2 4 7 4 2 1 3 5 8 4 3 1 2 5 9 4 5 1 2 3 10 5 2 1 3 4 11 5 3 1 2 4 12 5 4 1 2 3 Cluster 4: With 95% confidence: Guardian is better than Mystic Guardian is better than Seeker Rogue is better than Mystic Survivor is better than Mystic Survivor is better than Seeker Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 1 4 2 5 3 2 1 4 3 5 2 3 1 5 2 4 3 4 1 5 3 4 2 5 1 5 4 3 2 6 2 4 1 5 3 7 2 4 3 5 1 8 2 5 1 4 3 9 2 5 3 4 1 10 2 5 4 3 1 11 3 4 1 5 2 12 3 4 2 5 1 13 3 5 1 4 2 14 3 5 2 4 1 Cluster 5: With 95% confidence: Mystic is better than Guardian Survivor is better than Guardian Mystic is better than Rogue Mystic is better than Seeker Survivor is better than Rogue Survivor is better than Seeker Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 3 1 4 5 2 2 3 1 5 4 2 3 3 2 4 5 1 4 3 2 5 4 1 5 4 1 3 5 2 6 4 1 5 3 2 7 4 2 3 5 1 8 4 2 5 3 1 9 5 1 3 4 2 10 5 1 4 3 2 11 5 2 3 4 1 12 5 2 4 3 1

When computing confidence sets for clusters I ran into an interesting problem: what if, say, you never see Seekers ranked below Guardians? This will cause one of the entries of to be either 0 or 1, and there is no “variance” in its value; it’s always the same. This will cause the covariance matrix to be non-invertible since it has rows/columns that are zero. The solution to this is to eliminate those rows and work only with the non-constant entries of . That said, I still treat the entries removed as if they were “statisticall significant” results and remove rankings from our confidence set that are inconsistent with what we saw in the data. In short, if Seekers are never ranked below Guardians, remove all rankings in the confidence set that rank Seekers below Guardians.

One usually isn’t satisfied with just a clustering; it would be nice to determine what a clustering signifies about those who are in the cluster. For instance, what type of player gets assigned to Cluster 1? I feel that inspecting the data in a more thoughtful and manual way can give a sense to what characteristic individuals assigned to a cluster share. For instance, I read the comments submitted by poll participants to hypothesize what types of players were being assigned to particular clusters. You can read these comments at the bottom of this article, after the code section.

Code

All source code used to do the rank analysis done here is listed below, in a .R file intended to be run as an executable from a command line. (I created and ran it on a Linux system.)

Several packages had useful functions specific for this type of analysis, such as pmr (meant for modelling rand data) and rankdist (which had a lot of tools for working with the Kendall distance). The confidence interval, central ranking estimator, and hypothesis testing tools, though, I wrote myself, and they may not exist elsewhere.

I at least feel that the script itself is well-documented and I no longer need to explain it. But I will warn others that it was tailored to my problem, and the methods employed may not work well with larger sample sizes or when more items need to be ranked.

Conclusion

This is only the tip of the iceberg for rank data analysis. We have not even touched on modelling for rank data, which can provide even richer inference. If you’re interested, I’ll refer you again to Malden’s book.

I enjoyed this analysis so much I asked a Reddit question about where else I could conduct surveys (while at the same time still being statistically sound) because I’d love to do it again. I feel like there’s much to learn from rank data; it has great potential. Hopefully this article sparked your interest too.

R Script for Analysis #!/usr/bin/Rscript ################################################################################ # ArkhamHorrorClassPreferenceAnalysis.R ################################################################################ # 2019-02-10 # Curtis Miller ################################################################################ # Analyze Arkham Horror LCG class preference survey data. ################################################################################ # optparse: A package for handling command line arguments if (!suppressPackageStartupMessages(require("optparse"))) { install.packages("optparse") require("optparse") } ################################################################################ # CONSTANTS ################################################################################ CLASS_COUNT <- 5 CLASSES <- c("Guardian", "Mystic", "Rogue", "Seeker", "Survivor") CLASS_COLORS <- c("Guardian" = "#00628C", "Mystic" = "#44397D", "Rogue" = "#17623B", "Seeker" = "#B87D37", "Survivor" = "#AA242D") ################################################################################ # FUNCTIONS ################################################################################ %s% <- function(x, y) {paste(x, y)} %s0% <- function(x, y) {paste0(x, y)} #' Sum of Kendall Distances #' #' Given a ranking vector and a matrix of rankings, compute the sum of Kendall #' distances. #' #' @param r The ranking vector #' @param mat The matrix of rankings, with each row having its own ranking #' @param weight Optional vector weighting each row of \code{mat} in the sum, #' perhaps representing how many times that ranking is repeated #' @return The (weighted) sum of the Kendall distances #' @examples #' mat <- rbind(1:3, #' 3:1) #' skd(c(2, 1, 3), mat) skd <- function(r, mat, weight = 1) { dr <- partial(DistancePair, r2 = r) sum(apply(mat, 1, dr) * weight) } #' Least Sum of Kendall Distances Estimator #' #' Estimates the "central" ranking by minimizing the sum of Kendall distances, #' via exhaustive search. #' #' @param mat The matrix of rankings, with each row having its own ranking #' @param weight Optional vector weighting each row of \code{mat} in the sum, #' perhaps representing how many times that ranking is repeated #' @return Ranking vector that minimizes the (weighted) sum of rankings #' @examples #' mat <- rbind(1:3, #' 3:1) #' lskd_estimator(mat) lskd_estimator <- function(mat, weight = NULL) { if (is.null(weight)) { reduced <- rank_vec_count(mat) mat <- reduced$mat weight <- reduced$count } skdm <- partial(skd, mat = mat, weight = weight) m <- max(mat) permutation_mat <- permutations(m, m) sums <- apply(permutation_mat, 1, skdm) permutation_mat[which.min(sums),] } #' Identify Ranking With Center #' #' Find the index of the center closest to a ranking vector. #' #' @param r The ranking vector #' @param mat The matrix of rankings, with each row having its own ranking #' @return Index of row that is closest to \code{r} #' @examples #' mat <- rbind(1:3, #' 3:1) #' close_center(c(2, 1, 3), mat) close_center <- function(r, mat) { dr <- partial(DistancePair, r2 = r) which.min(apply(mat, 1, dr)) } #' Simplify Rank Matrix To Unique Rows #' #' Given a matrix with rows representing rankings, this function reduced the #' matrix to rows of only unique rankings and also counts how many times a #' ranking appeared. #' #' @param mat The matrix of rankings, with each row having its own ranking #' @return A list with entries \code{"mat"} and \code{"count"}, with #' \code{"mat"} being a matrix now with unique rankings and #' \code{"count"} being a vector of times each row in new matrix #' appeared in the old matrix #' @examples #' mat <- rbind(1:3, #' 3:1) #' rank_vec_count(mat) rank_vec_count <- function(mat) { old_col_names <- colnames(mat) old_row_names <- rownames(mat) res_df <- aggregate(list(numdup = rep(1, times = nrow(mat))), as.data.frame(mat), length) count <- res_df$numdup new_mat <- res_df[1:ncol(mat)] colnames(new_mat) <- old_col_names rownames(new_mat) <- old_row_names list("mat" = as.matrix(new_mat), "count" = count) } #' Find \eqn{k} Ranking Clusters #' #' Estimate \eqn{k} clusters of rankings. #' #' The algorithm to find the ranking clusters resembles the \eqn{k}-means++ #' algorithm except that the distance metric is the Kendall distance. #' #' @param mat The matrix of rankings, with each row having its own ranking #' @param k The number of clusters to find #' @param max_iter The maximum number of iterations for algorithm #' @param tol The numerical tolerance at which to end the algorithm if met #' @return A list containing the central rankings of each cluster (in #' \code{"centers"}) and a vector with integers representing cluster #' assignments #' @examples #' mat <- rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' rank_cluster(mat, 2) rank_cluster <- function(mat, k, init_type = c("spectral", "kmeans++"), max_iter = 100, tol = 1e-4) { simplified_mat <- rank_vec_count(mat) mat <- simplified_mat$mat count <- simplified_mat$count init_type <- init_type[1] if (init_type == "kmeans++") { centers <- rank_cluster_center_init(mat, k) } else if (init_type == "spectral") { centers <- rank_cluster_spectral(mat, k)$centers } else { stop("Don't know init_type" %s% init_type) } old_centers <- centers cc_centers <- partial(close_center, mat = centers) clusters <- apply(mat, 1, cc_centers) for (iter in 1:max_iter) { centers <- find_cluster_centers(mat, clusters, count) stopifnot(all(dim(centers) == dim(old_centers))) cc_centers <- partial(close_center, mat = centers) clusters <- apply(mat, 1, cc_centers) if (center_distance_change(centers, old_centers) < tol) { break } else { old_centers <- centers } } if (iter == max_iter) {warning("Maximum iterations reached")} colnames(centers) <- colnames(mat) list("centers" = centers, "clusters" = rep(clusters, times = count)) } #' Find the Distance Between Two Ranking Matrices #' #' Find the distance between two ranking matrices by summing the distance #' between each row of the respective matrices. #' #' @param mat1 First matrix of ranks #' @param mat2 Second matrix of ranks #' @return The sum of distances between rows of \code{mat1} and \code{mat2} #' @examples #' mat <- rbind(1:3, #' 3:1) #' center_distance_change(mat, mat) center_distance_change <- function(mat1, mat2) { if (any(dim(mat1) != dim(mat2))) {stop("Dimensions of matrices don't match")} sum(sapply(1:nrow(mat1), function(i) {DistancePair(mat1[i, ], mat2[i, ])})) } #' Initialize Cluster Centers #' #' Find initial cluster centers as prescribed by the \eqn{k}-means++ algorithm. #' #' @param mat The matrix of rankings, with each row having its own ranking #' @param k The number of clusters to find #' @return A matrix containing cluster centers. #' @examples #' mat <- rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' rank_cluster_center_init(mat, 2) rank_cluster_center_init <- function(mat, k) { n <- nrow(mat) center <- mat[sample(1:n, 1), ] centers_mat <- rbind(center) for (i in 2:k) { min_distances <- sapply(1:n, function(l) { min(sapply(1:(i - 1), function(j) { DistancePair(mat[l, ], centers_mat[j, ]) })) }) center <- mat[sample(1:n, 1, prob = min_distances/sum(min_distances)), ] centers_mat <- rbind(centers_mat, center) } rownames(centers_mat) <- NULL colnames(centers_mat) <- colnames(mat) centers_mat } #' Evaluation Metric for Clustering Quality #' #' Evaluates a clustering's quality by summing the distance of each observation #' to its assigned cluster center. #' #' @param mat Matrix of rankings (in the rows); the data #' @param centers Matrix of rankings (in the rows) representing the centers of #' the clusters #' @param clusters Vector of indices corresponding to cluster assignments (the #' rows of the \code{clusters} matrix) #' @return Score of the clustering #' @examples #' mat <- rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' centers <- rbind(1:3, 3:1) #' clusters <- c(1, 1, 2, 2) #' clustering_score(mat, centers, clusters) clustering_score <- function(mat, centers, clusters) { sum(sapply(1:nrow(centers), function(i) { center <- centers[i, ] submat <- mat[which(clusters == i), ] skd(center, submat) })) } #' Clustering with Restarts #' #' Clusters multiple times and returns the clustering with the lowest clustering #' score #' #' @param ... Parameters to pass to \code{\link{rank_cluster}} #' @param restarts Number of restarts #' @return A list containing the central rankings of each cluster (in #' \code{"centers"}) and a vector with integers representing cluster #' assignments #' @examples #' mat <- rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' rank_cluster_restarts(mat, 2, 5) rank_cluster_restarts <- function(mat, ..., restarts = 10) { best_score <- Inf rank_cluster_args <- list(...) rank_cluster_args$mat <- mat for (i in 1:restarts) { new_cluster_scheme <- do.call(rank_cluster, rank_cluster_args) score <- clustering_score(mat, new_cluster_scheme$centers, new_cluster_scheme$clusters) if (score < best_score) { best_score <- score best_scheme <- new_cluster_scheme } } return(best_scheme) } #' Given Clusters, Find Centers #' #' Given a collection of clusters, find centers for the clusters. #' #' @param mat Matrix of rankings (in rows) #' @param clusters Vector containing integers identifying cluster assignments, #' where the integers range from one to the number of clusters #' @param weight Optional vector weighting each row of \code{mat} in the sum, #' perhaps representing how many times that ranking is repeated #' @return Ranking vector that minimizes the (weighted) sum of rankings #' @return A matrix of ranks representing cluster centers #' @examples #' mat <- rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' find_cluster_centers(mat, c(1, 1, 2, 2)) find_cluster_centers <- function(mat, clusters, weight = NULL) { if (is.null(weight)) { weight <- rep(1, times = nrow(mat)) } centers <- t(sapply(unique(clusters), function(i) { submat <- mat[which(clusters == i), ] subweight <- weight[which(clusters == i)] lskd_estimator(submat, subweight) })) colnames(centers) <- colnames(mat) centers } #' Cluster Rankings Via Spectral Clustering #' #' Obtain a clustering of rank data via spectral clustering. #' #' @param mat Matrix containing rank data #' @param k Number of clusters to find #' @return A list with entries: \code{"centers"}, the centers of the clusters; #' and \code{"clusters"}, a vector assigning rows to clusters. #' @examples #' mat <- rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' rank_cluster_spectral(mat, 2) rank_cluster_spectral <- function(mat, k = 2) { dist_mat <- DistanceMatrix(mat) sim_mat <- max(dist_mat) - dist_mat clusters <- spectralClustering(sim_mat, k) centers <- find_cluster_centers(mat, clusters) list("centers" = centers, "clusters" = clusters) } #' Compute the Test Statistic for Uniformity Based on the Pairs Matrix #' #' Compute a test for uniformity based on the estimated pairs matrix. #' #' Let \eqn{m} be the number of items ranked and \eqn{n} the size of the data #' set. Let \eqn{\bar{k} = k(k - 1)/2} and \eqn{\bar{y}} the mean rank vector. #' Let \eqn{\hat{K}^*} be the upper-triangular part of the estimated pairs #' matrix (excluding the diagonal), laid out as a vector in row-major order. #' Finally, let \eqn{1_k} be a vector of \eqn{k} ones. Then the test statistic #' is #' #' \deqn{12n(\|\hat{K}^* - \frac{1}{2} 1_{\bar{m}}\|^2 - \|\bar{y} - \frac{m + #' 1}{2} 1_m\|^2 / (m + 1))} #' #' Under the null hypothesis this statistic asympotically follow a \eqn{\chi^2} #' distribution with \eqn{\bar{m}} degrees of freedom. #' #' @param mat The data matrix, with rankings in rows #' @return The value of the test statistic #' @examples #' mat <- rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' pairs_uniform_test_stat(mat) pairs_uniform_test_stat <- function(mat) { desc_stat <- suppressMessages(destat(mat)) mean_rank <- desc_stat$mean.rank pair <- desc_stat$pair m <- ncol(mat) - 1 n <- nrow(mat) mbar <- choose(m, 2) K <- pair[upper.tri(pair, diag = FALSE)] meanK <- rep(1/2, times = mbar) cm <- rep((m + 1)/2, times = m) 12 * n * (sum((K - meanK)^2) - sum((mean_rank - cm)^2)/(m + 1)) } #' Compute Covariance Matrix of Pairs Matrix Upper Triangle #' #' Compute the covariance matrix of the pairs matrix estimator. #' #' @param mat Data matrix, with each ranking having its own row #' @return The \eqn{m(m - 1)/2}-square matrix representing the covariance matrix #' @examples #' mat <- rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' pairs_mat_cov(mat) pairs_mat_cov <- function(mat) { n <- nrow(mat) m <- ncol(mat) pair <- kappa_est(mat) pair <- as.matrix(pair) # Transform data into a dataset of pair-wise rank comparisons if (m == 1) { return(0) } kappa_data <- sapply(2:m, function(j) {mat[, j] > mat[, 1]}) for (i in 2:(m - 1)) { kappa_data <- cbind(kappa_data, sapply((i + 1):m, function(j) { mat[, j] > mat[, i] })) } kappa_data <- kappa_data + 0 # Converts to integers cov(kappa_data) } #' Estimate \eqn{\kappa} Vector #' #' Estimate the \eqn{\kappa} vector, which fully defines the pairs matrix. #' #' @param mat Data matrix, with each ranking having its own row #' @return The \eqn{m(m - 1)/2}-dimensional vector #' @examples #' mat <- rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' kappa_est(mat) kappa_est <- function(mat) { n <- nrow(mat) df <- as.data.frame(mat) df$n <- 1 pair <- suppressMessages(destat(df)) pair <- t(pair$pair) pair <- pair[lower.tri(pair, diag = FALSE)]/n pair } #' Get Plausible Rankings For Central Ranking Based on Kendall Distance #' #' Determine a set of plausible central rankings based on the Kendall distance. #' #' Let \eqn{\alpha} be one minus the confidence level, \eqn{m} the number of #' options, \eqn{\bar{m} = m(m - 1)/2}, \eqn{\kappa} the vectorized #' upper-triangle of the pairs matrix of the population, \eqn{\hat{\kappa}} the #' sample estimate of \eqn{\kappa}, and \eqn{\hat{\Sigma}} the estimated #' covariance matrix of \eqn{\hat{kappa}}. Then the approximate \eqn{100(1 - #' \alpha)}% confidence interval for \eqn{\kappa} is #' #' \deqn{\kappa: (\hat{\kappa} - \kappa)^T \hat{\Sigma}^{-1} (\hat{kappa} - #' \kappa) < \chi^2_{\bar{m}}} #' #' One we have such an interval the next task is to determine which ranking #' vectors are consistent with plausible \eqn{\kappa}. To do this, the function #' determines which choices could plausibly be tied according to the confidence #' interval; that is, which entries of \eqn{\kappa} could plausibly be #' \eqn{1/2}. Whenever this is rejected, there is a statistically significant #' difference in the preference of the two choices; looking at \hat{\kappa} can #' determine which of the two choices is favored. All ranking vectors that would #' agree that disagree with that preference are eliminated from the space of #' plausible central ranking vectors. The ranking vectors surviving at the end #' of this process constitute the confidence interval. #' #' @param mat Matrix of rank data, each observation having its own row #' @param conf_level Desired confidence level #' @return A list with entries \code{"ranks"} holding the matrix of plausible #' rankings in the confidence interval and \code{"preference_string"}, a #' string enumerating which options are, with statistical significance, #' preferred over others #' @examples #' mat <- t(replicate(100, {sample(1:3)})) #' kendall_rank_conf_interval(mat) kendall_rank_conf_interval <- function(mat, conf_level = 0.95) { n <- nrow(mat) m <- max(mat) mbar <- choose(m, 2) kap <- kappa_est(mat) Sigma <- pairs_mat_cov(mat) crit_value <- qchisq(1 - conf_level, df = mbar, lower.tail = FALSE) # Find bad rows of Sigma, where the covariance is zero; that variable must be # constant const_vars <- which(colSums(Sigma^2) == 0) safe_vars <- which(colSums(Sigma^2) > 0) safe_kap <- kap[safe_vars] safe_Sigma <- Sigma[safe_vars, safe_vars] # Determine if hyperplanes where one coordinate is 1/2 intersect confidence # set b <- as.matrix(solve(safe_Sigma, safe_kap)) a <- t(safe_kap) %*% b a <- a[1, 1] check_half <- partial(hei_check, x = 1/2, A = safe_Sigma, b = -2 * b, d = crit_value/n - a, invert_A = TRUE) sig_diff_safe_vars <- !sapply(1:length(safe_vars), check_half) if (length(const_vars) > 0) { sig_diff <- rep(NA, times = mbar) sig_diff[safe_vars] <- sig_diff_safe_vars sig_diff[const_vars] <- TRUE } else { sig_diff <- sig_diff_safe_vars } idx_matrix <- matrix(0, nrow = m, ncol = m) idx_matrix[lower.tri(idx_matrix, diag = FALSE)] <- 1:mbar idx_matrix <- t(idx_matrix) rownames(idx_matrix) <- colnames(mat) colnames(idx_matrix) <- colnames(mat) # Remove rows of potential centers matrix to reflect confidence interval # results; also, record which groups seem to have significant difference in # ranking rank_string <- "" permutation_mat <- permutations(m, m) for (i in 1:(m - 1)) { for (j in (i + 1):m) { sig_diff_index <- idx_matrix[i, j] if (sig_diff[sig_diff_index]) { direction <- sign(kap[sig_diff_index] - 1/2) if (direction > 0) { # Row option (i) is preferred to column option (j) permutation_mat <- permutation_mat[permutation_mat[, i] < permutation_mat[, j], ] rank_string <- rank_string %s0% colnames(mat)[i] %s% "is better than" %s% colnames(mat)[j] %s0% '\n' } else if (direction < 0) { # Row option (i) is inferior to column option (j) permutation_mat <- permutation_mat[permutation_mat[, i] > permutation_mat[, j], ] rank_string <- rank_string %s0% colnames(mat)[j] %s% "is better than" %s% colnames(mat)[i] %s0% '\n' } } } } colnames(permutation_mat) <- colnames(mat) return(list("ranks" = permutation_mat, "preference_string" = rank_string)) } #' Straight Hyperplane and Ellipse Intersection Test #' #' Test whether a hyperplane parallel to an axis intersects an ellipse. #' #' The ellipse is fully determined by the parameters \code{A}, \code{b}, and #' \code{d}; in fact, the ellipse consists of all \eqn{x} such that #' #' \deqn{x^T A x + b^T x \leq d} #' #' \code{x} is the intercept of the hyperplane and \code{k} is the coordinate #' that is fixed to the value \code{x} and thus determine along which axis the #' hyperplane is parallel. A value of \code{TRUE} means that there is an #' intersection, while \code{FALSE} means there is no intersection. #' #' @param x The fixed value of the hyperplane #' @param k The coordinate fixed to \code{x} #' @param A A \eqn{n \times n} matrix #' @param b An \eqn{n}-dimensional vector #' @param d A scalar representing the upper bound of the ellipse #' @return \code{TRUE} or \code{FALSE} depending on whether the hyperplane #' intersects the ellipse or not #' @examples #' hei_check(1, 2, diag(3), rep(0, times = 3), 10) hei_check <- function(x, k, A, b, d, invert_A = FALSE) { b <- as.matrix(b) n <- nrow(b) stopifnot(k >= 1 & k <= n) stopifnot(nrow(A) == ncol(A) & nrow(A) == n) stopifnot(all(eigen(A)$values > 0)) all_but_k <- (1:n)[which(1:n != k)] s <- rep(0, times = n) s[k] <- x s <- as.matrix(s) if (invert_A) { tb <- as.matrix(solve(A, s)) } else { tb <- A %*% s } td <- t(s) %*% tb + t(b) %*% s if (invert_A) { # XXX: curtis: NUMERICALLY BAD; FIX THIS -- Thu 14 Feb 2019 07:50:19 PM MST A <- solve(A) } tA <- A[all_but_k, all_but_k] tx <- -solve(tA, (b/2 + tb)[all_but_k, ]) tx <- as.matrix(tx) val <- t(tx)%*% tA %*% tx + t((b + 2 * tb)[all_but_k]) %*% tx + td - d val <- val[1, 1] val <= 0 } ################################################################################ # MAIN FUNCTION DEFINITION ################################################################################ main <- function(input, prefix = "", width = 6, height = 4, clusters = 5, conflevel = 95, comments = "AHLCGClusterComments.txt", detailed = FALSE, help = FALSE) { suppressPackageStartupMessages(library(pmr)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(reshape2)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(rankdist)) suppressPackageStartupMessages(library(gtools)) suppressPackageStartupMessages(library(purrr)) suppressPackageStartupMessages(library(anocva)) load(input) n <- nrow(survey_data) rank_data <- survey_data[CLASSES] rank_data$n <- 1 rank_mat <- as.matrix(survey_data[CLASSES]) # Get basic descriptive statistics: mean ranks, marginals, pairs desc_stat <- suppressMessages(destat(rank_data)) mean_rank <- desc_stat$mean.rank marginal <- desc_stat$mar pair <- desc_stat$pair names(mean_rank) <- CLASSES rownames(marginal) <- CLASSES colnames(marginal) <- 1:CLASS_COUNT rownames(pair) <- CLASSES colnames(pair) <- CLASSES # Compute "typical" distance based on least sum of Kendall distances best_rank <- lskd_estimator(rank_mat) names(best_rank) <- CLASSES # Hypothesis Testing for Uniformity statistic <- pairs_uniform_test_stat(rank_data) # Confidence Interval ci <- kendall_rank_conf_interval(rank_mat, conf_level = conflevel / 100) # Cluster data rank_clustering <- rank_cluster_spectral(rank_mat, k = clusters) centers <- rank_clustering$centers Cluster <- rank_clustering$clusters # Naming convention broke for printing rownames(centers) <- 1:nrow(centers) # Plotting marginal_plot <- ggplot( melt(100 * marginal / n, varnames = c("Class", "Rank"), value.name = "Percent"), aes(fill = Class, x = Class, y = Percent, group = Rank)) + geom_bar(position = "dodge", stat = "identity") + scale_fill_manual(values = CLASS_COLORS) + labs(title = "Class Rankings") + theme_bw() ggsave(prefix %s0% "marginal_plot.png", plot = marginal_plot, width = width, height = height, units = "in", dpi = 300) pair_plot <- ggplot( melt(100 * pair / n, varnames = c("Class", "Opposite"), value.name = "Percent") %>% filter(Percent > 0), aes(fill = Opposite, x = Class, y = Percent)) + geom_bar(position = "dodge", stat = "identity") + geom_hline(yintercept = 50, linetype = 2, color = "red") + scale_fill_manual(values = CLASS_COLORS) + labs(title = "Class Ranking Comparison") + theme_bw() ggsave(prefix %s0% "pair_plot.png", plot = pair_plot, width = width, height = height, units = "in", dpi = 300) # Place cluster comments in file comment_string <- "" for (i in 1:clusters) { comment_string <- comment_string %s0% "\n\nCLUSTER" %s% i %s0% "\n------------\n\n" %s0% paste(survey_data$Reason[survey_data$Reason != "" & Cluster == i], collapse = "\n\n-*-\n\n") } cat(comment_string, file = comments) # Printing cat("\nMEAN RANK\n---------\n") print(round(mean_rank, digits = 2)) cat("\nMARGINALS\n---------\n") print(round(100 * marginal / n, digits = 2)) cat("\nPAIRS\n-----\n") print(round(100 * pair / n, digits = 2)) cat("\nUNIFORMITY TEST\n---------------\n") cat("Test Statistic:", statistic, "\n") cat("P-value:", pchisq(statistic, df = choose(CLASS_COUNT, 2), lower.tail = FALSE), "\n") cat("\nOPTIMAL RANK ESTIMATE\n---------------------\n") print(sort(best_rank)) cat("\nWith", conflevel %s0% '%', "confidence:", '\n' %s0% ci$preference_string) if (detailed) { cat("\nPlausible Modal Rankings:\n") print(as.data.frame(ci$ranks)) } cat("\nCLUSTERING\n----------\nCounts: ") print(table(Cluster)) cat("\nCenters:\n") print(centers) cat("\nScore:", clustering_score(rank_mat, centers, Cluster), "\n") if (detailed) { cat("\nCLUSTER CONFIDENCE INTERVALS\n----------------------------\n") for (i in 1:clusters) { cat("\nCluster", i %s0% ':\n') ci_cluster <- kendall_rank_conf_interval(rank_mat[Cluster == i, ]) cat("\nWith", conflevel %s0% '%', "confidence:", '\n' %s0% ci_cluster$preference_string) cat("\nPlausible Modal Rankings:\n") print(as.data.frame(ci_cluster$ranks)) } } } ################################################################################ # INTERFACE SETUP ################################################################################ if (sys.nframe() == 0) { cl_args <- parse_args(OptionParser( description = paste("Analyze Arkham Horror LCG class preference survey", "data and print results."), option_list = list( make_option(c("--input", "-i"), type = "character", help = paste("Input file containing survey data")), make_option(c("--prefix", "-p"), type = "character", default = "", help = "Another command-line argument"), make_option(c("--width", "-w"), type = "double", default = 6, help = "Width of plots"), make_option(c("--height", "-H"), type = "double", default = 4, help = "Height of plots"), make_option(c("--clusters", "-k"), type = "integer", default = 5, help = "Number of clusters in spectral clustering"), make_option(c("--comments", "-c"), type = "character", default = "AHLCGClusterComments.txt", help = "File to store participant comments organized" %s% "by cluster"), make_option(c("--conflevel", "-a"), type = "double", default = 95, help = "Confidence level of confidence set"), make_option(c("--detailed", "-d"), action = "store_true", default = FALSE, help = "More detail in report") ) )) do.call(main, cl_args) } Report $./ArkhamHorrorClassPreferenceAnalysis.R -i AHLCGClassPreferenceSurveys.Rda --detailed MEAN RANK --------- Guardian Mystic Rogue Seeker Survivor 2.92 3.10 3.16 2.60 3.22 MARGINALS --------- 1 2 3 4 5 Guardian 18.29 20.43 26.84 19.71 14.73 Mystic 19.71 18.29 17.81 20.90 23.28 Rogue 19.24 14.73 20.67 21.38 23.99 Seeker 28.03 25.18 17.10 18.53 11.16 Survivor 14.73 21.38 17.58 19.48 26.84 PAIRS ----- Guardian Mystic Rogue Seeker Survivor Guardian 0.00 54.16 55.34 42.52 55.82 Mystic 45.84 0.00 51.07 39.90 53.44 Rogue 44.66 48.93 0.00 38.72 51.54 Seeker 57.48 60.10 61.28 0.00 61.52 Survivor 44.18 46.56 48.46 38.48 0.00 UNIFORMITY TEST --------------- Test Statistic: 2309938376 P-value: 0 OPTIMAL RANK ESTIMATE --------------------- Seeker Guardian Mystic Rogue Survivor 1 2 3 4 5 With 95% confidence: Seeker is better than Rogue Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 1 2 4 3 5 2 1 2 5 3 4 3 1 3 4 2 5 4 1 3 5 2 4 5 1 4 3 2 5 6 1 4 5 2 3 7 1 5 3 2 4 8 1 5 4 2 3 9 2 1 4 3 5 10 2 1 5 3 4 11 2 3 4 1 5 12 2 3 5 1 4 13 2 4 3 1 5 14 2 4 5 1 3 15 2 5 3 1 4 16 2 5 4 1 3 17 3 1 4 2 5 18 3 1 5 2 4 19 3 2 4 1 5 20 3 2 5 1 4 21 3 4 2 1 5 22 3 4 5 1 2 23 3 5 2 1 4 24 3 5 4 1 2 25 4 1 3 2 5 26 4 1 5 2 3 27 4 2 3 1 5 28 4 2 5 1 3 29 4 3 2 1 5 30 4 3 5 1 2 31 4 5 2 1 3 32 4 5 3 1 2 33 5 1 3 2 4 34 5 1 4 2 3 35 5 2 3 1 4 36 5 2 4 1 3 37 5 3 2 1 4 38 5 3 4 1 2 39 5 4 2 1 3 40 5 4 3 1 2 CLUSTERING ---------- Counts: Cluster 1 2 3 4 5 130 83 80 66 62 Centers: Guardian Mystic Rogue Seeker Survivor 1 3 2 4 1 5 2 3 5 4 1 2 3 3 4 1 2 5 4 1 5 3 4 2 5 5 1 4 3 2 Score: 881 CLUSTER CONFIDENCE INTERVALS ---------------------------- Cluster 1: With 95% confidence: Guardian is better than Rogue Guardian is better than Survivor Mystic is better than Rogue Mystic is better than Survivor Seeker is better than Rogue Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 1 2 4 3 5 2 1 2 5 3 4 3 1 3 4 2 5 4 1 3 5 2 4 5 2 1 4 3 5 6 2 1 5 3 4 7 2 3 4 1 5 8 2 3 5 1 4 9 3 1 4 2 5 10 3 1 5 2 4 11 3 2 4 1 5 12 3 2 5 1 4 Cluster 2: With 95% confidence: Guardian is better than Mystic Guardian is better than Rogue Seeker is better than Guardian Seeker is better than Mystic Survivor is better than Mystic Seeker is better than Rogue Survivor is better than Rogue Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 2 4 5 1 3 2 2 5 4 1 3 3 3 4 5 1 2 4 3 5 4 1 2 Cluster 3: With 95% confidence: Rogue is better than Guardian Rogue is better than Mystic Rogue is better than Seeker Rogue is better than Survivor Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 2 3 1 4 5 2 2 4 1 3 5 3 2 5 1 3 4 4 3 2 1 4 5 5 3 4 1 2 5 6 3 5 1 2 4 7 4 2 1 3 5 8 4 3 1 2 5 9 4 5 1 2 3 10 5 2 1 3 4 11 5 3 1 2 4 12 5 4 1 2 3 Cluster 4: With 95% confidence: Guardian is better than Mystic Guardian is better than Seeker Rogue is better than Mystic Survivor is better than Mystic Survivor is better than Seeker Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 1 4 2 5 3 2 1 4 3 5 2 3 1 5 2 4 3 4 1 5 3 4 2 5 1 5 4 3 2 6 2 4 1 5 3 7 2 4 3 5 1 8 2 5 1 4 3 9 2 5 3 4 1 10 2 5 4 3 1 11 3 4 1 5 2 12 3 4 2 5 1 13 3 5 1 4 2 14 3 5 2 4 1 Cluster 5: With 95% confidence: Mystic is better than Guardian Survivor is better than Guardian Mystic is better than Rogue Mystic is better than Seeker Survivor is better than Rogue Survivor is better than Seeker Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 3 1 4 5 2 2 3 1 5 4 2 3 3 2 4 5 1 4 3 2 5 4 1 5 4 1 3 5 2 6 4 1 5 3 2 7 4 2 3 5 1 8 4 2 5 3 1 9 5 1 3 4 2 10 5 1 4 3 2 11 5 2 3 4 1 12 5 2 4 3 1 Respondent Comments Groups By Cluster CLUSTER 1 ------------ Guardians have serious bling and they're awesome at what they do, so they're number 1. Seekers also have great cards that guzzle clues and generally provide solid deck building, so they're #2. Rogues have cards that look like a lot of fun (there's bling there too) and they are often good at both clue gathering and fighting, depending on which is needed. Mystic decks feel like they're all the same, so building decks with them is not as much fun. Survivor cards are extremely limited so they're my least favorite. -*- I love the Mystic spells, especially the versatility. Hated Rogues since Skids days, although Jenny is great and Preston is very good fun. Guardians and Seeker fall very easy into the usable archetypes of Attack and Investigate. -*- I love supporting guardians and seekers. Control focused mistics are also fun. -*- Purple is top just because of Recall the Future and Premonition. Yellow for being weird, Green for extra-actions and Finn. Red for cool, weird interactions at a bargain price. Blue is boring. -*- I don't like playing Rogues, alright? Please don't crucify me! Oh, this is anonymous? Excellent. -*- Simplicity of play and planning. -*- I love spells and magic items -*- Guardian are probably te most rounded IMO. Seekers next, but great at clue gathering. -*- Seeker pool has best card draw & selection; guardian has stick to the plan + stand together + weapons; survivor pool is good but good xp options are less varied (will to survive/true survivor or bust); mystics delve too deep + bag control + David Renfield are good. Rogue pool is harder to build a full game plan around—-its best cards enable great turns (pocket watch, double or nothing, etc) and are valuable to have in the party, but they have a harder time spending actions 2-3 as usefully since some of their best things exhaust (lockpicks). -*- Mystic and Rogue tied for first. Mystic is my prefered and I like how I can stack my deck to be heavy in either investigating and/or combat. Rogue because most get a lot of recources where you can purchase more expensive cards. -*- I feel as though Mystic have the broadest tool kit and be specialise in almost any direction. However my experience is solely limited to two player with my wife and she plays a cloover, so we need someone with bashing power. -*- Matt's response -*- I primarily play a seeker (Daisy) -*- Yellow fits with my playstyle the best -*- I really like all of them, so there's not a ton of distance between them. -*- gameplay style, clear focus on purposes -*- Guardian and Seeker are very straightforward, and I like that. They have a clear objective, and they do it well. -*- While I feel that most classes have merit, the rogue is generally the worst at the core aspects of the game: fighting and clue finding. Evading does not have the punch that killing the enemy foes. -*- I prefer a support / team role, and play for consistency over tricks. -*- Most useful for the group -*- I just looked at options. Mystics have a lot of options in every way, shape or form, and so do Guardians. I just prefer the mystic combos better, since Guardians are pretty bland in that regard. I feel you really can now make different mystic decks, from support to tank and combat master, to main seeking investigator etc. They have everything and even playing one deck a few times is till fun because of so many exp. options. And while their decks are pretty deep, the premise is simple - boost willpower. That leaves them with a nice weakness you have to cover. Guardians have better weapons (more fun) than mystics have combat spells, although Shattered Aeons really gave Mystics a fun new icy option. And maybe I'd like to see a Mystic that wouldn't be pure Mystic if you get me. Some hybrid guy or girl, that's not just using spells and artifacts from the same class over and over again. That's really what they're missing. Guardians are just so great, because they are sooo well balanced imo. It's quite relaxing looking at their options. You have everything from amazing gear, weapons, allies, events that cover literally everything + your friends' asses, awesome skillcards that can also combo, fun and engaging exp. options etc. But they lack different kinds of investigators. They have options, just some other classes have more. Maybe my least favorite on investigator side. Mystics again are so simple to make in that regard. I gave Seekers 3. because they just have some 0 exp. cards that are just too strong for any class, not just for them. Otherwise I really like Seeker cards theme, maybe even more than Guardian, maybe even my favorite I'd say, but again, Seekers just have so much random stuff and OP stuff (you know what they are). I don't care for balance in a co-op game, OP cards can be really fun, but this stuff really limits their options and sometimes even other classes' options, because not including them just hinders your deck and you know it (example is Shortcut). And that's not good. They have really fun and diverse roster of investigators though. And their experience options are quite game breaking, but in a good way imo. There's seeking, combat, running and evading so much support and combos, really fun and diverse. Rogues have maybe some of my least favorite cards, but they have A LOT of options. They have quite a few very awesome weapons, but they also have SO MUCH cards that are meant for combos and while combo decks are fun, they, in my opinion, are niche, or at least not used in every game. Sometimes you just want a simple deck and Rouges have a limited card pool when you look at it that way (example: no useful combat ally or even asset - there is a new Guardian tarrot card for Jenny and Skids, but they need more imo). They got their quite fresh Lockpicks and the seeker gator and that was an amazing get. But more, Leo breaks their ally pool, because he's just too strong. They also have no pure combat investigators, but otherwise their investigators are really really fun and diverse. They have AMAZING experience options. Maybe the best in the game. And btw, they were my favorite to play before the last few expansions. I love Preston, but again the new cards are very niche. The new seeker agent Joe with 4 combat elevates seekers above Rogues for me in the options in card pool department though. They now have an optional pure combat investigator, while Rogues still don't. Survivors have AWESOME cards, especially investigators are just so fun and weird, but they just lack options in the card pool. You have so many "survive" cards, but they lack anything else strong. Their weapons are quite fun, but there are no heavy hitting options. That for me may be their biggest minus. Lack of experience pure combat options. They have quite a few very strong investigate cards though like Look What I Found and Newspaper 2 exp. And their allies, while strong, are still nicely balanced and quite diverse. They have a million evade options, maybe even too much. It would sometimes be nice to get something else rather than just another evade. These new Track Shoes are pretty cool though. Their skill cards are pretty awesome imo. But still, I feel like they have so much niche cards that only allow some very specific combos, like Rogues, and lack anything else meaningful. They are extremely fun to play though, with all their Survivor specializations like Seeker Urchin, combat Gravekeeper, being dead crazy guy, new athlete runner and evader etc. They may even be my favorite class, but they still lack options in a big way. And they even lack one investigator only available for 15 bucks along a cheaply written book. CLUSTER 2 ------------ survivors da best -*- Guardian just have so many cards that, when looking at them, seem useful. Mystic is my actual favourite class, but it has soo many cards where they went too far with the punishing effects that almost made them useless. Survivor on the other hand has too many events that end up feeling almost the same. Seekers I dont really know, Ive never played them, but everytime I see them looks like they can do many things. And rogue, while it has improved a bit more, I still miss a useful level 1 weapon -*- Difficulty wrapping my head around some classes -*- Mystics are incredibly dependent on their cards. -*- Seekers usually win the game, because the snitch is 150 points -*- Always cards in these classes that I have a hard time cutting. Which means they have the deepest pools marking them the most fun to me -*- I love deck manipulation for seekers, and the flexibility of survivors. I just can't get my head wrapped around mystics. -*- Guardians have a lot of great tools for not just fighting but getting clues. Seeker has the best support so splashing it is great. Rogue and survivor are ties for good splash but survivors card pool is mediocre to me. Mystic aren't bad but I haven't seen it great with others very well. Mystics are good as themselves but really expensive and not great for splash IMO. -*- Survivor have many nice tricks to survive and gather clues. Guardians because they have the best weapons (flamethrower) and protective tools. seeker for their upgradable cards and higher ed. mystic for canceling cards but dont like their only good stat is willpower... rogues seems interesting but never played one. -*- Seekers have action economy (shortcut, pathfinder), card economy, resource economy (Dr Milan + cheap cards) and they advance the game quickly (i.e. discover clues). Specialist decks are better than generalist decks (in multiplayer, which I play) as they accomplish their goals more consistently, and this favours seekers and guardians. Stick To The Plan with Ever Vigilant is the most powerful deck element I am aware of. -*- I tend to play builds focused around consistency of succeeding at tests and action efficiency and my rankings reflect the build consistencies in order except rogue who are consistent but just not interesting. -*- Love survivors -*- Seeker is m'y main class -*- Firstly let me preface this with I only own 2 cores and the Dunwich cycle and have yet to play through Dunwich. Survivor offers the most versatility and always seems to be one of the key factors when beating the odds in most cases as well as enhancing evasion and action economy (survival instinct etc). Seeker cards are my second favourite due to the amount of utility included within them (i.e. Shortcut, Barricade, Medical Texts, Old Book of Lore etc) as well as allowing you what you need to catapult out in front of the agenda deck with cluevering abilities. Guardian and Mystic operate on a similar field marginally behind Seeker to me though mystic finds itself slightly higher because of the unique interactions with the encounter deck and rule bending. though in my limited experience they both seem to be the more combat based of the card pools so operate in that same niche for me. Rogue is unfortunately last but honestly that's just because I haven't had many interactions with them, most of their effects seem too situational to be able to use consistently. -*- I don't like taking the obvious solutions to a problem. I.E: Gun to the face, or Spells for everything. -*- Efficiency at what is needed to complete scenarios - mostly clue getting and combat. -*- Rogue and survivor seem to have the most cards that support each other to suggest a new way of playing. Recursion survivor is fun and different from no money survivor (though you can do both). Rogue has succeed by 2 and rich as options. Seeker has less of that but has the power of untranslated etc cards. Guardians are okay but kind of blah. I can’t see any fun decks to do with mystic. Like, messing with the bag is a cool thing to do in any deck, it isn’t a deck. Playing with doom is a couple cards that need each other but it isn’t a plan for how the game will play out. -*- Definitely hard to rank them, but ranked in order of which I'd most like to have as an off-class -*- I like the consistency in the Survivor card pool and how much individual support there is for the variety of Survivor investigators. Although I like the power level of Mystic cards, it always sucks to have your Rite of Seeking or Shriveling 15 cards down after a full mulligan for them. -*- More scenarios need cloovers and fighters, so all classes outside of Seeker and Guardian are more tricksy and less focused on the goal. This is a hard-enough game as it is! -*- Seeker cards are way too powerful. Rogues are the most fun to play. Survivor cards are super efficient at what they do. Guardian pool is decent but overpriced. Mystics have a few amazing cards, but the rest is pretty meh. CLUSTER 3 ------------ Vaguely from ‘most interactive’ to ‘most straightforward’ with a special mention for the Survivor card pool which has been largely static since roughly core with a few major exceptions. -*- Rogue cards are the most fun for me. More money, more actions, more fun. -*- I seem to like the classes that are less straight-forward than Guardian and Seeker tend to be. (In the sense that they are the archetypical fighters and cluevers.) -*- I like cards that cheat the system and don't depend on leveraging board state -*- Green and purple cards have fun and flashy effects. Blue and yellow cards have more standard effects and narrower deck building options. -*- I didn't play mystics a lot yet -*- The numbers are different depending whether we’re talking theory or practice. In theory the Mystic cards are my favorite, both for flavor and interesting mechanics. In practice I struggle with them and they’re usually the first cut. -*- Combos! -*- I like moneeeey -*- seekers have literally everything, and their cards usually aren't too expensive. rogues have adaptable, streetwise, and really good allies, but they're a bit low in damage output. guardians have really good cards but are limited by how expensive they are. mystic events are amazing, but they are 4th place because level 0 spells kinda suck and are expensive as hell. mystic cards are much better with exp. survivor cards are almost decent. it really sucks that many of their leveled up cards are exile cards, but survivors don't get any extra exp. but in general i find their cards to be lacking in clue-gathering capability and damage dealing. they can turn failures into successes, but that's about it. -*- Guardian is solid and predictable, Rogue is fun. Mystic is challenging, Seeker and Survivor are necessary. -*- THE JANK -*- I really dislike survivors as I simply dont understand how to properly build them (appart maybe Wendy). Even if I have rated mystics 4, I enjoy playing Mystic nearly as much as seeker (which I rated 1) rather than Survivor. -*- I think the rogue theme is portayed very well in their card pool -*- corelation between mechanisms and theme -*- I like big, flashy, ridiculous turns and risky plays, so rogue and mystic are the most fun for me. Guardian and seeker are fine and all, just a bit dry. I don’t understand survivor at all, but I’m happy other people have a thing they like. -*- Rogue and survivor give you interesting and powerful but situational tools that you have to figure out how to apply to the scenario. Mystic and guardian are more about powerful assets that you either draw early and use a bunch or wish you’d drawn earlier but can’t afford now and just commit for icons. Seeker pool makes me sleepy every time I look at it; the only mechanic there I really enjoy is the tome synergies and that’s only with Daisy (Rex, of course, is only played one way). -*- Role-play Value -*- I went for those that get me excited to play or provide thrills or cool combinations as I play (rather than, say, the power of the cards) CLUSTER 4 ------------ Lol moments. We’d all be survivor if we were in this game! -*- The top two were tricky to place; Rogues have fantastically fun combo plays available to them, while I love the 'feel' of many Survivor cards, fighting against fate as hard as they damn well can. Overall, I find the Survivor pool *just* wins out, especially with the excellent Will To Survive and semi-immortal Pete Sylvestre. Guardians and Seekers are two sides of the same coin; I'd say Guardians edge out, because while a Guardian has a few tools (including the infamous Flashlight) to find clues, Seekers have very few options to take care of enemies. As with Survivors and Rogues, though, this is close. Mystics... weeeeell. .. I acknowledge they are arguably the best class, once set up, and while their charges last on their spells. The ability to do everything while testing just one stat can make them very efficient. But... this is the class I enjoy the least, in part due to their over-reliance on their spells. Their solutions never feel more than stopgaps for me, so I find Mystics a hard class to play. (That won't stop me taking Mystics for a spin though, especially for goodies like Delve Too Deep ) -*- Ability to bounce off Investigators with good resource and action economy, other card pools (including Neutral), as well as capability to start off with no experience — all the way to full campaign with as much power-card investment as possible. Seeker may have 2 of the best cards in the game (Higher Education and Dr. Milan Christopher), but the Seeker card pool as a whole does not stand up. It is both narrow and shallow. Mystic is the most detailed and the most broad, but suffers from undue delay leading to deterioration. Guardian definitely needs to be more broad as well. Both Rogue and Survivor blend well, and provide the necessary breadth to take on challenges while melding with the high-economy Investigators. Rogue has a few 3, 4, and 5 xp cards that push it to the top spot. Even for Lola these statements hold up. -*- On a scale of most interesting vs. most boring. Options for rogues and survivors feel fresh and like there are multiple deck archetypes that are valid. Less so for the seeker and mystic card pools, where I feel like there are more "must include" cards which makes deck building less exciting and more rote. -*- survivor da bass -*- The card pool allows rogue/survivor decks to make specialists. Seekers are all just different flavours of clueverer -*- Personally, I like healing and support best, which guardian does quite well. Survivor has my second favorite card pool, though, for tricks and card recursion. -*- Not much between them but I like guns & ammo, survivor class is cool because it is normies vs horror -*- I really like the guardian cards as i enjoy fighting the monsters that appear in scenarios. Unfortunately my least favorite is mystic. Although they have powerful cards, they often take time to set up and I think that the draw backs on some of their cards are too harsh for what they do. -*- Just what I gravitate towards -*- I like killing monsters -*- Mystics have so much of everything with cool effects added on. Guardian cards are efficient at what they do, but really boring. -*- Survivors feel more unique, guardians kill stuff, seekers feel like you can't win without them (though you really can). Rogues and mystics by default. I like rogues better because of Finn and Sefina being really fun to play. -*- Almost always let my partner(s) play the seekers as I find the rogue and survivor cardpools allow you to fly by the seat of your pants, which I find even more exciting than just being the clue gatherer. Mystic card pool can sometimes take too long to develop. Also many marquis mystic cards flirt around with the doom mechanic which always bites me in the arse. Thirdly, mystic pool doesn't have a strong ally base. What's funny about that is I always play spellcasters in D n D. Guardian pool is pretty straightforward, one I look at as more of a necessity within the context of the game,but doesn't tug at my heartstrings . Apologize for the sloppy phrasing in my opine due to a long day. Rankings based on personal preferences only. No meta analysis -*- Agnes -*- Just prefer the in your face monster destruction that Guardian is themed with. Really enjoy that class. -*- Flexibility -*- I love killing things and then getting clues! -*- I like all of them but play seekers least, I also like that guardians can just take the mythos phase to the face -*- I like to be the tank, and with some of the new additions guardians have gotten with getting clues they just shine even more. Mystic I never really play but has so many cards I want if I am playing a dunwich survivor or anyone who can get them, same goes for survivor, very few cards from rogue or seeker makes it into my decks unless I am playing chars like Wendy or Leo who kinda needs them to make them work -*- Number of fun upgrade options in green, base card pool for red, upgrade options in blue, useful upgrades in seeker, purple sucks to play. -*- I like support / killing role, Carolyn FTW CLUSTER 5 ------------ Weapons are fun. -*- Leo is da alpha male. -*- Red's best -*- There’s more variety of cards that let me build decks I enjoy. As you go down the ranking, there’s less variety or fewer viable sub themes to build around. -*- Seeker is powerful but boring, while mystic getting to punch pack at the game is great, with good support to boot. -*- I enjoy the cardplay in survivor, and the mystic side of things. Seeker cards are generally very powerful. I don’t enjoy playing rogue but there is some good cardplay. Guardian I find less interesting overall as a class -*- Wendy is literally the best investigator in the game -*- I enjoy support cards and interesting, unique effects. -*- I tend to go for lazy game play, and usually guardians allow to smash enemies without worrying too much about strategy. Seekers I love them thematically. Mystics, I never understood how to play them efficiently Packt Publishing published a book for me entitled Hands-On Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there. If you like my blog and would like to support it, spread the word (if not get a copy yourself)! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### MLB run scoring trends: Shiny app update Mon, 02/25/2019 - 15:26 (This article was first published on Bayes Ball, and kindly contributed to R-bloggers) .main-container { max-width: 940px; margin-left: auto; margin-right: auto; } code { color: inherit; background-color: rgba(0, 0, 0, 0.04); } img { max-width:100%; height: auto; } .tabbed-pane { padding-top: 12px; } .html-widget { margin-bottom: 20px; } button.code-folding-btn:focus { outline: none; } summary { display: list-item; } .tabset-dropdown > .nav-tabs { display: inline-table; max-height: 500px; min-height: 44px; overflow-y: auto; background: white; border: 1px solid #ddd; border-radius: 4px; } .tabset-dropdown > .nav-tabs > li.active:before { content: ""; font-family: 'Glyphicons Halflings'; display: inline-block; padding: 10px; border-right: 1px solid #ddd; } .tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before { content: ""; border: none; } .tabset-dropdown > .nav-tabs.nav-tabs-open:before { content: ""; font-family: 'Glyphicons Halflings'; display: inline-block; padding: 10px; border-right: 1px solid #ddd; } .tabset-dropdown > .nav-tabs > li.active { display: block; } .tabset-dropdown > .nav-tabs > li > a, .tabset-dropdown > .nav-tabs > li > a:focus, .tabset-dropdown > .nav-tabs > li > a:hover { border: none; display: inline-block; border-radius: 4px; } .tabset-dropdown > .nav-tabs.nav-tabs-open > li { display: block; float: none; } .tabset-dropdown > .nav-tabs > li { display: none; }$(document).ready(function () { window.buildTabsets("TOC"); }); $(document).ready(function () {$('.tabset-dropdown > .nav-tabs > li').click(function () { $(this).parent().toggleClass('nav-tabs-open') }); }); The new Major League Baseball season will soon begin, which means it’s time to look back and update my run scoring trends data visualization application, built using RStudio’s shiny package. You can find the app here: https://monkmanmh.shinyapps.io/MLBrunscoring_shiny/ The github repo for this app is https://github.com/MonkmanMH/MLBrunscoring_shiny This update gave me the opportunity to make some cosmetic tweaks to the front end, and more consequential changes to the code under the hood. 1. retired reshape, using tidyr At one point in the app’s code, I had used the now-retired reshape package to melt a data table. Although this still works, I opted to update the code to use the gather() function in the package tidyr, part of the tidyverse. 2. feather instead of csv The app relied on some pre-wrangled csv files; these have been replaced by files stored using the .feather format, which makes for a signficant performance improvement. 3. wrangling: the calculation of team and league run scoring averages The goal is to create data tables that minimize the amount of processing the app has to do. In previous versions of the app, the filtering of rows (records or observations) and selecting of columns (variables), the calculation of each team’s average runs scored and runs allowed per game, the league average runs per game, and the comparison of the team to the league, was done first using base R’s apply family of functions. Then I switched to using dplyr, and although the steps were now in a pipe, this approach still required creating a separate data table with the league average, and then joining that table back into the main team table so that the team result could be compared to the league average. For this iteration, preparing the data for the app is now done using tidyr::nest() and purrr::map(). What follows is a detailed explanation of how I approached this. It’s always valuable to have your end-state in mind when working through a multi-step data wrangle like this. My goal is the values shown on the “team plot” tab of the app – an index value (i.e. a percentage) of a team’s average runs scored (and runs allowed) compared to the league run scoring rate, for a single season. a. Load packages and read the data First, load the necessary packages, the first four of which are part of the tidyverse. # tidyverse packages library(dplyr) library(purrr) library(readr) library(tidyr) library(feather) Then, read in the data. Teams <- read_csv("Teams.csv", col_types = cols( divID = col_character(), DivWin = col_character(), SF = col_character(), WCWin = col_character() )) head(Teams) ## # A tibble: 6 x 48 ## yearID lgID teamID franchID divID Rank G Ghome W L DivWin ## ## 1 1871 BS1 BNA 3 31 NA 20 10 ## 2 1871 CH1 CNA 2 28 NA 19 9 ## 3 1871 CL1 CFC 8 29 NA 10 19 ## 4 1871 FW1 KEK 7 19 NA 7 12 ## 5 1871 NY2 NNA 5 33 NA 16 17 ## 6 1871 PH1 PNA 1 28 NA 21 7 ## # ... with 37 more variables: WCWin , LgWin , WSWin , ## # R , AB , H , 2B , 3B , HR , ## # BB , SO , SB , CS , HBP , SF , RA , ## # ER , ERA , CG , SHO , SV , IPouts , ## # HA , HRA , BBA , SOA , E , DP , ## # FP , name , park , attendance , BPF , ## # PPF , teamIDBR , teamIDlahman45 , teamIDretro The table above has far more variables than what we need, and some that we’ll have to calculate. b. Create league summary tables A short set of instructions that starts with the “Teams” table in the Lahman database and summarizes it for MLB run scoring trends Shiny app Thus rather than having the app do the work of 1. remove unnecessary records (rows) and fields (columns) and 2. run the calculations for the runs-per-game, runs-allowed-per-game, and indexed versions of those, the calculations are conducted here. This will vastly improve the performance of the app. i. create nested table I started with the “Many Models”” chapter of Wickham and Grolemund, R for Data Science. (And thanks to Dr. Charlotte Wickham, whose training course was invaluable in helping me wrap my head around this.) At this point, the code • filters out the years prior to 1901 and the misbegotten Federal League. • and then creates a nested data table, starting with the group_by() year and league (lgID) # select a sub-set of teams from 1901 [the establishment of the American League] forward to most recent year Teams_lgyr <- Teams %>% filter(yearID > 1900, lgID != "FL") %>% group_by(yearID, lgID) %>% nest() Teams_lgyr ## # A tibble: 236 x 3 ## yearID lgID data ## ## 1 1901 AL ## 2 1901 NL ## 3 1902 AL ## 4 1902 NL ## 5 1903 AL ## 6 1903 NL ## 7 1904 AL ## 8 1904 NL ## 9 1905 AL ## 10 1905 NL ## # ... with 226 more rows Here’s a quick peek inside the first entry of the “data” column…the American League, 1901. Teams_lgyr$data[[1]] ## # A tibble: 8 x 46
## teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
##
## 1 BLA NYY 5 134 66 68 65 N
## 2 BOS BOS 2 138 69 79 57 N
## 3 CHA CHW 1 137 71 83 53 Y
## 4 CLE CLE 7 138 69 54 82 N
## 5 DET DET 3 135 70 74 61 N
## 6 MLA BAL 8 139 70 48 89 N
## 7 PHA OAK 4 137 66 74 62 N
## 8 WS1 MIN 6 138 68 61 72 N
## # ... with 35 more variables: WSWin , R , AB , H ,
## # 2B , 3B , HR , BB , SO , SB ,
## # CS , HBP , SF , RA , ER , ERA ,
## # CG , SHO , SV , IPouts , HA , HRA ,
## # BBA , SOA , E , DP , FP , name ,
## # park , attendance , BPF , PPF , teamIDBR ,
## # teamIDlahman45 , teamIDretro ii – functional programming

This step creates a league run scoring function, and then applies that using the purrr::map() function.
Note:

• In the gapminder example in R for Data Science, the variables were called using their names. In this case, for a reason I have not yet determined, we have to specify the data object they are coming from; e.g. for the runs variable R, we have to use df$R (not just R). First, a simple test, calculating runs scored, and checking to see if we got the right answer, b comparing that to the value calculated using dplyr: # base R format leagueRuns_fun <- function(df) { sum(data = df$R)
}

league_year_runs <- map(Teams_lgyr$data, leagueRuns_fun) league_year_runs[[1]] ## [1] 5873 #check the answer by old school dplyr method Teams %>% filter(yearID == 1901, lgID == "AL") %>% summarise(leagueruns = sum(R)) ## # A tibble: 1 x 1 ## leagueruns ## ## 1 5873 Now we move on to the calculation of league averages. For the first approach, the sum calculation is part of the function. • There are two functions, one for Runs and the other for Runs Allowed. This is because I have not yet figured out how to specify two different variables (i.e. the name of the data object and the variable to be used in the function) in the map_() function and successfully have them carried into my calculation functions • Also note that in order to be consistent with other sources, the number of games played is calculated using the sum of wins (W) and losses (L), rather than the number of games reported in the G variable. # functions leagueRPG_fun <- function(df) { sum(data = df$R) / (sum(data = df$W) + sum(data = df$L))
}

leagueRAPG_fun <- function(df) {
sum(data = df$RA) / (sum(data = df$W) + sum(data = df$L)) } # simple map version league_year_RPG <- map(Teams_lgyr$data, leagueRPG_fun)

# embed as new columns in nested data object
Teams_lgyr <- Teams_lgyr %>%
mutate(lgRPG = map_dbl(Teams_lgyr$data, leagueRPG_fun), lgRAPG = map_dbl(Teams_lgyr$data, leagueRAPG_fun))

Teams_lgyr ## # A tibble: 236 x 5
## yearID lgID data lgRPG lgRAPG
##
## 1 1901 AL 5.43 5.43
## 2 1901 NL 4.69 4.69
## 3 1902 AL 4.97 4.97
## 4 1902 NL 4.09 4.09
## 5 1903 AL 4.15 4.15
## 6 1903 NL 4.85 4.85
## 7 1904 AL 3.65 3.65
## 8 1904 NL 3.98 3.98
## 9 1905 AL 3.75 3.75
## 10 1905 NL 4.16 4.16
## # ... with 226 more rows

In the second approach:

• the league and year total runs, runs allowed, and games are first calculated using separate functions
• RPG and RAPG for each league and year combination are then calculated outside the nested tibbles
# more functions - individual league by year totals
leagueR_fun <- function(df) {
sum(data = df$R) } leagueRA_fun <- function(df) { sum(data = df$RA)
}

leagueG_fun <- function(df) {
(sum(data = df$W) + sum(data = df$L))
}

Teams_lgyr <- Teams_lgyr %>%
mutate(lgR = map_dbl(Teams_lgyr$data, leagueR_fun), lgRA = map_dbl(Teams_lgyr$data, leagueRA_fun),
lgG = map_dbl(Teams_lgyr$data, leagueG_fun)) Teams_lgyr <- Teams_lgyr %>% mutate(lgRPG = (lgR / lgG), lgRAPG = (lgRA / lgG)) Teams_lgyr ## # A tibble: 236 x 8 ## yearID lgID data lgRPG lgRAPG lgR lgRA lgG ## ## 1 1901 AL 5.43 5.43 5873 5873 1082 ## 2 1901 NL 4.69 4.69 5194 5194 1108 ## 3 1902 AL 4.97 4.97 5407 5407 1088 ## 4 1902 NL 4.09 4.09 4494 4494 1098 ## 5 1903 AL 4.15 4.15 4543 4543 1096 ## 6 1903 NL 4.85 4.85 5349 5349 1102 ## 7 1904 AL 3.65 3.65 4433 4433 1216 ## 8 1904 NL 3.98 3.98 4872 4872 1224 ## 9 1905 AL 3.75 3.75 4547 4547 1212 ## 10 1905 NL 4.16 4.16 5092 5092 1224 ## # ... with 226 more rows iii. save LG_RPG files And then write csv and feather versions. As noted above, the shiny app now uses the feather format. Notes: • rounding of variables after all the calculations, simply to make the tables as viewed more legible. • renaming of variables to correspond with shiny app names. LG_RPG <- Teams_lgyr %>% mutate(lgRPG = round(lgRPG, 2), lgRAPG = round(lgRAPG, 2)) %>% select(yearID, lgID, R = lgR, RA = lgRA, G = lgG, leagueRPG = lgRPG, leagueRAPG = lgRAPG) LG_RPG ## # A tibble: 236 x 7 ## yearID lgID R RA G leagueRPG leagueRAPG ## ## 1 1901 AL 5873 5873 1082 5.43 5.43 ## 2 1901 NL 5194 5194 1108 4.69 4.69 ## 3 1902 AL 5407 5407 1088 4.97 4.97 ## 4 1902 NL 4494 4494 1098 4.09 4.09 ## 5 1903 AL 4543 4543 1096 4.15 4.15 ## 6 1903 NL 5349 5349 1102 4.85 4.85 ## 7 1904 AL 4433 4433 1216 3.65 3.65 ## 8 1904 NL 4872 4872 1224 3.98 3.98 ## 9 1905 AL 4547 4547 1212 3.75 3.75 ## 10 1905 NL 5092 5092 1224 4.16 4.16 ## # ... with 226 more rows write_csv(LG_RPG, "LG_RPG.csv") write_feather(LG_RPG, "LG_RPG.feather") c. Repeat for MLB total This only differs from the league summaries in the level of nesting; instead of grouping by year and league, it’s only year (yearID). Teams_lgyr <- Teams_lgyr %>% unnest() %>% group_by(yearID) %>% nest() Teams_lgyr ## # A tibble: 118 x 2 ## yearID data ## ## 1 1901 ## 2 1902 ## 3 1903 ## 4 1904 ## 5 1905 ## 6 1906 ## 7 1907 ## 8 1908 ## 9 1909 ## 10 1910 ## # ... with 108 more rows Teams_lgyr <- Teams_lgyr %>% mutate(mlbR = map_dbl(Teams_lgyr$data, leagueR_fun),
mlbRA = map_dbl(Teams_lgyr$data, leagueRA_fun), mlbG = map_dbl(Teams_lgyr$data, leagueG_fun),
mlbRPG = (mlbR / mlbG),
mlbRAPG = (mlbRA / mlbG))

Teams_lgyr ## # A tibble: 118 x 7
## yearID data mlbR mlbRA mlbG mlbRPG mlbRAPG
##
## 1 1901 11067 11067 2190 5.05 5.05
## 2 1902 9901 9901 2186 4.53 4.53
## 3 1903 9892 9892 2198 4.50 4.50
## 4 1904 9305 9305 2440 3.81 3.81
## 5 1905 9639 9639 2436 3.96 3.96
## 6 1906 8881 8878 2416 3.68 3.67
## 7 1907 8703 8703 2406 3.62 3.62
## 8 1908 8422 8422 2456 3.43 3.43
## 9 1909 8810 8810 2436 3.62 3.62
## 10 1910 9584 9584 2446 3.92 3.92
## # ... with 108 more rows

And again, we save the files for use in the shiny app.

MLB_RPG <- Teams_lgyr %>%
mutate(mlbRPG = round(mlbRPG, 2),
mlbRAPG = round(mlbRAPG, 2)) %>%
select(yearID, R = mlbR, RA = mlbRA, G = mlbG,
leagueRPG = mlbRPG, leagueRAPG = mlbRAPG)

write_csv(MLB_RPG, "MLB_RPG.csv")
write_feather(MLB_RPG, "MLB_RPG.feather") d. Individual team values

Calculate index of team run scoring against league average
Note that we start with unnest() and create a new object, Teams_append … a tibble with all of the variables exposed.

Teams_append <- Teams_lgyr %>%
unnest() %>%
mutate(teamRPG=(R / (W + L)),
teamRAPG=(RA / (W + L)),
WLpct=(W / (W + L))) %>%
# runs scored index where 100=the league average for that season
mutate(R_index = (teamRPG / lgRPG) * 100) %>%
mutate(R_index.sd = sd(R_index)) %>%
mutate(R_z = (R_index - 100) / R_index.sd) %>%
# runs allowed
mutate(RA_index = (teamRAPG / lgRAPG) * 100) %>%
mutate(RA_index.sd = sd(RA_index)) %>%
mutate(RA_z = (RA_index - 100) / RA_index.sd)

Teams_append ## # A tibble: 2,496 x 67
## yearID mlbR mlbRA mlbG mlbRPG mlbRAPG lgID lgRPG lgRAPG lgR lgRA
##
## 1 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873
## 2 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873
## 3 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873
## 4 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873
## 5 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873
## 6 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873
## 7 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873
## 8 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873
## 9 1901 11067 11067 2190 5.05 5.05 NL 4.69 4.69 5194 5194
## 10 1901 11067 11067 2190 5.05 5.05 NL 4.69 4.69 5194 5194
## # ... with 2,486 more rows, and 56 more variables: lgG ,
## # teamID , franchID , divID , Rank , G ,
## # Ghome , W , L , DivWin , WCWin , LgWin ,
## # WSWin , R , AB , H , 2B , 3B ,
## # HR , BB , SO , SB , CS , HBP , SF ,
## # RA , ER , ERA , CG , SHO , SV ,
## # IPouts , HA , HRA , BBA , SOA , E ,
## # DP , FP , name , park , attendance ,
## # BPF , PPF , teamIDBR , teamIDlahman45 ,
## # teamIDretro , teamRPG , teamRAPG , WLpct ,
## # R_index , R_index.sd , R_z , RA_index ,
## # RA_index.sd , RA_z

In this the final step, we first create a new data object Teams_merge.
Notes:

• rounding of a variety of the calculated variables, to address readability concerns.
• selection and renaming of variables to correspond with shiny app names.
• then write csv and feather versions.
Teams_merge <- Teams_append %>%
mutate(lgRPG = round(lgRPG, 2),
lgRAPG = round(lgRAPG, 2),
WLpct = round(WLpct, 3),
teamRPG = round(teamRPG, 2),
teamRAPG = round(teamRAPG, 2),
R_index = round(R_index, 1),
RA_index = round(RA_index, 1)
) %>%
select(yearID, lgID, franchID, teamID, name,
W, L, WLpct, R.x = R, RA.x = RA,
teamRPG, leagueRPG = lgRPG, R_index,
teamRAPG, leagueRAPG = lgRAPG, RA_index)

Teams_merge ## # A tibble: 2,496 x 16
## yearID lgID franchID teamID name W L WLpct R.x RA.x teamRPG
##
## 1 1901 AL NYY BLA Balt~ 68 65 0.511 760 750 5.71
## 2 1901 AL BOS BOS Bost~ 79 57 0.581 759 608 5.58
## 3 1901 AL CHW CHA Chic~ 83 53 0.61 819 631 6.02
## 4 1901 AL CLE CLE Clev~ 54 82 0.397 666 831 4.9
## 5 1901 AL DET DET Detr~ 74 61 0.548 741 694 5.49
## 6 1901 AL BAL MLA Milw~ 48 89 0.35 641 828 4.68
## 7 1901 AL OAK PHA Phil~ 74 62 0.544 805 760 5.92
## 8 1901 AL MIN WS1 Wash~ 61 72 0.459 682 771 5.13
## 9 1901 NL LAD BRO Broo~ 79 57 0.581 744 600 5.47
## 10 1901 NL ATL BSN Bost~ 69 69 0.5 531 556 3.85
## # ... with 2,486 more rows, and 5 more variables: leagueRPG ,
## # R_index , teamRAPG , leagueRAPG , RA_index write_csv(Teams_merge, "Teams_merge.csv")
write_feather(Teams_merge, "Teams_merge.feather")

And the feather files can now be incorproated into the shiny app.
-30-

// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); }$(document).ready(function () { bootstrapStylePandocTables(); }); (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })();

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Bayes Ball. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A Tale of Two (Small Belgian) Cities with Open Data: Official Crime Statistics and Self-Reported Feelings of Safety in Leuven and Vilvoorde

Mon, 02/25/2019 - 07:54

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

In this post, we will analyze government data from the Flemish region in Belgium on A) official crime statistics and B) self-reported feelings of safety among residents of Flanders. We will focus our analysis on two cities in the province of Flemish Brabant: Leuven and Vilvoorde. A key question of this analysis is: do the residents of the safer city (as measured by official government statistics) feel safer and have more pride in their city (according to government polling data)?

The Data

Crime

Our first set of data consists of official crime figures from the Flemish government, obtained at this website. I chose to download the data adjusted for population size. Specifically, the data record the number of reported incidents per 1000 residents. The types of crime reported are: property damage, theft, physical violence, and overall incidents. As of October 2018, these figures were available for the years 2000-2016.

The Flemish government regularly conducts a survey (called the “Stadsmonitor“) to monitor how the residents of Flanders feel about many different topics, including how people feel about the city or town where they live. Summary data are available by question on the official survey website. In this blog post, we will analyze answers from 2017 on the following questions: insecurity, problems, vandalism and pride in one’s city

Open Data and Code

I’m increasingly sharing data and code for blog posts in the hopes that this will be interesting or useful for readers. If you’d like to reproduce the analyses done here (or explore a different question), you can find the data and code at this link.

Data Preparation

Crime Data

The code below reads the raw crime data file and first does some basic cleaning. These files were only available in an Excel format, and were not meant to be machine readable, so it takes some effort to manipulate the data to work with them in statistical analysis programs like R.

I then produce an additional set of crime statistics for Leuven. Because Leuven is a university city, there are a number of students who live here during the year, but who are not officially registered as residents of the city. Essentially, these “invisible” residents live in the city, potentially commit or are victims of crime, but are not counted as residents and so do not factor into the calculation of crime per 1000 residents. In my analysis below, I’m assuming that there are an additional 30% “invisible” residents, and therefore divide all of the crime figures by 1.3. I call these data “Leuven student assumptions” because they are adjustments to the actual figures, and not the official statistics themselves. The 30% figure comes from this article (in Dutch), in which the former mayor gives an estimate of the number of “invisible” student residents. *

I then convert the data from its original wide format to a long format, as ggplot2 requires data in the long format for plotting. The code to read the data, produce the adjustment for students, and transform the data is below:

# load the packages we'll need
library(openxlsx)
library(ggplot2)
library(plyr);library(dplyr)
library(tidyr)
# install devtool version of package
# devtools::install_github("jbryer/likert")
library(likert)

# define path to the data
in_dir <- 'C:\\Folder\\'

# function to read in crime data
# remove weird rows with no data
crime_raw_f <- crime_raw_f[1:12,]
# make the column name city (translate from Dutch to English)
names(crime_raw_f)[names(crime_raw_f) == 'Gebied'] <- 'City'
# make the Flemish region code also in English
crime_raw_f$City[crime_raw_f$City == "Vlaams Gewest"] <- "Vlaams Gewest (Flemish Region)"
return(crime_raw_f)
}

crime_clean <- read_crime_data('Leuven Vilvoorde Crime per 1000 inhabitants clean.xlsx', in_dir)

# make function to adjust for
# "missing" students
# living but not registered in Leuven
divide_it <- function(x){
divided_figure <- x / 1.3
return(divided_figure)
}

# create the data for the Leuven student assumptions
leuven_w_students <- crime_clean %>% filter(City == 'Leuven') %>%
# mutate variables in the year columns (crime figures)
# we apply our function to each cell (divide by 1.3)
mutate_at(vars(2000:2016), divide_it) %>%
# change the name of the city to indicate it
# reflects our assumptions, not the actual data
mutate(City = dplyr::recode(City, 'Leuven' = 'Leuven Student Assumptions' ))

# stack our newly created data on top of original data
master_crime <- rbind(crime_clean, leuven_w_students)

# transform from wide to long format
# ggplot2 needs data in this structure
crime_long <- gather(master_crime, year, value, 2000:2016)

The code produces a dataset (crime_long) that looks like this (only first 10 lines shown):

City Categorie year value 1 Leuven Beschadigen van eigendom (aantal per 1.000 inw.) 2000 14.82 2 Leuven Diefstal en afpersing (aantal per 1.000 inw.) 2000 54.43 3 Leuven Misdr. tegen de lichamelijke integriteit (aantal per 1.000 inw.) 2000 6.17 4 Leuven Criminaliteitsgraad (aantal per 1.000 inw.) 2000 132.81 5 Vilvoorde Beschadigen van eigendom (aantal per 1.000 inw.) 2000 8.47 6 Vilvoorde Diefstal en afpersing (aantal per 1.000 inw.) 2000 44.95 7 Vilvoorde Misdr. tegen de lichamelijke integriteit (aantal per 1.000 inw.) 2000 6.08 8 Vilvoorde Criminaliteitsgraad (aantal per 1.000 inw.) 2000 88.22 9 Vlaams Gewest (Flemish Region) Beschadigen van eigendom (aantal per 1.000 inw.) 2000 9.36 10 Vlaams Gewest (Flemish Region) Diefstal en afpersing (aantal per 1.000 inw.) 2000 37.12

Note that we also have data on the entire Flemish region (Vlaams Gewest) which is a nice point of comparison – the global average across the territory. The descriptions are all in Dutch, but we’ll use English translations for plotting below.

Survey Data

The code below reads in the datasets (there is 1 dataset per question, and each dataset contains summary statistics from many different cities – here we focus on just Leuven and Vilvoorde). Unfortunately, there are slight differences in the formatting of the files between the questions, so I wrote a different function to read in every dataset. The functions read the data, select the observations from Leuven and Vilvoorde in 2017, and make English language column names (the original names are all in Dutch).

###### safety
# read the file, specifying that there are dates
safety_raw_f <- read.xlsx(paste0(in_dir, file_name_f), detectDates = TRUE)
# remove info hanging around at bottom of file
safety_raw_f <- safety_raw_f[!is.na(safety_raw_f$Indicator),] # name the columns names(safety_raw_f) <- c("Gemeente","Indicator","Item", 'Date', 'Empty', 'Never/Seldom', 'Sometimes', 'Often/Always') # select observations just from 2017 - we only have Vilvoorde for that date safety_raw_f <- safety_raw_f[safety_raw_f$Date == '2017-01-01',]
# remove the empty column
safety_raw_f$Empty <- NULL # replace missings in questionnaire data with zeros safety_raw_f[is.na(safety_raw_f)] <- 0 # select only relevant columns for Leuven and Vilvoorde safety_raw_f <- safety_raw_f[safety_raw_f$Gemeente == 'Leuven' | safety_raw_f$Gemeente == 'Vilvoorde',c(1,3,5:7)] # make sure that the "Item" variable represents city vs. neighborhood # (for Likert plot) safety_raw_f$Item <- c('Neighborhood', 'City','Neighborhood', 'City')
# make city and Item factors
# (for Likert plot)
safety_raw_f$Gemeente <- as.factor(safety_raw_f$Gemeente)
safety_raw_f$Item <- as.factor(safety_raw_f$Item)
# return the clean dataset
return(safety_raw_f)
}

###### problems
# read the file, specifying that there are dates
probs_raw_f <- read.xlsx(paste0(in_dir, file_name_f), detectDates = TRUE)
# remove info hanging around at bottom of file
probs_raw_f <- probs_raw_f[!is.na(probs_raw_f$Indicator),] # name the columns names(probs_raw_f) <- c("Gemeente","Indicator","Item", 'Date', 'Empty', 'Never/Seldom', 'Sometimes', 'Often/Always') # select observations just from 2017 - we only have Vilvoorde for that date probs_raw_f <- probs_raw_f[probs_raw_f$Date == '2017-01-01',]
# remove the empty column
probs_raw_f$Empty <- NULL # replace missings in questionnaire data with zeros probs_raw_f[is.na(probs_raw_f)] <- 0 # select only relevant columns for Leuven and Vilvoorde probs_raw_f <- probs_raw_f[probs_raw_f$Gemeente == 'Leuven' | probs_raw_f$Gemeente == 'Vilvoorde',c(1,3,5:7)] # delete the original item column (this is text on y axis) probs_raw_f$Item <- NULL
# make city (Gemeente) the "Item"
# (for Likert plot)
names(probs_raw_f)[1] <- 'Item'
# return the clean dataset
return(probs_raw_f)
}

####### vandalism
# read the file, specifying that there are dates
vand_raw_f <- read.xlsx(paste0(in_dir, file_name_f), detectDates = TRUE)
# remove info hanging around at bottom of file
vand_raw_f <- vand_raw_f[!is.na(vand_raw_f$Indicator),] # name the columns names(vand_raw_f) <- c("Gemeente","Indicator","Item", 'Date', 'Empty', 'Never/Seldom', 'Sometimes', 'Often/Always') # select observations just from 2017 - we only have Vilvoorde for that date vand_raw_f <- vand_raw_f[vand_raw_f$Date == '2017' & vand_raw_f$Item == 'Vernieling straatmeubilair',] # remove the empty column vand_raw_f$Empty <- NULL
# replace missings in questionnaire data with zeros
vand_raw_f[is.na(vand_raw_f)] <- 0
# select only relevant columns for Leuven and Vilvoorde
vand_raw_f <- vand_raw_f[vand_raw_f$Gemeente == 'Leuven' | vand_raw_f$Gemeente == 'Vilvoorde',c(1,3,5:7)]
# delete the original item column (this is text on y axis)
vand_raw_f$Item <- NULL # make city (Gemeente) the "Item" # (for Likert plot) names(vand_raw_f)[1] <- 'Item' # return the clean dataset return(vand_raw_f) } ######### pride in one's city read_pride_data <- function(file_name_f, in_dir_f){ # read the file, specifying that there are dates pride_raw_f <- read.xlsx(paste0(in_dir, file_name_f), detectDates = TRUE) # remove info hanging around at bottom of file pride_raw_f <- pride_raw_f[!is.na(pride_raw_f$Indicator),]
# name the columns
names(pride_raw_f) <- c("Gemeente","Indicator","Item", 'Date', 'Empty', 'Mostly/Completely disagree',
'Neither agree nor disagree', 'Mostly/Completely agree')
# select observations just from 2017 - we only have Vilvoorde for that date
pride_raw_f <- pride_raw_f[pride_raw_f$Date == '2017-01-01',] # remove the empty column pride_raw_f$Empty <- NULL
# replace missings in questionnaire data with zeros
pride_raw_f[is.na(pride_raw_f)] <- 0
# select only relevant columns for Leuven and Vilvoorde
pride_raw_f <- pride_raw_f[pride_raw_f$Gemeente == 'Leuven' | pride_raw_f$Gemeente == 'Vilvoorde',c(1,3,5:7)]
# delete the original item column (this is text on y axis)
pride_raw_f$Item <- NULL # make city (Gemeente) the "Item" # (for Likert plot) names(pride_raw_f)[1] <- 'Item' # return the clean dataset return(pride_raw_f) } # read in the data safety_clean <- read_safety_data('SAMEN_onveiligheid.xlsx', in_dir) probs_clean <- read_problem_data('SAMEN_problemen.xlsx', in_dir) vandal_clean <- read_vandalism_data('SAMEN_vandalisme.xlsx', in_dir) pride_clean <- read_pride_data('WON_fierheid.xlsx', in_dir) Our functions are all executed in a single line, and return a cleaned dataset for each of the four questions we will examine. For illustrative purposes, the safety dataset is shown below. The answers are in response to the following question: How often do you feel unsafe in your neighborhood and city? Gemeente Item Never/Seldom Sometimes Often/Always 1 Leuven Neighborhood 83.79 13.16 3.05 2 Leuven City 74.76 20.88 4.36 3 Vilvoorde Neighborhood 66.04 24.05 9.91 4 Vilvoorde City 46.67 31.22 22.11 Visualization Crime Data We will produce one graph for each outcome, and assemble the graphs into a single plot. For each outcome, we will plot four different lines, displaying the crime figures across the years for Leuven (“Leuven” in the plots below), Leuven with our assumptions about the student population (assuming 30% “invisible” student residents as described above, “Leuven Student Assumptions” in the plots below), Vilvoorde, and the Vlaams Gewest (average for the entire Flemish region). The code below stores each single graph in an object, and uses the gridExtra package to plot them all in a single plot. The code looks like this: # property damage prop_damage_plot <- crime_long %>% filter(Categorie == 'Beschadigen van eigendom (aantal per 1.000 inw.)') %>% ggplot(aes(x = year, y = value, color = City, group = City)) + geom_line(aes(color = City), size = 1) + geom_point() + labs(x = 'Year', y = 'Property Damage (per 1000 inhabitants)', title = 'Property Damage: 2000 - 2016') # theft theft_plot <- crime_long %>% filter(Categorie == 'Diefstal en afpersing (aantal per 1.000 inw.)') %>% ggplot(aes(x = year, y = value, color = City, group = City)) + geom_line(aes(color = City), size = 1) + geom_point() + labs(x = 'Year', y = 'Theft (per 1000 inhabitants)', title = 'Theft: 2000 - 2016') # physical violence phys_violence_plot <- crime_long %>% filter(Categorie == 'Misdr. tegen de lichamelijke integriteit (aantal per 1.000 inw.)') %>% ggplot(aes(x = year, y = value, color = City, group = City)) + geom_line(aes(color = City), size = 1) + geom_point() + labs(x = 'Year', y = 'Physical Violence (per 1000 inhabitants)', title = 'Physical Violence: 2000 - 2016') # overall crime overall_crime_plot <- crime_long %>% filter(Categorie == 'Criminaliteitsgraad (aantal per 1.000 inw.)') %>% ggplot(aes(x = year, y = value, color = City, group = City)) + geom_line(aes(color = City), size = 1) + geom_point() + labs(x = 'Year', y = 'Overall Crime (per 1000 inhabitants)', title = 'Overall Crime: 2000 - 2016') # plot multiple graphs with grid.arrange # from gridExtra package # we give a bit of extra space at the bottom # of each graph library(gridExtra) margin = theme(plot.margin = unit(c(1,.5,.5,.5), "cm")) grid.arrange(grobs = lapply(list(prop_damage_plot, theft_plot, phys_violence_plot, overall_crime_plot), "+", margin), nrow = 4) Which produces the following plot. The official crime statistics for Leuven are higher than the other points of comparison for every outcome after 2003. Even with our assumptions about the “invisible” residents, Leuven consistently has higher crime rates than Vilvoorde and the Flemish region average. Vilvoorde sticks close to the Flemish average for property damage and physical violence, but is higher than average for theft and overall crime rates. Questionnaire / Self-Report Data Subjective Feelings of Safety We will first take a look at subjective feelings of safety in one’s city and neighborhood. We will use the likert package to plot the responses to the questionnaires. The Likert package is great for producing stacked bar charts, a common visualization of responses to questionnaire data. The format of the data from the Stadsmonitor is also suited to this type of visual representation. We first create a Likert scale object from the safety dataset we created above. Because the same question was asked in regard to both one’s neighborhood and one’s city, we will use a grouping in our plot. This will allow us to display the responses to both areas with a correct sub-heading for both neighborhood and city. The code below creates the Likert object and produces the plot. # make likert object # for safety we use a grouping: neighborhood vs. city # (questions were asked about both) safety_likert <- likert(summary = safety_clean, grouping = safety_clean[,1]) # make the plot plot(safety_likert, group.order = c('Leuven', 'Vilvoorde')) + ggtitle('How often do you feel unsafe in your...') Leuven respondents rarely report feeling unsafe in their city (only 4% say they feel unsafe always or often) and neighborhood (only 3%). The figures are considerably higher for Vilvoorde. Fully 22% of respondents in Vilvoorde say that they always or often feel unsafe in their city, and 10% say that they always or often feel unsafe in their neighborhood. Social Problems and Pride in One’s City We will next examine questions about two social problems: being hassled on the street and seeing vandalism (literally “witnessing the destruction of street furniture” – public benches, lamp posts, etc.). We will also look at reports of feelings of pride in one’s city (an overall perception which is no doubt informed by feelings of safety). We will display these three questions in a single plot, as we did above for the crime data. We first create a Likert graph object for each question (as the questions don’t have the same question structure and response options, it’s better to plot each one by itself). We then use the gridExtra package to display all of the graphs in a single plot. The following code creates the Likert graph objects and displays them in one large plot. # make likert objects for the other questions likert_problems <- likert(summary = probs_clean) likert_vand <- likert(summary = vandal_clean) likert_pride <- likert(summary = pride_clean) # first make object for each graph prob_plot <- plot(likert_problems, group.order = c('Leuven', 'Vilvoorde')) + ggtitle('How often are you hassled on the street?') vand_plot <- plot(likert_vand, group.order = c('Leuven', 'Vilvoorde')) + ggtitle('How often do you see destruction of street furniture?') pride_plot <- plot(likert_pride, group.order = c('Leuven', 'Vilvoorde')) + ggtitle('I am truly proud of my city') # plot multiple graphs with grid.arrange # from gridExtra package # we give a bit of extra space at the bottom # of each graph margin = theme(plot.margin = unit(c(1,.5,.5,.5), "cm")) grid.arrange(grobs = lapply(list(prob_plot, vand_plot, pride_plot), "+", margin)) When considering the social problems questions (being hassled on the street and seeing vandalism), the percentage of respondents in Leuven and Vilvoorde who report that they often or always experience these things is more-or-less the same. The cities distinguish themselves in the middle and lower response categories. More residents in Vilvoorde (vs. Leuven) say that they sometimes are hassled on the street and witness vandalism, and fewer residents in Vilvoorde (vs. Leuven) say that they never or seldom experience these things. In sum, residents of Leuven report fewer social and safety-related problems than residents of Vilvoorde. The results of the pride question are quite striking. In Leuven, 75% of the respondents are proud of their city versus only 34% in Vilvoorde (less than half the Leuven percentage). Only 6% of Leuven residents are not proud of their city, while this figure is 23% in Vilvoorde. The self-report data are clear: residents of Leuven feel safer, report fewer social problems and are much more proud of their city when compared with residents of Vilvoorde. Summary and Conclusion In this post, we examined official crime statistics and survey data on subjective feelings of safety in two Flemish cities: Leuven and Vilvoorde. Interestingly, the official crime statistics and survey data seem to tell two different stories. On the one hand, Leuven (compared to Vilvoorde) has considerably higher rates of property damage, theft and physical violence, even when assuming that the official statistics are slightly biased because of “invisible” students who are not legally registered as city residents. On the other hand, people in Leuven say that they feel much safer in their neighborhoods and in their city. Residents of Leuven also report seeing fewer social problems and are prouder of their city. It’s interesting to see this disconnect between official crime statistics and self-reported feelings of safety. I will admit that I’m not entirely sure what to make of this. Potential explanations for this disconnect include: 1. There are more “invisible” student residents than I have assumed. If this is the case, it is possible that the true population-adjusted crime figures are lower in Leuven than in Vilvoorde (although to close the gap between the cities, there would have to be a very large number of invisible residents in Leuven). 2. Residents in Vilvoorde under-report crime that happens in their city. If this is the case, the actual crime figures in Vilvoorde are much higher, perhaps equal to or greater than those in Leuven (although the similarity between the Vilvoorde figures and those of the Flemish region suggests this is unlikely). 3. The “types” of people who respond to the survey questions are different in Leuven vs. Vilvoorde. If this is the case, the difference in observed feelings of safety and seeing social problems reflects sampling bias, not the true aggregate feelings of city residents. However, the Stadsmonitor survey appears to have been meticulously conducted, and it strikes me that such sampling errors are unlikely. My sense of these data is that, in some contexts, actual safety and feelings of safety are different things. The overall crime figures in Leuven are not very high, even though they are much higher than the average for the Flemish region (which is, after all, a relatively well-off part of the world). Given the relatively low crime rate overall, perhaps other factors influence how residents feel in the place where they live. Leuven is a wealthy city in a wealthy province. My sense is that people have a common identity as Leuven residents (Leuvenaars in Dutch) and are in many ways a homogeneous group (despite a sizeable number of foreign students and academics). Furthermore, the city is doing well economically and has a vision for itself (the university, science and technology spinoffs, along with a large bank and brewery company are all valuable sources of revenue for the city and its residents and seem likely to ensure a solid future in the modern economy). Vilvoorde, by comparison, is much less wealthy. It is socially very diverse, with its residents coming from many different parts of the world. As such, it doesn’t have a single strong source of identity like Leuven, and its diverse residents do not always mix or know each other very well. Given the different economic and social situations of these two cities, it is perhaps understandable that Leuven residents feel safer than Vilvoorde residents, even when the actual crime statistics suggest that the opposite is true. Coming Up Next In the next post, we will solve a basic programming puzzle that is (apparently) asked in data-science interviews. We will use this puzzle as a case study for understanding similarities and differences in programming logic and implementation of control structures in R vs. Python. Stay tuned! —— * If you’ve got a better one, please let me know in the comments! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Method Matters. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Bayesian Optimization for Hyper-Parameter Mon, 02/25/2019 - 01:44 (This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers) In past several weeks, I spent a tremendous amount of time on reading literature about automatic parameter tuning in the context of Machine Learning (ML), most of which can be classified into two major categories, e.g. search and optimization. Searching mechanisms, such as grid search, random search, and Sobol sequence, can be somewhat computationally expensive. However, they are extremely easy to implement and parallelize on a multi-core PC, as shown in https://statcompute.wordpress.com/2019/02/03/sobol-sequence-vs-uniform-random-in-hyper-parameter-optimization. On the other hand, optimization algorithms, especially gradient-free optimizers such as Nelder–Mead simplex and particle swarm, are often able to quickly locate close-to-optimal solutions in cases that the global optimal is neither feasible nor necessary, as shown in https://statcompute.wordpress.com/2019/02/10/direct-optimization-of-hyper-parameter and https://statcompute.wordpress.com/2019/02/23/gradient-free-optimization-for-glmnet-parameters. In the example below, another interesting approach, namely Bayesian optimization (https://arxiv.org/abs/1206.2944), is demonstrated and compared with CMA-ES (https://www.researchgate.net/publication/227050324_The_CMA_Evolution_Strategy_A_Comparing_Review), which is also a popular gradient-free optimizer based on the evolution strategy. As shown in the result, the output from Bayesian optimization is closer to the one from Nelder–Mead simplex and particle swarm. What’s more, Bayesian optimization is more consistent than CMA-ES among multiple trials in the experiment. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### January 2019: “Top 40” New CRAN Packages Mon, 02/25/2019 - 01:00 (This article was first published on R Views, and kindly contributed to R-bloggers) One hundred and fifty-three new packages made it to CRAN in January. Here are my “Top 40” picks in eight categories: Computational Methods, Data, Machine Learning, Medicine, Science, Statistics, Utilities, and Visualization. Computational Methods cPCG v1.0: Provides a function to solve systems of linear equations using a (preconditioned) conjugate gradient algorithm. The vignette shows how to use the package. RcppDynProg v0.1.1: Implements dynamic programming using Rcpp. Look here for examples. Data cimir v0.1-0: Provides functions to connect to the California Irrigation Management Information System (CIMIS) Web API. See the Quick Start for details. ecmwfr v1.1.0: Provides a programmatic interface to the European Centre for Medium-Range Weather Forecasts’ public and restricted dataset web services ECMWF, as well as Copernicus’s Climate Data Store CDS, allowing users to download weather forecasts and climate data. There are vignettes for both CDS and ECMWFR. germanpolls v0.2: Provides functions to extract data from Wahlen.de. nhdR v0.5.1: Provides tools for working with the National Hydrography Dataset, with functions for querying, downloading, and networking both the NHD and NHDPlus datasets. There are vignettes for Creating Simple Maps and Quering Flow Information, as well as an example. snotelr v1.0.1: Provides a programmatic interface to the SNOTEL snow data. See the vignette for information. wdpar v0.0.2: Provides an interface to the World Database on Protected Areas (WDPA). Data is obtained from Protected Planet. See the README for information. Machine Learning analysisPopelines v1.0.0: Implements an R interface that enables data scientists to compose inter-operable pipelines between R, Spark, and Python for data manipulation, exploratory analysis, modeling, and reporting. There are vignettes for Python Functions, R data frames, Spark data frames, Interoperable Pipelines, Meta-pipelines, Streaming Analysis Pipelines, and Using Pipelines with Spark. bender v0.1.1: Implements an R client for Bender Hyperparameters optimizer. FiRE v1.0: Implements an algorithm to find outliers and rare entities in voluminous datasets. Look here for information. foto v1.0.0: Implements the Fourier Transform Textural Ordination method, which uses a principal component analysis on radially averaged, two-dimensional Fourier spectra to characterize image texture. See the vignette for details. RcppHNSW v0.1.0: Provides bindings to the Hnswlib C++ library for Approximate Nearest Neighbors. ruimtehol v0.1.2: Wraps the StarSpace library, allowing users to calculate word, sentence, article, document, webpage, link, and entity embeddings. The techniques are explained in detail in Wu et al. (2017). See the vignette for more information. zoomgrid v1.0.0: Implements a grid search algorithm with zoom to help solve difficult optimization problems where there are many local optima inside the domain of the target function. Look here for information. Medicine bayesCT v0.99.0: Provides functions to simulate and analyze Bayesian adaptive clinical trials, incorporating historical data and allowing for early stopping. There is an Introduction, and vignettes for Binomial and Normal outcomes. BioMedR v1.1.1: Provides tools for calculating 293 chemical descriptors and 14 kinds of chemical fingerprints, 9920 protein descriptors based on protein sequences, more than 6000 DNA/RNA descriptors from nucleotide sequences, and six types of interaction descriptors. There is a very informative vignette. dr4pl v1.1.8: Models the relationship between dose levels and responses in a pharmacological experiment using the 4 Parameter Logistic model, and provides bounds that prevent parameter estimates from diverging. See Gadagkar and Call (2015) and Ritz et al. (2015) for background information, and the vignette for examples. GMMAT v1.0.3: Provides functions to perform association tests using generalized linear mixed models (GLMMs) in genome-wide association studies (GWAS) and sequencing association studies. See Chen et al. (2016) and Chen et al. (2019) for background information, and the vignette for an introduction to the package. Science ethnobotanyR v0.1.4: Implements functions to calculate common quantitative ethnobotany indices to assess the cultural significance. See Tardio and Pardo-de-Santayana (2008) for background information, and the vignette for information on the package. wsyn v1.0.0: Implements tools for a wavelet-based approach to analyzing spatial synchrony, principally in ecological data. The vignette gives the details. Statistics apcf v0.1.2: Implements the adapted pair correlation function, which transfers the concept of the pair correlation function from point patterns to patterns of objects of finite size and irregular shape. This is a re-implementation of the method suggested by Nuske et al. (2009). See the vignette for details. concurve v1.0.1: Provides functions to compute confidence (compatibility/consonance) intervals for various statistical tests, along with their corresponding P-values and S-values. Consonance functions allow modelers to determine what effect sizes are compatible with the test model at various compatibility levels. For details, see Poole (1987), Schweder and Hjort (2002), Singh, Xie, and Strawderman (2007), and Amrhein, Trafimow and Greenland (2018). See the vignette for examples. IMaGES v0.1: Provides functions to implement Independent Multiple-sample Greedy Equivalence Search (IMaGES), a causal inference algorithm for creating aggregate graphs and structural equation modeling data for one or more datasets. See Ramsey et. al (2010) for background information. There is a vignette. metamer v0.1.0: Provides functions to create data with identical statistics (metamers) using an iterative algorithm proposed by Matejka & Fitzmaurice (2017). See README for help with the package. mimi v0.1.0: Implements functions to estimate main effects and interactions in mixed data sets with missing values. Estimation is done through a convex program where main effects are assumed sparse and the interactions low-rank. See Geneviève et al. (2018). pcLasso v1.1: Implements a method for fitting the entire regularization path of the principal components lasso for linear and logistic regression models. See Tay, Friedman, and Tibshirani (2014) for details and the vignette for an Introduction. qrandom v1.1: Implements an API to the ANU Quantum Random Number Generator, provided by the Australian National University, that generates true random numbers in real-time by measuring the quantum fluctuations of the vacuum. The quantum Random Number Generator is based on the papers by Symul et al. (2011) and Haw, et al. (2015). Look here for live random numbers. rstap v1.0.3: Provides tools for estimating spatial temporal aggregated predictor models with stan. See the vignette for an introduction. ROCit v1.1.1: Provides functions to calculate and visualize performance measures for binary classifiers. The vignette describes the details. surveysd v1.0.0: Provides functions to calculate point estimates and their standard errors in complex household surveys using bootstrap replicates. A comprehensive description of the methodology can be found here. Utilities askpass v1.1: Provides safe password entry for R, Git, and SSH. Look here for help. logger v0.1: Provides a flexible and extensible way of formatting and delivering log messages with low overhead. There is an Introduction and vignettes on The Anatomy of a Log Request, Format Customization, Migration, Benchmarks, Logging, and Extensions. pagedown v0.1: Implements tools to use the paged media properties in CSS and the JavaScript library paged.js to split the content of an HTML document into discrete pages. See the README for details. rmd v0.1.4: Provides functions to manage multiple R Markdown packages. Look here for information. tor v1.1.1: Provides functions to enable users to import multiple files at the same time. See the README for details. vitae v0.1.0: Provides templates and functions to simplify the production and maintenance of curricula vitae. There is an Introduction and a vignette for creating templates. Visualization gganimate v1.0.1: Implements a ggplot2-compatible grammar for creating animations. The vignette will get you started. RIdeogram v0.1.1: Implement tools to draw SVG (Scalable Vector Graphics) graphics to visualize and map genome-wide data in ideograms. See the vignette for information. voronniTreeMap v0.2.0: Provides functions to create Voronni tree maps using the d3.js framework. Look here for examples. _____='https://rviews.rstudio.com/2019/02/25/january-2019-top-40-new-cran-packages/'; var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### ‚Arrest this man, he talks in maths‘ – Animating ten years of listening history on Last.FM Mon, 02/25/2019 - 01:00 (This article was first published on Rcrastinate, and kindly contributed to R-bloggers) Previously, when Rcrastinate was still on blogspot.com, I had a first look at ten years of my playback history on Last.FM. But there is still a lot one can do with this dataset. I wanted to try {gganimate} for a long time and this nice longitudinal dataset gives me the opportunity. Loading and preparing the data First, I am loading the dataset. I already did some preparations like extracting the top five tags for each track and some other stuff I used in my previous entry. library(knitr) # for 'kable' scrobs <- readRDS("LastFM-history-proc2.Rds") # I am getting rid of some columns we won't need. scrobs <- scrobs[, c("artist", "track", "album", "timestamp")] nrow(scrobs) ## [1] 65356 kable(head(scrobs)) artist track album timestamp Cat Power Ramblin’ (Wo)man Cat Power – Jukebox (Deluxe Edition) 2009-01-01 13:30:00 Cat Power Metal Heart (2008 Version) Cat Power – Jukebox (Deluxe Edition) 2009-01-01 13:33:00 Cat Power Silver Stallion Cat Power – Jukebox (Deluxe Edition) 2009-01-01 13:37:00 Cat Power Aretha, Sing One for Me Cat Power – Jukebox (Deluxe Edition) 2009-01-01 13:40:00 Cat Power Lost Someone Cat Power – Jukebox (Deluxe Edition) 2009-01-01 13:43:00 Cat Power Lord, Help the Poor & Needy Cat Power – Jukebox (Deluxe Edition) 2009-01-01 13:46:00 So, you see: There is one row in the dataset for each song I played (65356 in total). Artist, track, and album name as well as the the timestamp of the scrobble are associated with each playback. We will now extract the date from the timestamp and create a new dataset art.day that contains the number of playbacks (or “scrobbles” in Last.FM’s terminology) for each artist on each day in the ten-years period. scrobs$date <- substr(scrobs$timestamp, 1, 10) art.day <- as.data.frame(table(scrobs$artist, scrobs$date)) names(art.day) <- c("Artist", "Date", "Scrobbles") kable(tail(art.day[art.day$Artist == "Beck",]), row.names = F) Artist Date Scrobbles Beck 2018-12-25 1 Beck 2018-12-27 0 Beck 2018-12-28 0 Beck 2018-12-29 0 Beck 2018-12-30 9 Beck 2018-12-31 0

What you see above are the last 6 days of my scrobble history for Beck. Now, for some {dplyr} magic. I am starting to really like what you can do with all the piping and stuff.

Calculating cumulative sums

What I want to do now is:

• Group art.day by artist.
• Create a new column called cum.plays (for “cumulated playbacks”) which holds the number of scrobbles for this artist up to the respective point in time.
• To achieve this, I am sorting the different groups of the dataset by date before computing the cumulated sum – just to be sure that the cumulated sum is correct.
• In the end, I am only ungrouping the dataset again and overwrite the variable holding the old one (because the operations before are “non-destructive”).
library(dplyr) art.day %>% group_by(Artist) %>% arrange(Date) %>% mutate(cum.plays = cumsum(Scrobbles)) %>% ungroup() -> art.day kable(tail(art.day[art.day$Artist == "Radiohead",]), row.names = F) Artist Date Scrobbles cum.plays Radiohead 2018-12-25 0 1524 Radiohead 2018-12-27 4 1528 Radiohead 2018-12-28 2 1530 Radiohead 2018-12-29 0 1530 Radiohead 2018-12-30 4 1534 Radiohead 2018-12-31 0 1534 With the last lines for Radiohead we can see that this worked: On Dec 25th, 2018, 1524 scrobbles were registered for Radiohead. On Dec 26th, I didn’t listen to any music, so this day is missing from the dataset altogether. This is no problem because I am converting the Date column to date format later. On Dec 27th, four more scrobbles were registered for Radiohead, so cum.plays is increased by four to 1528 and so on. Selecting top ten artists Now, I am doing three things: • Create a table with the overall artist scrobbles to get the top ten artists. • Restricting art.day only to scrobbles by these top ten artists and save the result in top.art.day. • Convert the Date column in the new dataset to date format. art.tab <- sort(table(scrobs$artist), decreasing = T) top.art.day <- art.day[art.day$Artist %in% names(art.tab[1:10]),] top.art.day$Date <- as.Date(top.art.day$Date) nrow(top.art.day) ## [1] 28260 Scrobbles distribution Since we’re already here, we can also see how much percent of all scrobbles are registered for the top ten artists only. We can also look at the overall distribution of scrobbles over artists using the table that I created before. library(ggplot2) sum(top.art.day$Scrobbles) / sum(art.day$Scrobbles) * 100 ## [1] 20.11751 ggplot() + aes(x = 1:length(art.tab), y = unname(art.tab), group = 1) + geom_line() + labs(x = "Artist rank", y = "Artist scrobbles") + geom_vline(aes(xintercept = 20), col = "red") + theme_minimal() So, roughly 20% of all scrobbles come from the top ten artists alone. We are dealing with a massively scewed distribution here. See that red line in the plot? 20% of all scrobbles lie to the left of this line! It’s almost like word frequency distributions in a corpus of natural language data. Animating scrobble history Now for the main aim of this post. I am assuming {gganimate} is already installed. I then create a plot object p which also includes a call to transition_reveal(Date). This is a function from {gganimate} which tells the plot to be revealed over time. The calls to geom_segment() and geom_text() create the artist labels on the right side of the plot that are connected to the moving dot. The artist names rise according to the current value of cum.plays. I got this code fragment from a nice example by Thomas Lin Pedersen. You can read the discussion there for further explanations. Just a quick summary of important points: • I had to switch off the legend with guides(col = F) to prevent overplotting of the moving labels and the color legend. • I had to increase the plot margin on the right side to fit the longest artist name. • Clipping is turned of with coord_cartesian(clip = 'off') because the moving labels are not shown otherwise. library(gganimate) p <- ggplot(top.art.day, aes(x = Date, y = cum.plays, col = Artist, group = Artist)) + geom_line() + geom_point(size = 2) + geom_segment(aes(xend = as.Date("2019-01-10"), yend = cum.plays), linetype = 2, colour = 'grey') + geom_text(aes(x = as.Date("2019-01-20"), label = Artist), hjust = 0, size = 4) + coord_cartesian(clip = 'off') + transition_reveal(Date) + guides(col = F) + labs(x = "Date", y = "Playcount", col = "Artist", title = "Artist playcount through time") + theme_minimal() + theme(plot.margin = margin(5.5, 120, 5.5, 5.5)) Now for animating the plot. Depending on the detailed settings, this could take a while. Especially the number of frames is heavily influencing rendering time. animate(p, nframes = 200, end_pause = 20, height = 1000, width = 1250, res = 120) It takes a little bit of tweaking of the parameters to get a readable result. Check out the different parameters for ?animate if you want to tweak your animations a little more. I have a tip for you: If you’re still tweaking your plot parameters like the dimensions or margins, you might want to set the nframes parameter to a lower number like 50 or even 20. This rapidly accelerates the rendering of your animation. A few things are worth mentioning when I look at this plot: • Do you see how I discovered Scott Matthew only in the end of 2013? He quickly made it into my top ten artists with a rapid sprint of scrobbles. • Radiohead, Prince, and PJ Harvey were a lonely group at the top for quite some time. Then, Menomena entered the picture and joined this top group. Later, Beck also joins. At the end of 2018, with me discovering the album “Colors”, he takes second place after briefly being my top artist. • Janelle Monáe hasn’t been in my top ten artists up until 2018. Then, she released “Dirty Computer”. I listened to this album a lot and she made a final sprint into the final top ten artists. • That’s maybe the biggest flaw of this animation: Only the final top ten artists are included. The plot does not show the top ten artists over time. I would love to see the development of the top ten artists through time, maybe with greyed out lines whenever an artist leaves the top ten. For that, however, I would have to learn a lot more of {gganimate}. Alright, that’s it. See you next time. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Le Monde puzzle [#1087] Mon, 02/25/2019 - 00:19 (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers) A board-like Le Monde mathematical puzzle in the digit category: Given a (k,m) binary matrix, what is the maximum number S of entries with only one neighbour equal to one? Solve for k=m=2,…,13, and k=6,m=8. For instance, for k=m=2, the matrix is producing the maximal number 4. I first attempted a brute force random filling of these matrices with only a few steps of explorations and got the numbers 4,8,16,34,44,57,… for the first cases. Since I was convinced that the square k² of a number k previously exhibited to reach its maximum as S=k² was again perfect in this regard, I then tried another approach based on Gibbs sampling and annealing (what else?): gibzit=function(k,m,A=1e2,N=1e2){ temp=1 #temperature board=sole=matrix(sample(c(0,1),(k+2)*(m+2),rep=TRUE),k+2,m+2) board[1,]=board[k+2,]=board[,1]=board[,m+2]=0 #boundaries maxol=counter(board,k,m) #how many one-neighbours? for (t in 1:A){#annealing for (r in 1:N){#basic gibbs steps for (i in 2:(k+1)) for (j in 2:(m+1)){ prop=board prop[i,j]=1-board[i,j] u=runif(1) if (log(u/(1-u))maxol){ maxol=val;sole=board}} }} temp=temp*2} return(sole[-c(1,k+2),-c(1,m+2)])} which leads systematically to the optimal solution, namely a perfect square k² when k is even and a perfect but one k²-1 when k is odd. When k=6, m=8, all entries can afford one neighbour exactly, > gibzbbgiz(6,8) [1] 48 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1 0 0 1 1 0 0 1 [2,] 1 0 0 0 0 0 0 1 [3,] 0 0 1 0 0 1 0 0 [4,] 0 0 1 0 0 1 0 0 [5,] 1 0 0 0 0 0 0 1 [6,] 1 0 0 1 1 0 0 1 but this does not seem feasible when k=6, m=7, which only achieves 40 entries with one single neighbour. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Allusions to parents in autobiographies (or reading 118 books in a few seconds) Sun, 02/24/2019 - 17:43 (This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers) If I keep holding out, will the light shine through? (Come Back, Pearl Jam) Imagine that you are writing the story of your life. Almost sure you will make allusions to your parents, but will both of them have the same prominence in your biography or will you spend more words in one of them? In that case, which one will have more relevance? Your father or your mother? This experiment analyses 118 autobiographies from the Project Gutenberg and count how many times do authors make allusions to their fathers and mothers. This is what I’ve done: • Download all works from Gutenberg Project containing the word autobiography in its title (there are 118 in total). • Count how many times the bigrams my father and my mother appear in each text. This is what I call allusions to father and mother respectively. The number of allusions that I measure is a lower bound of the exact amount of them since the calculus has some limitations: • Maybe the author refers to them by their names. • After referring to them as my father or my mother, subsequent sentences may refer them as He or She. Anyway, I think these constrains do not introduce any bias in the calculus since may affect to fathers and mothers equally. Here you can find the dataset I created after downloading all autobiographies and measuring the number of allusions to each parent. Some results: • 64% of autobiographies have more allusions to the father than the mother. • 24% of autobiographies have more allusions to the mother than the father. • 12% allude them equally. Most of the works make more allusions to father than to mother. As a visual proof of this fact, the next plot is a histogram of the difference between the amount of allusions to father and mother along the 118 works (# allusions to father# allusions to mother): The distribution is clearly right skeweed, which supports our previous results. Another way to see this fact is this last plot, which situates each autobiography in a scatter plot, where X-axis is the amount of allusions to father and Y-axis to mother. It is interactive, so you can navigate through it to see the details of each point (work): Most of the points (works) are below the diagonal, which means that they contain more allusions to father than mother. Here you can find a full version of the previous plot. I don’t have any explanation to this fact, just some simple hypothesis: • Fathers and mothers influence their children differently. • Fathers star in more anecdotes than mothers. • This is the effect of patriarchy (72% of authors was born in the XIX century) Whatever it is the explanation, this experiment shows how easy is to do text mining with R. Special mention to purrr (to iterate eficiently over the set of works IDs), tidytext (to count the number of appearances of bigrams), highcharter (to do the interactive plot) and gutenbergr (to download the books). You can find the code here. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Strategic Data Science: Creating Value With Data Big and Small Sun, 02/24/2019 - 08:56 (This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to R-bloggers) The post Strategic Data Science: Creating Value With Data Big and Small appeared first on The Lucid Manager. Data science is without a doubt the most popular business fad of the past decade. The promise of machine learning blinds many managers so they forget about deploying these new approaches strategically. This article provides a framework for data science strategy and is a synopsis of my forthcoming book Principles of Strategic Data Science on LeanPub. The markdown files for the text, images and code of the book are available on my GitHub repository. What is Data Science? The term data science emerged in the middle of the last century when electronic computation first became a topic of study. In those days, the discipline was literally a science of storing and manipulating data. The current definition has drifted away from this initial academic activity to a business activity. The present data science hype can be traced back to an article in the 2012 edition of Harvard Business Review. Davenport and Patil proclaimed data scientist to be the sexiest job of the twenty-first century. In the wake of this article, the number of data science searches in Google increased rapidly. trends.embed.renderExploreWidget("TIMESERIES", {"comparisonItem":[{"keyword":"data science","geo":"","time":"2011﻿-01-01 2019-02-24"}],"category":0,"property":""}, {"exploreQuery":"date=2012-01-01%202019-02-24&q=data%20science","guestPath":"https://trends.google.com:443/trends/embed/"}); Organisations have for a long time used data to improve the lives of their customers, shareholders or society overall. Management gurus promoted concepts such as the data-driven organisation, evidence-based management, business intelligence and Six Sigma to help businesses realise the benefits of their data. Data science is an evolution of these methods enabled by the data revolution. The Data Revolution Recent developments in information technology have significantly improved what we can do with data, resulting in what we now know as data science. Firstly, most business processes are managed electronically, which has exponentially increased the amount of available data. Developments in communication, such as the Internet of Things and personal mobile devices, have significantly reduced the price of collecting data. Secondly, the computing capabilities on the average office worker’s desk outstrip the capabilities of the supercomputers of the past. Not only is it cheaper to collect vast amounts of electronic data, but processing these enormous volumes has also come within reach of the average office worker. Lastly, developments in applied mathematics and open source licensing have accelerated our capabilities in analysing this data. These new technologies allow us to discover patterns that were previously invisible. Most tools required to examine data are freely available on the internet with a helpful community sharing knowledge on how to use them. These three developments enabled an evolution from traditional business analysis to data science. Data science is the strategic and systematic approach to analysing data to achieve organisational objectives using electronic computing. This definition is agnostic of the promises of machine learning and leverages the three developments mentioned above. Data science is the next evolution in business analysis that maximises the value we can extract from data. Data Science Strategy Competencies The craft of data science combines three different competencies. Data scientist Drew Conway visualised the three core competencies of data science in a Venn diagram. Data Science Venn Diagram (Conway, 2010). Firstly and most importantly, data science requires domain knowledge. Any analysis needs to be grounded in the reality it seeks to improve. Subject-matter expertise is necessary to make sense of the investigation. Professional expertise in most areas uses mathematics to understand and improve outcomes. New mathematical tools expand the traditional approaches to develop a deeper understanding of the domain under consideration. Computer science is the competency that binds the available data with mathematics. Writing computer code to extract, transform and analyse data to create information and stimulate knowledge is an essential skill for any data scientist. Good Data Science To create value with data, we need to know how to create or recognise good data science. The second chapter uses three principles originally introduced two thousand years ago by Roman architect and engineer Vitruvius. He wrote that buildings need to be useful, sound and aesthetic. These requirements are also ideally suited to define best-practice in data science. The Vitruvian triangle for data science. For data science to be useful, it needs to contribute to the objectives of an organisation positively. It is in this sense that data science is an applied science and not an academic pursuit. The famous Data-Information-Knowledge pyramid visualises the process of creating value from data. Usefulness Useful data science meaningfully improves our reality through data. Data is a representation of either a social or physical reality. Any data source is ever only a sample of the fullness and complexity of the real world. Information is data imbued with context. The raw data collected from reality needs to be summarised, visualised and analysed for managers to understand the reality of their business. This information increases knowledge about a business process, which is in turn used to improve the reality from which the data was collected. This feedback loop visualises the essence of analysing data in businesses. Data science is a seductive activity because it is reasonably straightforward to create impressive visualisations with sophisticated algorithms. If data products don’t improve or enlighten the current situation, they are in essence useless. The Reality, Data, Information, Knowledge pyramid. Soundness Data science needs to be sound in that the outcomes are valid and reliable. The validity and reliability of data are where the science meets the traditional approaches to analysing data. Validity is the extent to which the data represents the reality it describes. The reliability of data relates to the accuracy of the measurement. These two concepts depend on the type of data under consideration. Measuring physical processes is less complicated than the social aspects of society. Validity and reliability are in essence a sophisticated way of expressing the well-known Garbage-In-Garbage-Out principle. The soundness of data science also relates to the reproducibility of the analysis to ensure that other professionals can review the outcomes. Reproducibility prevents that the data and the process by which it was transformed and analysed become a black-box where we have no reason to trust the results. Data science also needs to be sound concerning the governance of the workflow. All data sources need to be curated by relevant subject matter experts to ensure their validity and reliability. Data experts provide that the data is available to those who need it. Aesthetics Lastly, data science needs to be aesthetic to ensure that any visualisation or report is easy to understand by the consumer of the analysis. This requirement is not about beautification through infographics. Aesthetic data products minimise the risk or making wrong decisions because the information is presented without room for misinterpretation. Any visualisation needs to focus on telling a story with the data. This story can be a comparison, a prediction, a trend or whatever else is relevant to the problem. One of the essential principles of aesthetic data science is the data-to-pixel ratio. This principle means that we need to maximise the ratio between all the pixels on a screen and those pixels that present information. Good data visualisation practices austerity to ensure that the people that consume the information understand the story that needs to be told. Example of low and high data-to-pixel ratio. Strategic Data Science The data science continuum is a strategic journey for organisations that seek to maximise value from data. As an organisation moves along the continuum, increased complexity is the payoff for increased value. This continuum is a hierarchy as all phases are equally important. The latter stages cannot exist without the previous ones. Data science continuum. Collecting data is requires important considerations on what to collect, how to collect it and at what frequency. To collect meaningful data requires a good understanding of the relationship between reality and the data. There is no such thing as raw data as all information relies on assumptions and practical limitations. Describing the data is the first step in extracting value. Descriptive statistics are the core of most business reporting and are an essential first step in analysing the data. Diagnostics or analysis is the core activity of most professions. Each subject area uses specialised methods to create new information from data. Predictive analysis seems to be the holy grail for many managers. A prediction is not a perfect description of the future but provides the distribution of possible futures. Managers can use this information to change the present to construct their desired future. Prescriptive analysis uses the knowledge created in the previous phases to automatically run business process and even decide on future courses of action. Any organisation starting with data science should follow the five phases in this process and not jump ahead to try to bypass the seemingly less valuable stages. The Data-Driven Organisation Implementing a data science strategy is more than a matter of establishing a specialised team and solve complex problems. Creating a data-driven organisation that maximises the value of data requires a whole-of-business approach that involves people with the right attitude and skills, appropriate systems and robust processes. A data science team combines the three competencies described in the Conway Venn diagram. People that have skills in all three of these areas are rare, and the industry calls them unicorns. There is no need for recruiters to start hunting unicorns because these three areas of expertise can also exist within a team. Possibly more important than the technical skills are the social skills of a data scientist. Not only need they create useful, sound and aesthetic data science, they also need to convince the consumers of their work of its value. One of the problems of creating value with data is ensuring that the results are implemented in the organisation. A starting point to achieve this is to ensure that the users of data products have a relevant level of data literacy. Developing data literacy among the consumers of data science is perhaps the greatest challenge. The required level of data literacy depends on the type of position and the role of the data consumer within the organisation. Data scientists use an extensive range of tools and are often opportunistic in their choice of software. Spreadsheets are not very suitable to create good data science. Data science requires coding skills and the Python and R languages are powerful tools to solve complex problems. After the data specialists have developed the best way to analyse data, they need to communicate these to their customers. Many specific products exist to communicate data to users with interactive dashboards and many other dynamic systems. The final part of this book delves into the ethics of data science. From the fact that something can be done, we cannot conclude that it should be done. Just like any other profession that impacts humans, data scientists need ethical guidelines to ensure that their activities cause no harm. This book provides some basic guidelines that can assist data scientists to assess the ethical merits of their projects. The post Strategic Data Science: Creating Value With Data Big and Small appeared first on The Lucid Manager. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data – The Lucid Manager. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Plots within plots with ggplot2 and ggmap Sun, 02/24/2019 - 05:11 (This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers) Once in a while, you might find yourself wanting to embed one plot within another plot. ggplot2 makes this really easy with the annotation_custom function. The following example illustrates how you can achieve this. (For all the code in one R file, click here.) Let’s generate some random data and make a scatterplot along with a smoothed estimate of the relationship: library(ggplot2) set.seed(42) n <- 1000 x <- runif(n) * 3 y <- x * sin(1/x) + rnorm(n) / 25 df <- data.frame(x = x, y = y) p1 <- ggplot(df, aes(x, y)) + geom_point(alpha = 0.3) + geom_smooth(se = FALSE) + theme_bw() p1 The smoother seems to be doing a good job of capturing the relationship for most of the plot, but it looks like there’s something more going on in the region. Let’s zoom in: p2 <- ggplot(df, aes(x, y)) + geom_point(alpha = 0.3) + geom_smooth(se = FALSE) + scale_x_continuous(limits = c(0, 0.5)) + scale_y_continuous(limits = c(-0.3, 0.6)) + theme_bw() p2 That certainly seems like a meaningful relationship! While we might want to plot p1 to depict the overall relationship, it is probably a good idea to show p2 as well. This can be achieved very easily: p1 + annotation_custom(ggplotGrob(p2), xmin = 1, xmax = 3, ymin = -0.3, ymax = 0.6) The first argument is for annotation_custom must be a “grob” (what is a grob? see details here) which we can create using the ggplotGrob function. The 4 other arguments (xmin etc.) indicate the coordinate limits for the inset: these coordinates are with reference to the axes of the outer plot. As explained in the documentation, the inset will try to fill up the space indicated by these 4 arguments while being center-justified. For ggmap objects, we need to use inset instead of annotation_custom. We illustrate this by making a map of continental USA with insets for Alaska and Hawaii. Let’s get a map of continental US (for more details on how to use Stamen maps, see my post here): library(ggmap) us_bbox <- c(left = -125, bottom = 25, right = -55, top = 50) us_main_map <- get_stamenmap(us_bbox, zoom = 5, maptype = "terrain") p_main <- ggmap(us_main_map) p_main Next, let’s get maps for Alaska and Hawaii and save them into R variables. Each plot will have a title for the state, and information on the axes will be removed. alaska_bbox <- c(left = -180, bottom = 50, right = -128, top = 72) alaska_map <- get_stamenmap(alaska_bbox, zoom = 5, maptype = "terrain") p_alaska <- ggmap(alaska_map) + labs(title = "Alaska") + theme(axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank()) p_alaska hawaii_bbox <- c(left = -160, bottom = 18.5, right = -154.5, top = 22.5) hawaii_map <- get_stamenmap(hawaii_bbox, zoom = 6, maptype = "terrain") p_hawaii <- ggmap(hawaii_map) + labs(title = "Hawaii") + theme(axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank()) p_hawaii We can then use inset twice to embed these two plots (I had to fiddle around with the xmin etc. options to get it to come out right): library(grid) p_main + inset(ggplotGrob(p_alaska), xmin = -76.7, xmax = -66.7, ymin = 26, ymax = 35) + inset(ggplotGrob(p_hawaii), xmin = -66.5, xmax = -55.5, ymin = 26, ymax = 35) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Read all about it! Navigating the R Package Universe Sun, 02/24/2019 - 01:00 (This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers) In the most recent issue of the R Journal, I have a new paper out with coauthors John Nash and Spencer Graves. Check out the abstract: Today, the enormous number of contributed packages available to R users outstrips any given user’s ability to understand how these packages work, their relative merits, or how they are related to each other. We organized a plenary session at useR!2017 in Brussels for the R community to think through these issues and ways forward. This session considered three key points of discussion. Users can navigate the universe of R packages with (1) capabilities for directly searching for R packages, (2) guidance for which packages to use, e.g., from CRAN Task Views and other sources, and (3) access to common interfaces for alternative approaches to essentially the same problem. If you’ve been around a little while, you might remember that I ran a brief online survey in the spring of 2017 focused on this topic, and I’ve written before on my blog about the plenary session in general and the specific topics I focused on in this project. The paper represents a summary of this work, what we’ve learned both in preparing for the plenary session and in synethesizing community feedback afterward. If you are interested in this topic and what it means to be part of a community with over 13,000 packages on CRAN and a fast-growing userbase, check it out and let me know what you think! From the end of the paper: Our exploration of these topics leads us to call for increased respect and value for the work done by local meetup group organizers and individuals who contribute to spreading R knowledge, both online and in their communities. Our survey and discussions show how impactful these community networks are; investing in community building is not something we need do only because of idealism, but because it is effective. We also identify the importance of growing the skills of package developers across the R ecosystem, in dependency management, documentation, and beyond. Wider adoption of best practices makes the R package universe easier for everyone to navigate, from the developer with downstream dependencies to the CRAN Task View maintainer to the new R user. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Query Azure Cosmos DB in R Sat, 02/23/2019 - 23:50 (This article was first published on R – Detroit Data Lab, and kindly contributed to R-bloggers) It’s 2019, and your company has chosen to store data in Cosmos DB, the world’s most versatile and powerful database format, because they’re all-in on cloud native architectures. You’re excited! This means you’re working with an ultra low latency, planet-scale data source! Just one problem: your codebase is R, and connecting to retrieve the data is trickier than you’d hoped. I give you cosmosR: a simple package I wrote to give you a head start on connections to your Cosmos DB. It won’t solve all your problems, but it’ll help you overcome the authentication headers, create some simple queries, and get you up in running much faster than rolling your own connectivity from scratch. *Note: this is only for Cosmos DB storing documents, with retrieval via the SQL API. For more information on the SQL API please visit the Microsoft documentation. Begin by making sure we have devtools installed, then loaded: install.packages("devtools") library("devtools") With that set, install cosmosR via GitHub: install_github("aaron2012r2/cosmosR") We’re two lines of code away from your first data frame! Be sure to have your Access KeyURIDatabase Name, and Collection Name for your Cosmos DB handy. Store the access key for reusable querying via cosmosAuth: cosmosAuth("KeyGoesHere", "uri", "dbName", "collName") Then we can use cosmosQuery to perform a simple “SELECT * from c” query, which will retrieve all documents from the Cosmos DB. By setting content.response to TRUE, we are specifying that we want to convert the raw response to a character response that we can read. dfAllDocuments <- cosmosQuery(content.response = TRUE) Just like that, we have our information! Because documents are stored as JSON, being that Cosmos DB is primarily controlled using Javascript, you should carefully inspect all responses. The package used in this tutorial is available at my GitHub. Repo linked here. Please visit the link for further information on other tips or known limitations. If you see opportunities to contribute, please branch, create issues, and help further develop! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Detroit Data Lab. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Let’s get LEGO’d! Sat, 02/23/2019 - 23:23 (This article was first published on Little Miss Data, and kindly contributed to R-bloggers) If you follow my blog, you may have noticed that I like to create some pretty weird and nerd-fabulous graphs. In this blog I’ll introduce you to a new and exciting form of neat R graphs that I recently stumbled upon. Often times, my blogs are tutorials. However, since Ryan Timpe and Tyler Morgan-Wall have done such a great job documenting their work, this blog is more like a public service announcement with a few tips. The blog Ryan’s blog on creating 3D Lego image masterpieces can be found here. Ryan kindly created a series of functions to transform your image to legos and then turn them 3D building on Tyler’s R rayshader package. Steps In my journey to create some 3D Lego images of my own, I learned a few tips. Below, I’ll give you the super condensed version of what I did which is mostly pointing to their joint content! There is a whole bunch of genius here. You can definitely browse the code but also, just know that it works to create the functions that you will call when making your lego graphs. Ryan has done a great job making the code very straightforward, but I’ll still review the steps and the associated output below. 2A) Read in your image and transform it into Lego pieces mosaic1 <- readJPEG("/MyPics/Laura.jpg") %>% scale_image(48) %>% legoize() %>% collect_bricks() mosaic1 %>% display_set() 2B) Generate the instructions to create your own physical lego image The transformed lego image is great, but what’s better is if you could actually make this with real Legos! This function generates your own personal guide to creating your real Lego masterpiece. mosaic1 %>% generate_instructions(6) 2C) Identify which Lego pieces you will need to build your image Ryan thought of everything. If you are going to create the image with real Legos, you need to know what pieces to use! Call the display_pieces() function to get a comprehensive list of what to use to build your image. Note that these are created with the real lego colors and you can order most of them through Lego’s “Pick a Brick” website. pieces <- mosaic1 %>% table_pieces() mosaic1 %>% display_pieces() 2D) Create your 3D Lego image Follow the code below to create your 3D Lego image. If you are finding that you want to flip between which image area is elevated, please see the tips and troubleshooting section below. # install.packages("rayshader") # If you have issues installing rayshader, please see tips below. library(rayshader) mosaic1 %>% collect_3d() %>% display_3d(fov=0,theta=-30,phi=40,windowsize=c(1000,800),zoom=0.8) render_snapshot() Tips and Troubleshooting • If you have issues installing rayshader on Mac, I found Tylers tip to install XQuartz directly worked like a charm. • When running step 1, it assumes that you have the “Lego_Colors.csv” file on your computer. If you don’t want to bother downloading it, please replace the referenced lego color file (line 5) with “https://raw.githubusercontent.com/ryantimpe/LEGOMosaics/master/Colors/Lego_Colors.csv”. For example: lego_colors <- read_csv("https://raw.githubusercontent.com/ryantimpe/LEGOMosaics/master/Colors/Lego_Colors.csv",..” collect_3d(highest_el='dark') Calling all RLadies and Rladdies Because this is such a cool graphing technique, I ran a variety of other images through the functions. Specifically I thought the R Ladies graphic turned out pretty great! Gabriela de Queiroz (Founder of RLadies), had the great idea to 3D print the image. I couldn’t get that to work yet. If anyone is able to successfully 3D print the image or build the actual lego representation, please let me know! I’m posting the RLadies lego instructions below, in case anyone feels inspired! Thank you Thank you for following along with this fun little blog to showcase Ryan Timpe and Tyler Morgan-Wall’s R wizardry. Please share your creations with us on twitter! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Little Miss Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Gradient-Free Optimization for GLMNET Parameters Sat, 02/23/2019 - 23:13 (This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers) In the post https://statcompute.wordpress.com/2017/09/03/variable-selection-with-elastic-net, it was shown how to optimize hyper-parameters, namely alpha and gamma, of the glmnet by using the built-in cv.glmnet() function. However, following a similar logic of hyper-parameter optimization shown in the post https://statcompute.wordpress.com/2019/02/10/direct-optimization-of-hyper-parameter, we can directly optimize alpha and gamma parameters of the glmnet by using gradient-free optimizations, such as Nelder–Mead simplex or particle swarm. Different from traditional gradient-based optimizations, gradient-free optimizations are often able to find close-to-optimal solutions that are considered “good enough” from an empirical standpoint in many cases that can’t be solved by gradient-based approaches due to noisy and discontinuous functions. It is very straight-forward to set up the optimization work-flow. All we need to do is writing an objective function, e.g. to maximize the AUC statistic in this specific case, and then maximizing this objective function by calling the optimizer. In the demonstration below, Nelder–Mead simplex and particle swarm optimizers are employed to maximize the AUC statistic defined in the glmnet.optim() function based on a 10-fold cross validation. As shown in the result, both approaches gave very similar outcomes. For whoever interested, please feel free to compare the demonstrated method with the cv.glmnet() function. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### More on Macros in R Sat, 02/23/2019 - 17:28 (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers) Recently ran into something interesting in the R macros/quasi-quotation/substitution/syntax front: Romain François: “.@_lionelhenry reveals planned double curly syntax At #satRdayParis as a possible replacement, addition to !! and enquo()” It appears !! is no longer the last word in substitution (it certainly wasn’t the first). The described effect is actually already pretty easy to achieve in R. suppressPackageStartupMessages(library("dplyr")) group_by <- wrapr::bquote_function(group_by) summarize <- wrapr::bquote_function(summarize) my_average <- function(data, grp_var, avg_var) { data %>% group_by(.( grp_var )) %>% summarize(avg = mean(.( avg_var ), na.rm = TRUE)) } data <- data.frame(x = 1:10, g = rep(c(0,1), 5)) my_average(data, as.name("g"), as.name("x")) ## # A tibble: 2 x 2 ## g avg ## ## 1 0 5 ## 2 1 6 Or if you don’t want to perform the quoting by hand. my_average <- function(data, grp_var, avg_var, grp_var_name = substitute(grp_var), avg_var_name = substitute(avg_var) ) { data %>% group_by(.( grp_var_name )) %>% summarize(avg = mean(.( avg_var_name ), na.rm = TRUE)) } my_average(data, g, x) ## # A tibble: 2 x 2 ## g avg ## ## 1 0 5 ## 2 1 6 And we can use the same Thomas Lumley / bquote() notation for string interpolation. group_var <- as.name("g") avg_var <- as.name("x") wrapr::sinterp("group_var was .(group_var), and avg_var was .(avg_var)") # [1] "group_var was g, and avg_var was x" The .() notation has a great history in R and has been used by data.table for years. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Object of Type Closure is Not Subsettable Sat, 02/23/2019 - 04:40 (This article was first published on R – Quality and Innovation, and kindly contributed to R-bloggers) I started using R in 2004. I started using R religiously on the day of the annular solar eclipse in Madrid (October 3, 2005) after being inspired by David Hunter’s talk at ADASS. It took me exactly 4,889 days to figure out what this vexing error means, even though trial and error helped me move through it most every time it happened! I’m so moved by what I learned (and embarrassed that it took so long to make sense), that I have to share it with you. This error happens when you’re trying to treat a function like a list, vector, or data frame. To fix it, start treating the function like a function. Let’s take something that’s very obviously a function. I picked the sampling distribution simulator from a 2015 blog post I wrote. Cut and paste it into your R console: sdm.sim <- function(n,src.dist=NULL,param1=NULL,param2=NULL) { r <- 10000 # Number of replications/samples - DO NOT ADJUST # This produces a matrix of observations with # n columns and r rows. Each row is one sample: my.samples <- switch(src.dist, "E" = matrix(rexp(n*r,param1),r), "N" = matrix(rnorm(n*r,param1,param2),r), "U" = matrix(runif(n*r,param1,param2),r), "P" = matrix(rpois(n*r,param1),r), "B" = matrix(rbinom(n*r,param1,param2),r), "G" = matrix(rgamma(n*r,param1,param2),r), "X" = matrix(rchisq(n*r,param1),r), "T" = matrix(rt(n*r,param1),r)) all.sample.sums <- apply(my.samples,1,sum) all.sample.means <- apply(my.samples,1,mean) all.sample.vars <- apply(my.samples,1,var) par(mfrow=c(2,2)) hist(my.samples[1,],col="gray",main="Distribution of One Sample") hist(all.sample.sums,col="gray",main="Sampling Distribution\nof the Sum") hist(all.sample.means,col="gray",main="Sampling Distribution\nof the Mean") hist(all.sample.vars,col="gray",main="Sampling Distribution\nof the Variance") } The right thing to do with this function is use it to simulate a bunch of distributions and plot them using base R. You write the function name, followed by parenthesis, followed by each of the four arguments the function needs to work. This will generate a normal distribution with mean of 20 and standard deviation of 3, along with three sampling distributions, using a sample size of 100 and 10000 replications: sdm.sim(100, src.dist="N", param1=20, param2=3) (You should get four plots, arranged in a 2×2 grid.) But what if we tried to treat sdm.sim like a list, and call the 3rd element of it? Or what if we tried to treat it like a data frame, and we wanted to call one of the variables in the column of the data frame? > sdm.sim[3] Error in sdm.sim[3] : object of type 'closure' is not subsettable > sdm.sim$values Error in sdm.sim$values : object of type 'closure' is not subsettable SURPRISE! Object of type closure is not subsettable. This happens because sdm.sim is a function, and its data type is (shockingly) something called “closure”: > class(sdm.sim) [1] "function" > typeof(sdm.sim) [1] "closure" I had read this before on Stack Overflow a whole bunch of times, but it never really clicked until I saw it like this! And now that I had a sense for why the error was occurring, turns out it’s super easy to reproduce with functions in base R or functions in anyfamiliar packages you use: > ggplot$a Error in ggplot$a : object of type 'closure' is not subsettable > table$a Error in table\$a : object of type 'closure' is not subsettable > ggplot[1] Error in ggplot[1] : object of type 'closure' is not subsettable > table[1] Error in table[1] : object of type 'closure' is not subsettable

As a result, if you’re pulling your hair out over this problem, check and see where in your rogue line of code you’re treating something like a non-function. Or maybe you picked a variable name Then apologize, change your notation, and move on.

But as Luke Smith pointed out, this is not true for functional sequences (which you can also write as functions). Functional sequences are those chains of commands you write when you’re in tidyverse mode, all strung together with %>% pipes:

Luke’s code that you can cut and paste (and try), with the top line that got cut off by Twitter, is:

fun1 <- . %>% group_by(col1) %>% mutate(n=n+h) %>% filter(n==max(n)) %>% ungroup() %>% summarize(new_col=mean(n)) fun2 <- fun1[-c(2,5)]

Even though Luke’s fun1 and fun2 are of type closure, they are subsettable because they contain a sequence of functions:

> typeof(fun1) [1] "closure" > typeof(fun2) [1] "closure" > fun1[1] Functional sequence with the following components: 1. group_by(., col1) Use 'functions' to extract the individual functions. > fun1[[1]] function (.) group_by(., col1)

Don’t feel bad! This problem has plagued all of us for many, many hours (me: years), and yet it still happens to us more often than we would like to admit. Awareness of this issue will not prevent you from attempting things that give you this error in the future. It’s such a popular error that there have been memes about it and sad valentines written about it:

SCROLL DOWN PAST STEPH’S TWEET TO SEE THE JOKE!!

(Also, if you’ve made it this far, FOLLOW THESE GOOD PEOPLE ON TWITTER: @stephdesilva @djnavarro @lksmth – They all share great information on data science in general, and R in particular. Many thanks too to all the #rstats crowd who shared in my glee last night and didn’t make me feel like an idiot for not figuring this out for ALMOST 14 YEARS. It seems so simple now.

Steph is also a Microsoft whiz who you should definitely hire if you need anything R+ Microsoft. Thanks to all of you!)

The post Object of Type Closure is Not Subsettable appeared first on Quality and Innovation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R course in data wrangling and spatial analysis notes are online

Sat, 02/23/2019 - 01:00

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

R course in data wrangling and spatial analysis notes are online

The course is aimed at biological scientists who have some basic knowledge of R, but want to learn to use it for a broader variety of tasks. It is aimed to show how you can use R as a complete workflow from data cleaning to analysis, graphics and mapping.

We walk through the data wrangling, graphics, analysis and mapping for a case-study analysis of a (real) plankton data-set.

One of the maps we’ll make

In terms of packages, we cover data-wrangling with dplyr and a little bit of tidyr. Graphics with ggplot2. Then analyses with GLMs and GAMs (mgcv package). Mapping and GIS with raster, ggplot2 and sf. We also cover how you can do spatial analyses with GAMs.

The R course was written for a 1-day workshop at University of Queensland with my colleagues, Ant Richardson, Dave Schoeman and Bill Venables (author of the MASS package among other R fundamentals).

I would guestimate that if you do this course on your own you will need a day or a bit more. It took us a whole day in class, but we skipped a lot of the bonus material.

I hope you find the course helpful and a little bit fun.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Cloudy with a chance of Caffeinated Query Orchestration – New rJava Wrappers for AWS Athena SDK for Java

Fri, 02/22/2019 - 22:54

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

There are two fledgling rJava-based R packages that enable working with the AWS SDK for Athena:

They’re both needed to conform with the way CRAN like rJava-based packages submitted that also have large JAR dependencies. The goal is to eventually have wrappers for anything R folks need under the AWS Java SDK menu.

All package pairs will eventually cohabitate under the Cloudy R Project once each gets to 90% API coverage, passes CRAN checks and has passing Travis checks.

One thing I did get working right up front was the asynchronous dplyr chain query execution collect_async(), so if you need that and would rather not use reticulated wrappers, now’s your chance.

You would be correct in assuming this is an offshoot of the recent work on updating metis. My primary impetus for this is to remove the reticulate dependency from our Dockerized production setups but I also have discovered I like the Java libraries more than the boto3-based ones (not really a shocker there if you know my views on Python). As a result I should be able to quickly wrap most any library you may need (see below).

FIN

The next major wrapper coming is S3 (there are bits of it implemented in awsathena now but that’s temporary) and — for now — you can toss a comment here or file an issue in any of the social coding sites you like for priority wrapping of other AWS Java SDK libraries. Also, if you want some experience working with rJava packages in a judgement-free zone, drop a note into these or any new AWS rJava-based package repos and I’ll gladly walk you through your first PR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...