Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 3 hours 44 min ago

.rprofile: Jenny Bryan

Fri, 12/08/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)


Jenny Bryan @JennyBryan is a Software Engineer at RStudio and is on leave from being an Associate Professor at the University of British Columbia. Jenny serves in leadership positions with rOpenSci and Forwards and as an Ordinary member of The R Foundation.

KO: What is your name, your title, and how many years have you worked in R?

JB: I’m Jenny Bryan, I am a software engineer at RStudio (still getting used to that title)., And I am on leave from being an Associate Professor at the University of British Columbia. I’ve been working with R or it’s predecessors since 1996. I switched to R from S in the early 2000s.

KO: Why did you make the switch to R from S?

JB: It just seemed like the community was switching over to R and I didn’t have a specific reason to do otherwise, I was just following the communal path of least resistance.

KO: You have a huge following from all the stuff you post about your course. Did you always want to be a teacher? How did you get into teaching?

JB: No, I wouldn’t say that I always wanted to be a teacher, but I think I’ve enjoyed that above average compared to other professors. But it was more that I realized several years ago that I could have a bigger impact on what people did by improving data analysis workflows, thinking, and tooling instead of trying to make incremental progress on statistical methodology. It is a reflection of where I have a comparative advantage with respect to interest and aptitude, so it’s not really a knock on statistical methodology. But I feel we could use more people working on this side of the field – working on knowledge translation.

I was also reacting to what I saw in my collaborative work. I would work with people in genomics and if I’m completely honest with myself, often my biggest contribution to the paper would be getting all the datasets and analyses organized. I didn’t necessarily do some highly sophisticated statistical analysis. It would often boil down to just doing millions of t-tests or something. But the reason I had an impact on the project would be that I got everything organized so that we could re-run it and have more confidence in our results. And I was like, I have a PhD in stats, why is this my main contribution? Why do the postdocs, grad students, and bioinformaticians not know how to do these things? So then I started to make that more and more the focus of my course, instead of squeezing in more statistical methods. Then the teaching sort of changed who I was and what I allowed myself to think about and spend time on. I used to not let myself spend time on those things. Or if I did, I would feel guilty about it because I thought, I can’t get any professional credit for this! It’s not statistically profound, but it seems to be what the world needs me to do, and needs other people to be doing.

You don’t always have to be proving a theorem, you don’t always have to be writing a package, there’s still a lot of space for worthwhile activity in between all of those things.

KO: Do you feel proud of what you’ve accomplished?

JB: I finally in some sense gave myself permission to start teaching what I thought people actually needed to know. And then after spending lots of time on it in the classroom, you realize what gaps there are, you become increasingly familiar with the tooling that you’re teaching and you’re like, hey I could actually improve that. Or no one really talks about how you get the output of this step to flow nicely as the input into the following step, i.e. how to create workflows. It really helped open my mind to different forms of work that are still valuable. You don’t always have to be proving a theorem, you don’t always have to be writing a package, there’s still a lot of space for worthwhile activity in between all of those things. However because we don’t have names for all of it, it can be difficult from a career point of view. But so many people see it, use it, and are grateful for it.

KO: Can you talk about your transition into working for RStudio and what that will look like on a day-to-day basis?

JB: In many ways it looks a lot like my life already did because I had, especially in the last two to three years, decided if I want to work on R packages or on exposition, I’m going to do that. That’s what I think tenure is for! So I had decided to stop worrying about how to sell myself in a framework set up to reward traditional work in statistical methodology. That freed up a lot of mental energy, to pursue these other activities, unapologetically. Which lead to other opportunities, such as RStudio. I was already working mostly from home. The Statistics department is by no means a negative environment for me, but the internet helped me find virtual colleagues around the globe who really share my interests. The physical comfort of home is very appealing. RStudio is also very light on meetings, which is a beautiful thing.

KO: What is your team like at RStudio? How many projects are you juggling at any given time? Do you have an idea of what you want to accomplish while you’re there?

JB: The person I interact with most is Hadley Wickham and he now has a team of five. There’s a fair amount of back and forth with other team members. I might seek their advice on, e.g., development practices, or just put questions out there for everyone. This team is pretty new and the formalization of the tidyverse is pretty new, so everyone has different packages that they’re working on, either from scratch or shifting some of the maintenance burden off of Hadley. There’s a concerted effort to figure out “what does it mean to be an ecosystem of packages that work together?“.

KO: Do you have a well defined road map at this point on the team?

JB: I’ve been on that team since January and before that we had queued up readxl as a good project for me. It was also overdue for maintenance! I was already a “Spreadsheet Lady”, very familiar with the underlying objects, and with the problem space. It was a good opportunity for me to write compiled code which I hadn’t done in a really long time. I had never written C++ so it was a way to kill at least three birds with one stone. So that was an easy selection for the first thing to work on. And even before that was done, it was clear that going back and doing another project in the Google arena made sense. We knew we would do some work with interns. Wrapping the Google Drive API was going to be useful (in general and for a future update of googlesheets) and I knew our intern Lucy McGowan would be a great person to work with on it.

So no, there’s not some detailed 18-month roadmap stretching ahead of me. I think it will cycle between doing something that’s mine or new and doing maintenance on something that already exists. I also continue to do a lot of exposition, training, and speaking.

It actually pisses me off when people criticize “when” people work – like that’s a signifier of a poor work-life balance … their heart is in the right place to encourage balance, but I have a certain amount of work I want to get done.

KO: Day-to-day, do you have regular standups? How do you like your day to be structured?

JB: Oh there’s how I wish my day was structured and how it’s actually structured. I wish I could get up and just work because that’s when I feel by far my most productive. Unfortunately, this coincides with the morning chaos of a household with three kids, who, despite the fact that we’re trying to get them more independent with lunches and getting to school, you cannot completely tune out through this part of the day. So I do not really get up and just work, I sort of work until everyone gets out the door. Then I usually go exercise at that point, get that taken care of. I get more work done in the afternoon until the children all arrive home. I do a lot of work between 9 or 10 at night and 1 in the morning. Not because I love working at that time, but that’s just what I have.

Given that I have this platform, it actually pisses me off when people criticize “when” people work – like that’s a signifier of a poor work-life balance, though it is possible that I have a poor work-life balance, but I feel like it’s usually coming from people who don’t have the same constraints in their life. “You shouldn’t work on the weekends, You shouldn’t work in the evenings”. I’m like, when the heck else do you think I would work? I feel like sometimes people are – their heart is in the right place to encourage balance, but I have a certain amount of work I want to get done. And I have a family and it means that I work when my children are asleep.

They’re happy years but the tension between all the things you want to do is unbelievable because you will not do all of them. You cannot do it all.

KO: This topic is very interesting and personal to me. As I get older I’ve been thinking (nebulously) about starting a family, and I don’t know what that looks like. It’s scary to me, to not want to give up this lifestyle and this career that I’ve started for myself.

JB: My pivoting of thinking about myself as an applied statistician to more of a data scientist, coincided with me reemerging from having little kids. I had all of them pre-tenure and at some point we had “three under three”. I was trying to get tenure, just barely getting it all done and I was kind of in my own little world, just surviving. Then the tenure process completed successfully, the kids got older, they were all in school, and eventually they didn’t need any out of school care. So me being able to string multiple abstract thoughts together and carve out hours at a time to do thought work coincided with me also freeing myself to work on stuff that I found more interesting.

I don’t know how this all would have worked out if the conventional academic statistical work had suited me better. The time where I was most conflicted between doing a decent job parenting and doing decent work was also when I was doing work I wasn’t passionate about. I can’t tell if having more enthusiasm about the work would have made that period harder or easier! I really thought about ditching it all more than a few times.

The reinvigoration that coincided with switching emphasis also coincided with the reinvigoration that comes from the kids becoming more independent. It does eventually happen! There are some very tough years – they’re not dark years, they’re happy years but the tension between all the things you want to do is unbelievable because you will not do all of them. You cannot do it all.

KO: What are your favorite tools for managing your workflow?

JB: In terms of working with R I’ve completely standardized on working with RStudio. Before that I was an Emacs-ESS zealot and I still have more accumulated years in that sphere. But once RStudio really existed and was viable, I started teaching with it. I hate doing R one way when I’m in front of students and another when I’m alone. It got very confusing and mixing up the keyboard shortcuts would create chaos. So now I’ve fully embraced RStudio and have never looked back.

I’m also a git evangelist. Everything I do is in git, everything is on Github and at this point, almost everything is public because I’ve gotten unselfconscious enough to put it up there. Plus there’s enough volume now that no one could be looking at any particular one thing. It’s so much easier for me to find it again later. I just put everything in a public place rather than trying to have this granular access control; it simplifies things greatly. Working in the open has simplified a lot of decisions, that’s nice.

Otherwise I feel like my workflow is very primitive. I have thousands of email in my inbox. I’ve completely given up on managing email and I’m mostly okay with that. It’s out of my control and I can’t commit to a system where I’m forced to get to inbox zero. I’ve just given up on it. And twitter and slack are important ways to feel connected when I’m sitting at home on my sofa.

KO: Do you have any online blogs, personalities or podcasts that you particularly enjoy? It doesn’t have to be R related.

JB: I do follow people on twitter and the rstats hashtag, so that often results in serendipitous one-off links that I enjoy. I don’t follow certain blogs regularly, but there are certain places that I end up at regularly. I like the Not So Standard Deviations podcast. In the end I always listen to every episode, but it’s what I do on an airplane or car drive.

KO: You build up a backlog?

JB: Exactly. Then the next time I need to drive to Seattle in traffic, I’ll power through four episodes.

KO: What are some of your favorite R packages – do you have some that you think are funny, or love?

JB: I live entirely in the tidyverse. I’m not doing primary data analysis on projects anymore. It’s weird that the more involved you become in honing the tools, the less time you spend wielding them. So I’m increasingly focused on the data prep, data wrangling, data input part of the cycle and not on modeling. I did a lot more of that when I was a statistician and now it’s not where my comparative interest and advantage seems to lie. There’s plenty to do on the other end. And also not that many people who like it. I actually do enjoy it. I don’t have to force myself to enjoy it – this is really important, and it pleases me. Given how important I think the work is, it’s a relatively uncrowded field. Whereas machine learning, it seems like everyone wants to make a contribution there. I’m like, you go for it – I’m going to be over here getting data out of Excel spreadsheets.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Writing Excel formatted csv using readr::write_excel_csv2

Fri, 12/08/2017 - 01:00

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

Why this post?

Currently, my team and I are building a Shiny app that serves as an interface for a forecasting model. The app allows business users to interact with predictions. However, we keep getting feature requests, such as, “Can we please have this exported to Excel.”

Our client chose to see results exported to a csv file and wants to open them in Excel. App is already running on the Linux server and the csv that can be downloaded via app are utf-8 encoded.

If you are a Linux user you may not be aware that Windows Excel is not able to recognize utf-8 encoding automatically. It turns out that a few people faced this problem in the past.

Obviously, we cannot have a solution where our users are changing options in Excel or opening the file in any other way than double clicking.

We find having a Shiny App that allows for Excel export to be a good compromise between R/Shiny and Excel. It gives the user the power of interactivity and online access, while still preserving the possibility to work with the results in the environment they are most used to. This a great way to gradually accustom users with working in Shiny.

Current available solution in R

What we want is the following, write a csv file with utf-8 encoding and BOM sticky noteThe byte order mark (BOM) is a Unicode character which tells about the encoding of the document. . This has been addressed in R by RStudio in readr package.

library(readr) write_excel_csv(mtcars, "assets/data/readr/my_file.csv")

This is great and solves the problem with opening the file in Excel, but… supports only one type of locale.

Show me your locale

Depending on where you live you might have different locale. Locale is a set of parameters that defines the user’s language, region and any special variant preferences that the user wants to see in their user interface.

This means that number formatting can differ between different regions, for example in the USA . is used as a decimal separator, but on the other hand almost whole Europe uses ,. This article shows how countries around the world define their number formats.

This proves that there is a large need to extend the readr functionality and allow users to save Excel with European locale easily and quickly. This is not currently possible since write_excel_csv only allows one to write in the US locale.

New addition to readr

We proposed to add write_excel_csv2() to readr package that would allow the user to write a csv with , as a decimal separator and ; as column separator. To be consistent with naming convention in R for functions reading in (e.g. read.csv() and read.csv2()) or writing (e.g. write.csv() and write.csv2()) csv files with different delimiter we decided to simply add 2 to write_excel_csv().

tmp <- tempfile() on.exit(unlink(tmp)) readr::write_excel_csv2(mtcars, tmp)

To prove that it works, let’s read the first two lines and inspect the output.

readr::read_lines(tmp, n_max = 2) ## [1] "mpg;cyl;disp;hp;drat;wt;qsec;vs;am;gear;carb" ## [2] "21,0;6;160,0;110;3,90;2,620;16,46;0;1;4;4"

write_excel_csv2() is already available for download from readr repository and should be available on CRAN with the next release.

devtools::install_github("tidyverse/readr")

We hope you and your business team will find this addition useful.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: November 2017 roundup

Thu, 12/07/2017 - 22:54

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from November of particular interest to R users.

R 3.4.3 "Kite Eating Tree" has been released.

Several approaches for generating a "Secret Santa" list with R.

The "RevoScaleR" package from Microsoft R Server has now been ported to Python.

The call for papers for the R/Finance 2018 conference in Chicago is now open.

Give thanks to the volunteers behind R.

Advice for R user groups from the organizer of R-Ladies Chicago.

Use containers to build R clusters for parallel workloads in Azure with the doAzureParallel package.

A collection of R scripts for interesting visualizations that fit into a 280-character Tweet.

R is featured in a StackOverflow case study at the Microsoft Connect conference.

The City of Chicago uses R to forecast water quality and issue beach safety alerts.

A collection of best practices for sharing data in spreadsheets, from a paper by Karl Broman and Kara Woo.

The MRAN website has been updated with faster package search and other improvements.

The curl package has been updated to use the built-in winSSL library on Windows.

Beginner, intermediate and advanced on-line learning plans for developing AI applications on Azure.

A recap of the EARL conference (Effective Applications of the R Language) in Boston. 

Giora Simchoni uses R to calculate the expected payout from a slot machine.

An introductory R tutorial by Jesse Sadler focuses on the analysis of historical documents.

A new RStudio cheat sheet: "Working with Strings".

An overview of generating distributions in R via simulated gaming dice.

An analysis of StackOverflow survey data ranks R and Python among the most-liked and least-disliked languages.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Clustering Music Genres with R

Thu, 12/07/2017 - 18:28

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

In a number of upcoming posts, I’ll be analyzing an interesting dataset I found on Kaggle. The dataset contains information on 18,393 music reviews from the Pitchfork website. The data cover reviews posted between January 1999 and January 2016. I downloaded the data and did an extensive data munging exercise to turn the data into a tidy dataset for analysis (not described here, but perhaps in a future blog post).

The goal of this post is to describe the similarities and differences among the music genres of the reviewed albums using cluster analysis.

The Data

After data munging, I was left with 18,389 reviews for analysis (there were 4 identical review id’s in the dataset; visual inspection showed that the reviews were identical so I removed the duplicates).

One of the pieces of information about each album is the music genre, with the following options available: electronic, experimental, folk/country, global, jazz, metal, pop/rnb, rap and rock. Each album can have 0, 1 or multiple genres attached to it. I represented this information in the wide format, with one column to represent each genre. The presence of a given genre for a given album is represented with a 1, and the absence of a given genre for a given album is represented with a 0.

The head of the dataset, called “categories,” looks like this:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

title artist electronic experimental folk_country global jazz metal pop_rnb rap rock mezzanine massive attack 1 0 0 0 0 0 0 0 0 prelapsarian krallice 0 0 0 0 0 1 0 0 0 all of them naturals uranium club 0 0 0 0 0 0 0 0 1 first songs kleenex liliput 0 0 0 0 0 0 0 0 1 new start taso 1 0 0 0 0 0 0 0 0 insecure (music from the hbo original series) various artists 0 0 0 0 0 0 0 0 0

As we can see, each album has numeric values on all of our 9 genres. The album in the last row of the data shown above does not have a genre attached to it.

Let’s first answer some basic questions about the music genres. How often is each genre represented in our 18,389 reviews? We can make a simple bar plot using base R with the following code:

# histogram of genre frequencies
par(mai=c(1,1.5,1,1))
barplot(sort(colSums(categories[,3:11]), decreasing = FALSE),
horiz = TRUE, cex.names = 1, col = 'springgreen4',
main = 'Genre Frequency', las = 1)

Which gives us the following plot:

Rock is by far the most frequently-occurring genre in the dataset!

Let’s not forget that albums can have more than one genre. How many albums have more than 1 genre attached to them? What is the maximum number of genres in these data? What number of albums have what number of genres? It’s possible to extract these figures with dplyr, but for basic summary statistics I find it quicker and easier to use base R:

# how many of the albums have more than 1 genre?
table(rowSums(categories[,3:11])>1)
# FALSE TRUE
# 14512 3877

# what is the maximum number of genres?
max(rowSums(categories[,3:11]))
# [1] 4

# how many albums have what number of genres?
table(rowSums(categories[,3:11]))
# 0 1 2 3 4
# 2365 12147 3500 345 32

It looks like 3,877 of the albums have more than 1 genre. The table shows that the vast majority of albums with more than 1 genre have 2 genres, with a much smaller number having 3 and 4 genres. There are 2,365 albums with no recorded genre.

Data Preparation

In order to cluster the music genres, we first must make a matrix which contains the co-occurrences of our dummy-coded genre variables with one another. We can use matrix multiplication to accomplish this, as explained in this excellent StackOverflow answer:

# make a co-occurrence matrix
# select the relevant columns and convert to matrix format
library(plyr); library(dplyr)

co_occur <- categories %>% select(electronic, experimental, folk_country,
global,jazz, metal, pop_rnb, rap, rock) %>% as.matrix()
# calculate the co-occurrences of genres
out <- crossprod(co_occur)
# make the diagonals of the matrix into zeros
# (we won't count co-occurrences of a genre with itself)
diag(out) <- 0

The resulting co-occurrence matrix, called “out”, is a 9 by 9 matrix containing the counts of the genre co-occurrences together in the data:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

electronic experimental folk_country global jazz metal pop_rnb rap rock electronic 0 198 26 49 127 40 228 81 1419 experimental 198 0 15 20 76 64 32 17 1121 folk_country 26 15 0 7 6 10 65 2 52 global 49 20 7 0 13 0 48 3 33 jazz 127 76 6 13 0 26 52 25 57 metal 40 64 10 0 26 0 12 27 449 pop_rnb 228 32 65 48 52 12 0 133 126 rap 81 17 2 3 25 27 133 0 68 rock 1419 1121 52 33 57 449 126 68 0 Cluster Analysis

We can now proceed with the cluster analysis. We will use hierarchical clustering, an algorithm which seeks to build a hierarchy of clusters in the data. This analysis will produce groupings (e.g. clusters) of music genres. We will visualize the results of this analysis via a dendrogram.

We first need to produce a distance matrix from our co-occurrence matrix. For each pair of music genres, we will calculate the Euclidean distance between them. To calculate the Euclidean distances, we first calculate the sum of the squared differences in co-occurrences for each pair of rows across the nine columns. We then take the square root of the sum of squared differences. If we consider the two bottom rows (rap and rock) in the co-occurrence matrix above, then the Euclidean distance between them is calculated as follows:

# calculate Euclidean distance manually between rock and rap
squared_diffs = (81-1419)^2 + (17-1121)^2 + (2-52)^2 + (3-33)^2 +
(25-57)^2 + (27-449)^2 + (133-126)^2 + (0-68)^2 + (68-0)^2
sqrt(squared_diffs)
# [1] 1789.096

This calculation makes clear our definition of genre similarity, which will define our clustering solution: two genres are similar to one another if they have similar patterns of co-occurrence with the other genres.

Let’s calculate all pairwise Euclidean distances with the dist() function, and check our manual calculation with the one produced by dist().

# first produce a distance matrix
# from our co-occurrence matrix
dist_matrix <- dist(out)
# examine the distance matrix
round(dist_matrix,3)
# the result for rap+rock
# is 1789.096
# the same as we calculated above!

As noted in the R syntax above (and shown in the bottom-right corner of the distance matrix below), the distance between rap and rock is 1789.096, the same value that we obtained from our manual calculation above!

The distance matrix:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

electronic experimental folk_country global jazz metal pop_rnb rap experimental 462.453 folk_country 1397.729 1087.241 global 1418.07 1102.418 38.105 jazz 1392.189 1072.719 122.511 106.353 metal 1012.169 698.688 405.326 420.86 405.608 pop_rnb 1346.851 1006.476 275.647 259.559 197.16 397.776 rap 1375.999 1066.633 92.644 101.975 116.816 406.478 259.854 rock 2250.053 2041.001 1836.619 1818.557 1719.464 1854.881 1682.785 1789.096

To perform the clustering, we simply pass the distance matrix to our hierarchical clustering algorithm (specifying that we to use Ward’s method), and produce the dendrogram plot.*

# perform the hierarchical clustering
hc <- hclust(dist_matrix, method="ward.D")
# plot the dendrogram
plot(hc, hang=-1, xlab="", sub="")

The hierarchical clustering algorithm produces the organization of the music genres visible in the above dendrogram. But how many clusters are appropriate for these data?

There are many different ways of choosing the number of clusters when performing cluster analysis. With hierarchical clustering, we can simply examine the dendrogram and make a decision about where to cut the tree to determine the number clusters for our solution. One of the key intuitions behind the dendrogram is that observations that fuse together higher up the tree (e.g. at the top of the plot) are more different from one another, while observations that fuse together further down (e.g. at the bottom of the plot) are more similar to one another. Therefore, the higher up we split the tree, the more different the music genres will be among the clusters.

If we split the tree near the top (e.g., around a height of 2500), we end up with three clusters. Let’s make a dendrogram that represents each of these three clusters with a different color. The colored dendrogram is useful in the interpretation of the cluster solution.

There are many different ways to produce a colored dendrogram in R; a lot of them are pretty hacky and require a fair amount of code. Below, I use the wonderful dendextend package to produce the colored dendrogram in a straightforward and simple way (with a little help from this StackOverflow answer):

# make a colored dendrogram with the dendextend package
library(dendextend)
# hc is the result of our hclust call above
dend <- hc
# specify we want to color the branches and labels
dend <- color_branches(dend, k = 3)
dend <- color_labels(dend, k = 3)
# plot the colored dendrogram. add a title and y-axis label
plot(dend, main = 'Cluster Dendrogram', ylab = 'Height')

Which yields the following plot:

Interpretation of the Clusters

The Clusters

The three clusters identified by our chosen cut point are as follows. First, on the right-hand side of the plot, we see electronic and experimental music grouped together in a single cluster (colored in blue). Also on the right-hand of the plot, on the same branch but with a different leaf and colored in green, we find rock music which constitutes a cluster in and of itself. These two clusters (electronic + experimental and rock) are distinct, yet they share some similarities in terms of their co-occurrences with other music genres, as indicated by their fusion at the top right-hand side of the dendrogram.

On the left-hand side of the dendrogram, we see the third cluster (colored in pink) which encompasses metal, pop/r&b, folk/country, global, jazz and rap music. Our chosen cut point lumps all of these genres into a single cluster, but examination of the sub-divisions of this cluster reveals a more nuanced picture. Metal is most different from the other genres in this cluster. A further sub-grouping separates pop/r&b from folk/country, global, jazz and rap. Folk/country and global are fused at the lowest point in the dendrogram, followed by a similar grouping of jazz and rap. These two pairings occur very close to the bottom of the dendrogram, indicating strong similarity in genre co-occurrence between folk/country and global on the one hand, and between jazz and rap on the other hand.

Substantive Interpretation: What have we learned about music genres?

Rock (Green Cluster)
Rock music is the only music genre to get its own cluster, suggesting that it is distinctive in its patterns of co-occurrence with the other genres.

We must keep in mind the following statistical consideration, however. Rock music was the most frequent genre in the dataset. One of the reason rock’s Euclidean distances were so large is that, because this genre occurs so frequently in the data, the distances calculated between rock and the less-frequently occurring genres are naturally larger.

Electronic & Experimental (Blue Cluster)
The blue cluster in the dendrogram above groups electronic and experimental music together. One reason for this might be because these genres are both relatively more modern (compared to, say, jazz or folk), and therefore share some sonic similarities (for example, using electronic or synthetically generated sounds) which leads them to be used in a similar way in conjunction with the other music genres.

Metal, Pop/R&B, Folk/Country, Global, Jazz and Rap (Pink Cluster)
The remaining genres fall into a single cluster. Within this cluster, it seems natural that metal is most different from the other genres, and that pop/r&b separates itself from folk/country, global, jazz and rap.

Pop and R&B are different, in my vision, from folk/country, global, jazz and rap music in that the former feature slick production styles (e.g. a tremendously layered sound with many different tracks making up a song), electronic instrumentation (e.g. keyboard synthesizers and drum machines) and a contemporary musical aesthetic (e.g. auto-tuning to produce noticeably distorted vocals), whereas the latter feature more sparse arrangements and fewer electronically-produced sounds.

Folk/country and global music, meanwhile, share a similar musical palette in terms of their more traditional instrumentation. While there are exceptions, I feel that both folk/country and global music use more acoustic or “natural” instruments (guitars, violins, wind instruments, drums, etc.) and less of the obviously electronically-produced sounds mentioned above.

Finally, jazz and hip-hop, although quite different in many musical aspects, share a similar aesthetic in some ways.  Hip-hop, for example, has long made use of samples from older records (including many jazz tunes) to create beats (although recent hip-hop sub-genres such as trap music make use of drum machines and synthesizers for a more mechanical sound). One testament to the complementary between jazz and rap are number of recent notable collaborations between artists in these two genres: Kendrick Lamar’s excellent 2015 album To Pimp a Butterfly contains a number of tracks featuring jazz musicians and jazz instrumentation, the jazz pianist Robert Glasper has made a several collaborative albums featuring rap artists, and the jazzy group BadBadNotGood made an excellent album with the always brilliant Ghostface Killah.

Conclusion

In this post, we clustered music genres from albums reviewed by Pitchfork. Our original dataset contained dummy-coded indicators for 9 different music genres across 18,389 different albums. We used matrix multiplication to create a co-occurrence matrix for the music genres, which we then turned into a distance matrix for hierarchical clustering. We cut the resulting dendrogram high up on the tree, obtaining three separate clusters of music genres: 1) rock 2) electronic and experimental and 3) metal, pop/r&b, folk/country, global, jazz and rap. This clustering solution seems to align with the production styles, sonic and musical qualities, and past and current cross-pollination between the various music genres.

Coming Up Next

In the next post, I’ll use R to produce a unique visualization of a very famous dataset. Stay tuned!

—-

* If you’re interested, you can play around with other distance metrics and clustering algorithms at home- just import the above co-occurrence matrix into R, and adapt the code however you like!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Method Matters. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a simple Sales Revenue Dashboard with R Shiny & ShinyDashboard

Thu, 12/07/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

One of the beautiful gifts that R has got (that Python misses) is the package – Shiny. Shiny is an R package that makes it easy to build interactive web apps straight from R. Making Dashboard is an imminent wherever Data is available since Dashboards are good in helping Business make insights out of the existing data.

In this post, We will see how to leverage Shiny to build a simple Sales Revenue Dashboard.

Loading Packages

All the packages listed below can be directly installed from CRAN.

# load the required packages library(shiny) require(shinydashboard) library(ggplot2) library(dplyr)

Sample Input File:
Considering the fact that Dashboard needs an input data to visualise, we will use this sample recommendation.csv as input data to our dashboard but this can be modified to suit any organisational need like a Database connection or Data from remote location.

recommendation <- read.csv('recommendation.csv',stringsAsFactors = F,header=T) head(recommendation) Account Product Region Revenue 1 Axis Bank FBB North 2000 2 HSBC FBB South 30000 3 SBI FBB East 1000 4 ICICI FBB West 1000 5 Bandhan Bank FBB West 200 6 Axis Bank SIMO North 200

Every shiny application has two main sections 1. ui and 2. server. ui is where the code for front-end like buttons, plot visuals, tabs and so on are present and server is where the code for back-end like Data Retrieval, Manipulation, Wrangling are present.


Image Courtesy: Slideplayer

Instead of simply using only shiny, Here we will couple it with shinydashboard. shinydashboard is an R package whose job is to make it easier (as the name suggests) to build dashboards with shiny. The ui part of a shiny app built with shinydashboard would have 3 basic elements wrapped in dashboardPage().

    1. dashboardHeader(),
    2. dashboardSidebar(),
    3. dashboardBody()

Simplest Shiny app with shinydashboard:

## app.R ## library(shiny) library(shinydashboard) ui <- dashboardPage( dashboardHeader(), dashboardSidebar(), dashboardBody() ) server <- function(input, output) { } shinyApp(ui, server)

Gives this app:

Image Courtesy: rstudio

Aligning to our larger goal of making a Sales Revenue Dashboard, Let us look at the code of dashboardHeader() and dashboardSidebar().

#Dashboard header carrying the title of the dashboard header <- dashboardHeader(title = "Basic Dashboard") #Sidebar content of the dashboard sidebar <- dashboardSidebar( sidebarMenu( menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")), menuItem("Visit-us", icon = icon("send",lib='glyphicon'), href = "https://www.salesforce.com") ) )

To begin with dashboardPage(), We must decide what are the UI elements that we would like to show in our dashboard. Since it’s a Sales Revenue Dashboard, let us show 3 KPI boxes on the top that could represent quick summary and then 2 Graphical plots followed by them for a detailed view.

To align these elements one by one, we will define these elements inside fluidRow().

frow1 <- fluidRow( valueBoxOutput("value1") ,valueBoxOutput("value2") ,valueBoxOutput("value3") ) frow2 <- fluidRow( box( title = "Revenue per Account" ,status = "primary" ,solidHeader = TRUE ,collapsible = TRUE ,plotOutput("revenuebyPrd", height = "300px") ) ,box( title = "Revenue per Product" ,status = "primary" ,solidHeader = TRUE ,collapsible = TRUE ,plotOutput("revenuebyRegion", height = "300px") ) ) # combine the two fluid rows to make the body body <- dashboardBody(frow1, frow2)

It could be seen from the above code that valueBoxOutput() is used to display the KPI but what is displayed in this valueBoxOutput() will be written in the server part and the same applies for plotOutput() which is used in the ui part to display a plot. box() is a function provided by shinydashboard to enclose the plot inside a box with certain features like title, solidHeader and collapsible. Having defined two fluidRow() functions individually for the sake of modularity, we can combine both of them in dashbboardBody().

Thus we can complete the ui part comprising Header, Sidebar and Page with the below code:

#completing the ui part with dashboardPage ui <- dashboardPage(title = 'This is my Page title', header, sidebar, body, skin='red')

Note that the value of title in dashboardPage() will serve as the title of the browser page/tab, while the title defined in the dashboardHeader() will be visible as the dashboard title.

With the ui part over, We will create the server part where the program and logic behind valueBoxOutput() and plotOutput() are added with renderValueBox() and renderPlot() respectively, enclosed inside server function with input,output as its paramaters. Values inside input contain anything that is received from ui (like textBox value, Slider value) and Values inside output contain anything that is sent to ui (like plotOutput, valueBoxOutput).

Below is the complete server code:

# create the server functions for the dashboard server <- function(input, output) { #some data manipulation to derive the values of KPI boxes total.revenue <- sum(recommendation$Revenue) sales.account <- recommendation %>% group_by(Account) %>% summarise(value = sum(Revenue)) %>% filter(value==max(value)) prof.prod <- recommendation %>% group_by(Product) %>% summarise(value = sum(Revenue)) %>% filter(value==max(value)) #creating the valueBoxOutput content output$value1 <- renderValueBox({ valueBox( formatC(sales.account$value, format="d", big.mark=',') ,paste('Top Account:',sales.account$Account) ,icon = icon("stats",lib='glyphicon') ,color = "purple") }) output$value2 <- renderValueBox({ valueBox( formatC(total.revenue, format="d", big.mark=',') ,'Total Expected Revenue' ,icon = icon("gbp",lib='glyphicon') ,color = "green") }) output$value3 <- renderValueBox({ valueBox( formatC(prof.prod$value, format="d", big.mark=',') ,paste('Top Product:',prof.prod$Product) ,icon = icon("menu-hamburger",lib='glyphicon') ,color = "yellow") }) #creating the plotOutput content output$revenuebyPrd <- renderPlot({ ggplot(data = recommendation, aes(x=Product, y=Revenue, fill=factor(Region))) + geom_bar(position = "dodge", stat = "identity") + ylab("Revenue (in Euros)") + xlab("Product") + theme(legend.position="bottom" ,plot.title = element_text(size=15, face="bold")) + ggtitle("Revenue by Product") + labs(fill = "Region") }) output$revenuebyRegion <- renderPlot({ ggplot(data = recommendation, aes(x=Account, y=Revenue, fill=factor(Region))) + geom_bar(position = "dodge", stat = "identity") + ylab("Revenue (in Euros)") + xlab("Account") + theme(legend.position="bottom" ,plot.title = element_text(size=15, face="bold")) + ggtitle("Revenue by Region") + labs(fill = "Region") }) }

So far, we have defined both the essential parts of a Shiny app – ui and server. And finally we have to call/run the shinyApp with ui and server as its paramters.

#run/call the shiny app shinyApp(ui, server) Listening on http://127.0.0.1:5101

The entire file has to be saved as app.R inside a folder before running the shiny app. Also remember to put the input data file (in our case, recommendation.csv inside the same folder where app.R is saved). While there is another valid way to structure the shiny app with two files ui.R and server.R (optionally, global.R), it has been ignored in this article for the sake of brevity since this is aimed at beginners.

Upon running the file, the shiny web app would open in your default browser and would look similar to the below screenshot:

Hopefully at this stage, You have got your shiny web app in the form of Sales Revenue Dashboard up and running. The code and plots used here are available on my Github.

References

    Related Post

    1. Gender Diversity Analysis of Data Science Industry using Kaggle Survey Dataset in R
    2. How Happy is Your Country? — Happy Planet Index Visualized
    3. Exploring, Clustering, and Mapping Toronto’s Crimes
    4. Spring Budget 2017: Circle Visualisation
    5. Qualitative Research in R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppArmadillo 0.8.300.1.0

    Thu, 12/07/2017 - 01:59

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Another RcppArmadillo release hit CRAN today. Since our last 0.8.100.1.0 release in October, Conrad kept busy and produced Armadillo releases 8.200.0, 8.200.1, 8.300.0 and now 8.300.1. We tend to now package these (with proper reverse-dependency checks and all) first for the RcppCore drat repo from where you can install them "as usual" (see the repo page for details). But this actual release resumes within our normal bi-monthly CRAN release cycle.

    These releases improve a few little nags on the recent switch to more extensive use of OpenMP, and round out a number of other corners. See below for a brief summary.

    Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 405 other packages on CRAN.

    A high-level summary of changes follows.

    Changes in RcppArmadillo version 0.8.300.1.0 (2017-12-04)
    • Upgraded to Armadillo release 8.300.1 (Tropical Shenanigans)

      • faster handling of band matrices by solve()

      • faster handling of band matrices by chol()

      • faster randg() when using OpenMP

      • added normpdf()

      • expanded .save() to allow appending new datasets to existing HDF5 files

    • Includes changes made in several earlier GitHub-only releases (versions 0.8.300.0.0, 0.8.200.2.0 and 0.8.200.1.0).

    • Conversion from simple_triplet_matrix is now supported (Serguei Sokol in #192).

    • Updated configure code to check for g++ 5.4 or later to enable OpenMP.

    • Updated the skeleton package to current packaging standards

    • Suppress warnings from Armadillo about missing OpenMP support and -fopenmp flags by setting ARMA_DONT_PRINT_OPENMP_WARNING

    Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    jmv – one R package (not just) for the social sciences

    Thu, 12/07/2017 - 00:00

    (This article was first published on jamovi, and kindly contributed to R-bloggers)

    tl;dr
    • many analyses in the social sciences require many R packages
    • jmv makes these common analyses available from one R package
    • jmv can be used from jamovi, a graphical statistical spreadsheet, making it super-accessible

    introducing jmv

    There are many tremendously useful R packages for the social sciences (and similar fields), such as car, afex vcd, etc. Although running basic analyses (such as t-tests or ANOVA) with these packages is very straight forward, it is typically necessary to perform a number of supplementary analyses to accompany them; post-hoc tests, effect-size calculations, bias-corrections, and assumption checks. These additional tests often require the use of many additional R packages, and can make reasonably standard analyses quite time-consuming to perform. For example, in the book Discovering Statistics Using R by Andy Field (a popular textbook in the social sciences), the chapter on ANOVA alone recommends the use of 7 packages.

    jmv simplifies this whole process by bringing all of these packages together and makes doing the following analyses with their most common supplementary tests, corrections and assumption checks as easy as possible:

    • Descriptive statistics
    • T-Tests
    • ANOVA
    • ANCOVA
    • Repeated Measures ANOVA
    • Non-parametric ANOVAs
    • Correlation
    • Linear Regression
    • Contingency Tables
    • Proportion Tests
    • Factor Analysis

    and coming soon:

    • Logistic Regression
    • Log-linear Regression

    jmv aims to make all common statistical tests taught at an undergraduate level available from a single package.

    An ANOVA

    Let’s begin with a simple, familiar analysis – an ANOVA. In this example we use the ToothGrowth dataset from R, and explore whether different food supplements and their dosage affect how much a guinea pig’s teeth grow. We’ll specify len to be the dependent variable, and supp and dose to be the factors.

    library('jmv') data('ToothGrowth') jmv::anova(ToothGrowth, dep = 'len', factors = c('supp', 'dose')) ## ## ANOVA ## ## ANOVA ## ─────────────────────────────────────────────────────────────────────── ## Sum of Squares df Mean Square F p ## ─────────────────────────────────────────────────────────────────────── ## supp 205 1 205.4 15.57 < .001 ## dose 2426 2 1213.2 92.00 < .001 ## supp:dose 108 2 54.2 4.11 0.022 ## Residuals 712 54 13.2 ## ───────────────────────────────────────────────────────────────────────

    This produces what should be a familiar ANOVA table. You have likely seen something like this in R before, though perhaps not as nicely formatted.

    Where jmv really comes into its own, is with additional options. In the following example we will perform the same analysis, but additionally requesting effect-size, post-hoc tests, homogeneity of variances tests, descriptive statistics, and a descriptives plot:

    library('jmv') data('ToothGrowth') jmv::anova(ToothGrowth, dep = 'len', factors = c('supp', 'dose'), effectSize = 'eta', postHoc = c('supp', 'dose'), plotHAxis = 'dose', plotSepLines = 'supp', descStats = TRUE, homo = TRUE) ## ## ANOVA ## ## ANOVA ## ──────────────────────────────────────────────────────────────────────────────── ## Sum of Squares df Mean Square F p η² ## ──────────────────────────────────────────────────────────────────────────────── ## supp 205 1 205.4 15.57 < .001 0.059 ## dose 2426 2 1213.2 92.00 < .001 0.703 ## supp:dose 108 2 54.2 4.11 0.022 0.031 ## Residuals 712 54 13.2 ## ──────────────────────────────────────────────────────────────────────────────── ## ## ## ASSUMPTION CHECKS ## ## Test for Homogeneity of Variances (Levene's) ## ──────────────────────────────────────────── ## F df1 df2 p ## ──────────────────────────────────────────── ## 1.94 5 54 0.103 ## ──────────────────────────────────────────── ## ## ## POST HOC TESTS ## ## Post Hoc Comparisons - supp ## ──────────────────────────────────────────────────────────────────────────── ## supp supp Mean Difference SE df t p-tukey ## ──────────────────────────────────────────────────────────────────────────── ## OJ - VC 3.70 0.938 54.0 3.95 < .001 ## ──────────────────────────────────────────────────────────────────────────── ## ## ## Post Hoc Comparisons - dose ## ───────────────────────────────────────────────────────────────────────────── ## dose dose Mean Difference SE df t p-tukey ## ───────────────────────────────────────────────────────────────────────────── ## 0.5 - 1 -9.13 1.15 54.0 -7.95 < .001 ## - 2 -15.50 1.15 54.0 -13.49 < .001 ## 1 - 2 -6.37 1.15 54.0 -5.54 < .001 ## ───────────────────────────────────────────────────────────────────────────── ## ## ## Descriptives ## ─────────────────────────────────────── ## supp dose N Mean SD ## ─────────────────────────────────────── ## OJ 0.5 10 13.23 4.46 ## OJ 1 10 22.70 3.91 ## OJ 2 10 26.06 2.66 ## VC 0.5 10 7.98 2.75 ## VC 1 10 16.77 2.52 ## VC 2 10 26.14 4.80 ## ───────────────────────────────────────

    As can be seen, jmv can provide many additional tests and statistics relevant to the main tests, but with far less effort.

    You can explore additional options for the jmv ANOVA here, and the other tests and their available options here.

    jamovi integration

    jmv is also useable from the jamovi statistical spreadsheet. jamovi makes a range of analyses accessible to a broader audience by making them available from a familiar, spreadsheet user-interface. jamovi can also make the underlying R code for each analysis available, making it easy for people to learn R, and transition to R scripting if they like.

    Here is exactly the same analysis as above, having been performed in jamovi.

    summing up
    • jmv makes a whole suite of common analyses from the social sciences very easy to perform
    • jamovi makes these even easier to perform

    jmv is available from CRAN

    jamovi is available from www.jamovi.org

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: jamovi. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The British Ecological Society’s Guide to Reproducible Science

    Wed, 12/06/2017 - 23:26

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    The British Ecological Society has published a new volume in their Guides to Better Science series: A Guide to Reproducible Code in Ecology and Evolution (pdf). The introduction, by , describes its scope:

    A Guide to Reproducible Code covers all the basic tools and information you will need to start making your code more reproducible. We focus on R and Python, but many of the tips apply to any programming language. Anna Krystalli introduces some ways to organise files on your computer and to document your workflows. Laura Graham writes about how to make your code more reproducible and readable. François Michonneau explains how to write reproducible reports. Tamora James breaks down the basics of version control. Finally, Mike Croucher describes how to archive your code. We have also included a selection of helpful tips from other scientists.

    The guide proposes a simple reproducible project workflow, and a guide to organizing projects for reproducibility. The Programming section provides concrete tips and traps to avoid (example: use relative, not absolute pathnames), and the Reproducible Reports section provides a step-by-step guide for generating reports with R Markdown.

    While written for an ecology audience (and also including some gorgeous photography of animals), this guide would be useful for anyone in the science looking to implement a reproducible workflow. You can download the guide at the link below.

    British Ecological Society: A Guide to Reproducible Code in Ecology and Evolution (via Laura Graham)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    14 Jobs for R users from around the world (2017-12-06)

    Wed, 12/06/2017 - 17:07
    To post your R job on the next post

    Just visit  this link and post a new R job  to the R community.

    You can post a job for  free  (and there are also “featured job” options available for extra exposure).

    Current R jobs

    Job seekers:  please follow the links below to learn more and apply for your R job of interest:

    Featured Jobs
    More New R Jobs
    1. Full-Time
      Quantitative Market Research Analyst Gradient Metrics – Posted by kyblock
      Anywhere
      6 Dec 2017
    2. Full-Time
      Data Scientist in the Institute for Economics and Peace Institute for Economics and Peace – Posted by Institute for Economics and Peace
      Sydney New South Wales, Australia
      5 Dec 2017
    3. Full-Time
      Data Innovation Specialist /Technology Analyst- 20509 jessicaxtan
      Washington District of Columbia, United States
      30 Nov 2017
    4. Full-Time
      Data Scientist: Mobility Services Parkbob – Posted by ivan.kasanicky
      Wien Wien, Austria
      29 Nov 2017
    5. Freelance
      Statistician/Econometrician – R Programmer for Academic Statistical Research Academic Research – Posted by empiricus
      Anywhere
      29 Nov 2017
    6. Full-Time
      Senior Data Analyst @ Bangkok, Thailand Agoda – Posted by atrapassi
      Bangkok Krung Thep Maha Nakhon, Thailand
      28 Nov 2017
    7. Full-Time
      R Shiny Dashboard Engineer in Health Tech Castor EDC – Posted by Castor EDC
      Amsterdam-Zuidoost Noord-Holland, Netherlands
      24 Nov 2017
    8. Full-Time
      Data Analyst: Travel Behavior RSG – Posted by patricia.holland@rsginc.com
      San Diego California, United States
      22 Nov 2017
    9. Full-Time
      R&D Database Developer @ Toronto, Canada Crescendo Technology Ltd – Posted by Crescendo
      Toronto Ontario, Canada
      17 Nov 2017
    10. Full-Time
      Customer Success Representative RStudio, Inc. – Posted by jclemens1
      Anywhere
      17 Nov 2017
    11. Full-Time
      Data Science Engineer Bonify – Posted by arianna@meritocracy
      Berlin Berlin, Germany
      17 Nov 2017
    12. Part-Time
      Development of User-Defined Calculations and Graphical Outputs for WHO’s Influenza Data World Health Organization – Posted by aspenh
      Anywhere
      6 Nov 2017
    13. Full-Time
      Data Scientist for H Labs @ Chicago, Illinois, United States Heidrick & Struggles – Posted by Heidrick1
      Chicago Illinois, United States
      2 Nov 2017
    14. Full-Time
      Business Data Analytics Faculty Maryville University – Posted by tlorden
      St. Louis Missouri, United States
      2 Nov 2017

     

    In  R-users.com  you can see  all  the R jobs that are currently available.

    R-users Resumes

    R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

    (you may also look at  previous R jobs posts ).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    How to Avoid the dplyr Dependency Driven Result Corruption

    Wed, 12/06/2017 - 16:27

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases.

    To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using dplyr with a database.

    We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Mapping lithium production using R

    Wed, 12/06/2017 - 12:15

    (This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)



    A great thing about data science – and data visualization in particular – is that you can use data science tools to get information about a subject very quickly.

    This is particularly true when you know how to present the information. When you know the best technique for visualizing data, you can use those tools to present information in a way that can help you increase your understanding very quickly.

    Here’s a case in point: I was thinking about energy, solar energy, and battery technology, and I got curious about sources of lithium (which is used in batteries and related tech).

    Using a few tools from R, I created a map of lithium production within about 15 minutes. The maps that I created certainly don’t tell the full story, but they at least provide a baseline of knowledge.

    If you’re fluent with the tools and techniques of data science, this becomes possible. Whether you’re just doing some personal research, or working for a multinational corporation, you can use the tools of data science to quickly identify and present insights about the world.

    Let me walk you through this example and show you how I did it.

    Tutorial: how to map lithium production data using R

    First, we’ll load the packages that we’re going to use.

    We’ll load tidyverse mostly for access to ggplot2 and dplyr, but we’ll also load rvest (for data scraping), stringr (to help us manipulate our variables and put the data into shape), and viridis (to help us modify the color of the final plot).

    #-------------- # LOAD PACKAGES #-------------- library(tidyverse) library(rvest) library(stringr) library(viridis)

    Ok. Now, we’re going to scrape this mineral production data from the wikipedia page.

    Notice that we’re essentially using several rvest functions in series. We’re using several functions and combining them together using the “pipe” operator, %>%.

    If you’re not using it yet, you should definitely familiarize yourself with this operator, and start using it. It’s beyond the scope of this blog post to talk extensively about pipe operator, but I will say that it’s one of the most useful tools in the R data scientist’s toolkit. If you learn to use it properly, it will make your code easier to read and easier to write. It will even train your mind to think about analysis in a more step by step way.

    Concerning the code: what we’re doing here is designating the URL from which we’re going to scrape the data, then we’re specifying that we’ll be scraping one of the tables. Then, we specify that we’re going to scrape data from the 9th table, and finally we coerce the data into a tibble (instead of keeping it as a traditional data.frame).

    #--------------------------- # SCRAPE DATA FROM WIKIPEDIA #--------------------------- df.lithium <- read_html("https://en.wikipedia.org/wiki/Lithium") %>% html_nodes("table") %>% .[[9]] %>% html_table() %>% as.tibble() # INSPECT df.lithium

    The resultant dataset, df.lithium, is relatively small (which makes it easy to inspect and easy to work with), but in its raw form, it’s a little messy. We’ll need to do a few things to clean up the data, like change the variable names, parse some data into numerics, etc.

    So first, let’s change the column names.

    There are a few ways we could do this, but the most straightforward is to simply pass a vector of manually-defined column names into the colnames() function.

    #-------------------------------------------- # CHANGE COLUMN NAMES # - the raw column names are capitalized and # have some extra information # - we will just clean them up #-------------------------------------------- colnames(df.lithium) <- c('country', 'production', 'reserves', 'resources') colnames(df.lithium)

    Now, we’ll remove an extraneous row of data. The original data table on Wikipedia contained not only the individual records of lithium production for particular countries, but it also contained a “total” row at the bottom of the table. More often than not, these sorts of “total” rows are not appropriate for a data.frame in R; we’ll typically remove them, just as we will do here.

    #----------------------------------------------- # REMOVE "World total" # - this is a total amount that was # in the original data table # - we need to remove, because it's not a # proper data record for a particular country #----------------------------------------------- df.lithium <- df.lithium %>% filter(country != 'World total') df.lithium

    Next, we need to parse the numbers into actual numeric data. The reason is that when we scraped the data, it actually read in the numbers as character data, along with commas and some extra characters. We need to transform this character data into proper numeric data in R.

    To do this, we’ll need to do a few things. First, we need to remove a few “notes” that were in the original data. This is why we’re using the code str_replace(production,”W\\[.*\\]”, “-“)). Essentially, we’re removing those notes from the data.

    After that, we’re using parse_number(production, na = ‘-‘) to transform three variables – production, reserves, resources – into numerics.

    Note once again how we’re structuring this code. We’re using a combination of functions from dplyr and stringr to achieve our objectives.

    To a beginner, this might look complicated, but it’s really not that bad once you understand the individual pieces. If you don’t understand this code (our couldn’t write it yourself), I recommend that you learn the individual functions from dplyr and stringr first, and then come back to this once you’ve learned those pieces.

    #--------------------------------------------------------- # PARSE NUMBERS # - the original numeric quantities in the table # were read-in as character data # - we need to "parse" this information .... # & transform it from character into proper numeric data #--------------------------------------------------------- # Strip out the 'notes' from the numeric data #str_replace(df.lithium$production,"W\\[.*\\]", "") #test df.lithium <- df.lithium %>% mutate(production = str_replace(production,"W\\[.*\\]", "-")) # inspect df.lithium # Parse character data into numbers df.lithium <- df.lithium %>% mutate(production = parse_number(production, na = '-') ,reserves = parse_number(reserves, na = '-') ,resources = parse_number(resources, na = '-') ) # Inspect df.lithium

    Now we’ll get data for a map of the world. To do this, we’ll just use map_data().

    #-------------- # GET WORLD MAP #-------------- map.world <- map_data('world')

    We’ll also get the names of the countries in this dataset.

    The reason is because we’ll need to join this map data to the data from Wikipedia, and we’ll need the country names to be exactly the same. To make this work, we’ll need to examine the names in both datasets and modify any names that aren’t exactly the same.

    Notice that once again, we’re using a combination of functions from dplyr, wiring them together using the pipe operator.

    #---------------------------------------------------- # Get country names # - we can use this list and cross-reference # with the country names in the scraped data # - when we find names that are not the same between # this map data and the scraped data, we can recode # the values #---------------------------------------------------- map_data('world') %>% group_by(region) %>% summarise() %>% print(n = Inf)

    Ok. Now we’re going to recode some country names. Again, we’re going this so that the country names in df.lithium are the same as the corresponding country names in map.data.

    #-------------------------------------------- # RECODE COUNTRY NAMES # - some of the country names do not match # the names we will use later in our map # - we will re-code so that the data matches # the names in the world map #-------------------------------------------- df.lithium <- df.lithium %>% mutate(country = if_else(country == "Canada (2010)", 'Canada' ,if_else(country == "People's Republic of China", "China" ,if_else(country == "United States", "USA" ,if_else(country == "DR Congo","Democratic Republic of the Congo", country)))) ) # Inspect df.lithium

    Ok, now we’ll join the data using dplyr::left_join().

    #----------------------------------------- # JOIN DATA # - join the map data and the scraped-data #----------------------------------------- df <- left_join(map.world, df.lithium, by = c('region' = 'country'))

    Now we’ll plot.

    We’ll start with just a basic plot (to make sure that the map plots correctly), and then we’ll proceed to plot separate maps where the fill color corresponds to reserves, production, and resources.

    #----------- # PLOT DATA #----------- # BASIC MAP ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon() # LITHIUM RESERVES ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = reserves)) ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = reserves)) + scale_fill_viridis(option = 'plasma') # LITHIUM PRODUCTION ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = production)) + scale_fill_viridis(option = 'plasma') # LITHIUM RESOURCES ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = resources)) + scale_fill_viridis(option = 'plasma')







    In the final three versions, notice as well that we’re modifying the color scales by using scale_fill_viridis().

    There’s actually quite a bit more formatting that we could do on these, but as a first pass, these are pretty good.

    I’ll leave it as an exercise for you to format these with titles, background colors, etc. If you choose to do this, leave your finalized code in the comments section below.

    To master data science, you need a plan

    At several points in this tutorial, I’ve mentioned a high level plan for mastering data science: master individual pieces of a programming language, and then learn to put them together into more complicated structures.

    If you can do this, you will accelerate your progress … although, the devil is in the details.

    That’s actually not the only learning hack that you can use to rapidly master data science. There are lots of other tricks and learning hacks that you can use to dramatically accelerate your progress.

    Want to know them?

    Sign up for our email list.

    Here at Sharp Sight, we teach data science. But we also teach you how to learn and how to study data science, so you master the tools as quickly as possible.

    By signing up for our email list, you’ll get weekly tutorials about data science, delivered directly to your inbox.

    You’ll also get our Data Science Crash Course, for free.

    SIGN UP NOW

    The post Mapping lithium production using R appeared first on SHARP SIGHT LABS.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Package GetDFPData

    Wed, 12/06/2017 - 09:00

    (This article was first published on R and Finance, and kindly contributed to R-bloggers)

    Downloading Annual Financial Reports and Corporate Events from B3 (formerly Bovespa) –

    Financial statements of companies traded at B3 (formerly Bovespa), the
    Brazilian stock exchange, are available in its
    website. Accessing the data for a
    single company is straightforward. In the website one can find a simple
    interface for accessing this dataset. An example is given
    here.
    However, gathering and organizing the data for a large scale research,
    with many companies and many dates, is painful. Financial reports must
    be downloaded or copied individually and later aggregated. Changes in
    the accounting format thoughout time can make this process slow,
    unreliable and irreproducible.

    Package GetDFPData provides a R interface to all annual financial
    statements available in the website and more. It not only downloads the
    data but also organizes it in a tabular format and allows the use of
    inflation indexes. Users can select companies and a time period to
    download all available data. Several information about current
    companies, such as sector and available quarters are also at reach. The
    main purpose of the package is to make it easy to access financial
    statements in large scale research, facilitating the reproducibility of
    corporate finance studies with B3 data.

    The positive aspects of GetDFDData are:

    • Easy and simple R and web interface
    • Changes in accounting format are internally handled by the software
    • Access to corporate events in the FRE system such as dividend
      payments, changes in stock holder composition, changes in governance
      listings, board composition and compensation, debt composition, and
      a lot more!
    • The output data is automatically organized using tidy data
      principles (long format)
    • A cache system is employed for fast data acquisition
    • Completely free and open source!
    Installation

    The package is (not yet) available in CRAN (release version) and in
    Github (development version). You can install any of those with the
    following code:

    # Release version in CRAN install.packages('GetDFPData') # not in CRAN yet # Development version in Github devtools::install_github('msperlin/GetDFPData') Shinny interface

    The web interface of GetDFPData is available at
    http://www.msperlin.com/shiny/GetDFPData/.

    How to use GetDFPData

    The starting point of GetDFPData is to find the official names of
    companies in B3. Function gdfpd.search.company serves this purpose.
    Given a string (text), it will search for a partial matches in companies
    names. As an example, let’s find the official name of Petrobras, one
    of the largest companies in Brazil:

    library(GetDFPData) library(tibble) gdfpd.search.company('petrobras',cache.folder = tempdir()) ## ## Reading info file from github ## Found 43873 lines for 687 companies [Actives = 521 Inactives = 167 ] ## Last file update: 2017-10-19 ## Caching RDATA into tempdir() ## ## Found 1 companies: ## PETRÓLEO BRASILEIRO S.A. - PETROBRAS | situation = ATIVO | first date = 1998-12-31 | last date - 2016-12-31 ## [1] "PETRÓLEO BRASILEIRO S.A. - PETROBRAS"

    Its official name in Bovespa records is
    PETRÓLEO BRASILEIRO S.A. - PETROBRAS. Data for quarterly and annual
    statements are available from 1998 to 2017. The situation of the
    company, active or canceled, is also given. This helps verifying the
    availability of data.

    The content of all available financial statements can be accessed with
    function gdfpd.get.info.companies. It will read and parse a .csv file
    from my github
    repository
    . This will
    be periodically updated for new information. Let’s try it out:

    df.info <- gdfpd.get.info.companies(type.data = 'companies', cache.folder = tempdir()) ## ## Reading info file from github ## Found 43873 lines for 687 companies [Actives = 521 Inactives = 167 ] ## Last file update: 2017-10-19 ## Caching RDATA into tempdir() glimpse(df.info) ## Observations: 689 ## Variables: 8 ## $ name.company "521 PARTICIPAÇOES S.A. - EM LIQUIDAÇÃO EXTRAJ... ## $ id.company 16330, 16284, 108, 20940, 21725, 19313, 18970,... ## $ situation "ATIVO", "ATIVO", "CANCELADA", "CANCELADA", "A... ## $ listing.segment NA, "None", "None", "None", "None", "None", "C... ## $ main.sector NA, "Financeiro e Outros", "Materiais Básicos"... ## $ tickers NA, "QVQP3B", NA, NA, NA, "AELP3", "TIET11;TIE... ## $ first.date 1998-12-31, 2001-12-31, 2009-12-31, 2009-12-3... ## $ last.date 2016-12-31, 2016-12-31, 2009-12-31, 2009-12-3...

    This file includes several information that are gathered from Bovespa:
    names of companies, official numeric ids, listing segment, sectors,
    traded tickers and, most importantly, the available dates. The resulting
    dataframe can be used to filter and gather information for large scale
    research such as downloading financial data for a specific sector.

    Downloading financial information for ONE company

    All you need to download financial data with GetDFPData are the
    official names of companies, which can be found with
    gdfpd.search.company, the desired starting and ending dates and the
    type of financial information (individual or consolidated). Let’s try it
    for PETROBRAS:

    name.companies <- 'PETRÓLEO BRASILEIRO S.A. - PETROBRAS' first.date <- '2004-01-01' last.date <- '2006-01-01' df.reports <- gdfpd.GetDFPData(name.companies = name.companies, first.date = first.date, last.date = last.date, cache.folder = tempdir()) ## Found cache file. Loading data.. ## ## Downloading data for 1 companies ## First Date: 2004-01-01 ## Laste Date: 2006-01-01 ## Inflation index: dollar ## ## Downloading inflation data ## Caching inflation RDATA into tempdir() Done ## ## ## WARNING: Cash flow statements are not available before 2009 ## ## Inputs looking good! Starting download of files: ## ## PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Available periods: 2005-12-31 2004-12-31 ## ## ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Finding info from Bovespa | downloading and reading data | saving cache ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2005-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2004-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available..

    The resulting object is a tibble, a data.frame type of object that
    allows for list columns. Let’s have a look in its content:

    glimpse(df.reports) ## Observations: 1 ## Variables: 33 ## $ company.name "PETRÓLEO BRASILEIRO S.A. - PE... ## $ company.code 9512 ## $ company.tickers "PETR3;PETR4" ## $ min.date 2004-12-31 ## $ max.date 2005-12-31 ## $ n.periods 2 ## $ company.segment "Tradicional" ## $ current.stockholders [ [ [ [ [ [ [ [ [ [ [ [NULL] ## $ history.capital.issues [NULL] ## $ history.mkt.value [NULL] ## $ history.capital.increases [NULL] ## $ history.capital.reductions [NULL] ## $ history.stock.repurchases [NULL] ## $ history.other.stock.events [NULL] ## $ history.compensation [NULL] ## $ history.compensation.summary [NULL] ## $ history.transactions.related [NULL] ## $ history.debt.composition [NULL] ## $ history.governance.listings [NULL] ## $ history.board.composition [NULL] ## $ history.committee.composition [NULL] ## $ history.family.relations [NULL]

    Object df.reports only has one row since we only asked for data of one
    company. The number of rows increases with the number of companies, as
    we will soon learn with the next example. All financial statements for
    the different years are available within df.reports. For example, the
    assets statements for all desired years of PETROBRAS are:

    df.income.long <- df.reports$fr.income[[1]] glimpse(df.income.long) ## Observations: 48 ## Variables: 6 ## $ name.company "PETRÓLEO BRASILEIRO S.A. - PETROBRAS", "... ## $ ref.date 2005-12-31, 2005-12-31, 2005-12-31, 2005-1... ## $ acc.number "3.01", "3.02", "3.03", "3.04", "3.05", "3.... ## $ acc.desc "Receita Bruta de Vendas e/ou Serviços", "D... ## $ acc.value 143665730, -37843204, 105822526, -57512113,... ## $ acc.value.infl.adj 61398234.97, -16173000.56, 45225234.41, -24...

    The resulting dataframe is in the long format, ready for processing. In
    the long format, financial statements of different years are stacked. In
    the wide format, we have the year as columns of the table.

    If you want the wide format, which is the most common way that financial
    reports are presented, you can use function gdfpd.convert.to.wide. See
    an example next:

    df.income.wide <- gdfpd.convert.to.wide(df.income.long) knitr::kable(df.income.wide ) acc.number acc.desc name.company 2004-12-31 2005-12-31 3.01 Receita Bruta de Vendas e/ou Serviços PETRÓLEO BRASILEIRO S.A. – PETROBRAS 120024727 143665730 3.02 Deduções da Receita Bruta PETRÓLEO BRASILEIRO S.A. – PETROBRAS -34450292 -37843204 3.03 Receita Líquida de Vendas e/ou Serviços PETRÓLEO BRASILEIRO S.A. – PETROBRAS 85574435 105822526 3.04 Custo de Bens e/ou Serviços Vendidos PETRÓLEO BRASILEIRO S.A. – PETROBRAS -48607576 -57512113 3.05 Resultado Bruto PETRÓLEO BRASILEIRO S.A. – PETROBRAS 36966859 48310413 3.06 Despesas/Receitas Operacionais PETRÓLEO BRASILEIRO S.A. – PETROBRAS -11110540 -14810467 3.06.01 Com Vendas PETRÓLEO BRASILEIRO S.A. – PETROBRAS -2858630 -4195157 3.06.02 Gerais e Administrativas PETRÓLEO BRASILEIRO S.A. – PETROBRAS -2599552 -3453753 3.06.03 Financeiras PETRÓLEO BRASILEIRO S.A. – PETROBRAS -1019901 126439 3.06.04 Outras Receitas Operacionais PETRÓLEO BRASILEIRO S.A. – PETROBRAS 0 0 3.06.05 Outras Despesas Operacionais PETRÓLEO BRASILEIRO S.A. – PETROBRAS -5982336 -9070019 3.06.06 Resultado da Equivalência Patrimonial PETRÓLEO BRASILEIRO S.A. – PETROBRAS 1349879 1782023 3.07 Resultado Operacional PETRÓLEO BRASILEIRO S.A. – PETROBRAS 25856319 33499946 3.08 Resultado Não Operacional PETRÓLEO BRASILEIRO S.A. – PETROBRAS -550694 -199982 3.08.01 Receitas PETRÓLEO BRASILEIRO S.A. – PETROBRAS 46611 1256194 3.08.02 Despesas PETRÓLEO BRASILEIRO S.A. – PETROBRAS -597305 -1456176 3.09 Resultado Antes Tributação/Participações PETRÓLEO BRASILEIRO S.A. – PETROBRAS 25305625 33299964 3.10 Provisão para IR e Contribuição Social PETRÓLEO BRASILEIRO S.A. – PETROBRAS -5199166 -8581490 3.11 IR Diferido PETRÓLEO BRASILEIRO S.A. – PETROBRAS -1692288 -422392 3.12 Participações/Contribuições Estatutárias PETRÓLEO BRASILEIRO S.A. – PETROBRAS -660000 -846000 3.12.01 Participações PETRÓLEO BRASILEIRO S.A. – PETROBRAS -660000 -846000 3.12.02 Contribuições PETRÓLEO BRASILEIRO S.A. – PETROBRAS 0 0 3.13 Reversão dos Juros sobre Capital Próprio PETRÓLEO BRASILEIRO S.A. – PETROBRAS 0 0 3.15 Lucro/Prejuízo do Exercício PETRÓLEO BRASILEIRO S.A. – PETROBRAS 17754171 23450082 Downloading financial information for SEVERAL companies

    If you are doing serious research, it is likely that you need financial
    statements for more than one company. Package GetDFPData is specially
    designed for handling large scale download of data. Let’s build a case
    with two selected companies:

    my.companies <- c('PETRÓLEO BRASILEIRO S.A. - PETROBRAS', 'BANCO DO ESTADO DO RIO GRANDE DO SUL SA') first.date <- '2005-01-01' last.date <- '2007-01-01' type.statements <- 'individual' df.reports <- gdfpd.GetDFPData(name.companies = my.companies, first.date = first.date, last.date = last.date, cache.folder = tempdir()) ## Found cache file. Loading data.. ## ## Downloading data for 2 companies ## First Date: 2005-01-01 ## Laste Date: 2007-01-01 ## Inflation index: dollar ## ## Downloading inflation data ## Found cache file. Loading data.. Done ## ## ## WARNING: Cash flow statements are not available before 2009 ## ## Inputs looking good! Starting download of files: ## ## BANCO DO ESTADO DO RIO GRANDE DO SUL SA ## Available periods: 2006-12-31 2005-12-31 ## PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Available periods: 2006-12-31 2005-12-31 ## ## ## Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA ## Finding info from Bovespa | downloading and reading data | saving cache ## Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA | date 2006-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA | date 2005-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Finding info from Bovespa ## Found cache file /tmp/RtmpSpLsOP/9512_PETRÓLEO/GetDFPData_BOV_cache_9512_PETR.rds ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2006-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2005-12-31 ## Acessing DFP data | Found DFP cache file ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available..

    And now we can check the resulting tibble:

    glimpse(df.reports) ## Observations: 2 ## Variables: 33 ## $ company.name "BANCO DO ESTADO DO RIO GRANDE D... ## $ company.code 1210, 9512 ## $ company.tickers "BRSR3;BRSR5;BRSR6", "PETR3;PETR4" ## $ min.date 2005-12-31, 2005-12-31 ## $ max.date 2006-12-31, 2006-12-31 ## $ n.periods 2, 2 ## $ company.segment "Corporate Governance - Level 1"... ## $ current.stockholders [ [ [ [ [ [ [ [ [ [ [ [NULL, NULL] ## $ history.capital.issues [NULL, NULL] ## $ history.mkt.value [NULL, NULL] ## $ history.capital.increases [NULL, NULL] ## $ history.capital.reductions [NULL, NULL] ## $ history.stock.repurchases [NULL, NULL] ## $ history.other.stock.events [NULL, NULL] ## $ history.compensation [NULL, NULL] ## $ history.compensation.summary [NULL, NULL] ## $ history.transactions.related [NULL, NULL] ## $ history.debt.composition [NULL, NULL] ## $ history.governance.listings [NULL, NULL] ## $ history.board.composition [NULL, NULL] ## $ history.committee.composition [NULL, NULL] ## $ history.family.relations [NULL, NULL]

    Every row of df.reports will provide information for one company.
    Metadata about the corresponding dataframes such as min/max dates is
    available in the first columns. Keeping a tabular structure facilitates
    the organization and future processing of all financial data. We can use
    tibble df.reports for creating other dataframes in the long format
    containing data for all companies. See next, where we create dataframes
    with the assets and liabilities of all companies:

    df.assets <- do.call(what = rbind, args = df.reports$fr.assets) df.liabilities <- do.call(what = rbind, args = df.reports$fr.liabilities) df.assets.liabilities <- rbind(df.assets, df.liabilities)

    As an example, let’s use the resulting dataframe for calculating and
    analyzing a simple liquidity index of a company, the total of current
    (liquid) assets (Ativo circulante) divided by the total of current
    short term liabilities (Passivo Circulante), over time.

    library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union my.tab <- df.assets.liabilities %>% group_by(name.company, ref.date) %>% summarise(Liq.Index = acc.value[acc.number == '1.01']/ acc.value[acc.number == '2.01']) my.tab ## # A tibble: 3 x 3 ## # Groups: name.company [?] ## name.company ref.date Liq.Index ## ## 1 BANCO DO ESTADO DO RIO GRANDE DO SUL SA 2006-12-31 0.7251432 ## 2 PETRÓLEO BRASILEIRO S.A. - PETROBRAS 2005-12-31 0.9370813 ## 3 PETRÓLEO BRASILEIRO S.A. - PETROBRAS 2006-12-31 0.9733600

    Now we can visualize the information using ggplot2:

    library(ggplot2) p <- ggplot(my.tab, aes(x = ref.date, y = Liq.Index, fill = name.company)) + geom_col(position = 'dodge' ) print(p)

    Exporting financial data

    The package includes function gdfpd.export.DFP.data for exporting the
    financial data to an Excel or zipped csv files. See next:

    my.basename <- 'MyExcelData' my.format <- 'csv' # only supported so far gdfpd.export.DFP.data(df.reports = df.reports, base.file.name = my.basename, type.export = my.format)

    The resulting Excel file contains all data available in df.reports.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R and Finance. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The Cost of True Love (a.k.a. The Tidy — and expensive! — Twelve Days of Christmas)

    Tue, 12/05/2017 - 19:13

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    I’m in the market for Christmas presents for my true love, @mrshrbrmstr, and thought I’d look to an age-old shopping list for inspiration. Just what would it set me back if I decided to mimic the 12 Days of Christmas in this modern day and age?

    Let’s try to do the whole thing in R (of course!).

    We’ll need to:

    • Grab the lyrics
    • Parse the lyrics
    • Get pricing data
    • Compute some statistics
    • Make some (hopefully) pretty charts

    This one was complex enough formatting-wise that I needed to iframe it below. Feel free to bust out of the iframe at any time.

    Some good follow-ups to this (if you’re so inclined) would be to predict prices next year and/or clean up the charts a bit.

    Grab the code up on GitHub.

    (Note: ColourLovers API times out occasionally so just try that snippet again if you get an error).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Computing wind average in an area using rWind

    Tue, 12/05/2017 - 17:54

    (This article was first published on long time ago..., and kindly contributed to R-bloggers)

    p { margin-bottom: 0.25cm; line-height: 120%; }

    Hi all! A researcher asked me last week about how to compute wind average for an area using rWind. I wrote a simple function to do this using dplyr library (following the advice of my friend Javier Fajardo). The function will be added to rWind package as soon as possible. Meanwhile, you can test the results… enjoy!

    # First, charge the new function
    library(dplyr)
    wind.region <- function (X){
    X[,3] <- X[,3] %% 360
    X[X[,3]>=180,3] <- X[X[,3]>=180,3] - 360
    avg<-summarise_all(X[,-1], .funs = mean)
    wind_region <- cbind(X[1,1],avg)
    return(wind_region)
    }

    Once you have charged the function, let’s do some example

    # Get some wind data and convert it into a raster to be plotted later
    library(rWind)
    library(raster)
    wind_data<-wind.dl(2015,2,12,0,-10,5,35,45)
    wind_fitted_data <- wind.fit(wind_data)
    r_speed <- wind2raster(wind_fitted_data, type="speed")

    Now, you can use the new function to obtain wind average in the study area:

    myMean <- wind.region(wind_data)
    myMean

    # Now, you can use wind.fit to get wind speed and direction.
    myMean_fitted <- wind.fit(myMean)
    myMean_fitted

    # Finally, let's plot the results!
    library(rworldmap)
    library(shape)
    plot(r_speed)
    lines(getMap(resolution = "low"), lwd=4)
    alpha <- arrowDir(myMean_fitted)
    Arrowhead(myMean_fitted$lon, myMean_fitted$lat, angle=alpha,
    arr.length = 2, arr.type="curved")
    text(myMean_fitted$lon+1, myMean_fitted$lat+1,
    paste(round(myMean_fitted$speed,2), "m/s"), cex = 2)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: long time ago.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    On the biases in data

    Tue, 12/05/2017 - 16:15

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:

    “The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.”
    Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.

    “When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’”
    Hanna Wallach, Senior Researcher at Microsoft Research NYC. 

    The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.

    Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Building a Telecom Dictionary scraping web using rvest in R

    Tue, 12/05/2017 - 15:00

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    One of the biggest problems in Business to carry out any analysis is the availability of Data. That is where in many cases, Web Scraping comes very handy in creating that data that’s required. Consider the following case: To perform text analysis on Textual Data collected in a Telecom Company as part of Customer Feedback or Reviews, primarily requires a dictionary of Telecom Keywords. But such a dictionary is hard to find out-of-box. Hence as an Analyst, the most obvious thing to do when such dictionary doesn’t exist is to build one. Hence this article aims to help beginners get started with web scraping with rvest in R and at the same time, building a Telecom Dictionary by the end of this exercise.

    Disclaimer

    Web Scraping is not allowed by some websites. Kindly check the site’s Terms of Service before Scraping. This post is only for Educational-purpose.

    A simple Google search for Telecom Glossary results in this URL:
    Atis.org of which the required Telecom Dictionary could be built.

    Let us start by loading the required libraries:

    #load rvest and stringr library library(rvest) library(stringr) library(dplyr)

    rvest is a web scraping library in R that makes it easier to write common scraping tasks (to scrape useful information from web pages) without getting our head into xml parsing. rvest can be downloaded from CRAN and the development version is also available on Github.

    It could be seen in the above-mentioned URL that the Glossary words are listed alphabetically from A to Z with a separate link / URL for every (starting) letter. Clicking ‘A’ takes us to this link: atis.org that lists all the keywords with starting letter ‘A’. If you closely look at the URL, the code that’d be written for one letter (here in our case, ‘A’) could be easily replicated for other letters since ‘A’ is part of the URL which will be the only change for the link of other Alphabets.

    Let us assign this URL to an R object url which could be passed on as the paramater to rvest’s read_html() function.

    #url whose content to be scraped url <- 'http://www.atis.org/glossary/definitionsList.aspx?find=A&kw=0' #extracting the content of the given url #url_content <- read_html('http://www.atis.org/glossary/definitionsList.aspx?find=A&kw=0') url_content <- read_html(url)

    read_html() parses the html page of the given url (as its parameter) and saves the result as an xml object.

    To reiterate the objective, we are trying get the list of Telecom Keywords and as per the screenshot above, You could see that the Keywords are listed as Hyperlinks in the given url.

    Hyperlinks in HTML is written in the following syntax:

    Google

    Google is the Link Text Label that a browser would render and when clicked would take us to www.google.com. Hence it is evident that anchor tags in the page is what we are supposed to scrape/extract. But the issue with the current page is that, It’s not just the Keywords that are represented as Anchor Text (links) in the page but also there are a lot of other links (anchor tags) in the page. Hence to extract only the required information and filter out the irrelevant information, we need to find a pattern that helps us extract only the keywords links.

    Have a look at this screenshot:

    This screenshot shows that while the keywords are also represented as hyperlinks (Anchor Text), the differentiator is this ‘id’ element in the url. Only the links of the keywords have got this ‘id’ in the url and hence we can try extracting two information from the current page to get only the relevant information which in our case is – The Keywords: 1. Href value / Actual URL 2. Link Text.

    #extracting all the links from the page links <- url_content %>% html_nodes('a') %>% html_attr('href') #extracting all thhe link text from the page text <- url_content %>% html_nodes('a') %>% html_text()

    With the example Hyperlink discussed earlier, the above code gives two informations.

    www.google.com is saved in links and Google is saved in text.

    With links and text as columns, Let us build a rough dictionary (that’s not yet cleaned/filtered).

    #creating a new dictonary of links and text extracted above rough_dict <- data.frame(links,text, stringsAsFactors = F) head(rough_dict) links text 1 http://www.atis.org 2 definitionsList.aspx?find=A&kw=0 A 3 definitionsList.aspx?find=B&kw=0 B 4 definitionsList.aspx?find=C&kw=0 C 5 definitionsList.aspx?find=D&kw=0 D 6 definitionsList.aspx?find=E&kw=0 E

    As displayed above, rough_dict contains both signal (keywords) and noise (irrelevant links) and we have to filter the irrelevant out with the ‘id‘ pattern that we learnt earlier.

    #filtering glossary terms leaving out irrelevant information fair_dict <- rough_dict %>% filter(str_detect(links, 'id')) %>% select(text) tail(fair_dict) text 657 AN 658 axial propagation constant 659 analog component 660 axial ratio 661 analog computer 662 axial ray

    And that’s how using str_detect() we can keep only links with ‘id’ in it and filter out the rest and building our fair_dict. As displayed in the above output, We have got 662 Keywords just for the letter ‘A’ and the same exercise could be repeated for the other letters available on the site. The only change that’s required is the url object. For example, Like this:

    url <- 'http://www.atis.org/glossary/definitionsList.aspx?find=B&kw=0'

    Note the ‘B’ in it and the same could be done for other available letters. This process could be further improved by making this scraping part a function and looping the function call over a character vector with all the available letters (which ideally is beyond the scope of this article’s objective and hence left out). The complete code used in this article is available on my Github.

      Related Post

      1. Scraping Javascript-rendered web content using R
      2. Analysing Cryptocurrency Market in R
      3. Time Series Analysis in R Part 3: Getting Data from Quandl
      4. Pulling Data Out of Census Spreadsheets Using R
      5. Extracting Tables from PDFs in R using the Tabulizer Package
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      visualizing reassortment history using seqcombo

      Tue, 12/05/2017 - 04:09

      (This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

      Reassortment is an important strategy for influenza A viruses to
      introduce a HA subtype that is new to human populations, which creates
      the possibilities of pandemic.

      A diagram showed above (Figure 2 of doi:10.1038/srep25549) is widely
      used to illustrate the reassortment events. While such diagrams are
      mostly manually draw and edit without software tool to automatically
      generate. Here, I implemented the hybrid_plot function for producing
      publication quality figure of reassortment events.

      library(tibble) library(ggplot2) n <- 8 virus_info <- tibble( id = 1:7, x = c(rep(1990, 4), rep(2000, 2), 2009), y = c(1,2,3,5, 1.5, 3, 4), segment_color = list( rep('purple', n), rep('red', n), rep('darkgreen', n), rep('lightgreen', n), c('darkgreen', 'darkgreen', 'red', 'darkgreen', 'red', 'purple', 'red', 'purple'), c('darkgreen', 'darkgreen', 'red', 'darkgreen', 'darkgreen', 'purple', 'red', 'purple'), c('darkgreen', 'lightgreen', 'lightgreen', 'darkgreen', 'darkgreen', 'purple', 'red', 'purple')) ) flow_info <- tibble(from = c(1,2,3,3,4,5,6), to = c(5,5,5,6,7,6,7)) hybrid_plot(virus_info, flow_info)

      The hybrid_plot requires two tibble data frame of virus information
      and genetic flow information.

      Users need to provide x and y positions to plot the virus, this make
      sense for geographically and temporally information are usually
      available in such phylodynamic study and can be employed to set x or
      y to provide more information and help interpretation of the
      reassortment events.

      We use hexagon to represent virus. Users can set the virus outer
      boundary color by v_color and fill the virus by v_fill. Color of
      line segments that indicate the genetic flow relationship can be specify
      via l_color parameter.

      hybrid_plot(virus_info, flow_info, v_color='firebrick', v_fill='darkgreen', l_color='steelblue')

      We usually have more information to present, for example host
      information and HA subtype etc. and these information can be used to
      color the virus either by v_color or v_fill

      virus_info$Host = c("Avian", "Human", rep("Swine", 4), "Human") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

      The relative virus size can also be specify if a virus_size column is
      available in the input virus_info data.

      virus_info$virus_size <- c(rep(1, 3), 2, 1, 1, 1.5) hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

      If label and label_position coloumns are available, the virus labels
      (virus name or other information) will be added automatically.

      virus_info$label <- c("Avian", "Human\nH3N2", "Classic\nswine\nH1N1", "Eurasian swine", "North American swine\n triple reassrotant H3N2", "North American swine\n triple reassortant H1N2", "2009 Human H1N1") virus_info$label_position <- c('left', 'left', 'left', 'below', 'below', 'upper', 'below') hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

      User can use asp to set the aspect ratio of hexagons, asp < 1 for
      thick/short and asp > 1 for thin/tall.

      hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host, asp=2)

      The output of hybrid_plot is a ggplot object and users can use
      ggplot2 to modify the details.

      title <- "Reassortment events in evolution of the 2009 influenza A (H1N1) virus" caption <- 'Gene segments: PB2, PB1, PA, HA, NP, NA, M, NS' color <- c(Avian="purple", Human="red", Swine="darkgreen") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) + labs(caption=caption, title=title) + scale_color_manual(values=color) + scale_fill_manual(values=color) + scale_x_continuous(breaks=c(1990, 2000, 2009)) + xlab(NULL) + ylab(NULL) + theme_minimal() + theme(axis.line.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), panel.grid.minor=element_blank(), panel.grid.major.y=element_blank(), legend.position = c(.95, .1) )

      Top-down or bottom-up style is also supported.

      x <- virus_info$x virus_info$x <- virus_info$y virus_info$y <- x virus_info$label_position <- c(rep("right", 3), "left", "left", "right", "right") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) + scale_y_reverse() + scale_x_continuous(limits=c(0, 5.5))

      User can also use Emoji to label the virus (host information in this
      example):

      virus_info$label <- c("chicken", "woman", "pig", "pig", "pig", "pig", "woman") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host, parse='emoji', t_size=8, t_color='firebrick') + scale_y_reverse()

      In case you don’t have xy-coordination information, you can use
      set_layout function to auto setting the xy position using selected
      layout function.

      virus_info <- set_layout(virus_info, flow_info, layout="layout.kamada.kawai") hybrid_plot(virus_info, flow_info, parse='emoji', t_size=8, t_color='firebrick')

      virus_info <- set_layout(virus_info, flow_info, layout="layout.fruchterman.reingold") hybrid_plot(virus_info, flow_info, parse='emoji', t_size=8, t_color='firebrick')

      Please let me know if you know any published reassortment data that contain
      spatial information, I will demonstrate how to visualize reassortment
      history on a map.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      My book ‘Practical Machine Learning with R and Python’ on Amazon

      Tue, 12/05/2017 - 03:26

      (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

      My book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($9.99) and kindle ($6.97/Rs449) versions. In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code. This is almost like listening to parallel channels of music in stereo!
      1. Practical machine with R and Python – Machine Learning in Stereo (Paperback)
      2. Practical machine with R and Python – Machine Learning in Stereo (Kindle)
      This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

      Here is a look at the topics covered

      Table of Contents
      Essential R …………………………………….. 7
      Essential Python for Datascience ………………..   54
      R vs Python ……………………………………. 77
      Regression of a continuous variable ………………. 96
      Classification and Cross Validation ……………….113
      Regression techniques and regularization …………. 134
      SVMs, Decision Trees and Validation curves …………175
      Splines, GAMs, Random Forests and Boosting …………202
      PCA, K-Means and Hierarchical Clustering …………. 234

      Pick up your copy today!!
      Hope you have a great time learning as I did while implementing these algorithms!

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Version 0.6-8 of NIMBLE released

      Tue, 12/05/2017 - 02:19

      (This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)

      We’ve released the newest version of NIMBLE on CRAN and on our website a week ago. Version 0.6-8 has a few new features, and more are on the way in the next few months.

      New features include:

      • the proper Gaussian CAR (conditional autoregressive) model can now be used in BUGS code as dcar_proper, which behaves similarly to BUGS’ car.proper distribution;
      • a new nimbleMCMC function that provides one-line invocation of NIMBLE’s MCMC engine, akin to usage of JAGS and WinBUGS through R;
      • a new runCrossValidate function that will conduct k-fold cross-validation of NIMBLE models fit by MCMC;
      • dynamic indexing in BUGS code is now allowed by default;
      • and a variety of bug fixes and efficiency improvements.

      Please see the NEWS file in the installed package for more details.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      #R – Deprecate functions with roxygen2

      Tue, 12/05/2017 - 01:00

      (This article was first published on R-feed, and kindly contributed to R-bloggers)

      We show how to use roxygen2 tags and templates for deprecating existing documented functions.

      Introduction

      Package roxygen2 provides a powerful way of documenting packages and objects inside them. In particular, great flexibility is available for organizing the documentation content in source files and templates, and render it as desired in the corresponding help pages.

      We can leverage on this flexibility to adapt an existing package documentation upon making a function deprecated (and similarly defunct). As stated in the base-R ‘Marking Objects as Deprecated’ documentation (help(Deprecated)):

      The original help page for these functions is often available at help(“oldName-deprecated”) (note the quotes). Functions should be listed in help(“pkg-deprecated”) for an appropriate pkg, including base.

      Deprecate existing functions

      We show how an existing and documented function myFun in package ourPkg can be deprecated, and the documentation adjusted accordingly by re-arranging the roxygen2 documentation content.

      ## myFun.r #' @title My function. #' @description My function. #' @param arg1 My first argument. #' @param arg2 My second argument. #' @return My return value. #' @export myFun <- function(arg1, arg2) { "My return value" }

      Let us assume a new function myNewFunction exists as a replacement for the old myFun, which becomes deprecated. We want to achieve the following

      • list myFun in help("ourPkg-deprecated""), along with its replacement;
      • make its original documentation available via help("myFun-deprecated").

      This can be obtained by

      • creating generic content for help("ourPkg-deprecated"");
      • adjusting the existing myFun documentation content.
      Package-level documentation

      Generic content for help("ourPkg-deprecated") is defined in its own source file:

      ## ourPkg-deprecated.r #' @title Deprecated functions in package \pkg{ourPkg}. #' @description The functions listed below are deprecated and will be defunct in #' the near future. When possible, alternative functions with similar #' functionality are also mentioned. Help pages for deprecated functions are #' available at \code{help("-deprecated")}. #' @name ourPkg-deprecated #' @keywords internal NULL Function-specific documentation

      Function-specific content is added to the ourPkg-deprecated help page from the corresponding source file, where we also want to make the original help page available via help("myFun-deprecated").

      The source file of myFun is then modified as follows:

      ## myFun.r #' @title My function. #' @description My function. #' @param arg1 My first argument. #' @param arg2 My second argument. #' @return My return value. #' #' @name myFun-deprecated #' @usage myFun(arg1, arg2) #' @seealso \code{\link{ourPkg-deprecated}} #' @keywords internal NULL #' @rdname ourPkg-deprecated #' @section \code{myFun}: #' For \code{myFun}, use \code{\link{myNewFun}}. #' #' @export myFun <- function(arg1, arg2) { .Deprecated("myNewFun") "My return value" }

      Two blocks of roxygen2 tags have been added to the existing documentation content.

      • The first block is used to create the myFun-deprecated help page, detached from the myFun object. For this reason we need to explicitly add the original function signature as ‘Usage’ section. We also add a ‘See Also’ link to ourPkg-deprecated.
      • The second block adds myFun-specific content to the ourPkg-deprecated help page. The standard ‘Usage’ section is added automatically, and we create a function-specific section where using myNewFun is suggested. Such content will be collected and shown in help("ourPkg-deprecated") for all deprecated functions with similar tags structure.

      Note also that help(myFun) will point to the ourPkg-deprecated help page, and that @keywords internal prevents the corresponding topics from appearing in the package documentation index.

      Using roxygen2 templates

      Deprecating functions as described above can be automated using templates. One can define a template for each additional block.

      ## template-depr_pkg.r #' @name ourPkg-deprecated #' @section \code{<%= old %>}: #' For \code{<%= old %>}, use \code{\link{<%= new %>}}. ## template-depr_fun.r #' @name <%= fun %>-deprecated #' @usage <%= gsub("\n", "\n#' ", roxygen2:::wrapString(roxygen2:::function_usage(fun, formals(fun)))) %> #' @seealso \code{\link{ourPkg-deprecated}} #' @keywords internal

      Note the presence of template variables and of some roxygen2 internals for automating the construction of the ‘Usage’ section (handling multiple lines in case of many arguments).

      Deprecating herFun in favor of herNewFun then results in the simple inclusion of the templates in the function source.

      ## herFun.r #' @title Her function. #' @description Her function. #' @param arg1 Her first argument. #' @param arg2 Her second argument. #' @return Her return value. #' #' @templateVar fun herFun #' @template template-depr_fun NULL #' @templateVar old herFun #' @templateVar new herNewFun #' @template template-depr_pkg #' #' @export herFun <- function(arg1, arg2) { .Deprecated("herNewFun") "Her return value" } Example package manual and sources

      A complete example of ourPkg is
      avalailable for download.
      The package contains functions myFun, yourFun, herFun, hisFun, where all but yourFun are deprecated in favor of *NewFun, using roxygen2 templates for herFun and hisFun.

      The resulting documentation content can be assessed in the corresponding PDF reference manual.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R-feed. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages