Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 5 hours 57 min ago

In case you missed it: November 2017 roundup

Thu, 12/07/2017 - 22:54

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from November of particular interest to R users.

R 3.4.3 "Kite Eating Tree" has been released.

Several approaches for generating a "Secret Santa" list with R.

The "RevoScaleR" package from Microsoft R Server has now been ported to Python.

The call for papers for the R/Finance 2018 conference in Chicago is now open.

Give thanks to the volunteers behind R.

Advice for R user groups from the organizer of R-Ladies Chicago.

Use containers to build R clusters for parallel workloads in Azure with the doAzureParallel package.

A collection of R scripts for interesting visualizations that fit into a 280-character Tweet.

R is featured in a StackOverflow case study at the Microsoft Connect conference.

The City of Chicago uses R to forecast water quality and issue beach safety alerts.

A collection of best practices for sharing data in spreadsheets, from a paper by Karl Broman and Kara Woo.

The MRAN website has been updated with faster package search and other improvements.

The curl package has been updated to use the built-in winSSL library on Windows.

Beginner, intermediate and advanced on-line learning plans for developing AI applications on Azure.

A recap of the EARL conference (Effective Applications of the R Language) in Boston. 

Giora Simchoni uses R to calculate the expected payout from a slot machine.

An introductory R tutorial by Jesse Sadler focuses on the analysis of historical documents.

A new RStudio cheat sheet: "Working with Strings".

An overview of generating distributions in R via simulated gaming dice.

An analysis of StackOverflow survey data ranks R and Python among the most-liked and least-disliked languages.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Clustering Music Genres with R

Thu, 12/07/2017 - 18:28

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

In a number of upcoming posts, I’ll be analyzing an interesting dataset I found on Kaggle. The dataset contains information on 18,393 music reviews from the Pitchfork website. The data cover reviews posted between January 1999 and January 2016. I downloaded the data and did an extensive data munging exercise to turn the data into a tidy dataset for analysis (not described here, but perhaps in a future blog post).

The goal of this post is to describe the similarities and differences among the music genres of the reviewed albums using cluster analysis.

The Data

After data munging, I was left with 18,389 reviews for analysis (there were 4 identical review id’s in the dataset; visual inspection showed that the reviews were identical so I removed the duplicates).

One of the pieces of information about each album is the music genre, with the following options available: electronic, experimental, folk/country, global, jazz, metal, pop/rnb, rap and rock. Each album can have 0, 1 or multiple genres attached to it. I represented this information in the wide format, with one column to represent each genre. The presence of a given genre for a given album is represented with a 1, and the absence of a given genre for a given album is represented with a 0.

The head of the dataset, called “categories,” looks like this:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

title artist electronic experimental folk_country global jazz metal pop_rnb rap rock mezzanine massive attack 1 0 0 0 0 0 0 0 0 prelapsarian krallice 0 0 0 0 0 1 0 0 0 all of them naturals uranium club 0 0 0 0 0 0 0 0 1 first songs kleenex liliput 0 0 0 0 0 0 0 0 1 new start taso 1 0 0 0 0 0 0 0 0 insecure (music from the hbo original series) various artists 0 0 0 0 0 0 0 0 0

As we can see, each album has numeric values on all of our 9 genres. The album in the last row of the data shown above does not have a genre attached to it.

Let’s first answer some basic questions about the music genres. How often is each genre represented in our 18,389 reviews? We can make a simple bar plot using base R with the following code:

# histogram of genre frequencies
par(mai=c(1,1.5,1,1))
barplot(sort(colSums(categories[,3:11]), decreasing = FALSE),
horiz = TRUE, cex.names = 1, col = 'springgreen4',
main = 'Genre Frequency', las = 1)

Which gives us the following plot:

Rock is by far the most frequently-occurring genre in the dataset!

Let’s not forget that albums can have more than one genre. How many albums have more than 1 genre attached to them? What is the maximum number of genres in these data? What number of albums have what number of genres? It’s possible to extract these figures with dplyr, but for basic summary statistics I find it quicker and easier to use base R:

# how many of the albums have more than 1 genre?
table(rowSums(categories[,3:11])>1)
# FALSE TRUE
# 14512 3877

# what is the maximum number of genres?
max(rowSums(categories[,3:11]))
# [1] 4

# how many albums have what number of genres?
table(rowSums(categories[,3:11]))
# 0 1 2 3 4
# 2365 12147 3500 345 32

It looks like 3,877 of the albums have more than 1 genre. The table shows that the vast majority of albums with more than 1 genre have 2 genres, with a much smaller number having 3 and 4 genres. There are 2,365 albums with no recorded genre.

Data Preparation

In order to cluster the music genres, we first must make a matrix which contains the co-occurrences of our dummy-coded genre variables with one another. We can use matrix multiplication to accomplish this, as explained in this excellent StackOverflow answer:

# make a co-occurrence matrix
# select the relevant columns and convert to matrix format
library(plyr); library(dplyr)

co_occur <- categories %>% select(electronic, experimental, folk_country,
global,jazz, metal, pop_rnb, rap, rock) %>% as.matrix()
# calculate the co-occurrences of genres
out <- crossprod(co_occur)
# make the diagonals of the matrix into zeros
# (we won't count co-occurrences of a genre with itself)
diag(out) <- 0

The resulting co-occurrence matrix, called “out”, is a 9 by 9 matrix containing the counts of the genre co-occurrences together in the data:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

electronic experimental folk_country global jazz metal pop_rnb rap rock electronic 0 198 26 49 127 40 228 81 1419 experimental 198 0 15 20 76 64 32 17 1121 folk_country 26 15 0 7 6 10 65 2 52 global 49 20 7 0 13 0 48 3 33 jazz 127 76 6 13 0 26 52 25 57 metal 40 64 10 0 26 0 12 27 449 pop_rnb 228 32 65 48 52 12 0 133 126 rap 81 17 2 3 25 27 133 0 68 rock 1419 1121 52 33 57 449 126 68 0 Cluster Analysis

We can now proceed with the cluster analysis. We will use hierarchical clustering, an algorithm which seeks to build a hierarchy of clusters in the data. This analysis will produce groupings (e.g. clusters) of music genres. We will visualize the results of this analysis via a dendrogram.

We first need to produce a distance matrix from our co-occurrence matrix. For each pair of music genres, we will calculate the Euclidean distance between them. To calculate the Euclidean distances, we first calculate the sum of the squared differences in co-occurrences for each pair of rows across the nine columns. We then take the square root of the sum of squared differences. If we consider the two bottom rows (rap and rock) in the co-occurrence matrix above, then the Euclidean distance between them is calculated as follows:

# calculate Euclidean distance manually between rock and rap
squared_diffs = (81-1419)^2 + (17-1121)^2 + (2-52)^2 + (3-33)^2 +
(25-57)^2 + (27-449)^2 + (133-126)^2 + (0-68)^2 + (68-0)^2
sqrt(squared_diffs)
# [1] 1789.096

This calculation makes clear our definition of genre similarity, which will define our clustering solution: two genres are similar to one another if they have similar patterns of co-occurrence with the other genres.

Let’s calculate all pairwise Euclidean distances with the dist() function, and check our manual calculation with the one produced by dist().

# first produce a distance matrix
# from our co-occurrence matrix
dist_matrix <- dist(out)
# examine the distance matrix
round(dist_matrix,3)
# the result for rap+rock
# is 1789.096
# the same as we calculated above!

As noted in the R syntax above (and shown in the bottom-right corner of the distance matrix below), the distance between rap and rock is 1789.096, the same value that we obtained from our manual calculation above!

The distance matrix:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

electronic experimental folk_country global jazz metal pop_rnb rap experimental 462.453 folk_country 1397.729 1087.241 global 1418.07 1102.418 38.105 jazz 1392.189 1072.719 122.511 106.353 metal 1012.169 698.688 405.326 420.86 405.608 pop_rnb 1346.851 1006.476 275.647 259.559 197.16 397.776 rap 1375.999 1066.633 92.644 101.975 116.816 406.478 259.854 rock 2250.053 2041.001 1836.619 1818.557 1719.464 1854.881 1682.785 1789.096

To perform the clustering, we simply pass the distance matrix to our hierarchical clustering algorithm (specifying that we to use Ward’s method), and produce the dendrogram plot.*

# perform the hierarchical clustering
hc <- hclust(dist_matrix, method="ward.D")
# plot the dendrogram
plot(hc, hang=-1, xlab="", sub="")

The hierarchical clustering algorithm produces the organization of the music genres visible in the above dendrogram. But how many clusters are appropriate for these data?

There are many different ways of choosing the number of clusters when performing cluster analysis. With hierarchical clustering, we can simply examine the dendrogram and make a decision about where to cut the tree to determine the number clusters for our solution. One of the key intuitions behind the dendrogram is that observations that fuse together higher up the tree (e.g. at the top of the plot) are more different from one another, while observations that fuse together further down (e.g. at the bottom of the plot) are more similar to one another. Therefore, the higher up we split the tree, the more different the music genres will be among the clusters.

If we split the tree near the top (e.g., around a height of 2500), we end up with three clusters. Let’s make a dendrogram that represents each of these three clusters with a different color. The colored dendrogram is useful in the interpretation of the cluster solution.

There are many different ways to produce a colored dendrogram in R; a lot of them are pretty hacky and require a fair amount of code. Below, I use the wonderful dendextend package to produce the colored dendrogram in a straightforward and simple way (with a little help from this StackOverflow answer):

# make a colored dendrogram with the dendextend package
library(dendextend)
# hc is the result of our hclust call above
dend <- hc
# specify we want to color the branches and labels
dend <- color_branches(dend, k = 3)
dend <- color_labels(dend, k = 3)
# plot the colored dendrogram. add a title and y-axis label
plot(dend, main = 'Cluster Dendrogram', ylab = 'Height')

Which yields the following plot:

Interpretation of the Clusters

The Clusters

The three clusters identified by our chosen cut point are as follows. First, on the right-hand side of the plot, we see electronic and experimental music grouped together in a single cluster (colored in blue). Also on the right-hand of the plot, on the same branch but with a different leaf and colored in green, we find rock music which constitutes a cluster in and of itself. These two clusters (electronic + experimental and rock) are distinct, yet they share some similarities in terms of their co-occurrences with other music genres, as indicated by their fusion at the top right-hand side of the dendrogram.

On the left-hand side of the dendrogram, we see the third cluster (colored in pink) which encompasses metal, pop/r&b, folk/country, global, jazz and rap music. Our chosen cut point lumps all of these genres into a single cluster, but examination of the sub-divisions of this cluster reveals a more nuanced picture. Metal is most different from the other genres in this cluster. A further sub-grouping separates pop/r&b from folk/country, global, jazz and rap. Folk/country and global are fused at the lowest point in the dendrogram, followed by a similar grouping of jazz and rap. These two pairings occur very close to the bottom of the dendrogram, indicating strong similarity in genre co-occurrence between folk/country and global on the one hand, and between jazz and rap on the other hand.

Substantive Interpretation: What have we learned about music genres?

Rock (Green Cluster)
Rock music is the only music genre to get its own cluster, suggesting that it is distinctive in its patterns of co-occurrence with the other genres.

We must keep in mind the following statistical consideration, however. Rock music was the most frequent genre in the dataset. One of the reason rock’s Euclidean distances were so large is that, because this genre occurs so frequently in the data, the distances calculated between rock and the less-frequently occurring genres are naturally larger.

Electronic & Experimental (Blue Cluster)
The blue cluster in the dendrogram above groups electronic and experimental music together. One reason for this might be because these genres are both relatively more modern (compared to, say, jazz or folk), and therefore share some sonic similarities (for example, using electronic or synthetically generated sounds) which leads them to be used in a similar way in conjunction with the other music genres.

Metal, Pop/R&B, Folk/Country, Global, Jazz and Rap (Pink Cluster)
The remaining genres fall into a single cluster. Within this cluster, it seems natural that metal is most different from the other genres, and that pop/r&b separates itself from folk/country, global, jazz and rap.

Pop and R&B are different, in my vision, from folk/country, global, jazz and rap music in that the former feature slick production styles (e.g. a tremendously layered sound with many different tracks making up a song), electronic instrumentation (e.g. keyboard synthesizers and drum machines) and a contemporary musical aesthetic (e.g. auto-tuning to produce noticeably distorted vocals), whereas the latter feature more sparse arrangements and fewer electronically-produced sounds.

Folk/country and global music, meanwhile, share a similar musical palette in terms of their more traditional instrumentation. While there are exceptions, I feel that both folk/country and global music use more acoustic or “natural” instruments (guitars, violins, wind instruments, drums, etc.) and less of the obviously electronically-produced sounds mentioned above.

Finally, jazz and hip-hop, although quite different in many musical aspects, share a similar aesthetic in some ways.  Hip-hop, for example, has long made use of samples from older records (including many jazz tunes) to create beats (although recent hip-hop sub-genres such as trap music make use of drum machines and synthesizers for a more mechanical sound). One testament to the complementary between jazz and rap are number of recent notable collaborations between artists in these two genres: Kendrick Lamar’s excellent 2015 album To Pimp a Butterfly contains a number of tracks featuring jazz musicians and jazz instrumentation, the jazz pianist Robert Glasper has made a several collaborative albums featuring rap artists, and the jazzy group BadBadNotGood made an excellent album with the always brilliant Ghostface Killah.

Conclusion

In this post, we clustered music genres from albums reviewed by Pitchfork. Our original dataset contained dummy-coded indicators for 9 different music genres across 18,389 different albums. We used matrix multiplication to create a co-occurrence matrix for the music genres, which we then turned into a distance matrix for hierarchical clustering. We cut the resulting dendrogram high up on the tree, obtaining three separate clusters of music genres: 1) rock 2) electronic and experimental and 3) metal, pop/r&b, folk/country, global, jazz and rap. This clustering solution seems to align with the production styles, sonic and musical qualities, and past and current cross-pollination between the various music genres.

Coming Up Next

In the next post, I’ll use R to produce a unique visualization of a very famous dataset. Stay tuned!

—-

* If you’re interested, you can play around with other distance metrics and clustering algorithms at home- just import the above co-occurrence matrix into R, and adapt the code however you like!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Method Matters. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a simple Sales Revenue Dashboard with R Shiny & ShinyDashboard

Thu, 12/07/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

One of the beautiful gifts that R has got (that Python misses) is the package – Shiny. Shiny is an R package that makes it easy to build interactive web apps straight from R. Making Dashboard is an imminent wherever Data is available since Dashboards are good in helping Business make insights out of the existing data.

In this post, We will see how to leverage Shiny to build a simple Sales Revenue Dashboard.

Loading Packages

All the packages listed below can be directly installed from CRAN.

# load the required packages library(shiny) require(shinydashboard) library(ggplot2) library(dplyr)

Sample Input File:
Considering the fact that Dashboard needs an input data to visualise, we will use this sample recommendation.csv as input data to our dashboard but this can be modified to suit any organisational need like a Database connection or Data from remote location.

recommendation <- read.csv('recommendation.csv',stringsAsFactors = F,header=T) head(recommendation) Account Product Region Revenue 1 Axis Bank FBB North 2000 2 HSBC FBB South 30000 3 SBI FBB East 1000 4 ICICI FBB West 1000 5 Bandhan Bank FBB West 200 6 Axis Bank SIMO North 200

Every shiny application has two main sections 1. ui and 2. server. ui is where the code for front-end like buttons, plot visuals, tabs and so on are present and server is where the code for back-end like Data Retrieval, Manipulation, Wrangling are present.


Image Courtesy: Slideplayer

Instead of simply using only shiny, Here we will couple it with shinydashboard. shinydashboard is an R package whose job is to make it easier (as the name suggests) to build dashboards with shiny. The ui part of a shiny app built with shinydashboard would have 3 basic elements wrapped in dashboardPage().

    1. dashboardHeader(),
    2. dashboardSidebar(),
    3. dashboardBody()

Simplest Shiny app with shinydashboard:

## app.R ## library(shiny) library(shinydashboard) ui <- dashboardPage( dashboardHeader(), dashboardSidebar(), dashboardBody() ) server <- function(input, output) { } shinyApp(ui, server)

Gives this app:

Image Courtesy: rstudio

Aligning to our larger goal of making a Sales Revenue Dashboard, Let us look at the code of dashboardHeader() and dashboardSidebar().

#Dashboard header carrying the title of the dashboard header <- dashboardHeader(title = "Basic Dashboard") #Sidebar content of the dashboard sidebar <- dashboardSidebar( sidebarMenu( menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")), menuItem("Visit-us", icon = icon("send",lib='glyphicon'), href = "https://www.salesforce.com") ) )

To begin with dashboardPage(), We must decide what are the UI elements that we would like to show in our dashboard. Since it’s a Sales Revenue Dashboard, let us show 3 KPI boxes on the top that could represent quick summary and then 2 Graphical plots followed by them for a detailed view.

To align these elements one by one, we will define these elements inside fluidRow().

frow1 <- fluidRow( valueBoxOutput("value1") ,valueBoxOutput("value2") ,valueBoxOutput("value3") ) frow2 <- fluidRow( box( title = "Revenue per Account" ,status = "primary" ,solidHeader = TRUE ,collapsible = TRUE ,plotOutput("revenuebyPrd", height = "300px") ) ,box( title = "Revenue per Product" ,status = "primary" ,solidHeader = TRUE ,collapsible = TRUE ,plotOutput("revenuebyRegion", height = "300px") ) ) # combine the two fluid rows to make the body body <- dashboardBody(frow1, frow2)

It could be seen from the above code that valueBoxOutput() is used to display the KPI but what is displayed in this valueBoxOutput() will be written in the server part and the same applies for plotOutput() which is used in the ui part to display a plot. box() is a function provided by shinydashboard to enclose the plot inside a box with certain features like title, solidHeader and collapsible. Having defined two fluidRow() functions individually for the sake of modularity, we can combine both of them in dashbboardBody().

Thus we can complete the ui part comprising Header, Sidebar and Page with the below code:

#completing the ui part with dashboardPage ui <- dashboardPage(title = 'This is my Page title', header, sidebar, body, skin='red')

Note that the value of title in dashboardPage() will serve as the title of the browser page/tab, while the title defined in the dashboardHeader() will be visible as the dashboard title.

With the ui part over, We will create the server part where the program and logic behind valueBoxOutput() and plotOutput() are added with renderValueBox() and renderPlot() respectively, enclosed inside server function with input,output as its paramaters. Values inside input contain anything that is received from ui (like textBox value, Slider value) and Values inside output contain anything that is sent to ui (like plotOutput, valueBoxOutput).

Below is the complete server code:

# create the server functions for the dashboard server <- function(input, output) { #some data manipulation to derive the values of KPI boxes total.revenue <- sum(recommendation$Revenue) sales.account <- recommendation %>% group_by(Account) %>% summarise(value = sum(Revenue)) %>% filter(value==max(value)) prof.prod <- recommendation %>% group_by(Product) %>% summarise(value = sum(Revenue)) %>% filter(value==max(value)) #creating the valueBoxOutput content output$value1 <- renderValueBox({ valueBox( formatC(sales.account$value, format="d", big.mark=',') ,paste('Top Account:',sales.account$Account) ,icon = icon("stats",lib='glyphicon') ,color = "purple") }) output$value2 <- renderValueBox({ valueBox( formatC(total.revenue, format="d", big.mark=',') ,'Total Expected Revenue' ,icon = icon("gbp",lib='glyphicon') ,color = "green") }) output$value3 <- renderValueBox({ valueBox( formatC(prof.prod$value, format="d", big.mark=',') ,paste('Top Product:',prof.prod$Product) ,icon = icon("menu-hamburger",lib='glyphicon') ,color = "yellow") }) #creating the plotOutput content output$revenuebyPrd <- renderPlot({ ggplot(data = recommendation, aes(x=Product, y=Revenue, fill=factor(Region))) + geom_bar(position = "dodge", stat = "identity") + ylab("Revenue (in Euros)") + xlab("Product") + theme(legend.position="bottom" ,plot.title = element_text(size=15, face="bold")) + ggtitle("Revenue by Product") + labs(fill = "Region") }) output$revenuebyRegion <- renderPlot({ ggplot(data = recommendation, aes(x=Account, y=Revenue, fill=factor(Region))) + geom_bar(position = "dodge", stat = "identity") + ylab("Revenue (in Euros)") + xlab("Account") + theme(legend.position="bottom" ,plot.title = element_text(size=15, face="bold")) + ggtitle("Revenue by Region") + labs(fill = "Region") }) }

So far, we have defined both the essential parts of a Shiny app – ui and server. And finally we have to call/run the shinyApp with ui and server as its paramters.

#run/call the shiny app shinyApp(ui, server) Listening on http://127.0.0.1:5101

The entire file has to be saved as app.R inside a folder before running the shiny app. Also remember to put the input data file (in our case, recommendation.csv inside the same folder where app.R is saved). While there is another valid way to structure the shiny app with two files ui.R and server.R (optionally, global.R), it has been ignored in this article for the sake of brevity since this is aimed at beginners.

Upon running the file, the shiny web app would open in your default browser and would look similar to the below screenshot:

Hopefully at this stage, You have got your shiny web app in the form of Sales Revenue Dashboard up and running. The code and plots used here are available on my Github.

References

    Related Post

    1. Gender Diversity Analysis of Data Science Industry using Kaggle Survey Dataset in R
    2. How Happy is Your Country? — Happy Planet Index Visualized
    3. Exploring, Clustering, and Mapping Toronto’s Crimes
    4. Spring Budget 2017: Circle Visualisation
    5. Qualitative Research in R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppArmadillo 0.8.300.1.0

    Thu, 12/07/2017 - 01:59

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Another RcppArmadillo release hit CRAN today. Since our last 0.8.100.1.0 release in October, Conrad kept busy and produced Armadillo releases 8.200.0, 8.200.1, 8.300.0 and now 8.300.1. We tend to now package these (with proper reverse-dependency checks and all) first for the RcppCore drat repo from where you can install them "as usual" (see the repo page for details). But this actual release resumes within our normal bi-monthly CRAN release cycle.

    These releases improve a few little nags on the recent switch to more extensive use of OpenMP, and round out a number of other corners. See below for a brief summary.

    Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 405 other packages on CRAN.

    A high-level summary of changes follows.

    Changes in RcppArmadillo version 0.8.300.1.0 (2017-12-04)
    • Upgraded to Armadillo release 8.300.1 (Tropical Shenanigans)

      • faster handling of band matrices by solve()

      • faster handling of band matrices by chol()

      • faster randg() when using OpenMP

      • added normpdf()

      • expanded .save() to allow appending new datasets to existing HDF5 files

    • Includes changes made in several earlier GitHub-only releases (versions 0.8.300.0.0, 0.8.200.2.0 and 0.8.200.1.0).

    • Conversion from simple_triplet_matrix is now supported (Serguei Sokol in #192).

    • Updated configure code to check for g++ 5.4 or later to enable OpenMP.

    • Updated the skeleton package to current packaging standards

    • Suppress warnings from Armadillo about missing OpenMP support and -fopenmp flags by setting ARMA_DONT_PRINT_OPENMP_WARNING

    Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    jmv – one R package (not just) for the social sciences

    Thu, 12/07/2017 - 00:00

    (This article was first published on jamovi, and kindly contributed to R-bloggers)

    tl;dr
    • many analyses in the social sciences require many R packages
    • jmv makes these common analyses available from one R package
    • jmv can be used from jamovi, a graphical statistical spreadsheet, making it super-accessible

    introducing jmv

    There are many tremendously useful R packages for the social sciences (and similar fields), such as car, afex vcd, etc. Although running basic analyses (such as t-tests or ANOVA) with these packages is very straight forward, it is typically necessary to perform a number of supplementary analyses to accompany them; post-hoc tests, effect-size calculations, bias-corrections, and assumption checks. These additional tests often require the use of many additional R packages, and can make reasonably standard analyses quite time-consuming to perform. For example, in the book Discovering Statistics Using R by Andy Field (a popular textbook in the social sciences), the chapter on ANOVA alone recommends the use of 7 packages.

    jmv simplifies this whole process by bringing all of these packages together and makes doing the following analyses with their most common supplementary tests, corrections and assumption checks as easy as possible:

    • Descriptive statistics
    • T-Tests
    • ANOVA
    • ANCOVA
    • Repeated Measures ANOVA
    • Non-parametric ANOVAs
    • Correlation
    • Linear Regression
    • Contingency Tables
    • Proportion Tests
    • Factor Analysis

    and coming soon:

    • Logistic Regression
    • Log-linear Regression

    jmv aims to make all common statistical tests taught at an undergraduate level available from a single package.

    An ANOVA

    Let’s begin with a simple, familiar analysis – an ANOVA. In this example we use the ToothGrowth dataset from R, and explore whether different food supplements and their dosage affect how much a guinea pig’s teeth grow. We’ll specify len to be the dependent variable, and supp and dose to be the factors.

    library('jmv') data('ToothGrowth') jmv::anova(ToothGrowth, dep = 'len', factors = c('supp', 'dose')) ## ## ANOVA ## ## ANOVA ## ─────────────────────────────────────────────────────────────────────── ## Sum of Squares df Mean Square F p ## ─────────────────────────────────────────────────────────────────────── ## supp 205 1 205.4 15.57 < .001 ## dose 2426 2 1213.2 92.00 < .001 ## supp:dose 108 2 54.2 4.11 0.022 ## Residuals 712 54 13.2 ## ───────────────────────────────────────────────────────────────────────

    This produces what should be a familiar ANOVA table. You have likely seen something like this in R before, though perhaps not as nicely formatted.

    Where jmv really comes into its own, is with additional options. In the following example we will perform the same analysis, but additionally requesting effect-size, post-hoc tests, homogeneity of variances tests, descriptive statistics, and a descriptives plot:

    library('jmv') data('ToothGrowth') jmv::anova(ToothGrowth, dep = 'len', factors = c('supp', 'dose'), effectSize = 'eta', postHoc = c('supp', 'dose'), plotHAxis = 'dose', plotSepLines = 'supp', descStats = TRUE, homo = TRUE) ## ## ANOVA ## ## ANOVA ## ──────────────────────────────────────────────────────────────────────────────── ## Sum of Squares df Mean Square F p η² ## ──────────────────────────────────────────────────────────────────────────────── ## supp 205 1 205.4 15.57 < .001 0.059 ## dose 2426 2 1213.2 92.00 < .001 0.703 ## supp:dose 108 2 54.2 4.11 0.022 0.031 ## Residuals 712 54 13.2 ## ──────────────────────────────────────────────────────────────────────────────── ## ## ## ASSUMPTION CHECKS ## ## Test for Homogeneity of Variances (Levene's) ## ──────────────────────────────────────────── ## F df1 df2 p ## ──────────────────────────────────────────── ## 1.94 5 54 0.103 ## ──────────────────────────────────────────── ## ## ## POST HOC TESTS ## ## Post Hoc Comparisons - supp ## ──────────────────────────────────────────────────────────────────────────── ## supp supp Mean Difference SE df t p-tukey ## ──────────────────────────────────────────────────────────────────────────── ## OJ - VC 3.70 0.938 54.0 3.95 < .001 ## ──────────────────────────────────────────────────────────────────────────── ## ## ## Post Hoc Comparisons - dose ## ───────────────────────────────────────────────────────────────────────────── ## dose dose Mean Difference SE df t p-tukey ## ───────────────────────────────────────────────────────────────────────────── ## 0.5 - 1 -9.13 1.15 54.0 -7.95 < .001 ## - 2 -15.50 1.15 54.0 -13.49 < .001 ## 1 - 2 -6.37 1.15 54.0 -5.54 < .001 ## ───────────────────────────────────────────────────────────────────────────── ## ## ## Descriptives ## ─────────────────────────────────────── ## supp dose N Mean SD ## ─────────────────────────────────────── ## OJ 0.5 10 13.23 4.46 ## OJ 1 10 22.70 3.91 ## OJ 2 10 26.06 2.66 ## VC 0.5 10 7.98 2.75 ## VC 1 10 16.77 2.52 ## VC 2 10 26.14 4.80 ## ───────────────────────────────────────

    As can be seen, jmv can provide many additional tests and statistics relevant to the main tests, but with far less effort.

    You can explore additional options for the jmv ANOVA here, and the other tests and their available options here.

    jamovi integration

    jmv is also useable from the jamovi statistical spreadsheet. jamovi makes a range of analyses accessible to a broader audience by making them available from a familiar, spreadsheet user-interface. jamovi can also make the underlying R code for each analysis available, making it easy for people to learn R, and transition to R scripting if they like.

    Here is exactly the same analysis as above, having been performed in jamovi.

    summing up
    • jmv makes a whole suite of common analyses from the social sciences very easy to perform
    • jamovi makes these even easier to perform

    jmv is available from CRAN

    jamovi is available from www.jamovi.org

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: jamovi. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The British Ecological Society’s Guide to Reproducible Science

    Wed, 12/06/2017 - 23:26

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    The British Ecological Society has published a new volume in their Guides to Better Science series: A Guide to Reproducible Code in Ecology and Evolution (pdf). The introduction, by , describes its scope:

    A Guide to Reproducible Code covers all the basic tools and information you will need to start making your code more reproducible. We focus on R and Python, but many of the tips apply to any programming language. Anna Krystalli introduces some ways to organise files on your computer and to document your workflows. Laura Graham writes about how to make your code more reproducible and readable. François Michonneau explains how to write reproducible reports. Tamora James breaks down the basics of version control. Finally, Mike Croucher describes how to archive your code. We have also included a selection of helpful tips from other scientists.

    The guide proposes a simple reproducible project workflow, and a guide to organizing projects for reproducibility. The Programming section provides concrete tips and traps to avoid (example: use relative, not absolute pathnames), and the Reproducible Reports section provides a step-by-step guide for generating reports with R Markdown.

    While written for an ecology audience (and also including some gorgeous photography of animals), this guide would be useful for anyone in the science looking to implement a reproducible workflow. You can download the guide at the link below.

    British Ecological Society: A Guide to Reproducible Code in Ecology and Evolution (via Laura Graham)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    14 Jobs for R users from around the world (2017-12-06)

    Wed, 12/06/2017 - 17:07
    To post your R job on the next post

    Just visit  this link and post a new R job  to the R community.

    You can post a job for  free  (and there are also “featured job” options available for extra exposure).

    Current R jobs

    Job seekers:  please follow the links below to learn more and apply for your R job of interest:

    Featured Jobs
    More New R Jobs
    1. Full-Time
      Quantitative Market Research Analyst Gradient Metrics – Posted by kyblock
      Anywhere
      6 Dec 2017
    2. Full-Time
      Data Scientist in the Institute for Economics and Peace Institute for Economics and Peace – Posted by Institute for Economics and Peace
      Sydney New South Wales, Australia
      5 Dec 2017
    3. Full-Time
      Data Innovation Specialist /Technology Analyst- 20509 jessicaxtan
      Washington District of Columbia, United States
      30 Nov 2017
    4. Full-Time
      Data Scientist: Mobility Services Parkbob – Posted by ivan.kasanicky
      Wien Wien, Austria
      29 Nov 2017
    5. Freelance
      Statistician/Econometrician – R Programmer for Academic Statistical Research Academic Research – Posted by empiricus
      Anywhere
      29 Nov 2017
    6. Full-Time
      Senior Data Analyst @ Bangkok, Thailand Agoda – Posted by atrapassi
      Bangkok Krung Thep Maha Nakhon, Thailand
      28 Nov 2017
    7. Full-Time
      R Shiny Dashboard Engineer in Health Tech Castor EDC – Posted by Castor EDC
      Amsterdam-Zuidoost Noord-Holland, Netherlands
      24 Nov 2017
    8. Full-Time
      Data Analyst: Travel Behavior RSG – Posted by patricia.holland@rsginc.com
      San Diego California, United States
      22 Nov 2017
    9. Full-Time
      R&D Database Developer @ Toronto, Canada Crescendo Technology Ltd – Posted by Crescendo
      Toronto Ontario, Canada
      17 Nov 2017
    10. Full-Time
      Customer Success Representative RStudio, Inc. – Posted by jclemens1
      Anywhere
      17 Nov 2017
    11. Full-Time
      Data Science Engineer Bonify – Posted by arianna@meritocracy
      Berlin Berlin, Germany
      17 Nov 2017
    12. Part-Time
      Development of User-Defined Calculations and Graphical Outputs for WHO’s Influenza Data World Health Organization – Posted by aspenh
      Anywhere
      6 Nov 2017
    13. Full-Time
      Data Scientist for H Labs @ Chicago, Illinois, United States Heidrick & Struggles – Posted by Heidrick1
      Chicago Illinois, United States
      2 Nov 2017
    14. Full-Time
      Business Data Analytics Faculty Maryville University – Posted by tlorden
      St. Louis Missouri, United States
      2 Nov 2017

     

    In  R-users.com  you can see  all  the R jobs that are currently available.

    R-users Resumes

    R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

    (you may also look at  previous R jobs posts ).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    How to Avoid the dplyr Dependency Driven Result Corruption

    Wed, 12/06/2017 - 16:27

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases.

    To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using dplyr with a database.

    We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Mapping lithium production using R

    Wed, 12/06/2017 - 12:15

    (This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)



    A great thing about data science – and data visualization in particular – is that you can use data science tools to get information about a subject very quickly.

    This is particularly true when you know how to present the information. When you know the best technique for visualizing data, you can use those tools to present information in a way that can help you increase your understanding very quickly.

    Here’s a case in point: I was thinking about energy, solar energy, and battery technology, and I got curious about sources of lithium (which is used in batteries and related tech).

    Using a few tools from R, I created a map of lithium production within about 15 minutes. The maps that I created certainly don’t tell the full story, but they at least provide a baseline of knowledge.

    If you’re fluent with the tools and techniques of data science, this becomes possible. Whether you’re just doing some personal research, or working for a multinational corporation, you can use the tools of data science to quickly identify and present insights about the world.

    Let me walk you through this example and show you how I did it.

    Tutorial: how to map lithium production data using R

    First, we’ll load the packages that we’re going to use.

    We’ll load tidyverse mostly for access to ggplot2 and dplyr, but we’ll also load rvest (for data scraping), stringr (to help us manipulate our variables and put the data into shape), and viridis (to help us modify the color of the final plot).

    #-------------- # LOAD PACKAGES #-------------- library(tidyverse) library(rvest) library(stringr) library(viridis)

    Ok. Now, we’re going to scrape this mineral production data from the wikipedia page.

    Notice that we’re essentially using several rvest functions in series. We’re using several functions and combining them together using the “pipe” operator, %>%.

    If you’re not using it yet, you should definitely familiarize yourself with this operator, and start using it. It’s beyond the scope of this blog post to talk extensively about pipe operator, but I will say that it’s one of the most useful tools in the R data scientist’s toolkit. If you learn to use it properly, it will make your code easier to read and easier to write. It will even train your mind to think about analysis in a more step by step way.

    Concerning the code: what we’re doing here is designating the URL from which we’re going to scrape the data, then we’re specifying that we’ll be scraping one of the tables. Then, we specify that we’re going to scrape data from the 9th table, and finally we coerce the data into a tibble (instead of keeping it as a traditional data.frame).

    #--------------------------- # SCRAPE DATA FROM WIKIPEDIA #--------------------------- df.lithium <- read_html("https://en.wikipedia.org/wiki/Lithium") %>% html_nodes("table") %>% .[[9]] %>% html_table() %>% as.tibble() # INSPECT df.lithium

    The resultant dataset, df.lithium, is relatively small (which makes it easy to inspect and easy to work with), but in its raw form, it’s a little messy. We’ll need to do a few things to clean up the data, like change the variable names, parse some data into numerics, etc.

    So first, let’s change the column names.

    There are a few ways we could do this, but the most straightforward is to simply pass a vector of manually-defined column names into the colnames() function.

    #-------------------------------------------- # CHANGE COLUMN NAMES # - the raw column names are capitalized and # have some extra information # - we will just clean them up #-------------------------------------------- colnames(df.lithium) <- c('country', 'production', 'reserves', 'resources') colnames(df.lithium)

    Now, we’ll remove an extraneous row of data. The original data table on Wikipedia contained not only the individual records of lithium production for particular countries, but it also contained a “total” row at the bottom of the table. More often than not, these sorts of “total” rows are not appropriate for a data.frame in R; we’ll typically remove them, just as we will do here.

    #----------------------------------------------- # REMOVE "World total" # - this is a total amount that was # in the original data table # - we need to remove, because it's not a # proper data record for a particular country #----------------------------------------------- df.lithium <- df.lithium %>% filter(country != 'World total') df.lithium

    Next, we need to parse the numbers into actual numeric data. The reason is that when we scraped the data, it actually read in the numbers as character data, along with commas and some extra characters. We need to transform this character data into proper numeric data in R.

    To do this, we’ll need to do a few things. First, we need to remove a few “notes” that were in the original data. This is why we’re using the code str_replace(production,”W\\[.*\\]”, “-“)). Essentially, we’re removing those notes from the data.

    After that, we’re using parse_number(production, na = ‘-‘) to transform three variables – production, reserves, resources – into numerics.

    Note once again how we’re structuring this code. We’re using a combination of functions from dplyr and stringr to achieve our objectives.

    To a beginner, this might look complicated, but it’s really not that bad once you understand the individual pieces. If you don’t understand this code (our couldn’t write it yourself), I recommend that you learn the individual functions from dplyr and stringr first, and then come back to this once you’ve learned those pieces.

    #--------------------------------------------------------- # PARSE NUMBERS # - the original numeric quantities in the table # were read-in as character data # - we need to "parse" this information .... # & transform it from character into proper numeric data #--------------------------------------------------------- # Strip out the 'notes' from the numeric data #str_replace(df.lithium$production,"W\\[.*\\]", "") #test df.lithium <- df.lithium %>% mutate(production = str_replace(production,"W\\[.*\\]", "-")) # inspect df.lithium # Parse character data into numbers df.lithium <- df.lithium %>% mutate(production = parse_number(production, na = '-') ,reserves = parse_number(reserves, na = '-') ,resources = parse_number(resources, na = '-') ) # Inspect df.lithium

    Now we’ll get data for a map of the world. To do this, we’ll just use map_data().

    #-------------- # GET WORLD MAP #-------------- map.world <- map_data('world')

    We’ll also get the names of the countries in this dataset.

    The reason is because we’ll need to join this map data to the data from Wikipedia, and we’ll need the country names to be exactly the same. To make this work, we’ll need to examine the names in both datasets and modify any names that aren’t exactly the same.

    Notice that once again, we’re using a combination of functions from dplyr, wiring them together using the pipe operator.

    #---------------------------------------------------- # Get country names # - we can use this list and cross-reference # with the country names in the scraped data # - when we find names that are not the same between # this map data and the scraped data, we can recode # the values #---------------------------------------------------- map_data('world') %>% group_by(region) %>% summarise() %>% print(n = Inf)

    Ok. Now we’re going to recode some country names. Again, we’re going this so that the country names in df.lithium are the same as the corresponding country names in map.data.

    #-------------------------------------------- # RECODE COUNTRY NAMES # - some of the country names do not match # the names we will use later in our map # - we will re-code so that the data matches # the names in the world map #-------------------------------------------- df.lithium <- df.lithium %>% mutate(country = if_else(country == "Canada (2010)", 'Canada' ,if_else(country == "People's Republic of China", "China" ,if_else(country == "United States", "USA" ,if_else(country == "DR Congo","Democratic Republic of the Congo", country)))) ) # Inspect df.lithium

    Ok, now we’ll join the data using dplyr::left_join().

    #----------------------------------------- # JOIN DATA # - join the map data and the scraped-data #----------------------------------------- df <- left_join(map.world, df.lithium, by = c('region' = 'country'))

    Now we’ll plot.

    We’ll start with just a basic plot (to make sure that the map plots correctly), and then we’ll proceed to plot separate maps where the fill color corresponds to reserves, production, and resources.

    #----------- # PLOT DATA #----------- # BASIC MAP ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon() # LITHIUM RESERVES ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = reserves)) ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = reserves)) + scale_fill_viridis(option = 'plasma') # LITHIUM PRODUCTION ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = production)) + scale_fill_viridis(option = 'plasma') # LITHIUM RESOURCES ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = resources)) + scale_fill_viridis(option = 'plasma')







    In the final three versions, notice as well that we’re modifying the color scales by using scale_fill_viridis().

    There’s actually quite a bit more formatting that we could do on these, but as a first pass, these are pretty good.

    I’ll leave it as an exercise for you to format these with titles, background colors, etc. If you choose to do this, leave your finalized code in the comments section below.

    To master data science, you need a plan

    At several points in this tutorial, I’ve mentioned a high level plan for mastering data science: master individual pieces of a programming language, and then learn to put them together into more complicated structures.

    If you can do this, you will accelerate your progress … although, the devil is in the details.

    That’s actually not the only learning hack that you can use to rapidly master data science. There are lots of other tricks and learning hacks that you can use to dramatically accelerate your progress.

    Want to know them?

    Sign up for our email list.

    Here at Sharp Sight, we teach data science. But we also teach you how to learn and how to study data science, so you master the tools as quickly as possible.

    By signing up for our email list, you’ll get weekly tutorials about data science, delivered directly to your inbox.

    You’ll also get our Data Science Crash Course, for free.

    SIGN UP NOW

    The post Mapping lithium production using R appeared first on SHARP SIGHT LABS.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Package GetDFPData

    Wed, 12/06/2017 - 09:00

    (This article was first published on R and Finance, and kindly contributed to R-bloggers)

    Downloading Annual Financial Reports and Corporate Events from B3 (formerly Bovespa) –

    Financial statements of companies traded at B3 (formerly Bovespa), the
    Brazilian stock exchange, are available in its
    website. Accessing the data for a
    single company is straightforward. In the website one can find a simple
    interface for accessing this dataset. An example is given
    here.
    However, gathering and organizing the data for a large scale research,
    with many companies and many dates, is painful. Financial reports must
    be downloaded or copied individually and later aggregated. Changes in
    the accounting format thoughout time can make this process slow,
    unreliable and irreproducible.

    Package GetDFPData provides a R interface to all annual financial
    statements available in the website and more. It not only downloads the
    data but also organizes it in a tabular format and allows the use of
    inflation indexes. Users can select companies and a time period to
    download all available data. Several information about current
    companies, such as sector and available quarters are also at reach. The
    main purpose of the package is to make it easy to access financial
    statements in large scale research, facilitating the reproducibility of
    corporate finance studies with B3 data.

    The positive aspects of GetDFDData are:

    • Easy and simple R and web interface
    • Changes in accounting format are internally handled by the software
    • Access to corporate events in the FRE system such as dividend
      payments, changes in stock holder composition, changes in governance
      listings, board composition and compensation, debt composition, and
      a lot more!
    • The output data is automatically organized using tidy data
      principles (long format)
    • A cache system is employed for fast data acquisition
    • Completely free and open source!
    Installation

    The package is (not yet) available in CRAN (release version) and in
    Github (development version). You can install any of those with the
    following code:

    # Release version in CRAN install.packages('GetDFPData') # not in CRAN yet # Development version in Github devtools::install_github('msperlin/GetDFPData') Shinny interface

    The web interface of GetDFPData is available at
    http://www.msperlin.com/shiny/GetDFPData/.

    How to use GetDFPData

    The starting point of GetDFPData is to find the official names of
    companies in B3. Function gdfpd.search.company serves this purpose.
    Given a string (text), it will search for a partial matches in companies
    names. As an example, let’s find the official name of Petrobras, one
    of the largest companies in Brazil:

    library(GetDFPData) library(tibble) gdfpd.search.company('petrobras',cache.folder = tempdir()) ## ## Reading info file from github ## Found 43873 lines for 687 companies [Actives = 521 Inactives = 167 ] ## Last file update: 2017-10-19 ## Caching RDATA into tempdir() ## ## Found 1 companies: ## PETRÓLEO BRASILEIRO S.A. - PETROBRAS | situation = ATIVO | first date = 1998-12-31 | last date - 2016-12-31 ## [1] "PETRÓLEO BRASILEIRO S.A. - PETROBRAS"

    Its official name in Bovespa records is
    PETRÓLEO BRASILEIRO S.A. - PETROBRAS. Data for quarterly and annual
    statements are available from 1998 to 2017. The situation of the
    company, active or canceled, is also given. This helps verifying the
    availability of data.

    The content of all available financial statements can be accessed with
    function gdfpd.get.info.companies. It will read and parse a .csv file
    from my github
    repository
    . This will
    be periodically updated for new information. Let’s try it out:

    df.info <- gdfpd.get.info.companies(type.data = 'companies', cache.folder = tempdir()) ## ## Reading info file from github ## Found 43873 lines for 687 companies [Actives = 521 Inactives = 167 ] ## Last file update: 2017-10-19 ## Caching RDATA into tempdir() glimpse(df.info) ## Observations: 689 ## Variables: 8 ## $ name.company "521 PARTICIPAÇOES S.A. - EM LIQUIDAÇÃO EXTRAJ... ## $ id.company 16330, 16284, 108, 20940, 21725, 19313, 18970,... ## $ situation "ATIVO", "ATIVO", "CANCELADA", "CANCELADA", "A... ## $ listing.segment NA, "None", "None", "None", "None", "None", "C... ## $ main.sector NA, "Financeiro e Outros", "Materiais Básicos"... ## $ tickers NA, "QVQP3B", NA, NA, NA, "AELP3", "TIET11;TIE... ## $ first.date 1998-12-31, 2001-12-31, 2009-12-31, 2009-12-3... ## $ last.date 2016-12-31, 2016-12-31, 2009-12-31, 2009-12-3...

    This file includes several information that are gathered from Bovespa:
    names of companies, official numeric ids, listing segment, sectors,
    traded tickers and, most importantly, the available dates. The resulting
    dataframe can be used to filter and gather information for large scale
    research such as downloading financial data for a specific sector.

    Downloading financial information for ONE company

    All you need to download financial data with GetDFPData are the
    official names of companies, which can be found with
    gdfpd.search.company, the desired starting and ending dates and the
    type of financial information (individual or consolidated). Let’s try it
    for PETROBRAS:

    name.companies <- 'PETRÓLEO BRASILEIRO S.A. - PETROBRAS' first.date <- '2004-01-01' last.date <- '2006-01-01' df.reports <- gdfpd.GetDFPData(name.companies = name.companies, first.date = first.date, last.date = last.date, cache.folder = tempdir()) ## Found cache file. Loading data.. ## ## Downloading data for 1 companies ## First Date: 2004-01-01 ## Laste Date: 2006-01-01 ## Inflation index: dollar ## ## Downloading inflation data ## Caching inflation RDATA into tempdir() Done ## ## ## WARNING: Cash flow statements are not available before 2009 ## ## Inputs looking good! Starting download of files: ## ## PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Available periods: 2005-12-31 2004-12-31 ## ## ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Finding info from Bovespa | downloading and reading data | saving cache ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2005-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2004-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available..

    The resulting object is a tibble, a data.frame type of object that
    allows for list columns. Let’s have a look in its content:

    glimpse(df.reports) ## Observations: 1 ## Variables: 33 ## $ company.name "PETRÓLEO BRASILEIRO S.A. - PE... ## $ company.code 9512 ## $ company.tickers "PETR3;PETR4" ## $ min.date 2004-12-31 ## $ max.date 2005-12-31 ## $ n.periods 2 ## $ company.segment "Tradicional" ## $ current.stockholders [ [ [ [ [ [ [ [ [ [ [ [NULL] ## $ history.capital.issues [NULL] ## $ history.mkt.value [NULL] ## $ history.capital.increases [NULL] ## $ history.capital.reductions [NULL] ## $ history.stock.repurchases [NULL] ## $ history.other.stock.events [NULL] ## $ history.compensation [NULL] ## $ history.compensation.summary [NULL] ## $ history.transactions.related [NULL] ## $ history.debt.composition [NULL] ## $ history.governance.listings [NULL] ## $ history.board.composition [NULL] ## $ history.committee.composition [NULL] ## $ history.family.relations [NULL]

    Object df.reports only has one row since we only asked for data of one
    company. The number of rows increases with the number of companies, as
    we will soon learn with the next example. All financial statements for
    the different years are available within df.reports. For example, the
    assets statements for all desired years of PETROBRAS are:

    df.income.long <- df.reports$fr.income[[1]] glimpse(df.income.long) ## Observations: 48 ## Variables: 6 ## $ name.company "PETRÓLEO BRASILEIRO S.A. - PETROBRAS", "... ## $ ref.date 2005-12-31, 2005-12-31, 2005-12-31, 2005-1... ## $ acc.number "3.01", "3.02", "3.03", "3.04", "3.05", "3.... ## $ acc.desc "Receita Bruta de Vendas e/ou Serviços", "D... ## $ acc.value 143665730, -37843204, 105822526, -57512113,... ## $ acc.value.infl.adj 61398234.97, -16173000.56, 45225234.41, -24...

    The resulting dataframe is in the long format, ready for processing. In
    the long format, financial statements of different years are stacked. In
    the wide format, we have the year as columns of the table.

    If you want the wide format, which is the most common way that financial
    reports are presented, you can use function gdfpd.convert.to.wide. See
    an example next:

    df.income.wide <- gdfpd.convert.to.wide(df.income.long) knitr::kable(df.income.wide ) acc.number acc.desc name.company 2004-12-31 2005-12-31 3.01 Receita Bruta de Vendas e/ou Serviços PETRÓLEO BRASILEIRO S.A. – PETROBRAS 120024727 143665730 3.02 Deduções da Receita Bruta PETRÓLEO BRASILEIRO S.A. – PETROBRAS -34450292 -37843204 3.03 Receita Líquida de Vendas e/ou Serviços PETRÓLEO BRASILEIRO S.A. – PETROBRAS 85574435 105822526 3.04 Custo de Bens e/ou Serviços Vendidos PETRÓLEO BRASILEIRO S.A. – PETROBRAS -48607576 -57512113 3.05 Resultado Bruto PETRÓLEO BRASILEIRO S.A. – PETROBRAS 36966859 48310413 3.06 Despesas/Receitas Operacionais PETRÓLEO BRASILEIRO S.A. – PETROBRAS -11110540 -14810467 3.06.01 Com Vendas PETRÓLEO BRASILEIRO S.A. – PETROBRAS -2858630 -4195157 3.06.02 Gerais e Administrativas PETRÓLEO BRASILEIRO S.A. – PETROBRAS -2599552 -3453753 3.06.03 Financeiras PETRÓLEO BRASILEIRO S.A. – PETROBRAS -1019901 126439 3.06.04 Outras Receitas Operacionais PETRÓLEO BRASILEIRO S.A. – PETROBRAS 0 0 3.06.05 Outras Despesas Operacionais PETRÓLEO BRASILEIRO S.A. – PETROBRAS -5982336 -9070019 3.06.06 Resultado da Equivalência Patrimonial PETRÓLEO BRASILEIRO S.A. – PETROBRAS 1349879 1782023 3.07 Resultado Operacional PETRÓLEO BRASILEIRO S.A. – PETROBRAS 25856319 33499946 3.08 Resultado Não Operacional PETRÓLEO BRASILEIRO S.A. – PETROBRAS -550694 -199982 3.08.01 Receitas PETRÓLEO BRASILEIRO S.A. – PETROBRAS 46611 1256194 3.08.02 Despesas PETRÓLEO BRASILEIRO S.A. – PETROBRAS -597305 -1456176 3.09 Resultado Antes Tributação/Participações PETRÓLEO BRASILEIRO S.A. – PETROBRAS 25305625 33299964 3.10 Provisão para IR e Contribuição Social PETRÓLEO BRASILEIRO S.A. – PETROBRAS -5199166 -8581490 3.11 IR Diferido PETRÓLEO BRASILEIRO S.A. – PETROBRAS -1692288 -422392 3.12 Participações/Contribuições Estatutárias PETRÓLEO BRASILEIRO S.A. – PETROBRAS -660000 -846000 3.12.01 Participações PETRÓLEO BRASILEIRO S.A. – PETROBRAS -660000 -846000 3.12.02 Contribuições PETRÓLEO BRASILEIRO S.A. – PETROBRAS 0 0 3.13 Reversão dos Juros sobre Capital Próprio PETRÓLEO BRASILEIRO S.A. – PETROBRAS 0 0 3.15 Lucro/Prejuízo do Exercício PETRÓLEO BRASILEIRO S.A. – PETROBRAS 17754171 23450082 Downloading financial information for SEVERAL companies

    If you are doing serious research, it is likely that you need financial
    statements for more than one company. Package GetDFPData is specially
    designed for handling large scale download of data. Let’s build a case
    with two selected companies:

    my.companies <- c('PETRÓLEO BRASILEIRO S.A. - PETROBRAS', 'BANCO DO ESTADO DO RIO GRANDE DO SUL SA') first.date <- '2005-01-01' last.date <- '2007-01-01' type.statements <- 'individual' df.reports <- gdfpd.GetDFPData(name.companies = my.companies, first.date = first.date, last.date = last.date, cache.folder = tempdir()) ## Found cache file. Loading data.. ## ## Downloading data for 2 companies ## First Date: 2005-01-01 ## Laste Date: 2007-01-01 ## Inflation index: dollar ## ## Downloading inflation data ## Found cache file. Loading data.. Done ## ## ## WARNING: Cash flow statements are not available before 2009 ## ## Inputs looking good! Starting download of files: ## ## BANCO DO ESTADO DO RIO GRANDE DO SUL SA ## Available periods: 2006-12-31 2005-12-31 ## PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Available periods: 2006-12-31 2005-12-31 ## ## ## Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA ## Finding info from Bovespa | downloading and reading data | saving cache ## Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA | date 2006-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA | date 2005-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Finding info from Bovespa ## Found cache file /tmp/RtmpSpLsOP/9512_PETRÓLEO/GetDFPData_BOV_cache_9512_PETR.rds ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2006-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2005-12-31 ## Acessing DFP data | Found DFP cache file ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available..

    And now we can check the resulting tibble:

    glimpse(df.reports) ## Observations: 2 ## Variables: 33 ## $ company.name "BANCO DO ESTADO DO RIO GRANDE D... ## $ company.code 1210, 9512 ## $ company.tickers "BRSR3;BRSR5;BRSR6", "PETR3;PETR4" ## $ min.date 2005-12-31, 2005-12-31 ## $ max.date 2006-12-31, 2006-12-31 ## $ n.periods 2, 2 ## $ company.segment "Corporate Governance - Level 1"... ## $ current.stockholders [ [ [ [ [ [ [ [ [ [ [ [NULL, NULL] ## $ history.capital.issues [NULL, NULL] ## $ history.mkt.value [NULL, NULL] ## $ history.capital.increases [NULL, NULL] ## $ history.capital.reductions [NULL, NULL] ## $ history.stock.repurchases [NULL, NULL] ## $ history.other.stock.events [NULL, NULL] ## $ history.compensation [NULL, NULL] ## $ history.compensation.summary [NULL, NULL] ## $ history.transactions.related [NULL, NULL] ## $ history.debt.composition [NULL, NULL] ## $ history.governance.listings [NULL, NULL] ## $ history.board.composition [NULL, NULL] ## $ history.committee.composition [NULL, NULL] ## $ history.family.relations [NULL, NULL]

    Every row of df.reports will provide information for one company.
    Metadata about the corresponding dataframes such as min/max dates is
    available in the first columns. Keeping a tabular structure facilitates
    the organization and future processing of all financial data. We can use
    tibble df.reports for creating other dataframes in the long format
    containing data for all companies. See next, where we create dataframes
    with the assets and liabilities of all companies:

    df.assets <- do.call(what = rbind, args = df.reports$fr.assets) df.liabilities <- do.call(what = rbind, args = df.reports$fr.liabilities) df.assets.liabilities <- rbind(df.assets, df.liabilities)

    As an example, let’s use the resulting dataframe for calculating and
    analyzing a simple liquidity index of a company, the total of current
    (liquid) assets (Ativo circulante) divided by the total of current
    short term liabilities (Passivo Circulante), over time.

    library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union my.tab <- df.assets.liabilities %>% group_by(name.company, ref.date) %>% summarise(Liq.Index = acc.value[acc.number == '1.01']/ acc.value[acc.number == '2.01']) my.tab ## # A tibble: 3 x 3 ## # Groups: name.company [?] ## name.company ref.date Liq.Index ## ## 1 BANCO DO ESTADO DO RIO GRANDE DO SUL SA 2006-12-31 0.7251432 ## 2 PETRÓLEO BRASILEIRO S.A. - PETROBRAS 2005-12-31 0.9370813 ## 3 PETRÓLEO BRASILEIRO S.A. - PETROBRAS 2006-12-31 0.9733600

    Now we can visualize the information using ggplot2:

    library(ggplot2) p <- ggplot(my.tab, aes(x = ref.date, y = Liq.Index, fill = name.company)) + geom_col(position = 'dodge' ) print(p)

    Exporting financial data

    The package includes function gdfpd.export.DFP.data for exporting the
    financial data to an Excel or zipped csv files. See next:

    my.basename <- 'MyExcelData' my.format <- 'csv' # only supported so far gdfpd.export.DFP.data(df.reports = df.reports, base.file.name = my.basename, type.export = my.format)

    The resulting Excel file contains all data available in df.reports.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R and Finance. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The Cost of True Love (a.k.a. The Tidy — and expensive! — Twelve Days of Christmas)

    Tue, 12/05/2017 - 19:13

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    I’m in the market for Christmas presents for my true love, @mrshrbrmstr, and thought I’d look to an age-old shopping list for inspiration. Just what would it set me back if I decided to mimic the 12 Days of Christmas in this modern day and age?

    Let’s try to do the whole thing in R (of course!).

    We’ll need to:

    • Grab the lyrics
    • Parse the lyrics
    • Get pricing data
    • Compute some statistics
    • Make some (hopefully) pretty charts

    This one was complex enough formatting-wise that I needed to iframe it below. Feel free to bust out of the iframe at any time.

    Some good follow-ups to this (if you’re so inclined) would be to predict prices next year and/or clean up the charts a bit.

    Grab the code up on GitHub.

    (Note: ColourLovers API times out occasionally so just try that snippet again if you get an error).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Computing wind average in an area using rWind

    Tue, 12/05/2017 - 17:54

    (This article was first published on long time ago..., and kindly contributed to R-bloggers)

    p { margin-bottom: 0.25cm; line-height: 120%; }

    Hi all! A researcher asked me last week about how to compute wind average for an area using rWind. I wrote a simple function to do this using dplyr library (following the advice of my friend Javier Fajardo). The function will be added to rWind package as soon as possible. Meanwhile, you can test the results… enjoy!

    # First, charge the new function
    library(dplyr)
    wind.region <- function (X){
    X[,3] <- X[,3] %% 360
    X[X[,3]>=180,3] <- X[X[,3]>=180,3] - 360
    avg<-summarise_all(X[,-1], .funs = mean)
    wind_region <- cbind(X[1,1],avg)
    return(wind_region)
    }

    Once you have charged the function, let’s do some example

    # Get some wind data and convert it into a raster to be plotted later
    library(rWind)
    library(raster)
    wind_data<-wind.dl(2015,2,12,0,-10,5,35,45)
    wind_fitted_data <- wind.fit(wind_data)
    r_speed <- wind2raster(wind_fitted_data, type="speed")

    Now, you can use the new function to obtain wind average in the study area:

    myMean <- wind.region(wind_data)
    myMean

    # Now, you can use wind.fit to get wind speed and direction.
    myMean_fitted <- wind.fit(myMean)
    myMean_fitted

    # Finally, let's plot the results!
    library(rworldmap)
    library(shape)
    plot(r_speed)
    lines(getMap(resolution = "low"), lwd=4)
    alpha <- arrowDir(myMean_fitted)
    Arrowhead(myMean_fitted$lon, myMean_fitted$lat, angle=alpha,
    arr.length = 2, arr.type="curved")
    text(myMean_fitted$lon+1, myMean_fitted$lat+1,
    paste(round(myMean_fitted$speed,2), "m/s"), cex = 2)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: long time ago.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    On the biases in data

    Tue, 12/05/2017 - 16:15

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:

    “The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.”
    Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.

    “When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’”
    Hanna Wallach, Senior Researcher at Microsoft Research NYC. 

    The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.

    Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Building a Telecom Dictionary scraping web using rvest in R

    Tue, 12/05/2017 - 15:00

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    One of the biggest problems in Business to carry out any analysis is the availability of Data. That is where in many cases, Web Scraping comes very handy in creating that data that’s required. Consider the following case: To perform text analysis on Textual Data collected in a Telecom Company as part of Customer Feedback or Reviews, primarily requires a dictionary of Telecom Keywords. But such a dictionary is hard to find out-of-box. Hence as an Analyst, the most obvious thing to do when such dictionary doesn’t exist is to build one. Hence this article aims to help beginners get started with web scraping with rvest in R and at the same time, building a Telecom Dictionary by the end of this exercise.

    Disclaimer

    Web Scraping is not allowed by some websites. Kindly check the site’s Terms of Service before Scraping. This post is only for Educational-purpose.

    A simple Google search for Telecom Glossary results in this URL:
    Atis.org of which the required Telecom Dictionary could be built.

    Let us start by loading the required libraries:

    #load rvest and stringr library library(rvest) library(stringr) library(dplyr)

    rvest is a web scraping library in R that makes it easier to write common scraping tasks (to scrape useful information from web pages) without getting our head into xml parsing. rvest can be downloaded from CRAN and the development version is also available on Github.

    It could be seen in the above-mentioned URL that the Glossary words are listed alphabetically from A to Z with a separate link / URL for every (starting) letter. Clicking ‘A’ takes us to this link: atis.org that lists all the keywords with starting letter ‘A’. If you closely look at the URL, the code that’d be written for one letter (here in our case, ‘A’) could be easily replicated for other letters since ‘A’ is part of the URL which will be the only change for the link of other Alphabets.

    Let us assign this URL to an R object url which could be passed on as the paramater to rvest’s read_html() function.

    #url whose content to be scraped url <- 'http://www.atis.org/glossary/definitionsList.aspx?find=A&kw=0' #extracting the content of the given url #url_content <- read_html('http://www.atis.org/glossary/definitionsList.aspx?find=A&kw=0') url_content <- read_html(url)

    read_html() parses the html page of the given url (as its parameter) and saves the result as an xml object.

    To reiterate the objective, we are trying get the list of Telecom Keywords and as per the screenshot above, You could see that the Keywords are listed as Hyperlinks in the given url.

    Hyperlinks in HTML is written in the following syntax:

    Google

    Google is the Link Text Label that a browser would render and when clicked would take us to www.google.com. Hence it is evident that anchor tags in the page is what we are supposed to scrape/extract. But the issue with the current page is that, It’s not just the Keywords that are represented as Anchor Text (links) in the page but also there are a lot of other links (anchor tags) in the page. Hence to extract only the required information and filter out the irrelevant information, we need to find a pattern that helps us extract only the keywords links.

    Have a look at this screenshot:

    This screenshot shows that while the keywords are also represented as hyperlinks (Anchor Text), the differentiator is this ‘id’ element in the url. Only the links of the keywords have got this ‘id’ in the url and hence we can try extracting two information from the current page to get only the relevant information which in our case is – The Keywords: 1. Href value / Actual URL 2. Link Text.

    #extracting all the links from the page links <- url_content %>% html_nodes('a') %>% html_attr('href') #extracting all thhe link text from the page text <- url_content %>% html_nodes('a') %>% html_text()

    With the example Hyperlink discussed earlier, the above code gives two informations.

    www.google.com is saved in links and Google is saved in text.

    With links and text as columns, Let us build a rough dictionary (that’s not yet cleaned/filtered).

    #creating a new dictonary of links and text extracted above rough_dict <- data.frame(links,text, stringsAsFactors = F) head(rough_dict) links text 1 http://www.atis.org 2 definitionsList.aspx?find=A&kw=0 A 3 definitionsList.aspx?find=B&kw=0 B 4 definitionsList.aspx?find=C&kw=0 C 5 definitionsList.aspx?find=D&kw=0 D 6 definitionsList.aspx?find=E&kw=0 E

    As displayed above, rough_dict contains both signal (keywords) and noise (irrelevant links) and we have to filter the irrelevant out with the ‘id‘ pattern that we learnt earlier.

    #filtering glossary terms leaving out irrelevant information fair_dict <- rough_dict %>% filter(str_detect(links, 'id')) %>% select(text) tail(fair_dict) text 657 AN 658 axial propagation constant 659 analog component 660 axial ratio 661 analog computer 662 axial ray

    And that’s how using str_detect() we can keep only links with ‘id’ in it and filter out the rest and building our fair_dict. As displayed in the above output, We have got 662 Keywords just for the letter ‘A’ and the same exercise could be repeated for the other letters available on the site. The only change that’s required is the url object. For example, Like this:

    url <- 'http://www.atis.org/glossary/definitionsList.aspx?find=B&kw=0'

    Note the ‘B’ in it and the same could be done for other available letters. This process could be further improved by making this scraping part a function and looping the function call over a character vector with all the available letters (which ideally is beyond the scope of this article’s objective and hence left out). The complete code used in this article is available on my Github.

      Related Post

      1. Scraping Javascript-rendered web content using R
      2. Analysing Cryptocurrency Market in R
      3. Time Series Analysis in R Part 3: Getting Data from Quandl
      4. Pulling Data Out of Census Spreadsheets Using R
      5. Extracting Tables from PDFs in R using the Tabulizer Package
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      visualizing reassortment history using seqcombo

      Tue, 12/05/2017 - 04:09

      (This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

      Reassortment is an important strategy for influenza A viruses to
      introduce a HA subtype that is new to human populations, which creates
      the possibilities of pandemic.

      A diagram showed above (Figure 2 of doi:10.1038/srep25549) is widely
      used to illustrate the reassortment events. While such diagrams are
      mostly manually draw and edit without software tool to automatically
      generate. Here, I implemented the hybrid_plot function for producing
      publication quality figure of reassortment events.

      library(tibble) library(ggplot2) n <- 8 virus_info <- tibble( id = 1:7, x = c(rep(1990, 4), rep(2000, 2), 2009), y = c(1,2,3,5, 1.5, 3, 4), segment_color = list( rep('purple', n), rep('red', n), rep('darkgreen', n), rep('lightgreen', n), c('darkgreen', 'darkgreen', 'red', 'darkgreen', 'red', 'purple', 'red', 'purple'), c('darkgreen', 'darkgreen', 'red', 'darkgreen', 'darkgreen', 'purple', 'red', 'purple'), c('darkgreen', 'lightgreen', 'lightgreen', 'darkgreen', 'darkgreen', 'purple', 'red', 'purple')) ) flow_info <- tibble(from = c(1,2,3,3,4,5,6), to = c(5,5,5,6,7,6,7)) hybrid_plot(virus_info, flow_info)

      The hybrid_plot requires two tibble data frame of virus information
      and genetic flow information.

      Users need to provide x and y positions to plot the virus, this make
      sense for geographically and temporally information are usually
      available in such phylodynamic study and can be employed to set x or
      y to provide more information and help interpretation of the
      reassortment events.

      We use hexagon to represent virus. Users can set the virus outer
      boundary color by v_color and fill the virus by v_fill. Color of
      line segments that indicate the genetic flow relationship can be specify
      via l_color parameter.

      hybrid_plot(virus_info, flow_info, v_color='firebrick', v_fill='darkgreen', l_color='steelblue')

      We usually have more information to present, for example host
      information and HA subtype etc. and these information can be used to
      color the virus either by v_color or v_fill

      virus_info$Host = c("Avian", "Human", rep("Swine", 4), "Human") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

      The relative virus size can also be specify if a virus_size column is
      available in the input virus_info data.

      virus_info$virus_size <- c(rep(1, 3), 2, 1, 1, 1.5) hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

      If label and label_position coloumns are available, the virus labels
      (virus name or other information) will be added automatically.

      virus_info$label <- c("Avian", "Human\nH3N2", "Classic\nswine\nH1N1", "Eurasian swine", "North American swine\n triple reassrotant H3N2", "North American swine\n triple reassortant H1N2", "2009 Human H1N1") virus_info$label_position <- c('left', 'left', 'left', 'below', 'below', 'upper', 'below') hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

      User can use asp to set the aspect ratio of hexagons, asp < 1 for
      thick/short and asp > 1 for thin/tall.

      hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host, asp=2)

      The output of hybrid_plot is a ggplot object and users can use
      ggplot2 to modify the details.

      title <- "Reassortment events in evolution of the 2009 influenza A (H1N1) virus" caption <- 'Gene segments: PB2, PB1, PA, HA, NP, NA, M, NS' color <- c(Avian="purple", Human="red", Swine="darkgreen") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) + labs(caption=caption, title=title) + scale_color_manual(values=color) + scale_fill_manual(values=color) + scale_x_continuous(breaks=c(1990, 2000, 2009)) + xlab(NULL) + ylab(NULL) + theme_minimal() + theme(axis.line.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), panel.grid.minor=element_blank(), panel.grid.major.y=element_blank(), legend.position = c(.95, .1) )

      Top-down or bottom-up style is also supported.

      x <- virus_info$x virus_info$x <- virus_info$y virus_info$y <- x virus_info$label_position <- c(rep("right", 3), "left", "left", "right", "right") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) + scale_y_reverse() + scale_x_continuous(limits=c(0, 5.5))

      User can also use Emoji to label the virus (host information in this
      example):

      virus_info$label <- c("chicken", "woman", "pig", "pig", "pig", "pig", "woman") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host, parse='emoji', t_size=8, t_color='firebrick') + scale_y_reverse()

      In case you don’t have xy-coordination information, you can use
      set_layout function to auto setting the xy position using selected
      layout function.

      virus_info <- set_layout(virus_info, flow_info, layout="layout.kamada.kawai") hybrid_plot(virus_info, flow_info, parse='emoji', t_size=8, t_color='firebrick')

      virus_info <- set_layout(virus_info, flow_info, layout="layout.fruchterman.reingold") hybrid_plot(virus_info, flow_info, parse='emoji', t_size=8, t_color='firebrick')

      Please let me know if you know any published reassortment data that contain
      spatial information, I will demonstrate how to visualize reassortment
      history on a map.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      My book ‘Practical Machine Learning with R and Python’ on Amazon

      Tue, 12/05/2017 - 03:26

      (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

      My book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($9.99) and kindle ($6.97/Rs449) versions. In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code. This is almost like listening to parallel channels of music in stereo!
      1. Practical machine with R and Python – Machine Learning in Stereo (Paperback)
      2. Practical machine with R and Python – Machine Learning in Stereo (Kindle)
      This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

      Here is a look at the topics covered

      Table of Contents
      Essential R …………………………………….. 7
      Essential Python for Datascience ………………..   54
      R vs Python ……………………………………. 77
      Regression of a continuous variable ………………. 96
      Classification and Cross Validation ……………….113
      Regression techniques and regularization …………. 134
      SVMs, Decision Trees and Validation curves …………175
      Splines, GAMs, Random Forests and Boosting …………202
      PCA, K-Means and Hierarchical Clustering …………. 234

      Pick up your copy today!!
      Hope you have a great time learning as I did while implementing these algorithms!

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Version 0.6-8 of NIMBLE released

      Tue, 12/05/2017 - 02:19

      (This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)

      We’ve released the newest version of NIMBLE on CRAN and on our website a week ago. Version 0.6-8 has a few new features, and more are on the way in the next few months.

      New features include:

      • the proper Gaussian CAR (conditional autoregressive) model can now be used in BUGS code as dcar_proper, which behaves similarly to BUGS’ car.proper distribution;
      • a new nimbleMCMC function that provides one-line invocation of NIMBLE’s MCMC engine, akin to usage of JAGS and WinBUGS through R;
      • a new runCrossValidate function that will conduct k-fold cross-validation of NIMBLE models fit by MCMC;
      • dynamic indexing in BUGS code is now allowed by default;
      • and a variety of bug fixes and efficiency improvements.

      Please see the NEWS file in the installed package for more details.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      #R – Deprecate functions with roxygen2

      Tue, 12/05/2017 - 01:00

      (This article was first published on R-feed, and kindly contributed to R-bloggers)

      We show how to use roxygen2 tags and templates for deprecating existing documented functions.

      Introduction

      Package roxygen2 provides a powerful way of documenting packages and objects inside them. In particular, great flexibility is available for organizing the documentation content in source files and templates, and render it as desired in the corresponding help pages.

      We can leverage on this flexibility to adapt an existing package documentation upon making a function deprecated (and similarly defunct). As stated in the base-R ‘Marking Objects as Deprecated’ documentation (help(Deprecated)):

      The original help page for these functions is often available at help(“oldName-deprecated”) (note the quotes). Functions should be listed in help(“pkg-deprecated”) for an appropriate pkg, including base.

      Deprecate existing functions

      We show how an existing and documented function myFun in package ourPkg can be deprecated, and the documentation adjusted accordingly by re-arranging the roxygen2 documentation content.

      ## myFun.r #' @title My function. #' @description My function. #' @param arg1 My first argument. #' @param arg2 My second argument. #' @return My return value. #' @export myFun <- function(arg1, arg2) { "My return value" }

      Let us assume a new function myNewFunction exists as a replacement for the old myFun, which becomes deprecated. We want to achieve the following

      • list myFun in help("ourPkg-deprecated""), along with its replacement;
      • make its original documentation available via help("myFun-deprecated").

      This can be obtained by

      • creating generic content for help("ourPkg-deprecated"");
      • adjusting the existing myFun documentation content.
      Package-level documentation

      Generic content for help("ourPkg-deprecated") is defined in its own source file:

      ## ourPkg-deprecated.r #' @title Deprecated functions in package \pkg{ourPkg}. #' @description The functions listed below are deprecated and will be defunct in #' the near future. When possible, alternative functions with similar #' functionality are also mentioned. Help pages for deprecated functions are #' available at \code{help("-deprecated")}. #' @name ourPkg-deprecated #' @keywords internal NULL Function-specific documentation

      Function-specific content is added to the ourPkg-deprecated help page from the corresponding source file, where we also want to make the original help page available via help("myFun-deprecated").

      The source file of myFun is then modified as follows:

      ## myFun.r #' @title My function. #' @description My function. #' @param arg1 My first argument. #' @param arg2 My second argument. #' @return My return value. #' #' @name myFun-deprecated #' @usage myFun(arg1, arg2) #' @seealso \code{\link{ourPkg-deprecated}} #' @keywords internal NULL #' @rdname ourPkg-deprecated #' @section \code{myFun}: #' For \code{myFun}, use \code{\link{myNewFun}}. #' #' @export myFun <- function(arg1, arg2) { .Deprecated("myNewFun") "My return value" }

      Two blocks of roxygen2 tags have been added to the existing documentation content.

      • The first block is used to create the myFun-deprecated help page, detached from the myFun object. For this reason we need to explicitly add the original function signature as ‘Usage’ section. We also add a ‘See Also’ link to ourPkg-deprecated.
      • The second block adds myFun-specific content to the ourPkg-deprecated help page. The standard ‘Usage’ section is added automatically, and we create a function-specific section where using myNewFun is suggested. Such content will be collected and shown in help("ourPkg-deprecated") for all deprecated functions with similar tags structure.

      Note also that help(myFun) will point to the ourPkg-deprecated help page, and that @keywords internal prevents the corresponding topics from appearing in the package documentation index.

      Using roxygen2 templates

      Deprecating functions as described above can be automated using templates. One can define a template for each additional block.

      ## template-depr_pkg.r #' @name ourPkg-deprecated #' @section \code{<%= old %>}: #' For \code{<%= old %>}, use \code{\link{<%= new %>}}. ## template-depr_fun.r #' @name <%= fun %>-deprecated #' @usage <%= gsub("\n", "\n#' ", roxygen2:::wrapString(roxygen2:::function_usage(fun, formals(fun)))) %> #' @seealso \code{\link{ourPkg-deprecated}} #' @keywords internal

      Note the presence of template variables and of some roxygen2 internals for automating the construction of the ‘Usage’ section (handling multiple lines in case of many arguments).

      Deprecating herFun in favor of herNewFun then results in the simple inclusion of the templates in the function source.

      ## herFun.r #' @title Her function. #' @description Her function. #' @param arg1 Her first argument. #' @param arg2 Her second argument. #' @return Her return value. #' #' @templateVar fun herFun #' @template template-depr_fun NULL #' @templateVar old herFun #' @templateVar new herNewFun #' @template template-depr_pkg #' #' @export herFun <- function(arg1, arg2) { .Deprecated("herNewFun") "Her return value" } Example package manual and sources

      A complete example of ourPkg is
      avalailable for download.
      The package contains functions myFun, yourFun, herFun, hisFun, where all but yourFun are deprecated in favor of *NewFun, using roxygen2 templates for herFun and hisFun.

      The resulting documentation content can be assessed in the corresponding PDF reference manual.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R-feed. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Exploratory Data Analysis of Ancient Texts with rperseus

      Tue, 12/05/2017 - 01:00

      (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

      Introduction

      When I was in grad school at Emory, I had a favorite desk in the library. The desk wasn’t particularly cozy or private, but what it lacked in comfort it made up for in real estate. My books and I needed room to operate. Students of the ancient world require many tools, and when jumping between commentaries, lexicons, and interlinears, additional clutter is additional “friction”, i.e., lapses in thought due to frustration. Technical solutions to this clutter exist, but the best ones are proprietary and expensive. Furthermore, they are somewhat inflexible, and you may have to shoehorn your thoughts into their framework. More friction.

      Interfacing with the Perseus Digital Library was a popular online alternative. The library includes a catalog of classical texts, a Greek and Latin lexicon, and a word study tool for appearances and references in other literature. If the university library’s reference copies of BDAG1 and Synopsis Quattuor Evangeliorum2 were unavailable, Perseus was our next best thing.

      Fast forward several years, and I’ve abandoned my quest to become a biblical scholar. Much to my father’s dismay, I’ve learned writing code is more fun than writing exegesis papers. Still, I enjoy dabbling with dead languages, and it was the desire to wed my two loves, biblical studies and R, that birthed my latest package, rperseus. The goal of this package is to furnish classicists with texts of the ancient world and a toolkit to unpack them.

      Exploratory Data Analysis in Biblical Studies

      Working with the Perseus Digital Library was already a trip down memory lane, but here’s an example of how I would have leveraged rperseus many years ago.

      My best papers often sprung from the outer margins of my Nestle-Aland Novum Testamentum Graece. Here the editors inserted cross references to parallel vocabulary, themes, and even grammatical constructions. Given the intertextuality of biblical literature, the margins are a rich source of questions: Where else does the author use similar vocabulary? How is the source material used differently? Does the literary context affect our interpretation of a particular word? This is exploratory data analysis in biblical studies.

      Unfortunately the excitement of your questions is incommensurate with the tedium of the process–EDA continues by flipping back and forth between books, dog-earring pages, and avoiding paper cuts. rperseus aims to streamline this process with two functions: get_perseus_text and perseus_parallel. The former returns a data frame containing the text from any work in the Perseus Digital Library, and the latter renders a parallel in ggplot2.

      Suppose I am writing a paper on different expressions of love in Paul’s letters. Naturally, I start in 1 Corinthians 13, the famed “Love Chapter” often heard at weddings and seen on bumper stickers. I finish the chapter and turn to the margins. In the image below, I see references to Colossians 1:4, 1 Thessalonians 1:3, 5:8, Hebrews 10:22-24, and Romans 8:35-39.


      1 Corinithians 13 in Nestle-Aland Novum Testamentum Graece

      Ignoring that some scholars exclude Colossians from the “authentic” letters, let’s see the references alongside each other:

      library(rperseus) #devtools::install_github(“ropensci/rperseus”) library(tidyverse) tribble( ~label, ~excerpt, "Colossians", "1.4", "1 Thessalonians", "1.3", "1 Thessalonians", "5.8", "Romans", "8.35-8.39" ) %>% left_join(perseus_catalog) %>% filter(language == "grc") %>% select(urn, excerpt) %>% pmap_df(get_perseus_text) %>% perseus_parallel(words_per_row = 4)

      A brief explanation: First, I specify the labels and excerpts within a tibble. Second, I join the lazily loaded perseus_catalog onto the data frame. Third, I filter for the Greek3 and select the columns containing the arguments required for get_perseus_text. Fourth, I map over each urn and excerpt, returning another data frame. Finally, I pipe the output into perseus_parallel.

      The key word shared by each passage is agape (“love”). Without going into detail, it might be fruitful to consider the references alongside each other, pondering how the semantic range of agape expands or contracts within the Pauline corpus. Paul had a penchant for appropriating and recasting old ideas–often in slippery and unexpected ways–and your Greek lexicon provides a mere approximation. In other words, how can we move from the dictionary definition of agape towards Paul’s unique vision?

      If your Greek is rusty, you can parse each word with parse_excerpt by locating the text’s urn within the perseus_catalog object.

      parse_excerpt(urn = "urn:cts:greekLit:tlg0031.tlg012.perseus-grc2", excerpt = "1.4") word form verse part_of_speech person number tense mood voice gender case degree ἀκούω ἀκούσαντες 1.4 verb NA plural aorist participle active masculine nominative NA ὁ τὴν 1.4 article NA singular NA NA NA feminine accusative NA πίστις πίστιν 1.4 noun NA singular NA NA NA feminine accusative NA ὑμός ὑμῶν 1.4 pronoun NA plural NA NA NA masculine genative NA

      If your Greek is really rusty, you can also flip the language filter to “eng” to view an older English translation.4 And if the margin references a text from the Old Testament, you can call the Septuagint as well as the original Hebrew.5

      tribble( ~label, ~excerpt, "Genesis", "32.31", "Genesis, pointed", "32.31", "Numeri", "12.8", "Numbers, pointed", "12.8" ) %>% left_join(perseus_catalog) %>% filter(language %in% c("grc", "hpt")) %>% select(urn, excerpt) %>% pmap_df(get_perseus_text) %>% perseus_parallel()

      Admittedly, there is some “friction” here in joining the perseus_catalog onto the initial tibble. There is a learning curve with getting acquainted with the idiosyncrasies of the catalog object. A later release will aim to streamline this workflow.

      Future Work

      Check the vignette for a more general overview of rperseus. In the meantime, I look forward to getting more intimately acquainted with the Perseus Digital Library. Tentative plans to extend rperseus a Shiny interface to further reduce “friction” and a method of creating a “book” of custom parallels with bookdown.

      Acknowledgements

      I want to thank my two rOpenSci reviewers, Ildikó Czeller and François Michonneau, for coaching me through the review process. They were the first two individuals to ever scrutinize my code, and I was lucky to hear their feedback. rOpenSci onboarding is truly a wonderful process.

      1. Bauer, Walter. A Greek-English Lexicon of the New Testament and Other Early Christian Literature. Edited by Frederick W. Danker. 3rd ed. Chicago: University of Chicago Press, 2000.
      2. Aland, Kurt. Synopsis Quattuor Evangeliorum. Deutsche Bibelgesellschaft, 1997.
      3. The Greek text from the Perseus Digital Library is from 1885 standards. The advancement of textual criticism in the 20th century led to a more stable text you would find in current editions of the Greek New Testament.
      4. The English translation is from Rainbow Missions, Inc. World English Bible. Rainbow Missions, Inc.; revision of the American Standard Version of 1901. I’ve toyed with the idea of incorporating more modern translations, but that would require require resources beyond the Perseus Digital Library.
      5. “hpt” is the pointed Hebrew text from Codex Leningradensis.
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Usage of ruler package

      Tue, 12/05/2017 - 01:00

      (This article was first published on QuestionFlow , and kindly contributed to R-bloggers)

      Usage examples of ruler package: dplyr-style exploration and validation of data frame like objects.

      Prologue

      My previous post tells a story about design of my ruler package, which presents tools for “… creating data validation pipelines and tidy reports”. This package offers a framework for exploring and validating data frame like objects using dplyr grammar of data manipulation.

      This post is intended to show some close to reality ruler usage examples. Described methods and approaches reflect package design. Along the way you will learn why Yoda and Jabba the Hutt are “outliers” among core “Star Wars” characters.

      For more information see README (for relatively brief comprehensive introduction) or vignettes (for more thorough description of package capabilities).

      Beware of a lot of code.

      Overview suppressMessages(library(dplyr)) suppressMessages(library(purrr)) library(ruler)

      The general way of performing validation with ruler can be described with following steps:

      • Formulate a validation task. It is usually stated in the form of a yes-no question or true-false statement about some part (data unit) of an input data frame. Data unit can be one of: data [as a whole], group of rows [as a whole], column [as a whole], row [as a whole], cell. For example, does every column contain elements with sum more than 100?.
      • Create a dplyr-style validation function (rule pack) which checks desired data unit for obedience to [possibly] several rules: mtcars %>% summarise_all(funs(enough_sum = sum(.) > 100))
        • Use ruler’s function rules() instead of explicit or implicit usage of funs():
        mtcars %>% summarise_all(rules(enough_sum = sum(.) > 100)) . %>% summarise_all(rules(enough_sum = sum(.) > 100))
        • Wrap with rule specification function to explicitly identify validated data unit and to name rule pack. In this case it is col_packs() for column data unit with “is_enough_sum” as rule pack name:
        col_packs( is_enough_sum = . %>% summarise_all(rules(is_enough = sum(.) > 100)) )
      • Expose data to rules to obtain validation result (exposure). Use ruler’s expose() function for that. It doesn’t modify contents of input data frame but creates/updates exposure attribute. Exposure is a list with information about used rule packs (packs_info) and tidy data validation report (report).
      • Act after exposure. It can be:
        • Observing validation results with get_exposure(), get_packs_info() or get_report().
        • Making assertions if specific rules are not followed in desired way.
        • Imputing input data frame based on report.

      In examples we will use starwars data from dplyr package (to celebrate an upcoming new episode). It is a tibble with every row describing one “Star Wars” character. Every example starts with a validation task stated in italic and performs validation from beginning to end.

      Create rule packs Data

      Does starwars have 1) number of rows 1a) more than 50; 1b) less than 60; 2) number of columns 2a) more than 10; 2b) less than 15?

      check_data_dims <- data_packs( check_dims = . %>% summarise( nrow_low = nrow(.) >= 50, nrow_up = nrow(.) <= 60, ncol_low = ncol(.) >= 10, ncol_up = ncol(.) <= 15 ) ) starwars %>% expose(check_data_dims) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 1 x 4 ## name type fun remove_obeyers ## ## 1 check_dims data_pack TRUE ## ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 check_dims nrow_up .all 0 FALSE

      The result is interpreted as follows:

      • Data was exposed to one rule pack for data as a whole (data rule pack) named “check_dims”. For it all obeyers (data units which follow specified rule) were removed from validation report.
      • Combination of var equals .all and id equals 0 means that data as a whole is validated.
      • Input data frame doesn’t obey (because value is equal to FALSE) rule nrow_up from rule pack check_dims.

      Does starwars have enough rows for characters 1) with blond hair; 2) humans; 3) humans with blond hair?

      check_enough_rows <- data_packs( enough_blond = . %>% filter(hair_color == "blond") %>% summarise(is_enough = n() > 10), enough_humans = . %>% summarise( is_enough = sum(species == "Human", na.rm = TRUE) > 30 ), ehough_blond_humans = . %>% filter( hair_color == "blond", species == "Human" ) %>% summarise(is_enough = n() > 5) ) starwars %>% expose(check_enough_rows) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 3 x 4 ## name type fun remove_obeyers ## ## 1 enough_blond data_pack TRUE ## 2 enough_humans data_pack TRUE ## 3 ehough_blond_humans data_pack TRUE ## ## Tidy data validation report: ## # A tibble: 2 x 5 ## pack rule var id value ## ## 1 enough_blond is_enough .all 0 FALSE ## 2 ehough_blond_humans is_enough .all 0 FALSE

      New information gained from example:

      • Rule specification functions can be supplied with multiple rule packs all of which will be independently used during exposing.

      Does starwars have enough numeric columns?

      check_enough_num_cols <- data_packs( enough_num_cols = . %>% summarise( is_enough = sum(map_lgl(., is.numeric)) > 1 ) ) starwars %>% expose(check_enough_num_cols) %>% get_report() ## Tidy data validation report: ## # A tibble: 0 x 5 ## # ... with 5 variables: pack , rule , var , id , ## # value
      • If no breaker is found get_report() returns tibble with zero rows and usual columns.
      Group

      Does group defined by hair color and gender have a member from Tatooine?

      has_hair_gender_tatooine <- group_packs( hair_gender_tatooine = . %>% group_by(hair_color, gender) %>% summarise(has_tatooine = any(homeworld == "Tatooine")), .group_vars = c("hair_color", "gender"), .group_sep = "__" ) starwars %>% expose(has_hair_gender_tatooine) %>% get_report() ## Tidy data validation report: ## # A tibble: 12 x 5 ## pack rule var id value ## ## 1 hair_gender_tatooine has_tatooine auburn__female 0 FALSE ## 2 hair_gender_tatooine has_tatooine auburn, grey__male 0 FALSE ## 3 hair_gender_tatooine has_tatooine auburn, white__male 0 FALSE ## 4 hair_gender_tatooine has_tatooine blonde__female 0 FALSE ## 5 hair_gender_tatooine has_tatooine grey__male 0 FALSE ## # ... with 7 more rows
      • group_packs() needs grouping columns supplied via .group_vars.
      • Column var of validation report contains levels of grouping columns to identify group. By default their are pasted together with .. To change that supply .group_sep argument.
      • 12 combinations of hair_color and gender don’t have a character from Tatooine. They are “auburn”-“female”, “auburn, grey”-“male” and so on.
      Column

      Does every list-column have 1) enough average length; 2) enough unique elements?

      check_list_cols <- col_packs( check_list_cols = . %>% summarise_if( is.list, rules( is_enough_mean = mean(map_int(., length)) >= 1, length(unique(unlist(.))) >= 10 ) ) ) starwars %>% expose(check_list_cols) %>% get_report() ## Tidy data validation report: ## # A tibble: 3 x 5 ## pack rule var id value ## ## 1 check_list_cols is_enough_mean vehicles 0 FALSE ## 2 check_list_cols is_enough_mean starships 0 FALSE ## 3 check_list_cols rule..2 films 0 FALSE
      • To specify rule functions inside dplyr’s scoped verbs use ruler::rules(). It powers correct output interpretation during exposing process and imputes missing rule names based on the present rules in current rule pack.
      • Columns vehicles and starships don’t have enough average length and column films doesn’t have enough unique elements.

      Are all values of column birth_year non-NA?

      starwars %>% expose( col_packs( . %>% summarise_at( vars(birth_year = "birth_year"), rules(all_present = all(!is.na(.))) ) ) ) %>% get_report() ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 col_pack..1 all_present birth_year 0 FALSE
      • To correctly validate one column with scoped dplyr verb it should be a named argument inside vars. It is needed for correct interpretation of rule pack output.
      Row

      Has character appeared in enough films? As character is defined by row, this is a row pack.

      has_enough_films <- row_packs( enough_films = . %>% transmute(is_enough = map_int(films, length) >= 3) ) starwars %>% expose(has_enough_films) %>% get_report() %>% left_join(y = starwars %>% transmute(id = 1:n(), name), by = "id") %>% print(.validate = FALSE) ## Tidy data validation report: ## # A tibble: 64 x 6 ## pack rule var id value name ## ## 1 enough_films is_enough .all 8 FALSE R5-D4 ## 2 enough_films is_enough .all 9 FALSE Biggs Darklighter ## 3 enough_films is_enough .all 12 FALSE Wilhuff Tarkin ## 4 enough_films is_enough .all 15 FALSE Greedo ## 5 enough_films is_enough .all 18 FALSE Jek Tono Porkins ## # ... with 59 more rows
      • 64 characters haven’t appeared in 3 films or more. Those are characters described in starwars in rows 8, 9, etc. (counting based on input data).

      Does character with height less than 100 is a droid?

      is_short_droid <- row_packs( is_short_droid = . %>% filter(height < 100) %>% transmute(is_droid = species == "Droid") ) starwars %>% expose(is_short_droid) %>% get_report() %>% left_join(y = starwars %>% transmute(id = 1:n(), name, height), by = "id") %>% print(.validate = FALSE) ## Tidy data validation report: ## # A tibble: 5 x 7 ## pack rule var id value name height ## ## 1 is_short_droid is_droid .all 19 FALSE Yoda 66 ## 2 is_short_droid is_droid .all 29 FALSE Wicket Systri Warrick 88 ## 3 is_short_droid is_droid .all 45 FALSE Dud Bolt 94 ## 4 is_short_droid is_droid .all 72 FALSE Ratts Tyerell 79 ## 5 is_short_droid is_droid .all 73 NA R4-P17 96
      • One can expose only subset of rows by using filter or slice. The value of id column in result will reflect row number in the original input data frame. This feature is powered by keyholder package. In order to use it, rule pack should be created using its supported functions.
      • value equal to NA is treated as rule breaker.
      • 5 “not tall” characters are not droids.
      Cell

      Is non-NA numeric cell not an outlier based on z-score? This is a bit tricky. To present outliers as rule breakers one should ask whether cell is not outlier.

      z_score <- function(x, ...) {abs(x - mean(x, ...)) / sd(x, ...)} cell_isnt_outlier <- cell_packs( dbl_not_outlier = . %>% transmute_if( is.numeric, rules(isnt_out = z_score(., na.rm = TRUE) < 3 | is.na(.)) ) ) starwars %>% expose(cell_isnt_outlier) %>% get_report() %>% left_join(y = starwars %>% transmute(id = 1:n(), name), by = "id") %>% print(.validate = FALSE) ## Tidy data validation report: ## # A tibble: 4 x 6 ## pack rule var id value name ## ## 1 dbl_not_outlier isnt_out height 19 FALSE Yoda ## 2 dbl_not_outlier isnt_out mass 16 FALSE Jabba Desilijic Tiure ## 3 dbl_not_outlier isnt_out birth_year 16 FALSE Jabba Desilijic Tiure ## 4 dbl_not_outlier isnt_out birth_year 19 FALSE Yoda
      • 4 non-NA numeric cells appear to be an outlier within their column.
      Expose data to rules

      Do groups defined by species, gender and eye_color (3 different checks) have appropriate size?

      starwars %>% expose( group_packs(. %>% group_by(species) %>% summarise(isnt_many = n() <= 5), .group_vars = "species") ) %>% expose( group_packs(. %>% group_by(gender) %>% summarise(isnt_many = n() <= 60), .group_vars = "gender"), .remove_obeyers = FALSE ) %>% expose(is_enough_eye_color = . %>% group_by(eye_color) %>% summarise(isnt_many = n() <= 20)) %>% get_exposure() %>% print(n_report = Inf) ## Exposure ## ## Packs info: ## # A tibble: 3 x 4 ## name type fun remove_obeyers ## ## 1 group_pack..1 group_pack TRUE ## 2 group_pack..2 group_pack FALSE ## 3 is_enough_eye_color group_pack TRUE ## ## Tidy data validation report: ## # A tibble: 7 x 5 ## pack rule var id value ## ## 1 group_pack..1 isnt_many Human 0 FALSE ## 2 group_pack..2 isnt_many female 0 TRUE ## 3 group_pack..2 isnt_many hermaphrodite 0 TRUE ## 4 group_pack..2 isnt_many male 0 FALSE ## 5 group_pack..2 isnt_many none 0 TRUE ## 6 group_pack..2 isnt_many NA 0 TRUE ## 7 is_enough_eye_color isnt_many brown 0 FALSE
      • expose() can be applied sequentially which results into updating existing exposure with new information.
      • expose() imputes names of supplied unnamed rule packs based on the present rule packs for the same data unit type.
      • expose() by default removes obeyers (rows with data units that obey respective rules) from validation report. To stop doing that use .remove_obeyers = FALSE during expose() call.
      • expose() by default guesses the type of the supplied rule pack based only on its output. This has some annoying edge cases but is suitable for interactive usage. To turn this feature off use .guess = FALSE as an argument for expose(). Also, to avoid edge cases create rule packs with appropriate wrappers.

      Perform some previous checks with one expose().

      my_packs <- list(check_data_dims, is_short_droid, cell_isnt_outlier) str(my_packs) ## List of 3 ## $ :List of 1 ## ..$ check_dims:function (value) ## .. ..- attr(*, "class")= chr [1:4] "data_pack" "rule_pack" "fseq" "function" ## $ :List of 1 ## ..$ is_short_droid:function (value) ## .. ..- attr(*, "class")= chr [1:4] "row_pack" "rule_pack" "fseq" "function" ## $ :List of 1 ## ..$ dbl_not_outlier:function (value) ## .. ..- attr(*, "class")= chr [1:4] "cell_pack" "rule_pack" "fseq" "function" starwars_exposed_list <- starwars %>% expose(my_packs) starwars_exposed_arguments <- starwars %>% expose(check_data_dims, is_short_droid, cell_isnt_outlier) identical(starwars_exposed_list, starwars_exposed_arguments) ## [1] TRUE
      • expose() can have for rule pack argument a list of lists [of lists, of lists, …] with functions at any depth. This enables creating a list of rule packs wrapped with *_packs() functions (which all return a list of functions).
      • expose() can have multiple rule packs as separate arguments.
      Act after exposure

      Throw an error if any non-NA value of mass is more than 1000.

      starwars %>% expose( col_packs( low_mass = . %>% summarise_at( vars(mass = "mass"), rules(is_small_mass = all(. <= 1000, na.rm = TRUE)) ) ) ) %>% assert_any_breaker() ## Breakers report ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 low_mass is_small_mass mass 0 FALSE ## Error: assert_any_breaker: Some breakers found in exposure.
      • assert_any_breaker() is used to assert presence of at least one breaker in validation report.

      However, offered solution via column pack doesn’t show rows which break the rule. To do that one can use cell pack:

      starwars %>% expose( cell_packs( low_mass = . %>% transmute_at( vars(mass = "mass"), rules(is_small_mass = (. <= 1000) | is.na(.)) ) ) ) %>% assert_any_breaker() ## Breakers report ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 low_mass is_small_mass mass 16 FALSE ## Error: assert_any_breaker: Some breakers found in exposure.

      Remove numeric columns with mean value below certain threshold. To achieve that one should formulate rule as “column mean should be above threshold”, identify breakers and act upon this information.

      remove_bad_cols <- function(.tbl) { bad_cols <- .tbl %>% get_report() %>% pull(var) %>% unique() .tbl[, setdiff(colnames(.tbl), bad_cols)] } starwars %>% expose( col_packs( . %>% summarise_if(is.numeric, rules(mean(., na.rm = TRUE) >= 100)) ) ) %>% act_after_exposure( .trigger = any_breaker, .actor = remove_bad_cols ) %>% remove_exposure() ## # A tibble: 87 x 11 ## name height hair_color skin_color eye_color gender homeworld ## ## 1 Luke Skywalker 172 blond fair blue male Tatooine ## 2 C-3PO 167 gold yellow Tatooine ## 3 R2-D2 96 white, blue red Naboo ## 4 Darth Vader 202 none white yellow male Tatooine ## 5 Leia Organa 150 brown light brown female Alderaan ## # ... with 82 more rows, and 4 more variables: species , ## # films , vehicles , starships
      • act_after_exposure is a wrapper for performing actions after exposing. It takes .trigger function to trigger action and .actor function to perform action and return its result.
      • any_breaker is a function which return TRUE if tidy validation report attached to it has any breaker and FALSE otherwise.
      Conclusions
      • Yoda and Jabba the Hutt are outliers among other “Star Wars” characters: Yoda is by height and birth year, Jabba is by mass and also birth year.
      • There are less than 10 “Star Wars” films yet.
      • ruler offers flexible and extendable functionality for common validation tasks. Validation can be done for data [as a whole], group of rows [as a whole], column [as a whole], row [as a whole] and cell. After exposing data frame of interest to rules and obtaining tidy validation report, one can perform any action based on this information: explore report, throw error, impute input data frame, etc.
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages