Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 day 7 hours ago

RcppArmadillo 0.8.300.1.0

Thu, 12/07/2017 - 01:59

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Another RcppArmadillo release hit CRAN today. Since our last 0.8.100.1.0 release in October, Conrad kept busy and produced Armadillo releases 8.200.0, 8.200.1, 8.300.0 and now 8.300.1. We tend to now package these (with proper reverse-dependency checks and all) first for the RcppCore drat repo from where you can install them "as usual" (see the repo page for details). But this actual release resumes within our normal bi-monthly CRAN release cycle.

These releases improve a few little nags on the recent switch to more extensive use of OpenMP, and round out a number of other corners. See below for a brief summary.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 405 other packages on CRAN.

A high-level summary of changes follows.

Changes in RcppArmadillo version 0.8.300.1.0 (2017-12-04)
  • Upgraded to Armadillo release 8.300.1 (Tropical Shenanigans)

    • faster handling of band matrices by solve()

    • faster handling of band matrices by chol()

    • faster randg() when using OpenMP

    • added normpdf()

    • expanded .save() to allow appending new datasets to existing HDF5 files

  • Includes changes made in several earlier GitHub-only releases (versions 0.8.300.0.0, 0.8.200.2.0 and 0.8.200.1.0).

  • Conversion from simple_triplet_matrix is now supported (Serguei Sokol in #192).

  • Updated configure code to check for g++ 5.4 or later to enable OpenMP.

  • Updated the skeleton package to current packaging standards

  • Suppress warnings from Armadillo about missing OpenMP support and -fopenmp flags by setting ARMA_DONT_PRINT_OPENMP_WARNING

Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

jmv – one R package (not just) for the social sciences

Thu, 12/07/2017 - 00:00

(This article was first published on jamovi, and kindly contributed to R-bloggers)

tl;dr
  • many analyses in the social sciences require many R packages
  • jmv makes these common analyses available from one R package
  • jmv can be used from jamovi, a graphical statistical spreadsheet, making it super-accessible

introducing jmv

There are many tremendously useful R packages for the social sciences (and similar fields), such as car, afex vcd, etc. Although running basic analyses (such as t-tests or ANOVA) with these packages is very straight forward, it is typically necessary to perform a number of supplementary analyses to accompany them; post-hoc tests, effect-size calculations, bias-corrections, and assumption checks. These additional tests often require the use of many additional R packages, and can make reasonably standard analyses quite time-consuming to perform. For example, in the book Discovering Statistics Using R by Andy Field (a popular textbook in the social sciences), the chapter on ANOVA alone recommends the use of 7 packages.

jmv simplifies this whole process by bringing all of these packages together and makes doing the following analyses with their most common supplementary tests, corrections and assumption checks as easy as possible:

  • Descriptive statistics
  • T-Tests
  • ANOVA
  • ANCOVA
  • Repeated Measures ANOVA
  • Non-parametric ANOVAs
  • Correlation
  • Linear Regression
  • Contingency Tables
  • Proportion Tests
  • Factor Analysis

and coming soon:

  • Logistic Regression
  • Log-linear Regression

jmv aims to make all common statistical tests taught at an undergraduate level available from a single package.

An ANOVA

Let’s begin with a simple, familiar analysis – an ANOVA. In this example we use the ToothGrowth dataset from R, and explore whether different food supplements and their dosage affect how much a guinea pig’s teeth grow. We’ll specify len to be the dependent variable, and supp and dose to be the factors.

library('jmv') data('ToothGrowth') jmv::anova(ToothGrowth, dep = 'len', factors = c('supp', 'dose')) ## ## ANOVA ## ## ANOVA ## ─────────────────────────────────────────────────────────────────────── ## Sum of Squares df Mean Square F p ## ─────────────────────────────────────────────────────────────────────── ## supp 205 1 205.4 15.57 < .001 ## dose 2426 2 1213.2 92.00 < .001 ## supp:dose 108 2 54.2 4.11 0.022 ## Residuals 712 54 13.2 ## ───────────────────────────────────────────────────────────────────────

This produces what should be a familiar ANOVA table. You have likely seen something like this in R before, though perhaps not as nicely formatted.

Where jmv really comes into its own, is with additional options. In the following example we will perform the same analysis, but additionally requesting effect-size, post-hoc tests, homogeneity of variances tests, descriptive statistics, and a descriptives plot:

library('jmv') data('ToothGrowth') jmv::anova(ToothGrowth, dep = 'len', factors = c('supp', 'dose'), effectSize = 'eta', postHoc = c('supp', 'dose'), plotHAxis = 'dose', plotSepLines = 'supp', descStats = TRUE, homo = TRUE) ## ## ANOVA ## ## ANOVA ## ──────────────────────────────────────────────────────────────────────────────── ## Sum of Squares df Mean Square F p η² ## ──────────────────────────────────────────────────────────────────────────────── ## supp 205 1 205.4 15.57 < .001 0.059 ## dose 2426 2 1213.2 92.00 < .001 0.703 ## supp:dose 108 2 54.2 4.11 0.022 0.031 ## Residuals 712 54 13.2 ## ──────────────────────────────────────────────────────────────────────────────── ## ## ## ASSUMPTION CHECKS ## ## Test for Homogeneity of Variances (Levene's) ## ──────────────────────────────────────────── ## F df1 df2 p ## ──────────────────────────────────────────── ## 1.94 5 54 0.103 ## ──────────────────────────────────────────── ## ## ## POST HOC TESTS ## ## Post Hoc Comparisons - supp ## ──────────────────────────────────────────────────────────────────────────── ## supp supp Mean Difference SE df t p-tukey ## ──────────────────────────────────────────────────────────────────────────── ## OJ - VC 3.70 0.938 54.0 3.95 < .001 ## ──────────────────────────────────────────────────────────────────────────── ## ## ## Post Hoc Comparisons - dose ## ───────────────────────────────────────────────────────────────────────────── ## dose dose Mean Difference SE df t p-tukey ## ───────────────────────────────────────────────────────────────────────────── ## 0.5 - 1 -9.13 1.15 54.0 -7.95 < .001 ## - 2 -15.50 1.15 54.0 -13.49 < .001 ## 1 - 2 -6.37 1.15 54.0 -5.54 < .001 ## ───────────────────────────────────────────────────────────────────────────── ## ## ## Descriptives ## ─────────────────────────────────────── ## supp dose N Mean SD ## ─────────────────────────────────────── ## OJ 0.5 10 13.23 4.46 ## OJ 1 10 22.70 3.91 ## OJ 2 10 26.06 2.66 ## VC 0.5 10 7.98 2.75 ## VC 1 10 16.77 2.52 ## VC 2 10 26.14 4.80 ## ───────────────────────────────────────

As can be seen, jmv can provide many additional tests and statistics relevant to the main tests, but with far less effort.

You can explore additional options for the jmv ANOVA here, and the other tests and their available options here.

jamovi integration

jmv is also useable from the jamovi statistical spreadsheet. jamovi makes a range of analyses accessible to a broader audience by making them available from a familiar, spreadsheet user-interface. jamovi can also make the underlying R code for each analysis available, making it easy for people to learn R, and transition to R scripting if they like.

Here is exactly the same analysis as above, having been performed in jamovi.

summing up
  • jmv makes a whole suite of common analyses from the social sciences very easy to perform
  • jamovi makes these even easier to perform

jmv is available from CRAN

jamovi is available from www.jamovi.org

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: jamovi. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The British Ecological Society’s Guide to Reproducible Science

Wed, 12/06/2017 - 23:26

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The British Ecological Society has published a new volume in their Guides to Better Science series: A Guide to Reproducible Code in Ecology and Evolution (pdf). The introduction, by , describes its scope:

A Guide to Reproducible Code covers all the basic tools and information you will need to start making your code more reproducible. We focus on R and Python, but many of the tips apply to any programming language. Anna Krystalli introduces some ways to organise files on your computer and to document your workflows. Laura Graham writes about how to make your code more reproducible and readable. François Michonneau explains how to write reproducible reports. Tamora James breaks down the basics of version control. Finally, Mike Croucher describes how to archive your code. We have also included a selection of helpful tips from other scientists.

The guide proposes a simple reproducible project workflow, and a guide to organizing projects for reproducibility. The Programming section provides concrete tips and traps to avoid (example: use relative, not absolute pathnames), and the Reproducible Reports section provides a step-by-step guide for generating reports with R Markdown.

While written for an ecology audience (and also including some gorgeous photography of animals), this guide would be useful for anyone in the science looking to implement a reproducible workflow. You can download the guide at the link below.

British Ecological Society: A Guide to Reproducible Code in Ecology and Evolution (via Laura Graham)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

14 Jobs for R users from around the world (2017-12-06)

Wed, 12/06/2017 - 17:07
To post your R job on the next post

Just visit  this link and post a new R job  to the R community.

You can post a job for  free  (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers:  please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
More New R Jobs
  1. Full-Time
    Quantitative Market Research Analyst Gradient Metrics – Posted by kyblock
    Anywhere
    6 Dec 2017
  2. Full-Time
    Data Scientist in the Institute for Economics and Peace Institute for Economics and Peace – Posted by Institute for Economics and Peace
    Sydney New South Wales, Australia
    5 Dec 2017
  3. Full-Time
    Data Innovation Specialist /Technology Analyst- 20509 jessicaxtan
    Washington District of Columbia, United States
    30 Nov 2017
  4. Full-Time
    Data Scientist: Mobility Services Parkbob – Posted by ivan.kasanicky
    Wien Wien, Austria
    29 Nov 2017
  5. Freelance
    Statistician/Econometrician – R Programmer for Academic Statistical Research Academic Research – Posted by empiricus
    Anywhere
    29 Nov 2017
  6. Full-Time
    Senior Data Analyst @ Bangkok, Thailand Agoda – Posted by atrapassi
    Bangkok Krung Thep Maha Nakhon, Thailand
    28 Nov 2017
  7. Full-Time
    R Shiny Dashboard Engineer in Health Tech Castor EDC – Posted by Castor EDC
    Amsterdam-Zuidoost Noord-Holland, Netherlands
    24 Nov 2017
  8. Full-Time
    Data Analyst: Travel Behavior RSG – Posted by patricia.holland@rsginc.com
    San Diego California, United States
    22 Nov 2017
  9. Full-Time
    R&D Database Developer @ Toronto, Canada Crescendo Technology Ltd – Posted by Crescendo
    Toronto Ontario, Canada
    17 Nov 2017
  10. Full-Time
    Customer Success Representative RStudio, Inc. – Posted by jclemens1
    Anywhere
    17 Nov 2017
  11. Full-Time
    Data Science Engineer Bonify – Posted by arianna@meritocracy
    Berlin Berlin, Germany
    17 Nov 2017
  12. Part-Time
    Development of User-Defined Calculations and Graphical Outputs for WHO’s Influenza Data World Health Organization – Posted by aspenh
    Anywhere
    6 Nov 2017
  13. Full-Time
    Data Scientist for H Labs @ Chicago, Illinois, United States Heidrick & Struggles – Posted by Heidrick1
    Chicago Illinois, United States
    2 Nov 2017
  14. Full-Time
    Business Data Analytics Faculty Maryville University – Posted by tlorden
    St. Louis Missouri, United States
    2 Nov 2017

 

In  R-users.com  you can see  all  the R jobs that are currently available.

R-users Resumes

R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

(you may also look at  previous R jobs posts ).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

How to Avoid the dplyr Dependency Driven Result Corruption

Wed, 12/06/2017 - 16:27

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases.

To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using dplyr with a database.

We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mapping lithium production using R

Wed, 12/06/2017 - 12:15

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)



A great thing about data science – and data visualization in particular – is that you can use data science tools to get information about a subject very quickly.

This is particularly true when you know how to present the information. When you know the best technique for visualizing data, you can use those tools to present information in a way that can help you increase your understanding very quickly.

Here’s a case in point: I was thinking about energy, solar energy, and battery technology, and I got curious about sources of lithium (which is used in batteries and related tech).

Using a few tools from R, I created a map of lithium production within about 15 minutes. The maps that I created certainly don’t tell the full story, but they at least provide a baseline of knowledge.

If you’re fluent with the tools and techniques of data science, this becomes possible. Whether you’re just doing some personal research, or working for a multinational corporation, you can use the tools of data science to quickly identify and present insights about the world.

Let me walk you through this example and show you how I did it.

Tutorial: how to map lithium production data using R

First, we’ll load the packages that we’re going to use.

We’ll load tidyverse mostly for access to ggplot2 and dplyr, but we’ll also load rvest (for data scraping), stringr (to help us manipulate our variables and put the data into shape), and viridis (to help us modify the color of the final plot).

#-------------- # LOAD PACKAGES #-------------- library(tidyverse) library(rvest) library(stringr) library(viridis)

Ok. Now, we’re going to scrape this mineral production data from the wikipedia page.

Notice that we’re essentially using several rvest functions in series. We’re using several functions and combining them together using the “pipe” operator, %>%.

If you’re not using it yet, you should definitely familiarize yourself with this operator, and start using it. It’s beyond the scope of this blog post to talk extensively about pipe operator, but I will say that it’s one of the most useful tools in the R data scientist’s toolkit. If you learn to use it properly, it will make your code easier to read and easier to write. It will even train your mind to think about analysis in a more step by step way.

Concerning the code: what we’re doing here is designating the URL from which we’re going to scrape the data, then we’re specifying that we’ll be scraping one of the tables. Then, we specify that we’re going to scrape data from the 9th table, and finally we coerce the data into a tibble (instead of keeping it as a traditional data.frame).

#--------------------------- # SCRAPE DATA FROM WIKIPEDIA #--------------------------- df.lithium <- read_html("https://en.wikipedia.org/wiki/Lithium") %>% html_nodes("table") %>% .[[9]] %>% html_table() %>% as.tibble() # INSPECT df.lithium

The resultant dataset, df.lithium, is relatively small (which makes it easy to inspect and easy to work with), but in its raw form, it’s a little messy. We’ll need to do a few things to clean up the data, like change the variable names, parse some data into numerics, etc.

So first, let’s change the column names.

There are a few ways we could do this, but the most straightforward is to simply pass a vector of manually-defined column names into the colnames() function.

#-------------------------------------------- # CHANGE COLUMN NAMES # - the raw column names are capitalized and # have some extra information # - we will just clean them up #-------------------------------------------- colnames(df.lithium) <- c('country', 'production', 'reserves', 'resources') colnames(df.lithium)

Now, we’ll remove an extraneous row of data. The original data table on Wikipedia contained not only the individual records of lithium production for particular countries, but it also contained a “total” row at the bottom of the table. More often than not, these sorts of “total” rows are not appropriate for a data.frame in R; we’ll typically remove them, just as we will do here.

#----------------------------------------------- # REMOVE "World total" # - this is a total amount that was # in the original data table # - we need to remove, because it's not a # proper data record for a particular country #----------------------------------------------- df.lithium <- df.lithium %>% filter(country != 'World total') df.lithium

Next, we need to parse the numbers into actual numeric data. The reason is that when we scraped the data, it actually read in the numbers as character data, along with commas and some extra characters. We need to transform this character data into proper numeric data in R.

To do this, we’ll need to do a few things. First, we need to remove a few “notes” that were in the original data. This is why we’re using the code str_replace(production,”W\\[.*\\]”, “-“)). Essentially, we’re removing those notes from the data.

After that, we’re using parse_number(production, na = ‘-‘) to transform three variables – production, reserves, resources – into numerics.

Note once again how we’re structuring this code. We’re using a combination of functions from dplyr and stringr to achieve our objectives.

To a beginner, this might look complicated, but it’s really not that bad once you understand the individual pieces. If you don’t understand this code (our couldn’t write it yourself), I recommend that you learn the individual functions from dplyr and stringr first, and then come back to this once you’ve learned those pieces.

#--------------------------------------------------------- # PARSE NUMBERS # - the original numeric quantities in the table # were read-in as character data # - we need to "parse" this information .... # & transform it from character into proper numeric data #--------------------------------------------------------- # Strip out the 'notes' from the numeric data #str_replace(df.lithium$production,"W\\[.*\\]", "") #test df.lithium <- df.lithium %>% mutate(production = str_replace(production,"W\\[.*\\]", "-")) # inspect df.lithium # Parse character data into numbers df.lithium <- df.lithium %>% mutate(production = parse_number(production, na = '-') ,reserves = parse_number(reserves, na = '-') ,resources = parse_number(resources, na = '-') ) # Inspect df.lithium

Now we’ll get data for a map of the world. To do this, we’ll just use map_data().

#-------------- # GET WORLD MAP #-------------- map.world <- map_data('world')

We’ll also get the names of the countries in this dataset.

The reason is because we’ll need to join this map data to the data from Wikipedia, and we’ll need the country names to be exactly the same. To make this work, we’ll need to examine the names in both datasets and modify any names that aren’t exactly the same.

Notice that once again, we’re using a combination of functions from dplyr, wiring them together using the pipe operator.

#---------------------------------------------------- # Get country names # - we can use this list and cross-reference # with the country names in the scraped data # - when we find names that are not the same between # this map data and the scraped data, we can recode # the values #---------------------------------------------------- map_data('world') %>% group_by(region) %>% summarise() %>% print(n = Inf)

Ok. Now we’re going to recode some country names. Again, we’re going this so that the country names in df.lithium are the same as the corresponding country names in map.data.

#-------------------------------------------- # RECODE COUNTRY NAMES # - some of the country names do not match # the names we will use later in our map # - we will re-code so that the data matches # the names in the world map #-------------------------------------------- df.lithium <- df.lithium %>% mutate(country = if_else(country == "Canada (2010)", 'Canada' ,if_else(country == "People's Republic of China", "China" ,if_else(country == "United States", "USA" ,if_else(country == "DR Congo","Democratic Republic of the Congo", country)))) ) # Inspect df.lithium

Ok, now we’ll join the data using dplyr::left_join().

#----------------------------------------- # JOIN DATA # - join the map data and the scraped-data #----------------------------------------- df <- left_join(map.world, df.lithium, by = c('region' = 'country'))

Now we’ll plot.

We’ll start with just a basic plot (to make sure that the map plots correctly), and then we’ll proceed to plot separate maps where the fill color corresponds to reserves, production, and resources.

#----------- # PLOT DATA #----------- # BASIC MAP ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon() # LITHIUM RESERVES ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = reserves)) ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = reserves)) + scale_fill_viridis(option = 'plasma') # LITHIUM PRODUCTION ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = production)) + scale_fill_viridis(option = 'plasma') # LITHIUM RESOURCES ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = resources)) + scale_fill_viridis(option = 'plasma')







In the final three versions, notice as well that we’re modifying the color scales by using scale_fill_viridis().

There’s actually quite a bit more formatting that we could do on these, but as a first pass, these are pretty good.

I’ll leave it as an exercise for you to format these with titles, background colors, etc. If you choose to do this, leave your finalized code in the comments section below.

To master data science, you need a plan

At several points in this tutorial, I’ve mentioned a high level plan for mastering data science: master individual pieces of a programming language, and then learn to put them together into more complicated structures.

If you can do this, you will accelerate your progress … although, the devil is in the details.

That’s actually not the only learning hack that you can use to rapidly master data science. There are lots of other tricks and learning hacks that you can use to dramatically accelerate your progress.

Want to know them?

Sign up for our email list.

Here at Sharp Sight, we teach data science. But we also teach you how to learn and how to study data science, so you master the tools as quickly as possible.

By signing up for our email list, you’ll get weekly tutorials about data science, delivered directly to your inbox.

You’ll also get our Data Science Crash Course, for free.

SIGN UP NOW

The post Mapping lithium production using R appeared first on SHARP SIGHT LABS.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Package GetDFPData

Wed, 12/06/2017 - 09:00

(This article was first published on R and Finance, and kindly contributed to R-bloggers)

Downloading Annual Financial Reports and Corporate Events from B3 (formerly Bovespa) –

Financial statements of companies traded at B3 (formerly Bovespa), the
Brazilian stock exchange, are available in its
website. Accessing the data for a
single company is straightforward. In the website one can find a simple
interface for accessing this dataset. An example is given
here.
However, gathering and organizing the data for a large scale research,
with many companies and many dates, is painful. Financial reports must
be downloaded or copied individually and later aggregated. Changes in
the accounting format thoughout time can make this process slow,
unreliable and irreproducible.

Package GetDFPData provides a R interface to all annual financial
statements available in the website and more. It not only downloads the
data but also organizes it in a tabular format and allows the use of
inflation indexes. Users can select companies and a time period to
download all available data. Several information about current
companies, such as sector and available quarters are also at reach. The
main purpose of the package is to make it easy to access financial
statements in large scale research, facilitating the reproducibility of
corporate finance studies with B3 data.

The positive aspects of GetDFDData are:

  • Easy and simple R and web interface
  • Changes in accounting format are internally handled by the software
  • Access to corporate events in the FRE system such as dividend
    payments, changes in stock holder composition, changes in governance
    listings, board composition and compensation, debt composition, and
    a lot more!
  • The output data is automatically organized using tidy data
    principles (long format)
  • A cache system is employed for fast data acquisition
  • Completely free and open source!
Installation

The package is (not yet) available in CRAN (release version) and in
Github (development version). You can install any of those with the
following code:

# Release version in CRAN install.packages('GetDFPData') # not in CRAN yet # Development version in Github devtools::install_github('msperlin/GetDFPData') Shinny interface

The web interface of GetDFPData is available at
http://www.msperlin.com/shiny/GetDFPData/.

How to use GetDFPData

The starting point of GetDFPData is to find the official names of
companies in B3. Function gdfpd.search.company serves this purpose.
Given a string (text), it will search for a partial matches in companies
names. As an example, let’s find the official name of Petrobras, one
of the largest companies in Brazil:

library(GetDFPData) library(tibble) gdfpd.search.company('petrobras',cache.folder = tempdir()) ## ## Reading info file from github ## Found 43873 lines for 687 companies [Actives = 521 Inactives = 167 ] ## Last file update: 2017-10-19 ## Caching RDATA into tempdir() ## ## Found 1 companies: ## PETRÓLEO BRASILEIRO S.A. - PETROBRAS | situation = ATIVO | first date = 1998-12-31 | last date - 2016-12-31 ## [1] "PETRÓLEO BRASILEIRO S.A. - PETROBRAS"

Its official name in Bovespa records is
PETRÓLEO BRASILEIRO S.A. - PETROBRAS. Data for quarterly and annual
statements are available from 1998 to 2017. The situation of the
company, active or canceled, is also given. This helps verifying the
availability of data.

The content of all available financial statements can be accessed with
function gdfpd.get.info.companies. It will read and parse a .csv file
from my github
repository
. This will
be periodically updated for new information. Let’s try it out:

df.info <- gdfpd.get.info.companies(type.data = 'companies', cache.folder = tempdir()) ## ## Reading info file from github ## Found 43873 lines for 687 companies [Actives = 521 Inactives = 167 ] ## Last file update: 2017-10-19 ## Caching RDATA into tempdir() glimpse(df.info) ## Observations: 689 ## Variables: 8 ## $ name.company "521 PARTICIPAÇOES S.A. - EM LIQUIDAÇÃO EXTRAJ... ## $ id.company 16330, 16284, 108, 20940, 21725, 19313, 18970,... ## $ situation "ATIVO", "ATIVO", "CANCELADA", "CANCELADA", "A... ## $ listing.segment NA, "None", "None", "None", "None", "None", "C... ## $ main.sector NA, "Financeiro e Outros", "Materiais Básicos"... ## $ tickers NA, "QVQP3B", NA, NA, NA, "AELP3", "TIET11;TIE... ## $ first.date 1998-12-31, 2001-12-31, 2009-12-31, 2009-12-3... ## $ last.date 2016-12-31, 2016-12-31, 2009-12-31, 2009-12-3...

This file includes several information that are gathered from Bovespa:
names of companies, official numeric ids, listing segment, sectors,
traded tickers and, most importantly, the available dates. The resulting
dataframe can be used to filter and gather information for large scale
research such as downloading financial data for a specific sector.

Downloading financial information for ONE company

All you need to download financial data with GetDFPData are the
official names of companies, which can be found with
gdfpd.search.company, the desired starting and ending dates and the
type of financial information (individual or consolidated). Let’s try it
for PETROBRAS:

name.companies <- 'PETRÓLEO BRASILEIRO S.A. - PETROBRAS' first.date <- '2004-01-01' last.date <- '2006-01-01' df.reports <- gdfpd.GetDFPData(name.companies = name.companies, first.date = first.date, last.date = last.date, cache.folder = tempdir()) ## Found cache file. Loading data.. ## ## Downloading data for 1 companies ## First Date: 2004-01-01 ## Laste Date: 2006-01-01 ## Inflation index: dollar ## ## Downloading inflation data ## Caching inflation RDATA into tempdir() Done ## ## ## WARNING: Cash flow statements are not available before 2009 ## ## Inputs looking good! Starting download of files: ## ## PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Available periods: 2005-12-31 2004-12-31 ## ## ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Finding info from Bovespa | downloading and reading data | saving cache ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2005-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2004-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available..

The resulting object is a tibble, a data.frame type of object that
allows for list columns. Let’s have a look in its content:

glimpse(df.reports) ## Observations: 1 ## Variables: 33 ## $ company.name "PETRÓLEO BRASILEIRO S.A. - PE... ## $ company.code 9512 ## $ company.tickers "PETR3;PETR4" ## $ min.date 2004-12-31 ## $ max.date 2005-12-31 ## $ n.periods 2 ## $ company.segment "Tradicional" ## $ current.stockholders [ [ [ [ [ [ [ [ [ [ [ [NULL] ## $ history.capital.issues [NULL] ## $ history.mkt.value [NULL] ## $ history.capital.increases [NULL] ## $ history.capital.reductions [NULL] ## $ history.stock.repurchases [NULL] ## $ history.other.stock.events [NULL] ## $ history.compensation [NULL] ## $ history.compensation.summary [NULL] ## $ history.transactions.related [NULL] ## $ history.debt.composition [NULL] ## $ history.governance.listings [NULL] ## $ history.board.composition [NULL] ## $ history.committee.composition [NULL] ## $ history.family.relations [NULL]

Object df.reports only has one row since we only asked for data of one
company. The number of rows increases with the number of companies, as
we will soon learn with the next example. All financial statements for
the different years are available within df.reports. For example, the
assets statements for all desired years of PETROBRAS are:

df.income.long <- df.reports$fr.income[[1]] glimpse(df.income.long) ## Observations: 48 ## Variables: 6 ## $ name.company "PETRÓLEO BRASILEIRO S.A. - PETROBRAS", "... ## $ ref.date 2005-12-31, 2005-12-31, 2005-12-31, 2005-1... ## $ acc.number "3.01", "3.02", "3.03", "3.04", "3.05", "3.... ## $ acc.desc "Receita Bruta de Vendas e/ou Serviços", "D... ## $ acc.value 143665730, -37843204, 105822526, -57512113,... ## $ acc.value.infl.adj 61398234.97, -16173000.56, 45225234.41, -24...

The resulting dataframe is in the long format, ready for processing. In
the long format, financial statements of different years are stacked. In
the wide format, we have the year as columns of the table.

If you want the wide format, which is the most common way that financial
reports are presented, you can use function gdfpd.convert.to.wide. See
an example next:

df.income.wide <- gdfpd.convert.to.wide(df.income.long) knitr::kable(df.income.wide ) acc.number acc.desc name.company 2004-12-31 2005-12-31 3.01 Receita Bruta de Vendas e/ou Serviços PETRÓLEO BRASILEIRO S.A. – PETROBRAS 120024727 143665730 3.02 Deduções da Receita Bruta PETRÓLEO BRASILEIRO S.A. – PETROBRAS -34450292 -37843204 3.03 Receita Líquida de Vendas e/ou Serviços PETRÓLEO BRASILEIRO S.A. – PETROBRAS 85574435 105822526 3.04 Custo de Bens e/ou Serviços Vendidos PETRÓLEO BRASILEIRO S.A. – PETROBRAS -48607576 -57512113 3.05 Resultado Bruto PETRÓLEO BRASILEIRO S.A. – PETROBRAS 36966859 48310413 3.06 Despesas/Receitas Operacionais PETRÓLEO BRASILEIRO S.A. – PETROBRAS -11110540 -14810467 3.06.01 Com Vendas PETRÓLEO BRASILEIRO S.A. – PETROBRAS -2858630 -4195157 3.06.02 Gerais e Administrativas PETRÓLEO BRASILEIRO S.A. – PETROBRAS -2599552 -3453753 3.06.03 Financeiras PETRÓLEO BRASILEIRO S.A. – PETROBRAS -1019901 126439 3.06.04 Outras Receitas Operacionais PETRÓLEO BRASILEIRO S.A. – PETROBRAS 0 0 3.06.05 Outras Despesas Operacionais PETRÓLEO BRASILEIRO S.A. – PETROBRAS -5982336 -9070019 3.06.06 Resultado da Equivalência Patrimonial PETRÓLEO BRASILEIRO S.A. – PETROBRAS 1349879 1782023 3.07 Resultado Operacional PETRÓLEO BRASILEIRO S.A. – PETROBRAS 25856319 33499946 3.08 Resultado Não Operacional PETRÓLEO BRASILEIRO S.A. – PETROBRAS -550694 -199982 3.08.01 Receitas PETRÓLEO BRASILEIRO S.A. – PETROBRAS 46611 1256194 3.08.02 Despesas PETRÓLEO BRASILEIRO S.A. – PETROBRAS -597305 -1456176 3.09 Resultado Antes Tributação/Participações PETRÓLEO BRASILEIRO S.A. – PETROBRAS 25305625 33299964 3.10 Provisão para IR e Contribuição Social PETRÓLEO BRASILEIRO S.A. – PETROBRAS -5199166 -8581490 3.11 IR Diferido PETRÓLEO BRASILEIRO S.A. – PETROBRAS -1692288 -422392 3.12 Participações/Contribuições Estatutárias PETRÓLEO BRASILEIRO S.A. – PETROBRAS -660000 -846000 3.12.01 Participações PETRÓLEO BRASILEIRO S.A. – PETROBRAS -660000 -846000 3.12.02 Contribuições PETRÓLEO BRASILEIRO S.A. – PETROBRAS 0 0 3.13 Reversão dos Juros sobre Capital Próprio PETRÓLEO BRASILEIRO S.A. – PETROBRAS 0 0 3.15 Lucro/Prejuízo do Exercício PETRÓLEO BRASILEIRO S.A. – PETROBRAS 17754171 23450082 Downloading financial information for SEVERAL companies

If you are doing serious research, it is likely that you need financial
statements for more than one company. Package GetDFPData is specially
designed for handling large scale download of data. Let’s build a case
with two selected companies:

my.companies <- c('PETRÓLEO BRASILEIRO S.A. - PETROBRAS', 'BANCO DO ESTADO DO RIO GRANDE DO SUL SA') first.date <- '2005-01-01' last.date <- '2007-01-01' type.statements <- 'individual' df.reports <- gdfpd.GetDFPData(name.companies = my.companies, first.date = first.date, last.date = last.date, cache.folder = tempdir()) ## Found cache file. Loading data.. ## ## Downloading data for 2 companies ## First Date: 2005-01-01 ## Laste Date: 2007-01-01 ## Inflation index: dollar ## ## Downloading inflation data ## Found cache file. Loading data.. Done ## ## ## WARNING: Cash flow statements are not available before 2009 ## ## Inputs looking good! Starting download of files: ## ## BANCO DO ESTADO DO RIO GRANDE DO SUL SA ## Available periods: 2006-12-31 2005-12-31 ## PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Available periods: 2006-12-31 2005-12-31 ## ## ## Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA ## Finding info from Bovespa | downloading and reading data | saving cache ## Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA | date 2006-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA | date 2005-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS ## Finding info from Bovespa ## Found cache file /tmp/RtmpSpLsOP/9512_PETRÓLEO/GetDFPData_BOV_cache_9512_PETR.rds ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2006-12-31 ## Acessing DFP data | downloading file | reading file | saving cache ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available.. ## Processing 9512 - PETRÓLEO BRASILEIRO S.A. - PETROBRAS | date 2005-12-31 ## Acessing DFP data | Found DFP cache file ## Acessing FRE data | No FRE file available.. ## Acessing FCA data | No FCA file available..

And now we can check the resulting tibble:

glimpse(df.reports) ## Observations: 2 ## Variables: 33 ## $ company.name "BANCO DO ESTADO DO RIO GRANDE D... ## $ company.code 1210, 9512 ## $ company.tickers "BRSR3;BRSR5;BRSR6", "PETR3;PETR4" ## $ min.date 2005-12-31, 2005-12-31 ## $ max.date 2006-12-31, 2006-12-31 ## $ n.periods 2, 2 ## $ company.segment "Corporate Governance - Level 1"... ## $ current.stockholders [ [ [ [ [ [ [ [ [ [ [ [NULL, NULL] ## $ history.capital.issues [NULL, NULL] ## $ history.mkt.value [NULL, NULL] ## $ history.capital.increases [NULL, NULL] ## $ history.capital.reductions [NULL, NULL] ## $ history.stock.repurchases [NULL, NULL] ## $ history.other.stock.events [NULL, NULL] ## $ history.compensation [NULL, NULL] ## $ history.compensation.summary [NULL, NULL] ## $ history.transactions.related [NULL, NULL] ## $ history.debt.composition [NULL, NULL] ## $ history.governance.listings [NULL, NULL] ## $ history.board.composition [NULL, NULL] ## $ history.committee.composition [NULL, NULL] ## $ history.family.relations [NULL, NULL]

Every row of df.reports will provide information for one company.
Metadata about the corresponding dataframes such as min/max dates is
available in the first columns. Keeping a tabular structure facilitates
the organization and future processing of all financial data. We can use
tibble df.reports for creating other dataframes in the long format
containing data for all companies. See next, where we create dataframes
with the assets and liabilities of all companies:

df.assets <- do.call(what = rbind, args = df.reports$fr.assets) df.liabilities <- do.call(what = rbind, args = df.reports$fr.liabilities) df.assets.liabilities <- rbind(df.assets, df.liabilities)

As an example, let’s use the resulting dataframe for calculating and
analyzing a simple liquidity index of a company, the total of current
(liquid) assets (Ativo circulante) divided by the total of current
short term liabilities (Passivo Circulante), over time.

library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union my.tab <- df.assets.liabilities %>% group_by(name.company, ref.date) %>% summarise(Liq.Index = acc.value[acc.number == '1.01']/ acc.value[acc.number == '2.01']) my.tab ## # A tibble: 3 x 3 ## # Groups: name.company [?] ## name.company ref.date Liq.Index ## ## 1 BANCO DO ESTADO DO RIO GRANDE DO SUL SA 2006-12-31 0.7251432 ## 2 PETRÓLEO BRASILEIRO S.A. - PETROBRAS 2005-12-31 0.9370813 ## 3 PETRÓLEO BRASILEIRO S.A. - PETROBRAS 2006-12-31 0.9733600

Now we can visualize the information using ggplot2:

library(ggplot2) p <- ggplot(my.tab, aes(x = ref.date, y = Liq.Index, fill = name.company)) + geom_col(position = 'dodge' ) print(p)

Exporting financial data

The package includes function gdfpd.export.DFP.data for exporting the
financial data to an Excel or zipped csv files. See next:

my.basename <- 'MyExcelData' my.format <- 'csv' # only supported so far gdfpd.export.DFP.data(df.reports = df.reports, base.file.name = my.basename, type.export = my.format)

The resulting Excel file contains all data available in df.reports.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R and Finance. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Cost of True Love (a.k.a. The Tidy — and expensive! — Twelve Days of Christmas)

Tue, 12/05/2017 - 19:13

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’m in the market for Christmas presents for my true love, @mrshrbrmstr, and thought I’d look to an age-old shopping list for inspiration. Just what would it set me back if I decided to mimic the 12 Days of Christmas in this modern day and age?

Let’s try to do the whole thing in R (of course!).

We’ll need to:

  • Grab the lyrics
  • Parse the lyrics
  • Get pricing data
  • Compute some statistics
  • Make some (hopefully) pretty charts

This one was complex enough formatting-wise that I needed to iframe it below. Feel free to bust out of the iframe at any time.

Some good follow-ups to this (if you’re so inclined) would be to predict prices next year and/or clean up the charts a bit.

Grab the code up on GitHub.

(Note: ColourLovers API times out occasionally so just try that snippet again if you get an error).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Computing wind average in an area using rWind

Tue, 12/05/2017 - 17:54

(This article was first published on long time ago..., and kindly contributed to R-bloggers)

p { margin-bottom: 0.25cm; line-height: 120%; }

Hi all! A researcher asked me last week about how to compute wind average for an area using rWind. I wrote a simple function to do this using dplyr library (following the advice of my friend Javier Fajardo). The function will be added to rWind package as soon as possible. Meanwhile, you can test the results… enjoy!

# First, charge the new function
library(dplyr)
wind.region <- function (X){
X[,3] <- X[,3] %% 360
X[X[,3]>=180,3] <- X[X[,3]>=180,3] - 360
avg<-summarise_all(X[,-1], .funs = mean)
wind_region <- cbind(X[1,1],avg)
return(wind_region)
}

Once you have charged the function, let’s do some example

# Get some wind data and convert it into a raster to be plotted later
library(rWind)
library(raster)
wind_data<-wind.dl(2015,2,12,0,-10,5,35,45)
wind_fitted_data <- wind.fit(wind_data)
r_speed <- wind2raster(wind_fitted_data, type="speed")

Now, you can use the new function to obtain wind average in the study area:

myMean <- wind.region(wind_data)
myMean

# Now, you can use wind.fit to get wind speed and direction.
myMean_fitted <- wind.fit(myMean)
myMean_fitted

# Finally, let's plot the results!
library(rworldmap)
library(shape)
plot(r_speed)
lines(getMap(resolution = "low"), lwd=4)
alpha <- arrowDir(myMean_fitted)
Arrowhead(myMean_fitted$lon, myMean_fitted$lat, angle=alpha,
arr.length = 2, arr.type="curved")
text(myMean_fitted$lon+1, myMean_fitted$lat+1,
paste(round(myMean_fitted$speed,2), "m/s"), cex = 2)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: long time ago.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

On the biases in data

Tue, 12/05/2017 - 16:15

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:

“The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.”
Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.

“When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’”
Hanna Wallach, Senior Researcher at Microsoft Research NYC. 

The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.

Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a Telecom Dictionary scraping web using rvest in R

Tue, 12/05/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

One of the biggest problems in Business to carry out any analysis is the availability of Data. That is where in many cases, Web Scraping comes very handy in creating that data that’s required. Consider the following case: To perform text analysis on Textual Data collected in a Telecom Company as part of Customer Feedback or Reviews, primarily requires a dictionary of Telecom Keywords. But such a dictionary is hard to find out-of-box. Hence as an Analyst, the most obvious thing to do when such dictionary doesn’t exist is to build one. Hence this article aims to help beginners get started with web scraping with rvest in R and at the same time, building a Telecom Dictionary by the end of this exercise.

Disclaimer

Web Scraping is not allowed by some websites. Kindly check the site’s Terms of Service before Scraping. This post is only for Educational-purpose.

A simple Google search for Telecom Glossary results in this URL:
Atis.org of which the required Telecom Dictionary could be built.

Let us start by loading the required libraries:

#load rvest and stringr library library(rvest) library(stringr) library(dplyr)

rvest is a web scraping library in R that makes it easier to write common scraping tasks (to scrape useful information from web pages) without getting our head into xml parsing. rvest can be downloaded from CRAN and the development version is also available on Github.

It could be seen in the above-mentioned URL that the Glossary words are listed alphabetically from A to Z with a separate link / URL for every (starting) letter. Clicking ‘A’ takes us to this link: atis.org that lists all the keywords with starting letter ‘A’. If you closely look at the URL, the code that’d be written for one letter (here in our case, ‘A’) could be easily replicated for other letters since ‘A’ is part of the URL which will be the only change for the link of other Alphabets.

Let us assign this URL to an R object url which could be passed on as the paramater to rvest’s read_html() function.

#url whose content to be scraped url <- 'http://www.atis.org/glossary/definitionsList.aspx?find=A&kw=0' #extracting the content of the given url #url_content <- read_html('http://www.atis.org/glossary/definitionsList.aspx?find=A&kw=0') url_content <- read_html(url)

read_html() parses the html page of the given url (as its parameter) and saves the result as an xml object.

To reiterate the objective, we are trying get the list of Telecom Keywords and as per the screenshot above, You could see that the Keywords are listed as Hyperlinks in the given url.

Hyperlinks in HTML is written in the following syntax:

Google

Google is the Link Text Label that a browser would render and when clicked would take us to www.google.com. Hence it is evident that anchor tags in the page is what we are supposed to scrape/extract. But the issue with the current page is that, It’s not just the Keywords that are represented as Anchor Text (links) in the page but also there are a lot of other links (anchor tags) in the page. Hence to extract only the required information and filter out the irrelevant information, we need to find a pattern that helps us extract only the keywords links.

Have a look at this screenshot:

This screenshot shows that while the keywords are also represented as hyperlinks (Anchor Text), the differentiator is this ‘id’ element in the url. Only the links of the keywords have got this ‘id’ in the url and hence we can try extracting two information from the current page to get only the relevant information which in our case is – The Keywords: 1. Href value / Actual URL 2. Link Text.

#extracting all the links from the page links <- url_content %>% html_nodes('a') %>% html_attr('href') #extracting all thhe link text from the page text <- url_content %>% html_nodes('a') %>% html_text()

With the example Hyperlink discussed earlier, the above code gives two informations.

www.google.com is saved in links and Google is saved in text.

With links and text as columns, Let us build a rough dictionary (that’s not yet cleaned/filtered).

#creating a new dictonary of links and text extracted above rough_dict <- data.frame(links,text, stringsAsFactors = F) head(rough_dict) links text 1 http://www.atis.org 2 definitionsList.aspx?find=A&kw=0 A 3 definitionsList.aspx?find=B&kw=0 B 4 definitionsList.aspx?find=C&kw=0 C 5 definitionsList.aspx?find=D&kw=0 D 6 definitionsList.aspx?find=E&kw=0 E

As displayed above, rough_dict contains both signal (keywords) and noise (irrelevant links) and we have to filter the irrelevant out with the ‘id‘ pattern that we learnt earlier.

#filtering glossary terms leaving out irrelevant information fair_dict <- rough_dict %>% filter(str_detect(links, 'id')) %>% select(text) tail(fair_dict) text 657 AN 658 axial propagation constant 659 analog component 660 axial ratio 661 analog computer 662 axial ray

And that’s how using str_detect() we can keep only links with ‘id’ in it and filter out the rest and building our fair_dict. As displayed in the above output, We have got 662 Keywords just for the letter ‘A’ and the same exercise could be repeated for the other letters available on the site. The only change that’s required is the url object. For example, Like this:

url <- 'http://www.atis.org/glossary/definitionsList.aspx?find=B&kw=0'

Note the ‘B’ in it and the same could be done for other available letters. This process could be further improved by making this scraping part a function and looping the function call over a character vector with all the available letters (which ideally is beyond the scope of this article’s objective and hence left out). The complete code used in this article is available on my Github.

    Related Post

    1. Scraping Javascript-rendered web content using R
    2. Analysing Cryptocurrency Market in R
    3. Time Series Analysis in R Part 3: Getting Data from Quandl
    4. Pulling Data Out of Census Spreadsheets Using R
    5. Extracting Tables from PDFs in R using the Tabulizer Package
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    visualizing reassortment history using seqcombo

    Tue, 12/05/2017 - 04:09

    (This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

    Reassortment is an important strategy for influenza A viruses to
    introduce a HA subtype that is new to human populations, which creates
    the possibilities of pandemic.

    A diagram showed above (Figure 2 of doi:10.1038/srep25549) is widely
    used to illustrate the reassortment events. While such diagrams are
    mostly manually draw and edit without software tool to automatically
    generate. Here, I implemented the hybrid_plot function for producing
    publication quality figure of reassortment events.

    library(tibble) library(ggplot2) n <- 8 virus_info <- tibble( id = 1:7, x = c(rep(1990, 4), rep(2000, 2), 2009), y = c(1,2,3,5, 1.5, 3, 4), segment_color = list( rep('purple', n), rep('red', n), rep('darkgreen', n), rep('lightgreen', n), c('darkgreen', 'darkgreen', 'red', 'darkgreen', 'red', 'purple', 'red', 'purple'), c('darkgreen', 'darkgreen', 'red', 'darkgreen', 'darkgreen', 'purple', 'red', 'purple'), c('darkgreen', 'lightgreen', 'lightgreen', 'darkgreen', 'darkgreen', 'purple', 'red', 'purple')) ) flow_info <- tibble(from = c(1,2,3,3,4,5,6), to = c(5,5,5,6,7,6,7)) hybrid_plot(virus_info, flow_info)

    The hybrid_plot requires two tibble data frame of virus information
    and genetic flow information.

    Users need to provide x and y positions to plot the virus, this make
    sense for geographically and temporally information are usually
    available in such phylodynamic study and can be employed to set x or
    y to provide more information and help interpretation of the
    reassortment events.

    We use hexagon to represent virus. Users can set the virus outer
    boundary color by v_color and fill the virus by v_fill. Color of
    line segments that indicate the genetic flow relationship can be specify
    via l_color parameter.

    hybrid_plot(virus_info, flow_info, v_color='firebrick', v_fill='darkgreen', l_color='steelblue')

    We usually have more information to present, for example host
    information and HA subtype etc. and these information can be used to
    color the virus either by v_color or v_fill

    virus_info$Host = c("Avian", "Human", rep("Swine", 4), "Human") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

    The relative virus size can also be specify if a virus_size column is
    available in the input virus_info data.

    virus_info$virus_size <- c(rep(1, 3), 2, 1, 1, 1.5) hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

    If label and label_position coloumns are available, the virus labels
    (virus name or other information) will be added automatically.

    virus_info$label <- c("Avian", "Human\nH3N2", "Classic\nswine\nH1N1", "Eurasian swine", "North American swine\n triple reassrotant H3N2", "North American swine\n triple reassortant H1N2", "2009 Human H1N1") virus_info$label_position <- c('left', 'left', 'left', 'below', 'below', 'upper', 'below') hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

    User can use asp to set the aspect ratio of hexagons, asp < 1 for
    thick/short and asp > 1 for thin/tall.

    hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host, asp=2)

    The output of hybrid_plot is a ggplot object and users can use
    ggplot2 to modify the details.

    title <- "Reassortment events in evolution of the 2009 influenza A (H1N1) virus" caption <- 'Gene segments: PB2, PB1, PA, HA, NP, NA, M, NS' color <- c(Avian="purple", Human="red", Swine="darkgreen") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) + labs(caption=caption, title=title) + scale_color_manual(values=color) + scale_fill_manual(values=color) + scale_x_continuous(breaks=c(1990, 2000, 2009)) + xlab(NULL) + ylab(NULL) + theme_minimal() + theme(axis.line.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), panel.grid.minor=element_blank(), panel.grid.major.y=element_blank(), legend.position = c(.95, .1) )

    Top-down or bottom-up style is also supported.

    x <- virus_info$x virus_info$x <- virus_info$y virus_info$y <- x virus_info$label_position <- c(rep("right", 3), "left", "left", "right", "right") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) + scale_y_reverse() + scale_x_continuous(limits=c(0, 5.5))

    User can also use Emoji to label the virus (host information in this
    example):

    virus_info$label <- c("chicken", "woman", "pig", "pig", "pig", "pig", "woman") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host, parse='emoji', t_size=8, t_color='firebrick') + scale_y_reverse()

    In case you don’t have xy-coordination information, you can use
    set_layout function to auto setting the xy position using selected
    layout function.

    virus_info <- set_layout(virus_info, flow_info, layout="layout.kamada.kawai") hybrid_plot(virus_info, flow_info, parse='emoji', t_size=8, t_color='firebrick')

    virus_info <- set_layout(virus_info, flow_info, layout="layout.fruchterman.reingold") hybrid_plot(virus_info, flow_info, parse='emoji', t_size=8, t_color='firebrick')

    Please let me know if you know any published reassortment data that contain
    spatial information, I will demonstrate how to visualize reassortment
    history on a map.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    My book ‘Practical Machine Learning with R and Python’ on Amazon

    Tue, 12/05/2017 - 03:26

    (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

    My book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($9.99) and kindle ($6.97/Rs449) versions. In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code. This is almost like listening to parallel channels of music in stereo!
    1. Practical machine with R and Python – Machine Learning in Stereo (Paperback)
    2. Practical machine with R and Python – Machine Learning in Stereo (Kindle)
    This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

    Here is a look at the topics covered

    Table of Contents
    Essential R …………………………………….. 7
    Essential Python for Datascience ………………..   54
    R vs Python ……………………………………. 77
    Regression of a continuous variable ………………. 96
    Classification and Cross Validation ……………….113
    Regression techniques and regularization …………. 134
    SVMs, Decision Trees and Validation curves …………175
    Splines, GAMs, Random Forests and Boosting …………202
    PCA, K-Means and Hierarchical Clustering …………. 234

    Pick up your copy today!!
    Hope you have a great time learning as I did while implementing these algorithms!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Version 0.6-8 of NIMBLE released

    Tue, 12/05/2017 - 02:19

    (This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)

    We’ve released the newest version of NIMBLE on CRAN and on our website a week ago. Version 0.6-8 has a few new features, and more are on the way in the next few months.

    New features include:

    • the proper Gaussian CAR (conditional autoregressive) model can now be used in BUGS code as dcar_proper, which behaves similarly to BUGS’ car.proper distribution;
    • a new nimbleMCMC function that provides one-line invocation of NIMBLE’s MCMC engine, akin to usage of JAGS and WinBUGS through R;
    • a new runCrossValidate function that will conduct k-fold cross-validation of NIMBLE models fit by MCMC;
    • dynamic indexing in BUGS code is now allowed by default;
    • and a variety of bug fixes and efficiency improvements.

    Please see the NEWS file in the installed package for more details.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    #R – Deprecate functions with roxygen2

    Tue, 12/05/2017 - 01:00

    (This article was first published on R-feed, and kindly contributed to R-bloggers)

    We show how to use roxygen2 tags and templates for deprecating existing documented functions.

    Introduction

    Package roxygen2 provides a powerful way of documenting packages and objects inside them. In particular, great flexibility is available for organizing the documentation content in source files and templates, and render it as desired in the corresponding help pages.

    We can leverage on this flexibility to adapt an existing package documentation upon making a function deprecated (and similarly defunct). As stated in the base-R ‘Marking Objects as Deprecated’ documentation (help(Deprecated)):

    The original help page for these functions is often available at help(“oldName-deprecated”) (note the quotes). Functions should be listed in help(“pkg-deprecated”) for an appropriate pkg, including base.

    Deprecate existing functions

    We show how an existing and documented function myFun in package ourPkg can be deprecated, and the documentation adjusted accordingly by re-arranging the roxygen2 documentation content.

    ## myFun.r #' @title My function. #' @description My function. #' @param arg1 My first argument. #' @param arg2 My second argument. #' @return My return value. #' @export myFun <- function(arg1, arg2) { "My return value" }

    Let us assume a new function myNewFunction exists as a replacement for the old myFun, which becomes deprecated. We want to achieve the following

    • list myFun in help("ourPkg-deprecated""), along with its replacement;
    • make its original documentation available via help("myFun-deprecated").

    This can be obtained by

    • creating generic content for help("ourPkg-deprecated"");
    • adjusting the existing myFun documentation content.
    Package-level documentation

    Generic content for help("ourPkg-deprecated") is defined in its own source file:

    ## ourPkg-deprecated.r #' @title Deprecated functions in package \pkg{ourPkg}. #' @description The functions listed below are deprecated and will be defunct in #' the near future. When possible, alternative functions with similar #' functionality are also mentioned. Help pages for deprecated functions are #' available at \code{help("-deprecated")}. #' @name ourPkg-deprecated #' @keywords internal NULL Function-specific documentation

    Function-specific content is added to the ourPkg-deprecated help page from the corresponding source file, where we also want to make the original help page available via help("myFun-deprecated").

    The source file of myFun is then modified as follows:

    ## myFun.r #' @title My function. #' @description My function. #' @param arg1 My first argument. #' @param arg2 My second argument. #' @return My return value. #' #' @name myFun-deprecated #' @usage myFun(arg1, arg2) #' @seealso \code{\link{ourPkg-deprecated}} #' @keywords internal NULL #' @rdname ourPkg-deprecated #' @section \code{myFun}: #' For \code{myFun}, use \code{\link{myNewFun}}. #' #' @export myFun <- function(arg1, arg2) { .Deprecated("myNewFun") "My return value" }

    Two blocks of roxygen2 tags have been added to the existing documentation content.

    • The first block is used to create the myFun-deprecated help page, detached from the myFun object. For this reason we need to explicitly add the original function signature as ‘Usage’ section. We also add a ‘See Also’ link to ourPkg-deprecated.
    • The second block adds myFun-specific content to the ourPkg-deprecated help page. The standard ‘Usage’ section is added automatically, and we create a function-specific section where using myNewFun is suggested. Such content will be collected and shown in help("ourPkg-deprecated") for all deprecated functions with similar tags structure.

    Note also that help(myFun) will point to the ourPkg-deprecated help page, and that @keywords internal prevents the corresponding topics from appearing in the package documentation index.

    Using roxygen2 templates

    Deprecating functions as described above can be automated using templates. One can define a template for each additional block.

    ## template-depr_pkg.r #' @name ourPkg-deprecated #' @section \code{<%= old %>}: #' For \code{<%= old %>}, use \code{\link{<%= new %>}}. ## template-depr_fun.r #' @name <%= fun %>-deprecated #' @usage <%= gsub("\n", "\n#' ", roxygen2:::wrapString(roxygen2:::function_usage(fun, formals(fun)))) %> #' @seealso \code{\link{ourPkg-deprecated}} #' @keywords internal

    Note the presence of template variables and of some roxygen2 internals for automating the construction of the ‘Usage’ section (handling multiple lines in case of many arguments).

    Deprecating herFun in favor of herNewFun then results in the simple inclusion of the templates in the function source.

    ## herFun.r #' @title Her function. #' @description Her function. #' @param arg1 Her first argument. #' @param arg2 Her second argument. #' @return Her return value. #' #' @templateVar fun herFun #' @template template-depr_fun NULL #' @templateVar old herFun #' @templateVar new herNewFun #' @template template-depr_pkg #' #' @export herFun <- function(arg1, arg2) { .Deprecated("herNewFun") "Her return value" } Example package manual and sources

    A complete example of ourPkg is
    avalailable for download.
    The package contains functions myFun, yourFun, herFun, hisFun, where all but yourFun are deprecated in favor of *NewFun, using roxygen2 templates for herFun and hisFun.

    The resulting documentation content can be assessed in the corresponding PDF reference manual.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-feed. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Exploratory Data Analysis of Ancient Texts with rperseus

    Tue, 12/05/2017 - 01:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    Introduction

    When I was in grad school at Emory, I had a favorite desk in the library. The desk wasn’t particularly cozy or private, but what it lacked in comfort it made up for in real estate. My books and I needed room to operate. Students of the ancient world require many tools, and when jumping between commentaries, lexicons, and interlinears, additional clutter is additional “friction”, i.e., lapses in thought due to frustration. Technical solutions to this clutter exist, but the best ones are proprietary and expensive. Furthermore, they are somewhat inflexible, and you may have to shoehorn your thoughts into their framework. More friction.

    Interfacing with the Perseus Digital Library was a popular online alternative. The library includes a catalog of classical texts, a Greek and Latin lexicon, and a word study tool for appearances and references in other literature. If the university library’s reference copies of BDAG1 and Synopsis Quattuor Evangeliorum2 were unavailable, Perseus was our next best thing.

    Fast forward several years, and I’ve abandoned my quest to become a biblical scholar. Much to my father’s dismay, I’ve learned writing code is more fun than writing exegesis papers. Still, I enjoy dabbling with dead languages, and it was the desire to wed my two loves, biblical studies and R, that birthed my latest package, rperseus. The goal of this package is to furnish classicists with texts of the ancient world and a toolkit to unpack them.

    Exploratory Data Analysis in Biblical Studies

    Working with the Perseus Digital Library was already a trip down memory lane, but here’s an example of how I would have leveraged rperseus many years ago.

    My best papers often sprung from the outer margins of my Nestle-Aland Novum Testamentum Graece. Here the editors inserted cross references to parallel vocabulary, themes, and even grammatical constructions. Given the intertextuality of biblical literature, the margins are a rich source of questions: Where else does the author use similar vocabulary? How is the source material used differently? Does the literary context affect our interpretation of a particular word? This is exploratory data analysis in biblical studies.

    Unfortunately the excitement of your questions is incommensurate with the tedium of the process–EDA continues by flipping back and forth between books, dog-earring pages, and avoiding paper cuts. rperseus aims to streamline this process with two functions: get_perseus_text and perseus_parallel. The former returns a data frame containing the text from any work in the Perseus Digital Library, and the latter renders a parallel in ggplot2.

    Suppose I am writing a paper on different expressions of love in Paul’s letters. Naturally, I start in 1 Corinthians 13, the famed “Love Chapter” often heard at weddings and seen on bumper stickers. I finish the chapter and turn to the margins. In the image below, I see references to Colossians 1:4, 1 Thessalonians 1:3, 5:8, Hebrews 10:22-24, and Romans 8:35-39.


    1 Corinithians 13 in Nestle-Aland Novum Testamentum Graece

    Ignoring that some scholars exclude Colossians from the “authentic” letters, let’s see the references alongside each other:

    library(rperseus) #devtools::install_github(“ropensci/rperseus”) library(tidyverse) tribble( ~label, ~excerpt, "Colossians", "1.4", "1 Thessalonians", "1.3", "1 Thessalonians", "5.8", "Romans", "8.35-8.39" ) %>% left_join(perseus_catalog) %>% filter(language == "grc") %>% select(urn, excerpt) %>% pmap_df(get_perseus_text) %>% perseus_parallel(words_per_row = 4)

    A brief explanation: First, I specify the labels and excerpts within a tibble. Second, I join the lazily loaded perseus_catalog onto the data frame. Third, I filter for the Greek3 and select the columns containing the arguments required for get_perseus_text. Fourth, I map over each urn and excerpt, returning another data frame. Finally, I pipe the output into perseus_parallel.

    The key word shared by each passage is agape (“love”). Without going into detail, it might be fruitful to consider the references alongside each other, pondering how the semantic range of agape expands or contracts within the Pauline corpus. Paul had a penchant for appropriating and recasting old ideas–often in slippery and unexpected ways–and your Greek lexicon provides a mere approximation. In other words, how can we move from the dictionary definition of agape towards Paul’s unique vision?

    If your Greek is rusty, you can parse each word with parse_excerpt by locating the text’s urn within the perseus_catalog object.

    parse_excerpt(urn = "urn:cts:greekLit:tlg0031.tlg012.perseus-grc2", excerpt = "1.4") word form verse part_of_speech person number tense mood voice gender case degree ἀκούω ἀκούσαντες 1.4 verb NA plural aorist participle active masculine nominative NA ὁ τὴν 1.4 article NA singular NA NA NA feminine accusative NA πίστις πίστιν 1.4 noun NA singular NA NA NA feminine accusative NA ὑμός ὑμῶν 1.4 pronoun NA plural NA NA NA masculine genative NA

    If your Greek is really rusty, you can also flip the language filter to “eng” to view an older English translation.4 And if the margin references a text from the Old Testament, you can call the Septuagint as well as the original Hebrew.5

    tribble( ~label, ~excerpt, "Genesis", "32.31", "Genesis, pointed", "32.31", "Numeri", "12.8", "Numbers, pointed", "12.8" ) %>% left_join(perseus_catalog) %>% filter(language %in% c("grc", "hpt")) %>% select(urn, excerpt) %>% pmap_df(get_perseus_text) %>% perseus_parallel()

    Admittedly, there is some “friction” here in joining the perseus_catalog onto the initial tibble. There is a learning curve with getting acquainted with the idiosyncrasies of the catalog object. A later release will aim to streamline this workflow.

    Future Work

    Check the vignette for a more general overview of rperseus. In the meantime, I look forward to getting more intimately acquainted with the Perseus Digital Library. Tentative plans to extend rperseus a Shiny interface to further reduce “friction” and a method of creating a “book” of custom parallels with bookdown.

    Acknowledgements

    I want to thank my two rOpenSci reviewers, Ildikó Czeller and François Michonneau, for coaching me through the review process. They were the first two individuals to ever scrutinize my code, and I was lucky to hear their feedback. rOpenSci onboarding is truly a wonderful process.

    1. Bauer, Walter. A Greek-English Lexicon of the New Testament and Other Early Christian Literature. Edited by Frederick W. Danker. 3rd ed. Chicago: University of Chicago Press, 2000.
    2. Aland, Kurt. Synopsis Quattuor Evangeliorum. Deutsche Bibelgesellschaft, 1997.
    3. The Greek text from the Perseus Digital Library is from 1885 standards. The advancement of textual criticism in the 20th century led to a more stable text you would find in current editions of the Greek New Testament.
    4. The English translation is from Rainbow Missions, Inc. World English Bible. Rainbow Missions, Inc.; revision of the American Standard Version of 1901. I’ve toyed with the idea of incorporating more modern translations, but that would require require resources beyond the Perseus Digital Library.
    5. “hpt” is the pointed Hebrew text from Codex Leningradensis.
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Usage of ruler package

    Tue, 12/05/2017 - 01:00

    (This article was first published on QuestionFlow , and kindly contributed to R-bloggers)

    Usage examples of ruler package: dplyr-style exploration and validation of data frame like objects.

    Prologue

    My previous post tells a story about design of my ruler package, which presents tools for “… creating data validation pipelines and tidy reports”. This package offers a framework for exploring and validating data frame like objects using dplyr grammar of data manipulation.

    This post is intended to show some close to reality ruler usage examples. Described methods and approaches reflect package design. Along the way you will learn why Yoda and Jabba the Hutt are “outliers” among core “Star Wars” characters.

    For more information see README (for relatively brief comprehensive introduction) or vignettes (for more thorough description of package capabilities).

    Beware of a lot of code.

    Overview suppressMessages(library(dplyr)) suppressMessages(library(purrr)) library(ruler)

    The general way of performing validation with ruler can be described with following steps:

    • Formulate a validation task. It is usually stated in the form of a yes-no question or true-false statement about some part (data unit) of an input data frame. Data unit can be one of: data [as a whole], group of rows [as a whole], column [as a whole], row [as a whole], cell. For example, does every column contain elements with sum more than 100?.
    • Create a dplyr-style validation function (rule pack) which checks desired data unit for obedience to [possibly] several rules: mtcars %>% summarise_all(funs(enough_sum = sum(.) > 100))
      • Use ruler’s function rules() instead of explicit or implicit usage of funs():
      mtcars %>% summarise_all(rules(enough_sum = sum(.) > 100)) . %>% summarise_all(rules(enough_sum = sum(.) > 100))
      • Wrap with rule specification function to explicitly identify validated data unit and to name rule pack. In this case it is col_packs() for column data unit with “is_enough_sum” as rule pack name:
      col_packs( is_enough_sum = . %>% summarise_all(rules(is_enough = sum(.) > 100)) )
    • Expose data to rules to obtain validation result (exposure). Use ruler’s expose() function for that. It doesn’t modify contents of input data frame but creates/updates exposure attribute. Exposure is a list with information about used rule packs (packs_info) and tidy data validation report (report).
    • Act after exposure. It can be:
      • Observing validation results with get_exposure(), get_packs_info() or get_report().
      • Making assertions if specific rules are not followed in desired way.
      • Imputing input data frame based on report.

    In examples we will use starwars data from dplyr package (to celebrate an upcoming new episode). It is a tibble with every row describing one “Star Wars” character. Every example starts with a validation task stated in italic and performs validation from beginning to end.

    Create rule packs Data

    Does starwars have 1) number of rows 1a) more than 50; 1b) less than 60; 2) number of columns 2a) more than 10; 2b) less than 15?

    check_data_dims <- data_packs( check_dims = . %>% summarise( nrow_low = nrow(.) >= 50, nrow_up = nrow(.) <= 60, ncol_low = ncol(.) >= 10, ncol_up = ncol(.) <= 15 ) ) starwars %>% expose(check_data_dims) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 1 x 4 ## name type fun remove_obeyers ## ## 1 check_dims data_pack TRUE ## ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 check_dims nrow_up .all 0 FALSE

    The result is interpreted as follows:

    • Data was exposed to one rule pack for data as a whole (data rule pack) named “check_dims”. For it all obeyers (data units which follow specified rule) were removed from validation report.
    • Combination of var equals .all and id equals 0 means that data as a whole is validated.
    • Input data frame doesn’t obey (because value is equal to FALSE) rule nrow_up from rule pack check_dims.

    Does starwars have enough rows for characters 1) with blond hair; 2) humans; 3) humans with blond hair?

    check_enough_rows <- data_packs( enough_blond = . %>% filter(hair_color == "blond") %>% summarise(is_enough = n() > 10), enough_humans = . %>% summarise( is_enough = sum(species == "Human", na.rm = TRUE) > 30 ), ehough_blond_humans = . %>% filter( hair_color == "blond", species == "Human" ) %>% summarise(is_enough = n() > 5) ) starwars %>% expose(check_enough_rows) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 3 x 4 ## name type fun remove_obeyers ## ## 1 enough_blond data_pack TRUE ## 2 enough_humans data_pack TRUE ## 3 ehough_blond_humans data_pack TRUE ## ## Tidy data validation report: ## # A tibble: 2 x 5 ## pack rule var id value ## ## 1 enough_blond is_enough .all 0 FALSE ## 2 ehough_blond_humans is_enough .all 0 FALSE

    New information gained from example:

    • Rule specification functions can be supplied with multiple rule packs all of which will be independently used during exposing.

    Does starwars have enough numeric columns?

    check_enough_num_cols <- data_packs( enough_num_cols = . %>% summarise( is_enough = sum(map_lgl(., is.numeric)) > 1 ) ) starwars %>% expose(check_enough_num_cols) %>% get_report() ## Tidy data validation report: ## # A tibble: 0 x 5 ## # ... with 5 variables: pack , rule , var , id , ## # value
    • If no breaker is found get_report() returns tibble with zero rows and usual columns.
    Group

    Does group defined by hair color and gender have a member from Tatooine?

    has_hair_gender_tatooine <- group_packs( hair_gender_tatooine = . %>% group_by(hair_color, gender) %>% summarise(has_tatooine = any(homeworld == "Tatooine")), .group_vars = c("hair_color", "gender"), .group_sep = "__" ) starwars %>% expose(has_hair_gender_tatooine) %>% get_report() ## Tidy data validation report: ## # A tibble: 12 x 5 ## pack rule var id value ## ## 1 hair_gender_tatooine has_tatooine auburn__female 0 FALSE ## 2 hair_gender_tatooine has_tatooine auburn, grey__male 0 FALSE ## 3 hair_gender_tatooine has_tatooine auburn, white__male 0 FALSE ## 4 hair_gender_tatooine has_tatooine blonde__female 0 FALSE ## 5 hair_gender_tatooine has_tatooine grey__male 0 FALSE ## # ... with 7 more rows
    • group_packs() needs grouping columns supplied via .group_vars.
    • Column var of validation report contains levels of grouping columns to identify group. By default their are pasted together with .. To change that supply .group_sep argument.
    • 12 combinations of hair_color and gender don’t have a character from Tatooine. They are “auburn”-“female”, “auburn, grey”-“male” and so on.
    Column

    Does every list-column have 1) enough average length; 2) enough unique elements?

    check_list_cols <- col_packs( check_list_cols = . %>% summarise_if( is.list, rules( is_enough_mean = mean(map_int(., length)) >= 1, length(unique(unlist(.))) >= 10 ) ) ) starwars %>% expose(check_list_cols) %>% get_report() ## Tidy data validation report: ## # A tibble: 3 x 5 ## pack rule var id value ## ## 1 check_list_cols is_enough_mean vehicles 0 FALSE ## 2 check_list_cols is_enough_mean starships 0 FALSE ## 3 check_list_cols rule..2 films 0 FALSE
    • To specify rule functions inside dplyr’s scoped verbs use ruler::rules(). It powers correct output interpretation during exposing process and imputes missing rule names based on the present rules in current rule pack.
    • Columns vehicles and starships don’t have enough average length and column films doesn’t have enough unique elements.

    Are all values of column birth_year non-NA?

    starwars %>% expose( col_packs( . %>% summarise_at( vars(birth_year = "birth_year"), rules(all_present = all(!is.na(.))) ) ) ) %>% get_report() ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 col_pack..1 all_present birth_year 0 FALSE
    • To correctly validate one column with scoped dplyr verb it should be a named argument inside vars. It is needed for correct interpretation of rule pack output.
    Row

    Has character appeared in enough films? As character is defined by row, this is a row pack.

    has_enough_films <- row_packs( enough_films = . %>% transmute(is_enough = map_int(films, length) >= 3) ) starwars %>% expose(has_enough_films) %>% get_report() %>% left_join(y = starwars %>% transmute(id = 1:n(), name), by = "id") %>% print(.validate = FALSE) ## Tidy data validation report: ## # A tibble: 64 x 6 ## pack rule var id value name ## ## 1 enough_films is_enough .all 8 FALSE R5-D4 ## 2 enough_films is_enough .all 9 FALSE Biggs Darklighter ## 3 enough_films is_enough .all 12 FALSE Wilhuff Tarkin ## 4 enough_films is_enough .all 15 FALSE Greedo ## 5 enough_films is_enough .all 18 FALSE Jek Tono Porkins ## # ... with 59 more rows
    • 64 characters haven’t appeared in 3 films or more. Those are characters described in starwars in rows 8, 9, etc. (counting based on input data).

    Does character with height less than 100 is a droid?

    is_short_droid <- row_packs( is_short_droid = . %>% filter(height < 100) %>% transmute(is_droid = species == "Droid") ) starwars %>% expose(is_short_droid) %>% get_report() %>% left_join(y = starwars %>% transmute(id = 1:n(), name, height), by = "id") %>% print(.validate = FALSE) ## Tidy data validation report: ## # A tibble: 5 x 7 ## pack rule var id value name height ## ## 1 is_short_droid is_droid .all 19 FALSE Yoda 66 ## 2 is_short_droid is_droid .all 29 FALSE Wicket Systri Warrick 88 ## 3 is_short_droid is_droid .all 45 FALSE Dud Bolt 94 ## 4 is_short_droid is_droid .all 72 FALSE Ratts Tyerell 79 ## 5 is_short_droid is_droid .all 73 NA R4-P17 96
    • One can expose only subset of rows by using filter or slice. The value of id column in result will reflect row number in the original input data frame. This feature is powered by keyholder package. In order to use it, rule pack should be created using its supported functions.
    • value equal to NA is treated as rule breaker.
    • 5 “not tall” characters are not droids.
    Cell

    Is non-NA numeric cell not an outlier based on z-score? This is a bit tricky. To present outliers as rule breakers one should ask whether cell is not outlier.

    z_score <- function(x, ...) {abs(x - mean(x, ...)) / sd(x, ...)} cell_isnt_outlier <- cell_packs( dbl_not_outlier = . %>% transmute_if( is.numeric, rules(isnt_out = z_score(., na.rm = TRUE) < 3 | is.na(.)) ) ) starwars %>% expose(cell_isnt_outlier) %>% get_report() %>% left_join(y = starwars %>% transmute(id = 1:n(), name), by = "id") %>% print(.validate = FALSE) ## Tidy data validation report: ## # A tibble: 4 x 6 ## pack rule var id value name ## ## 1 dbl_not_outlier isnt_out height 19 FALSE Yoda ## 2 dbl_not_outlier isnt_out mass 16 FALSE Jabba Desilijic Tiure ## 3 dbl_not_outlier isnt_out birth_year 16 FALSE Jabba Desilijic Tiure ## 4 dbl_not_outlier isnt_out birth_year 19 FALSE Yoda
    • 4 non-NA numeric cells appear to be an outlier within their column.
    Expose data to rules

    Do groups defined by species, gender and eye_color (3 different checks) have appropriate size?

    starwars %>% expose( group_packs(. %>% group_by(species) %>% summarise(isnt_many = n() <= 5), .group_vars = "species") ) %>% expose( group_packs(. %>% group_by(gender) %>% summarise(isnt_many = n() <= 60), .group_vars = "gender"), .remove_obeyers = FALSE ) %>% expose(is_enough_eye_color = . %>% group_by(eye_color) %>% summarise(isnt_many = n() <= 20)) %>% get_exposure() %>% print(n_report = Inf) ## Exposure ## ## Packs info: ## # A tibble: 3 x 4 ## name type fun remove_obeyers ## ## 1 group_pack..1 group_pack TRUE ## 2 group_pack..2 group_pack FALSE ## 3 is_enough_eye_color group_pack TRUE ## ## Tidy data validation report: ## # A tibble: 7 x 5 ## pack rule var id value ## ## 1 group_pack..1 isnt_many Human 0 FALSE ## 2 group_pack..2 isnt_many female 0 TRUE ## 3 group_pack..2 isnt_many hermaphrodite 0 TRUE ## 4 group_pack..2 isnt_many male 0 FALSE ## 5 group_pack..2 isnt_many none 0 TRUE ## 6 group_pack..2 isnt_many NA 0 TRUE ## 7 is_enough_eye_color isnt_many brown 0 FALSE
    • expose() can be applied sequentially which results into updating existing exposure with new information.
    • expose() imputes names of supplied unnamed rule packs based on the present rule packs for the same data unit type.
    • expose() by default removes obeyers (rows with data units that obey respective rules) from validation report. To stop doing that use .remove_obeyers = FALSE during expose() call.
    • expose() by default guesses the type of the supplied rule pack based only on its output. This has some annoying edge cases but is suitable for interactive usage. To turn this feature off use .guess = FALSE as an argument for expose(). Also, to avoid edge cases create rule packs with appropriate wrappers.

    Perform some previous checks with one expose().

    my_packs <- list(check_data_dims, is_short_droid, cell_isnt_outlier) str(my_packs) ## List of 3 ## $ :List of 1 ## ..$ check_dims:function (value) ## .. ..- attr(*, "class")= chr [1:4] "data_pack" "rule_pack" "fseq" "function" ## $ :List of 1 ## ..$ is_short_droid:function (value) ## .. ..- attr(*, "class")= chr [1:4] "row_pack" "rule_pack" "fseq" "function" ## $ :List of 1 ## ..$ dbl_not_outlier:function (value) ## .. ..- attr(*, "class")= chr [1:4] "cell_pack" "rule_pack" "fseq" "function" starwars_exposed_list <- starwars %>% expose(my_packs) starwars_exposed_arguments <- starwars %>% expose(check_data_dims, is_short_droid, cell_isnt_outlier) identical(starwars_exposed_list, starwars_exposed_arguments) ## [1] TRUE
    • expose() can have for rule pack argument a list of lists [of lists, of lists, …] with functions at any depth. This enables creating a list of rule packs wrapped with *_packs() functions (which all return a list of functions).
    • expose() can have multiple rule packs as separate arguments.
    Act after exposure

    Throw an error if any non-NA value of mass is more than 1000.

    starwars %>% expose( col_packs( low_mass = . %>% summarise_at( vars(mass = "mass"), rules(is_small_mass = all(. <= 1000, na.rm = TRUE)) ) ) ) %>% assert_any_breaker() ## Breakers report ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 low_mass is_small_mass mass 0 FALSE ## Error: assert_any_breaker: Some breakers found in exposure.
    • assert_any_breaker() is used to assert presence of at least one breaker in validation report.

    However, offered solution via column pack doesn’t show rows which break the rule. To do that one can use cell pack:

    starwars %>% expose( cell_packs( low_mass = . %>% transmute_at( vars(mass = "mass"), rules(is_small_mass = (. <= 1000) | is.na(.)) ) ) ) %>% assert_any_breaker() ## Breakers report ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 low_mass is_small_mass mass 16 FALSE ## Error: assert_any_breaker: Some breakers found in exposure.

    Remove numeric columns with mean value below certain threshold. To achieve that one should formulate rule as “column mean should be above threshold”, identify breakers and act upon this information.

    remove_bad_cols <- function(.tbl) { bad_cols <- .tbl %>% get_report() %>% pull(var) %>% unique() .tbl[, setdiff(colnames(.tbl), bad_cols)] } starwars %>% expose( col_packs( . %>% summarise_if(is.numeric, rules(mean(., na.rm = TRUE) >= 100)) ) ) %>% act_after_exposure( .trigger = any_breaker, .actor = remove_bad_cols ) %>% remove_exposure() ## # A tibble: 87 x 11 ## name height hair_color skin_color eye_color gender homeworld ## ## 1 Luke Skywalker 172 blond fair blue male Tatooine ## 2 C-3PO 167 gold yellow Tatooine ## 3 R2-D2 96 white, blue red Naboo ## 4 Darth Vader 202 none white yellow male Tatooine ## 5 Leia Organa 150 brown light brown female Alderaan ## # ... with 82 more rows, and 4 more variables: species , ## # films , vehicles , starships
    • act_after_exposure is a wrapper for performing actions after exposing. It takes .trigger function to trigger action and .actor function to perform action and return its result.
    • any_breaker is a function which return TRUE if tidy validation report attached to it has any breaker and FALSE otherwise.
    Conclusions
    • Yoda and Jabba the Hutt are outliers among other “Star Wars” characters: Yoda is by height and birth year, Jabba is by mass and also birth year.
    • There are less than 10 “Star Wars” films yet.
    • ruler offers flexible and extendable functionality for common validation tasks. Validation can be done for data [as a whole], group of rows [as a whole], column [as a whole], row [as a whole] and cell. After exposing data frame of interest to rules and obtaining tidy validation report, one can perform any action based on this information: explore report, throw error, impute input data frame, etc.
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    AI School: Microsoft R and SQL Server ML Services

    Mon, 12/04/2017 - 21:40

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    If you'd like to learn how you use R to develop AI applications, the Microsoft AI School now features a learning path focused on Microsoft R and SQL Server ML Services. This learning path includes eight modules, each comprising detailed tutorials and examples:

    All of the Microsoft AI School learning paths are free to access, and the content is hosted on Github (where feedback is welcome!). You can access this course and many others at the link below.

    Microsoft AI School: Microsoft R and SQL Server ML Services 

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    5 ways a SAS Health Check can simplify a transition to R

    Mon, 12/04/2017 - 18:26

    (This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

    Nic Crane, Data Scientist

    At Mango, we’re seeing more and more clients making the decision to modernise their analytics process; moving away from SAS and on to R, Python, and other technologies. There are a variety of reasons for this, including SAS license costs, the increase of recent graduates with R and Python skills, SAS becoming increasingly uncommon, or the need for flexible technologies which have the capability for advanced analytics and quality graphics output.

    While such transitions are typically about much more than just technology migration, the code accounts for a significant degree of the complexity. So, in order to support our clients, we have developed a software suite to analyse the existing SAS code and simplify this process.

    So how can a SAS Code Health Check help you decide on how to tackle this kind of transformation?

    1. Analyse procedure calls to inform technology choice

    Using the right technology for the right job is important if we want to create code which is easy to maintain for years, saving us time and resources. But how can we determine the best tool for the job?

    A key part of any SAS code analysis involves looking at the procedure calls in the SAS codebase to get a quick view of the key functionality. For example, we can see from the analysis above that this codebase mainly consists of calls to PROC SORT and PROC SQL –SAS procedures which reorder data and execute SQL commands used for interacting with databases or tables of data. As there are no statistics related procs, we may decide —if we migrate this application away from SAS— to move this functionality directly into the database. The second graph shows an application which has a high degree of statistical functionality, using the FORECAST, TIMESERIES, and ARIMA procedures to fit complex predictive time series models. As R has sophisticated time series modelling packages, we might decide to move this application to R.

    2. Use macro analysis to find the most and least important components of an application

    Looking at the raw source code doesn’t give us any context about what the most important components of our codebase are. How do we know which code is most important and needs to be prioritised? And how can we avoid spending time redeveloping code which has been written, but is never actually used?

    We can answer these questions by taking a look at the analysis of the macros and how often they’re used in the code. Macros are like user-defined functions which can combine multiple data steps, proc steps, and logic, and are useful for grouping commands we want to call more than once.

    Looking at the plot above, we can see that the transfer_data macro is called 17 times, so we know it’s important to our codebase. When redeveloping the code, we might want to pay extra attention to this macro as it’s crucial to the application’s functionality.

    On the other hand, looking at load_other, we can see that it’s never called – this is known as ‘orphaned code’ and is common in large legacy codebases. With this knowledge, we can automatically exclude this to avoid wasting time and resource examining it.

    3. Looking at the interrelated components to understand process flow

    When redeveloping individual applications, planning the project and allocating resources requires an understanding of how the different components fit together and which parts are more complex than others. How do we gain this understanding without spending hours reading every line of code?

    The process flow diagram above allows us to see which scripts are linked to other scripts. Each node represents a script in the codebase, and is scaled by size. The nodes are coloured by complexity. Looking at the diagram above, we can instantly see that the create_metadata script is both large and complex, and so we might choose to assign this to a more experienced developer, or look to restructure it first.

    4. Examine code complexity to assess what needs redeveloping and redesigning

    Even with organisational best practice guidelines, there can still be discrepancies in the quality and style of code produced when it was first created. How do we know which code is fit for purpose, and which code needs restructuring so we can allocate resources more effectively?

    Thankfully, we can use ‘cyclomatic complexity’ which will assess how complex the code is. The results of this analysis will determine: whether it needs to be broken down into smaller chunks, how much testing is needed, and which code needs to be assigned to more experienced developers.

    5. Use the high level overview to get an informed perspective which ties into your strategic objectives

    Analytics modernisation programs can be large and complex projects, and the focus of a SAS Code Health Check is to allow people to make well-informed decisions by reducing the number of unknowns. So, how do we prioritise our applications in a way that ties into our strategic objectives?

    The overall summary can be used to answer questions around the relative size and complexity of multiple applications; making it possible to estimate more accurately on the time and effort required for redevelopment. Custom comparison metrics can be created on the basis of strategic decisions.

    For example, if your key priority is to consolidate your ETL process and you might first focus on the apps which have a high number of calls to proc SQL. Or you might have a goal of improving the quality of your graphics and so you’ll focus on the applications which produce a large number of plots. Either way, a high level summary like the one below collects all the information you need in one place and simplifies the decision-making process.

    SAS conversion projects tend to be large and complicated, and require deep expertise to ensure their success. A SAS Health Check can help reduce uncertainty, guide your decisions and save you time, resources and, ultimately, money.

    If you’re thinking of reducing or completely redeveloping your SAS estate, and want to know more about how Mango Solutions can help, get in touch with with our team today via sales@mango-solutions.com or +44 (0)1249 705 450.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Announcing the Plumber v0.4.4 Release

    Mon, 12/04/2017 - 17:11

    (This article was first published on Trestle Technology, LLC - R, and kindly contributed to R-bloggers)

    Plumber is a package which allows you to create web APIs from your R code. If you’re new to Plumber, you can find out more at www.rplumber.io.

    We’re excited to announce the v0.4.4 release of Plumber! This release adds a handful of oft-requested features and cleans up a few issues that came out of the major refactor that took place in the 0.4.2 release. We’ve also continued to expand the official Plumber documentation. We’ll mention the highlights below, but you can see the full release notes for v0.4.4 here. The release is on CRAN now, so you can update using:

    install.packages("plumber") Plumber v0.4.4 Highlights
    1. Support customized image sizes on the @png and @jpeg annotations. More details here.
    2. Support expiration, HTTPOnly, and Secure flags on cookies as discussed here (but see the “Known Bugs” section below for an important note about cookie expiration in v0.4.4).
    3. Restore functionality of PlumberStatic routers (#156).
    4. Support arguments sent from clients to nested subrouters.
    5. For APIs deployed using DigitalOcean, set the working directory to the root of the API before starting.
    6. Restore functionality of setErrorHandler.
    7. Ignore capitalization when searching for plumber.r and entrypoint.r files when plumb()ing a directory.
    8. Support query strings with keys that appear more than once
      (#165)
    9. Fix the validation error warning at the bottom of deployed Swagger files
      which would have appeared any time your swagger.json file was hosted in
      such a way that a hosted validator service would not have been able to access
      it. For now we just suppress validation of swagger.json files. (#149)
    10. Support for floating IPs in DNS check that occurs in do_configure_https()
    11. Make adding swap file idempotent in do_provision() so you can now call that function
      on a single droplet multiple times.
    12. Support an exit hook which can define a function that will be
      evaluated when the API is interrupted. e.g.
      pr <- plumb("plumber.R"); pr$registerHook("exit", function(){ print("Bye bye!") })
    13. Fixed bug in which a single function couldn’t support multiple paths for a
      single verb (#203).
    Known Bugs

    At the time of writing, one bug has already been fixed since the v0.4.4 release. Cookie expiration times were not properly being sent to clients which caused most clients to ignore these times altogether and revert back to using session cookies. If you wish to set an expiration time on a cookie, you will need to use the development release of Plumber which you can install using:

    devtools::install_github("trestletech/plumber") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Trestle Technology, LLC - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages