### ‘LMX ot NOSJ!’ Interchanging Classic Data Formats With Single `blackmagic` Incantations

(This article was first published on ** R – rud.is**, and kindly contributed to R-bloggers)

The D.C. Universe magic hero Zatanna used spells (i.e. incantations) to battle foes and said spells were just sentences said backwards, hence the mixed up jumble in the title. But, now I’m regretting not naming the package zatanna and reversing the function names to help ensure they’re only used deliberately & carefully. You’ll see why in a bit.

Just like their ore-seeking speleological counterparts, workers in our modern day data mines process a multitude of mineralic data formats to achieve our goals of world domination finding meaning, insight & solutions to hard problems.

Two formats in particular are common occurrences in many of our $DAYJOBs: XML and JSON. The rest of this (hopefully short-ish) post is going to assume you have at least a passing familiarity with — if not full-on battle scars from working with — them.

XML and JSON are, in many ways, very similar. This similarity is on purpose since JSON was originally created to make is easier to process data in browsers and help make data more human-readable. If your $DAYJOB involves processing small or large streams of nested data, you likely prefer JSON over XML.

There are times, though, that — even if one generally works with only JSON data — one comes across a need to ingest some XML and turn it into JSON. This was the case for a question-poser on Stack Overflow this week (I won’t point-shill with a direct link but it’ll be easy to find if you are interested in this latest SODD package).

Rather than take on the potentially painful task of performing the XML to JSON transformation on their own the OP wished for a simple incantation to transform the entirety of the incoming XML into JSON.

We’ll switch comic universes for a moment to issue a warning that all magic comes with a price. And, the cost for automatic XML<->JSON conversion can be quite high. XML has namespaces, attributes tags and values and requires schemas to convey data types and help validate content structure. JSON has no attributes, implicitly conveys types and is generally schema-less (though folks have bolted on that concept).

If one is going to use magic for automatic data conversion there must be rules (no, *not* those kind of Magic rules), otherwise how various aspects of XML become encoded into JSON (and the reverse) will generate inconsistency and may even result in significant data corruption. Generally speaking, you are *always* better off writing *your own* conversion utility vs rely on specific settings in a general conversion script/function. However, if your need is a one-off (which anyone who has been doing this type of work for a while knows is also generally never the case) you may have cause to throw caution to the wind, get your quick data fix, and move on. If that is the case, the blackmagic package may be of use to you.

One file that’s in XML that I only occasionally have to process is the IANA IPv4 Address Space Registry. If you visited that link you may have uttered *“Hey! That’s not XML it’s HTML!”*, to wit — I would respond — *“Well, HTML is really XML anyway, but use the View Source, Luke! and see that it is indeed XML with some clever XSL style sheet processing being applied in-browser to make the gosh awful XML human readable.”*.

With blackmagic we can make quick work of converting this monstrosity into JSON.

The blackmagic package itself uses even *darker* magic to accomplish its goals. The package is just a thin V8 wrapper around the xml-js javascript library. Because of this, it is recommended that you do not try to process gigabytes of XML with it as there is a round trip of data marshalling between R and the embedded v8 engine.

You can thank @yoniceedee for the URL processing capability in blackmagic:

source_url <- "https://www.iana.org/assignments/ipv4-address-space/ipv4-address-space.xml" iana_json <- blackmagic::xml_to_json(source_url) # NOTE: cat the whole iana_json locally to see it — perhaps to file="..." vs clutter your console cat(substr(iana_json, 1800, 2300)) ## me":"prefix","elements":[{"type":"text","text":"000/8"}]},{"type":"element","name":"designation","elements":[{"type":"text","text":"IANA - Local Identification"}]},{"type":"element","name":"date","elements":[{"type":"text","text":"1981-09"}]},{"type":"element","name":"status","elements":[{"type":"text","text":"RESERVED"}]},{"type":"element","name":"xref","attributes":{"type":"note","data":"2"}}]},{"type":"element","name":"record","elements":[{"type":"element","name":"prefix","elements":[{"type":"By by the hoary hosts of Hoggoth that's not very "human readable"! And, it looks super-verbose. Thankfully, Yousuf Almarzooqi knew we'd want to fine-tune the output and we can use those options to make this a bit better:

blackmagic::xml_to_json( doc = source_url, spaces = 2, # Number of spaces to be used for indenting XML output compact = FALSE, # Whether to produce detailed object or compact object ignoreDeclaration = TRUE # No declaration property will be generated. ) -> iana_json # NOTE: cat the whole iana_json locally to see it — perhaps to file="..." vs clutter your console cat(substr(iana_json, 3000, 3300)) ## pe": "element", ## "name": "prefix", ## "elements": [ ## { ## "type": "text", ## "text": "000/8" ## } ## ] ## }, ## { ## "type": "element", ## "name": "designation", ##One "plus side" for doing the mass-conversion is that we don't really need to do much processing to have it be "usable" data in R:

blackmagic::xml_to_json( doc = source_url, compact = FALSE, ignoreDeclaration = TRUE ) -> iana_json # NOTE: consider taking some more time to explore this monstrosity than this str(processed <- jsonlite::fromJSON(iana_json), 3) ## List of 1 ## $ elements:'data.frame': 3 obs. of 5 variables: ## ..$ type : chr [1:3] "instruction" "instruction" "element" ## ..$ name : chr [1:3] "xml-stylesheet" "oxygen" "registry" ## ..$ instruction: chr [1:3] "type=\"text/xsl\" href=\"ipv4-address-space.xsl\"" "RNGSchema=\"ipv4-address-space.rng\" type=\"xml\"" NA ## ..$ attributes :'data.frame': 3 obs. of 2 variables: ## .. ..$ xmlns: chr [1:3] NA NA "http://www.iana.org/assignments" ## .. ..$ id : chr [1:3] NA NA "ipv4-address-space" ## ..$ elements :List of 3 ## .. ..$ : NULL ## .. ..$ : NULL ## .. ..$ :'data.frame': 280 obs. of 4 variables: compact(processed$elements$elements[[3]]$elements) %>% head(6) %>% str(3) ## List of 6 ## $ :'data.frame': 1 obs. of 2 variables: ## ..$ type: chr "text" ## ..$ text: chr "IANA IPv4 Address Space Registry" ## $ :'data.frame': 1 obs. of 2 variables: ## ..$ type: chr "text" ## ..$ text: chr "Internet Protocol version 4 (IPv4) Address Space" ## $ :'data.frame': 1 obs. of 2 variables: ## ..$ type: chr "text" ## ..$ text: chr "2018-04-23" ## $ :'data.frame': 3 obs. of 4 variables: ## ..$ type : chr [1:3] "text" "element" "text" ## ..$ text : chr [1:3] "Allocations to RIRs are made in line with the Global Policy published at " NA ". \nAll other assignments require IETF Review." ## ..$ name : chr [1:3] NA "xref" NA ## ..$ attributes:'data.frame': 3 obs. of 2 variables: ## .. ..$ type: chr [1:3] NA "uri" NA ## .. ..$ data: chr [1:3] NA "http://www.icann.org/en/resources/policy/global-addressing" NA ## $ :'data.frame': 3 obs. of 4 variables: ## ..$ type : chr [1:3] "text" "element" "text" ## ..$ text : chr [1:3] "The allocation of Internet Protocol version 4 (IPv4) address space to various registries is listed\nhere. Origi"| __truncated__ NA " documents most of these allocations." ## ..$ name : chr [1:3] NA "xref" NA ## ..$ attributes:'data.frame': 3 obs. of 2 variables: ## .. ..$ type: chr [1:3] NA "rfc" NA ## .. ..$ data: chr [1:3] NA "rfc1466" NA ## $ :'data.frame': 5 obs. of 4 variables: ## ..$ type : chr [1:5] "element" "element" "element" "element" ... ## ..$ name : chr [1:5] "prefix" "designation" "date" "status" ... ## ..$ elements :List of 5 ## .. ..$ :'data.frame': 1 obs. of 2 variables: ## .. ..$ :'data.frame': 1 obs. of 2 variables: ## .. ..$ :'data.frame': 1 obs. of 2 variables: ## .. ..$ :'data.frame': 1 obs. of 2 variables: ## .. ..$ : NULL ## ..$ attributes:'data.frame': 5 obs. of 2 variables: ## .. ..$ type: chr [1:5] NA NA NA NA ... ## .. ..$ data: chr [1:5] NA NA NA NA ...As noted previously, all magic comes with a price and we just traded XML processing for some gnarly list processing. This isn't the case for all XML files and you can try to tweak the parameters to xml_to_json() to make the output more usable (NOTE: key name transformation parameters still need to be implemented in the package), but this seems a *whole* lot easier (to me):

xml_to_json() has a sibling function --- json_to_xml() for the reverse operation and you're invited to fill in the missing parameters with a PR as there is a fairly consistent and straightforward way to do that. Note that a small parameter tweak can radically change the output, which is one of the aforementioned potentially costly pitfalls of this automagic conversion.

Before using either function, seriously consider taking the time to write a dedicated, small package that exposes a function or two to perform the necessary conversions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – rud.is**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Improvements for Bio7 2.8

(This article was first published on ** R – Bio7 Website**, and kindly contributed to R-bloggers)

**18.05.2018
**

The next release of Bio7 adds a lot of new R features and improvements. One minor change is that the default perspective after the startup of Bio7 now is the R perspective to emphazise the importance of R within this software.

The R-Shell view has been simplified and the R templates have been moved in it’s own simple view for an improved usability (see screenshot from R perspective below).

In addition the context menu has been enhanced to allow the creation of submenus from scripts found in folders and subfolders (recursively added) which you can create for a menu structure.

Scripts can be added created in R, JavaScript, Groovy, Jython, BeanShell, ImageJ Macros.

Java (with dependant classes) can be dynamically compiled and executed like a script, too.

Several improvements have also been added to the R-Shell and the R editor for an easier generation of valid R code. The R-Shell and the R editor now display R workspace objects with it’s class and structure in the code completion dialog (marked with a new workspace icon – see below).

R-Shell:

R editor:

In the R editor a new quick fix function has been added to detect and install missing packages (from scanned default packages folder of an R installation – has to be enabled in the Bio7 R code analysis preferences).

Also the detection of missing package imports are fixable (when a function is called but the installed package declaration is missing in the code but the package is installed to deliver the function).

The code assistance in the R-Shell and in the R editor now offers completions for, e.g., dataframes (columns) in the %>% operator of piped function calls:

In addition code assistance is available for list, vectors, dataframes and arrays of named rows and columns, etc., when available in the current R environment.

Code completion for package functions can now easily added with the R-Shell or the R editor which loads the package function help for both interfaces. The editor will automatically be updated (see updated editor marking unknown functions in screencast below).

Numerous other features, improvements and bugfixes have been added, too.

Bio7 2.8 will hopefully be available soon at:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Bio7 Website**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### NYC restaurants reviews and inspection scores

(This article was first published on ** R – NYC Data Science Academy Blog**, and kindly contributed to R-bloggers)

If you ever pass outside a restaurant in New York City, you’ll notice a prominently displayed letter grade. Since July 2010, the Health Department has required restaurants to post letter grades showing sanitary inspection results.

An A grade attests to top marks for health and safety, so you can feel secure about eating there. But you don’t necessarily know that you will enjoy the food and experience courteous service. To find that out, you’d refer to the restaurant reviews. For this project, I looked at a simple data analysis and visualization of the NYC restaurants reviews and inspection scores data to find out if there is any correlation between the two. The data will also show which types of cuisines and which NYC locations tend to attract more ratings.

Nowadays, business reviews, ratings and grades are the decision making for any business to measure for their quality, popularity and future success. For restaurants business, ratings, hygienic, and cleanliness are essential. A popular site for reviews, Yelp, offers many individual ratings for restaurants. The New York City Department of Health and Mental Hygiene (DOHMH) conducts unannounced restaurant inspections annually. They check if the food handling, food temperature, personal hygiene of workers and vermin control of the restaurants are in compliance with hygienic standards.. The scoring and grading process can be found here.

The restaurant ratings and location information used in this project come from Yelp’s API. The inspection data was downloaded from NYC open data website. I merge yelp restaurants review data and inspection data and remove NA rows which doesn’t haveeither inspection score or reviews. I also reassigned the inspection score in the grades A, B, and C category as this measure is widely used and label on restaurants. There were other scores, primarily P or Z, or some version of grade pending which we are ignoring in our analysis here. Restaurants with a score between 0 and 13 points earn an A, those with 14 to 27 points receive a B and those with 28 or more a C.

The data shows that an A is the most commonly assigned inspection grade for restaurants of all types in all locations. I plotted various bar plots to visualized the inspection scores and ratings based on borough and cuisine type.

With respect to location, this borough bar plot shows that Manhattan has highest number of restaurants with all grades compared to others. This is obvious as it has highest number of restaurants in general. Staten Island has lowest number of restaurants with grades A, B and C among all.

As for cuisine types, the cuisines plots shows first 15 restaurants with highest number of counts for based on cuisine. This indicates that the American cuisine has highest number of A grade compared to other. This indicate that american restaurants are focus more on hygienic and cleanliness compare to others type of restaurants.

The review plot indicates that most restaurants do achieve the top rating of 4 stars. Again, Manhattan has the highest number of restaurants with ratings four stars while Staten Island has lowest numbers of restaurants with high ratings. It also shows that almost all borough have a low number of 2 star restaurants. Moreover, cuisine reviews plot indicates that American cuisine tend to have the highest rating compared to other cuisines. The reasons could be more American restaurants under this category then others.

The scatter plots shows therelationship between inspection score and rating. It indicates that there is no direct clear correlation between two variables. It is fairly common for a restaurant with a C grade inspection score to achieve a 4-5 star ratings in a review. Also it is possible to find a number of A grade ratings for restaurants that only have 1-2 stars. This could be because so long as food is tasty, people will rate the restaurant well because they do not pay very much attentions to cleanliness and hygienic issues. The scatter plots also show that though some restaurants maintain a very high level of cleanliness and hygienic food conditions, they fail to get good ratings, which could be due to bad service or less than tasty food . We can do further analysis on both side of restaurants by analyzing review comments and try to find why some restaurants have good reviews but low inspection score and vice-versa. This require further data about reviews comments and further analysis using NLP.

The cluster map of NYC restaurants helps visualize locations and to filter the restaurants based cuisine types. The color mark of the point indicates the ratings and includes descriptions of the featured restaurants. The heat map show the density of the restaurants based on borough selection or cuisine selection. It indicate which area has a greater number of restaurants. This could be helpful for business people to make informed decisions about where to open new restaurants based on the types of restaurants already in place.

Finally, this app can be useful for people to filter the data base on borough, cuisine , ratings , and inspection grade. The people want to go to eat with specific criteria can filters the restaurants and visit their favorite restaurants based on top marks for both ratings and inspection grades. The shiny app link is here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – NYC Data Science Academy Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### drake’s improved high-performance computing power

(This article was first published on ** rOpenSci - open tools for open science**, and kindly contributed to R-bloggers)

The drake R package is not only a reproducible research solution, but also a serious high-performance computing engine. The Get Started page introduces drake, and this technical note draws from the guides on high-performance computing and timing.

You can help!Some of these features are brand new, and others are newly refactored. The GitHub version has all the advertised functionality, but it needs more testing and development before I can submit it to CRAN in good conscience. New issues such as r-lib/processx#113 and HenrikBengtsson/future#226 seem to affect drake, and more may emerge. If you use drake for your own work, please consider supporting the project by field-testing the claims below and posting feedback here.

Let drake schedule your targets.A typical workflow is a sequence of interdependent data transformations. Consider the example from the Get Started page.

When you call make() on this project, drake takes care of "raw_data.xlsx", then raw_data, and then data in sequence. Once data completes, fit and hist can launch in parallel, and then "report.md" begins once everything else is done. It is drake’s responsibility to deduce this order of execution, hunt for ways to parallelize your work, and free you up to focus on the substance of your research.

Activate parallel processing.Simply set the jobs argument to an integer greater than 1. The following make() recruits multiple processes on your local machine.

make(plan, jobs = 2)For parallel deployment to a computing cluster (SLURM, TORQUE, SGE, etc.) drake calls on packages future, batchtools, and future.batchtools. First, create a batchtools template file to declare your resource requirements and environment modules. There are built-in example files in drake, but you will likely need to tweak your own by hand.

drake_batchtools_tmpl_file("slurm") # Writes batchtools.slurm.tmpl.Next, tell future.batchtools to talk to the cluster.

library(future.batchtools) future::plan(batchtools_slurm, template = "batchtools.slurm.tmpl")Finally, set make()’s parallelism argument equal to "future" or "future_lapply".

make(plan, parallelism = "future", jobs = 8) Choose a scheduling algorithm.The parallelism argument of make() controls not only where to deploy the workers, but also how to schedule them. The following table categorizes the 7 options.

Deploy: local Deploy: remote**Schedule: persistent**“mclapply”, “parLapply” “future_lapply”

**Schedule: transient**“future”, “Makefile”

**Schedule: staged**“mclapply_staged”, “parLapply_staged” Staged scheduling

drake’s first custom parallel algorithm was staged scheduling. It was easier to implement than the other two, but the workers run in lockstep. In other words, all the workers pick up their targets at the same time, and each worker has to finish its target before any worker can move on. The following animation illustrates the concept.

But despite weak parallel efficiency, staged scheduling remains useful because of its low overhead. Without the bottleneck of a formal master process, staged scheduling blasts through armies of tiny conditionally independent targets (example here). Consider it if the bulk of your work is finely diced and perfectly parallel, maybe if your dependency graph is tall and thin.

Persistent scheduling

Persistent scheduling is brand new to drake. Here, make(jobs = 2) deploys three processes: two workers and one master. Whenever a worker is idle, the master assigns it the next target whose dependencies are fully ready. The workers keep running until no more targets remain. See the animation below.

If the time limits of your cluster are too strict for persistent workers, consider transient scheduling, another new arrival. Here, make(jobs = 2) starts a brand new worker for each individual target. See the following video.

How many jobs should you choose?

The predict_runtime() function can help. Let’s revisit the mtcars example.

Let’s also

- Plan for non-staged scheduling,
- Assume each non-file target (black circle) takes 2 hours to build, and
- Rest assured that everything else is super quick.

When we declare the runtime assumptions with the known_times argument and cycle over a reasonable range of jobs, predict_runtime() paints a clear picture.

jobs = 4 is a solid choice. Any fewer would slow us down, and the next 2-hour speedup would take double the jobs and the hardware to back it up. Your choice of jobs for make() ultimately depends on the runtime you can tolerate and the computing resources at your disposal.

Thanks!When I attended RStudio::conf(2018), drake relied almost exclusively on staged scheduling. Kirill Müller spent hours on site and hours afterwards helping me approach the problem and educating me on priority queues, message queues, and the knapsack problem. His generous help paved the way for drake’s latest enhancements.

DisclaimerThis post is a product of my own personal experiences and opinions and does not necessarily represent the official views of my employer. I created and embedded the Powtoon videos only as explicitly permitted in the Terms and Conditions of Use, and I make no copyright claim to any of the constituent graphics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** rOpenSci - open tools for open science**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Understanding PCA using Stack Overflow data

(This article was first published on ** Rstats on Julia Silge**, and kindly contributed to R-bloggers)

This year, I have given some talks about understanding principal component analysis using what I spend day in and day out with, Stack Overflow data. You can see a recording of one of these talks from rstudio::conf 2018. When I have given these talks, I’ve focused a lot on understanding PCA. This blog post walks through how I implemented PCA and how I made the plots I used in my talk.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Rstats on Julia Silge**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Analyzing Customer Data from Square

(This article was first published on ** Blog-rss on stevenmortimer.com**, and kindly contributed to R-bloggers)

- The Square Data Model
- Authenticating
- Pulling Transaction Data
- Spend by Customer Group
- Issues with the APIs

Whether you own your own business or consult for a business using Square to capture payment data, Square can offer some amazing opportunities to gain insights by leveraging their Connect v1 & v2 APIs. The Square data backend operates much like a CRM system that holds information about *transactions* that *customers* make when purchasing *items* offered at a *location*. Naturally, there are API endpoints for each of the object types italicized above (Locations, Customers, Transactions, and Items) and many more endpoints for administrative tasks (employees, roles, timecards, refunds, etc.).

While working with Square data I decided to develop an R package **squareupr** that makes it easier to retrieve Square data from R so that you can focus on the analysis. After installing the **squareupr** package you must authenticate by supplying a personal access token (PAT) or using an OAuth 2.0 flow. You can find your PAT by logging into the Square Dashboard -> Apps -> My Apps. Create an app or click “Manage App” if you’ve already created one and there you should see your personal access token:

The package also offers OAuth 2.0 authentication. More information is available here.

Pulling Transaction DataAs mentioned above there are endpoints for every major type of data stored by Square. The API documentation does a very good job at laying out how requests should be made to each of those endpoints. One thing to note is that it is important to first pull down the location details for your business because the location is often required when searching for things like transactions and items.

our_locations <- sq_list_locations() our_locations$name <- "{HIDDEN}" our_locations %>% select(id, name, address, timezone, capabilities, status, created_at) #> # A tibble: 5 x 7 #> id name address timezone capabilities status created_at #> #> 1 46FYN9N9RQS54 {HIDDEN} ACTIVE 2017-04-2… #> 2 DRDCJ2X8E2PMV {HIDDEN} ACTIVE 2016-09-2… #> 3 8T1TYXE840S00 {HIDDEN} ACTIVE 2016-09-2… #> 4 1AWPRVVVFWGQF {HIDDEN} ACTIVE 2017-04-1… #> 5 50X1GNAWEC8V0 {HIDDEN} ACTIVE 2017-03-0…Now that you have the location ids if you would like to pull all of the transactions during a given timeframe, you would use the function sq_list_transactions().

# list all transactions for our 2nd location on May 11, 2018 # by default, if a date is provided with no time, then the time component is set to midnight our_transactions <- sq_list_transactions(location = our_locations$id[2], begin_time = as.Date('2018-05-11'), end_time = as.Date('2018-05-12')) our_transactions #> # A tibble: 245 x 6 #> id location_id created_at tenders product client_id #> #> 1 bUjFGVjBvN… DRDCJ2X8E2P… 2018-05-12T0… 2 5PZP31N5Zs… DRDCJ2X8E2P… 2018-05-11T2… 3 BTrGydD6he… DRDCJ2X8E2P… 2018-05-11T2… 4 XsqOAHl68z… DRDCJ2X8E2P… 2018-05-11T2… 5 vmLRzrwByS… DRDCJ2X8E2P… 2018-05-11T2… 6 pTbzQApZW7… DRDCJ2X8E2P… 2018-05-11T2… 7 lnE20zklpP… DRDCJ2X8E2P… 2018-05-11T2… 8 DSumrqQW0L… DRDCJ2X8E2P… 2018-05-11T2… 9 tPwFXetIwe… DRDCJ2X8E2P… 2018-05-11T2… 10 bqUuFrzH71… DRDCJ2X8E2P… 2018-05-11T2… # ... with 235 more rowsAt first glance there does not appear to be very much detail on the transaction record. However, the tender field represents a method of payment used in a Square transaction so it contains information regarding the amount of money paid in total, in Square fees, and tip. The tender field even contains information regarding the customer_id and credit card information. In the following I will loop through the transactions in April 2018 and determine the total spend for each customer in the dataset. Note: The call to sq_list_transactions may take a couple minutes to complete if you are pulling thousands of transactions.

april_transactions <- sq_list_transactions(location = our_locations$id[2], begin_time = as.Date('2018-04-01'), end_time = as.Date('2018-05-01'))In order to extract the customer ID and money spent I create a function that checks for the tender object on the transaction and if it exists tries to extract the data into a tbl_df. When I supply this function as map_df(extract_cust_info_func) %>% I get the data from each transaction stacked into a single tbl_df that’s ready to analyze.

# create a function that will extract out just the customer id and money spent extract_cust_info_func <- function(x){ if(!is.null(x$tender)){ tibble(customer_id = sq_null_to_na(x$tender[[1]]$customer_id), money_spent = sq_null_to_na(x$tender[[1]]$amount_money$amount)) } else { tibble(customer_id = NA_character_, money_spent = NA_integer_) } } april_customer_spend <- april_transactions %>% transpose() %>% # pull out just the information we want from each transaction map_df(extract_cust_info_func) %>% group_by(customer_id) %>% summarize(total_spend = sum(money_spent, na.rm=TRUE)) april_customer_spend #> # A tibble: 208 x 2 #> customer_id total_spend #> #> 1 064HFDQG0N52AHDBSG00C1BAC8 1700 #> 2 07VNWH1V4S6W4W2EJ4AN7SEJNR 0 #> 3 08M453QNJ97BCT97SM09TN7QK4 0 #> 4 08XE43X8FS0MPX8P2W4N0DEQY0 350 #> 5 0CZ78CVRW12V7AZET8S3S82GGW 675 #> 6 0G1J81148H42GGMTMQKRWSJHGC 0 #> 7 0V1Y1BX23WYRK889ERVBE2T0KM 900 #> 8 13HHFBFZTD33RX0RSJNAZQKV5M 0 #> 9 1BFCHB9MK91GQ39HTD7QK6R7ZR 0 #> 10 1DR0AK5GKX57H9ER9SG1JF01P0 2900 #> # ... with 198 more rowsThe amounts in april_customer_spend may seem large, but the Square APIs return money as integers that represent whole cents. If you divide by 100, then you will have the money amounts in dollars.

Spend by Customer GroupSquare has this concept of “groups” that customers belong to. These groups can be fashioned to do marketing campaigns complete with email blasts. In our analysis let’s further determine which groups these customers belong to. The Square API has an endpoint to retrieve one customer at a time; however, with large lists you may get subjected to rate limiting. Rate limiting is errors on your requests because too many are coming from the same application or access token. I would recommend pulling down the entire list of customers with sq_list_customers() and then matching them up in R.

cust_groups <- sq_list_customers() %>% select(id, groups) %>% sq_extract_cust_groups() %>% # filter to only the groups designated automatically by Square filter(grepl("^CQ689YH4KCJMY", groups.id)) cust_groups #> # A tibble: 13,444 x 3 #> id groups.id groups.name #> #> 1 M1RBDFRK7S1Q1EP6EZFJFV3CBW CQ689YH4KCJMY.LOYALTY_ALL Loyalty Participa… #> 2 58MK9F1HQ5447D1QZDX60NHTP4 CQ689YH4KCJMY.CHURN_RISK Lapsed #> 3 58MK9F1HQ5447D1QZDX60NHTP4 CQ689YH4KCJMY.REACHABLE Reachable #> 4 MBSJA4QV4WX6N2XV8WV9VJJTG8 CQ689YH4KCJMY.LOYALTY_ALL Loyalty Participa… #> 5 MBSJA4QV4WX6N2XV8WV9VJJTG8 CQ689YH4KCJMY.REACHABLE Reachable #> 6 ZCBZJ234217KTV812WX4DP2404 CQ689YH4KCJMY.REACHABLE Reachable #> 7 FKEMR8KZCN3BH98RV78PKHKQ1R CQ689YH4KCJMY.LOYALTY_ALL Loyalty Participa… #> 8 FKEMR8KZCN3BH98RV78PKHKQ1R CQ689YH4KCJMY.LOYAL Regulars #> 9 78VMJPJNK959AHH0ZQPXDXEG3C CQ689YH4KCJMY.LOYALTY_ALL Loyalty Participa… #> 10 QASM1G54VX0QN2S15YS6KHEFCC CQ689YH4KCJMY.LOYAL Regulars #> # ... with 13,434 more rowsNow that we know the customer memberships, let’s join back with the transaction data to determine the average total spend across the different membership groups.

cust_groups %>% # bring in the spend data inner_join(., april_customer_spend, by=c('id'='customer_id')) %>% # group by the customer groups to find average spend per group group_by(groups.name) %>% summarize(avg_spend = mean(total_spend / 100, na.rm=TRUE)) #> # A tibble: 6 x 2 #> groups.name avg_spend #> #> 1 Cards on File 569. #> 2 Lapsed 20.3 #> 3 Loyalty Participants 13.5 #> 4 Nearing a Loyalty Reward 15.3 #> 5 Reachable 51.0 #> 6 Regulars 27.4 Issues with the APIsWhat is great about this analysis is that we can use the Square APIs to quickly and reliably pull down transaction data, match it to customer information and see how certain customer groupings or campaigns are performing. However, I did find some quirks and drawbacks that I wish the Square team would consider:

Cannot Request Specific FieldsFirst, it would be nice to implement, as part of the query parameters, the ability to only return certain fields (e.g. the id and groups fields from the Customer object). This would help for two reasons: 1) It would improve the speed since only the required data would be passed across and 2) For analysts who do not want to deal with personally identifiable information (even in RAM) you would not be forced to pull information like phone number, email, names, and even credit card information.

Cannot Update Customer Groups ProgrammaticallySecond, the API will **not** allow you to update the customer groups programmatically. This means that you are stuck using the dashboard to create and assign customers to the groups you want to analyze. It is a drag when you want to create a reproducible research workflow to assign customer groups but the API hinders your ability to do so.

Finally, it appears that the customer ID associated to a transaction is not reliably captured on every transaction’s tender object. This means that you cannot get a complete picture of all the transactions. I believe part of this is an issue with the merchant capturing the customer information at the point of sale, but also partly Square’s fault since they seem to be able to piece everything together in their dashboard. However, this is not the case with the APIs. Overall, the Square APIs are a rich data resource for helping run a business and they should only get better with time as development progresses.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Blog-rss on stevenmortimer.com**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Beautiful and Powerful Correlation Tables in R

(This article was first published on ** Dominique Makowski**, and kindly contributed to R-bloggers)

*correlation*function?!

Yes, the correlation function from the psycho package.

devtools::install_github("neuropsychology/psycho.R") # Install the newest version library(psycho) library(tidyverse) cor <- psycho::affective %>% correlation()This function automatically select numeric variables and run a correlation analysis. It returns a psychobject.

A tableWe can then extract a **formatted table** that can be saved and pasted into reports and manuscripts by using the summary function.

It integrates a **plot** done with ggcorplot.

It also includes a **pairwise correlation printing** method.

You can also cutomize the **type** (*pearson, spearman or kendall*), the **p value correction method** (*holm (default), bonferroni, fdr, none…*) and run **partial, semi-partial** or **glasso** correlations.

*p*-hacking

In order to **prevent people for running many uncorrected correlation tests** (promoting *p*-hacking and result-fishing), we included the i_am_cheating parameter. If FALSE (default), the function will help you finding interesting results!

**As we can see, Schopenhauer’s Optimism is strongly related to many variables!!!**

This package was useful? You can cite psycho as follows:

- Makowski, (2018).
*The psycho Package: an Efficient and Publishing-Oriented Workflow for Psychological Science*. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470

To **leave a comment** for the author, please follow the link and comment on their blog: ** Dominique Makowski**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Le Monde puzzle [#1051]

(This article was first published on ** R – Xi'an's Og**, and kindly contributed to R-bloggers)

**A** combinatoric Le Monde mathematical puzzle of limited size:

*When the only allowed move is to switch two balls from adjacent boxes, what is the minimal number of moves to return all balls in the above picture to their respective boxes? Same question with six boxes and 12 balls.*

**T**he question is rather interesting to code as I decided to use recursion (as usual!) but wanted to gain time by storing the number of steps needed by any configuration to reach its ordered recombination. Meaning I had to update an external vector within the recursive function for each new configuration I met. With help from Julien Stoehr, who presented me with the following code, a simplification of a common R function

which assigns one or several entries to the external vector i. I thus used this trick in the following R code, where cosz is a vector of size 5¹⁰, much larger than the less than 10! values I need but easier to code. While n≤5.

n=5;tn=2*n baz=n^(0:(tn-1)) cosz=rep(-1,n^tn) swee <- function(balz){ indz <- sum((balz-1)*baz) if (cosz[indz]==-1){ if (min(diff(balz))==0){ #ordered v.assign("cosz",indz,value=1)}else{ val <- n^tn for (i in 2:n) for (j in (2*i-1):(2*i)) for (k in (2*i-3):(2*i-2)){ calz <- balz calz[k] <- balz[j];calz[j] <- balz[k] if (max(balz[k]-calz[k],calz[j]-balz[j])>0) val <- min(val,1+swee(calz))} v.assign("cosz",indz,value=val) }} return(cosz[indz])}which returns 2 for n=2, 6 for n=3, 11 for n=4, 15 for n=5. In the case n=6, I need a much better coding of the permutations of interest. Which is akin to ranking all words within a dictionary with letters (1,1,…,6,6). After some thinking (!) and searching, I came up with a new version, defining

parclass=rep(2,n) rankum=function(confg){ n=length(confg);permdex=1 for (i in 1:(n-1)){ x=confg[i] if (x>1){ for (j in 1:(x-1)){ if(parclass[j]>0){ parclass[j]=parclass[j]-1 permdex=permdex+ritpermz(n-i,parclass) parclass[j]=parclass[j]+1}}} parclass[x]=parclass[x]-1} return(permdex)} ritpermz=function(n,parclass){ return(factorial(n)/prod(factorial(parclass)))}for finding the index of a given permutation, between 1 and (2n)!/2!..2!, and then calling the initial swee(p) with this modified allocation. The R code is still running…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Xi'an's Og**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### To purrr or not to purrr

(This article was first published on ** Mango Solutions**, and kindly contributed to R-bloggers)

*Nicolas Attalides, Data Scientist*library(purrr) library(magrittr)

I first started using R long before the RStudio and tidyverse days… I remember writing chunks of code in a text editor and copy/pasting it into the R console! Yes I know, shocking. Nonetheless, most of us will have written code over the years that works just fine in base R, however in my case, the ever-growing adoption of the tidyverse packages (and my personal aspiration to improve my coding skills) has created a sense of necessity to re-write parts of it to fit within the tidyverse setting.

In this blog post I explore the purrr package (part of tidyverse collection) and its use within a data

scientist’s toolset. I aim to present the case for using the purrr functions and through the use of examples compare them with base R functionality. To do this, we will concentrate on two typical coding scenarios in base R: 1) loops and 2) the suite of apply functions and then compare them with their relevant counterpart map functions in the purrr package.

However, before I start, I wanted to make it clear that I do sympathise with those of you whose first reaction to purrr is *“but I can do all this stuff in base R”*. Putting that aside, the obvious first obstacle for us to overcome is to lose the notion of *“if it’s not broken why change it”* and open our ‘coding’ minds to change. At least, I hope you agree with me that the silver lining of this kind of exercise is to satisfy ones curiosity about the purrr package and maybe learn something new!

Let us first briefly describe the concept of functional programming (FP) in case you are not familiar with it.

Functional programming (FP)R is a functional programming language which means that a user of R has the necessary tools to create and manipulate functions. There is no need to go into too much depth here but it suffices to know that FP is the process of writing code in a structured way and through functions remove code duplications and redundancies. In effect, computations or evaluations are treated as mathematical functions and the output of a function only depends on the values of its inputs – known as arguments. FP ensures that any side-effects such as changes in state do not affect the expected output such that if you call the same function twice with the same arguments the function returns the same output.

For those that are interested to find out more, I suggest reading Hadley Wickham’s *Functional Programming* chapter in the “Advanced R” book. The companion website for this can be found here.

The purrr package, which forms part of the tidyverse ecosystem of packages, further enhances the functional programming aspect of R. It allows the user to write functional code with less friction in a complete and consistent manner. The purrr functions can be used, among other things, to replace loops and the suite of apply functions.

Let’s talk about loopsThe motivation behind the examples we are going to look at involve iterating in R for various scenarios. For example, iterate over elements of a vector or list, iterate over rows or columns of a matrix … the list (pun intended) can go on and on!

One of the first things that one gets very excited to ‘play’ when learning to use R – at least that was the case for me – is loops! Lot’s of loops, elaborate, complex… dare I say never ending infinite loops (queue hysteric laughter emoji). Joking aside, it is usually the default answer to a problem that involves iteration of some sort as I demonstrate below.

# Create a vector of the mean values of all the columns of the mtcars dataset # The long repetitive way mean_vec <- c(mean(mtcars$mpg), mean(mtcars$cyl), mean(mtcars$disp), mean(mtcars$hp), mean(mtcars$drat), mean(mtcars$wt), mean(mtcars$qsec), mean(mtcars$vs), mean(mtcars$am), mean(mtcars$gear), mean(mtcars$carb)) mean_vec # The loop way mean_vec_loop <- vector("double", ncol(mtcars)) for (i in seq_along(mtcars)) { mean_vec_loop[[i]] <- mean(mtcars[[i]]) } mean_vec_loopThe resulting vectors are the same and the difference in speed (milliseconds) is negligible. I hope that we can all agree that the long way is definitely not advised and actually is bad coding practice, let alone the frustration (and error-prone task) of copy/pasting. Having said that, I am sure there are other ways to do this – I demonstrate this later using lapply – but my aim was to show the benefit of using a for loop in base R for an iteration problem.

Now imagine if in the above example I wanted to calculate the variance of each column as well…

# Create two vectors of the mean and variance of all the columns of the mtcars dataset # For mean mean_vec_loop <- vector("double", ncol(mtcars)) for (i in seq_along(mtcars)) { mean_vec_loop[[i]] <- mean(mtcars[[i]]) } mean_vec_loop #For variance var_vec_loop <- vector("double", ncol(mtcars)) for (i in seq_along(mtcars)) { var_vec_loop[[i]] <- var(mtcars[[i]]) } var_vec_loop # Or combine both calculations in one loop for (i in seq_along(mtcars)) { mean_vec_loop[[i]] <- mean(mtcars[[i]]) var_vec_loop[[i]] <- var(mtcars[[i]]) } mean_vec_loop var_vec_loopNow let us assume that we know that we want to create these vectors not just for the mtcars dataset but for other datasets as well. We could in theory copy/paste the for loops and just change the dataset we supply in the loop but one should agree that this action is repetitive and could result to mistakes. Instead we can generalise this into functions. This is where FP comes into play.

# Create two functions that returns the mean and variance of the columns of a dataset # For mean col_mean <- function(df) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[[i]] <- mean(df[[i]]) } output } col_mean(mtcars) #For variance col_variance <- function(df) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[[i]] <- var(df[[i]]) } output } col_variance(mtcars)Why not take this one step further and take full advantage of R’s functional programming tools by creating a function that takes as an argument a function! Yes, you read it correctly… a function within a function!

Why do we want to do that? Well, the code for the two functions above, as clean as it might look, is still repetitive and the only real difference between col_mean and col_var is the mathematical function that we are calling. So why not generalise this further?

# Create a function that returns a computational value (such as mean or variance) # for a given dataset col_calculation <- function(df, fun) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[[i]] <- fun(df[[i]]) } output } col_calculation(mtcars, mean) col_calculation(mtcars, var) Did someone say*apply?*

I mentioned earlier that an alternative way to solve the problem is to use the apply function (or suite of apply functions such as lapply, sapply, vapply, etc). In fact, these functions are what we call Higher Order Functions. Similar to what we did earlier, these are functions that can take other functions as an argument.

The benefit of using higher order functions instead of a for loop is that they allow us to think about what code we are executing at a higher level. Think of it as: “apply this to that” rather than “take the first item, do this, take the next item, do this…”

I must admit that at first it might take a little while to get used to but there is definitely a sense of pride when you can improve your code by eliminating for loops and replace them with *apply-type* functions.

Once again, speed of execution is not the issue and neither is the common misconception about loops being slow compared to apply functions. As a matter of fact the main argument in favour of using lapply or any of the purrr functions as we will see later is the pure *simplicity* and *readability* of the code. Full stop.

The best place to start when exploring the purrr package is the map function. The reader will notice that these functions are utilised in a very similar way to the apply family of functions. The subtle difference is that the purrr functions are consistent and the user can be assured of the output – as opposed to some cases when using for example sapply as I demonstrate later on.

# Create a list/vector of the mean values of all the columns of the mtcars dataset map(mtcars, mean) %>% head # Returns a list map_dbl(mtcars, mean) %>% head # Returns a vector - of class doubleLet us introduce the iris dataset with a slight modification in order to demonstrate the inconsistency that sometimes can occur when using the sapply function. This can often cause issues with the code and introduce mystery bugs that are hard to spot.

# Modify iris dataset iris_mod <- iris iris_mod$Species <- ordered(iris_mod$Species) # Ordered factor levels class(iris_mod$Species) # Note: The ordered function changes the class # Extract class of every column in iris_mod sapply(iris_mod, class) %>% str # Returns a list of the results sapply(iris_mod[1:3], class) %>% str # Returns a character vector!?!? - Note: inconsistent object typeSince by default map returns a list one can ensure that an object of the same class is returned without any unexpected (and unwanted) surprises. This is inline with FP consistency.

# Extract class of every column in iris_mod map(iris_mod, class) %>% str # Returns a list of the results map(iris_mod[1:3], class) %>% str # Returns a list of the resultsTo further demonstrate the consistency of the purrr package in this type of setting, the map_*() functions (see below) can be used to return a vector of the expected type, otherwise you get an informative error.

- map_lgl() makes a logical vector.
- map_int() makes an integer vector.
- map_dbl() makes a double vector.
- map_chr() makes a character vector.

It is worth noting that if the user does not wish to rely on tidyverse dependencies they can always use base R functions but need to be extra careful of the potential inconsistencies that might arise.

Multiple arguments and neat tricksIn case we wanted to apply a function to multiple vector arguments we have the option of mapply from base R or the map2 from purrr.

# Create random normal values from a list of means and a list of standard deviations mu <- list(10, 100, -100) sigma <- list(0.01, 1, 10) mapply(rnorm, n = 5, mu, sigma, SIMPLIFY = FALSE) # I need SIMPLIFY = FALSE because otherwise I get a matrix map2(mu, sigma, rnorm, n = 5)The map2 function can easily extend to further arguments – not just two as in the example above – and that is where the pmap function comes in.

I also thought of sharing a couple of neat tricks that one can use with the map function.

1) Say you want to fit a linear model for every cylinder type in the mtcars dataset. You can avoid code duplication and do it as follows:

# Split mtcars dataset by cylinder values and then fit a simple lm models <- mtcars %>% split(.$cyl) %>% # Split by cylinder into 3 lists map(function(df) lm(mpg ~ wt, data = df)) # Fit linear model for each list2) Say we are using a function, such as sqrt (calculate square root), on a list that contains a non-numeric element. The base R function lapply throws an error and execution stops without knowing what caused the error. The safely function of purrr completes execution and the user can identify what caused the error.

x <- list(1, 2, 3, "e", 5) # Base R lapply(x, sqrt) # purrr package safe_sqrt <- safely(sqrt) safe_result_list <- map(x, safe_sqrt) %>% transpose safe_result_list$result ConclusionOverall, I think it is fair to say that using higher order functions in R is a great way to improve ones code. With that in mind, my closing remark for this blog post is to simply re-iterate the benefits of using the purrr package. That is:

- The output is consistent.
- The code is easier to read and write.

If you enjoyed learning about purrr, then you can join us at our purrr workshop at this years EARL London – early bird tickets are available now!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Mango Solutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### My book “Deep Learning from first principles” now on Amazon

(This article was first published on ** R – Giga thoughts …**, and kindly contributed to R-bloggers)

My 4th book(self-published), “Deep Learning from first principles – In vectorized Python, R and Octave” (557 pages), is now available on Amazon in both paperback ($16.99) and kindle ($6.65/Rs449). The book starts with the most primitive 2-layer Neural Network and works its way to a generic L-layer Deep Learning Network, with all the bells and whistles. The book includes detailed derivations and vectorized implementations in Python, R and Octave. The code has been extensively commented and has been included in the Appendix section.

Pick up your copy today!!!

My other books

1. Practical Machine Learning with R and Python

2. Beaten by sheer pace – Cricket analytics with yorkr

3. Cricket analytics with cricketr

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Giga thoughts …**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tuning xgboost in R: Part I

(This article was first published on ** R – insightR**, and kindly contributed to R-bloggers)

**By Gabriel Vasconcelos**Before we begin, I would like to thank Anuj for kindly including our blog in his list of the top40 R blogs! Check out the full list at his page, FeedSpot! Introduction

Tuning a Boosting algorithm for the first time may be a very confusing task. There are so many parameters to choose and they all have different behaviour on the results. Also, the best choice may depends on the data. Every time I get a new dataset I learn something new. A good understanding of classification and regression trees (CART) is also helpful because we will be boosting trees, you can start here if you have no idea of what a CART is.

My favourite Boosting package is the xgboost, which will be used in all examples below. Before going to the data let’s talk about some of the parameters I believe to be the most important. These parameters mostly are used to control how much the model may fit to the data. We would like to have a fit that captures the structure of the data but only the real structure. In other words, we do not want the model to fit noise because this will be translated in a poor out-of-sample performance.

**eta:**Learning (or shrinkage) parameter. It controls how much information from a new tree will be used in the Boosting. This parameter must be bigger than 0 and limited to 1. If it is close to zero we will use only a small piece of information from each new tree. If we set eta to 1 we will use all information from the new tree. Big values of eta result in a faster convergence and more over-fitting problems. Small values may need to many trees to converge.**colsample_bylevel:**Just like Random Forests, some times it is good to look only at a few variables to grow each new node in a tree. If we look at all variables the algorithm needs less trees to converge, but looking at, for example, of the variables may result in models more robust to over-fitting. There is a similar parameter called colsample_bytree that re-sample the variables in each new tree instead of each new node.**max_depth:**Controls the maximum depth of the trees. Deeper trees have more terminal nodes and fit more data. Convergence also requires less trees if we grow them deep. However, if the trees are to deep we will be using a lot of information from the first trees and the final trees of the algorithm will have less importance on the loss function. The Boosting benefits from using information from many trees. Therefore it is intuitive that huge trees are not desirable. Smaller trees also grow faster and because the Boosting grow new trees in a pseudo-residual and we do not require any amazing adjustment for an individual tree.**sub_sample:**This parameter determines if we are estimating a Boosting or a Stochastic Boosting. If we use 1 we obtain the regular Boosting. Values between 0 and 1 are for the stochastic case. The stochastic Boosting uses only a fraction of the data to grow each tree. For example, if we use 0.5 each tree will sample 50% of the data to grow. Stochastic Boosting is very useful if we have outliers because it limits their influence on the final model because they are dropped on several sub-samples. Moreover, it is possible to have a significant improvement on smaller instances because they are more susceptible to over-fitting.**gamma**: Controls the minimum reduction in the loss function required to grow a new node in a tree. This parameter is sensitive to the scale of the loss function, which will be linked to the scale of your response variable. The main consequence of using a gamma different from 0 is to stop the algorithm from growing useless trees that barely reduce the in-sample error and are likely to result in over-fitting. I will leave this parameter to a future post to save some space here.**min_child_weigth:**Controls the minimum number of observations (instances) in a terminal node. The minimum value for this parameter is 1, which allows the tree to have terminal nodes with only one observation. If we use bigger values we limit a possible perfect fit on some observations. This parameter will also be left to part II.

The data will be generate from the following equation:

where , , . The number of variables, , will be set to 10 and the number of instances to 1000.

The experiment will be to change each Boosting parameter keeping all the others constant to try to isolate their effects. The standard model will have the following parameters:

- eta: 0.1
- colsample_bylevel:
- max_depth: 6
- sub_sample: 0.5

I will change each of these parameters to the values in the code. We will analyse the convergence and the root mean squared error (RMSE) in the test sample. The code below will prepare everything to run the models.

library(xgboost) library(ggplot2) library(reshape2) library(Ecdat) set.seed(1) N = 1000 k = 10 x = matrix(rnorm(N*k),N,k) b = (-1)^(1:k) yaux=(x%*%b)^2 e = rnorm(N) y=yaux+e # = select train and test indexes = # train=sample(1:N,800) test=setdiff(1:N,train) # = parameters = # # = eta candidates = # eta=c(0.05,0.1,0.2,0.5,1) # = colsample_bylevel candidates = # cs=c(1/3,2/3,1) # = max_depth candidates = # md=c(2,4,6,10) # = sub_sample candidates = # ss=c(0.25,0.5,0.75,1) # = standard model is the second value of each vector above = # standard=c(2,2,3,2) # = train and test data = # xtrain = x[train,] ytrain = y[train] xtest = x[test,] ytest = y[test] etaWe will start analysing the eta parameter. The code below estimates Boosting models for each candidate eta. First we have the convergence plot, which shows that bigger values of eta converge faster. However, the train RMSE just below the plot shows that faster convergence does not translate into good out-of-sample performance. Smaller values of eta like 0.05 and 0.1 are the ones that produce smaller errors. My opinion is that in this case 0.1 gives us a good out-of-sample performance with acceptable convergence speed. Note that the result for eta=0.5 and 1 are really bad compared to the others.

set.seed(1) conv_eta = matrix(NA,500,length(eta)) pred_eta = matrix(NA,length(test), length(eta)) colnames(conv_eta) = colnames(pred_eta) = eta for(i in 1:length(eta)){ params=list(eta = eta[i], colsample_bylevel=cs[standard[2]], subsample = ss[standard[4]], max_depth = md[standard[3]], min_child_weigth = 1) xgb=xgboost(xtrain, label = ytrain, nrounds = 500, params = params) conv_eta[,i] = xgb$evaluation_log$train_rmse pred_eta[,i] = predict(xgb, xtest) } conv_eta = data.frame(iter=1:500, conv_eta) conv_eta = melt(conv_eta, id.vars = "iter") ggplot(data = conv_eta) + geom_line(aes(x = iter, y = value, color = variable)) (RMSE_eta = sqrt(colMeans((ytest-pred_eta)^2))) ## 0.05 0.1 0.2 0.5 1 ## 9.964462 10.052367 10.223738 13.691344 20.929690 colsample_bylevelThe next parameter controls the fraction of variables (characteristics) tested in each new node. Recall that if we use 1 all variables are tested. Values smaller than 1 test only the correspondent fraction of variables. The convergence shows that the model is much less sensitive to the colsample_bylevel. The curves with smaller values are just slightly above the curves with big values. The most accurate model seems to be the one that uses 25% of the sample in each tree. This result may change with different types of data. For example, the data we generated used the same distribution to create all response variables and the betas gave the variables a similar weight, which makes the sampling less relevant.

set.seed(1) conv_cs = matrix(NA,500,length(cs)) pred_cs = matrix(NA,length(test), length(cs)) colnames(conv_cs) = colnames(pred_cs) = cs for(i in 1:length(cs)){ params = list(eta = eta[standard[1]], colsample_bylevel = cs[i], subsample = ss[standard[4]], max_depth = md[standard[3]], min_child_weigth = 1) xgb=xgboost(xtrain, label = ytrain,nrounds = 500, params = params) conv_cs[,i] = xgb$evaluation_log$train_rmse pred_cs[,i] = predict(xgb, xtest) } conv_cs = data.frame(iter=1:500, conv_cs) conv_cs = melt(conv_cs, id.vars = "iter") ggplot(data = conv_cs) + geom_line(aes(x = iter, y = value, color = variable)) (RMSE_cs = sqrt(colMeans((ytest-pred_cs)^2))) ## 0.333333333333333 0.666666666666667 1 ## 10.29836 10.05237 10.20938 max_depthThe third parameter is the maximum deepness allowed in each tree. Naturally, the model converges faster if we grow bigger trees (see the figure below). However, the best out-of-sample performance is for max_depth=4. Remember that bigger trees are more likely to result in over-fitting. In my experience there is no need to use values bigger than 4 to 6, but there may be exceptions.

set.seed(1) conv_md=matrix(NA,500,length(md)) pred_md=matrix(NA,length(test),length(md)) colnames(conv_md)=colnames(pred_md)=md for(i in 1:length(md)){ params=list(eta=eta[standard[1]],colsample_bylevel=cs[standard[2]], subsample=ss[standard[4]],max_depth=md[i], min_child_weigth=1) xgb=xgboost(xtrain, label = ytrain,nrounds = 500,params=params) conv_md[,i] = xgb$evaluation_log$train_rmse pred_md[,i] = predict(xgb, xtest) } conv_md=data.frame(iter=1:500,conv_md) conv_md=melt(conv_md,id.vars = "iter") ggplot(data=conv_md)+geom_line(aes(x=iter,y=value,color=variable)) (RMSE_md=sqrt(colMeans((ytest-pred_md)^2))) ## 2 4 6 10 ## 12.502733 9.374257 10.134965 10.100691 sub_sampleThe next parameter determines if we are estimating a Boosting or a Stochastic Boosting. Smaller values result in bigger errors in-sample but it may generate more robust out-of-sample estimates. However, as mentioned before, you will obtain bigger improvement on samples that have to many characteristics compares to the number of observations and when the data is very noisy. This is not the case here. Nevertheless, there is some improvement compared to the deterministic case, which I believed is mostly caused by outliers in the response variable (boxplot the variable y to see).

set.seed(1) conv_ss=matrix(NA,500,length(ss)) pred_ss=matrix(NA,length(test),length(ss)) colnames(conv_ss)=colnames(pred_ss)=ss for(i in 1:length(ss)){ params=list(eta=eta[standard[1]],colsample_bylevel=cs[standard[2]], subsample=ss[i],max_depth=md[standard[3]], min_child_weigth=1) xgb=xgboost(xtrain, label = ytrain,nrounds = 500,params=params) conv_ss[,i] = xgb$evaluation_log$train_rmse pred_ss[,i] = predict(xgb, xtest) } conv_ss=data.frame(iter=1:500,conv_ss) conv_ss=melt(conv_ss,id.vars = "iter") ggplot(data=conv_ss)+geom_line(aes(x=iter,y=value,color=variable)) (RMSE_ss=sqrt(colMeans((ytest-pred_ss)^2))) ## 0.25 0.5 0.75 1 ## 9.731273 10.052367 11.119535 11.233855The main conclusion here is that there is no unique rule to tune Boosting models. The best way is to test several configurations.

Stay tuned if you liked this article, we will be talking more about Boosting soon.

References(Boosting) Friedman, Jerome H. “Greedy function approximation: a gradient boosting machine.” Annals of statistics (2001): 1189-1232.

(Stochastig Boosting) Friedman, Jerome H. “Stochastic gradient boosting.” Computational Statistics & Data Analysis 38.4 (2002): 367-378.

(CART) Breiman, Leo. Classification and regression trees. Routledge, 2017.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – insightR**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Laser Beams and Elliptical Billiards: Euler Problem 144

(This article was first published on ** The Devil is in the Data**, and kindly contributed to R-bloggers)

Euler problem 144 has been one of the most fun to solve. The underlying problem is the pathway of the reflection of a laser inside an ellipse-shaped mirror. Before I delve into this problem, I like to share this delightful video from Numberphile in which Alex Bellos demonstrates an elliptical billiards table. The billiards problem is mathematically equivalent to the laser problem. The reflection rule optics is the same as the bouncing rule in mechanics, but instead of using light, we use a billiard ball.

This article outlines my solution to Euler problem 104 and simulates the elliptical pool table in the R language. You can find the code on the GitHub repository for this website.

Euler Problem 144 DefinitionIn laser physics, a “white cell” is a mirror system that acts as a delay line for the laser beam. The beam enters the cell, bounces around on the mirrors, and eventually works its way back out. The specific white cell we will be considering is an ellipse with the equation . The section corresponding to at the top is missing, allowing the light to enter and exit through the hole.

The light beam in this problem starts at the point (0.0,10.1) just outside the white cell, and the beam first impacts the mirror at (1.4,-9.6). Each time the laser beam hits the surface of the ellipse, it follows the usual law of reflection “angle of incidence equals the angle of reflection.” That is, both the incident and reflected beams make the same angle with the normal line at the point of incidence. In the figure on the left, the red line shows the first two points of contact between the laser beam and the wall of the white cell; the blue line shows the line tangent to the ellipse at the point of incidence of the first bounce. The slope of the tangent line at any point of the given ellipse is . The normal line is perpendicular to this tangent line at the point of incidence.

How many times does the beam hit the internal surface of the white cell before exiting?

Proposed Solution to Euler Problem 144The first step was to rewrite the equation to use functions to generalise the problem. The general Cartesian equation for an ellipse is:

The length of the axes for this problem is and . While the Project Euler description gives the formula for the slope of the tangent to the ellipse, I have generalised the code to reuse it for the elliptical billiards table. The slope of the tangent to an ellipse at point is:

This first code snippet defines functions to draw an ellipse and calculate the bouncing angle. The last part of the code bounces the laser inside the cell until it exits through the top.

I sourced the formula to find the intersection between a line and an ellipse from Ambrsoft. The equation has two possible solutions, one of which is the same as the original point.

plot_ellipse <- function(a, b, colour = NA, line = "black") { plot.new() plot.window(xlim = c(-a, a), ylim = c(-b, b), asp = 1) par(mar = rep(0,4)) x <- seq(-a, a, length = 200) y <- sqrt(b^2 - (b^2 / a^2) * x^2) lines(x, y, col = line) lines(x, -y, col = line) polygon(x, y, col = colour, border = NA) polygon(x, -y, col = colour, border = NA) } bounce <- function(coords) { x <- coords$x y <- coords$y ## Tangent to ellipse t <- -(b^2 / a^2) * (x[2] / y[2]) ## Deflection on sloping mirror y = mx + c dydx <- diff(y) / diff(x) m <- tan(pi - atan(dydx) + 2 * atan(t)) c <- y[2] - m * x[2] ## Determine intersection point ## Source: http://www.ambrsoft.com/TrigoCalc/Circles2/Ellipse/EllipseLine.htm x[1] <- x[2] y[1] <- y[2] x2 <- (-a^2 * m * c + c(-1, 1) * (a * b * sqrt(a^2 * m^2 + b^2 - c^2))) / (a^2 * m^2 + b^2) x[2] <- ifelse(round(x[1] / x2[1], 6) == 1, x2[2], x2[1]) y[2] <- m * x[2] + c return(data.frame(x, y)) } # Initial conditions a <- 5 b <- 10 x1 <- 0 y1 <- 10.1 x2 <- 1.4 y2 <- -9.6 answer <- 0 plot_ellipse(a, b) points(c(0,0), c(-c, c), pch = 19) ## Bounce laser breams laser <- data.frame(x = c(x1, x2), y = c(y1, y2)) while((laser$x[2] < -0.01 | laser$x[2] > 0.01) | laser$y[2] < 0) { ## Escape? lines(laser$x, laser$y, col = "red", lwd = .5) laser <- bounce(laser) answer <- answer + 1 } print(answer)The result of this code is a pretty image of all the laser beams that have bounced around the mirror, which looks like the evil Eye of Sauron in Lord of the Rings.

We can use the solution to Euler problem 144 to play billiards on an elliptical billiards table. To close the article, we return to the elliptical pool table demonstrated by Alex Bellos. This code draws the pool table to the dimensions mentioned in the video. We know that the table has an eccentricity of and a long axis of cm. The code defines the short axis () and the distance of the focal points from the centre.

The code selects a random starting point and angle of the shot. The code first determines whether the line passes through the pocket. If this is not the case, the algorithm then finds the place where the ball hits and keeps bouncing it until it falls into the pocket or the ball bounces 100 times.

Elliptical billiard tables have four possible outcomes. Any ball the pass through a focal point will fall into the pocket, ending the simulation. Any ball that passes outside the focal points will bounce around, and the combined trajectories form an ellipse. When the ball moves between the foci, the result is a hyperbola. Lastly, there are some unique circumstances which result in a regular polygon.

If simulations are not enough for you, then head over to the Instructables website to find out how you can construct an elliptical billiards table. There is even a patent for an elliptical pocket billiard table, with the pockets at the edge.

## Elliptical pool table ## https://www.youtube.com/watch?time_continue=54&v=4KHCuXN2F3I e <- 0.43 a <- 130 b <- a * sqrt((1 + e) * (1 - e)) # a > b f <- sqrt(a^2 - b^2) plot_ellipse(a, b, "darkgreen", NA) points(-f, 0, pch = 19, cex = 2) points(f, 0, pch = 19, col = "grey") ## Simulate random shot angle <- runif(1, 0, 2 * pi) x1 <- runif(1, -a, a) ymax <- sqrt(b^2 - (b^2 / a^2) * x1^2) y1 <- runif(1, -ymax, ymax) ## First shot m <- tan(angle) c <- y1 - m * x1 x2 <- (-a^2 * m * c + c(-1, 1) * (a * b * sqrt(a^2 * m^2 + b^2 - c^2))) / (a^2 * m^2 + b^2) y2 <- m * x2 + c x2 <- x2[which(((x2 - x1) < 0) == (cos(angle) < 0))] y2 <- y2[which(((y2 - y1) < 0) == (sin(angle) < 0))] shot <- (data.frame(x = c(x1, x2), y = c(y1, y2))) ## Bounce ball for (i in 1:100){ dydx <- diff(shot$y) / diff(shot$x) if (all.equal(dydx, (shot$y[1] - 0) / (shot$x[1] - -f)) == TRUE) { shot[2, ] <- c(-f, 0) } lines(shot$x, shot$y, col = "yellow", lwd = 1) if (shot[2,2] == 0) break shot <- bounce(shot) } points(x1, y1, pch = 19, col = "blue", cex = 1.8)The post Laser Beams and Elliptical Billiards: Euler Problem 144 appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** The Devil is in the Data**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### General gems of comperes

(This article was first published on ** QuestionFlow **, and kindly contributed to R-bloggers)

Examples of exported functions from comperes package that could be useful for general tasks.

PrologueI am very glad to announce that my new package comperes is on CRAN now. It provides tools for managing competition results in a tidy manner as much as possible. For more information go to:

- Package README.
- Package vignettes.
- My previous post for usage examples based on built-in hp_survey data set (results of my Harry Potter Books Survey).

Besides tools for competition results, comperes offers some functions that can be useful in more general tasks. This post presents examples of their most common usage.

OverviewThis post covers the following themes:

**Compute vector levels**with levels2().**Manage item summaries**with summarise_item() and join_item_summary().**Convert pairwise data**with long_to_mat() and mat_to_long().

For examples we will use a shortened version of the everlasting mtcars data set. We will need the following setup:

library(comperes) library(rlang) # For example analysis library(dplyr) library(tibble) mtcars_tbl <- mtcars %>% rownames_to_column(var = "car") %>% select(car, cyl, vs, carb) %>% as_tibble() Compute vector levelsWe will start with the most simple function. During comperes development, idea about the it really helped me reason more clearly about package functional API. I am talking about levels2() which computes “levels” of any non-list vector.

It has the following logic: if x has levels attribute then return levels(x); otherwise return character representation of vector’s sorted unique values. Notes about design and implementation of this function:

- I hesitated
*a lot*about whether it should return character or same type as input vector in case x has no levels. In many practical cases there is a need in latter behavior. However, in the end I decided that type stable output (levels(x) always returns character vector or NULL) is better. - Conversion to character is done
**after**sorting, which is really important when dealing with numeric vectors.

This function is helpful when one needs to produce unique values in standardized manner (for example, during pairwise distance computation). Some examples:

levels2(mtcars_tbl$cyl) ## [1] "4" "6" "8" # Importance of conversion to character after sorting tricky_vec <- c(10, 1, 2, 12) sort(as.character(tricky_vec)) ## [1] "1" "10" "12" "2" levels2(tricky_vec) ## [1] "1" "2" "10" "12" Manage item summariesArguably, the most common task in data analysis is computation of group summaries. This task is conveniently done by consecutive application of dplyr’s group_by(), summarise() and ungroup() (to return regular data frame and not grouped one). comperes offers a wrapper summarise_item() for this task (which always returns tibble instead of a data frame) with additional feature of modifying column names by adding prefix (which will be handy soon):

cyl_vs_summary <- mtcars_tbl %>% summarise_item( item = c("cyl", "vs"), n = n(), mean_carb = mean(carb), .prefix = "cyl_vs__" ) cyl_vs_summary ## # A tibble: 5 x 4 ## cyl vs cyl_vs__n cyl_vs__mean_carb ## ## 1 4. 0. 1 2.00 ## 2 4. 1. 10 1.50 ## 3 6. 0. 3 4.67 ## 4 6. 1. 4 2.50 ## 5 8. 0. 14 3.50Sometimes, there is also a need to compare actual values with their summaries across different grouping. For example, **determine whether car’s number of carburetors (carb) is bigger than average value per different groupings**: by number of cylinders cyl and V/S vs.

To simplify this task, comperes offers a join_item_summary() function for that: it computes item summary with summarise_item() and joins it (with dplyr::left_join()) to input data frame:

# Save (with rlang magic) expression for reused summary carb_summary <- list(mean_carb = expr(mean(carb))) # Create new columns with joined grouped summaries mtcats_gear_summary <- mtcars_tbl %>% join_item_summary("cyl", !!! carb_summary, .prefix = "cyl__") %>% join_item_summary("vs", !!! carb_summary, .prefix = "vs__") print(mtcats_gear_summary, width = Inf) ## # A tibble: 32 x 6 ## car cyl vs carb cyl__mean_carb vs__mean_carb ## ## 1 Mazda RX4 6. 0. 4. 3.43 3.61 ## 2 Mazda RX4 Wag 6. 0. 4. 3.43 3.61 ## 3 Datsun 710 4. 1. 1. 1.55 1.79 ## 4 Hornet 4 Drive 6. 1. 1. 3.43 1.79 ## 5 Hornet Sportabout 8. 0. 2. 3.50 3.61 ## # ... with 27 more rows # Compute comparisons mtcats_gear_summary %>% mutate_at(vars(ends_with("mean_carb")), funs(carb > .)) %>% select(car, ends_with("mean_carb")) %>% rename_at(vars(-car), funs(gsub("__mean_carb$", "", .))) ## # A tibble: 32 x 3 ## car cyl vs ## ## 1 Mazda RX4 TRUE TRUE ## 2 Mazda RX4 Wag TRUE TRUE ## 3 Datsun 710 FALSE FALSE ## 4 Hornet 4 Drive FALSE FALSE ## 5 Hornet Sportabout FALSE FALSE ## # ... with 27 more rowsAdding different prefixes helps navigating through columns with different summaries.

Convert pariwise dataOne of the main features of comperes is the ability to compute Head-to-Head values of players in competition. There are functions h2h_long() and h2h_mat() which produce output in “long” (tibble with row describing one ordered pair) and “matrix” (matrix with cell value describing pair in corresponding row and column) formats respectively.

These formats of pairwise data is quite common: “long” is better for tidy computing and “matrix” is better for result presentation. Also converting distance matrix to data frame with pair data is a theme of several Stack Overflow questions (for example, this one and that one).

Package comperes has functions as_h2h_long() and as_h2h_mat() for converting between those formats. They are powered by a “general usage” functions long_to_mat() and mat_to_long(). Here is an example of how they can be used to convert between different formats of pairwise distances:

# Compute matrix of pairwise distances based on all numeric columns dist_mat <- mtcars_tbl %>% select_if(is.numeric) %>% dist() %>% as.matrix() dist_mat[1:4, 1:4] ## 1 2 3 4 ## 1 0.000000 0.000000 3.741657 3.162278 ## 2 0.000000 0.000000 3.741657 3.162278 ## 3 3.741657 3.741657 0.000000 2.000000 ## 4 3.162278 3.162278 2.000000 0.000000 # Convert to data frame (tibble in this case) dist_tbl <- dist_mat %>% mat_to_long(row_key = "id_1", col_key = "id_2", value = "dist") dist_tbl ## # A tibble: 1,024 x 3 ## id_1 id_2 dist ## ## 1 1 1 0. ## 2 1 2 0. ## 3 1 3 3.74 ## 4 1 4 3.16 ## 5 1 5 2.83 ## # ... with 1,019 more rows # Convert tibble back to matrix dist_mat_new <- dist_tbl %>% # To make natural row and column sortings mutate_at(vars("id_1", "id_2"), as.numeric) %>% long_to_mat(row_key = "id_1", col_key = "id_2", value = "dist") identical(dist_mat, dist_mat_new) ## [1] TRUE Conclusion- Package comperes provides not only tools for managing competition results but also functions with general purpose:
- Compute vector levels with levels2(). Usually used to produce unique values in standardized manner.
- Manage item summaries with summarise_item() and join_item_summary(). May be used to concisely compute comparisons of values with summaries from different groupings.
- Convert pairwise data with long_to_mat() and mat_to_long(). Very helpful in converting pairwise distances between “long” and “matrix” formats.

To **leave a comment** for the author, please follow the link and comment on their blog: ** QuestionFlow **.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### treeio: Phylogenetic data integration

(This article was first published on ** rOpenSci - open tools for open science**, and kindly contributed to R-bloggers)

Phylogenetic trees are commonly used to present evolutionary relationships of species. Newick is the *de facto* format in phylogenetic for representing tree(s). Nexus format incorporates Newick tree text with related information organized into separated units known as blocks. For the R community, we have ape and phylobase packages to import trees from Newick and Nexus formats. However, analysis results (tree + analysis findings) from widely used software packages in this field are not well supported. Some of them are extended from Newick and Nexus (*e.g.* RevBayes and BEAST outputs), while some of the others are just log files (*e.g.* r8s and PAML outputs). Parsing these output files is important for interpreting analysis findings.

Information associated with taxon species/strains can be meta-data (*e.g.* isolation host, time and location, *etc.*), phenotypic or experimental data. These data as well as analysis finding may be further analyzed in the evolutionary context for downstream integrative and comparative studies. Unfortunately, taxon-associated data are mostly stored in separate files and often in inconsistent formats. To fill this gap, I developed the treeio package for phylogenetic tree data integration.

Currently, treeio is able to read the following file formats: Newick, Nexus, New Hampshire eXtended format (NHX), jplace and Phylip as well as the data outputs from the following analysis programs: BEAST, EPA, HyPhy, MrBayes, PAML, PHYLDOG, pplacer, r8s, RAxML and RevBayes.

InstallationThe treeio package is released within Bioconductor project.

To get the released version from Bioconductor:

## try http:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite("treeio")Or the development version from github:

## install.packages("devtools") devtools::install_github("GuangchuangYu/treeio") Importing trees with data Importing tree data from software outputsThe treeio package provides several parser functions as illustrated in the following table:

Parser function Description read.beast parsing output of BEAST read.codeml parsing output of CodeML (rst and mlc files) read.codeml_mlc parsing mlc file (output of CodeML) read.hyphy parsing output of HYPHY read.jplace parsing jplace file including output of EPA and pplacer read.mrbayes parsing output of MrBayes read.newick parsing newick string, with ability to parse node label as support values read.nhx parsing NHX file including output of PHYLDOG and RevBayes read.paml_rst parsing rst file (output of BaseML or CodeML) read.phylip parsing phylip file (phylip alignment + newick string) read.r8s parsing output of r8s read.raxml parsing output of RAxMLFor example, users can parse BEAST output using read.beast function. All the information inferred by by BEAST will be stored in the output object.

file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio") beast <- read.beast(file) beast ## 'treedata' S4 object that stored information of ## '/Library/R/library/treeio/extdata/BEAST/beast_mcc.tree'. ## ## ...@ phylo: ## Phylogenetic tree with 15 tips and 14 internal nodes. ## ## Tip labels: ## A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ... ## ## Rooted; includes branch lengths. ## ## with the following features available: ## 'height', 'height_0.95_HPD', 'height_median', 'height_range', 'length', ## 'length_0.95_HPD', 'length_median', 'length_range', 'posterior', 'rate', ## 'rate_0.95_HPD', 'rate_median', 'rate_range'. Linking external data to phylogenyIn addition to analysis findings that are associated with the tree as we showed above, there is a wide range of heterogeneous data, including phenotypic data, experimental data and clinical data *etc.*, that need to be integrated and linked to phylogeny. For example, in the study of viral evolution, tree nodes may be associated with epidemiological information, such as location, age and subtype. To facilitate data integration, treeio provides full_join method to link external data to phylogeny as demonstrated below:

To allow comparative study, treeio supports combining multiple phylogenetic trees into one with their node/branch-specific attribute data.

The following example merges a tree analyzed by BEAST and CODEML.

beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree") rst_file <- system.file("examples/rst", package="ggtree") mlc_file <- system.file("examples/mlc", package="ggtree") beast_tree <- read.beast(beast_file) codeml_tree <- read.codeml(rst_file, mlc_file) merged_tree <- merge_tree(beast_tree, codeml_tree) merged_tree ## 'treedata' S4 object that stored information of ## '/Library/R/library/ggtree/examples/MCC_FluA_H3.tree', ## '/Library/R/library/ggtree/examples/rst', ## '/Library/R/library/ggtree/examples/mlc'. ## ## ...@ phylo: ## Phylogenetic tree with 76 tips and 75 internal nodes. ## ## Tip labels: ## A/Hokkaido/30-1-a/2013, A/New_York/334/2004, A/New_York/463/2005, A/New_York/452/1999, A/New_York/238/2005, A/New_York/523/1998, ... ## ## Rooted; includes branch lengths. ## ## with the following features available: ## 'height', 'height_0.95_HPD', 'height_median', 'height_range', 'length', ## 'length_0.95_HPD', 'length_median', 'length_range', 'posterior', 'rate', ## 'rate_0.95_HPD', 'rate_median', 'rate_range', 'subs', 'AA_subs', 't', 'N', ## 'S', 'dN_vs_dS', 'dN', 'dS', 'N_x_dN', 'S_x_dS'.After merging, the merged_tree object contains the whole set of node/branch attributes from both beast_tree and codeml_tree. The tree object can be converted to tidy data frame using tidytree package and visualized as hexbin scatterplot of *dN/dS*, *dN* and *dS* inferred by CODEML versus *rate* (substitution rate in unit of substitutions/site/year) inferred by BEAST on the same branches, as an example to demonstrate comparison of node attributes inferred by different software.

With this feature, users are able to compare/integrate evolutionary inferences obtained from different software packages/models.

Further readingsFor more details of importing trees with data, please refer to the vignette. In addition, treeio also supports exporting tree data to BEAST Nexus format, which allows software output format conversion and also enables combining external data with tree into a single file.

The following screenshot shows an example of exporting CODEML output to BEAST Nexus file. The tree was visualized using FigTree and colored by *dN*. For more details, please refer to the vignette, exporting trees with data.

All the data parsed/merged by treeio can be converted to tidy data frame using tidytree and can be used to visualize and annotate the tree using ggtree.

For more details, please refer to the online vignettes:

- A Tidy Tool for Phylogenetic Tree Data Manipulation
- Tree Visualization
- Tree Manipulation
- Tree Annotation
- Phylomoji
- Annotating phylogenetic tree with images
- Annotate a phylogenetic tree with insets

To **leave a comment** for the author, please follow the link and comment on their blog: ** rOpenSci - open tools for open science**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### maximal spacing around order statistics

(This article was first published on ** R – Xi'an's Og**, and kindly contributed to R-bloggers)

**T**he riddle from the Riddler for the coming weeks is extremely simple to express in mathematical terms, as it summarises into characterising the distribution of

when the n-sample is made of iid Normal variates. I however had a hard time finding a result connected with this quantity since most available characterisations are for either Uniform or Exponential variates. I eventually found a 2017 arXival by Nagaraya et al. covering the issue. Since the Normal distribution belongs to the Gumbel domain of attraction, the extreme spacings, that is the spacings between the most extreme orders statistics [rescaled by nφ(Φ⁻¹{1-n⁻¹})] are asymptotically independent and asymptotically distributed as (Theorem 5, p.15, after correcting a typo):

where the ξ’s are Exp(1) variates. A crude approximation is thus to consider that the above Δ is distributed as the maximum of two standard and independent exponential distributions, modulo the rescaling by nφ(Φ⁻¹{1-n⁻¹})… But a more adequate result was pointed out to me by Gérard Biau, namely a 1986 Annals of Probability paper by Paul Deheuvels, my former head at ISUP, Université Pierre and Marie Curie. In this paper, Paul Deheuvels establishes that the largest spacing in a normal sample, M¹, satisfies

from which a conservative upper bound on the value of n required for a given bound x⁰ can be derived. The simulation below compares the limiting cdf (in red) with the empirical cdf of the above Δ based on 10⁴ samples of size n=10³.The limiting cdf is the cdf of the maximum of an infinite sequence of independent exponentials with scales 1,½,…. Which connects with the above result, *in fine*. For a practical application, the 99% quantile of this distribution is 4.71. To achieve a maximum spacing of, say 0.1, with probability 0.99, one would need 2 log(n) > 5.29²/0.1², i.e., log(n)>1402, which is a pretty large number…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Xi'an's Og**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Ready Made Plots make Work Easier

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

A while back Simon Jackson and Kara Woo shared some great ideas and graphs on grouped bar charts and density plots (link). Win-Vector LLC‘s Nina Zumel just added a graph of this type to the development version of WVPlots.

Nina has, as usual, some great documentation here.

More and more I am finding when you are in the middle of doing something, having a ready made plotting tool is less distracting than working directly designing a ggplot2 graph on the fly. In addition to WVPlots other great “ready to go” plot packages include ggpubr and ggstatsplot. I definitely recommend checking out all three packages and the packages/tools they use.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### 50 years of Delaware Estuary Dissolved Oxygen

(This article was first published on ** AdventuresInData**, and kindly contributed to R-bloggers)

The Delaware River Basin Commission’s Delaware Estuary water quality monitoring program, which was initiated in 1967, is one of the longest running monitoring programs in the world. One advantage of such a long data set is that we can see the changes to water quality over time.

Since 1967, summer dissolved oxygen levels have steadily improved due to active water quality management and the addition of secondary treatment to waste water treatment plants. More work remains, however. A dissolved oxygen sag persists in the most urbanized portion of the estuary, and daytime surface spot measurements may not reveal pockets of lower dissolved oxygen water that could continue to impact aquatic life.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** AdventuresInData**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Cleaning up tables

(This article was first published on ** R – Stat Bandit**, and kindly contributed to R-bloggers)

This post is re-published from my blog

ContextOne of things I have to do quite often is create tables for papers and presentations. Often the “Table 1” of a paper has descriptives about the study, broken down by subgroups. For presentation purposes, it doesn’t look good (to me, at least) that the name of each subgroup be repeated down one column of the table.

One way to deal with this is, of course, by hand. Save the table as a CSV or Excel file, open it up in your favorite spreadsheet program, and prettify things. But, of course, this being a R blog, I wanted to create a function that would fix this. I’ve created hack-y functions for this before, but a neat trick pointed out here gave me an idea for a more elegant solution. It also meant I had to use the tidyeval paradigm a bit, which I figured I should at least become familiar with.

Here’s what I want to doTake a table like this:

Location Gender values Rural Male 32.74 Rural Female 25.18 Urban Male 40.48 Urban Female 25.28to something like this:

Location Gender values Rural Male 32.74 Female 25.18 Urban Male 40.48 Female 25.28The point is that the first column has repeating values, and I just want the first row of the cluster of rows corresponding to Rural and Urban to have text, the rest being blank. I find this a cleaner look and more typical of tables I see in papers.

This is purely for presentation purposes.I would never do this for data frames I’ll still analyze, since the blank cells screw up things. Of course this could be fixed easily using last value carried forward imputation on the column.

A solutionI created this simple function to do this for a single column within a magrittr pipeline:

clean_col = function(x, colvar){ require(dplyr) colv = enquo(colvar) x %>% group_by(!!colv) %>% mutate(rown = row_number()) %>% ungroup() %>% mutate_at(vars(!!colv), funs(ifelse(rown > 1, '', .))) %>% select (-rown) }The first thing to note here is that I’m using quosures and quasiquotation to allow the pipeline to work with the function’s inputs, specifically the column name, which is provided as an unquoted name. Admittedly this was done without much understanding, following examples on Edwin Thoen’s excellent blog.

The second thing was the use of the dummy rown column to identify the first row of each cluster of rows defined by the variable colvar. This was inspired by this blog I read through R-Bloggers yesterday. This trick allowed me to easily “blank out” the appropriate cells in the colvar column.

Desirable updatesThere are two directions I want to take this, but I don’t understand tidyeval or functions with variable numbers of arguments well enough yet to do it. The simpler extension is to do the same process using two or more columns, rather than one column. For example, taking

Location Gender AgeGrp DeathRate Rural Male 50-54 11.7 Rural Male 55-59 18.1 Rural Male 60-64 26.9 Rural Male 65-69 41.0 Rural Male 70-74 66.0 Rural Female 50-54 8.7 Rural Female 55-59 11.7 Rural Female 60-64 20.3 Rural Female 65-69 30.9 Rural Female 70-74 54.3 Urban Male 50-54 15.4 Urban Male 55-59 24.3 Urban Male 60-64 37.0 Urban Male 65-69 54.6 Urban Male 70-74 71.1 Urban Female 50-54 8.4 Urban Female 55-59 13.6 Urban Female 60-64 19.3 Urban Female 65-69 35.1 Urban Female 70-74 50.0to

Location Gender AgeGrp DeathRate Rural Male 50-54 11.7 55-59 18.1 60-64 26.9 65-69 41.0 70-74 66.0 Rural Female 50-54 8.7 55-59 11.7 60-64 20.3 65-69 30.9 70-74 54.3 Urban Male 50-54 15.4 55-59 24.3 60-64 37.0 65-69 54.6 70-74 71.1 Urban Female 50-54 8.4 55-59 13.6 60-64 19.3 65-69 35.1 70-74 50.0The second extension would be to create truly nested row labels, like this:

Location Gender AgeGrp DeathRate Rural Male 50-54 11.7 55-59 18.1 60-64 26.9 65-69 41.0 70-74 66.0 Female 50-54 8.7 55-59 11.7 60-64 20.3 65-69 30.9 70-74 54.3 Urban Male 50-54 15.4 55-59 24.3 60-64 37.0 65-69 54.6 70-74 71.1 Female 50-54 8.4 55-59 13.6 60-64 19.3 65-69 35.1 70-74 50.0I can create these on a case-by-case basis, but I’m not sure how to do this in a function, yet. Looking forward to comments.

UpdateI can do the first example with the following code (based on Edwin Thoen’s blog, again):

clean_col = function(x, ...){ require(dplyr) colvs = quos(...) x %>% group_by(!!!colvs) %>% mutate(rown = row_number()) %>% ungroup() %>% mutate_at(vars(!!!colvs), funs(ifelse(rown > 1, '', .))) %>% select (-rown) } Update 2The second example can be solved by extracting elements of quosures, which are essentially a list:

clean_cols = function(x, ...){ colvs = quos(...) for(i in 1:length(colvs)){ rowvar = rlang::sym(paste0('rown',i)) # Create dummy x = x %>% group_by(!!!colvs[1:i]) %>% mutate(!!rowvar := row_number()) %>% ungroup() } for(i in 1:length(colvs)){ rowvar = rlang::sym(paste0('rown',i)) x = x %>% mutate_at(vars(!!colvs[[i]]), funs(ifelse(!!rowvar > 1, '', .))) } x = x %>% select(-starts_with('rown')) # remove the dummies return(x) } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Stat Bandit**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Elegant regression results tables and plots in R: the finalfit package

(This article was first published on ** R – DataSurg**, and kindly contributed to R-bloggers)

The finafit package brings together the day-to-day functions we use to generate final results tables and plots when modelling. I spent many years repeatedly manually copying results from R analyses and built these functions to automate our standard healthcare data workflow. It is particularly useful when undertaking a large study involving multiple different regression analyses. When combined with RMarkdown, the reporting becomes entirely automated. Its design follows Hadley Wickham’s tidy tool manifesto.

Installation and DocumentationIt lives on GitHub.

You can install finalfit from github with:

# install.packages("devtools") devtools::install_github("ewenharrison/finalfit")It is recommended that this package is used together with dplyr, which is a dependent.

Some of the functions require rstan and boot. These have been left as Suggests rather than Depends to avoid unnecessary installation. If needed, they can be installed in the normal way:

install.packages("rstan") install.packages("boot")To install off-line (or in a Safe Haven), download the zip file and use devtools::install_local().

Main Features 1. Summarise variables/factors by a categorical variablesummary_factorlist() is a wrapper used to aggregate any number of explanatory variables by a single **variable of interest**. This is often “Table 1” of a published study. When categorical, the variable of interest can have a maximum of five levels. It uses Hmisc::summary.formula().

See other options relating to inclusion of missing data, mean vs. median for continuous variables, column vs. row proportions, include a total column etc.

summary_factorlist() is also commonly used to summarise any number of variables by an **outcome variable** (say dead yes/no).

Tables can be knitted to PDF, Word or html documents. We do this in RStudio from a .Rmd document. Example chunk:

```{r, echo = FALSE, results='asis'} knitr::kable(example_table, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` 2. Summarise regression model results in final table formatThe second main feature is the ability to create final tables for linear (lm()), logistic (glm()), hierarchical logistic (lme4::glmer()) and

Cox proportional hazards (survival::coxph()) regression models.

The finalfit() “all-in-one” function takes a single dependent variable with a vector of explanatory variable names (continuous or categorical variables) to produce a final table for publication including summary statistics, univariable and multivariable regression analyses. The first columns are those produced by summary_factorist(). The appropriate regression model is chosen on the basis of the dependent variable type and other arguments passed.

Logistic regression: glm()Of the form: glm(depdendent ~ explanatory, family="binomial")

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor") dependent = 'mort_5yr' colon_s %>% finalfit(dependent, explanatory) Logistic regression with reduced model: glm()Where a multivariable model contains a subset of the variables included specified in the full univariable set, this can be specified.

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor") explanatory_multi = c("age.factor", "obstruct.factor") dependent = 'mort_5yr' colon_s %>% finalfit(dependent, explanatory, explanatory_multi) Mixed effects logistic regression: lme4::glmer()Of the form: lme4::glmer(dependent ~ explanatory + (1 | random_effect), family="binomial")

Hierarchical/mixed effects/multilevel logistic regression models can be specified using the argument random_effect. At the moment it is just set up for random intercepts (i.e. (1 | random_effect), but in the future I’ll adjust this to accommodate random gradients if needed (i.e. (variable1 | variable2).

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor") explanatory_multi = c("age.factor", "obstruct.factor") random_effect = "hospital" dependent = 'mort_5yr' colon_s %>% finalfit(dependent, explanatory, explanatory_multi, random_effect) Cox proportional hazards: survival::coxph()Of the form: survival::coxph(dependent ~ explanatory)

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor") dependent = "Surv(time, status)" colon_s %>% finalfit(dependent, explanatory) Add common model metrics to outputmetrics=TRUE provides common model metrics. The output is a list of two dataframes. Note chunk specification for output below.

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor") dependent = 'mort_5yr' colon_s %>% finalfit(dependent, explanatory, metrics=TRUE) ```{r, echo=FALSE, results="asis"} knitr::kable(table7[[1]], row.names=FALSE, align=c("l", "l", "r", "r", "r")) knitr::kable(table7[[2]], row.names=FALSE) ```Rather than going all-in-one, any number of subset models can be manually added on to a summary_factorlist() table using finalfit_merge(). This is particularly useful when models take a long-time to run or are complicated.

Note the requirement for fit_id=TRUE in summary_factorlist(). fit2df extracts, condenses, and add metrics to supported models.

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor") explanatory_multi = c("age.factor", "obstruct.factor") random_effect = "hospital" dependent = 'mort_5yr' # Separate tables colon_s %>% summary_factorlist(dependent, explanatory, fit_id=TRUE) -> example.summary colon_s %>% glmuni(dependent, explanatory) %>% fit2df(estimate_suffix=" (univariable)") -> example.univariable colon_s %>% glmmulti(dependent, explanatory) %>% fit2df(estimate_suffix=" (multivariable)") -> example.multivariable colon_s %>% glmmixed(dependent, explanatory, random_effect) %>% fit2df(estimate_suffix=" (multilevel)") -> example.multilevel # Pipe together example.summary %>% finalfit_merge(example.univariable) %>% finalfit_merge(example.multivariable) %>% finalfit_merge(example.multilevel) %>% select(-c(fit_id, index)) %>% # remove unnecessary columns dependent_label(colon_s, dependent, prefix="") # place dependent variable label Bayesian logistic regression: with stanOur own particular rstan models are supported and will be documented in the future. Broadly, if you are running (hierarchical) logistic regression models in [Stan](http://mc-stan.org/users/interfaces/rstan) with coefficients specified as a vector labelled beta, then fit2df() will work directly on the stanfit object in a similar manner to if it was a glm or glmerMod object.

3. Summarise regression model results in plotModels can be summarized with odds ratio/hazard ratio plots using or_plot, hr_plot and surv_plot.

OR plot # OR plot explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor") dependent = 'mort_5yr' colon_s %>% or_plot(dependent, explanatory) # Previously fitted models (`glmmulti()` or `glmmixed()`) can be provided directly to `glmfit` HR plot # HR plot explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor") dependent = "Surv(time, status)" colon_s %>% hr_plot(dependent, explanatory, dependent_label = "Survival") # Previously fitted models (`coxphmulti`) can be provided directly using `coxfit` Kaplan-Meier survival plotsKM plots can be produced using the library(survminer)

# KM plot explanatory = c("perfor.factor") dependent = "Surv(time, status)" colon_s %>% surv_plot(dependent, explanatory, xlab="Time (days)", pval=TRUE, legend="none") NotesUse Hmisc::label() to assign labels to variables for tables and plots.

label(colon_s$age.factor) = "Age (years)"Export dataframe tables directly or to R Markdown knitr::kable().

Note wrapper summary_missing() is also useful. Wraps mice::md.pattern.

colon_s %>% summary_missing(dependent, explanatory)Development will be on-going, but any input appreciated.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – DataSurg**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Enterprise Dashboards with R Markdown

(This article was first published on ** R Views**, and kindly contributed to R-bloggers)

*This is a second post in a series on enterprise dashboards. See our previous post, Enterprise-ready dashboards with Shiny Databases.*

We have been living with spreadsheets for so long that most office workers think it is obvious that spreadsheets generated with programs like Microsoft Excel make it easy to understand data and communicate insights. Everyone in a business, from the newest intern to the CEO, has had some experience with spreadsheets. But using Excel as the de facto analytic standard is problematic. Relying exclusively on Excel produces environments where it is almost impossible to organize and maintain efficient operational workflows. In addition to fostering low productivity, organizations risk profits and reputations in an age where insightful analyses and process control translate to a competitive advantage. Most organizations want better control over accessing, distributing, and processing data. You can use the R programming language, along with with R Markdown reports and RStudio Connect, to build enterprise dashboards that are robust, secure, and manageable.

This Excel dashboard attempts to function as a real application by allowing its users to filter and visualize key metrics about customers. It took dozens of hours to build. The intent was to hand off maintenance to someone else, but the dashboard was so complex that the author was forced to maintain it. Every week, the author copied data from an ETL tool and pasted it into the workbook, spot checked a few cells, and then emailed the entire workbook to a distribution list. Everyone on the distribution list got a new copy in their inbox every week. There were no security controls around data management or data access. Anyone with the report could modify its contents. The update process often broke the brittle cell dependencies; or worse, discrepancies between weeks passed unnoticed. It was almost impossible to guarantee the integrity of each weekly report.

Why coding is importantExcel workbooks are hard to maintain, collaborate on, and debug because they are not reproducible. The content of every cell and the design of every chart is set without ever recording the author’s actions. There is no simple way to recreate an Excel workbook because there is no recipe (i.e., set of instructions) that describes how it was made. Because Excel workbooks lack a recipe, they tend to be hard to maintain and prone to errors. It takes care, vigilance, and subject-matter knowledge to maintain a complex Excel workbook. Even then, human errors abound and changes require a lot of effort.

A better approach is to write code. There are many reasons to start programming. When you create a recipe with code, anyone can reproduce your work (including your future self). The act of coding implicitly invites others to collaborate with you. You can systematically validate and debug your code. All of these things lead to better code over time. Coding in R has particular advantages given its vast ecosystem of packages, its vibrant community, and its powerful tool chain.

Using R MarkdownThere are many tools for replacing complex Excel dashboards with R code. One of these tools is R Markdown, an open-source R package that turns your analyses into high quality documents, reports, presentations and dashboards. R Markdown documents are fully reproducible and support dozens of output formats including HTML, PDF, and Microsoft Word documents.

Here is the same Excel dashboard translated to an R Markdown report. Because this report is written in code, it is vastly simpler and easier to maintain. Like the Excel dashboard above, this R Markdown report is designed to take user inputs so that it could render custom report versions.

Many people are already aware that R Markdown reports combine narrative, code, and output in a single document. What is less commonly known is that you can generalize any R Markdown report by declaring parameters in the document header. R Markdown documents with parameters are known as parameterized reports. In the Excel dashboard users can select segment, group, and period. In a parameterized R Markdown document, you would specify these inputs with the following YAML header:

--- title: Customer Tracker Report output: html_notebook params: seg: label: "Segment:" value: Total input: select choices: [Total, Heavy, Mainstream, Focus1, Focus2, Specialty, Diverse1, Diverse2, Other, New] grp: label: "Group:" value: Total input: select choices: [Total, Core, Extra] per: label: "Period:" value: Week input: radio choices: [Week, YTD] ---You can then call the parameters you declare in the YAML header from your R code chunks.

```{r} params$segment params$grp params$per ```You can render the document with different inputs by selecting knit with parameters in RStudio. This option will open a user interface that allows you to select the parameters you want.

If you want to automate the process of creating custom report versions, you can render these documents programmatically with the rmarkdown::render() function.

rmarkdown::render( input = "tracker-report.Rmd", params = list(seg = "Focus1", grp = "Core", per = "Weekly") ) Publishing to RStudio ConnectManaging access and permissions for an ocean of Excel files is painful. Data in Excel spreads through an organization without controls like a virus spreads through a body without disease prevention. There are better ways to secure the operation, access, and distribution of information.

RStudio Connect is a server product from RStudio that is designed for secure sharing of R content. It is on-premises software you run behind your firewall. You keep control of your data and of who has access. With RStudio Connect, you can see all your content, decide who should be able to view and collaborate on it, tune performance, schedule updates, and view logs. You can schedule your R Markdown reports to run automatically or even distribute the latest version by email.

When you publish a parameterized R Markdown report to RStudio Connect, an interface appears for selecting inputs. Viewers can create new report versions, then email themselves a copy. Collaborators can save and schedule new report versions, then email others a copy. You can even attach output files to these versions. Using parameterized R Markdown documents in RStudio Connect is a powerful way to communicate information.

You can publish content from the RStudio IDE by clicking the Publish button that looks like a blue Eye of Horus. Pressing this button will begin the publishing process. First, it creates a set of instructions for recreating your content. Second, it deploys your content bundle to the server. Third, it recreates your content on RStudio Connect. Push-button publishing has a long history of being used with RStudio. In 2012, RStudio enabled push-button publishing of R Markdown documents to RPubs. In 2014, RStudio enabled push-button publishing of Shiny apps to shinyapps.io. In 2016, RStudio enabled push-button publishing to RStudio Connect.

Adding ShinyR Markdown documents are rendered with batch processing. That makes them ideal for automation, long running workflows, and custom report versions. However, if you want your documents to be immediately reactive to user input, then you can add a Shiny runtime. These interactive documents behave like a Shiny application in that they must be hosted. You can host interactive documents and Shiny applications with RStudio Connect. Deciding when to choose between R Markdown, interactive documents, and Shiny applications is a subject for a later post.

SummaryReproducible code in R leads to better analysis and collaboration. You can use parameterized R Markdown reports to create complex, interactive dashboards. Hosting these dashboards securely in RStudio Connect gives you control over accessing, distributing, and processing data. You can use the R programming language, along with with R Markdown reports and RStudio Connect, to build enterprise dashboards that are robust, secure, and manageable.

*Click here for source code.*

_____='https://rviews.rstudio.com/2018/05/16/replacing-excel-reports-with-r-markdown-and-shiny/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R Views**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...