Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 36 min 43 sec ago

Gamma Process Prior for Semiparametric Survival Analysis

Sun, 05/12/2019 - 02:00

(This article was first published on R on , and kindly contributed to R-bloggers)

Heads up: equations may not render on blog aggregation sites. See original post here for good formatting. If you like this post, you can follow me on twitter.

Motviation

Suppose we observe survival/event times from some distribution \[T_{i\in1:n} \stackrel{iid}{\sim} f(t)\] where \(f\) is the density and \(F(t)=1-S(t)\) is the corresponding CDF expressed in terms of the survival function \(S(t)\). We can represent the hazard function of this distribution in terms of the density, \[\lambda(t) = \frac{f(t)}{S(t)}\] The hazard, CDF, and survival functions are all related. Thus, if we have a model for the hazard, we also have a model for the survival function and the survival time distribution. The well-known Cox proportional hazard approach models the hazard as a function of covariates \(x_i \in \mathbb{R}^p\) that multiply some baseline hazard \(\lambda_0(t)\), \[ \lambda(t_i) = \lambda_0(t_i)\exp(x_i'\theta)\] Frequentist estimation of \(\theta\) follows from maximizing the profile likelihood – which avoids the need to specify the baseline hazard \(\lambda_0(t)\). The model is semi-parametric because, while we don’t model the baseline hazard, we require that the multiplicative relationship between covariates and the hazard is correct.

This already works fine, so why go Bayesian? Here are just a few (hopefully) compelling reasons:

  • We may want to nonparametrically estimate the baseline hazard itself.
  • Posterior inference is exact, so we don’t need to rely on asymptotic uncertainty estimates (though we may want to evaluate the frequentist properties of resulting point and interval estimates).
  • Easy credible interval estimation for any function of the parameters. If we have posterior samples for the hazard, we also get automatic inference for the survival function as well.

Full Bayesian inference requires a proper probability model for both \(\theta\) and \(\lambda_0\). This post walks through a Bayesian approach that places a nonparametric prior on \(\lambda_0\) – specifically the Gamma Process.

The Gamma Process Prior Independent Hazards

Most of this comes from Kalbfleisch (1978), with an excellent technical outline by Ibrahim (2001).

Recall that the cumulative baseline hazard \(H_0(t) = \int_0^t \lambda_0(t) dt\) where the integral is the Riemann-Stieltjes integral. The central idea is to develop a prior for the cumulative hazard \(H_0(t)\), which will then admit a prior for the hazard, \(\lambda_0(t)\).

The Gamma Process is such a prior. Each realization of a Gamma Process is a cumulative hazard function that is centered around some prior cumulative hazard function, \(H^*\), with a sort of dispersion/concentration parameter, \(\beta\) that controls how tightly the realizations are distributed around the prior \(H^*\).

Okay, now the math. Let \(\mathcal{G}(\alpha, \beta)\) denote the Gamma distribution with shape parameter \(\alpha\) and rate parameter \(\beta\). Let \(H^*(t)\) for \(t\geq 0\) be our prior cumulative hazard function. For example we could choose \(H^*\) to be the exponential cumulative hazard, \(H^*(t)= \eta\cdot t\), where \(\eta\) is a fixed hyperparameter. By definition \(H^*(0)=0\). The Gamma Process is defined as having the following properties:

  • \(H_0(0) = 0\)
  • \(\lambda_0(t) = H_0(t) – H_0(s) \sim \mathcal G \Big(\ \beta\big(H^*(t) – H^*(s)\big)\ , \ \beta \ \Big)\), for \(t>s\)

The increments in the cumulative hazard is the hazard function. The gamma process has the property that these increments are independent and Gamma-distributed. For a set of time increments \(t\geq0\), we can use the properties above to generate one realization of hazards \(\{\lambda_0(t) \}_{t\geq0}\). Equivaltently, one realization of the cumulative hazard function is \(\{H_0(t)\}_{t\geq0}\), where \(H_0(t) = \sum_{k=0}^t \lambda_0(k)\). We denote the Gamma Process just described as \[H_0(t) \sim \mathcal{GP}\Big(\ \beta H^*(t), \ \beta \Big), \ \ t\geq0\]

Below in Panel A are some prior realizations of \(H_0(t)\) with a Weibull \(H^*\) prior for various concentration parameters, \(\beta\). Notice for low \(\beta\) the realizations are widely dispersed around the mean cumulative hazard. Higher \(\beta\) yields to tighter dispersion around \(H^*\).

Since there’s a correspondence between the \(H_0(t)\), \(\lambda_0(t)\), and \(S_0(t)\), we could also plot prior realizations of the baseline survival function \(S_0(t) = \exp\big\{- H_0(t) \big\}\) using the realization \(\{H_0(t)\}_{t\geq0}\). This is shown in Panel B with the Weibull survival function \(S^*\) corresponding to \(H^*\).

Correlated Hazards

In the previous section, the hazards \(\lambda_0(t)\) between increments were a priori independent – a naive prior belief perhaps. Instead, we might expect that the hazard at the current time point is centered around the hazard in the previous time point. We’d also expect that a higher hazard at the previous time point likely means a higher hazard at the current time point (positive correlation across time).

Nieto‐Barajas et al (2002) came up with a correlated Gamma Process that expresses exactly this prior belief. The basic idea is to introduce a latent stochastic process \(\{u_t\}_{t\geq0}\) that links \(\lambda_0(t)\) with \(\lambda_0(t-1)\). Here is the correlated Gamma Process,

  • Draw a hazard rate for the first time interval, \(I_1=[0, t_1)\), \(\lambda_1 \sim \mathcal G(\beta H^*(t_1), \beta)\),
  • Draw a latent variable \(u_1 | \lambda_1 \sim Pois(c \cdot \lambda_1)\)
  • Draw a hazard rate for second time interval \(I_2 = [t_1, t_2)\), \(\lambda_2 \sim \mathcal G(\beta( H^*(t_2) – H^*(t_1) ) + u_1, \beta + c )\)
  • In general for \(k\geq1\), define \(\alpha_k = \beta( H^*(t_k) – H^*(t_{k-1}) )\)

    • \(u_k | \lambda_k \sim Pois(c\cdot \lambda_k)\)
    • \(\lambda_{k+1} | u_k \sim \mathcal G( \alpha_k + u_k, \beta + c )\)

Notice that if \(c=0\), then \(u=0\) with probability \(1\) and the process reduces to the independent Gamma Process in the previous section. Now consider \(c=1\). Then, the hazard rate in the next interval has mean \(E[\lambda_{k+1} | u_k] = \frac{ \alpha_k + u_k }{\beta+c}\). Now \(u_k \sim Pois(\lambda_k)\) is centered around the current \(\lambda_k\) – allowinng \(\lambda_k\) to influence \(\lambda_{k+1}\) through the latent variable \(u_k\). The higher the current hazard, \(\lambda_k\), the higher \(u_k\), and the higher the mean of the next hazard, \(\lambda_{k+1}\).

Below are some realizations of a correlated and independent Gamma processes centered around the \(Weibull(2,1.5)\) hazard shown in red. One realization is higlighted in blue to make it easier to see the differences between correlated and independent realizations

Notice the correlated gamma process looks very snake-y. This is because of the autoregressive structure on \(\{\lambda_0(t)\}_{t\geq0}\) induced by the latent process\(\{ u_t\}_{t\geq0}\).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppArmadillo 0.9.400.3.0

Sat, 05/11/2019 - 19:49

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The recent 0.9.400.2.0 release of RcppArmadillo required a bug fix release. Conrad follow up on Armadillo 9.400.2 with 9.400.3 – which we packaged (and tested extensively as usual). It is now on CRAN and will get to Debian shortly.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 597 other packages on CRAN.

A brief discussion of possibly issues under 0.9.400.2.0 is at this GitHub issue ticket. The list of changes in 0.9.400.3.0 is below:

Changes in RcppArmadillo version 0.9.400.3.0 (2019-05-09)
  • Upgraded to Armadillo release 9.400.3 (Surrogate Miscreant)

    • check for symmetric / hermitian matrices (used by decomposition functions) has been made more robust

    • linspace() and logspace() now honour requests for generation of vectors with zero elements

    • fix for vectorisation / flattening of complex sparse matrices

The previous changes in 0.9.400.2.0 were:

Changes in RcppArmadillo version 0.9.400.2.0 (2019-04-28)
  • Upgraded to Armadillo release 9.400.2 (Surrogate Miscreant)

    • faster cov() and cor()

    • added .as_col() and .as_row()

    • expanded .shed_rows() / .shed_cols() / .shed_slices() to remove rows/columns/slices specified in a vector

    • expanded vectorise() to handle sparse matrices

    • expanded element-wise versions of max() and min() to handle sparse matrices

    • optimised handling of sparse matrix expressions: sparse % (sparse +- scalar) and sparse / (sparse +- scalar)

    • expanded eig_sym(), chol(), expmat_sym(), logmat_sympd(), sqrtmat_sympd(), inv_sympd() to print a warning if the given matrix is not symmetric

    • more consistent detection of vector expressions

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

winedApp: Wine Recommendation and Data Analysis Web App

Sat, 05/11/2019 - 19:42

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

About 

Don’t wind up with the wrong wine at the wrong time: unwind with the world’s best!

winedApp provides insights into prices, ratings, descriptions, and geographic distribution of the world’s most esteemed wines. Novice or connoisseur, consumer or seller, this app will meet your oenophile needs.

In the Wine Explorer, you can enter your location, varietal, aroma, taste, and price range preferences, and retrieve information on compatible wines. 

The Global Insights feature offers map visualizations of international wine trends.

Graphs and Charts provides additional lenses into relationships amongst countries of origin, varietals, prices per bottle, and ratings.

The data was sourced from ca. 36,000 wine reviews on the WineEnthusiast site. In this dataset, 145 varietals from 36 countries and 23 US states are represented. 

Further information was extracted from the Wikipedia list of grape varieties

General Trends 
  1. Wine prices (per bottle) and points awarded do not show a strong positive correlation (for certain countries, there is even a negative correlation). 
  2. The US, Italy, and France are by far the most represented in the dataset. 
  3.  The most represented varietals are Chardonnay (10, 996 entries) and Cabernet Sauvignon (9,058 entries). 
  4. Most wines fall within the range of $4-50 per bottle, but the distribution is right-skewed. The full price range is $4-2,013. 
  5. Varietals vary considerably with respect to point and price ranges, as well as country distribution. 
  6. There is a statistically significant difference between average prices per bottle for red vs. white wines ($42.47 and $30.51, respectively, with p << 0.05). 
  7. There is also a statistically significant difference regarding average point values for reds vs. whites (88.43 and 88.29, respectively, with p << 0.05, on an 80-100 point scale). 
Future Work

Features will be added and refined on a continuous basis. Any suggestions are welcome.

Current objectives include: 

  1. Testing the app on a larger dataset; 
  2. Enabling the user to determine if a given wine is available in their local area; 
  3. Building a price predictor. 
Technical Details 

Web scraping was completed using the CRAN rvest package.  The list of descriptive keywords featured in the Wine Explorer menu item was generated using the nltk (Natural Language Toolkit) in Python. All other content was produced via the R Shiny Dashboard library and associated data visualization packages. To view the app code, please visit this Github repository

External Links

App || Project Github || Author’s LinkedIn 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

What is “Tidy Data”?

Sat, 05/11/2019 - 18:58

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I would like to write a bit on the meaning and history of the phrase “tidy data.”

Hadley Wickham has been promoting the term “tidy data.” For example in an eponymous paper, he wrote:

In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Wickham, Hadley “Tidy Data”, Journal of Statistical Software, Vol 59, 2014.

Let’s try to apply this definition to following data set from the Wikipedia:

Tournament Winners
Tournament Year Winner Winner Date of Birth Indiana Invitational 1998 Al Fredrickson 21 July 1975 Cleveland Open 1999 Bob Albertson 28 September 1968 Des Moines Masters 1999 Al Fredrickson 21 July 1975 Indiana Invitational 1999 Chip Masterson 14 March 1977

This would seem to be a nice “ready to analyze” data set. Rows are keyed by tournament and year, and rows carry additional key-derived observations of winner’s name and winner’s date of birth. From such a data set we could look for repeated winners, and look at the age of winners.

A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner alone, and not a fact about the joint row keys (the tournament plus year) as required by the rules of Codd’s 3rd normal form. The critique being: this data presentation does not express the intended data invariant that Al Fredrickson must have the same “Winner Date of Birth” in all rows.

Around January of 2017 Hadley Wickham apparently retconned the “tidy data” definition to be:

Tidy data is data where:

  1. Each variable is in a column.
  2. Each observation is a row.
  3. Each value is a cell.

tidyr commit 58cf5d1ebad6b26bd33ad1c94cc5e5e7ef1acf7e

Notice point-3 is now something possibly more related to Codd’s guaranteed access rule, and now the example table is plausibly “tidy.”

The above concept was already well known in statistics and called a “data matrix.” For example:

A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual.

Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 1.

One must understand that in statistics, “individual” often refers to observations, not people.

The above reference clearly considers “data matrix” to be a noun phrase already in common use in statistics. It is in the book’s index, and often used in phrases such as:

Suppose X is an n × p data matrix …

Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 75.

So statistics not only already has the data organization concepts, statistics already has standard terminology around it. Data engineering often called this data organization a “de-normalized form.”

As a further example, the statistical system R, itself uses variations the above standard terminology. Take for instance the help() text from R’s data.matrix() method:

data.matrix {base} R Documentation Convert a Data Frame to a Numeric Matrix Description Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes.

What is the extra “Factors and ordered factors are replaced by their internal codes” part going on about? That is also fairly standard, let’s expand the earlier data matrix quote a bit to see this.

A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual. When presenting data in this form, it is customary to assign a numerical code to the categories of a qualitative variable …

Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 1.

Note: for many R analyses the model.matrix() command is implicitly called in preference to the data.matrix() command, as this conversion expands factors into “dummy variables”- which is a representation often more useful for modeling. The model.matrix() documentation starts as follows:

model.matrix {stats} R Documentation Construct Design Matrices Description model.matrix creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.

So to summarize: the whole time we have been talking about well understood concepts of organizing data for analysis that have a long history.

Frankly it appears “tidy data” is something akin to a trademark or marketing term, especially in its “tidyverse” variation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR

Sat, 05/11/2019 - 17:06

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

“There’s something so paradoxical about pi. On the one hand, it represents order, as embodied by the shape of a circle, long held to be a symbol of perfection and eternity. On the other hand, pi is unruly, disheveled in appearance, its digits obeying no obvious rule, or at least none that we can perceive. Pi is elusive and mysterious, forever beyond reach. Its mix of order and disorder is what makes it so bewitching. ” 

From  Infinite Powers by Steven Strogatz

Anybody who wants to be “anybody” in Big Data must necessarily be able to work on both large structured and unstructured data.  Log analysis is critical in any enterprise which is usually unstructured. As I mentioned in my previous post Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3 RDDs are typically used to handle unstructured data. Spark has the Dataframe abstraction over RDDs which performs better as it is optimized with the Catalyst optimization engine. Nevertheless, it is important to be able to process with RDDs.  This post is a continuation of my 3 earlier posts on Big Data namely

1. Big Data-1: Move into the big league:Graduate from Python to Pyspark
2. Big Data-2: Move into the big league:Graduate from R to SparkR
3. Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3

This post uses publicly available Webserver logs from NASA. The logs are for the months Jul 95 and Aug 95 and are a good place to start unstructured text analysis/log analysis. I highly recommend parsing these publicly available logs with regular expressions. It is only when you do that the truth of Jamie Zawinski’s pearl of wisdom

“Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.” – Jamie Zawinksi

hits home. I spent many hours struggling with regex!!

For this post for the RDD part,  I had to refer to Dr. Fisseha Berhane’s blog post Webserver Log Analysis and for the Pyspark part, to the Univ. of California Specialization which I had done 3 years back Big Data Analysis with Apache Spark. Once I had played around with the regex for RDDs and PySpark I managed to get SparkR and SparklyR versions to work.

The notebooks used in this post have been published and are available at

  1. logsAnalysiswithRDDs
  2. logsAnalysiswithPyspark
  3. logsAnalysiswithSparkRandSparklyR

You can also download all the notebooks from Github at WebServerLogsAnalysis

An essential and unavoidable aspect of Big Data processing is the need to process unstructured text.Web server logs are one such area which requires Big Data techniques to process massive amounts of logs. The Common Log Format also known as the NCSA Common log format, is a standardized text file format used by web servers when generating server log files. Because the format is standardized, the files can be readily analyzed.

A publicly available webserver logs is the NASA-HTTP Web server logs. This is good dataset with which we can play around to get familiar to handling web server logs. The logs can be accessed at NASA-HTTP

Description These two traces contain two month’s worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida.

Format The logs are an ASCII file with one line per request, with the following columns:

-host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.

-timestamp in the format “DAY MON DD HH:MM:SS YYYY”, where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.

-request given in quotes.

-HTTP reply code.

-bytes in the reply.

1 Parse Web server logs with RDDs 1.1 Read NASA Web server logs

Read the logs files from NASA for the months Jul 95 and Aug 95

from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf().setAppName("Spark-Logs-Handling").setMaster("local[*]") sc = SparkContext.getOrCreate(conf) sqlcontext = SQLContext(sc) rdd = sc.textFile("/FileStore/tables/NASA_access_log_*.gz") rdd.count() Out[1]: 3461613 1.2Check content

Check the logs to identify the parsing rules required for the logs

i=0 for line in rdd.sample(withReplacement = False, fraction = 0.00001, seed = 100).collect(): i=i+1 print(line) if i >5: break ix-stp-fl2-19.ix.netcom.com – – [03/Aug/1995:23:03:09 -0400] “GET /images/faq.gif HTTP/1.0” 200 263
slip183-1.kw.jp.ibm.net – – [04/Aug/1995:18:42:17 -0400] “GET /shuttle/missions/sts-70/images/DSC-95EC-0001.gif HTTP/1.0” 200 107133
piweba4y.prodigy.com – – [05/Aug/1995:19:17:41 -0400] “GET /icons/menu.xbm HTTP/1.0” 200 527
ruperts.bt-sys.bt.co.uk – – [07/Aug/1995:04:44:10 -0400] “GET /shuttle/countdown/video/livevideo2.gif HTTP/1.0” 200 69067
dal06-04.ppp.iadfw.net – – [07/Aug/1995:21:10:19 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 200 786
p15.ppp-1.directnet.com – – [10/Aug/1995:01:22:54 -0400] “GET /images/KSC-logosmall.gif HTTP/1.0” 200 1204 1.3 Write the parsing rule for each of the fields
  • host
  • timestamp
  • path
  • status
  • content_bytes
1.21 Get IP address/host name

This regex is at the start of the log and includes any non-white characted

import re rslt=(rdd.map(lambda line: re.search('\S+',line) .group(0)) .take(3)) # Get the IP address \host name rslt Out[3]: [‘in24.inetnebr.com’, ‘uplherc.upl.com’, ‘uplherc.upl.com’] 1.22 Get timestamp

Get the time stamp

rslt=(rdd.map(lambda line: re.search(‘(\S+ -\d{4})’,line) .groups()) .take(3)) #Get the date rslt [(‘[01/Aug/1995:00:00:01 -0400’,),
(‘[01/Aug/1995:00:00:07 -0400’,),
(‘[01/Aug/1995:00:00:08 -0400’,)] 1.23 HTTP request

Get the HTTP request sent to Web server \w+ {GET}

# Get the REST call with ” “ rslt=(rdd.map(lambda line: re.search('"\w+\s+([^\s]+)\s+HTTP.*"',line) .groups()) .take(3)) # Get the REST call rslt [(‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,),
(‘/’,),
(‘/images/ksclogo-medium.gif’,)] 1.23Get HTTP response status

Get the HTTP response to the request

rslt=(rdd.map(lambda line: re.search('"\s(\d{3})',line) .groups()) .take(3)) #Get the status rslt Out[6]: [(‘200’,), (‘304’,), (‘304’,)] 1.24 Get content size

Get the HTTP response in bytes

rslt=(rdd.map(lambda line: re.search(‘^.*\s(\d*)$’,line) .groups()) .take(3)) # Get the content size rslt Out[7]: [(‘1839’,), (‘0’,), (‘0’,)] 1.24 Putting it all together

Now put all the individual pieces together into 1 big regular expression and assign to the groups

  1. Host 2. Timestamp 3. Path 4. Status 5. Content_size
rslt=(rdd.map(lambda line: re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line) .groups()) .take(3)) rslt [(‘in24.inetnebr.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:01 -0400]’,
‘”GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0″‘,
‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,
‘200 1839’,
‘1839’),
(‘uplherc.upl.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:07 -0400]’,
‘”GET / HTTP/1.0″‘,
‘/’,
‘304 0’,
‘0’),
(‘uplherc.upl.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:08 -0400]’,
‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘,
‘/images/ksclogo-medium.gif’,
‘304 0’,
‘0’)] 1.25 Add a log parsing function import re def parse_log1(line): match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line) if match is None: return(line,0) else: return(line,1) 1.26 Check for parsing failure

Check how many lines successfully parsed with the parsing function

n_logs = rdd.count() failed = rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1] == 0).count() print('Out of a total of {} logs, {} failed to parse'.format(n_logs,failed)) # Get the failed records line[1] == 0 failed1=rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1]==0) failed1.take(3) Out of a total of 3461613 logs, 38768 failed to parse
Out[10]:
[(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘,
0),
(‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘,
0),
(‘pipe1.nyc.pipeline.com – – [01/Aug/1995:00:12:37 -0400] “GET /history/apollo/apollo-13/apollo-13-patch-small.gif” 200 12859’,
0)] 1.26 The above rule is not enough to parse the logs

It can be seen that the single rule only parses part of the logs and we cannot group the regex separately. There is an error “AttributeError: ‘NoneType’ object has no attribute ‘group’” which shows up

#rdd.map(lambda line: re.search(‘^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s(“\w+\s+([^\s]+)\s+HTTP.*”)\s(\d{3}\s(\d*)$)’,line[0]).group()).take(4)

File “/databricks/spark/python/pyspark/util.py”, line 99, in wrapper
return f(*args, **kwargs)
File “”, line 1, in
AttributeError: ‘NoneType’ object has no attribute ‘group’

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:490)

1.27 Add rule for parsing failed records

One of the issues with the earlier rule is the content_size has “-” for some logs

import re def parse_failed(line): match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line) if match is None: return(line,0) else: return(line,1) 1.28 Parse records which fail

Parse the records that fails with the new rule

failed2=rdd.map(lambda line: parse_failed(line)).filter(lambda line: line[1]==1) failed2.take(5) Out[13]:
[(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘,
1),
(‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘,
1),
(‘tia1.eskimo.com – – [01/Aug/1995:00:28:41 -0400] “GET /pub/winvn/release.txt HTTP/1.0” 404 -‘,
1),
(‘itws.info.eng.niigata-u.ac.jp – – [01/Aug/1995:00:38:01 -0400] “GET /ksc.html/facts/about_ksc.html HTTP/1.0” 403 -‘,
1),
(‘grimnet23.idirect.com – – [01/Aug/1995:00:50:12 -0400] “GET /www/software/winvn/winvn.html HTTP/1.0” 404 -‘,
1)] 1.28 Add both rules

Add both rules for parsing the log.

Note it can be shown that even with both rules all the logs are not parse.Further rules may need to be added

import re def parse_log2(line): # Parse logs with the rule below match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line) # If match failed then use the rule below if match is None: match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line) if match is None: return (line, 0) # Return 0 for failure else: return (line, 1) # Return 1 for success 1.29 Group the different regex to groups for handling def map2groups(line): match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line) if match is None: match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(-)$',line) return(match.groups()) 1.30 Parse the logs and map the groups parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])

parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))

2. Parse Web server logs with Pyspark 2.1Read data into a Pyspark dataframe import os logs_file_path="/FileStore/tables/" + os.path.join('NASA_access_log_*.gz') from pyspark.sql.functions import split, regexp_extract base_df = sqlContext.read.text(logs_file_path) #base_df.show(truncate=False) from pyspark.sql.functions import split, regexp_extract split_df = base_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'), regexp_extract('value', r'^.*\[(\d\d\/\w{3}\/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'), regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'), regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'), regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size')) split_df.show(5,truncate=False) +———————+————————–+———————————————–+——+————+
|host |timestamp |path |status|content_size|
+———————+————————–+———————————————–+——+————+
|199.72.81.55 |01/Jul/1995:00:00:01 -0400|/history/apollo/ |200 |6245 |
|unicomp6.unicomp.net |01/Jul/1995:00:00:06 -0400|/shuttle/countdown/ |200 |3985 |
|199.120.110.21 |01/Jul/1995:00:00:09 -0400|/shuttle/missions/sts-73/mission-sts-73.html |200 |4085 |
|burger.letters.com |01/Jul/1995:00:00:11 -0400|/shuttle/countdown/liftoff.html |304 |0 |
|199.120.110.21 |01/Jul/1995:00:00:11 -0400|/shuttle/missions/sts-73/sts-73-patch-small.gif|200 |4179 |
+———————+————————–+———————————————–+——+————+
only showing top 5 rows 2.2 Check data bad_rows_df = split_df.filter(split_df[‘host’].isNull() | split_df['timestamp'].isNull() | split_df['path'].isNull() | split_df['status'].isNull() | split_df['content_size'].isNull()) bad_rows_df.count() Out[20]: 33905 2.3Check no of rows which do not have digits

We have already seen that the content_type field has ‘-‘ instead of digits in RDDs

#bad_content_size_df = base_df.filter(~ base_df[‘value’].rlike(r’\d+$’)) bad_content_size_df.count() Out[21]: 33905 2.4 Add ‘*’ to identify bad rows

To identify the rows that are bad, concatenate ‘*’ to the content_size field where the field does not have digits. It can be seen that the content_size has ‘-‘ instead of a valid number

from pyspark.sql.functions import lit, concat bad_content_size_df.select(concat(bad_content_size_df['value'], lit('*'))).show(4,truncate=False) +—————————————————————————————————————————————————+
|concat(value, *) |
+—————————————————————————————————————————————————+
|dd15-062.compuserve.com – – [01/Jul/1995:00:01:12 -0400] “GET /news/sci.space.shuttle/archive/sci-space-shuttle-22-apr-1995-40.txt HTTP/1.0” 404 -*|
|dynip42.efn.org – – [01/Jul/1995:00:02:14 -0400] “GET /software HTTP/1.0” 302 -* |
|ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:02:40 -0400] “GET /software/winvn HTTP/1.0” 302 -* |
|ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:03:24 -0400] “GET /software HTTP/1.0” 302 -* |
+—————————————————————————————————————————————————+ 2.5 Fill NAs with 0s # Replace all null content_size values with 0.

cleaned_df = split_df.na.fill({‘content_size’: 0})

3. Webserver  logs parsing with SparkR library(SparkR) library(stringr) file_location = "/FileStore/tables/NASA_access_log_Jul95.gz" file_location = "/FileStore/tables/NASA_access_log_Aug95.gz" # Load the SparkR library # Initiate a SparkR session sparkR.session() sc <- sparkR.session() sqlContext <- sparkRSQL.init(sc) df <- read.text(sqlContext,"/FileStore/tables/NASA_access_log_Jul95.gz") #df=SparkR::select(df, "value") #head(SparkR::collect(df)) #m=regexp_extract(df$value,'\\\\S+',1) a=df %>% withColumn('host', regexp_extract(df$value, '^(\\S+)', 1)) %>% withColumn('timestamp', regexp_extract(df$value, "((\\S+ -\\d{4}))", 2)) %>% withColumn('path', regexp_extract(df$value, '(\\"\\w+\\s+([^\\s]+)\\s+HTTP.*")', 2)) %>% withColumn('status', regexp_extract(df$value, '(^.*"\\s+([^\\s]+))', 2)) %>% withColumn('content_size', regexp_extract(df$value, '(^.*\\s+(\\d+)$)', 2)) #b=a%>% select(host,timestamp,path,status,content_type) head(SparkR::collect(a),10)

1 199.72.81.55 – – [01/Jul/1995:00:00:01 -0400] “GET /history/apollo/ HTTP/1.0” 200 6245
2 unicomp6.unicomp.net – – [01/Jul/1995:00:00:06 -0400] “GET /shuttle/countdown/ HTTP/1.0” 200 3985
3 199.120.110.21 – – [01/Jul/1995:00:00:09 -0400] “GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0” 200 4085
4 burger.letters.com – – [01/Jul/1995:00:00:11 -0400] “GET /shuttle/countdown/liftoff.html HTTP/1.0” 304 0
5 199.120.110.21 – – [01/Jul/1995:00:00:11 -0400] “GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0” 200 4179
6 burger.letters.com – – [01/Jul/1995:00:00:12 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 304 0
7 burger.letters.com – – [01/Jul/1995:00:00:12 -0400] “GET /shuttle/countdown/video/livevideo.gif HTTP/1.0” 200 0
8 205.212.115.106 – – [01/Jul/1995:00:00:12 -0400] “GET /shuttle/countdown/countdown.html HTTP/1.0” 200 3985
9 d104.aa.net – – [01/Jul/1995:00:00:13 -0400] “GET /shuttle/countdown/ HTTP/1.0” 200 3985
10 129.94.144.152 – – [01/Jul/1995:00:00:13 -0400] “GET / HTTP/1.0” 200 7074
host timestamp
1 199.72.81.55 [01/Jul/1995:00:00:01 -0400
2 unicomp6.unicomp.net [01/Jul/1995:00:00:06 -0400
3 199.120.110.21 [01/Jul/1995:00:00:09 -0400
4 burger.letters.com [01/Jul/1995:00:00:11 -0400
5 199.120.110.21 [01/Jul/1995:00:00:11 -0400
6 burger.letters.com [01/Jul/1995:00:00:12 -0400
7 burger.letters.com [01/Jul/1995:00:00:12 -0400
8 205.212.115.106 [01/Jul/1995:00:00:12 -0400
9 d104.aa.net [01/Jul/1995:00:00:13 -0400
10 129.94.144.152 [01/Jul/1995:00:00:13 -0400
path status content_size
1 /history/apollo/ 200 6245
2 /shuttle/countdown/ 200 3985
3 /shuttle/missions/sts-73/mission-sts-73.html 200 4085
4 /shuttle/countdown/liftoff.html 304 0
5 /shuttle/missions/sts-73/sts-73-patch-small.gif 200 4179
6 /images/NASA-logosmall.gif 304 0
7 /shuttle/countdown/video/livevideo.gif 200 0
8 /shuttle/countdown/countdown.html 200 3985
9 /shuttle/countdown/ 200 3985
10 / 200 7074

4 Webserver logs parsing with SparklyR install.packages("sparklyr") library(sparklyr) library(dplyr) library(stringr) #sc <- spark_connect(master = "local", version = "2.1.0") sc <- spark_connect(method = "databricks") sdf <-spark_read_text(sc, name="df", path = "/FileStore/tables/NASA_access_log*.gz") sdf Installing package into ‘/databricks/spark/R/lib’ # Source: spark [?? x 1] line 1 "199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] \"GET /history/apollo/ HTTP/1… 2 "unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] \"GET /shuttle/countd… 3 "199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] \"GET /shuttle/missions/sts… 4 "burger.letters.com - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/countdow… 5 "199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/missions/sts… 6 "burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /images/NASA-logo… 7 "burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdow… 8 "205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdown/c… 9 "d104.aa.net - - [01/Jul/1995:00:00:13 -0400] \"GET /shuttle/countdown/ HTTP… 10 "129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] \"GET / HTTP/1.0\" 200 7074" # … with more rows #install.packages(“sparklyr”) library(sparklyr) library(dplyr) library(stringr) #sc <- spark_connect(master = "local", version = "2.1.0") sc <- spark_connect(method = "databricks") sdf <-spark_read_text(sc, name="df", path = "/FileStore/tables/NASA_access_log*.gz") sdf <- sdf %>% mutate(host = regexp_extract(line, '^(\\\\S+)',1)) %>% mutate(timestamp = regexp_extract(line, '((\\\\S+ -\\\\d{4}))',2)) %>% mutate(path = regexp_extract(line, '(\\\\"\\\\w+\\\\s+([^\\\\s]+)\\\\s+HTTP.*")',2)) %>% mutate(status = regexp_extract(line, '(^.*"\\\\s+([^\\\\s]+))',2)) %>% mutate(content_size = regexp_extract(line, '(^.*\\\\s+(\\\\d+)$)',2)) 5 Hosts 5.1  RDD 5.11 Parse and map to hosts to groups parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0]) parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line)) # Create tuples of (host,1) and apply reduceByKey() and order by descending rslt=(parsed_rdd2.map(lambda xx[0],1)) .reduceByKey(lambda a,b:a+b) .takeOrdered(10, lambda x: -x[1])) rslt Out[18]:
[(‘piweba3y.prodigy.com’, 21988),
(‘piweba4y.prodigy.com’, 16437),
(‘piweba1y.prodigy.com’, 12825),
(‘edams.ksc.nasa.gov’, 11962),
(‘163.206.89.4’, 9697),
(‘news.ti.com’, 8161),
(‘www-d1.proxy.aol.com’, 8047),
(‘alyssa.prodigy.com’, 8037),
(‘siltb10.orl.mmc.com’, 7573),
(‘www-a2.proxy.aol.com’, 7516)] 5.12Plot counts of hosts import seaborn as sns

import pandas as pd import matplotlib.pyplot as plt df=pd.DataFrame(rslt,columns=[‘host’,‘count’]) sns.barplot(x=‘host’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.6, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display()

5.2 PySpark 5.21 Compute counts of hosts df= (cleaned_df .groupBy('host') .count() .orderBy('count',ascending=False)) df.show(5) +——————–+—–+
| host|count|
+——————–+—–+
|piweba3y.prodigy….|21988|
|piweba4y.prodigy….|16437|
|piweba1y.prodigy….|12825|
| edams.ksc.nasa.gov |11964|
| 163.206.89.4 | 9697|
+——————–+—–+
only showing top 5 rows 5.22 Plot count of hosts import matplotlib.pyplot as plt import pandas as pd import seaborn as sns df1=df.toPandas() df2 = df1.head(10) df2.count() sns.barplot(x='host',y='count',data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel("Hosts") plt.ylabel('Count') plt.xticks(rotation="vertical",fontsize=10) display() 5.3 SparkR 5.31 Compute count of hosts c <- SparkR::select(a,a$host) df=SparkR::summarize(SparkR::groupBy(c, a$host), noHosts = count(a$host)) df1 =head(arrange(df,desc(df$noHosts)),10) head(df1)                   host noHosts 1 piweba3y.prodigy.com 17572 2 piweba4y.prodigy.com 11591 3 piweba1y.prodigy.com 9868 4 alyssa.prodigy.com 7852 5 siltb10.orl.mmc.com 7573 6 piweba2y.prodigy.com 5922 5.32 Plot count of hosts library(ggplot2) p <-ggplot(data=df1, aes(x=host, y=noHosts,fill=host)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Host') + ylab('Count') p 5.4 SparklyR 5.41 Compute count of Hosts df <- sdf %>% select(host,timestamp,path,status,content_size) df1 <- df %>% select(host) %>% group_by(host) %>% summarise(noHosts=n()) %>% arrange(desc(noHosts)) df2 <-head(df1,10) 5.42 Plot count of hosts library(ggplot2)

p <-ggplot(data=df2, aes(x=host, y=noHosts,fill=host)) + geom_bar(stat=“identity”)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(‘Host’) + ylab(‘Count’)

p

6 Paths 6.1 RDD
6.11 Parse and map to hosts to groups parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0]) parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line)) rslt=(parsed_rdd2.map(lambda xx[5],1)) .reduceByKey(lambda a,b:a+b) .takeOrdered(10, lambda x: -x[1])) rslt [(‘”GET /images/NASA-logosmall.gif HTTP/1.0″‘, 207520),
(‘”GET /images/KSC-logosmall.gif HTTP/1.0″‘, 164487),
(‘”GET /images/MOSAIC-logosmall.gif HTTP/1.0″‘, 126933),
(‘”GET /images/USA-logosmall.gif HTTP/1.0″‘, 126108),
(‘”GET /images/WORLD-logosmall.gif HTTP/1.0″‘, 124972),
(‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘, 120704),
(‘”GET /ksc.html HTTP/1.0″‘, 83209),
(‘”GET /images/launch-logo.gif HTTP/1.0″‘, 75839),
(‘”GET /history/apollo/images/apollo-logo1.gif HTTP/1.0″‘, 68759),
(‘”GET /shuttle/countdown/ HTTP/1.0″‘, 64467)] 6.12 Plot counts of HTTP Requests import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘path’,‘count’]) sns.barplot(x=‘path’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.7, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8)

display()

6.2 Pyspark 6.21 Compute count of HTTP Requests df= (cleaned_df .groupBy('path') .count() .orderBy('count',ascending=False)) df.show(5) Out[20]:
+——————–+——+
| path| count|
+——————–+——+
|/images/NASA-logo…|208362|
|/images/KSC-logos…|164813|
|/images/MOSAIC-lo…|127656|
|/images/USA-logos…|126820|
|/images/WORLD-log…|125676|
+——————–+——+
only showing top 5 rows 6.22 Plot count of HTTP Requests import matplotlib.pyplot as plt

import pandas as pd import seaborn as sns df1=df.toPandas() df2 = df1.head(10) df2.count() sns.barplot(x=‘path’,y=‘count’,data=df2)

plt.subplots_adjust(bottom=0.7, right=0.8, top=0.9) plt.xlabel(“HTTP Requests”) plt.ylabel(‘Count’) plt.xticks(rotation=90,fontsize=8)

display()

 

6.3 SparkR 6.31Compute count of HTTP requests library(SparkR) c <- SparkR::select(a,a$path) df=SparkR::summarize(SparkR::groupBy(c, a$path), numRequest = count(a$path)) df1=head(df) 3.14 Plot count of HTTP Requests library(ggplot2) p <-ggplot(data=df1, aes(x=path, y=numRequest,fill=path)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))+ xlab('Path') + ylab('Count') p 6.4 SparklyR 6.41 Compute count of paths df <- sdf %>% select(host,timestamp,path,status,content_size) df1 <- df %>% select(path) %>% group_by(path) %>% summarise(noPaths=n()) %>% arrange(desc(noPaths)) df2 <-head(df1,10) df2 # Source: spark [?? x 2] # Ordered by: desc(noPaths) path noPaths 1 /images/NASA-logosmall.gif 208362 2 /images/KSC-logosmall.gif 164813 3 /images/MOSAIC-logosmall.gif 127656 4 /images/USA-logosmall.gif 126820 5 /images/WORLD-logosmall.gif 125676 6 /images/ksclogo-medium.gif 121286 7 /ksc.html 83685 8 /images/launch-logo.gif 75960 9 /history/apollo/images/apollo-logo1.gif 68858 10 /shuttle/countdown/ 64695 6.42 Plot count of Paths library(ggplot2) p <-ggplot(data=df2, aes(x=path, y=noPaths,fill=path)) + geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Path') + ylab('Count') p 7.1 RDD 7.11 Compute count of HTTP Status

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])

parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line)) rslt=(parsed_rdd2.map(lambda xx[7],1)) .reduceByKey(lambda a,b:a+b) .takeOrdered(10, lambda x: -x[1])) rslt Out[22]: [(‘200’, 3095682),
(‘304’, 266764),
(‘302’, 72970),
(‘404’, 20625),
(‘403’, 225),
(‘500’, 65),
(‘501’, 41)] 1.37 Plot counts of HTTP response status’ import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘status’,‘count’]) sns.barplot(x=‘status’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.4, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8)

display()

7.2 Pyspark 7.21 Compute count of HTTP status status_count=(cleaned_df .groupBy('status') .count() .orderBy('count',ascending=False)) status_count.show() +——+——-+
|status| count|
+——+——-+
| 200|3100522|
| 304| 266773|
| 302| 73070|
| 404| 20901|
| 403| 225|
| 500| 65|
| 501| 41|
| 400| 15|
| null| 1| 7.22 Plot count of HTTP status

Plot the HTTP return status vs the counts

df1=status_count.toPandas()

df2 = df1.head(10) df2.count() sns.barplot(x=‘status’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel(“HTTP Status”) plt.ylabel(‘Count’) plt.xticks(rotation=“vertical”,fontsize=10) display()

7.3 SparkR 7.31 Compute count of HTTP Response status library(SparkR) c <- SparkR::select(a,a$status) df=SparkR::summarize(SparkR::groupBy(c, a$status), numStatus = count(a$status)) df1=head(df) 3.16 Plot count of HTTP Response status library(ggplot2) p <-ggplot(data=df1, aes(x=status, y=numStatus,fill=status)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Status') + ylab('Count') p 7.4 SparklyR
7.41 Compute count of status df <- sdf %>% select(host,timestamp,path,status,content_size) df1 <- df %>% select(status) %>% group_by(status) %>% summarise(noStatus=n()) %>% arrange(desc(noStatus)) df2 <-head(df1,10) df2 # Source: spark [?? x 2] # Ordered by: desc(noStatus) status noStatus 1 200 3100522 2 304 266773 3 302 73070 4 404 20901 5 403 225 6 500 65 7 501 41 8 400 15 9 "" 1 7.42 Plot count of status library(ggplot2)

p <-ggplot(data=df2, aes(x=status, y=noStatus,fill=status)) + geom_bar(stat=“identity”)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(‘Status’) + ylab(‘Count’) p

8.1 RDD
8.12 Compute count of content size parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0]) parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line)) rslt=(parsed_rdd2.map(lambda xx[8],1)) .reduceByKey(lambda a,b:a+b) .takeOrdered(10, lambda x: -x[1])) rslt Out[24]:
[(‘0’, 280017),
(‘786’, 167281),
(‘1204’, 140505),
(‘363’, 111575),
(‘234’, 110824),
(‘669’, 110056),
(‘5866’, 107079),
(‘1713’, 66904),
(‘1173’, 63336),
(‘3635’, 55528)] 8.21 Plot content size import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘content_size’,‘count’]) sns.barplot(x=‘content_size’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.4, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display()

8.2 Pyspark 8.21 Compute count of content_size size_counts=(cleaned_df .groupBy('content_size') .count() .orderBy('count',ascending=False)) size_counts.show(10) +------------+------+ |content_size| count| +------------+------+ | 0|313932| | 786|167709| | 1204|140668| | 363|111835| | 234|111086| | 669|110313| | 5866|107373| | 1713| 66953| | 1173| 63378| | 3635| 55579| +------------+------+ only showing top 10 rows 8.22 Plot counts of content size

Plot the path access versus the counts

df1=size_counts.toPandas()

df2 = df1.head(10) df2.count() sns.barplot(x=‘content_size’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel(“content_size”) plt.ylabel(‘Count’) plt.xticks(rotation=“vertical”,fontsize=10) display()

8.3 SparkR 8.31 Compute count of content size library(SparkR) c <- SparkR::select(a,a$content_size) df=SparkR::summarize(SparkR::groupBy(c, a$content_size), numContentSize = count(a$content_size)) df1=head(df) df1      content_size numContentSize 1 28426 1414 2 78382 293 3 60053 4 4 36067 2 5 13282 236 6 41785 174 8.32 Plot count of content sizes library(ggplot2)

p <-ggplot(data=df1, aes(x=content_size, y=numContentSize,fill=content_size)) + geom_bar(stat=“identity”) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(‘Content Size’) + ylab(‘Count’)

p

8.4 SparklyR
8.41Compute count of content_size df <- sdf %>% select(host,timestamp,path,status,content_size) df1 <- df %>% select(content_size) %>% group_by(content_size) %>% summarise(noContentSize=n()) %>% arrange(desc(noContentSize)) df2 <-head(df1,10) df2 # Source: spark [?? x 2] # Ordered by: desc(noContentSize) content_size noContentSize 1 0 280027 2 786 167709 3 1204 140668 4 363 111835 5 234 111086 6 669 110313 7 5866 107373 8 1713 66953 9 1173 63378 10 3635 55579 8.42 Plot count of content_size library(ggplot2) p <-ggplot(data=df2, aes(x=content_size, y=noContentSize,fill=content_size)) + geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Content size') + ylab('Count') p

Conclusion: I spent many,many hours struggling with Regex and getting RDDs,Pyspark to work. Also had to spend a lot of time trying to work out the syntax for SparkR and SparklyR for parsing. After you parse the logs plotting and analysis is a piece of cake! This is definitely worth a try!

Watch this space!!

Also see
1. Practical Machine Learning with R and Python – Part 3
2. Deep Learning from first principles in Python, R and Octave – Part 5
3. My book ‘Cricket analytics with cricketr and cricpy’ is now on Amazon
4. Latency, throughput implications for the Cloud
5. Modeling a Car in Android
6. Architecting a cloud based IP Multimedia System (IMS)
7. Dabbling with Wiener filter using OpenCV

To see all posts click Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Kernel spatial smoothing : transforming points pattern to continuous coverage

Sat, 05/11/2019 - 14:39

(This article was first published on r.iresmi.net, and kindly contributed to R-bloggers)

Representing mass data (inhabitants, livestock,…) on a map in not always easy : choropleth maps are clearly a no-go, except if you normalize with area and then you stumble on the MAUP… A nice solution is smoothing, producing a raster. You even get freebies like (potential) statistical confidentiality, a better geographic synthesis and easy multiple layers computations.

The kernel smoothing should not be confused with interpolation or kriging : the aim here is to « spread » and sum point values, see Loonis and de Bellefon (2018) for a comprehensive explanation.

We’ll use the btb package (Santos et al. 2018) which has the great advantage of providing a way to specify a geographical study zone, avoiding our values to bleed in another country or in the sea for example.

We’ll map the french population :

  • the data is available on the IGN site
  • a 7z decompress utility must be available in your $PATH ;
  • the shapefile COMMUNE.shp has a POPULATION field ;
  • this is a polygon coverage, so we’ll take the « centroids ».
library(raster) # load before dplyr (against select conflict) library(tidyverse) library(httr) library(sf) library(btb) # download and extract data zipped_file <- tempfile() GET("ftp://Admin_Express_ext:Dahnoh0eigheeFok@ftp3.ign.fr/ADMIN-EXPRESS_2-0__SHP__FRA_2019-03-14.7z.001", write_disk(zipped_file), progress()) system(paste("7z x -odata", zipped_file)) # create a unique polygon for France (our study zone) fr <- read_sf("data/ADMIN-EXPRESS_2-0__SHP__FRA_2019-03-14/ADMIN-EXPRESS/1_DONNEES_LIVRAISON_2019-03-14/ADE_2-0_SHP_LAMB93_FR/REGION.shp") %>% st_union() %>% st_sf() %>% st_set_crs(2154) # load communes ; convert to points comm <- read_sf("data/ADMIN-EXPRESS_2-0__SHP__FRA_2019-03-14/ADMIN-EXPRESS/1_DONNEES_LIVRAISON_2019-03-14/ADE_2-0_SHP_LAMB93_FR/COMMUNE.shp")%>% st_set_crs(2154) %>% st_point_on_surface()

We create a function :

#' Kernel weighted smoothing with arbitrary bounding area #' #' @param df sf object (points) #' @param field weigth field in sf #' @param bandwith kernel bandwidth (map units) #' @param resolution output grid resolution (map units) #' @param zone sf study zone (polygon) #' @param out_crs EPSG (should be an equal-area projection) #' #' @return a raster object #' @import btb, raster, fasterize, dplyr, plyr, sf lissage <- function(df, field, bandwidth, resolution, zone, out_crs = 3035) { if (st_crs(zone)$epsg != out_crs) { message("reprojecting data...") zone <- st_transform(zone, out_crs) } if (st_crs(df)$epsg != out_crs) { message("reprojecting study zone...") df <- st_transform(df, out_crs) } zone_bbox <- st_bbox(zone) # grid generation message("generating reference grid...") zone_xy <- zone %>% dplyr::select(geometry) %>% st_make_grid(cellsize = resolution, offset = c(plyr::round_any(zone_bbox[1] - bandwidth, resolution, f = floor), plyr::round_any(zone_bbox[2] - bandwidth, resolution, f = floor)), what = "centers") %>% st_sf() %>% st_join(zone, join = st_intersects, left = FALSE) %>% st_coordinates() %>% as_tibble() %>% dplyr::select(x = X, y = Y) # kernel message("computing kernel...") kernel <- df %>% cbind(., st_coordinates(.)) %>% st_set_geometry(NULL) %>% dplyr::select(x = X, y = Y, field) %>% btb::kernelSmoothing(dfObservations = ., sEPSG = out_crs, iCellSize = resolution, iBandwidth = bandwidth, vQuantiles = NULL, dfCentroids = zone_xy) # rasterization message("\nrasterizing...") raster::raster(xmn = plyr::round_any(zone_bbox[1] - bandwidth, resolution, f = floor), ymn = plyr::round_any(zone_bbox[2] - bandwidth, resolution, f = floor), xmx = plyr::round_any(zone_bbox[3] + bandwidth, resolution, f = ceiling), ymx = plyr::round_any(zone_bbox[4] + bandwidth, resolution, f = ceiling), resolution = resolution) %>% fasterize::fasterize(kernel, ., field = field) }

Instead of a raw choropleth map like this (don’t do this at home) :

Inhabitants, quantile classification ; see the red arrow : a big commune with a somewhat low population (2100 inhabitants) pops out due to its big area

… we should use a proportional symbol. But it’s quite cluttered in urban areas :

You’ve got measles

We can also get the polygon density :

Density, quantile classification. Our previous commune is now more coherent, however the map is not very synthetic due to the heterogeneous size of the communes

We just have to run our function for example with a bandwidth of 20 km and a cell size of 2 km. We use an equal area projection : the LAEA Europa (EPSG:3035).

comm %>% lissage("POPULATION", 20000, 2000, fr, 3035) %>% raster::writeRaster("pop.tif")

And lastly our smoothing :

Smoothing, discrete quantile classification

That’s a nice synthetic representation !

After that it’s easy in R to do raster algebra ; for example dividing a grid of crop yields by a grid of agricultural area, create a percent change between dates, etc.

References

Loonis, Vincent and Marie-Pierre de Bellefon, eds 2018. Handbook of Spatial Analysis. Theory and Application with R. INSEE Méthodes 131. Montrouge, France: Institut national de la statistique et des études économiques. https://www.insee.fr/en/information/3635545.

Santos, Arlindo Dos, Francois Semecurbe, Auriane Renaud, Cynthia Faivre, Thierry Cornely and Farida Marouchi. 2018. btb: Beyond the Border – Kernel Density Estimation for Urban Geography (version 0.1.30). https://CRAN.R-project.org/package=btb.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Porting and redirecting a Hugo-based blogdown website to an HTTPS-enabled custom domain and how to do it the easy way

Sat, 05/11/2019 - 14:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

As we wrote in Should you start your R blog now?, blogging has probably never been more accessible to the general population, R users included. Usually, the simplest solution is to host your blog via a service that provides it for free, such as Netlify, GitHub or GitLab Pages. But what if you want to host that awesome blog on your own, HTTPS enabled domain?

In this post, we will look at how to port a Hugo-based website, such as a blogdown blog to our own domain, specifically focusing on GitLab Pages. We will also cover setting up SSL certificates, redirects from www to non-www sites and other details that I had to solve when porting my blogdown blog from GitLab’s hosting.

Contents
  1. If you are just starting – there is an easy way
  2. Serving a page on a custom domain via GitLab Pages
  3. Redirecting the gitlab.io address to the custom domain
  4. Redirecting www and non-www URLs to a single address
  5. References
If you are just starting – there is an easy way

This post is mostly a reminder-to-self of what porting this blog from GitLab Pages hosting to a custom domain entailed. The route I took was heavily influenced by the way I was serving the website at the beginning – using GitLab Pages on a project address. Migrating to a new domain I wanted to

  1. Keep all the functionality that a GitLab repository with GitLab CI/CD provides
  2. Serving the content at a custom HTTPS-enabled domain
  3. Redirecting to the new domain with minimal to no content duplication
  4. Making sure that the website works on both www and non-www addresses

If you have your blog hosted on GitLab pages and want to port and redirect it to your own HTTPS-enabled domain, while keeping the functionality that GitLab provides, you might find my journey useful.

Blogdown, Hugo & Let’s Encrypt logos

If you are just starting, it is easier to choose a different approach to publish your blog – here are 2 tips to consider if you want to prevent the pain I went through because of my past decisions:

What would I do if starting today

With the knowledge I gained when investigating the process, I would probably take the following route:

  1. Register a custom domain via CloudFlare. This should make non-www/www redirects and getting SSL certificates seamless
  2. Deploy the pages by connecting the GitLab repository to Netlify, which should be equally easy as using GitLab CI/CD
  3. Setup deployment to the custom domain via Netlify. This should make the redirects to the custom domain seamless and technically sound
Doing it the simplest way

Serving a Hugo-based website can in principle be even simpler – in fact, all that is really necessary is just copying/uploading contents of the public directory generated for example with blogdown::build_site() to the proper place. All that we do around it are processes that make your lives nicer at the cost of extra effort.

Serving a page on a custom domain via GitLab Pages 1. Choose and register a domain name

The first step is to choose a domain name (i.e. the web address) for your brand new website. This is completely up to you and the internet is full of tips like this one. Next, register that domain name with a provider of your choice. I use a local provider for all of my websites for many years, so the choice was easy.

To register your domain name, you can pick from a plethora of providers, each with their of own pros and cons

2. Setup an SSL certificate

Setting up your website such that it can be accessed via HTTPS should be the standard these days, so we also need to set up an SSL certificate. Once again, this should be simple as we can use free Let’s Encrypt certificates to achieve that.

The actual process once again depends on the provider of your domain services – in practice, it should entail just a few clicks in their web UI

3. Setup GitLab to serve your pages to a custom domain

Setting up GitLab Pages to be served to your own domain is well documented in GitLab’s documentation here and even better here.

If you have chosen CloudFlare as your service to manage the DNS, Nick Zeng wrote a detailed guide on how to setup GitLab pages with a custom domain. GitLab also has links to setting up DNS records for other hosting providers.

After these 3 steps, you should see your website served on your new domain and HTTPS should work just fine. Now onto the not-so-simple issues.

Redirecting the gitlab.io address to the custom domain Server-side redirects are not supported with GitLab pages

Now that we can see our content on our new domain, we may want to take care of the fact that it is now visible on 2 addresses – the new custom domain and the original GitLab pages address. The traditional way of handling this is with server-side redirects – the server would issue an HTTP 301 Moved Permanently redirect to the new domain. The issue with that is that GitLab Pages does not support server-side redirects.

On the other hand, redirects are supported by GitHub pages, which have this feature and by Netlify as well.

JavaScript to the rescue

We can also find a suggestion to use meta refresh tags, but since using them is not always simple and server-side functionality is not available, we can opt for client-side JavaScript to solve our redirection issues.

It may not seem like a good idea from SEO perspective at first, but looking at some research on how Google handles JavaScript redirects it looks like the JavaScript redirects are quickly followed by Google. From an indexing standpoint, they are interpreted as 301s — the end-state URLs replaced the redirected URLs in Google’s index.

An example of a JavaScript implementation using the window.location object that can be used to get the current page address and to redirect the browser to a new page can look as follows:

function replacePath(path, old_d, new_d) { path = path.replace(old_d, new_d); if (path.includes(new_d)) { // only if really on the new domain path = path.replace("http:", "https:"); } return path; } newpath = replacePath( _____, "://jozefhajnala.gitlab.io/r", "://jozef.io" ); // Prevent infinite redirect if (_____ != newpath){ window.location.replace(newpath); }

We would obviously replace the mentioned addresses by the desired ones and omit the https replacement if the new domain does not have SSL enabled.

We can test our JavaScript with a very simple function to see if all the URLs will be translated correctly:

// Place all urls into a variable // this is just an example with a few var oldLinks = [ "https://jozefhajnala.gitlab.io/r", "https://jozefhajnala.gitlab.io/r/categories/rcase4base", "https://jozefhajnala.gitlab.io/r/categories/rstudioaddins", "https://jozefhajnala.gitlab.io/r/categories/various" ]; const old_d = "://jozefhajnala.gitlab.io/r"; const new_d = "://jozef.io"; // get the new links var newLinks = oldLinks.map(x => replacePath(x, old_d, new_d)); function getStatus(url) { var req = new XMLHttpRequest(); req.open("GET", url, false); req.send(null); return req.status; } // check the response statuses // we want no 404s here, all 200 would be ideal var statuses = newLinks.map(getStatus); Setting canonical links

To be completely sure about duplicate content, if we have several similar versions of the same content, we can choose one version and point the search engines at this version by specifying a canonical URL.

To specify them using Hugo is very simple thanks to the way it provides variables and the partials approach to building themes. Simply add a line like this to your header.html or head.html partial file:

The .html files for partials are usually located in the themes//layouts/partials/ directory.

Redirecting www and non-www URLs to a single address

Another aspect of the move is to make sure that the content is available via both www.example.com and example.com, but not duplicated. Which of those is preferred is once again up to you. One solution would be to tell GitLab to serve the content to both and use the canonical link or use the JavaScript redirect again. However, there is a much nicer solution on offer here since for our own domain we can use server-side redirects.

If your provider of choice is CloudFlare, it seems that this redirect can be done in a few clicks via the web UI

Using .htaccess

One way to create this redirect is by using a .htaccess file. An example content, if you want to redirect to the www address with HTTPS, can look as follows:

RewriteEngine On RewriteCond %{HTTPS} off RewriteRule .* https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301] RewriteCond %{HTTP_HOST} !^www\. [NC] RewriteRule .* https://www.%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

Once you have the file ready, upload it to your site through an ftp client. If you host directly via your own server

  • place the .htaccess file into the proper directory, for example, /var/www/example.com/
  • do not forget to activate the apache mod_rewrite module using sudo a2enmod rewrite
  • an extra SSL certificate is likely to be needed for https to work correctly. The generation using Let’s Encrypt is simple using certbot, described for example here

Read more details on using .htaccess here and more details on using Mod_Rewrites for redirects here

References Using custom domains with GitLab Pages Redirects and duplicate content Setting up virtual hosts, securing Apache with SSL, using .htaccess var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

What’s that disease called? Overview of icd package

Sat, 05/11/2019 - 02:00

(This article was first published on R on notast, and kindly contributed to R-bloggers)

Intro

There are many illnesses and diseases known to man. How do the various stakeholders in the medical science industry classify the same illness? The illness will need to be coded in a standardized manner to aid in fair reimbursements and concise reporting of diseases. The International Classification of Diseases (ICD) provides this uniform coding system. The ICD “is the standard diagnostic tool for epidemiology, health management and clinical purposes.”. (There is a more detailed coding system known as the Systematized Nomenclature of Medicine — Clinical Terms (SNOMED-CT) but it will not be covered in this post.)

The ICD has currently 11 versions. At this point of time, countries and researchers are using either ICD-9 or ICD-10, with those using ICD-9 gradually transiting to ICD-10. ICD-11 has yet to be adopted in clinical practice.

R has a package, icd, which deals with both ICD-9 and ICD-10. The package also includes built in functions to conduct common calculations involving ICD such as Hierarchical Condition Codes and Charlson and Van Walraven score. We will use the icd package to help explain ICD-9 and ICD-10 and do some analysis on an external dataset.

The ICD is a hierarchical based classification. There is a total of 4 levels:

  1. chapter
  2. sub-chapter
  3. major. Each major has a 3_digital identifier with a character length of three
  4. descriptor, long_desc. Each descriptor has an identifier code with a character length from three to five.
library(tidyverse) library(icd) theme_set(theme_light()) # Level 1-3 icd9cm_hierarchy %>% select(chapter, sub_chapter, major, three_digit ) %>% head(10) ## chapter sub_chapter ## 1 Infectious And Parasitic Diseases Intestinal Infectious Diseases ## 2 Infectious And Parasitic Diseases Intestinal Infectious Diseases ## 3 Infectious And Parasitic Diseases Intestinal Infectious Diseases ## 4 Infectious And Parasitic Diseases Intestinal Infectious Diseases ## 5 Infectious And Parasitic Diseases Intestinal Infectious Diseases ## 6 Infectious And Parasitic Diseases Intestinal Infectious Diseases ## 7 Infectious And Parasitic Diseases Intestinal Infectious Diseases ## 8 Infectious And Parasitic Diseases Intestinal Infectious Diseases ## 9 Infectious And Parasitic Diseases Intestinal Infectious Diseases ## 10 Infectious And Parasitic Diseases Intestinal Infectious Diseases ## major three_digit ## 1 Cholera 001 ## 2 Cholera 001 ## 3 Cholera 001 ## 4 Cholera 001 ## 5 Typhoid and paratyphoid fevers 002 ## 6 Typhoid and paratyphoid fevers 002 ## 7 Typhoid and paratyphoid fevers 002 ## 8 Typhoid and paratyphoid fevers 002 ## 9 Typhoid and paratyphoid fevers 002 ## 10 Typhoid and paratyphoid fevers 002 # Level 3-4 icd9cm_hierarchy %>% select(major, three_digit, long_desc, code) %>% head(10) ## major three_digit ## 1 Cholera 001 ## 2 Cholera 001 ## 3 Cholera 001 ## 4 Cholera 001 ## 5 Typhoid and paratyphoid fevers 002 ## 6 Typhoid and paratyphoid fevers 002 ## 7 Typhoid and paratyphoid fevers 002 ## 8 Typhoid and paratyphoid fevers 002 ## 9 Typhoid and paratyphoid fevers 002 ## 10 Typhoid and paratyphoid fevers 002 ## long_desc code ## 1 Cholera 001 ## 2 Cholera due to vibrio cholerae 0010 ## 3 Cholera due to vibrio cholerae el tor 0011 ## 4 Cholera, unspecified 0019 ## 5 Typhoid and paratyphoid fevers 002 ## 6 Typhoid fever 0020 ## 7 Paratyphoid fever A 0021 ## 8 Paratyphoid fever B 0022 ## 9 Paratyphoid fever C 0023 ## 10 Paratyphoid fever, unspecified 0029

We can see the subordinate codes of the three_digit identifier with the function, children.

children("001") ## [1] "001" "0010" "0011" "0019"

Beware that in some instances the first three characters of codes are not the same as the three_digit identifiers.

icd9cm_hierarchy %>% mutate(first_3_char_of_code=substr(three_digit, 1,3), same=first_3_char_of_code==three_digit) %>% ggplot(aes(same)) + geom_bar()+ labs(x="", title= "Is the first three characters of `code` the same as the `three_digit` identifier?")

Let’s examine which codes are these. Looks like codes beginning with “E” resulted in the mismatch.

icd9cm_hierarchy %>% mutate(first_3_char_of_code=substr(three_digit, 1,3), same=first_3_char_of_code==three_digit) %>% filter(same=="FALSE") %>% select(code, first_3_char_of_code, three_digit) %>% sample_n(10) ## code first_3_char_of_code three_digit ## 90 E0129 E01 E012 ## 380 E828 E82 E828 ## 6 E0009 E00 E000 ## 1360 E9830 E98 E983 ## 1011 E9284 E92 E928 ## 1085 E9353 E93 E935 ## 1143 E9422 E94 E942 ## 1117 E9389 E93 E938 ## 1024 E9298 E92 E929 ## 456 E8359 E83 E835 Difference between ICD-9 and ICD-10 Breath and depth

Now that we understand the structure of ICD. Let’s understand the difference between ICD-9 and ICD-10. ICD-10 has more chapters and more permutations and combinations of subordinate members than ICD-9. Thus, ICD-10 is a longer dataset than ICD-9.

cbind(ICD9=nrow(icd9cm_hierarchy), ICD10=nrow(icd10cm2019)) %>% as_tibble() ## # A tibble: 1 x 2 ## ICD9 ICD10 ## ## 1 17561 94444 Coding

Majority of ICD-9 uses numeric values for the first character for the three_digit identifier (and therefore also for its code).

substr( icd9cm_hierarchy$three_digit, 1,1) %>% unique() ## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "V" "E"

Whereas ICD-10 uses all alphabets for the first character.

substr( icd10cm2019$three_digit, 1,1) %>% unique() #https://stackoverflow.com/questions/33199203/r-how-to-display-the-first-n-characters-from-a-string-of-words ## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" ## [18] "R" "S" "T" "V" "W" "X" "Y" "Z"

I will be referring to ICD-9 for the rest of the post.

code format

code can be expressed in two ways:

  1. Short format which has been used in all the above examples. It has a character length from three to five. The first three characters of code are the same as the 3_digitalidentifier on most occasions. The mismatch occurs when the code begins with the letter “E”.

  2. Decimal format. A handful of healthcare databases and research datasets adopt this format. code in this format have three characters on the left side of the decimal point which are the same as the three_digit identifier. At the most two characters on the right side of the decimal point (e.g. “250.33”). However, due to formatting of electronic medical records or exporting the code to Excel, the code may be truncated. For instance, zeros before a non- zero numeric character will be dropped off (e.g. “004.11” -> “4.11” ). Zeros after a non-zero numeric character on the right side of the decimal point also will be dropped off (e.g. “250.50”-> “250.5”).

Inspecting for data entry errors

Data entry is susceptible to errors considering the code format and the magnitude of permutations and combinations of code. The icd package has two functions to identify data entry errors.

Validation of code appearance

is_valid will help to determine if the code looks correct

is_valid("123.456") #max of 2 char of R side of decimal point ## [1] FALSE is_valid("045l") #l is an invalid character ## [1] FALSE is_valid("099.17", short_code = T) #expecting `code` to be short format and not decimal format ## [1] FALSE is_valid("099.17", short_code = F) #plausible `code` in decimal format ## [1] TRUE Legitimate definition behind code

codes which appear valid may not be not have any underpinning meaning. is_defined helps to determine if the code can be defined.

as.icd9cm("089") %>% #as.icd9cm informs is_defined which ICD version you are referring to is_defined() ## [1] FALSE

The code 088 and 090 exists but 089 does not exist.

Application

After completing a crash course on the concepts of ICD, let’s see how the package can help us with our data wrangling. We will be using a dataset on hospital admission of individuals with diabetes.

diabetic<- read_csv("diabetic_data.csv") %>% select(primary=diag_1, secondary=diag_2)%>% #only using primary and secondary diagnosis for this exercise gather(primary, secondary, key = "diagnosis", value= "code") #longer tidy format Exploring and cleaning the data What format are the codes in ?

The codes are formatted in the decimal form.

diabetic %>% select(diagnosis) %>% str_detect(".") ## [1] TRUE Are there NA values?

There are no NA values.

diabetic %>% map_dbl(~sum(is.na(.x))) ## diagnosis code ## 0 0

However, by physically viewing the dataset, there are observations recorded as “?”. “?” suggests unknown or missing values. We’ll coerce “?” values into NA

diabetic<-diabetic %>% mutate(code=ifelse(code=="?", NA, code)) Providing the disease name

The codes allow encoding of diseases to be more convenient but render it less comprehensible. We will extract the name of the diseases from major, the disease types from sub-chapter and the disease class from chapter.

Converting into short format

The ICD dictionary code is in the short form while the code in the dataset is in the decimal form. I will need to convert the format of code in the dataset from the decimal form to the short type.

diabetic<-diabetic %>% mutate(code= decimal_to_short(code)) Extracting the names # shorten `chapter` name to range of `three_digit` identifier icd9cm_hierarchy$chapter<-fct_recode( icd9cm_hierarchy$chapter, `001-139`="Infectious And Parasitic Diseases", `140-239`= "Neoplasms", `240-279`= "Endocrine, Nutritional And Metabolic Diseases, And Immunity Disorders", `280-289`= "Diseases Of The Blood And Blood-Forming Organs", `290-319`= "Mental Disorders", `320-389 `= "Diseases Of The Nervous System And Sense Organs", `390-459`= "Diseases Of The Circulatory System", `460-519`= "Diseases Of The Respiratory System", `520-579`="Diseases Of The Digestive System", `580-629`="Diseases Of The Genitourinary System", `630-679`= "Complications Of Pregnancy, Childbirth, And The Puerperium", `680-709`="Diseases Of The Skin And Subcutaneous Tissue", `710-739`= "Diseases Of The Musculoskeletal System And Connective Tissue", `740-759`="Congenital Anomalies", `760-779`="Certain Conditions Originating In The Perinatal Period", `780-799`= "Symptoms, Signs, And Ill-Defined Conditions", `800-999`="Injury And Poisoning", `V01-V91`="Supplementary Classification Of Factors Influencing Health Status And Contact With Health Services", `E000-E999`="Supplementary Classification Of External Causes Of Injury And Poisoning") # merge dataset with ICD dictionary to extract disease names, types, classes diabetic_names<-left_join(diabetic, icd9cm_hierarchy, by=c("code"="code")) %>% #making the arg explicit select(diagnosis, disease_name=major, disease_type=sub_chapter, disease_class=chapter) head(diabetic_names,10) ## # A tibble: 10 x 4 ## diagnosis disease_name disease_type disease_class ## ## 1 primary Diabetes mellitus Diseases Of Other E~ 240-279 ## 2 primary Disorders of fluid, electr~ Other Metabolic And~ 240-279 ## 3 primary Other current conditions i~ Complications Mainl~ 630-679 ## 4 primary Intestinal infections due ~ Intestinal Infectio~ 001-139 ## 5 primary Secondary malignant neopla~ Malignant Neoplasm ~ 140-239 ## 6 primary Other forms of chronic isc~ Ischemic Heart Dise~ 390-459 ## 7 primary Other forms of chronic isc~ Ischemic Heart Dise~ 390-459 ## 8 primary Heart failure Other Forms Of Hear~ 390-459 ## 9 primary Other rheumatic heart dise~ Chronic Rheumatic H~ 390-459 ## 10 primary Occlusion of cerebral arte~ Cerebrovascular Dis~ 390-459 Summary of Diagnosis Disease names

The most common disease name for primary diagnosis is diabetes. Not surprised given that the dataset is about individuals with diabetes. The most common class of disease is cardio- vascular (390-459) which relates to the heart and the blood circulatory system

#top 20 primary diagnosis diabetic_names %>% filter(diagnosis=="primary") %>% count( disease_name, disease_class,sort = T) %>% top_n(20) %>% mutate(disease_name=fct_reorder(disease_name,n)) %>% ggplot(aes(disease_name, n, fill=disease_class))+ geom_col() + coord_flip() + theme(legend.position="bottom") + guides(fill=guide_legend(title= "Disease Class", ncol = 5)) + # legend based on aes fill, split into 4 col as legend broken off page. change legend title labs(x="", y="", title = "Top 20 Disease Names for Primary \n Diagnosis", subtitle = "disease name refers to ICD major, disease \n class refers to ICD chapter ") + scale_fill_brewer(palette = "Set3")

Similarly, the most common disease for secondary diagnosis is diabetes and the most common disease class is cardio-vascular. However, the number of disease class for secondary diagnosis is fewer than primary diagnosis.

diabetic_names %>% filter(diagnosis=="secondary") %>% count( disease_name, disease_class,sort = T) %>% top_n(20) %>% mutate(disease_name=fct_reorder(disease_name,n)) %>% ggplot(aes(disease_name, n, fill=disease_class))+ geom_col() + coord_flip() + theme(legend.position="bottom") + guides(fill=guide_legend(title= "Disease Class", ncol = 5))+ labs(x="", y="", title = "Top 20 Disease Names for Secondary \n Diagnosis ", subtitle = "disease name refers to ICD major, disease class \n refers to ICD chapter")+ scale_fill_brewer(palette = "Set3")

Disease types

The disease type for diabetes is “Diseases of Other Endocrine Glands” and knowing that diabetes is the most common disease name for primary diagnosis, let’s see if “Diseases of Other Endocrine Glands” will also be the most common disease type.

diabetic_names %>% filter(diagnosis=="primary") %>% count( disease_type, disease_class,sort = T) %>% top_n(20) %>% mutate(disease_type=fct_reorder(disease_type,n)) %>% ggplot(aes(disease_type, n, fill=disease_class))+ geom_col() + coord_flip() + theme(legend.position="bottom") + guides(fill=guide_legend(title= "Disease Class", ncol = 4)) + labs(x="", y="", title = "Top 20 Types of Diseases for \n Primary Diagnosis", subtitle = "disease type refers to ICD sub-chapter \n and disease class refers ICD chapter") + scale_fill_brewer(palette = "Set3")

When we collapsed disease names for primary diagnosis to their superordinate, disease types, the most common disease type is “Ischemic Heart Diseases”. Though, “Diseases of Other Endocrine Glands” is the third most common disease type.

Let’s see if this is the same for secondary diagnosis.

diabetic_names %>% filter(diagnosis=="secondary") %>% count( disease_type, disease_class,sort = T) %>% top_n(20) %>% mutate(disease_type=fct_reorder(disease_type,n)) %>% ggplot(aes(disease_type, n, fill=disease_class))+ geom_col() + coord_flip() + theme(legend.position="bottom") + guides(fill=guide_legend(title= "Disease Class", ncol = 5)) + labs(x="", y="", title = "Top 20 Types of Diseases \n for Secondary Diagnosis", subtitle = "disease type refers to ICD \n sub-chapter and disease class \n refers ICD chapter") + scale_fill_brewer(palette = "Set3")

“Diseases of Other Endocrine Glands” is still not the most common disease type though it moved up a spot. “Ischemic Heart Diseases” is now the 5th most common disease type.

To sum up

In this post, we learned about the International Classification of Diseases which is an invaluable reference for various stakeholders in healthcare to have a uniform code for illnesses. The icd package was introduced to aid in the processing of datasets with ICD codes.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on notast. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Easily explore your data using the summarytools package

Sat, 05/11/2019 - 02:00

(This article was first published on Anindya Mozumdar, and kindly contributed to R-bloggers)

Whenever we start working with data with which we are not familiar, our first step is usually some kind of exploratory data analysis. We may look at the structure of the data using the str function, or use a tool like the RStudio Viewer to examine the data. We might also use the base R function summary or the describe function from the Hmisc package to obtain summary statistics on our data.

In this article, we will look at the summarytools package, which provides a few functions to neatly summarize your data. This is especially useful in a Shiny application or a RMarkdown report as you can directly render the summary outputs to HTML and then display it in the application or report. In fact, I discovered this package while playing around with the Shiny application radiant.

The four main functions in this package are dfSummary, freq, descr and ctable. The print.summarytools function is used to print summarytools objects with the appropriate rendering method.

dfSummary is used to summarize an entire data frame; descriptive statistics, along with plots to show the distribution of the data is automatically produced by the function. Let us apply it on the tobacco dataset which is included as part of this package.

print(dfSummary(tobacco), method = "render") Data Frame Summary tobacco

Dimensions: 1000 x 9

Duplicates: 2

No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing 1 gender
[factor] 1. F
2. M 489 ( 50.0% ) 489 ( 50.0% ) 978
(97.8%) 22
(2.2%) 2 age
[numeric] Mean (sd) : 49.6 (18.3)
min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4) 63 distinct values 975
(97.5%) 25
(2.5%) 3 age.gr
[factor] 1. 18-34
2. 35-50
3. 51-70
4. 71 + 258 ( 26.5% ) 241 ( 24.7% ) 317 ( 32.5% ) 159 ( 16.3% ) 975
(97.5%) 25
(2.5%) 4 BMI
[numeric] Mean (sd) : 25.7 (4.5)
min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2) 974 distinct values 974
(97.4%) 26
(2.6%) 5 smoker
[factor] 1. Yes
2. No 298 ( 29.8% ) 702 ( 70.2% ) 1000
(100%) 0
(0%) 6 cigs.per.day
[numeric] Mean (sd) : 6.8 (11.9)
min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8) 37 distinct values 965
(96.5%) 35
(3.5%) 7 diseased
[factor] 1. Yes
2. No 224 ( 22.4% ) 776 ( 77.6% ) 1000
(100%) 0
(0%) 8 disease
[character] 1. Hypertension
2. Cancer
3. Cholesterol
4. Heart
5. Pulmonary
6. Musculoskeletal
7. Diabetes
8. Hearing
9. Digestive
10. Hypotension
[ 3 others ] 36 ( 16.2% ) 34 ( 15.3% ) 21 ( 9.5% ) 20 ( 9.0% ) 20 ( 9.0% ) 19 ( 8.6% ) 14 ( 6.3% ) 14 ( 6.3% ) 12 ( 5.4% ) 11 ( 5.0% ) 21 ( 9.5% ) 222
(22.2%) 778
(77.8%) 9 samp.wgts
[numeric] Mean (sd) : 1 (0.1)
min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1) 0.86! : 267 ( 26.7% ) 1.04! : 249 ( 24.9% ) 1.05! : 324 ( 32.4% ) 1.06! : 160 ( 16.0% ) ! rounded

1000
(100%) 0
(0%)

Generated by summarytools 0.9.3 (R version 3.6.0)
2019-05-12

As can be seen in the output above, the type of variable is shown along with the variable name. This is followed by basic descriptive statistics for numeric variables or the values for categorical variables. A simple histogram or bar chart is plotted showing the distribution of the variable. Overall, it provides you with a concise summary of the variables in the data. The output can also be controlled using a vast number of arguments, but the defaults may be good enough when quickly exploring data.

To obtain more detailed statistics on numeric variables, use the descr function. In this example, we use the default ASCII text output.

descr(tobacco$BMI) ## Descriptive Statistics ## tobacco$BMI ## N: 1000 ## ## BMI ## ----------------- -------- ## Mean 25.73 ## Std.Dev 4.49 ## Min 8.83 ## Q1 22.93 ## Median 25.62 ## Q3 28.65 ## Max 39.44 ## MAD 4.18 ## IQR 5.72 ## CV 0.17 ## Skewness 0.02 ## SE.Skewness 0.08 ## Kurtosis 0.26 ## N.Valid 974.00 ## Pct.Valid 97.40

One useful option supported by this function is the weights argument. So if your data has sample weights, they can directly be used. However, not all statistics will be adjusted using the weights; refer to the documentation for more details.

descr(tobacco$BMI, weights = tobacco$samp.wgts) ## Weighted Descriptive Statistics ## tobacco$BMI ## Weights: samp.wgts ## N: 1000 ## ## BMI ## --------------- -------- ## Mean 25.83 ## Std.Dev 4.48 ## Min 8.83 ## Median 25.68 ## Max 39.44 ## MAD 4.13 ## CV 0.17 ## N.Valid 973.85 ## Pct.Valid 97.38

While descr is most useful for numeric variables, freq is most useful for categorical variables. This also supports the weights argument, and provides the frequency as well as the cumulative frequency for each distinct value.

freq(tobacco$disease, weights = tobacco$samp.wgts) ## Weighted Frequencies ## tobacco$disease ## Type: Character ## Weights: samp.wgts ## ## Freq % Valid % Valid Cum. % Total % Total Cum. ## --------------------- --------- --------- -------------- --------- -------------- ## Cancer 33.66 15.03 15.03 3.37 3.37 ## Cholesterol 21.11 9.42 24.45 2.11 5.48 ## Diabetes 14.36 6.41 30.87 1.44 6.91 ## Digestive 11.49 5.13 36.00 1.15 8.06 ## Hearing 14.15 6.32 42.31 1.41 9.48 ## Heart 19.89 8.88 51.20 1.99 11.46 ## Hypertension 36.52 16.31 67.51 3.65 15.12 ## Hypotension 11.37 5.08 72.58 1.14 16.25 ## Musculoskeletal 19.59 8.75 81.33 1.96 18.21 ## Neurological 10.14 4.53 85.86 1.01 19.23 ## Other 2.09 0.93 86.79 0.21 19.44 ## Pulmonary 20.29 9.06 95.85 2.03 21.46 ## Vision 9.29 4.15 100.00 0.93 22.39 ## 776.07 77.61 100.00 ## Total 1000.00 100.00 100.00 100.00 100.00

Finally, ctable is used to cross-tabulate frequencies for a pair of categorical variables.

print( ctable(tobacco$disease, tobacco$gender, weights = tobacco$samp.wgts), method = "render" ) Cross-Tabulation, Row Proportions disease * gender

Data Frame: tobacco

gender disease F M Total Cancer 15.9 ( 47.2% ) 17.8 ( 52.8% ) 0.0 ( 0.0% ) 33.7 ( 100.0% ) Cholesterol 10.3 ( 48.8% ) 10.8 ( 51.2% ) 0.0 ( 0.0% ) 21.1 ( 100.0% ) Diabetes 8.2 ( 57.3% ) 5.3 ( 36.7% ) 0.9 ( 6.0% ) 14.4 ( 100.0% ) Digestive 4.7 ( 40.9% ) 6.8 ( 59.1% ) 0.0 ( 0.0% ) 11.5 ( 100.0% ) Hearing 5.1 ( 35.8% ) 9.1 ( 64.2% ) 0.0 ( 0.0% ) 14.1 ( 100.0% ) Heart 8.5 ( 42.8% ) 11.4 ( 57.2% ) 0.0 ( 0.0% ) 19.9 ( 100.0% ) Hypertension 18.2 ( 49.7% ) 17.3 ( 47.4% ) 1.1 ( 2.9% ) 36.5 ( 100.0% ) Hypotension 7.3 ( 64.6% ) 4.0 ( 35.4% ) 0.0 ( 0.0% ) 11.4 ( 100.0% ) Musculoskeletal 8.2 ( 42.0% ) 10.3 ( 52.6% ) 1.1 ( 5.4% ) 19.6 ( 100.0% ) Neurological 7.2 ( 70.7% ) 3.0 ( 29.3% ) 0.0 ( 0.0% ) 10.1 ( 100.0% ) Other 1.0 ( 50.1% ) 1.0 ( 49.9% ) 0.0 ( 0.0% ) 2.1 ( 100.0% ) Pulmonary 9.3 ( 45.8% ) 11.0 ( 54.2% ) 0.0 ( 0.0% ) 20.3 ( 100.0% ) Vision 6.1 ( 66.0% ) 3.2 ( 34.0% ) 0.0 ( 0.0% ) 9.3 ( 100.0% ) 378.9 ( 48.8% ) 378.0 ( 48.7% ) 19.2 ( 2.5% ) 776.1 ( 100.0% ) Total 488.9 ( 48.9% ) 488.9 ( 48.9% ) 22.2 ( 2.2% ) 1000.0 ( 100.0% )

Generated by summarytools 0.9.3 (R version 3.6.0)
2019-05-12

It is often desirable to obtain the output of descr or freq functions in a data frame. To do this, the package provides a function tb which can convert the output of the above functions into a tibble. If you are not aware, the tibble is the data frame of the tidyverse and are usually nicer to use compared to data frames in base R.

To conclude, whenever you start working with new data, consider generating an HTML report using dfSummary to quickly understand the variables in the data and their distributions. Use the other functions if you need to understand the individual variables further.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Anindya Mozumdar. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

EARL speaker – highlights from the Mango team

Fri, 05/10/2019 - 17:14

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

While we’re aiming to try and fit in as many EARL talks as we can, we know it’s impossible to see them all! We’ve asked some of the Mango team to let us know who they’re looking forward to seeing speak. First up is Alfie Smith one of our Data Scientists – more team picks to follow!

Alfie Smith

Avision Ho’s “Why a Nobel Prize algorithm is not always optimal for business” looks to be an interesting presentation on the problems that come with translating academic research into commercial applications. As data science consultants, we have to be able to tell our clients the risks of rushing to the newest algorithm; particularly when every new research paper creates a hype-bubble.

I’m excited to hear “Promoting the use of R in the NHS – progress and challenges” by Professor Mohammed A Mohammed. As the brother of an NHS doctor, I’ve heard many stories of the NHS’ dependence on archaic tech and the bottle necks it creates. I’m fascinated to hear whether R is solving some of these problems and whether my R skills could be of value to the UK’s health service.

Lastly, I’m very intrigued by Theo Boutaris’ “Deep Milk: The Quest of identifying Milk-Related Instagram Posts using Keras”. At EARL, we’re going to hear lots of stories of R solving huge business problems. However, it’s often the smaller, wackier, stories that I remember long after the event. I’m hoping Theo’s presentation will give me an anecdote to talk about at the next Bristol Data Science meet-up.

If any of these talks sound interesting please take a look at who else is speaking – we also have early bird ticket prices available for a limited time.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dealing with correlation in designed field experiments: part II

Fri, 05/10/2019 - 02:00

(This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers)

With field experiments, studying the correlation between the observed traits may not be an easy task. Indeed, in these experiments, subjects are not independent, but they are grouped by treatment factors (e.g., genotypes or weed control methods) or by blocking factors (e.g., blocks, plots, main-plots). I have dealt with this problem in a previous post and I gave a solution based on traditional methods of data analyses.

In a recent paper, Piepho (2018) proposed a more advanced solution based on mixed models. He presented four examplary datasets and gave SAS code to analyse them, based on PROC MIXED. I was very interested in those examples, but I do not use SAS. Therefore, I tried to ‘transport’ the models in R, which turned out to be a difficult task. After struggling for awhile with several mixed model packages, I came to an acceptable solution, which I would like to share.

A ‘routine’ experiment

I will use the same example as presented in my previous post, which should allow us to compare the results with those obtained by using more traditional methods of data analyses. It is a genotype experiment, laid out in randomised complete blocks, with 27 wheat genotypes and three replicates. For each plot, the collegue who made the experiment recorded several traits, including yield (Yield) and weight of thousand kernels (TKW). The dataset ‘WheatQuality.csv’ is available on ‘gitHub’; it consists of 81 records (plots) and just as many couples of measures in all. The code below loads the necessary packages, the data and transforms the numeric variable ‘Block’ into a factor.

library(reshape2) library(plyr) library(nlme) library(plyr) library(asreml) dataset <- read.csv("https://raw.githubusercontent.com/OnofriAndreaPG/aomisc/master/data/WheatQuality.csv", header=T) dataset$Block <- factor(dataset$Block) head(dataset) ## Genotype Block Height TKW Yield Whectol ## 1 arcobaleno 1 90 44.5 64.40 83.2 ## 2 arcobaleno 2 90 42.8 60.58 82.2 ## 3 arcobaleno 3 88 42.7 59.42 83.1 ## 4 baio 1 80 40.6 51.93 81.8 ## 5 baio 2 75 42.7 51.34 81.3 ## 6 baio 3 76 41.1 47.78 81.1

In my previous post, I used the above dataset to calculate the Pearson’s correlation coefficient between Yield and TKW for:

  1. plot measurements,
  2. residuals,
  3. treatment/block means.

Piepho (2018) showed that, for an experiment like this one, the above correlations can be estimated by coding a multiresponse mixed model, as follows:

\[ Y_{ijk} = \mu_i + \beta_{ik} + \tau_{ij} + \epsilon_{ijk}\]

where \(Y_{ijk}\) is the response for the trait \(i\), the rootstock \(j\) and the block \(k\), \(\mu_i\) is the mean for the trait \(i\), \(\beta_{ik}\) is the effect of the block \(k\) and trait \(i\), \(\tau_{ij}\) is the effect of genotype \(j\) for the trait \(i\) and \(\epsilon_{ijk}\) is the residual for each of 81 observations for two traits.

In the above model, the residuals \(\epsilon_{ijk}\) need to be normally distributed and heteroscedastic, with trait-specific variances. Furthermore, residuals belonging to the same plot (the two observed traits) need to be correlated (correlation of errors).

Hans-Peter Piepho, in his paper, put forward the idea that the ‘genotype’ and ‘block’ effects for the two traits can be taken as random, even though they might be of fixed nature, especially the genotype effect. This idea makes sense, because, for this application, we are mainly interested in variances and covariances. Both random effects need to be heteroscedastic (trait-specific variance components) and there must be a correlation between the two traits.

To the best of my knowledge, there is no way to fit such a complex model with the ‘nlme’ package and related ‘lme()’ function (I’ll gave a hint later on, for a simpler model). Therefore, I decided to use the package ‘asreml’ (Butler et al., 2018), although this is not freeware. With the function ‘asreml()’, we need to specify the following components.

  1. The response variables. When we set a bivariate model with ‘asreml’, we can ‘cbind()’ Yield and TKW and use the name ‘trait’ to refer to them.
  2. The fixed model, that only contains the trait effect. The specification is, therefore, ‘cbind(Yield, TKW) ~ trait – 1’. Following Piepho (2018), I removed the intercept, to separately estimate the means for the two traits.
  3. The random model, that is composed by the interactions ‘genotype x trait’ and ‘block x trait’. For both, I specified a general unstructured variance covariance matrix, so that the traits are heteroscedastic and correlated. Therefore, the random model is ~ Genotype:us(trait) + Block:us(trait).
  4. The residual structure, where the observations in the same plot (the term ‘units’ is used in ‘asreml’ to represent the observational units, i.e. the plots) are heteroscedastic and correlated.

The model call is:

mod.asreml <- asreml(cbind(Yield, TKW) ~ trait - 1, random = ~ Genotype:us(trait, init = c(3.7, -0.25, 1.7)) + Block:us(trait, init = c(77, 38, 53)), residual = ~ units:us(trait, init = c(6, 0.16, 4.5)), data=dataset) ## Model fitted using the sigma parameterization. ## ASReml 4.1.0 Fri May 10 17:33:27 2019 ## LogLik Sigma2 DF wall cpu ## 1 -641.556 1.0 160 17:33:27 0.0 (3 restrained) ## 2 -548.767 1.0 160 17:33:27 0.0 ## 3 -448.970 1.0 160 17:33:27 0.0 ## 4 -376.952 1.0 160 17:33:27 0.0 ## 5 -334.100 1.0 160 17:33:27 0.0 ## 6 -317.511 1.0 160 17:33:27 0.0 ## 7 -312.242 1.0 160 17:33:27 0.0 ## 8 -311.145 1.0 160 17:33:27 0.0 ## 9 -311.057 1.0 160 17:33:27 0.0 ## 10 -311.056 1.0 160 17:33:27 0.0 summary(mod.asreml)$varcomp[,1:3] ## component std.error z.ratio ## Block:trait!trait_Yield:Yield 3.7104778 3.9364268 0.9426005 ## Block:trait!trait_TKW:Yield -0.2428390 1.9074544 -0.1273105 ## Block:trait!trait_TKW:TKW 1.6684568 1.8343662 0.9095549 ## Genotype:trait!trait_Yield:Yield 77.6346623 22.0956257 3.5135761 ## Genotype:trait!trait_TKW:Yield 38.8322972 15.0909109 2.5732242 ## Genotype:trait!trait_TKW:TKW 53.8616088 15.3520661 3.5084274 ## units:trait!R 1.0000000 NA NA ## units:trait!trait_Yield:Yield 6.0939037 1.1951128 5.0990195 ## units:trait!trait_TKW:Yield 0.1635551 0.7242690 0.2258209 ## units:trait!trait_TKW:TKW 4.4717901 0.8769902 5.0990195

The box above shows the results about the variance-covariance parameters. In order to get the correlations, I specified the necessary combinations of variance-covariance parameters. It is necessary to remember that estimates, in ‘asreml’, are named as V1, V2, … Vn, according to their ordering in model output.

parms <- mod.asreml$vparameters vpredict(mod.asreml, rb ~ V2 / (sqrt(V1)*sqrt(V3) ) ) ## Estimate SE ## rb -0.09759916 0.7571335 vpredict(mod.asreml, rt ~ V5 / (sqrt(V4)*sqrt(V6) ) ) ## Estimate SE ## rt 0.6005174 0.130663 vpredict(mod.asreml, re ~ V9 / (sqrt(V8)*sqrt(V10) ) ) ## Estimate SE ## re 0.03133109 0.1385389

We see that the estimates are very close to those obtained by using the Pearson’s correlation coefficients (see my previous post). The advantage of this mixed model solution is that we can also test hypotheses in a relatively reliable way. For example, I tested the hypothesis that residuals are not correlated by:

  1. coding a reduced model where residuals are heteroscedastic and independent, and
  2. comparing this reduced model with the complete model by way of a REML-based Likelihood Ratio Test (LRT).

Removing the correlation of residuals is easily done, by changing the correlation structure from ‘us’ (unstructured variance-covariance matrix) to ‘idh’ (diagonal variance-covariance matrix).

mod.asreml2 <- asreml(cbind(Yield, TKW) ~ trait - 1, random = ~ Genotype:us(trait) + Block:us(trait), residual = ~ units:idh(trait), data=dataset) ## Model fitted using the sigma parameterization. ## ASReml 4.1.0 Fri May 10 17:33:28 2019 ## LogLik Sigma2 DF wall cpu ## 1 -398.023 1.0 160 17:33:28 0.0 (2 restrained) ## 2 -383.859 1.0 160 17:33:28 0.0 ## 3 -344.687 1.0 160 17:33:28 0.0 ## 4 -321.489 1.0 160 17:33:28 0.0 ## 5 -312.488 1.0 160 17:33:28 0.0 ## 6 -311.167 1.0 160 17:33:28 0.0 ## 7 -311.083 1.0 160 17:33:28 0.0 ## 8 -311.082 1.0 160 17:33:28 0.0 lrt.asreml(mod.asreml, mod.asreml2) ## Likelihood ratio test(s) assuming nested random models. ## (See Self & Liang, 1987) ## ## df LR-statistic Pr(Chisq) ## mod.asreml/mod.asreml2 1 0.05107 0.4106

Likewise, I tried to further reduce the model to test the significance of the correlation between block means and genotype means.

mod.asreml3 <- asreml(cbind(Yield, TKW) ~ trait - 1, random = ~ Genotype:us(trait) + Block:idh(trait), residual = ~ units:idh(trait), data=dataset) ## Model fitted using the sigma parameterization. ## ASReml 4.1.0 Fri May 10 17:33:29 2019 ## LogLik Sigma2 DF wall cpu ## 1 -398.027 1.0 160 17:33:29 0.0 (2 restrained) ## 2 -383.866 1.0 160 17:33:29 0.0 ## 3 -344.694 1.0 160 17:33:29 0.0 ## 4 -321.497 1.0 160 17:33:29 0.0 ## 5 -312.496 1.0 160 17:33:29 0.0 ## 6 -311.175 1.0 160 17:33:29 0.0 ## 7 -311.090 1.0 160 17:33:29 0.0 ## 8 -311.090 1.0 160 17:33:29 0.0 lrt.asreml(mod.asreml, mod.asreml3) ## Likelihood ratio test(s) assuming nested random models. ## (See Self & Liang, 1987) ## ## df LR-statistic Pr(Chisq) ## mod.asreml/mod.asreml3 2 0.066663 0.6399 mod.asreml4 <- asreml(cbind(Yield, TKW) ~ trait - 1, random = ~ Genotype:idh(trait) + Block:idh(trait), residual = ~ units:idh(trait), data=dataset) ## Model fitted using the sigma parameterization. ## ASReml 4.1.0 Fri May 10 17:33:30 2019 ## LogLik Sigma2 DF wall cpu ## 1 -406.458 1.0 160 17:33:30 0.0 (2 restrained) ## 2 -394.578 1.0 160 17:33:30 0.0 ## 3 -352.769 1.0 160 17:33:30 0.0 ## 4 -327.804 1.0 160 17:33:30 0.0 ## 5 -318.007 1.0 160 17:33:30 0.0 ## 6 -316.616 1.0 160 17:33:30 0.0 ## 7 -316.549 1.0 160 17:33:30 0.0 ## 8 -316.549 1.0 160 17:33:30 0.0 lrt.asreml(mod.asreml, mod.asreml4) ## Likelihood ratio test(s) assuming nested random models. ## (See Self & Liang, 1987) ## ## df LR-statistic Pr(Chisq) ## mod.asreml/mod.asreml4 3 10.986 0.003364 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that only the genotype means are significantly correlated.

An alternative (and more useful) way to code the same model is by using the ‘corgh’ structure, instead of ‘us’. These two structures are totally similar, apart from the fact that the first one uses the correlations, instead of the covariances. Another difference, which we should consider when giving starting values, is that correlations come before variances.

mod.asreml.r <- asreml(cbind(Yield, TKW) ~ trait - 1, random = ~ Genotype:corgh(trait, init=c(-0.1, 3.8, 1.8)) + Block:corgh(trait, init = c(0.6, 77, 53)), residual = ~ units:corgh(trait, init = c(0.03, 6, 4.5)), data=dataset) ## Model fitted using the sigma parameterization. ## ASReml 4.1.0 Fri May 10 17:33:31 2019 ## LogLik Sigma2 DF wall cpu ## 1 -632.445 1.0 160 17:33:31 0.0 (3 restrained) ## 2 -539.383 1.0 160 17:33:31 0.0 (2 restrained) ## 3 -468.810 1.0 160 17:33:31 0.0 (1 restrained) ## 4 -422.408 1.0 160 17:33:31 0.0 ## 5 -371.304 1.0 160 17:33:31 0.0 ## 6 -336.191 1.0 160 17:33:31 0.0 ## 7 -317.547 1.0 160 17:33:31 0.0 ## 8 -312.105 1.0 160 17:33:31 0.0 ## 9 -311.118 1.0 160 17:33:31 0.0 ## 10 -311.057 1.0 160 17:33:31 0.0 ## 11 -311.056 1.0 160 17:33:31 0.0 summary(mod.asreml.r)$varcomp[,1:2] ## component std.error ## Block:trait!trait!TKW:!trait!Yield.cor -0.09759916 0.7571335 ## Block:trait!trait_Yield 3.71047783 3.9364268 ## Block:trait!trait_TKW 1.66845679 1.8343662 ## Genotype:trait!trait!TKW:!trait!Yield.cor 0.60051740 0.1306748 ## Genotype:trait!trait_Yield 77.63466334 22.0981962 ## Genotype:trait!trait_TKW 53.86160957 15.3536792 ## units:trait!R 1.00000000 NA ## units:trait!trait!TKW:!trait!Yield.cor 0.03133109 0.1385389 ## units:trait!trait_Yield 6.09390366 1.1951128 ## units:trait!trait_TKW 4.47179012 0.8769902

The advantage of this parameterisation is that we can test our hypotheses by easily setting up contraints on correlations. One way to do this is to run the model with the argument ‘start.values = T’. In this way I could derive a data frame (‘mod.init$parameters’), with the starting values for REML maximisation.

# Getting the starting values mod.init <- asreml(cbind(Yield, TKW) ~ trait - 1, random = ~ Genotype:corgh(trait, init=c(-0.1, 3.8, 1.8)) + Block:corgh(trait, init = c(0.6, 77, 53)), residual = ~ units:corgh(trait, init = c(0.03, 6, 4.5)), data=dataset, start.values = T) init <- mod.init$vparameters init ## Component Value Constraint ## 1 Genotype:trait!trait!TKW:!trait!Yield.cor -0.10 U ## 2 Genotype:trait!trait_Yield 3.80 P ## 3 Genotype:trait!trait_TKW 1.80 P ## 4 Block:trait!trait!TKW:!trait!Yield.cor 0.60 U ## 5 Block:trait!trait_Yield 77.00 P ## 6 Block:trait!trait_TKW 53.00 P ## 7 units:trait!R 1.00 F ## 8 units:trait!trait!TKW:!trait!Yield.cor 0.03 U ## 9 units:trait!trait_Yield 6.00 P ## 10 units:trait!trait_TKW 4.50 P

We see that the ‘init’ data frame has three columns: (i) names of parameters, (ii) initial values and (iii) type of constraint (U: unconstrained, P = positive, F = fixed). Now, if we take the seventh row (correlation of residuals), set the initial value to 0 and set the third column to ‘F’ (meaning: keep the initial value fixed), we are ready to fit a model without correlation of residuals (same as the ‘model.asreml2’ above). What I had to do was just to pass this data frame as the starting value matrix for a new model fit (see the argument ‘R.param’, below).

init2 <- init init2[8, 2] <- 0 init2[8, 3] <- "F" mod.asreml.r2 <- asreml(cbind(Yield, TKW) ~ trait - 1, random = ~ Genotype:corgh(trait) + Block:corgh(trait), residual = ~ units:corgh(trait), data=dataset, R.param = init2) ## Model fitted using the sigma parameterization. ## ASReml 4.1.0 Fri May 10 17:33:33 2019 ## LogLik Sigma2 DF wall cpu ## 1 -1136.066 1.0 160 17:33:33 0.0 (1 restrained) ## 2 -939.365 1.0 160 17:33:33 0.0 (1 restrained) ## 3 -719.371 1.0 160 17:33:33 0.0 (1 restrained) ## 4 -550.513 1.0 160 17:33:33 0.0 ## 5 -427.355 1.0 160 17:33:33 0.0 ## 6 -353.105 1.0 160 17:33:33 0.0 ## 7 -323.421 1.0 160 17:33:33 0.0 ## 8 -313.616 1.0 160 17:33:33 0.0 ## 9 -311.338 1.0 160 17:33:33 0.0 ## 10 -311.087 1.0 160 17:33:33 0.0 ## 11 -311.082 1.0 160 17:33:33 0.0 summary(mod.asreml.r2)$varcomp[,1:2] ## component std.error ## Block:trait!trait!TKW:!trait!Yield.cor -0.09516456 0.7572040 ## Block:trait!trait_Yield 3.71047783 3.9364268 ## Block:trait!trait_TKW 1.66845679 1.8343662 ## Genotype:trait!trait!TKW:!trait!Yield.cor 0.60136047 0.1305180 ## Genotype:trait!trait_Yield 77.63463977 22.0809306 ## Genotype:trait!trait_TKW 53.86159936 15.3451466 ## units:trait!R 1.00000000 NA ## units:trait!trait!TKW:!trait!Yield.cor 0.00000000 NA ## units:trait!trait_Yield 6.09390366 1.1951128 ## units:trait!trait_TKW 4.47179012 0.8769902 lrt.asreml(mod.asreml.r2, mod.asreml.r) ## Likelihood ratio test(s) assuming nested random models. ## (See Self & Liang, 1987) ## ## df LR-statistic Pr(Chisq) ## mod.asreml.r/mod.asreml.r2 1 0.051075 0.4106

What is even more interesting is that ‘asreml’ permits to force the parameters to be linear combinations of one another. For instance, we can code a model where the residual correlation is contrained to be equal to the treatment correlation. To do so, we have to set up a two-column matrix (M), with row names matching the component names in the ‘asreml’ parameter vector. The matrix M should contain:

  1. in the first column, the equality relationships (same number, same value)
  2. in the second column, the coefficients for the multiplicative relationships

In this case, we would set the matrix M as follows:

firstCol <- c(1, 2, 3, 4, 5, 6, 7, 1, 8, 9) secCol <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1) M <- cbind(firstCol, secCol) dimnames(M)[[1]] <- mod.init$vparameters$Component M ## firstCol secCol ## Genotype:trait!trait!TKW:!trait!Yield.cor 1 1 ## Genotype:trait!trait_Yield 2 1 ## Genotype:trait!trait_TKW 3 1 ## Block:trait!trait!TKW:!trait!Yield.cor 4 1 ## Block:trait!trait_Yield 5 1 ## Block:trait!trait_TKW 6 1 ## units:trait!R 7 1 ## units:trait!trait!TKW:!trait!Yield.cor 1 1 ## units:trait!trait_Yield 8 1 ## units:trait!trait_TKW 9 1

Please note that in ‘firstCol’, the 1st and 8th element are both equal to 1, which contraints them to assume the same value. We can now pass the matrix M as the value of the ‘vcc’ argument in the model call.

mod.asreml.r3 <- asreml(cbind(Yield, TKW) ~ trait - 1, random = ~ Genotype:corgh(trait) + Block:corgh(trait), residual = ~ units:corgh(trait), data=dataset, R.param = init, vcc = M) ## Model fitted using the sigma parameterization. ## ASReml 4.1.0 Fri May 10 17:33:34 2019 ## LogLik Sigma2 DF wall cpu ## 1 -1122.762 1.0 160 17:33:34 0.0 (1 restrained) ## 2 -900.308 1.0 160 17:33:34 0.0 ## 3 -665.864 1.0 160 17:33:34 0.0 ## 4 -492.020 1.0 160 17:33:34 0.0 ## 5 -383.085 1.0 160 17:33:34 0.0 ## 6 -336.519 1.0 160 17:33:34 0.0 (1 restrained) ## 7 -319.561 1.0 160 17:33:34 0.0 ## 8 -315.115 1.0 160 17:33:34 0.0 ## 9 -314.540 1.0 160 17:33:34 0.0 ## 10 -314.523 1.0 160 17:33:34 0.0 summary(mod.asreml.r3)$varcomp[,1:3] ## component std.error z.ratio ## Block:trait!trait!TKW:!trait!Yield.cor -0.1146908 0.7592678 -0.1510545 ## Block:trait!trait_Yield 3.6991785 3.9364622 0.9397216 ## Block:trait!trait_TKW 1.6601799 1.8344090 0.9050217 ## Genotype:trait!trait!TKW:!trait!Yield.cor 0.2336876 0.1117699 2.0907919 ## Genotype:trait!trait_Yield 70.5970531 19.9293722 3.5423621 ## Genotype:trait!trait_TKW 48.9763800 13.8464106 3.5371174 ## units:trait!R 1.0000000 NA NA ## units:trait!trait!TKW:!trait!Yield.cor 0.2336876 0.1117699 2.0907919 ## units:trait!trait_Yield 6.3989855 1.2811965 4.9945387 ## units:trait!trait_TKW 4.6952670 0.9400807 4.9945358 lrt.asreml(mod.asreml.r3, mod.asreml.r) ## Likelihood ratio test(s) assuming nested random models. ## (See Self & Liang, 1987) ## ## df LR-statistic Pr(Chisq) ## mod.asreml.r/mod.asreml.r3 1 6.9336 0.00423 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output, we see that the residual and treatment correlations are equal in this latter model. We also see that this reduced model fits significantly worse than the complete model ‘mod.asreml.r’.

Going freeware

Considering that the block means are not correlated, if we were willing to take the ‘block’ effect as fixed, we could fit the resulting model also with the ‘nlme’ package and the function ‘lme()’ (Pinheiro and Bates, 2000). However, we should cast the model as a univariate model.

To this aim, the two variables (Yield and TKW) need to be piled up and a new factor (‘Trait’) needs to be introduced to identify the observations for the two traits. Another factor is also necessary to identify the different plots, i.e. the observational units. To perform such a restructuring, I used the ‘melt()’ function in the ‘reshape2’ package Wickham, 2007) and assigned the name ‘Y’ to the response variable, that is now composed by the two original variables Yield and TKW, one on top of the other.

dataset$Plot <- 1:81 mdataset <- melt(dataset[,c(-3,-6)], variable.name= "Trait", value.name="Y", id=c("Genotype", "Block", "Plot")) mdataset$Block <- factor(mdataset$Block) head(mdataset) ## Genotype Block Plot Trait Y ## 1 arcobaleno 1 1 TKW 44.5 ## 2 arcobaleno 2 2 TKW 42.8 ## 3 arcobaleno 3 3 TKW 42.7 ## 4 baio 1 4 TKW 40.6 ## 5 baio 2 5 TKW 42.7 ## 6 baio 3 6 TKW 41.1 tail(mdataset) ## Genotype Block Plot Trait Y ## 157 vesuvio 1 76 Yield 54.37 ## 158 vesuvio 2 77 Yield 55.02 ## 159 vesuvio 3 78 Yield 53.41 ## 160 vitromax 1 79 Yield 54.39 ## 161 vitromax 2 80 Yield 50.97 ## 162 vitromax 3 81 Yield 48.83

The fixed model is:

Y ~ Trait*Block

In order to introduce a totally unstructured variance-covariance matrix for the random effect, I used the ‘pdMat’ construct:

random = list(Genotype = pdSymm(~Trait - 1))

Relating to the residuals, heteroscedasticity can be included by using the ‘weight()’ argument and the ‘varIdent’ variance function, which allows a different standard deviation per trait:

weight = varIdent(form = ~1|Trait)

I also allowed the residuals to be correlated within each plot, by using the ‘correlation’ argument and specifying a general ‘corSymm()’ correlation form. Plots are nested within genotypes, thus I used a nesting operator, as follows:

correlation = corSymm(form = ~1|Genotype/Plot)

The final model call is:

mod <- lme(Y ~ Trait*Block, random = list(Genotype = pdSymm(~Trait - 1)), weight = varIdent(form=~1|Trait), correlation = corCompSymm(form=~1|Genotype/Plot), data = mdataset)

Retreiving the variance-covariance matrices needs some effort, as the function ‘getVarCov()’ does not appear to work properly with this multistratum model. First of all, we can retreive the variance-covariance matrix for the genotype random effect (G) and the corresponding correlation matrix.

#Variance structure for random effects obj <- mod G <- matrix( as.numeric(getVarCov(obj)), 2, 2 ) G ## [,1] [,2] ## [1,] 53.86081 38.83246 ## [2,] 38.83246 77.63485 cov2cor(G) ## [,1] [,2] ## [1,] 1.0000000 0.6005237 ## [2,] 0.6005237 1.0000000

Second, we can retreive the ‘conditional’ variance-covariance matrix (R), that describes the correlation of errors:

#Conditional variance-covariance matrix (residual error) V <- corMatrix(obj$modelStruct$corStruct)[[1]] #Correlation for residuals sds <- 1/varWeights(obj$modelStruct$varStruct)[1:2] sds <- obj$sigma * sds #Standard deviations for residuals (one per trait) R <- t(V * sds) * sds #Going from correlation to covariance R ## [,1] [,2] ## [1,] 4.4718234 0.1635375 ## [2,] 0.1635375 6.0939251 cov2cor(R) ## [,1] [,2] ## [1,] 1.00000000 0.03132756 ## [2,] 0.03132756 1.00000000

The total correlation matrix is simply obtained as the sum of G and R:

Tr <- G + R cov2cor(Tr) ## [,1] [,2] ## [1,] 1.0000000 0.5579906 ## [2,] 0.5579906 1.0000000

We see that the results are the same as those obtained by using ‘asreml’. Hope these snippets are useful!

References
  • Butler, D., Cullis, B.R., Gilmour, A., Gogel, B., Thomson, R., 2018. ASReml-r reference manual – version 4. UK.
  • Piepho, H.-P., 2018. Allowing for the structure of a designed experiment when estimating and testing trait correlations. The Journal of Agricultural Science 156, 59–70.
  • Pinheiro, J.C., Bates, D.M., 2000. Mixed-effects models in s and s-plus. Springer-Verlag Inc., New York.
  • Wickham, H., 2007. Reshaping data with the reshape package. Journal of Statistical Software 21.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

easy Riddler

Fri, 05/10/2019 - 00:19

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The riddle of the week is rather standard probability calculus

If N points are generated at random places on the perimeter of a circle, what is the probability that you can pick a diameter such that all of those points are on only one side of the newly halved circle?

Since it is equivalent to finding the range of N Uniform variates less than ½. And since the range of N Uniform variates is distributed as a Be(N-1,2) random variate. The resulting probability, which happens to be exactly , is decreasing exponentially, as shown below…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

What’s new in R 3.6.0

Thu, 05/09/2019 - 23:08

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

A major update to the open-source R language, R 3.6.0, was released on April 26 and is now available for download for Windows, Mac and Linux. As a major update, it has many new features, user-visible changes and bug fixes. You can read the details in the release announcement, and in this blog post I'll highlight the most significant ones.

Changes to random number generation. R 3.6.0 changes the method used to generate random integers in the sample function. In prior versions, the probability of generating each integer could vary from equal by up to 0.04% (or possibly more if generating more than a million different integers). This change will mainly be relevant to people who do large-scale simulations, and it also means that scripts using the sample function will generate different results in R 3.6.0 than they did in prior versions of R. If you need to keep the results the same (for reproducibility or for automated testing), you can revert to the old behavior by adding RNGkind(sample.kind="Rounding")) to the top of your script.

Changes to R's serialization format. If you want to save R data to disk with R 3.6.0 and read it back with R version 3.4.4 or earlier, you'll need to use readRDS("mydata.Rd",version=2) to save your data in the old serialization format, which has been updated to Version 3 for this release. (The same applies to the functions save, serialize, and byte-compiled R code.)  The R 3.5 series had forwards-compatibility in mind, and can already read data serialized in the Version 3 format.

Improvements to base graphics. You now have more options the appearance of axis labels (and perpendicular labels no longer overlap), better control over text positioning, a formula specification for barplots, and color palettes with better visual perception.

Improvements to package installation and loading, which should eliminate problems with partially-installed packages and reduce the space required for installed packages.

More functions now support vectors with more than 2 billion elements, including which and pmin/pmax.

Various speed improvements to functions including outer, substring, stopifnot, and the $ operator for data frames.

Improvements to statistical functions, including standard errors for T tests and better influence measures for multivariate models.

R now uses less memory, thanks to improvements to functions like drop, unclass and seq to use the ALTREP system to avoid duplicating data.

More control over R's memory usage, including the ability limit the amount of memory R will use for data. (Attempting to exceed this limit will generate a "cannot allocate memory" error.) This is particularly useful when R is being used in production, to limit the impact of a wayward R process on other applications in a shared system.

Better consistency between platforms. This has been an ongoing process, but R now has fewer instances of functions (or function arguments) that are only available on limited platforms (e.g. on Windows but not on Linux). The documentation is now more consistent between platforms, too. This should mean fewer instances of finding R code that doesn't run on your machine because it was written on a different platform.

There are many other specialized changes as well, which you can find in the release notes. Other than the issues raised above, most code from prior versions of R should run fine in R 3.6.0, but you will need to re-install any R packages you use, as they won't carry over from your R 3.5.x installation. Now that a couple of weeks have passed since the release, most packages should be readily available on CRAN for R 3.6.0.

As R enters its 20th year of continuous stable releases, please do take a moment to reflect on the ongoing commitment of the R Core Team for their diligence in improving the R engine multiple times a year. Thank you to everyone who has contributed. If you'd like to support the R project, here's some information on contributing to the R Foundation.

Final note: the code name for R 3.6.0 is "Planting of a Tree". The R code-names generally refer to Peanuts comics: if anyone can identify what the R 3.6.0 codename is referring to, please let us know in the comments!

R-announce mailing list: R 3.6.0 is released

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introduction to statistics with origami (and R examples)

Thu, 05/09/2019 - 21:00

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

I have written a small booklet (about 120 pages A5 size) on statistics and origami
it is a very simple -basic book, useful for introductory courses. 
There are several R examples and the R script for each of them

You can downolad the files free from: 
http://www.mariocigada.com/mariodocs/origanova_e.html

in a .zip file you find
the text in .pdf format
a colour cover
a (small) dataset
a .R file with all the script

Here is a link for downloading the book directly, I hope someone can find it useful:

Link for downloading the book: origanova2en

ciao
m

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analysing Raster Data: the rasterList Package

Thu, 05/09/2019 - 18:19

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

The rasterList package has been developed to create a S4 class object that can manage complex operations and objects in a spatially gridded map, i.e. a raster map. The rasterList-Class (Cordano 2017b) object differs from a traditional raster map by the fact that each cell can contain a generic object instead of numeric values on one or more bands or layers. The raster of objects, i.e. the RasterList-class object, consists on a rasterLayer-Class (Hijmans 2016) connected to a generic list: one list element for each raster cells. It allows to write few lines of R code for complex map algebra. Furthermore, in accordance with the existing classes defined in the R environment, each cell of the raster is “occupied” by an an istanced object. This may be of utitily plenty in most applications, e.g. environmental modellig.

Analysing Raster Data: an introduction

Nowadays the availability of spatially gridded coverages , often obtained by distributed biophysical/environmental models or remote sensed observations, e. g. CHIRPS rainfall (Funk et al. 2015), lead to the fact that normal operations on time series analysis and statistics must be replicated in each pixel or cell of the spatially gridded domain (raster). Based on raster package (Hijmans 2016), a S4 class has been created such that results of complex operations or speficfic R objects (e.g, S3 or S4) can be executed on each cells of a raster map. The raster of objects contains the traditional raster map with the addition of a list of generic objects: one object for each raster cells. In such a way, different kinds of objects, e.g. the probability distrubution of the mapped random variable or an htest, lm ,glm or function class objects (Chambers 1992)(Hastie and Pregibon 1992), can be saved for each cell, in order that the user can save intermediate results without loosing any geographic information. This new oblect class is implemented in a new package rasterList (Cordano 2017b). Two applications are presented:

  • Creation of a 2D/3D modelled map of soil water retention curve based on existing soil properties map and state-of-art empirical formulation (VanGenuchten 1980);
  • Processing spatial gridded datasets of daily or momthly precipitation (Funk et al. 2015) for a trend or frequancy analysis (estimation of the return period of critical events).
Analysing Raster Data: Application 1

The application 1 is the estimation of the mathematical relationships between soil water content \theta and soil water pressure \psi, i.e. soil water retention curves, in the cells of a raster map. Soil water retention curve is widely used in subsurface water/groundwater hydrological modeling because it allows to close the water balance equation. Generally some theoretical formulas are used to model this curve in function of soil properties. In this examples, given a raster map of soil properties, a map of soil water retention curves is produced. First of all, in absence of detailed soil information, as often used in hydrological modeling (Kollet et al. 2017), soil water curve is defined with the following empirical formula (VanGenuchten 1980):

0″ />\theta = \theta_{res}+(\theta_{sat}-\theta_{res})*(1+ |\alpha \psi|^n)^{-m} \qquad \psi>0

0″ />\theta = \theta_{sat} \qquad \psi>0

where \theta is the volumetric water content (e.g. water volume per soil volume) ranging between residual water content \theta_{res} and saturated water content \theta_{sat}. \alpha,n and m are parameters assuming to depend on soil type. An estimation of the parameter \theta_{sat},\theta_{res},\alpha,m,n values in function of soil type is given through the following table (Maidment 1992):

library(soilwater) library(stringr) soilparcsv <- system.file("external/soil_data.csv",package="soilwater") soilpar <- read.table(soilparcsv,stringsAsFactors=FALSE,header=TRUE,sep=",") knitr::kable(soilpar,caption="Average value of Va Genuchten's parameter per each soil type")

Average value of Va Genuchten’s parameter per each soil type type swc rwc alpha n m Ks_m_per_hour Ks_m_per_sec sand 0.44 0.02 13.8 2.09 0.52153 0.15000 4.17e-05 loamy sand 0.44 0.04 11.5 1.87 0.46524 0.10000 2.78e-05 sandy loam 0.45 0.05 6.8 1.62 0.38272 0.03900 1.08e-05 loam 0.51 0.04 9.0 1.42 0.29577 0.03300 9.20e-06 silt loam 0.50 0.03 4.8 1.36 0.26471 0.00900 2.50e-06 sandy clay loam 0.40 0.07 3.6 1.56 0.35897 0.00440 1.20e-06 clay loam 0.46 0.09 3.9 1.41 0.29078 0.00480 1.30e-06 silty clay loam 0.46 0.06 3.1 1.32 0.24242 0.00130 4.00e-07 sandy clay 0.43 0.11 3.4 1.40 0.28571 0.00073 2.00e-07 silty clay 0.48 0.07 2.9 1.26 0.20635 0.00076 2.00e-07 clay 0.48 0.10 2.7 1.29 0.22481 0.00041 1.00e-07 soilpar$color <- str_sub(rainbow(nrow(soilpar)),end=7) ## Only first 7 characters of HTML code are considered.

It it is assumed that each cell of a raster maps has its own soil water retention curve. Therefore to map the soil water retion curve and not only its parameters, rasterList package is loaded:

library(rasterList)

## Loading required package: raster

## Loading required package: sp

The study area is is the area of the meuse dataset (n.d.):

library(sp) data(meuse) help(meuse) data(meuse.grid) help(meuse.grid)

A soil map is avaialbe from soil type according to the 1:50 000 soil map of the Netherlands. In the Meuse site, the following soil type are present

  • 1 = Rd10A (Calcareous weakly-developed meadow soils, light sandy clay);
  • 2 = Rd90C/VII (Non-calcareous weakly-developed meadow soils, heavy sandy clay to light clay);
  • 3 = Bkd26/VII (Red Brick soil, fine-sandy, silty light clay) which can be respectively assimilated as:
soiltype_id <- c(1,2,3) soiltype_name <- c("sandy clay","clay","silty clay loam") soilpar_s <- soilpar[soilpar$type %in% soiltype_name,] is <- order(soilpar_s[,1],by=soiltype_name) soilpar_s <- soilpar_s[is,] soilpar_s$id <- soiltype_id

A geografic view of Meuse test case is available though leaflet package (Cheng, Karambelkar, and Xie 2018):

library(rasterList) library(soilwater) library(leaflet) set.seed(1234) data(meuse.grid) data(meuse) coordinates(meuse.grid) <- ~x+y coordinates(meuse) <- ~x+y gridded(meuse.grid) <- TRUE proj4string(meuse) <- CRS("+init=epsg:28992") proj4string(meuse.grid) <- CRS("+init=epsg:28992") soilmap <- as.factor(stack(meuse.grid)[['soil']]) ref_url <- "https://{s}.tile.opentopomap.org/{z}/{x}/{y}.png" ref_attr <- 'Map data: © OpenStreetMap, SRTM | Map style: © OpenTopoMap (CC-BY-SA)' opacity <- 0.6 color <- colorFactor(soilpar_s$color,level=soilpar_s$id) labFormat <- function(x,values=soilpar_s$type){values[as.integer(x)]} leaf1 <- leaflet() %>% addTiles(urlTemplate=ref_url,attribution=ref_attr) leaf2 <- leaf1 %>% addRasterImage(soilmap,opacity=opacity,color=color) %>% addLegend(position="bottomright",colors=soilpar_s$color,labels=soilpar_s$type,title="Soil Type",opacity=opacity) leaf2


Therefore, a map of each VanGenuchten (1980)’s formula parameter can be obtained as follow (The names of saturated and residual water contents swc and rwc mean \theta_{sat} and \theta_{res} according to the the arguments of swc function of soilwater package (Cordano, Andreis, and Zottele 2017))

soil_parameters_f <- function (soiltype,sp=soilpar_s) { o <- sp[soiltype,c("swc","rwc","alpha","n","m")] names(o) <- c("theta_sat","theta_res","alpha","n","m") return(o) } soil_parameters_rl <- rasterList(soilmap,FUN=soil_parameters_f)

A RasterList-class object has been created and each cell contains a vector of soil water parameters:

soil_parameters <- stack(soil_parameters_rl) soil_parameters

## class : RasterStack ## dimensions : 104, 78, 8112, 5 (nrow, ncol, ncell, nlayers) ## resolution : 40, 40 (x, y) ## extent : 178440, 181560, 329600, 333760 (xmin, xmax, ymin, ymax) ## coord. ref. : +init=epsg:28992 +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725 +units=m +no_defs ## names : theta_sat, theta_res, alpha, n, m ## min values : 0.43000, 0.06000, 2.70000, 1.29000, 0.22481 ## max values : 0.48000, 0.11000, 3.40000, 1.40000, 0.28571

A map of saturated water content (porosity) can be visualized as follows:

theta_sat <- soil_parameters[["theta_sat"]] color <- colorNumeric("Greens",domain=theta_sat[]) leaf3 <- leaf1 %>% addRasterImage(theta_sat,color=color,opacity=opacity) %>% addLegend(position="bottomright",pal=color,values=theta_sat[],opacity=opacity,title="Porosity")

Let consider 3 points of interest in the Meuse located (latitude and logitude) as follows:

lat <- 50.961532+c(0,0,0.02) lon <- 5.724514+c(0,0.01,0.0323) name <- c("A","B","C") points <- data.frame(lon=lon,lat=lat,name=name) knitr::kable(points,caption="Points of interest: geographical coordinates (latitude andlongitude) and name")

Points of interest: geographical coordinates (latitude andlongitude) and name lon lat name 5.724514 50.96153 A 5.734514 50.96153 B 5.756814 50.98153 C coordinates(points) <- ~lon+lat proj4string(points) <- CRS("+proj=longlat +ellps=WGS84") leaf3 %>% addMarkers(lng=points$lon,lat=points$lat,label=points$name)

<

They are indexed as follows in the raster map:

points$icell <- cellFromXY(soil_parameters,spTransform(points,projection(soil_parameters)))

where icell is a vector of integer numbers corresponding to the raster cells containing the points A,B and C. Information on these points can be extracted as follows:

soil_parameters[points$icell]

## theta_sat theta_res alpha n m ## [1,] 0.48 0.10 2.7 1.29 0.22481 ## [2,] 0.48 0.10 2.7 1.29 0.22481 ## [3,] 0.43 0.11 3.4 1.40 0.28571

Using swc , a function that given the soil parameters returns a soil water function is defined:

library(soilwater) swc_func <- function(x,...) { o <- function(psi,y=x,...) { args <- c(list(psi,...),as.list(y)) oo <- do.call(args=args,what=get("swc")) } return(o) }

And in case that it is applied to point A (i.e. setting the soil parameters of point A), it is:

soilparA <- soil_parameters[points$icell[1]][1,] swcA <- swc_func(soilparA)

where scwA is a function corresponding to the Soil Water Retention Curve in point A that can be plotted as follows:

library(ggplot2) psi <- seq(from=-5,to=1,by=0.25) title <- "Soil Water Retention Curve at Point A" xlab <- "Soil Water Pressure Head [m]" ylab <- "Soil Water Content (ranging 0 to 1) " gswA <- ggplot()+geom_line(aes(x=psi,y=swcA(psi)))+theme_bw() gswA <- gswA+ggtitle(title)+xlab(xlab)+ylab(ylab) gswA


The procedure is replicated for all cells of the raster map, saving the soil water retention curves as an R object for all the cells, then making use of rasterList function:

swc_rl <- rasterList(soil_parameters,FUN=swc_func)

The RasterList-class object contains each soil water retention curve per each cell. Since the list element referred to each raster cell are function class, swc_rl is coerced to a single function defined all raster extension through rasterListFun function:

swc_rlf <- rasterListFun(swc_rl)

In such a way, the function swc_rlf can be used to estimate to estimate soil water function given a spatially unimorm value of soil water pressure head \psi on all Meuse region.

psi <- c(0,-0.5,-1,-2) names(psi) <- sprintf("psi= %1.1f m",psi) soil_water_content <- stack(swc_rlf(psi)) names(soil_water_content) <- names(psi) plot(soil_water_content,main=names(psi))

Finally, assuming that \psi may vary with space ranging between -0.5 m and -1.5 m with a spatial exponetial dacay profile from point A, a map for \psi is calculated as follows:

region <- raster(swc_rlf(0)) mask <- !is.na(region) region[] <- NA region[points$icell[1]] <- 1 dist <- distance(region) dist[mask==0] <- NA mdist <- max(dist[],na.rm=TRUE) psi_max <- 0 psi_min <- -2 psi <- psi_max+(psi_min-psi_max)*(1-exp(-dist/mdist)) plot(psi)

color <- colorNumeric("Blues",domain=psi[]) leaf_psi <- leaf1 %>% addRasterImage(psi,color=color,opacity=opacity) %>% addLegend(position="bottomright",pal=color,values=psi[],opacity=opacity,title="Psi") %>% addMarkers(lng=points$lon,lat=points$lat,label=points$name)

## Warning in colors(.): Some values were outside the color scale and will be ## treated as NA

leaf_psi



The obtained map of \theta is then visualized through leaflet:

theta <- raster(swc_rlf(psi)) plot(theta)

color <- colorNumeric("Blues",domain=theta[]) leaf_psi <- leaf1 %>% addRasterImage(theta,color=color,opacity=opacity) %>% addLegend(position="bottomright",pal=color,values=theta[],opacity=opacity,title="Psi") %>% addMarkers(lng=points$lon,lat=points$lat,label=points$name) leaf_psi


Conclusions

The above applications illustrate how statistical analysis and other processing may be replicated at every cell of a raster map. The rasterList package has been created to answer the need to make complex operation on spatially gridded region. The shown examples go beyond their own task but they have the task of presenting the most important functions of the package and how thay can be used to solve practical problems. rasterList is able to save different type of objects within a cell of the raster map and allows that each object of a raster cell may be transformed and/or coerced to another object. Two or more RasterList-class objects can interact through operations with one another. This is of great relevance because it allows to save intermediate steps during a processing workflow (e.g. statistical analysis).

References

Chambers, J. M. 1992. “Linear Models.” In Statistical Models in S. Wadsworth & Brooks/Cole.

Cheng, Joe, Bhaskar Karambelkar, and Yihui Xie. 2018. Leaflet: Create Interactive Web Maps with the Javascript ’Leaflet’ Library. https://CRAN.R-project.org/package=leaflet.

Cordano, Emanuele. 2017a. LmomPi: (Precipitation) Frequency Analysis and Variability with L-Moments from ’Lmom’. https://CRAN.R-project.org/package=lmomPi.

———. 2017b. RasterList: A Raster Where Cells Are Generic Objects. https://CRAN.R-project.org/package=rasterList.

Cordano, Emanuele, Daniele Andreis, and Fabio Zottele. 2017. Soilwater: Implementation of Parametric Formulas for Soil Water Retention or Conductivity Curve. https://CRAN.R-project.org/package=soilwater.

Funk, Chris, Pete Peterson, Martin Landsfeld, Diego Pedreros, James Verdin, Shraddhanand Shukla, Gregory Husak, et al. 2015. “The Climate Hazards Infrared Precipitation with Stations–a New Environmental Record for Monitoring Extremes.” Scientific Data 2 (December). The Author(s) SN -: 150066 EP. http://dx.doi.org/10.1038/sdata.2015.66.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.

Hastie, T. J., and D. Pregibon. 1992. “Generalized Linear Models.” In Statistical Models in S. Wadsworth & Brooks/Cole.

Hijmans, Robert J. 2016. Raster: Geographic Data Analysis and Modeling. https://CRAN.R-project.org/package=raster.

Hosking, J. R. M. 2017. L-Moments. https://CRAN.R-project.org/package=lmom.

Hosking, J. R. M., and James R. Wallis. 1997. “L-Moments.” In Regional Frequency Analysis: An Approach Based on L-Moments, 14–43. Cambridge University Press. doi:10.1017/CBO9780511529443.004.

Kollet, Stefan, Mauro Sulis, Reed M. Maxwell, Claudio Paniconi, Mario Putti, Giacomo Bertoldi, Ethan T. Coon, et al. 2017. “The Integrated Hydrologic Model Intercomparison Project, Ih-Mip2: A Second Set of Benchmark Results to Diagnose Integrated Hydrology and Feedbacks.” Water Resources Research 53 (1): 867–90. doi:10.1002/2016WR019191.

Maidment, D. R. 1992. Handbook of Hydrology. Mc-Graw Hill.

Marsaglia, George, Wai Wan Tsang, and Jingbo Wang. 2003. “Evaluating Kolmogorov’s Distribution.” Journal of Statistical Software, Articles 8 (18): 1–4. doi:10.18637/jss.v008.i18.

Pohlert, Thorsten. 2018. Trend: Non-Parametric Trend Tests and Change-Point Detection. https://CRAN.R-project.org/package=trend.

VanGenuchten, M. Th. 1980. “A Closed-Form Equation for Predicting the Hydraulic Conductivity of Unsaturated Soils.” Soil Sci. Soc. Am. Jour. 44: 892–98.

The post Analysing Raster Data: the rasterList Package appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

(Bootstraping) Follow-Up Contrasts for Within-Subject ANOVAs

Thu, 05/09/2019 - 18:00

(This article was first published on Mattan S. Ben-Shachar, and kindly contributed to R-bloggers)

So you’ve run an ANOVA and found that your residuals are neither normally distributed, nor homogeneous, or that you are in violation of any other assumptions. Naturally you want to run some a-parametric analysis… but how?

In this post I will demonstrate how to run a permutation test ANOVA (easy!) and how to run bootstrap follow-up analysis (a bit more challenging) in a mixed design (both within- and between-subject factors) ANOVA. I’ve chosen to use a mixed design model in this demonstration for two reasons:

  1. I’ve never seen this done before.
  2. You can easily modify this code (change / skip some of these steps) to accommodate purely within- or purely between-subject designs.
Permutation ANOVA

Running a permutation test for your ANOVA in R is as easy as… running an ANOVA in R, but substituting aov with aovperm from the permucopackage.

library(permuco)
data(obk.long, package = "afex") # data from the afex package

# permutation anova
fit_mixed_p <-
aovperm(value ~ treatment * gender * phase * hour + Error(id / (phase * hour)),
data = obk.long)

fit_mixed_p
##
## Permutation test using Rd_kheradPajouh_renaud to handle nuisance variables and 5000 permutations.
## SSn dfn SSd dfd MSEn MSEd F
## treatment 179.730 2 228.06 10 89.8652 22.806 3.94049
## gender 83.448 1 228.06 10 83.4483 22.806 3.65912
## treatment:gender 130.241 2 228.06 10 65.1206 22.806 2.85547
## phase 129.511 2 80.28 20 64.7557 4.014 16.13292
## treatment:phase 77.885 4 80.28 20 19.4713 4.014 4.85098
## gender:phase 2.270 2 80.28 20 1.1351 4.014 0.28278
## treatment:gender:phase 10.221 4 80.28 20 2.5553 4.014 0.63660
## hour 104.285 4 62.50 40 26.0714 1.563 16.68567
## treatment:hour 1.167 8 62.50 40 0.1458 1.563 0.09333
## gender:hour 2.814 4 62.50 40 0.7035 1.562 0.45027
## treatment:gender:hour 7.755 8 62.50 40 0.9694 1.562 0.62044
## phase:hour 11.347 8 96.17 80 1.4183 1.202 1.17990
## treatment:phase:hour 6.641 16 96.17 80 0.4151 1.202 0.34529
## gender:phase:hour 8.956 8 96.17 80 1.1195 1.202 0.93129
## treatment:gender:phase:hour 14.155 16 96.17 80 0.8847 1.202 0.73594
## parametric P(>F) permutation P(>F)
## treatment 5.471e-02 0.0586
## gender 8.480e-02 0.0916
## treatment:gender 1.045e-01 0.1084
## phase 6.732e-05 0.0002
## treatment:phase 6.723e-03 0.0056
## gender:phase 7.566e-01 0.7644
## treatment:gender:phase 6.424e-01 0.6480
## hour 4.027e-08 0.0002
## treatment:hour 9.992e-01 0.9996
## gender:hour 7.716e-01 0.7638
## treatment:gender:hour 7.555e-01 0.7614
## phase:hour 3.216e-01 0.3260
## treatment:phase:hour 9.901e-01 0.9910
## gender:phase:hour 4.956e-01 0.5100
## treatment:gender:phase:hour 7.496e-01 0.7590

The results of the permutation test suggest an interaction between Treatment (a between subject factor) and Phase (a within-subject factor). To fully understand this interaction, we would like to conduct some sort of follow-up analysis (planned comparisons or post hoc tests). Usually I would pass the model to emmeans for any follow-ups, but here, due to our assumption violations, we need some sort of bootstrapping method.

Bootstrapping with car and emmeans

For bootstrapping we will be using the Boot function from the carpackage. In the case of within-subject factors, this function requires that they be specified in a multivariate data structure. Let’s see how this is done.

1. Make your data WIIIIIIIIIIDEEEEEEEE library(dplyr) library(tidyr)
obk_mixed_wide <- obk.long %>%
unite("cond", phase, hour) %>%
spread(cond, value)

head(obk_mixed_wide)
## id treatment gender age fup_1 fup_2 fup_3 fup_4 fup_5 post_1 post_2
## 1 1 control M -4.75 2 3 2 4 4 3 2
## 2 2 control M -2.75 4 5 6 4 1 2 2
## 3 3 control M 1.25 7 6 9 7 6 4 5
## 4 4 control F 7.25 4 4 5 3 4 2 2
## 5 5 control F -5.75 4 3 6 4 3 6 7
## 6 6 A M 7.25 9 10 11 9 6 9 9
## post_3 post_4 post_5 pre_1 pre_2 pre_3 pre_4 pre_5
## 1 5 3 2 1 2 4 2 1
## 2 3 5 3 4 4 5 3 4
## 3 7 5 4 5 6 5 7 7
## 4 3 5 3 5 4 7 5 4
## 5 8 6 3 3 4 6 4 3
## 6 10 8 9 7 8 7 9 9

This is not enough, as we also need our new columns (representing the different levels of the within subject factors) to be in a matrix column.

obk_mixed_matrixDV <- obk_mixed_wide %>%
select(id, age, treatment, gender)
obk_mixed_matrixDV$M <- obk_mixed_wide %>%
select(-id, -age, -treatment, -gender) %>%
as.matrix()

str(obk_mixed_matrixDV)
## 'data.frame': 16 obs. of 5 variables:
## $ id : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ age : num -4.75 -2.75 1.25 7.25 -5.75 7.25 8.25 2.25 2.25 -7.75 ...
## $ treatment: Factor w/ 3 levels "control","A",..: 1 1 1 1 1 2 2 2 2 3 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 2 1 1 2 2 1 1 2 ...
## $ M : num [1:16, 1:15] 2 4 7 4 4 9 8 6 5 8 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "fup_1" "fup_2" "fup_3" "fup_4" ...
2. Fit your regular model fit_mixed <- aov(M ~ treatment * gender, obk_mixed_matrixDV)

Note that the left-hand-side of the formula (the M) is a matrix representing all the within-subject factors and their levels, and the right-hand-side of the formula (treatment * gender) includes only the between-subject factors.

3. Define the contrast(s) of interest

For this step we will be using emmeans. But first, we need to create a list of the within-subject factors and their levels (I did say this was difficult – bear with me!). This list needs to correspond to the order of the multi-variate column in our data, such that if there is more than one factor, the combinations of factor levels are used in expand.grid order. In our case:

colnames(obk_mixed_matrixDV$M)
## [1] "fup_1" "fup_2" "fup_3" "fup_4" "fup_5" "post_1" "post_2"
## [8] "post_3" "post_4" "post_5" "pre_1" "pre_2" "pre_3" "pre_4"
## [15] "pre_5"

rm_levels <- list(hour = c("1", "2", "3", "4", "5"),
phase = c("fup", "post", "pre"))

Make sure you get the order of the variables and their levels correct! This will affect your results!

Let’s use emmeans to get the estimates of the pairwise differences between the treatment groups within each phase of the study:

library(emmeans)
# get the correct reference grid with the correct ultivariate levels!
rg <- ref_grid(fit_mixed, mult.levs = rm_levels)
rg
## 'emmGrid' object with variables:
## treatment = control, A, B
## gender = F, M
## hour = multivariate response levels: 1, 2, 3, 4, 5
## phase = multivariate response levels: fup, post, pre

# get the expected means:
em_ <- emmeans(rg, ~ phase * treatment)
em_
## phase treatment emmean SE df lower.CL upper.CL
## fup control 4.33 0.551 10 3.11 5.56
## post control 4.08 0.628 10 2.68 5.48
## pre control 4.25 0.766 10 2.54 5.96
## fup A 7.25 0.604 10 5.90 8.60
## post A 6.50 0.688 10 4.97 8.03
## pre A 5.00 0.839 10 3.13 6.87
## fup B 7.29 0.461 10 6.26 8.32
## post B 6.62 0.525 10 5.45 7.80
## pre B 4.17 0.641 10 2.74 5.59
##
## Results are averaged over the levels of: gender, hour
## Confidence level used: 0.95

# run pairwise tests between the treatment groups within each phase
c_ <- contrast(em_, "pairwise", by = 'phase')
c_
## phase = fup:
## contrast estimate SE df t.ratio p.value
## control - A -2.9167 0.818 10 -3.568 0.0129
## control - B -2.9583 0.719 10 -4.116 0.0054
## A - B -0.0417 0.760 10 -0.055 0.9983
##
## phase = post:
## contrast estimate SE df t.ratio p.value
## control - A -2.4167 0.931 10 -2.595 0.0634
## control - B -2.5417 0.819 10 -3.105 0.0275
## A - B -0.1250 0.865 10 -0.144 0.9886
##
## phase = pre:
## contrast estimate SE df t.ratio p.value
## control - A -0.7500 1.136 10 -0.660 0.7911
## control - B 0.0833 0.999 10 0.083 0.9962
## A - B 0.8333 1.056 10 0.789 0.7177
##
## Results are averaged over the levels of: gender, hour
## P value adjustment: tukey method for comparing a family of 3 estimates

# extract the estimates
est_names <- c("fup: control - A", "fup: control - B", "fup: A - B",
"post: control - A", "post: control - B", "post: A - B",
"post: control - A", "post: control - B", "post: A - B")
est_values <- summary(c_)$estimate
names(est_values) <- est_names
est_values
## fup: control - A fup: control - B fup: A - B post: control - A
## -2.91666667 -2.95833333 -0.04166667 -2.41666667
## post: control - B post: A - B post: control - A post: control - B
## -2.54166667 -0.12500000 -0.75000000 0.08333333
## post: A - B
## 0.83333333
4. Running the bootstrap

Now let’s wrap this all in a function that accepts the fitted model as an argument:

treatment_phase_contrasts <- function(mod){
rg <- ref_grid(mod, mult.levs = rm_levels)

# get the expected means:
em_ <- emmeans(rg, ~ phase * treatment)

# run pairwise tests between the treatment groups within each phase
c_ <- contrast(em_, "pairwise", by = 'phase')

# extract the estimates
est_names <- c("fup: control - A", "fup: control - B", "fup: A - B",
"post: control - A", "post: control - B", "post: A - B",
"post: control - A", "post: control - B", "post: A - B")
est_values <- summary(c_)$estimate
names(est_values) <- est_names
est_values
}

# test it
treatment_phase_contrasts(fit_mixed)
## fup: control - A fup: control - B fup: A - B post: control - A
## -2.91666667 -2.95833333 -0.04166667 -2.41666667
## post: control - B post: A - B post: control - A post: control - B
## -2.54166667 -0.12500000 -0.75000000 0.08333333
## post: A - B
## 0.83333333

Finally, we will use car::Boot to get the bootstrapped estimates!

library(car)

treatment_phase_results <-
Boot(fit_mixed, treatment_phase_contrasts, R = 50) # R = 599 at least

summary(treatment_phase_results) # original vs. bootstrapped estimate (bootMed)
##
## Number of bootstrap replications R = 27
## original bootBias bootSE bootMed
## fup..control...A -2.916667 0.0342593 0.58002 -3.05000
## fup..control...B -2.958333 -0.0246914 0.73110 -2.96667
## fup..A...B -0.041667 -0.0589506 0.35745 -0.16667
## post..control...A -2.416667 -0.1728395 0.65088 -2.75000
## post..control...B -2.541667 -0.1425926 0.77952 -2.66667
## post..A...B -0.125000 0.0302469 0.58006 -0.11667
## post..control...A.1 -0.750000 0.0067901 0.83692 -0.56667
## post..control...B.1 0.083333 -0.0169753 0.89481 0.33333
## post..A...B.1 0.833333 -0.0237654 0.73113 1.08333

confint(treatment_phase_results, type = "perc") # does include zero? ## Bootstrap percent confidence intervals
##
## 2.5 % 97.5 %
## fup..control...A -4.000000 -1.750000
## fup..control...B -4.300000 -1.500000
## fup..A...B -0.750000 0.750000
## post..control...A -3.500000 -1.333333
## post..control...B -4.250000 -1.333333
## post..A...B -1.416667 0.875000
## post..control...A.1 -2.600000 0.700000
## post..control...B.1 -2.000000 1.500000
## post..A...B.1 -0.600000 1.833333

Results indicate that the Control group is lower than both treatment groups in the post and fup (follow -up) phases.

If we wanted p-values, we could use this little function (based on this demo):

boot_pvalues <- function(x, side = c(0, -1, 1)) {
side <- side[1]
x <- as.data.frame(x$t)

ps <- sapply(x, function(.x) {
s <- na.omit(.x)
s0 <- 0
N <- length(s)

if (side == 0) {
min((1 + sum(s >= s0)) / (N + 1),
(1 + sum(s <= s0)) / (N + 1)) * 2
} else if (side < 0) {
(1 + sum(s <= s0)) / (N + 1)
} else if (side > 0) {
(1 + sum(s >= s0)) / (N + 1)
}
})

setNames(ps,colnames(x))
}

boot_pvalues(treatment_phase_results)
## fup: control - A fup: control - B fup: A - B post: control - A
## 0.07142857 0.07142857 0.71428571 0.07142857
## post: control - B post: A - B post: control - A post: control - B
## 0.07142857 0.85714286 0.42857143 0.85714286
## post: A - B
## 0.35714286

These p-values can then be passed to p.adjust() for the p-value adjustment method of your choosing.

Summary

I’ve demonstrated how to run permutation tests on main effects / interactions, with follow-up analysis using the bootstrap method. Using this code as a basis for any analysis you might have in mind gives you all the flexibility of emmeans, which supports many (many) models!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Mattan S. Ben-Shachar. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Bibliographies in RStudio Markdown are difficult – here’s how to make it easy

    Thu, 05/09/2019 - 17:14

    (This article was first published on R – DataSurg, and kindly contributed to R-bloggers)

    This blog is intended for researchers, PhD students, MD students and any other students who wish to have a robust and effective reference management setup. The blog has a particular focus on those using R markdown, Bookdown or LaTeX. Parts of the blog can also help setup Zotero for use with Microsoft Word. The blog has been designed to help achieve the following goals:

    • Effective citation storage
      • Fast and easy citation storage (one-click from Chrome)
      • Fast and easy PDF storage using cloud storage
      • Immediate, automatic and standardised PDF renaming
      • Immediate, automatic and standardised citation key generation
    • Effective citation integration with markdown etc.
      • Generation of citation keys which work with LaTeX and md (no non-standard characters)
      • Ability to lock citation keys so that they don’t update with Zotero updates
      • Storage of immediately updated .bib files for use with Rmd, Bookdown and LaTeX
      • Automated update of the .bib file in RStudio server
    Downloads and Setup

    For my current reference management setup I need the following software:

    • Zotero
      • Zotero comes with 300MB of free storage which allows well over 1000 references to be stored as long as PDFs are stored separately
      • From the same download page download the Chrome connector to enable the “save to zotero” function in Google Chrome
    • ZotFile
      • Zotfile is a Zotero plugin which helps with PDF management, download the .xpi file and then open Zotero, go to “Tools → Add-Ons” and click the little cog in the top right corner and navigate to file to install (Figure 1)
    • Better BibTeX
      • Better BibTeX is a plugin to help generate citation keys which will be essential for writing articles in LaTeX, R Markdown or Bookdown
      • If the link doesn’t work go to github and scroll down to the ReadMe to find a link to download the .xpi file
      • The same approach is then used to install the Better BibTeX plugin for zotero (“Tools → Add-Ons”)

    After downloading Zotero, ZotFile and Better BibTeX create an account on Zotero online.

    In addition to the Zotero downloads this guide will focus on an efficient setup for writing with R markdown or Bookdown and assumes that you have access to the following software / accounts:

    • Dropbox / Google Drive / other cloud storage service which allows APIs
      • It will also be necessary for these to be accessible using Windows Explorer or Mac Finder (there are many guides online for syncing Google Drive and Dropbox so that they appear in file explorers)
    • RStudio (this is not 100% essential but it is far harder to use Rmd without it)
      • Packages which will be required for this setup include rdrop2 (if using dropbox, other packages are available to convert this setup to Google Drive etc.), encryptr, bookdown or Rmarkdown, tinytex and a LaTeX installation (the Bookdown author recommends using tinytex which can be installed by the similarly named R package: tinytex::install_tinytex())
    Folder Setup

    When using Zotero it is a good ideal to create a folder in which you will store PDFs retrieved from articles. Ultimately it is optional whether or not PDFs are stored but if you have access to cloud storage with a good quota then it can make writing in Rmd etc. much faster as there is no requirement to search online for the original PDF. This folder should be set up in Google Drive, Dropbox or another cloud storage service which can be accessed from your own computer through the file explorer.

    A second folder may be useful to store bibliographies which will be generated for specific projects or submissions. Again this folder should be made available in cloud storage.

    ZotFile Preferences

    To setup Zotero so that retrieved PDFs are automatically stored and renamed in the cloud storage without consuming the Zotero storage quota go to “Tools → ZotFile Preferences” and on the first tab: General Settings and set the folder and subfolder naming strategy for PDFs. I have set the location of the files to a Custom location and in this case used the path to a Google Drive folder (~\Google Drive\Zotero PDF Library). ZotFile will also store retrieved PDFs in subfolders to help with finding PDFs at a later date. The current setup I use is to create a subfolder with the first author surname so that all papers authored by one (or more) author with the same name are stored together using the \%a in the subfolder field (Figure 2). Other alternatives are to store PDFs in subfolders using year (\%y); journal or publisher (\%w); or item type (\%T).

    Next the Renaming Rules tab can be configured to provide sensible names to each of the files (this is essential if PDFs are not to be stored as random strings of characters which provide no meaning). In this tab I have set the format to: {%a_}{%y_}{%t} which provides names for the PDFs in the format of: Fairfield_2019_Gallstone_Disease_and_the_Risk_of_Cardiovascular_Disease.pdf. I find that this shows author, year and first word of title without needing to expand the file name (Figure 3).

    I have not changed any of the default settings in either the Tablet Settings or Advanved Settings tabs apart from removing special characters in the Advanced Settings (this stops things from breaking later).

    General Zotero Settings

    Zotero has several configurable settings (accessed through: “Edit → Preferences”) and I have either adopted the defaults or made changes as follows:

    General:

    • I have ticked the following:
      • Automatically attach associated PDFs
      • Automatically retrieve metadata for PDFs
      • Automatically rename attachments using parent metadata
      • Automatically tag items with keywords and subject headings
      • All options in Group section
    • I have left the following unticked:
      • Automatically take snapshots
      • Rename linked files

    Sync:

    • Enter the account details
    • Tick sync automatically
    • Untick sync full text (if you choose to save PDFs then syncing full text will quickly consume the 300MB quota)

    Search:

    • Left unchanged

    Export:

    • Left unchanged

    Cite:

    • There are several sensible defaults but if there is a new citation style you wish to be able to use in Microsoft Word for example then click “Get additional styles” as there is probably a version that you need already created. You can click the “+” button to add a style from a .csl file if you have one already. Finally, if you are desperate for a style that doesn’t already exist then you can select a citation style and click Style Editor and edit the raw .csl file.
    • In the Word Processors subtab (on the main Cite tab), you can install the Microsoft Word add-in to allow Zotero to work in Microsoft Word.

    Advanced:

    • I changed nothing on the General subtab
    • In the Files and Folders subtab I have selected the path to base directory for attachments
    • I have not changed the Shortcuts subtab
    • I have not changed the Feeds subtab

    Better BibTex:

    • In this section I have set my Citation Key format to [auth:lower:alphanum]_[year:alphanum]_[veryshorttitle:lower:alphanum]_[journal:lower:clean:alphanum] (Figure 4). This generates a citation key for each reference in the format of fairfield_2019_gallstones_scientificreports or harrison_2012_hospital_bmj. It always takes the first author’s surname, the year, the first word of the title and the journal abbreviation if known. The clean and alphanum arguments to this field are used to remove unwanted punctuation which can cause citation to fail in LaTeX.
    Figure 4: Better BibTeX Citation Key

    Once the settings have been configured if you already had references stored in Zotero and wish to change the citation key for old references select your entire library root (above all folders), select all references, right click and use “Better BibTex → Refresh BibTeX Key” and all of the citation keys should be updated.

    Creating a .bib file

    For referencing in a new project, publication or submission it may be helpful to have a dynamic .bib file that updates with every new publication and can be accessed from any device through cloud storage.

    To set up a .bib file, first find the folder that you wish to create the file from (this should be the folder which contains any citations you will use and ideally not the full library to cut down on unnecessary storage and syncing requirements). Note that the .bib file will generate a bibliography from any citations stored directly in the folder when using default settings. This prevents use of subfolders which I find particularly helpful for organising citations and I have therefore changed the setting so that folders also show any citations stored in subfolders. To make this change go to “Edit Preferences” and select the “Advanced” tab and at the bottom of the “General” subtab select “Config Editor”. This will bring up a searchable list of configurations (it may show a warning message before this) and search in the search box for “extensions.zotero.recursiveCollections”. Set “Value” to TRUE and then when you click a folder you should see all of the citations also stored in subfolders.

    Right click the folder and select “Export Collection”. A pop-up window will appear at which point select “Keep Updated” and if using RStudio desktop save the file in the directory where you have your Rmd project files. If you are working with RStudio server then save the file in a cloud storage location which will then be accessed from the server. I have a .bib file stored in Dropbox which I access from RStudio server.

    Linking Dropbox and RStudio Server to Access the .bib File

    The following covers linking Dropbox to RStudio server but could be adapted to cover another cloud storage service.

    Dropbox provides a token to allow communication between different apps. The rdrop2 package is what I used to create a token to allow this. I actually created the token on RStudio desktop as I couldn’t get the creation to work on the server but this is perfectly ok.

    Caution: The token generated by this process could be used to access your Dropbox from anywhere using RStudio if you do not keep it secure. If somebody were to access an unencrypted token then it would be equivalent to handing out your email and password. I therefore used the encryptr package to allow safe storage of this token.

    Token Creation

    Open Rstudio desktop and enter the following code:

    library(rdrop2) library(encryptr) # Create token token <- drop_auth() # Save token saveRDS(token, "droptoken.rds") # Encrypt token genkeys() # Default file names are id_rsa and id_rsa.pub encrypt_file("droptoken.rds", "droptoken.rds.encryptr.bin") encrypt_file(".httr-oauth", ".httr-oauth.encryptr.bin") # Same details should appear later drop_acc() # Remove token from local environment rm(token) # Delete the unencrypted files system("rm droptoken.rds") system("rm .httr-oauth")

    The code will create two files, a token and the .httr-oauth file from which a token can also be made. The encryptr package can then encrypt the files using a public / private key pair. It is essential that the password that is set when using genkeys() is remembered otherwise the token cannot then be used. In this case the original token can’t be retrieved but could be created again from scratch.

    The following files will then be needed to upload to the RStudio server:

    • droptoken.rds.encryptr.bin – or the name provided for the encrypted Dropbox token
    • id_rsa – or the name provided for the private key from the private / public key pair
    Dropbox Linkage for Referencing the .bib File

    Now that the encrypted token and necessary (password-protected) private key are available in RStudio server, the following can be saved as a separate script. The script is designed to read in and decrypt the encrypted token (this will require a password and should be done if the .bib file needs updated). Only the drop_download() needs repeated if using the token again during the same session. The token should be cleared at the end of every session for additional security.

    library(rdrop2) library(encryptr) # ******** WARNING ******** # Losing the unencrypted token will give anyone # complete control of your Dropbox account # If you are concerned this has happened, # you can then revoke the rdrop2 app from your # dropbox account and start over. # ******** WARNING ******** safely_extract_dropbox_token <- function(encrypted_db_token = NULL, private_key_file = NULL){ decrypt_file(encrypted_db_token, file_name = "temporary_dropbox_token.rds", private_key_path = private_key_file) token <<- readRDS("temporary_dropbox_token.rds") system("rm temporary_dropbox_token.rds") } safely_extract_dropbox_token(encrypted_db_token = "droptoken.rds.encryptr.bin", private_key_file = "id_rsa") # Then pass the token to each drop_ function drop_acc(dtoken = token) # The path is the Dropbox file location drop_download(path = "My_Dropbox_Home_Directory/Zotero Library/my.bib", local_path = "my.bib", dtoken = token, overwrite = TRUE)

    Now that the .bib file has been created and is stored as “my.bib” in the local directory, it should update whenever the token is loaded and drop_download() is run.

    Final Result

    On clicking “Save to Zotero” button in Chrome and running drop_download() the following should all happen almost instantaneously:

    • Zotero stores a new reference
    • A PDF is stored in the cloud storage having been named appropriately
    • A link to the PDF is stored in Zotero (without using up significant memory)
    • A citation key is established for the reference in a standardised format without conflicts
    • Pre-existing citation keys which have been referenced earlier in the writing of the paper are not altered
    • A .bib file is updated in the RStudio server directory
    • And much unwanted frustration of reference management is resolved

    This is my current reference management system which I have so far found to be very effective. If there are ways you think it can be improved I would love to hear about them.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Open Trade Statistics

    Thu, 05/09/2019 - 02:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    Introduction

    Open Trade Statistics (OTS) was created with the intention to lower the barrier to working with international economic trade data. It includes a public API, a dashboard, and an R package for data retrieval.

    The project started when I was affected by the fact that many Latin American Universities have limited or no access to the United Nations Commodity Trade Statistics Database (UN COMTRADE).

    There are alternatives to COMTRADE, for example the Base Pour L’Analyse du Commerce International (BACI) constitutes an improvement over COMTRADE as it is constructed using the raw data and a method that reconciles the declarations of the exporter and the importer. The main problem with BACI is that you need UN COMTRADE institutional access to download their datasets.

    After contacting UN COMTRADE, and suggesting to them my idea of doing something similar to BACI but available for anyone but keeping commercial purposes out of the scope of the project, I got an authorization to share curated versions of their datasets.

    Different projects such as The Atlas of Economic complexity and The Obervatory of Economic complexity use UN COMTRADE data and focus on data visualization to answer questions like:

    • What did Germany export in 2016?
    • Who imported Electronics in 1980?
    • Who exported Refined Copper in 1990?
    • Where did Chile export Wine to in 2016?

    Unlike existing visualization projects, I wanted to focus on data retrieval and reproducibility, and the starting point was to study the existing trade data APIs to create something more flexible and easier to use than those tools.

    Making the code (always) work

    I started organizing code I wrote during the last four years at https://github.com/tradestatistics/. There was code there that I haven’t touched in more than two years, and I wrote almost no comments indicating what the parts of the code actually do, so it was not understandable for others.

    Reproducibility can be explained as: “Work in a smart way so that your future-self won’t ask ‘Why does the code return an error?’, ‘What does this code do?’ or ‘Why did the result change if I haven’t touched the script?’”. My data cleaning process was not reproducible, and it was tragic to discover! I decided to start using RStudio Server to test the code line by line, in a fresh environment, and then dividing the code into smaller pieces and commenting what the different sections actually do.

    Once I had reproducible results I took a snapshot of my packages by using packrat. To ensure reproducibility over time, I decided to build R from source, isolated from the system package manager and therefore avoiding accidental updates that might break the code.

    Is it worth mentioning that I’m using DigitalOcean virtual machines to store the datasets and run all the services required to run an API. Under their Open Source Sponsorships the server cost is subsidized.

    As a way to contribute back to the community, here you can find a ready to use RStudio image to work with databases and build R packages.

    The power of Open Source

    With a reproducible data pipeline I had the power to do more, and to do it in a sustainable way. Finally I was able to create the R package that I wanted to submit for rOpenSci software peer review, but that package was the final step.

    The base for the project is Ubuntu, the database of choice is PostgreSQL, and R constitutes 95% of the project.

    The datasets were cleaned by using data.table, jsonlite, dplyr, tidyr, stringr and janitor. The database was created by using RPostgreSQL. The documentation is R markdown and bookdown. The dashboard was made with Shiny. To adhere to a coding style guide I used styler. In addition to all of that I used doParallel, purrr, Rcpp and Matrix packages in order to use the largest possible share of available resources in the server and in an efficient way, so there is a fraction of code involving sparse matrices and C++.

    Even our API was made with R. I used the Plumber package, and I used it to combine RPostgreSQL, dplyr, glue and other R packages. With some input sanitization, and to avoid situations like this XKCD vignette, I was ready to start working on a new R package for rOpenSci and a dashboard that I wanted to visualize the data.

    The web service is nginx enhanced with a secured connection by using Let’s Encrypt. The landing page is a modified HTML5UP template, and Atom and yui-compressor were the tools to personalize the CSS behind the landing, documentation and dashboard with Fira Sans as the typeface of choice.

    Even our email service is a stack of Open Source tools. We use mail-in-a-box with some very simple tweaks such as email forwarding and integration with Thunderbird.

    rOpenSci contributions

    Thanks to Maëlle Salmon, Amanda Dobbyn, Jorge Cimentada, Emily Riederer, Mark Padgham the overall result can be said to be top quality!

    After a long reviewing process (more than six month considering initial submission!), what started as an individual process mutated into something that I consider a collective result. Thanks to the amazing team behind rOpenSci, to their constructive feedback, exhaustive software reviewing and the confidence to propose ideas that that I had never gotten, what you have now is not just a solid R package.

    The hours spent as a part of the reviewing process translated into changes to the database and the API. According to the reviewers comments, there are limited opportunities to implement server-side changes and then updating the R code. With the inclusion of different API parameters that I initially didn’t consider, the current API/package state provides an efficient solution way better than post-filtering. You’ll always extract exactly the data you require and no more than that.

    One useful contributed idea was to create aliases for groups of countries. Both in the API and package, you can request an ISO code such as “usa” (United States) or an alias such as “c-am” (all countries in America) that returns condensed queries.

    Our API vs the Atlas and the OEC

    Our project covers a large documentation with different examples for both API and R package. The package example is reserved for the next section, so you’ll probably like to skip this part.

    As a simple example, I shall compare three APIs by extracting what did Chile export to Argentina, Bolivia and Perú in 2016 using just common use R packages (jsonlite, dplyr and purrr).

    What I am going to do now is to obtain the same information from three different sources, showing how easy or hard is to use each source, and commenting some of the problems that emerge from different APIs.

    Packages library(jsonlite) library(dplyr) library(purrr) Open Trade Statistics

    In case of not knowing the ISO codes for the country of origin or destination, I can check the countries data and inspect it from the browser.

    With the code above, it is quite clear that this API is easy to use:

    # Function to read each combination reporter-partners read_from_ots <- function(p) { fromJSON(sprintf("https://api.tradestatistics.io/yrpc?y=2016&r=chl&p=%s", p)) } # The ISO codes are here: https://api.tradestatistics.io/countries partners <- c("arg", "bol", "per") # Now with purrr I can combine the three resulting datasets # Chile-Argentina, Chile-Bolivia, and Chile-Perú ots_data <- map_df(partners, read_from_ots) # Preview the data as_tibble(ots_data) # A tibble: 2,788 x 15 year reporter_iso partner_iso product_code product_code_le… export_value_usd import_value_usd 1 2016 chl arg 0101 4 593659 1074372 2 2016 chl arg 0106 4 NA 36588 3 2016 chl arg 0201 4 NA 138990325 4 2016 chl arg 0202 4 NA 501203 5 2016 chl arg 0204 4 NA 213358 6 2016 chl arg 0206 4 NA 202296 7 2016 chl arg 0207 4 NA 21218283 8 2016 chl arg 0302 4 35271947 NA 9 2016 chl arg 0303 4 249011 1180820 10 2016 chl arg 0304 4 15603048 28658 # … with 2,778 more rows, and 8 more variables: export_value_usd_change_1_year , # export_value_usd_change_5_years , export_value_usd_percentage_change_1_year , # export_value_usd_percentage_change_5_years , import_value_usd_change_1_year , # import_value_usd_change_5_years , import_value_usd_percentage_change_1_year , # import_value_usd_percentage_change_5_years

    The resulting data is tidy and, in my opinion, it involved few and simple steps. The codes from the product_code column are official Harmonized System (HS) codes, and those are used both by UN COMTRADE and all over the world.

    To answer the original question, with this data as is, is not possible to tell, but I can use the API again to join two tables. I’ll obtain the product information and then I’ll group the data by groups of products:

    # Product information products <- fromJSON("https://api.tradestatistics.io/products") # Join the two tables and then summarise by product group # This will condense the original table into something more compact # and even probably more informative ots_data %>% left_join(products, by = "product_code") %>% group_by(group_name) %>% summarise(export_value_usd = sum(export_value_usd, na.rm = T)) %>% arrange(-export_value_usd) # A tibble: 97 x 2 group_name export_value_usd 1 Vehicles; other than railway or tramway rolling stock, and… 444052393 2 Nuclear reactors, boilers, machinery and mechanical applia… 328008667 3 Mineral fuels, mineral oils and products of their distilla… 221487719 4 Electrical machinery and equipment and parts thereof; soun… 179309083 5 Plastics and articles thereof 172385449 6 Iron or steel articles 153072803 7 Miscellaneous edible preparations 149936537 8 Paper and paperboard; articles of paper pulp, of paper or … 149405846 9 Fruit and nuts, edible; peel of citrus fruit or melons 139800139 10 Wood and articles of wood; wood charcoal 113034494 # … with 87 more rows

    Now we can say that in 2016, Chile exported primarily vehicles to Argentina, Bolivia and Perú.

    The Observatory of Economic Complexity

    This API is documented here. I’ll try to replicate the result from OTS API:

    # Function to read each combination reporter-partners read_from_oec <- function(p) { fromJSON(sprintf("https://atlas.media.mit.edu/hs07/export/2016/chl/%s/show/", p)) } # From their documentation I can infer their links use ISO codes for countries, # so I'll use the same codes from the previous example destination <- c("arg", "bol", "per") # One problem here is that the API returns a nested JSON that doesn't work with map_df # I can obtain the same result with map and flatten oec_data <- map(destination, read_from_oec) oec_data <- bind_rows(oec_data[[1]]$data, oec_data[[2]]$data, oec_data[[3]]$data) # Preview the data as_tibble(oec_data) # A tibble: 9,933 x 15 dest_id export_val export_val_grow… export_val_grow… export_val_grow… export_val_grow… hs07_id hs07_id_len 1 saarg 455453. 6.97 0.108 398317. 182558. 010101 6 2 saarg 100653. 1.79 -0.064 64634. -39290. 010101… 8 3 saarg 354799. 15.8 0.217 333682. 221847. 010101… 8 4 saarg NA NA NA NA NA 010106 6 5 saarg NA NA NA NA NA 010106… 8 6 saarg NA NA NA NA NA 010201 6 7 saarg NA NA NA NA NA 010201… 8 8 saarg NA NA NA NA NA 010202 6 9 saarg NA NA NA NA NA 010202… 8 10 saarg NA NA NA NA NA 010204 6 # … with 9,923 more rows, and 7 more variables: import_val , import_val_growth_pct , # import_val_growth_pct_5 , import_val_growth_val , import_val_growth_val_5 , origin_id , # year

    At first sight the API returned many more rows than in the previous example. To obtain the exact same result I’ll need post-filtering at product code. One curious column in the table above is hs07_id_len, and that it reflects length of the HS code. For example, the first row the HS code is 010101 and its length is 6. This can be a huge problem as that column contains values 6 and 8, because the HS does not contain 8 digits codes and those 6 digits codes are not official HS codes.

    If you need to join that table with official HS tables, for example, in case of having to append a column with product names, exactly zero of the codes above shall have match. Among all HS codes, “7325” means “Iron or steel; cast articles” and “732510” means “Iron; articles of non-malleable cast iron”, and those are official codes used by all customs in the world. In the OEC case, their “157325” code is actually “7325” from the HS, because they append a “15” that stands for “product community #15, metals”.

    Let’s filter with this consideration in mind:

    # Remember that this is a "false 6", and is a "4" actually as_tibble(oec_data) %>% filter(hs07_id_len == 6) # A tibble: 2,558 x 15 dest_id export_val export_val_grow… export_val_grow… export_val_grow… export_val_grow… hs07_id hs07_id_len 1 saarg 558931. 0.223 0.277 101763. 394357. 010101 6 2 saarg NA NA NA NA NA 010106 6 3 saarg NA NA NA NA NA 010201 6 4 saarg NA NA NA NA NA 010202 6 5 saarg NA NA NA NA NA 010204 6 6 saarg NA NA NA NA NA 010206 6 7 saarg NA NA NA NA NA 010207 6 8 saarg 41842074. 0.14 0.163 5146236. 22203666. 010302 6 9 saarg 621080. 1.93 -0.135 409185. -661807. 010303 6 10 saarg 20324918. 0.287 0.231 4534606. 13148256. 010304 6 # … with 2,548 more rows, and 7 more variables: import_val , import_val_growth_pct , # import_val_growth_pct_5 , import_val_growth_val , import_val_growth_val_5 , origin_id , # year

    Finally I can get something closer to what can be obtained with OTS API.

    The Atlas of Economic Complexity

    I couldn’t find documentation for this API but still I’ll try to replicate the result from OTS API (I obtained the URL by using Firefox inspector at their website):

    # Function to read each combination reporter-partners read_from_atlas <- function(p) { fromJSON(sprintf("http://atlas.cid.harvard.edu/api/data/location/42/hs_products_by_partner/%s/?level=4digit", p)) } # Getting to know these codes required web scraping from http://atlas.cid.harvard.edu/explore # These codes don't follow UN COMTRADE numeric codes with are an alternative to ISO codes destination <- c("8", "31", "173") # The resulting JSON doesn't work with map_df either # This can still be combined without much hassle atlas_data <- map(destination, read_from_atlas) atlas_data <- bind_rows(atlas_data[[1]]$data, atlas_data[[2]]$data, atlas_data[[3]]$data) # Preview the data as_tibble(atlas_data) ## # A tibble: 59,518 x 6 ## export_value import_value location_id partner_id product_id year ## ## 1 23838 413061 42 8 650 1995 ## 2 172477 368650 42 8 650 1996 ## 3 146238 310383 42 8 650 1997 ## 4 69139 141525 42 8 650 1998 ## 5 79711 97951 42 8 650 1999 ## 6 85042 392098 42 8 650 2000 ## 7 463361 252611 42 8 650 2001 ## 8 191069 186278 42 8 650 2002 ## 9 88566 106782 42 8 650 2003 ## 10 234638 113184 42 8 650 2004 ## # … with 59,508 more rows

    Post-filtering is required at year as there are more years than what was requested:

    as_tibble(atlas_data) %>% filter(year == 2016) ## # A tibble: 2,718 x 6 ## export_value import_value location_id partner_id product_id year ## ## 1 463809 1074354 42 8 650 2016 ## 2 0 17189 42 8 655 2016 ## 3 0 139638464 42 8 656 2016 ## 4 0 507301 42 8 657 2016 ## 5 0 212049 42 8 659 2016 ## 6 0 124921 42 8 661 2016 ## 7 0 20601067 42 8 662 2016 ## 8 34454500 0 42 8 667 2016 ## 9 211614 724851 42 8 668 2016 ## 10 14944704 25975 42 8 669 2016 ## # … with 2,708 more rows

    Finally I can get something closer to what can be obtained with OTS API. Here is a major drawback that is the product id consists in numbers, and this is totally against HS codes which are always used as character provided some codes start with zero.

    R package

    Even when the package connects to the API, it required a dedicated site with documentation and examples. Please check the documentation here.

    Now that I’ve compared the APIs I’ll dig a bit in the R package we have prepared. If I want to obtain the same data as with the examples above, I can do this:

    # install.packages("tradestatistics") library(tradestatistics) ots_create_tidy_data( years = 2016, reporters = "chl", partners = c("arg", "bol", "per") ) # A tibble: 2,788 x 20 year reporter_iso partner_iso reporter_fullna… partner_fullnam… product_code product_code_le… product_fullnam… 1 2016 chl arg Chile Argentina 0101 4 Horses, asses, … 2 2016 chl arg Chile Argentina 0106 4 Animals, n.e.c.… 3 2016 chl arg Chile Argentina 0201 4 Meat of bovine … 4 2016 chl arg Chile Argentina 0202 4 Meat of bovine … 5 2016 chl arg Chile Argentina 0204 4 Meat of sheep o… 6 2016 chl arg Chile Argentina 0206 4 Edible offal of… 7 2016 chl arg Chile Argentina 0207 4 Meat and edible… 8 2016 chl arg Chile Argentina 0302 4 Fish; fresh or … 9 2016 chl arg Chile Argentina 0303 4 Fish; frozen (e… 10 2016 chl arg Chile Argentina 0304 4 Fish fillets an… # … with 2,778 more rows, and 12 more variables: group_code , group_name , export_value_usd , # import_value_usd , export_value_usd_change_1_year , export_value_usd_change_5_years , # export_value_usd_percentage_change_1_year , export_value_usd_percentage_change_5_years , # import_value_usd_change_1_year , import_value_usd_change_5_years , # import_value_usd_percentage_change_1_year , import_value_usd_percentage_change_5_years

    Here the added value is that the package does all the work of combining the data, and it does some joins for you to add country names, product names and full product category/community description.

    There are several cases where the functions within this package remain simple. For example, if I require different years, and instead of product level data I just need aggregated bilateral flows from all countries in America to all countries in Asia, this is how to obtain that data:

    ots_create_tidy_data( years = 2010:2017, reporters = "c-am", partners = "c-as", table = "yr" ) # A tibble: 386 x 21 year reporter_iso reporter_fullna… export_value_usd import_value_usd top_export_prod… top_export_trad… 1 2010 aia Anguilla 12165731 64287919 8514 3274981 2 2010 ant Neth. Antilles 1631080123 2966955978 2710 1229297847 3 2010 arg Argentina 76056875101 64416501373 2304 9352050413 4 2010 atg Antigua and Bar… 2464746725 2573456652 8703 263196190 5 2010 bhs Bahamas 3139761427 12310398156 2710 1473528434 6 2010 blz Belize 459835990 1053385171 2709 130371691 7 2010 bmu Bermuda 718819987 5021642429 8903 531493968 8 2010 bol Bolivia 7754773449 7197859978 2711 2797774138 9 2010 bra Brazil 241714684212 225037224410 2601 37576257058 10 2010 brb Barbados 1097175359 2655154217 8481 110615870 # … with 376 more rows, and 14 more variables: top_import_product_code , # top_import_trade_value_usd , export_value_usd_change_1_year , # export_value_usd_change_5_years , export_value_usd_percentage_change_1_year , # export_value_usd_percentage_change_5_years , import_value_usd_change_1_year , # import_value_usd_change_5_years , import_value_usd_percentage_change_1_year , # import_value_usd_percentage_change_5_years , eci_4_digits_product_code , # eci_rank_4_digits_commodity_code , eci_rank_4_digits_commodity_code_delta_1_year , # eci_rank_4_digits_commodity_code_delta_5_years How to contribute?

    If you are interested in contributing to this project send us a tweet or an email. We’d also like to read ideas not listed here.

    Here’s a list of ideas for future work:

    • Crops data: I got suggestions to include crops data from The Food and Agriculture Organization of the United Nations (FAO) to be easily be able to compare volumes and exported values of crop/commodity groups. The problem is that UN COMTRADE and FAO group their data in different categories.

    • R package for Economic Complexity: We have a set of both R and Rcpp functions such as Revealed comparative advantage, Economic Complexity Index, and other related functions that might lead to a new package.

    • D3plus htmlwidget: D3plus is an open source D3 based library which is released under MIT license. It can be a good alternative to Highcharts based on the license type. I’ve been working on my spare time in a D3plus htmlwidget that integrates with Shiny.

    • Network layouts: We are in the middle of creating the Product Space for the HS rev. 2007. The idea is to provide a visualization that accounts for products not included in the networks used both by the Atlas and the Observatory, which use HS rev. 1992 and therefore do not reflect the huge changes, especially in electronics, in the last two decades.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Introducing RStudio Team

    Thu, 05/09/2019 - 02:00

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    RStudio is excited to announce RStudio Team, a new software bundle that makes it easier and more economical to adopt our commercially licensed and supported professional offerings.

    RStudio Team includes RStudio Server Pro, RStudio Connect, and RStudio Package Manager. With RStudio Team, your data science team will be properly equipped to analyze data at scale using R and Python; manage R packages; and create and share plots, Shiny apps, R Markdown documents, REST APIs (with plumber), and even Jupyter Notebooks, with your entire organization.

    Contact Sales
    Configure your own RStudio Team
    Evaluate RStudio Team QuickStart VM
    Learn More

    RStudio Team Standard and RStudio Team Enterprise

    RStudio Team is available in two configurations: Standard and Enterprise.

    RStudio Team Standard RStudio Team Enterprise Number of RStudio Server Pro Users 5+ 10+ Number of RStudio Connect Users 20+ 100+ RStudio Package Manager Version Base or Standard Enterprise Number of Key Activations One per product* Unrestricted Staging or High Availability Servers Optional Purchase Included

    RStudio Team Standard – an affordable place to start for smaller teams

    Team Standard fits the needs and budgets of smaller businesses and data science departments. For 5 RStudio Server Pro users and 20 RStudio Connect users sharing RStudio Package Manager Base, RStudio Team Standard starts at $22,000 per year. SMB and Academic discounts can further reduce the starting price, ensuring that every professional data science team can afford to start out with the right solution.

    RStudio Team Enterprise – unrestricted servers for larger deployments

    Team Enterprise has it all. For 10 RStudio Server Pro users and 100 RStudio Connect users sharing RStudio Package Manager Enterprise, RStudio Team Enterprise starts at $58,000, including all development, production, staging, high-availability, and disaster-recovery servers you require. With RStudio Team Enterprise, servers are unrestricted so you can deploy the professional data science configuration your teams need and support the virtualized IT infrastructure your organization wants. SMB, Academic, and volume discounts also apply.

    Visit our pricing page to get your estimate for RStudio Team

    RStudio Team is a new bundled offering from RStudio that combines RStudio Server Pro, RStudio Connect, and RStudio Package Manager products. Now organizations using the open-source ecosystems of Python and R have what they need to analyze data effectively and inform critical business decisions across the enterprise with convenience, simplicity, and savings. Visit our pricing page to estimate your own price for RStudio Team. Except for a few RStudio Team Standard options, all you need to provide is the number of RStudio Server Pro users you want and the number of RStudio Connect users you will have. For most scenarios, we will estimate your total price on the spot. Please contact us at sales@rstudio.com to discuss your unique circumstances or to get a formal price quote.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Introducing RStudio Server Pro 1.2 – New Features, Packaging, and Pricing

    Thu, 05/09/2019 - 02:00

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    We are excited to announce the general availability of RStudio Server Pro 1.2, and to introduce RStudio Server Pro Standard and RStudio Server Pro Enterprise.

    RStudio customers have made it clear to us that the future of data analysis is moving to a more elastic and flexible computational model. RStudio Server Pro now allows you to execute R processes remotely, and introduces new packaging and pricing to reflect this new trend.

    Contact Sales
    Learn More
    Try a 45 Day Evaluation

    New RStudio Server Pro Features

    The most significant new feature of RStudio Server Pro is Launcher, the long-awaited capability to separate the execution of R processes from the server where RStudio Server Pro is installed. Launcher allows you to run RStudio sessions and ad-hoc R scripts within your existing cluster workload managers, so you can leverage your current infrastructure instead of provisioning load balancer nodes manually. Now organizations that want to use Kubernetes or other job managers can run interactive sessions or batch jobs remotely and scale them independently.

    Learn more about Launcher here.

    Other features exclusive to RStudio Server Pro 1.2 include improved R version management and enhanced configuration reload.

    In addition, RStudio Server Pro has all of the new features of the new RStudio v1.2

    New Named User Packaging and Pricing for RStudio Server Pro

    Effective today, RStudio is introducing RStudio Server Pro Standard and RStudio Server Pro Enterprise. Standard and Enterprise are priced per user, and include the remote Launcher feature. Existing RStudio Server Pro customers may upgrade to Standard or Enterprise to take advantage of the remote Launcher feature, or continue to purchase RStudio Server Pro without Launcher under their current terms.

    RStudio Server Pro Standard for smaller teams – an affordable place to start

    RStudio Server Pro Standard is more affordable for smaller teams. Instead of $9,995 per year at a minimum, RStudio Server Pro Standard is now only $4,975 per year for 5 users on a single server. Additional Named Users are $995 each per year. Additional “Staging” servers for testing and “High Availability” servers for load balancing user sessions are optional. SMB, Academic, Volume, or Bundle discounts may apply.

    RStudio Server Pro Enterprise for larger teams – unrestricted servers

    RStudio Server Pro Enterprise eliminates restrictions on servers for larger teams. Containerized IT infrastructure using tools such as Docker and Kubernetes are increasingly popular. Larger teams often need more development, staging, production, and high availability servers and find license key management troublesome. RStudio Server Pro Enterprise starts at only $11,950 per year, allowing up to 10 Named Users to use the software on as many servers as needed. Additional Named Users are $1,195 each per year. SMB, Academic, Volume, or Bundle discounts may apply.

    The table below summarizes our new pricing and packages for RStudio Server Pro; visit RStudio Pricing on our website to learn more.

    RStudio Server Pro 1.2 Packaging and Pricing

    Package Annual Price Named Users Launcher License Type Server Licenses RStudio Server Pro Enterprise $11,995 10 Yes Named User Unrestricted. Additional Named Users $1,195 each RStudio Server Pro Standard $4,975 5 Yes Named User per Server One included. Staging and high availability servers available for additional charge. Additional Named Users $995 each RStudio Server Pro* $9,995 Unlimited Per Server One. Staging servers available for additional charge.

    *RStudio Server Pro per server licensing is available only to existing RStudio Server Pro customers who have purchased prior to May 2019.

    For questions about RStudio Server Pro, RStudio Server Pro Standard, or RStudio Server Pro Enterprise, please contact sales@rstudio.com or your customer success representative.

    Additional information is also available on our Support FAQ

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages