### RStudio Connect v1.6.4.2 – Security Update

(This article was first published on ** RStudio Blog**, and kindly contributed to R-bloggers)

A security vulnerability in a third-party library used by RStudio Connect was uncovered during a security audit last week. We have confirmed that this vulnerability has not been used against any of the RStudio Connect instances we host, and are unaware of it being exploited on any customer deployments. Under certain conditions, this vulnerability could compromise the session of a user that was tricked into visiting a specially crafted URL. The issue affects all versions of RStudio Connect up to and including 1.6.4.1, but none of our other products. We have prepared a hotfix: v1.6.4.2.

RStudio remains committed to providing the most secure product possible. We regularly perform internal security audits against RStudio Connect in order to ensure the product’s security.

As part of the responsible disclosure process, we will provide additional details about the vulnerability and how to ensure that you have not been affected, in the coming weeks once customers have had time to update their systems. For now, **please update your RStudio Connect installations to version 1.6.4.2 as soon as possible**.

To **leave a comment** for the author, please follow the link and comment on their blog: ** RStudio Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Solver Interfaces in CVXR

(This article was first published on ** R Views**, and kindly contributed to R-bloggers)

In our previous blog

post, we

introduced CVXR, an R package for disciplined convex

optimization. The package allows one to describe an optimization

problem with Disciplined Convex Programming

rules using high level mathematical syntax. Passing this problem

definition along (with a list of constraints, if any) to the solve

function transforms it into a form that can be handed off to

a solver. The default installation of CVXR comes with two (imported)

open source solvers:

- ECOS and its mixed integer

cousin ECOS_BB via the CRAN package

ECOSolveR - SCS via the CRAN package

scs.

CVXR (version 0.99) can also make use of several other open source

solvers implemented in R packages:

- The linear and mixed integer programming package

lpSolve via the

lpSolveAPI package - The linear and mixed integer programming package GLPK via the

Rglpk package.

The real work of finding a solution is done by solvers, and writing

good solvers is hard work. Furthermore, some solvers work particularly

well for certain types of problems (linear programs, quadratic

programs, etc.). Not surprisingly, there are commercial vendors who

have solvers that are designed for performance and scale. Two

well-known solvers are MOSEK and

GUROBI. R packages for these solvers are

also provided, but they require the problem data to be constructed in

a specific form. This necessitates a bit of work in the current version of

CVXR and is certainly something we plan to include in future versions.

However, it is also true that these commercial solvers expose a much

richer API to Python programmers than to R programmers. How, then, do we

interface such solvers with R as quickly as possible, at least

in the short term?

The current version of CVXR exploits the

reticulate package

for commercial solvers such as MOSEK and GUROBI. We took the Python solver interfaces in CVXPY version

0.4.11, edited them

suitably to make them self-contained, and hooked them up to reticulate.

This means that one needs two prerequisites to use these commercial solvers in the current version of CVXR:

- A Python installation
- The reticulate R

package.

Both MOSEK and

GUROBI provide academic versions

(registration required) free of charge. For example,

Anaconda users can install MOSEK with the command:

Others can use the pip command:

pip install -f https://download.mosek.com/stable/wheel/index.html MosekGUROBI is handled in a similar fashion. The solvers must be activated using a

license provided by the vendor.

Once activated, one can check that CVXR recognizes the solver;

installed_solvers() should list them.

More information on these solvers, along with a number of tutorial examples are

available on the CVXR site. If you are

attending useR! 2018, you can catch

Anqi’s CVXR talk on Friday, July 13.

_____='https://rviews.rstudio.com/2018/07/09/solver-interfaces-in-cvxr/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R Views**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Time Series Analysis With Documentation And Steps I Follow For Analytics Projects.

(This article was first published on ** R-Analytics**, and kindly contributed to R-bloggers)

To do this I will create a prediction of the open values for Bitcoin in the next 3 days.

The process I follow is based on CRISP-DM methodology: https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome

1.- Planning the activities.

To plan the activities I use a spread sheet document, below I show the spread sheet sample, if you would like the document, please go to the next link:

https://github.com/LaranIkal/R-ANALYTICS/blob/master/BitCoinPredictionsAdmin%26FollowUp.ods

**Activity**

**Activity Description**

**DueDate**

**Activity Owner**

**Status**

**Comments**Functional Requirement Specification A Text Document explaining the objectives of this project. 4/19/2018 Carlos Kassab Done

Get Data For Analysis Get initial data to create feasibility analysis 4/19/2018 Carlos Kassab Done

ETL Development ETL to get final data for next analysis 4/20/2018 Carlos Kassab Done 2018/04/19: In this case, there is not ETL needed, the dataset was downloaded from kaggle: https://www.kaggle.com/vivekchamp/bitcoin/data Exploratory Data Analysis Dataset summary and histogram to know the data normalization 4/20/2018 Carlos Kassab In progress

Variables frequency Frequency of variable occurrence( frequency of values change, etc. ) 4/20/2018 Carlos Kassab Done 2018/04/19: We have already seen that our values change every day. Outliers Analysis Analysis of variability in numeric variables, show it in charts and grids.. 4/20/2018 Carlos Kassab In progress

Time Series Decomposition – Getting metric charts, raw data, seasonality, trend and remainder. 4/20/2018 Carlos Kassab In progress

Modelling Create the analytics model 4/25/2018 Carlos Kassab Not Started

SQL View Development For Training, Validation And Testing NA Carlos Kassab Not Started 2018/04/19: No SQL view needed, everything is done inside the R script. Model Selection By using random parameters search algorithm, to find the right model to be used for this data. 4/25/2018 Carlos Kassab Not Started

Model fine tunning. After finding the right algorithm, find the right model parameters to be used. 4/25/2018 Carlos Kassab Not Started

Chart Development Final data chart development 4/25/2018 Carlos Kassab Not Started

Data Validation Run the analytics model at least 2 weeks daily in order to see its behavior. NA Carlos Kassab Not Started

Deployment Schedule the automatic execution of the R code. NA Carlos Kassab Not Started

The first activity is the functional specification, this would be similar to business understanding in the crisp-dm methodology.

I use a text document, for this analysis, you can get the document from this link:

https://github.com/LaranIkal/R-ANALYTICS/blob/master/BitCoinPredictionRequirementSpecification.odt

Now, the next step is to get the data for analysis and create the ETL script, in this case we just got the data from kaggle.com as mentioned in the documentation but, no ETL script was needed.

So, we have our data, now we are going to analyze it, we will do all the activities mentioned in yellow in the grid above. I did this in a MarkDown document, here you can see the HTML output:

https://github.com/LaranIkal/R-ANALYTICS/blob/master/BitcoinDataAnalysis.html

Note. At the end, in order to have everything together I included the time series algorithms in the same rmd file that creates BitcoinDataAnalysis.html file.

You can get all the sources from:

https://github.com/LaranIkal/R-ANALYTICS

Enjoy it!!!.Carlos Kassab https://www.linkedin.com/in/carlos-kassab-48b40743/

More information about R: https://www.r-bloggers.com var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-Analytics**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### a thread to bin them all [puzzle]

(This article was first published on ** R – Xi'an's Og**, and kindly contributed to R-bloggers)

**T**he most recent riddle on the Riddler consists in finding the shorter sequence of digits (in 0,1,..,9) such that all 10⁴ numbers between 0 (or 0000) and 9,999 can be found as a group of consecutive four digits. This sequence is obviously longer than 10⁴+3, but how long? On my trip to Brittany last weekend, I wrote an R code first constructing the sequence at random by picking with high preference the next digit among those producing a new four-digit number

which usually returns a value above 75,000. I then looked through the sequence to eliminate useless replicas

for (i in sample(4:(length(snak)-5))){ if ((seqz[wn2dg(snak[(i-3):i])]>1) &(seqz[wn2dg(snak[(i-2):(i+1)])]>1) &(seqz[wn2dg(snak[(i-1):(i+2)])]>1) &(seqz[wn2dg(snak[i:(i+3)])]>1)){ seqz[wn2dg(snak[(i-3):i])]=seqz[wn2dg(snak[(i-3):i])]-1 seqz[wn2dg(snak[(i-2):(i+1)])]=seqz[wn2dg(snak[(i-2):(i+1)])]-1 seqz[wn2dg(snak[(i-1):(i+2)])]=seqz[wn2dg(snak[(i-1):(i+2)])]-1 seqz[wn2dg(snak[i:(i+3)])]=seqz[wn2dg(snak[i:(i+3)])]-1 snak=snak[-i] seqz[wn2dg(snak[(i-3):i])]=seqz[wn2dg(snak[(i-3):i])]+1 seqz[wn2dg(snak[(i-2):(i+1)])]=seqz[wn2dg(snak[(i-2):(i+1)])]+1 seqz[wn2dg(snak[(i-1):(i+2)])]=seqz[wn2dg(snak[(i-1):(i+2)])]+1}}until none is found. A first attempt produced 12,911 terms in the sequence. A second one 12,913. A third one 12,871. Rather consistent figures but not concentrated enough to believe in achieving a true minimum. An overnight run produced 12,779 as the lowest value.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Xi'an's Og**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### World Income, Inequality and Murder

(This article was first published on ** Theory meets practice...**, and kindly contributed to R-bloggers)

We follow up on last weeks post on using Gapminder data to study the world’s income distribution. In order to assess the inequality of the distribution we compute the Gini coefficient for the world’s income distribution by Monte Carlo approximation and visualize the result as a time series. Furthermore, we animate the association between Gini coefficient and homicide rate per country using the new version of gganimate.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

IntroductionOne of the main messages of the Chapter ‘The Gap Instinct’ of the book *Factfulness* is that there is no justification of the ‘we’ and ‘them’ classification of countries anymore, because they have converged towards the same levels in key indicators such as life expectancy, child mortality, births per female. The difference between countries is, hence, not as many imagine it to be: there is less inequality and no real gap. While reading, I became curious about the following: **what if countries became more equal, but simultaneously inequality within countries became bigger?** This was also indirectly a Disqus comment by F. Weidemann to the post *Factfulness: Building Gapminder Income Mountains*. Aim of the present post is to investigate this hypothesis using the Gapminder data. Furthermore, we use the country specific Gini coefficients to investigate the association with the number of homicides in the country.

There are different ways to measure income inequality, both in terms of which response you consider and which statistical summary you compute for it. Not going into the details of these discussion we use the GDP/capita in Purchasing Power Parities (PPP) measured in so called international dollars (fixed prices 2011). In other words, comparison between years and countries are possible, because the response is adjusted for inflation and differences in price of living.

The **Gini coefficient** is a statistical measure to quantify inequality. In what follows we shall focus on computing the Gini coefficient for a continuous probability distribution with a known probability density function. Let the probability density function of the non-negative continuous income distribution be defined by \(f\), then the Gini coefficient is given as **half the relative mean difference**:

\[

G

= \frac{1}{2\mu}\int_0^\infty \int_0^\infty |x-y| \> f(x) \> f(y) \>

dx\> dy, \quad\text{where}\quad \mu = \int_{0}^\infty x\cdot f(x) dx.

\]

Depending on \(f\) it might be possible to solve these integrals analytically, however, a straightforward computational approach is to use Monte Carlo sampling – as we shall see shortly. Personally, I find the above relative mean difference presentation of the Gini index much more intuitive than the area argument using the Lorenz curve. From the eqution it also becomes clear that the Gini coefficient is invariant to multiplicative changes in the income: if everybody increases their income by factor \(k>0\) then the Gini coefficient remains the same, because \(|k x – k y| = k | x – y|\) and \(E(k \cdot X) = k \mu\).

The above formula indirectly also states how to compute the Gini coefficient for a discrete sample of size \(n\) and with incomes \(x_1,\ldots, x_n\): \[

G = \frac{\sum_{i=1}^n \sum_{j=1}^n |x_i –

x_j| \frac{1}{n} \frac{1}{n}}{2 \sum_{i=1}^n x_i \frac{1}{n}} =

\frac{\sum_{i=1}^n \sum_{j=1}^n |x_i – x_j|}{2 n \sum_{j=1}^n x_j}.

\]

If one is able to easily sample from \(f\) then can instead of solving the integral analytically use \(k\) pairs \((x,y)\) both drawn at random from \(f\) to approximate the double integral:

\[

G \approx \frac{1}{2\mu K} \sum_{k=1}^K |x_k – y_k|, \quad\text{where}\quad

x_k \stackrel{\text{iid}}{\sim} f \text{ and } y_k \stackrel{\text{iid}}{\sim} f,

\] where for our mixture model \[

\mu = \sum_{i=1}^{192} w_i \> E(X_i) = \sum_{i=1}^{192} w_i \exp\left(\mu_i + \frac{1}{2}\sigma_i^2\right).

\] This allows us to compute \(G\) even in case of a complex \(f\) such as the log-normal mixture distribution. As always, the larger \(K\) is, the better the Monte Carlo approximation is.

We start by showing the country specific Gini coefficient per year since 1950 for a somewhat arbitrary selection of countries. The dashed black line shows the mean Gini coefficient each year over all 192 countries in the dataset.

In addition, we now show the Monte Carlo computed Gini coefficient for the world’s income distribution given as a log-normal mixture with 192 components.

We notice that the Gini coefficient taken over the 192 countries’ GDP/capita remains very stable over time. This, however, does not take the large differences in populations between countries into account. A fairer measure is thus the Gini coefficient for the world’s income distribution. We see that this Gini coefficient increased over time until peaking around 1990. From then on it has declined. However, the pre-1950 Gini coefficients are rather guesstimates as stated by Gapminder, hence, we zoom in on the period from 1970, because data are more reliable from this point on.

Gini coefficient and Homicide RateFinally, we end the post by illustrating the association between the Gini coefficient and the homicide rate per country using a 2D scatterplot over the years. The Gapminder data download page also contains data for this for the years 1950- 2005. Unfortunately, no data for more recent years are available from the Gapminder homepage, but the plot shown below is the situation in r show_year with a log-base-10 y-axis for the homicide rates. For each of the four Gapminder regional groups we also fit a simple linear regression line to the points of all countries within the region. Findings such as Fajnzylber, Lederman, and Loayza (2002) suggest that there is a strong positive correlation between Gini coefficient and homicide rate, but we see from the plot that there are regional differences and of course correlation is not causality. Explanations for this phenomena are debated and beyond the scope of this post.

We extend the plots to all years 1950-2005. Unfortunately, not all countries are available every year – so we only plot the available countries each year. This means that many African countries are missing from the animation. An improvement would be to try some form of linear interpolation. Furthermore, for the sake of simplicity of illustration, we fix countries with a reported murder rate of zero in a given year (happens for example for Cyprus, Iceland, Fiji in some years) to 0.01 per 100,000 population. This can be nicely animated using the new version of the gganimate pkg by Thomas Lin Pedersen.

## New version of gganimate. Not on CRAN yet. ## devtools::install_github('thomasp85/gganimate') require(gganimate) p <- ggplot(gm2_nozero, aes(x=gini, y=murder_rate,size=population, color=Region)) + geom_point() + scale_x_continuous(labels=scales::percent) + scale_y_continuous(trans="log10", breaks = trans_breaks("log10", function(x) 10^x,n=5), labels = trans_format("log10", function(x) ifelse(x<0, sprintf(paste0("%.",ifelse(is.na(x),"0",round(abs(x))),"f"),10^x), sprintf("%.0f",10^x)))) + geom_smooth(se=FALSE, method="lm", formula=y~x) + geom_text(data=gm2, aes(x=gini,y=murder_rate, label=country), vjust=-0.9, show.legend=FALSE) + ylab("Murder rate (per 100,000 population)") + xlab("Gini coefficient (in %)") + guides(size=FALSE) + labs(title = 'Year: {frame_time}') + transition_time(year) + ease_aes('linear') animate(p, nframes=length(unique(gm2$year)), fps=4, width=800, height=400, res=100) DiscussionBased on the available Gapminder data we showed that in the last 25 years the Gini coefficient for the world’s income distribution has decreased. For several individual countries opposite dynamics are, however, observed. One particular concern is the share that the richest 1% have of the overall wealth: more than 50%.

LiteratureFajnzylber, P., D. Lederman, and N. Loayza. 2002. “Inequality and Violent Crime.” *Journal of Law and Economics* 45 (April). http://siteresources.worldbank.org/DEC/Resources/Crime&Inequality.pdf.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Theory meets practice...**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Statistics Sunday: Mixed Effects Meta-Analysis

(This article was first published on ** Deeply Trivial**, and kindly contributed to R-bloggers)

As promised, how to conduct mixed effects meta-analysis in R:

Code used in the video is available here. And I’d recommend the following posts to provide background for this video:

- What is meta-analysis?
- Introduction to effect sizes
- Variance and weights in meta-analysis
- Video on conducting fixed and random effects meta-analysis in R
- And the homepage for the metafor package

To **leave a comment** for the author, please follow the link and comment on their blog: ** Deeply Trivial**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Speed up your R Work

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Keep in mind: unless the pipeline steps have non-trivial cost, the overhead of partitioning and distributing the work may overwhelm any parallel speedup. Also data.table itself already seems to exploit some thread-level parallelism (notice user time is greater than elapsed time). That being said, in this note we will demonstrate a synthetic example where computation is expensive due to a blow-up in an intermediate join step.

Our exampleFirst we set up our execution environment and example (some details: OSX 10.13.4 on a 2.8 GHz Intel Core i5 Mac Mini (Late 2015 model) with 8GB RAM and hybrid disk drive).

library("rqdatatable") ## Loading required package: rquery library("microbenchmark") library("ggplot2") library("WVPlots") suppressPackageStartupMessages(library("dplyr")) ## Warning: package 'dplyr' was built under R version 3.5.1 base::date() ## [1] "Sun Jul 8 09:05:25 2018" R.version.string ## [1] "R version 3.5.0 (2018-04-23)" parallel::detectCores() ## [1] 4 packageVersion("parallel") ## [1] '3.5.0' packageVersion("rqdatatable") ## [1] '0.1.2' packageVersion("rquery") ## [1] '0.5.1' packageVersion("dplyr") ## [1] '0.7.6' ncore <- parallel::detectCores() print(ncore) ## [1] 4 cl <- parallel::makeCluster(ncore) print(cl) ## socket cluster with 4 nodes on host 'localhost' set.seed(2362) mk_example <- function(nkey, nrep, ngroup = 20) { keys <- paste0("key_", seq_len(nkey)) key_group <- sample(as.character(seq_len(ngroup)), length(keys), replace = TRUE) names(key_group) <- keys key_table <- data.frame( key = rep(keys, nrep), stringsAsFactors = FALSE) key_table$data <- runif(nrow(key_table)) instance_table <- data.frame( key = rep(keys, nrep), stringsAsFactors = FALSE) instance_table$id <- seq_len(nrow(instance_table)) instance_table$info <- runif(nrow(instance_table)) # groups should be no finer than keys key_table$key_group <- key_group[key_table$key] instance_table$key_group <- key_group[instance_table$key] list(key_table = key_table, instance_table = instance_table) } dlist <- mk_example(10, 10) data <- dlist$instance_table annotation <- dlist$key_table rquery / rqdatatablerquery and rqdatatable can implement a non-trivial calculation as follows.

# possible data lookup: find rows that # have lookup data <= info optree <- local_td(data) %.>% natural_join(., local_td(annotation), jointype = "INNER", by = "key") %.>% select_rows_nse(., data <= info) %.>% pick_top_k(., k = 1, partitionby = "id", orderby = "data", reverse = "data", keep_order_column = FALSE) %.>% orderby(., "id") cat(format(optree)) ## table('data'; ## key, ## id, ## info, ## key_group) %.>% ## natural_join(., ## table('annotation'; ## key, ## data, ## key_group), ## j= INNER, by= key) %.>% ## select_rows(., ## data <= info) %.>% ## extend(., ## row_number := row_number(), ## p= id, ## o= "data" DESC) %.>% ## select_rows(., ## row_number <= 1) %.>% ## drop_columns(., ## row_number) %.>% ## orderby(., id) res1 <- ex_data_table(optree) head(res1) ## data id info key key_group ## 1: 0.9152014 1 0.9860654 key_1 20 ## 2: 0.5599810 2 0.5857570 key_2 8 ## 3: 0.3011882 3 0.3334490 key_3 10 ## 4: 0.3650987 4 0.3960980 key_4 5 ## 5: 0.1469254 5 0.1753649 key_5 14 ## 6: 0.2567631 6 0.3510280 key_6 7 nrow(res1) ## [1] 94And we can execute the operations in parallel.

parallel::clusterEvalQ(cl, library("rqdatatable")) ## [[1]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[2]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[3]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[4]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" res2 <- ex_data_table_parallel(optree, "key_group", cl) head(res2) ## data id info key key_group ## 1: 0.9152014 1 0.9860654 key_1 20 ## 2: 0.5599810 2 0.5857570 key_2 8 ## 3: 0.3011882 3 0.3334490 key_3 10 ## 4: 0.3650987 4 0.3960980 key_4 5 ## 5: 0.1469254 5 0.1753649 key_5 14 ## 6: 0.2567631 6 0.3510280 key_6 7 nrow(res2) ## [1] 94 data.tabledata.table can implement the same function.

library("data.table") ## ## Attaching package: 'data.table' ## The following objects are masked from 'package:dplyr': ## ## between, first, last packageVersion("data.table") ## [1] '1.11.4' data_table_f <- function(data, annotation) { data <- data.table::as.data.table(data) annotation <- data.table::as.data.table(annotation) joined <- merge(data, annotation, by = "key", all=FALSE, allow.cartesian=TRUE) joined <- joined[joined$data <= joined$info, ] data.table::setorderv(joined, cols = "data") joined <- joined[, .SD[.N], id] data.table::setorderv(joined, cols = "id") } resdt <- data_table_f(data, annotation) head(resdt) ## id key info key_group.x data key_group.y ## 1: 1 key_1 0.9860654 20 0.9152014 20 ## 2: 2 key_2 0.5857570 8 0.5599810 8 ## 3: 3 key_3 0.3334490 10 0.3011882 10 ## 4: 4 key_4 0.3960980 5 0.3650987 5 ## 5: 5 key_5 0.1753649 14 0.1469254 14 ## 6: 6 key_6 0.3510280 7 0.2567631 7 nrow(resdt) ## [1] 94We can also run data.table in parallel using wrapr::execute_parallel.

parallel::clusterEvalQ(cl, library("data.table")) ## [[1]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[2]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[3]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[4]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" parallel::clusterExport(cl, "data_table_f") dt_f <- function(tables_list) { data <- tables_list$data annotation <- tables_list$annotation data_table_f(data, annotation) } data_table_parallel_f <- function(data, annotation) { respdt <- wrapr::execute_parallel( tables = list(data = data, annotation = annotation), f = dt_f, partition_column = "key_group", cl = cl) %.>% data.table::rbindlist(.) data.table::setorderv(respdt, cols = "id") respdt } respdt <- data_table_parallel_f(data, annotation) head(respdt) ## id key info key_group.x data key_group.y ## 1: 1 key_1 0.9860654 20 0.9152014 20 ## 2: 2 key_2 0.5857570 8 0.5599810 8 ## 3: 3 key_3 0.3334490 10 0.3011882 10 ## 4: 4 key_4 0.3960980 5 0.3650987 5 ## 5: 5 key_5 0.1753649 14 0.1469254 14 ## 6: 6 key_6 0.3510280 7 0.2567631 7 nrow(respdt) ## [1] 94 dplyrdplyr can also implement the example.

dplyr_pipeline <- function(data, annotation) { res <- data %>% inner_join(annotation, by = "key") %>% filter(data <= info) %>% group_by(id) %>% arrange(-data) %>% mutate(rownum = row_number()) %>% ungroup() %>% filter(rownum == 1) %>% arrange(id) res } resd <- dplyr_pipeline(data, annotation) head(resd) ## # A tibble: 6 x 7 ## key id info key_group.x data key_group.y rownum ## ## 1 key_1 1 0.986 20 0.915 20 1 ## 2 key_2 2 0.586 8 0.560 8 1 ## 3 key_3 3 0.333 10 0.301 10 1 ## 4 key_4 4 0.396 5 0.365 5 1 ## 5 key_5 5 0.175 14 0.147 14 1 ## 6 key_6 6 0.351 7 0.257 7 1 nrow(resd) ## [1] 94And we can use wrapr::execute_parallel to parallelize the dplyr solution.

parallel::clusterEvalQ(cl, library("dplyr")) ## [[1]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[2]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[3]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[4]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" parallel::clusterExport(cl, "dplyr_pipeline") dplyr_f <- function(tables_list) { data <- tables_list$data annotation <- tables_list$annotation dplyr_pipeline(data, annotation) } dplyr_parallel_f <- function(data, annotation) { respdt <- wrapr::execute_parallel( tables = list(data = data, annotation = annotation), f = dplyr_f, partition_column = "key_group", cl = cl) %>% dplyr::bind_rows() %>% arrange(id) } respdplyr <- dplyr_parallel_f(data, annotation) head(respdplyr) ## # A tibble: 6 x 7 ## key id info key_group.x data key_group.y rownum ## ## 1 key_1 1 0.986 20 0.915 20 1 ## 2 key_2 2 0.586 8 0.560 8 1 ## 3 key_3 3 0.333 10 0.301 10 1 ## 4 key_4 4 0.396 5 0.365 5 1 ## 5 key_5 5 0.175 14 0.147 14 1 ## 6 key_6 6 0.351 7 0.257 7 1 nrow(respdplyr) ## [1] 94 BenchmarkWe can benchmark the various realizations.

dlist <- mk_example(300, 300) data <- dlist$instance_table annotation <- dlist$key_table timings <- microbenchmark( data_table_parallel = nrow(data_table_parallel_f(data, annotation)), data_table = nrow(data_table_f(data, annotation)), rqdatatable_parallel = nrow(ex_data_table_parallel(optree, "key_group", cl)), rqdatatable = nrow(ex_data_table(optree)), dplyr_parallel = nrow(dplyr_parallel_f(data, annotation)), dplyr = nrow(dplyr_pipeline(data, annotation)), times = 10L) saveRDS(timings, "Parallel_rqdatatable_timings.RDS") Conclusion print(timings) ## Unit: seconds ## expr min lq mean median uq ## data_table_parallel 5.274560 5.457105 5.609827 5.546554 5.686829 ## data_table 9.401677 9.496280 9.701807 9.541218 9.748159 ## rqdatatable_parallel 7.165216 7.497561 7.587663 7.563883 7.761987 ## rqdatatable 12.490469 12.700474 13.320480 12.898154 14.229233 ## dplyr_parallel 6.492262 6.572062 6.784865 6.787277 6.875076 ## dplyr 20.056555 20.450064 20.647073 20.564529 20.800350 ## max neval ## 6.265888 10 ## 10.419316 10 ## 7.949404 10 ## 14.282269 10 ## 7.328223 10 ## 21.332103 10 # autoplot(timings) timings <- as.data.frame(timings) timings$seconds <- timings$time/1e+9 ScatterBoxPlotH(timings, xvar = "seconds", yvar = "expr", title="task duration by method")The benchmark timings show parallelized data.table is the fastest, followed by parallelized dplyr, and parallelized rqdatatable. In the non-paraellized case data.table is the fastest, followed by rqdatatable, and then dplyr.

A reason dplyr sees greater speedup relative to its own non-parallel implementation (yet does not beat data.table) is that data.table starts already multi-threaded, so data.table is exploiting some parallelism even before we added the process level parallelism (and hence sees less of a speed up, though it is fastest).

rquery pipelines exhibit superior performance on big data systems (Spark, PostgreSQL, Amazon Redshift, and hopefully soon Google bigquery), and rqdatatable supplies a very good in-memory implementation of the rquery system based on data.table. rquery also speeds up solution development by supplying higher order operators and early debugging features.

In this note we have demonstrated simple procedures to reliably parallelize any of rqdatatable, data.table, or dplyr.

Note: we did not include alternatives such as multidplyr or dtplyr in the timings, as they did not appear to work on this example.

Materials

The original rendering of this article can be found here, source code here, and raw timings here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Dealing with heteroskedasticity; regression with robust standard errors using R

(This article was first published on ** Econometrics and Free Software**, and kindly contributed to R-bloggers)

First of all, is it heteros**k**edasticity or heteros**c**edasticity? According to

McCulloch (1985),

heteros**k**edasticity is the proper spelling, because when transliterating Greek words, scientists

use the Latin letter k in place of the Greek letter κ (kappa). κ sometimes is transliterated as

the Latin letter c, but only when these words entered the English language through French, such

as scepter.

Now that this is out of the way, we can get to the meat of this blogpost (foreshadowing pun).

A random variable is said to be heteroskedastic, if its variance is not constant. For example,

the variability of expenditures may increase with income. Richer families may spend a similar

amount on groceries as poorer people, but some rich families will sometimes buy expensive

items such as lobster. The variability of expenditures for rich families is thus quite large.

However, the expenditures on food of poorer families, who cannot afford lobster, will not vary much.

Heteroskedasticity can also appear when data is clustered; for example, variability of

expenditures on food may vary from city to city, but is quite constant within a city.

To illustrate this, let’s first load all the packages needed for this blog post:

library(robustbase) library(tidyverse) library(sandwich) library(lmtest) library(modelr) library(broom)First, let’s load and prepare the data:

data("education") education <- education %>% rename(residents = X1, per_capita_income = X2, young_residents = X3, per_capita_exp = Y, state = State) %>% mutate(region = case_when( Region == 1 ~ "northeast", Region == 2 ~ "northcenter", Region == 3 ~ "south", Region == 4 ~ "west" )) %>% select(-Region)I will be using the education data sat from the {robustbase} package. I renamed some columns

and changed the values of the Region column. Now, let’s do a scatterplot of per capita expenditures

on per capita income:

It would seem that, as income increases, variability of expenditures increases too. Let’s look

at the same plot by region:

I don’t think this shows much; it would seem that observations might be clustered, but there are

not enough observations to draw any conclusion from this plot (in any case, drawing conclusions

from plots only is dangerous).

Let’s first run a good ol’ linear regression:

lmfit <- lm(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = education) summary(lmfit) ## ## Call: ## lm(formula = per_capita_exp ~ region + residents + young_residents + ## per_capita_income, data = education) ## ## Residuals: ## Min 1Q Median 3Q Max ## -77.963 -25.499 -2.214 17.618 89.106 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.40283 142.57669 -3.278 0.002073 ** ## regionnortheast 15.72741 18.16260 0.866 0.391338 ## regionsouth 7.08742 17.29950 0.410 0.684068 ## regionwest 34.32416 17.49460 1.962 0.056258 . ## residents -0.03456 0.05319 -0.650 0.519325 ## young_residents 1.30146 0.35717 3.644 0.000719 *** ## per_capita_income 0.07204 0.01305 5.520 1.82e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 39.88 on 43 degrees of freedom ## Multiple R-squared: 0.6292, Adjusted R-squared: 0.5774 ## F-statistic: 12.16 on 6 and 43 DF, p-value: 6.025e-08Let’s test for heteroskedasticity using the Breusch-Pagan test that you can find in the {lmtest}

package:

This test shows that we can reject the null that the variance of the residuals is constant,

thus heteroskedacity is present. To get the correct standard errors, we can use the vcovHC()

function from the {sandwich} package (hence the choice for the header picture of this post):

By default vcovHC() estimates a heteroskedasticity consistent (HC) variance covariance

matrix for the parameters. There are several ways to estimate such a HC matrix, and by default

vcovHC() estimates the “HC3” one. You can refer to Zeileis (2004)

for more details.

We see that the standard errors are much larger than before! The intercept and regionwest variables

are not statistically significant anymore.

You can achieve the same in one single step:

coeftest(lmfit, vcov = vcovHC(lmfit)) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.402827 311.310887 -1.5014 0.14056 ## regionnortheast 15.727405 25.307782 0.6214 0.53759 ## regionsouth 7.087424 23.561063 0.3008 0.76501 ## regionwest 34.324157 24.122587 1.4229 0.16198 ## residents -0.034558 0.091844 -0.3763 0.70857 ## young_residents 1.301458 0.688297 1.8908 0.06540 . ## per_capita_income 0.072036 0.029999 2.4013 0.02073 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1It’s is also easy to change the estimation method for the variance-covariance matrix:

coeftest(lmfit, vcov = vcovHC(lmfit, type = "HC0")) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.402827 172.577569 -2.7084 0.009666 ** ## regionnortheast 15.727405 20.488148 0.7676 0.446899 ## regionsouth 7.087424 17.755889 0.3992 0.691752 ## regionwest 34.324157 19.308578 1.7777 0.082532 . ## residents -0.034558 0.054145 -0.6382 0.526703 ## young_residents 1.301458 0.387743 3.3565 0.001659 ** ## per_capita_income 0.072036 0.016638 4.3296 8.773e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1As I wrote above, by default, the type argument is equal to “HC3”.

Another way of dealing with heteroskedasticity is to use the lmrob() function from the

{robustbase} package. This package is quite interesting, and offers quite a lot of functions

for robust linear, and nonlinear, regression models. Running a robust linear regression

is just the same as with lm():

This however, gives you different estimates than when fitting a linear regression model.

The estimates should be the same, only the standard errors should be different. This is because

the estimation method is different, and is also robust to outliers (at least that’s my understanding,

I haven’t read the theoretical papers behind the package yet).

Finally, it is also possible to bootstrap the standard errors. For this I will use the

bootstrap() function from the {modelr} package:

Let’s take a look at the boot_education object:

boot_education ## # A tibble: 100 x 2 ## strap .id ## ## 1 001 ## 2 002 ## 3 003 ## 4 004 ## 5 005 ## 6 006 ## 7 007 ## 8 008 ## 9 009 ## 10 010 ## # ... with 90 more rowsThe column strap contains resamples of the original data. I will run my linear regression

from before on each of the resamples:

I have added a new column called regressions which contains the linear regressions on each

bootstrapped sample. Now, I will create a list of tidied regression results:

broom::tidy() creates a data frame of the regression results. Let’s look at one of these:

tidied$tidy_lm[[1]] ## term estimate std.error statistic p.value ## 1 (Intercept) -515.19835839 129.61828003 -3.9747353 2.648477e-04 ## 2 regionnortheast -11.75706535 18.72014312 -0.6280436 5.332970e-01 ## 3 regionsouth -8.89139412 15.51588707 -0.5730510 5.695950e-01 ## 4 regionwest 25.96217940 17.48335459 1.4849656 1.448476e-01 ## 5 residents -0.08921914 0.05557157 -1.6054819 1.157079e-01 ## 6 young_residents 1.41203710 0.35297506 4.0003877 2.447887e-04 ## 7 per_capita_income 0.08609340 0.01191468 7.2258276 6.069534e-09This format is easier to handle than the standard lm() output:

tidied$regressions[[1]] ## ## Call: ## lm(formula = per_capita_exp ~ region + residents + young_residents + ## per_capita_income, data = .) ## ## Coefficients: ## (Intercept) regionnortheast regionsouth ## -515.19836 -11.75707 -8.89139 ## regionwest residents young_residents ## 25.96218 -0.08922 1.41204 ## per_capita_income ## 0.08609Now that I have all these regression results, I can compute any statistic I need. But first,

let’s transform the data even further:

list_mods is a list of the tidy_lm data frames. I now add an index and

bind the rows together (by using map2_df() instead of map2()):

Let’s take a look at the final object:

head(mods_df, 25) ## term estimate std.error statistic p.value ## 1 (Intercept) -515.19835839 129.61828003 -3.9747353 2.648477e-04 ## 2 regionnortheast -11.75706535 18.72014312 -0.6280436 5.332970e-01 ## 3 regionsouth -8.89139412 15.51588707 -0.5730510 5.695950e-01 ## 4 regionwest 25.96217940 17.48335459 1.4849656 1.448476e-01 ## 5 residents -0.08921914 0.05557157 -1.6054819 1.157079e-01 ## 6 young_residents 1.41203710 0.35297506 4.0003877 2.447887e-04 ## 7 per_capita_income 0.08609340 0.01191468 7.2258276 6.069534e-09 ## 8 (Intercept) -526.78466908 142.64004523 -3.6931050 6.207704e-04 ## 9 regionnortheast 3.35909252 21.97165783 0.1528830 8.792057e-01 ## 10 regionsouth 2.62845462 17.20973263 0.1527307 8.793251e-01 ## 11 regionwest 26.40527188 20.50214280 1.2879274 2.046593e-01 ## 12 residents -0.04834920 0.05722962 -0.8448282 4.028830e-01 ## 13 young_residents 1.49618012 0.37091593 4.0337445 2.208994e-04 ## 14 per_capita_income 0.07489435 0.01179888 6.3475800 1.140844e-07 ## 15 (Intercept) -466.17160566 130.18954610 -3.5807146 8.659014e-04 ## 16 regionnortheast -6.75977787 17.24994600 -0.3918724 6.970880e-01 ## 17 regionsouth 6.31061828 16.27953099 0.3876413 7.001938e-01 ## 18 regionwest 27.89261897 15.43852463 1.8066894 7.781178e-02 ## 19 residents -0.08760677 0.05007780 -1.7494134 8.735434e-02 ## 20 young_residents 1.23763741 0.32965208 3.7543746 5.168492e-04 ## 21 per_capita_income 0.08469609 0.01236601 6.8491057 2.128982e-08 ## 22 (Intercept) -316.44265722 166.96769979 -1.8952328 6.479692e-02 ## 23 regionnortheast 11.54907449 14.95746934 0.7721276 4.442620e-01 ## 24 regionsouth 9.25759229 16.89974638 0.5477947 5.866654e-01 ## 25 regionwest 31.83905855 16.56287888 1.9223143 6.120738e-02 ## resample ## 1 1 ## 2 1 ## 3 1 ## 4 1 ## 5 1 ## 6 1 ## 7 1 ## 8 2 ## 9 2 ## 10 2 ## 11 2 ## 12 2 ## 13 2 ## 14 2 ## 15 3 ## 16 3 ## 17 3 ## 18 3 ## 19 3 ## 20 3 ## 21 3 ## 22 4 ## 23 4 ## 24 4 ## 25 4Now this is a very useful format, because I now can group by the term column and compute any

statistics I need, in the present case the standard deviation:

We can append this column to the linear regression model result:

lmfit %>% broom::tidy() %>% full_join(r.std.error) %>% select(term, estimate, std.error, r.std.error) ## Joining, by = "term" ## term estimate std.error r.std.error ## 1 (Intercept) -467.40282655 142.57668653 226.08735720 ## 2 regionnortheast 15.72740519 18.16259519 19.65120904 ## 3 regionsouth 7.08742365 17.29949951 19.05153934 ## 4 regionwest 34.32415663 17.49459866 18.72767470 ## 5 residents -0.03455770 0.05318811 0.06285984 ## 6 young_residents 1.30145825 0.35716636 0.50928879 ## 7 per_capita_income 0.07203552 0.01305066 0.02109277As you see, using the whole bootstrapping procedure is longer than simply using either one of

the first two methods. However, this procedure is very flexible and can thus be adapted to a very

large range of situations. Either way, in the case of heteroskedasticity, you can see that

results vary a lot depending on the procedure you use, so I would advise to use them all as

robustness tests and discuss the differences.

If you found this blog post useful, you might want to follow me on twitter

for blog post updates.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Econometrics and Free Software**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A primer in using Java from R – part 2

(This article was first published on ** Jozef's Rblog**, and kindly contributed to R-bloggers)

In this part of the primer we discuss creating and using custom .jar archives within our R scripts and packages, handling of Java exceptions from R and a quick look at performance comparison between the low and high-level interfaces provided by rJava.

In the first part we talked about using the rJava package to create objects, call methods and work with arrays, we examined the various ways to call Java methods and calling Java code from R directly via execution of shell commands.

Contents- Using rJava with custom built classes
- Very quick look at performace
- Usage of jars in R packages
- Handling Java exceptions in R
- References

Getting back to our example with running the main method of our HelloWorldDummy class from the first part of this primer, in practice we most likely want to actually create objects and invoke methods for such classes rather than simply call the main method.

For our resources to be available to rJava, we need to create a .jar archive and add it to the class path. An example of the process can be as follows. Compile our code to create the class file, and jar it:

$ javac DummyJavaClassJustForFun/HelloWorldDummy.java $ cd DummyJavaClassJustForFun/ $ jar cvf HelloWorldDummy.jar HelloWorldDummy.class Adding the .jar file to the class pathWithin R, attach rJava, initialize the JVM and investigate our current class path using .jclassPath:

library(rJava) .jinit() .jclassPath()Now, we add our newly created .jar to the class path using .jaddClassPath:

.jaddClassPath(paste0(jardir, "HelloWorldDummy.jar"))If this worked, we can see the added jar(s) in the class path if we call .jclassPath() again.

Creating objects, investigating methods and fieldsNow that we have our .jar in the class path, we can create a new Java object from our class:

dummyObj <- .jnew("DummyJavaClassJustForFun/HelloWorldDummy") str(dummyObj) ## Formal class 'jobjRef' [package "rJava"] with 2 slots ## ..@ jobj : ## ..@ jclass: chr "DummyJavaClassJustForFun/HelloWorldDummy"We can also investigate the available constructors, methods and fields for our class (or provide the object as argument, then its class will be queried):

- .jconstructors returns a character vector with all constructors for a given class or object
- .jmethods returns a character vector with all methods for a given class or object
- .jfields returns a character vector with all fields (aka attributes) for a given class or object
- .DollarNames returns all fields and methods associated with the object. Method names are followed by ( or () depending on arity.

We can now invoke our SayMyName method on this object in the three ways as discussed is the first part of this primer:

# low level lres <- .jcall(dummyObj, "Ljava/lang/String;", "SayMyName") # high level hres <- J(dummyObj, method = "SayMyName") # convenient $ shorthand dres <- dummyObj$SayMyName() c(lres, hres, dres) ## [1] "My name is DummyJavaClassJustForFun.HelloWorldDummy" ## [2] "My name is DummyJavaClassJustForFun.HelloWorldDummy" ## [3] "My name is DummyJavaClassJustForFun.HelloWorldDummy" Very quick look at performaceThe low-level is much faster, since J has to use reflection to find the most suitable method. The $ seems to be the slowest, but also very convenient, as it supports code completion:

microbenchmark::microbenchmark(times = 1000 , .jcall(dummyObj, "Ljava/lang/String;", "SayMyName") , J(dummyObj, "SayMyName") , dummyObj$SayMyName() ) ## Unit: microseconds ## expr min lq ## .jcall(dummyObj, "Ljava/lang/String;", "SayMyName") 34.992 48.0385 ## J(dummyObj, "SayMyName") 696.860 748.9630 ## dummyObj$SayMyName() 894.576 963.1035 ## mean median uq max neval ## 62.13235 61.3425 66.6470 723.719 1000 ## 903.54615 786.2715 841.2835 66191.562 1000 ## 1093.67310 1009.4020 1075.2565 6743.220 1000 Usage of jars in R packagesTo use rJava within an R package, Simon Urbanek, the author of rJava even provides a convenience function for this purpose which initializes the JVM and registers Java classes and native code contained in the package with it. A quick step by step guide to use .jars within a package is as follows:

- place our .jars into inst/java/
- add Depends: rJava and SystemRequirements: Java into our NAMESPACE
- add a call to .jpackage(pkgname, lib.loc=libname) into our .onLoad.R or .First.lib for example like so:

- if possible, add .java source files into /java folder of our package

If you are interested in more detail than provided in this super-quick overview, Tobias Verbeke created a Hello Java World! package with a vignette providing a verbose step-by-step tutorial for interfacing to Java archives inside R packages.

Setting java.parametersThe .jpackage function calls .jinit with the default parameters = getOption("java.parameters"), so if we want to set some of the java parameters, we can do it for example like so:

.onLoad <- function(libname, pkgname) { options(java.parameters = c("-Xmx1000m")) .jpackage(pkgname, lib.loc = libname) }Note that the options call needs to be done before the call to .jpackage, as Java parameters can only be used during JVM initialization. Consequently, this will only work if other package did not intialize the JVM already.

Handling Java exceptions in RrJava maps Java exceptions to R conditions relayed by the stop function, therefore we can use the standard R mechanisms such as tryCatch to handle the exceptions.

The R condition object, assume we call it e for this, is actually an S3 object (a list) that contains:

- call – a language object containing the call resulting in the exception
- jobj – an S4 object containing the actual exception object, so we can for example investigate investigate it’s class: e[["jobj"]]@jclass

Since class(e) is a vector of simple java class names which allows the R code to use direct handlers, we can handle different such classes differently:

withCallingHandlers( iOne <- .jnew(class = "java/lang/Integer", 1) , error = function(e) { message("Meh, just a boring error") } , NoSuchMethodError = function(e) { message("We have a NoSuchMethodError") } , IncompatibleClassChangeError = function(e) { message("We also have a IncompatibleClassChangeError - lets recover") recover() # recovering here and looking at # 2: .jnew(class = "java/lang/Integer", 1) # we see that the issue is in # str(list(...)) # List of 1 # $ : num 1 # We actually passed a numeric, not integer # To fix it, just do # .jnew(class = "java/lang/Integer", 1L) } , LinkageError = function(e) { message("Ok, this is getting a bit overwhelming, lets smile and end here :o)") } ) ## Meh, just a boring error ## We have a NoSuchMethodError ## We also have a IncompatibleClassChangeError - lets recover ## recover called non-interactively; frames dumped, use debugger() to view ## Ok, this is getting a bit overwhelming, ## lets smile and end here ## :o) ## Error in .jnew(class = "java/lang/Integer", 1): java.lang.NoSuchMethodError: References- Hello Java World! vignette – a tutorial for interfacing to Java archives inside R packages by Tobias Verbeke
- rJava basic crashcourse – at the rJava site on rforge, scroll down to the Documentation section
- The JNI Type Signatures – at Oracle JNI specs
- rJava documentation on CRAN
- Calling Java code from R by prof. Darren Wilkinson

To **leave a comment** for the author, please follow the link and comment on their blog: ** Jozef's Rblog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### HavanaR Workshop 2018

(This article was first published on ** Blog on FORWARDS**, and kindly contributed to R-bloggers)

The first meeting of R users in Cuba, HavanaR! Club 2018, will take place from 29-10-2018 to 02-11-2018. The workshop is organized jointly by the Group of Probability and Statistics of Universidad de La Habana, the >eR-Biostat initiative, the Center for Statistics, CenStat & I-Biostat, Hasselt University, Belgium and Forwards.

The workshop will focus on visualisation methods and exploratory analysis of high dimensional and multivariate datasets.

Program:

29-Oct: Basic concepts of exploratory data analysis using R. Exploratory Analysis of high dimensional data using R (Adetayo Kasim, Durham University, UK, a.s.kasim@durham.ac.uk and Ziv Shkedy, Hasselt University, Belgium, ziv.shkedy@uhasselt.be).

30-Oct: Exploratory Analysis of high dimensional data using R (Adetayo Kasim and Ziv Shkedy).

31-Oct: HavanaR! Club 2018: The Cuban R users workshop with presentations of the participants.

01-Nov: Exploratory Multivariate Data Analysis using R (Julie Josse, Ecole Polytechnique, France, julie.josse@polytechnique.edu).

02-Nov: Exploratory Multivariate Data Analysis using R (Julie Josse).

The workshop is free but registration in mandatory. To register please send an email to erbiostat@gmail.com or cbouza2002@gmail.com.

More details and workshop materials will be available online at https://er-biostat.github.io/Courses/.

All updates about the workshop will be anounced on the Facebook page (https://www.facebook.com/ER-BioStat-1463845487001786/) and Twitter (@erbiostat).

To **leave a comment** for the author, please follow the link and comment on their blog: ** Blog on FORWARDS**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Visualizing macOS App Usage with a Little Help from osqueryr & mactheknife

(This article was first published on ** R – rud.is**, and kindly contributed to R-bloggers)

Both my osqueryr and macthekinfe packages have had a few updates and I wanted to put together a fun example (it being Friday, and all) for what you can do with them. All my packages are now on GitHub and GitLab and I’ll be maintaining them on both so I can accommodate the comfort-level of any and all contributors but will be prioritizing issues and PRs on GitLab ahead of any other platform. Having said that, I’ll mark non-CRAN packages with a # notcran comment in the source views so you know you need to install it from wherever you like to grab sketch packages from.

One table that osquery makes available under macOS is an inventory of all “apps” that macOS knows about. Previous posts have shown how to access these tables via the dplyr interface I built for osquery, but they involved multiple steps and as I started to use it more regularly (especially to explore the macOS 10.14 beta I’m running) I noticed that it could use some helper functions. One in particular — osq_expose_tables() — is pretty helpful in that it handles all the dplyr boilerplate code and makes table(s) available in the global environment by name. It takes a single table name or regular expression and then exposes all matching entities. While the function has a help page, it’s easier just to see it in action. Let’s expose the apps table:

library(osqueryr) # notcran library(tidyverse) osq_expose_tables("apps") apps ## # Source: table [?? x 19] ## # Database: OsqueryConnection ## applescript_enab… bundle_executable bundle_identifier bundle_name bundle_package_… ## ## 1 0 1Password 6 com.agilebits.onep… 1Password 6 APPL ## 2 0 2BUA8C4S2C.com.agil… 2BUA8C4S2C.com.agi… 1Password m… APPL ## 3 1 Adium com.adiumX.adiumX Adium APPL ## 4 1 Adobe Connect com.adobe.adobecon… Adobe Conne… APPL ## 5 1 Adobe Illustrator com.adobe.illustra… Illustrator… APPL ## 6 "" AIGPUSniffer com.adobe.AIGPUSni… AIGPUSniffer APPL ## 7 "" CEPHtmlEngine Helper com.adobe.cep.CEPH… CEPHtmlEngi… APPL ## 8 "" CEPHtmlEngine com.adobe.cep.CEPH… CEPHtmlEngi… APPL ## 9 "" LogTransport2 com.adobe.headligh… LogTranspor… APPL ## 10 "" droplet "" Analyze Doc… APPL ## # ... with more rows, and 14 more variables: bundle_short_version , ## # bundle_version , category , compiler , copyright , ## # development_region , display_name , element , environment , ## # info_string , last_opened_time , minimum_system_version , name , ## # pathThere’s tons of info on all the apps macOS knows about, some of which are system services and “helper” apps (like Chrome’s auto-updater). One field — last_opened_time — caught my eye and I thought it would be handy to see which apps had little use (i.e. ones that haven’t been opened in a while) and which apps I might use more frequently (i.e. ones with more recent “open” times). That last_open_time is a fractional POSIX timestamp and, due to the way osquery created the schemas, it’s in a character field. That’s easy enough to convert and then arrange() the whole list in descending order to let you see what you use most frequently.

But, this is R and we can do better than a simple table or even a DT::datatable().

I recently added the ability to read macOS property lists (a.k.a. “plists”) to mactheknife by wrapping a Python module (plistlib). Since all (OK, “most”) macOS apps have an icon, I thought it would be fun to visualize the last opened frequency for each app using the app icons and ggplot2. Unfortunately, the ImageMagick (and, thus the magick package) cannot read macOS icns files, so you’ll need to do a brew install libicns before working with any of the remaining code since we’ll be relying on a command-line utility from that formula.

Let’s get the frontmatter out of the way:

library(sys) library(magick) library(osqueryr) # notcran library(mactheknife) #notcran library(ggimage) library(hrbrthemes) library(ggbeeswarm) library(tidyverse) osq_expose_tables("apps") # macOS will use a generic app icon when none is present in an app bundle; this is the location and we'll # need to use it when our plist app spelunking comes up short default_app <- "/System/Library/CoreServices/CoreTypes.bundle/Contents/Resources/GenericApplicationIcon.icns"Next, we'll:

- collect the apps table locally
- filter out system-ish things (which we really don't care about for this post)
- convert the last used time to something useful (and reduce it to a day resolution)
- try to locate the property list for the app and read the path to the app icon file, substituting the generic one if not found (or other errors pop up):

Since I really didn't feel like creating a package wrapper for libicns, we're going to use the sys package to make system calls to convert the icns files to png files. We *really* don't want to do this repeatedly for the same files if we ever run this again, so we'll setup a cache directory to hold our converted pngs.

Apps can (and, usually do) have multiple icons with varying sizes and are not guaranteed to have every common size available. So, we'll have the libicns icns2png utility extract *all* the icons and use the highest resolution one, using magick to reduce it to a 32x32 png bitmap.

You can open up that cache directory with the macOS finder to find all the extracted/converted pngs.

Now, we're on the final leg of our app-use visualization journey.

Some system/utility apps have start-of-epoch dates due to the way the macOS installer tags them. We only want "recent" ones so I set an arbitrary cutoff date of the year 2000. Since many apps would have the same last opened date, I wanted to get a spread out layout "for free". One way to do that is to use ggbeeswarm::position_beswarm():

filter(apps_df, lop_day > as.Date("2000-01-01")) %>% ggplot() + geom_image( aes(x="", lop_day, image = icns_png), size = 0.033, position = position_quasirandom(width = 0.5) ) + geom_text( data = data_frame( x = c(0.6, 0.6), y = as.Date(c("2018-05-01", "2017-09-15")), label = c("More recently used ↑", "Not used in a while ↓") ), aes(x, y, label=label), family = font_an, size = 5 , hjust = 0, color = "lightslategray" ) + labs(x = NULL, y = "Last Opened Time") + labs( x = NULL, y = NULL, title = "macOS 'Last Used' App History" ) + theme_ipsum_rc(grid="Y") + theme(axis.text.x = element_blank())There are tons of other ways to look at this data and you can use the osquery daemon to log this data regularly so you can get an extra level of detail. An interesting offshot project would be to grab the latest RStudio dailies and see if you can wrangle a sweet D3 visualization from the app data we collected. Make sure to drop a comment with your creations in the comments. You can find the full code in this snippet.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – rud.is**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Setting up RStudio Server, Shiny Server and PostgreSQL by @ellis2013nz

(This article was first published on ** free range statistics - R**, and kindly contributed to R-bloggers)

For a variety of reasons I wanted a Linux server in the cloud with data science software on it. In no particular order:

- so I could access a reliable setup from anywhere with a web browser, including on networks without the software I need (which happens a lot in corporate/government environments)
- so I could do round-the-clock data gathering when necessary
- to run my pet HappyRrobot twitterbot, having retired the old physical linux box it used to be hosted from
- to make sure I stay familiar with what’s needed to set up such a system from scratch.

I couldn’t tell you which of these are the more important.

In this blog post I’ll be noting down the steps necessary for setting up this server, and illustrating purpose 2 with a Shiny app, hosted on the server and drawing data from a database on the server, which is collecting a round-the-clock random sample of Twitter. The end result looks like this (click on the image to go to the actual working app):

If you’re interested in Twitter data but not in setting up a Linux server with R, Shiny and PostgreSQL on it you can skip straight to where I talk about getting the data in and analysing it.

Setting up a server for data and statsI chose Amazon Web Services Elastic Cloud to host the server. I used the Elastic IP service (fairly cheap) to allocate a permanent IP address to the instance.

I was keen on using Red Hat or similar flavoured Linux, because of purpose #4 noted above; I already knew a bit about Ubuntu and wanted to expand my knowledge sphere, and Red Hat is the flavour that seems to pop up on corporate and government servers. In the end I opted for CentOS, which is “a Linux distribution that provides a free, enterprise-class, community-supported computing platform functionally compatible with its upstream source, Red Hat Enterprise Linux (RHEL)”. In fact, installing R is a bit easier on CentOS that it was on my Red Hat experiments.

I want the following on this machine:

- R, RStudio Server and Python 3 for analysis and for data munging
- PostgreSQL for storing data and supporting analysis
- Shiny Server for dissemination
- A webserver (I chose Nginx) to support RStudio Server and Shiny Server through reverse proxy so I can access them on regular web browser ports rather than ports 8787 and 3838, which will often not be available from a corporate network
- all the other utilities and extras to support all this, such as curl, mail and fonts

This was non-trivial, which is one of the reasons why I’m blogging about it! There are lots of fishhooks and little things to sort through and while there are some excellent blog posts and tutorials out there to help do it, none of them covered quite the end-to-end setup I needed. One of my outputs from the process was a set of notes – not quite a single run-and-forget configuration script as you’d want in a professional setup, but fairly easy to use – that makes it easy to do similar in the future.

R and a few basicsHere’s where I started, once having started up the server (plenty of tutorials on how to do that provided by Amazon themselves and others). I start by installing R, including the epel “Extra Packages for Enterprise Linux” that are needed beforehand.

# we'll want mail later sudo yum install -y mailx #-----------------R and its dependencies------------------------ #install R. We need epel first sudo yum update -y sudo yum install –y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm sudo yum install -y texlive sudo yum install -y texinfo # clean up rm *.rpm sudo yum install -y R # and version control of course sudo yum install -y git DatabaseNext thing was to install PostgreSQL. I found it useful to install this before I started installing R packages, because some R packages that speak to PostgreSQL behave differently on installation depending on whether PostgreSQL is found on the machine or not

#=====================postgresql======================= # install postgresql. Good to do this before we start installing R packages sudo yum install -y postgresql-server postgresql-contrib postgresql-devel sudo postgresql-setup initdb # stop to give postgres account a password for the operating system sudo passwd postgres # start the postgresql service sudo systemctl start postgresql # I think this next line means the database service restarts when the machine is rebooted sudo systemctl enable postgresql Extending R via packages and their dependenciesI find installing all the R packages I regularly use a harder job in Linux than Windows. I’m sorry, but I do. In particular, Windows installations of packages like gdal seems to look after upstream dependencies seamlessly and quietly. Not so on Linux. Here’s what I needed to do at the command line to get all the R packages I wanted installed.

#======================miscellaneous dependencies needed by R packages=============== # iBest to do this on a large instance, even if you only start it as big # during the install. 8GB seems a minimum #----------------tidyverse and Cairo--------- # First, some dependencies that rvest, devtools, Cairo need: sudo yum install -y libcurl-devel libxml2-devel openssl-devel sudo yum install -y cairo-devel libXt-devel udunits2-devel gdal-devel poppler-cpp-devel #------------The gdal epic--------------------- # needed for spatial stuff and in particular sf which needs version > 2 (currently 2.2). # This should work according to the sf github page (excluding udunits2 which we look after later): sudo yum install -y gdal-devel proj-devel proj-epsg proj-nad geos-devel # but that installs the wrong version of gdal! We have to install it by hand. # see https://gis.stackexchange.com/questions/263495/how-to-install-gdal-on-centos-7-4 # adapted from https://gist.github.com/simondobner/f859b2db15ad65090c3c316d3c224f45 wget http://download.osgeo.org/gdal/2.2.4/gdal-2.2.4.tar.gz tar zxvf gdal-2.2.4.tar.gz cd gdal-2.2.4/ ./configure --prefix=/usr/ --with-sfcgal=no make -j4 sudo make install # should have a test here to only do the next two things if installed correctly! rm *.tar.gz rm gdal-2.2.4 -r #======================R packages===================== # Now we can install R packages that need all the above system dependencies first. # # udunits2 needs special configuration when installing in R so let's do that first and get it out of the way sudo R -e "install.packages('udunits2',configure.args='--with-udunits2-include=/usr/include/udunits2', repos='http://cran.rstudio.com/')" # these are a bunch of packages that are heavily used and that I want installed up front and available # for all users (hence installing them as super user) sudo R -e "install.packages(c('Rcpp', 'rlang', 'bindrcpp', 'dplyr', 'digest', 'htmltools', 'tidyverse', 'shiny', 'leaflet', 'sf', 'scales', 'Cairo', 'forecast', 'forcats', 'h2o', 'seasonal', 'data.table', 'extrafont','survey', 'forecastHybrid', 'ggseas', 'treemap', 'glmnet', 'ranger', 'RPostgres', 'igraph', 'ggraph', 'nzelect', 'tm', 'wordcloud', 'praise', 'showtext', 'ngram', 'pdftools', 'rtweet', 'GGally', 'ggExtra', 'lettercase', 'xgboost'), repos='http://cran.rstudio.com/')" # fonts sudo yum install dejavu-sans-fonts sudo yum install -y google-droid-*-fonts sudo yum install -y gnu-free-*-fonts sudo R -e "extrafont::font_import(prompt = FALSE)"I did all of this with my server set up as a “large” instance with 8GB of RAM. This particularly makes a difference when installing Rcpp. After all the initial is setup you can stop the instance, downsize it something cheaper, and restart it.

Note that I am using sudo to install R packages so they are available to all users (which will include the shiny user down the track), not just to me. I wanted everyone using this server to have the same set of packages available; obviously whether this is desirable or not depends on the purpose of the setup.

Server-related stuffNext I want to get RStudio Server and Shiny Server working, and accessible via a web browser that just talks to standard port 80. There is a step here where the Nginx configuration file gets edited by hand; the links to RStudio support for RStudio Server and for Shiny Server contain instructions on what needs to go where.

Also note that the actual versions of RStudio Server and of Shiny Server below are date-specific (because they are installed via local install), and probably the links are already out of date.

#-------------webby stuff------------ # install a web server so we can deliver things through it via reverse proxy # see https://support.rstudio.com/hc/en-us/articles/200552326-Running-RStudio-Server-with-a-Proxy # and https://support.rstudio.com/hc/en-us/articles/213733868-Running-Shiny-Server-with-a-Proxy sudo yum install -y nginx #install RStudio-Server (2018-04-23) wget https://download2.rstudio.org/rstudio-server-rhel-1.1.447-x86_64.rpm sudo yum localinstall -y --nogpgcheck rstudio-server-rhel-1.1.447-x86_64.rpm #install shiny and shiny-server (2018-04-23) wget https://download3.rstudio.org/centos6.3/x86_64/shiny-server-1.5.7.907-rh6-x86_64.rpm sudo yum localinstall -y --nogpgcheck shiny-server-1.5.7.907-rh6-x86_64.rpm rm *.rpm # now go make the necessary edits to /etc/nginx/nginx.conf # note that the additions are made in two different bits of that file, you don't just past the whole # lot in. sudo nano /etc/nginx/nginx.conf sudo systemctl restart nginx # go to yr.ip.number/shiny/ and yr.ip.number/rstudio/ to check all working # add some more users if wanted at this point # sudo useradd ellisp # sudo passwd ellisp # not sure if all these are needed: sudo systemctl enable nginx sudo systemctl enable rstudio-server sudo systemctl enable shiny-server # set the ownership of the directory we're going to keep apps in so the `shiny` # user can access it sudo chown -R shiny:shiny /srv/shiny-server PythonCentos currently comes with Python 2.7, but I wanted to be using Python 3. My Python skills are halting at best but I want them to be as future-proofed as possible. Anaconda seems a relatively straightforward way to manage Python.

#---------------Anaconda / python---------------- # go to https://repo.continuum.io/archive/ or https://www.anaconda.com/download/#linux to see the latest version # Anaconda3 is with python 3.X, Anaconda2 is wit python 2.7. Note # that python 2.7 is part of the Centos linux dsitribution and shouldn't be # overwritten ie python xxx.py should run python 2.7. But doing the process below does this; # watch out for if this causes problems later... # wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh sudo bash Anaconda3-5.1.0-Linux-x86_64.sh # agree to the license, and specify /opt/anaconda3 as location when asked # we want to give all users anaconda on their path, so I snitched this from: # https://www.vultr.com/docs/how-to-install-jupyter-notebook-on-a-vultr-centos-7-server-instance sudo cp /etc/profile /etc/profile_backup echo 'export PATH=/opt/anaconda3/bin:$PATH' | sudo tee -a /etc/profile source /etc/profile echo $PATH sudo /opt/anaconda3/bin/conda conda install psycopg2 # as far as I can tell this makes python3.6 the default python, which is surely going to cause problems down # the track... Configuring PostgreSQLI installed PostgreSQL and started its database service early in this process, but in the next step need to actually set up some database and users for use. The PostgreSQL security model is thorough and comprehensive but with lots of fishhooks. Here’s how I set it up for this particular (very simple) use case. First, I enter the psql environment as the postgres user (currently the only user with any access to the database server)

sudo -u postgres psqlNow we can set up the users we want to be accessing our databases; some databases for them to use; and schemas within those database. In this case, I set up two databases for now

- survey_microdata

and three different users, in addition to postgres:

- ellisp (ie me, in development mode)
- external_analyst (ie me or others, in read-only mode)
- shiny (the Shiny Server’s id on the server, needed so Shiny apps can access the database)

We also need to tweak the configuration so the PostgreSQL database is accessible from the outside world (if that’s what we want, which I do).

# follow instructions at https://blog.bigbinary.com/2016/01/23/configure-postgresql-to-allow-remote-connection.html if # you want to remotely access eg from DBeaver on your laptop. Definitely need a password then # first add in listen_addresses = 'localhost' just above the commented out version of #listen_addresses = 'localhost' sudo nano /var/lib/pgsql/data/postgresql.conf # now the client authentication file about how individuals can actually log on. # Add the below two lines (not the # at beginning) to the bottom of the table. # lets users log on via password form anywhere. If this doesn't suit your # risk profile, find something more constrictive... # host all all 0.0.0.0/0 md5 # host all all ::/0 md5 sudo nano /var/lib/pgsql/data/pg_hba.conf sudo systemctl restart postgresql A Twitter sample stream database Collecting dataOK, that wasn’t so bad was it (or was it…). I now have my server running (I stopped it and restarted it as a smaller cheaper instance than the 8GB of RAM I used during that setup) and available to do Useful Stuff. Like collect Twitter data for analysis and dissemination in Shiny:

For a while, I’ve been mildly exercised by the problem of sampling from Twitter. See for example this earlier post where I was reduced to using a snowballing network method to find users, and searching for tweets with the letter “e” in them to get a sample of tweets. Both of these methods have obvious problems if you want to do inference about how people as a whole are using Twitter.

On the other hand, Twitter make available several sets of public Tweets that are fully representative:

- The free Sample Tweets API “returns a small random sample of all public Tweets.”
- The Decahose stream provides a 10% sample of all public Tweets
- The Firehose provides all public tweets

The latter two services are for paying customers only. My interest in Twitter is curiousity at most, so I’m only interested in the free sample, which is thought to be around 1% of the Firehose (exactly what proportion it is of the Firehose isn’t publicly known, and is a question of some inferential interest).

So I was interested in the sample stream, but I wanted to collect a sample over a period of time, not just from the day I was going to do some analysis. Even this 1% sample was more than I wanted to pay disk space to store if I were to collect over time, so I decided I would collect 30 seconds of sample streaming data every hour, at a random time within the hour to avoid problems associated with doing the sampling at the same time each day.

I designed a data model to capture the data I was most interested in while discarding attached video and images (this was about saving me disk space; I think serious Twitter analysis would have to do better than just collecting text). It looks like this:

BTW that diagram (and much of the database development) was done with the excellent universal SQL editor and database admin tool, DBeaver. It works with different flavours of relational database and is awesome.

The code that creates and populates that database is available on GitHub:

- the SQL that builds the empty database.
- the R code that imports a 30 second window of the sample stream and uploads it to the public schema of the twitter database. All the heavy lifting is done by the awesome rtweet package.
- the SQL that transforms and loads the data into the twitter.tweets schema. This does the work, for example, of matching users in the latest sample with previously observed users; making sure that re-tweeted users are in the users table even if we haven’t seen them directly tweet something themselves; and so on.
- the shell script that is activated by a cron job 24 times a day and runs the above R and SQL.

This has now been running smoothly since 17 May 2018, apart from one day last week when I botched an R upgrade and it all went down for half a day before I noticed (lesson learned – run update.package(ask = FALSE, checkBuilt = TRUE) to ensure your R packages all keep working after the R internals change). So far the database is about 3GB in size, and I’m quite happy to let it grow quite a bit more than that.

What do we find out?So far the main use I’ve put this data to is the Shiny app that I’ve scattered a few screenshots of in this blog post. The source code is on GitHub of course. That Shiny app writes its own SQL based on the inputs provided by the user (eg date range), queries the database and produces charts.

So what have I learned about Twitter (as opposed to about Linux administration) from the exercise? No time to explore in much depth right now, but some of the interesting things include:

- Tweets have a daily cycle, peaking at around 15:30 UTC each day (this assumes that the sampling ratio in the Twitter sample stream is roughly constant; which I think is likely as otherwise why would we see this seasonality).
- The most tweeted hashtags all relate to teen-oriented popular music. I had to look up TeenChoice just to find out what it was… My filter bubble isn’t so much a liberal-v-conservative one as something relating to different interests to most people in the world altogether. The things that dominate my own Twitter feed are not even faintly representative of Twitter as a whole (I expected this with regard to statistical computing of course, but it was interesting to find out that even US politics hardly makes a dent in the most common tweets/retweets in any particular day, compared to popular retweets such as “If the Cleveland Cavaliers win the 2018 NBA finals I’ll buy everyone who retweet’s this a jersey…” (sic) – 1.1 million retweets – and several suspiciously similar variants)
- If you ignore tweets in Japanese, Korean, Thai and Arabic script you are missing three of the top seven languages on Twitter. Any serious analysis needs to find a way to bring them on board (my first obstacle in this was getting a font that could represent as many different scripts and emojis as possible without knowing in advance the language; in the end I opted for GNU FreeFont, as described in my own answer to my question on StackOverflow about this problem)
- Only six of the ten most prolific Tweeters listed on Twitter Counter are currently prolifically tweeting. In particular, @venethis @AmexOffers @BEMANISoundTeam and @__Scc__ seem to have gone either completely quiet or just much lower frequency tweeting (code for analysis).
- The currently most prolific tweeter is @akiko_lawson, which I think (I don’t read Japanese) is an account associated with Lawson convenience stores. This single account issues around 1 tweet for every 5,000 tweets by anyone on the planet.
- The currently second most prolific tweeter is probably a spam bot, called @test5f1798, that amongst other things tweets pictures of Spam. Very meta.

There’s some interesting statistical challenges with using this database for inference that I might come back to. For example, I could use a small amount of auxiliary information such as the 1.1 million retweets of that Cleveland Cavaliers jersey tweet and compare it to the 105 times I found the tweet in my sample; and deduce that my sample is about 1 in 10,000 of the full population of tweets. This is consistent with the sample stream being a genuine 1% sample, of which I collect 1/120th (30 seconds every hour). So I should be able to treat my sample as a 1/12,000 sample of the whole population, clustered by the 30 second window they are in. Something for later.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** free range statistics - R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Lyric Analysis with NLP and Machine Learning using R: Part One – Text Mining

(This article was first published on ** R-posts.com**, and kindly contributed to R-bloggers)

*June 22*

*By Debbie Liske*

This is Part One of a three part tutorial series originally published on the DataCamp online learning platform in which you will use R to perform a variety of analytic tasks on a case study of musical lyrics by the legendary artist, Prince. The three tutorials cover the following:

**Part One**: Text Mining and Exploratory Analysis (beginner)**Part Two**: Sentiment Analysis and Topic Modeling with NLP (intermediate)**Part Three**: Predictive Analytics using Machine Learning (intermediate -advanced)

Musical lyrics may represent an artist’s perspective, but popular songs reveal what society wants to hear. Lyric analysis is no easy task. Because it is often structured so differently than prose, it requires caution with assumptions and a uniquely discriminant choice of analytic techniques. Musical lyrics permeate our lives and influence our thoughts with subtle ubiquity. The concept of *Predictive Lyrics* is beginning to buzz and is more prevalent as a subject of research papers and graduate theses. This case study will just touch on a few pieces of this emerging subject.

**Prince: The Artist**

To celebrate the inspiring and diverse body of work left behind by Prince, you will explore the sometimes obvious, but often hidden, messages in his lyrics. However, you don’t have to like Prince’s music to appreciate the influence he had on the development of many genres globally. Rolling Stone magazine listed Prince as the 18th best songwriter of all time, just behind the likes of Bob Dylan, John Lennon, Paul Simon, Joni Mitchell and Stevie Wonder. Lyric analysis is slowly finding its way into data science communities as the possibility of predicting “Hit Songs” approaches reality.

Prince was a man bursting with music – a wildly prolific songwriter, a virtuoso on guitars, keyboards and drums and a master architect of funk, rock, R&B and pop, even as his music defied genres. – Jon Pareles (NY Times)

In this tutorial, Part One of the series, you’ll utilize text mining techniques on a set of lyrics using the tidy text framework. Tidy datasets have a specific structure in which each variable is a column, each observation is a row, and each type of observational unit is a table. After cleaning and conditioning the dataset, you will create descriptive statistics and exploratory visualizations while looking at different aspects of Prince’s lyrics.

Check out the article here!

(reprint by permission of DataCamp online learning platform)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R-posts.com**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Ghosts of Animals Haunt Portland’s Overpriced Apartments

(This article was first published on ** Dan Garmat's Blog -- R**, and kindly contributed to R-bloggers)

Apartment hunting in an expensive city is leading me to curses and exclamations. Following are some outstanding examples of insanely priced apartments in Portland, OR, run through Google Deep Dream in hopes of my understanding why people pay so much for a small box. These listings will be gone in no time (I’m sure) so including some captions for posterity.

Let’s start with this one. Indeed, it appears $1899 for 1 bedroom grants access to this clubhouse haunted by some floating apparition:

Deep Dream InceptionV3 algorithm here is trained on ImageNet, then makes changes that increase confidence in the predicted category. Looped several times with the num_octave hyperparameter, it starts to look a good bit trippy and helps give some intuition what a neural network “sees” as prototypical examples of a predicted class. Apparently there is no “view of apartment” class as it keeps seeing ghastly animals. Perhaps it is no coincidence even before running InceptionV3 this clubhouse already looks like it could work in The Shining:

*$1850 / 1br – 697ft2 – BRAND NEW! Enjoy Luxury Uban [Urban?] Living at The Franklin Flats!*

*“NEW ON THE MARKET!*

*“The Franklin Flats is the newest addition to this desirable part of town! Built with the urban adventurer in mind, our small community offers luxury appeal with a neighborhood feel. Boasting a walkability score of 86 out of 100, you can’t beat the location! [unless an 87+?] Our close proximity to Mt. Tabor, food carts, [because you won’t have anything left over for restaurants] shopping and eateries gives you the classic Northwest experience you crave. Email or call to schedule a personal tour today!”*

Apparently the Attack on Titan seals make this a desirable part of town.

Perhaps those seals are why walkability doesn’t crack 90. If you survive the seals on a midday stroll, there are titan polar bears who can walk through walls:

*$4250 / 2br – 1900ft2 – Condo on the Willamette*

*“Breathtaking views of the city and the Willamette River, located in the elegant Atwater. This condo has two bedrooms, living room, dining room, gourmet kitchen, gas fireplace, small office, two balconies, utility room and underground parking. Includes concierge desk, card-accessed security.”*

Something tells me this view of the Willamette River would be complete if a cocker spaniel is staring at me…

But this view is what you really pay for: look at all the suckers in the two identical buildings who massively overpaid for – how the heck did those get up here?!

*$3900 / 2br – 1004ft2 – Portland’s most well-appointed two bedroom apartments now available*

*“Portland’s premier rental community is now pre-leasing. Find everything you desire in one place: The finest dining, distinguished boutiques, and most-beloved haunts. Experience the perfect merger of luxury and livability with our timeless brick exterior, stunning marble lobby, tiled bathrooms, tall ceilings, ample light, extensive storage, concierge services, and even a dog washing station.*

*“Proudly situated in Portland’s [S]Nob Hill/23rd Ave neighborhood, 21 Astor boasts a “97” walk score and a “96” bike score. [Beat that Franklin Flats!] Life is great in the epicenter of Portland’s charm. Pick up your locally-grown produce at City Market. Grab happy hour with the gang at Seratto. Sweat it out at Corepower Yoga.*

*“Our greater than 1:1 parking ratio assures a space for your car in our garage. Imagine never having to find parking on NW 23rd again.”*

I work on NW 21st, that’s almost the cost of parking alone. People walk by on Oregon ‘pot tours’ and this may be how they see the building as well:

Whoa! Is that a pig in the sky? Far out!

At least their new $3.88/sqft/month kitchens aren’t yet haunted – oh, god! Are those giant psychedelic worms!

Code:Started with JJ Allaire, François Chollet, RStudio, and Google’s keras code based on Google DeepDream. Added a little looping to try different parameters over these 7 images:

library(keras) library(tensorflow) library(purrr) # Function Definitions ---------------------------------------------------- preprocess_image <- function(image_path){ image_load(image_path) %>% image_to_array() %>% array_reshape(dim = c(1, dim(.))) %>% inception_v3_preprocess_input() } deprocess_image <- function(x){ x <- x[1,,,] # Remove zero-center by mean pixel x <- x/2. x <- x + 0.5 x <- x * 255 # 'BGR'->'RGB' x <- x[,,c(3,2,1)] # Clip to interval 0, 255 x[x > 255] <- 255 x[x < 0] <- 0 x[] <- as.integer(x)/255 x } # Parameters -------------------------------------------------------- ## list of images to process -- list_images <- list.files('images/deep dream apartments/orig/', full.names = TRUE) ## list of settings to try -- list_settings <- list( settings = list( features = list( mixed2 = 0.2, mixed3 = 0.5, mixed4 = 2., mixed5 = 1.5), hyperparams = list( # Playing with these hyperparameters will also allow you to achieve new effects step = 0.01, # Gradient ascent step size num_octave = 6, # Number of scales at which to run gradient ascent octave_scale = 1.4, # Size ratio between scales iterations = 20, # Number of ascent steps per scale max_loss = 10 ) ), settings = list( features = list( mixed2 = 0.5, mixed3 = 0.2, mixed4 = 1.1, mixed5 = 1.5), hyperparams = list( step = 0.01, num_octave = 9, octave_scale = 1.1, iterations = 20, max_loss = 5 ) ), settings = list( features = list( mixed2 = 0.02, mixed3 = 0.05, mixed4 = 0.01, mixed5 = 0.05), hyperparams = list( step = 0.01, num_octave = 11, octave_scale = 1.1, iterations = 20, max_loss = 20 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 0.5, mixed4 = 2., mixed5 = 1.5), hyperparams = list( step = 0.01, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 10 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 0.1, mixed4 = 0.4, mixed5 = 0.3), hyperparams = list( step = 0.01, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 10 ) ), settings = list( features = list( mixed2 = 1.2, mixed3 = 1.5, mixed4 = 3., mixed5 = 2.5), hyperparams = list( step = 0.01, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 25 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 2.5, mixed4 = 2., mixed5 = 3.5), hyperparams = list( step = 0.05, num_octave = 6, octave_scale = 1.4, iterations = 20, max_loss = 13 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 2.5, mixed4 = 2., mixed5 = 3.5), hyperparams = list( step = 0.05, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 13 ) ) ) # Model Definition -------------------------------------------------------- k_set_learning_phase(0) # Build the InceptionV3 network with our placeholder. # The model will be loaded with pre-trained ImageNet weights. model <- application_inception_v3(weights = "imagenet", include_top = FALSE) # This will contain our generated image dream <- model$input # Get the symbolic outputs of each "key" layer (we gave them unique names). layer_dict <- model$layers names(layer_dict) <- map_chr(layer_dict ,~.x$name) ## just loop on images - bottleneck is training anyway -- for(image_path in list_images){ image <- preprocess_image(image_path) #res <- predict(model,image) #reduce(dim(res)[-1], .f = `*`) %>% # as.numeric() %>% #array_reshape(res, c(2, 51200)) %>% # imagenet_decode_predictions() ## get model prediction: #x <- image_to_array(image) #x <- array_reshape(image, c(1, dim(image))) #x <- imagenet_preprocess_input(image) #preds = predict(model, x) #imagenet_decode_predictions(preds, top = 3) #predict_classes(model, image) # need to try again later setting_counter <- 0 ## then nested loop on settings -- for(settings in list_settings){ setting_counter <- setting_counter + 1 print(paste0('starting ', image_path, ', setting # ', setting_counter)) # reload image each time image <- preprocess_image(image_path) # Define the loss loss <- k_variable(0.0) for(layer_name in names(settings$features)){ # Add the L2 norm of the features of a layer to the loss coeff <- settings$features[[layer_name]] x <- layer_dict[[layer_name]]$output scaling <- k_prod(k_cast(k_shape(x), 'float32')) # Avoid border artifacts by only involving non-border pixels in the loss loss <- loss + coeff*k_sum(k_square(x)) / scaling } # Compute the gradients of the dream wrt the loss grads <- k_gradients(loss, dream)[[1]] # Normalize gradients. grads <- grads / k_maximum(k_mean(k_abs(grads)), k_epsilon()) # Set up function to retrieve the value # of the loss and gradients given an input image. fetch_loss_and_grads <- k_function(list(dream), list(loss,grads)) # this function will crash the R session if too many octaves on too small an image eval_loss_and_grads <- function(image){ outs <- fetch_loss_and_grads(list(image)) list( loss_value = outs[[1]], grad_values = outs[[2]] ) } gradient_ascent <- function(x, iterations, step, max_loss = NULL) { for (i in 1:iterations) { out <- tryCatch(eval_loss_and_grads(x), error = function(e) NA) # need to add this for negative gradients if(is.na(out$loss_value)){ print(paste0('NA loss_value on setting # ', setting_counter)) break } else if (!is.null(max_loss) & out$loss_value > max_loss) { break } print(paste("Loss value at", i, ':', out$loss_value)) x <- x + step * out$grad_values } x } original_shape <- dim(image)[-c(1, 4)] successive_shapes <- list(original_shape) for (i in 1:settings$hyperparams$num_octave) { successive_shapes[[i+1]] <- as.integer(original_shape/settings$hyperparams$octave_scale^i) } successive_shapes <- rev(successive_shapes) original_image <- image shrunk_original_img <- image_array_resize( image, successive_shapes[[1]][1], successive_shapes[[1]][2] ) shpnum <- 0 # for debugging for (shp in successive_shapes) { shpnum <- shpnum + 1 # for debugging image <- image_array_resize(image, shp[1], shp[2]) print(paste0('running octave ', shpnum))# for debugging image <- gradient_ascent(image, settings$hyperparams$iterations, settings$hyperparams$`step`, settings$hyperparams$max_loss) print(paste0('finished octave ', shpnum))# for debugging upscaled_shrunk_original_img <- image_array_resize(shrunk_original_img, shp[1], shp[2]) same_size_original <- image_array_resize(original_image, shp[1], shp[2]) lost_detail <- same_size_original - upscaled_shrunk_original_img image <- image + lost_detail shrunk_original_img <- image_array_resize(original_image, shp[1], shp[2]) } image_path %>% gsub('/orig/', '/dream/', .) %>% gsub('.jpg', paste0('_dream', setting_counter, '.png'), .) %>% png(filename = .) plot(as.raster(deprocess_image(image))) dev.off() print(paste0('finished ', image_path, ', setting # ', setting_counter)) } } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Dan Garmat's Blog -- R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New package packagefinder – Search for packages from the R console

(This article was first published on ** Topics in R**, and kindly contributed to R-bloggers)

There are over 12,700 R packages on CRAN. How to find the right one for you? The new package ‘packagefinder‘ helps you search for packages on CRAN right from the R console. With ‘packagefinder’ you can search for multiple keywords in the name, title and description of the CRAN package, either case-sensitive or insensitive and define your own weighting scheme for the search results, if you like. Once you have found a promising package, you can use the simple function go() to go to the package’s CRAN webpage or view its PDF manual, directly from the R console without having to installing the package first. Of course, you can also install the package easily, if you want to try it out. Check our ‘packagefinder’ on CRAN: https://cloud.r-project.org/web/packages/packagefinder/index.html

And leave your comments on GitHub (https://github.com/jsugarelli/packagefinder) or contact me via Twitter or e-mail. Your ideas are highly appreciated!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Topics in R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### seven digit addition

(This article was first published on ** R – Xi'an's Og**, and kindly contributed to R-bloggers)

**A**nother quick riddle from the riddler: solve the equation

EXMREEK + EHKREKK = ?K?H?X?E

which involves every digit between 0 and 9. While the puzzle can be unravelled by considering first E and K, which must be equal to 6 and 3, a simple R code also leads to the conclusion

isok <- function(a,b){ s=as.numeric(unlist(strsplit(as.character(sum(10^(6:0)*a)+ sum(10^(6:0)*b)),""))) if (length(s)==7){ goal=FALSE}else{ goal=(length(unique(c(a,b,s)))==10)&(a[2]==s[6])& (s[8]==a[6])&(s[2]==a[7])&(b[2]==s[4])} return(goal)} pasok <- function(T=1e3){ for (t in 1:T){ a[1]=a[5]=a[6]=6;a[7]=3 digs=sample(c(0:2,4,5,7:9),4) a[2:4]=digs[1:3] b[1]=a[1];b[2]=digs[4]; b[3]=a[7];b[4]=a[4];b[5]=a[1];b[6:7]=a[7] if (isok(a=a,b=b)) print(rbind(a,b))}} > pasok() [,1] [,2] [,3] [,4] [,5] [,6] [,7] a 6 2 4 7 6 6 3 b 6 8 3 7 6 3 3which sum is 13085296.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Xi'an's Og**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Convex Regression Model

(This article was first published on ** R-english – Freakonometrics**, and kindly contributed to R-bloggers)

This morning during the lecture on nonlinear regression, I mentioned (very) briefly the case of convex regression. Since I forgot to mention the codes in R, I will publish them here. Assume that where is some convex function.

Then is convex if and only if , , Hidreth (1954) proved that ifthen is unique.

Let , then where. I.e. is the projection of onto the (closed) convex cone . The projection theorem gives existence and unicity.

For convenience, in the application, we will consider the real-valued case, , i.e. . Assume that observations are ordered . Here

Hence, quadratic program with linear constraints.

is a piecewise linear function (interpolation of consecutive pairs ).

If is differentiable, is convex if

More generally, if is convex, then there exists such that

is a subgradient of at . And then

Hence, is solution of and . Now, to do it for real, use cobs package for constrained (b)splines regression,

1 library(cobs)To get a convex regression, use

1 2 3 4 5 plot(cars) x = cars$speed y = cars$dist rc = conreg(x,y,convex=TRUE) lines(rc, col = 2)

Here we can get the values of the knots

and actually, if we use them in a linear-spline regression, we get the same output here

1 2 3 4 reg = lm(dist~bs(speed,degree=1,knots=c(4,7,8,9,,20,23,25)),data=cars) u = seq(4,25,by=.1) v = predict(reg,newdata=data.frame(speed=u)) lines(u,v,col="green")Let us add vertical lines for the knots

1 abline(v=c(4,7,8,9,20,23,25),col="grey",lty=2) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R-english – Freakonometrics**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New Course: Support Vector Machines in R

(This article was first published on ** DataCamp Community - r programming**, and kindly contributed to R-bloggers)

Here is the course link.

Course DescriptionSupport Vector Machines in R will help students develop an understanding of the SVM model as a classifier and gain practical experience using R’s libsvm implementation from the e1071 package. Along the way, students will gain an intuitive understanding of important concepts, such as hard and soft margins, the kernel trick, different types of kernels, and how to tune SVM parameters. Get ready to classify data with this impressive model.

Chapter 1: IntroductionThis chapter introduces some key concepts of support vector machines through a simple 1-dimensional example. Students are also walked through the creation of a linearly separable dataset that is used in the subsequent chapter.

Chapter 2: Support Vector Classifiers – Linear KernelsChapter 2 introduces students to the basic concepts of support vector machines by applying the svm algorithm to a dataset that is linearly separable. Key concepts are illustrated through ggplot visualizations that are built from the outputs of the algorithm and the role of the cost parameter is highlighted via a simple example. The chapter closes with a section on how the algorithm deals with multi-class problems.

Chapter 3: Polynomial KernelsThis chapter provides an introduction to polynomial kernels via a dataset that is radially separable (i.e. has a circular decision boundary). After demonstrating the inadequacy of linear kernels for this dataset, students will see how a simple transformation renders the problem linearly separable thus motivating an intuitive discussion of the kernel trick. Students will then apply the polynomial kernel to the dataset and tune the resulting classifier.

Chapter 4: Radial Basis Function KernelsChapter 4 builds on the previous three chapters by introducing the highly flexible Radial Basis Function (RBF) kernel. Students will create a "complex" dataset that shows up the limitations of polynomial kernels. Then, following an intuitive motivation for the RBF kernel, students see how it addresses the shortcomings of the other kernels discussed in this course.

The course requires the following prerequisites:

Introduction to Machine Learning

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** DataCamp Community - r programming**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### First Class R Support in Binder / Binderhub – Shiny Apps As Well as R-Kernels and RStudio

(This article was first published on ** Rstats – OUseful.Info, the blog…**, and kindly contributed to R-bloggers)

I notice from the binder-examples/r repo that Binderhub now appears to offer all sorts of R goodness out of the can, if you specify a particular R build.

From the same repo root, you can get:

- Jupyter notebook+R: http://beta.mybinder.org/v2/gh/binder-examples/r/master?filepath=index.ipynb
- RStudio: http://beta.mybinder.org/v2/gh/binder-examples/r/master?urlpath=rstudio
- A Shiny app: http://beta.mybinder.org/v2/gh/binder-examples/r/master?urlpath=shiny/bus-dashboard/

And from previously, here’s a workaround for displaying R/HTMLwidgets in a Jupyter notebook.

OpenRefine is also available from a simple URL – https://mybinder.org/v2/gh/betatim/openrefineder/master?urlpath=openrefine – courtesy of betatim/openrefineder:

Perhaps it’s time for me to try to get my head round what the Jupyter notebook proxy handlers are doing…

PS see also Scripted Forms for a simple markdown script way of building interactive Jupyter widget powered UIs.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Rstats – OUseful.Info, the blog…**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New Course: Interactive Maps with leaflet in R

(This article was first published on ** DataCamp Community - r programming**, and kindly contributed to R-bloggers)

Here is the course link.

Course DescriptionGet ready to have some fun with maps! Interactive Maps with leaflet in R will give you the tools to make attractive and interactive web maps using spatial data and the tidyverse. In this course, you will create maps using the IPEDS dataset, which contains data on U.S. colleges and universities. Along the way, you will customize our maps using labels, popups, and custom markers, and add layers to enhance interactivity. Following the course, you will be able to create and customize your own interactive web maps to reveal patterns in your data.

Chapter 1: Setting Up Interactive Web Maps (Free)This chapter will introduce students to the htmlwidgets package and the leaflet package. Following this introduction, students will build their first interactive web map using leaflet. Through the process of creating this first map students will be introduced to many of the core features of the leaflet package, including adding different map tiles, setting the center point and zoom level, plotting single points based on latitude and longitude coordinates, and storing leaflet maps as objects.

Chapter 2: Plotting PointsIn this chapter students will build on the leaflet map they created in chapter 1 to create an interactive web map of every four year college in California. After plotting hundreds of points on an interactive leaflet map, students will learn to customize the markers on their leaflet map. This chapter will also how to color code markers based on a factor variable.

Chapter 3: Groups, Layers, and ExtrasIn chapter 3, students will expand on their map of all four year colleges in California to create a map of all American colleges. First, in section 3.1 students will review and build on the material from Chapter 2 to create a map of all American colleges. Then students will re-plot the colleges on their leaflet map by sector (public, private, or for-profit) using groups to enable users to toggle the colleges that are displayed on the map. In section 3.3 students will learn to add multiple base maps so that users can toggle between multiple map tiles.

Chapter 4: Plotting PolygonsIn Chapter 4, students will learn to map polygons, which can be used to define geographic regions (e.g., zip codes, states, countries, etc.). Chapter 4 will start by plotting the zip codes in North Carolina that fall in the top quartile of mean family incomes. Students will learn to customize the polygons with color palettes and labels. Chapter 4 will conclude with adding a new layer to the map of every college in America that displays every zip code with a mean income of $200,000 or more during the 2015 tax year. Through the process of mapping zip codes students will learn about spatial data generally, geoJSON data, the @ symbol, and the addPolygons() function.

This course requires the following prerequisites:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** DataCamp Community - r programming**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...