R news and tutorials contributed by hundreds of R bloggers
Updated: 5 hours 19 min ago

### RStudio Connect v1.6.4.2 – Security Update

Mon, 07/09/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

A security vulnerability in a third-party library used by RStudio Connect was uncovered during a security audit last week. We have confirmed that this vulnerability has not been used against any of the RStudio Connect instances we host, and are unaware of it being exploited on any customer deployments. Under certain conditions, this vulnerability could compromise the session of a user that was tricked into visiting a specially crafted URL. The issue affects all versions of RStudio Connect up to and including 1.6.4.1, but none of our other products. We have prepared a hotfix: v1.6.4.2.

RStudio remains committed to providing the most secure product possible. We regularly perform internal security audits against RStudio Connect in order to ensure the product’s security.

As part of the responsible disclosure process, we will provide additional details about the vulnerability and how to ensure that you have not been affected, in the coming weeks once customers have had time to update their systems. For now, please update your RStudio Connect installations to version 1.6.4.2 as soon as possible.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Solver Interfaces in CVXR

Mon, 07/09/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Introduction

In our previous blog
post
, we
introduced CVXR, an R package for disciplined convex
optimization. The package allows one to describe an optimization
problem with Disciplined Convex Programming
rules using high level mathematical syntax. Passing this problem
definition along (with a list of constraints, if any) to the solve
function transforms it into a form that can be handed off to
a solver. The default installation of CVXR comes with two (imported)
open source solvers:

• ECOS and its mixed integer
cousin ECOS_BB via the CRAN package
ECOSolveR
• SCS via the CRAN package
scs.

CVXR (version 0.99) can also make use of several other open source
solvers implemented in R packages:

• The linear and mixed integer programming package
lpSolve via the
lpSolveAPI package
• The linear and mixed integer programming package GLPK via the
Rglpk package.

The real work of finding a solution is done by solvers, and writing
good solvers is hard work. Furthermore, some solvers work particularly
well for certain types of problems (linear programs, quadratic
programs, etc.). Not surprisingly, there are commercial vendors who
have solvers that are designed for performance and scale. Two
well-known solvers are MOSEK and
GUROBI. R packages for these solvers are
also provided, but they require the problem data to be constructed in
a specific form. This necessitates a bit of work in the current version of
CVXR and is certainly something we plan to include in future versions.
However, it is also true that these commercial solvers expose a much
richer API to Python programmers than to R programmers. How, then, do we
interface such solvers with R as quickly as possible, at least
in the short term?

Reticulate to the Rescue

The current version of CVXR exploits the
reticulate package
for commercial solvers such as MOSEK and GUROBI. We took the Python solver interfaces in CVXPY version
0.4.11
, edited them
suitably to make them self-contained, and hooked them up to reticulate.

This means that one needs two prerequisites to use these commercial solvers in the current version of CVXR:

Installing MOSEK/GUROBI

Both MOSEK and
(registration required) free of charge. For example,
Anaconda users can install MOSEK with the command:

conda install -c mosek mosek

Others can use the pip command:

GUROBI is handled in a similar fashion. The solvers must be activated using a

Once activated, one can check that CVXR recognizes the solver;
installed_solvers() should list them.

> installed_solvers() [1] "ECOS" "ECOS_BB" "SCS" "MOSEK" "LPSOLVE" "GLPK" "GUROBI" Further information

More information on these solvers, along with a number of tutorial examples are
available on the CVXR site. If you are
attending useR! 2018, you can catch
Anqi’s CVXR talk on Friday, July 13.

_____='https://rviews.rstudio.com/2018/07/09/solver-interfaces-in-cvxr/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Time Series Analysis With Documentation And Steps I Follow For Analytics Projects.

Mon, 07/09/2018 - 01:01

(This article was first published on R-Analytics, and kindly contributed to R-bloggers)

To do this I will create a prediction of the open values for Bitcoin in the next 3 days.

The process I follow is based on CRISP-DM methodology: https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome

1.- Planning the activities.

To plan the activities I use a spread sheet document, below I show the spread sheet sample, if you would like the document, please go to the next link:

body,div,table,thead,tbody,tfoot,tr,th,td,p { font-family:"Calibri"; font-size:x-small } a.comment-indicator:hover + comment { background:#ffd; position:absolute; display:block; border:1px solid black; padding:0.5em; } a.comment-indicator { background:red; display:inline-block; border:1px solid black; width:0.5em; height:0.5em; } comment { display:none; }

Activity Activity Description DueDate Activity Owner Status Comments Functional Requirement Specification A Text Document explaining the objectives of this project. 4/19/2018 Carlos Kassab Done
Get Data For Analysis Get initial data to create feasibility analysis 4/19/2018 Carlos Kassab Done
ETL Development ETL to get final data for next analysis 4/20/2018 Carlos Kassab Done 2018/04/19: In this case, there is not ETL needed, the dataset was downloaded from kaggle: https://www.kaggle.com/vivekchamp/bitcoin/data Exploratory Data Analysis Dataset summary and histogram to know the data normalization 4/20/2018 Carlos Kassab In progress
Variables frequency Frequency of variable occurrence( frequency of values change, etc. ) 4/20/2018 Carlos Kassab Done 2018/04/19: We have already seen that our values change every day. Outliers Analysis Analysis of variability in numeric variables, show it in charts and grids.. 4/20/2018 Carlos Kassab In progress
Time Series Decomposition – Getting metric charts, raw data, seasonality, trend and remainder. 4/20/2018 Carlos Kassab In progress
Modelling Create the analytics model 4/25/2018 Carlos Kassab Not Started
SQL View Development For Training, Validation And Testing NA Carlos Kassab Not Started 2018/04/19: No SQL view needed, everything is done inside the R script. Model Selection By using random parameters search algorithm, to find the right model to be used for this data. 4/25/2018 Carlos Kassab Not Started
Model fine tunning. After finding the right algorithm, find the right model parameters to be used. 4/25/2018 Carlos Kassab Not Started
Chart Development Final data chart development 4/25/2018 Carlos Kassab Not Started
Data Validation Run the analytics model at least 2 weeks daily in order to see its behavior. NA Carlos Kassab Not Started
Deployment Schedule the automatic execution of the R code. NA Carlos Kassab Not Started

The first activity is the functional specification, this would be similar to business understanding in the crisp-dm methodology.

I use a text document, for this analysis, you can get the document from this link:
https://github.com/LaranIkal/R-ANALYTICS/blob/master/BitCoinPredictionRequirementSpecification.odt

Now, the next step is to get the data for analysis and create the ETL script, in this case we just got the data from kaggle.com as mentioned in the documentation but, no ETL script was needed.

So, we have our data, now we are going to analyze it, we will do all the activities mentioned in yellow in the grid above. I did this in a MarkDown document, here you can see the HTML output:

https://github.com/LaranIkal/R-ANALYTICS/blob/master/BitcoinDataAnalysis.html

Note. At the end, in order to have everything together I included the time series algorithms in the same rmd file that creates BitcoinDataAnalysis.html file.

You can get all the sources from:

https://github.com/LaranIkal/R-ANALYTICS

Enjoy it!!!.

More information about R: https://www.r-bloggers.com var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### a thread to bin them all [puzzle]

Mon, 07/09/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The most recent riddle on the Riddler consists in finding the shorter sequence of digits (in 0,1,..,9) such that all 10⁴ numbers between 0 (or 0000) and 9,999 can be found as a group of consecutive four digits. This sequence is obviously longer than 10⁴+3, but how long? On my trip to Brittany last weekend, I wrote an R code first constructing the sequence at random by picking with high preference the next digit among those producing a new four-digit number

tenz=10^(0:3) wn2dg=function(dz) 1+sum(dz*tenz) seqz=rep(0,10^4) snak=wndz=sample(0:9,4,rep=TRUE) seqz[wn2dg(wndz)]=1 while (min(seqz)==0){ wndz[1:3]=wndz[-1];wndz[4]=0 wndz[4]=sample(0:9,1,prob=.01+.99*(seqz[wn2dg(wndz)+0:9]==0)) snak=c(snak,wndz[4]) sek=wn2dg(wndz) seqz[sek]=seqz[sek]+1}

which usually returns a value above 75,000. I then looked through the sequence to eliminate useless replicas

for (i in sample(4:(length(snak)-5))){ if ((seqz[wn2dg(snak[(i-3):i])]>1) &(seqz[wn2dg(snak[(i-2):(i+1)])]>1) &(seqz[wn2dg(snak[(i-1):(i+2)])]>1) &(seqz[wn2dg(snak[i:(i+3)])]>1)){ seqz[wn2dg(snak[(i-3):i])]=seqz[wn2dg(snak[(i-3):i])]-1 seqz[wn2dg(snak[(i-2):(i+1)])]=seqz[wn2dg(snak[(i-2):(i+1)])]-1 seqz[wn2dg(snak[(i-1):(i+2)])]=seqz[wn2dg(snak[(i-1):(i+2)])]-1 seqz[wn2dg(snak[i:(i+3)])]=seqz[wn2dg(snak[i:(i+3)])]-1 snak=snak[-i] seqz[wn2dg(snak[(i-3):i])]=seqz[wn2dg(snak[(i-3):i])]+1 seqz[wn2dg(snak[(i-2):(i+1)])]=seqz[wn2dg(snak[(i-2):(i+1)])]+1 seqz[wn2dg(snak[(i-1):(i+2)])]=seqz[wn2dg(snak[(i-1):(i+2)])]+1}}

until none is found. A first attempt produced 12,911 terms in the sequence. A second one 12,913. A third one 12,871. Rather consistent figures but not concentrated enough to believe in achieving a true minimum. An overnight run produced 12,779 as the lowest value.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### World Income, Inequality and Murder

Mon, 07/09/2018 - 00:00

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract:

We follow up on last weeks post on using Gapminder data to study the world’s income distribution. In order to assess the inequality of the distribution we compute the Gini coefficient for the world’s income distribution by Monte Carlo approximation and visualize the result as a time series. Furthermore, we animate the association between Gini coefficient and homicide rate per country using the new version of gganimate.

Introduction

One of the main messages of the Chapter ‘The Gap Instinct’ of the book Factfulness is that there is no justification of the ‘we’ and ‘them’ classification of countries anymore, because they have converged towards the same levels in key indicators such as life expectancy, child mortality, births per female. The difference between countries is, hence, not as many imagine it to be: there is less inequality and no real gap. While reading, I became curious about the following: what if countries became more equal, but simultaneously inequality within countries became bigger? This was also indirectly a Disqus comment by F. Weidemann to the post Factfulness: Building Gapminder Income Mountains. Aim of the present post is to investigate this hypothesis using the Gapminder data. Furthermore, we use the country specific Gini coefficients to investigate the association with the number of homicides in the country.

Gini coefficient

There are different ways to measure income inequality, both in terms of which response you consider and which statistical summary you compute for it. Not going into the details of these discussion we use the GDP/capita in Purchasing Power Parities (PPP) measured in so called international dollars (fixed prices 2011). In other words, comparison between years and countries are possible, because the response is adjusted for inflation and differences in price of living.

The Gini coefficient is a statistical measure to quantify inequality. In what follows we shall focus on computing the Gini coefficient for a continuous probability distribution with a known probability density function. Let the probability density function of the non-negative continuous income distribution be defined by $$f$$, then the Gini coefficient is given as half the relative mean difference:

$G = \frac{1}{2\mu}\int_0^\infty \int_0^\infty |x-y| \> f(x) \> f(y) \> dx\> dy, \quad\text{where}\quad \mu = \int_{0}^\infty x\cdot f(x) dx.$

Depending on $$f$$ it might be possible to solve these integrals analytically, however, a straightforward computational approach is to use Monte Carlo sampling – as we shall see shortly. Personally, I find the above relative mean difference presentation of the Gini index much more intuitive than the area argument using the Lorenz curve. From the eqution it also becomes clear that the Gini coefficient is invariant to multiplicative changes in the income: if everybody increases their income by factor $$k>0$$ then the Gini coefficient remains the same, because $$|k x – k y| = k | x – y|$$ and $$E(k \cdot X) = k \mu$$.

The above formula indirectly also states how to compute the Gini coefficient for a discrete sample of size $$n$$ and with incomes $$x_1,\ldots, x_n$$: $G = \frac{\sum_{i=1}^n \sum_{j=1}^n |x_i – x_j| \frac{1}{n} \frac{1}{n}}{2 \sum_{i=1}^n x_i \frac{1}{n}} = \frac{\sum_{i=1}^n \sum_{j=1}^n |x_i – x_j|}{2 n \sum_{j=1}^n x_j}.$

Approximating the integral by Monte Carlo

If one is able to easily sample from $$f$$ then can instead of solving the integral analytically use $$k$$ pairs $$(x,y)$$ both drawn at random from $$f$$ to approximate the double integral:

$G \approx \frac{1}{2\mu K} \sum_{k=1}^K |x_k – y_k|, \quad\text{where}\quad x_k \stackrel{\text{iid}}{\sim} f \text{ and } y_k \stackrel{\text{iid}}{\sim} f,$ where for our mixture model $\mu = \sum_{i=1}^{192} w_i \> E(X_i) = \sum_{i=1}^{192} w_i \exp\left(\mu_i + \frac{1}{2}\sigma_i^2\right).$ This allows us to compute $$G$$ even in case of a complex $$f$$ such as the log-normal mixture distribution. As always, the larger $$K$$ is, the better the Monte Carlo approximation is.

##Precision of Monte Carlo approx is controlled by the number of samples n <- 1e6 ##Compute Gini index of world income per year gini_year <- gm %>% group_by(year) %>% do({ x <- rmix(n, meanlog=.$meanlog, sdlog= .$sdlog, w=.$w) y <- rmix(n, meanlog=.$meanlog, sdlog= .$sdlog, w=.$w) int <- mean( abs(x-y) ) mu <- sum(exp( .$meanlog + 1/2 * .$sdlog^2) * .$w) data.frame(gini_all_mc=1/(2*mu)*int, country_gini=gini(.$gdp*.$population)) }) %>% rename(World Population GDP/capita=gini_all_mc, Country GDP/capita=country_gini) ##Convert to long format gini_ts <- gini_year %>% gather(key="type", value="gini", -year) Results We start by showing the country specific Gini coefficient per year since 1950 for a somewhat arbitrary selection of countries. The dashed black line shows the mean Gini coefficient each year over all 192 countries in the dataset. In addition, we now show the Monte Carlo computed Gini coefficient for the world’s income distribution given as a log-normal mixture with 192 components. We notice that the Gini coefficient taken over the 192 countries’ GDP/capita remains very stable over time. This, however, does not take the large differences in populations between countries into account. A fairer measure is thus the Gini coefficient for the world’s income distribution. We see that this Gini coefficient increased over time until peaking around 1990. From then on it has declined. However, the pre-1950 Gini coefficients are rather guesstimates as stated by Gapminder, hence, we zoom in on the period from 1970, because data are more reliable from this point on. Gini coefficient and Homicide Rate Finally, we end the post by illustrating the association between the Gini coefficient and the homicide rate per country using a 2D scatterplot over the years. The Gapminder data download page also contains data for this for the years 1950- 2005. Unfortunately, no data for more recent years are available from the Gapminder homepage, but the plot shown below is the situation in r show_year with a log-base-10 y-axis for the homicide rates. For each of the four Gapminder regional groups we also fit a simple linear regression line to the points of all countries within the region. Findings such as Fajnzylber, Lederman, and Loayza (2002) suggest that there is a strong positive correlation between Gini coefficient and homicide rate, but we see from the plot that there are regional differences and of course correlation is not causality. Explanations for this phenomena are debated and beyond the scope of this post. We extend the plots to all years 1950-2005. Unfortunately, not all countries are available every year – so we only plot the available countries each year. This means that many African countries are missing from the animation. An improvement would be to try some form of linear interpolation. Furthermore, for the sake of simplicity of illustration, we fix countries with a reported murder rate of zero in a given year (happens for example for Cyprus, Iceland, Fiji in some years) to 0.01 per 100,000 population. This can be nicely animated using the new version of the gganimate pkg by Thomas Lin Pedersen. ## New version of gganimate. Not on CRAN yet. ## devtools::install_github('thomasp85/gganimate') require(gganimate) p <- ggplot(gm2_nozero, aes(x=gini, y=murder_rate,size=population, color=Region)) + geom_point() + scale_x_continuous(labels=scales::percent) + scale_y_continuous(trans="log10", breaks = trans_breaks("log10", function(x) 10^x,n=5), labels = trans_format("log10", function(x) ifelse(x<0, sprintf(paste0("%.",ifelse(is.na(x),"0",round(abs(x))),"f"),10^x), sprintf("%.0f",10^x)))) + geom_smooth(se=FALSE, method="lm", formula=y~x) + geom_text(data=gm2, aes(x=gini,y=murder_rate, label=country), vjust=-0.9, show.legend=FALSE) + ylab("Murder rate (per 100,000 population)") + xlab("Gini coefficient (in %)") + guides(size=FALSE) + labs(title = 'Year: {frame_time}') + transition_time(year) + ease_aes('linear') animate(p, nframes=length(unique(gm2$year)), fps=4, width=800, height=400, res=100)

Discussion

Based on the available Gapminder data we showed that in the last 25 years the Gini coefficient for the world’s income distribution has decreased. For several individual countries opposite dynamics are, however, observed. One particular concern is the share that the richest 1% have of the overall wealth: more than 50%.

Literature

Fajnzylber, P., D. Lederman, and N. Loayza. 2002. “Inequality and Violent Crime.” Journal of Law and Economics 45 (April). http://siteresources.worldbank.org/DEC/Resources/Crime&Inequality.pdf.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Statistics Sunday: Mixed Effects Meta-Analysis

Sun, 07/08/2018 - 20:08

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

As promised, how to conduct mixed effects meta-analysis in R:

Code used in the video is available here. And I’d recommend the following posts to provide background for this video:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Speed up your R Work

Sun, 07/08/2018 - 18:24

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Introduction

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Keep in mind: unless the pipeline steps have non-trivial cost, the overhead of partitioning and distributing the work may overwhelm any parallel speedup. Also data.table itself already seems to exploit some thread-level parallelism (notice user time is greater than elapsed time). That being said, in this note we will demonstrate a synthetic example where computation is expensive due to a blow-up in an intermediate join step.

Our example

First we set up our execution environment and example (some details: OSX 10.13.4 on a 2.8 GHz Intel Core i5 Mac Mini (Late 2015 model) with 8GB RAM and hybrid disk drive).

library("rqdatatable") ## Loading required package: rquery library("microbenchmark") library("ggplot2") library("WVPlots") suppressPackageStartupMessages(library("dplyr")) ## Warning: package 'dplyr' was built under R version 3.5.1 base::date() ## [1] "Sun Jul 8 09:05:25 2018" R.version.string ## [1] "R version 3.5.0 (2018-04-23)" parallel::detectCores() ## [1] 4 packageVersion("parallel") ## [1] '3.5.0' packageVersion("rqdatatable") ## [1] '0.1.2' packageVersion("rquery") ## [1] '0.5.1' packageVersion("dplyr") ## [1] '0.7.6' ncore <- parallel::detectCores() print(ncore) ## [1] 4 cl <- parallel::makeCluster(ncore) print(cl) ## socket cluster with 4 nodes on host 'localhost' set.seed(2362) mk_example <- function(nkey, nrep, ngroup = 20) { keys <- paste0("key_", seq_len(nkey)) key_group <- sample(as.character(seq_len(ngroup)), length(keys), replace = TRUE) names(key_group) <- keys key_table <- data.frame( key = rep(keys, nrep), stringsAsFactors = FALSE) key_table$data <- runif(nrow(key_table)) instance_table <- data.frame( key = rep(keys, nrep), stringsAsFactors = FALSE) instance_table$id <- seq_len(nrow(instance_table)) instance_table$info <- runif(nrow(instance_table)) # groups should be no finer than keys key_table$key_group <- key_group[key_table$key] instance_table$key_group <- key_group[instance_table$key] list(key_table = key_table, instance_table = instance_table) } dlist <- mk_example(10, 10) data <- dlist$instance_table annotation <- dlist$key_table rquery / rqdatatable rquery and rqdatatable can implement a non-trivial calculation as follows. # possible data lookup: find rows that # have lookup data <= info optree <- local_td(data) %.>% natural_join(., local_td(annotation), jointype = "INNER", by = "key") %.>% select_rows_nse(., data <= info) %.>% pick_top_k(., k = 1, partitionby = "id", orderby = "data", reverse = "data", keep_order_column = FALSE) %.>% orderby(., "id") cat(format(optree)) ## table('data'; ## key, ## id, ## info, ## key_group) %.>% ## natural_join(., ## table('annotation'; ## key, ## data, ## key_group), ## j= INNER, by= key) %.>% ## select_rows(., ## data <= info) %.>% ## extend(., ## row_number := row_number(), ## p= id, ## o= "data" DESC) %.>% ## select_rows(., ## row_number <= 1) %.>% ## drop_columns(., ## row_number) %.>% ## orderby(., id) res1 <- ex_data_table(optree) head(res1) ## data id info key key_group ## 1: 0.9152014 1 0.9860654 key_1 20 ## 2: 0.5599810 2 0.5857570 key_2 8 ## 3: 0.3011882 3 0.3334490 key_3 10 ## 4: 0.3650987 4 0.3960980 key_4 5 ## 5: 0.1469254 5 0.1753649 key_5 14 ## 6: 0.2567631 6 0.3510280 key_6 7 nrow(res1) ## [1] 94 And we can execute the operations in parallel. parallel::clusterEvalQ(cl, library("rqdatatable")) ## [[1]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[2]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[3]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[4]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" res2 <- ex_data_table_parallel(optree, "key_group", cl) head(res2) ## data id info key key_group ## 1: 0.9152014 1 0.9860654 key_1 20 ## 2: 0.5599810 2 0.5857570 key_2 8 ## 3: 0.3011882 3 0.3334490 key_3 10 ## 4: 0.3650987 4 0.3960980 key_4 5 ## 5: 0.1469254 5 0.1753649 key_5 14 ## 6: 0.2567631 6 0.3510280 key_6 7 nrow(res2) ## [1] 94 data.table data.table can implement the same function. library("data.table") ## ## Attaching package: 'data.table' ## The following objects are masked from 'package:dplyr': ## ## between, first, last packageVersion("data.table") ## [1] '1.11.4' data_table_f <- function(data, annotation) { data <- data.table::as.data.table(data) annotation <- data.table::as.data.table(annotation) joined <- merge(data, annotation, by = "key", all=FALSE, allow.cartesian=TRUE) joined <- joined[joined$data <= joined$info, ] data.table::setorderv(joined, cols = "data") joined <- joined[, .SD[.N], id] data.table::setorderv(joined, cols = "id") } resdt <- data_table_f(data, annotation) head(resdt) ## id key info key_group.x data key_group.y ## 1: 1 key_1 0.9860654 20 0.9152014 20 ## 2: 2 key_2 0.5857570 8 0.5599810 8 ## 3: 3 key_3 0.3334490 10 0.3011882 10 ## 4: 4 key_4 0.3960980 5 0.3650987 5 ## 5: 5 key_5 0.1753649 14 0.1469254 14 ## 6: 6 key_6 0.3510280 7 0.2567631 7 nrow(resdt) ## [1] 94 We can also run data.table in parallel using wrapr::execute_parallel. parallel::clusterEvalQ(cl, library("data.table")) ## [[1]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[2]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[3]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[4]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" parallel::clusterExport(cl, "data_table_f") dt_f <- function(tables_list) { data <- tables_list$data annotation <- tables_list$annotation data_table_f(data, annotation) } data_table_parallel_f <- function(data, annotation) { respdt <- wrapr::execute_parallel( tables = list(data = data, annotation = annotation), f = dt_f, partition_column = "key_group", cl = cl) %.>% data.table::rbindlist(.) data.table::setorderv(respdt, cols = "id") respdt } respdt <- data_table_parallel_f(data, annotation) head(respdt) ## id key info key_group.x data key_group.y ## 1: 1 key_1 0.9860654 20 0.9152014 20 ## 2: 2 key_2 0.5857570 8 0.5599810 8 ## 3: 3 key_3 0.3334490 10 0.3011882 10 ## 4: 4 key_4 0.3960980 5 0.3650987 5 ## 5: 5 key_5 0.1753649 14 0.1469254 14 ## 6: 6 key_6 0.3510280 7 0.2567631 7 nrow(respdt) ## [1] 94 dplyr dplyr can also implement the example. dplyr_pipeline <- function(data, annotation) { res <- data %>% inner_join(annotation, by = "key") %>% filter(data <= info) %>% group_by(id) %>% arrange(-data) %>% mutate(rownum = row_number()) %>% ungroup() %>% filter(rownum == 1) %>% arrange(id) res } resd <- dplyr_pipeline(data, annotation) head(resd) ## # A tibble: 6 x 7 ## key id info key_group.x data key_group.y rownum ## ## 1 key_1 1 0.986 20 0.915 20 1 ## 2 key_2 2 0.586 8 0.560 8 1 ## 3 key_3 3 0.333 10 0.301 10 1 ## 4 key_4 4 0.396 5 0.365 5 1 ## 5 key_5 5 0.175 14 0.147 14 1 ## 6 key_6 6 0.351 7 0.257 7 1 nrow(resd) ## [1] 94 And we can use wrapr::execute_parallel to parallelize the dplyr solution. parallel::clusterEvalQ(cl, library("dplyr")) ## [[1]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[2]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[3]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[4]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" parallel::clusterExport(cl, "dplyr_pipeline") dplyr_f <- function(tables_list) { data <- tables_list$data annotation <- tables_list$annotation dplyr_pipeline(data, annotation) } dplyr_parallel_f <- function(data, annotation) { respdt <- wrapr::execute_parallel( tables = list(data = data, annotation = annotation), f = dplyr_f, partition_column = "key_group", cl = cl) %>% dplyr::bind_rows() %>% arrange(id) } respdplyr <- dplyr_parallel_f(data, annotation) head(respdplyr) ## # A tibble: 6 x 7 ## key id info key_group.x data key_group.y rownum ## ## 1 key_1 1 0.986 20 0.915 20 1 ## 2 key_2 2 0.586 8 0.560 8 1 ## 3 key_3 3 0.333 10 0.301 10 1 ## 4 key_4 4 0.396 5 0.365 5 1 ## 5 key_5 5 0.175 14 0.147 14 1 ## 6 key_6 6 0.351 7 0.257 7 1 nrow(respdplyr) ## [1] 94 Benchmark We can benchmark the various realizations. dlist <- mk_example(300, 300) data <- dlist$instance_table annotation <- dlist$key_table timings <- microbenchmark( data_table_parallel = nrow(data_table_parallel_f(data, annotation)), data_table = nrow(data_table_f(data, annotation)), rqdatatable_parallel = nrow(ex_data_table_parallel(optree, "key_group", cl)), rqdatatable = nrow(ex_data_table(optree)), dplyr_parallel = nrow(dplyr_parallel_f(data, annotation)), dplyr = nrow(dplyr_pipeline(data, annotation)), times = 10L) saveRDS(timings, "Parallel_rqdatatable_timings.RDS") Conclusion print(timings) ## Unit: seconds ## expr min lq mean median uq ## data_table_parallel 5.274560 5.457105 5.609827 5.546554 5.686829 ## data_table 9.401677 9.496280 9.701807 9.541218 9.748159 ## rqdatatable_parallel 7.165216 7.497561 7.587663 7.563883 7.761987 ## rqdatatable 12.490469 12.700474 13.320480 12.898154 14.229233 ## dplyr_parallel 6.492262 6.572062 6.784865 6.787277 6.875076 ## dplyr 20.056555 20.450064 20.647073 20.564529 20.800350 ## max neval ## 6.265888 10 ## 10.419316 10 ## 7.949404 10 ## 14.282269 10 ## 7.328223 10 ## 21.332103 10 # autoplot(timings) timings <- as.data.frame(timings) timings$seconds <- timings$time/1e+9 ScatterBoxPlotH(timings, xvar = "seconds", yvar = "expr", title="task duration by method") The benchmark timings show parallelized data.table is the fastest, followed by parallelized dplyr, and parallelized rqdatatable. In the non-paraellized case data.table is the fastest, followed by rqdatatable, and then dplyr. A reason dplyr sees greater speedup relative to its own non-parallel implementation (yet does not beat data.table) is that data.table starts already multi-threaded, so data.table is exploiting some parallelism even before we added the process level parallelism (and hence sees less of a speed up, though it is fastest). rquery pipelines exhibit superior performance on big data systems (Spark, PostgreSQL, Amazon Redshift, and hopefully soon Google bigquery), and rqdatatable supplies a very good in-memory implementation of the rquery system based on data.table. rquery also speeds up solution development by supplying higher order operators and early debugging features. In this note we have demonstrated simple procedures to reliably parallelize any of rqdatatable, data.table, or dplyr. Note: we did not include alternatives such as multidplyr or dtplyr in the timings, as they did not appear to work on this example. Materials The original rendering of this article can be found here, source code here, and raw timings here. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Dealing with heteroskedasticity; regression with robust standard errors using R Sun, 07/08/2018 - 02:00 (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers) First of all, is it heteroskedasticity or heteroscedasticity? According to McCulloch (1985), heteroskedasticity is the proper spelling, because when transliterating Greek words, scientists use the Latin letter k in place of the Greek letter κ (kappa). κ sometimes is transliterated as the Latin letter c, but only when these words entered the English language through French, such as scepter. Now that this is out of the way, we can get to the meat of this blogpost (foreshadowing pun). A random variable is said to be heteroskedastic, if its variance is not constant. For example, the variability of expenditures may increase with income. Richer families may spend a similar amount on groceries as poorer people, but some rich families will sometimes buy expensive items such as lobster. The variability of expenditures for rich families is thus quite large. However, the expenditures on food of poorer families, who cannot afford lobster, will not vary much. Heteroskedasticity can also appear when data is clustered; for example, variability of expenditures on food may vary from city to city, but is quite constant within a city. To illustrate this, let’s first load all the packages needed for this blog post: library(robustbase) library(tidyverse) library(sandwich) library(lmtest) library(modelr) library(broom) First, let’s load and prepare the data: data("education") education <- education %>% rename(residents = X1, per_capita_income = X2, young_residents = X3, per_capita_exp = Y, state = State) %>% mutate(region = case_when( Region == 1 ~ "northeast", Region == 2 ~ "northcenter", Region == 3 ~ "south", Region == 4 ~ "west" )) %>% select(-Region) I will be using the education data sat from the {robustbase} package. I renamed some columns and changed the values of the Region column. Now, let’s do a scatterplot of per capita expenditures on per capita income: ggplot(education, aes(per_capita_income, per_capita_exp)) + geom_point() + theme_dark() It would seem that, as income increases, variability of expenditures increases too. Let’s look at the same plot by region: ggplot(education, aes(per_capita_income, per_capita_exp)) + geom_point() + facet_wrap(~region) + theme_dark() I don’t think this shows much; it would seem that observations might be clustered, but there are not enough observations to draw any conclusion from this plot (in any case, drawing conclusions from plots only is dangerous). Let’s first run a good ol’ linear regression: lmfit <- lm(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = education) summary(lmfit) ## ## Call: ## lm(formula = per_capita_exp ~ region + residents + young_residents + ## per_capita_income, data = education) ## ## Residuals: ## Min 1Q Median 3Q Max ## -77.963 -25.499 -2.214 17.618 89.106 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.40283 142.57669 -3.278 0.002073 ** ## regionnortheast 15.72741 18.16260 0.866 0.391338 ## regionsouth 7.08742 17.29950 0.410 0.684068 ## regionwest 34.32416 17.49460 1.962 0.056258 . ## residents -0.03456 0.05319 -0.650 0.519325 ## young_residents 1.30146 0.35717 3.644 0.000719 *** ## per_capita_income 0.07204 0.01305 5.520 1.82e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 39.88 on 43 degrees of freedom ## Multiple R-squared: 0.6292, Adjusted R-squared: 0.5774 ## F-statistic: 12.16 on 6 and 43 DF, p-value: 6.025e-08 Let’s test for heteroskedasticity using the Breusch-Pagan test that you can find in the {lmtest} package: bptest(lmfit) ## ## studentized Breusch-Pagan test ## ## data: lmfit ## BP = 17.921, df = 6, p-value = 0.006432 This test shows that we can reject the null that the variance of the residuals is constant, thus heteroskedacity is present. To get the correct standard errors, we can use the vcovHC() function from the {sandwich} package (hence the choice for the header picture of this post): lmfit %>% vcovHC() %>% diag() %>% sqrt() ## (Intercept) regionnortheast regionsouth regionwest ## 311.31088691 25.30778221 23.56106307 24.12258706 ## residents young_residents per_capita_income ## 0.09184368 0.68829667 0.02999882 By default vcovHC() estimates a heteroskedasticity consistent (HC) variance covariance matrix for the parameters. There are several ways to estimate such a HC matrix, and by default vcovHC() estimates the “HC3” one. You can refer to Zeileis (2004) for more details. We see that the standard errors are much larger than before! The intercept and regionwest variables are not statistically significant anymore. You can achieve the same in one single step: coeftest(lmfit, vcov = vcovHC(lmfit)) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.402827 311.310887 -1.5014 0.14056 ## regionnortheast 15.727405 25.307782 0.6214 0.53759 ## regionsouth 7.087424 23.561063 0.3008 0.76501 ## regionwest 34.324157 24.122587 1.4229 0.16198 ## residents -0.034558 0.091844 -0.3763 0.70857 ## young_residents 1.301458 0.688297 1.8908 0.06540 . ## per_capita_income 0.072036 0.029999 2.4013 0.02073 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 It’s is also easy to change the estimation method for the variance-covariance matrix: coeftest(lmfit, vcov = vcovHC(lmfit, type = "HC0")) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.402827 172.577569 -2.7084 0.009666 ** ## regionnortheast 15.727405 20.488148 0.7676 0.446899 ## regionsouth 7.087424 17.755889 0.3992 0.691752 ## regionwest 34.324157 19.308578 1.7777 0.082532 . ## residents -0.034558 0.054145 -0.6382 0.526703 ## young_residents 1.301458 0.387743 3.3565 0.001659 ** ## per_capita_income 0.072036 0.016638 4.3296 8.773e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 As I wrote above, by default, the type argument is equal to “HC3”. Another way of dealing with heteroskedasticity is to use the lmrob() function from the {robustbase} package. This package is quite interesting, and offers quite a lot of functions for robust linear, and nonlinear, regression models. Running a robust linear regression is just the same as with lm(): lmrobfit <- lmrob(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = education) summary(lmrobfit) ## ## Call: ## lmrob(formula = per_capita_exp ~ region + residents + young_residents + per_capita_income, ## data = education) ## \--> method = "MM" ## Residuals: ## Min 1Q Median 3Q Max ## -57.074 -14.803 -0.853 24.154 174.279 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -156.37169 132.73828 -1.178 0.24526 ## regionnortheast 20.64576 26.45378 0.780 0.43940 ## regionsouth 10.79694 29.42747 0.367 0.71549 ## regionwest 45.22589 33.07950 1.367 0.17867 ## residents 0.03406 0.04412 0.772 0.44435 ## young_residents 0.57896 0.25512 2.269 0.02832 * ## per_capita_income 0.04328 0.01442 3.000 0.00447 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Robust residual standard error: 26.4 ## Multiple R-squared: 0.6235, Adjusted R-squared: 0.571 ## Convergence in 24 IRWLS iterations ## ## Robustness weights: ## observation 50 is an outlier with |weight| = 0 ( < 0.002); ## 7 weights are ~= 1. The remaining 42 ones are summarized as ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.05827 0.85200 0.93870 0.85250 0.98700 0.99790 ## Algorithmic parameters: ## tuning.chi bb tuning.psi refine.tol ## 1.548e+00 5.000e-01 4.685e+00 1.000e-07 ## rel.tol scale.tol solve.tol eps.outlier ## 1.000e-07 1.000e-10 1.000e-07 2.000e-03 ## eps.x warn.limit.reject warn.limit.meanrw ## 1.071e-08 5.000e-01 5.000e-01 ## nResample max.it best.r.s k.fast.s k.max ## 500 50 2 1 200 ## maxit.scale trace.lev mts compute.rd fast.s.large.n ## 200 0 1000 0 2000 ## psi subsampling cov ## "bisquare" "nonsingular" ".vcov.avar1" ## compute.outlier.stats ## "SM" ## seed : int(0) This however, gives you different estimates than when fitting a linear regression model. The estimates should be the same, only the standard errors should be different. This is because the estimation method is different, and is also robust to outliers (at least that’s my understanding, I haven’t read the theoretical papers behind the package yet). Finally, it is also possible to bootstrap the standard errors. For this I will use the bootstrap() function from the {modelr} package: resamples <- 100 boot_education <- education %>% modelr::bootstrap(resamples) Let’s take a look at the boot_education object: boot_education ## # A tibble: 100 x 2 ## strap .id ## ## 1 001 ## 2 002 ## 3 003 ## 4 004 ## 5 005 ## 6 006 ## 7 007 ## 8 008 ## 9 009 ## 10 010 ## # ... with 90 more rows The column strap contains resamples of the original data. I will run my linear regression from before on each of the resamples: ( boot_lin_reg <- boot_education %>% mutate(regressions = map(strap, ~lm(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = .))) ) ## # A tibble: 100 x 3 ## strap .id regressions ## ## 1 001 ## 2 002 ## 3 003 ## 4 004 ## 5 005 ## 6 006 ## 7 007 ## 8 008 ## 9 009 ## 10 010 ## # ... with 90 more rows I have added a new column called regressions which contains the linear regressions on each bootstrapped sample. Now, I will create a list of tidied regression results: ( tidied <- boot_lin_reg %>% mutate(tidy_lm = map(regressions, broom::tidy)) ) ## # A tibble: 100 x 4 ## strap .id regressions tidy_lm ## ## 1 001 ## 2 002 ## 3 003 ## 4 004 ## 5 005 ## 6 006 ## 7 007 ## 8 008 ## 9 009 ## 10 010 ## # ... with 90 more rows broom::tidy() creates a data frame of the regression results. Let’s look at one of these: tidied$tidy_lm[[1]] ## term estimate std.error statistic p.value ## 1 (Intercept) -515.19835839 129.61828003 -3.9747353 2.648477e-04 ## 2 regionnortheast -11.75706535 18.72014312 -0.6280436 5.332970e-01 ## 3 regionsouth -8.89139412 15.51588707 -0.5730510 5.695950e-01 ## 4 regionwest 25.96217940 17.48335459 1.4849656 1.448476e-01 ## 5 residents -0.08921914 0.05557157 -1.6054819 1.157079e-01 ## 6 young_residents 1.41203710 0.35297506 4.0003877 2.447887e-04 ## 7 per_capita_income 0.08609340 0.01191468 7.2258276 6.069534e-09

This format is easier to handle than the standard lm() output:

tidied$regressions[[1]] ## ## Call: ## lm(formula = per_capita_exp ~ region + residents + young_residents + ## per_capita_income, data = .) ## ## Coefficients: ## (Intercept) regionnortheast regionsouth ## -515.19836 -11.75707 -8.89139 ## regionwest residents young_residents ## 25.96218 -0.08922 1.41204 ## per_capita_income ## 0.08609 Now that I have all these regression results, I can compute any statistic I need. But first, let’s transform the data even further: list_mods <- tidied %>% pull(tidy_lm) list_mods is a list of the tidy_lm data frames. I now add an index and bind the rows together (by using map2_df() instead of map2()): mods_df <- map2_df(list_mods, seq(1, resamples), ~mutate(.x, resample = .y)) Let’s take a look at the final object: head(mods_df, 25) ## term estimate std.error statistic p.value ## 1 (Intercept) -515.19835839 129.61828003 -3.9747353 2.648477e-04 ## 2 regionnortheast -11.75706535 18.72014312 -0.6280436 5.332970e-01 ## 3 regionsouth -8.89139412 15.51588707 -0.5730510 5.695950e-01 ## 4 regionwest 25.96217940 17.48335459 1.4849656 1.448476e-01 ## 5 residents -0.08921914 0.05557157 -1.6054819 1.157079e-01 ## 6 young_residents 1.41203710 0.35297506 4.0003877 2.447887e-04 ## 7 per_capita_income 0.08609340 0.01191468 7.2258276 6.069534e-09 ## 8 (Intercept) -526.78466908 142.64004523 -3.6931050 6.207704e-04 ## 9 regionnortheast 3.35909252 21.97165783 0.1528830 8.792057e-01 ## 10 regionsouth 2.62845462 17.20973263 0.1527307 8.793251e-01 ## 11 regionwest 26.40527188 20.50214280 1.2879274 2.046593e-01 ## 12 residents -0.04834920 0.05722962 -0.8448282 4.028830e-01 ## 13 young_residents 1.49618012 0.37091593 4.0337445 2.208994e-04 ## 14 per_capita_income 0.07489435 0.01179888 6.3475800 1.140844e-07 ## 15 (Intercept) -466.17160566 130.18954610 -3.5807146 8.659014e-04 ## 16 regionnortheast -6.75977787 17.24994600 -0.3918724 6.970880e-01 ## 17 regionsouth 6.31061828 16.27953099 0.3876413 7.001938e-01 ## 18 regionwest 27.89261897 15.43852463 1.8066894 7.781178e-02 ## 19 residents -0.08760677 0.05007780 -1.7494134 8.735434e-02 ## 20 young_residents 1.23763741 0.32965208 3.7543746 5.168492e-04 ## 21 per_capita_income 0.08469609 0.01236601 6.8491057 2.128982e-08 ## 22 (Intercept) -316.44265722 166.96769979 -1.8952328 6.479692e-02 ## 23 regionnortheast 11.54907449 14.95746934 0.7721276 4.442620e-01 ## 24 regionsouth 9.25759229 16.89974638 0.5477947 5.866654e-01 ## 25 regionwest 31.83905855 16.56287888 1.9223143 6.120738e-02 ## resample ## 1 1 ## 2 1 ## 3 1 ## 4 1 ## 5 1 ## 6 1 ## 7 1 ## 8 2 ## 9 2 ## 10 2 ## 11 2 ## 12 2 ## 13 2 ## 14 2 ## 15 3 ## 16 3 ## 17 3 ## 18 3 ## 19 3 ## 20 3 ## 21 3 ## 22 4 ## 23 4 ## 24 4 ## 25 4 Now this is a very useful format, because I now can group by the term column and compute any statistics I need, in the present case the standard deviation: ( r.std.error <- mods_df %>% group_by(term) %>% summarise(r.std.error = sd(estimate)) ) ## # A tibble: 7 x 2 ## term r.std.error ## ## 1 (Intercept) 226. ## 2 per_capita_income 0.0211 ## 3 regionnortheast 19.7 ## 4 regionsouth 19.1 ## 5 regionwest 18.7 ## 6 residents 0.0629 ## 7 young_residents 0.509 We can append this column to the linear regression model result: lmfit %>% broom::tidy() %>% full_join(r.std.error) %>% select(term, estimate, std.error, r.std.error) ## Joining, by = "term" ## term estimate std.error r.std.error ## 1 (Intercept) -467.40282655 142.57668653 226.08735720 ## 2 regionnortheast 15.72740519 18.16259519 19.65120904 ## 3 regionsouth 7.08742365 17.29949951 19.05153934 ## 4 regionwest 34.32415663 17.49459866 18.72767470 ## 5 residents -0.03455770 0.05318811 0.06285984 ## 6 young_residents 1.30145825 0.35716636 0.50928879 ## 7 per_capita_income 0.07203552 0.01305066 0.02109277 As you see, using the whole bootstrapping procedure is longer than simply using either one of the first two methods. However, this procedure is very flexible and can thus be adapted to a very large range of situations. Either way, in the case of heteroskedasticity, you can see that results vary a lot depending on the procedure you use, so I would advise to use them all as robustness tests and discuss the differences. If you found this blog post useful, you might want to follow me on twitter for blog post updates. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### A primer in using Java from R – part 2 Sat, 07/07/2018 - 15:00 (This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers) Introduction In this part of the primer we discuss creating and using custom .jar archives within our R scripts and packages, handling of Java exceptions from R and a quick look at performance comparison between the low and high-level interfaces provided by rJava. In the first part we talked about using the rJava package to create objects, call methods and work with arrays, we examined the various ways to call Java methods and calling Java code from R directly via execution of shell commands. R <3 Java, or maybe not? Contents Using rJava with custom built classes Preparing a .jar archive for use Getting back to our example with running the main method of our HelloWorldDummy class from the first part of this primer, in practice we most likely want to actually create objects and invoke methods for such classes rather than simply call the main method. For our resources to be available to rJava, we need to create a .jar archive and add it to the class path. An example of the process can be as follows. Compile our code to create the class file, and jar it:$ javac DummyJavaClassJustForFun/HelloWorldDummy.java $cd DummyJavaClassJustForFun/$ jar cvf HelloWorldDummy.jar HelloWorldDummy.class Adding the .jar file to the class path

Within R, attach rJava, initialize the JVM and investigate our current class path using .jclassPath:

library(rJava) .jinit() .jclassPath()

Now, we add our newly created .jar to the class path using .jaddClassPath:

If this worked, we can see the added jar(s) in the class path if we call .jclassPath() again.

Creating objects, investigating methods and fields

Now that we have our .jar in the class path, we can create a new Java object from our class:

dummyObj <- .jnew("DummyJavaClassJustForFun/HelloWorldDummy") str(dummyObj) ## Formal class 'jobjRef' [package "rJava"] with 2 slots ## ..@ jobj : ## ..@ jclass: chr "DummyJavaClassJustForFun/HelloWorldDummy"

We can also investigate the available constructors, methods and fields for our class (or provide the object as argument, then its class will be queried):

• .jconstructors returns a character vector with all constructors for a given class or object
• .jmethods returns a character vector with all methods for a given class or object
• .jfields returns a character vector with all fields (aka attributes) for a given class or object
• .DollarNames returns all fields and methods associated with the object. Method names are followed by ( or () depending on arity.
# Requesting vectors of methods, constructors and fields by class .jmethods("DummyJavaClassJustForFun/HelloWorldDummy") ## [1] "public static void DummyJavaClassJustForFun.HelloWorldDummy.main(java.lang.String[])" ## [2] "public java.lang.String DummyJavaClassJustForFun.HelloWorldDummy.SayMyName()" ## [3] "public final void java.lang.Object.wait(long,int) throws java.lang.InterruptedException" ## [4] "public final native void java.lang.Object.wait(long) throws java.lang.InterruptedException" ## [5] "public final void java.lang.Object.wait() throws java.lang.InterruptedException" ## [6] "public boolean java.lang.Object.equals(java.lang.Object)" ## [7] "public java.lang.String java.lang.Object.toString()" ## [8] "public native int java.lang.Object.hashCode()" ## [9] "public final native java.lang.Class java.lang.Object.getClass()" ## [10] "public final native void java.lang.Object.notify()" ## [11] "public final native void java.lang.Object.notifyAll()" .jconstructors("DummyJavaClassJustForFun/HelloWorldDummy") ## [1] "public DummyJavaClassJustForFun.HelloWorldDummy()" .jfields("DummyJavaClassJustForFun/HelloWorldDummy") ## NULL # Requesting vectors of methods, constructors and fields by object .jmethods(dummyObj) ## [1] "public static void DummyJavaClassJustForFun.HelloWorldDummy.main(java.lang.String[])" ## [2] "public java.lang.String DummyJavaClassJustForFun.HelloWorldDummy.SayMyName()" ## [3] "public final void java.lang.Object.wait(long,int) throws java.lang.InterruptedException" ## [4] "public final native void java.lang.Object.wait(long) throws java.lang.InterruptedException" ## [5] "public final void java.lang.Object.wait() throws java.lang.InterruptedException" ## [6] "public boolean java.lang.Object.equals(java.lang.Object)" ## [7] "public java.lang.String java.lang.Object.toString()" ## [8] "public native int java.lang.Object.hashCode()" ## [9] "public final native java.lang.Class java.lang.Object.getClass()" ## [10] "public final native void java.lang.Object.notify()" ## [11] "public final native void java.lang.Object.notifyAll()" .jconstructors(dummyObj) ## [1] "public DummyJavaClassJustForFun.HelloWorldDummy()" .jfields(dummyObj) ## NULL Calling methods 3 different ways

We can now invoke our SayMyName method on this object in the three ways as discussed is the first part of this primer:

# low level lres <- .jcall(dummyObj, "Ljava/lang/String;", "SayMyName") # high level hres <- J(dummyObj, method = "SayMyName") # convenient $shorthand dres <- dummyObj$SayMyName() c(lres, hres, dres) ## [1] "My name is DummyJavaClassJustForFun.HelloWorldDummy" ## [2] "My name is DummyJavaClassJustForFun.HelloWorldDummy" ## [3] "My name is DummyJavaClassJustForFun.HelloWorldDummy" Very quick look at performace

The low-level is much faster, since J has to use reflection to find the most suitable method. The $seems to be the slowest, but also very convenient, as it supports code completion: microbenchmark::microbenchmark(times = 1000 , .jcall(dummyObj, "Ljava/lang/String;", "SayMyName") , J(dummyObj, "SayMyName") , dummyObj$SayMyName() ) ## Unit: microseconds ## expr min lq ## .jcall(dummyObj, "Ljava/lang/String;", "SayMyName") 34.992 48.0385 ## J(dummyObj, "SayMyName") 696.860 748.9630 ## dummyObj$SayMyName() 894.576 963.1035 ## mean median uq max neval ## 62.13235 61.3425 66.6470 723.719 1000 ## 903.54615 786.2715 841.2835 66191.562 1000 ## 1093.67310 1009.4020 1075.2565 6743.220 1000 Usage of jars in R packages To use rJava within an R package, Simon Urbanek, the author of rJava even provides a convenience function for this purpose which initializes the JVM and registers Java classes and native code contained in the package with it. A quick step by step guide to use .jars within a package is as follows: 1. place our .jars into inst/java/ 2. add Depends: rJava and SystemRequirements: Java into our NAMESPACE 3. add a call to .jpackage(pkgname, lib.loc=libname) into our .onLoad.R or .First.lib for example like so: .onLoad <- function(libname, pkgname) { .jpackage(pkgname, lib.loc = libname) } 1. if possible, add .java source files into /java folder of our package If you are interested in more detail than provided in this super-quick overview, Tobias Verbeke created a Hello Java World! package with a vignette providing a verbose step-by-step tutorial for interfacing to Java archives inside R packages. Setting java.parameters The .jpackage function calls .jinit with the default parameters = getOption("java.parameters"), so if we want to set some of the java parameters, we can do it for example like so: .onLoad <- function(libname, pkgname) { options(java.parameters = c("-Xmx1000m")) .jpackage(pkgname, lib.loc = libname) } Note that the options call needs to be done before the call to .jpackage, as Java parameters can only be used during JVM initialization. Consequently, this will only work if other package did not intialize the JVM already. Handling Java exceptions in R rJava maps Java exceptions to R conditions relayed by the stop function, therefore we can use the standard R mechanisms such as tryCatch to handle the exceptions. The R condition object, assume we call it e for this, is actually an S3 object (a list) that contains: • call – a language object containing the call resulting in the exception • jobj – an S4 object containing the actual exception object, so we can for example investigate investigate it’s class: e[["jobj"]]@jclass tryCatch( iOne <- .jnew(class = "java/lang/Integer", 1), error = function(e) { message("\nLets look at the condition object:") str(e) message("\nClass of the jobj item:") print(e[["jobj"]]@jclass) message("\nClasses of the condition object: ") class(e) } ) ## ## Lets look at the condition object: ## List of 3 ##$ message: chr "java.lang.NoSuchMethodError: " ## $call : language .jnew(class = "java/lang/Integer", 1) ##$ jobj :Formal class 'jobjRef' [package "rJava"] with 2 slots ## .. ..@ jobj : ## .. ..@ jclass: chr "java/lang/NoSuchMethodError" ## - attr(*, "class")= chr [1:9] "NoSuchMethodError" "IncompatibleClassChangeError" "LinkageError" "Error" ... ## ## Class of the jobj item: ## [1] "java/lang/NoSuchMethodError" ## ## Classes of the condition object: ## [1] "NoSuchMethodError" "IncompatibleClassChangeError" ## [3] "LinkageError" "Error" ## [5] "Throwable" "Object" ## [7] "Exception" "error" ## [9] "condition"

Since class(e) is a vector of simple java class names which allows the R code to use direct handlers, we can handle different such classes differently:

withCallingHandlers( iOne <- .jnew(class = "java/lang/Integer", 1) , error = function(e) { message("Meh, just a boring error") } , NoSuchMethodError = function(e) { message("We have a NoSuchMethodError") } , IncompatibleClassChangeError = function(e) { message("We also have a IncompatibleClassChangeError - lets recover") recover() # recovering here and looking at # 2: .jnew(class = "java/lang/Integer", 1) # we see that the issue is in # str(list(...)) # List of 1 # $: num 1 # We actually passed a numeric, not integer # To fix it, just do # .jnew(class = "java/lang/Integer", 1L) } , LinkageError = function(e) { message("Ok, this is getting a bit overwhelming, lets smile and end here :o)") } ) ## Meh, just a boring error ## We have a NoSuchMethodError ## We also have a IncompatibleClassChangeError - lets recover ## recover called non-interactively; frames dumped, use debugger() to view ## Ok, this is getting a bit overwhelming, ## lets smile and end here ## :o) ## Error in .jnew(class = "java/lang/Integer", 1): java.lang.NoSuchMethodError: References 1. Hello Java World! vignette – a tutorial for interfacing to Java archives inside R packages by Tobias Verbeke 2. rJava basic crashcourse – at the rJava site on rforge, scroll down to the Documentation section 3. The JNI Type Signatures – at Oracle JNI specs 4. rJava documentation on CRAN 5. Calling Java code from R by prof. Darren Wilkinson Did you find the article helpful or interesting? Help others find it by sharing var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### HavanaR Workshop 2018 Sat, 07/07/2018 - 02:00 (This article was first published on Blog on FORWARDS, and kindly contributed to R-bloggers) The first meeting of R users in Cuba, HavanaR! Club 2018, will take place from 29-10-2018 to 02-11-2018. The workshop is organized jointly by the Group of Probability and Statistics of Universidad de La Habana, the >eR-Biostat initiative, the Center for Statistics, CenStat & I-Biostat, Hasselt University, Belgium and Forwards. The workshop will focus on visualisation methods and exploratory analysis of high dimensional and multivariate datasets. Program: 29-Oct: Basic concepts of exploratory data analysis using R. Exploratory Analysis of high dimensional data using R (Adetayo Kasim, Durham University, UK, a.s.kasim@durham.ac.uk and Ziv Shkedy, Hasselt University, Belgium, ziv.shkedy@uhasselt.be). 30-Oct: Exploratory Analysis of high dimensional data using R (Adetayo Kasim and Ziv Shkedy). 31-Oct: HavanaR! Club 2018: The Cuban R users workshop with presentations of the participants. 01-Nov: Exploratory Multivariate Data Analysis using R (Julie Josse, Ecole Polytechnique, France, julie.josse@polytechnique.edu). 02-Nov: Exploratory Multivariate Data Analysis using R (Julie Josse). The workshop is free but registration in mandatory. To register please send an email to erbiostat@gmail.com or cbouza2002@gmail.com. More details and workshop materials will be available online at https://er-biostat.github.io/Courses/. All updates about the workshop will be anounced on the Facebook page (https://www.facebook.com/ER-BioStat-1463845487001786/) and Twitter (@erbiostat). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Blog on FORWARDS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Visualizing macOS App Usage with a Little Help from osqueryr & mactheknife Fri, 07/06/2018 - 20:34 (This article was first published on R – rud.is, and kindly contributed to R-bloggers) Both my osqueryr and macthekinfe packages have had a few updates and I wanted to put together a fun example (it being Friday, and all) for what you can do with them. All my packages are now on GitHub and GitLab and I’ll be maintaining them on both so I can accommodate the comfort-level of any and all contributors but will be prioritizing issues and PRs on GitLab ahead of any other platform. Having said that, I’ll mark non-CRAN packages with a # notcran comment in the source views so you know you need to install it from wherever you like to grab sketch packages from. One table that osquery makes available under macOS is an inventory of all “apps” that macOS knows about. Previous posts have shown how to access these tables via the dplyr interface I built for osquery, but they involved multiple steps and as I started to use it more regularly (especially to explore the macOS 10.14 beta I’m running) I noticed that it could use some helper functions. One in particular — osq_expose_tables() — is pretty helpful in that it handles all the dplyr boilerplate code and makes table(s) available in the global environment by name. It takes a single table name or regular expression and then exposes all matching entities. While the function has a help page, it’s easier just to see it in action. Let’s expose the apps table: library(osqueryr) # notcran library(tidyverse) osq_expose_tables("apps") apps ## # Source: table [?? x 19] ## # Database: OsqueryConnection ## applescript_enab… bundle_executable bundle_identifier bundle_name bundle_package_… ## ## 1 0 1Password 6 com.agilebits.onep… 1Password 6 APPL ## 2 0 2BUA8C4S2C.com.agil… 2BUA8C4S2C.com.agi… 1Password m… APPL ## 3 1 Adium com.adiumX.adiumX Adium APPL ## 4 1 Adobe Connect com.adobe.adobecon… Adobe Conne… APPL ## 5 1 Adobe Illustrator com.adobe.illustra… Illustrator… APPL ## 6 "" AIGPUSniffer com.adobe.AIGPUSni… AIGPUSniffer APPL ## 7 "" CEPHtmlEngine Helper com.adobe.cep.CEPH… CEPHtmlEngi… APPL ## 8 "" CEPHtmlEngine com.adobe.cep.CEPH… CEPHtmlEngi… APPL ## 9 "" LogTransport2 com.adobe.headligh… LogTranspor… APPL ## 10 "" droplet "" Analyze Doc… APPL ## # ... with more rows, and 14 more variables: bundle_short_version , ## # bundle_version , category , compiler , copyright , ## # development_region , display_name , element , environment , ## # info_string , last_opened_time , minimum_system_version , name , ## # path There’s tons of info on all the apps macOS knows about, some of which are system services and “helper” apps (like Chrome’s auto-updater). One field — last_opened_time — caught my eye and I thought it would be handy to see which apps had little use (i.e. ones that haven’t been opened in a while) and which apps I might use more frequently (i.e. ones with more recent “open” times). That last_open_time is a fractional POSIX timestamp and, due to the way osquery created the schemas, it’s in a character field. That’s easy enough to convert and then arrange() the whole list in descending order to let you see what you use most frequently. But, this is R and we can do better than a simple table or even a DT::datatable(). I recently added the ability to read macOS property lists (a.k.a. “plists”) to mactheknife by wrapping a Python module (plistlib). Since all (OK, “most”) macOS apps have an icon, I thought it would be fun to visualize the last opened frequency for each app using the app icons and ggplot2. Unfortunately, the ImageMagick (and, thus the magick package) cannot read macOS icns files, so you’ll need to do a brew install libicns before working with any of the remaining code since we’ll be relying on a command-line utility from that formula. Let’s get the frontmatter out of the way: library(sys) library(magick) library(osqueryr) # notcran library(mactheknife) #notcran library(ggimage) library(hrbrthemes) library(ggbeeswarm) library(tidyverse) osq_expose_tables("apps") # macOS will use a generic app icon when none is present in an app bundle; this is the location and we'll # need to use it when our plist app spelunking comes up short default_app <- "/System/Library/CoreServices/CoreTypes.bundle/Contents/Resources/GenericApplicationIcon.icns" Next, we'll: • collect the apps table locally • filter out system-ish things (which we really don't care about for this post) • convert the last used time to something useful (and reduce it to a day resolution) • try to locate the property list for the app and read the path to the app icon file, substituting the generic one if not found (or other errors pop up): select(apps, name, path, last_opened_time) %>% collect() %>% filter(!str_detect(path, "(^/System|usr|//System|/Library/|Helper|/Contents/|\\.service$)")) %>% mutate(lop_day = as.Date(anytime::anytime(as.numeric(last_opened_time)))) %>% mutate(icon = map_chr(path, ~{ p <- read_plist(file.path(.x, "Contents", "Info.plist")) icns <- p$CFBundleIconFile[1] if (is.null(icns)) return(default_app) if (!str_detect(icns, "\\.icns$")) icns <- sprintf("%s.icns", icns) file.path(.x, "Contents", "Resources", icns) })) -> apps_df apps_df ## # A tibble: 274 x 5 ## last_opened_time name path lop_day icon ## ## 1 1529958322.11297 1Password 6.app /Applications/1Password … 2018-06-25 /Applications/1Password 6.… ## 2 1523889402.80918 Adium.app /Applications/Adium.app 2018-04-16 /Applications/Adium.app/Co… ## 3 1516307513.7606 Adobe Connect.app /Applications/Adobe Conn… 2018-01-18 /Applications/Adobe Connec… ## 4 1530044681.76677 Adobe Illustrator.app /Applications/Adobe Illu… 2018-06-26 /Applications/Adobe Illust… ## 5 -1.0 Analyze Documents.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust… ## 6 -1.0 Make Calendar.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust… ## 7 -1.0 Contact Sheets.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust… ## 8 -1.0 Export Flash Animation.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust… ## 9 -1.0 Web Gallery.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust… ## 10 -1.0 Adobe InDesign CC 2018.app /Applications/Adobe InDe… 1969-12-31 /Applications/Adobe InDesi… ## # ... with 264 more rows

Since I really didn't feel like creating a package wrapper for libicns, we're going to use the sys package to make system calls to convert the icns files to png files. We really don't want to do this repeatedly for the same files if we ever run this again, so we'll setup a cache directory to hold our converted pngs.

Apps can (and, usually do) have multiple icons with varying sizes and are not guaranteed to have every common size available. So, we'll have the libicns icns2png utility extract all the icons and use the highest resolution one, using magick to reduce it to a 32x32 png bitmap.

# setup the cache dir -- use whatever you want cache_dir <- path.expand("~/.r-icns-cache") dir.create(cache_dir) # create a unique name hash for more compact names mutate(apps_df, icns_png = map_chr(icon, ~{ hash <- digest::digest(.x, serialize=FALSE) file.path(cache_dir, sprintf("%s.png", hash)) })) -> apps_df # find the icns2png program icns2png <- unname(Sys.which("icns2png")) # go through each icon file pb <- progress_estimated(length(apps_df$icns_png)) walk2(apps_df$icon, apps_df$icns_png, ~{ pb$tick()$print() # progress! if (!file.exists(.y)) { # don't create it if it already exists td <- tempdir() # no icon file == use default one if (!file.exists(.x)) .x <- default_app # convert all of them to pngs sys::exec_internal( cmd = icns2png, args = c("-x", "-o", td, .x), error = FALSE ) -> res rawToChar(res$stdout) %>% # go through icns2png output str_split("\n") %>% flatten_chr() %>% keep(str_detect, " Saved") %>% # find all the extracted icons last() %>% # use the last one str_replace(".* to /", "/") %>% # clean up the filename so we can read it in str_replace("\\.$", "") -> png # read and convert image_read(png) %>% image_resize(geometry_area(32, 32)) %>% image_write(.y) } }) You can open up that cache directory with the macOS finder to find all the extracted/converted pngs. Now, we're on the final leg of our app-use visualization journey. Some system/utility apps have start-of-epoch dates due to the way the macOS installer tags them. We only want "recent" ones so I set an arbitrary cutoff date of the year 2000. Since many apps would have the same last opened date, I wanted to get a spread out layout "for free". One way to do that is to use ggbeeswarm::position_beswarm(): filter(apps_df, lop_day > as.Date("2000-01-01")) %>% ggplot() + geom_image( aes(x="", lop_day, image = icns_png), size = 0.033, position = position_quasirandom(width = 0.5) ) + geom_text( data = data_frame( x = c(0.6, 0.6), y = as.Date(c("2018-05-01", "2017-09-15")), label = c("More recently used ↑", "Not used in a while ↓") ), aes(x, y, label=label), family = font_an, size = 5 , hjust = 0, color = "lightslategray" ) + labs(x = NULL, y = "Last Opened Time") + labs( x = NULL, y = NULL, title = "macOS 'Last Used' App History" ) + theme_ipsum_rc(grid="Y") + theme(axis.text.x = element_blank()) There are tons of other ways to look at this data and you can use the osquery daemon to log this data regularly so you can get an extra level of detail. An interesting offshot project would be to grab the latest RStudio dailies and see if you can wrangle a sweet D3 visualization from the app data we collected. Make sure to drop a comment with your creations in the comments. You can find the full code in this snippet. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Setting up RStudio Server, Shiny Server and PostgreSQL by @ellis2013nz Fri, 07/06/2018 - 14:00 (This article was first published on free range statistics - R, and kindly contributed to R-bloggers) Motivation For a variety of reasons I wanted a Linux server in the cloud with data science software on it. In no particular order: 1. so I could access a reliable setup from anywhere with a web browser, including on networks without the software I need (which happens a lot in corporate/government environments) 2. so I could do round-the-clock data gathering when necessary 3. to run my pet HappyRrobot twitterbot, having retired the old physical linux box it used to be hosted from 4. to make sure I stay familiar with what’s needed to set up such a system from scratch. I couldn’t tell you which of these are the more important. In this blog post I’ll be noting down the steps necessary for setting up this server, and illustrating purpose 2 with a Shiny app, hosted on the server and drawing data from a database on the server, which is collecting a round-the-clock random sample of Twitter. The end result looks like this (click on the image to go to the actual working app): If you’re interested in Twitter data but not in setting up a Linux server with R, Shiny and PostgreSQL on it you can skip straight to where I talk about getting the data in and analysing it. Setting up a server for data and stats I chose Amazon Web Services Elastic Cloud to host the server. I used the Elastic IP service (fairly cheap) to allocate a permanent IP address to the instance. I was keen on using Red Hat or similar flavoured Linux, because of purpose #4 noted above; I already knew a bit about Ubuntu and wanted to expand my knowledge sphere, and Red Hat is the flavour that seems to pop up on corporate and government servers. In the end I opted for CentOS, which is “a Linux distribution that provides a free, enterprise-class, community-supported computing platform functionally compatible with its upstream source, Red Hat Enterprise Linux (RHEL)”. In fact, installing R is a bit easier on CentOS that it was on my Red Hat experiments. I want the following on this machine: • R, RStudio Server and Python 3 for analysis and for data munging • PostgreSQL for storing data and supporting analysis • Shiny Server for dissemination • A webserver (I chose Nginx) to support RStudio Server and Shiny Server through reverse proxy so I can access them on regular web browser ports rather than ports 8787 and 3838, which will often not be available from a corporate network • all the other utilities and extras to support all this, such as curl, mail and fonts This was non-trivial, which is one of the reasons why I’m blogging about it! There are lots of fishhooks and little things to sort through and while there are some excellent blog posts and tutorials out there to help do it, none of them covered quite the end-to-end setup I needed. One of my outputs from the process was a set of notes – not quite a single run-and-forget configuration script as you’d want in a professional setup, but fairly easy to use – that makes it easy to do similar in the future. R and a few basics Here’s where I started, once having started up the server (plenty of tutorials on how to do that provided by Amazon themselves and others). I start by installing R, including the epel “Extra Packages for Enterprise Linux” that are needed beforehand. # we'll want mail later sudo yum install -y mailx #-----------------R and its dependencies------------------------ #install R. We need epel first sudo yum update -y sudo yum install –y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm sudo yum install -y texlive sudo yum install -y texinfo # clean up rm *.rpm sudo yum install -y R # and version control of course sudo yum install -y git Database Next thing was to install PostgreSQL. I found it useful to install this before I started installing R packages, because some R packages that speak to PostgreSQL behave differently on installation depending on whether PostgreSQL is found on the machine or not #=====================postgresql======================= # install postgresql. Good to do this before we start installing R packages sudo yum install -y postgresql-server postgresql-contrib postgresql-devel sudo postgresql-setup initdb # stop to give postgres account a password for the operating system sudo passwd postgres # start the postgresql service sudo systemctl start postgresql # I think this next line means the database service restarts when the machine is rebooted sudo systemctl enable postgresql Extending R via packages and their dependencies I find installing all the R packages I regularly use a harder job in Linux than Windows. I’m sorry, but I do. In particular, Windows installations of packages like gdal seems to look after upstream dependencies seamlessly and quietly. Not so on Linux. Here’s what I needed to do at the command line to get all the R packages I wanted installed. #======================miscellaneous dependencies needed by R packages=============== # iBest to do this on a large instance, even if you only start it as big # during the install. 8GB seems a minimum #----------------tidyverse and Cairo--------- # First, some dependencies that rvest, devtools, Cairo need: sudo yum install -y libcurl-devel libxml2-devel openssl-devel sudo yum install -y cairo-devel libXt-devel udunits2-devel gdal-devel poppler-cpp-devel #------------The gdal epic--------------------- # needed for spatial stuff and in particular sf which needs version > 2 (currently 2.2). # This should work according to the sf github page (excluding udunits2 which we look after later): sudo yum install -y gdal-devel proj-devel proj-epsg proj-nad geos-devel # but that installs the wrong version of gdal! We have to install it by hand. # see https://gis.stackexchange.com/questions/263495/how-to-install-gdal-on-centos-7-4 # adapted from https://gist.github.com/simondobner/f859b2db15ad65090c3c316d3c224f45 wget http://download.osgeo.org/gdal/2.2.4/gdal-2.2.4.tar.gz tar zxvf gdal-2.2.4.tar.gz cd gdal-2.2.4/ ./configure --prefix=/usr/ --with-sfcgal=no make -j4 sudo make install # should have a test here to only do the next two things if installed correctly! rm *.tar.gz rm gdal-2.2.4 -r #======================R packages===================== # Now we can install R packages that need all the above system dependencies first. # # udunits2 needs special configuration when installing in R so let's do that first and get it out of the way sudo R -e "install.packages('udunits2',configure.args='--with-udunits2-include=/usr/include/udunits2', repos='http://cran.rstudio.com/')" # these are a bunch of packages that are heavily used and that I want installed up front and available # for all users (hence installing them as super user) sudo R -e "install.packages(c('Rcpp', 'rlang', 'bindrcpp', 'dplyr', 'digest', 'htmltools', 'tidyverse', 'shiny', 'leaflet', 'sf', 'scales', 'Cairo', 'forecast', 'forcats', 'h2o', 'seasonal', 'data.table', 'extrafont','survey', 'forecastHybrid', 'ggseas', 'treemap', 'glmnet', 'ranger', 'RPostgres', 'igraph', 'ggraph', 'nzelect', 'tm', 'wordcloud', 'praise', 'showtext', 'ngram', 'pdftools', 'rtweet', 'GGally', 'ggExtra', 'lettercase', 'xgboost'), repos='http://cran.rstudio.com/')" # fonts sudo yum install dejavu-sans-fonts sudo yum install -y google-droid-*-fonts sudo yum install -y gnu-free-*-fonts sudo R -e "extrafont::font_import(prompt = FALSE)" I did all of this with my server set up as a “large” instance with 8GB of RAM. This particularly makes a difference when installing Rcpp. After all the initial is setup you can stop the instance, downsize it something cheaper, and restart it. Note that I am using sudo to install R packages so they are available to all users (which will include the shiny user down the track), not just to me. I wanted everyone using this server to have the same set of packages available; obviously whether this is desirable or not depends on the purpose of the setup. Server-related stuff Next I want to get RStudio Server and Shiny Server working, and accessible via a web browser that just talks to standard port 80. There is a step here where the Nginx configuration file gets edited by hand; the links to RStudio support for RStudio Server and for Shiny Server contain instructions on what needs to go where. Also note that the actual versions of RStudio Server and of Shiny Server below are date-specific (because they are installed via local install), and probably the links are already out of date. #-------------webby stuff------------ # install a web server so we can deliver things through it via reverse proxy # see https://support.rstudio.com/hc/en-us/articles/200552326-Running-RStudio-Server-with-a-Proxy # and https://support.rstudio.com/hc/en-us/articles/213733868-Running-Shiny-Server-with-a-Proxy sudo yum install -y nginx #install RStudio-Server (2018-04-23) wget https://download2.rstudio.org/rstudio-server-rhel-1.1.447-x86_64.rpm sudo yum localinstall -y --nogpgcheck rstudio-server-rhel-1.1.447-x86_64.rpm #install shiny and shiny-server (2018-04-23) wget https://download3.rstudio.org/centos6.3/x86_64/shiny-server-1.5.7.907-rh6-x86_64.rpm sudo yum localinstall -y --nogpgcheck shiny-server-1.5.7.907-rh6-x86_64.rpm rm *.rpm # now go make the necessary edits to /etc/nginx/nginx.conf # note that the additions are made in two different bits of that file, you don't just past the whole # lot in. sudo nano /etc/nginx/nginx.conf sudo systemctl restart nginx # go to yr.ip.number/shiny/ and yr.ip.number/rstudio/ to check all working # add some more users if wanted at this point # sudo useradd ellisp # sudo passwd ellisp # not sure if all these are needed: sudo systemctl enable nginx sudo systemctl enable rstudio-server sudo systemctl enable shiny-server # set the ownership of the directory we're going to keep apps in so the shiny # user can access it sudo chown -R shiny:shiny /srv/shiny-server Python Centos currently comes with Python 2.7, but I wanted to be using Python 3. My Python skills are halting at best but I want them to be as future-proofed as possible. Anaconda seems a relatively straightforward way to manage Python. #---------------Anaconda / python---------------- # go to https://repo.continuum.io/archive/ or https://www.anaconda.com/download/#linux to see the latest version # Anaconda3 is with python 3.X, Anaconda2 is wit python 2.7. Note # that python 2.7 is part of the Centos linux dsitribution and shouldn't be # overwritten ie python xxx.py should run python 2.7. But doing the process below does this; # watch out for if this causes problems later... # wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh sudo bash Anaconda3-5.1.0-Linux-x86_64.sh # agree to the license, and specify /opt/anaconda3 as location when asked # we want to give all users anaconda on their path, so I snitched this from: # https://www.vultr.com/docs/how-to-install-jupyter-notebook-on-a-vultr-centos-7-server-instance sudo cp /etc/profile /etc/profile_backup echo 'export PATH=/opt/anaconda3/bin:$PATH' | sudo tee -a /etc/profile source /etc/profile echo $PATH sudo /opt/anaconda3/bin/conda conda install psycopg2 # as far as I can tell this makes python3.6 the default python, which is surely going to cause problems down # the track... Configuring PostgreSQL I installed PostgreSQL and started its database service early in this process, but in the next step need to actually set up some database and users for use. The PostgreSQL security model is thorough and comprehensive but with lots of fishhooks. Here’s how I set it up for this particular (very simple) use case. First, I enter the psql environment as the postgres user (currently the only user with any access to the database server) sudo -u postgres psql Now we can set up the users we want to be accessing our databases; some databases for them to use; and schemas within those database. In this case, I set up two databases for now • survey_microdata • twitter and three different users, in addition to postgres: • ellisp (ie me, in development mode) • external_analyst (ie me or others, in read-only mode) • shiny (the Shiny Server’s id on the server, needed so Shiny apps can access the database) -- you are now in psql as user postgres. Although default is to use unix's identification of you, -- and you don't need a password to access the database from the local host, it's good to have a -- password if you want to set up other connections later \password postgres CREATE DATABASE survey_microdata; CREATE DATABASE twitter; CREATE ROLE ellisp; \password ellisp; ALTER ROLE ellisp WITH LOGIN; CREATE ROLE shiny; -- no need for a password for shiny, it can only access the db from this machine CREATE ROLE external_analyst; \password external_analyst; GRANT ALL PRIVILEGES ON DATABASE twitter TO ellisp; GRANT ALL PRIVILEGES ON DATABASE survey_microdata TO ellisp; \c survey_microdata; CREATE SCHEMA nzivs; CREATE SCHEMA nzis2011; GRANT ALL PRIVILEGES ON SCHEMA nzivs TO ellisp; GRANT ALL PRIVILEGES ON SCHEMA nzis2011 TO ellisp; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA nzivs TO ellisp; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA nzis2011 TO ellisp; GRANT SELECT ON ALL TABLES IN SCHEMA nzis2011 to external_analyst; GRANT SELECT ON ALL TABLES IN SCHEMA nzivs to external_analyst; \c twitter CREATE SCHEMA tweets; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO ellisp; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA tweets TO ellisp; GRANT SELECT ON ALL TABLES IN SCHEMA tweets TO shiny; GRANT CONNECT ON DATABASE twitter TO shiny; \q We also need to tweak the configuration so the PostgreSQL database is accessible from the outside world (if that’s what we want, which I do). # follow instructions at https://blog.bigbinary.com/2016/01/23/configure-postgresql-to-allow-remote-connection.html if # you want to remotely access eg from DBeaver on your laptop. Definitely need a password then # first add in listen_addresses = 'localhost' just above the commented out version of #listen_addresses = 'localhost' sudo nano /var/lib/pgsql/data/postgresql.conf # now the client authentication file about how individuals can actually log on. # Add the below two lines (not the # at beginning) to the bottom of the table. # lets users log on via password form anywhere. If this doesn't suit your # risk profile, find something more constrictive... # host all all 0.0.0.0/0 md5 # host all all ::/0 md5 sudo nano /var/lib/pgsql/data/pg_hba.conf sudo systemctl restart postgresql A Twitter sample stream database Collecting data OK, that wasn’t so bad was it (or was it…). I now have my server running (I stopped it and restarted it as a smaller cheaper instance than the 8GB of RAM I used during that setup) and available to do Useful Stuff. Like collect Twitter data for analysis and dissemination in Shiny: For a while, I’ve been mildly exercised by the problem of sampling from Twitter. See for example this earlier post where I was reduced to using a snowballing network method to find users, and searching for tweets with the letter “e” in them to get a sample of tweets. Both of these methods have obvious problems if you want to do inference about how people as a whole are using Twitter. On the other hand, Twitter make available several sets of public Tweets that are fully representative: • The free Sample Tweets API “returns a small random sample of all public Tweets.” • The Decahose stream provides a 10% sample of all public Tweets • The Firehose provides all public tweets The latter two services are for paying customers only. My interest in Twitter is curiousity at most, so I’m only interested in the free sample, which is thought to be around 1% of the Firehose (exactly what proportion it is of the Firehose isn’t publicly known, and is a question of some inferential interest). So I was interested in the sample stream, but I wanted to collect a sample over a period of time, not just from the day I was going to do some analysis. Even this 1% sample was more than I wanted to pay disk space to store if I were to collect over time, so I decided I would collect 30 seconds of sample streaming data every hour, at a random time within the hour to avoid problems associated with doing the sampling at the same time each day. I designed a data model to capture the data I was most interested in while discarding attached video and images (this was about saving me disk space; I think serious Twitter analysis would have to do better than just collecting text). It looks like this: BTW that diagram (and much of the database development) was done with the excellent universal SQL editor and database admin tool, DBeaver. It works with different flavours of relational database and is awesome. The code that creates and populates that database is available on GitHub: This has now been running smoothly since 17 May 2018, apart from one day last week when I botched an R upgrade and it all went down for half a day before I noticed (lesson learned – run update.package(ask = FALSE, checkBuilt = TRUE) to ensure your R packages all keep working after the R internals change). So far the database is about 3GB in size, and I’m quite happy to let it grow quite a bit more than that. What do we find out? So far the main use I’ve put this data to is the Shiny app that I’ve scattered a few screenshots of in this blog post. The source code is on GitHub of course. That Shiny app writes its own SQL based on the inputs provided by the user (eg date range), queries the database and produces charts. So what have I learned about Twitter (as opposed to about Linux administration) from the exercise? No time to explore in much depth right now, but some of the interesting things include: • Tweets have a daily cycle, peaking at around 15:30 UTC each day (this assumes that the sampling ratio in the Twitter sample stream is roughly constant; which I think is likely as otherwise why would we see this seasonality). • The most tweeted hashtags all relate to teen-oriented popular music. I had to look up TeenChoice just to find out what it was… My filter bubble isn’t so much a liberal-v-conservative one as something relating to different interests to most people in the world altogether. The things that dominate my own Twitter feed are not even faintly representative of Twitter as a whole (I expected this with regard to statistical computing of course, but it was interesting to find out that even US politics hardly makes a dent in the most common tweets/retweets in any particular day, compared to popular retweets such as “If the Cleveland Cavaliers win the 2018 NBA finals I’ll buy everyone who retweet’s this a jersey…” (sic) – 1.1 million retweets – and several suspiciously similar variants) • If you ignore tweets in Japanese, Korean, Thai and Arabic script you are missing three of the top seven languages on Twitter. Any serious analysis needs to find a way to bring them on board (my first obstacle in this was getting a font that could represent as many different scripts and emojis as possible without knowing in advance the language; in the end I opted for GNU FreeFont, as described in my own answer to my question on StackOverflow about this problem) • Only six of the ten most prolific Tweeters listed on Twitter Counter are currently prolifically tweeting. In particular, @venethis @AmexOffers @BEMANISoundTeam and @__Scc__ seem to have gone either completely quiet or just much lower frequency tweeting (code for analysis). • The currently most prolific tweeter is @akiko_lawson, which I think (I don’t read Japanese) is an account associated with Lawson convenience stores. This single account issues around 1 tweet for every 5,000 tweets by anyone on the planet. • The currently second most prolific tweeter is probably a spam bot, called @test5f1798, that amongst other things tweets pictures of Spam. Very meta. There’s some interesting statistical challenges with using this database for inference that I might come back to. For example, I could use a small amount of auxiliary information such as the 1.1 million retweets of that Cleveland Cavaliers jersey tweet and compare it to the 105 times I found the tweet in my sample; and deduce that my sample is about 1 in 10,000 of the full population of tweets. This is consistent with the sample stream being a genuine 1% sample, of which I collect 1/120th (30 seconds every hour). So I should be able to treat my sample as a 1/12,000 sample of the whole population, clustered by the 30 second window they are in. Something for later. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Lyric Analysis with NLP and Machine Learning using R: Part One – Text Mining Fri, 07/06/2018 - 12:48 (This article was first published on R-posts.com, and kindly contributed to R-bloggers) June 22 By Debbie Liske This is Part One of a three part tutorial series originally published on the DataCamp online learning platform in which you will use R to perform a variety of analytic tasks on a case study of musical lyrics by the legendary artist, Prince. The three tutorials cover the following: Musical lyrics may represent an artist’s perspective, but popular songs reveal what society wants to hear. Lyric analysis is no easy task. Because it is often structured so differently than prose, it requires caution with assumptions and a uniquely discriminant choice of analytic techniques. Musical lyrics permeate our lives and influence our thoughts with subtle ubiquity. The concept of Predictive Lyrics is beginning to buzz and is more prevalent as a subject of research papers and graduate theses. This case study will just touch on a few pieces of this emerging subject. Prince: The Artist To celebrate the inspiring and diverse body of work left behind by Prince, you will explore the sometimes obvious, but often hidden, messages in his lyrics. However, you don’t have to like Prince’s music to appreciate the influence he had on the development of many genres globally. Rolling Stone magazine listed Prince as the 18th best songwriter of all time, just behind the likes of Bob Dylan, John Lennon, Paul Simon, Joni Mitchell and Stevie Wonder. Lyric analysis is slowly finding its way into data science communities as the possibility of predicting “Hit Songs” approaches reality. Prince was a man bursting with music – a wildly prolific songwriter, a virtuoso on guitars, keyboards and drums and a master architect of funk, rock, R&B and pop, even as his music defied genres. – Jon Pareles (NY Times) In this tutorial, Part One of the series, you’ll utilize text mining techniques on a set of lyrics using the tidy text framework. Tidy datasets have a specific structure in which each variable is a column, each observation is a row, and each type of observational unit is a table. After cleaning and conditioning the dataset, you will create descriptive statistics and exploratory visualizations while looking at different aspects of Prince’s lyrics. Check out the article here! (reprint by permission of DataCamp online learning platform) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Ghosts of Animals Haunt Portland’s Overpriced Apartments Fri, 07/06/2018 - 02:00 (This article was first published on Dan Garmat's Blog -- R, and kindly contributed to R-bloggers) Apartment hunting in an expensive city is leading me to curses and exclamations. Following are some outstanding examples of insanely priced apartments in Portland, OR, run through Google Deep Dream in hopes of my understanding why people pay so much for a small box. These listings will be gone in no time (I’m sure) so including some captions for posterity. Let’s start with this one. Indeed, it appears$1899 for 1 bedroom grants access to this clubhouse haunted by some floating apparition:

Deep Dream InceptionV3 algorithm here is trained on ImageNet, then makes changes that increase confidence in the predicted category. Looped several times with the num_octave hyperparameter, it starts to look a good bit trippy and helps give some intuition what a neural network “sees” as prototypical examples of a predicted class. Apparently there is no “view of apartment” class as it keeps seeing ghastly animals. Perhaps it is no coincidence even before running InceptionV3 this clubhouse already looks like it could work in The Shining:

$1850 / 1br – 697ft2 – BRAND NEW! Enjoy Luxury Uban [Urban?] Living at The Franklin Flats! “NEW ON THE MARKET! “The Franklin Flats is the newest addition to this desirable part of town! Built with the urban adventurer in mind, our small community offers luxury appeal with a neighborhood feel. Boasting a walkability score of 86 out of 100, you can’t beat the location! [unless an 87+?] Our close proximity to Mt. Tabor, food carts, [because you won’t have anything left over for restaurants] shopping and eateries gives you the classic Northwest experience you crave. Email or call to schedule a personal tour today!” Apparently the Attack on Titan seals make this a desirable part of town. Perhaps those seals are why walkability doesn’t crack 90. If you survive the seals on a midday stroll, there are titan polar bears who can walk through walls:$4250 / 2br – 1900ft2 – Condo on the Willamette

“Breathtaking views of the city and the Willamette River, located in the elegant Atwater. This condo has two bedrooms, living room, dining room, gourmet kitchen, gas fireplace, small office, two balconies, utility room and underground parking. Includes concierge desk, card-accessed security.”

Something tells me this view of the Willamette River would be complete if a cocker spaniel is staring at me…

But this view is what you really pay for: look at all the suckers in the two identical buildings who massively overpaid for – how the heck did those get up here?!

$3900 / 2br – 1004ft2 – Portland’s most well-appointed two bedroom apartments now available “Portland’s premier rental community is now pre-leasing. Find everything you desire in one place: The finest dining, distinguished boutiques, and most-beloved haunts. Experience the perfect merger of luxury and livability with our timeless brick exterior, stunning marble lobby, tiled bathrooms, tall ceilings, ample light, extensive storage, concierge services, and even a dog washing station. “Proudly situated in Portland’s [S]Nob Hill/23rd Ave neighborhood, 21 Astor boasts a “97” walk score and a “96” bike score. [Beat that Franklin Flats!] Life is great in the epicenter of Portland’s charm. Pick up your locally-grown produce at City Market. Grab happy hour with the gang at Seratto. Sweat it out at Corepower Yoga. “Our greater than 1:1 parking ratio assures a space for your car in our garage. Imagine never having to find parking on NW 23rd again.” I work on NW 21st, that’s almost the cost of parking alone. People walk by on Oregon ‘pot tours’ and this may be how they see the building as well: Whoa! Is that a pig in the sky? Far out! At least their new$3.88/sqft/month kitchens aren’t yet haunted – oh, god! Are those giant psychedelic worms!

Code:

Started with JJ Allaire, François Chollet, RStudio, and Google’s keras code based on Google DeepDream. Added a little looping to try different parameters over these 7 images:

library(keras) library(tensorflow) library(purrr) # Function Definitions ---------------------------------------------------- preprocess_image <- function(image_path){ image_load(image_path) %>% image_to_array() %>% array_reshape(dim = c(1, dim(.))) %>% inception_v3_preprocess_input() } deprocess_image <- function(x){ x <- x[1,,,] # Remove zero-center by mean pixel x <- x/2. x <- x + 0.5 x <- x * 255 # 'BGR'->'RGB' x <- x[,,c(3,2,1)] # Clip to interval 0, 255 x[x > 255] <- 255 x[x < 0] <- 0 x[] <- as.integer(x)/255 x } # Parameters -------------------------------------------------------- ## list of images to process -- list_images <- list.files('images/deep dream apartments/orig/', full.names = TRUE) ## list of settings to try -- list_settings <- list( settings = list( features = list( mixed2 = 0.2, mixed3 = 0.5, mixed4 = 2., mixed5 = 1.5), hyperparams = list( # Playing with these hyperparameters will also allow you to achieve new effects step = 0.01, # Gradient ascent step size num_octave = 6, # Number of scales at which to run gradient ascent octave_scale = 1.4, # Size ratio between scales iterations = 20, # Number of ascent steps per scale max_loss = 10 ) ), settings = list( features = list( mixed2 = 0.5, mixed3 = 0.2, mixed4 = 1.1, mixed5 = 1.5), hyperparams = list( step = 0.01, num_octave = 9, octave_scale = 1.1, iterations = 20, max_loss = 5 ) ), settings = list( features = list( mixed2 = 0.02, mixed3 = 0.05, mixed4 = 0.01, mixed5 = 0.05), hyperparams = list( step = 0.01, num_octave = 11, octave_scale = 1.1, iterations = 20, max_loss = 20 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 0.5, mixed4 = 2., mixed5 = 1.5), hyperparams = list( step = 0.01, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 10 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 0.1, mixed4 = 0.4, mixed5 = 0.3), hyperparams = list( step = 0.01, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 10 ) ), settings = list( features = list( mixed2 = 1.2, mixed3 = 1.5, mixed4 = 3., mixed5 = 2.5), hyperparams = list( step = 0.01, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 25 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 2.5, mixed4 = 2., mixed5 = 3.5), hyperparams = list( step = 0.05, num_octave = 6, octave_scale = 1.4, iterations = 20, max_loss = 13 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 2.5, mixed4 = 2., mixed5 = 3.5), hyperparams = list( step = 0.05, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 13 ) ) ) # Model Definition -------------------------------------------------------- k_set_learning_phase(0) # Build the InceptionV3 network with our placeholder. # The model will be loaded with pre-trained ImageNet weights. model <- application_inception_v3(weights = "imagenet", include_top = FALSE) # This will contain our generated image dream <- model$input # Get the symbolic outputs of each "key" layer (we gave them unique names). layer_dict <- model$layers names(layer_dict) <- map_chr(layer_dict ,~.x$name) ## just loop on images - bottleneck is training anyway -- for(image_path in list_images){ image <- preprocess_image(image_path) #res <- predict(model,image) #reduce(dim(res)[-1], .f = *) %>% # as.numeric() %>% #array_reshape(res, c(2, 51200)) %>% # imagenet_decode_predictions() ## get model prediction: #x <- image_to_array(image) #x <- array_reshape(image, c(1, dim(image))) #x <- imagenet_preprocess_input(image) #preds = predict(model, x) #imagenet_decode_predictions(preds, top = 3) #predict_classes(model, image) # need to try again later setting_counter <- 0 ## then nested loop on settings -- for(settings in list_settings){ setting_counter <- setting_counter + 1 print(paste0('starting ', image_path, ', setting # ', setting_counter)) # reload image each time image <- preprocess_image(image_path) # Define the loss loss <- k_variable(0.0) for(layer_name in names(settings$features)){ # Add the L2 norm of the features of a layer to the loss coeff <- settings$features[[layer_name]] x <- layer_dict[[layer_name]]$output scaling <- k_prod(k_cast(k_shape(x), 'float32')) # Avoid border artifacts by only involving non-border pixels in the loss loss <- loss + coeff*k_sum(k_square(x)) / scaling } # Compute the gradients of the dream wrt the loss grads <- k_gradients(loss, dream)[[1]] # Normalize gradients. grads <- grads / k_maximum(k_mean(k_abs(grads)), k_epsilon()) # Set up function to retrieve the value # of the loss and gradients given an input image. fetch_loss_and_grads <- k_function(list(dream), list(loss,grads)) # this function will crash the R session if too many octaves on too small an image eval_loss_and_grads <- function(image){ outs <- fetch_loss_and_grads(list(image)) list( loss_value = outs[[1]], grad_values = outs[[2]] ) } gradient_ascent <- function(x, iterations, step, max_loss = NULL) { for (i in 1:iterations) { out <- tryCatch(eval_loss_and_grads(x), error = function(e) NA) # need to add this for negative gradients if(is.na(out$loss_value)){ print(paste0('NA loss_value on setting # ', setting_counter)) break } else if (!is.null(max_loss) & out$loss_value > max_loss) { break } print(paste("Loss value at", i, ':', out$loss_value)) x <- x + step * out$grad_values } x } original_shape <- dim(image)[-c(1, 4)] successive_shapes <- list(original_shape) for (i in 1:settings$hyperparams$num_octave) { successive_shapes[[i+1]] <- as.integer(original_shape/settings$hyperparams$octave_scale^i) } successive_shapes <- rev(successive_shapes) original_image <- image shrunk_original_img <- image_array_resize( image, successive_shapes[[1]][1], successive_shapes[[1]][2] ) shpnum <- 0 # for debugging for (shp in successive_shapes) { shpnum <- shpnum + 1 # for debugging image <- image_array_resize(image, shp[1], shp[2]) print(paste0('running octave ', shpnum))# for debugging image <- gradient_ascent(image, settings$hyperparams$iterations, settings$hyperparams$step, settings$hyperparams$max_loss) print(paste0('finished octave ', shpnum))# for debugging upscaled_shrunk_original_img <- image_array_resize(shrunk_original_img, shp[1], shp[2]) same_size_original <- image_array_resize(original_image, shp[1], shp[2]) lost_detail <- same_size_original - upscaled_shrunk_original_img image <- image + lost_detail shrunk_original_img <- image_array_resize(original_image, shp[1], shp[2]) } image_path %>% gsub('/orig/', '/dream/', .) %>% gsub('.jpg', paste0('_dream', setting_counter, '.png'), .) %>% png(filename = .) plot(as.raster(deprocess_image(image))) dev.off() print(paste0('finished ', image_path, ', setting # ', setting_counter)) } } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dan Garmat's Blog -- R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New package packagefinder – Search for packages from the R console

Fri, 07/06/2018 - 01:52

(This article was first published on Topics in R, and kindly contributed to R-bloggers)

There are over 12,700 R packages on CRAN. How to find the right one for you? The new package ‘packagefinder‘ helps you search for packages on CRAN right from the R console. With ‘packagefinder’ you can search for multiple keywords in the name, title and description of the CRAN package, either case-sensitive or insensitive and define your own weighting scheme for the search results, if you like. Once you have found a promising package, you can use the simple function go() to go to the package’s CRAN webpage or view its PDF manual, directly from the R console without having to installing the package first. Of course, you can also install the package easily, if you want to try it out. Check our ‘packagefinder’ on CRAN: https://cloud.r-project.org/web/packages/packagefinder/index.html

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Topics in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fri, 07/06/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Another quick riddle from the riddler: solve the equation

EXMREEK + EHKREKK = ?K?H?X?E

which involves every digit between 0 and 9. While the puzzle can be unravelled by considering first E and K, which must be equal to 6 and 3, a simple R code also leads to the conclusion

isok <- function(a,b){ s=as.numeric(unlist(strsplit(as.character(sum(10^(6:0)*a)+ sum(10^(6:0)*b)),""))) if (length(s)==7){ goal=FALSE}else{ goal=(length(unique(c(a,b,s)))==10)&(a[2]==s[6])& (s[8]==a[6])&(s[2]==a[7])&(b[2]==s[4])} return(goal)} pasok <- function(T=1e3){ for (t in 1:T){ a[1]=a[5]=a[6]=6;a[7]=3 digs=sample(c(0:2,4,5,7:9),4) a[2:4]=digs[1:3] b[1]=a[1];b[2]=digs[4]; b[3]=a[7];b[4]=a[4];b[5]=a[1];b[6:7]=a[7] if (isok(a=a,b=b)) print(rbind(a,b))}} > pasok() [,1] [,2] [,3] [,4] [,5] [,6] [,7] a 6 2 4 7 6 6 3 b 6 8 3 7 6 3 3

which sum is 13085296.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Convex Regression Model

Thu, 07/05/2018 - 23:23

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

This morning during the lecture on nonlinear regression, I mentioned (very) briefly the case of convex regression. Since I forgot to mention the codes in R, I will publish them here. Assume that where is some convex function.

Then is convex if and only if , , Hidreth (1954) proved that ifthen is unique.

Let , then where. I.e. is the projection of onto the (closed) convex cone . The projection theorem gives existence and unicity.

For convenience, in the application, we will consider the real-valued case, , i.e. . Assume that observations are ordered . Here

Hence, quadratic program with linear constraints.

is a piecewise linear function (interpolation of consecutive pairs ).

If is differentiable, is convex if

More generally, if is convex, then there exists such that
is a subgradient of at . And then

Hence, is solution of and . Now, to do it for real, use cobs package for constrained (b)splines regression,

1 library(cobs)

To get a convex regression, use

1 2 3 4 5 plot(cars) x = cars$speed y = cars$dist rc = conreg(x,y,convex=TRUE) lines(rc, col = 2)

Here we can get the values of the knots

1 2 3 4 5 6 7 rc   Call: conreg(x = x, y = y, convex = TRUE) Convex regression: From 19 separated x-values, using 5 inner knots, 7, 8, 9, 20, 23. RSS = 1356; R^2 = 0.8766; needed (5,0) iterations

and actually, if we use them in a linear-spline regression, we get the same output here

1 2 3 4 reg = lm(dist~bs(speed,degree=1,knots=c(4,7,8,9,,20,23,25)),data=cars) u = seq(4,25,by=.1) v = predict(reg,newdata=data.frame(speed=u)) lines(u,v,col="green")

Let us add vertical lines for the knots

1 abline(v=c(4,7,8,9,20,23,25),col="grey",lty=2)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New Course: Support Vector Machines in R

Thu, 07/05/2018 - 20:28

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Course Description

Support Vector Machines in R will help students develop an understanding of the SVM model as a classifier and gain practical experience using R’s libsvm implementation from the e1071 package. Along the way, students will gain an intuitive understanding of important concepts, such as hard and soft margins, the kernel trick, different types of kernels, and how to tune SVM parameters. Get ready to classify data with this impressive model.

Chapter 1: Introduction

This chapter introduces some key concepts of support vector machines through a simple 1-dimensional example. Students are also walked through the creation of a linearly separable dataset that is used in the subsequent chapter.

Chapter 2: Support Vector Classifiers – Linear Kernels

Chapter 2 introduces students to the basic concepts of support vector machines by applying the svm algorithm to a dataset that is linearly separable. Key concepts are illustrated through ggplot visualizations that are built from the outputs of the algorithm and the role of the cost parameter is highlighted via a simple example. The chapter closes with a section on how the algorithm deals with multi-class problems.

Chapter 3: Polynomial Kernels

This chapter provides an introduction to polynomial kernels via a dataset that is radially separable (i.e. has a circular decision boundary). After demonstrating the inadequacy of linear kernels for this dataset, students will see how a simple transformation renders the problem linearly separable thus motivating an intuitive discussion of the kernel trick. Students will then apply the polynomial kernel to the dataset and tune the resulting classifier.

Chapter 4: Radial Basis Function Kernels

Chapter 4 builds on the previous three chapters by introducing the highly flexible Radial Basis Function (RBF) kernel. Students will create a "complex" dataset that shows up the limitations of polynomial kernels. Then, following an intuitive motivation for the RBF kernel, students see how it addresses the shortcomings of the other kernels discussed in this course.

The course requires the following prerequisites:

Introduction to R

Introduction to Machine Learning

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### First Class R Support in Binder / Binderhub – Shiny Apps As Well as R-Kernels and RStudio

Thu, 07/05/2018 - 19:59

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

I notice from the binder-examples/r repo that Binderhub now appears to offer all sorts of R goodness out of the can, if you specify a particular R build.

From the same repo root, you can get:

And from previously, here’s a workaround for displaying R/HTMLwidgets in a Jupyter notebook.

OpenRefine is also available from a simple URL – https://mybinder.org/v2/gh/betatim/openrefineder/master?urlpath=openrefine – courtesy of betatim/openrefineder:

Perhaps it’s time for me to try to get my head round what the Jupyter notebook proxy handlers are doing…

PS see also Scripted Forms for a simple markdown script way of building interactive Jupyter widget powered UIs.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New Course: Interactive Maps with leaflet in R

Thu, 07/05/2018 - 19:19

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Course Description

Get ready to have some fun with maps! Interactive Maps with leaflet in R will give you the tools to make attractive and interactive web maps using spatial data and the tidyverse. In this course, you will create maps using the IPEDS dataset, which contains data on U.S. colleges and universities. Along the way, you will customize our maps using labels, popups, and custom markers, and add layers to enhance interactivity. Following the course, you will be able to create and customize your own interactive web maps to reveal patterns in your data.

Chapter 1: Setting Up Interactive Web Maps (Free)

This chapter will introduce students to the htmlwidgets package and the leaflet package. Following this introduction, students will build their first interactive web map using leaflet. Through the process of creating this first map students will be introduced to many of the core features of the leaflet package, including adding different map tiles, setting the center point and zoom level, plotting single points based on latitude and longitude coordinates, and storing leaflet maps as objects.

Chapter 2: Plotting Points

In this chapter students will build on the leaflet map they created in chapter 1 to create an interactive web map of every four year college in California. After plotting hundreds of points on an interactive leaflet map, students will learn to customize the markers on their leaflet map. This chapter will also how to color code markers based on a factor variable.

Chapter 3: Groups, Layers, and Extras

In chapter 3, students will expand on their map of all four year colleges in California to create a map of all American colleges. First, in section 3.1 students will review and build on the material from Chapter 2 to create a map of all American colleges. Then students will re-plot the colleges on their leaflet map by sector (public, private, or for-profit) using groups to enable users to toggle the colleges that are displayed on the map. In section 3.3 students will learn to add multiple base maps so that users can toggle between multiple map tiles.

Chapter 4: Plotting Polygons

In Chapter 4, students will learn to map polygons, which can be used to define geographic regions (e.g., zip codes, states, countries, etc.). Chapter 4 will start by plotting the zip codes in North Carolina that fall in the top quartile of mean family incomes. Students will learn to customize the polygons with color palettes and labels. Chapter 4 will conclude with adding a new layer to the map of every college in America that displays every zip code with a mean income of \$200,000 or more during the 2015 tax year. Through the process of mapping zip codes students will learn about spatial data generally, geoJSON data, the @ symbol, and the addPolygons() function.

This course requires the following prerequisites:

Introduction to R

Introduction to the Tidyverse

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...