Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 27 min ago

Frankenstein

Tue, 03/07/2017 - 08:30

Remember me, remember me, but ah! forget my fate (Dido’s Lament, Henry Purcell)

A Voronoi diagram divides a plane based on a set of original points. Each polygon, or Voronoi cell, contains an original point and all that are closer to that point than any other.

This is a nice example of a Voronoi tesselation. You can find good explanations of Voronoi diagrams and Delaunay triangulations here (in English) or here (in Spanish).

A grayscale image is simply a matrix where darkness of pixel located in coordinates (i, j) is represented by the value of its corresponding element of the matrix: a grayscale image is a dataset. This is a Voronoi diagraman of Frankenstein:

To do it I followed the next steps:

  1. Read this image
  2. Convert it to gray scale
  3. Turn it into a pure black and white image
  4. Obtain a random sample of black pixels (previous image corresponds to a sample of 6.000 points)
  5. Computes the Voronoi tesselation

Steps 1 to 3 were done with imager, a very appealing package to proccess and analice images. Step 5 was done with deldir, also a convenient package which computes Delaunay triangulation and the Dirichlet or Voronoi tessellations.

The next grid shows tesselations for sample size from 500 to 12.000 points and step equal to 500:
























I gathered all previous images in this gif created with magick, another amazing package of R I discovered recently:

This is the code:

library(imager) library(dplyr) library(deldir) library(ggplot2) library(scales) # Download the image file="http://ereaderbackgrounds.com/movies/bw/Frankenstein.jpg" download.file(file, destfile = "frankenstein.jpg", mode = 'wb') # Read and convert to grayscale load.image("frankenstein.jpg") %>% grayscale() -> x # This is just to define frame limits x %>% as.data.frame() %>% group_by() %>% summarize(xmin=min(x), xmax=max(x), ymin=min(y), ymax=max(y)) %>% as.vector()->rw # Filter image to convert it to bw x %>% threshold("45%") %>% as.data.frame() -> df # Function to compute and plot Voronoi tesselation depending on sample size doPlot = function(n) { #Voronoi tesselation df %>% sample_n(n, weight=(1-value)) %>% select(x,y) %>% deldir(rw=rw, sort=TRUE) %>% .$dirsgs -> data # This is just to add some alpha to lines depending on its longitude data %>% mutate(long=sqrt((x1-x2)^2+(y1-y2)^2), alpha=findInterval(long, quantile(long, probs = seq(0, 1, length.out = 20)))/21)-> data # A little bit of ggplot to plot results data %>% ggplot(aes(alpha=(1-alpha))) + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), color="black", lwd=1) + scale_x_continuous(expand=c(0,0))+ scale_y_continuous(expand=c(0,0), trans=reverse_trans())+ theme(legend.position = "none", panel.background = element_rect(fill="white"), axis.ticks = element_blank(), panel.grid = element_blank(), axis.title = element_blank(), axis.text = element_blank())->plot return(plot) } # I call the previous function and store resulting plot in jpeg format i=5000 name=paste0("frankie",i,".jpeg") jpeg(name, width = 600, height = 800, units = "px", quality = 100) doPlot(i) dev.off() # Once all images are stored I can create gif library(magick) frames=c() images=list.files(pattern="jpeg") for (i in length(images):1) { x=image_read(images[i]) x=image_scale(x, "300") c(x, frames) -> frames } animation=image_animate(frames, fps = 2) image_write(animation, "Frankenstein.gif")

RProtoBuf 0.4.9

Tue, 03/07/2017 - 02:17

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

RProtoBuf provides R bindings for the Google Protocol Buffers ("Protobuf") data encoding and serialization library used and released by Google, and deployed as a language and operating-system agnostic protocol by numerous projects.

The RProtoBuf 0.4.9 release is the fourth and final update this weekend following the request by CRAN to not use package= in .Call() when PACKAGE= is really called for.

Some of the code in RProtoBuf 0.4.9 had this bug; some other entry points had neither (!!). With the ongoing drive to establish proper registration of entry points, a few more issues were coming up, all of which are now addressed. And we had some other unreleased minor cleanup, so this made for a somewhat longer (compared to the other updates this weekend) NEWS list:

Changes in RProtoBuf version 0.4.9 (2017-03-06)
  • A new file init.c was added with calls to R_registerRoutines() and R_useDynamicSymbols()

  • Symbol registration is enabled in useDynLib

  • Several missing PACKAGE= arguments were added to the corresponding .Call invocations

  • Two (internal) C++ functions were renamed with suffix _cpp to disambiguate them from R functions with the same name

  • All of above were part of #26

  • Some editing corrections were made to the introductory vignette (David Kretch in #25)

  • The ‘configure.ac’ file was updated, and renamed from the older converntion ‘configure.in’, along with ‘src/Makevars’. (PR #24 fixing #23)

CRANberries also provides a diff to the previous release. The RProtoBuf page has an older package vignette, a ‘quick’ overview vignette, a unit test summary vignette, and the pre-print for the JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RVowpalWabbit 0.0.9

Tue, 03/07/2017 - 02:06

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The RVowpalWabbit package update is the third of four upgrades requested by CRAN, following RcppSMC 0.1.5 and RcppGSL 0.3.2.

This package being somewhat raw, the change was simple and just meant converting the single entry point to using Rcpp Attributes — which addressed the original issue in passing.

No new code or features were added.

We should mention that is parallel work ongoing in a higher-level package interfacing the vw binary — rvw — as well as plan to redo this package via the external libraries. In that sounds interesting to you, please get in touch.

More information is on the RVowpalWabbit page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hyper-parameter Tuning with Grid Search for Deep Learning

Tue, 03/07/2017 - 01:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Last week I showed how to build a deep neural network with h2o and rsparkling. As we could see there, it is not trivial to optimize the hyper-parameters for modeling. Hyper-parameter tuning with grid search allows us to test different combinations of hyper-parameters and find one with improved accuracy.

Keep in mind though, that hyper-parameter tuning can only improve the model so much without overfitting. If you can’t achieve sufficient accuracy, the input features might simply not be adequate for the predictions you are trying to model. It might be necessary to go back to the original features and try e.g. feature engineering methods.

Preparing Spark instance and plotting theme

Check out last week’s post for details on how I prepared the data.

library(rsparkling) options(rsparkling.sparklingwater.version = "2.0.3") library(h2o) library(dplyr) library(sparklyr) sc <- spark_connect(master = "local", version = "2.0.0") library(ggplot2) library(ggrepel) my_theme <- function(base_size = 12, base_family = "sans"){ theme_minimal(base_size = base_size, base_family = base_family) + theme( axis.text = element_text(size = 12), axis.title = element_text(size = 14), panel.grid.major = element_line(color = "grey"), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "aliceblue"), strip.background = element_rect(fill = "darkgrey", color = "grey", size = 1), strip.text = element_text(face = "bold", size = 12, color = "white"), legend.position = "right", legend.justification = "top", panel.border = element_rect(color = "grey", fill = NA, size = 0.5) ) } arrhythmia_sc <- copy_to(sc, arrhythmia_subset) arrhythmia_hf <- as_h2o_frame(sc, arrhythmia_sc, strict_version_check = FALSE) arrhythmia_hf[, 2] <- h2o.asfactor(arrhythmia_hf[, 2]) arrhythmia_hf[, 3] <- h2o.asfactor(arrhythmia_hf[, 3]) splits <- h2o.splitFrame(arrhythmia_hf, ratios = c(0.7, 0.15), seed = 1) train <- splits[[1]] valid <- splits[[2]] test <- splits[[3]] response <- "diagnosis" weights <- "weights" features <- setdiff(colnames(train), c(response, weights, "class"))

Grid Search

We can use the h2o.grid() function to perform a Random Grid Search (RGS). We could also test all possible combinations of parameters with Cartesian Grid or exhaustive search, but RGS is much faster when we have a large number of possible combinations and usually finds sufficiently accurate models.

For RGS, we first define a set of hyper-parameters and search criteria to fine-tune our models. Because there are many hyper-parameters, each with a range of possible values, we want to find an (ideally) optimal combination to maximize our model’s accuracy.
We can also specify how long we want to run the grid search for. Based on the results of each model tested in the grid, we can choose the one with the highest accuracy or best performance for the question on hand.

Activation Functions
  • Rectifier: is the default activation function. It is the fastest and most versatile option. It can lead to instability though and tends to be lower in accuracy.
  • Tanh: The hyperbolic tangent is a scaled and shifted variant of the sigmoid activation function. It can take on values from -1 to 1 and centers around 0. Tanh needs more computational power than e.g. the Rectifier function.
  • Maxout: is an activation function that is the max of the inputs. It is computationally quite demanding but can produce high accuracy models.

  • ...WithDropout: When we specify with dropout, a random subset of the network is trained and the weights of all sub-networks are averaged. It works together with the parameter hidden_dropout_ratios, which controls the amount of layer neurons that are randomly dropped for each hidden layer. Hidden dropout ratios are useful for preventing overfitting on learned features.
Hidden layers
  • are the most important hyper-parameter to set for deep neural networks, as they specify how many hidden layers and how many nodes per hidden layer the model should learn
L1 and L2 penalties
  • L1: lets only strong weights survive
  • L2: prevents any single weight from getting too big.
Rho and Epsilon
  • rho: similar to prior weight updates
  • epsilon: prevents getting stuck in local optima
hyper_params <- list( activation = c("Rectifier", "Maxout", "Tanh", "RectifierWithDropout", "MaxoutWithDropout", "TanhWithDropout"), hidden = list(c(5, 5, 5, 5, 5), c(10, 10, 10, 10), c(50, 50, 50), c(100, 100, 100)), epochs = c(50, 100, 200), l1 = c(0, 0.00001, 0.0001), l2 = c(0, 0.00001, 0.0001), rate = c(0, 01, 0.005, 0.001), rate_annealing = c(1e-8, 1e-7, 1e-6), rho = c(0.9, 0.95, 0.99, 0.999), epsilon = c(1e-10, 1e-8, 1e-6, 1e-4), momentum_start = c(0, 0.5), momentum_stable = c(0.99, 0.5, 0), input_dropout_ratio = c(0, 0.1, 0.2), max_w2 = c(10, 100, 1000, 3.4028235e+38) )

Early stopping criteria
  • stopping_metric: metric that we want to use as stopping criterion
  • stopping_tolerance and stopping_rounds: training stops when the the stopping metric does not improve by the stopping tolerance proportion any more (e.g. by 0.05 or 5%) for the number of consecutive rounds defined by stopping rounds.
search_criteria <- list(strategy = "RandomDiscrete", max_models = 100, max_runtime_secs = 900, stopping_tolerance = 0.001, stopping_rounds = 15, seed = 42)

Now, we can train the model with combinations of hyper-parameters from our specified stopping criteria and hyper-parameter grid.

dl_grid <- h2o.grid(algorithm = "deeplearning", x = features, y = response, weights_column = weights, grid_id = "dl_grid", training_frame = train, validation_frame = valid, nfolds = 25, fold_assignment = "Stratified", hyper_params = hyper_params, search_criteria = search_criteria, seed = 42 )

We now want to extract the best model from the grid model list. What makes a model the best depends on the question you want to address with it: in some cases, the model with highest AUC is the most suitable, or the one with the lowest mean squared error, etc. See last week’s post again for a more detailed discussion of performance metrics.

For demonstration purposes, I am choosing the best models from a range of possible quality criteria. We first use the h2o.getGrid() function to sort all models by the quality metric we choose (depending on the metric, you want it ordered by descending or ascending values). We can then get the model that’s the first in the list to work with further. This model’s hyper-parameters can be found with best_model@allparameters. You can now work with your best model as with any regular model in h2o (for an example see last week’s post).

# performance metrics where smaller is better -> order with decreasing = FALSE sort_options_1 <- c("mean_per_class_error", "mse", "err") for (sort_by_1 in sort_options_1) { grid <- h2o.getGrid("dl_grid", sort_by = sort_by_1, decreasing = FALSE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) assign(paste0("best_model_", sort_by_1), best_model) } # performance metrics where bigger is better -> order with decreasing = TRUE sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity") for (sort_by_2 in sort_options_2) { grid <- h2o.getGrid("dl_grid", sort_by = sort_by_2, decreasing = TRUE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) assign(paste0("best_model_", sort_by_2), best_model) }

Let’s plot the mean per class error for each best model:

library(tibble) sort_options <- c("mean_per_class_error", "mse", "err", "auc", "precision", "accuracy", "recall", "specificity") for (sort_by in sort_options) { best_model <- get(paste0("best_model_", sort_by)) errors <- h2o.mean_per_class_error(best_model, train = TRUE, valid = TRUE, xval = TRUE) errors_df <- data.frame(model_id = best_model@model_id, sort = sort_by, errors) %>% rownames_to_column(var = "rowname") if (sort_by == "mean_per_class_error") { errors_df_comb <- errors_df } else { errors_df_comb <- rbind(errors_df_comb, errors_df) } } order <- subset(errors_df_comb, rowname == "xval") %>% arrange(errors) errors_df_comb %>% mutate(sort = factor(sort, levels = order$sort)) %>% ggplot(aes(x = sort, y = errors, fill = model_id)) + facet_grid(rowname ~ ., scales = "free") + geom_bar(stat = "identity", alpha = 0.8) + scale_fill_brewer(palette = "Set1") + my_theme() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), plot.margin = unit(c(0.5, 0, 0, 1), "cm")) + labs(x = "")

Model performance

The ultimate performance test for our model will be it’s prediction accuracy on the test set it hasn’t seen before. Here, I will compare the AUC and mean squared error for each best model from before. You could of course look at any other quality metric that is most appropriate for your model.

for (sort_by in sort_options) { best_model <- get(paste0("best_model_", sort_by)) mse_auc_test <- data.frame(model_id = best_model@model_id, sort = sort_by, mse = h2o.mse(h2o.performance(best_model, test)), auc = h2o.auc(h2o.performance(best_model, test))) if (sort_by == "mean_per_class_error") { mse_auc_test_comb <- mse_auc_test } else { mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test) } } library(tidyr) mse_auc_test_comb %>% gather(x, y, mse:auc) %>% ggplot(aes(x = sort, y = y, fill = model_id)) + facet_grid(x ~ ., scales = "free") + geom_bar(stat = "identity", alpha = 0.8, position = "dodge") + scale_fill_brewer(palette = "Set1") + my_theme() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) + labs(x = "", y = "value", fill = "")

I will then create a dataframe with predictions for each test sample with all best models. As in last week’s post, I want to compare the default predictions with stringent predictions.

for (sort_by in sort_options) { best_model <- get(paste0("best_model_", sort_by)) finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, nrow(test)), sort = rep(sort_by, nrow(test)), class = as.vector(test$class), actual = as.vector(test$diagnosis), as.data.frame(h2o.predict(object = best_model, newdata = test))) finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$arrhythmia > 0.8, "arrhythmia", ifelse(finalRf_predictions$healthy > 0.8, "healthy", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) if (sort_by == "mean_per_class_error") { finalRf_predictions_comb <- finalRf_predictions } else { finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions) } }

To get a better overview, I am going to plot the predictions (default and stringent):

finalRf_predictions_comb %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + my_theme() + facet_wrap(~ sort, ncol = 4) + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")

finalRf_predictions_comb %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + my_theme() + facet_wrap(~ sort, ncol = 4) + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")

With predictions made by different models, we can see where each model performs best. This obviously corresponds with the quality metric we chose to define the best model. Stringent prediction thresholds reduced the number of false predictions but of course also the number of predictions made at all.

We can now decide which model is most relevant. Let’s say, we want this model to give recommendations for further validating a diagnosis of arrhythmia, we might want to detect as many arrhythmia cases as possible, while being okay with also getting some erroneously diagnosed healthy subjects. In this case, the people could flag patients with likely arrhythmia for further testing. Here, this would mean that we wanted to minimize the number of wrong predictions of arrhythmia cases, so we would prefer the mse, auc and specificity model (which is the same model, chosen by all three qualitry metrics). The worst model for this purpose would be the recall model, which also did not make any predictions that passed the stringency threshold. The recall model would have been best if we wanted to be confident that healthy people get flagged correctly. However, in a real-life setting, you can imagine that it would be much worse if a sick patient got sent home because he was wrongly diagnosed as healthy than if a healthy person got submitted to further testing for possible arrhythmia.

Other machine learning topics I have covered include

## R version 3.3.2 (2016-10-31) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] tidyr_0.6.1 tibble_1.2 ggrepel_0.6.5 ## [4] ggplot2_2.2.1.9000 sparklyr_0.5.3-9002 dplyr_0.5.0 ## [7] h2o_3.10.3.6 rsparkling_0.1.0 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.9 RColorBrewer_1.1-2 plyr_1.8.4 ## [4] bitops_1.0-6 base64enc_0.1-3 tools_3.3.2 ## [7] digest_0.6.12 jsonlite_1.2 evaluate_0.10 ## [10] gtable_0.2.0 shiny_1.0.0 DBI_0.5-1 ## [13] rstudioapi_0.6 yaml_2.1.14 parallel_3.3.2 ## [16] withr_1.0.2 httr_1.2.1 stringr_1.2.0 ## [19] knitr_1.15.1 rappdirs_0.3.1 rprojroot_1.2 ## [22] grid_3.3.2 R6_2.2.0 rmarkdown_1.3 ## [25] reshape2_1.4.2 magrittr_1.5 backports_1.0.5 ## [28] scales_0.4.1 htmltools_0.3.5 assertthat_0.1 ## [31] mime_0.5 xtable_1.8-2 colorspace_1.3-2 ## [34] httpuv_1.3.3 labeling_0.3 config_0.2 ## [37] stringi_1.1.2 RCurl_1.95-4.8 lazyeval_0.2.0 ## [40] munsell_0.4.3

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Create Air Travel Route Maps in ggplot: A Visual Travel Diary

Mon, 03/06/2017 - 21:24

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

I have been lucky to fly to a few countries around the world. Like any other bored traveller, I thumb through the airline magazines and look at the air travel route maps. These maps are beautifully stylised depictions of the world with gently curved lines between the many destinations. I always wanted such a map for my own travel adventures.

Create Air Travel Route Maps using ggplot2

The first step was to create a list of all the places I have flown between at least once. Paging through my travel photos and diaries, I managed to create a pretty complete list. The structure of this document is simply a list of all routes (From, To) and every flight only gets counted once. The next step finds the spatial coordinates for each airport by searching Google Maps using the geocode function from the ggmap package. In some instances, I had to add the country name to avoid confusion between places.

# Read flight list flights <- read.csv("flights.csv", stringsAsFactors = FALSE) # Lookup coordinates library(ggmap) airports <- unique(c(flights$From, flights$To)) coords <- geocode(airports) airports <- data.frame(airport=airports, coords)

We now we have a data frame of airports with their coordinates and can create air travel route maps. The data frames are merged so that we can create air travel route maps using the curve geom. The borders function of ggplot2 creates the map data. The ggrepel package helps to prevent overplotting of text.

# Add coordinates to flight list flights <- merge(flights, airports, by.x="To", by.y="airport") flights <- merge(flights, airports, by.x="From", by.y="airport") # Plot flight routes library(ggplot2) library(ggrepel) worldmap <- borders("world", colour="#efede1", fill="#efede1") # create a layer of borders ggplot() + worldmap + geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y, yend = lat.y), col = "#b29e7d", size = 1, curvature = .2) + geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") + geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport), col = "black", size = 2, segment.color = NA) + theme(panel.background = element_rect(fill="white"), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank() )

I also tried to use ggmap package to display the maps to get a satellite image background. This did not work because the curve geom struggles with the map projection methods used in ggmap. Another problem is that the flight from Auckland to Los Angeles is drawn the wrong way. I hope no flat-earthers will see this map because they might use it as prove that the world is flat.

Alternative Visualisation

Another way of visualising this type of data is using a network diagram provided by the igraph package. This visualisation shows the logic between the locations and not their spatial locations.

# Network visualisation library(igraph) edgelist <- as.matrix(flights[c("From", "To")]) g <- graph_from_edgelist(edgelist, directed = TRUE) g <- simplify(g) par(mar=rep(0,4)) plot.igraph(g, edge.arrow.size=0, edge.color="black", edge.curved=TRUE, edge.width=2, vertex.size=3, vertex.color=NA, vertex.frame.color=NA, vertex.label=V(g)$name, vertex.label.cex=3, layout=layout.fruchterman.reingold )

The post Create Air Travel Route Maps in ggplot: A Visual Travel Diary appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R 3.3.3 now available

Mon, 03/06/2017 - 19:10

The R core group announced today the release of R 3.3.3 (code-name: "Another Canoe"). As the wrap-up release of the R 3.3 series, this update mainly contains minor bug-fixes. (Bigger changes are planned for R 3.4.0, expected in mid-April.) Binaries for the Windows version are already up on the CRAN master site, and binaries for all platforms will appear on your local CRAN mirror within the next couple of days. 

R 3.3.3 fixes an issue related to attempting to use download.file on sites that automatically redirect from http to https: now, R will re-attempt to download the secure link rather than failing. Other fixes include support for long vectors in the vapply function, the ability to use pmax (and pmin) on ordered factors, improved accuracy for qbeta for some extreme cases, corrected spelling for "Cincinnati" in the precip data set, and a few other minor issues.

R 3.3.3 should be completely compatible with your existing R 3.3.x scripts and data sets. Unless some major unforeseen bug rears its head between now and the release of R 3.4.0, this is the final release of the R 3.3 series, so it's definitely a good idea to upgrade when you can.

As always, thanks go to the R Core Group for the continued and welcome improvements to R!

Step-Debugging magrittr/dplyr Pipelines in R with wrapr and replyr

Mon, 03/06/2017 - 18:59

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In this screencast we demonstrate how to easily and effectively step-debug magrittr/dplyr pipelines in R using wrapr and replyr.



Some of the big issues in trying to debug magrittr/dplyr pipelines include:

  • Pipelines being large expressions that are hard to line-step into.
  • Visibility of intermediate results.
  • Localizing operations (in time and code position) in the presence of lazy evaluation semantics.

In this screencast we demonstrate how to mitigate the impact of all of these issues and quickly and effectively debug pipelined code.

The example code is here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mapping 5,000 Years of City Growth

Mon, 03/06/2017 - 17:41

(This article was first published on R – Spatial.ly, and kindly contributed to R-bloggers)

I recently stumbled upon a great dataset. It’s the first to provide comprehensive data for world city sizes as far back as 3700BC. The authors (Meredith Reba, Femke Reitsma & Karen Seto) write:

How were cities distributed globally in the past? How many people lived in these cities? How did cities influence their local and regional environments? In order to understand the current era of urbanization, we must understand long-term historical urbanization trends and patterns. However, to date there is no comprehensive record of spatially explicit, historic, city-level population data at the global scale. Here, we developed the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000, by digitizing, transcribing, and geocoding historical, archaeological, and census-based urban population data previously published in tabular form by Chandler and Modelski.

These kinds of data are crying out to be mapped so that’s what I did…

Creating the Maps

I was keen to see how easy it was to produce an animated map with R and the ggplot2 package from these data. It turned out to be a slightly bigger job than I thought – partly because the cleaned data doesn’t have all the population estimates updating for every city each year so there’s a bit of a messy loop at the start of the code to get that bit working, and partly because there a LOADS of different parameters to tweak on ggplot to get the maps looking the way I like. The script below will generate a series of PNGs that can be strung together into an animation using a graphics package. I’m a big fan of what ggplot2 can do now and I hope, at the very least, the chunks of ggplot2 code below will provide a useful template for similar mapping projects.

R Code

Load the libraries we need

library("ggplot2") library("ggthemes") library("rworldmap") library("classInt") library("gridExtra") library("grid") library("cowplot")

load data – this is the processed file from http://www.nature.com/articles/sdata201634 I offer no guarantees about whether I ran the script correctly etc! I’d encourage you to take the data direct from the authors.

#City data cities<- read.csv("alldata.csv") #for some reason the coords were loaded as factors so I've created two new numeric fields here cities$X<-as.numeric(as.vector(cities$Longitude)) cities$Y<-as.numeric(as.vector(cities$Latitude)) #World map base from rworldmap world <- fortify(getMap(resolution="low"))

Take a look at the data

head(cities)

Create the base map

base<- ggplot() + geom_map(data=world, map=world, aes(x=long,y=lat, map_id=id, group=group),fill="#CCCCCC") +theme_map()

Now plot the cities on top – with circles scaled by city size. Here I am using the mollweide projection. This script loops through the data and checks for updates from one year to the next. It then plots the cities as proportional circles for each year and saves out an image for each 100 years of data. If you are just interested in the maps themselves and have your own point data then most of the below can be ignored.

Set the breaks for the size of the circles

size_breaks<-c(10000,50000,100000,500000,1000000,5000000,10000000) #Here I'm looping through the data each year to select the years required and comparing/ updating the values for each city. count<-1 for (i in unique(cities$year)) { if (count==1) { Data<-cities[which(cities$year==i),] }else{ New_Data<-cities[which(cities$year==i),] #replace old rows Data2<-merge(Data, New_Data, by.x="X", by.y="X", all=T) New_Vals<-Data2[!is.na(Data2$pop.y),c("City.y","X","Y.y","pop.y")] names(New_Vals)<-c("City","X","Y","pop") Old_Vals<-Data2[is.na(Data2$pop.y),c("City.x","X","Y.x","pop.x")] names(Old_Vals)<-c("City","X","Y","pop") Data<-rbind(New_Vals,Old_Vals) Data<- aggregate(Data$pop, by=list(Data$City,Data$X, Data$Y), FUN="mean") names(Data)<-c("City","X","Y","pop") } #This statement sorts out the BC/ AC in the title that we'll use for the plot. if(i<0) { title <- paste0("Cities in the Year ",sqrt(i^2)," BC") }else{ title <- paste0("Cities in the Year ",i," AD") } #There's lots going on here with the myriad of different ggplot2 parameters. I've broken each chunk up line by line to help make it a little clearer. Map<-base+ geom_point(data=Data, aes(x=X, y=Y, size=pop), colour="#9933CC",alpha=0.3, pch=20)+ scale_size(breaks=size_breaks,range=c(2,14), limits=c(min(cities$pop), max(cities$pop)),labels=c("10K","50k","100K","500k","1M","5M","10M+"))+ coord_map("mollweide", ylim=c(-55,80),xlim=c(-180,180))+ theme(legend.position="bottom",legend.direction="horizontal",legend.justification="center",legend.title=element_blank(),plot.title = element_text(hjust = 0.5,face='bold',family = "Helvetica"))+ggtitle(title)+guides(size=guide_legend(nrow=1)) #I only want to plot map every 100 years if(i%%100==0) { png(paste0("outputs/",i,"_moll.png"), width=30,height=20, units="cm", res=150) print(Map) dev.off() } count<-count+1 }

Just to make things a little more interesting I wanted to try a plot of two hemispheres (covering most, but not all) of the globe. I’ve also added the key in between them. This depends on the plot_grid() functionality as well as a few extra steps you’ll spot that we didn’t need above.

count<-1 for (i in unique(cities$year)) { if (count==1) { Data<-cities[which(cities$year==i),] }else{ New_Data<-cities[which(cities$year==i),] #replace old rows Data2<-merge(Data, New_Data, by.x="X", by.y="X", all=T) New_Vals<-Data2[!is.na(Data2$pop.y),c("City.y","X","Y.y","pop.y")] names(New_Vals)<-c("City","X","Y","pop") Old_Vals<-Data2[is.na(Data2$pop.y),c("City.x","X","Y.x","pop.x")] names(Old_Vals)<-c("City","X","Y","pop") Data<-rbind(New_Vals,Old_Vals) Data<- aggregate(Data$pop, by=list(Data$City,Data$X, Data$Y), FUN="mean") names(Data)<-c("City","X","Y","pop") } #Create a map for each hemisphere - to help with the positioning I needed to use the `plot.margin` options in the `theme()` settings. map1<-base+geom_point(data=Data, aes(x=X, y=Y, size=pop), colour="#9933CC",alpha=0.3, pch=20)+scale_size(breaks=size_breaks,range=c(2,16), limits=c(min(cities$pop), max(cities$pop)))+coord_map("ortho", orientation = c(10, 60, 0),ylim=c(-55,70))+ theme(legend.position="none",plot.margin = unit(c(0.5,0.5,-0.1,0.5), "cm")) map2<- base+geom_point(data=Data, aes(x=X, y=Y, size=pop), colour="#9933CC",alpha=0.3, pch=20)+scale_size(breaks=size_breaks,range=c(2,16),limits=c(min(cities$pop), max(cities$pop)))+coord_map("ortho", orientation = c(10, -90, 0),ylim=c(-55,70))+ theme(legend.position="none", plot.margin = unit(c(0.5,0.5,0,0.5), "cm")) #create dummy plot that we will use the legend from - basically I just want to create something that has a legend (they've been switched off for the maps above) dummy<-ggplot()+geom_point(aes(1:7,1,size=size_breaks),colour="#9933CC", alpha=0.3, pch=20)+scale_size(breaks=size_breaks,range=c(2,16),labels=c("10K","50k","100K","500k","1M","5M","10M+"),limits=c(min(cities$pop), max(cities$pop)))+theme(legend.position="bottom",legend.direction="vertical",legend.title=element_blank())+guides(size=guide_legend(nrow=1,byrow=TRUE)) Legend <- get_legend(dummy) #This statement sorts out the BC/ AC in the title. if(i<0) { title <- ggdraw() + draw_label(paste0("Cities in the Year ",sqrt(i^2)," BC"), fontface='bold',fontfamily = "Helvetica") }else{ title <- ggdraw() + draw_label(paste0("Cities in the Year ",i," AD"), fontface='bold',fontfamily = "Helvetica") } #I only want to plot map every 100 years if(i%%100==0) { png(paste0("outputs/",i,".png"), width=20,height=30, units="cm", res=150) #This is where the elements of the plot are combined print(plot_grid(title,map1, Legend,map2, ncol=1,rel_heights = c(.1,1, .1,1))) dev.off() } count<-count+1 }

 

To leave a comment for the author, please follow the link and comment on their blog: R – Spatial.ly. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Descriptive summary: Proportions of values in a vector #rstats

Mon, 03/06/2017 - 16:33

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

When describing a sample, researchers in my field often show proportions of specific characteristics as description. For instance, proportion of female persons, proportion of persons with higher or lower income etc. Since it happens often that I like to know these characteristics when exploring data, I decided to write a function, prop(), which is part of my sjstats-package – a package dedicated to summary-functions, mostly for fit- or association-measures of regression models or descriptive statistics.

prop() is designed following a similar fashion like most functions of my sjmisc-package: first, the data; then an user-defined number of logical comparisons that define the proportions. A single comparison argument as input returns a vector, multiple comparisons return a tibble (where the first column contains the comparison, and the second the related proportion).

An examle from the mtcars dataset:

library(sjstats) data(mtcars) # proportions of observations in mpg that are greater than 25 prop(mtcars, mpg > 25) #> [1] 0.1875 prop(mtcars, mpg > 25, disp > 200, gear == 4) #> # A tibble: 3 × 2 #> condition prop #> <chr> <dbl> #> 1 mpg>25 0.1875 #> 2 disp>200 0.5000 #> 3 gear==4 0.3750

The function also works on grouped data frames, and with labelled data. In the following example, we group a dataset on family carers by their gender and education, and then get the proportions of observations where care-receivers are at least moderately dependent and male persons. To get an impression of how the raw variables look like, we first compute simple frequency tables with frq().

library(sjmisc) # for frq()-function data(efc) frq(efc, e42dep) #> # elder's dependency #> #> val label frq raw.prc valid.prc cum.prc #> 1 independent 66 7.27 7.33 7.33 #> 2 slightly dependent 225 24.78 24.97 32.30 #> 3 moderately dependent 306 33.70 33.96 66.26 #> 4 severely dependent 304 33.48 33.74 100.00 #> 5 NA 7 0.77 NA NA frq(efc, e16sex) #> # elder's gender #> #> val label frq raw.prc valid.prc cum.prc #> 1 male 296 32.60 32.85 32.85 #> 2 female 605 66.63 67.15 100.00 #> 3 NA 7 0.77 NA NA efc %>% select(e42dep, c161sex, c172code, e16sex) %>% group_by(c161sex, c172code) %>% prop(e42dep > 2, e16sex == 1) #> # A tibble: 6 × 4 #> `carer's gender` `carer's level of education` `e42dep>2` `e16sex==1` #> <chr> <chr> <dbl> <dbl> #> 1 Male low level of education 0.6829 0.3659 #> 2 Male intermediate level of education 0.6590 0.3155 #> 3 Male high level of education 0.7872 0.2766 #> 4 Female low level of education 0.7101 0.4638 #> 5 Female intermediate level of education 0.5929 0.2832 #> 6 Female high level of education 0.6881 0.2752

So, within the group of male family carers with low level of education, 68.29% of care-receivers are moderately or severely dependent, and 36.59% of care-receivers are male. Within female family carers with high level of education, 68.81% of care-receivers are at least moderately dependent and 27.52% are male.

Tagged: R, rstats

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Slides from my Talk at the CDC

Mon, 03/06/2017 - 16:00

Two weeks ago I had the honor of giving a talk about my open source work at the CDC. Last week I gave the talk again as an open-enrollment webinar.

In case you missed that talk, you can view the slides here. You can also download the slides by entering your email below.

Download the Slides and Code! DOWNLOAD

Download the Slides and Code! DOWNLOAD

The post Slides from my Talk at the CDC appeared first on AriLamstein.com.

Python and R for code development

Mon, 03/06/2017 - 04:03

(This article was first published on logopt: a journey in R, Python, finance and open source, and kindly contributed to R-bloggers)

The previous post glossed about why I now prefer Python to write code, including for a module like logopt. This post explains in more details some specific differences where I prefer one of these two languages:

  • 0-based indexing in python versus 1-based indexing in R.  This may seem a small difference but for me, 0-based indexing is more natural and results in less off by one errors.  No less than Dijkstra opines with me on 0-based indexing.
  • = versus <- for assignment.  I like R approach here, and I would like to see more languages doing the same.  I still sometimes end up using = where I wanted ==.  If only R would allow <- in call arguments.
  • CRAN versus pypi
    • CRAN is much better for the user, the CRAN Task Views is a gold mine, and in general CRAN is a better repository, with higher quality packages.
    • But publishing one CRAN is simply daunting, and the reason logopt remained in R-Forge only.  The manual explaining how to write extensions is 178 pages long.
  • Python has better data structures, especially the Python dictionary is something I miss whenever I write in R.  Python has no native dataframe, but this is easily taken care of by importing pandas.
  • Object orientation is conceptually clean and almost easy to use in Python, less so in R.
  • Plotting is better in R.  There are some effort to make Python better in that area, especially for ease of use.  Matplotlib is powerful but difficult to master.
  • lm is a gem in R, the simplicity with which you can express the expressions you want to model is incredible

All in all, I prefer coding in Python.  This is a personal opinion of course, and R remains important because of some packages, but for more general purpose tasks, Python is simpler to use, and that translates in being more productive. 

To leave a comment for the author, please follow the link and comment on their blog: logopt: a journey in R, Python, finance and open source. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

HTML vignettes crashing your RStudio? This may be the reason

Mon, 03/06/2017 - 01:05

Short version: if RStudio on Windows 7 crashes when viewing vignettes in HTML format, it may be because those packages specify knitr::rmarkdown as the vignette engine, instead of knitr::knitr and you’re using rmarkdown v1.

Longer version with details – read on.

update: looks like this issue relates to the installed version of rmarkdown (1.3 in my case) – see here for details.

HTML documentation for broom in RStudio

At work I run RStudio (currently version 1.0.136) on Windows 7 (because I have no choice). This works:

  1. open the Packages tab
  2. click on broom
  3. click on User guides, package vignettes and other documentation
  4. click on HTML to see documentation for broom::broom

 

HTML documentation for dplyr in RStudio

If I do the same for the dplyr package and choose the HTML documentation for dplyr::two-table, the results is not so pleasing (see image, right). Click on Debug for an uninformative message:

an unhandled win 32 exception occurred in rstudio.exe [10276].

Looking at the RMD source for the vignettes, we see that dplyr specifies knitr::rmarkdown for the vignette engine, whereas broom specifies knitr::knitr.

# dplyr --- title: "Two-table verbs" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Two-table verbs} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- # broom <!-- %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Introduction to broom} -->

Some further investigation suggests that the crash occurs for all packages where %\VignetteEngine{knitr::rmarkdown}, but not for those where %\VignetteEngine{knitr::knitr}. As it happens, I’ve just written a package which used knitr::rmarkdown (and crashed RStudio) and so was able to confirm that switching to knitr::knitr fixed the problem.

Filed under: R, statistics Tagged: debug, rstudio, software, windows

Scraping CRAN with rvest

Mon, 03/06/2017 - 01:00

I am participating in a session at userR 2017 this coming July focusing on discovering and learning about R packages. How do R users find packages that meet their needs? Can we make this process easier? As somebody who is relatively new to the R world compared to many, this is a topic that resonates with me and I am happy to be part of the discussion. I am working on this session with John Nash and Spencer Graves, and we hope that some useful discussion and results come out of the session.

In preparation for this session, I wanted to look at the distribution of R packages by date, number of version, etc. There have been some great plots that came out around the time when CRAN passed the 10,000 package mark but most of the code to make those scripts involve packages and idioms I am less familiar with, so here is an rvest and tidyverse centered version of those analyses!

Scraping CRAN

The first thing we need to do is get all the packages that are currently available on CRAN. Let’s use rvest to scrape the page that lists all the packages currently on CRAN. It also has some other directories besides packages so we can use filter to remove the things that don’t look like R packages.

library(rvest) library(stringr) library(lubridate) library(tidyverse) cran_link <- function(...) { file.path("https://cran.rstudio.com/src/contrib", ...) } pkgs_raw <- read_html(cran_link()) %>% html_nodes("table") %>% .[[1]] %>% html_table() pkgs_raw <- pkgs_raw[,-1] pkgs <- pkgs_raw %>% filter(Size != "-", str_detect(Name, "tar.gz$")) %>% mutate(Date = dmy_hm(`Last modified`), Name = str_extract(Name, "^[^_]+(?=_)")) %>% select(-Size, -Description) %>% as_tibble() pkgs ## # A tibble: 10,211 × 3 ## Name `Last modified` Date ## <chr> <chr> <dttm> ## 1 A3 16-Aug-2015 21:05 2015-08-16 21:05:00 ## 2 ABC.RAP 20-Oct-2016 08:52 2016-10-20 08:52:00 ## 3 ABCanalysis 23-Aug-2016 12:57 2016-08-23 12:57:00 ## 4 ABCoptim 17-Nov-2016 09:04 2016-11-17 09:04:00 ## 5 ABCp2 04-Feb-2016 10:27 2016-02-04 10:27:00 ## 6 ABHgenotypeR 04-Feb-2016 10:27 2016-02-04 10:27:00 ## 7 ACA 10-Mar-2016 16:55 2016-03-10 16:55:00 ## 8 ACCLMA 29-Oct-2012 12:13 2012-10-29 12:13:00 ## 9 ACD 31-Oct-2013 19:59 2013-10-31 19:59:00 ## 10 ACDm 16-Jul-2016 10:19 2016-07-16 10:19:00 ## # ... with 10,201 more rows

So that’s currently available packages!

Now let’s turn to the archive. Let’s do a similar operation.

archives_raw <- read_html(cran_link("Archive")) %>% html_nodes("table") %>% .[[1]] %>% html_table() archives_raw <- archives_raw[,-1] archives_processed <- archives_raw %>% filter(str_detect(Name, "/$")) %>% mutate(Date = dmy_hm(`Last modified`), Name = str_sub(Name, end = -2)) %>% select(-Size, -Description) %>% as_tibble() archives_processed ## # A tibble: 8,897 × 3 ## Name `Last modified` Date ## <chr> <chr> <dttm> ## 1 A3 16-Aug-2015 21:05 2015-08-16 21:05:00 ## 2 ABCExtremes 19-Jun-2015 11:26 2015-06-19 11:26:00 ## 3 ABCanalysis 23-Aug-2016 12:57 2016-08-23 12:57:00 ## 4 ABCoptim 17-Nov-2016 09:04 2016-11-17 09:04:00 ## 5 ABCp2 01-Jul-2015 06:12 2015-07-01 06:12:00 ## 6 ABHgenotypeR 04-Feb-2016 10:27 2016-02-04 10:27:00 ## 7 ACD 31-Oct-2013 19:59 2013-10-31 19:59:00 ## 8 ACDm 16-Jul-2016 10:19 2016-07-16 10:19:00 ## 9 ACEt 18-Nov-2016 21:19 2016-11-18 21:19:00 ## 10 ACNE 27-Oct-2015 07:09 2015-10-27 07:09:00 ## # ... with 8,887 more rows

That is good, but now we need to get more detailed information for packages that have been archived at least once to get the date they originally were released and how many versions they have had.

Visiting every page in the archive

Let’s set up a function for scraping an individual page for a package and apply that to every page in the archive. This step takes A WHILE because it queries a web page for every package in the CRAN archive. I’ve set this up with map from purrr; it is one of my favorite ways to organize tasks these days.

read_page <- function(name) { message(name) read_html(cran_link("Archive", name)) %>% html_nodes("td") %>% html_text() } archives_scraped <- archives_processed %>% mutate(page = map(Name, read_page))

What do these pages look like?

archives_scraped$page[8457] ## [[1]] ## [1] "" "Parent Directory" " " " - " ## [5] " " "" "tidytext_0.1.0.tar.gz" "28-Apr-2016 09:50 " ## [9] "1.4M" " " "" "tidytext_0.1.1.tar.gz" ## [13] "25-Jun-2016 17:07 " "1.5M" " "

This is exactly what we need: the dates that the packages were released and how many times they have been released. Let’s use mutate and map again to extract these values.

archives <- archives_scraped %>% mutate(Date = dmy_hm(map_chr(page, ~ .[8])), ArchivedVersions = map_dbl(page, ~ length(.) / 5 - 1)) %>% select(-page) archives ## # A tibble: 8,897 × 4 ## Name `Last modified` Date ArchivedVersions ## <chr> <chr> <dttm> <dbl> ## 1 A3 16-Aug-2015 21:05 2013-02-07 09:00:00 2 ## 2 ABCExtremes 19-Jun-2015 11:26 2013-05-15 08:45:00 1 ## 3 ABCanalysis 23-Aug-2016 12:57 2015-04-20 15:40:00 5 ## 4 ABCoptim 17-Nov-2016 09:04 2013-11-05 17:00:00 2 ## 5 ABCp2 01-Jul-2015 06:12 2013-04-10 15:04:00 2 ## 6 ABHgenotypeR 04-Feb-2016 10:27 2016-01-21 07:26:00 1 ## 7 ACD 31-Oct-2013 19:59 2012-11-27 06:43:00 2 ## 8 ACDm 16-Jul-2016 10:19 2015-07-16 12:24:00 2 ## 9 ACEt 18-Nov-2016 21:19 2016-02-14 17:48:00 7 ## 10 ACNE 27-Oct-2015 07:09 2011-09-16 17:43:00 4 ## # ... with 8,887 more rows Putting it together

Not it’s time to join the data from the currently available packages and the archives.

  • Packages that are in archives but not pkgs are no longer on CRAN.
  • Packages that are in pkgs but not archives only have one CRAN release.
  • Packages that are in both dataframes have had more than one CRAN release.

Sounds like a good time to use anti_join and inner_join.

all_pkgs <- bind_rows(archives %>% anti_join(pkgs, by = "Name") %>% mutate(Archived = TRUE), pkgs %>% anti_join(archives, by = "Name") %>% mutate(ArchivedVersions = 0, Archived = FALSE), archives %>% semi_join(pkgs, by = "Name") %>% mutate(Archived = FALSE)) %>% mutate(Versions = ifelse(Archived, ArchivedVersions, ArchivedVersions + 1)) %>% arrange(Name) all_pkgs ## # A tibble: 11,489 × 6 ## Name `Last modified` Date ArchivedVersions Archived Versions ## <chr> <chr> <dttm> <dbl> <lgl> <dbl> ## 1 A3 16-Aug-2015 21:05 2013-02-07 09:00:00 2 FALSE 3 ## 2 aaMI 30-Jul-2010 12:17 2005-06-24 15:55:00 2 TRUE 2 ## 3 abbyyR 20-Jun-2016 15:32 2015-06-12 04:43:00 7 FALSE 8 ## 4 abc 05-May-2015 09:34 2010-10-05 08:45:00 10 FALSE 11 ## 5 abc.data 05-May-2015 09:34 2015-05-05 09:34:00 0 FALSE 1 ## 6 ABC.RAP 20-Oct-2016 08:52 2016-10-20 08:52:00 0 FALSE 1 ## 7 ABCanalysis 23-Aug-2016 12:57 2015-04-20 15:40:00 5 FALSE 6 ## 8 abcdeFBA 15-Sep-2012 13:13 2011-11-05 10:48:00 3 FALSE 4 ## 9 ABCExtremes 19-Jun-2015 11:26 2013-05-15 08:45:00 1 TRUE 1 ## 10 ABCoptim 17-Nov-2016 09:04 2013-11-05 17:00:00 2 FALSE 3 ## # ... with 11,479 more rows Plotting results

Let’s look at some results now.

all_pkgs %>% filter(!Archived) %>% group_by(Date = floor_date(Date, unit = "month")) %>% summarise(NewPackages = n()) %>% ungroup %>% mutate(TotalPackages = cumsum(NewPackages)) %>% ggplot(aes(Date, TotalPackages)) + geom_line(size = 1.5, alpha = 0.8, color = "midnightblue") + labs(x = NULL, y = "Number of available packages", title = "How many packages are available on CRAN?", subtitle = "Only packages that are still available")

There we go! That is similar to the results we all saw going around when CRAN passed 10,000 packages, which is good.

What about the number of archived vs. available packages?

all_pkgs %>% ggplot(aes(Archived)) + geom_histogram(stat = "count", alpha = 0.8, fill = "midnightblue") + scale_x_discrete(labels=c("Still available", "Archived, no longer available")) + labs(y = "Number of packages", x = NULL, title = "How many packages are no longer available on CRAN?", subtitle = "About 10% of total packages are no longer available")

And lastly, let’s look at the distribution of number of releases for each package.

all_pkgs %>% ggplot(aes(Versions)) + geom_histogram(binwidth = 10, alpha = 0.8, fill = "midnightblue") + labs(y = "Number of packages", x = "Number of versions on CRAN", title = "How many versions do CRAN packages have?", subtitle = "About 25% of packages are on their first version")

The End

It is pretty ironic that I worked on this code and wrote this post because I wanted to do an analysis using different packages than the ones used in the original scripts shared. That is exactly part of the challenge facing all of us as R users now that there is such a diversity of tools out there! I hope that our session at useR this summer provides some clarity and perspective for attendees on these types of issues. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!

Shiny Based Tablet or Desktop App

Mon, 03/06/2017 - 00:54

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

Mark Sellors
Head of Data Engineering

Turn a shiny application into a tablet or desktop app

Since we first demoed it at our really successful trip to Strata London last year, a few people have asked us how we made the awesome looking Data Science Radar app that we were running on the tablets we had with us. In this post we’ll take a look at how we did it, and hopefully show you how easy it is to do yourself.

Mango is primarily known for our work with the R language, so it should come as no surprise that this is the secret sauce used in the creation of the app. More specifically, we used a Shiny app written by one of our Senior Data Scientists, Aimee Gott. The app uses the radarchart package which you can find on github.

I think the fact that it was written with Shiny has actually surprised a few people, largely because of how it looks and the way that we run it.

The tablets in question are cheap Windows 10 devices, nothing special, but we had to come up with a way of running the application that would be simple enough for the non-R-using members of the team. This meant that anything from the everyday world of R had to be abstracted away or made as simple to use as possible. In turn this means, not starting RStudio, or having to type anything in to start the app.

R and the required packages are installed on the tablets, ready to start the configuration that would allow the whole Mango team to use them in the high pressure, high visibility setting of a stand at an extremely busy conference.

We wrote a simple batch file that would start the app. This only got us part of the way though, because the default browser on Windows 10, Microsoft’s Edge, doesn’t have a full screen mode, which makes the app look less slick. We therefore changed the default browser to Microsoft’s IE, and put it in full screen mode (with F11) when it first opened. The good news here is that IE, remembers that it was in full screen mode when you close and re-open it, so that’s another problem solved. The app now opens automatically and covers the full screen.

The code for the batch file is a simple one-liner and looks like this:

"C:\Program Files\R\R-3.3.0\bin\Rscript.exe" -e "shiny::runApp('/Users/training2/Documents/dsRadar/app.R', launch.browser = TRUE)"

Next, it was necessary to set the rotation lock on the tablets, to avoid the display flipping round to portrait mode while in use on the stand. This is more cosmetic than anything else, and we did find that the Win10 rotation lock is slightly buggy in that it doesn’t always seem to remember which way round the lock is, so that it occasionally needs to be reset between uses. Remember, our app was written specifically for this device, so the layout is optimised correctly for the resolution and landscape orientation, you may want to approach that differently if you try this yourself.

We also found that the on-screen keyboard wasn’t enabled by default with our devices (which have a detachable keyboard), so we had to turn that on in the settings as well.

Having the application start via a Windows batch file, isn’t the prettiest way of starting an app as it starts the windows command prompt before launching the application itself. This is hidden behind the application when it’s fully started, but it just doesn’t look good enough. This problem can be avoided with a small amount of VBScript, which runs the contents of the batch file without displaying the command prompt. Unfortunately the VBScript icon you end up with is pretty horrid. The easiest way to change it is to create a shortcut to the VBScript file and then change the icon of the shortcut, which is much easier.

Here’s the VBScript:

Set objShell = WScript.CreateObject("WScript.Shell") objShell.Run("C:\Users\training2\Desktop\dsRadar.bat"), 0, True

Check out the video below to see it in action, we hope you agree that it looks really good and we hope you find this simple method of turning a shiny application into a tablet or desktop app as useful as we do!

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Scheduling R scripts and processes on Windows and Unix/Linux

Sun, 03/05/2017 - 23:00

(This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

2 new R packages were put on CRAN last week by BNOSAC (www.bnosac.be).

  • One package for scheduling R scripts and processes on Windows (taskscheduleR) and
  • Another package for scheduling R scripts and processes on Unix / Linux (cronR)

These 2 packages allow you to schedule R processes from R directly. This is done by passing commands directly to cron which is a basic Linux/Unix job scheduling utility or by using the Windows Task Scheduler. The packages were developed for beginning R users who are unaware of that fact that R scripts can also be run non-interactively and can be automated.

We blogged already about the taskscheduleR R package at this blog post and also here. This time we devote some more details to the cronR R package.

The cronR package allows to

  • Get the list of scheduled jobs
  • Remove scheduled jobs
  • Add a job
    • a job is basically a script with R code which is run through Rscript
    • You can schedule tasks ‘ONCE’, ‘EVERY MINUTE’, ‘EVERY HOUR’, ‘EVERY DAY’, ‘EVERY WEEK’, ‘EVERY MONTH’ or any complex schedule
    • The task log contains the stdout & stderr of the Rscript which was run on that timepoint. This log can be found at the same folder as the R script

The package is especially suited for persons working on an RStudio server in the cloud or within the premises of their corporate environment. It allows to easily schedule processes. To make that extremely easy for beginning R users, an RStudio addin was developed, which is shown in the example below. The RStudio addin basically allows you to select an R script and schedule it at specific timepoints. It does this by copying the script to your launch/log folder and setting up a cronjob for that script.

The example below shows how to set up a cron job using the RStudio addin so that the scripts are launched every minute or every day at a specific hour. The R code is launched through Rscript and the log will contain the errors and the warnings in case your script failed so that you can review where the code failed.

Mark that you can also pass on arguments to the R script so that you can launch the same script for productXYZ and productABC.

Of course scheduling scripts can also be done from R directly. Some examples are shown below. More information at https://github.com/bnosac/cronR

library(cronR)
f <- system.file(package = "cronR", "extdata", "helloworld.R")
cmd <- cron_rscript(f, rscript_args = c("productx", "20160101"))
## Every minute
cron_add(cmd, frequency = 'minutely', id = 'job1', description = 'Customers')
## Every hour at 20 past the hour on Monday and Tuesday
cron_add(cmd, frequency = 'hourly', id = 'job2', at = '00:20', description = 'Weather', days_of_week = c(1, 2))
## Every day at 14h20 on Sunday, Wednesday and Friday
cron_add(cmd, frequency = 'daily', id = 'job3', at = '14:20', days_of_week = c(0, 3, 5))
## Every starting day of the month at 10h30
cron_add(cmd, frequency = 'monthly', id = 'job4', at = '10:30', days_of_month = 'first', days_of_week = '*')
## Get all the jobs
cron_ls()
## Remove all scheduled jobs
cron_clear(ask=FALSE)

We hope this will gain you some precious time and if you need more help on automating R processes, feel free to get into contact. We have a special training devoted to managing R processes which can be given in your organisation. More information at our training curriculum.

 

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppGSL 0.3.2

Sun, 03/05/2017 - 20:39

The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

RcppGSL release 0.3.2 is one of several maintenance releases this weekend to fix an issue flagged by CRAN: calls to .Call() sometimes used package= where PACKAGE= was meant. This came up now while the registration mechanism is being reworked.

So RcppGSL was updated too, and we took the opportunity to bring several packaging aspects up to the newest standards, including support for the soon-to-be required registration of routines.

No new code or features were added. The NEWS file entries follow below:

Changes in version 0.3.2 (2017-03-04)
  • In the fastLm function, .Call now uses the correct PACKAGE= argument

  • Added file init.c with calls to R_registerRoutines() and R_useDynamicSymbols(); also use .registration=TRUE in useDynLib in NAMESPACE

  • The skeleton configuration for created packages was updated.

Courtesy of CRANberries, a summary of changes to the most recent release is available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

RcppSMC 0.1.5

Sun, 03/05/2017 - 20:38

RcppSMC provides Rcpp-based bindings to R for them Sequential Monte Carlo Template Classes (SMCTC) by Adam Johansen described in his JSS article.

RcppSMC release 0.1.5 is one of several maintenance releases this weekend to fix an issue flagged by CRAN: calls to .Call() sometimes used package= where PACKAGE= was meant. This came up now while the registration mechanism is being reworked.

Hence RcppSMC was updated, and we took the opportunity to bring several packaging aspects up to the newest standards, including support for the soon-to-be required registration of routines.

No new code or features were added. The NEWS file entries follow below:

Changes in RcppSMC version 0.1.5 (2017-03-03)
  • Correct .Call to use PACKAGE= argument

  • DESCRIPTION, NAMESPACE, README.md changes to comply with current R CMD check levels

  • Added file init.c with calls to R_registerRoutines() and R_useDynamicSymbols()

  • Updated .travis.yml file for continuous integration

Courtesy of CRANberries, there is a diffstat report for this release.

More information is on the RcppSMC page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

replyr: Get a Grip on Big Data in R

Sun, 03/05/2017 - 19:09

replyr is an R package that contains extensions, adaptions, and work-arounds to make remote R dplyr data sources (including big data systems such as Spark) behave more like local data. This allows the analyst to more easily develop and debug procedures that simultaneously work on a variety of data services (in-memory data.frame, SQLite, PostgreSQL, and Spark2 currently being the primary supported platforms).

Example

Suppose we had a large data set hosted on a Spark cluster that we wished to work with using dplyr and sparklyr (for this article we will simulate such using data loaded into Spark from the nycflights13 package).

We will work a trivial example: taking a quick peek at your data. The analyst should always be able to and willing to look at the data.

It is easy to look at the top of the data, or any specific set of rows of the data.

Either through print() (which is much safter with tbl_df derived classes, than with base data.frame).

print(flts) ## Source: query [3.368e+05 x 19] ## Database: spark connection master=local[4] app=sparklyr local=TRUE ## ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 ## 6 2013 1 1 554 558 -4 740 ## 7 2013 1 1 555 600 -5 913 ## 8 2013 1 1 557 600 -3 709 ## 9 2013 1 1 557 600 -3 838 ## 10 2013 1 1 558 600 -2 753 ## # ... with 3.368e+05 more rows, and 12 more variables: ## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Or with dplyr::glimpse():

dplyr::glimpse(flts) ## Observations: 3.368e+05 ## Variables: 19 ## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,... ## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55... ## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60... ## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2... ## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7... ## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7... ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -... ## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",... ## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79... ## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN... ## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"... ## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"... ## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138... ## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94... ## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,... ## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ time_hour <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...

What replyr adds to the task of "looking at the data" is a rough equivalent to base::summary(): a few per-column statistics.

# using dev version of replyr https://github.com/WinVector/replyr replyr::replyr_summary(flts, countUniqueNonNum= TRUE) ## column index class nrows nna nunique min max ## 1 year 1 integer 336776 0 NA 2013 2013 ## 2 month 2 integer 336776 0 NA 1 12 ## 3 day 3 integer 336776 0 NA 1 31 ## 4 dep_time 4 integer 336776 8255 NA 1 2400 ## 5 sched_dep_time 5 integer 336776 0 NA 106 2359 ## 6 dep_delay 6 numeric 336776 8255 NA -43 1301 ## 7 arr_time 7 integer 336776 8713 NA 1 2400 ## 8 sched_arr_time 8 integer 336776 0 NA 1 2359 ## 9 arr_delay 9 numeric 336776 9430 NA -86 1272 ## 10 carrier 10 character 336776 0 16 NA NA ## 11 flight 11 integer 336776 0 NA 1 8500 ## 12 tailnum 12 character 336776 0 4044 NA NA ## 13 origin 13 character 336776 0 3 NA NA ## 14 dest 14 character 336776 0 105 NA NA ## 15 air_time 15 numeric 336776 9430 NA 20 695 ## 16 distance 16 numeric 336776 0 NA 17 4983 ## 17 hour 17 numeric 336776 0 NA 1 23 ## 18 minute 18 numeric 336776 0 NA 0 59 ## 19 time_hour 19 numeric 336776 0 NA 2013 2013 ## mean sd lexmin lexmax ## 1 2013.000000 0.000000 <NA> <NA> ## 2 6.548510 3.414457 <NA> <NA> ## 3 15.710787 8.768607 <NA> <NA> ## 4 1349.109947 488.281791 <NA> <NA> ## 5 1344.254840 467.335756 <NA> <NA> ## 6 12.639070 40.210061 <NA> <NA> ## 7 1502.054999 533.264132 <NA> <NA> ## 8 1536.380220 497.457142 <NA> <NA> ## 9 6.895377 44.633292 <NA> <NA> ## 10 NA NA 9E YV ## 11 1971.923620 1632.471938 <NA> <NA> ## 12 NA NA <NA> <NA> ## 13 NA NA EWR LGA ## 14 NA NA ABQ XNA ## 15 150.686460 93.688305 <NA> <NA> ## 16 1039.912604 733.233033 <NA> <NA> ## 17 13.180247 4.661316 <NA> <NA> ## 18 26.230100 19.300846 <NA> <NA> ## 19 2013.000000 0.000000 <NA> <NA>

As we see, replyr summary returns data in a data frame, and can deal with multiple column types.

Note: the above summary has problems with NA in character columns with Spark, and thus is mis-reporting the NA count in the tailum column. We are working on the issue. That is also one of the advantages of taking your work-arounds from a package: when they do improve you can easily incorporate bring the improvements into your own work by a mere package update.

We could also use dplyr::summarize_each for the task, but it has the minor downside of returning the data in a wide form.

# currently throws if tailnum left in column list vars <- setdiff(colnames(flts), 'tailnum') flts %>% summarize_each(funs(min, max, mean, sd), one_of(vars)) ## Source: query [1 x 72] ## Database: spark connection master=local[4] app=sparklyr local=TRUE ## ## year_min month_min day_min dep_time_min sched_dep_time_min dep_delay_min ## <int> <int> <int> <int> <int> <dbl> ## 1 2013 1 1 1 106 -43 ## # ... with 66 more variables: arr_time_min <int>, ## # sched_arr_time_min <int>, arr_delay_min <dbl>, carrier_min <chr>, ## # flight_min <int>, origin_min <chr>, dest_min <chr>, ## # air_time_min <dbl>, distance_min <dbl>, hour_min <dbl>, ## # minute_min <dbl>, time_hour_min <dbl>, year_max <int>, ## # month_max <int>, day_max <int>, dep_time_max <int>, ## # sched_dep_time_max <int>, dep_delay_max <dbl>, arr_time_max <int>, ## # sched_arr_time_max <int>, arr_delay_max <dbl>, carrier_max <chr>, ## # flight_max <int>, origin_max <chr>, dest_max <chr>, ## # air_time_max <dbl>, distance_max <dbl>, hour_max <dbl>, ## # minute_max <dbl>, time_hour_max <dbl>, year_mean <dbl>, ## # month_mean <dbl>, day_mean <dbl>, dep_time_mean <dbl>, ## # sched_dep_time_mean <dbl>, dep_delay_mean <dbl>, arr_time_mean <dbl>, ## # sched_arr_time_mean <dbl>, arr_delay_mean <dbl>, carrier_mean <dbl>, ## # flight_mean <dbl>, origin_mean <dbl>, dest_mean <dbl>, ## # air_time_mean <dbl>, distance_mean <dbl>, hour_mean <dbl>, ## # minute_mean <dbl>, time_hour_mean <dbl>, year_sd <dbl>, ## # month_sd <dbl>, day_sd <dbl>, dep_time_sd <dbl>, ## # sched_dep_time_sd <dbl>, dep_delay_sd <dbl>, arr_time_sd <dbl>, ## # sched_arr_time_sd <dbl>, arr_delay_sd <dbl>, carrier_sd <dbl>, ## # flight_sd <dbl>, origin_sd <dbl>, dest_sd <dbl>, air_time_sd <dbl>, ## # distance_sd <dbl>, hour_sd <dbl>, minute_sd <dbl>, time_hour_sd <dbl>

Special code for remote data is needed as none of the obvious "one liner" candidates (base::summary(), or broom:glance()) are not currently (as of March 4, 2017) intended to work with remote data sources.

summary(flts) ## Length Class Mode ## src 1 src_spark list ## ops 3 op_base_remote list str(flts) ## List of 2 ## $ src:List of 1 ## ..$ con:List of 11 ## .. ..$ master : chr "local[4]" ## .. ..$ method : chr "shell" ## .. ..$ app_name : chr "sparklyr" ## ... packageVersion('broom') ## [1] '0.4.2' broom::glance(flts) ## Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl

The source for the examples can be found here.

Conclusion

replyr_summary is not the only service replyr supplies, replyr includes many more adaptions including my own version of case-completion.

Roughly replyr is where I collect my adaptions so they don’t infest application code. replyr a way you can use heavy-duty big-data machinery, while keeping you fingers out of the gears.

Direct integration of sjPlot-tables in knitr-rmarkdown-documents #rstats

Sun, 03/05/2017 - 10:54

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

A new update of my sjPlot-package was just released on CRAN. Thanks to @c_schwemmer, it’s now possible to easily integrate the HTML-ouput of all table-functions into knitr-rmarkdown-documents.

Simpel Tables

In the past, to integrate table-output in knitr, you needed to set the argument no.output = TRUE and use the return-value $knitr:

`r sjt.df(efc, no.output=TRUE)$knitr`

If you also wanted to include the code to the function call, things were even more complicated. However, since sjPlot 2.3.1, you can use the table-functions like any other R-code-snippet:

```{r} sjt.frq(efc$e42dep) ``` elder’s dependency value N raw % valid % cumulative % independent 66 7.27 7.33 7.33 slightly dependent 225 24.78 24.97 32.30 moderately dependent 306 33.70 33.96 66.26 severely dependent 304 33.48 33.74 100.00 missings 7 0.77 total N=908 · valid N=901 · x̄=2.94 · σ=0.94 Grouped Tables

The same applies to the sjtab()-function, which is a „pipe“-wrapper for the sjPlot-table-functions. Here is an example how you can print multiple tables of grouped data frames in knitr. sjtab() calls sjt.xtab() (see argument fun = "xtab") and uses the first two arguments of the data frame as variables for the contingency table. Since the data frame is grouped (by gender and education), a table for each group is displayed. The group-characteristic is printed as table-caption. Note that the HTML-output here is truncated and limited to the first three tables only.

```{r} library(dplyr) library(sjPlot) library(sjmisc) data(efc) efc %>% group_by(e16sex, c172code) %>% select(e42dep, n4pstu, e16sex, c172code) %>% sjtab(fun = "xtab") ``` elder’s gender: male
carer’s level of education: low level of education elder’s dependency Care level Total No Care Level Care Level 1 Care Level 2 Care Level 3 Care Level 3+ independent 4 1 0 0 0 5 slightly dependent 9 3 1 2 0 15 moderately dependent 15 6 4 2 0 27 severely dependent 8 5 10 7 0 30 Total 36 15 15 11 0 77 χ2=13.843 · df=9 · Cramer’s V=0.245 · Fisher’s p=0.141

 

elder’s gender: male
carer’s level of education: intermediate level of education elder’s dependency Care level Total No Care Level Care Level 1 Care Level 2 Care Level 3 Care Level 3+ independent 7 2 4 2 0 15 slightly dependent 22 7 8 2 0 39 moderately dependent 27 14 13 4 1 59 severely dependent 7 3 12 20 1 43 Total 63 26 37 28 2 156 χ2=42.707 · df=12 · Cramer’s V=0.302 · Fisher’s p=0.000

 

elder’s gender: male
carer’s level of education: high level of education elder’s dependency Care level Total No Care Level Care Level 1 Care Level 2 Care Level 3 Care Level 3+ independent 1 0 0 0 0 1 slightly dependent 8 1 2 2 0 13 moderately dependent 11 2 5 0 0 18 severely dependent 4 2 3 2 0 11 Total 24 5 10 4 0 43 χ2=5.992 · df=9 · Cramer’s V=0.216 · Fisher’s p=0.620

(… remaining table output truncated)

The knitr-integration works for all table-functions in sjPlot. You can find some examples over here.

Tagged: knitr, markdown, R, rmarkdown, rstats, sjPlot

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Resizing spatial data in R

Sun, 03/05/2017 - 09:22

Here I am after a short break, writing again about R!

In december I worked on a project that required me to work on spatial data. This led me to learn about how R deals with this kind of data and to look around for ways to make my “spatial data experience” less painful. I discovered that R is, as always, full of options.

I’m used to dplyr for exploring data and manipulating it, however when it comes to spatial data, I think that gstat, raster and rasterVis are just some of the useful packages that will make your life a lot easier.

For instance, suppose you would like to analyse, among the other variables, the net heat flux ($\frac{w}{m^2}$) data (from here on referred as “qnet”). The data can be found on the website of the WHOI OAFlux Project1, which has a lot of available satellite observations dating as back as 1985.  These observations come in nice netcdf files and can be monthly or daily, as far as I know.

When doing your analysis, it will almost surely happen that qnet is not the only variable you are considering, and in that case, if you have gathered spatial data from many different sources, not all of it may have the same resolution! This can be a bit of a problem since if you would like to use almost any model that R offers, you need the same number of features and less NAs as possible!

Qnet has a resolution (longitude x latitude) of 180 x 360 pixels. This means that for each year (read: file) you will find 365 grids of 64800 observations each. Or if you prefer, each netcdf file will offer  23652000 observations of the qnet variable. In reality, a good sized part of these pixels, as you can see from the picture above, represent land, therefore they are just NAs to be ignored.

If you are interested in a particular region, you can use the raster::crop function on the raster object in order to select the area you need. For this example, I wanted to focus my attention on a 40×20 pixel region in the north of the Atlantic Ocean. This is the plot of the selected region:

Now suppose that all your other datapoints are in a different resolution, say 161×81 pixels. The gstat package is the solution to the problem! By choosing the appropriate algorithm you can easily resize any spatial grid you like. I chose to use the inverse distance weighting interpolation since I wanted to obtain a smooth surface, but keep in mind that in the docs there are a lot of other methods for interpolating a surface and, depending on your particular problem, another method may be a better fit. Inverse distance weighting is a  technique that estimates the unknown points by taking an average of the points near by. As the name suggests, the points closer to the ones to be estimated have greater impact on the estimate. I think this method is suited to this task since by just changing the resolution I do not expect (and would like to avoid) significant changes in the distribution of qnet.

The result of the procedure is showed below:

As a comparison, you can check the pictures before and after the change in resolution by putting them side to side. Just be sure to set the colorbar at the same level for both of the pictures, otherwise it would be pretty hard to tell what has changed and what not. I think you can tell that the picture on the right is the one with the higher resolution.


By using the rasterVis::levelplot function, you can check a perhaps more insightful plot with the marginal distributions of the qnet variable. As you can see, the overall distribution and the marginals of the variable have not been altered significantly, however now we have more datapoints and we can match the qnet data to the other data for the analysis. I would suggest playing around with the parameters in the gstat::idw function, depending on the resolution you would like to achieve, they can make some difference in the outcome. Below you can find the code I used for this example. This is not exactly the code I used in the project but the basic concept is the same.

Then you might have the weird idea to make a short video of the observations over the years 2008 and 2009 to have a look at how the qnet evolved as times passed by. Nothing easier in R, just use a for loop for saving each picture and you’re almost done:

Then use your video editor of choice and you’re done! I have to admit though, making the video was the first thing I thought when I made the first plot .

Thank you for reading this post. Leave a comment if you have any suggestion and if you have found it useful please share it.

1 Global ocean heat flux and evaporation products were provided by the WHOI OAFlux project (http://oaflux.whoi.edu) funded by the NOAA Climate Observations and Monitoring (COM) program.

Pages