Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 7 hours 15 min ago

R GUI Update: BlueSky User’s Guide, New Features

Tue, 08/04/2020 - 00:33

[social4i size="small" align="align-left"] --> [This article was first published on R – r4stats.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The BlueSky Statistics graphical user interface (GUI) for the R language has added quite a few new features (described below). I’m also working on a BlueSky User’s Guide, a draft of which you can read about and download here. Although I’m spending a lot of time on BlueSky, I still plan to be as obsessive as ever about reviewing all (or nearly all) of the R GUIs, which is summarized here.

The new data management features in BlueSky are:

  • Date Order Check — this lets you quickly check across the dates stored in many variables, and it reports if it finds any rows whose dates are not always increasing from left to right.
  • Find Duplicates – generates a report of duplicates and saves a copy of the data set from which the duplicates are removed. Duplicates can be based on all variables, or a set of just ID variables.
  • Select First/Last Observation per Group – finding the first or last observation in a group can create new datasets from the “best” or “worst” case in each group, find the most current record, and so on.

Model Fitting / Tuning

One of the more interesting features in BlueSky is its offering of what they call Model Fitting and Model Tuning. Model Fitting gives you direct control over the R function that does the work. That provides precise control over every setting, and it can teach you the code that the menus create, but it also means that model tuning is up to you to do. However, it does standardize scoring so that you do not have to keep up with the wide range of parameters that each of those functions need for scoring. Model Tuning controls models through the caret package, which lets you do things like K-fold cross-validation and model tuning. However, it does not allow control over every model setting.

New Model Fitting menu items are:

  • Cox Proportional Hazards Model: Cox Single Model
  • Cox Multiple Models
  • Cox with Formula
  • Cox Stratified Model
  • Extreme Gradient Boosting
  • KNN
  • Mixed Models
  • Neural Nets: Multi-layer Perceptron
  • NeuralNets (i.e. the package of that name)
  • Quantile Regression

There are so many Model Tuning entries that it’s easier to just paste in the list I updated on the main BlueSkly review that I updated earlier this morning:

  • Model Tuning: Adaboost Classification Trees
  • Model Tuning: Bagged Logic Regression
  • Model Tuning: Bayesian Ridge Regression
  • Model Tuning: Boosted trees: gbm
  • Model Tuning: Boosted trees: xgbtree
  • Model Tuning: Boosted trees: C5.0
  • Model Tuning: Bootstrap Resample
  • Model Tuning: Decision trees: C5.0tree
  • Model Tuning: Decision trees: ctree
  • Model Tuning: Decision trees: rpart (CART)
  • Model Tuning: K-fold Cross-Validation
  • Model Tuning: K Nearest Neighbors
  • Model Tuning: Leave One Out Cross-Validation
  • Model Tuning: Linear Regression: lm
  • Model Tuning: Linear Regression: lmStepAIC
  • Model Tuning: Logistic Regression: glm
  • Model Tuning: Logistic Regression: glmnet
  • Model Tuning: Multi-variate Adaptive Regression Splines (MARS via earth package)
  • Model Tuning: Naive Bayes
  • Model Tuning: Neural Network: nnet
  • Model Tuning: Neural Network: neuralnet
  • Model Tuning: Neural Network: dnn (Deep Neural Net)
  • Model Tuning: Neural Network: rbf
  • Model Tuning: Neural Network: mlp
  • Model Tuning: Random Forest: rf
  • Model Tuning: Random Forest: cforest (uses ctree algorithm)
  • Model Tuning: Random Forest: ranger
  • Model Tuning: Repeated K-fold Cross-Validation
  • Model Tuning: Robust Linear Regression: rlm
  • Model Tuning: Support Vector Machines: svmLinear
  • Model Tuning: Support Vector Machines: svmRadial
  • Model Tuning: Support Vector Machines: svmPoly

You can download the free open-source version from https://BlueSkyStatistics.com.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

finding contour lines

Mon, 08/03/2020 - 21:01

[social4i size="small" align="align-left"] --> [This article was first published on bnosac :: open analytical helpers - bnosac :: open analytical helpers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Finally, the R package you all have been waiting for has arrived – image.ContourDetector developed at https://github.com/bnosac/image. It detects contour lines in images alongside the ‘Unsupervised Smooth Contour Detection’ algorithm available at http://www.ipol.im/pub/art/2016/175.

Have you always wanted to be able to draw like you are in art school? Let me show how to quickly do this.

If you want to reproduce this, the following snippets show how. Steps are as follows

1. Install the packages from CRAN install.packages("image.ContourDetector")
install.packages("magick")
install.packages("sp") 2. Get an image, put it into grey scale, pass the pixels to the function an off you go. library(magick)
library(image.ContourDetector)
library(sp)
img <- image_read("https://cdn.mos.cms.futurecdn.net/9sUwFGNJvviJks7jNQ7AWc-1200-80.jpg")
mat <- image_data(img, channels = "gray")
mat <- as.integer(mat, transpose = TRUE)
mat <- drop(mat)
contourlines <- image_contour_detector(mat)
plt <- plot(contourlines)
class(plt)

3. If you want to have the same image as shown at the top of the article:

Put the 3 images (original, combined, contour lines only) together in 1 plot using the excellent magick R package:

plt <- image_graph(width = image_info(img)$width, height = image_info(img)$height)
plot(contourlines)
dev.off() plt_combined <- image_graph(width = image_info(img)$width, height = image_info(img)$height)
plot(img)
plot(contourlines, add = TRUE, col = "red", lwd = 5)
dev.off() combi <- image_append(c(img, plt_combined, plt))
combi
image_write(combi, "example-contourlines.png", format = "png") var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers - bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RStudio Adds New R Features in Qubole’s Open Data Lake

Mon, 08/03/2020 - 02:00

[social4i size="small" align="align-left"] --> [This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Launch RStudio Server Pro from inside the Qubole platform

We are excited to team up with Qubole to offer data science teams the ability to use RStudio Server Pro from directly within the Qubole Open Data Lake Platform. Qubole is an open, simple, and secure data lake platform for machine learning, streaming and ad-hoc analytics. RStudio and Qubole customers now have access to RStudio’s out-of-the-box features and Qubole’s unique managed services that supercharge data science and data exploration workflows for R users, while optimizing costs for R-based projects. Within the Qubole platform, data scientists are able to easily access and analyze large datasets using the RStudio IDE, securely within their enterprise running in their public cloud environment of choice (AWS, Azure, or Google).

With massive amounts of data becoming more accessible, data scientists increasingly need more computational power. Cluster frameworks such as Apache Spark, and their integration with R using the SparkR and SparklyR libraries, help these users quickly make sense of their big data and derive actionable insights for their businesses. However, high CPU costs, long setup times, and complex management processes often prevent data scientists from taking advantage of these powerful frameworks.

Now that Qubole has added RStudio Server Pro into its offering, it now offers its users:

  • Single click access to Spark clusters. With Qubole’s authentication mechanisms, no additional sign-in is required.
  • Automatic persistence of users’ files and data sets when clusters are restarted.
  • Pre-installed packages such as Sparklyr, tidyverse, and other popular R packages.
  • Cluster Package Manager allows users to define cluster-wide R & Python dependencies for Spark applications
  • Performance optimizations such as Qubole’s optimized spark distribution allows the cluster to automatically scale up when the sparklyr application needs more resources and downscales as cluster resources are not in use.
  • Spark UI, Logs, and Resource Manager links available in the RStudio Connections pane for seamlessly managing applications.

Enterprise users benefit from this new integration because this new upgraded platform:

  • Limits CPU expenses to what users need. The Qubole cluster automatically scales up when the sparklyr application needs more resources, and downscales when cluster resources are not un use.
  • Allows on-demand cluster use. With single-click integration, users can seamlessly access large datasets that can persist automatically.
  • Simplifies cluster management. Qubole’s Cluster Package Manager, with pre-installed R packages, lets users define R and Python dependencies across their clusters.
How do I enable this integration?

If you already are a Qubole customer, and would like to enable RStudio Server Pro in your environment, please contact your Qubole support team.

Want to learn more about RStudio Server Pro?

RStudio Server Pro is the preferred data analysis and integrated development experience for professional R users and data science teams who use R and Python. RStudio Server Pro enables the collaboration, centralized management, metrics, security, and commercial support that professional data science teams need to operate at scale.

Try a Free 45 Day Evaluation or See in in Action

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to measure spatial diversity and segregation?

Mon, 08/03/2020 - 02:00

[social4i size="small" align="align-left"] --> [This article was first published on rstats on Jakub Nowosad's website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The raceland package implements a computational framework for a pattern-based, zoneless analysis and visualization of (ethno)racial topography.

The main concept in this package is a racial landscape (RL). It consists of many large and small patches (racial enclaves) formed by adjacent raster grid cells having the same race categories. The distribution of racial enclaves creates a specific spatial pattern, which can be quantified by two metrics (entropy and mutual information) derived from the Information Theory concept (IT). Entropy is the measure of racial diversity and mutual information measures racial segregation.

Methods in the raceland package are based on the raster data, and unlike the previous methods, do not depend on the division for specific zones (census tract, census block, etc.). Calculation of racial diversity (entropy) and racial segregation (mutual information) can be performed for the whole area of interests (i.e., metropolitan area) or any portion of the whole area without introducing any arbitrary divisions.

To learn more about this topic, read our Applied Geography article or its preprint:

Dmowska, A., Stepinski T., Nowosad J. Racial landscapes – a pattern-based, zoneless method for analysis and visualization of racial topography. Applied Geography. 122:1-9, DOI:10.1016/j.apgeog.2020.102239

Example calculations

To reproduce the results on your own computer, install and attach the following packages:

library(raceland) library(raster) library(sf) library(tmap) library(dplyr)

You also need to download and extract the data.zip file containing the example data.

temp_data_file = tempfile(fileext = ".zip") download.file("https://github.com/Nowosad/raceland-bp1/raw/master/data.zip", destfile = temp_data_file, mode = "wb") unzip(temp_data_file) Input data

The presented approach requires a set of rasters, where each raster represents one of five race-groups: Asians, Blacks, Hispanic, others, and Whites. In this example, we use data limited to the city of Cincinnati, Ohio.

list_raster = dir("data", pattern = ".tif$", full.names = TRUE) race_raster = stack(list_raster)

We also use vector data containing the city borders to ease the understanding of the results.

cincinnati = read_sf("data/cincinnati.gpkg")

We can visualize the data using the tmap package:

tm_race = tm_shape(race_raster) + tm_raster(style = "fisher", n = 10, palette = "viridis", title = "Number of people") + tm_facets(nrow = 3) + tm_shape(cincinnati) + tm_borders(lwd = 3, col = "black") tm_race

The above maps show the distribution of people from different race-groups in Cincinnati. Each, 30 by 30 meters, cell represents a number of people living in this area. Data was obtained from http://sil.uc.edu/cms/index.php?id=socscape-data and preprocessed using the instructions at https://cran.r-project.org/web/packages/raceland/vignettes/raceland-intro3.html.

Basic example

Our goal is to measure racial diversity and racial segregation for different places in the city. We can use the quanfity_raceland() function for this purpose.

results_metrics = quanfity_raceland(race_raster, n = 30, window_size = 10, fun = "mean", size = 20, threshold = 0.75) head(results_metrics) ## Simple feature collection with 6 features and 4 fields ## geometry type: POLYGON ## dimension: XY ## bbox: xmin: 978285 ymin: 1858035 xmax: 984885 ymax: 1859235 ## CRS: +proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs ## row col ent mutinf geometry ## 30 1 30 1.1050557 0.01559040 POLYGON ((981885 1859235, 9... ## 31 1 31 1.3120756 0.03010253 POLYGON ((982485 1859235, 9... ## 33 1 33 1.1301688 0.01910744 POLYGON ((983685 1859235, 9... ## 34 1 34 1.6320160 0.06155428 POLYGON ((984285 1859235, 9... ## 74 2 24 0.9527805 0.01716798 POLYGON ((978285 1858635, 9... ## 80 2 30 1.4438328 0.04498205 POLYGON ((981885 1858635, 9...

It requires several arguments:

  • x – RasterStack with race-specific population densities assign to each cell
  • n – a number of realizations
  • window_size – expressed in the numbers of cells, is a length of the side of a square-shaped block of cells for which local densities will be calculated
  • fun – function to calculate values from adjacent cells to contribute to exposure matrix, "mean" – calculate average values of local population densities from adjacent cells, "geometric_mean" – calculate geometric mean values of local population densities from adjacent cells, or "focal" assign value from the focal cell
  • size – expressed in the numbers of cells, is a length of the side of a square-shaped block of cells. It defines the extent of a local pattern
  • threshold – the share of NA cells to allow metrics calculation

The result is a spatial vector object containing areas of the size of 20 by 20 cells from input data (600 by 600 meters in this example). Its attribute table has five columns – row and col allowing for identification of each square polygon, ent – entropy measuring racial diversity, mutinf – mutual information, which is associated with measuring racial segregation, and geometry containing spatial geometries.

diversity_map = tm_shape(results_metrics) + tm_polygons(col = "ent", title = "Diversity", style = "cont", palette = "magma") + tm_shape(cincinnati) + tm_borders(lwd = 1, col = "black") segregation_map = tm_shape(results_metrics) + tm_polygons(col = "mutinf", title = "Segregation", style = "cont", palette = "cividis") + tm_shape(cincinnati) + tm_borders(lwd = 1, col = "black") tmap_arrange(diversity_map, segregation_map)

The above result present areas with different levels of racial diversity and segregation. Interestingly, there is a low correlation between these two properties. Some areas inside of the city do not have any value attached – this indicates either they are covered with missing values in more than 75% of their areas or nobody lives there.

Extended example

The quanfity_raceland() function is a wrapper around several steps implemented in raceland, namely create_realizations(), create_densities(), calculate_metrics(), and create_grid(). All of them can be used sequentially, as you can see below.

Additionally, the raceland package has zones_to_raster() function that prepares input data based on spatial vector data with race counts.

Constructing racial landscapes

The racial landscape is a high-resolution grid in which each cell contains only inhabitants of a single race. It is constructed using the create_realizations() function, which expects a stack of race-specific rasters. Racial composition at each cell is translated into probabilities of drawing a person of a specific race from a cell. For example, if a cell has 100 people, where 90 are classified as Black (90% chance) and 10 as White (10% chance), then we can assign a specific race randomly based on these probabilities.

This approach generates a specified number (n = 30, in this case) of realization with slightly different patterns.

realizations_raster = create_realizations(race_raster, n = 30)

The output of this function is a RasterStack, where each raster contains values from 1 to k, where k is a number of provided race-specific grids. In this case, we provided five race-specific grids (Asians, Blacks, Hispanic, others, and Whites), therefore the value of 1 in the output object represents Asians, number 2 Blacks, etc.

my_pal = c("#F16667", "#6EBE44", "#7E69AF", "#C77213", "#F8DF1D") tm_realizations = tm_shape(realizations_raster[[1:4]]) + tm_raster(style = "cat", palette = my_pal, labels = c("Asians", "Blacks", "Hispanic", "others", "Whites"), title = "") + tm_facets(ncol = 2) + tm_shape(cincinnati) + tm_borders(lwd = 3, col = "black") + tm_layout(panel.labels = paste("Realization", 1:30)) tm_realizations

The above plot shows four of 30 created realizations and makes it clear that all of them are fairly similar.

Local densities

Now, for each of the created realization, we can calculate local densities of subpopulations (race-specific local densities) using the create_densities() function.

dens_raster = create_densities(realizations_raster, race_raster, window_size = 10)

The output is a RasterStack with local densities calculated separately for each realization.

tm_density = tm_shape(dens_raster[[1:4]]) + tm_raster(style = "fisher", n = 10, palette = "viridis", title = "Number of people") + tm_facets(ncol = 2) + tm_shape(cincinnati) + tm_borders(lwd = 3, col = "black") + tm_layout(panel.labels = paste("Realization", 1:30)) tm_density

Total diversity and segregation

We can use both, realizations and density rasters, to calculate racial diversity and segregation using calculate_metrics() function. It calculates four information theory-derived metrics: entropy (ent), joint entropy (joinent), conditional entropy (condent), and mutual information (mutinf). As we mentioned before, ent is measuring racial diversity, while mutinf is associated with racial segregation. These metrics can be calculated for a given spatial scale. For example, setting size to NULL, as in the example below, calculates the metrics for the whole area of each realization.

metr_df = calculate_metrics(x = realizations_raster, w = dens_raster, fun = "mean", size = NULL, threshold = 1) head(metr_df) ## realization row col ent joinent condent mutinf ## 1 1 1 1 1.400229 2.625657 1.225428 0.1748010 ## 2 2 1 1 1.398806 2.624101 1.225295 0.1735102 ## 3 3 1 1 1.398361 2.623339 1.224978 0.1733824 ## 4 4 1 1 1.400530 2.625777 1.225247 0.1752829 ## 5 5 1 1 1.395641 2.617376 1.221734 0.1739072 ## 6 6 1 1 1.397392 2.616627 1.219235 0.1781572

Now, we can calculate average metrics across all realization, which should give more accurate results.

metr_df %>% summarise( mean_ent = mean(ent, na.rm = TRUE), mean_mutinf = mean(mutinf) ) ## mean_ent mean_mutinf ## 1 1.397863 0.1742165

These values could be compared with values obtained by other US cities to evaluate, which cities have high average racial diversity (larger values of mean_ent) and which have high average racial segregation (larger values of mean_mutinf).

Local diversity and segregation

The information theory-derived metrics can be also calculated for smaller, local scales using the size argument. It describes the size of a local area for metrics calculations. For example, size = 20 indicates that each local area will consist of 20 by 20 cells of the original raster.

metr_df_20 = calculate_metrics(x = realizations_raster, w = dens_raster, fun = "mean", size = 20, threshold = 0.75)

Now, we can summarize the results for each local area independently (group_by(row, col)).

smr = metr_df_20 %>% group_by(row, col) %>% summarize( ent_mean = mean(ent, na.rm = TRUE), mutinf_mean = mean(mutinf, na.rm = TRUE), ) %>% na.omit() head(smr) ## # A tibble: 6 x 4 ## # Groups: row [2] ## row col ent_mean mutinf_mean ## ## 1 1 30 1.09 0.0152 ## 2 1 31 1.30 0.0356 ## 3 1 33 1.12 0.0159 ## 4 1 34 1.62 0.0576 ## 5 2 24 0.959 0.0195 ## 6 2 30 1.44 0.0445

Each row in the obtained results relates to some spatial locations. We can create an empty grid with appropriate dimensions using the create_grid() function. Its size argument expects the same value as used in the calculate_metrics() function.

grid_sf = create_grid(realizations_raster, size = 20)

The result is a spatial vector object with three columns: row and col allowing for identification of each square polygon, and geometry containing spatial geometries.

tm_shape(grid_sf) + tm_polygons()

The first two columns,row and col, can be used to connect the grid with summary results.

grid_attr = dplyr::left_join(grid_sf, smr, by = c("row", "col")) grid_attr = na.omit(grid_attr)

Finally, we are able to create two maps. The first one represents racial diversity (larger the value, larger the diversity; the ent_mean variable) and the second one shows racial segregation (larger the value, larger the segregation; the ent_mean variable).

diversity_map = tm_shape(grid_attr) + tm_polygons(col = "ent_mean", title = "Diversity", style = "cont", palette = "magma") + tm_shape(cincinnati) + tm_borders(lwd = 3, col = "black") segregation_map = tm_shape(grid_attr) + tm_polygons(col = "mutinf_mean", title = "Segregation", style = "cont", palette = "cividis") + tm_shape(cincinnati) + tm_borders(lwd = 3, col = "black") tmap_arrange(diversity_map, segregation_map)

Bonus: visualizing racial landscapes

While the realizations created few steps before represents race spatial distribution fairly well, they do not take the spatial variability of the population densities into consideration. Additional function plot_realization() displays a selected realization taking into account not only race spatial distribution, but also the population density.

plot_realization(x = realizations_raster[[2]], y = race_raster, hex = my_pal)

In its result, darker areas have larger populations, and brighter represent areas less-inhabited areas.

Summary

The raceland package implements a computational framework for a pattern-based, zoneless analysis and visualization of (ethno)racial topography. The most comprehensive description of the method can be found in the Racial landscapes – a pattern-based, zoneless method for analysis and visualization of racial topography article published in Applied Geography. Its preprint is available at https://osf.io/preprints/socarxiv/mejz5. Additionally, raceland has three extensive vignettes:

This approach is based on the concept of ‘landscape’ used in the domain of landscape ecology. To learn more about information theory metrics used in this approach you can read the Information theory as a consistent framework for quantification and classification of landscape patterns article published in Landscape Ecology.

The raceland package requires race-specific grids. They can be obtained in two main ways. The first one is to download prepared grids from the SocScape project. It provides high-resolution raster grids for 1990, 2000, 2010 years for 365 metropolitan areas and each county in the conterminous US. The second way is to rasterize a spatial vector file (e.g., an ESRI Shapefile) with an attribute table containing race counts for some areas using the zones_to_raster() function.

Finally, while the presented methods have been applied to race-specific raster grids, they can be also used for many other problems where it is important to determine spatial diversity and segregation.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rstats on Jakub Nowosad's website. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

87th TokyoR Meetup Roundup: {data.table}, Bioconductor, & more!

Mon, 08/03/2020 - 01:00

[social4i size="small" align="align-left"] --> [This article was first published on R by R(yo), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As the monsoon season (finally) ends, another TokyoR meetup! Since COVID
hit all of TokyoR’s meetups since February have been done online and the
transition has been seamless thanks to the efforts of the TokyoR
organizing team. It was my first TokyoR since January so it was great
to be back!

In line with my previous round up posts:

I will be going over around half of all the talks. Hopefully, my efforts
will help spread the vast knowledge of Japanese R users to the wider R
community. Throughout I will also post helpful blog posts and links from
other sources if you are interested in learning more about the topic of
a certain talk. You can follow Tokyo.R by searching for the
#TokyoR hashtag on Twitter.

Anyway…

Let’s get started!

BeginneR Session

As with every TokyoR meetup, we began
with a set of beginner user focused talks:

Main Talks u_ribo: Let’s learn {data.table}!

@u_ribo gave an introduction to the {data.table} package. The
{data.table} package is a package that extends the data.frame and
allows you to do fast data manipulation, data aggregation, and more!
@u_ribo’s slides were very easy to understand and is probably a very
good intro to {data.table} for tidyverse users as the walk-through
included side-by-side comparisons with similar {dplyr} and {tidyr}
syntax (shown in detail below).


The 3 main differences he made to contrast with {dplyr} were:

  • Lower # of dependencies: {data.table} only uses {methods}
  • Lower memory usage: deep-copy {dplyr} vs. shallow-copy {data.table}
  • “Conservative” development: Try to minimize the amount of breaking changes in new code

Other {data.table} resources:

Lightning Talks soupcurry049: Introduction to {ggspatial}!

@soupcurry049 gave a introduction to the {ggspatial} package which
provides the user with ggplot-like style for plotting spatial data. It
supports sf, sp, and raster objects and you have a lot of cool
options for annotations (spatial lines, a NORTH arrow, etc.), layers,
spatial geometries (in ggplot2::geom_*() style). @soupcurry049
finished off the LT with a quick demonstration of a map showing Onsen
locations in Hokkaido prefecture.

andrew_cb2: (x) => x + 1

@andrew_cb2 talked about … not programming IN R but programming R
itself. Currently in R the syntax for creating a function requires
typing out function(...) ... but typing all 8 letters every time can
be annoying, couldn’t there be a way to make it shorter? Recently there
has been talk about creating a shorter anonymous function syntax and the
following 3 styles were discussed:

The reason why some implementations are harder than others is due to the
location of the special characters and R. @andrew_cb2 then gave us a
quick tutorial of going into the R source files and adding in your own
anonymous function syntax into R.


@andrew_cb2 has made the entire tutorial available on Github
here.

flaty13: Data Science 100 Knocks!

@flaty13 talked about a new initiative by the Japan Data Science
Society, the Data Science 100 Knocks for Data Pre-processing. It is a
series of problem solving exercises meant for beginner and intermediate
data scientists to practice their data pre-processing/handling skills in
SQL, Python, and R. The problems are all contained in a Docker
container so you are able to learn how to use it as well.

kozo2: Introduction to BioConductor!

@kozo2 introduced Bioconductor and its community. Bioconductor is a
package repository for bioinformatics much like CRAN for most R users.
@kozo2 talked about a number of differences with CRAN including:

  • A more rigid code review
  • A strict Bioconductor coding style
  • Github-based package submission and updating

To develop the local Japanese community, the
Bio-Pack-athon monthly meetup was
created which helps bioinformatics developers with ideas and workflows
to nurture future Bioconductor authors. One of the bigger goals of this
meetup is to increase the number of Bioconductor devs in Japan so that
Tokyo could be a candidate to host the Bioc Asia conference in 2021.

Earlier this year a Community Advisor Board was also created which aims
to support training, outreach, and promote cooperation between users and
developers.

Other Bioconductor materials:

Other talks Conclusion

TokyoR happens almost monthly and it’s a great way to mingle with
Japanese R users as it’s the largest regular meetup here in Japan. The
next meetup will be in January

For the time being meetups will continue to be conducted online. Talks
in English are also welcome so come join us!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo). R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Effect of COVID-19 on Our Mobility

Sun, 08/02/2020 - 23:47

[social4i size="small" align="align-left"] --> [This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I just moved to NYC in January 2020. My reason for moving was to enjoy the socially rich community and the plethora of activities that I could explore, such as museums, cute little theaters, Broadway shows, fancy stores, riding the subway and watching people, restaurants and the fun nightlife. And I did for two months. Then in March, the pandemic hit the city hard. WOW! All the fun activities and sights that I had hoped for were closed.

Many places were forced to close due to lockdown measures, and some businesses chose to close even if they were allowed to remain open. I was curious about how the numbers of the daily COVID cases are increasing and how those numbers controlled our mobility and decisions. The change in mobility is defined by how the number of visitors or the time spent at places such as parks or transit stations, etc., changed compared to a baseline. The baseline used for this data is the median value of the corresponding day of the week in the 5-week period from January 3 to February 6 of 2020. The mobility data was collected by Google Community Mobility Reports. To understand how COVID-19 affected our mobility and some businesses, I developed a Shiny App to provide some visualization of the mobility data and to interpret the data as well. This post presents some of the answers that the app can provide. There are six studied mobility types; grocery and pharmacy, retail, park, transit, workplace, residential, and public transit. Details on  each category is available at the mobility data documentation.

How has COVID-19 changed mobility across the US?

The mobility in the US by type (either grocery and pharmacy, retail, park, workplace, residential and public transit) can be visualized in the app by the selection of date and mobility type. For example, the following map shows the percentage change in mobility at residential places on March 25th. As shown, how strictly people observed sheltering at home varied by state. Northeastern states have the greatest relative increase, which can be attributed to the high level of severity in those states. Also the number of daily cases were higher in those states. The box plots show the top and bottom 5 states/districts for the selected mobility type (which is residential mobility in this example). Comparing the median change in residential mobility of the counties in each region (a measure of citizens staying at home), the District of Columbia, New Jersey, Massachusetts, Michigan and New York had the highest relative increase on March, 25th, while Arkansas, Arizona, Idaho, Mississippi and Oklahoma had the lowest. 

Another interesting change is in the mobility at retail stores on May 28th when the number of daily cases were decreasing in the US as shown in the following map. Of course, some states decided to close retail stores, but not all.Retail closure may have affected other mobility types such as residential mobility, since people may have chosen to stay more at home because stores are closed as shown in the previous map of residential mobility on the same day. This is noticeable in the District of Columbia. It had the highest decrease in retail mobility (-58% from previous box plots) and the highest increase in residential mobility (28% from the following box plots). The District of Columbia and New Jersey have the highest decrease in mobility at the retail stores, while states like Alaska, Oklahoma, Wyoming, Mississippi and South Dakota did not show a considerable change in mobility at retail stores because the state governments did not enforce the closing of retail stores as strictly as the other states and the COVID situation was not as severe.

Similar visualizations are available per day for the other mobility types, including public transit, workplaces, parks, and grocery stores during the period between March 1st and July 29th. This period is chosen based on the availability of the most updated mobility data until the date of this post. Percentage change in mobility in each state is a daily median of the mobility change of the counties over  the whole measured time-frame.

How do mobility types correlate to one another?

Studying the correlation between the change in mobility by type from March 1st to July 29th, 2020 nationwide shows each of the mobility types is correlated to the other as expected. There is a strong direct correlation between retail, grocery, transit, and workplace mobility types. Due to COVID-19, offices are closed and people are going to work less. They also use less mass transit and reduce their trips for retail and groceries because of both restrictions imposed by the governors and personal safety choices. The correlation is also positive but weaker between the change in park mobility and the rest of the other mobility types because some people still went to the parks even when there were mobility limitations. The closure of parks was not imposed by the governors in most of the states. Thus, the choice to visit the parks was left to the people.

Moreover, the correlation between the change in residential mobility and the rest of the mobility is inverted; the more people limit their mobility, the more they stay at home.

How have the reported daily cases affected mobility in each state?

The app also addresses how the daily cases and the mobility trends are correlated on a state-by-state basis. The following is an example of the visualization that shows the trends of daily cases in the state of NY. As you see, the reported daily cases peaked around the middle of April, then started to decrease. The daily cases visualization is available until August 1st.

The percentage change in mobility for the state of NY is also reported as shown in the following graph.

The percentage change in mobility trends are different for each mobility type. Parks, grocery, retail, transit and workplace mobility types were decreasing until the middle of April when they started to increase. The residential mobility trend is the opposite: It was increasing until the middle of April then it started to decrease. This indicates that as the number of daily cases were increasing, more people stayed home. But when the number of daily cases started to decrease, people started to leave home and to increase the rest of the six mobility types (parks, grocery, retail, transit and workplace). Quantifying that relationship between the number of daily cases and the change in people’s mobility can be insightful. We want to know how strongly people responded to the daily new cases. This can be expressed by the correlation coefficient between the reported daily cases and the change in the mobility in the following day. It takes values between -1 and 1. A negative value indicates that as the reported daily cases increased, people’s mobility decreased the following day. A positive value indicates that as the reported daily cases increased, people’s mobility increased the following day. Zero indicates that there is no response, and one indicates that there is a strong response. The values of these correlation coefficients are also reported in the app in the small boxes below the trends’ plots as shown in the following figure.

As presented by the numbers in NY, people react inversely to the increase in the number of daily cases for retail, grocery, transit, park and workplace. As the daily cases increase, people limit their mobility in the previous categories because of either their personal choice or the closure of some places. However (for residential mobility), as the daily cases increase, people stay home and the correlation is about 27%. 

An interesting question to ask would be whether or not any states respond anomalously to the trend in diagnosed cases, such as Florida. The difference is noticeable. As the daily cases increase, people do not stay home after the middle of April. People stopped going to the beaches and to the parks until the middle of April when the number of daily cases were almost constant. However, they kept going out even though the daily cases were increasing since June. Note that grocery, parks, retail, transit, and workplace mobility types increased right after the middle of April, and that is when the state started to open. That also has a great effect on the large increase in the number of daily cases–a positive feedback loop.

People in states outside New York reacted to the number of daily cases in NY because New York’s daily cases dominated the total number in the US at the start of the crisis. Therefore, people in the other states used those numbers as their reference to decide how to mobilize. As shown from the figure above, in FL people started to increase their mobility outside their houses when the daily cases in NY started to decrease while the daily cases in FL were plateauing. FL residents reacted more to the numbers in NY than in FL itself.

Another example is California in the figure below. Though California had not  reached its peak by the middle of April, people still decided to go out when NY daily cases started to drop which also supports that previous analysis.

How did people react to daily cases in NYC? 

Now let’s look at NYC. The more daily cases were reported, the more people stayed home—53% correlation—and the more they limited their other mobility types (~ -46% correlation on average). Part of the decrease in mobility was due to the lockdown orders, and the other part of it was due to people’s voluntary actions. The decrease in retail, transit and workplace mobility types was more probably due to the lockdown order. On the other hand, the decrease in grocery and park mobility types was more probably due to people’s voluntary actions. However, NYC residents started to go to the parks after mid-April when the reported daily cases started to decrease. After staying home for so long, they felt the need to get out, especially as warmer weather lured them outdoors. There was a huge increase in park-related mobility during weekends and on holidays like Memorial Day weekend. The number of daily cases and the positive reaction to mobility could have been different if people in other states had considered their local daily cases rather than the global daily cases.

The percentage change in mobility in the different NYC counties has shown a significant difference during the period between March 1st and July 29th. The following map shows, as an example, the change in retail mobility on June 15th in the five NYC counties (The Bronx, Manhattan, Richmond, Queens and Kings counties). Retail mobility also includes restaurants and theaters.

P.S. June 15th is used as an example date for the following analysis. The user of the app can change the date to any day between March 1st and July 29th,

The biggest decrease in retail mobility was in Manhattan (70%) where most retail businesses, theaters and restaurants exist in NYC. The lowest decrease on that day was in the Bronx (30%). From these data, it is clear just how hard Retail businesses in NYC have been hit, especially in the Manhattan area. 

Similar decrease was also observed in workplace mobility on  June 15th. The decrease ranged from 45% in the Bronx to 65% in Manhattan. Similarly, Manhattan got the biggest hit of -65%, since Manhattan has most of the business offices in NYC.

The decrease in grocery and pharmacy mobility ranged from 10 to 30% on June 15th in NYC. Manhattan was the biggest hit county (-30%), while the lowest change was in Queens (-10%).  People generally tried to limit their trips to the grocery and drug stores in the five counties because of the pandemic. Manhattan had the highest decrease, and that can be because of the less human activity in the area. When Manhattan was open, it was the main business and entertainment destination in NYC, and people would visit the nearby grocery and drug stores as they are in Manhattan for work or for entertainment activities. But, when Manhattan was on lockdown (offices and retail were closed), it lost its significance as a main business and entertainment destination in NYC and People would visit Manhattan much less and accordingly the grocery and drug stores.

The mobility change percentages in the previous three maps highlight how much the mobility in retail, workplaces, grocery and drug stores were affected by COVID-19. They also show how this effect varies in the five NYC counties. The drop in these three mobility types had its great impact on the economy of NYC in return, and Manhattan was the most affected county.

The following map shows how the mobility in public transportation such as the subway and the bus systems changed on June 15th. The change is similar to the above maps, which explains the strong correlation between retail, workplace, grocery,  and transit mobility types. Since people limited their mobility to go to work, to shop, to entertain or to get groceries, they use the transit system less. As expected, Manhattan had the highest decrease in transit mobility (-65%).

The mobility at the parks had a different distribution on June 15th as shown on the following map. It ranged from -20 to 100%. Manhattan had the greatest decrease in parks mobility (-20%). That can be attributed to that there are less parks in Manhattan, people preferred not to go to the parks in Manhattan, people generally chose to avoid going out in Manhattan or it was very restricted to go out in Manhattan. On the contrary, Queens had an increase in park mobility of 100%. That can be attributed to that there are more or better parks in queens, residents from the other counties visited those parks or there were less restrictions on parks in Queens. Moreover, since entertainment destinations such as theaters and restaurants that are mostly in Manhattan were closed, people may have decided to visit the parks in different parts of the city like Queens, Kings and Richmond counties where there was an increase in parks mobility as shown on the map.

Mobility in NYC has been dramatically affected by the pandemic. There are variations in how each mobility changed in each county, and that change is related the kind of the prominent activities and businesses in each of these counties. The application provides visualizations of these variations that can be helpful to understand how the pandemic affected our mobility, behavior and different businesses.

And hopefully one day, we will get to enjoy the city again!

The application is available for use at COVID-MOBILITY 2020.

Data Sources

Google LLC “Google COVID-19 Community Mobility Reports”

Johns Hopkins Daily Reports

New York Times COVID-19 Reports

Geocodes from Healthcare.gov

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Credit Risk Modelling using Machine Learning: A Gentle Introduction

Sun, 08/02/2020 - 16:05

[social4i size="small" align="align-left"] --> [This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor. Category Tags

Assume you are given a dataset for a large bank and you are tasked to come up with a credit risk score for each customer.You have just been briefed that you are going to work on this project and you have to come up with a prototype demonstrating how this problem could be solved.

Approach

The credit risk scoring is a very complicated process with a lot of due diligence on data, model reviews internal controls and sign offs. As a first step you could follow the steps outlined below with the accompanying code to create a straw man version of your approach.

The first step in building your prototype will be obtaining a sample dataset and performing high level analysis on it.

#setting up the data and performing high level analysis# ######################################################## #downloading the data #https://raw.githubusercontent.com/obaidpervaizgill/CreditRiskModelling/master/credit.csv #loading data credit <- read.csv("credit.csv") #identifying the structure of variables str(credit) #getting summary of the variables summary(credit) #getting the column names colnames(credit) #[1] "checking_balance" "months_loan_duration" "credit_history" "purpose" "amount" "savings_balance" #[7] "employment_duration" "percent_of_income" "years_at_residence" "age" "other_credit" "housing" #[13] "existing_loans_count" "job" #tabulating dependent variables table(credit$default) #No missing values in the data #Note : I would have used "mice" package in R to impute missing values if there were any #Normalizing or standardizing data #Note : I would have scaled the variables using standardization or minmax normalization, but I havent done this here! #Removing correlated features #Note : I would have removed correlated feature based on an 80 percent correlation rule in the correlation matrix #spliting data into test and train library(caTools) split <- sample.split(credit$default, SplitRatio = 0.70) train <- subset(cbind(credit,split), cbind(credit,split)$split == TRUE) test <- subset(cbind(credit,split), cbind(credit,split)$split == FALSE) #checking proportions across train and test prop.table(table(train$default)) prop.table(table(test$default))

The second step in your prototype will be to train an explainable model, such as a logistic regression model so that you can identify and explain the driving variables.

#training a model using logistic regression# ############################################ #training a model creditLogReg <- glm(train$default ~ ., data = train[,c(-17,-18)], family = "binomial" ) #removing split feature and dependent variable summary(creditLogReg) #summary of the model output #Note: In theory I should rerun the model removing the non-significant features but since I want to demonstrate multiple model usage I would let it slide #predicing on test data predCreditLogReg 0.5) #Note: we want our model to be optimally sensitive hence we use 0.5 as the threshold, redudcing the threshold will make the model more sensitive #computing the accuracy of the model accuracyCreditLogReg 0.5))[1,1]) + (as.matrix(table(test$default, predCreditLogReg > 0.5))[2,2]))/nrow(test) #computing the baseline model for comparison baseLineAccuracy <- max(table(test$default))/nrow(test) print(accuracyCreditLogReg) print(baseLineAccuracy) #Note : Our simple logistic regression model beats the baseline model #assesing the robustness of model library(ROCR) rocrPredCreditLogReg <- prediction(predCreditLogReg,test$default) areaUnderCurve <- as.numeric(performance(rocrPredCreditLogReg, "auc")@y.values) #out of sample auc print(areaUnderCurve) #Note : Closer to 1 is better, 0.78 here is not bad for a first model

The third step in your prototype will be to train an more complicated model to assess if you can improve over your explainable model through additional tuning as well.

#training a model using decision trees# ####################################### library("rpart") library("rpart.plot") #training a model creditDecTree <- rpart(train$default ~ ., data = train[,c(-17,-18)], method = "class", minbucket = 1) #min bucket is minimum number of observations in a terminal nore summary(creditDecTree) #summary of the model output #plotting a decision tree to see splits prp(creditDecTree) #predicting on test data predictCreditDecTree <- predict(creditDecTree, newdata = test[,c(-17,-18)], type = "class") #getting classes rather than probability #computing the accuracy of the model table(test$default,predictCreditDecTree) #since we dont have a probability here so we dont set a threshold accuracyCreditDecTree <- ((as.matrix(table(test$default, predictCreditDecTree))[1,1]) + (as.matrix(table(test$default, predictCreditDecTree))[2,2]))/nrow(test) #computing the baseline model for comparison baseLineAccuracy <- max(table(test$default))/nrow(test) print(accuracyCreditDecTree) print(baseLineAccuracy) #Note: Our decision tree model beats the basline model in terms of accuracy #assesing the robustness of model library(ROCR) rocrPredictCreditDecTree <- prediction((predict(creditDecTree, newdata = test[,c(-17,-18)])[,2]), test$default) #getting probability and then picking predicted class areaUnderCurve <- as.numeric(performance(rocrPredictCreditDecTree, "auc")@y.values) #out of sample auc print(areaUnderCurve) #tuning a model using decision trees# ##################################### library(caret) #tuning for complexity parameter, this penalizes model complexity and avoids overfitting tuneGridDecTree <- expand.grid(.cp=seq(0.01,0.5,0.01)) #creating a list of parameters to be passed onto the model fitControlDecTree <- trainControl(method = "cv", number = 10) tunedCreditDecTree <- train(train$default ~., data = train[,c(-17,-18)], method = "rpart", trControl = fitControlDecTree, tuneGrid = tuneGridDecTree) tunedPredictCreditDecTree <- predict(tunedCreditDecTree, newdata=test[,c(-17,-18)], type="raw") #copmuting the accuracy of the model table(test$default,tunedPredictCreditDecTree) #since we dont have a probability here so we dont set a threshold accuracyTunedCreditDecTree <- ((as.matrix(table(test$default, tunedPredictCreditDecTree))[1,1]) + (as.matrix(table(test$default, tunedPredictCreditDecTree))[2,2]))/nrow(test)

The final step in your prototype will be to train using a highly robust and more black box model to assess if you can improve over your existing approaches, to see if it is worthwhile to pursue this path.

#training a model using random forest# ####################################### library(randomForest) #training a model creditRandFor <- randomForest(as.factor(train$default) ~., data = train[,c(-17,-18)],nodesize =25, ntree = 200) summary(creditRandFor) #summary of the model output #identifying the most important variables based on mean gini decrease varImpPlot(creditRandFor) #Note : Show how each split result in low impurities or increased homogeneity #predicting on test data predictCreditRandFor <- predict(creditRandFor, newdata = test[,c(-17,-18)]) #computing the accuracy of the model table(test$default,predictCreditRandFor) #since we dont have a probability here so we dont set a threshold accuracyCreditRandFor <- ((as.matrix(table(test$default, predictCreditRandFor))[1,1]) + (as.matrix(table(test$default, predictCreditRandFor))[2,2]))/nrow(test) #computing the baseline model for comparison baseLineAccuracy <- max(table(test$default))/nrow(test) print(accuracyCreditRandFor) print(baseLineAccuracy) #Note: Our random forest model beats the basline model in terms of accuracy #assesing the robustness of model library(ROCR) rocrPredictCreditRandFor <- prediction((predict(creditRandFor, newdata = test[,c(-17,-18)], type = "prob")[,2]), test$default) #getting probability and then picking predicted class areaUnderCurve <- as.numeric(performance(rocrPredictCreditRandFor, "auc")@y.values) #out of sample auc print(areaUnderCurve) #Note : Very high area under the curve but slighltly less than logistic regression #Note : Very high accuracy as good as logistic regression #tuning a model using random forest# ####################################### #Note : We can tune it using tuneRF package but repeated cross validation using caret produces much better results library(caret) #tuning for mtry, this the number of variables randomly sampled for splits tuneGridRandFor <- expand.grid(.mtry=c(1:sqrt(ncol(train[,c(-17,-18)])))) #creating a list of parameters to be passed onto the model fitControlRandFor <- trainControl(method = "repeatedcv", number = 5, repeats = 3, #fivefold cross validation repeated 10 times classProbs = TRUE, summaryFunction = twoClassSummary) tunedCreditRandFor <- train(as.factor(train$default) ~., data = train[,c(-17,-18)], method = "rf", trControl = fitControlRandFor, verbose = TRUE, metric = "ROC", tuneGrid = data.frame(tuneGridRandFor), importance = TRUE) tunedPredictCreditRandFor <- predict(tunedCreditRandFor, newdata = test[,c(-17,-18)]) #computing the accuracy of the model table(test$default,tunedPredictCreditRandFor) #since we dont have a probability here so we dont set a threshold accuracyTunedCreditRandFor <- ((as.matrix(table(test$default, tunedPredictCreditRandFor))[1,1]) + (as.matrix(table(test$default, tunedPredictCreditRandFor))[2,2]))/nrow(test) Conclusion

Depending on the problem you are trying to solve, you could pick a model that serves your case, simplest is always the better unless the complicated one is significantly better. Also note that while there may be a temptation to jump into models, most improvement in model performance come from data wrangling and creating new features for your models.

Related Post

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Weathering the Storm

Sun, 08/02/2020 - 02:00

[social4i size="small" align="align-left"] --> [This article was first published on R | Quantum Jitter, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Covid-19 began battering the financial markets in February. Which sectors are faring best?

I’ll compare each sector in the S&P 500 with the overall market. And I’ll baseline each at 100% as of February 19th, 2020 so we can see which have recovered lost ground.

symbols <- c( "EOD/SPY", "EOD/XLV", "EOD/XLK", "EOD/XLE", "EOD/XLF", "EOD/XLC", "EOD/XLI", "EOD/XLY", "EOD/XLP", "EOD/XLRE", "EOD/XLU", "EOD/XLB" ) from <- "2020-02-19" eod_sectors <- tq_get(symbols, get = "quandl", from = from) %>% group_by(symbol) %>% mutate( norm_close = adj_close / first(adj_close), type = if_else(symbol == "EOD/SPY", "Market", "Sector"), sector = case_when( symbol == "EOD/SPY" ~ "S&P 500", symbol == "EOD/XLB" ~ "Materials", symbol == "EOD/XLE" ~ "Energy", symbol == "EOD/XLU" ~ "Utilities", symbol == "EOD/XLI" ~ "Industrical", symbol == "EOD/XLRE" ~ "Real Estate", symbol == "EOD/XLV" ~ "Health", symbol == "EOD/XLK" ~ "Technology", symbol == "EOD/XLF" ~ "Financial", symbol == "EOD/XLC" ~ "Communication", symbol == "EOD/XLY" ~ "Consumer Discretionary", symbol == "EOD/XLP" ~ "Consumer Staples", TRUE ~ "Other" ) ) %>% ungroup()

With all that home-working and web conferencing, perhaps not too surprising to see Tech and Comms doing relatively well, along with Consumer Discretionary and Health.

eod_sectors %>% mutate( sector = str_wrap(sector, 12), sector = fct_reorder(sector, norm_close, last, .desc = TRUE) ) %>% ggplot(aes(date, norm_close, colour = type)) + geom_rect(aes(xmin = min(date), xmax = max(date), ymin = -Inf, ymax = Inf), fill = if_else(eod_sectors$type == "Market", cols[3], NULL), colour = "white") + geom_hline(yintercept = 1, linetype = "dashed", colour = "grey80") + geom_line(key_glyph = "timeseries") + facet_wrap(~sector) + scale_colour_manual(values = cols[c(1, 4)]) + scale_y_continuous(labels = percent_format()) + labs( title = "S&P 500 Sector Impact of Covid-19", subtitle = glue("Relative to {from}"), x = NULL, y = NULL, colour = NULL ) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function base library[8]; c[1]; conflicts[1]; cumsum[1]; function[1]; max[1]; min[1]; search[1]; sum[1] dplyr mutate[6]; if_else[5]; filter[4]; group_by[2]; tibble[2]; arrange[1]; as_tibble[1]; case_when[1]; desc[1]; first[1]; select[1]; summarise[1]; ungroup[1] forcats fct_reorder[1] ggplot2 aes[2]; element_text[1]; facet_wrap[1]; geom_hline[1]; geom_line[1]; geom_rect[1]; ggplot[1]; labs[1]; scale_colour_manual[1]; scale_y_continuous[1]; theme[1]; theme_bw[1]; theme_set[1] glue glue[2] kableExtra kable[1] lubridate date[2] purrr map[1]; map2_dfr[1]; possibly[1]; set_names[1] Quandl Quandl[1]; Quandl.api_key[1] readr read_lines[1] rebus literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; one_or_more[1]; or[1] scales percent_format[1] stringr str_detect[3]; str_c[2]; str_remove[2]; str_count[1]; str_remove_all[1]; str_wrap[1] tibble enframe[1] tidyquant tq_get[1] tidyr tibble[2]; as_tibble[1]; unnest[1] wesanderson wes_palette[1] xts first[1] var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R | Quantum Jitter. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Choroplethr v3.6.4 is now on CRAN (and the Future of Choroplethr)

Sat, 08/01/2020 - 18:00

[social4i size="small" align="align-left"] --> [This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Choroplethr v3.6.4 is now on CRAN. This is the first update to the package in two years, and was necessary because of a recent change to the tigris package, which choroplethr uses to make Census Tract maps. I also took this opportunity to add new example demographic data for Census Tracts.

Quick Demo

The first new dataset is ?df_pop_ny_tract, which contains 2012 population estimates for all Tracts in New York State. Here is an example of mapping the data:

install.packages("choroplethr") library(choroplethr) data(df_pop_ny_tract) tract_choropleth(df_pop_ny_tract, state_name = "new york", title = "NY State Tract Population Estimates (2012)", legend = "Population")

There are a total of 4,918 Census Tracts in New York State. As you can see, viewing them all at once isn’t very useful. My hope is that one day Choroplethr will also support interactive maps, which might make statewide Tract maps more useful. Until then, I recommend using the county_zoom parameter. This parameter requires a County FIPS Code. Here’s an example of zooming in on Manhattan, which has the FIPS Code 36061:

tract_choropleth(df_pop_ny_tract, state_name = "new york", county_zoom = 36061, title = "Manhattan Tract Population Estimates (2012)", legend = "Population")

 

The second dataset I added is ?df_ny_tract_demographics, which contains 8 demographic variables for Census Tracts in NY from 2013. Here is how to map Per Capita Income in Manhattan. Note that like all choroplethr functions, tract_choropleth requires a data.frame with one column named region and one column named value. df_ny_tract_demographics does not come with a column named value, so we must set it ourselves.

data(df_ny_tract_demographics) df_ny_tract_demographics$value = df_ny_tract_demographics$per_capita_income tract_choropleth(df_ny_tract_demographics, state_name = "new york", county_zoom = 36061, title = "Manhattan Tract Income Estimates (2013)", legend = "Per Capita Income")

Exploratory Data Analysis

One of the main goals with choroplethr is to faciliate the exploration of the Census Bureau’s American Community Survey (ACS). If you want to explore Tract level demographics outside of New York’s 2013 data, then use the function ?get_tract_demographics. Here is an example of comparing the variables percent_white and percent_black in Chicago (Cook County, Illinois – FIPS Code 17031).

df_chicago_tract_demographics = get_tract_demographics("illinois", 17031, 2018, 5) # create and store the % white map df_chicago_tract_demographics$value = df_chicago_tract_demographics$percent_white chicago_percent_white = tract_choropleth(df_chicago_tract_demographics, state_name="illinois", county_zoom=17031, title="", legend="Percent White") # create and store the % black map df_chicago_tract_demographics$value = df_chicago_tract_demographics$percent_black chicago_percent_black = tract_choropleth(df_chicago_tract_demographics, state_name="illinois", county_zoom=17031, title="", legend="Percent Black") # place the maps side-by-side double_map(chicago_percent_white, chicago_percent_black, title = "Racial Characteristics of Chicago Census Tracts\n2018 ACS")

Technical Details

To fully understand this change to choroplethr, you need to understand two things: where Choroplethr’s maps come from, and the development of the Simple Features format for storing map data.

Where Choroplethr’s maps come from

When Choroplethr was first released in 2014, there were a lot of problems with maps in R. For example, the maps package, which ships with R, contained a world map … that included the USSR. It also had a US County map that had errors. Also, I believe that none of the US maps in that package contained Alaska or Hawaii.

As a result of these limitations I created a new package, choroplethrMaps, that contains official US maps that come from the US Census Bureau. The choroplethr approach to mapping proved to be very popular, and people increasingly asked for me to add more maps to the package. As such, and through the generosity of my employer at the time, I created more packages with more maps. They all followed a similar format:

Package Map Variables Metadata Variables Mapping Functions Notes choroplethrMaps state.map,
county.map,
country.map state.regions,
county.regions,
country.regions state_choropleth,
county_choropleth,
country_choropleth US States and Counties, plus Countries of the world choroplethrAdmin1 admin1.map admin.regions admin_choropleth Administrative subdivisions of each country of the world (e.g. prefectures of Japan) choroplethrZip zip.map zip.regions zip_choropleth US ZIP Code Tabulation Areas, which serve as a proxy for ZIP Codes for US Census data.

While this worked for a while, at some point there are just too many maps. For example, the national ZCTA map is 419 MB, which CRAN considers to be too large to host. (It is hosted on github instead). And the Census Bureau does not even release a national Tract map; rather they publish a separate one for each State. So if you wanted to package all the Tract maps, you would need 50 different packages. At some point, this approach simply doesn’t scale.

Enter Tigris

Kyle Walker, the author of the Tigris package, found a great way around this problem. Tigris uses the Census Bureau’s API to pull Census Tract maps on demand. This obviates the need to package maps at all, although you will likely still need to process and generate metadata on them (see ?zip.map and ?zip.regions for an example of this processing and metadata).

Behind the scenes, choroplethr’s ?tract_choropleth function uses the Tigris package to fetch the map you need on demand.

Simple Features

Choroplethr uses ggplot2 to render maps. When Choroplethr was first released in 2014, the only way ggplot2 could render maps was if they were in the Shapefile format. ggplot2 required you to read the Shapefile into R, convert it to a “fortified” data.frame, and then pass it that data.frame. Choroplethr stores all of its maps as “fortified” data.frames, so that ggplot2 can render the maps as quickly as possible.

Now R is moving away from Shapefiles to a new format called Simple Features. Tigris, with its latest version, made Simple Features the default format it returns from the Census API. This change broke Choroplethr’s ability to render Tract maps at all, because Choroplethr expected all maps to be in the old format.

In short, this change updated Choroplethr so it can render maps that are encoded as Simple Features.

Future Work

As the R mapping world continues to migrate to Simple Features, it would be nice to migrate all of the maps that Choroplethr uses to Simple Features as well. This would allow, among other things, me to start work on making Choroplethr produce interactive maps. While ggplot2 does not support interactive maps, the Leaflet package does. But Leaflet requires maps to be stored in Simple Features. (While leaflet can also render Shapefiles, it cannot render the “Fortified data.frames” that ggplot2 used to excusively require, and which choroplethr uses to store its data).

As the above table shows, Choroplethr has 5 maps stored in 3 packages. Most of them were modified / processed prior to import, and each of them also contains metadata that needs to be tested and possibly updated as a result of the change. The packages then need to be re-submitted to CRAN and summarized in a blog post. I also need to work with the maintainers of the packages that import these maps so that their packages do not break. I estimate this project to take about 3 weeks.

Choroplethr was originally creating during the “Innovation Weeks” at my previous employer. All employees had a full week each quarter to work on the project of their choice. If I was still at that company, then I would do this work over a few “Innovation Weeks”, as well as during any downtime I had between regularly scheduled projects.

But now that I am self employed, I am not sure if I can afford to take the time to make these changes. In fact, I am beginning to see how the maps package became outdated: finding funding to maintain and update an open source project is hard.

Choroplethr has over 175k total downloads, including over 3k in the last month. These numbers indicate that Choroplethr is widely used and valuable to the R community. While Choroplethr is popular, it is unfortunately now using outdated mapping technology. There is a clear path to address this, but it will require funding to make it a reality.

If you know of an organization that is able to fund this work, then please contact me.

The post Choroplethr v3.6.4 is now on CRAN (and the Future of Choroplethr) appeared first on AriLamstein.com.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualisation options to show growth in occupations in the Australian health industry by @ellis2013nz

Sat, 08/01/2020 - 16:00

[social4i size="small" align="align-left"] --> [This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualising growth in occupations in one industry

A chart is doing the rounds purporting to show the number of administrators working in health care in the USA has grown much faster than the number of physicians – more than 3,000% growth from 1970 to 2009 for administrators (allegedly) compared to about 90% or so for physicians. I don’t much like the original chart so I’ve relegated it to the bottom of this post. It presumably dates from the time of the debates about the introduction of the Affordable Care Act (aka ‘Obamacare’). I find it very difficult to believe the 3,000% number, and suspect there is either deliberate definitional sleight of hand going on, or a genuine classification challenge. One obvious possibility is that some “administrator” classification has been cherry-picked that was very rarely present under that name in the 1970s, and much of the growth is movement from other differently-classified roles into that one.

A similar visualisation with Australian Labour Force Survey data

It did cross my mind that the problem was the visualisation method; in fact the tweet that brought this to my attention was from a researcher wondering what it would look like if it showed absolute numbers rather than cumulative growth. While I’m not really interested in the facts of “administrators” in the US health system, the broader data viz question sounded like something I should know about. So I had a look at Australian figures from the Australian Bureau of Statistics’ Labour Force Survey Quarterly Detail. Here is my own version of a chart showing cumulative growth in various occupations in an industry:

Actually, I think my chart is much better than the US original, not only because it uses an official and well-defined occupation classification, but because it has a go at showing absolute size as well. So we can see that while the total hours worked in the health care and human services industry by managers and professionals who aren’t health-specific (more on this below) have grown fast, the orange and grey dots are still small compared to the pink dots that represents health professionals.

The industry I’m looking at here is “Health Care and Social Assistance”, so some of those managers and other professionals (lawyers, accountants, statisticians, etc) are in social assistance rather than health, but this is as granular as we can get for an occupation and industry crosstabulation with this data without a custom request to access the microdata.

In fact, clearly one of the big stories from this chart is the thick blue dots and the rapidly rising blue line – community and personal care workers. The biggest occupations by far in that category in this industry are “Aged and Disabled Carers” and “Child Carers”, and the growth in importance in these roles (particularly the former) is one of the big news stories of the economy seen over a few decades’ perspective.

I have split the “Professionals” ANZSCO code (the lowest level published by industry in this dataset) into health and other, by getting employment hours for everyone in occupation codes 2500 to 2544 (“Health Professionals not further defined” to “Registered Nurses”). This is from a different cut of the data from the industry version and is only published for “all industries”. I adjust the all-industries health professionals number downwards by about 14%, based on the 2011 Census which tells us that there were 433,726 health professionals in total of which 373,609 worked in the health care and social assistance industry (see screenshot at bottom of post). For example, a mining company or sports team can hire a doctor or nurse. To avoid working on this thing all weekend, I’ve applied that single correction across all years of data.

Other ways of showing this data

What are the other obvious ways to visualise this data? Obviously, in absolute numbers as a stacked area chart:

… or as above but with position “fill” so we see changing proportions:

All three of these methods are completely valid.

  • The very first line chart – cumulative growth – is visually equivalent to converting labour hours to an index. It’s great for showing growth over time, and for many purposes would be suitable. For example, it nicely highlights that the number of labourers in the health care and social assistance industry has declined, and the fastest growing occupation types are managers and non-health professionals.
  • The second – absolute numbers – highlights the aggregate size and growth of labour in these occupations, while still allowing basic comparisons of changes.
  • The third – proportions – lets you see changes in the proportion of the workforce while still getting a snapshot overview (like a pie chart would, but for many times). In this case the change we see is the growth in community and personal service workers rather than health professionals.
Physicians / medical practitioners compared to other health professionals?

The original US chart had focused specifically on “physicians” and I’ve used a broader category of “Health Professionals”. This prompted me to do one last bit of analysis with this – to find out how much of the Australian health profession’s labour is by medical practitioners and whether this is changing. I was surprised to see that the proportion of all health professional labour done by medical practitioners of various sorts (there is no “physician” in the ANZSCO so I chose the combination of unit groups I thought was closest to this) has stayed pretty constant over the past 35 years:

That’s a very boring chart, but it’s boring for interesting reasons – a fairly steady composition of employment hours within the health professionals category, at least when divided between medical practitioners and others.

So what occupations are growing fast?

Finally, I was intrigued by the 3,000% cumulative growth in the one occupation in the original US chart. Could an occupation really grow this much in a few decades? Turns out we have a couple in the Australian data, but in our case I think these are genuine changes in employment patterns. Professional outdoor guides and ICT test engineers are two professions that we believe really have grown materially in the last few decades, partly due to changes in demand and workflow and partly due to specialisation and reclassification of other roles.

I like that plot because it gives a real snapshot (at least from one perspective) of how the economy has changed over 32 years.

Code

Here’s today’s R code, all in one chunk. The most interesting thing here is the need to use the occupation detailed data (cube EQ08) to separate out the single digit occupation data that we get from the higher level industry by occupation data in cube EQ09.

library(tidyverse) library(readxl) library(janitor) library(scales) library(RColorBrewer) library(lubridate) #==================Data management======================= #----------Industry by occupation---------------- url <- "https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&eq09.zip&6291.0.55.003&Data%20Cubes&0A68E700485EF985CA2585910016AF26&0&May%202020&25.06.2020&Latest" download.file(url, "lfs-eq09.zip", mode = "wb") unzip("lfs-eq09.zip") eq09 <- read_excel("EQ09.xlsx", sheet = "Data 1", skip = 3) %>% clean_names() %>% rename(industry1 = industry_division_of_main_job_anzsic_2006_rev_2_0, occupation1 = occupation_major_group_of_main_job_anzsco_2013_v1_2) %>% mutate(total_hours = number_of_hours_actually_worked_in_all_jobs_employed_full_time_000_hours + number_of_hours_actually_worked_in_all_jobs_employed_part_time_000_hours) #------------Detailed occupation------------- url <- "https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&eq08.zip&6291.0.55.003&Data%20Cubes&CB6124B8CB5B515DCA2585910016A499&0&May%202020&25.06.2020&Latest" download.file(url, "lfs-eq08.zip", mode = "wb") unzip("lfs-eq08.zip") eq08 <- read_excel("EQ08.xlsx", sheet = "Data 1", skip = 3) %>% clean_names() %>% rename( occupation4 = occupation_of_main_job_anzsco_2013_v1_2, total_hours = number_of_hours_actually_worked_in_all_jobs_000_hours ) health_profs <- c( "2500 Health Professionals nfd" , "2510 Health Diagnostic and Promotion Professionals nfd" , "2511 Nutrition Professionals" , "2512 Medical Imaging Professionals" , "2513 Occupational and Environmental Health Professionals" , "2514 Optometrists and Orthoptists" , "2515 Pharmacists" , "2519 Other Health Diagnostic and Promotion Professionals" , "2520 Health Therapy Professionals nfd" , "2521 Chiropractors and Osteopaths" , "2522 Complementary Health Therapists" , "2523 Dental Practitioners" , "2524 Occupational Therapists" , "2525 Physiotherapists" , "2526 Podiatrists" , "2527 Audiologists and Speech Pathologists \\ Therapists" , "2530 Medical Practitioners nfd" , "2531 General Practitioners and Resident Medical Officers" , "2532 Anaesthetists" , "2533 Specialist Physicians" , "2534 Psychiatrists" , "2535 Surgeons" , "2539 Other Medical Practitioners" , "2540 Midwifery and Nursing Professionals nfd" , "2541 Midwives" , "2542 Nurse Educators and Researchers" , "2543 Nurse Managers" , "2544 Registered Nurses" ) #---------------------mucking around with classifications----------- # We will find the total hours by health professionals (codes above)... health_profs_hours <- eq08 %>% filter(occupation4 %in% health_profs) %>% group_by(mid_quarter_month) %>% summarise(health_prof_hours = sum(total_hours), # adjust downwards to crudely remove health professionals in other industries # Source: Table Builder for Census 2011 (note this ratio is applying to our # whole time period, so this is really rough) health_prof_hours = health_prof_hours * 373609 / 433726 ) %>% mutate(occupation1 = "Professionals") # And subtract it from the Professionals in the Health Care and Social Assistance Industry, # to get a data frame that has two types of professionals. Note this is problematic because # of health professionals in other industries. separated_profs <- eq09 %>% filter(industry1 == "Health Care and Social Assistance") %>% group_by(occupation1, mid_quarter_month) %>% summarise(total_hours = sum(total_hours)) %>% inner_join(health_profs_hours, by = c("mid_quarter_month", "occupation1")) %>% mutate(other_profs_hours = total_hours - health_prof_hours) %>% ungroup() %>% select(-total_hours, -occupation1) %>% gather(occupation, total_hours, -mid_quarter_month) %>% mutate(occupation = case_when( occupation == "health_prof_hours" ~ "Health Professionals", occupation == "other_profs_hours" ~ "Non-Health Professionals" )) # join back to the original data d <- eq09 %>% filter(industry1 == "Health Care and Social Assistance" & occupation1 != "Professionals") %>% rename(occupation = occupation1) %>% group_by(occupation, mid_quarter_month) %>% summarise(total_hours = sum(total_hours)) %>% ungroup() %>% rbind(separated_profs) %>% mutate(occupation = fct_reorder(occupation, total_hours)) #==========Plotting============ #----------named palette and caption------------- occ_palette <- brewer.pal(length(unique(d$occupation)), "Set1") names(occ_palette) <- unique(d$occupation) the_caption <- "Source: ABS Labour Force Survey EQ08 and EQ09, analysis by http://freerangestats.info" #------------Line chart------------ # This is equivalent to an index - showing cumulative growth d %>% mutate(yr = year(mid_quarter_month)) %>% group_by(yr, occupation) %>% summarise(total_hours = mean(total_hours)) %>% group_by(occupation) %>% arrange(yr) %>% mutate(growth = total_hours / total_hours[1] - 1) %>% ungroup() %>% mutate(occupation = fct_reorder(occupation, -growth, .fun = last)) %>% ggplot(aes(x = yr, y = growth, colour = occupation)) + scale_y_continuous(label = percent) + geom_point(aes(size = total_hours / 1000)) + geom_line(stat="smooth", method = "loess", span = 1/2, alpha = 0.5, size = 2) + scale_colour_manual(values = occ_palette) + scale_size_area(label = comma_format(accuracy = 1)) + theme(legend.position = "right") + labs(x = "", y = "Cumulative growth in hours since 1986", size = "Million hours per quarter", colour = "Occupation", subtitle = "Hours by occupation of workers in Australia's health care and social assistance industry", title = "Strong growth in managers and non-health professionals, but absolute numbers are small", caption = the_caption) #----------stacked and filled area charts----------------- # fundamental guts of the plot with no geom p <- d %>% ggplot(aes(x = mid_quarter_month, y = total_hours / 1000, fill = occupation)) + scale_fill_manual(values = occ_palette) + theme(legend.position = "right") + labs(caption = the_caption, x = str_wrap("Health care professionals who work in other industries adjusted for by subtracting around 14% over all years, based on a rough estimate from the ABS Census of Population and Housing 2011.", 120), fill = "Occupation", subtitle = "Hours by occupation of workers in Australia's health care and social assistance industry") + theme(axis.title.x = element_text(size = 9, hjust = 0, colour = "grey")) # chart: stacked so we see absolute size p + geom_area(position = "stack") + scale_y_continuous(label = comma) + labs(y = "Millions of hours worked per quarter", title = "Steady growth over time") # chart: filled to top, showing proportions p + geom_area(position = "fill") + scale_y_continuous(label = percent_format(accuracy = 1)) + labs(y = "Proportion of hours worked", title = "More community and personal service workers and less labourers") #==================Medical practitioners======= # Out of curiousity, let's look more at the breakdown of those health professionals med_prac <- c( "2530 Medical Practitioners nfd" , "2531 General Practitioners and Resident Medical Officers" , "2532 Anaesthetists" , "2533 Specialist Physicians" , "2534 Psychiatrists" , "2535 Surgeons" , "2539 Other Medical Practitioners" ) profs_only <- eq08 %>% filter(occupation4 %in% health_profs) %>% mutate(med_prac = if_else(occupation4 %in% med_prac, "Medical practitioner", "Other health professional")) %>% group_by(mid_quarter_month, med_prac) %>% summarise(total_hours = sum(total_hours)) # chart: medical practitioners as a proportion of health professionals profs_only %>% ggplot(aes(x = mid_quarter_month, y = total_hours, fill = med_prac)) + geom_area(position = "fill") + scale_y_continuous(label = percent) + labs(x = str_wrap("Medical practitioners defined as GPs, Resident Medical Officers, Anaesthetists, Specialist Physicians, Psychiatrists, Surgeons, and other Medical Practitioners. 'Other health professional' examples includes nurses, pharmacists, midwives, nutritional practitioners, dental practitioners.", 120), fill = "", y = "Percentage of all health professionals' hours", subtitle = "Hours worked by all health professionals (unit group code 2500 to 2544)", title = "Medical practitioners' labour has remained a constant proportion of health professionals'", caption = the_caption) + theme(axis.title.x = element_text(size = 9, hjust = 0, colour = "grey")) #=================which are growing fast?============ # chart: lollipop of fastest growing or shrinking occupations eq08 %>% # remove any 'not further defined' residual categories so we can focus on real occupations filter(!grepl("nfd$", occupation4)) %>% mutate(yr = year(mid_quarter_month)) %>% filter(yr %in% c(1987, 2019)) %>% group_by(occupation4, yr) %>% summarise(total_hours = sum(total_hours)) %>% group_by(occupation4) %>% arrange(yr) %>% mutate(start_hours = total_hours[1], growth = total_hours / start_hours - 1, growth_backwards = start_hours / total_hours - 1, growth_either = ifelse(growth > 0, growth, -growth_backwards)) %>% filter(yr == max(yr)) %>% ungroup() %>% arrange(desc(abs(growth_either))) %>% slice(1:25) %>% mutate(occupation4 = fct_reorder(str_sub(occupation4, 6), growth_either)) %>% ggplot(aes(y = occupation4, yend = occupation4, x = growth_either, xend = 0)) + geom_segment(size = 2, aes(colour = growth_either)) + geom_point(aes(size = total_hours / 1000)) + scale_x_continuous(label = percent) + scale_size_area(label = comma_format(accuracy = 1)) + scale_colour_viridis_c(option = "C", label = percent, guide = 'none') + labs(x = "Growth in employment hours from 1987 to 2019 (if increasing),\nor from 2019 to 1987 (if decreasing)", y = "", caption = the_caption, colour = "", size = "Million hours per year in 2019", title = "The biggest growing and shrinking occupations by employment hours", subtitle = paste0("Two occupations grew by 3,000% from nearly non-existent in 1987.\n", "Textile footwear machine operators' hours were 1,500% more in 1987 than (near zero) in 2019.")) Other supplementary material

Here’s the original image that prompted me to think about all this:

And here’s a screenshot from the ABS Census tablebuilder, for anyone curious about which industries other than Health Care and Social Assistance employ medical professionals:

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to manage credentials and secrets safely in R

Sat, 08/01/2020 - 15:39

[social4i size="small" align="align-left"] --> [This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor. Category Tags

If you have ever received an embarrassing message with a warning saying that you may have published your credentials or secrets when publishing your code, you know what I’m talking about. A very common mistake among noob coders is (temporarily) hardcoding passwords, tokens, secrets, that should never be shared with others, and… shared them.

  • But, how can we handle a public or shared repository or reproducible code without doing so?
  • Are there one-time-only safe solutions that can set our credentials once and for all without having to worry if they will be shared but will always work?

Today I’ll share with you a simple but effective approach.

I have several functions that live in my public lares library that use get_creds() to fetch my secrets. Some of them are used as credentials to query databases, send emails with API services such as Mailgun, ping notifications using Slack‘s webhook, interacting with Google Sheets programatically, fetching Facebook and Twitter’s API stuff, Typeform, Github, Hubspot… I even have a portfolio performance report for my personal investments. If you check the code underneath, you won’t find credentials written anywhere but the code will actually work (for me and for anyone that uses the library). So, how can we accomplish this?

You may want to install the library to follow the examples:

devtools::install_github("laresbernardo/lares") Credentials in YAML files

A YAML (acronym for “YAML Ain’t Markup Language”) file is a readable text file, commonly used to save configurations in a .yml file. So, the trick here will be to post our credentials and secrets into a local YAML file, set RStudio to “know and remember” where it is saved, and call the file every time we use a credential-needed-function. That’s where get_creds comes in!

When using functions in lares that need credentials to actually work, you’ll notice there is always a creds argument. In it, you’ll specify which service you need to fetch the secrets from and will be used in the function. Every time you call this function it will check for your .Renviron file which will reveal where you have your .yml file is and get a list with the credentials needed.

The first time you run the get_creds() or use any function that has the creds parameter, it will reactively ask you to set the path for tour YAML local file. This will be asked once and will be set for further R sessions. Remember, once you change this path you must reset your session for this setup to start working properly.

One-time only setup

Let’s run an example. If you already have a YAML file, you’re halfway there. If you already installed the lares library, you already have a dummy file locally that will work just fine for this exercise; you can find it here: system.file("docs", "config.yml", package = "lares"). If not, you can download the file and save it in your machine, wherever you wish to keep it.

1. Know the path: you must place the YAML file in a secure place and know its absolute path.

2. Set the path: load the library and call the get_creds() function to set the directory. It will ask for the directory (not the file).

library(lares) # I'm using this function to get the library's dummy file directory # dirname(system.file("docs", "config.yml", package = "lares")) get_creds() Please, set your creds directory (one-time only step to set LARES_CREDS): Set directory where your config.yml file is saved: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/lares/docs ALL's SET! But, you must reset your session for it to work!

3. Reset your session: close your R/RStudio session and open it again. That should be all!

get_creds() NULL Warning message: In get_creds() : No credentials for NA found in your YML file. Try any of the following: 'service1', 'service2', 'service3'

We did it! As the warning message suggested, we can run the same function with one of the options available in our file. We’ll get a “list” object containing a (dummy) username, a repo, and a (fake) token, which can be now passed to any function without revealing its values. Awesome, right!?

get_creds("service3") $username [1] "myusername" $repo [1] "laresbernardo/lares" $token [1] "clntbjnrdbgvutdlkcecricuurtjtnbe"

Once you set your path, it will work from now on as long as you keep your file in the correct path. Of course, you don’t need the library to follow this logic, but feel free to use it and pass any feedback. I’ve been using this method for more than 3 years now, locally and in servers, with no issues so far.

BONUS 1: I frequently use 2-3 different computers all the time. To avoid having three different files (which will probably be recommended for security reasons), I only have one that syncs across all machines using Dropbox. So the path I’ve set is ~/Dropbox (Personal)/... for all of them, regardless of their origin path names.

BONUS 2: You can manually change your .Renviron file with usethis::edit_r_environ().

Hope you find this approach useful next time you are in need of hiding your coding secrets! Remember: reveal only what’s necessary and avoid shouting your credentials to the web. Happy coding!

Related Post

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why R? 2020 (Remote) Call for Papers Extended

Fri, 07/31/2020 - 18:00

[social4i size="small" align="align-left"] --> [This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This decided to give you one more week to submit a talk or a workshop to Call for Papers for 2020.whyr.pl remote conference. Please fill this form 2020.whyr.pl/submit/ if you are interested in an active participation.

The new deadline for submissions is 2020-08-07 23:59 CEST (UTC+2)!

Looking forward to your submissions!

As the meeting is held in English we invite R users from all over the globe!

We will stream the conference on youtube.com/WhyRFoundation. The channel already contains past R webinars and keynote talks from previous conferences.

Our Social Media

Find out our Social Media channels:

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

fairmodels: let’s fight with biased Machine Learning models (part 1 — detection)

Fri, 07/31/2020 - 17:36

[social4i size="small" align="align-left"] --> [This article was first published on Stories by Przemyslaw Biecek on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

fairmodels: let’s fight with biased Machine Learning models (part 1 — detection)

Author: Jakub Wiśniewski

TL;DR

The fairmodels R Package facilitates bias detection through model visualizations. It implements few mitigation strategies that could reduce the bias. It enables easy to use checks for fairness metrics and comparison between different Machine Learning (ML) models.

Longer version

Fairness in ML is a quickly emerging field. Big companies like IBM or Google developed some tools already (see AIF360) with growing community of users. Unfortunately, there aren’t many tools enabling to discover bias and discrimination in machine learning models created in R. Therefore, checking the fairness of the classifier created in R might be a difficult task. This is why R package fairmodels was created.

Introduction to fairness concepts

What does it mean that model is fair? Imagine we have a classification model which decisions would have some impact on a human. For example, the model must decide whether some individuals will get a loan or not. What we don’t want is our model predictions to be based on sensitive (later called protected) attributes such as sex, race, nationality, etc… because it could potentially harm some unprivileged groups of people. However, not using such variables might not be enough because the correlations are usually hidden deep inside the data. That is what fairness in ML is for. It checks if privileged and unprivileged groups are treated similarly and if not, it offers some bias mitigation techniques.

There are numerous fairness metrics such as Statistical Parity, Equalized odds, Equal opportunity, and more. They check if model properties on privileged and unprivileged groups are the same

Equal opportunity criterion is satisfied when probabilities for 2 subgroups where A = 1 denotes privileged one are equal.

Many of these metrics can be derived from the confusion matrix. For example, Equal opportunity is ensuring the equal rate of TPR (True Positive Rate) among subgroups in the protected variable. However, knowing these rates is not essential information for us. We would like to know whether the difference between these rates between the privileged group and the unprivileged ones is significant. Let’s say that the acceptable difference in fairness metrics is 0.1. We will call this epsilon. TPR criterion for this metric would be:

For all subgroups (unique values in the protected variable) the fairness metric difference between subgroup denoted as i and the privileged one must be lower than some acceptable value epsilon ( 0.1 in our case ).

Such a criterion is double-sided. It also ensures that there is not much difference in favour of the unprivileged group.

fairmodels as bias detection tool

fairmodels is R package for discovering, eliminating, and visualizing bias. Its main function — fairness_check() enables the user to quickly check if popular fairness metrics are satisfied. fairness_check() return an object called fairness_object. It wraps models together with metrics in useful structure. To create this object we need to provide:

  • Classification model explained with DALEX
  • The protected variable in the form of a factor. Unlike in other fairness tools, it doesn’t need to be binary, just discrete.
  • The privileged group in the form of character or factor

So let’s see how it works in practice. We will make a linear regression model with german credit data predicting whether a certain person makes more or less than 50k annually. Sex will be used as a protected variable.

  1. Create a model
library(fairmodels)
data("german")
y_numeric <- as.numeric(german$Risk) -1
lm_model <- glm(Risk~., data = german, family=binomial(link="logit"))

2. Create an explainer

library(DALEX)
explainer_lm <- explain(lm_model, data = german[,-1], y = y_numeric)

3. Use the fairness_check(). Here the epsilon value is set to default which is 0.1

fobject <- fairness_check(explainer_lm,
protected = german$Sex,
privileged = "male")

Now we can check the level of bias

print(fobject)- prints information in console. Tells us how many metrics model passes and what is the total difference (loss) in all metrics plot(object) — returns ggplot object. Shows red and green areas, where the red field signifies bias. If the bar reaches the left red field it means that the unprivileged group is discriminated, if it reaches the right red zone it means that there is a bias towards the privileged group.

As we can see checking fairness is not difficult. What is more complicated is comparing the discrimination between models. But even this can be easily done with fairmodels!

fairmodels is flexible

When we have many models, they can be passed into one fairness_check() together. Moreover, there is possible an iterative approach. As we explain the model and it does not satisfy fairness criteria, we can add other models along with fairness_object to fairness_check(). That way even the same model with different parameters and/or trained on different data can be compared with the previous one(s).

library(ranger) rf_model <- ranger(Risk ~., data = german, probability = TRUE)
explainer_rf <- explain(rf_model, data = german[,-1], y = y_numeric)
fobject <- fairness_check(explainer_rf, fobject) print(fobject) with additional explainer

That is it. Ranger model passes our fairness criteria (epsilon = 0.1) and therefore is fair.

Summary

fairmodels is flexible and easy to use tool for asserting that the ML model is fair. It can handle multiple models, trained on different data no matter if it was encoded, features were standardized, etc… It facilitates the bias detection process in multiple models allowing at the same time to compare those models with each other.

Learn more

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Przemyslaw Biecek on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Explainable ‘AI’ using Gradient Boosted randomized networks Pt2 (the Lasso)

Fri, 07/31/2020 - 02:00

[social4i size="small" align="align-left"] --> [This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is about LSBoost, an Explainable ‘AI’ algorithm which uses Gradient Boosted randomized networks for pattern recognition. As we’ve discussed it last week LSBoost is a cousin of GFAGBM’s LS_Boost. In LSBoost, more specifically, the so called weak learners from LS_Boost are based on randomized neural networks’ components and variants of Least Squares regression models.

I’ve already presented some promising examples of use of LSBoost based on Ridge Regression weak learners. In mlsauce’s version 0.7.1, the Lasso can also be used as an alternative ingredient to the weak learners. Here is a comparison of the regression coefficients obtained by using mlsauce’s implementation of Ridge regression and the Lasso:

R example: LSBoostRegressor with Ridge regression and the Lasso

The following example is about training set error vs testing set error, as a function of the regularization parameter, both for Ridge regression and Lasso-based weak learners.

Packages and data # 0 - Packages and data ------------------------------------------------------- library(devtools) devtools::install_github("thierrymoudiki/mlsauce/R-package") library(mlsauce) library(datasets) print(summary(datasets::mtcars)) X <- as.matrix(datasets::mtcars[, -1]) y <- as.integer(datasets::mtcars[, 1]) n <- dim(X)[1] p <- dim(X)[2] set.seed(21341) train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE) test_index <- -train_index X_train <- as.matrix(X[train_index, ]) y_train <- as.double(y[train_index]) X_test <- as.matrix(X[test_index, ]) y_test <- as.double(y[test_index]) LSBoost using Ridge regression # 1 - Ridge ------------------------------------------------------------------- obj <- mlsauce::LSBoostRegressor() # default h is Ridge print(obj$get_params()) n_lambdas <- 100 lambdas <- 10**seq(from=-6, to=6, length.out = n_lambdas) rmse_matrix <- matrix(NA, nrow = 2, ncol = n_lambdas) rownames(rmse_matrix) <- c("training rmse", "testing rmse") for (j in 1:n_lambdas) { obj$set_params(reg_lambda = lambdas[j]) obj$fit(X_train, y_train) rmse_matrix[, j] <- c(sqrt(mean((obj$predict(X_train) - y_train)**2)), sqrt(mean((obj$predict(X_test) - y_test)**2))) }

LSBoost using the Lasso # 2 - Lasso ------------------------------------------------------------------- obj <- mlsauce::LSBoostRegressor(solver = "lasso") print(obj$get_params()) n_lambdas <- 100 lambdas <- 10**seq(from=-6, to=6, length.out = n_lambdas) rmse_matrix2 <- matrix(NA, nrow = 2, ncol = n_lambdas) rownames(rmse_matrix2) <- c("training rmse", "testing rmse") for (j in 1:n_lambdas) { obj$set_params(reg_lambda = lambdas[j]) obj$fit(X_train, y_train) rmse_matrix2[, j] <- c(sqrt(mean((obj$predict(X_train) - y_train)**2)), sqrt(mean((obj$predict(X_test) - y_test)**2))) }

R session info > print(session_info()) ─ Session info ───────────────────────────────────────────────────────────── setting value version R version 4.0.2 (2020-06-22) os Ubuntu 16.04.6 LTS system x86_64, linux-gnu ui RStudio language (EN) collate C.UTF-8 ctype C.UTF-8 tz Etc/UTC date 2020-07-31 ─ Packages ───────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.2) backports 1.1.8 2020-06-17 [1] RSPM (R 4.0.2) callr 3.4.3 2020-03-28 [1] RSPM (R 4.0.2) cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.2) crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.2) curl 4.3 2019-12-02 [1] RSPM (R 4.0.2) desc 1.2.0 2018-05-01 [1] RSPM (R 4.0.2) devtools * 2.3.1 2020-07-21 [1] RSPM (R 4.0.2) digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.2) ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.2) fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.2) fs 1.4.2 2020-06-30 [1] RSPM (R 4.0.2) glue 1.4.1 2020-05-13 [1] RSPM (R 4.0.2) jsonlite 1.7.0 2020-06-25 [1] RSPM (R 4.0.2) lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2) magrittr 1.5 2014-11-22 [1] RSPM (R 4.0.2) Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2) memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.2) mlsauce * 0.7.1 2020-07-31 [1] Github (thierrymoudiki/mlsauce@68e391a) pkgbuild 1.1.0 2020-07-13 [1] RSPM (R 4.0.2) pkgload 1.1.0 2020-05-29 [1] RSPM (R 4.0.2) prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.0.2) processx 3.4.3 2020-07-05 [1] RSPM (R 4.0.2) ps 1.3.3 2020-05-08 [1] RSPM (R 4.0.2) R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.2) rappdirs 0.3.1 2016-03-28 [1] RSPM (R 4.0.2) Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.2) reticulate 1.16 2020-05-27 [1] RSPM (R 4.0.2) rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) rprojroot 1.3-2 2018-01-03 [1] RSPM (R 4.0.2) rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.2) sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.2) testthat 2.3.2 2020-03-02 [1] RSPM (R 4.0.2) usethis * 1.6.1 2020-04-29 [1] RSPM (R 4.0.2) withr 2.2.0 2020-04-20 [1] RSPM (R 4.0.2) [1] /home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0 [2] /opt/R/4.0.2/lib/R/library

No post in August

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Handling R6 objects in C++

Fri, 07/31/2020 - 02:00

[social4i size="small" align="align-left"] --> [This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

When we are using R6 objects and want to introduce some C++ code in our
project, we may also want to interact with these objects using Rcpp. With
this objective in mind, the key to interacting with R6 objects is that
they are essentially environments, and so they can be treated as such.

In this example, we will define a ‘Person’ class with some private attributes
and public methods. It will allow us to show how to create R6 objects in C++
and how to interact with them.

Creating a dummy R6 class

First, we define the class ‘Person’, with a name, an id, an item and some
methods:

library(R6) #' R6 class that defines a person Person <- R6::R6Class("Person", public = list( #' @description #' Constructor of the 'Person' class #' @param name a string that defines this person's name #' @param id an integer that defines this person's id #' @return A new 'Person' object initialize = function(name, id){ private$name <- name private$id <- id }, get_name = function(){return(private$name)}, get_id = function(){return(private$id)}, #' @description #' Gives an item to the person #' @param item a string that defines the item give_item = function(item){private$item <- item}, #' @description #' A public function that calls a private one good_foo = function(){ return(paste0("Wrapped inside: {", private$bad_foo(), "}")) } ), private = list( #' @field name the name of the person name = NULL, #' @field id the id of the person id = NULL, #' @field item some item that the person has item = NULL, #' @description #' A private function that should not be called from outside the object bad_foo = function(){return("This is a private function")} ) )

With this simple class, we will be able to create R6 objects, initialize them
and call some of their methods.

Creating R6 objects in C++

To create R6 objects, we must first get the ‘new’ function that initializes
them. In this case, given that we have loaded the Person class into the
global environment, we will get it from there:

#include using namespace Rcpp; // [[Rcpp::export]] List initialize_list(StringVector &names, IntegerVector &ids){ List res(names.size()); Environment g_env = Environment::global_env(); Environment p_env = g_env["Person"]; Function new_person = p_env["new"]; for(unsigned int i = 0; i < names.size(); i++){ Environment new_p; String name = names[i]; int id = ids[i]; new_p = new_person(name, id); res[i] = new_p; } return res; } names <- c("Jake", "Anne", "Alex") ids <- 1:3 res <- initialize_list(names, ids) print(res)

The previous example initializes a list with Persons. It is important to notice
that the only function relevant to us from the global environment is the ‘new’
function created by the R6 class. All the functions defined inside Person will
be contained in each instance of the class.

If the R6 object is defined inside some package, included our own package if
we are developing one, we must get it from there. To do this, the previous
function has to be modified as follows:

#include using namespace Rcpp; // [[Rcpp::export]] List initialize_list(unsigned int size){ List res(size); Environment package_env("package:some_package"); Environment class_env = package_env["some_class"]; Function new_instance = class_env["new"]; for(unsigned int i = 0; i < size; i++){ Environment new_i; new_i = new_instance(); res[i] = new_i; } return res; } Calling the methods of an instance

Once we have instantiated an R6 object as an environment, we can call its
methods after getting them from the instance:

#include using namespace Rcpp; // [[Rcpp::export]] Environment create_person_item(){ Environment g_env = Environment::global_env(); Environment p_env = g_env["Person"]; Function new_person = p_env["new"]; Environment new_p; String name = "Jake"; int id = 1; String item = "shovel"; new_p = new_person(name, id); Function give_i = new_p["give_item"]; give_i(item); return new_p; } res <- create_person_item() print(res)

Note that if we are creating multiple instances, we have to get the desired
method from each one of them.

Private attributes and methods cannot be accessed in this manner, trying to do
so results in an error:

#include using namespace Rcpp; // [[Rcpp::export]] String access_method(std::string foo_name){ Environment g_env = Environment::global_env(); Environment p_env = g_env["Person"]; Function new_person = p_env["new"]; Environment new_p; String name = "Jake"; int id = 1; new_p = new_person(name, id); Function foo = new_p[foo_name]; String res = foo(); return res; } tryCatch({res <- access_method("bad_foo")}, error=function(cond){print(paste0("Exception raised: ", cond))}) res <- access_method("good_foo") print(res)

The error is telling us that there is no function called “bad_foo” in
the environment of the object “new_p”. In that environment, only public
attributes and methods are present on the surface. Theoretically, we could
still try to find the private attributes and methods inside the object, but
it goes against the paradigm and it would be much easier to just make
the method we want to call public.

Passing down R6 objects

Passing R6 objects as arguments to C++ functions is also straight forward
by treating them as environments:

#include using namespace Rcpp; // [[Rcpp::export]] void give_shovel(Environment &person){ String item = "shovel"; Function give_i = person["give_item"]; give_i(item); } p <- Person$new("Jake", 1) give_shovel(p) print(p) Public: clone: function (deep = FALSE) get_id: function () get_name: function () give_item: function (item) good_foo: function () initialize: function (name, id) Private: bad_foo: function () id: 1 item: shovel name: Jake

Note that there is no need to return the object, as the environment is passed
by reference and changes inside it will be reflected outside the function.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I like to MVO it!

Fri, 07/31/2020 - 02:00

[social4i size="small" align="align-left"] --> [This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In our last post, we ran through a bunch of weighting scenarios using our returns simulation. This resulted in three million portfolios comprised in part, or total, of four assets: stocks, bonds, gold, and real estate. These simulations relaxed the allocation constraints to allow us to exclude assets, yielding a wider range of return and risk results, while lowering the likelihood of achieving our risk and return targets. We bucketed the portfolios to simplify the analysis around the risk-return trade off. We then calculated the median returns and risk for each bucket and found that some buckets achieved Sharpe ratios close to or better than that implied by our original risk-return constraint. Cutting the data further, we calculated the average weights for the better Sharpe ratio portfolios. The result: relatively equal-weighting tended to produce a better risk-reward outcome than significant overweighting.

At the end of the post we noted that we could have a bypassed much of this data wrangling and simply calculated the optimal portfolio weights for various risk profiles using mean-variance optimization. That is what we plan to do today.

The madness behind all this data wrangling was to identify the best return afforded by a given level of risk. Mean-variance optimization (MVO) solves that problem more elegantly than our “hacky” methods. It uses quadratic programming1 to minimize the portfolio variance by altering the weights of the various assets in the portfolio. It is subject to the constraints (in the simplest form) that the return of any particular portfolio is at least equal to the expected return of the portfolio2 and the weights of the assets sum to one.

More formally it can be expressed as follows:

Minimize: \(\frac{1}{2}w'\sum\ w\)
Subject to: \(r'w = \mu\ and\ e'w = 1\)

Here \(w\) = asset weights, \(\sum\) = the covariance matrix of the assets with themselves and every other asset, \(r\) = returns of the assets, \(\mu\) = expected return of the portfolio, \(e'\) = a vector of ones. It is understood that one is employing matrix notation, so the \(w'\) is the transpose of \(w\).

If you understand that, it’s probably the roughest rendition of MVO you’ve seen and if you don’t, don’t worry about it. The point is through some nifty math, you can solve for the precise weights so that every portfolio that falls along a line has the lowest volatility for a given level of return or the highest return for a given level of volatility. This line is called the efficient frontier since efficiency in econospeak means every asset is optimally allocated and frontier, well you get that one we hope.

What does this look like in practice? Let’s bring back our original portfolio, run the simulations, and then calculate the efficient frontier. We graph our original simulation with the original weighting constraint (all assets are in the portfolio) below.

Recall that after we ran this simulation we averaged the weightings for those portfolios that achieved our constraints of not less than a 7% return and not more 10% risk on an annual basis. We then applied that weighting to our first five year test period. We show the weighting below.

Before we look at the forward returns and the efficient frontier, let’s see where our portfolio lies in the original simulation to orient ourselves. It’s the red dot.

As is clear, the portfolio ends up in the higher end of the continuum, but there are other portfolios that dominate it. Now the moment we’ve been waiting for—portfolio optimization! Taking a range of returns between the minimum and maximum of the simulated portfolios, we’ll calculate the optimal weights to produce the highest return for the lowest amount of risk.

Wow! That optimization stuff sure does work. The blue line representing the efficient frontier clearly shows that there are other portfolios that could generate much higher returns for the implied level of risk we’re taking on. Alternatively, if we move horizontally to the left we see that we could achieve the same level of return at a much lower level of risk, shown by where the blue line crosses above 7% return.

Recall for illustrative purposes we used a simple version for the original weight simulation that required an investment in all assets. When we relax that constraint, we get a much wider range of outcomes, as we pointed out in the last post. What if we ran the weighting simulation with the relaxed constraint? What would our simulation and allocation look like in that case? We show those results below.

We see a much broader range of outcomes, which yields a higher weighting to bonds and a lower one to gold than the previous portfolio. Now we’ll overlay the placement of our satisfactory portfolio on the broader weight simulation along with the efficient frontier in the graph below.

Who needs mean-variance optimization when you’ve got data science simulation?! As one can see, when you allow portfolio weights to approach zero in many, but not all, of the assets, you can approximate the efficient frontier without having to rely on quadratic programming. This should give new meaning to “p-hacking.” Still, quadratic programming is likely to be a lot faster that running thousands of simulations with a large portfolio of assets. Recall for the four asset portfolio when we relaxed the inclusion constraint, that tripled the number of simulations. Hence, for any simulation in which some portfolios won’t be invested in all the assets, the number of calculations increases by a factor of the total number of assets minus one.

Whatever the case, we see that the satisfactory portfolio may not be that satisfactory given how much it’s dominated by the efficient frontier. Recall, however, we weren’t trying to achieve an optimal portfolio per se. We “just” wanted a portfolio that would meet our risk-return constraints.

Let’s see what happens when we use our satisfactory portfolio’s weights on the first five-year test period. In the graph below, we calculate our portfolios risk and return and then place it within our weight simulation scatter plot. We also calculate the risk and returns of various portfolios using the weights we derived from our efficient frontier above and add that our graph as the blue line.

Uh oh, not so efficient. The weights from the previous efficient frontier did not achieve optimal portfolios in the future and produced an unusual shape too. This illustrates one of the main problems with mean-variance optimization: “optimal weights are sensitive to return estimates”. In other words, if your estimate of returns aren’t that great, your optimal portfolio weights won’t be so optimal. Moreover, even if your estimates reflect all presently available information, that doesn’t mean they’ll be that accurate in the future.

A great way to see this is to calculate the efficient frontier using as much of the data as we have, ignoring incomplete cases (which produces bias) and plotting that against the original and first five-year simulations

You win some; you lose some. As is evident, different return estimates yield different frontiers both retrospectively and prospectively. Should we be skeptical of mean mean-variance optimization as Warren Buffett is of “geeks bearing gifts”? Not really. It’s an elegant solution to the thorny problem of portfolio construction. But it’s not very dynamic and it doesn’t exactly allow for much uncertainty around estimates.

There have been a number attempts to address such shortcomings including multi-period models, inter-temporal models, and even a statistics-free approach, among others. Even summarizing these different approaches would take us far afield of this post. Suffice it to say, there isn’t a clear winner; instead each refinement addresses a particular issue or fits a particular risk preference.

We’ve now partially revealed why we’ve been talking about a “satisfactory” portfolio all along. It’s the trade-off between satsificing and optimal. While we cannot possibly discuss all the nuances of satisficing now, our brief explanation is this. Satisficing is finding the best available solution when the optimal one is uncertain or unattainable. It was a concept developed by Herbert Simon who argued that decision makers could choose an optimal solution to a simplified reality or a satisfactory solution to a messy one.

If the “optimal” solution to portfolio allocation is a moving target with multiple approaches to calculating it, many of which involve a great deal of complexity, then electing a “good-enough” solution might be more satisfactory. The cost to become conversant in the technical details necessary to understand some of the solutions, let alone compile all the data necessary, could be prohibitive. Of course, if you’re a fund manager being paid to outperform (i.e., beat everyone else trying to beat you), then it behooves you to seek out these arcane solutions if your commpetitors are apt to use them too.

This discussion explains, in part, why the “simple” 1/n or 60/40 stock/bond portfolios are so popular. The exercise of mean-variance optimization and all its offshoots may simply be too much effort if the answers it gives aren’t dramatically better than a simplified approach. But it would be wrong to lay the blame for poor results or uncertainty on MVO: financial markets have way more noise than signal.

In pursuit of the signal, our next posts will look at the “simple” portfolios and see what they produce over multiple simulations relative to the satisfactory and optimal portfolios we’ve already discussed. If you think this blog is producing more noise than signal or vice versa, we want to know! Our email address is after the R and Python code below.

R code:

# Written in R 3.6.2 # Code for any source('function.R') is found at the end. ## Load packages suppressPackageStartupMessages({ library(tidyquant) library(tidyverse) library(quadprog) }) ## Load data df <- readRDS("port_const.rds") dat <- readRDS("port_const_long.rds") sym_names <- c("stock", "bond", "gold", "realt", "rfr") ## Call simuation functions source("Portfolio_simulation_functions.R") ## Run simulation set.seed(123) port_sim_1 <- port_sim(df[2:61,2:5],1000,4) ## Graph port_sim_1$graph + theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"), legend.background = element_rect(fill = NA)) ## Run selection function and graph results_1 <- port_select_func(port_sim_1, 0.07, 0.1, sym_names[1:4]) results_1$graph # Create satisfactory portfolio satis_ret <- sum(results_1$port_wts*colMeans(df[2:61, 2:5])) satis_risk <- sqrt(as.numeric(results_1$port_wts) %*% cov(df[2:61, 2:5]) %*% as.numeric(results_1$port_wts)) port_satis <- data.frame(returns = satis_ret, risk = satis_risk) # Graph with simulated port_sim_1$graph + geom_point(data = port_satis, aes(risk*sqrt(12)*100, returns*1200), size = 4, color="red") + theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"), legend.background = element_rect(fill = NA)) ## Find efficient frontier source("Efficient_frontier.R") eff_port <- eff_frontier_long(df[2:61,2:5], risk_increment = 0.01) df_eff <- data.frame(returns = eff_port$exp_ret, risk = eff_port$stdev) port_sim_1$graph + geom_line(data = df_eff, aes(risk*sqrt(12)*100, returns*1200), color = 'blue', size = 1.5, linetype = "dashed") + geom_point(data = port_satis, aes(risk*sqrt(12)*100, returns*1200), size = 4, color="red") + theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"), legend.background = element_rect(fill = NA)) # Simulation with leaving out assets port_sim_1lv <- port_sim_lv(df[2:61,2:5],1000,4) lv_graf <- port_sim_1lv$graph + theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"), legend.background = element_rect(fill = NA), plot.title = element_text(size=10)) ## Run selection function results_1lv <- port_select_func(port_sim_1lv, 0.07, 0.1, sym_names[1:4]) lv_res_graf <- results_1lv$graph + theme(plot.title = element_text(size=10)) gridExtra::grid.arrange(lv_graf, lv_res_graf, ncol=2) ## Create satisfactory data frame and graph leave out portfolios with efficient frontier satis_ret_lv <- sum(results_1lv$port_wts*colMeans(df[2:61, 2:5])) satis_risk_lv <- sqrt(as.numeric(results_1lv$port_wts) %*% cov(df[2:61, 2:5]) %*% as.numeric(results_1lv$port_wts)) port_satis_lv <- data.frame(returns = satis_ret_lv, risk = satis_risk_lv) port_sim_1lv$graph + geom_line(data = df_eff, aes(risk*sqrt(12)*100, returns*1200), color = 'blue', size = 1.5, linetype = "dashed") + geom_point(data = port_satis_lv, aes(risk*sqrt(12)*100, returns*1200), size = 4, color="red") + theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"), legend.background = element_rect(fill = NA)) ## Run function and create actual portfolio and data frame for graph port_1_act <- rebal_func(df[62:121,2:5],results_1lv$port_wts) port_act <- data.frame(returns = mean(port_1_act$ret_vec), risk = sd(port_1_act$ret_vec), sharpe = mean(port_1_act$ret_vec)/sd(port_1_act$ret_vec)*sqrt(12)) ## Simulate portfolios on first five-year period set.seed(123) port_sim_2 <- port_sim_lv(df[62:121,2:5], 1000, 4) eff_ret1 <- apply(eff_port[,1:4], 1, function(x) x %*% colMeans(df[62:121, 2:5])) eff_risk1 <- sqrt(apply(eff_port[,1:4], 1, function(x) as.numeric(x) %*% cov(df[62:121,2:5]) %*% as.numeric(x))) eff_port1 <- data.frame(returns = eff_ret1, risk = eff_risk1) ## Graph simulation with chosen portfolio port_sim_2$graph + geom_point(data = port_act, aes(risk*sqrt(12)*100, returns*1200), size = 4, color="red") + geom_line(data = eff_port1, aes(risk*sqrt(12)*100, returns*1200), color = 'red', size = 2) + theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"), legend.background = element_rect(fill = NA)) ## Using longer term data eff_port_old <- eff_frontier_long(dat[1:253,2:5], risk_increment = 0.01) df_eff_old <- data.frame(returns = eff_port_old$exp_ret, risk = eff_port_old$stdev) p1 <- port_sim_1lv$graph + geom_line(data = df_eff_old, aes(risk*sqrt(12)*100, returns*1200), color = 'blue', size = 1.5) + geom_point(data = port_satis, aes(risk*sqrt(12)*100, returns*1200), size = 4, color="red") + theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"), legend.background = element_rect(fill = NA), plot.title = element_text(size=10)) + labs(title = 'Simulated portfolios with long-term optimzation') # For forward graph eff_ret1_old <- apply(eff_port_old[,1:4], 1, function(x) x %*% colMeans(dat[1:253, 2:5], na.rm = TRUE)) eff_risk1_old <- sqrt(apply(eff_port_old[,1:4], 1, function(x) as.numeric(x) %*% cov(dat[1:253,2:5], use = 'pairwise.complete.obs') %*% as.numeric(x))) eff_port1_old <- data.frame(returns = eff_ret1_old, risk = eff_risk1_old) ## Graph simulation with chosen portfolio p2 <- port_sim_2$graph + geom_point(data = port_act, aes(risk*sqrt(12)*100, returns*1200), size = 4, color="red") + geom_line(data = eff_port1_old, aes(risk*sqrt(12)*100, returns*1200), color = 'blue', size = 2) + theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"), legend.background = element_rect(fill = NA), plot.title = element_text(size=10)) + labs(title = 'Forward portfolios with long-term optimization') gridExtra::grid.arrange(p1, p2, ncol=2) #### Portfolio_simulation_functions.R # Portfolio simulations ## Portfolio simuation function port_sim <- function(df, sims, cols){ if(ncol(df) != cols){ print("Columns don't match") break } # Create weight matrix wts <- matrix(nrow = sims, ncol = cols) for(i in 1:sims){ a <- runif(cols,0,1) b <- a/sum(a) wts[i,] <- b } # Find returns mean_ret <- colMeans(df) # Calculate covariance matrix cov_mat <- cov(df) # Calculate random portfolios port <- matrix(nrow = sims, ncol = 2) for(i in 1:sims){ port[i,1] <- as.numeric(sum(wts[i,] * mean_ret)) port[i,2] <- as.numeric(sqrt(t(wts[i,]) %*% cov_mat %*% wts[i,])) } colnames(port) <- c("returns", "risk") port <- as.data.frame(port) port$Sharpe <- port$returns/port$risk*sqrt(12) max_sharpe <- port[which.max(port$Sharpe),] graph <- port %>% ggplot(aes(risk*sqrt(12)*100, returns*1200, color = Sharpe)) + geom_point(size = 1.2, alpha = 0.4) + scale_color_gradient(low = "darkgrey", high = "darkblue") + labs(x = "Risk (%)", y = "Return (%)", title = "Simulated portfolios") out <- list(port = port, graph = graph, max_sharpe = max_sharpe, wts = wts) } ## Portfolio Simulation leave port_sim_lv <- function(df, sims, cols){ if(ncol(df) != cols){ print("Columns don't match") break } # Create weight matrix wts <- matrix(nrow = (cols-1)*sims, ncol = cols) count <- 1 for(i in 1:(cols-1)){ for(j in 1:sims){ a <- runif((cols-i+1),0,1) b <- a/sum(a) c <- sample(c(b,rep(0,i-1))) wts[count,] <- c count <- count+1 } } # Find returns mean_ret <- colMeans(df) # Calculate covariance matrix cov_mat <- cov(df) # Calculate random portfolios port <- matrix(nrow = (cols-1)*sims, ncol = 2) for(i in 1:nrow(port)){ port[i,1] <- as.numeric(sum(wts[i,] * mean_ret)) port[i,2] <- as.numeric(sqrt(t(wts[i,]) %*% cov_mat %*% wts[i,])) } colnames(port) <- c("returns", "risk") port <- as.data.frame(port) port$Sharpe <- port$returns/port$risk*sqrt(12) max_sharpe <- port[which.max(port$Sharpe),] graph <- port %>% ggplot(aes(risk*sqrt(12)*100, returns*1200, color = Sharpe)) + geom_point(size = 1.2, alpha = 0.4) + scale_color_gradient(low = "darkgrey", high = "darkblue") + labs(x = "Risk (%)", y = "Return (%)", title = "Simulated portfolios") out <- list(port = port, graph = graph, max_sharpe = max_sharpe, wts = wts) } ## Load portfolio selection function port_select_func <- function(port, return_min, risk_max, port_names){ port_select <- cbind(port$port, port$wts) port_wts <- port_select %>% mutate(returns = returns*12, risk = risk*sqrt(12)) %>% filter(returns >= return_min, risk <= risk_max) %>% summarise_at(vars(4:7), mean) %>% `colnames<-`(port_names) p <- port_wts %>% rename("Stocks" = stock, "Bonds" = bond, "Gold" = gold, "Real estate" = realt) %>% gather(key,value) %>% ggplot(aes(reorder(key,value), value*100 )) + geom_bar(stat='identity', position = "dodge", fill = "blue") + geom_text(aes(label=round(value,2)*100), vjust = -0.5) + scale_y_continuous(limits = c(0,max(port_wts*100+2))) + labs(x="", y = "Weights (%)", title = "Average weights for risk-return constraints") out <- list(port_wts = port_wts, graph = p) out } ## Function for portfolio returns without rebalancing rebal_func <- function(act_ret, weights){ ret_vec <- c() wt_mat <- matrix(nrow = nrow(act_ret), ncol = ncol(act_ret)) for(i in 1:nrow(wt_mat)){ wt_ret <- act_ret[i,]*weights # wt'd return ret <- sum(wt_ret) # total return ret_vec[i] <- ret weights <- (weights + wt_ret)/(sum(weights)+ret) # new weight based on change in asset value wt_mat[i,] <- as.numeric(weights) } out <- list(ret_vec = ret_vec, wt_mat = wt_mat) out } #### Efficient_frontier.R # Adapted from https://www.nexteinstein.org/wp-content/uploads/sites/6/2017/01/ORIG_Portfolio-Optimization-Using-R_Pseudo-Code.pdf eff_frontier_long <- function(returns, risk_premium_up = 0.5, risk_increment = 0.005){ covariance <- cov(returns, use = "pairwise.complete.obs") num <- ncol(covariance) Amat <- cbind(1, diag(num)) bvec <- c(1, rep(0, num)) meq <- 1 risk_steps <- risk_premium_up/risk_increment+1 count <- 1 eff <- matrix(nrow = risk_steps, ncol = num + 3) colnames(eff) <- c(colnames(returns), "stdev", "exp_ret", "sharpe") loop_step <- seq(0, risk_premium_up, risk_increment) for(i in loop_step){ dvec <- colMeans(returns, na.rm = TRUE)*i sol <- quadprog::solve.QP(covariance, dvec = dvec, Amat = Amat, bvec = bvec, meq = meq) eff[count, "stdev"] <- sqrt(sum(sol$solution * colSums(covariance * sol$solution))) eff[count, "exp_ret"] <- as.numeric(sol$solution %*% colMeans(returns, na.rm = TRUE)) eff[count, "sharpe"] <- eff[count,"exp_ret"]/eff[count, "stdev"] eff[count, 1:num] <- sol$solution count <- count + 1 } return(as.data.frame(eff)) }

Python code:

# Load libraries import pandas as pd import pandas_datareader.data as web import numpy as np import matplotlib.pyplot as plt %matplotlib inline plt.style.use('ggplot') # SKIP IF ALREADY HAVE DATA # Load data start_date = '1970-01-01' end_date = '2019-12-31' symbols = ["WILL5000INDFC", "BAMLCC0A0CMTRIV", "GOLDPMGBD228NLBM", "CSUSHPINSA", "DGS5"] sym_names = ["stock", "bond", "gold", "realt", 'rfr'] filename = 'data_port_const.pkl' try: df = pd.read_pickle(filename) print('Data loaded') except FileNotFoundError: print("File not found") print("Loading data", 30*"-") data = web.DataReader(symbols, 'fred', start_date, end_date) data.columns = sym_names data_mon = data.resample('M').last() df = data_mon.pct_change()['1987':'2019'] # df.to_pickle(filename) # If you haven't saved the file dat = data_mon.pct_change()['1971':'2019'] # pd.to_pickle(df,filename) # if you haven't saved the file # Portfolio simulation functions ## Simulation function class Port_sim: def calc_sim(df, sims, cols): wts = np.zeros((sims, cols)) for i in range(sims): a = np.random.uniform(0,1,cols) b = a/np.sum(a) wts[i,] = b mean_ret = df.mean() port_cov = df.cov() port = np.zeros((sims, 2)) for i in range(sims): port[i,0] = np.sum(wts[i,]*mean_ret) port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,])) sharpe = port[:,0]/port[:,1]*np.sqrt(12) best_port = port[np.where(sharpe == max(sharpe))] max_sharpe = max(sharpe) return port, wts, best_port, sharpe, max_sharpe def calc_sim_lv(df, sims, cols): wts = np.zeros(((cols-1)*sims, cols)) count=0 for i in range(1,cols): for j in range(sims): a = np.random.uniform(0,1,(cols-i+1)) b = a/np.sum(a) c = np.random.choice(np.concatenate((b, np.zeros(i))),cols, replace=False) wts[count,] = c count+=1 mean_ret = df.mean() port_cov = df.cov() port = np.zeros(((cols-1)*sims, 2)) for i in range(sims): port[i,0] = np.sum(wts[i,]*mean_ret) port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,])) sharpe = port[:,0]/port[:,1]*np.sqrt(12) best_port = port[np.where(sharpe == max(sharpe))] max_sharpe = max(sharpe) return port, wts, best_port, sharpe, max_sharpe def graph_sim(port, sharpe): plt.figure(figsize=(14,6)) plt.scatter(port[:,1]*np.sqrt(12)*100, port[:,0]*1200, marker='.', c=sharpe, cmap='Blues') plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show() # Constraint function def port_select_func(port, wts, return_min, risk_max): port_select = pd.DataFrame(np.concatenate((port, wts), axis=1)) port_select.columns = ['returns', 'risk', 1, 2, 3, 4] port_wts = port_select[(port_select['returns']*12 >= return_min) & (port_select['risk']*np.sqrt(12) <= risk_max)] port_wts = port_wts.iloc[:,2:6] port_wts = port_wts.mean(axis=0) return port_wts def port_select_graph(port_wts): plt.figure(figsize=(12,6)) key_names = {1:"Stocks", 2:"Bonds", 3:"Gold", 4:"Real estate"} lab_names = [] graf_wts = port_wts.sort_values()*100 for i in range(len(graf_wts)): name = key_names[graf_wts.index[i]] lab_names.append(name) plt.bar(lab_names, graf_wts, color='blue') plt.ylabel("Weight (%)") plt.title("Average weights for risk-return constraint", fontsize=15) for i in range(len(graf_wts)): plt.annotate(str(round(graf_wts.values[i])), xy=(lab_names[i], graf_wts.values[i]+0.5)) plt.show() # Return function with no rebalancing def rebal_func(act_ret, weights): ret_vec = np.zeros(len(act_ret)) wt_mat = np.zeros((len(act_ret), len(act_ret.columns))) for i in range(len(act_ret)): wt_ret = act_ret.iloc[i,:].values*weights ret = np.sum(wt_ret) ret_vec[i] = ret weights = (weights + wt_ret)/(np.sum(weights) + ret) wt_mat[i,] = weights return ret_vec, wt_mat ## Rum simulation and graph np.random.seed(123) port_sim_1, wts_1, _, sharpe_1, _ = Port_sim.calc_sim(df.iloc[1:60,0:4],1000,4) Port_sim.graph_sim(port_sim_1, sharpe_1) # Weight choice results_1_wts = port_select_func(port_sim_1, wts_1, 0.07, 0.1) port_select_graph(results_1_wts) # Compute satisfactory portfolio satis_ret = np.sum(results_1_wts * df.iloc[1:60,0:4].mean(axis=0).values) satis_risk = np.sqrt(np.dot(np.dot(results_1_wts.T, df.iloc[1:60,0:4].cov()),results_1_wts)) # Graph simulation with actual portfolio return plt.figure(figsize=(14,6)) plt.scatter(port_sim_1[:,1]*np.sqrt(12)*100, port_sim_1[:,0]*1200, marker='.', c=sharpe_1, cmap='Blues') plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.scatter(satis_risk*np.sqrt(12)*100, satis_ret*1200, c='red', s=50) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show() # Create efficient frontier function from scipy.optimize import minimize def eff_frontier(df_returns, min_ret, max_ret): n = len(df_returns.columns) def get_data(weights): weights = np.array(weights) returns = np.sum(df_returns.mean() * weights) risk = np.sqrt(np.dot(weights.T, np.dot(df_returns.cov(), weights))) sharpe = returns/risk return np.array([returns,risk,sharpe]) # Contraints def check_sum(weights): return np.sum(weights) - 1 # Rante of returns mus = np.linspace(min_ret,max_ret,20) # Function to minimize def minimize_volatility(weights): return get_data(weights)[1] # Inputs init_guess = np.repeat(1/n,n) bounds = tuple([(0,1) for _ in range(n)]) eff_risk = [] port_weights = [] for mu in mus: # function for return cons = ({'type':'eq','fun': check_sum}, {'type':'eq','fun': lambda w: get_data(w)[0] - mu}) result = minimize(minimize_volatility,init_guess,method='SLSQP',bounds=bounds,constraints=cons) eff_risk.append(result['fun']) port_weights.append(result.x) eff_risk = np.array(eff_risk) return mus, eff_risk, port_weights ## Create variables for froniter function df_returns = df.iloc[1:60, 0:4] min_ret = min(port_sim_1[:,0]) max_ret = max(port_sim_1[:,0]) eff_ret, eff_risk, eff_weights = eff_frontier(df_returns, min_ret, max_ret) ## Graph efficient frontier plt.figure(figsize=(12,6)) plt.scatter(port_sim_1[:,1]*np.sqrt(12)*100, port_sim_1[:,0]*1200, marker='.', c=sharpe_1, cmap='Blues') plt.plot(eff_risk*np.sqrt(12)*100,eff_ret*1200,'b--',linewidth=2) plt.scatter(satis_risk*np.sqrt(12)*100, satis_ret*1200, c='red', s=50) plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show() ## Graph with unconstrained weights np.random.seed(123) port_sim_1lv, wts_1lv, _, sharpe_1lv, _ = Port_sim.calc_sim_lv(df.iloc[1:60,0:4],1000,4) Port_sim.graph_sim(port_sim_1lv, sharpe_1lv) # Weight choice results_1lv_wts = port_select_func(port_sim_1lv, wts_1lv, 0.07, 0.1) port_select_graph(results_1lv_wts) # Satisfactory portfolio unconstrained weights satis_ret1 = np.sum(results_1lv_wts * df.iloc[1:60,0:4].mean(axis=0).values) satis_risk1 = np.sqrt(np.dot(np.dot(results_1lv_wts.T, df.iloc[1:60,0:4].cov()),results_1lv_wts)) # Graph with efficient frontier plt.figure(figsize=(12,6)) plt.scatter(port_sim_1lv[:,1]*np.sqrt(12)*100, port_sim_1lv[:,0]*1200, marker='.', c=sharpe_1lv, cmap='Blues') plt.plot(eff_risk*np.sqrt(12)*100,eff_ret*1200,'b--',linewidth=2) plt.scatter(satis_risk1*np.sqrt(12)*100, satis_ret1*1200, c='red', s=50) plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show() # Five year forward with unconstrained satisfactory portfolio # Returns ## Run rebalance function using desired weights port_1_act, wt_mat = rebal_func(df.iloc[61:121,0:4], results_1lv_wts) port_act = {'returns': np.mean(port_1_act), 'risk': np.std(port_1_act), 'sharpe': np.mean(port_1_act)/np.std(port_1_act)*np.sqrt(12)} # Run simulation on recent five-years np.random.seed(123) port_sim_2lv, wts_2lv, _, sharpe_2lv, _ = Port_sim.calc_sim_lv(df.iloc[61:121,0:4],1000,4) # Graph simulation with actual portfolio return plt.figure(figsize=(14,6)) plt.scatter(port_sim_2lv[:,1]*np.sqrt(12)*100, port_sim_2lv[:,0]*1200, marker='.', c=sharpe_2lv, cmap='Blues') plt.plot(eff_risk*np.sqrt(12)*100,eff_ret*1200,'b--',linewidth=2) plt.scatter(port_act['risk']*np.sqrt(12)*100, port_act['returns']*1200, c='red', s=50) plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show() ## Eficient frontier on long term data df_returns_l = dat.iloc[1:254, 0:4] min_ret_l = min(port_sim_1[:,0]) max_ret_l = max(port_sim_1[:,0]) eff_ret_l, eff_risk_l, eff_weightsl = eff_frontier(df_returns1, min_ret1, max_ret1) ## Graph with original plt.figure(figsize=(12,6)) plt.scatter(port_sim_1lv[:,1]*np.sqrt(12)*100, port_sim_1lv[:,0]*1200, marker='.', c=sharpe_1lv, cmap='Blues') plt.plot(eff_risk_l*np.sqrt(12)*100,eff_ret_l*1200,'b--',linewidth=2) plt.scatter(satis_risk1*np.sqrt(12)*100, satis_ret1*1200, c='red', s=50) plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show() ## Graph with five-year forward # Graph simulation with actual portfolio return plt.figure(figsize=(14,6)) plt.scatter(port_sim_2lv[:,1]*np.sqrt(12)*100, port_sim_2lv[:,0]*1200, marker='.', c=sharpe_2lv, cmap='Blues') plt.plot(eff_risk_l*np.sqrt(12)*100,eff_ret_l*1200,'b--',linewidth=2) plt.scatter(port_act['risk']*np.sqrt(12)*100, port_act['returns']*1200, c='red', s=50) plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show()
  1. As opposed to linear since we’re dealing with squares

  2. Which is often the historical average return

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on OSM. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Explainable ‘AI’ using Gradient Boosted randomized networks Pt2 (the Lasso)

Fri, 07/31/2020 - 02:00

[social4i size="small" align="align-left"] --> [This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is about LSBoost, an Explainable ‘AI’ algorithm which uses Gradient Boosted randomized networks for pattern recognition. As we’ve discussed it last week LSBoost is a cousin of GFAGBM’s LS_Boost. In LSBoost, more specifically, the so called weak learners from LS_Boost are based on randomized neural networks’ components and variants of Least Squares regression models.

I’ve already presented some promising examples of use of LSBoost based on Ridge Regression weak learners. In mlsauce’s version 0.7.1, the Lasso can also be used as an alternative ingredient to the weak learners. Here is a comparison of the regression coefficients obtained by using mlsauce’s implementation of Ridge regression and the Lasso:

R example: LSBoostRegressor with Ridge regression and the Lasso

The following example is about training set error vs testing set error, as a function of the regularization parameter, both for Ridge regression and Lasso-based weak learners.

Packages and data # 0 - Packages and data ------------------------------------------------------- library(devtools) devtools::install_github("thierrymoudiki/mlsauce/R-package") library(mlsauce) library(datasets) print(summary(datasets::mtcars)) X <- as.matrix(datasets::mtcars[, -1]) y <- as.integer(datasets::mtcars[, 1]) n <- dim(X)[1] p <- dim(X)[2] set.seed(21341) train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE) test_index <- -train_index X_train <- as.matrix(X[train_index, ]) y_train <- as.double(y[train_index]) X_test <- as.matrix(X[test_index, ]) y_test <- as.double(y[test_index]) LSBoost using Ridge regression # 1 - Ridge ------------------------------------------------------------------- obj <- mlsauce::LSBoostRegressor() # default h is Ridge print(obj$get_params()) n_lambdas <- 100 lambdas <- 10**seq(from=-6, to=6, length.out = n_lambdas) rmse_matrix <- matrix(NA, nrow = 2, ncol = n_lambdas) rownames(rmse_matrix) <- c("training rmse", "testing rmse") for (j in 1:n_lambdas) { obj$set_params(reg_lambda = lambdas[j]) obj$fit(X_train, y_train) rmse_matrix[, j] <- c(sqrt(mean((obj$predict(X_train) - y_train)**2)), sqrt(mean((obj$predict(X_test) - y_test)**2))) }

LSBoost using the Lasso # 2 - Lasso ------------------------------------------------------------------- obj <- mlsauce::LSBoostRegressor(solver = "lasso") print(obj$get_params()) n_lambdas <- 100 lambdas <- 10**seq(from=-6, to=6, length.out = n_lambdas) rmse_matrix2 <- matrix(NA, nrow = 2, ncol = n_lambdas) rownames(rmse_matrix2) <- c("training rmse", "testing rmse") for (j in 1:n_lambdas) { obj$set_params(reg_lambda = lambdas[j]) obj$fit(X_train, y_train) rmse_matrix2[, j] <- c(sqrt(mean((obj$predict(X_train) - y_train)**2)), sqrt(mean((obj$predict(X_test) - y_test)**2))) }

R session info > print(session_info()) ─ Session info ───────────────────────────────────────────────────────────── setting value version R version 4.0.2 (2020-06-22) os Ubuntu 16.04.6 LTS system x86_64, linux-gnu ui RStudio language (EN) collate C.UTF-8 ctype C.UTF-8 tz Etc/UTC date 2020-07-31 ─ Packages ───────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.2) backports 1.1.8 2020-06-17 [1] RSPM (R 4.0.2) callr 3.4.3 2020-03-28 [1] RSPM (R 4.0.2) cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.2) crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.2) curl 4.3 2019-12-02 [1] RSPM (R 4.0.2) desc 1.2.0 2018-05-01 [1] RSPM (R 4.0.2) devtools * 2.3.1 2020-07-21 [1] RSPM (R 4.0.2) digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.2) ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.2) fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.2) fs 1.4.2 2020-06-30 [1] RSPM (R 4.0.2) glue 1.4.1 2020-05-13 [1] RSPM (R 4.0.2) jsonlite 1.7.0 2020-06-25 [1] RSPM (R 4.0.2) lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2) magrittr 1.5 2014-11-22 [1] RSPM (R 4.0.2) Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2) memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.2) mlsauce * 0.7.1 2020-07-31 [1] Github (thierrymoudiki/mlsauce@68e391a) pkgbuild 1.1.0 2020-07-13 [1] RSPM (R 4.0.2) pkgload 1.1.0 2020-05-29 [1] RSPM (R 4.0.2) prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.0.2) processx 3.4.3 2020-07-05 [1] RSPM (R 4.0.2) ps 1.3.3 2020-05-08 [1] RSPM (R 4.0.2) R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.2) rappdirs 0.3.1 2016-03-28 [1] RSPM (R 4.0.2) Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.2) reticulate 1.16 2020-05-27 [1] RSPM (R 4.0.2) rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) rprojroot 1.3-2 2018-01-03 [1] RSPM (R 4.0.2) rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.2) sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.2) testthat 2.3.2 2020-03-02 [1] RSPM (R 4.0.2) usethis * 1.6.1 2020-04-29 [1] RSPM (R 4.0.2) withr 2.2.0 2020-04-20 [1] RSPM (R 4.0.2) [1] /home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0 [2] /opt/R/4.0.2/lib/R/library

No post in August

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Package Integration with Modern Reusable C++ Code Using Rcpp – Part 3

Fri, 07/31/2020 - 02:00

[social4i size="small" align="align-left"] --> [This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Daniel Hanson is a full-time lecturer in the Computational Finance & Risk Management program within the Department of Applied Mathematics at the University of Washington.

In the previous post in this series, we looked at some design considerations when integrating standard and reusable C++ code into an R package. Specific emphasis was on Rcpp’s role in facilitating a means of communication between R and the C++ code, particularly highlighting a few of the C++ functions in the Rcpp namespace that conveniently and efficiently pass data between an R numeric vector and a C++ std::vector object.

Today, we will look at a specific example of implementing the interface. We will see how to configure code that allows the use of standard reusable C++ in an R package, without having to modify it with any R or Rcpp-specific syntax. It is admittedly a simple and toy example, but the goal is to provide a starting point that can be easily extended for more real world examples.

The Code

To get started, let’s have a look at the code we will use for our demonstration. It is broken into three categories, consistent with the design considerations from the previous post:

  • Standard and reusable C++: No dependence on R or Rcpp
  • Interface level C++: Uses functions in the Rcpp namespace
  • R functions exported by the interface level: Same names as in the interface level
Standard and Reusable C++ Code

In this example, we will use a small set of C++ non-member functions, and two classes. There is a declaration (header) file for the non-member functions, say NonmemberCppFcns.h, and another with class declarations for two shape classes, Square and Circle, called ConcreteShapes.h. Each of these is accompanied by a corresponding implementation file, with file extension .cpp, as one might expect in a more realistic C++ code base.

Nonmember C++ Functions

These functions are shown and described here, as declared in the following C++ header file:

#include // Adds two real numbers double add(double x, double y); // Sorts a vector of real numbers and returns it std::vector sortVec(std::vector v); // Computes the product of the LCM and GCD of two integers, // using C++17 functions std::lcm(.) and std::gcd(.) int prodLcmGcd(int m, int n);

The last function uses recently added features in the C++ Standard Library, to show that we can use C++17.

C++ Classes

The two classes in our reusable code base are declared in the ConcreteShapes.h file, as shown and described here. Much like textbook C++ examples, we’ll write classes for two geometric shapes, each with a member function to compute the area of its corresponding object.

#include class Circle { public: Circle(double radius); // Computes the area of a circle with given radius double area() const; private: double radius_; }; class Square { public: Square(double side); // Computes the area of a square with given side length double area() const; private: double side_; }; Interface Level C++

Now, the next step is to employ Rcpp, namely for the following essential tasks:

  • Export the interface functions to R
  • Facilitate data exchange between R and C++ container objects

An interface file containing functions designated for export to R does not require a header file with declarations; one can think of it as being analogous to a .cpp file that contains the main() function in a C++ executable project. In addition, the interface can be contained in one file, or split into multiple files. For demonstration, I have written two such files: CppInterface.cpp, which provides the interface to the non-member functions above, and CppInterface2.cpp, which does the same for the two C++ classes.

Interface to Non-Member C++ Functions:

Let’s first have a look the CppInterface.cpp interface file, which connects R with the nonmember functions in our C++ code base:

#include "NonmemberCppFcns.h" #include #include // Nonmember Function Interfaces: // [[Rcpp::export]] int rAdd(double x, double y) { // Call the add(.) function in the reusable C++ code base: return add(x, y); } // [[Rcpp::export]] Rcpp::NumericVector rSortVec(Rcpp::NumericVector v) { // Transfer data from NumericVector to std::vector auto stlVec = Rcpp::as>(v); // Call the reusable sortVec(.) function, with the expected // std::vector argument: stlVec = sortVec(stlVec); // Reassign the results from the vector return object // to the same NumericVector v, using Rcpp::wrap(.): v = Rcpp::wrap(stlVec); // Return as an Rcpp::NumericVector: return v; } // C++17 example: // [[Rcpp::export]] int rProdLcmGcd(int m, int n) { return prodLcmGcd(m, n); } Included Declarations:

The NonmemberCppFcns.h declaration file is included at the top with #include, just as it would in a standalone C++ application, so that the interface will recognize these functions that reside in the reusable code base. The STL vector declaration is required, as we shall soon see. And, the key in making the interface work resides in the Rcpp.h file, which provides access to very useful C++ functions in the Rcpp namespace.

Function implementations:

Each of these functions is designated for export to R when the package is built, by placing the
// [[Rcpp::export]] tag just above the each function signature, as shown above. In this particular example, each interface function simply calls a function in the reusable code base. For example, the rAdd(.) function simply calls the add(.) function in the reusable C++ code. In the absence of a user-defined namespace, the interface function name must be different from the function it calls to prevent name clash errors during the build, so I have simply chosen to prefix an r to the name of each interface function.

Note that the rSort(.) function takes in an Rcpp::NumericVector object, v. This type will accept data passed in from R as a numeric vector and present it as a C++ object. Then, so that we can call a function in our code base, such as sort(.), which expects a std::vector type as its input, Rcpp also provides the Rcpp::as<.>(.) function that facilitates the transfer of data from an Rcpp::NumericVector object to the STL container:

auto stlVec = Rcpp::as>(v);

Rcpp also gives us a function that will transfer data from a std::vector type being returned from our reusable C++ code back into an Rcpp::NumericVector, so that the results can be passed back to R as a familiar numeric vector type:

v = Rcpp::wrap(stlVec);

As the std::vector object is the workhorse C++ STL containers in quantitative work, these two Rcpp functions are a godsend.

Remark 1: There is no rule that says an interface function can only call a single function in the reusable code; one can use whichever functions or classes that are needed to get the job done, just like with any other C++ function. I’ve merely kept it simple here for demonstration purposes.

Remark 2: The tag // [[Rcpp::plugins(cpp17)]] is sometimes placed at the top of a C++ source file in online examples related to Rcpp and C++17. I have not found this necessary in my own code, however, as long as the Makeconf file has been updated for C++17, as described in the first post in this series.

Interface to C++ Classes:

We now turn our attention to the second interface file, CppInterface2.cpp, which connects R with the C++ classes in our reusable code. It is shown here:

#include "ConcreteShapes.h" // Class Member Function Interfaces: // Interface to Square member // function area(.): // [[Rcpp::export]] double squareArea(double side) { Square sq(side); return sq.area(); } // Interface to Circle member // function area(.): // [[Rcpp::export]] double circleArea(double radius) { Circle circ(radius); return circ.area(); }

This is again nothing terribly sophisticated, but the good news is it shows the process of creating instances of classes from the code base is not difficult at all. We first #include only the header file containing these class declarations; Rcpp.h is not required here, as we are not using any functions in the the Rcpp namespace.

To compute the area of a square, the side length is input in R as a simple numeric type and passed to the interface function as a C++ double. The Square object, sq, is constructed with the side argument, and its area() member function performs said calculation and returns the result. The process is trivally similar for the circleArea(.) function.

R Functions Exported by the Interface Level

To wrap up this discussion, let’s look at the functions an R user will have available after we build the package in RStudio (coming next in this series). Each of these functions will be exported from their respective C++ interface functions as regular R functions, namely:

  • rAdd(.)
  • rSortVec(.)
  • rProdLcmGcd(.)
  • squareArea(.)
  • circleArea(.)

The package user will not need to know or care that the core calculations are being performed in C++. Visually, we can represent the associations as shown in the following diagram:

The solid red line represents a “Chinese Wall” that separates our code base from the interface and allows us to maintain it as standard and reusable C++.

Summary

This concludes our example of configuring code that allows the use of standard reusable C++ in an R package, without having to modify it with any R or Rcpp-specific syntax. In the next post, we will examine how to actually build this code into an R package by leveraging the convenience of Rcpp and RStudio, and deploy it for any number of R users. The source code will also be made available so that you can try it out for yourself.

_____='https://rviews.rstudio.com/2020/07/31/r-package-integration-with-modern-reusable-c-code-using-rcpp-part-3/';

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Spatial GLMM(s) using the INLA Approximation

Fri, 07/31/2020 - 02:00

[social4i size="small" align="align-left"] --> [This article was first published on Corey Sparks R blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The INLA Approach to Bayesian models

The Integrated Nested Laplace Approximation, or INLA, approach is a recently developed, computationally simpler method for fitting Bayesian models [(Rue et al., 2009, compared to traditional Markov Chain Monte Carlo (MCMC) approaches. INLA fits models that are classified as latent Gaussian models, which are applicable in many settings (Martino & Rue, 2010. In general, INLA fits a general form of additive models such as:

\(\eta = \alpha + \sum_{j=1}^{nf} f^{(j)}(u_{ij}) + \sum_{k=1}^{n\beta}\beta_k z_{ki} + \epsilon_i\)

where \(\eta\) is the linear predictor for a generalized linear model formula, and is composed of a linear function of some variables u, \(\beta\) are the effects of covariates, z, and \(\epsilon\) is an unstructured residual (Rue et al., 2009). As this model is often parameterized as a Bayesian one, we are interested in the posterior marginal distributions of all the model parameters. Rue and Martino (2007) show that the posterior marginal for the random effects (x) in such models can be approximated as:

\(\tilde{p}(x_i|y) = \sum_k \tilde{p}(x_i|\theta_k, y) \tilde{p}(\theta_k|y) \Delta_k\)

via numerical integration (Rue & Martino, 2007; Schrodle & Held, 2011a, 2011b). The posterior distribution of the hyperparameters (\(\theta\)) of the model can also be approximated as:

\(\tilde{p}(\theta | y)) \propto \frac{p(x, \theta, y)}{\tilde{p}G(x| \theta,y)} \mid _{x} = x^*(\theta)\)

, where G is a Gaussian approximation of the posterior and \(x^*(\theta)\) is the mode of the conditional distribution of \(p(x|\theta,y)\). Thus, instead of using MCMC to find an iterative, sampling-based estimate of the posterior, it is arrived at numerically. This method of fitting the spatial models specified above has been presented by numerous authors (Blangiardo & Cameletti, 2015; Blangiardo et al., 2013; Lindgren & Rue, 2015; Martins et al., 2013; Schrodle & Held, 2011a, 2011b), with comparable results to MCMC.

Libraries #library(rgdal) library(spdep) library(RColorBrewer) library(lattice) library(INLA) library(tigris) library(tidycensus) library(ggplot2) library(dplyr) Data

I have the data on my github site under the nhgis_vs page. These are data from the NHGIS project by IPUMS who started providing birth and death data from the US Vital statistics program.

The data we will use here are infant mortality rates in US counties between 2000 and 2007.

files<-list.files("~/ExpanDrive/Google Drive/classes/dem7473/data/nhgis0022_csv/nhgis0022_csv/", pattern = "*.csv", full.names = T) vital<-lapply(files, read.csv, header=T) library(plyr) df <- ldply(vital, data.frame) df$cofips<-paste(substr(df$GISJOIN, 2,3), substr(df$GISJOIN, 5,7), sep="") df<-df%>% filter(YEAR %in%2000:2007)%>% mutate(rate=as.numeric(AGWJ001) )%>% select(YEAR, cofips,rate) head(df) ## YEAR cofips rate ## 1 2000 01001 34 ## 2 2000 01003 61 ## 3 2000 01005 125 ## 4 2000 01007 70 ## 5 2000 01009 89 ## 6 2000 01011 242 Census intercensus population estimates

From the Census population estimates program

popurl<-url("http://www2.census.gov/programs-surveys/popest/datasets/2000-2010/intercensal/county/co-est00int-tot.csv") pops<-read.csv(popurl) names(pops)<-tolower(names(pops)) pops<-pops%>% mutate(cofips = paste(sprintf(fmt = "%02d", state), sprintf(fmt = "%03d",county), sep=""))%>% filter(sumlev==50, !state%in%c(2, 15)) head(pops) ## sumlev region division state county stname ctyname estimatesbase2000 ## 1 50 3 6 1 1 Alabama Autauga County 43751 ## 2 50 3 6 1 3 Alabama Baldwin County 140416 ## 3 50 3 6 1 5 Alabama Barbour County 29042 ## 4 50 3 6 1 7 Alabama Bibb County 19856 ## 5 50 3 6 1 9 Alabama Blount County 50982 ## 6 50 3 6 1 11 Alabama Bullock County 11603 ## popestimate2000 popestimate2001 popestimate2002 popestimate2003 ## 1 44021 44889 45909 46800 ## 2 141342 144875 147957 151509 ## 3 29015 28863 28653 28594 ## 4 19913 21028 21199 21399 ## 5 51107 51845 52551 53457 ## 6 11581 11358 11256 11316 ## popestimate2004 popestimate2005 popestimate2006 popestimate2007 ## 1 48366 49676 51328 52405 ## 2 156266 162183 168121 172404 ## 3 28287 28027 27861 27757 ## 4 21721 22042 22099 22438 ## 5 54124 54624 55485 56240 ## 6 11056 11011 10776 11011 ## popestimate2008 popestimate2009 census2010pop popestimate2010 cofips ## 1 53277 54135 54571 54632 01001 ## 2 175827 179406 182265 183195 01003 ## 3 27808 27657 27457 27411 01005 ## 4 22705 22941 22915 22867 01007 ## 5 57055 57341 57322 57338 01009 ## 6 10953 10987 10914 10890 01011 Data prep pops.long<-reshape(data = pops, idvar = "cofips", varying = list(names(pops)[9:16]), direction="long", drop = names(pops)[c(2,3,4,5,6,8,17,18,19,20)], v.names = "population") pops.long$year<-pops.long$time+1999 head(pops.long) ## sumlev ctyname cofips time population year ## 01001.1 50 Autauga County 01001 1 44021 2000 ## 01003.1 50 Baldwin County 01003 1 141342 2000 ## 01005.1 50 Barbour County 01005 1 29015 2000 ## 01007.1 50 Bibb County 01007 1 19913 2000 ## 01009.1 50 Blount County 01009 1 51107 2000 ## 01011.1 50 Bullock County 01011 1 11581 2000 dat.long<-merge(pops.long, df, by.x=c("cofips", "year"), by.y=c("cofips", "YEAR")) head(dat.long) ## cofips year sumlev ctyname time population rate ## 1 01001 2000 50 Autauga County 1 44021 34 ## 2 01001 2001 50 Autauga County 2 44889 78 ## 3 01001 2002 50 Autauga County 3 45909 83 ## 4 01001 2003 50 Autauga County 4 46800 79 ## 5 01001 2004 50 Autauga County 5 48366 76 ## 6 01001 2005 50 Autauga County 6 49676 124 Get census data using tidycensus

Here I get data from the 2000 decennial census summary file 3

#v00<-load_variables(year=2000, dataset = "sf3", cache = T) cov_dat<-get_decennial(geography = "county", year = 2000, sumfile = "sf3", summary_var = "P001001", variables = c("P007003", "P007004","P007010","P053001", "P089001", "P089002" ), output = "wide") ## Getting data from the 2000 decennial Census cov_dat<-cov_dat%>% mutate(cofips=GEOID,pwhite=P007003/summary_value, pblack=P007004/summary_value, phisp=P007010/summary_value,medhhinc=as.numeric(scale(P053001)), ppov=P089002/P089001) final.dat<-merge(dat.long, cov_dat, by="cofips") head(final.dat) ## cofips year sumlev ctyname time population rate GEOID NAME ## 1 01001 2006 50 Autauga County 7 51328 93 01001 Autauga County ## 2 01001 2003 50 Autauga County 4 46800 79 01001 Autauga County ## 3 01001 2004 50 Autauga County 5 48366 76 01001 Autauga County ## 4 01001 2005 50 Autauga County 6 49676 124 01001 Autauga County ## 5 01001 2000 50 Autauga County 1 44021 34 01001 Autauga County ## 6 01001 2007 50 Autauga County 8 52405 83 01001 Autauga County ## P007003 P007004 P007010 P053001 P089001 P089002 summary_value pwhite ## 1 34760 7450 394 42013 43377 4738 43671 0.7959515 ## 2 34760 7450 394 42013 43377 4738 43671 0.7959515 ## 3 34760 7450 394 42013 43377 4738 43671 0.7959515 ## 4 34760 7450 394 42013 43377 4738 43671 0.7959515 ## 5 34760 7450 394 42013 43377 4738 43671 0.7959515 ## 6 34760 7450 394 42013 43377 4738 43671 0.7959515 ## pblack phisp medhhinc ppov ## 1 0.1705938 0.009022005 0.7593459 0.1092284 ## 2 0.1705938 0.009022005 0.7593459 0.1092284 ## 3 0.1705938 0.009022005 0.7593459 0.1092284 ## 4 0.1705938 0.009022005 0.7593459 0.1092284 ## 5 0.1705938 0.009022005 0.7593459 0.1092284 ## 6 0.1705938 0.009022005 0.7593459 0.1092284 Create expected numbers of cases

In count data models, and spatial epidemiology, we have to express the raw counts of events relative to some expected value, or population offset, see this Rpub for a reminder.

#ratesyr<-aggregate(rate~year, final.dat,mean) #in this case, we will standardize to the average IMR for the period #ratesyr$E<-ratesyr$rate #final.dat<-merge(final.dat, ratesyr[,-2], by="year") #rates<-aggregate(rate~1, final.dat, mean) final.dat$E_d<-mean(final.dat$rate) final.dat<-final.dat[order(final.dat$cofips, final.dat$year),] final.dat$id<-1:dim(final.dat)[1] head(final.dat) ## cofips year sumlev ctyname time population rate GEOID NAME ## 5 01001 2000 50 Autauga County 1 44021 34 01001 Autauga County ## 8 01001 2001 50 Autauga County 2 44889 78 01001 Autauga County ## 7 01001 2002 50 Autauga County 3 45909 83 01001 Autauga County ## 2 01001 2003 50 Autauga County 4 46800 79 01001 Autauga County ## 3 01001 2004 50 Autauga County 5 48366 76 01001 Autauga County ## 4 01001 2005 50 Autauga County 6 49676 124 01001 Autauga County ## P007003 P007004 P007010 P053001 P089001 P089002 summary_value pwhite ## 5 34760 7450 394 42013 43377 4738 43671 0.7959515 ## 8 34760 7450 394 42013 43377 4738 43671 0.7959515 ## 7 34760 7450 394 42013 43377 4738 43671 0.7959515 ## 2 34760 7450 394 42013 43377 4738 43671 0.7959515 ## 3 34760 7450 394 42013 43377 4738 43671 0.7959515 ## 4 34760 7450 394 42013 43377 4738 43671 0.7959515 ## pblack phisp medhhinc ppov E_d id ## 5 0.1705938 0.009022005 0.7593459 0.1092284 72.33683 1 ## 8 0.1705938 0.009022005 0.7593459 0.1092284 72.33683 2 ## 7 0.1705938 0.009022005 0.7593459 0.1092284 72.33683 3 ## 2 0.1705938 0.009022005 0.7593459 0.1092284 72.33683 4 ## 3 0.1705938 0.009022005 0.7593459 0.1092284 72.33683 5 ## 4 0.1705938 0.009022005 0.7593459 0.1092284 72.33683 6 options(scipen=999)

Next we make the spatial information, we get the polygons from census directly using counties from the tigris package. We drop counties not in the contiguous 48 US states.

us_co<-counties( cb = T) us_co<-us_co%>% subset(!STATEFP%in%c("02", "15", "60", "66", "69", "72", "78"))%>% filter(STATEFP%in%c("01", "05", "12", "13", "21", "22", "28", "37", "45", "47", "48", "51", "40")) Construction of spatial relationships: Contiguity based neighbors

In a general sense, we can think of a square grid. Cells that share common elements of their geometry are said to be “neighbors”. There are several ways to describe these patterns, and for polygons, we generally use the rules of the chess board.

Rook adjacency Neighbors must share a line segment

Queen adjacency Neighbors must share a vertex or a line segment

If polygons share these boundaries (based on the specific definition: rook or queen), they are said to be “spatial neighbors” of one another. The figure below illustrates this principle.

For an observation of interest, the pink area, the Rood adjacent areas are those in green in the figure, because they share a line segment. For the second part of the figure on the right, the pink area has different sets of neighbors, compared to the Rook rule neighbors, because the area also shares vertices with other polygons, making them Queen neighbors.

Adjacency using Chessboard Rules

Order of adjacency

The figure above also highlights the order of adjacency among observations. By order of adjacency, we simply men that observations are either immediate neighbors (the green areas), or they are neighbors of immediate neighbors. These are referred to as first and second order neighbors.

So, we can see, that the yellow polygons are the neighboring areas for this tract, which allows us to think about what the spatial structure of the area surrounding this part of campus.

For an example, let’s consider the case of San Antonio again. If our data are polygons, then there is a function in the spdep library in R, poly2nb that will take a polygon layer and find the neighbors of all areas using either a queen or rook rule. First we form the neighbors using the rook rule for all the tracts in Bexar County.

Distance based association

The queen and rook rules are useful for polygon features, but distance based contiguity is useful for all feature types (points, polygons, lines). The idea is similar to the polygon adjacency rule from above, but the distance rule is based on the calculated distance between areas. There are a variety of distance metrics that are used in statistics, but the most commonly assumed one is the Euclidean distance. The Euclidean distance between any two points is:

\[D^2 = \sqrt{\left (x_1 – x_2 \right)^2 + \left (y_1 – y_2 \right)^2 } \] Where x and y are the coordinates of each of the two areas. For polygons, these coordinates are typically the centroid of the polygon (you may have noticed this above when we were plotting the neighbor lists), while for point features, these are the two dimensional geometry of the feature. The collection of these distances between all features forms what is known as the distance matrix between observations. This summarizes all distances between all features in the data.

K nearest neighbors
  • A useful way to use distances is to construct a k-nearest neighbors set.

  • This will find the “k” closest observations for each observation, where k is some integer.

  • For instance if we find the k=3 nearest neighbors, then each observation will have 3 neighbors, which are the closest observations to it, regardless of the distance between them which is important.

  • Using the k nearest neighbor rule, two observations could potentially be very far apart and still be considered neighbors.

#In INLA, we don't need FIPS codes, we need a simple numeric index for our counties us_co$struct<-1:dim(us_co)[1] nbs<-knearneigh(coordinates(as_Spatial(us_co) ), k = 5, longlat = T) #k=5 nearest neighbors nbs<-knn2nb(nbs, row.names = us_co$struct, sym = T) #force symmetry!! mat <- nb2mat(nbs, style="B",zero.policy=TRUE) colnames(mat) <- rownames(mat) mat <- as.matrix(mat[1:dim(mat)[1], 1:dim(mat)[1]]) nb2INLA("cl_graph",nbs) am_adj <-paste(getwd(),"/cl_graph",sep="") H<-inla.read.graph(filename="cl_graph") #image(inla.graph2matrix(H), xlab="", ylab="", main="") Plot geographies library(sf) us_co<-st_as_sf(us_co) us_co$cofips<-paste(us_co$STATEFP, us_co$COUNTYFP, sep="") us_co%>% ggplot()+geom_sf()+coord_sf(crs =7603)

final.dat<-merge( us_co,final.dat, by="cofips", all.y=F) final.dat<-final.dat[order(final.dat$cofips, final.dat$year),] final.dat%>% filter(year==2000)%>% ggplot()+geom_sf(aes(fill=rate, col=rate))+coord_sf(crs =7603)+ggtitle("Infant Mortality Rate - 2000")

Model setup
  • We have a count outcome (deaths and births), in counties over time, and a set of time-constant covariates.
  • We have several options in the GLM framework with which to model these data, for example:

  • Binomial – \[y_{ij} \sim Bin(\pi_{ij}) \text{: } logit(\pi_{ij} ) = \beta_{0}+ x'\beta_k \]

  • Poisson – \[y_{ij} \sim Pois(\lambda_{ij} E_{ij}) \text{: } log(\lambda_{ij} ) = log(E_{ij}) + \beta_{0}+ x'\beta_k \]

  • Negative Binomial – \[y_{ij} \sim \text{Neg Bin} (\mu_{ij}, \alpha, E_{ij}) \text{: } log(\mu_{ij} ) = log(E_{ij}) + \beta_{0}+ x'\beta_k \]

  • In addition to various zero-inflated versions of these data.

ggplot(data = final.dat)+geom_histogram(aes(x =log(rate) , y=0.5*..density..))+facet_wrap(~year)+ ggtitle(label = "Distribution of Infant Mortality Rate by Year", subtitle = "US Counties, 2000-2007") ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## Warning: Removed 1849 rows containing non-finite values (stat_bin).

ggplot(data = final.dat)+geom_histogram(aes(x =log(rate/E_d) , y=0.5*..density..))+facet_wrap(~year)+ ggtitle(label = "Distribution of Infant Mortality Relative Risk by Year", subtitle = "US Counties, 2000-2007") ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## Warning: Removed 1849 rows containing non-finite values (stat_bin).

final.dat%>% dplyr::group_by(cofips)%>% dplyr::summarise(meanrate=mean(rate), meanE=mean(E_d))%>% dplyr::mutate(rate_e=meanrate/meanE)%>% #ungroup() # filter(year%in%c(2000))%>% # mutate(rate_e=rate/E_d)%>% mutate(qrr=cut(rate_e, breaks = quantile(rate_e, p=seq(0,1,length.out = 7)), include.lowest = T))%>% ggplot()+ geom_sf(aes(fill=qrr))+ scale_fill_brewer(type="seq", palette = "RdBu")+ ggtitle(label="Relative Risk Quartile - period average 2000 to 2007")+ coord_sf(crs = 7603) ## `summarise()` ungrouping output (override with `.groups` argument)

We can fit these model using the Bayesian framework with INLA.

First, we consider the basic GLM for the mortality outcome, with out any hierarchical structure. We can write this model as a Negative Binomial model, for instance as:

\[\text{Deaths}_{ij} \sim NB(\mu_{ij}, \gamma)\] \[\mu_{ij} = \text{log(E_d)}_{ij} + X' \beta\]

INLA will use vague Normal priors for the \(\beta\)’s, and we have other parameters in the model to specify priors for. INLA does not require you to specify all priors, as all parameters have a default prior specification. In this example, I will use a \(Gamma(1, .5)\) prior for all hierarchical variance terms.

#Model specification: f1<-rate~scale(pblack)+scale(phisp)+scale(ppov)+year #Model fit mod1<-inla(formula = f1,data = final.dat, #linear predictor - fixed effects family = "nbinomial", E = E_d, #marginal distribution for the outcome, expected count control.compute = list(waic=T), # compute DIC or not? control.predictor = list(link=1), #estimate predicted values & their marginals or not? num.threads = 2, verbose = F) #model summary summary(mod1) ## ## Call: ## c("inla(formula = f1, family = \"nbinomial\", data = final.dat, E = ## E_d, ", " verbose = F, control.compute = list(waic = T), ## control.predictor = list(link = 1), ", " num.threads = 2)") ## Time used: ## Pre = 0.928, Running = 21.8, Post = 0.722, Total = 23.5 ## Fixed effects: ## mean sd 0.025quant 0.5quant 0.975quant mode kld ## (Intercept) -5.047 10.723 -26.102 -5.048 15.989 -5.047 0 ## scale(pblack) 0.159 0.015 0.130 0.159 0.188 0.159 0 ## scale(phisp) -0.025 0.013 -0.050 -0.025 0.001 -0.025 0 ## scale(ppov) 0.041 0.015 0.012 0.041 0.070 0.041 0 ## year 0.003 0.005 -0.008 0.003 0.013 0.003 0 ## ## Model hyperparameters: ## mean sd 0.025quant ## size for the nbinomial observations (1/overdispersion) 0.624 0.009 0.608 ## 0.5quant 0.975quant ## size for the nbinomial observations (1/overdispersion) 0.624 0.641 ## mode ## size for the nbinomial observations (1/overdispersion) 0.624 ## ## Expected number of effective parameters(stdev): 5.04(0.001) ## Number of equivalent replicates : 2124.92 ## ## Watanabe-Akaike information criterion (WAIC) ...: 114586.38 ## Effective number of parameters .................: 10.27 ## ## Marginal log-Likelihood: -57331.80 ## Posterior marginals for the linear predictor and ## the fitted values are computed

Plot our observed vs fitted values

plot(x= mod1$summary.fitted.values$mean, y=final.dat$rate/final.dat$E_d , ylab="Observed", xlab="Estimated" )

Basic county level random intercept model

Now we add basic nesting of rates within counties, with a random intercept term for each county. This would allow there to be heterogeneity in the mortality rate for each county, over and above each county’s observed characteristics.

This model would be:

\[\text{Deaths}_{ij} \sim NB(\mu_{ij}, \gamma)\] \[\mu_{ij} = \text{log(E_d)}_{ij} + X' \beta + u_j\] \[u_j \sim \text{Normal} (0 , \tau_u)\]

where \(\tau_u\) here is the precision, not the variance and precision = 1/variance. INLA puts a log-gamma prior on the the precision by default.

f2<-rate~scale(pblack)+scale(phisp)+scale(ppov)+year+ #fixed effects f(struct, model = "iid",param=c(1,.5)) #random effects mod2<-inla(formula = f2,data = final.dat, family = "nbinomial", E = E_d, control.compute = list(waic=T), control.predictor = list(link=1), num.threads = 2, verbose = F) #total model summary summary(mod2) ## ## Call: ## c("inla(formula = f2, family = \"nbinomial\", data = final.dat, E = ## E_d, ", " verbose = F, control.compute = list(waic = T), ## control.predictor = list(link = 1), ", " num.threads = 2)") ## Time used: ## Pre = 0.571, Running = 160, Post = 1.36, Total = 162 ## Fixed effects: ## mean sd 0.025quant 0.5quant 0.975quant mode kld ## (Intercept) -2.824 10.758 -23.945 -2.824 18.279 -2.824 0 ## scale(pblack) 0.158 0.015 0.128 0.158 0.189 0.158 0 ## scale(phisp) -0.041 0.014 -0.069 -0.041 -0.013 -0.041 0 ## scale(ppov) 0.044 0.015 0.014 0.044 0.074 0.044 0 ## year 0.001 0.005 -0.009 0.001 0.012 0.001 0 ## ## Random effects: ## Name Model ## struct IID model ## ## Model hyperparameters: ## mean sd 0.025quant ## size for the nbinomial observations (1/overdispersion) 0.627 0.009 0.609 ## Precision for struct 50.626 7.005 38.292 ## 0.5quant 0.975quant ## size for the nbinomial observations (1/overdispersion) 0.627 0.644 ## Precision for struct 50.138 65.780 ## mode ## size for the nbinomial observations (1/overdispersion) 0.626 ## Precision for struct 49.174 ## ## Expected number of effective parameters(stdev): 125.34(15.33) ## Number of equivalent replicates : 85.47 ## ## Watanabe-Akaike information criterion (WAIC) ...: 114610.09 ## Effective number of parameters .................: 66.26 ## ## Marginal log-Likelihood: -57375.58 ## Posterior marginals for the linear predictor and ## the fitted values are computed Marginal Distributions of hyperparameters

We can plot the posterior marginal of the hyperparameter in this model, in this case \(\sigma_u = 1/\tau_u\)

m2<- inla.tmarginal( function(x) (1/x), #invert the precision to be on variance scale mod2$marginals.hyperpar$`Precision for struct`) #95% credible interval for the variance inla.hpdmarginal(.95, marginal=m2) ## low high ## level:0.95 0.01491462 0.02565338 plot(m2, type="l", main=c("Posterior distibution for between county variance", "- IID model -"))

final.dat$fitted_m2<-mod2$summary.fitted.values$mean p1<-final.dat%>% filter(year%in%c(2000))%>% mutate(qrr=cut(fitted_m2, breaks = quantile(final.dat$fitted_m2, p=seq(0,1,length.out = 6)), include.lowest = T))%>% ggplot()+geom_sf(aes(fill=qrr))+scale_colour_brewer(palette = "RdBu" )+scale_fill_brewer(palette = "RdBu", na.value="grey")+guides(fill=guide_legend(title="Relative Risk Quartile"))+ggtitle(label="Relative Risk Quartile - IID Model, 2000")+coord_sf(crs = 7603) p2<-final.dat%>% filter(year%in%c(2007))%>% mutate(qrr=cut(fitted_m2, breaks = quantile(final.dat$fitted_m2, p=seq(0,1,length.out = 6)), include.lowest = T))%>% ggplot()+geom_sf(aes(fill=qrr))+scale_colour_brewer(palette = "RdBu" )+scale_fill_brewer(palette = "RdBu", na.value="grey")+guides(fill=guide_legend(title="Relative Risk Quartile"))+ggtitle(label="Relative Risk Quartile - IID Model, 2007")+coord_sf(crs = 7603) library(gridExtra) ## ## Attaching package: 'gridExtra' ## The following object is masked from 'package:dplyr': ## ## combine pall<-grid.arrange(p1, p2, nrow=2)

pall ## TableGrob (2 x 1) "arrange": 2 grobs ## z cells name grob ## 1 1 (1-1,1-1) arrange gtable[layout] ## 2 2 (2-2,1-1) arrange gtable[layout] # library(mapview) # # map1<-final.dat%>% # filter(year%in%c(2007))%>% # mutate(qrr=cut(fitted_m2, breaks = quantile(fitted_m2, p=seq(0,1,length.out = 8)))) # clrs <- colorRampPalette(brewer.pal(8, "RdBu")) # mapView(as(map1, "Spatial"), zcol="qrr", legend=T, col.regions=clrs) BYM Model

Model with spatial correlation – Besag, York, and Mollie (1991) model and temporal heterogeneity \[\text{Deaths}_{ij} \sim NB(\mu_{ij}, \gamma)\] \[\mu_{ij} = \text{log(E_d)}_{ij} + X' \beta + u_j + v_j + \gamma_t\]

Which has two random effects, one an IID random effect and the second a spatially correlated random effect, specified as a conditionally auto-regressive prior for the \(v_j\)’s. This is the Besag model:

\[v_j|v_{\neq j},\sim\text{Normal}(\frac{1}{n_i}\sum_{i\sim j}v_j,\frac{1}{n_i\tau})\] and \(u_j\) is an IID normal random effect, \(\gamma_t\) is also given an IID Normal random effect specification, and there are now three hyperparameters, \(\tau_u\) and \(\tau_v\) and \(\tau_{\gamma}\) and each are given log-gamma priors.

For the BYM model we must specify the spatial connectivity matrix in the random effect.

#final.dat$year_c<-final.dat$year - 2004 f3<-rate~scale(pblack)+scale(phisp)+scale(ppov)+ f(struct, model = "bym", constr = T, scale.model = T, graph = H,param=c(1,.5))+ f(year, model="iid",param=c(1,.5)) #temporal random effect mod3<-inla(formula = f3,data = final.dat, family = "nbinomial", E = E_d, control.compute = list(waic=T), num.threads = 2, verbose = F, control.predictor = list(link=1)) #total model summary summary(mod3) ## ## Call: ## c("inla(formula = f3, family = \"nbinomial\", data = final.dat, E = ## E_d, ", " verbose = F, control.compute = list(waic = T), ## control.predictor = list(link = 1), ", " num.threads = 2)") ## Time used: ## Pre = 0.737, Running = 138, Post = 1.26, Total = 140 ## Fixed effects: ## mean sd 0.025quant 0.5quant 0.975quant mode kld ## (Intercept) 0.115 0.129 -0.145 0.115 0.374 0.115 0 ## scale(pblack) 0.157 0.016 0.126 0.158 0.189 0.158 0 ## scale(phisp) -0.039 0.016 -0.069 -0.039 -0.007 -0.040 0 ## scale(ppov) 0.043 0.016 0.012 0.043 0.075 0.043 0 ## ## Random effects: ## Name Model ## struct BYM model ## year IID model ## ## Model hyperparameters: ## mean sd ## size for the nbinomial observations (1/overdispersion) 0.627 0.009 ## Precision for struct (iid component) 51.094 7.099 ## Precision for struct (spatial component) 1974.289 1903.577 ## Precision for year 8.760 4.130 ## 0.025quant 0.5quant ## size for the nbinomial observations (1/overdispersion) 0.609 0.627 ## Precision for struct (iid component) 38.602 50.591 ## Precision for struct (spatial component) 174.447 1425.658 ## Precision for year 2.885 8.075 ## 0.975quant mode ## size for the nbinomial observations (1/overdispersion) 0.644 0.628 ## Precision for struct (iid component) 66.447 49.595 ## Precision for struct (spatial component) 7055.730 496.592 ## Precision for year 18.742 6.583 ## ## Expected number of effective parameters(stdev): 133.75(15.30) ## Number of equivalent replicates : 80.09 ## ## Watanabe-Akaike information criterion (WAIC) ...: 114605.76 ## Effective number of parameters .................: 69.81 ## ## Marginal log-Likelihood: -56934.15 ## Posterior marginals for the linear predictor and ## the fitted values are computed plot(y=mod3$summary.random$year$mean,x=unique(final.dat$year), type="l")

m3a<- inla.tmarginal( function(x) (1/x), mod3$marginals.hyperpar$`Precision for struct (iid component)`) m3b<- inla.tmarginal( function(x) (1/x), mod3$marginals.hyperpar$`Precision for struct (spatial component)`) m3c<- inla.tmarginal( function(x) (1/x), mod3$marginals.hyperpar$`Precision for year`) plot(m3a, type="l", main=c("Posterior distibution for between county variance", "- BYM model -"), xlim=c(0, .2), ylim=c(0, 300)) lines(m3b, col="red") lines(m3c, col="green") legend("topright", legend=c("BYM IID", "BYM Spatial", "Year"), col=c(1, "red", "green"), lty=c(1,1,1))

#HPD intervals inla.hpdmarginal(.95,m3a) ## low high ## level:0.95 0.01475866 0.02544088 inla.hpdmarginal(.95,m3b) ## low high ## level:0.95 0.00005416961 0.003970123 inla.hpdmarginal(.95,m3c) ## low high ## level:0.95 0.03927999 0.2945931

This indicates very low spatially correlated variance in these data.

Space-time mapping of the fitted values final.dat$fitted_m3<-mod3$summary.fitted.values$mean p3<-final.dat%>% filter(year%in%c(2000))%>% mutate(qrr=cut(fitted_m3, breaks = quantile(final.dat$fitted_m3, p=seq(0,1,length.out = 6)), include.lowest = T))%>% ggplot()+geom_sf(aes(fill=qrr))+scale_colour_brewer(palette = "RdBu" )+scale_fill_brewer(palette = "RdBu", na.value="grey")+guides(fill=guide_legend(title="Relative Risk Quartile"))+ggtitle(label="Relative Risk Quartile - IID Model, 2000")+coord_sf(crs = 7603) p4<-final.dat%>% filter(year%in%c(2007))%>% mutate(qrr=cut(fitted_m3, breaks = quantile(final.dat$fitted_m3, p=seq(0,1,length.out = 6)), include.lowest = T))%>% ggplot()+geom_sf(aes(fill=qrr))+scale_colour_brewer(palette = "RdBu" )+scale_fill_brewer(palette = "RdBu", na.value="grey")+guides(fill=guide_legend(title="Relative Risk Quartile"))+ggtitle(label="Relative Risk Quartile - IID Model, 2007")+coord_sf(crs = 7603) pall2<-grid.arrange(p3, p4, nrow=2)

pall2 ## TableGrob (2 x 1) "arrange": 2 grobs ## z cells name grob ## 1 1 (1-1,1-1) arrange gtable[layout] ## 2 2 (2-2,1-1) arrange gtable[layout] #library(mapview) #map1<-final.dat%>% # filter(year%in%c(2007))%>% # mutate(qrr=cut(fitted_m3, breaks = quantile(fitted_m3, p=seq(0,1,length.out = 8)))) #clrs <- colorRampPalette(brewer.pal(8, "RdBu")) #mapView(as(map1, "Spatial"), zcol="qrr", legend=T, col.regions=clrs) Map of spatial random effects

It is common to map the random effects from the BYM model to look for spatial trends, in this case, there are not strong spatial signals:

us_co$sp_re<-mod3$summary.random$struct$mean[1:length(unique(final.dat$cofips))] us_co%>% mutate(qse=cut(sp_re, breaks = quantile(sp_re, p=seq(0,1,length.out = 6)), include.lowest = T))%>% ggplot()+geom_sf(aes(fill=qse))+scale_colour_brewer(palette = "RdBu" )+scale_fill_brewer(palette = "RdBu", na.value="grey")+guides(fill=guide_legend(title="Spatial Excess Risk"))+ggtitle(label="Spatial Random Effect - BYM Model")+coord_sf(crs = 7603)

Exceedence probabilities

In Bayesian spatial models that are centered on an epidemiological type of outcome, it is common to examine the data for spatial clustering. One way to do this is to examine the clustering in the relative risk from one of these GLMM models. For instance if \(\theta\) is the relative risk \[\theta = exp(\beta_0 + \beta_1*x_1 + u_j)\] from one of our Negative binomial models above. We can use the posterior marginals of the relative risk to ask \(\theta \gt \theta^*\) where \(\theta^*\) is a specific level of excess risk, say 50% extra or \(\theta > 1.25\). If the density, or \(\text{Pr}(\theta \gt \theta^*)\) is high, then there is evidence that the excess risk is not only high, but significantly high.

To get the exceedence probabilities from one of our models, we can use the inla.pmarginal() function to ask if \(\text{Pr}(\theta \gt \theta^*)\)

thetastar<-1.25#theta* inlaprob<- unlist(lapply(mod3$marginals.fitted.values, function(X){ 1-inla.pmarginal(thetastar, X) })) hist(inlaprob)

So, we see lots of occasions where the exceedence probability is greater than .9. We can visualize these in a map.

final.dat$exceedprob<-inlaprob final.dat%>% filter(year%in%c(2007))%>% mutate(qrr=cut(exceedprob, breaks = c(0, .5, .9, .95, .99, 1), include.lowest = T))%>% ggplot()+geom_sf(aes(fill=qrr))+scale_colour_brewer(palette = "Blues" )+scale_fill_brewer(palette = "Blues", na.value="grey")+guides(fill=guide_legend(title=""))+ggtitle(label=expression(paste("Exceedence Probability Relative Risk ","Pr( ",theta," >1.25"," ) - 2007") ))+coord_sf(crs = 7603)

#library(mapview) #map1<-final.dat%>% # filter(year%in%c(2007))%>% # mutate(qrr=cut(exceedprob, breaks = c(0, .5, .9, .95, .99, 1), include.lowest = T)) #clrs <- colorRampPalette(brewer.pal(6, "Blues")) #mapView(as(map1, "Spatial"), zcol="qrr", legend=T, col.regions=clrs, map.types="OpenStreetMap")

Which shows several areas of the south where risk the infant mortality rate is signficantly higher than the national rate, with high posterior probability.

References

Besag, J., York, J., & Mollie, a. (1991). Bayesian image-restoration, with 2 applications in spatial statistics. Annals of the Institute of Statistical Mathematics, 43(1), 1-20. https://doi.org/10.1007/BF00116466

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Corey Sparks R blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rfm 0.2.2

Fri, 07/31/2020 - 02:00

[social4i size="small" align="align-left"] --> [This article was first published on Rsquared Academy Blog - Explore Discover Learn, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’re excited to announce the release of rfm 0.2.2 on CRAN! rfm provides tools for customer segmentation using Recency Frequency Monetary value analysis. It includes a Shiny app for interactive segmentation. You can install rfm with:

install.packages("rfm")

In this blog post, we will summarize the changes implemented in the current (0.2.2) and previous release (0.2.1).

Segmentation

In previous versions, rfm_segment() would overwrite a segment if the intervals used to define the segment was a subset of another segment. It was expected that the end user would be careful to ensure that the intervals for each segment would be unique and not a subset of any other segment. You can see the example here.

We are grateful to @leungi for bringing this to our attention and also for fixing it. Now, rfm_segment() does not overwrite
the segments even if the intervals for one segment is a subset of another.

# analysis date analysis_date <- lubridate::as_date("2006-12-31") # rfm score rfm_result <- rfm_table_order(rfm_data_orders, customer_id, order_date, revenue, analysis_date) rfm_result ## # A tibble: 995 x 9 ## customer_id date_most_recent recency_days transaction_cou~ amount ## ## 1 Abbey O'Re~ 2006-06-09 205 6 472 ## 2 Add Senger 2006-08-13 140 3 340 ## 3 Aden Lesch~ 2006-06-20 194 4 405 ## 4 Admiral Se~ 2006-08-21 132 5 448 ## 5 Agness O'K~ 2006-10-02 90 9 843 ## 6 Aileen Bar~ 2006-10-08 84 9 763 ## 7 Ailene Her~ 2006-03-25 281 8 699 ## 8 Aiyanna Br~ 2006-04-29 246 4 157 ## 9 Ala Schmid~ 2006-01-16 349 3 363 ## 10 Alannah Bo~ 2005-04-21 619 4 196 ## # ... with 985 more rows, and 4 more variables: recency_score , ## # frequency_score , monetary_score , rfm_score # segmentation segment_names <- c( "Champions", "Loyal Customers", "Potential Loyalist", "New Customers", "Promising", "Need Attention", "About To Sleep", "At Risk", "Can't Lose Them", "Lost" ) recency_lower <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1) recency_upper <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2) frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1) frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2) monetary_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1) monetary_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2) segments <- rfm_segment( rfm_result, segment_names, recency_lower, recency_upper, frequency_lower, frequency_upper, monetary_lower, monetary_upper ) # segment size segments %>% count(segment) %>% arrange(desc(n)) %>% rename(Segment = segment, Count = n) ## # A tibble: 8 x 2 ## Segment Count ## ## 1 Loyal Customers 278 ## 2 Potential Loyalist 229 ## 3 Champions 158 ## 4 Lost 111 ## 5 At Risk 86 ## 6 About To Sleep 50 ## 7 Others 48 ## 8 Need Attention 35

In the above example, the interval used to define the Champions segment is a subset of Loyal Customers. In the previous versions, those customers who
should have been assigned Champions were reassigned as Loyal Customers if the criteria for Champions was evaluated before Loyal Customers. From version 0.2.0, rfm_segment() will avoid such overwriting.

Visualization

rfm used print all the plots by default instead of returning a plot object. This resulted in difficulties for some end users who wanted to:

  • further modify the plot
  • include the plot in a panel of other plots

From version 0.2.1, all plotting functions use an additional argument print_plot. It is set to TRUE by default to avoid any disruption to current work flows. Those users who want a plot object to be returned can set the above argument to FALSE.

# analysis date analysis_date <- lubridate::as_date('2007-01-01') # transactions data rfm_order <- rfm_table_order(rfm_data_orders, customer_id, order_date, revenue, analysis_date) # customer data rfm_customer <- rfm_table_customer(rfm_data_customer, customer_id, number_of_orders, recency_days, revenue, analysis_date) # plots p1 <- rfm_heatmap(rfm_order, plot_title = "Transaction Data", print_plot = FALSE) p2 <- rfm_heatmap(rfm_customer, plot_title = "Customer Data", print_plot = FALSE) # using patchwork p1 + p2

Custom Threshold for RFM Scores

Lots of users wanted to know the threshold used for generating the RFM scores. From version 0.2.1, rfm_table_* family of functions return the threshold.

analysis_date <- lubridate::as_date('2006-12-31') result <- rfm_table_order(rfm_data_orders, customer_id, order_date, revenue, analysis_date) # threshold result$threshold ## # A tibble: 5 x 6 ## recency_lower recency_upper frequency_lower frequency_upper monetary_lower ## ## 1 1 115 1 4 12 ## 2 115 181 4 5 256. ## 3 181 297. 5 6 382 ## 4 297. 482 6 8 506. ## 5 482 977 8 15 666 ## # ... with 1 more variable: monetary_upper

Another request (see here) was to be able to use custom or user specific threshold for generating RFM score. rfm uses quantiles to generate the lower and upper thresholds used for generating the scores. Unfortunately, if the data is skewed, using quantiles is not effective. From version 0.2.1, users can specify custom threshold for generating the RFM score and we will learn how to do this using an example.

analysis_date <- lubridate::as_date('2006-12-31') result <- rfm_table_order(rfm_data_orders, customer_id, order_date, revenue, analysis_date) result$threshold ## # A tibble: 5 x 6 ## recency_lower recency_upper frequency_lower frequency_upper monetary_lower ## ## 1 1 115 1 4 12 ## 2 115 181 4 5 256. ## 3 181 297. 5 6 382 ## 4 297. 482 6 8 506. ## 5 482 977 8 15 666 ## # ... with 1 more variable: monetary_upper

If you look at the above output, we have 5 bins/scores and there are six different values. Let us focus on the monetary_* columns in the threshold table. The lower threshold of the first bin and the upper threshold of the last bin are the min and max values form the revenue column of rfm_data_orders and the rest of the values are returned by the quantile() function.

revenue <- rfm_data_orders %>% group_by(customer_id) %>% summarize(total = sum(revenue)) ## `summarise()` ungrouping (override with `.groups` argument) # revenue at customer level revenue ## # A tibble: 995 x 2 ## customer_id total ## * ## 1 Abbey O'Reilly DVM 472 ## 2 Add Senger 340 ## 3 Aden Lesch Sr. 405 ## 4 Admiral Senger 448 ## 5 Agness O'Keefe 843 ## 6 Aileen Barton 763 ## 7 Ailene Hermann 699 ## 8 Aiyanna Bruen PhD 157 ## 9 Ala Schmidt DDS 363 ## 10 Alannah Borer 196 ## # ... with 985 more rows # min and max min(revenue$total) ## [1] 12 max(revenue$total) ## [1] 1488

Let us look at the quantiles used for generating the scores.

quantile(revenue$total, probs = seq(0, 1, length.out = 6)) ## 0% 20% 40% 60% 80% 100% ## 12.0 254.8 381.0 505.4 665.0 1488.0

The intervals are created in the below style:

Left-closed, right-open: [ a , b ) = { x ∣ a ≤ x < b }

Since rfm uses left closed intervals to generate the scores, we add 1 to all values except the minimum value. Now, let us recreate the RFM scores using custom threshold instead of quantiles.

rfm_table_order(rfm_data_orders, customer_id, order_date, revenue, analysis_date, recency_bins = c(115, 181, 297, 482), frequency_bins = c(4, 5, 6, 8), monetary_bins = c(256, 382, 506, 666)) ## # A tibble: 995 x 9 ## customer_id date_most_recent recency_days transaction_cou~ amount ## ## 1 Abbey O'Re~ 2006-06-09 205 6 472 ## 2 Add Senger 2006-08-13 140 3 340 ## 3 Aden Lesch~ 2006-06-20 194 4 405 ## 4 Admiral Se~ 2006-08-21 132 5 448 ## 5 Agness O'K~ 2006-10-02 90 9 843 ## 6 Aileen Bar~ 2006-10-08 84 9 763 ## 7 Ailene Her~ 2006-03-25 281 8 699 ## 8 Aiyanna Br~ 2006-04-29 246 4 157 ## 9 Ala Schmid~ 2006-01-16 349 3 363 ## 10 Alannah Bo~ 2005-04-21 619 4 196 ## # ... with 985 more rows, and 4 more variables: recency_score , ## # frequency_score , monetary_score , rfm_score

We have used the values from the threshold table to reproduce the earlier result. If you observe carefully, we have specified 4 values while generating 5 bins/scores. Whenever using custom threshold, values supplied should be one less than the number of bins/scores generated as rfm internally computes the min and max values. In general, if you have n bins/scores, you only specify the upper threshold for n - 1 bins/scores.

We have tried our best to explain how to use custom threshold but completely understand that it can be confusing to implement at beginning. If you have any questions about this method, feel free to write to us at and our team will be happy to help you.

Acknowledgements

We are grateful to @leungi, @gfagherazzi and @DavidGarciaEstaun for their inputs.

Learning More Feedback

*As the reader of this blog, you are our most important critic and commentator.
We value your opinion and want to know what we are doing right, what we could
do better, what areas you would like to see us publish in, and any other words
of wisdom you are willing to pass our way.

We welcome your comments. You can email to let us know what you did or did not
like about our blog as well as what we can do to make our post better.*

Email:

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog - Explore Discover Learn. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pages