Detect When the Random Number Generator Was Used
[social4i size="small" align="alignleft"] > [This article was first published on JottR on R, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
If you ever need to figure out if a function call in R generated a random number or not, here is a simple trick that you can use in an interactive R session. Add the following to your ~/.Rprofile(*):
if (interactive()) { invisible(addTaskCallback(local({ last < .GlobalEnv$.Random.seed function(...) { curr < .GlobalEnv$.Random.seed if (!identical(curr, last)) { msg < "NOTE: .Random.seed changed" if (requireNamespace("crayon", quietly=TRUE)) msg < crayon::blurred(msg) message(msg) last << curr } TRUE } }), name = "RNG tracker")) }It works by checking whether or not the state of the random number generator (RNG), that is, .Random.seed in the global environment, was changed. If it has, a note is produced. For example,
> sum(1:100) [1] 5050 > runif(1) [1] 0.280737 NOTE: .Random.seed changed >It is not always obvious that a function generates random numbers internally. For instance, the rank() function may or may not updated the RNG state depending on argument ties as illustrated in the following example:
> x < c(1, 4, 3, 2) > rank(x) [1] 1.0 2.5 2.5 4.0 > rank(x, ties.method = "random") [1] 1 3 2 4 NOTE: .Random.seed changed >For some functions, it may even depend on the input data whether or not random numbers are generated, e.g.
> y < matrixStats::rowRanks(matrix(c(1,2,2), nrow=2, ncol=3), ties.method = "random") NOTE: .Random.seed changed > y < matrixStats::rowRanks(matrix(c(1,2,3), nrow=2, ncol=3), ties.method = "random") >I have this RNG tracker enabled all the time to learn about functions that unexpectedly draw random numbers internally, which can be important to know when you run statistical analysis in parallel.
As a bonus, if you have the crayon package installed, the RNG tracker will output the note with a style that is less intrusive.
(*) If you use the startup package, you can add it to a new file ~/.Rprofile.d/interactive=TRUE/rng_tracker.R. To learn more about the startup package, have a look at the blog posts on startup.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: JottR on R. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Detect When the Random Number Generator Was Used first appeared on Rbloggers.
August 2020: “Top 40” New CRAN Packages
[social4i size="small" align="alignleft"] > [This article was first published on R Views, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
One hundred fortysix new packages stuck to CRAN in August. Below, are my “Top 40” picks in eleven categories: Computational Methods, Data, Genomics, Insurance, Machine Learning, Mathematics, Medicine, Statistics, Time Series, Utilities and Visualization.
Computational Methodsdpseg v0.1.1: Implements an algorithm for piecewise linear segmentation of ordered data by a dynamic programming algorithm. See the vignette.
qpmadr v0.1.0: Implements the method outlined in Goldfarb & Idnani (1983) for solving quadratic problems with linear inequality, equality, and box constraints.
WoodburyMatrix v0.0.1: Implements a hierarchy of classes and methods for manipulating matrices formed implicitly from the sums of the inverses of other matrices, a situation commonly encountered in spatial statistics and related fields. See the vignette for details.
Dataneonstore v0.2.2: Provides to access to numerous National Ecological Observatory Network (NEON) data sets through its API.
pdxTrees v0.4.0: A collection of datasets from Portland Parks and Recreation which inventoried every tree in over one hundred seventy parks and along the streets in ninetysix neighborhoods. See the vignette.
Genomicshiphop v0.0.1: Implements a method to compare the genotypes of offspring with any combination of potential parents, and scores the number of mismatches of these individuals at biallelic genetic markers that can be used for paternity and maternity assignment. See Huisman (2017) for background, and the vignette for an introduction.
RapidoPGS v1.0.2: Provides functions to quickly compute polygenic scores from GWAS summary statistics of either casecontrol or quantitative traits without LD matrix computation or parameter tuning. See Reales et al. (2020) for details and the vignette for examples.
InsuranceSynthETIC v0.1.0: Implements an individual claims simulator which generates synthetic data that emulates various features of nonlife insurance claims. Refer to Avanzi et al. (2020) for background and see the vignette for examples.
Machine Learningsparklyr.flint v0.1.1: Extends sparklyr to include Flint time series functionality. Vignettes include Importing Data and Time Series RDD.
torch v0.0.3: Provides functionality to define and train neural networks similar to PyTorch by Paszke et al (2019) but written entirely in R. There are vignettes on Extending Autograd, Indexing tensors, Loading data, Creating tensors, and Using autograd.
Mathematicsgasper v1.0.1: Provides the standard operations for signal processing on graphs including graph Fourier transform, spectral graph wavelet transform, visualization tools. See De Loynes et al. (2019) for background and the package vignette.
GeodRegr v0.1.0: Provides a gradient descent algorithm to find a geodesic relationship between realvalued independent variables and a manifoldvalued dependent variable (i.e. geodesic regression). Available manifolds are Euclidean space, the sphere, and Kendall’s 2dimensional shape space. See Shin & Oh (2020), Fletcher (2013), Kim et al. (2104)] for background.
geos v0.0.1: Provides an R API to the Open Source Geometry Engine GEOS library and a vector format with which to efficiently store GEOS geometries. See README for an example.
pcSteiner v1.0.0: Provides functions for obtaining an approximate solution to the prize winning Steiner Tree problem that seeks a subgraph connecting a given set of vertices with the most expensive nodes and least expensive edges. This implementation uses a loopy belief propagation algorithm. There is a Tutorial.
TCIU v1.1.0: Provides the core functionality to transform longitudinal data to complextime (kime) data using analytic and numerical techniques, visualize the original timeseries and reconstructed kimesurfaces, perform model based (e.g., tensorlinear regression) and modelfree classification and clustering methods. See Dinov & Velev (2021) for background. There are vignettes on Laplace Transform and Kime Surface Transforms and Workflows of TCIU Analytics.
Medicineepigraphdb v0.2.1: Provides access to the EpiGraphDB platform. There is an overview, vignettes on the API, Platform Functionality, Meta Functions and three case studies on SNP protein associations, Drug Targets and Causal Evidence.
raveio v0.0.3: implements an interface to the RAVE (R analysis and visualization of human intracranial electroencephalography data) project which aims at analyzing brain recordings from patients with electrodes placed on the cortical surface or inserted into the brain. See Mafnotti et al. (2020) for background.
tboot v0.2.0: Provides functions to simulate clinical trial data with realistic correlation structures and assumed efficacy levels by using a tilted bootstrap resampling approach. There is a tutorial on The Tilted Bootstrap and another on Bayesian Marginal Reconstruction.
StatisticsBayesMRA v1.0.0: Fits sparse Bayesian multiresolution spatial models using Markov Chain Monte Carlo. See the vignette.
bsem v1.0.0: Implements functions to allow structural equation modeling for particular cases using rstan that includes Bayesian semiconfirmatory factor analysis, confirmatory factor analysis, and structural equation models. See Mayrink (2013) for background and the vignettes: Get Started and Exploring bsem class.
cyclomort v1.0.2: Provides functions to do survival modeling with a periodic hazard function. See Gurarie et al. (2020) and the vignette for details.
ebmstate v0.1.1: Implements an empirical Bayes, multistate Cox model for survival analysis. See Schall (1991) for details.
fairmodels v0.2.2: Provides functions to measure fairness for multiple models including measuring a model’s bias towards different races, sex, nationalities etc. There are Basic and Advanced tutorials.
MGMM v0.3.1: Implements clustering of multivariate normal random vectors with missing elements. Clustering is achieved by fitting a Gaussian Mixture Model (GMM). See McCaw et al. (2019) for details, and the vignette for examples.
rmsb v0.0.1: Is a Bayesian companion to the rms package which provides Bayesian model fitting, postfit estimation, and graphics, and implements Bayesian regression models whose fit objects can be processed by rms functions. Look here for more information.
RoBMA v1.0.4: Implements a framework for estimating ensembles of metaanalytic models (assuming either presence or absence of the effect, heterogeneity, and publication bias) and uses Bayesian model averaging to combine them. See Maier et al. (2020) for background and the vignettes: Fitting custom metaanalytic ensembles,
Reproducing BMA, and Common warnings and errors.
tTOlr v0.2: Implements likelihood ratio statistics for one and two sample ttests. There are two vignettes: Likelihood Ratio and False Positive Risk and Pvalues – Uses, abuses, and alternatives.
Time Seriesfable.prophet v0.1.0: Enables prophet models to be used in tidyworkflows created with fabletools. See the vignette for an introduction.
garma v0.9.3: Provides methods for estimating long memoryseasonal/cyclical Gegenbauer univariate time series processes. See Dissanayake et al. (2018) for background and the vignette for the details of model fitting.
gratis v0.2.0: Generates time series based on mixture autoregressive models. See Kang et al. (2020) for background and the vignette for an introduction to the package.
rhosa v0.1.0: Implements higherorder spectra or polyspectra analysis for time series. Brillinger & Irizarry (1998) and Lii & Helland (1981) for background and the vignette for examples.
UtilitiesDataEditR v0.0.5: Implements an interactive editor to allow the interactive viewing, entering and editing of data in R. See the vignette for details.
equatiomatic v0.1.0: Simplifies writing LaReX formulas by providing a function that takes a fitted model object as its input and returns the corresponding LaTeX code for the model. There is an Introduction and a vignette on Tests and Coverage
starschemar v1.1.0: Provides functions to obtain star schema from flat tables. The vignette shows multiple examples.
Visualizationglow v0.10.1: Provides a framework for creating plots with glowing points. See the vignette for examples.
graph3d v0.1.0 Implements a wrapper for the JavaScript library visgraph that enables users to create three dimensional interactive visualizations. Look here for an example.
jsTreeR v0.1.0: Provides functions to implement interactive trees for representing hierarchical data that can be included in Shiny apps and R markdown documents. Look here for examples.
KMunicate v0.1.0: Provides functions to produce Kaplan–Meier plots in the style recommended following the KMunicate study by Morris et al. (2019). See the vignette for examples.
rAmCharts4 v0.1.0: Provides functions to create JavaScript charts that can be included in Shiny apps and R Markdown documents, or viewed from the R console and RStudio viewer. Look here for examples.
tabularmaps v0.1.0: Provides functions for creating tabular maps, a visualization method for efficiently displaying data consisting of multiple elements by tiling them. When dealing with geospatial data, they corrects for differences in visibility between areas. Look here and at the vignette for examples.
_____='https://rviews.rstudio.com/2020/09/22/august2020top40newcranpackages/';
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post August 2020: "Top 40" New CRAN Packages first appeared on Rbloggers.
Exploring 30 years of local CT weather history with R
[social4i size="small" align="alignleft"] > [This article was first published on R on Redwall Analytics, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
# Libraries packages < c("data.table", "ggplot2", "stringr", "skimr", "janitor", "glue" ) if (length(setdiff(packages,rownames(installed.packages()))) > 0) { install.packages(setdiff(packages, rownames(installed.packages()))) } invisible(lapply(packages, library, character.only = TRUE)) knitr::opts_chunk$set( comment = NA, fig.width = 12, fig.height = 8, out.width = '100%', cache = TRUE )
IntroductionAs our journey with open source software continues, there is a growing list of things we have tried, but were unable to or took too long to figure out, so moved on. Sometimes its a blog or twitter post, others a new package or API we hadn’t heard of, or the solution just pops into our head. A few months ago, we were trying to figure out how to explore the history of the local weather and air quality. We walk by the EPA monitor at our local beach all the time, so it didn’t seem like a big deal to connect to, but didn’t know where to find the data.
Over the weekend, we were reading how Californians have come to rely on wifi air quality monitors in Purple Air’s network to track the effects of the wildfires on the air they are breathing. Naturally, this got us thinking about the subject again, but discovered that unlike in California, there seems to be very few Purple Air monitors around our area. This time when we Googled around though, we found the EPA Arcgis map and links with all the data from our local monitor going back to 1990:
In this post, we will tap the links on the map for all the available daily data. In the future, we may look at the API to see if we can extend our time span, granularity of time elements to hourly or even find more elements of the Air Quality Index (AQI).
This will be a very quick codebased blog post showing how to load and visualize the data from our local monitor. Below, we show how to use glue::glue() to hit the api for our local site (“090010017”) for the annual daily data sets from 19902020 with datatable::fread(), which took only a few minutes. We could change the code above and get the same for the next closest monitor in Westport, CT, or a simple adjustment would get us all the monitors in Connecticut or beyond if needed. For the actual blog post, data saved to disc will be used.
# Years to retrieve year < c(1990:2020) ac < lapply( glue::glue( 'https://www3.epa.gov/cgibin/broker?_service=data&_program=dataprog.Daily.sas&check=void&polname=Ozone&debug=0&year={year}&site=090010017' ), fread )
Fortunately, the annual data sets was consistent over the period with all the same variables, so it took just a few minutes to get a nice data.table stretching back to 1990. In a future post, we might look at a more complicated exploration of the EPA API, which has data going back much further for some monitors and some variables, and seems to be one of the better organized and documented government API’s we have come across.
# Bind lists to data.table ac_dt < rbindlist(ac) # Clean names ac_dt < janitor::clean_names(ac_dt)
Exloration and PreparationWe will start off with the full 34 columns, but throw out the identifier rows where there is only one unique value. We can see that there are only 10,416 unique dates though there are 55,224 rows, so the many fields are layered in the data set. Four of the logical columns are all missing, so they will have to go. There are 14 unique values in the parameter_name field, so we will have to explore those. A majority of the pollutant standard rows are missing. We can also see aqi which we consider to be a parameter is included in a separate column as a character. The two main measurements are the “arithmetic_mean” and “first_maximum_value”. There are a couple of time related variables including year, day_in_year and date_local. There are a lot of fields with only one unique value to identify the monitor, so these can all be dropped. Its pretty messy, so he best thing we can think of doing is to tidy up the data set so it is easier to work with.
# Summarize data.table skimr::skim(ac_dt)
Table 1: Data summary Name ac_dt Number of rows 55224 Number of columns 34 _______________________ Column type frequency: character 15 Date 1 logical 4 numeric 14 ________________________ Group variables NoneVariable type: character
skim_variable n_missing complete_rate min max empty n_unique whitespace Datum 0 1 5 5 0 1 0 Parameter Name 0 1 5 26 0 14 0 Duration Description 0 1 6 23 0 6 0 Pollutant Standard 0 1 0 17 35497 6 0 Units of Measure 0 1 5 29 0 8 0 Exceptional Data Type 0 1 4 8 0 2 0 AQI 0 1 1 3 0 157 0 Daily Criteria Indicator 0 1 1 1 0 2 0 State Name 0 1 11 11 0 1 0 County Name 0 1 9 9 0 1 0 City Name 0 1 19 19 0 1 0 Local Site Name 0 1 20 20 0 1 0 Address 0 1 31 31 0 1 0 MSA or CBSA Name 0 1 31 31 0 1 0 Data Source 0 1 13 13 0 1 0Variable type: Date
skim_variable n_missing complete_rate min max median n_unique Date (Local) 0 1 19900101 20200331 20020829 10416Variable type: logical
skim_variable n_missing complete_rate mean count Nonreg Observation Count 55224 0 NaN : Nonreg Arithmetic Mean 55224 0 NaN : Nonreg First Maximum Value 55224 0 NaN : Tribe Name 55224 0 NaN :Variable type: numeric
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist State Code 0 1 9.00 0.00 9.00 9.00 9.00 9.00 9.00 ▁▁▇▁▁ County Code 0 1 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁ Site Number 0 1 17.00 0.00 17.00 17.00 17.00 17.00 17.00 ▁▁▇▁▁ Parameter Code 0 1 55689.14 9429.11 42401.00 44201.00 61102.00 62101.00 82403.00 ▅▁▇▁▁ POC 0 1 1.01 0.29 1.00 1.00 1.00 1.00 9.00 ▇▁▁▁▁ Latitude 0 1 41.00 0.00 41.00 41.00 41.00 41.00 41.00 ▁▁▇▁▁ Longitude 0 1 73.59 0.00 73.59 73.59 73.59 73.59 73.59 ▁▁▇▁▁ Year 0 1 2002.86 8.22 1990.00 1996.00 2002.00 2009.00 2020.00 ▇▆▇▃▃ Day In Year (Local) 0 1 181.50 96.44 1.00 106.00 181.00 258.00 366.00 ▆▇▇▇▅ Observation Count 0 1 21.28 5.60 1.00 23.00 24.00 24.00 24.00 ▁▁▁▁▇ Observation Percent 0 1 97.44 9.07 4.00 100.00 100.00 100.00 100.00 ▁▁▁▁▇ Arithmetic Mean 0 1 44.42 75.70 7.50 0.04 5.00 56.62 353.00 ▇▁▁▁▁ First Maximum Value 0 1 65.56 112.18 5.00 0.07 10.00 64.00 360.00 ▇▁▁▁▁ First Maximum Hour 0 1 10.89 6.69 0.00 6.00 12.00 15.00 23.00 ▆▅▇▇▅First we will drop all of the identifier rows before column 9 and after column 27, and also the “tribe” and “nonreg” colunns using data.table::patterns(). We then convert the air quality index (“aqi”) column to numeric for the cases where it is not missing. We are not clear why the “aqi” is not included in the “parameter_name” variable with the other measures, but seems to be associated with rows which have “ozone” and “sulfur dioxide” (two of five variables which compose the “aqi” itself). Air Quality is also stored in by the 1hour average and separately a single 8hour measurements for each day, and these numbers can be significantly different.
# Drop unneeded cols ac_dt < ac_dt[, c(9:27)][, .SD, .SDcols = !patterns("tribenonreg")] # Convert aqi to integer ac_dt[, `:=`( aqi= as.integer(str_extract(aqi, "\\d*")))]
We add the three measurement columns to value and 12 identifier columns to variable. We decided to separate the “aqi” index column from the rest of the data which is identified in the “parameter_name” column before tidying, and then bind them back together with three variables (“aqi”, “arithmetic_mean” and “first_maximum_value”).
# Separate out and tidy up rows with aqi
aqi < ac_dt[!is.na(aqi)]
# Tidy key measures
measures < c("first_maximum_value", "arithmetic_mean")
ids < setdiff(names(ac_dt), measures)
ids < ids[!str_detect(ids, "aqi")]
ac_dt_tidy <
ac_dt[,
melt(.SD,
idcols = ids,
measure.vars = measures),
.SDcols = !"aqi"]
aqi <
aqi[,
melt(.SD,
idcols = ids,
measure.vars = "aqi"),
.SDcols = !measures]
# Put two data sets back together
ac_dt_tidy < rbind(ac_dt_tidy, aqi)
# Show sample rows
ac_dt_tidy
parameter_name duration_description pollutant_standard
1: Outdoor Temperature 1 HOUR
2: Sulfur dioxide 1 HOUR SO2 1hour 2010
3: Sulfur dioxide 3HR BLK AVG SO2 3hour 1971
4: Sulfur dioxide 3HR BLK AVG SO2 3hour 1971
5: Outdoor Temperature 1 HOUR

120581: Ozone 8HR RUN AVG BEGIN HOUR Ozone 8hour 2015
120582: Ozone 8HR RUN AVG BEGIN HOUR Ozone 8hour 2015
120583: Ozone 8HR RUN AVG BEGIN HOUR Ozone 8hour 2015
120584: Ozone 8HR RUN AVG BEGIN HOUR Ozone 8hour 2015
120585: Ozone 8HR RUN AVG BEGIN HOUR Ozone 8hour 2015
date_local year day_in_year_local units_of_measure
1: 19900101 1990 1 Degrees Fahrenheit
2: 19900101 1990 1 Parts per billion
3: 19900101 1990 1 Parts per billion
4: 19900102 1990 2 Parts per billion
5: 19900102 1990 2 Degrees Fahrenheit

120581: 20200327 2020 87 Parts per million
120582: 20200328 2020 88 Parts per million
120583: 20200329 2020 89 Parts per million
120584: 20200330 2020 90 Parts per million
120585: 20200331 2020 91 Parts per million
exceptional_data_type observation_count observation_percent
1: None 24 100
2: None 23 96
3: None 7 88
4: None 7 88
5: None 24 100

120581: None 17 100
120582: None 17 100
120583: None 17 100
120584: None 17 100
120585: None 12 71
first_maximum_hour daily_criteria_indicator variable value
1: 0 Y first_maximum_value 43.0
2: 8 Y first_maximum_value 11.0
3: 8 Y first_maximum_value 9.3
4: 23 Y first_maximum_value 17.6
5: 14 Y first_maximum_value 42.0

120581: 10 Y aqi 48.0
120582: 22 Y aqi 46.0
120583: 8 Y aqi 46.0
120584: 7 Y aqi 37.0
120585: 16 N aqi 42.0
When we graph with “parameter_name” facets including separate colors for the mean and maximum values, we can see a few things. There are a few gaps in collection including a big one in sulfur dioxide from about 19972005. The Air Quality Index first created in the Clean Air Act has the following components: groundlevel ozone, particulate matter, carbon monoxide, sulfur dioxide, and nitrogen dioxide. We are unsure how they calculate the AQI in our data set for the full period because of the period where sulfur dioxide is missing. When we read up on AQI, we learned that there may be several ways of calculating the AQI. We will leave the details for later research.
# Look for missing periods ac_dt_tidy[ variable %in% c("arithmetic_mean", "first_maximum_value"), ggplot(.SD, aes(date_local, y = value, color = variable)) + geom_line() + facet_wrap( ~ parameter_name, scale = 'free') + theme_bw() + labs(caption = "Source: EPA Monitor 090010017" )]
WindWe had hoped to look at the wind speeds during Sandy, but apparently, the monitor was knocked out during Sandy so there are no measurements for that date or for several months subsequent, so it looks like we are not going to do a lot with the wind data. It is hard to find much in the charts above, so we averaged up the values by month. We might have guessed, but hadn’t thought that wind was as seasonal as it seems to be below.
ac_dt_tidy[ str_detect(parameter_name, "Wind") & variable %in% c("first_maximum_value", "arithmetic_mean"), .(avg_speed = mean(value)), by = .(month(date_local), parameter_name, variable)][, ggplot(.SD, aes(month, avg_speed, color = variable)) + geom_line() + facet_wrap( ~ parameter_name, scales = "free_y") + theme_bw() + labs( x = "Month", y = "Wind Speed", caption = "Source: EPA Monitor 090010017" ) ]
Air QualityThe data set actually records 4 measurements for ozone, the average and maximum values by hour, and separately, for 8hour periods. The EPA sets a threshold for the level of Ozone to be avoided at 0.064, and days above this are shown in red. It looks like the “first_maximum_value” very often registers undesirable levels, although the hourly reading does much less so. We can see that there are clearly fewer unhealthy days over time, and only two unhealthy days based on the hourly arithmetic average since 2003. We can also see that the low end of the readings has been moving up over time, even though well in the healthy zone.
ac_dt_tidy[ parameter_name == "Ozone" & variable %in% c("arithmetic_mean", "first_maximum_value")][ ][, ggplot(.SD, aes(date_local, y = value)) + geom_point(aes(color = cut( value, breaks = c(0, 0.064, 0.3), labels = c("Good", "Unhealthy") )), size = 1) + scale_color_manual( name = "Ozone", values = c("Good" = "green1", "Unhealthy" = "red1")) + theme_bw() + labs(x = "Year", y = "Ozone", caption = "Source: EPA Monitor 090010017") + facet_wrap(~ variable + duration_description, scales = "free_y")]
We can see in the chart looking at Air Quality based on the Ozone 8hour 2015 parameter below, that if rthe EPA calculates it, it doesn’t report AQI during the winter months, which probably makes sense because people are not out and air quality appears to be worst in the summer. Sometimes we get the iPhone messages about Air Quality and naturally worry, but when we look at the AQI daily over the last 30 years, we can see that the number of Unhealthy days has been declining similar two what we saw above with Ozone, and the last Very Unhealthy day was in 2006. The same trend with the low end of the AQI rising a little over time is apparent.
ac_dt_tidy[ variable == "aqi" & pollutant_standard == "Ozone 8hour 2015"][ ][, ggplot(.SD, aes(date_local, y = value)) + geom_point(aes(color = cut( value, breaks = c(0, 50, 100, 150, 200, 300), labels = c( "Good", "Moderate", "Unhealthy  Sensitive", "Unhealthy", "Very Unhealthy" ) )), size = 1) + scale_color_manual( name = "AQI", values = c( "Good" = "green1", "Moderate" = "yellow", "Unhealthy  Sensitive" = "orange", "Unhealthy" = "red", "Very Unhealthy" = "violetred4" ) ) + theme_bw() + labs(x = "Year", y = "Air Quality Indicator (AQI)", caption = "Source: EPA Monitor 090010017" )]
A heatmap is another way to look at the Ozone which better shows the time dimension. The yaxis shows the day of the year, so the most unhealthly air quality is between days 175225, or the end of June through the first half of August. We can also see that “unhealthy” days might even have outnumbered healthy days back in the early 1990s, but we rarely see above “moderate” now.
breaks < c(0, 50, 100, 150,200, 300, 1000) labels < c("Good", "Moderate", "Unhealty  Sensitive Groups", "Unhealthy", "Very Unhealthy", "Hazardous") ac_dt[parameter_name == "Ozone" & exceptional_data_type == "None", .( year, day_in_year_local, observation_count, duration_description, date_local, aqi= as.integer(str_extract(aqi, "\\d*")), parameter_name, `Air Quality` = cut( as.integer(str_extract(aqi, "\\d*")), breaks = breaks, labels = labels ) )][!is.na(`Air Quality`) & day_in_year_local %in% c(90:260), ggplot(.SD, aes(year, day_in_year_local, fill = `Air Quality`)) + geom_tile() + theme_bw() + labs( x = "Year", y = "Day of Year", caption = "Source: EPA Monitor 090010017" )]
TemperatureWe can see above the dot plot of the Outside Temperature over the period. Hot days are defined as above 85, and very hot above 95, while cold are below 32. There isn’t much of a trend visible in the middle of the graphs. As might be expected the daily first maximum highs and lows tend to be significantly above the daily average levels. All in all, if there is change, it is less definitive than the air quality data looking at it this way.
ac_dt_tidy[ parameter_name == "Outdoor Temperature" & variable %in% c("arithmetic_mean", "first_maximum_value")][ ][, ggplot(.SD, aes(date_local, y = value)) + geom_point(aes( color = cut(value, breaks = c(20, 32, 50, 65, 85, 95, 120), labels = c("Very Cold", "Cold", "Cool", "Moderate", "Hot", "Very Hot"))), size = 1) + scale_color_manual( name = "Outside Temperature", values = c( "Very Cold" = "blue", "Cold" = "yellow", "Cool" = "green1", "Moderate" = "green4", "Hot" = "orange", "Very Hot" = "red" ) ) + theme_bw() + labs( x = "Year", y = "Outside Temperature", caption = "Source: EPA Monitor 090010017" ) + facet_wrap(~ variable)]
We also tried to look at the change in temperature over the period versus the first five years (19901995). By doing this, we probably learned more about heat maps than about the temperature. It does look like the bigger changes in temperature have probably happened more at the beginning and the end of the year. Movements in the maximum temperatures seem more pronounced than the averages, but again it makes sense that this would be the case.
temperature < ac_dt[parameter_name == "Outdoor Temperature", c("year", "day_in_year_local", "arithmetic_mean", "first_maximum_value")] baseline < temperature[year < 1995, .(base_mean = mean(arithmetic_mean), base_max = mean(first_maximum_value)), day_in_year_local] temperature < baseline[temperature[year > 1994], on = c("day_in_year_local")][, `:=`(change_avg = arithmetic_mean  base_mean, change_max = first_maximum_value  base_max)] temperature < temperature[, melt( .SD, id.vars = c("day_in_year_local", "year"), measure.vars = c("change_max", "change_avg") )] temperature[ year %in% c(1995:2019) & !is.na(value), ggplot(.SD, aes(year, day_in_year_local, fill = cut( value, breaks = c(100, 15, 5, 5, 15, 100), labels = c("Much Colder", "Colder", "Similar", "Warmer", "Much Warmer") ))) + geom_tile() + scale_fill_manual(name = "Temp. Change", values = c("skyblue4", "skyblue", "green", "red", "red4")) + theme_bw() + labs( title = "Days Compared to 19901994 Average Temp. on That Day", subtitle = "Hotter Days Shown Redder", x = "Year", y = "Day of Year", caption = "Source: EPA" ) + facet_wrap(~ variable) ]
The interesting thing we learned about heat maps is how much we could control the perception of the chart based on our decisions about the size of the groupings and the color choices. Dark colors on the days with the biggest temperature increases could flood the chart with red. If we chose equal equal sized groups for the cutoffs, there would be a lot more days for the average which were colder (as shown below), but a lot more for the max which were hotter. It made us more wary of heat maps.
# Uneven hand selected cutoffs to find more balanced counts
lapply(list(cut(temperature[variable == "change_avg" &
!is.na(value)]$value, c(100,15,5, 5, 15, 100)),
cut(temperature[variable == "change_max" &
!is.na(value)]$value, c(100,15,5, 5, 15, 100))), summary)
[[1]]
(100,15] (15,5] (5,5] (5,15] (15,100]
169 1489 4929 1786 65
[[2]]
(100,15] (15,5] (5,5] (5,15] (15,100]
286 2016 4155 1821 160
# Even range limits with less even counts
lapply(list(cut(temperature[variable == "change_avg" &
!is.na(value)]$value, 5),
cut(temperature[variable == "change_max" &
!is.na(value)]$value, 5)), summary)
[[1]]
(38,25] (25,12.1] (12.1,0.827] (0.827,13.8] (13.8,26.8]
5 305 4253 3767 108
[[2]]
(28.8,15.8] (15.8,2.95] (2.95,9.95] (9.95,22.9] (22.9,35.8]
215 2858 4659 686 20
That wraps up this quick exploration of our local EPA monitor. We still have many questions about air quality and wind speed. We wonder why the EPA put the monitor down by the edge of the water away from the heavy traffic and population. We didn’t have a lot of time to spend, and realize that we may have misread or misinterpreted some of the data, but now we at least know what to look for. There is a comments section below, so please feel free to correct, point us in the right direction or inform us about other ways of using this data.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on Redwall Analytics. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Exploring 30 years of local CT weather history with R first appeared on Rbloggers.
The treedata.table Package
[social4i size="small" align="alignleft"] > [This article was first published on rOpenSci  open tools for open science, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The data.table package enables highperformance extended functionality for data tables in R. treedata.table is a wrapper
for data.table for phylogenetic analyses that matches a phylogeny to the data.table, and preserves matching during data.table operations.
Using the data.table package greatly increases analysis reproducibility and the efficiency of data manipulation operations over other ways of performing similar tasks in
base R, enabling processing of larger trees and datasets.
Simultaneous processing of phylogenetic trees and data remains a computationallyintensive task.
For example, processing a tree alongside a dataset of phylogenetic characters means processing much more data compared to processing the characters alone. This results in much longer processing times compared to more simple analysis with characters alone in data.table (Fig. 1A).
treedata.table provides new tools for increasing the speed and efficiency of phylogenetic data processing.
Data manipulation in treedata.table is significantly faster than in other commonly used packages such as base (>35%), treeplyr (>60%), and dplyr (>90%).
Additionally, treedata.table is >400% faster than treeplyr during the initial data/tree matching step (Fig. 1B).
Fig. 1. Results for the treedata.table microbenchmark during (A) data manipulation (treedata.table) and (B) tree/data matching steps.
We compare the performance of treedata.table against data.table, base, treeplyr, and dplyr using the microbenchmark R package.
The simplest function that can be done in treedata.table is matching the tips of a phylogenetic tree to
the rows in a data.table object.
Like many R packages involving phylogenetic trees, treedata.table assumes that your tree is an object of class
phylo.
The first vignette (in English here and Spanish here) walks new users through matching a phylogeny to a data.table, and performing some common data operations, such as selecting data columns and dropping tips that do not have data.
There are other ways to achieve similar results, but as seen on Fig. 1, these are not as efficient.
Something unique in treedata.table is the ability to work with multiPhylo objects.
multiPhylo objects, as their name suggests, are colelctions of trees.
For example, perhaps you would like to match all the trees in a Bayesian posterior sample to a dataset to perform
a comparative methods analysis for every tree in your sample.
Our second vignette walks researchers through the process of syncing many trees to one dataset, manipulating the data, and performing a continuous trait evolution modeling exercise in geiger.
Our other two vignettes cover important data functions researchers may wish to do with phylogenetic data.
For example, syncing names across datasets and phylogenetic trees based on partial matching, and assorted data tidying functions.
Fig. 2. The beach outside the workshop bunkhouse.
treedata.table is a research output from the Nantucket DevelopR workshop.
Funded by an NSF grant to Liam Revell, the workshop aimed to support intermediate R developers working with phylogenetic and phylogenetic comparative methods.
Last time around, we had four instructors (April Wright, Josef Uyeda, Claudia SolisLemus and Klaus Schliep).
Each instructor taught one serious lesson during the week, and most did one “fireside chat” about being a computational scientist working in R.
But the real heart of the workshop was spending the day immersed in coding, both collaboratively and alone.
Participants each came with a project in mind, and made pitches to others for their project.
Some participants worked on multiple projects, or one with a group and one alone.
The instructors were housed separately, and often we’d show up to the beachside house to find everyone didn’t want lecture yet – they were too busy working already.
Intermediate programmers are often sort of left to their own devices in education, and spending a week developing skills, community, and friendships was absolutely lovely. And, as evidenced by this package, productive!
We’d like to thank all the participants at the workshop for a very special week together in an amazing place.
We would also like to thank the instructors (Claudia and Klaus), Liam (NSF DBI1759940), UMass field station coordinator Yolanda Vaillancourt, and course facilitator Fabio Machado Lugar. treedata.table is an rOpenSci peerreviewed package. We would also like to thank package reviewers Hugo Gruson, Kari Norman, editor Julia Gustavson and blog editor Steffi LaZerte. AMW was supported by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under grant number P2O GM10342418.
To leave a comment for the author, please follow the link and comment on their blog: rOpenSci  open tools for open science. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post The treedata.table Package first appeared on Rbloggers.
R/exams for Distance Learning: Resources and Experiences
[social4i size="small" align="alignleft"] > [This article was first published on R/exams, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Resources and experiences shared by the R/exams community for supporting distance learning in different learning management systems.
MotivationThe COVID19 pandemic forced many of us to switch abruptly from our usual teaching schedules to distance learning formats. Within days or a couple of weeks, it was necessary to deliver courses in live streams or screen casts, to develop new materials for giving students feedback online, and to move assessments to online formats. This led to a sudden increase in the interest in video conference software (like Zoom, BigBlueButton, …), learning management systems (like Moodle, Canvas, OpenOLAT, Blackboard, ILIAS, …), and – last not least – our R/exams package. Due to the flexible “oneforall” approach, it was relatively easy for R/exams users who had been using the system for inclass materials to transition to elearning materials. But also many new R/exams users emerged that were interested in leveraging their university’s learning management system (often Moodle) for their classes (often related to statistics, data analysis, etc.).
As distance learning will continue to play an important role in most universities in the upcoming semester, I called for sharing experiences and resources related to R/exams. Here, I collect and summarize some of the feedback from the R/exams community. The full original threads can be found on Twitter and in the RForge forum, respectively.
 Twitter (@AchimZeileis): After a semester of #distancelearning is before the next semester of #distancelearning…
 RForge forum: Experiences with R/exams for distance learning.
Due to the increased interest in R/exams, I contributed various YouTube videos to help getting an overview of the system and to tackle specific tasks (creating exercises or online tests, summative exams, etc.). Also, the many questions asked on StackOverflow and the RForge forum led to numerous improvements in the software and support for various typical challenges.
 R/exams webinar @ WhyR?: YouTube video, blog post.
 Video tutorials: YouTube channel Achim Zeileis.
 Questions & answers: StackOverflow/rexams, RForge forum.
But in addition to these materials that we developed, many R/exams users stepped up and also created really cool and useful materials. Especially from users in Spain came a lot of interest along with great contributions:
 Jonathan Tuke (YouTube video, in English): Online tests in Canvas..
 Julio Mulero (blog post, in Spanish): Online tests in Moodle.
 Estadística útil (YouTube videos, in Spanish): Package, Installation, First steps, Single choice, Written exams, Moodle exams.
 Pedro Luis Luque Calvo (YouTube video and blog, in Spanish): Online tests in Blackboard, blog post.
 Kenji Sato (blog posts, in Japanese or English): Part 1, Part 2, Part 3, bordered tables.
Not surprisingly, we had been using R/exams extensively in our teaching for a blended learning approach in a Mathematics 101 course (R/exams blog). The asynchronous parts of the course (screencasts, forum, self tests, and formative online tests) were already delivered through the university’s OpenOLAT learning management system. Thus, we “just” had to switch the synchronous parts of the course (lecture, tutorials, and summative written exams) to distance learning formats:
Learning Feedback Assessment Synchronous LectureLive stream Previously: Live quiz
Now: Online test
(+ Tutorial) Previously: Written exam
Now: Online exam Asynchronous Textbook
Screencast Self test
(+ Forum) Online test
The most challenging part was the summative online exam (R/exams blog) but overall everything worked out reasonably ok. One nice aspect was that other colleagues profited immediately from using a similar approach for their online exams as well. Like my colleague Lisa Lechner from the Department of Political Science:
Lisa Lechner: R/exams was a lifesaver during the difficult semester we faced. I used it for the statistics lecture with approx. 300 students (final online exam, but also ARSnova questions) and for a smaller course on empirical methods in political science. But even before the pandemic, R/exams increased the fairness of my final exams and reduced the time spent on evaluating results. #bestRpackage2020
Replacing written examsFinding a suitable replacement for the summative written exams was not only a challenge for us. One concern that came up several times was that learning management systems like Moodle might not be reliable enough for online exams, especially if students’ internet connections were flaky. Hence, one idea that was discussed, especially with Felix Schönbrodt and Arthur Charpentier, was to email fillable PDF forms (StackOverflow Q&A) to students instead of relying entirely on Moodle (Twitter). However, I could convince them that PDF forms would create more problems than they solve (see the linked threads above for more details):
Felix Schönbrodt @nicebread303: We thought about doing a (pseudowritten) PDFexam, but we are very glad that we followed @AchimZeileis advice to go for Moodle. Grading is so much easier!
Another concern is, of course, cheating, no matter which format the exams use. Many R/exams users recommended to use openended questions and/or more complex cloze questions forcing participants to interact with a certain randomized problem, e.g., based on a randomized data set. As Ulrike Grömping reports, this can also be combined with online proctoring for moderatelysized exams.
Ulrike Grömping (RForge): I have also run online exams via Moodle in real time, supervised in a video conference […]. And I have not used auto correction, but have asked questions that also require free text answers from students (text fields with arbitrary correct answer of (e.g.) 72 characters in a cloze exercise). Of course, the solutions support my manual correction process.
Other approaches to tackling this problem can be found in the RForge thread, see e.g., the contribution by Niels Smits.
Moodle vs. other learning management systemsWhile overall there is a lot of heterogeneity in the learning management systems that the R/exams community uses, Moodle is clearly the favorite. Moodle quizzes also have their quirks but they are surprisingly flexible and can be used at various levels.
Eric Lin @LinXEric: I use #rexams extensively, have been for years. I used to do paper exams. Now, with distance learning, I’m serving up exams, quizzes, entry ticket questions, exit ticket questions, all using a #rexams with #moodle.
Especially, the cloze question format is popular, allowing to combine multiplechoice, singlechoice, numeric, and string items.
Rebecca Killick (RForge): We moved to using this exclusively for our undergraduate quizzes in the summer of 2019 and this served us well when COVID19 hit. We are now expanding our online assessment using R/exams. We mainly use cloze questions with randomization.
See also the tweet by Emilio L. Cano @emilopezcano.
Of course, other learning management systems are also very powerful with Blackboard and Canvas probably being second and third in popularity. The OpenOLAT system, also used at our university, is less popular but in various respects even more powerful than Moodle. The support of the QTI 2.1 standard (question and test interoperability) allows to specify a complete exam, including random selection of exercises, randomized order of questions, and flexible cloze questions.
Collaborating with colleaguesOne design principle of R/exams is that each exercise is a separate file. The motivation behind this is to make it easy to distribute work on a question bank among a larger group before someone eventually compiles a final exam from this question bank. Hence it’s great to see that various R/exams users report that they work on the exercises together with their colleagues, e.g., Fernando Mayer @fernando_mayer on Twitter or:
Ilaria Prosdocimi @ilapros: used by me and colleagues in Venice for about 500 exams via Moodle (in different basic stats courses). Mostly single choice and numeric questions. After some initial learning it worked perfectly and made the whole exam writing much easier.
For facilitating their collaboration, the group around Fernando Mayer has even created a ShinyExams interface (in Portuguese).
Statistics & data science vs. other fieldsIt is not unexpected that the majority of the R/exams community employs the system for creating resources for teaching statistics, data science, or related topics. See the tweets and posts by Filip Děchtěrenko @FDechterenko, @BrettPresnell, Emi Tanaka @statsgen, Dianne Cook @visnut, and Stuart Lee. In many cases such courses are for students from other fields, such as pharmacy as reported by Francisco A. Ocaña @faolmath or biology as reported by:
@OwenPetchey: The Data Analysis in Biology course at University of Zurich, with 300 students this spring, used this fantastic resource to author its final exams, with upload to OLAT.
However, there is also a broad diversity of other fields/topics, e.g., international finance and macroeconomics (Kamal Romero @kamromero), applied economics and accounting (Eric Lin @LinXEric), econometrics (Francisco Goerlich), linguistics (Maria Paola Bissiri), soil mechanics (Wolfgang Fellin), or programming (John Tipton).
Going forwardAll the additions, improvements, and bug fixes in R/exams that were added over the last month are currently only available in the development version of R/exams 2.40 on RForge. The plan is to also release this version to CRAN in the next weeks.
Another topic, that will likely become more relevant in the next months as well, is the creation of interactive selfstudy materials outside of learningmanagement systems, e.g., as part of an online tutorial. Debbie Yuster nicely describes such a scenario:
Debbie Yuster (RForge): I plan to use R/exams to maintain source files for “video quizzes” students will take after they watch lecture videos asynchronously in my Introduction to Data Science course. Instructors at different universities will be using these videos (created by the curriculum author), and I will distribute the quizzes to interested faculty, each of whom may have a different preferred delivery mode. I’ll be giving the quizzes through my LMS (Canvas), while others may administer the quizzes in learnr, or use a different LMS. With R/exams, I can maintain a single source, and instructors can convert the quizzes into their preferred format.
Two approaches that are particularly interesting are learnr (RStudio), that uses R/Markdown with shiny, and webex (psyTeachR project), that uses more lightweight R/Markdown with some custom Javascript and CSS. Some first proofofconcept code is avaiable for a learnr interface (RForge) and integrating webex exercises in Markdown documents (StackOverflow Q&A) but more work is needed to turn this into full R/exams interfaces.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R/exams. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post R/exams for Distance Learning: Resources and Experiences first appeared on Rbloggers.
Last Week to Register for Why R? 2020 Conference
[social4i size="small" align="alignleft"] > [This article was first published on http://raddict.com, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The fourth edition of Why R? Conference (2020.whyr.pl) is happening this week. Below are informations on how to participate!
 Register to get all relevant notifications for participant.
 All talks, besides workshops, will be streamed on youtube.com/WhyRFoundation.
 You can check the detailed agenda on 2020.whyr.pl/plan/.
 For your convenience we setup a whyr.pl/live/ site with talks divided by days.
 Join our whyr.pl/slack/ for the networking.
 You can leave comments and questions on YouTube chat or on relevant Slack channels.
 You can still join Text Mining Hackathon (lasts 24hours 20200923 5pm UTC > 20200924 5pm UTC)
 Abstracts of talks are available here 2020.whyr.pl/abstracts/
 Check out last seats for workshops 2020.whyr.pl/workshops/
To leave a comment for the author, please follow the link and comment on their blog: http://raddict.com. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Last Week to Register for Why R? 2020 Conference first appeared on Rbloggers.
Momentum in Sports: Does Conference Tournament Performance Impact NCAA Tournament Performance
[social4i size="small" align="alignleft"] > [This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Every year in March, the NCAA holds its March Madness basketball tournament to crown a National Champion. Prior to this, many conferences also hold a conference tournament to name their Conference Champions. I’ve often wondered ‘Is there any correlation to how a team peforms in their conference tourney and how they perform in the Big Dance?’ So I decided to dive into the data. The data used was taken from the NCAA and Google Cloud ML Kaggle competition. The competition this data was taken from did not occur because the 2020 NCAA Tournament was cancelled due to COVID19. Conference Tournament data only went back to 2001, so I truncated all the data to 2001.
Comparing Conference Tournament wins and NCAA Tournament wins by seasonOverall, There doesn’t seem to be too much of a correlation. It could be said that only one team has ever won more than 3 games in their conference tournament and won the National Championship (UConn, 2011). The reverse can also be said, no team that has not won at least one conference tournament game has won the National Championship. But I wanted to look deeper.
Subsetting the ConferencesEvery Conference Tournament champion automatically makes the NCAA Tournament. For a lot of conferences, this is the only team that makes it into the field. These teams, also, often do not win a single game. These conferences should be filtered out to really get at the underlying question. Here I averaged the number of teams that make it each year for each conference and filtered out those that average less than 1.5. The reasoning is that more often than not these conferences only send one team. That left only 12 conferences: Conference USA, American Athletic (AAC), Atlantic 10, Atlantic Coast (ACC), Big 12, Big East, Big 10, Missouri Valley (MVC), Mountain West, Pac12, SEC, and West Coast (WCC). These conferences and a graph showing the relationship between conference wins and tournament wins is shown below.
Power ConferencesAfter subsetting to conferences averaging more than 1.5 teams, we are down to 12. There does not seem to be any overwhelming trends, just a few possibly interesting observations. We can subset even further to the The Power Conferences. These are conferences that average sending 4 or more teams to the Dance.
ConclusionsThough not strong, the trend seems to suggest that conference champions perform better. Big East and West Coast champions average twice as many wins as other teams that make it, although most of the wins from the WCC champion comes from Gonzaga, a perennial good team. AAC teams that win two conference games have a much higher average than other teams; however, the conference has not been around that long, so there is not enough data points to make any major conclusions. Not surprising, power conference teams overall average more wins in the NCAA Tournament. The Shiny App can be accessed here.
Photo by Markus Spiske on Unsplash
The post Momentum in Sports: Does Conference Tournament Performance Impact NCAA Tournament Performance first appeared on NYC Data Science Academy Blog.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Momentum in Sports: Does Conference Tournament Performance Impact NCAA Tournament Performance first appeared on Rbloggers.
Last Seats for Why R? 2020 Workshops
[social4i size="small" align="alignleft"] > [This article was first published on http://raddict.com, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Why R? 2020 conference 2020.whyr.pl is happening this week. There are still some last seats for workshops held on Friday Sept 25th/
Register 2020.whyr.pl/register/Registration for workshops started 20200901. Workshops take place 20200925. They are divided to 2 sessions
 09:00 – 12:30 / CEST / GMT+2 – Morning (M)
 14:00 – 17:30 / CEST / GMT+2 – Afternoon (A)
During the break McKinsey & Company will host a Discussion Panel
 Lessons learned from 500+ interviews for data science jobs (13:00 – 14:00 / CEST / GMT+2 – Panel (P))
Register 2020.whyr.pl/register/
The symbolic price was put on the workshop ticket so that those who register feel obliged to attend the workshop. Funds will be used to finance ‘Supporting grant for Women in Data Science’ (whyr.pl > grants) and Why R? premeetings in Europe and Africa (whyr.pl > events).
Title Tutors Seats Invited First steps with Continuous Integration Colin Gillespie & Rhian Davis from Jumping Rivers 25 (M) Invited Bayesian Optimization with mlr3mbo Jakob Richter 25 (M) Basics of Shiny Weronika Puchała, Krystyna Grzesiak, Katarzyna Sidorczuk 30 (M) How to make your code fast – R and C++ integration using Rcpp Jadwiga Słowik, Dominik Rafacz, Mateusz Bąkała 30 (M) Reproductible data analysis with drake Jakub Kwiecień 30 (M) Recruiting Panel Lessons learned from 500+ interviews for data science jobs McKinsey & Company 1000 (P) Invited Advanced Shiny Colin Fay from ThinkR 30 (A) Highlighted Causal machine learning in practice Mateusz Zawisza McKinsey & Company 40 (A) Creating R Subroutines with Fortran and OpenMP Tools Erin Hodgess 30 (A) Satellite imagery analysis in R Ewa Grabska 40 (A) var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: http://raddict.com. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Last Seats for Why R? 2020 Workshops first appeared on Rbloggers.
Learn to Code Free — Our Interactive Courses Are ALL Free This Week!
[social4i size="small" align="alignleft"] > [This article was first published on rpromotefeed – Dataquest, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Exciting news: for the next week, all courses are free. Yup, every single course in every learning path is free from Sept 2128.
This free week includes all of our courses in R, Python, SQL, machine learning, Git, the command line, and much more!
Even more exciting: complete at least one mission during this week and you’ll unlock an additional prize: a downloadable data science career resources pack sent to your email!
Now, it’s easier than ever to go from total beginner to jobqualified using Dataquest. The paywall is down!
(No credit card required.)
How our interactive programming courses workAt Dataquest, we do things a little differently. You won’t find any lecture videos here. Nor will you find any multiplechoice quizzes or fillintheblank assessments.
Instead, we challenge you to write real code and work with real datasets. From the first mission of the first course, you’ll be writing and running your own code using our interactive coderunning platform. You’ll get realtime feedback when your code is correct, and hints, answers, and access to our amazing learner community for whenever you need a little help.
At the end of the course, you’ll often be challenged to put your new skills together by building a real data science project. Our guided projects will help you build skills and confidence even as you’re building out a cool project portfolio for your Github.
Why try Dataquest’s courses? Here’s what students say:97%
Recommend Dataquest for career advancementBased on our 2020 survey (n=656). This is consistent with the results of our 2019 survey, in which 96% of respondents recommended Dataquest for improving your career.
“Dataquest gave me a very clear pathway of handson learning. This massive data science world has opened up.
“That is the power of a Dataquest education. This job found me [and] I doubled my income overnight. I could not be happier with where I am, and the skills that Dataquest allowed me to develop.
“[Other platforms] take you by the hand. You think you’re learning but you’re not. Dataquest finds the balance between being too easy and being so hard that you get discouraged.
Median salary increase after DataquestMany respondents were still studying, and some folks study on Dataquest to improve efficiency at their current jobs. Among those who did report a postDataquest salary change, the median was a US$30,000 increase in yearly salary.
$30k
Sign Up & Start Learning!
FAQ & Details
Simply sign up for a free Dataquest account (or log in to your existing account) and you will have access to every course on our platform.
You can start at the beginning of one of our learning paths, or dive in at any point you’d like.
All courses will be free from Sept. 21 at 0:00 UTC to Sept. 28 at 23:59 UTC.
Do I need a credit card to sign up?No! You can create a Dataquest account for free, and no credit card is required. Even after the free week has ended, you’ll be able to use your free account to access any of our dozens of free missions.
Are there any limits to how much I can do during this week?Nope! Our platform is selfserve and allows you to work at your own pace. You can complete as much of the path as you’d like during the free week, and you will be issued certificates for any courses you complete.
After the free week is over, users who do not have a paid subscription will no longer have access to paywalled parts of the platform, but your progress from the free week will be saved, and you will still have access to a free mission in each course.
What’s in the data science career resources pack?The data science career resources pack is a downloadable ZIP file that contains PDF resources relating to getting your first job in data science, including a 30,000+ word ebook on the data science job application process, as well as additional career resources.
How do I get the data science career resources pack?To unlock the pack, you’ll need to complete at least one mission during the free week. A “mission” is a lesson on Dataquest, each course is made up of several missions. You can complete any mission you’d like.
You can confirm you’ve completed a mission on the platform if you see this green checkmark in on the dashboard screen next to a mission name, like this:
Inside the mission itself, you can also click the sandwich menu icon in the bottom left to view your progress. If every screen in the mission has a green checkmark, as seen on the left in the image below, the mission is complete:
This guide to using Dataquest is a good resource for answering questions about the platform and understanding the structure of screens, missions, and courses.
When do I get the data science career resources pack?Completing a mission will not instantly unlock the pack! After free week is over, we’ll analyze our platform data. We will send a link to the email account associated with your Dataquest account that will contain a downloadable zip file sometime on or before October 9.
How can I access the platform once Free Week is over?Once the free week is over, a Dataquest subscription will be required to access most of the content on our platform, although there are 60+ missions available for free.
Based on a 2020 survey of more than 600 Dataquest learners, 91% of students rated a Dataquest subscription a good or great investment, and 97% said they recommended Dataquest for career advancement.
Charlie CusterCharlie is a student of data science, and also a content marketer at Dataquest. In his free time, he’s learning to mountain bike and making videos about it.
The post Learn to Code Free — Our Interactive Courses Are ALL Free This Week! appeared first on Dataquest.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rpromotefeed – Dataquest. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Learn to Code Free — Our Interactive Courses Are ALL Free This Week! first appeared on Rbloggers.
Double dispatch in R: S4vsvctrs
[social4i size="small" align="alignleft"] > [This article was first published on krzjoa, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Why do we may need double dispatch?In most cases, when writing R scripts or even creating R packages, it is
enough to use standard functions or S3 methods. However, there is one
important field that forces us to consider double dispatch question:
arithmetic operators.
Suppose we’d like to create a class, which fits the problem we’re
currently working on. Let’s name such class beer.
Then, we create an operator which defines some nonstandard behaviour.
 if we add an opener to the beer, we get an opened_beer.
 adding a numeric x, we get a case of beers (which even contain
a negative number of bees, i.e. our owe…)  if second argument is different than a or opener or numeric,
we get… untouched beer
Let’s demonstrate, how does it work:
<span class="n">`+.beer`</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="p">){</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">b</span><span class="p">,</span><span class="w"> </span><span class="s2">"opener"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">structure</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="s2">"opened"</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="o">$</span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="p">),</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"opened_beer"</span><span class="p">))</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">b</span><span class="p">,</span><span class="w"> </span><span class="s2">"numeric"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">print</span><span class="p">(</span><span class="s2">"It's magic! You've got a case of beers!"</span><span class="p">)</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">structure</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="w"> </span><span class="n">n_beers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="p">),</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"case_of_beers"</span><span class="p">))</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span> <span class="n">pilsner</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">my_opener</span><span class="w"> </span> ## $name ## [1] "opened " ## ## attr(,"class") ## [1] "opened_beer" <span class="n">pilsner</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0.1</span><span class="w"> </span> ## [1] "It's magic! You've got a case of beers!" ## $n_beers ## [1] 0.9 ## ## attr(,"class") ## [1] "case_of_beers"Don’t you think, that such operations should be commutative?
<span class="n">my_opener</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">pilsner</span><span class="w"> </span> ## list() ## attr(,"class") ## [1] "opener"What did happen here? This is an example of the way the R interpreter
handles arithmetic operator. It was described with details on Hiroaki
Yutani’s
blog.
Briefly speaking, in this particular case R engine matched method to the
second argument (not to the first one), because there is no +.opener
S3 method. What about such trick:
After that, the result is different:
<span class="n">my_opener</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">pilsner</span><span class="w"> </span> ## Warning: Incompatible methods ("+.opener", "+.beer") for "+" ## Error in my_opener + pilsner: nonnumeric argument to binary operatorWe crashed our function call. When both objects have the + method
defined and these methods are not the same, R is trying to resolve the
conflict by applying an internal +. It obviously cannot work. This
case could be easily solved using more ‘ifs’ in the +.beer beer
function body. But let’s face a different situation.
What a mess! Simple S3 methods are definitely not the best solution when
we need the double dispatch.
To civilize such code, we can use classic R approach, S4 methods. We’ll
start from S4 classes declaration.
Then, we can two otptions, how to handle + operators. I didn’t mention
about it in the previous example, but both S3 and S4 operators are
grouped as socalled group generic functions (learn more:
S3,
S4).
We can set a S4 method for a single operator and that looks as follows:
<span class="n">setMethod</span><span class="p">(</span><span class="s2">"+"</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">e1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"S4_beer"</span><span class="p">,</span><span class="w"> </span><span class="n">e2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"S4_opener"</span><span class="p">),</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">e1</span><span class="p">,</span><span class="w"> </span><span class="n">e2</span><span class="p">){</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">e2</span><span class="p">,</span><span class="w"> </span><span class="s2">"S4_opener"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">.S4_opened_beer</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="s2">"opened"</span><span class="p">,</span><span class="w"> </span><span class="n">e1</span><span class="o">@</span><span class="n">type</span><span class="p">)))</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">e2</span><span class="p">,</span><span class="w"> </span><span class="s2">"numeric"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">print</span><span class="p">(</span><span class="s2">"It's magic! You've got a case of beers!"</span><span class="p">)</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">.S4_case_of_beers</span><span class="p">(</span><span class="n">n_beers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">e2</span><span class="p">))</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">e1</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">})</span><span class="w"> </span><span class="n">setMethod</span><span class="p">(</span><span class="s2">"+"</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">e1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"S4_opener"</span><span class="p">,</span><span class="w"> </span><span class="n">e2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"S4_beer"</span><span class="p">),</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">e1</span><span class="p">,</span><span class="w"> </span><span class="n">e2</span><span class="p">)</span><span class="w"> </span><span class="n">e2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">e1</span><span class="p">)</span><span class="w"> </span>Alternatively, we can define a method for Arith geneneric and check,
what method is exactly called at the moment. I decided to use the second
approach, because it’s more similar to the way the double dispatch is
implemented in the vctrs library.
Let’s create our class instances and do a piece of math.
<span class="n">S4_pilsner</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">.S4_beer</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Pilsner"</span><span class="p">)</span><span class="w"> </span><span class="n">S4_opener</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">.S4_opener</span><span class="p">(</span><span class="n">ID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span> <span class="n">S4_pilsner</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">S4_opener</span><span class="w"> </span> ## An object of class "S4_opened_beer" ## Slot "type": ## [1] "opened Pilsner" <span class="n">S4_opener</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">S4_pilsner</span><span class="w"> </span> ## An object of class "S4_opened_beer" ## Slot "type": ## [1] "opened Pilsner"Declared methods are clear, and, the most important: they work
correctly.
vctrs is an interesting library,
thought as a remedy for a couple of R disadvantages. It delivers, among
others, a custom doubledispatch system based on wellknown S3
mechanism.
At the first step we declare class ‘constructors’.
<span class="n">library</span><span class="p">(</span><span class="n">vctrs</span><span class="p">)</span><span class="w"> </span><span class="n">.vec_beer</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">type</span><span class="p">){</span><span class="w"> </span><span class="n">new_vctr</span><span class="p">(</span><span class="n">.data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">type</span><span class="p">),</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"vec_beer"</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">.vec_opened_beer</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">type</span><span class="p">){</span><span class="w"> </span><span class="n">new_vctr</span><span class="p">(</span><span class="n">.data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">type</span><span class="p">),</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"vec_opened_beer"</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">.vec_case_of_beers</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n_beers</span><span class="p">){</span><span class="w"> </span><span class="n">new_vctr</span><span class="p">(</span><span class="n">.data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">n_beers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_beers</span><span class="p">),</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"vec_case_of_beers"</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">.vec_opener</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="k">function</span><span class="p">(){</span><span class="w"> </span><span class="n">new_vctr</span><span class="p">(</span><span class="n">.data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(),</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"vec_opener"</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span>Then, we create class instances.
<span class="n">vec_pilsner</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">.vec_beer</span><span class="p">(</span><span class="s2">"pilnser"</span><span class="p">)</span><span class="w"> </span><span class="n">vec_opener</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">.vec_opener</span><span class="p">()</span><span class="w"> </span><span class="n">print</span><span class="p">(</span><span class="nf">class</span><span class="p">(</span><span class="n">vec_pilsner</span><span class="p">))</span><span class="w"> </span> ## [1] "vec_beer" "vctrs_vctr" <span class="n">print</span><span class="p">(</span><span class="nf">class</span><span class="p">(</span><span class="n">vec_opener</span><span class="p">))</span><span class="w"> </span> ## [1] "vec_opener" "vctrs_vctr"At the end, we write a doubledispatched methods in vctrs style. As
you can see,
It works properly, too.
BenchmarkI’ve created all the classes and methods above not only to demonstate,
how to implement double dispatch in R. My main goal is to benchmark both
approaches and check, which one has smaller overhead. The hardware I
used for the test looks as follows:
It’s my good old notebook, which is not a beast.
<span class="n">library</span><span class="p">(</span><span class="n">microbenchmark</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w"> </span> Beer + opener <span class="n">bm1</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">microbenchmark</span><span class="p">(</span><span class="w"> </span><span class="n">s4</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">S4_pilsner</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">S4_opener</span><span class="p">,</span><span class="w"> </span><span class="n">s3_vec</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">vec_pilsner</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">vec_opener</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="w"> </span><span class="p">)</span><span class="w"> </span> ## Unit: microseconds ## expr min lq mean median uq max neval ## s4 153.292 158.2120 178.40541 161.4225 165.6375 5506.681 1000 ## s3_vec 56.686 60.1265 69.52364 68.9240 70.8830 163.278 1000
It seems that vctrsbased performs better than traditional S4
methods. Obviously, I checked only one operation and probably some
edge cases may exists. However, I think that it shows us some direction,
what execution time we can expect.
If you are interesting, how to implement doubledispatched operators in
S4, I encourage you to get familiar with code of the following R
libraries:
If you are looking for some examples of vctrs, I recommend you to
learn the source code of:
To leave a comment for the author, please follow the link and comment on their blog: krzjoa. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Double dispatch in R: S4vsvctrs first appeared on Rbloggers.
Running an R Script on a Schedule: Heroku
[social4i size="small" align="alignleft"] > [This article was first published on Category R on Roel's Rtefacts, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
In this tutorial I have an R script that runs every day on heroku. It creates a curve in ggplot2 and posts that picture to twitter.
The use case is this: You have a script and it needs to run on a schedule (for instance every day).
In 2018 I wrote a small post how to run an R script on heroku.
The amazing thing is that the bot I created back then is still running! But I recently got a question about the scheduling, and because I did not really document it that well I will do a small update here.
I will create a new post for many of the other ways on which you can run an R script on schedule. But in this case I will run the script on heroku. Heroku is useful if you have one script you want to run, and not too often (every hour/ every day). If you have many scripts, long running scripts or you want more precise time control, heroku is not the best solution. Also there are quite some manual steps, this is not really suited for a complete automatic setup.
Heroku detailsHeroku does not have dedicated R runners but you can install an R runtime created by other people. In heroku they are called buildpacks. I’m using this one: https://github.com/virtualstaticvoid/herokubuildpackr
On a high level this is what is going to happen:
(We want the code to run on computer in the cloud) You save your script locally in a git repository You push everything to heroku (a cloud provider, think a laptop in the sky) # installation heroku installs R and the relevant packages and the script heroku saves this state and stops # running something you can give heroku a command and it will start up and run the script this starting up can be done on a timer
I first explain what you need, what my rscript does, and how to deal with credentials. If you are not interested go immediately to steps.
What you need: no fear for using the command line (know where it is and how to navigate is enough)
 an heroku account (you will need a creditcard, but once a day running is probably free)
 install the heroku CLI (command line interface)
 authenticate your CLI by running heroku login
 a folder with a script that does what you want to do
 (not necessary, but very useful) renv set up for this project
I have an R script that:
 creates a ushape curve dataset
 adds random names to the x and y axes
 creates ggplot2 image
 posts the tweet as a twitter account
With this as result:
Of course you could create something that is actually useful, like downloading data, cleaning it and pushing it into a database. But this example is relatively small and you can actually see the results online.
Small diversion: credentials/ secretsFor many applications you need credentials and you don’t want to put the
credentials in the script, if you share the script with someone, they also have the credentials. If you put it on github, the world has your secrets (I just did this).
So how can you do it? R can read environmental variables
and in heroku you can input the environmental variables that will
be passed to the runner when it runs (there are better, more professional tools to do the same thing but this is good enough for me). So you create an environmental variable called apikey with a value like aVerY5eCretKEy. In your script you use Sys.getenv("apikey") and the script will retrieve the apikey: aVerY5eCretKEy and use that.
How do you add them to your local environment?
 Create a .Renviron file in your local project
 add a new line to your .gitignore file: .Renviron
 Now this file with secrets will be ignored by git and you
can never accidentely add it to a repo.  the .Renviron file is a simple text file where you can add ‘secrets’ like: apikey="aVerY5eCretKEy" on a new line.
How do you add them to heroku?
I went into the heroku website of my project and manually
set the config vars (heroku’s name for environmental variables) but it is also possible to set them using heroku config:set GITHUB_USERNAME=joesmith in your project folder.
Check if the env vars are correctly set by running heroku config
StepsSo what do you need to make this work?
Steps in orderCheck if your script runs on your computer (Set up renv) on the cmdline setup an heroku project add buildpack git commit all the files you need push to heroku testrun script on heroku add a scheduler
Steps with explanation run your R script locally to make sure it works source("script.R")
 (optional, but recommended) check if you have set up renv for this project. renv::status(). When you are satisfied with the script, use renv::snapshot() to fix the versions of your required packages. This creates an ‘renv.lock’ file that contains the package versions you used.
 go to the project folder on your command line
 Setup a heroku project: Do either:
heroku create buildpack https://github.com/virtualstaticvoid/herokubuildpackr.git in the folder (this creates a new project and adds the buildpack)
or do heroku create first and add the buildpack with:
heroku buildpacks:set https://github.com/virtualstaticvoid/herokubuildpackr.git
In this previous step you get a name for your project for instance powerfuldusk49558
 make sure you connect your repo to heroku (this didn’t happen automatically for me)
heroku git:remote a powerfuldusk49558
you now have a remote called ‘heroku’ (git remote v shows this)
 commit all the necessary things:
renv/activate.R renv.lock script.R
You need the renv/activate.R script from renv so that the buildpack recognizes this as a renvequiped Rproject. The buildpack also works with a init.R file if you don’t want to use renv and manually write out which packages to install.
 push to heroku git push heroku master
The terminal shows all the installs of the buildback
remote: > R (renv) app detected remote: > Installing R remote: Version 4.0.0 will be installed. remote: > Downloading buildpack archives from AWS S3 ..etc...
This will take a while because it needs to install R and all the packages. Subsequent pushes are faster because of caching but will still take several minutes.
 After this is finished, test if it works on heroku with heroku run Rscript script.R
If it was successful you can add a scheduler
heroku addons:create scheduler:standard
Setting up this scheduler is still manual work.
Go to your heroku project in the browser to set the scheduler or use
heroku addons:open scheduler to let your browser move to the correct window.
I first set the frequency to hourly to see if the tweet bot works hourly.
I set the job to Rscript script.R
There are no logs so you better make sure it works when you run heroku run Rscript script.R
If it works via run, it should also work via the scheduler.
I set the schedule to once a day.
ConclusionAnd now it runs every day. However, there is in this free plan no
logs and no fine grained control. So if it fails, you wouldn’t know.
 Find more tutorials by me in this tutorial overview page
 previous post by me about heroku
 heroku CLI docs
 my current R script
 heroku config vars docs (environmental variables)
 Custom R buildpack for heroku by virtualstaticvoid
─ Session info ─────────────────────────────────────────────────────────────── setting value version R version 4.0.2 (20200622) os macOS Catalina 10.15.6 system x86_64, darwin17.0 ui X11 language (EN) collate en_US.UTF8 ctype en_US.UTF8 tz Europe/Amsterdam date 20200921 ─ Packages ─────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 20190321 [1] CRAN (R 4.0.0) cli 2.0.2 20200228 [1] CRAN (R 4.0.0) crayon 1.3.4 20170916 [1] CRAN (R 4.0.0) digest 0.6.25 20200223 [1] CRAN (R 4.0.0) evaluate 0.14 20190528 [1] CRAN (R 4.0.0) fansi 0.4.1 20200108 [1] CRAN (R 4.0.0) glue 1.4.1 20200513 [1] CRAN (R 4.0.1) htmltools 0.5.0 20200616 [1] CRAN (R 4.0.1) knitr 1.29 20200623 [1] CRAN (R 4.0.1) magrittr 1.5 20141122 [1] CRAN (R 4.0.0) rlang 0.4.7 20200709 [1] CRAN (R 4.0.2) rmarkdown 2.3 20200618 [1] CRAN (R 4.0.1) sessioninfo 1.1.1 20181105 [1] CRAN (R 4.0.1) stringi 1.4.6 20200217 [1] CRAN (R 4.0.0) stringr 1.4.0 20190210 [1] CRAN (R 4.0.0) withr 2.2.0 20200420 [1] CRAN (R 4.0.2) xfun 0.15 20200621 [1] CRAN (R 4.0.2) yaml 2.2.1 20200201 [1] CRAN (R 4.0.0) [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's Rtefacts. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Running an R Script on a Schedule: Heroku first appeared on Rbloggers.
MultiArmed Bandit with Thompson Sampling
[social4i size="small" align="alignleft"] > [This article was first published on R – Predictive Hacks, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Few words about Thompson SamplingThompson Sampling is an algorithm for decision problems where actions are taken in sequence balancing between exploitation which maximizes immediate performance and exploration which accumulates new information that may improve future performance. There is always a tradeoff between exploration and exploitation in all Multiarmed bandit problems.
Currently, Thompson Sampling has increased its popularity since is widely used in online campaigns like Facebook, Youtube and Web Campaigns where many variants are served at the beginning, and as time passes, the algorithm gives more weight to the strong variants.
Conjugate PriorIn Bayesian probability theory, if the posterior distributions p(θ  x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(x  θ)
Let us focus on the binomial case since in digital marketing we care mostly about CTR and Conversion Rates which follow binomial distribution. According to Bayesian Theory, the conjugate prior distribution of a Binomial Distribution is Beta Distribution with Posterior hyperparameters α and β which are the successes and the failures respectively.
https://en.wikipedia.org/wiki/Conjugate_priorNotice that The exact interpretation of the parameters of a beta distribution in terms of the number of successes and failures depends on what function is used to extract a point estimate from the distribution. The mean of a beta distribution is \(\displaystyle {\frac {\alpha }{\alpha +\beta }}\),}{\frac {\alpha }{\alpha +\beta }}, which corresponds to {\displaystyle \alpha }\alpha successes and {\displaystyle \beta }\beta failures, while the mode is {\displaystyle {\frac {\alpha 1}{\alpha +\beta 2}},}{\frac {\alpha 1}{\alpha +\beta 2}}, which corresponds to {\displaystyle \alpha 1}\alpha 1 successes and {\displaystyle \beta 1}\beta 1 failures. Bayesians generally prefer to use the posterior mean rather than the posterior mode as a point estimate, justified by a quadratic loss function, and the use of {\displaystyle \alpha }\alpha and {\displaystyle \beta }\beta is more convenient mathematically, while the use of {\displaystyle \alpha 1}\alpha 1 and {\displaystyle \beta 1}\beta 1 has the advantage that a uniform {\displaystyle {\rm {Beta}}(1,1)}{\rm {Beta}}(1,1) prior corresponds to 0 successes and 0 failures. The same issues apply to the Dirichlet distribution
Thompson Sampling AlgorithmIn practice, we want to maximize the expected number of rewards (i.e. clicks, conversions) given the actions (i.e. variants) and the current context. Conceptually, this means that at the beginning we serve the variants randomly and then we readjust our strategy (i.e. the distribution of the variants) based on the results.
The question is how do we get the weights for every state and the answer is with Monte Carlo Simulation. We assume the variants follow the Binomial distribution and according to Bayesian theory, their Conjugate prior distribution follows the Beta distribution with hyperparameters α = responses+1 and β = no responses + 1.
Let’s give an example of three variants:
 Variant 1: N=1000, responses=100 (i.e RR=10%). The conjugate prior will be Beta(a=101, b=901)
 Variant 2: N=1000, responses=110 (i.e RR=11%). The conjugate prior will be Beta(a=111, b=891)
 Variant 3: N=100, responses=10 (i.e RR=10%). The conjugate prior will be Beta(a=11, b=91)
Every time we should serve the variant with the highest RR. This is done by Monte Carlo Simulation. In the example above, we sample 5000 observations of each beta distribution by picking each time the variant with the highest value. The weights we got were:
 14% Variant 1
 45% Variant 2
 41% Variant 3.
Once we adjust the weights and we
get new results, we follow the same approach again.
It is quite easy to apply the Thompson Sampling in R. We will work with the three variants of our example above. Notice that the variant 3 has fewer impressions than the other two that is why is less steep as a distribution.
# vector of impressions per variant b_Sent<c(1000, 1000, 100) # vector of responses per variant b_Reward<c(100, 110, 10) msgs<length(b_Sent) # number of simulations N<5000 # simulation of Beta distributions (success+1, failures+1) B<matrix(rbeta(N*msgs, b_Reward+1, (b_Sentb_Reward)+1),N, byrow = TRUE) # Take the percentage where each variant # was observed with the highest rate rate P<table(factor(max.col(B), levels=1:ncol(B)))/dim(B)[1] Pand we get:
> P 1 2 3 0.1368 0.4496 0.4136
Notice that since we are dealing with a Monte Carlo Simulation, if you do not set a random seed you will get slightly different results each time.
Thompson Sampling in ActionAbove we showed what would be the weights the suggested weights based on the observed results of the three variants. Let’s provide an example where we are dealing with 4 Variants where we know their ground truth probability and let’s see how the Thompson Sampling will adjust the weights in every step.
Let’s assume that the ground truth conversion rates of the 4 variants are:
 Variant 1: 10%
 Variant 2: 11%
 Variant 3: 12%
 Variant 4: 13%
Our strategy is to update the weights in every 1000 impressions and let’s assume that the total sample size is 10000 (i.e. we will update the weights ten times).
output<{} b_Probs<c(0.10,0.11,0.12, 0.13) b_Sent<rep(0, length(b_Probs)) b_Reward<rep(0, length(b_Probs)) batch_size<1000 N<10000 steps<floor(N/batch_size) msgs<length(b_Probs) for (i in 1:steps) { B<matrix(rbeta(1000*msgs, b_Reward+1, (b_Sentb_Reward)+1),1000, byrow = TRUE) P<table(factor(max.col(B), levels=1:ncol(B)))/dim(B)[1] # tmp are the weights for each time step tmp<round(P*batch_size,0) # Update the Rewards b_Reward<b_Reward+rbinom(rep(1,msgs), size=tmp, prob = b_Probs) #Update the Sent b_Sent<b_Sent+tmp #print(P) output<rbind(output, t(matrix(P))) } # the weights of every step outputAnd we get:
> output [,1] [,2] [,3] [,4] [1,] 0.239 0.259 0.245 0.257 [2,] 0.002 0.349 0.609 0.040 [3,] 0.003 0.329 0.533 0.135 [4,] 0.001 0.078 0.386 0.535 [5,] 0.001 0.016 0.065 0.918 [6,] 0.002 0.016 0.054 0.928 [7,] 0.002 0.095 0.154 0.749 [8,] 0.003 0.087 0.067 0.843 [9,] 0.001 0.059 0.069 0.871 [10,] 0.006 0.058 0.116 0.820
As we can see, even from the 5th step the algorithm started to assign more weight to the variant 4 and almost nothing to the variant 1 and variant 2.
DiscussionThere exist other MultiArmed Bandit algorithms like the εgreedy, the greedy the UCB etc. There are also contextual multiarmed bandits. In practice, there are some issues with the multiarmed bandits. Let’s mention some:
 The CTR/CR can change across days as well as the preference of the variants(seasonality effect, day of week effect, fatigue of the campaign etc)
 By changing the weights of the variants it affects the CTR/CR mainly because:
 Due to the cookies, it affects the distribution of the new and the repeated users
 The results are skewed due to the late conversions
To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post MultiArmed Bandit with Thompson Sampling first appeared on Rbloggers.
The Raspberry Pi 4B as a shiny server
[social4i size="small" align="alignleft"] > [This article was first published on Econometrics and Free Software, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
This blog post will not have any code, but will document how I went from hosting apps
on shinyapps.io to hosting shiny apps on my own server, which is a Raspberry Pi 4B with 8 gigs of ram.
First of all, why hosting apps on a Raspberry Pi? And why not continue on shinyapps.io?
Or why not get one of hose nifty droplets on DigitalOcean? Well for two reasons; one is that I wanted
to have full control of the server, and learn some basic web dev/web engineering skills that I lacked. These services
simplify the process of deploying and hosting a lot, which of course is a good thing if your only
goal is to deploy apps. But I wanted to learn how to do it myself from scratch for some time.
True, with a DigitalOcean droplet, I could have learned quite a lot about the whole process as well,
but there’s a second problem; the minimum amount of processing power that the droplet needed to run
shiny came at 10€ a month. Not a fortune, but already quite expensive for me, since I just wanted
to learn some stuff on my free time. Which is why I got a Raspberry Pi 4B with 8 gigs of ram. It’s less
than 100€, and now that I have it, I can do whatever I want whenever I want to. If I don’t touch it
for several months, no harm done. And if I get tired of it, I’ll make a retro console out of it and
play some old schools games. It’s a winwin situation if you ask me.
So first, you should get a Raspberry Pi. Those are quite easy to find online, and there’s many
tutorials available on how to install Ubuntu (or any other Linux distro) on it, so I won’t bother
with that. I also won’t explain to you how to ssh into your Raspberry Pi, again, there’s many tutorials
online. More importantly, is how to get Shiny on it? There’s two solutions; you either install
it from source, or you use Docker. I chose to use Docker, but maybe not in the way you’d expect;
there’s a lot of talk online about dockerizing apps, complete with all their dependencies and
environment. The advantage is that you’re guaranteed that deployment with be very smooth. But the
big disadvantage is that these dockerized apps are huge, around 1GB, or sometimes more. It is true that disk space is
quite cheap nowadays, but still… so I prefer to run a Shiny server from Docker, and then run the
apps out of this server. My apps are thus very small, and it’s only the Shiny server that is huge.
I found a Github repository from user havlev that explains how to do it here.
I have followed this guide, and created my own docker container, which is based on havlev’s
one. I added some dependencies (to the base Debian distro included, as well as some more R packages).
If you’re in a hurry, and want to use my Docker image, you can simply type the following on your
Raspberry pi:
mkdir shinyserver cd shinyserver mkdir apps mkdir conf mkdir logs docker run d p 3838:3838 v shinyapps:/srv/shinyserver/ v shinylogs:/var/log/ v shinyconf:/etc/shinyserver/ name rpishinyserver brodriguesco/shiny_1_5:firstcommit
The first 5 commands will create some folders that we’ll need later on, while the last one will pull
my Docker container, which is based on havlev’s one, launch the server and it’ll start listening to
port 3838.
I made an app (another blog post, focusing on this app, will follow soon), hosted on my Raspberry Pi
that you can find here. I’ll also give you
some pointers on how you can achieve that.
But let’s start from the beginning.
Adding dependencies to a Docker containerSo let’s suppose that you’re me a few weeks ago, and that you find and follow havlev’s guide here.
Getting the docker running is quite easy, you just need to set up Docker, and then find the line in the
tutorial that starts with docker run…. You’ll get Shiny running with its hello world app. Now,
how can you add more packages, either to the base Debian image, or R packages? For this part, I
followed this guide.
The idea is to “log in” to the console of the base Debian distro that is running from the container.
First, find the ID of the container by typing the following command in the terminal:
docker ps
You should see something like this:
ubuntu@ubuntu:~$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 69420blazeit brodriguesco/shiny_1_5:firstcommit "/etc/shinyserver/i…" About a minute ago Up About a minute 0.0.0.0:3838>3838/tcp rpishinyserver
now with the ID in hand, you can start any command line program from your Docker container, for instance
bash:
docker exec it 69420blazeit bash
You’ll be “logged in” as root:
root@69420blazeit:/#
and from there, you can install Debian packages. The following two packages are necessary to install
many R packages from source, so I recommend you install them:
root@69420blazeit:/# aptget install libssldev libxml2dev
Once these Debian packages are installed, you can start R by simply typing R in the same console,
and install whatever packages your Shiny apps will need. In my case, I installed {golem} and several
others, but this will be the subject of another blog post. We’re almost done with that; we now need
to save the changes because if you restart the container, you’ll lose all these changes. To save these
changes, let’s run the following command, but in a new terminal on your Raspberry Pi (on the
“local” Ubuntu, not the Debian running in the container):
ubuntu@ubuntu:~$ docker commit m "added some dependencies" 69420blazeit shiny_with_deps
So now you could run this container with the command from above, by replacing the adequate parts:
docker run d p 3838:3838 v shinyapps:/srv/shinyserver/ v shinylogs:/var/log/ v shinyconf:/etc/shinyserver/ name rpishinyserver shiny_with_depsshiny_with_deps
Using your Shiny serverOk so now that the server is running, you can you deploy apps on it? Remember the folders that we
created at the beginning of the blog post (or that you created if you followed havlev’s guide)?
This is where you’ll drop your apps, the usual way. You create a folder there, and simply put the
ui.R and server.R files in here, and that it. These folders can be found in your $HOME directory,
and they are accessible to your docker container as well. Once you dropped one or two apps, you’ll be able to access them on a link
similar as this one:
http://192.168.178.55:3838/hello/
where 192.168.178.55 is the local IP address of the Raspberry Pi, 3838 is the port the server
is listening to, and /hello/ is the name of the subfolder contained in the ~/shinyserver/apps
folder that you created before. What is left doing is making your Raspberry Pi a proper server that
can be accessed from the internet. For this, you’ll need to ask your ISP for a dynamic IP address.
Generally, you’ll have to pay some money for it; in my case, I’m paying 2€ a month. This address
can then be used to access your Raspberry Pi from the internet. The problem, is that being dynamic,
the address changes every time you restart your server. To solve this issue, you can use a free
dynamic DNS. I use duckdns. This will allow you to have domain that you
can share with the world. What’s nice is that if you follow their guide
the redirection to the dynamic IP address will happen seamlessly every time it changes, so no need
to think about it and do it manually.
Finally, you’ll also have to open up port 3838 on your router. The procedure changes from router
to router, but you should be able to find the instructions for your router quite easily. If not, you
should also be able to get help from your ISP.
The end result is that you’ll have your own Shiny server running off a Raspberry Pi, and accessible
over the internet! You’ll be able to deploy as many apps as you want, but of course, don’t forget
that you’re running all this on a Raspberry Pi. While these machines have become quite powerful over
the years, they won’t be powerful enough if you’re running some heavy duty apps with hundreds of
concurrent users.
In my next blog post, I’ll walk you through the development of a Shiny app using the {golem} package,
which you can find here.
Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me, or buy my ebook on Leanpub.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post The Raspberry Pi 4B as a shiny server first appeared on Rbloggers.
Moving on as Head of Solutions and AI at Draper and Dash
[social4i size="small" align="alignleft"] > [This article was first published on R Blogs – Hutsonshacks, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Well, what can I say? It has been a blast over the past year and a half. In my time at the company I have managed to work on a number of fascinating projects, here is a list of just some of them:
 Data Science Platform – this has been the culmination of lots of machine learning projects, undertaken separately in R and Python, and has allowed for our Healthcare Data Science Platform to be developed. This, alongside skilled data scientists, data engineers and front end developers, we think we have developed something really special
 Clinical Outcome Review System (CORS) – we have built a lot into this product and it has allowed the web tool to be co created and aligned with our core customers. The CORS tool is a web framework that enables the centralisation of ‘learning from deaths’ in one, easy to use, web tool.
 Command Centre – led the design and development of the underpinning machine learning and forecasting models embedded into the solution. Furthermore, working alongside key trusts to deploy the solution and tailor to the trusts needs. For information about this product see: https://draperanddash.com/solutions2/commandcentre/.
 Readmission avoidance tool, stranded patient predictor, ED Length of Stay estimator, waiting list cohorter, building an effective forecasting solution, alongside many other projects – primarily developing the forecasting and predictive functionality, alongside our key data scientist Alfonso. These have not been created as Qlik and other BI dashboards which smoothly integrate and augment the machine learning with exploratory analysis functionality in the various business intelligence solutions.
 Data Assurance Programme – working with a number of trusts to establish our data assurance framework and build out the discovery process that heightens and improves the visibility of data issues across the organisational sphere.
 AI and ML innovation lead – established the wider AI/ML meetup group and managed the preexisting trustwide AI and ML group. This involved sending regular updates to the group and making sure they were sighted of coming D&D products, tools and algorithms.
 Developed the Machine Learning and Data Science Training – led on the rollout of upskilling analysts in data science and machine learning techniques, mainly using the R programming language as a vehicle for the demonstration of the ‘art of the possible’.
 Cancer analytics – led the statistical analysis of a number of key cancer measures to establish if the change in the group was of significant noteworthiness.
 COVID19 – SIR and SEIR model development. These led to the correct estimation of a number of epidemic curves and peaks. Additionally, I created a Shiny web tool to allow for the estimation of peak demand. This was a very fun project to be working on, in a distressing time.
These are just a few of the projects I have been involved with and I am hugely proud of the time at the company.
I look forward to my next challenge back in the NHS as Head of Advanced Analytics for a CSU I won’t name…
What will I miss about the role?I will miss a number of things in the role, but if I had to rank order them, as a statistician you know how much we like to quantify, they would be:
 The people – the team at D&D are great and I have made some really good friends.
 The challenge – every day is different, from coding a Shiny app, to working in R to do a Machine Learning problem, to working in Python to work on NLP and Deep Learning, to working as a consultant in a consulting role, to working with clients, to upskilling NHS analysts, to working on a number of different domains.
 The learning curve – I like to continuously develop, and D&D affords that opportunity. Every day is a school day, as my professor used to say.
 The team spirit – we are a team and that is how we deliver all our projects.
 The coding – I love coding and I have time to do this on a daily basis – from R to Python for data science, SQL for extracting information, among others.
The only thing I won’t miss is the travel, but you can’t have it all…
Does this sound like your bag?If you want to become the new me, then you are in luck, as D&D are looking for a replacement:
Click on the image above to be routed to the Draper and Dash main website.
A special thanks to all the guys at D&DI would like to say that I have really enjoyed my time with the team and I will miss each and every one of the team.
Team D&D
You guys have been great and it has been emotional.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsonshacks. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Moving on as Head of Solutions and AI at Draper and Dash first appeared on Rbloggers.
Permutations in R
[social4i size="small" align="alignleft"] > [This article was first published on R – Predictive Hacks, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
During the interview process for Data Science positions, it is likely to be asked to calculate Combinations or Permutations. Today we will provide an example of how we can solve numerically permutation problems in R.
Find all Permutations of the word baboonMathematically we can approach this question as follows:
\(P=\frac{n!}{n_1! n_2! n_3!…n_k!}\)
Where:
 \(n\) is the total number of object, i.e. letters in our case which is 6
 \(n_1\) is the number of objects of type 1, for example, the number of b which is 2
 \(n_2\) is the number of objects of type 2, for example, the number of a which is 1
 \(n_3\) is the number of objects of type 3, for example, the number of o which is 2
 \(n_4\) is the number of objects of type 4, for example, the number of n which is 1
Hence the number of permutations is \(P=\frac{6!}{2! \times 2!} = 180\)
Return all Permutations in RLet’s return all permutations in R
Working with combinat package
library(combinat) # get the list of all permutations my_list<combinat::permn(c("b","a","b","o","o","n")) # convert the list to a matrix my_matrix<do.call(rbind,my_list) #take the unique rows my_matrix<unique(my_matrix) head(my_matrix)[,1] [,2] [,3] [,4] [,5] [,6] [1,] "b" "a" "b" "o" "o" "n" [2,] "b" "a" "b" "o" "n" "o" [3,] "b" "a" "b" "n" "o" "o" [4,] "b" "a" "n" "b" "o" "o" [5,] "b" "n" "a" "b" "o" "o" [6,] "n" "b" "a" "b" "o" "o"
If we want to get the number of rows of the table, which are actually our permutations:
dim(my_matrix) # [1] 180 6As expected we got 180 rows (the permutations) and 6 columns (the number of letters)
CommentsWhen we are in a position to get all the possible permutations, we will be able to calculate the permutations of more complicated problems. Let’s say, that the question was to calculate all the possible permutations of the word baboon but by picking 4 letters each time (instead of 6).
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Permutations in R first appeared on Rbloggers.
Sequential satisficing
[social4i size="small" align="alignleft"] > [This article was first published on R on OSM, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
In our last post, we ran simulations on our 1,000 randomly generated return scenarios to compare the average and riskadjusted return for satisfactory, naive, and meanvariance optimized (MVO) maximum return and maximum Sharpe ratio portfolios.1 We found that you can shoot for high returns or high riskadjusted returns, but rarely both. Assuming no major change in the underlying average returns and risk, choosing the efficient high return or high riskadjusted return portfolio generally leads to similar performance a majority of the time in outofsample simulations. What was interesting about the results was that even though we weren’t arguing that one should favor satisficing over optimizing, we found that the satisfactory portfolio generally performed well.
One area we didn’t test was the effect of sequence on meanvariance optimization. Recall we simulated 1,000 different sixty month (five year) return profiles for four potential assets—stocks, bonds, commodities (gold) and real estate. The weights we used to calculate the different portfolio returns and risk were based on data from 19871991 and were kept constant for all the simulations. What if the weighting system could “learn” from previous scenarios or adjust to recent information?
In this post, we’ll look at the impact of incorporating sequential information on portfolio results and compare that to the nonsequential results. To orient folks, we’ll show the initial portfolio weight simulation graph with the different portfolios and efficient frontier based on the historical data. The satisfactory, naive, MVOderived maximum Sharpe ratio, and MVOderived maximum return portfolios are the colored blue, black, red, and purple points, respectively.
Now we’ll take the weighting schemes and apply them to the return simulations. However, we’ll run through the returns sequentially and calculate the returns and riskadjusted returns for the weighting scheme using the prior simulation as the basis for the allocation on the next simulation. For example, we’ll calculate the efficient frontier and run the portfolio weight algorithm (see Weighting on a friend for more details) on simulation #75. From those calculations, we’ll assign the weights for the satisfactory, maximum Sharpe ratio, and maximum efficient return portfolios to compute the returns and riskadjusted returns on simulation #76. For the record, simulation #1 uses the same weights as those that produced the dots in the graph above.
Running the simulations, here’s the average return by weighting algorithm.
And here’s the original.
For the naive and maximum return portfolios, the average return isn’t much different on the sequential calculations than on original constant weight computations. The sequential Sharpe allocations are about 1% point lower than the stable ones. However, the sequential satisfactory allocations are almost 2% points lower than the constant allocation. In about 5% of the cases, the return simulation was not capable of achieving a satisfactory portfolio with the original constraints of not less than 7% and not more than 10% annualized return and risk. In such a case, we reverted to the original weighting. This adjustment did not affect the average return even when we excluded the “unsatisfactory” portfolios.
We assume that the performance drag is due to the specificity of the constraints amid random outcomes. The other portfolios (except for the naive one) optimize for the previous scenario and apply those optimal weights on the next portfolio in the sequence. The satisfactory portfolio chooses average weights to achieve a particular constraint and then applies those weights in sequence. Due to the randomness of the return scenarios, it seems possible that a weighting scheme that works for one scenario would produce poor results on another scenario. Of course, the satisfactory portfolios still achieved their return constraint on average.
This doesn’t exactly explain why the MVO portfolios didn’t suffer a performance drag. Our intuition is that the MVO portfolios are border cases. Hence, there probably won’t be too much dispersion in the results on average since the return simulations draw from the same return, risk, and error terms throughout. However, the satisfactory portfolio lies with the main portion of the territory, so the range of outcomes could be much larger. Clearly, this requires further investigation, but we’ll have to shelve that for now.
In the next set of graphs, we check out how each portfolio performs relative to the others. First the sequential simulation.
And the original:
There are some modest changes in the satisfactory and naive vs. MVOderived portfolios. But the biggest change occurs in the satisfactory vs. naive portfolio—almost a 22 percentage point decline! This appears to support our conjecture above that the order of the sequence could be affecting the satisfactory portfolio’s performance. Nonetheless, sequencing does not appear to alter the performance of the naive and satisfactory portfolios relative to the MVOderived ones in a material way. Let’s randomize the sequence to say if results change.
For the random draw case, we randomly draw one return simulation on which to calculate our portfolio weights and then apply those weights on another randomly drawn portfolio. The draws are without replacement to ensure we don’t use the same simulation both to calculate weights and returns as well as to ensure that we use all simulations.2
The average return graph isn’t much different for the random sampling than it is for the sequential so we’ll save some space and not print it. Here is the relative performance graph.
And here is the original relative performance for ease of comparison.
With random ordering we see that the naive and satisfactory portfolios relative to the maximum return experience a modest performance improvement. Relative to the maximum Sharpe portfolio, the performance is relatively unchanged. The satisfactory portfolio again suffers a drag relative to the naive, but somewhat less severe that in the sequential case.
What are the takeaways? In both sequential and random sampling, the performance of the naive and satisfactory compared to the MVO portfolios was relatively stable. That supports our prior observation that you can shoot for high returns or high riskadjusted returns, but usually not both. It’s also noteworthy that the maximum Sharpe ratio consistently underperforms the naive and satisfactory portfolios in all of the simulations. The team at Alpha Architect noted similar results in a much broader research project. Here, part of the study found that the maximum Sharpe portfolio suffered the worst performance relative to many others including an equalweighted (i.e., naive) portfolio.
One issue with our random sampling test is that it was only one random sequence. To be truly robust, we’d want to run the sampling simulation more than once. But we’ll have to save that for when we rent time on a cloud GPU because we don’t think our flintstoneaged laptop will take kindly to running 1,000 simulations of 1,000 random sequences of 1,000 simulated portfolios with 1,000 optimizations and 3,000 random portfolio weightings!
Perceptive readers might quibble with how we constructed our “learning” simulations. Indeed, in both the sequential and random sample cases we calculated our allocation weights only using the prior simulation in the queue (e.g. weights derived from return simulation #595 are deployed on simulation #596 (or, for example, #793 in the random simulation)). But that “learning” could be considered relatively short term. A more realistic scenario would be to allow for cumulative learning by trying to incorporate all or a majority of past returns in the sequence. And we could get even more complicated by throwing in some weighting scheme to emphasize nearer term results. Finally, we could calculate cumulative returns over one scenario or multiple scenarios. We suspect relative performance would be similar, but you never know!
Until we experiment with these other ideas, the R and Python code are below. While we generally don’t discuss the code in detail, we have been getting more questions and constructive feedback on it of late. Thank you for that! One thing to note is the R code to produce the simulations runs much quicker than the Python code. Part of that is likely due to how we coded in the efficient frontier and maximum Sharpe and return portfolios in the different languages. But the differences are close to an order of magnitude. If anyone notices anything we could do to improve performance, please drop us an email at nbw dot osm at gmail dot com. Thanks!
R code# Built using R 3.6.2 ### Load packages suppressPackageStartupMessages({ library(tidyquant) library(tidyverse) }) ### Load data # Seem prior posts for how we built these data frames df < readRDS("port_const.rds") dat < readRDS("port_const_long.rds") sym_names < c("stock", "bond", "gold", "realt", "rfr") sim1 < readRDS("hist_sim_port16.rds") ### Call functions # See prior posts for how we built these functions source("Portfolio_simulation_functions.R") source("Efficient_frontier.R") ## Function for calculating satisfactory weighting port_sim_wts < function(df, sims, cols, return_min, risk_max){ if(ncol(df) != cols){ print("Columns don't match") break } # Create weight matrix wts < matrix(nrow = (cols1)*sims, ncol = cols) count < 1 for(i in 1:(cols1)){ for(j in 1:sims){ a < runif((colsi+1),0,1) b < a/sum(a) c < sample(c(b,rep(0,i1))) wts[count,] < c count < count+1 } } # Find returns mean_ret < colMeans(df) # Calculate covariance matrix cov_mat < cov(df) # Calculate random portfolios port < matrix(nrow = (cols1)*sims, ncol = 2) for(i in 1:nrow(port)){ port[i,1] < as.numeric(sum(wts[i,] * mean_ret)) port[i,2] < as.numeric(sqrt(t(wts[i,]) %*% cov_mat %*% wts[i,])) } port < as.data.frame(port) %>% `colnames<`(c("returns", "risk")) port_select < cbind(port, wts) port_wts < port_select %>% mutate(returns = returns*12, risk = risk*sqrt(12)) %>% filter(returns >= return_min, risk <= risk_max) %>% summarise_at(vars(3:6), mean) port_wts } ## Portfolio func port_func < function(df,wts){ mean_ret = colMeans(df) returns = sum(mean_ret*wts) risk = sqrt(t(wts) %*% cov(df) %*% wts) c(returns, risk) } ## Portfolio graph pf_graf < function(df, nudge, multiplier, rnd, y_lab, text){ df %>% gather(key, value) %>% ggplot(aes(reorder(key, value), value*multiplier*100)) + geom_bar(stat='identity', fill = 'darkblue') + geom_text(aes(label = format(round(value*multiplier,rnd)*100,nsmall = 1)), nudge_y = nudge)+ labs(x = "", y = paste(y_lab, "(%)", sep = " "), title = paste(text, "by portfolio", sep = " ")) } ## Create outperformance graph perf_graf < function(df, rnd, nudge, text){ df %>% rename("Satisfactory vs. Naive" = ovs, "Satisfactory vs. Max" = ovr, "Naive vs. Max" = rve, "Satisfactory vs. Sharpe" = ove, "Naive vs. Sharpe" = sve) %>% gather(key, value) %>% ggplot(aes(reorder(key, value), value*100)) + geom_bar(stat='identity', fill = 'darkblue') + geom_text(aes(label = format(round(value,rnd)*100, nsmall = 1)), nudge_y = nudge)+ labs(x = "", y = "Percent (%)", title = paste("Frequency of outperformance:", text, sep = " ")) } ### Create weight schemes satis_wts < c(0.32, 0.4, 0.08, 0.2) # Calculated in previous post using port_select_func simple_wts < rep(0.25, 4) eff_port < eff_frontier_long(df[2:61,2:5], risk_increment = 0.01) eff_sharp_wts < eff_port[which.max(eff_port$sharpe),1:4] %>% as.numeric() eff_max_wts < eff_port[which.max(eff_port$exp_ret), 1:4] %>% as.numeric() ## Run port sim on 19871991 data port_sim_1 < port_sim_lv(df[2:61,2:5],1000,4) # Run function on three weighting schemes and one simulation weight_list < list(satis = satis_wts, naive = simple_wts, sharp = eff_sharp_wts, max = eff_max_wts) wts_df < data.frame(wts = c("satis", "naive", "sharp", "max"), returns = 1:4, risk = 5:8, stringsAsFactors = FALSE) for(i in 1:4){ wts_df[i, 2:3] < port_func(df[2:61,2:5], weight_list[[i]]) } wts_df$sharpe = wts_df$returns/wts_df$risk # Graph portfolio simulation with three portfolios port_sim_1$graph + geom_point(data = wts_df, aes(x = risk*sqrt(12)*100, y = returns*1200), color = c("darkblue", "black", "red", "purple"), size = 4) + geom_line(data = eff_port, aes(stdev*sqrt(12)*100, exp_ret*1200), color = 'blue', size = 1.5) + theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"), legend.background = element_rect(fill = NA)) ## Calculate performance with rolling frontier on max sharpe set.seed(123) eff_sharp_roll < list() eff_max_roll < list() satis_roll < list() for(i in 1:1000){ if(i == 1){ sharp_wts < eff_sharp_wts max_wts < eff_max_wts sat_wts < satis_wts }else { eff_calc < eff_frontier_long(sim1[[i1]]$df, risk_increment = 0.01) sharp_wts < eff_calc[which.max(eff_calc$sharpe), 1:4] max_wts < eff_calc[which.max(eff_calc$exp_ret), 1:4] sat_wts < port_sim_wts(sim1[[i1]]$df, 1000, 4, 0.07, 0.1) } eff_sharp_roll[[i]] < port_func(sim1[[i]]$df, as.numeric(sharp_wts)) eff_max_roll[[i]] < port_func(sim1[[i]]$df, as.numeric(max_wts)) satis_roll[[i]] < port_func(sim1[[i]]$df, as.numeric(sat_wts)) } list_to_df < function(list_ob){ x < do.call('rbind', list_ob) %>% as.data.frame() %>% `colnames<`(c("returns", "risk")) %>% mutate(sharpe = returns/risk) x } roll_lists < c("eff_sharp_roll", "eff_max_roll", "satis_roll") for(obj in roll_lists){ x < list_to_df(get(obj)) assign(obj, x) } # Return roll_mean_pf < data.frame(Satisfactory = mean(satis_roll[,1], na.rm = TRUE), Naive = mean(simple_df[,1]), Sharpe = mean(eff_sharp_roll[,1]), Max = mean(eff_max_roll[,1])) # Graph mean returns pf_graf(roll_mean_pf, 1, 12, 2,"Return (%)", "Rolling simulation return") pf_graf(mean_pf, 1, 12, 2,"Return (%)", "Original simulation return") # Create relative performance df roll_ret_pf < data.frame(ovs = mean(satis_df[,1] > simple_df[,1]), ovr = mean(satis_df[,1] > eff_max_roll[,1]), rve = mean(simple_df[,1] > eff_max_roll[,1]), ove = mean(satis_df[,1] > eff_sharp_roll[,1]), sve = mean(simple_df[,1] > eff_sharp_roll[,1])) # Graph outperformance perf_graf(roll_ret_pf, 2, 4, "Rolling simulation returns") perf_graf(ret_pf, 2, 4, "Original simulation returns") ## Sampling portfolios set.seed(123) eff_sharp_samp < list() eff_max_samp < list() satis_samp < list() for(i in 1:1000){ if(i == 1){ sharp_wts < eff_sharp_wts max_wts < eff_max_wts sat_wts < satis_wts }else { samp_1 < sample(1000, 1) # Sample a return simulation for weight scheme eff_calc < eff_frontier_long(sim1[[samp_1]]$df, risk_increment = 0.01) sharp_wts < eff_calc[which.max(eff_calc$sharpe), 1:4] max_wts < eff_calc[which.max(eff_calc$exp_ret), 1:4] sat_wts < port_sim_wts(sim1[[samp_1]]$df, 1000, 4, 0.07, 0.1) } samp_2 < sample(1000, 1) # sample a return simulation to analyze performance eff_sharp_samp[[i]] < port_func(sim1[[samp_2]]$df, as.numeric(sharp_wts)) eff_max_samp[[i]] < port_func(sim1[[samp_2]]$df, as.numeric(max_wts)) satis_samp[[i]] < port_func(sim1[[samp_2]]$df, as.numeric(sat_wts)) } samp_lists < c("eff_sharp_samp", "eff_max_samp", "satis_samp") for(obj in samp_lists){ x < list_to_df(get(obj)) assign(obj, x) } # Return samp_mean_pf < data.frame(Satisfactory = mean(satis_samp[,1], na.rm=TRUE), Naive = mean(simple_df[,1]), Sharpe = mean(eff_sharp_samp[,1]), Max = mean(eff_max_samp[,1])) # Graph mean returns: NOT SHOWN pf_graf(samp_mean_pf, 1, 12, 2,"Return (%)", "Random sampling return") # pf_graf(mean_pf, 1, 12, 2,"Return (%)", "Original return") # Create relative performance df samp_ret_pf < data.frame(ovs = mean(satis_df[,1] > simple_df[,1]), ovr = mean(satis_df[,1] > eff_max_samp[,1]), rve = mean(simple_df[,1] > eff_max_samp[,1]), ove = mean(satis_df[,1] > eff_sharp_samp[,1]), sve = mean(simple_df[,1] > eff_sharp_samp[,1])) # Graph outperformance perf_graf(samp_ret_pf, 2, 4, "Random sample returns") perf_graf(ret_pf, 2, 4, "Original returns")
Python code# Built using Python 3.7.4 # Load libraries import pandas as pd import pandas_datareader.data as web import numpy as np import matplotlib.pyplot as plt %matplotlib inline plt.style.use('ggplot') ## Load data # Seem prior posts for how we built these data frames df = pd.read_pickle('port_const.pkl') dat = pd.read_pickle('data_port_const.pkl') port_names = ['Original','Naive', 'Sharpe', 'Max'] sim1 = pd.read_pickle('hist_sim_port16.pkl') ## Load functions part 1 # Portfolio simulation functions # NOte the Port_sim class is slightly different than previous posts, # so we're reproducing it here. # We were getting a lot of "numpy.float64 is not callable" errors # due to overlapping names on variables and functions, so we needed # to fix the code. If it still throws error, let us know if it does. ## Simulation function class Port_sim: def calc_sim(df, sims, cols): wts = np.zeros((sims, cols)) for i in range(sims): a = np.random.uniform(0,1,cols) b = a/np.sum(a) wts[i,] = b mean_ret = df.mean() port_cov = df.cov() port = np.zeros((sims, 2)) for i in range(sims): port[i,0] = np.sum(wts[i,]*mean_ret) port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,])) sharpe = port[:,0]/port[:,1]*np.sqrt(12) return port, wts, sharpe def calc_sim_lv(df, sims, cols): wts = np.zeros(((cols1)*sims, cols)) count=0 for i in range(1,cols): for j in range(sims): a = np.random.uniform(0,1,(colsi+1)) b = a/np.sum(a) c = np.random.choice(np.concatenate((b, np.zeros(i))),cols, replace=False) wts[count,] = c count+=1 mean_ret = df.mean() port_cov = df.cov() port = np.zeros(((cols1)*sims, 2)) for i in range(sims): port[i,0] = np.sum(wts[i,]*mean_ret) port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,])) sharpe = port[:,0]/port[:,1]*np.sqrt(12) return port, wts, sharpe def graph_sim(port, sharpe): plt.figure(figsize=(14,6)) plt.scatter(port[:,1]*np.sqrt(12)*100, port[:,0]*1200, marker='.', c=sharpe, cmap='Blues') plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show() # Constraint function def port_select_func(port, wts, return_min, risk_max): port_select = pd.DataFrame(np.concatenate((port, wts), axis=1)) port_select.columns = ['returns', 'risk', 1, 2, 3, 4] port_wts = port_select[(port_select['returns']*12 >= return_min) & (port_select['risk']*np.sqrt(12) <= risk_max)] port_wts = port_wts.iloc[:,2:6] port_wts = port_wts.mean(axis=0) return port_wts def port_select_graph(port_wts): plt.figure(figsize=(12,6)) key_names = {1:"Stocks", 2:"Bonds", 3:"Gold", 4:"Real estate"} lab_names = [] graf_wts = port_wts.sort_values()*100 for i in range(len(graf_wts)): name = key_names[graf_wts.index[i]] lab_names.append(name) plt.bar(lab_names, graf_wts, color='blue') plt.ylabel("Weight (%)") plt.title("Average weights for riskreturn constraint", fontsize=15) for i in range(len(graf_wts)): plt.annotate(str(round(graf_wts.values[i])), xy=(lab_names[i], graf_wts.values[i]+0.5)) plt.show() ## Load functions part 2 # We should have wrapped the three different efficient frontier functions # into one class or function but ran out of time. This is probably what slows # down the simulations below. # Create efficient frontier function from scipy.optimize import minimize def eff_frontier(df_returns, min_ret, max_ret): n = len(df_returns.columns) def get_data(weights): weights = np.array(weights) returns = np.sum(df_returns.mean() * weights) risk = np.sqrt(np.dot(weights.T, np.dot(df_returns.cov(), weights))) sharpe = returns/risk return np.array([returns,risk,sharpe]) # Contraints def check_sum(weights): return np.sum(weights)  1 # Rante of returns mus = np.linspace(min_ret,max_ret,21) # Function to minimize def minimize_volatility(weights): return get_data(weights)[1] # Inputs init_guess = np.repeat(1/n,n) bounds = ((0.0,1.0),) * n eff_risk = [] port_weights = [] for mu in mus: # function for return cons = ({'type':'eq','fun': check_sum}, {'type':'eq','fun': lambda w: get_data(w)[0]  mu}) result = minimize(minimize_volatility,init_guess,method='SLSQP',bounds=bounds,constraints=cons) eff_risk.append(result['fun']) port_weights.append(result.x) eff_risk = np.array(eff_risk) return mus, eff_risk, port_weights # Create max sharpe function from scipy.optimize import minimize def max_sharpe(df_returns): n = len(df_returns.columns) def get_data(weights): weights = np.array(weights) returns = np.sum(df_returns.mean() * weights) risk = np.sqrt(np.dot(weights.T, np.dot(df_returns.cov(), weights))) sharpe = returns/risk return np.array([returns,risk,sharpe]) # Function to minimize def neg_sharpe(weights): return get_data(weights)[2] # Inputs init_guess = np.repeat(1/n,n) bounds = ((0.0,1.0),) * n # function for return constraint = {'type':'eq','fun': lambda x: np.sum(x)  1} result = minimize(neg_sharpe, init_guess, method='SLSQP', bounds=bounds, constraints=constraint) return result['fun'], result['x'] # Create efficient frontier function from scipy.optimize import minimize def max_ret(df_returns): n = len(df_returns.columns) def get_data(weights): weights = np.array(weights) returns = np.sum(df_returns.mean() * weights) risk = np.sqrt(np.dot(weights.T, np.dot(df_returns.cov(), weights))) sharpe = returns/risk return np.array([returns,risk,sharpe]) # Function to minimize def port_ret(weights): return get_data(weights)[0] # Inputs init_guess = np.repeat(1/n,n) bounds = ((0.0,1.0),) * n # function for return constraint = {'type':'eq','fun': lambda x: np.sum(x)  1} result = minimize(port_ret, init_guess, method='SLSQP', bounds=bounds, constraints=constraint) return result['fun'], result['x'] ## Load functions part 3 ## Portfolio return def port_func(df, wts): mean_ret = df.mean() returns = np.sum(mean_ret * wts) risk = np.sqrt(np.dot(wts, np.dot(df.cov(), wts))) return returns, risk ## Return and relative performance graph def pf_graf(names, values, rnd, nudge, ylabs, graf_title): df = pd.DataFrame(zip(names, values), columns = ['key', 'value']) sorted = df.sort_values(by = 'value') plt.figure(figsize = (12,6)) plt.bar('key', 'value', data = sorted, color='darkblue') for i in range(len(names)): plt.annotate(str(round(sorted['value'][i], rnd)), xy = (sorted['key'][i], sorted['value'][i]+nudge)) plt.ylabel(ylabs) plt.title('{} performance by portfolio'.format(graf_title)) plt.show() ## Create portfolio np.random.seed(123) port_sim_1, wts_1, sharpe_1 = Port_sim.calc_sim(df.iloc[1:61,0:4],1000,4) # Create returns and min/max ranges df_returns = df.iloc[1:61, 0:4] min_ret = min(port_sim_1[:,0]) max_ret = max(port_sim_1[:,0]) # Find efficient portfolio eff_ret, eff_risk, eff_weights = eff_frontier(df_returns, min_ret, max_ret) eff_sharpe = eff_ret/eff_risk ## Create weight schemes satis_wts = np.array([0.32, 0.4, 0.08, 0.2]) # Calculated in previous post using port_select_func simple_wts = np.repeat(0.25, 4) eff_sharp_wts = eff_weights[np.argmax(eff_sharpe)] eff_max_wts = eff_weights[np.argmax(eff_ret)] wt_list = [satis_wts, simple_wts, eff_sharp_wts, eff_max_wts] wts_df = np.zeros([4,3]) for i in range(4): wts_df[i,:2] = port_func(df.iloc[1:61,0:4], wt_list[i]) wts_df[:,2] = wts_df[:,0]/wts_df[:,1] ## Graph portfolios plt.figure(figsize=(12,6)) plt.scatter(port_sim_1[:,1]*np.sqrt(12)*100, port_sim_1[:,0]*1200, marker='.', c=sharpe_1, cmap='Blues') plt.plot(eff_risk*np.sqrt(12)*100,eff_ret*1200,'b',linewidth=2) col_code = ['blue', 'black', 'red', 'purple'] for i in range(4): plt.scatter(wts_df[i,1]*np.sqrt(12)*100, wts_df[i,0]*1200, c = col_code[i], s = 50) plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show() ## Create simplifed satisfactory portfolio finder function def port_sim_wts(df1, sims1, cols1, ret1, risk1): pf, wt, _ = Port_sim.calc_sim(df1, sims1, cols1) port_wts = port_select_func(pf, wt, ret1, risk1) return port_wts ## Run sequential simulation np.random.seed(123) eff_sharp_roll = np.zeros([1000,3]) eff_max_roll = np.zeros([1000,3]) satis_roll = np.zeros([1000,3]) for i in range(1000): if i == 0: sharp_weights = eff_sharp_wts max_weights = eff_max_wts sat_weights = satis_wts else: _, sharp_wts = max_sharpe(sim1[i1][0]) _, max_wts = max_ret(sim1[i1][0]) # Running the optimizatin twice is probably slowing down the simulation sharp_weights = sharp_wts max_weights = max_wts test = port_sim_wts(sim[i1][0], 1000, 4, 0.07,0.1) if np.isnan(test): sat_weights = satis_wts else: sat_weights = test eff_sharp_roll[i,:2] = port_func(sim1[i][0], sharp_weights) eff_max_roll[i,:2] = port_func(sim1[i][0], max_weights) satis_roll[i,:2] = port_func(sim1[i][0], sat_weights) eff_sharp_roll[:,2] = eff_sharp_roll[:,0]/eff_sharp_roll[:,1] eff_max_roll[:,2] = eff_max_roll[:,0]/eff_max_roll[:,1] satis_roll[:,2] = satis_roll[:,0]/satis_roll[:,1] # Calculate simple returns simple_df = np.zeros([1000,3]) for i in range(1000): simple_df[i,:2] = port_func(sim1[i][0], simple_wts) simple_df[:,2] = simple_df[:,0]/simple_df[:,1] ## Add simulations to list and graph roll_sim = [satis_roll, simple_df, eff_sharp_roll, eff_max_roll] port_means = [] for df in roll_sim: port_means.append(np.mean(df[:,0])*1200) port_names = ['Satisfactory', 'Naive', 'Sharpe', 'Max'] # Sequential simulation pf_graf(port_names, port_means, 1, 0.5, 'Returns (%)', 'Rolling simulation return') # Original simulation port_means1 = [] for df in list_df: ## Comparison charts # Build names for comparison chart comp_names= [] for i in range(4): for j in range(i+1,4): comp_names.append('{} vs. {}'.format(port_names[i], port_names[j])) # Calculate comparison values comp_values = [] for i in range(4): for j in range(i+1, 4): comps =np.mean(roll_sim[i][:,0] > roll_sim[j][:,0]) comp_values.append(comps) # Sequential comparisons pf_graf(comp_names[:1], comp_values[:1], 2, 0.025, 'Frequency (%)', 'Rolling simulation frequency of') port_means1.append(np.mean(df[:,0])*1200) pf_graf(port_names, port_means1, 1, 0.5, 'Returns (%)', 'Original simulation return') # original comparisons # Calculate comparison values comp_values1 = [] for i in range(4): for j in range(i+1, 4): comps1 = np.mean(list_df[i][:,0] > list_df[j][:,0]) comp_values1.append(comps1) pf_graf(comp_names[:1], comp_values1[:1], 2, 0.025, 'Frequency (%)', 'Original simulation frequency of') ## Sample simulation from datetime import datetime start_time = datetime.now() np.random.seed(123) eff_sharp_samp = np.zeros([1000,3]) eff_max_samp = np.zeros([1000,3]) satis_samp = np.zeros([1000,3]) naive_samp = np.zeros([1000,3]) for i in range(1000): if i == 0: sharp_weights = eff_sharp_wts max_weights = eff_max_wts sat_weights = satis_wts nav_weights = simple_wts else: samp1 = int(np.random.choice(1000,1)) _, sharp_wts = max_sharpe(sim1[samp1][0]) _, max_wts = max_ret(sim1[samp1][0]) sharp_weights = sharp_wts max_weights = max_wts test = port_sim_wts(sim1[samp1][0], 1000, 4, 0.07, 0.1) if np.isnan(test.any()): sat_wts = satis_wts else: sat_wts = test samp2 = int(np.random.choice(1000,1)) eff_sharp_samp[i,:2] = port_func(sim1[samp2][0], sharp_wts) eff_max_samp[i,:2] = port_func(sim1[samp2][0], max_wts) satis_samp[i,:2] = port_func(sim1[samp2][0], sat_wts) naive_samp[i,:2] = port_func(sim1[samp2][0], nav_wts) eff_sharp_samp[:,2] = eff_sharp_samp[:,0]/eff_sharp_samp[:,1] eff_max_samp[:,2] = eff_max_samp[:,0]/eff_max_samp[:,1] satis_samp[:,2] = satis_samp[:,0]/satis_samp[:,1] naive_samp[:,2] = naive_samp[:,0]/naive_samp[:,1] end_time = datetime.now() print('Duration: {}'.format(end_time  start_time)) # Duration: 0:07:19.733893 # Create sample list and graph samp_list = [eff_sharp_samp, eff_max_samp, satis_samp, naive_samp] port_means_samp = [] for df in samp_list: port_means_samp.append(np.mean(df[:,0])*1200) # Sample graph pf_graf(port_names, port_means_samp, 1, 0.5, 'Returns (%)', 'Random sample simulation return') # Original graph pf_graf(port_names, port_means1, 1, 0.5, 'Returns (%)', 'Original simulation return') # Calculate comparison values for sample simulation comp_values_samp = [] for i in range(4): for j in range(i+1, 4): comps_samp = np.mean(samp_list[i][:,0] > samp_list[j][:,0]) comp_values_samp.append(comps_samp) # Sample graph pf_graf(comp_names[:1], comp_values_samp[:1], 2, 0.025, 'Frequency (%)', 'Random sample simulation frequency of') # original graph pf_graf(comp_names[:1], comp_values1[:1], 2, 0.025, 'Frequency (%)', 'Original simulation frequency of')

Ensuring our scenario draws don’t match is almost overkill, since the likelihood of drawing two of the same samples in a row is about one in a million.
To leave a comment for the author, please follow the link and comment on their blog: R on OSM. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Sequential satisficing first appeared on Rbloggers.
A Frosty Deal?
[social4i size="small" align="alignleft"] > [This article was first published on R  Quantum Jitter, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Reading news articles on the willtheywon’tthey postBrexit trade negotiations with the EU sees days of optimism jarred by days of gloom. Do negative news articles, when one wants a positive outcome, leave a deeper impression?
I wondered if I could get a more objective view from quantitative analysis of textual data. To do this, I’m going to look at hundreds of articles published in the Guardian newspaper over the course of the year to see how tradetalk sentiment has changed weektoweek.
library(tidyverse)
library(rebus)
library(wesanderson)
library(kableExtra)
library(lubridate)
library(GuardianR)
library(quanteda)
library(scales)
library(tictoc)
library(patchwork)
library(text2vec)
library(topicmodels)
theme_set(theme_bw())
cols < wes_palette(name = "Chevalier1")
The Withdrawal Agreement between the UK and the European Union was signed on the 24th of January 2020. I’ll import Brexitrelated newspaper articles from that date.
The Guardian newspaper asks for requests to span no more than 1 month at a time. So I’ll first create a set of monthly date ranges.
dates_df < tibble(start_date = seq(ymd("20200124"), today(), by = "1 month")) %>% mutate(end_date = start_date + months(1)  1) dates_df %>% kable()
start_date end_date 20200124 20200223 20200224 20200323 20200324 20200423 20200424 20200523 20200524 20200623 20200624 20200723 20200724 20200823 20200824 20200923I’ll import the newspaper articles in monthly chunks. Note, access to the Guardian’s API requires a key which may be requested here.
tic() article_df < dates_df %>% pmap_dfr(., function(start_date, end_date) { Sys.sleep(1) get_guardian( "brexit", from.date = start_date, to.date = end_date, api.key = key ) }) toc()
The data need a little cleaning, for example, to remove multitopic articles, html tags and nonbreaking spaces.
trade_df < article_df %>% filter(!str_detect(id, "/live/"), sectionId %in% c("world", "politics", "business")) %>% mutate( body = str_remove_all(body, "<.*?>") %>% str_to_lower(), body = str_remove_all(body, "[^az09 .]"), body = str_remove_all(body, "nbsp") )
A corpus then gives me a collection of texts whereby each document is a newspaper article.
trade_corp < trade_df %>% corpus(docid_field = "shortUrl", text_field = "body")
Although I’ve only imported articles mentioning Brexit since the Withdrawal Agreement was signed, some of these articles will not be related to trade negotiations with the EU. For example, there are ongoing negotiations with many countries around the world. So, I’m going to use word embeddings to help narrow the focus to the specific context of the UKEU trade deal.
The chief negotiator for the EU is Michel Barnier, so I’ll quantitatively identify words in close proximity to “Barnier” in the context of these Brexit news articles.
window < 5
trade_fcm <
trade_corp %>%
fcm(context = "window", window = window, count = "weighted", weights = window:1)
glove < GlobalVectors$new(rank = 60, x_max = 10)
set.seed(42)
wv_main < glove$fit_transform(trade_fcm, n_iter = 10)
## INFO [10:06:33.114] epoch 1, loss 0.3817
## INFO [10:06:34.959] epoch 2, loss 0.2510
## INFO [10:06:36.759] epoch 3, loss 0.2225
## INFO [10:06:38.577] epoch 4, loss 0.2021
## INFO [10:06:40.438] epoch 5, loss 0.1847
## INFO [10:06:42.303] epoch 6, loss 0.1710
## INFO [10:06:44.124] epoch 7, loss 0.1605
## INFO [10:06:45.936] epoch 8, loss 0.1524
## INFO [10:06:47.754] epoch 9, loss 0.1457
## INFO [10:06:49.594] epoch 10, loss 0.1403
wv_context < glove$components
word_vectors < wv_main + t(wv_context)
search_coord <
word_vectors["barnier", , drop = FALSE]
word_vectors %>%
sim2(search_coord, method = "cosine") %>%
as_tibble(rownames = NA) %>%
rownames_to_column("term") %>%
rename(similarity = 2) %>%
arrange(desc(similarity)) %>%
slice(1:10) %>%
kable()
Word embedding is a learned modelling technique placing words into a multidimensional vector space such that contextuallysimilar words may be found close by. Not surprisingly, the closest word contextually is “Michel”. And as he is the chief negotiator for the EU, we find “eu’s”, “chief”, and “negotiator” also in the top most contextuallysimilar words.
The word embeddings algorithm, through word cooccurrence, has identified the name of Michel Barnier’s UK counterpart David Frost. So filtering articles for “Barnier”, “Frost” and “UKEU” should help narrow the focus.
context_df < trade_df %>% filter(str_detect(body, "barnierfrostukeu")) context_corp < context_df %>% corpus(docid_field = "shortUrl", text_field = "body")
I can then use quanteda’s kwic function to review the key phrases in context to ensure I’m homing in on the texts I want. Short URLs are included below so I can click on any to read the actual article as presented by The Guardian.
set.seed(123) context_corp %>% tokens( remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE ) %>% kwic(pattern = phrase(c("trade negotiation", "trade deal", "trade talks")), valuetype = "regex", window = 7) %>% as_tibble() %>% left_join(article_df, by = c("docname" = "shortUrl")) %>% slice_sample(n = 10) %>% select(docname, pre, keyword, post, headline) %>% kable()
docname pre keyword post headline https://gu.com/p/ee3qc ecj unless we have such a thin trade deal that its not worth the paper its Brexit: Boris Johnson faces Eurotunnel test https://gu.com/p/end82 london a separate process to the troubled trade talks that got under way in london on Irish MEP in line for EU finance role vacated due to lockdown scandal https://gu.com/p/ezjdz said the downsides with the eu free trade deal the us free trade deal and our Brexit bill hugely damaging to UK’s reputation, says exambassador https://gu.com/p/d7d9t people we have who have been negotiating trade deals forever she said while people criticise the Brexit trade talks: EU to back Spain over Gibraltar claims https://gu.com/p/eyzhq played down the prospect of reaching a trade deal with the eu in time for december No 10 blames EU and plays down prospects of Brexit trade deal https://gu.com/p/ez2v6 it will make it harder to strike trade deals going forward he told channel news after Brexit: UK negotiators ‘believe brinkmanship will reboot trade talks’ https://gu.com/p/d7n4t alignment with eu rules in any brexit trade deal while brussels threatened to put tariffs on Pound falls as Boris Johnson takes tough line on EU trade deal https://gu.com/p/dnvbj personal rapport when communicating remotely related postbrexit trade talks with eu on course to fail johnson Fears Brexit talks could collapse in June but UK still optimistic https://gu.com/p/d94j9 this situation and we work on a trade deal with them of course the united kingdom Ursula von der Leyen mocks Boris Johnson’s stance on EU trade deal https://gu.com/p/ezkxc it threatens to damage british prospects of trade deals with the us and eu it puts Tuesday briefing: Rancour as lawbreaking bill goes forwardQuanteda provides a sentiment dictionary which, in addition to identifying positive and negative words, also finds negativenegatives and negativepositives such as, for example, “not effective”. For each week’s worth of articles, I’ll calculate the proportion of positive sentiments.
tic() sent_df < context_corp %>% dfm(dictionary = data_dictionary_LSD2015) %>% as_tibble() %>% left_join(context_df, by = c("doc_id" = "shortUrl")) %>% mutate( date = ceiling_date(as_date(webPublicationDate), "week"), pct_pos = (positive + neg_negative) / (positive + neg_negative + negative + neg_positive) ) sent_df %>% select(doc_id, starts_with("pos"), starts_with("neg")) %>% slice(1:10) %>% kable()
doc_id positive negative neg_positive neg_negative https://gu.com/p/d6qhb 40 22 0 0 https://gu.com/p/d9e9j 27 15 0 0 https://gu.com/p/d6kzd 51 27 0 1 https://gu.com/p/d6bt2 37 7 0 0 https://gu.com/p/d9vjq 13 23 0 0 https://gu.com/p/d7n8b 57 34 1 0 https://gu.com/p/d79cn 56 48 3 1 https://gu.com/p/d6t3c 28 26 0 0 https://gu.com/p/d9xtf 33 13 1 0 https://gu.com/p/d696t 15 21 1 0summary_df < sent_df %>%
group_by(date) %>%
summarise(pct_pos = mean(pct_pos), n = n())
toc()
## 0.708 sec elapsed
Plotting the changing proportion of positive sentiment over time did surprise me a little. The outcome was more balanced than I expected which perhaps confirms the deeper impression left on me by negative articles.
The upper violin plot shows the average weight of the sentiment across multiple articles for each week. Individually the articles range from 20% to 80% positive, with discernible periods of relatively negative and relatively positive sentiment.
The lower plot shows the volume of articles. As we draw closer to the crunchpoint the volume appears to be picking up.
p1 < sent_df %>% ggplot(aes(date, pct_pos)) + geom_violin(aes(group = date), alpha = 0.5, fill = cols[1]) + geom_line(data = summary_df, aes(date, pct_pos), colour = cols[1], linetype = "dashed") + geom_hline(yintercept = 0.5, linetype = "dotted", colour = cols[4]) + scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) + labs(title = "Changing Sentiment Towards a UKEU Trade Deal", subtitle = "Weektoweek Since the Withdrawal Agreement", x = NULL, y = "Positive Sentiment") p2 < summary_df %>% ggplot(aes(date, n)) + geom_line(colour = cols[1]) + labs(x = "Weeks", y = "Article Count", caption = "Source: Guardian Newspaper") p1 / p2 + plot_layout(heights = c(2, 1))
Some writers exhibit more sentiment variation than others.
byline_df < sent_df %>% mutate(byline = word(byline, 1, 2) %>% str_remove_all(PUNCT)) %>% group_by(byline, date) %>% summarise(pct_pos = mean(pct_pos), n = n()) top_3 < byline_df %>% count(byline, sort = TRUE) %>% ungroup() %>% filter(!is.na(byline)) %>% slice(c(3, 2)) %>% pull(byline) byline_df %>% filter(byline %in% top_3) %>% ggplot(aes(date, pct_pos, colour = byline)) + geom_line() + geom_hline(yintercept = 0.5, linetype = "dotted", colour = cols[2]) + scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) + scale_colour_manual(values = cols[c(1, 4)]) + labs(title = "Changing Sentiment Towards a UKEU Trade Deal", subtitle = "Weektoweek Since the Withdrawal Agreement", x = "Weeks", y = "Positive Sentiment", colour = "Byline", caption = "Source: Guardian Newspaper")
R ToolboxSummarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.
Package Function base library[12]; c[8]; function[2]; mean[2]; set.seed[2]; conflicts[1]; cumsum[1]; is.na[1]; months[1]; search[1]; seq[1]; sum[1]; Sys.sleep[1] dplyr filter[8]; mutate[8]; as_tibble[4]; group_by[3]; if_else[3]; n[3]; select[3]; slice[3]; summarise[3]; tibble[3]; arrange[2]; desc[2]; left_join[2]; starts_with[2]; count[1]; pull[1]; rename[1]; slice_sample[1]; ungroup[1] ggplot2 aes[5]; geom_line[3]; ggplot[3]; labs[3]; geom_hline[2]; scale_y_continuous[2]; geom_violin[1]; scale_colour_manual[1]; theme_bw[1]; theme_set[1] GuardianR get_guardian[1] kableExtra kable[5] lubridate date[3]; as_date[1]; ceiling_date[1]; today[1]; ymd[1] patchwork plot_layout[1] purrr map[1]; map2_dfr[1]; pmap_dfr[1]; possibly[1]; set_names[1] quanteda corpus[2]; data_dictionary_LSD2015[1]; dfm[1]; fcm[1]; kwic[1]; phrase[1]; t[1]; tokens[1] readr read_lines[1] rebus literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; one_or_more[1]; or[1]; PUNCT[1] scales percent_format[2] stringr str_detect[5]; str_remove_all[5]; str_c[2]; str_remove[2]; str_count[1]; str_to_lower[1]; word[1] text2vec sim2[1] tibble enframe[1]; rownames_to_column[1] tictoc tic[2]; toc[2] tidyr unnest[1] wesanderson wes_palette[1] var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R  Quantum Jitter. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post A Frosty Deal? first appeared on Rbloggers.
R – Risk and Compliance Survey: we need your help!
[social4i size="small" align="alignleft"] > [This article was first published on RBlog – Mango Solutions, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
We would like to evaluate how business restrictions impact the commercial usage of R. To do this, we would love to hear your thoughts. We have created a short survey, which should take one minute of your time.
Thank you!
The post R – Risk and Compliance Survey: we need your help! appeared first on Mango Solutions.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post R – Risk and Compliance Survey: we need your help! first appeared on Rbloggers.
VirtuEARL: Speaker interview
[social4i size="small" align="alignleft"] > [This article was first published on RBlog – Mango Solutions, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
We caught up with one of our VirtuEARL speakers and EARL regulars Jeremy Horne on what we can expect from his VirtuEARL talk and how he thinks the R community has adapted since lockdown.
You’ve presented a few times now at EARL, what makes you come back each time?
I’ve said it for several years now – EARL has become the “must go” event of the R calendar – not only have I presented a few times, but I’ve been fortunate enough to attend every EARL since it started in 2014 and on each occasion, I’ve taken something new away to try with my clients – and indeed, the quality of the talks and use cases grows better and better each year – so I’m looking forward to some more excellent talks this year too.
How has the R community adapted since lockdown?
Like every community, it was tough in the first few months. Events got cancelled (including my very own BrightonR user group ) and the interaction with other R users was limited to channels like slack – but with the realisation that we’re a long way off facetoface meetups again, virtual user groups have been setup and done quite well – there were some fantastic presentations at LondonR last month and we’ve finally got BrightonR off the ground again too!
Why are events like EARL and R meet ups important do you think?
The networking – being able to talk to other likeminded R users and staying on the pulse of what they’re up to – this is the bit I miss terribly at the moment as it always gets me thinking about new things that I can do within R.
What can we expect to hear about from your EARL talk?
I’m hoping you’ll take a few tips away on how you can improve your email marketing using R and specifically text analytics – and equally, would love to hear how you get on afterwards if you implement any of the techniques.
If you’d like to hear Jeremy’s talk and some other inspirational R based talks, then join us at VirtuEARL. The online Enterprise Applications of the R Language Conference taking place this October.
While you’re here – Please complete this short poll to help us evaluate how business restrictions impact the commercial usage of R.
The post VirtuEARL: Speaker interview appeared first on Mango Solutions.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post VirtuEARL: Speaker interview first appeared on Rbloggers.
D&D’s Data Science Platform (DSP) – making healthcare analytics easier
[social4i size="small" align="alignleft"] > [This article was first published on R Blogs – Hutsonshacks, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
My current company Draper and Dash have been working on the Healthcare DSP for a number of months and we are now ready to launch. Find out what the HTN had to say: https://www.thehtn.co.uk/2020/09/15/newhealthdatascienceplatformlaunches/
A word from D&D’s Chief Technology OfficerData Science Resilience & Collaboration in times of unprecedented disruption https://t.co/pPYTw55w4g – all powered by R and Python under the hood. It has been great to work on this project.
— Gary Hutson (@StatsGary) September 17, 2020
What it does?
The DSP uses web technologies alongside Healthcare models built in R and Python to combine analytics with predictive machine learning, forecasting, Natural Language Processing, among other techniques:
Find out moreTo learn more about this please register for our webinar:
It has been a pleasure for my team to work so closely alongside my Chief Technical Officer and we have really put together an exciting product. As stated, register for the webinar by clicking the above image to find out more.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsonshacks. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post D&D’s Data Science Platform (DSP) – making healthcare analytics easier first appeared on Rbloggers.