Ligature fonts for R
(This article was first published on R – Benomics, and kindly contributed to Rbloggers)
Ligature fonts are fonts which sometimes map multiple characters to a single glyph, either for readability or just because it looks neat. Importantly, this only affects the rendering of the text with said font, while the distinct characters remain in the source.
Maybe ligatures are an interesting topic in themselves if you’re into typography, but it’s the relatively modern monospaced variants which are significantly more useful in the context of R programming.
Two of the most popular fonts in this category are:
 Fira Code — an extension of Fira Mono which really goes all out providing a wide range of ligatures for obscure Haskell operators, as well as the more standard set which will be used when writing R
 Hasklig — a fork of Source Code Pro (in my opinion a nicer base font) which is more conservative with the ligatures it introduces
Here’s some code to try out with these ligature fonts, first rendered via bogstandard monospace font:
library(magrittr) library(tidyverse) filtered_storms < dplyr::storms %>% filter(category == 5, year >= 2000) %>% unite("date", year:day, sep = "") %>% group_by(name) %>% filter(pressure == max(pressure)) %>% mutate(date = as.Date(date)) %>% arrange(desc(date)) %>% ungroup() %T>% print()Here’s the same code rendered with Hasklig:
Some of the glyphs on show here are:
 A single arrow glyph for lessthan hyphen (<)
 Altered spacing around two colons (::)
 Joined up doubleequals
Fira Code takes this a bit further and also converts >= to a single glyph:
In my opinion these fonts are a nice and clear way of reading and writing R. In particular the single arrow glyph harks back to the APL keyboards with real arrow keys, for which our modern twocharacter < is a poor substitute.
One downside could be a bit of confusion when showing your IDE to someone else, or maybe writing slightly longer lines than it appears, but personally I’m a fan and my RStudio is now in Hasklig.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Benomics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Neville’s Method of Polynomial Interpolation
(This article was first published on R – Aaron Schlegel, and kindly contributed to Rbloggers)
Part 1 of 5 in the series Numerical AnalysisNeville’s method evaluates a polynomial that passes through a given set of and points for a particular value using the Newton polynomial form. Neville’s method is similar to a nowdefunct procedure named Aitken’s algorithm and is based on the divided differences recursion relation (“Neville’s Algorithm”, n.d).
It was stated before in a previous post on Lagrangian polynomial interpolation that there exists a Lagrange polynomial that passes through points where each is a distinct integer and at corresponding x values . The points are denoted .
Neville’s MethodNeville’s method can be stated as follows:
Let a function be defined at points where and are two distinct members. For each , there exists a Lagrange polynomial that interpolates the function at the points . The th Lagrange polynomial is defined as:
The and are often denoted and , respectively, for ease of notation.
The interpolating polynomials can thus be generated recursively, which we will see in the following example:
Neville’s Method ExampleConsider the following table of and corresponding values.
x y 8.1 16.9446 8.3 17.56492 8.6 18.50515 8.7 18.82091Suppose we are interested in interpolating a polynomial that passes through these points to approximate the resulting value from an value of 8.4.
We can construct the interpolating polynomial approximations using the function above:
The approximated values are then used in the next iteration.
Then the final iteration yields the approximated value for the given value.
Therefore is the approximated value at the point .
Neville’s Method in RThe following function is an implementation of Neville’s method for interpolating and evaluating a polynomial.
poly.neville < function(x, y, x0) { n < length(x) q < matrix(data = 0, n, n) q[,1] < y for (i in 2:n) { for (j in i:n) { q[j,i] < ((x0  x[ji+1]) * q[j,i1]  (x0  x[j]) * q[j1,i1]) / (x[j]  x[ji+1]) } } res < list('Approximated value'=q[n,n], 'Neville iterations table'=q) return(res) }Let’s test this function to see if it reports the same result as what we found earlier.
x < c(8.1, 8.3, 8.6, 8.7) y < c(16.9446, 17.56492, 18.50515, 18.82091) poly.neville(x, y, 8.4) ## $`Approximated value` ## [1] 17.87709 ## ## $`Neville iterations table` ## [,1] [,2] [,3] [,4] ## [1,] 16.94460 0.00000 0.00000 0.00000 ## [2,] 17.56492 17.87508 0.00000 0.00000 ## [3,] 18.50515 17.87833 17.87703 0.00000 ## [4,] 18.82091 17.87363 17.87716 17.87709The approximated value is reported as , the same value we calculated previously (minus a few decimal places). The function also outputs the iteration table that stores the intermediate results.
The pracma package contains the neville() function which also performs Neville’s method of polynomial interpolation and evaluation.
library(pracma) neville(x, y, 8.4) ## [1] 17.87709The neville() function reports the same approximated value that we found with our manual calculations and function.
ReferencesBurden, R. L., & Faires, J. D. (2011). Numerical analysis (9th ed.). Boston, MA: Brooks/Cole, Cengage Learning.
Cheney, E. W., & Kincaid, D. (2013). Numerical mathematics and computing (6th ed.). Boston, MA: Brooks/Cole, Cengage Learning.
Neville’s algorithm. (2016, January 2). In Wikipedia, The Free Encyclopedia. From https://en.wikipedia.org/w/index.php?title=Neville%27s_algorithm&oldid=697870140
The post Neville’s Method of Polynomial Interpolation appeared first on Aaron Schlegel.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ArCo Package v 0.2 is on
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
The ArCo package 0.2 is now available on CRAN. The functions are now more user friendly. The new features are:
 Default function for estimation if the user does not inform the functions fn and p.fn. The default model is Ordinary Least Squares.
 The user can now add extra arguments to the fn function in the call.
 The data will be automatically coerced when possible.
To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data wrangling : Transforming (2/3)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a timeconsuming process which is estimated to take about 6080% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the third part of the series and it aims to cover the transforming of data used.This can include filtering, summarizing, and ordering your data by different means. This also includes combining various data sets, creating new variables, and many other manipulation tasks. At this post, we will go through a few more advanced transformation tasks on mtcars data set.
Before proceeding, it might be helpful to look over the help pages for the group_by, ungrpoup, summary, summarise, arrange, mutate, cumsum.
Moreover please load the following libraries.
install.packages("dplyr")
library(dplyr)
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Create a new object named cars_cyl and assign to it the mtcars data frame grouped by the variable cyl
Hint: be careful about the data type of the variable, in order to be used for grouping it has to be a factor.
Exercise 2
Remove the grouping from the object cars_cyl
Exercise 3
Print out the summary statistics of the mtcars data frame using the summary function and pipeline symbols %>%.
Learn more about Data PreProcessing in the online course R Data PreProcessing & Data Management – Shape your Data!. In this course you will learn how to: Work with popular libraries such as dplyr
 Learn about methods such as pipelines
 And much more
Exercise 4
Make a more descriptive summary statistics output containing the 4 quantiles, the mean, the standard deviation and the count.
Exercise 5
Print out the average *hp* for every cyl category
Exercise 6
Print out the mtcars data frame sorted by hp (ascending oder)
Exercise 7
Print out the mtcars data frame sorted by hp (descending oder)
Exercise 8
Create a new object named cars_per containing the mtcars data frame along with a new variable called performance and calculated as performance = hp/mpg
Exercise 9
Print out the cars_per data frame, sorted by performance in descending order and create a new variable called rank indicating the rank of the cars in terms of performance.
Exercise 10
To wrap everything up, we will use the iris data set. Print out the mean of every variable for every Species and create two new variables called Sepal.Density and Petal.Density being calculated as Sepal.Density = Sepal.Length Sepal.Width and Petal.Density = Sepal.Length Petal.Width respectively.
Related exercise sets: Building Shiny App exercises part 6
 DensityBased Clustering Exercises
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part2)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
New R Course: Writing Efficient R Code
Hello R users, we’ve got a brand new course today: Writing Efficient R Code by Colin Gillespie.
The beauty of R is that it is built for performing data analysis. The downside is that sometimes R can be slow, thereby obstructing our analysis. For this reason, it is essential to become familiar with the main techniques for speeding up your analysis, so you can reduce computational time and get insights as quickly as possible.
Writing Efficient R Code features interactive exercises that combine highquality video, inbrowser coding, and gamification for an engaging learning experience that will make you a master in writing efficient, quick, R code!
What you’ll learn:
Chapter 1: The Art of Benchmarking
In order to make your code go faster, you need to know how long it takes to run.
Chapter 2: Fine Tuning – Efficient Base R
R is flexible because you can often solve a single problem in many different ways. Some ways can be several orders of magnitude faster than the others.
Chapter 3: Diagnosing Problems – Code Profiling
Profiling helps you locate the bottlenecks in your code.
Chapter 4: Turbo Charged Code – Parallel Programming
Some problems can be solved faster using multiple cores on your machine. This chapter shows you how to write R code that runs in parallel.
Learn how to write efficient R code today!
Short course on Bayesian data analysis and Stan 2325 Aug in NYC!
(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to Rbloggers)
Jonah “ShinyStan” Gabry, Mike “Riemannian NUTS” Betancourt, and I will be giving a threeday short course next month in New York, following the model of our successful courses in 2015 and 2016.
Before class everyone should install R, RStudio and RStan on their computers. (If you already have these, please update to the latest version of R and the latest version of Stan.) If problems occur please join the stanusers group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.
Class structure and example topics for the three days:
Day 1: Foundations
Foundations of Bayesian inference
Foundations of Bayesian computation with Markov chain Monte Carlo
Intro to Stan with handson exercises
Reallife Stan
Bayesian workflow
Day 2: Linear and Generalized Linear Models
Foundations of Bayesian regression
Fitting GLMs in Stan (logistic regression, Poisson regression)
Diagnosing model misfit using graphical posterior predictive checks
Little data: How traditional statistical ideas remain relevant in a big data world
Generalizing from sample to population (surveys, Xbox example, etc)
Day 3: Hierarchical Models
Foundations of Bayesian hierarchical/multilevel models
Accurately fitting hierarchical models in Stan
Why we don’t (usually) have to worry about multiple comparisons
Hierarchical modeling and prior information
Specific topics on Bayesian inference and computation include, but are not limited to:
Bayesian inference and prediction
Naive Bayes, supervised, and unsupervised classification
Overview of Monte Carlo methods
Convergence and effective sample size
Hamiltonian Monte Carlo and the noUturn sampler
Continuous and discretedata regression models
Mixture models
Measurementerror and itemresponse models
Specific topics on Stan include, but are not limited to:
Reproducible research
Probabilistic programming
Stan syntax and programming
Optimization
Warmup, adaptation, and convergence
Identifiability and problematic posteriors
Handling missing data
Ragged and sparse data structures
Gaussian processes
Again, information on the course is here.
The course is organized by Lander Analytics.
The course is not cheap. Stan is opensource, and we organize these courses to raise money to support the programming required to keep Stan up to date. We hope and believe that the course is more than worth the money you pay for it, but we hope you’ll also feel good, knowing that this money is being used directly to support Stan R&D.
The post Short course on Bayesian data analysis and Stan 2325 Aug in NYC! appeared first on Statistical Modeling, Causal Inference, and Social Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Machine Learning Explained: supervised learning, unsupervised learning, and reinforcement learning
(This article was first published on Enhance Data Science, and kindly contributed to Rbloggers)
Machine learning is often split between three main types of learning: supervised learning, unsupervised learning, and reinforcement learning. Knowing the differences between these three types of learning is necessary for any data scientist.
The big pictureThe type of learning is defined by the problem you want to solve and is intrinsic to the goal of your analysis:
 You have a target, a value or a class to predict. For instance, let’s say you want to predict the revenue of a store from different inputs (day of the week, advertising, promotion). Then your model will be trained on historical data and use them to forecast future revenues. Hence the model is supervised, it knows what to learn.
 You have unlabelled data and looks for patterns, groups in these data. For example, you want to cluster to clients according to the type of products they order, how often they purchase your product, their last visit, … Instead of doing it manually, unsupervised machine learning will automatically discriminate different clients.
 You want to attain an objective. For example, you want to find the best strategy to win a game with specified rules. Once these rules are specified, reinforcement learning techniques will play this game many times to find the best strategy.
Supervised learning regroups different techniques which all share the same principles:
 The training dataset contains inputs data (your predictors) and the value you want to predict (which can be numeric or not).
 The model will use the training data to learn a link between the input and the outputs. Underlying idea is that the training data can be generalized and that the model can be used on new data with some accuracy.
Some supervised learning algorithms:
 Linear and logistic regression
 Support vector machine
 Naive Bayes
 Neural network
 Gradient boosting
 Classification trees and random forest
Supervised learning is often used for expert systems in image recognition, speech recognition, forecasting, and in some specific business domain (Targeting, Financial analysis, ..)
On unsupervised learningCluster Analysis from Wikipedia
On the other hand, unsupervised learning does not use output data (at least output data that are different from the input). Unsupervised algorithms can be split into different categories:
 Clustering algorithm, such as Kmeans, hierarchical clustering or mixture models. These algorithms try to discriminate and separate the observations in different groups.
 Dimensionality reduction algorithms (which are mostly unsupervised) such as PCA, ICA or autoencoder. These algorithms find the best representation of the data with fewer dimensions.
 Anomaly detections to find outliers in the data, i.e. observations which do not follow the data set patterns.
Most of the time unsupervised learning algorithms are used to preprocess the data, during the exploratory analysis or to pretrain supervised learning algorithms.
On reinforcement learningReinforcement learning algorithms try to find the best ways to earn the greatest reward. Rewards can be winning a game, earning more money or beating other opponents. They present stateofart results on very human task, for instance, this paper from the University of Toronto shows how a computer can beat human in oldschool Atari video game.
Reinforcement learnings algorithms follow the different circular steps:
From Wikipedia: Reinforcement Learning
Given its and the environment’s states, the agent will choose the action which will maximize its reward or will explore a new possibility. These actions will change the environment’s and the agent states. They will also be interpreted to give a reward to the agent. By performing this loop many times, the agents will improve its behavior.
Reinforcement learning already performs wells on ‘small’ dynamic system and is definitely to follow for the years to come.
The post Machine Learning Explained: supervised learning, unsupervised learning, and reinforcement learning appeared first on Enhance Data Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppAPT 0.0.4
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new version of RcppAPT — our interface from R to the C++ library behind the awesome apt, aptget, aptcache, … commands and their cache powering Debian, Ubuntu and the like — arrived on CRAN yesterday.
We added a few more functions in order to compute on the package graph. A concrete example is shown in this vignette which determines the (minimal) set of remaining Debian packages requiring a rebuild under R 3.4.* to update their .C() and .Fortran() registration code. It has been used for the binNMU request #868558.
As we also added a NEWS file, its (complete) content covering all releases follows below.
Changes in version 0.0.4 (20170716)
New function getDepends

New function reverseDepends

Added package registration code

Added usage examples in scripts directory

Added vignette, also in docs as rendered copy
 Added dumpPackages, showSrc
 Added reverseDepends, dumpPackages, showSrc
 Initial version with getPackages and hasPackages
A bit more information about the package is available here as well as as the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
seplyr update
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
The development version of my new R package seplyr is performing in practical applications with dplyr 0.7.* much better than even I (the seplyr package author) expected.
I think I have hit a very good set of tradeoffs, and I have now spent significant time creating documentation and examples.
I wish there had been such a package weeks ago, and that I had started using this approach in my own client work at that time. If you are already a dplyr user I strongly suggest trying seplyr in your own analysis projects.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Some Ideas for your Internal R Package
(This article was first published on R Views, and kindly contributed to Rbloggers)
At RStudio, I have the pleasure of interacting with data science teams around the world. Many of these teams are led by R users stepping into the role of analytic admins. These users are responsible for supporting and growing the R user base in their organization and often lead internal R user groups.
One of the most successful strategies to support a corporate R user group is the creation of an internal R package. This article outlines some common features and functions shared in internal packages. Creating an R package is easier than you might expect. A good place to start is this webinar on package creation.
Logos and Custom CSSInterestingly, one powerful way to increase the adoption of data science outputs – plots, reports, and even slides – is to stick to consistent branding. Having a common look and feel makes it easier for management to recognize the work of the data science team, especially as the team grows. Consistent branding also saves the R user time that would normally be spent picking fonts and color schemes.
It is easy to include logos and custom CSS inside of an R package, and to write wrapper functions that copy the assets from the package to a user’s local working directory. For example, this wrapper function adds a logo from the RStudioInternal package to the working directory:
getLogo < function(copy_to = getwd()){ copy_to < normalizePath(copy_to) file.copy(system.file(“logo.png”, package = “RStudioInternal”) , copy_to) }Once available, logos and CSS can be added to Shiny apps and R Markdown documents.
ggplot2 ThemesSimilar to logos and custom CSS, many internal R packages include a custom ggplot2 theme. These themes ensure consistency across plots in an organization, making data science outputs easier to recognize and read.
ggplot2 themes are shared as functions. To get started writing a ggplot2 theme, see resource 1. For inspiration, take a look at the ggthemes package.
Data ConnectionsInternal R packages are also an effective way to share functions that make it easy for analysts to connect to internal data sources. Nothing is more frustrating for a first time R user than trying to navigate the world of ODBC connections and complex database schemas before they can get started with data relevant to their daytoday job.
If you’re not sure where to begin, look through your own scripts for common database connection strings or configurations. db.rstudio.com can provide more information on how to handle credentials, drivers, and config files.
learnr TutorialsRStudio recently released a new package for creating interactive tutorials in R Markdown called learnr. There are many great resources online for getting started with R, but it can be useful to create tutorials specific to your internal data and domain. learnr tutorials can serve as training wheels for the other components of the internal R package or teach broader concepts and standards accepted across the organization. For example, you might provide a primer that teaches new users your organization’s R style guide.
Sharing an Internal R PackageInternal packages can be built in the RStudio IDE and distributed as tar files. Alternatively, many organizations use RStudio Server or RStudio Server Pro to standardize the R environment in their organization. In addition to making it easy to share an internal package, a standard compute environment keeps new R users from having to spend time installing R, RStudio, and packages. While these are necessary skills, the first interactions with R should get new users to a data insight as fast as possible. RStudio Server Pro also includes IT functions for monitoring and restricting resources.
Wrap UpIf you are leading an R group, an internal R package is a powerful way to support your users and the adoption of R. Imagine how easy it would be to introduce R to coworkers if they could connect to real, internal data and create a useful, beautiful plot in under 10 minutes. Investing in an internal R packages makes that on boarding experience possible.
_____='https://rviews.rstudio.com/2017/07/19/supportingcorporaterusergroups/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
JAGS 4.3.0 is released
(This article was first published on R – JAGS News, and kindly contributed to Rbloggers)
The source tarball for JAGS 4.3.0 is now available from Sourceforge. Binary distributions will be available later. See the updated manual for details of the features in the new version. This version is fully compatible with the current version of rjags on CRAN.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – JAGS News. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Securely store API keys in R scripts with the "secret" package
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
If you use an API key to access a secure service, or need to use a password to access a protected database, you'll need to provide these "secrets" in your R code somewhere. That's easy to do if you just include those keys as strings in your code — but it's not very secure. This means your private keys and passwords are stored in plaintext on your hard drive, and if you email your script they're available to anyone who can intercept that email. It's also really easy to inadvertently include those keys in a public repo if you use Github or similar codesharing services.
To address this problem, Gábor Csárdi and Andrie de Vries created the secret package for R. The secret package integrates with OpenSSH, providing R functions that allow you to create a vault to keys on your local machine, define trusted users who can access those keys, and then include encrypted keys in R scripts or packages that can only be decrypted by you or by people you trust. You can see how it works in the vignette secret: Share Sensitive Information in R Packages, and in this presentation by Andrie de Vries at useR!2017:
To use the secret package, you'll need access to your private key, which you'll also need to store securely. For that, you might also want to take a look at the inprogress keyring package, which allows you to access secrets stored in Keychain on macOS, Credential Store on Windows, and the Secret Service API on Linux.
The secret package is available now on CRAN, and you can also find the latest development version on Github.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Multiple Factor Analysis to analyse several data tables
(This article was first published on François Husson, and kindly contributed to Rbloggers)
How to take into account and how to compare information from different information sources? Multiple Factor Analysis is a principal Component Methods that deals with datasets that contain quantitative and/or categorical variables that are structured by groups.
Here is a course with videos that present the method named Multiple Factor Analysis.
Multiple Factor Analysis (MFA) allows you to study complex data tables, where a group of individuals is characterized by variables structured as groups, and possibly coming from different information sources. Our interest in the method is due to it being able to analyze a data table as a whole, but also its ability to compare information provided by the various information sources.
Four videos present a course on MFA, highlighting the way to interpret the data. Then you will find videos presenting the way to implement MFA in FactoMineR.
With this course, you will be standalone to perform and interpret results obtain with MFA.
 Course
 Introduction
 Weighting and global PCA
 Study of the groups of variables
 Complements: qualitative groups, frenquency tables
 MFA with FactoMineR
 Material on the course videos: the slides, the transcription
 Tutorial in R
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: François Husson. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Animating a spinner using ggplot2 and ImageMagick
(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to Rbloggers)
It’s Sunday, and I [Bob] am just sitting on the couch peacefully ggplotting to illustrate basic sample spaces using spinners (a trick I’m borrowing from Jim Albert’s book Curve Ball). There’s an underlying continuous outcome (i.e., where the spinner lands) and a quantization into a number of regions to produce a discrete outcome (e.g., “success” and “failure”). I’m quite pleased with myself for being able to use polar coordinates to create the spinner and arrow. ggplot works surprisingly well in polar coordinates once you figure them out; almost everything people have said about them online is confused and the doc itself assumes you’re a bit more of a ggplotter and geometer than me.
I’m so pleased with it that I show the plot to Mitzi. She replies, “Why don’t you animate it?” I don’t immediately say, “What a waste of time,” then get back to what I’m doing. Instad, I boast, “It’ll be done when you get back from your run.” Luckily for me, she goes for long runs—I just barely had the prototype working as she got home. And then I had to polish it and turn it into a blog post. So here it is, for your wonder and amazement.
Here’s the R magic.
library(ggplot2) draw_curve < function(angle) { df < data.frame(outcome = c("success", "failure"), prob = c(0.3, 0.7)) plot < ggplot(data=df, aes(x=factor(1), y=prob, fill=outcome)) + geom_bar(stat="identity", position="fill") + coord_polar(theta="y", start = 0, direction = 1) + scale_y_continuous(breaks = c(0.12, 0.7), labels=c("success", "failure")) + geom_segment(aes(y= angle/360, yend= angle/360, x = 1, xend = 1.4), arrow=arrow(type="closed"), size=1) + theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text.y = element_blank()) + theme(panel.grid = element_blank(), panel.border = element_blank()) + theme(legend.position = "none") + geom_point(aes(x=1, y = 0), color="#666666", size=5) return(plot) } ds < c() pos < 0 for (i in 1:66) { pos < (pos + (67  i)) %% 360 ds[i] < pos } ds < c(rep(0, 10), ds) ds < c(ds, rep(ds[length(ds)], 10)) for (i in 1:length(ds)) { ggsave(filename = paste("frame", ifelse(i < 10, "0", ""), i, ".png", sep=""), plot = draw_curve(ds[i]), device="png", width=4.5, height=4.5) }I probably should've combined theme functions. Ben would've been able to define ds in a oneliner and then map ggsave. I hope it's at least clear what my code does (just decrements the number of degrees moved each frame by oneno physics involved).
After producing the frames in alphabetical order (all that ifelse and paste mumbojumbo), I went to the output directory and ran the results through ImageMagick (which I'd previously installed on my now ancient Macbook Pro) from the terminal, using
> convert *.png delay 3 loop 0 spin.gifThat took a minute or two. Each of the pngs is about 100KB, but the final output is only 2.5MB or so. Maybe I should've went with less delay (I don't even know what the units are!) and fewer rotations and maybe a slower final slowing down (maybe study the physics). How do the folks at Pixar ever get anything done?
P.S. I can no longer get the animation package to work in R, though it used to work in the past. It just wraps up those calls to ImageMagick.
P.P.S. That salmon and teal color scheme is the default!
The post Animating a spinner using ggplot2 and ImageMagick appeared first on Statistical Modeling, Causal Inference, and Social Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Plants
(This article was first published on R – Fronkonstin, and kindly contributed to Rbloggers)
TweetBlue dragonflies dart to and fro
I tie my life to your balloon and let it go
(Warm Foothills, AltJ)
In my last post I did some drawings based on LSystems. These drawings are done sequentially. At any step, the state of the drawing can be described by the position (coordinates) and the orientation of the pencil. In that case I only used two kind of operators: drawing a straight line and turning a constant angle. Today I used two more symbols to do stack operations:
 “[“ Push the current state (position and orientation) of the pencil onto a pushdown
operations stack  “]” Pop a state from the stack and make it the current state of the pencil (no line is drawn)
These operators allow to return to a previous state to continue drawing from there. Using them you can draw plants like these:
Each image corresponds to a different axiom, rules, angle and depth. I described these terms in my previous post. If you want to reproduce them you can find the code below (each image corresponds to a different set of axiom, rules, angle and depth parameters). Change colors, add noise to angles, try your own plants … I am sure you will find nice images:
library(gsubfn) library(stringr) library(dplyr) library(ggplot2) #Plant 1 axiom="F" rules=list("F"="FF[F+F+F]+[+FFF]") angle=22.5 depth=4 #Plant 2 axiom="X" rules=list("X"="F[+X][X]FX", "F"="FF") angle=25.7 depth=7 #Plant 3 axiom="X" rules=list("X"="F[+X]F[X]+X", "F"="FF") angle=20 depth=7 #Plant 4 axiom="X" rules=list("X"="F[[X]+X]+F[+FX]X", "F"="FF") angle=22.5 depth=5 #Plant 5 axiom="F" rules=list("F"="F[+F]F[F]F") angle=25.7 depth=5 #Plant 6 axiom="F" rules=list("F"="F[+F]F[F][F]") angle=20 depth=5 for (i in 1:depth) axiom=gsubfn(".", rules, axiom) actions=str_extract_all(axiom, "\\d*\\+\\d*\\FLR\\[\\]\\") %>% unlist status=data.frame(x=numeric(0), y=numeric(0), alfa=numeric(0)) points=data.frame(x1 = 0, y1 = 0, x2 = NA, y2 = NA, alfa=90, depth=1) for (action in actions) { if (action=="F") { x=points[1, "x1"]+cos(points[1, "alfa"]*(pi/180)) y=points[1, "y1"]+sin(points[1, "alfa"]*(pi/180)) points[1,"x2"]=x points[1,"y2"]=y data.frame(x1 = x, y1 = y, x2 = NA, y2 = NA, alfa=points[1, "alfa"], depth=points[1,"depth"]) %>% rbind(points)>points } if (action %in% c("+", "")){ alfa=points[1, "alfa"] points[1, "alfa"]=eval(parse(text=paste0("alfa",action, angle))) } if(action=="["){ data.frame(x=points[1, "x1"], y=points[1, "y1"], alfa=points[1, "alfa"]) %>% rbind(status) > status points[1, "depth"]=points[1, "depth"]+1 } if(action=="]"){ depth=points[1, "depth"] points[1,]>points data.frame(x1=status[1, "x"], y1=status[1, "y"], x2=NA, y2=NA, alfa=status[1, "alfa"], depth=depth1) %>% rbind(points) > points status[1,]>status } } ggplot() + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), lineend = "round", colour="white", data=na.omit(points)) + coord_fixed(ratio = 1) + theme(legend.position="none", panel.background = element_rect(fill="black"), panel.grid=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), axis.text=element_blank()) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part3)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.
In the first two part of this series, we’ve seen how to identify the distribution of a random variable by plotting the distribution of a sample and by estimating statistic. We also seen that it can be tricky to identify a distribution from a small sample of data. Today, we’ll see how to estimate the confidence interval of a statistic in this situation by using a powerful method called bootstrapping.
Answers to the exercises are available here.
Exercise 1
Load this dataset and draw the histogram, the ECDF of this sample and the ECDF of a density who’s a good fit for the data.
Exercise 2
Write a function that takes a dataset and a number of iterations as parameter. For each iteration this function must create a sample with replacement of the same size than the dataset, calculate the mean of the sample and store it in a matrix, which the function must return.
Exercise 3
Use the t.test() to compute the 95% confidence interval estimate for the mean of your dataset.
 Learn how to develop bootstrapped confidence intervals
 Go indepth into the lavaan package for modelling equations
 And much more
Exercise 4
Use the function you just wrote to estimate the mean of your sample 10,000 times. Then draw the histogram of the results and the sampling mean of the data.
The probability distribution of the estimation of a mean is a normal distribution centered around the real value of the mean. In other words, if we take a lot of samples from a population and compute the mean of each sample, the histogram of those mean will look like one of a normal distribution center around the real value of the mean we try to estimate. We have recreated artificially this process by creating a bunch of new sample from the dataset, by resampling it with replacement and now we can do a point estimation of the mean by computing the average of the sample of means or compute the confidence interval by finding the correct percentile of this distribution. This process is basically what is called bootstrapping.
Exercise 5
Calculate the value of the 2.5 and 97.5 percentile of your sample of 10,000 estimates of the mean and the mean of this sample. Compare this last value to the value of the sample mean of your data.
Exercise 6
Bootstrapping can be used to compute the confidence interval of all the statistics of interest, but you don’t have to write a function for each of them! You can use the boot() function from the library of the same name and pass the statistic as argument to compute the bootstrapped sample. Use this function with 10,000 replicates to compute the median of the dataset.
Exercise 7
Look at the structure of your result and plot his histogram. On the same plot, draw the value of the sample median of your dataset and plot the 95% confidence interval of this statistic by adding two vertical green lines at the lower and higher bounds of the interval.
Exercise 8
Write functions to compute by bootstrapping the following statistics:
 Variance
 kurtosis
 Max
 Min
Exercise 9
Use the functions from last exercise and the boot function with 10,000 replicates to compute the following statistics:
 Variance
 kurtosis
 Max
 Min
Then draw the histogram of the bootstrapped sample and plot the 95% confidence interval of the statistics.
Exercise 10
Generate 1000 points from a normal distribution of mean and standard deviation equal to the one of the dataset. Use the bootstrap method to estimate the 95% confidence interval of the mean, the variance, the kurtosis, the min and the max of this density. Then plot the histograms of the bootstrap samples for each of the variable and draw the 95% confidence interval as two red vertical line.
Two bootstrap estimate of the same statistic of two sample who are distributed by the same density should be pretty similar. When we compare those last plots with the confidence interval we drawn before we see that they are. More importantly, the confidence interval computed in exercise 10 overlap the confidence interval of the statistics of the first dataset. As a consequence, we can’t conclude that the two sample come from different density distribution and in practice we could use a normal distribution with a mean of 0.4725156 and a standard deviation of 1.306665 to simulate this random variable.
Related exercise sets: Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part2)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part1)
 Data Science for Doctors – Part 3 : Distributions
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Multiple Correspondence Analysis with FactoMineR
(This article was first published on François Husson, and kindly contributed to Rbloggers)
Here is a course with videos that present Multiple Correspondence Analysis in a French way. The most wellknown use of Multiple Correspondence Analysis is: surveys.
Four videos present a course on MCA, highlighting the way to interpret the data. Then you will find videos presenting the way to implement MCA in FactoMineR, to deal with missing values in MCA thanks to the package missMDA and lastly a video to draw interactive graphs with Factoshiny. And finally you will see that the new package FactoInvestigate allows you to obtain automatically an interpretation of your MCA results.
With this course, you will be standalone to perform and interpret results obtain with MCA.
 Course
 Data – issues
 Visualizing the point cloud of individuals
 Visualizing the cloud of categories
 Interpretation aids
 Material on the course videos: the slides, the PCA_transcription
 Tutorial in R
 Automatic interpretation
 The package FactoInvestigate allows you to obtain a first automatic description of your MCA results.
For more information, you can see the book blow. Here are some reviews on the book and a link to order the book.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: François Husson. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
constants 0.0.1
(This article was first published on R – Enchufa2, and kindly contributed to Rbloggers)
The new constants package is available on CRAN. This small package provides the CODATA 2014 internationally recommended values of the fundamental physical constants (universal, electromagnetic, physicochemical, atomic…), provided as symbols for direct use within the R language. Optionally, the values with errors and/or the values with units are also provided if the errors and/or the units packages are installed as well.
But, what is CODATA? The Committee on Data for Science and Technology (CODATA) is an interdisciplinary committee of the International Council for Science. The Task Group on Fundamental Constants periodically provides the internationally accepted set of values of the fundamental physical constants. The version currently in force is the “2014 CODATA”, published on 25 June 2015.
This package wraps the codata dataset, defines unique symbols for each one of the 237 constants, and provides them enclosed in three sets of symbols: syms, syms_with_errors and syms_with_units.
library(constants) # the speed of light with(syms, c0) ## [1] 299792458 # explore which constants are available lookup("planck constant", ignore.case=TRUE) ## quantity symbol value unit ## 7 Planck constant h 6.626070040e34 J s ## 8 Planck constant h_eV 4.135667662e15 eV s ## 9 Planck constant hbar h/(2*pi) J s ## 10 Planck constant hbar_eV h_eV/(2*pi) eV s ## 11 Planck constant hbar.c0 197.3269788 MeV fm ## 212 molar Planck constant Na.h 3.9903127110e10 J s mol1 ## 213 molar Planck constant Na.h.c0 0.119626565582 J m mol1 ## rel_uncertainty type ## 7 1.2e08 universal ## 8 6.1e09 universal ## 9 1.2e08 universal ## 10 6.1e09 universal ## 11 6.1e09 universal ## 212 4.5e10 physicochemical ## 213 4.5e10 physicochemical # symbols can also be attached to the search path attach(syms) # the Planck constant hbar ## [1] 1.054572e34If the errors/units package is installed in your system, constants with errors/units are available:
attach(syms_with_errors) # the Planck constant with error hbar ## 1.05457180(1)e34 attach(syms_with_units) # the Planck constant with units hbar ## 1.054572e34 J*sThe dataset is available for lazy loading:
data(codata) head(codata) ## quantity symbol value unit ## 1 speed of light in vacuum c0 299792458 m s1 ## 2 magnetic constant mu0 4*pi*1e7 N A2 ## 3 electric constant epsilon0 1/(mu0*c0^2) F m1 ## 4 characteristic impedance of vacuum Z0 mu0*c0 Ω ## 5 Newtonian constant of gravitation G 6.67408e11 m3 kg1 s2 ## 6 Newtonian constant of gravitation G_hbar.c0 6.70861e39 GeV2 c4 ## rel_uncertainty type ## 1 0.0e+00 universal ## 2 0.0e+00 universal ## 3 0.0e+00 universal ## 4 0.0e+00 universal ## 5 4.7e05 universal ## 6 4.7e05 universal dplyr::count(codata, type, sort=TRUE) ## # A tibble: 15 x 2 ## type n ## ## 1 atomicnuclearelectron 31 ## 2 atomicnuclearproton 26 ## 3 atomicnuclearneutron 24 ## 4 physicochemical 24 ## 5 atomicnuclearhelion 18 ## 6 atomicnuclearmuon 17 ## 7 electromagnetic 17 ## 8 universal 16 ## 9 atomicnucleardeuteron 15 ## 10 atomicnucleargeneral 11 ## 11 atomicnucleartau 11 ## 12 atomicnucleartriton 11 ## 13 adopted 7 ## 14 atomicnuclearalpha 7 ## 15 atomicnuclearelectroweak 2Article originally published in Enchufa2.es: constants 0.0.1.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The Value of #Welcome
(This article was first published on rOpenSci Blog, and kindly contributed to Rbloggers)
I’m participating in the AAAS Community Engagement Fellows Program (CEFP), funded by the Alfred P. Sloan Foundation. The inaugural cohort of Fellows is made up of 17 community managers working in a wide range of scientific communities. This is crossposted from the Trellis blog as part of a series of reflections that the CEFP Fellows are sharing.In my training as a AAAS Community Engagement Fellow, I hear repeatedly about the value of extending a personal welcome to your community members. This seems intuitive, but recently I put this to the test. Let me tell you about my experience creating and maintaining a #welcome channel in a community Slack group.
"Welcome" by Nathan under CC BYSA 2.0
I listen in on and occasionally participate in a Slack group for RLadies community organizers (RLadies is a global organization with local meetup chapters around the world, for women who do/want to do programming in R). Their Slack is incredibly wellorganized and has a #welcome channel where new joiners are invited to introduce themselves in a couple of sentences. The leaders regularly jump in to add a wave emoji and ask people to introduce themselves if they have not already.
At rOpenSci, where I am the Community Manager, when people joined our 150+ person Slack group, they used to land in the #general channel where people ask and answer questions. Often, new people joining went unnoticed among the conversations. So recently I copied RLadies and created a #welcome channel in our Slack and made sure any new people got dropped in there, as well as in the #general channel. The channel purpose is set as "A place to welcome new people and for new people to introduce themselves. We welcome participation and civil conversations that adhere to our code of conduct."
I pinged three new rOpenSci community members to join and introduce themselves, and in the #general channel said “Hey, come say hi to people over here at #welcome!”. One day later, we already had 33 people in #welcome. I spent that morning nurturing it, noting some people’s activities and contributions that they might not otherwise highlight themselves e.g. Bea just did her first open peer software review for rOpenSci, or Matt has been answering people’s questions about meta data, or Julie, Ben and Jamie are coauthors on this cool new paper about open data science tools. And I gave a shoutout to RLadies stating clearly that I copied their fantastic #welcome channel.
People are introducing themselves, tagging with emoji and thanking “the community” saying things like:
“…I feel super lucky to be a part of the rOpenSci community, which has … had a great positive impact on my life!”
“I love rOpenSci for building this community and helping people like me become confident data scientists!”
“[I’m] hoping to contribute more to the group moving forward”
“…thank you for having me part of this community!”
Such is the value of #welcome.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Automatically Fitting the Support Vector Machine Cost Parameter
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
In an earlier post I discussed how to avoid overfitting when using Support Vector Machines. This was achieved using cross validation. In cross validation, prediction accuracy is maximized by varying the cost parameter. Importantly, prediction accuracy is calculated on a different subset of the data from that used for training.
In this blog post I take that concept a step further, by automating the manual search for the optimal cost.
The data set I’ll be using describes different types of glass based upon physical attributes and chemical composition. You can read more about the data here, but for the purposes of my analysis all you need to know is that the outcome variable is categorical (7 types of glass) and the 4 predictor variables are numeric.
Creating the base support vector machine modelI start, as in my earlier analysis, by splitting the data into a larger 70% training sample and a smaller 30% testing sample. Then I train a support vector machine on the training sample with the following code:
library(flipMultivariates) svm = SupportVectorMachine(Type ~ RefractiveIndex + Ca + Ba + Fe, subset = training, cost = 1)This produces output as shown below. There are 2 reasons why we can largely disregard the 64.67% accuracy:
 We used the training data (and not the independent testing data) to calculate accuracy.
 We have used a default value for the cost of 1 and not attempted to optimize.
I am going to amend the code above in order to loop over a range of values of cost. For each value, I will calculate the accuracy on the test sample. The updated code is as follows:
library(flipMultivariates) library(flipRegression) costs = c(0.1, 1, 10, 100, 1000, 10000) i = 1 accuracies = rep(0, length(costs)) for (cost in costs) { svm = SupportVectorMachine(Type ~ RefractiveIndex + Ca + Ba + Fe, subset = training, cost = cost) accuracies[i] = attr(ConfusionMatrix(svm, subset = (testing == 1)), "accuracy") i = i + 1 } plot(costs, accuracies, type = "l", log = "x")The first 5 lines set things up. I load libraries required to run the Support Vector Machine and calculate the accuracy. Next I choose a range of costs, initialize a loop counter i and an empty vector accuracies, where I store the results.
Then I add a loop around the code that created the base model to iterate over costs. The next line calculates and stores the accuracy on the testing sample. Finally I plot the results which tells me that the greatest accuracy appears around 100. This allows us to go back and update costs to a more granular range around this value.
Rerunning the code again using the new costs (10, 20, 50, 75, 100, 150, 200, 300, 500, 1000) I get the final chart shown below. This indicates that a cost of 50 gives best performance.
TRY IT OUT
The analysis in this post used R in Displayr. The flipMultivariates package (available on GitHub), which uses the e1071 package, performed the calculations. You can try automatically fitting the Support Vector Machine Cost Parameter yourself using the data in this example.
To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...