R news and tutorials contributed by (750) R bloggers
Updated: 21 hours 15 min ago

### New R Course: Writing Efficient R Code

Wed, 07/19/2017 - 16:38

Hello R users, we’ve got a brand new course today: Writing Efficient R Code by Colin Gillespie.

The beauty of R is that it is built for performing data analysis. The downside is that sometimes R can be slow, thereby obstructing our analysis. For this reason, it is essential to become familiar with the main techniques for speeding up your analysis, so you can reduce computational time and get insights as quickly as possible.

Take me to chapter 1!

Writing Efficient R Code features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master in writing efficient, quick, R code!

What you’ll learn:

Chapter 1: The Art of Benchmarking

In order to make your code go faster, you need to know how long it takes to run.

Chapter 2: Fine Tuning – Efficient Base R

R is flexible because you can often solve a single problem in many different ways. Some ways can be several orders of magnitude faster than the others.

Chapter 3: Diagnosing Problems – Code Profiling

Profiling helps you locate the bottlenecks in your code.

Chapter 4: Turbo Charged Code – Parallel Programming

Some problems can be solved faster using multiple cores on your machine. This chapter shows you how to write R code that runs in parallel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Short course on Bayesian data analysis and Stan 23-25 Aug in NYC!

Wed, 07/19/2017 - 16:00

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

Jonah “ShinyStan” Gabry, Mike “Riemannian NUTS” Betancourt, and I will be giving a three-day short course next month in New York, following the model of our successful courses in 2015 and 2016.

Before class everyone should install R, RStudio and RStan on their computers. (If you already have these, please update to the latest version of R and the latest version of Stan.) If problems occur please join the stan-users group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.

Class structure and example topics for the three days:

Day 1: Foundations
Foundations of Bayesian inference
Foundations of Bayesian computation with Markov chain Monte Carlo
Intro to Stan with hands-on exercises
Real-life Stan
Bayesian workflow

Day 2: Linear and Generalized Linear Models
Foundations of Bayesian regression
Fitting GLMs in Stan (logistic regression, Poisson regression)
Diagnosing model misfit using graphical posterior predictive checks
Little data: How traditional statistical ideas remain relevant in a big data world
Generalizing from sample to population (surveys, Xbox example, etc)

Day 3: Hierarchical Models
Foundations of Bayesian hierarchical/multilevel models
Accurately fitting hierarchical models in Stan
Why we don’t (usually) have to worry about multiple comparisons
Hierarchical modeling and prior information

Specific topics on Bayesian inference and computation include, but are not limited to:
Bayesian inference and prediction
Naive Bayes, supervised, and unsupervised classification
Overview of Monte Carlo methods
Convergence and effective sample size
Hamiltonian Monte Carlo and the no-U-turn sampler
Continuous and discrete-data regression models
Mixture models
Measurement-error and item-response models

Specific topics on Stan include, but are not limited to:
Reproducible research
Probabilistic programming
Stan syntax and programming
Optimization
Identifiability and problematic posteriors
Handling missing data
Ragged and sparse data structures
Gaussian processes

The course is organized by Lander Analytics.

The course is not cheap. Stan is open-source, and we organize these courses to raise money to support the programming required to keep Stan up to date. We hope and believe that the course is more than worth the money you pay for it, but we hope you’ll also feel good, knowing that this money is being used directly to support Stan R&D.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Machine Learning Explained: supervised learning, unsupervised learning, and reinforcement learning

Wed, 07/19/2017 - 15:51

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Machine learning is often split between three main types of learning: supervised learning, unsupervised learning, and reinforcement learning. Knowing the differences between these three types of learning is necessary for any data scientist.

The big picture

The type of learning is defined by the problem you want to solve and is intrinsic to the goal of your analysis:

• You have a target, a value or a class to predict. For instance, let’s say you want to predict the revenue of a store from different inputs (day of the week, advertising, promotion). Then your model will be trained on historical data and use them to forecast future revenues. Hence the model is supervised, it knows what to learn.
• You have unlabelled data and looks for patterns, groups in these data. For example, you want to cluster to clients according to the type of products they order, how often they purchase your product, their last visit, … Instead of doing it manually, unsupervised machine learning will automatically discriminate different clients.
• You want to attain an objective. For example, you want to find the best strategy to win a game with specified rules. Once these rules are specified, reinforcement learning techniques will play this game many times to find the best strategy.
On supervised learning

Supervised learning regroups different techniques which all share the same principles:

• The training dataset contains inputs data (your predictors) and the value you want to predict (which can be numeric or not).
• The model will use the training data to learn a link between the input and the outputs. Underlying idea is that the training data can be generalized and that the model can be used on new data with some accuracy.

Some supervised learning algorithms:

• Linear and logistic regression
• Support vector machine
• Naive Bayes
• Neural network
• Classification trees and random forest

Supervised learning is often used for expert systems in image recognition, speech recognition, forecasting, and in some specific business domain (Targeting, Financial analysis, ..)

On unsupervised learning

Cluster Analysis from Wikipedia

On the other hand, unsupervised learning does not use output data (at least output data that are different from the input). Unsupervised algorithms can be split into different categories:

• Clustering algorithm, such as K-means, hierarchical clustering or mixture models. These algorithms try to discriminate and separate the observations in different groups.
• Dimensionality reduction algorithms (which are mostly unsupervised) such as PCA, ICA or autoencoder. These algorithms find the best representation of the data with fewer dimensions.
• Anomaly detections to find outliers in the data, i.e. observations which do not follow the data set patterns.

Most of the time unsupervised learning algorithms are used to pre-process the data, during the exploratory analysis or to pre-train supervised learning algorithms.

On reinforcement learning

Reinforcement learning algorithms try to find the best ways to earn the greatest reward. Rewards can be winning a game, earning more money or beating other opponents. They present state-of-art results on very human task, for instance, this paper from the University of Toronto shows how a computer can beat human in old-school Atari video game.

Reinforcement learnings algorithms follow the different circular steps:

From Wikipedia: Reinforcement Learning

Given its and the environment’s states, the agent will choose the action which will maximize its reward or will explore a new possibility. These actions will change the environment’s and the agent states. They will also be interpreted to give a reward to the agent. By performing this loop many times, the agents will improve its behavior.

Reinforcement learning already performs wells on ‘small’ dynamic system and is definitely to follow for the years to come.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RcppAPT 0.0.4

Wed, 07/19/2017 - 14:12

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version of RcppAPT — our interface from R to the C++ library behind the awesome apt, apt-get, apt-cache, … commands and their cache powering Debian, Ubuntu and the like — arrived on CRAN yesterday.

We added a few more functions in order to compute on the package graph. A concrete example is shown in this vignette which determines the (minimal) set of remaining Debian packages requiring a rebuild under R 3.4.* to update their .C() and .Fortran() registration code. It has been used for the binNMU request #868558.

As we also added a NEWS file, its (complete) content covering all releases follows below.

Changes in version 0.0.4 (2017-07-16)
• New function getDepends

• New function reverseDepends

• Added usage examples in scripts directory

• Added vignette, also in docs as rendered copy

Changes in version 0.0.3 (2016-12-07)
Changes in version 0.0.2 (2016-04-04)
Changes in version 0.0.1 (2015-02-20)
• Initial version with getPackages and hasPackages

A bit more information about the package is available here as well as as the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### seplyr update

Wed, 07/19/2017 - 09:06

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The development version of my new R package seplyr is performing in practical applications with dplyr 0.7.* much better than even I (the seplyr package author) expected.

I think I have hit a very good set of trade-offs, and I have now spent significant time creating documentation and examples.

I wish there had been such a package weeks ago, and that I had started using this approach in my own client work at that time. If you are already a dplyr user I strongly suggest trying seplyr in your own analysis projects.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Some Ideas for your Internal R Package

Wed, 07/19/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

At RStudio, I have the pleasure of interacting with data science teams around the world. Many of these teams are led by R users stepping into the role of analytic admins. These users are responsible for supporting and growing the R user base in their organization and often lead internal R user groups.

One of the most successful strategies to support a corporate R user group is the creation of an internal R package. This article outlines some common features and functions shared in internal packages. Creating an R package is easier than you might expect. A good place to start is this webinar on package creation.

Logos and Custom CSS

Interestingly, one powerful way to increase the adoption of data science outputs – plots, reports, and even slides – is to stick to consistent branding. Having a common look and feel makes it easier for management to recognize the work of the data science team, especially as the team grows. Consistent branding also saves the R user time that would normally be spent picking fonts and color schemes.

It is easy to include logos and custom CSS inside of an R package, and to write wrapper functions that copy the assets from the package to a user’s local working directory. For example, this wrapper function adds a logo from the RStudioInternal package to the working directory:

getLogo <- function(copy_to = getwd()){ copy_to <- normalizePath(copy_to) file.copy(system.file(“logo.png”, package = “RStudioInternal”) , copy_to) }

Once available, logos and CSS can be added to Shiny apps and R Markdown documents.

ggplot2 Themes

Similar to logos and custom CSS, many internal R packages include a custom ggplot2 theme. These themes ensure consistency across plots in an organization, making data science outputs easier to recognize and read.

ggplot2 themes are shared as functions. To get started writing a ggplot2 theme, see resource 1. For inspiration, take a look at the ggthemes package.

Data Connections

Internal R packages are also an effective way to share functions that make it easy for analysts to connect to internal data sources. Nothing is more frustrating for a first time R user than trying to navigate the world of ODBC connections and complex database schemas before they can get started with data relevant to their day-to-day job.

If you’re not sure where to begin, look through your own scripts for common database connection strings or configurations. db.rstudio.com can provide more information on how to handle credentials, drivers, and config files.

learnr Tutorials

RStudio recently released a new package for creating interactive tutorials in R Markdown called learnr. There are many great resources online for getting started with R, but it can be useful to create tutorials specific to your internal data and domain. learnr tutorials can serve as training wheels for the other components of the internal R package or teach broader concepts and standards accepted across the organization. For example, you might provide a primer that teaches new users your organization’s R style guide.

Sharing an Internal R Package

Internal packages can be built in the RStudio IDE and distributed as tar files. Alternatively, many organizations use RStudio Server or RStudio Server Pro to standardize the R environment in their organization. In addition to making it easy to share an internal package, a standard compute environment keeps new R users from having to spend time installing R, RStudio, and packages. While these are necessary skills, the first interactions with R should get new users to a data insight as fast as possible. RStudio Server Pro also includes IT functions for monitoring and restricting resources.

Wrap Up

If you are leading an R group, an internal R package is a powerful way to support your users and the adoption of R. Imagine how easy it would be to introduce R to co-workers if they could connect to real, internal data and create a useful, beautiful plot in under 10 minutes. Investing in an internal R packages makes that on boarding experience possible.

_____='https://rviews.rstudio.com/2017/07/19/supporting-corporate-r-user-groups/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### JAGS 4.3.0 is released

Tue, 07/18/2017 - 23:13

(This article was first published on R – JAGS News, and kindly contributed to R-bloggers)

The source tarball for JAGS 4.3.0 is now available from Sourceforge. Binary distributions will be available later. See the updated manual for details of the features in the new version. This version is fully compatible with the current version of rjags on CRAN.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – JAGS News. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Securely store API keys in R scripts with the "secret" package

Tue, 07/18/2017 - 20:40

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you use an API key to access a secure service, or need to use a password to access a protected database, you'll need to provide these "secrets" in your R code somewhere. That's easy to do if you just include those keys as strings in your code — but it's not very secure. This means your private keys and passwords are stored in plain-text on your hard drive, and if you email your script they're available to anyone who can intercept that email. It's also really easy to inadvertently include those keys in a public repo if you use Github or similar code-sharing services.

To address this problem, Gábor Csárdi and Andrie de Vries created the secret package for R. The secret package integrates with OpenSSH, providing R functions that allow you to create a vault to keys on your local machine, define trusted users who can access those keys, and then include encrypted keys in R scripts or packages that can only be decrypted by you or by people you trust. You can see how it works in the vignette secret: Share Sensitive Information in R Packages, and in this presentation by Andrie de Vries at useR!2017:

To use the secret package, you'll need access to your private key, which you'll also need to store securely. For that, you might also want to take a look at the in-progress keyring package, which allows you to access secrets stored in Keychain on macOS, Credential Store on Windows, and the Secret Service API on Linux.

The secret package is available now on CRAN, and you can also find the latest development version on Github.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Multiple Factor Analysis to analyse several data tables

Tue, 07/18/2017 - 20:21

(This article was first published on François Husson, and kindly contributed to R-bloggers)

How to take into account and how to compare information from different information sources? Multiple Factor Analysis is a principal Component Methods that deals with datasets that contain quantitative and/or categorical variables that are structured by groups.

Here is a course with videos that present the method named Multiple Factor Analysis.

Multiple Factor Analysis (MFA) allows you to study complex data tables, where a group of individuals is characterized by variables structured as groups, and possibly coming from different information sources. Our interest in the method is due to it being able to analyze a data table as a whole, but also its ability to compare information provided by the various information sources.

Four videos present a course on MFA, highlighting the way to interpret the data. Then  you will find videos presenting the way to implement MFA in FactoMineR.

With this course, you will be stand-alone to perform and interpret results obtain with MFA.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: François Husson. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Animating a spinner using ggplot2 and ImageMagick

Tue, 07/18/2017 - 20:00

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

It’s Sunday, and I [Bob] am just sitting on the couch peacefully ggplotting to illustrate basic sample spaces using spinners (a trick I’m borrowing from Jim Albert’s book Curve Ball). There’s an underlying continuous outcome (i.e., where the spinner lands) and a quantization into a number of regions to produce a discrete outcome (e.g., “success” and “failure”). I’m quite pleased with myself for being able to use polar coordinates to create the spinner and arrow. ggplot works surprisingly well in polar coordinates once you figure them out; almost everything people have said about them online is confused and the doc itself assumes you’re a bit more of a ggplotter and geometer than me.

I’m so pleased with it that I show the plot to Mitzi. She replies, “Why don’t you animate it?” I don’t immediately say, “What a waste of time,” then get back to what I’m doing. Instad, I boast, “It’ll be done when you get back from your run.” Luckily for me, she goes for long runs—I just barely had the prototype working as she got home. And then I had to polish it and turn it into a blog post. So here it is, for your wonder and amazement.

Here’s the R magic.

library(ggplot2) draw_curve <- function(angle) { df <- data.frame(outcome = c("success", "failure"), prob = c(0.3, 0.7)) plot <- ggplot(data=df, aes(x=factor(1), y=prob, fill=outcome)) + geom_bar(stat="identity", position="fill") + coord_polar(theta="y", start = 0, direction = 1) + scale_y_continuous(breaks = c(0.12, 0.7), labels=c("success", "failure")) + geom_segment(aes(y= angle/360, yend= angle/360, x = -1, xend = 1.4), arrow=arrow(type="closed"), size=1) + theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text.y = element_blank()) + theme(panel.grid = element_blank(), panel.border = element_blank()) + theme(legend.position = "none") + geom_point(aes(x=-1, y = 0), color="#666666", size=5) return(plot) } ds <- c() pos <- 0 for (i in 1:66) { pos <- (pos + (67 - i)) %% 360 ds[i] <- pos } ds <- c(rep(0, 10), ds) ds <- c(ds, rep(ds[length(ds)], 10)) for (i in 1:length(ds)) { ggsave(filename = paste("frame", ifelse(i < 10, "0", ""), i, ".png", sep=""), plot = draw_curve(ds[i]), device="png", width=4.5, height=4.5) }

I probably should've combined theme functions. Ben would've been able to define ds in a one-liner and then map ggsave. I hope it's at least clear what my code does (just decrements the number of degrees moved each frame by one---no physics involved).

After producing the frames in alphabetical order (all that ifelse and paste mumbo-jumbo), I went to the output directory and ran the results through ImageMagick (which I'd previously installed on my now ancient Macbook Pro) from the terminal, using

> convert *.png -delay 3 -loop 0 spin.gif

That took a minute or two. Each of the pngs is about 100KB, but the final output is only 2.5MB or so. Maybe I should've went with less delay (I don't even know what the units are!) and fewer rotations and maybe a slower final slowing down (maybe study the physics). How do the folks at Pixar ever get anything done?

P.S. I can no longer get the animation package to work in R, though it used to work in the past. It just wraps up those calls to ImageMagick.

P.P.S. That salmon and teal color scheme is the default!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Plants

Tue, 07/18/2017 - 19:14

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Blue dragonflies dart to and fro
I tie my life to your balloon and let it go
(Warm Foothills, Alt-J)

In my last post I did some drawings based on L-Systems. These drawings are done sequentially. At any step, the state of the drawing can be described by the position (coordinates) and the orientation of the pencil. In that case I only used two kind of operators: drawing a straight line and turning a constant angle. Today I used two more symbols to do stack operations:

• “[“ Push the current state (position and orientation) of the pencil onto a pushdown
operations stack
• “]” Pop a state from the stack and make it the current state of the pencil (no line is drawn)

These operators allow to return to a previous state to continue drawing from there. Using them you can draw plants like these:

Each image corresponds to a different axiom, rules, angle and depth. I described these terms in my previous post. If you want to reproduce them you can find the code below (each image corresponds to a different set of axiom, rules, angle and depth parameters). Change colors, add noise to angles, try your own plants … I am sure you will find nice images:

library(gsubfn) library(stringr) library(dplyr) library(ggplot2) #Plant 1 axiom="F" rules=list("F"="FF-[-F+F+F]+[+F-F-F]") angle=22.5 depth=4 #Plant 2 axiom="X" rules=list("X"="F[+X][-X]FX", "F"="FF") angle=25.7 depth=7 #Plant 3 axiom="X" rules=list("X"="F[+X]F[-X]+X", "F"="FF") angle=20 depth=7 #Plant 4 axiom="X" rules=list("X"="F-[[X]+X]+F[+FX]-X", "F"="FF") angle=22.5 depth=5 #Plant 5 axiom="F" rules=list("F"="F[+F]F[-F]F") angle=25.7 depth=5 #Plant 6 axiom="F" rules=list("F"="F[+F]F[-F][F]") angle=20 depth=5 for (i in 1:depth) axiom=gsubfn(".", rules, axiom) actions=str_extract_all(axiom, "\\d*\\+|\\d*\\-|F|L|R|\\[|\\]|\\|") %>% unlist status=data.frame(x=numeric(0), y=numeric(0), alfa=numeric(0)) points=data.frame(x1 = 0, y1 = 0, x2 = NA, y2 = NA, alfa=90, depth=1) for (action in actions) { if (action=="F") { x=points[1, "x1"]+cos(points[1, "alfa"]*(pi/180)) y=points[1, "y1"]+sin(points[1, "alfa"]*(pi/180)) points[1,"x2"]=x points[1,"y2"]=y data.frame(x1 = x, y1 = y, x2 = NA, y2 = NA, alfa=points[1, "alfa"], depth=points[1,"depth"]) %>% rbind(points)->points } if (action %in% c("+", "-")){ alfa=points[1, "alfa"] points[1, "alfa"]=eval(parse(text=paste0("alfa",action, angle))) } if(action=="["){ data.frame(x=points[1, "x1"], y=points[1, "y1"], alfa=points[1, "alfa"]) %>% rbind(status) -> status points[1, "depth"]=points[1, "depth"]+1 } if(action=="]"){ depth=points[1, "depth"] points[-1,]->points data.frame(x1=status[1, "x"], y1=status[1, "y"], x2=NA, y2=NA, alfa=status[1, "alfa"], depth=depth-1) %>% rbind(points) -> points status[-1,]->status } } ggplot() + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), lineend = "round", colour="white", data=na.omit(points)) + coord_fixed(ratio = 1) + theme(legend.position="none", panel.background = element_rect(fill="black"), panel.grid=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), axis.text=element_blank()) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-3)

Tue, 07/18/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

In the first two part of this series, we’ve seen how to identify the distribution of a random variable by plotting the distribution of a sample and by estimating statistic. We also seen that it can be tricky to identify a distribution from a small sample of data. Today, we’ll see how to estimate the confidence interval of a statistic in this situation by using a powerful method called bootstrapping.

Answers to the exercises are available here.

Exercise 1
Load this dataset and draw the histogram, the ECDF of this sample and the ECDF of a density who’s a good fit for the data.

Exercise 2
Write a function that takes a dataset and a number of iterations as parameter. For each iteration this function must create a sample with replacement of the same size than the dataset, calculate the mean of the sample and store it in a matrix, which the function must return.

Exercise 3
Use the t.test() to compute the 95% confidence interval estimate for the mean of your dataset.

Learn more about bootstrapping functions in the online course Structural equation modeling (SEM) with lavaan. In this course you will learn how to:

• Learn how to develop bootstrapped confidence intervals
• Go indepth into the lavaan package for modelling equations
• And much more

Exercise 4
Use the function you just wrote to estimate the mean of your sample 10,000 times. Then draw the histogram of the results and the sampling mean of the data.

The probability distribution of the estimation of a mean is a normal distribution centered around the real value of the mean. In other words, if we take a lot of samples from a population and compute the mean of each sample, the histogram of those mean will look like one of a normal distribution center around the real value of the mean we try to estimate. We have recreated artificially this process by creating a bunch of new sample from the dataset, by resampling it with replacement and now we can do a point estimation of the mean by computing the average of the sample of means or compute the confidence interval by finding the correct percentile of this distribution. This process is basically what is called bootstrapping.

Exercise 5
Calculate the value of the 2.5 and 97.5 percentile of your sample of 10,000 estimates of the mean and the mean of this sample. Compare this last value to the value of the sample mean of your data.

Exercise 6
Bootstrapping can be used to compute the confidence interval of all the statistics of interest, but you don’t have to write a function for each of them! You can use the boot() function from the library of the same name and pass the statistic as argument to compute the bootstrapped sample. Use this function with 10,000 replicates to compute the median of the dataset.

Exercise 7
Look at the structure of your result and plot his histogram. On the same plot, draw the value of the sample median of your dataset and plot the 95% confidence interval of this statistic by adding two vertical green lines at the lower and higher bounds of the interval.

Exercise 8
Write functions to compute by bootstrapping the following statistics:

• Variance
• kurtosis
• Max
• Min

Exercise 9
Use the functions from last exercise and the boot function with 10,000 replicates to compute the following statistics:

• Variance
• kurtosis
• Max
• Min

Then draw the histogram of the bootstrapped sample and plot the 95% confidence interval of the statistics.

Exercise 10
Generate 1000 points from a normal distribution of mean and standard deviation equal to the one of the dataset. Use the bootstrap method to estimate the 95% confidence interval of the mean, the variance, the kurtosis, the min and the max of this density. Then plot the histograms of the bootstrap samples for each of the variable and draw the 95% confidence interval as two red vertical line.

Two bootstrap estimate of the same statistic of two sample who are distributed by the same density should be pretty similar. When we compare those last plots with the confidence interval we drawn before we see that they are. More importantly, the confidence interval computed in exercise 10 overlap the confidence interval of the statistics of the first dataset. As a consequence, we can’t conclude that the two sample come from different density distribution and in practice we could use a normal distribution with a mean of 0.4725156 and a standard deviation of 1.306665 to simulate this random variable.

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Multiple Correspondence Analysis with FactoMineR

Tue, 07/18/2017 - 13:06

(This article was first published on François Husson, and kindly contributed to R-bloggers)

Here is a course with videos that present Multiple Correspondence Analysis in a French way. The most well-known use of Multiple Correspondence Analysis is: surveys.

Four videos present a course on MCA, highlighting the way to interpret the data. Then  you will find videos presenting the way to implement MCA in FactoMineR, to deal with missing values in MCA thanks to the package missMDA and lastly a video to draw interactive graphs with Factoshiny. And finally you will see that the new package FactoInvestigate allows you to obtain automatically an interpretation of your MCA results.

With this course, you will be stand-alone to perform and interpret results obtain with MCA.

For more information, you can see the book blow. Here are some reviews on the book and a link to order the book.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: François Husson. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### constants 0.0.1

Tue, 07/18/2017 - 10:00

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The new constants package is available on CRAN. This small package provides the CODATA 2014 internationally recommended values of the fundamental physical constants (universal, electromagnetic, physicochemical, atomic…), provided as symbols for direct use within the R language. Optionally, the values with errors and/or the values with units are also provided if the errors and/or the units packages are installed as well.

But, what is CODATA? The Committee on Data for Science and Technology (CODATA) is an interdisciplinary committee of the International Council for Science. The Task Group on Fundamental Constants periodically provides the internationally accepted set of values of the fundamental physical constants. The version currently in force is the “2014 CODATA”, published on 25 June 2015.

This package wraps the codata dataset, defines unique symbols for each one of the 237 constants, and provides them enclosed in three sets of symbols: syms, syms_with_errors and syms_with_units.

library(constants) # the speed of light with(syms, c0) ## [1] 299792458 # explore which constants are available lookup("planck constant", ignore.case=TRUE) ## quantity symbol value unit ## 7 Planck constant h 6.626070040e-34 J s ## 8 Planck constant h_eV 4.135667662e-15 eV s ## 9 Planck constant hbar h/(2*pi) J s ## 10 Planck constant hbar_eV h_eV/(2*pi) eV s ## 11 Planck constant hbar.c0 197.3269788 MeV fm ## 212 molar Planck constant Na.h 3.9903127110e-10 J s mol-1 ## 213 molar Planck constant Na.h.c0 0.119626565582 J m mol-1 ## rel_uncertainty type ## 7 1.2e-08 universal ## 8 6.1e-09 universal ## 9 1.2e-08 universal ## 10 6.1e-09 universal ## 11 6.1e-09 universal ## 212 4.5e-10 physicochemical ## 213 4.5e-10 physicochemical # symbols can also be attached to the search path attach(syms) # the Planck constant hbar ## [1] 1.054572e-34

If the errors/units package is installed in your system, constants with errors/units are available:

attach(syms_with_errors) # the Planck constant with error hbar ## 1.05457180(1)e-34 attach(syms_with_units) # the Planck constant with units hbar ## 1.054572e-34 J*s

data(codata) head(codata) ## quantity symbol value unit ## 1 speed of light in vacuum c0 299792458 m s-1 ## 2 magnetic constant mu0 4*pi*1e-7 N A-2 ## 3 electric constant epsilon0 1/(mu0*c0^2) F m-1 ## 4 characteristic impedance of vacuum Z0 mu0*c0 Ω ## 5 Newtonian constant of gravitation G 6.67408e-11 m3 kg-1 s-2 ## 6 Newtonian constant of gravitation G_hbar.c0 6.70861e-39 GeV-2 c4 ## rel_uncertainty type ## 1 0.0e+00 universal ## 2 0.0e+00 universal ## 3 0.0e+00 universal ## 4 0.0e+00 universal ## 5 4.7e-05 universal ## 6 4.7e-05 universal dplyr::count(codata, type, sort=TRUE) ## # A tibble: 15 x 2 ## type n ## ## 1 atomic-nuclear-electron 31 ## 2 atomic-nuclear-proton 26 ## 3 atomic-nuclear-neutron 24 ## 4 physicochemical 24 ## 5 atomic-nuclear-helion 18 ## 6 atomic-nuclear-muon 17 ## 7 electromagnetic 17 ## 8 universal 16 ## 9 atomic-nuclear-deuteron 15 ## 10 atomic-nuclear-general 11 ## 11 atomic-nuclear-tau 11 ## 12 atomic-nuclear-triton 11 ## 13 adopted 7 ## 14 atomic-nuclear-alpha 7 ## 15 atomic-nuclear-electroweak 2

Article originally published in Enchufa2.es: constants 0.0.1.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Value of #Welcome

Tue, 07/18/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

I’m participating in the AAAS Community Engagement Fellows Program (CEFP), funded by the Alfred P. Sloan Foundation. The inaugural cohort of Fellows is made up of 17 community managers working in a wide range of scientific communities. This is cross-posted from the Trellis blog as part of a series of reflections that the CEFP Fellows are sharing.

In my training as a AAAS Community Engagement Fellow, I hear repeatedly about the value of extending a personal welcome to your community members. This seems intuitive, but recently I put this to the test. Let me tell you about my experience creating and maintaining a #welcome channel in a community Slack group.

"Welcome" by Nathan under CC BY-SA 2.0

I listen in on and occasionally participate in a Slack group for R-Ladies community organizers (R-Ladies is a global organization with local meetup chapters around the world, for women who do/want to do programming in R). Their Slack is incredibly well-organized and has a #welcome channel where new joiners are invited to introduce themselves in a couple of sentences. The leaders regularly jump in to add a wave emoji and ask people to introduce themselves if they have not already.

At rOpenSci, where I am the Community Manager, when people joined our 150+ person Slack group, they used to land in the #general channel where people ask and answer questions. Often, new people joining went unnoticed among the conversations. So recently I copied R-Ladies and created a #welcome channel in our Slack and made sure any new people got dropped in there, as well as in the #general channel. The channel purpose is set as "A place to welcome new people and for new people to introduce themselves. We welcome participation and civil conversations that adhere to our code of conduct."

I pinged three new rOpenSci community members to join and introduce themselves, and in the #general channel said “Hey, come say hi to people over here at #welcome!”. One day later, we already had 33 people in #welcome. I spent that morning nurturing it, noting some people’s activities and contributions that they might not otherwise highlight themselves e.g. Bea just did her first open peer software review for rOpenSci, or Matt has been answering people’s questions about meta data, or Julie, Ben and Jamie are co-authors on this cool new paper about open data science tools. And I gave a shoutout to R-Ladies stating clearly that I copied their fantastic #welcome channel.

People are introducing themselves, tagging with emoji and thanking “the community” saying things like:

“…I feel super lucky to be a part of the rOpenSci community, which has … had a great positive impact on my life!”

“I love rOpenSci for building this community and helping people like me become confident data scientists!”

“[I’m] hoping to contribute more to the group moving forward”

“…thank you for having me part of this community!”

Such is the value of #welcome.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Automatically Fitting the Support Vector Machine Cost Parameter

Tue, 07/18/2017 - 07:32

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

In an earlier post I discussed how to avoid overfitting when using Support Vector Machines. This was achieved using cross validation. In cross validation, prediction accuracy is maximized by varying the cost parameter. Importantly, prediction accuracy is calculated on a different subset of the data from that used for training.

In this blog post I take that concept a step further, by automating the manual search for the optimal cost.

The data set I’ll be using describes different types of glass based upon physical attributes and chemical composition.  You can read more about the data here, but for the purposes of my analysis all you need to know is that the outcome variable is categorical (7 types of glass) and the 4 predictor variables are numeric.

Creating the base support vector machine model

I start, as in my earlier analysis, by splitting the data into a larger 70% training sample and a smaller 30% testing sample. Then I train a support vector machine on the training sample with the following code:

library(flipMultivariates) svm = SupportVectorMachine(Type ~ RefractiveIndex + Ca + Ba + Fe, subset = training, cost = 1)

This produces output as shown below. There are 2 reasons why we can largely disregard the 64.67% accuracy:

1. We used the training data (and not the independent testing data) to calculate accuracy.
2. We have used a default value for the cost of 1 and not attempted to optimize.

Amending the R code

I am going to amend the code above in order to loop over a range of values of cost. For each value, I will calculate the accuracy on the test sample. The updated code is as follows:

library(flipMultivariates) library(flipRegression) costs = c(0.1, 1, 10, 100, 1000, 10000) i = 1 accuracies = rep(0, length(costs)) for (cost in costs) { svm = SupportVectorMachine(Type ~ RefractiveIndex + Ca + Ba + Fe, subset = training, cost = cost) accuracies[i] = attr(ConfusionMatrix(svm, subset = (testing == 1)), "accuracy") i = i + 1 } plot(costs, accuracies, type = "l", log = "x")

The first 5 lines set things up. I load libraries required to run the Support Vector Machine and calculate the accuracy. Next I choose a range of costs, initialize a loop counter i and an empty vector accuracies, where I store the results.

Then I add a loop around the code that created the base model to iterate over costs. The next line calculates and stores the accuracy on the testing sample. Finally I plot the results which tells me that the greatest accuracy appears around 100. This allows us to go back and update costs to a more granular range around this value.

Re-running the code again using the new costs (10, 20, 50, 75, 100, 150, 200, 300, 500, 1000) I get the final chart shown below. This indicates that a cost of 50 gives best performance.

TRY IT OUT
The analysis in this post used R in Displayr. The flipMultivariates package (available on GitHub), which uses the e1071 package, performed the calculations. You can try automatically fitting the Support Vector Machine Cost Parameter yourself using the data in this example.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Programming Notes – Part 2

Tue, 07/18/2017 - 06:34

In an older post, I discussed a number of functions that are useful for programming in R. I wanted to expand on that topic by covering other functions, packages, and tools that are useful. Over the past year, I have been working as an R programmer and these are some of the new learnings that have become fundamental in my work.

IS TRUE and IS FALSE

isTRUE is a logical operator that can be very useful in checking whether a condition or variable has been set to true. Lets say that we are writing a script whereby we will take run a generalized linear regression when the parameter run_mod is set to true. The conditional portion of the script can be written as either if(isTRUE(run_mod)) or if(run_mod). I am partial to isTRUE, but this is entirely a matter of personal preference. Users should also be aware of the isFALSE function, which is part of the BBmisc package.

run_mod = TRUE if(isTRUE(run_mod){ tryCatch( GLM_Model(full_data=full.df, train_data=train.df, test_data=test.df), error = function(e) { print("error occured") print(e) }) } if(BBmisc::isFALSE(check_one) & BBmisc::isFALSE(check_two)){ data_output.tmp\$score = 0.8 }

INVISIBLE

The invisible function can be used to return an object that is not printed out and can be useful in a number of circumstances. For example, it’s useful when you have helper functions that will be utilized within other functions to do calculations. In those cases, it’s often not desireable to print those results. I generally use invisible when I’m checking function arguments. For example, consider a function that takes two arguments and you need to check whether the input is a vector.

if(!check_input(response, type='character', length=1)) { stop('something is wrong') }

The check_input function is something I created and has a few lines which contain invisible. The idea is for check_input to return true or false based on the inputs so that it’ll stop stop the execution when needed.

if(is.null(response) & !length(response)==0) { return(FALSE) } else if (!is.null(response)) { return(invisible(TRUE)) }

DEBUG

When I’m putting together new classes or have multiple functions that interact with one another, I ensure that the code includes an comprehensive debugging process. This means that I’m checking my code at various stages so that I can identify when issues arise. Consider that I’m putting together a function that will go through a number of columns in a data frame, summarize those variables, and save the results as a nested list. To effectively put together code without issues, I ensure that the functions takes a debug argument that will run when it’s set to true. In the code below, it will print out values at different stages of the code. Furthermore, the final line of the code will check the resulting data structure.

DSummary_Variable(data_obj, var, debug=TRUE){ ...... } if(debug) message('|==========>>> Processing of the variable. \n') if(debug){ if(!missing(var_summary)){ message('|==========>>> var_summary has been created and has a length of ', length(var_summary), ' and the nested list has a length of ', length(var_summary[['var_properties']]), ' \n') } else { stop("var_summary is missing. Please investigate") }

If you have multiple functions that interact with one another, it’s a good idea to preface the printed message with the name of the function name.

add_func <- function(a,b) a + b mult_func <- function(a,b) a * b main_func <- function(data, cola, colb, debug=TRUE){ if(debug){ message("mult_func: checking inputs to be used") } mult_func(data[,cola], data[,colb]) if(debug){ message("mult_add: checking inputs to be used") } }

Stay tuned for part three, where I’ll talk about the testthat and assertive package.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Investigating Cryptocurrencies (Part II)

Tue, 07/18/2017 - 02:36

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

This is the second in a series of posts designed to show how the R programming language can be used with cryptocurrency related data sets.  A number of R packages are great for analyzing stocks and bonds and similar financial instruments.  These can also be applied to working with cryptocurrencies.  In this post we will focus on Bitcoin.

Bitcoin has garnered enough attention that it is available through Yahoo’s finance data under the symbol BTCUSD=X.  The quantmod package is comprised of a set of packages and utilities geared towards time series analysis traditionally associated with stocks.  You can load Bitcoin along with other Stock symbols using the loadSymbols function.  In this example we will also load AMD, which makes graphics cards used by cryptocurrency miners.

library(quantmod)

If you have any issue downloading the data, make sure you update to the latest version of quantmod.  If all goes well, you will have two objects in your global environment named AMD and BTCUSD=X.

ls()
[1] “AMD”      “BTCUSD=X”

You can plot AMD by simply passing it to the plot function.

plot(AMD)

Bitcoin is slightly different simply because the symbol in use includes an equal sign.  To ensure that R evaluates the code properly, the symbol must be surrounded in back ticks.

plot(`BTCUSD=X`)

There is data missing for certain days.  There are other sources for cryptocurrency data which can be substituted if needed.  We will ignore this anomaly for the remainder of this post.  There is data for the last 4 weeks. We can construct a candle chart that focuses on this subset of data.

chartSeries(`BTCUSD=X`, subset=’last 4 weeks’)

This chart can then be modified to include technical analysis – for instance Bollinger Bands.

The capabilities of the quantmod package in an earlier post (see http://www.r-chart.com/2010/06/stock-analysis-using-r.html) where a listing of other functions that can be applied is included.

Inasmuch as cryptocurrencies behave like traditional equities, they lend themselves to similar types of analysis.  The quantmod package is a great place to start when analyzing cryptocurrencies.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Chart. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Young people neither in employment nor in education and training in Europe, 2000-2016

Tue, 07/18/2017 - 02:00

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

R Documentation at Stack Overflow

One of the nice features of R is the ease of data acquisition. I am now working on the examples of data acquisition form different sources within an R session. Soon I am going publish a long-read with an overview of demographic data acquisition in R.

NEET in Europe

As an example of Eurostat data usage I chose to show the dynamics of NEET (Young people neither in employment nor in education and training) in European countries. The example is using the brilliant geofact package.

library(tidyverse) library(lubridate) library(forcats) library(eurostat) library(geofacet) library(viridis) library(ggthemes) library(extrafont) # Find the needed dataset code # http://ec.europa.eu/eurostat/web/regions/data/database # download fertility rates for countries neet <- get_eurostat("edat_lfse_22") # if the automated download does not work, the data can be grabbed manually at # http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing neet %>% filter(geo %>% paste %>% nchar == 2, sex == "T", age == "Y18-24") %>% group_by(geo) %>% mutate(avg = values %>% mean()) %>% ungroup() %>% ggplot(aes(x = time %>% year(), y = values))+ geom_path(aes(group = 1))+ geom_point(aes(fill = values), pch = 21)+ scale_x_continuous(breaks = seq(2000, 2015, 5), labels = c("2000", "'05", "'10", "'15"))+ scale_y_continuous(expand = c(0, 0), limits = c(0, 40))+ scale_fill_viridis("NEET, %", option = "B")+ facet_geo(~ geo, grid = "eu_grid1")+ labs(x = "Year", y = "NEET, %", title = "Young people neither in employment nor in education and training in Europe", subtitle = "Data: Eurostat Regional Database, 2000-2016", caption = "ikashnitsky.github.io")+ theme_few(base_family = "Roboto Condensed", base_size = 15)+ theme(axis.text = element_text(size = 10), panel.spacing.x = unit(1, "lines"), legend.position = c(0, 0), legend.justification = c(0, 0))

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Ecosystems chapter added to “Empirical software engineering using R”

Tue, 07/18/2017 - 01:05

(This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

The Ecosystems chapter of my Empirical software engineering book has been added to the draft pdf (download here).

I don’t seem to be able to get away from rewriting everything, despite working on the software engineering material for many years. Fortunately the sparsity of the data keeps me in check, but I keep finding new and interesting data (not a lot, but enough to slow me down).

There is still a lot of work to be done on the ecosystems chapter, not least integrating all the data I have been promised. The basic threads are there, they just need filling out (assuming the promised data sets arrive).

I did not get any time to integrate in the developer and economics data received since those draft chapters were released; there has been some minor reorganization.

As always, if you know of any interesting software engineering data, please tell me.

I’m looking to rerun the workshop on analyzing software engineering data. If anybody has a venue in central London, that holds 30 or so people+projector, and is willing to make it available at no charge for a series of free workshops over several Saturdays, please get in touch.

Projects chapter next.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...