My #Best9of2017 tweets
(This article was first published on Maëlle, and kindly contributed to Rbloggers)
You’ve probably seen people posting their #Best9of2017, primarily on Instagram I’d say. I’m not an Instagram user, although I do have an account to spy on my younger sister and cousins, so I don’t even have 9 Instagram posts in total but I do love the collage people get to show off… So what about my best 9 tweets of 2017?
Get my 9 best tweets by number of likesI first wanted to use rtweet::get_timeline but it only returned me tweets from July, even when using include_rts = FALSE, so I downloaded my analytics files from the Twitter website, one per trimester.
my_files < c("tweet_activity_metrics_ma_salmon_20170101_20170402_en.csv", "tweet_activity_metrics_ma_salmon_20170402_20170702_en.csv", "tweet_activity_metrics_ma_salmon_20170702_20171001_en.csv", "tweet_activity_metrics_ma_salmon_20171001_20171231_en.csv") paths < paste0("data/", my_files) # read them all at once my_tweets < purrr::map_df(paths, readr::read_csv) # just in case I got some data ranges wrong my_tweets < unique(my_tweets) # get the top 9! my_tweets < dplyr::arrange(my_tweets,  likes) my_tweets < janitor::clean_names(my_tweets) best9 < my_tweets$tweet_permalink[1:9]My husband advised me to use something more elaborate than number of likes, which is a wise idea, but I was happy with that simple method.
Take screenshots and paste themThere’s a great R package to do screenshots from R, webshot. I was a bit annoyed at the “Follow” button appearing, but I did not want to have to write Javascript code to first login in the hope to make that thing disappear. I tried using a CSS selector instead of a rectangle, but I was less satisfied. An obvious problem here is that contrary to Instagram images, tweets have different heights depending on the text length and on the size of the optional attached media… It’s a bit sad but not too sad, my collage will still give a look at my Twitter 2017.
library("magrittr") save_one_file < function(url, name){ filename < paste0(name, ".png") # save and output filename webshot::webshot(url, filename, cliprect = c(0, 150, 750, 750)) filename } files < purrr::map2_chr(best9, 1:9, save_one_file)Regarding the collage part using magick, I used my “Faces of R” post as a reference, which is funny since it features in my top 9 tweets.
no_rows < 3 no_cols < 3 make_column < function(i, files, no_rows){ filename < paste0("col", i, ".jpg") magick::image_read(files[(i*no_rows+1):((i+1)*no_rows)]) %>% magick::image_border("salmon", "20x20") %>% magick::image_append(stack = TRUE) %>% magick::image_write(filename) filename } purrr::map_chr(0:(no_cols1), make_column, files = files, no_rows = no_rows) %>% magick::image_read() %>% magick::image_append(stack = FALSE) %>% magick::image_border("salmon", "20x20") %>% magick::image_write("20171230best9of2017.jpg")And since I’m well behaved, I clean after myself.
# clean file.remove(files) file.removes(paste0("col", 1:3, ".jpg"))So, apart from fun blog posts (Where to live in the US, Faces of #rstats Twitter, A plot against the CatterPlots complot) and more serious ones (Automatic tools for improving R packages, How to develop good R packages (for open science), Where have you been? Getting my Github activity), my top 9 tweets include a good ggplot2 tip (this website) and above all, the birth of baby Émile! Why have I only included two heart emojis?
What about your 2017?I’ve now read a few awesome posts from other R bloggers, from the top of my head:

Bob Rudis posted an impressive review of his year on Stack Overflow, Github and Twitter, and even made it reproducible for you to make it with your own data!

Mara Averick posted her top 2 tweets by month. Note of them praises Mara so let’s remind everyone she’s the greatest curator of R content ever, and recently became a tidyverse advocate for RStudio!

Thomas Lin Pedersen looked back on his fantastic 2017.

Kasia’s end of year thoughts include an inspiring description of her journey this year and awesome productivity tips for R bloggers.
What’s your take on your 2017, dear reader? In any case, I wish you a happy New Year!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Maëlle. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The first step to becoming a top performing data scientist
(This article was first published on rbloggers – SHARP SIGHT LABS, and kindly contributed to Rbloggers)
Nearly every day, I see a new article talking about the benefits of data: “data will change the world” … “data is transforming business” … “data is the new oil.”
Setting aside the hyperbolic language, most of this is true.
So when you hear that “data scientist is the sexiest job of the 21st century,” you should mostly believe it. Companies are fighting to hire great data scientists.
But there’s a catch.
Even though there’s a huge demand for data scientists, a lot of people who study data science still can’t get jobs.
I regularly hear from young data science students who tell me that they can’t get a job. Or they can get a “job,” but it’s actually an unpaid internship.
What’s going on here?
The dirty little secret is that companies are desperate for highly skilled data scientists.
Companies want data scientists that are great at what they do. They want people who create more value than they cost in terms of salary.
What this means is that to get a data science job, you actually need to be able to “get things done.”
… and if you want a highlypaid data job, you need to be a top performer.
I can’t stress this enough: if you want to get a great data science job, certificates aren’t enough. You need to become a top performer.
Your first steps towards becoming a top performerYour first step towards becoming a topperforming data scientist is mastering the foundations:
 data visualization
 data manipulation
 exploratory data analysis
Have you mastered these? Have you memorized the syntax to accomplish these? Are you “fluent” in the foundations?
If not, you need to go back and practice. Believe me. You’ll thank me later. (You’re welcome.)
The reason is that these skills are used in almost every part of the data science workflow, particularly at earlier parts of your career.
Given almost data task, you’ll almost certainly need to clean your data, visualize it, and do some exploratory data analysis.
Moreover, they are also important as you move into more advanced topics. Do you want to start doing machine learning, artificial intelligence, and deep learning? You had better know how to clean and explore a dataset. If you can’t, you’ll basically be lost.
“Fluency” with the basics … what does this mean?I want to explain a little more about what I mean by “master of the foundations.” By “mastery,” I mean something like “fluency.”
As I’ve said before, programming languages are a lot like human languages.
To communicate effectively and “get things done” in a language, you essentially need to be “fluent.” You need to be able to express yourself in that language, and you need to be able to do so in a way that’s accurate and performed with ease.
Granted, you can “get by” without fluency, but you couldn’t expect to be hired for a languagedependent job without fluency.
For example, do you think you could get a job as a journalist at the New York Times if you hadn’t mastered basic English grammar? Do you think you could get a job at the Wall Street Journal if you needed to look up 50% of the words you used?
Of course not. If you wanted to be a journalist (in the USA), you would absolutely need to be fluent in English.
Data science is similar. You can’t expect to get a paid job as a data scientist if you’re doing google searches for syntax every few minutes.
If you eventually want a great job as a data scientist, you need to be fluent in writing data science code.
Can you write this code fluently?Here’s an example. This is some code to analyze some data.
Ask yourself, can you write this code fluently, from memory?
# # LOAD PACKAGES # library(tidyverse) library(forcats) # # BUILD DATASET # NOTE: # In this post, we will be using data from an analysis # performed by pwc # source: pwc.to/2totbnj # df.ai_growth < tribble( ~region, ~ai_econ_growth ,"China", 7 ,"North America", 3.7 ,"Northern Europe", 1.8 ,"Africa, Oceania, & Other Asia", 1.2 ,"Developed Asia", .9 ,"Southern Europe", .7 ,"Latin America", .5 ) # # CREATE FIRST DRAFT PLOT # (bar chart) # ggplot(data = df.ai_growth, aes(x = region, y = ai_econ_growth)) + geom_bar(stat = 'identity') # # FLIP THE COORDINATES # ggplot(data = df.ai_growth, aes(x = region, y = ai_econ_growth)) + geom_bar(stat = 'identity') + coord_flip() # # REORDER REGIONS #  reorder by econ growth # using forcats::fct_reorder() # ggplot(data = df.ai_growth, aes(x = fct_reorder(region, ai_econ_growth), y = ai_econ_growth)) + geom_bar(stat = 'identity') + coord_flip() # # CREATE THEME # theme.futurae < theme(text = element_text(family = 'Gill Sans', color = "#444444") ,panel.background = element_rect(fill = '#444B5A') ,panel.grid.minor = element_line(color = '#4d5566') ,panel.grid.major = element_line(color = '#586174') ,plot.title = element_text(size = 26) ,axis.title = element_text(size = 18, color = '#555555') ,axis.title.y = element_text(vjust = .7, angle = 0) ,axis.title.x = element_text(hjust = .5) ,axis.text = element_text(size = 12) ,plot.subtitle = element_text(size = 14) ) # # CREATE FINAL PLOT # ggplot(data = df.ai_growth, aes(x = fct_reorder(region, ai_econ_growth), y = ai_econ_growth)) + geom_bar(stat = 'identity', fill = 'cyan') + coord_flip() + labs(x = NULL , y = 'Projected AIdriven growth (Trillion USD)' , title = 'AI is projected to add an additional\n$15 Trillion to the global economy by 2030' , subtitle = '...strongest growth predicted in China & North America') + annotate(geom = 'text', label = "source: pwc.to/2totbnj", x = 1, y = 6, color = 'white') + theme.futuraeYou should be able to write most of this code fluently, from memory. You shouldn’t have to use many google searches or external resources at all.
Will you maybe forget a few things? Sure, every now and again. Will you write it all in one go? No. Even the best data scientists write code iteratively.
But in terms of remembering the syntax, you should know most of this cold. You should know most of the syntax by memory.
That’s what fluency means.
… and that’s what it will take to be one of the best.
Mastering data science is easier than you thinkI get it. This probably sounds hard.
I don’t want to lie to you. It’s not “easy” in the sense that you can achieve “fluency” without any effort.
But it is much easier than you think. With some discipline, and a good practice system, you can master the essentials within a couple of months.
If you know how to practice, within a few months you can learn to write data science code fluently and from memory.
Discover how to become a topperforming data scientistIf you want to become a topperforming data scientist, then make sure you sign up for our email list.
Next week, we will reopen enrollment for our data science training course, Starting Data Science.
This course will teach you the essentials of data science in R, and give you a practice system that will enable you to memorize everything you learn.
Want to become a top performer? Our course will show you how.
Sign up for our email list and you’ll get an exclusive invitation to join the course when it opens.
SIGN UP NOW
The post The first step to becoming a top performing data scientist appeared first on SHARP SIGHT LABS.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers – SHARP SIGHT LABS. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Downtime Reading
(This article was first published on R Views, and kindly contributed to Rbloggers)
Not everyone has the luxury of taking some downtime at the end the year, but if you do have some free time, you may enjoy something on my short list of downtime reading. The books and articles here are not exactly “light reading”, nor are they literature for cuddling by the fire. Nevertheless, you may find something that catches your eye.
The Syncfusion series of free eBooks contains more than a few gems on a variety of programming subjects, including James McCaffrey’s R Programming Succinctly and Barton Poulson’s R Succinctly.
For a more ambitious read, mine the rich vein of SUNY Open Textbooks. My pick is Hiroki Sayama’s Introduction to the Modeling and Analysis of Complex Systems.
If you just can’t get enough of data science, then a few articles that caught my attention are:
 Christopher Olah’s brief but mindstretching post on Neural Networks, Manifolds, and Topology, which is good preparation for the Fujitsu Laboratories paper on Time Series Classification via Topological Data Analysis
 The paper by Nguyen and Holmes on their Bayesian Unidimensional Scaling (BUDS) method for detecting patterns in highdimensional data
 BouHamad et. al’s A review of survival trees, a valuable introduction to the literature on the subject
 Rob Hyndman’s recent post on Some new time series packages
 Mike Bostock’s beautiful and mindaltering post on Visualizing Algorithms.
Finally, if you really have some time on your hands, try searching through the 318M+ papers on PDFDRIVE.
Happy reading, and have a Happy and Prosperous New Year from all of us at RStudio!!
_____='https://rviews.rstudio.com/2017/12/29/downtimereading/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Working with PDFs – scraping the PASS budget
(This article was first published on R on Locke Data Blog, and kindly contributed to Rbloggers)
Using tabulizer we’re able to extract information from PDFs so it comes in really handy when people publish data as a PDF! This post takes you through using tabulizer and tidyverse packages to scrape and clean up some budget data from PASS, an association for the Microsoft Data Platform community. The goal is to mainly show some of the tricks of the data wrangling trade that you may need to utilise when you scrape data from PDFs.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on Locke Data Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Adding bananas from the commandline (extending the oomsifier)
(This article was first published on Clean Code, and kindly contributed to Rbloggers)
Sometimes you just want to add bananas from the commandline. Previously
I created a small script that takes an image and adds a dancing banana to the bottom left of the image. I wanted to make an API too, but that will have to wait till next year. Today we will create a commandline script that will do the same thing.
With the excellent explanation in Mark Sellors’ guide I have now created a cmdline thingy in very few steps.
I can now add bananas from the commandline with:
./bananafy.R ../images/ggplotexample.png out.gifThis executes and says:
Linking to ImageMagick 6.9.7.4 Enabled features: fontconfig, freetype, fftw, lcms, pango, x11 Disabled features: cairo, ghostscript, rsvg, webp writing bananafied image to out.gif The modified scriptFirst the script itself, saved as bananafy.R
#!/usr/bin/Rscript vanilla args < commandArgs(trailingOnly = TRUE) if (length(args) < 1){ stop("I think you forgot to input an image and output name? \n") } library(magick) ## Commandline version of add banana #banana < image_read("images/banana.gif") # this assumes you have a project with the folder /images/ inside. #add_banana < function(, offset = NULL, debug = FALSE){ offset < NULL # maybe a third argument here would be cool? debug < FALSE image_in < magick::image_read(args[[1]]) banana < image_read("../images/banana.gif") # 365w 360 h image_info < image_info(image_in) if("gif" %in% image_info$format ){stop("gifs are to difficult for me now")} stopifnot(nrow(image_info)==1) # scale banana to correct size: # take smallest dimension. target_height < min(image_info$width, image_info$height) # scale banana to 1/3 of the size scaling < (target_height /3) front < image_scale(banana, scaling) # place in lower right corner # offset is width and hieight minus dimensions picutre? scaled_dims < image_info(front) x_c < image_info$width  scaled_dims$width y_c < image_info$height  scaled_dims$height offset_value < ifelse(is.null(offset), paste0("+",x_c,"+",y_c), offset) if(debug) print(offset_value) frames < lapply(as.list(front), function(x) image_composite(image_in, x, offset = offset_value)) result < image_animate(image_join(frames), fps = 10) message("writing bananafied image to", args[[2]]) image_write(image = result, path = args[[2]])As you might notice I copied the entire thing from the previous post and added some extra Things
 It starts with ‘#!/usr/bin/Rscript’
According to Mark:
Sometimes called a ‘shebang’, this line tells the Linux and MacOS command line interpreters (which both default to one called ‘bash’), what you want to use to run the rest of the code in the file. ….The –vanilla on the end, tells Rscript to run without saving or restoring anything in the process. This just keeps things nice a clean.
I’ve added a message call that tells me where the script saves the image. I could have suppressed the magic messages, but meh, it is a proof of concept.
To make it work, you have to tell linux (which I’m working on) that it can execute the file. That means changing the permissions on that file
In the terminal you go to the projectfolder and type chmod +x bananafy.R. You CHange MODe by adding (+) eXecution rights to that file.
advanced use: making bananafy options available always and everywhere in the terminal.We could make the bananafyer available to you always in in every folder. T do that you could move the script to f.i. ~/scripts/, modify the code a bit and add the bananagif to that same folder. You then have to modify your bashrc file.
 I had to make the link to the banana hardcoded: ‘~/scripts/images/banana.gif’
 you can call the code from anywhere and the output of the script will end up in the folder you currently are in. So if I’m in ~/pictures/reallynicepictures the bananafied image will be there.
Adding bananas from the commandline (extending the oomsifier) was originally published by at Clean Code on December 29, 2017.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Clean Code. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
cyclic riddle [not in NYC!]
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
In the riddle of this week on fivethirtyeight, the question is to find the average number of rounds when playing the following game: P=6 players sitting in a circle each have B=3 coins and with probability 3⁻¹ they give one coin to their right or left side neighbour, or dump the coin to the centre. The game ends when all coins are in the centre. Coding this experiment in R is rather easy
situz=rep(B,P) r=1 while (max(situz)>0){ unz=runif(P) newz=situz(situz>0) for (i in (1:P)[unz<1/3]) newz[i%%P+1]=newz[i%%P+1]+(situz[i]>0) for (i in (1:P)[unz>2/3]) newz[i1+P*(i==1)]=newz[i1+P*(i==1)]+(situz[i]>0) situz=newz r=r+1}resulting in an average of 15.58, but I cannot figure out (quickly enough) an analytic approach to the problem. (The fit above is to a Gamma(â1,ĝ) distribution.)
In a completely unrelated aparté (aside), I read earlier this week that New York City had prohibited the use of electric bikes. I was unsure of what this meant (prohibited on sidewalks? expressways? metro carriages?) so rechecked and found that electric bikes are simply not allowed for use anywhere in New York City. Because of the potential for “reckless driving”. The target is apparently the high number of delivery ebikes, but this ban sounds so absurd that I cannot understand it passed. Especially when De Blasio has committed the city to the Paris climate agreement after Trump moronically dumped it… Banning the cars would sound much more in tune with this commitment! (A further aparté is that I strongly dislike ebikes, running on nuclear power plants, especially when they pass me on sharp hills!, but they are clearly a lesser evil when compared with mopeds and cars.)
Filed under: Kids, R, Running, Travel Tagged: car accidents, ebike, electric bike, FiveThirtyEight, game, mathematical puzzle, New York city, R, random walk, reckless driving, The Riddler
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Announcing rquery
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
We are excited to announce the rquery R package.
rquery is WinVector LLC‘s currently in development big data query tool for R.
rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big data scale in production).
As an example: rquery operators allow us to write our earlier “treatment and control” example as follows.
dQ < d %.>% extend_se(., if_else_block( testexpr = "rand()>=0.5", thenexprs = qae( a_1 := 'treatment', a_2 := 'control'), elseexprs = qae( a_1 := 'control', a_2 := 'treatment'))) %.>% select_columns(., c("rowNum", "a_1", "a_2"))rquery pipelines are firstclass objects; so we can extend them, save them, and even print them.
cat(format(dQ)) table('d') %.>% extend(., ifebtest_1 := rand() >= 0.5) %.>% extend(., a_1 := ifelse(ifebtest_1,"treatment",a_1), a_2 := ifelse(ifebtest_1,"control",a_2)) %.>% extend(., a_1 := ifelse(!( ifebtest_1 ),"control",a_1), a_2 := ifelse(!( ifebtest_1 ),"treatment",a_2)) %.>% select_columns(., rowNum, a_1, a_2)rquery targets only databases, and right now primarilly SparkSQL and PostgreSQL. rquery is primarily a SQL generator, allowing it to avoid some of the tradeoffs required to directly support inmemory data.frames. We demonstrate converting the above rquery pipeline into SQL and executing it here.
rquery itself is still in early development (and not yet ready for extensive use in production), but it is maturing fast, and we expect more rquery announcements going forward. Our current intent is to bring in sponsors, partners, and R community voices to help develop and steer rquery.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Getting started with dplyr in R using Titanic Dataset
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
dplyr is one of the most popular rpackages and also part of tidyverse that’s been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it’s one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us – in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else’s.
Hence, this post aims to bring out some wellknown and notsowellknown applications of dplyr so that any data analyst could leverage its potential using a much familiar – Titanic Dataset.
dplyr library can be installed directly from CRAN and loaded into R session like any other R package.
#install.packages('dplyr') library(dplyr) # Loading Dplyr packageLet us start by reading the input training file using the base r function read.csv
train < read.csv('../input/train.csv',stringsAsFactors = F, header = T)Getting the total number of rows in the given data frame (even though it’s been very straightforward with nrow() in baser, this being a dplyr starterkit, we’ll start with that.
train%>% count() n 891The above code just gives the row count of the data frame that’s been passed with the pipe %>% operator. The pipe operator works very similar to the  (pipe) operator in Unix environment where the output of the current operation is fed as the input of the following operation. Similarly, in dplyr or any other package that supports pipe operator, the functions in it will always take only data frame as the first argument hence the function can be called in two ways like below:
count(train) #Without pipe, passing the df as the first argument train %>% count() #with pipe, more convenient and more readability n 891 n 891But dplyr’s real flavor starts with the following 5 functions (or as most people call, verbs of dplyr):
 select()
 filter()
 arrange()
 mutate()
 summarise()
 group_by()
And let us see what every one of these does!
selectselect() as the name suggests selects the columns that are required from a given dataframe and if multiple columns are required or not required, then one_of() could be used within select.
select(train,Age) #without pipe Age 22 38 26 35 35 NA 54 2 27 14 4 #multicolumn selection train %>% select(one_of('Sex','Age')) Sex Age male 22 female 38 female 26 female 35 male 35 male NA male 54 male 2 female 27 female 14 female 4 #multicolumn rejection train %>% select(one_of('Age','Sex')) PassengerId Survived Pclass Name SibSp Parch Ticket Fare Cabin Embarked 1 0 3 Braund, Mr. Owen Harris 1 0 A/5 21171 7.2500 S 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 1 0 PC 17599 71.2833 C85 C 3 1 3 Heikkinen, Miss. Laina 0 0 STON/O2. 3101282 7.9250 S 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 0 113803 53.1000 C123 S 5 0 3 Allen, Mr. William Henry 0 0 373450 8.0500 S 6 0 3 Moran, Mr. James 0 0 330877 8.4583 Q 7 0 1 McCarthy, Mr. Timothy J 0 0 17463 51.8625 E46 S 8 0 3 Palsson, Master. Gosta Leonard 3 1 349909 21.0750 S 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 0 2 347742 11.1333 S 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) 1 0 237736 30.0708 CLike selecting a column with entire column name (or multiple column names with one_of()), select could also be used with a few more string ops.
train %>% select(starts_with('P')) PassengerId Pclass Parch 1 3 0 2 1 0 3 3 0 4 1 0 5 3 0 6 3 0 7 1 0 8 3 1 9 3 2 10 2 0 train %>% select(ends_with('e')) Name Age Fare Braund, Mr. Owen Harris 22 7.2500 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38 71.2833 Heikkinen, Miss. Laina 26 7.9250 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35 53.1000 Allen, Mr. William Henry 35 8.0500 Moran, Mr. James NA 8.4583 McCarthy, Mr. Timothy J 54 51.8625 Palsson, Master. Gosta Leonard 2 21.0750 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 27 11.1333 Nasser, Mrs. Nicholas (Adele Achem) 14 30.0708 Sandstrom, Miss. Marguerite Rut 4 16.7000 group_bygroup_by is a lot similar to SQL Group by but more versatile. It is related to concept of “splitapplycombine”. Let us understand group_by with a starter example of finding out number of male and number of female – which logically could be the count of each Sex Type (once grouped by Sex).
train %>% group_by(Sex) %>% count() Sex n female 314 male 577Aha! That seems simple and now let us do a two level grouping to understand how many of survived of each gender.
train %>% group_by(Survived, Sex) %>% count() train %>% group_by(Sex, Survived) %>% count() Survived Sex n 0 female 81 0 male 468 1 female 233 1 male 109 Sex Survived n female 0 81 female 1 233 male 0 468 male 1 109 mutate and summariseThat’s minimally group_by, but the true power of group_by is unveiled only when it is coupled with mutate and summarise functions.
Mutate function adds a new column based on the given expression while summarise function summarises the dataset based on the given function and let us see the difference in action with the following example.
Let us get the average age of all survivors (and nonsurvivors): so this must be group_by ed based on Survived while summarised by Age so that we will get a summarised mean value.for two groups.
train %>% group_by(Survived) %>% summarise(mean(Age)) #Remember we have got NAs, so mean() wouldn't work and to bypass NAs, na.rm = T must be passed. train %>% group_by(Survived) %>% summarise(average_age = mean(Age,na.rm=T)) Survived mean(Age) 0 NA 1 NA Survived average_age 0 30.62618 1 28.34369That’s summarise() giving us the summary of the dataframe. If we need to create a new column, values filled for all 891 datapoints, that’s where mutate plays its role. Let us create a new column, Age_Bracket containing value Minor if Age is less than 18 else Major
train %>% mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>% select(starts_with('Age')) #In fact this can be coupled with Survivor list to see the impact of this Age_bracket train %>% mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>% group_by(Survived,Age_Bracket) %>% summarise(pnt = (n()/nrow(train))*100) Age Age_Bracket 22 Major 38 Major 26 Major 35 Major 35 Major NA NA 54 Major 2 Minor 27 Major 14 Minor Survived Age_Bracket pnt 0 Major 41.750842 0 Minor 5.836139 0 NA 14.029181 1 Major 25.701459 1 Minor 6.846240 1 NA 5.836139That’s how dplyr can get more powerful with group_by coupled with mutate or summarise for feautre engineering and for better data visualization. But this doesn’t stop here, because one of the most important function a dataanalyst would require is sorting and that’s what arrange() does.
arrangeExtracting last 4 results after sorting the fare in asending order:
train %>% arrange(Fare) %>% tail(4) PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 870 319 1 1 Wick, Miss. Mary Natalie female 31 0 2 36928 164.8667 C7 S 871 857 1 1 Wick, Mrs. George Dennick (Mary Hitchcock) female 45 1 1 36928 164.8667 S 872 690 1 1 Madill, Miss. Georgette Alexandra female 15 0 1 24160 211.3375 B5 S 873 731 1 1 Allen, Miss. Elisabeth Walton female 29 0 0 24160 211.3375 B5 S 874 780 1 1 Robert, Mrs. Edward Scott (Elisabeth Walton McMillan) female 43 0 1 24160 211.3375 B3 SArrange in descending order:
train %>% arrange(desc(Age)) %>% head(5) PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 S 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 C 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 Q filterfilter does row_wise filter ( similar to what select did with columns). filter() takes a logical expression and evaluates them and results the only_true datapoints. So to be clear, all that matters to filter() function is if the expression evaluates to TRUE.
Let us start with filtering (extracting) only male and getting their Embarked station count.
And this is dplyr in a nut shell and hope you get a decent start with this article, if you are a beginner. Please share your thoughts and suggestions in comments!. The notebook used here is available on my github.
References
Related Post
 How to apply Monte Carlo simulation to forecast Stock prices using Python
 Analysing iOS App Store iTunes Reviews in R
 Handling ‘Happy’ vs ‘Not Happy’: Better sentiment analysis with sentimentr in R
 Creating Reporting Template with Glue in R
 Predict Employee Turnover With Python
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Visualising US Voting Records with shinydashboard
(This article was first published on Computational Social Science, and kindly contributed to Rbloggers)
Introducing adavisMy second ever post on this blog was on introducing adamap, a Shiny app that maps Americans for Democratic Action voting scores (the socalled Liberal Quotient) between 19472015. It was built with highcharter, and hence it was nicely interactive but quite slow. I wanted to switch to another package since, and when I eventually ran into statebins, I knew what had to be done.
I was certain that statebins would definitely add some oomph to the design, but because it’s so easy to implement, I had some spare time to do other things. As it is often the case, one thing led to the other, and I came to the conclusion that the revamped app should feature one plot from every major graphics package. Of course, a strict implementation of that statement would be quite difficult, so I downgraded the challenge to just four plots using a different package each time. I ended with statebins for statelevel mapping, plotly for plotting changes from the previous year, ggExtra for headtohead state comparisons, and *drum roll* base R for graphing a single politican’s voting record over time. It turned out to be fun.
As I have already explained the data setup in my previous post linked above, I will skip to the visualisations. I won’t be including the actual code for the Shiny app, which includes reactive elements throughout. You can fork the code underlying the live dashboard on GitHub and/or run a local copy via the runGitHub code provided there. Also, none of the codes are evaluated here (as I translate them from the app), so they will not work if you just plug them in. Hence, there are primarily for motivation rather than replication.
Slicker US with StatebinsThere are many good things about statebins. First, you get identicalsized states so you don’t get biased by the variation in their size. It’s cleaner by definition, not featuring idiosyncratic shapes that are found in nature. Also, it plays really nice with viridis, which is important (maybe). In addition, you can define light and dark labels for the state abbreviations, ensuring they will not blend into the fill colour. statebins can be called as a function (as I did), or applied later to a ggplot object. The only thing that did not work for me was the ggplot2_scale_function argument; R kept saying no such function is defined (I’m using the dev version from GitHub) so I ended up passing the fill colours separately. It gives a warning about overriding the existing fill, but works otherwise. If you download the dataset and want to visualise how House Democrats in 1962 voted, something along the lines of:
#Not evaluated library(ggplot2) library(viridis) library(statebins) library(hrbrthemes) theme_set(theme_ipsum_rc()) #assumes data = data, year as "Year", state names in "State", voting scores in "ADA" etc. us < statebins(data[data$Year == 1962 & data$Chamber == 1 & data$Party == "Democrat", ], state_col = "State", value_col = "ADA", round = TRUE, font_size = 7, state_border_col = "#232d33", dark_label = "white", light_label = "black", state_border_size = .8, radius = grid::unit(10, "pt")) + labs(title = "") + theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.background = element_rect(fill = "white", linetype = "blank"), legend.position = c(.075, .85), legend.direction = "horizontal", legend.text = element_text(colour = "#232d33", size = 14), legend.title = element_text(colour = "#232d33", size = 18), legend.key.height = grid::unit(.01, "snpc"), legend.key.width = grid::unit(.05, "snpc"), plot.margin = margin(1, 0, 0, 0, "cm")) us + scale_fill_viridis(direction = 1, breaks = c(seq(25, 100, 25)), labels = c("No Rep", paste(seq(0, 100, 25), "% ")), guide = guide_legend(title = "", title.position = "top", keywidth = 2, keyheight = 2, ncol = 1))should do the trick. Now, I am totally cheating because the image is from the live app and the above code is not evaluated. However, it should give you an idea, mainly most of the clutter is about the layout rather than the content. Can we get someone with clout to ping Hadley and Bob regarding any updates on issue#4 from 2015 so we get magically created bins (and other shapes, for the ambitious) not just for US states but for everything? Cheers.
One trick regarding the data; the original data only have scores for selected representatives (naturally). Meaning, in any given year, there will be several states (approx. 810 per party) with no Democrat or Republican reps. As these are rowwise missing instead of NA, if you plot them as they are, those states will not show in the plot. If only there was a tidyverse function that would solve common data problems like this…
library(tidyverse) #Add rows for missing states in partyyear #Use builtin states data states < data.frame(state.name, stringsAsFactors = FALSE) states$state.no < 1:50 dataset < merge(dataset, states, by.x = "State", by.y = "state.name") #I couldn't get this to work with strings so matched them by state.no dataset < dataset %>% tidyr::complete(state.no = full_seq(state.no, period = 1), Year, Chamber, Party, fill = list(ADA = 25, aADA = 25)) #Arbitrary low score instead of NA dataset$State < ifelse(is.na(dataset$State), states[dataset$state.no, 1], dataset$State) Interactive Charts with PlotlyMoving on to challenge number #2, I wanted to keep to the same filter (Year > Chamber > Party), but with the amount of change from last year plotted instead. I haven’t used plotly much before so I learned onthego, but it has robust documentation if you are considering delving into it.
The main reason for going with plotly was its buildin interactivity. I wanted the users to just hover over points and see a block of text describing the shift from the previous year. This turned out to be easy, just with a somewhat ugly paste. One curious thing was the alpha functionality, which is governed with toRGB("colour", "alpha"), but called opacity in plotly. In the app, the default is showing Senate Republicans in 1990 (i.e. difference from 1989):
library(plotly) #Store common args ax < list( showline = FALSE, showticklabels = TRUE, showgrid = FALSE) #Mandatory data should be subsetted before comment plot_ly(data, x = ~Score, key = data[, 1], y = ~Change, mode = "markers", type = "scatter", hoverinfo = "text", hoverlabel = list(font = list(family = "Roboto Condensed", size = 14)), #Add custom hover text text = ~paste(data$Chamber, data$Party, "from", State, "voted\n", paste0(abs(round(Change, 2)), "% more"), Label, "in", data$Year), color = ~Change, colors = viridis(direction = 1, n = 12), marker = list(size = 30, opacity = .7)) %>% layout(dragmode = "select", showlegend = FALSE, xaxis = c(ax, list(title = "Selected Year Voting Score", zeroline = FALSE)), yaxis = c(ax, list(title = "Change from Last Year", zeroline = TRUE, zerolinecolor = toRGB("black", .05))), font = list(family = "Roboto Condensed")) %>% config(displayModeBar = FALSE)I admit today’s code chunks are a bit like this, so if you have any questions, just fire away.
Last but not least…Base RWhen I saw this brilliant post on Tufte, the plot I wanted to replicate the most was the very first one. The one that was done in base R. Some might even argue I added a representative lookup tab to the app just for trying this out. Hmm. Like plotly, I was out of practice with base R graphics, so I mimicked the original code as much as I could. One thing I wanted to convey with this graph is the consistency of a single politician over their tenure. I didn’t want to show minima and maxima, but just their mean score with some sort of confidence measure. I also learned that you can pass Greek letters with expression(), which is handy. Say, you want to plot the complete voting history of Nancy Pelosi:
#Store descriptives v1 < mean(data$Score) v2 < sd(data$Score) v3 < min(data$Year) v4 < max(data$Year) v5 < summary(data$Year) #Base plot, data should be a representative subset plot(data$Score ~ data$Year, xlab = "", ylab = "", axes = FALSE, family = "Roboto Condensed", pch = 16, type = "b", lwd = 2) #Upper sd abline(h = v1 + v2, lty = 2, col = alpha("black", .2)) #Mean abline(h = v1, lty = 2) #Lower sd abline(h = v1  v2, lty = 2, col = alpha("black", .2)) #Right axis axis(1, at = c(v3, v4, (v3 + v4) / 2), labels = c(v3, v4, round((v3 + v4) / 2, 0)), tick = FALSE, family = "Roboto Condensed") #Bottom axis axis(2, at = c(v1, v1 + v2, v1  v2), labels = round(c(v1, v1 + v2, v1  v2), 0), las = 2, family = "Roboto Condensed", tick = FALSE, lty = 0) #Left axis axis(4, at = c(v1, v1 + v2, v1  v2), lwd = 0, las = 2, labels = c(expression(mu), expression(sigma), expression(sigma)), col = alpha("black", .2), family = "Roboto Condensed", ps = 20) Viridis OptionsI might be relying on viridis a lot, although I also utilise the RColorBrewer package as well (Map of Westeros, anyone?). To be honest, I more or less only like the default palette, the namesake or option = "D", but others might fancy some good old diversity. To this end, I added a dropdown menu for switching viridis palettes, and a button for changing the direction (i.e. whether the palette should start from the lightest or the darkest colour). Both of these options are global, so you can switch any time at any tab. Except for the base R plot; that looks much better in black, Tufte style.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Computational Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Six Reasons To Learn R For Business
(This article was first published on businessscience.io  Articles, and kindly contributed to Rbloggers)
Data science for business (DS4B) is the future of business analytics yet it is really difficult to figure out where to start. The last thing you want to do is waste time with the wrong tool. Making effective use of your time involves two pieces: (1) selecting the right tool for the job, and (2) efficiently learning how to use the tool to return business value. This article focuses on the first part, explaining why R is the right choice in six points. Our next article will focus on the second part, learning R in 12 weeks.
Reason 1: R Has The Best Overall QualitiesThere are a number of tools available business analysis/business intelligence (with DS4B being a subset of this area). Each tool has its pros and cons, many of which are important in the business context. We can use these attributes to compare how each tool stacks up against the others! We did a qualitative assessment using several criteria:
 Business Capability (1 = Low, 10 = High)
 Ease of Learning (1 = Difficult, 10 = Easy)
 Cost (Free/Minimal, Low, High)
 Trend (0 = Fast Decline, 5 = Stable, 10 = Fast Growth)
Further discussion on the assessment is included in the Appendix at the end of the article.
What we saw was particularly interesting. A trendline developed exposing a tradeoff between learning curve and DS4B capability rating. The most flexible tools are more difficult to learn but tend to have higher business capability. Conversely, the “easytolearn” tools are often not the best longterm tools for business or data science capability. Our opinion is go for capability over ease of use.
Of the top tools in capability, R has the best mix of desirable attributes including high data science for business capability, low cost, and it’s growing very fast. The only downside is the learning curve. The rest of the article explains why R is so great for business.
Reason 2: R Is Data Science For NonComputer ScientistsIf you are seeking highperformance data science tools, you really have two options: R or Python. When starting out, you should pick one. It’s a mistake to try to learn both. Your choice comes down to what’s right for you. The difference between the R and Python has been described in numerous infographics and debates online, but the most overlooked reason is personprogramming language fit. Don’t understand what we mean? Let’s break it down.
Fact 1: Most people interested in learning data science for business are not computer scientists. They are business professionals, nonsoftware engineers (e.g. mechanical, chemical), and other technicaltobusiness converts. This is important because of where each language excels.
Fact 2: Most activities in business and finance involve communication. This comes in the form of reports, dashboards, and interactive web applications that allow decision makers to recognize when things are not going well and to make wellinformed decisions that improve the business.
Now that we recognize what’s important, let’s learn about the two major players in data science.
About PythonPython is a general service programming language developed by software engineers that has solid programming libraries for math, statistics and machine learning. Python has bestinclass tools for pure machine learning and deep learning, but lacks much of the infrastructure for subjects like econometrics and communication tools such as reporting. Because of this, Python is wellsuited for computer scientists and software engineers.
About RR is a statistical programming language developed by scientists that has open source libraries for statistics, machine learning, and data science. R lends itself well to business because of its depth of topicspecific packages and its communciation infrastructure. R has packages covering a wide range of topics such as econometrics, finance, and time series. R has bestinclass tools for visualization, reporting, and interactivity, which are as important to business as they are to science. Because of this, R is wellsuited for scientists, engineers and business professionals.
What Should You Do?Don’t make the decision tougher than what it is. Think about where you are coming from:

Are you a computer scientist or software engineer? If yes, choose Python.

Are you an analytics professional or mechanical/industrial/chemical engineer looking to get into data science? If yes, choose R.
Think about what you are trying to do:

Are you trying to build a selfdriving car? If yes, choose Python.

Are you trying to communicate business analytics throughout your organization? If yes, choose R.
Learning R used to be a major challenge. Base R was a complex and inconsistent programming language. Structure and formality was not the top priority as in other programming languages. This all changed with the “tidyverse”, a set of packages and tools that have a consistently structured programming interface.
When tools such as dplyr and ggplot2 came to fruition, it made the learning curve much easier by providing a consistent and structured approach to working with data. As Hadley Wickham and many others continued to evolve R, the tidyverse came to be, which includes a series of commonly used packages for data manipulation, visualization, iteration, modeling, and communication. The end result is that R is now much easier to learn (we’ll show you in our next article!).
Source: tidyverse.org
R continues to evolve in a structured manner, with advanced packages that are built on top of the tidyverse infrastructure. A new focus is being placed on modeling and algorithms, which we are excited to see. Further, the tidyverse is being extended to cover topical areas such as text (tidytext) and finance (tidyquant). For newcomers, this should give you confidence in selecting this language. R has a bright future.
Reason 4: R Has Brains, Muscle, And HeartSaying R is powerful is actually an understatement. From the business context, R is like Excel on steroids! But more important than just muscle is the combination of what R offers: brains, muscle, and heart.
R has brainsR implements cuttingedge algorithms including:
 H2O (h2o) – Highend machine learning package
 Keras/TensorFlow (keras, tensorflow) – Goto deep learning packages
 xgboost – Top Kaggle algorithm
 And many more!
These tools are used everywhere from AI products to Kaggle Competitions, and you can use them in your business analyses.
R has muscleR has powerful tools for:
 Vectorized Operations – R uses vectorized operations to make math computations lightning fast right out of the box
 Loops (purrr)
 Parallelizing operations (parallel, future)
 Speeding up code using C++ (Rcpp)
 Connecting to other languages (rJava, reticulate)
 Working With Databases – Connecting to databases (dbplyr, odbc, bigrquery)
 Handling Big Data – Connecting to Apache Spark (sparklyr)
 And many more!
We already talked about the infrastructure, the tidyverse, that enables the ecosystem of applications to be built using a consistent approach. It’s this infrastructure that brings life into your data analysis. The tidyverse enables:
 Data manipulation (dplyr, tidyr)
 Working with data types (stringr for strings, lubridate for date/datetime, forcats for categorical/factors)
 Visualization (ggplot2)
 Programming (purrr, tidyeval)
 Communication (Rmarkdown, shiny)
Two major advantages of R versus every other programming language is that it can produce businessready reports and machine learningpowered web applications. Neither Python or Tableau or any other tool can currently do this as efficiently as R can. The two capabilities we refer to are rmarkdown for report generation and shiny for interactive web applications.
RmarkdownRmarkdown is a framework for creating reproducible reports that has since been extended to building blogs, presentations, websites, books, journals, and more. It’s the technology that’s behind this blog, and it allows us to include the code with the text so that anyone can follow the analysis and see the output right with the explanation. What’s really cool is that the technology has evolved so much. Here are a few examples of its capability:
 rmarkdown for generating HTML, Word and PDF reports
 rmarkdown for generating presentations
 flexdashboard for creating web apps via the userfriendly Rmarkdown format.
 blogdown for building blogs and websites
 bookdown for creating online books
 Interactive documents
 Parameterized reports for generating custom reports (e.g. reports for a specific geographic segment, department, or segment of time)
Source: shiny.rstudio.com
Shiny is a framework for creating interactive web applications that are powered by R. Shiny is a major consulting area for us as four of five assignments involve building a web application using shiny. It’s not only powerful, it enables nondata scientists to gain the benefit of data science via interactive decision making tools. Here’s an example of a Google Trend app built with shiny.
Reason 6: R Community SupportBeing a powerful language alone is not enough. To be successful, a language needs community support. We’ll hit on two ways that R excels in this respects: CRAN and the R Community.
CRAN: CommunityProvided R PackagesCRAN is like the Apple App store, except everything is free, super useful, and built for R. With over 14,000 packages, it has most everything you can possibly want from machine learning to highperformance computing to finance and econometrics! The task views cover specific areas and are one way to explore R’s offerings. CRAN is communitydriven, with top open source authors such as Hadley Wickham and Dirk Eddelbuettel leading the way. Package development is a great way to contribute to the community especially for those looking to showcase their coding skills and give back!
Community SupportYou begin with R because of its capability, you stay with R because of its community. The R Community is the coolest part. It’s tightknit, opinionated, fun, silly, and highly knowledgeable… all of the things you want in a high performing team.
Social/WebFor your #rstats holiday wish list consideration ????????https://t.co/dXYEAYXpzK pic.twitter.com/caQu53Czy8
— RLadies DC (@RLadiesDC) December 2, 2017
R users can be found all over the web. A few of the popular hangouts are:
 RBloggers
 #rstats on Twitter
 The R Project for Statistical Computing group on LinkedIn
Rfocused business conferences are gaining traction in a big way. Here are a few that we attend and/or will be attending in the future:
 EARL – Mango Solution’s conference on enterprise and business applications of R
 R/Finance – Communityhosted conference on financial asset and portfolio analytics and applied finance
 Rstudio Conf – Rstudio’s technology conference
 New York R – Business and technologyfocused R conference
A full list of Rconferences can be found here.
MeetupsA really cool thing about R is that many major cities have a meetup nearby. Meetups are exactly what you think: a group of Rusers getting together to talk R. They are usually funded by RConsortium. You can get a full list of meetups here.
ConclusionR has a wide range of benefits making it our obvious choice for Data Science for Busienss (DS4B). That’s not to say that Python isn’t a good choice as well, but, for the widerange of needs for business, there’s nothing that compares to R. In this article we saw why R is a great choice. In the next article we’ll show you how to learn R in 12 weeks.
About Business ScienceBusiness Science specializes in “ROIdriven data science”. Our focus is machine learning and data science in business and financial applications. We build web applications and automated reports to put machine learning in the hands of decision makers. Visit the Business Science or contact us to learn more!
Business Science UniversityInterested in learning data science for business? Enroll in Business Science University. We’ll teach you how to apply data science and machine learning in realworld business applications. We take you through the entire process of modeling problems, creating interactive data products, and distributing solutions within an organization. We are launching courses in early 2018!
Follow Business Science on Social Media @bizScienc is on twitter!
 Check us out on Facebook page!
 Check us out on LinkedIn!
 Sign up for our insights blog to stay updated!
 If you like our software, star our GitHub packages!
Here’s some additional information on the tool assessment. We have provided the code used to make the visualization, the criteria explanation, and the tool assessment.
Criteria ExplanationOur assessment of the most powerful DS4B tools was based on three criteria:

Business Capability (1 = Low, 10 = High): How wellsuited is the tool for use in the business? Does it include features needed for the business including advanced analytics, interactivity, communication, interactivity, and web apps?

Ease of Learning (1 = Difficult, 10 = Easy): How easy is it to pick up? Can you learn it in a week of short courses or will it take a longer time horizon to become proficient?

Cost (Free/Minimal, Low, High): Cost has two undesirable effects. From a firstorder perspective, the organization has to spend money. This is not inandofitself undesirable because the software companies can theoretically spend on R&D and other efforts to advance the product. The secondorder effect of lowering adoption is much more concerning. Highcost tools tend to have much less discussion in the online world, whereas open source or lowcost tools have great trends.

Trend (0 = Fast Decline, 5 = Stable, 10 = Fast Growth): We used StackOverflow Insights of questions as a proxy for the trend of usage over time. A major assumption is that growing number of Stack Overflow questions is that the usage is also increasing in a similar trend.
Source: Stack Overflow Trends
Individual Tool Assessment R: DS4B Capability = 10: Has it all. Great data science capability, great visualization libraries, Shiny for interactive web apps, rmarkdown for professional reporting.
 Learning Curve = 4: A lot to learn, but learning is getting easier with the tidyverse.
 Trend = 10: Stack overflow questions are growing at a very fast pace.
 Cost = Low: Free and open source
 DS4B Capability = 7: Has great machine learning and deep learning libraries. Can connect to any major database. Communication is limited by flask / Django web applications, which can be difficult to build. Does not have a business reporting infrastructure comparable to rmarkdown.
 Learning Curve = 4: A lot to learn, but learning is relatively easy compared to other object oriented programming languages like Java.
 Trend = 10: Stack overflow questions are growing at a very fast pace.
 Cost = Low: Free and open source
 DS4B Capability = 4: Mainly a spreadsheet software but has programming built in with VBA. Difficult to integrate R, but is possible. No data science libraries.
 Learning Curve = 10: Relatively easy to become an advanced user.
 Trend = 7: Stack overflow questions are growing at a relatively fast pace.
 Cost = Low: Comes with Microsoft Office, which most organizations use.
 DS4B Capability = 6: Has R integrated, but is very difficult to implement advanced algorithms and not as flexible as R+shiny.
 Learning Curve = 7: Very easy to pick up.
 Trend = 6: Stack overflow questions are growing at a relatively fast pace.
 Cost = Low: Free public version. Enterprise licenses are relatively affordable.
 DS4B Capability = 5: Similar to Tableau, but not quite as featurerich. Can integrate R to some extent.
 Learning Curve = 8: Very easy to pick up.
 Trend = 6: Expected to have same trend as Tableau.
 Cost = Low: Free public version. Licenses are very affordable.
 DS4B Capability = 6: Can do a lot with it, but lacks the infrastructure to use for business.
 Learning Curve = 2: Matlab is quite difficult to learn.
 Trend = 1: Stack overflow growth is declining at a rapid pace.
 Cost = High: Matlab licenses are very expensive. Licensing structure does not scale well.
 DS4B Capability = 8: Has data science, database connection, business reporting and visualization capabilities. Can also build applications. However, limited by closedsource nature. Does not get latest technologies like tensorflow and H2O.
 Learning Curve = 4: Similar to most data science programming languages for the tough stuff. Has a GUI for the easy stuff.
 Trend = 3: Stack Overflow growth is declining.
 Cost = High: Expensive for licenses. Licensing structure does not scale well.
library(ggrepel)
data_apps < tribble(
~application, ~business_capability, ~ease_of_learning, ~trend, ~cost,
"R", 10, 4, 10, "Free",
"Python", 7, 4, 10, "Free",
"Excel", 4, 10, 7, "Low",
"Tableau", 6, 7, 6, "Low",
"PowerBI", 5, 8, 6, "Low",
"Matlab", 6, 2, 1, "High",
"SAS", 8, 4, 3, "High"
)
cap < paste0(
"Why R? Tools like Excel, Tableau, PowerBI are easier to learn, but have lower ",
"Business Capability. Tools like Python, SAS, and Matlab have high ",
"Data Science Capability, but lack the visualization and interactive ",
"application tools needed for business. R has the best data science, visualization, ",
" and interactive tools plus it's free!"
)
data_apps %>%
ggplot(aes(x = business_capability, y = ease_of_learning,
color = cost, size = trend)) +
geom_point() +
geom_label_repel(aes(label = application, fill = application),
size = 3.5,
fontface = 'bold', color = 'white',
box.padding = 0.1, point.padding = 0.5,
segment.color = 'grey50', segment.size = 1) +
geom_smooth(color = palette_dark()[[1]], method = "lm", se = FALSE, show.legend = F) +
expand_limits(x = c(0, 10), y = c(0, 10)) +
theme_tq() +
theme(legend.direction = "vertical") +
scale_fill_tq() +
scale_color_tq() +
scale_y_continuous(breaks = seq(0, 10, 2)) +
scale_x_continuous(breaks = 0:10) +
scale_size_continuous(range = c(2, 14)) +
labs(title = "DS4B Tools: Capability Vs Learning Curve",
subtitle = "R has a longer learning curve but has a massive business capability rating",
caption = label_wrap_gen(115)(cap),
x = "Data Science For Business Capability Rating",
y = "Learning Curve Rating",
color = "Cost",
size = "Trend",
fill = "Tool") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: businessscience.io  Articles. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Building formulae
(This article was first published on Econometrics and Free Software, and kindly contributed to Rbloggers)
This Stackoverflow question made me think about how to build formulae. For example, you might want to programmatically build linear model formulae and then map these models on data. For example, suppose the following (output suppressed):
data(mtcars) lm(mpg ~ hp, data = mtcars) lm(mpg ~I(hp^2), data = mtcars) lm(mpg ~I(hp^3), data = mtcars) lm(mpg ~I(hp^4), data = mtcars) lm(mpg ~I(hp^5), data = mtcars) lm(mpg ~I(hp^6), data = mtcars)To avoid doing this, one can write a function that builds the formulae:
create_form = function(power){ rhs = substitute(I(hp^pow), list(pow=power)) rlang::new_formula(quote(mpg), rhs) }If you are not familiar with substitute(), try the following to understand what it does:
substitute(y ~ x, list(x = 1)) ## y ~ 1Then using rlang::new_formula() I build a formula by providing the left hand side, which is quote(mpg) here, and the right hand side, which I built using substitute(). Now I can create a list of formulae:
library(tidyverse) list_formulae = map(seq(1, 6), create_form) str(list_formulae) ## List of 6 ## $ :Class 'formula' language mpg ~ I(hp^1L) ## .. .. attr(*, ".Environment")= ## $ :Class 'formula' language mpg ~ I(hp^2L) ## .. .. attr(*, ".Environment")= ## $ :Class 'formula' language mpg ~ I(hp^3L) ## .. .. attr(*, ".Environment")= ## $ :Class 'formula' language mpg ~ I(hp^4L) ## .. .. attr(*, ".Environment")= ## $ :Class 'formula' language mpg ~ I(hp^5L) ## .. .. attr(*, ".Environment")= ## $ :Class 'formula' language mpg ~ I(hp^6L) ## .. .. attr(*, ".Environment")=As you can see, power got replaced by 1, 2, 3,… and each element of the list is a nice formula. Exactly what lm() needs. So now it’s easy to map lm() to this list of formulae:
data(mtcars) map(list_formulae, lm, data = mtcars) ## [[1]] ## ## Call: ## .f(formula = .x[[i]], data = ..1) ## ## Coefficients: ## (Intercept) I(hp^1) ## 30.09886 0.06823 ## ## ## [[2]] ## ## Call: ## .f(formula = .x[[i]], data = ..1) ## ## Coefficients: ## (Intercept) I(hp^2) ## 24.3887252 0.0001649 ## ## ## [[3]] ## ## Call: ## .f(formula = .x[[i]], data = ..1) ## ## Coefficients: ## (Intercept) I(hp^3) ## 2.242e+01 4.312e07 ## ## ## [[4]] ## ## Call: ## .f(formula = .x[[i]], data = ..1) ## ## Coefficients: ## (Intercept) I(hp^4) ## 2.147e+01 1.106e09 ## ## ## [[5]] ## ## Call: ## .f(formula = .x[[i]], data = ..1) ## ## Coefficients: ## (Intercept) I(hp^5) ## 2.098e+01 2.801e12 ## ## ## [[6]] ## ## Call: ## .f(formula = .x[[i]], data = ..1) ## ## Coefficients: ## (Intercept) I(hp^6) ## 2.070e+01 7.139e15This is still a new topic for me there might be more elegant ways to do that, using tidyeval to remove the hardcoding of the columns in create_form(). I might continue exploring this.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
random wake
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
Just too often on X validated, one sees questions displaying a complete ignorance of the basics that makes one purposelessly wonder what is the point of trying to implement advanced methods when missing the necessary background. And just as often, I reacted to the question by wondering out loud about this… In the current case, the question was about debugging an R code for a mixture of two exponential distributions and the random walk Metropolis algorithm that comes with it. Except that the Normal noise was replaced with a Uniform U(0,1) noise, leading to a most obviously transient Markov chain.I eventually corrected the R code, which returned a rather nicely labelswitching output. And which did not necessarily help with the comprehension of the fundamentals.
Filed under: Books, Kids, R, Statistics Tagged: cross validated, fundamentals, Gaussian random walk, MetropolisHastings algorithm, Monte Carlo Statistical Methods, R, simulation
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
SubGaussian property for the Beta distribution (part 3, final)
(This article was first published on R – Statisfaction, and kindly contributed to Rbloggers)
In this third and last post about the SubGaussian property for the Beta distribution [1] (post 1 and post 2), I would like to show the interplay with the Bernoulli distribution as well as some connexions with optimal transport (OT is a hot topic in general, and also on this blog with Pierre’s posts on Wasserstein ABC).
Let us see how subGaussian proxy variances can be derived from transport inequalities. To this end, we need first to introduce the Wasserstein distance (of order 1) between two probability measures P and Q on a space . It is defined wrt a distance d on by
where is the set of probability measures on with fixed marginal distributions respectively and Then, a probability measure is said to satisfy a transport inequality with positive constant , if for any probability measure dominated by ,
where is the entropy, or Kullback–Leibler divergence, between and . The nice result proven by Bobkov and Götze (1999) [2] is that the constant is a subGaussian proxy variance for P.
For a discrete space equipped with the Hamming metric, , the induced Wasserstein distance reduces to the total variation distance, . In that setting, Ordentlich and Weinberger (2005) [3] proved the distributionsensitive transport inequality:
where the function is defined by and the coefficient is called the balance coefficient of , and is defined by . In particular, the Bernoulli balance coefficient is easily shown to coincide with its mean. Hence, applying the result of Bobkov and Götze (1999) [2] to the above transport inequality yields a distributionsensitive proxy variance of for the Bernoulli with mean , as plotted in blue above.
In the Beta distribution case, we have not been able to extend this transport inequality methodology since the support is not discrete. However, a nice limiting argument holds. Consider a sequence of Beta random variables with fixed mean and with a sum going to zero. This converges to a Bernoulli random variable with mean , and we have shown that the limiting optimal proxy variance of such a sequence of Beta with decreasing sum is the one of the Bernoulli.
References[1] Marchal, O. and Arbel, J. (2017), On the subGaussianity of the Beta and Dirichlet distributions. Electronic Communications in Probability, 22:1–14, 2017. Code on GitHub.
[2] Bobkov, S. G. and Götze, F. (1999). Exponential integrability and transportation cost related to logarithmic Sobolev inequalities. Journal of Functional Analysis, 163(1):1–28.
[3] Ordentlich, E. and Weinberger, M. J. (2005). A distribution dependent refinement of Pinsker’s inequality. IEEE Transactions on Information Theory, 51(5):1836–1840.
To leave a comment for the author, please follow the link and comment on their blog: R – Statisfaction. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Building your own blockchain in R
(This article was first published on R – Big Data Doctor, and kindly contributed to Rbloggers)
Everybody is freaking out about the rise of the Bitcoin and the potential of the Blockchain technologies. The advent of cryptocurrencies, game changing use cases, disruption of established business models by disintermediation, etc..
By the time I’m writing this article, there are more than 1300 cryptocurrencies listed in coinmarketcap.. And a lot more coming with the next ICOs (Internet Coin Offering).
Most certainly, the main enabler of Bitcoin and of many other currencies (although not all of them) is the Blockchain technology.
Although the original paper from Satoshi explains very well the ground idea behind this technology, nothing like creating your own blockchain to fully understand how it works, its limitations and potential improvements (aka “If you can code it, you certainly understand it”).
In this post I’d like to share a gentle coding exercise in R (#rstats). Why R? Just because it’s my favorite language… I wouldn’t choose R for a productive, fullfledge block chain implementation, but again, this is not the purpose of this post. This is just a learning exercise hacked quite quickly without any aspiration of ever running this code in a productive environment, and should be understood as such.
For convenience, I stored the code in a git repository, for others to improve, reuse, try, etc.
First things first: what is what in a Blockchain
A blockchain is an immutable chain of sequential records or blocks that are “chained” together using hashes. Blocks can be understood as containers, typically of transactions, but it can be extended to documents, etc. We can think of a blockchain as a database where new data is stored in blocks, that are added to an immutable chain based on all other existing blocks.
Blockchain is often referred as a digital ledger of transactions performed in a cryptocurrency of choice (Bitcoin or whatever). A transaction requires a sender address, a recipient address and a given amount and needs to be assigned to a block.
The json below shows a typical block with a particular index, an integer representing the proof, a hashed representation of the previous block, which allows for consistency check (more about proof and hashing later) and a timestamp.
Once we have understood the concept behind the blockchain, let’s build one and make it work
We are going to need three different files:
 The class definition file, where we create the Blockchain class with its components and methods.
 The API definition file, where we instantiate the class, register a node and expose the blockchain methods as GET and POST calls for the peers to interact with.
 The plumber launch file, to start the server and expose the API methods
To represent the Blockchain, we need a list of blocks (the chain), a list of currentTransactions (waiting to be mined) and a list of mining nodes.
Our Blockchain class implements a method to register a new transaction addTransaction. Transactions are appended to the list of currentTransactions until a new block is created, which takes care of the newly added transactions that have not been added to any block yet.
The creation of a new block takes place after calling nextBlock. To maintain the consistency, a new block can only be added knowing the hashed value of the previous one as part of a Proof of Work procedure (PoW). Basically, it is just finding a number that satisfies a condition. This number should be easy enough to be verified by anyone in the chain, but difficult enough so that a brute force attack to find it wouldn’t be feasible (too long, too expensive).
In our example, proofOfWork relies on finding a number called proof such that if we append this proof to the previous proof and we hash it, the last 2 characters of the resulting string are exactly two zeroes. The way it works in the real Bitcoin chain is quite different, but based on the same concept.
In Bitcoin, the Proof of Work algorithm is called Hashcash. Solving this problem requires computational resources (the longer the strings and the more characters to be found within it, the higher the complexity), and miners are rewarded by receiving a coin—in a transaction.
The Blockchain class provides a method to check the consistency over all blocks validChain, which iterates over the chain checking if both each block is properly linked to the previous one and whether the PoW is preserved.
The method registerNode adds the URL of a mining node to the central nodes register.
Finally, as we need to think of a Blockchain as a distributed system, there is a chance that one node has a different chain to another node. The handleConflicts resolves this conflict by declaring the longest chain the proper one… the one that all nodes take.
list.of.packages < c("digest", "httr","jsonlite") new.packages < list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages) require(digest) require(jsonlite) require(httr) Blockchain < function () { bc = list ( chain = list(), currentTransactions = list(), nodes = list() ) #' Create a new Block in the Blockchain #' #' @param proof The proof given by the Proof of Work algorithm #' @param previousHash Hash of previous Block #' @return new block generated given the \code{proof} and the \code{previousHash} #' @examples #' blockchain = Blockchain() #' blockchain$nextBlock(previousHash=1, proof=100) # genesis block bc$nextBlock = function (proof, previousHash=NULL){ previousHash < ifelse (is.null(previousHash), bc$hashBlock(bc$chain[length(bc$chain)]), previousHash) block = list('block' = list('index' = length (bc$chain) + 1, 'timestamp' = as.numeric(Sys.time()) , 'transactions' = bc$currentTransactions, 'proof' = proof, 'previousHash' = previousHash)) bc$currentTransactions = NULL bc$chain < append(bc$chain, block) return (block) } #' Returns the last block in the Blockchain #' #' @examples #' blockchain$lastBlock() bc$lastBlock = function () { bc$chain[length(bc$chain)] } #' Register a new transaction in the Blockchain #' #' @param sender address of the sender #' @param recipient address of the recipient #' @param amount transaction amount #' @return Index of the Block that will hold this transaction bc$addTransaction = function (sender, recipient, amount) { txn < list('transaction'= list('sender'=sender,'recipient'=recipient,'amount'=amount)) bc$currentTransactions < append(bc$currentTransactions, txn) last.block < bc$lastBlock() return(last.block$block$index + 1) } #' Hash a block using SHA256 #' #' @param block #' @return SHA256 hashed value for \code(block) #' @examples bc$hashBlock = function (block) { require(digest) digest(block,algo="sha256") } #' Find a number p' such that hash(pp') contains leading 4 zeroes, where p is the previous p' #' p is the previous proof and p' is the new proof #' @param last_proof #' @return SHA256 hashed value for \code(block) bc$proofOfWork < function (last_proof) { proof < 0 while (!bc$validProof(last_proof, proof)) { proof < proof + 1 } return (proof) } #' Find a number p' such that hash(pp') ends with two zeroes, where p is the previous p' #' p is the previous proof and p' is the new proof #' @param last_proof previous proof #' @param proof proof #' @return TRUE if correct, FALSE if not bc$validProof < function (last_proof, proof) { guess = paste0(last_proof,proof) guess_hash = digest(guess, algo = 'sha256') return (gsub('.*(.{2}$)', '\\1',guess_hash) == "00") } #' Checks whether a given blockchain is valid #' #' @return TRUE if the chain is valid, FALSE otherwise bc$validChain < function (chain) { lastBlock < chain[0] currentIndex < 1 while (currentIndex < length(chain)) { block = chain[currentIndex] # checking for valid linking if (block$block$previousHash != bc$hashBlock(lastBlock)) { return(FALSE) } # checking for proof validity if(!bc$validProof(lastBlock$block$proof, block$block$proof)) { return (FALSE) } lastBlock < block currentIndex < currentIndex +1 } return(TRUE) } #' Add a new node to the list of existing nodes #' #' @param address full URL of the node #' @examples #' blockchain = Blockchain() #' blockchain$registerNode('http://192.168.0.5:5000') bc$registerNode < function(address) { parsed_url = address bc$nodes< append(bc$nodes, parsed_url) } #' Resolve conflicts by replacing the current chain by the longest chain in the network #' #' @return TRUE if the chain was replaced, FALSE otherwise bc$handleConflicts < function() { neighbours < bc$nodes new_chain < NULL max_length = length(bc$chain) for (i in 1:length(neighbours)) { chain.node < GET(paste0(neighbours[i],'/chain')) node.chain.length < jsonlite::fromJSON(chain.node)$length node.chain.chain < jsonlite::fromJSON(chain.node)$chain if (node.chain.length > max_length) { new_chain = node.chain.chain max_length<node.chain.length } } if (!is.null(new_chain)) { bc$chain < new_chain } } # Adding bc to the environment bc < list2env(bc) class(bc) < "BlockchainClass" return(bc) } Defining the API MethodsAfter defining the Blockchain class, we need to create an instance of it running on a node. First we generate a valid identifier for the node (using Universally Unique IDentifier). Then, we add the genesis block or first block in the chain, using some default parameters (previousHash=1 and proof=100).
Everything else takes place when users invoke the “transaction/new” method to create new transactions and miners call the “/mine” function to trigger the creation of new blocks according to the PoW schema and process the newly created transactions.
Apart from these core methods, we also enable the registration of new nodes “/nodes/register”, the consensus method to handle potential conflicts “/nodes/resolve”, the retrieval “/chain” and html display of the chain “/chain/show”.
To host these methods, we use rplumber, which is a wonderful package to expose R functions as REST and RESTFULL services. It’s really versatile and easy to use (but singlethreaded, which requires more advanced setups).
Apart from the aforementioned methods, we define a filter to log all requests with the server (logger).
') render.html < paste0(render.html, 'Current transactions:
') for (i in 1:length(blockchain$currentTransactions)) { render.html < paste0(render.html, 'Transaction' , i ,'
') render.html < paste0(render.html, 'sender:', blockchain$currentTransactions[i]$transaction$sender) render.html < paste0(render.html, '
') render.html < paste0(render.html, 'recipient:', blockchain$currentTransactions[i]$transaction$recipient) render.html < paste0(render.html, '
') render.html < paste0(render.html, 'amount:', blockchain$currentTransactions[i]$transaction$amount) render.html < paste0(render.html, '
') } render.html < paste0(render.html, '
') render.html < paste0(render.html, 'Current transactions:') render.html < paste0(render.html, '') for (i in 1:blockchain$lastBlock()$block$index) { render.html < paste0(render.html, '
') render.html < paste0(render.html, 'Block nr:', blockchain$chain[i]$block$index) render.html < paste0(render.html, '
') render.html < paste0(render.html, 'Transactions') render.html < paste0(render.html, '
') render.html < paste0(render.html, blockchain$chain[i]$block$transactions) render.html < paste0(render.html, '
Proof') render.html < paste0(render.html, '
') render.html < paste0(render.html,blockchain$chain[i]$block$proof) render.html < paste0(render.html, '
') } render.html < paste0(render.html, '') render.html }
We stored the code above in a file called “blockchainnodeserver.R” to “plumb” it with the script below. First of all, we define a custom Serializer to handle a json boxing issue, as shown here.
list.of.packages < c("plumber","jsonlite") new.packages < list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages) require(plumber) require(jsonlite) custom_json < function(){ function(val, req, res, errorHandler){ tryCatch({ json < jsonlite::toJSON(val,auto_unbox=TRUE) res$setHeader("ContentType", "application/json") res$body < json return(res$toResponse()) }, error=function(e){ errorHandler(req, res, e) }) } } addSerializer("custom_json",custom_json) # Make sure you put the path to your blockchainnodeserver.R script r < plumb('blockchainnodeserver.R') r$run(port=8000)Rplumber implements the swagger UI, which can be reached under http://127.0.0.1:8000/__swagger__/. The picture below shows our methods as we declared them in the blockchainnodeserver.R script:
Once we have our Blockchain API running in the url we specified in the plumb command, we can try it with a simple client:
library(jsonlite) library(httr) # to register a node req < POST("http://127.0.0.1:8000/nodes/register", body = '{"nodes": "http://127.0.0.1:8000"}') cat(jsonlite::fromJSON(content(req, "text"))) # create a new transaction req < POST("http://127.0.0.1:8000/transactions/new", body = '{"sender": "d4ee26eee15148ee92c6cd394edd964", "recipient": "23448ee92cd4ee26eee6cd394edd964", "amount": 15}') object < jsonlite::fromJSON(content(req, "text"));object$message # create a new transaction req < POST("http://127.0.0.1:8000/transactions/new", body = '{"sender": "6eee15148ee92c6cd394edd974d4ee2", "recipient": "15148ee92cd4ee26eee6cd394edd964", "amount": 225}') object < jsonlite::fromJSON(content(req, "text"));object$message # start mining req < GET("http://127.0.0.1:8000/mine") object < jsonlite::fromJSON(content(req, "text"));object$message # create a new transaction req < POST("http://127.0.0.1:8000/transactions/new", body = '{"sender": "334e15148ee92c6cd394edd974d4ee2", "recipient": "8ee98ee92cd4ee26eee6cd3334e1514", "amount": 887}') object < jsonlite::fromJSON(content(req, "text"));object$message # mine again req < GET("http://127.0.0.1:8000/mine") object < jsonlite::fromJSON(content(req, "text"));object$messageTo try it locally, I opened 2 instances of RStudio: the first one for the server, where rplumber executes the server part, and the second one for the client, from where I fired all the sample requests from the previous script.
You can check the status of the chain in the browser, as shown in the picture below
But you can also interact with the chain programmatically from the client:
# retrieve the chain req < GET("http://127.0.0.1:8000/chain") chain < jsonlite::fromJSON(content(req, "text")) # check the amount of the first transaction in the first block of the chain chain$chain$block.1$transactions$transaction$amount CreditsThis coding has been inspired by different articles and tutorials (mostly in Python) showing how to build a blockchain from scratch. For example, the ones from Gerald Nash, Eric Alcaide and
Lauri Hartikka (this one in JS).
Special mention to the one created by Daniel van Flymen, very inspiring and easy to follow.
To leave a comment for the author, please follow the link and comment on their blog: R – Big Data Doctor. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Combined outlier detection with dplyr and ruler
(This article was first published on QuestionFlow , and kindly contributed to Rbloggers)
Overview of simple outlier detection methods with their combination using dplyr and ruler packages.
PrologueDuring the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data.
There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyr and ruler packages.
After reading this post you will know:
 Most basic outlier detection techniques.
 A way to implement them using dplyr and ruler.
 A way to combine their results in order to obtain a new outlier detection method.
 A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point).
We will perform an analysis with the goal to find not typical diamonds listed in diamonds dataset from ggplot2 package. Here one observation represents one diamond and is stored as a row in data frame.
The way we will do that is by combining different outlier detection techniques to identify rows which are “strong outliers”, i.e. which might by considered outliers based on several methods.
Packages required for this analysis:
library(dplyr) library(tidyr) library(ggplot2) library(ruler) Outlier detection methodsTo do convenient outlier detection with ruler it is better to define notion of nonoutlier in form of the rule “Observation is not an outlier if …”. This way actual outliers are considered as rule breakers, objects of interest of ruler package. Note that definition of nonoutlier is essentially a definition of outlier because of total two possibilities.
ZscoreZscore, also called a standard score, of an observation is [broadly speaking] a distance from the population center measured in number of normalization units. The default choice for center is sample mean and for normalization unit is standard deviation.
⬛ Observation is not an outlier based on zscore if its absolute value of default zscore is lower then some threshold (popular choice is 3).
Here is the function for identifying nonoutliers based on zscore:
isnt_out_z < function(x, thres = 3, na.rm = TRUE) { abs(x  mean(x, na.rm = na.rm)) <= thres * sd(x, na.rm = na.rm) }It takes a numeric vector as input and returns logical vector of the same length indicating whether input value is a nonoutlier.
Zscore with MADMedian Absolute Deviation is a robust normalization unit based on median as a population center. In order to use MAD “as a consistent estimator for the estimation of the standard deviation” one takes its value multiplied by a factor. This way base R function mad is implemented.
⬛ Observation is not an outlier based on MAD if its absolute value of zscore with median as center and MAD as normalization unit is lower then some threshold (popular choice is 3).
isnt_out_mad < function(x, thres = 3, na.rm = TRUE) { abs(x  median(x, na.rm = na.rm)) <= thres * mad(x, na.rm = na.rm) } Tukey’s fencesTukey’s fences is a technique used in box plots. The nonoutlier range is defined with \([Q_1 – k(Q_3 – Q_1),~ Q_3 + k(Q_3 – Q_1)]\), where \(Q_1\) and \(Q_3\) are the lower and upper quartiles respectively, \(k\) – some nonnegative constant (popular choice is 1.5).
⬛ Observation is not an outlier based on Tukey’s fences if its value lies in nonoutlier range.
isnt_out_tukey < function(x, k = 1.5, na.rm = TRUE) { quar < quantile(x, probs = c(0.25, 0.75), na.rm = na.rm) iqr < diff(quar) (quar[1]  k * iqr <= x) & (x <= quar[2] + k * iqr) } Mahalanobis distanceAll previous approaches were created for univariate numerical data. To detect outliers in multivariate case one can use Mahalanobis distance to reduce to univariate case and then apply known techniques.
⬛ Observation is not an outlier based on Mahalanobis distance if its distance is not an outlier.
maha_dist < . %>% select_if(is.numeric) %>% mahalanobis(center = colMeans(.), cov = cov(.)) isnt_out_maha < function(tbl, isnt_out_f, ...) { tbl %>% maha_dist() %>% isnt_out_f(...) }This function takes as input a data frame of interest (with possible nonnumeric columns which are ignored) and function performing univariate outlier detection. It returns a logical vector of the same length as number of rows in input data frame.
To read more about practical usefulness of Mahalanobis distance in detecting outliers go to Steffen’s very helpful post.
Using dplyr and ruler Definition of nonoutlier rowPackage ruler, based on dplyr grammar of data manipulation, offers tools for validating the following data units: data as a whole, group [of rows] as a whole, column as a whole, row as a whole, cell. Our primary interest is row as a whole. However, using this framework, we can construct several approaches for definition of the nonoutlier row:
 Row is not an outlier based on some column if it doesn’t contain outlier (computed based on the target column) on the intersection with that column. In other words, first a univariate outlier detection is performed based solely on data from target column and then all rows containing nonoutliers are named nonoutlier rows.
 Row is not an outlier based on Mahalanobis distance if its distance (computed based on the selected numeric columns) is not an outlier.
 Row is not an outlier based on grouping if it is a part of a nonoutlier group [of rows]. A group [of rows] is not an outlier if its summary value is not an outlier among summary values of other groups.
Note that all listed approached depend on the choice of the univariate outlier detection method. We will use all three previously listed univariate techniques.
isnt_out_funs < funs( z = isnt_out_z, mad = isnt_out_mad, tukey = isnt_out_tukey ) ImplementationIn ruler framework rules are defined in packs (to learn more go to ruler README and vignettes).
Column based nonoutlier rowsFor diamonds dataset rules for column based nonoutlier rows can be defined based on 7 numeric columns and 3 presented univariate detection methods. There is a convenient way of computing all them at once using scoped variant of dplyr::transmute():
diamonds %>% transmute_if(is.numeric, isnt_out_funs) ## # A tibble: 53,940 x 21 ## carat_z depth_z table_z price_z x_z y_z z_z carat_mad depth_mad ## ## 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 3 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE ## 4 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 5 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## # ... with 5.394e+04 more rows, and 12 more variables: table_mad , ## # price_mad , x_mad , y_mad , z_mad , ## # carat_tukey , depth_tukey , table_tukey , ## # price_tukey , x_tukey , y_tukey , z_tukeyThe result has outputs for 21 methods. Their names are of the form _. So the name ‘carat_z’ is interpreted as result of univariate method with name ‘z’ for column with name ‘carat’.
Mahalanobis based nonoutlier rowsTo define nonoutlier rows based on Mahalanobis distance one should apply univariate method for distances computed for some subset of numeric columns. To simplify a little bit, we will one “subset” with all numeric columns and all listed methods:
diamonds %>% transmute(maha = maha_dist(.)) %>% transmute_at(vars(maha = maha), isnt_out_funs) ## # A tibble: 53,940 x 3 ## maha_z maha_mad maha_tukey ## ## 1 TRUE TRUE TRUE ## 2 TRUE FALSE FALSE ## 3 TRUE FALSE FALSE ## 4 TRUE TRUE TRUE ## 5 TRUE TRUE TRUE ## # ... with 5.394e+04 more rowsThe result has outputs for 3 methods. Their names are considered as method names. Note that with this approach outlier rows are not only the ones far from multivariate center, but also the ones that are unnaturally close to it.
Group based nonoutlier rowsDefinition of nonoutlier rows based on grouping depends on group summary function and univariate outlier detection method. As grouping column we will choose all nonnumeric columns (cut, color and clarity) united into one called group (for later easier imputation of nonoutlier rows). As reasonable summary functions we will choose mean value of some numeric column (total of 7 functions):
data_tbl < diamonds %>% unite(col = "group", cut, color, clarity) compute_group_non_outliers < . %>% # Compute per group mean values of columns group_by(group) %>% summarise_if(is.numeric, mean) %>% ungroup() %>% # Detect outliers among groups mutate_if(is.numeric, isnt_out_funs) %>% # Remove unnecessary columns select_if(Negate(is.numeric)) data_tbl %>% compute_group_non_outliers() ## # A tibble: 276 x 22 ## group carat_z depth_z table_z price_z x_z y_z z_z carat_mad ## ## 1 Fair_D_I1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE ## 2 Fair_D_IF TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 3 Fair_D_SI1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 4 Fair_D_SI2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 5 Fair_D_VS1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## # ... with 271 more rows, and 13 more variables: depth_mad , ## # table_mad , price_mad , x_mad , y_mad , ## # z_mad , carat_tukey , depth_tukey , table_tukey , ## # price_tukey , x_tukey , y_tukey , z_tukeyThe result has outputs for 21 methods applied to the 276 groups. Their names are of the form _. So the name ‘carat_z’ is interpreted as result of method ‘z’ for summary function equal to mean value of ‘carat’ column. Column group defines names of the groupings.
ExposureColumn and Mahalanobis based definition of nonoutlier rows can be expressed with row packs and group based – as group packs.
row_packs_isnt_out < row_packs( # Nonoutliers based on some column column = . %>% transmute_if(is.numeric, isnt_out_funs), # Nonoutliers based on Mahalanobis distance maha = . %>% transmute(maha = maha_dist(.)) %>% transmute_at(vars(maha = maha), isnt_out_funs) ) group_packs_isnt_out < group_packs( # Nonoutliers based on grouping group = compute_group_non_outliers, .group_vars = "group" )Application of all those packs is called exposing process. The result is an exposure from which we can extract tidy data validation report using get_report.
# Don't remove obeyers to compute total number of applied rules full_report < data_tbl %>% expose(row_packs_isnt_out, group_packs_isnt_out, .remove_obeyers = FALSE) %>% get_report() used_rules < full_report %>% distinct(pack, rule) breaker_report < full_report %>% filter(!(value %in% TRUE))used_rules contains data about all definitions of nonoutlier rows applied to data. They are encoded with combination of columns pack and rule.
breaker_report contains data about data units that break certain rules. Packs column and maha has actual row numbers of data_tbl listed in id column of report (for rows which should be considered as outliers).
On the other hand, pack group defines group pack and is represented in breaker_report with id 0. To obtain row outliers based on grouping we need to expand those rows with information about rows in the data that belong to those groups. This can be done using dplyr::left_join():
group_breakers < breaker_report %>% # Filter group packs filter(pack == "group") %>% # Expand rows by matching group with its rows select(id) %>% left_join( y = data_tbl %>% transmute(var = group, id = 1:n()), by = "var" ) %>% select(pack, rule, var, id, value) outliers < bind_rows( breaker_report %>% filter(pack != "group"), group_breakers ) %>% select(pack, rule, id) # Not all group based definitions resulted with outliers outliers %>% count(pack, rule) %>% filter(pack == "group") %>% print(n = Inf) ## # A tibble: 13 x 3 ## pack rule n ## ## 1 group carat_mad 37 ## 2 group carat_tukey 37 ## 3 group carat_z 29 ## 4 group depth_mad 1093 ## 5 group depth_tukey 1016 ## 6 group depth_z 156 ## 7 group price_mad 209 ## 8 group price_tukey 1146 ## 9 group price_z 44 ## 10 group table_mad 920 ## 11 group table_tukey 8 ## 12 group table_z 7 ## 13 group z_z 23Tibble outliers contains data about outlier rows. Combination of columns pack and rule defines nonoutlier/outlier definition approach and column id defines row number of input data frame that should be considered an outlier based on the definition.
Definitions with most outliers are as follows:
outliers %>% count(pack, rule, sort = TRUE) ## # A tibble: 37 x 3 ## pack rule n ## ## 1 maha maha_mad 6329 ## 2 maha maha_tukey 5511 ## 3 column price_mad 5386 ## 4 column price_tukey 3540 ## 5 column table_mad 2560 ## # ... with 32 more rowsTwo out of three Mahalanobis based definition yielded the most row outliers.
CombinationGiven outliers data frame, one can do whatever he/she wants to identify outliers. Here we will use the basic combination approach based on average score.
Combined outlier detection score for certain row can be defined as share of applied methods that tagged it as outlier. Alternatively one can define it just as number of those methods as it will only change absolute value of the result and not the order.
outlier_score < outliers %>% group_by(id) %>% # nrow(used_rules) equals total number of applied methods summarise(score = n() / nrow(used_rules)) # Top 10 outliers outlier_score %>% arrange(desc(score)) %>% slice(1:10) ## # A tibble: 10 x 2 ## id score ## ## 1 26432 0.5777778 ## 2 27416 0.5777778 ## 3 27631 0.5777778 ## 4 27131 0.4666667 ## 5 23645 0.4222222 ## 6 26445 0.4222222 ## 7 26745 0.4000000 ## 8 27430 0.4000000 ## 9 15952 0.3777778 ## 10 17197 0.3777778Finally we will tag those rows as strong outliers which has score more than 0.2 (subjective threshold which should be researched more).
diam_tbl < diamonds %>% mutate(id = 1:n()) %>% left_join(y = outlier_score, by = "id") %>% mutate( score = coalesce(score, 0), is_out = if_else(score > 0.2, "Outlier", "Not outlier") ) # Total number of outliers sum(diam_tbl$score > 0.2) ## [1] 161Tibble diam_tbl is basically the diamonds but with three more columns: id for row number, score for combined outlier score and is_out for nonoutlier/outlier tag.
Plots illustrating strong outliers:
theme_set(theme_bw()) plot_outliers < function(tbl, x, y, facet_var) { tbl %>% arrange(is_out) %>% ggplot(aes_string(x, y, colour = "is_out")) + geom_point() + facet_wrap(facets = facet_var) + scale_colour_manual(values = c("#AAAAAA", "#004080")) + guides(colour = guide_legend(title = NULL, override.aes = list(size = 4))) + labs(title = paste0("Strong outliers illustration by ", facet_var)) + theme(legend.position = "bottom", legend.text = element_text(size = 14)) } diam_tbl %>% plot_outliers("carat", "price", facet_var = "cut") diam_tbl %>% plot_outliers("x", "depth", facet_var = "color") diam_tbl %>% plot_outliers("price", "table", facet_var = "clarity")Based on those plots we see the complicated nature of “strong outliers”. They are not necessary located “on the edge” of twodimensional scatter plots, but most extreme cases are tagged as outliers.
Also one interesting observation: most outliers are concentrated in the combination of “Fair” cut, “J” colour and “I1” clarity which are worst options among their features. The reason of this effect is groupbased definitions of nonoutliers which tagged certain groups more than others:
breaker_report %>% filter(pack == "group") %>% count(var, sort = TRUE) %>% print(n = 10) ## # A tibble: 47 x 2 ## var n ## ## 1 Fair_D_I1 7 ## 2 Fair_J_I1 7 ## 3 Fair_H_VVS1 6 ## 4 Ideal_J_I1 6 ## 5 Fair_J_VVS1 5 ## 6 Fair_G_VVS1 4 ## 7 Fair_D_VVS1 3 ## 8 Fair_E_I1 3 ## 9 Fair_F_I1 3 ## 10 Fair_H_I1 3 ## # ... with 37 more rowsHere we see that “Fair” cut is among majority of top breaker groups. There are also some interesting combinations: Fair_D_I1 (“worst”“best”“worst”), Fair_J_I1 (“worst”“worst”“worst”), Ideal_J_I1 (“best”“worst”“worst”).
This fact might be interpreted as suggested combined outlier detection approach discovered notion of diamond quality without prior knowledge about it.
Conclusions Using only basic outlier detection methods one can achieve insightful results by combining them. Observations which are tagged as outlier by more than some threshold number of methods might be named as “strong outliers”. Those should be considered as outliers based on the whole data rather then on separate features.
 With ruler combining results of several outlier detection methods is straightforward due to the format of tidy data validation report.
 Suggested “strong outlier” observations in diamonds dataset are not only those with extreme numerical values but also ones based on quality of diamonds. This is achieved without prior knowledge of “diamond quality” notion.
To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Merry Christmas and Happy New Year!
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
The Revolutions team is celebrating Christmas today, and we're taking a break with family and enjoying good food. And given the number of Eggnogs that are being prepared — thanks to Hadley Wickham's eggnogr Shiny app — it might be a good idea to take the rest of the week off as well. (You can find the R source behind the eggnogr app here.)
We'll be back again in the New Year, on January 2. Enjoy the holiday season, and a happy New Year to all.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Simple Intro to QLearning in R: Floor Plan Navigation
(This article was first published on R – Quality and Innovation, and kindly contributed to Rbloggers)
This example is drawn from “A Painless QLearning Tutorial” at http://mnemstudio.org/pathfindingqlearningtutorial.htm which explains how to manually calculate iterations using the updating equation for QLearning, based on the Bellman Equation (image from https://www.is.unifreiburg.de/ressourcen/businessanalytics/13_reinforcementlearning.pdf):
The example explores pathfinding through a house:
The question to be answered here is: What’s the best way to get from Room 2 to Room 5 (outside)? Notice that by answering this question using reinforcement learning, we will also know how to find optimal routes from any room to outside. And if we run the iterative algorithm again for a new target state, we can find out the optimal route from any room to that new target state.
Since QLearning is modelfree, we don’t need to know how likely it is that our agent will move between any room and any other room (the transition probabilities). If you had observed the behavior in this system over time, you might be able to find that information, but it many cases it just isn’t available. So the key for this problem is to construct a Rewards Matrix that explains the benefit (or penalty!) of attempting to go from one state (room) to another.
Assigning the rewards is somewhat arbitrary, but you should give a large positive value to your target state and negative values to states that are impossible or highly undesirable. Here’s the guideline we’ll use for this problem:
 1 if “you can’t get there from here”
 0 if the destination is not the target state
 100 if the destination is the target state
We’ll start constructing our rewards matrix by listing the states we’ll come FROM down the rows, and the states we’ll go TO in the columns. First, let’s fill the diagonal with 1 rewards, because we don’t want our agent to stay in the same place (that is, move from Room 1 to Room 1, or from Room 2 to Room 2, and so forth).
Next, let’s move across the first row. Starting in Room 0, we only have one choice: go to Room 4. All other possibilities are blocked (1):
Now let’s fill in the row labeled 1. From Room 1, you have two choices: go to Room 3 (which is not great but permissible, so give it a 0) or go to Room 5 (the target, worth 100)!
Continue moving row by row, determining if you can’t get there from here (1), you can but it’s not the target (0), or it’s the target(100). You’ll end up with a final rewards matrix that looks like this:
Now create this rewards matrix in R:
R < matrix(c(1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 100, 1, 1, 100, 100), nrow=6, ncol=6, byrow=TRUE)And run the code. Notice that we’re calling the target state 6 instead of 5 because even though we have a room labeled with a zero, our matrix starts with a 1s so we have to adjust:
source("https://raw.githubusercontent.com/NicoleRadziwill/RFunctions/master/qlearn.R") results < q.learn(R,10000,alpha=0.1,gamma=0.8,tgt.state=6) > round(results) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 0 0 0 64 80 [2,] 0 0 0 51 0 80 [3,] 0 0 0 51 0 0 [4,] 0 64 41 0 64 0 [5,] 64 0 0 51 0 80 [6,] 0 84 0 0 84 100You can read this table of average value to obtain policies. A policy is a “path” through the states of the system:
 Start at Room 0 (first row, labeled 1): Choose Room 4 (80), then from Room 4 choose Room 5 (100)
 Start at Room 1: Choose Room 5 (100)
 Start at Room 2: Choose Room 3 (64), from Room 3 choose Room 1 or Room 4 (80); from 1 or 4 choose 5 (100)
 Start at Room 3: Choose Room 1 or Room 4 (80), then Room 5 (100)
 Start at Room 4: Choose Room 5 (100)
 Start at Room 5: Stay at Room 5 (100)
To answer the original problem, we would take route 2315 or 2345 to get out the quickest if we started in Room 2. This is easy to see with a simple map, but is much more complicated when the maps get bigger.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Reinforcement Learning: QLearning with the Hopping Robot
(This article was first published on R – Quality and Innovation, and kindly contributed to Rbloggers)
Overview: Reinforcement learning uses “reward” signals to determine how to navigate through a system in the most valuable way. (I’m particularly interested in the variant of reinforcement learning called “QLearning” because the goal is to create a “Quality Matrix” that can help you make the best sequence of decisions!) I found a toy robot navigation problem on the web that was solved using custom R code for reinforcement learning, and I wanted to reproduce the solution in different ways than the original author did. This post describes different ways that I solved the problem described at http://bayesianthink.blogspot.com/2014/05/hoppingrobotsandreinforcement.html
The Problem: Our agent, the robot, is placed at random on a board of wood. There’s a hole at s1, a sticky patch at s4, and the robot is trying to make appropriate decisions to navigate to s7 (the target). The image comes from the blog post linked above.
To solve a problem like this, you can use MODELBASED approaches if you know how likely it is that the robot will move from one state to another (that is, the transition probabilities for each action) or MODELFREE approaches (you don’t know how likely it is that the robot will move from state to state, but you can figure out a reward structure).
 Markov Decision Process (MDP) – If you know the states, actions, rewards, and transition probabilities (which are probably different for each action), you can determine the optimal policy or “path” through the system, given different starting states. (If transition probabilities have nothing to do with decisions that an agent makes, your MDP reduces to a Markov Chain.)
 Reinforcement Learning (RL) – If you know the states, actions, and rewards (but not the transition probabilities), you can still take an unsupervised approach. Just randomly create lots of hops through your system, and use them to update a matrix that describes the average value of each hop within the context of the system.
Solving a RL problem involves finding the optimal value functions (e.g. the Q matrix in Attempt 1) or the optimal policy (the StateAction matrix in Attempt 2). Although there are many techniques for reinforcement learning, we will use Qlearning because we don’t know the transition probabilities for each action. (If we did, we’d model it as a Markov Decision Process and use the MDPtoolbox package instead.) QLearning relies on traversing the system in many ways to update a matrix of average expected rewards from each state transition. This equation that it uses is from https://www.is.unifreiburg.de/ressourcen/businessanalytics/13_reinforcementlearning.pdf:
For this to work, all states have to be visited a sufficient number of times, and all stateaction pairs have to be included in your experience sample. So keep this in mind when you’re trying to figure out how many iterations you need.
Attempt 1: Quick QLearning with qlearn.R Input: A rewards matrix R. (That’s all you need! Your states are encoded in the matrix.)
 Output: A Q matrix from which you can extract optimal policies (or paths) to help you navigate the environment.
 Pros: Quick and very easy. Cons: Does not let you set epsilon (% of random actions), so all episodes are determined randomly and it may take longer to find a solution. Can take a long time to converge.
Set up the rewards matrix so it is a square matrix with all the states down the rows, starting with the first and all the states along the columns, starting with the first:
hopper.rewards < c(10, 0.01, 0.01, 1, 1, 1, 1, 10, 1, 0.1, 3, 1, 1, 1, 1, 0.01, 1, 3, 0.01, 1, 1, 1, 1, 0.01, 1, 0.01, 0.01, 1, 1, 1, 1, 3, 1, 0.01, 100, 1, 1, 1, 1, 0.01, 1, 100, 1, 1, 1, 1, 1, 0.01, 100) HOP < matrix(hopper.rewards, nrow=7, ncol=7, byrow=TRUE) > HOP [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 10 0.01 0.01 1 1.00 1.00 1 [2,] 10 1.00 0.10 3 1.00 1.00 1 [3,] 1 0.01 1.00 3 0.01 1.00 1 [4,] 1 1.00 0.01 1 0.01 0.01 1 [5,] 1 1.00 1.00 3 1.00 0.01 100 [6,] 1 1.00 1.00 1 0.01 1.00 100 [7,] 1 1.00 1.00 1 1.00 0.01 100Here’s how you read this: the rows represent where you’ve come FROM, and the columns represent where you’re going TO. Each element 1 through 7 corresponds directly to S1 through S7 in the cartoon above. Each cell contains a reward (or penalty, if the value is negative) if we arrive in that state.
The S1 state is bad for the robot… there’s a hole in that piece of wood, so we’d really like to keep it away from that state. Location [1,1] on the matrix tells us what reward (or penalty) we’ll receive if we start at S1 and stay at S1: 10 (that’s bad). Similarly, location [2,1] on the matrix tells us that if we start at S2 and move left to S1, that’s also bad and we should receive a penalty of 10. The S4 state is also undesirable – there’s a sticky patch there, so we’d like to keep the robot away from it. Location [3,4] on the matrix represents the action of going from S3 to S4 by moving right, which will put us on the sticky patch
Now load the qlearn command into your R session:
qlearn < function(R, N, alpha, gamma, tgt.state) { # Adapted from https://stackoverflow.com/questions/39353580/howtoimplementqlearninginr Q < matrix(rep(0,length(R)), nrow=nrow(R)) for (i in 1:N) { cs < sample(1:nrow(R), 1) while (1) { next.states < which(R[cs,] > 1) # Get feasible actions for cur state if (length(next.states)==1) # There may only be one possibility ns < next.states else ns < sample(next.states,1) # Or you may have to pick from a few if (ns > nrow(R)) { ns < cs } # NOW UPDATE THE QMATRIX Q[cs,ns] < Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > 1)])  Q[cs,ns]) if (ns == tgt.state) break cs < ns } } return(round(100*Q/max(Q))) }Run qlearn with the HOP rewards matrix, a learning rate of 0.1, a discount rate of 0.8, and a target state of S7 (the location to the far right of the wooden board). I did 10,000 episodes (where in each one, the robot dropped randomly onto the wooden board and has to get to S7):
r.hop < qlearn(HOP,10000,alpha=0.1,gamma=0.8,tgt.state=7) > r.hop [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 0 51 64 0 0 0 0 [2,] 0 0 64 0 0 0 0 [3,] 0 51 0 0 80 0 0 [4,] 0 0 64 0 80 80 0 [5,] 0 0 0 0 0 80 100 [6,] 0 0 0 0 80 0 100 [7,] 0 0 0 0 0 80 100The QMatrix that is presented encodes the bestvalue solutions from each state (the “policy”). Here’s how you read it:
 If you’re at s1 (first row), hop to s3 (biggest value in first row), then hop to s5 (go to row 3 and find biggest value), then hop to s7 (go to row 5 and find biggest value)
 If you’re at s2, go right to s3, then hop to s5, then hop to s7
 If you’re at s3, hop to s5, then hop to s7
 If you’re at s4, go right to s5 OR hop to s6, then go right to s7
 If you’re at s5, hop to s7
 If you’re at s6, go right to s7
 If you’re at s7, stay there (when you’re in the target state, the value function will not be able to pick out a “best action” because the best action is to do nothing)
Alternatively, the policy can be expressed as the best action from each of the 7 states: HOP, RIGHT, HOP, RIGHT, HOP, RIGHT, (STAY PUT)
Attempt 2: Use ReinforcementLearning PackageI also used the ReinforcementLearning package by Nicholas Proellochs (6/19/2017) described in https://cran.rproject.org/web/packages/ReinforcementLearning/ReinforcementLearning.pdf.
 Input: 1) a definition of the environment, 2) a list of states, 3) a list of actions, and 4) control parameters alpha (the learning rate; usually 0.1), gamma (the discount rate which describes how important future rewards are; often 0.9 indicating that 90% of the next reward will be taken into account), and epsilon (the probability that you’ll try a random action; often 0.1)
 Output: A StateAction Value matrix, which attaches a number to how good it is to be in a particular state and take an action. You can use it to determine the highest value action from each state. (It contains the same information as the Qmatrix from Attempt 1, but you don’t have to infer the action from the destination it brings you to.)
 Pros: Relatively straightforward. Allows you to specify epsilon, which controls the proportion of random actions you’ll explore as you create episodes and explore your environment. Cons: Requires manual setup of all state transitions and associated rewards.
First, I created an “environment” that describes 1) how the states will change when actions are taken, and 2) what rewards will be accrued when that happens. I assigned a reward of 1 to all actions that are not special, e.g. landing on S1, landing on S4, or landing on S7. To be perfectly consistent with Attempt 1, I could have used 0.01 instead of 1, but the results will be similar. The values you choose for rewards are sort of arbitrary, but you do need to make sure there’s a comparatively large positive reward at your target state and “negative rewards” for states you want to avoid or are physically impossible.
y.env < function(state,action) { next_state < state if (state == state("s1") && action == "right") { next_state < state("s2") } if (state == state("s1") && action == "hop") { next_state < state("s3") } if (state == state("s2") && action == "left") { next_state < state("s1"); reward < 10 } if (state == state("s2") && action == "right") { next_state < state("s3") } if (state == state("s2") && action == "hop") { next_state < state("s4"); reward < 3 } if (state == state("s3") && action == "left") { next_state < state("s2") } if (state == state("s3") && action == "right") { next_state < state("s4"); reward < 3 } if (state == state("s3") && action == "hop") { next_state < state("s5") } if (state == state("s4") && action == "left") { next_state < state("s3") } if (state == state("s4") && action == "right") { next_state < state("s5") } if (state == state("s4") && action == "hop") { next_state < state("s6") } if (state == state("s5") && action == "left") { next_state < state("s4"); reward < 3 } if (state == state("s5") && action == "right") { next_state < state("s6") } if (state == state("s5") && action == "hop") { next_state < state("s7"); reward < 10 } if (state == state("s6") && action == "left") { next_state < state("s5") } if (state == state("s6") && action == "right") { next_state < state("s7"); reward < 10 } if (next_state == state("s7") && state != state("s7")) { reward < 10 } else { reward < 1 } out < list(NextState = next_state, Reward = reward) return(out) }Next, I installed and loaded up the ReinforcementLearning package and ran the RL simulation:
install.packages("ReinforcementLearning") library(ReinforcementLearning) states < c("s1", "s2", "s3", "s4", "s5", "s6", "s7") actions < c("left","right","hop") data < sampleExperience(N=3000,env=my.env,states=states,actions=actions) control < list(alpha = 0.1, gamma = 0.8, epsilon = 0.1) model < ReinforcementLearning(data, s = "State", a = "Action", r = "Reward", s_new = "NextState", control = control)Now we can see the results:
> print(model) StateAction function Q hop right left s1 2.456741 1.022440 1.035193 s2 2.441032 2.452331 1.054154 s3 4.233166 2.469494 1.048073 s4 4.179853 4.221801 2.422842 s5 6.397159 4.175642 2.456108 s6 4.217752 6.410110 4.223972 s7 4.602003 4.593739 4.591626 Policy s1 s2 s3 s4 s5 s6 s7 "hop" "right" "hop" "right" "hop" "right" "left" Reward (last iteration) [1] 223The recommended policy is: HOP, RIGHT, HOP, RIGHT, HOP, RIGHT, (STAY PUT)
If you tried this example and it didn’t produce the same response, don’t worry! Modelfree reinforcement learning is done by simulation, and when you used the sampleExperience function, you generated a different set of state transitions to learn from. You may need more samples, or to tweak your rewards structure, or both.)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Plotting Deep Learning Model Performance Trajectories
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
I am excited to share a new deep learning model performance trajectory graph.
Here is an example produced based on Keras in R using ggplot2:
The ideas include:
 We plot model performance as a function of training epoch, data set (training and validation), and metric.
 For legibility we facet on metric, and facets are adjusted so all facets have the same visual interpretation (“up is better”).
 The only solid horizontal curve is validation performance, and training performance is only indicated as the topregion of a shared region that depicts degree of overfit.
Obviously is going to take some training and practice to read these graphs quickly: but that is petty much true for all visualizations.
The methods work with just about any staged machine learning algorithm (neural nets, deep learning, boosting, random forests, and more) and can also be adapted to nonstaged bug regularized methods (lasso, elastic net, and so on).
The graph is now part of the development version of WVPlots. And we have complete worked examples for Keras and xgboost.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Tiny Art in Less Than 280 Characters
(This article was first published on R – Fronkonstin, and kindly contributed to Rbloggers)
TweetNow that Twitter allows 280 characters, the code of some drawings I have made can fit in a tweet. In this post I have compiled a few of them.
The first one is a cardioid inspired in string art (more info here):
library(ggplot2) n=300 t1=1:n t0=seq(3,2*n+1,2)%%n t2=t0+(t0==0)*n df=data.frame(x=cos((t11)*2*pi/n), y=sin((t11)*2*pi/n), x2=cos((t21)*2*pi/n), y2=sin((t21)*2*pi/n)) ggplot(df,aes(x,y,xend=x2,yend=y2)) + geom_segment(alpha=.1)+theme_void()
This other is based on Fermat’s spiral (more info here):
A recurrence plot of Gauss error function (more info here):
A xy scatter plot of a trigonometric function on R2 (more info here):
A turtle graphic (more info here):
A curve generated by a simulated harmonograph (more info here):
A chord diagram of a 20×20 1matrix (more info here):
Most of them are made with ggplot2 package. I love R and the sense of wonder of how just one or two lines of code can create beautiful and unexpected patterns.
I recently did this project for DataCamp to show how easy is to do art with R and ggplot. Starting from a extremely simple plot, and following a well guided path, you can end making beautiful images like this one:
Furthermore, you can learn also ggplot2 while you do art.
I have done the project together with Rasmus Bååth, instructor at DataCamp and the perfect mate to work with. He is looking for people to build more projects so if you are interested, here you can find more information. Do not hesitate to ask him for details.
All the best for 2018.
Merry Christmas.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...