A quick introduction to using color in density plots
(This article was first published on rbloggers – SHARP SIGHT LABS, and kindly contributed to Rbloggers)
Right now, many people are pursuing data science because they want to learn artificial intelligence and machine learning. And for good reason. Machine learning is white hot right now, and it will probably reshape almost every sector of the world economy.
Having said that, if you want to be a great machine learning expert, and a great data scientist in general, you need to master data visualization too.
This is because data visualization is a critical prerequisite for advanced topics (like machine learning), and also because visualization is very useful for getting things done in its own right.
So let’s talk a little more about data visualization. As you begin learning data visualization in R, you should master the basics: the how to use ggplot2, how to think about data visualization, how to make basic plots (like the bar chart, line chart, histogram, and scatterplot).
And I really mean that you need to master these basics. To be a great data scientist, you need to be “fluent” in these basics. You should be able to write the code for basic charts and plots without even thinking about it. If you can’t, you should go back and practice them. Don’t get shiny object syndrome and try to move on to advanced topics before you do. Show some discipline. Master the foundations.
After you do master the foundations though, you’ll need to learn some intermediate tools.
One such thing that you’ll need to learn is how to work with color. Specifically, you’ll need to learn how to manipulate the “fill” color of things like density plots (as well as heatmaps).
With that in mind, let’s take a look at using color in density plots.
As always, we’re going to use the metalearning strategy of learning the tools with basic cases. Essentially, we’ll learn and practice how to modify the fill aesthetic for very simple plots. Later, once you get the hang of it, you can move on to more advanced applications.
As always, first we will load the packages we will need.
# # LOAD PACKAGES # library(tidyverse) library(viridis) library(RColorBrewer)Next, we will set a “seed” that will make the dataset exactly reproducible when we create our data. Without this, running rnorm() and runif() would produce data that is similar, but not exactly the same as the data you see here. To be clear, it won’t make too much difference for the purposes of this blog post, but it’s a best practice when you want to show reproducible results, so we will do this anyway.
# # SET SEED # set.seed(19031025)Now, we will create our data frame.
We will use runif() to create 20,000 uniformly distributed values for our x variable, and rnorm() to create 20,000 random normal values for our y variables. We’re also using the tibble() function to ultimately create a tibble/data.frame out of these two new variables.
# # CREATE DATA FRAME # df.data < tibble(x = runif(20000) ,y = rnorm(20000) )Now that we have a dataset created, let’s create a simple plot of the data. Ultimately, we will be working with density plots, but it will be useful to first plot the data points as a simple scatter plot.
Here, we’re using the typical ggplot syntax: we’re specifying the data frame inside of ggplot() and specifying our variable mappings inside of aes().
# # CREATE SCATTER PLOT # ggplot(df.data, aes(x = x, y = y)) + geom_point()Ok. We can at least see the data points and the general structure of the data (i.e., the horizontal band).
Having said that, these data are very heavily overplotted. There are a few ways to mitigate this overplotting (e.g., manipulating the alpha aesthetic), but a great way is to create a density plot.
To create the density plot, we’re using stat_density2d(). I won’t explain this in detail here, but essentially in this application, stat_density2d() calculates the density of observations in each region of the plot, and then fills in that region with a color that corresponds to the density.
# # DENSITY PLOT, w/ DEFAULT FILL # ggplot(df.data, aes(x = x, y = y)) + stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F)This isn’t bad. It gives us a sense of the density of the data (you can see the thick band across the middle). However, there are two issues.
First, the differences in density are not completely obvious, because of the color scale. The default light blue/dark blue color scheme doesn’t illuminate the differences in data density.
Second, this is just not very aesthetically appealing. It just doesn’t look that good.
To fix these issues, let’s modify the color scheme. I’ll show you a few options. Some will work better than others, and after you see these, I’ll encourage you to experiment with other color palettes.
Let’s first take a look at the color palette options.
You can examine a large number of readymade color palettes from the RColorBrewer package by using display.brewer.all().
# # DISPLAY COLOR PALETTES # display.brewer.all()As you can see, there are quite a few palettes from RColorBrewer. We’ll use a few of these to change the fill color of our plot.
# # FILL WITH COLOR PALETTE: Greens # ggplot(df.data, aes(x = x, y = y)) + stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) + scale_fill_distiller(palette = 'Greens')Now let’s use a few more color palettes.
# # FILL WITH COLOR PALETTE: Reds, Greys, Red/Yellow/Blue # ggplot(df.data, aes(x = x, y = y)) + stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) + scale_fill_distiller(palette = 'Reds')… grey
ggplot(df.data, aes(x = x, y = y)) + stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) + scale_fill_distiller(palette = 'Greys')… and a scale from red, to yellow, to blue.
ggplot(df.data, aes(x = x, y = y)) + stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) + scale_fill_distiller(palette = 'RdYlBu')These aren’t bad.
I think they work a little better than the default color scheme, but I think we can do better, so let’s try one more.
The following plot uses a custom color palette from the viridis package.
# # FILL WITH COLOR PALETTE: Viridis # ggplot(df.data, aes(x = x, y = y)) + stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) + scale_fill_viridis()I should have called this blog post “ggplot for people who love Mark Rothko.”
Ok, the viridis color palette (and a related set of palettes in the viridis package) is probably my favorite option. Not only do I think this color palette is one of the most aesthetically attractive, it’s also more functional. As noted in the documentation for the package, the viridis color palette is “designed in such a way that it will analytically be perfectly perceptuallyuniform … It is also designed to be perceived by readers with the most common form of color blindness.”
Master these techniques with simple casesAdmittedly, when you move on to more complex datasets later, it will take a bit of finesse to properly apply these color palettes.
But as I noted earlier, when you’re trying to master R (or any programming language), you should first master the syntax and techniques on basic cases, and then increase the complexity as you attain basic competence.
With that in mind, if you want to master these color and fill techniques, learn and practice these tools with simple cases like the ones shown here, and you can attempt more advanced applications later.
Sign up now, and discover how to rapidly master data scienceTo master data visualization and data science, you need to master the essential tools.
Moreover, to make rapid progress, you need to know what to learn, what not to learn, and you need to know how to practice what you learn.
Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.
Sign up now for our email list, and you’ll receive regular tutorials and lessons.
You’ll learn:
 How to do data visualization in R
 How to practice data science
 How to apply data visualization to more advanced topics (like machine learning)
 … and more
If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.
SIGN UP NOW
The post A quick introduction to using color in density plots appeared first on SHARP SIGHT LABS.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers – SHARP SIGHT LABS. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Introducing RITCH: Parsing ITCH Files in R (Finance & Market Microstructure)
(This article was first published on R – Data Shenanigans, and kindly contributed to Rbloggers)
Recently I was faced with a file compressed in NASDAQ’s ITCHprotocol, as I wasn’t able to find an Rpackage that parses and loads the file to R for me, I spent (probably) way to much time to write one, so here it is.
But you might wonder, what exactly is ITCH and why should I care?
Well, ITCH is the outbound protocol NASDAQ uses to communicate market data to its clients, that is, all information including market status, orders, trades, circuit breakers, etc. with nanosecond timestamps for each day and each exchange. Kinda a musthave if you are looking into market microstructure, a goodtohavelookedintoit if you are interested in general finance and/or if you are interested in data analytics and large structured datasets.
If you are wondering where you might get some of these fancy datasets in the first place, I have good news for you. NASDAQ provides some sample datasets (6 days for 3 exchanges (NASDAQ, PSX, and BX), together about 25GBs gzipped) on its FTP server: ftp://emi.nasdaq.com/ITCH/
The RITCH PackageNow that I (hopefully) have your attention, let me represent to you RITCH an R package that parses ITCHfiles (version 5.0).
Currently the package only lives on GitHub (https://github.com/DavZim/RITCH), but it should find its way into CRAN eventually. Until then, you have to use devtools to install it
# install.packages("devtools") devtools::install_github("DavZim/RITCH")And you should be good to go (if not, please let me know! If you get a message related to make, you can safely ignore it for now).
RITCH so far has a very limited scope: extracting the messages from the ITCHfile plus some functions to count messages. The package leverages C++ and the excellent Rcpp library to optimise parsing. RITCH itself does not contain any data as the datasets are too large for any repos and I have no copyright on the datasets in any way.
For the following code I will use the 20170130.BX_ITCH_50file from NASDAQ’s FTPserver, as its not too large at 714MB gzipped (1.6GB gunzipped), but still has almost 55 million messages. All functions can take the gzipped or unzipped files, but if you use the file more than once and harddrive space is not of utmost concern, I suggest you gunzip the file by hand (i.e., use R.utils::gunzip(file, new_file, remove = FALSE) in R or gunzip k YYYYMMDD.XXX_ITCH_50.gz in the terminal) and call the functions on the “plain”file. I will address some concerns to size and speed later on.
To download and prepare the data in R, we can use the following code
# might take some time as it downloads 714MB download.file("ftp://emi.nasdaq.com/ITCH/20170130.BX_ITCH_50.gz", "20170130.BX_ITCH_50.gz", mode = "wb") # gunzip the file, but keep the original file R.utils::gunzip("20170130.BX_ITCH_50.gz", "20170130.BX_ITCH_50", remove = FALSE)First, we want to get a general overview of the file, which we can do with count_messages()
library(RITCH) file < "20170130.BX_ITCH_50" msg_count < count_messages(file, add_meta_data = TRUE) #> [Counting] 54473386 messages found #> [Converting] to data.table msg_count #> msg_type count msg_name msg_group doc_nr #> 1: S 6 System Event Message System Event Message 4.1 #> 2: R 8371 Stock Directory Stock Related Messages 4.2.1 #> 3: H 8401 Stock Trading Action Stock Related Messages 4.2.2 #> 4: Y 8502 Reg SHO Restriction Stock Related Messages 4.2.3 #> 5: L 6011 Market Participant Position Stock Related Messages 4.2.4 #> 6: V 2 MWCB Decline Level Message Stock Related Messages 4.2.5.1 #> 7: W 0 MWCB Status Message Stock Related Messages 4.2.5.2 #> 8: K 0 IPO Quoting Period Update Stock Related Messages 4.2.6 #> 9: J 0 LULD Auction Collar Stock Related Messages 4.2.7 #> 10: A 21142017 Add Order Message Add Order Message 4.3.1 #> 11: F 20648 Add Order  MPID Attribution Message Add Order Message 4.3.2 #> 12: E 1203625 Order Executed Message Modify Order Messages 4.4.1 #> 13: C 8467 Order Executed Message With Price Message Modify Order Messages 4.4.2 #> 14: X 1498904 Order Cancel Message Modify Order Messages 4.4.3 #> 15: D 20282644 Order Delete Message Modify Order Messages 4.4.4 #> 16: U 3020278 Order Replace Message Modify Order Messages 4.4.5 #> 17: P 330023 Trade Message (NonCross) Trade Messages 4.5.1 #> 18: Q 0 Cross Trade Message Trade Messages 4.5.2 #> 19: B 0 Broken Trade Message Trade Messages 4.5.3 #> 20: I 0 NOII Message Net Order Imbalance Indicator (NOII) Message 4.6 #> 21: N 6935487 Retail Interest Message Retail Price Improvement Indicator (RPII) 4.7 #> msg_type count msg_name msg_group doc_nrAs you can see, there are a lot of different message types. Currently this package parses only messages from the group “Add Order Messages” (type ‘A’ and ‘F’), “Modify Order Messages” (type ‘E’, ‘C’, ‘X’, ‘D’, and ‘U’), and “Trade Messages” (type ‘P’, ‘Q’, and ‘B’). You can extract the different messagetypes by using the functions get_orders, get_modifications, and get_trades, respectively. The docnumber refers to the section in the official documentation (which also contains more detailed description what each type contains).
If you are annoyed by the feedback the function gives you ([Counting] ... [Converting]...), you can always turn the feedback off with the quiet = TRUE option (this applies to all functions).
Lets try to parse the first 10 orders
orders < get_orders(file, 1, 10) #> 10 messages found #> [Loading] . #> [Converting] to data.table #> [Formatting] orders #> msg_type locate_code tracking_number timestamp order_ref buy shares stock price mpid date datetime #> 1: A 7584 0 2.520001e+13 36132 TRUE 500000 UAMY 0.0001 NA 20170130 20170130 07:00:00 #> 2: A 3223 0 2.520001e+13 36133 TRUE 500000 GLOW 0.0001 NA 20170130 20170130 07:00:00 #> 3: A 2937 0 2.520001e+13 36136 FALSE 200 FRP 18.6500 NA 20170130 20170130 07:00:00 #> 4: A 5907 0 2.520001e+13 36137 TRUE 1500 PIP 3.1500 NA 20170130 20170130 07:00:00 #> 5: A 5907 0 2.520001e+13 36138 FALSE 2000 PIP 3.2500 NA 20170130 20170130 07:00:00 #> 6: A 5907 0 2.520001e+13 36139 TRUE 3000 PIP 3.1000 NA 20170130 20170130 07:00:00 #> 7: A 5398 0 2.520001e+13 36140 TRUE 200 NSR 33.0000 NA 20170130 20170130 07:00:00 #> 8: A 5907 0 2.520001e+13 36141 FALSE 500 PIP 3.2500 NA 20170130 20170130 07:00:00 #> 9: A 2061 0 2.520001e+13 36142 FALSE 1300 DSCI 7.0000 NA 20170130 20170130 07:00:00 #> 10: A 1582 0 2.520001e+13 36143 TRUE 500 CPPL 17.1500 NA 20170130 20170130 07:00:00The same works for trades using the get_trades() function and for order modifications using the get_modifications() function.
To speed up the get_* functions, we can use the messagecount information from earlier. For example the following code yields the same results as above, but saves time.
orders < get_orders(file, 1, count_orders(msg_count)) trades < get_trades(file, 1, count_trades(msg_count))If you want to get more information about each field, you can have a look at the
official ITCHprotocol specification manual or you can get a small data.table about each message type by calling get_meta_data().
To have at least one piece of eyecandy in this post, lets have a quick go at the orders and trades of SPY (an S&P 500 ETF and one of the most traded assets, in case you didn’t know), IWO (Russel 2000 Growth ETF), IWM (Russel 2000 Index ETF), and VXX (S&P 500 VIX ETF) on the BXexchange.
In case you are wondering, I got these four tickers with
library(magrittr) get_orders(file, 1, count_orders(msg_count), quiet = T) %>% .$stock %>% table %>% sort(decreasing = T) %>% head(4) #> . #> SPY IWO IWM VXX #> 135119 135016 118123 117395First we load the data (orders and trades) from the file, then we do some data munging, and finally plot the data using ggplot2.
library(ggplot2) # 0. load the data orders < get_orders(file, 1, count_orders(msg_count)) #> 21162665 messages found #> [Loading] ................ #> [Converting] to data.table #> [Formatting] trades < get_trades(file, 1, count_trades(msg_count)) #> 330023 messages found #> [Loading] ................ #> [Converting] to data.table #> [Formatting] # 1. data munging tickers < c("SPY", "IWO", "IWM", "VXX") dt_orders < orders[stock %in% tickers] dt_trades < trades[stock %in% tickers] # for each ticker, use only orders that are within 1% of the range of traded prices ranges < dt_trades[, .(min_price = min(price), max_price = max(price)), by = stock] # filter the orders dt_orders < dt_orders[ranges, on = "stock"][price >= 0.99 * min_price & price <= 1.01 * max_price] # replace the buyfactor with something more useful dt_orders[, buy := ifelse(buy, "Bid", "Ask")] dt_orders[, stock := factor(stock, levels = tickers)] # 2. data visualization ggplot() + # add the orders to the plot geom_point(data = dt_orders, aes(x = datetime, y = price, color = buy), size = 0.5) + # add the trades as a black line to the plot geom_step(data = dt_trades, aes(x = datetime, y = price)) + # add a facet for each ETF facet_wrap(~stock, scales = "free_y") + # some Aesthetics theme_light() + labs(title = "Orders and Trades of the largest ETFs", subtitle = "Date: 20170130  Exchange: BX", caption = "Source: NASDAQ", x = "Time", y = "Price", color = "Side") + scale_y_continuous(labels = scales::dollar) + scale_color_brewer(palette = "Set1")Now its up to you to do something interesting with the data, I hope RITCH can help you with it.
Speed & RAM concernsIf your machine struggles with some files, you can always load only parts of a file as shown above. And of course, make sure that you have only necessary datasets in your Renvironment and that no unused programs are open (Chrome with some open tabs in the background happily eats your RAM).
If your machine is able to handle larger datasets, you can increase the buffersize of each function call to 1GB or more using the buffer_size = 1e9 argument, increasing the speed with which a file is parsed.
AddendumIf you find this package useful or have any other kind of feedback, I’d be happy if you let me know. Otherwise, if you need more functionality for additional message types, please feel free to create an issue or a pull request on GitHub.
Filed under: R Tagged: Finance, R, RITCH, Visualisation
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Data Shenanigans. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A crazy day in the Bitcoin World
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
By Gabriel VasconcelosToday, November 29, 2017 was a crazy day in the Bitcoin world and the craziness is still going on as I write this post. The price range was of thousands of Dollars in a few hours. Bitcoins were today the main topic in all discussion groups I participate. Some people believe we are in the middle of a giant bubble and are very skeptical about Bitcoins intrinsic value and other people believe cryptocurrencies are the future and are already counting on a price of hundreds of thousands of dollars in a few years. I am no expert and I have no idea which group is right, but I hope it is the second because I really like the Bitcoin idea as the money of the future.
My objective here is to have a brief look at what happened with Bitcoins prices and transactions today (20171129) compared to two days ago (20171127), which was an ordinary day given the current levels of Bitcoin prices and volume. I will use data available by a Brazilian Bitcoin exchange called “Mercado Bitcoin (MB) (Bitcoin Market)”. You can see more details in their github page here. The prices are in Brazilian Real, which trades today for approximately 31 cents of US Dollar. Since some readers may be only interested in the results, I will put all codes to download the data and generate the plots in the end of the post.
First I will show a small table that demonstrates how volume and prices changed from one day to the other. The table shows that transactions and the number of Bitcoins traded increased something close 500% and the average amount of Bitcoins in each transaction also increased. The price range was of less than 2000 BRL in November 27th compared to more than 10000 BRL today (November 29th).
stat Nov.29 Nov.27 Transactions 31838.000 6975.000 Amount 2583.511 446.133 Avg.Amount 0.081 0.064 Avg.Price 40063.902 34274.563 Min.Price 33002.000 33500.000 Max.Price 44980.000 34965.000The boxplot helps us to understand how the prices were distributed. The distance between the 25% and the 75% percentiles of today covers much more than the entire range of prices from November 27th. If you are more interested in the dynamics of prices and volume you should look at the figure below the boxplot. It shows how prices and volume behaved across the days (the data is coerced to minutes). Both figures show that November 29th was clearly an unusual day in the Bitcoin world.
Codes library(jsonlite) library(tidyverse) library(lubridate) library(reshape2) library(grid) library(scales) ## == The data is stored in JSON == ## ## == We can only download 1000 trades each time == ## ## == The while is to download all trades in the day == ## ## == Dates are in UNIX time == ## ## == Data from the last 24h (now it is 23:17  20171129) == ## start = 1512004273  86400 end = 1512004273 datalist = list() while(start < end){ file = paste("https://www.mercadobitcoin.net/api/BTC/trades/",start,"/",sep="") data < fromJSON(file, flatten = TRUE) start = max(data$date) + 1 datalist[[length(datalist) + 1]] = data } df_present = Reduce("rbind", datalist) df_present = df_present %>% filter(date<=end) ## == Data from the same time 2 days before == ## start = 1512004273  86400 * 2 end = 1512004273  86400 * 1 datalist = list() while(start < end){ file = paste("https://www.mercadobitcoin.net/api/BTC/trades/",start,"/",sep="") data < fromJSON(file, flatten = TRUE) start = max(data$date) + 1 datalist[[length(datalist) + 1]] = data } df_past = Reduce("rbind", datalist) df_past = df_past %>% filter(date <= end) ## = adjust date = ## df_past$date = as.POSIXct(df_past$date, origin = "19700101") df_present$date = as.POSIXct(df_present$date, origin = "19700101") # = statistics = # past = c(nrow(df_past), sum(df_past$amount, na.rm = TRUE), mean(df_past$amount, na.rm = TRUE), mean(df_past$price), range(df_past$price)) present = c(nrow(df_present), sum(df_present$amount, na.rm = TRUE), mean(df_present$amount, na.rm = TRUE), mean(df_present$price), range(df_present$price)) stat = round(data.frame("Nov.29" = present,"Nov.27" = past), 3) stat = data.frame(stat = c("Transactions","Amount","Avg.Amount","Avg.Price","Min.Price","Max.Price"), stat) ## = make data by minute = ## df_present_min = df_present %>% mutate(day = 1 + day(date)  min(day(date)), hour = hour(date), min = minute(date)) %>% mutate(date = make_datetime(day = day, hour = hour,min = min,tz = "BRST")) %>% group_by(date) %>% summarise(price = tail(price, 1) ,vol = sum(amount)) df_past_min=df_past %>% mutate(day = 1 + day(date)  min(day(date)), hour = hour(date) ,min = minute(date)) %>% mutate(date = make_datetime(day = day, hour = hour,min = min, tz="BRST")) %>% group_by(date) %>% summarise(price = tail(price, 1), vol = sum(amount)) df_min = full_join(df_present_min, df_past_min,by=c("date")) %>% arrange(date) df_min$price.x = na.locf(df_min$price.x, na.rm = FALSE) df_min$price.y = na.locf(df_min$price.y, na.rm = FALSE) ## = Plots = ## df1 = melt(df_min[, c(1, 2, 4)], id.vars = "date") df2 = melt(df_min[, c(1, 3, 5)], id.vars = "date") p0 = ggplot() + geom_boxplot(data = df1, aes(variable, value)) + labs(x="Day", y="Price") + scale_x_discrete(labels = c("Nov. 29", "Nov. 27")) p1 = ggplot() + geom_line(data = df1, aes(x = date, y = value, color = factor(variable, labels = c("Nov. 29", "Nov. 27")))) + labs(x = "Time",y = "Price") + guides(color = guide_legend(title = "Day")) + scale_x_datetime(labels = date_format("%H:%M")) p2 = ggplot() + geom_line(data = df2,aes(x = date,y = value,color = factor(variable, labels=c("Nov. 29", "Nov. 27")))) + labs(x = "Time", y = "Volume") + guides(color = guide_legend(title = "Day")) + scale_x_datetime(labels = date_format("%H:%M")) p0 grid.draw(rbind(ggplotGrob(p1), ggplotGrob(p2), size = "first")) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Using Heatmap Coloring on a Density Plot Using R to Visualize Distributions
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
Lots of different visualizations have been proposed for understanding distributions. In this post, I am going show how to create my current favorite, which is a density plot using heatmap shading. I find them particularly useful for showing duration data.
Example: time to purchaseThe table below shows the number of days it took 200 people to purchase a product, after their first trial of the product. The average time taken is 131 days and the median is 54 days. The rest of the post looks at a number of different ways of visualizing this data, finishing with the heated density plot.
Density plotOne of the classic ways of plotting this type of data is as a density plot. The standard R version is shown below. I have set the default from argument to better display this data, as otherwise density plots tend to show negative values even when all the data contains no negative values.
plot(density(days, from = 0), main = "Density plot", xlab = "Number of days since trial started")This plot clearly shows that purchases occur at relatively close to 0 days since the trial started. But, unless you use a ruler, there is no way to work out precisely where the peak occurs. Is it at 20 days or 50 days? The values shown on the yaxis have no obvious meaning (unless you read the technical documentation, and even then they are not numbers that can be explained in a useful way to nontechnical people).
We can make this easier to read by only plotting the data for the first 180 days (to = 180), and changing the bandwidth used in estimating the density (adjust = 0.2) to make the plot less smooth. We can now see that the peak is around 30. While the plot does a good job at describing the shape of the data, it does not allow us to draw conclusions regarding the cumulative proportion of people to purchase by a specific time. For example, what proportion of people buy in the first 100 days? We need a different plot.
plot(density(days, from = 0, to = 180, adjust = 0.2), main = "Density plot  Up to 180 days (86% of data)", xlab = "Number of days since trial started") Survival curveSurvival curves have been invented for this type of data. An example is shown below. In this case, the survival curve shows the proportion of people who have yet to purchase at each point in time. The relatively sharp drop in the survival function at 30 is the same pattern that was revealed by the peak in the density plot. We can see also see that at about 100 days, around 26% or so of people have “survived” (i.e., not purchased). The plot also reveals that around 14% of customers have not purchased during the time interval shown (as this is where the line crosses the rightside of the plot). To my mind the survival plot is more informative than the density plot. But, it is also harder work. It is hard to see most nontechnical audiences finding all the key patterns in this plot without guidance.
library(survival) surv.days = Surv(days) surv.fit = survfit(surv.days~1) plot(surv.fit, main = "KaplanMeier estimate with 95% confidence bounds (86% of data)", xlab = "Days since trial started", xlim = c(0, 180), ylab = "Survival function") grid(20, 10, lwd = 2) Heated density plotI call the visualization below a heated density plot. No doubt somebody invented this before we did, so please tell me if there is a more appropriate name. It is identical to the density plot from earlier in this post, except that:
 The heatmap coloring shows the cumulative proportion to purchase, ranging from red (0%), to yellow (50%, the median), to blue (100%). Thus, we can now see that the median is at about 55, which could not be ascertained from the earlier density plots.
 If you hover your mouse pointer (if you have one) over the plot, it will show you the cumulative percentage of people to purchase. We can readily see that the peak occurs near 30 days. Similarly, we can see that 93% of people purchased within 500 days.
Below I show the plot limited to the first 180 days with the bandwidth adjusted as in the earlier density plot. The use of the legend on the right allows the viewer to readily see that only a little over 85% of people have purchased.
HeatedDensityPlot(days, from = 0, to = 180, adjust = 0.2, title = "Heated density plot  Up to 180 days (86% of data)", x.title = "Number of days since trial started", legend.title = "% of buyers") Bonus: they are awesome for discrete dataThe visualization below shows randomlygenerated data where I have generated whole numbers in the range of 0 through 12 (i.e., one 0; six 1s, seven 2s, etc). In all the previous heated density plots the color transition was relatively smooth. In the visualization below, the discrete nature of the data has been emphasized by the clear vertical lines between each color. This is not a feature that can be seen on either a traditional density plot or histogram with small sample sizes.
set.seed(12230) x = rpois(100, 5) HeatedDensityPlot(x, from = 0, title = "Poisson(5), n = 100", x.title = "x") ConclusionThe heated density plot is, to my mind, a straightforward improvement on traditional density plots. It shows additional information about the cumulative values of the distribution and reveals when the data is discrete. However, the function we have written is still a bit rough around the edges. For example, sometimes a faint line appears at the top of the plot, and a whole host of warnings appear in R when the plot is created. If you can improve on this, please tell us how!
The codeThe R function used to create the plot is shown below. If you run this code you will get a plot of a normal distribution. If you want to reproduce my exact examples, please click here to sign into Displayr and access the document that contains all my code and charts. This document contains further additional examples of the use of this plot. To see the code, just click on one of the charts in the document, and the code should be shown on the right in Properties > R CODE.
HeatedDensityPlot < function(x, title = "", x.title = "x", colors = c('#d7191c','#fdae61','#ffffbf','#abdda4','#2b83ba'), show.legend = TRUE, legend.title = "Cumulative %", n = 512, # Number of unique values of x in the heatmap/density estimation n.breaks = 5, # Number of values of x to show on the xaxis. Final number may differ. ...) { # Checking inputs. n.obs < length(x) if (any(nas < is.na(x))) { warning(sum(nas), " observations with missing values have been removed.") x < x[!nas] n.obs < length(x) } if (n.obs < 4) stop(n.obs, " is too few observations for a valid density plot. See Silverman (1986) Density Estimation for Statistics and Data Analysis.") # Computing densities dens < density(x, ...) y.max < max(dens$y) * 1.1 # To ensure top of plot isn't too close to title x.to.plot.true < x.to.plot < dens$x y.seq < c(0, y.max/2, y.max) y.to.plot < dens$y range.x = range(x) cum.dens < ecdf(x)(x.to.plot)# * 100 #Due to plotly blug that misaligns heatmap with ensuing white line, #putting blanks at beginnig of data. n.blanks < 10 cum.dens < c(rep(NA, n.blanks), cum.dens) diff < x.to.plot[1]  x.to.plot[2] blanks < diff * (n.blanks:1) + x.to.plot[1] x.to.plot < c(blanks, x.to.plot) y.to.plot < c(rep(0, n.blanks), y.to.plot) # Creating the matrix of heatmap values cum.perc < cum.dens * 100 z.mat < matrix(cum.perc, byrow = TRUE, nrow = 3, ncol = n + n.blanks, dimnames = list(y = y.seq, x = x.to.plot)) # Specifying the colors col.fun < scales::col_numeric(colors, domain = 0:1, na.color = "white")#range.x) x.as.colors < col.fun(cum.dens) z.to.plot.scaled < scales::rescale(cum.perc) color.lookup < setNames(data.frame(z.to.plot.scaled, x.as.colors), NULL) # Creating the base heatmap. require(plotly) p < plot_ly(z = z.mat, xsrc = x.to.plot, ysrc = y.seq, type = "heatmap", colorscale = color.lookup, cauto = FALSE, hoverinfo = "none", colorbar = list(title = legend.title), showscale = show.legend) # Placing white on top of the bits of the heatmap to hide p < add_trace(p, x = c(1:(n + n.blanks), (n + n.blanks):1), y = c(y.to.plot, rep(y.max * 1.10, n + n.blanks)), fill = "tonexty", hoverinfo = "none", showlegend = FALSE, type = "scatter", mode = "line", showscale = FALSE, line = list(color = "white", width = 0), fillcolor = "white") # Adding the tooltips p < add_trace(p, x = 1:(n + n.blanks), y = y.to.plot, name = "", hoverinfo = "text", text = sprintf(paste0(x.title,": %.0f %% < %.1f"), cum.perc, x.to.plot), type = "scatter", mode = "lines", line = list(color = "white", width = 0), showlegend=FALSE, showscale=FALSE) p < plotly::config(p, displayModeBar = FALSE) # Formatting the x axis x.text < pretty(x.to.plot, n = n.breaks) x.tick < 1 + (x.text  x.to.plot[1]) / (x.to.plot[n + n.blanks]  x.to.plot[1]) * (n + n.blanks  1) p < layout(p, title = title, xaxis = list(title = x.title, tickmode = "array", tickvals = x.tick, ticktext = x.text), yaxis = list(title = "", showline = FALSE, ticks = "", showticklabels = FALSE, range= c(0, y.max)), margin = list(t = 30, l = 5, b = 50, r = 5)) p } AcknowledgementsMy colleague Carmen wrote most of the code. It is written using the wonderful plotly package.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Sentiment Analysis of “A Christmas Carol”
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
Our family has been reading, listening to and watching “A Christmas Carol” for just abt 30 years now. I got it into my crazy noggin to perform a sentiment analysis on it the other day and tweeted out the results, but a large chunk of the R community is not on Twitter and it would be good to get a holidaythemed post or two up for the season.
One reason I embarked on this endeavour is that @juliasilge & @drob made it so gosh darn easy to do so with:
(btw: That makes an excellent holiday gift for the data scientist[s] in your life.)
Let us begin!
STAVE I: hrbrmstr’s CodeWe need the text of this book to work with and thankfully it’s long been in the public domain. As @drob noted, we can use the gutenbergr package to retrieve it. We’ll use an RStudio project structure for this and cache the results locally to avoid burning bandwidth:
library(rprojroot) library(gutenbergr) library(hrbrthemes) library(stringi) library(tidytext) library(tidyverse) rt < find_rstudio_root_file() carol_rds < file.path(rt, "data", "carol.rds") if (!file.exists(carol_rds)) { carol_df < gutenberg_download("46") write_rds(carol_df, carol_rds) } else { carol_df < read_rds(carol_rds) }How did I know to use 46? We can use gutenberg_works() to get to that info:
gutenberg_works(author=="Dickens, Charles") ## # A tibble: 74 x 8 ## gutenberg_id title ## ## 1 46 A Christmas Carol in Prose; Being a Ghost Story of Christmas ## 2 98 A Tale of Two Cities ## 3 564 The Mystery of Edwin Drood ## 4 580 The Pickwick Papers ## 5 588 Master Humphrey's Clock ## 6 644 The Haunted Man and the Ghost's Bargain ## 7 650 Pictures from Italy ## 8 653 "The Chimes\r\nA Goblin Story of Some Bells That Rang an Old Year out and a New Year In" ## 9 675 American Notes ## 10 678 The Cricket on the Hearth: A Fairy Tale of Home ## # ... with 64 more rows, and 6 more variables: author , gutenberg_author_id , language , ## # gutenberg_bookshelf , rights , has_text STAVE II: The first of three wranglesWe’re eventually going to make a ggplot2 faceted chart of the sentiments by paragraphs in each stave (chapter). I wanted nicer titles for the facets so we’ll clean up the stave titles first:
#' Convenience only carol_txt < carol_df$text # Just want the chapters (staves) carol_txt < carol_txt[(1:(which(grepl("STAVE I:", carol_txt)))1)] #' We'll need this later to make prettier facet titles data_frame( stave = 1:5, title = sprintf("Stave %s: %s", stave, carol_txt[stri_detect_fixed(carol_txt, "STAVE")] %>% stri_replace_first_regex("STAVE [[:alpha:]]{1,3}: ", "") %>% stri_trans_totitle()) ) > stave_titlesstri_trans_totitle() is a superhandy function and all we’re doing here is extracting the stave titles and doing a small transformation. There are scads of ways to do this, so don’t get stuck on this example. Try out other ways of doing this munging.
You’ll also see that I made sure we started at the first stave break vs include the title bits in the analysis.
Now, we need to prep the text for text analysis.
STAVE III: The second of three wranglesThere are other text mining packages and processes in R. I’m using tidytext because it takes care of so many details for you and does so elegantly. I was also at the rOpenSci Unconf where the idea was spawned & worked on and I’m glad it blossomed into such a great package and a book!
Since we (I) want to do the analysis by stave & paragraph, let’s break the text into those chunks. Note that I’m doing an extra break by sentence in the event folks out there want to replicate this work but do so on a more granular level.
#' Break the text up into chapters, paragraphs, sentences, and words, #' preserving the hierarchy so we can use it later. data_frame(txt = carol_txt) %>% unnest_tokens(chapter, txt, token="regex", pattern="STAVE [[:alpha:]]{1,3}: [[:alpha:] [:punct:]]+") %>% mutate(stave = 1:n()) %>% unnest_tokens(paragraph, chapter, token = "paragraphs") %>% group_by(stave) %>% mutate(para = 1:n()) %>% ungroup() %>% unnest_tokens(sentence, paragraph, token="sentences") %>% group_by(stave, para) %>% mutate(sent = 1:n()) %>% ungroup() %>% unnest_tokens(word, sentence) > carol_tokens carol_tokens ## A tibble: 28,710 x 4 ## stave para sent word ## ## 1 1 1 1 marley ## 2 1 1 1 was ## 3 1 1 1 dead ## 4 1 1 1 to ## 5 1 1 1 begin ## 6 1 1 1 with ## 7 1 1 1 there ## 8 1 1 1 is ## 9 1 1 1 no ## 0 1 1 1 doubt ## ... with 28,700 more rowsBy indexing each hierarchy level, we have the flexibility to do all sorts of structured analyses just by choosing grouping combinations.
STAVE IV: The third of three wranglesNow, we need to layer in some sentiments and do some basic sentiment calculations. Many of these sentimental posts (including this one) take a naive approach with basic match and only looking at 1grams. One reason I didn’t go further was to make the code accessible to new R folk (since I primarily blog for new R folk :). I’m prepping some 2018 posts with more involved text analysis themes and will likely add some complexity then with other texts.
#' Retrieve sentiments and compute them. #' #' I left the `index` in vs just use `paragraph` since it'll make this easier to reuse #' this block (which I'm not doing but thought I might). inner_join(carol_tokens, get_sentiments("nrc"), "word") %>% count(stave, index = para, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive  negative) %>% left_join(stave_titles, "stave") > carol_with_sent STAVE V: The end of itNow, we just need to do some really basic ggploting to to get to our desired result:
ggplot(carol_with_sent) + geom_segment(aes(index, sentiment, xend=index, yend=0, color=title), size=0.33) + scale_x_comma(limits=range(carol_with_sent$index)) + scale_y_comma() + scale_color_ipsum() + facet_wrap(~title, scales="free_x", ncol=5) + labs(x=NULL, y="Sentiment", title="Sentiment Analysis of A Christmas Carol", subtitle="By stave & ¶", caption="Humbug!") + theme_ipsum_rc(grid="Y", axis_text_size = 8, strip_text_face = "italic", strip_text_size = 10.5) + theme(legend.position="none")You’ll want to tap/click on that to make it bigger.
Despite using a naive analysis, I think it tracks pretty well with the flow of the book.
Stave one is quite bleak. Marley is morose and frightening. There is no joy apart from Fred’s brief appearance.
The truly terrible (10 sentiment) paragraph also makes sense:
Marley’s face. It was not in impenetrable shadow as the other objects in the yard were, but had a dismal light about it, like a bad lobster in a dark cellar. It was not angry or ferocious, but looked at Scrooge as Marley used to look: with ghostly spectacles turned up on its ghostly forehead. The hair was curiously stirred, as if by breath or hot air; and, though the eyes were wide open, they were perfectly motionless. That, and its livid colour, made it horrible; but its horror seemed to be in spite of the face and beyond its control, rather than a part of its own expression.
(I got to that via this snippet which you can use as a template for finding the other significant sentiment points:)
filter( carol_tokens, stave == 1, para == filter(carol_with_sent, stave==1) %>% filter(sentiment == min(sentiment)) %>% pull(index) )Stave two (Christmas past) is all about Scrooge’s youth and includes details about Fezziwig’s party so the mostlypositive tone also makes sense.
Stave three (Christmas present) has the highest:
The Grocers’! oh, the Grocers’! nearly closed, with perhaps two shutters down, or one; but through those gaps such glimpses! It was not alone that the scales descending on the counter made a merry sound, or that the twine and roller parted company so briskly, or that the canisters were rattled up and down like juggling tricks, or even that the blended scents of tea and coffee were so grateful to the nose, or even that the raisins were so plentiful and rare, the almonds so extremely white, the sticks of cinnamon so long and straight, the other spices so delicious, the candied fruits so caked and spotted with molten sugar as to make the coldest lookerson feel faint and subsequently bilious. Nor was it that the figs were moist and pulpy, or that the French plums blushed in modest tartness from their highlydecorated boxes, or that everything was good to eat and in its Christmas dress; but the customers were all so hurried and so eager in the hopeful promise of the day, that they tumbled up against each other at the door, crashing their wicker baskets wildly, and left their purchases upon the counter, and came running back to fetch them, and committed hundreds of the like mistakes, in the best humour possible; while the Grocer and his people were so frank and fresh that the polished hearts with which they fastened their aprons behind might have been their own, worn outside for general inspection, and for Christmas daws to peck at if they chose.
and lowest (sentiment) points of the entire book:
And now, without a word of warning from the Ghost, they stood upon a bleak and desert moor, where monstrous masses of rude stone were cast about, as though it were the burialplace of giants; and water spread itself wheresoever it listed, or would have done so, but for the frost that held it prisoner; and nothing grew but moss and furze, and coarse rank grass. Down in the west the setting sun had left a streak of fiery red, which glared upon the desolation for an instant, like a sullen eye, and frowning lower, lower, lower yet, was lost in the thick gloom of darkest night.
Stave four (Christmas yet to come) is fairly middling. I had expected to see lower marks here. The standout negative sentiment paragraph (and the one that follows) are pretty dark, though:
They left the busy scene, and went into an obscure part of the town, where Scrooge had never penetrated before, although he recognised its situation, and its bad repute. The ways were foul and narrow; the shops and houses wretched; the people halfnaked, drunken, slipshod, ugly. Alleys and archways, like so many cesspools, disgorged their offences of smell, and dirt, and life, upon the straggling streets; and the whole quarter reeked with crime, with filth, and misery.
Finally, Stave five is both short and positive (whew!). Which I heartily agree with!
FINThe code is up on GitHub and I hope that it will inspire more folks to experiment with this fun (& useful!) aspect of data science.
Make sure to send links to anything you create and shoot over PRs for anything you think I did that was awry.
For those who celebrate Christmas, I hope you keep Christmas as well as or even better than old Scrooge. “May that be truly said of us, and all of us! And so, as Tiny Tim observed, God bless Us, Every One!”
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R Programmers: What is your biggest problem when working with Census data?
(This article was first published on R – AriLamstein.com, and kindly contributed to Rbloggers)
A few weeks ago I announced my latest project: Data Science instruction at the Census Bureau.
In addition to announcing the project, I also snuck a bit of market research into the post. I asked people the types of analyses they do when working with Census data. I also asked what packages they use when solving those problems.
23 people left comments, and they have been very helpful in shaping the curriculum of the course. Thank you to everyone who left a comment!
That was such an effective way to learn about the community of R Census users that I’ve decided to do it again. If you are an R programmer who has worked with Census data, please leave a comment with an answer to this question:
What is your biggest problem when working with Census data in R?
Understanding the obstacles people face has the potential to help us design better courses.
Leave your answer as a comment below!
The post R Programmers: What is your biggest problem when working with Census data? appeared first on AriLamstein.com.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
WinVector LLC announces new “big data in R” tools
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
WinVector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN):
 partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using bigdata systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long inmutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
 if_else_device(): provides a dplyr::mutate() based simulation of perrow conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant dplyr::mutate() data flow code that works on Spark (via Sparklyr) and databases.
Image by Jeff Kubina from Columbia, Maryland – [1], CC BYSA 2.0, Link
For “big data in R” users these two function families (plus the included support functions and examples) are simple, yet game changing. These tools were developed by WinVector LLC to fill gaps identified by WinVector and our partners when standingup production scale R plus Apache Spark projects.
We are happy to share these tools as open source, and very interested in consulting with your teams on developing R/Spark solutions (including porting existing SAS code). For more information please reach out to WinVector.
To teams get started we are supplying the following initial documentation, discussion, and examples:
 Mutate Partitioner package vignette
 if_else_device reference
 “Partition Mutate” article
 “Partitioning Mutate, Example 2” (includes if_else_device) article.
To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Gender Diversity Analysis of Data Science Industry using Kaggle Survey Dataset in R
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
Kaggle recently released the dataset of an industrywide survey that it conducted with 16K respondents.
This article aims to understand how the argument of Gender Diversity plays out in Data Science Practice.
Disclaimer: Yes, I understand this dataset is not the output of a Randomized Experiment hence cannot be a representative of the entire Data Science Practitioners and also contains Selection bias, which I’m well aware. Let us proceed further with this disclaimer in mind.
Loading required R packages:
#Loading Required Libraries library(dplyr) library(stringr) library(ggplot2) library(ggthemes) library(tidyr) library(scales)We are going to deal with only multipleChoiceResponses.csv dataset from the downloaded files and let us load it into our R environment.
#Load Input Data complete_data < read.csv(".../multipleChoiceResponses.csv",header =T, stringsAsFactors = F)The very first thing we would check is how is the Gender distribution of the respondents.
#Gender Distribution complete_data %>% filter(GenderSelect!='') %>% group_by(GenderSelect) %>% count() %>% ggplot(aes(x = GenderSelect,y = (n / sum(n))*100))+ geom_bar(stat = 'identity') + ylab('Percent') + theme_solarized() + theme(axis.text = element_text(size = 6)) + ggtitle('Gender Distribution of Kaggle Survey Respondents')With no surprise like many other Technical domain, Data Science is also dominated by Male Gender with more than 80% respondents being Male and less than 20% being Female. How about the split of Male and Female across countries of the participants.
complete_data %>% filter(GenderSelect %in% c('Male','Female')) %>% group_by(Country,GenderSelect) %>% summarise(count = n()) %>% arrange(desc(count)) %>% ggplot() + geom_bar(aes(Country,count, fill = GenderSelect), stat = 'identity') + #facet_grid(.~GenderSelect) + theme_solarized() + theme(axis.text = element_text(size = 9), axis.text.x = element_text(angle = 40, vjust = 0.5, hjust = 0.5)) + ggtitle('Country wise Survey Respondends  M/F')With the US being the top country of Respondents followed by India, We can see how MaleFemale split across Countries look like. While this chart very well represents the distribution of countries, it doesn’t reflect which country is doing good in terms of Female Gender in Data Science. To understand that, Let us create a new KPI – F2M Ratio (Female to Male Ratio % – which could be interpreted as the number of Female to 100 Male).
complete_data %>% filter(GenderSelect %in% c('Male','Female') & Country!="") %>% group_by(Country,GenderSelect) %>% summarise(count = n()) %>% spread(GenderSelect,count) %>% mutate(F2M = (Female/Male)*100) %>% arrange(desc(F2M)) %>% #mutate(F2M = percent(F2M)) %>% ggplot() + geom_bar(aes(Country,F2M, fill = F2M), stat = 'identity') + theme_solarized() + theme(axis.text = element_text(size = 9), axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + ggtitle('Female to Male Ratio  Country Wise') + scale_fill_continuous_tableau()It turns out that no Country has got this FemaletoMale ratio more than 50% which is not a very healthy insight, but there are three countries that fared above 40% – Ireland, Malaysia and Egypt are those. Ironically, The US and India that were on the top overall have got lesser than 30% and 20% respectively.
Let us see how the Age distribution looks between Male and Female:
complete_data %>% filter(GenderSelect %in% c('Male','Female')) %>% ggplot() + geom_histogram(aes(Age),binwidth = 1) + theme_solarized() + facet_grid(.~GenderSelect) + ggtitle('Age Distribution  Male vs Female')It could be inferred from the above Plot that the central tendency between Male and Female is similar, but it is very clear that Men seemed to start earlier than their Female counterparts – which could be a big factor to establish themselves in this industry with confidence.
And finally, What is the language that’s been familiar among Data Enthusiasts.
complete_data %>% group_by(LanguageRecommendationSelect) %>% summarize(count = n()) # A tibble: 14 x 2 LanguageRecommendationSelect count 1 5718 2 C/C++/C# 307 3 F# 4 4 Haskell 17 5 Java 138 6 Julia 30 7 Matlab 238 8 Other 85 9 Python 6941 10 R 2643 11 SAS 88 12 Scala 94 13 SQL 385 14 Stata 28The answer becomes the obvious one – Python followed by R (Remember the Selection bias disclaimer). So, what’s the language that does well with our Female Data Enthusiasts:
complete_data %>% filter(GenderSelect %in% c('Male','Female')) %>% group_by(LanguageRecommendationSelect,GenderSelect) %>% summarize(count = n()) %>% spread(GenderSelect,count) %>% mutate(F2M = Female/Male) %>% arrange(desc(F2M)) %>% mutate(F2M = F2M * 100) %>% ggplot() + geom_bar(aes(LanguageRecommendationSelect,F2M, fill = F2M), stat = 'identity') + theme_solarized() + theme(axis.text = element_text(size = 9), axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + scale_fill_continuous_tableau() + ggtitle('F2M Ratio of Languages used by Kagglers')Stata has very well outperformed R and Python with Female Data Enthusiasts and the possible explanation for this could be the increased penetration of Stata as a language in Academia and Research.
Well, that’s a simple Gender Diversity Analysis of Data Science Industry with Kaggle Dataset. The entire code used here is available on my Github.
References:
Related Post
 How Happy is Your Country? — Happy Planet Index Visualized
 Exploring, Clustering, and Mapping Toronto’s Crimes
 Spring Budget 2017: Circle Visualisation
 Qualitative Research in R
 MultiDimensional Reduction and Visualisation with tSNE
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Flip axes in plotly histograms – few approaches
(This article was first published on http://raddict.com, and kindly contributed to Rbloggers)
This seems easy as it requires just to change x parameter to y in the plot specification. Well, there are some edge cases where R users might get a in trouble!
library(plotly) packageVersion('plotly') [1] '4.7.1'Before you go, let me just explain myself. I have just started learning R interface to plotly library and I am really amazed by the effort people done to make those wires connected and possible to be used for a broad audience.
set.seed(2); values < rnorm(50) p1 < plot_ly(x = values, type = "histogram", name = "default") p2 < plot_ly(y = values, type = "histogram", name = "flipped") subplot(p1, p2)What if the plotly object specification is longer than several dozen of lines and one would like to have this feature parametrized in a function’s call? Like in the shiny application, to have the flip as an option?
The quickest solution is to provide an if statement like
#' @param values Numeric values to be used in the plot. #' @param flip Logical: should the plot be a horizontal histogram? to_flip_or_not_to_flip < function(values, flip = FALSE){ if (flip) { p < plot_ly(y = values, type = "histogram", name = "flipped") } else { p < plot_ly(x = values, type = "histogram", name = "default") } p } #' @examples #' to_flip_or_not_to_flip(p) #' to_flip_or_not_to_flip(p, flip = TRUE)This is a typical redundancy of the code. Imagine the plot being created in a loop with many extra layers, where in the end the specification is longer than 50 lines. Would you copy and paste all 50 lines just to substitute x with y?
orientation parameterMaybe orientation parameter is an option? After the reference: https://plot.ly/r/reference/#histogram
p3 < plot_ly(x = values, type = "histogram", name = "orient v", orientation = "v") p4 < plot_ly(x = values, type = "histogram", name = "orient h", orientation = "h") subplot(p3, p4)one get a wrong plot. values should be assigned to y parameter again.
Of course there is a plotly guide book for R users (where I’ve learned subplot()) but one is not going to read the whole documentation just to create a simple horizontal histogram (I suppose).
The solution?One can create the list with all possible parameters that he/she would like to specify (besides default parameters) and then
change the name of x parameter to y in that list depending on flip requirements?
Change the object directly after you have specified the plot.
One can easily guess what needs to be changed after looking to str(plot) call.
We would change data attributes and will rename x object to y. See that we can also modify values, not only names of parameters.
Do you know cleaner approach? Please don’t hesitate to leave comments at mine blog.
I suppose one could create a plot in ggplot2 and then apply ggplotly() function but I am not convinced this function translates all possible further plots to the client ready format, so I’d like to stick to plotly interface.
Note: sorry for print screens. I could not get interactive results for plotly in the Rmarkdown document compiled with a jekyll and knitr.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: http://raddict.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Rcpp now used by 1250 CRAN packages
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Earlier today Rcpp passed 1250 reversedependencies on CRAN as another big milestone. The graph is on the left depicts the growth of Rcpp usage (as measured by Depends, Imports and LinkingTo, but excluding Suggests) over time.
Rcpp cleared 300 packages in November 2014. It passed 400 packages in June 2015 (when I only tweeted about it), 500 packages in late October 2015, 600 packages last March, 700 packages last July, 800 packages last October, 900 packages early January, and 1000 packages in April. The chart extends to the very beginning via manually compiled data from CRANberries and checked with crandb. The next part uses manually saved entries. The core (and by far largest) part of the data set was generated semiautomatically via a short script appending updates to a small filebased backend. A list of packages using Rcpp is kept on this page.
Also displayed in the graph is the relative proportion of CRAN packages using Rcpp. The four percent hurdle was cleared just before useR! 2014 where I showed a similar graph (as two distinct graphs) in my invited talk. We passed five percent in December of 2014, six percent July of 2015, seven percent just before Christmas 2015, eight percent last summer, nine percent midDecember 2016 and then cracked ten percent this summer.
1250 user packages is staggering. We can use the progression of CRAN itself compiled by Henrik in a series of posts and emails to the main development mailing list. A decade ago CRAN itself did not have 1250 packages, and here we are approaching 12k with Rcpp at 10% and growing steadily. Amazeballs.
This puts a whole lot of responsibility on us in the Rcpp team as we continue to keep Rcpp as performant and reliable as it has been.
And with that, and as always, a very big Thank You! to all users and contributors of Rcpp for help, suggestions, bug reports, documentation or, of course, code.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
An introduction to Monte Carlo Tree Search
(This article was first published on Appsilon Data Science Blog, and kindly contributed to Rbloggers)
IntroductionWe recently witnessed one of the biggest game AI events in history – Alpha Go became the first computer program to beat the world champion in a game of Go. The publication can be found here. Different techniques from machine learning and tree search have been combined by developers from DeepMind to achieve this result. One of them is the Monte Carlo Tree Search (MCTS) algorithm. This algorithm is fairly simple to understand and, interestingly, has applications outside of game AI. Below, I will explain the concept behind MCTS algorithm and briefly tell you about how it was used at the European Space Agency for planning interplanetary flights.
Perfect Information GamesMonte Carlo Tree Search is an algorithm used when playing a socalled perfect information game. In short, perfect information games are games in which, at any point in time, each player has perfect information about all event actions that have previously taken place. Examples of such games are Chess, Go or TicTacToe. But just because every move is known, doesn’t mean that every possible outcome can be calculated and extrapolated. For example, the number of possible legal game positions in Go is over 10^{170}.
sticky noteSource
Every perfect information game can be represented in the form of a tree data structure in the following way. At first, you have a root which encapsulates the beginning state of the game. For Chess that would be 16 white figures and 16 black figures placed in the proper places on the chessboard. For TicTacToe it would be simply 3×3 empty matrix. The first player has some number n_1 of possible choices to make. In the case of TicTacToe this would be 9 possible places to draw a circle. Each such move changes the state of the game. These outcome states are the children of the root node. Then, for each of n_1 children, the next player has n_2 possible moves to consider, each of them generating another state of the game – generating a child node. Note that n_2 might differ for each of n_1 nodes. For instance in chess you might make a move which forces your enemy to make a move with their king or consider another move which leaves your opponent with many other options.
An outcome of a play is a path from the root node to one of the leaves. Each leaf consist a definite information which player (or players) have won and which have lost the game.
Making a decision based on a treeThere are two main problems we face when making a decision in perfect information game. The first, and main one is the size of the tree.
This doesn’t bother us with very limited games such as TicTacToe. We have at most 9 children nodes (at the beginning) and this number gets smaller and smaller as we continue playing. It’s a completely different story with Chess or Go. Here the corresponding tree is so huge that you cannot hope to search the entire tree. The way to approach this is to do a random walk on the tree for some time and get a subtree of the original decision tree.
This, however, creates a second problem. If every time we play we just walk randomly down the tree, we don’t care at all about the efficiency of our move and do not learn from our previous games. Whoever played Chess during his or her life knows that making random moves on a chessboard won’t get him too far. It might be good for a beginner to get an understanding of how the pieces move. But game after game it’s better to learn how to distinguish good moves from bad ones.
So, is there a way to somehow use the facts contained in the previously built decision trees to reason about our next move? As it turns out, there is.
MultiArmed Bandit ProblemImagine that you are at a casino and would like to play a slot machine. You can choose one randomly and enjoy your game. Later that night, another gambler sits next to you and wins more in 10 minutes than you have during the last few hours. You shouldn’t compare yourself to the other guy, it’s just luck. But still, it’s normal to ask whether the next time you can do better. Which slot machine should you choose to win the most? Maybe you should play more than one machine at a time?
The problem you are facing is the MultiArmed Bandit Problem. It was already known during II World War, but the most commonly known version today was formulated by Herbert Robbins in 1952. There are N slot machines, each one with a different expected return value (what you expect to net from a given machine). You don’t know the expected return values for any machine. You are allowed to change machines at any time and play on each machine as many times as you’d like. What is the optimal strategy for playing?
What does “optimal” mean in this scenario? Clearly your best option would be to play only on the machine with highest return value. An optimal strategy is a strategy for which you do as well as possible compared to the best machine.
It was actually proven that you cannot do better than O( \ln n ) on average. So that’s the best you can hope for. Luckily, it was also proven that you can achieve this bound (again – on average). One way to do this is to do the following.
sticky noteRead this paper if you are interested in the proof.
For each machine i we keep track of two things: how many times we have tried this machine (n_i) and what the mean return value (x_i) was. We also keep track of how many times (n) we have played in general. Then for each i we compute the confidence interval around x_i:
x_i \pm \sqrt{ 2 \cdot \frac{\ln n}{n_i} }
All the time we choose to play on the machine with the highest upper bound for x_i (so “+” in the formula above).
This is a solution to MultiArmed Bandit Problem. Now note that we can use it for our perfect information game. Just treat each possible next move (child node) as a slot machine. Each time we choose to play a move we end up winning, losing or drawing. This is our payout. For simplicity, I will assume that we are only interested in winning, so payout is 1 if we have won and 0 otherwise.
Real world application exampleMAB algorithms have multiple practical implementations in the real world, for example, price engine optimization or finding the best online campaign. Let’s focus on the first one and see how we can implement this in R. Imagine you are selling your products online and want to introduce a new one, but are not sure how to price it. You came up with 4 price candidates based on our expert knowledge and experience: 99$, 100$, 115$ and 120$. Now you want to test how those prices will perform and which to choose eventually.
During first day of your experiment 4000 people visited your shop when the first price (99$) was tested and 368 bought the product,
for the rest of the prices we have the following outcome:
 100$ 4060 visits and 355 purchases,
 115$ 4011 visits and 373 purchases,
 120$ 4007 visits and 230 purchases.
Now let’s look at the calculations in R and check which price was performing best during the first day of our experiment.
library(bandit) set.seed(123) visits_day1 < c(4000, 4060, 4011, 4007) purchase_day1 < c(368, 355, 373, 230) prices < c(99, 100, 115, 120) post_distribution = sim_post(purchase_day1, visits_day1, ndraws = 10000) probability_winning < prob_winner(post_distribution) names(probability_winning) < prices probability_winning ## 99 100 115 120 ## 0.3960 0.0936 0.5104 0.0000We calculated the Bayesian probability that the price performed the best and can see that the price 115$ has the highest probability (0.5). On the other hand 120$ seems bit too much for the customers.
The experiment continues for a few more days.
Day 2 results:
visits_day2 < c(8030, 8060, 8027, 8037) purchase_day2 < c(769, 735, 786, 420) post_distribution = sim_post(purchase_day2, visits_day2, ndraws = 1000000) probability_winning < prob_winner(post_distribution) names(probability_winning) < prices probability_winning ## 99 100 115 120 ## 0.308623 0.034632 0.656745 0.000000After the second day price 115$ still shows the best results, with 99$ and 100$ performing very similar.
Using bandit package we can also perform significant analysis, which is handy for overall proportion comparison using prop.test.
significance_analysis(purchase_day2, visits_day2) ## successes totals estimated_proportion lower upper ## 1 769 8030 0.09576588 0.004545319 0.01369494 ## 2 735 8060 0.09119107 0.030860453 0.04700507 ## 3 786 8027 0.09791952 0.007119595 0.01142688 ## 4 420 8037 0.05225831 NA NA ## significance rank best p_best ## 1 3.322143e01 2 1 3.086709e01 ## 2 1.437347e21 3 1 2.340515e06 ## 3 6.637812e01 1 1 6.564434e01 ## 4 NA 4 0 1.548068e39At this point we can see that 120$ is still performing badly, so we drop it from the experiment and continue for the next day. Chances that this alternative is the best according to the p_best are very small (p_best has negligible value).
Day 3 results:
visits_day3 < c(15684, 15690, 15672, 8037) purchase_day3 < c(1433, 1440, 1495, 420) post_distribution = sim_post(purchase_day3, visits_day3, ndraws = 1000000) probability_winning < prob_winner(post_distribution) names(probability_winning) < prices probability_winning ## 99 100 115 120 ## 0.087200 0.115522 0.797278 0.000000 value_remaining = value_remaining(purchase_day3, visits_day3) potential_value = quantile(value_remaining, 0.95) potential_value ## 95% ## 0.02670002Day 3 results led us to conclude that 115$ will generate the highest conversion rate and revenue.
We are still unsure about the conversion probability for the best price 115$, but whatever it is, one of the other prices might beat it by as much as 2.67% (the 95% quantile of value remaining).
The histograms below show what happens to the valueremaining distribution, the distribution of improvement amounts that another price might have over the current best price, as the experiment continues. With the larger sample we are much more confident about conversion rate. Over time other prices have lower chances to beat price $115.
If this example was interesting to you, checkout our another post about dynamic pricing.
Monte Carlo Tree SearchWe are ready to learn how the Monte Carlo Tree Search algorithm works.
As long as we have enough information to treat child nodes as slot machines, we choose the next node (move) as we would have when solving MultiArmed Bandit Problem. This can be done when we have some information about the payouts for each child node.
At some point we reach the node where we can no longer proceed in this manner because there is at least one node with no statistic present. It’s time to explore the tree to get new information. This can be done either completely randomly or by applying some heuristic methods when choosing child nodes (in practice this might be necessary for games with high branching factor – like chess or Go – if we want to achieve good results).
Finally, we arrive at a leaf. Here we can check whether we have won or lost.
It’s time to update the nodes we have visited on our way to the leaf. If the player making a move at the corresponding node turned out to be the winner we increase the number of wins by one. Otherwise we keep it the same. Whether we have won or not, we always increase the number of times the node was played (in the corresponding picture we can automatically deduce it from the number of loses and wins).
That’s it! We repeat this process until some condition is met: timeout is reached or the confidence intervals we mentioned earlier stabilize (convergence). Then we make our final decision based on the information we have gathered during the search. We can choose a node with the highest upper bound for payout (as we would have in each iteration of the MultiArmed Bandit). Or, if you prefer, choose the one with the highest mean payout.
The decision is made, a move was chosen. Now it’s time for our opponent (or opponents). When they’ve finished we arrive at a new node, somewhere deeper in the tree, and the story repeats.
Not just for gamesAs you might have noticed, Monte Carlo Tree Search can be considered as a general technique for making decisions in perfect information scenarios. Therefore it’s use does not have to be restrained to games only. The most amazing use case I have heard of is to use it for planning interplanetary flights. You can read about it at this website but I will summarize it briefly.
Think of an interplanetary flight as of a trip during which you would like to visit more than one planet. For instance, you are flying from Earth to Jupiter via Mars.
An efficient way to do this is to make use of these planets gravitational field (like they did in ‘The Martian’ movie) so you can take less fuel with you. The question is when the best time to arrive and leave from each planets surface or orbit is (for the first and last planet it’s only leave and arrive, respectively).
You can treat this problem as a decision tree. If you divide time into intervals, at each planet you make a decision: in which time slot I should arrive and in which I should leave. Each such choice determines the next. First of all, you cannot leave before you arrive. Second – your previous choices determine how much fuel you have left and consequently – what places in the universe you can reach.
Each such set of consecutive choices determines where you arrive at the end. If you visited all required checkpoints – you’ve won. Otherwise, you’ve lost. It’s like a perfect information game. There is no opponent and you make a move by determining the timeslot for leave/arrival. This can be treated using the above Monte Carlo Tree Search. As you can read here, it fares quite well in comparison with other known approaches to this problem. And at the same time – it does not require any domainspecific knowledge to implement it.
Write your question and comments below. We’d love to hear what you think.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Announcing a New rOpenSci Software Review Collaboration
(This article was first published on rOpenSci  open tools for open science, and kindly contributed to Rbloggers)
rOpenSci is pleased to announce a new collaboration with the Methods and Ecology and Evolution (MEE), a journal of the British Ecological Society, published by Wiley press 1. Publications destined for MEE that include the development of a scientific R package will now have the option of a joint review process whereby the R package is reviewed by rOpenSci, followed by fasttracked review of the manuscript by MEE. Authors opting for this process will be recognized via a mark on both web and print versions of their paper.
We are very excited for this partnership to improve the rigor of both scientific software and software publications and to provide greater recognition to developers in the fields of ecology and evolution. It is a natural outgrowth of our interest in supporting scientists in developing and maintaining software, and of MEE’s mission of vetting and disseminating tools and methods for the research community. The collaboration formalizes and eases a path already pursued by researchers: The rotl, RNexML, and treebase packages were all developed or reviewed by rOpenSci and subsequently had associated manuscripts published in MEE.
About rOpenSci software reviewrOpenSci is a diverse community of researchers from academia, nonprofit, government, and industry who collaborate to develop and maintain tools and practices around open data and reproducible research. The rOpenSci suite of tools is made of core infrastructure software developed and maintained by the project staff. The suite also contains numerous packages that are contributed by members of the broader R community. The volume of community submissions has grown considerably over the years necessitating a formal system of review quite analogous to that of a peer reviewed academic journal.
rOpenSci welcomes full software submissions that fit within our aims and scope, with the option of a presubmission inquiry in cases when the scope of a submission is not immediately obvious. This software peer review framework, known as the rOpenSci Onboarding process, operates with three editors and one editor in chief who carefully vet all incoming submissions. After an editorial review, editors solicit detailed, public and signed reviews from two reviewers, and the path to acceptance from then on is similar to a standard journal review process. Details about the system are described in various blog posts by the editorial team.
Collaboration with journalsThis is our second collaboration with a journal. Since late 2015, rOpenSci has partnered with the Journal of Open Source software (JOSS), an open access journal that publishes brief articles on research software. Packages accepted to rOpenSci can be submitted for fasttrack publication at JOSS, in which JOSS editors may evaluate based on rOpenSci’s reviews alone. As rOpenSci’s review criteria is significantly more stringent and designed to be compatible with JOSS, these packages are generally accepted without additional review. We have had great success with this partnership providing rOpenSci authors with an additional venue to publicize and archive their work. Given this success, we are keen on expanding to other journals and fields where there is potential for software reviewed and created by rOpenSci to play a significant role in supporting scientific findings.
The detailsOur new partnership with MEE broadly resembles that with JOSS, with the major difference that MEE, rather than rOpenSci, leads review of the manuscript component. Authors with R packages and associated manuscripts that fit the Aims and Scope for both rOpenSci and MEE are encouraged to first submit to rOpenSci. The rotl, RNexML, and treebase packages are all great examples of such packages. MEE editors may also refer authors to this option if authors submit an appropriate manuscript to MEE first.
On submission to rOpenSci, authors can use our updated submission template to choose MEE as a publication venue. Following acceptance by rOpenSci, the associated manuscript will be reviewed by an expedited process at MEE, with reviewers and editors having the knowledge that the software has already been reviewed and the public reviews available to them.
Should the manuscript be accepted, a footnote will appear in the web version and the first page of the print version of the MEE article indicating that the software as well as the manuscript has been peerreviewed, with a link to the rOpenSci open reviews.
As with any collaboration, there may be a few hiccups early on and we welcome ideas to make the process more streamlined and efficient. We look forward to the community’s submissions and to your participation in this process.
Many thanks to MEE’s Assistant Editor Chris Grieves and Senior Editor Bob O’Hara for working with us on this collaboration.
 See also MEE’s post from today at https://methodsblog.wordpress.com/2017/11/29/softwarereview/
↩
To leave a comment for the author, please follow the link and comment on their blog: rOpenSci  open tools for open science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Some new time series packages
(This article was first published on R on Rob J Hyndman, and kindly contributed to Rbloggers)
This week I have finished preliminary versions of two new R packages for time series analysis. The first (tscompdata) contains several large collections of time series that have been used in forecasting competitions; the second (tsfeatures) is designed to compute features from univariate time series data. For now, both are only on github. I will probably submit them to CRAN after they’ve been tested by a few more people.
tscompdata There are already two packages containing forecasting competition data: Mcomp (containing the M and M3 competition data) and Tcomp (containing the tourism competition data).
To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Competition: StanCon 2018 ticket
(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to Rbloggers)
Today we are happy to announce our Stan contest. Something we feel very strongly at Jumping Rivers is giving back to the community. We have benefited immensely from hard work by numerous people, so when possible, we try to give something back.
This year we’re sponsoring StanCon 2018. If you don’t know,
Stan is freedomrespecting, opensource software for facilitating statistical inference at the frontiers of applied statistics.
Or to put it another way, it makes Bayesian inference fast and (a bit) easier. StanCon is the premier conference for all things Stan related and this year it will take place at the Asilomar Conference Grounds, a National Historic Landmark on the Monterey Peninsula right on the beach.
Our interaction with Stan is via the excellent R package, rstan and the introduction to Rstan course we run.
The prizeOne of the benefits of sponsoring a conference, is free tickets. Unfortunately, we can’t make StanCon this year, so we’ve decided to give away the ticket via a competition.
How do I enter?Entering the contest is straightforward. We want a nice image to use in our introduction to Rstan course. This can be a cartoon, some neat looking densities, whatever you think!
For example, my first attempt at a logo resulted in
Pretty enough, but lacking something.
Once you’ve come up with your awesome image, simply email info@jumpingrivers.com with a link to the image. If the image was generated in R (or other language), than the associated code would be nice. The deadline is the 6th December.
FAQ Can I submit more than one image? Feel free.
 Who owns the copyright? Our intention is to make the image CC0. However we’re happy to change the copyright to suit. The only proviso is that we can use the image for our rstan course.
 Another question. Contact us via the contact page.
To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Cyber Week Only: Save 50% on DataCamp!
(This article was first published on Rposts.com, and kindly contributed to Rbloggers)
For Cyber Week only, DataCamp offers the readers of Rbloggers over $150 off for unlimited access to its data science library. That’s over 90 courses and 5,200 exercises of which a large chunk are Rfocused, plus access to the mobile app, Practice Challenges, and handson Projects. All by expert instructors such as Hadley Wickham (RStudio), Matt Dowle (data.table), Garrett Grolemund (RStudio), and Max Kuhn (caret)! Clair your offer now! Offer ends 12/5!
Skills track:


Learn how to translate numbers into plain English for businesses, whether it’s sales figures, market research, logistics, or transportation costs get the most of your data.



Learn how to combine statistical and machine learning techniques with R programming to analyze and interpret complex data.



Learn how to ensure portfolios are risk balanced, help find new trading opportunities, and evaluate asset prices using mathematical models.

Skills track:


Communicate the most important features of your data by creating beautiful visualizations using ggplot2 and base R graphics.



Learn how to parse data in any format. Whether it’s flat files, statistical software, databases, or web data, you’ll learn to handle it all.



Learn key statistical concepts and techniques like exploratory data analysis, correlation, regression, and inference.



Apply your R skills to financial data, including bond valuation, financial trading, and portfolio analysis.

Individual courses:

Machine Learning Toolbox (by Zachary DeaneMayer, DataRobot & caret, and Max Kuhn, Pfizer Global R&D & caret)

Unsupervised Learning in R (Hank Roark data scientist at Boeing)

Case Study: Bag of Words (Ted Kwartler, Liberty Mutual)

Writing Functions in R (Hadley Wickham, RStudio)
A word about DataCamp For those of you who don’t know DataCamp: it’s the most intuitive way out there to learn data science thanks to its combination of short expert videos and immediate handsonthekeyboard exercises as well as its instant personalized feedback on every exercise. They focus only on data science to offer the best learning experience possible and rely on expert instructors to teach. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Rposts.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to make Python easier for the R user: revoscalepy
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
by Siddarth Ramesh, Data Scientist, Microsoft
I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.
Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python.
Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy.
Why revoscalepy?Revoscalepy is the Python implementation of the R package RevoScaleR included with Microsoft Machine Learning Server.
The methods in ‘revoscalepy’ are the same, and more importantly, the way the R user can view data is the same. The reason this is so important is that for an R programmer, being able to understand the data shape and structure is one of the challenges with getting used to Python. In Python, data types are different, preprocessing the data is different, and the criteria to feed the processed dataset into a model is different.
To understand how revoscalepy eases the transition from R to Python, the following section will compare building a decision tree using revoscalepy with building a decision tree using sklearn. The Titanic dataset from Kaggle will be used for this example. To be clear, this post is written from an R user’s perspective, as many of the challenges this post will outline are standard practices for native Python users.
revoscalepy versus sklearnrevoscalepy works on Python 3.5, and can be downloaded as a part of Microsoft Machine Learning Server. Once downloaded, set the Python environment path to the python executable in the MML directory, and then import the packages.
The first chunk of code imports the revoscalepy, numpy, pandas, and sklearn packages, and imports the Titatic data. Pandas has some R roots in that it has its own implementation of DataFrames as well as methods that resemble R’s exploratory methods.
import revoscalepy as rp; import numpy as np; import pandas as pd; import sklearn as sk; titanic_data = pd.read_csv('titanic.csv') titanic_data.head() Preprocessing with sklearnOne of the challenges as an R user with using sklearn is that the decision tree model for sklearn can only handle the numeric datatype. Pandas has a categorical type that looks like factors in R, but sklearn’s Decision Tree does not integrate with this. As a result, numerically encoding the categorical data becomes a mandatory step. This example will use a onehot encoder to shape the categories in a way that sklearn’s decision tree understands.
The side effect of having to onehot encode the variable is that if the dataset contains high cardinality features, it can be memory intensive and computationally expensive because each category becomes its own binary column. While implementing onehot encoding itself is not a difficult transformation in Python and provides good results, it is still an extra step for an R programmer to have to manually implement. The following chunk of code detaches the categorical columns, label and onehot encodes them, and then reattaches the encoded columns to the rest of the dataset.
from sklearn import tree le = sk.preprocessing.LabelEncoder() x = titanic_data.select_dtypes(include=[object]) x = x.drop(['Name', 'Ticket', 'Cabin'], 1) x = pd.concat([x, titanic_data['Pclass']], axis = 1) x['Pclass'] = x['Pclass'].astype('object') x = pd.DataFrame(x) x = x.fillna('Missing') x_cats = x.apply(le.fit_transform) enc = sk.preprocessing.OneHotEncoder() enc.fit(x_cats) onehotlabels = enc.transform(x_cats).toarray() encoded_titanic_data = pd.concat([pd.DataFrame(titanic_data.select_dtypes(include=[np.number])), pd.DataFrame(onehotlabels)], axis = 1)At this point, there are more columns than before, and the columns no longer have semantic names (they have been enumerated). This means that if a decision tree is visualized, it will be difficult to understand without going through the extra step of renaming these columns. There are techniques in Python to help with this, but it is still an extra step that must be considered.
Preprocessing with revoscalepyUnlike sklearn, revoscalepy reads pandas’ ‘category’ type like factors in R. This section of code iterates through the DataFrame, finds the string types, and converts those types to ‘category’. In pandas, there is an argument to set the order to False, to prevent ordered factors.
titanic_data_object_types = titanic_data.select_dtypes(include = ['object']) titanic_data_object_types_columns = np.array(titanic_data_object_types.columns) for column in titanic_data_object_types_columns: titanic_data[column] = titanic_data[column].astype('category', ordered = False) titanic_data['Pclass'] = titanic_data['Pclass'].astype('category', ordered = False)This dataset is already ready to be fed into the revoscalepy model.
Training models with sklearnOne difference between implementing a model in R and in sklearn in Python is that sklearn does not use formulas.
Formulas are important and useful for modeling because they provide a consistent framework to develop models with varying degrees of complexity. With formulas, users can easily apply different types of variable cases, such as ‘+’ for separate independent variables, ‘:’ for interaction terms, and ‘*’ to include both the variable and its interaction terms, along with many other convenient calculations. Within a formula, users can do mathematical calculations, create factors, and include more complex entities third order interactions. Furthermore, formulas allow for building highly complex models such as mixed effect models, which are next to impossible build without them. In Python, there are packages such as ‘statsmodels’ which have more intuitive ways to build certain statistical models. However, statsmodels has a limited selection of models, and does not include tree based models.
With sklearn, model.fit expects the independent and dependent terms to be columns from the DataFrame. Interactions must be created manually as a preprocessing step for more complex examples. The code below trains the decision tree:
model = tree.DecisionTreeClassifier(max_depth = 50) x = encoded_titanic_data.drop(['Survived'], 1) x = x.fillna(1) y = encoded_titanic_data['Survived'] model = model.fit(x,y) Training models with revoscalepyrevoscalepy brings back formulas. Granted, users cannot view the formula the same way as they can in R, because formulas are strings in Python. However, importing code from R to Python is an easy transition because formulas are read the same way in the revoscalepy functions as the model fit functions in R. The below code fits the Decision Tree in revoscalepy.
#rx_dtree works with formulas, just like models in R form = 'Survived ~ Pclass + Sex + Age + Parch + Fare + Embarked' titanic_data_tree = rp.rx_dtree(form, titanic_data, max_depth = 50)The resulting object, titanic_data_tree, is the same structural object that RxDTree() would create in R. Because the individual elements that make up the rx_dtree() object are the same as RxDTree(), it allows R users to easily understand the decision tree without having to translate between two object structures.
ConclusionFrom the workflow, it should be clear how revoscalepy can help with transliteration between R and Python. Sklearn has different preprocessing considerations because the data must be fed into the model differently. The advantage to revoscalepy is that R programmers can easily convert their R code to Python without thinking too much about the ‘Pythonic way’ of implementing their R code. Categories replace factors, rx_dtree() reads the Rlike formula, and the arguments are similar to the R equivalent. Looking at the big picture, revoscalepy is one way to ease Python for the R user and future posts will cover other ways to transition between R and Python.
Microsoft Docs: Introducing revoscalepy
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Quick RoundUp – Visualising Flows Using Network and Sankey Diagrams in Python and R
(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to Rbloggers)
Got some data, relating to how students move from one module to another. Rows are student ID, module code, presentation date. The flow is not wellbehaved. Students may take multiple modules in one presentation, and modules may be taken in any order (which means there are loops…).
My first take on the data was just to treat it as a graph and chart flows without paying attention to time other than to direct the edges (module A taken directky after module B; if multiple modules are taken by the same student in the same presentation, they both have the same precursor(s) and follow on module(s), if any) – the following (dummy) data shows the sort of thing we can get out using networkx and the pygraphviz output:
The data can also be filtered to show just the modules taken leading up to a particular module, or taken following a particular module.
The R diagram package has a couple of functions that can generate similar sorts of network diagram using its plotweb() function. For example, a simple flow graph:
Or something that looks a bit more like a finite state machine diagram:
(In passing, I also note the R diagram package can be used to draw electrical circuit diagrams/schematics.)
Another way of visualising this blocked data might be to use a simple two dimensional flow diagram, such as a transition plot from the R Gmisc package.
For example, the following could describe total flow from one module to another over a given period of time:
If there are a limited number of presentations (or modules) of interest, we could further break down each category to show the count of students taking a module in a particular presentation (or going directly on to / having directly come from a particular module; in this case, we may want an “other” group to act as a catch all for modules outside a set of modules we are interested in; getting the proportions right might also be a fudge).
Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.
The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the setup correctly!
Again, it may be appropriate to introduce a catchall category to group modules into a generic Other bin where there is only a small flow to/from that module to reduce clutter in the diagram.
The sankeyview package is actually part of a family of packages that includes the d3sankeydiagram and the ipysankeywidget.
We can use the ipysankeywidget to render a simple graph data structure of the sort that can be generated by networkx.
One big problems with the view I took of the data is that it doesn’t respect time, or the numbers of students taking a particular presentation of a course. This detail could help tell a story about the evolving curriculum as new modules come on stream, for example, and perhaps change student behaviour about the module they take next from a particular module. So how could we capture it?
If we can linearise the flow by using module_presentation keyed nodes, rather than just module identified nodes, and limit the data to just show students progressing from one presentation to the next, we should be able to use something line a categorical parallel coordinates plot, such as an alluvial diagram from the R alluvial package.
With time indexed modules, we can also explore richer Sankey style diagrams that require a one way flow (no loops).
So for example, here are a few more packages that might be able to help with that, as well as the aforementioned Python sankeyview and ipysankeywidget packages.
First up, the R networkD3 package includes support for Sankey diagrams. Data can be sourced from igraph and then exported into the JSON format expected by the package:
If you prefer Google charts, the googleVis R package has a gvisSankey function (that I’ve used elsewhere).
The R riverplot package also supports Sankey diagrams – and the gallery includes a demo of how to recreate Minard’s visualisation of Napoleon’s 1812 march.
The R sankey package generates a simple Sankey diagram from a simple data table:
Back in the Python world, the pySankey package can generate a simple Sankey diagram from a pandas dataframe.
matplotlib also support for sankey diagrams as matplotlib.sankey() (see also this tutorial):
What I really need to do now is set up a Binder demo of each of them… but that will have to wait till another day…
If you know of any other R or Python packages / demos that might be useful for visualising flows, please let me know via the comments and I’ll add them to the post.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Secret Santa is a graph traversal problem
(This article was first published on Higher Order Functions, and kindly contributed to Rbloggers)
Last week at Thanksgiving, my family drew names from a hat for our annual game
of Secret Santa. Actually, it wasn’t a hat but you know what I mean. (Now that I
think about it, I don’t think I’ve ever seen names drawn from a literal hat
before!) In our family, the rules of Secret Santa are pretty simple:
 The players’ names are put in “a hat”.
 Players randomly draw a name from a hat, become that person’s Secret Santa,
and get them a gift.  If a player draws their own name, they draw again.
Once again this year, somebody asked if we could just use an app or a website to
handle the drawing for Secret Santa. Or I could write a script to do it I
thought to myself. The problem nagged at the back of my mind for the past few
days. You could just shuffle the names… no, no, no. It’s trickier than that.
In this post, I describe a couple of algorithms for Secret Santa sampling using
R and directed graphs. I use the
DiagrammeR package which creates
graphs from dataframes of nodes and edges, and I liberally use
dplyr verbs to manipulate tables of
edges.
If you would like a more practical way to use R for Secret Santa, including
automating the process of drawing names and emailing players, see
this blog post.
Let’s start with a subset of my family’s game of just five players. I assign
each name a unique ID number.
We can think of the players as nodes in a directed graph. An edge connecting two
players indicates a “givesto” (Secret Santa) relationship. Suppose I drew
Marissa’s name. Then the graph will have an edge connecting me to her. In the
code below, I use DiagrammeR to create a graph by combining a node dataframe
(create_node_df()) and an edge dataframe (create_edge_df()).
%0
2>5
1
Jeremy
2
TJ
3
Jonathan
4
Alex
5
Marissa
Before the game starts, anyone could draw anyone else’s name, so let’s visualize
all possible givesto relations. We can do this by using combn(n, 2) to
generate all nchoose2 pairs and creating two edges for each pair.
%0
1>2
1>3
1>4
1>5
2>1
2>3
2>4
2>5
3>1
3>2
3>4
3>5
4>1
4>2
4>3
4>5
5>1
5>2
5>3
5>4
1
Jeremy
2
TJ
3
Jonathan
4
Alex
5
Marissa
Do you need to organize a giftgiving drawing for a group of people? The easiest
solution is to shuffle the names and have the first name give to the second name,
the second to the third, and so on with last name giving looping back around to
the first name. This solution is equivalent to walking through the graph and
visiting every node just once. Such a path is called a
Hamiltonian path.
Here we find a Hamiltonian path and create a helper function overwrite_edges()
to update the edges that fall on the path.
%0
1>2
1>3
1>4
1>5
2>1
2>3
2>4
2>5
3>1
3>2
3>4
3>5
4>1
4>2
4>3
4>5
5>1
5>2
5>3
5>4
1
Jeremy
2
TJ
3
Jonathan
4
Alex
5
Marissa
As promised, the red paths loop through all nodes exactly once. No one is their
own gift giver :white_check_mark:, and everyone has an incoming red path
:white_check_mark: and an outgoing red path :white_check_mark:. Very nice.
Actually… let’s put that checklist into a validation function.
Despite its elegance, this solution does not simulate drawing names from a
hat! Because each node is visited only once, there is no backtracking, so there
there is no reciprocal giftgiving or other subcircuits in the graph.
Whether you think this is a bad thing is a matter of preference. Personally, I
would like all remaining pairs to be equally probable at each step of the
drawing. This is not the case when backtracking is not allowed. (For example, if
I draw Marissa, then all of the remaining edges are not equally likely because
P(Marissa draws TJ  TJ draws Marissa) = 0.)
Let’s think about what happens when I draw Marissa’s name from a nice big red
Santa hat.
 The edge from TJ to Marissa is fixed. (I drew her name.)
 All other edges from TJ become illegal. (I can’t draw any more names.)
 All other edges onto Marissa become illegal. (No one else can draw her name
either.)
To simulate a single hatdrawing, we randomly select a legal edge, fix it, and
delete all illegal edges. Let’s work through a couple of examples.
First, we need some helper functions.
draw_secret_santa_edge < function(edge_df, ...) { aes_options < quos(...) edge_df %>% filter(rel != "givesto") %>% sample_n(1) %>% mutate(!!! aes_options) } find_illegal_edges < function(edge_df, edge, ...) { aes_options < quos(...) outgoing < edge_df %>% filter(from %in% edge$from) incoming < edge_df %>% filter(to %in% edge$to) # The one edge that is not illegal is in both # the incoming and outgoing sets to_keep < dplyr::intersect(outgoing, incoming) outgoing %>% bind_rows(incoming) %>% anti_join(to_keep, by = c("from", "to")) %>% mutate(!!! aes_options) }Here we draw a single edge (red with fat arrow). All of the other edges that
point to the same node are illegal (navy) as are all of the edges that have the
same origin as the drawn edge.
%0
Selected vs. illegal
1>2
1>3
1>4
1>5
2>1
2>3
2>4
2>5
3>1
3>2
3>4
3>5
4>1
4>2
4>3
4>5
5>1
5>2
5>3
5>4
1
Jeremy
2
TJ
3
Jonathan
4
Alex
5
Marissa
We delete those illegal edges and leaving us with the following graph.
edges_after_pick1 < all_possible_edges %>% overwrite_edges(current_pick %>% mutate(arrowsize = NULL)) %>% anti_join(current_illegal_edges, by = "id") create_graph(nodes, edges_after_pick1) %>% render_graph(title = "After one draw")
%0
After one draw
1>2
1>3
1>4
1>5
2>1
3>2
3>4
3>5
4>2
4>3
4>5
5>2
5>3
5>4
1
Jeremy
2
TJ
3
Jonathan
4
Alex
5
Marissa
The name has been removed from the hat, and the graph is simpler now!
Let’s do it again. Draw a random legal edge (fat arrow) and identify all the
illegal paths (navy).
%0
Selected vs. illegal
1>2
1>3
1>4
1>5
2>1
3>2
3>4
3>5
4>2
4>3
4>5
5>2
5>3
5>4
1
Jeremy
2
TJ
3
Jonathan
4
Alex
5
Marissa
After deleting illegal edges, the problem simplifies further.
edges_after_pick2 < edges_after_pick1 %>% overwrite_edges(current_pick %>% mutate(arrowsize = NULL)) %>% anti_join(current_illegal_edges, by = "id") create_graph(nodes, edges_after_pick2) %>% render_graph(title = "After two draws")
%0
After two draws
1>2
1>4
1>5
2>1
3>2
3>4
3>5
4>3
5>2
5>4
1
Jeremy
2
TJ
3
Jonathan
4
Alex
5
Marissa
You can tell where this is going… Loop Town!
To finish up, we are going to repeat this process until there are only
giftgiving edges left. We will control the loop with this helper function
which tells us if there are any free edges remaining.
In the function below, the whileloop does the same steps as above: Randomly
selecting a free edge and removing illegal edges.
After the whileloop, the function checks if it has a valid set of gift edges,
and if it doesn’t, the function calls itself again. This bit is intended to
handle more constrained situations. Such as…
I lied above. Secret Santa is not so simple in family. For my generation (with
me, my siblings and our partners), there’s a rule that a player can’t draw their
partner’s name. Similarly, my nieces and nephews (and now also my child
:sparkling_heart:) have
their own gift exchange with an added constraint: A player can’t give their
sibling a gift. The elegant and simple Hamiltonian solution fails under these
constraints unless you write a special shuffling algorithm. Our hatdrawing
approach, however, handles this situation with minimal effort. Let’s work
through an example with my nieces and nephews (and :baby:!). To protect the very
young, I have replaced their names with Pokemon names.
Below we define the children and their families and do some datawrangling so
that we have columns with the family at the start and end of each node.
In the graph below, we draw an olive edge to connect edge pair of siblings.
These edges are illegal before we even start drawing names.
%0
1>2
1>3
1>4
1>5
1>6
1>7
1>8
1>9
1>10
1>11
1>12
2>1
2>3
2>4
2>5
2>6
2>7
2>8
2>9
2>10
2>11
2>12
3>1
3>2
3>4
3>5
3>6
3>7
3>8
3>9
3>10
3>11
3>12
4>1
4>2
4>3
4>5
4>6
4>7
4>8
4>9
4>10
4>11
4>12
5>1
5>2
5>3
5>4
5>6
5>7
5>8
5>9
5>10
5>11
5>12
6>1
6>2
6>3
6>4
6>5
6>7
6>8
6>9
6>10
6>11
6>12
7>1
7>2
7>3
7>4
7>5
7>6
7>8
7>9
7>10
7>11
7>12
8>1
8>2
8>3
8>4
8>5
8>6
8>7
8>9
8>10
8>11
8>12
9>1
9>2
9>3
9>4
9>5
9>6
9>7
9>8
9>10
9>11
9>12
10>1
10>2
10>3
10>4
10>5
10>6
10>7
10>8
10>9
10>11
10>12
11>1
11>2
11>3
11>4
11>5
11>6
11>7
11>8
11>9
11>10
11>12
12>1
12>2
12>3
12>4
12>5
12>6
12>7
12>8
12>9
12>10
12>11
1
Squirtle
2
Wartortle
3
Pikachu
4
Bulbasaur
5
Ivysaur
6
Venusaur
7
Machamp
8
Machoke
9
Ratata
10
Raticate
11
Mew
12
Mewtwo
The solution for this trickier, more constrained version of the problem is to
delete the illegal edges beforehand so that they can never be drawn from the
hat. After that, the same algorithm works as before.
%0
1>4
2>6
3>9
4>3
5>11
6>8
7>1
8>12
9>7
10>2
11>10
12>5
1
Squirtle
2
Wartortle
3
Pikachu
4
Bulbasaur
5
Ivysaur
6
Venusaur
7
Machamp
8
Machoke
9
Ratata
10
Raticate
11
Mew
12
Mewtwo
Like usual, I’m not sure how to close one of these blog posts. I guess: When a
problem involves relations among individuals, always consider whether there is a
graph hiding in the background. Even the simple process of drawing names from a
hat for Secret Santa describes a graph: a graph of giftgiving relations among
individuals. This graph wasn’t obvious to me until I thought about Hamilitonian
path solution. Hey, wait a minute, that’s a big giftgiving circle! It’s like
some kind of network… ooooooh.
To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Vectorized Block ifelse in R
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
WinVector LLC has been working on porting some significant large scale production systems from SAS to R.
From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure.
When porting code from one language to another you hope the expressive power and style of the languages are similar.
 If the source language is too weak then the original code will be very long (and essentially over specified), meaning a direct transliteration will be unlikely to be efficient, as you are not using the higher order operators of the target language.
 If the source language is too strong you will have operators that don’t have direct analogues in the target language.
SAS has some strong and powerful operators. One such is what I am calling “the vectorized block if(){}else{}“. From SAS documentation:
The subsetting IF statement causes the DATA step to continue processing only those raw data records or those observations from a SAS data set that meet the condition of the expression that is specified in the IF statement.
That is a really wonderful operator!
R has some available related operators: base::ifelse(), dplyr::if_else(), and dplyr::mutate_if(). However, none of these has the full expressive power of the SAS operator, which can per data row:
 Conditionally choose where different assignments are made to (not just choose conditionally which values are taken).
 Conditionally specify blocks of assignments that happen together.
 Be efficiently nested and chained with other IF statements.
To help achieve such expressive power in R WinVector is introducing seplyr::if_else_device(). When combined with seplyr::partition_mutate_se() you get a good high performance simulation of the SAS power in R. These are now available in the open source R package seplyr.
For more information please reach out to us here at WinVector or try help(if_else_device).
Also, we will publicize more documentation and examples shortly (especially showing big data scale use with Apache Spark via Sparklyr).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
#11: (Much) Faster Package (Re)Installation via Caching
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Welcome to the eleventh post in the rarely rued R rants series, or R4 for short. Time clearly flies as it has been three months since out last post on significantly reducing library size via stripping. I had been meaning to post on today’s topic for quite some time, but somehow something (working on a paper, releasing a package, …) got in the way.
Just a few days ago Colin (of Efficient R Programming fame) posted about speed(ing up) package installation. His recommendation? Remember that we (usually) have multiple cores and using several of them via options(Ncpus = XX). It is an excellent point, and it bears repeating.
But it turns I have not one but two salient recommendations too. Today covers the first, we should hopefully get pretty soon to the second. Both have one thing in common: you will be fastest if you avoid doing the work in the first place.
What?One truly outstanding tool for this in the context of the installation of compiled packages is ccache. It is actually a pretty old tool that has been out for well over a decade, and it comes from the folks that gave the Samba filesystem.
What does it do? Well, it nutshell, it "hashes" a checksum of a source file once the preprocessor has operated on it and stores the resulting object file. In the case of rebuild with unchanged code you get the object code back pretty much immediately. The idea is very similar to memoisation (as implemented in R for example in the excellent little memoise package by Hadley, Jim, Kirill and Daniel. The idea is the same: if you have to do something even moderately expensive a few times, do it once and then recall it the other times.
This happens (at least to me) more often that not in package development. Maybe you change just one of several source files. Maybe you just change the R code, the Rd documentation or a test file—yet still need a full reinstallation. In all these cases, ccache can help tremdendously as illustrated below.
How?Because essentially all our access to compilation happens through R, we need to set this in a file read by R. I use ~/.R/Makevars for this and have something like these lines on my machines:
VER= CCACHE=ccache CC=$(CCACHE) gcc$(VER) CXX=$(CCACHE) g++$(VER) CXX11=$(CCACHE) g++$(VER) CXX14=$(CCACHE) g++$(VER) FC=$(CCACHE) gfortran$(VER) F77=$(CCACHE) gfortran$(VER)That way, when R calls the compiler(s) it will prefix with ccache. And ccache will then speed up.
There is an additional issue due to R use. Often we install from a .tar.gz. These will be freshly unpackaged, and hence have "new" timestamps. This would usually lead ccache to skip to file (fear of "false positives") so we have to override this. Similarly, the tarball is usually unpackage in a temporary directory with an ephemeral name, creating a unique path. That too needs to be overwritten. So in my ~/.ccache/ccache.conf I have this:
max_size = 5.0G # important for R CMD INSTALL *.tar.gz as tarballs are expanded freshly > fresh ctime sloppiness = include_file_ctime # also important as the (temp.) directory name will differ hash_dir = false Show MeA quick illustration will round out the post. Some packages are meatier than others. More C++ with more templates usually means longer build times. Below is a quick chart comparing times for a few such packages (ie RQuantLib, dplyr, rstan) as well as igraph ("merely" a large C package) and lme4 as well as Rcpp. The worst among theseis still my own RQuantLib package wrapping (still just parts of) the genormous and Boostheavy QuantLib library.
Pretty dramatic gains. Best of all, we can of course combine these with other methods such as Colin’s use of multiple CPUs, or even a simple MAKE=make j4 to have multiple compilation units being considered in parallel. So maybe we all get to spend less time on social media and other timewasters as we spend less time waiting for our builds. Or maybe that is too much to hope for…
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...