R news and tutorials contributed by (750) R bloggers
Updated: 27 min 47 sec ago

### 6 myths about refuelling – tackled with statistics

Fri, 05/17/2019 - 09:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

On my daily commute to STATWORX, I pass the gas station down the street twice a day. Even though I’m riding a bike, I’m still looking at the gas prices. Lately, after a few weeks of observation, I thought to have found some patterns. My statistic sensors were tickling!

Of course, I am not the first one to notice patterns. Nearly everyone has their own theory about when to buy cheap gas! Some of the more famous examples are:

1. Everything was better in the old days
2. It’s cheaper in the evening
3. Gas stations at motorways are more expensive
4. No-name stations have lower prices than top brands
5. The difference between diesel and gas (E10) is shrinking
6. It’s cheapest on Mondays

But, are these facts or just rumours? To check all those myths, I gathered some data, explored relations and finally plotted my results!

FYI: The code I used will be presented in a future blog, when I also will take a deeper look at my coding and optimisation progress – stay tuned!

The king of gas price data

Since December 2013 all gas stations in Germany must report their prices to the Markttransparenzstelle für Kraftstoffe, which is part of the Bundeskartellamt. While searching for this data, I stumbled upon the site tankerkoenig.de. They had gathered the price data for nearly all gas stations since 2014. First, they only provided a data bank dump with all historical data, but since October 2018, they also offer a git repository with daily csv files.

The data contain all price changes of around 15.000 gas stations, looking as follows:

date station_uuid diesel e5 e10 1: 2019-01-01 00:01:06 2ae0e4dc 1.319 1.399 1.379 2: 2019-01-01 00:01:06 08a67e2e 1.339 1.409 1.389 3: 2019-01-01 00:01:06 939f06e5 1.299 1.379 1.359 4: 2019-01-01 00:01:06 d299da0e 1.279 1.389 1.369 5: 2019-01-01 00:01:06 a1e15688 1.228 1.328 1.308

In addition to the price information, tankerkoenig.de also provides additional information like the brand, the geo-coordinates and the city of each gas station. Since I was also interested in the fact whether a station is near a motorway or not, I included this data from the ADAC (the German automobile club). Of course, this did not seem to be complete, but good enough for my purpose. With the result, I was able to plot the stations, and the motorways emerged!

If you want to know more about how to make plots with geo-coordinates and ggplot2, check out the blog post of my colleague Lea!

Only believe the facts!

To confirm or debunk the myths I chose, I went ahead and filtered, aggregated, and merged the raw data. Be prepared for some nice plots and insights!

1. Everything was better in the old days

Nothing shows a development better than a good old timeline! As you can see in the following graphic, I plotted the mean price per day. To get a little more insight about the price ranges, I included the 10% and 90% quantiles of prices for each day.

Of course, this was just a glimpse, and there were times (many years back) when the price was below 1DM (~0.5€). But for the last five years, the price was kind of within the same range. There was a low point early in 2016, but the latest peak was still a bit lower than in mid 2014.

Conclusion: For the „short-term“ old days the myth does not hold up – for the „long-run“ … probably yes!

2. It’s cheaper in the evening

To check this myth, I aggregated the data over days and years. I noticed that there were different patterns over the years. From 2014 till 2017, there seemed to be a pattern where the price was high in the morning, slowly decreased over the day, with a small peak at noon, and rose up again in the evening. Since 2018, a new pattern with three peaks in a wave pattern emerged: The highest one in the morning around 7:00, one at noon and the last one around 17:00.

Since the price level varied a bit over the years, I also looked at the scaled prices. Here, the difference between the patterns was even more visible.

Conclusion: From 2015 to 2017, the myth seemed to be right. However, lately there were multiple „good times“ across a day. Mostly right before the peak in the afternoon or the evening. As the price did not rise that much after its evening peak, the myth is still correct. The lowest point at the moment seemed to be around 21:00.

Update: While writing this post I found a recent study from the Goethe university of Frankfurt, that shows, that there is a new pattern with even more peaks!

3. Gas stations at motorways are more expensive

In the plot above I already showed where the motorway stations are – the blue dots. In the next plot, the blue lines represent the prices at those motorway stations. One can clearly spot the difference! Even though there are some stations within the 10% and 90% quartile range where the prices overlapped.

Conclusion: Obviously right!

4. No-name stations have lower prices than top brands

To get the top brands, I took the nine brands with the most stations in Germany and labelled the rest „other“. After that, I calculated the difference between mean price and mean price per brand. Therefore, lines above zero indicated that a brand is more expensive than the mean and vice versa.. Of the top brands, ARAL and SHELL were always above the mean, some fluctuated around the mean and some were always below the mean. The no-name stations were cheaper than the mean over the whole timespan.

Conclusion: The result depends on how a no-name station is defined. Since the top five brands are mostly more expensive than the mean, I think the myth is somewhat right. At least, it is safe to say that there are differences between the brands.

5. The difference between diesel and gas (E10) is shrinking

To compare the differences, I took the E10 price as the baseline and plotted the other two types against it. Between E5 and E10 there was nearly no change over the last five years. However, for diesel, it was a story. Since 2014, the difference took a wild ride from 10% cheaper than E10 to 20% cheaper at the beginning of 2016 back to only 5% now.

Conclusion: Yes, the difference is shrinking at the moment compared to the last five years.

6. It’s cheapest on Mondays

To check the last myth, I calculated the mean price per hour, type and weekday. The weekday with the lowest price at a given time can be seen in the following plot. All gas types revealed a similar pattern. Between the years, there was a subtle change mostly between Wednesday and Thursday.

Conclusion: Since Monday barely ever was the cheapest day – this myth got debunked!

The results are in

After all of that data hacking and pattern finding, these are my results:

1. Everything was better in the old days – FALSEish
2. It’s cheaper in the evening – TRUE
3. Gas stations at motorways are more expensive – TRUE
4. No-name stations have lower prices than top brands – TRUEish
5. The difference between diesel and gas (e10) is shrinking – TRUE
6. It’s cheapeast on Mondays – FALSE

Of course, this analysis is far from being complete and even had some shortcuts in it, as well. Still, I had a lot of fun analysing this big data set regardless of all the internal fights with my RAM. I will present some of these fights and their outcome in my next blog post. In that post, you will also get more information about my code optimisation process.

In another upcoming blog post, my colleague Matthias will use this data to predict the future price. So, stay tuned!

References
• the data is used under the Creative-Commons-Lizenz (CC BY 4.0).
Über den Autor

Jakob Gepp

Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI.

.button {background-color: #0085af;}

Der Beitrag 6 myths about refuelling – tackled with statistics erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Four ways to reverse a string in R

Fri, 05/17/2019 - 05:26

(This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers)

R offers several ways to reverse a string, include some base R options. We go through a few of those in this post. We’ll also compare the computational time for each method.

Reversing a string can be especially useful in bioinformatics (e.g. finding the reverse compliment of a DNA strand). To get started, let’s generate a random string of 10 million DNA bases (we can do this with the stringi package as well, but for our purposes here, let’s just use base R functions).

set.seed(1) dna <- paste(sample(c("A", "T", "C", "G"), 10000000, replace = T), collapse = "") 1) Base R with strsplit and paste

One way to reverse a string is to use strsplit with paste. This is the slowest method that will be shown, but it does get the job done without needing any packages. In this example, we use strsplit to break the string into a vector of its individual characters. We then reverse this vector using rev. Finally, we concatenate the vector of characters into a string using paste.

start <- proc.time() splits <- strsplit(dna, "")[[1]] reversed <- rev(splits) final_result <- paste(reversed, collapse = "") end <- proc.time() print(end - start)

2) Base R: Using utf8 magic

This example also does not require any external packages. In this method, we can use the built-in R function utf8ToInt to convert our DNA string to a vector of integers. We then reverse this vector with the rev function. Lastly, we convert this reversed vector of integers back to its original encoding – except now the string is in reverse.

start <- proc.time() final_result <- intToUtf8(rev(utf8ToInt(dna))) end <- proc.time() print(end - start)

3) The stringi package

Of all the examples presented, this option is the fastest when tested. Here we use the stri_reverse function from the stringi package.

library(stringi) start <- proc.time() final_result <- stri_reverse(dna) end <- proc.time() print(end - start)

4) The Biostrings package

Our last example uses the Biostrings package, which contains a collection of functions useful for working with DNA-string data. One function, called str_rev, can reverse strings. You can download and load the Biostrings package like this:

source("http://bioconductor.org/biocLite.R") biocLite("Biostrings") library(Biostrings)

Then, all we have to do is input our DNA string into the str_rev function and we get our result.

start <- proc.time() final_result <- str_rev(dna) end <- proc.time() print(end - start)

That’s it for this post! Please check out my other articles here.

The post Four ways to reverse a string in R appeared first on Open Source Automation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Using Data Cubes with R

Fri, 05/17/2019 - 04:29

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Interested in guest posting? We would love to share your codes and ideas with our community. Category Tags

Data cubes are a popular way to display multidimensional data. This makes the method suitable for big data. Giving the incredible growth of data it is natural that the method have become increasingly popular. In this article you learn to use R for data cubes.

Read data cubes pacakages into R

First we will read the packages into the R library:

# Read packages into R library library(data.table) library(data.cube) library(rpivotTable)

Next we will build a sample array for the data cube:

# Crerate sample slide array set.seed(1L) ar.dimnames = list(color = sort(c("green","yellow","red")), year = as.character(2011:2015), status = sort(c("active","inactive","archived","removed"))) ar.dim = sapply(ar.dimnames, length) ar = array(sample(c(rep(NA, 4), 4:7/2), prod(ar.dim), TRUE), unname(ar.dim), ar.dimnames) print(ar) cb = as.cube(ar) print(cb) str(cb) all.equal(ar, as.array(cb)) all.equal(dim(ar), dim(cb)) all.equal(dimnames(ar), dimnames(cb)) print(cb) fact: fact 30 rows x 4 cols (0.00 MB) dims: color 3 rows x 1 cols (0.00 MB) year 5 rows x 1 cols (0.00 MB) status 4 rows x 1 cols (0.00 MB) total size: 0.01 MB > str(cb) Classes 'cube', 'R6'. cube$env$fact: List of 1 $fact:Classes ‘data.table’ and 'data.frame': 30 obs. of 4 variables: cube$env$dims: List of 3$ color :Classes ‘data.table’ and 'data.frame': 3 obs. of 1 variable: $year :Classes ‘data.table’ and 'data.frame': 5 obs. of 1 variable:$ status:Classes ‘data.table’ and 'data.frame': 4 obs. of 1 variable:

Now it is time to create a slice and dice for the data cube:

# slice arr = ar["green",,] print(arr) r = cb["green",] print(r) all.equal(arr, as.array(r)) arr = ar["green",,,drop=FALSE] print(arr) r = cb["green",,,drop=FALSE] print(r) all.equal(arr, as.array(r)) arr = ar["green",,"active"] r = cb["green",,"active"] all.equal(arr, as.array(r)) # dice arr = ar["green",, c("active","archived","inactive")] r = cb["green",, c("active","archived","inactive")] all.equal(arr, as.array(r)) as.data.table(r) as.data.table(r, na.fill = TRUE) print(arr) status year active archived inactive removed 2011 NA NA NA 3.0 2012 3.5 NA NA 2.5 2013 3.5 NA 3 3.0 2014 NA NA NA NA 2015 2.5 NA 3 2.0 print(r) fact: fact 9 rows x 3 cols (0.00 MB) dims: year 5 rows x 1 cols (0.00 MB) status 4 rows x 1 cols (0.00 MB) total size: 0.01 MB print(arr) , , status = active year color 2011 2012 2013 2014 2015 green NA 3.5 3.5 NA 2.5 , , status = archived year color 2011 2012 2013 2014 2015 green NA NA NA NA NA , , status = inactive year color 2011 2012 2013 2014 2015 green NA NA 3 NA 3 , , status = removed year color 2011 2012 2013 2014 2015 green 3 2.5 3 NA 2 print(r) fact: fact 9 rows x 4 cols (0.00 MB) dims: color 1 rows x 1 cols (0.00 MB) year 5 rows x 1 cols (0.00 MB) status 4 rows x 1 cols (0.00 MB) total size: 0.01 MB > as.data.table(r) year status value 1: 2012 active 3.5 2: 2013 active 3.5 3: 2013 inactive 3.0 4: 2015 active 2.5 5: 2015 inactive 3.0 > as.data.table(r, na.fill = TRUE) year status value 1: 2011 active NA 2: 2011 archived NA 3: 2011 inactive NA 4: 2012 active 3.5 5: 2012 archived NA 6: 2012 inactive NA 7: 2013 active 3.5 8: 2013 archived NA 9: 2013 inactive 3.0 10: 2014 active NA 11: 2014 archived NA 12: 2014 inactive NA 13: 2015 active 2.5 14: 2015 archived NA 15: 2015 inactive 3.0

Now it is time to make apply rollup and drilldown for the data cube

# apply format(aggregate(cb, c("year","status"), sum)) format(capply(cb, c("year","status"), sum)) # rollup and drilldown # granular data with all totals r = rollup(cb, MARGIN = c("color","year"), FUN = sum) format(r) # chose subtotals - drilldown to required levels of aggregates r = rollup(cb, MARGIN = c("color","year"), INDEX = 1:2, FUN = sum) format(r)

Now lets try to make the data cube into a pivottable:

# pivot r = capply(cb, c("year","status"), sum) format(r, dcast = TRUE, formula = year ~ status) library(rpivotTable) r = rollup(cb, c("year","status"), FUN = sum, normalize=FALSE) rpivotTable(r,rows="year", cols=c("status"),width="100%", height="400px")

This gives us the following pivottable in html:

References

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### MobileTrigger Setup: Run R Scripts, Models, Reports with Mobile Device

Fri, 05/17/2019 - 03:50

(This article was first published on R – R-BAR, and kindly contributed to R-bloggers)

Say hello to the MobileTrigger package. With it, you can run R scripts, models, and reports from a mobile device using e-mail triggers. In its debut offering, version 0.0.31 of the package provides List and Run trigger functions. The list function allows you to see available scripts, models, and reports. Complimenting list, the run function allows you to launch and interact with R while you're traveling and cannot be at your desk to run R directly. In this post, we'll cover MobileTrigger installation and setup. Follow-up articles will detail how to set up scripts, reports, and models for mobile execution and interaction.

The Concept

In Running Remote R Scripts with Mobile Devices using E-mail Triggers, I showed you how a mobile device could be used to trigger an R script back at your home or office. In a nut shell, The process looked like this:

1. Write a R script on the remote machine that performed an operation and e-mails the results back to your mobile device using the mailR package.
2. Create a special batch file that calls “Headless R” and your target R script.
3. Set your desktop e-mail client to launch the special batch file when an e-mail having the subject “Hey R – Send Update” is received from your mobile device
The MobileTrigger Package

Having the ability to launch a single R script remotely is nifty. But, most of us need to run many R scripts, models, and reports during the day.

hmmm… How to handle this?

To work more generally with R via e-mail, I started working on the MobileTrigger package which is now on CRAN. The package allows for basic two-way communication between your mobile device and R. As discussed above, communication relies strongly on e-mail client rules, batch files, and headless R. The graphic below summarizes the core workflow of the MobileTrigger package.

Package Note: At this time, the MobileTrigger package only works for Windows users who read their e-mail with either Outlook or ThunderBird. I'm confident a solution could be made for Mac and Linux environments. If you would like to contribute to the project, please visit the github repository for the MobileTrigger package.

Now that we have oriented ourselves, let's install and setup the MobileTrigger package.

Objectives: Install and Setup MobileTrigger

Install MobileTrigger

Mobile triggers can be installed from CRAN or GitHub using the following code:

1 2 install.package("MobileTrigger") # CRAN devtools::install_github("kenithgrey/MobileTrigger") # GitHub Planning Your MobileTrigger Setup

There is a lot of communication going on in the MobileTrigger package. Specifically communication needs to happen between the following three systems: your Mobile, R, and your desktop e-mail client. You can do this using the same e-mail address for all systems, a different e-mail address for each system, or a hybrid as detailed below:

MobileTriggers Uses One E-mail Address 1 MobileEmail <- RsEmail <- MailClientEmail <- "Your.Email@somewhere.com"

The benefit to using a single e-mail address is that you only need SMTP server settings for that one e-mail address.

MobileTriggers Uses Two E-mail Address 1 2 MobileEmail <- "Email@MobileDevice.com" RsEmail <- MailClientEmail <- "Your.Email@somewhere.com"

For this you will need SMTP settings for both e-mail addresses.

MobileTriggers Uses Three E-mail Address 1 2 3 MobileEmail <- "Email@MobileDevice.com" RsEmail <- "For.R@R-email.com" MailClientEmail <- "Your.Email@somewhere.com"

For this you will need SMTP settings for both R and the Mobile e-mail addresses.

In the next sections that follow we are going to use the three e-mail address approach. If you understand how to work with three email addresses than choosing to setup the mobileTrigger package using the 1 or 2 e-mail address options is trivial.

MobileTrigger File and Folder Setup

To get us rolling with the MobileTrigger package, we are going to start with the SetupWindowsTrigger() function. The function creates most of the files you need to facilitate communication between R and your mobile device. To use SetupWindowsTrigger(), you need to know the e-mail sever details you want to use for R to send messages to your mobile device. Please make sure you have the following information on hand:

• SMTP server for outgoing e-mail
• The SMTP server port

Other pieces of information you will need for the SetupWindowsTrigger() function include the

• “root trigger path”: The root trigger path is where MobileTrigger will conduct its business. It will also be the home of the scripts, models, and reports you plan to use with MobileTrigger. I recommend you put it some place simple like “c:/trigger” if you can.

Note: At this time spaces in the trigger path is poorly supported. Recommend that you keep the trigger path simple (e.g., “c:/trigger”)

• Mobile E-Mail Address: The package uses this to send results to your mobile device. Specify this e-mail address using the function’s Mail.To argument.
• R’s Sending E-mail Address: R will need a way to communicate out using the mailR package. You may want to make an e-mail address like R@mydomain.com. Specify this e-mail address using the function’s Mail.From argument

Below is an example call to the SetupWindowsTrigger() function showing how all this information gets used.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 require(MobileTrigger)   ################################################ # Define path & E-mail Addresses # ################################################ path <- "c:/trigger" MobileEmail <- "Email@MobileDevice.com" RsEmail <- "For.R@outlook.com" MailClientEmail <- "Your.Email@somewhere.com"   ################################################### # R's Email Outlook SMTP Example (uses tls= TRUE) # ################################################### SetupWindowsTrigger( path = path, Mail.To = MobileEmail, Mail.From = RsEmail, SMTP.Settings = list(host.name = "smtp.office365.com", port = 587, user.name = RsEmail, passwd = 'password', tls = TRUE)   )

Getting all the SMTP server information setup correctly can be challenging, particularly if you are using a webmail provider like gmail, yahoo, hotmail etc. If you are having trouble, check out my post on setting up webmail communication with the mailR package. Lastly, make sure you update the SetupWindowsTrigger function call with your information.

After Running SetupWindowsTrigger

After running the SetupWindowsTrigger function, take a moment to examine what's in the root trigger path. Below is a screenshot of what mine looks like with some descriptions.

As you can see from the screenshot above, the folder contains e-mail input files, a series of batch and R files for headless R operations, and some folders to put your codes depending on what you're trying to accomplish. All these files work together to facilitate communications between R and your mobile device.

If your trigger folder is written and you have a basic understanding of the files and folders, let's start setting up your e-mail client to work with the MobileTrigger framework.

Setting up the e-mail client

This section will help you get your desktop e-mail client (Outlook or ThunderBird) ready to work with the MobileTrigger package. This is the most involved part of the setup, but once you're done it should be trouble free.

ThunderBird

There are two ways to setup ThunderBird with the MobileTrigger framework, the fast way and the slow way. In general the process goes like this:

1. Install FiltaQuilla Plugin
2. Adjust FiltaQuilla options so the “Save Message as File” option to ON
3. Setup ThunderBird Rules: The Fast Way or the Slow Way
4. Test Communication

In the next few sections, we'll discuss the ThunderBird setup in detail. We'll start with the fast way; then I'll show an example of the slow way. Let's get started by installing the FiltaQuilla add-on.

We need to install a plugin called FiltaQuilla, so we can make e-mail filters in ThunderBird. To install the plugin:

1. Open ThunderBird and let everything sync-up
2. Press (ALT + t) on your keyboard. At the top of the ThunderBird window, the tools menu should appear
3. Click the “Add-ons” menu item. This will open a new tab within ThunderBird called the “Add-ons Manager”.
4. Click “Extensions” in the left-hand panel to open the extensions screen. Use the search box in the top-right corner to search for FiltaQuilla
5. Install it
6. Restart ThunderBird when you’re done.
7. After restarting, press (ALT + t) again. This time select “Add-on Options” → “FiltaQuilla”. In the dialogue window, make sure the “Save Message as File” option is checked. Press OK. (see image below)

FiltaQuilla is now installed and ready. Close ThunderBird

Setup ThunderBird Rules: The Fast Way

This process takes me about 2 minutes. That being said, I've had a lot of developer practice. Anyway, the method we are about to do is much faster than the “slow” ThunderBird method or the Outlook method. Why? Because Filtaquilla's e-mail rules are in a nice editable text file called:

msgFilterRules.dat

1. Make sure ThunderBird is closed
2. Find the file msgFilterRules.dat buried deep in your windows home directory. For me, the path looked a little something like this: C:/Users/[username]/AppData/Roaming/ThunderBird/Profiles/wrvfw5me.default/ImapMail/imap.ionos.com/msgFilterRules.dat

The highlighted portion of the path above is likely to be a little different for your local system, but the above should give you an indication of what the path looks like.

3. Open the msgFilterRules.dat file in your favorite text editor. Initially, I had these two lines of text: 1 2 version="9" logging="no"
4. If you already have filters working in your environment, backup the file before experimenting with it. Mistakes happen!
5. In R, run the WriteThunderBirdFilters() function from the MobileTrigger package: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 require(MobileTrigger)   ################################################ # Define path & E-mail Addresses # ################################################ path <- "c:/trigger" #your root trigger path MobileEmail <- "Email@MobileDevice.com" RsEmail <- "For.R@outlook.com" MailClientEmail <- "Your.Email@somewhere.com"   ################################################ # Write the ThunderBird Filters # ################################################   WriteThunderBirdFilters( path = path, sent.from = MobileEmail )

After running this file, a new msgFilterRules.dat will be written to your root trigger path (in our examples the root trigger path has been c:/trigger).

6. If you just installed FiltaQuilla for the first time, you can overwrite the existing msgFilterRules.dat file with the new one generated by R. If you have existing rules, copy the text from the R generated file from line 3 onward. Paste-append that text to the end of your existing msgFilterRules.dat using a text editor. Save and close.
7. Open ThunderBird
8. Press (ALT + t) to open the program tools menu and select the “Message Filters” menu item. The dialogue box that appears should show 6 rules as illustrated below:

9. To check your setup, skip down to the Communication Testing section.
Setup ThunderBird Rules: The Slow Way

If you're in this section, you have FiltaQuilla installed, ThunderBird is presently closed, and you have elected to setup the rules using the slow way. This will take about 15-20 minutes. If you don't want to set up the rules this way, go to the Fast Way section.

Alright, the slow way.

Example List Script Rule
1. Open ThunderBird
2. Go to your root trigger path. In our running example, this would be c:/trigger.
3. Run the “starterMessage.bat” file. This will launch headless R, and use the e-mail settings you’ve provided when you built the folder to send 6 messages with different subjects. Specifically, “Hey R – List Scripts”, “Hey R – Run Scripts”, “Hey R – List Models”, “Hey R – Run Models”, “Hey R – List Reports”, “Hey R – Run Reports”.
4. When you receive the messages, copy the subject line from the “Hey R – List Scripts” email.
5. Press (Alt + t), to open the ThunderBird Tool Menu, then click “Message Filters” and Press “New”. The following screen will appear.

6. Name the rule “R List Scripts”, and specify a rule that responds to two conditions
1. subject line is “Hey R – List Scripts”
7. Next, set the rule action to launch the ListScripts.bat file in your trigger root folder. In our examples, we’ve been using c:\trigger. The resulting e-mail rule should look like the screenshot below. Other list types will look similar:

8. After pressing OK, restart ThunderBird
9. Test: Send an e-mail to your desktop with the subject line “Hey R – List Scripts” from your mobile device to make sure the batch file gets launched. You should see a command prompt open briefly with some R output. The message that returns to your mobile, will indicating “No Scripts in Path”. This is expected

We just setup the “List Script” trigger. Now, we need to setup the “Run Script” e-mail trigger.

Example Run Script Rule
1. Copy the subject line from the “Hey R – Run Scripts” message.
2. Open the FiltaQuilla Rule Manager using (Alt + t) → “Message Filters”
3. Setup the run script rule as shown in the image below. Notice the filter criteria is the same as before, but the action has two operations:
1. write message to file in the root trigger path (C:/trigger)
2. Launch RunScripts.bat in the root trigger path

Completed rule should look like this:

4. Press OK
5. Restart ThunderBird
6. Test: Send an e-mail to your desktop with the subject line “Hey R – Run Scripts” from your mobile device. Make sure the batch file gets launched. During the process, you should see a command prompt open briefly with some R output. Shortly after, you should get an e-mail message indicating “No Scripts in Path” sent back to your mobile. This is expected

If the first two triggers appear to be working as expected, congratulations! You made it through a very long process. To finish the process, grab a coffee, put on some upbeat music, and repeat the above procedure on the last 4 triggers listed in the table below as appropriate.

Rule Name Filter Action Filter Criteria List Scripts Launch: [Trigger Path]/ListScripts.bat Subject == ‘Hey R – List Scripts’
From == ‘your.mobile.email@mobile.com’ Run Scripts 1. Save Message as File to [Trigger Path]/
2. Launch: [Trigger Path]/RunScripts.bat Subject == ‘Hey R – Run Scripts’
From == ‘your.mobile.email@mobile.com’ List Models Launch: [Trigger Path]/ListModels.bat Subject == ‘Hey R – List Models’
From == ‘your.mobile.email@mobile.com’ Run Models 1. Save Message as File to [Trigger Path]/
2. Launch: [Trigger Path]/RunModels.bat Subject == ‘Hey R – Run Models’
From == ‘your.mobile.email@mobile.com’ List Reports Launch: [Trigger Path]/ListReports.bat Subject == ‘Hey R – List Reports’
From == ‘your.mobile.email@mobile.com’ Run Reports 1. Save Message as File to [Trigger Path]/
2. Launch: [Trigger Path]/RunReports.bat Subject == ‘Hey R – Run Reports’
From == ‘your.mobile.email@mobile.com’

After you get all the email rules in, you'll want to test to make sure everything triggers as planned. You may have been doing this as you setup each trigger. If so, move on to the next steps section, otherwise skip down to the Communication Testing section.

Outlook

Setting MobileTrigger up with Outlook is going to take about 20 minutes. The process goes roughly like this:

Registry Tweak
1. Press the (Windows Key + R): This will open the run dialogue box
2. In the “open” text box type: regedit
3. You’ll get an “are you sure you want to use this application” warning from windows. Press OK…
4. The registry editor should look like this:

Cool we got the registry editor open. Here are the changes we need to make

1. In the left-hand panel drill down to:
Computer\HKEY_CURRENT_USER\Software\Microsoft\Office\16.0\Outlook
2. Make a new folder (key) called “Security” by right clicking on the Outlook folder and selecting “New”→“Key”
3. Enter the new “Security” folder. Full path should now be:
Computer\HKEY_CURRENT_USER\Software\Microsoft\Office\16.0\Outlook\Security
4. Create a new DWORD(32bit) value in the right panel by right clicking, call it “EnableUnsafeClientMailRules”
5. Double click on EnableUnsafeClientMailRules and set the data value to 1.
6. Should look like this when you are done:

All done.

Setup the Visual Basic Macros
1. In the root path for MobileTrigger, open the file OUTLOOK.txt in a text editor and copy the text. The contents of this file is visual basic for applications (VBA) code. We need this to setup our email rules.
2. Open Outlook
3. On the program menu find the “Tell me what do search box”

4. Search for “Visual Basic Editor” and open it. A new window should popup
5. Find the “thisOutlookSession” module circled in the screenshot below and double click. Likely an empty window will appear. See example image below

6. Paste in the VBA code you copied form the OUTLOOK.txt file (see step 1).
7. Save and close the Visual Basic Editor
Enable Macros

If you don't enable macros, the MobileTrigger package won't work with Outlook. If the security risk is too great for you, consult your IT department or perhaps consider using ThunderBird instead.

1. Go to “File” → “Options”
2. Select the Trust Center
3. Click the Trust Center Settings (a new window will appear)
4. Select the Macro Settings Option
5. Adjust the macro settings to “Enable All Macros”. Note Notification of all macros will work but only within a given Outlook session.

6. Press OK.
7. Restart Outlook.
Starter Messages
1. Run the starterMessages.bat file in your root trigger path
2. 6 messages should appear in your Outlook e-mail box.
Setup the Rules
1. Right Click on the e-mail with subject “Hey R – List Reports”
2. Go to Rules → “Create Rule…”
3. Select advanced from the Create Rule Dialogue Box
4. First screen – Set rule conditions
1. Select the “From [Your Mobile e-mail]” and
2. With “Hey R – List Reports” in the subject
3. Press Next
5. Second screen – set the action to do if the previous conditions were met
1. In the Step 1: Select action box, select “run a script”. This option won’t be available if you haven’t made the windows registry change discussed above.
2. In the Step 2: Edit the Rule Description… Click the words “a script”
3. A “Select Script” dialog box will appear. Select the VBA macro for listing reports, and press “OK”

4. Press “Next” to move to the third screen
6. Third screen – set exceptions to the rules. We aren't going to worry about this right now.
1. Press Next
7. Last Screen – Finish Rule

1. Step 1: Give your Rule a name
2. Step 2: Make sure the rule is turned on
3. Press Finish

If you try to test the rule at this point, it may not work because of an Outlook bug. This bug tripped me up for several hours. So hopefully this saves you some time.

1. On the Home Ribbon, click “Rules” → “Manage Rules & Alerts”
2. A Dialogue box will open, and show your new rule for R to list reports. To the right of the rule name you will see “(For Other Computer)”. For whatever reason this causes problems.
3. To get rid of it, click the wrench and screw driver at the far right (see picture below).
4. The Rule Wizard will open, Toggle OFF then ON “on this computer”

5. Press Finish
6. You should no longer see the “(For Other Computer)” in your rule list.
7. Press OK
8. Check that the List Report Rule is working by sending a test e-mail to Outlook with the header “Hey R – List Reports”. Make sure its from the right e-mail address or the rule won’t run.

If the test works, repeat the above steps on the other 5 triggers. Here is a complete list.

Rule Name Use Script Filter Criteria List Scripts OutlookRule.ThisOutlookSession.ListReports Subject == ‘Hey R – List Reports’
From == ‘your.mobile.email@mobile.com’ Run Scripts OutlookRule.ThisOutlookSession.RunReports Subject == ‘Hey R – Run Reports’
From == ‘your.mobile.email@mobile.com’ List Models OutlookRule.ThisOutlookSession.ListScripts Subject == ‘Hey R – List Scripts’
From == ‘your.mobile.email@mobile.com’ Run Models OutlookRule.ThisOutlookSession.RunScripts Subject == ‘Hey R – Run Scripts’
From == ‘your.mobile.email@mobile.com’ List Reports OutlookRule.ThisOutlookSession.ListModels Subject == ‘Hey R – List Models’
From == ‘your.mobile.email@mobile.com’ Run Reports OutlookRule.ThisOutlookSession.RunModels Subject == ‘Hey R – Run Models’
From == ‘your.mobile.email@mobile.com’

Once you have everything setup, move on to the Communication Testing section

Communication Testing

Congratulations, if you've made it to this section I assume you have all the rules setup in either Outlook or ThunderBird. Perhaps, you've tested your communications as you set up each rule. If you know the rules are working, skip this section and move on to this post's Summary. If you still need to test your setup, here is an R script to help you through the process. The script is organized as follows.

2. Run the trigger tests

The testTrigger() function in the script below will make requests to List and Run scripts, models, and reports that don't exist currently in your setup. This is fine. We are checking to make sure communication is working. In follow-up posts, you'll get to see these modules in action.

To test your triggers and communications, start by running the “list” triggers line [9]. There is a 15 second delay between each message sent. As the requests roll in, your triggers should activate, and you should start seeing e-mail from R indicating “no” model, script, or report in path. If these are working, move on to test each set of Run Triggers (Model, Script, and Report). Remember to update the Mail.To and Mail.From for your situation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 # Load Libraries ---------------------------------------------------------- require(mailR) require(MobileTrigger)   ################################################ # Define path & E-mail Addresses # ################################################ path <- "c:/trigger" #your root trigger path MobileEmail <- "Email@MobileDevice.com" RsEmail <- "For.R@outlook.com" MailClientEmail <- "Your.Email@somewhere.com"   ################################################ # Test Communication # # - Test Each of the TestWhat Conditions # ################################################   testTriggers( TestWhat = "Lists", #TestWhat = "RunModels", #TestWhat = "RunScripts", #TestWhat = "RunReports", path = path, Mail.To = MailClientEmail, # Client Email Address Mail.From = MobileEmail, # Simulate Mobile Mail.From.SMTP.Settings = list(host.name = "Mobile SMTP Server Settings", port = [mobile SMTP port number], user.name = MobileEmail, passwd = 'Mobile Password', tls = TRUE) )

If all “List” and “Run” modules are communicating, MobileTrigger is setup and ready for use.

Next Steps

If MobileTrigger is working with your e-mail client, the next step is to bring in your models, scripts, and reports. We will cover each case in a series of brief follow-up posts. So, stay tuned!

Summary

You made it to the end. Yes, I know setting up the e-mail client is a pain. However, if you've made it this far you've got the basic idea of how the MobileTrigger package works. As well, you should have working e-mail communications between your mobile device and R on a remote desktop at your home or office. In a series of brief upcoming posts, I will show you how to integrate your reports, models, and scripts.

If anyone wants to support the project by making the Outlook setup easier or setting up capabilities for Linux or Mac, please visit the MobileTrigger GitHub Project

Other Articles from r-bar.net

The post MobileTrigger Setup: Run R Scripts, Models, Reports with Mobile Device appeared first on R-BAR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – R-BAR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R usage in Brazil

Fri, 05/17/2019 - 02:00

(This article was first published on R on msperlin, and kindly contributed to R-bloggers)

I’m using R for at least five years and always been curious about its usage in Brazil. I see some minor personal evidence that the number of users is increasing over time. My book in portuguese is steadily increasing its sells, and I’ve been receiving far more emails about my R packages. Conference are also booming. Every year there are at least two or three R conferences in Brazil.

What I learned from experience is that software choice is a group decision. It is very likely that you will use whatever your peer group uses. For example, if you are a PhD student, you will never convince your adviser to change research software, even if you have perfectly good reasons!

It takes some independence and autonomy to be able to break free from bad group choices. In academia, you can only do that later on, when you finish your PhD and start your career. Then you can use whatever rocks your boat. And, even for that, it takes courage and humbleness to relearn all you research tricks, from data acquisition to research report.

library(tidyverse) df.dls <- read_rds('data/r-downloads-brazil.rds') glimpse(df.dls) ## Observations: 72,853 ## Variables: 7 ## $date 2012-10-31, 2012-10-31, 2012-10-31, 2012-10-31, 2012-10… ##$ time 16:17:34, 18:21:35, 19:26:20, 19:28:03, 19:26:03, 19:06… ## $size 49351035, 33301364, 49351035, 49351024, 1424794, 6652340… ##$ version "2.15.2", "2.15.2", "2.15.2", "2.15.2", "2.15.2", "2.15.… ## $os "win", "win", "win", "win", "win", "osx", "win", "win", … ##$ country "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "B… ## ip_id 30, 59, 73, 30, 87, 90, 143, 213, 231, 260, 260, 134, 18… As you can see, we have the date, time, version, os (platform), country and ip (randomized daily). First of all, let’s see how many downloads per day we have for Brazil. I’m also including the different release dates for major R versions. df_by_day <- df.dls %>% group_by(ref.date = date) %>% summarise(n = n()) df.R.releases <- tibble(ref.date = as.Date(c('2013-04-03', '2014-04-10','2015-04-16', '2016-05-03', '2017-04-21', '2018-04-23', '2019-04-26')), R_version = c('3.0.0', '3.1.0','3.2.0', '3.3.0','3.4.0', '3.5.0', '3.6.0') ) p <- ggplot(data = df_by_day, aes(y = n, x = ref.date) ) + geom_point() + geom_smooth(size = 2) + labs(x = 'Date (day)', y= 'Number of Downloads', title = paste0('Number of R downloads in Brazil'), subtitle = 'Data from Rstudio logs ') + geom_vline(data = df.R.releases, aes(xintercept = ref.date, color = R_version ), size = 1) + scale_color_grey(start = 0.8, end = 0.2 ) print(p) The number of downloads is steadily increasing over time. The new releases of R also seems to explain the outliers in the dataset. Let’s clean it a bit by decreasing the frequency and calculating the number of downloads per month, instead of by day. df_by_month <- df.dls %>% group_by(ref.month = lubridate::ymd(format(date, '%Y-%m-01'))) %>% summarise(n = n()) p <- ggplot(data = df_by_month, aes(y = n, x = ref.month) ) + geom_point() + geom_smooth(size = 2) + labs(x = 'Date (month)', y= 'Number of Downloads', title = paste0('Number of R downloads in Brazil'), subtitle = 'Data from Rstudio logs ') + geom_vline(data = df.R.releases, aes(xintercept = ref.date, color = R_version ), size = 1) + scale_color_grey(start = 0.8, end = 0.2 ) print(p) Much better! Overall, R downloads average about 910.7 per month, with a monthly compound rate of 6.40%. It means that, each month, the number of downloads is increasing by 6.40% from previous month. The data also includes information about the operating system. Let’s check its distribution: df_by_os <- df.dls %>% group_by(os) %>% count() %>% na.omit() %>% ungroup() %>% mutate(os = fct_recode(os, "Windows" = "win", 'Mac OS' = 'osx', 'Linux' = 'src')) p <- ggplot(df_by_os, aes(x = os, y = n)) + geom_col() + labs(x = 'Operation System', y = 'Number of Download Cases', title = 'Distribution of OS', subtitle = 'Data from Rstudio logs ') print(p) Not unexpectedly, Windows is the winner! I’m very surprised to see that Mac OS presents more downloads than Linux. With an unfavorable exchange rate and many import taxes, the price of a Mac computer (desktop or laptop) are exorbitantly expensive in Brazil. This tells a lot about the purchase power of R users. I hope you liked this post. Next time I’ll analyze the logs of package installation in Brazil. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on msperlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Make R Great Again Fri, 05/17/2019 - 02:00 (This article was first published on R on Gianluca Baio, and kindly contributed to R-bloggers) Our editorial on using R in HTA (I talked about it here) has finally been published in Value in Health. This is nice in the (rather long, I know…) build up to our workshop. And speaking of, we’ve already received quite a few abstracts for contributed talks — this below is the “official” advert we’ve circulated. This is a reminder that the deadline for abstract submission to the Workshop on R for trial and model-based cost-effectiveness analysis (CEA) is 15th May. The workshop will take place at University College London on 9th July. This is a great opportunity to present any interesting work or challenges you’ve faced in using R for CEA. We are open to wide range of topics, including experiences beginning to use R for CEA, implementations of Markov models, tricks to speed up model evaluations, ways to improve transparency, or even reasons you’re never using R again! If you’re unsure, please feel free to drop me an email about a potential topic. Abstracts can be structured or unstructured and up to about 300 words, but there are no strict requirements. Please submit abstracts to howard.thom@bristol.ac.uk. Registration for the workshop is still open and details are at the following link: http://www.statistica.it/gianluca/teaching/r-hta-workshop/2019/ In fact, as of today (17 May), there are still about 20 places available — and yesterday we had a nice surge of 5 registrations in the space of half an hour. So hurry! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on Gianluca Baio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### You are what you vote Fri, 05/17/2019 - 02:00 (This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers) I’ve tried my hand at writing for the wider public with an article for The Conversation based on my paper with Di Cook and Jeremy Forbes on “Spatial modelling of the two-party preferred vote in Australian federal elections: 2001-2016”. With the next Australian election taking place tomorrow, we thought it was timely to put out a publicly accessible version of our analysis. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### SatRday LA 2019 Slides on Futures Thu, 05/16/2019 - 22:00 (This article was first published on JottR on R, and kindly contributed to R-bloggers) A bit late but here are my slides on Future: Friendly Parallel Processing in R for Everyone that I presented at the satRday LA 2019 conference in Los Angeles, CA, USA on April 6, 2019. My talk (33 slides; ~45 minutes): • Title: : Friendly Parallel and Distributed Processing in R for Everyone • HTML (incremental slides; requires online access) • PDF (flat slides) • Video (44 min; YouTube; sorry, different page numbers) Thank you all for making this a stellar satRday event. I enjoyed it very much! Links var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: JottR on R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Analysing the HIV pandemic, Part 3: Genetic diversity Thu, 05/16/2019 - 02:00 (This article was first published on R Views, and kindly contributed to R-bloggers) Phillip (Armand) Bester is a medical scientist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa Andrie de Vries is the author of “R for Dummies” and a Solutions Engineer at RStudio Recap In part 2 of this series, we discussed the PhyloPi pipeline for conducting routine HIV phylogenetics in the drug-resistance testing laboratory as a part of quality control. As mentioned, during HIV replication the error-prone viral reverse transcriptase (RT) converts its RNA genome into DNA before it can be integrated into the host cell genome. During this conversion, the enzyme makes random mistakes in the copying process. These mistakes, or mutations, can be deleterious, beneficial or may have no measurable impact on the replicative fitness of the virus. However, the fast rate of mutation provides enough divergence to be useful for phylogenetic analysis. Introduction As infections spread from person to person, the virus continues to mutate and become more and more divergent. This allows us to use the genetic information we obtain while doing the drug resistance test and analyse the sequences for abnormalities. We showed how DNA sequences can be aligned and, based on the composition of ‘columns’ in these strings, a distance matrix can be calculated of each string against each other. In the example we discussed in part 2, we had a very simple method for calculating matches, i.e., we used either a one or zero. We can get closer to the truth by using substitution models, as we will explain below. In many machine learning algorithms, it is required that one first calculate the distances of each observation against each other, and the choice of algorithm is up to the analyst. Phylogenetic inference is very similar in that a distance matrix needs to be constructed on which the tree can be calculated. If the sequence targeted for phylogenetic inference is very stable with little or no evolution, the distances calculated will be zero or very close to it. This will not allow for differentiation. However, as we mentioned, HIV has a very fast rate of evolution due to its error-prone reverse transcriptase. Cuevas et al. (2015) published work on the in vivo rate of HIV evolution. Their analysis revealed the highest mutation rate of any biological entity of $$4.1 \cdot 10^{-3}$$ ($$sd=1.7 \cdot 10^{-3}$$). However, the error-prone reverse transcriptase is not the only mechanism of mutation. One defence against HIV infection is an enzyme called apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like or APOBEC. These enzymes act on RNA and convert or mutate cytidine to uridine (uridine in RNA is the thymadine counterpart in DNA). This results in a G to A mutation on the cDNA. Also, shown by Cuevas et al, these enzymes are not equally active in all people. On the other hand, the viral Vif protein inhibits this hypermutation by ‘tagging’ the APOBEC protein with ubiquinone for degradation by the cytoplasmic ubiquitin-dependent proteasome machinery. But how does this virus-driven mutation, or APOBEC-driven hypermutation, affect the virus in a negative (or positive) way? We first need to understand how RNA is translated into proteins. Below is a table showing the codon combinations for each of the 20 amino acids. Figure 1: Amino acid encoding. Available at https://www.biologyjunction.com/protein-synthesis-worksheet/ As can be seen from the table above, some amino acids are encoded by more than one codon. For example, if we change the codon CGU to AGA, the resulting amino acid stays Arginine or R. This is referred to as a silent mutation, since the resulting protein will look the same. On the other hand, if we mutate AGU to CGU, the resulting mutation is from Serine to Arginine, or in single-letter notation, S to R. A change in the amino acid is referred to as a non-synonymous mutation. Example In reality, the APOBEC enzyme recognizes specific RNA sequence motifs, but just to give an idea of how this works, let’s look at an example. Load some packages: library(ape) library(Biostrings) library(tibble) library(tidyr) library(dplyr) library(knitr) library(plotly) library(RColorBrewer) library(diagram) Create a RNA sequence (remember U is T in RNA language): WT <- c("CGA", "GUU", "AUA", "GAG", "UGG", "AGU") We have the sequence CGAGUUAUAGAGUGGAGU that we created in the cell block above as codons for clarity. We can now translate this sequence using the codon table or some function. translate_dna_sequence <- function(x){ x %>% paste0(collapse = "") %>% gsub("U", "T", .) %>% DNAString() %>% as.DNAbin() %>% trans() %>% .[[1]] %>% as.character.AAbin() } AA <- WT %>% translate_dna_sequence() The code block above translated our RNA sequence into a protein sequence: R, V, I, E, W, S. Now let’s mutate all occurrences of C to U/T: MUT <- gsub("C", "U", WT) The resulting mutant sequence is: UGA, GUU, AUA, GAG, UGG, AGU, and if we now translate that, we get … AA <- MUT %>% translate_dna_sequence() … the protein sequence: *, V, I, E, W, S. The * means a stop codon was introduced. Stop codons are responsible for terminating translation from RNA to protein. If one of the viral genes has a stop codon in it, the protein will truncate prematurely and the protein will most likely be dysfunctional. Mutations other than stop codons could also have a negative effect on the virus, or it can cause resistance to an ARV. Calculating genetic distances from a multiple sequence alignment (MSA) In part 2, we showed the general principle of a MSA. In biology, sequence alignments are used to look at similarities of DNA or protein sequences. For most phylogenetic analysis, a multiple sequence alignment is a requirement, and the more accurate the MSA, the more accurate the phylogenetic inference. First, we read in the multiple sequence alignment file. # Read in the alignment file aln <- read.dna('example.aln', format = 'fasta') Next, we can calculate the distance matrix using the Kimura two-parameter (K80) model. There are various models that can be applied when looking at DNA substitution models. We will use a model based on Markov chains. Remember: “All models are wrong, but some are useful” – George Box This is very true when it comes to estimating genetic distances and phylogenetic inference. Consider the image below: Figure 2: transversions vs transitions. Available at https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/All_transitions_and_transversions.svg/1024px-All_transitions_and_transversions.svg.png The figure above shows transition and transversion events. Transition between A and G (the purines) and C and T (the pyrimidines) are more likely than transversions (indicated by the red arrows). The K80 model takes this into account as one of its parameters, and these rates, or probabilities, are calculated or estimated by maximum likelihood. Let’s see what that looks like: tmDNA <- matrix(c(0.8,0.05,0.1,0.05, 0.05,0.8,0.05,0.1, 0.1,0.05,0.8,0.05, 0.05,0.1,0.05,0.8), nrow = 4, byrow = TRUE) stateNames <- c("A","C","G", "T") row.names(tmDNA) <- stateNames; colnames(tmDNA) <- stateNames tmDNA %>% kable( caption = "Example K80 probabilities of transitions or transversions" ) Table 1: Example K80 probabilities of transitions or transversions A C G T A 0.80 0.05 0.10 0.05 C 0.05 0.80 0.05 0.10 G 0.10 0.05 0.80 0.05 T 0.05 0.10 0.05 0.80 plotmat(tmDNA,pos = c(2,2), lwd = 1, box.lwd = 2, cex.txt = 0.8, box.size = 0.1, box.type = "circle", box.prop = 0.5, box.col = "light blue", arr.length=.1, arr.width=.1, self.cex = .6, self.shifty = -.01, self.shiftx = .14, main = "Markov Chain") This example is contrived, but should explain the concept of a substitutions model. The viral reverse transcriptase is not a random sequence generator, but it does make mistakes. Most of the time when it is copying the RNA into DNA, the base (state) stays the same. Then also, the probability of a transversion vs. a transition is different. If you look at the figure above where we introduced transversion and transition, you will notice that A is more similar to G, and T is more similar to C in its chemical structure. There are many other substitution models. It is not always trivial to select the best model for phylogenetic inference. One technique is to run multiple maximum likelihood phylogenetic calculations using different models, and then pick the model with the lowest AIC (Akaike Information Criterion). For our pipeline, we selected the rather simple K80 model. Since we are looking at different sets of sequences at each submission, a simple model is probably better in order to avoid the problems caused by overfitting. We can use the ape package and calculate distances using the K80 model. # Calculate the genetic distances between sequences using the K80 model, as.mattrix makes the rest easier alnDist <- dist.dna(aln, model = "K80", as.matrix = TRUE) alnDist[1:5, 1:5] %>% kable(caption = "First few rows of our distance matrix") Table 2: First few rows of our distance matrix 01_AE.JP.AB253686_INT B.US.HM450245_INT B.AU.AF407664_INT B.CN.KJ820110_INT B.RU.HM466986_INT 01_AE.JP.AB253686_INT 0.0000000 0.0935626 0.0961965 0.0962887 0.0962887 B.US.HM450245_INT 0.0935626 0.0000000 0.0378446 0.0378167 0.0378748 B.AU.AF407664_INT 0.0961965 0.0378446 0.0000000 0.0454602 0.0494138 B.CN.KJ820110_INT 0.0962887 0.0378167 0.0454602 0.0000000 0.0479955 B.RU.HM466986_INT 0.0962887 0.0378748 0.0494138 0.0479955 0.0000000 The matrix has a shape of 47 by 47, so we just preview the first 5 rows and columns. Reduction of the heatmap to focus on the important data The pipeline mentioned uses the Basic Local Alignment Search Tool (BLAST) to retrieve previously sampled sequences, and adds these retrieved sequences to the analysis. BLAST is like a search engine you use on the web, but for protein or DNA sequences. By doing this, important sequences from retrospective samples are included, which enables PhyloPi to be aware of past sequences and not just batch-per-batch aware. Have a look at the paper for some examples. The data we have is ready to use for heatmap plotting purposes, but since the data also contains previously sampled sequences, comparing those sequences amongst themselves would be a distraction. We are interested in those samples, but only compared to the current batch of samples analysed. The figures below should explain this a bit better. (#fig:distracting data)A diagram of a heatmap with lots of redundant and distracting data. From the image above you can see that, typical of a heatmap, it is symmetrical on the diagonal. We show submitted vs retrieved samples in both the horizontal and vertical direction. Notice also, annotated as “Distraction”, the previous samples are compared amongst themselves. We are not interested in those samples now, as we would already have acted on any issues then. What we want instead is a heatmap, as depicted in the image below. (#fig:focussed data)A diagram of a more focussed heatmap with the redundant and distracting data removed. Fortunately, we have a very powerful tool, R, at our disposal, and plenty of really useful and convenient packages like dplyr to fix this. alnDistLong <- alnDist %>% as.data.frame(stringsToFactors = FALSE) %>% rownames_to_column(var = "sample_1") %>% gather(key = "sample_2", value = "distance", -sample_1, na.rm = TRUE) %>% arrange(distance) alnDistLong %>% head() ## sample_1 sample_2 distance ## 1 01_AE.JP.AB253686_INT 01_AE.JP.AB253686_INT 0 ## 2 B.US.HM450245_INT B.US.HM450245_INT 0 ## 3 B.AU.AF407664_INT B.AU.AF407664_INT 0 ## 4 B.CN.KJ820110_INT B.CN.KJ820110_INT 0 ## 5 B.RU.HM466986_INT B.RU.HM466986_INT 0 ## 6 B.US.DQ127546_INT B.US.DQ127546_INT 0 Final cleanup and removal of distracting data # get the names of samples originally in the fasta file used for submission qSample <- names(read.dna("example.fasta", format = "fasta")) # compute new order of samples, so the new alignment is in the order of the heatmap example sample_1 <- unique(alnDistLongsample_1) new_order <- c(sort(qSample), setdiff(sample_1, qSample)) new_order ## [1] "01_AE.JP.AB253686_INT" "01_AE.TH.JX448243_INT" ## [3] "01_AE.VN.LC100946_INT" "38_BF1.UY.FJ213783_INT" ## [5] "B.AU.AF407664_INT" "B.CN.KJ820110_INT" ## [7] "B.KR.JN417106_INT" "B.RU.HM466986_INT" ## [9] "B.US.DQ127546_INT" "B.US.GU076504_INT" ## [11] "B.US.HM450245_INT" "BC.CN.JQ898256_INT" ## [13] "C.ZA.KT183056_INT" "C.ZM.KM049918_INT" ## [15] "C.ZM.KM050042_INT" "01_AE.TH.JX448252_INT" ## [17] "01_AE.TH.JX448250_INT" "01_AE.TH.JX448249_INT" ## [19] "C.ZA.KT183058_INT" "C.ZM.KM049913_INT" ## [21] "B.KR.JN417120_INT" "B.KR.JN417117_INT" ## [23] "B.KR.JN417116_INT" "57_BC.CN.JX679207_INT" ## [25] "C.ZM.KM050043_INT" "C.ZM.KM050041_INT" ## [27] "01_AE.JP.AB253682_INT" "01_AE.JP.AB253689_INT" ## [29] "B.US.KJ704790_INT" "B.ES.KC238594_INT" ## [31] "B.AU.AF407665_INT" "B.AU.AF407667_INT" ## [33] "B.CN.KC987976_INT" "B.CN.KT192001_INT" ## [35] "B.US.AF040369_INT" "B.US.M38429_INT" ## [37] "B.US.DQ127547_INT" "B.US.DQ127543_INT" ## [39] "C.ZA.KT183062_INT" "B.US.GU076505_INT" ## [41] "B.US.GU076507_INT" "C.ZM.KM049917_INT" ## [43] "01_AE.CN.JQ302565_INT" "01_AE.VN.FJ185234_INT" ## [45] "F1.BR.FJ771006_INT" "BF.AR.AF408631_INT" ## [47] "BC.CN.KC898983_INT"

Plot the heatmap using plotly for interactivity

alnDistLong %>% filter( sample_1 %in% qSample, sample_1 != sample_2 ) %>% mutate( sample_2 = factor(sample_2, levels = new_order) ) %>% plot_ly( x = ~sample_2, y = ~sample_1, z = ~distance, type = "heatmap", colors = brewer.pal(11, "RdYlBu"), zmin = 0.0, zmax = 0.03, xgap = 2, ygap = 1 ) %>% layout( margin = list(l = 100, r = 10, b = 100, t = 10, pad = 4), yaxis = list(tickfont = list(size = 10), showspikes = TRUE), xaxis = list(tickfont = list(size = 10), showspikes = TRUE) )

Above we used the package ape to calculate the genetic distances for the heatmap.

Another way of looking at our alignment data is to use phylogenetic inference. The PhyloPi pipeline saves each step of phylogenetic inference to allow the user to intercept at any step. We can use the newick tree file (a text file formatted as newick) and draw our own tree:

tree <- read.tree("example-tree.txt") plot.phylo( tree, cex = 0.8, use.edge.length = TRUE, tip.color = 'blue', align.tip.label = FALSE, show.node.label = TRUE ) nodelabels("This one", 9, frame = "r", bg = "red", adj = c(-8.2,-46))

We have highlighted a node with a red block, with the text “This one”, which we can now discuss. We have three leaves in this node – KM050043, KM050042, KM050041 – and if you would look up these accession numbers at NCBI, you will notice the publication it is tied to:

“HIV transmission. Selection bias at the heterosexual HIV-1 transmission bottleneck”

In this paper, the authors looked at selection bias when the infection is transmitted. They found that in a pool of viral quasi-species, transmission is biased to benefit the fittest viral quasi-species. The node highlighted above shows the kind of clustering one would expect with a study like the one mentioned above. You will also notice plenty of other nodes, which you can explore using the accession number and searching for it here.

The tree above is much like a dendrogram used when displaying agglomerative or hierarchical clustering. The numbers on the tree indicate the probability that the corresponding clusters are correct. The branch lengths indicate the distances between samples. In conjunction with a properly coloured heatmap, this is very useful for finding relevant clusters to investigate. If the reason for close clustering cannot be explained, the tests are repeated.

The importance of phylogenetics

Phylogenetics, and thus genetic distance calculations, are used in many branches of biology. It is one of the quality-control measures at our disposal, but it has been used for the reconstruction of the origin of HIV. You may find the research papers listed below interesting where the authors used phylogenetics to infer the zoonotic origins of HIV.

As another example, in 1998, six foreign medical workers were accused of deliberately infecting hospitalized children with HIV and were sentenced to death in Libya. In 2006, de Oliveira, et al. used phylogenetics to provide evidence that the origin of the HIV strains that infected the children had an evolutionary history in the mid-90s, which was before the health care workers arrived in 1998. The six medics were released in 2007. There is also a very good writeup on the case by Declan Butler. Although probably very emotional, this would be a great movie.

These techniques are also used in criminal convictions. However, the interpretation of this kind of evidence in court cases can be unsafe. The insights of Pillay, et al. should bring this to light.

Summary

In this post we discussed that as infections spread from person to person, the virus continues to mutate and become more and more divergent. This allows using the genetic information we obtain while doing the drug resistance test and analyse the sequences for abnormalities.

We then showed how to compute genetic distance using multiple sequence alignment (MSA) and that it’s possible to model this process as a Markov chain. Then you can view the resulting model as a heatmap or phylogenetic trees.

This finds practical application in diverse situations, for exampling shedding light on the origin of the HIV virus, as well as evidence in legal trials.

What’s next

In the fourth and final part of this series, we will show how we analysed the inter- and intra-patient genetic distances of HIV sequences by logistic regression. This was useful in properly colouring our heatmap explained in this series. See you there!

_____='https://rviews.rstudio.com/2019/05/16/pipeline-for-analysing-hiv-part-3/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### rOpenSci Dev Guide 0.2.0: Updates Inside and Out

Thu, 05/16/2019 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

As announced in our recent post about updates to our Software Peer Review system, all our package development, review and maintenance is available as an online book. Our goal is to update it approximately quarterly so it’s already time to present its second official version! You can read the changelog or this blog post to find out what’s new in our dev guide 0.2.0!

A more legit and accessible book

Let’s start with very exciting news, the dev guide now has a cover, designed by Oz Locke from Locke Creatives!

It shows a package production line with small humans discussing, examining and promoting packages, all of this in white on rOpenSci blue. Speaking of rOpenSci branding, thanks to Will Landau’s help, the online book now features the rOpenSci hex logo as favicon.

Two changes improved the accessibility of the book, thanks to feedback from readers. First, the book became more legible by the use of a darker blue for links. Thanks a lot to Kevin Wright for telling us how hard it was to read the links in the former color! Then, the README of the GitHub repository holding the source of our blog got more accessible thanks to a contribution by Katrin Leinweber. In particular, Katrin made us aware of guidance about why not to use “click here” for links.

One last outside change to the book is its availability as PDF to download (see PDF button in the navbar), after a request by Indrajeet Patil.

Updates to our policies and guidance Style: arrow vs. equal

Thanks to a question by Robert M Flight, our style guidance now states that it is fine to use either an arrow or an equal sign for assignment as long as the choice is consistent within the whole package. You can choose to use = over <- as long you are consistent with one choice within your package. We recommend avoiding the use of -> for assignment within a package. If you do use <- throughout your package, and you also use R6 in that package, you’ll be forced to use = for assignment within your R6Class construction – this is not considered inconsistency beause you can’t use <- in this case.

Continuous integration

We added a new continuous integration requirement: package maintainers must now use devel and oldrel R versions on Travis, not only R-release.

That same chapter was improved with more guidance, including links to examples, thanks to a contribution by Mark Padgham.

Preferred dependencies

In the section about recommended scaffolding we already recommended xml2 over XML for most cases but now we’ve added a link to Daniel Nüst’s very neat notes about migration from XML to xml2.

After acceptance

Three changes relate to what happens after a package has been accepted.

rOpenSci community manager Stefanie Butland and other authors recently reported how our software review guidelines ended up being used “in the wild” for a scientific paper. Knowing about such use cases makes us happy and helps us assess the usefulness of our material beyond our own system, so we’ve now added the following wish to several places in our guide (intro to our software review system, reviewer guide, chapter about contributing to rOpenSci):

If you use our standards/checklists etc. when reviewing software elsewhere, do tell the recipients (e.g. journal editors, students, internal code review) that they came from rOpenSci, and tell us in our public forum, or privately by email.

The contributing guide now contains more reasons to contribute, and the approval template for editors now features more specific guidance about writing a blog post or tech note about an approved package.

Conclusion

In this post we summarized the changes incorporated into our online book “rOpenSci Packages: Development, Maintenance, and Peer Review” over the last three months. We are very grateful for all contributions that made this release possible. Now, if you have any feedback about the book, you too can head to the issue tracker!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Introducing {tvthemes}: ggplot2 palettes and themes from your favorite TV shows!

Thu, 05/16/2019 - 01:00

(This article was first published on R by R(yo), and kindly contributed to R-bloggers)

This blog post will provide an intro to {tvthemes} as well as some
lessons learned (codecov, Github badges, creating a hexsticker,
usethis::use_*(), unit testing for ggplot2, etc.) and the future
direction of the package. What kick-started this idea was my blog
post

looking at simple TV stats on my current favorite TV show, Brooklyn
Nine-Nine. I got a lot of good feedback on the colors I used for the
custom ggplot2 theme and color palettes so I decided to expand it to
other shows that I love!

Suggestions and Pull Requests for palettes/themes are welcome, you can
find the Github repo for the package
HERE!

pacman::p_load(ggplot2, dplyr, tvthemes, extrafont, glue, gapminder, emo, patchwork, cowplot) loadfonts() Current list of TV shows
• Avatar: The Last Airbender: theme + palettes (Fire Nation, Water
Tribe, Earth Kingdom, & Air Nomads)
• Brooklyn Nine-Nine: theme + palettes (regular & dark)
• Game of Thrones/A Song of Ice & Fire: ‘The Palettes of Ice &
Fire’ (currently: Stark, Lannister, Tully, Targaryen, Greyjoy, &
Tyrell)
• Rick & Morty: theme + palette
• Parks & Recreation: two themes (light & dark) + palette
• The Simpsons: theme + palette
• Spongebob Squarepants: theme + palette + background images
• More in future releases…
Installation

tvthemes is currently available only on Github, you can install it by:

## install.packages(devtools) devtools::install_github("Ryo-N7/tvthemes")

I hope to have a CRAN version soon!

Fonts

The difficulty with a lot of the fonts used by TV shows in their logos
and other marketing media is that they are made by font foundries and
can be rather expensive (for a regular person like you or me) to
purchase. So I endeavored to find free fonts to use that were
somewhat similar to the real ones used by the shows from resources like
Google Fonts. In the documentation you can find the actual fonts
used by the TV shows if you are so inclined to buy them (some are just
my best guesses though)! In some cases there were fan-made fonts such as
“Some Time Later” for Spongebob or “Akbar” for The Simpsons that I
included with the package.

Instead of dealing with extrafont yourself, I re-purposed the
import_*() functions from the hrbrthemes package so you can import
the included fonts very easily. Do note that you still might need to
install the fonts directly on your computer from the .ttf files found
in tvthemes/inst/fonts. When you’re done running the functions and
library and then run loadfonts(). If you’re having problems check out
the documentation on extrafont’s Github
repo
or on
CRAN.

The help files for each function tells you the recommended font names in
case you forget!

import_simpsons() ## "Akbar" font import_theLastAirbender() ## "Slayer" font import_rickAndMorty() ## "Get Schwifty" font import_roboto_condensed() ## "Roboto Condensed" Google Font import from hrbrthemes import_titillium_web() ## "Titillium Web" Google Font import from hrbrthemes import_spongeBob() ## "Some Time Later" font import_cinzel() ## "Cinzel" font to use with 'The Palettes of Ice & Fire' ## install.packages("extrafont") library(extrafont) loadfonts() ## You need to do this at the beginning of a session. Colors

I gathered the colors/hex codes by looking at images online or
re-watching some old episodes and then using various hex code websites
and hex code extraction websites. Most of the color palettes were pretty
straightforward as the characters or certain elements of the TV shows
naturally provided some kind of differentiation by color. The colors in
some of these palettes may still change from feedback and further
experimentation.

You can check out all the colors for each palette by running
scales::show_col(tvthemes:::name_of_palette). Some examples below:

scales::show_col(tvthemes:::rickAndMorty_palette)

scales::show_col(tvthemes:::lannister_palette)

scales::show_col(tvthemes:::simpsons_palette)

Example Plots Brooklyn Nine-Nine

For the most part this theme and palette are unchanged from the original
blog post. I just added a few more colors to both the normal palette
(shown below) and the dark palette.

mpg %>% ggplot(aes(displ)) + geom_histogram(aes(fill = class), col = "black", size = 0.1, binwidth = 0.1) + scale_fill_brooklyn99() + labs(title = "Do you know what it means to 'clap back', Raymond?", subtitle = glue::glue("BE- {emo::ji('clap')} -CAUSE {emo::ji('clap')} I {emo::ji('clap')} DO {emo::ji('clap')} !"), caption = "Pizza bagels? Pizza rolls? Pizza poppers? Pizzaritos? Pizza pockets?") + theme_brooklyn99(title.font = "Titillium Web", text.font = "Calibri Light", subtitle.size = 14)

Spongebob Squarepants

I had a lot of fun with this one. The colors are taken from the
characters and their clothes while I was able to find a fan-made font
that looks similar to the one used in the transition slides of the show.

bobspog_plot <- mpg %>% ggplot(aes(displ)) + geom_histogram(aes(fill = class), col = "black", size = 0.1) + scale_fill_spongeBob() + labs(title = "F is for Fire that burns down the whole town!", subtitle = "U is for Uranium... bombs! N is for No survivors when you're-", caption = "Plankton, those things aren't what fun is about!") + theme_spongeBob(title.font = "Some Time Later", text.font = "Some Time Later", title.size = 22, subtitle.size = 16, axis.title.size = 16, axis.text.size = 14, legend.title.size = 14) bobspog_plot

In addition I was inspired by ggpomological::paint_pomological() and
created a similar function that allows you to place your plot on top of
a Spongebob-themed background.

paintBikiniBottom(plot = bobspog_plot, background = "background") ## or "floral"

The Simpsons

Pretty simple one to do and I’m sure many of you have seen other
Simpsons related color palettes around the internet. The blue background
and yellow title font for the theme was inspired by Nathan
Cunningham’s
cool blog posts on The
Simpsons like this one
here.

data <- gapminder::gapminder %>% filter(country %in% c("France", "Germany", "Ireland", "Italy", "Japan", "Norway", "Belarus")) %>% mutate(year = as.Date(paste(year, "-01-01", sep = "", format='%Y-%b-%d'))) ggplot(data = data, aes(x = year, y = gdpPercap, fill = country)) + geom_area(alpha = 0.8) + scale_x_date(breaks = data$year, date_labels = "%Y") + scale_y_continuous(expand = c(0, 0), labels = scales::dollar) + scale_fill_simpsons() + labs(title = "The Simpsons", caption = glue::glue(" A 'Bake 'em Away, Toys!' Production"), x = "Wel-diddly-elcome neighborino!", y = "Price of Duff Beer") + theme_simpsons(title.font = "Akbar", text.font = "Akbar", axis.text.size = 8) Rick and Morty The colors from the show make for a pretty good discrete color palette while the font is a fan-made creation that tries to emulate co-creator Justin Roiland’s handwriting that was used on the show. ggplot(diamonds, aes(price, fill = cut)) + geom_histogram(binwidth = 500) + scale_fill_rickAndMorty() + labs(title = "Dammit Morty, You Know Diamonds Aren't Forever Right?", subtitle = "They're blood diamonds, Morty **burp**", caption = "Wubbalubbadubdub!") + theme_rickAndMorty(title.font = "Get Schwifty", text.font = "Get Schwifty", title.size = 14) Game of Thrones: Houses Stark, Lannister, Targaryen As a fan of the books and medieval heraldry looking around for inspiration for the Great Houses of Westeros was quite fun. I hope to add others in the future (Martell, Arryn), even smaller houses if they have a great color scheme, like House Bolton, Dayne, Flint (Widow’s Watch), etc. mpg %>% ggplot(aes(displ)) + geom_histogram(aes(fill = class), col = "black", size = 0.1) + labs(title = "The winters are hard, but the Starks will endure.", subtitle = "We always have...", caption = "Winter Is Coming...") + scale_y_continuous(expand = c(0,0)) + scale_x_continuous(expand = c(0,0)) + scale_fill_stark() + theme_minimal() + theme(text = element_text(family = "Cinzel", size = 14)) -> stark colstully <- tully_pal()(5) ggplot(diamonds, aes(price, fill = cut)) + geom_histogram(binwidth = 500) + scale_fill_manual(values = rev(colstully)) + #scale_fill_tully() + labs(title = "I've seen wet shits I like better than Walder Frey.", subtitle = "Pardon my lord, my lady. I need to find a tree to piss on.", caption = "- The Blackfish") + theme_minimal() + theme(text = element_text(family = "Cinzel", size = 10), title = element_text(family = "Cinzel", size = 14)) -> tully ggplot(gapminder::gapminder, aes(x = log10(gdpPercap), y = lifeExp)) + geom_point(aes(color = continent)) + scale_x_log10() + scale_color_targaryen() + labs(title = "I am the blood of the dragon. I must be strong.", subtitle = "I must have fire in my eyes when I face them, not tears.", caption = "- Fire & Blood.") + theme_minimal() + theme(text = element_text(family = "Cinzel", size = 10), title = element_text(family = "Cinzel", size = 14)) -> targaryen ## patchwork together: stark + tully - targaryen + plot_layout(ncol = 1) Game of Thrones: Houses Tyrell, Tully, Greyjoy data <- gapminder::gapminder %>% filter(country %in% c("France", "Germany", "Ireland", "Italy", "Japan", "Norway", "Belarus")) %>% mutate(year = as.Date(paste(year, "-01-01", sep = "", format='%Y-%b-%d'))) ggplot(data = data, aes(x = year, y = gdpPercap, fill = country)) + geom_area(alpha = 0.8) + scale_x_date(breaks = data$year, date_labels = "%Y") + scale_y_continuous(expand = c(0, 0), labels = scales::dollar) + scale_fill_tyrell() + labs(title = "All men are fools, if truth be told, but", subtitle = "the ones in motley are more amusing than ones with crowns.", caption = "- The Queen of Thorns") + theme_minimal() + theme(text = element_text(family = "Cinzel", size = 10), plot.title = element_text(family = "Cinzel", size = 16), plot.subtitle = element_text(family = "Cinzel", size = 12)) -> tyrell cols <- lannister_pal()(5) ggplot(diamonds, aes(price, fill = cut)) + geom_histogram(binwidth = 500) + labs(title = "You are done with whores.", subtitle = "The next one I find in your bed, I'll hang.", caption = "Rains of Castamere") + scale_fill_manual(values = rev(cols)) + #scale_fill_lannister() + theme_minimal() + theme(text = element_text(family = "Cinzel", size = 14)) -> lannister airquality %>% mutate(Month = as.factor(Month)) %>% ggplot(aes(x = Day, y = Temp, group = Month, color = Month)) + geom_line(size = 1.5) + scale_color_greyjoy() + labs(title = "I am the storm, my lord.", subtitle = "The first storm, and the last.", caption = "- Euron 'The Crow's Eye' Greyjoy") + theme_minimal() + theme(text = element_text(family = "Cinzel", size = 10), title = element_text(family = "Cinzel", size = 14)) -> greyjoy ## patchwork together: tyrell + lannister - greyjoy + plot_layout(ncol = 1)

Avatar: The Last Airbender

With four very distinct nations (each based on a certain natural
element) in the Avatar universe it was very easy to come up with color
palettes.

mpg %>% ggplot(aes(displ)) + geom_histogram(aes(fill = class), col = "black", size = 0.1) + scale_fill_fireNation() + labs(title = "Flameo, Hotman!", subtitle = "Fire. Wang Fire. This is my wife, Sapphire.", x = "Lion Vultures Owned", y = "Agni Kai Participation") + theme_theLastAirbender(title.font = "Slayer", text.font = "Slayer") -> firenation airquality %>% mutate(Month = as.factor(Month)) %>% ggplot(aes(x = Day, y = Temp, group = Month, color = Month)) + geom_line(size = 1.5) + scale_color_airNomads() + labs(title = "Let's head to the Eastern Air Temple!", subtitle = "Appa, Yip Yip!") + theme_theLastAirbender(title.font = "Slayer", text.font = "Slayer", title.size = 10) -> airnomads ggplot(gapminder::gapminder, aes(x = log10(gdpPercap), y = lifeExp)) + geom_point(aes(color = continent)) + scale_x_log10() + scale_color_waterTribe() + labs(title = "I am thinking maybe we could... do an activity together?", subtitle = "... Do an activity?", x = "GDP per Otter-Penguins", y = "Life Expectancy of Arctic Camels") + theme_theLastAirbender(title.font = "Slayer", text.font = "Slayer", title.size = 8, subtitle.size = 8) -> watertribe mpg %>% ggplot(aes(displ)) + geom_histogram(aes(fill = class), col = "black", size = 0.1) + scale_fill_earthKingdom() + labs(title = "There is no war in Ba Sing Se", subtitle = "(Welcome to Lake Laogai)") + theme_theLastAirbender(title.font = "Slayer", text.font = "Slayer", title.size = 14) -> earthkingdom ## plot together: plot_grid(firenation, airnomads, watertribe, earthkingdom, ncol = 2)

Parks and Recreation

This is probably a more literal interpretation of “Parks and Recreation”
rather than any of the colors in the palette/theme representing the
characters. This one was the one I took the most liberties with and just
tried to use colors that felt very outdoorsy and “Parks”-like. The font
used with this theme, “Titillium Web” is similar to the typeface used in
the logo of “Parks and Rec”, “Champion HTF-Heavyweight”.

airquality %>% mutate(Month = as.factor(Month)) %>% ggplot(aes(x = Day, y = Temp, group = Month, color = Month)) + geom_point(size = 2.5) + labs(title = "Calzones are pointless.", subtitle = "They're just pizza that's harder to eat!", caption = "No one likes them. Good day, sir.") + scale_color_parksAndRec() + theme_minimal() + theme_parksAndRec(text.font = "Titillium Web", title.font = "Titillium Web Black", legend.font = "Titillium Web") -> parksandrec mpg %>% ggplot(aes(displ)) + geom_histogram(aes(fill = class), col = "black", size = 0.1) + labs(title = "Parks & Recreation", subtitle = "Gotta Spend Money To Make Money!", caption = "And I spent... all of my money!") + scale_fill_parksAndRec() + theme_minimal() + theme_parksAndRec_light(title.font = "Titillium Web Black", text.font = "Titillium Web") -> parksandreclight ## plot together: plot_grid(parksandrec, parksandreclight, ncol = 2)

Lessons Learned & Future Steps

Although this package is mainly just for fun, I still learned quite a
lot from the experience. I was able to get more practice with usethis
and devtools for package development. Instead of manually creating
package files, be it the DESCRIPTION, R scripts, license, etc. the
usethis::use_*() set of functions creates them for you automatically
along with the correct directories and when relevant adds them to
.gitignore or makes a note of any changes in other relevant files.
Below is an example of a really nifty usethis function that creates a
unit test to run a spell check on all your documentation!

A lot of the best packages I’ve seen have always had the cool and
described in detail here are
pretty straightforward, it gives the viewer an idea of what stage of
development the R package is in. The other badges I have on tvthemes
are the ones for the license (GPL v3) and for codecov. I just used
usethis::use_badge() but there are alternative ways to produce the
package by Guangchuang Yu.

On the topic of testing, I create a lot of unit tests for the R packages
that I help maintain at the NGO I work for but until tvthemes I had
never created tests for ggplot2 themes and palette objects before.
After looking around at other theme/palette packages such as
vapoRwave,
hrbrthemes, and
ggpomological I was able
to get a sense for what to test for such as:

test_that("theme_brooklyn99 works", { thm <- theme_brooklyn99() expect_s3_class(thm, "theme") expect_equal(thm$text$family, "") }) test_that("brooklyn99_pal raises warning with number greater than colors available, x > 10", { expect_warning(brooklyn99_pal()(11)) })

In addition, I used codecov to see the exact
percentage of my code that my tests covers.

Although all of the actual theme or palette functions are “covered” I
could still add more depth to them. Currently I only have a test that
checks for one component of the theme, the font family, but I should
expand this to include other key components of the particular theme I’m
testing for. For example, the exact color of the panel.background for
theme_spongBob() or the correct default size of the title. I also need
to figure out how to write tests for the import_*() functions and the

The next thing I want to try out is to use the
vdiffr package to test the outputted
plots themselves. From the README the steps are as follows:

1. Add expectations to by including expect_doppelganger() in your
test files.

2. Run manage_cases() to generate the plots which vdiffr will test
against in the future. This will launch a shiny gadget which will
ask you to confirm that each plot is correct.

3. Run devtools::test() to execute the tests as normal.

This can be extremely useful for tvthemes to check whether the
individual themes and palettes are being applied properly whenever I
make some changes in the code!

Last, but certainly not least, was the hex sticker for my package.
Originally I wanted to use a character from one of the TV shows in the
logo but in the end I went with something that wasn’t copyrighted. Also,
I didn’t really have the photoshop skills to make it work! What I ended
up choosing was using the popular Japanese site,
irasutoya, which provides thousands of
drawings for free use. It’s a great website that provides drawings for
any object, person, and situation! After finding the best image (a child
watching cartoons and sitting too close to the TV), I used the
hexSticker package by Guangchuang Yu to create it in R:

## Download image: download.file("https://3.bp.blogspot.com/-kHda-2huAF8/WUdZKshMQkI/AAAAAAABFDM/sXt-PT9dKtYp2Y0Y_OV64TJzs7PvOLZmgCLcBGAs/s800/tv_boy_chikaku.png", destfile = paste0(tempdir() , ".png"), mode="wb") ## Create sticker: hexSticker::sticker(subplot = paste0(tempdir() , ".png"), package = "tvthemes", p_size = 14, p_color = "#8B4411", p_x = 1, p_y = 1.65, s_x = 1.0, s_y = 0.95, s_width = 0.6, s_height = 0.6, h_fill = "#f5f5dc", h_color = "black", filename = "tvthemes_hexsticker.png")

Besides adding more themes and palettes the next big step would be
releasing tvthemes to CRAN. I was rather confused by the process at
first but after watching Jim
Hester’s
video of prepping
vroom for a CRAN release, a lot of my hangups were cleared up for me.
I definitely recommend it as he carefully guides you through the
process!

I had a LOT of fun creating this package and I hope to continue with
it in my free time!

Suggestions and Pull Requests for palettes/themes are most welcome, you
can find the Github repo for the package
HERE!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Personalized treatment effects with model4you

Thu, 05/16/2019 - 00:00

(This article was first published on Achim Zeileis, and kindly contributed to R-bloggers)

Personalized treatment effects can be estimated easily with model-based trees and model-based random forests using the R package model4you.

Citation

Heidi Seibold, Achim Zeileis, Torsten Hothorn (2019). “model4you: An R Package for Personalised Treatment Effect Estimation.” Journal of Open Research Software, 7(17), 1-6. doi:10.5334/jors.219

Abstract

Typical models estimating treatment effects assume that the treatment effect is the same for all individuals. Model-based recursive partitioning allows to relax this assumption and to estimate stratified treatment effects (model-based trees) or even personalised treatment effects (model-based forests). With model-based trees one can compute treatment effects for different strata of individuals. The strata are found in a data-driven fashion and depend on characteristics of the individuals. Model-based random forests allow for a similarity estimation between individuals in terms of model parameters (e.g. intercept and treatment effect). The similarity measure can then be used to estimate personalised models. The R package model4you implements these stratified and personalised models in the setting with two randomly assigned treatments with a focus on ease of use and interpretability so that clinicians and other users can take the model they usually use for the estimation of the average treatment effect and with a few lines of code get a visualisation that is easy to understand and interpret.

Software

https://CRAN.R-project.org/package=model4you

Illustration

The correlation between exam group and exam performance in an introductory mathematics exam (for business and economics students) is investigated using tree-based stratified and personalized treatment effects. Group 1 took the exam in the morning and group 2 started the exam with slightly different exercises after the first group finished. Potential sources of heterogeneity in the group effect include gender, field of study, whether the exam was taken (and failed) previously, and prior performance in online “tests” earlier in the semester. Performance in both the written exam and the online tests is captured by percentage of correctly solved exercises.

Overall, it seems that the split into two different exam groups was fair: The second group had only a slightly lower performance by around 2 or 3 percentage points, suggesting that the exam in the second group was only very slightly more difficult. However, when investigating the heterogeneity of this group effect with a model-based tree it turns out that this distinguishes the students by their performance in the online tests. The largest difference between the two exam groups is in the students who did very well in the online tests (more than 92.3 percent correct), where the second-group students performed worse by 13.3 percentage points. So the split into the two exam groups seems to have been not fully fair for those very good students.

To refine the assessment further, a model-based forest can be estimated. This reveals that the dependence of the group effect on the performance in the online tests is even more pronounced. This is shown in the dependence plots and beeswarm plots below with the group treatment effect on the y-axis and the performance in the online tests on the x-axis.

To fit the simple linear base model in R, lm() can be used. The subsequent tree based on this model can be obtained with pmtree() from model4you and the forest with pmforest(). Example code is shown below, the full replication code for the entire analysis and graphics is included in the manuscript.

bmod_math <- lm(pcorrect ~ group, data = MathExam) tr_math <- pmtree(bmod_math, control = ctree_control(maxdepth = 2)) forest_math <- pmforest(bmod_math) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New Free R Course on Census.gov!

Wed, 05/15/2019 - 19:08

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Today I am happy to announce that a new free course on Choroplethr – Mapping Census Bureau Data in R with Choroplethr – is available on Census Academy, the US Census Bureau’s new training platform! (Choroplethr is a suite of R packages I created for mapping demographic statistics).

This course is a more in-depth version of Learn to Map Census Data in R, the first Choroplethr course I published in 2015. That course walks you through the steps necessary to create this map, which shows how US Per Capita Income has changed over time.

Mapping Census Bureau Data in R with Choroplethr goes into much more detail. It has 18 video lessons spread over five modules, and you can download the slides and code used in each lesson.

I am most excited by Module 4 of the course (“Data Details”), which goes into more detail about US Census data than I have previously taught. It also details advanced support that Choroplethr provides for working with the Census API. Specific topics include:

1. What the American Community Survey (ACS) is, and how it differs from the Decennial Census
2. Nuances of ACS data, such as span and margin of error
3. How to search for ACS data
4. How to load arbitrary ACS tables with Choroplethr

I would like to thank Jeff Meisel, former CMO of the Census Bureau, for championing the idea for this course and Census Academy. I would also like to thank Alexandra Barker, a Supervisory Program Analyst at Census, for helping to make the course a reality!

The post New Free R Course on Census.gov! appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Timing Working With a Row or a Column from a data.frame

Wed, 05/15/2019 - 18:24

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames.

We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.

We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.

Introduction

We will time the above tasks both using base R, and dplyr 0.8.1. We will actually perform the timings using the tibble variation of data.frames, as this (hopefully) should not affect the base-R timings, and may potentially help dplyr (and we want to time each system “used well”).

Below are the results in graphical form (images link through to the full code of the experiments).

The experiments

We won’t go deep into the details of the timings, as we are just interested in relative performance, except to say:

• We vary number of rows or number of columns as appropriate for each experiment, but not both at the same time.
• All data structures were tibble, and we tended to force a gc() before task timing. We will say data.frame throughout, as tibbles are a roughly a sub-class of data.frames, and many R users are interested in the performance of data.frames.
• Each measurement was repeated to try and get stable estimates and see varation. However, what we are trying to estimate is the cost of performing one of these
tasks for the first time. So if we had noticed a case where the first run was systematically different than repetitions we would have re-structured the code to simulate a first-run multiple times by re-freshing the data multiple times. This didn’t occur in this set of measurements, but we have used the method in related studies, so it was part of our measurement intent.
• All runs were with a current version of R 3.6.0, all packages up to date, on an un-loaded late 2014 Mac Mini with 8 GB of RAM. More details are in the linked materials.
• As we are running a great range of examples, all graphs are presented in a log-log scale. This is a legibility trade-off; log-log scale compresses (or obscures) variation, but this crushing is just about the only thing that makes anything but the large examples visible when plotting a large range of examples jointly. In a log-log plot: small changes in height can in fact be large effects.
• As always, these are initial, not terminal studies. If you feel a variation is important, feel free to copy and alter the code to run your own study.
Timings Selecting the first column from a data.frame

First we look at the time to select the first column from a data frame, with time to complete the task plotted as a function of the number of columns in the task.

For both the base-R bracket notation and for dplyr::select() the time is growing with the number of columns. In both cases this is undesirable, but can be excused as it may not be a good design trade-off to maintain fast column lookup structures: column names may change and column names can repeat. However, the ranges of behavior differ: base-R is slowing down for larger tasks, but is always fast. dplyr::select() is taking many seconds on the larger tasks.

The ratio of the time for dplyr::select() to complete the task over the time it takes base-R to complete the task is given here.

The take-away being: dplyr::select() is routinely thousands of times slower than R itself, and the times are substantial for large data.

Altering the first column in a data.frame

As one would expect the time to alter a the first column in a data.frame shows similar behavior to the time to select the first column.

(We apologize for swapping colors in the above graph, it is due to us using the alpha order as the color order.)

Selecting the first row from a data.frame

Selecting the first row from a data.frame can be fast due to the row indexing structures. It may or may not be constant time as the indexing structures may be non-trivial keys (not just integers). Similarly, picking a row from a more general database requires a full scan, unless there are index keys.

However, selecting the first row by index should in fact be easy. Let’s look at the timings.

It appears R does achieve constant time access to the first row when the row indices are simple integers. dplyr::slice() shows time-growth, even though dplyr::slice() is defined to work with indices (so has a good chance of being constant time). These timings are faster than the column examples, as both systems are designed to work with many rows.

data.table

We did not time data.table, as it is known to be very fast both on trivial and non-trivial workloads (at all scales). Here are some shared data.table timings. If you are working with data at even medium scale, we strongly recommend data.table.

Conclusion

If you are working in a data set size range where the above effects are present, and you are unsatisfied with how long your analyses are taking to run: you may want to look into timing both dplyr and base-R variations of small extracts of your work pattern to see what the trade-offs of using either methodology are. The point is: even if your code is currently taking a long time to run there may be an easy variation of it that is in fact fast.

R is not intrinsically slow in working with data.

If you are not seeing slow results the above issues can be safely ignored.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Why R? 2019 Conference – Registration and Call for Papers Opened

Wed, 05/15/2019 - 10:00

Two editions of Why R? conferences were a great networking success. We gather 200-300 R enthusiasts from Europe during each edition and were able to host more than 25 workshops together. The hunger for a bigger event led us to the 3rd edition!

About the Why R? 2019 conference

We are more than happy to announce that Why R? 2019 Conference will be organized by Why R? Foundation. The third official meeting of Polish R enthusiasts will be held in Warsaw 26-29 September 2019. As the meeting is held in English, we are happy to invite R users from other countries.

Important dates

Abstract submissions are format free, but please do not exceed 400 words and state clearly a chosen call. The abstract submission form is available here: whyr.pl/2019/submit/. One can register for the event under whyr.pl/2019/register/.

KeynotesInstitute of Geoecology and Geoinformation, Adam Mickiewicz University

Among multiple workshops we are planning to host, the above keynotes have confirmed their talks at Why R? 2019:

• Steph Locke (Principal Consultant (GBR) Locke Data),
• Marvin Wright (ranger package, Leibniz Institute for Prevention Research and Epidemiology),
• Sigrid Keydana (Applied Researcher at RStudio, Inc.)
Programme

The following events will be hosted during the Why R? 2019 conference:

• plenary lectures of invited speakers,
• lightning talks,
• community/table session,
• Why R? paRty,
• workshops – blocks of several small to mediumsized courses (up to 50 people) for R users at diﬀerent levels of proﬁciency.
Pre-meetings

We are organizing pre-meetings in many European cities to cultivate the R experience of knowledge sharing. You are more than welcome to visit upcoming event and check photos and presentations from previous ones. There are still few meetings that are being organized and are not yet added to the map. If you are interested in co-organizing a Why R? pre-meeting in your city, let us know (under kontakt_at_whyr.pl) and the Why R? Foundation can provide speakers for the venue!

Past events

Why R? 2017 edition, organized in Warsaw, gathered 200 participants. The Facebook reach of the conference page exceeds 15 000 users, with almost 800 subscribers. Our oﬃcial web page had over 8000 unique visitors and over 12 000 visits in general. To learn more about Why R? 2017 see the conference after movie (https://vimeo.com/239259242).

Why R? 2018 edition, organized in Wroclaw, gathered 250 participants. The aftermovie can be found here

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Challenge accepted

Wed, 05/15/2019 - 02:00

(This article was first published on R on Gianluca Baio, and kindly contributed to R-bloggers)

As part of our R for HTA workshop, we have also prepared a kind-of-hackathon, which we are now releasing to the general public (I think “general public” is a bit pretentious, but this afternoon, I’m in the right mood for it, I think…).

Participation is actually open to all (and we’ll advertise more widely on relevant mailing lists) but we expect contributions particularly from anyone attending the workshop, which will be held at UCL on 9th July.

The 6-step challenge (which has been put together by Nathan, with inputs from Howard and a couple of very minor comments from me — I think/hope I’m not missing out any other contributor!) is available from here.

Participants should email their code solutions, including a main.R or main.Rmd script, to nathan.green@imperial.ac.uk by 24th June 2019. I think we are allowing group submissions, as long as this is acknowledged clearly.

The code will be judged on the grounds of speed, elegance, and transparency/readability. First, second, and third place ranked solutions will be given signed e-certificates confirming their achievement. However, this exercise aims to encourage a collaborative approach to the programming and there will be time for discussion of solutions and problems at the 9th July workshop. Participants will be encouraged to briefly talk through their solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Gianluca Baio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### an attempt at code golf

Wed, 05/15/2019 - 00:19

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Having discovered codegolf on Stack Exchange a few weeks ago, I spotted a few interesting puzzles since then but only got the opportunity at a try over a quiet and rainy weekend (and Robin being on vacation)! The challenge was to write an R code for deciding whether or not a given integer n is congruent or not, when congruent means that it is the surface of a rectangle triangle with all three sides rational. The question included a pointer to the Birch and Swinnerton-Dyer conjecture as a mean to check congruence although the real solution was provided by Tunnell’s Theorem, which states that n is congruent if and only if the number of integer solutions to 2x²+y²+8z²=n is twice as much as the number of integer solutions to 2x²+y²+32z²=n if n is odd and  the number of integer solutions to 8x²+y²+16z²=n is twice as much as the number of integer solutions to 8x²+y²+64z²=n if n is even. Although this is only true for squared-free integers. (I actually spent more time on figuring out the exact wording of the theorem than on optimising the R code!)

My original solution

p=function(n){ for (i in(n:2)^2)if(n%%i<1)n=n/i if(n%%2){d=8;f=2;g=16}else{d=2;f=1;g=8} A=0;b=(-n:n)^2 for(x in d*b)for(y in x+f*b)for(z in g*b) A=A+(y+z==n)-2*(y+4*z==n) A==0}

was quite naïve, as shown by the subsequent improvements by senior players, like the final (?) version of Guiseppe:

function(n){b=(-n:n)^2 for(i in b[b>0])n=n/i^(!n%%i) P=2^(n%%2) o=outer !sum(!o(y<-o(8/P*b,2*b,"+")/P-n,z<-16/P*b,"+"),-2*!o(y,4*z,"+"))}

exhibiting a load of code golf tricks, from using an anonymous function to renaming functions with a single letter, to switching from integers to booleans and back with the exclamation mark.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tue, 05/14/2019 - 22:04

(This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers)

Larger packages typically consist of functions that are visible to the users (exported functions) and functions that are used by the exported functions, but that are invisible to the user. For example:

# exported, user-visible function inch2cm <- function(x){ x*conversion_factor("inch") } # not exported function, package-internal conversion_factor <- function(unit){ confac <- c(inch=2.54, pound=1/2.2056) confac[unit] }

We can think of the exported functions (or more correctly, the interface of the exported functins) as the surface of a package, and all the other functions as the volume. The surface is what a user sees, the volume is what the developer sees. The surface is how a user interacts with a package.

If the surface is small (few functions exported, no unnecessary parameters in the interface), users are limited in the ways they can interact with your package, and that means there is less to test. It also means that you, as a package developer, have more room to move and change things in the volume. So as a rule of thumb, it is a good idea to keep the surface small.

Since a sphere has the smallest surface-to-volume ratio possible, I refer to this rule of thumb as as make your package spherical.

Note
This post was first published as a paragraph in the vignette of the tinytest package. I repeat it here with a few changes for more visibility.

Markdown with by wp-gfm var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Production Line Stations Maintenance Prediction – Process Flow.

Tue, 05/14/2019 - 19:34

(This article was first published on R-Analytics, and kindly contributed to R-bloggers)

Steps Needed in a Process to Detect Anomalies And Have a Maintenance Notice Before We Have Scrap Created on The Production Line.

Describing my previous articles( 1, 2 ) process flow:

• Get Training Data.
• At least 2 weeks of passed units measurements.
• Data Cleaning.
• Ensure no null values.
• At least 95% data must have measurement values.
• Anomalies Detection Model Creation.
• Deep Learning Autoencoders.
• or
• Isolation Forest.
• Set Yield Threshold Desired, Normally 99%
• Get Prediction Value Limit by Linking Yield Threshold to Training Data Using The Anomaly Detection Model Created.
• Get Testing Data.
• Last 24 Hour Data From Station Measurements, Passed And Failed Units.
• Testing Data Cleaning.
• Ensure no null values.
• Get Anomalies From Testing Data by Using The Model Created And Prediction Limit Found Before.
• If Anomalies Found, Notify Maintenance to Avoid Scrap.
• Display Chart Showing Last 24 Hour Anomalies And Failures Found:

As you can see( Anomalies in blue, Failures in orange ), we are detecting anomalies( Units close to measurement limits ) before failures. Sending an alert when the first or second anomaly was detected will prevent scrap because the station will get maintenance to avoid failures.

We are using R, more information about R: https://www.r-bloggers.com var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Introduction to AzureKusto

Tue, 05/14/2019 - 19:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

By Hong Ooi and Alex Kyllo

This post is to announce the availability of AzureKusto, the R interface to Azure Data Explorer (internally codenamed “Kusto”), a fast, fully managed data analytics service from Microsoft. It is available from CRAN, or you can install the development version from GitHub via devtools::install_github("cloudyr/AzureKusto").

AzureKusto provides an interface (including DBI compliant methods for connecting to Kusto clusters and submitting Kusto Query Language (KQL) statements, as well as a dbplyr style backend that translates dplyr queries into KQL statements. On the administrator side, it extends the AzureRMR framework to allow for creating clusters and managing database principals.

Connecting to a cluster

To connect to a Data Explorer cluster, call the kusto_database_endpoint() function. Once you are connected, call run_query() to execute queries and command statements.

library(AzureKusto) ## Connect to a Data Explorer cluster with (default) device code authentication Samples <- kusto_database_endpoint(
server="https://help.kusto.windows.net",
database="Samples") res <- run_query(Samples,
"StormEvents | summarize EventCount = count() by State | order by State asc") head(res) ## State EventCount ## 1 ALABAMA 1315 ## 2 ALASKA 257 ## 3 AMERICAN SAMOA 16 ## 4 ARIZONA 340 ## 5 ARKANSAS 1028 ## 6 ATLANTIC NORTH 188 # run_query can also handle command statements, which begin with a '.' character res <- run_query(Samples, ".show tables | count") res[[1]] ## Count ## 1 5 dplyr Interface

The package also implements a dplyr-style interface for building a query upon a tbl_kusto object and then running it on the remote Kusto database and returning the result as a regular tibble object with collect(). All the standard verbs are supported.

library(dplyr) StormEvents <- tbl_kusto(Samples, "StormEvents") q <- StormEvents %>% group_by(State) %>% summarize(EventCount=n()) %>% arrange(State)
show_query(q) ## database('Samples').['StormEvents'] ## | summarize ['EventCount'] = count() by ['State'] ## | order by ['State'] asc
collect(q) ## # A tibble: 67 x 2 ## State EventCount ## ## 1 ALABAMA 1315 ## 2 ALASKA 257 ## 3 AMERICAN SAMOA 16 ## ... DBI interface

AzureKusto implements a subset of the DBI specification for interfacing with databases in R.

The following methods are supported:

• Connections: dbConnect, dbDisconnect, dbCanConnect
• Table management: dbExistsTable, dbCreateTable, dbRemoveTable, dbReadTable, dbWriteTable
• Querying: dbGetQuery, dbSendQuery, dbFetch, dbSendStatement, dbExecute, dbListFields, dbColumnInfo

It should be noted, though, that Data Explorer is quite different to the SQL databases that DBI targets. This affects the behaviour of certain DBI methods and renders other moot.

library(DBI) Samples <- dbConnect(AzureKusto(), server="https://help.kusto.windows.net", database="Samples") dbListTables(Samples) ## [1] "StormEvents" "demo_make_series1" "demo_series2" ## [4] "demo_series3" "demo_many_series1" dbExistsTable(Samples, "StormEvents") ##[1] TRUE dbGetQuery(Samples, "StormEvents | summarize ct = count()") ## ct ## 1 59066

If you have any questions, comments or other feedback, please feel free to open an issue on the GitHub repo.

And one more thing…

As of Build 2019, Data Explorer can also run R (and Python) scripts in-database. For more information on this feature, currently in public preview, see the Azure blog and the documentation article.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...