Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 48 min ago

Package management: Using repositories in production systems

Thu, 11/21/2019 - 15:53

[This article was first published on R-Bloggers – eoda GmbH, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data science is characterized among other things using open source tools. An advantage when working with open source languages such as R or Python is the large package world. This provides tools for numerous use cases and problems through the development within huge communities. The packages are organized in digital online archives – so-called repositories. Data scientists can use these repositories to access current or past package versions and use them for their work. An important aspect here is the continuous development of many packages. New package versions include new, improved or extended functionalities as well as bug fixes. In some cases, however, a new package version also contains different behavior with the same code or new dependencies on other packages, the programming language itself or other system components, such as the underlying operating system. These changes require additional customizations to maintain the functionality of the code already developed. For example, the code must be adapted to the new behavior of the packages, or additional packages must be installed to do justice to the dependencies. Production systems not only have to guarantee almost constant functionality, but also have many developers who are working on them. Therefore, it is important that updates to the package landscape are carried out quickly and smoothly.

Package management and collaborative work

Ideally, all developers work in identical environments, i.e. with the same packages and package versions. However, different scripts and analyses can result because developers might work with different package versions where changes to the functionality exist.

These do not work uniformly for all developers, causing either errors or different results. In addition to the danger that scripts may behave differently in various development environments, there is also the danger that developers‘ package versions may differ from those of the production system, where the analyses are used profitably and must therefore function almost continuously. In order to avoid conflicts between different package versions, a good infrastructure is used to ensure smooth package management that guarantees equal development conditions and controlled and synchronous updates.

A first measure to create the basis for good package management is the provision of packages in a local, company or team-wide repository. For developers, the local repository functions like an online repository, with only selected packages and package versions available in the local repository. This gives all data scientists access to the same central repository of packages, while ensuring that package versions in the repository are largely stable and all dependencies are met. This guarantees that the developed algorithms and codes behave the same throughout the company in the various development environments and in the production system. However, the coexistence of different versions of the same package cannot always be guaranteed, since there is again the danger of different developers developing on different package versions, as in the case of an online repository.  The RStudio Package Manager is suitable for this. The RStudio Package Manager acts as a bridge to integrate different package sources, such as online repository, local repository and external development repository (GitLab). Companies with restrictive corporate governance principles only want to have an approved subset of the packages in their local repository.

Package management in practice

To prevent this problem, the local repository can be extended with different package versions and restricted to a certain version within different projects. For this purpose, a project environment is defined for each project, which contains a certain part of the packages of the local repository and is limited to fixed package versions. This has the advantage that you can work with different packages or package versions in different projects, while providing a stable and conflict-free package world throughout a project. For the data scientists this means either developing on a central development system (e.g. RStudio server) or working on their local system with the packages defined for the project (e.g. as R Project or conda environment, optionally within a Docker container). In addition, a production system is operated that includes a package landscape that is identical to the development environment. In this case, the local repository provides an additional level of security to ensure that only packages are used that have proven to be stable over a certain period and that they already contain initial bug fixes.

If it is time to update the packages, this should be done almost simultaneously on the development and production environments in order to limit the different behavior of the environments to as short a time as possible. It is especially important that the production system runs stable without interruptions. It is therefore advisable to set up a test system on which updates are carried out beforehand to check for missing package dependencies or conflicts between certain package versions. If the test system has reached a stable state, the development environments can be updated in order to adapt the algorithms and analyses to the new package versions if necessary. An update of the package world on the production system can then take place at the same time as the analysis adjustments already tested on the development environments to keep the risk of errors on the production system as low as possible. A reliable infrastructure is essential in order to carry out such updates quickly and smoothly on a regular basis. The structure of such an infrastructure depends on many factors, such as the number of projects, the size of the development teams, or the length of the update cycles.

A good package management in productive systems and a fully functional infrastructure are the basis for a complication-free development environment. We are happy to support and advise you in the planning and implementation of an IT infrastructure in your company. Learn more about eoda| analytic infrastructure consulting!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – eoda GmbH. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

BayesComp 20 [schedule]

Wed, 11/20/2019 - 00:19

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The schedule for the program is now available on the conference webpage of BayesComp 20, for the days of 7-10 Jan 2020. There are twelve invited sessions, including one j-ISBA session, and a further thirteen contributed sessions were selected by the scientific committee. And three tutorials on the first day. Looking forward seeing you in Florida! (Poster submissions still welcomed!)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Galton’s board all askew

Tue, 11/19/2019 - 14:18

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Since Galton’s quincunx has fascinated me since the (early) days when I saw a model of it as a teenager in an industry museum near Birmingham, I jumped on the challenge to build an uneven nail version where the probabilities to end up in one of the boxes were not the Binomial ones. For instance,  producing a uniform distribution with the maximum number of nails with probability ½ to turn right. And I obviously chose to try simulated annealing to figure out the probabilities, facing as usual the unpleasant task of setting the objective function, calibrating the moves and the temperature schedule. Plus, less usually, a choice of the space where the optimisation takes place, i.e., deciding on a common denominator for the (rational) probabilities. Should it be 2⁸?! Or more (since the solution with two levels also involves 1/3)? Using the functions

evol<-function(P){ Q=matrix(0,7,8) Q[1,1]=P[1,1];Q[1,2]=1-P[1,1] for (i in 2:7){ Q[i,1]=Q[i-1,1]*P[i,1] for (j in 2:i) Q[i,j]=Q[i-1,j-1]*(1-P[i,j-1])+Q[i-1,j]*P[i,j] Q[i,i+1]=Q[i-1,i]*(1-P[i,i]) Q[i,]=Q[i,]/sum(Q[i,])} return(Q)}

and

temper<-function(T=1e3){ bestar=tarP=targ(P<-matrix(1/2,7,7)) temp=.01 while (sum(abs(8*evol(R<-P)[7,]-1))>.01){ for (i in 2:7) R[i,sample(rep(1:i,2),1)]=sample(0:deno,1)/deno if (log(runif(1))/temp

I first tried running my simulated annealing code with a target function like

targ<-function(P)(1+.1*sum(!(2*P==1)))*sum(abs(8*evol(P)[7,]-1))

where P is the 7×7 lower triangular matrix of nail probabilities, all with a 2⁸ denominator, reaching

60
126 35
107 81 20
104 71 22 0
126 44 26 69 14
61 123 113 92 91 38
109 60 7 19 44 74 50

for 128P. With  four entries close to 64, i.e. ½’s. Reducing the denominator to 16 produced once

8
12 1
13 11 3
16  7  6   2
14 13 16 15 0
15  15  2  7   7  4
    8   0    8   9   8  16  8

as 16P, with five ½’s (8). But none of the solutions had exactly a uniform probability of 1/8 to reach all endpoints. Success (with exact 1/8’s and a denominator of 4) was met with the new target

(1+,1*sum(!(2*P==1)))*(.01+sum(!(8*evol(P)[7,]==1)))

imposing precisely 1/8 on the final line. With a solution with 11 ½’s

0.5
1.0 0.0
1.0 0.0 0.0
1.0 0.5 1.0 0.5
0.5 0.5 1.0 0.0 0.0
1.0 0.0 0.5 0.0 0.5 0.0
0.5 0.5 0.5 1.0 1.0 1.0 0.5

and another one with 12 ½’s:

0.5
1.0 0.0
1.0 .375 0.0
1.0 1.0 .625 0.5
0.5  0.5  0.5  0.5  0.0
1.0  0.0  0.5  0.5  0.0  0.5
0.5  1.0  0.5  0.0  1.0  0.5  0.0

Incidentally, Michael Proschan and my good friend Jeff Rosenthal have an 2009 American Statistician paper on another modification of the quincunx they call the uncunx! Playing a wee bit further with the annealing, and using a denominator of 840 let to a 2P  with 14 ½’s out of 28

.5
60 0
60 1 0
30 30 30 0
30 30 30 30 30
60  60  60  0  60  0
60  30  0  30  30 60 30

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A baby named Al*

Tue, 11/19/2019 - 12:36

[This article was first published on R – scottishsnow, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

About half the males in my team seem to be called Alasdair, but few of them spell it the same way. I live in hope the The International Organization for Standardization will fix the spelling, I can’t believe it hasn’t been higher up their agenda.

Anyhoo, here’s a quick post about baby names using National Records of Scotland data and a wee bit of R magic to tidy and visualise.

library(tidyverse) df = read_csv("Downloads/babies-first-names-18-vis-final.csv") # https://en.wikipedia.org/wiki/Alistair var = c("Alasdair", "Alistair", "Alastair", "Allister", "Alister", "Aleister") png("Downloads/a_baby_named_Al.png", height = 800, width = 1200) df %>% gather(year, count, -firstname, -sex) %>% filter(sex == "Male") %>% filter(firstname %in% var) %>% mutate(year = as.numeric(year)) %>% ggplot(aes(year, count, colour = firstname)) + geom_line(lwd = 2) + scale_color_brewer(type = "qual", palette = "Set1") + labs(title = "Popularity of Al baby names", subtitle = "Created by Mike Spencer @mikerspencer\nNational Records of Scotland data:\nhttps://www.nrscotland.gov.uk/news/2018/most-popular-names-in-scotland", x = "Year", y = "Named babies", colour = "") + theme_bw() + theme(text = element_text(size = 30), plot.subtitle = element_text(size = 14)) dev.off()

Here’s the output!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Debugging in R: How to Easily and Efficiently Conquer Errors in Your Code

Tue, 11/19/2019 - 11:34

[This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When you write code, you’re sure to run into problems from time to time. Debugging is the process of finding errors in your code to figure out why it’s behaving in unexpected ways. This typically involves:

  1. Running the code
  2. Stopping the code where something suspicious is taking place
  3. Looking at the code step-by-step from this point on to either change the values of some variables, or modify the code itself.

Since R is an interpreter language, debugging in R means debugging functions.

There are a few kinds of problems you’ll run into with R:

  • messages give the user a hint that something is wrong, or may be missing. They can be ignored, or suppressed altogether with suppressMessages().
  • warnings don’t stop the execution of a function, but rather give a heads up that something unusual is happening. They display potential problems.
  • errors are problems that are fatal, and result in the execution stopping altogether. Errors are used when there is no way for the function to continue with its task.

There are many ways to approach these problems when they arise. For example, condition handling using tools like try(), tryCatch(), and withCallingHandlers() can increase your code’s robustness by proactively steering error handling.

R also includes several advanced debugging tools that can be very helpful for quickly and efficiently locating problems, which will be the focus of this article. To illustrate, we’ll use an example adapted from an excellent paper by Roger D. Peng, and show how these tools work along with some updated ways to interact with them via RStudio. In addition to working with errors, the debugging tools can also be used on warnings by converting them to errors via options(warn = 2).

traceback()

If we’ve run our code and it has already crashed, we can use traceback() to try to locate where this happened. traceback() does this by printing a list of the functions that were called before the error occurred, called the “call stack.” The call stack is read from bottom to top:

traceback() shows that the error occurred during evaluation of func3(y).

Another way we can use traceback(), besides inserting it directly into the code, is by using traceback() as an error handler (meaning that it will call immediately if any error occurs). This can be done using options(error = traceback).

We can also access traceback() directly through the button on the right-hand side of the error message in RStudio:

Debug Mode

While traceback() is certainly useful, it doesn’t show us where, exactly, an error occurred within a function. For this, we need “debug mode.”

Entering debug mode will pause your function and let you examine and interact with the environment of the function itself, rather than the usual global environment. In the function’s runtime environment you’re able to do some useful new things. For example, the environment pane shows the objects that are saved in the function’s local environment, which can be inspected by typing their name into the browser prompt.

You can also run code and view the results that normally only the function would see. Beyond just viewing, you’re able to make changes directly inside debug mode.

You’ll notice that while debugging, the prompt changes to Browse[1]> to let you know that you’re in debug mode. In this state you’ll still have access to all the usual commands, but also some extra ones. These can be used via the toolbar that shows up, or by entering the commands into the console directly:

  • ls() to see what objects are available in the current environment
  • str() and print() to examine these objects
  • n to evaluate the next statement
  • s to step into the next line, if it is a function. From there you can go through each line of the function.
  • where to print a stack trace of all active function calls
  • f to finish the execution of the current loop or function
  • c to leave the debug mode and continue with the regular execution of the function
  • Q to stop debug mode, terminate the function, and return to the R prompt

Debug mode sounds pretty useful, right? Here are some ways we can access it.

browser()

One way to enter debug mode is to insert a browser() statement into your code manually, allowing you to step into debug mode at a pre-specified point.

If you want to use a manual browser() statement on installed code, you can use print(functionName) to print the function code (or you can download the source code locally), and use browser() just like you would on your own code.

While you don’t have to run any special code to quit browser(), do remember to remove the browser() statement from your code once you’re done.

debug()

In contrast to browser(), which can be inserted anywhere into your code, debug() automatically inserts a browser() statement at the beginning of a function.

This can also be achieved by using the “Rerun with Debug” button on the right-hand side of the error message in RStudio, just under “Show Traceback.”

Once you’re done with debug(), you’ll need to call undebug(), otherwise it’ll enter debug mode every time the function is called. An alternative is to use debugonce(). You can check whether a function is in debug mode using isdebugged().

Options in RStudio

In addition to debug() and browser(), you can also enter debug mode by setting “editor breakpoints” in RStudio by clicking to the left of the line in RStudio, or by selecting the line and typing shift+F9. Editor breakpoints are denoted by a red circle on the left-hand side, indicating that debug mode will be entered at this line once the source is run.

Editor breakpoints avoid having to modify code with a browser() statement, though it is important to note that there are some instances where editor breakpoints won’t function properly, and they cannot be used conditionally (unlike browser(), which can be used in an if() statement).

You can also have RStudio enter the debug mode for you. For example, you can have RStudio stop the execution when an error is raised via Debug (on the top bar) > On Error, and changing it from “Error Inspector” to “Break in Code.”

To prevent debug mode from opening every time an error occurs, RStudio won’t invoke the debugger unless it looks like some of your own code is on the stack. If this is causing problems for you, navigate to Tools > Global Options > General > Advanced, and unclick “Use debug error handler only when my code contains errors.”

If you just want to invoke debug mode every single time there’s ever an error, use options(error = browser()).

recover()

recover() is similar to browser(), but lets you choose which function in the call stack you want to debug. recover() is not used directly, but rather as an error handler by calling options(error = recover).

Once put in place, when an error is encountered, recover() will pause R, print the call stack (though note that this call stack will be upside-down relative to the order in traceback()), and allow you to select which function’s browser you’d like to enter. This is helpful because you’ll be able to browse any function on the call stack, even before the error occurred, which is important if the root cause is a few calls prior to where the error actually takes place.

Once you’ve found the problem, you can switch back to default error handling by removing the option from your .Rprofile file. Note that previously options(error = NULL) was used to accomplish this, but this became illegal in R 3.6.0 and as of September 2019 may cause RStudio to crash the next time you try running certain things, such as .Rmd files.

trace()

The trace() function is slightly more complicated to use, but can be useful when you don’t have access to the source code (for example, with base functions). trace() allows you to insert any code at any location in a function, and the functions are only modified indirectly (without re-sourcing them).

The basic syntax is as follows:

trace(what = yourFunction, tracer = some R expression, at = code line)

In order to figure out which line of code to use, try: as.list(body(yourFunction))

Note that if called with no additional arguments beyond the function name, trace(yourFunction) just prints the function message:

Let’s try it out:

Now our function func3() is an object with tracing code:

If we want to see the tracing code to get a better understanding of what’s going on, we can use body(yourFunction):

At this point, if we call on the function func1(), debug mode will open if r is not a number.

When you’re done, you can remove tracing from a function using untrace().

And that’s it! These methods may seem a bit confusing at first, but once you get the hang of them, they will be an important tool to help you quickly and efficiently overcome (inevitable) bugs in your code.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Learning R: Data Wrangling in Password Hacking Game

Tue, 11/19/2019 - 09:00

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Data Scientists know that about 80% of a Data Science project consists of preparing the data so that they can be analyzed. Building Machine Learning models is the fun part that only comes afterwards!

This process is called Data Wrangling (or Data Munging). If you want to use some Base R data wrangling techniques in a fun game to hack a password read on!

First install proton (on CRAN), load it and type proton() to start it:

library(proton) ## _____ _ _____ _ _____ ## |_ _| |_ ___ | _ |___ ___| |_ ___ ___ | __|___ _____ ___ ## | | | | -_| | __| _| . | _| . | | | | | .'| | -_| ## |_| |_|_|___| |__| |_| |___|_| |___|_|_| |_____|__,|_|_|_|___| ## ## Your goal is to find Slawomir Pietraszko's credentials for the Proton server. ## This is the only way for Bit to find the secret plans of Pietraszko's laboratory. ## ## Enter the `proton()` command in order to start the adventure. ## ## Remember that at any time you may add `hint=TRUE` argument to the executed command in order to get additional suggestions. proton() ## Pietraszko uses a password which is very difficult to guess. ## At first, try to hack an account of a person which is not as cautious as Pietraszko. ## ## But who is the weakest point? Initial investigation suggests that John Insecure doesn't care about security and has an account on the Proton server. He may use a password which is easy to crack. ## Let's attack his account first! ## ## Problem 1: Find the login of John Insecure. ## ## Bit has scrapped 'employees' data (names and logins) from the www web page of Technical University of Warsaw. The data is in the data.frame `employees`. ## Now, your task is to find John Insecure's login. ## When you finally find out what John's login is, use `proton(action = "login", login="XYZ")` command, where XYZ is Insecure's login.

Now, try to solve the problem yourself before reading on…

The best way is always to get some overview over the data. That can be achieved by the str() function (for structure):

str(employees) ## 'data.frame': 541 obs. of 3 variables: ## $ name : Factor w/ 534 levels "Aaron","Adam",..: 272 198 240 442 34 389 433 460 351 16 ... ## $ surname: Factor w/ 541 levels "Abbott","Adams",..: 369 274 310 247 251 78 90 462 130 291 ... ## $ login : chr "j.patrick" "gerald.long" "j.mendoza" "rjoh" ...

We know that John’s surname is “Insecure”, so we use that for subsetting the data frame:

employees[employees$surname == "Insecure", ] ## name surname login ## 217 John Insecure johnins

Ok, we can now use this information to get to the next problem:

proton(action = "login", login = "johnins", hint = TRUE) ## Congratulations! You have found out what John Insecure's login is! ## It is highly likely that he uses some typical password. ## Bit downloaded from the Internet a database with 1000 most commonly used passwords. ## You can find this database in the `top1000passwords` vector. ## ## Problem 2: Find John Insecure's password. ## ## Use `proton(action = "login", login="XYZ", password="ABC")` command in order to log into the Proton server with the given credentials. ## If the password is correct, you will get the following message: ## `Success! User is logged in!`. ## Otherwise you will get: ## `Password or login is incorrect!`. ## ## HINT: ## Use the brute force method. ## By using a loop, try to log in with subsequent passwords from `top1000passwords` vector as long as you receive: ## `Success! User is logged in!`.

Now, try to solve the problem yourself before reading on…

Again, let us gain an overview:

str(top1000passwords) ## chr [1:1000] "123456" "password" "12345678" "qwerty" "123456789" ...

Ok, Brute Force means that we simply try all possibilities to find the right one (a technique that is also used by real hackers… which is the reason why most modern systems only accept a few (e.g. 3) trials before hitting a time limit or deactivating an account altogether). Here, we can use a for loop for that:

for (pw in top1000passwords) { proton(action = "login", login = "johnins", password = pw, hint = TRUE) } ## Well done! This is the right password! ## Bit used John Insecure's account in order to log into the Proton server. ## It turns out that John has access to server logs. ## Now, Bit wants to check from which workstation Pietraszko is frequently logging into the Proton server. Bit hopes that there will be some useful data. ## ## Logs are in the `logs` dataset. ## Consecutive columns contain information such as: who, when and from which computer logged into Proton. ## ## Problem 3: Check from which server Pietraszko logs into the Proton server most often. ## ## Use `proton(action = "server", host="XYZ")` command in order to learn more about what can be found on the XYZ server. ## The biggest chance to find something interesting is to find a server from which Pietraszko logs in the most often. ## ## ## HINT: ## In order to get to know from which server Pietraszko is logging the most often one may: ## 1. Use `filter` function to choose only Pietraszko's logs, ## 2. Use `group_by` and `summarise` to count the number of Pietraszko's logs into separate servers, ## 3. Use `arrange` function to sort servers' list by the frequency of logs. ## ## Use `employees` database in order to check what Pietraszko's login is.

There are always different possibilities to achieve a goal and you can, of course, use any of the functions given under “HINT”… yet one of my favorite functions is the table() function and I encourage you to try it out here…

Ok, the following first steps shouldn’t come as a surprise by now:

str(logs) ## 'data.frame': 59366 obs. of 3 variables: ## $ login: Factor w/ 541 levels "j.patrick","gerald.long",..: 172 45 42 196 254 390 169 397 469 361 ... ## $ host : Factor w/ 312 levels "193.0.96.13.0",..: 35 124 250 146 157 227 230 69 239 134 ... ## $ data : POSIXct, format: "2014-09-01 03:01:12" "2014-09-01 03:01:51" ... employees[employees$surname == "Pietraszko", ] ## name surname login ## 477 Slawomir Pietraszko slap

As said above we will use the table() function to get the frequencies of server logins:

log <- logs[logs$login == "slap", ] table(as.character(log$host)) ## ## 193.0.96.13.20 193.0.96.13.38 194.29.178.108 194.29.178.155 194.29.178.16 ## 33 1 74 6 112

We now use the most often used to get to the next and last problem:

proton(action = "server", host = "194.29.178.16", hint = TRUE) ## It turns out that Pietraszko often uses the public workstation 194.29.178.16. ## What a carelessness. ## ## Bit infiltrated this workstation easily. He downloaded `bash_history` file which contains a list of all commands that were entered into the server's console. ## The chances are that some time ago Pietraszko typed a password into the console by mistake thinking that he was logging into the Proton server. ## ## Problem 4: Find the Pietraszko's password. ## ## In the `bash_history` dataset you will find all commands and parameters which have ever been entered. ## Try to extract from this dataset only commands (only strings before space) and check whether one of them looks like a password. ## ## ## HINT: ## Commands and parameters are separated by a space. In order to extract only names of commands from each line, you can use `gsub` or `strsplit` function. ## After having all commands extracted you should check how often each command is used. ## Perhaps it will turn out that one of typed in commands look like a password? ## ## If you see something which looks like a password, you shall use `proton(action = "login", login="XYZ", password="ABC")` command to log into the Proton server with Pietraszko credentials.

After checking the structure again we could use strsplit() (for stringsplit) in the following way

str(bash_history) ## chr [1:19913] "mcedit /var/log/lighttpd/*" "pwd" ... table(unlist(strsplit(bash_history, " "))) ## ## /bin /boot ## 338 338 ## /cdrom /dev ## 338 338 ## /etc /home ## 338 338 ## /lib /lost+found ## 338 338 ## /media /mnt ## 338 338 ## /opt /proc ## 338 338 ## /root /run ## 338 338 ## /sbin /selinux ## 338 338 ## /srv /sys ## 338 338 ## /tmp /usr ## 338 338 ## /var /var/log/apache2/* ## 338 219 ## /var/log/apport.log /var/log/auth.log ## 219 219 ## /var/log/boot /var/log/daemon.log ## 219 219 ## /var/log/debug /var/log/dmesg ## 219 219 ## /var/log/dpkg.log /var/log/faillog ## 219 219 ## /var/log/fsck/* /var/log/kern.log ## 219 219 ## /var/log/lighttpd/* /var/log/lpr.log ## 219 219 ## /var/log/mail.* /var/log/messages ## 219 219 ## /var/log/mysql.* /var/log/user.log ## 219 219 ## /var/log/xorg.0.log ~/.bash_history ## 219 133 ## ~/.bash_login ~/.bash_logout ## 133 133 ## ~/.bash_profile ~/.bashrc ## 133 133 ## ~/.emacs ~/.exrc ## 133 133 ## ~/.forward ~/.fvwm2rc ## 133 133 ## ~/.fvwmrc ~/.gtkrc ## 133 133 ## ~/.hushlogin ~/.kderc ## 133 133 ## ~/.mail.rc ~/.muttrc ## 133 133 ## ~/.ncftp/ ~/.netrc ## 133 133 ## ~/.pinerc ~/.profile ## 133 133 ## ~/.rhosts ~/.rpmrc ## 133 133 ## ~/.signature ~/.twmrc ## 133 133 ## ~/.vimrc ~/.Xauthority ## 133 133 ## ~/.Xdefaults-hostname ~/.Xdefaults, ## 133 133 ## ~/.xinitrc ~/.Xmodmap ## 133 133 ## ~/.xmodmaprc ~/.Xresources ## 133 133 ## ~/.xserverrc ~/mbox ## 133 133 ## ~/News/Sent-Message-IDs alert_actions.conf ## 133 74 ## app.conf audit.conf ## 74 74 ## authentication.conf authorize.conf ## 74 74 ## aux ax ## 100 100 ## cat cd ## 4341 2520 ## collections.conf commands.conf ## 74 74 ## cp crawl.conf ## 1176 74 ## datamodels.conf default.meta.conf ## 74 74 ## deploymentclient.conf DHbb7QXppuHnaXGN ## 74 1 ## distsearch.conf event_renderers.conf ## 74 74 ## eventdiscoverer.conf eventtypes.conf ## 74 74 ## fields.conf httpd ## 74 20 ## indexes.conf inputs.conf ## 74 74 ## instance.cfg.conf limits.conf ## 74 74 ## literals.conf ls ## 74 1806 ## macros.conf mc ## 74 112 ## mcedit multikv.conf ## 1944 74 ## outputs.conf pdf_server.conf ## 74 74 ## procmon-filters.conf props.conf ## 74 74 ## ps pubsub.conf ## 420 74 ## pwd restmap.conf ## 80 74 ## rm savedsearches.conf ## 1596 74 ## searchbnf.conf segmenters.conf ## 74 74 ## server.conf serverclass.conf ## 74 74 ## serverclass.seed.xml.conf service ## 74 20 ## source-classifier.conf sourcetypes.conf ## 74 74 ## start tags.conf ## 20 74 ## tenants.conf times.conf ## 74 74 ## top transactiontypes.conf ## 150 74 ## transforms.conf user-seed.conf ## 74 74 ## vi viewstates.conf ## 2666 74 ## vim web.conf ## 2991 74 ## whoiam wmi.conf ## 90 74 ## workflow_actions.conf ## 74

When you look through this list you will find something that looks suspiciously like a password: “DHbb7QXppuHnaXGN”. Let’s try this to finally hack into the system:

proton(action = "login", login = "slap", password = "DHbb7QXppuHnaXGN") ## Congratulations! ## ## You have cracked Pietraszko's password! ## Secret plans of his lab are now in your hands. ## What is in this mysterious lab? ## You may read about it in the `Pietraszko's cave` story which is available at http://biecek.pl/BetaBit/Warsaw ## ## Next adventure of Beta and Bit will be available soon. ## proton.login.pass ## "Success! User is logged in!"

This was fun, wasn’t it! And hopefully, you learned some helpful data wrangling techniques along the way…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hangman game with R

Tue, 11/19/2019 - 07:57

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hangman is a classic word game in which you need to need to guess as many possible letters in word, so you can guess the word, before running out of tries (lives).

Upon running out of tries, you are hanged!

The game can be played in R Studio, where the user inputs new letters in console, and the picture is being drawn (using library ggplot2). The picture consists of 7 false tries, so  it is drawn in 7 steps.

The diagram is created using simple X, Y coordinates with groups for determining the steps:

level1 <- data.frame(x=c(1,2,3,4,5,6,7,8), y=c(1,1,1,1,1,1,1,1), group=c(1,1,1,1,1,1,1,1)) level2 <- data.frame(x=c(4,4,4,4,4), y=c(1,2,3,4,5),group=c(2,2,2,2,2)) level3 <- data.frame(x=c(4,5,6), y=c(5,5,5), group=c(3,3,3)) level4 <- data.frame(x=c(6,6), y=c(5,4), group=c(4,4)) level5 <- drawHead(c(6,3.5),1,10,5) level6 <- data.frame(x=c(6,6,5.8,6.2),y=c(3,1.5,1.5,1.5), group=c(6,6,6,6)) level7 <- data.frame(x=c(5.5,6,6.5),y=c(2,2.5,2), group=c(7,7,7)) levels <- rbind(level1,level2,level3,level4,level5,level6,level7)

Drawing itself is created by using a simple function using ggplot2 library:

drawMan <- function(st_napak) { ggplot(levels[which(levels$group<=st_napak),], aes(x=x, y=y, group=group)) + geom_path(size=2.5) + theme_void() }

The function draws the hanging man in 7 steps

All the rest of the logic is fairly simple, continue until you find the correct word, or until you are hanged. Section of the code:

beseda <- readline(prompt="Word: ") iskana_beseda <- replicate(nchar(beseda),'_') while (active == TRUE) { if (i == 0) { writeLines(paste(iskana_beseda, collapse = " ")) } crka <- readline(prompt="Enter Letter: ") izbor <- rbind(izbor, crka) #iskana_beseda if (grepl(crka, beseda) == TRUE) { cilj <- rbind(cilj, crka) iskana_beseda <- zamenjaj2(beseda, crka) #print(zamenjaj2(beseda, crka)) print(paste("Yay!","Try N:",i+1,"Wrong letters: {",(toString(paste0(cilj_n, sep=","))),"}")) if (as.character(paste(iskana_beseda, collapse = "")) == beseda) { active == FALSE Print("Bravo, win!") break } {code continues.....}

… and the rest of there code is  here — >> github.

When playing, this is how it looks from my R Studio.

 

As always, complete code is available at Github.

Happy R-hanging

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ttdo 0.0.4: Extension

Tue, 11/19/2019 - 02:12

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A first update release to the still very new (and still very small) ttdo package arrived on CRAN today. Introduced about two months ago in September, the ttdo package extends the most excellent (and very minimal / zero depends) unit testing package tinytest by Mark van der Loo with the very clever and well-done diffobj package by Brodie Gaslam.

Just how the package creation was motivated by our needs in teaching STAT 430 at Illinois, so does the extension code in this release which generalized how we extend the tinytest test predicates with additional arguments which help in the use of the PrairieLearn system (developed at Illinois) to provide tests, quizzes or homework. This release is mostly the work of Alton who is now also a coauthor.

The NEWS entries follow.

Changes in ttdo version 0.0.4 (2019-11-18)
  • Generalize tinytest extensions with additional arguments for test predicates (Alton in #2).

  • Use Travis CI for continuous integration off GitHub (Dirk).

Please use the GitHub repo and its issues for any questions.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Customer Churn Modeling using Machine Learning with parsnip

Mon, 11/18/2019 - 06:44

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article comes from Diego Usai, a student in Business Science University. Diego has completed both 101 (Data Science Foundations) and 201 (Advanced Machine Learning & Business Consulting) courses. Diego shows off his progress in this Customer Churn Tutorial using Machine Learning with parsnip. Diego originally posted the article on his personal website, diegousai.io, which has been reproduced on the Business Science blog here. Enjoy!

R Packages Covered:

  • parsnip – NEW Machine Learning API in R, similar to scikit learn in Python
  • rsample – 10-Fold Cross Validation
  • recipes – Data preprocessing
  • yardstick – Model scoring and metrics
  • skimr – Quickly skim data
  • ranger – Random Forest Library used for churn modeling
Churn Modeling Using Machine Learning

by Diego Usai, Customer Insights Consultant

Recently I have completed the online course Business Analysis With R focused on applied data and business science with R, which introduced me to a couple of new modelling concepts and approaches. One that especially captured my attention is parsnip and its attempt to implement a unified modelling and analysis interface (similar to python’s scikit-learn) to seamlessly access several modelling platforms in R.

parsnip is the brainchild of RStudio’s Max Khun (of caret fame) and Davis Vaughan and forms part of tidymodels, a growing ensemble of tools to explore and iterate modelling tasks that shares a common philosophy (and a few libraries) with the tidyverse.

Although there are a number of packages at different stages in their development, I have decided to take tidymodels “for a spin”, and create and execute a “tidy” modelling workflow to tackle a classification problem. My aim is to show how easy it is to fit a simple logistic regression in R’s glm and quickly switch to a cross-validated random forest using the ranger engine by changing only a few lines of code.

For this post in particular I’m focusing on four different libraries from the tidymodels suite:

  • parsnip for machine learning and modeling
  • rsample for data sampling and 10-fold cross-validation
  • recipes for data preprocessing
  • yardstick for model assessment.

Note that the focus is on modelling workflow and libraries interaction. For that reason, I am keeping data exploration and feature engineering to a minimum. Data exploration, data wrangling, visualization, and business understanding are CRITICAL to your ability to perform machine learning. If you want to learn the end-to-end process for completing business projects with data science with H2O and parsnip and Shiny web applications using AWS, then I recommend Business Science’s 4-Course R-Track System – One complete system to go from beginner to expert in 6-months.

My Workflow

Here’s a diagram of the workflow I used to web scrape the Specialized Data and create an application:

  1. Start with raw data in CSV format

  2. Use skimr to quickly understand the features

  3. Use rsample to split into training/testing sets

  4. Use recipes to create data preprocessing pipeline

  5. Use parsnip, rsample and yardstick to build models and assess machine learning performance



My Code Workflow for Machine Learning with parsnip

Tutorial – Churn Classification using Machine Learning

This is an intermediate tutorial to expose business analysts and data scientists to churn modeling with the new parsnip Machine Learning API.

1.0 Setup and Data

First, I load the packages I need for this analysis.

library(tidyverse) # Loads dplyr, ggplot2, purrr, and other useful packages library(tidymodels) # Loads parsnip, rsample, recipes, yardstick library(skimr) # Quickly get a sense of data library(knitr) # Pretty HTML Tables

For this project I am using the Telco Customer Churn from IBM Watson Analytics, one of IBM Analytics Communities. The data contains 7,043 rows, each representing a customer, and 21 columns for the potential predictors, providing information to forecast customer behaviour and help develop focused customer retention programmes.

Churn is the Dependent Variable and shows the customers who left within the last month. The dataset also includes details on the Services that each customer has signed up for, along with Customer Account and Demographic information.

Next, we read in the data (I have hosted on my GitHub repo for this project).

telco <- read_csv("https://raw.githubusercontent.com/DiegoUsaiUK/Classification_Churn_with_Parsnip/master/00_Data/WA_Fn-UseC_-Telco-Customer-Churn.csv") telco %>% head() %>% kable() customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes 9305-CDSKC Female 0 No No 8 Yes Yes Fiber optic No No Yes No Yes Yes Month-to-month Yes Electronic check 99.65 820.50 Yes 2.0 Skim the Data

We can get a quick sense of the data using the skim() function from the skimr package.

telco %>% skim() ## Skim summary statistics ## n obs: 7043 ## n variables: 21 ## ## ── Variable type:character ─────────────────────────────────────────────────────────────────────────────────────────────────── ## variable missing complete n min max empty n_unique ## Churn 0 7043 7043 2 3 0 2 ## Contract 0 7043 7043 8 14 0 3 ## customerID 0 7043 7043 10 10 0 7043 ## Dependents 0 7043 7043 2 3 0 2 ## DeviceProtection 0 7043 7043 2 19 0 3 ## gender 0 7043 7043 4 6 0 2 ## InternetService 0 7043 7043 2 11 0 3 ## MultipleLines 0 7043 7043 2 16 0 3 ## OnlineBackup 0 7043 7043 2 19 0 3 ## OnlineSecurity 0 7043 7043 2 19 0 3 ## PaperlessBilling 0 7043 7043 2 3 0 2 ## Partner 0 7043 7043 2 3 0 2 ## PaymentMethod 0 7043 7043 12 25 0 4 ## PhoneService 0 7043 7043 2 3 0 2 ## StreamingMovies 0 7043 7043 2 19 0 3 ## StreamingTV 0 7043 7043 2 19 0 3 ## TechSupport 0 7043 7043 2 19 0 3 ## ## ── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────────────────────────── ## variable missing complete n mean sd p0 p25 ## MonthlyCharges 0 7043 7043 64.76 30.09 18.25 35.5 ## SeniorCitizen 0 7043 7043 0.16 0.37 0 0 ## tenure 0 7043 7043 32.37 24.56 0 9 ## TotalCharges 11 7032 7043 2283.3 2266.77 18.8 401.45 ## p50 p75 p100 hist ## 70.35 89.85 118.75 ▇▁▃▂▆▅▅▂ ## 0 0 1 ▇▁▁▁▁▁▁▂ ## 29 55 72 ▇▃▃▂▂▃▃▅ ## 1397.47 3794.74 8684.8 ▇▃▂▂▁▁▁▁

There are a couple of things to notice here:

  • customerID is a unique identifier for each row. As such it has no descriptive or predictive power and it needs to be removed.

  • Given the relative small number of missing values in TotalCharges (only 11 of them) I am dropping them from the dataset.

telco <- telco %>% select(-customerID) %>% drop_na() 3.0 Tidymodels Workflow – Generalized Linear Model (Baseline)

To show the basic steps in the tidymodels framework I am fitting and evaluating a simple logistic regression model as a baseline.

3.1 Train/Test Split

rsample provides a streamlined way to create a randomised training and test split of the original data.

set.seed(seed = 1972) train_test_split <- rsample::initial_split( data = telco, prop = 0.80 ) train_test_split ## <5626/1406/7032>

Of the 7,043 total customers, 5,626 have been assigned to the training set and 1,406 to the test set. I save them as train_tbl and test_tbl.

train_tbl <- train_test_split %>% training() test_tbl <- train_test_split %>% testing() 3.2 Prepare

The recipes package uses a cooking metaphor to handle all the data preprocessing, like missing values imputation, removing predictors, centring and scaling, one-hot-encoding, and more.

First, I create a recipe where I define the transformations I want to apply to my data. In this case I create a simple recipe to change all character variables to factors.

Then, I “prep the recipe” by mixing the ingredients with prep. Here I have included the prep bit in the recipe function for brevity.

recipe_simple <- function(dataset) { recipe(Churn ~ ., data = dataset) %>% step_string2factor(all_nominal(), -all_outcomes()) %>% prep(data = dataset) }

Note – In order to avoid Data Leakage (e.g: transferring information from the train set into the test set), data should be “prepped” using the train_tbl only.

recipe_prepped <- recipe_simple(dataset = train_tbl)

Finally, to continue with the cooking metaphor, I “bake the recipe” to apply all preprocessing to the data sets.

train_baked <- bake(recipe_prepped, new_data = train_tbl) test_baked <- bake(recipe_prepped, new_data = test_tbl) 3.3 Machine Learning and Performance Fit the Model

parsnip is a recent addition to the tidymodels suite and is probably the one I like best. This package offers a unified API that allows access to several machine learning packages without the need to learn the syntax of each individual one.

With 3 simple steps you can:

  1. Set the type of model you want to fit (here is a logistic regression) and its mode (classification)

  2. Decide which computational engine to use (glm in this case)

  3. Spell out the exact model specification to fit (I’m using all variables here) and what data to use (the baked train dataset)

logistic_glm <- logistic_reg(mode = "classification") %>% set_engine("glm") %>% fit(Churn ~ ., data = train_baked)

If you want to use another engine, you can simply switch the set_engine argument (for logistic regression you can choose from glm, glmnet, stan, spark, and keras) and parsnip will take care of changing everything else for you behind the scenes.

Assess Performance predictions_glm <- logistic_glm %>% predict(new_data = test_baked) %>% bind_cols(test_baked %>% select(Churn)) predictions_glm %>% head() %>% kable() .pred_class Churn Yes No No No No No No No No No No No

There are several metrics that can be used to investigate the performance of a classification model but for simplicity I’m only focusing on a selection of them: accuracy, precision, recall and F1_Score.

All of these measures (and many more) can be derived by the Confusion Matrix, a table used to describe the performance of a classification model on a set of test data for which the true values are known.

In and of itself, the confusion matrix is a relatively easy concept to get your head around as is shows the number of false positives, false negatives, true positives, and true negatives. However some of the measures that are derived from it may take some reasoning with to fully understand their meaning and use.

predictions_glm %>% conf_mat(Churn, .pred_class) %>% pluck(1) %>% as_tibble() %>% # Visualize with ggplot ggplot(aes(Prediction, Truth, alpha = n)) + geom_tile(show.legend = FALSE) + geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)

Accuracy

The model’s Accuracy is the fraction of predictions the model got right and can be easily calculated by passing the predictions_glm to the metrics function. However, accuracy is not a very reliable metric as it will provide misleading results if the data set is unbalanced.

With only basic data manipulation and feature engineering the simple logistic model has achieved 80% accuracy.

predictions_glm %>% metrics(Churn, .pred_class) %>% select(-.estimator) %>% filter(.metric == "accuracy") %>% kable() .metric .estimate accuracy 0.8058321 Precision and Recall

Precision shows how sensitive models are to False Positives (i.e. predicting a customer is leaving when he-she is actually staying) whereas Recall looks at how sensitive models are to False Negatives (i.e. forecasting that a customer is staying whilst he-she is in fact leaving).

These are very relevant business metrics because organisations are particularly interested in accurately predicting which customers are truly at risk of leaving so that they can target them with retention strategies. At the same time they want to minimising efforts of retaining customers incorrectly classified as leaving who are instead staying.

tibble( "precision" = precision(predictions_glm, Churn, .pred_class) %>% select(.estimate), "recall" = recall(predictions_glm, Churn, .pred_class) %>% select(.estimate) ) %>% unnest(cols = c(precision, recall)) %>% kable() precision recall 0.8466368 0.9024857 F1 Score

Another popular performance assessment metric is the F1 Score, which is the harmonic average of the precision and recall. An F1 score reaches its best value at 1 with perfect precision and recall.

predictions_glm %>% f_meas(Churn, .pred_class) %>% select(-.estimator) %>% kable() .metric .estimate f_meas 0.8736696 4.0 Random Forest – Machine Learning Modeling and Cross Validation

This is where the real beauty of tidymodels comes into play. Now I can use this tidy modelling framework to fit a Random Forest model with the ranger engine.

4.1 Cross Validation – 10-Fold

To further refine the model’s predictive power, I am implementing a 10-fold cross validation using vfold_cv from rsample, which splits again the initial training data.

cross_val_tbl <- vfold_cv(train_tbl, v = 10) cross_val_tbl ## # 10-fold cross-validation ## # A tibble: 10 x 2 ## splits id ## ## 1 Fold01 ## 2 Fold02 ## 3 Fold03 ## 4 Fold04 ## 5 Fold05 ## 6 Fold06 ## 7 Fold07 ## 8 Fold08 ## 9 Fold09 ## 10 Fold10

If we take a further look, we should recognise the 5,626 number, which is the total number of observations in the initial train_tbl. In each round, 563 observations will in turn be retained from estimation and used to validate the model for that fold.

cross_val_tbl %>% pluck("splits", 1) ## <5063/563/5626>

To avoid confusion and distinguish the initial train/test splits from those used for cross validation, the author of rsample Max Kuhn has coined two new terms: the analysis and the assessment_ sets. The former is the portion of the train data used to recursively estimate the model, where the latter is the portion used to validate each estimate.

4.2 Machine Learning Random Forest

Switching to another model could not be simpler! All I need to do is to change the type of model to random_forest, add its hyper-parameters, change the set_engine argument to ranger, and I’m ready to go.

I’m bundling all steps into a function that estimates the model across all folds, runs predictions and returns a convenient tibble with all the results. I need to add an extra step before the recipe “prepping” to maps the cross validation splits to the analysis() and assessment() functions. This will guide the iterations through the 10 folds.

rf_fun <- function(split, id, try, tree) { analysis_set <- split %>% analysis() analysis_prepped <- analysis_set %>% recipe_simple() analysis_baked <- analysis_prepped %>% bake(new_data = analysis_set) model_rf <- rand_forest( mode = "classification", mtry = try, trees = tree ) %>% set_engine("ranger", importance = "impurity" ) %>% fit(Churn ~ ., data = analysis_baked) assessment_set <- split %>% assessment() assessment_prepped <- assessment_set %>% recipe_simple() assessment_baked <- assessment_prepped %>% bake(new_data = assessment_set) tibble( "id" = id, "truth" = assessment_baked$Churn, "prediction" = model_rf %>% predict(new_data = assessment_baked) %>% unlist() ) } Modeling with purrr

I iteratively apply the random forest modeling function, rf_fun(), to each of the 10 cross validation folds using purrr.

pred_rf <- map2_df( .x = cross_val_tbl$splits, .y = cross_val_tbl$id, ~ rf_fun(split = .x, id = .y, try = 3, tree = 200) ) head(pred_rf) ## # A tibble: 6 x 3 ## id truth prediction ## ## 1 Fold01 Yes Yes ## 2 Fold01 Yes No ## 3 Fold01 Yes Yes ## 4 Fold01 No No ## 5 Fold01 No No ## 6 Fold01 Yes Yes Assess Performance

I’ve found that yardstick has a very handy confusion matrix summary() function, which returns an array of 13 different confusion matrix metrics but in this case I want to see the four I used for the glm model.

pred_rf %>% conf_mat(truth, prediction) %>% summary() %>% select(-.estimator) %>% filter(.metric %in% c("accuracy", "precision", "recall", "f_meas")) %>% kable() .metric .estimate accuracy 0.7975471 precision 0.8328118 recall 0.9050279 f_meas 0.8674194

The random forest model is performing in par with the simple logistic regression. Given the very basic feature engineering that I’ve carried out, there is scope to further improve the model but this is beyond the scope of this post.

Parting Thoughts

One of the great advantage of tidymodels is the flexibility and ease of access to every phase of the analysis workflow. Creating the modelling pipeline is a breeze and you can easily re-use the initial framework by changing model type with parsnip and data pre-processing with recipes and in no time you’re ready to check your new model’s performance with yardstick.

In any analysis you would typically audit several models and parsnip frees you up from having to learn the unique syntax of every modelling engine so that you can focus on finding the best solution for the problem at hand.

If you would like to learn how to apply Data Science to Business Problems, take the program that I chose to build my skills. You will learn tools like parsnip and H2O for machine learning and Shiny for web applications, and many more critical tools (tidyverse, recipes, and more!) for applying data science to business problems. For a limited time you can get 15% OFF the 4-Course R-Track System.

Code Repository

The full R code can be found on my GitHub profile.

Other Student Articles You Might Enjoy

Here are more Student Success Tutorials on data science for business and building shiny applications.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

October 2019: “Top 40” New R Packages

Mon, 11/18/2019 - 01:00

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two Hundred twenty-three new packages made it to CRAN in October. Here are my “Top 40” picks in ten categories: Computational Methods, Data, Genomics, Machine Learning, Mathematics, Medicine, Pharmacology, Statistics, Utilities, and Visualization.

Computational Methods

admmDensestSubmatrix v0.1.0: Implements a method to identify the densest sub-matrix in a given or sampled binary matrix. See Bombina et al. (2019) for the technical details and the vignette for examples.

mbend v1.2.3: Provides functions to “bend”” non-positive-definite (symmetric) matrices to positive-definite matrices using weighted and unweighted methods. See Jorjani et al. (2003) and Schaeffer (2010) for background and the vignette for an Introduction.

Data

cqcr v0.1.2: Provides access to data from the Care Quality Commission, the health and adult social care regulator for England. Data available under the Open Government License include information on service providers, hospitals, care homes, and medical clinics locations, and ratings and inspection reports.

fpp3 v0.1: Contains all data sets required for the examples and exercises in the book Forecasting: principles and practice by Rob J Hyndman and George Athanasopoulos.

fsbrain v0.0.2: Provides high-level access to FreeSurfer neuroimaging data on the level of subjects and groups. There is a vignette.

opendatatoronto v0.1.0: Provides access to data from the City of Toronto Open Data Portal. There is an Introduction and vignettes on Geospatial Data, Zip Resources, Retrieving Multiple Resources, and Retrieving XLS/XLSX Resources.

povcalnetR v0.1.0: Provides an interface to Povcalnet, a computational tool that allows users to estimate poverty rates for regions, sets of countries or individual countries, over time, and at any poverty line that is managed by the World Bank’s development economics division. There is a Getting Started Guide, and vignettes on Examples and Advanced Usage.

Genomics

dynwrap v1.1.4: Provides functions to infer trajectories from single-cell data, represent them into a common format, and adapt them. See Saelens et al. (2019) for background. There are vignettes on Containers, Scripts, Adding Methods, and Wrapping Trajectories.

phyr v1.0.2: Provides a collection of functions to do model-based phylogenetic analysis, including functions to calculate community phylogenetic diversity, to estimate correlations among functional traits while accounting for phylogenetic relationships, and to fit phylogenetic generalized linear mixed models. The Bayesian phylogenetic generalized linear mixed models are fitted with the INLA package. There is a Performance Benchmark and vignettes on Usage and Plotting.

Machine Learning

cwbtools v0.1.0: Provides tools to create, modify, and manage Corpus Workbench (CWB) Corpora. See Evert and Hardie (2011) for background, and the vignettes Introducing cwbtools and Europal for information on the package.

discrim v0.0.1: Provides bindings for additional classification models for use with the parsnip package including linear discriminate (See Fisher (1936).), regularized discriminant analysis (See Friedman (1989).), and flexible discriminate analysis (See (Hastie et al. (1994).), as well as naive for Bayes classifiers Hand and Yu (2007).

forecastML v0.5.0: Provides functions for forecasting time series using machine learning models and an approach inspired by Bergmeir, Hyndman, and Koo’s (2018). There is an Overview, and vignettes on Customizing Wrapper Functions, Multiple Time Series, and Custom Feature Lags.

interpret v0.1.23: Implements the Explainable Boosting Machine (EBM) framework for machine learning interpretability. See Caruana et al. (2015), for details, and look here for help with the package.

mlr3pipelines v0.1.1: Implements a dataflow programming toolkit that enriches mlr3 with a diverse set of pipelining operators that can be composed into graphs. Operations exist for data preprocessing, model fitting, and ensemble learning. There is an Introduction and a vignette on Comparing Frameworks.

postDoubleR v1.4.12: Implements the double/debiased machine learning algorithm described in Chernozhukov et al. (2017).

SLEMI v1.0: Implements the method described in Jetka et al. (2019) for estimating mutual information and channel capacity from experimental data by classification procedures (logistic regression). The vignette describes how to use the package.

tfprobability v0.0.2: Provides an interface to TensorFlow Probability, a Python library built on TensorFlow that makes it easy to combine probabilistic models and deep learning on modern hardware including TPUs and GPUs. There are vignettes on Dynamic Linear Models, Multi-level Modeling with Hamiltonian Monte Carlo, and Uncertainty Estimates.

Mathematics

Ryacas0 v0.4.2: Provides and interface to the yacas computer algebra system. There is a Getting Started Guide and vignettes on Ryacas functionality, a Naive Bayes Model, a State Space Model and Matric and Vector Objects.

silicate v0.2.0: Provides functions to generate common forms for complex hierarchical and relational data structures inspired by the simplicial complex. There is a vignette.

Medicine

diyar v0.0.2: Implements multistage record linkage and case definition for epidemiological analyses. There are vignettes on Case Definitions and Multistage Deterministic Linkage.

ushr v0.1.0: Presents an analysis of longitudinal data of HIV decline in patients on antiretroviral therapy using the canonical biphasic exponential decay model described in Perelson et al. (1997) and Wu and Ding (1999), and includes options to calculate the time to viral suppression. The vignette walks through the analysis.

Pharmacology

chlorpromazineR v0.1.2: Provides functions to convert doses of antipsychotic medications to chlorpromazine-equivalent doses using conversion keys generated from Gardner et. al (2010) and Leucht et al. (201). See the vignette.

ubiquity v1.0.0: Implements a complete work flow for the analysis of pharmacokinetic pharmacodynamic (PKPD), physiologically-based pharmacokinetic (PBPK) and systems pharmacology models including: creation of ODE-based models, pooled parameter estimation, simulations for clinical trial design and modeling assays and deployment with Shiny and reporting with PowerPoint. There are vignettes on Deployment, Estimation, Language, NCA, Reporting, Simulation, and Titration.

Statistics

DPQ v0.3-5: Provides the computations for approximations and alternatives for the density, cumulative density and quantile functions for R’s probability distributions. This package from researchers working with R-core is intended primarily for researchers working to improve R’s beta, gamma and related distributions. See the vignettes Non-central Chi-Swuared Probabilities – Algorithms in R and Computing Beta for Large Arguments.

hypr v0.1.3: Provides functions to translate between experimental null hypotheses, hypothesis matrices, and contrast matrices as used in linear regression models based on the method described in Schad et al. (2019). There is an Introduction and vignettes on Contrasts and Linear Regression.

HTLR v0.4-1: Implements Bayesian multinomial logistic regression based on heavy-tailed (hyper-LASSO, non-convex) priors for high-dimensional feature selection. Li and Yao (2018) provides a detailed description of the method, and the vignette introduces the package.

meteorits v0.1.0: Provides a unified mixture-of-experts (ME) modeling and estimation framework to model, cluster and classify heterogeneous data in many complex situations where the data are distributed according to non-normal, possibly skewed distributions. See Chamroukhi et al. (2009), Chamroukhi (2010). Chamroukhi (2015), Chamroukhi (2016), and Chamroukhi (2017) for background, and the vignettes NMoE, SNMoE, StMoE and tMoE.

mHHMbayes v0.1.1: Implements multilevel (mixed or random effects) hidden Markov model using Bayesian estimation in R. For background see Rabiner (1989) and de Haan-Rietdijk et al. (2017). There is a Tutorial and a vignette on Estimation.

nhm v0.1.0: Provides functions to fit non-homogeneous Markov multistate models and misclassification-type hidden Markov models in continuous time to intermittently observed data. See Titman (2011) for background and the User Guide for package details.

mniw v1.0: Implements the Matrix-Normal Inverse-Wishart (MNIW) distribution, as well as the the Matrix-Normal, Matrix-T, Wishart, and Inverse-Wishart distributions. the vignette does the math.

PosteriorBootstrap v0.1.0: Implements a non-parametric statistical model using a parallelized Monte Carlo sampling scheme that allows non-parametric inference to be regularized for small sample sizes. The method is described in full in Lyddon et al. (2018). There is a vignette.

spBFA v1.0: Implements functions for spatial Bayesian non-parametric factor analysis model with inference. See Berchuck et al. (2019) for the technical background and the vignette for package details.

VARshrink v0.3.1: Provides functions that integrate shrinkage estimation with vector autoregressive models including nonparametric, parametric, and semiparametric methods such as the multivariate ridge regression (See Golub et al. (1979).), a James-Stein type nonparametric shrinkage method (See Opgen-Rhein and Strimmer (2007).), and Bayesian estimation methods as in Lee et al. (2016) and Ni and Sun (2005). There is a vignette.

Utilities

geospark v0.2.1: Provides simple features bindings to GeoSpark extending the sparklyr package to bring geocomputing to Spark distributed systems. See README for more information.

laelmachine v1.0.0: Provides functions to assign meaningful labels to data frame columns, and to manage label assignment rules in yaml files making it easy to use the same labels in multiple projects. There is a Getting Started Guide and vignettes on Altering lama-dictionaries, Creating lama-dictionaries, and Translating Variables.

renv v0.8-3: Implements a dependency management toolkit that enables creating and managing project-local R libraries, saving the state of these libraries and later restoring them. There is an Introduction and a series of vignettes: Continuous Integration,
Collaborating with renv, Using renv with Docker, Frequently Asked Questions, Local Sources, Lockfiles, and Using Python with renv.

ymlthis v0.1.0: Provides functions to write YAML front matter for R Markdown and related documents. There is an Introduction to the package, a YAML Field Guide and an YAML Overview.

Visualization

geometr v0.1.1: Provides tools that generate and process tidy geometric shapes. There is a vignette.

ggVennDiagram v0.3: Provides functions to generae publication quality Venn diagrams using two to four sets. See README for more information.

rayrender v0.4.2: Provides functions to render scenes using path tracing including building 3D scenes out of geometrical shapes and 3D models in the Wavefront OBJ file format. Look here for more information.

sankeywheel v0.1.0: Implements bindings to the Highcharts library to provide a simple way to draw dependency wheels and sankey diagrams. There is a
vignette.

_____='https://rviews.rstudio.com/2019/11/18/october-2019-top-40-new-r-packages/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Our first artist in residence: Allison Horst!

Mon, 11/18/2019 - 01:00

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m very excited to announce that Allison Horst is RStudio’s inaugural artist-in-residence. Allison is one of my favourite artists in the R community. Over the next year, she’s going to be working with us to create even more awesome artwork. Here’s a little about Allison in her own words.

— Hadley

Hello everyone, I’m Allison.

Some of you might know me from my R- and stats-inspired illustrations on twitter). I’m excited to share that, as of October 2019, I am an Artist-in-Residence with RStudio. My goal as an RStudio Artist-in-Residence is to create useful and engaging illustrations and designs that welcome users to explore, teach, and learn about new packages and functions in R. In this post, I’ll share what motivates me to create R-related artwork, and what I’ll be working on at RStudio.

Why did I start making R-related artwork?

My primary job is teaching data science and statistics courses to ~100 incoming students each year at the Bren School of Environmental Science and Management, UC Santa Barbara.

When teaching, I’ve frequently found myself struggling to motivate students to try new R packages or functions. As an example, imagine you’re a student in an “Intro to Data Science” class learning to code for the first time. You’re already kind of intimidated by R, and then the really excited (unnamed) instructor exclaims “dplyr::mutate() is sooooo awesome!!!” while displaying code examples and/or R documentation on a slide behind them:

Even if the instructor is positive and encouraging, a screen full of code and documentation behind them might cast a daunting cloud over the introduction.

That’s the position I found myself in as a teacher. There was a clear disconnect between my excitement about sharing new things in R, and what I was presenting visually as a “first glimpse” into what a package or function could do. I felt frustrated to not have educational visuals that aligned with my enthusiasm. I also felt that if I could just make a student’s first exposure to a new coding skill something positive — funny, or happy, or intriguing, or just plain cute — they would be less resistant to investing in a new [insert thing] in R.

What are my goals?

When I started creating my aRt to lower learning barriers, I kept three things in mind:

  • Focus first on the big-picture application/use of the R function or package.
  • Make illustrations visually engaging, welcoming, and useful for useRs at all levels.
  • Use imagery to make it feel like R is working with you, not against you.

I tried a few different styles and characters and the friendly, hardworking, colorful monsters were most representative of how I think about work done by packages and functions. All of the monsteRs illustrations are driven by the goal of creating a friendlier bridge between learners and R functions / packages that might look intimidating at first glance.

For example, instead of showing a chunk of code while trying to encourage students to learn dplyr::mutate(), their first sighting of the function would be mutant monsteRs working behind the scenes to add columns to a data frame, while keeping the existing ones:

And here are the R Markdown wizard monsteRs, helping to keep text, code and outputs all together, then knitting to produce a final document:

And of course the ggplot2 artist monsteRs are using geoms, themes, and aesthetic mappings to build masterful data visualizations:

Do the monsteRs teach code? Well, no. But I hope that they do provide a welcome entry point for learners, and make the use of an R function or package clear and memorable. And while I create the illustrations mostly with teachers and learners in mind, users at any level can learn something new, or remember something old, through art reminders.

What else am I working on?

The monsteRs make frequent appearances in my artwork, but I’ve also enjoyed contributing to the R community through other graphic design and illustrations. Here’s an extended cut of the classic schematic from R for Data Science, updated to include environmental data and science communication bookends, that Dr. Julia Lowndes envisioned and presented in her useR!2019 keynote:

I had a great time creating buttons and banners for the “Birds of a Feather” sessions at upcoming rstudio::conf(2020) – where I’m looking forward to meeting many of you in person!

And, I’ve been working on hex designs for R-related groups and packages! Here are a few: the hex sticker for Santa Barbara R-Ladies (our growing local chapter of R-Ladies Global), the new rray package hex envisioned by Davis Vaughan, and a design for the butcher package from Joyce Cahoon and the tidymodels team:

I’m inspired by how RStudio and the broader R community have embraced and supported art as a means of reaching more users, improving education materials (see the beautiful RStudio Education site with artwork by Desirée De Leon!), and simply making the R landscape a bit brighter. I am excited to continue producing aRt as an RStudio Artist-in-Residence over the next year.

— Allison

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Eigenvectors from Eigenvalues – a NumPy implementation

Sun, 11/17/2019 - 15:49

[This article was first published on Rstats – bayesianbiologist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was intrigued by the recent splashy result showing how eigenvectors can be computed from eigenvalues alone. The finding was covered in Quanta magazine and the original paper is pretty easy to understand, even for a non-mathematician.

Being a non-mathematician myself, I tend to look for insights and understanding via computation, rather than strict proofs. What seems cool about the result to me is that you can compute the directions from simply the stretches (along with the stretches of the sub-matrices). It seems kind of magical (of course, it’s not ). To get a feel for it, I implemented the key identity in the paper in python and NumPy and confirmed that it gives the right answer for a random (real-valued, symmetric) matrix.

I posted the Jupyter Notebook here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – bayesianbiologist. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Using R: From gather to pivot

Sun, 11/17/2019 - 10:05

[This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Since version 1.0.0, released in September, the tidyr package has a new replacement for the gather/spread pair of functions, called pivot_longer/pivot_wider. (See the blog post about the release. It can do a lot of cool things.) Just what we needed, another pair of names for melt/cast, right?

Yes, I feel like this might just be what we need!

My journey started with reshape2, and after a bit of confusion, I internalised the logic of melt/cast. Look at this beauty:

library(reshape2) fake_data <- data.frame(id = 1:20, variable1 = runif(20, 0, 1), variable2 = rnorm(20)) melted <- melt(fake_data, id.vars = "id")

This turns a data frame that looks like this …

id variable1 variable2 1 1 0.10287737 -0.21740708 2 2 0.04219212 1.36050438 3 3 0.78119150 0.09808656 4 4 0.44304613 0.48306900 5 5 0.30720140 -0.45028374 6 6 0.42387957 1.16875579

… into a data frame that looks like this:

id variable value 1 1 variable1 0.10287737 2 2 variable1 0.04219212 3 3 variable1 0.78119150 4 4 variable1 0.44304613 5 5 variable1 0.30720140 6 6 variable1 0.42387957

This is extremely useful. Among other things it comes up all the time when using ggplot2.

Then, as I detailed in a post two years ago, I switched to tidyr as that became the replacement package. ”Gather” and ”spread” made no sense to me as descriptions of operations on a data frame. To be fair, ”melt” and ”cast” felt equally arbitrary, but by that time I was used to them. Getting the logic of the arguments, the order, what needed quotation marks and not, some starting at examples and a fair bit of trial and error.

Here are some examples. If you’re not used to these functions, just skip ahead, because you will want to learn the pivot functions instead!

library(tidyr) melted <- gather(fake_data, variable, value, 2:3) ## Column names instead of indices melted <- gather(fake_data, variable, value, variable1, variable2) ## Excluding instead of including melted <- gather(fake_data, variable, value, -1) ## Excluding using column name melted <- gather(fake_data, variable, value, -id)

Enter the pivot functions. Now, I have never used pivot tables in any spreadsheet software, and in fact, the best way to explain them to me was to tell me that they were like melt/cast (and summarise) … But pivot_longer/pivot_wider are definitely friendlier on first use than gather/spread. The naming of both the functions themselves and their arguments feel like a definite improvement.

long <- pivot_longer(fake_data, 2:3, names_to = "variable", values_to = "value") # A tibble: 40 x 3 id variable value 1 1 variable1 0.103 2 1 variable2 -0.217 3 2 variable1 0.0422 4 2 variable2 1.36 5 3 variable1 0.781 6 3 variable2 0.0981 7 4 variable1 0.443 8 4 variable2 0.483 9 5 variable1 0.307 10 5 variable2 -0.450 # … with 30 more rows

We tell it into what column we want the names to go, and into what column we want the values to go. The function is named after a verb that is associated with moving things about in tables all the way to matrix algebra, followed by an adjective (in my opinion the most descriptive, out of the alternatives) that describes the layout of the data that we want.

Or, to switch us back again:

wide <- pivot_wider(long, names_from = "variable", values_from = "value") # A tibble: 20 x 3 id variable1 variable2 1 1 0.103 -0.217 2 2 0.0422 1.36 3 3 0.781 0.0981 4 4 0.443 0.483 5 5 0.307 -0.450 6 6 0.424 1.17

Here, instead, we tell it where we want the new column names taken from and where we want the new values taken from. None of this is self-explanatory, by any means, but they are thoughtful choices that make a lot of sense.

We’ll see what I think after trying to explain them to beginners a few times, and after I’ve fought warning messages involving list columns for some time, but so far: well done, tidyr developers!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Quicker knitr kables in RStudio notebook

Sun, 11/17/2019 - 01:00

[This article was first published on Roman Pahl, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The setup

The RStudio notebook is a great interactive tool to build a statistical report. Being able to see statistics and graphs right on the fly probably has saved me countless hours, especially when building complex reports.

However, one thing that has always bothered me was the way tables are displayed in the notebook with knitr’s kable function. For example, consider the airquality data set:

head(airquality) ## Ozone Solar.R Wind Temp Month Day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 ## 3 12 149 12.6 74 5 3 ## 4 18 313 11.5 62 5 4 ## 5 NA NA 14.3 56 5 5 ## 6 28 NA 14.9 66 5 6

To get a nice table in your report you type

knitr::kable(head(airquality), caption = "New York Air Quality Measurements.")

which shows up nicely formatted in the final output

New York Air Quality Measurements. Ozone Solar.R Wind Temp Month Day 41 190 7.4 67 5 1 36 118 8.0 72 5 2 12 149 12.6 74 5 3 18 313 11.5 62 5 4 NA NA 14.3 56 5 5 28 NA 14.9 66 5 6

The problem

But in the interactive RStudio notebook session the table looks something like the following:

So first of all, the formatting is not that great. Secondly, the table chunk consumes way too much space of the notebook and, at times, can be very cumbersome to scroll. Also for bigger tables (and depending on your hardware) it can take up to a few seconds for the table to be built.

So often when I was using kable, I felt my workflow being disrupted. In the interactive session I want a table being built quickly and in a clean format. Now, using the simple print function you’ll get exactly this

So my initial quick-and-dirty workaround during the interactive session was to comment out the knitr statement and use the print function.

#knitr::kable(head(airquality), caption = "New York Air Quality Measurements.") print(head(airquality))

Then, only when creating the final report, I would comment out the print function and use kable again. Of course, there is a much more elegant and easier solution to get this without having to switch between functions.

The solution

We define a simple wrapper, which chooses the corresponding function depending on the context:

kable_if <- function(x, ...) if (interactive()) print(x, ...) else knitr::kable(x, ...)

Then you simply call it as you would invoke kable and now you get both, the quick table in the interactive session …

… and a formatted table in the report.

kable_if(head(airquality), caption = "New York Air Quality Measurements.") New York Air Quality Measurements. Ozone Solar.R Wind Temp Month Day 41 190 7.4 67 5 1 36 118 8.0 72 5 2 12 149 12.6 74 5 3 18 313 11.5 62 5 4 NA NA 14.3 56 5 5 28 NA 14.9 66 5 6

That’s it. Simply put this function definition somewhere in the top of your document and enjoy a quick workflow.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Roman Pahl. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RcppEigen 0.3.3.7.0

Sat, 11/16/2019 - 16:41

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new minor release 0.3.3.7.0 of RcppEigen arrived on CRAN today (and just went to Debian too) bringing support for Eigen 3.3.7 to R.

This release comes almost a year after the previous minor release 0.3.3.5.0. Besides the upgrade to the new upstream version, it brings a few accumulated polishes to the some helper and setup functions, and switches to the very nice tinytest package for unit tests; see below for the full list. As before, we carry a few required changes to Eigen in a diff. And as we said before at the previous two releases:

One additional and recent change was the accomodation of a recent CRAN Policy change to not allow gcc or clang to mess with diagnostic messages. A word of caution: this may make your compilation of packages using RcppEigen very noisy so consider adding -Wno-ignored-attributes to the compiler flags added in your ~/.R/Makevars.

The complete NEWS file entry follows.

Changes in RcppEigen version 0.3.3.7.0 (2019-11-16)
  • Fixed skeleton package creation listing RcppEigen under Imports (James Balamuta in #68 addressing #16).

  • Small RNG use update to first example in skeleton package used by package creation helper (Dirk addressing #69).

  • Update vignette example to use RcppEigen:::eigen_version() (Dirk addressing #71).

  • Correct one RcppEigen.package.skeleton() corner case (Dirk in #77 fixing #75).

  • Correct one usage case with pkgKitten (Dirk in #78).

  • The package now uses tinytest for unit tests (Dirk in #81).

  • Upgraded to Eigen 3.3.7 (Dirk in #82 fixing #80).

Courtesy of CRANberries, there is also a diffstat report for the most recent release.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Data Science with R, 2nd Edition, IS OUT!!!!!!!

Sat, 11/16/2019 - 00:53

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Data Science with R, 2nd Edition author Dr. Nina Zumel, with a fresh author’s copy of her book!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Data Science with R, 2nd Edition, IS OUT!!!!!!!

Sat, 11/16/2019 - 00:53

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Data Science with R, 2nd Edition author Dr. Nina Zumel, with a fresh author’s copy of her book!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Data Science with R, 2nd Edition, IS OUT!!!!!!!

Sat, 11/16/2019 - 00:53

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Data Science with R, 2nd Edition author Dr. Nina Zumel, with a fresh author’s copy of her book!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Data Science with R, 2nd Edition, IS OUT!!!!!!!

Sat, 11/16/2019 - 00:53

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Data Science with R, 2nd Edition author Dr. Nina Zumel, with a fresh author’s copy of her book!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The hidden diagnostic plots for the lm object

Thu, 11/14/2019 - 20:07

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When plotting an lm object in R, one typically sees a 2 by 2 panel of diagnostic plots, much like the one below:

set.seed(1) x <- matrix(rnorm(200), nrow = 20) y <- rowSums(x[,1:3]) + rnorm(20) lmfit <- lm(y ~ x) summary(lmfit) par(mfrow = c(2, 2)) plot(lmfit)

This link has an excellent explanation of each of these 4 plots, and I highly recommend giving it a read.

Most R users are familiar with these 4 plots. But did you know that the plot() function for lm objects can actually give you 6 plots? It says so right in the documentation:

We can specify which of the 6 plots we want when calling this function using the which option. By default, we are given plots 1, 2, 3 and 5. Let’s have a look at what plots 4 and 6 are.

Plot 4 is of Cook’s distance vs. observation number (i.e. row number). Cook’s distance is a measure of how influential a given observation is on the linear regression fit, with a value > 1 typically indicating a highly influential point. By plotting this value against row number, we can see if highly influential points exhibit any relationship to their position in the dataset. This is useful for time series data as it can indicate if our fit is disproportionately influenced by data from a particular time period.

Here is what plot 4 might look like:

plot(lmfit, which = 4)

Plot 6 is of Cook’s distance against (leverage)/(1 – leverage). An observation’s leverage must fall in the interval , so plotting against (leverage)/(1 – leverage) allows the x-axis to span the whole positive real line. The contours on the plot represent points where the absolute value of the standardized residual is the same. On this plot they happen to be straight lines; the documentation says so as well but I haven’t had time to check it mathematically.

Here is what plot 6 might look like:

plot(lmfit, which = 6)

I’m not too sure how one should interpret this plot. As far as I know, one should take extra notice of points with high leverage and/or high Cook’s distance. So any observation in the top-left, top-right or bottom-right corner should be taken note of. If anyone knows of a better way to interpret this plot, let me know!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pages