Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 14 hours 1 min ago

R Function Call with Ellipsis Trap/Pitfall

Sun, 04/02/2017 - 10:00

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Objective if this post: alerting all users to double check case and spelling of all function parameters

I am newbie in R and was trying RSNNS mlp function and wasted a lot of time due to some typos.

RSNNS mlp function silently ignores misspelled keywords
Example:

model<-mlp(iris[,1:4],decodeClassLabels(iris[,5]),hize=7,mazit=100)

I intentionally misspelled size as hize and maxit as mazit
There are no warnings or errors.

I think that many packages may have same problem as package writers may not always validate ellipsis arguments. I made a small spelling mistake and got puzzling results as there was no parameter validation, but I expected that great eminent packages should be robust and help users recover from typos

Let us see what happens with no ellipsis

> square <-function(num ){ + return(num*num) + } > > square(num=4) [1] 16 > square(numm=4) Error in square(numm = 4) : unused argument (numm = 4) # With ellipsis added > square <-function(num, …){ + print(names(list(…))); + return(num*num) + } > > square(num=3,bla=4,kla=9) [1] “bla” “kla” [1] 9

As you can see names(list(…)) does give access to parameter names

The problem is that ellipsis function calls are for flexibility but package writers should take extra care to throw exception “unused argument” when parameters of functions are misspelled.

This to my mind is a major weakness of the R ecosystem. Most parameters have defaults and small case or spelling mistake can lead to really wrong conclusions!

RSNNS  is simply fantastic but given simply as an example of the ellipsis function call trap. Hope other newbies benefit and learn to avoid the trap of wrong arguments.

Jayanta Narayan Choudhuri
Kolkata India

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Superclassing to R⁶

Sun, 04/02/2017 - 03:31

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

To avoid “branding” confusion with R⁴ I’m superclassing it to R⁶ and encouraging others in the R community to don the moniker and do their own small, focused posts on topics that would help the R community learn things. Feel free to use R⁶ (I’ll figure out an acronym later). Feel free to tag your posts as R⁶ (or r6) and use the moniker as you see fit.

I’ll eventually tag the 2 current “r4” posts as “r6”.

Hopefully we can link together a cadre of R⁶ posts into a semi-organized structure that all can benefit from.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dealing with unbalanced data in machine learning

Sun, 04/02/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

In my last post, where I shared the code that I used to produce an example analysis to go along with my webinar on building meaningful models for disease prediction, I mentioned that it is advised to consider over- or under-sampling when you have unbalanced data sets. Because my focus in this webinar was on evaluating model performance, I did not want to add an additional layer of complexity and therefore did not further discuss how to specifically deal with unbalanced data.

But because I had gotten a few questions regarding this, I thought it would be worthwhile to explain over- and under-sampling techniques in more detail and show how you can very easily implement them with caret.

library(caret)

Unbalanced data

In this context, unbalanced data refers to classification problems where we have unequal instances for different classes. Having unbalanced data is actually very common in general, but it is especially prevalent when working with disease data where we usually have more healthy control samples than disease cases. Even more extreme unbalance is seen with fraud detection, where e.g. most credit card uses are okay and only very few will be fraudulent. In the example I used for my webinar, a breast cancer dataset, we had about twice as many benign than malignant samples.

summary(bc_data$classes) ## benign malignant ## 458 241

Why is unbalanced data a problem in machine learning?

Most machine learning classification algorithms are sensitive to unbalance in the predictor classes. Let’s consider an even more extreme example than our breast cancer dataset: assume we had 10 malignant vs 90 benign samples. A machine learning model that has been trained and tested on such a dataset could now predict “benign” for all samples and still gain a very high accuracy. An unbalanced dataset will bias the prediction model towards the more common class!

How to balance data for modeling

The basic theoretical concepts behind over- and under-sampling are very simple:

  • With under-sampling, we randomly select a subset of samples from the class with more instances to match the number of samples coming from each class. In our example, we would randomly pick 241 out of the 458 benign cases. The main disadvantage of under-sampling is that we loose potentially relevant information from the left-out samples.

  • With oversampling, we randomly duplicate samples from the class with fewer instances or we generate additional instances based on the data that we have, so as to match the number of samples in each class. While we avoid loosing information with this approach, we also run the risk of overfitting our model as we are more likely to get the same samples in the training and in the test data, i.e. the test data is no longer independent from training data. This would lead to an overestimation of our model’s performance and generalizability.

In reality though, we should not simply perform over- or under-sampling on our training data and then run the model. We need to account for cross-validation and perform over- or under-sampling on each fold independently to get an honest estimate of model performance!

Modeling the original unbalanced data

Here is the same model I used in my webinar example: I randomly divide the data into training and test sets (stratified by class) and perform Random Forest modeling with 10 x 10 repeated cross-validation. Final model performance is then measured on the test set.

set.seed(42) index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE) train_data <- bc_data[index, ] test_data <- bc_data[-index, ] set.seed(42) model_rf <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) final <- data.frame(actual = test_data$classes, predict(model_rf, newdata = test_data, type = "prob")) final$predict <- ifelse(final$benign > 0.5, "benign", "malignant") cm_original <- confusionMatrix(final$predict, test_data$classes)

Under-sampling

Luckily, caret makes it very easy to incorporate over- and under-sampling techniques with cross-validation resampling. We can simply add the sampling option to our trainControl and choose down for under- (also called down-) sampling. The rest stays the same as with our original model.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE, sampling = "down") set.seed(42) model_rf_under <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = ctrl) final_under <- data.frame(actual = test_data$classes, predict(model_rf_under, newdata = test_data, type = "prob")) final_under$predict <- ifelse(final_under$benign > 0.5, "benign", "malignant") cm_under <- confusionMatrix(final_under$predict, test_data$classes)

Oversampling

For over- (also called up-) sampling we simply specify sampling = "up".

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE, sampling = "up") set.seed(42) model_rf_over <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = ctrl) final_over <- data.frame(actual = test_data$classes, predict(model_rf_over, newdata = test_data, type = "prob")) final_over$predict <- ifelse(final_over$benign > 0.5, "benign", "malignant") cm_over <- confusionMatrix(final_over$predict, test_data$classes)

ROSE

Besides over- and under-sampling, there are hybrid methods that combine under-sampling with the generation of additional data. Two of the most popular are ROSE and SMOTE.

From Nicola Lunardon, Giovanna Menardi and Nicola Torelli’s “ROSE: A Package for Binary Imbalanced Learning” (R Journal, 2014, Vol. 6 Issue 1, p. 79): “The ROSE package provides functions to deal with binary classification problems in the presence of imbalanced classes. Artificial balanced samples are generated according to a smoothed bootstrap approach and allow for aiding both the phases of estimation and accuracy evaluation of a binary classifier in the presence of a rare class. Functions that implement more traditional remedies for the class imbalance and different metrics to evaluate accuracy are also provided. These are estimated by holdout, bootstrap, or cross-validation methods.”

You implement them the same way as before, this time choosing sampling = "rose"…

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE, sampling = "rose") set.seed(42) model_rf_rose <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = ctrl) final_rose <- data.frame(actual = test_data$classes, predict(model_rf_rose, newdata = test_data, type = "prob")) final_rose$predict <- ifelse(final_rose$benign > 0.5, "benign", "malignant") cm_rose <- confusionMatrix(final_rose$predict, test_data$classes)

SMOTE

… or by choosing sampling = "smote" in the trainControl settings.

From Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Philip Kegelmeyer’s “SMOTE: Synthetic Minority Over-sampling Technique” (Journal of Artificial Intelligence Research, 2002, Vol. 16, pp. 321–357): “This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples.”

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE, sampling = "smote") set.seed(42) model_rf_smote <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = ctrl) final_smote <- data.frame(actual = test_data$classes, predict(model_rf_smote, newdata = test_data, type = "prob")) final_smote$predict <- ifelse(final_smote$benign > 0.5, "benign", "malignant") cm_smote <- confusionMatrix(final_smote$predict, test_data$classes)

Predictions

Now let’s compare the predictions of all these models:

models <- list(original = model_rf, under = model_rf_under, over = model_rf_over, smote = model_rf_smote, rose = model_rf_rose) resampling <- resamples(models) bwplot(resampling)

library(dplyr) comparison <- data.frame(model = names(models), Sensitivity = rep(NA, length(models)), Specificity = rep(NA, length(models)), Precision = rep(NA, length(models)), Recall = rep(NA, length(models)), F1 = rep(NA, length(models))) for (name in names(models)) { model <- get(paste0("cm_", name)) comparison[comparison$model == name, ] <- filter(comparison, model == name) %>% mutate(Sensitivity = model$byClass["Sensitivity"], Specificity = model$byClass["Specificity"], Precision = model$byClass["Precision"], Recall = model$byClass["Recall"], F1 = model$byClass["F1"]) } library(tidyr) comparison %>% gather(x, y, Sensitivity:F1) %>% ggplot(aes(x = x, y = y, color = model)) + geom_jitter(width = 0.2, alpha = 0.5, size = 3)

With this small dataset, we can already see how the different techniques can influence model performance. Sensitivity (or recall) describes the proportion of benign cases that have been predicted correctly, while specificity describes the proportion of malignant cases that have been predicted correctly. Precision describes the true positives, i.e. the proportion of benign predictions that were actual from benign samples. F1 is the weighted average of precision and sensitivity/ recall.

Here, all four methods improved specificity and precision compared to the original model.
Under-sampling, over-sampling and ROSE additionally improved precision and the F1 score.

This post shows a simple example of how to correct for unbalance in datasets for machine learning. For more advanced instructions and potential caveats with these techniques, check out the excellent caret documentation.

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: macOS Sierra 10.12.3 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] tidyr_0.6.1 dplyr_0.5.0 randomForest_4.6-12 ## [4] caret_6.0-73 ggplot2_2.2.1 lattice_0.20-34 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.9 nloptr_1.0.4 plyr_1.8.4 ## [4] class_7.3-14 iterators_1.0.8 tools_3.3.3 ## [7] digest_0.6.12 lme4_1.1-12 evaluate_0.10 ## [10] tibble_1.2 gtable_0.2.0 nlme_3.1-131 ## [13] mgcv_1.8-17 Matrix_1.2-8 foreach_1.4.3 ## [16] DBI_0.5-1 yaml_2.1.14 parallel_3.3.3 ## [19] SparseM_1.74 e1071_1.6-8 stringr_1.2.0 ## [22] knitr_1.15.1 MatrixModels_0.4-1 stats4_3.3.3 ## [25] rprojroot_1.2 grid_3.3.3 nnet_7.3-12 ## [28] R6_2.2.0 rmarkdown_1.3 minqa_1.2.4 ## [31] reshape2_1.4.2 car_2.1-4 magrittr_1.5 ## [34] backports_1.0.5 scales_0.4.1 codetools_0.2-15 ## [37] ModelMetrics_1.1.0 htmltools_0.3.5 MASS_7.3-45 ## [40] splines_3.3.3 assertthat_0.1 pbkrtest_0.4-6 ## [43] colorspace_1.3-2 labeling_0.3 quantreg_5.29 ## [46] stringi_1.1.2 lazyeval_0.2.0 munsell_0.4.3

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building Shiny App Exercises (part-9)

Sat, 04/01/2017 - 18:00

Shiny Dashboard Overview

In this part we will “dig deeper” to discover the amazing capabilities that a Shiny Dasboard provides.
Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

The dashboardPage function expects three components: a header, sidebar, and body:
#ui.R
dashboardPage(
dashboardHeader(),
dashboardSidebar(),
dashboardBody()
)

For more complicated apps, splitting app into pieces can make it more readable:

header <- dashboardHeader()

sidebar <- dashboardSidebar()

body <- dashboardBody()

dashboardPage(header, sidebar, body)

Now we’ll look at each of the three main components of a shinydashboard.

HEADER

A header can have a title and dropdown menus. The dropdown menus are generated by the dropdownMenu function. There are three types of menus – messages, notifications, and tasks – and each one must be populated with a corresponding type of item.

Message menus

A messageItem contained in a message menu needs values for from and message. You can also control the icon and a notification time string. By default, the icon is a silhouette of a person. The time string can be any text. For example, it could be a relative date/time like “5 minutes”, “today”, or “12:30pm yesterday”, or an absolute time, like “2014-12-01 13:45”.
#ui.R
dropdownMenu(type = "messages",
messageItem(
from = "Sales Dept",
message = "Sales are steady this month."
),
messageItem(
from = "New User",
message = "How do I register?",
icon = icon("question"),
time = "13:45"
),
messageItem(
from = "Support",
message = "The new server is ready.",
icon = icon("life-ring"),
time = "2014-12-01"
)
)

Exercise 1

Create a dropdownMenu in your dashboardHeader as the example above. Put date, time and generally text of your choice.

Dynamic content

In most cases, you’ll want to make the content dynamic. That means that the HTML content is generated on the server side and sent to the client for rendering. In the UI code, you’d use dropdownMenuOutput like this:

dashboardHeader(dropdownMenuOutput("messageMenu"))

Exercise 2

Replace dropdownMenu with dropdownMenuOutput and the three messageItem with messageMenu.

The next step is to create some messages for this example.The code below does this work for us.
# Example message data in a data frame
messageData <- data.frame(
from = c("Admininstrator", "New User", "Support"),
message = c(
"Sales are steady this month.",
"How do I register?",
"The new server is ready."
),
stringsAsFactors = FALSE
)

Exercise 3

Put messageData inside your server.r but outside of the shinyServer function.

And on the server side, you’d generate the entire menu in a renderMenu, like this:
output$messageMenu <- renderMenu({
# Code to generate each of the messageItems here, in a list. messageData
# is a data frame with two columns, 'from' and 'message'.
# Also add on slider value to the message content, so that messages update.
msgs <- apply(messageData, 1, function(row) {
messageItem(
from = row[["from"]],
message = paste(row[["message"]], input$slider)
)
})

dropdownMenu(type = "messages", .list = msgs)
})

Exercise 4

Put the code above(output$messageMenu) in the shinyServer of server.R.

Hopefully you have understood by now the logic behind the dynamic content of your Menu. Now let’s return to the static one in order to describe it a little bit more. So make the proper changes to your code in order to return exactly to the point we were after exercise 1.

Notification menus

A notificationItem contained in a notification contains a text notification. You can also control the icon and the status color. The code below gives an example.
#ui.r
dropdownMenu(type = "notifications",
notificationItem(
text = "20 new users today",
icon("users")
),
notificationItem(
text = "14 items delivered",
icon("truck"),
status = "success"
),
notificationItem(
text = "Server load at 84%",
icon = icon("exclamation-triangle"),
status = "warning"
)
)

Exercise 5

Create a dropdownMenu for your notifications like the example. Use text of your choice. Be careful of the type and the notificationItem.

Task menus

Task items have a progress bar and a text label. You can also specify the color of the bar. Valid colors are listed in ?validColors. Take a look at the example below.
#ui.r
dropdownMenu(type = "tasks", badgeStatus = "success",
taskItem(value = 90, color = "green",
"Documentation"
),
taskItem(value = 17, color = "aqua",
"Project X"
),
taskItem(value = 75, color = "yellow",
"Server deployment"
),
taskItem(value = 80, color = "red",
"Overall project"
)
)

Exercise 6

Create a dropdownMenu for your tasks like the example above. Use text of your choice and create as many taskItem as you want. Be carefull of the type and the taskItem.

Disabling the header

If you don’t want to show a header bar, you can disable it with:

dashboardHeader(disable = TRUE)

Exercise 7

Disable the header.

Now enable it again.

Body

The body of a dashboard page can contain any regular Shiny content. However, if you’re creating a dashboard you’ll likely want to make something that’s more structured. The basic building block of most dashboards is a box. Boxes in turn can contain any content.

Boxes

Boxes are the main building blocks of dashboard pages. A basic box can be created with the box function, and the contents of the box can be (most) any Shiny UI content. We have already created some boxes in part 8 so lets enhance theis appearance a little bit.
Boxes can have titles and header bar colors with the title and status options. Look at the examples below.

box(title = "Histogram", status = "primary",solidHeader = TRUE, plotOutput("plot2", height = 250)),

box(
title = "Inputs", status = "warning",
"Box content here", br(), "More box content",
sliderInput("slider", "Slider input:", 1, 100, 50),
textInput("text", "Text input:")
)

Exercise 8

Give a title of your choice to all the box you have created in your dashboard except of the three widgets’ box.

Exercise 9

Change the status of the first three box to “primary” and the last three to “warning”.

Exercise 10

Transform the headers of your first three box to solid headers.

Related exercise sets:
  1. Building Shiny App exercises part 1
  2. Building Shiny App exercises part 3
  3. Building Shiny App exercises part 2
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

Tutorial: Using R for Scalable Data Analytics

Fri, 03/31/2017 - 21:45

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

At the recent Strata conference in San Jose, several members of the Microsoft Data Science team presented the tutorial Using R for Scalable Data Analytics: Single Machines to Spark Clusters. The materials are all available online, including the presentation slides and hands-on R scripts. You can follow along with the materials at home, using the Data Science Virtual Machine for Linux, which provides all the necessary components like Spark and Microsoft R Server. (If you don't already have an Azure account, you can get $200 credit with the Azure free trial.)

The tutorial covers many different techniques for training predictive models at scale, and deploying the trained models as predictive engines within production environments. Among the technologies you'll use are Microsoft R Server running on Spark, the SparkR package, the sparklyr package and H20 (via the rsparkling package). It also touches on some non-Spark methods, like the bigmemory and ff packages for R (and various other packages that make use of them), and using the foreach package for coarse-grained parallel computations. You'll also learn how to create prediction engines from these trained models using the mrsdeploy package.

The tutorial also includes scripts for comparing the performance of these various techniques, both for training the predictive model:

and for generating predictions from the trained model:

(The above tests used 4 worker nodes and 1 edge node, all with with 16 cores and 112Gb of RAM.)

You can find the tutorial details, including slides and scripts, at the link below.

Strata + Hadoop World 2017, San Jose: Using R for scalable data analytics: From single machines to Hadoop Spark clusters

 

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ggedit 0.2.0 is now on CRAN

Fri, 03/31/2017 - 21:17

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Jonathan Sidi, Metrum Research Group

We are pleased to announce the release of the ggedit package on CRAN.

To install the package you can call the standard R command

install.packages('ggedit')

The source version is still tracked on github, which has been reorganized to be easier to navigate.

To install the dev version:

devtools::install_github('metrumresearchgroup/ggedit') What is ggedit?

ggedit is an R package that is used to facilitate ggplot formatting. With ggedit, R users of all experience levels can easily move from creating ggplots to refining aesthetic details, all while maintaining portability for further reproducible research and collaboration.
ggedit is run from an R console or as a reactive object in any Shiny application. The user inputs a ggplot object or a list of objects. The application populates Bootstrap modals with all of the elements found in each layer, scale, and theme of the ggplot objects. The user can then edit these elements and interact with the plot as changes occur. During editing, a comparison of the script is logged, which can be directly copied and shared. The application output is a nested list containing the edited layers, scales, and themes in both object and script form, so you can apply the edited objects independent of the original plot using regular ggplot2 grammar.
Why does it matter? ggedit promotes efficient collaboration. You can share your plots with team members to make formatting changes, and they can then send any objects they’ve edited back to you for implementation. No more email chains to change a circle to a triangle!

Updates in ggedit 0.2.0:
  • The layer modal (popups) elements have been reorganized for less clutter and easier navigation.
  • The S3 method written to plot and compare themes has been removed from the package, but can still be found on the repo, see plot.theme.
Deploying
  • call from the console: ggedit(p)
  • call from the addin toolbar: highlight script of a plot object on the source editor window of RStudio and run from toolbar.
  • call as part of Shiny: use the Shiny module syntax to call the ggEdit UI elements.
    • server: callModule(ggEdit,'pUI',obj=reactive(p))
    • ui: ggEditUI('pUI')
  • if you have installed the package you can see an example of a Shiny app by executing runApp(system.file('examples/shinyModule.R',package = 'ggedit'))
Outputs

ggedit returns a list containing 8 elements either to the global enviroment or as a reactive output in Shiny.

  • updatedPlots
    • List containing updated ggplot objects
  • updatedLayers
    • For each plot a list of updated layers (ggproto) objects
    • Portable object
  • updatedLayersElements
    • For each plot a list elements and their values in each layer
    • Can be used to update the new values in the original code
  • updatedLayerCalls
    • For each plot a list of scripts that can be run directly from the console to create a layer
  • updatedThemes
    • For each plot a list of updated theme objects
    • Portable object
    • If the user doesn’t edit the theme updatedThemes will not be returned
  • updatedThemeCalls
    • For each plot a list of scripts that can be run directly from the console to create a theme
  • updatedScales
    • For each plot a list of updated scales (ggproto) objects
    • Portable object
  • updatedScaleCalls
    • For each plot a list of scripts that can be run directly from the console to create a scale
  Short Clip to use ggedit in Shiny

Jonathan Sidi joined Metrum Research Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings. To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website. Contact: For questions and comments, feel free to email me at: yonis@metrumrg.com or open an issue for bug fixes or enhancements at github.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The 5 Most Effective Ways to Learn R

Fri, 03/31/2017 - 19:53

(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)

Whether you’re plotting a simple time series or building a predictive model for the next election, the R programming language’s flexibility will ensure you have all the capabilities you need to get the job done. In this blog we will take a look at five effective tactics for learning this essential data science language, as well as some of the top resources associated with each. These tactics should be used to complement one another on your path to mastering the world’s most powerful statistical language!

1. Watch Instructive Videos

We often flock to YouTube when we want to learn how to play a song on the piano, change a tire, or chop an onion, but why should it be any different when learning how to perform calculations using the most popular statistical programming language? LearnR, Google Developers, and MarinStatsLectures are all fantastic YouTube channels with playlists specifically dedicated to the R language.

2. Read Blogs

There’s a good chance you came across this article through the R-bloggers website, which curates content from some of the best blogs about R that can be found on the web today. Since there are 750+ blogs that are curated on R-bloggers alone, you shouldn’t have a problem finding an article on the exact topic or use case you’re interested in!

A few notable R blogs:

3. Take an Online Course

As we’ve mentioned in previous blogs, there are a great number of online classes you can take to learn specific technical skills. In many instances, these courses are free, or very affordable, with some offering discounts to college students. Why spend thousands of dollars on a university course, when you can get as good, if not better (IMHO), of an understanding online.

Some sites that offer great R courses include:

4. Read Books

Many times, books are given a bad rap since most programming concepts can be found online, for free. Sure, if you are going to use the book just as a reference, you’d probably be better off saving that money and taking to Google search. However, if you’re a beginner, or someone who wants to learn the fundamentals, working through an entire book at the foundational level will provide a high degree of understanding.

There is a fantastic list of the best books for R at Data Science Central.

5. Experiment!

You can read articles and watch videos all day long, but if you never try it for yourself, you’ll never learn! Datazar is a great place for you to jump right in and experiment with what you’ve learned. You can immediately start by opening the R console or creating a notebook in our cloud-based environment. If you get stuck, you can consult with other users and even work with scripts that have been opened up by others!

I hope you found this helpful and as always if you would like to share any additional resources, feel free to drop them in the comments below!

Resources Included in this Article


The 5 Most Effective Ways to Learn R was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Easy leave-one-out cross validation with pipelearner

Fri, 03/31/2017 - 14:10

(This article was first published on blogR, and kindly contributed to R-bloggers)

@drsimonj here to show you how to do leave-one-out cross validation using pipelearner.

 Leave-one-out cross validation

Leave-one-out is a type of cross validation whereby the following is done for each observation in the data:

  • Run model on all other observations
  • Use model to predict value for observation

This means that a model is fitted, and a predicted is made n times where n is the number of observations in your data.

 Leave-one-out in pipelearner

pipelearner is a package for streamlining machine learning pipelines, including cross validation. If you’re new to it, check out blogR for other relevant posts.

To demonstrate, let’s use regression to predict horsepower (hp) with all other variables in the mtcars data set. Set this up in pipelearner as follows:

library(pipelearner) pl <- pipelearner(mtcars, lm, hp ~ .)

How cross validation is done is handled by learn_cvpairs(). For leave-one-out, specify k = number of rows:

pl <- learn_cvpairs(pl, k = nrow(mtcars))

Finally, learn() the model on all folds:

pl <- learn(pl)

This can all be written in a pipeline:

pl <- pipelearner(mtcars, lm, hp ~ .) %>% learn_cvpairs(k = nrow(mtcars)) %>% learn() pl #> # A tibble: 32 × 9 #> models.id cv_pairs.id train_p fit target model params #> <chr> <chr> <dbl> <list> <chr> <chr> <list> #> 1 1 01 1 <S3: lm> hp lm <list [1]> #> 2 1 02 1 <S3: lm> hp lm <list [1]> #> 3 1 03 1 <S3: lm> hp lm <list [1]> #> 4 1 04 1 <S3: lm> hp lm <list [1]> #> 5 1 05 1 <S3: lm> hp lm <list [1]> #> 6 1 06 1 <S3: lm> hp lm <list [1]> #> 7 1 07 1 <S3: lm> hp lm <list [1]> #> 8 1 08 1 <S3: lm> hp lm <list [1]> #> 9 1 09 1 <S3: lm> hp lm <list [1]> #> 10 1 10 1 <S3: lm> hp lm <list [1]> #> # ... with 22 more rows, and 2 more variables: train <list>, test <list>  Evaluating performance

Performance can be evaluated in many ways depending on your model. We will calculate R2:

library(tidyverse) # Extract true and predicted values of hp for each observation pl <- pl %>% mutate(true = map2_dbl(test, target, ~as.data.frame(.x)[[.y]]), predicted = map2_dbl(fit, test, predict)) # Summarise results results <- pl %>% summarise( sse = sum((predicted - true)^2), sst = sum(true^2) ) %>% mutate(r_squared = 1 - sse / sst) results #> # A tibble: 1 × 3 #> sse sst r_squared #> <dbl> <dbl> <dbl> #> 1 41145.56 834278 0.9506812

Using leave-one-out cross validation, the regression model obtains an R2 of 0.95 when generalizing to predict horsepower in new data.

We’ll conclude with a plot of each true data point and it’s predicted value:

pl %>% ggplot(aes(true, predicted)) + geom_point(size = 2) + geom_abline(intercept = 0, slope = 1, linetype = 2) + theme_minimal() + labs(x = "True value", y = "Predicted value") + ggtitle("True against predicted values based\non leave-one-one cross validation")

 Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: blogR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

#2: Even Easier Package Registration

Fri, 03/31/2017 - 05:31

Welcome to the second post in rambling random R recommendation series, or R4 for short.

Two days ago I posted the initial (actual) post. It provided context for why we need package registration entries (tl;dr: because R CMD check now tests for it, and because it The Right Thing to do, see documentation in the posts). I also showed how generating such a file src/init.c was essentially free as all it took was single call to a new helper function added to R-devel by Brian Ripley and Kurt Hornik.

Now, to actually use R-devel you obviously need to have it accessible. There are a myriad of ways to achieve that: just compile it locally as I have done for years, use a Docker image as I showed in the post — or be creative with eg Travis or win-builder both of which give you access to R-devel if you’re clever about it.

But as no good deed goes unpunished, I was of course told off today for showing a Docker example as Docker was not "Easy". I think the formal answer to that is baloney. But we leave that aside, and promise to discuss setting up Docker at another time.

R is after all … just R. So below please find a script you can save as, say, ~/bin/pnrrs.r. And calling it—even with R-release—will generate the same code snippet as I showed via Docker. Call it a one-off backport of the new helper function — with a half-life of a few weeks at best as we will have R 3.4.0 as default in just a few weeks. The script will then reduce to just the final line as the code will be present with R 3.4.0.

#!/usr/bin/r library(tools) .find_calls_in_package_code <- tools:::.find_calls_in_package_code .read_description <- tools:::.read_description ## all what follows is from R-devel aka R 3.4.0 to be package_ff_call_db <- function(dir) { ## A few packages such as CDM use base::.Call ff_call_names <- c(".C", ".Call", ".Fortran", ".External", "base::.C", "base::.Call", "base::.Fortran", "base::.External") predicate <- function(e) { (length(e) > 1L) && !is.na(match(deparse(e[[1L]]), ff_call_names)) } calls <- .find_calls_in_package_code(dir, predicate = predicate, recursive = TRUE) calls <- unlist(Filter(length, calls)) if(!length(calls)) return(NULL) attr(calls, "dir") <- dir calls } native_routine_registration_db_from_ff_call_db <- function(calls, dir = NULL, character_only = TRUE) { if(!length(calls)) return(NULL) ff_call_names <- c(".C", ".Call", ".Fortran", ".External") ff_call_args <- lapply(ff_call_names, function(e) args(get(e, baseenv()))) names(ff_call_args) <- ff_call_names ff_call_args_names <- lapply(lapply(ff_call_args, function(e) names(formals(e))), setdiff, "...") if(is.null(dir)) dir <- attr(calls, "dir") package <- # drop name as.vector(.read_description(file.path(dir, "DESCRIPTION"))["Package"]) symbols <- character() nrdb <- lapply(calls, function(e) { if (startsWith(deparse(e[[1L]]), "base::")) e[[1L]] <- e[[1L]][3L] ## First figure out whether ff calls had '...'. pos <- which(unlist(Map(identical, lapply(e, as.character), "..."))) ## Then match the call with '...' dropped. ## Note that only .NAME could be given by name or ## positionally (the other ff interface named ## arguments come after '...'). if(length(pos)) e <- e[-pos] ## drop calls with only ... if(length(e) < 2L) return(NULL) cname <- as.character(e[[1L]]) ## The help says ## ## '.NAME' is always matched to the first argument ## supplied (which should not be named). ## ## But some people do (Geneland ...). nm <- names(e); nm[2L] <- ""; names(e) <- nm e <- match.call(ff_call_args[[cname]], e) ## Only keep ff calls where .NAME is character ## or (optionally) a name. s <- e[[".NAME"]] if(is.name(s)) { s <- deparse(s)[1L] if(character_only) { symbols <<- c(symbols, s) return(NULL) } } else if(is.character(s)) { s <- s[1L] } else { ## expressions symbols <<- c(symbols, deparse(s)) return(NULL) } ## Drop the ones where PACKAGE gives a different ## package. Ignore those which are not char strings. if(!is.null(p <- e[["PACKAGE"]]) && is.character(p) && !identical(p, package)) return(NULL) n <- if(length(pos)) { ## Cannot determine the number of args: use ## -1 which might be ok for .External(). -1L } else { sum(is.na(match(names(e), ff_call_args_names[[cname]]))) - 1L } ## Could perhaps also record whether 's' was a symbol ## or a character string ... cbind(cname, s, n) }) nrdb <- do.call(rbind, nrdb) nrdb <- as.data.frame(unique(nrdb), stringsAsFactors = FALSE) if(NROW(nrdb) == 0L || length(nrdb) != 3L) stop("no native symbols were extracted") nrdb[, 3L] <- as.numeric(nrdb[, 3L]) nrdb <- nrdb[order(nrdb[, 1L], nrdb[, 2L], nrdb[, 3L]), ] nms <- nrdb[, "s"] dups <- unique(nms[duplicated(nms)]) ## Now get the namespace info for the package. info <- parseNamespaceFile(basename(dir), dirname(dir)) ## Could have ff calls with symbols imported from other packages: ## try dropping these eventually. imports <- info$imports imports <- imports[lengths(imports) == 2L] imports <- unlist(lapply(imports, `[[`, 2L)) info <- info$nativeRoutines[[package]] ## Adjust native routine names for explicit remapping or ## namespace .fixes. if(length(symnames <- info$symbolNames)) { ind <- match(nrdb[, 2L], names(symnames), nomatch = 0L) nrdb[ind > 0L, 2L] <- symnames[ind] } else if(!character_only && any((fixes <- info$registrationFixes) != "")) { ## There are packages which have not used the fixes, e.g. utf8latex ## fixes[1L] is a prefix, fixes[2L] is an undocumented suffix nrdb[, 2L] <- sub(paste0("^", fixes[1L]), "", nrdb[, 2L]) if(nzchar(fixes[2L])) nrdb[, 2L] <- sub(paste0(fixes[2L]), "$", "", nrdb[, 2L]) } ## See above. if(any(ind <- !is.na(match(nrdb[, 2L], imports)))) nrdb <- nrdb[!ind, , drop = FALSE] ## Fortran entry points are mapped to l/case dotF <- nrdb$cname == ".Fortran" nrdb[dotF, "s"] <- tolower(nrdb[dotF, "s"]) attr(nrdb, "package") <- package attr(nrdb, "duplicates") <- dups attr(nrdb, "symbols") <- unique(symbols) nrdb } format_native_routine_registration_db_for_skeleton <- function(nrdb, align = TRUE, include_declarations = FALSE) { if(!length(nrdb)) return(character()) fmt1 <- function(x, n) { c(if(align) { paste(format(sprintf(" {\"%s\",", x[, 1L])), format(sprintf(if(n == "Fortran") "(DL_FUNC) &F77_NAME(%s)," else "(DL_FUNC) &%s,", x[, 1L])), format(sprintf("%d},", x[, 2L]), justify = "right")) } else { sprintf(if(n == "Fortran") " {\"%s\", (DL_FUNC) &F77_NAME(%s), %d}," else " {\"%s\", (DL_FUNC) &%s, %d},", x[, 1L], x[, 1L], x[, 2L]) }, " {NULL, NULL, 0}") } package <- attr(nrdb, "package") dups <- attr(nrdb, "duplicates") symbols <- attr(nrdb, "symbols") nrdb <- split(nrdb[, -1L, drop = FALSE], factor(nrdb[, 1L], levels = c(".C", ".Call", ".Fortran", ".External"))) has <- vapply(nrdb, NROW, 0L) > 0L nms <- names(nrdb) entries <- substring(nms, 2L) blocks <- Map(function(x, n) { c(sprintf("static const R_%sMethodDef %sEntries[] = {", n, n), fmt1(x, n), "};", "") }, nrdb[has], entries[has]) decls <- c( "/* FIXME: ", " Add declarations for the native routines registered below.", "*/") if(include_declarations) { decls <- c( "/* FIXME: ", " Check these declarations against the C/Fortran source code.", "*/", if(NROW(y <- nrdb$.C)) { args <- sapply(y$n, function(n) if(n >= 0) paste(rep("void *", n), collapse=", ") else "/* FIXME */") c("", "/* .C calls */", paste0("extern void ", y$s, "(", args, ");")) }, if(NROW(y <- nrdb$.Call)) { args <- sapply(y$n, function(n) if(n >= 0) paste(rep("SEXP", n), collapse=", ") else "/* FIXME */") c("", "/* .Call calls */", paste0("extern SEXP ", y$s, "(", args, ");")) }, if(NROW(y <- nrdb$.Fortran)) { args <- sapply(y$n, function(n) if(n >= 0) paste(rep("void *", n), collapse=", ") else "/* FIXME */") c("", "/* .Fortran calls */", paste0("extern void F77_NAME(", y$s, ")(", args, ");")) }, if(NROW(y <- nrdb$.External)) c("", "/* .External calls */", paste0("extern SEXP ", y$s, "(SEXP);")) ) } headers <- if(NROW(nrdb$.Call) || NROW(nrdb$.External)) c("#include <R.h>", "#include <Rinternals.h>") else if(NROW(nrdb$.Fortran)) "#include <R_ext/RS.h>" else character() c(headers, "#include <stdlib.h> // for NULL", "#include <R_ext/Rdynload.h>", "", if(length(symbols)) { c("/*", " The following symbols/expresssions for .NAME have been omitted", "", strwrap(symbols, indent = 4, exdent = 4), "", " Most likely possible values need to be added below.", "*/", "") }, if(length(dups)) { c("/*", " The following name(s) appear with different usages", " e.g., with different numbers of arguments:", "", strwrap(dups, indent = 4, exdent = 4), "", " This needs to be resolved in the tables and any declarations.", "*/", "") }, decls, "", unlist(blocks, use.names = FALSE), ## We cannot use names with '.' in: WRE mentions replacing with "_" sprintf("void R_init_%s(DllInfo *dll)", gsub(".", "_", package, fixed = TRUE)), "{", sprintf(" R_registerRoutines(dll, %s);", paste0(ifelse(has, paste0(entries, "Entries"), "NULL"), collapse = ", ")), " R_useDynamicSymbols(dll, FALSE);", "}") } package_native_routine_registration_db <- function(dir, character_only = TRUE) { calls <- package_ff_call_db(dir) native_routine_registration_db_from_ff_call_db(calls, dir, character_only) } package_native_routine_registration_db <- function(dir, character_only = TRUE) { calls <- package_ff_call_db(dir) native_routine_registration_db_from_ff_call_db(calls, dir, character_only) } package_native_routine_registration_skeleton <- function(dir, con = stdout(), align = TRUE, character_only = TRUE, include_declarations = TRUE) { nrdb <- package_native_routine_registration_db(dir, character_only) writeLines(format_native_routine_registration_db_for_skeleton(nrdb, align, include_declarations), con) } package_native_routine_registration_skeleton(".") ## when R 3.4.0 is out you only need this line

Here I use /usr/bin/r as I happen to like littler a lot, but you can use Rscript the same way.

Easy enough now?

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Take your data frames to the next level.

Fri, 03/31/2017 - 04:27

(This article was first published on R – Real Data, and kindly contributed to R-bloggers)

 

While finishing up with R-rockstar Hadley Wickham’s book (Free Book – R for Data Science), the section on model building elaborates on something pretty cool that I had no idea about – list columns.

Most of us have probably seen the following data frame column format:

df <- data.frame("col_uno" = c(1,2,3),"col_dos" = c('a','b','c'), "col_tres" = factor(c("google", "apple", "amazon")))

And the output:

df ## col_uno col_dos col_tres ## 1 1 a google ## 2 2 b apple ## 3 3 c amazon

This is an awesome way to organize data and one of R’s strong points. However, we can use list functionality to go deeper. Check this out:

library(tidyverse) library(datasets) head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa nested <- iris %>% group_by(Species) %>% nest() ## # A tibble: 3 × 2 ## Species data ## <fctr> <list> ## 1 setosa <tibble [50 × 4]> ## 2 versicolor <tibble [50 × 4]> ## 3 virginica <tibble [50 × 4]>

Using nest we can compartmentalize our data frame for readability and more efficient iteration. Here we can use map from the purrr package to compute the mean of each column in our nested data.

means <- map(nested$data, colMeans) ## [[1]] ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 5.006 3.428 1.462 0.246 ## ## [[2]] ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 5.936 2.770 4.260 1.326 ## ## [[3]] ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.588 2.974 5.552 2.026

Once you’re done messing around with data-ception, use unnest to revert your data back to its original state.

head(unnest(nested)) ## # A tibble: 6 × 5 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## <fctr> <dbl> <dbl> <dbl> <dbl> ## 1 setosa 5.1 3.5 1.4 0.2 ## 2 setosa 4.9 3.0 1.4 0.2 ## 3 setosa 4.7 3.2 1.3 0.2 ## 4 setosa 4.6 3.1 1.5 0.2 ## 5 setosa 5.0 3.6 1.4 0.2 ## 6 setosa 5.4 3.9 1.7 0.4

I was pretty excited to learn about this property of data.frames and will definitely make use of it in the future. If you have any neat examples of nested dataset usage, please feel free to share in the comments.  As always, I’m happy to answer questions or talk data!

Kiefer Smith

To leave a comment for the author, please follow the link and comment on their blog: R – Real Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Who is old? Visualizing the concept of prospective ageing with animated population pyramids

Fri, 03/31/2017 - 02:00

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

This post is about illustrating the concept of prospective ageing, a relatively fresh approach in demography to refine our understanding of population ageing. This visualization was created in collaboration with my colleague Michael Boissonneault: (mostly) his idea and (mostly) my implementation. The animated visualization builds upon Michael’s viz prepared for the submission to the highly anticipated event at the end June 2017 – Rostock Retreat Visualization. My visualization of the provided Swedish dataset can be found in the previous post.

Prospective ageing

Over the past decades the alarmist views of the upcoming population ageing disaster became widely spread. True, with the growing number of countries approaching the ending of the Demographic Transition, the average/median age of their population increases rapidly, which is something unprecedented in the documented human history. But does that imply an unbearable burden of elderly population in the nearest future? Not necessarily.

The demographic prospects depend a lot on how we define ageing. Quite recently Waren Sanderson and Sergei Scherbov proposed 1 2 a new way to look at population ageing, they called it Prospective Ageing. The underlying idea is really simple – age is not static: a person aged 65 (the conventional border deliminating elderly population) today is in many aspects not the same as a person ages 65 half a century ago. Health and lifespan improved a lot in the last decades, meaning that today people generally have much more remaining years of life at the moment of being recognized as elderly by the conventional standards. Thus, Sanderson and Scherbov proposed to define elderly population based on the estimation of the expected remaining length of life rather than years lived. Such a refined view of population ageing disqualifies the alarmist claims of the approaching demographic collapse. The would be paradoxical title of one the latest papers of Sanderson and Scherbov 3 summarizes the phenomenon nicely: Faster Increases in Human Life Expectancy Could Lead to Slower Population Aging.

Of course, the choice of the new ageing threshold is a rather arbitrary question 4. It became usual to define this threshold at the remaining life expectancy of 15 years.

Population pyramids for Sweden

Population pyramid is a simple and nice way to represent population composition and to compare changes in the age structure of a given population over time. We show the difference between conventional and prospective approach to the definition of the elderly population using Swedish data for the last half a century. Sweden is a natural choice for demographers aiming to play with rich and reliable data.

The data used for this visualization comes from Human Mortality Database. It can be easily accessed from an R session using HMDHFDplus package by Tim Riffe (for examples see my previous posts – one and two). For this exercise, I will use the dataset for Sweden that was provided for an application task for Rostock Retreat Visualization 5.

Data preparation # load packages library(tidyverse) library(extrafont) myfont <- "Ubuntu Mono" # download data df_swe <- read_csv("http://www.rostock-retreat.org/files/application2017/SWE.csv") # copy at https://ikashnitsky.github.io/doc/misc/application-rostock-retreat/SWE.csv # define the selection of years to visualize years <- c(seq(1965, 2010, 5),2014) df <- df_swe %>% select(Year, Sex, Age, Exposure, ex) %>% filter(Year %in% years) %>% mutate(old_c = Age >= 65, old_p = ex <= 15) %>% gather("type", "old", contains("old")) %>% group_by(Year, Sex, type) %>% mutate(share = Exposure / sum(Exposure)) %>% ungroup() %>% mutate(share = ifelse(Sex == 'f', share, -share)) names(df) <- names(df) %>% tolower() df_old <- df %>% filter(old == T) %>% group_by(year, sex, type, old) %>% summarise(cum_old = sum(share)) %>% ungroup() Visualization

Let’s first have a look at the pyramids in 1965, 1990, and 2014 (the latest available year).

gg_three <- ggplot(df %>% filter(year %in% c(1965, 1990, 2014))) + geom_bar(aes(x = age, y = share, fill = sex, alpha = old), stat = 'identity', width = 1)+ geom_vline(xintercept = 64.5, size = .5, color = 'gold')+ scale_y_continuous(breaks = c(-.01, 0, .01), labels = c(.01, 0, .01), limits = c(-.02, .02), expand = c(0,0))+ facet_grid(year~type) + theme_minimal(base_family = 'Ubuntu Mono') + theme(strip.text = element_blank(), legend.position = 'none', plot.title = element_text(hjust = 0.5, size = 20), plot.caption = element_text(hjust = 0, size = 10)) + coord_flip() + labs(y = NULL, x = 'Age') + geom_text(data = data_frame(type = c('old_c', 'old_p'), label = c('CONVENTIONAL', 'PROSPECTIVE')), aes(label = label), y = 0, x = 50, size = 5, vjust = 1, family = 'Ubuntu Mono') + geom_text(data = df_old %>% filter(year %in% c(1965, 1990, 2014), sex == 'f'), aes(label = year), y = 0, x = 30, vjust = 1, hjust = .5, size = 7, family = 'Ubuntu Mono') + geom_text(data = df_old %>% filter(year %in% c(1965, 1990, 2014), sex == 'f'), aes(label = paste('Elderly\nfemales\n', round(cum_old*100,1), '%')), y = .0125, x = 105, vjust = 1, hjust = .5, size = 4, family = 'Ubuntu Mono') + geom_text(data = df_old %>% filter(year %in% c(1965, 1990, 2014), sex == 'm'), aes(label = paste('Elderly\nmales\n', round(-cum_old*100,1), '%')), y = -.0125, x = 105, vjust = 1, hjust = .5, size = 4, family = 'Ubuntu Mono') #ggsave("figures/three-years.png", gg_three, width = 6, height = 8)

Animated pyramid

To get an animated pyramid I simply saved all the separate plots and then use the very convenient free online tool to make an animated image – GIFCreator 6.

note <- 'The population pyramid can be used to compare change in the age structure of a given population over time. In many cases, doing so gives the impression of rapid aging. This is due to the fact that age is represented as a static variable; however, as Sanderson and Scherbov showed repeatedly, age is not static: a person age 65 in 1965 is in many aspects not the same as a person age 65 in 2015. In the right panel, old age is considered to start when the period remaining life expectancy reaches 15 years, thereby providing another look at the change in the age structure of a population. The gold line deliminates the conventional border of old age at 65. Elderly populations are filled with non-transparent colors. Authors: Michael Boissonneault, Ilya Kashnitsky (NIDI)' # I will store the plots in a list plots <- list() for (i in 1:length(years)){ gg <- ggplot(df %>% filter(year == years[[i]])) + geom_bar(aes(x = age, y = share, fill = sex, alpha = old), stat = 'identity', width = 1)+ geom_vline(xintercept = 64.5, size = .5, color = 'gold')+ scale_y_continuous(breaks = c(-.01, 0, .01), labels = c(.01, 0, .01), limits = c(-.02, .02), expand = c(0,0))+ facet_wrap(~type, ncol = 2) + theme_minimal(base_family = 'Ubuntu Mono') + theme(strip.text = element_blank(), legend.position = 'none', plot.title = element_text(hjust = 0.5, size = 20), plot.caption = element_text(hjust = 0, size = 10)) + coord_flip() + labs(title = paste("Sweden", years[i]), caption = paste(strwrap(note, width = 106), collapse = '\n'), y = NULL, x = 'Age') + geom_text(data = data_frame(type = c('old_c', 'old_p'), label = c('CONVENTIONAL', 'PROSPECTIVE')), aes(label = label), y = 0, x = 115, size = 5, vjust = 1, family = 'Ubuntu Mono') + geom_text(data = df_old %>% filter(year == years[[i]], sex == 'f'), aes(label = paste('Elderly\nfemales\n', round(cum_old*100,1), '%')), y = .0125, x = 105, vjust = 1, hjust = .5, size = 4, family = 'Ubuntu Mono') + geom_text(data = df_old %>% filter(year == years[[i]], sex == 'm'), aes(label = paste('Elderly\nmales\n', round(-cum_old*100,1), '%')), y = -.0125, x = 105, vjust = 1, hjust = .5, size = 4, family = 'Ubuntu Mono') plots[[i]] <- gg } # # a loop to save the plots # for (i in 1:length(years)){ # ggsave(paste0('figures/swe-', years[i], '.png'), plots[[i]], # width = 8, height = 5.6) # }

  1. Sanderson, W. C., & Scherbov, S. (2005). Average remaining lifetimes can increase as human populations age. Nature, 435(7043), 811–813. Retrieved from http://www.nature.com/nature/journal/v435/n7043/abs/nature03593.html 

  2. Sanderson, W. C., & Scherbov, S. (2010). Remeasuring Aging. Science, 329(5997), 1287–1288. https://doi.org/10.1126/science.1193647 

  3. Sanderson, W. C., & Scherbov, S. (2015). Faster Increases in Human Life Expectancy Could Lead to Slower Population Aging. PLoS ONE, 10(4), e0121922. http://doi.org/10.1371/journal.pone.0121922 

  4. See the working paper of my colleagues devoted to this question 

  5. By using this data, I agree to the user agreement 

  6. I did try to play with the package gganimate, though it produced a strange output. 

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building meaningful machine learning models for disease prediction

Fri, 03/31/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Webinar for the ISDS R Group

This document presents the code I used to produce the example analysis and figures shown in my webinar on building meaningful machine learning models for disease prediction.

My webinar slides are available on Github

Description: Dr Shirin Glander will go over her work on building machine-learning models to predict the course of different diseases. She will go over building a model, evaluating its performance, and answering or addressing different disease related questions using machine learning. Her talk will cover the theory of machine learning as it is applied using R.

Setup

All analyses are done in R using RStudio. For detailed session information including R version, operating system and package versions, see the sessionInfo() output at the end of this document.

All figures are produced with ggplot2.

The dataset

The dataset I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first dataset looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterize cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis
bc_data <- read.table("datasets/breast-cancer-wisconsin.data.txt", header = FALSE, sep = ",") colnames(bc_data) <- c("sample_code_number", "clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "single_epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis", "classes") bc_data$classes <- ifelse(bc_data$classes == "2", "benign", ifelse(bc_data$classes == "4", "malignant", NA))

Missing data bc_data[bc_data == "?"] <- NA # how many NAs are in the data length(which(is.na(bc_data))) ## [1] 16 # how many samples would we loose, if we removed them? nrow(bc_data) ## [1] 699 nrow(bc_data[is.na(bc_data), ]) ## [1] 16

Missing values are imputed with the mice package.

# impute missing data library(mice) bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x))) dataset_impute <- mice(bc_data[, 2:10], print = FALSE) bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1)) bc_data$classes <- as.factor(bc_data$classes) # how many benign and malignant cases are there? summary(bc_data$classes)

Data exploration
  • Response variable for classification
library(ggplot2) ggplot(bc_data, aes(x = classes, fill = classes)) + geom_bar()

  • Response variable for regression
ggplot(bc_data, aes(x = clump_thickness)) + geom_histogram(bins = 10)

  • Principal Component Analysis
library(pcaGoPromoter) library(ellipse) # perform pca and extract scores pcaOutput <- pca(t(bc_data[, -1]), printDropped = FALSE, scale = TRUE, center = TRUE) pcaOutput2 <- as.data.frame(pcaOutput$scores) # define groups for plotting pcaOutput2$groups <- bc_data$classes centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean) conf.rgn <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t) data.frame(groups = as.character(t), ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]), centre = as.matrix(centroids[centroids$groups == t, 2:3]), level = 0.95), stringsAsFactors = FALSE))) ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) + geom_point(size = 2, alpha = 0.6) + scale_color_brewer(palette = "Set1") + labs(color = "", fill = "", x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"), y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))

  • Features
library(tidyr) gather(bc_data, x, y, clump_thickness:mitosis) %>% ggplot(aes(x = y, color = classes, fill = classes)) + geom_density(alpha = 0.3) + facet_wrap( ~ x, scales = "free", ncol = 3)

Machine Learning packages for R caret # configure multicore library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) library(caret)

Training, validation and test data set.seed(42) index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE) train_data <- bc_data[index, ] test_data <- bc_data[-index, ] library(dplyr) rbind(data.frame(group = "train", train_data), data.frame(group = "test", test_data)) %>% gather(x, y, clump_thickness:mitosis) %>% ggplot(aes(x = y, color = group, fill = group)) + geom_density(alpha = 0.3) + facet_wrap( ~ x, scales = "free", ncol = 3)

Regression set.seed(42) model_glm <- caret::train(clump_thickness ~ ., data = train_data, method = "glm", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE)) model_glm ## Generalized Linear Model ## ## 490 samples ## 9 predictor ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 441, 441, 440, 442, 441, 440, ... ## Resampling results: ## ## RMSE Rsquared ## 1.974296 0.5016141 ## ## predictions <- predict(model_glm, test_data) # model_glm$finalModel$linear.predictors == model_glm$finalModel$fitted.values data.frame(residuals = resid(model_glm), predictors = model_glm$finalModel$linear.predictors) %>% ggplot(aes(x = predictors, y = residuals)) + geom_jitter() + geom_smooth(method = "lm")

# y == train_data$clump_thickness data.frame(residuals = resid(model_glm), y = model_glm$finalModel$y) %>% ggplot(aes(x = y, y = residuals)) + geom_jitter() + geom_smooth(method = "lm")

data.frame(actual = test_data$clump_thickness, predicted = predictions) %>% ggplot(aes(x = actual, y = predicted)) + geom_jitter() + geom_smooth(method = "lm")

Classification Decision trees

rpart

library(rpart) library(rpart.plot) set.seed(42) fit <- rpart(classes ~ ., data = train_data, method = "class", control = rpart.control(xval = 10, minbucket = 2, cp = 0), parms = list(split = "information")) rpart.plot(fit, extra = 100)

Random Forests

Random Forests predictions are based on the generation of multiple classification trees. They can be used for both, classification and regression tasks. Here, I show a classification task.

set.seed(42) model_rf <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE))

When you specify savePredictions = TRUE, you can access the cross-validation resuls with model_rf$pred.

model_rf$finalModel$confusion ## benign malignant class.error ## benign 313 8 0.02492212 ## malignant 4 165 0.02366864

  • Feature Importance
imp <- model_rf$finalModel$importance imp[order(imp, decreasing = TRUE), ] ## uniformity_of_cell_size uniformity_of_cell_shape ## 54.416003 41.553022 ## bland_chromatin bare_nuclei ## 29.343027 28.483842 ## normal_nucleoli single_epithelial_cell_size ## 19.239635 18.480155 ## clump_thickness marginal_adhesion ## 13.276702 12.143355 ## mitosis ## 3.081635 # estimate variable importance importance <- varImp(model_rf, scale = TRUE) plot(importance)

  • predicting test data
confusionMatrix(predict(model_rf, test_data), test_data$classes) ## Confusion Matrix and Statistics ## ## Reference ## Prediction benign malignant ## benign 133 2 ## malignant 4 70 ## ## Accuracy : 0.9713 ## 95% CI : (0.9386, 0.9894) ## No Information Rate : 0.6555 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9369 ## Mcnemar's Test P-Value : 0.6831 ## ## Sensitivity : 0.9708 ## Specificity : 0.9722 ## Pos Pred Value : 0.9852 ## Neg Pred Value : 0.9459 ## Prevalence : 0.6555 ## Detection Rate : 0.6364 ## Detection Prevalence : 0.6459 ## Balanced Accuracy : 0.9715 ## ## 'Positive' Class : benign ## results <- data.frame(actual = test_data$classes, predict(model_rf, test_data, type = "prob")) results$prediction <- ifelse(results$benign > 0.5, "benign", ifelse(results$malignant > 0.5, "malignant", NA)) results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE) ggplot(results, aes(x = prediction, fill = correct)) + geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)

Extreme gradient boosting trees

Extreme gradient boosting (XGBoost) is a faster and improved implementation of gradient boosting for supervised learning.

“XGBoost uses a more regularized model formalization to control over-fitting, which gives it better performance.” Tianqi Chen, developer of xgboost

XGBoost is a tree ensemble model, which means the sum of predictions from a set of classification and regression trees (CART). In that, XGBoost is similar to Random Forests but it uses a different approach to model training. Can be used for classification and regression tasks. Here, I show a classification task.

set.seed(42) model_xgb <- caret::train(classes ~ ., data = train_data, method = "xgbTree", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE))

  • Feature Importance
importance <- varImp(model_xgb, scale = TRUE) plot(importance)

  • predicting test data
confusionMatrix(predict(model_xgb, test_data), test_data$classes) ## Confusion Matrix and Statistics ## ## Reference ## Prediction benign malignant ## benign 132 2 ## malignant 5 70 ## ## Accuracy : 0.9665 ## 95% CI : (0.9322, 0.9864) ## No Information Rate : 0.6555 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9266 ## Mcnemar's Test P-Value : 0.4497 ## ## Sensitivity : 0.9635 ## Specificity : 0.9722 ## Pos Pred Value : 0.9851 ## Neg Pred Value : 0.9333 ## Prevalence : 0.6555 ## Detection Rate : 0.6316 ## Detection Prevalence : 0.6411 ## Balanced Accuracy : 0.9679 ## ## 'Positive' Class : benign ## results <- data.frame(actual = test_data$classes, predict(model_xgb, test_data, type = "prob")) results$prediction <- ifelse(results$benign > 0.5, "benign", ifelse(results$malignant > 0.5, "malignant", NA)) results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE) ggplot(results, aes(x = prediction, fill = correct)) + geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)

Feature Selection

Performing feature selection on the whole dataset would lead to prediction bias, we therefore need to run the whole modeling process on the training data alone!

  • Correlation

Correlations between all features are calculated and visualised with the corrplot package. I am then removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.

library(corrplot) # calculate correlation matrix corMatMy <- cor(train_data[, -1]) corrplot(corMatMy, order = "hclust")

#Apply correlation filter at 0.70, highlyCor <- colnames(train_data[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)] ## Compare row 2 and column 3 with corr 0.899 ## Means: 0.696 vs 0.575 so flagging column 2 ## Compare row 3 and column 7 with corr 0.736 ## Means: 0.654 vs 0.55 so flagging column 3 ## All correlations <= 0.7 # which variables are flagged for removal? highlyCor ## [1] "uniformity_of_cell_size" "uniformity_of_cell_shape" #then we remove these variables train_data_cor <- train_data[, which(!colnames(train_data) %in% highlyCor)]

  • Recursive Feature Elimination (RFE)

Another way to choose features is with Recursive Feature Elimination. RFE uses a Random Forest algorithm to test combinations of features and rate each with an accuracy score. The combination with the highest score is usually preferential.

set.seed(7) results_rfe <- rfe(x = train_data[, -1], y = train_data$classes, sizes = c(1:9), rfeControl = rfeControl(functions = rfFuncs, method = "cv", number = 10)) # chosen features predictors(results_rfe) ## [1] "bare_nuclei" "uniformity_of_cell_size" ## [3] "clump_thickness" "uniformity_of_cell_shape" ## [5] "bland_chromatin" "marginal_adhesion" ## [7] "normal_nucleoli" "single_epithelial_cell_size" ## [9] "mitosis" train_data_rfe <- train_data[, c(1, which(colnames(train_data) %in% predictors(results_rfe)))]

  • Genetic Algorithm (GA)

The Genetic Algorithm (GA) has been developed based on evolutionary principles of natural selection: It aims to optimize a population of individuals with a given set of genotypes by modeling selection over time. In each generation (i.e. iteration), each individual’s fitness is calculated based on their genotypes. Then, the fittest individuals are chosen to produce the next generation. This subsequent generation of individuals will have genotypes resulting from (re-) combinations of the parental alleles. These new genotypes will again determine each individual’s fitness. This selection process is iterated for a specified number of generations and (ideally) leads to fixation of the fittest alleles in the gene pool.

This concept of optimization can be applied to non-evolutionary models as well, like feature selection processes in machine learning.

set.seed(27) model_ga <- gafs(x = train_data[, -1], y = train_data$classes, iters = 10, # generations of algorithm popSize = 10, # population size for each generation levels = c("malignant", "benign"), gafsControl = gafsControl(functions = rfGA, # Assess fitness with RF method = "cv", # 10 fold cross validation genParallel = TRUE, # Use parallel programming allowParallel = TRUE)) plot(model_ga) # Plot mean fitness (AUC) by generation

train_data_ga <- train_data[, c(1, which(colnames(train_data) %in% model_ga$ga$final))]

Grid search with caret
  • Automatic Grid
set.seed(42) model_rf_tune_auto <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE, search = "random"), tuneLength = 15) model_rf_tune_auto ## Random Forest ## ## 490 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 442, 441, 441, 441, 441, 441, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 1 0.9692153 0.9323624 ## 2 0.9704277 0.9350498 ## 5 0.9645085 0.9216721 ## 6 0.9639087 0.9201998 ## 7 0.9632842 0.9186919 ## 8 0.9626719 0.9172257 ## 9 0.9636801 0.9195036 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2. plot(model_rf_tune_auto)

  • Manual Grid

  • mtry: Number of variables randomly sampled as candidates at each split.

set.seed(42) grid <- expand.grid(mtry = c(1:10)) model_rf_tune_man <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE, search = "random"), tuneGrid = grid) model_rf_tune_man ## Random Forest ## ## 490 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 442, 441, 441, 441, 441, 441, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 1 0.9696153 0.9332392 ## 2 0.9706440 0.9354737 ## 3 0.9696194 0.9330647 ## 4 0.9661495 0.9253163 ## 5 0.9649252 0.9225586 ## 6 0.9653209 0.9233806 ## 7 0.9634881 0.9192265 ## 8 0.9624718 0.9169227 ## 9 0.9641005 0.9203072 ## 10 0.9628760 0.9176675 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2. plot(model_rf_tune_man)

Grid search with h2o

The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O distributes a wide range of common machine learning algorithms for classification, regression and deep learning.

library(h2o) h2o.init(nthreads = -1) ## ## H2O is not running yet, starting it now... ## ## Note: In case of errors look at the following log files: ## C:\Users\s_glan02\AppData\Local\Temp\RtmpwDqf33/h2o_s_glan02_started_from_r.out ## C:\Users\s_glan02\AppData\Local\Temp\RtmpwDqf33/h2o_s_glan02_started_from_r.err ## ## ## Starting H2O JVM and connecting: . Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 1 seconds 815 milliseconds ## H2O cluster version: 3.10.3.6 ## H2O cluster version age: 1 month and 10 days ## H2O cluster name: H2O_started_from_R_s_glan02_tvy462 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.54 GB ## H2O cluster total cores: 8 ## H2O cluster allowed cores: 8 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## R Version: R version 3.3.3 (2017-03-06) bc_data_hf <- as.h2o(bc_data) ## | | | 0% | |=================================================================| 100% h2o.describe(bc_data_hf) %>% gather(x, y, Zeros:Sigma) %>% mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% ggplot(aes(x = Label, y = as.numeric(y), color = x)) + geom_point(size = 4, alpha = 0.6) + scale_color_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + facet_grid(group ~ ., scales = "free") + labs(x = "Feature", y = "Value", color = "")

library(reshape2) # for melting bc_data_hf[, 1] <- h2o.asfactor(bc_data_hf[, 1]) cor <- h2o.cor(bc_data_hf) rownames(cor) <- colnames(cor) melt(cor) %>% mutate(Var2 = rep(rownames(cor), nrow(cor))) %>% mutate(Var2 = factor(Var2, levels = colnames(cor))) %>% mutate(variable = factor(variable, levels = colnames(cor))) %>% ggplot(aes(x = variable, y = Var2, fill = value)) + geom_tile(width = 0.9, height = 0.9) + scale_fill_gradient2(low = "white", high = "red", name = "Cor.") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(x = "", y = "")

Training, validation and test data splits <- h2o.splitFrame(bc_data_hf, ratios = c(0.7, 0.15), seed = 1) train <- splits[[1]] valid <- splits[[2]] test <- splits[[3]] response <- "classes" features <- setdiff(colnames(train), response) summary(train$classes, exact_quantiles = TRUE) ## classes ## benign :317 ## malignant:174 summary(valid$classes, exact_quantiles = TRUE) ## classes ## benign :71 ## malignant:35 summary(test$classes, exact_quantiles = TRUE) ## classes ## benign :70 ## malignant:32 pca <- h2o.prcomp(training_frame = train, x = features, validation_frame = valid, transform = "NORMALIZE", impute_missing = TRUE, k = 3, seed = 42) ## | | | 0% | |==================================================== | 80% | |=================================================================| 100% eigenvec <- as.data.frame(pca@model$eigenvectors) eigenvec$label <- features library(ggrepel) ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) + geom_point(color = "navy", alpha = 0.7) + geom_text_repel()

Classification Random Forest hyper_params <- list( ntrees = c(25, 50, 75, 100), max_depth = c(10, 20, 30), min_rows = c(1, 3, 5) ) search_criteria <- list( strategy = "RandomDiscrete", max_models = 50, max_runtime_secs = 360, stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.0005, seed = 42 ) rf_grid <- h2o.grid(algorithm = "randomForest", # h2o.randomForest, # alternatively h2o.gbm # for Gradient boosting trees x = features, y = response, grid_id = "rf_grid", training_frame = train, validation_frame = valid, nfolds = 25, fold_assignment = "Stratified", hyper_params = hyper_params, search_criteria = search_criteria, seed = 42 ) # performance metrics where smaller is better -> order with decreasing = FALSE sort_options_1 <- c("mean_per_class_error", "mse", "err", "logloss") for (sort_by_1 in sort_options_1) { grid <- h2o.getGrid("rf_grid", sort_by = sort_by_1, decreasing = FALSE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) h2o.saveModel(best_model, path="models", force = TRUE) } # performance metrics where bigger is better -> order with decreasing = TRUE sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity") for (sort_by_2 in sort_options_2) { grid <- h2o.getGrid("rf_grid", sort_by = sort_by_2, decreasing = TRUE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) h2o.saveModel(best_model, path = "models", force = TRUE) } files <- list.files(path = "models") rf_models <- files[grep("rf_grid_model", files)] for (model_id in rf_models) { path <- paste0("U:\\Github_blog\\Webinar\\Webinar_ML_for_disease\\models\\", model_id) best_model <- h2o.loadModel(path) mse_auc_test <- data.frame(model_id = model_id, mse = h2o.mse(h2o.performance(best_model, test)), auc = h2o.auc(h2o.performance(best_model, test))) if (model_id == rf_models[[1]]) { mse_auc_test_comb <- mse_auc_test } else { mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test) } } mse_auc_test_comb %>% gather(x, y, mse:auc) %>% ggplot(aes(x = model_id, y = y, fill = model_id)) + facet_grid(x ~ ., scales = "free") + geom_bar(stat = "identity", alpha = 0.8, position = "dodge") + scale_fill_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) + labs(x = "", y = "value", fill = "")

for (model_id in rf_models) { best_model <- h2o.getModel(model_id) finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, nrow(test)), actual = as.vector(test$classes), as.data.frame(h2o.predict(object = best_model, newdata = test))) finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", ifelse(finalRf_predictions$malignant > 0.8, "malignant", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) if (model_id == rf_models[[1]]) { finalRf_predictions_comb <- finalRf_predictions } else { finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions) } } ## | | | 0% | |=================================================================| 100% ## | | | 0% | |=================================================================| 100% ## | | | 0% | |=================================================================| 100% finalRf_predictions_comb %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + facet_wrap(~ model_id, ncol = 3) + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")

finalRf_predictions_comb %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + facet_wrap(~ model_id, ncol = 3) + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")

rf_model <- h2o.loadModel("models/rf_grid_model_6") h2o.varimp_plot(rf_model)

#h2o.varimp(rf_model) h2o.mean_per_class_error(rf_model, train = TRUE, valid = TRUE, xval = TRUE) ## train valid xval ## 0.024674571 0.007042254 0.023097284 h2o.confusionMatrix(rf_model, valid = TRUE) ## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.293125896751881: ## benign malignant Error Rate ## benign 70 1 0.014085 =1/71 ## malignant 0 35 0.000000 =0/35 ## Totals 70 36 0.009434 =1/106 plot(rf_model, timestep = "number_of_trees", metric = "classification_error")

plot(rf_model, timestep = "number_of_trees", metric = "logloss")

plot(rf_model, timestep = "number_of_trees", metric = "AUC")

plot(rf_model, timestep = "number_of_trees", metric = "rmse")

h2o.auc(rf_model, train = TRUE) ## [1] 0.989521 h2o.auc(rf_model, valid = TRUE) ## [1] 0.9995976 h2o.auc(rf_model, xval = TRUE) ## [1] 0.9890496 perf <- h2o.performance(rf_model, test) perf ## H2OBinomialMetrics: drf ## ## MSE: 0.03673598 ## RMSE: 0.1916663 ## LogLoss: 0.1158835 ## Mean Per-Class Error: 0.0625 ## AUC: 0.990625 ## Gini: 0.98125 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## benign malignant Error Rate ## benign 70 0 0.000000 =0/70 ## malignant 4 28 0.125000 =4/32 ## Totals 74 28 0.039216 =4/102 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.735027 0.933333 25 ## 2 max f2 0.294222 0.952381 37 ## 3 max f0point5 0.735027 0.972222 25 ## 4 max accuracy 0.735027 0.960784 25 ## 5 max precision 1.000000 1.000000 0 ## 6 max recall 0.294222 1.000000 37 ## 7 max specificity 1.000000 1.000000 0 ## 8 max absolute_mcc 0.735027 0.909782 25 ## 9 max min_per_class_accuracy 0.424524 0.937500 31 ## 10 max mean_per_class_accuracy 0.294222 0.942857 37 ## ## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)` plot(perf)

h2o.logloss(perf) ## [1] 0.1158835 h2o.mse(perf) ## [1] 0.03673598 h2o.auc(perf) ## [1] 0.990625 head(h2o.metric(perf)) ## Metrics for Thresholds: Binomial metrics as a function of classification thresholds ## threshold f1 f2 f0point5 accuracy precision recall ## 1 1.000000 0.171429 0.114504 0.340909 0.715686 1.000000 0.093750 ## 2 0.998333 0.222222 0.151515 0.416667 0.725490 1.000000 0.125000 ## 3 0.998000 0.270270 0.187970 0.480769 0.735294 1.000000 0.156250 ## 4 0.997222 0.315789 0.223881 0.535714 0.745098 1.000000 0.187500 ## 5 0.996210 0.358974 0.259259 0.583333 0.754902 1.000000 0.218750 ## 6 0.994048 0.400000 0.294118 0.625000 0.764706 1.000000 0.250000 ## specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy ## 1 1.000000 0.257464 0.093750 0.546875 ## 2 1.000000 0.298807 0.125000 0.562500 ## 3 1.000000 0.335794 0.156250 0.578125 ## 4 1.000000 0.369755 0.187500 0.593750 ## 5 1.000000 0.401478 0.218750 0.609375 ## 6 1.000000 0.431474 0.250000 0.625000 ## tns fns fps tps tnr fnr fpr tpr idx ## 1 70 29 0 3 1.000000 0.906250 0.000000 0.093750 0 ## 2 70 28 0 4 1.000000 0.875000 0.000000 0.125000 1 ## 3 70 27 0 5 1.000000 0.843750 0.000000 0.156250 2 ## 4 70 26 0 6 1.000000 0.812500 0.000000 0.187500 3 ## 5 70 25 0 7 1.000000 0.781250 0.000000 0.218750 4 ## 6 70 24 0 8 1.000000 0.750000 0.000000 0.250000 5 finalRf_predictions <- data.frame(actual = as.vector(test$classes), as.data.frame(h2o.predict(object = rf_model, newdata = test))) ## | | | 0% | |=================================================================| 100% finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", ifelse(finalRf_predictions$malignant > 0.8, "malignant", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) finalRf_predictions %>% group_by(actual, predict) %>% dplyr::summarise(n = n()) ## Source: local data frame [3 x 3] ## Groups: actual [?] ## ## actual predict n ## <fctr> <fctr> <int> ## 1 benign benign 62 ## 2 benign malignant 8 ## 3 malignant malignant 32 finalRf_predictions %>% group_by(actual, predict_stringent) %>% dplyr::summarise(n = n()) ## Source: local data frame [4 x 3] ## Groups: actual [?] ## ## actual predict_stringent n ## <fctr> <chr> <int> ## 1 benign benign 61 ## 2 benign uncertain 9 ## 3 malignant malignant 26 ## 4 malignant uncertain 6 finalRf_predictions %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")

finalRf_predictions %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")

df <- finalRf_predictions[, c(1, 3, 4)] thresholds <- seq(from = 0, to = 1, by = 0.1) prop_table <- data.frame(threshold = thresholds, prop_true_b = NA, prop_true_m = NA) for (threshold in thresholds) { pred <- ifelse(df$benign > threshold, "benign", "malignant") pred_t <- ifelse(pred == df$actual, TRUE, FALSE) group <- data.frame(df, "pred" = pred_t) %>% group_by(actual, pred) %>% dplyr::summarise(n = n()) group_b <- filter(group, actual == "benign") prop_b <- sum(filter(group_b, pred == TRUE)$n) / sum(group_b$n) prop_table[prop_table$threshold == threshold, "prop_true_b"] <- prop_b group_m <- filter(group, actual == "malignant") prop_m <- sum(filter(group_m, pred == TRUE)$n) / sum(group_m$n) prop_table[prop_table$threshold == threshold, "prop_true_m"] <- prop_m } prop_table %>% gather(x, y, prop_true_b:prop_true_m) %>% ggplot(aes(x = threshold, y = y, color = x)) + geom_point() + geom_line() + scale_color_brewer(palette = "Set1") + labs(y = "proportion of true predictions", color = "b: benign cases\nm: malignant cases")

h2o.shutdown()

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats4 parallel stats graphics grDevices utils datasets ## [8] methods base ## ## other attached packages: ## [1] ggrepel_0.6.5 reshape2_1.4.2 h2o_3.10.3.6 ## [4] corrplot_0.77 plyr_1.8.4 xgboost_0.6-4 ## [7] randomForest_4.6-12 dplyr_0.5.0 caret_6.0-73 ## [10] lattice_0.20-35 doParallel_1.0.10 iterators_1.0.8 ## [13] foreach_1.4.3 tidyr_0.6.1 pcaGoPromoter_1.18.0 ## [16] Biostrings_2.42.1 XVector_0.14.0 IRanges_2.8.1 ## [19] S4Vectors_0.12.1 BiocGenerics_0.20.0 ellipse_0.3-8 ## [22] ggplot2_2.2.1.9000 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.10 class_7.3-14 assertthat_0.1 ## [4] rprojroot_1.2 digest_0.6.12 R6_2.2.0 ## [7] backports_1.0.5 MatrixModels_0.4-1 RSQLite_1.1-2 ## [10] evaluate_0.10 e1071_1.6-8 zlibbioc_1.20.0 ## [13] lazyeval_0.2.0 minqa_1.2.4 data.table_1.10.4 ## [16] SparseM_1.76 car_2.1-4 nloptr_1.0.4 ## [19] Matrix_1.2-8 rmarkdown_1.4 labeling_0.3 ## [22] splines_3.3.3 lme4_1.1-12 stringr_1.2.0 ## [25] RCurl_1.95-4.8 munsell_0.4.3 mgcv_1.8-17 ## [28] htmltools_0.3.5 nnet_7.3-12 tibble_1.2 ## [31] codetools_0.2-15 MASS_7.3-45 bitops_1.0-6 ## [34] ModelMetrics_1.1.0 grid_3.3.3 nlme_3.1-131 ## [37] jsonlite_1.3 gtable_0.2.0 DBI_0.6 ## [40] magrittr_1.5 scales_0.4.1 stringi_1.1.3 ## [43] RColorBrewer_1.1-2 tools_3.3.3 Biobase_2.34.0 ## [46] pbkrtest_0.4-7 yaml_2.1.14 AnnotationDbi_1.36.2 ## [49] colorspace_1.3-2 memoise_1.0.0 knitr_1.15.1 ## [52] quantreg_5.29

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Customising Shiny Server HTML Pages

Fri, 03/31/2017 - 01:16

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

Mark Sellors
Head of Data Engineering

At Mango we work with a great many clients using the Shiny framework for R. Many of those use Shiny Server or Shiny Server Pro to publish their shiny apps within their organisations. Shiny Server Pro in particular is a great product, but some of the stock html pages are a little plain, so I asked the good folks at RStudio if it was possible to customise them to match corporate themes and so on. It turns out that it’s a documented feature that’s been available for around 3 years now!

The stock Shiny Server Pro login screen

If you want to try this yourself, check out the <a href=”http://docs.rstudio.com/shiny-server/#custom-templates”>Shiny Server Admin Guide</a>, but it’s pretty simple to do. My main point of interest in this is to customise the Shiny Server Pro login screen, but you can customise several other pages too. Though it’s worth noting that if you create a custom 404, it will not be available from within an application, only in places where it would be rendered by Shiny Server rather than by Shiny itself.

To get a feel for how this works, I asked Mango’s resident web dev, Ben, to rustle me up a quick login page for a fake company and I then set about customising that to fit the required format. The finished article can be seen below and we hope you’ll agree, it’s a marked improvement on the stock one. (Eagle eyed readers and sci-fi/film-buffs will hopefully recognise the OCP logo!)

Our new, customised Shiny Server Pro login screen

This new login screen works exactly like the original and is configured for a single app only on the server, rather than all apps. We do have an additional “Forgot” button, that could be used to direct users to a support portal or to mail an administrator or similar.

Customisations are very simple, and use the reasonably common handlebars/mustache format. A snippet of our custom page is below. Values placed within handlebars, like `{{value}}`, are actually variables that Shiny Server will replace with the correct info when it renders the page. I stripped out some stuff from the original template that we didn’t need, like some logic to use if the server is using different authentication mechanisms to keep things simple.


<form action="{{url}}" method="POST">
<input id="successUrl" name="success_url" type="hidden" value="{{success_url}}" />
<input id="username" name="username" required="required" type="text" value="{{username}}" />
<input id="password" name="password" required="required" type="password" value="Password" />
<button>
<i class="fa fa-question-circle"></i> Forgot
</button>
<button type="submit">
<i class="fa fa-sign-in"></i> Login
</button>
</form>

Hopefully this post has inspired you to check out this feature. It’s an excellent way to provide custom corporate branding, especially as it can be applied on an app-by-app basis. It’s worth knowing that this feature is not yet available in RStudio Connect, but hopefully it will arrive in a future update. If you do create any customisations of your own be sure to let us know! You’ll find us on twitter @MangoTheCat.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger

Thu, 03/30/2017 - 23:25

(This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

Parts of Speech (POS) tagging is a crucial part in natural language processing. It consists of labelling each word in a text document with a certain category like noun, verb, adverb, pronoun, … . At BNOSAC, we use it on a dayly basis in order to select only nouns before we do topic detection or in specific NLP flows. For R users working with different languages, the number of POS tagging options is small and all have up or downsides. The following taggers are commonly used.

  • The Stanford Part-Of-Speech Tagger which is terribly slow, the language set is limited to English/French/German/Spanish/Arabic/Chinese (no Dutch). R packages for this are available at http://datacube.wu.ac.at.
  • Treetagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger) contains more languages but is only usable for non-commercial purposes (can be used based on the koRpus R package)
  • OpenNLP is faster and allows to do POS tagging for Dutch, Spanish, Polish, Swedish, English, Danish, German but no French or Eastern-European languages. R packages for this are available at http://datacube.wu.ac.at.
  • Package pattern.nlp (https://github.com/bnosac/pattern.nlp) allows Parts of Speech tagging and lemmatisation for Dutch, French, English, German, Spanish, Italian but needs Python installed which is not always easy to request at IT departments
  • SyntaxNet and Parsey McParseface (https://github.com/tensorflow/models/tree/master/syntaxnet) have good accuracy for POS tagging but need tensorflow installed which might be too much installation hassle in a corporate setting not to mention the computational resources needed.

Comes in RDRPOSTagger which BNOSAC released at https://github.com/bnosac/RDRPOSTagger. It has the following features:

  1. Easily installable in a corporate environment as a simple R package based on rJava
  2. Covering more than 40 languages:
    UniversalPOS annotation for languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, Estonian, Finnish, Finnish-FTB, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Kazakh, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Norwegian, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian-SynTagRus, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish. Prepend the UD_ to the language if you want to used these models.
    MORPH annotation for languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
    POS annotation for languages: English, French, German, Hindi, Italian, Thai, Vietnamese
  3. Fast tagging as the Single Classification Ripple Down Rules are easy to execute and hence are quick on larger text volumes
  4. Competitive accuracy in comparison to state-of-the-art POS and morphological taggers
  5. Cross-platform running on Windows/Linux/Mac
  6. It allows to do the Morphological, POS tagging and universal POS tagging of sentences

The Ripple Down Rules a basic binary classification trees which are built on top of the Universal Dependencies datasets available at http://universaldependencies.org. The methodology of this is explained in detail at the paper ‘ A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging’ available at http://content.iospress.com/articles/ai-communications/aic698. If you just want to apply POS tagging on your text, you can go ahead as follows:

library(RDRPOSTagger)
rdr_available_models()

## POS annotation
x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, x = x)

## MORPH/POS annotation
x <- c("Dus godvermehoeren met pus in alle puisten , zei die schele van Van Bukburg .",
       "Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
       "  ", "", NA)
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)

## Universal POS tagging annotation
tagger <- rdr_model(language = "UD_Dutch", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)

## This gives the following output
sentence.id word.id             word word.type
           1       1              Dus       ADV
           1       2   godvermehoeren      VERB
           1       3              met       ADP
           1       4              pus      NOUN
           1       5               in       ADP
           1       6             alle      PRON
           1       7          puisten      NOUN
           1       8                ,     PUNCT
           1       9              zei      VERB
           1      10              die      PRON
           1      11           schele       ADJ
           1      12              van       ADP
           1      13              Van     PROPN
           1      14          Bukburg     PROPN
           1      15                .     PUNCT
           2       1               Er       ADV
           2       2              was       AUX
           2       3             toen     SCONJ
           2       4              dat     SCONJ
           2       5           liedje      NOUN
           2       6              van       ADP
           2       7 tietenkonttieten      VERB
           2       8             kont     PROPN
           2       9           tieten      VERB
           2      10     kontkontkont     PROPN
           2      11                .     PUNCT
           3       0             <NA>      <NA>
           4       0             <NA>      <NA>
           5       0             <NA>      <NA>

The function rdr_pos requests as input a vector of sentences. If you need to transform you text data to sentences, just use tokenize_sentences from the tokenizers package.

Good luck with text mining.
If you need our help for a text mining project. Let us know, we’ll be glad to get you started.

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Live Presentation: Kyle Walker on the tigris Package

Thu, 03/30/2017 - 19:08

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Kyle Walker

Next Tuesday (April 4) at 9am PT I will be hosting a webinar with Kyle Walker about the tigris package in R. In addition to creating the Tigris package, Kyle is also a Professor of Geography at Texas Christian University.

If you are interested in R, Mapping and US Census Geography then I encourage you to attend!

What tigris Does

When I released choroplethr back in 2014, many people loved that it allows you to easily map US States, Counties and ZIP Codes. What many people don’t realize is that:

  1. These maps come from the US Census Bureau
  2. The Census Bureau publishes many more maps than choroplethr ships with

What Tigris does is remarkable: rather than relying on someone creating an R package with the US map you want, it allows you to download shapefiles directly from the US Census Bureau and import them into your R session. All without leaving the R console! From there you can render the map with your favorite graphics library.

Kyle will explain what shapefiles the Census Bureau has available, and also give a demonstration of how to use Tigris. After the presentation, there will be a live Q&A session.

If you are unable to attend live, please register so I can email you a link to the replay.

SAVE MY SEAT

The post Live Presentation: Kyle Walker on the tigris Package appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Learning Scrabble strategy from robots, using R

Thu, 03/30/2017 - 18:42

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

While you might think of Scrabble as that game you play with your grandparents on a rainy Sunday, some people take it very seriously. There's an international competition devoted to Scrabble, and no end of guides and strategies for competitive play. James Curley, a psychology professor at Columbia University, has used an interesting method to collect data about what plays are most effective in Scrabble: by having robots play against each other, thousands of times.

The data were generated with a Visual Basic script that automated two AI players completing a game in Quackle. Quackle emulates the Scrabble board, and provides a number of AI players; the simulation used the "Speedy Player" AI which tends to make tactical scoring moves while missing some longer-term strategic plays (like most reasonably skilled Scrabble players). He recorded the results of 2566 games between two such computer players and provided the resulting plays and boards in an R package. With these data, you can see some interesting statistics on long-term outcomes from competitive Scrabble games, like this map (top-left) of which squares on the board are most used in games (darker means more frequently), and also for just the Q, Z and blank tiles. Scrabble games in general tend to follow the diagonals where the double-word score squares are located, while the high-scoring Q and Z tiles tend to be used on double- and triple-letter squares. The zero-point blank tile, by comparison, is used fairly uniformly across the board.

 

Further analysis of the actual plays during the simulated games reveals some interesting Scrabble statistics:

It's best to play first. Player 1 won 54.6% of games, while Player 2 won 44.9%, a statistically significant difference. (The remaining 0.4% of the games were ties.) 

Some uncommon words are frequently used. The 10 most frequently-played words were QI, QAT, QIN, XI, OX, EUOI, XU, ZO, ZA, and EX; not necessarily words you'd use in casual conversation, but all words very familiar to competitive scrabble players. As we'll see later though, not all are high-scoring plays. Sometimes it's a good idea to get rid of a high-scoring but restrictive letter (QI, the life energy in Chinese philosophy), or simply to shake up a vowel-heavy rack for more opportunities next turn (EUOI, a cry of impassioned rapture in ancient Bacchic revels).

For high-scoring plays, go for the bingo. A Scrabble Bingo, where you lay down all 7 tiles in your rack in one play, comes with a 50-point bonus. The top three highest-scoring plays in the simulation were all bingo plays: REPIQUED (239 points), CZARISTS (230 points), and IDOLIZED. (Remember though, that this is from just a couple of thousand simulated games; there are many many more potentially high-scoring words.)

High-scoring non-bingo plays can be surprisingly long words. It's super-satisfying to lay down a short word like EX with the X making a second word on a triple-letter tile (for a 9x bonus), so I was surprised to see the top 10 highest-scoring non-bingo plays were still fairly long words: XENURINE (144 points), CYANOSES (126 pts) and SNAPWEED (126 points), all using at least 2 tiles already on the board. The shortest word in the top 10 of this list was ZITS.

Some tiles just don't work well together. The pain of getting a Q without a U seems obvious, but it turns out getting two U's is way worse in point-scoring potential. From the simulation, you can estimate the point-scoring potential of any pair of tiles in your rack: lighter is better, and darker is worse.

Managing the scoring potential of the tiles in your rack is a big part of Scrabble strategy, as we saw in another Scrabble analysis using R a few years ago. The lowly zero-point blank is actually worth a lot of potential points, while the highest-scoring tile Q is actually a liability. Here are the findings from that analysis:

  • The blank is worth about 30 points to a good player, mainly by making 50-point "bingo" plays possible.
  • Each S is worth about 10 points to the player who draws it.
  • The Q is a burden to whichever player receives it, effectively serving as a 5 point penalty for having to deal with it due to its effect in reducing bingo opportunities, needing either a U or a blank for a chance at a bingo and a 50-point bonus.
  • The J is essentially neutral pointwise.
  • The X and the Z are each worth about 3-5 extra points to the player who receives them. Their difficulty in playing in bingoes is mitigated by their usefulness in other short words.

For more James Curley's recent Scrabble analysis, including the R code using his scrabblr package, follow the link to below.

RPubs: Analyzing Scrabble Games

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Delaporte package: The SPARCmonster is sated

Thu, 03/30/2017 - 18:26

Finally, finally after months of pulling out my hair, the Delaporte project on CRAN passes all of its checks on Solaris SPARC. The last time it did that, it was still using serial C++. Now it uses OpenMP-based parallel Fortran 2003. What a relief! One of these days I should write up what I did and why, but for now, I’ll be glad to put the project down for a month or seventeen!

R-Lab #2: Shiny & maps + earthquakes | Milan, April 11th

Thu, 03/30/2017 - 16:22

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

 

R-Lab #2 – April 11th, Milano Shiny & Maps: a geo representation of earthquakes evolution with R

[The map above, that represents the earthquakes of magnitude>5 in the last 3 years, is done following this great post of Andrew Collier. The nice colors were easy to choose thanks to the amazing colourpicker addin

Hi everybody! We are ready to announce our second R-Lab! Either if you are an R expert, a beginner, or you just curious, you are welcome to join us! The event will be on April 11th, in Mikamai, Milano, and you can register from here: https://www.meetup.com/R-Lab-Milano/events/238824405/

This time the R topic is very intriguing: Shiny and maps! As several of you would know, Shiny is the R specific framework for interactive visualization, while packages as ggmap and Leaflet provide several functions for georeferencing and mapping data.

These tools will be very useful for handling the task proposed by our first presenter: EarthCloud, an organisation aimed at building a close link between information technologies and earth sciences related to risk prevention. With their help we will work on a very hot-topic: map the earthquakes that happened over a specific fault, to study the fault evolution over time and provide qualitative and quantitative insights about it. As always, working in team, hands on R code!

 

Agenda

18:45 : Meeting at Mikamai

19:00 : Intro: Earthquakes data and fault evolution + R tools for mapping and Shiny

19:30 : Get a pizza (for those who want it)

19:45 : Coding together while eating the pizza.

The goal is: build a shiny app to map the fault evolution over time, enriching as much as possible the app with qualitative and quantitative insight about the earthquakes

22:30 : Greetings and see you soon!

 

A bit more about the proposer – EarthCloud

EarthCloud is an association that aims at creating a strong link between new information technologies (distributed systems, Cloud, serverless, etc.) and earth sciences related to risk prevention (seismic, geological and environmental), in order to increase population protection.

 

Who is this event for

Either you are an R expert, a basic user or just curious, you are welcome! The R-Lab is a space for learning and sharing knowledge, working together on a challenging goal. If you are a geologist or a an expert of earth science, you are welcome as well! We appreciate your contribution about the meaning of what we are doing

 

What to bring

Be sure to bring your own computer, possibly with the latest version of RStudio.

 

Where to go

We will be hosted by Mikamai and LinkMe in their location, in Via Giulio e Corrado Venini, 42 (very close to the Pasteur metro station). The doorbell is Mikamai: when you enter, go straight, cross the inner courtyard of the building, you face a metal door with a sheet with Mikamai written. The office is the last (and only) floor.

 

For any additional info, please contact us via meetup: https://www.meetup.com/it-IT/R-Lab-Milano/

Looking forward to seeing you there!!!

 

If you still don’t know the R-Lab project… What is an R-Lab?

The R-Labs are evening kind of mini-hackathons where we work together with R on a real problem.

In short, this is what we do:

  • A company/institution/university proposes a real problem, and teaches something about that issue
  • We work together on the solution, possibly having fun
  • Everything we do is released on Github, exactly here

 

Where can I join the group?

You can join the group on the Meetup platform here:

https://www.meetup.com/it-IT/R-Lab-Milano/

 

 

The post R-Lab #2: Shiny & maps + earthquakes | Milan, April 11th appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Weekly Bulletin Vol – II

Thu, 03/30/2017 - 13:42

This week’s R bulletin will cover functions calls, sorting data frame, creating time series object, and functions like is.na, na.omit, paste, help, rep, and seq function. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To show files – Ctrl+5
2. To show plots – Ctrl+6
3. To show packages – Ctrl+7

Problem Solving Ideas Calling a function in an R script

If you want to call a custom-built function in your R script from another script, one can use the “exists” function along with the “source” function. See the example below:

Example:

if(exists("daily_price_data", mode="function")) source("Stock price data.R")

In this case, the expression will check whether a function called “daily_price_data” exists in the “Stock price data.R” script, and if it does, it will load the function in the current script. We can then use the function any number of times in our script by providing the relevant arguments.

Convert dates from Google finance to a time series object

When we download stock price data from Google finance, the “DATE” column shows a date in the yyymmdd format. This format is not considered as a time series object in R. To convert the dates from Google Finance into a time series object, one can use the ymd function from the lubridate package. The ymd function accepts dates in the form year, month, day. In the case of dates in other formats, the lubridate package has functions like ydm, mdy, myd, dmy, and dym, which can be used to convert it into a time series object.

Example:

library(lubridate) dt = ymd(20160523) print(dt)

[1] “2016-05-23”

Sorting a data frame in an ascending or descending order

The arrange function from the dplyr package can be used to sort a data frame. The first argument is the data.frame and the next argument is the variable to sort by, either in an ascending or in a descending order.

In the example below, we create a two column data frame comprising of stock symbols and their respective percentage price change. We then sort the Percent change column first in an ascending order, and in the second instance in a descending order.

Example:

library(dplyr) # Create a dataframe Ticker = c("UNITECH", "RCOM", "VEDL", "CANBK") Percent_Change = c(2.3, -0.25, 0.5, 1.24) df = data.frame(Ticker, Percent_Change) print(df)

Ticker          Percent_Change
1  UNITECH    2.30
2      RCOM   -0.25
3        VEDL    0.50
4     CANBK    1.24

# Sort in an ascending order df_descending = arrange(df, Percent_Change) print(df_descending)

Ticker     Percent_Change
1     RCOM    -0.25
2       VEDL    0.50
3    CANBK    1.24
4 UNITECH    2.30

# Sort in a descending order df_descending = arrange(df, desc(Percent_Change)) print(df_descending)

Ticker         Percent_Change
1 UNITECH   2.30
2    CANBK   1.24
3       VEDL   0.50
4     RCOM   -0.25

Functions Demystified paste function

The paste is a very useful function in R and is used to concatenate (join) the arguments supplied to it. To include or remove the space between the arguments use the “sep” argument.

Example 1: Combining a string of words and a function using paste

x = c(20:45) paste("Mean of x is", mean(x), sep = " ")

[1] “Mean of x is 32.5”

Example 2: Creating a filename using the dirPath, symbol, and the file extension name as the arguments to the paste function.

dirPath = "C:/Users/MyFolder/" symbol = "INFY" filename = paste(dirPath, symbol, ".csv", sep = "") print(filename)

[1] “C:/Users/MyFolder/INFY.csv”

is.na and na.omit function

The is.na functions checks whether there are any NA values in the given data set, whereas, the na.omit function will remove all the NA values from the given data set.

Example: Consider a data frame comprising of open and close prices for a stock corresponding to each date.

date = c(20160501, 20160502, 20160503, 20160504) open = c(234, NA, 236.85, 237.45) close = c(236, 237, NA, 238) df = data.frame(date, open, close) print(df)

date           open        close
1  20160501  234.00     236
2  20160502        NA     237
3  20160503  236.85     NA
4  20160504  237.45     238

Let us check whether the data frame has any NA values using the is.na function.

is.na(df)

date      open      close
[1,]  FALSE  FALSE   FALSE
[2,]  FALSE  TRUE     FALSE
[3,]  FALSE  FALSE   TRUE
[4,]  FALSE  FALSE   FALSE

As you can see from the result, it has two NA values. Let us now use the na.omit function, and view the results.

na.omit(df)

date           open         close
1  20160501  234.00      236
4  20160504  237.45      238

As can be seen from the result, the rows having NA values got omitted, and the resultant data frame now comprises of non-NA values only.

These functions can be used to check for any NA values in large data sets on which we wish to apply some computations. The presence of NA values can cause the computations to give unwanted results, and hence such NA values need to be either removed or replaced by relevant values.

rep and seq function

The rep function repeats the arguments for the specified number of times, while the sequence function is used to form a required sequence of numbers. Note that in the sequence function we use a comma and not a colon.

Example 1:

rep("Strategy", times = 3)

[1] “Strategy” “Strategy” “Strategy”

rep(1:3, 2)

[1] 1 2 3 1 2 3

Example 2:

seq(1, 5)

[1] 1 2 3 4 5

seq(1, 5, 0.5)

[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

help and example function

The help function provides information on the topic sort, while the example function provides examples on the given topic.

help(sum)
example(sum)

To access the R help files associated with specific functions within a particular package, include the function name as the first argument to the help function along with the package name mentioned in the second argument.

Example:

help(barplot, package="graphics")

Alternatively one can also type a question mark followed by the function name (e.g. ?barplot) and execute the command to know more about the function.

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

The post R Weekly Bulletin Vol – II appeared first on .

Are you fluent in R?

Thu, 03/30/2017 - 10:00

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

A few weeks ago, I wrote an article saying that you should master R. The basic argument, is that if you want to actually work as a data scientist, you need to know the essential tools backwards and forwards.

In response, a reader left a comment. I have to say that it’s unfortunate to read something like this, but sadly it’s very common.

Essentially, he wrote that he took an online course, but still can’t write code:

I was able to take a data analysis course on edX with no problem, following all the instruction and guides they provide, acing the course, but as you noted astutely, not learning or remembering much of it, because I had mostly cut and pasted the code. I tried to do some elementary analysis recently, after almost a year later, and was not able to even do that”

Does this sound like you?

If you’re an aspiring data scientist, and you’re taking online courses and reading books, but not getting results, you need to regroup.

How many online courses have you taken?

How many data science books have you bought?

After taking courses and buying books, can you write R code rapidly and from memory?

… or instead, are you doing Google searches to find out how to execute basic data science techniques in R?

Which one are you? Are you fluent in R, or are you constantly struggling to remember the essentials?

If you’re still struggling to fluently write code, keep reading.

Are you fluent in R?

fluent
– Able to speak a language accurately, rapidly, and confidently – in a flowing way.

Usage notes
In casual use, “fluency” refers to language proficiency broadly, while in narrow use it refers to speaking a language flowingly, rather than haltingly.

– Wiktionary

I find it a little odd that the concept of “fluency” is used so rarely among programmers and technologists. Honestly, this idea of “fluency” almost perfectly encapsulates the skill level you need in order to achieve your goals.

Read the definition. To be fluent, is to be able to execute proficiently, accurately, rapidly, and confidently.

Is that how you write R data science code?

Do you write code proficiently, accurately, rapidly, and confidently? Do you write your code from memory?



Or do you write your code slowly? Laboriously? Can you remember the syntax at all? Are you constantly looking things up?

Be honest.

You need to be honest with yourself, because if you have weaknesses as a data scientist (or data science candidate), the only way you can correct those weaknesses is by being honest about where you need to improve.

The reality is that many data science students are not fluent in the essential techniques.

To be clear, when I say “essential techniques” I’m not talking about advanced techniques, like machine learning, deep learning, etc. If you’re a beginner or intermediate data science student, it’s normal to still struggle with machine learning.

No. I’m talking about the essential, foundational techniques, like data manipulation, data visualization, and data analysis.

Most data science students (and even some practitioners) can’t do these things fluently.

If that sounds like you, you need to rethink your learning strategy. If you can’t fluently execute essential techniques – like visualization, manipulation, and analysis – then you need to revisit those things and focus on them.

To get a data science job, to keep a data science job, and to excel in a data science job, you need to master the foundational data science techniques.

Getting a job as a data scientist without fluency in R (or another data language) is like trying to get a job as a writer for a Spanish magazine without having basic fluency in Spanish.

Don’t kid yourself.

Your first milestone: fluency with the essential techniques

Your real first milestone as an aspiring data scientist is achieving basic fluency in writing R code. More specifically, you need to be fluent in writing R code to perform data visualization, data manipulation, and data analysis. These are the foundations, and you need to be able to execute them proficiently, rapidly, from memory.

If you can’t do visualization, manipulation, and analysis rapidly and from memory then you’re probably not ready to do real data science work. Which means, you’re not ready to apply for a data science job.

Your first milestone: fluency in the essentials.

Let’s break that down more. Here are some of the things you should be able to execute “with your eyes closed”:

  1. Data visualization basics
    Bar charts
    Line charts
    Histograms
    – Scatterplots
    – Box Plots
    Small multiples (these are rarely used, but very useful)
  2. Intermediate visualization
    – Manipulating colors
    – Manipulating size (I.e., bubble charts)
    – Dealing with data visualization problems (e.g., over plotting)
    – Formatting plots
  3. Data Manipulation:
    – How to read in a dataset (from a file, or inline)
    How to add variables to a dataset
    How to remove variables from a dataset
    – How to aggregate data
    – …

… I could go on.

This is a brief (and incomplete) list of things that you should be able to execute without even thinking about it. These are basic, essential tools. If you can’t do them fluently, you need to refocus your efforts until you can.

How to become fluent in R

This sounds great in theory. “Fluent in R!” It sounds good.

But how do you accomplish this?

I’ve said it before, and I’ll repeat:

Practice.

To become fluent in data science in R, you need to practice the essential techniques. You need to drill these essential techniques until they become second nature.

It’s not enough to just cut-and-paste some code one time and assume that you’ve learned it.

And it’s not enough to watch a few videos. To be clear: videos are an excellent way to learn something new the very first time. Good video lectures can teach you how something works and offer explanations. They are good for giving you basic understanding.

However, learning a technique from a lecture video is not the same thing as practiced, internalized skill. Lots of people watch a video or a lecture and say “yep, that makes sense.” Great.

But they don’t actually practice the technique, so they never internalize it. Actually, what happens is that they “learn” the technique from the video, but forget the technique soon after. They forget because they fail to practice.

Example: learning R is like learning a foreign language

As I’ve already suggested, learning R is much like learning a foreign language, like Spanish.

Let’s use that as an example. Let’s say you’re learning Spanish.

One day, in a lecture, you learn a little piece of grammar. As you learn that piece of grammar in the lecture, you understand it. Because you “learned” it, you’ll be able to use that grammatical construct by simply repeating it. You’ll also likely be able to use it for a few minutes or hours after class (although, you’re likely to struggle a little bit).

Next, you leave the classroom and don’t practice that grammatical construct.

A week later, do you think you’d still be able to use it? Would you remember it? How “fluent” will you be with that piece of grammar?

Here’s my bet: if you don’t practice that piece of grammar, you will forget it.

Foreign language vocabulary is the same. To remember a vocabulary word, it’s not enough to learn the word a single time. I bet you’ve had that experience. You learn the Spanish word for “cat,” and you can remember it for a few minutes, but if you don’t practice it, you will for forget it.

In foreign language, if you learn grammar and words, and you want to remember them in the long run, you need to practice them. You need to drill. The best way to remember a word, is to learn it, and then practice it repeatedly over time.

Guess what? Learning a programming language is almost exactly the same.

To learn and remember programming language syntax, you need to practice. You need to drill the basic “vocabulary” and syntax of the programming language until you know it without thinking about it.

If you do this … if you learn the basic syntax, and practice it syntax until you can write it fluidly and quickly … you will achieve fluency.

I will repeat: identify the essential techniques and practice them relentlessly.

How long does it take to achieve basic fluency in R?

You might be asking, how long will this take.

Actually, it depends on how good you are at learning. Learning itself is a technical skill.

If you don’t know how to practice, this could take years. I know people who started learning R years ago, and they still aren’t fluent. They bought dozens of books, but still can’t write code very well because they never really practiced. Again, it’s like foreign languages. I know people who have been studying Spanish for years and they still can’t have conversations.

This is one of your major risks. It might take you years to achieve basic fluency in R.

Even worse: you might fail altogether.

The problem here is that most online courses will not show you how to practice. They might show you syntax and explain how the language works, but they don’t show you how to practice to achieve mastery.

On the other hand, there is some good news …

If you know how to practice programming languages, you could achieve basic fluency as fast as about 6 weeks.

I won’t go into the details here, but if you know how to “hack your memory” you can learn R very, very quickly. Essentially, you need to know exactly how to practice for maximum gains and efficiency.

If you know how to do this, and you practice diligently every day, it’s possible to master the foundations of R within 6-8 weeks. (In fact, it’s probably possible to do faster, if you really hustle.)

To succeed as a data scientist, become fluent in the essentials

I strongly believe that to succeed as a data scientist, you need fluency. You need a rapid, unconscious mastery of the essential syntax and techniques for data science in R.

And that requires practice.

If you want to be a data scientist, here is my recommendation. Learn and drill the major techniques from the following R packages:

  • ggplot2
  • dplyr
  • tidyr
  • lubridate
  • stringr
  • forcats
  • readr

These give you essential tools for manipulating, cleaning, wrangling, visualizing and analyzing data.

If you can become fluent with these, you’ll have all of the tools that you need to get things done at an entry level.

You’ll be prepared to work on your own projects.

You’ll be prepared for more advanced topics.

And you’ll be well on your way to becoming a top-performer.

Our data science course opens next week

If you’re interested in rapidly mastering data science, then sign up for our list right now.

Next week, we will re-open registration for our flagship course, Starting Data Science.

Starting Data Science will teach you the essentials of R, including ggplot2, dplyr, tidyr, stringr, and lubridate.

It will also give you a practice system that you can use to rapidly master the tools from these packages.

If you sign up for our email list, you’ll get an exclusive invitation to join the course when it opens.

SIGN UP NOW

The post Are you fluent in R? appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages