R news and tutorials contributed by hundreds of R bloggers
Updated: 5 hours 56 min ago

Time Series Forecasting with Recurrent Neural Networks

Wed, 12/20/2017 - 01:00

(This article was first published on TensorFlow for R, and kindly contributed to R-bloggers)

Deep Learning with R

This post is an excerpt from Chapter 5 of François Chollet’s and J.J. Allaire’s book, Deep Learning with R (Manning Publications). Use the code fccallaire for a 42% discount on the book at manning.com.

Time Series Forecasting with Recurrent Neural Networks

In this section, we’ll review three advanced techniques for improving the performance and generalization power of recurrent neural networks. By the end of the section, you’ll know most of what there is to know about using recurrent networks with Keras. We’ll demonstrate all three concepts on a temperature-forecasting problem, where you have access to a time series of data points coming from sensors installed on the roof of a building, such as temperature, air pressure, and humidity, which you use to predict what the temperature will be 24 hours after the last data point. This is a fairly challenging problem that exemplifies many common difficulties encountered when working with time series.

We’ll cover the following techniques:

• Recurrent dropout — This is a specific, built-in way to use dropout to fight overfitting in recurrent layers.
• Stacking recurrent layers — This increases the representational power of the network (at the cost of higher computational loads).
• Bidirectional recurrent layers — These present the same information to a recurrent network in different ways, increasing accuracy and mitigating forgetting issues.
A temperature-forecasting problem

Until now, the only sequence data we’ve covered has been text data, such as the IMDB dataset and the Reuters dataset. But sequence data is found in many more problems than just language processing. In all the examples in this section, you’ll play with a weather timeseries dataset recorded at the Weather Station at the Max Planck Institute for Biogeochemistry in Jena, Germany.

In this dataset, 14 different quantities (such air temperature, atmospheric pressure, humidity, wind direction, and so on) were recorded every 10 minutes, over several years. The original data goes back to 2003, but this example is limited to data from 2009–2016. This dataset is perfect for learning to work with numerical time series. You’ll use it to build a model that takes as input some data from the recent past (a few days’ worth of data points) and predicts the air temperature 24 hours in the future.

Let’s look at the data.

library(tibble) library(readr) data_dir <- "~/Downloads/jena_climate" fname <- file.path(data_dir, "jena_climate_2009_2016.csv") data <- read_csv(fname) glimpse(data) Observations: 420,551 Variables: 15 $Date Time "01.01.2009 00:10:00", "01.01.2009 00:20:00", "...$ p (mbar) 996.52, 996.57, 996.53, 996.51, 996.51, 996.50,... $T (degC) -8.02, -8.41, -8.51, -8.31, -8.27, -8.05, -7.62...$ Tpot (K) 265.40, 265.01, 264.91, 265.12, 265.15, 265.38,... $Tdew (degC) -8.90, -9.28, -9.31, -9.07, -9.04, -8.78, -8.30...$ rh (%) 93.3, 93.4, 93.9, 94.2, 94.1, 94.4, 94.8, 94.4,... $VPmax (mbar) 3.33, 3.23, 3.21, 3.26, 3.27, 3.33, 3.44, 3.44,...$ VPact (mbar) 3.11, 3.02, 3.01, 3.07, 3.08, 3.14, 3.26, 3.25,... $VPdef (mbar) 0.22, 0.21, 0.20, 0.19, 0.19, 0.19, 0.18, 0.19,...$ sh (g/kg) 1.94, 1.89, 1.88, 1.92, 1.92, 1.96, 2.04, 2.03,... $H2OC (mmol/mol) 3.12, 3.03, 3.02, 3.08, 3.09, 3.15, 3.27, 3.26,...$ rho (g/m**3) 1307.75, 1309.80, 1310.24, 1309.19, 1309.00, 13... $wv (m/s) 1.03, 0.72, 0.19, 0.34, 0.32, 0.21, 0.18, 0.19,...$ max. wv (m/s) 1.75, 1.50, 0.63, 0.50, 0.63, 0.63, 0.63, 0.50,... $wd (deg) 152.3, 136.1, 171.6, 198.0, 214.3, 192.7, 166.5... Here is the plot of temperature (in degrees Celsius) over time. On this plot, you can clearly see the yearly periodicity of temperature. library(ggplot2) ggplot(data, aes(x = 1:nrow(data), y = T (degC))) + geom_line() Here is a more narrow plot of the first 10 days of temperature data (see figure 6.15). Because the data is recorded every 10 minutes, you get 144 data points per day. ggplot(data[1:1440,], aes(x = 1:1440, y = T (degC))) + geom_line() On this plot, you can see daily periodicity, especially evident for the last 4 days. Also note that this 10-day period must be coming from a fairly cold winter month. If you were trying to predict average temperature for the next month given a few months of past data, the problem would be easy, due to the reliable year-scale periodicity of the data. But looking at the data over a scale of days, the temperature looks a lot more chaotic. Is this time series predictable at a daily scale? Let’s find out. Preparing the data The exact formulation of the problem will be as follows: given data going as far back as lookback timesteps (a timestep is 10 minutes) and sampled every steps timesteps, can you predict the temperature in delay timesteps? You’ll use the following parameter values: • lookback = 720 — Observations will go back 5 days. • steps = 6 — Observations will be sampled at one data point per hour. • delay = 144 — Targets will be 24 hours in the future. To get started, you need to do two things: • Preprocess the data to a format a neural network can ingest. This is easy: the data is already numerical, so you don’t need to do any vectorization. But each time series in the data is on a different scale (for example, temperature is typically between -20 and +30, but atmospheric pressure, measured in mbar, is around 1,000). You’ll normalize each time series independently so that they all take small values on a similar scale. • Write a generator function that takes the current array of float data and yields batches of data from the recent past, along with a target temperature in the future. Because the samples in the dataset are highly redundant (sample N and sample N + 1 will have most of their timesteps in common), it would be wasteful to explicitly allocate every sample. Instead, you’ll generate the samples on the fly using the original data. NOTE: Understanding generator functions A generator function is a special type of function that you call repeatedly to obtain a sequence of values from. Often generators need to maintain internal state, so they are typically constructed by calling another yet another function which returns the generator function (the environment of the function which returns the generator is then used to track state). For example, the sequence_generator() function below returns a generator function that yields an infinite sequence of numbers: sequence_generator <- function(start) { value <- start - 1 function() { value <<- value + 1 value } } gen <- sequence_generator(10) gen() [1] 10 gen() [1] 11 The current state of the generator is the value variable that is defined outside of the function. Note that superassignment (<<-) is used to update this state from within the function. Generator functions can signal completion by returning the value NULL. However, generator functions passed to Keras training methods (e.g. fit_generator()) should always return values infinitely (the number of calls to the generator function is controlled by the epochs and steps_per_epoch parameters). First, you’ll convert the R data frame which we read earlier into a matrix of floating point values (we’ll discard the first column which included a text timestamp): data <- data.matrix(data[,-1]) You’ll then preprocess the data by subtracting the mean of each time series and dividing by the standard deviation. You’re going to use the first 200,000 timesteps as training data, so compute the mean and standard deviation for normalization only on this fraction of the data. train_data <- data[1:200000,] mean <- apply(train_data, 2, mean) std <- apply(train_data, 2, sd) data <- scale(data, center = mean, scale = std) The code for the data generator you’ll use is below. It yields a list (samples, targets), where samples is one batch of input data and targets is the corresponding array of target temperatures. It takes the following arguments: • data — The original array of floating-point data, which you normalized in listing 6.32. • lookback — How many timesteps back the input data should go. • delay — How many timesteps in the future the target should be. • min_index and max_index — Indices in the data array that delimit which timesteps to draw from. This is useful for keeping a segment of the data for validation and another for testing. • shuffle — Whether to shuffle the samples or draw them in chronological order. • batch_size — The number of samples per batch. • step — The period, in timesteps, at which you sample data. You’ll set it 6 in order to draw one data point every hour. generator <- function(data, lookback, delay, min_index, max_index, shuffle = FALSE, batch_size = 128, step = 6) { if (is.null(max_index)) max_index <- nrow(data) - delay - 1 i <- min_index + lookback function() { if (shuffle) { rows <- sample(c((min_index+lookback):max_index), size = batch_size) } else { if (i + batch_size >= max_index) i <<- min_index + lookback rows <- c(i:min(i+batch_size, max_index)) i <<- i + length(rows) } samples <- array(0, dim = c(length(rows), lookback / step, dim(data)[[-1]])) targets <- array(0, dim = c(length(rows))) for (j in 1:length(rows)) { indices <- seq(rows[[j]] - lookback, rows[[j]], length.out = dim(samples)[[2]]) samples[j,,] <- data[indices,] targets[[j]] <- data[rows[[j]] + delay,2] } list(samples, targets) } } The i variable contains the state that tracks next window of data to return, so it is updated using superassignment (e.g. i <<- i + length(rows)). Now, let’s use the abstract generator function to instantiate three generators: one for training, one for validation, and one for testing. Each will look at different temporal segments of the original data: the training generator looks at the first 200,000 timesteps, the validation generator looks at the following 100,000, and the test generator looks at the remainder. lookback <- 1440 step <- 6 delay <- 144 batch_size <- 128 train_gen <- generator( data, lookback = lookback, delay = delay, min_index = 1, max_index = 200000, shuffle = TRUE, step = step, batch_size = batch_size ) val_gen = generator( data, lookback = lookback, delay = delay, min_index = 200001, max_index = 300000, step = step, batch_size = batch_size ) test_gen <- generator( data, lookback = lookback, delay = delay, min_index = 300001, max_index = NULL, step = step, batch_size = batch_size ) # How many steps to draw from val_gen in order to see the entire validation set val_steps <- (300000 - 200001 - lookback) / batch_size # How many steps to draw from test_gen in order to see the entire test set test_steps <- (nrow(data) - 300001 - lookback) / batch_size A common-sense, non-machine-learning baseline Before you start using black-box deep-learning models to solve the temperature-prediction problem, let’s try a simple, common-sense approach. It will serve as a sanity check, and it will establish a baseline that you’ll have to beat in order to demonstrate the usefulness of more-advanced machine-learning models. Such common-sense baselines can be useful when you’re approaching a new problem for which there is no known solution (yet). A classic example is that of unbalanced classification tasks, where some classes are much more common than others. If your dataset contains 90% instances of class A and 10% instances of class B, then a common-sense approach to the classification task is to always predict “A” when presented with a new sample. Such a classifier is 90% accurate overall, and any learning-based approach should therefore beat this 90% score in order to demonstrate usefulness. Sometimes, such elementary baselines can prove surprisingly hard to beat. In this case, the temperature time series can safely be assumed to be continuous (the temperatures tomorrow are likely to be close to the temperatures today) as well as periodical with a daily period. Thus a common-sense approach is to always predict that the temperature 24 hours from now will be equal to the temperature right now. Let’s evaluate this approach, using the mean absolute error (MAE) metric: mean(abs(preds - targets)) Here’s the evaluation loop. evaluate_naive_method <- function() { batch_maes <- c() for (step in 1:val_steps) { c(samples, targets) %<-% val_gen() preds <- samples[,dim(samples)[[2]],2] mae <- mean(abs(preds - targets)) batch_maes <- c(batch_maes, mae) } print(mean(batch_maes)) } evaluate_naive_method() This yields an MAE of 0.29. Because the temperature data has been normalized to be centered on 0 and have a standard deviation of 1, this number isn’t immediately interpretable. It translates to an average absolute error of 0.29 x temperature_std degrees Celsius: 2.57˚C. celsius_mae <- 0.29 * std[[2]] That’s a fairly large average absolute error. Now the game is to use your knowledge of deep learning to do better. A basic machine-learning approach In the same way that it’s useful to establish a common-sense baseline before trying machine-learning approaches, it’s useful to try simple, cheap machine-learning models (such as small, densely connected networks) before looking into complicated and computationally expensive models such as RNNs. This is the best way to make sure any further complexity you throw at the problem is legitimate and delivers real benefits. The following listing shows a fully connected model that starts by flattening the data and then runs it through two dense layers. Note the lack of activation function on the last dense layer, which is typical for a regression problem. You use MAE as the loss. Because you evaluate on the exact same data and with the exact same metric you did with the common-sense approach, the results will be directly comparable. library(keras) model <- keras_model_sequential() %>% layer_flatten(input_shape = c(lookback / step, dim(data)[-1])) %>% layer_dense(units = 32, activation = "relu") %>% layer_dense(units = 1) model %>% compile( optimizer = optimizer_rmsprop(), loss = "mae" ) history <- model %>% fit_generator( train_gen, steps_per_epoch = 500, epochs = 20, validation_data = val_gen, validation_steps = val_steps ) Let’s display the loss curves for validation and training. Some of the validation losses are close to the no-learning baseline, but not reliably. This goes to show the merit of having this baseline in the first place: it turns out to be not easy to outperform. Your common sense contains a lot of valuable information that a machine-learning model doesn’t have access to. You may wonder, if a simple, well-performing model exists to go from the data to the targets (the common-sense baseline), why doesn’t the model you’re training find it and improve on it? Because this simple solution isn’t what your training setup is looking for. The space of models in which you’re searching for a solution – that is, your hypothesis space – is the space of all possible two-layer networks with the configuration you defined. These networks are already fairly complicated. When you’re looking for a solution with a space of complicated models, the simple, well-performing baseline may be unlearnable, even if it’s technically part of the hypothesis space. That is a pretty significant limitation of machine learning in general: unless the learning algorithm is hardcoded to look for a specific kind of simple model, parameter learning can sometimes fail to find a simple solution to a simple problem. A first recurrent baseline The first fully connected approach didn’t do well, but that doesn’t mean machine learning isn’t applicable to this problem. The previous approach first flattened the time series, which removed the notion of time from the input data. Let’s instead look at the data as what it is: a sequence, where causality and order matter. You’ll try a recurrent-sequence processing model – it should be the perfect fit for such sequence data, precisely because it exploits the temporal ordering of data points, unlike the first approach. Instead of the LSTM layer introduced in the previous section, you’ll use the GRU layer, developed by Chung et al. in 2014. Gated recurrent unit (GRU) layers work using the same principle as LSTM, but they’re somewhat streamlined and thus cheaper to run (although they may not have as much representational power as LSTM). This trade-off between computational expensiveness and representational power is seen everywhere in machine learning. model <- keras_model_sequential() %>% layer_gru(units = 32, input_shape = list(NULL, dim(data)[[-1]])) %>% layer_dense(units = 1) model %>% compile( optimizer = optimizer_rmsprop(), loss = "mae" ) history <- model %>% fit_generator( train_gen, steps_per_epoch = 500, epochs = 20, validation_data = val_gen, validation_steps = val_steps ) The results are plotted below. Much better! You can significantly beat the common-sense baseline, demonstrating the value of machine learning as well as the superiority of recurrent networks compared to sequence-flattening dense networks on this type of task. The new validation MAE of ~0.265 (before you start significantly overfitting) translates to a mean absolute error of 2.35˚C after denormalization. That’s a solid gain on the initial error of 2.57˚C, but you probably still have a bit of a margin for improvement. Using recurrent dropout to fight overfitting It’s evident from the training and validation curves that the model is overfitting: the training and validation losses start to diverge considerably after a few epochs. You’re already familiar with a classic technique for fighting this phenomenon: dropout, which randomly zeros out input units of a layer in order to break happenstance correlations in the training data that the layer is exposed to. But how to correctly apply dropout in recurrent networks isn’t a trivial question. It has long been known that applying dropout before a recurrent layer hinders learning rather than helping with regularization. In 2015, Yarin Gal, as part of his PhD thesis on Bayesian deep learning, determined the proper way to use dropout with a recurrent network: the same dropout mask (the same pattern of dropped units) should be applied at every timestep, instead of a dropout mask that varies randomly from timestep to timestep. What’s more, in order to regularize the representations formed by the recurrent gates of layers such as layer_gru and layer_lstm, a temporally constant dropout mask should be applied to the inner recurrent activations of the layer (a recurrent dropout mask). Using the same dropout mask at every timestep allows the network to properly propagate its learning error through time; a temporally random dropout mask would disrupt this error signal and be harmful to the learning process. Yarin Gal did his research using Keras and helped build this mechanism directly into Keras recurrent layers. Every recurrent layer in Keras has two dropout-related arguments: dropout, a float specifying the dropout rate for input units of the layer, and recurrent_dropout, specifying the dropout rate of the recurrent units. Let’s add dropout and recurrent dropout to the layer_gru and see how doing so impacts overfitting. Because networks being regularized with dropout always take longer to fully converge, you’ll train the network for twice as many epochs. model <- keras_model_sequential() %>% layer_gru(units = 32, dropout = 0.2, recurrent_dropout = 0.2, input_shape = list(NULL, dim(data)[[-1]])) %>% layer_dense(units = 1) model %>% compile( optimizer = optimizer_rmsprop(), loss = "mae" ) history <- model %>% fit_generator( train_gen, steps_per_epoch = 500, epochs = 40, validation_data = val_gen, validation_steps = val_steps ) The plot below shows the results. Success! You’re no longer overfitting during the first 20 epochs. But although you have more stable evaluation scores, your best scores aren’t much lower than they were previously. Stacking recurrent layers Because you’re no longer overfitting but seem to have hit a performance bottleneck, you should consider increasing the capacity of the network. Recall the description of the universal machine-learning workflow: it’s generally a good idea to increase the capacity of your network until overfitting becomes the primary obstacle (assuming you’re already taking basic steps to mitigate overfitting, such as using dropout). As long as you aren’t overfitting too badly, you’re likely under capacity. Increasing network capacity is typically done by increasing the number of units in the layers or adding more layers. Recurrent layer stacking is a classic way to build more-powerful recurrent networks: for instance, what currently powers the Google Translate algorithm is a stack of seven large LSTM layers – that’s huge. To stack recurrent layers on top of each other in Keras, all intermediate layers should return their full sequence of outputs (a 3D tensor) rather than their output at the last timestep. This is done by specifying return_sequences = TRUE. model <- keras_model_sequential() %>% layer_gru(units = 32, dropout = 0.1, recurrent_dropout = 0.5, return_sequences = TRUE, input_shape = list(NULL, dim(data)[[-1]])) %>% layer_gru(units = 64, activation = "relu", dropout = 0.1, recurrent_dropout = 0.5) %>% layer_dense(units = 1) model %>% compile( optimizer = optimizer_rmsprop(), loss = "mae" ) history <- model %>% fit_generator( train_gen, steps_per_epoch = 500, epochs = 40, validation_data = val_gen, validation_steps = val_steps ) The figure below shows the results. You can see that the added layer does improve the results a bit, though not significantly. You can draw two conclusions: • Because you’re still not overfitting too badly, you could safely increase the size of your layers in a quest for validation-loss improvement. This has a non-negligible computational cost, though. • Adding a layer didn’t help by a significant factor, so you may be seeing diminishing returns from increasing network capacity at this point. Using bidirectional RNNs The last technique introduced in this section is called bidirectional RNNs. A bidirectional RNN is a common RNN variant that can offer greater performance than a regular RNN on certain tasks. It’s frequently used in natural-language processing – you could call it the Swiss Army knife of deep learning for natural-language processing. RNNs are notably order dependent, or time dependent: they process the timesteps of their input sequences in order, and shuffling or reversing the timesteps can completely change the representations the RNN extracts from the sequence. This is precisely the reason they perform well on problems where order is meaningful, such as the temperature-forecasting problem. A bidirectional RNN exploits the order sensitivity of RNNs: it consists of using two regular RNNs, such as the layer_gru and layer_lstm you’re already familiar with, each of which processes the input sequence in one direction (chronologically and antichronologically), and then merging their representations. By processing a sequence both ways, a bidirectional RNN can catch patterns that may be overlooked by a unidirectional RNN. Remarkably, the fact that the RNN layers in this section have processed sequences in chronological order (older timesteps first) may have been an arbitrary decision. At least, it’s a decision we made no attempt to question so far. Could the RNNs have performed well enough if they processed input sequences in antichronological order, for instance (newer timesteps first)? Let’s try this in practice and see what happens. All you need to do is write a variant of the data generator where the input sequences are reverted along the time dimension (replace the last line with list(samples[,ncol(samples):1,], targets)). Training the same one-GRU-layer network that you used in the first experiment in this section, you get the results shown below. The reversed-order GRU underperforms even the common-sense baseline, indicating that in this case, chronological processing is important to the success of your approach. This makes perfect sense: the underlying GRU layer will typically be better at remembering the recent past than the distant past, and naturally the more recent weather data points are more predictive than older data points for the problem (that’s what makes the common-sense baseline fairly strong). Thus the chronological version of the layer is bound to outperform the reversed-order version. Importantly, this isn’t true for many other problems, including natural language: intuitively, the importance of a word in understanding a sentence isn’t usually dependent on its position in the sentence. Let’s try the same trick on the LSTM IMDB example from section 6.2. library(keras) # Number of words to consider as features max_features <- 10000 # Cuts off texts after this number of words maxlen <- 500 imdb <- dataset_imdb(num_words = max_features) c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb # Reverses sequences x_train <- lapply(x_train, rev) x_test <- lapply(x_test, rev) # Pads sequences x_train <- pad_sequences(x_train, maxlen = maxlen) <4> x_test <- pad_sequences(x_test, maxlen = maxlen) model <- keras_model_sequential() %>% layer_embedding(input_dim = max_features, output_dim = 128) %>% layer_lstm(units = 32) %>% layer_dense(units = 1, activation = "sigmoid") model %>% compile( optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("acc") ) history <- model %>% fit( x_train, y_train, epochs = 10, batch_size = 128, validation_split = 0.2 ) You get performance nearly identical to that of the chronological-order LSTM. Remarkably, on such a text dataset, reversed-order processing works just as well as chronological processing, confirming the hypothesis that, although word order does matter in understanding language, which order you use isn’t crucial. Importantly, an RNN trained on reversed sequences will learn different representations than one trained on the original sequences, much as you would have different mental models if time flowed backward in the real world – if you lived a life where you died on your first day and were born on your last day. In machine learning, representations that are different yet useful are always worth exploiting, and the more they differ, the better: they offer a new angle from which to look at your data, capturing aspects of the data that were missed by other approaches, and thus they can help boost performance on a task. This is the intuition behind ensembling, a concept we’ll explore in chapter 7. A bidirectional RNN exploits this idea to improve on the performance of chronological-order RNNs. It looks at its input sequence both ways, obtaining potentially richer representations and capturing patterns that may have been missed by the chronological-order version alone. To instantiate a bidirectional RNN in Keras, you use the bidirectional() function, which takes a recurrent layer instance as an argument. The bidirectional() function creates a second, separate instance of this recurrent layer and uses one instance for processing the input sequences in chronological order and the other instance for processing the input sequences in reversed order. Let’s try it on the IMDB sentiment-analysis task. model <- keras_model_sequential() %>% layer_embedding(input_dim = max_features, output_dim = 32) %>% bidirectional( layer_lstm(units = 32) ) %>% layer_dense(units = 1, activation = "sigmoid") model %>% compile( optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("acc") ) history <- model %>% fit( x_train, y_train, epochs = 10, batch_size = 128, validation_split = 0.2 ) It performs slightly better than the regular LSTM you tried in the previous section, achieving over 89% validation accuracy. It also seems to overfit more quickly, which is unsurprising because a bidirectional layer has twice as many parameters as a chronological LSTM. With some regularization, the bidirectional approach would likely be a strong performer on this task. Now let’s try the same approach on the temperature prediction task. model <- keras_model_sequential() %>% bidirectional( layer_gru(units = 32), input_shape = list(NULL, dim(data)[[-1]]) ) %>% layer_dense(units = 1) model %>% compile( optimizer = optimizer_rmsprop(), loss = "mae" ) history <- model %>% fit_generator( train_gen, steps_per_epoch = 500, epochs = 40, validation_data = val_gen, validation_steps = val_steps ) This performs about as well as the regular layer_gru. It’s easy to understand why: all the predictive capacity must come from the chronological half of the network, because the antichronological half is known to be severely underperforming on this task (again, because the recent past matters much more than the distant past in this case). Going even further There are many other things you could try, in order to improve performance on the temperature-forecasting problem: • Adjust the number of units in each recurrent layer in the stacked setup. The current choices are largely arbitrary and thus probably suboptimal. • Adjust the learning rate used by the RMSprop optimizer. • Try using layer_lstm instead of layer_gru. • Try using a bigger densely connected regressor on top of the recurrent layers: that is, a bigger dense layer or even a stack of dense layers. • Don’t forget to eventually run the best-performing models (in terms of validation MAE) on the test set! Otherwise, you’ll develop architectures that are overfitting to the validation set. As always, deep learning is more an art than a science. We can provide guidelines that suggest what is likely to work or not work on a given problem, but, ultimately, every problem is unique; you’ll have to evaluate different strategies empirically. There is currently no theory that will tell you in advance precisely what you should do to optimally solve a problem. You must iterate. Wrapping up Here’s what you should take away from this section: • As you first learned in chapter 4, when approaching a new problem, it’s good to first establish common-sense baselines for your metric of choice. If you don’t have a baseline to beat, you can’t tell whether you’re making real progress. • Try simple models before expensive ones, to justify the additional expense. Sometimes a simple model will turn out to be your best option. • When you have data where temporal ordering matters, recurrent networks are a great fit and easily outperform models that first flatten the temporal data. • To use dropout with recurrent networks, you should use a time-constant dropout mask and recurrent dropout mask. These are built into Keras recurrent layers, so all you have to do is use the dropout and recurrent_dropout arguments of recurrent layers. • Stacked RNNs provide more representational power than a single RNN layer. They’re also much more expensive and thus not always worth it. Although they offer clear gains on complex problems (such as machine translation), they may not always be relevant to smaller, simpler problems. • Bidirectional RNNs, which look at a sequence both ways, are useful on natural-language processing problems. But they aren’t strong performers on sequence data where the recent past is much more informative than the beginning of the sequence. NOTE: Markets and machine learning Some readers are bound to want to take the techniques we’ve introduced here and try them on the problem of forecasting the future price of securities on the stock market (or currency exchange rates, and so on). Markets have very different statistical characteristics than natural phenomena such as weather patterns. Trying to use machine learning to beat markets, when you only have access to publicly available data, is a difficult endeavor, and you’re likely to waste your time and resources with nothing to show for it. Always remember that when it comes to markets, past performance is not a good predictor of future returns – looking in the rear-view mirror is a bad way to drive. Machine learning, on the other hand, is applicable to datasets where the past is a good predictor of the future. main { hyphens: inherit; } Deep Learning with R This post is an excerpt from Chapter 5 of François Chollet’s and J.J. Allaire’s book, Deep Learning with R (Manning Publications). Use the code fccallaire for a 42% discount on the book at manning.com. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: TensorFlow for R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Using blogdown with an existing Hugo site Wed, 12/20/2017 - 01:00 (This article was first published on R on Locke Data Blog, and kindly contributed to R-bloggers) If you decide you want to use R in your existing Hugo blog, it’s really easy to convert over. There’s a single command you need to know from blogdown and the rest is working out your deployment process. To create content, use the blogdown Rstudio add-in to quickly get started. This niftily reads all tags and categories from past posts to help you get going. You can then write Rmarkdown as usual. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on Locke Data Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Registration now open for workshop on Deep Learning with Keras and TensorFlow using R Wed, 12/20/2017 - 01:00 (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers) Recently, I announced my workshop on Deep Learning with Keras and TensorFlow. The next dates for it are January 18th and 19th in Solingen, Germany. You can register now by following this link: https://www.codecentric.de/schulung/deep-learning-mit-keras-und-tensorflow If any non-German-speaking people want to attend, I’m happy to give the course in English! Contact me if you have further questions. As a little bonus, I am also sharing my sketch notes from a Podcast I listened to when first getting into Keras: Sketchnotes: Software Engineering Daily – Podcast from Jan 29th 2016 Links from the notes: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ASA Police Data Challenge student visualization contest winners Tue, 12/19/2017 - 18:00 (This article was first published on Revolutions, and kindly contributed to R-bloggers) The winners of the American Statistical Association Police Data Challenge have been announced. The ASA teamed up with the Police Data Initiative, which provides open data from local law enforcement agencies in the US, to create a competition for high school and college students to analyze crime data from Baltimore, Seattle and Cincinnati. In this post, we take a look at the winners of the visualization category. The winners of the Best Visualization for college students were Julia Nguyen, Katherine Qian, Youbeen Shim, Catherine Sun from University of Virginia. Their entry included several visualizations of crime data in Baltimore, including the crime density map shown below. The team used R for all of the data cleaning, manipulation, and visualizations. The tidyverse suite of packages was used for data pipelining (including stringr and lubridate for merging data on latitude/longitude and date), and the ggmap package for visualization. The winners of the Best Visualization for high school students were Alex Lapuente, Ana Kenefick and Sara Kenefick from Charlotte Latin School (Charlotte, N.C.). They used Microsoft Excel to look at overall trends Seattle crime data, the impact of employment and poverty on crime, and this visualization of the frequency of traffic-related incidents (note the "pedestrian violation" segment — I can attest from experience that jaywalking is strictly enforced in Seattle!): For more on the Police Data Challenge and the winners in the Overall and Best Use of External Data categories, follow the link below. This is Statistics: Police Data Challenge: Congratulations to our Winners! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Read and write using fst & feather for large data files. Tue, 12/19/2017 - 16:07 (This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers) /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:8.0pt; mso-para-margin-left:0in; line-height:107%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:8.0pt; mso-para-margin-left:0in; line-height:107%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} For past few years , I was using featheras my favorite data writing and reading option in R (one reason was its cross platform compatible across Julia, Python and R), however, recently, observed it’s read and write time lines were not at all effective with large files of size > 5 GB. And found fst format to be good for both read and write of large files in R, only disadvantage is it is cross platform compatible as feather. Especially, fst compression with 100% is good for storage of large file and it is more efficient in reading the same into R environment. Also, I feel no point in re-testing as they are already available at fst site. —— Happy R programming let me know your experiences. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Data Wonderland: Christmas songs from the viewpoint of a data scientist Tue, 12/19/2017 - 15:54 (This article was first published on eoda english R news, and kindly contributed to R-bloggers) Whether „Driving Home for Christmas“, „Winter Wonderland“, „Let it snow!“ or „Last Christmas“ – every year christmas songs are taking over the charts again. While average Joe is joyfully putting on the next christmas song, the data scientist starts his journey of discovery through the snowy music history. The data set comes from 55000+ Song Lyrics, which contains over 55,000+ songs. It is a data frame with 55,000+ rows and four columns: • artist • song • link • text library(igraph) library(ggraph) library(ggplot2) library(wordcloud2) library(dplyr) library(widyr) library(tidytext) library(tm) library(stringr) library(topicmodels) library(reshape2) library(quanteda) library(Rtsne) library(DT) library(knitr) library(animation) library(ldatuning) set.seed(201712) # Preperation ------------------------------------------------------------- songs <- read.csv("songdata.csv") songs$song <- songs$song %>% as.character() songs$artist <- songs$artist %>% as.character() songs$link <- songs$link %>% as.character() songs$text <- songs$text %>% as.character() Our goal is to perform a comprehensive analysis of the song texts to identify the Christmas songs. In order to do so, first we add an additional column to the data frame to give each song a label of either Christmas or Not Christmas, where every song which contains the words ChristmasXmas or X-mas will be labeled as Christmas and otherwise as Not Christmas. # Initialization of the Labels label <- character(dim(songs)[1]) for(i in 1:dim(songs)[1]){ if(str_detect(songs$song[i], "Christmas") | str_detect(songs$song[i], "X-mas") | str_detect(songs$song[i], "Xmas")){ label[i] <- "Christmas" } else{ label[i] <- "Not Christmas" } } songs <- songs %>% mutate(Label = label)

This is just the initialization of the labels, later we will apply Naive Bayes to a training set to identify the other Christmas songs. First of all, we will start by exploring the data set by means of some intuitive descriptive approaches.

D3Vis <- function(edgeList, directed){ colnames(edgeList) <- c("SourceName", "TargetName", "Weight") # Min-Max & Inverse scaling, because the weights should represent distance/similarity edgeList$Weight <- 1 - edgeList$Weight weight.min <- edgeList$Weight %>% min weight.max <- edgeList$Weight %>% max edgeList$Weight <- (edgeList$Weight - weight.min)/(weight.max - weight.min) # Create a graph. Use simplyfy to ensure that there are no duplicated edges or self loops gD <- igraph::simplify(igraph::graph.data.frame(edgeList, directed=directed)) # Create a node list object (actually a data frame object) that will contain information about nodes nodeList <- data.frame(ID = c(0:(igraph::vcount(gD) - 1)), # because networkD3 library requires IDs to start at 0 nName = igraph::V(gD)$name) # Map node names from the edge list to node IDs getNodeID <- function(x){ which(x == igraph::V(gD)$name) - 1 # to ensure that IDs start at 0 } # And add them to the edge list edgeList <- plyr::ddply(edgeList, .variables = c("SourceName", "TargetName", "Weight"), function (x) data.frame(SourceID = getNodeID(x$SourceName), TargetID = getNodeID(x$TargetName))) #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++# # Calculate some node properties and node similarities that will be used to illustrate # different plotting abilities and add them to the edge and node lists # Calculate degree for all nodes nodeList <- cbind(nodeList, nodeDegree=igraph::degree(gD, v = igraph::V(gD), mode = "all")) # Calculate betweenness for all nodes betAll <- igraph::betweenness(gD, v = igraph::V(gD), directed = directed) / (((igraph::vcount(gD) - 1) * (igraph::vcount(gD)-2)) / 2) betAll.norm <- (betAll - min(betAll))/(max(betAll) - min(betAll)) nodeList <- cbind(nodeList, nodeBetweenness=100*betAll.norm) # We are scaling the value by multiplying it by 100 for visualization purposes only (to create larger nodes) rm(betAll, betAll.norm) #Calculate Dice similarities between all pairs of nodes dsAll <- igraph::similarity.dice(gD, vids = igraph::V(gD), mode = "all") F1 <- function(x) {data.frame(diceSim = dsAll[x$SourceID +1, x$TargetID + 1])} edgeList <- plyr::ddply(edgeList, .variables=c("SourceName", "TargetName", "Weight", "SourceID", "TargetID"), function(x) data.frame(F1(x))) rm(dsAll, F1, getNodeID, gD) #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++# # We will also create a set of colors for each edge, based on their dice similarity values # We'll interpolate edge colors based on the using the "colorRampPalette" function, that # returns a function corresponding to a collor palete of "bias" number of elements (in our case, that # will be a total number of edges, i.e., number of rows in the edgeList data frame) F2 <- colorRampPalette(c("#FFFF00", "#FF0000"), bias = nrow(edgeList), space = "rgb", interpolate = "linear") colCodes <- F2(length(unique(edgeList$diceSim))) edges_col <- sapply(edgeList$diceSim, function(x) colCodes[which(sort(unique(edgeList$diceSim)) == x)]) rm(colCodes, F2) #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++# # revise transformation of the weights edgeList$Weight <- -(edgeList$Weight*(weight.max - weight.min) + weight.min - 1) # Let's create a network D3_network_LM <- networkD3::forceNetwork(Links = edgeList, # data frame that contains info about edges Nodes = nodeList, # data frame that contains info about nodes Source = "SourceID", # ID of source node Target = "TargetID", # ID of target node Value = "Weight", # value from the edge list (data frame) that will be used to value/weight relationship amongst nodes NodeID = "nName", # value from the node list (data frame) that contains node description we want to use (e.g., node name) Nodesize = "nodeBetweenness", # value from the node list (data frame) that contains value we want to use for a node size Group = "nodeDegree", # value from the node list (data frame) that contains value we want to use for node color # height = 500, # Size of the plot (vertical) # width = 1000, # Size of the plot (horizontal) fontSize = 20, # Font size linkDistance = networkD3::JS("function(d) { return 10*d.value; }"), # Function to determine distance between any two nodes, uses variables already defined in forceNetwork function (not variables from a data frame) # linkWidth = networkD3::JS("function(d) { return d.value/5; }"),# Function to determine link/edge thickness, uses variables already defined in forceNetwork function (not variables from a data frame) opacity = 0.85, # opacity arrows = directed, zoom = TRUE, # ability to zoom when click on the node # opacityNoHover = 0.1, # opacity of labels when static legend = F, linkColour = edges_col) # edge colors # Plot network D3_network_LM } Exploration of the initial Christmas songs Cleaning & Tokenization We should start with the data cleaning and tokenization. Afterwards, the Christmas songs will be selected and saved as a variable. songs.unnest <- songs %>% unnest_tokens(word, text) %>% anti_join(tibble(word = stop_words$word)) %>% filter(!str_detect(word, "\\d+")) xmas.unnest <- songs.unnest %>% filter(Label == "Christmas") Correlation Analysis

Now we can start analyzing the initial Christmas songs by means of correlations from different perspectives. In the following, we will visualize the correlations with the networkD3 html widget where nodes with the same total number of connections will be given the same color and the color of the edge implies the number of common neighbors shared by two nodes. Moreover, the size of a node indicates the centrality of it, which is defined by the betweenness, i.e. the number of shortest paths going through it. Where the distance between two nodes is the minimum maximum transformation of 1 minus the correlation, which makes sense because intuitively the higher the correlation, the nearer two nodes should be. Moreover, the shorter the distance, the wider the edge.

Note that the correlations are always based on lyrics.

Correlation between words

The correlation between words which appeared more than 100 times and are correlated with at least one other word with a correlation greater than 0.55.

correlation.words <- xmas.unnest %>% group_by(word) %>% filter(n() > 100) %>% ungroup() %>% pairwise_cor(word, song, sort = T) # Network visualization correlation.words %>% filter(correlation > 0.55) %>% D3Vis(directed = F) Correlation between songs

The correlation between songs which are correlated with at least 3 other songs with a correlation greater than 0.75. With this, we may detect similiar or just slightly modified songs.

correlation.songs <- xmas.unnest %>% pairwise_cor(song, word, sort = T) # Network visualization correlation.songs %>% filter(correlation > 0.75) %>% group_by(item1) %>% filter(n() >= 3) %>% ungroup() %>% D3Vis(directed = F) Correlation between certain words

The correlation between certain words

correlation.words %>% filter(item1 == "christus" | item1 == "jesus" | item1 == "snow" | item1 == "reindeer" | item1 == "home" | item1 == "holy" | item1 == "love" | item1 == "tree" | item1 == "white" | item1 == "christmas", correlation > 0.4) %>% D3Vis(directed = F) Correlation between artists

The correlation between artists

correlation.artists <- xmas.unnest %>% pairwise_cor(artist, word, sort = T) # Network Visualization correlation.artists %>% filter(correlation > 0.8) %>% group_by(item1) %>% filter(n() >= 3) %>% ungroup() %>% D3Vis(directed = F) Word Cloud

Word cloud of the initial Christmas songs

xmas.cloud <- xmas.unnest %>% count(word) %>% as.data.frame() xmas.cloud %>% wordcloud2(minSize = 3, shape = 'star') Naive Bayes

Naive Bayes is a popular supervised machine learning algorithm to handle classification problems with a huge amount of features. It is „naive“ in the sense that, conditioned on a class, the features are assumed to be independently distributed. In our case, we would like to know, given a bunch of features, i.e. the tf-idf of words in a document, whether a song should be classified as Christmas song or not by Naive Bayes.

Generally, given features $\mathbf{x} = (x_1, …, x_p)$ we have \begin{aligned} \mathbb{P}(C_k|\mathbf{x}) &= \frac{\mathbb{P}(C_k)\mathbb{P}(\mathbf{x}|C_k)}{\mathbb{P(\mathbf{x})}} \ &= \frac{\mathbb{P}(C_k)\prod_{i = 1}^p\mathbb{P}(x_i|C_k)}{\mathbb{P(\mathbf{x})}} \varpropto \mathbb{P}(C_k)\prod_{i = 1}^p\mathbb{P}(x_i|C_k) \end{aligned} Where $\mathbb{P}(C_k)$ is called the prior and $\mathbb{P}(C_k|\mathbf{x})$ the posterior and $\mathbb{P}(\mathbf{x}|C_k)$ the likelihood. The MLE is obviously $$\hat{C}:= \underset{k}{\arg \max},\mathbb{P}(C_k)\prod_{i = 1}^p\mathbb{P}(x_i|C_k)$$

Because we assume that the features are independent conditioning on an arbitrary class. We may therefore estimate $\mathbb{P}(x_i|C_k), \forall i = 1,…, p$ independently of other features using a training set, which makes the whole thing much easier. The popular assumptions of the likelihood are Gaussian, multinomial or Bernoulli. The harder part of constructing the maximum likelihood estimator is the choice of the prior distribution, i.e. the probability distribution of the classes. Where it is usually assumed to be uniformly distributed or estimated by the class frequencies. In our case the multinomial distribution for the likelihood and the uniform distribution for the prior are used, which means we have no prejudice regarding the categorization of the songs without given further information.

Identify the hidden Christmas songs # Document Feature Matrix songs.dfm.tfidf <- corpus(songs, text_field = "text", docid_field = "song") %>% dfm(tolower = T, stem = TRUE, remove_punct = TRUE, remove = stopwords("english")) %>% dfm_trim(min_count = 5, min_docfreq = 3) %>% dfm_weight(type = "tfidf") # Determine the Indizes for the training set christmas.index <- which(label == "Christmas") not_christmas.index <- which(label == "Not Christmas") christmas.train.index <- christmas.index not_christmas.train.index <- sample(not_christmas.index, length(christmas.index)) train.index <- c(christmas.train.index, not_christmas.train.index) label.train <- label[train.index] trainning.set <- songs.dfm.tfidf[train.index, ] # Train the Model classifier_NB <- textmodel_NB(trainning.set, label.train) # Prediction predictions <- classifier_NB %>% predict(newdata = songs.dfm.tfidf) # Confusion Matrix confusion <- table(predictions$nb.predicted, label) confusion So we have identified 2965 hidden Christmas songs and there are 2 songs out of the initial 500 Christmas songs that are rejected by Naive Bayes as Christmas songs. Explore the hidden Christmas songs #Determine the Indizes for the hidden (not) Christmas Songs. hidden.index <- (predictions$nb.predicted == "Christmas") & (songs$Label == "Not Christmas") hidden_not.index <- (predictions$nb.predicted == "Not Christmas") & (songs$Label == "Christmas") # Change the labels label[hidden.index] <- "Hidden Christmas" label[hidden_not.index] <- "Hidden Not Christmas" songs$Label <- label songs.dfm.tfidf@docvars$Label <- label # Wordcloud for the hidden Christmas Songs hidden.xmas <- songs[hidden.index, ] hidden.unnest <- hidden.xmas %>% unnest_tokens(word, text) %>% anti_join(tibble(word = stop_words$word)) %>% filter(!str_detect(word, "\\d+")) hidden.unnest %>% count(word) %>% filter(n >= 5) %>% as.data.frame() %>% wordcloud2(shape = "star", minSize = 5) # Correlation hidden.correlation.words <- hidden.unnest %>% group_by(word) %>% filter(n() > 15) %>% ungroup() %>% pairwise_cor(word, song, sort = T) # Network visualization hidden.correlation.words %>% filter(correlation > 0.65) %>% group_by(item1) %>% filter(n() >= 20) %>% ungroup() %>% D3Vis(directed = F)

We have therefore successfully identified a bunch of religous christmas songs, whose titles usually do not contain the word „Christmas“ or „X-mas“.

Latent Dirichtlet Allocation & t-Statistics Stochastic Neighbor Embedding Data Preparation

Only the top 300 features for the Christmas songs including the hidden ones will be used to calculate the Rtsne & LDA, else the memory space will not be sufficient.

xmas.dfm.tfidf <- songs.dfm.tfidf %>% dfm_subset(Label == "Christmas" | Label == "Hidden Christmas") songs.dfm.tfidf_300 <- songs.dfm.tfidf %>% dfm_select(pattern = xmas.dfm.tfidf %>% topfeatures(300) %>% names(), selection = "keep") xmas.dfm.tfidf_300 <- xmas.dfm.tfidf %>% dfm_select(pattern = xmas.dfm.tfidf %>% topfeatures(300) %>% names(), selection = "keep") LDA

LDA stands for Latent Dirichtlet Allocation, which was introduced in Blei, Ng, Jordan (2003). It is a generative probabilistic model of a corpus, where the documents are represented as random mixtures over latent topics and for a single document there are usually only a few topics that are assigned unneglectable probabilities. Moreover, each topic is characterized by a distribution over words, where usually only a small set of words will be assigned significant probabilities for a certain topic. Either the variational expectation maximization algorithm or Gibbs sampling is used for the statistical inference of the parameters.

LDA requires a fixed number of topics, i.e. it assumes that the number of topics should already be known before applying the algorithm. However, there are possibilities to determine the optimal number of topics by different performance metrics, see Nikita, by using the package ldatuning.

OptimalNumber <- FindTopicsNumber(LDA_xmas <- xmas.dfm.tfidf_300 %>% convert(to = "topicmodels"), topics = seq(2, 8, by = 1), mc.cores = 2, metrics = c("CaoJuan2009", "Arun2010", "Deveaud2014"), method = "VEM", verbose = T) FindTopicsNumber_plot(OptimalNumber)

Therefore, we will choose 8 as the optimal number of topics.

LDA_xmas <- xmas.dfm.tfidf_300 %>% convert(to = "topicmodels") %>% LDA(k = 8)

We may use the package tidytext to inspect the topic probability distribution of each document, i.e. for each document the sum of the probabilities that it belongs to a topic from 1 to 8 is equal to 1.

LDA_xmas %>% tidy(matrix = "gamma") %>% datatable(rownames = F)

Analogously, we can also obtain the probability distribution of words for each topic, i.e. for each topic the sum of probabilities that it generates different words is equal to 1.

LDA_xmas %>% tidy(matrix = "beta") %>% datatable(rownames = F)

The top terms for each topic are:

# LDA for the Christmas songs terms(LDA_xmas, 10) t-SNE

Developed by van der Maaten and Hinton (2008), t-SNE stands for t-Statistics Stochastic Neighborhood Embedding, which is a dimensionality reduction technique that is formulated to capture the local clustering structure of the original data points. It is non-linear and non-deterministic.

We have generally speaking data points with high dimensionality $$x_1, …, x_n \in \mathbb{R}^N$$ and would like to calculate its counterparts $$y_1, …, y_n \in \mathbb{R}^M$$ in a low dimensional space, i.e. where $M First of all, we define the probability that$x_i$would pick$x_j$as its neighbor as $$p_{j|i} = \frac{exp(-||x_j – x_i||^2/2\sigma_i^2)}{\sum_{k\neq i} exp(-||x_k – x_i||^2/2\sigma_i^2)}$$, i.e. it is proportional to a Gaussian centered at$x_i$where the variance$\sigma_i$is determined by a binary search such that the perplexity $$Perp(p_i) = 2^{H(p_i)} = 2^{-\sum_{j\neq i} p_{j|i}log_2p_{j|i}}$$ is as close as possible to a perplexity which is predefined by the user. However, the conditional probability is not symmetric. In order to measure the similarity between$x_i$and$x_j$, we define the metric to be $$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2}$$. The similarity metric for$y_1, …, y_n$is defined as the Student-t distribution with one-degree of freedom, i.e. the similarity between$y_i$and$y_j$is $$q_{ij} = \frac{(1 + ||y_j – y_i||^2)^{-1}}{\sum_{k \neq i}(1 + ||y_k – y_i||^2)^{-1}}$$. The goal of t-SNE is to find the counterparts$\mathfrak{Y} = (y_1, …, y_n)$of$\mathfrak{X} = (x_1, …, x_n)$such that the Kullback-Leibler divergence $$D_{KL}(P||Q) = \sum_{i \neq j} p_{ij}log\frac{p_{ij}}{q_{ij}}$$, i.e. our loss function$C$, which can be understood as the information loss using$\mathfrak{Y}$to represent$\mathfrak{X}$, is minimized. Obviously, there will be a relatively high loss if we use far apart pair$(y_i, y_j)$to represent nearby pair$(x_i, x_j)$. Therefore, the local clustering/neighborhood structure of$\mathfrak{X}$is preserved. It can be shown that the gradient of the loss function has a relatively simple form of $$\frac{dC}{d\mathfrak{Y}} = (\frac{\partial C}{\partial y_1}, …, \frac{\partial C}{\partial y_n})$$ where $$\frac{\partial C}{\partial y_i} = 4\sum_j (p_{ij} – q_{ij})(y_i – y_j)(1 + ||y_i – y_j||^2)^{-1}$$. The gradient descent is applied to minimize the loss function: $$\mathfrak{Y}^{(t)} = \mathfrak{Y}^{(t – 1)} + \eta\frac{dC}{d\mathfrak{Y}} + \alpha (t)(\mathfrak{Y}^{(t-1)} – \mathfrak{Y}^{(t – 2)})$$, where$\eta$is called the learning rate and$\alpha(t)$the momentum.$\mathfrak{Y}^{(0)}$is a sample from an isotropic Gaussian with small variance. The following computation will take about 30 minutes. # t-Statistics Stochastic Neighbor Embedding -------------------------------- index.unique.songs <- !songs.dfm.tfidf_300 %>% as.matrix() %>% duplicated() songs.unique <- songs.dfm.tfidf_300[index.unique.songs, ] %>% as.matrix() tsne.all <- Rtsne(songs.unique) songs_2d <- tsne.all$Y %>% as.data.frame() %>% mutate(Label = label[index.unique.songs]) songs_2d %>% ggplot(aes(x = V1, y = V2, color = Label)) + geom_point(size = 0.25) + scale_color_manual(values = c("Not Christmas" = "#a6a6a6", "Christmas" = "#88ab33", "Hidden Christmas" = "#F98948", "Hidden Not Christmas" = "#437F97")) + guides(color = guide_legend(override.aes = list(size = 5))) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "black")) What if we repeat the procedure for more than one iteration?

So far we have only run the Naive Bayes for one iteration. However, we may repeat this procedure for more than one iteration, i.e. train a Naive Bayes classifier and relabel all the false positives as Hidden Christmas/Christmas and all the false negatives as Hidden Not Christmas/Not Christmas over and over again.

First of all, we prepare the data again to avoid bugs.

songs <- read.csv("songdata.csv") songs$song <- songs$song %>% as.character() songs$artist <- songs$artist %>% as.character() songs$link <- songs$link %>% as.character() songs$text <- songs$text %>% as.character() # Initialization of the Labels label <- character(dim(songs)[1]) for(i in 1:dim(songs)[1]){ if(str_detect(songs$song[i], "Christmas") | str_detect(songs$song[i], "X-mas") | str_detect(songs$song[i], "Xmas")){ label[i] <- "Christmas" } else{ label[i] <- "Not Christmas" } } songs <- songs %>% mutate(Label = label) songs.dfm.tfidf <- corpus(songs, text_field = "text", docid_field = "song") %>% dfm(tolower = T, stem = TRUE, remove_punct = TRUE, remove = stopwords("english")) %>% dfm_trim(min_count = 5, min_docfreq = 3) %>% dfm_weight(type = "tfidf") results <- data.frame(precision = numeric(10), recall = numeric(10), f1_score = numeric(10)) Run 10 iterations. for(i in 1:10){ # Determine the Indizes christmas.index <- which(label == "Christmas") not_christmas.index <- which(label == "Not Christmas") if(length(christmas.index) < length(not_christmas.index)){ christmas.train.index <- christmas.index not_christmas.train.index <- sample(not_christmas.index, length(christmas.index)) } else{ not_christmas.train.index <- not_christmas.index christmas.train.index <- sample(christmas.index, length(not_christmas.index)) } train.index <- c(christmas.train.index, not_christmas.train.index) label.train <- label[train.index] trainning.set <- songs.dfm.tfidf[train.index, ] # Train the Model classifier_NB <- textmodel_NB(trainning.set, label.train) # Prediction predictions <- classifier_NB %>% predict(newdata = songs.dfm.tfidf) # Confusion Matrix confusion <- table(predictions$nb.predicted, label) precision <- confusion[1, 1]/sum(confusion[1, ]) recall <- confusion[1, 1]/sum(confusion[, 1]) f1_score <- 2*precision*recall/(precision + recall) # The hidden (not) Christmas Songs ---------------------------------------------- hidden.index <- (predictions$nb.predicted == "Christmas") & (songs$Label == "Not Christmas") hidden_not.index <- (predictions$nb.predicted == "Not Christmas") & (songs$Label == "Christmas") hidden.xmas <- songs[hidden.index, ] hidden.not_xmas <- songs[hidden_not.index, ] label[hidden.index] <- "Hidden Christmas" label[hidden_not.index] <- "Hidden Not Christmas" songs_2d <- tsne.all$Y %>% as.data.frame() %>% mutate(Label = label[index.unique.songs]) random.forest <- songs_2d %>% ggplot(aes(x = V1, y = V2, color = Label)) + geom_point(size = 0.25) + scale_color_manual(values = c("Not Christmas" = "#a6a6a6", "Christmas" = "#88ab33", "Hidden Christmas" = "#F98948", "Hidden Not Christmas" = "#437F97")) + guides(color = guide_legend(override.aes = list(size = 5))) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "black")) + ggtitle(paste("Iteration:", i)) # Change the labels label[hidden.index] <- "Christmas" label[hidden_not.index] <- "Not Christmas" songs$Label <- label songs.dfm.tfidf@docvars$Label <- label results[i, ] <- c(precision, recall, f1_score) plot(random.forest) } results %>% mutate(index = 1:10) %>% melt(id = "index") %>% ggplot(aes(x = index, y = value, color = variable)) + geom_line() Then the precision as well as the f1 score grow monotonically at first and then converge to a value around 0.95, which means there are not many „Hidden Christmas Songs“ and „Hidden Not Christmas Songs“ left to be detected. However, in this procedure we always believe that the Naive Bayes classifier is 100% accurate, which is hardly possible. Thus, in each iteration there are some songs falsely classified by Naive Bayes as „Christmas“, which will be used in the next iteration in the training set to train the Naive Bayes classifier. With this accumulating error we might have the apprehension that the results are actually worse with more iterations. confusion At the end we have roughly half of the songs classified as „Christmas“ and the other half as „Not Christmas“, which seems very implausible. It raises the question whether or not there is an optimal number of iterations, however, we simply can not manually control whether all the 57,650 songs are correctly classified or not. This remains an open question to be answered. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... 5 interesting subtle insights from TED videos data analysis in R Tue, 12/19/2017 - 15:00 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) This post aims to bring out some not-so-obvious subtle insights from analyzing TED Videos posted on TED.com. For those who do not know what is TED, Here’s the summary from Wikipedia: TED (Technology, Entertainment, Design) is a media organization which posts talks online for free distribution, under the slogan “ideas worth spreading”. This analysis uses TED Talks dataset posted on Kaggle Datasets. Data Description Once the data is downloaded from the above link and unzipped, two files – 1. transcripts.csv, 2. ted_main.csv that are found to be read into R as below: transcripts <- read.csv('../transcripts.csv',stringsAsFactors=F, header = T) main <- read.csv('../ted_main.csv',stringsAsFactors=F, header = T) ted_main.csv contains informations like Speaker Name, Talk Name, Event Name, Talk Duration, Comments, Video Views and much more for the videos made available on TED and transcripts.csv contains the entire talk transcript of the same talks. Loading Libraries library(dplyr); library(ggplot2); library(ggthemes); Extract some basic info regarding the dataset. nrow(main) 2550 2550 Video details are available in the main data frame. Since not all TED videos are extremely popular, let us see how many of those are with more than 1M views. paste0('Total Number of videos with more than 1M views: ',main %>% filter(views > 1000000) %>% count() ) paste0('% of videos with more than 1M views: ', round((main %>% filter(views > 1000000) %>% count() / nrow(main))*100,2),'%') 'Total Number of videos with more than 1M views: 1503' '% of videos with more than 1M views: 58.94%' Being a one-trick-Pony is very easy in any business, so let us explore who are those among best of not-so-one-trick-ponies. main %>% filter(views > 1000000) %>% group_by(main_speaker) %>% count() %>% filter(n >2) %>% arrange(desc(n)) %>% head(20) %>% ggplot() + geom_bar(aes(reorder(main_speaker,-n),n),stat='identity') + theme_solarized() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Speakers') + ggtitle('To 20 Frequently Appeared Speakers in all videos with 1M+ views') Gives this plot: And the winner is none other than Mr. Hans Rosling whose Gapminder TED talk is always an inspiration for any Data Wiz. Many a time, the amount of effort put gets translated to the effectiveness of the outcome, but the best is always getting things done with less effort – which in our case, more views with less time. main %>% filter(views > 1000000) %>% arrange(duration) %>% slice(1:10) %>% select('name','duration','views','event') name duration views event Derek Sivers: Weird, or just different? 162 2835976 TEDIndia 2009 Paolo Cardini: Forget multitasking, try monotasking 172 2324212 TEDGlobal 2012 Mitchell Joachim: Don't build your home, grow it! 176 1332785 TED2010 Arthur Benjamin: Teach statistics before calculus! 178 2175141 TED2009 Terry Moore: How to tie your shoes 179 6263759 TED2005 Malcolm London: "High School Training Ground" 180 1188177 TED Talks Education Bobby McFerrin: Watch me play ... the audience! 184 3302312 World Science Festival Derek Sivers: How to start a movement 189 6475731 TED2010 Bruno Maisonnier: Dance, tiny robots! 189 1193896 TEDxConcorde Dean Ornish: Your genes are not your fate 192 1384333 TED2008 And the winner this time is, Derek Sivers, who happened to appear twice on the same list, donning two popular TED talks that are just under 6 minutes. Malcolm Gladwell in his book Outliers presents an interesting case of how Date of Birth plays a role in Hockey team selection, so let us see if there is any magical first letter of the name that stands out among TED Speakers. main$first_letter <- substr(main$main_speaker,1,1) main %>% group_by(first_letter = toupper(first_letter)) %>% count() %>% arrange(desc(n)) %>% ggplot() + geom_bar(aes(reorder(first_letter,-n),n),stat = 'identity') + theme_solarized() + xlab('Speaker First Letter') + ylab('Count') + ggtitle('Popular First Letter of Author Names appearing in TED Talks') Gives this plot: And ‘J’ is the outstanding winner of the first-letter race that whose name holders frequently appeared on TED talks (Remember, correlation doesn’t mean causation!). While TED primarily publishes videos from TED Global Event, some great TEDx videos get to feature on TED and let us analyze which TEDx chapter made it big. tedx %>% filter(views > 1000000) %>% group_by(event) %>% count() %>% filter(n >2) %>% arrange(desc(n)) %>% head(20) %>% ggplot() + geom_bar(aes(reorder(event,-n),n),stat='identity') + theme_solarized() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('TEDx Events') + ggtitle('Top 20 TEDx Events that more talks with 1M+ views on TED.com') Gives this plot: And Finally, let us analyze what is that magical (first) word that TED speakers usually start their talk with. transcripts$first_word <- unlist(lapply(transcripts$transcript, function(x) strsplit(x," ")[[1]][1])) transcripts %>% group_by(first_word) %>% count() %>% arrange(desc(n)) %>% head(25) %>% ggplot() + geom_bar(aes(reorder(first_word,-n),n),stat = 'identity') + theme_solarized() + xlab('First Word of the Talk') + ylab('Count') + ggtitle('Top First Word of the Talk') + theme(axis.text.x = element_text(angle = 60, hjust = 1)) Gives this plot: And as most humans on the planet, most TED Speakers seem to start with ‘I’ (Narcissism, maybe?) and strangely Chris – the first name of Chris Anderson, the owner of TED appears on the same list too (Gratitude, maybe!). This dataset still has a lot more interesting – subtle insights to be unveiled. The code used here is available on my Github. Related Post var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ANNOUNCEMENT: EARL London 2018 + abstract submissions open! Tue, 12/19/2017 - 11:34 We are thrilled to announce the dates for next year’s EARL London: 11-13 September 2018 (make sure you pop these dates in your calendars!). Call for abstracts Abstract submissions are now open! We’d love to hear from you if you want to share your R story. Every year, R users from all over the world submit fantastic abstracts around really cool commercial R usage and we want you to be one of them in 2018! We’re accepting lightning talks… For 2018, we’re mixing things up and accepting abstracts for lightning talks as well as the usual 30 minute slots. These lightning talk slots are 10 minutes long, so it’s about concentrating the awesomeness of R into a bite-sized chunk. Seven reasons you should submit an abstract In case you’re ‘umming’ and ‘ahhing’, we’ve put together the top seven reasons why you should submit an abstract 1. Our attendees – our audience are amongst the most passionate about R in the world. They come from across the globe to discuss, share and network with their fellow R Users. Our audience are also incredibly friendly, which makes networking fun and easy. You shouldn’t worry about attending solo either because many of our attendees do too. 2. The other speakers – hopefully after your talk you’ll be able to catch some of the other presentations. At EARL, speaker slots aren’t sponsored so we’re not charging companies to share their stories. This helps us get fascinating speakers from a huge range of industries and different sized companies. 3. Forefront of R usage in business – On the topic of speakers, having so many industries represented means EARL is a great place to get a live snapshot of what’s going on in the industry right now and what trends and technologies are starting to emerge. 4. The supportive atmosphere – maybe you think you don’t know enough to present a talk, but it’s not about being the best or knowing the most. People come to EARL to hear from people just like them. They want to hear about the lessons you’ve learnt, your successes, and your mistakes. EARL is about inspiring others to solve problems with R. 5. The evening networking event – We’ve been to HMS Belfast, the Tower of London, Tower Bridge and on a cruise down the Thames. We can’t reveal our location for 2018 just yet, but it’s lining up to be another incredible venue. The evening reception is a great opportunity to catch someone you’ve been wanting to talk to or just have a more relaxed discussion over a couple of drinks. 6. The range of topics – If you’re unsure if anyone will be interested in your talk, we can promise you that there will be others there that love what you love. Our audience always have plenty of interesting questions to ask no matter what the talk topic is. 7. Free conference ticket – speakers at EARL receive a free conference pass for the day they are speaking and the evening networking event. Deadline Abstracts submissions for EARL London close on 9 March 2018. Please do get in contact if you have any questions about your topic, submitting abstracts or if you just need some encouragement! Submit your abstract here. Tickets Tickets will be available in January 2018, so keep your eyes peeled! We look forward to reading your abstracts! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); Insurance Data Science Conference 2018 Tue, 12/19/2017 - 01:00 (This article was first published on R on mages' blog, and kindly contributed to R-bloggers) Following five R in Insurance conferences, we are organising the first Insurance Data Science conference at Cass Business School London, 16 July 2018. In 2013, we started with the aim to bring practitioners of industry and academia together to discuss and exchange ideas and needs from both sides. R was and is a perfect glue between the two groups, a tool which both side embrace and which has fostered the knowledge transfer between the two. However, R is just one example and other languages serve this purpose equally well. Python is another popular language, but also Julia and Stan have gained momentum. For that reason we have rebranded our conference series to “Insurance Data Science”. We believe by removing the explicit link to “R” we have more freedom to stay relevant and embrace whatever technology may evolve in the future. We expect contributions to topics related to risk assessment, customer analytics, pricing, reserving, capital management, catastrophe and econometric modelling. The keynote speakers are: The conference will be followed by a full-day Stan in insurance workshop with Eric Novik and Paul-Christian Bürkner. Key dates • 1 February 2018: Registration opens • 9 April 2018: Abstract submission deadline • 31 May 2018: Early-bird registrations deadline • 16 July 2018: Insurance Data Science Conference • 17 July 2018: Stan in Insurance workshop For more details visit https://insurancedatascience.org var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Le Monde puzzle [#1033] Tue, 12/19/2017 - 00:17 (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers) A simple Le Monde mathematical puzzle after two geometric ones I did not consider: 1. Bob gets a 2×3 card with three integer entries on the first row and two integer entries on the second row such that (i) entry (1,1) is 1, (ii) summing up subsets of adjacent entries produces all integers from 1 to 21. (Adjacent means sharing an index.) Deduce Bob’s voucher. 2. Alice gets Bob’s voucher completed into a 2×4 card with further integer entries. What is the largest value of N such that all integers from 1 to N are available through summing up all subsets of entries? The first question only requires a few attempts but it can be solves by brute force simulation. Here is a R code that leads to the solution: alsumz<-function(sol){return( c(sol,sum(sol[1:2]),sum(sol[2:3]),sum(sol[4:5]), sum(sol[c(1,4)]), sum(sol[c(1,5)]),sum(sol[1:3]), sum(sol[c(1,4,5)]),sum(sol[c(1,2,5)]), sum(sol[c(2,4,5)]), sum(sol[c(1,2,3,5)]),sum(sol[2:5]), sum(sol[c(1,2,4)]),sum(sol[c(1,2,4,5)]),sum(sol[1:4]), sum(sol)),sum(sol[c(2,3,5)]))} produces (1,8,7,3,2) as the only case for which (length(unique(alsum(sol)))==21) The second puzzle means considering all sums and checking there exists a solution for all subsets. There is no constraint in this second question, hence on principle this could produce N=2⁸-1=255, but I have been unable to exceed 175 through brute force simulation. (Which entitled me to use the as.logical(intToBits(i) R command!) Filed under: Books, Kids, R Tagged: Alice and Bob, Le Monde, mathematical puzzle, partition, R var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... How to Create Notebooks on Datazar Tue, 12/19/2017 - 00:07 (This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers) Getting started with notebooks is as easy as clicking a button on Datazar. Notebooks are great tools that make — creating and sharing reproducible analysis — super easy. To create R or Python notebooks, follow these directions: I. Navigate to your project, or create a new one if you don’t have one yet. Click on the “Create File” buton and a box containing several options will popup. Let’s go with “R Notebook” for this one. You can choose “Python Notebook” and the steps will be exactly the same. And just like that, you’ll be redirected to your freshly printed Notebook! First thing you’ll see is the first cell (a code text box). If you created an R Notebook type in message("hello world!") and then hit Shift+Enter. If you’ve created a Python Notebook, type in print "hello world!" and then Shift+Enter. You’ll see the results for the cell displayed right below it. You’ll see a new cell is created right below the result. On and on; you get the gist. II. If you want to load data that’s in your project and access it from your Notebook for your analysis, all you have to do is click on the “Load Files” button and click on the button next to the file you want to upload. If you’re the using R Notebook, you can then import the dataset with data<-read.csv("ExampleData.csv") . That’s it! The data is now loaded to the data variable. You can do the same with script files with the source function in R. Whatever you do already in R (in your terminal or other programs), you can replicate in these Notebooks including plots/visualizations, loading external libraries etc… Visit www.datazar.com to create Notebooks for free and show us what you built @datazarhq How to Create Notebooks on Datazar was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Data Manipulation in R Mon, 12/18/2017 - 22:29 (This article was first published on R on Locke Data Blog, and kindly contributed to R-bloggers) Data Manipulation in R is now generally available on Amazon. All book links will attempt geo-targeting so you end up at the right Amazon. Prices are in USD as most readers are American and the price will be the equivalent in local currency. Data Manipulation in R is the second book in my R Fundamentals series that takes folks from no programming knowledge through to an experienced R user. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on Locke Data Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... How to Perform Hierarchical Clustering using R Mon, 12/18/2017 - 22:01 (This article was first published on R-posts.com, and kindly contributed to R-bloggers) What is Hierarchical Clustering? Clustering is a technique to club similar data points into one group and separate out dissimilar observations into different groups or clusters. In Hierarchical Clustering, clusters are created such that they have a predetermined ordering i.e. a hierarchy. For example, consider the concept hierarchy of a library. A library has many sections, each section would have many books, and the books would be grouped according to their subject, let’s say. This forms a hierarchy. In Hierarchical Clustering, this hierarchy of clusters can either be created from top to bottom, or vice-versa. Hence, it’s two types namely – Divisive and Agglomerative. Let’s discuss it in detail. Divisive Method In Divisive method we assume that all of the observations belong to a single cluster and then divide the cluster into two least similar clusters. This is repeated recursively on each cluster until there is one cluster for each observation. This technique is also called DIANA, which is an acronym for Divisive Analysis. Agglomerative Method It’s also known as Hierarchical Agglomerative Clustering (HAC) or AGNES (acronym for Agglomerative Nesting). In this method, each observation is assigned to its own cluster. Then, the similarity (or distance) between each of the clusters is computed and the two most similar clusters are merged into one. Finally, steps 2 and 3 are repeated until there is only one cluster left. Please note that Divisive method is good for identifying large clusters while Agglomerative method is good for identifying small clusters. We will proceed with Agglomerative Clustering for the rest of the article. Since, HAC’s account for the majority of hierarchical clustering algorithms while Divisive methods are rarely used. I think now we have a general overview of Hierarchical Clustering. Let’s also get ourselves familiarized with the algorithm for it. HAC Algorithm Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson’s (1967) hierarchical clustering is – 1. Assign each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. In Steps 2 and 3 here, the algorithm talks about finding similarity among clusters. So, before any clustering is performed, it is required to determine the distance matrix that specifies the distance between each data point using some distance function (Euclidean, Manhattan, Minkowski, etc.). Then, the matrix is updated to specify the distance between different clusters that are formed as a result of merging. But, how do we measure the distance (or similarity) between two clusters of observations? For this, we have three different methods as described below. These methods differ in how the distance between each cluster is measured. Single Linkage It is also known as the connectedness or minimum method. Here, the distance between one cluster and another cluster is taken to be equal to the shortest distance from any data point of one cluster to any data point in another. That is, distance will be based on similarity of the closest pair of data points as shown in the figure. It tends to produce long, “loose” clusters. Disadvantage of this method is that it can cause premature merging of groups with close pairs, even if those groups are quite dissimilar overall. Complete Linkage This method is also called the diameter or maximum method. In this method, we consider similarity of the furthest pair. That is, the distance between one cluster and another cluster is taken to be equal to the longest distance from any member of one cluster to any member of the other cluster. It tends to produce more compact clusters. One drawback of this method is that outliers can cause close groups to be merged later than what is optimal. Average Linkage In Average linkage method, we take the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. A variation on average-link clustering is the UCLUS method of D’Andrade (1978) which uses the median distance instead of mean distance. Ward’s Method Ward’s method aims to minimize the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. In other words, it forms clusters in a manner that minimizes the loss associated with each cluster. At each step, the union of every possible cluster pair is considered and the two clusters whose merger results in minimum increase in information loss are combined. Here, information loss is defined by Ward in terms of an error sum-of-squares criterion (ESS). If you want a mathematical treatment of this visit this link. The following table describes the mathematical equations for each of the methods — Where, • X1, X2, , Xk = Observations from cluster 1 • Y1, Y2, , Yl = Observations from cluster 2 • d (x, y) = Distance between a subject with observation vector x and a subject with observation vector y • ||.|| = Euclidean norm Implementing Hierarchical Clustering in R Data Preparation To perform clustering in R, the data should be prepared as per the following guidelines – 1. Rows should contain observations (or data points) and columns should be variables. 2. Check if your data has any missing values, if yes, remove or impute them. 3. Data across columns must be standardized or scaled, to make the variables comparable. We’ll use a data set called ‘Freedman’ from the ‘car’ package. The ‘Freedman’ data frame has 110 rows and 4 columns. The observations are U. S. metropolitan areas with 1968 populations of 250,000 or more. There are some missing data. (Make sure you install the car package before proceeding) data <- car::Freedman To remove any missing value that might be present in the data, type this: data <- na.omit(data) This simply removes any row that contains missing values. You can use more sophisticated methods for imputing missing values. However, we’ll skip this as it is beyond the scope of the article. Also, we will be using numeric variables here for the sake of simply demonstrating how clustering is done in R. Categorical variables, on the other hand, would require special treatment, which is also not within the scope of this article. Therefore, we have selected a data set with numeric variables alone for conciseness. Next, we have to scale all the numeric variables. Scaling means each variable will now have mean zero and standard deviation one. Ideally, you want a unit in each coordinate to represent the same degree of difference. Scaling makes the standard deviation the unit of measurement in each coordinate. This is done to avoid the clustering algorithm to depend to an arbitrary variable unit. You can scale/standardize the data using the R function scale: data <- scale(data) Implementing Hierarchical Clustering in R There are several functions available in R for hierarchical clustering. Here are some commonly used ones: • ‘hclust’ (stats package) and ‘agnes’ (cluster package) for agglomerative hierarchical clustering • ‘diana’ (cluster package) for divisive hierarchical clustering Agglomerative Hierarchical Clustering For ‘hclust’ function, we require the distance values which can be computed in R by using the ‘dist’ function. Default measure for dist function is ‘Euclidean’, however you can change it with the method argument. With this, we also need to specify the linkage method we want to use (i.e. “complete”, “average”, “single”, “ward.D”). # Dissimilarity matrix d <- dist(data, method = "euclidean") # Hierarchical clustering using Complete Linkage hc1 <- hclust(d, method = "complete" ) # Plot the obtained dendrogram plot(hc1, cex = 0.6, hang = -1) Another alternative is the agnes function. Both of these functions are quite similar; however, with the agnes function you can also get the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure). # Compute with agnes (make sure you have the package cluster) hc2 <- agnes(data, method = "complete") # Agglomerative coefficient hc2$ac ## [1] 0.9317012 Let’s compare the methods discussed # vector of methods to compare m <- c( "average", "single", "complete", "ward") names(m) <- c( "average", "single", "complete", "ward") # function to compute coefficient ac <- function(x) { agnes(data, method = x)$ac } map_dbl(m, ac) ## from ‘purrr’ package ## average single complete ward ## 0.9241325 0.9215283 0.9317012 0.9493598 Ward’s method gets us the highest agglomerative coefficient. Let us look at its dendogram. hc3 <- agnes(data, method = "ward") pltree(hc3, cex = 0.6, hang = -1, main = "Dendrogram of agnes") Divisive Hierarchical Clustering The function ‘diana’ in the cluster package helps us perform divisive hierarchical clustering. ‘diana’ works similar to ‘agnes’. However, there is no method argument here, and, instead of agglomerative coefficient, we have divisive coefficient. # compute divisive hierarchical clustering hc4 <- diana(data) # Divise coefficient hc4$dc ## [1] 0.9305939 # plot dendrogram pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana")

A dendogram is a cluster tree where each group is linked to two or more successor groups. These groups are nested and organized as a tree. You could manage dendograms and do much more with them using the package “dendextend”. Check out the vignette of the package here:

Great! So now we understand how to perform clustering and come up with dendograms. Let us move to the final step of assigning clusters to the data points.

This can be done with the R function cutree. It cuts a tree (or dendogram), as resulting from hclust (or diana/agnes), into several groups either by specifying the desired number of groups (k) or the cut height (h). At least one of k or h must be specified, k overrides h if both are given.

Following our demo, assign clusters for the tree obtained by diana function (under section Divisive Hierarchical Clustering).

clust <- cutree(hc4, k = 5)

We can also use the fviz_cluster function from the factoextra package to visualize the result in a scatter plot.

fviz_cluster(list(data = data, cluster = clust)) ## from ‘factoextra’ package

You can also visualize the clusters inside the dendogram itself by putting borders as shown next

pltree(hc4, hang=-1, cex = 0.6)

rect.hclust(hc4, k = 9, border = 2:10)

dendextend

You can do a lot of other manipulation on dendrograms with the dendextend package such as – changing the color of labels, changing the size of text of labels, changing type of line of branches, its colors, etc. You can also change the label text as well as sort it. You can see the many options in the online vignette of the package.

Another important function that the dendextend package offers is tanglegram. It is used to compare two dendrogram (with the same set of labels), one facing the other, and having their labels connected by lines.

As an example, let’s compare the single and complete linkage methods for agnes function.

library(dendextend) hc_single <- agnes(data, method = "single") hc_complete <- agnes(data, method = "complete") # converting to dendogram objects as dendextend works with dendogram objects hc_single <- as.dendrogram(hc_single) hc_complete <- as.dendrogram(hc_complete) tanglegram(hc_single,hc_complete)

This is useful in comparing two methods. As seen in the figure, one can relate to the methodology used for building the clusters by looking at this comparison.
You can find more functionalities of the dendextend package here.

End Notes

Hope now you have a better understanding of clustering algorithms than what you started with. We discussed about Divisive and Agglomerative clustering techniques and four linkage methods namely, Single, Complete, Average and Ward’s method. Next, we implemented the discussed techniques in R using a numeric dataset. Note that we didn’t have any categorical variable in the dataset we used. You need to treat the categorical variables in order to incorporate them into a clustering algorithm. Lastly, we discussed a couple of plots to visualise the clusters/groups formed. Note here that we have assumed value of ‘k’ (number of clusters) is known. However, this is not always the case. There are a number of heuristics and rules-of-thumb for picking number of clusters. A given heuristic will work better on some datasets than others. It’s best to take advantage of domain knowledge to help set the number of clusters, if that’s possible. Otherwise, try a variety of heuristics, and perhaps a few different values of k.

Consolidated code install.packages('cluster') install.packages('purrr') install.packages('factoextra') library(cluster) library(purrr) library(factoextra) data <- car::Freedman data <- na.omit(data) data <- scale(data) d <- dist(data, method = "euclidean") hc1 <- hclust(d, method = "complete" ) plot(hc1, cex = 0.6, hang = -1) hc2 <- agnes(data, method = "complete") hc2$ac m <- c( "average", "single", "complete", "ward") names(m) <- c( "average", "single", "complete", "ward") ac <- function(x) { agnes(data, method = x)$ac } map_dbl(m, ac) hc3 <- agnes(data, method = "ward") pltree(hc3, cex = 0.6, hang = -1, main = "Dendrogram of agnes") hc4 <- diana(data) hc4$dc pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana") clust <- cutree(hc4, k = 5) fviz_cluster(list(data = data, cluster = clust)) pltree(hc4, hang=-1, cex = 0.6) rect.hclust(hc4, k = 9, border = 2:10) library(dendextend) hc_single <- agnes(data, method = "single") hc_complete <- agnes(data, method = "complete") # converting to dendogram objects as dendextend works with dendogram objects hc_single <- as.dendrogram(hc_single) hc_complete <- as.dendrogram(hc_complete) tanglegram(hc_single,hc_complete) Author Bio: This article was contributed by Perceptive Analytics. Amanpreet Singh, Chaitanya Sagar, Jyothirmayee Thondamallu and Saneesh Veetil contributed to this article. Perceptive Analytics provides data analytics, business intelligence and reporting services to e-commerce, retail and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... 9th MilanoR Meeting – Dataviz with R | Photos and presentations! Mon, 12/18/2017 - 18:25 (This article was first published on MilanoR, and kindly contributed to R-bloggers) Hello R-users, On November 20th we had the pleasure to host the 9th MilanoR meeting in Mikamai! Lots of you participated with passion and enthusiasm and we’re so glad of the outcome. Thank you all, from the bottom of our hearts! This time the meeting focused on a thrilling and very useful subject: Data Visualization. This post is thought for you who participated and want to look through the materials again, and for everyone who could not make it. As you know, MilanoR meetings have a defined structure: two talks and the free buffet. But this time the structure was different! After the speech from our sponsors, we had an introductive talk, a presentation by Andrea Cirillo and then four lighting talks, of 5 minutes each. But don’t worry: the free buffet and networking time were still there! The introductive talk was hosted by Mariachiara Fortuna, from MilanoR; she gave us a brief overview of the history of Data Visualization with R, how it changed in time and how it probably will change, as well as what actually is Data Visualization with R now and how can we use it. This presentation will give a framework the R world of dataviz, form Shiny to ggplot2, to htmlwidgets and more. Dataviz with R Overview by MilanoR – Mariachiara Fortuna When it was Andrea Cirillo’s turn to seize the stage, we were transported in the middle of the world of Data Visualization. Have you ever dreamt of giving the colors of the Mona Lisa to your graphic? You can do it now, with Andrea’s package paletter! Paletter takes the colors from your chosen image and optimizes them to give you the perfect graphic palette. You can download the presentation and the package itself with the links down below. QUI Automatically build the perfect palette for you plot with paletteR + paletteR package by Andrea Cirillo It was then the turn of the Lightning talks! Stefano Biffani explained the process of getting from modeling to visualization itself, with a great example: the fertility of bunnies! The second talk was from Stefano Barberis, who showed us how to map data with R using Google Earth and the package plotKML. From modelling to visualization by Stefano Biffani Interactive maps and spatial visualizations with R by Stefano Barberis Nicola Pesenti introduced to us Highchart, a wonderful library to create interactive charts and plots. We highly suggest to give a look at it! At last, Moreno Riccardi put an end to the meeting with his testimony of the use of Data Visualization in his work. Moreno is a data scientist working with the famous ESP MailUp and uses Shiny to monitor costumers’ behavior and prevent the abuse of the platform. by Nicola Pesenti by Moreno Riccardi During the buffet and the networking time we had the chance to see many smiling faces. It’s wonderful to see so many people brought together by a shared passion. Of course none of this could have been possible without our sponsors: Quantide, a provider of consulting services and training courses specialized in R who supports MilanoR meetings since its first edition (here the Quantide’s presentation at the meeting), and Open Search Network, a headhunting company focused on Italian Data Digital Experts who has chosen to support us for the first time. Our many thanks also go to Le Madeleine Catering, to our photographer Alberto Morici and of course to Mikamai, which once again hosted our meeting with no charge at all. We really had fun and if you had too, please stay in touch with us: many other dates and meeting occasions are coming! Our next event will focus on the fascinating subject of personalized healthcare with R [R-Lab#5, hands on code, on January 16th in Mikamai – Subscription should open here in a few days, if we manage to not sink into the Christmas funnel of things to do xD] Until then, thanks again and happy holidays! The post 9th MilanoR Meeting – Dataviz with R | Photos and presentations! appeared first on MilanoR. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... New Course: Working with Dates & Times in R Mon, 12/18/2017 - 16:58 (This article was first published on R-posts.com, and kindly contributed to R-bloggers) Hello R users! We just launched another course: Working with Dates and Times in R by Charlotte Wickham! Dates and times are abundant in data and essential for answering questions that start with when, how long, or how often. However, they can be tricky, as they come in a variety of formats and can behave in unintuitive ways. This course teaches you the essentials of parsing, manipulating, and computing with dates and times in R. By the end, you'll have mastered the lubridate package, a member of the tidyverse, specifically designed to handle dates and times. You'll also have applied your new skills to explore how often R versions are released, when the weather is good in Auckland (the birthplace of R), and how long monarchs ruled in Britain. Take me to chapter 1! Working with Dates and Times in R features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in dates and times in R! What you'll learn 1. Dates and Times in R Dates and times come in a huge assortment of formats, so your first hurdle is often to parse the format you have into an R datetime. This chapter teaches you to import dates and times with the lubridate package. You'll also learn how to extract parts of a datetime. You'll practice by exploring the weather in R's birthplace, Auckland NZ. 2. Parsing and Manipulating Dates and Times with lubridate Dates and times come in a huge assortment of formats, so your first hurdle is often to parse the format you have into an R datetime. This chapter teaches you to import dates and times with the lubridate package. You'll also learn how to extract parts of a datetime. You'll practice by exploring the weather in R's birthplace, Auckland NZ. 3. Arithmetic with Dates and Times Getting datetimes into R is just the first step. Now that you know how to parse datetimes, you need to learn how to do calculations with them. In this chapter, you'll learn the different ways of representing spans of time with lubridate and how to leverage them to do arithmetic on datetimes. By the end of the chapter, you'll have calculated how long it's been since the first man stepped on the moon, generated sequences of dates to help schedule reminders, calculated when an eclipse occurs, and explored the reigns of monarch's of England (and which ones might have seen Halley's comet!). 4. Problems in practice You now know most of what you need to tackle data that includes dates and times, but there are a few other problems you might encounter in practice. In this final chapter you'll learn a little more about these problems by returning to some of the earlier data examples and learning how to handle time zones, deal with times when you don't care about dates, parse dates quickly, and output dates and times. Master dates and times in R with our latest course! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Conference Cost Mon, 12/18/2017 - 01:00 (This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers) In last weeks post we tantalised you with upcoming R & data science conferences, but from a cost point view, not all R conferences are the same. Using the R conference site, it’s fairly easy to compare the cost of previous R conferences. I selected the main conferences over the last few years and obtained the cost for the full ticket (including any tutorial days, but ignoring any discounts). Next, I converted all prices to dollars and calculated the cost per day. Conference Cost ($) #Days $per day eRum 2016 30 3 10 SatRday 2017 82 3 27 useR! 2017 770 4 192 R Finance 2017 600 2 300 rstudio::conf 2018 995 3 331 New York R 725 2 362 Earl London 1191 3 397 The conferences fall into two camps business oriented and more general R conferences; useR! is somewhere in the middle. A simple bar plot highlights the extreme difference in cost per day I think the organisers of eRum and SatRdays should be proud of themselves for keeping the cost down; it would also be really useful if the organisers wrote a blog post giving more general tips for keeping the cost down. I’m going to resist commenting on the conferences since I have only ever attended useR! (which is excellent), but in terms of speakers, most conferences have the same keynotes, so this year I’ll be looking at eRum 2018 (sadly useR! is a bit too far away from me). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... mqtt Development Log : On DSLs, Rcpp Modules and Custom Formula Functions Sun, 12/17/2017 - 21:23 (This article was first published on R – rud.is, and kindly contributed to R-bloggers) I know some folks had a bit of fun with the previous post since it exposed the fact that I left out unique MQTT client id generation from the initial 0.1.0 release of the in-development package (client ids need to be unique). There have been some serious improvements since said post and I thought a (hopefully not-too-frequent) blog-journal of the development of this particular package might be interesting/useful to some folks, especially since I’m delving into some not-oft-blogged (anywhere) topics as I use some new tricks in this particular package. Thank The Great Maker for C++ I’m comfortable and not-too-shabby at wrapping C/C++ things with an R bow and I felt quite daft seeing this after I had started banging on the mosquitto C interface. Yep, that’s right: it has a C++ interface. It’s waaaaay easier (in my experience) bridging C++ libraries since Dirk/Romain’s (et al, as I know there are many worker hands involved as well) Rcpp has so many tools available to do that very thing. As an aside, if you do any work with Rcpp or want to start doing work in Rcpp, Masaki E. Tsuda’s Rcpp for Everyone is an excellent tome. I hadn’t used Rcpp Modules before (that link goes to a succinct but very helpful post by James Curran) but they make exposing C++ library functionality even easier than I had experienced before. So easy, in fact, that it made it possible to whip out an alpha version of a “domain specific language” (or a pipe-able, customized API — however you want to frame these things in your head) for the package. But, I’m getting ahead of myself. The mosquittopp class in the mosqpp namespace is much like the third bowl of porridge: just right. It’s neither too low-level nor too high-level and it was well thought out enough that it only required a bit of tweaking to use as an Rcpp Module. First there are more than a few char * parameters that needed to be std::strings. So, something like: int username_pw_set(const char *username, const char *password); becomes: int username_pw_set(std::string username, std::string password); in our custom wrapper class. Since the whole point of the mqtt package is to work in R vs C[++] or any other language, the callbacks — the functions that do the work when message, publish, subscribe, etc. events are triggered — need to be in R. I wanted to have some default callbacks during the testing phase and they’re really straightforward to setup in Rcpp: Rcpp::Environment pkg_env = Rcpp::Environment::namespace_env("mqtt"); Rcpp::Function ccb = pkg_env[".mqtt_connect_cb"]; Rcpp::Function dcb = pkg_env[".mqtt_disconnect_cb"]; Rcpp::Function pcb = pkg_env[".mqtt_publish_cb"]; Rcpp::Function mcb = pkg_env[".mqtt_message_cb"]; Rcpp::Function scb = pkg_env[".mqtt_subscribe_cb"]; Rcpp::Function ucb = pkg_env[".mqtt_unsubscribe_cb"]; Rcpp::Function lcb = pkg_env[".mqtt_log_cb"]; Rcpp::Function ecb = pkg_env[".mqtt_error_cb"]; The handy thing about that approach is you don’t need to export the functions (it works like the ::: does). But the kicker is the Rcpp Module syntactic sugar: RCPP_MODULE(MQTT) { using namespace Rcpp; class_("mqtt_r") .constructor("id/host/port constructor") .constructor("id/host/port/user/pass constructor") .constructor("id/host/post/con/mess/discon constructor") .method("connect", &mqtt_r::connect) .method("disconnect", &mqtt_r::disconnect) .method("reconnect", &mqtt_r::reconnect) .method("username_pw_set", &mqtt_r::username_pw_set) .method("loop_start", &mqtt_r::loop_start) .method("loop_stop", &mqtt_r::loop_stop) .method("loop", &mqtt_r::loop) .method("publish_raw", &mqtt_r::publish_raw) .method("publish_chr", &mqtt_r::publish_chr) .method("subscribe", &mqtt_r::subscribe) .method("unsubscribe", &mqtt_r::unsubscribe) .method("set_connection_cb", &mqtt_r::set_connection_cb) .method("set_discconn_cb", &mqtt_r::set_discconn_cb) .method("set_publish_cb", &mqtt_r::set_publish_cb) .method("set_message_cb", &mqtt_r::set_message_cb) .method("set_subscribe_cb", &mqtt_r::set_subscribe_cb) .method("set_unsubscribe_cb", &mqtt_r::set_unsubscribe_cb) .method("set_log_cb", &mqtt_r::set_log_cb) .method("set_error_cb", &mqtt_r::set_error_cb) ; } That, combined with RcppModules: MQTT in the DESCRIPTION file and a MQTT <- Rcpp::Module("MQTT") just above where you’d put an .onLoad handler means you can do something like (internally, since it’s not exported): mqtt_obj <- MQTT$mqtt_r mqtt_conn_obj <- new(mqtt_obj, "unique_client_id", "test.mosquitto.org", 1883L)

and have access to each of those methods right from R (e.g. mqtt_conn_obj$subscribe(0, "topic", 0)). If you’re careful with your C++ class code, you’ll be able to breeze through exposing functionality. Because of the existence of Rcpp Modules, I was able to do what follows in the next section in near record time. “The stump of a %>% he held tight in his teeth” I felt compelled to get a Christmas reference in the post and it’s relevant to this section. I like %>%, recommend the use of %>% and use %>% in my own day-to-day R coding (it’s even creeping into internal package code, though I still try not to do that). I knew I wanted to expose a certain way of approaching MQTT workflows in this mqtt package and that meant coming up with an initial — but far from complete — mini-language or pipe-able API for it. Here’s the current thinking/implementation: • Setup connection parameters with mqtt_broker(). For now, it takes some parameters, but there is a URI scheme for MQTT so I want to be able to support that as well at some point. • Support authentication with mqtt_username_pw(). There will also be a function for dealing with certificates and other security-ish features which look similar to this. • Make it dead-easy to subscribe to topics and associate callbacks with mqtt_subscribe() (more on this below) • Support an “until you kill it” workflow with mqtt_run() that loops either forever or for a certain number of iterations • Support user-controlled iterations with mqtt_begin(), mqtt_loop() and mqtt_end(). An example (in a bit) should help explain this further, but this one is likely to be especially useful in a Shiny context. Now, as hopefully came across in the previous post, the heart of MQTT message processing is the callback function. You write a function with a contractually defined set of parameters and operate on the values passed in. While we should all likely get into a better habit of using named function objects vs anonymous functions, anonymous functions are super handy, and short ones don’t cause code to get too gnarly. However, in this new DSL/API I’ve cooked up, each topic message callback has six parameters, so that means if you want to use an anonymous function (vs a named one) you have to do something like this in message_subscribe(): mqtt_subscribe("sometopic", function(id, topic, payload, qos, retain, con) {}) That’s not very succinct, elegant or handy. Since those are three attributes I absolutely about most things related to R, I had to do something about it. Since I’m highly attached to the ~{} syntax introduced with purrr and now expanding across the Tidyverse, I decided to make a custom version of it for message_subscribe(). As a result, the code above can be written as: mqtt_subscribe("sometopic", ~{}) and, you can reference id, topic, payload, etc inside those brackets without the verbose function declaration. How is this accomplished? Via: as_message_callback <- function(x, env = rlang::caller_env()) { rlang::coerce_type( x, rlang::friendly_type("function"), closure = { x }, formula = { if (length(x) > 2) rlang::abort("Can't convert a two-sided formula to an mqtt message callback function") f <- function() { x } formals(f) <- alist(id=, topic=, payload=, qos=, retain=, con=) body(f) <- rlang::f_rhs(x) f } ) } It’s a shortened version of some Tidyverse code that’s more generic in nature. That as_message_callback() function looks to see if you’ve passed in a ~{} or a named/anonymous function. If ~{} was used, that function builds a function with the contractually obligated/expected signature, otherwise it shoves in what you gave it. A code example is worth a thousand words (which is, in fact, the precise number of “words” up until this snippet, at least insofar as the WordPress editor counts them): library(mqtt) # We're going to subscribe to *three* BBC subtitle feeds at the same time! # # We'll distinguish between them by coloring the topic and text differently. # this is a named function object that displays BBC 2's subtitle feed when it get messages moar_bbc <- function(id, topic, payload, qos, retain, con) { if (topic == "bbc/subtitles/bbc_two_england/raw") { cat(crayon::cyan(topic), crayon::blue(readBin(payload, "character")), "\n", sep=" ") } } mqtt_broker("makmeunique", "test.mosquitto.org", 1883L) %>% # connection info mqtt_silence(c("all")) %>% # silence all the development screen messages # subscribe to BBC 1's topic using a fully specified anonyous function mqtt_subscribe( "bbc/subtitles/bbc_one_london/raw", function(id, topic, payload, qos, retain, con) { # regular anonymous function if (topic == "bbc/subtitles/bbc_one_london/raw") cat(crayon::yellow(topic), crayon::green(readBin(payload, "character")), "\n", sep=" ") }) %>% # as you can see we can pipe-chain as many subscriptions as we like. the package # handles the details of calling each of them. This makes it possible to have # very focused handlers vs lots of "if/then/case_when" impossible-to-read functions. # Ahh. A tidy, elegant, succinct ~{} function instead mqtt_subscribe("bbc/subtitles/bbc_news24/raw", ~{ # tilde shortcut function (passing in named, pre-known params) if (topic == "bbc/subtitles/bbc_news24/raw") cat(crayon::yellow(topic), crayon::red(readBin(payload, "character")), "\n", sep=" ") }) %>% # And, a boring, but -- in the long run, better (IMO) -- named function object mqtt_subscribe("bbc/subtitles/bbc_two_england/raw", moar_bbc) %>% # named function mqtt_run() -> res # this runs until you Ctrl-C There’s in-code commentary, so I’ll refrain from blathering about it more here except for noting there are a staggering amount of depressing stories on BBC News and an equally staggering amount of un-hrbrmstr-like language use in BBC One and BBC Two shows. Apologies if any of the GH README.md snippets or animated screenshots ever cause offense, as it’s quite unintentional. But you said something about begin/end/loop before? Quite right! For that we’ll use a different example. I came across a topic — “sfxrider/+/locations” — on broker.mqttdashboard.com. It looks like live data from folks who do transportation work for “Shadowfax Technologies” (which is a crowd-sourced transportation/logistics provider in India). It publishes the following in the payload: | device:6170774037 | latitude:28.518363 | longitude:77.095753 | timestamp:1513539899000 | | device:6170774037 | latitude:28.518075 | longitude:77.09555 | timestamp:1513539909000 | | device:6170774037 | latitude:28.518015 | longitude:77.095488 | timestamp:1513539918000 | | device:8690150597 | latitude:28.550963 | longitude:77.13432 | timestamp:1513539921000 | | device:6170774037 | latitude:28.518018 | longitude:77.095492 | timestamp:1513539928000 | | device:6170774037 | latitude:28.518022 | longitude:77.095495 | timestamp:1513539938000 | | device:6170774037 | latitude:28.518025 | longitude:77.095505 | timestamp:1513539947000 | | device:6170774037 | latitude:28.518048 | longitude:77.095527 | timestamp:1513539957000 | | device:6170774037 | latitude:28.518075 | longitude:77.095573 | timestamp:1513539967000 | | device:8690150597 | latitude:28.550963 | longitude:77.13432 | timestamp:1513539975000 | | device:6170774037 | latitude:28.518205 | longitude:77.095603 | timestamp:1513539977000 | | device:6170774037 | latitude:28.5182 | longitude:77.095587 | timestamp:1513539986000 | | device:6170774037 | latitude:28.518202 | longitude:77.095578 | timestamp:1513539996000 | | device:6170774037 | latitude:28.5182 | longitude:77.095578 | timestamp:1513540006000 | | device:6170774037 | latitude:28.518203 | longitude:77.095577 | timestamp:1513540015000 | | device:6170774037 | latitude:28.518208 | longitude:77.095577 | timestamp:1513540025000 | Let’s turn that into proper, usable, JSON (we’ll just cat() it out for this post): library(mqtt) library(purrr) library(stringi) # turn the pipe-separated, colon-delimeted lines into a proper list .decode_payload <- function(.x) { .x <- readBin(.x, "character") .x <- stri_match_all_regex(.x, "([[:alpha:]]+):([[:digit:]\\.]+)")[[1]][,2:3] .x <- as.list(setNames(as.numeric(.x[,2]), .x[,1])) .x$timestamp <- as.POSIXct(.x$timestamp/1000, origin="1970-01-01 00:00:00") .x } # do it safely as the payload in MQTT can be anything decode_payload <- purrr::safely(.decode_payload) # change the client id mqtt_broker("makemeuique", "broker.mqttdashboard.com", 1883L) %>% mqtt_silence(c("all")) %>% mqtt_subscribe("sfxrider/+/locations", ~{ x <- decode_payload(payload)$result if (!is.null(x)) { cat(crayon::yellow(jsonlite::toJSON(x, auto_unbox=TRUE), "\n", sep="")) } }) %>% mqtt_run(times = 10000) -> out

What if you wanted do that one-by-one so you could plot the data live in a Shiny map? Well, we won’t do that in this post, but the user-controlled loop version would look like this:

mqtt_broker("makemeuique", "broker.mqttdashboard.com", 1883L) %>% mqtt_silence(c("all")) %>% mqtt_subscribe("sfxrider/+/locations", ~{ x <- decode_payload(payload)$result if (!is.null(x)) { cat(crayon::yellow(jsonlite::toJSON(x, auto_unbox=TRUE), "\n", sep="")) } }) %>% mqtt_begin() -> tracker # _begin!! not _run!! # call this individually and have the callback update a # larger scoped variable or Redis or a database. You # can also just loop like this for setup. for (i in 1:25) mqtt_loop(tracker, timeout = 1000) mqtt_end(tracker) # this cleans up stuff! FIN Whew. 1,164 words later and I hope I’ve kept your interest through it all. I’ve updated the GH repo for the package and also updated the requirements for the package in the README. I’m also working on a configure script (mimicking @opencpu’s ‘anti-conf’ approach) and found Win32 library binaries that should make this easier to get up and running on Windows, so stay tuned for the next installment and don’t hesitate to jump on board with issues, questions, comments or PRs. The goal for the next post is to cover reading from either that logistics feed or OwnTracks and dynamically display points on a map with Shiny. Stay tuned! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... littler 0.3.3 Sun, 12/17/2017 - 17:37 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) The fourth release of littler as a CRAN package is now available, following in the now more than ten-year history as a package started by Jeff in 2006, and joined by me a few weeks later. littler is the first command-line interface for R and predates Rscript. In my very biased eyes better as it allows for piping as well shebang scripting via #!, uses command-line arguments more consistently and still starts faster. Last but not least it is also less silly than Rscript and always loads the methods package avoiding those bizarro bugs between code running in R itself and a scripting front-end. littler prefers to live on Linux and Unix, has its difficulties on OS X due to yet-another-braindeadedness there (who ever thought case-insensitive filesystems where a good idea?) and simply does not exist on Windows (yet — the build system could be extended — see RInside for an existence proof, and volunteers welcome!). A few examples as highlighted at the Github repo: This release brings a few new examples scripts, extends a few existing ones and also includes two fixes thanks to Carl. Again, no internals were changed. The NEWS file entry is below. Changes in littler version 0.3.3 (2017-12-17) • Changes in examples • The script installGithub.r now correctly uses the upgrade argument (Carl Boettiger in #49). • New script pnrrs.r to call the package-native registration helper function added in R 3.4.0 • The script install2.r now has more robust error handling (Carl Boettiger in #50). • New script cow.r to use R Hub’s check_on_windows • Scripts cow.r and c4c.r use #!/usr/bin/env r • New option --fast (or -f) for scripts build.r and rcc.r for faster package build and check • The build.r script now defaults to using the current directory if no argument is provided. • The RStudio getters now use the rvest package to parse the webpage with available versions. • Changes in package • Travis CI now uses https to fetch script, and sets the group Courtesy of CRANberries, there is a comparison to the previous release. Full details for the littler release are provided as usual at the ChangeLog page. The code is available via the GitHub repo, from tarballs off my littler page and the local directory here — and now of course all from its CRAN page and via install.packages("littler"). Binary packages are available directly in Debian as well as soon via Ubuntu binaries at CRAN thanks to the tireless Michael Rutter. Comments and suggestions are welcome at the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Using R: reshape2 to tidyr Sun, 12/17/2017 - 12:35 (This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers) Tidy data — it’s one of those terms that tend to confuse people, and certainly confused me. It’s Codd’s third normal form, but you can’t go around telling that to people and expect to be understood. One form is ”long”, the other is ”wide”. One form is ”melted”, another ”cast”. One form is ”gathered”, the other ”spread”. To make matters worse, I often botch the explanation and mix up at least two of the terms. The word is also associated with the tidyverse suite of R packages in a somewhat loose way. But you don’t need to write in a tidyverse-style (including the %>%s and all) to enjoy tidy data. But Hadley Wickham’s definition is straightforward: In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. In practice, I don’t think people always take their data frames all the way to tidy. For example, to make a scatterplot, it is convenient to keep a couple of variables as different columns. The key is that we need to move between different forms rapidly (brain time-rapidly, more than computer time-rapidly, I might add). And not everything should be organized this way. If you’re a geneticist, genotypes are notoriously inconvenient in normalized form. Better keep that individual by marker matrix. The first serious piece of R code I wrote for someone else was a function to turn data into long form for plotting. I suspect plotting is often the gateway to tidy data. The function was like you’d expect from R code written by a beginner who comes from C-style languages: It reinvented the wheel, and I bet it had nested for loops, a bunch of hard bracket indices, and so on. Then I discovered reshape2. library(reshape2) fake_data <- data.frame(id = 1:20, variable1 = runif(20, 0, 1), variable2 = rnorm(20)) melted <- melt(fake_data, id.vars = "id") The id.vars argument is to tell the function that the id column is the key, a column that tells us which individual each observation comes from. As the name suggests, id.vars can name multiple columns in a vector. So the is the data before: id variable1 variable2 1 1 0.938173781 0.852098580 2 2 0.408216233 0.261269134 3 3 0.341325188 1.796235963 4 4 0.958889279 -0.356218000 And this is after. This time the data frame doesn’t become wider. There are still three columns. But we go from 20 rows to 40: two variables times 20 individuals. id variable value 1 1 variable1 0.938173781 2 2 variable1 0.408216233 3 3 variable1 0.341325188 4 4 variable1 0.958889279 And now: tidyr. tidyr is the new tidyverse package for rearranging data like this. The tidyr equivalent of the melt function is called gather. There are two important differences that messed with my mind at first. The melt and gather functions take the opposite default assumption about what columns should be treated as keys and what columns should be treated as containing values. In melt, as we saw above, we need to list the keys to keep them with each observation. In gather, we need to list the value columns, and the rest will be treated as keys. Also, the second and third arguments (and they would be the first and second if you piped something into it), are the variable names that will be used in the long form data. In this case, to get a data frame that looks exactly the same as the first, we will stick with ”variable” and ”value”. Here are five different ways to get the same long form data frame as above: library(tidyr) melted <- gather(fake_data, variable, value, 2:3) ## Column names instead of indices melted <- gather(fake_data, variable, value, variable1, variable2) ## Excluding instead of including melted <- gather(fake_data, variable, value, -1) ## Excluding using column name melted <- gather(fake_data, variable, value, -id) ## With pipe melted <- fake_data %>% gather(variable, value, -id) Usually, this is the transformation we need: wide to long. If we need to go the other way, we can use plyr’s cast functions, and tidyr’s gather. This code recovers the original data frame: ## plyr dcast(melted, id ~ variable) ## tidyr spread(melted, variable, value) Postat i:computer stuff, data analysis, english Tagged: R, reshape2, tidyr var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Teaching the tidyverse to beginners Sun, 12/17/2017 - 01:00 (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers) End October I tweeted this: will teach #rstats soon again but this time following @drob 's suggestion of the tidyverse first as laid out here: https://t.co/js8SsUs8Nv — Bruno Rodrigues (@brodriguesco) October 24, 2017 and it generated some discussion. Some people believe that this is the right approach, and some others think that one should first present base R and then show how the tidyverse complements it. This year, I taught three classes; a 12-hour class to colleagues that work with me, a 15-hour class to master’s students and 3 hours again to some of my colleagues. Each time, I decided to focus on the tidyverse(almost) entirely, and must say that I am not disappointed with the results! The 12 hour class was divided in two 6 hours days. It was a bit intense, especially the last 3 hours that took place Friday afternoon. The crowd was composed of some economists that had experience with STATA, some others that were mostly using Excel and finally some colleagues from the IT department that sometimes need to dig into some data themselves. Apart from 2 people, all the other never had any experience with R. We went from 0 to being able to do the plot below after the end of the first day (so 6 hours in). Keep in mind that practically none of them even had opened RStudio before. I show the code so you can see the progress made in just a few hours: library(Ecdat) library(tidyverse) library(ggthemes) data(Bwages) bwages = Bwages %>% mutate(educ_level = case_when(educ == 1 ~ "Primary School", educ == 2 ~ "High School", educ == 3 ~ "Some university", educ == 4 ~ "Master's degree", educ == 5 ~ "Doctoral degree")) ggplot(bwages) + geom_smooth(aes(exper, wage, colour = educ_level)) + theme_minimal() + theme(legend.position = "bottom", legend.title = element_blank()) ## geom_smooth() using method = 'loess' Of course some of them needed some help here and there, and I also gave them hints (for example I told them about case_when() and try to use it inside mutate() instead of nested ifs) but it was mostly due to lack of experience and because they hadn’t had the time to fully digest R’s syntax which was for most people involved completely new. On the second day I showed purrr::map() and purrr::reduce() and overall it went quite well too. I even showed list-columns, and this is where I started losing some of them; I did not insist too much on it though, only wanted to show them the flexibility of data.frame objects. Some of them were quite impressed by list-columns! Then I started showing (for and while) loops and writing functions. I even showed them tidyeval and again, it went relatively well. Once they had the opportunity to play a bit around with it, I think it clicked (plus they have lots of code examples to go back too). At the end, people seemed to have enjoyed the course, but told me that Friday was heavy; indeed it was, but I feel that it was mostly because 12 hours spread on 2 days is not the best format for this type of material, but we all had time constraints. The 15 hour Master’s course was spread over 4 days, and covered basically the same. I just used the last 3 hours to show the students some basic functions for model estimation (linear, count, logit/probit and survival models). Again, the students were quite impressed by how easily they could get descriptive statistics by first grouping by some variables. Through their questions, I even got to show them scoped versions of dplyr verbs, such as select_if() and summarise_at(). I was expecting to lose them there, but actually most of them got these scoped versions quite fast. These students already had some experience with R though, but none with the tidyverse. Finally the 3 hour course was perhaps the most interesting; I only had 100% total beginners. Some just knew R by name and had never heard/seen/opened RStudio (with the exception of one person)! I did not show them any loops, function definitions and no plots. I only showed them how RStudio looked and worked, what were (and how to install) packages (as well as the CRAN Task Views) and then how to import data with rio and do descriptive statistics only with dplyr. They were really interested and quite impressed by rio (“what do you mean I can use the same code for importing any dataset, in any format?”) but also by the simplicity of dplyr. In all the courses, I did show the$ primitive to refer to columns inside a data.frame. First I showed them lists which is where I introduced \$. Then it was easy to explain to them why it was the same for a column inside a data.frame; a data.frame is simply a list! This is also the distinction I made from the previous years; I simply mentioned (and showed really quickly) matrices and focused almost entirely on lists. Most participants, if not all, had learned to program statistics by thinking about linear algebra and matrices. Nothing wrong with that, but I feel that R really shines when you focus on lists and on how to work with them.

Overall as the teacher, I think that focusing on the tidyverse might be a very good strategy. I might have to do some adjustments here and there for the future courses, but my hunch is that the difficulties that some participants had were not necessarily due to the tidyverse but simply to lack of time to digest what was shown, as well as a total lack of experience with R. I do not think that these participants would have better understood a more traditional, base, matrix-oriented course.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...