Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 hours 23 min ago

Data Manipulation with Data Table -Part 1

Thu, 06/15/2017 - 21:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


In the exercises below we cover the some useful features of data.table ,data.table is a library in R for fast manipulation of large data frame .Please see the data.table vignette before trying the solution .This first set is intended for the begineers of data.table package and does not cover set keywords, joins of data.table which will be covered in the next set . Load the data.table library in your r session before starting the exercise
Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Load the iris dataset ,make it a data.table and name it iris_dt ,Print mean of Petal.Length, grouping by first letter of Species from iris_dt .

Exercise 2
Load the diamonds dataset from ggplot2 package as dt (a data.table) ,Find mean price for each group of cut and color .
Exercise 3
Load the diamonds dataset from ggplot2 package as dt . Now group the dataset by price per carat and print top 5 in terms of count per group . Dont use head ,use chaining in data.table to achieve this

Exercise 4
Use the already loaded diamonds dataset and print the last two carat value of each cut .

Exercise 5
In the same data set , find median of the columns x,y,z per cut . Use data.table’s methods to achieve this .

Exercise 6
Load the airquality dataset as data.table, Now I want to find Logarithm of wind rate for each month and for days greater than 15
Exercise 7
In the same data set , for all the odd rows ,update Temp column by adding 10 .

Exercise 8
data.table comes with a powerful feature of updating column by reference as you have seen in the last exercise,Its even possible to update /create multiple columns .Now to test that in the airquality data.table that you have created previously,add 10 to Solar.R ,Wind .

Exercise 9
Now you have a fairly good idea of how easy its to create multiple column ,Its even possible to use delete multiple column using the same idea. In this exercise , use the same airquality data.table that you have created previously from airquality and delete Solar.R,Wind,Temp using a single expression
Exercise 10
Load the airquality dataset as data.table again , I want to create two columns a,b which indicates temp in Celcius and Kelvin scale . Write a expression to achieve same.
Celcius = (Temp-32)*5/9
Kelvin = Celcius+273.15

Related exercise sets:
  1. Shiny Application Layouts Exercises (Part-2)
  2. Descriptive Analytics-Part 5: Data Visualisation (Categorical variables)
  3. Summary Statistics With Aggregate()
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Retailers: Here’s What Your Employees Are Saying About You – A Project on Web Scraping Indeed.com

Thu, 06/15/2017 - 20:56

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

To view the original source code, you can visit my Github page here. If you are interested in learning more about my background, you can visit my LinkedIn page here.

About This Project

This is the second of four projects for the NYC Data Science Academy. In this project, students are required to collect their datasets using a method called webscraping. Essentially, webscraping is the process of collecting information (i.e. texts) from a series of websites into structured databases such as .csv files. Webscraping enables analyses to be done on data that is not already provided in a structured format.

In this analysis, we will scrape employee reviews for the 8 retail companies that made it to Indeed’s 50 Best Corporate Employers in the US list. See below for additional details regarding each of these 8 retail firms in discussion. The rank and overall score corresponds to the employer’s rank and overall score as they appear on Indeed’s website.

Objectives

Given a two-week timeframe, the scope of the analysis was limited to these three primary objectives:

  • To understand when employees are more likely to submit positive and negative reviews
  • To understand the sentiments and emotions associated with positive and negative reviews (i.e. Sentiment Analysis)
  • To understand the topics of positive and negative reviews (i.e. Topic Modeling)

Web Scraping with Scrapy

Scrapy (a Python framework) was used for collecting employee reviews for each of the retailers on the 50 Best Corporate Employers List. Each review scraped contains the following elements:

  • Company
  • Industry
  • Job Title
  • Date
  • Month
  • Location
  • Rating
  • Header
  • Comment (Main Review)
  • Pro
  • Con

As seen in the illustration below, not every review contains a pro and a con, as those are optional fields for the reviewer. To see an example of a real review, you can visit Indeed’s Walt Disney Parks and Resorts review page here.

Employee Reviews At a Glance

In total, we have scraped over 20,000 reviews. The number of reviews scraped for each firm varies significantly across the board. This is likely due to the difference in the number of employees hired by each firm across the nation.

Employee reviews for these eight retail firms tend to peak around the beginning of the year (January through April). This is expected as we know that retail is a cyclical industry in which the majority of sales happen during the holiday season (November through January). Because employees are oftentimes hired only for the duration of this period, it makes sense  that the majority of the reviews come in around the beginning of the year.

Among the eight retailers on the 50 Best Corporate Employers list, we see that the review ratings skew toward a score of 5, which is the highest score a review can get.

There is no significant differences in average rating across retailers.

There is no significant differences in average rating across months.

When Are Employees At Retail Firms More Likely to Submit Positive and Negative Reviews?

Average Employee Rating by Month

When we consider reviews of all ratings (i.e. reviews with ratings 1 to 5), retailers receive the highest average ratings during the months of October, November, December, and January. Note that the ratings have been scaled to 0 to 1 to allow for comparison across firms and months.

Most Negative Reviews By Month (Reviews with Ratings Below 3)

If we only consider the count of negative reviews (i.e. reviews with ratings 1 or 2), retailers on average receive the most negative reviews during the months of October, February, March, and April (i.e. these months scored higher across retailers).

Most Positive Reviews By Month (Reviews with Ratings Above 3)

If we only consider the count of positive reviews (i.e. reviews with ratings 4 or 5), retailers on average receive the most positive reviews during the months of January, February, and April (i.e. these months scored higher across retailers).

Summary Observations

In summary, while the highest average ratings concentrate toward the end of the year, the highest count of both positive and negative reviews are given during the beginning of the year. This aligns with our understanding that the retail industry is cyclical. In other words, employees are oftentimes hired specifically to aid sales around the holiday season (i.e. Nov to Jan). When their work concludes around the beginning of the year, that’s when we can expect employees to write the majority of the reviews (i.e. both positive and negative).

Understanding Employees Reviews Using Sentiment Analysis

A sentiment analysis was done using R’s Syuzhet library to visualize the emotions across employee reviews. In the illustration below, each graph represents the intensity of the an emotion from January to December on a positive scale (i.e. 0 and above). For example, for Build-A-Bear Workshop’s ‘positive’ graph, we can see that there is an uptick in positive sentiment toward March and November, and each of those months scored approximately a 6. This is a high score relative to the scores recorded across all other emotions.

Generally, the sentiment observed align with our understanding that the majority of the employee reviews are positive, as we are narrowly focused on analyzing employee reviews among the best 8 corporate retail employers according to Indeed’s ranking. As an aside, it is interesting to see that Build-A-Bear Workshop and Trader Joe’s recorded more volatility in its scoring across the negative emotions (i.e. negative, sadness, fear, disgust, anger).

Using Topic Modeling to Identify Patterns Among Positive and Negative Employee Reviews

Topic modeling is a form of unsupervised machine learning, and it can help us identify patterns by clustering similar items into groups. In this analysis, pyLDAvis, a Python library, was used to analyze the groups among positive and negative reviews (using pros and cons of the employee reviews). Essentially, the library takes a list of documents as inputs (i.e. a list of reviews) and attempts to group the reviews based on common combinations of keywords appearing in each of these reviews.

The drawback of using topic modeling, as with other clustering techniques, is that the user has to specify the number of groups to cluster the documents into. This presents a catch-22 issue as the initial objective is to understand the number of topics or groups among your inputs. Given this hurdle, the user must use their business judgment as well as running multiple trials with different number of topics as inputs to determine results that are most interpretable. More sophisticated clustering libraries aid the users by automatically selecting the best number of topics, but this should only be used as a guide to begin the analysis.

In the illustration below, we can see that words like ‘great’ and ‘benefit,’ among many other words, are oftentimes seen together among pro reviews.

Observed Topics Among Pros and Cons in Employee Reviews

Two topic modeling analyses were conducted using pro reviews and con reviews. Below are the most common combinations of words appearing in each of the groups for each analysis. It is not a surprise that the groups overlap in the common words identified among them, as we have chosen a relatively high number of topics. Nevertheless, the results provide a useful start for understanding what employees are thinking.

Visualizing Employee Reviews Using Word Clouds

R’s Wordcloud2 library was used to further aid the visualization of what employees are saying about these firms. Essentially, the library takes a list of words along with their frequencies, and produces a word cloud based on those inputs. The higher the frequency a word is associated with, the larger it appears in the word cloud.

Word Cloud – Positive Reviews

The positive reviews word cloud was created using only the pros from each of the reviews. (Note: Each review contains the main review, a pro, and a con.) Some of the common words we see among pros are ‘great,”work,’ ‘job,’’company,’ and ‘customers.’

Word Cloud – Negative Reviews

Similarly, a word cloud was created using only the cons from each of the reviews. We can see that common words appearing include work, management, hours, employees, and pay.

Future Directions

Beyond the scope of this project, below are other interesting areas that are worth looking into if additional time allows.

  • Evaluate the credibility of Indeed’s Top 50 Ranking corporate employers by sampling comparable employers that did not make it to the list
  • Better understand the rationale why positive and negative reviews are concentrated in the months they were observed to be in, and evaluate whether this behavior is prevalent across other industries
  • Perform statistical methods to evaluate the relationships among variables

The post Retailers: Here’s What Your Employees Are Saying About You – A Project on Web Scraping Indeed.com appeared first on NYC Data Science Academy Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Demo: Real-Time Predictions with Microsoft R Server

Thu, 06/15/2017 - 19:59

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

At the R/Finance conference last month, I demonstrated how to operationalize models developed in Microsoft R Server as web services using the mrsdeploy package. Then, I used that deployed model to generate predictions for loan delinquency, using a Python script as the client. (You can see slides here, and a video of the presentation below.)

With Microsoft R Server 9.1, there are now two ways to operationalize models as a Web service or as a SQL Server stored procedure:

  • Flexible Operationalization: Deploy any R script or function.
  • Real-Time Operationalization: Deploy model objects generated by specific functions in Microsoft R, but generates predictions much more quickly by bypassing the R interpreter.

In the demo, which begins at the 10:00 mark in the video below, you can see a comparison of using the two types of deployment. Ultimately, I was able to generate predictions from a random forest at a rate of 1M predictions per second, with three Python clients simultaneously drawing responses from the server (an Azure GS5 instance running the Windows Data Science VM).

If you'd like to try out this capability yourself, you can find the R and Python scripts used in the demo at this Github repository. The lending club data is available here, and the script used to featurize the data is here

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Neural networks Exercises (Part-2)

Thu, 06/15/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Source: Wikipedia

Neural network have become a corner stone of machine learning in the last decade. Created in the late 1940s with the intention to create computer programs who mimics the way neurons process information, those kinds of algorithm have long been believe to be only an academic curiosity, deprived of practical use since they require a lot of processing power and other machine learning algorithm outperform them. However since the mid 2000s, the creation of new neural network types and techniques, couple with the increase availability of fast computers made the neural network a powerful tool that every data analysts or programmer must know.

In this series of articles, we’ll see how to fit a neural network with R, we’ll learn the core concepts we need to know to well apply those algorithms and how to evaluate if our model is appropriate to use in production. Today, we’ll practice how to use the nnet and neuralnet packages to create a feedforward neural networks, which we introduce in the last set of exercises. In this type of neural network, all the neuron from the input layer are linked to the neuron from the hidden layer and all of those neuron are linked to the output layer, like seen on this image. Since there’s no cycle in this network, the information flow in one direction from the input layer to the hidden layers to the output layer. For more information about those types of neural network you can read this page.

Answers to the exercises are available here.

Exercise 1
We’ll start by practicing what we’ve seen in the last set of exercises. Load the MASS package and the biopsy dataset, then prepare your data to be feed to a neural network.

Exercise 2
We’ll use the nnet() function from the package of the same name to do a logistic regression on the biopsy data set using a feedforward neural network. If you remember the
last set of exercises you know that we have to choose the number of neuron in the hidden layer of our feedforward neural network. There’s no rule or equation which can tell us the optimal number of neurons to use, so the best way to find the better model is to do a bunch of cross-validation of our model with different number of neurons in the hidden layer and choose the one who would fit best the data. A good range to test with this process is between one neuron and the number of input variables.

Write a function that take a train data set, a test data set and a range of integer corresponding to the number of neurons to be used as parameter. Then this function should, for each possible number of neuron in the hidden layer, train a neural network made with nnet(), make prediction on the test set and return the accuracy of the prediction.

Exercise 3
Use your function on your data set and plot the result. Which should be the number of neurons to use in the hidden layer of your feedforward neural network.

Exercise 4
The nnet() function is easy to use, but doesn’t give us a lot of option to customize our neural network. As a consequence, it’s a good package to use if you have to do a quick model to test a hypothesis, but for more complex model the neuralnet package is a lot more powerful. Documentation for this package can be found here.

Use the neuralnet() function with the default parameter and the number of neuron in the hidden layer set to the answer of the last exercise. Note that this function can only handle numeric value and cannot deal with factors. Then use the compute() function to make prediction on the values of the test set and compute the accuracy of your model.

Learn more about neural networks in the online course Machine Learning A-Z™: Hands-On Python & R In Data Science. In this course you will learn how to:

  • Work with Deep Learning networks and related packages in R
  • Create Natural Language Processing models
  • And much more

Exercise 5
The nnet() function use by default the BFGS algorithm to adjust the value of the weights until the output values of our model are close to the values of our data set. The neuralnet package give us the option to use more efficient algorithm to compute those value which result in faster processing time and overall better estimation. For example, by default this function use the resilient backpropagation with weight backtracking.

Use the neuralnet() function with the parameter algorithm set to ‘rprop-‘, which stand for resilient backpropagation without weight backtracking.
Then test your model and print the accuracy.

Exercise 6
Two other algorithm can be used with the neuralnet() function: 'sag' and 'slr'. Those two strings tell the function to use the globally convergent algorithm (grprop) and to modify the learning rate associated with the smallest absolute gradient (sag) or the smallest learning rate (slr). When using those algorithm, it can be useful to pass a vector or list containing the lowest and highest limit for the learning rate to the learningrate.limit parameter.

Again, use the neuralnet() function twice, once with parameter algorithm set to 'sag' and then to 'slr'. In both cases set the learningrate.limit parameter to c(0.1,1) and change the stepmax parameter to 1e+06.

Exercise 7
The learning rate determine how much the backpropagation can affect the weight at each iteration. A high learning rate mean that during the training of the neural network, each iteration can strongly change the value of the weight or, to put in other way, the algorithm learn a lot of each observation in your data set. This mean that outlier could easily affect your weight and make your algorithm diverge from the path of the ideal weights for your problem. A small learning rate mean that the algorithm learn less from each observation in your data set, so your neural network is less affected by outlier, but this mean that you will need more observations to make a good model.

Use the neuralnet() function with parameter algorithm set to ‘rprop+’ twice: once with the learningrate parameter set to 0.01 and another time with the learningrate parameter set to 1. Notice the difference in running time in both cases.

Exercise 8
The neuralnet package give us the ability of make a visual representation of the neural network you made. Use the plot() function to visualize one of the neural networks of the last exercise.

Exercise 9
Until now, we’ve used feedfordward neural network with one hidden layer of neurons, but we could use more. In fact, the state of the art neural network use often 100 of hidden layer for modeling complex behavior. For basic regression problems or even basic digits recognition problems, one layer is enough, but if you want to use more, you can do so with the neuralnet() function by passing a vector of integer to the hidden parameter representing the number of neurons in each layer.

Create a feedforward neural network with three hidden layers of nine neurons and use it on your data.

Exercise 10
Plot the feedforward neural network from the last exercise.

Related exercise sets:
  1. Neural networks Exercises (Part-1)
  2. Evaluate your model with R Exercises
  3. Model Evaluation Exercises 1
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Sampling weights and multilevel modeling in R

Thu, 06/15/2017 - 15:26

(This article was first published on Data Literacy - The blog of Andrés Gutiérrez, and kindly contributed to R-bloggers)

So many things have been said about weighting, but on my personal view of statistical inference processes, you do have to weight. From a single statistic until a complex model, you have to weight, because of the probability measure that induces the variation of the sample comes from an (almost always) complex sampling design that you should not ignore. Weighting is a complex issue that has been discussed by several authors in recent years. The social researchers have no found consensus about the appropriateness of the use of weighting when it comes to the fit of statistical models. Angrist and Pischke (2009, p. 91) claim that few things are as confusing to applied researchers as the role of sample weights. Even now, 20 years post-Ph.D., we read the section of the Stata manual on weighting with some dismay.

Anyway, despite the fact that researchers do not have consensus on when to weight, the reality is that you have to be careful when doing so. For example, when it comes to estimating totals, means or proportions, you can use the inverse probability as a way for weighting, and it looks like every social researcher agrees to weight in order to estimate this kind of descriptive statistics. The rationale behind this practice is that you suppose that every unit belonging to the sample represents itself and many others that were not selected in the sample.

Now, when using weights to estimate parameter models, you have to keep in mind the nature of the sampling design. For example, when it comes to estimates multilevel parameters, you have to take into account not only the final sampling unit weights but also the first sampling unit weights. For example, let’s assume that you have a sample of students, selected from a national frame of schools. Then, we have two sets of weights, the first one regarding schools (notice that one selected school represents itself as well as others not in the sample) and the second one regarding students.

Now, let’s assume that in the finite population we have 10.000 students and 40 schools. For the sake of my example, let’s consider that you have selected 500 students allocated in 8 schools. For the sake of easiness, let’s think that a simple random sample is used (I know, this kind of sampling design is barely used) to select students. Think about it, if you take into account only the student’s weights to fit your multilevel model, you will find that you are estimating parameters with an expanded sample that represents 10.000 students that are allocated in a sample of just eight schools. So, any conclusion stated will be wrong. For example, when performing a simple analysis of variance, the percentage of variance explained by the schools will be extremely low, because of you are expanding the sample of schools. Now, if you take into account both sets of weights (students and schools), you will find yourself fitting a model with expanded samples that represent 10.000 students and 40 schools (which is good).

Unfortunately, as far as I know, the R suitcase lacks of a package that performs this kind of design-based inference to fitting multilevel models. So, right about now, we can unbiasedly estimate model parameters, but when it comes to estimate standard errors (from a design-based perspective) we need to use other computational resources and techniques like bootstrapping or Jackknife. According to the assumption of independence, most of the applied statistical methods cannot be used to analyze this kind of data directly due to dependency among sampled observation units. Inaccurate standard errors may be produced if no adjustment is made when analyzing complex survey data

Now, when it comes to educational studies (based on large-assessment tests), we can distinguish (at least) four set of weights: total student weight, student house-weight, student senate-weight and school weight. TIMMS team claims that total student weight is appropriate for single-level student-level analyses. Student house weight, also called normalized weight, is used when analyses are sensitive to sample size. Student house weight is essentially a linear transformation of total student weight so that the sum of the weights is equal to the sample size. Student Senate weight is used when analyses involve more than one country because it is total student weight scaled in such a way that all students’ senate weights sum to 500 (or 1000) in each country. School weight should be used when analyzing school-level data, as it is the inverse of the probability of selection for the selected school.

R workshop

We will use the student house-weight to fit a multilevel model. As stated before, the sum of these weights is equal to the sample. For the R workshop, we will use PISA 2012 data (available in the OECD website). I have done a filter for the Colombian case and saved this data to be directly compatible with R (available here). Let’s load the data into R.

rm(list = ls())
library(dplyr)
library(ggplot2)
library(lme4)

setwd("/your working directory")
load("PisaCol.RData")

head(names(PisaCol))
summary(PisaCol$STRATUM)

Now, we create an object containing the student house-weights and summarize some results based on that set of weights. Notice that the total student weights are stored in the column W_FSTUWT of the PISA database. I recall you that I am working with the first plausible value of the mathematics test and that score will be defined as our (dependent) variable of interest for the modeling.  

n <- nrow(PisaCol)
PisaCol$W_HOUSEWHT <- n * PisaCol$W_FSTUWT / sum(PisaCol$W_FSTUWT)

PisaCol %>%
group_by(STRATUM) %>%
summarise(avg1 = weighted.mean(PV1MATH, w = W_HOUSEWHT),
avg2 = weighted.mean(PV2MATH, w = W_HOUSEWHT))

We use the function lmer of the lme4 package to obtain the estimation of the model coefficients in the null model (where schools are defined as independent variables).

##################
### Null model ###
##################

HLM0 <- lmer(PV1MATH ~ (1 | SCHOOLID), data = PisaCol,
weights = W_HOUSEWHT)
coef(HLM0)
summary(HLM0)

# 62.81% of the variance is due to students
# 37.19% of the variance is due to schools
100 * 3569 / (3569 + 2113)

Now, as you may know, the PISA index of economic, social and cultural status has a strong relationship to student achievement, so it is a good idea to control for this variable in a more refined model. 

#################
### ESCS mdel ###
#################

HLM1 <- lmer(PV1MATH ~ ESCS + (1 + ESCS | SCHOOLID), data = PisaCol,
weights = W_HOUSEWHT)
coef(HLM1)
summary(HLM1)

# After contoling for ESCE, 34.58% of the variance is due to schools
100 * (96.12 + 1697.36) / (3392.58 + 96.12 + 1697.36)

So then, in summary: we have 3569 units of within-schools variance (63%), after controlling for ESCE that figure turns out to 3392 units (student background explains 5% of that variation). We have 2113 (37%) units of between-school variances, after controlling for ESCE that figure turns out to 1793 (student background explains 15% of that variation). The following code makes a graph that summarizes the relationship of the student achievement with ESCE.

ggplot(data = PisaCol, aes(x = ESCS, y = PV1MATH, size = W_HOUSEWHT)) +
theme_minimal() + geom_point() + theme(legend.position="none")

ggplot(data = PisaCol, aes(x = ESCS, y = PV1MATH, size = W_HOUSEWHT)) +
geom_point(aes(colour = SCHOOLID)) + theme(legend.position="none") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Literacy - The blog of Andrés Gutiérrez. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Lexicographic Permutations: Euler Problem 24

Thu, 06/15/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Euler Problem 24 asks to develop lexicographic permutations which are ordered arrangements of objects in lexicographic order. Tushar Roy of Coding Made Simple has shared a great introduction on how to generate lexicographic permutations.

Euler Problem 24 Definition

A permutation is an ordered arrangement of objects. For example, 3124 is one possible permutation of the digits 1, 2, 3 and 4. If all of the permutations are listed numerically or alphabetically, we call it lexicographic order. The lexicographic permutations of 0, 1 and 2 are:

012 021 102 120 201 210

What is the millionth lexicographic permutation of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9?

Brute Force Solution

The digits 0 to 9 have permutations (including combinations that start with 0). Most of these permutations are, however, not in lexicographic order. A brute-force way to solve the problem is to determine the next lexicographic permutation of a number string and repeat this one million times.

nextPerm <- function(a) { # Find longest non-increasing suffix i <- length(a) while (i > 1 && a[i - 1] >= a[i]) i <- i - 1 # i is the head index of the suffix # Are we at the last permutation? if (i <= 1) return (NA) # a[i - 1] is the pivot # Find rightmost element that exceeds the pivot j <- length(a) while (a[j] <= a[i - 1]) j <- j - 1 # Swap pivot with j temp <- a[i - 1] a[i - 1] <- a[j] a[j] <- temp # Reverse the suffix a[i:length(a)] <- rev(a[i:length(a)]) return(a) } numbers <- 0:9 for (i in 1:(1E6 - 1)) numbers <- nextPerm(numbers) answer <- numbers print(answer)

This code takes the following steps:

  1. Find largest index such that .
    1. If no such index exists, then this is already the last permutation.
  2. Find largest index such that and a_{i-1}" title="a_j > a_{i-1}" class="latex" />.
  3. Swap and .
  4. Reverse the suffix starting at .
Combinatorics

A more efficient solution is to use combinatorics, thanks to MathBlog. The last nine digits can be ordered in ways. So the first permutations start with a 0. By extending this thought, it follows that the millionth permutation must start with a 2.

From this rule, it follows that the 725761st permutation is 2013456789. We now need 274239 more lexicographic permutations:

We can repeat this logic to find the next digit. The last 8 digits can be ordered in 40320 ways. The second digit is the 6th digit in the remaining numbers, which is 7 (2013456789).

This process is repeated until all digits have been used.

numbers <- 0:9 n <- length(numbers) answer <- vector(length = 10) remain <- 1E6 - 1 for (i in 1:n) { j <- floor(remain / factorial(n - i)) answer[i] <- numbers[j + 1] remain <- remain %% factorial(n - i) numbers <- numbers[-(j + 1)] } answer <- paste(answer, collapse = "") print(answer)

R blogger Tony’s Bubble Universe created a generalised function to solve this problem a few years ago.

The post Lexicographic Permutations: Euler Problem 24 appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Studying disease with R: RECON, The R Epidemics Consortium

Wed, 06/14/2017 - 23:41

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

For almost a year now, a collection of researchers from around the world has been collaborating to develop the next generation of analysis tools for disease outbreak response using R. The R Epidemics Consortium (RECON) creates R packages for handling, visualizing, and analyzing outbreak data using cutting-edge statistical methods, along with general-purpose tools for data cleaning, versioning, and encryption, and system infrastructure. Like ROpenSci, the Epidemics Consortium is focused on developing efficient, reliable, and accessible open-source tools, but with a focus on epidemology as opposed to science generally.

The Epidemics Consortium has already created several useful resources for epidemiology:

There are also a large number of additional packages under development.

RECON welcomes new members, particularly experienced R developers and as public health officers specialized in outbreak response. You can find information on how to join here, and general information about the R Epidemics Consortium at the link below.

RECON: The R Epidemics Consortium (via Maëlle Salmon‏)

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Exploratory Factor Analysis – Exercises

Wed, 06/14/2017 - 18:10

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

This set of exercises is about exploratory factor analysis. We shall use some basic features of psych package. For quick introduction to exploratory factor analysis and psych package, we recommend this short “how to” guide.

You can download the dataset here. The data is fictitious.

Answers to the exercises are available here.

If you have different solution, feel free to post it.

Exercise 1

Load the data, install the packages psych and GPArotation which we will use in the following exercises, and load it. Describe the data.

Exercise 2

Using the parallel analysis, determine the number of factors.

Exercise 3

Determine the number of factors using Very Simple Structure method.

Exercise 4

Based on normality test, is the Maximum Likelihood factoring method proper, or is OLS/Minres better? (Tip: Maximum Likelihood method requires normal distribution.)

Exercise 5

Using oblimin rotation, 5 factors and factoring method from the previous exercise, find the factor solution. Print loadings table with cut off at 0.3.

Exercise 6

Plot factors loadings.

Exercise 7

Plot structure diagram.

Exercise 8

Find the higher-order factor model with five factors plus general factor.

Exercise 9

Find the bifactor solution.

Exercise 10

Reduce the number of dimensions using hierarchical clustering analysis.

Related exercise sets:
  1. Repeated measures ANOVA in R Exercises
  2. Data science for Doctors: Cluster Analysis Exercises
  3. Factor exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

How to draw connecting routes on map with R and great circles

Wed, 06/14/2017 - 16:25

(This article was first published on Blog – The R graph Gallery, and kindly contributed to R-bloggers)

This post explains how to draw connection lines between several localizations on a map, using R. The method proposed here relies on the use of the gcIntermediate function from the geosphere package. Instead of making straight lines, it offers to draw the shortest routes, using great circles. A special care is given for situations where cities are very far from each other and where the shortest connection thus passes behind the map.

First we need to load 3 libraries. Maps allows to draw the background map, and geosphere provides the gcintermediate function.

library(tidyverse) library(maps) library(geosphere)

 

1- Draw an empty map

This is easily done using the ‘maps’ package. You can see plenty of other maps made with R in the map section of the R graph gallery.

par(mar=c(0,0,0,0)) map('world',col="#f2f2f2", fill=TRUE, bg="white", lwd=0.05,mar=rep(0,4),border=0, ylim=c(-80,80) )

2- Add 3 cities

First we create a data frame with the coordinates of Buenos Aires, Melbourne and Paris

Buenos_aires=c(-58,-34) Paris=c(2,49) Melbourne=c(145,-38) data=rbind(Buenos_aires, Paris, Melbourne) %>% as.data.frame() colnames(data)=c("long","lat")

Then add it to the map using the ‘points’ function:

points(x=data$long, y=data$lat, col="slateblue", cex=3, pch=20)

4- Show connection between them

Now we can connect cities drawing the shortest route between them. This is done using great circles, what gives a better visualization than using straight lines. This technique has been proposed by Nathan Yau on FlowingData

# Connection between Buenos Aires and Paris inter <- gcIntermediate(Paris, Buenos_aires, n=50, addStartEnd=TRUE, breakAtDateLine=F) lines(inter, col="slateblue", lwd=2) # Between Paris and Melbourne inter <- gcIntermediate(Melbourne, Paris, n=50, addStartEnd=TRUE, breakAtDateLine=F) lines(inter, col="slateblue", lwd=2)


5 – Correcting gcIntermediate

If we use the same method between Melbourne and Buenos Aires, we get this disappointing result:

What happens is that gcintermediate follows the shortest path, which means it will go East from Australia until the date line, break the line and come back heading East from the pacific to South America. Because we do not want to see the horizontal line, we need to plot this connection in 2 times.

To do so we can use the following function, which breaks the line in 2 sections when the distance between 2 points is longer than 180 degrees.

plot_my_connection=function( dep_lon, dep_lat, arr_lon, arr_lat, ...){ inter <- gcIntermediate(c(dep_lon, dep_lat), c(arr_lon, arr_lat), n=50, addStartEnd=TRUE, breakAtDateLine=F) inter=data.frame(inter) diff_of_lon=abs(dep_lon) + abs(arr_lon) if(diff_of_lon > 180){ lines(subset(inter, lon>=0), ...) lines(subset(inter, lon<0), ...) }else{ lines(inter, ...) } }

Let’s try it!

map('world',col="#f2f2f2", fill=TRUE, bg="white", lwd=0.05,mar=rep(0,4),border=0, ylim=c(-80,80) ) points(x=data$long, y=data$lat, col="slateblue", cex=3, pch=20) plot_my_connection(Paris[1], Paris[2], Melbourne[1], Melbourne[2], col="slateblue", lwd=2) plot_my_connection(Buenos_aires[1], Buenos_aires[2], Melbourne[1], Melbourne[2], col="slateblue", lwd=2) plot_my_connection(Buenos_aires[1], Buenos_aires[2], Paris[1], Paris[2], col="slateblue", lwd=2)

6 – Apply it to several pairs of cities

Let’s consider 8 cities:

data=rbind( Buenos_aires=c(-58,-34), Paris=c(2,49), Melbourne=c(145,-38), Saint.Petersburg=c(30.32, 59.93), Abidjan=c(-4.03, 5.33), Montreal=c(-73.57, 45.52), Nairobi=c(36.82, -1.29), Salvador=c(-38.5, -12.97) ) %>% as.data.frame() colnames(data)=c("long","lat")

We can generate all pairs of coordinates

all_pairs=cbind(t(combn(data$long, 2)), t(combn(data$lat, 2))) %>% as.data.frame() colnames(all_pairs)=c("long1","long2","lat1","lat2")

And plot every connections:

# background map par(mar=c(0,0,0,0)) map('world',col="#f2f2f2", fill=TRUE, bg="white", lwd=0.05,mar=rep(0,4),border=0, ylim=c(-80,80) ) # add every connections: for(i in 1:nrow(all_pairs)){ plot_my_connection(all_pairs$long1[i], all_pairs$lat1[i], all_pairs$long2[i], all_pairs$lat2[i], col="skyblue", lwd=1) } # add points and names of cities points(x=data$long, y=data$lat, col="slateblue", cex=2, pch=20) text(rownames(data), x=data$long, y=data$lat, col="slateblue", cex=1, pos=4)

7 – An alternative using the greatCircle function

This is the method proposed by the Simply Statistics Blog to draw a twitter connection map. The idea is to calculate the whole great circle, and keep only the part that stays in front of the map, never going behind it.

# A function that keeps the good part of the great circle, by Jeff Leek: getGreatCircle = function(userLL,relationLL){ tmpCircle = greatCircle(userLL,relationLL, n=200) start = which.min(abs(tmpCircle[,1] - data.frame(userLL)[1,1])) end = which.min(abs(tmpCircle[,1] - relationLL[1])) greatC = tmpCircle[start:end,] return(greatC) } # map 3 connections: map('world',col="#f2f2f2", fill=TRUE, bg="white", lwd=0.05,mar=rep(0,4),border=0, ylim=c(-80,80) ) great=getGreatCircle(Paris, Melbourne) lines(great, col="skyblue", lwd=2) great=getGreatCircle(Buenos_aires, Melbourne) lines(great, col="skyblue", lwd=2) great=getGreatCircle(Paris, Buenos_aires) lines(great, col="skyblue", lwd=2) points(x=data$long, y=data$lat, col="slateblue", cex=3, pch=20) text(rownames(data), x=data$long, y=data$lat, col="slateblue", cex=1, pos=4)

Note that the R graph gallery proposes lot of other examples of maps made with R. You can follow the gallery on Twitter or on Facebook to be aware or recent updates.

1

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog – The R graph Gallery. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

When the LASSO fails???

Wed, 06/14/2017 - 16:20

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

When the LASSO fails?

The LASSO has two important uses, the first is forecasting and the second is variable selection. We are going to talk about the second. The variable selection objective is to recover the correct set of variables that generate the data or at least the best approximation given the candidate variables. The LASSO has attracted a lot of attention lately because it allows us to estimate a linear regression with thousands of variables and the model select the right ones for us. However, what many people ignore is when the LASSO fails.

Like any model, the LASSO also rely on assumptions in order to work. The first is sparsity, i.e. only a small number of variables may actually be relevant. If this assumption does not hold there is no hope to use the LASSO for variable selection. Another assumption is that the irrepresentable condition must hold, this condition may look very technical but it only says that the relevant variable may not be very correlated with the irrelevant variables.

Suppose your candidate variables are represented by the matrix , where each column is a variable and each line is an observation. We can calculate the covariance matrix , which is a symmetric matrix. This matrix may be broken into four pieces:

The first piece, , shows the covariances between only the important variables, is the covariance matrix of the irrelevant variables and and shows the covariances between relevant and irrelevant variables. With that in mind, the irrepresentable condition is:

The inequality above must hold for all elements. is for positive values of , for negative values and if .

Example

For this example we are going to generate two covariance matrices, one that satisfies the irrepresentable condition and one that violates it. Our design will be very simple: only 10 candidate variables where five of them are relevant.

library(mvtnorm) library(corrplot) library(glmnet) library(clusterGeneration) k=10 # = Number of Candidate Variables p=5 # = Number of Relevant Variables N=500 # = Number of observations betas=(-1)^(1:p) # = Values for beta set.seed(12345) # = Seed for replication sigma1=genPositiveDefMat(k,"unifcorrmat")$Sigma # = Sigma1 violates the irc sigma2=sigma1 # = Sigma2 satisfies the irc sigma2[(p+1):k,1:p]=0 sigma2[1:p,(p+1):k]=0 # = Verify the irrepresentable condition irc1=sort(abs(sigma1[(p+1):k,1:p]%*%solve(sigma1[1:p,1:p])%*%sign(betas))) irc2=sort(abs(sigma2[(p+1):k,1:p]%*%solve(sigma2[1:p,1:p])%*%sign(betas))) c(max(irc1),max(irc2)) ## [1] 3.222599 0.000000 # = Have a look at the correlation matrices par(mfrow=c(1,2)) corrplot(cov2cor(sigma1)) corrplot(cov2cor(sigma2))

As you can see, irc1 violates the irrepresentable condition and irc2 does not. The correlation matrix that satisfies the irrepresentable condition is block diagonal and the relevant variables have no correlation with the irrelevant ones. This is an extreme case, you may have a small correlation and still satisfy the condition.

Now let us check how the LASSO works for both covariance matrices. First we need do understand what is the regularization path. The LASSO objective function penalizes the size of the coefficients and this penalization is controlled by a hyper-parameter . We can find the exact that is sufficiently big to shrink all variables to zero and for any value smaller than some variable will be included. As we decrease the size of more variables are included until we have a model with all variables (or the biggest identified model when we have more variables than observations). This path between the model with all variables and the model with no variables is the regularization path. The code below generates data from multivariate normal distributions for the covariance matrix that violates the irrepresentable condition and the covariance matrix that satisfies it. Then I estimate the regularization path for both case and summarize the information in plots.

X1=rmvnorm(N,sigma = sigma1) # = Variables for the design that violates the IRC = # X2=rmvnorm(N,sigma = sigma2) # = Variables for the design that satisfies the IRC = # e=rnorm(N) # = Error = # y1=X1[,1:p]%*%betas+e # = Generate y for design 1 = # y2=X2[,1:p]%*%betas+e # = Generate y for design 2 = # lasso1=glmnet(X1,y1,nlambda = 100) # = Estimation for design 1 = # lasso2=glmnet(X2,y2,nlambda = 100) # = Estimation for design 2 = # ## == Regularization path == ## par(mfrow=c(1,2)) l1=log(lasso1$lambda) matplot(as.matrix(l1),t(coef(lasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="Violates IRC") l2=log(lasso2$lambda) matplot(as.matrix(l2),t(coef(lasso2)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="Satisfies IRC")

The plot on the left shows the results when the irrepresentable condition is violated and the plot on the right is the case when it is satisfied. The five black lines that slowly converge to zero are the five relevant variables and the red line is an irrelevant variable. As you can see, when the IRC is satisfied all relevant variables shrink very fast to zero as we increase lambda. However, when the IRC is violated one irrelevant variable starts with a very small coefficient that slowly increases before decreasing to zero in the very end of the path. This variable is selected through the entire path, it is virtually impossible to recover the correct set of variables in this case unless you apply a different penalty to each variable. This is precisely what the adaptive LASSO does. Does that mean that the adaLASSO is free from the irrepresentable condition??? The answer is: partially. The adaptive LASSO requires a less restrictive condition called weighted irrepresentable condition, which is much easier to satisfy. The two plots below show the regularization path for the LASSO and the adaLASSO in the case the IRC is violated. As you can see, the adaLASSO selects the correct set of variables in all the path.

lasso1.1=cv.glmnet(X1,y1) w.=(abs(coef(lasso1.1)[-1])+1/N)^(-1) adalasso1=glmnet(X1,y1,penalty.factor = w.) par(mfrow=c(1,2)) l1=log(lasso1$lambda) matplot(as.matrix(l1),t(coef(lasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="LASSO") l2=log(adalasso1$lambda) matplot(as.matrix(l2),t(coef(adalasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="adaLASSO")

The biggest problem is that the irrepresentable condition and its less restricted weighted version are not testable in the real world because we need the populational covariance matrix and the true betas that generate the data. The solution is to study your data as much as possible to at least have an idea of the situation.

Some articles on the topic

Zhao, Peng, and Bin Yu. “On model selection consistency of Lasso.” Journal of Machine learning research 7.Nov (2006): 2541-2563.

Meinshausen, Nicolai, and Bin Yu. “Lasso-type recovery of sparse representations for high-dimensional data.” The Annals of Statistics (2009): 246-270.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Dynamic Networks: Visualizing Arms Trades Over Time

Wed, 06/14/2017 - 16:09

(This article was first published on sweissblaug, and kindly contributed to R-bloggers)

I previously made some network graphs of Middle East country relationships here using Slate’s Middle East Friendship Chart. I was thinking of a way to visualize these relationships (and possibly other ones) with more rule based and objective measure over time. What kind of public dataset could show countries relationships accurately?

I used weapons / arms trades between nations to explore these relationships. I think arms trades are a good indicator of how friendly countries are together because 1) countries won’t sell to enemies and 2) if a country wants to befriends country, buying weapons are a good way to do buy influence. SIPRI has a very detailed dataset of international arms transfers between >257 nations / organizations in the post war era. The include a type / description and quantity for all trades. For this analysis I will use only the fact that two countries bought or sold some weapons for a particular year.  One can create adjacency matrices from this dataset and make some network plots. However, doing this one year at a time will not create have a smooth plots over time. I decided to use siamese network graphs to embed a country indicator and year variable in two dimensions with the response variable being whether or not countries traded arms in that particular year. This will transform the country and year of a country into a 2 dimensional space, and means if two points are near each other they probably traded in that particular year.

You can check out the graph here. It’s cool that we can see Iran moving away from US to Russia / Soviet Union around 1979 while Egypt and Iraq move towards US.  One can also explore other relationships. Below is a graph of relationships of the world in 1960. There is a Russia/ Soviet Union cluster and a US / Western Europe Cluster. Neat! Analysis was done in python (using the Keras example for siamese network) and R (for plotting). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: sweissblaug. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

New course: Data Visualization in R with lattice

Wed, 06/14/2017 - 16:05

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Hello, R users! Today we’re launching Data Visualization in R with lattice by Deepayan Sarkar, the creator of the lattice package.

Visualization is an essential component of interactive data analysis in R. Traditional (base) graphics is powerful, but limited in its ability to deal with multivariate data. Trellis graphics is the natural successor to traditional graphics, extending its simple philosophy to gracefully handle common multivariable data visualization tasks. This course introduces the lattice package, which implements Trellis graphics for R, and illustrates its basic use.

Start course for free

Data Visualization in R with lattice features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master in data visualization with lattice!

What you’ll learn

You’ll start off with an introduction to some basic plotting functions in lattice. Draw histograms, scatter plots, density plots, and box and whisker plots.

Chapter 2 will teach you to create “conditioned” plots consisting of multiple panels using the formula interface.

You’ll then learn how to control and customize axis limits and visual appearance in chapter 3.

Chapter 4 will cover how to use panel and prepanel functions to enhance existing displays or create new ones.

Finally, you’ll see that the lattice package is not just meant to be used as a standalone collection of plotting functions. Rather, it is a framework that is used as a base by many other packages. Some of these are very specialized and beyond the scope of this course. In chapter 5, we give a brief survey of extensions that are generally useful to enhance displays or create new ones.

Start Data Visualization with lattice today!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

How to make and share an R package in 3 steps

Wed, 06/14/2017 - 14:07

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

If you find yourself often repeating the same scripts in R, you might come to the point where you want to turn them into reusable functions and create your own R package. I recently reached that point and wanted to learn how to build my own R package – as simple as possible.

It took me some time to figure out the full process from A to Z. I was often left with remaining questions:

  • What to do with the import of external libraries?
  • What about documentation?
  • How to actually install my package?
  • How can I share it with others?

Therefore, I will explain how I made my first R package and which methods I found helpful. Of course, there are different approaches out there, some of them very well documented, like for example this one – but the following 3 easy steps worked for me, so maybe they will help you too getting your first R package ready, installed and online quickly.

My first package is a wrapper for the FlightStats API. You can sign up for FlightStats to get a free trial API key (which unfortunately works for one month only). However, you can of course make a wrapper for any API you like or make a non-API related package. Two links I found very useful for getting started making my first package are this one and this one (regarding the storing of API keys). Also, I learned a lot using this package as an example.

Prepare the functions you want to include

First of all, we need to have all the functions (and possibly dataframes & other objects) ready in our R environment. For example, imagine we want to include these two functions to store the API key and an app ID:

setAPIKey <- function(){ input = readline(prompt="Enter your FlightStats API Key: ") Sys.setenv(flightstats_api_key = input) # this is a more simple way of storing API keys, it saves it in the .Rprofile file, however this is only temporary - meaning next session the login details will have to be provided again. See below how to store login details in a more durable way. } setAppId <- function(){ input = readline(prompt="Enter your FlightStats appID: ") Sys.setenv(flightstats_app_id = input) }

Now you can retrieve the login in details like this:

ID = Sys.getenv("flightstats_app_id") # if the key does not exist, this returns an empty string (""), in this case the user should be prompted to use the setAPIKey() and setAppID() functions KEY = Sys.getenv("flightstats_api_key")

However, if you would like the login details to still be accessible the next time you open R, you can define your login details in the .Renviron file (instead of the .Rprofile file). The .Renviron file can be found in your home directory. You can find the path to your home directory by running Sys.getenv("HOME"). For more info see here. The final function to store the API key in the FlightsR package looked like this.

Side note: I actually did not know this before, but find it quite handy: If you want to know the script of any function in R, you can find it by just typing the function name and hit enter. For example:

> read.csv function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) read.table(file = file, header = header, sep = sep, quote = quote, dec = dec, fill = fill, comment.char = comment.char, ...)

Let’s make another function that we can add to the package. For example, here is a simple function to list all airlines:

listAirlines <- function(activeOnly=TRUE){ ID = Sys.getenv("flightstats_app_id") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } KEY = Sys.getenv("flightstats_api_key") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } if(missing(activeOnly)){ choice = "active" } if(activeOnly == FALSE) { choice = "all" } else { choice = "active" } link = paste0("https://api.flightstats.com/flex/airlines/rest/v1/json/",choice,"?appId=",ID,"&appKey=",KEY) dat = getURL(link) dat_list <- fromJSON(dat) airlines <- dat_list$airlines return(airlines) } Prepare the documents

Use package.skeleton(), devtools & RoxyGen2 to let R prepare the necessary documents for you

package.skeleton(name = "FlightR", list = c("listAirlines","listAirports","scheduledFlights","scheduledFlightsFullDay","searchAirline","searchAirport","setAPIKey","setAppId"))

That’s it! Now in your working directory folder there should be a new folder with the name you just gave to your package.

Now, what is handy from the function above is that it creates the folders and files you need in a new package folder (“FlightsR” in this case). In the /R folder you see now that every function you added has its own .R script and in the /man folder there is an .Rd file for each of the functions.

You can now go and manually change everything in these files that needs to be changed (documentation needs to be added, the import of external packages to be defined, etc.) – or use roxygen2 and devtools to do it for you. Roxygen2 will complete the documentation in each .Rd file correctly and will create a NAMESPACE file for you. To do this, make sure you delete the current incomplete files (this is, all the files in the /man folder and the NAMESPACE file), otherwise you will get an error when you use the document() function later.

Now extra information needs to be added in the functions (for example, what are the parameters of the function, an example usage, necessary imports, etc.), in the following way:

#' Function searches a specific airline by IATA code #' #' @param value character, an airline IATA code #' @return data.frame() with the airline #' #' @author Emelie Hofland, \email{emelie_hofland@hotmail.com} #' #' @examples #' searchAirline("FR") #' #' @import RCurl #' @import jsonlite #' @export #' searchAirline <- function(value){ ID = Sys.getenv("flightstats_app_id") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } KEY = Sys.getenv("flightstats_api_key") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } link = paste0("https://api.flightstats.com/flex/airlines/rest/v1/json/iata/",toupper(value),"?appId=",ID,"&appKey=",KEY) dat <- getURL(link) dat_list <- fromJSON(dat) result <- dat_list$airlines if (length(result)==0){ warning("Please make sure that you provide a valid airline IATA code.") } return(result) }

Do not forget to add the @export, otherwise your functions will not be there when you open your package!

Now, when you have added this information for all your functions, make sure that devtools and roxygen2 are installed.

install.packages("roxygen2") install.packages("devtools")

Make sure your working directory is set to the folder of your package and run the following commands in R:

# to automatically generate the documentation: document() # to build the package build() # to install the package install()

Voila! You are done. I do not know if it is necessary, but just to be sure I restarted R at this point. In a new session you can now run library(YourPackageName) and this should work.

To adjust functions, you can just change things in the functions in the package and re-run the document(), build() and install() commands.

Pushing a custom made R package to GitHub (or GitLab)

Note: These steps assume that you have Git installed and configured on your PC.
1) Create a new repository in your github account.
2) Create and copy the https link to clone the repo on your PC.
3) Go to the folder on your PC where you want to save your repo, open the command line interface & type:

$ git clone https://github.com/YourGithub/YourPackage.git

4) Copy all the files from your package in the folder and run:

$ git add . $ git commit -m "whatever message you want to add" $ git push origin master

5) Voila – now your package should be on GitHub!

Now people can download & install your package straight from GitHub or GitLab – the devtools package has a function for this:

if (!require(devtools)) { install.packages('devtools') } # If your repo is on GitHub: devtools::install_github('YourGithub/YourPackage') # If your repo is a public repo on GitLab: devtools::install_git("https://gitlab/link/to/your/repo") # If your repo is a private repo on GitLab: devtools::install_git("https://emelie.hofland:password@gitlab/link/to/your/repo.git")

Post a comment below if you have a question.

    Related Post

    1. Introductory Data Analysis with Python
    2. Understanding Linear SVM with R
    3. How to add a background image to ggplot2 graphs
    4. Streamline your analyses linking R to SAS: the workfloweR experiment
    5. R Programming – Pitfalls to avoid (Part 1)
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Installing R packages with rxInstallPackages in Microsoft R Server

    Wed, 06/14/2017 - 13:07

    (This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

    In MicrosoftML package comes – in my opinion – long anticipated function for installing R packages for SQL Server and Microsoft R Server. And, I am super happy.

    Last year, in one of my previous blog posts, I have been showing how to install R package from SSMS using sp_execute_external_script. Now, with new package MicrosoftML (that is part of Microsoft R Server 9.x and above)  new function is available that enables you to easy install the package and also little bit more.

    Code is relatively simple and straightforward:

    USE SQLR; GO EXECUTE sp_execute_external_script @language = N'R' ,@script = N' packagesToInstall <- c("caret","tree","party") library(MicrosoftML) SqlServerCC <- RxInSqlServer(connectionString = "Driver=SQL Server; +Server=SICN-KASTRUN\\SQLSERVER2017C2;Database=SQLR; +Trusted_Connection=True;") rxInstallPackages(pkgs = packagesToInstall, owner = '', +scope = "shared", computeContext = "SqlServerCC");'; GO

    This is way too easy to be true, but it is. Make sure to do couple of things prior to running this code:

    1. set the compute environment to where your packages are installed
    2. set up the correct permissions and access
    3. Check up also the tcp/ip protocols

    In rxInstallPackages function use computeContext parameter to set either to “Local” or to your  “SqlServer” environment, you can also use scope as shared or private (difference is, if you install package as shared it can be used by different users across different databases, respectively for private). You can also specify owner if you are running this command out of db_owner role.

    Happy SQLR-ing!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    #7: C++14, R and Travis — A useful hack

    Wed, 06/14/2017 - 03:54

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Welcome to the seventh post in the rarely relevant R ramblings series, or R4 for short.

    We took a short break as several conferences and other events interfered during the month of May, keeping us busy and away from this series. But we are back now with a short and useful hack I came up with this weekend.

    The topic is C++14, i.e. the newest formally approved language standard for C++, and its support in R and on Travis CI. With release R 3.4.0 of a few weeks ago, R now formally supports C++14. Which is great.

    But there be devils. A little known fact is that R hangs on to its configuration settings from its own compile time. That matters in cases such as the one we are looking at here: Travis CI. Travis is a tremendously useful and widely-deployed service, most commonly connected to GitHub driving "continuous integration" (the ‘CI’) testing after each commit. But Travis CI, for as useful as it is, is also maddingly conservative still forcing everybody to live and die by [Ubuntu 14.04]http://releases.ubuntu.com/14.04/). So while we all benefit from the fine work by Michael who faithfully provides Ubuntu binaries for distribution via CRAN (based on the Debian builds provided by yours truly), we are stuck with Ubuntu 14.04. Which means that while Michael can provide us with current R 3.4.0 it will be built on ancient Ubuntu 14.04.

    Why does this matter, you ask? Well, if you just try to turn the very C++14 support added to R 3.4.0 on in the binary running on Travis, you get this error:

    ** libs Error in .shlib_internal(args) : C++14 standard requested but CXX14 is not defined

    And you get it whether or not you define CXX14 in the session.

    So R (in version 3.4.0) may want to use C++14 (because a package we submitted requests it), but having been built on the dreaded Ubuntu 14.04, it just can’t oblige. Even when we supply a newer compiler. Because R hangs on to its compile-time settings rather than current environment variables. And that means no C++14 as its compile-time compiler was too ancient. Trust me, I tried: adding not only g++-6 (from a suitable repo) but also adding C++14 as the value for CXX_STD. Alas, no mas.

    The trick to overcome this is twofold, and fairly straightforward. First off, we just rely on the fact that g++ version 6 defaults to C++14. So by supplying g++-6, we are in the green. We have C++14 by default without requiring extra options. Sweet.

    The remainder is to tell R to not try to enable C++14 even though we are using it. How? By removing CXX_STD=C++14 on the fly and just for Travis. And this can be done easily with a small script configure which conditions on being on Travis by checking two environment variables:

    #!/bin/bash ## Travis can let us run R 3.4.0 (from CRAN and the PPAs) but this R version ## does not know about C++14. Even though we can select CXX_STD = C++14, R ## will fail as the version we use there was built in too old an environment, ## namely Ubuntu "trusty" 14.04. ## ## So we install g++-6 from another repo and rely on the fact that is ## defaults to C++14. Sadly, we need R to not fail and hence, just on ## Travis, remove the C++14 instruction if [[ "${CI}" == "true" ]]; then if [[ "${TRAVIS}" == "true" ]]; then echo "** Overriding src/Makevars and removing C++14 on Travis only" sed -i 's|CXX_STD = CXX14||' src/Makevars fi fi

    I have deployed this now for two sets of builds in two distinct repositories for two "under-development" packages not yet on CRAN, and it just works. In case you turn on C++14 via SystemRequirements: in the file DESCRIPTION, you need to modify it here.

    So to sum up, there it is: C++14 with R 3.4.0 on Travis. Only takes a quick Travis-only modification.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Use a Join Controller to Document Your Work

    Wed, 06/14/2017 - 00:17

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

    When working on real world predictive modeling tasks in production, the ability to join data and document how you join data is paramount. There are very strong reasons to organize production data in something resembling one of the Codd normal forms. However, for machine learning we need a fully denormalized form (all columns populated into a single to ready to go row, no matter what their provenance, keying, or stride).

    This is not an essential difficulty as in relational data systems moving between these forms can be done by joining, and data stores such as PostgreSQL or Apache Spark are designed to provide powerful join capabilities.

    However there are some inessential (in that they can be avoided) but substantial difficulties in managing and documenting long join plans. It is not uncommon to have to join 7 or more tables to get an analysis ready. This at first seems trivial, but once you add in the important tasks of narrowing tables (eliminating columns not used later) and disambiguating column names (ensuring unique names after the join) the documentation and work can become substantial. Specifying the join process directly in R code leads to hard to manage, hard to inspect, and hard to share spaghetti code (even when using a high-level data abstraction such as dplyr).

    If you have done non-trivial work with production data you have seen this pain point.

    The fix is to apply the following principles:

    • Anything long, repetitive, and tedious should not be done directly.
    • Moving specification out of code and into data is of huge benefit.
    • A common special case can be treated separately, as that documents intent.

    To supply such a solution the development version of replyr now supplies a item called a "join controller" under the method replyr::executeLeftJoinPlan().

    This is easiest to explain through a concrete example, which is what we will do here.

    First let’s load the needed packages.

    # load packages suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ## [1] '0.7.0' library("replyr") packageVersion("replyr") ## [1] '0.4.1'

    Now let’s load some notional example data. For our example we have:

    • One primary table of measurements (called "meas1") keyed by id and date.
    • A fact table that maps ids to patient names (called "names", and keyed by id).
    • A second table of additional measurements (called "meas2") That we consider "nice to have." That is: rows missing from this table should not censor-out meas1 rows, and additional rows found here should not be included in the analysis.

    The data is given below:

    # load notional example data my_db <- dplyr::src_sqlite(":memory:", create = TRUE) # example data replyr_copy_to(my_db, data.frame(id= c(1,1,2,2), date= c(1,2,1,2), weight= c(200, 180, 98, 120), height= c(60, 54, 12, 14)), 'meas1_train') replyr_copy_to(my_db, data.frame(id= seq_len(length(letters)), name= letters, stringsAsFactors=FALSE), 'names_facts') replyr_copy_to(my_db, data.frame(pid= c(2,3), date= c(2,2), weight= c(105, 110), width= 1), 'meas2_train')

    An important (and very neglected) step in data science tasks is documenting roles of tables, especially their key-structure (which we also call "stride" in the sense it describes how you move from row to row). replyr::tableDescription() is a function that builds an initial description of the tables. (Note: replyr::tableDescription() is misspelled in the current release version of replyr, we have fixed this in dev).

    # map from abstract names to realized names tables <- list('meas1' = 'meas1_train', 'names' = 'names_facts', 'meas2' = 'meas2_train') # get the initial description of table defs got <- lapply(names(tables), function(ni) { # get table reference from source by concrete name ti <- tbl(my_db, tables[[ni]]) # map abstract name to reference tableDescription(ni, ti) }) tDesc <- bind_rows(got)

    tDesc is essentially a slightly enriched version of the data handle concordance described in "Managing Spark data handles in R." We can take a quick look at the stored simplified summaries:

    print(tDesc %>% select(tableName, sourceClass, isEmpty)) ## # A tibble: 3 x 3 ## tableName sourceClass isEmpty ## ## 1 meas1 src_dbi, src_sql, src FALSE ## 2 names src_dbi, src_sql, src FALSE ## 3 meas2 src_dbi, src_sql, src FALSE print(tDesc$columns) ## [[1]] ## [1] "id" "date" "weight" "height" ## ## [[2]] ## [1] "id" "name" ## ## [[3]] ## [1] "pid" "date" "weight" "width" print(tDesc$colClass) ## [[1]] ## id date weight height ## "numeric" "numeric" "numeric" "numeric" ## ## [[2]] ## id name ## "integer" "character" ## ## [[3]] ## pid date weight width ## "numeric" "numeric" "numeric" "numeric" # add names for printing names(tDesc$keys) <- tDesc$tableName print(tDesc$keys) ## $meas1 ## id date weight height ## "id" "date" "weight" "height" ## ## $names ## id name ## "id" "name" ## ## $meas2 ## pid date weight width ## "pid" "date" "weight" "width"

    tableDescription() is a table that holds the following:

    • tableName: the abstract name we wish to use for this table.
    • handle: the actual data handle (either a data.frame or a handle to a remote data source such as PostgreSQL or Spark). Notice in the example it is of class "tbl_sqlite" or "tbl_dbi" (depending on the version of dplyr).
    • columns: the list of columns in the table.
    • keys: a named list mapping abstract key names to table column names. The set of keys together is supposed to uniquely identify rows.
    • colClasses: a vector of column classes of the underlying table.
    • sourceClass: the declared class of the data source.
    • isEmpty: an advisory column indicating if any rows were present when we looked.

    The tableName is "abstract" in that it is only used to discuss tables (i.e., it is only ever used as row identifier in this table). The data is actually found through the handle. This is critical in processes where we may need to run the same set of joins twice on different sets of tables (such as building a machine learning model, and then later applying the model to new data).

    The intent is to build a detailed join plan (describing order, column selection, and column re-naming) from the tDesc table. We can try this with the supplied function buildJoinPlan(), which in this case tells us our table descriptions are not quite ready to specify a join plan:

    tryCatch( buildJoinPlan(tDesc), error = function(e) {e} ) ##

    In the above the keys column is wrong in that it claims every column of each table is a table key. The join plan builder noticed this is unsupportable in that when it comes time to join the "names" table not all of the columns that are claimed to be "names" keys are already known from previous tables. That is: the "names$name" column is present in the earlier tables, and so can not be joined on. We can’t check everything, but the join controller tries to "front load" or encounter as many configuration inconsistencies early- before any expensive steps have been started.

    The intent is: the user should edit the "tDesc" keys column and share it with partners for criticism. In our case we declare the primary of the measurement tables to be PatientID and MeasurementDate, and the primary key of the names table to be PatientID. Notice we do this by specifying named lists or vectors mapping desired key names to names actually used in the tables.

    # declare keys (and give them consistent names) tDesc$keys[[1]] <- c(PatientID= 'id', MeasurementDate= 'date') tDesc$keys[[2]] <- c(PatientID= 'id') tDesc$keys[[3]] <- c(PatientID= 'pid', MeasurementDate= 'date') print(tDesc$keys) ## $meas1 ## PatientID MeasurementDate ## "id" "date" ## ## $names ## PatientID ## "id" ## ## $meas2 ## PatientID MeasurementDate ## "pid" "date"

    The above key mapping could then be circulated to partners for comments and help. Notice since this is not R code we can easily share it with non-R users for comment and corrections.

    It is worth confirming the keying as as expected (else some rows can reproduce in bad ways during joining). This is a potentially expensive operation, but it can be done as follows:

    keysAreUnique(tDesc) ## meas1 names meas2 ## TRUE TRUE TRUE

    Once we are satisfied with our description of tables we can build a join plan. The join plan is an ordered sequence of left-joins.

    In practice, when preparing data for predictive analytics or machine learning there is often a primary table that has exactly the set of rows you want to work over (especially when encountering production star-schemas. By starting joins from this table we can perform most of our transformations using only left-joins. To keep things simple we have only supplied a join controller for this case. This is obviously not the only join pattern needed; but it is the common one.

    A join plan can now be built from our table descriptions:

    # build the column join plan columnJoinPlan <- buildJoinPlan(tDesc) print(columnJoinPlan %>% select(tableName, sourceColumn, resultColumn, isKey, want)) ## # A tibble: 10 x 5 ## tableName sourceColumn resultColumn isKey want ## ## 1 meas1 id PatientID TRUE TRUE ## 2 meas1 date MeasurementDate TRUE TRUE ## 3 meas1 weight meas1_weight FALSE TRUE ## 4 meas1 height height FALSE TRUE ## 5 names id PatientID TRUE TRUE ## 6 names name name FALSE TRUE ## 7 meas2 pid PatientID TRUE TRUE ## 8 meas2 date MeasurementDate TRUE TRUE ## 9 meas2 weight meas2_weight FALSE TRUE ## 10 meas2 width width FALSE TRUE

    Essentially the join plan is an unnest of the columns from the table descriptions. This was anticipated in our article "Managing Spark Data Handles".

    We then alter the join plan to meet our needs (either through R commands or by exporting the plan to a spreadsheet and editing it there).

    Only columns named in the join plan with a value of TRUE in the want column are kept in the join (columns marked isKey must also have want set to TRUE). This is very useful as systems of record often have very wide tables (with hundreds of columns) of which we only want a few columns for analysis.

    For example we could decide to exclude the width column by either dropping the row or setting the row’s want column to FALSE.

    Since we have edited the join plan it is a good idea to both look at it and also run it through the inspectDescrAndJoinPlan() to look for potential inconsistencies.

    # decide we don't want the width column columnJoinPlan$want[columnJoinPlan$resultColumn=='width'] <- FALSE # double check our plan if(!is.null(inspectDescrAndJoinPlan(tDesc, columnJoinPlan))) { stop("bad join plan") } print(columnJoinPlan %>% select(tableName, sourceColumn, resultColumn, isKey, want)) ## # A tibble: 10 x 5 ## tableName sourceColumn resultColumn isKey want ## ## 1 meas1 id PatientID TRUE TRUE ## 2 meas1 date MeasurementDate TRUE TRUE ## 3 meas1 weight meas1_weight FALSE TRUE ## 4 meas1 height height FALSE TRUE ## 5 names id PatientID TRUE TRUE ## 6 names name name FALSE TRUE ## 7 meas2 pid PatientID TRUE TRUE ## 8 meas2 date MeasurementDate TRUE TRUE ## 9 meas2 weight meas2_weight FALSE TRUE ## 10 meas2 width width FALSE FALSE

    The join plan is the neglected (and often missing) piece of documentation key to non-trivial data science projects. We strongly suggest putting it under source control, and circulating it to project partners for comment.

    As a diagram the key structure of the join plan looks like the following (produced by DiagrammeR::mermaid(makeJoinDiagramSpec(columnJoinPlan))):

    Note the diagramming ability is currently only in the dev version of replyr. These diagrams are kind of fun. For instance, here is a more complicated one from the help(makeJoinDiagramSpec) examples:

    Once you have a good join plan executing it is a one-line command with executeLeftJoinPlan() (once you have set up a temp name manager as described in "Managing intermediate results when using R/sparklyr"):

    # manage the temp names as in: # http://www.win-vector.com/blog/2017/06/managing-intermediate-results-when-using-rsparklyr/ tempNameGenerator <- makeTempNameGenerator("extmps") # execute the left joins results <- executeLeftJoinPlan(tDesc, columnJoinPlan, verbose= TRUE, tempNameGenerator= tempNameGenerator) ## [1] "start meas1 Tue Jun 13 15:30:25 2017" ## [1] " rename/restrict meas1" ## [1] " 'table_meas1_present' = 'table_meas1_present'" ## [1] " 'PatientID' = 'id'" ## [1] " 'MeasurementDate' = 'date'" ## [1] " 'meas1_weight' = 'weight'" ## [1] " 'height' = 'height'" ## [1] " res <- meas1" ## [1] "done meas1 Tue Jun 13 15:30:25 2017" ## [1] "start names Tue Jun 13 15:30:25 2017" ## [1] " rename/restrict names" ## [1] " 'table_names_present' = 'table_names_present'" ## [1] " 'PatientID' = 'id'" ## [1] " 'name' = 'name'" ## [1] " res <- left_join(res, names, by = c( 'PatientID' ))" ## [1] "done names Tue Jun 13 15:30:25 2017" ## [1] "start meas2 Tue Jun 13 15:30:25 2017" ## [1] " rename/restrict meas2" ## [1] " 'table_meas2_present' = 'table_meas2_present'" ## [1] " 'PatientID' = 'pid'" ## [1] " 'MeasurementDate' = 'date'" ## [1] " 'meas2_weight' = 'weight'" ## [1] " res <- left_join(res, meas2, by = c( 'PatientID', 'MeasurementDate' ))" ## [1] "done meas2 Tue Jun 13 15:30:26 2017"

    executeLeftJoinPlan() takes both a table description table (tDesc, keyed by tableName) and the join plan (columnJoinPlan, keyed by tableName and sourceColumn).

    The separation of concerns is strong: all details about the intended left-join sequence are taken from the columnJoinPlan, and only the mapping from abstract table names to tables (or table references/handles) is taken from tDesc. This is deliberate design and makes running the same join plan on two different sets of tables (say once for model construction, and later for model application) very easy. tDesc is a runtime entity (as it binds names to live handles, so can’t be serialized: you must save the code steps to produce it; note only the columns tableName and handle are used so there is no point re-editing the keys column after running tableDescription() on new tables) and columnJoinPlan is a durable entity (has only information, not handles).

    Basically you:

    • Build simple procedures to build up tDesc.
    • Work hard to get a good columnJoinPlan.
    • Save columnJoinPlan in source control and re-load it (not re-build it) when you need it.
    • Re-build new tDesc compatible with the saved columnJoinPlan later when you need to work with tables (note only the columns tableName and handle are used during join execution, so you only need to create those).

    As always: the proof is in the pudding. We should look at results:

    print(results %>% select(PatientID, MeasurementDate, meas1_weight, height, name, table_meas2_present, meas2_weight), width= Inf) ## # Source: lazy query [?? x 7] ## # Database: sqlite 3.11.1 [:memory:] ## PatientID MeasurementDate meas1_weight height name table_meas2_present meas2_weight ## ## 1 1 1 200 60 a 0 NA ## 2 1 2 180 54 a 0 NA ## 3 2 1 98 12 b 0 NA ## 4 2 2 120 14 b 1 105

    Notice the joiner added extra columns of the form table_*_present to show which tables had needed rows. This lets us tell different sorts of missingness apart (value NA as there was no row to join, versus value NA as a NA came from a row) and appropriately coalesce results easily. These columns are also very good for collecting statistics on data coverage, and in business settings often are very useful data quality and data provenance features which can often be directly included in machine learning models.

    Also notice the join plan is very specific: every decision (such as what order to operate and how to disambiguate column names) is already explicitly set in the plan. The executor is then allowed to simply move through the tables left-joining in the order the table names first appear in the plan.

    Having to "join a bunch of tables to get the data into simple rows" is a common step in data science. Therefore you do not want this to be a difficult and undocumented task. By using a join controller you essentially make the documentation the executable specification for the task.

    # cleanup temps <- tempNameGenerator(dumpList= TRUE) for(ti in temps) { replyr_drop_table_name(my_db, ti) } rm(list=ls()) gc(verbose= FALSE) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Syberia: A development framework for R code in production

    Tue, 06/13/2017 - 22:57

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Putting R code into production generally involves orchestrating the execution of a series of R scripts. Even if much of the application logic is encoded into R packages, a run-time environment typically involves scripts to ingest and prepare data, run the application logic, validate the results, and operationalize the output. Managing those scripts, especially in the face of working with multiple R versions, can be a pain — and worse, very complex scripts are difficult to understand and reuse for future applications.

    That's where Syberia comes in: an open-source framework created by Robert Krzyzanowski and other engineers at the consumer lending company Avant. There, Syberia has been used by more than 30 developers to build a production data modeling system. In fact, building production R systems was the motivating tenet of Syberia

    Developing classifiers using the Syberia modeling engine follows one primary tenet: development equals production. When a web developer experiments with a new website layout or feature, they are targeting a production system. After they are satisfied with their work, they push a button and deploy the code base so it is live and others can interact with it.

    Feature engineering and statistical modeling … should belong to the same class of work. When an architect designs a skyscraper their work has to be translated to reality through a construction team by replaying the design using a physical medium. This is the current industry standard for machine learning: prototype the model in one language, typically a Python or R "notebook," and then write it in a "solid production-ready" language so it can survive the harsh winds of the real world.

    This is wrong.

    In much the same way that ggplot2 is a completely different way of thinking about R graphics, Syberia is a completely different way of thinking about R scripts. It's also similarly difficult to get your head around at first, but once you do, it reveals itself as an elegant and customizable way of managing complex and interconnected R code — and a much better solution than simply source-ing an 800-line R script.

    At its core, Syberia defines a set of conventions for defining the steps in a data-analysis workflow, and specifying them with a collection (in real-world projects, a large collection) of small R scripts in a standardized folder structure. Collectively, these define the complete data analysis process, which you can execute with a simple R command: run. To make modifying and maintaining this codebase (which you'd typically manage in a source-code control system) easier, Syberia is designed to isolate dependencies between filew. For example, rather than specifying a file name and format (say, "data.csv") in a script that reads data, you'd instead define "adapters" to read and write data in the adapters/adapters.R script:

    # ./adapters/csv.R read <- function(key) { read.csv(file.path("/some/path", paste0(key, ".csv"))) } write <- function(value, key) { write.csv(value, file.path("/some/path", paste0(key, ".csv"))) }

    Syberia will then use those "read" and "write" adapters to connect with your data. That way, when you later decide to source the data from a database, you can just write new adapters rather than trying to find the lines dealing with data I/O in a massive script. (This also helps avoid conflicts when working in a large code-base with multiple developers.) Syberia defines similar "patterns" for data preparation (including feature generation), statistical modeling, and testing; the "run" function conceptually synthesizes the entire codebase into a single R script to run.

    Syberia also encourages you to break up your process into a series of distinct steps, each of which can be run (and tested) independently. It also has a make-like feature, in that results from intermediate steps are cached, and do not need to be re-run each time unless their dependencies have been modified.

    Syberia can also be used to associate specific R versions with scripts, or even other R engines like Microsoft R. I was extremely impressed when during a 30-minute-break at the R/Finance conference last month, Robert was able to sketch out a Syberia implementation of a modeling process using the RevoScaleR library. In fact Robert's talk from the conference, embedded below, provides a nice introduction to Syberia.

    Syberia is available now under the open-source MIT license. (Note: git is required to install Syberia.) To get started with Syberia check out the documentation, which is available at the Syberia website linked below.

    Syberia: The development framework for R

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Keeping Users Safe While Collecting Data

    Tue, 06/13/2017 - 22:48

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    I caught a mention of this project by Pete Warden on Four Short Links today. If his name sounds familiar, he’s the creator of the DSTK, an O’Reilly author, and now works at Google. A decidedly clever and decent chap.

    The project goal is noble: crowdsource and make a repository of open speech data for researchers to make a better world. Said sourcing is done by asking folks to record themselves saying “Yes”, “No” and other short words.

    As I meandered over the blog post I looked in horror on the URL for the application that did the recording: https://open-speech-commands.appspot.com/.

    Why would the goal of the project combined with that URL give pause? Read on!

    You’ve Got Scams!

    Picking up the phone and saying something as simple as ‘Yes’ has been a major scam this year. By recording your voice, attackers can replay it on phone prompts and because it’s your voice it makes it harder to refute the evidence and can foil recognition systems that look for your actual voice.

    As the chart above shows, the Better Business Bureau has logged over 5,000 of these scams this year (searching for ‘phishing’ and ‘yes’). You can play with the data (a bit — the package needs work) in R with scamtracker.

    Now, these are “analog” attacks (i.e. a human spends time socially engineering a human). Bookmark this as you peruse section 2.

    Integrity Challenges in 2017

    I “trust” Pete’s intentions, but I sure don’t trust open-speech-commands.appspot.com (and, you shouldn’t either). Why? Go visit https://totally-harmless-app.appspot.com. It’s a Google App Engine app I made for this post. Anyone can make an appspot app and the https is meaningless as far as integrity & authenticity goes since I’m running on google’s infrastructure but I’m not google.

    You can’t really trust most SSL/TLS sessions as far as site integrity goes anyway. Let’s Encrypt put the final nail in the coffin with their Certs Gone Wild! initiative. With super-recent browser updates you can almost trust your eyes again when it comes to URLs, but you should be very wary of entering your info — especially uploading voice, prints or eye/face images — into any input box on any site if you aren’t 100% sure it’s a legit site that you trust.

    Tracking the Trackers

    If you don’t know that you’re being tracked 100% of the time on the internet then you really need to read up on the modern internet.

    In many cases your IP address can directly identify you. In most cases your device & browser profile — which most commercial sites log — can directly identify you. So, just visiting a web site means that it’s highly likely that web site can know that you are both not a dog and are in fact you.

    Still Waiting for the “So, What?”

    Many states and municipalities have engaged in awareness campaigns to warn citizens about the “Say ‘Yes’” scam. Asking someone to record themselves saying ‘Yes’ into a random web site pretty much negates that advice.

    Folks like me regularly warn about trust on the internet. I could have cloned the functionality of the original site to open-speech-commmands.appspot.com. Did you even catch the 3rd ‘m’ there? Even without that, it’s an appspot.com domain. Anyone can set one up.

    Even if the site doesn’t ask for your name or other info and just asks for your ‘Yes’, it can know who you are. In fact, when you’re enabling the microphone to do the recording, it could even take a picture of you if it wanted to (and you’d likely not know or not object since it’s for SCIENCE!).

    So, in the worst case scenario a malicious entity could be asking you for your ‘Yes’, tying it right to you and then executing the post-scam attacks that were being performed in the analog version.

    But, go so far as to assume this is a legit site with good intentions. Do you really know what’s being logged when you commit your voice info? If the data was mishandled, it would be just as easy to tie the voice files back to you (assuming a certain level of data logging).

    The “so what” is not really a warning to users but a message to researchers: You need to threat model your experiments and research initiatives, especially when innocent end users are potentially being put at risk. Data is the new gold, diamonds and other precious bits that attackers are after. You may think you’re not putting folks at risk and aren’t even a hacker target, but how you design data gathering can reinforce good or bad behaviour on the part of users. It can solidify solid security messages or tear them down. And, you and your data may be more of a target than you really know.

    Reach out to interdisciplinary colleagues to help threat model your data collection, storage and dissemination methods to ensure you aren’t putting yourself or others at risk.

    FIN

    Pete did the right thing:

    and, I’m sure the site will be on a “proper” domain soon. When it is, I’ll be one of the first in line to help make a much-needed open data set for research purposes.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    dplyr 0.7.0

    Tue, 06/13/2017 - 18:18

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    I’m pleased to announce that dplyr 0.7.0 is now on CRAN! (This was dplyr 0.6.0 previously; more on that below.) dplyr provides a “grammar” of data transformation, making it easy and elegant to solve the most common data manipulation challenges. dplyr supports multiple backends: as well as in-memory data frames, you can also use it with remote SQL databases. If you haven’t heard of dplyr before, the best place to start is the Data transformation chapter in R for Data Science.

    You can install the latest version of dplyr with:

    install.packages("dplyr") Features

    dplyr 0.7.0 is a major release including over 100 improvements and bug fixes, as described in the release notes. In this blog post, I want to discuss one big change and a handful of smaller updates. This version of dplyr also saw a major revamp of database connections. That’s a big topic, so it’ll get its own blog post next week.

    Tidy evaluation

    The biggest change is a new system for programming with dplyr, called tidy evaluation, or tidy eval for short. Tidy eval is a system for capturing expressions and later evaluating them in the correct context. It is is important because it allows you to interpolate values in contexts where dplyr usually works with expressions:

    my_var <- quo(homeworld) starwars %>% group_by(!!my_var) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE) #> # A tibble: 49 x 3 #> homeworld height mass #> #> 1 Alderaan 176.3333 64.0 #> 2 Aleen Minor 79.0000 15.0 #> 3 Bespin 175.0000 79.0 #> 4 Bestine IV 180.0000 110.0 #> 5 Cato Neimoidia 191.0000 90.0 #> 6 Cerea 198.0000 82.0 #> 7 Champala 196.0000 NaN #> 8 Chandrila 150.0000 NaN #> 9 Concord Dawn 183.0000 79.0 #> 10 Corellia 175.0000 78.5 #> # ... with 39 more rows

    This makes it possible to write your functions that work like dplyr functions, reducing the amount of copy-and-paste in your code:

    starwars_mean <- function(my_var) { my_var <- enquo(my_var) starwars %>% group_by(!!my_var) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE) } starwars_mean(homeworld)

    You can also use the new .data pronoun to refer to variables with strings:

    my_var <- "homeworld" starwars %>% group_by(.data[[my_var]]) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE)

    This is useful when you’re writing packages that use dplyr code because it avoids an annoying note from R CMD check.

    To learn more about how tidy eval helps solve data analysis challenge, please read the new programming with dplyr vignette. Tidy evaluation is implemented in the rlang package, which also provides a vignette on the theoretical underpinnings. Tidy eval is a rich system and takes a while to get your head around it, but we are confident that learning tidy eval will pay off, especially as it roles out to other packages in the tidyverse (tidyr and ggplot2 are next on the todo list).

    The introduction of tidy evaluation means that the standard evaluation (underscored) version of each main verb (filter_(), select_() etc) is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).

    Character encoding

    We have done a lot of work to ensure that dplyr works with encodings other than Latin1 on Windows. This is most likely to affect you if you work with data that contains Chinese, Japanese, or Korean (CJK) characters. dplyr should now just work with such data. Please let us know if you have problems!

    New datasets

    dplyr has some new datasets that will help write more interesting examples:

    • starwars, shown above, contains information about characters from the Star Wars movies, sourced from the Star Wars API. It contains a number of list-columns. starwars #> # A tibble: 87 x 13 #> name height mass hair_color skin_color eye_color #> #> 1 Luke Skywalker 172 77 blond fair blue #> 2 C-3PO 167 75 gold yellow #> 3 R2-D2 96 32 white, blue red #> 4 Darth Vader 202 136 none white yellow #> 5 Leia Organa 150 49 brown light brown #> 6 Owen Lars 178 120 brown, grey light blue #> 7 Beru Whitesun lars 165 75 brown light blue #> 8 R5-D4 97 32 white, red red #> 9 Biggs Darklighter 183 84 black light brown #> 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray #> # ... with 77 more rows, and 7 more variables: birth_year , #> # gender , homeworld , species , films , #> # vehicles , starships
    • storms has the trajectories of ~200 tropical storms. It contains a strong grouping structure. storms #> # A tibble: 10,010 x 13 #> name year month day hour lat long status category #> #> 1 Amy 1975 6 27 0 27.5 -79.0 tropical depression -1 #> 2 Amy 1975 6 27 6 28.5 -79.0 tropical depression -1 #> 3 Amy 1975 6 27 12 29.5 -79.0 tropical depression -1 #> 4 Amy 1975 6 27 18 30.5 -79.0 tropical depression -1 #> 5 Amy 1975 6 28 0 31.5 -78.8 tropical depression -1 #> 6 Amy 1975 6 28 6 32.4 -78.7 tropical depression -1 #> 7 Amy 1975 6 28 12 33.3 -78.0 tropical depression -1 #> 8 Amy 1975 6 28 18 34.0 -77.0 tropical depression -1 #> 9 Amy 1975 6 29 0 34.4 -75.8 tropical storm 0 #> 10 Amy 1975 6 29 6 34.0 -74.8 tropical storm 0 #> # ... with 10,000 more rows, and 4 more variables: wind , #> # pressure , ts_diameter , hu_diameter
    • band_members, band_instruments and band_instruments2 has a tiny amount of data about bands. It’s designed to be very simple so you can illustrate how joins work without getting distracted by the details of the data. band_members #> # A tibble: 3 x 2 #> name band #> #> 1 Mick Stones #> 2 John Beatles #> 3 Paul Beatles band_instruments #> # A tibble: 3 x 2 #> name plays #> #> 1 John guitar #> 2 Paul bass #> 3 Keith guitar
    New and improved verbs
    • The pull() generic allows you to extract a single column either by name or position. It’s similar to select() but returns a vector, rather than a smaller tibble. mtcars %>% pull(-1) %>% str() #> num [1:32] 4 4 1 1 2 1 4 2 2 4 ... mtcars %>% pull(cyl) %>% str() #> num [1:32] 6 6 4 6 8 6 8 4 4 6 ...

      Thanks to Paul Poncet for the idea!

    • arrange() for grouped data frames gains a .by_group argument so you can choose to sort by groups if you want to (defaults to FALSE).
    • All single table verbs now have scoped variantssuffixed with _if(), _at() and _all(). Use these if you want to do something to every variable (_all), variables selected by their names (_at), or variables that satisfy some predicate (_if). iris %>% summarise_if(is.numeric, mean) starwars %>% select_if(Negate(is.list)) storms %>% group_by_at(vars(month:hour))
    Other important changes
    • Local join functions can now control how missing values are matched. The default value is na_matches = "na", which treats two missing values as equal. To prevent missing values from matching, use na_matches = "never".

    You can change the default behaviour by calling pkgconfig::set_config("dplyr::na_matches", "never").

    • bind_rows() and combine() are more strict when coercing. Logical values are no longer coerced to integer and numeric. Date, POSIXct and other integer or double-based classes are no longer coerced to integer or double to avoid dropping important metadata. We plan to continue improving this interface in the future.
    Breaking changes

    From time-to-time I discover that I made a mistake in an older version of dplyr and developed what is now a clearly suboptimal API. If the problem isn’t too big, I try to just leave it – the cost of making small improvements is not worth it when compared to to the cost of breaking existing code. However, there are bigger improvements where I believe the short-term pain of breaking code is worth the long-term payoff of a better API.

    Regardless, it’s still frustrating when an update to dplyr breaks your code. To minimise this pain, I plan to do two things going forward:

    • Adopt an odd-even release cycle so that API breaking changes only occur in odd numbered releases. Even numbered releases will only contain bug fixes and new features. This is why I’ve skipped dplyr 0.6.0 and gone directly to dplyr 0.7.0.
    • Invest time in developing better tools isolating packages across projects so that you can choose when to upgrade a package on a project-by-project basis, and if something goes wrong, easily roll back to a version that worked. Look for news about this later in the year.
    Contributors

    dplyr is truly a community effort. Apart from the dplyr team (myself, Kirill Müller, and Lionel Henry), this release wouldn’t have been possible without patches from Christophe Dervieux, Dean Attali, Ian Cook, Ian Lyttle, Jake Russ, Jay Hesselberth, Jennifer (Jenny) Bryan, @lindbrook, Mauro Lepore, Nicolas Coutin, Daniel, Tony Fischetti, Hiroaki Yutani and Sergio Oller. Thank you all for your contributions!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Using purrr with APIs – revamping my code

    Tue, 06/13/2017 - 12:20

    I wrote a little while back about using Microsoft Cognitive Services APIs with R to first of all detect the language of pieces of text and then do sentiment analysis on them. I wasn’t too happy with the some of the code as it was very inelegant. I knew I could code better than I had, especially as I’ve been doing a lot more work with purrr recently. However, it had sat in drafts for a while. Then David Smith kindly posted about the process I used which meant I had to get this nicer version of my code out ASAP!

    Get the complete code in this gist.

    Prerequisites Setup library(httr) library(jsonlite) library(dplyr) library(purrr) cogapikey<-"XXX" text=c("is this english?" ,"tak er der mere kage" ,"merci beaucoup" ,"guten morgen" ,"bonjour" ,"merde" ,"That's terrible" ,"R is awesome") # Put data in an object that converts to the expected schema for the API data_frame(text) %>% mutate(id=row_number()) -> textdf textdf %>% list(documents=.) -> mydata Language detection

    We need to identify the most likely language for each bit of text in order to send this additional bit of info to the sentiment analysis API to be able to get decent results from the sentiment analysis.

    cogapi<-"https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/languages?numberOfLanguagesToDetect=1" cogapi %>% POST(add_headers(`Ocp-Apim-Subscription-Key`=cogapikey), body=toJSON(mydata)) -> response # Process response response %>% content() %>% flatten_df() %>% select(detectedLanguages) %>% flatten_df()-> respframe textdf %>% mutate(language= respframe$iso6391Name) -> textdf Sentiment analysis

    With an ID, text, and a language code, we can now request the sentiment of our text be analysed.

    # New info mydata<-list(documents = textdf) # New endpoint cogapi<-"https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment" # Construct a request cogapi %>% POST(add_headers(`Ocp-Apim-Subscription-Key`=cogapikey), body=toJSON(mydata)) -> response # Process response response %>% content() %>% flatten_df() %>% mutate(id=as.numeric(id))-> respframe # Combine textdf %>% left_join(respframe) -> textdf

    And… et voila! A multi-language dataset with the language identified and the sentiment scored using purrr for easier to read code.

    Using purrr with APIs makes code nicer and more elegant as it really helps interact with hierarchies from JSON objects. I feel much better about this code now!

    Original id language text score 1 en is this english? 0.2852910 2 da tak er der mere kage NA 3 fr merci beaucoup 0.8121097 4 de guten morgen NA 5 fr bonjour 0.8118965 6 fr merde 0.0515683 7 en That’s terrible 0.1738841 8 en R is awesome 0.9546152 Revised text id language score is this english? 1 en 0.2265771 tak er der mere kage 2 da 0.7455934 merci beaucoup 3 fr 0.8121097 guten morgen 4 de 0.8581840 bonjour 5 fr 0.8118965 merde 6 fr 0.0515683 That’s terrible 7 en 0.0068665 R is awesome 8 en 0.9973871

    Interestingly the scores for English have not stayed the same – for instance, Microsoft now sees “R is awesome” in a much more positive light. It’s also great to see German and Danish are now supported!

    Get the complete code in this gist.

    The post Using purrr with APIs – revamping my code appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Pages