R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 47 min ago

### Financial time series forecasting – an easy approach

Wed, 03/22/2017 - 00:51

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Financial time series analysis and their forecasting have an history of remarkable contributions. It is then quite hard for the beginner to get oriented and capitalize from reading such scientific literature as it requires a solid understanding of basic statistics, a detailed study of the ground basis of time series analysis tools and the knowledge of specific statistical models used for financial products. Further, the financial time series ecosystem is not one of the most easiest flavour you may encounter. Trends are typically transitory as driven by underlying random processes, non stationarity, heteroscedasticity, structural breaks and outliers are rather common. All that has driven the adoption of sophisticated models and simulation techniques which require good understanding and expertise to take advantage of.

On the other hand, you may want to get a basic understanding of stock prices time series forecasting by taking advantage of a simple model providing with a sufficient reliability. For such purpose, the Black-Scholes-Merton model as based upon the lognormal distribution hypothesis and largely used in financial analysis can be helpful. The rest of this short dissertation shows how to take advantage of it.

Model

As said, I am going to introduce the Black-Scholes-Merton model which assumes that percentage changes in the stock price in a short period of time are normally distributed.

The return of the stock price S(t) at time t can be expressed under those hypothesis as:
\begin{aligned} \frac{S(t)-S(t_{0})}{S(t_{0})}\ \sim\ N(u\ \Delta T,\ \sigma^2\Delta T) \ \ \ \ \ (1) \\ \end{aligned}
where the left term is the (discrete) return on stock price S at time t. By formulating the same equation in terms of first order differentials for price S (dS) and time t (dt), the equation (1) turns out to be expressed in terms of continuously compounded returns, obtaining:
$$\begin{cases} t = t_{0} + \Delta T \\ ln(\frac{S(t)}{S(t_{0})})\ \sim \ N((u\ -\ \frac{\sigma^2}{2})\ \Delta T,\ \sigma^2\Delta T) \ \ \ \ \ (2) \\ \end{cases}$$
Equation (2) states that the log returns follow a normal distribution, where u is the stock price drift, σ is the stock price standard deviation. From equation (2), it can be stated the following relationship binding stock price S(t) with stock price S(t0):
\begin{aligned} S(t)\ =\ S(t_{0})\ e^{(u\ -\ \frac{\sigma^2}{2})\ +\ \sigma B(t)} \ \ \ \ \ (3) \\ \end{aligned}
The B(t) term represents the Brownian motion.

Furthermore, equation (2) allows to determine the distribution of the stock price as stated by the following equation:
\begin{aligned} \ln(S(t_{0}+\Delta T))\ \sim \ N(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\Delta T,\ \sigma^2\Delta T)\ \ \ \ \ (4) \\ \end{aligned}
The drift u and the standard deviation σ can be estimated from the stock price time series history.

The time interval ΔT represents our future horizon. Please note that both the drift and the standard deviation must refer to the same time scale, in the sense that if ΔT is measured in days, we have to use the daily returns and daily standard deviation, if ΔT is measured in weeks, we have to use weekly returns and weekly standard deviation, and so on.

Taking the exponential of both terms of equation (4) we obtain:
\begin{aligned} S(t_{0} + \Delta T)\ \sim \ exp(\ N(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T,\ \sigma^2 \Delta T)) \\ = exp(\ N(\ \hat u(\Delta T),\ \hat\sigma^2(\Delta T)) \ \ \ \ \ (5) \\ \end{aligned}
Above equation provides with a family of normal distributions having known parameters and dependent on the time interval ΔT = [0, T]. Lower and upper bounds at each instant of time t in ΔT can be modeled as a 95% confidence interval as determined by the 1.96 multiple of the standard deviation at time t in ΔT. As a result, the expected value, lower and upper bounds of the stock price S(t) are so determined:
$$\begin{cases} E(S(t))\ =\ exp(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T) \ \ \ \ \ (6) \\ LB(S(t))\ = exp(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T\ -\ 1.96*\sigma \sqrt \Delta T) \\ UB(S(t))\ = exp(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T\ +\ 1.96*\sigma \sqrt \Delta T) \\ \end{cases}$$

Implementation

I will take advantage of the timeSeries package for computing returns (see returns() function) and the quantmod package for financial stock prices availability (see getSymbols() function). Specifically, I will download the Apple share price history.

suppressPackageStartupMessages(library(timeSeries)) suppressPackageStartupMessages(library(quantmod)) # downloading stock price getSymbols("AAPL") head(AAPL) AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted 2007-01-03 86.29 86.58 81.90 83.80 309579900 10.85709 2007-01-04 84.05 85.95 83.82 85.66 211815100 11.09807 2007-01-05 85.77 86.20 84.40 85.05 208685400 11.01904 2007-01-08 85.96 86.53 85.28 85.47 199276700 11.07345 2007-01-09 86.45 92.98 85.15 92.57 837324600 11.99333 2007-01-10 94.75 97.80 93.45 97.00 738220000 12.56728 # using adjusted close price Y <- coredata(AAPL[,"AAPL.Adjusted"]) # history time span hist_start <- 1 hist_end <- 100 hist <- c(hist_start:hist_end) # historical prices Y.hist <- Y[hist] # historical returns = (Y(t1)-Y(t0))/Y(t0) Y.hist.ret <- returns(Y.hist) # standard deviation computed based on history (sv_hist <- sd(Y.hist.ret, na.rm=T)) [1] 0.01886924

Aboveshown value is the estimate of our standard deviation of our share price daily returns as determined by hystorical observations.

It is a good practice to compute the confidence intervals for estimated parameters in order to understand if we have sufficient precision as implied by the samples set size.

# 95% confidence interval n <- length(hist) sv_hist_low <- sqrt((n-1)*sv_hist^2/qchisq(.975, df=n-1)) sv_hist_up <- sqrt((n-1)*sv_hist^2/qchisq(.025, df=n-1)) (sv_hist_confint_95 <- c(sv_hist_low, sv_hist, sv_hist_up)) [1] 0.01656732 0.01886924 0.02191993

I am going to show a plot outlining future share price evolution.

# relative future time horizon t <- 1:20 # martingale hypothesis (the average of future returns is equal to the current value) u <- 0 # future expected value as based on normal distribution hypothesis fc <- log(Y.hist[hist_end]) + (u - 0.5*sv_hist^2)*t # lower bound 95% confidence interval fc.lb <- fc - 1.96*sv_hist*sqrt(t) # upper bound 95% confidence interval fc.ub <- fc + 1.96*sv_hist*sqrt(t) # collecting lower, expected and upper values fc_band <- list(lb = exp(fc.lb), m = exp(fc), ub = exp(fc.ub)) # absolute future time horizon xt <- c(hist_end + t) # stock price history line plot plot(Y[hist_start:(hist_end + max(t))], type='l', xlim = c(0, hist_end + max(t)), ylim = c(5, 20), xlab = "Time Index", ylab = "Share Price", panel.first = grid()) # starting point for our forecast suppressWarnings(points(x = hist_end, y = Y.hist[hist_start+hist_end-1], pch = 21, bg = "green")) # lower bound stock price forecast lines(x = xt, y = fc_band$lb, lty = 'dotted', col = 'violet', lwd = 2) # expected stock price forecast lines(x = xt, y = fc_band$m, lty = 'dotted', col = 'blue', lwd = 2) # upper bound stock price forecast lines(x = xt, y = fc_bandub, lty = 'dotted', col = 'red', lwd = 2) Gives this plot: The plot shows the lower (violet) and upper (red) bounds including the actual future price evolution and the forecasted expected value (blue). In that, I did not account for a drift u (u = 0) and as a consequence, there is a flat line representing the future expected value (actually its slope is slightly negative as determined by the -0.5*σ^2 term). If you like to have future stock price drift more consistent with its recent history, you may compute a return based on the same time scale of the standard deviation. The lines of code to add are the following: # added line of code (mu_hist <- mean(Y.hist.ret, na.rm=T)) n <- length(hist) # 95% confidence interval for the mean mu_hist_low <- mu_hist - qt(0.975, df=n-1)*sv_hist/sqrt(n) mu_hist_up <- mu_hist + qt(0.975, df=n-1)*sv_hist/sqrt(n) (mu_hist_confint_95 <- c(mu_hist_low, mu_hist, mu_hist_up)) [1] -0.0006690514 0.0030750148 0.0068190811 Above the confidence interval for historical daily returns is shown. We have also to change the assigment to the variable u, the drift. # drift computed on historical values (u <- mu_hist) [1] 0.0030750148 The resulting plot is: Furthermore, the code shown above can be easily enhanced with sliders to specify the stock price history to take into account for parameters estimation and the desired future time horizon. That can be done by taking advantage of the manipulate package or by implementing a Shiny gadget or application, for example. Using the same model is possible to compute the probability that the future stock price be above or below a predetermined value at time t. That is possible by computing the normal distribution parameters as a function of ΔT = t – t0 and a density distribution basic property. Herein is how. This is the historical share price daily drift u: (u <- mu_hist) [1] 0.003075015 This is the current share price S(t0): (curr_share_price <- Y.hist[hist_end]) [1] 14.72056 This is the mean mu_t of our normal distribution computed with ΔT= 10, ten units of time (days) ahead of the current time: t <- 10 (mu_t <- log(curr_share_price) + u - 0.5*sv_hist^2)*t [1] 26.92142 This is the standard deviation sv_t of our normal distribution computed with ΔT = 10, ten units of time (days) ahead of the current time: (sv_t <- sv_hist*sqrt(t)) [1] 0.05966977 Arbitrarly, I determine a target price 10% higher of the current one and hence equal to: (target_share_price <- 1.10*curr_share_price) [1] 16.19261 The probability that the share price at time t is equal or greater (please note the lower.tail = FALSE parameter) than the target price is the following: pnorm(q = log(target_share_price), mean = mu_t, sd = sv_t, lower.tail = FALSE) [1] 0.06072166 Our model states there is a probability of 6% that share price is above or equal to the target price. The Misbehavior of Markets: criticism to the lognormal hypothesis In 1963, Mandelbrot published a research highlighting that, contrary to the general assumption that share price movements were normally distributed, they instead followed a Pareto-Levy distribution, which has infinite variance. That implies that values considered with negligible probability determined by normal distribution hypothesis, they actually are not that unlikely to happen in case of Pareto-Levy distribution. Based on that market booms and crashes are more frequent than we may think. Be aware of this while applying lognormal distribution assumptions. Conclusions We have outlined an approach that can be easily implemented to compute expected values, lower and upper bounds of future stock prices. That is based on the well known Black-Scholes-Merton model and its normal distribution hypothesis. The plot showing future share price expected value, lower and upper bounds can be enhanced with interactive inputs to allow users to select history length and future horizon of interest. By using the same model, probabilities associated to future price thresholds can be computed. A reference for the Black-Scholes-Merton model we talked about can be found in the following book: • Options, Futures and Other Derivatives, John C. Hull, Prentice Hall, 8th Ed. A reference for Mandlebrot criticism to share price lognormal distribution hypothesis is the following book: • The Misbehavior of Markets: a Fractal View of Finance Turbolence, Benoit Mandelbrot & Richard L. Hudson, Basic Books Ed. Disclaimer Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or a promotion of any particular security or source. If you have any question, feel free to comment below. Related Post To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### The Next Era of Research Communication Tue, 03/21/2017 - 22:04 (This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers) From the days of actual research papers (before the digital age), to now where research papers are posted online first, not much has really changed in the way we communicate. We still use static images, formulas and a bunch of text to show what we have discovered in the laboratory. This has definitely worked. However, I think (or I’d like to believe) we’re heading into a new era of how we share research. The rise of animations, GIFs and interactive visualizations is about to make reading scientific papers a little bit more fun. So what has contributed to this movement? Of course, like any paradigm shift, several things accumulate to make it happen. But one of the major players is JavaScript and one library in particular: D3.js. D3.js is a JavaScript library that makes it easy to manipulate HTML DOMs (Document Object Model). Using this library, the sky is the limit. You can build an insane amount of visualizations. The learning curve is a bit stiff, but the freedom it gives you is worth every minute spent learning how to use it properly. LaTeX has been the ultimate publishing tool when it comes to scientific papers. LaTeX, pronounced lay-tek is an open source domain specific language. It was initially released in 1985 so it’s been around for a very, very long time. The norm is to usually author the papers in LaTeX and then convert them into PDFs where they will be consumed by the general readers. So What’s Next? There are several efforts going in the direction of interactive papers. Some consciously and some just evolving naturally by the sheer need of whoever is playing around with them. MarkDown MarkDown (MD) is a “text-to-html” conversion tool authored by John Gruber. MarkDown is widely used in the software development community (ex: README.md docs) and now slowly creeping into the statistics community with RMarkDown. Its syntax let’s you write in simple plain text and then converts it into a robust HTML page. Its very simple to learn and very convenient to use, thanks to it’s simplicity. Finding a way to add animations/visualizations, not just GIFs to MarkDown might make sharing — at least drafts — a bit more pleasant and understandable. PDBF PDBF is probably the most interesting out of all of these. It’s three things at the same time. It’s a PDF document, an HTML page and an OVA file. What you get depends on where you open the file and what extension you give it. If you save it as a PDF, you get the static part of the document. If you save it as an HTML doc and open it in a browser, you get the dynamic and static part of the document. Meaning you can interact with the visualizations etc… And lastly, the OVA part of it let’s you add an entire virtual box to it so you can run an entire operating system out of it. Online Some new blogs, non-official publications and pre-print hubs are also starting to allow users and/contributors to add custom visualization by allowing them to write in HTML and add D3.js visualizations. The only thing with these is export and sharing beyond the Internet-connected browser. They are only viewable in the browser and good luck saving them into a single file. Crazy enough, I think they are the future. An Internet connection is as available as electricity (might be a stretch in some places) these days, so should you really download and save the research paper? Can’t you just read it in your browser, bookmark it and come back to it again if you need to, while enjoying the visualizations? Edit: check out distill.pub, a publication that let’s you submit interactive articles on machine learning. What are your thoughts? The Next Era of Research Communication was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story. To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Give a talk about an application of R at EARL Tue, 03/21/2017 - 15:57 (This article was first published on Revolutions, and kindly contributed to R-bloggers) The EARL (Enterprise Applications of R) conference is one of my favourite events to go to. As the name of the conference suggests, the focus of the conference is where the rubber of the R language meets the road of it being used to solve real-world problems. Prior conferences have included presentations on how Maersk uses R to optimize capacity for its shipping lines, how Investec has deployed R to support investment and risk teams, how Verizon uses R to analyze cybersecurity data, and how AstraZeneca uses R to design clinical trials. This years conferences, to be held in San Francisco June 5-7, and London September 12-14, promise to be equally fascinating. (The social program is always a lot of fun, too.) The presentations as EARL conferences come directly from R users, so if you're using R to solve an interesting problem in business, government, or for the public good, then consider giving a presentation yourself. The call for papers closes on March 25 for San Francisco, and on March 31 for London. To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### The one thing you need to master data science Tue, 03/21/2017 - 08:00 (This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers) When you ask people what makes a person great – what makes someone an elite performer – they commonly say “talent.” Most people believe that elite performers are born with their talent. Most people believe that top performers come into the world with an innate talent that makes them special. You see something like this in data science too. People hear about elite data scientists and they assume that these people are just naturally gifted. “He must be a genius.” “She must have been born with a talent for math.” “That guy has a gift for programming.” From the outside, people look at top performing data scientists and say, “that could never be me … I don’t have that gift.” In data science and more generally, people think that innate talent is what causes exceptional performance. This may be one of the biggest misconceptions in history. As it turns out, there has been extensive research on elite performers of all kinds: musicians, doctors, chess players, even mathematicians. Time after time, research demonstrates that top performers are made not born. The myth of innate talent The idea that talent creates greatness was detailed and largely debunked in the popular book, Talent is Overrated, by Geoff Colvin. In a specific example in Talent is Overrated, Colvin writes about a study of music students that was performed in 1992. In this study, they studied a group of several hundred young music students. They categorized the performance abilities of these students into 5 (i.e., the top performers, middling performers, low performers, etc). Next, they collected data on these students. They investigated how many hours per day the students practiced by interviewing the students and their parents. They asked when the students first started playing their instrument (what age). When they analyzed the data, there were a few critical insights. First, when they looked for evidence of innate talent in the highest performing students, they didn’t find it. Essentially, they analyzed the highest performing music students, and looked for early signs of talent. Among the best performing music students, there had been no unusual signs of talent when they were young. Second, they found that there was a single factor that predicted the performance of the students: how much they practiced. Practice was the secret. Practice was the thing that made the “great performers” great. You can use this information as a data science student. The primary factor that enable you to master data scientist is practice. Practice can make you a top-tier data scientist To learn and master data science, you need to practice. But importantly, it’s a very particular kind of practice that leads to expertise. To master data science, you need deliberate practice. Deliberate practice is practice that is specifically designed to push you beyond your current skill level. Unfortunately, when most people practice a skill, they actually don’t push themselves. According to Anders Ericsson, a famous researcher of human expertise and human performance, when most people practice they focus on things that they already know how to do. For example, let’s say that you already know how to create a bar chart. ggplot(data = diamonds, aes(x = carat, y = price) + geom_line() If you you’ve already mastered the syntax to create a bar chart, but you keep practicing it intensely without moving on to more advanced techniques, you are unlikely to improve as a data scientist. Now don’t misunderstand: it is useful to occasionally practice techniques that you’ve already mastered. However, if you don’t push yourself forward by expanding your skills to more advanced techniques, then you won’t develop as a data scientist. Another example is with musicians: if you’re a musician, and you learn only 5 simple songs, and you only play those 5 songs for 10 years, you won’t improve. To improve, you need to master the basics, but you need to consistently push yourself beyond your current skills. To quote Anders Ericsson: “[Deliberate practice] entails considerable, specific, and sustained efforts to do something you can’t do well—or even at all.” What this means for you, as a developing data scientist, is that you need to practice in a way that: 1. Enables you to learn techniques 2. Enables you to practice those techniques 3. Pushes you beyond your present level so you continue to improve. These principles will help you develop a good practice system. However, to give you a clearer understanding of how to optimize your practice, let’s dig a little deeper into the nature of deliberate practice. What is deliberate practice? In Talent is Overrated, Colvin described the most important features of deliberate practice. Deliberate practice is: 1. Designed for improvement 2. Repeatable 3. Provides feedback 4. Mentally demanding 5. Not much fun Let’s examine each of these features one by one. I’ll also explain how you can apply them to your study of data science. Deliberate practice is designed to improve performance To train optimally, your practice should be specifically designed to push you beyond your current skill. In Talent is Overrated, Colvin gives the example of Tiger Woods dropping a ball in a sand trap and deliberately stepping on it to press the ball further into the sand, making the shot very, very challenging. Let me explain how you can apply this to studying data science. You need a system that is deliberately designed to challenge you. You need to practice in a way that pushes you beyond your comfort zone. So for example, if you’ve already learned the basics of ggplot2, like geom_line(), geom_bar(), geom_point(), pushing yourself might mean learning the theme() function and the corresponding element functions. Or if you’ve already learned ggplot::theme(), then you may need to move on to more advanced visualization techniques, etc. I want to emphasize, however, that the important word is ‘designed.’ You need a system that is designed to challenge you. As a data science student, this will be one of your biggest obstacles. You may not be able to design a practice routine yourself. As a beginning or intermediate student, it will be difficult for you to design a system yourself that teaches you the right skills, in the right order, in such a way that you’re continuously challenged by your practice. It’s hard to do without a coach or mentor. Sadly, I also think that most data science courses lack well designed practice systems. I’ve heard many stories from students who say that they “took an online course” but they still can’t write code. In most cases, I suspect that this is because the course lacks a well designed practice system to teach you skills, but then push you beyond your skill level to higher and higher levels of mastery. One way or another, I highly recommend that you create or invest in a practice system that is designed to improve your performance. Deliberate practice is repeatable To master a skill, you need to repeat your practice activities until they are second nature. Sadly, people learning data science rarely repeat their practice activities. At best, most people learn a new technique, and then practice it only a couple of times. Inevitably, if they don’t repeat their practice, they forget the technique. Again, you frequently hear people say “I took several online courses, but I still can’t write R code very well.” They frequently talk about learning a technique, but then forgetting it. The reason they forget, is that they fail to practice over the long run. If you want to be a top-performing data scientist, you need to repeat your data science practice until you can perform techniques unconsciously (i.e., without thinking about it). To help you understand what I mean by “perform techniques unconsciously,” I’ll break down the phases of learning as follows: 1. Unconscious incompetence 2. Conscious incompetence 3. Conscious competence 4. Unconscious competence Let me explain these. Before learning a skill, you will be unconsciously incompetent. Essentially, this means that before you begin learning a new technique, you’re bad at it and you don’t even know it. For example, if you want to learn data science, but you don’t even know which packages to learn, you would be “unconsciously incompetent.” At this stage, you don’t even know what you don’t know. Next, when you first learn a skill, you become consciously incompetent. At this stage, you are unskilled, but you are aware of your lack of skill. For example, the first time you learn how to create a bar chart with geom_bar(), you’ll essentially be consciously incompetent. If you only practice it a few times, you’ll probably struggle to remember how to execute the technique. You would lack skill, and you’d be acutely aware of your lack of skill. That’s conscious incompetence. But if you repeat your practice for a few days and weeks, you’ll begin to be consciously competent. At this level, you can smoothly execute a technique, but still with some mental effort. For example, if you systematically and repeatedly practice the techniques from ggplot2 and dplyr for a couple of weeks, you’ll eventually reach a point where you can execute those techniques. Having said that, at this stage, it still requires effort. You’ll need to consciously think about the code. You’ll need to work to remember the syntax. But the important thing is that you will remember. You will be able to execute the techniques. This stage, when you can perform the activity (but only with mental effort), is called conscious competence. Finally, if you stick with it and you systematically practice your data science techniques, you’ll reach unconscious competence. At the stage of unconscious competence, you can execute techniques without any thought at all. The techniques are so well practiced, that you can do them without effort. For example, I have students that say that they can write R code “with their eyes closed.” That’s unconscious competence. That should be your goal. Your goal should be unconscious competence in the techniques of the tidyverse, like ggplot2, dplyr, stringr, and tidyr functions. Your goal should be to be able to write the code effortlessly, fluidly, rapidly, and smoothly. Imagine writing R code effortlessly, with your eyes closed, as fast as you can type. That should be your goal. “Fluency” in R. It sounds hard, but this is absolutely possible for you as a data scientist. You can achieve this level of “fluency,” but it requires you to practice. It requires repeated practice. You need to repeat your data science practice until writing R code is second nature. Deliberate practice provides feedback Moreover, for your practice to be effective, you need feedback. This is a major difference between “regular” practice and “deliberate practice.” Ideally, this feedback should come from a skilled instructor. Having an expert analyze your performance to identify strengths and weaknesses is extremely helpful. Having said that, not everyone can afford a coach or tutor for this sort of guidance. On the other hand, as data scientists, we’re actually somewhat lucky. In many instances, we get feedback on our techniques directly from our programming environment. If you type some code into R-Studio, it either runs without error, or it doesn’t. If your code contains an error, you’ll get an error message (however cryptic it may be) about what you did wrong. Alternatively, if your code runs without errors, you commonly get other types of feedback by examining the output. For example, if you intended to make a line chart, you can examine the output. Did the code produce the exact line chart that you envisioned? Did the code do what you thought it would? This is also feedback. Ideally though, your data science practice system should provide more than just the output from R-Studio. Ideally, you want to get feedback that your exact answer was correct or not. If you know the right tools to use, this is absolutely possible. Moreover, once you start using a feedback-driven practice system to learn data science, your progress will accelerate. You’ll learn more, faster, and make fewer mistakes. Deliberate practice is mentally demanding Another detail of deliberate practice is that it’s mental challenging. If you’re really pushing yourself to learn and practice skills that are out of your comfort zone, it’s going to hurt a little. It’s going to be mentally challenging. For example, the first time you start learning the functions from the tidyr package (which I recommend if you’re a beginner), they might be a little difficult to understand. tidyr reshapes your data into new formats. These transformations that reshape your data can be difficult to understand. And it’s not just that some techniques are difficult to understand. Simply remembering R syntax can be challenging. If you’ve just learned a new R technique, it will be difficult to remember the syntax after a few days. Even if you learn a technique sufficiently the first time, it’s very likely that you’ll begin to forget it very quickly. Ultimately in data science, some techniques are hard to understand. Syntax is hard to remember. The process of learning new, advanced techniques (and pushing yourself to memorize the syntax) is hard. If you’re doing it right, learning data science is mentally demanding. If you’re going to master data science, it should be hard. Deliberate practice is supposed to be hard. If it’s not, then you’re not pushing yourself hard enough. If your practice always feels easy, you’re doing it wrong. The challenge for you, as someone who’s learning data science, is that you need a practice system that continuously pushes you beyond your skill level towards techniques of increasing difficulty. Deliberate practice is not fun Finally, deliberate practice is not fun. In talking about this, I still want to emphasize that you can rapidly master data science if you know how to practice. You can learn data science faster than you ever thought possible. I think you can learn data science 2x, 3x, even 5x faster than average, if you know how to practice. Having said that, a good practice system is not a magic wand. It’s not going to make data science effortless. It’s going to be hard. It will be frustrating at times. It’s not always fun. If you want to master data science, you need to embrace the hurt. You need to accept that there’s no success without struggle. You need to accept that mastering data science will be a little painful sometimes, and you need to power through. If you can embrace the fact that deliberate practice – the type of practice that leads you to mastery – will be hard sometimes, then you can succeed. A quick guide to deliberate practice for data science “Deliberate practice requires that one identify certain sharply defined elements of performance that need to be improved, and then work intently on them.” – Geoff Colvin Now that you understand what deliberate practice is and how it can help you, here are a few recommendations. Identify sharply defined techniques that you can practice To engage in deliberate practice, you need to sharply define a set of techniques, and practice them intensely until you master them. After you master them, define additional techniques (more advanced techniques) and practice them as well. As a data scientist, this means that you should sharply define individual techniques, practice those techniques repeatedly, and move on to harder techniques as you progress. The tidyverse functions are small units that you can practice As I wrote in a recent article, the modular nature of R’s tidyverse makes it somewhat easy to define “practicable techniques.” The functions of the tidyverse are highly modular. Almost all of the functions in the tidyverse do one thing. You should consider the functions of the tidyverse to be like sharply defined techniques that you can individually learn, practice, and master. For example, I consider dplyr::mutate() to be one technique. dplyr::arrange() is another separate technique. tidyr::gather() is a technique. Within ggplot2, geom_line(), geom_bar(), and geom_point() should be considered separate techniques. These are individual techniques that you can practice and master. They are the small units that you need to practice. Establish a practice system Ideally, you should set up a practice system. You need a system that will teach you the right skills in the right order. You need a system that enables you to practice your techniques repeatedly over time until you reach mastery. You need a system that gives you feedback. If you don’t have these things, you are unlikely to reach your full potential. But with the right system in place, you can travel very rapidly on the path to data science mastery. Commit to practice As I’ve mentioned, if you want to become a top-tier data scientist, you need to practice. It’s not enough to learn a technique and practice it one time. You need to practice a technique repeatedly over time until it becomes second nature. When you reach that point, you need to move on to new skills that push you beyond your current skill level. Practicing like this requires commitment. It’s hard. You need to be disciplined. You need to commit to showing up every day and doing the work. There are. no. shortcuts. I promise you though, if you can commit to practice, and you practice data science the right way, then you can learn data science very, very quickly. Our data science course is reopening soon If you’re interested in rapidly mastering data science, then sign up for our list right now. We will be re-opening our flagship course, Starting Data Science, within a few weeks. Starting Data Science will teach you the essentials of R, including ggplot2, dplyr, tidyr, stringr, and lubridate. It will also give you a practice system that you can use to rapidly master these techniques. If you sign up for our email list, you’ll get an exclusive invitation to join the course when it opens. SIGN UP NOW The post The one thing you need to master data science appeared first on SHARP SIGHT LABS. To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### camsRad, satellite-based time series of solar irradiation Tue, 03/21/2017 - 08:00 (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers) camsRad is a lightweight R client for the CAMS Radiation Service, that provides satellite-based time series of solar irradiation for the actual weather conditions as well as for clear-sky conditions. Satellite-based solar irradiation data have been around roughly as long our modern era satellites. But the price tag has been very high, in the range of several thousand euros per site. This has damped research and development of downstream applications. With CAMS Radiation Service coming online in 2016, this changed as the services are provided under the (not yet fully implemented) European Union stand point that data and services produced with public funding shall be provided on free and open grounds. The service is part of Copernicus, a European Union programme aimed at developing information services based on satellite earth observation and in situ data. All Copernicus information services are free and openly accessible. Satellite-based solar irradiation data The main groups of users are planners and managers of solar energy systems; architects and building engineers; researchers in renewable energies and building engineering. Surface solar irradiation is relatively cumbersome and expensive to retrieve by ground observation. Therefore, a satellite-based modelling approach can in many cases be a more feasible option. These approaches build-upon the principle that pixels in satellite images of clouds (target 2 in figure below) appear brighter/whiter than pixels of ground (target 1). For more thorough description on underlying theory and technical details, head to the user guide and articles. CAMS Radiation Service is jointly developed and provided by DLR, Armines and Transvalor. Illustration on using satellite images to determine level of cloudiness. Source: MACC-III Deliverable D57.5 Check out the Shiny app The camsRad package In a perfect world, this package would be unnecessary. Accessing data and web services should be easy, a procedure so standardized that anyone with a bit of programming experience should be able to achieve this with a few lines of code. I´ve worked with quite a few web API, but still haven´t encountered two of the same kind. It always requires time and effort familiarizing yourself with the concept of communication, formatting etc. That there exist numerous ways of disseminating data can be seen at the rOpenSci packages listing, where roughly half of the listed packages are categorized as data access packages. The CAMS Radiation Service uses a so-called Web Process Service (WPS) interface. WPS provides standardizing for geospatial processing services. It requires you to make a POST requests carrying a XML formatted payload that specifies how the service is invoked. As I wanted to have a thorough and reliable R-client and though other users could have use of it, I decided to make it a public R-package and submit it to the rOpenSci for review. I think I´m a quite typical aspiring R developer, mostly self-learned programmer with a non-computer science degree. Another, a bit selfish, motive for submitting was that I thought it would be a good learning ground to adopt a more formal R-development style. And yes, I learned a lot! A big thank you to reviewer Jeffrey Hollister and editor Scott Chamberlain. The biggest change after the review, was getting rid of package dependencies to get a more lightweight package and to decrease the risk of breakage when upstream packages get updated. The core of the camsRad package is the cams_api function which interfaces the WPS of the CAMS Radiation Service. It can be of interest for anyone wanting to call CAMS Radiation Service from other languages than R, or for those that want to invoke other WPS based web services from R. The two convenience functions cams_get_radiation and cams_get_mcclear are for those of you that just want to get the data into a R data frame with a little hassle as possible. Check the vignette and readme file for further instructions and examples. My use cases Shiny Weather Data is a web service making different sources of European gridded climate data available in hourly time series formats used by common building performance modeling tools. This web service has been around for a while and has a steadily growing user group of professional building modelers as well as students and researchers. Smart Energy Modeler demonstrates how modeled climate data can be used to calibrate simple building models with utility bill data. This is a dissemination from my research as a PhD Candidate at Mälardalen University, Sweden. It´s in an early stage of development, but a nice showcase of how access to open climate data enables data-driven application that solves real world-problems. Future development The CAMS Radiation Service relies on images from the Metosat satellite which is located at the at the prime meridian of 0° longitude (covering Europe, Africa and Middle East). I´ve started a GitHub issue about alternative data sources. So, please, let me know if you know any sources of open and free, high resolution solar radiation data based on images from satellites above America and Asia. To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Simultaneous intervals for derivatives of smooths revisited Tue, 03/21/2017 - 07:00 (This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers) Eighteen months ago I screwed up! I’d written a post in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalized spline. It was a nice post that attracted some interest. It was also wrong. In December I corrected the first part of that mistake by illustrating one approach to compute an actual simultaneous interval, but only for the fitted smoother. At the time I thought that the approach I outlined would translate to the derivatives but I was being lazy then Christmas came and went and I was back to teaching — you know how it goes. Anyway, in this post I hope to finally rectify my past stupidity and show how the approach used to generate simultaneous intervals from the December 2016 post can be applied to the derivatives of a spline. If you haven’t read the December 2016 post I suggest you do so as there I explain this: [ \begin{align} \mathbf{\hat{f}_g} &amp;\pm m_{1 – \alpha} \begin{bmatrix} \widehat{\mathrm{st.dev}} (\hat{f}(g_1) – f(g_1)) \\ \widehat{\mathrm{st.dev}} (\hat{f}(g_2) – f(g_2)) \\ \vdots \\ \widehat{\mathrm{st.dev}} (\hat{f}(g_M) – f(g_M)) \\ \end{bmatrix} \end{align} ] This equation states that the critical value for a 100(1 – ())% simultaneous interval is given by the 100(1 – ())% quantile of the distribution of the standard errors of deviations of the fitted values from the true values of the smoother. We don’t know this distribution, so we generated realizations from it using simulation, and used the empirical quantiles of the simulated distribution to give the appropriate critical value (m) with which to calculate the simultaneous interval. In that post I worked my way through some R code to show how you can calculate this for a fitted spline. To keep this post relatively short, I won’t rehash the discussion of the code used to compute the critical value (m). I also won’t cover in detail how these derivatives are computed. We use finite differences and the general approach is explained in an older post. I don’t recommend you use the code in that post for real data analysis, however. Whilst I was putting together this post I re-wrote the derivative code as well as that for computing point-wise and simultaneous intervals and started a new R package tsgam. tsgam is is available on GitHub and we’ll use it here. Note this package isn’t even at version 0.1 yet, but the code for derivatives and intervals has been through several iterations now and works well whenever I have tested it. Assuming you have the devtools package installed, you can install tsgam using devtools::install_github("gavinsimpson/tsgam") As example data, I’ll again use the strontium isotope data set included in the SemiPar package, and which is extensively analyzed in the monograph Semiparametric Regression (Ruppert et al., 2003). First, load the packages we’ll need as well as the data, which is data set fossil. If you don’t have SemiPar installed, install it using install.packages(“SemiPar”) before proceeding library("mgcv") # fit the GAM library("tsgam") # code for derivatives & intervals library("ggplot2") # package for nice plots theme_set(theme_bw()) # simpler theme for the plots data(fossil, package = "SemiPar") # load the data The fossil data set includes two variables and is a time series of strontium isotope measurements on samples from a sediment core. The data are shown below using ggplot() ggplot(fossil, aes(x = age, y = strontium.ratio)) + geom_point() + scale_x_reverse() The strontium isotope example data used in the post The aim of the analysis of these data is to model how the measured strontium isotope ratio changed through time, using a GAM to estimate the clearly non-linear change in the response. As time is the complement of sediment age, we should probably model this on that time scale, especially if you wanted to investigate residual temporal auto-correlation. This requires creating a new variable negAge for modelling purposes only fossil <- transform(fossil, negAge = -age) As per the previous post a reasonable GAM for these data is fitted using mgcv and gam() m <- gam(strontium.ratio ~ s(negAge, k = 20), data = fossil, method = "REML") Having fitted the model we should do some evaluation of it but I’m going to skip that here and move straight to computing the derivative of the fitted spline and a simultaneous interval for it. First we set some constants that we can refer to throughout the rest of the post ## parameters for testing UNCONDITIONAL <- FALSE # unconditional or conditional on estimating smooth params? N <- 10000 # number of posterior draws n <- 500 # number of newdata values EPS <- 1e-07 # finite difference To facilitate checking that this interval has the correct coverage properties I’m going to fix the locations where we’ll evaluate the derivative, calculating the vector of values to predict at once only. Normally you wouldn’t need to do this just to compute the derivatives and associated confidence intervals — you would just need to set the number of values n over the range of the predictors you want — and if you have a model with several splines it is probably easier to let tsgam handle this part for you. ## where are we going to predict at? newd <- with(fossil, data.frame(negAge = seq(min(negAge), max(negAge), length = n))) The fderiv() function in tsgam computes the first derivative of any splines in the supplied GAM1 or you can request derivatives for a specified smooth term. As we have only a single smooth term in the model, we simply pass in the model and the data frame of locations at which to evaluate the derivative fd <- fderiv(m, newdata = newd, eps = EPS, unconditional = UNCONDITIONAL) (we set eps = EPS, so we have the same grid shift later in the post when checking coverage properties, and don’t account for the uncertainty due to estimating the smoothness parameters (unconditional = FALSE), normally you can leave these at the defaults). The object returned by fderiv() str(fd, max = 1) List of 6 derivatives :List of 1 $terms : chr "negAge"$ model :List of 52 ..- attr(*, "class")= chr [1:3] "gam" "glm" "lm" $eps : num 1e-07$ eval :'data.frame': 500 obs. of 1 variable: $unconditional: logi FALSE - attr(*, "class")= chr "fderiv" contains a component derivatives that contains the evaluated derivatives for all or the selected smooth terms. The other components include a copy of the fitted model and some additional parameters that are required for the confidence intervals. Confidence intervals for the derivatives are computed using the confint() method. The type argument specifies whether point-wise or simultaneous intervals are required. For the latter, the number of simulations to draw is required via nsim set.seed(42) # set the seed to make this repeatable sint <- confint(fd, type = "simultaneous", nsim = N) To make it easier to work with the results I wrote the confint() method so that it returned the confidence interval as a tidy data frame suitable for plotting with ggplot2. sint is a data frame with an identifier for which smooth term each row relates to (term), plus columns containing the estimated (est) derivative and the lower and upper confidence interval head(sint) term lower est upper 1 negAge -0.000053 -6.1e-06 0.000041 2 negAge -0.000053 -6.1e-06 0.000041 3 negAge -0.000053 -6.1e-06 0.000040 4 negAge -0.000052 -6.1e-06 0.000040 5 negAge -0.000052 -6.1e-06 0.000040 6 negAge -0.000051 -6.0e-06 0.000039 The estimated derivative plus its 95% simultaneous confidence interval are shown below ggplot(cbind(sint, age = -newd$negAge), aes(x = age, y = est)) + geom_hline(aes(yintercept = 0)) + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) + geom_line() + scale_x_reverse() + labs(y = "First derivative", x = "Age")

Estimated first derivative of the spline fitted to the strontium isotope data. The grey band shows the 95% simultaneous interval.

So far so good.

Having thought about how to apply the theory outlined in the previous post, it seems that all we need to do to apply it to derivatives is to make the assumption that the estimate of the first derivative is unbiased and hence we can proceed as we did in the previous post by computing BUdiff using a multivariate normal with zero mean vector and the Bayesian covariance matrix of the model coefficients. Where the version for derivatives differs is that we use a prediction matrix for the derivatives instead of for the fitted spline. This prediction matrix is created as follows

1. generate a prediction matrix from the current model for the locations in newd,
2. generate a second prediction matrix as before but for slightly shifted locations newd + eps
3. difference these two prediction matrices yielding the prediction matrix for the first differences Xp
4. for each smooth in turn

1. create a zero matrix, Xi, of the same dimensions as the prediction matrices
2. fill in the columns of Xi that relate to the current smooth using the values of the same columns from Xp
3. multiply Xi by the vector of model coefficients to yield predicted first differences
4. calculate the standard error of these predictions

The matrix Xi is supplied for each smooth term in the derivatives component of the object returned by fderiv().

Once I’d grokked this one basic assumption about the unbiasedness of the first derivative, the rest of the translation of the method to derivatives fell into place. As we are using finite differences, we may be a little biased in estimating the first derivatives, but this can be reduced by makes eps smaller, thought the default probably suffices.

To see the detail of how this is done, look at the source code for tsgam:::simultaneous, which apart from a bit of renaming of objects follows closely the code in the previous post.

Having computed the purported simultaneous interval for the derivatives of the trend, we should do what I didn’t do in the original posts about these intervals and go and look at the coverage properties of the generated interval.

To do that I’m going to simulate a large number, N, of draws from the posterior distribution of the model. Each of these draws is a fitted spline that includes the uncertainty in the estimated model coefficients. Note that I’m not including a correction here for the uncertainty due to smoothing parameters being estimated — you can set unconditional = TRUE throughout (or change UNCONDITIONAL above) to include this extra uncertainty if you wish.

Vb <- vcov(m, unconditional = UNCONDITIONAL) set.seed(24) sims <- MASS::mvrnorm(N, mu = coef(m), Sigma = Vb) X0 <- predict(m, newd, type = "lpmatrix") newd <- newd + EPS X1 <- predict(m, newd, type = "lpmatrix") Xp <- (X1 - X0) / EPS derivs <- Xp %*% t(sims)

The code above basically makes a large number of draws from the model posterior and applies the steps of the algorithm outlined above to generate derivs, a matrix containing 10000 draws from the posterior distribution of the model derivatives. Our simultaneous interval should entirely contain about 95% of these posterior draws. Note that a draw here refers to the entire set of evaluations of the first derivative for each posterior draw from the model. The plot below shows 50 such draws (lines)

set.seed(2) matplot(derivs[, sample(N, 50)], type = "l", lty = "solid")

50 draws from the posterior distribution of the first derivative of the fitted spline.

and 95% of the 10000 draws (lines) should lie entirely within the simultaneous interval if it has the right coverage properties. Put the other way, only 5% of the draws (lines) should ever venture outside the limits of the interval.

To check this is the case, we reuse the the inCI() function that checks if a draw lies entirely within the interval or not

inCI <- function(x, upr, lwr) { all(x >= lwr & x <= upr) }

As each column of derivs contains a different draw, we want to apply inCI() to each column in turn

fitsInCI <- with(sint, apply(derivs, 2L, inCI, upr = upper, lwr = lower))

inCI() returns a TRUE if all the points that make up the line representing a single posterior draw lie within the interval and FALSE otherwise, therefore we can sum up the TRUEs (recall that a TRUE == 1 and a FALSE == 0) and divide by the number of draws to get an estimate of the coverage properties of the interval. If we do this for our interval

sum(fitsInCI) / length(fitsInCI) [1] 0.95

we see that the interval includes 95% of the 10000 draws. Which, you’ll agree is pretty close to the desired coverage of 95%.

That’s it for this post; whilst the signs are encouraging that these simultaneous intervals have the required coverage properties, I’ve only looked at them for a simple single-term GAM, and only for a response that is conditionally distributed Gaussian. I also haven’t looked at anything other than the coverage at an expected 95%. If you do use this in your work, please do check that the interval is working as anticipated. If you do discover problems, please let me know either in the comments below or via email. The next task is to start thinking about extending these ideas to work with a wider range of GAMs that mgcv can fit, include location-scale models and models with factor-smooth interactions.

References

Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression. Cambridge University Press.

1. fderiv() currently works for smooths of a single variable fitted using gam() or gamm(). It hasn’t been tested with the location-scale extended families in newer versions of mgcv and I doubt it will work with them currently.

To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Is it possible to use RevoScaleR package in Power BI?

Mon, 03/20/2017 - 22:19

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

I was invited to deliver a session for Belgium User Group on SQL Server and R integration. After the session – which we did online using web based Citrix  – I got an interesting question: “Is it possible to use RevoScaleR performance computational functions within Power BI?“. My first answer was,  a sceptical yes. But I said, that I haven’t used it in this manner yet and that there might be some limitations.

The idea of having the scalable environment and the parallel computational package with all the predictive analytical functions in Power BI is absolutely great. But something tells me, that it will not be that straight forward.

So let’s start by taking a large (500 MB) txt file and create XDF file:

File is available on-line at this address with the zip file.

Getting data with R script

Open Power BI and choose Get Data -> R Script -> and copy/Paste the following slightly changed code:

With copy pasting and clicking OK,

You will have to wait for the data to be read into the memory, the data models to be created and after monitoring the memory consumption and patiently waiting, you will notice, that this particular dataset (500 MB or 160 MB XDF), that minimum 3 GB of RAM will be consumed and you will end up with preview:

By now, you will also notice that after saving this Power BI document, it will take somewhere up to 700 MB of your disk space and all the data visualization will consume additional RAM and time. After you will close the Power BI document, you will notice a lot of RAM being released.

Using R Script in the visuals

When you create a new Power BI document, I will create new dataset by Entering data. I will create three “dummy” variables.

With these three variables I will try to inject the data returned from XDF data format and have data represented in Power BI.

After selecting the new visual and choosing R visual, I inserted following code:

And this time, the result is fascinating. R is plotting histogram in a split of a second, simply meaning it takes advantage of XDF file and inject it to Power BI.

This is still – an outer file or dataset -, that Power BI does not have a clue about. Meaning, no slicers are available for dynamic change of the user selection.

Let’s try to insert the data into those three dummy variables, where the third one will be a factor that I have to pre-prepare. Since in this case factor is Year, it is relatively easy to do:

library(RevoScaleR) library(gridExtra) library(dplyr) Year % filter(year == c("2000","2001","2002"))) grid.table(df_f %>% filter(year == Year))

Once I have this inserted in new R visualize, I just need to add a dummy slicer.

Now, I can easily change the years for my cross-tabulation (using rxCrosstab function). Since calculation is comprehended in the back on the whole dataset and using dplyr package just to omit or filter the results, it is also possible to use rxDatastep:

rxDataStep(inData=outputFile, outFile="C:\\Files\\YearPredictMSD_Year.xdf", overwrite=TRUE, transforms=list(LateYears = V1 > 1999)) rxCrossTabs(V2~F(LateYears), data = "C:\\Files\\YearPredictMSD_Year.xdf")

In this way, you will be creating new XDF file through PowerBI with the transformation. Bear in mind, that this step might take some extra seconds to create new variable or to make a subset, if you would need. Again, this is up to  you to decide, based on the file size.

Using SQL Server procedure with R Script

This approach is not that uncommon, because it has been proven that using Stored Procedures with T-SQL and R code is useful and powerful way to use SQL Server and R integration within SSRS.  Changing the computational context is sure another way to make a work around.

Creating Stored procedure:

CREATE PROCEDURE [dbo].[SP_YearMSD_CrossTab] AS BEGIN     DECLARE @RScript nvarchar(max)         SET @RScript = N'                 library(RevoScaleR)                 sampleDataDir

Or by copying the T-SQL Code into the SQL Server Data Source, the result is the same.

In both cases, you should have a cross-tabulational  representation of XDF dataset within Power BI. And now you can really use all the advantages of Power BI visuals, Slicers and as well any additional R predictions.

There is a slight minus to this (if not all) approaches like this. You need to have many stored procedures or queries having generated like this. Also rxCube will help you to some extent, but repetitive work will not be avoided.

Using XDF data files stored in HD-Insight or in Hadoop would generaly mean using same dataset and step as for SQL Server procedure. Just that you would need to – prior to executing T-SQL script, also change comptutational context:

# HD Insight - Spark - Azure HDInsight mySshUsername = USNM,mySshHostname = HSTNM, mySshSwitches= SWTCH) rxSetComputeContext("HDInsight") ## Hadoop Hadoop mySshUsername = USNM,mySshHostname = HSTNM, mySshSwitches= SWTCH) rxSetComputeContext("Hadoop") Verdict

I have explored couple of ways how to use the Power BI visuals and environment with RevoScaleR XDF (eXternal Data Frame) datafiles. I have to admit, I was surprised that there will be a way to do it in a relatively easy way, but from data scientist perspective, it is still some additional load and work before you can start with actual data analysis. Last two approaches (R script in Visuals and SQL Server Procedures) are by far the fastest and also take the advantage of using parallel and distributed computations that RevoScaleR package brings.

I would very strongly advise Microsoft and Power BI development team to add XDF plug-in to Power BI. Plug-in would work with metadata presentation of the data each time the computations should be used, the metadata would push the code against R Server to have results returned. This would, for sure be a great way to bring Big Data concept to Power BI Desktop.

As always, code and samples are available at GitHub.

Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Sentiment Analysis of Warren Buffett’s Letters to Shareholders

Mon, 03/20/2017 - 22:06

(This article was first published on Michael Toth's Blog, and kindly contributed to R-bloggers)

Last week, I was reading through Warren Buffett’s most recent letter to Berkshire Hathaway shareholders. Every year, he writes a letter that he makes publicly available on the Berkshire Hathaway website. In the letters he talks about the performance of Berkshire Hathaway and their portfolio of businesses and investments. But he also talks about his views on business, the market, and investing more generally, and it’s after this wisdom that many investors, including me, read what he has to say.

In many ways Warren Buffett’s letters are atypical. When most companies report their financial performance, they fill their reports with dense, technical language designed to obscure and confuse. Mr. Buffett does not follow this approach. His letters are written in easily understandable language, beacuse he wants them to be accessible to everybody. Warren Buffett is not often swayed by what others are doing. He goes his own way, and that has been a source of incredible strength. In annually compounded returns, Berkshire stock has gained 20.8% since 1965, while the S&P 500 as a whole has gained only 9.7% over the same period. To highlight how truly astounding this performance is, one dollar invested in the S&P in 1965 would have grown to $112.34 by the end of 2016, while the same dollar invested in Berkshire stock would have grown to the massive sum of$15,325.46!

I’ve been reading the annual Berkshire letters when they come out for the last few years. One day I’ll sit down and read through all of them, but I haven’t gotten around to it yet. But while I was reading through his most recent letter last week, I got to thinking. I wondered whether there are any trends in his letters over time, or how strongly his writings are influenced by external market factors. I decided I could probably answer some of these questions through a high-level analysis of the text in his letters, which brings me to the subject of this blog post.

In this post I’m going to be performing a sentiment analysis of the text of Warren Buffett’s letters to shareholders from 1977 – 2016. A sentiment analysis is a method of identifying and quantifying the overall sentiment of a particular set of text. Sentiment analysis has many use cases, bu a common one is to determine how positive or negative a particular text document is, which is what I’ll be doing here. For this, I’ll be using bing sentiment analysis, developed by Bing Liu of the University of Illinois at Chicago. For this type of sentiment analysis, you first split a text document into a set of distinct words, and then for each word determining whether it is positive, negative, or neutral.

In the graph below, I show something called the ‘Net Sentiment Ratio’ for each of Warren Buffett’s letters, beginning in 1977 and ending with 2016. The net sentiment ratio tells how positive or negative a particular text is. I’m definining the net sentiment ratio as:

(Number of Positive Words – Number of Negative Words) / (Number of Total Words)

The results here show that overall, Warren Buffett’s letters have been positive. Over the forty years of letters I’m analyzing here, only 5 show a negative net sentiment score. The five years that do show negative net sentiment scores are closely tied with major negative economic events:

• 1987: The market crash that happened on October 19th, 1987 (Black Monday) is widely known as the largest single-day percentage decline ever experienced for the Dow-Jones Industrial Average, 22.61% in one day.
• 1990: The recession of 1990, triggered by an oil price shock following the United States’ invasion of Kuwait, resulted in a notable increase in unemployment.
• 2001: Following the 1990s, which represented the longest period of growth in American history, 2001 saw the collapse of the dot-com bubble and associated declining market values, as well as the September 11th attacks.
• 2002: The market, already falling in 2001, continued to see declines throughout much of 2002.
• 2008: The Great Recession was a large worldwide economic recession, characterized by the International Monetary Fund as the worst global recession since before World War II. Other related events during this period included the financial crisis of 2007-2008 and the subprime mortgage crisis of 2007-2009.

Another interesting topic to examine is which words were actually the strongest contributors to the positive and negative sentiment in the letters. For this exercise, I analyzed the letters as one single text, and present the most common positive and negative words in the graph below.

The results here are interesting. Many of the most common words–‘gain’, ‘gains’, ‘loss’, ‘losses’, ‘worth’, ‘liability’, and ‘debt’–are what we’d expect given the financial nature of these documents. I find the adjectives that make their way into this set particularly interesting, as they give insight into the way Warren Buffett thinks. On the positive side we have ‘significant’, ‘outstanding’, ‘excellent’, ‘extraordinary’, and ‘competitive’. On the negative side there are ‘negative’, ‘unusual’, ‘difficult’, and ‘bad’. One interesting inclusion that shows some of the limitations of sentiment analysis is ‘casualty’, where Mr. Buffett is not referring to death, but to the basket of property and casualty insurance companies that make up a significant portion of his business holdings.

While the above is interesting, and helps us to highlight the most frequent positive and negative words, it’s a bit limited in the number of words we can present before the graph becomes too crowded. To see a larger number of words, we can use a word cloud. The word cloud below shows 400 of the most commonly used words, split by positive and negative sentiment.

If you’re interested in reproducing this blog post or analysis, please check out the R code I used to produce this document

To leave a comment for the author, please follow the link and comment on their blog: Michael Toth's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Alteryx integrates with Microsoft R

Mon, 03/20/2017 - 20:57

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

You can now use Alteryx Designer, the data science workflow tool from Alteryx, as a drag-and-drop interface for many of the big-data statistical modeling tools included with Microsoft R. Alteryx v11.0 includes expanded support for Microsoft SQL Server 2016, Microsoft R Server, Azure SQL Data Warehouse, and Microsoft Analytics Platform System (APS), with new workflow tools to access functionality without having to write R code manually.

Alteryx V11 adds a new XDF Input tool (and corresponding XDF Output tool) to bring data into Alteryx using the Microsoft R out-of-memory file format. In addition, you can use several other new tools to train statistical models on that data, without needing to bring the data into Alteryx itself (only the meta-data is passed from one node to the next). The models provided include:

• Boosted Model
• Count Regression
• Decision Tree
• Forest Model
• Gamma Regression
• Linear Regression
• Logistic Regression

along with the ability to use stepwise methods to select variables, and scoring (prediction) using trained models.

To use Microsoft R with Alteryx, you will first need to download and install the version of Alteryx Predictive Tools corresponding with the version of R you have installed. Alteryx provides versions for both Microsoft R Server and the free Microsoft R Client, as well as open source R 3.3.2.

For more on the integration of Alteryx with Microsoft R, check out the blog post linked below.

Alteryx Community: Alteryx V11.0 Integrates with Microsoft

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### What’s in the words? Comparing artists and lyrics with R.

Mon, 03/20/2017 - 17:28

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

It’s been a while since I had the opportunity to post something on music. Let’s get back to that.

I got my hands on some song lyrics by a range of artists. (I have an R script to download all lyrics for a given artist from a lyrics website. Since these lyrics are protected by copyright law, I cannot share the download script here, but I can show some of the analyses I made with the lyrics.)

My main question is: What can we learn about an artist, or several artists, when we have a corpus of lyrics. I gonna analyze lyrics by the following artists:

• Beck
• Bob Dylan
• Britney Spears
• Leonard Cohen
• Lou Reed
• Metallica
• Michael Jackson
• Modest Mouse
• Nick Cave and TBS
• Nikka Costa
• Nine Inch Nails
• Nirvana
• PJ Harvey
• Prince
• Rihanna
• The Cure
• The Doors
• The National

Let’s start with an easy one. I wanna know which artist has the longest songs. The more words there are in the respective lyrics, the longer the song.

Mean length of songs in words (click to enlarge).

That’s quite a surprise (at least for me). Rihanna and Britney Spears, certainly the most prototypical actual pop artists in the list, have actually pretty long lyrics. Another measure from linguistics is the type-token ratio where the number of different words (types) is divided by the total number of words (tokens). This measure is often interpreted as “lexical diversity” because the vocabulary is more diverse if there are only a few words that are repeated very often. Suppose you have a song that only consists of the words “oh yeah” and this is repeated 10 times, you will have 2 types and 20 tokens, which would lead to a type-token ratio of 2/20=0.1.
Mean type-token ratio of songs (higher means more diverse vocabulary, click to enlarge).

Well, look at that – Nikka Costa, one of my favorite funk/soul artists comes out on top in this list, followed by Beck and The Doors. Rihanna and Britney obviously have a lot of words in their songs, but with regard to lexical diversity, they rank last within the artists analysed here.

Let’s try something content-related. Obviously, it’s quite hard to tackle the content (or even meaning) of songs. But we can do some really easy stuff. The first thing I want to try is what I want to call the “self-centered ratio”. I simply define a list of keywords (or better: sequences of characters) that are referencing the first person: “i”, “me”, “i’ve”, “i’m”, “my”, “mine”, “myself”. Now I calculate for each song how many of the words in the lyrics are in this list and divide this number by the number of words in the song. Suppose you have a song with these lyrics: “i’m my enemy and my enemy is mine” (I really don’t know what that would mean but that’s just an example, right?). The “self-centered ratio” would be 4/8 = 0.5 because we have “i’m”, “my”, “my” and “mine” and 8 words altogether (“i’m” is counted as one word here because it is not separated by a space). Here is the result.
Mean self-centered ratio of songs (click to enlarge)

Britney and the Nine Inch Nails are definitely not very similar in terms of their music (that’s a wild guess, I only know very few songs by Britney Spears!), but they are quite similar when it comes to singing about stuff that concerns themselves.

Next up is sentiment analysis. Professionally, I don’t like it very much because in my opinion, it has a lot of empirical and methodological problems. But why not give it a try for this application here? We’re not here for the hard science side of things, are we? So, what I did was basically the same as for the self-centered ratio but only with much bigger keyword lists for positive words and negative words (so, actually I did it twice, one time with positive words and one time with negative words). I got the word lists from here (for negative words) and from here (for positive words).

I show you two plots, one where you can see both ratios and one where I combined both ratios per song to get one value (positive value + negative value). These are the results:
Mean ratio of positive and negative words of songs (click to enlarge).

Mean combined measure for sentiment of songs (click to enlarge).

Actually, this seems to make sense. I’m no expert for Metallica, but for Nine Inch Nails, Nirvana and Radiohead, this second plot seems to make sense. Also, Prince, Michael Jackson, Nikka Costa, Rihanna and Britney Spears getting an overall positive score works for me. Nick Cave is sometimes called the “Prince of Darkness”. In this analysis, however, this is not really confirmed. Or the “dark” aspects of his lyrics are just hidden from this quite coarse approach. Just think of the song “People just ain’t no good”. Here, each occurence of “good” is counted as positive because my simple word list approach is simply not sensitive for the negation in this line.

One last thing: I wanted to know if artists can be clustered (grouped) just with the use of their lyrics. What we need is a measure of dissimilarity for each artist-artist combination. There are several ways to do that and I experimented with a few (e.g. cosine distance or correlation of frequency vectors). It turns out, there is an even easier measure to do this: Let’s take the first 500 most frequent word each artist uses in their lyrics. With the other artist, we do the same. Then, we intersect these two sets of word lists and divide it by 500. What we get is the ratio of words that are present in both top-500 vocabularies, which is essentially a similarity measure. If we do 1 minus this value, we get a dissmilarity measure which we can use as input to a hierarchical cluster analysis. This is what we get.
Dendrogram for a hierarchical cluster analysis of overlapping top-500 words.

Look at that, I think it works quite nice: We get a “pop” cluster on the left with Nikka Costa, Britney Spears, Rihanna, Michael Jackson and Prince. Feel free to interpret the other clusters in the comments. As I said, I think it works quite OK.

R CODE is coming soon!

LOOK, there are frequency plots available here for all the artists!

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Linear Regression and ANOVA shaken and stirred (Part 2)

Mon, 03/20/2017 - 17:00

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

In the first part of this entry I did show some mtcars examples, where am can be useful to explain ANOVA as its observations are defined as:

$$am_i = \begin{cases}1 &\text{ if car } i \text{ is manual} \cr 0 &\text{ if car } i \text{ is automatic}\end{cases}$$

Now I’ll show another example to continue the last example from Part 1 and I’ll move to something involved more variables.

ANOVA with one dummy variable

Consider a model where the outcome is mpg and the design matrix is $$X = (\vec{1} \: \vec{x}_2)$$.

From the last entry let

$$x_2 = \begin{cases}1 &\text{ if car } i \text{ is automatic} \cr 0 &\text{ if car } i \text{ is manual}\end{cases}$$

This will lead to this estimate:

$$\hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_1 \cr \bar{y}_2 – \bar{y}_1\end{bmatrix}$$

Fitting the model gives:

y = mtcars$mpg x1 = mtcars$am x2 = ifelse(x1 == 1, 0, 1) fit = lm(y ~ x2) summary(fit) Call: lm(formula = y ~ x2) Residuals: Min 1Q Median 3Q Max -9.3923 -3.0923 -0.2974 3.2439 9.5077 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 24.392 1.360 17.941 < 2e-16 *** x2 -7.245 1.764 -4.106 0.000285 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.902 on 30 degrees of freedom Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385 F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285

So to see the relationship between the estimates and the group means I need additional steps:

x0 = rep(1,length(y)) X = cbind(x0,x2) beta = solve(t(X)%*%X) %*% (t(X)%*%y) beta [,1] x0 24.392308 x2 -7.244939

I did obtain the same estimates with lm command so now I calculate the group means:

x1 = ifelse(x1 == 0, NA, x1) x2 = ifelse(x2 == 0, NA, x2) m1 = mean(y*x1, na.rm = TRUE) m2 = mean(y*x2, na.rm = TRUE) beta0 = m1 beta2 = m2-m1 beta0;beta2 [1] 24.39231 [1] -7.244939

In this case this means that the slope for the two groups is the same but the intercept is different, and therefore exists a negative effect of automatic transmission on miles per gallon in average terms.

Again I’ll verify the equivalency between lm and aov in this particular case:

y = mtcars$mpg x1 = mtcars$am x2 = ifelse(x1 == 1, 0, 1) fit2 = aov(y ~ x2) fit2$coefficients (Intercept) x2 24.392308 -7.244939 I can calculate the residuals by hand: mean_mpg = mean(mtcars$mpg) fitted_mpg = fit3$coefficients[1] + fit3$coefficients[2]*mtcars$am observed_mpg = mtcars$mpg TSS = sum((observed_mpg - mean_mpg)^2) ESS = sum((fitted_mpg - mean_mpg)^2) RSS = sum((observed_mpg - fitted_mpg)^2) TSS;ESS;RSS [1] 1126.047 [1] 405.1506 [1] 720.8966

Here its verified that $$TSS = ESS + RSS$$ but aside from that I can extract information from aov:

summary(fit2) Df Sum Sq Mean Sq F value Pr(>F) x2 1 405.2 405.2 16.86 0.000285 *** Residuals 30 720.9 24.0 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And check that, as expected, $$ESS$$ is the variance explained by x2.

I also can run ANOVA over lm with:

anova(fit) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x2 1 405.15 405.15 16.86 0.000285 *** Residuals 30 720.90 24.03 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table provides information on the effect of x2 over y. In this case the null hypothesis is rejected because of the large F-value and the associated p-values.

Considering a 0.05 significance threshold I can say, with 95% of confidence, that the regression slope is statistically different from zero or that there is a difference in group means between automatic and manual transmission.

ANOVA with three dummy variables

Now let’s explore something more complex than am. Reading the documentation I wonder if cyl has an impact on mpg so I explore that variable:

str(mtcars) 'data.frame': 32 obs. of 11 variables: $mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...$ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $disp: num 160 160 108 258 360 ...$ hp : num 110 110 93 110 175 105 245 62 95 123 ... $drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...$ wt : num 2.62 2.88 2.32 3.21 3.44 ... $qsec: num 16.5 17 18.6 19.4 17 ...$ vs : num 0 0 1 1 0 1 0 1 1 1 ... $am : num 1 1 1 0 0 0 0 0 0 0 ...$ gear: num 4 4 4 3 3 3 3 4 4 4 ... $carb: num 4 4 1 1 2 1 4 2 2 4 ... unique(mtcars$cyl) [1] 6 4 8

One (wrong) possibility is to write:

y = mtcars$mpg x1 = mtcars$cyl; x1 = ifelse(x1 == 4, 1, 0) x2 = mtcars$cyl; x2 = ifelse(x1 == 6, 1, 0) x3 = mtcars$cyl; x3 = ifelse(x1 == 8, 1, 0) fit = lm(y ~ x1 + x2 + x3) summary(fit) Call: lm(formula = y ~ x1 + x2 + x3) Residuals: Min 1Q Median 3Q Max -6.2476 -2.2846 -0.4556 2.6774 7.2364 Coefficients: (2 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 16.6476 0.7987 20.844 < 2e-16 *** x1 10.0160 1.3622 7.353 3.44e-08 *** x2 NA NA NA NA x3 NA NA NA NA --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.66 on 30 degrees of freedom Multiple R-squared: 0.6431, Adjusted R-squared: 0.6312 F-statistic: 54.06 on 1 and 30 DF, p-value: 3.436e-08

Here the NAs mean there are variables that are linearly related to the other variables (e.g. the variable pointed with NA is an average of one or more of the rest of the variables, like $$x_2 = 2x_1 + x_3$$ or another linear combination), then there’s no unique solution to the regression without dropping variables.

But R has a command called as.factor() that is useful in these cases and also can save you some lines of code in other cases:

fit2 = lm(mpg ~ as.factor(cyl), data = mtcars) summary(fit2) Call: lm(formula = mpg ~ as.factor(cyl), data = mtcars) Residuals: Min 1Q Median 3Q Max -5.2636 -1.8357 0.0286 1.3893 7.2364 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.6636 0.9718 27.437 < 2e-16 *** as.factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 *** as.factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.223 on 29 degrees of freedom Multiple R-squared: 0.7325, Adjusted R-squared: 0.714 F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09

The aov version of this is:

fit3 = aov(mpg ~ as.factor(cyl), data = mtcars) TukeyHSD(fit3) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = mpg ~ as.factor(cyl), data = mtcars) $as.factor(cyl) diff lwr upr p adj 6-4 -6.920779 -10.769350 -3.0722086 0.0003424 8-4 -11.563636 -14.770779 -8.3564942 0.0000000 8-6 -4.642857 -8.327583 -0.9581313 0.0112287 As I said many times in this entry, ANOVA is linear regression. Interpreting the coefficients is up to you. if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = "center", indent = "0em", linebreak = "false";

if (false) { align = (screen.width < 768) ? "left" : align; indent = (screen.width < 768) ? "0em" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); var location_protocol = (false) ? 'https' : document.location.protocol; if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:'; mathjaxscript.id = 'mathjaxscript_pelican_#%@#@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '../mathjax/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: '"+ align +"'," + " displayIndent: '"+ indent +"'," + " showMathMenu: true," + " messageStyle: 'normal'," + " tex2jax: { " + " inlineMath: [ ['\\\$$','\\\$$'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," + " linebreaks: { automatic: "+ linebreak +", width: '90% container' }," + " }, " + "}); " + "if ('default' !== 'default') {" + "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "}"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### survminer 0.3.0 Mon, 03/20/2017 - 16:45 (This article was first published on Easy Guides, and kindly contributed to R-bloggers) I’m very pleased to announce that survminer 0.3.0 is now available on CRAN. survminer makes it easy to create elegant and informative survival curves. It includes also functions for summarizing and inspecting graphically the Cox proportional hazards model assumptions. This is a big release and a special thanks goes to Marcin Kosiński and Przemysław Biecek for their great works in actively improving and adding new features to the survminer package. The official online documentation is available at http://www.sthda.com/english/rpkgs/survminer/. Contents Release notes In this post, we present only the most important changes in v0.3.0. See the release notes for a complete list. New arguments in ggsurvplot() • data: Now, it’s recommended to specify the data used to compute survival curves (#142). This will avoid the error generated when trying to use the ggsurvplot() function inside another functions (@zzawadz, #125). • cumevents and cumcensor: logical value for displaying the cumulative number of events table (#117) and the cumulative number of censored subjects table (#155), respectively. • tables.theme for changing the theme of the tables under the main plot. • pval.method and log.rank.weights: New possibilities to compare survival curves. Functionality based on survMisc::comp (@MarcinKosinski, #17). Read also the following blog post on R-Addict website: Comparing (Fancy) Survival Curves with Weighted Log-rank Tests. New functions • pairwise_survdiff() for pairwise comparisons of survival curves (#97). • arrange_ggsurvplots() to arrange multiple ggsurvplots on the same page (#66) Thanks to the work of Przemysław Biecek, survminer 0.3.0 has received four new functions: • ggsurvevents() to plot the distribution of event’s times (@pbiecek, #116). • ggcoxadjustedcurves() to plot adjusted survival curves for Cox proportional hazards model (@pbiecek, #133 & @markdanese, #67). • ggforest() to draw a forest plot (i.e. graphical summary) for the Cox model (@pbiecek, #114). • ggcompetingrisks() to plot the cumulative incidence curves for competing risks (@pbiecek, #168). Vignettes and examples Two new vignettes were contributed by Marcin Kosiński: Installing and loading survminer Install the latest developmental version from GitHub: if(!require(devtools)) install.packages("devtools") devtools::install_github("kassambara/survminer", build_vignettes = TRUE) Or, install the latest release from CRAN as follow: install.packages("survminer") To load survminer in R, type this: library("survminer") Survival curves New arguments displaying supplementary survival tables – cumulative events & censored subjects – under the main survival curves: • risk.table = TRUE: Displays the risk table • cumevents = TRUE: Displays the cumulative number of events table. • cumcensor = TRUE: Displays the cumulative number of censoring table. • tables.height = 0.25: Numeric value (in [0 – 1]) to adjust the height of all tables under the main survival plot. # Fit survival curves require("survival") fit <- survfit(Surv(time, status) ~ sex, data = lung) # Plot informative survival curves library("survminer") ggsurvplot(fit, data = lung, title = "Survival Curves", pval = TRUE, pval.method = TRUE, # Add p-value & method name surv.median.line = "hv", # Add median survival lines legend.title = "Sex", # Change legend titles legend.labs = c("Male", "female"), # Change legend labels palette = "jco", # Use JCO journal color palette risk.table = TRUE, # Add No at risk table cumevents = TRUE, # Add cumulative No of events table tables.height = 0.15, # Specify tables height tables.theme = theme_cleantable(), # Clean theme for tables tables.y.text = FALSE # Hide tables y axis text ) survminer Cumulative events and censored tables are good additional feedback to survival curves, so that one could realize: what is the number of risk set AND what is the cause that the risk set become smaller: is it caused by events or by censored events? Arranging multiple ggsurvplots on the same page The function arrange_ggsurvplots() [in survminer] can be used to arrange multiple ggsurvplots on the same page. # List of ggsurvplots splots <- list() splots[[1]] <- ggsurvplot(fit, data = lung, risk.table = TRUE, tables.y.text = FALSE, ggtheme = theme_light()) splots[[2]] <- ggsurvplot(fit, data = lung, risk.table = TRUE, tables.y.text = FALSE, ggtheme = theme_grey()) # Arrange multiple ggsurvplots and print the output arrange_ggsurvplots(splots, print = TRUE, ncol = 2, nrow = 1, risk.table.height = 0.25) survminer If you want to save the output into a pdf, type this: # Arrange and save into pdf file res <- arrange_ggsurvplots(splots, print = FALSE) ggsave("myfile.pdf", res) Distribution of events’ times The function ggsurvevents() [in survminer] calculates and plots the distribution for events (both status = 0 and status = 1). It helps to notice when censoring is more common (@pbiecek, #116). This is an alternative to cumulative events and censored tables, described in the previous section. For example in colon dataset, as illustrated below, censoring occur mostly after the 6’th year: require("survival") surv <- Surv(colontime, colon$status) ggsurvevents(surv) survminer Adjusted survival curves for Cox model Adjusted survival curves show how a selected factor influences survival estimated from a cox model. If you want read more about why we need to adjust survival curves, see this document: Adjusted survival curves. Briefly, in clinical investigations, there are many situations, where several known factors, potentially affect patient prognosis. For example, suppose two groups of patients are compared: those with and those without a specific genotype. If one of the groups also contains older individuals, any difference in survival may be attributable to genotype or age or indeed both. Hence, when investigating survival in relation to any one factor, it is often desirable to adjust for the impact of others. The cox proportional-hazards model is one of the most important methods used for modelling survival analysis data. Here, we present the function ggcoxadjustedcurves() [in survminer] for plotting adjusted survival curves for cox proportional hazards model. The ggcoxadjustedcurves() function models the risks due to the confounders as described in the section 5.2 of this article: Terry M Therneau (2015); Adjusted survival curves. Briefly, the key idea is to predict survival for all individuals in the cohort, and then take the average of the predicted curves by groups of interest (for example, sex, age, genotype groups, etc.). # Data preparation and computing cox model library(survival) lung$sex <- factor(lung$sex, levels = c(1,2), labels = c("Male", "Female")) res.cox <- coxph(Surv(time, status) ~ sex + age + ph.ecog, data = lung) # Plot the baseline survival function # with showing all individual predicted surv. curves ggcoxadjustedcurves(res.cox, data = lung, individual.curves = TRUE) survminer # Adjusted survival curves for the variable "sex" ggcoxadjustedcurves(res.cox, data = lung, variable = lung[, "sex"], # Variable of interest legend.title = "Sex", # Change legend title palette = "npg", # nature publishing group color palettes curv.size = 2 # Change line size ) survminer Graphical summary of Cox model The function ggforest() [in survminer] can be used to create a graphical summary of a Cox model, also known as forest plot. For each covariate, it displays the hazard ratio (HR) and the 95% confidence intervals of the HR. By default, covariates with significant p-value are highlighted in red. # Fit a Cox model library(survival) res.cox <- coxph(Surv(time, status) ~ sex + age + ph.ecog, data = lung) res.cox ## Call: ## coxph(formula = Surv(time, status) ~ sex + age + ph.ecog, data = lung) ## ## coef exp(coef) se(coef) z p ## sexFemale -0.55261 0.57544 0.16774 -3.29 0.00099 ## age 0.01107 1.01113 0.00927 1.19 0.23242 ## ph.ecog 0.46373 1.58999 0.11358 4.08 4.4e-05 ## ## Likelihood ratio test=30.5 on 3 df, p=1.08e-06 ## n= 227, number of events= 164 ## (1 observation deleted due to missingness) # Create a forest plot ggforest(res.cox) survminer Pairwise comparisons for survival curves When you compare three or more survival curves at once, the function survdiff() [in survival package] returns a global p-value whether to reject or not the null hypothesis. With this, you know that a difference exists between groups, but you don’t know where. You can’t know until you test each combination. Therefore, we implemented the function pairwise_survdiff() [in survminer]. It calculates pairwise comparisons between group levels with corrections for multiple testing. • Multiple survival curves with global p-value: library("survival") library("survminer") # Survival curves with global p-value data(myeloma) fit2 <- survfit(Surv(time, event) ~ molecular_group, data = myeloma) ggsurvplot(fit2, data = myeloma, legend.title = "Molecular Group", legend.labs = levels(myeloma$molecular_group), legend = "right", pval = TRUE, palette = "lancet")

survminer

• Pairwise survdiff:
# Pairwise survdiff res <- pairwise_survdiff(Surv(time, event) ~ molecular_group, data = myeloma) res ## ## Pairwise comparisons using Log-Rank test ## ## data: myeloma and molecular_group ## ## Cyclin D-1 Cyclin D-2 Hyperdiploid Low bone disease MAF MMSET ## Cyclin D-2 0.723 - - - - - ## Hyperdiploid 0.328 0.103 - - - - ## Low bone disease 0.644 0.447 0.723 - - - ## MAF 0.943 0.723 0.103 0.523 - - ## MMSET 0.103 0.038 0.527 0.485 0.038 - ## Proliferation 0.723 0.988 0.103 0.485 0.644 0.062 ## ## P value adjustment method: BH
• Symbolic number coding:
# Symbolic number coding symnum(res$p.value, cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 0.1, 1), symbols = c("****", "***", "**", "*", "+", " "), abbr.colnames = FALSE, na = "") ## Cyclin D-1 Cyclin D-2 Hyperdiploid Low bone disease MAF MMSET ## Cyclin D-2 ## Hyperdiploid ## Low bone disease ## MAF ## MMSET * * ## Proliferation + ## attr(,"legend") ## [1] 0 '****' 1e-04 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1 \t ## NA: '' Visualizing competing risk analysis Competing risk events refer to a situation where an individual (patient) is at risk of more than one mutually exclusive event, such as death from different causes, and the occurrence of one of these will prevent any other event from ever happening. For example, when studying relapse in patients who underwent HSCT (Hematopoietic stem cell transplantation), transplant related mortality is a competing risk event and the cumulative incidence function (CIF) must be calculated by appropriate accounting. A ‘competing risks’ analysis is implemented in the R package cmprsk. Here, we provide the ggcompetingrisks() function [in survminer] to plot the results using ggplot2-based elegant data visualization. # Create a demo data set #%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% set.seed(2) failure_time <- rexp(100) status <- factor(sample(0:2, 100, replace=TRUE), 0:2, c('no event', 'death', 'progression')) disease <- factor(sample(1:3, 100,replace=TRUE), 1:3, c('BRCA','LUNG','OV')) # Cumulative Incidence Function #%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% require(cmprsk) fit3 <- cuminc(ftime = failure_time, fstatus = status, group = disease) # Visualize #%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ggcompetingrisks(fit3, palette = "Dark2", legend = "top", ggtheme = theme_bw()) survminer The ggcometingrisks() function has also support for multi-state survival objects (type = “mstate”), where the status variable can have multiple levels. The first of these will stand for censoring, and the others for various event types, e.g., causes of death. # Data preparation df <- data.frame(time = failure_time, status = status, group = disease) # Fit multi-state survival library(survival) fit5 <- survfit(Surv(time, status, type = "mstate") ~ group, data = df) ggcompetingrisks(fit5, palette = "jco") survminer Related articles Infos This analysis has been performed using R software (ver. 3.3.2). jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header .content{padding:0px;} To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### R Correlation Tutorial Mon, 03/20/2017 - 15:41 (This article was first published on DataCamp Blog, and kindly contributed to R-bloggers) In this tutorial, you explore a number of data visualization methods and their underlying statistics. Particularly with regard to identifying trends and relationships between variables in a data frame. That’s right, you’ll focus on concepts such as correlation and regression! First, you’ll get introduced to correlation in R. Then, you’ll see how you can plot correlation matrices in R, using packages such as ggplot2 and GGally. Lastly, you’ll see what types of correlations exist and how they matter for your further analysis. If you’re interested in diving deeper into this topic, consider taking DataCamp’s Correlation and Regression course. Background In today’s tutorial, you’ll be working with a data set of movies acquired from Kaggle to discover how you can better understand relationships among variables better. I have lightly cultivated the data so that our analysis is an “apples-to-apples” one by ensuring that things like currency use the same units. Without that step, our statistical analysis of variables like gross, budget, and profit would be misleading. You can access the original data set here. Importing The Data In order to access the movies data set and put it to use, you can use the read.csv() function to import your data into a data frame and store it in the variable with the stunningly original name movies! movies <- read.csv(url("http://s3.amazonaws.com/dcwoods2717/movies.csv")) That’s all it takes to get started! Basic Inspection of Your Data It’s a good idea, once a data frame has been imported, to get an idea about your data. First, check out the structure of the data that is being examined. Below you can see the results of using this super-simple, helpful function str(): str(movies) ## 'data.frame': 2961 obs. of 11 variables: ##$ title : Factor w/ 2907 levels "10 Cloverfield Lane",..: 1560 2143 34 2687 1405 1896 2633 894 1604 665 ... ## $genre : Factor w/ 17 levels "Action","Adventure",..: 6 12 5 5 5 3 2 8 3 8 ... ##$ director : Factor w/ 1366 levels "Aaron Schneider",..: 474 472 781 828 175 1355 1328 1328 968 747 ... ## $year : int 1920 1929 1933 1935 1936 1937 1939 1939 1940 1946 ... ##$ duration : int 110 100 89 81 87 83 102 226 88 144 ... ## $gross : int 3000000 2808000 2300000 3000000 163245 184925485 22202612 198655278 84300000 20400000 ... ##$ budget : int 100000 379000 439000 609000 1500000 2000000 2800000 3977000 2600000 8000000 ... ## $cast_facebook_likes: int 4 109 995 824 352 229 2509 1862 1178 2037 ... ##$ votes : int 5 4546 7921 13269 143086 133348 291875 215340 90360 6304 ... ## $reviews : int 2 107 162 164 331 349 746 863 252 119 ... ##$ rating : num 4.8 6.3 7.7 7.8 8.6 7.7 8.1 8.2 7.5 6.9 ...

In this particular data frame, you can see from the console that 2961 observations of 11 variables are present.

As a side note, even if each movie was only an hour long, you’d need to watch movies non-stop for over four months to see them all!

The console also lists each variable by name, the class of each variable, and a few instances of each variable. This gives us a pretty good idea of what is in the data frame, the understanding of which is crucial to our analytic endeavors.

Another great function to help us perform a quick, high-level overview of our data frame is summary(). Note the similarities and differences between the output produced by running str().

summary(movies) ## title genre ## Home : 3 Comedy :848 ## A Nightmare on Elm Street : 2 Action :738 ## Across the Universe : 2 Drama :498 ## Alice in Wonderland : 2 Adventure:288 ## Aloha : 2 Crime :202 ## Around the World in 80 Days: 2 Biography:135 ## (Other) :2948 (Other) :252 ## director year duration ## Steven Spielberg : 23 Min. :1920 Min. : 37.0 ## Clint Eastwood : 19 1st Qu.:1999 1st Qu.: 95.0 ## Martin Scorsese : 16 Median :2004 Median :106.0 ## Tim Burton : 16 Mean :2003 Mean :109.6 ## Spike Lee : 15 3rd Qu.:2010 3rd Qu.:119.0 ## Steven Soderbergh: 15 Max. :2016 Max. :330.0 ## (Other) :2857 ## gross budget cast_facebook_likes ## Min. : 703 Min. : 218 Min. : 0 ## 1st Qu.: 12276810 1st Qu.: 11000000 1st Qu.: 2241 ## Median : 34703228 Median : 26000000 Median : 4604 ## Mean : 58090401 Mean : 40619384 Mean : 12394 ## 3rd Qu.: 75590286 3rd Qu.: 55000000 3rd Qu.: 16926 ## Max. :760505847 Max. :300000000 Max. :656730 ## ## votes reviews rating ## Min. : 5 Min. : 2.0 Min. :1.600 ## 1st Qu.: 19918 1st Qu.: 199.0 1st Qu.:5.800 ## Median : 55749 Median : 364.0 Median :6.500 ## Mean : 109308 Mean : 503.3 Mean :6.389 ## 3rd Qu.: 133348 3rd Qu.: 631.0 3rd Qu.:7.100 ## Max. :1689764 Max. :5312.0 Max. :9.300 ##

With a single command, you’ve had R return some key statistical information for each variable in our data frame. Now that you know what you’re working with, let’s dive in and explore the data some further!

Feature Engineering: Calculating Profit

In reviewing the variables available to you, it appears that some of the numeric variables can be manipulated to provide new insights into our data frame.

For instance, you’ve got gross and budget variables, so why not use a little subsetting to calculate the profit for each movie?

You can calculate profit by using the formula profit = gross - budget. Try it out in the DataCamp Light chunk below!

Awesome!

You can see that there is quite a collection of money-makers and money-pits in our data frame.

The profit of our movies can probably be used for some interesting analyses down the road, so let’s go ahead and add profit as a new column to our data frame.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllcyA8LSByZWFkLmNzdihcImh0dHA6Ly9zMy5hbWF6b25hd3MuY29tL2Rjd29vZHMyNzE3L21vdmllcy5jc3ZcIikiLCJzYW1wbGUiOiIjIEFkZCBhIGNvbHVtbiBmb3IgYHByb2ZpdGAgdG8gYG1vdmllc2AgXG5tb3ZpZXMkLi4uLi4gPC0gbW92aWVzJC4uLi4uIC0gbW92aWVzJC4uLi4uIiwic29sdXRpb24iOiIjIEFkZCBhIGNvbHVtbiBmb3IgYHByb2ZpdGAgdG8gYG1vdmllc2Bcbm1vdmllcyRwcm9maXQgPC0gbW92aWVzJGdyb3NzIC0gbW92aWVzJGJ1ZGdldCJ9 Correlation

Now that profit has been added as a new column in our data frame, it’s time to take a closer look at the relationships between the variables of your data set.

Let’s check out how profit fluctuates relative to each movie’s rating.

For this, you can use R’s built in plot and abline functions, where plot will result in a scatter plot and abline will result in a regression line or “line of best fit” due to our inclusion of the linear model argument as you will see below.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllcyA8LSByZWFkLmNzdih1cmwoXCJodHRwOi8vczMuYW1hem9uYXdzLmNvbS9kY3dvb2RzMjcxNy9tb3ZpZXMuY3N2XCIpKVxubW92aWVzJHByb2ZpdCA8LSBtb3ZpZXMkZ3Jvc3MgLSBtb3ZpZXMkYnVkZ2V0XG5vcHRpb25zKHNjaXBlbiA9IDk5OSkiLCJzYW1wbGUiOiIjIENyZWF0ZSB0aGUgc2NhdHRlciBwbG90IHdpdGggYHJhdGluZ3NgIG9uIHRoZSB4LWF4aXMgYW5kIGBwcm9maXRgIG9uIHRoZSB5LWF4aXNcbnBsb3QobW92aWVzJC4uLi4sIG1vdmllcyQuLi4uKVxuXG4jIEFkZCBhIHJlZ3Jlc3Npb24gbGluZXdpdGggdGhlIGZvcm0gYGFibGluZShsbSh5IH4geCkpYFxuXG5hYmxpbmUobG0obW92aWVzJC4uLi4gfiBtb3ZpZXMkLi4uLikpIiwic29sdXRpb24iOiIjIENyZWF0ZSB0aGUgc2NhdHRlciBwbG90IHdpdGggYHJhdGluZ3NgIG9uIHRoZSB4LWF4aXMgYW5kIGBwcm9maXRgIG9uIHRoZSB5LWF4aXNcbnBsb3QobW92aWVzJHJhdGluZywgbW92aWVzJHByb2ZpdClcblxuIyBBZGQgYSByZWdyZXNzaW9uIGxpbmUgd2l0aCB0aGUgZm9ybSBgYWJsaW5lKGxtKHkgfiB4KSlgIFxuYWJsaW5lKGxtKG1vdmllcyRwcm9maXQgfiBtb3ZpZXMkcmF0aW5nKSkifQ==

Was the output pretty much what you expected?

In general, it appears that movies with a higher rating tend to have higher profit. Another way to phrase this statement would be to say that a positive correlation exists between rating and profit, at least in our data frame.

That said, even a cursory glance at the plot reveals that there are plenty of highly rated movies that weren’t exactly blockbusters, and there are a number of very profitable movies that got relatively low ratings.

Correlation does NOT imply causation!

One anecdote to help you understand correlation versus causation is as follows: I run an ice cream stand at the beach. The average number of jaywalkers in the city tends to increase when my ice cream sales do, but is my ice cream causing people to disregard traffic laws, or is some other force in play? My ice cream is admittedly awesome, but the sunshine outside probably has something to do with people wanting ice cream and people not wanting to stand around getting sunburnt at crosswalks. A relationsip (correlation) does exist between my ice cream sales and the number of jay walkers, but you can’t definitively state that it’s a causal relationship.

Keep that line of reasoning in mind as you proceed through this tutorial!

Calculating Correlation in R

So what types of relationships exist between the variables in movies, and how can you evaluate those relationships quantitatively?

The first way is to produce correlations and correlation matrices with cor():

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllcyA8LSByZWFkLmNzdihcImh0dHA6Ly9zMy5hbWF6b25hd3MuY29tL2Rjd29vZHMyNzE3L21vdmllcy5jc3ZcIilcbm1vdmllcyRwcm9maXQgPC0gbW92aWVzJGdyb3NzIC0gbW92aWVzJGJ1ZGdldFxub3B0aW9ucyhzY2lwZW4gPSA5OTkpIiwic2FtcGxlIjoiIyBDb21wdXRlIFBlYXJzb24gY29ycmVsYXRpb25cbmNvcihtb3ZpZXMkcmF0aW5nLCBtb3ZpZXMkcHJvZml0KVxuXG4jIENvcnJlbGF0aW9uIE1hdHJpeFxuY29yKG1vdmllc1ssNDoxMF0pIn0=

Note that you can also specify the method that you want to indicate which correlation coefficient you want to compute. Be careful, because there are always some assumptions that these correlations work with: the Kendall and Spearman methods only make sense for ordered inputs. This means that you’ll need to order your data before calculating the correlation coefficient.

Additionally, the default method, the Pearson correlation, assumes that your variables are normally distributed, that there is a straight line relationship between each of the variables and that the data is normally distributed about the regression line.

Note also that you can use rcorr(), which is part of the Hmisc package to compute the significance levels for pearson and spearman correlations.

You’ll also see that the correlation matrix is actually a table that shows correlation coefficients between sets of variables. This is already a great first way to get an idea of which relationships exist between the variables fo your data set, but let’s go a bit deeper into this in the next section.

Visually Exploring Correlation: The R Correlation Matrix

In this next exploration, you’ll plot a correlation matrix using the variables available in your movies data frame. This simple plot will enable you to quickly visualize which variables have a negative, positive, weak, or strong correlation to the other variables.

To pull this wizardry off, you’ll be using a nifty package called GGally and a function called ggcorr().

The form of this function call will be ggcorr(df), where df is the name of the data frame you’re calling the function on. The output of this function will be a triangular-shaped, color-coded matrix labelled with our variable names. The correlation coefficient of each variable relative to the other variables can be found by reading across and / or down the matrix, depending on the variable’s location in the matrix.

It may sound confusing, but it’s really simple in practice, so give it a shot in the following code chunk!

So in the correlation matrix you’ve just created (great job, by the way!), you can go to the row or column associated with a variable, such as year and see its correlation coefficient as indicated by the color of the cell that corresponds with another variable.

In examining year, for example, you can see that there is a weak, positive correlation with budget and a similarly weak, negative correlation with rating.

Correlation coefficients are always between -1 and 1, inclusive. A correlation coefficient of -1 indicates a perfect, negative fit in which y-values decrease at the same rate than x-values increase. A correlation coefficient of 1 indicates a perfect, positive fit in which y-values increase at the same rate that x-values increase.

In most cases, such as your year example above, the correlation coefficient will be somewhere between -1 and 1.

Did you notice anything a bit odd about the variables shown in the correlation matrix?

Not all of the variables in movies are present!

That’s because not all of the variables in movies are numeric. The ggcorr function disregards non-numeric variables automatically, which saves some time by exempting the user from “subsetting-out” such variables prior to creating the matrix.

If you want your correlation matrix to really “pop” (or maybe you’re a little colorblind, like me), there are a few simple tweaks you can make to produce more visually-compelling, data-loaded matrices.

In the code chunk that follows, you’ve included the label argument, which can be set equal to either TRUE or FALSE. The default setting is FALSE, but if you add label = TRUE, the correlation coefficient of each relationship is included in the appropriate cell. This keeps you from guessing the value of each coefficient based off the color scale.

The label_alpha argument allows you to increase or decrease the opacity of each label based on the strength of the correlation coefficient. This is super helpful in quick, visual data analysis.

Don’t take my word for it, try it for yourself!

Now let’s plot a better correlation matrix:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllcyA8LSByZWFkLmNzdih1cmwoXCJodHRwOi8vczMuYW1hem9uYXdzLmNvbS8vZGN3b29kczI3MTcvbW92aWVzLmNzdlwiKSlcbm1vdmllcyRwcm9maXQgPC0gbW92aWVzJGdyb3NzIC0gbW92aWVzJGJ1ZGdldFxubGlicmFyeShcIkdHYWxseVwiKSIsInNhbXBsZSI6IiMgRmlsbCBpbiBcIlRSVUVcIiBvciBcIkZBTFNFXCIgdG8gc2VlIGhvdyB0aGUgY29ycmVsYXRpb24gbWF0cml4IGNoYW5nZXNcbmdnY29ycihtb3ZpZXMsIFxuICAgICAgIGxhYmVsID0gLi4uLi4sIFxuICAgICAgIGxhYmVsX2FscGhhID0gLi4uLi4pIiwic29sdXRpb24iOiIjIEZpbGwgaW4gXCJUUlVFXCIgb3IgXCJGQUxTRVwiIHRvIHNlZSBob3cgdGhlIGNvcnJlbGF0aW9uIG1hdHJpeCBjaGFuZ2VzXG5nZ2NvcnIobW92aWVzLCBcbiAgICAgICBsYWJlbCA9IFRSVUUsIFxuICAgICAgIGxhYmVsX2FscGhhID0gVFJVRSkifQ==

In the plots that follow, you will see that when a plot with a “strong” correlation is created, the slope of its regression line (x/y) is closer to 1/1 or -1/1, while a “weak” correlation’s plot may have a regression line with barely any slope. A slope closer to 1/1 or -1/1 implies that the two variables plotted are closely related.

In the case of our movies data frame, such a regression line can lend powerful insights about the nature of your variables and may indicate an interdependence of those variables.

For instance, in your previous plot of profit over rating, you saw that our regression line had a moderate, positive slope. Taking a look back at your correlation matrix, you see that the correlation coefficient of these two variables is 0.3.

Makes sense, right?

Correlation Types

Let’s now check out additional plots illustrating the different types of correlations that you have seen in the previous section!

Strong Correlation: Plotting Votes Versus Reviews

Lets begin by plotting two variables with a strong, positive correlation.

In looking at our correlation matrix, it seems that votes versus reviews meets your criteria, with a correlation value of 0.8. That means that our slope should be relatively close to 1/1.

You will be implementing the ggplot2 package in this exercise and the exercises that follow. The ggplot2 package is a versatile toolset that can be used to create compelling, data visualizations.

In this case, you’re going to take advantage of the qplot function within the ggplot2 package, which can produce many kinds of plots based on what plot type is passed to the geom (geometry) argument.

In the code chunk that follows, geom has been set to a vector containing two types of geometry, point and smooth, where point results in a scatterplot and smooth generates a trendline. You’ll also notice that method has been set to lm, which means the trendline will be the familiar, straight regression line that was created in our plot of profit over years previously.

In short, you’re about to learn how easy it is to generate killer data visualizations in a flash!

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoXCJnZ3Bsb3QyXCIpXG5tb3ZpZXMgPC0gcmVhZC5jc3YodXJsKFwiaHR0cDovL3MzLmFtYXpvbmF3cy5jb20vL2Rjd29vZHMyNzE3L21vdmllcy5jc3ZcIikpXG5tb3ZpZXMkcHJvZml0IDwtIG1vdmllcyRncm9zcyAtIG1vdmllcyRidWRnZXRcbm9wdGlvbnMoc2NpcGVuID0gOTk5KSIsInNhbXBsZSI6IiMgUGxvdCB2b3RlcyB2cyByZXZpZXdzXG5xcGxvdChtb3ZpZXMkLi4uLi4sIFxuICAgICAgbW92aWVzJC4uLi4uLCBcbiAgICAgIGRhdGEgPSAuLi4uLiwgXG4gICAgICBnZW9tID0gYyhcInBvaW50XCIsIFwic21vb3RoXCIpLCBcbiAgICAgIG1ldGhvZCA9IFwibG1cIiwgXG4gICAgICBhbHBoYSA9IEkoMSAvIDUpLCBcbiAgICAgIHNlID0gRkFMU0UpIiwic29sdXRpb24iOiIjIFBsb3Qgdm90ZXMgdnMgcmV2aWV3c1xucXBsb3QobW92aWVzJHZvdGVzLCBcbiAgICAgIG1vdmllcyRyZXZpZXdzLCBcbiAgICAgIGRhdGEgPSBtb3ZpZXMsIFxuICAgICAgZ2VvbSA9IGMoXCJwb2ludFwiLCBcInNtb290aFwiKSwgXG4gICAgICBtZXRob2QgPSBcImxtXCIsIFxuICAgICAgYWxwaGEgPSBJKDEgLyA1KSwgXG4gICAgICBzZSA9IEZBTFNFKSJ9

You may have recognized the alpha argument included in the qplot function above. The use of alpha in qplot applies a gradient of opacity to the points in the scatter plot similar to how it changed the opacity of the correlation coefficient labels in ggcorr.

In the configuration you used, total opacity is only achieved when a point is overlapped by 5 other points. Increasing or decreasing the number in the denominator of alpha will affect the number of overlapping points required to change a point’s opacity. One reason to use this aesthetic is that it can help users quickly identify concentrations of data points in their plots, which in turn can bring new insights about our data to light with only a glance.

Go ahead and tinker with the code and see how it works for yourself!

Weak Correlation: Plotting Profit Over Years

Now let’s see what a “weak”, negative correlation looks like by plotting profit versus years. This time, no method has been specified, so our trendline is going to vary according to the data in a curvilinear fashion (a fitted curve), which is the default for this function. You’ll notice that along certain parts of the trendline there is a light gray area that increases or decreases in size according to the confidence interval of the trendline.

If you’re not familiar with the concept of a confidence interval, don’t worry! R will do the work for you, and the concept will be better explained after this exercise.

For now, just fill in the blanks with the necessary code and observe!

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoXCJnZ3Bsb3QyXCIpXG5tb3ZpZXMgPC0gcmVhZC5jc3YodXJsKFwiaHR0cDovL3MzLmFtYXpvbmF3cy5jb20vL2Rjd29vZHMyNzE3L21vdmllcy5jc3ZcIikpXG5tb3ZpZXMkcHJvZml0IDwtIG1vdmllcyRncm9zcyAtIG1vdmllcyRidWRnZXRcbm9wdGlvbnMoc2NpcGVuID0gOTk5KSIsInNhbXBsZSI6IiMgUGxvdCBwcm9maXQgb3ZlciB5ZWFyc1xucXBsb3QobW92aWVzJHllYXIsIFxuICAgICAgbW92aWVzJHByb2ZpdCwgXG4gICAgICBkYXRhID0gbW92aWVzLCBcbiAgICAgIGdlb20gPSBjKFwicG9pbnRcIiwgXCJzbW9vdGhcIiksIFxuICAgICAgYWxwaGEgPSBJKDEgLyA1KSkiLCJzb2x1dGlvbiI6IiMgUGxvdCBwcm9maXQgb3ZlciB5ZWFyc1xucXBsb3QobW92aWVzJHllYXIsIFxuICAgICAgbW92aWVzJHByb2ZpdCwgXG4gICAgICBkYXRhID0gbW92aWVzLCBcbiAgICAgIGdlb20gPSBjKFwicG9pbnRcIiwgXCJzbW9vdGhcIiksIFxuICAgICAgYWxwaGEgPSBJKDEgLyA1KSkifQ==

So now that you’ve had a chance to see the confidence interval and smoothing curve in action, what observations can you make?

• First, you see that in the initial portion of the plot, profit seems to be increasing each year.
• Second, you notice that the grey area offset from your smoothing curve is initially quite large and the number of data points (or observations) is quite small.
• Third, as the curve moves into a larger concentration of observations, the grey area diminishes and practically hugs the curve, which begins to decline in later years.

In conclusion, plotting the confidence interval in conjunction with the smoothing curve allows us to see the amount of uncertainty associated with our regression line. That uncertainy increases with fewer observations and decreases with more observations.

This is very helpful in visualizing how the relationship between our variables changes throughout the data frame and reveals how concentrations of observations alter the plotted curve.

The caveat here is that the plot may be a bit misleading, as it’s hard to visually discern whether there is a positive or negative correlation of your variables.

Let’s now plot profit over years in the same way you plotted votes versus reviews: simply replace the “…..” in each argument so that years are plotted and on x-axis and profit is on the y-axis. Make sure to specify which data frame that “data” equals!

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoXCJnZ3Bsb3QyXCIpXG5tb3ZpZXMgPC0gcmVhZC5jc3YodXJsKFwiaHR0cDovL3MzLmFtYXpvbmF3cy5jb20vL2Rjd29vZHMyNzE3L21vdmllcy5jc3ZcIikpXG5tb3ZpZXMkcHJvZml0IDwtIG1vdmllcyRncm9zcyAtIG1vdmllcyRidWRnZXRcbm9wdGlvbnMoc2NpcGVuID0gOTk5KSIsInNhbXBsZSI6IiMgUGxvdCB0aGUgeWVhcnMgb24gdGhlIHgtYXhpcywgcHJvZml0IG9uIHRoZSB5LWF4aXNcbnFwbG90KG1vdmllcyQuLi4uLiwgXG4gICAgICBtb3ZpZXMkLi4uLi4sIFxuICAgICAgZGF0YSA9IC4uLi4uLCBcbiAgICAgIGdlb20gPSBjKFwicG9pbnRcIiwgXCJzbW9vdGhcIiksIFxuICAgICAgbWV0aG9kID0gXCJsbVwiLCBcbiAgICAgIGFscGhhID0gSSgxIC8gNSksIFxuICAgICAgc2UgPSBGQUxTRSkiLCJzb2x1dGlvbiI6IiMgUGxvdCB0aGUgeWVhcnMgb24gdGhlIHgtYXhpcywgcHJvZml0IG9uIHRoZSB5LWF4aXNcbnFwbG90KG1vdmllcyR5ZWFyLCBcbiAgICAgIG1vdmllcyRwcm9maXQsIFxuICAgICAgZGF0YSA9IG1vdmllcywgXG4gICAgICBnZW9tID0gYyhcInBvaW50XCIsIFwic21vb3RoXCIpLCBcbiAgICAgIG1ldGhvZCA9IFwibG1cIiwgXG4gICAgICBhbHBoYSA9IEkoMSAvIDUpLCBcbiAgICAgIHNlID0gRkFMU0UpIn0=

Looking back to your correlation matrix, you see that the correlation coefficient of profit and year is -0.1, which is perhaps more readily visualized in the plot that incorporates a line of best fit than it is in the plot that used a fitted curve.

Tying it all together

Let’s now take a look at another powerful function available in GGally called ggpairs. This function is great because it allows users to create a matrix that shows the correlation coefficient of multiple variables in conjunction with a scatterplot (including a line of best fit with a confidence interval) and a density plot.

Using this one function, you can effectively combine everything you’ve covered in this tutorial in a concise, readily comprehensible fashion.

In the following code chunk, pick your three favorite variables from the data frame (no subsetting required!), plug them in, and play!

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoXCJnZ3Bsb3QyXCIpXG5saWJyYXJ5KFwiR0dhbGx5XCIpXG5tb3ZpZXMgPC0gcmVhZC5jc3YodXJsKFwiaHR0cDovL3MzLmFtYXpvbmF3cy5jb20vL2Rjd29vZHMyNzE3L21vdmllcy5jc3ZcIikpXG5tb3ZpZXMkcHJvZml0IDwtIG1vdmllcyRncm9zcyAtIG1vdmllcyRidWRnZXRcbm9wdGlvbnMoc2NpcGVuID0gOTk5KSIsInNhbXBsZSI6IiMgUGx1ZyBpbiB5b3VyIHRocmVlIGZhdm9yaXRlIHZhcmlhYmxlcyBhbmQgdGlua2VyIGF3YXkhXG5nZ3BhaXJzKG1vdmllcywgXG4gICAgICAgIGNvbHVtbnMgPSBjKFwiLi4uLi5cIiwgXCIuLi4uLlwiLCBcIi4uLi4uXCIpLCBcbiAgICAgICAgdXBwZXIgPSBsaXN0KGNvbnRpbnVvdXMgPSB3cmFwKFwiY29yXCIsIFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgc2l6ZSA9IDEwKSksIFxuICAgICAgICBsb3dlciA9IGxpc3QoY29udGludW91cyA9IFwic21vb3RoXCIpKSIsInNvbHV0aW9uIjoiIyBQbHVnIGluIHlvdXIgdGhyZWUgZmF2b3JpdGUgdmFyaWFibGVzIGFuZCB0aW5rZXIgYXdheSFcbmdncGFpcnMobW92aWVzLCBcbiAgICAgICAgY29sdW1ucyA9IGMoXCJjYXN0X2ZhY2Vib29rX2xpa2VzXCIsIFwiYnVkZ2V0XCIsIFwicHJvZml0XCIpLCBcbiAgICAgICAgdXBwZXIgPSBsaXN0KGNvbnRpbnVvdXMgPSB3cmFwKFwiY29yXCIsIFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgc2l6ZSA9IDEwKSksIFxuICAgICAgICBsb3dlciA9IGxpc3QoY29udGludW91cyA9IFwic21vb3RoXCIpKSJ9

As you may have guessed, there are many scenarios in which one type of plot is ideal – the key is to learn what visualization techniques are available, how to use them, and what the end result means!

Going Forward

You’ve covered a lot of ground in this tutorial, so congratulations for making it through to the end!

I hope you are able to put these concepts to work in your own analytical adventures! These are really the basics of data exploration in R and there’s so much more that you can do to make sure that you get a good feel for your data before you start analyzing and modeling it. Why not take your efforts up a notch and start DataCamp’s Exploratory Data Analysis course?

In the mean time, I encourage you to head on over to Kaggle and find a rich, fascinating data set of your own to explore if you don’t want to continue with the movies data set!

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Data validation with the assertr package

Mon, 03/20/2017 - 15:03

(This article was first published on R – On the lambda, and kindly contributed to R-bloggers)

Version 2.0 of my data set validation package assertr hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using assertr, the text below will point out the improvements.

I can (and have) go on and on about the treachery of messy/bad datasets. Though its substantially less exciting than… pretty much everything else, I believe (proportional to the heartache and stress it causes) we don’t spend enough time talking about it or building solutions around it. No matter how new and fancy your ML algorithm is, it’s success is predicated upon a properly sanitized dataset. If you are using bad data, your approach will fail—either flagrantly (best case), or unnoticeably (considerably more probable and considerably more pernicious).

assertr is a R package to help you identify common dataset errors. More specifically, it helps you easily spell out your assumptions about how the data should look and alert you of any deviation from those assumptions.

I’ll return to this point later in the post when we have more background, but I want to be up front about the goals of the package; assertr is not (and can never be) a “one-stop shop” for all of your data validation needs. The specific kind of checks individuals or teams have to perform any particular dataset are often far too idiosyncratic to ever be exhaustively addressed by a single package (although, the assertive meta-package may come very close!) But all of these checks will reuse motifs and follow the same patterns. So, instead, I’m trying to sell assertr as a way of thinking about dataset validations—a set of common dataset validation actions. If we think of these actions as verbs, you could say that assertr attempts to impose a grammar of error checking for datasets.

In my experience, the overwhelming majority of data validation tasks fall into only five different patterns:

• For every element in a column, you want to make sure it fits certain criteria. Examples of this strain of error checking would be to make sure every element is a valid credit card number, or fits a certain regex pattern, or represents a date between two other dates. assertr calls this verb assert.
• For every element in a column, you want to make sure certain criteria are met but the criteria can only be decided only after looking at the entire column as a whole. For example, testing whether each element is within n standard deviations of the mean of the elements requires computation on the elements prior to inform the criteria to check for. assertr calls this verb insist.
• For every row of a dataset, you want to make sure certain assumptions hold. Examples include ensuring that no row has more than n number of missing values, or that a group of columns are jointly unique and never duplicated. assertr calls this verb assert_rows.
• For every row of a dataset, you want to make sure certain assumptions hold but the criteria can only be decided only after looking at the entire column as a whole. This closely mirrors the distinction between assert and insist, but for entire rows (not individual elements). An example of using this would be checking to make sure that the Mahalanobis distance between each row and all other rows are within n number of standard deviations of the mean distance. assertr calls this verb insist_rows.
• You want to check some property of the dataset as a whole object. Examples include making sure the dataset has more than n columns, making sure the dataset has some specified column names, etc… assertr calls this last verb verify.

Some of this might sound a little complicated, but I promise this is a worthwhile way to look at dataset validation. Now we can begin with an example of what can be achieved with these verbs. The following example is borrowed from the package vignette and README…

Pretend that, before finding the average miles per gallon for each number of engine cylinders in the mtcars dataset, we wanted to confirm the following dataset assumptions…

• that it has the columns mpg, vs, and am
• that the dataset contains more than 10 observations
• that the column for ‘miles per gallon’ (mpg) is a positive number
• that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean
• that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only
• each row contains at most 2 NAs
• each row is unique jointly between the mpg, am, and wt columns
• each row’s mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)
library(dplyr) library(assertr) mtcars %>% verify(has_all_names("mpg", "vs", "am", "wt")) %>% verify(nrow(.) > 10) %>% verify(mpg > 0) %>% insist(within_n_sds(4), mpg) %>% assert(in_set(0,1), am, vs) %>% assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>% assert_rows(col_concat, is_uniq, mpg, am, wt) %>% insist_rows(maha_dist, within_n_mads(10), everything()) %>% group_by(cyl) %>% summarise(avg.mpg=mean(mpg))

Before assertr version 2, the pipeline would immediately terminate at the first failure. Sometimes this is a good thing. However, sometimes we’d like to run a dataset through our entire suite of checks and record all failures. The latest version includes the chain_start and chain_end functions; all assumptions within a chain (below a call to chain_start and above chain_end) will run from beginning to end and accumulate errors along the way. At the end of the chain, a specific action can be taken but the default is to halt execution and display a comprehensive report of what failed including line numbers and the offending datum, where applicable.

Another major improvement since the last version of assertr of CRAN is that assertr errors are now S3 classes (instead of dumb strings). Additionally, the behavior of each assertion statement on success (no error) and failure can now be flexibly customized. For example, you can now tell assertr to just return TRUE and FALSE instead of returning the data passed in or halting execution, respectively. Alternatively, you can instruct assertr to just give a warning instead of throwing a fatal error. For more information on this, see help("success_and_error_functions")

Beyond these examples

Since the package was initially published on CRAN (almost exactly two years ago) many people have asked me how they should go about using assertr to test a particular assumption (and I’m very happy to help if you have one of your own, dear reader!) In every single one of these cases, I’ve been able to express it as an incantation using one of these 5 verbs. It also underscored, to me, that creating specialized functions for every need is a pipe dream. There is, however, two good pieces of news.

The first is that there is another package, assertive (vignette here) that greatly enhances the assertr experience. The predicates (functions that start with “is_”) from this (meta)package can be used in assertr pipelines just as easily as assertr’s own predicates. And assertive has an enormous amount of them! Some specialized and exciting examples include is_hex_color, is_ip_address, and is_isbn_code!

The second is if assertive doesn’t have what you’re looking for, with just a little bit of studying the assertr grammar, you can whip up your own predicates with relative ease. Using some these basic constructs and a little effort, I’m confident that the grammar is expressive enough to completely adapt to your needs.

If this package interests you, I urge you to read the most recent package vignette here. If you’re a assertr old-timer, I point you to this NEWS file that list the changes from the previous version.

To leave a comment for the author, please follow the link and comment on their blog: R – On the lambda. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### House effects in New Zealand voting intention polls

Mon, 03/20/2017 - 12:00

This post is one of a series leading up to a purely data-driven probabilistic prediction model for the New Zealand general election in 2017. No punditry will be indulged in (if only to avoid complications with my weekday role as an apolitical public servant)! This is straight statistics, if there is such a thing…

There are important sources of uncertainty in political polling other than sampling error

An important question in using polling/survey data to predict election results is how to quantify the “house effects” of the different polling companies conducting surveys of potential voters. This is important for establishing which of the various pollsters is best at predicting the overall result which has obvious commercial implications for them; and also for predicting future election results for polling aggregators.

The naive approach is the one taken in this article on polling leading up to the 2014 New Zealand general election, which takes polling results as predictions of the eventual result. A more sophisticated approach is to treat polling numbers as data in an a predictive model in which election vote is a time-bound result. Basically, voters get to change their mind between a poll and the election – the more time between the two, the more randomness there is, and the uncertainty from this can certainly swamp the often-quoted margins of error.

If we look at the data from three major pollsters that have both a track record and continue to poll today (March 2017) for New Zealand intended party vote, we see that discrepancies between election results and voting intention estimated from opinion polls are not purely a matter of pollsters’ uncertainty or error:

The case of the Labour Party in the period between the 2011 and 2014 elections is a particular case in point. All three pollsters in the chart had, on average, higher intended party vote for Labour during the period than eventually transpired. But the consistency of their patterns strongly suggests that there was a genuine trend over time, best captured by Reid Research and Roy Morgan, with a decrease in support in the months leading up to the actual election.

In contrast, in the 2011 and 2014 (and perhaps 2008) elections, all three pollsters do indeed seem to systematically underestimate the vote for New Zealand First.

There are several things to take account of

A well-regarded approach to studying the relationship between opinion polls and electoral results has been based on a seminal 2005 article by Simon Jackman, “Pooling the Polls Over an Election Campaign”:

“Poll results vary over the course of a campaign election and across polling organisations, making it difficult to track genuine changes in voter support. I present a statistical model that tracks changes in voter support over time by pooling the polls, and corrects for variation across polling organisations due to biases known as ‘house effects’. The result is a less biased and more precise estimate of vote intentions than is possible from any one poll alone.”

Simon Jackman

The method is also described in his classic text, Bayesian Analysis for the Social Sciences. The statistical method theorises a latent, largely unobserved (ie except on election day) state space of the voting intention each day, and that polls are various dirty variables linked to the state of that latent variable, but not direct observations of it. The state space changes at random due to unknown forces over time, and the various opinion polls are grossly imperfect glimpses of its reality. The method is state of the art but very computationally intensive – it requires estimating the state of the voting intention for each party on each day since observations began, resulting in many thousands of parameters if the model is fit over multiple election cycles. Computational methods exist for fitting such models and I intend to do so at some point, but I also wanted a quicker look at the data that doesn’t take hours or days to fit fit.

It’s worth noting at this point that there’s good reason to believe that the latent, unobserved voting intention isn’t as volatile as polls indicate – ie an endorsement of Jackman’s method or a similar smoothing approach to aggregate polls, with a conservative approach to change in the underlying intention compared even to the common weighted rolling average method of aggregation provides. A brilliant study during the 2012 USA Presidential Election by Andrew Gelman and Andrew Rothschild showed compellingly that much of the fluctuation in voting intention comes from bias in responses:

“When there was good news for Mitt Romney, more Republicans opted to respond to the poll; when Obama was riding high, Democrats were more likely to respond. The result was that large and systematic changes in nonresponse had the effect of amplifying small changes in actual voter intention.”

These results are compelling evidence for some kind of smoothing of polling results, that not only provides a weighted average of recent polls, but in addition some healthy statistical skepticism (or regularisation to shrink towards zero) to the apparent rapid moves up and down in voting intention.

Generalized additive models give a cheap and quick solution to a broad range of smoothing challenges

To get a glance at house effects in New Zealand polling, I decided to use generalized additive models for the polls leading up to individual elections, for one party at a time, to produce a predicted election result for each pollster-party-election combination. Most studies I’ve seen in this space have applied Jackman’s method to a single election cycle; the computational problems magnify with longer periods as do the data management challenges. My nzelect R package makes available multiple years of polling data sourced from Wikipedia (albeit unfortunately without sample size and margin of error information); using a GAM rather than a state space model massively reduces the computational challenges.

I limit myself to the seven parties with a party vote likely to influence the 2017 election and who also were in previous elections; and the three major pollsters who both have a track record and an intention of polling leading up to the 2017 election (my understanding is that Digipoll do not have such intention; otherwise they would be a fourth pollster to include in the analysis).

For each combination of party and election year I used the polling data from all pollsters to predict election result, allowing an absolute level of bias for each pollster in the model. For two of the pollsters I have four complete election cycles of data and I obtained these results:

Roy Morgan stands out as particularly inclined to over-estimate the vote for the Greens; and Colmar Brunton to under-estimate the vote for New Zealand First.

For the third pollster I only have two election cycles of data and I get these results:

Here is a table summarising the amount each pollster appears to over/under estimate voting intention, comparing the prediction for election day based on patterns to date with the actual result:

Party Colmar Brunton Reid Research Roy Morgan ACT -0.2% -0.1% 0.2% Green 0.6% 2% 2.6% Labour -0.1% -0.9% -1% Maori -0.2% -0.2% -0.1% National 3.2% 3.2% 0.1% NZ First -1.8% -2.5% 0% United Future -0.4% -0.3% 0% Number of polls

For those who are interested, here is an exploratory chart on the number of actual polls available by pollster and election cycle which led me to my conclusion to fit models just to polls from Colmar Brunton, Reid Research and Roy Morgan:

R code

Here is the code in R for all the above. Caveat – this is very much a spare time project for me, and none of this has been peer reviewed. I’d never allow this situation in my professional life, but here I am publishing results on a political subject this way… Use at your own risk, and if you spot something wrong or an idea for improvement, please let me know.