Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 49 min 29 sec ago

Introduction to Volatility

Wed, 07/12/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)



This is the beginning of a series on portfolio volatility, variance, and standard deviation. I realize that it’s a lot more fun to fantasize about analyzing stock returns, which is why television shows and websites constantly update the daily market returns and give them snazzy green and red colors. But good ol’ volatility is quite important in its own right, especially to finance geeks, aspiring finance geeks, and institutional investors. If you are, might become, or might ever work with/for any of those, this series should at least serve as a jumping-off point.

Briefly, our volatility project will proceed as follows:

Part 1
  1. Portfolio Volatility Intro: by-hand, matrix algebra, built-in and compare to SPY

  2. Visualizing Volatility Intro: chart portfolio sd over time – will have to roll apply

  3. Shiny app to test different portfolios

Part 2
  1. Asset Contributions to Volatility: by-hand, matrix algebra, built-in and visualize snapshot with bar graph

  2. Chart contributions over time: flag any asset that surpasses a threshold

  3. Shiny app to test different portfolios

Part 3
  1. Minimum Variance Portfolio: find minimum variance portfolio weights

  2. Shiny app to test different portfolios

A quick word of warning: this series begins at the beginning with portfolio standard deviation, builds up to a more compelling data visualization in the next post, and finally a nice Shiny app after that. R users with experience in the world of volatility may wish to skip this post and wait for the visualizations in the next one. That said, I would humbly offer a couple of benefits to the R code that awaits us.

First, volatility is important, possibly more important than returns. I don’t think any investment professional looks back on hours spent pondering volatility as a waste of time. Plus, today we’ll look at a new way to convert daily prices to monthly using the tidyquant package, and that might offer enough new substance.

Second, as always, we have an eye on making our work reproducible and reusable. This Notebook makes it exceedingly clear how we derive our final data visualizations on portfolio volatility. It’s a good template for other visualization derivations, even if standard deviation is old hat for you.

Okay, without further ado, here’s where we are headed today:

  1. Import prices and calculate returns for 5 assets and construct a portfolio.

  2. Calculate the standard deviation of monthly portfolio returns using three methods:
  • the old-fashioned equation
  • matrix algebra
  • a built-in function from performanceAnalytics
  1. Compare those to the standard deviation of monthly SPY returns.

On to step 1, wherein we import prices and calculate returns for the 5 ETFs to be used in our portfolio. Those are AGG (a US bond fund), DBC (a commodities fund), EFA (a non-US equities fund), SPY (an S&P500 ETF), VGT (a technology fund).

Let’s import prices and save them to an xts object.

# A vector of symbol for our ETFs. symbols <- sort(c("SPY","VGT","EFA","DBC","AGG")) # Pipe them to getSymbols, extract the closing prices, and merge to one xts object. # Take a look at result before moving on to calculate the returns. # Notice that we are only grabbing prices from 2013 to present, but that is # only to keep the loading time shorter for the post. prices <- getSymbols(symbols, src = 'google', from = "2013-01-01", auto.assign = TRUE, warnings = FALSE) %>% map(~Cl(get(.))) %>% reduce(merge) %>% `colnames<-`(symbols)

Next we want to turn those daily prices into monthly returns. We will pipe the prices object to the tq_transmute() function from the tidyquant package, but we can’t do that directly. First we need to transform our xts object to a tibble using a call to as_tibble(preserve_row_names = TRUE) from tidyquant. There’s probably a more efficient way to set this up (check out the tidyquant articles to learn more) than this whole piped construct, but I derive a certain pleasure from toggling between tibble and xts objects.

# We are going to make heavy use of the tidyquant package to get monthly returns. portfolio_component_monthly_returns_xts <- prices %>% # Convert to tibble so can stay in the tidyquant/verse. as_tibble(preserve_row_names = TRUE) %>% # Add a date column. mutate(date = ymd(row.names)) %>% # Remove the row.names column; it's not needed anymore. select(-row.names) %>% # I like to have the date column as the first column. select(date, everything()) %>% # We need to gather into long format in order to use tq_transmute(). gather(asset, return, -date) %>% group_by(asset) %>% # Use the function from tidyquant; note how easily we could change to # a different time period like weekly or yearly. tq_transmute(mutate_fun = periodReturn, period = "monthly") %>% # Put the results back to wide format. spread(asset, monthly.returns) %>% # Convert back to an xts, so we can use the cov() and StdDev() functions. as_xts(date_col = date) head(portfolio_component_monthly_returns_xts) ## AGG DBC EFA SPY VGT ## 2013-01-31 -0.0050473186 0.021162123 0.02147558 0.02492127 -0.008422235 ## 2013-02-28 0.0039858683 -0.047067088 -0.01288572 0.01275885 0.005237826 ## 2013-03-28 -0.0009022828 0.006634722 0.01305393 0.03337511 0.026615970 ## 2013-04-30 0.0073150908 -0.038081289 0.05018650 0.01921236 0.005349794 ## 2013-05-31 -0.0217859064 -0.015607156 -0.03019051 0.02354709 0.042434166 ## 2013-06-28 -0.0174136193 -0.028228925 -0.04611287 -0.01847773 -0.031675393

Take a quick look at the monthly returns above, to make sure things appear to be in order.

Now, on to constructing a portfolio and calculating volatility. To turn these five ETFs into a portfolio, we need to assign them weights. Let’s first create a weights vector.

weights <- c(0.10, 0.10, 0.15, 0.40, 0.15)

Before we use the weights in our calculations, we perform a quick sanity check in the next code chunk. This might not be necessary with five assets as we have today, but it is good practice because if we had 50 assets, it could save us a lot of grief to catch a mistake early.

# Make sure the weights line up with assets. asset_weights_sanity_check <- tibble(weights, symbols) asset_weights_sanity_check ## # A tibble: 5 x 2 ## weights symbols ## ## 1 0.10 AGG ## 2 0.10 DBC ## 3 0.15 EFA ## 4 0.40 SPY ## 5 0.15 VGT

Alright, now on to the fun part, wherein we use the textbook equation for the standard deviation of a multi-asset portfolio.

  • First, we assign the weights of each asset.
  • Then, we isolate and assign returns of each asset.
  • Next, we plug those weights and returns into the equation for portfolio standard deviation, which involves the following:
  • Take the weight squared of each asset times its variance, and sum those weighted variance terms.
  • Then we take the covariance of each asset pair, multiplied by two times the weight of the first asset times the weight of the second asset.
  • Sum together the covariance terms and the weighted variance terms. This gives us the portfolio variance.
  • Then take the square root to get the standard deviation.
# This code chunk is intentionally verbose, repetitive, and inefficient # to emphasize how to break down volatility and grind through the equation. # Let's assign each asset a weight from our weights vector above. w_asset1 <- weights[1] w_asset2 <- weights[2] w_asset3 <- weights[3] w_asset4 <- weights[4] w_asset5 <- weights[5] # And each asset has a return as well, stored in our # portfolio_component_monthly_returns_xts object. asset1 <- portfolio_component_monthly_returns_xts[,1] asset2 <- portfolio_component_monthly_returns_xts[,2] asset3 <- portfolio_component_monthly_returns_xts[,3] asset4 <- portfolio_component_monthly_returns_xts[,4] asset5 <- portfolio_component_monthly_returns_xts[,5] # I am going to label this 'sd_by_hand' to distinguish it from the matrix algebra we use later, # and a built-in function for the same operation. sd_by_hand <- # Important, don't forget to take the square root! sqrt( # Our weighted variance terms. (w_asset1^2 * var(asset1)) + (w_asset2^2 * var(asset2)) + (w_asset3^2 * var(asset3)) + (w_asset4^2 * var(asset4)) + (w_asset5^2 * var(asset5)) + # Our weighted covariance terms (2 * w_asset1 * w_asset2 * cov(asset1, asset2)) + (2 * w_asset1 * w_asset3 * cov(asset1, asset3)) + (2 * w_asset1 * w_asset4 * cov(asset1, asset4)) + (2 * w_asset1 * w_asset5 * cov(asset1, asset5)) + (2 * w_asset2 * w_asset3 * cov(asset2, asset3)) + (2 * w_asset2 * w_asset4 * cov(asset2, asset4)) + (2 * w_asset2 * w_asset5 * cov(asset2, asset5)) + (2 * w_asset3 * w_asset4 * cov(asset3, asset4)) + (2 * w_asset3 * w_asset5 * cov(asset3, asset5)) + (2 * w_asset4 * w_asset5 * cov(asset4, asset5)) ) # I want to print the percentage, so multiply by 100 and round. sd_by_hand_percent <- round(sd_by_hand * 100, 2)

Okay, writing that equation out was painful and very copy/pasty, but at least we won’t be forgetting it any time soon. Our result is a monthly portfolio returns standard deviation of 2.22%.

Now, let’s turn to the less verbose matrix algebra path and confirm that we get the same result.

First, we will build a covariance matrix of returns using the cov() function.

# Build the covariance matrix. covariance_matrix <- cov(portfolio_component_monthly_returns_xts) covariance_matrix ## AGG DBC EFA SPY VGT ## AGG 8.185794e-05 -0.0000645834 6.046268e-05 -1.370992e-06 -1.412557e-06 ## DBC -6.458340e-05 0.0018089475 4.198803e-04 2.931955e-04 1.901253e-04 ## EFA 6.046268e-05 0.0004198803 1.251623e-03 7.809411e-04 9.529822e-04 ## SPY -1.370992e-06 0.0002931955 7.809411e-04 8.119639e-04 9.003360e-04 ## VGT -1.412557e-06 0.0001901253 9.529822e-04 9.003360e-04 1.329909e-03

Have a look at the covariance matrix.

AGG, the US bond ETF, has a negative covariance with the other ETFs (besides EFA – something to note), and it should make a nice volatility dampener. Interestingly, the covariance between DBC, a commodities ETF, and VGT is quite low, as well. Our painstakingly written-out equation above is a good reminder of how low covariances affect total portfolio standard deviation.

Back to our calculation: let’s take the square root of the transpose of the weights vector times the covariance matrix times the weights vector. To perform matrix multiplication, we use %*%.

# If we wrote out the matrix multiplication, we would get the original by-hand equation. sd_matrix_algebra <- sqrt(t(weights) %*% covariance_matrix %*% weights) # I want to print out the percentage, so I'll multiply by 100 and round. sd_matrix_algebra_percent <- round(sd_matrix_algebra * 100, 2)

The by-hand calculation is 2.22% and the matrix algebra calculation is 2.22%. Thankfully, these return the same result, so we don’t have to sort through the by-hand equation again.

Finally, we can use the built-in StdDev() function from the performanceAnalytics package. It takes two arguments: returns and weights.

# Confirm portfolio volatility portfolio_sd <- StdDev(portfolio_component_monthly_returns_xts, weights = weights) # I want to print out the percentage, so I'll multiply by 100 and round. portfolio_sd_percent <- round(portfolio_sd * 100, 2)

We now have:

  • by-hand calculation = 2.22%
  • matrix algebra calculation = 2.22%
  • built-in function calculation = 2.22%

Huzzah! That was quite a lot of work to confirm that the results of three calculations are equal to each other, but there are a few benefits.

First, while it was tedious, we should all be pretty comfortable with calculating portfolio standard deviations in various ways. That might never be useful to us, until the day that for some reason it is (e.g., if during an interview someone asks you to go to a whiteboard and write down the code for standard deviation or whatever equation/model – I think that’s still a thing in interviews).

More importantly, as our work gets more complicated and we build custom functions, we’ll want to rely on the built-in StdDev function, and we now have confidence in its accuracy. That’s nice, but even more important is now that we have the template above, we can reuse it for other portfolios.

Also, as usual, this is more of a toy example than an actual template for use in industry. If a team relies heavily on pre-built functions, even those built by the team itself, it’s not a bad idea to have a grind-it-out sanity check Notebook like this one. It reminds team members what a pre-built function might be doing under-the-hood.

Now, let’s turn to a little bit of portfolio theory (or, why we want to build a portfolio instead of putting all of our money into SPY). We believe that by building a portfolio of assets whose covariances of returns are lower than the variance of SPY returns (or, equivalently, lower than the covariance of SPY returns with themselves), we can construct a portfolio whose standard deviation is lower than the standard deviation of SPY. If we believe that standard deviation and volatility are a good proxy for risk, then the portfolio would have a lower risk.

To see if we succeeded, first, isolate the returns of SPY, then find the standard deviation of those returns.

# First get the returns of the S&P 500 isolated spy_returns <- portfolio_component_monthly_returns_xts$SPY # Now calculated standard deviation spy_sd <- StdDev(spy_returns) # To confirm the variance of SPY's returns is equal to # the covariance of SPY's returns with themselves, # uncomment and run the next two lines of code. # spy_var <- var(spy_returns) # spy_cov <- cov(spy_returns, spy_returns) # We could also have extracted this value from the SPY column and SPY row of covariance matrix, # since the covariance of SPY with itself is equal to its variance. # spy_sd_from_cov_matrix <- sqrt(covariance_matrix[4,4]) # Again, I want percent so will multiply by 100 and round. spy_sd_percent <- round(spy_sd * 100, 2)

The standard deviation of monthly SPY returns is 2.85% and that of the portfolio is 2.22%.

Fantastic, our portfolio has lower monthly volatility!

Alright, despite the fact that we have completely ignored returns, we can see the volatility benefits of assets with low or even negative covariances. That’s all for today’s introduction to volatility. Next time, we will move to visualizing these differences in a Notebook, before heading to the world of Shiny.

_____='https://rviews.rstudio.com/2017/07/12/introduction-to-volatility/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Registration open for rstudio::conf 2018!

Wed, 07/12/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio is very excited to announce that rstudio::conf 2018 is open for registration!

rstudio::conf, the conference on all things R and RStudio, will take place February 2 and 3, 2018 in San Diego, California, preceded by Training Days on January 31 and February 1.

This year’s conference will feature keynotes from Di Cook, Monash University Professor and Iowa State University Emeritus Faculty; and J.J. Allaire, RStudio Founder, CEO & Principal Developer, along with talks from Shiny creator Joe Cheng and (no-introduction-necessary) Hadley Wickham.

We are especially happy to announce that the following outstanding R innovators have already accepted invitations to speak:

Speaker Role Affiliation Mara Averick Research Analyst, Web Developer, Data Nerd TCB Analytics Nick Carchedi Director of Course Development Datacamp Tanya Cashorali Data Entrepreneur TCB Analytics Eric Colson Chief Algorithms Officer Stitch Fix Sandra Griffith Senior Methodologist Flatiron Health Aaron Horowitz Analytics Team Leader McKinsey JD Long Economist Cocktail Party Host Elaine McVey Data Science Lead TransLoc Logan Meltabarger Information Science & Analytics Slalom Consulting Ian Lyttle Senior Staff Engineer Schneider Electric Edzer Pebesma Institute for Geoinformatics University of Muenster Thomas Peterson Analytics Programmer SKAT Olga Pierce Deputy Data Editor, Reporter ProPublica David Robinson Data Scientist Stack Overflow

Attendees will also hear from a rare assembly of popular RStudio data scientists and developers like Yihui Xie, Winston Chang, Garrett Grolemund, Jenny Bryan, Max Kuhn, Kevin Ushey, Gabor Csardi, Amanda Gadrow, Jeff Allen, Jonathan McPherson, Javier Luraschi, Lionel Henry, Jim Hester, Bárbara Borges Ribeiro, Nathan Stephens, Sean Lopp, Edgar Ruiz, Phil Bowsher, Jonathan Regenstein, Mine Çetinkaya-Rundel and Joseph Rickert.

Additional speakers will be added soon.

The conference will feature more than 60 sessions, with three tracks specifically designed for people newer to R, advanced R users, and those looking for solutions to an interesting problem or industry application.

Preceding the conference, on January 31 and February 1, RStudio will offer two days of optional in-person training. This year, workshop choices include:

3 workshops for those newer to R Instructor Intro to R and RStudio (2 days) RStudio Instructor TBD Data Science in the Tidyverse (2 days) Charlotte Wickham Intro to Shiny and RMarkdown (2 days) Mine Çetinkaya-Rundel 5 workshops for those familiar to R Instructor Applied machine learning (2 days) Max Kuhn Intermediate Shiny (2 days) Joe Cheng Extending the tidyverse (2 days) Hadley Wickham What they forgot to teach you about R (2 days) Jenny Bryan Big Data with R (2 days) Edgar Ruiz 2 workshops for RStudio Partners and Administrators Instructor Tidyverse Trainer Certification (2 days) Garrett Grolemund RStudio Connect Administrator Certification (1 day) Sean Lopp Who should go?

rstudio::conf is for RStudio users, R administrators, and RStudio partners who want to learn how to write better Shiny applications, explore all the new capabilities of R Markdown, apply R to big data and work effectively with Spark, understand the tidyverse of tools for data science with R, discover best practices and tips for coding with RStudio, and investigate enterprise scale development & deployment practices and tools.

Why do people go to rstudio::conf?

Because there is simply no better way to immerse in all things R & RStudio.

rstudio::conf 2017 | Average Satisfaction Rating 8.7 out of 10

  • “Overall, one of the best conferences I’ve attended on any subject.”
  • “I had a blast and learned a lot. Glad I came.”
  • “There were real takeaways I could immediately apply to my work. It was a very effective way to be exposed to all of the new tools and ideas in the R community. I would have to read hundreds of blog posts and tweets.”
  • “It was just a really fun and collegial conference.”

Also, it’s in San Diego.

What should I do now?

Be an early bird! Attendance is limited. All seats are are available on a first-come, first-serve basis. Early Bird registration discounts are available (Conference only) and a capped number of Academic discounts are also available for eligible students and faculty. If all tickets available for a particular workshop are sold out before you are able to purchase, we apologize in advance!

Please go to www.rstudio.com/conference to purchase.

We hope to see you in San Diego at rstudio::conf 2018!

For questions or issues registering, please email conf@rstudio.com. If you would like to sponsor rstudio::conf 2018 please email anne@rstudio.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

easy riddle

Wed, 07/12/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

From the current Riddler, a problem that only requires a few lines of code and a few seconds of reasoning. Or not.

N households each stole the earnings from one of the (N-1) other households, one at a time. What is the probability that a given household is not burglarised? And what are the expected final earnings of each household in the list, assuming they all start with $1?

The first question is close to Feller’s enveloppe problem in that

is close to exp(-1) for N large. The second question can easily be solved by an R code like

N=1e3;M=1e6 fina=rep(1,N) for (v in 1:M){ ordre=sample(1:N) vole=sample(1:N,N,rep=TRUE) while (min(abs(vole-(1:N)))==0) vole[abs(vole-(1:N))==0]=sample(1:N, sum(vole-(1:N)==0)) cash=rep(1,N) for (t in 1:N){ cash[ordre[t]]=cash[ordre[t]]+cash[vole[t]];cash[vole[t]]=0} fina=fina+cash[ordre]}

which returns a pretty regular exponential-like curve, although I cannot figure the exact curve beyond the third burglary. The published solution gives the curve

corresponding to the probability of never being robbed (and getting on average an extra unit from the robbery) and of being robbed only before robbing someone else (with average wealth N/(N-1)).

Filed under: Books, Kids, R Tagged: FiveThirtyEight, mathematical puzzle, R, simulation, The Riddler, William Feller

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Asynchronous API calls with postlightmercury

Tue, 07/11/2017 - 21:53

(This article was first published on Renglish – 56north | Skræddersyet dataanalyse, and kindly contributed to R-bloggers)

In this post I’ll tell you about a new package I’ve built and also take you under the hood to show you a really cool thing thats going on: asynchronous API calls.

The package: postlightmercury

I created a package called postlightmercury which is now on cran. The package is a wrapper for the Mercury web parser by Postlight.

Basically you sign up for free, get an API key and with that you can send it urls that it then parses for you. This is actually pretty clever if you are scraping a lot of different websites and you don’t want to write a web parser for each and every one of them.

Here is how the package works Installation

Since the package is on cran, it’s very straight forward to install it:

install.packages("postlightmercury") Load libraries

We’ll need the postlightmercurydplyr and stringr libraries:

library(postlightmercury) library(dplyr) library(stringr) Get an API key

Before you can use the package you need to get an API key from Postlight. Get yours here: https://mercury.postlight.com/web-parser/ Replace the XXXX’s below with your new API key.

Parse an URL

We will use this extremely sad story from BBC that Gangnam Style is no longer the #1 most viewed video on YouTube. Sad to see a masterpiece like that get dethroned

# Then run the code below replacing the X's wih your api key: parsed_url <- web_parser(page_urls = "http://www.bbc.co.uk/news/entertainment-arts-40566816", api_key = XXXXXXXXXXXXXXXXXXXXXXX)

As you can see below the result is a tibble (data frame) with 14 different variables:

glimpse(parsed_url) ## Observations: 1 ## Variables: 14 ## $ title "Gangnam Style is no longer the most-played vid... ## $ author "Mark Savage BBC Music reporter" ## $ date_published NA ## $ dek NA ## $ lead_image_url "https://ichef.bbci.co.uk/news/1024/cpsprodpb/9... ## $ content "

NA ## $ url "http://www.bbc.co.uk/news/entertainment-arts-4... ## $ domain "www.bbc.co.uk" ## $ excerpt "Psy's megahit was the most-played video for fi... ## $ word_count 685 ## $ direction "ltr" ## $ total_pages 1 ## $ rendered_pages 1 Parse more than one URL:

You can also parse more than one URL. Instead of one, lets try giving it three URLs, two about Gangnam Style and one about Sauerkraut – with all that dancing it is after all important with proper nutrition.

urls <- c("http://www.bbc.co.uk/news/entertainment-arts-40566816", "http://www.bbc.co.uk/news/world-asia-30288542", "https://www.bbcgoodfood.com/howto/guide/health-benefits-sauerkraut") # Then run the code below replacing the X's wih your api key: parsed_url <- web_parser(page_urls = urls, api_key = XXXXXXXXXXXXXXXXXXXXXXX)

Just like before the result is a tibble (data frame) with 14 different variables – but this time with 3 observations instead of one:

glimpse(parsed_url) ## Observations: 3 ## Variables: 14 ## $ title "Gangnam Style is no longer the most-played vid... ## $ author "Mark Savage BBC Music reporter", NA, "Nicola S... ## $ date_published NA, NA, NA ## $ dek NA, NA, NA ## $ lead_image_url "https://ichef.bbci.co.uk/news/1024/cpsprodpb/9... ## $ content "

NA, NA, NA ## $ url "http://www.bbc.co.uk/news/entertainment-arts-4... ## $ domain "www.bbc.co.uk", "www.bbc.co.uk", "www.bbcgoodf... ## $ excerpt "Psy's megahit was the most-played video for fi... ## $ word_count 685, 305, 527 ## $ direction "ltr", "ltr", "ltr" ## $ total_pages 1, 1, 1 ## $ rendered_pages 1, 1, 1 Clean the content of HTML

The content block keeps the HTML of the website:

str_trunc(parsed_url$content[1], 500, "right") ## [1] "

By Mark Savage BBC Music reporter

Image copyright

We can clean that quite easily:

parsed_url$content <- remove_html(parsed_url$content) str_trunc(parsed_url$content[1], 500, "right") ## [1] "By Mark Savage BBC Music reporter Image copyright Schoolboy/Universal Republic Records Image caption Gangnam Style had been YouTube's most-watched video for five years Psy's Gangnam Style is no longer the most-watched video on YouTube.The South Korean megahit had been the site's most-played clip for the last five years.The surreal video became so popular that it \"broke\" YouTube's play counter, exceeding the maximum possible number of views (2,147,483,647), and forcing the company to rewr..."

And that is basically what the package does!

Under the hood: asynchronous API calls

Originally I wrote the package using the httr package which I normally use for my every day API calling business.

But after reading about the crul package on R-bloggers and how it can handle asynchronous api calls I rewrote the web_parser() function so it uses the crul package.

This means that instead of calling each URL sequentially it calls them in parrallel. This no doubt has major implications if you want to call a lot of URLs and can speed up your analysis significantly.

The web_parser function looks like this under the hood (look for where the magic happens):

web_parser <- function (page_urls, api_key) { if (missing(page_urls)) stop("One or more urls must be provided") if (missing(api_key)) stop("API key must be provided. Get one here: https://mercury.postlight.com/web-parser/") ### THIS IS WHERE THE MAGIC HAPPENS async <- lapply(page_urls, function(page_url) { crul::HttpRequest$new(url = "https://mercury.postlight.com/parser", headers = list(`x-api-key` = api_key))$get(query = list(url = page_url)) }) res <- crul::AsyncVaried$new(.list = async) ### END OF MAGIC output <- res$request() api_content <- lapply(output, function(x) x$parse("UTF-8")) api_content <- lapply(api_content, jsonlite::fromJSON) api_content <- null_to_na(api_content) df <- purrr::map_df(api_content, tibble::as_tibble) return(df) }

As you can see from the above code I create a list async that holds the three different URL calls. I then add these to the res object. When I call the results from res it will fetch the data in parrallel if there is more than one URL. That is pretty smart!

You can use this basic temptlate for your own API calls if you have a function that rutinely calls several URL’s sequentially.

Note: In this case the “surrounding conditions” are all the same. But you can also do asynchronous requests that call different end-points. Check out the crul package documentation for more on that.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Renglish – 56north | Skræddersyet dataanalyse. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Take the R Consortium survey on R

Tue, 07/11/2017 - 21:32

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Since its foundation just a little over two years ago, the R Consortium has been dedicated to providing support to the R Project and the R community. Already, the R Consortium has channeled the contributions from its corporate members to fund more than 25 projects, working groups, and community initiatives. Recently funded initiatives include a code coverage tool for R, improved connectivity between R and databases, new methods for handling spatial data, the R-hub package builder, and the SatRdays and R-Ladies community programs. (Many of these projects were presented at the recent useR!2017 conference in Brussels, which was awesome to see.)

Now, the R Consortium would like to year from you, the R user community, to help guide it directions. Please take a few minutes to take the R Consortium's Survey on R, and share your thoughts on the past, present and future of R. The R Consortium says this about the survey in a blog post by Joe Rickert and Hadley Wickham:

The R Consortium exists to promote R as a language, environment and community. In order to answer some of the questions above and to help us understand our mission better we have put together the first of what we hope will be an annual survey of R users. This first attempt is a prototype. We don’t have any particular hypothesis or point of view. We would like to reach everyone who is interested in participating. So please, take a few minutes to take the survey yourself and help us get the word out. The survey will adapt depending on your answers, but will take about 10 minutes to complete.

The R Consortium will also make an anonymized form of the data available to the community for analysis after the survey has closed. 

You can find the survey (and share it with others) at this link: R Consortium Survey

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Visualization with googleVis exercises part 6

Tue, 07/11/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Geographical Charts

In part 6 of this series we are going to see some amazing geographical charts that googleVis provides.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package

As you already know, the first thing you have to do is install and load the googleVis package with:
install.packages("googleVis")
library(googleVis)

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page.

Geo Chart

It is quite simple to create a Geo Chart with googleVis. We will use the “Exports” dataset. First let’s take a look at it with head(Exports). As you can see there are three variables (“Country”, “Profit”, “Online”) which we are going to use later.
Look at the example below to create a simple geo chart:
Geo=gvisGeoChart(Exports )
plot(Geo)

Exercise 1

Create a list named “GeoC” and pass to it the “Exports” dataset as a geo chart. HINT: Use gvisGeoChart().

Exercise 2

Plot the the geo chart. HINT: Use plot().

Furthermore you can add much more information in your chart by using the locationvar and colorvar options to color the countries according to the their profit. Look at the example below.
Geo=gvisGeoChart(Exports,
locationvar="Country",
colorvar="Profit")
plot(Geo)

Exercise 3

Color the countries of your geo chart according to their profit and plot it. HINT: Use locationvar and colorvar.

Google Maps

It is quite simple to create a Google Map with googleVis. We will use the “Andrew” dataset. First let’s take a look at it with head(Andrew) to see its variables. Look at the example below to create a simple google map:
GoogleMap <- gvisMap(Andrew)
plot(GoogleMap)

Exercise 4

Create a list named “GoogleMap” and pass to it the “Andrew” dataset as a google map. HINT: Use gvisMap().

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

  • Work extensively with the GoogleVis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 5

Plot the the google map. HINT: Use plot().

As you can see there ane no data points on it as we did not select something yet. We have to select the latitude and longtitude variables for the dataset like the example below.
GoogleMap <- gvisMap(Andrew,"LatLong" )

Exercise 6

Display the map by addind the “LatLong” variable to your list and plot it.

Exercise 7

Display the “Tip” variable on your google map just like you displayed the “LatLong” and plot it.

There are some useful options that gvisMap() provides to you that can enhance your map. Check the example below.
options=list(showTip=TRUE,
showLine=TRUE,
mapType='terrain',
useMapTypeControl=TRUE)

Exercise 8

Deactivate the Tip information from your map, plot the map and then enable it again. HINT: Use showTip.

Exercise 9

Enable useMapTypeControl and plot the map.

Exercise 10

Set the mapType by default to “terrain” and plot the map.

Related exercise sets:
  1. Data Visualization with googleVis exercises part 2
  2. Data Visualization with googleVis exercises part 3
  3. Data Visualization with googleVis exercises part 4
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R 3.4.1 is released – with some Windows related bug-fixes

Tue, 07/11/2017 - 09:02

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

R 3.4.1 (codename “Single Candle”) was released several days ago. You can get the latest binaries version from here. (or the .tar.gz source code from here).

As mentioned last week by David Smith, R 3.4.1 includes several Windows related bug fixed:

including an issue sometimes encountered when attempting to install packages on Windows, and problems displaying functions including Unicode characters (like “日本語”) in the Windows GUI.

 

The full list of bug fixes and new features is provided below.

Upgrading to R 3.4.1 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install setInternet2(TRUE) # only for R versions older than 3.3.0 installr::updateR() # updating R. # If you wish it to go faster, run: installr::updateR(T)

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

CHANGES IN R 3.4.1 INSTALLATION on a UNIX-ALIKE
  • The deprecated support for PCRE versions older than 8.20 has been removed.
BUG FIXES
  • getParseData() gave incorrect column information when code contained multi-byte characters. (PR#17254)
  • Asking for help using expressions like ?stats::cor() did not work. (PR#17250)
  • readRDS(url(....)) now works.
  • R CMD Sweave again returns status = 0 on successful completion.
  • Vignettes listed in ‘.Rbuildignore’ were not being ignored properly. (PR#17246)
  • file.mtime() no longer returns NA on Windows when the file or directory is being used by another process. This affected installed.packages(), which is now protected against this.
  • R CMD INSTALL Windows .zip file obeys --lock and --pkglock flags.
  • (Windows only) The choose.files() function could return incorrect results when called with multi = FALSE. (PR#17270)
  • aggregate(, drop = FALSE) now also works in case of near-equal numbers in by. (PR#16918)
  • fourfoldplot() could encounter integer overflow when calculating the odds ratio. (PR#17286)
  • parse() no longer gives spurious warnings when extracting srcrefs from a file not encoded in the current locale.

    This was seen from R CMD check with ‘inst/doc/*.R’ files, and check has some additional protection for such files.

  • print.noquote(x) now always returns its argument x (invisibly).
  • Non-UTF-8 multibyte character sets were not handled properly in source references. (PR#16732)

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing learnr

Tue, 07/11/2017 - 02:00

(This article was first published on Home on RStudio Blog, and kindly contributed to R-bloggers)

We’re pleased to introduce the learnr package, now available on CRAN. The learnr package makes it easy to turn any R Markdown document into an interactive tutorial. Tutorials consist of content along with interactive components for checking and reinforcing understanding. Tutorials can include any or all of the following:

  • Narrative, figures, illustrations, and equations.

  • Code exercises (R code chunks that users can edit and execute directly).

  • Multiple choice quizzes.

  • Videos (supported services include YouTube and Vimeo).

  • Interactive Shiny components.

Each learnr tutorial is a Shiny interactive document, which means that tutorials can be deployed all of the same ways that Shiny applications can, including locally on an end-user’s machine, on a Shiny or RStudio Connect Server, or on a hosting service like shinyapps.io.

Getting Started

To create a learnr tutorial, install the learnr package with

install.packages("learnr")

Then select the Interactive Tutorial template from the New R Markdown dialog in the RStudio IDE (v1.0.136 or later).

Exercises

Exercises are interactive R code chunks that allow readers to directly execute R code and see it’s results:

To add an exercise, add exercise = TRUE to the chunk options of an R Markdown code chunk. R Markdown will preload the chunk with the code that you supply.

```{r ex1, exercise = TRUE} head(mtcars, n = 5) ```

becomes

Exercises can include hints or solutions as well as custom checking code to provide feedback on user answers. The learnr Exercises page includes a more in depth discussion of exercises and their various available options and behaviors.

Questions

You can include one or more multiple-choice quiz questions within a tutorial to help verify that readers understand the concepts presented. Questions can have a single or multiple correct answers.

Include a question by calling the question() function within an R code chunk:

```{r letter-a, echo=FALSE} question("What number is the letter A in the English alphabet?", answer("8"), answer("14"), answer("1", correct = TRUE), answer("23") ) ```

Videos

You can include videos published on either YouTube or Vimeo within a tutorial using the standard markdown image syntax. Note that any valid YouTube or Vimeo URL will work. For example, the following are all valid examples of video embedding:

![](https://youtu.be/zNzZ1PfUDNk) ![](https://www.youtube.com/watch?v=zNzZ1PfUDNk) ![](https://vimeo.com/142172484) ![](https://player.vimeo.com/video/142172484) Code checking

learnr works with external code checking packages to let you evaluate student answers and provide targeted, automated feedback, like the message below.

You can use any package that provides a learnr compatible checker function to do code checking (the checkr package provides a working prototype of a compatible code checker).

Navigation and progress tracking

Each learnr tutorial includes a Table of Contents that tracks student progress. learnr remembers which sections of a tutorial a student completes, and returns a student to where they left off when they reopen a tutorial.

Progressive Reveal

learnr optionally reveals content one sub-section at a time. You can use this feature to let students set their own pace, or to hide information that would spoil an exercise or question that appears just before it.

To use progressive reveal, set the progressive field to true in the yaml header.

--- title: "Programming basics" output: learnr::tutorial: progressive: true allow_skip: true runtime: shiny_prerendered ---

Visit rstudio.github.io/learnr/ to learn more about creating interactive tutorials with learnr.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Home on RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: June 2017 roundup

Mon, 07/10/2017 - 23:25

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from June of particular interest to R users.

R 3.4.1 "Single Candle" has been released.

The Scientific Computing Coordinator at the FDA explains how R is used at the FDA and by sponsors for clinical trial submissions.

Several useful tips related to including images in Rmarkdown documents.

A review of one of R's best features — its community.

It's now possible to include interactive R visualizations in Power BI reports (like graphics created with plotly or htmlwidgets).

The Azure Data Science Virtual Machine for Windows now supports GPU-based computations with Microsoft R, Tensorflow, and other included software.

The 2017 Burtch Works survey of data science software popularity shows R leading, Python gaining, and SAS declining.

A video presentation by Ali Zaidi on using the sparklyr package with Microsoft R Server.

The EARL conference in San Francisco featured applications of R at Pandora, Pfizer, Amgen, Hitachi, and many other companies.

A demo of real-time predictions from a model created with Microsoft R, at a rate of one million predictions per second.

The R Epidemics Consortium is a coalition of researchers developing epidemiology resources for R.

Syberia is an open-source framework for orchestrating R scripts in production.

A from-the-basics guide to accessing open APIs from R, from Locke Data.

Highlights of talks from useR!2017 (recordings will be available in late July).

The doAzureParallel package provides a backend to "foreach" that spins up a parallel-processing cluster on Azure incorporating low-cost low-priority VMs.

A tutorial on creating dot-density maps in R, an alternative to choropleths.

A free 13-page e-book on using Power BI with R.

Python edges out R for the first time in the 2017 KDnuggets poll of data science software usage.

The miner package encourages kids to learn to program in R while manipulating the world of Minecraft with R functions.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Useful dplyr Functions (w/examples)

Mon, 07/10/2017 - 21:02

(This article was first published on Environmental Science and Data Analytics, and kindly contributed to R-bloggers)

The R package dplyr is an extremely useful resource for data cleaning, manipulation, visualisation and analysis. It contains a large number of very useful functions and is, without doubt, one of my top 3 R packages today (ggplot2 and reshape2 being the others). When I was learning how to use dplyr for the first time, I used DataCamp which offers some fantastic interactive courses on R. I also made use of  a decent data wrangling cheat sheet which can be found here.

There are many useful functions contained within the dplyr package. This post does not attempt to cover them all but does look at the major functions that are commonly used in data manipulation tasks. These are:

select() filter() mutate() group_by() summarise() arrange() join()

The data used in this post are taken from the UCI Machine Learning Repository and contain census information from 1994 for the USA. The dataset can be used for classification of income class in a machine learning setting and can be obtained here.

require(dplyr) # Data file file <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" # Some sensible variable names df_names <- c("age", "wrkclass", "fnlweight", "education_lvl", "edu_score", "marital_status", "occupation", "relationship", "ethnic", "gender", "cap_gain", "cap_loss", "hrs_wk", "nationality", "income") # Import the data df <- read.csv(file, header = F, sep = ",", na.strings = c(" ?", " ", ""), row.names = NULL, col.names = df_names)

Many data manipulation tasks in dplyr can be performed with the assistance of the forward-pipe operator (%>%). The pipe was first introduced in the magrittr package and has since been included in the dplyr package. It is an incredibly useful tool for fluid data manipulation and results in highly readable code.

The census dataset requires a bit of preprocessing to get it ready for classification algorithms. This post does not cover preprocessing nor does it include predictive modelling.

The first function I would like to introduce removes duplicate entries which, in fact, is a preprocessing step one may carry out in a data analysis. It is so useful that it must be included.

# Remove duplicate rows and check number of rows df %>% distinct() %>% nrow() # Drop duplicate rows and assign to new dataframe object df_clean <- df %>% distinct() # Drop duplicates based on one or more variables df %>% distinct(gender, .keep_all = T) df %>% distinct(gender, education_lvl, .keep_all = T)

Taking random samples of data is easy with dplyr.

# Sample random rows with or without replacement sample_n(df, size = nrow(df) * 0.7, replace = F) sample_n(df, size = 20, replace = T) # Sample a proportion of rows with or without replacement sample_frac(df, size = 0.7, replace = F) sample_frac(df, size = 0.8, replace = T

Renaming variables is also easy with dplyr.

# Rename one or more variables in a dataframe df <- df %>% rename("INCOME" = "income") df <- df %>% rename("INCOME" = "income", "AGE" = "age")

The main “verbs” of dplyr are now introduced. Let’s begin with the select() verb which filters a dataframe by column.

# Select specific columns (note that INCOME is the new name from earlier) df %>% select(education_lvl, INCOME) # With dplyr 0.7.0 the pull() function extracts a variable as a vector df %>% pull(age) # Drop a column using the - operator (variable can be referenced by name or column position) df %>% select(-edu_score) df %>% select(-1, -4) df %>% select(-c(2:6))

Some useful helper functions are available in dplyr and can be used in conjunction with the select() verb. Here are some quick examples.

# Select columns with their names starting with "e" df %>% select(starts_with("e")) # The negative sign works for dropping here too df %>% select(-starts_with("e")) # Select columns with some pattern in the column name df %>% select(contains("edu")) # Reorder data to place a particular column at the start followed by all others using everything() df %>% select(INCOME, everything()) # Select columns ending with a pattern df %>% select(ends_with("e")) df %>% select(ends_with("_loss"))

The next major verb we look at is filter() which, surprisingly enough, filters a dataframe by row based on one or more conditions.

# Filter rows to retain observations where age is greater than 30 df %>% filter(age > 30) # Filter by multiple conditions using the %in% operator (make sure strings match) df %>% filter(relationship %in% c(" Unmarried", " Wife")) # You can also use the OR operator (|) df %>% filter(relationship == " Husband" | relationship == " Wife") # Filter using the AND operator df %>% filter(age > 30 & INCOME == " >50K") # Combine them too df %>% filter(education_lvl %in% c(" Doctorate", " Masters") & age > 30) # The NOT condition (filter out doctorate holders) df %>% filter(education_lvl != " Doctorate") # The grepl() function can be conveniently used with filter() df %>% filter(grepl(" Wi", relationship))

Next, we look at the summarise() verb which allows one to dynamically summarise groups of data and even pipe groups to ggplot data visualisations.

# The summarise() verb in dplyr is useful for summarising grouped data df %>% filter(INCOME == " >50K") %>% summarise(mean_age = mean(age), median_age = median(age), sd_age = sd(age)) # Summarise multiple variables using summarise_at() df %>% filter(INCOME == " >50K") %>% summarise_at(vars(age, hrs_wk), funs(n(), mean, median)) # We can also summarise with custom functions # The . in parentheses represents all called variables df %>% summarise_at(vars(age, hrs_wk), funs(n(), missing = sum(is.na(.)), mean = mean(., na.rm = T))) # Create a new summary statistic with an anonymous function df %>% summarise_at(vars(age), function(x) { sum((x - mean(x)) / sd(x)) }) # Summarise conditionally using summarise_if() df %>% filter(INCOME == " >50K") %>% summarise_if(is.numeric, funs(n(), mean, median)) # Subset numeric variables and use summarise_all() to get summary statistics ints <- df[sapply(df, is.numeric)] summarise_all(ints, funs(mean, median, sd, var))

Next up is the arrange() verb which is useful for sorting data in ascending or descending order (ascending is default).

# Sort by ascending age and print top 10 df %>% arrange(age) %>% head(10) # Sort by descending age and print top 10 df %>% arrange(desc(age)) %>% head(10)

The group_by() verb is useful for grouping together observations which share common characteristics.

# The group_by verb is extremely useful for data analysis df %>% group_by(gender) %>% summarise(Mean = mean(age)) df %>% group_by(relationship) %>% summarise(total = n()) df %>% group_by(relationship) %>% summarise(total = n(), mean_age = mean(age))

The mutate() verb is used to create new variables from existing local variables or global objects. New variables, such as sequences, can be also specified within mutate().

# Create new variables from existing or global variables df %>% mutate(norm_age = (age - mean(age)) / sd(age)) # Multiply each numeric element by 1000 (the name "new" is added to the original variable name) df %>% mutate_if(is.numeric, funs(new = (. * 1000))) %>% head()

The join() verb is used to merge rows from disjoint tables which share a primary key ID  or some other common variable. There are many join variants but I will consider just left, right, inner and full joins.

# Create ID variable which will be used as the primary key df <- df %>% mutate(ID = seq(1:nrow(df))) %>% select(ID, everything()) # Create two tables (purposely overlap to facilitate joins) table_1 <- df[1:50 , ] %>% select(ID, age, education_lvl) table_2 <- df[26:75 , ] %>% select(ID, gender, INCOME) # Left join joins rows from table 2 to table 1 (the direction is implicit in the argument order) left_join(table_1, table_2, by = "ID") # Right join joins rows from table 1 to table 2 right_join(table_1, table_2, by = "ID") # Inner join joins and retains only complete cases inner_join(table_1, table_2, by = "ID") # Full join joins and retains all values full_join(table_1, table_2, by = "ID"

That wraps up a brief demonstration of some of dplyr’s excellent functions. For additional information on the functions and their arguments, check out the help documentation using the template: ?

 

References

Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller (2017). dplyr: A
Grammar of Data Manipulation. R package version 0.7.0.
https://CRAN.R-project.org/package=dplyr

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New
York, 2009.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

What is magrittr’s future in the tidyverse?

Mon, 07/10/2017 - 18:52

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

For many R users the magrittr pipe is a popular way to arrange computation and famously part of the tidyverse.

The tidyverse itself is a rapidly evolving centrally controlled package collection. The tidyverse authors publicly appear to be interested in re-basing the tidyverse in terms of their new rlang/tidyeval package. So it is natural to wonder: what is the future of magrittr (a pre-rlang/tidyeval package) in the tidyverse?

For instance: here is a draft of rlang/tidyeval based pipe (from one of the primary rlang/tidyeval authors). This pipe even fixes dplyr issue 2726:

# do NOT perform the source step directly! # Instead, save and inspect the source code from: # https://gist.github.com/lionel-/10cd649b31f11512e4aea3b7a98fe381 # Output printed is the result of an example in the gist. source("https://gist.githubusercontent.com/lionel-/10cd649b31f11512e4aea3b7a98fe381/raw/b8804e41424a4f721ce21292a7ec9c35b5f3689d/pipe.R") #> List of 1 #> $ :List of 1 #> ..$ :List of 2 #> .. ..$ : chr "foo" #> .. ..$ : chr "foo" # previously throwing example from dplyr issue 2726 (function(x) mtcars %>% dplyr::select(!!enquo(x)))(disp) %>% head() #> disp #> Mazda RX4 160 #> Mazda RX4 Wag 160 #> Datsun 710 108 #> Hornet 4 Drive 258 #> Hornet Sportabout 360 #> Valiant 225 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Volatility modelling in R exercises (Part-3)

Mon, 07/10/2017 - 18:00

This is the third part of the series on volatility modelling. For other parts of the series follow the tag volatility.

In this exercise set we will use GARCH models to forecast volatility.

Answers to the exercises are available here.

Exercise 1
Load the rugarch and the FinTS packages. Next, load the m.ibmspln dataset from the FinTS package. This dataset contains monthly excess returns of the S&P500 index from Jan-1926 to Dec-1999 (Ruey Tsay (2005) Analysis of Financial Time Series, 2nd ed. ,Wiley, chapter 3).
Also, load the forecast package which we will use for auto-correlation graphs.

Exercise 2
Excess S&P500 returns are defined as a regular zoo variable. Convert this to a time series variable with correct dates.

Exercise 3
Plot the excess S&P500 returns along with its ACF and PACF graphs and comment on the apparent correlation.

Exercise 4
Plot the squared excess S&P500 returns along with its ACF and PACF graphs and comment on the apparent correlation.

Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

  • Avoid model over-fitting using cross-validation for optimal parameter selection
  • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
  • And much more

Exercise 5
Using the results from exercise-3, estimate a suitable ARMA model for excess returns assuming normal errors.

Exercise 6
Using the results from exercise-4, estimate a suitable ARMA model for excess returns without assuming normal errors.

Exercise 7
Using the results from exercises 5 and 6, estimate a more parsimonious model that has better fit.

Exercise 8
Generate 10 steps ahead forecast for the model from exercise-7

Exercise 9
Plot the excess returns forecast.

Exercise 10
Plot the volatility forecast.

Related exercise sets:
  1. Forecasting: Linear Trend and ARIMA Models Exercises (Part-2)
  2. Volatility modelling in R exercises (Part-2)
  3. Volatility modelling in R exercises (Part-1)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The R community is one of R’s best features

Mon, 07/10/2017 - 17:48

(This article was first published on Shige's Research Blog, and kindly contributed to R-bloggers)

Excellent summary here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shige's Research Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The R Shiny packages you need for your web apps!

Mon, 07/10/2017 - 16:10

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Shiny is an R Package to deploy web apps using an R backend. Let’s face it, Shiny is awesome! It brings all the power of R to a simple web app with interactivity, user inputs, and interactive visualizations. If you don’t know Shiny yet, you can access a selection of apps on Show me shiny.
As usual with R, the community is great and has created lots of packages for Shiny, here is a personal selection of Shiny packages.
NB: The post focuses on shiny functionalities and visual improvements but it does not deal with data visualization packages. You should check A pick of the best R packages for interactive plot and visualization (1/2) to discover visualizations!

Improving the Shiny UI

This sections will show you some Shiny packages to improve your UI with new components and UI/UX improvements like draggable UI or in-app tutorial.

ShinyDashboard by Rstudio

ShinyDashboard makes it easy to create a dashboard and is based on the AdminLTE Bootstrap theme. Basically, you can build much more complex shiny apps with a sidebar, a header with notifications, messages, … and Bootstrap UI components (box, value box, info box). As it’s based on the adminLTE theme and bootstrap, you can often use their custom HTML classes by adding a few HTML lines to your application.


An example of a Shiny dashboard application

This package should be downloaded with Shiny and is really a must have!

Version: 0.6.1 (June 15, 2017) on CRAN
License: GPL-2
Package components’ License: GPL-2

shiny.semantic by Appsilon

Like ShinyDashboard, shiny.semantic is a wrapper for a web UI: Semantic UI. Semantic UI looks really beautiful and modern and has more varied components and inputs than Bootstrap.

Components from Semantic UI.

Click to view slideshow.

The apps made with this package looks really awesome, you can check this demo app from Appsilon. However, the package is not easy to use at the beginning, its syntax is very close to HTML and may require some adaptation.

Version: 0.1.1 (2017-05-29) on CRAN
License: MIT
Package components’ License: MIT

ShinyBS by ebailey78

ShinyBS adds some Bootstrap components to Shiny such as popovers, tooltips, and modal windows. Since modal windows are now included in Shiny, the main interest of the packages lies in the popover and tooltips. They can be easily added and parametrized using a R syntax.

Version: 0.61 (2015-03-30 ) on CRAN
License: GPL-3
Package components’ License: MIT

shinyjqui by Yang-Tang

Shinyjqui adds the Jquery UI to Shiny and this is really awesome! With the package, you can create drag, resize, sort and animates the elements of your UI with a few lines of code. This really adds a wahoo factor to your apps and corrects the fixed/static aspect of Shiny applications. Just check this example from the repository and you will love this Shiny package!

 

Version: 0.2.0 ( 2017-07-04 ) on CRAN
License: MIT
Package components’ license: CC0

rintrojs by CarlGanz

Is your app really complex? You don’t want your user to get lost and to quit? Then, you should really give a look at rintrojs. It’s a wrapper for the intro.js library to create step-by-step walkthroughs for your app and to show the users how they can use it. You can trigger the intro programmatically (it can be triggered when the user clicks on a ‘Hey I need help’ button) and the intro will highlight the elements of interest in your app with a comment. You can check the demo app to see it working.
This is definitely a must have for complex apps!

Version: 0.2.0 ( 2017-07-04 ) on CRAN
License: GNU Affero General Public License v3.0
Package components’ license (intro.js)
Commercial license for commercial use (price are reasonable though).
GNU AGPLv3 for open-source use.

New Shiny inputs

These Shiny packages bring new inputs and variety to your app.

shinyWidgets by dreamRs

ShinyWidgets brings news and improved inputs from Bootstrap to Shiny:

  • Improved radio button and checkbox inputs
  • Grouped checkbox and radio buttons which can be used instead of the original checkboxes and radio. Very useful for mobile-friendly whiny apps!
  • Two different types of switch for boolean inputs
  • A pickerInput which enhances a lot the vanilla selectize input
  • Drop-down buttons and improved modal/alert windows.

Examples of inputs from shinyWidgets

You can see a live version of all the inputs here.

Version: 0.3.0 (June 12, 2017) on CRAN
License: GPL-3
Package components’ License: MIT

ShinySky by AnalytixWare

ShinySky also brings some new inputs. Though the package does not seems active anymore, it has some nice inputs that are not in Shiny:

  • Typeahead input which will complete what the user is typing in the input and provide more CSS and HTML when the results appear. It’s like a slectizeInput on steroids.
  • Select2input, which is also a selectize on steroids with a possibility to rearrange the inputs order.

Version: 0.1.2 (2014-3-27) on Github
License: MIT

Make your tables great again

The vanilla tables are so-so, this selection of Shiny packages will help you to create interactive, editable and stylish tables.

DT by RStudio

DT is an interface to the JS library datatables. The package is a must-have if you plan to insert tables in your apps, it improves the vanilla tables with:

  • Sorting, search, filtering, and scrolling for long or wide tables
  • Enhance table layout with captions and custom table container.
  • Possibility to insert raw HTML codes (for instance to add buttons) and to modify your table style.
  • Easy table export to convenient format (pdf, excel, clipboard, …)
  • Built-in selection to use the table as a Shiny Input. For instance, you can detect which cell has been clicked, which row(s) or columns are selected, …

Version: 0.2 ( 2016-08-09) on CRAN
License: GPL-3
Package components’ License: Open source components

Formattable by Renkun

Formattable is centered around data presentation and formatting in a table. It has conditional styling options (colors, bar charts, icons, …) for your table and formatting options (Rounding, percentage, …). Even if I tend to prefer DT for its interactivity, I found formattable easier to use for formatting and styling tables.

Example of a formattable (from package repo)

Version: 0.2.0.1 ( 2016-08-05) on CRAN
License: MIT

Rhandsontable by jrowen

Rhansonetable brings handsontable.js into Shiny. Handsontable.js is like excel brought in your internet browser, with rhansontable you can use editable tables in your shiny application!

Example of rhansontable (from the package vignette)

Here is a quick list of its functionalities:

  • Editable table and shiny binding to use the table as a Shiny input
  • Definition of column types, for instance, you can set a column as being a date column (with a date picker) a character column or a boolean column (with a checkbox).
  • Contextual events on right clicks to add and remove rows, to add comments or borders.
  • Data validation with built-in or custom validation functions (in javascript)
  • Custom formatting of data, insertion of HTML code and of sparklines.

Version: 0.2.0.1 (2016-08-05) on CRAN
License: MIT
Package components’ License: MIT for the community version of handsontable (which is used in the package)

Sparkline by timelyportfolio & ramnathv

Sparklines are tiny plots than can be put in a single cell, they can be used to show trends or evolution over the last periods. Hence, Sparkline is an HTMLWidget to add interactive sparklines in your tables and also includes bar charts and boxplots. That’s a great package to make your tables even more interactive and visually appealing.

Examples of sparkline (from the package repo)

Version: 2.0 ( 2016-11-12) on CRAN
License: MIT
Package components’ License: 3-Clause BSD License ( jquery sparkline)

A text and code editor in your shiny apps: ShinyAce by trestletech

Ace is a web-based code editor and gives you access to syntax coloring and code edition in your web browser. As you have already guessed, ShinyAce is a wrapper for the Ace editor and contains most of its functionalities. The editor field can also be used as an input to run the code from the editor.
[Warning: you should only allow this in restricted/logged parts of your application. You don’t want everyone playing with your server.]

Version: 0.2.1 (2016-03-14) on CRAN
License: MIT
Package components’ License: BSD 3 (Ace)

Interfacing your Shiny app with JS shinyJs by Dean Attali

ShinyJs greatly improves the interaction between R and Shiny, the package is divided into two parts:

  • It makes some js-based action (hiding an element, on-event listener, change the class of an HTML element, …) accessible with an R and shiny syntax. Hence it helps the non-js programmers to perform some easy actions in Shiny.
  • It makes it easier to use and to call your custom js function in Shiny. You can now easily pass parameters to your js function and call them in a more R-friendly way.

So whether you intend to add more interactivity or to include js code in your application, this package is a must!

Version: 0.9.1(2017-06-29) on CRAN
License: GNU Affero General Public License v3.0 for open-source use

d3R by timelyportfolio

d3R is not really focused on Shiny but on making the use of d3.js easier in R:

  • Loading D3.js with a single function to avoid checking the latest D3.js version
  • Several helper functions to convert usual R data types to data types that can be used in d3.js (such as nested list or graphs).

d3R removes the painful data transformation part and simplifies the integration of custom d3.js viz’ in R.

Version: 0.9.1(2017-06-29) on CRAN
License: BSD-3
Package components’ License: BSD License (d3.js)

Improving your development pipeline shinyTest by Rstudio

One of the painful parts of Shiny development is testing of your app.You have to do it manually …hence, the testing is inconsistent and you may forget some cases … and it’s boring and it requires a lot of times.
Shinytest intends to automate applications testing, you need to create your different test once. Each time you want to test the app, shinyTest will run the test and compare the inputs and outputs with the original test. You don’t need to run manual test anymore!

Version: 1.1.0 (2017-06-23) on Github
License: MIT

shinyLoadTest by Rstudio

shinyLoadTest is an extension of shinyTest which will help you to ensure that your applications are scalable. The packages create concurrent sessions which will perform a predefined set of actions to test the ability of your application to sustain the load.

Version: 0.1.0 (2017-06-23) on Github
License: GPL-3

That’s it, our little tour of Shiny packages is done. If you know other packages and wish to share them, please comment, I’d be glad to add them to the post.

You can follow the blog on TwitterFacebook  and subscribe to the newsletter:

Email Thank you for following us!

The post The R Shiny packages you need for your web apps! appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The R Journal, Volume 9, Issue 1, June 2017 – is online!

Mon, 07/10/2017 - 15:58

The new issue of The R Journal is now available!

You may Download the complete issue, or choose your topic of interest from the following links:

 

Table of contents

Editorial 
Roger Bivand 4

Contributed Research Articles

iotools: High-Performance I/O Tools for R 
Taylor Arnold, Michael J. Kane and Simon Urbanek 6

IsoGeneGUI: Multiple Approaches for Dose-Response Analysis of Microarray Data Using R 
Martin Otava, Rudradev Sengupta, Ziv Shkedy, Dan Lin, Setia Pramana, Tobias Verbeke, Philippe Haldermans, Ludwig A. Hothorn, Daniel Gerhard, Rebecca M. Kuiper, Florian Klinglmueller and Adetayo Kasim 14

Network Visualization with ggplot2 
Sam Tyner, François Briatte and Heike Hofmann 27

OrthoPanels: An R Package for Estimating a Dynamic Panel Model with Fixed Effects Using the Orthogonal Reparameterization Approach 
Mark Pickup, Paul Gustafson, Davor Cubranic and Geoffrey Evans 60

The mosaic Package: Helping Students to Think with Data Using R 
Randall Pruim, Daniel T Kaplan and Nicholas J Horton 77

smoof: Single- and Multi-Objective Optimization Test Functions 
Jakob Bossek 103

minval: An R package for MINimal VALidation of Stoichiometric Reactions 
Daniel Osorio, Janneth González and Andrés Pinzón 114

Working with Daily Climate Model Output Data in R and the futureheatwaves Package 
G. Brooke Anderson, Colin Eason and Elizabeth A. Barnes 124

alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances 
Sean S. Downey, Guowei Sun and Peter Norquest 138

Implementing a Metapopulation Bass Diffusion Model using the R Package deSolve 
Jim Duggan 153

MDplot: Visualise Molecular Dynamics 
Christian Margreitter and Chris Oostenbrink 164

On Some Extensions to GA Package: Hybrid Optimisation, Parallelisation and Islands EvolutionOn some extensions to GA package: hybrid optimisation, parallelisation and islands evolution 
Luca Scrucca 187

imputeTS: Time Series Missing Value Imputation in R 
Steffen Moritz and Thomas Bartz-Beielstein 207

The NoiseFiltersR Package: Label Noise Preprocessing in R 
Pablo Morales, Julián Luengo, Luís P.F. Garcia, Ana C. Lorena, André C.P.L.F. de Carvalho and Francisco Herrera 219

coxphMIC: An R Package for Sparse Estimation of Cox Proportional Hazards Models via Approximated Information Criteria 
Razieh Nabi and Xiaogang Su 229

Update of the nlme Package to Allow a Fixed Standard Deviation of the Residual Error 
Simon H. Heisterkamp, Engelbertus van Willigen, Paul-Matthias Diderichsen and John Maringwa 239

EMSaov: An R Package for the Analysis of Variance with the Expected Mean Squares and its Shiny Application 
Hye-Min Choe, Mijeong Kim and Eun-Kyung Lee 252

GsymPoint: An R Package to Estimate the Generalized Symmetry Point, an Optimal Cut-off Point for Binary Classification in Continuous Diagnostic Tests 
Mónica López-Ratón, Elisa M. Molanes-López, Emilio Letón and Carmen Cadarso-Suárez 262

autoimage: Multiple Heat Maps for Projected Coordinates 
Joshua P. French 284

Market Area Analysis for Retail and Service Locations with MCI 
Thomas Wieland 298

PSF: Introduction to R Package for Pattern Sequence Based Forecasting Algorithm 
Neeraj Bokde, Gualberto Asencio-Cortés, Francisco Martínez-Álvarez and Kishore Kulat 324

flan: An R Package for Inference on Mutation Models. 
Adrien Mazoyer, Rémy Drouilhet, Stéphane Despréaux and Bernard Ycart 334

Multilabel Classification with R Package mlr 
Philipp Probst, Quay Au, Giuseppe Casalicchio, Clemens Stachl and Bernd Bischl 352

Counterfactual: An R Package for Counterfactual Analysis 
Mingli Chen, Victor Chernozhukov, Iván Fernández-Val and Blaise Melly 370

Retrieval and Analysis of Eurostat Open Data with the eurostat Package 
Leo Lahti, Janne Huovari, Markus Kainu and Przemysław Biecek 385

PGEE: An R Package for Analysis of Longitudinal Data with High-Dimensional Covariates 
Gul Inan and Lan Wang 393

BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data 
Panagiotis Papastamoulis and Magnus Rattray 403

pdp: An R Package for Constructing Partial Dependence Plots 
Brandon M. Greenwell 421

checkmate: Fast Argument Checks for Defensive R Programming 
Michel Lang 437

milr: Multiple-Instance Logistic Regression with Lasso Penalty 
Ping-Yang Chen, Ching-Chuan Chen, Chun-Hao Yang, Sheng-Mao Chang and Kuo-Jung Lee 446

spcadjust: An R Package for Adjusting for Estimation Error in Control Charts 
Axel Gandy and Jan Terje Kvaløy 458

Weighted Effect Coding for Observational Data with wec 
Rense Nieuwenhuis, Manfred te Grotenhuis and Ben Pelzer 477

Hosting Data Packages via drat: A Case Study with Hurricane Exposure Data 
G. Brooke Anderson and Dirk Eddelbuettel 486

News and Notes

R Foundation News 
Torsten Hothorn 498

Conference Report: European R Users Meeting 2016 
Maciej Beręsewicz, Adolfo Alvarez, Przemysław Biecek, Marcin K. Dyderski, Marcin Kosinski, Jakub Nowosad, Kamil Rotter, Alicja Szabelska-Beręsewicz, Marcin Szymkowiak, Łukasz Wawrowski and Joanna Zyprych-Walczak 501

Changes on CRAN 
Kurt Hornik, Uwe Ligges and Achim Zeileis 505

News from the Bioconductor Project 
Bioconductor Core Team 508

Changes in R 
R Core Team 509

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Downloading S&P 500 Stock Data from Google/Quandl with R (Command Line Script)

Mon, 07/10/2017 - 14:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

DISCLAIMER: Any losses incurred based on the content of this post are the responsibility of the trader, not me. I, the author, neither take responsibility for the conduct of others nor offer any guarantees. None of this should be considered as financial advice; the content of this article is only for educational/entertainment purposes.

While most Americans have heard of the Dow Jones Industrial Average (DJIA), most people active in finance consider the S&P 500 stock index to be the better barometer of the overall American stock market. The 500 stocks included in the index are large-cap stocks seen as a leading indicator for the performance of stocks overall. Thus the S&P 500 and its component stocks are sometimes treated as “the market.”

I wanted to download the historical price data (specifically closing price) for all the stocks included in the S&P 500, which you can read on a Wikipedia list. Services such as Google Finance or Quandl provide data for individual companies and the index as a whole (or at least for SPY, the ticker symbol of the SPDR S&P 500 ETF Trust, which is an index ETF that tracks the S&P 500), but they don’t provide a utility to download closing prices for all symbols included in an index, nor will they provide the list. That said, once I have the list I can easily fetch the data for the symbols individually, then merge them together into one dataset.

I wrote an R script for doing this. The script executes the following steps:

  1. Scrape a list of symbols included in the S&P 500 index from Wikipedia, storing the list as a character vector.
  2. Using either quantmod or the R package Quandl (both available from CRAN), attempt to fetch (daily) price data for each symbol in the list created in step one in a loop for a certain date range. If no data for a symbol is available, ignore it.
  3. Merge closing price data for each symbol into a single dataset.

In addition to these steps, I made my script executable from the command line on Unix/Linux systems using the argparse package, further streamlining the process and hiding the R implementation details (so if you’re a Pythonista who doesn’t want to touch R beyond installing it and perhaps some dependencies, this script is still useful to you). I make no guarantees for Windows users; you may need to run the script by hand. By default, the script downloads data from Google, but in the command line you can instruct the script to get data from Quandl; you will need a Quandl API key if you do so. Quandl data can also be adjusted for stock splits and dividends if you so choose (Google data is adjusted automatically for stock splits only).

You can either apply an inner join data, or an outer join. Inner join will lead to no missing information in your CSV, but the resulting file will likely contain much less data than you wanted, since some symbols won’t have data until a few years ago, thus truncating the entire dataset. The default, outer join, will return a CSV with as much data as possible for each symbol, including NA when some symbols don’t have data extending as far back as specified.

Here is the script. Copy this into a .R file (I named my file get-sp500-data.R), make it executable (for example, chmod +x get-sp500-data.R on Ubuntu Linux), execute (for example, ./get-sp500-data.R in Bash), and get that sweet, juicy data.

#!/usr/bin/Rscript # get-sp500-data.R # Author: Curtis Miller suppressPackageStartupMessages(library(argparse)) suppressPackageStartupMessages(library(rvest)) suppressPackageStartupMessages(library(magrittr)) suppressPackageStartupMessages(library(tibble)) suppressPackageStartupMessages(library(quantmod)) # Argument parser for command line use parser <- ArgumentParser() parser$add_argument("-v", "--verbose", action = "store_true", default = TRUE, help = "Print extra output [default]") parser$add_argument("--quietly", action = "store_false", dest = "verbose", help = "Print little output") parser$add_argument("-f", "--file", type = "character", dest = "csv_name", default = "sp-500.csv", help = "CSV file to save data to [default: sp-500.csv]") parser$add_argument("-s", "--sleep", type = "integer", dest = "sleeptime", default = 2, help = paste("Time (seconds) between fetching symbols", "[default: 2] (don't flood websites with", "requests!)")) parser$add_argument("--inner", action = "store_true", default = FALSE, dest = "inner", help = paste("Inner join; only dates where all symbols", "have data will be included")) parser$add_argument("--start", type = "character", dest = "start", default = "1997-01-01", help = paste("Earliest date (YYYY-MM-DD) to include", "[default: 1997-01-01]")) parser$add_argument("--end", type = "character", dest = "end", default = "today", help = paste('Last date (YYYY-MM-DD or "today") to', 'include [default: "today"]')) parser$add_argument("-k", "--key", type = "character", dest = "api_key", default = NULL, help = "Quandl API key, needed if getting Quandl data") parser$add_argument("-q", "--quandl", action = "store_true", default = FALSE, dest = "use_quandl", help = "Get data from Quandl") parser$add_argument("-a", "--adjust", action = "store_true", default = FALSE, dest = "adjust", help = "Adjust prices (Quandl only)") parser$add_argument("--about", action = "store_true", default = FALSE, dest = "about", help = paste("Print information about the script and its", "usage, then quit")) args <- parser$parse_args() join <- "outer" if (args$inner) { join <- "inner" } verbose <- args$verbose start <- args$start if (args$end == "today") { end <- Sys.Date() } else { end <- args$end } sleeptime <- args$sleeptime # In seconds csv_name <- args$csv_name api_key <- args$api_key use_quandl <- args$use_quandl adjust <- args$adjust about <- args$about if (about) { # Display a message, then quit comm_name <- substring(commandArgs(trailingOnly = FALSE)[4], 8) cat(comm_name, "\n(c) 2017 Curtis Miller\n", "Licensed under GNU GPL v. 3.0 available at ", "https://www.gnu.org/licenses/gpl-3.0.en.html \n", "E-mail: cgmil@msn.com\n\n", "This script fetches closing price data for ticker symbols included ", "in the S&P 500 stock index. A list of symbols included in the index ", "is fetched from this webpage:", "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies The list ", "is parsed and the symbols included in the list are fetched from ", "either Google Finance (the default) or Quandl (which requires a ", "Quandl API key). If Quandl is the data source, adjusted data can be ", "fetched instead. The resulting data set is then saved to a CSV", "file in the current working directory.\n\n", "This package requires the following R packages be installed in order ", "to work (all of which are available from CRAN and can be downloaded ", "and installed automatically from R via the command ", "'install.packages(\"package_name\")'):\n\n", "* rvest\n", "* magrittr\n", "* quantmod\n", "* Quandl\n", "* tibble\n", "* argparse (used only for the command line interface)\n\n", "This script was written by Curtis Miller and was made available on ", "his website: https://ntguardian.wordpress.com\n\n", "You can read more about this script in the following article: ", "https://ntguardian.wordpress.com/blog\n\n", sep = "") quit() } start %<>% as.Date end %<>% as.Date options("getSymbols.warning4.0"=FALSE) sp500_wiki <- read_html( "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies") symbols_table <- sp500_wiki %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table[1]') %>% html_table() symbols_table <- symbols_table[[1]] symbols <- as.character(symbols_table$`Ticker symbol`) if (use_quandl) { suppressPackageStartupMessages(library(Quandl)) # This will be needed Quandl.api_key(api_key) } sp500 <- NULL for (s in symbols) { Sys.sleep(sleeptime) if (verbose) { cat("Processing:", s, "...") } tryCatch({ if (use_quandl) { s_data <- Quandl.datatable("WIKI/PRICES", ticker = c(s), date.gte = start, date.lte = end) rownames(s_data) <- as.Date(s_data$date) if (adjust) { s_data <- s_data[, "adj_close", drop = FALSE] } else { s_data <- s_data[, "close", drop = FALSE] } } else { s_data <- Cl(getSymbols(s, src="google", from = start, to = end, env = NULL)) } names(s_data) <- s s_data %<>% as.xts if (length(unique(s_data)) > 1) { # Don't allow what is effectively # empty data if (is.null(sp500)) { sp500 <- as.xts(s_data) } else { sp500 %<>% merge(s_data, join = join) } if (verbose) { cat(" Got it! From", start(s_data) %>% as.character, "to", end(s_data) %>% as.character, "\n") } } else if (verbose) { cat("Sorry, but not this one!\n") } }, error = function(e) { if (verbose) { cat("Sorry, but not this one!\n") } }) } badsymbols <- setdiff(symbols, names(sp500)) if (verbose & (length(badsymbols) > 0)) { cat("There were", length(badsymbols), "symbols for which data could not be obtained.\nThey are:", badsymbols, "\n") } write.csv(rownames_to_column(as.data.frame(sp500), "Date"), file=csv_name)

Clearly this can be modified to work for other indices, which requires getting a different list to scrape. I’m sure Wikipedia has more lists that can easily substitute for the one I scraped. You could even construct a list of symbols by hand (though this could be a painful process). I’ll leave that up to you. If you have alternative lists of symbols for different indices, I would love to read your substitute for lines 107-112.

I have created a video course that Packt Publishing will be publishing later this month, entitled Unpacking NumPy and Pandas, the first volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there (when it becomes available).

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Assessing the Accuracy of our models (R Squared, Adjusted R Squared, RMSE, MAE, AIC)

Mon, 07/10/2017 - 10:07

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

Assessing the accuracy of our model There are several ways to check the accuracy of our models, some are printed directly in R within the summary output, others are just as easy to calculate with specific functions. R-Squared This is probably the most commonly used statistics and allows us to understand the percentage of variance in the target variable explained by the model. It can be computed as a ratio of the regression sum of squares and the total sum of squares. This is one of the standard measures of accuracy that R prints out through the function summary for linear models and ANOVAs. Adjusted R-Squared This is a form of R-squared that is adjusted for the number of terms in the model. It can be computed as follows: Where R2 is the R-squared of the model, n is the sample size and p is the number of terms (or predictors) in the model. This index is extremely useful to determine possible overfitting in the model. This happens particularly when the sample size is small, in such cases if we fill the model with predictors we may end up increasing the R-squared simply because the model starts adapting to the noise in the data and not properly describing the data.  Root Mean Squared Deviation or Root Mean Squared Error The previous indexes measure the amount of variance in the target variable that can be explained by our model. This is a good indication but in some cases we are more interested in quantifying the error in the same measuring unit of the variable. In such cases we need to compute indexes that average the residuals of the model. The problem is the residuals are both positive and negative and their distribution should be fairly symmetrical. This means that their average will always be zero. So we need to find other indexes to quantify the average residuals, for example by averaging the squared residuals: This is the square root of the mean the squared residuals, with  being the estimated value at point t,  being the observed value in t and  being the sample size. The RMSE has the same measuring unit of the variable y. Mean Squared Deviation or Mean Squared Error This is simply the numerator of the previous equation, but it is not used often. The issue with both the RMSE and the MSE is that since they square the residuals they tend to be more affected by large residuals. This means that even if our model explains the large majority of the variation in the data very well, with few exceptions; these exceptions will inflate the value of RMSE. Mean Absolute Deviation or Mean Absolute Error To solve the problem with large residuals we can use the mean absolute error, where we average the absolute value of the residuals: This index is more robust against large residuals. Since RMSE is still widely used, even though its problems are well known, it is always better calculate and present both in a research paper. Akaike Information Criterion This index is another popular index we have used along the text to compare different models. It is very popular because it corrects the RMSE for the number of predictors in the model, thus allowing to account for overfitting. It can be simply computed as follows: Where again p is the number of terms in the model. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R tutorial for Spatial Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A tour of the tibble package

Mon, 07/10/2017 - 07:00

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

Dataframes are used in R to hold tabular data. Think of the prototypical
spreadsheet or database table: a grid of data arranged into rows and columns.
That’s a dataframe. The tibble R package
provides a fresh take on dataframes to fix some longstanding annoyances with
them. For example, printing a large tibble shows just the first 10 rows instead
of the flooding the console with the first 1,000 rows.

In this post, I provide a tour of the tibble package. Because the package
provides tools for working with tabular data, it also contain some less
well-known helper functions that I would like to advertise in this post. In
particular, I find add_column(), rownames_to_column() and
rowid_to_column() to be useful tools in my work.

Why the name “tibble”? Tibbles first appeared in the dplyr package in
January 2014, but they weren’t called “tibbles” yet. dplyr used a subclass
tbl_df for its dataframe objects and they behaved like modern tibbles: Better
printing, not converting strings to factors, etc. We loved them, and we would
convert our plain-old dataframes into these tbl_dfs for these features.
However, the name tee-bee-ell-dee-eff is quite a mouthful. On Twitter,
@JennyBryan raised the question of how to pronounce
tbl_df
, and
@kevin_ushey suggested “tibble
diff”
. The name was
enthusiastically received.

Creating tibbles

Create a fresh tibble using tibble() and vectors of values for each column.
The column definitions are evaluated sequentially, so additional columns can be
created by manipulating earlier defined ones. Below x is defined and then the
values of x are manipulated to create the column x_squared.

library(tibble) library(magrittr) tibble(x = 1:5, x_squared = x ^ 2) #> # A tibble: 5 x 2 #> x x_squared #> #> 1 1 1 #> 2 2 4 #> 3 3 9 #> 4 4 16 #> 5 5 25

Note that this sequential evaluation does not work on classical dataframes.

data.frame(x = 1:5, x_squared = x ^ 2) #> Error in data.frame(x = 1:5, x_squared = x^2): object 'x' not found

The function data_frame()—note the underscore instead of a dot—is an alias
for tibble(), which might be more transparent if your audience has never
heard of tibbles.

data_frame(x = 1:5, x_squared = x ^ 2) #> # A tibble: 5 x 2 #> x x_squared #> #> 1 1 1 #> 2 2 4 #> 3 3 9 #> 4 4 16 #> 5 5 25

In tibble(), the data are defined column-by-column. We can use tribble() to
write out tibbles row-by-row. Formulas like ~x are used to denote column names.

tribble( ~ Film, ~ Year, "A New Hope", 1977, "The Empire Strikes Back", 1980, "Return of the Jedi", 1983) #> # A tibble: 3 x 2 #> Film Year #> #> 1 A New Hope 1977 #> 2 The Empire Strikes Back 1980 #> 3 Return of the Jedi 1983

The name “tribble” is short for “transposed tibble” (the transposed
part referring to change from column-wise creation in tibble() to row-wise
creation in tribble).

I like to use light-weight tribbles for two particular tasks:

  • Recoding: Create a tribble of, say, labels for a plot and join it onto a dataset.
  • Exclusion: Identify observations to exclude, and remove them with an anti-join.

Pretend that we have a tibble called dataset. The code below shows
examples of these tasks with dataset.

library(dplyr) # Recoding example plotting_labels <- tribble( ~ Group, ~ GroupLabel, "TD", "Typically Developing", "CI", "Cochlear Implant", "ASD", "Autism Spectrum" ) # Attach labels to dataset dataset <- left_join(dataset, plotting_labels, by = "Group") # Exclusion example ids_to_exclude <- tibble::tribble( ~ Study, ~ ResearchID, "TimePoint1", "053L", "TimePoint1", "102L", "TimePoint1", "116L" ) reduced_dataset <- anti_join(dataset, ids_to_exclude) Converting things into tibbles

as_tibble() will convert dataframes, matrices, and some other types into
tibbles.

as_tibble(mtcars) #> # A tibble: 32 x 11 #> mpg cyl disp hp drat wt qsec vs am gear carb #> * #> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 #> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 #> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 #> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 #> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 #> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 #> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 #> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 #> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 #> # ... with 22 more rows

We can convert simple named vectors into tibbles with enframe().
For example, quantile() returns a named vector which we can enframe().

quantiles <- quantile(mtcars$hp, probs = c(.1, .25, .5, .75, .9)) quantiles #> 10% 25% 50% 75% 90% #> 66.0 96.5 123.0 180.0 243.5 enframe(quantiles, "quantile", "value") #> # A tibble: 5 x 2 #> quantile value #> #> 1 10% 66.0 #> 2 25% 96.5 #> 3 50% 123.0 #> 4 75% 180.0 #> 5 90% 243.5

I have not had an opportunity to use enframe() since I learned about it,
but I definitely have created dataframes from names-value pairs in the past.

It’s also worth noting the most common way I create tibbles: Reading in files.
The readr package will create tibbles when
reading in data files like csvs.

Viewing some values from each column

When we print() a tibble, we only see data frame enough columns to fill the
width of the console. For example, we will not see every column in this
tibble().

# Create a 200 x 26 dataframe df <- as.data.frame(replicate(26, 1:200)) %>% setNames(letters) %>% as_tibble() df #> # A tibble: 200 x 26 #> a b c d e f g h i j k l #> #> 1 1 1 1 1 1 1 1 1 1 1 1 1 #> 2 2 2 2 2 2 2 2 2 2 2 2 2 #> 3 3 3 3 3 3 3 3 3 3 3 3 3 #> 4 4 4 4 4 4 4 4 4 4 4 4 4 #> 5 5 5 5 5 5 5 5 5 5 5 5 5 #> 6 6 6 6 6 6 6 6 6 6 6 6 6 #> 7 7 7 7 7 7 7 7 7 7 7 7 7 #> 8 8 8 8 8 8 8 8 8 8 8 8 8 #> 9 9 9 9 9 9 9 9 9 9 9 9 9 #> 10 10 10 10 10 10 10 10 10 10 10 10 10 #> # ... with 190 more rows, and 14 more variables: m , n , #> # o , p , q , r , s , t , u , #> # v , w , x , y , z

We can “transpose” the printing with glimpse() to see a few values from every
column. Once again, just enough data is shown to fill the width of the output
console.

glimpse(df) #> Observations: 200 #> Variables: 26 #> $ a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ b 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ c 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ e 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ f 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ g 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ h 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ i 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ j 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ k 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ l 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ n 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ o 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ p 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ q 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ r 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ s 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ t 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ u 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ v 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ w 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ x 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ y 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... #> $ z 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... Growing a tibble

We can add new rows and columns with add_row() and add_column().

Below we add rows to the bottom of the tibble (the default behavior) and to the
top of the tibble by using the .before argument (add the row before row 1).
There also is an .after argument, but I prefer to only add rows to the
tops and bottoms of tables. The values in the add_row() are computed
iteratively, so we can define values x_squared in terms of x.

df <- tibble(comment = "original", x = 1:2, x_squared = x ^ 2) df #> # A tibble: 2 x 3 #> comment x x_squared #> #> 1 original 1 1 #> 2 original 2 4 df <- df %>% add_row(comment = "append", x = 3:4, x_squared = x ^ 2) %>% add_row(comment = "prepend", x = 0, x_squared = x ^ 2, .before = 1) df #> # A tibble: 5 x 3 #> comment x x_squared #> #> 1 prepend 0 0 #> 2 original 1 1 #> 3 original 2 4 #> 4 append 3 9 #> 5 append 4 16

The value NA is used when values are not provided for a certain column.
Also, because we provide the names of the columns when adding rows, we have
don’t have to write out the columns in any particular order.

df %>% add_row(x = 5, comment = "NA defaults") %>% add_row(x_squared = 36, x = 6, comment = "order doesn't matter") #> # A tibble: 7 x 3 #> comment x x_squared #> #> 1 prepend 0 0 #> 2 original 1 1 #> 3 original 2 4 #> 4 append 3 9 #> 5 append 4 16 #> 6 NA defaults 5 NA #> 7 order doesn't matter 6 36

We can similarly add columns with add_column().

df %>% add_column(comment2 = "inserted column", .after = "comment") #> # A tibble: 5 x 4 #> comment comment2 x x_squared #> #> 1 prepend inserted column 0 0 #> 2 original inserted column 1 1 #> 3 original inserted column 2 4 #> 4 append inserted column 3 9 #> 5 append inserted column 4 16

Typically, with dplyr loaded, you would create new columns by using mutate(),
although I have recently started to prefer using add_column() for cases like
the above example, where I add a column with a single recycled value.

Row names and identifiers

Look at the converted mtcars tibble again.

as_tibble(mtcars) #> # A tibble: 32 x 11 #> mpg cyl disp hp drat wt qsec vs am gear carb #> * #> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 #> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 #> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 #> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 #> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 #> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 #> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 #> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 #> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 #> # ... with 22 more rows

The row numbers in the converted dataframe have an asterisk * above them. That
means that the dataframe has row-names. Row-names are clunky and quirky; they are
just a column of data (labels) that umm :confused: we store away from the rest
of the data.

We should move those row-names into an explicit column, and
rownames_to_column() does just that.

mtcars %>% as_tibble() %>% rownames_to_column("model") #> # A tibble: 32 x 12 #> model mpg cyl disp hp drat wt qsec vs am #> #> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 #> 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 #> 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 #> 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 #> 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 #> 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 #> 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 #> 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 #> 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 #> 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 #> # ... with 22 more rows, and 2 more variables: gear , carb

When I fit Bayesian models, I end up with a bunch of samples from a posterior
distribution. In my data-tidying, I need to assign a ID-number to each sample.
The function rowid_to_column() automates this step by creating a new column in
a dataframe with the row-numbers. In the example below, I load some MCMC samples
from the coda package and create draw IDs.

library(coda) data(line, package = "coda") line1 <- as.matrix(line$line1) %>% as_tibble() line1 #> # A tibble: 200 x 3 #> alpha beta sigma #> #> 1 7.17313 -1.566200 11.233100 #> 2 2.95253 1.503370 4.886490 #> 3 3.66989 0.628157 1.397340 #> 4 3.31522 1.182720 0.662879 #> 5 3.70544 0.490437 1.362130 #> 6 3.57910 0.206970 1.043500 #> 7 2.70206 0.882553 1.290430 #> 8 2.96136 1.085150 0.459322 #> 9 3.53406 1.069260 0.634257 #> 10 2.09471 1.480770 0.912919 #> # ... with 190 more rows line1 %>% rowid_to_column("draw") #> # A tibble: 200 x 4 #> draw alpha beta sigma #> #> 1 1 7.17313 -1.566200 11.233100 #> 2 2 2.95253 1.503370 4.886490 #> 3 3 3.66989 0.628157 1.397340 #> 4 4 3.31522 1.182720 0.662879 #> 5 5 3.70544 0.490437 1.362130 #> 6 6 3.57910 0.206970 1.043500 #> 7 7 2.70206 0.882553 1.290430 #> 8 8 2.96136 1.085150 0.459322 #> 9 9 3.53406 1.069260 0.634257 #> 10 10 2.09471 1.480770 0.912919 #> # ... with 190 more rows

From here, I could reshape the data into a long format or draw some random
samples for use in a plot, all while preserving the draw number.

…And that covers the main functionality of the tibble package. I hope you
discovered a new useful feature of the tibble package. To learn more about the
technical differences between tibbles and dataframes, see the tibble chapter in
R for Data Science
.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using simulation for power analysis: an example based on a stepped wedge study design

Mon, 07/10/2017 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Simulation can be super helpful for estimating power or sample size requirements when the study design is complex. This approach has some advantages over an analytic one (i.e. one based on a formula), particularly the flexibility it affords in setting up the specific assumptions in the planned study, such as time trends, patterns of missingness, or effects of different levels of clustering. A downside is certainly the complexity of writing the code as well as the computation time, which can be a bit painful. My goal here is to show that at least writing the code need not be overwhelming.

Recently, I was helping an investigator plan a stepped wedge cluster randomized trial to study the effects of modifying a physician support system on patient-level diabetes management. While analytic approaches for power calculations do exist in the context of this complex study design, it seemed worth the effort to be explicit about all of the assumptions. So in this case I opted to use simulation. The basic approach is outlined below.

The stepped wedge design

In cluster randomized trials, the unit of randomization is the group rather than the individual. While outcomes might be collected at the individual (e.g. student or patient) level, the intervention effect is assessed at the group (e.g. school or clinic). In a stepped wedge cluster design, the randomization unit is still the group, but all groups are eventually exposed to the intervention at some point in the study. Randomization determines when the intervention starts.

Below is schematic view of how a stepped wedge study is implemented. In this example, a block of clusters receives the intervention starting in the second period, another block starts the intervention in the third period, and so on. The intervention effect is essentially assessed by making within group comparisons. By staggering the starting points, the study is able to distinguish between time effects and treatment effects. If all groups started intervention at the same point, we would need to make an assumption that any improvements were due only to the intervention rather than changes that were occurring over time. This is not an assumption any one can easily justify.

Power and simulation

The statistical power of a study is the conditional probability (conditional on a given effect size), that a hypothesis test will incorrectly reject the null hypothesis (i.e. conclude there is no effect when there actually is one). Power is underscored by the notion that a particular study can be replicated exactly over and over again. So, if the power of a study is 80%, that means in 80% of the replications of that study we will (inappropriately) reject the null hypothesis.

So, to estimate power, we can simulate replications of the study many times and conduct repeated hypothesis tests. The proportion of tests where we reject the null hypothesis is the estimated power. Each of these replications is based on the same set of data generating assumptions: effect sizes, sample sizes, individual level variation, group level variation, etc.

Simulating from a stepped wedge design

In this example, we are assuming a 3-year study with four groups of clusters randomized to start an intervention at either 12 months, 18 months, 24 months, or 30 months (i.e. every 6 months following the 1st baseline year). The study would enroll patients at baseline in each of the clusters, and a measurement of a binary outcome (say diabetes under control, or not) would be collected at that time. Those patients would be followed over time and the same measurement would be collected every 6 months, concluding with the 7th measurement in the 36th month of the study. (It is totally possible to enroll new patients as the study progresses and have a different follow-up scheme, but this approximates the actual study I was working on.)

The data are generated based on a mixed effects model where there are group level effects (\(b_j\) in the model) as well as individual level effects (\(b_i\)). The model also assumes a very slight time trend before the intervention (e.g. diabetes control is improving slightly over time for an individual), an intervention effect, and an almost non-existent change in the time trend after the intervention. The outcome in each period is generated based on this formula:

\(logit(Y_{ijt}) = 0.8 + .01 * period + 0.8 * I_{jt} + 0.001 * I_{jt} * (period-s_j) + b_i + b_j,\)

where \(period\) goes from 0 to 6 (period 0 is the baseline, period 1 is the 6 month follow, etc.), \(I_{jt}\) is 1 if cluster \(j\) is in the intervention in period \(t\), \(s_j\) is the period where the intervention starts for cluster \(j\), and \(logit(Y_{ijt})\) is the log odds of the outcome \(Y\) for individual \(i\) in cluster \(j\) during period \(t\).

We start by defining the data structure using simstudy “data def”" commands. We are assuming that there will be 100 individuals followed at each site for the full study. (We are not assuming any dropout, though we could easily do that.) In this particular case, we are assuming an effect size of 0.8 (which is a log odds ratio):

library(simstudy) starts <- "rep(c(2 : 5), each = 10)" siteDef <- defData(varname = "bj", dist = "normal", formula = 0, variance = .01, id="site") siteDef <- defData(siteDef, varname = "sj", dist = "nonrandom", formula = starts) siteDef <- defData(siteDef, varname = "ips", dist = "nonrandom", formula = 100) indDef <- defDataAdd(varname = "bi", dist = "normal", formula = 0, variance = 0.01) trtDef <- defDataAdd(varname = "Ijt" , formula = "as.numeric(period >= sj)", dist = "nonrandom") f = "0.8 + .01 * period + 0.8 * Ijt + 0.001 * Ijt * (period-sj) + bi + bj" trtDef <- defDataAdd(trtDef, varname = "Yijt", formula = f, dist = "binary", link = "logit")

To generate 40 clusters of data, we use the following code:

set.seed(6789) dtSite <- genData(40, siteDef) dtSite <- genCluster(dtSite, cLevelVar = "site", numIndsVar = "ips", level1ID = "id") dtSite <- addColumns(indDef, dtSite) dtSiteTm <- addPeriods(dtSite, nPeriods = 7, idvars = "id") dtSiteTm <- addColumns(trtDef, dtSiteTm) dtSiteTm ## id period site bj sj ips bi timeID Ijt Yijt ## 1: 1 0 1 -0.1029785 2 100 0.08926153 1 0 1 ## 2: 1 1 1 -0.1029785 2 100 0.08926153 2 0 1 ## 3: 1 2 1 -0.1029785 2 100 0.08926153 3 1 1 ## 4: 1 3 1 -0.1029785 2 100 0.08926153 4 1 1 ## 5: 1 4 1 -0.1029785 2 100 0.08926153 5 1 1 ## --- ## 27996: 4000 2 40 0.1000898 5 100 0.18869371 27996 0 1 ## 27997: 4000 3 40 0.1000898 5 100 0.18869371 27997 0 0 ## 27998: 4000 4 40 0.1000898 5 100 0.18869371 27998 0 1 ## 27999: 4000 5 40 0.1000898 5 100 0.18869371 27999 1 1 ## 28000: 4000 6 40 0.1000898 5 100 0.18869371 28000 1 1

And to visualize what the study data might looks like under these assumptions:

# summary by site dt <- dtSiteTm[, .(Y = mean(Yijt)), keyby = .(site, period, Ijt, sj)] ggplot(data = dt, aes(x=period, y=Y, group=site)) + geom_hline(yintercept = c(.7, .83), color = "grey99") + geom_line(aes(color=factor(site))) + geom_point(data = dt[sj == period], color="grey50") + theme(panel.background = element_rect(fill = "grey90"), panel.grid = element_blank(), plot.title = element_text(size = 10, hjust = 0), panel.border = element_rect(fill = NA, colour = "gray90"), legend.position = "none", axis.title.x = element_blank() ) + ylab("Proportion controlled") + scale_x_continuous(breaks = seq(0, 10, by = 2), labels = c("Baseline", paste("Year", c(1:5)))) + scale_y_continuous(limits = c(.5, 1), breaks = c(.5, .6, .7, .8, .9, 1)) + ggtitle("Stepped-wedge design with immediate effect") + facet_grid(sj~.)

Estimating power

We are going to estimate power using only 20 clusters and effect size of 0.25. (Assuming 40 clusters and a large effect size was useful for visualizing the data, but not so interesting for illustrating power, since under those assumptions we are virtually guaranteed to find an effect.)

After generating the data (code not shown) for one iteration, we fit a generalized mixed effects model to show the effect estimate. In this case, the effect estimate is 1.46 (95% CI 1.21-1.77) on the odds ratio scale or 0.37 (95% CI 0.19-0.57) on the log odds ratio scale.

library(lme4) library(sjPlot) glmfit <- glmer(data = dtSiteTm, Yijt ~ period + Ijt + I(Ijt*(period - sj)) + (1|id) + (1|site), family="binomial" ) sjt.glmer(glmfit, show.icc = FALSE, show.dev = FALSE)     Yijt     Odds Ratio CI p Fixed Parts (Intercept)   2.15 1.90 – 2.44 <.001 period   1.00 0.95 – 1.06 .959 Ijt   1.46 1.21 – 1.77 <.001 I(Ijt * (period – sj))   0.99 0.91 – 1.07 .759 Random Parts τ00, id   0.011 τ00, site   0.029 Nid   1000 Nsite   20 Observations   7000

In order to estimate power, we need to generate a large number of replications. I created a simple function that generates a new data set every iteration based on the definitions. If we want to vary the model assumptions across different replications, we can write code to modify the data definition part of the process. In this way we could look at power across different sample size, effect size, or variance assumptions. Here, I am only considering a single set of assumptions.

gData <- function() { dtSite <- genData(nsites, siteDef) dtSite <- genCluster(dtSite, cLevelVar = "site", numIndsVar = "ips", level1ID = "id") dtSite <- addColumns(indDef, dtSite) dtSiteTm <- addPeriods(dtSite, nPeriods = 7, idvars = "id") dtSiteTm <- addColumns(trtDef, dtSiteTm) return(dtSiteTm) }

And finally, we iterate through a series of replications, keeping track of each hypothesis test in the variable result. Typically, it would be nice to replicate a large number of times (say 1000), but this can sometimes take a long time. In this case, each call to glmer is very resource intensive – unfortunately, I know of know way to speed this up (please get in touch if you have thoughts on this) – so for the purposes of illustration, I’ve only used 99 iterations. Note also that I check to see if the model converges in each iteration, and only include results from valid estimates. This can be an issue with mixed effects models, particularly when sample sizes are small. To estimate the power (which in this case is 78%), calculate the proportion of successful iterations with a p-value smaller than 0.05, the alpha-level threshold we have used in our hypothesis test:

result <- NULL i=1 while (i < 100) { dtSite <- gData() glmfit <- tryCatch(glmer(data = dtSite, Yijt ~ period + Ijt + I(Ijt*(period - sj)) + (1|id) + (1|site), family="binomial" ), warning = function(w) { "warning" } ) if (! is.character(glmfit)) { pvalue <- coef(summary(glmfit))["Ijt", "Pr(>|z|)"] result <- c(result, pvalue) i <- i + 1 } } mean(result < .05) ## [1] 0.7812

To explore the sensitivity of the power estimates to changing underlying assumptions of effect size, sample size, variation, and time trends, we could vary those parameters and run a sequence of iterations. The code gets a little more complicated (essentially we need to change the “data defs” for each set of iterations), but it is still quite manageable. Of course, you might want to plan for fairly long execution times, particularly if you use 500 or 1000 iterations for each scenario, rather than the 100 I used here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to Scrape Images from Google

Mon, 07/10/2017 - 02:00

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

In my last post, I tried to train a deep neural net to detect brand logos in images. For that, I downloaded the Flickr27-dataset, containing 270 images of 27 different brands.
As this dataset is rather small I turned to google image search and wrote a small R script to download the first 20 images for each search term.

In order to use the script you need to download phantomJS and create a small Javascript file. (See the Gist-file at the end of this post.)
If you have the phantomJS.exe, the small JS-file and the R-file in one folder, you just need to run the following lines to download the images …

gg <- scrapeJSSite(searchTerm = "adidas+logo") downloadImages(as.character(gg$images), i)

Good luck scraping some images from Google.
I will use the script to enhance the brand logo dataset, hopefully improving my model with some better image data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages