Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 5 hours 32 min ago

A Shiny App for Exploring Commodities Prices and Economic Indicators, via Quandl

Fri, 06/02/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)



In a previous post, we created an R Notebook to explore the relationship between the copper/gold price ratio and 10-year Treasury yields (if you’re curious why we might care about this relationship, have a quick look at that previous post), relying on data from Quandl. Today, we’ll create a Shiny app that lets users choose which different commodities ratios and different economic indicators to investigate. Perhaps users don’t care about Dr. Copper and Treasury yields, but instead want to explore the oil/gold price ratio and how it correlates with the US inflation rate, or the EU inflation rate. Let’s give them some flexibility!

The finished app is available here.

Before we get to it, note a few issues that we have seen in the past.

Very similar to our previous Shiny app, in the code chunk below, we have some important decisions about how a user selects a commodity. We could use textInput to allow the user to enter the code for the desired data set which would not limit the user in any way – he or she could choose any dataset on Quandl. The cost would be that the user would need to know, or go to Quandl and look up, the code for any data set.

For example, to import iron ore prices, the user would have to type in ODA/PIORECR_USD. That’s a big problem if your end users are not familiar with and have no interest in data set codes. To emphasize convenience and usability we will go with selectInput instead of textInput, meaning our app will show a drop-down of a few choices. The user just clicks on “Iron Ore” instead of typing ODA/PIORECR_USD, or clicks on “copper” instead of typing CHRIS/CME_HG1.1. But, if a user wants to work with a data set that we haven’t included, said user is out of luck.

Another big decision is how many choices to give the user. I’ve included a few: copper, oil, iron, platinum, palladium and silver. That’s a nice smattering of commodities whose prices tend to rise with a growing economy, plus silver which does not. I included silver so we could take a look at a commodity that should behave differently from the rest. As always, our choice here is driven by how broad this app should be. We could have added steel, lead, zinc, and tin, or we could have included just copper, oil and iron, depending on the use case. Either way, the number of drop downs is another tradeoff between usability and flexibility.

The final decision is a bit more nuanced and requires looking ahead to how these inputs will be used further down in the app. Have a peak at the object called commodityChoices and you might notice that we don’t strictly need that object. We could have put the vector of choices as an argment to selectInput, so that our code would have read choices = c("Copper" = "CHRIS/CME_HG1.1", ...) instead of choices = commodityChoices. In that choice assignment, “copper” is called the name and “CHRIS/CME_HG1.1” is called the value (together we can think of them as a name-value pair). The reason for building a separate commodityChoices object is that we want the ability to extract either the name or the value of the name-value pair. Usually we would care only about the value, because we want to pass the value to Quandl and import the data, but that name is going to be useful when we label our graphs.

Without further adieu, let’s look at commodityChoices, econIndicatorChoices and the use of selectInput.

# Create a vector of commodity choices. commodityChoices <- c( "Copper" = "CHRIS/CME_HG1", "WTI oil" = "FRED/DCOILWTICO",# "Iron Ore" = "ODA/PIORECR_USD", # monthly "Platinum" = "LPPM/PLAT", "Palladium" = "LPPM/PALL", "Silver" = "LBMA/SILVER") # Make those commodity choices avaible via selectInput. selectInput("commodity", "Commodity", choices = commodityChoices, selected = "Copper") # Create a vector of economic indicator choices. econIndicatorChoices <- c( "10-Yr Yield" = "FRED/DGS10", # daily "US CPI" = "RATEINF/INFLATION_USA",# monthly "Japan CPI" = "RATEINF/INFLATION_JPN", "EU CPI" = "RATEINF/INFLATION_EUR") # Make those economic indicator choices avaible via selectInput. selectInput("econIndicator", "Economic Indicator", choices = econIndicatorChoices, selected = "10-yr Yield") # A standard date range input. dateRangeInput("dateRange", "Date range", start = "1990-01-01", end = "2016-12-31")

Now that we have the inputs in a sidebar for the user, it’s back to Quandl to import the data for the chosen commodity, gold and the chosen economic indicator. There’s a common date range for all three so we’ll start by creating start and end date objects.

ratio_indicator <- reactive({ Quandl.api_key("your API key here") start_date <- format(input$dateRange[1]) end_date <- format(input$dateRange[2])

We could now write three separate calls to Quandl for each of the data sets but, instead, let’s make use of the map() function from the purrr package. If you’re not familiar with purrr, have a look here. I’ll just say that you’ll probably never have to use lapply() again (and that should be motivation enough), but, in short, the family of map() functions takes a function and applies it to the elements in a vector, similar to the apply() functions.

Before we can use map() though, we need a vector to feed it. Let’s create a vector of Quandl codes.

# Create a vector of 3 data set codes # 1) commodity chosen by user # 2) gold quandl code # 3) economic indicator chosen by user gold_code <- "CHRIS/CME_GC1.1" # Vector of Quandl codes. data_set_codes <- c(input$commodity, gold_code, input$econIndicator)

Then we’ll apply the Quandl() function by piping our vector of codes and using map().

# Pipe the data_set_codes vector to Quandl via the map() function # Note we can still set the start and end date and object type # as we always can with Quandl. quandlData<- data_set_codes %>% # Pipe the datasets vector to Quandl via the map() function. map(Quandl, start_date = start_date, end_date = end_date, collapse = "monthly", type = "xts") %>%

Next, we will use map() to apply the na.locf() function to our time series and ensure that no NAs remain.

# Replace all NAs using map() and na.locf(). map(na.locf, formLast = TRUE) %>%

If we stopped here, we would have a list of three xts series, but I don’t want a list, I want one xts object. So, we’ll pipe our list of three and use the reduce() + merge() to combine our list of 3 time series into one xts object.

# Merge to one xts object using map() and merge(). reduce(merge) %>% # Add nicer column names. `colnames<-`(c(names(commodityChoices[commodityChoices == input$commodity]), "Gold", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator])))

Alright, after running our Quandl codes through that series of mapped functions, we have three time series stored in one xts object, and now we want to calculate the price ratio of the chosen commodity/gold.

To create that price ratio, we need to divide the first column by the second column and we’ll store it in a new column called ratio.

Then we will save just that ratio and the economic indicator column data into their xts object. That is not necessary but it makes things cleaner and easier when we pass to dygraph().

# Create a column and add the price ratio. quandlData$ratio <- quandlData[,1]/quandlData[,2] # Save just the ratio and the economic indicator data. ratio_indicator <- merge(quandlData$ratio, quandlData[,3]) # Add more general names. colnames(ratio_indicator) <- c("ratio","indicator") return(ratio_indicator) })

Now we just need to pass our reactive object ratio_indicator() to dygraph() and follow the same steps as we did when testing in our Notebook.

We will use dyRoller() to smooth out our chart and make each point an average of the number of periods specified with rollPeriod = X. This won’t affect our xts object, where we store the data, it just makes the chart more readable.

Remember also that we are charting two time series on the same chart and they are on different scales, so we want to add a right-hand-side y-axis.

To do so, we need to invoke dyAxis() for the left-hand axis, called “y”. Then we invoke dyAxis() for the right-hand axis, called “y2”. We also need to set independentTicks = TRUE so that we can use a unique, independent value scale for the right-hand side. Next, in our dySeries() call for each time series, we assign each one to an axis. Here we assign “ratio” with axis = 'y', so that the commodity-gold price ratio will be on the left-hand scale, and we assign “indicator” with axis = 'y2', so the economic indicator will be on the right-hand scale.

dygraphOutput("ratio_indicator") output$ratio_indicator <- renderDygraph({ dygraph(ratio_indicator()) %>% # Add the rollPeriod for smoothing. dyRoller(rollPeriod = 3) %>% # Create two independent axes, just we did in the Notebook. dyAxis("y", label = "USD") %>% dyAxis("y2", label = "Percent (%)", independentTicks = TRUE) %>% # Assign each time series to an axis. # Use the name from the name-value pair to create nice labels for each. dySeries("ratio", axis = 'y', label = paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold (LHS)", sep = ""), color = "blue") %>% dySeries("indicator", axis = 'y2', label = paste(names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), "(RHS)", sep = ""), color = "red") })

We could end things here but let’s go ahead and add a chart to show the rolling correlation between the ratio and the indicator. We’ve done so much work to calculate and wrangle these time series, might as well put them to use!

First, we’ll calculate the rolling correlation using the rollapply() function. Nothing too complicated here.

dygraphOutput("rollingCorrelation") output$rollingCorrelation <- renderDygraph({ rolling_cor <- rollapply(ratio_indicator(), 24, function(x) cor(x[, 1], x[, 2], use = "pairwise.complete.obs"), by.column = FALSE) # Make a nicer name for the xts object that stores the rolling correlation. # This name will be displayed when a user hovers on the dygraph. names(rolling_cor) <- paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold ", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), " Correlation", sep = "")

It’s not necessary, but I like to display the mean, minimum and maximum rolling correlations on the chart. We’ll store those in three objects: avg, mini, and maxi.

avg <- round(mean(rolling_cor, na.rm = T), 2) mini <- round(min(rolling_cor, na.rm = T), 2) maxi <- round(max(rolling_cor, na.rm = T), 2)

Now we pass our rolling_cor xts object to dygraph() and pass the mean, minimum and maximum objects to dyLimit().

dygraph(rolling_cor, main = paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold ", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), " Correlation", sep = "")) %>% dyRangeSelector(dateWindow = c("2015-01-01", "2016-12-31")) %>% # Add a line for the mean, min and max. dyLimit(avg, color = 'purple') %>% dyLimit(mini, color = 'red') %>% dyLimit(maxi, color = 'blue') %>% # Add an event for the US election. dyEvent("2016-11-08", "Trump!", labelLoc = "bottom") })

And, we’re done! It’s fun to explore different relationships amongst different time series with this app. And once we have this template in the toolkit, all sorts of different data sets can be substituted in for exploration. For example, we might want to port this work over to a currencies dashboard, or a country GDP dashboard. The nice thing is, it’s just a matter of finding the right Quandl codes and imagining new hypotheses to explore.

Things got a little choppy today with all the piping, so just a reminder that if you want the reusable code for this app, it’s available via the source code button at the top right of the live app. Thanks, and see you next time.

_____='https://rviews.rstudio.com/2017/06/02/a-shiny-app-for-exploring-commodities-prices-and-economic-indicators-via-quandl/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Python and R top 2017 KDnuggets rankings

Thu, 06/01/2017 - 23:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The results of KDnuggets' 18th annual poll of data science software usage are in, and for the first time in three years Python has edged out R as the most popular software. While R increased its share of usage from 45.7% in last year's poll to 52.1% this year, Python's usage among data scientists increased even more, from 36.6% of users in 2016 to 52.6% of users this year.

There were some interesting moves in the long tail, as well. Several tools entered the KDNuggets chart for the first time, including Keras (9.5% of users), PyCharm (9.0%) and Microsoft R Server (4.3%). And several returning tools saw big jumps in usage, including Microsoft Cognitive Toolkit (3.4% of users), Tensorflow (20.2%) and Power BI (10.2%). Microsoft SQL Server increased its share to 11.6% (up from 10.8%), whereas SAS (7.1%) and Matlab (7.4%) saw declines. Julia, somewhat surprisingly, remained flat at 1.1%.

For the complete results and analysis of the 2017 KDnuggets data science software poll, follow the link below.

KDnuggets: New Leader, Trends, and Surprises in Analytics, Data Science, Machine Learning Software Poll

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Canada Labour Market: Future Perspectives

Thu, 06/01/2017 - 19:21

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

For anyone looking for job opportunities, it is nice to have an idea how the job market will perform in the future for your chosen career or industry. Many countries have open data sets that offer this kind of data. In these exercises we will use R to analyze the future perspective of Canadian labour market.

Answers to the exercises are available here.

Exercise 1
Download and read into R all data sets from Canadian Occupational Projection System (COPS) – 2015 to 2024 projections.

Exercise 2
Load library tidyr. Use gather to rearrange any occupation related data set that present time series data into a tidy data format.

Exercise 3
Use gather to rearrange ALL other occupation related data sets that present time series data into a tidy data format, and pile out them in a unique data frame.

Exercise 4
Remove lines that present NA values, columns in French, and the “X” in front every year. Take a look at your tidy data set.

Learn more about Data manipulation in the online course R: Complete Data Analysis Solutions. In this course you will learn how to:

  • Learn indepth how to work with dplyr
  • Get a full introduction to the data.table package
  • And much more

Exercise 5
Let’s do the same with industries data sets. Start by taking one of the industries data sets that present data in a time series an use gather.

Exercise 6
Do the same procedure of exercise 5 to all other industries data sets. Pile out them in a new data frame.

Exercise 7
Remove NAs, and French columns. In addition, set year and value as numeric, and take a look at your new tidy data set about industries.

Exercise 8
Find out the industries that have que lowest number of jobseekers, and create a new data set by sub setting the previous one.

Exercise 9
Plot the recently create data set using a line for each industry.

Exercise 10
Create a similar plot for the top 5 occupations in terms of low amount of jobseekers.

Related exercise sets:
  1. Reshape 2 Exercises
  2. As.Date() Exercises
  3. Data frame exercises Vol. 2
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In defense of wrapr::let()

Thu, 06/01/2017 - 18:01

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Saw this the other day:

In defense of wrapr::let() (originally part of replyr, and still re-exported by that package) I would say:

  • let() was deliberately designed for a single real-world use case: working with data when you don’t know the column names when you are writing the code (i.e., the column names will come later in a variable). We can re-phrase that as: there is deliberately less to learn as let() is adapted to a need (instead of one having to adapt to let()).
  • The R community already has months of experience confirming let() working reliably in production while interacting with a number of different packages.
  • let() will continue to be a very specific, consistent, reliable, and relevant tool even after dpyr 0.6.* is released, and the community gains experience with rlang/tidyeval in production.

If rlang/tidyeval is your thing, by all means please use and teach it. But please continue to consider also using wrapr::let(). If you are trying to get something done quickly, or trying to share work with others: a “deeper theory” may not be the best choice.

An example follows.

In “base R” one can write:

d <- data.frame(x = 1:3)

If we know the column name we wish to add to this data frame we write:

d$newColumn <- 1

The above is “non-parameterized” evaluation in that the variable name is taken from the source code, and not from a variable carrying the information. This is great for ad-hoc analysis, but it would be hard to write functions, scripts and packages if this was the only way we had to manipulate columns.

This isn’t a problem as R supplies an additional parameterized notation as we show here. When we don’t know the name of the column (but expect it to be in a variable at run time) we write:

# code written very far away from us variableHoldingColumnName <- 'newColumn' # our code d[[variableHoldingColumnName]] <- 1

The purpose of wrapr::let() is to allow one to use the non-parameterized form as if it were parameterized. Obviously this is only useful if there is not a convenient parameterized alternative (which is the case for some packages). But for teaching purposes: how would wrapr::let() let us use the “$” as if it were parameterized (which we would have to do if [[]] and [] did not exist)?

With wrapr::let() we can re-write the “dollar-sign form” as:

wrapr::let( c(NEWCOL = variableHoldingColumnName), { d$NEWCOL <- 1 } )

The name “NEWCOL” is a stand-in name that we write all our code in terms of. The expression “c(NEWCOL = variableHoldingColumnName)” is saying: “NEWCOL is to be interpreted as being whatever name is stored in the variable variableHoldingColumnName.” Notice we can’t tell from the code what value is in variableHoldingColumnName, as that won’t be known until execution time. The alias list or vector can be arbitrarily long and built wherever you like (it is just data). The expression block can be arbitrarily large and complex (so you need only pay the mapping notation cost once per function, not once per line of code).

And that is wrapr::let().

If you don’t need parametric names you don’t need wrapr::let(). If you do need parametric names wrapr::let() is one of the most reliable and easiest to learn ways to get them.

A nice article from a user recording their experience trying different ways to parameterize their code (including wrapr::let()) can be found here.

If you wish to learn more we have a lot of resources available:

And many more examples on our blog.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Primer in functional Programming in R (part -2)

Thu, 06/01/2017 - 17:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the last exercise, We have seen how powerful functional programming principles can be and how it can drammatically increase the readablity of the code and how easily you can work with them .In this set of exercises we will look at functional programming principles with purrr.Purrr comes with a number of interesting features and is really useful in writing clean and concise code . Please check the documentation and load the purrr library in your R session before starting these exercise set .
Answers to the exercises are available here

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
From the airquality dataset( available in base R ) , Find the mean ,median ,standard deviation of all columns using map functions .

Exercise 2

In the same dataset,find 95th percentile of each column excluding the NA values

Exercise 3
Load the iris dataset ,with help of pipe and map functions find out the mean of the relavant columns.Keep in mind mean is meant for numeric columns ,so you may need multiple map like functions.I expect the output as a dataframe .

Exercise 4
I have a vector x <- 1:20 ,I want to multiply every odd element with 2 and every even element with 4 , find a solution using purrr .

Exercise 5
I have a sequence x <- 1:100 , I want to increase the 1st,11th, 21st…91st element by 1 . How can I achieve that .
Exercise 6
Suppose I have two character vectors
x <- letters # letters is a vector of alphabets in small available in R
y <- LETTERS # LETTERS is a vector of alphabets in caps available in R

How do I join each capital letter with the small letter so that the end result looks like this
"A,a" "B,b" ,.. and so on

Exercise 7
The previous exercise gave you the intuition of how to work parallely on two vectors.Now accepts a list of character vectors of same size and joins them like the previous exercise . so if I have a
list like x <- list(c("a","b","c"),c("d","e",f"),c("g","h","i")) ,it should give me output as .
[1] "a d g" "b e h" "c f i"

Exercise 8
using a functional tool from purrr ,reverse the letters so that my output is a string of 26 character starting from z and ending at a
like “z y x w v u t s r q p o n m l k j i h g f e d c b a”

Exercise 9
Exercise on Purrr wont be complete if We dont mention its “Filter” like functions . take a numeric vector of 1:100 , keep all the numbers which are divisible by 2 and after that remove the one’s which are divisible by 5

Exercise 10
Ability to create partial functions is a powerful tool in functional programming. This allow functions to behave a little like data structures .Consider creating a partial function whenever you see you are repeating functions argument. This may not sound very useful in context of this exercise but I am sure you will find it very useful in your R gigs .Now create a Partial function
ql ,which is to find the 25th percentile and 50th percentile of a column from a dataframe .use it to find the same from airquality dataset .Dont use quantile twice in this exercise .

Related exercise sets:
  1. Higher Order Functions Exercises
  2. A Primer in Functional Programming in R (Part – 1)
  3. Building Shiny App exercises part 4
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A tidy model pipeline with twidlr and broom

Thu, 06/01/2017 - 14:00

(This article was first published on blogR, and kindly contributed to R-bloggers)

@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.

 The problem

Different model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:

Thus, different models may need very different code.

However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.

 Two-step modelling

To understand the solution, think of the problem as a two-step process, depicted in this Figure:

 Step 1: from data to fitted model

Step 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!

To demonstrate:

#devtools::install_github("drimonj/twidlr") # To install library(twidlr) lm(mtcars, hp ~ .) #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) mpg cyl disp drat #> 79.048 -2.063 8.204 0.439 -4.619 #> wt qsec vs am gear #> -27.660 -1.784 25.813 9.486 7.216 #> carb #> 18.749

This means we can pipe data.frames into any model function exposed by twidlr. For example:

library(dplyr) mtcars %>% lm(hp ~ .) #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) mpg cyl disp drat #> 79.048 -2.063 8.204 0.439 -4.619 #> wt qsec vs am gear #> -27.660 -1.784 25.813 9.486 7.216 #> carb #> 18.749  Step2: fitted model to tidy results

Step 2 must take a fitted model object as its input and return a tidy data frame of results. This is precisely what the broom package does via three functions: glance, tidy, and augment! To demonstrate:

#install.packages("broom") # To install library(broom) fit <- mtcars %>% lm(hp ~ .) glance(fit) #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e-08 11 -142.8905 #> AIC BIC deviance df.residual #> 1 309.7809 327.3697 14164.76 21 tidy(fit) #> term estimate std.error statistic p.value #> 1 (Intercept) 79.0483879 184.5040756 0.4284371 0.672695339 #> 2 mpg -2.0630545 2.0905650 -0.9868407 0.334955314 #> 3 cyl 8.2037204 10.0861425 0.8133655 0.425134929 #> 4 disp 0.4390024 0.1492007 2.9423609 0.007779725 #> 5 drat -4.6185488 16.0829171 -0.2871711 0.776795845 #> 6 wt -27.6600472 19.2703681 -1.4353668 0.165910518 #> 7 qsec -1.7843654 7.3639133 -0.2423121 0.810889101 #> 8 vs 25.8128774 19.8512410 1.3003156 0.207583411 #> 9 am 9.4862914 20.7599371 0.4569518 0.652397317 #> 10 gear 7.2164047 14.6160152 0.4937327 0.626619355 #> 11 carb 18.7486691 7.0287674 2.6674192 0.014412403 augment(fit) %>% head() #> .rownames hp mpg cyl disp drat wt qsec vs am gear carb #> 1 Mazda RX4 110 21.0 6 160 3.90 2.620 16.46 0 1 4 4 #> 2 Mazda RX4 Wag 110 21.0 6 160 3.90 2.875 17.02 0 1 4 4 #> 3 Datsun 710 93 22.8 4 108 3.85 2.320 18.61 1 1 4 1 #> 4 Hornet 4 Drive 110 21.4 6 258 3.08 3.215 19.44 1 0 3 1 #> 5 Hornet Sportabout 175 18.7 8 360 3.15 3.440 17.02 0 0 3 2 #> 6 Valiant 105 18.1 6 225 2.76 3.460 20.22 1 0 3 1 #> .fitted .resid .hat .sigma .cooksd .std.resid #> 1 148.68122 -38.681220 0.2142214 24.75946 0.069964902 -1.6801773 #> 2 140.62866 -30.628664 0.2323739 25.43881 0.049861042 -1.3460408 #> 3 79.99158 13.008418 0.3075987 26.38216 0.014633059 0.6019364 #> 4 125.75448 -15.754483 0.2103960 26.31579 0.011288712 -0.6826601 #> 5 183.21756 -8.217565 0.2016137 26.53317 0.002878707 -0.3541128 #> 6 111.38490 -6.384902 0.3147448 26.55680 0.003682813 -0.2969840  A single, tidy pipeline

So twidlr and broom functions can be combined into a single, tidy pipeline to go from data.frame to tidy output:

library(twidlr) library(broom) mtcars %>% lm(hp ~ .) %>% glance() #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e-08 11 -142.8905 #> AIC BIC deviance df.residual #> 1 309.7809 327.3697 14164.76 21

Any model included in twidlr and broom can be used in this same way. Here’s a kmeans example:

iris %>% select(-Species) %>% kmeans(centers = 3) %>% tidy() #> x1 x2 x3 x4 size withinss cluster #> 1 5.901613 2.748387 4.393548 1.433871 62 39.82097 1 #> 2 5.006000 3.428000 1.462000 0.246000 50 15.15100 2 #> 3 6.850000 3.073684 5.742105 2.071053 38 23.87947 3

And a ridge regression with cross-fold validation example:

mtcars %>% cv.glmnet(am ~ ., alpha = 0) %>% glance() #> lambda.min lambda.1se #> 1 0.2284167 0.8402035

So next time you want to do some tidy modelling, keep this pipeline in mind:

 Limitations

Currently, the major limitation for this approach is that a model must be covered by twidlr and broom. For example, you can’t use randomForest in this pipeline because, although twidlr exposes a data.frame friendly version of it, broom doesn’t provide tidying methods for it. So if you want to write tidy code for a model that isn’t covered by these packages, have a go at helping out by contributing to these open source projects! To get started creating and contributing to R packages, take a look at Hadley Wickham’s free book, “R Packages”.

 Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: blogR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Correcting bias in meta-analyses: What not to do (meta-showdown Part 1)

Thu, 06/01/2017 - 10:24

(This article was first published on R – Nicebread, and kindly contributed to R-bloggers)

tl;dr: Publication bias and p-hacking can dramatically inflate effect size estimates in meta-analyses. Many methods have been proposed to correct for such bias and to estimate the underlying true effect. In a large simulation study, we found out which methods do not work well under which conditions, and give recommendations what not to use. Estimated reading time: 7 min. It is well known that publication bias and p-hacking inflate effect size estimates from meta-analyses. In the last years, methodologists have developed an ever growing menu of statistical approaches to correct for such overestimation. However, to date it was unclear under which conditions they perform well, and what to do if they disagree. Born out of a Twitter discussion, Evan Carter, Joe Hilgard, Will Gervais and I did a large simulation project, where we compared the performance of naive random effects meta-analysis (RE), trim-and-fill (TF), p-curve, p-uniform, PET, PEESE, PET-PEESE, and the three-parameter selection model (3PSM).

Previous investigations typically looked only at publication bias or questionable research practices QRPs (but not both), used non-representative study-level sample sizes, or only compared few bias-correcting techniques, but not all of them. Our goal was to simulate a research literature that is as realistic as possible for psychology. In order to simulate several research environments, we fully crossed five experimental factors: (1) the true underlying effect, δ (0, 0.2, 0.5, 0.8); (2) between-study heterogeneity, τ (0, 0.2, 0.4); (3) the number of studies in the meta-analytic sample, k (10, 30, 60, 100); (4) the percentage of studies in the meta-analytic sample produced under publication bias (0%, 60%, 90%); and (5) the use of QRPs in the literature that produced the meta-analytic sample (none, medium, high).

This blog post summarizes some insights from our study, internally called “meta-showdown”. Check out the preprint; and the interactive app metaExplorer. The fully reproducible and reusable simulation code is on Github, and more information is on OSF.

In this blog post, I will highlight some lessons that we learned during the project, primarily focusing on what not do to when performing a meta-analysis.

Limitation of generalizability disclaimer: These recommendations apply to typical sample sizes, effect sizes, and heterogeneities in psychology; other research literatures might have different settings and therefore a different performance of the methods. Furthermore, the recommendations rely on the modeling assumptions of our simulation. We went a long way to make them as realistic as possible, but other assumptions could lead to other results.

Never trust a naive random effects meta-analysis or trim-and-fill (unless you meta-analyze a set of registered reports)

If studies have no publication bias, nothing can beat plain old random effects meta-analysis: it has the highest power, the least bias, and the highest efficiency compared to all other methods. Even in the presence of some (though not extreme) QRPs, naive RE performs better than all other methods. When can we expect no publication bias? If (and, in my opinion only if) we meta-analyze a set of registered reports.

But.

In any other setting except registered reports, a consequential amount of publication bias must be expected. In the field of psychology/psychiatry, more than 90% of all published hypothesis tests are significant (Fanelli, 2011) despite the average power being estimated as around 35% (Bakker, van Dijk, & Wicherts, 2012) – the gap points towards a huge publication bias. In the presence of publication bias, naive random effects meta-analysis and trim-and-fill have false positive rates approaching 100%:

More thoughts about trim-and-fill’s inability to recover δ=0 are in Joe Hilgard’s blog post. (Note: this insight is not really new and has been shown multiple times before, for example by Moreno et al., 2009, and Simonsohn, Nelson, and Simmons, 2014).

Our recommendation: Never trust meta-analyses based on naive random effects and trim-and-fill, unless you can rule out publication bias. Results from previously published meta-analyses based on these methods should be treated with a lot of skepticism.

 

Do not use p-curve and p-uniform for effect size estimation (under heterogeneity)

As a default, heterogeneity should always be expected – even under the most controlled conditions, where many labs perform the same computer-administered experiment, a large proportion showed significant and substantial amounts of between-study heterogeneity (cf. ManyLabs 1 and 3; see also our supplementary document for more details). p-curve and p-uniform assume homogeneous effect sizes, and their performance is impacted to a large extent by heterogeneity:

 

As you can see, all other methods retain the nominal false positive rate, but p-curve and p-uniform go through the roof as soon as heterogeneity comes into play (see also McShane, Böckenholt, & Hansen, 2016; van Aert et al., 2016).

Under H1, heterogeneity leads to overestimation of the true effect:

(additional settings for these plots: no QRPs, no publication bias, k = 100 studies, true effect size = 0.5)

Note that in their presentation of p-curve, Simonsohn et al. (2014) emphasize that, in the presence of heterogeneity, p-curve is intended as an estimate of the average true effect size among the studies submitted to p-curve (see here, Supplement 2). p-curve may indeed yield an accurate estimate of the true effect size among the significant studies, but in our view, the goal of bias-correction in meta-analysis is to estimate the average effect of all conducted studies. Of course this latter estimation hinges on modeling assumptions (e.g., that the effects are normally distributed), which can be disputed, and there might be applications where indeed the underlying true effect of all significant studies is more interesting.

Furthermore, as McShane et al (2016) demonstrate, p-curve and p-uniform are constrained versions of the more general three-parameter selection model (3PSM; Iyengar & Greenhouse, 1988). The 3PSM estimates (a) the mean of the true effect, δ, (b) the heterogeneity, τ, and (c) the probability that a non-significant result enters the literature, p. The constraints of p-curve and p-uniform are: 100% publication bias (i.e., p = 0) and homogeneity (i.e., τ = 0). Hence, for the estimation of effect sizes, 3PSM seems to be a good replacement for p-curve and p-uniform, as it makes these constraints testable.

Our recommendation: Do not use p-curve or p-uniform for effect size estimation when heterogeneity can be expected (which is nearly always the case).

Ignore overadjustments in the opposite direction

Many bias-correcting methods are driven by QRPs – the more QRPs, the stronger the downward correction. However, this effect can get so strong, that methods overadjust into the opposite direction, even if all studies in the meta-analysis are of the same sign:

 

Note: You need to set the option “Keep negative estimates” to get this plot.

Our recommendation: Ignore bias-corrected results that go into the opposite direction; set the estimate to zero, do not reject H₀.

Do not take it seriously if PET-PEESE does a reverse correction

Typical small-study effects (e.g., by p-hacking or publication bias) induce a negative correlation between sample size and effect size – the smaller the sample, the larger the observed effect size. PET-PEESE aims to correct for that relationship. In the absence of bias and QRPs, however, random fluctuations can lead to a positive correlation between sample size and effect size, which leads to a PET and PEESE slope of the unintended sign. Without publication bias, this reversal of the slope actually happens quite often.

See for example the next figure. The true effect size is zero (red dot), naive random effects meta-analysis slightly overestimates the true effect (see black dotted triangle), but PET and PEESE massively overadjust towards more positive effects:

 

PET-PEESE was never intended to correct in the reverse direction. An underlying biasing process would have to systematically remove small studies that show a significant result with larger effect sizes, and keep small studies with non-significant results. In the current incentive structure, I see no reason for such a process.

Our recommendation: Ignore the PET-PEESE correction if it has the wrong sign.

 

PET-PEESE sometimes overestimates, sometimes underestimates

A bias can be more easily accepted if it always is conservative – then one could conclude: “This method might miss some true effects, but if it indicates an effect, we can be quite confident that it really exists”. Depending on the conditions (i.e., how much publication bias, how much QRPs, etc.), however, PET/PEESE sometimes shows huge overestimation and sometimes huge underestimation.

For example, with no publication bias, some heterogeneity (τ=0.2), and severe QRPs, PET/PEESE underestimates the true effect of δ = 0.5:

In contrast, if no effect exists in reality, but strong publication bias, large heterogeneity and no QRPs, PET/PEESE overestimates at lot:

In fact, the distribution of PET/PEESE estimates looks virtually identical for these two examples, although the underlying true effect is δ = 0.5 in the upper plot and δ = 0 in the lower plot. Furthermore, note the huge spread of PET/PEESE estimates (the error bars visualize the 95% quantiles of all simulated replications): Any single PET/PEESE estimate can be very far off.

Our recommendation: As one cannot know the condition of reality, it is safest not to use PET/PEESE at all.

 

Recommendations in a nutshell: What you should not use in a meta-analysis

Again, please consider the “limitations of generalizability” disclaimer above.

  • When you can exclude publication bias (i.e., in the context of registered reports), do not use bias-correcting techniques. Even in the presence of some QRPs they perform worse than plain random effects meta-analysis.
  • In any other setting except registered reports, expect publication bias, and do not use random effects meta-analysis or trim-and-fill. Both will give you a 100% false positive rate in typical settings, and a biased estimation.
  • Under heterogeneity, p-curve and p-uniform overestimate the underlying effect and have false positive rates >= 50%
  • Even if all studies entering a meta-analysis point into the same direction (e.g., all are positive), bias-correcting techniques sometimes overadjust and return a significant estimate of the opposite direction. Ignore these results, set the estimate to zero, do not reject H₀.
  • Sometimes PET/PEESE adjust into the wrong direction (i.e., increasing the estimated true effect size)

As with any general recommendations, there might be good reasons to ignore them.

Additional technical recommendations
  • The p-uniform package (v. 0.0.2) very rarely does not provide a lower CI. In this case, ignore the estimate.
  • Do not run p-curve or p-uniform on <=3 significant and directionally consistent studies. Although computationally possible, this gives hugely variable results, which are often very biased. See our supplemental material for more information and plots.
  • If the 3PSM method (in the implementation of McShane et al., 2016) returns an incomplete covariance matrix, ignore the result (even if a point estimate is provided).

 

Now you probably ask: But what should I use? Read our preprint for an answer!

The post Correcting bias in meta-analyses: What not to do (meta-showdown Part 1) appeared first on Nicebread.

To leave a comment for the author, please follow the link and comment on their blog: R – Nicebread. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Simple bash script for a fresh install of R and its dependencies in Linux

Thu, 06/01/2017 - 09:00

(This article was first published on R and Finance, and kindly contributed to R-bloggers)

I’ve been working with Linux for some time but always in a dual boot setup. While I used Linux-Mint at home, my university computer always had Windows. Most of the times it was not a problem working in one or the other and share files between computers with Dropbox.

After getting constantly frustated with Windows updates, crashes and slow-downs, I decided to make the full switch. All my computers are now Linux based. I am very happy with my choice and regret not making the switch earlier.

Since I formatted all my three computers (home/laptop/work), I wrote a small bash file to automate the process of installing R and its dependencies. I use lots of R packages in a daily basis. For some of them, it is required to install dependencies using the terminal. Each time that a install.package() failed, I saved the name of the required software and added it to the bash file. While my bash file will not cover all dependencies for all packages, it will suffice for a great proportion.

The steps for using it are:

1) Copy the contents of the code presented next in a file called InstallR.sh. Do notice that the name of the release, in this case trusty, is added to the cran address. Try not running the bash file twice, or it will append the CRAN address more than one time in /etc/apt/sources.list.

#!/bin/bash # Adds R to apt and install it # # Instructions: # sudo chmod 700 InstallR.sh # ./FirstTimeInstallR.sh sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" | sudo tee -a /etc/apt/sources.list gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 gpg -a --export E084DAB9 | sudo apt-key add - sudo apt-get update sudo apt-get install -y r-base r-base-dev r-cran-xml r-cran-rjava libcurl4-openssl-dev sudo apt-get install -y libssl-dev libxml2-dev openjdk-7-* libgdal-dev libproj-dev libgsl-dev sudo apt-get install -y xml2 default-jre default-jdk mesa-common-dev libglu1-mesa-dev freeglut3-dev sudo apt-get install -y mesa-common-dev libx11-dev r-cran-rgl r-cran-rglpk r-cran-rsymphony r-cran-plyr sudo apt-get install -y r-cran-reshape r-cran-reshape2 r-cran-rmysql sudo R CMD javareconf

2) Change the permissions of the file so that it can be executed:

sudo chmod 700 InstallR.sh

3) Execute it as sudo

sudo ./FirstTimeInstallR.sh

That’s it. I hope others will find this simple bash script useful as well.

To leave a comment for the author, please follow the link and comment on their blog: R and Finance. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Partial Remedy to the Reproducibility Problem

Thu, 06/01/2017 - 06:48

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

Several years ago, John Ionnidis jolted the scientific establishment with an article titled, “Why Most Published Research Findings Are False.” He had concerns about inattention to statistical power, multiple inference issues and so on. Most people had already been aware of all this, of course, but that conversation opened the floodgates, and many more issues were brought up, such as hidden lab-to-lab variability. In addition, there is the occasional revelation of outright fraud.

Many consider the field to be at a crisis point.

In the 2014 JSM, Phil Stark organized a last-minute session on the issue, including Marcia McNutt, former editor of Science and Yoav Benjamini of multiple inference methodology fame. The session attracted a standing-room-only crowd.

In this post, Reed Davis and I are releasing the prototype of an R package that we are writing, revisit, with the goal of partially remedying the statistical and data wrangling aspects of this problem. It is assumed that the authors of a study have supplied (possibly via carrots or sticks) not only the data but also the complete code for their analyses, from data cleaning up through formal statistical analysis.

There are two main aspects:

  • The package allows the user to “replay” the authors’ analysis, and most importantly, explore other alternate analyses that the authors may have overlooked. The various alternate analyses may be saved for sharing.
  • Warn of statistical   errors, such as: overreliance on p-values; need for multiple inference procedures; possible distortion due to outliers; etc.

The term user here could refer to several different situations:

  • The various authors of a study, collaborating and trying different analyses during the course of the study.
  • Reviewers of a paper submitted for publication on the results of the study.
  • Fellow scientists who wish to delve further into the study after it is published.

The package has text and GUI versions. The latter is currently implemented as an RStudio add-in.

The package is on my GitHub site, and has a fairly extensive README file introducing the goals and usage.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

U.S. Residential Energy Use: Machine Learning on the RECS Dataset

Thu, 06/01/2017 - 02:25

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Contributed by Thomas Kassel. He is currently enrolled in the NYC Data Science Academy remote bootcamp program taking place from January-May 2017. This post is based on his final capstone project, focusing on the use of machine learning techniques learned throughout the course.

 

Introduction

The residential sector accounts for up to 40% of annual U.S. electricity consumption, representing a large opportunity for energy efficiency and conservation. A strong understanding of the main electricity end-uses in residences can allow homeowners to make more informed decisions to lower their energy bills, help utilities maximize efficiency/incentive programs, and allow governments or NGOs to better forecast energy demand and address climate concerns.

The Residential Energy Consumption Survey, RECS, collects energy-related data on a nationally representative sample of U.S. homes. First conducted in 1978 and administered every 4-5 years by the Energy Information Administration, it is a leading data source for residential energy analysis. Methodology for the study is well-documented – a comprehensive overview of RECS can be found in the program’s technical documentation.

This project applied machine learning methods to the most recently available RECS dataset, published in 2009. The primary goals were twofold, as is common in many regression exercises:

  • Description – identification of variable importance or features with the strongest statistical associations to annual residential electricity usage (kWh)
  • Prediction – feature engineering and use of machine learning platform h2o to optimize predictive accuracy over annual electricity usage

 

EDA and Data Pre-Processing

RECS 2009 consists of 12,083 observations (each representing a unique housing unit) and over 900 features encompassing physical building characteristics, appliances, occupant behavior/usage patterns, occupant demographics, etc. These features serve as independent variables in predicting the outcome variable, annual electricity usage in kilowatt-hours (kWh). Because RECS aims to cover a comprehensive overview of many types of residences for a variety of analytical purposes (beyond energy use prediction), many of the features are sparsely populated, collinear, or uncorrelated with kWh usage. Therefore, a preliminary and recurring task throughout the project was dimension reduction.

Since most of the independent variables were collected through occupant interviews – as opposed to exact physical examination of the residence – the values of many are binned as factor/categorical variables. The dataset’s 931 original variables had the following characteristics:

Missingness was common in the raw dataset, with 73 features having NA values in more than 95% of observations. To quickly gauge whether these features correlated with the outcome variable kWh (and therefore should be retained/imputed), I made use of flexible EDA graphing functions from the GGally R package. An example is shown below on continuous variables, revealing generally low r-statistics in correlation to kWh and having non-significant p-values (not shown). GGally also accommodates similar exploratory analysis on factor variables via pairwise box plots. Overall, although a more complex imputation strategy could have been employed to address these 73 features, they were dropped due to their high missingness and little evidence of predictive power over the outcome variable.

 

Feature Engineering/Reduction

Factor Encoding

As previously mentioned, the majority of the dataset features were factor variables, many of which needed recoding in order to correctly capture the variation being described. As illustrated below, nominal factor variables such as WALLTYPE have no intrinsic ordering; it would not make sense to order “stucco” higher than “wood”, for example. On the other hand, ordinal factors such as AGERFRI1 (age of most-used refrigerator) have a clear order to their factor levels; each level denotes a numerically higher value than the previous. Information contained in variables like AGERFRI1 is often more appropriately formatted in integer/numeric form, which can better convey their continuous range of values, aid interpretation of the variable coefficient, and reduce standard error by decreasing the number of coefficients for model estimation (many algorithms generate a separate model coefficient for each factor level via one-hot-encoding).

In the image above, WALLTYPE was left as a factor while AGERFRI1 was recoded as an integer value (using the mid range of each bin as number of years). This factor-to-integer recoding was applied to additional binned variables relating to age of appliances and frequency of use of those appliances.

New factor variables were also created by consolidating information from existing factors into one-hot-encoded binary values such as presence of heated pool, use of electricity for space heating, etc.

Feature Reduction

We can visualize multicollinearity among the integer and numeric features using correlation plots, again made with R package GGally. In the plot below, a conceptual group of collinear variables stands out, i.e. those directly or indirectly representing a residence’s size, for example, total square footage (TOTSQFT), cooling square footage (TOTCSQFT), number of bathrooms (NCOMBATH), number of refrigerators (NUMFRIG), etc.

To confirm the degree of multicollinearity, variance inflation factors were calculated based on a linear model with the above numeric features – those with high VIF scores (loosely following the rule of thumb VIF > 5.0) were dropped.

Despite variable recoding and de-correlation using the methods previously described, a large number of features remained in the dataset, the majority of them being factor variables. Although principal component analysis (PCA) is a powerful method for dimension reduction, it does not generalize easily to categorical data. Therefore, feature reduction was continued instead through the use of an exploratory LASSO regression model, utilizing the L1 regularization penalty to drive non-significant variable coefficients in the model’s output to 0. This was helpful in identifying and dropping features with very little predictive power over kWh usage, particularly factor variables that were not covered in the multicollinearity exercise above.

 

Modeling With H2o

The processed and cleaned dataset was migrated to open-source, online machine learning platform h2o, which offers highly efficient and scalable modeling methods. In contrast to machine learning conducted in slower native R packages (caret, glm) in the local R environment, R package h2o facilitates API calls to h2o’s online platform, sending the given dataset to be distributed and parallel-processed among multiple clusters. H2o offers an array of the most common machine learning algorithms (glm, kNN, random forests, gbm, deep learning) at very impressive run times – lessening the large burden of computational speed in the model fitting and tuning process. It also provides handy, built-in functions for pre-processing such as training/validation/test set splitting, in this case chosen to be 70/20/10.

View the code on Gist.

Multivariable Regression

To best understand variable importance and interpret linear effects of the features on kWh usage, a simple multivariate regression model was fit using h2o’s glm function with default parameters (no regularization). A variable importance plot (see below) ranks the top 15 features by z-statistic, all of which are quite large – an observed z-score of 15 translates to a p-value of 9.63e-54, or virtually zero. Hence all 15 variables show extremely significant statistical evidence of a relationship to kWh, with an additional ~100 of the features (not shown) also meeting significance at the p = 0.05 threshold. Variables beginning with “hasElec” were feature engineered (previously described), validating the approach of using domain knowledge to add or recode valuable information in new features. Coefficient estimates are denoted on the bar for each feature. We can observe that the presence of electric space heating at a residence (one-hot-encoded as a factor variable) increases yearly electricity usage at the residence by about 2,722 kWh; each additional TOTCSQFT (air conditioned square feet, numeric) adds 1.146 kWh/year.

ElasticNet Regression with Grid Search

While default h2o glm provided interpretability and p-values, adding regularization to the linear model enabled an increase in predictive accuracy. H2o allows flexible grid search methods to tune the main glm hyperparameters alpha (type of regularization) and lambda (degree of coefficient shrinkage). Because grid search can quickly become computationally expensive, I utilized h2o’s powerful early-stopping and random search functionality to ensure that computation time is not wasted for very small, marginal gains in validation accuracy.

View the code on Gist.

A primary purpose of grid search is to choose optimal tuning parameters according to a validation metric –  in this case root-mean-squared-error (RMSE) of kWh prediction on the validation set – and then train a new model using the optimal values discovered on the full dataset (no train/validation split) to maximize sample size. In 5 minutes of training, h2o fit all ~440 model combinations specified in the grid search, with the best RMSE model having alpha = 1.0 (LASSO regression) and lambda = 1.0 (small amount of regularization). These parameters were then used to fit a final model on 90% of the data (combining 70% train and 20% validation sets) and performance was evaluated on the 10% holdout test data, which had not yet been seen by the model.

Gradient Boosted Machine

To increase predictive accuracy over generalized linear model-based approaches, I used h2o’s implementation of gradient boosted machines. Boosting, a nonlinear tree-based approach, sequentially fits many decorrelated decision trees to slowly reduce overall predictive error (in the regression setting, often RMSE). Grid search was performed using the same methods as above and with the following tuning parameters:

  • learn_rate – how much error should each tree “remember” from the previous
  • max_depth – max tree depth (rule of thumb – sqrt of # of features)
  • sample_rate – percentage of rows/observations to be chosen to fit a given tree
  • col_sample_rate – percentage of columns to be chosen to fit a given tree

Following the same process as with GLM, parameter values were then chosen from the trained model with the best validation set RMSE. The model was re-trained on the full 90% of training data and tested on the final 10% holdout split.

Results from the two modeling strategies are summarized in the table below.

Additionally, predicted vs actual kWh values for either modeling strategy are plotted against the most significant numeric feature, CDD30YR (30-year average of cooling degree-days in the residence’s region). Both model types demonstrate reasonable ability to predict kWh over a range of annual cooling loads/climate regions.

Conclusions 

Final predictive accuracy was similar for both linear-based (GLM) and tree-based (GBM) models on the RECS dataset. The two models’ RMSE values represent a large improvement over an RMSE of 7,560, which was established as a baseline accuracy by initially fitting a default GLM on the raw dataset (before any feature engineering or reduction). This improvement validates the approaches used – using domain knowledge to feature engineer highly significant variables and grid search for hyperparameter tuning.

Aside from the regression setting, future analysis could be applied to RECS in a classification context. Since the majority of the dataset’s features are factor variables, energy providers could use the rich training set to attempt to answer important energy-related classification questions about their customers – type of space heating, presence of major appliances, or whether the home is a good candidate for solar installation. Greater predictive ability over residential electricity use will contribute over time to a smarter, cleaner, and more efficient power grid.

The post U.S. Residential Energy Use: Machine Learning on the RECS Dataset appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Complete Subset Regressions, simple and powerful

Thu, 06/01/2017 - 02:05

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

The complete subset regressions (CSR) is a forecasting method proposed by Elliott, Gargano and Timmermann in 2013. It is as very simple but powerful technique. Suppose you have a set of variables and you want to forecast one of them using information from the others. If your variables are highly correlated and the variable you want to predict is noisy you will have collinearity problems and in-sample overfitting because the model will try to fit the noise.

These problems may be solved if you estimate a smaller model using only a subset of the explanatory variables, however, if you do not know which variables are important you may loose information. What if we estimate models from many different subsets and combine their forecasts? Even better, what if we estimate models for all possible combinations of variables? This would be a good solution, however, if you have only 20 variables, the number of regressions would be more the 1 million. The CSR is a solution between using only one subset and all possible subsets. Instead of estimating all possible combinations of models, we fix a value and estimate all possible models with variables. Then we compute the forecasts from all these models to get our final result. Naturaly, must be smaller than the number of variables you have. Let us review the steps:

  • Suppose you have explanatory variables. Estimate all possible models with variables,
  • Compute the forecasts for all the models,
  • Compute the average of all the forecasts to have the final result.
Aplication

We are going to generate data from a linear model where the explanatory variables, , are draw from a multivariate normal distribution. The coefficients are generated from a normal distribution and multiplied by a parameter to ensure the dependent variable has a significant random component. The value for is 10 and the CSR is estimated with .

set.seed(1) # = Seed for replication = # K = 10 # = Number of Variables = # T = 300 # = Number of Observations = # # = Generate covariance matrix = # library(mvtnorm) D = diag(0.1, K) P = matrix(rnorm(K * K), K, K) sigma = t(P)%*%D%*%P alpha = 0.1 beta = rnorm(K) * alpha # = coefficients = # X = rmvnorm(T, sigma = sigma) # = Explanatory Variables = # u = rnorm(T) # = Error = # y = X%*%beta + u # = Generate y = # # = Break data into in-sample and out-of-sample = # y.in = y[1:200] y.out = y[-c(1:200)] X.in = X[1:200, ] X.out = X[-c(1:200), ] # = Estimate model by OLS to compare = # model = lm(y.in ~ X.in) pred = cbind(1, X.out)%*%coef(model) # = CSR = # k = 4 models = combn(K, k) # = Gives all combinations with 4 variables = # csr.fitted = rep(0, length(y.in)) # = Store in-sample fitted values = # csr.pred = rep(0, length(y.out)) # = Store forecast = # for(i in 1:ncol(models)){ m = lm(y.in ~ X.in[ ,models[ ,i]]) csr.fitted = csr.fitted + fitted(m) csr.pred = csr.pred + cbind(1, X.out[ ,models[ ,i]])%*%coef(m) } R2ols = 1 - var(y.in - fitted(model))/var(y.in) # = R2 OLS = # R2csr = 1 - var(y.in - csr.fitted/ncol(models))/var(y.in) # = R2 CSR = # c(R2ols, R2csr) ## [1] 0.1815733 0.1461342 # = In-sample fit = # plot(y.in, type="l") lines(fitted(model), col=2) lines(csr.fitted/ncol(models), col=4)

# = Out-of-sample fit = # plot(y.out, type="l") lines(pred, col = 2) lines(csr.pred/ncol(models), col = 4)

# = MAPE = # MAPEols=mean(abs(y.out - pred)) MAPEcsr=mean(abs(y.out - csr.pred/ncol(models))) c(MAPEols, MAPEcsr) ## [1] 0.8820019 0.8446682

The main conclusion from the results is that the CSR gives up some in-sample performance to improve the forecasts. In fact, the CSR is very robust to overfitting and you should definitively add it to your collection to use when you believe that you are having this type of problem. Most modern forecasting techniques have the same idea of accepting some bias in-sample to have a more parsimonious or stable model out-of-sample.

This is the simplest implementation possible for the CSR. In some cases you may have fixed controls you want to include in all regressions such as lags of the dependent variable. The CSR computational costs increase fast with the number of variables. If you are in a high-dimensional framework you may need to do some type of pre-selection of the variables to reduce the problem’s size.

References

Elliott, Graham, Antonio Gargano, and Allan Timmermann. “Complete subset regressions.” Journal of Econometrics 177.2 (2013): 357-373.

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Euler Problem 23: Non-Abundant Sums

Thu, 06/01/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

A demonstration of the abundance of the number 12 using Cuisenaire rods (Wikipedia).

Euler problem 23 asks us to solve a problem with abundant or excessive numbers.

These are numbers for which the sum of its proper divisors is greater than the number itself.

12 is an abundant number because the sum of its proper divisors (the aliquot sum) is larger than 12: (1 + 2 + 3 + 4 + 6 = 16).

All highly composite numbers or anti-primes greater than six are abundant numbers. These are numbers that have so many divisors that they are considered the opposite of primes, as explained in the Numberphile video below.

Euler Problem 23 Definition

A perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.

A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.

As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis, even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.

Find the sum of all the positive integers which cannot be written as the sum of two abundant numbers.

Solution

This solution repurposes the divisors function that determines the proper divisors for a number, introduced for Euler Problem 21. The first code snippet creates the sequence of all abundant numbers up to 28123 (sequence A005101 in the OEIS). An abundant number is one where its aliquot sum is larger than n.

# Generate abundant numbers (OEIS A005101) A005101 <- function(x){ abundant <- vector() a <- 1 for (n in 1:x) { aliquot.sum <- sum(proper.divisors(n)) - n if (aliquot.sum > n) { abundant[a] <- n a <- a + 1 } } return(abundant) } abundant <- A005101(28123)

The solution to this problem is also a sequence in the Online Encyclopedia of Integer Sequences (OEIS A048242). This page states that the highest number in this sequence is 20161, not 28123 as stated in the problem definition.

The second section of code creates a list of all potential numbers not the sum of two abundant numbers. The next bit of code sieves any sum of two abundant numbers from the list. The answer is determined by adding remaining numbers in the sequence.

# Create a list of potential numbers that are not the sum of two abundant numbers A048242 <- 1:20161 # Remove any number that is the sum of two abundant numbers for (i in 1:length(abundant)) { for (j in i:length(abundant)) { if (abundant[i] + abundant[j] <= 20161) { A048242[abundant[i] + abundant[j]] <- NA } } } A048242 <- A048242[!is.na(A048242)] answer <- sum(A048242) print(answer)

The post Euler Problem 23: Non-Abundant Sums appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

To eat or not to eat! That’s the question? Measuring the association between categorical variables

Thu, 06/01/2017 - 02:00

(This article was first published on Stories Data Speak, and kindly contributed to R-bloggers)

1. Introduction

I serve as a reviewer to several ISI and Scopus indexed journals in Information Technology. Recently, I was reviewing an article, wherein the researchers had made a critical mistake in data analysis. They converted the original categorical data to continuous without providing a rigorous statistical treatment, nor, any justification to the loss of information if any. Thus, my motivation to develop this study, is borne out of their error.

We know the standard association measure between continuous variables is the product-moment correlation coefficient introduced by Karl Pearson. This measure determines the degree of linear association between continuous variables and is both normalized to lie between -1 and +1 and symmetric: the correlation between variables x and y is the same as that between y and x. the best-known association measure between two categorical variables is probably the chi-square measure, also introduced by Karl Pearson. Like the product-moment correlation coefficient, this association measure is symmetric, but it is not normalized. This lack of normalization provides one motivation for Cramer’s V, defined as the square root of a normalized chi-square value; the resulting association measure varies between 0 and 1 and is conveniently available via the assocstats function in the vcd package. An interesting alternative to Cramer’s V is Goodman and Kruskal’s tau, which is not nearly as well known and is asymmetric. This asymmetry arises because the tau measure is based on the fraction of variability in the categorical variable y that can be explained by the categorical variable x. 1

The data for this study is sourced from UCI Machine Learning repository. As it states in the data information section, “This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The guide clearly states that there is no simple rule for determining the edibility of a mushroom;

Furthermore, the possible research questions, I want to explore are;

  • Is significance test enough to justify a hypothesis?
  • How to measure associations between categorical predictors?
2. Making data management decisions

As a first step, I imported the data in R environment as;

# Import data from UCI ML repo > theURL<- "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data" # Explicitly adding the column headers from the data dictionary > mushroom.data<- read.csv(file = theURL, header = FALSE, sep = ",",strip.white = TRUE, stringsAsFactors = TRUE, col.names = c("class","cap-shape","cap-surface","cap-color","bruises", "odor","gill-attachment","gill-spacing","gill-size", "gill-color","stalk-shape","stalk-root","stalk-surface-above-ring", "stalk-surface-below-ring","stalk-color-above-ring","stalk-color-below-ring", "veil-type","veil-color","ring-number","ring-type","spore-print-color", "population","habitat"))

Next, I quickly summarize the dataset to get a brief glimpse. The reader’s should note that the data has no missing values.

# Calculate number of levels for each variable > mushroom.data.levels<-cbind.data.frame(Variable=names(mushroom.data), Total_Levels=sapply(mushroom.data,function(x){as.numeric(length(levels(x)))})) > print(mushroom.data.levels) Variable Total_Levels class class 2 cap.shape cap.shape 5 cap.surface cap.surface 3 cap.color cap.color 10 bruises bruises 2 odor odor 8 gill.attachment gill.attachment 1 gill.spacing gill.spacing 2 gill.size gill.size 3 gill.color gill.color 10 stalk.shape stalk.shape 3 stalk.root stalk.root 6 stalk.surface.above.ring stalk.surface.above.ring 4 stalk.surface.below.ring stalk.surface.below.ring 5 stalk.color.above.ring stalk.color.above.ring 7 stalk.color.below.ring stalk.color.below.ring 8 veil.type veil.type 2 veil.color veil.color 2 ring.number ring.number 3 ring.type ring.type 5 spore.print.color spore.print.color 7 population population 7 habitat habitat 8

As we can see, the variable, gill.attachement has just one level. Nor, does it make any significant contribution to the target class so dropping it.

# dropping variable with constant variance > mushroom.data$gill.attachment<- NULL

The different levels are uninterpretable in their current format. I will use the data dictionary and recode the levels into meaningful names.

> levels(mushroom.data$class)<- c("edible","poisonous") > levels(mushroom.data$cap.shape)<-c("bell","conical","flat","knobbed","sunken","convex") > levels(mushroom.data$cap.surface)<- c("fibrous","grooves","smooth","scaly") > levels(mushroom.data$cap.color)<- c("buff","cinnamon","red","gray","brown","pink","green","purple","white","yellow") > levels(mushroom.data$bruises)<- c("bruisesno","bruisesyes") > levels(mushroom.data$odor)<-c("almond","creosote","foul","anise","musty","nosmell","pungent","spicy","fishy") > levels(mushroom.data$gill.attachment)<- c("attached","free") > levels(mushroom.data$gill.spacing)<- c("close","crowded") > levels(mushroom.data$gill.size)<-c("broad","narrow") > levels(mushroom.data$gill.color)<- c("buff","red","gray","chocolate","black","brown","orange","pink","green","purple","white","yellow") > levels(mushroom.data$stalk.shape)<- c("enlarging","tapering") > table(mushroom.data$stalk.root) # has a missing level coded as ? ? b c e r 2480 3776 556 1120 192 > levels(mushroom.data$stalk.root)<- c("missing","bulbous","club","equal","rooted") > levels(mushroom.data$stalk.surface.above.ring)<-c("fibrous","silky","smooth","scaly") > levels(mushroom.data$stalk.surface.below.ring)<-c("fibrous","silky","smooth","scaly") > levels(mushroom.data$stalk.color.above.ring)<- c("buff","cinnamon","red","gray","brown", "orange","pink","white","yellow") > levels(mushroom.data$stalk.color.below.ring)<- c("buff","cinnamon","red","gray","brown", "orange","pink","white","yellow") > levels(mushroom.data$veil.type)<-c("partial") > levels(mushroom.data$veil.color)<- c("brown","orange","white","yellow") > levels(mushroom.data$ring.number)<-c("none","one","two") > levels(mushroom.data$ring.type)<- c("evanescent","flaring","large","none","pendant") > levels(mushroom.data$spore.print.color)<- c("buff","chocolate","black","brown","orange","green","purple","white","yellow") > levels(mushroom.data$population)<- c("abundant","clustered","numerous","scattered","several","solitary") > levels(mushroom.data$habitat)<-c("woods","grasses","leaves","meadows","paths","urban","waste") 3. Initial data visualization a. Is there a relationship between cap-surface and cap-shape of a mushroom? > p<- ggplot(data = mushroom.data, aes(x=cap.shape, y=cap.surface, color=class)) > p + geom_jitter(alpha=0.3) + scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Fig-1: Mushroom cap-shape and cap-surface

From Fig-1, we can easily notice, the mushrooms with a, flat cap-shape and scaly, smooth or fibrous cap-surface are poisonous. While, the mushrooms with a, bell,knob or sunken cap-shape and are fibrous, smooth or scaly are edible. A majority of flat cap-shaped mushrooms with scaly or smooth cap surface are poisonous.

b. Is mushroom habitat and its population related? > p<- ggplot(data = mushroom.data, aes(x=population, y=habitat, color=class)) > p + geom_jitter(alpha=0.3) + scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Fig-2: Mushroom cap-shape and cap-surface

From Fig-2, we see that mushrooms which are clustered or scattered in population and living in woods are entirely poisonous. Those that live in grasses, wasteland, meadows, leaves, paths and urban area’s are edible.

c. What’s the deal with living condition and odor? > p<- ggplot(data = mushroom.data, aes(x=habitat, y=habitat, color=class)) > p + geom_jitter(alpha=0.3) +scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Fig-3: Mushroom habitat and odor

From Fig-3, we notice, the mushrooms with fishy, spicy,pungent, foul, musty and creosote odor are clearly marked poisonous irrespective of there habitat. Whereas the one’s with almond, anise odour are edible mushrooms. We also notice, that a minority of no odour mushrooms are poisonous while the one’s living in meadows are entirely poisonous in nature.

Although, there could be many other pretty visualizations but I will leave that as a future work.

I will now focus on exploratory data analysis.

4. Exploratory data analysis a. Correlation detection & treatment for categorical predictors

If we look at the structure of the dataset, we notice that each variable has several factor levels. Moreover, these levels are unordered. Such unordered categorical variables are termed as nominal variables. The opposite of unordered is ordered, we all know that. The ordered categorical variables are called, ordinal variables.

“In the measurement hierarchy, interval variables are highest, ordinal variables are next, and nominal variables are lowest. Statistical methods for variables of one type can also be used with variables at higher levels but not at lower levels.”, see Agresti

I found this cheat-sheet that can aid in determining the right kind of test to perform on categorical predictors (independent/explanatory variables). Also, this SO post is very helpful. See the answer by user gung.

For categorical variables, the concept of correlation can be understood in terms of significance test and effect size (strength of association)

The Pearson’s chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data. It is a significance test. Given two categorical random variables, X and Y, the chi-squared test of independence determines whether or not there exists a statistical dependence between them. Formally, it is a hypothesis test. The chi-squared test assumes a null hypothesis and an alternate hypothesis. The general practice is, if the p-value that comes out in the result is less than a pre-determined significance level, which is 0.05 usually, then we reject the null hypothesis.

H0: The The two variables are independent

H1: The The two variables are dependent

The null hypothesis of the chi-squared test is that the two variables are independent and the alternate hypothesis is that they are related.

To establish that two categorical variables (or predictors) are dependent, the chi-squared statistic must have a certain cutoff. This cutoff increases as the number of classes within the variable (or predictor) increases.

In section 3a, 3b and 3c, I detected possible indications of dependency between variables by visualizing the predictors of interest. In this section, I will test to prove how well those dependencies are associated. First, I will apply the chi-squared test of independence to measure if the dependency is significant or not. Thereafter, I will apply the Goodman’s Kruskal Tau test to check for effect size (strength of association).

i. Pearson’s chi-squared test of independence (significance test) > chisq.test(mushroom.data$cap.shape, mushroom.data$cap.surface, correct = FALSE) Pearson's Chi-squared test data: mushroom.data$cap.shape and mushroom.data$cap.surface X-squared = 1011.5, df = 15, p-value < 2.2e-16

since the p-value is < 2.2e-16 is less than the cut-off value of 0.05, we can reject the null hypothesis in favor of alternative hypothesis and conclude, that the variables, cap.shape and cap.surface are dependent to each other.

> chisq.test(mushroom.data$habitat, mushroom.data$odor, correct = FALSE) Pearson's Chi-squared test data: mushroom.data$habitat and mushroom.data$odor X-squared = 6675.1, df = 48, p-value < 2.2e-16

Similarly, the variables habitat and odor are dependent to each other as the p-value < 2.2e-16 is less than the cut-off value 0.05.

ii. Effect size (strength of association)

The measure of association does not indicate causality, but association–that is, whether a variable is associated with another variable. This measure of association also indicates the strength of the relationship, whether, weak or strong.

Since, I’m dealing with nominal categorical predictor’s, the Goodman and Kruskal’s tau measure is appropriate. Interested readers are invited to see pages 68 and 69 of the Agresti book. More information on this test can be seen here

> library(GoodmanKruskal) > varset1<- c("cap.shape","cap.surface","habitat","odor","class") > mushroomFrame1<- subset(mushroom.data, select = varset1) > GKmatrix1<- GKtauDataframe(mushroomFrame1) > plot(GKmatrix1, corrColors = "blue")

In Fig-4, I have shown the association plot. This plot is based on the corrplot library. In this plot the diagonal element K refers to number of unique levels for each variable. The off-diagonal elements contain the forward and backward tau measures for each variable pair. Specifically, the numerical values appearing in each row represent the association measure τ(x,y)τ(x,y) from the variable xx indicated in the row name to the variable yy indicated in the column name.

The most obvious feature from this plot is the fact that the variable odor is almost perfectly predictable (i.e. τ(x,y)=0.94) from class and this forward association is quite strong. The forward association suggest that x=odor (which has levels “almond”, “creosote”, “foul”, “anise”, “musty”, “nosmell”, “pungent”, “spicy”, “fishy”) is highly predictive of y=class (which has levels “edible”, “poisonous”). This association between odor and class is strong and indicates that if we know a mushroom’s odor than we can easily predict its class being edible or poisonous.

On the contrary, the reverse association y=class and x=odor(i.e. τ(y,x)=0.34; is a strong association and indicates that if we know the mushroom’s class being edible or poisonous than its easy to predict its odor.

Earlier we have found cap.shape and cap.surface are dependent to each other (chi-squared significance test). Now, let’s see if the association is strong too or not. Again, from Fig-4, both the forward and reverse association suggest that x=cap shape is weakly associated to y=cap surface (i.e.τ(x,y)=0.03) and (i.e.τ(y,x)=0.01). Thus, we can safely say that although these two variables are significant but they are association is weak; i.e. it will be difficult to predict one from another.

Similarly, many more associations can be interpreted from plot-4. I invite interested reader’s to explore it further.

5. Conclusion

The primary objective of this study was to drive the message, do not tamper the data without providing a credible justification. The reason I chose categorical data for this study to provide an in-depth treatment of the various measures that can be applied to it. From my prior readings of statistical texts, I could recall that significance test alone was not enough justification; there had to be something more. It is then, I found the about the different types of association measures and it sure did cleared my doubts. In my next post, I will continue the current work by providing inferential and predictive analysis. For interested reader’s, I have uploaded the complete code on my Github repository in here

To leave a comment for the author, please follow the link and comment on their blog: Stories Data Speak. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My new DataCamp course: Forecasting Using R

Thu, 06/01/2017 - 02:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

For the past few months I’ve been working on a new DataCamp course teaching Forecasting using R. I’m delighted that it is now available for anyone to do.
Course blurb Forecasting involves making predictions about the future. It is required in many situations such as deciding whether to build another power generation plant in the next ten years requires forecasts of future demand; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements.

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

let there be progress

Thu, 06/01/2017 - 01:00

(This article was first published on HighlandR, and kindly contributed to R-bloggers)

The ‘wrapr’package for use with dplyr programming –

UPDATED POST

I’m the first to admit I’m not an R expert, (even duffers can blog about it though), but when I began thinking about writing some dplyr functions to help me create and analyse run charts, I had no idea that I was going to struggle quite so much getting to grips with Non Standard Evaluation and Lazy Evaluation (see my last post for links and more background).

To clarify, I wanted to create flexible dplyr functions, without hardcoded grouping parameters, so I could query different SQL Server tables and apply the same functions reliably to transform and plot the data.

Eventually I sussed things out and created the functions I needed, but it was yet another example of how R makes me feel really stupid on a regular basis.
Please tell me I’m not alone with that…

I’m aware that dplyr 0.6 is imminent, and that my code will need revising as the dplyr verbs with underscores (for programmatic use) will be deprecated (at least eventually).

While I could probably install the dev version , I thought I should finally take a look at the let function in the wrapr package, as I’d seen numerous tweets giving it positive feedback.

I had a bash at recreating one of my functions – got some help directly from win-vector and was then able to rewrite all my remaining functions quickly.

I’ve put the code gist here – should hopefully run as intended, assuming you have the current dplyr 0.5 and wrapr packages installed.

Points to note
  • I commented out a line in the wrapr let example code for this demo so that all 3 return the same output.

  • There is some nice code to handle the lack of a grouping column, courtesy of Win-Vector. For my real world use, there will always be a grouping variable, but its nice to build this in for future requirements.

  • I used slice instead of filter in the NSE code – I could not get it to work at all otherwise. I’m pretty convinced I’ve done something wrong here, because the function only runs if I don’t wrap the variables in quotes when I call it. For the life of me I couldn’t figure out why a grouping call would work in one NSE function and fail in another when the same syntax was being used. Bearing in mind that the work I actually WANTED to do was going to be complex enough, I was just happy that it worked.

Code examples

Day to day, interactive dplyr code :

My attempt at the whole NSE/lazy eval thing :

Wrapr code :

With the let function you create the mapping block first, then use the variables there in your function – and can use the regular dplyr verbs. Once I’d seen how to set up the code, it was easy to recreate the others.

Outputs (in same order as above code examples):

Here’s another example of using the underscore versions compared to wrapr.

This time the NSE code works with quoted variables:

And here is the wrapr equivalent using let:

From a real world deadlines / getting things done perspective – if you need to create flexible dplyr functions, and you need to do it soon – take a look at wrapr to see if it fits your needs.

To leave a comment for the author, please follow the link and comment on their blog: HighlandR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny: data presentation with an extra

Thu, 06/01/2017 - 00:40

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)


A Shiny app with three tabs presenting different sections of the same data.

Shiny is an application based on R/RStudio which enables an interactive exploration of data through a dashboard with drop-down lists and checkboxes—programming-free. The apps can be useful for both the data analyst and the public.

Shiny apps are based on the Internet: This allows for private consultation of the data on one’s own browser as well as for online publication. Free apps can handle quite a lot of data, which can be increased with a subscription.

The target user of Shiny is extremely broad. Let’s take science—open science. At a time when openly archiving all data is becoming standard practice (e.g., OSF.io, Figshare.com, Github.com), Shiny can be used to walk the extra mile by letting people tour the data at once without programming. It’s the right tool for acknowledging all aspects of the data. Needless to say, these apps do not replace raw data archiving.

The apps simply add. For instance, the data in the lower app is a little noisy, right? Indeed it shows none of the succession of waves that characterizes word reading. The app helped me in identifying this issue. Instead of running along a host of participants and electrodes through a heavy code score, I found that the drop-down lists of the app let me seamlessly surf the data. By Participant 7, though, my wave dropped me…
 

Those data were very poor—systematically poorer than those of any other participants. I then checked the EEG preprocessing logs, and confirmed that those data had to be discarded. So much for the analysis utility of such an app. On the open science utility, what I did on discovering the fault was maintain the discarded data in the app, with a note, so that any colleagues and reviewers could consider it too. Now, although this example of use concerns a rather salient trait in the data, some other Shiny app might help to spot patterns such as individual differences, third variables.

Building a Shiny app is not difficult. Apps basically draw on some data presentation code (tables, plots) that you already have. Then just add a couple of scripts into the folder: one for the user interface (named iu.R), one for the process (named server.R), and perhaps another one compiling the commands for deploying the app and checking any errors.

The steps to start a Shiny app from scratch are:

1: Tutorials. Being open-source software, the best manuals are available through a Google search.

2: User token. Signing up and reading in your private key—just once.

3: GO. Plunge into the ui and server scripts, and deployApp().

4: Bugs and logs. They are not bugs in fact—rather fancies. For instance, some special characters have to get even more special (technically, UTF-8 encoding). For a character such as μ, Shiny prefers Âμ. Just cling to error logs by calling:

showLogs(appPath = getwd(), appFile = NULL, appName = NULL, account = NULL, entries = 50, streaming = FALSE)

The log output will mention any typos and unaccepted characters, pointing to specific lines in your code.

It may take a couple of intense days to get a first app running. As usual with programming, it’s easy to run into the traps which are there to spice up the way. The app’s been around for years, so tips and tricks abound on the Internet. For greater companionship, there are dedicated Google groups, and then there’s StackExchange etc., where you can post any needs/despair. Post your code, log, and explanation, and you’ll be rescued out in a couple of days. Long live those contributors.

It will often be enough to upload a bare app, but you might then think it can look better.

5 (optional): Pro up.
Use tabs to combine multiple apps in one, use different widgets, etc. Tutorials like this one on Youtube can take you there, especially those that provide the code, as in the description of that video. Use those scripts as templates. For example, see this script in which the function conditionalPanel() is used to modify the app’s sidebar based on which tab is selected. The utility of tabs is illustrated in the upper cover of this article and in the app shown in the text: When having multiple data sections, the tabs allow you to have all in one (cover screenshot), instead of linking to other apps in each (screenshot in text).

Time for logistics. You can include any text in your app’s webpage, such as explanations of any length, web links, and author information. Oh, also importantly: the Shiny application allows for the presentation of data in any of the forms available in R—notably, plots and tables. Costs: Shiny is free to use, in principle. You may only have to pay if you use your app(s) a lot—first month, commonly—, in which case you may pay 9 euros a month. There are different subscription plans. The free plan allows 25 hours of use per month, distributed among up to five apps.

There do exist alternatives to Shiny. One is fairly similar: It’s called Tableau. A nice comparison of the two apps is available here. Then, there’s a more advanced application called D3.js, which allows for lush graphics but proves more restrictive to newbies.

In sum, if you already program in R or even in Python, but have not tried online data presentation yet, consider it.

Feel free to share your ideas, experiences, or doubts in a comment on the original post.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Watch presentations from R/Finance 2017

Wed, 05/31/2017 - 23:23

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It was another great year for the R/Finance conference, held earlier this month in Chicago. This is normally a fairly private affair: with attendance capped at around 300 people every year, it's a somewhat exclusive gathering of the best and brightest minds from industry and academia in financial data analysis with R. But for the first time this year (and with thanks to sponsorship from Microsoft), videos of the presentations are available for viewing by everyone. I've included the complete list (copied from the R/Finance website) below, but here are a few of my favourites:

You can find an up-to-date version of the table below at the R/Finance website (click on the "Program" tab), and you can also browse the videos at Channel 9. Note that the lightning talk sessions (in orange) are bundled together in a single video, which you can find linked after the first talk in each session.

Friday, May 19th, 2017 09:30 – 09:35   Kickoff (video) 09:35 – 09:40   Sponsor Introduction 09:40 – 10:10   Marcelo Perlin: GetHFData: An R package for downloading and aggregating high frequency trading data from Bovespa (pdf) (video)    Jeffrey Mazar: The obmodeling Package (html)    Yuting Tan: Return Volatility, Market Microstructure Noise, and Institutional Investors: Evidence from High Frequency Market (pdf)    Stephen Rush: Adverse Selection and Broker Execution (pdf)    Jerzy Pawlowski: How Can Machines Learn to Trade? (html) 10:10 – 10:30   Michael Hirsch: Revealing High-Frequency Trading Provisions of Liquidity with Visualization in R (html) (video) 10:30 – 10:50   Eric Glass: Equity Factor Portfolio Case Study (html) (video) 10:50 – 11:10   Break 11:10 – 11:30   Seoyoung Kim: Zero-Revelation RegTech: Detecting Risk through Linguistic Analysis of Corporate Emails and News (pdf) (video) 11:30 – 12:10   Szilard Pafka: No-Bullshit Data Science (pdf) (video) 12:10 – 13:30   Lunch 13:30 – 14:00   Francesco Bianchi: Measuring Risk with Continuous Time Generalized Autoregressive Conditional Heteroscedasticity Models (pdf) (video)    Eina Ooka: Bunched Random Forest in Monte Carlo Risk Simulation (pdf)    Matteo Crimella: Operational Risk Stress Testing: An Empirical Comparison of Machine Learning Algorithms and Time Series Forecasting Methods (pdf)    Thomas Zakrzewski: Using R for Regulatory Stress Testing Modeling (pdf)    Andy Tang: How much structure is best? (pptx) 14:00 – 14:20   Robert McDonald: Ratings and Asset Allocation: An Experimental Analysis (pdf) 14:20 – 14:50   Break 14:50 – 15:10   Dries Cornilly: Nearest Comoment Estimation with Unobserved Factors and Linear Shrinkage (pdf) (video) 15:10 – 15:30   Bernhard Pfaff: R package: mcrp: Multiple criteria risk contribution optimization (pdf) (video) 15:30 – 16:00   Oliver Haynold: Practical Options Modeling with the sn Package, Fat Tails, and How to Avoid the Ultraviolet Catastrophe (pdf) (video)    Shuang Zhou: A Nonparametric Estimate of the Risk-Neutral Density and Its Applications (pdf)    Luis Damiano: A Quick Intro to Hidden Markov Models Applied to Stock Volatility    Oleg Bondarenko: Rearrangement Algorithm and Maximum Entropy (pdf)    Xin Chen: Risk and Performance Estimator Standard Errors for Serially Correlated Returns (pdf) 16:00 – 16:20   Qiang Kou: Text analysis using Apache MxNet (pdf) (video) 16:20 – 16:40   Robert Krzyzanowski: Syberia: A development framework for R (pdf) (video) 16:40 – 16:52   Matt Dancho: New Tools for Performing Financial Analysis Within the 'Tidy' Ecosystem (pptx) (video)    Leonardo Silvestri: ztsdb, a time-series DBMS for R users (pdf) Saturday, May 20th, 2017 09:05 – 09:35   Stephen Bronder: Integrating Forecasting and Machine Learning in the mlr Framework (pdf) (video)    Leopoldo Catania: Generalized Autoregressive Score Models in R: The GAS Package (pdf)    Guanhao Feng: Regularizing Bayesian Predictive Regressions (pdf)    Jonas Rende: partialCI: An R package for the analysis of partially cointegrated time series (pdf)    Carson Sievert: Interactive visualization for multiple time series (pdf) 09:35 – 09:55   Emanuele Guidotti: yuimaGUI: A graphical user interface for the yuima package (pptx) (video) 09:55 – 10:15   Daniel Kowal: A Bayesian Multivariate Functional Dynamic Linear Model (pdf) (video) 10:15 – 10:45   Break 10:45 – 11:05   Jason Foster: Scenario Analysis of Risk Parity using RcppParallel (pdf) (video) 11:05 – 11:35   Michael Weylandt: Convex Optimization for High-Dimensional Portfolio Construction (pdf) (video)    Lukas Elmiger: Risk Parity Under Parameter Uncertainty (pdf)    Ilya Kipnis: Global Adaptive Asset Allocation, and the Possible End of Momentum (pptx)    Vyacheslav Arbuzov: Dividend strategy: towards the efficient market (pdf)    Nabil Bouamara: The Alpha and Beta of Equity Hedge UCITS Funds – Implications for Momentum Investing (pdf) 11:35 – 12:15   Dave DeMers: Risk Fast and Slow (pdf) (video) 12:15 – 13:35   Lunch 13:35 – 13:55   Matthew Dixon: MLEMVD: A R Package for Maximum Likelihood Estimation of Multivariate Diffusion Models (pdf) (video) 13:55 – 14:15   Jonathan Regenstein: Reproducible Finance with R: A Global ETF Map (html) (video) 14:15 – 14:35   David Ardia: Markov-Switching GARCH Models in R: The MSGARCH Package (pdf) (video) 14:35 – 14:55   Keven Bluteau: Forecasting Performance of Markov-Switching GARCH Models: A Large-Scale Empirical Study (pdf) (video) 14:55 – 15:07   Riccardo Porreca: Efficient, Consistent and Flexible Credit Default Simulation (pdf) (video)    Maisa Aniceto: Machine Learning and the Analysis of Consumer Lending (pdf) 15:07 – 15:27   David Smith: Detecting Fraud at 1 Million Transactions per Second (pptx) (video) 15:27 – 15:50   Break 15:50 – 16:10   Thomas Harte: The PE package: Modeling private equity in the 21st century (pdf) (video) 16:10 – 16:30   Guanhao Feng: The Market for English Premier League (EPL) Odds (pdf) (video) 16:30 – 16:50   Bryan Lewis: Project and conquer (html) (video)

 

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

VIF and multicollinearity diagnostics

Wed, 05/31/2017 - 23:14

(This article was first published on R code – Serious Stats, and kindly contributed to R-bloggers)

In the book I use the car package to get VIF and other multicollinearity diagnostics. I’ve occasionally found this breaks down (usually through mixing different versions of R on different machines at work home or on the move). I recently saw the mctest package and thought it would be useful to use that as a backup – and also because it offers a slightly different set of diagnostics.

If you’d like to try it out I’ve written a helper function that makes it easier apply directly to a linear model. You can find the function and a simple example here.

 

Filed under: R code, serious stats Tagged: multicollinearity, R, statistics, vif

To leave a comment for the author, please follow the link and comment on their blog: R code – Serious Stats. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bland-Altman/Tukey Mean-Difference Plots using ggplot2

Wed, 05/31/2017 - 23:04

(This article was first published on Environmental Science and Data Analytics, and kindly contributed to R-bloggers)

A very useful data visualisation tool in science, particularly in medical and sports settings, is the Bland-Altman/Tukey Mean-Difference plot. When comparing two sets of measurements for the same variable made by different instruments, it is often required to determine whether the instruments are in agreement or not.

Correlation and linear regression can tell us something about the bivariate relationship which exists between two sets of measurements. We can identify the strength, form and direction of a relationship but this approach is not recommended for comparative analyses.

The Bland-Altman plot’s first use was in 1983 by J.M Bland and D.G Altman who applied it to medical statistics. The Tukey Mean-Difference Plot was one of many exploratory data visualisation tools created by John Tukey who, interestingly, also created the beloved boxplot.

A simple implementation of the Bland-Altman/Tukey Mean-Difference plot in R is described below. The dataset used for this example is one of R’s built-in datasets, Loblolly, which contains data on the growth of Loblolly pine trees.

Prepare the workspace and have a look at the structure of the data.

library(dplyr) library(ggplot2) pine_df <- Loblolly glimpse(pine_df)

 

Randomly split the data into two samples of equal length. These samples will be used as our two sets of pine tree height measurements. Arrange in order of descending height to mimic realistic dual measurements of single trees and combine into a dataframe. Note that these data are used simply to demonstrate and the measures in reality are NOT of the same tree using different instruments.

sample_df <- data.frame(sample_n(pine_df, size = nrow(pine_df) * 0.5) %>% select(height) %>% arrange(desc(height)), sample_n(pine_df, size = nrow(pine_df) * 0.5) %>% select(height) %>% arrange(desc(height)))

 

Assign sensible names to the two pine tree height samples in the dataframe.

names(sample_df) <- c("Sample_1", "Sample_2")

Each row in the dataframe consists of a pair of measurements. The Bland-Altman plot has the average of the two measures in a pair on the x-axis. The y-axis contains the difference between the two measures in each pair. Add the averages and differences data to the dataframe.

sample_df$Avg <- (sample_df$Sample_1 + sample_df$Sample_2) / 2 sample_df$Dif <- sample_df$Sample_1 - sample_df$Sample_2

Finally, code the plot and add the mean difference (blue line) and a 95% confidence interval (red lines) for predictions of a mean difference. This prediction interval gives the level of agreement (1.96 * SD).

ggplot(sample_df, aes(x = Avg, y = Dif)) +   geom_point(alpha = 0.5) +   geom_hline(yintercept = mean(sample_df$Dif), colour = "blue", size = 0.5) +   geom_hline(yintercept = mean(sample_df$Dif) - (1.96 * sd(sample_df$Dif)), colour = "red", size = 0.5) +   geom_hline(yintercept = mean(sample_df$Dif) + (1.96 * sd(sample_df$Dif)), colour = "red", size = 0.5) +   ylab("Diff. Between Measures") +   xlab("Average Measure")

 

To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Server Pro is now available on AWS Marketplace

Wed, 05/31/2017 - 23:02

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio is excited to announce the availability of its flagship enterprise-ready integrated development environment for R in AWS Marketplace.

RStudio Server Pro AWS is identical to RStudio Server Pro, but with turnkey convenience. It comes pre-configured with multiple versions of R, common systems libraries, and the most popular R packages.

RStudio Server Pro AWS helps you adapt to your unique circumstances. It allows you to choose different AWS computing instances no matter how large, whenever a project requires it (flat hourly pricing). Or you can set up a persistent instance of RStudio Server Pro ready to be used anytime you need it (annual pricing), avoiding the sometimes complicated processes for procuring on-premises software.

If the enhanced security, elegant support for multiple R versions and multiple sessions, and commercially licensed and supported features of RStudio Server Pro appeal to you and your enterprise, consider RStudio Server Pro for AWS. It’s ready to go!

Read the FAQ         Try RStudio Server Pro AWS

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages