A Shiny App for Exploring Commodities Prices and Economic Indicators, via Quandl
(This article was first published on R Views, and kindly contributed to Rbloggers)
In a previous post, we created an R Notebook to explore the relationship between the copper/gold price ratio and 10year Treasury yields (if you’re curious why we might care about this relationship, have a quick look at that previous post), relying on data from Quandl. Today, we’ll create a Shiny app that lets users choose which different commodities ratios and different economic indicators to investigate. Perhaps users don’t care about Dr. Copper and Treasury yields, but instead want to explore the oil/gold price ratio and how it correlates with the US inflation rate, or the EU inflation rate. Let’s give them some flexibility!
The finished app is available here.
Before we get to it, note a few issues that we have seen in the past.
Very similar to our previous Shiny app, in the code chunk below, we have some important decisions about how a user selects a commodity. We could use textInput to allow the user to enter the code for the desired data set which would not limit the user in any way – he or she could choose any dataset on Quandl. The cost would be that the user would need to know, or go to Quandl and look up, the code for any data set.
For example, to import iron ore prices, the user would have to type in ODA/PIORECR_USD. That’s a big problem if your end users are not familiar with and have no interest in data set codes. To emphasize convenience and usability we will go with selectInput instead of textInput, meaning our app will show a dropdown of a few choices. The user just clicks on “Iron Ore” instead of typing ODA/PIORECR_USD, or clicks on “copper” instead of typing CHRIS/CME_HG1.1. But, if a user wants to work with a data set that we haven’t included, said user is out of luck.
Another big decision is how many choices to give the user. I’ve included a few: copper, oil, iron, platinum, palladium and silver. That’s a nice smattering of commodities whose prices tend to rise with a growing economy, plus silver which does not. I included silver so we could take a look at a commodity that should behave differently from the rest. As always, our choice here is driven by how broad this app should be. We could have added steel, lead, zinc, and tin, or we could have included just copper, oil and iron, depending on the use case. Either way, the number of drop downs is another tradeoff between usability and flexibility.
The final decision is a bit more nuanced and requires looking ahead to how these inputs will be used further down in the app. Have a peak at the object called commodityChoices and you might notice that we don’t strictly need that object. We could have put the vector of choices as an argment to selectInput, so that our code would have read choices = c("Copper" = "CHRIS/CME_HG1.1", ...) instead of choices = commodityChoices. In that choice assignment, “copper” is called the name and “CHRIS/CME_HG1.1” is called the value (together we can think of them as a namevalue pair). The reason for building a separate commodityChoices object is that we want the ability to extract either the name or the value of the namevalue pair. Usually we would care only about the value, because we want to pass the value to Quandl and import the data, but that name is going to be useful when we label our graphs.
Without further adieu, let’s look at commodityChoices, econIndicatorChoices and the use of selectInput.
# Create a vector of commodity choices. commodityChoices < c( "Copper" = "CHRIS/CME_HG1", "WTI oil" = "FRED/DCOILWTICO",# "Iron Ore" = "ODA/PIORECR_USD", # monthly "Platinum" = "LPPM/PLAT", "Palladium" = "LPPM/PALL", "Silver" = "LBMA/SILVER") # Make those commodity choices avaible via selectInput. selectInput("commodity", "Commodity", choices = commodityChoices, selected = "Copper") # Create a vector of economic indicator choices. econIndicatorChoices < c( "10Yr Yield" = "FRED/DGS10", # daily "US CPI" = "RATEINF/INFLATION_USA",# monthly "Japan CPI" = "RATEINF/INFLATION_JPN", "EU CPI" = "RATEINF/INFLATION_EUR") # Make those economic indicator choices avaible via selectInput. selectInput("econIndicator", "Economic Indicator", choices = econIndicatorChoices, selected = "10yr Yield") # A standard date range input. dateRangeInput("dateRange", "Date range", start = "19900101", end = "20161231")Now that we have the inputs in a sidebar for the user, it’s back to Quandl to import the data for the chosen commodity, gold and the chosen economic indicator. There’s a common date range for all three so we’ll start by creating start and end date objects.
ratio_indicator < reactive({ Quandl.api_key("your API key here") start_date < format(input$dateRange[1]) end_date < format(input$dateRange[2])We could now write three separate calls to Quandl for each of the data sets but, instead, let’s make use of the map() function from the purrr package. If you’re not familiar with purrr, have a look here. I’ll just say that you’ll probably never have to use lapply() again (and that should be motivation enough), but, in short, the family of map() functions takes a function and applies it to the elements in a vector, similar to the apply() functions.
Before we can use map() though, we need a vector to feed it. Let’s create a vector of Quandl codes.
# Create a vector of 3 data set codes # 1) commodity chosen by user # 2) gold quandl code # 3) economic indicator chosen by user gold_code < "CHRIS/CME_GC1.1" # Vector of Quandl codes. data_set_codes < c(input$commodity, gold_code, input$econIndicator)Then we’ll apply the Quandl() function by piping our vector of codes and using map().
# Pipe the data_set_codes vector to Quandl via the map() function # Note we can still set the start and end date and object type # as we always can with Quandl. quandlData< data_set_codes %>% # Pipe the datasets vector to Quandl via the map() function. map(Quandl, start_date = start_date, end_date = end_date, collapse = "monthly", type = "xts") %>%Next, we will use map() to apply the na.locf() function to our time series and ensure that no NAs remain.
# Replace all NAs using map() and na.locf(). map(na.locf, formLast = TRUE) %>%If we stopped here, we would have a list of three xts series, but I don’t want a list, I want one xts object. So, we’ll pipe our list of three and use the reduce() + merge() to combine our list of 3 time series into one xts object.
# Merge to one xts object using map() and merge(). reduce(merge) %>% # Add nicer column names. `colnames<`(c(names(commodityChoices[commodityChoices == input$commodity]), "Gold", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator])))Alright, after running our Quandl codes through that series of mapped functions, we have three time series stored in one xts object, and now we want to calculate the price ratio of the chosen commodity/gold.
To create that price ratio, we need to divide the first column by the second column and we’ll store it in a new column called ratio.
Then we will save just that ratio and the economic indicator column data into their xts object. That is not necessary but it makes things cleaner and easier when we pass to dygraph().
# Create a column and add the price ratio. quandlData$ratio < quandlData[,1]/quandlData[,2] # Save just the ratio and the economic indicator data. ratio_indicator < merge(quandlData$ratio, quandlData[,3]) # Add more general names. colnames(ratio_indicator) < c("ratio","indicator") return(ratio_indicator) })Now we just need to pass our reactive object ratio_indicator() to dygraph() and follow the same steps as we did when testing in our Notebook.
We will use dyRoller() to smooth out our chart and make each point an average of the number of periods specified with rollPeriod = X. This won’t affect our xts object, where we store the data, it just makes the chart more readable.
Remember also that we are charting two time series on the same chart and they are on different scales, so we want to add a righthandside yaxis.
To do so, we need to invoke dyAxis() for the lefthand axis, called “y”. Then we invoke dyAxis() for the righthand axis, called “y2”. We also need to set independentTicks = TRUE so that we can use a unique, independent value scale for the righthand side. Next, in our dySeries() call for each time series, we assign each one to an axis. Here we assign “ratio” with axis = 'y', so that the commoditygold price ratio will be on the lefthand scale, and we assign “indicator” with axis = 'y2', so the economic indicator will be on the righthand scale.
dygraphOutput("ratio_indicator") output$ratio_indicator < renderDygraph({ dygraph(ratio_indicator()) %>% # Add the rollPeriod for smoothing. dyRoller(rollPeriod = 3) %>% # Create two independent axes, just we did in the Notebook. dyAxis("y", label = "USD") %>% dyAxis("y2", label = "Percent (%)", independentTicks = TRUE) %>% # Assign each time series to an axis. # Use the name from the namevalue pair to create nice labels for each. dySeries("ratio", axis = 'y', label = paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold (LHS)", sep = ""), color = "blue") %>% dySeries("indicator", axis = 'y2', label = paste(names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), "(RHS)", sep = ""), color = "red") })We could end things here but let’s go ahead and add a chart to show the rolling correlation between the ratio and the indicator. We’ve done so much work to calculate and wrangle these time series, might as well put them to use!
First, we’ll calculate the rolling correlation using the rollapply() function. Nothing too complicated here.
dygraphOutput("rollingCorrelation") output$rollingCorrelation < renderDygraph({ rolling_cor < rollapply(ratio_indicator(), 24, function(x) cor(x[, 1], x[, 2], use = "pairwise.complete.obs"), by.column = FALSE) # Make a nicer name for the xts object that stores the rolling correlation. # This name will be displayed when a user hovers on the dygraph. names(rolling_cor) < paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold ", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), " Correlation", sep = "")It’s not necessary, but I like to display the mean, minimum and maximum rolling correlations on the chart. We’ll store those in three objects: avg, mini, and maxi.
avg < round(mean(rolling_cor, na.rm = T), 2) mini < round(min(rolling_cor, na.rm = T), 2) maxi < round(max(rolling_cor, na.rm = T), 2)Now we pass our rolling_cor xts object to dygraph() and pass the mean, minimum and maximum objects to dyLimit().
dygraph(rolling_cor, main = paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold ", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), " Correlation", sep = "")) %>% dyRangeSelector(dateWindow = c("20150101", "20161231")) %>% # Add a line for the mean, min and max. dyLimit(avg, color = 'purple') %>% dyLimit(mini, color = 'red') %>% dyLimit(maxi, color = 'blue') %>% # Add an event for the US election. dyEvent("20161108", "Trump!", labelLoc = "bottom") })And, we’re done! It’s fun to explore different relationships amongst different time series with this app. And once we have this template in the toolkit, all sorts of different data sets can be substituted in for exploration. For example, we might want to port this work over to a currencies dashboard, or a country GDP dashboard. The nice thing is, it’s just a matter of finding the right Quandl codes and imagining new hypotheses to explore.
Things got a little choppy today with all the piping, so just a reminder that if you want the reusable code for this app, it’s available via the source code button at the top right of the live app. Thanks, and see you next time.
_____='https://rviews.rstudio.com/2017/06/02/ashinyappforexploringcommoditiespricesandeconomicindicatorsviaquandl/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Python and R top 2017 KDnuggets rankings
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
The results of KDnuggets' 18th annual poll of data science software usage are in, and for the first time in three years Python has edged out R as the most popular software. While R increased its share of usage from 45.7% in last year's poll to 52.1% this year, Python's usage among data scientists increased even more, from 36.6% of users in 2016 to 52.6% of users this year.
There were some interesting moves in the long tail, as well. Several tools entered the KDNuggets chart for the first time, including Keras (9.5% of users), PyCharm (9.0%) and Microsoft R Server (4.3%). And several returning tools saw big jumps in usage, including Microsoft Cognitive Toolkit (3.4% of users), Tensorflow (20.2%) and Power BI (10.2%). Microsoft SQL Server increased its share to 11.6% (up from 10.8%), whereas SAS (7.1%) and Matlab (7.4%) saw declines. Julia, somewhat surprisingly, remained flat at 1.1%.
For the complete results and analysis of the 2017 KDnuggets data science software poll, follow the link below.
KDnuggets: New Leader, Trends, and Surprises in Analytics, Data Science, Machine Learning Software Poll
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Canada Labour Market: Future Perspectives
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
For anyone looking for job opportunities, it is nice to have an idea how the job market will perform in the future for your chosen career or industry. Many countries have open data sets that offer this kind of data. In these exercises we will use R to analyze the future perspective of Canadian labour market.
Answers to the exercises are available here.
Exercise 1
Download and read into R all data sets from Canadian Occupational Projection System (COPS) – 2015 to 2024 projections.
Exercise 2
Load library tidyr. Use gather to rearrange any occupation related data set that present time series data into a tidy data format.
Exercise 3
Use gather to rearrange ALL other occupation related data sets that present time series data into a tidy data format, and pile out them in a unique data frame.
Exercise 4
Remove lines that present NA values, columns in French, and the “X” in front every year. Take a look at your tidy data set.
 Learn indepth how to work with dplyr
 Get a full introduction to the data.table package
 And much more
Exercise 5
Let’s do the same with industries data sets. Start by taking one of the industries data sets that present data in a time series an use gather.
Exercise 6
Do the same procedure of exercise 5 to all other industries data sets. Pile out them in a new data frame.
Exercise 7
Remove NAs, and French columns. In addition, set year and value as numeric, and take a look at your new tidy data set about industries.
Exercise 8
Find out the industries that have que lowest number of jobseekers, and create a new data set by sub setting the previous one.
Exercise 9
Plot the recently create data set using a line for each industry.
Exercise 10
Create a similar plot for the top 5 occupations in terms of low amount of jobseekers.
 Reshape 2 Exercises
 As.Date() Exercises
 Data frame exercises Vol. 2
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
In defense of wrapr::let()
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
Saw this the other day:
In defense of wrapr::let() (originally part of replyr, and still reexported by that package) I would say:
 let() was deliberately designed for a single realworld use case: working with data when you don’t know the column names when you are writing the code (i.e., the column names will come later in a variable). We can rephrase that as: there is deliberately less to learn as let() is adapted to a need (instead of one having to adapt to let()).
 The R community already has months of experience confirming let() working reliably in production while interacting with a number of different packages.
 let() will continue to be a very specific, consistent, reliable, and relevant tool even after dpyr 0.6.* is released, and the community gains experience with rlang/tidyeval in production.
If rlang/tidyeval is your thing, by all means please use and teach it. But please continue to consider also using wrapr::let(). If you are trying to get something done quickly, or trying to share work with others: a “deeper theory” may not be the best choice.
An example follows.
In “base R” one can write:
d < data.frame(x = 1:3)If we know the column name we wish to add to this data frame we write:
d$newColumn < 1The above is “nonparameterized” evaluation in that the variable name is taken from the source code, and not from a variable carrying the information. This is great for adhoc analysis, but it would be hard to write functions, scripts and packages if this was the only way we had to manipulate columns.
This isn’t a problem as R supplies an additional parameterized notation as we show here. When we don’t know the name of the column (but expect it to be in a variable at run time) we write:
# code written very far away from us variableHoldingColumnName < 'newColumn' # our code d[[variableHoldingColumnName]] < 1The purpose of wrapr::let() is to allow one to use the nonparameterized form as if it were parameterized. Obviously this is only useful if there is not a convenient parameterized alternative (which is the case for some packages). But for teaching purposes: how would wrapr::let() let us use the “$” as if it were parameterized (which we would have to do if [[]] and [] did not exist)?
With wrapr::let() we can rewrite the “dollarsign form” as:
wrapr::let( c(NEWCOL = variableHoldingColumnName), { d$NEWCOL < 1 } )The name “NEWCOL” is a standin name that we write all our code in terms of. The expression “c(NEWCOL = variableHoldingColumnName)” is saying: “NEWCOL is to be interpreted as being whatever name is stored in the variable variableHoldingColumnName.” Notice we can’t tell from the code what value is in variableHoldingColumnName, as that won’t be known until execution time. The alias list or vector can be arbitrarily long and built wherever you like (it is just data). The expression block can be arbitrarily large and complex (so you need only pay the mapping notation cost once per function, not once per line of code).
And that is wrapr::let().
If you don’t need parametric names you don’t need wrapr::let(). If you do need parametric names wrapr::let() is one of the most reliable and easiest to learn ways to get them.
A nice article from a user recording their experience trying different ways to parameterize their code (including wrapr::let()) can be found here.
If you wish to learn more we have a lot of resources available:
 A vignette: vignette('let', package='wrapr').
 The method help: help('let', package='wrapr')
 Examples: Using replyr::let to Parameterize dplyr Expressions.
 A video lecture: My recent BARUG talk: Parametric Programming in R with replyr.
 Some notes: Parametric Programming in R.
 The package introduction: The wrapr introduction.
 Comparison to the soon to deprecated SE underbar/underscore methods: Comparative examples using replyr::let.
And many more examples on our blog.
To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Primer in functional Programming in R (part 2)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
In the last exercise, We have seen how powerful functional programming principles can be and how it can drammatically increase the readablity of the code and how easily you can work with them .In this set of exercises we will look at functional programming principles with purrr.Purrr comes with a number of interesting features and is really useful in writing clean and concise code . Please check the documentation and load the purrr library in your R session before starting these exercise set .
Answers to the exercises are available here
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
From the airquality dataset( available in base R ) , Find the mean ,median ,standard deviation of all columns using map functions .
Exercise 2
In the same dataset,find 95th percentile of each column excluding the NA values
Exercise 3
Load the iris dataset ,with help of pipe and map functions find out the mean of the relavant columns.Keep in mind mean is meant for numeric columns ,so you may need multiple map like functions.I expect the output as a dataframe .
Exercise 4
I have a vector x < 1:20 ,I want to multiply every odd element with 2 and every even element with 4 , find a solution using purrr .
Exercise 5
I have a sequence x < 1:100 , I want to increase the 1st,11th, 21st…91st element by 1 . How can I achieve that .
Exercise 6
Suppose I have two character vectors
x < letters # letters is a vector of alphabets in small available in R
y < LETTERS # LETTERS is a vector of alphabets in caps available in R
How do I join each capital letter with the small letter so that the end result looks like this
"A,a" "B,b" ,.. and so on
Exercise 7
The previous exercise gave you the intuition of how to work parallely on two vectors.Now accepts a list of character vectors of same size and joins them like the previous exercise . so if I have a
list like x < list(c("a","b","c"),c("d","e",f"),c("g","h","i")) ,it should give me output as .
[1] "a d g" "b e h" "c f i"
Exercise 8
using a functional tool from purrr ,reverse the letters so that my output is a string of 26 character starting from z and ending at a
like “z y x w v u t s r q p o n m l k j i h g f e d c b a”
Exercise 9
Exercise on Purrr wont be complete if We dont mention its “Filter” like functions . take a numeric vector of 1:100 , keep all the numbers which are divisible by 2 and after that remove the one’s which are divisible by 5
Exercise 10
Ability to create partial functions is a powerful tool in functional programming. This allow functions to behave a little like data structures .Consider creating a partial function whenever you see you are repeating functions argument. This may not sound very useful in context of this exercise but I am sure you will find it very useful in your R gigs .Now create a Partial function
ql ,which is to find the 25th percentile and 50th percentile of a column from a dataframe .use it to find the same from airquality dataset .Dont use quantile twice in this exercise .
 Higher Order Functions Exercises
 A Primer in Functional Programming in R (Part – 1)
 Building Shiny App exercises part 4
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A tidy model pipeline with twidlr and broom
(This article was first published on blogR, and kindly contributed to Rbloggers)
@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.
The problemDifferent model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:
Thus, different models may need very different code.
However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.
Twostep modellingTo understand the solution, think of the problem as a twostep process, depicted in this Figure:
Step 1: from data to fitted modelStep 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!
To demonstrate:
#devtools::install_github("drimonj/twidlr") # To install library(twidlr) lm(mtcars, hp ~ .) #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) mpg cyl disp drat #> 79.048 2.063 8.204 0.439 4.619 #> wt qsec vs am gear #> 27.660 1.784 25.813 9.486 7.216 #> carb #> 18.749This means we can pipe data.frames into any model function exposed by twidlr. For example:
library(dplyr) mtcars %>% lm(hp ~ .) #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) mpg cyl disp drat #> 79.048 2.063 8.204 0.439 4.619 #> wt qsec vs am gear #> 27.660 1.784 25.813 9.486 7.216 #> carb #> 18.749 Step2: fitted model to tidy resultsStep 2 must take a fitted model object as its input and return a tidy data frame of results. This is precisely what the broom package does via three functions: glance, tidy, and augment! To demonstrate:
#install.packages("broom") # To install library(broom) fit < mtcars %>% lm(hp ~ .) glance(fit) #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e08 11 142.8905 #> AIC BIC deviance df.residual #> 1 309.7809 327.3697 14164.76 21 tidy(fit) #> term estimate std.error statistic p.value #> 1 (Intercept) 79.0483879 184.5040756 0.4284371 0.672695339 #> 2 mpg 2.0630545 2.0905650 0.9868407 0.334955314 #> 3 cyl 8.2037204 10.0861425 0.8133655 0.425134929 #> 4 disp 0.4390024 0.1492007 2.9423609 0.007779725 #> 5 drat 4.6185488 16.0829171 0.2871711 0.776795845 #> 6 wt 27.6600472 19.2703681 1.4353668 0.165910518 #> 7 qsec 1.7843654 7.3639133 0.2423121 0.810889101 #> 8 vs 25.8128774 19.8512410 1.3003156 0.207583411 #> 9 am 9.4862914 20.7599371 0.4569518 0.652397317 #> 10 gear 7.2164047 14.6160152 0.4937327 0.626619355 #> 11 carb 18.7486691 7.0287674 2.6674192 0.014412403 augment(fit) %>% head() #> .rownames hp mpg cyl disp drat wt qsec vs am gear carb #> 1 Mazda RX4 110 21.0 6 160 3.90 2.620 16.46 0 1 4 4 #> 2 Mazda RX4 Wag 110 21.0 6 160 3.90 2.875 17.02 0 1 4 4 #> 3 Datsun 710 93 22.8 4 108 3.85 2.320 18.61 1 1 4 1 #> 4 Hornet 4 Drive 110 21.4 6 258 3.08 3.215 19.44 1 0 3 1 #> 5 Hornet Sportabout 175 18.7 8 360 3.15 3.440 17.02 0 0 3 2 #> 6 Valiant 105 18.1 6 225 2.76 3.460 20.22 1 0 3 1 #> .fitted .resid .hat .sigma .cooksd .std.resid #> 1 148.68122 38.681220 0.2142214 24.75946 0.069964902 1.6801773 #> 2 140.62866 30.628664 0.2323739 25.43881 0.049861042 1.3460408 #> 3 79.99158 13.008418 0.3075987 26.38216 0.014633059 0.6019364 #> 4 125.75448 15.754483 0.2103960 26.31579 0.011288712 0.6826601 #> 5 183.21756 8.217565 0.2016137 26.53317 0.002878707 0.3541128 #> 6 111.38490 6.384902 0.3147448 26.55680 0.003682813 0.2969840 A single, tidy pipelineSo twidlr and broom functions can be combined into a single, tidy pipeline to go from data.frame to tidy output:
library(twidlr) library(broom) mtcars %>% lm(hp ~ .) %>% glance() #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e08 11 142.8905 #> AIC BIC deviance df.residual #> 1 309.7809 327.3697 14164.76 21Any model included in twidlr and broom can be used in this same way. Here’s a kmeans example:
iris %>% select(Species) %>% kmeans(centers = 3) %>% tidy() #> x1 x2 x3 x4 size withinss cluster #> 1 5.901613 2.748387 4.393548 1.433871 62 39.82097 1 #> 2 5.006000 3.428000 1.462000 0.246000 50 15.15100 2 #> 3 6.850000 3.073684 5.742105 2.071053 38 23.87947 3And a ridge regression with crossfold validation example:
mtcars %>% cv.glmnet(am ~ ., alpha = 0) %>% glance() #> lambda.min lambda.1se #> 1 0.2284167 0.8402035So next time you want to do some tidy modelling, keep this pipeline in mind:
LimitationsCurrently, the major limitation for this approach is that a model must be covered by twidlr and broom. For example, you can’t use randomForest in this pipeline because, although twidlr exposes a data.frame friendly version of it, broom doesn’t provide tidying methods for it. So if you want to write tidy code for a model that isn’t covered by these packages, have a go at helping out by contributing to these open source projects! To get started creating and contributing to R packages, take a look at Hadley Wickham’s free book, “R Packages”.
Sign offThanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.
To leave a comment for the author, please follow the link and comment on their blog: blogR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Correcting bias in metaanalyses: What not to do (metashowdown Part 1)
(This article was first published on R – Nicebread, and kindly contributed to Rbloggers)
tl;dr: Publication bias and phacking can dramatically inflate effect size estimates in metaanalyses. Many methods have been proposed to correct for such bias and to estimate the underlying true effect. In a large simulation study, we found out which methods do not work well under which conditions, and give recommendations what not to use. Estimated reading time: 7 min. It is well known that publication bias and phacking inflate effect size estimates from metaanalyses. In the last years, methodologists have developed an ever growing menu of statistical approaches to correct for such overestimation. However, to date it was unclear under which conditions they perform well, and what to do if they disagree. Born out of a Twitter discussion, Evan Carter, Joe Hilgard, Will Gervais and I did a large simulation project, where we compared the performance of naive random effects metaanalysis (RE), trimandfill (TF), pcurve, puniform, PET, PEESE, PETPEESE, and the threeparameter selection model (3PSM).Previous investigations typically looked only at publication bias or questionable research practices QRPs (but not both), used nonrepresentative studylevel sample sizes, or only compared few biascorrecting techniques, but not all of them. Our goal was to simulate a research literature that is as realistic as possible for psychology. In order to simulate several research environments, we fully crossed five experimental factors: (1) the true underlying effect, δ (0, 0.2, 0.5, 0.8); (2) betweenstudy heterogeneity, τ (0, 0.2, 0.4); (3) the number of studies in the metaanalytic sample, k (10, 30, 60, 100); (4) the percentage of studies in the metaanalytic sample produced under publication bias (0%, 60%, 90%); and (5) the use of QRPs in the literature that produced the metaanalytic sample (none, medium, high).
This blog post summarizes some insights from our study, internally called “metashowdown”. Check out the preprint; and the interactive app metaExplorer. The fully reproducible and reusable simulation code is on Github, and more information is on OSF.
In this blog post, I will highlight some lessons that we learned during the project, primarily focusing on what not do to when performing a metaanalysis.
Limitation of generalizability disclaimer: These recommendations apply to typical sample sizes, effect sizes, and heterogeneities in psychology; other research literatures might have different settings and therefore a different performance of the methods. Furthermore, the recommendations rely on the modeling assumptions of our simulation. We went a long way to make them as realistic as possible, but other assumptions could lead to other results.
Never trust a naive random effects metaanalysis or trimandfill (unless you metaanalyze a set of registered reports)If studies have no publication bias, nothing can beat plain old random effects metaanalysis: it has the highest power, the least bias, and the highest efficiency compared to all other methods. Even in the presence of some (though not extreme) QRPs, naive RE performs better than all other methods. When can we expect no publication bias? If (and, in my opinion only if) we metaanalyze a set of registered reports.
But.
In any other setting except registered reports, a consequential amount of publication bias must be expected. In the field of psychology/psychiatry, more than 90% of all published hypothesis tests are significant (Fanelli, 2011) despite the average power being estimated as around 35% (Bakker, van Dijk, & Wicherts, 2012) – the gap points towards a huge publication bias. In the presence of publication bias, naive random effects metaanalysis and trimandfill have false positive rates approaching 100%:
More thoughts about trimandfill’s inability to recover δ=0 are in Joe Hilgard’s blog post. (Note: this insight is not really new and has been shown multiple times before, for example by Moreno et al., 2009, and Simonsohn, Nelson, and Simmons, 2014).
Our recommendation: Never trust metaanalyses based on naive random effects and trimandfill, unless you can rule out publication bias. Results from previously published metaanalyses based on these methods should be treated with a lot of skepticism.
Do not use pcurve and puniform for effect size estimation (under heterogeneity)
As a default, heterogeneity should always be expected – even under the most controlled conditions, where many labs perform the same computeradministered experiment, a large proportion showed significant and substantial amounts of betweenstudy heterogeneity (cf. ManyLabs 1 and 3; see also our supplementary document for more details). pcurve and puniform assume homogeneous effect sizes, and their performance is impacted to a large extent by heterogeneity:
As you can see, all other methods retain the nominal false positive rate, but pcurve and puniform go through the roof as soon as heterogeneity comes into play (see also McShane, Böckenholt, & Hansen, 2016; van Aert et al., 2016).
Under H1, heterogeneity leads to overestimation of the true effect:
(additional settings for these plots: no QRPs, no publication bias, k = 100 studies, true effect size = 0.5)
Note that in their presentation of pcurve, Simonsohn et al. (2014) emphasize that, in the presence of heterogeneity, pcurve is intended as an estimate of the average true effect size among the studies submitted to pcurve (see here, Supplement 2). pcurve may indeed yield an accurate estimate of the true effect size among the significant studies, but in our view, the goal of biascorrection in metaanalysis is to estimate the average effect of all conducted studies. Of course this latter estimation hinges on modeling assumptions (e.g., that the effects are normally distributed), which can be disputed, and there might be applications where indeed the underlying true effect of all significant studies is more interesting.
Furthermore, as McShane et al (2016) demonstrate, pcurve and puniform are constrained versions of the more general threeparameter selection model (3PSM; Iyengar & Greenhouse, 1988). The 3PSM estimates (a) the mean of the true effect, δ, (b) the heterogeneity, τ, and (c) the probability that a nonsignificant result enters the literature, p. The constraints of pcurve and puniform are: 100% publication bias (i.e., p = 0) and homogeneity (i.e., τ = 0). Hence, for the estimation of effect sizes, 3PSM seems to be a good replacement for pcurve and puniform, as it makes these constraints testable.
Our recommendation: Do not use pcurve or puniform for effect size estimation when heterogeneity can be expected (which is nearly always the case).
Ignore overadjustments in the opposite directionMany biascorrecting methods are driven by QRPs – the more QRPs, the stronger the downward correction. However, this effect can get so strong, that methods overadjust into the opposite direction, even if all studies in the metaanalysis are of the same sign:
Note: You need to set the option “Keep negative estimates” to get this plot.
Our recommendation: Ignore biascorrected results that go into the opposite direction; set the estimate to zero, do not reject H₀.
Typical smallstudy effects (e.g., by phacking or publication bias) induce a negative correlation between sample size and effect size – the smaller the sample, the larger the observed effect size. PETPEESE aims to correct for that relationship. In the absence of bias and QRPs, however, random fluctuations can lead to a positive correlation between sample size and effect size, which leads to a PET and PEESE slope of the unintended sign. Without publication bias, this reversal of the slope actually happens quite often.
See for example the next figure. The true effect size is zero (red dot), naive random effects metaanalysis slightly overestimates the true effect (see black dotted triangle), but PET and PEESE massively overadjust towards more positive effects:
PETPEESE was never intended to correct in the reverse direction. An underlying biasing process would have to systematically remove small studies that show a significant result with larger effect sizes, and keep small studies with nonsignificant results. In the current incentive structure, I see no reason for such a process.
Our recommendation: Ignore the PETPEESE correction if it has the wrong sign.
PETPEESE sometimes overestimates, sometimes underestimates
A bias can be more easily accepted if it always is conservative – then one could conclude: “This method might miss some true effects, but if it indicates an effect, we can be quite confident that it really exists”. Depending on the conditions (i.e., how much publication bias, how much QRPs, etc.), however, PET/PEESE sometimes shows huge overestimation and sometimes huge underestimation.
For example, with no publication bias, some heterogeneity (τ=0.2), and severe QRPs, PET/PEESE underestimates the true effect of δ = 0.5:
In contrast, if no effect exists in reality, but strong publication bias, large heterogeneity and no QRPs, PET/PEESE overestimates at lot:
In fact, the distribution of PET/PEESE estimates looks virtually identical for these two examples, although the underlying true effect is δ = 0.5 in the upper plot and δ = 0 in the lower plot. Furthermore, note the huge spread of PET/PEESE estimates (the error bars visualize the 95% quantiles of all simulated replications): Any single PET/PEESE estimate can be very far off.
Our recommendation: As one cannot know the condition of reality, it is safest not to use PET/PEESE at all.
Recommendations in a nutshell: What you should not use in a metaanalysis
Again, please consider the “limitations of generalizability” disclaimer above.
 When you can exclude publication bias (i.e., in the context of registered reports), do not use biascorrecting techniques. Even in the presence of some QRPs they perform worse than plain random effects metaanalysis.
 In any other setting except registered reports, expect publication bias, and do not use random effects metaanalysis or trimandfill. Both will give you a 100% false positive rate in typical settings, and a biased estimation.
 Under heterogeneity, pcurve and puniform overestimate the underlying effect and have false positive rates >= 50%
 Even if all studies entering a metaanalysis point into the same direction (e.g., all are positive), biascorrecting techniques sometimes overadjust and return a significant estimate of the opposite direction. Ignore these results, set the estimate to zero, do not reject H₀.
 Sometimes PET/PEESE adjust into the wrong direction (i.e., increasing the estimated true effect size)
As with any general recommendations, there might be good reasons to ignore them.
Additional technical recommendations The puniform package (v. 0.0.2) very rarely does not provide a lower CI. In this case, ignore the estimate.
 Do not run pcurve or puniform on <=3 significant and directionally consistent studies. Although computationally possible, this gives hugely variable results, which are often very biased. See our supplemental material for more information and plots.
 If the 3PSM method (in the implementation of McShane et al., 2016) returns an incomplete covariance matrix, ignore the result (even if a point estimate is provided).
Now you probably ask: But what should I use? Read our preprint for an answer!
The post Correcting bias in metaanalyses: What not to do (metashowdown Part 1) appeared first on Nicebread.
To leave a comment for the author, please follow the link and comment on their blog: R – Nicebread. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Simple bash script for a fresh install of R and its dependencies in Linux
(This article was first published on R and Finance, and kindly contributed to Rbloggers)
–
I’ve been working with Linux for some time but always in a dual boot setup. While I used LinuxMint at home, my university computer always had Windows. Most of the times it was not a problem working in one or the other and share files between computers with Dropbox.
After getting constantly frustated with Windows updates, crashes and slowdowns, I decided to make the full switch. All my computers are now Linux based. I am very happy with my choice and regret not making the switch earlier.
Since I formatted all my three computers (home/laptop/work), I wrote a small bash file to automate the process of installing R and its dependencies. I use lots of R packages in a daily basis. For some of them, it is required to install dependencies using the terminal. Each time that a install.package() failed, I saved the name of the required software and added it to the bash file. While my bash file will not cover all dependencies for all packages, it will suffice for a great proportion.
The steps for using it are:
1) Copy the contents of the code presented next in a file called InstallR.sh. Do notice that the name of the release, in this case trusty, is added to the cran address. Try not running the bash file twice, or it will append the CRAN address more than one time in /etc/apt/sources.list.
#!/bin/bash # Adds R to apt and install it # # Instructions: # sudo chmod 700 InstallR.sh # ./FirstTimeInstallR.sh sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/"  sudo tee a /etc/apt/sources.list gpg keyserver keyserver.ubuntu.com recvkey E084DAB9 gpg a export E084DAB9  sudo aptkey add  sudo aptget update sudo aptget install y rbase rbasedev rcranxml rcranrjava libcurl4openssldev sudo aptget install y libssldev libxml2dev openjdk7* libgdaldev libprojdev libgsldev sudo aptget install y xml2 defaultjre defaultjdk mesacommondev libglu1mesadev freeglut3dev sudo aptget install y mesacommondev libx11dev rcranrgl rcranrglpk rcranrsymphony rcranplyr sudo aptget install y rcranreshape rcranreshape2 rcranrmysql sudo R CMD javareconf2) Change the permissions of the file so that it can be executed:
sudo chmod 700 InstallR.sh3) Execute it as sudo
sudo ./FirstTimeInstallR.shThat’s it. I hope others will find this simple bash script useful as well.
To leave a comment for the author, please follow the link and comment on their blog: R and Finance. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Partial Remedy to the Reproducibility Problem
(This article was first published on Mad (Data) Scientist, and kindly contributed to Rbloggers)
Several years ago, John Ionnidis jolted the scientific establishment with an article titled, “Why Most Published Research Findings Are False.” He had concerns about inattention to statistical power, multiple inference issues and so on. Most people had already been aware of all this, of course, but that conversation opened the floodgates, and many more issues were brought up, such as hidden labtolab variability. In addition, there is the occasional revelation of outright fraud.
Many consider the field to be at a crisis point.
In the 2014 JSM, Phil Stark organized a lastminute session on the issue, including Marcia McNutt, former editor of Science and Yoav Benjamini of multiple inference methodology fame. The session attracted a standingroomonly crowd.
In this post, Reed Davis and I are releasing the prototype of an R package that we are writing, revisit, with the goal of partially remedying the statistical and data wrangling aspects of this problem. It is assumed that the authors of a study have supplied (possibly via carrots or sticks) not only the data but also the complete code for their analyses, from data cleaning up through formal statistical analysis.
There are two main aspects:
 The package allows the user to “replay” the authors’ analysis, and most importantly, explore other alternate analyses that the authors may have overlooked. The various alternate analyses may be saved for sharing.
 Warn of statistical errors, such as: overreliance on pvalues; need for multiple inference procedures; possible distortion due to outliers; etc.
The term user here could refer to several different situations:
 The various authors of a study, collaborating and trying different analyses during the course of the study.
 Reviewers of a paper submitted for publication on the results of the study.
 Fellow scientists who wish to delve further into the study after it is published.
The package has text and GUI versions. The latter is currently implemented as an RStudio addin.
The package is on my GitHub site, and has a fairly extensive README file introducing the goals and usage.
To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
U.S. Residential Energy Use: Machine Learning on the RECS Dataset
(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to Rbloggers)
Contributed by Thomas Kassel. He is currently enrolled in the NYC Data Science Academy remote bootcamp program taking place from JanuaryMay 2017. This post is based on his final capstone project, focusing on the use of machine learning techniques learned throughout the course.
Introduction
The residential sector accounts for up to 40% of annual U.S. electricity consumption, representing a large opportunity for energy efficiency and conservation. A strong understanding of the main electricity enduses in residences can allow homeowners to make more informed decisions to lower their energy bills, help utilities maximize efficiency/incentive programs, and allow governments or NGOs to better forecast energy demand and address climate concerns.
The Residential Energy Consumption Survey, RECS, collects energyrelated data on a nationally representative sample of U.S. homes. First conducted in 1978 and administered every 45 years by the Energy Information Administration, it is a leading data source for residential energy analysis. Methodology for the study is welldocumented – a comprehensive overview of RECS can be found in the program’s technical documentation.
This project applied machine learning methods to the most recently available RECS dataset, published in 2009. The primary goals were twofold, as is common in many regression exercises:
 Description – identification of variable importance or features with the strongest statistical associations to annual residential electricity usage (kWh)
 Prediction – feature engineering and use of machine learning platform h2o to optimize predictive accuracy over annual electricity usage
EDA and Data PreProcessing
RECS 2009 consists of 12,083 observations (each representing a unique housing unit) and over 900 features encompassing physical building characteristics, appliances, occupant behavior/usage patterns, occupant demographics, etc. These features serve as independent variables in predicting the outcome variable, annual electricity usage in kilowatthours (kWh). Because RECS aims to cover a comprehensive overview of many types of residences for a variety of analytical purposes (beyond energy use prediction), many of the features are sparsely populated, collinear, or uncorrelated with kWh usage. Therefore, a preliminary and recurring task throughout the project was dimension reduction.
Since most of the independent variables were collected through occupant interviews – as opposed to exact physical examination of the residence – the values of many are binned as factor/categorical variables. The dataset’s 931 original variables had the following characteristics:
Missingness was common in the raw dataset, with 73 features having NA values in more than 95% of observations. To quickly gauge whether these features correlated with the outcome variable kWh (and therefore should be retained/imputed), I made use of flexible EDA graphing functions from the GGally R package. An example is shown below on continuous variables, revealing generally low rstatistics in correlation to kWh and having nonsignificant pvalues (not shown). GGally also accommodates similar exploratory analysis on factor variables via pairwise box plots. Overall, although a more complex imputation strategy could have been employed to address these 73 features, they were dropped due to their high missingness and little evidence of predictive power over the outcome variable.
Feature Engineering/Reduction
Factor Encoding
As previously mentioned, the majority of the dataset features were factor variables, many of which needed recoding in order to correctly capture the variation being described. As illustrated below, nominal factor variables such as WALLTYPE have no intrinsic ordering; it would not make sense to order “stucco” higher than “wood”, for example. On the other hand, ordinal factors such as AGERFRI1 (age of mostused refrigerator) have a clear order to their factor levels; each level denotes a numerically higher value than the previous. Information contained in variables like AGERFRI1 is often more appropriately formatted in integer/numeric form, which can better convey their continuous range of values, aid interpretation of the variable coefficient, and reduce standard error by decreasing the number of coefficients for model estimation (many algorithms generate a separate model coefficient for each factor level via onehotencoding).
In the image above, WALLTYPE was left as a factor while AGERFRI1 was recoded as an integer value (using the mid range of each bin as number of years). This factortointeger recoding was applied to additional binned variables relating to age of appliances and frequency of use of those appliances.
New factor variables were also created by consolidating information from existing factors into onehotencoded binary values such as presence of heated pool, use of electricity for space heating, etc.
Feature Reduction
We can visualize multicollinearity among the integer and numeric features using correlation plots, again made with R package GGally. In the plot below, a conceptual group of collinear variables stands out, i.e. those directly or indirectly representing a residence’s size, for example, total square footage (TOTSQFT), cooling square footage (TOTCSQFT), number of bathrooms (NCOMBATH), number of refrigerators (NUMFRIG), etc.
To confirm the degree of multicollinearity, variance inflation factors were calculated based on a linear model with the above numeric features – those with high VIF scores (loosely following the rule of thumb VIF > 5.0) were dropped.
Despite variable recoding and decorrelation using the methods previously described, a large number of features remained in the dataset, the majority of them being factor variables. Although principal component analysis (PCA) is a powerful method for dimension reduction, it does not generalize easily to categorical data. Therefore, feature reduction was continued instead through the use of an exploratory LASSO regression model, utilizing the L1 regularization penalty to drive nonsignificant variable coefficients in the model’s output to 0. This was helpful in identifying and dropping features with very little predictive power over kWh usage, particularly factor variables that were not covered in the multicollinearity exercise above.
Modeling With H2o
The processed and cleaned dataset was migrated to opensource, online machine learning platform h2o, which offers highly efficient and scalable modeling methods. In contrast to machine learning conducted in slower native R packages (caret, glm) in the local R environment, R package h2o facilitates API calls to h2o’s online platform, sending the given dataset to be distributed and parallelprocessed among multiple clusters. H2o offers an array of the most common machine learning algorithms (glm, kNN, random forests, gbm, deep learning) at very impressive run times – lessening the large burden of computational speed in the model fitting and tuning process. It also provides handy, builtin functions for preprocessing such as training/validation/test set splitting, in this case chosen to be 70/20/10.
View the code on Gist.
Multivariable Regression
To best understand variable importance and interpret linear effects of the features on kWh usage, a simple multivariate regression model was fit using h2o’s glm function with default parameters (no regularization). A variable importance plot (see below) ranks the top 15 features by zstatistic, all of which are quite large – an observed zscore of 15 translates to a pvalue of 9.63e54, or virtually zero. Hence all 15 variables show extremely significant statistical evidence of a relationship to kWh, with an additional ~100 of the features (not shown) also meeting significance at the p = 0.05 threshold. Variables beginning with “hasElec” were feature engineered (previously described), validating the approach of using domain knowledge to add or recode valuable information in new features. Coefficient estimates are denoted on the bar for each feature. We can observe that the presence of electric space heating at a residence (onehotencoded as a factor variable) increases yearly electricity usage at the residence by about 2,722 kWh; each additional TOTCSQFT (air conditioned square feet, numeric) adds 1.146 kWh/year.
ElasticNet Regression with Grid Search
While default h2o glm provided interpretability and pvalues, adding regularization to the linear model enabled an increase in predictive accuracy. H2o allows flexible grid search methods to tune the main glm hyperparameters alpha (type of regularization) and lambda (degree of coefficient shrinkage). Because grid search can quickly become computationally expensive, I utilized h2o’s powerful earlystopping and random search functionality to ensure that computation time is not wasted for very small, marginal gains in validation accuracy.
View the code on Gist.
A primary purpose of grid search is to choose optimal tuning parameters according to a validation metric – in this case rootmeansquarederror (RMSE) of kWh prediction on the validation set – and then train a new model using the optimal values discovered on the full dataset (no train/validation split) to maximize sample size. In 5 minutes of training, h2o fit all ~440 model combinations specified in the grid search, with the best RMSE model having alpha = 1.0 (LASSO regression) and lambda = 1.0 (small amount of regularization). These parameters were then used to fit a final model on 90% of the data (combining 70% train and 20% validation sets) and performance was evaluated on the 10% holdout test data, which had not yet been seen by the model.
Gradient Boosted Machine
To increase predictive accuracy over generalized linear modelbased approaches, I used h2o’s implementation of gradient boosted machines. Boosting, a nonlinear treebased approach, sequentially fits many decorrelated decision trees to slowly reduce overall predictive error (in the regression setting, often RMSE). Grid search was performed using the same methods as above and with the following tuning parameters:
 learn_rate – how much error should each tree “remember” from the previous
 max_depth – max tree depth (rule of thumb – sqrt of # of features)
 sample_rate – percentage of rows/observations to be chosen to fit a given tree
 col_sample_rate – percentage of columns to be chosen to fit a given tree
Following the same process as with GLM, parameter values were then chosen from the trained model with the best validation set RMSE. The model was retrained on the full 90% of training data and tested on the final 10% holdout split.
Results from the two modeling strategies are summarized in the table below.
Additionally, predicted vs actual kWh values for either modeling strategy are plotted against the most significant numeric feature, CDD30YR (30year average of cooling degreedays in the residence’s region). Both model types demonstrate reasonable ability to predict kWh over a range of annual cooling loads/climate regions.
ConclusionsFinal predictive accuracy was similar for both linearbased (GLM) and treebased (GBM) models on the RECS dataset. The two models’ RMSE values represent a large improvement over an RMSE of 7,560, which was established as a baseline accuracy by initially fitting a default GLM on the raw dataset (before any feature engineering or reduction). This improvement validates the approaches used – using domain knowledge to feature engineer highly significant variables and grid search for hyperparameter tuning.
Aside from the regression setting, future analysis could be applied to RECS in a classification context. Since the majority of the dataset’s features are factor variables, energy providers could use the rich training set to attempt to answer important energyrelated classification questions about their customers – type of space heating, presence of major appliances, or whether the home is a good candidate for solar installation. Greater predictive ability over residential electricity use will contribute over time to a smarter, cleaner, and more efficient power grid.
The post U.S. Residential Energy Use: Machine Learning on the RECS Dataset appeared first on NYC Data Science Academy Blog.
To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Complete Subset Regressions, simple and powerful
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
By Gabriel VasconcelosThe complete subset regressions (CSR) is a forecasting method proposed by Elliott, Gargano and Timmermann in 2013. It is as very simple but powerful technique. Suppose you have a set of variables and you want to forecast one of them using information from the others. If your variables are highly correlated and the variable you want to predict is noisy you will have collinearity problems and insample overfitting because the model will try to fit the noise.
These problems may be solved if you estimate a smaller model using only a subset of the explanatory variables, however, if you do not know which variables are important you may loose information. What if we estimate models from many different subsets and combine their forecasts? Even better, what if we estimate models for all possible combinations of variables? This would be a good solution, however, if you have only 20 variables, the number of regressions would be more the 1 million. The CSR is a solution between using only one subset and all possible subsets. Instead of estimating all possible combinations of models, we fix a value and estimate all possible models with variables. Then we compute the forecasts from all these models to get our final result. Naturaly, must be smaller than the number of variables you have. Let us review the steps:
 Suppose you have explanatory variables. Estimate all possible models with variables,
 Compute the forecasts for all the models,
 Compute the average of all the forecasts to have the final result.
We are going to generate data from a linear model where the explanatory variables, , are draw from a multivariate normal distribution. The coefficients are generated from a normal distribution and multiplied by a parameter to ensure the dependent variable has a significant random component. The value for is 10 and the CSR is estimated with .
set.seed(1) # = Seed for replication = # K = 10 # = Number of Variables = # T = 300 # = Number of Observations = # # = Generate covariance matrix = # library(mvtnorm) D = diag(0.1, K) P = matrix(rnorm(K * K), K, K) sigma = t(P)%*%D%*%P alpha = 0.1 beta = rnorm(K) * alpha # = coefficients = # X = rmvnorm(T, sigma = sigma) # = Explanatory Variables = # u = rnorm(T) # = Error = # y = X%*%beta + u # = Generate y = # # = Break data into insample and outofsample = # y.in = y[1:200] y.out = y[c(1:200)] X.in = X[1:200, ] X.out = X[c(1:200), ] # = Estimate model by OLS to compare = # model = lm(y.in ~ X.in) pred = cbind(1, X.out)%*%coef(model) # = CSR = # k = 4 models = combn(K, k) # = Gives all combinations with 4 variables = # csr.fitted = rep(0, length(y.in)) # = Store insample fitted values = # csr.pred = rep(0, length(y.out)) # = Store forecast = # for(i in 1:ncol(models)){ m = lm(y.in ~ X.in[ ,models[ ,i]]) csr.fitted = csr.fitted + fitted(m) csr.pred = csr.pred + cbind(1, X.out[ ,models[ ,i]])%*%coef(m) } R2ols = 1  var(y.in  fitted(model))/var(y.in) # = R2 OLS = # R2csr = 1  var(y.in  csr.fitted/ncol(models))/var(y.in) # = R2 CSR = # c(R2ols, R2csr) ## [1] 0.1815733 0.1461342 # = Insample fit = # plot(y.in, type="l") lines(fitted(model), col=2) lines(csr.fitted/ncol(models), col=4) # = Outofsample fit = # plot(y.out, type="l") lines(pred, col = 2) lines(csr.pred/ncol(models), col = 4) # = MAPE = # MAPEols=mean(abs(y.out  pred)) MAPEcsr=mean(abs(y.out  csr.pred/ncol(models))) c(MAPEols, MAPEcsr) ## [1] 0.8820019 0.8446682The main conclusion from the results is that the CSR gives up some insample performance to improve the forecasts. In fact, the CSR is very robust to overfitting and you should definitively add it to your collection to use when you believe that you are having this type of problem. Most modern forecasting techniques have the same idea of accepting some bias insample to have a more parsimonious or stable model outofsample.
This is the simplest implementation possible for the CSR. In some cases you may have fixed controls you want to include in all regressions such as lags of the dependent variable. The CSR computational costs increase fast with the number of variables. If you are in a highdimensional framework you may need to do some type of preselection of the variables to reduce the problem’s size.
ReferencesElliott, Graham, Antonio Gargano, and Allan Timmermann. “Complete subset regressions.” Journal of Econometrics 177.2 (2013): 357373.
To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Euler Problem 23: NonAbundant Sums
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Euler problem 23 asks us to solve a problem with abundant or excessive numbers.
These are numbers for which the sum of its proper divisors is greater than the number itself.
12 is an abundant number because the sum of its proper divisors (the aliquot sum) is larger than 12: (1 + 2 + 3 + 4 + 6 = 16).
All highly composite numbers or antiprimes greater than six are abundant numbers. These are numbers that have so many divisors that they are considered the opposite of primes, as explained in the Numberphile video below.
Euler Problem 23 DefinitionA perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.
A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.
As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis, even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.
SolutionThis solution repurposes the divisors function that determines the proper divisors for a number, introduced for Euler Problem 21. The first code snippet creates the sequence of all abundant numbers up to 28123 (sequence A005101 in the OEIS). An abundant number is one where its aliquot sum is larger than n.
# Generate abundant numbers (OEIS A005101) A005101 < function(x){ abundant < vector() a < 1 for (n in 1:x) { aliquot.sum < sum(proper.divisors(n))  n if (aliquot.sum > n) { abundant[a] < n a < a + 1 } } return(abundant) } abundant < A005101(28123)The solution to this problem is also a sequence in the Online Encyclopedia of Integer Sequences (OEIS A048242). This page states that the highest number in this sequence is 20161, not 28123 as stated in the problem definition.
The second section of code creates a list of all potential numbers not the sum of two abundant numbers. The next bit of code sieves any sum of two abundant numbers from the list. The answer is determined by adding remaining numbers in the sequence.
# Create a list of potential numbers that are not the sum of two abundant numbers A048242 < 1:20161 # Remove any number that is the sum of two abundant numbers for (i in 1:length(abundant)) { for (j in i:length(abundant)) { if (abundant[i] + abundant[j] <= 20161) { A048242[abundant[i] + abundant[j]] < NA } } } A048242 < A048242[!is.na(A048242)] answer < sum(A048242) print(answer)The post Euler Problem 23: NonAbundant Sums appeared first on The Devil is in the Data.
To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
To eat or not to eat! That’s the question? Measuring the association between categorical variables
(This article was first published on Stories Data Speak, and kindly contributed to Rbloggers)
1. IntroductionI serve as a reviewer to several ISI and Scopus indexed journals in Information Technology. Recently, I was reviewing an article, wherein the researchers had made a critical mistake in data analysis. They converted the original categorical data to continuous without providing a rigorous statistical treatment, nor, any justification to the loss of information if any. Thus, my motivation to develop this study, is borne out of their error.
We know the standard association measure between continuous variables is the productmoment correlation coefficient introduced by Karl Pearson. This measure determines the degree of linear association between continuous variables and is both normalized to lie between 1 and +1 and symmetric: the correlation between variables x and y is the same as that between y and x. the bestknown association measure between two categorical variables is probably the chisquare measure, also introduced by Karl Pearson. Like the productmoment correlation coefficient, this association measure is symmetric, but it is not normalized. This lack of normalization provides one motivation for Cramer’s V, defined as the square root of a normalized chisquare value; the resulting association measure varies between 0 and 1 and is conveniently available via the assocstats function in the vcd package. An interesting alternative to Cramer’s V is Goodman and Kruskal’s tau, which is not nearly as well known and is asymmetric. This asymmetry arises because the tau measure is based on the fraction of variability in the categorical variable y that can be explained by the categorical variable x. 1
The data for this study is sourced from UCI Machine Learning repository. As it states in the data information section, “This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The guide clearly states that there is no simple rule for determining the edibility of a mushroom;
Furthermore, the possible research questions, I want to explore are;
 Is significance test enough to justify a hypothesis?
 How to measure associations between categorical predictors?
As a first step, I imported the data in R environment as;
# Import data from UCI ML repo > theURL< "http://archive.ics.uci.edu/ml/machinelearningdatabases/mushroom/agaricuslepiota.data" # Explicitly adding the column headers from the data dictionary > mushroom.data< read.csv(file = theURL, header = FALSE, sep = ",",strip.white = TRUE, stringsAsFactors = TRUE, col.names = c("class","capshape","capsurface","capcolor","bruises", "odor","gillattachment","gillspacing","gillsize", "gillcolor","stalkshape","stalkroot","stalksurfaceabovering", "stalksurfacebelowring","stalkcolorabovering","stalkcolorbelowring", "veiltype","veilcolor","ringnumber","ringtype","sporeprintcolor", "population","habitat"))Next, I quickly summarize the dataset to get a brief glimpse. The reader’s should note that the data has no missing values.
# Calculate number of levels for each variable > mushroom.data.levels<cbind.data.frame(Variable=names(mushroom.data), Total_Levels=sapply(mushroom.data,function(x){as.numeric(length(levels(x)))})) > print(mushroom.data.levels) Variable Total_Levels class class 2 cap.shape cap.shape 5 cap.surface cap.surface 3 cap.color cap.color 10 bruises bruises 2 odor odor 8 gill.attachment gill.attachment 1 gill.spacing gill.spacing 2 gill.size gill.size 3 gill.color gill.color 10 stalk.shape stalk.shape 3 stalk.root stalk.root 6 stalk.surface.above.ring stalk.surface.above.ring 4 stalk.surface.below.ring stalk.surface.below.ring 5 stalk.color.above.ring stalk.color.above.ring 7 stalk.color.below.ring stalk.color.below.ring 8 veil.type veil.type 2 veil.color veil.color 2 ring.number ring.number 3 ring.type ring.type 5 spore.print.color spore.print.color 7 population population 7 habitat habitat 8As we can see, the variable, gill.attachement has just one level. Nor, does it make any significant contribution to the target class so dropping it.
# dropping variable with constant variance > mushroom.data$gill.attachment< NULLThe different levels are uninterpretable in their current format. I will use the data dictionary and recode the levels into meaningful names.
> levels(mushroom.data$class)< c("edible","poisonous") > levels(mushroom.data$cap.shape)<c("bell","conical","flat","knobbed","sunken","convex") > levels(mushroom.data$cap.surface)< c("fibrous","grooves","smooth","scaly") > levels(mushroom.data$cap.color)< c("buff","cinnamon","red","gray","brown","pink","green","purple","white","yellow") > levels(mushroom.data$bruises)< c("bruisesno","bruisesyes") > levels(mushroom.data$odor)<c("almond","creosote","foul","anise","musty","nosmell","pungent","spicy","fishy") > levels(mushroom.data$gill.attachment)< c("attached","free") > levels(mushroom.data$gill.spacing)< c("close","crowded") > levels(mushroom.data$gill.size)<c("broad","narrow") > levels(mushroom.data$gill.color)< c("buff","red","gray","chocolate","black","brown","orange","pink","green","purple","white","yellow") > levels(mushroom.data$stalk.shape)< c("enlarging","tapering") > table(mushroom.data$stalk.root) # has a missing level coded as ? ? b c e r 2480 3776 556 1120 192 > levels(mushroom.data$stalk.root)< c("missing","bulbous","club","equal","rooted") > levels(mushroom.data$stalk.surface.above.ring)<c("fibrous","silky","smooth","scaly") > levels(mushroom.data$stalk.surface.below.ring)<c("fibrous","silky","smooth","scaly") > levels(mushroom.data$stalk.color.above.ring)< c("buff","cinnamon","red","gray","brown", "orange","pink","white","yellow") > levels(mushroom.data$stalk.color.below.ring)< c("buff","cinnamon","red","gray","brown", "orange","pink","white","yellow") > levels(mushroom.data$veil.type)<c("partial") > levels(mushroom.data$veil.color)< c("brown","orange","white","yellow") > levels(mushroom.data$ring.number)<c("none","one","two") > levels(mushroom.data$ring.type)< c("evanescent","flaring","large","none","pendant") > levels(mushroom.data$spore.print.color)< c("buff","chocolate","black","brown","orange","green","purple","white","yellow") > levels(mushroom.data$population)< c("abundant","clustered","numerous","scattered","several","solitary") > levels(mushroom.data$habitat)<c("woods","grasses","leaves","meadows","paths","urban","waste") 3. Initial data visualization a. Is there a relationship between capsurface and capshape of a mushroom? > p< ggplot(data = mushroom.data, aes(x=cap.shape, y=cap.surface, color=class)) > p + geom_jitter(alpha=0.3) + scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))Fig1: Mushroom capshape and capsurface
From Fig1, we can easily notice, the mushrooms with a, flat capshape and scaly, smooth or fibrous capsurface are poisonous. While, the mushrooms with a, bell,knob or sunken capshape and are fibrous, smooth or scaly are edible. A majority of flat capshaped mushrooms with scaly or smooth cap surface are poisonous.
b. Is mushroom habitat and its population related? > p< ggplot(data = mushroom.data, aes(x=population, y=habitat, color=class)) > p + geom_jitter(alpha=0.3) + scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))Fig2: Mushroom capshape and capsurface
From Fig2, we see that mushrooms which are clustered or scattered in population and living in woods are entirely poisonous. Those that live in grasses, wasteland, meadows, leaves, paths and urban area’s are edible.
c. What’s the deal with living condition and odor? > p< ggplot(data = mushroom.data, aes(x=habitat, y=habitat, color=class)) > p + geom_jitter(alpha=0.3) +scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))Fig3: Mushroom habitat and odor
From Fig3, we notice, the mushrooms with fishy, spicy,pungent, foul, musty and creosote odor are clearly marked poisonous irrespective of there habitat. Whereas the one’s with almond, anise odour are edible mushrooms. We also notice, that a minority of no odour mushrooms are poisonous while the one’s living in meadows are entirely poisonous in nature.
Although, there could be many other pretty visualizations but I will leave that as a future work.
I will now focus on exploratory data analysis.
4. Exploratory data analysis a. Correlation detection & treatment for categorical predictorsIf we look at the structure of the dataset, we notice that each variable has several factor levels. Moreover, these levels are unordered. Such unordered categorical variables are termed as nominal variables. The opposite of unordered is ordered, we all know that. The ordered categorical variables are called, ordinal variables.
“In the measurement hierarchy, interval variables are highest, ordinal variables are next, and nominal variables are lowest. Statistical methods for variables of one type can also be used with variables at higher levels but not at lower levels.”, see Agresti
I found this cheatsheet that can aid in determining the right kind of test to perform on categorical predictors (independent/explanatory variables). Also, this SO post is very helpful. See the answer by user gung.
For categorical variables, the concept of correlation can be understood in terms of significance test and effect size (strength of association)
The Pearson’s chisquared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data. It is a significance test. Given two categorical random variables, X and Y, the chisquared test of independence determines whether or not there exists a statistical dependence between them. Formally, it is a hypothesis test. The chisquared test assumes a null hypothesis and an alternate hypothesis. The general practice is, if the pvalue that comes out in the result is less than a predetermined significance level, which is 0.05 usually, then we reject the null hypothesis.
H0: The The two variables are independent
H1: The The two variables are dependent
The null hypothesis of the chisquared test is that the two variables are independent and the alternate hypothesis is that they are related.
To establish that two categorical variables (or predictors) are dependent, the chisquared statistic must have a certain cutoff. This cutoff increases as the number of classes within the variable (or predictor) increases.
In section 3a, 3b and 3c, I detected possible indications of dependency between variables by visualizing the predictors of interest. In this section, I will test to prove how well those dependencies are associated. First, I will apply the chisquared test of independence to measure if the dependency is significant or not. Thereafter, I will apply the Goodman’s Kruskal Tau test to check for effect size (strength of association).
i. Pearson’s chisquared test of independence (significance test) > chisq.test(mushroom.data$cap.shape, mushroom.data$cap.surface, correct = FALSE) Pearson's Chisquared test data: mushroom.data$cap.shape and mushroom.data$cap.surface Xsquared = 1011.5, df = 15, pvalue < 2.2e16since the pvalue is < 2.2e16 is less than the cutoff value of 0.05, we can reject the null hypothesis in favor of alternative hypothesis and conclude, that the variables, cap.shape and cap.surface are dependent to each other.
> chisq.test(mushroom.data$habitat, mushroom.data$odor, correct = FALSE) Pearson's Chisquared test data: mushroom.data$habitat and mushroom.data$odor Xsquared = 6675.1, df = 48, pvalue < 2.2e16Similarly, the variables habitat and odor are dependent to each other as the pvalue < 2.2e16 is less than the cutoff value 0.05.
ii. Effect size (strength of association)The measure of association does not indicate causality, but association–that is, whether a variable is associated with another variable. This measure of association also indicates the strength of the relationship, whether, weak or strong.
Since, I’m dealing with nominal categorical predictor’s, the Goodman and Kruskal’s tau measure is appropriate. Interested readers are invited to see pages 68 and 69 of the Agresti book. More information on this test can be seen here
> library(GoodmanKruskal) > varset1< c("cap.shape","cap.surface","habitat","odor","class") > mushroomFrame1< subset(mushroom.data, select = varset1) > GKmatrix1< GKtauDataframe(mushroomFrame1) > plot(GKmatrix1, corrColors = "blue")In Fig4, I have shown the association plot. This plot is based on the corrplot library. In this plot the diagonal element K refers to number of unique levels for each variable. The offdiagonal elements contain the forward and backward tau measures for each variable pair. Specifically, the numerical values appearing in each row represent the association measure τ(x,y)τ(x,y) from the variable xx indicated in the row name to the variable yy indicated in the column name.
The most obvious feature from this plot is the fact that the variable odor is almost perfectly predictable (i.e. τ(x,y)=0.94) from class and this forward association is quite strong. The forward association suggest that x=odor (which has levels “almond”, “creosote”, “foul”, “anise”, “musty”, “nosmell”, “pungent”, “spicy”, “fishy”) is highly predictive of y=class (which has levels “edible”, “poisonous”). This association between odor and class is strong and indicates that if we know a mushroom’s odor than we can easily predict its class being edible or poisonous.
On the contrary, the reverse association y=class and x=odor(i.e. τ(y,x)=0.34; is a strong association and indicates that if we know the mushroom’s class being edible or poisonous than its easy to predict its odor.
Earlier we have found cap.shape and cap.surface are dependent to each other (chisquared significance test). Now, let’s see if the association is strong too or not. Again, from Fig4, both the forward and reverse association suggest that x=cap shape is weakly associated to y=cap surface (i.e.τ(x,y)=0.03) and (i.e.τ(y,x)=0.01). Thus, we can safely say that although these two variables are significant but they are association is weak; i.e. it will be difficult to predict one from another.
Similarly, many more associations can be interpreted from plot4. I invite interested reader’s to explore it further.
5. ConclusionThe primary objective of this study was to drive the message, do not tamper the data without providing a credible justification. The reason I chose categorical data for this study to provide an indepth treatment of the various measures that can be applied to it. From my prior readings of statistical texts, I could recall that significance test alone was not enough justification; there had to be something more. It is then, I found the about the different types of association measures and it sure did cleared my doubts. In my next post, I will continue the current work by providing inferential and predictive analysis. For interested reader’s, I have uploaded the complete code on my Github repository in here
To leave a comment for the author, please follow the link and comment on their blog: Stories Data Speak. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
My new DataCamp course: Forecasting Using R
(This article was first published on R on Rob J Hyndman, and kindly contributed to Rbloggers)
For the past few months I’ve been working on a new DataCamp course teaching Forecasting using R. I’m delighted that it is now available for anyone to do.
Course blurb Forecasting involves making predictions about the future. It is required in many situations such as deciding whether to build another power generation plant in the next ten years requires forecasts of future demand; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements.
To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
let there be progress
(This article was first published on HighlandR, and kindly contributed to Rbloggers)
The ‘wrapr’package for use with dplyr programming –
UPDATED POSTI’m the first to admit I’m not an R expert, (even duffers can blog about it though), but when I began thinking about writing some dplyr functions to help me create and analyse run charts, I had no idea that I was going to struggle quite so much getting to grips with Non Standard Evaluation and Lazy Evaluation (see my last post for links and more background).
To clarify, I wanted to create flexible dplyr functions, without hardcoded grouping parameters, so I could query different SQL Server tables and apply the same functions reliably to transform and plot the data.
Eventually I sussed things out and created the functions I needed, but it was yet another example of how R makes me feel really stupid on a regular basis.
Please tell me I’m not alone with that…
I’m aware that dplyr 0.6 is imminent, and that my code will need revising as the dplyr verbs with underscores (for programmatic use) will be deprecated (at least eventually).
While I could probably install the dev version , I thought I should finally take a look at the let function in the wrapr package, as I’d seen numerous tweets giving it positive feedback.
I had a bash at recreating one of my functions – got some help directly from winvector and was then able to rewrite all my remaining functions quickly.
I’ve put the code gist here – should hopefully run as intended, assuming you have the current dplyr 0.5 and wrapr packages installed.
Points to note
I commented out a line in the wrapr let example code for this demo so that all 3 return the same output.

There is some nice code to handle the lack of a grouping column, courtesy of WinVector. For my real world use, there will always be a grouping variable, but its nice to build this in for future requirements.

I used slice instead of filter in the NSE code – I could not get it to work at all otherwise. I’m pretty convinced I’ve done something wrong here, because the function only runs if I don’t wrap the variables in quotes when I call it. For the life of me I couldn’t figure out why a grouping call would work in one NSE function and fail in another when the same syntax was being used. Bearing in mind that the work I actually WANTED to do was going to be complex enough, I was just happy that it worked.
Day to day, interactive dplyr code :
My attempt at the whole NSE/lazy eval thing :
Wrapr code :
With the let function you create the mapping block first, then use the variables there in your function – and can use the regular dplyr verbs. Once I’d seen how to set up the code, it was easy to recreate the others.
Outputs (in same order as above code examples):
Here’s another example of using the underscore versions compared to wrapr.
This time the NSE code works with quoted variables:
And here is the wrapr equivalent using let:
From a real world deadlines / getting things done perspective – if you need to create flexible dplyr functions, and you need to do it soon – take a look at wrapr to see if it fits your needs.
To leave a comment for the author, please follow the link and comment on their blog: HighlandR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Shiny: data presentation with an extra
(This article was first published on Rposts.com, and kindly contributed to Rbloggers)
A Shiny app with three tabs presenting different sections of the same data.
Shiny is an application based on R/RStudio which enables an interactive exploration of data through a dashboard with dropdown lists and checkboxes—programmingfree. The apps can be useful for both the data analyst and the public.
Shiny apps are based on the Internet: This allows for private consultation of the data on one’s own browser as well as for online publication. Free apps can handle quite a lot of data, which can be increased with a subscription.
The target user of Shiny is extremely broad. Let’s take science—open science. At a time when openly archiving all data is becoming standard practice (e.g., OSF.io, Figshare.com, Github.com), Shiny can be used to walk the extra mile by letting people tour the data at once without programming. It’s the right tool for acknowledging all aspects of the data. Needless to say, these apps do not replace raw data archiving.
The apps simply add. For instance, the data in the lower app is a little noisy, right? Indeed it shows none of the succession of waves that characterizes word reading. The app helped me in identifying this issue. Instead of running along a host of participants and electrodes through a heavy code score, I found that the dropdown lists of the app let me seamlessly surf the data. By Participant 7, though, my wave dropped me…
Those data were very poor—systematically poorer than those of any other participants. I then checked the EEG preprocessing logs, and confirmed that those data had to be discarded. So much for the analysis utility of such an app. On the open science utility, what I did on discovering the fault was maintain the discarded data in the app, with a note, so that any colleagues and reviewers could consider it too. Now, although this example of use concerns a rather salient trait in the data, some other Shiny app might help to spot patterns such as individual differences, third variables.
Building a Shiny app is not difficult. Apps basically draw on some data presentation code (tables, plots) that you already have. Then just add a couple of scripts into the folder: one for the user interface (named iu.R), one for the process (named server.R), and perhaps another one compiling the commands for deploying the app and checking any errors.
1: Tutorials. Being opensource software, the best manuals are available through a Google search.
2: User token. Signing up and reading in your private key—just once.
3: GO. Plunge into the ui and server scripts, and deployApp().
4: Bugs and logs. They are not bugs in fact—rather fancies. For instance, some special characters have to get even more special (technically, UTF8 encoding). For a character such as μ, Shiny prefers Âμ. Just cling to error logs by calling:
showLogs(appPath = getwd(), appFile = NULL, appName = NULL, account = NULL, entries = 50, streaming = FALSE)The log output will mention any typos and unaccepted characters, pointing to specific lines in your code.
It may take a couple of intense days to get a first app running. As usual with programming, it’s easy to run into the traps which are there to spice up the way. The app’s been around for years, so tips and tricks abound on the Internet. For greater companionship, there are dedicated Google groups, and then there’s StackExchange etc., where you can post any needs/despair. Post your code, log, and explanation, and you’ll be rescued out in a couple of days. Long live those contributors.
It will often be enough to upload a bare app, but you might then think it can look better.
5 (optional): Pro up. Use tabs to combine multiple apps in one, use different widgets, etc. Tutorials like this one on Youtube can take you there, especially those that provide the code, as in the description of that video. Use those scripts as templates. For example, see this script in which the function conditionalPanel() is used to modify the app’s sidebar based on which tab is selected. The utility of tabs is illustrated in the upper cover of this article and in the app shown in the text: When having multiple data sections, the tabs allow you to have all in one (cover screenshot), instead of linking to other apps in each (screenshot in text).
Time for logistics. You can include any text in your app’s webpage, such as explanations of any length, web links, and author information. Oh, also importantly: the Shiny application allows for the presentation of data in any of the forms available in R—notably, plots and tables. Costs: Shiny is free to use, in principle. You may only have to pay if you use your app(s) a lot—first month, commonly—, in which case you may pay 9 euros a month. There are different subscription plans. The free plan allows 25 hours of use per month, distributed among up to five apps.
There do exist alternatives to Shiny. One is fairly similar: It’s called Tableau. A nice comparison of the two apps is available here. Then, there’s a more advanced application called D3.js, which allows for lush graphics but proves more restrictive to newbies.
In sum, if you already program in R or even in Python, but have not tried online data presentation yet, consider it.
Feel free to share your ideas, experiences, or doubts in a comment on the original post.
To leave a comment for the author, please follow the link and comment on their blog: Rposts.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Watch presentations from R/Finance 2017
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
It was another great year for the R/Finance conference, held earlier this month in Chicago. This is normally a fairly private affair: with attendance capped at around 300 people every year, it's a somewhat exclusive gathering of the best and brightest minds from industry and academia in financial data analysis with R. But for the first time this year (and with thanks to sponsorship from Microsoft), videos of the presentations are available for viewing by everyone. I've included the complete list (copied from the R/Finance website) below, but here are a few of my favourites:
 NoBullshit Data Science. Szilard Pafka's keynote knocks down some of the myths and misconceptions about realworld data science practices.
 Risk Fast and Slow. A rare treat: in his keynote presentation Dave DeMers offers his perspectives on risk, financial engineering, and major financial events from his days the Prediction Company, Black Mesa and other investment houses.
 Syberia: A development framework for R (Robert Krzyzanowski). Syberia is a new operationalization framework for R scripts, applicable for any production workflow using R (not just Finance).
 yuimaGUI: A graphical user interface for the yuima package (Emanuele Guidotti). An impressive frontend to the Yuima project for solving stochastic differential equations.
 ZeroRevelation RegTech: Detecting Risk through Linguistic Analysis of Corporate Emails and News (Seoyoung Kim). A really interesting approach to looking at sentiment (and length!) of emails (here, the Enron corpus) to predict stock prices.
 New Tools for Performing Financial Analysis Within the 'Tidy' Ecosystem (Matt Dancho). This lightning talk shows how you can perform piping (%>%) operations with time series data from zoo and xts.
 The PE package: Modeling private equity in the 21st century (Thomas Harte). An intriguing peek into the mysterious world of private equity finance.
 Project and conquer (Bryan Lewis). A beautifully elegant introduction to the use of projections in Statistics, and a potentially revolutionary application in speeding up calculations with highdimensional correlations and clusters.
 Detecting Fraud at 1 Million Transactions per Second (David Smith). My own presentation includes a demo of very highfrequency predictions from R models (about which I'll blog more shortly).
You can find an uptodate version of the table below at the R/Finance website (click on the "Program" tab), and you can also browse the videos at Channel 9. Note that the lightning talk sessions (in orange) are bundled together in a single video, which you can find linked after the first talk in each session.
Friday, May 19th, 2017 09:30 – 09:35 Kickoff (video) 09:35 – 09:40 Sponsor Introduction 09:40 – 10:10 Marcelo Perlin: GetHFData: An R package for downloading and aggregating high frequency trading data from Bovespa (pdf) (video) Jeffrey Mazar: The obmodeling Package (html) Yuting Tan: Return Volatility, Market Microstructure Noise, and Institutional Investors: Evidence from High Frequency Market (pdf) Stephen Rush: Adverse Selection and Broker Execution (pdf) Jerzy Pawlowski: How Can Machines Learn to Trade? (html) 10:10 – 10:30 Michael Hirsch: Revealing HighFrequency Trading Provisions of Liquidity with Visualization in R (html) (video) 10:30 – 10:50 Eric Glass: Equity Factor Portfolio Case Study (html) (video) 10:50 – 11:10 Break 11:10 – 11:30 Seoyoung Kim: ZeroRevelation RegTech: Detecting Risk through Linguistic Analysis of Corporate Emails and News (pdf) (video) 11:30 – 12:10 Szilard Pafka: NoBullshit Data Science (pdf) (video) 12:10 – 13:30 Lunch 13:30 – 14:00 Francesco Bianchi: Measuring Risk with Continuous Time Generalized Autoregressive Conditional Heteroscedasticity Models (pdf) (video) Eina Ooka: Bunched Random Forest in Monte Carlo Risk Simulation (pdf) Matteo Crimella: Operational Risk Stress Testing: An Empirical Comparison of Machine Learning Algorithms and Time Series Forecasting Methods (pdf) Thomas Zakrzewski: Using R for Regulatory Stress Testing Modeling (pdf) Andy Tang: How much structure is best? (pptx) 14:00 – 14:20 Robert McDonald: Ratings and Asset Allocation: An Experimental Analysis (pdf) 14:20 – 14:50 Break 14:50 – 15:10 Dries Cornilly: Nearest Comoment Estimation with Unobserved Factors and Linear Shrinkage (pdf) (video) 15:10 – 15:30 Bernhard Pfaff: R package: mcrp: Multiple criteria risk contribution optimization (pdf) (video) 15:30 – 16:00 Oliver Haynold: Practical Options Modeling with the sn Package, Fat Tails, and How to Avoid the Ultraviolet Catastrophe (pdf) (video) Shuang Zhou: A Nonparametric Estimate of the RiskNeutral Density and Its Applications (pdf) Luis Damiano: A Quick Intro to Hidden Markov Models Applied to Stock Volatility Oleg Bondarenko: Rearrangement Algorithm and Maximum Entropy (pdf) Xin Chen: Risk and Performance Estimator Standard Errors for Serially Correlated Returns (pdf) 16:00 – 16:20 Qiang Kou: Text analysis using Apache MxNet (pdf) (video) 16:20 – 16:40 Robert Krzyzanowski: Syberia: A development framework for R (pdf) (video) 16:40 – 16:52 Matt Dancho: New Tools for Performing Financial Analysis Within the 'Tidy' Ecosystem (pptx) (video) Leonardo Silvestri: ztsdb, a timeseries DBMS for R users (pdf) Saturday, May 20th, 2017 09:05 – 09:35 Stephen Bronder: Integrating Forecasting and Machine Learning in the mlr Framework (pdf) (video) Leopoldo Catania: Generalized Autoregressive Score Models in R: The GAS Package (pdf) Guanhao Feng: Regularizing Bayesian Predictive Regressions (pdf) Jonas Rende: partialCI: An R package for the analysis of partially cointegrated time series (pdf) Carson Sievert: Interactive visualization for multiple time series (pdf) 09:35 – 09:55 Emanuele Guidotti: yuimaGUI: A graphical user interface for the yuima package (pptx) (video) 09:55 – 10:15 Daniel Kowal: A Bayesian Multivariate Functional Dynamic Linear Model (pdf) (video) 10:15 – 10:45 Break 10:45 – 11:05 Jason Foster: Scenario Analysis of Risk Parity using RcppParallel (pdf) (video) 11:05 – 11:35 Michael Weylandt: Convex Optimization for HighDimensional Portfolio Construction (pdf) (video) Lukas Elmiger: Risk Parity Under Parameter Uncertainty (pdf) Ilya Kipnis: Global Adaptive Asset Allocation, and the Possible End of Momentum (pptx) Vyacheslav Arbuzov: Dividend strategy: towards the efficient market (pdf) Nabil Bouamara: The Alpha and Beta of Equity Hedge UCITS Funds – Implications for Momentum Investing (pdf) 11:35 – 12:15 Dave DeMers: Risk Fast and Slow (pdf) (video) 12:15 – 13:35 Lunch 13:35 – 13:55 Matthew Dixon: MLEMVD: A R Package for Maximum Likelihood Estimation of Multivariate Diffusion Models (pdf) (video) 13:55 – 14:15 Jonathan Regenstein: Reproducible Finance with R: A Global ETF Map (html) (video) 14:15 – 14:35 David Ardia: MarkovSwitching GARCH Models in R: The MSGARCH Package (pdf) (video) 14:35 – 14:55 Keven Bluteau: Forecasting Performance of MarkovSwitching GARCH Models: A LargeScale Empirical Study (pdf) (video) 14:55 – 15:07 Riccardo Porreca: Efficient, Consistent and Flexible Credit Default Simulation (pdf) (video) Maisa Aniceto: Machine Learning and the Analysis of Consumer Lending (pdf) 15:07 – 15:27 David Smith: Detecting Fraud at 1 Million Transactions per Second (pptx) (video) 15:27 – 15:50 Break 15:50 – 16:10 Thomas Harte: The PE package: Modeling private equity in the 21st century (pdf) (video) 16:10 – 16:30 Guanhao Feng: The Market for English Premier League (EPL) Odds (pdf) (video) 16:30 – 16:50 Bryan Lewis: Project and conquer (html) (video)
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
VIF and multicollinearity diagnostics
(This article was first published on R code – Serious Stats, and kindly contributed to Rbloggers)
In the book I use the car package to get VIF and other multicollinearity diagnostics. I’ve occasionally found this breaks down (usually through mixing different versions of R on different machines at work home or on the move). I recently saw the mctest package and thought it would be useful to use that as a backup – and also because it offers a slightly different set of diagnostics.
If you’d like to try it out I’ve written a helper function that makes it easier apply directly to a linear model. You can find the function and a simple example here.
Filed under: R code, serious stats Tagged: multicollinearity, R, statistics, vif
To leave a comment for the author, please follow the link and comment on their blog: R code – Serious Stats. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
BlandAltman/Tukey MeanDifference Plots using ggplot2
(This article was first published on Environmental Science and Data Analytics, and kindly contributed to Rbloggers)
A very useful data visualisation tool in science, particularly in medical and sports settings, is the BlandAltman/Tukey MeanDifference plot. When comparing two sets of measurements for the same variable made by different instruments, it is often required to determine whether the instruments are in agreement or not.
Correlation and linear regression can tell us something about the bivariate relationship which exists between two sets of measurements. We can identify the strength, form and direction of a relationship but this approach is not recommended for comparative analyses.
The BlandAltman plot’s first use was in 1983 by J.M Bland and D.G Altman who applied it to medical statistics. The Tukey MeanDifference Plot was one of many exploratory data visualisation tools created by John Tukey who, interestingly, also created the beloved boxplot.
A simple implementation of the BlandAltman/Tukey MeanDifference plot in R is described below. The dataset used for this example is one of R’s builtin datasets, Loblolly, which contains data on the growth of Loblolly pine trees.
Prepare the workspace and have a look at the structure of the data.
library(dplyr) library(ggplot2) pine_df < Loblolly glimpse(pine_df)
Randomly split the data into two samples of equal length. These samples will be used as our two sets of pine tree height measurements. Arrange in order of descending height to mimic realistic dual measurements of single trees and combine into a dataframe. Note that these data are used simply to demonstrate and the measures in reality are NOT of the same tree using different instruments.
sample_df < data.frame(sample_n(pine_df, size = nrow(pine_df) * 0.5) %>% select(height) %>% arrange(desc(height)), sample_n(pine_df, size = nrow(pine_df) * 0.5) %>% select(height) %>% arrange(desc(height)))
Assign sensible names to the two pine tree height samples in the dataframe.
names(sample_df) < c("Sample_1", "Sample_2")Each row in the dataframe consists of a pair of measurements. The BlandAltman plot has the average of the two measures in a pair on the xaxis. The yaxis contains the difference between the two measures in each pair. Add the averages and differences data to the dataframe.
sample_df$Avg < (sample_df$Sample_1 + sample_df$Sample_2) / 2 sample_df$Dif < sample_df$Sample_1  sample_df$Sample_2Finally, code the plot and add the mean difference (blue line) and a 95% confidence interval (red lines) for predictions of a mean difference. This prediction interval gives the level of agreement (1.96 * SD).
ggplot(sample_df, aes(x = Avg, y = Dif)) + geom_point(alpha = 0.5) + geom_hline(yintercept = mean(sample_df$Dif), colour = "blue", size = 0.5) + geom_hline(yintercept = mean(sample_df$Dif)  (1.96 * sd(sample_df$Dif)), colour = "red", size = 0.5) + geom_hline(yintercept = mean(sample_df$Dif) + (1.96 * sd(sample_df$Dif)), colour = "red", size = 0.5) + ylab("Diff. Between Measures") + xlab("Average Measure")
To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RStudio Server Pro is now available on AWS Marketplace
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
RStudio is excited to announce the availability of its flagship enterpriseready integrated development environment for R in AWS Marketplace.
RStudio Server Pro AWS is identical to RStudio Server Pro, but with turnkey convenience. It comes preconfigured with multiple versions of R, common systems libraries, and the most popular R packages.
RStudio Server Pro AWS helps you adapt to your unique circumstances. It allows you to choose different AWS computing instances no matter how large, whenever a project requires it (flat hourly pricing). Or you can set up a persistent instance of RStudio Server Pro ready to be used anytime you need it (annual pricing), avoiding the sometimes complicated processes for procuring onpremises software.
If the enhanced security, elegant support for multiple R versions and multiple sessions, and commercially licensed and supported features of RStudio Server Pro appeal to you and your enterprise, consider RStudio Server Pro for AWS. It’s ready to go!
Read the FAQ Try RStudio Server Pro AWS
To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...