greybox package for R
(This article was first published on R – Modern Forecasting, and kindly contributed to Rbloggers)
I am delighted to announce a new package on CRAN. It is called “greybox”. I know, what my American friends will say, as soon as they see the name – they will claim that there is a typo, and that it should be “a” instead of “e”. But in fact no mistake was made – I used British spelling for the name, and I totally understand that at some point I might regret this…
So, what is “greybox”? Wikipedia tells us that grey box is a model that “combines a partial theoretical structure with data to complete the model”. This means that almost any statistical model can be considered as a grey box, thus making the package potentially quite flexible and versatile.
But why do we need a new package on CRAN?
First, there were several functions in smooth package that did not belong there, and there are several functions in TStools package that can be united with a topic of model building. They focus on the multivariate regression analysis rather than on statespace models, time series smoothing or anything else. It would make more sense to find them their own home package. An example of such a function is
ro()– Rolling Origin – function that Yves and I wrote in 2016 on our way to the International Symposium on Forecasting. Arguably this function can be used not only for assessing the accuracy of forecasting models, but also for the variables / model selection.
Second, in one of my side projects, I needed to work more on the multivariate regressions, and I had several ideas I wanted to test. One of those is creating a combined multivariate regression from several models using information criteria weights. The existing implementations did not satisfy me, so I ended up writing a function
combiner()that does that. In addition, our research together with Yves Sagaert indicates that there is a nice solution for a fat regression problem (when the number of parameters is higher than the number of observations) using information criteria. Uploading those function in
smoothdid not sound right, but having a
greyboxhelps a lot. There are other ideas that I have in mind, and they don’t fit in the other packages.
Finally, I could not find satisfactory (from my point of view) packages on CRAN that would focus on multivariate model building and forecasting – the usual focus is on analysis instead (including time series analysis). The other thing is the obsession of many packages with pvalues and hypotheses testing, which was yet another motivator for me to develop a package that would be completely hypothesesfree (at 95% level). As a result, if you work with the functions from
greybox, you might notice that they produce confidence intervals instead of pvalues (because I find them more informative and useful). Finally, I needed good instruments for the promotional modelling for several projects, and it was easier to implement them myself than to compile them from different functions from different packages.
Keeping that in mind, it makes sense to briefly discuss what is already available there. I’ve already discussed how
xregExpander()and
stepwise()functions work in one of the previous posts, and these functions are now available in
greyboxinstead of
smooth. However, I have not covered either
combiner()or
ro()functions yet. While
combiner()is still under construction and works only for normal cases (fat regression can be solved, but not 100% efficiently),
ro()has worked efficiently for several years already. So I created a detailed vignette, explaining what is rolling origin, how the function works and how to use it. So, if you are interested in finding out more, check it out on CRAN.
As a wrap up,
greyboxpackage is focused on model building and forecasting and from now on will be periodically updated.
As a final note, I plan to do the following in
greyboxin future releases:
 Move nemenyi()
 Develop functions for promotional modelling;
 Write a function for multiple correlation coefficients (will be used for multicollinearity analysis);
 Implement variables selection based on rolling origin evaluation;
 Stepwise regression and combinations of models, based on Laplace and the other distributions;
 AICc for Laplace and the other distributions;
 Solve fat regression problem via combination of regression models (sounds crazy, right?);

xregTransformer
– Nonlinear transformation of the provided xreg variables;
 Other cool stuff.
If you have any thoughts on what to implement, leave a comment – I will consider your idea.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Survey books, courses and tools by @ellis2013nz
(This article was first published on free range statistics  R, and kindly contributed to Rbloggers)
Surveys everywhere!Surveys are everywhere, and thinking about them in different ways has dominated my work so far this year. What’s right and wrong with them, how to improve them, do we need them, can we replace this one or that one with administrative data, why do the media (and nearly everyone else) so misunderstand and misinterpret them… those sorts of questions.
Well, they work (mostly), and they’re not going anywhere fast. To take just the most obvious example, have the (unfair) complaints about the pollsters letting down the pundits in the UK and USA in 2016 led to a collapse in the polling industry? No. A quick glance at the Five Thirty Eight data download page shows that there have been more than 1,500 polls with a combined sample size of 2.3 million Americans in the past 18 months just to help people judge “Are Democrats or Republicans Winning the Race for Congress?”
In the world of official statistics they’re still indispensable in most – perhaps all – countries, despite the rise and promise of administrative data. It would be difficult to measure underemployment rates, firmlevel innovation rates, or smoking prevalence from administrative data, to pick three examples at random.
I love the thought and rigour that has gone into designing and analysing surveys, by academics, national statistical offices and (to a certain extent) market researchers. The unifying concept is total survey error
as seen in this diagram. This gives us categories like measurement error, coverage error and adjustment error without which it’s difficult even to think about whether our estimation process is working and how to improve it. The concepts evolved through the second half of the twentieth century and matured in the 1980s and 1990s; I don’t think any single author can (or does) claim ownership of the whole idea.
It’s a simple but brilliant organising framework for design, analysis and just talking about surveys, and it’s one that should be paid more attention by those working with data outside the survey field too. Even with survey professionals there’s far too little attention to sources of error other than the easilyquantified sampling error (inevitably the only acknowledgement of uncertainty quoted in reporting of opinion polls, for example); but in much analysis and discussion in the context of today’s data revolution one sees not even that. I commend the diagram on the right and its many variants to any serious analyst’s attention.
So back to this blog… I thought I’d note down good books on this I’ve read recently and some other stuff.
BooksMy university training emphasised the analysis and sample design aspects of surveys and didn’t cover in any detail issues such as questionnaire development and testing, interviewer effects, mode effects and the like. Can’t do everything. I suspect these are bits of craft more often taught on the job, although there’s no reason for this. I might be wrong about it – certainly there’s good university research going on in these areas.
In my reading this year I emphasised books about what feel to me to be the messier aspects of survey quality and measurement and nonresponse error. These include subtle changes of wording and how mode effects are changing as people change the way they relate to strangers asking them questions (whether face to face, on the phone, or online).
Paul Biemer and others (edited), Total Survey Error in Practice, 2017I gave this five stars. It’s a great (and very indepth – not for the fainthearted) collection of applications and theorising on survey quality. Covers all aspects – design, details of question wording, data collection in various modes (old and new), evaluation and analysis. Very much rooted in the real world of today, and today’s data problems including new data sources beyond the classic survey.
Don Dillman and others, Internet, Phone, Mail, and MixedMode Surveys: The Tailored Design Method, 2014.Another five star entry for me. This is the latest edition of the bible for putting these survey things together (with an emphasis on questionnaires and their delivery rather than sample design) and it’s a very thorough, well evidenced setting out of what works, what doesn’t, and how we know.
Floyd Fowler, Survey Research Methods, 2013Four stars. Nice overview, much more introductory than the other two. Would suit a busy researcher who needs good understanding of the range of the issues rather than a survey specialist.
Massive Open Online CourseI had the pleasure this year of checking out the Coursera Survey Data Collection and Analytics specialization (a collection of seven courses) with an eye to being able to recommend a general introduction for analysts to the design, management, delivery and analysis of surveys. The courses are provided via the Coursera platform, by Maryland and Michigan universities. I rate the overall specialization as four or five stars, and have rated the individual courses as three, four or five stars on Coursera. In my search for affordable online courses in this area, this was the only one that covered the material in both the range and the depth that I was looking for. Other courses available are either too superficial, or focus on just one aspect of surveys, or more commonly both.
The content is spot on, but the delivery can be heavy going – it’s more like a traditional university course than a lot of online courses. That’s a strength in part as well as an issue of course.
The courses cover all aspects of survey management, from a framework in which to plan research, through questionnaire design, the technical aspects of data collection via various modes, processing (weighting, imputation, etc), and analysis. The courses are rigorous and backed up with the latest research into survey effectiveness, including aspects such as the importance of interviewer effects in inflating the variance of estimates; the relative effectiveness of different modes of delivery in different circumstances; and modern software for analysis (Stata and R examples are provided in the later courses, but this is not a course in Stata or R).
There are no prerequisites for the course but it would be helpful to have some background in university level statistics; otherwise some of the material in courses 4, 5 and 6 on sample design, imputation and weighting, and analysis would be a difficult introduction to those topics (but not impossible). Overall the courses are rated “intermediate” and there is a fair amount of both material and rigour (inlecture quizzes, assessed quizzes, and assignments). A student new to the topics might take five or six months to complete the whole specialisation, with 2 – 6 hours of effort per week.
I strongly recommend the first six courses (and optionally the seventh “capstone” project) for anyone working handson on an important survey, particularly one that needs to meet standards of official statistics or of rigorous research. I would recommend the first course (“Framework for Data Collection and Analysis”) for managers of people working on surveys, and for people who need to use surveys in their work and need an introduction to concepts such as “Total Survey Error” and a quality framework for surveys. Other individual courses may be of interest to people with specific needs (eg questionnaire design, sampling issues).
For someone aiming to be a handson analyst of surveys in the depth covered in these course, I would recommend in parallel building general skills in R. The best way to do this will depend on learning style but I recommend either the “data analyst with R” stream on Data Camp or working through the free book R for Data Science . These do not cover surveyspecific issues but would build the background skills needed to make the most of the material in courses 5 and 6 of the survey specialization. Data Camp in particular provides an excellent platform for step by step learning and practice of technical skills; these would need to be placed in context by other training or support.
Tools Survey data collection platformsThe market of suppliers and solutions for professional questionnaire development and delivery is surprisingly crowded. There are in fact many good quality outfits that provide good tools for developing and testing questionnaires, delivery (whether ComputerAssisted Personal Interviewing with accompany tablet software, ComputerAssisted Telephone Interviewing, or webbased) and managing invitations and the like. Those that I’ve looked at all looked to be good quality, and mostly offered to look after data management in the cloud during collection too.
The surprise for me in this space was the World Bank’s very high quality offering – Survey Solutions. It’s very good, very powerful (uses C# for more complex things but very simple to get the knack of), great control over survey logic, validation and soon. It’s definitely up there with or better than the commercial competition, and free for developing in. Growing fast, with a rapid enhancement development cycle, satisfactory videobased training aimed at the different users (interviewers, designers, etc), and a vibrant user community. For some customers (eg member governments) the Bank might host data management during collection too. I definitely recommend looking at Survey Solutions if you’re doing a serious survey, particularly CAPI in difficult conditions.
Processing, analysis and sample designFor analysis of complex surveys, Thomas Lumley’s brilliantly effective and comprehensive R survey package is the day to day workhorse which meets virtually all my surveyspecific needs. That is, on top of the usual R environment and packages I use for data management and graphics of course. His book on complex surveys is my handbook for the analytical stage and I hope there will be a new edition soon.
There are other excellent relevant R packages too; see the CRAN official statistics and survey methodology task view.
Dealing with surveys is also a strength of Stata.
The philosophy of complex survey analysis in Stata is very similar to survey in R. One specifies the data, design (primary sample units, strata, and so on), then uses specialist analysis functions that do things like means, totals, proportions, models in the appropriate way given that design. This abstracts nearly all the complexity away from the analyst and leads to rapid development and readable code.
Q is a nice product for analysis of surveys aimed at people who aren’t specialist statisticians. It’s got much more grunt (better graphics and statistics) than the other crosstabs tools I’ve seen aimed at market researchers, but is still fairly simple to use. It handles complexities like multiple response questions (‘pick as many as apply…’) elegantly and can do statistical inference that appropriately takes the weights into account (not other aspects of sample design though, I think). For data transform, reshape or graphics that get too much for the pointandclick, the user can write R (nice) or JavaScript (yuck – this isn’t what JavaScript was designed for…). The R connection isn’t brilliant – the data and analysis are sent to a distant server and analysed there before coming back to the main Q client, so if you were going to do much of that you’d be better off just using R. Q is pricey per user compared to R or Stata.
If used out of the box, SPSS will interpret weights as frequencies which leads to wildly inappropriate statistical inference. It’s common in market research to adjust for this by scaling weights to add to the sample size rather than the population but this is very crude, only very partially fixes the problem, and makes it much harder to produce population estimates. There’s a “complex survey” module available that adds in the necessary functionality but it costs extra on top of base SPSS. I’ve never seen it in use.
I don’t recommend SPSS for this or any purpose. There’s a nice open source competitor in this space jamovi, built on R but adding SPSSlike GUI control; but at the moment it can’t handle complex surveys at all. This may change, because anything available in R is in principle simple to add to jamovi.
SAS can handle complex surveys appropriately of course. It comes with a price tag, and generally less functionality than available for much less cost in R and Stata.
Microsoft Power BI not for serious statistical inference, but it is an effective exploratory and a great dashboarding tool. In my most recent post a month ago I showed how to workaround the way it doesn’t deal naturally with weighted data.
None of the above is comprehensive, just some of the surveyrelated things I’ve been thinking about recently.
That’s all for today. If I were assigning exercises it would be – go have another look at the total survey error diagram, and think about how it applies to other data analysis!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: free range statistics  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
About Risks and SideEffects… Consult your PurrrMacist
(This article was first published on rbloggers – STATWORX, and kindly contributed to Rbloggers)
Capture errors, warnings and messages, but keep your list operations goingIn a recent post about text mining, I discussed some solutions to webscraping the contents of our STATWORX blog using the purrrpackage. However, while preparing the next the episode of my series on text mining, I remembered a little gimmick that I found quite helpful along the way. Thus, a little detour: How do I capture sideeffects and errors when I perform operations on lists with purrr rather than using a loop?
First of all, a quick motivating example: Imagine we will build a parser for the blogsection of our STATWORX website. However, we have no idea how many entries were posted in the meantime (the hive of data scientists in our office is pretty quick with writing these things, usually). Thus, we need such a function to be more robust in the sense that it can endure the cruelties of "404 – Not found" error messages and still continues parsing after running into an error.
How could this possibly() work?So let's use some beautiful purrr adverbs to fearlessly record all our outputs, errors and warnings, rather than stopping and asking the user to handle sideeffects the exact moment errors turn up. These adverbs are reminiscent of try(), however, these are a little more convenient for operations on lists.
Let's consider a more complex motivating example first, but no worries – there are more obvious examples to help explain the nittygritties further down this page. The Rcode below illustrates our use of possibly() for later use with puurr::map(). First, let us have a look at what we tried to achieve with our function. More specifically, what happens between the curly braces below: Our robust_parse() function will simply parse HTMLwebpages for other links using URLs that we provide it with. In this case, we simply use paste0() to create a vector of links to our blog overview pages, extract the weblinks from these each of these pages using XML::xpathSApply(), pipe these weblinks into a data_frame and clean our results from duplicates using dplyr::filter() – there are various overview pages that group our blogs by category – and dplyr::distinct().
robust_parse < possibly(function(value){ htmlParse(paste0("http://www.statworx.com/de/blog/page/", value, "/")) %>% xpathSApply(., "//a/@href") %>% data_frame(.) %>% filter(., grepl("/blog", .)) %>% filter(., !grepl("/blog/$/blog/page//datascience//statistik/", .)) %>% distinct() }, otherwise = NULL)Second, let us inspect how we employ possibly() in this context. possibly() expects a function to be modified from us, as well as the argument otherwise, stating what it is supposed to do when things go south. In this case, we want NULL as an output value. Another popular choice would be NA, signaling that somewhere, we have not produced a string as intended. However, in our example we are happy with NULL, since we only want to parse the pages that exist and do not require a specific listing of pages that do not exist (or what happened when we did not find a page).
webpages < map_df(0:100, ~robust_parse(.)) %>% unlist webpages .1 "https://www.statworx.com/de/blog/strsplitbutkeepingthedelimiter/" .2 "https://www.statworx.com/de/blog/datascienceinpythonvorstellungvonnuetzlichendatenstrukturenteil1/" .3 "https://www.statworx.com/de/blog/burglrstealingcodefromtheweb/" .4 "https://www.statworx.com/de/blog/regularizedgreedyforestthescottishplayacti/" ...Third, we use our new function robust_parse() to operate on a vector or list of integers from 0 to 100 (possible numbers of subpages we want to parse) and have a quick look at the beautiful links we extracted. Just as a reminder, below you find the code to extract and clean the contents of the individual pages, using another map_df()based loop – which is the focus of another post.
tidy_statworx_blogs < map_df(webpages, ~read_html(.) %>% htmlParse(., asText = TRUE) %>% xpathSApply(., "//p", xmlValue) %>% paste(., collapse = "\n") %>% gsub("\n", "", .) %>% data_frame(text = .) %>% unnest_tokens(word, text) %>% anti_join(data_frame(word = stopwords("de"))) %>% anti_join(data_frame(word = stopwords("en"))) %>% mutate(author = .$word[2]))However, we actually want to go back to our purrrhelpers and see what they can do for us. To be more specific, rather than helpers, these are actually called adverbs since we use them to modify the behavior of a function (i.e. a verb). Our current robust_parse() function does not produce entries when the loop does not successfully find a webpage to parse for links. Consider the situation where you intend to keep track of unsuccessfull operations and of errors that arise along the way. Instead of further exploring purrr adverbs using the above code, let us look at a much easier example to realise the possible contexts in which using purrr adverbs might help you out.
A much easier example: Try dividing a character string by 2Suppose there is an element in our list where our amazing division powers are useless: We are going to try to divide all the elements in our list by 2 – but this time, we want purrr to note where the function i_divide_things resists dividing particular elements for us. Again, the otherwise argument helps us defining our output in situations that are beyond the scope of our function.
i_divide_things < possibly(function(value){ value /2}, otherwise = "I won't divide this for you.") # Let's try our new function > purrr::map(list(1, 2, "a", 6), ~ i_divide_things(.)) [[1]] [1] 0.5 [[2]] [1] 1 [[3]] [1] "I won't divide this for you." [[4]] [1] 3However, consider the case where "something did not work out" might not suffice and you want to keep track of possible errors as well as warnings while still retaining the entire output. A job for safely(): As illustrated below, wrapping our function by safely(), helps us output a nested list. For each element of the input, the output provides two components – $result and $error. For all iterations where a list element is numeric, $result includes a numeric output and an empty (= NULL) errorelement. Only for the third list element – where our function stumbled over a character input – we captured an error message, as well as the result we defined using otherwise.
i_divide_things < safely(function(value){ value /2}, otherwise = "This did not quite work out.") purrr::map(list(1, 2, "a", 6), ~ i_divide_things(.)) [[1]] [[1]]$result [1] 0.5 [[1]]$error NULL [[2]] [[2]]$result [1] 1 [[2]]$error NULL [[3]] [[3]]$result [1] "This did not quite work out." [[3]]$error [[4]] [[4]]$result [1] 3 [[4]]$error NULLIn the example above, we have only been revealing our errors once we have looped over all elements of our list, by inspecting the output list. However, safely() also has the quiet argument – by default set to TRUE. If we set this to FALSE, we receive our errors the very moment they occur.
Now, we want to have a quick look at quietly(). We will define a warning, a message and print an output. This is to illustrate where purrr saves the individual components that our function returns. For each element of our input the returned list provides four components:
 $result again returns the result of our operation
 $output returns the output that would be printed in the console.
 $warnings and $message return the strings that we defined.
Last, there is auto_browse(), which allows us to trigger the RStudio browser for debugging and brings the user to the approximate location of the error. This case is illustrated in the screenshot below.
i_divide_things < purrr::auto_browse(function(value){ print(value / 2) }) purrr::map(list(1, "a", 6), ~i_divide_things(.))Splendid  this was a quick wrapup of how to wrap your functions for handling sideeffects in your operations on lists using adverbs of purrr. Happy wrapping everyone!
Über den Autor David SchleppsDer Beitrag About Risks and SideEffects… Consult your PurrrMacist erschien zuerst auf STATWORX.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers – STATWORX. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
MLE with General Optimization Functions in R
(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to Rbloggers)
In my previous post (https://statcompute.wordpress.com/2018/02/25/mleinr/), it is shown how to estimate the MLE based on the log likelihood function with the generalpurpose optimization algorithm, e.g. optim(), and that the optimizer is more flexible and efficient than wrappers in statistical packages.
A benchmark comparison are given below showing the use case of other general optimizers commonly used in R, including optim(), nlm(), nlminb(), and ucminf(). Since these optimizers are normally designed to minimize the objective function, we need to add a minus () sign to the log likelihood function that we want to maximize, as shown in the minLL() function below. In addition, in order to speed up the optimization process, we can suppress the hessian in the function call. If indeed the hessian is required to calculate standard errors of estimated parameters, it can be calculated by calling the hessian() function in the numDeriv package.
As shown in the benchmark result, although the ucminf() is the most efficient optimization function, a hessian option can increase the computing time by 70%. In addition, in the second fastest nlminb() function, there is no builtin option to output the hessian. Therefore, sometimes it might be preferable to estimate model parameters first and then calculate the hessian afterwards for the analysis purpose, as demonstrated below.
df < read.csv("Downloads/credit_count.txt") ### DEFINE THE OBJECTIVE FUNCTION ### minLL < function(par) { mu < exp(par[1] + par[2] * df$AGE + par[3] * df$ACADMOS + par[4] * df$MINORDRG + par[5] * df$OWNRENT) return(ll < sum(log(exp(mu) * (mu ^ df$MAJORDRG) / factorial(df$MAJORDRG)))) } ### BENCHMARKING ### import::from("rbenchmark", "benchmark") benchmark(replications = 10, order = "elapsed", relative = "elapsed", columns = c("test", "replications", "elapsed", "relative"), optim = {optim(par = rep(0, 5), fn = minLL, hessian = F)}, nlm = {nlm(f = minLL, p = rep(0, 5), hessian = F)}, nlminb = {nlminb(start = rep(0, 5), objective = minLL)}, ucminf = {ucminf::ucminf(par = rep(0, 5), fn = minLL, hessian = 0)}, hessian = {ucminf::ucminf(par = rep(0, 5), fn = minLL, hessian = 1)} ) # test replications elapsed relative # 4 ucminf 10 4.044 1.000 # 3 nlminb 10 6.444 1.593 # 5 hessian 10 6.973 1.724 # 2 nlm 10 8.292 2.050 # 1 optim 10 12.027 2.974 ### HOW TO CALCULATE THE HESSIAN ### fit < nlminb(start = rep(0, 5), objective = minLL) import::from("numDeriv", "hessian") std < sqrt(diag(solve(hessian(minLL, fit$par)))) est < data.frame(beta = fit$par, stder = std, z_values = fit$par / std) # beta stder z_values # 1 1.379324501 0.0438155970 31.480217 # 2 0.010394876 0.0013645030 7.618068 # 3 0.001532188 0.0001956843 7.829894 # 4 0.461129515 0.0068557359 67.261856 # 5 0.199393808 0.0283222704 7.040177It is worth mentioning that, although these general optimizers are fast, they are less userfriendly than wrappers in statistical packages, such as mle or maxLik. For instance, we have to calculate AIC or BIC based on the log likelihood function or pvalues based on Zscores.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Seventeen Minutes From Tweet To Package
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
Earlier today, @noamross posted to Twitter:
#rstats #lazyweb What's the R/httr/curl equivalent of
curl F "file=@somefile.html" https://t.co/abbugLz9ZW
— Noam Ross (@noamross) May 3, 2018
The answer was a 1:1 “file upload” curl to httr translation:
httr::POST( url = "https://file.io", encode = "multipart", body = list(file = httr::upload_file("/some/path/to/file")), )but I wanted to do more than that since Noam took 20 minutes out his day this week (with no advance warning) to speak to my introtostats class about his work and R.
The Twitter request was (ultimately) a question on how to use R to post content to https://file.io. They have a really simple API, and the timespan from Noam’s request to the initial commit of a fully functional package was roughly 17 minutes. The end product included the ability to post files, strings and R data (something that seemed like a good thing to add).
Not too long after came a v0.1.0 release complete with tests and passing CRAN checks on all platforms.
Noam also suggested I do a screencast:
You should do a screencast of your process for this.
— Noam Ross (@noamross) May 3, 2018
I don’t normally do screencasts but had some conference call time so folks can follow along at home:
That’s not the best screencast in the world, but it’s very representative of the workflow I used. A great deal of boilerplate package machinations is accomplished with this bash script.
I wasn’t happy with the hurried function name choices I made nor was I thrilled with the package title, description, tests and basic docs, so I revamped all those into another release. That took a while, mostly due to constantly triggering API warnings about being ratelimited.
So, if you have a 5 GB or less file, character vector or inmemory R data you’d like to ephemerally share with others, take the fileio package for a spin:
devtools::install_github("hrbrmstr/fileio") fileio::fi_post_text("TWFrZSBzdXJlIHRvIEAgbWUgb24gVHdpdHRlciBpZiB5b3UgZGVjb2RlIHRoaXM=") ## success key link expiry ## 1 TRUE n18ZSB https://file.io/n18ZSB 14 days(bonus points if you can figure out what that seemingly random sequence of characters says).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Bayesian Inference with Backfitting MCMC
(This article was first published on R – Stable Markets, and kindly contributed to Rbloggers)
Previous posts in this series on MCMC samplers for Bayesian inference (in order of publication): Bayesian Simple Linear Regression with Gibbs Sampling in R Blocked Gibbs Sampling in R for Bayesian Multiple Linear Regression MetropolisinGibbs Sampling and Runtime Analysis with Profviz Speeding up MetropolisHastings with Rcpp All code for this (and previous) posts are in … Continue reading Bayesian Inference with Backfitting MCMC →
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Stable Markets. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Qualitative Data Science: Using RQDA to analyse interviews
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Qualitative data science sounds like a contradiction in terms. Data scientists generally solve problems using numerical solutions. Even the analysis of text is reduced to a numerical problem using Markov chains, topic analysis, sentiment analysis and other mathematical tools.
Scientists and professionals consider numerical methods the gold standard of analysis. There is, however, a price to pay when relying on numbers alone. Numerical analysis reduces the complexity of the social world. When analysing people, numbers present an illusion of precision and accuracy. Giving primacy to quantitative research in the social sciences comes at a high price. The dynamics of reality are reduced to statistics, losing the narrative of the people that the research aims to understand.
Being both an engineer and a social scientist, I acknowledge the importance of both numerical and qualitative methods. My dissertation used a mixedmethod approach to review the relationship between employee behaviour and customer perception in water utilities. This article introduces some aspects of qualitative data science with an example from my dissertation. In this article, I show how I analysed data from interviews using both quantitative and qualitative methods and demonstrate why qualitative data science is better to understand text than numerical methods. The most recent version of the code is available on my GitHub repository.
Qualitative Data ScienceThe often celebrated artificial intelligence of machine learning is impressive but does not come close to human intelligence and ability to understand the world. Many data scientists are working on automated text analysis to solve this issue (the topicmodels package is an example of such an attempt). These efforts are impressive but even the smartest text analysis algorithm is not able to derive meaning from text. To fully embrace all aspects of data science we need to be able to methodically undertake qualitative data analysis.
The capabilities of R in numerical analysis are impressive but it can also assist with Qualitative Data Analysis (QDA). Huang Ronggui from Hong Kong developed the RQDA package to analyse texts in R. RQDA assists with qualitative data analysis using a GUI frontend to analyse collections texts. The video below contains a complete course in using this software. Below the video, I share an example from my dissertation which compares qualitative and quantitative methods for analysing text.
For my dissertation about water utility marketing, I interviewed seven people from various organisations. The purpose of these interviews was to learn about the value proposition for water utilities. The data consists of the transcripts of six interviews which I manually coded using RQDA. For reasons of agreed anonymity, I cannot provide the raw data file of the interviews through GitHub.
Numerical Text AnalysisWord clouds are a popular method for exploratory analysis of texts. The wordcloud is created with the text mining and wordcloud packages. The transcribed interviews are converted to a text corpus (the native format for the tm package) and whitespace, punctuation etc is removed. This code snippet opens the RQDA file and extracts the texts from the database. RQDA stores all text in an SQLite database and the package provides a query command to extract data.
# INIT library(tidyverse) library(RQDA) library(tm) library(wordcloud) library(topicmodels) library(igraph) #Load data library(RQDA) openProject("Hydroinformatics/Macromarketing/stakeholders.rqda") # Word frequency diagram library(tm) interviews < RQDAQuery("SELECT file FROM source") interviews$file < apply(interviews, 1, function(x) gsub("…", "...", x)) interviews$file < apply(interviews, 1, function(x) gsub("’", "", x)) interviews < Corpus(VectorSource(interviews$file)) interviews < tm_map(interviews, stripWhitespace) interviews < tm_map(interviews, content_transformer(tolower)) interviews < tm_map(interviews, removeWords, stopwords("english")) interviews < tm_map(interviews, removePunctuation) interviews < tm_map(interviews, removeNumbers) interviews < tm_map(interviews, removeWords, c("interviewer", "interviewee")) library(wordcloud) set.seed(1969) wordcloud(interviews, min.freq = 10, max.words = 50, rot.per=0.35, colors = brewer.pal(8, "Blues")[1:5])This word cloud makes it clear that the interviews are about water businesses and customers, which is a pretty obvious statement. The interviews are also about the opinion of the interviewees (think). While the word cloud is aesthetically pleasing and provides a quick snapshot of the content of the texts, they cannot inform us about their meaning.
Topic modelling is a more advanced method to extract information from the text by assessing the proximity of words to each other. The topic modelling package provides functions to perform this analysis. I am not an expert in this field and simply followed basic steps using default settings with four topics.
# Topic Analysis library(topicmodels) dtm < DocumentTermMatrix(interviews) dtm < removeSparseTerms(dtm, 0.99) ldaOut <LDA(dtm, k = 4) terms(ldaOut,6)This code converts the corpus created earlier into a DocumentTerm Matrix, which is a matrix of words and documents (the interviews) and the frequency at which each of these words occurs. The LDA function applies a Latent Dietrich Allocation to the matrix to define the topics. The variable k defines the number of anticipated topics. An LDA is similar to clustering in multivariate data. The final output is a table with six words for each topic.
Topics extracted from interviews Topic 1 Topic 2 Topic 3 Topic 4 water water customers water think think water think actually inaudible customer companies customer people think yeah businesses service business issues customers businesses service don’tThis table does not tell me much at all about what was discussed in the interviews. Perhaps it is the frequent use of the word “water” or “think”—I did ask people their opinion about waterrelated issues. To make this analysis more meaningful I could perhaps manually remove the words water, yeah, and so on. This introduces bias in the analysis and reduces the reliability of the topic analysis because I would be interfering with the text.
Numerical text analysis sees a text as a bag of words instead of a set of meaningful words. It seems that any automated text mining needs a lot of manual cleaning to derive anything meaningful. This excursion shows that automated text analysis is not a surefire way to analyse the meaning of a collection of words. After a lot of trial and error to try to make this work, I decided to go back to my roots of qualitative analysis using RQDA as my tool.
Qualitative Data Science Using RQATo use RQDA for qualitative data science, you first need to manually analyse each text and assign codes (topics) to parts of the text. The image below shows a question and answer and how it was coded. All marked text is blue and the codes are shown between markers. Coding a text is an iterative process that aims to extract meaning from a text. The advantage of this method compared to numerical analysis is that the researcher injects meaning into the analysis. The disadvantage is that the analysis will always be biased, which in the social sciences is unavoidable. My list of topics was based on words that appear in a marketing dictionary so that I analysed the interviews from that perspective.
My first step was to look at the occurrence of codes (themes) in each of the interviews.
codings < getCodingTable()[,4:5] categories < RQDAQuery("SELECT filecat.name AS category, source.name AS filename FROM treefile, filecat, source WHERE treefile.catid=filecat.catid AND treefile.fid=source.id AND treefile.status=1") codings < merge(codings, categories, all.y = TRUE) head(codings) # Open coding reorder_size < function(x) { factor(x, levels = names(sort(table(x)))) } ggplot(codings, aes(reorder_size(codename), fill=category)) + geom_bar(stat="count") + facet_grid(~filename) + coord_flip() + theme(legend.position="bottom", legend.title=element_blank()) + ylab("Code frequency in interviews") + xlab("Code") ggsave("Hydroinformatics/Macromarketing/rqda_codes.png", width=9, height=11, dpi = 300)The code uses an internal RQDA function getCodingTable to obtain the primary data. The RQDAQuery function provides more flexibility and can be used to build more complex queries of the data. You can view the structure of the RQDA database using the RQDATables function.
This bar chart helps to explore the topics that interviewees discussed but it does not help to understand how these topic relate to each other. This method provides a better view than the ‘bag of words’ approach because the text has been given meaning.
RQDA provides a facility to assign each code to a code category. This structure can be visualised using a network. The network is visualised using the igraph package and the graph shows how codes relate to each other.
Qualitative data analysis can create value from a text by interpreting it from a given perspective. This article is not even an introduction to the science and art of qualitative data science. I hope it invites you to explore RQA and similar tools.
If you are interested in finding out more about how I used this analysis, then read my dissertation on customer service in water utilities.
# Axial coding edges < RQDAQuery("SELECT codecat.name, freecode.name FROM codecat, freecode, treecode WHERE codecat.catid=treecode.catid AND freecode.id=treecode.cid") g < graph_from_edgelist(as.matrix(edges), directed=FALSE) V(g)$name < gsub(" ", "\n", V(g)$name) c < spinglass.community(g) par(mar=rep(0,4)) set.seed(666) plot(c, g, vertex.size=10, vertex.color=NA, vertex.frame.color=NA, edge.width=E(g)$weight, layout=layout.drl)The post Qualitative Data Science: Using RQDA to analyse interviews appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
purrr Like a Kitten till the Lake Pipes RoaR
(This article was first published on R – FinEx.Co, and kindly contributed to Rbloggers)
I really should make a minimal effort to resist opening a data analysis blog post with Beach Boys’ lyrics, but this time the combination is too apt. We use the purrr package to show how to let your pipes roar in R.
The tidyverse GitHub site contains a simple example illustrating how well pipes and purrr work together. For more learning, try Jenny Bryan’s purrr tutorial.
First, load the tidyverse and the purrr package.
library(tidyverse) library(purrr) #devtools::install_github("jennybc/repurrrsive") #library(repurrrsive)If you want to be more adventurous, you can download Jenny’s repurrrsive package from GitHub. The code is hashed out in the chunk above.
You Don’t Know What I Got
The great thing about using pipes is you can tie together a series of steps to produce a single output of exactly what you want. In this case, we are going to start with the base R dataset airquality and apply a number of functions to come up with the adjusted Rsquared value for the Ozone data for each month between May and September. Along the way we will run a linear regression on the data to generate the adjusted Rsquared values.
Here is what the full process looks like, from beginning to end.
airquality %>% split(.$Month) %>% map(~ lm(Ozone ~ Temp, data = .)) %>% map(summary) %>% map_dbl('adj.r.squared') 5 6 7 8 9 0.2781 0.3676 0.5024 0.3307 0.6742The problem, of course, is the output generated by the intermediate steps stays behind the scenes. For a beginner, this can be a bit confusing because it isn’t clear what is going on. So let’s break the full chunk into its constituent pieces so we can see what purrr is doing and how pipes tie the whole thing together.
Start at the beginning and take things stepbystep.
Run the airquality data set to see what we are dealing with. We can see the data is collected daily for five months. Ozone looks to be the target, so it is natural to wonder if there is any relationship between it and the other variables.
airquality Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10 11 7 NA 6.9 74 5 11 12 16 256 9.7 69 5 12 13 11 290 9.2 66 5 13 14 14 274 10.9 68 5 14 15 18 65 13.2 58 5 15 16 14 334 11.5 64 5 16 17 34 307 12.0 66 5 17 18 6 78 18.4 57 5 18 19 30 322 11.5 68 5 19 20 11 44 9.7 62 5 20 21 1 8 9.7 59 5 21 22 11 320 16.6 73 5 22 23 4 25 9.7 61 5 23 24 32 92 12.0 61 5 24 25 NA 66 16.6 57 5 25 26 NA 266 14.9 58 5 26 27 NA NA 8.0 57 5 27 28 23 13 12.0 67 5 28 29 45 252 14.9 81 5 29 30 115 223 5.7 79 5 30 31 37 279 7.4 76 5 31 32 NA 286 8.6 78 6 1 33 NA 287 9.7 74 6 2 34 NA 242 16.1 67 6 3 35 NA 186 9.2 84 6 4 36 NA 220 8.6 85 6 5 37 NA 264 14.3 79 6 6 38 29 127 9.7 82 6 7 39 NA 273 6.9 87 6 8 40 71 291 13.8 90 6 9 41 39 323 11.5 87 6 10 42 NA 259 10.9 93 6 11 43 NA 250 9.2 92 6 12 44 23 148 8.0 82 6 13 45 NA 332 13.8 80 6 14 46 NA 322 11.5 79 6 15 47 21 191 14.9 77 6 16 48 37 284 20.7 72 6 17 49 20 37 9.2 65 6 18 50 12 120 11.5 73 6 19 51 13 137 10.3 76 6 20 52 NA 150 6.3 77 6 21 53 NA 59 1.7 76 6 22 54 NA 91 4.6 76 6 23 55 NA 250 6.3 76 6 24 56 NA 135 8.0 75 6 25 57 NA 127 8.0 78 6 26 58 NA 47 10.3 73 6 27 59 NA 98 11.5 80 6 28 60 NA 31 14.9 77 6 29 61 NA 138 8.0 83 6 30 62 135 269 4.1 84 7 1 63 49 248 9.2 85 7 2 64 32 236 9.2 81 7 3 65 NA 101 10.9 84 7 4 66 64 175 4.6 83 7 5 67 40 314 10.9 83 7 6 68 77 276 5.1 88 7 7 69 97 267 6.3 92 7 8 70 97 272 5.7 92 7 9 71 85 175 7.4 89 7 10 72 NA 139 8.6 82 7 11 73 10 264 14.3 73 7 12 74 27 175 14.9 81 7 13 75 NA 291 14.9 91 7 14 76 7 48 14.3 80 7 15 77 48 260 6.9 81 7 16 78 35 274 10.3 82 7 17 79 61 285 6.3 84 7 18 80 79 187 5.1 87 7 19 81 63 220 11.5 85 7 20 82 16 7 6.9 74 7 21 83 NA 258 9.7 81 7 22 84 NA 295 11.5 82 7 23 85 80 294 8.6 86 7 24 86 108 223 8.0 85 7 25 87 20 81 8.6 82 7 26 88 52 82 12.0 86 7 27 89 82 213 7.4 88 7 28 90 50 275 7.4 86 7 29 91 64 253 7.4 83 7 30 92 59 254 9.2 81 7 31 93 39 83 6.9 81 8 1 94 9 24 13.8 81 8 2 95 16 77 7.4 82 8 3 96 78 NA 6.9 86 8 4 97 35 NA 7.4 85 8 5 98 66 NA 4.6 87 8 6 99 122 255 4.0 89 8 7 100 89 229 10.3 90 8 8 101 110 207 8.0 90 8 9 102 NA 222 8.6 92 8 10 103 NA 137 11.5 86 8 11 104 44 192 11.5 86 8 12 105 28 273 11.5 82 8 13 106 65 157 9.7 80 8 14 107 NA 64 11.5 79 8 15 108 22 71 10.3 77 8 16 109 59 51 6.3 79 8 17 110 23 115 7.4 76 8 18 111 31 244 10.9 78 8 19 112 44 190 10.3 78 8 20 113 21 259 15.5 77 8 21 114 9 36 14.3 72 8 22 115 NA 255 12.6 75 8 23 116 45 212 9.7 79 8 24 117 168 238 3.4 81 8 25 118 73 215 8.0 86 8 26 119 NA 153 5.7 88 8 27 120 76 203 9.7 97 8 28 121 118 225 2.3 94 8 29 122 84 237 6.3 96 8 30 123 85 188 6.3 94 8 31 124 96 167 6.9 91 9 1 125 78 197 5.1 92 9 2 126 73 183 2.8 93 9 3 127 91 189 4.6 93 9 4 128 47 95 7.4 87 9 5 129 32 92 15.5 84 9 6 130 20 252 10.9 80 9 7 131 23 220 10.3 78 9 8 132 21 230 10.9 75 9 9 133 24 259 9.7 73 9 10 134 44 236 14.9 81 9 11 135 21 259 15.5 76 9 12 136 28 238 6.3 77 9 13 137 9 24 10.9 71 9 14 138 13 112 11.5 71 9 15 139 46 237 6.9 78 9 16 140 18 224 13.8 67 9 17 141 13 27 10.3 76 9 18 142 24 238 10.3 68 9 19 143 16 201 8.0 82 9 20 144 13 238 12.6 64 9 21 145 23 14 9.2 71 9 22 146 36 139 10.3 81 9 23 147 7 49 10.3 69 9 24 148 14 20 16.6 63 9 25 149 30 193 6.9 70 9 26 150 NA 145 13.2 77 9 27 151 14 191 14.3 75 9 28 152 18 131 8.0 76 9 29 153 20 223 11.5 68 9 30The first step is to break the data up by month, so we make use of base R‘s split() function. Notice all the data is grouped by month.
airquality %>% split(.$Month) $`5` Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10 11 7 NA 6.9 74 5 11 12 16 256 9.7 69 5 12 13 11 290 9.2 66 5 13 14 14 274 10.9 68 5 14 15 18 65 13.2 58 5 15 16 14 334 11.5 64 5 16 17 34 307 12.0 66 5 17 18 6 78 18.4 57 5 18 19 30 322 11.5 68 5 19 20 11 44 9.7 62 5 20 21 1 8 9.7 59 5 21 22 11 320 16.6 73 5 22 23 4 25 9.7 61 5 23 24 32 92 12.0 61 5 24 25 NA 66 16.6 57 5 25 26 NA 266 14.9 58 5 26 27 NA NA 8.0 57 5 27 28 23 13 12.0 67 5 28 29 45 252 14.9 81 5 29 30 115 223 5.7 79 5 30 31 37 279 7.4 76 5 31 $`6` Ozone Solar.R Wind Temp Month Day 32 NA 286 8.6 78 6 1 33 NA 287 9.7 74 6 2 34 NA 242 16.1 67 6 3 35 NA 186 9.2 84 6 4 36 NA 220 8.6 85 6 5 37 NA 264 14.3 79 6 6 38 29 127 9.7 82 6 7 39 NA 273 6.9 87 6 8 40 71 291 13.8 90 6 9 41 39 323 11.5 87 6 10 42 NA 259 10.9 93 6 11 43 NA 250 9.2 92 6 12 44 23 148 8.0 82 6 13 45 NA 332 13.8 80 6 14 46 NA 322 11.5 79 6 15 47 21 191 14.9 77 6 16 48 37 284 20.7 72 6 17 49 20 37 9.2 65 6 18 50 12 120 11.5 73 6 19 51 13 137 10.3 76 6 20 52 NA 150 6.3 77 6 21 53 NA 59 1.7 76 6 22 54 NA 91 4.6 76 6 23 55 NA 250 6.3 76 6 24 56 NA 135 8.0 75 6 25 57 NA 127 8.0 78 6 26 58 NA 47 10.3 73 6 27 59 NA 98 11.5 80 6 28 60 NA 31 14.9 77 6 29 61 NA 138 8.0 83 6 30 $`7` Ozone Solar.R Wind Temp Month Day 62 135 269 4.1 84 7 1 63 49 248 9.2 85 7 2 64 32 236 9.2 81 7 3 65 NA 101 10.9 84 7 4 66 64 175 4.6 83 7 5 67 40 314 10.9 83 7 6 68 77 276 5.1 88 7 7 69 97 267 6.3 92 7 8 70 97 272 5.7 92 7 9 71 85 175 7.4 89 7 10 72 NA 139 8.6 82 7 11 73 10 264 14.3 73 7 12 74 27 175 14.9 81 7 13 75 NA 291 14.9 91 7 14 76 7 48 14.3 80 7 15 77 48 260 6.9 81 7 16 78 35 274 10.3 82 7 17 79 61 285 6.3 84 7 18 80 79 187 5.1 87 7 19 81 63 220 11.5 85 7 20 82 16 7 6.9 74 7 21 83 NA 258 9.7 81 7 22 84 NA 295 11.5 82 7 23 85 80 294 8.6 86 7 24 86 108 223 8.0 85 7 25 87 20 81 8.6 82 7 26 88 52 82 12.0 86 7 27 89 82 213 7.4 88 7 28 90 50 275 7.4 86 7 29 91 64 253 7.4 83 7 30 92 59 254 9.2 81 7 31 $`8` Ozone Solar.R Wind Temp Month Day 93 39 83 6.9 81 8 1 94 9 24 13.8 81 8 2 95 16 77 7.4 82 8 3 96 78 NA 6.9 86 8 4 97 35 NA 7.4 85 8 5 98 66 NA 4.6 87 8 6 99 122 255 4.0 89 8 7 100 89 229 10.3 90 8 8 101 110 207 8.0 90 8 9 102 NA 222 8.6 92 8 10 103 NA 137 11.5 86 8 11 104 44 192 11.5 86 8 12 105 28 273 11.5 82 8 13 106 65 157 9.7 80 8 14 107 NA 64 11.5 79 8 15 108 22 71 10.3 77 8 16 109 59 51 6.3 79 8 17 110 23 115 7.4 76 8 18 111 31 244 10.9 78 8 19 112 44 190 10.3 78 8 20 113 21 259 15.5 77 8 21 114 9 36 14.3 72 8 22 115 NA 255 12.6 75 8 23 116 45 212 9.7 79 8 24 117 168 238 3.4 81 8 25 118 73 215 8.0 86 8 26 119 NA 153 5.7 88 8 27 120 76 203 9.7 97 8 28 121 118 225 2.3 94 8 29 122 84 237 6.3 96 8 30 123 85 188 6.3 94 8 31 $`9` Ozone Solar.R Wind Temp Month Day 124 96 167 6.9 91 9 1 125 78 197 5.1 92 9 2 126 73 183 2.8 93 9 3 127 91 189 4.6 93 9 4 128 47 95 7.4 87 9 5 129 32 92 15.5 84 9 6 130 20 252 10.9 80 9 7 131 23 220 10.3 78 9 8 132 21 230 10.9 75 9 9 133 24 259 9.7 73 9 10 134 44 236 14.9 81 9 11 135 21 259 15.5 76 9 12 136 28 238 6.3 77 9 13 137 9 24 10.9 71 9 14 138 13 112 11.5 71 9 15 139 46 237 6.9 78 9 16 140 18 224 13.8 67 9 17 141 13 27 10.3 76 9 18 142 24 238 10.3 68 9 19 143 16 201 8.0 82 9 20 144 13 238 12.6 64 9 21 145 23 14 9.2 71 9 22 146 36 139 10.3 81 9 23 147 7 49 10.3 69 9 24 148 14 20 16.6 63 9 25 149 30 193 6.9 70 9 26 150 NA 145 13.2 77 9 27 151 14 191 14.3 75 9 28 152 18 131 8.0 76 9 29 153 20 223 11.5 68 9 30In the second step, we apply the purrr map() function to the linear regression model we create with lm(). We want the adjusted Rsquared value for each month for Ozone. I played around with the variables a bit to find which one illustrated the adjusted Rsquared values best and settled on Temp, but you can choose any other besides Month and Day.
The map() command applies the lm() function to each monthy group and yields the typical output for each month. We now have five linear regression models, one for each month, but no adjusted Rsquared values.
airquality %>% split(.$Month) %>% map(~ lm(Ozone ~ Temp, data = .)) $`5` Call: lm(formula = Ozone ~ Temp, data = .) Coefficients: (Intercept) Temp 102.16 1.88 $`6` Call: lm(formula = Ozone ~ Temp, data = .) Coefficients: (Intercept) Temp 91.99 1.55 $`7` Call: lm(formula = Ozone ~ Temp, data = .) Coefficients: (Intercept) Temp 372.92 5.15 $`8` Call: lm(formula = Ozone ~ Temp, data = .) Coefficients: (Intercept) Temp 238.86 3.56 $`9` Call: lm(formula = Ozone ~ Temp, data = .) Coefficients: (Intercept) Temp 149.35 2.35To generate the adjusted Rsquared values, we need to map() the summary() command to each group. This is the third step. Again, the results are familiar. We have the typical summary of the linear model, one for each month.
airquality %>% split(.$Month) %>% map(~ lm(Ozone ~ Temp, data = .)) %>% map(summary) $`5` Call: lm(formula = Ozone ~ Temp, data = .) Residuals: Min 1Q Median 3Q Max 30.32 8.62 2.41 5.32 68.26 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 102.159 38.750 2.64 0.0145 * Temp 1.885 0.578 3.26 0.0033 **  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 18.9 on 24 degrees of freedom (5 observations deleted due to missingness) Multiple Rsquared: 0.307, Adjusted Rsquared: 0.278 Fstatistic: 10.6 on 1 and 24 DF, pvalue: 0.00331 $`6` Call: lm(formula = Ozone ~ Temp, data = .) Residuals: Min 1Q Median 3Q Max 12.99 9.34 6.31 11.08 23.27 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 91.991 51.312 1.79 0.116 Temp 1.552 0.653 2.38 0.049 *  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.5 on 7 degrees of freedom (21 observations deleted due to missingness) Multiple Rsquared: 0.447, Adjusted Rsquared: 0.368 Fstatistic: 5.65 on 1 and 7 DF, pvalue: 0.0491 $`7` Call: lm(formula = Ozone ~ Temp, data = .) Residuals: Min 1Q Median 3Q Max 32.11 14.52 1.16 7.58 75.29 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 372.92 84.45 4.42 0.00018 *** Temp 5.15 1.01 5.12 3e05 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.3 on 24 degrees of freedom (5 observations deleted due to missingness) Multiple Rsquared: 0.522, Adjusted Rsquared: 0.502 Fstatistic: 26.2 on 1 and 24 DF, pvalue: 3.05e05 $`8` Call: lm(formula = Ozone ~ Temp, data = .) Residuals: Min 1Q Median 3Q Max 40.42 17.65 8.07 9.97 118.58 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 238.861 82.023 2.91 0.0076 ** Temp 3.559 0.974 3.65 0.0013 **  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 32.5 on 24 degrees of freedom (5 observations deleted due to missingness) Multiple Rsquared: 0.357, Adjusted Rsquared: 0.331 Fstatistic: 13.4 on 1 and 24 DF, pvalue: 0.00126 $`9` Call: lm(formula = Ozone ~ Temp, data = .) Residuals: Min 1Q Median 3Q Max 27.45 8.59 3.69 11.04 31.39 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 149.347 23.688 6.30 9.5e07 *** Temp 2.351 0.306 7.68 2.9e08 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.8 on 27 degrees of freedom (1 observation deleted due to missingness) Multiple Rsquared: 0.686, Adjusted Rsquared: 0.674 Fstatistic: 58.9 on 1 and 27 DF, pvalue: 2.95e08She Blows em Outta the Water Like You Never Seen
Now things get interesting, because we can put it all together.
Step 4 involves using the specialized map_dbl() command to pull the adjusted Rsquared values from each month’s linear model summary and output them in a single line.
How do we know the adjusted Rsquared value is a double? Well, it looks like one since it consists of a floating decimal. But we could guess, too. If we were to use the map_int() command we would get an error that tells us the value is a double. If we guessed character we could use mapZ_chr. That would work but we would have the output in character form, which isn’t what we want.
So we can recognize its data form or we can figure it out through trial and error. Either way we end up at the same place, back where we started.
airquality %>% split(.$Month) %>% map(~ lm(Ozone ~ Temp, data = .)) %>% map(summary) %>% map_dbl('adj.r.squared') 5 6 7 8 9 0.2781 0.3676 0.5024 0.3307 0.6742And if That Ain’t Enough to Make You Flip Your Lid
I don’t know what will.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – FinEx.Co. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
EARL London Keynote Speaker announcement: Garrett Grolemund
(This article was first published on Mango Solutions, and kindly contributed to Rbloggers)
We’re delighted to announce RStudio’s Garrett Grolemund as one of our Keynote Speakers at this year’s EARL London.
He will join Starcount’s Edwina Dunn and a whole host of brilliant speakers for the 5th EARL London Conference on 1113 September at The Tower Hotel.
Garrett specialises in teaching people how to use R, which is why you’ll see his name on some brilliant resources, including video courses on Datacamp.com and O’Reilly media, his series of popular R cheat sheets distributed by RStudio, and as coauthor of R for Data Science and HandsOn Programming with R. He also wrote the lubridate R package and works for RStudio as an advocate who trains engineers to do data science with R and the Tidyverse.
He earned his Phd in Statistics from Rice University in 2012 under the guidance of Hadley Wickham. Before that, he earned a Bachelor’s degree in Psychology from Harvard University and briefly attended law school before wising up.
Garrett is one of the foremost promoters of Shiny, R Markdown, and the Tidyverse, so we’re really looking forward to his keynote.
Don’t miss out on early bird ticketsEarly bird tickets for all EARL Conferences are now available:
London: 1113 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November
To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Statistical Sins: Is Your Classification Model Any Good?
(This article was first published on Deeply Trivial, and kindly contributed to Rbloggers)
.knitr .inline { backgroundcolor: #f7f7f7; border:solid 1px #B0B0B0; } .error { fontweight: bold; color: #FF0000; } .warning { fontweight: bold; } .message { fontstyle: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { backgroundcolor: #f5f5f5; } .rimage .left { textalign: left; } .rimage .right { textalign: right; } .rimage .center { textalign: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; fontstyle: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; fontweight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; fontweight: bold; }
Prediction with Binomial Regression April A to Z is complete! We now return to your regularly scheduled statistics blog posts! Today, I want to talk about an issue I touched on during A to Z: using regression to predict values and see how well your model is doing.
Specifically, I talked a couple of times about binomial regression (here and here), which is used to predict (read: recreate with a set of variables significantly related to) a binary outcome. The data example I used involved my dissertation data and the binary outcome was verdict: guilty or not guilty. A regression model returns the linear correction applied to the predictor variables to reproduce the outcome, and will highlight whether a predictor was significantly related to the outcome or not. But a big question you may be asking of your binomial model is: how well does it predict the outcome? Specifically, how can you examine whether your regression model is correctly classifying cases?
We’ll start by loading/setting up the data and rerunning the binomial regression with interactions.
dissertation<read.delim("dissertation_data.txt",header=TRUE)dissertation<dissertation[,1:44]
predictors<c("obguilt","reasdoubt","bettertolet","libertyvorder",
"jurevidence","guilt")
dissertation<subset(dissertation, !is.na(libertyvorder))
dissertation[45:50]<lapply(dissertation[predictors], function(x) {
y<scale(x, center=TRUE, scale=TRUE)
}
)
pred_int<'verdict ~ obguilt.1 + reasdoubt.1 + bettertolet.1 + libertyvorder.1 +
jurevidence.1 + guilt.1 + obguilt.1*guilt.1 + reasdoubt.1*guilt.1 +
bettertolet.1*guilt.1 + libertyvorder.1*guilt.1 + jurevidence.1*guilt.1'
model<glm(pred_int, family="binomial", data=dissertation)
summary(model)
##
## Call:
## glm(formula = pred_int, family = "binomial", data = dissertation)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 2.6101 0.5432 0.1289 0.6422 2.2805
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept) 0.47994 0.16264 2.951 0.00317 **
## obguilt.1 0.25161 0.16158 1.557 0.11942
## reasdoubt.1 0.09230 0.20037 0.461 0.64507
## bettertolet.1 0.22484 0.20340 1.105 0.26899
## libertyvorder.1 0.05825 0.21517 0.271 0.78660
## jurevidence.1 0.07252 0.19376 0.374 0.70819
## guilt.1 2.31003 0.26867 8.598 < 2e16 ***
## obguilt.1:guilt.1 0.14058 0.23411 0.600 0.54818
## reasdoubt.1:guilt.1 0.61724 0.29693 2.079 0.03764 *
## bettertolet.1:guilt.1 0.02579 0.30123 0.086 0.93178
## libertyvorder.1:guilt.1 0.27492 0.29355 0.937 0.34899
## jurevidence.1:guilt.1 0.27601 0.36181 0.763 0.44555
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.08 on 354 degrees of freedom
## Residual deviance: 300.66 on 343 degrees of freedom
## AIC: 324.66
##
## Number of Fisher Scoring iterations: 6
The predict function, which I introduced here, can also be used for the binomial model. Let’s have R generate predicted scores for everyone in the dissertation sample:
dissertation$predver<predict(model)dissertation$predver
## [1] 0.3907097456 4.1351129605 2.1820478279 2.8768390246 2.5804618523
## [6] 0.4244692909 2.3065468369 2.7853434926 0.3504760502 0.2747339639
## [11] 1.8506160725 0.6956240161 4.7860574839 0.3875950731 2.4955679446
## [16] 0.3941516951 4.5831011509 1.6185480937 0.4971923298 4.1581842900
## [21] 0.6320531052 4.8447046319 2.3974890696 1.8566258698 0.0360685822
## [26] 2.2151040131 2.3477149003 2.4493726369 0.2253481404 4.8899805287
## [31] 1.7789459288 0.0978703861 3.5541042186 3.6009218603 0.1568318789
## [36] 3.7866003489 0.6371816898 0.7047761441 0.7529742376 0.0302759317
## [41] 0.1108055330 1.9751810033 0.2373614802 0.0424471071 0.4018757856
## [46] 0.0530272726 1.0763759980 0.0099577637 0.3128581222 1.4806679691
## [51] 1.7468626219 0.2998282372 3.6359162016 2.2200774510 0.3192366472
## [56] 3.0103216033 2.0625775984 6.0179845235 2.0300503627 2.3676828409
## [61] 2.8971753746 3.2131490026 2.1349358889 3.0215336139 1.2436192890
## [66] 0.2885535375 0.2141821004 1.9480686936 0.0438751446 1.9368013875
## [71] 0.2931258287 0.5319938265 0.0177643261 3.3724920900 0.0332949791
## [76] 2.5935500970 0.7571810150 0.7131757400 2.5411073339 2.8499853550
## [81] 2.8063291084 0.4500738791 1.4700679077 0.8659309719 0.0870492258
## [86] 0.5728074322 0.1476797509 2.4697257261 2.5935500970 2.2200774510
## [91] 0.0941827753 1.3708676633 1.4345235392 0.2407209578 2.4662700339
## [96] 1.9687731888 6.7412580522 0.0006224018 4.4132951092 2.8543032695
## [101] 1.2295635352 2.8194173530 0.1215689324 3.8258079371 1.8959803882
## [106] 4.5578801595 2.3754402614 0.0826808026 1.5112359711 3.5402060466
## [111] 0.2556657363 0.7054183194 1.4675797244 2.3974890696 2.6955929822
## [116] 0.3123518919 4.8431862346 2.0132721372 0.4673405434 2.3053405270
## [121] 1.9498822386 0.5164183930 1.8277820872 0.0134750769 2.3013547136
## [126] 0.2498730859 4.4281010683 0.0134750769 0.2604532514 0.1476797509
## [131] 2.3392939519 2.0625775984 3.5541042186 1.5087477879 4.6453051124
## [136] 2.0616474606 3.2691362859 7.3752231145 1.6666447439 1.0532964013
## [141] 2.0625775984 0.3355312717 2.2481601983 2.2200774510 4.3276959075
## [146] 0.8685972087 0.7727065311 1.7511589809 0.4774548995 0.0008056357
## [151] 1.7022334970 0.4202625135 0.2902646169 2.4409712692 0.0008056357
## [156] 0.0008056357 3.6009218603 0.8567788439 0.4528474822 0.3517462520
## [161] 0.1307210605 3.7843118182 2.8419024763 3.5191098774 0.1460684795
## [166] 1.8809888141 2.8194173530 2.4656469123 1.0589888029 0.1659840070
## [171] 1.4345235392 2.3676828409 1.5749534339 0.1681557545 2.6406620359
## [176] 0.1476797509 2.2135177411 1.9168260534 3.4993205379 0.4557086940
## [181] 3.8136089417 0.1121510987 3.9772095600 1.3849234171 0.3504760502
## [186] 2.3807710856 3.0667307601 2.3040586537 1.7599138086 0.2083894500
## [191] 0.6844579761 0.3552635652 1.9459392035 0.6075281598 2.1663310490
## [196] 2.3676828409 1.9205271122 2.2334295071 4.4265826710 1.0117771483
## [201] 0.0161530548 0.3072233074 0.0161530548 0.7451676752 7.0351269313
## [206] 2.6406620359 3.7523234832 0.2498730859 2.0222929422 3.2886316225
## [211] 1.6221457956 2.4749949634 1.7570711677 0.0904873650 4.7332807307
## [216] 0.1568318789 0.0302759317 0.5127229828 1.3097316594 6.9309218514
## [221] 0.0515992352 0.4514194447 0.2253481404 4.7652690656 0.4279866041
## [226] 4.4136563866 3.7618312672 0.0156676181 0.2590252139 2.6076058507
## [231] 1.6420333133 3.9985172969 6.2076483227 0.1632104039 0.1829426974
## [236] 4.7652690656 4.4212844958 1.6001906117 0.8579971472 3.8699110198
## [241] 0.3022779567 0.1679979189 1.9421248181 0.6592738895 1.6132788564
## [246] 0.0366544567 3.4818233673 3.9422152187 0.3473613776 0.4321933815
## [251] 0.7480288869 0.2498730859 1.9861068488 2.2297920164 0.7621263656
## [256] 1.2966434147 0.1632104039 0.2048721368 1.7789459288 0.4926393080
## [261] 0.4096285430 1.7794744955 2.5822853071 2.0413250624 6.6574350219
## [266] 0.1277642235 2.1972434657 2.5075677545 0.4482774141 0.6943740757
## [271] 0.7821891015 6.3289445390 0.1568318789 0.1165981835 1.4781797859
## [276] 4.2287015488 3.6157278195 0.1511970641 0.7047761441 2.0935344484
## [281] 3.8258079371 4.4231102471 1.3097316594 3.4081542651 0.4996175382
## [286] 2.0534397824 0.9783975145 2.2562634924 3.7196170683 1.1110084017
## [291] 2.1661785291 4.2138955896 1.9421248181 2.3065468369 0.7139282722
## [296] 4.1431023472 2.0854115837 2.9389399956 1.7711269214 0.0302759317
## [301] 2.6458711124 0.5856241187 0.1199576611 1.8566258698 2.2383553905
## [306] 2.3807710856 0.2838860920 3.1176953128 2.8499853550 2.8063291084
## [311] 0.0034011417 0.4683781352 3.0377484314 1.3833686805 1.7764577456
## [316] 1.7842151661 3.4081542651 0.1165981835 4.6988069009 2.6013721641
## [321] 2.0616474606 0.2498730859 4.2207121622 4.1705330009 5.2103776377
## [326] 4.5406977837 1.5080855068 2.5232652805 5.7259789038 2.5211393933
## [331] 0.3487069432 2.5035573312 2.2764097339 5.8364854607 1.8694684539
## [336] 1.3402996614 0.5728074322 0.3663267540 0.1603491921 2.1690805453
## [341] 1.4105339689 3.0768201201 5.1065624241 4.5966850670 4.5498907729
## [346] 1.3078399029 1.0882592824 0.3128581222 0.3644156933 0.3100845191
## [351] 2.4774831467 1.0763759980 2.2151040131 0.0952748801 4.6864864366
Now, remember that the outcome variable is not guilty (0) and guilty (1), so you might be wondering – what’s with these predicted values? Why aren’t they 0 or 1?
Binomial regression is used for nonlinear outcomes. Since the outcome is 0/1, it’s nonlinear. But binomial regression is based on the general linear model. So how can we apply the general linear model to a nonlinear outcome? Answer: by transforming scores. Specifically, it transforms the outcome into a log odds ratio; the log transform makes the outcome variable behave somewhat linearly and symmetrically. The predicted outcome, then, is also a log odds ratio.
ordvalues<dissertation[order(dissertation$predver),]ordvalues<ordvalues[,51]
ordvalues<data.frame(1:355,ordvalues)
colnames(ordvalues)<c("number","predver")
library(ggplot2)
ggplot(data=ordvalues, aes(number,predver))+geom_smooth()
## `geom_smooth()` using method = 'loess'
Log odds ratios are great for analysis, but when trying to understand how well your model is predicting values, we want to convert them into a metric that’s easier to understand in isolation and when compared to the observed values. We can convert them into probabilities with the following equation:
dissertation$verdict_predicted<exp(predict(model))/(1+exp(predict(model)))This gives us a value ranging from 0 to 1, which is the probability that a particular person will select guilty. We can use this value in different ways to see how well our model is doing. Typically, we’ll divide at the 50% mark, so anyone with a probability of 0.5 or greater is predicted to select guilty, and anyone with a probability less than 0.5 would be predicted to select not guilty. We then compare this new variable with the observed results to see how well the model did.
dissertation$vpred_rounded<round(dissertation$verdict_predicted,digits=0)library(expss)
## Warning: package 'expss' was built under R version 3.4.4
dissertation< apply_labels(dissertation,
verdict = "Actual Verdict",
verdict = c("Not Guilty" = 0,
"Guilty" = 1),
vpred_rounded = "Predicted Verdict",
vpred_rounded = c("Not Guilty" = 0,
"Guilty" = 1)
)
cro(dissertation$verdict,list(dissertation$vpred_rounded, total()))
Predicted Verdict #Total Not Guilty Guilty Actual Verdict Not Guilty 152 39 191 Guilty 35 129 164 #Total cases 187 168 355
One thing we could look at regarding this table, which when dealing with actual versus predicted categories is known as a confusion matrix, is how well the model did at correctly categorizing cases – which we get by adding together the number of people with both observed and predicted not guilty, and people with observed and predicted guilty, then dividing that sum by the total.
accuracy<(152+129)/355accuracy
## [1] 0.7915493
Our model correctly classified 79% of the cases. However, this is not the only way we can determine how well our model did. There are a variety of derivations you can make from the confusion matrix. But two you should definitely include when doing this kind of analysis are sensitivity and specificity. Sensitivity refers to the true positive rate, and specificity refers to the true negative rate.
When you’re working with confusion matrices, you’re often trying to diagnose or identify some condition, one that may be deemed positive or present, and the other that may be deemed negative or absent. These derivations are important because they look at how well your model identifies these different states. For instance, if most of my cases selected not guilty, I could get a high accuracy rate by simply predicting that everyone will select not guilty. But then my model lacks sensitivity – it only identifies negative cases (not guilty) and fails to identify any positive cases (guilty). If I were dealing with something even higher stakes, like whether a test result indicates the presence of a condition, I want to make certain my classification is sensitive to those positive cases. And vice versa, I could keep from missing any positive cases by just classifying everyone as positive, but then my model lacks specificity and I may subject people to treatment they don’t need (and that could be harmful).
Just like accuracy, sensitivity and specificity are easy to calculate. As I said above, I’ll consider not guilty to be negative and guilty to be positive. Sensitivity is simply the number of true positives (observed and predicted guilty) divided by the sum of true positives and false negatives (people who selected guilty but were classified as not guilty).
sensitivity<129/164sensitivity
## [1] 0.7865854
And specificity is the number of true negatives (observed and predicted not guilty) divided by the sum of true negatives and false positives (people who selected not guilty but were classified as guilty).
specificity<152/191specificity
## [1] 0.7958115
So the model correctly classifies 79% of the positive cases and 80% of the negative cases. The model could be improved, but it’s functioning equally well across positive and negative cases, which is good.
It should be pointed out that you can select any cutpoint you want for your probability variable. That is, if I want to be very conservative in identifying positive cases, I might want there to be a higher probability that it is a positive case before I classify it as such – perhaps I want to use a cutpoint like 75%. I can easily do that.
dissertation$vpred2[dissertation$verdict_predicted < 0.75]<0dissertation$vpred2[dissertation$verdict_predicted >= 0.75]<1
dissertation< apply_labels(dissertation,
vpred2 = "Predicted Verdict (0.75 cut)",
vpred2 = c("Not Guilty" = 0,
"Guilty" = 1)
)
cro(dissertation$verdict,list(dissertation$vpred2, total()))
Predicted Verdict (0.75 cut) #Total Not Guilty Guilty Actual Verdict Not Guilty 177 14 191 Guilty 80 84 164 #Total cases 257 98 355 accuracy2<(177+84)/355
sensitivity2<84/164
specificity2<177/191
accuracy2
## [1] 0.7352113
sensitivity2
## [1] 0.5121951
specificity2
## [1] 0.9267016
Changing the cut score improves specificity but at the cost of sensitivity, which makes sense, because our model was predicting equally well (or poorly, depending on how you look at it) across positives and negatives. In this case, a different cut score won’t improve our model. We would need to go back and see if there are better variables to use for prediction. And to keep us from fishing around in our data, we’d probably want to use a training and testing set for such exploratory analysis.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Moving to blogdown
(This article was first published on R – Stat Bandit, and kindly contributed to Rbloggers)
I’ve been in the process of transferring my blog (along with creating a personal website) to blogdown, which is hosted on Github Pages. The new blog, or rather, the continuation of this blog, will be at webbedfeet.github.io/posts, and it went live today.
I’ll be crossposting here for a while, at least until Tal gets my new blog address included in RBloggers. I’m enjoying the RMarkdown blogging experience now, which is quite nice, and any code or analyses I want to include isn’t “lost in translation” when on WP. Since I live in R most of my days, it is also allowing a rather free flow of ideas onto the virtual page.
Hope you’ll come visit
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How efficient are multifactorial experiments?
(This article was first published on ouR data generation, and kindly contributed to Rbloggers)
I recently described why we might want to conduct a multifactorial experiment, and I alluded to the fact that this approach can be quite efficient. It is efficient in the sense that it is possible to test simultaneously the impact of multiple interventions using an overall sample size that would be required to test a single intervention in a more traditional RCT. I demonstrate that here, first with a continuous outcome and then with a binary outcome.
In all of the examples that follow, I am assuming we are in an exploratory phase of research, so our alpha levels are relaxed a bit to \(\alpha = 0.10\). In addition, we make no adjustments for multiple testing. This might be justifiable, since we are not as concerned about making a Type 1 error (concluding an effect is real when there isn’t actually one). Because this is a screening exercise, the selected interventions will be reevaluated. At the same time, we are setting desired power to be 90%. This way, if an effect really exists, we are more likely to select it for further review.
Two scenarios with a continuous outcomeTo start, I have created two sets of underlying assumptions. In the first, the effects of the four interventions (labeled fac1, fac2, fac3, and fac4) are additive. (The factor variables are parameterized using effectstyle notation, where the value 1 represents no intervention and 1 represents the intervention.) So, with no interventions the outcome is 0, and each successive intervention adds 0.8 to the observed outcome (on average), so that individuals exposed to all four factors will have an average outcome \(4 \times 0.8 = 3.2\).
cNoX < defReadCond("DataMF/FacSumContNoX.csv") cNoX ## condition formula variance dist link ## 1: (fac1 + fac2 + fac3 + fac4) == 4 0.0 9.3 normal identity ## 2: (fac1 + fac2 + fac3 + fac4) == 2 0.8 9.3 normal identity ## 3: (fac1 + fac2 + fac3 + fac4) == 0 1.6 9.3 normal identity ## 4: (fac1 + fac2 + fac3 + fac4) == 2 2.4 9.3 normal identity ## 5: (fac1 + fac2 + fac3 + fac4) == 4 3.2 9.3 normal identityIn the second scenario, each successive exposure continues to add to the effect, but each additional intervention adds a little less. The first intervention adds 0.8, the second adds 0.6, the third adds 0.4, and the fourth adds 0.2. This is a form of interaction.
cX < defReadCond("DataMF/FacSumContX.csv") cX ## condition formula variance dist link ## 1: (fac1 + fac2 + fac3 + fac4) == 4 0.0 9.3 normal identity ## 2: (fac1 + fac2 + fac3 + fac4) == 2 0.8 9.3 normal identity ## 3: (fac1 + fac2 + fac3 + fac4) == 0 1.4 9.3 normal identity ## 4: (fac1 + fac2 + fac3 + fac4) == 2 1.8 9.3 normal identity ## 5: (fac1 + fac2 + fac3 + fac4) == 4 2.0 9.3 normal identityThis is what a plot of the means might look like for each of the scenarios. The straight line represents the additive (noninteractive) scenario, and the bent line is the interaction scenario:
Sample size requirement for a single intervention compared to controlIf we were to conduct a more traditional randomized experiment with two groups – treatment and control – we would need about 500 total subjects under the assumptions that we are using:
power.t.test(power = 0.90, delta = .8, sd = 3.05, sig.level = 0.10) ## ## Twosample t test power calculation ## ## n = 249.633 ## delta = 0.8 ## sd = 3.05 ## sig.level = 0.1 ## power = 0.9 ## alternative = two.sided ## ## NOTE: n is number in *each* groupTo take a look at the sample size requirements for a multifactorial study, I’ve written this function that repeatedly samples data based on the definitions and fits the appropriate model, storing the results after each model estimation.
library(simstudy) iterFunc < function(dc, dt, seed = 464653, iter = 1000, binary = FALSE) { set.seed(seed) res < list() for (i in 1:iter) { dx < addCondition(dc, dt, "Y") if (binary == FALSE) { fit < lm(Y~fac1*fac2*fac3*fac4, data = dx) } else { fit < glm(Y~fac1*fac2*fac3*fac4, data = dx, family = binomial) } # A simple function to pull data from the fit res < appendRes(res, fit) } return(res) }And finally, here are the results for the sample size requirements based on no interaction across interventions. (I am using function genMultiFac to generate replications of all the combinations of four factors. This function is now part of simstudy, which is available on github, and will hopefully soon be up on CRAN.)
dt < genMultiFac(32, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res < iterFunc(cNoX, dt) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.894 0.895 0.905 0.902A sample size of \(32 \times 16 = 512\) gives us 90% power that we are seeking. In case you don’t believe my simulation, we can compare the estimate provided by the MOST package, created by the Methodology Center at Penn State:
library(MOST) FactorialPowerPlan(alpha = 0.10, model_order = 1, nfactors = 4, ntotal = 500, sigma_y = 3.05, raw_main = 0.8)$power ## [1] "" ## [1] "FactorialPowerPlan Macro" ## [1] "The Methodology Center" ## [1] "(c) 2012 Pennsylvania State University" ## [1] "" ## [1] "Assumptions:" ## [1] "There are 4 dichotomous factors." ## [1] "There is independent random assignment." ## [1] "Analysis will be based on main effects only." ## [1] "Twosided alpha: 0.10" ## [1] "Total number of participants: 500" ## [1] "Effect size as unstandardized difference in means: 0.80" ## [1] "Assumed standard deviation for the response variable is 3.05" ## [1] "Attempting to calculate the estimated power." ## [1] "" ## [1] "Results:" ## [1] "The calculated power is 0.9004" ## [1] 0.9004 InteractionA major advantage of the multifactorial experiment over the traditional RCT, of course, is that it allows us to investigate if the interventions interact in any interesting ways. However, in practice it may be difficult to generate sample sizes large enough to measure these interactions with much precision.
In the next pair of simulations, we see that even if we are only interested in exploring the main effects, underlying interaction reduces power. If there is actually interaction (as in the second scenario defined above), the original sample size of 500 may be inadequate to estimate the main effects:
dt < genMultiFac(31, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res < iterFunc(cX, dt) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.567 0.556 0.588 0.541Here, a total sample of about 1300 does the trick:
dt < genMultiFac(81, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res < iterFunc(cX, dt) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.898 0.893 0.908 0.899But this sample size is not adequate to estimate the actual second degree interaction terms:
apply(res$p[, .(`fac1:fac2`, `fac1:fac3`, `fac1:fac4`, `fac2:fac3`, `fac2:fac4`, `fac3:fac4`)] < 0.10, 2, mean) ## fac1:fac2 fac1:fac3 fac1:fac4 fac2:fac3 fac2:fac4 fac3:fac4 ## 0.144 0.148 0.163 0.175 0.138 0.165You would actually need a sample size of about 32,000 to be adequately powered to estimate the interaction! Of course, this requirement is driven by the size of the interaction effects and the variation, so maybe this is a bit extreme:
dt < genMultiFac(2000, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res < iterFunc(cX, dt) apply(res$p[, .(`fac1:fac2`, `fac1:fac3`, `fac1:fac4`, `fac2:fac3`, `fac2:fac4`, `fac3:fac4`)] < 0.10, 2, mean) ## fac1:fac2 fac1:fac3 fac1:fac4 fac2:fac3 fac2:fac4 fac3:fac4 ## 0.918 0.902 0.888 0.911 0.894 0.886 A binary outcomeThe situation with the binary outcome is really no different than the continuous outcome, except for the fact that sample size requirements might be much more sensitive to the strength of underlying interaction.
Again, we have two scenarios – one with interaction and one without. When I talk about an additive (noninteraction) model in this context, the additivity is on the logodds scale. This becomes apparent when looking at a plot.
I want to reiterate here that we have interaction when there are limits to how much marginal effect an additional intervention can have conditional on the presence of other interventions. In a recent project (one that motivated this pair of blog entries), we started with the assumption that a single intervention would have a 5 percentage point effect on the outcome (which was smoking cessation), but a combination of all four interventions might only get a 10 percentage point reduction. This cap generates severe interaction which dramatically affected sample size requirements, as we see below (using even less restrictive interaction assumptions).
No interaction:
## condition formula variance dist link ## 1: (fac1 + fac2 + fac3 + fac4) == 4 0.10 NA binary identity ## 2: (fac1 + fac2 + fac3 + fac4) == 2 0.18 NA binary identity ## 3: (fac1 + fac2 + fac3 + fac4) == 0 0.30 NA binary identity ## 4: (fac1 + fac2 + fac3 + fac4) == 2 0.46 NA binary identity ## 5: (fac1 + fac2 + fac3 + fac4) == 4 0.63 NA binary identityInteraction:
## condition formula variance dist link ## 1: (fac1 + fac2 + fac3 + fac4) == 4 0.10 NA binary identity ## 2: (fac1 + fac2 + fac3 + fac4) == 2 0.18 NA binary identity ## 3: (fac1 + fac2 + fac3 + fac4) == 0 0.24 NA binary identity ## 4: (fac1 + fac2 + fac3 + fac4) == 2 0.28 NA binary identity ## 5: (fac1 + fac2 + fac3 + fac4) == 4 0.30 NA binary identityThe plot highlights that additivity is on the logodds scale only:
The sample size requirement for a treatment effect of 8 percentage points for a single intervention compared to control is about 640 total participants:
power.prop.test(power = 0.90, p1 = .10, p2 = .18, sig.level = 0.10) ## ## Twosample comparison of proportions power calculation ## ## n = 320.3361 ## p1 = 0.1 ## p2 = 0.18 ## sig.level = 0.1 ## power = 0.9 ## alternative = two.sided ## ## NOTE: n is number in *each* groupSimulation shows that the multifactorial experiment requires only 500 participants, a pretty surprising reduction:
dt < genMultiFac(31, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res < iterFunc(bNoX, dt, binary = TRUE) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.889 0.910 0.916 0.901But, if there is a cap to how much we can effect the outcome (i.e. there is underlying interaction), estimated power is considerably reduced:
dt < genMultiFac(31, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res < iterFunc(bX, dt, binary = TRUE) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.398 0.409 0.405 0.392We need to increase the sample size to about \(125 \times 16 = 2000\) just to explore the main effects:
dt < genMultiFac(125, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res < iterFunc(bX, dt, binary = TRUE) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.910 0.890 0.895 0.887I think the biggest take away from all of this is that multifactorial experiments are a super interesting option when exploring possible interventions or combinations of interventions, particularly when the outcome is continuous. However, this approach may not be as feasible when the outcome is binary, as sample size requirements may quickly become prohibitive, given the number of factors, sample sizes, and extent of interaction.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Get your tracks from the Strava API and plot them on Leaflet maps
(This article was first published on Rcrastinate, and kindly contributed to Rbloggers)
Here is some updated R code from my previous post. It doesn’t throw any warnings when importing tracks with and without heart rate information. Also, it is easier to distinguish types of tracks now (e.g., when you want to plot runs and rides separately). Another thing I changed: You get very basic information on the track when you click on it (currently the name of the track and the total length).
Have fun and leave a comment if you have any questions.
options(stringsAsFactors = F)
rm(list=ls())
library(httr)
library(rjson)
library(leaflet)
library(dplyr)
token < “”
# Functions —————————————————————
get.coord.df.from.stream < function (stream.obj) {
data.frame(lat = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[1]]),
lon = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[2]]))
}
get.stream.from.activity < function (act.id, token) {
stream < GET(“https://www.strava.com/”,
path = paste0(“api/v3/activities/”, act.id, “/streams/latlng”),
query = list(access_token = token))
content(stream)
}
get.activities2 < function (token) {
activities < GET(“https://www.strava.com/”, path = “api/v3/activities”,
query = list(access_token = token, per_page = 200))
activities < content(activities, “text”)
activities < fromJSON(activities)
res.df < data.frame()
for (a in activities) {
values < sapply(c(“name”, “distance”, “moving_time”, “elapsed_time”, “total_elevation_gain”,
“type”, “id”, “start_date_local”,
“location_country”, “average_speed”, “max_speed”, “has_heartrate”, “elev_high”,
“elev_low”, “average_heartrate”, “max_heartrate”), FUN = function (x) {
if (is.null(a[[x]])) {
NA } else { a[[x]] }
})
res.df < rbind(res.df, values)
}
names(res.df) < c(“name”, “distance”, “moving_time”, “elapsed_time”, “total_elevation_gain”,
“type”, “id”, “start_date_local”,
“location_country”, “average_speed”, “max_speed”, “has_heartrate”, “elev_high”,
“elev_low”, “average_heartrate”, “max_heartrate”)
res.df
}
get.multiple.streams < function (act.ids, token) {
res.list < list()
for (act.id.i in 1:length(act.ids)) {
if (act.id.i %% 5 == 0) cat(“Actitivy no.”, act.id.i, “of”, length(act.ids), “\n”)
stream < get.stream.from.activity(act.ids[act.id.i], token)
coord.df < get.coord.df.from.stream(stream)
res.list[[length(res.list) + 1]] < list(act.id = act.ids[act.id.i],
coords = coord.df)
}
res.list
}
activities < get.activities2(token)
stream.list < get.multiple.streams(activities$id, token)
# Leaflet —————————————————————–
lons.range < c(9.156572, 9.237580)
lats.range < c(48.74085, 48.82079)
map < leaflet() %>%
addProviderTiles(“OpenMapSurfer.Grayscale”, # nice: CartoDB.Positron, OpenMapSurfer.Grayscale, CartoDB.DarkMatterNoLabels
options = providerTileOptions(noWrap = T)) %>%
fitBounds(lng1 = min(lons.range), lat1 = max(lats.range), lng2 < max(lons.range), lat2 = min(lats.range))
add.run < function (act.id, color, act.name, act.dist, strlist = stream.list) {
act.ind < sapply(stream.list, USE.NAMES = F, FUN = function (x) {
x$act.id == act.id
})
act.from.list < strlist[act.ind][[1]]
map << addPolylines(map, lng = act.from.list$coords$lon,
lat = act.from.list$coords$lat,
color = color, opacity = 1/3, weight = 2,
popup = paste0(act.name, “, “, round(as.numeric(act.dist) / 1000, 2), ” km”))
}
# plot all
for (i in 1:nrow(activities)) {
add.run(activities[i, “id”], ifelse(activities[i, “type”] == “Run”, “red”,
ifelse(activities[i, “type”] == “Ride”, “blue”, “black”)),
activities[i, “name”], activities[i, “distance”])
}
map
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppArmadillo 0.8.500.0
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
RcppArmadillo release 0.8.500.0, originally prepared and uploaded on April 21, has hit CRAN today (after having already been available via the RcppCore drat repo). A corresponding Debian release will be prepared as well. This RcppArmadillo release contains Armadillo release 8.500.0 with a number of rather nice changes (see below for details), and continues our normal bimonthly CRAN release cycle.
Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 472 other packages on CRAN.
A highlevel summary of changes follows.
Changes in RcppArmadillo version 0.8.500.0 (20180421)
Upgraded to Armadillo release 8.500 (Caffeine Raider)

faster handling of sparse matrices by kron() and repmat()

faster transpose of sparse matrices

faster element access in sparse matrices

faster row iterators for sparse matrices

faster handling of compound expressions by trace()

more efficient handling of aliasing in submatrix views

expanded normalise() to handle sparse matrices

expanded .transform() and .for_each() to handle sparse matrices

added reverse() for reversing order of elements

added repelem() for replicating elements

added roots() for finding the roots of a polynomial


Fewer LAPACK compiletime guards are used, new unit tests for underlying features have been added (Keith O’Hara in #211 addressing #207).

The configure check for LAPACK features has been updated accordingly (Keith O’Hara in #214 addressing #213).

The compiletime check for g++ is now more robust to minimal shell versions (#217 addressing #216).

Compiler tests to were added for macOS (Keith O’Hara in #219).
Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcppdevel mailing list off the RForge page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
17 Jobs for R users from around the world (20180430)
Just visit this link and post a new R job to the R community.
You can post a job for free (and there are also “featured job” options available for extra exposure).
Current R jobsJob seekers: please follow the links below to learn more and apply for your R job of interest:
Featured Jobs
 Freelance
 Freelance Project: R Libraries Install on Amazon AWS EC2 WS
 Anywhere
 27 Apr 2018

 FullTime
 Associate, Data Analytics National Network for Safe Communities (NNSC>) – Posted by nnsc
 New York New York, United States
 18 Apr 2018

 FullTime
 Senior Research Associate Kellogg Research Support – Posted by lorbos
 Evanston Illinois, United States
 17 Apr 2018

 FullTime
 Empirical Research Fellow Kellogg School of Management (Northwestern University)– Posted by lorbos
 Evanston Illinois, United States
 17 Apr 2018

 PartTime
 Data Scientist Alchemy – Posted by Andreas Voniatis
 Anywhere
 7 Apr 2018

 FullTime
 Solutions Engineer RStudio – Posted by nwstephens
 Anywhere
 3 Apr 2018

 FullTime
 Lead Data Scientist @ Washington, District of Columbia, U.S. AFLCIO – Posted by carterkalchik
 Washington District of Columbia, United States
 27 Apr 2018

 FullTime
 Data Scientist R @ Medellín, Antioquia, Colombia IDATA S.A.S – Posted by vmhoyos
 Medellín Antioquia, Colombia
 27 Apr 2018

 FullTime
 Data Scientist @ Annapolis, Maryland, United States The National SocioEnvironmental Synthesis Center – Posted by sesync
 Annapolis Maryland, United States
 27 Apr 2018

 Freelance
 Freelance Project: R Libraries Install on Amazon AWS EC2 WS
 Anywhere
 27 Apr 2018

 FullTime
 Associate, Data Analytics National Network for Safe Communities (NNSC>) – Posted by nnsc
 New York New York, United States
 18 Apr 2018

 FullTime
 Senior Research Associate Kellogg Research Support – Posted by lorbos
 Evanston Illinois, United States
 17 Apr 2018

 FullTime
 Empirical Research Fellow Kellogg School of Management (Northwestern University)– Posted by lorbos
 Evanston Illinois, United States
 17 Apr 2018

 FullTime
 Discrete Choice Modeler @ Chicago, Illinois, U.S. Resource Systems Group – Posted by patricia.holland@rsginc.com
 Chicago Illinois, United States
 13 Apr 2018

 FullTime
 R&D Database Developer @ Toronto, Canada Crescendo Technology Ltd – Posted by Crescendo
 Toronto Ontario, Canada
 12 Apr 2018

 FullTime
 Applied Statistics Position @ Florida U.S. New College of Florida – Posted by Jobs@New College
 Sarasota Florida, United States
 9 Apr 2018

 FullTime
 PhD Fellowship @ Wallace University, US Ubydul Haque – Posted by Ubydul
 Khon Kaen Chang Wat Khon Kaen, Thailand
 9 Apr 2018

 PartTime
 Data Scientist Alchemy – Posted by Andreas Voniatis
 Anywhere
 7 Apr 2018

 FullTime
 Product Owner (Data Science) (m/f) Civey GmbH – Posted by Civey
 Berlin Berlin, Germany
 6 Apr 2018

 FullTime
 Solutions Engineer RStudio – Posted by nwstephens
 Anywhere
 3 Apr 2018

 FullTime
 PostDoctoral Fellow – Data Scientist @ CIMMYT (eastern Africa) International Maize and Wheat Improvement Center – Posted by cimmytjobs
 Marsabit County Kenya
 27 Mar 2018

 Freelance
 Statistician / R Developer – for Academic Statistical Research Academic Research – Posted by empiricus
 Anywhere
 27 Mar 2018

 FullTime
 Graduate Data Scientist JamieAi – Posted by JamieAi
 Anywhere
 27 Mar 2018
In Rusers.com you can see all the R jobs that are currently available.
Rusers Resumes
Rusers also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.
(you may also look at previous R jobs posts ).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Microsoft R Open 3.4.4 now available
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
An update to Microsoft R Open (MRO) is now available for download on Windows, Mac and Linux. This release upgrades the R language engine to version 3.4.4, which addresses some minor issues with timezone detection and some edge cases in some statistics functions. As a maintenance release, it's backwardscompatible with scripts and packages from the prior release of MRO.
MRO 3.4.4 points to a fixed CRAN snapshot taken on April 1 2018, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always, you can use the builtin checkpoint package to access packages from an earlier date (for reproducibility) or a later date (to access new and updated packages).
Looking ahead, the next update based on R 3.5.0 has started the build and test process. Microsoft R Open 3.5.0 is scheduled for release on May 31.
We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.
MRAN: Download Microsoft R Open
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Make a sculpture in LEGO from a photo, with R
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
The entrance to our office in Redmond in is adorned with this sculpture of our department logo, rendered in LEGO:
We had fun with LEGO bricks at work this week. APEX is our internal team name, this was fun. Oh and we're hiring for all roles in Azure! pic.twitter.com/VlNNaTexA5
— Jeff Sandquist (@jeffsand) March 30, 2017
Our team, the Cloud Developer Advocates, has a logo as well, created by the multitalented Ashley Macnamara. (The mascot's name is Bit: he's a raccoon because, like developers, he's into everything.) It would be nice to have a LEGO rendition of Bit for the wall as well, but converting an image into LEGO bricks isn't easy … until now.
This R script by Ryan Timpe provides everything you need render an image in LEGO. It will downscale the image to a size that meets your bricks budget, convert the colors to those available as LEGO bricks, and divide the image up into LEGOsized pieces, ready to lay out on a flat tray. The script is super easy to use: just source a file of utility functions and then:
(You can also use readJPEG to read in JPG images; I just loaded in the png package and used readPNG which works just as well.) Here's what the output looks like. (Click to see the original, for comparison.)
The script also provides a shopping list of the bricks you need by color and size: this particular project will require 1842 LEGO bricks in 19 different colors to create the 48×48 image. It will even provide a series of stepbystep instructions showing how the project will look in various stages of completion:
The R script is available on GitHub, here, and works with any recent version of R and with uptodate tidyverse installation. (I used R 3.5.0.) You can find a complete walkthrough of using the scripts in the blog post at the link below.
Ryan Timple: How To: LEGO mosaics from photos using R & the tidyverse
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Z is for ZScores and Standardizing
(This article was first published on Deeply Trivial, and kindly contributed to Rbloggers)
.knitr .inline { backgroundcolor: #f7f7f7; border:solid 1px #B0B0B0; } .error { fontweight: bold; color: #FF0000; } .warning { fontweight: bold; } .message { fontstyle: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { backgroundcolor: #f5f5f5; } .rimage .left { textalign: left; } .rimage .right { textalign: right; } .rimage .center { textalign: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; fontstyle: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; fontweight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; fontweight: bold; }
Z is for ZScores and Standardizing Last April, I wrapped up the A to Z of Statistics with a post about Zscores. It seems only fitting that I’m wrapping this April A to Z with the same topic. Zscores are frequently used, sometimes when you don’t even realize it. When you take your child to the doctor and they say he’s at the x percentile on height, or when you take a standardized test and are told you scored in the y percentile, those values are being derived from Zscores. You create a Zscore when you subtract the population mean from a value and divide that result by the population standard deviation.
Of course, we often will standardize variables in statistics, and the results are similar to Zscores (though technically not the same if the mean and standard deviation aren’t population values). In fact, when I demonstrated the GLM function earlier this month, I skipped a very important step when conducting an analysis with interactions. I should have standardized my continuous predictors first, which means subtracting the variable mean and dividing by the variable standard deviation, creating a new variable with a mean of 0 and a standard deviation of 1 (just like the normal distribution).
Let’s revisit that GLM analysis. I was predicting verdict (guilty, not guilty) with binomial regression. I did one analysis where I used a handful of attitude items and the participant’s guilt rating, and a second analysis where I created interactions between each attitude item and the guilt rating. The purpose was to see if an attitude impacts the threshold – how high the guilt rating needed to be before a participant selected “guilty”. When working with interactions, the individual variables are highly correlated with the interaction variables based on them, so we solve that problem, and make our analysis and output a bit cleaner, by centering our variables and using those centered values to create interactions.
I’ll go ahead and load my data. Also, since I know I have some missing values, which caused an error when I tried to work with predicted values and residuals yesterday, I’ll also go ahead and identify that case/those cases.
dissertation<read.delim("dissertation_data.txt",header=TRUE)predictors<c("obguilt","reasdoubt","bettertolet","libertyvorder",
"jurevidence","guilt")
library(psych)
## Warning: package 'psych' was built under R version 3.4.4
describe(dissertation[predictors])
## vars n mean sd median trimmed mad min max range skew
## obguilt 1 356 3.50 0.89 4 3.52 0.00 1 5 4 0.50
## reasdoubt 2 356 2.59 1.51 2 2.68 1.48 9 5 14 3.63
## bettertolet 3 356 2.36 1.50 2 2.38 1.48 9 5 14 3.41
## libertyvorder 4 355 2.74 1.31 3 2.77 1.48 9 5 14 3.89
## jurevidence 5 356 2.54 1.63 2 2.66 1.48 9 5 14 3.76
## guilt 6 356 4.80 1.21 5 4.90 1.48 2 7 5 0.59
## kurtosis se
## obguilt 0.55 0.05
## reasdoubt 26.92 0.08
## bettertolet 25.47 0.08
## libertyvorder 34.58 0.07
## jurevidence 25.39 0.09
## guilt 0.54 0.06
dissertation<subset(dissertation, !is.na(libertyvorder))
R has a builtin function that will do this for you: scale. The scale function has 3 main arguments – the variable or variables to be scaled, and whether you want those variables centered (subtract mean) and/or scaled (divided by standard deviation). For regression with interactions, we want to both center and scale. For instance, to center and scale the guilt rating:
dissertation$guilt_c<scale(dissertation$guilt, center=TRUE, scale=TRUE)I have a set of predictors I want to do this to, so I want to apply a function across those specific columns:
dissertation[46:51]<lapply(dissertation[predictors], function(x) {y<scale(x, center=TRUE, scale=TRUE)
}
)
Now, let’s rerun that binomial regression, this time using the centered variables in the model.
pred_int<'verdict ~ obguilt.1 + reasdoubt.1 + bettertolet.1 + libertyvorder.1 +jurevidence.1 + guilt.1 + obguilt.1*guilt.1 + reasdoubt.1*guilt.1 +
bettertolet.1*guilt.1 + libertyvorder.1*guilt.1 + jurevidence.1*guilt.1'
model2<glm(pred_int, family="binomial", data=dissertation)
summary(model2)
##
## Call:
## glm(formula = pred_int, family = "binomial", data = dissertation)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 2.6101 0.5432 0.1289 0.6422 2.2805
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept) 0.47994 0.16264 2.951 0.00317 **
## obguilt.1 0.25161 0.16158 1.557 0.11942
## reasdoubt.1 0.09230 0.20037 0.461 0.64507
## bettertolet.1 0.22484 0.20340 1.105 0.26899
## libertyvorder.1 0.05825 0.21517 0.271 0.78660
## jurevidence.1 0.07252 0.19376 0.374 0.70819
## guilt.1 2.31003 0.26867 8.598 < 2e16 ***
## obguilt.1:guilt.1 0.14058 0.23411 0.600 0.54818
## reasdoubt.1:guilt.1 0.61724 0.29693 2.079 0.03764 *
## bettertolet.1:guilt.1 0.02579 0.30123 0.086 0.93178
## libertyvorder.1:guilt.1 0.27492 0.29355 0.937 0.34899
## jurevidence.1:guilt.1 0.27601 0.36181 0.763 0.44555
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.08 on 354 degrees of freedom
## Residual deviance: 300.66 on 343 degrees of freedom
## AIC: 324.66
##
## Number of Fisher Scoring iterations: 6
The results are essentially the same; the constant and slopes of the predictors variables are different but the variables that were significant before still are. So standardizing doesn’t change the results, but it is generally recommended because it makes results easier to interpret, because the variables are centered around the mean. So negative numbers are below the mean, and positive numbers are above the mean.
Hard to believe A to Z is over! Don’t worry, I’m going to keep blogging about statistics, R, and whatever strikes my fancy. I almost kept this post going by applying the prediction work from yesterday to the binomial model, but decided that would make for a fun future post. And I’ll probably sprinkle in posts in the near future on topics I didn’t have room for this month or that I promised to write a future post on. Thanks for reading and hope you keep stopping by, even though April A to Z is officially over!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
201804 Extreme Makeover: R Graphics Edition
(This article was first published on R – Stat Tech, and kindly contributed to Rbloggers)
This report describes a complex R graphics customisation example using functions from the ‘grid’ and ‘gridGraphics’ packages and introduces two new functions in ‘grid’: deviceLoc and deviceDim.
Paul Murrell
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Comparing dependencies of popular machine learning packages with `pkgnet`
(This article was first published on Shirin's playgRound, and kindly contributed to Rbloggers)
When looking through the CRAN list of packages, I stumbled upon this little gem:
pkgnet is an R library designed for the analysis of R libraries! The goal of the package is to build a graph representation of a package and its dependencies.
And I thought it would be fun to play around with it. The little analysis I ended up doing was to compare dependencies of popular machine learning packages.
 I first loaded the packages:
 I then created a function that will
 create the package report with pkgnet::CreatePackageReport
 convert the edge (report$DependencyReporter$edges) and node (report$DependencyReporter$nodes) data into a graph object with tidygraph::as_tbl_graph

To create a vector of machine learning packages from R I looked at CRAN’s machine learning task view

These are the packages I ended up including:
Note: I wanted to include other packages, like tensorflow, randomFores, gbm, etc. but for those, pkgnet threw an error:
Error in data.table::data.table(node = names(igraph::V(self$pkg_graph)), : column or argument 1 is NULL
 Next, I ran them through my function from before and assigned them each a unique name.
 These individual objects I combined with tidygraph and calculated node centrality as the number of outgoing edges.
 Finally, I plotted the dependency network with ggraph:
The bigger the node labels (package names), the higher their centrality. Seems like the more basic utilitarian packages have the highest centrality (not really a surprise…).
graph %>% ggraph(layout = 'nicely') + geom_edge_link(arrow = arrow()) + geom_node_point() + geom_node_label(aes(label = name, fill = color, size = centrality), show.legend = FALSE, repel = TRUE) + theme_graph() + scale_fill_brewer(palette = "Set1") Because the complete network is a bit hard to make sense of, I plotted it again with only the packages I wanted to analyze plus dependencies that had at least 1 outgoing edge; now it is easier to see shared dependencies.
For example, methods and stats are dependencies of caret, mlr and e1071 but not h2o, while utils is a dependency of all four.
graph %>% filter(centrality > 1  color == "a") %>% ggraph(layout = 'nicely') + geom_edge_link(arrow = arrow()) + geom_node_point() + geom_node_label(aes(label = name, fill = color, size = centrality), show.legend = FALSE, repel = TRUE) + theme_graph() + scale_fill_brewer(palette = "Set1")It would of course be interesting to analyse a bigger network with more packages. Maybe someone knows how to get these other packages to work with pkgnet?
sessionInfo() ## R version 3.5.0 (20180423) ## Platform: x86_64appledarwin15.6.0 (64bit) ## Running under: macOS High Sierra 10.13.4 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] de_DE.UTF8/de_DE.UTF8/de_DE.UTF8/C/de_DE.UTF8/de_DE.UTF8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] bindrcpp_0.2.2 ggraph_1.0.1 ggplot2_2.2.1 tidygraph_1.1.0 ## [5] pkgnet_0.2.0 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.16 RColorBrewer_1.12 plyr_1.8.4 ## [4] compiler_3.5.0 pillar_1.2.2 formatR_1.5 ## [7] futile.logger_1.4.3 bindr_0.1.1 viridis_0.5.1 ## [10] futile.options_1.0.1 tools_3.5.0 digest_0.6.15 ## [13] viridisLite_0.3.0 gtable_0.2.0 jsonlite_1.5 ## [16] evaluate_0.10.1 tibble_1.4.2 pkgconfig_2.0.1 ## [19] rlang_0.2.0 igraph_1.2.1 ggrepel_0.7.0 ## [22] yaml_2.1.18 blogdown_0.6 xfun_0.1 ## [25] gridExtra_2.3 stringr_1.3.0 dplyr_0.7.4 ## [28] knitr_1.20 htmlwidgets_1.2 grid_3.5.0 ## [31] rprojroot_1.32 glue_1.2.0 data.table_1.10.43 ## [34] R6_2.2.2 rmarkdown_1.9 bookdown_0.7 ## [37] udunits2_0.13 tweenr_0.1.5 tidyr_0.8.0 ## [40] purrr_0.2.4 lambda.r_1.2.2 magrittr_1.5 ## [43] units_0.51 MASS_7.349 scales_0.5.0 ## [46] backports_1.1.2 mvbutils_2.7.4.1 htmltools_0.3.6 ## [49] assertthat_0.2.0 ggforce_0.1.1 colorspace_1.32 ## [52] labeling_0.3 stringi_1.1.7 visNetwork_2.0.3 ## [55] lazyeval_0.2.1 munsell_0.4.3 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...