Tutorial: An app in R shiny visualizing biopsy data — in a pharmaceutical company
(This article was first published on Stories by Sebastian Wolf on Medium, and kindly contributed to Rbloggers)
Tutorial: An app in R shiny visualizing biopsy data — in a pharmaceutical company Learn how to build a shiny app for the visualization of clustering results. The app helps to better identify patient data samples, e.g. during a clinical study.This tutorial is a joint work effort. The Tutorial was presented by Olaf Menzer in a workshop at the ODSC West Conference in San Francisco in 2018. Sebastian Wolf was coimplementing this application as an expert in biopharmaceutical webapplications.
What is it about?The story behind this app comes from a real application inside departments evaluating clinical studies in diagnostics.
Due to fast recruiting for clinical studies the patient cohorts seemed to be inhomogenic. Therefore a researcher and a clinical study statistician wanted to find out, by which parameter they can find patients, that do not seem to fit their desired class. Maybe there was a mistake in the labeling of a patient’s disease status? Maybe one measurement or two measurements can be used to easily find such patients?
The example data used here is real data from a 1990’s study, known as the biopsy data set, also hosted on UCI ML data repository. The app that should be build inside this tutorial is preliminary and was especially build for the tutorial. Pieces of it were applied in real world biostatistical applications.
Starting pointPlease start forking the tutorial at: https://github.com/zappingseb/biopharmaapp
and afterwards run the installation of dependencies in your R session:
install.packages(c("ape","dplyr","magrittr","igraph","viridisLite","MASS","shiny"))if you have all the packages installed you will have one file to work on. This file is app.R. This app.R allows you to build a shiny application.
This file we will use to insert the right visualizations and the right table for the researcher to fullfil the task named above. To check what the app will finally look like you can already perform runApp() inside the console and your browser will open the app.
The app already contains:
A sideBarPanel that has all the inputs we need
 Slider for the # of patients
 Slider for the # of desired clusters/groups
 Empty input to choose measurements — shall be done by you
 Dropdown field for the clustering method
A server function that will provide
 The empty input to choose measurments
 A Heatmap to see the outcome of clustering
 A Phylogenetic tree plot to see the outcome of clustering
 A table to see the outcome of the clustering
It already contains a function to provide you with the input data sets biopsy and biopsy_numeric(), as biopsy_numeric is a reactive.
In this tutorial we will go through steps 1–4 to enable building the app
The input data setThe patients inside the data set were obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. There are 699 rows and 11 columns.
The data set can be called by the variable biopsy inside the app. The columns 210 were stored inside the reactive biopsy_numeric() which is filtered by the input$patients input to not use all 699 patients, but between 1 and 100.
1) Construction of a SelectInputThe SelectInput shall allow the user to not use all 9 measured variables, but just the ones he desires. This shall help finding the measurement, that is necessary to classify patients. What is a shiny selectInput? We can therefore look at the description of the selectInput by
?shiny::selectInputand see
We now need to build the choices as the column names of the biopsy data set from 210. the selected input will be the same. We shall allow multiple inputs, so multiple will be set to TRUE. Additionally we shall name the inputId "vars". So we can replace the part output$variables inside the app.R file with this:
output$variables < renderUI({selectInput(inputId="vars",
label = "Variables to use",
choices = names(biopsy)[2:10],
multiple = TRUE,
selected = names(biopsy)[2:10]
)
})
And you’re done.
2) A Heatmap to see the outcome of clusteringThe basic heatmap function allows you to draw a heat map. In this case we would like to change a few things. We would like to change the clustering method inside the hclust function to a method defined by the user. We can grab the user defined method by using input$method as we already defined this input field as a drop down menu. We have to overwrite the default hclust method with our method by:
my_hclust < function(...){hclust(method=my_method,...)
}
my_method << input$method
Be aware that you define a global variable my_method here, which suffices within the scope of this tutorial. However, please keep in mind that global variables can be problematic in many other contexts and do your own research what best fits your application.
Now for the heatmap call we basically need to change a few inputs. Please see the result:
heatmap(x = t(as.matrix(biopsy_numeric())), Rowv=NA, hclustfun=my_hclust,labCol =biopsy$"disease status",
col=viridis(15)
)
We need to transform the biopsy_numeric matrix, as we would like to have the patients in columns. As there is just a one dimensional clustering, we can switch of row labels by setting Rowv to NA. The hclustfun is overwritten by our function my_hclust.
For coloring of the plot we use the viridis palette as it is a color blind friendly palette. And the labels of our columns shall now not only the patient IDs but the disease status. You can see the names of all columns we defined in the file R/utils.R. There you see that the last column of biopsy is called "disease status". This will be used to label each patient. Now we got:
output$plot1 < renderPlot({my_hclust < function(...){
hclust(method=my_method,...)
}
my_method << input$method
heatmap(x = t(as.matrix(biopsy_numeric())), Rowv=NA, hclustfun=my_hclust,
labCol =biopsy$"disease status",
col=viridis(15)
)
})
Part 2 is done
3) Plot a phylogenetic treeTo allow plotting a phylogenetic tree we provided you with a function called phyltree. You can read the whole code of the function inside R/utils.R. This function takes as inputs
 a numeric matrix > biopsy_numeric() CHECK
 The clustering method > input$method CHECK
 The number of clusters > input$nc CHECK
 A color function > viridis CHECK
You can read why to use () behind biopsy_numeric here.
The hard part are now the labels. The biopsy_numeric data set is filtered by the # of patients. Therefore we have to filter the labels, too. Therefore we use
labels = biopsy %>% dplyr::select("disease status") %>%filter(row_number() <= input$patients) %>%
mutate_all(as.character) %>%
pull("disease status")
This is a workflow using functional programming with the Rpackage dplyr. The function select allows us to just select the "disease status". The filter function filters the number of rows. The mutate_all function applies the as.character function to all columns and finally we export the labels as a vector by using pull.
 labels for the tree nodes > biopsy %>% … CHECK
The final result looks like this
output$plot2 < renderPlot({phyltree( x = biopsy_numeric(),
method = input$method,
nc = input$nc,
color_func = "viridis",
labels = biopsy %>% dplyr::select("disease status") %>%
filter(row_number() <= input$patients) %>%
mutate_all(as.character)%>%
pull("disease status")
)
}) 4) Create a table from clustering results
Now we would also like to see for each patient in which cluster she was assigned. Therefore we perform the clustering and tree cutting on our own:
clust < hclust(dist(biopsy_numeric()), method = input$method)cluster_assigment = cutree(clust, k = input$nc) #cluster assignement
The cluster_assignment is now a vector with numbers for the clusters for each patient such as c(1,2,1,1,1,2,2,1,…). This information can be helpful if we combine it with the patientID and the disease status that was named in the patients forms.
The task will be performed using the cbind function of R:
out_table < cbind(cluster_assigment,
biopsy %>% filter(row_number() <= length(cluster_assigment)) %>% select(c(1,11))
)# cbind
Now this table shall be sorted by the cluster_assigment to get a faster view on which patients landed in the wrong cluster.
out_table %>% arrange(cluster_assigment)The final code:
output$cluster_table < renderTable({#  perform clustering 
# Clustering via Hierarchical Clustering
clust < hclust(dist(biopsy_numeric()), method = input$method)
cluster_assigment = cutree(clust, k = input$nc) #cluster assignement
# Create a table with the clusters, Patient IDs and Disease status
out_table < cbind(
cluster_assigment,
biopsy %>% filter(row_number() <= length(cluster_assigment)) %>% select(c(1,11))
)# cbind
# Order by cluster_assigment
out_table %>% arrange(cluster_assigment)
}
)
Done
What to do now?Now you can run the runApp() function.
If you choose 100 patients, 2 clusters, “ward.D2” clustering and all variables you will see pretty fast, that the patients:
 1002945
 1016277
 1018099
 1096800
could be identified as the patients that were clustered wrong. Now you can go search for problems in clustering or look at the sheets of those patients. By changing the labeling e.g. using PatientIDs inside the phyltree function call, you can even check which other patients show close measurements to these patients. Explore and play!
Further reading Get to know other packages important for pharmaceutical applications
 Visual Elements of Data Science
 Learn how to build custom shiny inputs
 Learn how to start fast with a team running R
Tutorial: An app in R shiny visualizing biopsy data — in a pharmaceutical company was originally published in Data Driven Investor on Medium, where people are continuing the conversation by highlighting and responding to this story.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Stories by Sebastian Wolf on Medium. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RTest: pretty testing of R packages
(This article was first published on Stories by Sebastian Wolf on Medium, and kindly contributed to Rbloggers)
The specflow and cucumber.io for R. Enabling noncoders to interpret test reports for Rpackages, moreover allowing noncoders to create test cases. A step towards simple r package validation. by startupstockphotos http://startupstockphotos.com/post/143841899156 Table of contents Why RTest?
 What’s special about RTest?
 An example of a test implementation with RTest.
 Further Reading
Testing in R seems simple. Start by using usethis::test_name("name") and off you go by coding your tests in testthat with functions like expect_equal. You can find a lot of tutorials online, there is even a whole book on “Testing%20R%20Code%20(Chapman%20&%20Hall/Crc%20the%20R)Testing R Code”. Sadly, this is not the way I can go. As I mentioned a few times, I work in a strongly regulated environment. Inside such environments your tests are not only checked by coders, but also by people who cannot code. Some of your tests will even be written by people who cannot code. Something like specflow or cucumber would really help them to write such tests. But those do not exist in R. Additionally, those people cannot read command line test reports. You can train them to do it, but we decided it is easier to provide us and them with a pretty environment for testing, called RTest.
If you want to know more about the reasons for developing such an environment, you can read the article: Why do we need human readable test for a programming language.
What’s special about RTest?To explain which features we put into RTest I will start describing of a basic testing workflow.
1 Testing code starts with writing code. Your Rpackage will contain functions, classes and methods. These shall be tested.
2 Writing the tests now mostly includes calls like this:
my_function(x,y){sums_up(x,y) return(z)} x=3y=4
z=7 stopifnot(my_function(x,y)==z)
It is easy to see, that your test will break if your function my_function cannot sum up two values. You will create a bunch of such tests and store them in a separate folder of your package, normally called tests .
3 Afterwards you can run all such tests. You can include a script in the tests folder or use testthat and runtestthat::test_dir() .
4 If one of your test fails the script will stop and tell you which test failed in the console. This describes the 4 steps shown in the figure below.
What’s now special about RTest are two major steps.
 The definition of the tests
 The reporting of the test execution
For the definition of the tests we decided for XML. Why XML? XML is not just easier to read then pure RCode, it comes with a feature, that is called XSD; “XML schema definition”. Each XML test case we create can immediately be checked against a schema designed by the developer. It can also be checked against our very own Rtest.xsd. This means the tester can double check the created test cases before even executing them. This saves us a lot of time and gives a fixed structure to all test cases.
The reporting was implemented in HTML. This is due to the many features HTML comes with for reporting. It allows coloring of test results, linking to test cases and including images. The main difference for the reporting in HTML between RTest and testthat is that RTest reports every test that shall be executed, not only the failed ones. The test report will also include the value created by the function call and the one given as a reference. The reader can see if the comparison really went right. By this the test report contains way more information than the testthat console log.
An example of a test implementation with RTestPlease note the whole example is stored in a github gist. Please star the gist if you like this example.
 Given a function that sums up two columns:
2. We want to have one successful and one non successful test. Both will have three parts in the XML file:
params accounts for input parameters
reference for the output data.frame
testspec for whether the test shall run silently and what the tolerance is
For the successful test our test would look like this:
https://medium.com/media/70c6ccf0679bf3ba46da37ada1604890/href
You can immediately see one special feature of RTest. It allows to use data sets for multiple tests, we store those data sets in the inputdata tag.This saves space in the file. The dataset test01 will be used here. Moreover a test description can be given for each test. For each data.frame stored in XML the types of the columns can be given in coldefs . Here those are all numeric.
Theinputdatais now given here:
https://medium.com/media/5fd8414d53e7e01ae262deb90b8c0d03/href
It’s a data frame with the x column just carrying 1 and the y column just carrying 2. The test shall create a data.frame with the sum column being 3 in each row.
We can easily let the test fail by changing the reference tag and instead of having just 3 in the sum column we can add a 3.5 to let the test fail. The whole test case can be found inside the github gist with 90 rows.
3. The execution of the test case is just one line of code. You shall have your working directory in the directory with the XML file and my_function shall be defined in the global environment.
RTest.execute(getwd(),"RTest_medium.xml")4. The test report now contains one successful and one failed test. Both will be visualized:
General test outcome in RTest test reportadditional information is given on all tests. For the test that failed we caused it by setting the sum to be 3.5 instead of 3. It’s reported at the end of the table:
example of a failed data frame comparison in RTestMoreover the Report contains information on the environment where the test ran:
System information for an RTest test reportThat’s it. Now you can test any package with RTest.
Further reading RTest github repository
 RTest documentation website
 Why do we need human readable test for a programming language?
 Author’s LinkedIn page
RTest: pretty testing of R packages was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Stories by Sebastian Wolf on Medium. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
7 Reasons for Policy Professionals to Get Pumped About R Programming in 2019
(This article was first published on RProgramming – Giles DickensonJones, and kindly contributed to Rbloggers)
Note: A version of this article was also published via LinkedIn here.
With the rise of ‘Big Data’, ‘Machine Learning’ and the ‘Data Scientist’ has come an explosion in the popularity of using opensource programming tools for data analysis.
This article
provides a short summary of some of the evidence of these tools overtaking
commercial alternatives and why, if you work with data, adding an open
programming language, like R or Python, to your professional repertoire is
likely to be a worthwhile career investment for 2019 and beyond.
Like most faithful public policy wonks,
I’ve spent more hours than I can count dragging numbers across a screen to understand,
analyse or predict whatever segment of the world I have data on.
Exploring where the money
was flowing in the world’s youngest democracy; analysing which
government program was delivering the biggest impact; or predicting which roads were
likely to disappear first as a result of climate change.
New policy questions, new approaches to
answer them and a fresh set of data.
Yet, every silverlining has a cloud. And in
my experience with data it’s often the need to scale a new learning curve to adhere
to legacy systems and fulfil an organizational fetish for using their statistical
software of choice.
Excel, SAS, SPSS, Eviews, Minitab, Stata
and the list goes on.
Which is why I’ve decided this article
needed to be written:
Because not only am I tired of talking to fellow
analytical wonks about why they’re limiting themselves by only being able to work
on data with spreadsheets, but also that there are distinct professional
advantages to unshackling yourself from the professional tyranny of proprietary
tools:
 OpenSource Statistics is
Becoming the Global Standard
Firstly, if you haven’t been watching, the
world is increasingly full of data. So much data, that the world is chasing
after nerds to analyse it. As a result, the demand for a ‘jack of all trades’
data person, or “data scientist” has been outstripping that of a more
vanillaflavoured ‘statistician’:
% Job Advertisements with term “data scientist” vs. “statistician”
(Credit:
Bob Muenchen – r4stats.com)
And although you
might not have aspirations to work in what the Harvard Business Review called
the ‘Sexiest Job of the
21st Century’ the data gold rush has had implications far beyond the
sex appeal of nerds.
For one, online communities
like Stackoverflow,
Kaggle
and Data
for Democracy have flourished. Providing practical avenues for learning
how to do some science with data and driving demand for tools that make
applying this science accessible to everyone, like R and Python.
So much, that some
of the best evidence, suggests that not only is
demand for quants with R and Python skills booming, but the practical use of opensource
statistical tools like R and Python are starting to eclipse their proprietary relatives:
Statistical software by Google Scholar Hits:
(More credit to Bob Muenchen –
r4stats.com)
Of course, I’m not here to conclusively make the point that a particular piece of software is a ‘silver bullet’. Only that something has happened in the world of data that the quantitatively inclined shouldn’t ignore: Not only are R and Python becoming programming languages for the masses, but they’re increasingly proving themselves as powerful complements to more traditional business analysis tools like Excel and SAS.
2. R is for Renaissance Wo(Man)
For those watching the news, you’ll no doubt have heard of the great battle being waged between the R and Python languages that has tragically left the internet strewn with the blood of programmers and their pocket protectors.
But I’m going to goosestep right over the issue as in my
opinion much of what I say for R, is increasingly applicable to Python.
For those of you
unfamiliar with R, in essence it’s a programming language made to use computers
to do stuff with numbers.
Enter: “10*10”
and it will tell you ‘100’
Enter: “print(‘Sup?’)”
and the computer will speak to you like that kid loitering on your lawn.
Developed
around 25 years ago, the idea behind R was
in essence to develop a simpler, more open and extendible programming language
for statisticians. Something which allowed you greater power and flexibility
than a ‘point and click’ interface, but that was quicker than punch cards or manually
keying in 1s and 0s to tell the computer what to do.
The result: R –
A free statistical tool whose sustained growth use has led to one of the most
flexible statistical tools in existence.
So much growth
in fact, that in 2014 enough new functionality was added to R by the community
that “R
added more functions/procs than the SAS Institute has written in its entire
history.” And while it’s
not the quantity of your software packages that counts, the speed of
development is impressive and a good indication of the likely future trajectory
of R’s functionality. Particularly as many heavy hitters including the likes of
Microsoft,
IBM and Google are already using R and making their own
contributions to the ecosystem:
Using R for Analytics – Get in Before George Clooney Does:
Image
source. Also, see here
Not only that, but with much of this growth being driven by user contributions, it is also a great reminder of the active and supportive community you have access to as an R and Python user. Making it easier to get help, access free resources and find example code to steal base your analysis on.
3. R is Data and Discipline Agnostic
(Source: xkcd)
One of the first
things that motivated me to learn R, was the observation that many of the most interesting
questions I encountered went unanswered because they crossed disciplines, involved
obscure analytical techniques, or were locked away in a longforgotten format. It
therefore seemed logical to me that if I could become a data analytics “MacGyver”,
I’d have greater opportunities to work on interesting problems.
Which is how I
discovered R. You see, as somebody that is interested in almost everything, R’s adoption by such a diverse range of fields
made it nearly impossible to overlook. With extensions being freely available to
work with a wide variety of data formats (proprietary
or otherwise) and apply a range of nerdy methods, R made a lot of
sense.
I think it was Richard
Branson that once said “If somebody offers you a problem but you are not sure
you can do it, say yes. R probably has a package for it”:
 Want to forecast sales?
 Need to create a
beautiful graph to impress that guy you like?  Want to see your data
on a map so you can see if it relates to geography?  Want to create a regression
model to test whether Bill from accounting is stealing your lunch?  Want to streamline that
statistical summary your boss keeps asking for?  Need to turn your analysis into
an interactive
web dashboard?  Want to radically shift a nation’s
budget transparency?
Then R (and
increasingly Python) has you covered.
Yet there is perhaps a subtler reason adopting R made sense and that’s the simple fact that by being ‘discipline agnostic’ it’s wellsuited for multidisciplinary teams, applied multipotentialites and anyone uncertain about exactly where their career might take them.
4. R Helps Avoid Fitting the Problem to the Tool
As an economist, I love a good echo chamber. Not only does everybody speak my language and get my jokes, but my diagnosis of the problem is always spoton. Unfortunately, thanks to errors of others, I’m aware that such cosy teams of specialists, isn’t always a good idea – with homogeneous specialist teams risking developing solutions which aren’t fit for purpose by too narrowly defining a problem and misunderstanding the scope of the system it’s embedded in.
While good
organizations are doing their best to address
this, creating teams that are multidisciplinary and have more
diverse networks can be a useful means to protect against these
risks while also driving better performance. Which of course stands
to be another useful advantage of using more general statistical tools with a
diverse user base like R: as you can more fluidly collaborate across
disciplines while being better able to pick the right technique for your
problem, reducing the risk that everything look like
a nail, merely because you have a hammer. 5. Programming Encourages Reproducibility
Yet programming languages also hold an
additional advantage to more typical ‘point and click’ interfaces for
conducting analysis – transparency and reproducibility.
For instance, because software like R encourages
you to write down each step in your analysis, your work is more likely to be ‘reproducible’
than had it been done using more traditional ‘point and click solutions. This
is because you’re encouraged to record each step needed to achieve the final
result making it easier for your colleagues to understand what the hell you’re
doing and increasing the likelihood you’ll be able to reproduce the results when
you need to (or
somebody else will).
In addition to this being practically
useful for tracing your journey down the dataanalysismaze, for analytical
teams it can also serve as a means for encouraging collaboration by allowing to
more easily understand your work and replicate your results. Assisting with
organizational knowledge retention and providing an incentive for ensuring
analysis is accurate by often making it easier to spot errors before they
impact your analysis or soil
your reputation.
Finally, while the use of scripting isn’t unique to opensource programming languages, by being free, R and Python comes with an additional advantage that in the instance you decide to release your analysis, the potential audience is likely to be greater and more diverse than had it been written using propriety software. Which is why in a world of the “Open Government Partnership” opensource programming languages makes a lot of sense, providing a means of easing the transition towards government publicly releasing government policy models.
6. R Helps Make Bytes Beautiful
As datadriveneverything becomes all the
rage, making data pretty is becoming an increasingly important skill. R is great at
this, with virtually unlimited options for unleashing your
creativity on the world and communicating your results to the masses. Bar
graphs, scatter diagrams, histograms and heat maps. Easy.
Just not pie graphs. They’re
terrible.
But R’s visualization tools don’t finish at your desk, with the ‘Shiny’ package allowing you to take your pie graphs to the bigtime by publishing interactive dashboards for the web. Boss asking you to redo a graph 20 times each day? Outsource your work to the web by automating it through a dashboard and send them a link while you sip cocktails at the beach.
7. R and Python are free, but the Cost of Ignoring the Trend Towards OpenSource Statistics Won’t Be
Finally, R and Python are free, meaning not
only can you install it wherever you want, but that you can take it with you
throughout your career:
 Statistics lecturers prescribing
you textbooks that are trying to get you hooked on expensive software that
likely won’t exist when you graduate? Tell
them it’s not 1999 and send
them a link to this.  Working for a notforprofit
organization that needs statistical software but can’t afford the costs of
proprietary software? Install R and show them how to install Swirl’s free interactive
lessons.  Want to install it at home? No
problem. You can even give a copy to your cat.  Got a promotion and been gifted
a new team of statisticians? Swap the Christmas
bonuses for the give the gift that keeps giving: R!
But I’m not here to tell you R (or Python) are
perfect. Afterall, there are good
reasons some companies are reluctant to switch their analysis to R or Python.
Nor am I interested in convincing you that it can, or should, replace every
proprietary tool you’re using. As I’m an avid spreadsheeter and programs
like Excel have distinct advantages.
Rather,
I’d like to suggest that for all the immediate costs involved in learning an
opensource programming language, whether it be R or Python, the longterm benefits
are more than likely to surpass them.
(Source)
Not only that, but as a new generation of data scientists continue to push for the use of opensource tools, it’s reasonable to expect R and Python will become as pervasive a business tool as the spreadsheet and as important to your career as laughing at your boss’ terrible jokes.
Interested in learning R? Check out this
link here for a range of free resources.
You can also read my review of the online specialization I took to scale the R learning curve here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RProgramming – Giles DickensonJones. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Part 2, further comments on OfS gradeinflation report
(This article was first published on R – Let's Look at the Figures, and kindly contributed to Rbloggers)
Update, 20190107: I am pleased to say that the online media article that I complained about in Sec 1 below has now been amended by its author(s), to correct the false attributions. I am grateful to Chris Parr for helping to sort this out.
In my post a few days ago (which I’ll now call “Part 1”) I looked at aspects of the statistical methods used in a report by the UK government’s Office for Students, about “grade inflation” in English universities. This second post continues on the same topic.
In this Part 2 I will do two things:
 Set the record straight, in relation to some incorrect reporting of Part 1 in the specialist media.
 Suggest a new statistical method that (in my opinion) is better than the one used in the OfS report.
The more substantial stuff will be the second bullet there (and of course I wish I didn’t need to do the first bullet at all). In this post (at section 2 below) I will just outline a better method, by using the same artificial example that I gave in Part 1: hopefully that will be enough to give the general idea, to both specialist and nonspecialist readers. Later I will follow up (in my intended Part 3) with a more detailed description of the suggested better method; that Part 3 post will be suitable mainly for readers with more specialist background in Statistics.
1. For the recordI am aware of two places where the analysis I gave in Part 1 has been reported:
 At https://www.researchprofessional.com/0/rr/he/agencies/ofs/2019/OfSgradeinflationanalysisnotfitforpurpose–saysexpert.html, an article entitled “OfS grade inflation analysis not fit for purpose, says expert”
 At https://www.researchresearch.com/news/article/?articleId=1379083, which seems to be a straight copy of the same article (I have not checked in detail).
The first link there is to a paywalled site, I think. The second one appears to be in the public domain. I do not recommend following either of those links, though! If anyone reading this wants to know about what I wrote in Part 1, then my advice is just to read Part 1 directly.
Here I want to mention three specific ways in which that article misrepresents what I wrote in Part 1. Points 2 and 3 here are the more important ones, I think (but #1 is also slightly troubling, to me):
 The article refers to my blog post as “a review commissioned by HE”. The reality is that a journalist called Chris Parr had emailed me just before Christmas. In the email Chris introduced himself as “I’m a journalist at Research Fortnight”, and the request he made in the email (in relation to the newly published OfS report) was “Would you or someone you know be interested in taking a look?”. I had heard of Research Fortnight. And I was indeed interested in taking a look at the methods used in the OfS report. But until the abovementioned article came to my attention, I had never even heard of a publication named HE. Possibly I am mistaken in this, but to my mind the phrase “a review commissioned by HE” indicates some kind of formal arrangement between HE and me, with specified deliverables and perhaps even payment for the work. There was in fact no such “commission” for the work that I did. I merely spent some time during the Christmas break thinking about the methods used in the OfS report, and then I wrote a blog post (and told Chris Parr that I had done that). And let me repeat: I had never even heard of HE (nor of the article’s apparent author, which was not Chris Parr). No payment was offered or demanded. I mention all this here only in case anyone who has read that article got a wrong impression from it.
 The article contains this false statement: “The data is too complex for a reliable statistical method to be used, he said”. The “he” there refers to me, David Firth. I said no such thing, neither in my blog post nor in any email correspondence with Chris Parr. Indeed, it is not something I ever would say: the phrase “data…too complex for a reliable statistical method” is a nonsense.
 The article contains this false statement: “He calls the OfS analysis an example of Simpson’s paradox”. Again, the “he” in that statement refers to me. But I did not call the OfS analysis an example of Simpson’s paradox, either in my blog post or anywhere else. (And nor could I have, since I do not have access to the OfS dataset.) What I actually wrote in my blog post was that my own artificial, speciallyconstructed example was an instance of Simpson’s paradox — which is not even close to the same thing!
The article mentioned above seems to have had an agenda that was very different from giving a faithful and informative account of my comments on the OfS report. I suppose that’s journalistic license (although I would naively have expected better from a specialist publication to which my own university appears to subscribe). The false attribution of misleading statements is not something I can accept, though, and that is why I have written specifically about that here.
To be completely clear:
 The article mentioned above is misleading. I do not recommend it to anyone.
 All of my posts in this blog are my own work, not commissioned by anyone. In particular, none of what I’ll continue to write below (and also in Part 3 of this extended blog post, when I get to that), about the OfS report, was requested by any journalist.
I have to admit that in Part 1 I ran out of steam at one point, specifically where — in response to my own question about what would be a better way than the method used in the OfS report — I wrote “I do not have an answer“. I could have and should have done better than that.
Below I will outline a fairly simple approach that overcomes the specific pitfall I identified in Part 1, i.e., the fact that measurement at too high a level of aggregation can give misleading answers. I will demonstrate my suggested new approach through the same, contrived example that I used in Part 1. This should be enough to convey the basic idea, I hope. [Full generality for the analysis of real data will demand a more detailed and more technical treatment of a hierarchical statistical model; I’ll do that later, when I come to write Part 3.]
On reflection, I think a lot of the criticism seen by the OfS report since its publication relates to the use of the word “explain” in that report. And indeed, that was a factor also in my own (mentioned above) “I do not have an answer” comment. It seems obvious — to me, anyway — that any serious attempt to explain apparent increases in the awarding of First Class degrees would need to take account of a lot more than just the attributes of students when they enter university. With the data used in the OfS report I think the best that one can hope to do is to measure those apparent increases (or decreases), in such a way that the measurement is a “fair” one that appropriately takes account of incoming student attributes and their fluctuation over time. If we take that attitude — i.e, that the aim is only to measure things well, not to explain them — then I do think it is possible to devise a better statistical analysis, for that purpose, than the one that was used in the OfS report.
(I fully recognise that this actually was the attitude taken in the OfS work! It is just unfortunate that the OfS report’s use of the word “explain”, which I think was intended there mainly as a technical word with its meaning defined by a statistical regression model, inevitably leads readers of the report to think more broadly about substantive explanations for any apparent changes in degreeclass distributions.)
2.1 Those “toy” data again, and a better statistical modelRecall the setup of the simple example from Part 1: Two academic years, two types of university, two types of student. The data are as follows:
201011 University A University B Firsts Other Firsts Other h 1000 0 h 500 500 i 0 1000 i 500 500 201617 University A University B Firsts Other Firsts Other h 1800 200 h 0 0 i 0 0 i 500 1500Our measurement (of change) should reflect the fact that, for each type of student within each university, where information is available, the percentage awarded Firsts actually decreased (in this example).
Change in percent awarded firsts: University A, student type h: 100% > 90% University A, student type i: no data University B, student type h: no data University B, student type i: 50% > 25%This provides the key to specification of a suitable (statistical) measurement model:
 measure the changes at the lowest level of aggregation possible;
 then, if aggregate conclusions are wanted, combine the separate measurements in some sensible way.
In our simple example, “lowest level of aggregation possible” means that we should measure the change separately for each type of student within each university. (In the real OfS data, there’s a lower level of aggregation that will be more appropriate, since different degree courses within a university ought to be distinguished too — they have different student intakes, different teaching, different exam boards, etc.)
In Statistics this kind of analysis is often called a stratified analysis. The quantity of interest (which here is the change in % awarded Firsts) is measured separately in several prespecified strata, and those measurements are then combined if needed (either through a formal statistical model, or less formally by simple or weighted averaging).
In our simple example above, there are 4 strata (corresponding to 2 types of student within each of 2 universities). In our specific dataset there is information about the change in just 2 of those strata, and we can summarize that information as follows:
 in University A, student type i saw their percentage of Firsts reduced by 10%;
 in University B, student type h saw their percentage of Firsts reduced by 50%.
That’s all the information in the data, about changes in the rate at which Firsts are awarded. (It was a deliberately small dataset!)
If a combined, “sectorwide” measure of change is wanted, then the separate, stratumspecific measures need to be combined somehow. To some extent this is arbitrary, and the choice of a combination method ought to depend on the purpose of such a sectorwide measure and (especially) on the interpretation desired for it. I might find time to write more about this later in Part 3.
For now, let me just recall what was the “sectorwide” measurement that resulted from analysis (shown in Part 1) of the above dataset using the OfS report’s method. The result obtained by that method was a sectorwide increase of 7.5% in the rate at which Firsts are awarded — which is plainly misleading in the face of data that shows substantial decreases in both universities. Whilst I do not much like the OfS Report’s “compare with 2010” approach, it does have the benefit of transparency and in my “toy” example it is easy to apply to the stratified analysis:
201617 Expected Firsts Actual based on 201011 University A 2000 1800 University B 1000 500  Total 3000 2300— from which we could report a sectorwide decrease of 700/3000 = 23.3% in the awarding of Firsts, once student attributes are taken properly into account. (This could be viewed as just a suitably weighted average of the 10% and 50% decreases seen in University A and University B respectively.)
As before, I have made the full R code available (as an update to my earlier R Markdown document). For those who don’t use R, I attach here also a PDF copy of that: gradeinflationexample.pdf
2.2 Generalising the better model: More strata, more timepointsThe essential idea of a better measurement model is presented above in the context of a small “toy” example, but the real data are of course much bigger and more complex.
The key to generalising the model will simply be to recognise that it can be expressed in the form of a logistic regression model (that’s the same kind of model that was used in the OfS report; but the “better” logistic regression model structure is different, in that it needs to include a term that defines the strata within which measurement takes place).
This will be developed further in Part 3, which will be more technical in flavour than Parts 1 and 2 of this blogpost thread have been. Just by way of a taster, let me show here the mathematical form of the logisticregression representation of the “toy” data analysis shown above. With notation
 u for providers (universities); u is either A or B in the toy example
 t for type of student; t is either h or i in the toy example
 y for years; y is either 201011 or 201617 in the toy example
 for the probability of a First in year y, for students of type t in university u
the logistic regression model corresponding to the analysis above is
.
This is readily generalized to situations involving more strata (more universities u and student types t, and also degreecourses within universities). There were just 4 stratum parameters in the above example, but more strata are easily accommodated.
The model is readily generalized also, in a similar way, to more than 2 years of data.
For comparison, the corresponding logistic regression model as used in the OfS report looks like this:
.
So it is superficially very similar. But the allimportant term that determines the necessary strata within universities is missing from the OfS model.
I will aim to flesh this out a bit in a new Part 3 post within the next few days, if time permits. For now I suppose the model I’m suggesting here needs a name (i.e., a name that identifies it more clearly than just “my better model”!) Naming things is not my strong point, unfortunately! But, for now at least, I will term the analysis introduced above “stratified by available student attributes” — or “SASA model” for short.
(The key word there is “stratified”.)
© David Firth, January 2019
To cite this entry:
Firth, D (2019). Part 2, further comments on OfS gradeinflation report. Weblog entry at URL https://statgeek.net/2019/01/07/part2furthercommentsonofsgradeinflationreport/
To leave a comment for the author, please follow the link and comment on their blog: R – Let's Look at the Figures. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppStreams 0.1.2
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A maintenance release of RcppStreams arrived on CRAN this afternoon. RcppStreams brings the excellent Streamulus C++ template library for event stream processing to R.
Streamulus, written by Irit Katriel, uses very clever template metaprogramming (via Boost Fusion) to implement an embedded domainspecific event language created specifically for event stream processing.
This release provides a minor update, motivated by the upcoming BH release 1.69 for which we made a prerelease – and RcppStreams is one of three packages identified by that prerelease as needed minor accomodations for the new Boost release. We also updated packaging to current CRAN standards, but made no other changes.
The NEWS file entries follows below:
Changes in version 0.1.2 (20190105)
Added symbol registration

Converted a few files from CRLF to CR [CRAN checks]

Added a few final newlines [CRAN checks]

Disabled one optional output to enable BH 1.69 builds
Courtesy of CRANberries, there is also a copy of the DESCRIPTION file for this initial release. More detailed information is on the RcppStreams page page and of course on the Streamulus page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Dow Jones Stock Market Index (2/4): Trade Volume Exploratory Analysis
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
Categories
Tags
This is the second part of the 4series articles about Dow Jones Stock Market. To read the first part go to this link. In this part, I am going to analyze the Dow Jones Industrial Average (DJIA) trade volume.
PackagesThe packages being used in this post series are herein listed.
suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(fBasics)) suppressPackageStartupMessages(library(lmtest)) suppressPackageStartupMessages(library(urca)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(quantmod)) suppressPackageStartupMessages(library(PerformanceAnalytics)) suppressPackageStartupMessages(library(rugarch)) suppressPackageStartupMessages(library(FinTS)) suppressPackageStartupMessages(library(forecast)) suppressPackageStartupMessages(library(strucchange)) suppressPackageStartupMessages(library(TSA)) Getting DataWe upload the environment status as saved at the end of part 1.
load(file='DowEnvironment.RData') Daily Volume Exploratory AnalysisFrom the saved environment, we can find back our DJI object. We plot the daily volume.
dj_vol < DJI[,"DJI.Volume"] plot(dj_vol)It is remarkable the level jump at the beginning of 2017, something that we will investigate in part 4.
We transform the volume time series data and timeline index into a dataframe.
dj_vol_df < xts_to_dataframe(dj_vol) head(dj_vol_df) ## year value ## 1 2007 327200000 ## 2 2007 259060000 ## 3 2007 235220000 ## 4 2007 223500000 ## 5 2007 225190000 ## 6 2007 226570000 tail(dj_vol_df) ## year value ## 3015 2018 900510000 ## 3016 2018 308420000 ## 3017 2018 433080000 ## 3018 2018 407940000 ## 3019 2018 336510000 ## 3020 2018 288830000 Basic statistics summary (dj_stats < dataframe_basicstats(dj_vol_df)) ## 2007 2008 2009 2010 ## nobs 2.510000e+02 2.530000e+02 2.520000e+02 2.520000e+02 ## NAs 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ## Minimum 8.640000e+07 6.693000e+07 5.267000e+07 6.840000e+07 ## Maximum 4.571500e+08 6.749200e+08 6.729500e+08 4.598900e+08 ## 1. Quartile 2.063000e+08 2.132100e+08 1.961850e+08 1.633400e+08 ## 3. Quartile 2.727400e+08 3.210100e+08 3.353625e+08 2.219025e+08 ## Mean 2.449575e+08 2.767164e+08 2.800537e+08 2.017934e+08 ## Median 2.350900e+08 2.569700e+08 2.443200e+08 1.905050e+08 ## Sum 6.148432e+10 7.000924e+10 7.057354e+10 5.085193e+10 ## SE Mean 3.842261e+06 5.965786e+06 7.289666e+06 3.950031e+06 ## LCL Mean 2.373901e+08 2.649672e+08 2.656970e+08 1.940139e+08 ## UCL Mean 2.525248e+08 2.884655e+08 2.944104e+08 2.095728e+08 ## Variance 3.705505e+15 9.004422e+15 1.339109e+16 3.931891e+15 ## Stdev 6.087286e+07 9.489163e+07 1.157199e+08 6.270480e+07 ## Skewness 9.422400e01 1.203283e+00 1.037015e+00 1.452082e+00 ## Kurtosis 1.482540e+00 2.064821e+00 6.584810e01 3.214065e+00 ## 2011 2012 2013 2014 ## nobs 2.520000e+02 2.500000e+02 2.520000e+02 2.520000e+02 ## NAs 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ## Minimum 8.410000e+06 4.771000e+07 3.364000e+07 4.287000e+07 ## Maximum 4.799800e+08 4.296100e+08 4.200800e+08 6.554500e+08 ## 1. Quartile 1.458775e+08 1.107150e+08 9.488000e+07 7.283000e+07 ## 3. Quartile 1.932400e+08 1.421775e+08 1.297575e+08 9.928000e+07 ## Mean 1.804133e+08 1.312606e+08 1.184434e+08 9.288516e+07 ## Median 1.671250e+08 1.251950e+08 1.109250e+08 8.144500e+07 ## Sum 4.546415e+10 3.281515e+10 2.984773e+10 2.340706e+10 ## SE Mean 3.897738e+06 2.796503e+06 2.809128e+06 3.282643e+06 ## LCL Mean 1.727369e+08 1.257528e+08 1.129109e+08 8.642012e+07 ## UCL Mean 1.880897e+08 1.367684e+08 1.239758e+08 9.935019e+07 ## Variance 3.828475e+15 1.955108e+15 1.988583e+15 2.715488e+15 ## Stdev 6.187468e+07 4.421660e+07 4.459353e+07 5.211034e+07 ## Skewness 1.878239e+00 3.454971e+00 3.551752e+00 6.619268e+00 ## Kurtosis 5.631080e+00 1.852581e+01 1.900989e+01 5.856136e+01 ## 2015 2016 2017 2018 ## nobs 2.520000e+02 2.520000e+02 2.510000e+02 2.510000e+02 ## NAs 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ## Minimum 4.035000e+07 4.589000e+07 1.186100e+08 1.559400e+08 ## Maximum 3.445600e+08 5.734700e+08 6.357400e+08 9.005100e+08 ## 1. Quartile 8.775250e+07 8.224250e+07 2.695850e+08 2.819550e+08 ## 3. Quartile 1.192150e+08 1.203550e+08 3.389950e+08 4.179200e+08 ## Mean 1.093957e+08 1.172089e+08 3.112396e+08 3.593710e+08 ## Median 1.021000e+08 9.410500e+07 2.996700e+08 3.414700e+08 ## Sum 2.756772e+10 2.953664e+10 7.812114e+10 9.020213e+10 ## SE Mean 2.433611e+06 4.331290e+06 4.376432e+06 6.984484e+06 ## LCL Mean 1.046028e+08 1.086786e+08 3.026202e+08 3.456151e+08 ## UCL Mean 1.141886e+08 1.257392e+08 3.198590e+08 3.731270e+08 ## Variance 1.492461e+15 4.727538e+15 4.807442e+15 1.224454e+16 ## Stdev 3.863238e+07 6.875709e+07 6.933572e+07 1.106550e+08 ## Skewness 3.420032e+00 3.046742e+00 1.478708e+00 1.363823e+00 ## Kurtosis 1.612326e+01 1.122161e+01 3.848619e+00 3.277164e+00In the following, we make specific comments to some relevant above shown metrics.
MeanYears when Dow Jones daily volume has positive mean are:
filter_dj_stats(dj_stats, "Mean", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"All Dow Jones daily volume mean values in ascending order.
dj_stats["Mean",order(dj_stats["Mean",,])] ## 2014 2015 2016 2013 2012 2011 2010 ## Mean 92885159 109395714 117208889 118443373 131260600 180413294 201793373 ## 2007 2008 2009 2017 2018 ## Mean 244957450 276716364 280053730 311239602 359371036 MedianYears when Dow Jones daily volume has positive median are:
filter_dj_stats(dj_stats, "Median", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"All Dow Jones daily volume median values in ascending order.
dj_stats["Median",order(dj_stats["Median",,])] ## 2014 2016 2015 2013 2012 2011 2010 ## Median 81445000 94105000 102100000 110925000 125195000 167125000 190505000 ## 2007 2009 2008 2017 2018 ## Median 235090000 244320000 256970000 299670000 341470000 SkewnessYears when Dow Jones daily volume has positive skewness are:
filter_dj_stats(dj_stats, "Skewness", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"All Dow Jones daily volume skewness values in ascending order.
dj_stats["Skewness",order(dj_stats["Skewness",,])] ## 2007 2009 2008 2018 2010 2017 2011 ## Skewness 0.94224 1.037015 1.203283 1.363823 1.452082 1.478708 1.878239 ## 2016 2015 2012 2013 2014 ## Skewness 3.046742 3.420032 3.454971 3.551752 6.619268 Excess KurtosisYears when Dow Jones daily volume has positive excess kurtosis are:
filter_dj_stats(dj_stats, "Kurtosis", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"All Dow Jones daily volume excess kurtosis values in ascending order.
dj_stats["Kurtosis",order(dj_stats["Kurtosis",,])] ## 2009 2007 2008 2010 2018 2017 2011 ## Kurtosis 0.658481 1.48254 2.064821 3.214065 3.277164 3.848619 5.63108 ## 2016 2015 2012 2013 2014 ## Kurtosis 11.22161 16.12326 18.52581 19.00989 58.56136 Boxplots dataframe_boxplot(dj_vol_df, "DJIA daily volume box plots 20072018")The trade volume starts to decrease from 2010 and on 2017 a remarkable increase occurred. Year 2018 volume has been even larger than 2017 and other years as well.
Density plots dataframe_densityplot(dj_vol_df, "DJIA daily volume density plots 20072018") Shapiro Tests dataframe_shapirotest(dj_vol_df) ## result ## 2007 6.608332e09 ## 2008 3.555102e10 ## 2009 1.023147e10 ## 2010 9.890576e13 ## 2011 2.681476e16 ## 2012 1.866544e20 ## 2013 6.906596e21 ## 2014 5.304227e27 ## 2015 2.739912e21 ## 2016 6.640215e23 ## 2017 4.543843e12 ## 2018 9.288371e11The null hypothesis of normality is rejected for all years.
QQ plots dataframe_qqplot(dj_vol_df, "DJIA daily volume QQ plots 20072018")QQplots visually confirm the nonnormality of daily trade volume distribution.
Daily volume logratio Exploratory AnalysisSimilarly to logreturns, we can define the trade volume log ratio as.
\[
v_{t}\ := ln \frac{V_{t}}{V_{t1}}
\]
We can compute it by CalculateReturns within the PerformanceAnalytics package and plot it.
Mapping the trade volume logratio time series data and timeline index into a dataframe.
dj_vol_df < xts_to_dataframe(dj_vol_log_ratio) head(dj_vol_df) ## year value ## 1 2007 0.233511910 ## 2 2007 0.096538449 ## 3 2007 0.051109832 ## 4 2007 0.007533076 ## 5 2007 0.006109458 ## 6 2007 0.144221282 tail(dj_vol_df) ## year value ## 3014 2018 0.44563907 ## 3015 2018 1.07149878 ## 3016 2018 0.33945998 ## 3017 2018 0.05980236 ## 3018 2018 0.19249224 ## 3019 2018 0.15278959 Basic statistics summary (dj_stats < dataframe_basicstats(dj_vol_df)) ## 2007 2008 2009 2010 2011 ## nobs 250.000000 253.000000 252.000000 252.000000 252.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum 1.606192 1.122526 1.071225 1.050181 2.301514 ## Maximum 0.775961 0.724762 0.881352 1.041216 2.441882 ## 1. Quartile 0.123124 0.128815 0.162191 0.170486 0.157758 ## 3. Quartile 0.130056 0.145512 0.169233 0.179903 0.137108 ## Mean 0.002685 0.001203 0.001973 0.001550 0.000140 ## Median 0.010972 0.002222 0.031748 0.004217 0.012839 ## Sum 0.671142 0.304462 0.497073 0.390677 0.035162 ## SE Mean 0.016984 0.016196 0.017618 0.019318 0.026038 ## LCL Mean 0.036135 0.030693 0.036670 0.039596 0.051141 ## UCL Mean 0.030766 0.033100 0.032725 0.036495 0.051420 ## Variance 0.072112 0.066364 0.078219 0.094041 0.170850 ## Stdev 0.268536 0.257612 0.279677 0.306661 0.413341 ## Skewness 0.802037 0.632586 0.066535 0.150523 0.407226 ## Kurtosis 5.345212 2.616615 1.500979 1.353797 14.554642 ## 2012 2013 2014 2015 2016 ## nobs 250.000000 252.000000 252.000000 252.000000 252.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum 2.158960 1.386215 2.110572 1.326016 1.336471 ## Maximum 1.292956 1.245202 2.008667 1.130289 1.319713 ## 1. Quartile 0.152899 0.145444 0.144280 0.143969 0.134011 ## 3. Quartile 0.144257 0.149787 0.134198 0.150003 0.141287 ## Mean 0.001642 0.002442 0.000200 0.000488 0.004228 ## Median 0.000010 0.004922 0.013460 0.004112 0.002044 ## Sum 0.410521 0.615419 0.050506 0.123080 1.065480 ## SE Mean 0.021293 0.019799 0.023514 0.019010 0.019089 ## LCL Mean 0.040295 0.041435 0.046110 0.036952 0.033367 ## UCL Mean 0.043579 0.036551 0.046510 0.037929 0.041823 ## Variance 0.113345 0.098784 0.139334 0.091071 0.091826 ## Stdev 0.336667 0.314299 0.373274 0.301780 0.303028 ## Skewness 0.878227 0.297951 0.209417 0.285918 0.083826 ## Kurtosis 8.115847 4.681120 9.850061 4.754926 4.647785 ## 2017 2018 ## nobs 251.000000 251.000000 ## NAs 0.000000 0.000000 ## Minimum 0.817978 1.071499 ## Maximum 0.915599 0.926101 ## 1. Quartile 0.112190 0.119086 ## 3. Quartile 0.110989 0.112424 ## Mean 0.000017 0.000257 ## Median 0.006322 0.003987 ## Sum 0.004238 0.064605 ## SE Mean 0.013446 0.014180 ## LCL Mean 0.026500 0.027671 ## UCL Mean 0.026466 0.028185 ## Variance 0.045383 0.050471 ## Stdev 0.213032 0.224658 ## Skewness 0.088511 0.281007 ## Kurtosis 3.411036 4.335748In the following, we make specific comments to some relevant aboveshown metrics.
MeanYears when Dow Jones daily volume logratio has positive mean are:
filter_dj_stats(dj_stats, "Mean", 0) ## [1] "2008" "2011" "2012" "2014" "2015" "2016" "2018"All Dow Jones daily volume logratio mean values in ascending order.
dj_stats["Mean",order(dj_stats["Mean",,])] ## 2007 2013 2009 2010 2017 2011 2014 ## Mean 0.002685 0.002442 0.001973 0.00155 1.7e05 0.00014 2e04 ## 2018 2015 2008 2012 2016 ## Mean 0.000257 0.000488 0.001203 0.001642 0.004228 MedianYears when Dow Jones daily volume logratio has positive median are:
filter_dj_stats(dj_stats, "Median", 0) ## [1] "2008" "2014" "2015" "2018"All Dow Jones daily volume logratio median values in ascending order.
dj_stats["Median",order(dj_stats["Median",,])] ## 2009 2011 2007 2017 2013 2010 ## Median 0.031748 0.012839 0.010972 0.006322 0.004922 0.004217 ## 2016 2012 2008 2018 2015 2014 ## Median 0.002044 1e05 0.002222 0.003987 0.004112 0.01346 SkewnessYears when Dow Jones daily volume logratio has positive skewness are:
filter_dj_stats(dj_stats, "Skewness", 0) ## [1] "2009" "2011" "2016" "2017"All Dow Jones daily volume logratio mean values in ascending order.
dj_stats["Skewness",order(dj_stats["Skewness",,])] ## 2012 2007 2008 2013 2015 2018 ## Skewness 0.878227 0.802037 0.632586 0.297951 0.285918 0.281007 ## 2014 2010 2009 2016 2017 2011 ## Skewness 0.209417 0.150523 0.066535 0.083826 0.088511 0.407226 Excess KurtosisYears when Dow Jones daily volume has positive excess kurtosis are:
filter_dj_stats(dj_stats, "Kurtosis", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"All Dow Jones daily volume logratio excess kurtosis values in ascending order.
dj_stats["Kurtosis",order(dj_stats["Kurtosis",,])] ## 2010 2009 2008 2017 2018 2016 2013 ## Kurtosis 1.353797 1.500979 2.616615 3.411036 4.335748 4.647785 4.68112 ## 2015 2007 2012 2014 2011 ## Kurtosis 4.754926 5.345212 8.115847 9.850061 14.55464 Boxplots dataframe_boxplot(dj_vol_df, "DJIA daily volume box plots 20072018")The most positive extreme values can be spotted on years 2011, 2014 and 2016. The most negative extreme values, on years 2007, 2011, 2012, 2014.
Density plots dataframe_densityplot(dj_vol_df, "DJIA daily volume density plots 20072018") Shapiro Tests dataframe_shapirotest(dj_vol_df) ## result ## 2007 3.695053e09 ## 2008 6.160136e07 ## 2009 2.083475e04 ## 2010 1.500060e03 ## 2011 3.434415e18 ## 2012 8.417627e12 ## 2013 1.165184e10 ## 2014 1.954662e16 ## 2015 5.261037e11 ## 2016 7.144940e11 ## 2017 1.551041e08 ## 2018 3.069196e09Based on reported pvalues, for all we can reject the null hypothesis of normal distribution.
QQ plots dataframe_qqplot(dj_vol_df, "DJIA daily volume QQ plots 20072018")Departure from normality can be spotted for all reported years.
Saving the current enviroment for further analysis.
save.image(file='DowEnvironment.RData')If you have any questions, please feel free to comment below.
References Dow Jones Industrial Average
 Skewness
 Kurtosis
 An introduction to analysis of financial data with R, Wiley, Ruey S. Tsay
 Time series analysis and its applications, Springer ed., R.H. Shumway, D.S. Stoffer
 Applied Econometric Time Series, Wiley, W. Enders, 4th ed.
 Forecasting – Principle and Practice, Texts, R.J. Hyndman
 Options, Futures and other Derivatives, Pearson ed., J.C. Hull
 An introduction to rugarch package
 Applied Econometrics with R, Achim Zeileis, Christian Kleiber – Springer Ed.
 GARCH modeling: diagnostic tests
Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or promotion of any particular security or source.
Related Post
 Dow Jones Stock Market Index (1/4): Log Returns Exploratory Analysis
 Six Sigma DMAIC Series in R – Part4
 NYC buses: company level predictors with R
 Visualizations for correlation matrices in R
 Interpretation of the AUC
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
February 21st & 22nd: End2End from a Keras/TensorFlow model to production
(This article was first published on Shirin's playgRound, and kindly contributed to Rbloggers)
Registration is now open for my 1.5day workshop on how to develop end2end from a Keras/TensorFlow model to production.
It will take place on February 21st & 22nd in Berlin, Germany.
Please register by sending an email to shirin.glander@gmail.com with the following information:
 name
 company/institute/affiliation
 address for invoice
 phone number
 reference to this blog
The course material will be in English and we will speak a mix of German and English, depending on the participants’ preferences.
(The same course will also be given on 11th & 12th of April, 2019 in Münster, Germany.)
In this workshop, you will learn
 the basics of deep learning
 what crossentropy and loss is
 about activation functions
 how to optimize weights and biases with backpropagation and gradient descent
 how to build (deep) neural networks with Keras and TensorFlow
 how to save and load models and model weights
 how to visualize models with TensorBoard
 how to make predictions on test data
 how to deploy models with Spotify Luigi and TensorFlow Serving
To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Timing the Same Algorithm in R, Python, and C++
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
While developing the RcppDynProg R package I took a little extra time to port the core algorithm from C++ to both R and Python.
This means I can time the exact same algorithm implemented nearly identically in each of these three languages. So I can extract some comparative “apples to apples” timings. Please read on for a summary of the results.
The algorithm in question is the general dynamic programming solution to the “minimum cost partition into intervals” problem. As coded in C++ it uses onetime allocation of large tables and then forloops and index chasing to fill in the dynamic programming table solution. The C++ code is given here.
I then directly transliterated (or linefor line translated) this code into R (code here) and Python (code here). Both of these implementations are very direct translations of the C++ solution, so they are possibly not what somebody starting in R or Python would design. So really we are coding in an an imperative C style in R and Python. To emphasize the shallowness of the port I deliberately left the semicolons from the C++ in the R port. The Python can be taken to be equally “unPythonic” (for example, we are using for loops and not list comprehensions).
That being said we now have very similar code to compare in all three languages. We can summarize the timings (details here and here) as follows.
problem solution language time in seconds 500 point partition into intervals dynamic program R 21 500 point partition into intervals dynamic program C++ (from R via Rcpp) 0.088 500 point partition into intervals dynamic program Python 39
Notice for this example C++ is 240 times faster than R, and R is almost twice as fast as Python
Neither R nor Python is optimized for the type of indexchasing this dynamic programming solution depends on. So we also took a look at a simpler problem: computing the PRESS statistic, which is easy to vectorize (a preferred writing efficient code in R nor Python). When we compare all three languages on this problem we see the following.
problem solution method time in seconds 3,000,000 point PRESS statistic calculation R scalar code 3.4 3,000,000 point PRESS statistic calculation Rcpp scalar code 0.26 3,000,000 point PRESS statistic calculation R vectorized code 0.35 3,000,000 point PRESS statistic calculation Python vectorized (numpy) 0.21
The timing details can be found here and here.
Ignoring the R scalar solution (which is too direct a translation from C++ to R, but a stepping stone to the R vectorized solution as we discuss here). We see: vectorized Python is now about 1.6 times faster than the vectorized R and even 1.2 times faster than the C++ (probably not due to Rcpp, but instead driven by my choice of container class in the C++ code).
Obviously different code (and perlanguage tuning and optimization) will give different results. But the above is consistent with our general experience with R, Python, and C++ in production.
In conclusion: R and Python are in fact much slower than C++ for direct scalar manipulation (single values, indexes, and pointers). However, R and Python are effective glue languages that can be fast when they are orchestrating operations over higher level abstractions (vectors, databases, data frames, and Spark).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
2018 Volatility Recap
(This article was first published on R – Quintuitive, and kindly contributed to Rbloggers)
2018 brought more volatility to the markets, which so far has spilled into 2019. Let’s take a look at the long term volatility history picture using the Dow Jones Industrial Average:
Indeed, 2018 was the most volatile year since 2011. Relatively speaking however, the volatility is on the low end for a bear market, which I believe started in late December.
The above chart was produced using the following R code:
library(quantmod) library(ggplot2) library(ggthemes) library(grid) dji.annual.volatility = function(dji, year) { dates = paste("/", as.character(year), sep="") dji = na.exclude(dji[dates]) djiVol = aggregate(dji, as.numeric(format(index(dji), "%Y")), function(ss) coredata(tail(TTR:::volatility( ss, n=NROW(ss), calc="close"), 1))) xx = ecdf(as.vector(djiVol))(as.numeric(tail(djiVol,1))) print(xx) absRets = na.exclude(abs(ROC(dji[dates], type="discrete"))) yy = as.numeric(format(index(absRets), "%Y")) zz = aggregate(absRets, yy, function(ss) tail(cumprod(1+ss),1)) print(as.vector(tail(zz,1))) df = cbind(as.data.frame(index(djiVol)), coredata(djiVol)) colnames(df) = c("Year", "Volatility") gg = qplot(x=Year, y=Volatility, data=df, geom="line", colour=Volatility, xlab="Year", ylab="Volatility") gg = gg + theme_solarized(base_size=16, base_family="verdana", light=TRUE) return(list(plot=gg, dji=dji, dji.vol=djiVol, crystal.ball=zz, df=df)) }The post 2018 Volatility Recap appeared first on Quintuitive.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Scaling H2O analytics with AWS and p(f)urrr (Part 1)
(This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to Rbloggers)
H2O + AWS + purrr (Part I)In these small tutorials to follow over the next 3 weeks, I go through the steps of using an AWS1 AMI Rstudio instance to run a toy machine learning example on a large AWS instance. I have to admit that you have to know a little bit about AWS to follow the next couple of steps, but be assured it doesn’t take too much googling to find your way if you get confused at any stage by some of the jargon.
 Part I of this III part series will show you how to get the infrastructure working on AWS.
 Part II will showcase how we can combine purrr and h2o to train and stack ML models2.
 Part III, the final installment in the series, will show you how we can use furrr, reticulate – boto3 and AWS to spin up remote machines from your local work environment, but train remotely. Have you ever dreamt of running ML models on a 150 core 2TB RAM machine? Well, for a couple of dollars, I am going to show you how.
If you havent got an AWS account already, set one up now if you want to follow along with the rest of this series.
Why I moved to H2O from caretI have always been a big fan of the caret package. It was my introduction into machine learning in R. Its modular framework allowed for a very experimental environment. It allows for the exploriation of effects such as class imbalance on the outcome, with options such as SMOTE, ROSE an Up and Down sampling, while easily allowing for Kfolds (and repeated Kfolds) crossvalidation. It also lets the programmer configure a training grid for mulitple hyperparameter specifications allowing for very fine tuning of the models. Lastly, lets not forget the fact that there are over 200 models to play with.
H2O does most of this, but allows for a much more “programmatic” approach to the problems. After the first couple of interactions with the api, you start to enjoy the bulletproof engineer of the underlying code. It does not offer the wide array of features of caret, but it has the big 5: glm(net), rf, gbm, xgboost, deeplearning, and also stacking of the models. H2O allows for a much easier programming journey to a production ML system. The robust engineering strategy allows us to scale the ML training for a much larger experimental design.
I think caret is very good if you only need to run a single model that we need to tune and experiment with – i.e academia and small scale ML applications. H2O on the other hand is does a lot of the nitty gritty in the background for you which is great for production:
 If removes constant columns
 It doesn’t break if it sees an unknown factor variable
 It has early stopping which is easily implemented
 Exploits multicore and gpu processing in a much more effective manner
 Does crazy amounts of logging for any errors that might creep up3
Lastly, and certainly important if you gonna train a couple of models: its fast!
In my case, I was looking for a profit optimisation model over multiple markets (~80) for which I wanted to introduce a rolling window design, check performance of rf, gbm, xgboost and in the end put model in production with short delay. This is a lot of modeling and H2O made all of this ‘easy’ish.
Introducing AWSIf you haven’t started migrating your analytics to the cloud, then hopefully this will convince you to start reconsidering. The opprtunity to have access to a 64, 96 or even 128 core machines with 2TB of RAM rarely crosses the path of most Data Scientists. This can mostly be accredited to the fact that most of us don’t really need such a large machine for what we need to achieve, see Szilard’s twitter posts if you need convincing. Another reason that we don’t use these big machines are purely because we just don’t have access to such machines within our working environments. Luckily for us, access to cloud computing have become more accessible and well, lets be honest, cheap as chips.
Using prebuilt AMIsAn AMI or Amazon Machine Image is a prebuilt enviroment available to us as an allinclusive environment containing the operating system and any other applications we might have installed, e.g Rstudio, MariaDB, Docker etc..
One of these AMIs, specifically built for analytics and deep learning in Rstudio, is maintained by Louis Aslett. This AMI is designed to fascilitate most applications using MCMC sampling such as Stan or Jags. It also takes away the headache of setting up an instance that is able to accommodate the Keras and Tensorflow interfaces on an Linux machine. In addition the image contains LaTeXand rmarkdown, along with various other installations to allow for a plug and play setup.
If you have ever try to do any of the above yourself, then you will know what achievement this is. So to Louis:
I have used Louis Aslett’s AMI and added rJava, as well as installed the caret and h2o libraries for us the play with. This AMI is publicly available. We can use the search functionality to look for the AMI under public images: ami0157656a8c5b46458. For security reasons, I advise that you change the default port on which the Rstudio instance listens from port 80 to something random such as port 8765 if you gonna run this in production. If you want to do this, you will need to reconfigure the nginx server conf under /etc/nginx/sitesavailable.
To access the Rstudio setup once you get to the end of this post, navigate to the Public IP as indicated in the screenshot at the end and log in using the predefined credentials. For now the image uses4:
 user: rstudio
 pass: rstudioh2ocaret
To use one of these AMIs, we need to log into amazon and navigate to the EC2 dashboard. I use the Ohio region, so make sure that you set this region in the left had corner if you want the follow the setup of the AMI.
Well documented instructions on how to set up an AWS account are easily available on a multiple blogs, but I will show you how to find the AMI using the search function on the AWS website. Firstly, access your ec2 dashboard and go to the AMI tab on the left:
Setting up the AMI on a Spot instanceWhenever I use AWS, I prefer using Sport Instances:
A Spot Instance is an unused EC2 instance that is available for less than the OnDemand price. Because Spot Instances enable you to request unused EC2 instances at steep discounts, you can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance is called a Spot price
I typically use these machines anything from an hour to a couple of hours, so ondemand has never been a need. Never have the instances been a pain to get or have I been restricted to using an instance because usage suddenly spiked. These spot instances are the perfect way to keep costs low, while having access to HUGE machines.
Choose instance and configure:Right click on the H2O_ami and select Spot Request. This will bring you to a screen showing the machine menu. For now, we won’t get too carried away, so lets select the r3.4xlarge image. This machine is one of my favourites, although its not part of the current generation machines, it is a memory optimised machine which works well for ML training:
 16 cores
 122GB RAM
 1 x 320 SSD
 Spot Price: $0.1517/h
You can also configure the instance to shutdown when a maximum spot price is reached. In steps 4 and 5 you can add storage and tags if you want. I don’t show this, as we don’t want to get too bogged down in the detail, we want to get to the part where we play with H2O.
Add security groupIn step 6, you have to configure the security group of the instance. For now, we are not going to setup any crazy configurations. We will open port 22 (ssh) and 80 (http) which will allow us to access the machine and Rstudio from anywhere. If safety is a concern, you can be a lot more specific on the configurations.
The final step before the instance is launched is confirming you have the appropriate key pair to access the machine:
Once you have done this, you should be able to see your new instance being initialized in the Instance tab of your ec2 dashboard:
Well done, you have now spinned up a machine that is probably around 3 – 4 times bigger than your current working environment on a laptop or PC. Remember to terminate this machine when you are done with it!
To check that the machine has the 16 cores and 122GB RAM as promised, we can SSH5 into it and use the htop command to confirm:
To access the Rstudio IDE, open a browser and go the the public IP provided. Once you have entered the username and password you should be given access to a welcome screen.
Well done! You now have access to an analytical system that is CPU/GPU enabled and can handle training of ML models using either caret or H2O.
I also recommend you read the Welcome.R file which contains information on the very helpful RStudioAMI package.
In the next post we gonna explore running models using the H2O api and combining it with purrr to create multiple model tibbles for easy evaluation
 Disclaimer:I don’t work for AWS, it is just the platform that I ended up using over the past year or so. I am sure Google Cloud (or any other cloud service) is just as good ^
 For those excited to see OLS in action, sorry to dissapoint, those aren’t ML models… ^
 I am looking at you Error in nominalTrainWorkflow... ^
 In future I would like to adjust this to the specifications that Louis had. In the original AMI, the password is the instance ID, which is very clever – unfortunately I haven’t been able to figure it out for this H2O AMI ^
 If you do not know about SSH you are probably working on a Windows machine – follow these instructions to install putty ^
To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Cointegration and Mean Reverting Portfolio
(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to Rbloggers)
In the previous post https://statcompute.wordpress.com/2018/07/29/cointegrationandpairstrading, it was shown how to identify two cointegrated stocks in the pair trade. In the example below, I will show how to form a mean reverting portfolio with three or more stocks, e.g. stocks with cointegration, and also how to find the linear combination that is stationary for these stocks.
First of all, we downloaded series of three stock prices from finance.yahoo.com.
### GET DATA FROM YAHOO FINANCE quantmod::getSymbols("FITB", from = "20100101") FITB < get("FITB")[, 6] quantmod::getSymbols("MTB", from = "20100101") MTB < get("MTB")[, 6] quantmod::getSymbols("BAC", from = "20100101") BAC < get("BAC")[, 6]For the residualbased cointegration test, we can utilize the Pu statistic in the PhillipsOuliaris test to identify the cointegration among three stocks. As shown below, the null hypothesis of no cointegration is rejected, indicating that these three stocks are cointegrated and therefore form a mean reverting portfolio. Also, the test regression to derive the residual for the statistical test is also given.
k < trunc(4 + (length(FITB) / 100) ^ 0.25) po.test < urca::ca.po(cbind(FITB, MTB, BAC), demean = "constant", lag = "short", type = "Pu") #Value of teststatistic is: 62.7037 #Critical values of Pu are: # 10pct 5pct 1pct #critical values 33.6955 40.5252 53.8731 po.test@testreg # Estimate Std. Error t value Pr(t) #(Intercept) 1.097465 0.068588 16.00 <2e16 *** #z[, 1]MTB.Adjusted 0.152637 0.001487 102.64 <2e16 *** #z[, 1]BAC.Adjusted 0.140457 0.007930 17.71 <2e16 ***Based on the test regression output, a linear combination can be derived by [FITB + 1.097465 – 0.152637 * MTB – 0.140457 * BAC]. The ADF test result confirms that the linear combination of these three stocks are indeed stationary.
ts1 < FITB + 1.097465  0.152637 * MTB  0.140457 * BAC tseries::adf.test(ts1, k = k) #DickeyFuller = 4.1695, Lag order = 6, pvalue = 0.01Alternatively, we can also utilize the Johansen test that is based upon the likelihood ratio to identify the cointegration. While the null hypothesis of no cointegration (r = 0) is rejected, the null hypothesis of r <= 1 suggests that there exists a cointegration equation at the 5% significance level.
js.test < urca::ca.jo(cbind(FITB, MTB, BAC), type = "trace", K = k, spec = "longrun", ecdet = "const") # test 10pct 5pct 1pct #r <= 2  3.26 7.52 9.24 12.97 #r <= 1  19.72 17.85 19.96 24.60 #r = 0  45.88 32.00 34.91 41.07 js.test@V # FITB.Adjusted.l6 MTB.Adjusted.l6 BAC.Adjusted.l6 constant #FITB.Adjusted.l6 1.0000000 1.000000 1.000000 1.0000000 #MTB.Adjusted.l6 0.1398349 0.542546 0.522351 0.1380191 #BAC.Adjusted.l6 0.1916826 1.548169 3.174651 0.9654671 #constant 0.6216917 17.844653 20.329085 6.8713179Similarly, based on the above Eigenvectors, a linear combination can be derived by [FITB + 0.6216917 – 0.1398349 * MTB – 0.1916826 * BAC]. The ADF test result also confirms that the linear combination of these three stocks are stationary.
ts2 < FITB + 0.6216917  0.1398349 * MTB  0.1916826 * BAC tseries::adf.test(ts2, k = k) #DickeyFuller = 4.0555, Lag order = 6, pvalue = 0.01 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Making Original Bingo – Heart Theme
(This article was first published on R on Chi's Impe[r]fect Blog, and kindly contributed to Rbloggers)
I have learned how to draw a heart with mathmatical equation, in fact there are so many “curves” you can draw with equation. Amazing, right?!? You can find all sorts of different curves on Wolfram Mathworld site. I’m really curious how did people find equation itself for some of shapes?
Also at last family reunion, I’ve played “Bingo”, which I haven’t played for ages! It’s a great game when you have wide range of age groups, because kids were having tons of fun, but so were adults and grand parents!
So I’ve decided to create little twist on bingo…! Instead of just drawing numbers between 175, you can play bingo with words, pictures, since I just really wanted to use my new “heart shape drawing shape…” I decided I’ll make up bingo with some words related to “Valentine’s day”.
Drawing Heart Shape with ggplot2You can look at past entry on how to draw cannabis here. Same concept.
There were 6 different heart shape example on Wolfram site, but I liked 6th heart the best for shape. You can look at equation here
library(tidyverse) library(patchwork) ## Function to generate heart shape around point xc and rc with some sizing. Output of function is data frame with bunch of points which you can use to draw a heart! # http://mathworld.wolfram.com/HeartCurve.html heart_vertices < function(xc,yc,size,npoints=100,...){ #area = pi*r*r for circle... this heart has area of 180 = r*r radius = size*0.05 ## I'm not set on this... I just wanted to make size smaller yc = yc + 0.1*radius ## Just adjusting center of heart bit t = seq(pi,pi, length.out=npoints+1) x = xc + 16 * radius *(sin(t))^3 y = yc + radius*13*cos(t)  radius*5*cos(2*t)  radius*2*cos(3*t)  radius*cos(4*t) df < tibble(theta=t,x=x,y=y) return(df) } ## Above function with generate points you'd need to generate heart. If you adjust npoints to be higher, then you can draw smoother shape. heart_vertices(0,0,1) %>% ggplot() + geom_line(aes(x=theta, y=x), color="#D22042de") + geom_line(aes(x=theta, y= x),color="#D22042de",linetype=3) + ## this is just to make a design geom_line(aes(x=theta, y=y), color="#30C4C9de") + geom_line(aes(x=theta, y= y), color="#30C4C9de", linetype=3) + ## this is just to make a design geom_polygon(aes(x=x,y=y), fill="#615375de", ## to draw heart use x=x,y=y color="#61537520",size=10) + ## I'm just playing around with line around filled part. theme_minimal(base_family="Roboto Condensed") + scale_x_continuous(breaks=c(pi,pi/2,0,pi/2,pi), labels=c("pi","","0","","pi")) + coord_fixed() + labs(x="",y="", caption="pink solid line = x value & blue solid line = y value") Drawing Bunch of Hearts on GridNow I know how to draw a single heart, I want to be able to plot bunch of hearts on a grid. Since function heart_vertices creates data frame for a single heart around point xc and yc, I can create a grid with coordinates xc and yc.
my_grid < tibble( xc=rep(c(1:15), times=5), yc=rep(c(1:5), each=15), size=0.6, id = c(1:75) ) my_grid %>% ggplot(aes(x=xc,y=yc)) + geom_point(shape=21, size=10) + geom_text(aes(label=id), family="Roboto Condensed", fontface="bold") + theme_minimal() + coord_fixed() ## For each points on grid generate points to draw heart print_out_base <my_grid %>% mutate(v=pmap(.,heart_vertices)) %>% unnest(v) %>% ggplot(aes(x=x,y=y,group=id)) + geom_polygon(aes(fill=factor(yc))) + coord_fixed() + theme_minimal(base_family="Roboto Condensed") + scale_x_continuous(breaks=seq(0.5,16.5, by=1), labels=NULL) + scale_y_continuous(breaks=seq(0.5,5.5,by=1), labels=NULL) + scale_fill_manual(values=c("#30bcad","#57a337","#f8b620","#e03426","#eb73b3"), guide="none") print_out_base Putting Words On Top of HeartsNow I have the hearts placed on grid, I want some words on top. So I took inspiration from candies with sayings that I often see during Valentine’s day week, which I didn’t know the name of candy, but I think it’s called Necco Sweetheart.
## Needs at least 75 words.... since there are 75 heart with some word placed on it. ## Some are not from those candies, I just made some up. love_msgs < tibble( msgs = c("143", "#1 FAN", "#LOVE", "1800\nCUPID", "HUG ME", "KISSES", "BE MINE", "CRAZY\n4 U", "HOLD\nHANDS", "UR\nLOVED", "PURR FECT", "WOO", "QT PIE", "RECIPE\n4 LOVE", "RISING STAR", "TABLE\nFOR 2", "TOO SWEET", "TWEET", "TWO HEARTS", "TXT ME", "UR HOT", "WHATS UP", "DESTINY", "WICKED COOL", "WINK\nWINK", "STUNNING", "XOXO", "YOU&ME", "YUM\nYUM","SOUL\nMATE","BABE", "SAY YES","HELLO","DREAM\nBIG","BFF","HIGH\nFIVE","AWESOME", "SMILE","UR\nGR8","PHONE\nME","LOVE\nBIRD","BE TRUE","SURE LOVE", "MY BABY","HI GORGEOUS","HOT\nSTUFF","ADORE\nME","FUN","LOL","CALL ME", "PICK ME","DEAR\nONE","EVER\nAFTER","LOVER","ALL\nMINE","ANGEL", "RU SHY","SWEET PEA","LOVE\nBUG","ADORABLE","EMBRACE","FLOWERS", "CHERISH","CHOCOLATE","CUPCAKES","CRUSH","SECRET\nADMIRER", "VALENTINE","DOVES","LOVEBIRDS","DIAMONDS","PAARTY","HONEY", "PASSION","AWWW") ) ## Here you want to make sure you have at leat 75 words! love_msgs < love_msgs %>% arrange(msgs) %>% ## sort them alphabetically.... It makes it easier to find word that were picked out. mutate(idx=row_number()1, row_group=floor(idx/15)+1) print_out_base + geom_text(aes(x=xc,y=yc, label=love_msgs$msgs), color="#ffffffde",family="Roboto Condensed", size=3, fontface="bold", lineheight=0.8, data=. %>% filter(theta==pi)) + labs(title="Print & Cut Them Into Pieces & Draw Them Out of Hat or Box",x="",y="", caption="") Making Bingo CardsSimilariy, now I want to make 5 x 5 bingo cards that each person gets to participate in the game. I’ve generated 4 cards as example.
## Making Bingo Cards (Base Design) bingo_base <tibble( xc = rep(c(1:5),times=5), yc = rep(c(1:5),each=5), size=0.6, id = c(1:25) ) %>% mutate(v=pmap(., heart_vertices)) %>% unnest(v) %>% ggplot(aes(x=x,y=y,group=id)) + geom_polygon(aes(fill=factor(xc))) + geom_polygon(fill="#000000de", data=. %>% filter(xc==3,yc==3))+ theme_minimal(base_family="Roboto Condensed") + scale_x_continuous(breaks=c(1,2,3,4,5),labels=c("B","I","N","G","O"), position="top") + scale_y_continuous(labels=NULL) + scale_fill_manual(values=c("#30bcad","#57a337","#f8b620","#e03426","#eb73b3"), guide="none") + labs(x="",y="") + coord_fixed() ## Just to make card little more fun, let's add some quotes about love on each cards. library(rvest) love_quotes < read_html("https://lifehacks.io/inspirationallovequotessayings/") %>% html_nodes("h2") %>% html_text() love_quotes < love_quotes[2:64] ## Creating function to create one bingo card with randomly selected words on each rows. make_card < function(name="") { love_msgs_list < love_msgs %>% split(.$row_group) unique_card < tibble( xc = rep(c(1:5),each=5), yc = rep(c(1:5),times=5), ## from each lists i want to select 5 randomly. msg = love_msgs_list %>% map(.,"msgs") %>% map(.,sample,5) %>% unlist() ) unique_card < unique_card %>% mutate(msg=ifelse(xc==3&yc==3,"FREE",msg)) bingo_card < bingo_base + geom_text(data=unique_card, aes(x=xc,y=yc,label=msg, group=NULL), family="Roboto Condensed", fontface="bold", color="white", size=3) + labs(title=str_c(name),caption=sample(love_quotes,size=1)) bingo_card } ## using patchwork, I want to print 4 cards make_card("BINGO CARD 1") + make_card("BINGO CARD 2") + make_card("BINGO CARD 3") + make_card("BINGO CARD 4") + patchwork::plot_layout(ncol=2) Bonus : Drawing Flowers To Go With HeartsJust thought it would also be nice to draw flowers too. After all, flowers go with hearts :). You can read more about rose curve here or here
flower_vertices < function(xc,yc,radius,k=5,npoints=300,...){ t = seq(0,2*pi, length.out=npoints+1) m = sqrt(radius) * cos(k * t) x = xc + m * cos(t) y = yc + m * sin(t) df < tibble(t=t,x=x,y=y,r=m) return(df) } flower_vertices(0,0,1,7) %>% ggplot(aes(x=t)) + geom_line(aes(y=x), color="red", linetype=3) + geom_line(aes(y=y), color="blue",linetype=3) + geom_polygon(aes(x=x,y=y), alpha=0.5) + theme_minimal(base_family="Roboto Condensed") + coord_fixed() + labs(title="Rose Curve with K=7  Flower with 7 Petals") tibble( xc=rep(c(1:5),time=5), yc=rep(c(1:5),each=5), radius=0.1, k = c(1:25), id=c(1:25) ) %>% mutate(v=pmap(.,flower_vertices)) %>% unnest(v) %>% ggplot(aes(x=x,y=y,group=id)) + geom_polygon(aes(fill=id%%2)) + geom_point(aes(x=xc,y=yc), data=. %>% count(id,xc,yc), size=3,shape=19, alpha=0.7) + geom_text(aes(x=x,y=y, label=k), family="Roboto Condensed", size=8, vjust=1, fontface="bold", color="#000000ae", data=. %>% group_by(id) %>% filter(max(t)==t)) + theme_void(base_family="Roboto Condensed") + coord_fixed() + scale_y_reverse() + scale_fill_viridis_c(begin=0.2,end=0.7,option="magma", guide="none", alpha=0.8) + labs(title="Rose Curves with differnt k ", subtitle="r = cos(k * theta) ") Flower Needs Butterfly Too..There’s also another called “butterfly curve”.
I think flower deserves butterfly… So here’s butterfly curve drawn in similar manner as the above.
butterfly_vertices < function(xc,yc,npoints=1000,...){ t = seq(0,12*pi, length.out=npoints+1) x = xc + sin(t)*(exp(cos(t))2*cos(4*t)sin(t/12)^5) y = yc + cos(t)*(exp(cos(t))2*cos(4*t)sin(t/12)^5) df < tibble(x=x,y=y,t=t) %>% mutate(pos=row_number()) return(df) } ggplot() + geom_path(data=butterfly_vertices(1,1),aes(x=x,y=y, color=pos)) + geom_polygon(data=butterfly_vertices(8,1), aes(x=x,y=y,fill=factor(floor(t/pi))), color="#000000de") + coord_fixed() + theme_void() + scale_fill_viridis_d(alpha=0.3, guide="none") + scale_color_viridis_c(option="magma", guide="none") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on Chi's Impe[r]fect Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
gganimation for the nation
(This article was first published on HighlandR, and kindly contributed to Rbloggers)
gganimate hits CRAN –
At the NHS_R conference back in October, I showed a few ways of building animations using gifski – I also wrote up the method I used in an earlier post right here.
And, the source code for all this stuff is available from my github here.
However – the gganimate file in the repo is now out of date, so use the code down below.
Now at the time of the conference, the method I used to build animations involved a fair bit of slicing, dicing and a bit of purrr to build up individual plots and stitch them together. It was a workaround because gganimate was on github at the time, and installing packages from github onto NHS computers is likely to be highly problematic, to put it mildly.
But, the New Year comes, and despite the fact the country is going completely to pot, there are still things to celebrate because gganimate has now hit CRAN.
So, join me as I take my mind off it all by updating some of the code I used with this wonderful new package.
Fair Warning – regular reader(s) will have seen these plots before.
Sorry.
However, I still think it’s (vaguely) worthwhile writing this because while there were / are quite a few examples of how to use gganimate with points, I couldn’t find one that shows how to keep the old points on screen while new points are added (Disclaimer – I haven’t put too much effort into additional checking of the latest examples before writing this).
If you are using lines and or segments in your plots, then transition_reveal keeps the history of the line over time, but it did not work with geom_point.
Here’s the setup :
source("1_setup.R") source("2_data_wrangling.R") library(gganimate) # Change "Tranfer In" or "Transfer Out" to "Transfer" plot_data$Movement_Type < gsub("Transfer.*","Transfer",x = plot_data$Movement_Type) #convert Movement Type to factor, as the first sequence of dots turns red instead of green plot_data$Movement_Type < as_factor(plot_data$Movement_Type) #check the levels levels(plot_data$Movement_Type) plot_data$Movement_Type < forcats::fct_rev(plot_data$Movement_Type) levels(plot_data$Movement_Type) lims < as.POSIXct(strptime(c("20140903 00:00","20140904 01:00") , format = "%Y%m%d %H:%M")) p < ggplot(plot_data,aes(Movement15,Movement_15_SEQNO, colour = Movement_Type)) + geom_jitter(width = 0.10) + scale_colour_manual(values = plot_colours) + facet_grid(Staging_Post~.,switch = "y") + # facet_wrap(vars(Staging_Post), ncol = 1) + scale_x_datetime(date_labels = "%H:%M",date_breaks = "3 hours", limits = lims, timezone = "GMT", expand = c(0,0)) + ggtitle(label = "Anytown General Hospital  Wednesday 3rd September 2014 00:00 to 23:59\n", subtitle = "A&E AND INPATIENT ARRIVALS, DEPARTURES AND TRANSFERS") + labs(x = NULL, y = NULL, caption = "@_johnmackintosh  johnmackintosh.com Source: Neil Pettinger  @kurtstat  kurtosis.co.uk") + theme_minimal(base_size = 11) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) + theme(axis.text.x = element_text(size = 7)) + theme(axis.ticks.x = element_blank()) + theme(legend.position = "bottom") + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank()) + theme(strip.text.y = element_text(angle = 180)) + guides(color = guide_legend("Movement Type")) + ggExtra::removeGrid()You will need to hit up my repo to grab the source files, and obviously you’ll want to change the titles and labels.
Now, here is the gganimate bit:
p < p + labs(title = 'Time: {frame_time}', x = 'Time In/ Out', y = NULL) + transition_time(MovementDateTime) + shadow_mark(past = TRUE, future = FALSE) + ease_aes('linear') p #OR #animate(p, fps = 5, width = 1000, height = 600) #save it if you want to: #anim_save("newgganimate.gif")Transition_time, by itself, plotted each dot where I expected it, but did not keep the previous points on screen.
I experimented with some other options (ie. shadow_wake and shadow_trail), but it seems
is the magic sauce for this particular task.
I could be completely wrong about that, and maybe there is a better way, in which case, let me know in the comments.
Here is the final gif:
Also at the event, I didn’t get a chance to demonstrate the Shiny app I built.
It was my first ever attempt, and while admittedly it’s nothing earth shattering, you have to start somewhere right?
It’s over here, but please lower your expectations. There is only one dropdown to interact with:
https://johnm.shinyapps.io/NHS_R_PatientFlow/
And with that, I stop thinking about R for a bit.
See you next month.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: HighlandR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Dow Jones Stock Market Index (1/4): Log Returns Exploratory Analysis
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
Categories
Tags
In this fourpost series, I am going to analyze the Dow Jones Industrial Average (DJIA) index on years 20072018. The Dow Jones Industrial Average (DIJA) is a stock market index that indicates the value of thirty large, publicly owned companies based in the United States. The value of the DJIA is based upon the sum of the price of one share of stock for each component company. The sum is corrected by a factor which changes whenever one of the component stocks has a stock split or stock dividend. See ref. [1] for further details.
The main questions I will try to answer are:
 How the returns and trade volume have changed over the years?
 How the returns and trade volume volatility have changed over the years?
 How can we model the returns volatility?
 How can we model the trade volume volatility?
Part 1: getting data, summaries, and plots of daily and weekly logreturns
Part 2: getting data, summaries, and plots of daily trade volume and its logratio
Part 3: daily logreturns analysis and GARCH model definition
Part 4: daily trade volume analysis and GARCH model definition
The packages being used in this post series are herein listed.
suppressPackageStartupMessages(library(lubridate)) # dates manipulation suppressPackageStartupMessages(library(fBasics)) # enhanced summary statistics suppressPackageStartupMessages(library(lmtest)) # coefficients significance tests suppressPackageStartupMessages(library(urca)) # unit rooit test suppressPackageStartupMessages(library(ggplot2)) # visualization suppressPackageStartupMessages(library(quantmod)) # getting financial data suppressPackageStartupMessages(library(PerformanceAnalytics)) # calculating returns suppressPackageStartupMessages(library(rugarch)) # GARCH modeling suppressPackageStartupMessages(library(FinTS)) # ARCH test suppressPackageStartupMessages(library(forecast)) # ARMA modeling suppressPackageStartupMessages(library(strucchange)) # structural changes suppressPackageStartupMessages(library(TSA)) # ARMA order identificationTo the right, I added a short comment to highlight the main functionalities I took advantage from. Of course, those packages offer much more than just what I indicated.
Packages version are herein listed.
packages < c("lubridate", "fBasics", "lmtest", "urca", "ggplot2", "quantmod", "PerformanceAnalytics", "rugarch", "FinTS", "forecast", "strucchange", "TSA") version < lapply(packages, packageVersion) version_c < do.call(c, version) data.frame(packages=packages, version = as.character(version_c)) ## packages version ## 1 lubridate 1.7.4 ## 2 fBasics 3042.89 ## 3 lmtest 0.9.36 ## 4 urca 1.3.0 ## 5 ggplot2 3.1.0 ## 6 quantmod 0.4.13 ## 7 PerformanceAnalytics 1.5.2 ## 8 rugarch 1.4.0 ## 9 FinTS 0.4.5 ## 10 forecast 8.4 ## 11 strucchange 1.5.1 ## 12 TSA 1.2The R language version running on Windows10 is:
R.version >## _ ## platform x86_64w64mingw32 ## arch x86_64 ## os mingw32 ## system x86_64, mingw32 ## status ## major 3 ## minor 5.2 ## year 2018 ## month 12 ## day 20 ## svn rev 75870 ## language R ## version.string R version 3.5.2 (20181220) ## nickname Eggshell Igloo Getting DataTaking advantage of the getSymbols() function made available within quantmod package we get Dow Jones Industrial Average from 2007 up to the end of 2018.
suppressMessages(getSymbols("^DJI", from = "20070101", to = "20190101")) ## [1] "DJI" dim(DJI) ## [1] 3020 6 class(DJI) ## [1] "xts" "zoo"Let us have a look at the DJI xts object which provides sixtime series, as we can see.
head(DJI) ## DJI.Open DJI.High DJI.Low DJI.Close DJI.Volume DJI.Adjusted ## 20070103 12459.54 12580.35 12404.82 12474.52 327200000 12474.52 ## 20070104 12473.16 12510.41 12403.86 12480.69 259060000 12480.69 ## 20070105 12480.05 12480.13 12365.41 12398.01 235220000 12398.01 ## 20070108 12392.01 12445.92 12337.37 12423.49 223500000 12423.49 ## 20070109 12424.77 12466.43 12369.17 12416.60 225190000 12416.60 ## 20070110 12417.00 12451.61 12355.63 12442.16 226570000 12442.16 tail(DJI) ## DJI.Open DJI.High DJI.Low DJI.Close DJI.Volume DJI.Adjusted ## 20181221 22871.74 23254.59 22396.34 22445.37 900510000 22445.37 ## 20181224 22317.28 22339.87 21792.20 21792.20 308420000 21792.20 ## 20181226 21857.73 22878.92 21712.53 22878.45 433080000 22878.45 ## 20181227 22629.06 23138.89 22267.42 23138.82 407940000 23138.82 ## 20181228 23213.61 23381.88 22981.33 23062.40 336510000 23062.40 ## 20181231 23153.94 23333.18 23118.30 23327.46 288830000 23327.46More precisely, we have available OHLC (Open, High, Low, Close) index value, adjusted close value and trade volume. Here we can see the corresponding chart as produced by the chartSeries within the quantmod package.
chartSeries(DJI, type = "bars", theme="white")We herein analyse the adjusted close value.
dj_close < DJI[,"DJI.Adjusted"]Simple and log returns
Simple returns are defined as:
\[
R_{t}\ :=\ \frac{P_{t}}{P_{t1}}\ \ \ 1
\]
Log returns are defined as:
\[
r_{t}\ :=\ ln\frac{P_{t}}{P_{t1}}\ =\ ln(1+R_{t})
\]
See ref. [4] for further details.
We compute log returns by taking advantage of CalculateReturns within PerformanceAnalytics package.
dj_ret < CalculateReturns(dj_close, method = "log") dj_ret < na.omit(dj_ret)Let us have a look.
head(dj_ret) ## DJI.Adjusted ## 20070104 0.0004945580 ## 20070105 0.0066467273 ## 20070108 0.0020530973 ## 20070109 0.0005547987 ## 20070110 0.0020564627 ## 20070111 0.0058356461 tail(dj_ret) ## DJI.Adjusted ## 20181221 0.018286825 ## 20181224 0.029532247 ## 20181226 0.048643314 ## 20181227 0.011316355 ## 20181228 0.003308137 ## 20181231 0.011427645This gives the following plot.
plot(dj_ret)Sharp increases and decreases in volatility can be eyeballed. That will be indepth verified in part 3.
Helper FunctionsWe need some helper functions to ease some basic data conversion, summaries and plots. Here below the details.
1. Conversion from xts to dataframe with year and value column. That allows for summarizations and plots on year basis.
xts_to_dataframe < function(data_xts) { df_t < data.frame(year = factor(year(index(data_xts))), value = coredata(data_xts)) colnames(df_t) < c( "year", "value") df_t }2. Enhanced summaries statistics for data stored as data frame columns.
bs_rn < rownames(basicStats(rnorm(10,0,1))) # gathering the basic stats dataframe output row names that get lost with tapply() dataframe_basicstats < function(dataset) { result < with(dataset, tapply(value, year, basicStats)) df_result < do.call(cbind, result) rownames(df_result) < bs_rn as.data.frame(df_result) }3. Basic statistics dataframe row threshold filtering to return the associated column names.
filter_dj_stats < function(dataset_basicstats, metricname, threshold) { r < which(rownames(dataset_basicstats) == metricname) colnames(dataset_basicstats[r, which(dataset_basicstats[r,] > threshold), drop = FALSE]) }4. Boxplots with faceting based on years.
dataframe_boxplot < function(dataset, title) { p < ggplot(data = dataset, aes(x = year, y = value)) + theme_bw() + theme(legend.position = "none") + geom_boxplot(fill = "lightblue") p < p + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle(title) + ylab("year") p }5. Density plots with faceting based on years.
dataframe_densityplot < function(dataset, title) { p < ggplot(data = dataset, aes(x = value)) + geom_density(fill = "lightblue") p < p + facet_wrap(. ~ year) p < p + theme_bw() + theme(legend.position = "none") + ggtitle(title) p }6. QQ plots with faceting based on years.
dataframe_qqplot < function(dataset, title) { p < ggplot(data = dataset, aes(sample = value)) + stat_qq(colour = "blue") + stat_qq_line() p < p + facet_wrap(. ~ year) p < p + theme_bw() + theme(legend.position = "none") + ggtitle(title) p }7. Shapiro Test
shp_pvalue < function (v) { shapiro.test(v)$p.value } dataframe_shapirotest < function(dataset) { result < with(dataset, tapply(value, year, shp_pvalue)) as.data.frame(result) } Daily Logreturns Exploratory AnalysisWe transform our original Dow Jones time series into a dataframe with year and value columns. That will ease plots and summaries by year.
dj_ret_df < xts_to_dataframe(dj_ret) head(dj_ret_df) ## year value ## 1 2007 0.0004945580 ## 2 2007 0.0066467273 ## 3 2007 0.0020530973 ## 4 2007 0.0005547987 ## 5 2007 0.0020564627 ## 6 2007 0.0058356461 tail(dj_ret_df) ## year value ## 3014 2018 0.018286825 ## 3015 2018 0.029532247 ## 3016 2018 0.048643314 ## 3017 2018 0.011316355 ## 3018 2018 0.003308137 ## 3019 2018 0.011427645 Basic statistics summaryBasic statistics summary is given.
(dj_stats < dataframe_basicstats(dj_ret_df)) ## 2007 2008 2009 2010 2011 ## nobs 250.000000 253.000000 252.000000 252.000000 252.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum 0.033488 0.082005 0.047286 0.036700 0.057061 ## Maximum 0.025223 0.105083 0.066116 0.038247 0.041533 ## 1. Quartile 0.003802 0.012993 0.006897 0.003853 0.006193 ## 3. Quartile 0.005230 0.007843 0.008248 0.004457 0.006531 ## Mean 0.000246 0.001633 0.000684 0.000415 0.000214 ## Median 0.001098 0.000890 0.001082 0.000681 0.000941 ## Sum 0.061427 0.413050 0.172434 0.104565 0.053810 ## SE Mean 0.000582 0.001497 0.000960 0.000641 0.000837 ## LCL Mean 0.000900 0.004580 0.001207 0.000848 0.001434 ## UCL Mean 0.001391 0.001315 0.002575 0.001678 0.001861 ## Variance 0.000085 0.000567 0.000232 0.000104 0.000176 ## Stdev 0.009197 0.023808 0.015242 0.010182 0.013283 ## Skewness 0.613828 0.224042 0.070840 0.174816 0.526083 ## Kurtosis 1.525069 3.670796 2.074240 2.055407 2.453822 ## 2012 2013 2014 2015 2016 ## nobs 250.000000 252.000000 252.000000 252.000000 252.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum 0.023910 0.023695 0.020988 0.036402 0.034473 ## Maximum 0.023376 0.023263 0.023982 0.038755 0.024384 ## 1. Quartile 0.003896 0.002812 0.002621 0.005283 0.002845 ## 3. Quartile 0.004924 0.004750 0.004230 0.005801 0.004311 ## Mean 0.000280 0.000933 0.000288 0.000090 0.000500 ## Median 0.000122 0.001158 0.000728 0.000211 0.000738 ## Sum 0.070054 0.235068 0.072498 0.022586 0.125884 ## SE Mean 0.000470 0.000403 0.000432 0.000613 0.000501 ## LCL Mean 0.000645 0.000139 0.000564 0.001298 0.000487 ## UCL Mean 0.001206 0.001727 0.001139 0.001118 0.001486 ## Variance 0.000055 0.000041 0.000047 0.000095 0.000063 ## Stdev 0.007429 0.006399 0.006861 0.009738 0.007951 ## Skewness 0.027235 0.199407 0.332766 0.127788 0.449311 ## Kurtosis 0.842890 1.275821 1.073234 1.394268 2.079671 ## 2017 2018 ## nobs 251.000000 251.000000 ## NAs 0.000000 0.000000 ## Minimum 0.017930 0.047143 ## Maximum 0.014468 0.048643 ## 1. Quartile 0.001404 0.005017 ## 3. Quartile 0.003054 0.005895 ## Mean 0.000892 0.000231 ## Median 0.000655 0.000695 ## Sum 0.223790 0.057950 ## SE Mean 0.000263 0.000714 ## LCL Mean 0.000373 0.001637 ## UCL Mean 0.001410 0.001175 ## Variance 0.000017 0.000128 ## Stdev 0.004172 0.011313 ## Skewness 0.189808 0.522618 ## Kurtosis 2.244076 2.802996In the following, we make specific comments to some relevant aboveshown metrics.
MeanYears when Dow Jones daily logreturns has positive mean values are:
filter_dj_stats(dj_stats, "Mean", 0) ## [1] "2007" "2009" "2010" "2011" "2012" "2013" "2014" "2016" "2017"All Dow Jones daily logreturns mean values in ascending order.
dj_stats["Mean",order(dj_stats["Mean",,])] ## 2008 2018 2015 2011 2007 2012 2014 ## Mean 0.001633 0.000231 9e05 0.000214 0.000246 0.00028 0.000288 ## 2010 2016 2009 2017 2013 ## Mean 0.000415 5e04 0.000684 0.000892 0.000933 MedianYears when Dow Jones daily logreturns has positive median are:
filter_dj_stats(dj_stats, "Median", 0) ## [1] "2007" "2009" "2010" "2011" "2013" "2014" "2016" "2017" "2018"All Dow Jones daily logreturns median values in ascending order.
dj_stats["Median",order(dj_stats["Median",,])] ## 2008 2015 2012 2017 2010 2018 2014 ## Median 0.00089 0.000211 0.000122 0.000655 0.000681 0.000695 0.000728 ## 2016 2011 2009 2007 2013 ## Median 0.000738 0.000941 0.001082 0.001098 0.001158 SkewnessA spatial distribution has positive skewness when it has tendency to produce more positive extreme values above rather than negative ones. Negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right. Here is a sample picture to explain.
Years when Dow Jones daily logreturns has positive skewness are:
filter_dj_stats(dj_stats, "Skewness", 0) ## [1] "2008" "2009" "2012"All Dow Jones daily logreturns skewness values in ascending order.
dj_stats["Skewness",order(dj_stats["Skewness",,])] ## 2007 2011 2018 2016 2014 2013 ## Skewness 0.613828 0.526083 0.522618 0.449311 0.332766 0.199407 ## 2017 2010 2015 2012 2009 2008 ## Skewness 0.189808 0.174816 0.127788 0.027235 0.07084 0.224042 Excess KurtosisThe Kurtosis is a measure of the “tailedness” of the probability distribution of a realvalued random variable. The kurtosis of any univariate normal distribution is 3. The excess kurtosis is equal to the kurtosis minus 3 and eases the comparison to the normal distribution. The basicStats function within the fBasics package reports the excess kurtosis. Here is a sample picture to explain.
Years when Dow Jones daily logreturns has positive excess kurtosis are:
filter_dj_stats(dj_stats, "Kurtosis", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"All Dow Jones daily logreturns excess kurtosis values in ascending order.
dj_stats["Kurtosis",order(dj_stats["Kurtosis",,])] ## 2012 2014 2013 2015 2007 2010 2009 ## Kurtosis 0.84289 1.073234 1.275821 1.394268 1.525069 2.055407 2.07424 ## 2016 2017 2011 2018 2008 ## Kurtosis 2.079671 2.244076 2.453822 2.802996 3.670796Year 2018 has the closest excess kurtosis to 2008.
Boxplots dataframe_boxplot(dj_ret_df, "DJIA daily logreturns box plots 20072018")We can see how the most extreme values occurred on 2008. Starting on 2009, the values range gets narrow, with the exception of 2011 and 2015. However, comparing 2017 with 2018, it is remarkable an improved tendency to produce more extreme values on last year.
Density plots dataframe_densityplot(dj_ret_df, "DJIA daily logreturns density plots 20072018")Year 2007 has remarkable negative skewness. Year 2008 is characterized by flatness and extreme values. The 2017 peakness is in constrant with the flatness and left skeweness of 2018 results.
Shapiro Tests
dataframe_shapirotest(dj_ret_df) ## result ## 2007 5.989576e07 ## 2008 5.782666e09 ## 2009 1.827967e05 ## 2010 3.897345e07 ## 2011 5.494349e07 ## 2012 1.790685e02 ## 2013 8.102500e03 ## 2014 1.750036e04 ## 2015 5.531137e03 ## 2016 1.511435e06 ## 2017 3.304529e05 ## 2018 1.216327e07The null hypothesis of normality is rejected for all years 20072018.
QQ plots
dataframe_qqplot(dj_ret_df, "DJIA daily logreturns QQ plots 20072018")Strong departure from normality of daily log returns (and hence lognormality for discrete returns) is visible on years 2008, 2009, 2010, 2011 and 2018.
Weekly Logreturns Exploratory AnalysisThe weekly log returns can be computed starting from the daily ones. Let us suppose to analyse the trading week on days {t4, t3, t2, t1, t} and to know closing price at day t5 (last day of the previous week). We define the weekly logreturn as:
\[
r_{t}^{w}\ :=\ ln \frac{P_{t}}{P_{t5}}
\]
That can be rewritten as:
\[
\begin{equation}
\begin{aligned}
r_{t}^{w}\ = \ ln \frac{P_{t}P_{t1} P_{t2} P_{t3} P_{t4}}{P_{t1} P_{t2} P_{t3} P_{t4} P_{t5}} = \\
=\ ln \frac{P_{t}}{P_{t1}} + ln \frac{P_{t1}}{P_{t2}} + ln \frac{P_{t2}}{P_{t3}} + ln \frac{P_{t3}}{P_{t4}}\ +\ ln \frac{P_{t4}}{P_{t5}}\ = \\ =
r_{t} + r_{t1} + r_{t2} + r_{t3} + r_{t4}
\end{aligned}
\end{equation}
\]
Hence the weekly log return is the sum of daily log returns applied to the trading week window.
We take a look at Dow Jones weekly logreturns.
dj_weekly_ret < apply.weekly(dj_ret, sum) plot(dj_weekly_ret)That plot shows sharp increses and decreases of volatility.
We transform the original time series data and index into a dataframe.
dj_weekly_ret_df < xts_to_dataframe(dj_weekly_ret) dim(dj_weekly_ret_df) ## [1] 627 2 head(dj_weekly_ret_df) ## year value ## 1 2007 0.0061521694 ## 2 2007 0.0126690596 ## 3 2007 0.0007523559 ## 4 2007 0.0062677053 ## 5 2007 0.0132434177 ## 6 2007 0.0057588519 tail(dj_weekly_ret_df) ## year value ## 622 2018 0.05028763 ## 623 2018 0.04605546 ## 624 2018 0.01189714 ## 625 2018 0.07114867 ## 626 2018 0.02711928 ## 627 2018 0.01142764Basic statistics summary
(dj_stats < dataframe_basicstats(dj_weekly_ret_df)) ## 2007 2008 2009 2010 2011 2012 ## nobs 52.000000 52.000000 53.000000 52.000000 52.000000 52.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum 0.043199 0.200298 0.063736 0.058755 0.066235 0.035829 ## Maximum 0.030143 0.106977 0.086263 0.051463 0.067788 0.035316 ## 1. Quartile 0.009638 0.031765 0.015911 0.007761 0.015485 0.010096 ## 3. Quartile 0.014808 0.012682 0.022115 0.016971 0.014309 0.011887 ## Mean 0.001327 0.008669 0.003823 0.002011 0.001035 0.001102 ## Median 0.004244 0.006811 0.004633 0.004529 0.001757 0.001166 ## Sum 0.069016 0.450811 0.202605 0.104565 0.053810 0.057303 ## SE Mean 0.002613 0.006164 0.004454 0.003031 0.003836 0.002133 ## LCL Mean 0.003919 0.021043 0.005115 0.004074 0.006666 0.003181 ## UCL Mean 0.006573 0.003704 0.012760 0.008096 0.008736 0.005384 ## Variance 0.000355 0.001975 0.001051 0.000478 0.000765 0.000237 ## Stdev 0.018843 0.044446 0.032424 0.021856 0.027662 0.015382 ## Skewness 0.680573 0.985740 0.121331 0.601407 0.076579 0.027302 ## Kurtosis 0.085887 5.446623 0.033398 0.357708 0.052429 0.461228 ## 2013 2014 2015 2016 2017 2018 ## nobs 52.000000 52.000000 53.000000 52.000000 52.000000 53.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum 0.022556 0.038482 0.059991 0.063897 0.015317 0.071149 ## Maximum 0.037702 0.034224 0.037693 0.052243 0.028192 0.050288 ## 1. Quartile 0.001738 0.006378 0.012141 0.007746 0.002251 0.011897 ## 3. Quartile 0.011432 0.010244 0.009620 0.012791 0.009891 0.019857 ## Mean 0.004651 0.001756 0.000669 0.002421 0.004304 0.001093 ## Median 0.006360 0.003961 0.000954 0.001947 0.004080 0.001546 ## Sum 0.241874 0.091300 0.035444 0.125884 0.223790 0.057950 ## SE Mean 0.001828 0.002151 0.002609 0.002436 0.001232 0.003592 ## LCL Mean 0.000981 0.002563 0.005904 0.002470 0.001830 0.008302 ## UCL Mean 0.008322 0.006075 0.004567 0.007312 0.006778 0.006115 ## Variance 0.000174 0.000241 0.000361 0.000309 0.000079 0.000684 ## Stdev 0.013185 0.015514 0.018995 0.017568 0.008886 0.026154 ## Skewness 0.035175 0.534403 0.494963 0.467158 0.266281 0.658951 ## Kurtosis 0.200282 0.282354 0.665460 2.908942 0.124341 0.000870In the following, we make specific comments to some relevant aboveshown metrics.
MeanYears when Dow Jones weekly logreturns has positive mean are:
filter_dj_stats(dj_stats, "Mean", 0) ## [1] "2007" "2009" "2010" "2011" "2012" "2013" "2014" "2016" "2017"All mean values in ascending order.
dj_stats["Mean",order(dj_stats["Mean",,])] ## 2008 2018 2015 2011 2012 2007 2014 ## Mean 0.008669 0.001093 0.000669 0.001035 0.001102 0.001327 0.001756 ## 2010 2016 2009 2017 2013 ## Mean 0.002011 0.002421 0.003823 0.004304 0.004651 MedianYears when Dow Jones weekly logreturns has positive median are:
filter_dj_stats(dj_stats, "Median", 0) ## [1] "2007" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017" ## [11] "2018"All median values in ascending order.
dj_stats["Median",order(dj_stats["Median",,])] ## 2008 2015 2012 2018 2011 2016 2014 ## Median 0.006811 0.000954 0.001166 0.001546 0.001757 0.001947 0.003961 ## 2017 2007 2010 2009 2013 ## Median 0.00408 0.004244 0.004529 0.004633 0.00636 SkewnessYears when Dow Jones weekly logreturns has positive skewness are:
filter_dj_stats(dj_stats, "Skewness", 0) ## [1] "2009" "2017"All skewness values in ascending order.
dj_stats["Skewness",order(dj_stats["Skewness",,])] ## 2008 2007 2018 2010 2014 2015 ## Skewness 0.98574 0.680573 0.658951 0.601407 0.534403 0.494963 ## 2016 2011 2013 2012 2009 2017 ## Skewness 0.467158 0.076579 0.035175 0.027302 0.121331 0.266281 Excess KurtosisYears when Dow Jones weekly logreturns has positive excess kurtosis are:
filter_dj_stats(dj_stats, "Kurtosis", 0) ## [1] "2008" "2010" "2011" "2014" "2015" "2016"All excess kurtosis values in ascending order.
dj_stats["Kurtosis",order(dj_stats["Kurtosis",,])] ## 2012 2013 2017 2007 2009 2018 ## Kurtosis 0.461228 0.200282 0.124341 0.085887 0.033398 0.00087 ## 2011 2014 2010 2015 2016 2008 ## Kurtosis 0.052429 0.282354 0.357708 0.66546 2.908942 5.446623Year 2008 has also highest weekly kurtosis. However in this scenario, 2017 has negative kurtosis and year 2016 has the second highest kurtosis.
Boxplots
dataframe_boxplot(dj_weekly_ret_df, "DJIA weekly logreturns box plots 20072018") Density plots dataframe_densityplot(dj_weekly_ret_df, "DJIA weekly logreturns density plots 20072018") Shapiro Tests dataframe_shapirotest(dj_weekly_ret_df) ## result ## 2007 0.0140590311 ## 2008 0.0001397267 ## 2009 0.8701335006 ## 2010 0.0927104389 ## 2011 0.8650874270 ## 2012 0.9934600084 ## 2013 0.4849043121 ## 2014 0.1123139646 ## 2015 0.3141519756 ## 2016 0.0115380989 ## 2017 0.9465281164 ## 2018 0.0475141869The null hypothesis of normality is rejected for years 2007, 2008, 2016.
QQ plots
dataframe_qqplot(dj_weekly_ret_df, "DJIA weekly logreturns QQ plots 20072018")Strong departure from normality is particularly visible on year 2008.
Saving the current enviroment for further analysis.
save.image(file='DowEnvironment.RData')If you have any questions, please feel free to comment below.
References Dow Jones Industrial Average
 Skewness
 Kurtosis
 An introduction to analysis of financial data with R, Wiley, Ruey S. Tsay
 Time series analysis and its applications, Springer ed., R.H. Shumway, D.S. Stoffer
 Applied Econometric Time Series, Wiley, W. Enders, 4th ed.]
 Forecasting – Principle and Practice, Texts, R.J. Hyndman
 Options, Futures and other Derivatives, Pearson ed., J.C. Hull
 An introduction to rugarch package
 Applied Econometrics with R, Achim Zeileis, Christian Kleiber – Springer Ed.
 GARCH modeling: diagnostic tests
Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or promotion of any particular security or source.
Related Post
 Six Sigma DMAIC Series in R – Part4
 NYC buses: company level predictors with R
 Visualizations for correlation matrices in R
 Interpretation of the AUC
 Simple Experiments with Smoothed Scatterplots
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Structural Analisys of Bayesian VARs with an example using the Brazilian Development Bank
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
By Gabriel Vasconcelos
IntroductionVector Autorregresive (VAR) models are very popular in economics because they can model a system of economic variables and relations. Bayesian VARs are receiving a lot of attention due to their ability to deal with larger systems and the smart use of priors. For example, in this old post I showed an example of large Bayesian VARs to forecast covariance matrices. In this post I will show how to use the same model to obtain impulse response coefficients and perform structural analysis. The type of estimation was based on Bańbura et al. (2010) and the empirical application is from Barboza and Vasconcelos (2019). The objective is to measure the effects of the Brazilian Development Bank on investment. Therefore, we will measure how the investment respond to an increase in loans over the time.
VARA structural VAR is described by the following equation:
where are squared matrices of coefficients, is a vector of constants, is a vector of errors and is the vector with all variables in the system. In this sense, is not a single value like in an univariate Autorregressive model, it contains several variables that are used at the same time in the model. For example, could have the GDP, investment and interest rate. In this case each matrix would be .
In VAR models we do not estimate the equation above in a single step. First we estimate the VAR in its reduced form (this is the equation we use to compute forecasts) and then we use some identification strategy to recover the structural form. The reduced form is presented below.
where , , . When we do structural analysis we want to know how some variable respond over time to a shock on some other variable. However, shocks in the reduced form are correlated between equations and can not be isolated. Therefore, an important difference between the reduced form and the structural form is that in the second the error covariance matrix is diagonal, which means uncorrelated shocks.
ApplicationFirst we must install the lbvar package from my github:
library(devtools) install_github("gabrielrvsc/lbvar") library(lbvar) library(tseries)The dataset is included in the package and it is called BNDESdata. I removed column 17 in this example because it contains a variable (capital goods prices) used only for robustness in other examples in the paper. The data is already treated and ready for the model. We are going to evaluate the effects of the Brazilian Development Bank Loans (BL) on the Gross Fixed Capital formation (GFCF), the GFCF fraction of machinery and equipment (GFCFme) and the GFCF fraction of machinery and equipment manufactured in Brazil (GFCFmeBR).
data("BNDESdata") BNDESdata = BNDESdata[,17] colnames(BNDESdata) = c("TT", "CRB", "CDS", "WPROD", "BL", "GFCF", "GFCFme", "GFCFmeBR", "GDP", "IP", "ULC", "CUR", "IR", "ICI", "UNCERT", "ER", "IBOV")Before estimating the VAR, we must define some parameters for the priors. Our prior is of a random walk for nonstationary variables and of a white noise for stationary variables. We just used a PhillipsPerron test to determine which variables are stationary and which variables are not. We set a value of 1 for the first autorregresive term of each nonstationary equation and 0 for each stationary equation.
There is another parameter we must define called 0" title="\lambda>0" class="latex" />. This parameter controls the importance we give to the prior and the importance we give to the data. Smaller values give more importance to the prior. Bańbura et al. (2010) uses a strategy to determine which consists of estimating a small VAR by OLS and Bayesian VARs for several values of . The chosen is the one that makes the big model more similar to the small model in terms of squared error. Note that the small model does not use all variables, therefore we compare only the equations included in both models.
The codes for the PhillipsPerron test and the are presented below. I used a small range for the to save computational time given that I already knew that the best value was in this interval.
## = PhillipsPerron test prior = apply(BNDESdata,2,function(x)pp.test(x)$p.value) prior[prior > 0.05] = 1 prior[prior <= 0.05] = 0 ## = lambda estimation lambda = fitLambda(BNDESdata,c("GFCF","BL","IR"), seq(0.6,0.65,0.0001), p=13, p.reduced = 13, delta=prior)The VAR we want to estimate has 17 variables and 13 lags. This accounts for 3774 parameters in the reduced form if we include the intercept. The name Large Bayesian VARs is very much appropriate. The code below estimates the model, computes the recursive identification (see the full article for more details) and the impulse responses. The last step requires some simulation and it may take a few minutes to run.
## == Estimate the Reduced Form == ## model = lbvar(BNDESdata, 13, delta = prior, lambda = lambda) ## == identification == ## ident = identification(model) ## == Impulse Responses == ## set.seed(123) ir = irf(model, ident, 48, unity.shock = FALSE, M=1000)Now we can plot the impulse responses and see the results. The figures below show the effects of a shock in the bank loans on the three variations of GFCF.
par(mfrow=c(1,3)) plot(ir,"BL","GFCF",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(0.01,0.015),ylab="GFCF",xlab="Time",main="(a)") plot(ir,"BL","GFCFme",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(0.01,0.015),ylab="GFCF ME",xlab="Time",main="(b)") plot(ir,"BL","GFCFmeBR",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(0.01,0.015),ylab="GFCF MEBR",xlab="Time",main="(c)")The effects of the bank are bigger if we move to the machinery and equipment measures because they represent a large portion of the bank’s portfolio. The responses are significant in some horizons. The shock was of one standard deviation, which accounts for a 20% increase in loans. This shock increases the GFCF by less than 2% if we look at the sum of the first 6 months. However, to understand the magnitude of this effect we must look at the fraction of the Brazilian Development Bank on total investment. Our results show that an increase of 1 BRL in loans accounts for an increase of approximately 0.46 BRL in the total GFCF. This may look like a small value first sight. We explain this results with two hypothesis. First, there is some degree of fund substitution where companies use the government bank just because interest rates are lower but they do not suffer from significant credit constraints. Second, There is a crowdingout effect because the government tends to increase the interest rate when the bank puts more money in the economy.
Finally, a good way to evaluate a VAR model in a macroeconomic framework such as this one is to look at how the variables behave in response to the monetary policy. The figures below show how the GFCF behaves when the government increases interest rates (IR).
par(mfrow=c(1,3)) plot(ir,"IR","GFCF",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(0.02,0.01),ylab="GFCF",xlab="Time",main="(a)") plot(ir,"IR","GFCFme",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(0.02,0.01),ylab="GFCF ME",xlab="Time",main="(b)") plot(ir,"IR","GFCFmeBR",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(0.02,0.01),ylab="GFCF MEBR",xlab="Time",main="(c)") References var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Here’s why 2019 is a great year to start with R: A story of 10 year old R code then and now
(This article was first published on Jozef's Rblog, and kindly contributed to Rbloggers)
IntroductionIt has been more than ten years since I wrote my first R code. And in those years, the R world has changed dramatically, and mostly to the better. I believe that the current time may be one of the best times to start working with R.
In this new year’s post we will look at the R world 10 years ago and today, and provide links to many tools that helped it become a great language to solve and present everyday tasks with a welcoming community of users and developers.
Contents My first exposure to R, more than ten years ago
 The growth of R
 R now versus then – A much better world
 If you really came for that ugly old code
The year was 2007 and I was studying probability and mathematical statistics at my faculty when one of the professors introduced us to R – a free programming language that we could use to solve many statistical tasks, from simple matrix operations and fitting models, to data visualization. This sounded great, as many other solutions that were traditionally used such as Stata or SPSS were even not free to use, let alone open source.
Now to get a bit of context, my most recent exposures to programming at that time were using Borland’s Deplhi 7 and C++ Builder, both mature IDEs with very pleasant and advanced user interfaces and features, where you could literally have a Windows application with a nice UI ready, compiled and running in an hour.
Rgui timesWhen I first opened the RGui it felt, well, slightly underwhelming:
But why did you not use RStudio?
Well, the first beta version of RStudio was released about 3 years later in February 2011. By the wat, those RStudio blog posts from 2011 still have comment sections available below them and I really enjoy reading through them. Anyway, I was stuck with the Rgui and it was not a pleasant experience. At that time, I disliked that experience so much, I still wrote some of the code in Delphi or C++ Builder.
Which currently popular packages existed?But dplyr syntax makes everything so easy, why not use that?
Looking at the CRAN snapshots from the beginning of 2008, the latest released R version at that time was R2.6.1 and there were around 1200 packages available on CRAN. At the time of writing of this post the number of packages available on CRAN reached 13600.
Looking at the top 40 most downloaded packages in the past month, only two of those packages existed on CRAN at that time – ggplot2 and digest – no filter, summarize or group_by for me back then.
StackOverflow, GitHub and Twitter communitiesWhy did you not just ask StackOverflow, Twitter or check GitHub ?
According to Wikipedia, StackOverflow was launched 15th September 2008 and GitHub on 10th of April 2008, so in the beginning of 2008 none of the two today’s giants even existed.
Not that I was using Twitter at that time, but the first #rstats tweet I was able to find is from 4th April 2009:
RT @ChrisAlbon @drewconway #rstats is the official R statistical language hashtag. #rstats (because #R doesn’t cut it)
— brendan o’connor (@brendan642) April 4, 2009
For comparison, R itself was first released 29th of February 2000, a date easily remembered.
The growth of RThere are many ways to look at a growth of a programming language and this does not mean to be a comprehensive and objective growth assessment. I rather took a look at 2 metrics I found interesting that show some trends in the R world.
If you are interested in the topic of programming language popularity, there are indices such as PYPL and TIOBE, and of course they have their criticisms.
Downloads of R packages
RStudio’s CRAN mirror provides a REST API from which we can look at and visualize the number of monthly downloads of R packages in the past 5 years. The chart speaks for itself:
$(function () { $('#r90801monthlyrdownloads').highcharts({ title: { text: null }, yAxis: { title: { text: null } }, credits: { enabled: false }, exporting: { enabled: false }, plotOptions: { series: { turboThreshold: 0 }, treemap: { layoutAlgorithm: "squarified" }, bubble: { minSize: 5, maxSize: 25 } }, annotationsOptions: { enabledButtons: false }, tooltip: { delayForDisplay: 10 }, series: [ { data: [1111018, 1254236, 1585446, 1836591, 1748652, 1696828, 1835270, 1926914, 2615003, 2771729, 2822179, 2378941, 3107109, 3253037, 4130467, 4907718, 4029389, 5060976, 4823936, 5588692, 5396871, 6534368, 6682423, 5984781, 7695022, 7738873, 9937860, 11274461, 10511679, 11046909, 10525599, 11577427, 12848572, 13991958, 16414107, 14767617, 14946786, 15695514, 16110482, 15034673, 18135754, 15942551, 15159072, 14926215, 19778328, 23259892, 27638156, 21852630, 25833739, 28609044, 32522448, 32597694, 32389075, 30214297, 27995878, 28865282, 33786377, 37808497, 39820735, 31812256, 38244722, 33787943, 45966646, 51566293, 48369947, 44977277, 42285633, 41935159, 59734508, 74333034, 74764832, 58582203], name: "Monthly package downloads (RStudio's CRAN mirror)" } ], xAxis: { categories: ["201301", "201302", "201303", "201304", "201305", "201306", "201307", "201308", "201309", "201310", "201311", "201312", "201401", "201402", "201403", "201404", "201405", "201406", "201407", "201408", "201409", "201410", "201411", "201412", "201501", "201502", "201503", "201504", "201505", "201506", "201507", "201508", "201509", "201510", "201511", "201512", "201601", "201602", "201603", "201604", "201605", "201606", "201607", "201608", "201609", "201610", "201611", "201612", "201701", "201702", "201703", "201704", "201705", "201706", "201707", "201708", "201709", "201710", "201711", "201712", "201801", "201802", "201803", "201804", "201805", "201806", "201807", "201808", "201809", "201810", "201811", "201812"] } } ); });
Interest on StackOverflowAnother interesting point of view is the statistics on trends on StackOverflow, paraphrasing their blogpost:
When we see a rapid growth in the number of questions about a technology, it usually reflects a real change in what developers are using and learning.
And how does R look within the StackOverflow trends compared to other languages? Looks like the growth of R is so remarkable, even the data scientists at StackOverflow itself noticed and wrote a blogpost about it in 2017:
R now versus then – A much better worldGoing back to that story of my first R codes, I think time has made working with R much better than it was before in many ways. I will list just few of the many reasons why with the links to relevant resources to follow:
Availability of free information and support is great The amazing amount of free information readily available such as (tidyverse oriented) R for Data Science, or Advanced R books make R more accessible to learn and use
 Communities of R users such as the one on StackOverflow make it easy to ask questions and get answers, the #rstats hashtag on Twitter is a good way to interact with the community
 Many user and developer blogs on rbloggers.com and curated selections of content on RWeekly.org can serve as an inspiration and overview of the news in the community
 Tools like RStudio make using R a much more pleasant experience compared to the original RGui, with many useful features and a Server version running in browser
 Well documented R packages that make common data science tasks easier and/or more performant such as the popular tidyverse or data.table make it easier to start
 R packages that support development, testing and documentation such as devtools, testthat and roxygen2 make R code efficient to develop, test and document
 For portability, reproducibility and dependency management, tools such as packrat can make life less painful
 Code repository managers such as GitHub, GitLab or others make it easy to share code, collaborate and even perform CI/CD tasks where necessary
 Tools like RMarkdown, Bookdown, Blogdown and others make it easy to publish the results of your work, be it an interactive dashboard, a paper in pdf, a presentation, even a book or a blog (such as this one)
 Many packages for generating interactive charts, maps and animations such as highcharter, leaflet and more help create amazing data visualizations
 Shiny takes it to the next level allowing for advanced interactive web applications
 R now has mature interfaces to many programming languages, software libraries, database systems and file formats, just a few examples include Rcpp, rJava, httr, openxlsx, XLConnect, highcharter, jsonlite, xml2, sparklyr and DBI
 CRAN task views provide guidance on R packages per topic, such as Web Technologies and Services, HighPerformance and Parallel Computing, Machine Learning & Statistical Learning and many more
I hope this post motivated you to dive a bit deeper into the R world and check some of the many amazing contributions created by developers and users in the R community mentioned above.
But if you really feel like having a good laugh first, feel free check some of the oldest R scripts I was able to find unedited on GitLab here.
They date somewhere to the end of 2007/beginning of 2008 and, for it’s worth, should still be runnable.
Thank you for reading and
have a happy new yeaR
To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
My Activities in 2018 with R and ShinyApp
(This article was first published on R – Networkx, and kindly contributed to Rbloggers)
What better way to analyze your activities data from Apple Health and Runkeeper into R and generating some visualizations and counters. After that I will wrapping it together into a Shiny App.
Want do I want to achieve for now?
 Number of activities, steps, kilometers ect.
 Heatmap of last X years number of activities colored by amount.
Export Runkeeper data, the option is available after login > account settings > export data > download your data.
For export and convert your Apple Health data follow my previous post Analysing your Apple Health Data with Splunk.
runkeeper < read.csv("data/cardioActivities.csv", stringsAsFactors=FALSE) steps < read.csv("data/StepCount.csv", stringsAsFactors = FALSE) dim(runkeeper) ## [1] 676 14 dim(steps) ## [1] 47790 9 head(runkeeper) ## Activity.Id Date Type ## 1 47f590499bd54b9aa63cd51fca7e0bf8 20181231 10:22:33 Running ## 2 7099bd747685453ea96262ee711dc18e 20181227 20:24:09 Running ## 3 c2250f5b856747ef97d7504b174809e1 20181223 13:28:04 Running ## 4 787c8438d27949fb8babd1f348935264 20181218 21:07:42 Boxing / MMA ## 5 2c41d494204b4be39b4ed7acac5a3187 20181216 13:04:44 Running ## 6 86b6429a13a84122b200c3aadef271f5 20181204 20:24:27 Boxing / MMA ## Route.Name Distance..km. Duration Average.Pace Average.Speed..km.h. ## 1 8.12 50:49 6:15 9.59 ## 2 15.99 1:27:36 5:29 10.95 ## 3 8.31 50:04 6:01 9.96 ## 4 0.00 1:20:00 NA ## 5 10.51 56:47 5:24 11.10 ## 6 0.00 1:00:00 NA ## Calories.Burned Climb..m. Average.Heart.Rate..bpm. Friend.s.Tagged Notes ## 1 772.0 83 134 ## 2 1485.0 57 147 ## 3 788.0 88 134 ## 4 1032.4 0 NA ## 5 979.0 41 145 ## 6 774.3 0 NA ## GPX.File ## 1 20181231102233.gpx ## 2 20181227202409.gpx ## 3 20181223132804.gpx ## 4 ## 5 20181216130444.gpx ## 6 head(steps) ## sourceName sourceVersion ## 1 John’s iPhone 10.1.1 ## 2 John’s iPhone 10.1.1 ## 3 John’s iPhone 10.1.1 ## 4 John’s iPhone 10.1.1 ## 5 John’s iPhone 10.1.1 ## 6 John’s iPhone 10.1.1 ## device ## 1 <<HKDevice: 0x2836d9310>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## 2 <<HKDevice: 0x2836db070>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## 3 <<HKDevice: 0x2836d9d10>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## 4 <<HKDevice: 0x2836dad50>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## 5 <<HKDevice: 0x2836da080>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## 6 <<HKDevice: 0x2836dabc0>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## type unit creationDate startDate ## 1 StepCount count 20161124 09:13:55 +0100 20161124 08:55:10 +0100 ## 2 StepCount count 20161124 09:13:55 +0100 20161124 09:10:09 +0100 ## 3 StepCount count 20161124 09:13:55 +0100 20161124 09:11:11 +0100 ## 4 StepCount count 20161124 09:13:55 +0100 20161124 09:12:13 +0100 ## 5 StepCount count 20161124 09:49:04 +0100 20161124 09:13:19 +0100 ## 6 StepCount count 20161124 09:49:04 +0100 20161124 09:33:57 +0100 ## endDate value ## 1 20161124 08:55:33 +0100 26 ## 2 20161124 09:11:11 +0100 114 ## 3 20161124 09:12:13 +0100 105 ## 4 20161124 09:13:19 +0100 25 ## 5 20161124 09:22:42 +0100 108 ## 6 20161124 09:42:33 +0100 89 str(runkeeper) ## 'data.frame': 676 obs. of 14 variables: ## $ Activity.Id : chr "47f590499bd54b9aa63cd51fca7e0bf8" "7099bd747685453ea96262ee711dc18e" "c2250f5b856747ef97d7504b174809e1" "787c8438d27949fb8babd1f348935264" ... ## $ Date : chr "20181231 10:22:33" "20181227 20:24:09" "20181223 13:28:04" "20181218 21:07:42" ... ## $ Type : chr "Running" "Running" "Running" "Boxing / MMA" ... ## $ Route.Name : chr "" "" "" "" ... ## $ Distance..km. : num 8.12 15.99 8.31 0 10.51 ... ## $ Duration : chr "50:49" "1:27:36" "50:04" "1:20:00" ... ## $ Average.Pace : chr "6:15" "5:29" "6:01" "" ... ## $ Average.Speed..km.h. : num 9.59 10.95 9.96 NA 11.1 ... ## $ Calories.Burned : num 772 1485 788 1032 979 ... ## $ Climb..m. : int 83 57 88 0 41 0 62 0 0 0 ... ## $ Average.Heart.Rate..bpm.: int 134 147 134 NA 145 NA 145 NA NA NA ... ## $ Friend.s.Tagged : chr "" "" "" "" ... ## $ Notes : chr "" "" "" "" ... ## $ GPX.File : chr "20181231102233.gpx" "20181227202409.gpx" "20181223132804.gpx" "" ... str(steps) ## 'data.frame': 47790 obs. of 9 variables: ## $ sourceName : chr "John’s iPhone" "John’s iPhone" "John’s iPhone" "John’s iPhone" ... ## $ sourceVersion: chr "10.1.1" "10.1.1" "10.1.1" "10.1.1" ... ## $ device : chr "<<HKDevice: 0x2836d9310>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1>" "<<HKDevice: 0x2836db070>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1>" "<<HKDevice: 0x2836d9d10>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1>" "<<HKDevice: 0x2836dad50>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1>" ... ## $ type : chr "StepCount" "StepCount" "StepCount" "StepCount" ... ## $ unit : chr "count" "count" "count" "count" ... ## $ creationDate : chr "20161124 09:13:55 +0100" "20161124 09:13:55 +0100" "20161124 09:13:55 +0100" "20161124 09:13:55 +0100" ... ## $ startDate : chr "20161124 08:55:10 +0100" "20161124 09:10:09 +0100" "20161124 09:11:11 +0100" "20161124 09:12:13 +0100" ... ## $ endDate : chr "20161124 08:55:33 +0100" "20161124 09:11:11 +0100" "20161124 09:12:13 +0100" "20161124 09:13:19 +0100" ... ## $ value : int 26 114 105 25 108 89 329 284 89 18 ...From the Runkeeper data we need Date and from Apple Health StepCount we need the endDate (thats when your step is done). Both has type chr. I could convert it as date, but I leave the data what it is and will do convert it when necessary.
Create new variablesFirst we load the lubridate package.
library(lubridate)I will create some new variables and convert Date and endDate as Date so I can extract the year with the year function of the lubridate package.
runkeeper$year < year(as.Date(runkeeper$Date, origin = '19000101')) steps$year < year(as.Date(steps$endDate, origin = '19000101')) summary(runkeeper$year) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2010 2013 2016 2015 2017 2018 summary(steps$year) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2016 2017 2017 2017 2018 2019As you can see in the above summary of both dataframes, Runkeeper has more years than steps. That doesn’t matter becouse we are now only looking for 2018 in the Shiny App and I will group it by year. In the app I create a year slider.
I will parse the period with hour, minuts ans seconds of the variable Duration. And I will calculate the duration in minutus. Some of the Duration variables has no leading zero for the hour and I cannot use it with the function hms.
Add a leading zero if not exists.
runkeeper$Duration[1:5] ## [1] "50:49" "1:27:36" "50:04" "1:20:00" "56:47" runkeeper$Duration < ifelse(nchar(runkeeper$Duration) < 6, paste0("0:", runkeeper$Duration), runkeeper$Duration) runkeeper$Duration[1:5] ## [1] "0:50:49" "1:27:36" "0:50:04" "1:20:00" "0:56:47"Now I can parse the Duration with hms.
runkeeper$lub < hms(runkeeper$Duration) runkeeper$time_minutes < hour(runkeeper$lub)*60 + minute(runkeeper$lub) + second(runkeeper$lub)/60 runkeeper$lub[1:5] ## [1] "50M 49S" "1H 27M 36S" "50M 4S" "1H 20M 0S" "56M 47S" runkeeper$time_minutes[1:5] ## [1] 50.81667 87.60000 50.06667 80.00000 56.78333Group the data by year
First we load the package dplyr.
library(dplyr)Group both dataframes by year and do some summarises like count and sum of kilometers, climb, calories and duration
grouped_runkeeper_year < runkeeper %>% group_by(year) %>% summarise(cnt = n(), km = sum(Distance..km.), climb = sum(Climb..m.), calories = sum(Calories.Burned), duration = sum(time_minutes, na.rm = TRUE)) grouped_steps_year < steps %>% group_by(year) %>% summarise(cnt = sum(value)) grouped_runkeeper_year ## # A tibble: 9 x 6 ## year cnt km climb calories duration ## <dbl> <int> <dbl> <int> <dbl> <dbl> ## 1 2010 3 10.8 0 901 71.8 ## 2 2011 96 657. 1951 56333 3864. ## 3 2012 46 410. 1458 36876 2278. ## 4 2013 46 258. 488 22654 1456. ## 5 2014 24 175. 698 14382 992. ## 6 2015 117 1202. 5910 98518 7537. ## 7 2016 93 661. 3076 65275. 5397. ## 8 2017 128 408. 1569 89872. 7404. ## 9 2018 123 366. 1818 86792. 7062. grouped_steps_year ## # A tibble: 4 x 2 ## year cnt ## <dbl> <int> ## 1 2016 328759 ## 2 2017 3360951 ## 3 2018 4366289 ## 4 2019 3692Some simple visualizations
library(ggplot2) library(RColorBrewer) grouped_runkeeper_year %>% ggplot(aes(x=year, y=cnt )) + geom_bar(stat = "identity", col="white", fill="#ee8300") + geom_text(aes(label=cnt), hjust=1.2, color="white", size=3.5) + labs(x="", y="", title="Number activities by Year") + scale_x_continuous(breaks=seq(min(grouped_runkeeper_year$year), max(grouped_runkeeper_year$year),1)) + coord_flip() + theme_bw()grouped_steps_year %>% ggplot(aes(x=year, y=cnt )) + geom_bar(stat = "identity", col="white", fill="#ee8300") + geom_text(aes(label=cnt), hjust=1.2, color="white", size=3.5) + labs(x="", y="", title="Number of Steps by Year") + scale_x_continuous(breaks=seq(min(grouped_runkeeper_year$year), max(grouped_runkeeper_year$year),1)) + coord_flip() + theme_bw()
Add a heatmap of number of activities last 3 years from now.
calendar_heatmap < runkeeper %>% select(Date,time_minutes,year) %>% filter(year >= year(now())  3) %>% mutate( week = week(as.Date(Date)), wday = wday(as.Date(Date), week_start = 1), month = month(as.Date(Date), label = TRUE, abbr = TRUE), day = weekdays(as.Date(Date)) ) cols < rev(rev(brewer.pal(7,"Oranges"))[1:5]) calendar_heatmap %>% ggplot(aes(month, reorder(day, wday), fill = time_minutes)) + geom_tile(colour = "white") + scale_fill_gradientn('Activity \nMinutes', colours = cols) + facet_wrap(~ year, ncol = 1) + theme_classic() + theme(strip.text.x = element_text(size = 16, face = "bold", colour = "orange")) + ylab("") + xlab("")Screenshot Shiny App SessionInfo sessionInfo() ## R version 3.5.1 (20180702) ## Platform: x86_64appledarwin15.6.0 (64bit) ## Running under: macOS 10.14.2 ## ## Matrix products: default ## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] en_US.UTF8/en_US.UTF8/en_US.UTF8/C/en_US.UTF8/en_US.UTF8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] RWordPress_0.23 knitr_1.20 ## [3] bindrcpp_0.2.2 RColorBrewer_1.12 ## [5] ggplot2_3.0.0 shinydashboardPlus_0.6.0 ## [7] shinydashboard_0.7.1 shiny_1.2.0 ## [9] dplyr_0.7.8 lubridate_1.7.4 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.19 prettyunits_1.0.2 ps_1.2.1 ## [4] utf8_1.1.4 assertthat_0.2.0 rprojroot_1.32 ## [7] digest_0.6.17 mime_0.5 R6_2.2.2 ## [10] plyr_1.8.4 backports_1.1.2 evaluate_0.12 ## [13] highr_0.7 pillar_1.3.0 rlang_0.3.0.1 ## [16] lazyeval_0.2.1 curl_3.2 rstudioapi_0.8 ## [19] callr_3.0.0 rmarkdown_1.10 desc_1.2.0 ## [22] labeling_0.3 devtools_2.0.1 stringr_1.3.1 ## [25] RCurl_1.954.11 munsell_0.5.0 compiler_3.5.1 ## [28] httpuv_1.4.5.1 pkgconfig_2.0.2 base64enc_0.13 ## [31] pkgbuild_1.0.2 htmltools_0.3.6 tidyselect_0.2.5 ## [34] tibble_1.4.2 XML_3.981.16 fansi_0.4.0 ## [37] crayon_1.3.4 withr_2.1.2 later_0.7.5 ## [40] bitops_1.06 grid_3.5.1 jsonlite_1.5 ## [43] xtable_1.83 gtable_0.2.0 magrittr_1.5 ## [46] scales_1.0.0 cli_1.0.1 stringi_1.2.4 ## [49] fs_1.2.6 promises_1.0.1 remotes_2.0.2 ## [52] testthat_2.0.0 tools_3.5.1 glue_1.3.0 ## [55] purrr_0.2.5 hms_0.4.2 processx_3.2.0 ## [58] pkgload_1.0.2 yaml_2.2.0 colorspace_1.32 ## [61] sessioninfo_1.1.1 memoise_1.1.0 bindr_0.1.1 ## [64] XMLRPC_0.31 usethis_1.4.0
Well wrap it up into a Shiny App which can be found on my GitHub.
The post My Activities in 2018 with R and ShinyApp appeared first on Networkx.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Networkx. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
In case you missed it: December 2018 roundup
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
In case you missed them, here are some articles from December of particular interest to R users.
Roundup of AI, Machine Learning and Data Science news from December 2018.
AzureStor, a new R package to interface with Azure Storage.
How to use the "plumber" package to create an R service as a container with the AzureContainers package.
How to give money to the R Foundation in support of the R project.
The Revolutions blog is 10 years old, and I reflect on some milestones in a talk given at SatRDays DC.
Reshama Shaikh compares gender diversity within the R and Python communities.
AzureVM, a new package for managing virtual machines in Azure from R.
And some general interest stories (not necessarily related to R):
 Santa Claus and compliance with GDPR
 A parody on the use of CGI in film
 The story behind Fleetwood Mac's Go Your Own Way
As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Displaying our “R – Quality Control Individual Range Chart Made Nice” inside a Java web App using AJAX – How To.
(This article was first published on RAnalytics, and kindly contributed to Rbloggers)
 Prerequisites:
 What you should have installed:
 Java, it can be OpenJDK, you can get it from here: https://github.com/ojdkbuild/ojdkbuild
 Tomcat, any version from 8 up.
 Eclipse EE: Eclipse IDE for Java EE.
 Spring Tools Suite For Eclipse: https://spring.io/tools. You can install it from Eclipse Marketplace.
 R: https://www.rproject.org. Inside R, at least these packages:
 install.packages( “tidyverse”, dependencies = TRUE )
 install.packages( “rmarkdown”, dependencies = TRUE )
 install.packages( “dygraphs”, dependencies = TRUE )
 install.packages( “qcc”, dependencies = TRUE )
 install.packages( “rattle”, dependencies = TRUE )
 install.packages( “Rcmdr”, dependencies = TRUE )
 install.packages( “stlplus”, dependencies = TRUE )
 Rstudio, to edit your RMarkdown files: https://www.rstudio.com.
 Rtools: https://cran.rproject.org/bin/windows/Rtools/index.html.
 Pandoc, this is to convert markdown files to html: https://pandoc.org.
 Fork git utility: https://gitfork.com. This is optional but it might be easier the job of downloading the projects from github.com: https://github.com/LaranIkal/RANALYTICS And https://github.com/LaranIkal/rchart.
 Knowledge of:
 Java web development with Spring.
 Ajax, at least the concept.
 Thymeleaf, at least the template concept: https://www.thymeleaf.org.
 R and the R related tools mentioned above.
 SPC, it is important for you to read my article: R – Quality Control Individual Range Chart Made Nice.
 Once you are ready with all mentioned above, it is important you are able to generate the html file on your computer from the RMarkdown file provided on the github repository( https://github.com/LaranIkal/RANALYTICS ) mentioned above by using RStudio.
 Download the zip file: RANALYTICSmaster.zip
 Unzip it to a local folder.
 Doubleclick on file: RANALYTICS.Rproj, it will start rstudio.
 On the rightbelow pane, click on files, you will see your files, click on file name: QualityControl_IndividualRangeChart.Rmd, rstudio will open the file:
 To generate the html file from the Rmarkdown file, click on Knit>Knit to HTML.
 Now you should be able to see your Intermediate Range Chart html file, like this:
 Creating the project in eclipse EE IDE:
 Using your file explorer, create a folder called rchart inside your eclipse EE workspace folder.
 Open file rchartmaster.zip
 Extract the content of rchartmaster folder inside the zip file into the new rchart folder you just created.
 Start Eclipse EE IDE and go to FileNewJava Project:.
 Enter the name of the project as rchart and click Finish:
 Once the project has been added, open it and double click on your pom.xml file to edit the Tomcat path according to your Tomcat path and save your changes:
 RightClick on your pom.xml file and Select Run As > Maven build:
 When Eclipse asks to edit the launch configuration, enter eclipse:eclipse as the Goals and click Run:
 Now let’s create the war file, RightClick on your pom.xml file and Select Run As > Maven install:
 Go to your web browser and point it to http://localhost:8080/RChartFromJava, verify the Rscript path and set it accordingly – “C:\\Program Files\\R\\R3.5.1\\bin\\Rscript” – and click on the Display Chart button, now you should be able to see the Intermediate Range Chart of the data in the text area:
Carlos Kassab
https://www.linkedin.com/in/carloskassab48b40743/
More information about R:
https://www.rbloggers.com var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: RAnalytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
What does it mean to write “vectorized” code in R?
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
One often hears that R can not be fast (false), or more correctly that for fast code in R you may have to consider “vectorizing.”
A lot of knowledgable R users are not comfortable with the term “vectorize”, and not really familiar with the method.
“Vectorize” is just a slightly highhanded way of saying:
R naturally stores data in columns (or in column major order), so if you are not coding to that pattern you are fighting the language.
In this article we will make the above clear by working through a nontrivial example of writing vectorized code.
For our example problem we will take on the task of computing the PRESS statistic (a statistic we use in our new RcppDynProg package). The “predicted residual error sum of squares” or PRESS statistic is an estimate of the out of sample quality of a fit. We motivate the PRESS statistic as follows.
Suppose we are fitting a simple linear model. In such a case it is natural to examine the model residuals (differences between the actual values and the matching predictions):
d < data.frame( x = 1:6, y = c(1, 1, 2, 2, 3, 3)) lm(y ~ x, data = d)$residuals # 1 2 3 4 5 6 # 0.1428571 0.3142857 0.2285714 0.2285714 0.3142857 0.1428571However, because the model has seen the data it is being applied to, these residuals are not representative of the residuals we would see on new data (there is a towardszero bias in this estimate). An improved estimate is the PRESS statistic: for each point the model is fit on all points except the point in question, and then the residuals are estimated. This is a form of crossvalidation and is easy to calculate:
xlin_fits_lm < function(x, y) { n < length(y) d < data.frame(x = x, y = y) vapply( seq_len(n), function(i) { m < lm(y ~ x, data = d[i, ]) predict(m, newdata = d[i, ]) }, numeric(1)) } d$y  xlin_fits_lm(d$x, d$y) # [1] 0.3000000 0.4459459 0.2790698 0.2790698 0.4459459 0.3000000Notice these values tend to be further from zero. They also better represent how an overall model might perform on new data.
At this point, from a statistical point of view, we are done. However, rebuilding an entire linear model for each point is computationally inefficient. Each of the models we are calculating share many training points, so we should be able to build a much faster handrolled calculation.
That faster calculation is given in the rather formidable function xlin_fits_R() found here. This function computes the summary statistics needed to solve the linear regression for all of the data. Then for each point in turn, it subtracts that point’s contribution out of the summaries and then performs the algebra needed to solve for the model and then apply it. The advantage of this approach is that taking the point out and solving the model is only a very small (constant) number of steps independent of how many points are in the summary. So this code is, in principle, very efficient.
And in fact timings on a small problem (300 observations) show while the simple “xlin_fits_lm() call lm() a bunch of times” takes 0.28 seconds, the more complicated xlin_fits_R() takes 0.00043 seconds, a speedup of over 600 times!
However this code is performing a separate calculation for each scalar datapoint. As we mentioned above, this is fighting R, which is specialized for performing calculations over large vectors. The exact same algorithm written in C++, instead of R, takes 0.000055 seconds, almost another multiple of 10 faster!
The timings are summarized below.
This sort of difference, scalar oriented C++ being so much faster than scalar oriented R, is often distorted into “R is slow.”
This is just not the case. If we adapt the algorithm to be vectorized we get an R algorithm with performance comparable to the C++ implementation!
Not all algorithms can be vectorized, but this one can, and in an incredibly simple way. The original algorithm itself (xlin_fits_R()) is a bit complicated, but the vectorized version (xlin_fits_V()) is literally derived from the earlier one by crossing out the indices. That is: in this case we can move from working over very many scalars (slow in R) to working over a small number of vectors (fast in R).
Let’s take a look at the code transformation.
We are not saying that xlin_fits_R() or xlin_fits_V() are easy to understand; we felt pretty slick when we derived them and added a lot of tests to confirm they calculate the same thing as xlin_fits_lm(). What we are saying is that the transform from xlin_fits_R() to xlin_fits_V() is simple: just cross out the forloop and all of the “[k]” indices!
Performing the exact same operation on every entry in a structure (but with different values) is the essence of “vectorized code.” When we wrote a forloop in xlin_fits_R() to perform the same steps for each index, we were in fact fighting the language. Crossing out the forloop and indices mechanically turned the scalar code into faster vector code.
And that is our example of how and why to vectorize code in R
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...