Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 11 min ago

Tutorial: An app in R shiny visualizing biopsy data —  in a pharmaceutical company

Mon, 01/07/2019 - 17:39

(This article was first published on Stories by Sebastian Wolf on Medium, and kindly contributed to R-bloggers)

Tutorial: An app in R shiny visualizing biopsy data — in a pharmaceutical company Learn how to build a shiny app for the visualization of clustering results. The app helps to better identify patient data samples, e.g. during a clinical study.

This tutorial is a joint work effort. The Tutorial was presented by Olaf Menzer in a workshop at the ODSC West Conference in San Francisco in 2018. Sebastian Wolf was co-implementing this application as an expert in bio-pharmaceutical web-applications.

What is it about?

The story behind this app comes from a real application inside departments evaluating clinical studies in diagnostics.

Due to fast recruiting for clinical studies the patient cohorts seemed to be inhomogenic. Therefore a researcher and a clinical study statistician wanted to find out, by which parameter they can find patients, that do not seem to fit their desired class. Maybe there was a mistake in the labeling of a patient’s disease status? Maybe one measurement or two measurements can be used to easily find such patients?

The example data used here is real data from a 1990’s study, known as the biopsy data set, also hosted on UCI ML data repository. The app that should be build inside this tutorial is preliminary and was especially build for the tutorial. Pieces of it were applied in real world biostatistical applications.

Starting point

Please start forking the tutorial at: https://github.com/zappingseb/biopharma-app

and afterwards run the installation of dependencies in your R session:

install.packages(c("ape","dplyr","magrittr","igraph","viridisLite","MASS","shiny"))

if you have all the packages installed you will have one file to work on. This file is app.R. This app.R allows you to build a shiny application.

This file we will use to insert the right visualizations and the right table for the researcher to fullfil the task named above. To check what the app will finally look like you can already perform runApp() inside the console and your browser will open the app.

The app already contains:

A sideBarPanel that has all the inputs we need

  • Slider for the # of patients
  • Slider for the # of desired clusters/groups
  • Empty input to choose measurements — shall be done by you
  • Dropdown field for the clustering method

A server function that will provide

  1. The empty input to choose measurments
  2. A Heatmap to see the outcome of clustering
  3. A Phylogenetic tree plot to see the outcome of clustering
  4. A table to see the outcome of the clustering

It already contains a function to provide you with the input data sets biopsy and biopsy_numeric(), as biopsy_numeric is a reactive.

In this tutorial we will go through steps 1–4 to enable building the app

The input data set

The patients inside the data set were obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. There are 699 rows and 11 columns.

The data set can be called by the variable biopsy inside the app. The columns 2-10 were stored inside the reactive biopsy_numeric() which is filtered by the input$patients input to not use all 699 patients, but between 1 and 100.

1) Construction of a SelectInput

The SelectInput shall allow the user to not use all 9 measured variables, but just the ones he desires. This shall help finding the measurement, that is necessary to classify patients. What is a shiny selectInput? We can therefore look at the description of the selectInput by

?shiny::selectInput

and see

We now need to build the choices as the column names of the biopsy data set from 2-10. the selected input will be the same. We shall allow multiple inputs, so multiple will be set to TRUE. Additionally we shall name the inputId "vars". So we can replace the part output$variables inside the app.R file with this:

output$variables <- renderUI({
selectInput(inputId="vars",
label = "Variables to use",
choices = names(biopsy)[2:10],
multiple = TRUE,
selected = names(biopsy)[2:10]
)
})

And you’re done.

2) A Heatmap to see the outcome of clustering

The basic heatmap function allows you to draw a heat map. In this case we would like to change a few things. We would like to change the clustering method inside the hclust function to a method defined by the user. We can grab the user defined method by using input$method as we already defined this input field as a drop down menu. We have to overwrite the default hclust method with our method by:

my_hclust <- function(...){
hclust(method=my_method,...)
}
my_method <<- input$method

Be aware that you define a global variable my_method here, which suffices within the scope of this tutorial. However, please keep in mind that global variables can be problematic in many other contexts and do your own research what best fits your application.

Now for the heatmap call we basically need to change a few inputs. Please see the result:

heatmap(x = t(as.matrix(biopsy_numeric())), Rowv=NA, hclustfun=my_hclust,
labCol =biopsy$"disease status",
col=viridis(15)
)

We need to transform the biopsy_numeric matrix, as we would like to have the patients in columns. As there is just a one dimensional clustering, we can switch of row labels by setting Rowv to NA. The hclustfun is overwritten by our function my_hclust.

For coloring of the plot we use the viridis palette as it is a color blind friendly palette. And the labels of our columns shall now not only the patient IDs but the disease status. You can see the names of all columns we defined in the file R/utils.R. There you see that the last column of biopsy is called "disease status". This will be used to label each patient. Now we got:

output$plot1 <- renderPlot({
my_hclust <- function(...){
hclust(method=my_method,...)
}
my_method <<- input$method
heatmap(x = t(as.matrix(biopsy_numeric())), Rowv=NA, hclustfun=my_hclust,
labCol =biopsy$"disease status",
col=viridis(15)
)
})

Part 2 is done

3) Plot a phylogenetic tree

To allow plotting a phylogenetic tree we provided you with a function called phyltree. You can read the whole code of the function inside R/utils.R. This function takes as inputs

  • a numeric matrix > biopsy_numeric() CHECK
  • The clustering method > input$method CHECK
  • The number of clusters > input$nc CHECK
  • A color function > viridis CHECK

You can read why to use () behind biopsy_numeric here.

The hard part are now the labels. The biopsy_numeric data set is filtered by the # of patients. Therefore we have to filter the labels, too. Therefore we use

labels = biopsy %>% dplyr::select("disease status") %>%
filter(row_number() <= input$patients) %>%
mutate_all(as.character) %>%
pull("disease status")

This is a workflow using functional programming with the R-package dplyr. The function select allows us to just select the "disease status". The filter function filters the number of rows. The mutate_all function applies the as.character function to all columns and finally we export the labels as a vector by using pull.

  • labels for the tree nodes > biopsy %>% … CHECK

The final result looks like this

output$plot2 <- renderPlot({
phyltree( x = biopsy_numeric(),
method = input$method,
nc = input$nc,
color_func = "viridis",
labels = biopsy %>% dplyr::select("disease status") %>%
filter(row_number() <= input$patients) %>%
mutate_all(as.character)%>%
pull("disease status")
)
}) 4) Create a table from clustering results

Now we would also like to see for each patient in which cluster she was assigned. Therefore we perform the clustering and tree cutting on our own:

clust <- hclust(dist(biopsy_numeric()), method = input$method)
cluster_assigment = cutree(clust, k = input$nc) #cluster assignement

The cluster_assignment is now a vector with numbers for the clusters for each patient such as c(1,2,1,1,1,2,2,1,…). This information can be helpful if we combine it with the patientID and the disease status that was named in the patients forms.

The task will be performed using the cbind function of R:

out_table <- cbind(
cluster_assigment,
biopsy %>% filter(row_number() <= length(cluster_assigment)) %>% select(c(1,11))
)# cbind

Now this table shall be sorted by the cluster_assigment to get a faster view on which patients landed in the wrong cluster.

out_table %>% arrange(cluster_assigment)

The final code:

output$cluster_table <- renderTable({

# --------- perform clustering ----------------

# Clustering via Hierarchical Clustering
clust <- hclust(dist(biopsy_numeric()), method = input$method)
cluster_assigment = cutree(clust, k = input$nc) #cluster assignement

# Create a table with the clusters, Patient IDs and Disease status
out_table <- cbind(
cluster_assigment,
biopsy %>% filter(row_number() <= length(cluster_assigment)) %>% select(c(1,11))
)# cbind
# Order by cluster_assigment
out_table %>% arrange(cluster_assigment)
}

)

Done

What to do now?

Now you can run the runApp() function.

If you choose 100 patients, 2 clusters, “ward.D2” clustering and all variables you will see pretty fast, that the patients:

  • 1002945
  • 1016277
  • 1018099
  • 1096800

could be identified as the patients that were clustered wrong. Now you can go search for problems in clustering or look at the sheets of those patients. By changing the labeling e.g. using PatientIDs inside the phyltree function call, you can even check which other patients show close measurements to these patients. Explore and play!

Further reading

Tutorial: An app in R shiny visualizing biopsy data —  in a pharmaceutical company was originally published in Data Driven Investor on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Sebastian Wolf on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RTest: pretty testing of R packages

Mon, 01/07/2019 - 17:03

(This article was first published on Stories by Sebastian Wolf on Medium, and kindly contributed to R-bloggers)

The specflow and cucumber.io for R. Enabling non-coders to interpret test reports for R-packages, moreover allowing non-coders to create test cases. A step towards simple r package validation. by startupstockphotos http://startupstockphotos.com/post/143841899156 Table of contents Why RTest?

Testing in R seems simple. Start by using usethis::test_name("name") and off you go by coding your tests in testthat with functions like expect_equal. You can find a lot of tutorials online, there is even a whole book on “Testing%20R%20Code%20(Chapman%20&%20Hall/Crc%20the%20R)Testing R Code”. Sadly, this is not the way I can go. As I mentioned a few times, I work in a strongly regulated environment. Inside such environments your tests are not only checked by coders, but also by people who cannot code. Some of your tests will even be written by people who cannot code. Something like specflow or cucumber would really help them to write such tests. But those do not exist in R. Additionally, those people cannot read command line test reports. You can train them to do it, but we decided it is easier to provide us and them with a pretty environment for testing, called RTest.

If you want to know more about the reasons for developing such an environment, you can read the article: Why do we need human readable test for a programming language.

What’s special about RTest?

To explain which features we put into RTest I will start describing of a basic testing workflow.

1 Testing code starts with writing code. Your R-package will contain functions, classes and methods. These shall be tested.

2 Writing the tests now mostly includes calls like this:

my_function(x,y){sums_up(x,y) return(z)} x=3
y=4
z=7 stopifnot(my_function(x,y)==z)

It is easy to see, that your test will break if your function my_function cannot sum up two values. You will create a bunch of such tests and store them in a separate folder of your package, normally called tests .

3 Afterwards you can run all such tests. You can include a script in the tests folder or use testthat and runtestthat::test_dir() .

4 If one of your test fails the script will stop and tell you which test failed in the console. This describes the 4 steps shown in the figure below.

What’s now special about RTest are two major steps.

  1. The definition of the tests
  2. The reporting of the test execution

For the definition of the tests we decided for XML. Why XML? XML is not just easier to read then pure R-Code, it comes with a feature, that is called XSD; “XML schema definition”. Each XML test case we create can immediately be checked against a schema designed by the developer. It can also be checked against our very own Rtest.xsd. This means the tester can double check the created test cases before even executing them. This saves us a lot of time and gives a fixed structure to all test cases.

The reporting was implemented in HTML. This is due to the many features HTML comes with for reporting. It allows coloring of test results, linking to test cases and including images. The main difference for the reporting in HTML between RTest and testthat is that RTest reports every test that shall be executed, not only the failed ones. The test report will also include the value created by the function call and the one given as a reference. The reader can see if the comparison really went right. By this the test report contains way more information than the testthat console log.

An example of a test implementation with RTest

Please note the whole example is stored in a github gist. Please star the gist if you like this example.

  1. Given a function that sums up two columns:
my_function <- function(data = data.frame(x=c(1,2),y=c(1,2))){ stopifnot(dim(data)[2]==2) data[,"sum"] <- apply(data,1,function(x){sum(x)}) return(data) }

2. We want to have one successful and one non successful test. Both will have three parts in the XML file:

params accounts for input parameters

reference for the output data.frame

testspec for whether the test shall run silently and what the tolerance is

For the successful test our test would look like this:

https://medium.com/media/70c6ccf0679bf3ba46da37ada1604890/href

You can immediately see one special feature of RTest. It allows to use data sets for multiple tests, we store those data sets in the input-data tag.This saves space in the file. The dataset test01 will be used here. Moreover a test description can be given for each test. For each data.frame stored in XML the types of the columns can be given in col-defs . Here those are all numeric.

Theinput-datais now given here:

https://medium.com/media/5fd8414d53e7e01ae262deb90b8c0d03/href

It’s a data frame with the x column just carrying 1 and the y column just carrying 2. The test shall create a data.frame with the sum column being 3 in each row.

We can easily let the test fail by changing the reference tag and instead of having just 3 in the sum column we can add a 3.5 to let the test fail. The whole test case can be found inside the github gist with 90 rows.

3. The execution of the test case is just one line of code. You shall have your working directory in the directory with the XML file and my_function shall be defined in the global environment.

RTest.execute(getwd(),"RTest_medium.xml")

4. The test report now contains one successful and one failed test. Both will be visualized:

General test outcome in RTest test report

additional information is given on all tests. For the test that failed we caused it by setting the sum to be 3.5 instead of 3. It’s reported at the end of the table:

example of a failed data frame comparison in RTest

Moreover the Report contains information on the environment where the test ran:

System information for an RTest test report

That’s it. Now you can test any package with RTest.

Further reading

RTest: pretty testing of R packages was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Sebastian Wolf on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

7 Reasons for Policy Professionals to Get Pumped About R Programming in 2019

Mon, 01/07/2019 - 11:04

(This article was first published on R-Programming – Giles Dickenson-Jones, and kindly contributed to R-bloggers)

Note: A version of this article was also published via LinkedIn here.

With the rise of ‘Big Data’, ‘Machine Learning’ and the ‘Data Scientist’ has come an explosion in the popularity of using open-source programming tools for data analysis.

This article
provides a short summary of some of the evidence of these tools overtaking
commercial alternatives and why, if you work with data, adding an open
programming language, like R or Python, to your professional repertoire is
likely to be a worthwhile career investment for 2019 and beyond.

Like most faithful public policy wonks,
I’ve spent more hours than I can count dragging numbers across a screen to understand,
analyse or predict whatever segment of the world I have data on.

Exploring where the money
was flowing in the world’s youngest democracy
; analysing which
government program was delivering the biggest impact; or predicting which roads were
likely to disappear first as a result of climate change.

New policy questions, new approaches to
answer them and a fresh set of data.

Yet, every silver-lining has a cloud. And in
my experience with data it’s often the need to scale a new learning curve to adhere
to legacy systems and fulfil an organizational fetish for using their statistical
software of choice.

Excel, SAS, SPSS, Eviews, Minitab, Stata
and the list goes on.

Which is why I’ve decided this article
needed to be written:

Because not only am I tired of talking to fellow
analytical wonks about why they’re limiting themselves by only being able to work
on data with spreadsheets, but also that there are distinct professional
advantages to unshackling yourself from the professional tyranny of proprietary
tools:

  1. Open-Source Statistics is
    Becoming the Global Standard

Firstly, if you haven’t been watching, the
world is increasingly full of data. So much data, that the world is chasing
after nerds to analyse it. As a result, the demand for a ‘jack of all trades’
data person, or “data scientist” has been outstripping that of a more
vanilla-flavoured ‘statistician’:

% Job Advertisements with term “data scientist” vs. “statistician”

(Credit:
Bob Muenchen – r4stats.com
)

And although you
might not have aspirations to work in what the Harvard Business Review called
the ‘Sexiest Job of the
21st Century
’ the data gold rush has had implications far beyond the
sex appeal of nerds.

For one, online communities
like Stackoverflow,
Kaggle
and Data
for Democracy
have flourished. Providing practical avenues for learning
how to do some science with data
and driving demand for tools that make
applying this science accessible to everyone, like R and Python.

So much, that some
of the best evidence, suggests that not only is
demand for quants with R and Python skills booming, but the practical use of open-source
statistical tools like R and Python are starting to eclipse their proprietary relatives:

Statistical software by Google Scholar Hits:

(More credit to Bob Muenchen –
r4stats.com
)

Of course, I’m not here to conclusively make the point that a particular piece of software is a ‘silver bullet’. Only that something has happened in the world of data that the quantitatively inclined shouldn’t ignore: Not only are R and Python becoming programming languages for the masses, but they’re increasingly proving themselves as powerful complements to more traditional business analysis tools like Excel and SAS.

2. R is for Renaissance Wo(Man)

For those watching the news, you’ll no doubt have heard of the great battle being waged between the R and Python languages that has tragically left the internet strewn with the blood of programmers and their pocket protectors.

But I’m going to goosestep right over the issue as in my
opinion much of what I say for R, is increasingly applicable to Python.

For those of you
unfamiliar with R, in essence it’s a programming language made to use computers
to do stuff with numbers.

Enter: “10*10”
and it will tell you ‘100’ 

Enter: “print(‘Sup?’)”
and the computer will speak to you like that kid loitering on your lawn.

Developed
around 25 years ago
, the idea behind R was
in essence to develop a simpler, more open and extendible programming language
for statisticians. Something which allowed you greater power and flexibility
than a ‘point and click’ interface, but that was quicker than punch cards or manually
keying in 1s and 0s to tell the computer what to do.

The result: R –
A free statistical tool whose sustained growth use has led to one of the most
flexible statistical tools in existence.

So much growth
in fact, that in 2014 enough new functionality was added to R by the community
that “R
added more functions/procs than the SAS Institute has written in its entire
history.
” And while it’s
not the quantity of your software packages that counts, the speed of
development is impressive and a good indication of the likely future trajectory
of R’s functionality. Particularly as many heavy hitters including the likes of
Microsoft,
IBM and Google
are already using R and making their own
contributions to the ecosystem:  

Using R for Analytics – Get in Before George Clooney Does:

Image
source
. Also, see here

Not only that, but with much of this growth being driven by user contributions, it is also a great reminder of the active and supportive community you have access to as an R and Python user. Making it easier to get help, access free resources and find example code to steal base your analysis on.

3. R is Data and Discipline Agnostic

(Source: xkcd)

One of the first
things that motivated me to learn R, was the observation that many of the most interesting
questions I encountered went unanswered because they crossed disciplines, involved
obscure analytical techniques, or were locked away in a long-forgotten format. It
therefore seemed logical to me that if I could become a data analytics “MacGyver”,
I’d have greater opportunities to work on interesting problems.

Which is how I
discovered R. You see, as somebody that is interested in almost everything, R’s adoption by such a diverse range of fields
made it nearly impossible to overlook. With extensions being freely available to
work with a wide variety of data formats (proprietary
or otherwise
) and apply a range of nerdy methods, R made a lot of
sense.

I think it was Richard
Branson that once said “If somebody offers you a problem but you are not sure
you can do it, say yes. R probably has a package for it”:

Then R (and
increasingly Python
) has you covered.

Yet there is perhaps a subtler reason adopting R made sense and that’s the simple fact that by being ‘discipline agnostic’ it’s well-suited for multidisciplinary teams, applied multi-potentialites and anyone uncertain about exactly where their career might take them.

4. R Helps Avoid Fitting the Problem to the Tool

As an economist, I love a good echo chamber. Not only does everybody speak my language and get my jokes, but my diagnosis of the problem is always spot-on. Unfortunately, thanks to errors of others, I’m aware that such cosy teams of specialists, isn’t always a good idea – with homogeneous specialist teams risking developing solutions which aren’t fit for purpose by too narrowly defining a problem and misunderstanding the scope of the system it’s embedded in.

(Source: chainsawsuit.com)

While good
organizations are doing their best to address
this
, creating teams that are multidisciplinary and have more
diverse networks can be a useful means to protect against these
risks while also driving better performance.
Which of course stands
to be another useful advantage of using more general statistical tools with a
diverse user base like R: as you can more fluidly collaborate across
disciplines while being better able to pick the right technique for your
problem, reducing the risk that everything look like
a nail, merely because you have a hammer
.  5. Programming Encourages Reproducibility

Yet programming languages also hold an
additional advantage to more typical ‘point and click’ interfaces for
conducting analysis – transparency and reproducibility.  

For instance, because software like R encourages
you to write down each step in your analysis, your work is more likely to be ‘reproducible’
than had it been done using more traditional ‘point and click solutions. This
is because you’re encouraged to record each step needed to achieve the final
result making it easier for your colleagues to understand what the hell you’re
doing and increasing the likelihood you’ll be able to reproduce the results when
you need to (or
somebody else will
).

In addition to this being practically
useful for tracing your journey down the data-analysis-maze, for analytical
teams it can also serve as a means for encouraging collaboration by allowing to
more easily understand your work and replicate your results. Assisting with
organizational knowledge retention and providing an incentive for ensuring
analysis is accurate by often making it easier to spot errors before they
impact your analysis or soil
your reputation
.

Finally, while the use of scripting isn’t unique to open-source programming languages, by being free, R and Python comes with an additional advantage that in the instance you decide to release your analysis, the potential audience is likely to be greater and more diverse than had it been written using propriety software. Which is why in a world of the “Open Government Partnership” open-source programming languages makes a lot of sense, providing a means of easing the transition towards government publicly releasing government policy models.

6. R Helps Make Bytes Beautiful  

As data-driven-everything becomes all the
rage, making data pretty is becoming an increasingly important skill. R is great at
this
, with virtually unlimited options for unleashing your
creativity on the world and communicating your results to the masses. Bar
graphs, scatter diagrams, histograms and heat maps. Easy.

Just not pie graphs. They’re
terrible
.  

But R’s visualization tools don’t finish at your desk, with the ‘Shiny’ package allowing you to take your pie graphs to the bigtime by publishing interactive dashboards for the web. Boss asking you to redo a graph 20 times each day? Outsource your work to the web by automating it through a dashboard and send them a link while you sip cocktails at the beach.

7. R and Python are free, but the Cost of Ignoring the Trend Towards Open-Source Statistics Won’t Be

Finally, R and Python are free, meaning not
only can you install it wherever you want, but that you can take it with you
throughout your career:

  • Statistics lecturers prescribing
    you textbooks that are trying to get you hooked on expensive software that
    likely won’t exist when you graduate?  Tell
    them it’s not 1999 and send
    them a link to this
    .
  • Working for a not-for-profit
    organization that needs statistical software but can’t afford the costs of
    proprietary software? Install R and show them how to install Swirl’s free interactive
    lessons
    .
  • Want to install it at home? No
    problem. You can even give a copy to your cat.
  • Got a promotion and been gifted
    a new team of statisticians?  Swap the Christmas
    bonuses for the give the gift that keeps giving: R!

But I’m not here to tell you R (or Python) are
perfect. Afterall, there are good
reasons some companies are reluctant to switch their analysis to R or Python
.
Nor am I interested in convincing you that it can, or should, replace every
proprietary tool you’re using. As I’m an avid spreadsheeter and programs
like Excel have distinct advantages
.

Rather,
I’d like to suggest that for all the immediate costs involved in learning an
open-source programming language, whether it be R or Python, the long-term benefits
are more than likely to surpass them.

(Source)

Not only that, but as a new generation of data scientists continue to push for the use of open-source tools, it’s reasonable to expect R and Python will become as pervasive a business tool as the spreadsheet and as important to your career as laughing at your boss’ terrible jokes.  

Interested in learning R? Check out this
link here for a range of free resources
.

You can also read my review of the online specialization I took to scale the R learning curve here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Programming – Giles Dickenson-Jones. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Part 2, further comments on OfS grade-inflation report

Mon, 01/07/2019 - 10:15

(This article was first published on R – Let's Look at the Figures, and kindly contributed to R-bloggers)

Update, 2019-01-07: I am pleased to say that the online media article that I complained about in Sec 1 below has now been amended by its author(s), to correct the false attributions.  I am grateful to Chris Parr for helping to sort this out.

In my post a few days ago (which I’ll now call “Part 1”) I looked at aspects of the statistical methods used in a report by the UK government’s Office for Students, about “grade inflation” in English universities.  This second post continues on the same topic.

In this Part 2 I will do two things:

  1. Set the record straight, in relation to some incorrect reporting of Part 1 in the specialist media.
  2. Suggest a new statistical method that (in my opinion) is better than the one used in the OfS report.

The more substantial stuff will be the second bullet there (and of course I wish I didn’t need to do the first bullet at all).   In this post (at section 2 below) I will just outline a better method, by using the same artificial example that I gave in Part 1: hopefully that will be enough to give the general idea, to both specialist and non-specialist readers.  Later I will follow up (in my intended Part 3) with a more detailed description of the suggested better method; that Part 3 post will be suitable mainly for readers with more specialist background in Statistics.

1.  For the record

I am aware of two places where the analysis I gave in Part 1 has been reported:

The first link there is to a paywalled site, I think.  The second one appears to be in the public domain.  I do not recommend following either of those links, though!  If anyone reading this wants to know about what I wrote in Part 1, then my advice is just to read Part 1 directly.

Here I want to mention three specific ways in which that article misrepresents what I wrote in Part 1.  Points 2 and 3 here are the more important ones, I think (but #1 is also slightly troubling, to me):

  1. The article refers to my blog post as “a review commissioned by HE”.  The reality is that a journalist called Chris Parr had emailed me just before Christmas.  In the email Chris introduced himself as “I’m a journalist at Research Fortnight”, and the request he made in the email (in relation to the newly published OfS report) was “Would you or someone you know be interested in taking a look?”.  I had heard of Research Fortnight.  And I was indeed interested in taking a look at the methods used in the OfS report.  But until the above-mentioned article came to my attention, I had never even heard of a publication named HE.  Possibly I am mistaken in this, but to my mind the phrase “a review commissioned by HE” indicates some kind of formal arrangement between HE and me, with specified deliverables and perhaps even payment for the work.  There was in fact no such “commission” for the work that I did.  I merely spent some time during the Christmas break thinking about the methods used in the OfS report, and then I wrote a blog post (and told Chris Parr that I had done that).  And let me repeat: I had never even heard of HE (nor of the article’s apparent author, which was not Chris Parr).  No payment was offered or demanded.  I mention all this here only in case anyone who has read that article  got a wrong impression from it.
  2. The article contains this false statement: “The data is too complex for a reliable statistical method to be used, he said”.  The “he” there refers to me, David Firth.  I said no such thing, neither in my blog post nor in any email correspondence with Chris Parr.  Indeed, it is not something I ever would say: the phrase “data…too complex for a reliable statistical method” is a nonsense.
  3. The article contains this false statement: “He calls the OfS analysis an example of Simpson’s paradox”.  Again, the “he” in that statement refers to me.  But I did not call the OfS analysis an example of Simpson’s paradox, either in my blog post or anywhere else.  (And nor could I have, since I do not have access to the OfS dataset.)  What I actually wrote in my blog post was that my own artificial, specially-constructed example was an instance of Simpson’s paradox — which is not even close to the same thing!

The article mentioned above seems to have had an agenda that was very different from giving a faithful and informative account of my comments on the OfS report.  I suppose that’s journalistic license (although I would naively have expected better from a specialist publication to which my own university appears to subscribe).  The false attribution of misleading statements is not something I can accept, though, and that is why I have written specifically about that here.

To be completely clear:

  • The article mentioned above is misleading.  I do not recommend it to anyone.
  • All of my posts in this blog are my own work, not commissioned by anyone.  In particular, none of what I’ll continue to write below (and also in Part 3 of this extended blog post, when I get to that), about the OfS report, was requested by any journalist.
2.  Towards a better (statistical) measurement model

I have to admit that in Part 1 I ran out of steam at one point, specifically where — in response to my own question about what would be a better way than the method used in the OfS report — I wrote “I do not have an answer“.  I could have and should have done better than that.

Below I will outline a fairly simple approach that overcomes the specific pitfall I identified in Part 1, i.e., the fact that measurement at too high a level of aggregation can give misleading answers.  I will demonstrate my suggested new approach through the same, contrived example that I used in Part 1.  This should be enough to convey the basic idea, I hope.  [Full generality for the analysis of real data will demand a more detailed and more technical treatment of a hierarchical statistical model; I’ll do that later, when I come to write Part 3.]

On reflection, I think a lot of the criticism seen by the OfS report since its publication relates to the use of the word “explain” in that report.  And indeed, that was a factor also in my own (mentioned above) “I do not have an answer” comment.  It seems obvious — to me, anyway — that any serious attempt to explain apparent increases in the awarding of First Class degrees would need to take account of a lot more than just the attributes of students when they enter university.  With the data used in the OfS report I think the best that one can hope to do is to measure those apparent increases (or decreases), in such a way that the measurement is a “fair” one that appropriately takes account of incoming student attributes and their fluctuation over time.  If we take that attitude — i.e, that the aim is only to measure things well, not to explain them — then I do think it is possible to devise a better statistical analysis, for that purpose, than the one that was used in the OfS report.

(I fully recognise that this actually was the attitude taken in the OfS work!  It is just unfortunate that the OfS report’s use of the word “explain”, which I think was intended there mainly as a technical word with its meaning defined by a statistical regression model, inevitably leads readers of the report to think more broadly about substantive explanations for any apparent changes in degree-class distributions.)

2.1  Those “toy” data again, and a better statistical model

Recall the setup of the simple example from Part 1:  Two academic years, two types of university, two types of student.  The data are as follows:

2010-11 University A University B Firsts Other Firsts Other h 1000 0 h 500 500 i 0 1000 i 500 500 2016-17 University A University B Firsts Other Firsts Other h 1800 200 h 0 0 i 0 0 i 500 1500

Our measurement (of change) should reflect the fact that, for each type of student within each university, where information is available, the percentage awarded Firsts actually decreased (in this example).

Change in percent awarded firsts: University A, student type h: 100% --> 90% University A, student type i: no data University B, student type h: no data University B, student type i: 50% --> 25%

This provides the key to specification of a suitable (statistical) measurement model:

  • measure the changes at the lowest level of aggregation possible;
  • then, if aggregate conclusions are wanted, combine the separate measurements in some sensible way.

In our simple example, “lowest level of aggregation possible” means that we should measure the change separately for each type of student within each university.  (In the real OfS data, there’s a lower level of aggregation that will be more appropriate, since different degree courses within a university ought to be distinguished too — they have different student intakes, different teaching, different exam boards, etc.)

In Statistics this kind of analysis is often called a stratified analysis.  The quantity of interest (which here is the change in % awarded Firsts) is measured separately in several pre-specified strata, and those measurements are then combined if needed (either through a formal statistical model, or less formally by simple or weighted averaging).

In our simple example above, there are 4 strata (corresponding to 2 types of student within each of 2 universities).  In our specific dataset there is information about the change in just 2 of those strata, and we can summarize that information as follows:

  • in University A, student type i saw their percentage of Firsts reduced by 10%;
  • in University B, student type h saw their percentage of Firsts reduced by 50%.

That’s all the information in the data, about changes in the rate at which Firsts are awarded.  (It was a deliberately small dataset!)

If a combined, “sector-wide” measure of change is wanted, then the separate, stratum-specific measures need to be combined somehow.  To some extent this is arbitrary, and the choice of a combination method ought to depend on the purpose of such a sector-wide measure and (especially) on the interpretation desired for it.  I might find time to write more about this later in Part 3.

For now, let me just recall what was the “sector-wide” measurement that resulted from analysis (shown in Part 1) of the above dataset using the OfS report’s method.  The result obtained by that method was a sector-wide increase of 7.5% in the rate at which Firsts are awarded — which is plainly misleading in the face of data that shows substantial decreases in both universities.  Whilst I do not much like the OfS Report’s “compare with 2010” approach, it does have the benefit of transparency and in my “toy” example it is easy to apply to the stratified analysis:

2016-17 Expected Firsts Actual based on 2010-11 University A 2000 1800 University B 1000 500 ------------------------------------------ Total 3000 2300

— from which we could report a sector-wide decrease of 700/3000 = 23.3% in the awarding of Firsts, once student attributes are taken properly into account.  (This could be viewed as just a suitably weighted average of the 10% and 50% decreases seen in University A and University B respectively.)

As before, I have made the full R code available (as an update to my earlier R Markdown document).  For those who don’t use R, I attach here also a PDF copy of that: grade-inflation-example.pdf

2.2  Generalising the better model: More strata, more time-points

The essential idea of a better measurement model is presented above in the context of a small “toy” example, but the real data are of course much bigger and more complex.

The key to generalising the model will simply be to recognise that it can be expressed in the form of a logistic regression model (that’s the same kind of model that was used in the OfS report; but the “better” logistic regression model structure is different, in that it needs to include a term that defines the strata within which measurement takes place).

This will be developed further in Part 3, which will be more technical in flavour than Parts 1 and 2 of this blog-post thread have been.  Just by way of a taster, let me show here the mathematical form of the logistic-regression representation of the “toy” data analysis shown above.  With notation

  • u for providers (universities); u is either A or B in the toy example
  • t for type of student; t is either h or i in the toy example
  • y for years; y is either 2010-11 or 2016-17 in the toy example
  • for the probability of a First in year y, for students of type t in university u

the logistic regression model corresponding to the analysis above is

.

This is readily generalized to situations involving more strata (more universities u and student types t, and also degree-courses within universities).  There were just 4 stratum parameters in the above example, but more strata are easily accommodated.

The model is readily generalized also, in a similar way, to more than 2 years of data.

For comparison, the corresponding logistic regression model as used in the OfS report looks like this:

.

So it is superficially very similar.  But the all-important term that determines the necessary strata within universities is missing from the OfS model.

I will aim to flesh this out a bit in a new Part 3 post within the next few days, if time permits.  For now I suppose the model I’m suggesting here needs a name (i.e., a name that identifies it more clearly than just “my better model”!)  Naming things is not my strong point, unfortunately!  But, for now at least, I will term the analysis introduced above “stratified by available student attributes” — or “SASA model” for short.

(The key word there is “stratified”.)

© David Firth, January 2019

To cite this entry:
Firth, D (2019). Part 2, further comments on OfS grade-inflation report. Weblog entry at URL https://statgeek.net/2019/01/07/part-2-further-comments-on-ofs-grade-inflation-report/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Let's Look at the Figures. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppStreams 0.1.2

Mon, 01/07/2019 - 03:12

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release of RcppStreams arrived on CRAN this afternoon. RcppStreams brings the excellent Streamulus C++ template library for event stream processing to R.

Streamulus, written by Irit Katriel, uses very clever template meta-programming (via Boost Fusion) to implement an embedded domain-specific event language created specifically for event stream processing.

This release provides a minor update, motivated by the upcoming BH release 1.69 for which we made a pre-release – and RcppStreams is one of three packages identified by that pre-release as needed minor accomodations for the new Boost release. We also updated packaging to current CRAN standards, but made no other changes.

The NEWS file entries follows below:

Changes in version 0.1.2 (2019-01-05)
  • Added symbol registration

  • Converted a few files from CRLF to CR [CRAN checks]

  • Added a few final newlines [CRAN checks]

  • Disabled one optional output to enable BH 1.69 builds

Courtesy of CRANberries, there is also a copy of the DESCRIPTION file for this initial release. More detailed information is on the RcppStreams page page and of course on the Streamulus page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dow Jones Stock Market Index (2/4): Trade Volume Exploratory Analysis

Mon, 01/07/2019 - 01:54

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Basic Statistics

    Tags

    1. Data Visualisation
    2. Import Data
    3. R Programming

    This is the second part of the 4-series articles about Dow Jones Stock Market. To read the first part go to this link. In this part, I am going to analyze the Dow Jones Industrial Average (DJIA) trade volume.

    Packages

    The packages being used in this post series are herein listed.

    suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(fBasics)) suppressPackageStartupMessages(library(lmtest)) suppressPackageStartupMessages(library(urca)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(quantmod)) suppressPackageStartupMessages(library(PerformanceAnalytics)) suppressPackageStartupMessages(library(rugarch)) suppressPackageStartupMessages(library(FinTS)) suppressPackageStartupMessages(library(forecast)) suppressPackageStartupMessages(library(strucchange)) suppressPackageStartupMessages(library(TSA)) Getting Data

    We upload the environment status as saved at the end of part 1.

    load(file='DowEnvironment.RData') Daily Volume Exploratory Analysis

    From the saved environment, we can find back our DJI object. We plot the daily volume.

    dj_vol <- DJI[,"DJI.Volume"] plot(dj_vol)

    It is remarkable the level jump at the beginning of 2017, something that we will investigate in part 4.

    We transform the volume time series data and timeline index into a dataframe.

    dj_vol_df <- xts_to_dataframe(dj_vol) head(dj_vol_df) ## year value ## 1 2007 327200000 ## 2 2007 259060000 ## 3 2007 235220000 ## 4 2007 223500000 ## 5 2007 225190000 ## 6 2007 226570000 tail(dj_vol_df) ## year value ## 3015 2018 900510000 ## 3016 2018 308420000 ## 3017 2018 433080000 ## 3018 2018 407940000 ## 3019 2018 336510000 ## 3020 2018 288830000 Basic statistics summary (dj_stats <- dataframe_basicstats(dj_vol_df)) ## 2007 2008 2009 2010 ## nobs 2.510000e+02 2.530000e+02 2.520000e+02 2.520000e+02 ## NAs 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ## Minimum 8.640000e+07 6.693000e+07 5.267000e+07 6.840000e+07 ## Maximum 4.571500e+08 6.749200e+08 6.729500e+08 4.598900e+08 ## 1. Quartile 2.063000e+08 2.132100e+08 1.961850e+08 1.633400e+08 ## 3. Quartile 2.727400e+08 3.210100e+08 3.353625e+08 2.219025e+08 ## Mean 2.449575e+08 2.767164e+08 2.800537e+08 2.017934e+08 ## Median 2.350900e+08 2.569700e+08 2.443200e+08 1.905050e+08 ## Sum 6.148432e+10 7.000924e+10 7.057354e+10 5.085193e+10 ## SE Mean 3.842261e+06 5.965786e+06 7.289666e+06 3.950031e+06 ## LCL Mean 2.373901e+08 2.649672e+08 2.656970e+08 1.940139e+08 ## UCL Mean 2.525248e+08 2.884655e+08 2.944104e+08 2.095728e+08 ## Variance 3.705505e+15 9.004422e+15 1.339109e+16 3.931891e+15 ## Stdev 6.087286e+07 9.489163e+07 1.157199e+08 6.270480e+07 ## Skewness 9.422400e-01 1.203283e+00 1.037015e+00 1.452082e+00 ## Kurtosis 1.482540e+00 2.064821e+00 6.584810e-01 3.214065e+00 ## 2011 2012 2013 2014 ## nobs 2.520000e+02 2.500000e+02 2.520000e+02 2.520000e+02 ## NAs 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ## Minimum 8.410000e+06 4.771000e+07 3.364000e+07 4.287000e+07 ## Maximum 4.799800e+08 4.296100e+08 4.200800e+08 6.554500e+08 ## 1. Quartile 1.458775e+08 1.107150e+08 9.488000e+07 7.283000e+07 ## 3. Quartile 1.932400e+08 1.421775e+08 1.297575e+08 9.928000e+07 ## Mean 1.804133e+08 1.312606e+08 1.184434e+08 9.288516e+07 ## Median 1.671250e+08 1.251950e+08 1.109250e+08 8.144500e+07 ## Sum 4.546415e+10 3.281515e+10 2.984773e+10 2.340706e+10 ## SE Mean 3.897738e+06 2.796503e+06 2.809128e+06 3.282643e+06 ## LCL Mean 1.727369e+08 1.257528e+08 1.129109e+08 8.642012e+07 ## UCL Mean 1.880897e+08 1.367684e+08 1.239758e+08 9.935019e+07 ## Variance 3.828475e+15 1.955108e+15 1.988583e+15 2.715488e+15 ## Stdev 6.187468e+07 4.421660e+07 4.459353e+07 5.211034e+07 ## Skewness 1.878239e+00 3.454971e+00 3.551752e+00 6.619268e+00 ## Kurtosis 5.631080e+00 1.852581e+01 1.900989e+01 5.856136e+01 ## 2015 2016 2017 2018 ## nobs 2.520000e+02 2.520000e+02 2.510000e+02 2.510000e+02 ## NAs 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ## Minimum 4.035000e+07 4.589000e+07 1.186100e+08 1.559400e+08 ## Maximum 3.445600e+08 5.734700e+08 6.357400e+08 9.005100e+08 ## 1. Quartile 8.775250e+07 8.224250e+07 2.695850e+08 2.819550e+08 ## 3. Quartile 1.192150e+08 1.203550e+08 3.389950e+08 4.179200e+08 ## Mean 1.093957e+08 1.172089e+08 3.112396e+08 3.593710e+08 ## Median 1.021000e+08 9.410500e+07 2.996700e+08 3.414700e+08 ## Sum 2.756772e+10 2.953664e+10 7.812114e+10 9.020213e+10 ## SE Mean 2.433611e+06 4.331290e+06 4.376432e+06 6.984484e+06 ## LCL Mean 1.046028e+08 1.086786e+08 3.026202e+08 3.456151e+08 ## UCL Mean 1.141886e+08 1.257392e+08 3.198590e+08 3.731270e+08 ## Variance 1.492461e+15 4.727538e+15 4.807442e+15 1.224454e+16 ## Stdev 3.863238e+07 6.875709e+07 6.933572e+07 1.106550e+08 ## Skewness 3.420032e+00 3.046742e+00 1.478708e+00 1.363823e+00 ## Kurtosis 1.612326e+01 1.122161e+01 3.848619e+00 3.277164e+00

    In the following, we make specific comments to some relevant above shown metrics.

    Mean

    Years when Dow Jones daily volume has positive mean are:

    filter_dj_stats(dj_stats, "Mean", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"

    All Dow Jones daily volume mean values in ascending order.

    dj_stats["Mean",order(dj_stats["Mean",,])] ## 2014 2015 2016 2013 2012 2011 2010 ## Mean 92885159 109395714 117208889 118443373 131260600 180413294 201793373 ## 2007 2008 2009 2017 2018 ## Mean 244957450 276716364 280053730 311239602 359371036 Median

    Years when Dow Jones daily volume has positive median are:

    filter_dj_stats(dj_stats, "Median", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"

    All Dow Jones daily volume median values in ascending order.

    dj_stats["Median",order(dj_stats["Median",,])] ## 2014 2016 2015 2013 2012 2011 2010 ## Median 81445000 94105000 102100000 110925000 125195000 167125000 190505000 ## 2007 2009 2008 2017 2018 ## Median 235090000 244320000 256970000 299670000 341470000 Skewness

    Years when Dow Jones daily volume has positive skewness are:

    filter_dj_stats(dj_stats, "Skewness", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"

    All Dow Jones daily volume skewness values in ascending order.

    dj_stats["Skewness",order(dj_stats["Skewness",,])] ## 2007 2009 2008 2018 2010 2017 2011 ## Skewness 0.94224 1.037015 1.203283 1.363823 1.452082 1.478708 1.878239 ## 2016 2015 2012 2013 2014 ## Skewness 3.046742 3.420032 3.454971 3.551752 6.619268 Excess Kurtosis

    Years when Dow Jones daily volume has positive excess kurtosis are:

    filter_dj_stats(dj_stats, "Kurtosis", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"

    All Dow Jones daily volume excess kurtosis values in ascending order.

    dj_stats["Kurtosis",order(dj_stats["Kurtosis",,])] ## 2009 2007 2008 2010 2018 2017 2011 ## Kurtosis 0.658481 1.48254 2.064821 3.214065 3.277164 3.848619 5.63108 ## 2016 2015 2012 2013 2014 ## Kurtosis 11.22161 16.12326 18.52581 19.00989 58.56136 Box-plots dataframe_boxplot(dj_vol_df, "DJIA daily volume box plots 2007-2018")

    The trade volume starts to decrease from 2010 and on 2017 a remarkable increase occurred. Year 2018 volume has been even larger than 2017 and other years as well.

    Density plots dataframe_densityplot(dj_vol_df, "DJIA daily volume density plots 2007-2018")

    Shapiro Tests dataframe_shapirotest(dj_vol_df) ## result ## 2007 6.608332e-09 ## 2008 3.555102e-10 ## 2009 1.023147e-10 ## 2010 9.890576e-13 ## 2011 2.681476e-16 ## 2012 1.866544e-20 ## 2013 6.906596e-21 ## 2014 5.304227e-27 ## 2015 2.739912e-21 ## 2016 6.640215e-23 ## 2017 4.543843e-12 ## 2018 9.288371e-11

    The null hypothesis of normality is rejected for all years.

    QQ plots dataframe_qqplot(dj_vol_df, "DJIA daily volume QQ plots 2007-2018")

    QQplots visually confirm the non-normality of daily trade volume distribution.

    Daily volume log-ratio Exploratory Analysis

    Similarly to log-returns, we can define the trade volume log ratio as.

    \[
    v_{t}\ := ln \frac{V_{t}}{V_{t-1}}
    \]
    We can compute it by CalculateReturns within the PerformanceAnalytics package and plot it.

    dj_vol_log_ratio <- CalculateReturns(dj_vol, "log") dj_vol_log_ratio <- na.omit(dj_vol_log_ratio) plot(dj_vol_log_ratio)

    Mapping the trade volume log-ratio time series data and timeline index into a dataframe.

    dj_vol_df <- xts_to_dataframe(dj_vol_log_ratio) head(dj_vol_df) ## year value ## 1 2007 -0.233511910 ## 2 2007 -0.096538449 ## 3 2007 -0.051109832 ## 4 2007 0.007533076 ## 5 2007 0.006109458 ## 6 2007 0.144221282 tail(dj_vol_df) ## year value ## 3014 2018 0.44563907 ## 3015 2018 -1.07149878 ## 3016 2018 0.33945998 ## 3017 2018 -0.05980236 ## 3018 2018 -0.19249224 ## 3019 2018 -0.15278959 Basic statistics summary (dj_stats <- dataframe_basicstats(dj_vol_df)) ## 2007 2008 2009 2010 2011 ## nobs 250.000000 253.000000 252.000000 252.000000 252.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum -1.606192 -1.122526 -1.071225 -1.050181 -2.301514 ## Maximum 0.775961 0.724762 0.881352 1.041216 2.441882 ## 1. Quartile -0.123124 -0.128815 -0.162191 -0.170486 -0.157758 ## 3. Quartile 0.130056 0.145512 0.169233 0.179903 0.137108 ## Mean -0.002685 0.001203 -0.001973 -0.001550 0.000140 ## Median -0.010972 0.002222 -0.031748 -0.004217 -0.012839 ## Sum -0.671142 0.304462 -0.497073 -0.390677 0.035162 ## SE Mean 0.016984 0.016196 0.017618 0.019318 0.026038 ## LCL Mean -0.036135 -0.030693 -0.036670 -0.039596 -0.051141 ## UCL Mean 0.030766 0.033100 0.032725 0.036495 0.051420 ## Variance 0.072112 0.066364 0.078219 0.094041 0.170850 ## Stdev 0.268536 0.257612 0.279677 0.306661 0.413341 ## Skewness -0.802037 -0.632586 0.066535 -0.150523 0.407226 ## Kurtosis 5.345212 2.616615 1.500979 1.353797 14.554642 ## 2012 2013 2014 2015 2016 ## nobs 250.000000 252.000000 252.000000 252.000000 252.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum -2.158960 -1.386215 -2.110572 -1.326016 -1.336471 ## Maximum 1.292956 1.245202 2.008667 1.130289 1.319713 ## 1. Quartile -0.152899 -0.145444 -0.144280 -0.143969 -0.134011 ## 3. Quartile 0.144257 0.149787 0.134198 0.150003 0.141287 ## Mean 0.001642 -0.002442 0.000200 0.000488 0.004228 ## Median -0.000010 -0.004922 0.013460 0.004112 -0.002044 ## Sum 0.410521 -0.615419 0.050506 0.123080 1.065480 ## SE Mean 0.021293 0.019799 0.023514 0.019010 0.019089 ## LCL Mean -0.040295 -0.041435 -0.046110 -0.036952 -0.033367 ## UCL Mean 0.043579 0.036551 0.046510 0.037929 0.041823 ## Variance 0.113345 0.098784 0.139334 0.091071 0.091826 ## Stdev 0.336667 0.314299 0.373274 0.301780 0.303028 ## Skewness -0.878227 -0.297951 -0.209417 -0.285918 0.083826 ## Kurtosis 8.115847 4.681120 9.850061 4.754926 4.647785 ## 2017 2018 ## nobs 251.000000 251.000000 ## NAs 0.000000 0.000000 ## Minimum -0.817978 -1.071499 ## Maximum 0.915599 0.926101 ## 1. Quartile -0.112190 -0.119086 ## 3. Quartile 0.110989 0.112424 ## Mean -0.000017 0.000257 ## Median -0.006322 0.003987 ## Sum -0.004238 0.064605 ## SE Mean 0.013446 0.014180 ## LCL Mean -0.026500 -0.027671 ## UCL Mean 0.026466 0.028185 ## Variance 0.045383 0.050471 ## Stdev 0.213032 0.224658 ## Skewness 0.088511 -0.281007 ## Kurtosis 3.411036 4.335748

    In the following, we make specific comments to some relevant above-shown metrics.

    Mean

    Years when Dow Jones daily volume log-ratio has positive mean are:

    filter_dj_stats(dj_stats, "Mean", 0) ## [1] "2008" "2011" "2012" "2014" "2015" "2016" "2018"

    All Dow Jones daily volume log-ratio mean values in ascending order.

    dj_stats["Mean",order(dj_stats["Mean",,])] ## 2007 2013 2009 2010 2017 2011 2014 ## Mean -0.002685 -0.002442 -0.001973 -0.00155 -1.7e-05 0.00014 2e-04 ## 2018 2015 2008 2012 2016 ## Mean 0.000257 0.000488 0.001203 0.001642 0.004228 Median

    Years when Dow Jones daily volume log-ratio has positive median are:

    filter_dj_stats(dj_stats, "Median", 0) ## [1] "2008" "2014" "2015" "2018"

    All Dow Jones daily volume log-ratio median values in ascending order.

    dj_stats["Median",order(dj_stats["Median",,])] ## 2009 2011 2007 2017 2013 2010 ## Median -0.031748 -0.012839 -0.010972 -0.006322 -0.004922 -0.004217 ## 2016 2012 2008 2018 2015 2014 ## Median -0.002044 -1e-05 0.002222 0.003987 0.004112 0.01346 Skewness

    Years when Dow Jones daily volume log-ratio has positive skewness are:

    filter_dj_stats(dj_stats, "Skewness", 0) ## [1] "2009" "2011" "2016" "2017"

    All Dow Jones daily volume log-ratio mean values in ascending order.

    dj_stats["Skewness",order(dj_stats["Skewness",,])] ## 2012 2007 2008 2013 2015 2018 ## Skewness -0.878227 -0.802037 -0.632586 -0.297951 -0.285918 -0.281007 ## 2014 2010 2009 2016 2017 2011 ## Skewness -0.209417 -0.150523 0.066535 0.083826 0.088511 0.407226 Excess Kurtosis

    Years when Dow Jones daily volume has positive excess kurtosis are:

    filter_dj_stats(dj_stats, "Kurtosis", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"

    All Dow Jones daily volume log-ratio excess kurtosis values in ascending order.

    dj_stats["Kurtosis",order(dj_stats["Kurtosis",,])] ## 2010 2009 2008 2017 2018 2016 2013 ## Kurtosis 1.353797 1.500979 2.616615 3.411036 4.335748 4.647785 4.68112 ## 2015 2007 2012 2014 2011 ## Kurtosis 4.754926 5.345212 8.115847 9.850061 14.55464 Box-plots dataframe_boxplot(dj_vol_df, "DJIA daily volume box plots 2007-2018")

    The most positive extreme values can be spotted on years 2011, 2014 and 2016. The most negative extreme values, on years 2007, 2011, 2012, 2014.

    Density plots dataframe_densityplot(dj_vol_df, "DJIA daily volume density plots 2007-2018")

    Shapiro Tests dataframe_shapirotest(dj_vol_df) ## result ## 2007 3.695053e-09 ## 2008 6.160136e-07 ## 2009 2.083475e-04 ## 2010 1.500060e-03 ## 2011 3.434415e-18 ## 2012 8.417627e-12 ## 2013 1.165184e-10 ## 2014 1.954662e-16 ## 2015 5.261037e-11 ## 2016 7.144940e-11 ## 2017 1.551041e-08 ## 2018 3.069196e-09

    Based on reported p-values, for all we can reject the null hypothesis of normal distribution.

    QQ plots dataframe_qqplot(dj_vol_df, "DJIA daily volume QQ plots 2007-2018")

    Departure from normality can be spotted for all reported years.

    Saving the current enviroment for further analysis.

    save.image(file='DowEnvironment.RData')

    If you have any questions, please feel free to comment below.

    References
    1. Dow Jones Industrial Average
    2. Skewness
    3. Kurtosis
    4. An introduction to analysis of financial data with R, Wiley, Ruey S. Tsay
    5. Time series analysis and its applications, Springer ed., R.H. Shumway, D.S. Stoffer
    6. Applied Econometric Time Series, Wiley, W. Enders, 4th ed.
    7. Forecasting – Principle and Practice, Texts, R.J. Hyndman
    8. Options, Futures and other Derivatives, Pearson ed., J.C. Hull
    9. An introduction to rugarch package
    10. Applied Econometrics with R, Achim Zeileis, Christian Kleiber – Springer Ed.
    11. GARCH modeling: diagnostic tests
    Disclaimer

    Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or promotion of any particular security or source.

    Related Post

    1. Dow Jones Stock Market Index (1/4): Log Returns Exploratory Analysis
    2. Six Sigma DMAIC Series in R – Part4
    3. NYC buses: company level predictors with R
    4. Visualizations for correlation matrices in R
    5. Interpretation of the AUC
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    February 21st & 22nd: End-2-End from a Keras/TensorFlow model to production

    Mon, 01/07/2019 - 01:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    Registration is now open for my 1.5-day workshop on how to develop end-2-end from a Keras/TensorFlow model to production.

    It will take place on February 21st & 22nd in Berlin, Germany.

    Please register by sending an email to shirin.glander@gmail.com with the following information:

    • name
    • company/institute/affiliation
    • address for invoice
    • phone number
    • reference to this blog

    The course material will be in English and we will speak a mix of German and English, depending on the participants’ preferences.

    (The same course will also be given on 11th & 12th of April, 2019 in Münster, Germany.)

    In this workshop, you will learn

    • the basics of deep learning
    • what cross-entropy and loss is
    • about activation functions
    • how to optimize weights and biases with backpropagation and gradient descent
    • how to build (deep) neural networks with Keras and TensorFlow
    • how to save and load models and model weights
    • how to visualize models with TensorBoard
    • how to make predictions on test data
    • how to deploy models with Spotify Luigi and TensorFlow Serving
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Timing the Same Algorithm in R, Python, and C++

    Sun, 01/06/2019 - 20:55

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    While developing the RcppDynProg R package I took a little extra time to port the core algorithm from C++ to both R and Python.

    This means I can time the exact same algorithm implemented nearly identically in each of these three languages. So I can extract some comparative “apples to apples” timings. Please read on for a summary of the results.

    The algorithm in question is the general dynamic programming solution to the “minimum cost partition into intervals” problem. As coded in C++ it uses one-time allocation of large tables and then for-loops and index chasing to fill in the dynamic programming table solution. The C++ code is given here.

    I then directly transliterated (or line-for line translated) this code into R (code here) and Python (code here). Both of these implementations are very direct translations of the C++ solution, so they are possibly not what somebody starting in R or Python would design. So really we are coding in an an imperative C style in R and Python. To emphasize the shallowness of the port I deliberately left the semi-colons from the C++ in the R port. The Python can be taken to be equally “un-Pythonic” (for example, we are using for loops and not list comprehensions).

    That being said we now have very similar code to compare in all three languages. We can summarize the timings (details here and here) as follows.

    problem solution language time in seconds 500 point partition into intervals dynamic program R 21 500 point partition into intervals dynamic program C++ (from R via Rcpp) 0.088 500 point partition into intervals dynamic program Python 39

    Notice for this example C++ is 240 times faster than R, and R is almost twice as fast as Python

    Neither R nor Python is optimized for the type of index-chasing this dynamic programming solution depends on. So we also took a look at a simpler problem: computing the PRESS statistic, which is easy to vectorize (a preferred writing efficient code in R nor Python). When we compare all three languages on this problem we see the following.

    problem solution method time in seconds 3,000,000 point PRESS statistic calculation R scalar code 3.4 3,000,000 point PRESS statistic calculation Rcpp scalar code 0.26 3,000,000 point PRESS statistic calculation R vectorized code 0.35 3,000,000 point PRESS statistic calculation Python vectorized (numpy) 0.21

    The timing details can be found here and here.

    Ignoring the R scalar solution (which is too direct a translation from C++ to R, but a stepping stone to the R vectorized solution as we discuss here). We see: vectorized Python is now about 1.6 times faster than the vectorized R and even 1.2 times faster than the C++ (probably not due to Rcpp, but instead driven by my choice of container class in the C++ code).

    Obviously different code (and per-language tuning and optimization) will give different results. But the above is consistent with our general experience with R, Python, and C++ in production.

    In conclusion: R and Python are in fact much slower than C++ for direct scalar manipulation (single values, indexes, and pointers). However, R and Python are effective glue languages that can be fast when they are orchestrating operations over higher level abstractions (vectors, databases, data frames, and Spark).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    2018 Volatility Recap

    Sun, 01/06/2019 - 20:39

    (This article was first published on R – Quintuitive, and kindly contributed to R-bloggers)

    2018 brought more volatility to the markets, which so far has spilled into 2019. Let’s take a look at the long term volatility history picture using the Dow Jones Industrial Average:

    Indeed, 2018 was the most volatile year since 2011. Relatively speaking however, the volatility is on the low end for a bear market, which I believe started in late December.

    The above chart was produced using the following R code:

    library(quantmod) library(ggplot2) library(ggthemes) library(grid) dji.annual.volatility = function(dji, year) { dates = paste("/", as.character(year), sep="") dji = na.exclude(dji[dates]) djiVol = aggregate(dji, as.numeric(format(index(dji), "%Y")), function(ss) coredata(tail(TTR:::volatility( ss, n=NROW(ss), calc="close"), 1))) xx = ecdf(as.vector(djiVol))(as.numeric(tail(djiVol,1))) print(xx) absRets = na.exclude(abs(ROC(dji[dates], type="discrete"))) yy = as.numeric(format(index(absRets), "%Y")) zz = aggregate(absRets, yy, function(ss) tail(cumprod(1+ss),1)) print(as.vector(tail(zz,1))) df = cbind(as.data.frame(index(djiVol)), coredata(djiVol)) colnames(df) = c("Year", "Volatility") gg = qplot(x=Year, y=Volatility, data=df, geom="line", colour=Volatility, xlab="Year", ylab="Volatility") gg = gg + theme_solarized(base_size=16, base_family="verdana", light=TRUE) return(list(plot=gg, dji=dji, dji.vol=djiVol, crystal.ball=zz, df=df)) }

    The post 2018 Volatility Recap appeared first on Quintuitive.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Scaling H2O analytics with AWS and p(f)urrr (Part 1)

    Sun, 01/06/2019 - 10:00

    (This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to R-bloggers)

    H2O + AWS + purrr (Part I)

    In these small tutorials to follow over the next 3 weeks, I go through the steps of using an AWS1 AMI Rstudio instance to run a toy machine learning example on a large AWS instance. I have to admit that you have to know a little bit about AWS to follow the next couple of steps, but be assured it doesn’t take too much googling to find your way if you get confused at any stage by some of the jargon.

    • Part I of this III part series will show you how to get the infrastructure working on AWS.
    • Part II will showcase how we can combine purrr and h2o to train and stack ML models2.
    • Part III, the final installment in the series, will show you how we can use furrr, reticulate – boto3 and AWS to spin up remote machines from your local work environment, but train remotely. Have you ever dreamt of running ML models on a 150 core 2TB RAM machine? Well, for a couple of dollars, I am going to show you how.

    If you havent got an AWS account already, set one up now if you want to follow along with the rest of this series.

    Why I moved to H2O from caret

    I have always been a big fan of the caret package. It was my introduction into machine learning in R. Its modular framework allowed for a very experimental environment. It allows for the exploriation of effects such as class imbalance on the outcome, with options such as SMOTE, ROSE an Up and Down sampling, while easily allowing for K-folds (and repeated K-folds) cross-validation. It also lets the programmer configure a training grid for mulitple hyperparameter specifications allowing for very fine tuning of the models. Lastly, lets not forget the fact that there are over 200 models to play with.

    H2O does most of this, but allows for a much more “programmatic” approach to the problems. After the first couple of interactions with the api, you start to enjoy the bulletproof engineer of the underlying code. It does not offer the wide array of features of caret, but it has the big 5: glm(net), rf, gbm, xgboost, deeplearning, and also stacking of the models. H2O allows for a much easier programming journey to a production ML system. The robust engineering strategy allows us to scale the ML training for a much larger experimental design.

    I think caret is very good if you only need to run a single model that we need to tune and experiment with – i.e academia and small scale ML applications. H2O on the other hand is does a lot of the nitty gritty in the background for you which is great for production:

    • If removes constant columns
    • It doesn’t break if it sees an unknown factor variable
    • It has early stopping which is easily implemented
    • Exploits multicore and gpu processing in a much more effective manner
    • Does crazy amounts of logging for any errors that might creep up3

    Lastly, and certainly important if you gonna train a couple of models: its fast!

    In my case, I was looking for a profit optimisation model over multiple markets (~80) for which I wanted to introduce a rolling window design, check performance of rf, gbm, xgboost and in the end put model in production with short delay. This is a lot of modeling and H2O made all of this ‘easy’-ish.

    Introducing AWS

    If you haven’t started migrating your analytics to the cloud, then hopefully this will convince you to start reconsidering. The opprtunity to have access to a 64, 96 or even 128 core machines with 2TB of RAM rarely crosses the path of most Data Scientists. This can mostly be accredited to the fact that most of us don’t really need such a large machine for what we need to achieve, see Szilard’s twitter posts if you need convincing. Another reason that we don’t use these big machines are purely because we just don’t have access to such machines within our working environments. Luckily for us, access to cloud computing have become more accessible and well, lets be honest, cheap as chips.

    Using prebuilt AMIs

    An AMI or Amazon Machine Image is a pre-built enviroment available to us as an all-inclusive environment containing the operating system and any other applications we might have installed, e.g Rstudio, MariaDB, Docker etc..

    One of these AMIs, specifically built for analytics and deep learning in Rstudio, is maintained by Louis Aslett. This AMI is designed to fascilitate most applications using MCMC sampling such as Stan or Jags. It also takes away the headache of setting up an instance that is able to accommodate the Keras and Tensorflow interfaces on an Linux machine. In addition the image contains LaTeXand rmarkdown, along with various other installations to allow for a plug and play setup.
    If you have ever try to do any of the above yourself, then you will know what achievement this is. So to Louis:

    I have used Louis Aslett’s AMI and added rJava, as well as installed the caret and h2o libraries for us the play with. This AMI is publicly available. We can use the search functionality to look for the AMI under public images: ami-0157656a8c5b46458. For security reasons, I advise that you change the default port on which the Rstudio instance listens from port 80 to something random such as port 8765 if you gonna run this in production. If you want to do this, you will need to reconfigure the nginx server conf under /etc/nginx/sites-available.

    To access the Rstudio setup once you get to the end of this post, navigate to the Public IP as indicated in the screenshot at the end and log in using the predefined credentials. For now the image uses4:

    • user: rstudio
    • pass: rstudioh2ocaret

    To use one of these AMIs, we need to log into amazon and navigate to the EC2 dashboard. I use the Ohio region, so make sure that you set this region in the left had corner if you want the follow the setup of the AMI.

    Well documented instructions on how to set up an AWS account are easily available on a multiple blogs, but I will show you how to find the AMI using the search function on the AWS website. Firstly, access your ec2 dashboard and go to the AMI tab on the left:

    Setting up the AMI on a Spot instance

    Whenever I use AWS, I prefer using Sport Instances:

    A Spot Instance is an unused EC2 instance that is available for less than the On-Demand price. Because Spot Instances enable you to request unused EC2 instances at steep discounts, you can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance is called a Spot price

    I typically use these machines anything from an hour to a couple of hours, so on-demand has never been a need. Never have the instances been a pain to get or have I been restricted to using an instance because usage suddenly spiked. These spot instances are the perfect way to keep costs low, while having access to HUGE machines.

    Choose instance and configure:

    Right click on the H2O_ami and select Spot Request. This will bring you to a screen showing the machine menu. For now, we won’t get too carried away, so lets select the r3.4xlarge image. This machine is one of my favourites, although its not part of the current generation machines, it is a memory optimised machine which works well for ML training:

    • 16 cores
    • 122GB RAM
    • 1 x 320 SSD
    • Spot Price: $0.1517/h

    You can also configure the instance to shutdown when a maximum spot price is reached. In steps 4 and 5 you can add storage and tags if you want. I don’t show this, as we don’t want to get too bogged down in the detail, we want to get to the part where we play with H2O.

    Add security group

    In step 6, you have to configure the security group of the instance. For now, we are not going to setup any crazy configurations. We will open port 22 (ssh) and 80 (http) which will allow us to access the machine and Rstudio from anywhere. If safety is a concern, you can be a lot more specific on the configurations.

    The final step before the instance is launched is confirming you have the appropriate key pair to access the machine:

    Once you have done this, you should be able to see your new instance being initialized in the Instance tab of your ec2 dashboard:

    Well done, you have now spinned up a machine that is probably around 3 – 4 times bigger than your current working environment on a laptop or PC. Remember to terminate this machine when you are done with it!

    To check that the machine has the 16 cores and 122GB RAM as promised, we can SSH5 into it and use the htop command to confirm:

    To access the Rstudio IDE, open a browser and go the the public IP provided. Once you have entered the username and password you should be given access to a welcome screen.

    Well done! You now have access to an analytical system that is CPU/GPU enabled and can handle training of ML models using either caret or H2O.

    I also recommend you read the Welcome.R file which contains information on the very helpful RStudioAMI package.

    In the next post we gonna explore running models using the H2O api and combining it with purrr to create multiple model tibbles for easy evaluation

    1. Disclaimer:I don’t work for AWS, it is just the platform that I ended up using over the past year or so. I am sure Google Cloud (or any other cloud service) is just as good ^
    2. For those excited to see OLS in action, sorry to dissapoint, those aren’t ML models… ^
    3. I am looking at you Error in nominalTrainWorkflow... ^
    4. In future I would like to adjust this to the specifications that Louis had. In the original AMI, the password is the instance ID, which is very clever – unfortunately I haven’t been able to figure it out for this H2O AMI ^
    5. If you do not know about SSH you are probably working on a Windows machine – follow these instructions to install putty ^
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Co-integration and Mean Reverting Portfolio

    Sun, 01/06/2019 - 06:41

    (This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

    In the previous post https://statcompute.wordpress.com/2018/07/29/co-integration-and-pairs-trading, it was shown how to identify two co-integrated stocks in the pair trade. In the example below, I will show how to form a mean reverting portfolio with three or more stocks, e.g. stocks with co-integration, and also how to find the linear combination that is stationary for these stocks.

    First of all, we downloaded series of three stock prices from finance.yahoo.com.

    ### GET DATA FROM YAHOO FINANCE quantmod::getSymbols("FITB", from = "2010-01-01") FITB <- get("FITB")[, 6] quantmod::getSymbols("MTB", from = "2010-01-01") MTB <- get("MTB")[, 6] quantmod::getSymbols("BAC", from = "2010-01-01") BAC <- get("BAC")[, 6]

    For the residual-based co-integration test, we can utilize the Pu statistic in the Phillips-Ouliaris test to identify the co-integration among three stocks. As shown below, the null hypothesis of no co-integration is rejected, indicating that these three stocks are co-integrated and therefore form a mean reverting portfolio. Also, the test regression to derive the residual for the statistical test is also given.

    k <- trunc(4 + (length(FITB) / 100) ^ 0.25) po.test <- urca::ca.po(cbind(FITB, MTB, BAC), demean = "constant", lag = "short", type = "Pu") #Value of test-statistic is: 62.7037 #Critical values of Pu are: # 10pct 5pct 1pct #critical values 33.6955 40.5252 53.8731 po.test@testreg # Estimate Std. Error t value Pr(|t|) #(Intercept) -1.097465 0.068588 -16.00 <2e-16 *** #z[, -1]MTB.Adjusted 0.152637 0.001487 102.64 <2e-16 *** #z[, -1]BAC.Adjusted 0.140457 0.007930 17.71 <2e-16 ***

    Based on the test regression output, a linear combination can be derived by [FITB + 1.097465 – 0.152637 * MTB – 0.140457 * BAC]. The ADF test result confirms that the linear combination of these three stocks are indeed stationary.

    ts1 <- FITB + 1.097465 - 0.152637 * MTB - 0.140457 * BAC tseries::adf.test(ts1, k = k) #Dickey-Fuller = -4.1695, Lag order = 6, p-value = 0.01

    Alternatively, we can also utilize the Johansen test that is based upon the likelihood ratio to identify the co-integration. While the null hypothesis of no co-integration (r = 0) is rejected, the null hypothesis of r <= 1 suggests that there exists a co-integration equation at the 5% significance level.

    js.test <- urca::ca.jo(cbind(FITB, MTB, BAC), type = "trace", K = k, spec = "longrun", ecdet = "const") # test 10pct 5pct 1pct #r <= 2 | 3.26 7.52 9.24 12.97 #r <= 1 | 19.72 17.85 19.96 24.60 #r = 0 | 45.88 32.00 34.91 41.07 js.test@V # FITB.Adjusted.l6 MTB.Adjusted.l6 BAC.Adjusted.l6 constant #FITB.Adjusted.l6 1.0000000 1.000000 1.000000 1.0000000 #MTB.Adjusted.l6 -0.1398349 -0.542546 -0.522351 -0.1380191 #BAC.Adjusted.l6 -0.1916826 1.548169 3.174651 -0.9654671 #constant 0.6216917 17.844653 -20.329085 6.8713179

    Similarly, based on the above Eigenvectors, a linear combination can be derived by [FITB + 0.6216917 – 0.1398349 * MTB – 0.1916826 * BAC]. The ADF test result also confirms that the linear combination of these three stocks are stationary.

    ts2 <- FITB + 0.6216917 - 0.1398349 * MTB - 0.1916826 * BAC tseries::adf.test(ts2, k = k) #Dickey-Fuller = -4.0555, Lag order = 6, p-value = 0.01 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Making Original Bingo – Heart Theme

    Sun, 01/06/2019 - 01:00

    (This article was first published on R on Chi's Impe[r]fect Blog, and kindly contributed to R-bloggers)

    I have learned how to draw a heart with mathmatical equation, in fact there are so many “curves” you can draw with equation. Amazing, right?!? You can find all sorts of different curves on Wolfram Mathworld site. I’m really curious how did people find equation itself for some of shapes?

    Also at last family reunion, I’ve played “Bingo”, which I haven’t played for ages! It’s a great game when you have wide range of age groups, because kids were having tons of fun, but so were adults and grand parents!

    So I’ve decided to create little twist on bingo…! Instead of just drawing numbers between 1-75, you can play bingo with words, pictures, since I just really wanted to use my new “heart shape drawing shape…” I decided I’ll make up bingo with some words related to “Valentine’s day”.

    Drawing Heart Shape with ggplot2

    You can look at past entry on how to draw cannabis here. Same concept.

    There were 6 different heart shape example on Wolfram site, but I liked 6th heart the best for shape. You can look at equation here

    library(tidyverse) library(patchwork) ## Function to generate heart shape around point xc and rc with some sizing. Output of function is data frame with bunch of points which you can use to draw a heart! # http://mathworld.wolfram.com/HeartCurve.html heart_vertices <- function(xc,yc,size,npoints=100,...){ #area = pi*r*r for circle... this heart has area of 180 = r*r radius = size*0.05 ## I'm not set on this... I just wanted to make size smaller yc = yc + 0.1*radius ## Just adjusting center of heart bit t = seq(-pi,pi, length.out=npoints+1) x = xc + 16 * radius *(sin(t))^3 y = yc + radius*13*cos(t) - radius*5*cos(2*t) - radius*2*cos(3*t) - radius*cos(4*t) df <- tibble(theta=t,x=x,y=y) return(df) } ## Above function with generate points you'd need to generate heart. If you adjust npoints to be higher, then you can draw smoother shape. heart_vertices(0,0,1) %>% ggplot() + geom_line(aes(x=theta, y=x), color="#D22042de") + geom_line(aes(x=theta, y= -x),color="#D22042de",linetype=3) + ## this is just to make a design geom_line(aes(x=theta, y=y), color="#30C4C9de") + geom_line(aes(x=theta, y= -y), color="#30C4C9de", linetype=3) + ## this is just to make a design geom_polygon(aes(x=x,y=y), fill="#615375de", ## to draw heart use x=x,y=y color="#61537520",size=10) + ## I'm just playing around with line around filled part. theme_minimal(base_family="Roboto Condensed") + scale_x_continuous(breaks=c(-pi,-pi/2,0,pi/2,pi), labels=c("-pi","","0","","pi")) + coord_fixed() + labs(x="",y="", caption="pink solid line = x value & blue solid line = y value")

    Drawing Bunch of Hearts on Grid

    Now I know how to draw a single heart, I want to be able to plot bunch of hearts on a grid. Since function heart_vertices creates data frame for a single heart around point xc and yc, I can create a grid with coordinates xc and yc.

    my_grid <- tibble( xc=rep(c(1:15), times=5), yc=rep(c(1:5), each=15), size=0.6, id = c(1:75) ) my_grid %>% ggplot(aes(x=xc,y=yc)) + geom_point(shape=21, size=10) + geom_text(aes(label=id), family="Roboto Condensed", fontface="bold") + theme_minimal() + coord_fixed()

    ## For each points on grid generate points to draw heart print_out_base <-my_grid %>% mutate(v=pmap(.,heart_vertices)) %>% unnest(v) %>% ggplot(aes(x=x,y=y,group=id)) + geom_polygon(aes(fill=factor(yc))) + coord_fixed() + theme_minimal(base_family="Roboto Condensed") + scale_x_continuous(breaks=seq(0.5,16.5, by=1), labels=NULL) + scale_y_continuous(breaks=seq(0.5,5.5,by=1), labels=NULL) + scale_fill_manual(values=c("#30bcad","#57a337","#f8b620","#e03426","#eb73b3"), guide="none") print_out_base

    Putting Words On Top of Hearts

    Now I have the hearts placed on grid, I want some words on top. So I took inspiration from candies with sayings that I often see during Valentine’s day week, which I didn’t know the name of candy, but I think it’s called Necco Sweetheart.

    ## Needs at least 75 words.... since there are 75 heart with some word placed on it. ## Some are not from those candies, I just made some up. love_msgs <- tibble( msgs = c("143", "#1 FAN", "#LOVE", "1-800\nCUPID", "HUG ME", "KISSES", "BE MINE", "CRAZY\n4 U", "HOLD\nHANDS", "UR\nLOVED", "PURR FECT", "WOO", "QT PIE", "RECIPE\n4 LOVE", "RISING STAR", "TABLE\nFOR 2", "TOO SWEET", "TWEET", "TWO HEARTS", "TXT ME", "UR HOT", "WHATS UP", "DESTINY", "WICKED COOL", "WINK\nWINK", "STUNNING", "XOXO", "YOU&ME", "YUM\nYUM","SOUL\nMATE","BABE", "SAY YES","HELLO","DREAM\nBIG","BFF","HIGH\nFIVE","AWESOME", "SMILE","UR\nGR8","PHONE\nME","LOVE\nBIRD","BE TRUE","SURE LOVE", "MY BABY","HI GORGEOUS","HOT\nSTUFF","ADORE\nME","FUN","LOL","CALL ME", "PICK ME","DEAR\nONE","EVER\nAFTER","LOVER","ALL\nMINE","ANGEL", "RU SHY","SWEET PEA","LOVE\nBUG","ADORABLE","EMBRACE","FLOWERS", "CHERISH","CHOCOLATE","CUPCAKES","CRUSH","SECRET\nADMIRER", "VALENTINE","DOVES","LOVEBIRDS","DIAMONDS","PAARTY","HONEY", "PASSION","AWWW") ) ## Here you want to make sure you have at leat 75 words! love_msgs <- love_msgs %>% arrange(msgs) %>% ## sort them alphabetically.... It makes it easier to find word that were picked out. mutate(idx=row_number()-1, row_group=floor(idx/15)+1) print_out_base + geom_text(aes(x=xc,y=yc, label=love_msgs$msgs), color="#ffffffde",family="Roboto Condensed", size=3, fontface="bold", lineheight=0.8, data=. %>% filter(theta==pi)) + labs(title="Print & Cut Them Into Pieces & Draw Them Out of Hat or Box",x="",y="", caption="")

    Making Bingo Cards

    Similariy, now I want to make 5 x 5 bingo cards that each person gets to participate in the game. I’ve generated 4 cards as example.

    ## Making Bingo Cards (Base Design) bingo_base <-tibble( xc = rep(c(1:5),times=5), yc = rep(c(1:5),each=5), size=0.6, id = c(1:25) ) %>% mutate(v=pmap(., heart_vertices)) %>% unnest(v) %>% ggplot(aes(x=x,y=y,group=id)) + geom_polygon(aes(fill=factor(xc))) + geom_polygon(fill="#000000de", data=. %>% filter(xc==3,yc==3))+ theme_minimal(base_family="Roboto Condensed") + scale_x_continuous(breaks=c(1,2,3,4,5),labels=c("B","I","N","G","O"), position="top") + scale_y_continuous(labels=NULL) + scale_fill_manual(values=c("#30bcad","#57a337","#f8b620","#e03426","#eb73b3"), guide="none") + labs(x="",y="") + coord_fixed() ## Just to make card little more fun, let's add some quotes about love on each cards. library(rvest) love_quotes <- read_html("https://lifehacks.io/inspirational-love-quotes-sayings/") %>% html_nodes("h2") %>% html_text() love_quotes <- love_quotes[2:64] ## Creating function to create one bingo card with randomly selected words on each rows. make_card <- function(name="") { love_msgs_list <- love_msgs %>% split(.$row_group) unique_card <- tibble( xc = rep(c(1:5),each=5), yc = rep(c(1:5),times=5), ## from each lists i want to select 5 randomly. msg = love_msgs_list %>% map(.,"msgs") %>% map(.,sample,5) %>% unlist() ) unique_card <- unique_card %>% mutate(msg=ifelse(xc==3&yc==3,"FREE",msg)) bingo_card <- bingo_base + geom_text(data=unique_card, aes(x=xc,y=yc,label=msg, group=NULL), family="Roboto Condensed", fontface="bold", color="white", size=3) + labs(title=str_c(name),caption=sample(love_quotes,size=1)) bingo_card } ## using patchwork, I want to print 4 cards make_card("BINGO CARD 1") + make_card("BINGO CARD 2") + make_card("BINGO CARD 3") + make_card("BINGO CARD 4") + patchwork::plot_layout(ncol=2)

    Bonus : Drawing Flowers To Go With Hearts

    Just thought it would also be nice to draw flowers too. After all, flowers go with hearts :). You can read more about rose curve here or here

    flower_vertices <- function(xc,yc,radius,k=5,npoints=300,...){ t = seq(0,2*pi, length.out=npoints+1) m = sqrt(radius) * cos(k * t) x = xc + m * cos(t) y = yc + m * sin(t) df <- tibble(t=t,x=x,y=y,r=m) return(df) } flower_vertices(0,0,1,7) %>% ggplot(aes(x=t)) + geom_line(aes(y=x), color="red", linetype=3) + geom_line(aes(y=y), color="blue",linetype=3) + geom_polygon(aes(x=x,y=y), alpha=0.5) + theme_minimal(base_family="Roboto Condensed") + coord_fixed() + labs(title="Rose Curve with K=7 - Flower with 7 Petals")

    tibble( xc=rep(c(1:5),time=5), yc=rep(c(1:5),each=5), radius=0.1, k = c(1:25), id=c(1:25) ) %>% mutate(v=pmap(.,flower_vertices)) %>% unnest(v) %>% ggplot(aes(x=x,y=y,group=id)) + geom_polygon(aes(fill=id%%2)) + geom_point(aes(x=xc,y=yc), data=. %>% count(id,xc,yc), size=3,shape=19, alpha=0.7) + geom_text(aes(x=x,y=y, label=k), family="Roboto Condensed", size=8, vjust=1, fontface="bold", color="#000000ae", data=. %>% group_by(id) %>% filter(max(t)==t)) + theme_void(base_family="Roboto Condensed") + coord_fixed() + scale_y_reverse() + scale_fill_viridis_c(begin=0.2,end=0.7,option="magma", guide="none", alpha=0.8) + labs(title="Rose Curves with differnt k ", subtitle="r = cos(k * theta) ")

    Flower Needs Butterfly Too..

    There’s also another called “butterfly curve”.

    I think flower deserves butterfly… So here’s butterfly curve drawn in similar manner as the above.

    butterfly_vertices <- function(xc,yc,npoints=1000,...){ t = seq(0,12*pi, length.out=npoints+1) x = xc + sin(t)*(exp(cos(t))-2*cos(4*t)-sin(t/12)^5) y = yc + cos(t)*(exp(cos(t))-2*cos(4*t)-sin(t/12)^5) df <- tibble(x=x,y=y,t=t) %>% mutate(pos=row_number()) return(df) } ggplot() + geom_path(data=butterfly_vertices(1,1),aes(x=x,y=y, color=pos)) + geom_polygon(data=butterfly_vertices(8,1), aes(x=x,y=y,fill=factor(floor(t/pi))), color="#000000de") + coord_fixed() + theme_void() + scale_fill_viridis_d(alpha=0.3, guide="none") + scale_color_viridis_c(option="magma", guide="none")

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Chi's Impe[r]fect Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    gganimation for the nation

    Sun, 01/06/2019 - 01:00

    (This article was first published on HighlandR, and kindly contributed to R-bloggers)

    gganimate hits CRAN –

    At the NHS_R conference back in October, I showed a few ways of building animations using gifski – I also wrote up the method I used in an earlier post right here.

    And, the source code for all this stuff is available from my github here.

    However – the gganimate file in the repo is now out of date, so use the code down below.

    Now at the time of the conference, the method I used to build animations involved a fair bit of slicing, dicing and a bit of purrr to build up individual plots and stitch them together. It was a workaround because gganimate was on github at the time, and installing packages from github onto NHS computers is likely to be highly problematic, to put it mildly.

    But, the New Year comes, and despite the fact the country is going completely to pot, there are still things to celebrate because gganimate has now hit CRAN.

    So, join me as I take my mind off it all by updating some of the code I used with this wonderful new package.

    Fair Warning – regular reader(s) will have seen these plots before.
    Sorry.
    However, I still think it’s (vaguely) worthwhile writing this because while there were / are quite a few examples of how to use gganimate with points, I couldn’t find one that shows how to keep the old points on screen while new points are added (Disclaimer – I haven’t put too much effort into additional checking of the latest examples before writing this).

    If you are using lines and or segments in your plots, then transition_reveal keeps the history of the line over time, but it did not work with geom_point.

    Here’s the setup :

    source("1_setup.R") source("2_data_wrangling.R") library(gganimate) # Change "Tranfer In" or "Transfer Out" to "Transfer" plot_data$Movement_Type <- gsub("Transfer.*","Transfer",x = plot_data$Movement_Type) #convert Movement Type to factor, as the first sequence of dots turns red instead of green plot_data$Movement_Type <- as_factor(plot_data$Movement_Type) #check the levels levels(plot_data$Movement_Type) plot_data$Movement_Type <- forcats::fct_rev(plot_data$Movement_Type) levels(plot_data$Movement_Type) lims <- as.POSIXct(strptime(c("2014-09-03 00:00","2014-09-04 01:00") , format = "%Y-%m-%d %H:%M")) p <- ggplot(plot_data,aes(Movement15,Movement_15_SEQNO, colour = Movement_Type)) + geom_jitter(width = 0.10) + scale_colour_manual(values = plot_colours) + facet_grid(Staging_Post~.,switch = "y") + # facet_wrap(vars(Staging_Post), ncol = 1) + scale_x_datetime(date_labels = "%H:%M",date_breaks = "3 hours", limits = lims, timezone = "GMT", expand = c(0,0)) + ggtitle(label = "Anytown General Hospital | Wednesday 3rd September 2014 00:00 to 23:59\n", subtitle = "A&E AND INPATIENT ARRIVALS, DEPARTURES AND TRANSFERS") + labs(x = NULL, y = NULL, caption = "@_johnmackintosh | johnmackintosh.com Source: Neil Pettinger | @kurtstat | kurtosis.co.uk") + theme_minimal(base_size = 11) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) + theme(axis.text.x = element_text(size = 7)) + theme(axis.ticks.x = element_blank()) + theme(legend.position = "bottom") + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank()) + theme(strip.text.y = element_text(angle = 180)) + guides(color = guide_legend("Movement Type")) + ggExtra::removeGrid()

    You will need to hit up my repo to grab the source files, and obviously you’ll want to change the titles and labels.

    Now, here is the gganimate bit:

    p <- p + labs(title = 'Time: {frame_time}', x = 'Time In/ Out', y = NULL) + transition_time(MovementDateTime) + shadow_mark(past = TRUE, future = FALSE) + ease_aes('linear') p #OR #animate(p, fps = 5, width = 1000, height = 600) #save it if you want to: #anim_save("newgganimate.gif")

    Transition_time, by itself, plotted each dot where I expected it, but did not keep the previous points on screen.
    I experimented with some other options (ie. shadow_wake and shadow_trail), but it seems

    shadow_mark(past = TRUE, future = FALSE)

    is the magic sauce for this particular task.
    I could be completely wrong about that, and maybe there is a better way, in which case, let me know in the comments.

    Here is the final gif:

    Also at the event, I didn’t get a chance to demonstrate the Shiny app I built.
    It was my first ever attempt, and while admittedly it’s nothing earth shattering, you have to start somewhere right?
    It’s over here, but please lower your expectations. There is only one dropdown to interact with:
    https://johnm.shinyapps.io/NHS_R_PatientFlow/

    And with that, I stop thinking about R for a bit.

    See you next month.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: HighlandR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Dow Jones Stock Market Index (1/4): Log Returns Exploratory Analysis

    Sun, 01/06/2019 - 00:15

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

      Categories

      1. Basic Statistics

      Tags

      1. Data Visualisation
      2. Import Data
      3. R Programming

      In this four-post series, I am going to analyze the Dow Jones Industrial Average (DJIA) index on years 2007-2018. The Dow Jones Industrial Average (DIJA) is a stock market index that indicates the value of thirty large, publicly owned companies based in the United States. The value of the DJIA is based upon the sum of the price of one share of stock for each component company. The sum is corrected by a factor which changes whenever one of the component stocks has a stock split or stock dividend. See ref. [1] for further details.

      The main questions I will try to answer are:

      • How the returns and trade volume have changed over the years?
      • How the returns and trade volume volatility have changed over the years?
      • How can we model the returns volatility?
      • How can we model the trade volume volatility?
      For the purpose, this post series is split as follows:

      Part 1: getting data, summaries, and plots of daily and weekly log-returns
      Part 2: getting data, summaries, and plots of daily trade volume and its log-ratio
      Part 3: daily log-returns analysis and GARCH model definition
      Part 4: daily trade volume analysis and GARCH model definition

      Packages

      The packages being used in this post series are herein listed.

      suppressPackageStartupMessages(library(lubridate)) # dates manipulation suppressPackageStartupMessages(library(fBasics)) # enhanced summary statistics suppressPackageStartupMessages(library(lmtest)) # coefficients significance tests suppressPackageStartupMessages(library(urca)) # unit rooit test suppressPackageStartupMessages(library(ggplot2)) # visualization suppressPackageStartupMessages(library(quantmod)) # getting financial data suppressPackageStartupMessages(library(PerformanceAnalytics)) # calculating returns suppressPackageStartupMessages(library(rugarch)) # GARCH modeling suppressPackageStartupMessages(library(FinTS)) # ARCH test suppressPackageStartupMessages(library(forecast)) # ARMA modeling suppressPackageStartupMessages(library(strucchange)) # structural changes suppressPackageStartupMessages(library(TSA)) # ARMA order identification

      To the right, I added a short comment to highlight the main functionalities I took advantage from. Of course, those packages offer much more than just what I indicated.

      Packages version are herein listed.

      packages <- c("lubridate", "fBasics", "lmtest", "urca", "ggplot2", "quantmod", "PerformanceAnalytics", "rugarch", "FinTS", "forecast", "strucchange", "TSA") version <- lapply(packages, packageVersion) version_c <- do.call(c, version) data.frame(packages=packages, version = as.character(version_c)) ## packages version ## 1 lubridate 1.7.4 ## 2 fBasics 3042.89 ## 3 lmtest 0.9.36 ## 4 urca 1.3.0 ## 5 ggplot2 3.1.0 ## 6 quantmod 0.4.13 ## 7 PerformanceAnalytics 1.5.2 ## 8 rugarch 1.4.0 ## 9 FinTS 0.4.5 ## 10 forecast 8.4 ## 11 strucchange 1.5.1 ## 12 TSA 1.2

      The R language version running on Windows-10 is:

      R.version >## _ ## platform x86_64-w64-mingw32 ## arch x86_64 ## os mingw32 ## system x86_64, mingw32 ## status ## major 3 ## minor 5.2 ## year 2018 ## month 12 ## day 20 ## svn rev 75870 ## language R ## version.string R version 3.5.2 (2018-12-20) ## nickname Eggshell Igloo Getting Data

      Taking advantage of the getSymbols() function made available within quantmod package we get Dow Jones Industrial Average from 2007 up to the end of 2018.

      suppressMessages(getSymbols("^DJI", from = "2007-01-01", to = "2019-01-01")) ## [1] "DJI" dim(DJI) ## [1] 3020 6 class(DJI) ## [1] "xts" "zoo"

      Let us have a look at the DJI xts object which provides six-time series, as we can see.

      head(DJI) ## DJI.Open DJI.High DJI.Low DJI.Close DJI.Volume DJI.Adjusted ## 2007-01-03 12459.54 12580.35 12404.82 12474.52 327200000 12474.52 ## 2007-01-04 12473.16 12510.41 12403.86 12480.69 259060000 12480.69 ## 2007-01-05 12480.05 12480.13 12365.41 12398.01 235220000 12398.01 ## 2007-01-08 12392.01 12445.92 12337.37 12423.49 223500000 12423.49 ## 2007-01-09 12424.77 12466.43 12369.17 12416.60 225190000 12416.60 ## 2007-01-10 12417.00 12451.61 12355.63 12442.16 226570000 12442.16 tail(DJI) ## DJI.Open DJI.High DJI.Low DJI.Close DJI.Volume DJI.Adjusted ## 2018-12-21 22871.74 23254.59 22396.34 22445.37 900510000 22445.37 ## 2018-12-24 22317.28 22339.87 21792.20 21792.20 308420000 21792.20 ## 2018-12-26 21857.73 22878.92 21712.53 22878.45 433080000 22878.45 ## 2018-12-27 22629.06 23138.89 22267.42 23138.82 407940000 23138.82 ## 2018-12-28 23213.61 23381.88 22981.33 23062.40 336510000 23062.40 ## 2018-12-31 23153.94 23333.18 23118.30 23327.46 288830000 23327.46

      More precisely, we have available OHLC (Open, High, Low, Close) index value, adjusted close value and trade volume. Here we can see the corresponding chart as produced by the chartSeries within the quantmod package.

      chartSeries(DJI, type = "bars", theme="white")

      We herein analyse the adjusted close value.

      dj_close <- DJI[,"DJI.Adjusted"]

      Simple and log returns

      Simple returns are defined as:

      \[
      R_{t}\ :=\ \frac{P_{t}}{P_{t-1}}\ \ -\ 1
      \]

      Log returns are defined as:

      \[
      r_{t}\ :=\ ln\frac{P_{t}}{P_{t-1}}\ =\ ln(1+R_{t})
      \]

      See ref. [4] for further details.

      We compute log returns by taking advantage of CalculateReturns within PerformanceAnalytics package.

      dj_ret <- CalculateReturns(dj_close, method = "log") dj_ret <- na.omit(dj_ret)

      Let us have a look.

      head(dj_ret) ## DJI.Adjusted ## 2007-01-04 0.0004945580 ## 2007-01-05 -0.0066467273 ## 2007-01-08 0.0020530973 ## 2007-01-09 -0.0005547987 ## 2007-01-10 0.0020564627 ## 2007-01-11 0.0058356461 tail(dj_ret) ## DJI.Adjusted ## 2018-12-21 -0.018286825 ## 2018-12-24 -0.029532247 ## 2018-12-26 0.048643314 ## 2018-12-27 0.011316355 ## 2018-12-28 -0.003308137 ## 2018-12-31 0.011427645

      This gives the following plot.

      plot(dj_ret)

      Sharp increases and decreases in volatility can be eye-balled. That will be in-depth verified in part 3.

      Helper Functions

      We need some helper functions to ease some basic data conversion, summaries and plots. Here below the details.

      1. Conversion from xts to dataframe with year and value column. That allows for summarizations and plots on year basis.

      xts_to_dataframe <- function(data_xts) { df_t <- data.frame(year = factor(year(index(data_xts))), value = coredata(data_xts)) colnames(df_t) <- c( "year", "value") df_t }

      2. Enhanced summaries statistics for data stored as data frame columns.

      bs_rn <- rownames(basicStats(rnorm(10,0,1))) # gathering the basic stats dataframe output row names that get lost with tapply() dataframe_basicstats <- function(dataset) { result <- with(dataset, tapply(value, year, basicStats)) df_result <- do.call(cbind, result) rownames(df_result) <- bs_rn as.data.frame(df_result) }

      3. Basic statistics dataframe row threshold filtering to return the associated column names.

      filter_dj_stats <- function(dataset_basicstats, metricname, threshold) { r <- which(rownames(dataset_basicstats) == metricname) colnames(dataset_basicstats[r, which(dataset_basicstats[r,] > threshold), drop = FALSE]) }

      4. Box-plots with faceting based on years.

      dataframe_boxplot <- function(dataset, title) { p <- ggplot(data = dataset, aes(x = year, y = value)) + theme_bw() + theme(legend.position = "none") + geom_boxplot(fill = "lightblue") p <- p + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle(title) + ylab("year") p }

      5. Density plots with faceting based on years.

      dataframe_densityplot <- function(dataset, title) { p <- ggplot(data = dataset, aes(x = value)) + geom_density(fill = "lightblue") p <- p + facet_wrap(. ~ year) p <- p + theme_bw() + theme(legend.position = "none") + ggtitle(title) p }

      6. QQ plots with faceting based on years.

      dataframe_qqplot <- function(dataset, title) { p <- ggplot(data = dataset, aes(sample = value)) + stat_qq(colour = "blue") + stat_qq_line() p <- p + facet_wrap(. ~ year) p <- p + theme_bw() + theme(legend.position = "none") + ggtitle(title) p }

      7. Shapiro Test

      shp_pvalue <- function (v) { shapiro.test(v)$p.value } dataframe_shapirotest <- function(dataset) { result <- with(dataset, tapply(value, year, shp_pvalue)) as.data.frame(result) } Daily Log-returns Exploratory Analysis

      We transform our original Dow Jones time series into a dataframe with year and value columns. That will ease plots and summaries by year.

      dj_ret_df <- xts_to_dataframe(dj_ret) head(dj_ret_df) ## year value ## 1 2007 0.0004945580 ## 2 2007 -0.0066467273 ## 3 2007 0.0020530973 ## 4 2007 -0.0005547987 ## 5 2007 0.0020564627 ## 6 2007 0.0058356461 tail(dj_ret_df) ## year value ## 3014 2018 -0.018286825 ## 3015 2018 -0.029532247 ## 3016 2018 0.048643314 ## 3017 2018 0.011316355 ## 3018 2018 -0.003308137 ## 3019 2018 0.011427645 Basic statistics summary

      Basic statistics summary is given.

      (dj_stats <- dataframe_basicstats(dj_ret_df)) ## 2007 2008 2009 2010 2011 ## nobs 250.000000 253.000000 252.000000 252.000000 252.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum -0.033488 -0.082005 -0.047286 -0.036700 -0.057061 ## Maximum 0.025223 0.105083 0.066116 0.038247 0.041533 ## 1. Quartile -0.003802 -0.012993 -0.006897 -0.003853 -0.006193 ## 3. Quartile 0.005230 0.007843 0.008248 0.004457 0.006531 ## Mean 0.000246 -0.001633 0.000684 0.000415 0.000214 ## Median 0.001098 -0.000890 0.001082 0.000681 0.000941 ## Sum 0.061427 -0.413050 0.172434 0.104565 0.053810 ## SE Mean 0.000582 0.001497 0.000960 0.000641 0.000837 ## LCL Mean -0.000900 -0.004580 -0.001207 -0.000848 -0.001434 ## UCL Mean 0.001391 0.001315 0.002575 0.001678 0.001861 ## Variance 0.000085 0.000567 0.000232 0.000104 0.000176 ## Stdev 0.009197 0.023808 0.015242 0.010182 0.013283 ## Skewness -0.613828 0.224042 0.070840 -0.174816 -0.526083 ## Kurtosis 1.525069 3.670796 2.074240 2.055407 2.453822 ## 2012 2013 2014 2015 2016 ## nobs 250.000000 252.000000 252.000000 252.000000 252.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum -0.023910 -0.023695 -0.020988 -0.036402 -0.034473 ## Maximum 0.023376 0.023263 0.023982 0.038755 0.024384 ## 1. Quartile -0.003896 -0.002812 -0.002621 -0.005283 -0.002845 ## 3. Quartile 0.004924 0.004750 0.004230 0.005801 0.004311 ## Mean 0.000280 0.000933 0.000288 -0.000090 0.000500 ## Median -0.000122 0.001158 0.000728 -0.000211 0.000738 ## Sum 0.070054 0.235068 0.072498 -0.022586 0.125884 ## SE Mean 0.000470 0.000403 0.000432 0.000613 0.000501 ## LCL Mean -0.000645 0.000139 -0.000564 -0.001298 -0.000487 ## UCL Mean 0.001206 0.001727 0.001139 0.001118 0.001486 ## Variance 0.000055 0.000041 0.000047 0.000095 0.000063 ## Stdev 0.007429 0.006399 0.006861 0.009738 0.007951 ## Skewness 0.027235 -0.199407 -0.332766 -0.127788 -0.449311 ## Kurtosis 0.842890 1.275821 1.073234 1.394268 2.079671 ## 2017 2018 ## nobs 251.000000 251.000000 ## NAs 0.000000 0.000000 ## Minimum -0.017930 -0.047143 ## Maximum 0.014468 0.048643 ## 1. Quartile -0.001404 -0.005017 ## 3. Quartile 0.003054 0.005895 ## Mean 0.000892 -0.000231 ## Median 0.000655 0.000695 ## Sum 0.223790 -0.057950 ## SE Mean 0.000263 0.000714 ## LCL Mean 0.000373 -0.001637 ## UCL Mean 0.001410 0.001175 ## Variance 0.000017 0.000128 ## Stdev 0.004172 0.011313 ## Skewness -0.189808 -0.522618 ## Kurtosis 2.244076 2.802996

      In the following, we make specific comments to some relevant aboveshown metrics.

      Mean

      Years when Dow Jones daily log-returns has positive mean values are:

      filter_dj_stats(dj_stats, "Mean", 0) ## [1] "2007" "2009" "2010" "2011" "2012" "2013" "2014" "2016" "2017"

      All Dow Jones daily log-returns mean values in ascending order.

      dj_stats["Mean",order(dj_stats["Mean",,])] ## 2008 2018 2015 2011 2007 2012 2014 ## Mean -0.001633 -0.000231 -9e-05 0.000214 0.000246 0.00028 0.000288 ## 2010 2016 2009 2017 2013 ## Mean 0.000415 5e-04 0.000684 0.000892 0.000933 Median

      Years when Dow Jones daily log-returns has positive median are:

      filter_dj_stats(dj_stats, "Median", 0) ## [1] "2007" "2009" "2010" "2011" "2013" "2014" "2016" "2017" "2018"

      All Dow Jones daily log-returns median values in ascending order.

      dj_stats["Median",order(dj_stats["Median",,])] ## 2008 2015 2012 2017 2010 2018 2014 ## Median -0.00089 -0.000211 -0.000122 0.000655 0.000681 0.000695 0.000728 ## 2016 2011 2009 2007 2013 ## Median 0.000738 0.000941 0.001082 0.001098 0.001158 Skewness

      A spatial distribution has positive skewness when it has tendency to produce more positive extreme values above rather than negative ones. Negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right. Here is a sample picture to explain.

      Years when Dow Jones daily log-returns has positive skewness are:

      filter_dj_stats(dj_stats, "Skewness", 0) ## [1] "2008" "2009" "2012"

      All Dow Jones daily log-returns skewness values in ascending order.

      dj_stats["Skewness",order(dj_stats["Skewness",,])] ## 2007 2011 2018 2016 2014 2013 ## Skewness -0.613828 -0.526083 -0.522618 -0.449311 -0.332766 -0.199407 ## 2017 2010 2015 2012 2009 2008 ## Skewness -0.189808 -0.174816 -0.127788 0.027235 0.07084 0.224042 Excess Kurtosis

      The Kurtosis is a measure of the “tailedness” of the probability distribution of a real-valued random variable. The kurtosis of any univariate normal distribution is 3. The excess kurtosis is equal to the kurtosis minus 3 and eases the comparison to the normal distribution. The basicStats function within the fBasics package reports the excess kurtosis. Here is a sample picture to explain.

      Years when Dow Jones daily log-returns has positive excess kurtosis are:

      filter_dj_stats(dj_stats, "Kurtosis", 0) ## [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" ## [11] "2017" "2018"

      All Dow Jones daily log-returns excess kurtosis values in ascending order.

      dj_stats["Kurtosis",order(dj_stats["Kurtosis",,])] ## 2012 2014 2013 2015 2007 2010 2009 ## Kurtosis 0.84289 1.073234 1.275821 1.394268 1.525069 2.055407 2.07424 ## 2016 2017 2011 2018 2008 ## Kurtosis 2.079671 2.244076 2.453822 2.802996 3.670796

      Year 2018 has the closest excess kurtosis to 2008.

      Box-plots dataframe_boxplot(dj_ret_df, "DJIA daily log-returns box plots 2007-2018")

      We can see how the most extreme values occurred on 2008. Starting on 2009, the values range gets narrow, with the exception of 2011 and 2015. However, comparing 2017 with 2018, it is remarkable an improved tendency to produce more extreme values on last year.

      Density plots dataframe_densityplot(dj_ret_df, "DJIA daily log-returns density plots 2007-2018")

      Year 2007 has remarkable negative skewness. Year 2008 is characterized by flatness and extreme values. The 2017 peakness is in constrant with the flatness and left skeweness of 2018 results.

      Shapiro Tests

      dataframe_shapirotest(dj_ret_df) ## result ## 2007 5.989576e-07 ## 2008 5.782666e-09 ## 2009 1.827967e-05 ## 2010 3.897345e-07 ## 2011 5.494349e-07 ## 2012 1.790685e-02 ## 2013 8.102500e-03 ## 2014 1.750036e-04 ## 2015 5.531137e-03 ## 2016 1.511435e-06 ## 2017 3.304529e-05 ## 2018 1.216327e-07

      The null hypothesis of normality is rejected for all years 2007-2018.

      QQ plots

      dataframe_qqplot(dj_ret_df, "DJIA daily log-returns QQ plots 2007-2018")

      Strong departure from normality of daily log returns (and hence log-normality for discrete returns) is visible on years 2008, 2009, 2010, 2011 and 2018.

      Weekly Log-returns Exploratory Analysis

      The weekly log returns can be computed starting from the daily ones. Let us suppose to analyse the trading week on days {t-4, t-3, t-2, t-1, t} and to know closing price at day t-5 (last day of the previous week). We define the weekly log-return as:

      \[
      r_{t}^{w}\ :=\ ln \frac{P_{t}}{P_{t-5}}
      \]

      That can be rewritten as:

      \[
      \begin{equation}
      \begin{aligned}
      r_{t}^{w}\ = \ ln \frac{P_{t}P_{t-1} P_{t-2} P_{t-3} P_{t-4}}{P_{t-1} P_{t-2} P_{t-3} P_{t-4} P_{t-5}} = \\
      =\ ln \frac{P_{t}}{P_{t-1}} + ln \frac{P_{t-1}}{P_{t-2}} + ln \frac{P_{t-2}}{P_{t-3}} + ln \frac{P_{t-3}}{P_{t-4}}\ +\ ln \frac{P_{t-4}}{P_{t-5}}\ = \\ =
      r_{t} + r_{t-1} + r_{t-2} + r_{t-3} + r_{t-4}
      \end{aligned}
      \end{equation}
      \]

      Hence the weekly log return is the sum of daily log returns applied to the trading week window.

      We take a look at Dow Jones weekly log-returns.

      dj_weekly_ret <- apply.weekly(dj_ret, sum) plot(dj_weekly_ret)

      That plot shows sharp increses and decreases of volatility.

      We transform the original time series data and index into a dataframe.

      dj_weekly_ret_df <- xts_to_dataframe(dj_weekly_ret) dim(dj_weekly_ret_df) ## [1] 627 2 head(dj_weekly_ret_df) ## year value ## 1 2007 -0.0061521694 ## 2 2007 0.0126690596 ## 3 2007 0.0007523559 ## 4 2007 -0.0062677053 ## 5 2007 0.0132434177 ## 6 2007 -0.0057588519 tail(dj_weekly_ret_df) ## year value ## 622 2018 0.05028763 ## 623 2018 -0.04605546 ## 624 2018 -0.01189714 ## 625 2018 -0.07114867 ## 626 2018 0.02711928 ## 627 2018 0.01142764

      Basic statistics summary

      (dj_stats <- dataframe_basicstats(dj_weekly_ret_df)) ## 2007 2008 2009 2010 2011 2012 ## nobs 52.000000 52.000000 53.000000 52.000000 52.000000 52.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum -0.043199 -0.200298 -0.063736 -0.058755 -0.066235 -0.035829 ## Maximum 0.030143 0.106977 0.086263 0.051463 0.067788 0.035316 ## 1. Quartile -0.009638 -0.031765 -0.015911 -0.007761 -0.015485 -0.010096 ## 3. Quartile 0.014808 0.012682 0.022115 0.016971 0.014309 0.011887 ## Mean 0.001327 -0.008669 0.003823 0.002011 0.001035 0.001102 ## Median 0.004244 -0.006811 0.004633 0.004529 0.001757 0.001166 ## Sum 0.069016 -0.450811 0.202605 0.104565 0.053810 0.057303 ## SE Mean 0.002613 0.006164 0.004454 0.003031 0.003836 0.002133 ## LCL Mean -0.003919 -0.021043 -0.005115 -0.004074 -0.006666 -0.003181 ## UCL Mean 0.006573 0.003704 0.012760 0.008096 0.008736 0.005384 ## Variance 0.000355 0.001975 0.001051 0.000478 0.000765 0.000237 ## Stdev 0.018843 0.044446 0.032424 0.021856 0.027662 0.015382 ## Skewness -0.680573 -0.985740 0.121331 -0.601407 -0.076579 -0.027302 ## Kurtosis -0.085887 5.446623 -0.033398 0.357708 0.052429 -0.461228 ## 2013 2014 2015 2016 2017 2018 ## nobs 52.000000 52.000000 53.000000 52.000000 52.000000 53.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum -0.022556 -0.038482 -0.059991 -0.063897 -0.015317 -0.071149 ## Maximum 0.037702 0.034224 0.037693 0.052243 0.028192 0.050288 ## 1. Quartile -0.001738 -0.006378 -0.012141 -0.007746 -0.002251 -0.011897 ## 3. Quartile 0.011432 0.010244 0.009620 0.012791 0.009891 0.019857 ## Mean 0.004651 0.001756 -0.000669 0.002421 0.004304 -0.001093 ## Median 0.006360 0.003961 0.000954 0.001947 0.004080 0.001546 ## Sum 0.241874 0.091300 -0.035444 0.125884 0.223790 -0.057950 ## SE Mean 0.001828 0.002151 0.002609 0.002436 0.001232 0.003592 ## LCL Mean 0.000981 -0.002563 -0.005904 -0.002470 0.001830 -0.008302 ## UCL Mean 0.008322 0.006075 0.004567 0.007312 0.006778 0.006115 ## Variance 0.000174 0.000241 0.000361 0.000309 0.000079 0.000684 ## Stdev 0.013185 0.015514 0.018995 0.017568 0.008886 0.026154 ## Skewness -0.035175 -0.534403 -0.494963 -0.467158 0.266281 -0.658951 ## Kurtosis -0.200282 0.282354 0.665460 2.908942 -0.124341 -0.000870

      In the following, we make specific comments to some relevant aboveshown metrics.

      Mean

      Years when Dow Jones weekly log-returns has positive mean are:

      filter_dj_stats(dj_stats, "Mean", 0) ## [1] "2007" "2009" "2010" "2011" "2012" "2013" "2014" "2016" "2017"

      All mean values in ascending order.

      dj_stats["Mean",order(dj_stats["Mean",,])] ## 2008 2018 2015 2011 2012 2007 2014 ## Mean -0.008669 -0.001093 -0.000669 0.001035 0.001102 0.001327 0.001756 ## 2010 2016 2009 2017 2013 ## Mean 0.002011 0.002421 0.003823 0.004304 0.004651 Median

      Years when Dow Jones weekly log-returns has positive median are:

      filter_dj_stats(dj_stats, "Median", 0) ## [1] "2007" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017" ## [11] "2018"

      All median values in ascending order.

      dj_stats["Median",order(dj_stats["Median",,])] ## 2008 2015 2012 2018 2011 2016 2014 ## Median -0.006811 0.000954 0.001166 0.001546 0.001757 0.001947 0.003961 ## 2017 2007 2010 2009 2013 ## Median 0.00408 0.004244 0.004529 0.004633 0.00636 Skewness

      Years when Dow Jones weekly log-returns has positive skewness are:

      filter_dj_stats(dj_stats, "Skewness", 0) ## [1] "2009" "2017"

      All skewness values in ascending order.

      dj_stats["Skewness",order(dj_stats["Skewness",,])] ## 2008 2007 2018 2010 2014 2015 ## Skewness -0.98574 -0.680573 -0.658951 -0.601407 -0.534403 -0.494963 ## 2016 2011 2013 2012 2009 2017 ## Skewness -0.467158 -0.076579 -0.035175 -0.027302 0.121331 0.266281 Excess Kurtosis

      Years when Dow Jones weekly log-returns has positive excess kurtosis are:

      filter_dj_stats(dj_stats, "Kurtosis", 0) ## [1] "2008" "2010" "2011" "2014" "2015" "2016"

      All excess kurtosis values in ascending order.

      dj_stats["Kurtosis",order(dj_stats["Kurtosis",,])] ## 2012 2013 2017 2007 2009 2018 ## Kurtosis -0.461228 -0.200282 -0.124341 -0.085887 -0.033398 -0.00087 ## 2011 2014 2010 2015 2016 2008 ## Kurtosis 0.052429 0.282354 0.357708 0.66546 2.908942 5.446623

      Year 2008 has also highest weekly kurtosis. However in this scenario, 2017 has negative kurtosis and year 2016 has the second highest kurtosis.

      Box-plots

      dataframe_boxplot(dj_weekly_ret_df, "DJIA weekly log-returns box plots 2007-2018")

      Density plots dataframe_densityplot(dj_weekly_ret_df, "DJIA weekly log-returns density plots 2007-2018")

      Shapiro Tests dataframe_shapirotest(dj_weekly_ret_df) ## result ## 2007 0.0140590311 ## 2008 0.0001397267 ## 2009 0.8701335006 ## 2010 0.0927104389 ## 2011 0.8650874270 ## 2012 0.9934600084 ## 2013 0.4849043121 ## 2014 0.1123139646 ## 2015 0.3141519756 ## 2016 0.0115380989 ## 2017 0.9465281164 ## 2018 0.0475141869

      The null hypothesis of normality is rejected for years 2007, 2008, 2016.

      QQ plots

      dataframe_qqplot(dj_weekly_ret_df, "DJIA weekly log-returns QQ plots 2007-2018")

      Strong departure from normality is particularly visible on year 2008.

      Saving the current enviroment for further analysis.

      save.image(file='DowEnvironment.RData')

      If you have any questions, please feel free to comment below.

      References
      1. Dow Jones Industrial Average
      2. Skewness
      3. Kurtosis
      4. An introduction to analysis of financial data with R, Wiley, Ruey S. Tsay
      5. Time series analysis and its applications, Springer ed., R.H. Shumway, D.S. Stoffer
      6. Applied Econometric Time Series, Wiley, W. Enders, 4th ed.]

      7. Forecasting – Principle and Practice, Texts, R.J. Hyndman
      8. Options, Futures and other Derivatives, Pearson ed., J.C. Hull
      9. An introduction to rugarch package
      10. Applied Econometrics with R, Achim Zeileis, Christian Kleiber – Springer Ed.
      11. GARCH modeling: diagnostic tests
      Disclaimer

      Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or promotion of any particular security or source.

      Related Post

      1. Six Sigma DMAIC Series in R – Part4
      2. NYC buses: company level predictors with R
      3. Visualizations for correlation matrices in R
      4. Interpretation of the AUC
      5. Simple Experiments with Smoothed Scatterplots
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Structural Analisys of Bayesian VARs with an example using the Brazilian Development Bank

      Sat, 01/05/2019 - 20:24

      (This article was first published on R – insightR, and kindly contributed to R-bloggers)

      By Gabriel Vasconcelos

      Introduction

      Vector Autorregresive (VAR) models are very popular in economics because they can model a system of economic variables and relations. Bayesian VARs are receiving a lot of attention due to their ability to deal with larger systems and the smart use of priors. For example, in this old post I showed an example of large Bayesian VARs to forecast covariance matrices. In this post I will show how to use the same model to obtain impulse response coefficients and perform structural analysis. The type of estimation was based on Bańbura et al. (2010) and the empirical application is from Barboza and Vasconcelos (2019). The objective is to measure the effects of the Brazilian Development Bank on investment. Therefore, we will measure how the investment respond to an increase in loans over the time.

      VAR

      A structural VAR is described by the following equation:

      where are squared matrices of coefficients, is a vector of constants, is a vector of errors and is the vector with all variables in the system. In this sense, is not a single value like in an univariate Autorregressive model, it contains several variables that are used at the same time in the model. For example, could have the GDP, investment and interest rate. In this case each matrix would be .

      In VAR models we do not estimate the equation above in a single step. First we estimate the VAR in its reduced form (this is the equation we use to compute forecasts) and then we use some identification strategy to recover the structural form. The reduced form is presented below.

      where , , . When we do structural analysis we want to know how some variable respond over time to a shock on some other variable. However, shocks in the reduced form are correlated between equations and can not be isolated. Therefore, an important difference between the reduced form and the structural form is that in the second the error covariance matrix is diagonal, which means uncorrelated shocks.

      Application

      First we must install the lbvar package from my github:

      library(devtools) install_github("gabrielrvsc/lbvar") library(lbvar) library(tseries)

      The dataset is included in the package and it is called BNDESdata. I removed column 17 in this example because it contains a variable (capital goods prices) used only for robustness in other examples in the paper. The data is already treated and ready for the model. We are going to evaluate the effects of the Brazilian Development Bank Loans (BL) on the Gross Fixed Capital formation (GFCF), the GFCF fraction of machinery and equipment (GFCFme) and the GFCF fraction of machinery and equipment manufactured in Brazil (GFCFmeBR).

      data("BNDESdata") BNDESdata = BNDESdata[,-17] colnames(BNDESdata) = c("TT", "CRB", "CDS", "WPROD", "BL", "GFCF", "GFCFme", "GFCFmeBR", "GDP", "IP", "ULC", "CUR", "IR", "ICI", "UNCERT", "ER", "IBOV")

      Before estimating the VAR, we must define some parameters for the priors. Our prior is of a random walk for nonstationary variables and of a white noise for stationary variables. We just used a Phillips-Perron test to determine which variables are stationary and which variables are not. We set a value of 1 for the first autorregresive term of each nonstationary equation and 0 for each stationary equation.

      There is another parameter we must define called 0" title="\lambda>0" class="latex" />. This parameter controls the importance we give to the prior and the importance we give to the data. Smaller values give more importance to the prior. Bańbura et al. (2010) uses a strategy to determine which consists of estimating a small VAR by OLS and Bayesian VARs for several values of . The chosen is the one that makes the big model more similar to the small model in terms of squared error. Note that the small model does not use all variables, therefore we compare only the equations included in both models.

      The codes for the Phillips-Perron test and the   are presented below. I used a small range for the to save computational time given that I already knew that the best value was in this interval.

      ## = Phillips-Perron test prior = apply(BNDESdata,2,function(x)pp.test(x)$p.value) prior[prior > 0.05] = 1 prior[prior <= 0.05] = 0 ## = lambda estimation lambda = fitLambda(BNDESdata,c("GFCF","BL","IR"), seq(0.6,0.65,0.0001), p=13, p.reduced = 13, delta=prior)

      The VAR we want to estimate has 17 variables and 13 lags. This accounts for 3774 parameters in the reduced form if we include the intercept. The name Large Bayesian VARs is very much appropriate. The code below estimates the model, computes the recursive identification (see the full article for more details) and the impulse responses. The last step requires some simulation and it may take a few minutes to run.

      ## == Estimate the Reduced Form == ## model = lbvar(BNDESdata, 13, delta = prior, lambda = lambda) ## == identification == ## ident = identification(model) ## == Impulse Responses == ## set.seed(123) ir = irf(model, ident, 48, unity.shock = FALSE, M=1000)

      Now we can plot the impulse responses and see the results. The figures below show the effects of a shock in the bank loans on the three variations of GFCF.

      par(mfrow=c(1,3)) plot(ir,"BL","GFCF",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(-0.01,0.015),ylab="GFCF",xlab="Time",main="(a)") plot(ir,"BL","GFCFme",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(-0.01,0.015),ylab="GFCF ME",xlab="Time",main="(b)") plot(ir,"BL","GFCFmeBR",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(-0.01,0.015),ylab="GFCF ME-BR",xlab="Time",main="(c)")

      The effects of the bank are bigger if we move to the machinery and equipment measures because they represent a large portion of the bank’s portfolio. The responses are significant in some horizons. The shock was of one standard deviation, which accounts for a 20% increase in loans. This shock increases the GFCF by less than 2% if we look at the sum of the first 6 months. However, to understand the magnitude of this effect we must look at the fraction of the Brazilian Development Bank on total investment. Our results show that an increase of 1 BRL in loans accounts for an increase of approximately 0.46 BRL in the total GFCF. This may look like a small value first sight. We explain this results with two hypothesis. First, there is some degree of fund substitution where companies use the government bank just because interest rates are lower but they do not suffer from significant credit constraints. Second, There is a crowding-out effect because the government tends to increase the interest rate when the bank puts more money in the economy.

      Finally, a good way to evaluate a VAR model in a macroeconomic framework such as this one is to look at how the variables behave in response to the monetary policy. The figures below show how the GFCF behaves when the government increases interest rates (IR).

      par(mfrow=c(1,3)) plot(ir,"IR","GFCF",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(-0.02,0.01),ylab="GFCF",xlab="Time",main="(a)") plot(ir,"IR","GFCFme",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(-0.02,0.01),ylab="GFCF ME",xlab="Time",main="(b)") plot(ir,"IR","GFCFmeBR",alpha=c(0.05,0.1),xlim=c(0,36),ylim=c(-0.02,0.01),ylab="GFCF ME-BR",xlab="Time",main="(c)")

      References

      Bańbura, Marta, Domenico Giannone, and Lucrezia Reichlin. “Large Bayesian vector auto regressions.” Journal of Applied Econometrics 25.1 (2010): 71-92.

      de Menezes Barboza, R., & Vasconcelos, G. F. (2019). Measuring the aggregate effects of the Brazilian Development Bank on investment. The North American Journal of Economics and Finance.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Here’s why 2019 is a great year to start with R: A story of 10 year old R code then and now

      Sat, 01/05/2019 - 13:00

      (This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

      Introduction

      It has been more than ten years since I wrote my first R code. And in those years, the R world has changed dramatically, and mostly to the better. I believe that the current time may be one of the best times to start working with R.

      In this new year’s post we will look at the R world 10 years ago and today, and provide links to many tools that helped it become a great language to solve and present everyday tasks with a welcoming community of users and developers.

      Contents
      1. My first exposure to R, more than ten years ago
      2. The growth of R
      3. R now versus then – A much better world
      4. If you really came for that ugly old code
      My first exposure to R, more than ten years ago

      The year was 2007 and I was studying probability and mathematical statistics at my faculty when one of the professors introduced us to R – a free programming language that we could use to solve many statistical tasks, from simple matrix operations and fitting models, to data visualization. This sounded great, as many other solutions that were traditionally used such as Stata or SPSS were even not free to use, let alone open source.

      Now to get a bit of context, my most recent exposures to programming at that time were using Borland’s Deplhi 7 and C++ Builder, both mature IDEs with very pleasant and advanced user interfaces and features, where you could literally have a Windows application with a nice UI ready, compiled and running in an hour.

      Deplhi 7, released in 2002

      Rgui times

      When I first opened the RGui it felt, well, slightly underwhelming:

      Rgui rocking R version 2.6.1, released 26th Nov 2007

      But why did you not use RStudio?

      Well, the first beta version of RStudio was released about 3 years later in February 2011. By the wat, those RStudio blog posts from 2011 still have comment sections available below them and I really enjoy reading through them. Anyway, I was stuck with the Rgui and it was not a pleasant experience. At that time, I disliked that experience so much, I still wrote some of the code in Delphi or C++ Builder.

      Which currently popular packages existed?

      But dplyr syntax makes everything so easy, why not use that?

      Looking at the CRAN snapshots from the beginning of 2008, the latest released R version at that time was R-2.6.1 and there were around 1200 packages available on CRAN. At the time of writing of this post the number of packages available on CRAN reached 13600.

      Looking at the top 40 most downloaded packages in the past month, only two of those packages existed on CRAN at that time – ggplot2 and digest – no filter, summarize or group_by for me back then.

      StackOverflow, GitHub and Twitter communities

      Why did you not just ask StackOverflow, Twitter or check GitHub ?

      According to Wikipedia, StackOverflow was launched 15th September 2008 and GitHub on 10th of April 2008, so in the beginning of 2008 none of the two today’s giants even existed.

      Not that I was using Twitter at that time, but the first #rstats tweet I was able to find is from 4th April 2009:

      RT @ChrisAlbon @drewconway #rstats is the official R statistical language hashtag. #rstats (because #R doesn’t cut it)

      — brendan o’connor (@brendan642) April 4, 2009

      For comparison, R itself was first released 29th of February 2000, a date easily remembered.

      The growth of R

      There are many ways to look at a growth of a programming language and this does not mean to be a comprehensive and objective growth assessment. I rather took a look at 2 metrics I found interesting that show some trends in the R world.

      If you are interested in the topic of programming language popularity, there are indices such as PYPL and TIOBE, and of course they have their criticisms.

      Downloads of R packages



      RStudio’s CRAN mirror provides a REST API from which we can look at and visualize the number of monthly downloads of R packages in the past 5 years. The chart speaks for itself:

      $(function () { $('#r908-01-monthly-r-downloads').highcharts({ title: { text: null }, yAxis: { title: { text: null } }, credits: { enabled: false }, exporting: { enabled: false }, plotOptions: { series: { turboThreshold: 0 }, treemap: { layoutAlgorithm: "squarified" }, bubble: { minSize: 5, maxSize: 25 } }, annotationsOptions: { enabledButtons: false }, tooltip: { delayForDisplay: 10 }, series: [ { data: [1111018, 1254236, 1585446, 1836591, 1748652, 1696828, 1835270, 1926914, 2615003, 2771729, 2822179, 2378941, 3107109, 3253037, 4130467, 4907718, 4029389, 5060976, 4823936, 5588692, 5396871, 6534368, 6682423, 5984781, 7695022, 7738873, 9937860, 11274461, 10511679, 11046909, 10525599, 11577427, 12848572, 13991958, 16414107, 14767617, 14946786, 15695514, 16110482, 15034673, 18135754, 15942551, 15159072, 14926215, 19778328, 23259892, 27638156, 21852630, 25833739, 28609044, 32522448, 32597694, 32389075, 30214297, 27995878, 28865282, 33786377, 37808497, 39820735, 31812256, 38244722, 33787943, 45966646, 51566293, 48369947, 44977277, 42285633, 41935159, 59734508, 74333034, 74764832, 58582203], name: "Monthly package downloads (RStudio's CRAN mirror)" } ], xAxis: { categories: ["2013-01", "2013-02", "2013-03", "2013-04", "2013-05", "2013-06", "2013-07", "2013-08", "2013-09", "2013-10", "2013-11", "2013-12", "2014-01", "2014-02", "2014-03", "2014-04", "2014-05", "2014-06", "2014-07", "2014-08", "2014-09", "2014-10", "2014-11", "2014-12", "2015-01", "2015-02", "2015-03", "2015-04", "2015-05", "2015-06", "2015-07", "2015-08", "2015-09", "2015-10", "2015-11", "2015-12", "2016-01", "2016-02", "2016-03", "2016-04", "2016-05", "2016-06", "2016-07", "2016-08", "2016-09", "2016-10", "2016-11", "2016-12", "2017-01", "2017-02", "2017-03", "2017-04", "2017-05", "2017-06", "2017-07", "2017-08", "2017-09", "2017-10", "2017-11", "2017-12", "2018-01", "2018-02", "2018-03", "2018-04", "2018-05", "2018-06", "2018-07", "2018-08", "2018-09", "2018-10", "2018-11", "2018-12"] } } ); });

      Interest on StackOverflow

      Another interesting point of view is the statistics on trends on StackOverflow, paraphrasing their blogpost:

      When we see a rapid growth in the number of questions about a technology, it usually reflects a real change in what developers are using and learning.

      And how does R look within the StackOverflow trends compared to other languages? Looks like the growth of R is so remarkable, even the data scientists at StackOverflow itself noticed and wrote a blogpost about it in 2017:

      R now versus then – A much better world

      Going back to that story of my first R codes, I think time has made working with R much better than it was before in many ways. I will list just few of the many reasons why with the links to relevant resources to follow:

      Availability of free information and support is great
      • The amazing amount of free information readily available such as (tidyverse oriented) R for Data Science, or Advanced R books make R more accessible to learn and use
      • Communities of R users such as the one on StackOverflow make it easy to ask questions and get answers, the #rstats hashtag on Twitter is a good way to interact with the community
      • Many user and developer blogs on r-bloggers.com and curated selections of content on RWeekly.org can serve as an inspiration and overview of the news in the community
      Software tools that make working with R efficient
      • Tools like RStudio make using R a much more pleasant experience compared to the original RGui, with many useful features and a Server version running in browser
      • Well documented R packages that make common data science tasks easier and/or more performant such as the popular tidyverse or data.table make it easier to start
      • R packages that support development, testing and documentation such as devtools, testthat and roxygen2 make R code efficient to develop, test and document
      • For portability, reproducibility and dependency management, tools such as packrat can make life less painful
      • Code repository managers such as GitHub, GitLab or others make it easy to share code, collaborate and even perform CI/CD tasks where necessary
      Professionally presenting and publishing R results is simple
      • Tools like RMarkdown, Bookdown, Blogdown and others make it easy to publish the results of your work, be it an interactive dashboard, a paper in pdf, a presentation, even a book or a blog (such as this one)
      • Many packages for generating interactive charts, maps and animations such as highcharter, leaflet and more help create amazing data visualizations
      • Shiny takes it to the next level allowing for advanced interactive web applications
      Mature interfaces to programming languages, file formats and more Guidance on packages per topic on CRAN If you really came for that ugly old code

      I hope this post motivated you to dive a bit deeper into the R world and check some of the many amazing contributions created by developers and users in the R community mentioned above.

      But if you really feel like having a good laugh first, feel free check some of the oldest R scripts I was able to find unedited on GitLab here.

      They date somewhere to the end of 2007/beginning of 2008 and, for it’s worth, should still be runnable.

      Thank you for reading and
      have a happy new yeaR

      Did you find the article helpful or interesting? Help others find it by sharing var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      My Activities in 2018 with R and ShinyApp

      Fri, 01/04/2019 - 23:26

      (This article was first published on R – Networkx, and kindly contributed to R-bloggers)

      What better way to analyze your activities data from Apple Health and Runkeeper into R and generating some visualizations and counters. After that I will wrapping it together into a Shiny App.

      Want do I want to achieve for now?

      • Number of activities, steps, kilometers ect.
      • Heatmap of last X years number of activities colored by amount.
      Loading and have a look at the data

      Export Runkeeper data, the option is available after login > account settings > export data > download your data.

      For export and convert your Apple Health data follow my previous post Analysing your Apple Health Data with Splunk.

      runkeeper <- read.csv("data/cardioActivities.csv", stringsAsFactors=FALSE) steps <- read.csv("data/StepCount.csv", stringsAsFactors = FALSE) dim(runkeeper)

      ## [1] 676 14

      dim(steps)

      ## [1] 47790 9

      head(runkeeper)

      ## Activity.Id Date Type ## 1 47f59049-9bd5-4b9a-a63c-d51fca7e0bf8 2018-12-31 10:22:33 Running ## 2 7099bd74-7685-453e-a962-62ee711dc18e 2018-12-27 20:24:09 Running ## 3 c2250f5b-8567-47ef-97d7-504b174809e1 2018-12-23 13:28:04 Running ## 4 787c8438-d279-49fb-8bab-d1f348935264 2018-12-18 21:07:42 Boxing / MMA ## 5 2c41d494-204b-4be3-9b4e-d7acac5a3187 2018-12-16 13:04:44 Running ## 6 86b6429a-13a8-4122-b200-c3aadef271f5 2018-12-04 20:24:27 Boxing / MMA ## Route.Name Distance..km. Duration Average.Pace Average.Speed..km.h. ## 1 8.12 50:49 6:15 9.59 ## 2 15.99 1:27:36 5:29 10.95 ## 3 8.31 50:04 6:01 9.96 ## 4 0.00 1:20:00 NA ## 5 10.51 56:47 5:24 11.10 ## 6 0.00 1:00:00 NA ## Calories.Burned Climb..m. Average.Heart.Rate..bpm. Friend.s.Tagged Notes ## 1 772.0 83 134 ## 2 1485.0 57 147 ## 3 788.0 88 134 ## 4 1032.4 0 NA ## 5 979.0 41 145 ## 6 774.3 0 NA ## GPX.File ## 1 2018-12-31-102233.gpx ## 2 2018-12-27-202409.gpx ## 3 2018-12-23-132804.gpx ## 4 ## 5 2018-12-16-130444.gpx ## 6

      head(steps)

      ## sourceName sourceVersion ## 1 John’s iPhone 10.1.1 ## 2 John’s iPhone 10.1.1 ## 3 John’s iPhone 10.1.1 ## 4 John’s iPhone 10.1.1 ## 5 John’s iPhone 10.1.1 ## 6 John’s iPhone 10.1.1 ## device ## 1 <<HKDevice: 0x2836d9310>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## 2 <<HKDevice: 0x2836db070>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## 3 <<HKDevice: 0x2836d9d10>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## 4 <<HKDevice: 0x2836dad50>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## 5 <<HKDevice: 0x2836da080>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## 6 <<HKDevice: 0x2836dabc0>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1> ## type unit creationDate startDate ## 1 StepCount count 2016-11-24 09:13:55 +0100 2016-11-24 08:55:10 +0100 ## 2 StepCount count 2016-11-24 09:13:55 +0100 2016-11-24 09:10:09 +0100 ## 3 StepCount count 2016-11-24 09:13:55 +0100 2016-11-24 09:11:11 +0100 ## 4 StepCount count 2016-11-24 09:13:55 +0100 2016-11-24 09:12:13 +0100 ## 5 StepCount count 2016-11-24 09:49:04 +0100 2016-11-24 09:13:19 +0100 ## 6 StepCount count 2016-11-24 09:49:04 +0100 2016-11-24 09:33:57 +0100 ## endDate value ## 1 2016-11-24 08:55:33 +0100 26 ## 2 2016-11-24 09:11:11 +0100 114 ## 3 2016-11-24 09:12:13 +0100 105 ## 4 2016-11-24 09:13:19 +0100 25 ## 5 2016-11-24 09:22:42 +0100 108 ## 6 2016-11-24 09:42:33 +0100 89

      str(runkeeper)

      ## 'data.frame': 676 obs. of 14 variables: ## $ Activity.Id : chr "47f59049-9bd5-4b9a-a63c-d51fca7e0bf8" "7099bd74-7685-453e-a962-62ee711dc18e" "c2250f5b-8567-47ef-97d7-504b174809e1" "787c8438-d279-49fb-8bab-d1f348935264" ... ## $ Date : chr "2018-12-31 10:22:33" "2018-12-27 20:24:09" "2018-12-23 13:28:04" "2018-12-18 21:07:42" ... ## $ Type : chr "Running" "Running" "Running" "Boxing / MMA" ... ## $ Route.Name : chr "" "" "" "" ... ## $ Distance..km. : num 8.12 15.99 8.31 0 10.51 ... ## $ Duration : chr "50:49" "1:27:36" "50:04" "1:20:00" ... ## $ Average.Pace : chr "6:15" "5:29" "6:01" "" ... ## $ Average.Speed..km.h. : num 9.59 10.95 9.96 NA 11.1 ... ## $ Calories.Burned : num 772 1485 788 1032 979 ... ## $ Climb..m. : int 83 57 88 0 41 0 62 0 0 0 ... ## $ Average.Heart.Rate..bpm.: int 134 147 134 NA 145 NA 145 NA NA NA ... ## $ Friend.s.Tagged : chr "" "" "" "" ... ## $ Notes : chr "" "" "" "" ... ## $ GPX.File : chr "2018-12-31-102233.gpx" "2018-12-27-202409.gpx" "2018-12-23-132804.gpx" "" ...

      str(steps)

      ## 'data.frame': 47790 obs. of 9 variables: ## $ sourceName : chr "John’s iPhone" "John’s iPhone" "John’s iPhone" "John’s iPhone" ... ## $ sourceVersion: chr "10.1.1" "10.1.1" "10.1.1" "10.1.1" ... ## $ device : chr "<<HKDevice: 0x2836d9310>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1>" "<<HKDevice: 0x2836db070>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1>" "<<HKDevice: 0x2836d9d10>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1>" "<<HKDevice: 0x2836dad50>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:10.1.1>" ... ## $ type : chr "StepCount" "StepCount" "StepCount" "StepCount" ... ## $ unit : chr "count" "count" "count" "count" ... ## $ creationDate : chr "2016-11-24 09:13:55 +0100" "2016-11-24 09:13:55 +0100" "2016-11-24 09:13:55 +0100" "2016-11-24 09:13:55 +0100" ... ## $ startDate : chr "2016-11-24 08:55:10 +0100" "2016-11-24 09:10:09 +0100" "2016-11-24 09:11:11 +0100" "2016-11-24 09:12:13 +0100" ... ## $ endDate : chr "2016-11-24 08:55:33 +0100" "2016-11-24 09:11:11 +0100" "2016-11-24 09:12:13 +0100" "2016-11-24 09:13:19 +0100" ... ## $ value : int 26 114 105 25 108 89 329 284 89 18 ...

      From the Runkeeper data we need Date and from Apple Health StepCount we need the endDate (thats when your step is done). Both has type chr. I could convert it as date, but I leave the data what it is and will do convert it when necessary.

      Create new variables

      First we load the lubridate package.

      library(lubridate)

      I will create some new variables and convert Date and endDate as Date so I can extract the year with the year function of the lubridate package.

      runkeeper$year <- year(as.Date(runkeeper$Date, origin = '1900-01-01')) steps$year <- year(as.Date(steps$endDate, origin = '1900-01-01')) summary(runkeeper$year)

      ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2010 2013 2016 2015 2017 2018

      summary(steps$year)

      ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2016 2017 2017 2017 2018 2019

      As you can see in the above summary of both dataframes, Runkeeper has more years than steps. That doesn’t matter becouse we are now only looking for 2018 in the Shiny App and I will group it by year. In the app I create a year slider.

      I will parse the period with hour, minuts ans seconds of the variable Duration. And I will calculate the duration in minutus. Some of the Duration variables has no leading zero for the hour and I cannot use it with the function hms.

      Add a leading zero if not exists.

      runkeeper$Duration[1:5]

      ## [1] "50:49" "1:27:36" "50:04" "1:20:00" "56:47"

      runkeeper$Duration <- ifelse(nchar(runkeeper$Duration) < 6, paste0("0:", runkeeper$Duration), runkeeper$Duration) runkeeper$Duration[1:5]

      ## [1] "0:50:49" "1:27:36" "0:50:04" "1:20:00" "0:56:47"

      Now I can parse the Duration with hms.

      runkeeper$lub <- hms(runkeeper$Duration) runkeeper$time_minutes <- hour(runkeeper$lub)*60 + minute(runkeeper$lub) + second(runkeeper$lub)/60 runkeeper$lub[1:5]

      ## [1] "50M 49S" "1H 27M 36S" "50M 4S" "1H 20M 0S" "56M 47S"

      runkeeper$time_minutes[1:5]

      ## [1] 50.81667 87.60000 50.06667 80.00000 56.78333

      Group the data by year

      First we load the package dplyr.

      library(dplyr)

      Group both dataframes by year and do some summarises like count and sum of kilometers, climb, calories and duration

      grouped_runkeeper_year <- runkeeper %>% group_by(year) %>% summarise(cnt = n(), km = sum(Distance..km.), climb = sum(Climb..m.), calories = sum(Calories.Burned), duration = sum(time_minutes, na.rm = TRUE)) grouped_steps_year <- steps %>% group_by(year) %>% summarise(cnt = sum(value)) grouped_runkeeper_year

      ## # A tibble: 9 x 6 ## year cnt km climb calories duration ## <dbl> <int> <dbl> <int> <dbl> <dbl> ## 1 2010 3 10.8 0 901 71.8 ## 2 2011 96 657. 1951 56333 3864. ## 3 2012 46 410. 1458 36876 2278. ## 4 2013 46 258. 488 22654 1456. ## 5 2014 24 175. 698 14382 992. ## 6 2015 117 1202. 5910 98518 7537. ## 7 2016 93 661. 3076 65275. 5397. ## 8 2017 128 408. 1569 89872. 7404. ## 9 2018 123 366. 1818 86792. 7062.

      grouped_steps_year

      ## # A tibble: 4 x 2 ## year cnt ## <dbl> <int> ## 1 2016 328759 ## 2 2017 3360951 ## 3 2018 4366289 ## 4 2019 3692

      Some simple visualizations

      library(ggplot2) library(RColorBrewer) grouped_runkeeper_year %>% ggplot(aes(x=year, y=cnt )) + geom_bar(stat = "identity", col="white", fill="#ee8300") + geom_text(aes(label=cnt), hjust=1.2, color="white", size=3.5) + labs(x="", y="", title="Number activities by Year") + scale_x_continuous(breaks=seq(min(grouped_runkeeper_year$year), max(grouped_runkeeper_year$year),1)) + coord_flip() + theme_bw()

      grouped_steps_year %>% ggplot(aes(x=year, y=cnt )) + geom_bar(stat = "identity", col="white", fill="#ee8300") + geom_text(aes(label=cnt), hjust=1.2, color="white", size=3.5) + labs(x="", y="", title="Number of Steps by Year") + scale_x_continuous(breaks=seq(min(grouped_runkeeper_year$year), max(grouped_runkeeper_year$year),1)) + coord_flip() + theme_bw()

      Add a heatmap of number of activities last 3 years from now.

      calendar_heatmap <- runkeeper %>% select(Date,time_minutes,year) %>% filter(year >= year(now()) - 3) %>% mutate( week = week(as.Date(Date)), wday = wday(as.Date(Date), week_start = 1), month = month(as.Date(Date), label = TRUE, abbr = TRUE), day = weekdays(as.Date(Date)) ) cols <- rev(rev(brewer.pal(7,"Oranges"))[1:5]) calendar_heatmap %>% ggplot(aes(month, reorder(day, -wday), fill = time_minutes)) + geom_tile(colour = "white") + scale_fill_gradientn('Activity \nMinutes', colours = cols) + facet_wrap(~ year, ncol = 1) + theme_classic() + theme(strip.text.x = element_text(size = 16, face = "bold", colour = "orange")) + ylab("") + xlab("")

      Screenshot Shiny App

      SessionInfo sessionInfo() ## R version 3.5.1 (2018-07-02) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS 10.14.2 ## ## Matrix products: default ## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] RWordPress_0.2-3 knitr_1.20 ## [3] bindrcpp_0.2.2 RColorBrewer_1.1-2 ## [5] ggplot2_3.0.0 shinydashboardPlus_0.6.0 ## [7] shinydashboard_0.7.1 shiny_1.2.0 ## [9] dplyr_0.7.8 lubridate_1.7.4 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.19 prettyunits_1.0.2 ps_1.2.1 ## [4] utf8_1.1.4 assertthat_0.2.0 rprojroot_1.3-2 ## [7] digest_0.6.17 mime_0.5 R6_2.2.2 ## [10] plyr_1.8.4 backports_1.1.2 evaluate_0.12 ## [13] highr_0.7 pillar_1.3.0 rlang_0.3.0.1 ## [16] lazyeval_0.2.1 curl_3.2 rstudioapi_0.8 ## [19] callr_3.0.0 rmarkdown_1.10 desc_1.2.0 ## [22] labeling_0.3 devtools_2.0.1 stringr_1.3.1 ## [25] RCurl_1.95-4.11 munsell_0.5.0 compiler_3.5.1 ## [28] httpuv_1.4.5.1 pkgconfig_2.0.2 base64enc_0.1-3 ## [31] pkgbuild_1.0.2 htmltools_0.3.6 tidyselect_0.2.5 ## [34] tibble_1.4.2 XML_3.98-1.16 fansi_0.4.0 ## [37] crayon_1.3.4 withr_2.1.2 later_0.7.5 ## [40] bitops_1.0-6 grid_3.5.1 jsonlite_1.5 ## [43] xtable_1.8-3 gtable_0.2.0 magrittr_1.5 ## [46] scales_1.0.0 cli_1.0.1 stringi_1.2.4 ## [49] fs_1.2.6 promises_1.0.1 remotes_2.0.2 ## [52] testthat_2.0.0 tools_3.5.1 glue_1.3.0 ## [55] purrr_0.2.5 hms_0.4.2 processx_3.2.0 ## [58] pkgload_1.0.2 yaml_2.2.0 colorspace_1.3-2 ## [61] sessioninfo_1.1.1 memoise_1.1.0 bindr_0.1.1 ## [64] XMLRPC_0.3-1 usethis_1.4.0

      Well wrap it up into a Shiny App which can be found on my GitHub.

      The post My Activities in 2018 with R and ShinyApp appeared first on Networkx.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Networkx. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      In case you missed it: December 2018 roundup

      Fri, 01/04/2019 - 18:31

      (This article was first published on Revolutions, and kindly contributed to R-bloggers)

      In case you missed them, here are some articles from December of particular interest to R users.

      R 3.5.2 is now available.

      Roundup of AI, Machine Learning and Data Science news from December 2018.

      AzureStor, a new R package to interface with Azure Storage.

      How to use the "plumber" package to create an R service as a container with the AzureContainers package.

      How to give money to the R Foundation in support of the R project.

      The Revolutions blog is 10 years old, and I reflect on some milestones in a talk given at SatRDays DC.

      Reshama Shaikh compares gender diversity within the R and Python communities.

      AzureVM, a new package for managing virtual machines in Azure from R.

      And some general interest stories (not necessarily related to R):

      As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Displaying our “R – Quality Control Individual Range Chart Made Nice” inside a Java web App using AJAX – How To.

      Fri, 01/04/2019 - 18:22

      (This article was first published on R-Analytics, and kindly contributed to R-bloggers)

      • Prerequisites:
      • Once you are ready with all mentioned above, it is important you are able to generate the html file on your computer from the RMarkdown file provided on the github repository( https://github.com/LaranIkal/R-ANALYTICS ) mentioned above by using RStudio.
        • Download the zip file: R-ANALYTICS-master.zip
        • Unzip it to a local folder.
        • Double-click on file: R-ANALYTICS.Rproj, it will start rstudio.
        • On the right-below pane, click on files, you will see your files, click on file name: QualityControl_IndividualRangeChart.Rmd, rstudio will open the file:
        • To generate the html file from the Rmarkdown file, click on Knit->Knit to HTML.
        • Now you should be able to see your Intermediate Range Chart html file, like this:
      – Now you must be able to generate the html file from command line, if you are on windows, the command would be like this: “C:\Program Files\R\R-3.5.1\bin\Rscript” -e “rmarkdown::render(‘C:/Development/R/R-ANALYTICS/QualityControl_IndividualRangeChart.Rmd’)” Maybe the path to Rscript and the path to the Rmd file is different, just change it as you have it on your computer. Note. It is important to note that this project was created using Windows 10 and it works on a Windows server as well. – The next step is to get the files from master repository on github: https://github.com/LaranIkal/rchart.git, or you can also go directly to: https://github.com/LaranIkal/rchart and download the zip file, it will download a file called: rchart-master.zip 
      • Creating the project in eclipse EE IDE:
        • Using your file explorer, create a folder called rchart inside your eclipse EE workspace folder.
        • Open file rchart-master.zip
        • Extract the content of rchart-master folder inside the zip file into the new rchart folder you just created. 
        • Start Eclipse EE IDE and go to File-New-Java Project:.
        • Enter the name of the project as rchart and click Finish:
        • Once the project has been added, open it and double click on your pom.xml file to edit the Tomcat path according to your Tomcat path and save your changes:
        • Right-Click on your pom.xml file and Select Run As -> Maven build:
        • When Eclipse asks to edit the launch configuration, enter eclipse:eclipse as the Goals and click Run:
      This will update all dependencies for the project.
        • Now let’s create the war file, Right-Click on your pom.xml file and Select Run As -> Maven install:
      This should generate a war file under your Tomcat installation folder/webapps, when you run the file Tomcat installation folder/bin/startup.bat, the war file must be expanded, so you will see something like this:
        • Go to your web browser and point it to http://localhost:8080/RChartFromJava, verify the Rscript path and set it accordingly – “C:\\Program Files\\R\\R-3.5.1\\bin\\Rscript” – and click on the Display Chart button, now you should be able to see the Intermediate Range Chart of the data in the text area:
      How is it working?: – When you point your web browser to: http://localhost:8080/RChartFromJava, the MainController Java file sends the homepage.html template to the browser, the homepage.html loads the Javascript file DisplayChart.js under folder rchart\src\main\resources\static – When you click the Display Chart button, it runs the Javascript code because it is configured in this way in the template: onclick=’JavaScript:xmlhttpPostScriptData( “DisplayChart”, this.form, “iframeChartDisplay”, “Display Chart” )’>, the Javascript code posts the data to the Java backend app using Ajax and keeps waiting for response. –  The MainController in the Java backend app receives the data and opens the RMarkdown template under folder: rchart\src\main\resources\static\ChartTemplates, it replaces the values and creates a temporary RMarkdown document, then it process the RMarkdown temporary document to create the html document, opens the html file and return it as string to the web browser. If you have any questions, I might help you if you let a comment in this article. Enjoy it!!!.

      Carlos Kassab
      https://www.linkedin.com/in/carlos-kassab-48b40743/

      More information about R:
      https://www.r-bloggers.com var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R-Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      What does it mean to write “vectorized” code in R?

      Fri, 01/04/2019 - 01:00

      (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

      One often hears that R can not be fast (false), or more correctly that for fast code in R you may have to consider “vectorizing.”

      A lot of knowledgable R users are not comfortable with the term “vectorize”, and not really familiar with the method.

      “Vectorize” is just a slightly high-handed way of saying:

      R naturally stores data in columns (or in column major order), so if you are not coding to that pattern you are fighting the language.

      In this article we will make the above clear by working through a non-trivial example of writing vectorized code.

      For our example problem we will take on the task of computing the PRESS statistic (a statistic we use in our new RcppDynProg package). The “predicted residual error sum of squares” or PRESS statistic is an estimate of the out of sample quality of a fit. We motivate the PRESS statistic as follows.

      Suppose we are fitting a simple linear model. In such a case it is natural to examine the model residuals (differences between the actual values and the matching predictions):

      d <- data.frame( x = 1:6, y = c(1, 1, 2, 2, 3, 3)) lm(y ~ x, data = d)$residuals # 1 2 3 4 5 6 # 0.1428571 -0.3142857 0.2285714 -0.2285714 0.3142857 -0.1428571

      However, because the model has seen the data it is being applied to, these residuals are not representative of the residuals we would see on new data (there is a towards-zero bias in this estimate). An improved estimate is the PRESS statistic: for each point the model is fit on all points except the point in question, and then the residuals are estimated. This is a form of cross-validation and is easy to calculate:

      xlin_fits_lm <- function(x, y) { n <- length(y) d <- data.frame(x = x, y = y) vapply( seq_len(n), function(i) { m <- lm(y ~ x, data = d[-i, ]) predict(m, newdata = d[i, ]) }, numeric(1)) } d$y - xlin_fits_lm(d$x, d$y) # [1] 0.3000000 -0.4459459 0.2790698 -0.2790698 0.4459459 -0.3000000

      Notice these values tend to be further from zero. They also better represent how an overall model might perform on new data.

      At this point, from a statistical point of view, we are done. However, re-building an entire linear model for each point is computationally inefficient. Each of the models we are calculating share many training points, so we should be able to build a much faster hand-rolled calculation.

      That faster calculation is given in the rather formidable function xlin_fits_R() found here. This function computes the summary statistics needed to solve the linear regression for all of the data. Then for each point in turn, it subtracts that point’s contribution out of the summaries and then performs the algebra needed to solve for the model and then apply it. The advantage of this approach is that taking the point out and solving the model is only a very small (constant) number of steps independent of how many points are in the summary. So this code is, in principle, very efficient.

      And in fact timings on a small problem (300 observations) show while the simple “xlin_fits_lm() call lm() a bunch of times” takes 0.28 seconds, the more complicated xlin_fits_R() takes 0.00043 seconds, a speedup of over 600 times!

      However this code is performing a separate calculation for each scalar data-point. As we mentioned above, this is fighting R, which is specialized for performing calculations over large vectors. The exact same algorithm written in C++, instead of R, takes 0.000055 seconds, almost another multiple of 10 faster!

      The timings are summarized below.


      This sort of difference, scalar oriented C++ being so much faster than scalar oriented R, is often distorted into “R is slow.”

      This is just not the case. If we adapt the algorithm to be vectorized we get an R algorithm with performance comparable to the C++ implementation!

      Not all algorithms can be vectorized, but this one can, and in an incredibly simple way. The original algorithm itself (xlin_fits_R()) is a bit complicated, but the vectorized version (xlin_fits_V()) is literally derived from the earlier one by crossing out the indices. That is: in this case we can move from working over very many scalars (slow in R) to working over a small number of vectors (fast in R).

      Let’s take a look at the code transformation.

      We are not saying that xlin_fits_R() or xlin_fits_V() are easy to understand; we felt pretty slick when we derived them and added a lot of tests to confirm they calculate the same thing as xlin_fits_lm(). What we are saying is that the transform from xlin_fits_R() to xlin_fits_V() is simple: just cross out the for-loop and all of the “[k]” indices!
      Performing the exact same operation on every entry in a structure (but with different values) is the essence of “vectorized code.” When we wrote a for-loop in xlin_fits_R() to perform the same steps for each index, we were in fact fighting the language. Crossing out the for-loop and indices mechanically turned the scalar code into faster vector code.

      And that is our example of how and why to vectorize code in R

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages