Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 3 hours 47 min ago

Extending slackr

Wed, 10/25/2017 - 02:00

(This article was first published on Rstats on Reproducible Random Thoughts, and kindly contributed to R-bloggers)

The slackr package, by Bob Rudis, contains functions that make it possible to interact with the Slack messaging platform. The great part of Slack is that it has mobile devices application interface. We take advantage of this by extending slackr’s interaction with with the server:

  • Write to a channel (current) – slackr
    • Compile TeX snippets directly to a channel – tex_slackr
    • tex_slackr(xtable::xtable(mtcars))
    • This function uses the texPreview package to compile and generate the TeX output.
  • Read channel history (new) – history_slackr
  • Edit messages on a channel (new) – edit_slackr
  • Delete channel messages (new) – delete_slackr

This lets us interact with R and Slack in new ways, by getting active updates on long simulations directly to your (and your team’s) mobile device and multitask away from your computer.

devtools::install_github('hrbrmstr/slackr') Progress Bars

Create text progress bar that is sent directly to a Slack channel.

progress_slackr <- function(x){ text_slackr('0%') s <- c(0,x,1) for(i in 1:length(s)){ Sys.sleep(0.5) edit_slackr(sprintf('%s%%',round(100*i/length(s),1))) } Sys.sleep(0.5) } set.seed(1234) progress_slackr(sort(runif(10)))

This also solves a long standing problem of tracking the progress of parallel jobs on a server. We use slackr with the qapply package, which runs jobs on an Open Grid Scheduler/Engine. We can track each node

Onexit

Attach on.exit expressions to any function in R and at the end of the original function an output will be sent to the Slack channel.

This is useful for letting you know when a simulation is done, but also to be able to send to the Slack channel a relevant summary to see that the simulation did as intended.

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) group <- gl(2, 10, 20, labels = c("Ctl","Trt")) weight <- c(ctl, trt) #pass a message to Slack channel 'general' register_onexit(lm,'bazinga!',channel="#general") lm.D9 <- slack_lm(weight ~ group) #test that output keeps inheritance summary(lm.D9) #pass a message to Slack channel 'general' with a header message to begin output register_onexit(lm,'bazinga!', channel="#general", header_msg='This is a message to begin') lm.D9 <- slack_lm(weight ~ group) #onexit with an expression that calls lm.plot register_onexit(lm,{ par(mfrow = c(2, 2), oma = c(0, 0, 2, 0)) plot(z) }, channel="#general", header_msg = 'This is a plot just for this output', use_device = TRUE) lm.D9 <- slack_lm(weight ~ group) #clean up slack channel from examples delete_slackr(count = 6,channel = '#general') var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Reproducible Random Thoughts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Create editable Microsoft Office charts from R

Wed, 10/25/2017 - 00:12

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R has a rich and infinitely flexible graphics system, and you can easily embed R graphics into Microsoft Office documents like PowerPoint or Word. The one thing I dread hearing after delivering such a document, though, is "how can I tweak that graphic?". I could change the colors or fonts or dimensions in R, of course, but sometimes people just want to watch the world burn tweak graphics to their hearts' content. If you're in that situation, you have a couple of options for using R to create Office documents with graphics, and make those graphics editable. Both options work in conjunction with the "officer" package, which lets you create Word and PowerPoint documents from R

The first option — the rvg package — lets you use traditional R graphics commands, but allows the PowerPoint or Word user to modify the components of those graphics. Labels become text elements; lines, points and bars become shapes; and so on. The recipient can then use the standard Office tools to change fonts, resize, rotate, recolor, or add other annotations. For example, this code creates a 1-slide PowerPoint document with a standard R bar chart: 

library(rvg) library(ggplot2) library(officer) doc <- read_pptx() doc <- add_slide(doc, layout = "Title and Content", master = "Office Theme") doc <- ph_with_vg(doc, code = barplot(sample(1:20,10),xlab="Day",ylab="Widgets"), type = "body") print(doc, target = "my_plot.pptx")

Out of the box, the slide looks like this:

If the recipient wants to color the bars and change the label font to Papyrus, they can use the standard tools in Powerpoint. These changes took me about 20 seconds.

The downside is that the chart here is just a collection of objects. It's easy to make mistakes (say, to move a single bar to a new position), and charts with many components (like scatterplots with thousands of points) can be unwieldy. It's not possible to change the structure of the chart either, say to switch from a bar chart to a line chart.

Another option is the mschart package, which allows you to use data in R to create Microsoft Office charts in Word or PowerPoint documents. In this case, you won't be using the standard R graphics commands; instead, you'll be using new R functions to generate the standard Office chart types:

  • Bar charts, including grouped bars and vertical and horizontal stacked bars
  • Line charts, including stacked line charts
  • Area charts
  • Scatter plots

You can see several examples in the mschart package vignette. While you lose the flexibility of R's graphics system, using the mschart package does mean the recipient can use the standard Office chart tools to reformat the chart while keeping the overall presentation structure. For example, I was able to change the style of this bar chart in word with a couple of clicks, and even change the labels from French to English using the Edit Data tool.

Like the officer package, the mschart and rvg packages are maintained by David Gohel; Bob Rudis, and Francois Brunetti also contributed. Both packages are available now on CRAN, and you can install them into your R session with the install.packages function.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Some Announcements

Tue, 10/24/2017 - 20:13

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Some Announcements:

  • Dr. Nina Zumel will be presenting “Myths of Data Science: Things you Should and Should Not Believe”,
    Sunday, October 29, 2017
    10:00 AM to 12:30 PM at the She Talks Data Meetup (Bay Area).
  • ODSC West 2017 is soon. It is our favorite conference and we will be giving both a workshop and a talk.
    • Thursday Nov 2 2017,
      2:00 PM,
      Room T2,
      “Modeling big data with R, Sparklyr, and Apache Spark”,
      Workshop/Training intermediate, 4 hours,
      by Dr. John Mount (link).

    • Friday Nov 0 2017,
      4:15 PM,
      Room TR2
      “Myths of Data Science: Things you Should and Should Not Believe”,
      Data Science lecture beginner/intermediate, 45 minutes,
      by Dr. Nina Zumel (link, length, abstract, and title to be corrected).

    • We really hope you can make these talks.

  • On the “R for big data” front we have some big news: the replyr package now implements pivot/un-pivot (or what tidyr calls spread/gather) for big data (databases and Sparklyr). This data shaping ability adds a lot of user power. We call the theory “coordinatized data” and the work practice “fluid data”.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Meeting Titans of Open Data

Tue, 10/24/2017 - 19:52

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

The recent Association of Public Data Users (APDU) Conference gave me the opportunity to meet some people who have made tremendous contributions to the world of Open Data.

Jon Sperling

The author with Jon Sperling

One of the most popular R packages I’ve published is choroplethrZip. This package contains the US Census Bureau’s Zip Code Tabulation Area (ZCTA) map, as well as metadata and visualization functions. It literally never occurred to me to think about who created the first ZCTA map, the methodology for creating it, and so on.

It turns out that one of the people who created the ZCTA map – Jon Sperling – was at the conference. We had a fascinating conversation, and I even got to take selfie with him! You can learn more about Jon’s role in the creation of the TIGER database in his 1992 paper Development and Maintenance of the TIGER Database: Experiences in Spatial Data Sharing at the U.S. Bureau of the Census.

Jon currently works at the Department of Housing and Urban Development (HUD). You can learn more about their data here.

Nancy Potok

Nancy Potok is the Chief Statistician of the United States, and she gave a fascinating keynote. Before her talk I did not know that the country even had a Chief Statistician. Her talk taught me about the Federal Statistical System as well as the Interagency Council on Statistical Policy.

The Q&A portion of her talk was also interesting. A significant portion of the audience worked at federal agencies. I believe that it was during this session when someone said “I can’t tell how much of my time should be dedicated to supporting the data and analyses which I publish. In a very real sense, my job ends when I publish it.” This question helped me understand why it is sometimes difficult to get help understanding how to use government datasets: there simply aren’t incentives for data publishers to support users of that data.

Andrew Reamer

Sometimes at a conference you can tell that there’s a celebrity there just by how people act towards them.

In this case it seemed that everyone except me knew who Andrew Reamer was. During the Q&A portion of talks, Andrew would raise his hand and speakers would say “Hi Andrew! What’s your question?”. Or they would get a question from someone else and say “I’m actually not sure what the answer is. Maybe Andrew knows. Andrew?”

Note that Andrew didn’t actually have a speaking slot at the event. But yet he still did a lot of speaking!

During lunch I worked up the courage to go up Andrew and ask him about his work. It turns out that he is a Research Professor at George Washington University’s Institute of Public Policy. He is a well-known social scientist. As I’m quickly learning, social scientists tend to be major users of public datasets.

I previously published a quote from the Census Bureau that the American Community Survey (a major survey they run) impacts how over $400 billion is allocated. However, I was never able to get any more granularity on that. If you’re interested in the answer to that, it turns out that Andrew wrote a piece on it: Counting for Dollars 2020: The Role of the Decennial Census in the Geographic Distribution of Federal Funds.

Sarah Cohen

Sarah Cohen is a Pulitzer Prize winning journalist who is now the Assistant Editor for Computer-Assisted Reporting at the New York Times. Before Sarah’s talk I only had passing familiarity with data journalism, and had a very narrow view of what it entailed. Sarah’s keynote gave examples of different ways that public data is shaping journalism. She also discussed common problems that arise when journalists communicate with data publishers.

Key Takeaway

The 2017 APDU Conference was a great chance for me (an R trainer and consultant with a strong interest in open data) to meet people who have made major contributions to the field of public data. If you are interested in learning more about the Association of Public Data Users, I recommend visiting their website here.

The post Meeting Titans of Open Data appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

CRAN search based on natural language processing

Tue, 10/24/2017 - 15:30

(This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

CRAN contains up to date (October 2017) more than 11500 R packages. If you want to scroll through all of these, you probably need to spend a few days, assuming you need 5 seconds per package and there are 8 hours in a day.

Since R version 3.4, we can also get a dataset will all packages, their dependencies, the package title, the description and even the installation errors which the packages have. Which makes the CRAN database with all packages an excellent dataset for doing text mining. If you want to get that dataset, just do as follows in R:

library(tools)
crandb <- CRAN_package_db()

Based on that data the following CRAN NLP searcher app was built as shown below. I’ts available for inspection at http://datatailor.be:9999/app/cran_search and is a tiny wrapper around the result of annotating the package title and package description using the udpipe R package: https://github.com/bnosac/udpipe

If you want to easily extract what is written in text without reading it, a common way is to do Parts of Speech tagging, extract the nouns and/or the verbs and then plot all co-occurrences / correlations and frequencies of the lemmata. The updipe package allows you exactly to do that. Annotating using Parts of Speech tagging, is pretty easy with udpipe_annotate function from the the udpipe R package (https://github.com/bnosac/udpipe). Mark that this takes a while (as in +/- 30 minutes) and is probably something you want to run as a web-service or integrated stored procedure.

library(udpipe)
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)
crandb_annotated <- udpipe_annotate(ud_model,
                                    x = paste(crandb$Title, crandb$Description, sep = " \n "),
                                    doc_id = crandb$Package)
crandb_annotated <- as.data.frame(crandb_annotated)

Once we have that data annotated, making a web application which allows you to visualise, structure and display the CRAN packages content is pretty easy with tools like flexdashboard. That’s exactly what this web application available at http://datatailor.be:9999/app/cran_search does. The application allows you

  • List all packages which are part of a CRAN Task View
  • To search for CRAN packages based on what the author has written in the package title and description
  • Based on the found CRAN packages which were searched for: Visualise the nouns and verbs in the package title and descriptions by using
    • Word-coocurrence graphs indicating how many times each lemma occurs in the same package as another lemma
    • Word-correlation graphs showing the positive correlations between the top n most occurring lemma’s in the packages
    • Word clouds indicating the frequency of nouns/verbs or consecutive nouns/verbs (bigrams) in the package descriptions
    • Build a topic model (latent dirichlet allocation) to cluster packages and visualise them

The web application (flexdashboard) was launched on a small shinyproxy server and is available here: http://datatailor.be:9999/app/cran_search. Can you find topics which are not yet covered by the CRAN task views yet? Can you find the content of the Rcpp-universe or the sp package universe?

If you are interested in these techniques, you can always subscribe for our text mining with R course at the following dates:

  • 08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here
  • 27-28/11/2017: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to training@di-academy.com
  • 19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  • 22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine Learning Explained: Kmeans

Tue, 10/24/2017 - 15:20

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Kmeans is one of the most popular and simple algorithm to discover underlying structures in your data. The goal of kmeans is simple, split your data in k different groups represented by their mean. The mean of each group is assumed to be a good summary of each observation of this cluster.

Kmeans Algorithm

We assume that we want to split the data into k groups, so we need to find and assign k centers. How to define and find these centers?

They are the solution to the equation:     where     if the observation i is assigned to the center j and 0 otherwise.

 

Basically, this equation means that we are looking for the k centers which will minimize the distance between them and the points of their cluster. This is an optimization problem, but since the function, we want to minimize is not convex and some variables are binary, it cannot be solved in classic ways with gradient descent.
The usual way to solve it is the following:

  1. Initialize randomly the centers by selecting k-observations
  2. While some convergence criterion is not met
    1. Assign each observation to its closest center
    2. Update each center. The new centers are the mean of the observation of each group.
    3. Update the convergence criterion.
K-means from scratch with R

Now that we have the algorithm in pseudocode, let’s implement kmeans from scratch in R. First,we’ll create some toys data based on five 2D gaussian distributions.

require(MASS) require(ggplot2) set.seed(1234) set1=mvrnorm(n = 300, c(-4,10), matrix(c(1.5,1,1,1.5),2)) set2=mvrnorm(n = 300, c(5,7), matrix(c(1,2,2,6),2)) set3=mvrnorm(n = 300, c(-1,1), matrix(c(4,0,0,4),2)) set4=mvrnorm(n = 300, c(10,-10), matrix(c(4,0,0,4),2)) set5=mvrnorm(n = 300, c(3,-3), matrix(c(4,0,0,4),2)) DF=data.frame(rbind(set1,set2,set3,set4,set5),cluster=as.factor(c(rep(1:5,each=300)))) ggplot(DF,aes(x=X1,y=X2,color=cluster))+geom_point()

On this dataset, Kmeans will work well since each distribution has a circular shape. Here are what the data look like:

Now that we have a dataset, let’s inplement kmeans.

Initialisation of the centroids

The initialisation of the centroids is crucial and will change how the algorithm behave. Here, we will wimply takes K random points from the data.

#Centroids initialisation centroids=data[sample.int(nrow(data),K),] ##Stopping criteria initilisation. current_stop_crit=10e10 ##Vector where the assigned centers of each points will be saved cluster=rep(0,nrow(data)) ##Has the alogrithm converged ? converged=F it=1 Assigning points to their clusters

At each iteration, every points will be assigned to its closest cluster. To do so, the euclidian distance between each points and each centers is computed, the lowest distance and the center for which it’s reached is saved.

###Iterating over observations for (i in 1:nrow(data)) { ##Setting a high minimum distance min_dist=10e10 ##Iterating over centroids for (centroid in 1:nrow(centroids)) { ##Computing the L2 distance distance_to_centroid=sum((centroids[centroid,]-data[i,])^2) ##This centroid is the closest centroid to the point if (distance_to_centroid<=min_dist) { ##The point is assigned to this centroid/cluster cluster[i]=centroid min_dist=distance_to_centroid } } } Centroids update

Once each point has been assigned to the closest centroids, the coordinates of each centroid are updated. The new coordinates are the means of the observations which belongs to the cluster.

##For each centroid for (i in 1:nrow(centroids)) { ##The new coordinates are the means of the point in the cluster centroids[i,]=apply(data[cluster==i,],2,mean) } Stopping criterion

We do not want the algorithm to run indefinetely, hence we need a stopping criterion to stop the algorthm when we are close enough to a minimum. The criterion is simply that when the centroids stop moving, the algorithm should stop.

while(current_stop_crit>=stop_crit & converged==F) { it=it+1 if (current_stop_crit<=stop_crit) { converged=T } old_centroids=centroids ###Run previous step ####Recompute stop criterion current_stop_crit=mean((old_centroids-centroids)^2) Complete function kmeans=function(data,K=4,stop_crit=10e-5) { #Initialisation of clusters centroids=data[sample.int(nrow(data),K),] current_stop_crit=1000 cluster=rep(0,nrow(data)) converged=F it=1 while(current_stop_crit>=stop_crit & converged==F) { it=it+1 if (current_stop_crit<=stop_crit) { converged=T } old_centroids=centroids ##Assigning each point to a centroid for (i in 1:nrow(data)) { min_dist=10e10 for (centroid in 1:nrow(centroids)) { distance_to_centroid=sum((centroids[centroid,]-data[i,])^2) if (distance_to_centroid<=min_dist) { cluster[i]=centroid min_dist=distance_to_centroid } } } ##Assigning each point to a centroid for (i in 1:nrow(centroids)) { centroids[i,]=apply(data[cluster==i,],2,mean) } current_stop_crit=mean((old_centroids-centroids)^2) } return(list(data=data.frame(data,cluster),centroids=centroids)) }

You can easily run the code to see your clusters:

res=kmeans(DF[1:2],K=5) res$centroids$cluster=1:5 res$data$isCentroid=F res$centroids$isCentroid=T data_plot=rbind(res$centroids,res$data) ggplot(data_plot,aes(x=X1,y=X2,color=as.factor(cluster),size=isCentroid,alpha=isCentroid))+geom_point()

Exploring kmeans results

Now let’s try the algorithm on two different datasets. First, on the 5 Gaussians distributions:

The centroids move and split the data in clusters which are very close to the original ones. Kmeas is doing a great job here.
Now, instead of having nice gaussian distributions, we will build three rings on into another.

##Building three sets set1=data.frame(r=runif(300,0.1,0.5),theta=runif(300,0,360),set='1') set2=data.frame(r=runif(300,1,1.5),theta=runif(300,0,360),set='2') set3=data.frame(r=runif(300,3,5),theta=runif(300,0,360),set='3') ##Transformation in rings data_2=rbind(set1,set2,set3) data_2$x=data_2$r*cos(2*3.14*data_2$theta) data_2$y=(data_2$r)*sin(2*3.14*data_2$theta)

The kmeans is performing very poorly on these new data. Actually, the euclidian distance is not adapted to this kind of problem, since the data are not in a circular shape.

So before using kmeans, you should ensure that the data is in appropriate shapes, if not, you can apply transformations or change the distance you are using in the kmeans.

The importance of a good initialisation

The kmeans algorithm only looks for a local mimimum which is often not a global optimum. Hence, different initialisation can lead to very different results.

We ran the kmean algorithm over more than 60 different starting positions.As you can see, sometimes, the algorithm results in poor centroids due to an unlucky initialization. The solution to this is simply to run kmeans several times and to take the best centroids set. The quality of initialization can also be improved with kmeans ++, the algorithm selects starting points which are less likely to perform poorly.

Want to learn more on Machine Learning ? Here is a selection of Machine Learning Explained posts:
Dimensionality reduction
Supervised vs unsupervised vs reinforcement learning
Regularization in machine learning

The post Machine Learning Explained: Kmeans appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Shiny App to Create Sentimental Tweets Based on Project Gutenberg Books

Tue, 10/24/2017 - 09:00

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

There was something about them that made me uneasy, some longing and at the same time some deadly fear – Dracula (Stoker, Bram)

Twitter is a very good source of inspiration. Some days ago I came across with this:

Mi presentación sobre análisis de textos con R para el primer meetup de @RLadiesSantiago: https://t.co/KPU89vyeND #rladies #rstatsES

— Riva Quiroga (@rivaquiroga) September 25, 2017

The tweet refers to a presentation (in Spanish) available here, which is a very concise and well illustrated document about the state-of-the-art of text mining in R. I discovered there several libraries that I will try to use in the future. In this experiment I have used one of them: the syuzhet package. As can be read in the documentation:

this package extracts sentiment and sentiment-derived plot arcs from text using three sentiment dictionaries conveniently packaged for consumption by R users. Implemented dictionaries include syuzhet (default) developed in the Nebraska Literary Lab, afinn developed by Finn Arup Nielsen, bing developed by Minqing Hu and Bing Liu, and nrc developed by Mohammad, Saif M. and Turney, Peter D.

You can find a complete explanation of the package in its vignette. A very interesting application of these techniques is the Sentiment Graph of a book, which represents how sentiment changes over time. This is the Sentiment Graph of Romeo and Juliet, by William Shakespeare, taken from Project Alexandria:

Darkest sentiments can be seen at the end of the book, where the tragedy reaches its highest level. It is also nice to see how sentiments are cyclical. This graphs can be very useful for people who just want to read happy endings books (my sister is one of those).

Inspired by this analysis, I have done another experiment in which I download a book from Project Gutenberg and measure sentiment of all its sentences. Based on this measurement, I filter top 5% (positive or negative sentiment) sentences to build tweets. I have done a Shiny app where all these steps are explained. The app is available here.

From a technical point of view I used selectize JavaScript library to filter books in a flexible way. I customized as well the appearance with CSS bootstrap from Bootswatch as explained here.

This is the code of the experiment.

UI.R:

library(shiny) fluidPage(theme = "bootstrap.css", titlePanel(h1("Sentimental Tweets from Project Gutenberg Books", align="center"), windowTitle="Tweets from Project Gutenberg"), sidebarLayout( sidebarPanel( selectInput( 'book', 'Choose a book:', multiple=FALSE, selectize = TRUE, choices=c("Enter some words of title or author" = "", gutenberg_works$searchstr) ), radioButtons(inputId = "sent", label = "Choose sentiment:", choices = c("Dark"="1", "Bright"="20"), selected="1", inline=TRUE), radioButtons(inputId = "meth", label = "Choose a method to measure sentiment:", choices = c("syuzhet", "bing", "afinn", "nrc"), selected="syuzhet", inline=TRUE), radioButtons(inputId = "char", label = "Number of characters (max):", choices = list("140", "280"), inline=TRUE), checkboxInput(inputId = "auth", label = "Add author", value=FALSE), checkboxInput(inputId = "titl", label = "Add title", value=FALSE), checkboxInput(inputId = "post", label="Add link to post (thanks!)", value=TRUE), textInput(inputId = "adds", label="Something else?", placeholder="Maybe a #hastag?"), actionButton('do','Go!', class="btn btn-success action-button", css.class="btn btn-success") ), mainPanel( tags$br(), p("First of all, choose a book entering some keywords of its title or author and doing dropdown navigation. Books are downloaded from Project Gutenberg. You can browse the complete catalog", tags$a(href = "https://www.gutenberg.org/catalog/", "here.")), p("After that, choose the sentiment of tweets you want to generate. There are four possible methods than can return slightly different results. All of them assess the sentiment of each word of a sentence and sum up the result to give a scoring for it. The more negative is this scoring, the", em("darker") ,"is the sentiment. The more positive, the ", em("brighter."), " You can find a nice explanation of these techniques", tags$a(href = "http://www.matthewjockers.net/2017/01/12/resurrecting/", "here.")), p("Next parameters are easy: you can add the title and author of the book where sentence is extracted as well as a link to my blog and any other string you want. Clicking on the lower button you will get after some seconds a tweet below. Click as many times you want until you like the result."), p("Finally, copy, paste and tweet. ",strong("Enjoy it!")), tags$br(), tags$blockquote(textOutput("tweet1")), tags$br() )))

Server.R:

library(shiny) function(input, output) { values <- reactiveValues(default = 0) observeEvent(input$do,{ values$default <- 1 }) book <- eventReactive(input$do, { GetTweet(input$book, input$meth, input$sent, input$char, input$auth, input$titl, input$post, input$adds) }) output$tweet1 <- renderText({ if(values$default == 0){ "Your tweet will appear here ..." } else{ book() } }) }

Global.R:

library(gutenbergr) library(dplyr) library(stringr) library(syuzhet) x <- tempdir() # Read the Project Gutenberg catalog and filter english works. I also create a column with # title and author to make searchings gutenberg_metadata %>% filter(has_text, language=="en", gutenberg_id>0, !is.na(author)) %>% mutate(searchstr=ifelse(is.na(author), title, paste(title, author, sep= " - "))) %>% mutate(searchstr=str_replace_all(searchstr, "[\r\n]" , "")) %>% group_by(searchstr) %>% summarize(gutenberg_id=min(gutenberg_id)) %>% ungroup() %>% na.omit() %>% filter(str_length(searchstr)<100)-> gutenberg_works # This function generates a tweet according the UI settings (book, method, sentiment and # number of characters). It also appends some optional strings at the end GetTweet = function (string, method, sentim, characters, author, title, link, hastag) { # Obtain gutenberg_id from book gutenberg_works %>% filter(searchstr == string) %>% select(gutenberg_id) %>% .$gutenberg_id -> result # Download text, divide into sentences and score sentiment. Save results to do it once and # optimize performance if(!file.exists(paste0(x,"/","book",result,"_",method,".RDS"))) { book=gutenberg_download(result) book[,2] %>% as.data.frame() %>% .$text %>% paste(collapse=" ") -> text sentences_v <- get_sentences(text) sentiment_v <- get_sentiment(sentences_v, method=method) data.frame(sentence=sentences_v, sentiment=sentiment_v) %>% mutate(length=str_length(sentence)) -> results saveRDS(results, paste0(x,"/","book",result,"_",method,".RDS")) } results=readRDS(paste0(x,"/","book",result,"_",method,".RDS")) book_info=gutenberg_metadata %>% filter(gutenberg_id==result) # Paste optional strings to append at the end post="" if (title) post=paste("-", book_info[,"title"], post, sep=" ") if (author) post=paste0(post, " (", str_trim(book_info[,"author"]), ")") if (link) post=paste(post, "https://wp.me/p7VZWY-16S", sep=" ") post=paste(post, hastag, sep=" ") length_post=nchar(post) # Calculate 5% quantiles results %>% filter(length<=(as.numeric(characters)-length_post)) %>% mutate(sentiment=jitter(sentiment)) %>% mutate(group = cut(sentiment, include.lowest = FALSE, labels = FALSE, breaks = quantile(sentiment, probs = seq(0, 1, 0.05)))) -> results # Obtain a sample sentence according sentiment and append optional string to create tweet results %>% filter(group==as.numeric(sentim)) %>% sample_n(1) %>% select(sentence) %>% .$sentence %>% as.character() %>% str_replace_all("[.]", "") %>% paste(post, sep=" ") -> tweet return(tweet) } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Connect v1.5.8

Tue, 10/24/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re pleased to announce RStudio Connect: version 1.5.8. This release enables reconnects for Shiny applications, more consistent and trustworthy editing of user information, and various LDAP improvements.

The major changes this release include:

  • Enabled support for Shiny reconnects. Users of Shiny applications are less likely to be interrupted during brief network hiccups. The Client.ReconnectTimeout property specifies how long that session is maintained when there is connectivity trouble. The default setting is 15s. See https://shiny.rstudio.com/articles/reconnecting.html to learn more about reconnecting to Shiny applications. Disable this feature by giving the Client.ReconnectTimeout property a value of 0.
  • Greater consistency around editing user information. Authentication providers that expect user information to come in externally (like LDAP and OAuth) will by default forbid users from editing their information and will automatically refresh user profile information when the user logs in. Other providers now more consistently allow information that was specified when the user created their account to be edited by the user later.
  • The browseURL R function is disabled when executing deployed content. Use techniques like the Shiny shiny::tags$a function to expose links to application visitors.
  • Support more flexibility when searching for LDAP users and groups with the [LDAP].UserFilterBase and [LDAP].GroupFilterBase settings.
  • LDAP configuration’s BindDN password can now be stored in an external file using the new BindPasswordFile field. Also made improvements to LDAP group membership lookups.
  • Previously, usernames could not be edited when using the LDAP authentication provider by default or if the Authentication.RequireExternalUsernames flag was set to true. Now, user email, first name, and last name are also not editable for this configuration.
  • Connect administrators now receive an email as license expiration nears. Email is sent when the license is sixty days from expiring. Disable this behavior through the Licensing.Expiration setting.
  • Resolved a bug in the version of the rebuild-packrat command-line tool that was released in v1.5.6. Previously, the migration utility would render static content inaccessible. This release fixes this behavior and adds support for running this CLI tool while the RStudio Connect server is online. However, due to the discovery of new defects, the utility is disabled by default and is not recommended for production use until further notice. Those wishing to attempt to use the utility anyway should do so on a staging server that can be safely lost, and all content should be thoroughly tested after it has completed. http://docs.rstudio.com/connect/1.5.8/admin/cli.html#migration-cli
  • Fixed an issue with account confirmations and password resets for servers using non-UTC time zones.
  • LDAP now updates user email, first name, and last name every time a user logs in.
  • Fix an issue when performing the LOGIN SMTP authentication mechanism.
  • BREAKING: Changed the default value for PAM.AuthenticatedSessionService to su. Previously, on some distributions of Linux, setting PAM.ForwardPassword to true could present PAM errors to users when running applications as the current user if the AuthenticatedSessionService was not configured. System administrators who had previously edited the rstudio-connect PAM service for use in ForwardPassword mode should update the PAM.AuthenticatedSessionService configuration option. See:http://docs.rstudio.com/connect/1.5.8/admin/process-management.html#pam-credential-caching-kerberos
  • BREAKING: The format of the RStudio Connect package file names have changed. Debian package file names have the form rstudio-connect_1.2.3-7_amd64.deb. RPM package file names have the form rstudio-connect-1.2.3-7.x86_64.rpm. In addition, the RPM meta-data will have a “version” of 1.2.3 and a “release” of 7 for this file name. Previously, the RPM would have had a “version” of 1.2.3-7.

You can see the full release notes for RStudio Connect 1.5.8 here.

Upgrade Planning

There are no special precautions to be aware of when upgrading from v1.5.6. You can expect the installation and startup of v1.5.8 to be complete in under a minute.

If you’re upgrading from a release older than v1.5.6, be sure to consider the “Upgrade Planning” notes from those other releases, as well.

If you haven’t yet had a chance to download and try RStudio Connect we encourage you to do so. RStudio Connect is the best way to share all the work that you do in R (Shiny apps, R Markdown documents, plots, dashboards, Plumber APIs, etc.) with collaborators, colleagues, or customers.

You can find more details or download a 45 day evaluation of the product at https://www.rstudio.com/products/connect/. Additional resources can be found below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

jsTree htmlwidget

Tue, 10/24/2017 - 02:00

(This article was first published on Rstats on Reproducible Random Thoughts, and kindly contributed to R-bloggers)




jsTree is a R package that is a standalone htmlwidget for the jsTree JavaScript library. It can be run from the R console directly into the Rstudio Viewer, be used in a Shiny application or be part of an Rmarkdown html output.

Installation #CRAN install.packages('jsTree') #Github devtools::install_github('metrumresearchgroup/jsTree')

The motivation for the package came from the shinyTree by Jeff Allen, which has an early version of the JavaScript library hardcoded into a Shiny application.

The input for the htmlwidget is a character vector and the heirarchy is set by the ‘/’, as to mimic the delimiter of a file path.

The main purpose of this package is vizualize folder hierarchy, much like in the Files tab in RStudio. But, instead of looking in local directories it will be used for remote repositories, such as github, bitbucket and svn repositories. This is implemented in the vcs package and a future post will show utility that package gives.

To show the basic functionality of jsTree we will use a dataset that contains all superfund sites from the EPA. You can access the tibble here. The data

library(htmlTable) library(jsTree) library(widgetframe) # Read in the data superfundsite <- readRDS('../../data/superfundsite.Rds') htmlTable::htmlTable(head(superfundsite),rnames=FALSE) Region State City County Zip Code Site Name 01 Massachusetts ACTON MIDDLESEX 01720 W.R. GRACE & CO., INC. (ACTON PLANT) 01 Massachusetts AMESBURY ESSEX 01913 MICROFAB INC (FORMER) 01 Massachusetts ASHLAND MIDDLESEX 01721 NYANZA CHEMICAL WASTE DUMP 01 Massachusetts ATTLEBORO BRISTOL 02703 WALTON & LONSBURY INC. 01 Maine AUGUSTA KENNEBEC 04330 O’CONNOR CO. 01 Connecticut BARKHAMSTED LITCHFIELD 06063 BARKHAMSTED-NEW HARTFORD LANDFILL Basic Usage # collapse columns to be delimited by '/' myx <- apply(superfundsite,1,function(x) paste(x,collapse = '/')) # call jsTree j0 <- jsTree(obj = myx) # place widget in iframe to use in blog post # (not needed in regular usage) frameWidget(j0)

{"x":{"url":"/post/2017-10-24-jstree_files/figure-html//widgets/widget_ex0.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]} Initialize tree with checked boxes nodestate1 <- superfundsite$State=='California' j1 <- jsTree(myx,nodestate=nodestate1) frameWidget(j1)

{"x":{"url":"/post/2017-10-24-jstree_files/figure-html//widgets/widget_ex1.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]} Tooltips # Use the zipcode as a tooltip on the county name myx2 <- apply(superfundsite[,-c(5)],1,paste,collapse='/') zipcode_label <- superfundsite$`Zip Code` names(zipcode_label) <- superfundsite$County j2 <- jsTree(myx2,tooltips = zipcode_label) frameWidget(j2)

{"x":{"url":"/post/2017-10-24-jstree_files/figure-html//widgets/widget_ex2.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]} Shiny

jsTree can be used in Shiny applications and supplies observers so the Shiny can react to the tree.


ui <- shiny::fluidPage( shiny::column(width=4,jsTree::jsTreeOutput('tree')), shiny::column(width=8,shiny::uiOutput('chosen')) ) server <- function(input, output,session) { network <- shiny::reactiveValues() shiny::observeEvent(input$tree_update,{ current_selection <- input$tree_update$.current_tree if(!is.null(current_selection)) network$tree <- jsonlite::fromJSON(current_selection) }) shiny::observeEvent(network$tree,{ output$results <- shiny::renderPrint({ str.out <- '' if(length(network$tree)>0) str.out=network$tree return(str.out) }) }) output$chosen <- shiny::renderUI({ list(shiny::h3('Selected Items'), shiny::verbatimTextOutput(outputId = "results")) }) output$tree <- jsTree::renderJsTree({ nested_string <- myx jsTree(nested_string) }) } shinyApp(ui, server) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Reproducible Random Thoughts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Loading R Packages: library() or require()?

Mon, 10/23/2017 - 23:18

(This article was first published on R – Gerald Belton, and kindly contributed to R-bloggers)

When I was an R newbie, I was taught to load packages by using the command library(package). In my Linear Models class, the instructor likes to use require(package). This made me wonder, are the commands interchangeable? What’s the difference, and which command should I use?

Interchangeable commands . . .

The way most users will use these commands, most of the time, they are actually interchangeable. That is, if you are loading a library that has already been installed, and you are using the command outside of a function definition, then it makes no difference if you use “require” or “library.” They do the same thing.

… Well, almost interchangeable

There are, though, a couple of important differences. The first one, and the most obvious, is what happens if you try to load a package that has not previously been installed. If you use library(foo) and foo has not been installed, your program will stop with the message “Error in library(foo) : there is no package called ‘foo’.” If you use require(foo), you will get a warning, but not an error. Your program will continue to run, only to crash later when you try to use a function from the library “foo.”

This can make troubleshooting more difficult, since the error is going to happen in, say, line 847 of your code, while the actual mistake was in line 4 when you tried to load a package that wasn’t installed.

I have come to prefer using library(package) for this reason.

There is another difference, and it makes require(package) the preferred choice in certain situations. Unlike library, require returns a logical value. That is, it returns (invisibly) TRUE if the package is available, and FALSE if the package is not.  This can be very valuable for sharing code. If you are loading a package on your own machine, you may know for sure that it is already installed. But if you are loading it in a program that someone else is going to run, it’s a good idea to check! You can do this with, for example:

if (!require(package)) install.packages('package') library(package)

will install “package” if it doesn’t exist, and then load it.

One final note: if you look at the source code, you can see that require calls library. This suggests that it is more efficient to use library. As a practical matter, it seems unlikely that this would make a difference on any reasonably modern computer.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Gerald Belton. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistical Machine Learning with Microsoft ML

Mon, 10/23/2017 - 22:20

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

MicrosoftML is an R package for machine learning that works in tandem with the RevoScaleR package. (In order to use the MicrosoftML and RevoScaleR libraries, you need an installation of Microsoft Machine Learning Server or Microsoft R Client.) A great way to see what MicrosoftML can do is to take a look at the on-line book Machine Learning with the MicrosoftML Package Package by Ali Zaidi.

The book includes worked examples on several topics:

  • Exploratory data analysis and feature engineering
  • Regression models
  • Classification models for computer vision
  • Convolutional neural networks for computer vision
  • Natural language processing
  • Transfer learning with pre-trained DNNs

The book is part of Ali's in-person workshop "Statistical Machine Learning with MicrosoftML", and you can find further materials including data and scripts at this Github repository. If you'd like to experience the workshop in person, Ali will be presenting it at the EARL Boston conference on November 1.

 

Github (Azure): Machine Learning with the MicrosoftML Package

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Return of Free Data and Possible Volatility Trading Subscription

Mon, 10/23/2017 - 17:37

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will be about pulling free data from AlphaVantage, and gauging interest for a volatility trading subscription service.

So first off, ever since the yahoos at Yahoo decided to turn off their free data, the world of free daily data has been in somewhat of a dark age. Well, thanks to http://blog.fosstrading.com/2017/10/getsymbols-and-alpha-vantage.html#gpluscommentsJosh Ulrich, Paul Teetor, and other R/Finance individuals, the latest edition of quantmod (which can be installed from CRAN) now contains a way to get free financial data from AlphaVantage since the year 2000, which is usually enough for most backtests, as that date predates the inception of most ETFs.

Here’s how to do it.

First off, you need to go to alphaVantage, register, and https://www.alphavantage.co/support/#api-keyget an API key.

Once you do that, downloading data is simple, if not slightly slow. Here’s how to do it.

require(quantmod) getSymbols('SPY', src = 'av', adjusted = TRUE, output.size = 'full', api.key = YOUR_KEY_HERE)

And the results:

> head(SPY) SPY.Open SPY.High SPY.Low SPY.Close SPY.Volume SPY.Adjusted 2000-01-03 148.25 148.25 143.875 145.4375 8164300 104.3261 2000-01-04 143.50 144.10 139.600 139.8000 8089800 100.2822 2000-01-05 139.90 141.20 137.300 140.8000 9976700 100.9995 2000-01-06 139.60 141.50 137.800 137.8000 6227200 98.8476 2000-01-07 140.30 145.80 140.100 145.8000 8066500 104.5862 2000-01-10 146.30 146.90 145.000 146.3000 5741700 104.9448

Which means if any one of my old posts on asset allocation has been somewhat defunct thanks to bad yahoo data, it will now work again with a slight modification to the data input algorithms.

Beyond demonstrating this routine, one other thing I’d like to do is to gauge interest for a volatility signal subscription service, for a system I have personally started trading a couple of months ago.

Simply, I have seen other websites with subscription services with worse risk/reward than the strategy I currently trade, which switches between XIV, ZIV, and VXX. Currently, the equity curve, in log 10, looks like this:

That is, $1000 in 2008 would have become approximately $1,000,000 today, if one was able to trade this strategy since then.

Since 2011 (around the time of inception for XIV), the performance has been:

Performance Annualized Return 0.8265000 Annualized Std Dev 0.3544000 Annualized Sharpe (Rf=0%) 2.3319000 Worst Drawdown 0.2480087 Calmar Ratio 3.3325450

Considering that some websites out there charge upwards of $50 a month for either a single tactical asset rotation strategy (and a lot more for a combination) with inferior risk/return profiles, or a volatility strategy that may have had a massive and historically record-breaking drawdown, I was hoping to gauge a price point for what readers would consider paying for signals from a better strategy than those.

Thanks for reading.

NOTE: I am currently interested in networking and am seeking full-time opportunities related to my skill set. My LinkedIn profile can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

14 data science Jobs for R users from around the world (2017-10-23)

Mon, 10/23/2017 - 17:17
To post your R job on the next post

Just visit  this link and post a new R job  to the R community.

You can post a job for  free  (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers:  please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
More New R Jobs
  1. Freelance
    Statistician / Econometrician / R Programmer for Academic Statistical Research Academic Research – Posted by empiricus
    Anywhere
    21 Oct 2017
  2. Full-Time
    Postdoc – Norwegian institute of public health – Dept. Genetics and bioinformatics Norwegian institute of public health – Posted by universitypositions
    Oslo Oslo, Norway
    20 Oct 2017
  3. Full-Time
    Evaluation Data Analyst @ Hadley, Massachusetts, U.S. VentureWell – Posted by djscottnc
    Hadley Massachusetts, United States
    18 Oct 2017
  4. Full-Time
    Portfolio Analyst @ Dallas, Texas, United States HD Vest – Posted by jhickey94
    Dallas Texas, United States
    13 Oct 2017
  5. Full-Time
    Data Scientist Cincinnati Reds – Posted by MichaelSchatz
    Anywhere
    13 Oct 2017
  6. Freelance
    The R Foundation is looking for a volunteer to organize R’s contributed documentation R: The R Foundation – Posted by Tal Galili
    Anywhere
    13 Oct 2017
  7. Full-Time
    Lead Statistician Genomics and Biomarkers @ New Jersey, U.S. Bayer – Posted by KarenJ
    Hanover New Jersey, United States
    12 Oct 2017
  8. Full-Time
    Rays Research & Development Analyst Tampa Bay Rays – Posted by kferris10
    Saint Petersburg Florida, United States
    11 Oct 2017
  9. Full-Time
    PROGRAMMER/SOFTWARE DEVELOPMENT ENGINEER/COMPUTATIONAL AND MACHINE LEARNING SPECIALIST fchiava
    Cambridge Massachusetts, United States
    10 Oct 2017
  10. Freelance
    R Shiny Developer – Work From Home Data Science Talent – Posted by datasciencetalent
    Frankfurt am Main Hessen, Germany
    6 Oct 2017
  11. Full-Time
    Quantitative Econometrician @ San Francisco, California, U.S. Concentric Marketing and Search – Posted by Philip Palmer
    San Francisco California, United States
    1 Oct 2017
  12. Full-Time
    Postdoctoral Research Fellow in Healthcare Systems Engineering @ Maryland, U.S. Johns Hopkins University – Posted by diegomartinezcea
    Baltimore Maryland, United States
    27 Sep 2017
  13. Freelance
    R Programmer & Statistician for Academic Research Academic Research – Posted by empiricus
    Anywhere
    25 Sep 2017
  14. Full-Time
    Data Scientist Edelweiss Business Services – Posted by Aakash Gupta
    Mumbai Maharashtra, India
    20 Sep 2017

In  R-users.com  you can see  all  the R jobs that are currently available.

R-users Resumes

R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

(you may also look at  previous R jobs posts ).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

„One function to rule them all“ – visualization of regression models in #rstats w/ #sjPlot

Mon, 10/23/2017 - 16:27

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

I’m pleased to announce the latest update from my sjPlot-package on CRAN. Beside some bug fixes and minor new features, the major update is a new function, plot_model(), which is both an enhancement and replacement of sjp.lm(), sjp.glm(), sjp.lmer(), sjp.glmer() and sjp.int(). The latter functions will become deprecated in the next updates and removed somewhen in the future.

plot_model() is a „generic“ plot function that accepts many model-objects, like lm, glm, lme, lmerMod etc. It offers various plotting types, like estimates/coefficient plots (aka forest or dot-whisker plots), marginal effect plots and plotting interaction terms, and sort of diagnostic plots.

In this blog post, I want to describe how to plot estimates as forest plots.

Plotting Estimates (Fixed Effects) of Regression Models

The plot-type is defined via the type-argument. The default is type = "fe", which means that fixed effects (model coefficients) are plotted. First, we fit a model that will be used in the following examples. The examples work in the same way for any other model as well.

library(sjPlot) library(sjlabelled) library(sjmisc) library(ggplot2) data(efc) theme_set(theme_sjplot()) # create binary response y <- ifelse(efc$neg_c_7 < median(na.omit(efc$neg_c_7)), 0, 1) # create data frame for fitting model df <- data.frame( y = to_factor(y), sex = to_factor(efc$c161sex), dep = to_factor(efc$e42dep), barthel = efc$barthtot, education = to_factor(efc$c172code) ) # set variable label for response set_label(df$y) <- "High Negative Impact" # fit model m1 <- glm(y ~., data = df, family = binomial(link = "logit"))

The simplest function call is just passing the model object as argument. By default, estimates are sorted in descending order, with the highest effect at the top.

plot_model(m1)

The “neutral” line, i.e. the vertical intercept that indicates no effect (x-axis position 1 for most glm’s and position 0 for most linear models), is drawn slightly thicker than the other grid lines. You can change the line color with the vline.color-argument.

plot_model(m1, vline.color = "red")

Sorting Estimates

By default, the estimates are sorted in the same order as they were introduced into the model. Use sort.est = TRUE to sort estimates in descending order, from highest to lowest value.

plot_model(m1, sort.est = TRUE)

Another way to sort estimates is to use the order.terms-argument. This is a numeric vector, indicating the order of estimates in the plot. In the summary, we see that “sex2” is the first term, followed by the three dependency-categories (position 2-4), the Barthel-Index (5) and two levels for intermediate and high level of education (6 and 7).

summary(m1) #> #> Call: #> glm(formula = y ~ ., family = binomial(link = "logit"), data = df) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -2.2654 -0.9275 0.4610 0.9464 2.0215 #> #> Coefficients: #> Estimate Std. Error z value Pr(>|z|) #> (Intercept) 0.700232 0.576715 1.214 0.224682 #> sex2 0.649136 0.186186 3.486 0.000489 *** #> dep2 0.485259 0.361498 1.342 0.179480 #> dep3 1.125130 0.361977 3.108 0.001882 ** #> dep4 0.910194 0.441774 2.060 0.039368 * #> barthel -0.029802 0.004732 -6.298 3.02e-10 *** #> education2 0.226525 0.200298 1.131 0.258081 #> education3 0.283600 0.249327 1.137 0.255346 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> (Dispersion parameter for binomial family taken to be 1) #> #> Null deviance: 1122.16 on 814 degrees of freedom #> Residual deviance: 939.77 on 807 degrees of freedom #> (93 observations deleted due to missingness) #> AIC: 955.77 #> #> Number of Fisher Scoring iterations: 4

Now we want the educational levels (6 and 7) first, than gender (1), followed by dependency (2-4)and finally the Barthel-Index (5). Use this order as numeric vector for the order.terms-argument.

plot_model(m1, order.terms = c(6, 7, 1, 2, 3, 4, 5))

Estimates on the untransformed scale

By default, plot_model() automatically exponentiates coefficients, if appropriate (e.g. for models with log or logit link). You can explicitley prevent transformation by setting the transform-argument to NULL, or apply any transformation – no matter whether this function / transformation makes sense or not – by using a character vector with the function name.

plot_model(m1, transform = NULL)

plot_model(m1, transform = "plogis")

Showing value labels

By default, just the dots and error bars are plotted. Use show.values = TRUE to show the value labels with the estimates values, and use show.p = FALSE to suppress the asterisks that indicate the significance level of the p-values. Use value.offset to adjust the relative positioning of value labels to the dots and lines.

plot_model(m1, show.values = TRUE, value.offset = .3)

Labelling the plot

As seen in the above examples, by default, the plotting-functions of sjPlot retrieve value and variable labels if the data is labelled, using the sjlabelled-package. If the data is not labelled, the variable names are used. In such cases, use the arguments title, axis.labels and axis.title to annotate the plot title and axes. If you want variable names instead of labels, even for labelled data, use "" as argument-value, e.g. axis.labels = "", or set auto.label to FALSE.

Furthermore, plot_model() applies case-conversion to all labels by default, using the snakecase-package. This converts labels into human-readable versions. Use case = NULL to turn case-conversion off, or refer to the package-vignette of the snakecase-package for further options.

data(iris) m2 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, data = iris) # variable names as labels, but made "human readable" # separating dots are removed plot_model(m2)

# to use variable names even for labelled data plot_model(m1, axis.labels = "", title = "my own title")

Pick or remove specific terms from plot

Use terms resp. rm.terms to select specific terms that should (not) be plotted.

# keep only coefficients sex2, dep2 and dep3 plot_model(m1, terms = c("sex2", "dep2", "dep3"))

# remove coefficients sex2, dep2 and dep3 plot_model(m1, rm.terms = c("sex2", "dep2", "dep3"))

Bayesian models (fitted with Stan)

plot_model() also supports stan-models fitted with the rstanarm or brms packages. However, there are a few differences compared to the previous plot examples.

First, of course, there are no confidence intervals, but uncertainty intervals – high density intervals, to be precise.

Second, there’s not just one interval range, but an inner and outer probability. By default, the inner probability is fixed to .5 (50%), while the outer probability is specified via ci.lvl (which defaults to .89 (89%) for Bayesian models). However, you can also use the arguments prob.inner and prob.outer to define the intervals boundaries.

Third, the point estimate is by default the median, but can also be another value, like mean. This can be specified with the bpe-argument.

library(rstanarm) data(mtcars) m <- stan_glm(mpg ~ wt + am + cyl + gear, data = mtcars, chains = 1) # default model plot_model(m)

# same model, with mean point estimate, dot-style for point estimate # and different inner/outer probabilities of the HDI plot_model( m, bpe = "mean", bpe.style = "dot", prob.inner = .4, prob.outer = .8 )

Tweaking plot appearance

There are several options to customize the plot appearance:

  • The colors-argument either takes the name of a valid colorbrewer palette (see also the related vignette), "bw" or "gs" for black/white or greyscaled colors, or a string with a color name.
  • value.offset and value.size adjust the positioning and size of value labels, if shown.
  • dot.size and line.size change the size of dots and error bars.
  • vline.color changes the neutral “intercept” line.
  • width, alpha and scale are passed down to certain ggplot-geoms, like geom_errorbar() or geom_density_ridges().
plot_model( m1, colors = "Accent", show.values = TRUE, value.offset = .4, value.size = 4, dot.size = 3, line.size = 1.5, vline.color = "blue", width = 1.5 )

Final words

In a similar manner, marginal effects (predictins) of models can be plotted. One of the next blog posts will show some examples.

plot_model() makes it easy to „summarize“ models of (m)any classes as plot. On the one hand, you no longer need different functions for different models (like lm, glm, lmer etc.), on the other hand, now even more model types are supported in the latest sjPlot-update. Forthcoming updates will continue this „design philosophy“, for instance, a generic tab_model()-funtion will produce tabular (HTML) outputs of models and will replace current functions sjt.lm(), sjt.glm(), sjt.lmer() and sjt.glmer(). tab_model() is planned to support more model types as well, and offer more output options…

Tagged: Bayes, data visualization, R, regression analysis, rstats, sjPlot, Stan, Statistik

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

9th MilanoR meeting on November 20th: call for presentations!

Mon, 10/23/2017 - 12:42

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

MilanoR Staff is happy to announce the 9th MilanoR Meeting!

The meeting will take place on November 20th, from 7pm to about 9:30 pm, in Mikamai (close to the Pasteur metro station) [save the date, more info soon]

This time we want to focus on a specific topic: data visualization with R. 

We are curious to see if there are interesting contributions about this topic within the community. Then: have you build a gorgeous and smart visualization with R, or developed a package that handle some data viz stuff in a new way? Have you created a Shiny HTML widget or a dashboard that has something new to say? Do you feel you have something to input, o you can recommend someone? 

Send your contribution at admin[at]milanor[dot]net: you may present it at the 9th MilanoR meeting! 

If you want to contribute but you cannot attend the meeting, you can send your contribution as a short video (3/4 minutes long) atadmin[at]milanor[dot]net. Videos may be published in the blog and played at the meeting before or after presentations.

MilanoR community grows with every R user contribution!

(If you want to get inspired, here you can find all the presentations from our past meetings)

 

What is a MilanoR Meeting?

A MilanoR meeting is an occasion to bring together the R users in the Milano area to share knowledge and experiences. The meeting is open to beginners as well as expert R users. Usually we run two MilanoR meetings each year, one in Autumn and one in Spring. We are now running the 9th MilanoR meeting edition.

A MilanoR meeting consists of 2-3 R talks and a free buffet offered by our sponsors, to give plenty of room for discussions and exchange of ideas: the event is free for everyone, but a seat reservation is needed. Registration will open soon, stay tuned!

The post 9th MilanoR meeting on November 20th: call for presentations! appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Economic time series data quiz as a shiny app for mobile phones

Mon, 10/23/2017 - 07:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

Nowadays, a lot of interesting time series data is freely available that allows us to compare important economic, social and environmental trends across countries.

I feel that one can learn a lot by surfing through the data sections on the websites of institutions like the Gapminder Foundation, the World Bank, or the OECD.

At the same time, I am quite a big fan of learning facts with quiz questions. Since my internet search did not yield any apps or websites that present these interesting time series in forms of quizzes, I coded a bit in R and generated a Shiny app that creates such quizzes based on OECD data and some Eurostat data.

Here is a screenshot:



The quiz is hosted here:

http://econ.mathematik.uni-ulm.de:4501/dataquiz/

Why not do some supervised learning for your biological neural network, and take a look? I would recommend visiting the site with your mobile phone, though, as it looks nicer.

From my own experience, there is a bit of guessing involved at first. Yet, after having solved some questions, performance quickly improved, as did my understanding of plausible dimensions, time trends and likely differences among countries for the various measures.

The app is based on a corresponding R package https://github.com/skranz/dataquiz. The package is more complex than required for the current app, as I also was experimenting with different forms of quizzes, e.g. questions that ask you to rank different countries or sectors. Yet, I ended up prefering the current form of the quiz. The app runs in a docker container specified here: https://github.com/skranz/dataquizDocker.

There exists an API and even an R package to download OECD data. Yet, I found it more convenient to retrieve the underlying OECD data via some web scrapping using RSelenium. This got me also some detailed descriptions of the different time series. The data sets used in my app and the web-scrapping scripts are given in the following Github repository https://github.com/skranz/dataquizResources.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The SeaClass R Package

Mon, 10/23/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

The SeaClass R Package

The Operations Technology and Advanced Analytics Group (OTAAG) at Seagate Technology has decided to share an internal project that helps accelerate development of classification models. The interactive SeaClass tool is contained in an R-based package built using shiny and other CRAN packages commonly used for binary classification. The package is free to use and develop further, but any analysis mistakes are the sole responsibility of the user. Check out the demo video below:

Origin Story

Manufacturer data sets used for pass/fail analysis are typically highly imbalanced with many more passing cases than failing cases. In some situations, the failure examples may be scarce or nonexistent. In these extreme cases complex modeling techniques are likely inadvisable. Data scientists commonly consider data scarcity, class imbalance, and data dimensionality when discriminating between competing candidate approaches, such as anomaly detection, simple models, and complex models. Standard approaches are easily identified within each of these analysis categories, and can be exploited as reasonable initial classification analysis baselines.

The SeaClass application was created to generate template R code for the commonly encountered classification problems described above. The application allows data analysts to explore multiple models quickly with essentially no programming required. SeaClass provides an option to download the corresponding R code for further model training/testing. This workflow enables our data analysts to jump-start their modeling, saving time and initial hassles.

The Advanced Analytics group decided to open-source the package for several reasons. Firstly, we encourage other contributors to suggest improvements to the SeaClass application. Additionally, we are hopeful our application will inspire other code-generating projects within the R community. Lastly, our group benefits greatly from open-source tools, and it’s nice to give back every once in a while.

Package Overview

The SeaClass R package provides tools for analyzing classification problems. In particular, specialized tools are available for addressing the problem of imbalanced data sets. The SeaClass application provides an easy-to-use interface that requires only minimal R programming knowledge to get started, and can be launched using the RStudio Addins menu. The application allows the user to explore numerous methods by simply clicking on the available options and interacting with the generated results. The user can choose to download the codes for any procedures they wish to explore further. SeaClass was designed to jump-start the analysis process for both novice and advanced R users. See screenshots below for one demonstration.

Installation Instructions

The SeaClass application depends on multiple R packages. To install SeaClass and its dependencies, run:

install.packages('devtools') devtools::install_github('ChrisDienes/SeaClass') Usage Instructions

Step 1. Begin by loading and preparing your data in R. Some general advice:

  • Your data set must be saved as an R data frame object.
  • The data set must contain a binary response variable (0/1, PASS/FAIL, A/B, etc.)
  • All other variables must be predictor variables.
  • Predictor variables can be numeric, categorical, or factors.
  • Including too many predictors may slow down the application and weaken performance.
  • Categorical predictors are often ignored when the number of levels exceeds 10, since they tend to have improper influences.
  • Missing values are not allowed and will throw a flag. Please remove or impute NAs prior to starting the app.
  • Keep the number of observations (rows) to a medium or small size.
  • Data sets with many rows (>10,000) or many columns (>30) may slow down the app’s interactive responses.

Step 2. After data preparation, start the application by either loading SeaClass from the RStudio Addins drop-down menu, or by loading the SeaClass function from the command line. For example:

library(SeaClass) ### Make some fake data: X <- matrix(rnorm(10000,0,1),ncol=10,nrow=1000) X[1:100,1:2] <- X[1:100,1:2] + 3 Y <- c(rep(1,100), rep(0,900)) Fake_Data <- data.frame(Y = Y , X) ### Load the SeaClass rare failure data: data("rareFailData") ### Start the interactive GUI: SeaClass()

If the application fails to load, you may need to specify your favorite browser path first. For example:

options(browser = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")

Step 3. The user has various options for configuring their analysis within the GUI. Once the analysis runs, the user can view the results, interact with the results (module-dependent), save the underlying R script, or start over. Additional help is provided within the application. See above screenshots for one depiction of these steps.

Step 4. Besides the SeaClass function, several other functions are contained within the library. For example:

### List available functions: ls("package:SeaClass") ### Note this is a sample data set: # data(rareFailData) ### Note code_output is a support function for SeaClass, not for general use. ### View help: ?accuracy_threshold ### Run example from help file: ### General Use: ### set.seed(123) x <- c(rnorm(100,0,1),rnorm(100,2,1)) group <- c(rep(0,100),rep(2,100)) accuracy_threshold(x=x, group=group, pos_class=2) accuracy_threshold(x=x, group=group, pos_class=0) ### Bagged Example ### set.seed(123) replicate_function = function(index){accuracy_threshold(x=x[index], group=group[index], pos_class=2)[[2]]} sample_cuts <- replicate(100, { sample_index = sample.int(n=length(x),replace=TRUE) replicate_function(index=sample_index) }) bagged_scores <- sapply(x, function(x) mean(x > sample_cuts)) unbagged_cut <- accuracy_threshold(x=x, group=group, pos_class=2)[[2]] unbagged_scores <- ifelse(x > unbagged_cut, 1, 0) # Compare AUC: PRROC::roc.curve(scores.class0 = bagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]] PRROC::roc.curve(scores.class0 = unbagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]] bagged_prediction <- ifelse(bagged_scores > 0.50, 2, 0) unbagged_prediction <- ifelse(x > unbagged_cut, 2, 0) # Compare Confusion Matrix: table(bagged_prediction, group) table(unbagged_prediction, group)

_____='https://rviews.rstudio.com/2017/10/23/the-seaclass-r-package/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Who knew likelihood functions could be so pretty?

Mon, 10/23/2017 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I just released a new iteration of simstudy (version 0.1.6), which fixes a bug or two and adds several spline related routines (available on CRAN). The previous post focused on using spline curves to generate data, so I won’t repeat myself here. And, apropos of nothing really – I thought I’d take the opportunity to do a simple simulation to briefly explore the likelihood function. It turns out if we generate lots of them, it can be pretty, and maybe provide a little insight.

If a probability density (or mass) function is more or less forward-looking – answering the question of what is the probability of seeing some future outcome based on some known probability model, the likelihood function is essentially backward-looking. The likelihood takes the data as given or already observed – and allows us to assess how likely that outcome was under different assumptions the underlying probability model. While the form of the model is not necessarily in question (normal, Poisson, binomial, etc) – though it certainly should be – the specific values of the parameters that define the location and shape of that distribution are not known. The likelihood function provides a guide as to how the backward-looking probability varies across different values of the distribution’s parameters for a given data set.

We are generally most interested in finding out where the peak of that curve is, because the parameters associated with that point (the maximum likelihood estimates) are often used to describe the “true” underlying data generating process. However, we are also quite interested in the shape of the likelihood curve itself, because that provides information about how certain we can be about our conclusions about the “true” model. In short, a function that has a more clearly defined peak provides more information than one that is pretty flat. When you are climbing Mount Everest, you are pretty sure you know when you reach the peak. But when you are walking across the rolling hills of Tuscany, you can never be certain if you are at the top.

The setup

A likelihood curve is itself a function of the observed data. That is, if we were able to draw different samples of data from a single population, the curves associated with each of those sample will vary. In effect, the function is a random variable. For this simulation, I repeatedly make draws from an underlying known model – in this case a very simple linear model with only one unknown slope parameter – and plot the likelihood function for each dataset set across a range of possible slopes along with the maximum point for each curve.

In this example, I am interested in understanding the relationship between a variable \(X\) and some outcome \(Y\). In truth, there is a simple relationship between the two:

\[ Y_i = 1.5 \times X_i + \epsilon_i \ ,\] where \(\epsilon_i \sim Normal(0, \sigma^2)\). In this case, we have \(n\) individual observations, so that \(i \in (1,…n)\). Under this model, the likelihood where we do know \(\sigma^2\) but don’t know the coefficient \(\beta\) can be written as:

\[L(\beta;y_1, y_2,…, y_n, x_1, x_2,…, x_n,\sigma^2) = (2\pi\sigma^2)^{-n/2}\text{exp}\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i – \beta x_i)^2\right)\]

Since it is much easier to work with sums than products, we generally work with the log-likelihood function:

\[l(\beta;y_1, y_2,…, y_n, x_1, x_2,…, x_n, \sigma^2) = -\frac{n}{2}\text{ln}(2\pi\sigma^2) – \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i – \beta x_i)^2\] In the log-likelihood function, \(n\), \(x_i\)’s, \(y_i\)’s, and \(\sigma^2\) are all fixed and known – we are trying to estimate \(\beta\), the slope. That is, the likelihood (or log-likelihood) is a function of \(\beta\) only. Typically, we will have more than unknown one parameter – say multiple regression coefficients, or an unknown variance parameter (\(\sigma^2\)) – but visualizing the likelihood function gets very hard or impossible; I am not great in imagining (or plotting) in \(p\)-dimensions, which is what we need to do if we have \(p\) parameters.

The simulation

To start, here is a one-line function that returns the log-likelihood of a data set (containing \(x\)’s and \(y\)’s) based on a specific value of \(\beta\).

library(data.table) ll <- function(b, dt, var) { dt[, sum(dnorm(x = y, mean = b*x, sd = sqrt(var), log = TRUE))] } test <- data.table(x=c(1,1,4), y =c(2.0, 1.8, 6.3)) ll(b = 1.8, test, var = 1) ## [1] -3.181816 ll(b = 0.5, test, var = 1) ## [1] -13.97182

Next, I generate a single draw of 200 observations of \(x\)’s and \(y\)’s:

library(simstudy) b <- c(seq(0, 3, length.out = 500)) truevar = 1 defX <- defData(varname = "x", formula = 0, variance = 9, dist = "normal") defA <- defDataAdd(varname = "y", formula = "1.5*x", variance = truevar, dist = "normal") set.seed(21) dt <- genData(200, defX) dt <- addColumns(defA, dt) dt ## id x y ## 1: 1 2.379040 4.3166333 ## 2: 2 1.566754 0.9801416 ## 3: 3 5.238667 8.4869651 ## 4: 4 -3.814008 -5.6348268 ## 5: 5 6.592169 9.6706410 ## --- ## 196: 196 3.843341 4.5740967 ## 197: 197 -1.334778 -1.5701510 ## 198: 198 3.583162 5.0193182 ## 199: 199 1.112866 1.5506167 ## 200: 200 4.913644 8.2063354

The likelihood function is described with a series of calls to function ll using sapply. Each iteration uses one value of the b vector. What we end up with is a likelihood estimation for each potential value of \(\beta\) given the data.

loglik <- sapply(b, ll, dt = dt, var = truevar) bt <- data.table(b, loglike = loglik) bt ## b loglike ## 1: 0.000000000 -2149.240 ## 2: 0.006012024 -2134.051 ## 3: 0.012024048 -2118.924 ## 4: 0.018036072 -2103.860 ## 5: 0.024048096 -2088.858 ## --- ## 496: 2.975951904 -2235.436 ## 497: 2.981963928 -2251.036 ## 498: 2.987975952 -2266.697 ## 499: 2.993987976 -2282.421 ## 500: 3.000000000 -2298.206

In a highly simplified approach to maximizing the likelihood, I simply select the \(\beta\) that has the largest likelihood based on my calls to ll (I am limiting my search to values between 0 and 3, just because I happen to know the true value of the parameter). Of course, this is not how things work in the real world, particularly when you have more than one parameter to estimate – the estimation process requires elaborate algorithms. In the case of a normal regression model, it is actually the case that the ordinary least estimate of the regression parameters is the maximum likelihood estimate (you can see in the above equations that maximizing the likelihood is minimizing the sum of the squared differences of the observed and expected values).

maxlik <- dt[, max(loglik)] lmfit <- lm(y ~ x - 1, data =dt) # OLS estimate (maxest <- bt[loglik == maxlik, b]) # value of beta that maxmizes likelihood ## [1] 1.472946

The plot below on the left shows the data and the estimated slope using OLS. The plot on the right shows the likelihood function. The \(x\)-axis represents the values of \(\beta\), and the \(y\)-axis is the log-likelihood as a function of those \(\beta's\):

library(ggplot2) slopetxt <- paste0("OLS estimate: ", round(coef(lmfit), 2)) p1 <- ggplot(data = dt, aes(x = x, y= y)) + geom_point(color = "grey50") + theme(panel.grid = element_blank()) + geom_smooth(method = "lm", se = FALSE, size = 1, color = "#1740a6") + annotate(geom = "text", label = slopetxt, x = -5, y = 7.5, family = "sans") p2 <- ggplot(data = bt) + scale_y_continuous(name = "Log likelihood") + scale_x_continuous(limits = c(0, 3), breaks = seq(0, 3, 0.5), name = expression(beta)) + theme(panel.grid.minor = element_blank()) + geom_line(aes(x = b, y = loglike), color = "#a67d17", size = 1) + geom_point(x = maxest, y = maxlik, color = "black", size = 3) library(gridExtra) grid.arrange(p1, p2, nrow = 1)

Adding variation

Now, for the pretty part. Below, I show plots of multiple likelihood functions under three scenarios. The only thing that differs across each of those scenarios is the level of variance in the error term, which is specified in \(\sigma^2\). (I have not included the code here since essentially loop through the process describe above.) If you want the code just let me know, and I will make sure to post it. I do want to highlight the fact that I used package randomcoloR to generate the colors in the plots.)

What we can see here is that as the variance increases, we move away from Mt. Everest towards the Tuscan hills. The variance of the underlying process clearly has an impact on the uncertainty of the maximum likelihood estimates. The likelihood functions flatten out and the MLEs have more variability with increased underlying variance of the outcomes \(y\). Of course, this is all consistent with maximum likelihood theory.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

linl 0.0.1: linl is not Letter

Sun, 10/22/2017 - 23:06

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Aaron Wolen and I are pleased to announce the availability of the initial 0.0.1 release of our new linl package on the CRAN network. It provides a simple-yet-powerful Markdown—and RMarkdown—wrapper the venerable LaTeX letter class. Aaron had done the legwork in the underlying pandoc-letter repository upon which we build via proper rmarkdown integration.

The package also includes a LaTeX trick or two: optional header and signature files, nicer font, better size, saner default geometry and more. See the following screenshot which shows the package vignette—itself a simple letter—along with (most of) its source:

The initial (short) NEWS entry follows:

Changes in tint version 0.0.1 (2017-10-17)
  • Initial CRAN release

The date is a little off; it took a little longer than usual for the good folks at CRAN to process the initial submission. We expect future releases to be more timely.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Markets Performance after Election: Day 239

Sun, 10/22/2017 - 00:53

(This article was first published on R – Quintuitive, and kindly contributed to R-bloggers)

When I wrote the original post, I wasn’t planning on writing a follow-up. Certainly not the week after. But what a difference a week can make in a dynamic system like the US stock market.

While re-running the computations testing the latest version of RStudio, I noticed something surprising – President Trump’s rally has advanced to 2nd place!

A mere week ago, that seemed unthinkable. Something abnormal must have happened. Two things happened. First, the current stock market advanced another 2%. Nothing to brag about just another positive vote of confidence in the economy’s direction.

The more impactful reason behind this sudden switch became clear when I took a look at the President H.W. Bush’s rally. Even from the above chart, it’s clear the rally lost significant amount of steam within a day, or two max. Is this a problem with data? A bit of forensics:

election.date = "1988-11-08" require(quantmod) dj = getSymbols("^DJI", from="1900-01-01", auto.assign=F) id = findInterval(as.Date(election.date), index(dj)) tail(ROC(Cl(dj[id:(id+239)]), type="discrete", na.pad=F))

And the truth reveals itself:

Date Return 1989-10-12 -0.49% 1989-10-13 -6.91% 1989-10-16 3.43% 1989-10-17 -0.70% 1989-10-18 0.19% 1989-10-19 1.50%

Low and behold, no problem with the data, just a 7% drop on October 13th 1989. A quick search reveals that this was indeed Friday the 13th mini-crash! Friday, the 13th … mini-crash … What a coincidence!

I will keep it brief and wrap up this post here. There were a few improvements and changes I did to the R code used to perform these analysis – the Gist contains all of them.

It’s all optimism in the air, judging by the market behavior at least.

The post Markets Performance after Election: Day 239 appeared first on Quintuitive.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages