Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 19 hours 38 min ago

Earthquakes & Maps – Report and shinyapp from R-Lab#2

Thu, 04/20/2017 - 18:18

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Our board of ideas, tasks to do, doing and done <3

Hey R-users and lovers!

In the last few weeks we have had many events going on for the R community in Milano!

After the launching event for our new R-labs and the 8th MilanoR meeting, we had our second R-Lab.

This was the first R-Lab with its actual structure: an organisation/ a company is willing to study a specific topic, they present it and we work all together on the problem in order to solve it using R and having fun.

In #RLab2 we talked about earthquakes and shiny with several participants with different backgrounds and different interests: from students to researchers and professionals, from statistics to geology and physics, from R beginners to advanced users… we all got together to learn and share our abilities!

The meeting opened with the newly born EarthCloud association that provided us with interesting data on the Bove-Vettore (Umbria) fault and the fault’s movements in years 1980-2016. We continued with Andrea and Mariachiara that gave us some inputs on the use of shiny and of ggmap: here both the presentations.

In three different heterogeneous groups we then started to work: a group was in charge of producing plots to be integrated into the shiny app, another group was in charge of producing maps to be integrated in the shiny app and a last group was in charge of putting up the actual shiny app integrating everyone’s functions.

The ggplot group produced univariate and bivariate plots for giving an overview of the magnitude of eartquakes in the years 1980-2016, on their depth and on their localisation.

The ggmap group produced a number of maps of the Bove-Vettore fault system highlighting the areas with more intense magnitude.

While the data visualization groups were busy finding the best representation for the data, the Shiny group was setting up a friendly shiny app where all the useful info could be presented and easily navigated.

By the end of the evening, and a couple of hours work, we were tired but happy! All groups came up with functioning functions and great ideas while learning about earthquakes with the geologists!

Check here for the functioning app:

https://r-lab-milano.shinyapps.io/earthquakemaps/

Download the the source code and the data on our dedicated github repo. It is still on development! (@rlabbers: there are still some graphs to integrate, and some little changes to do)

Thank you very much to all the participants, to EarthCloud for their proposal and contribution, and see you next month for the third Rlab!

(hint: the third R-Lab is Saturday 13th of May, and we will work together with the Comune di Milano for an entire day! We’ll have access to the municipality data of the city budget planning, for designing and realizing an R Shiny app that allows citizens to visualize and explore the city budget data. Subscriptions will open soon here, stay tuned!)

 

The post Earthquakes & Maps – Report and shinyapp from R-Lab#2 appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

SQL Server 2017 to add Python support

Thu, 04/20/2017 - 18:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

One of the major announcements from yesterday's Data Amp event was that SQL Server 2017 will add Python as a supported language. Just as with the continued R support, SQL Server 2017 will allow you to process data in the database using any Python function or package without needing to export the data from the database, and use SQL Server itself as an operationalization platform for production applications using Python code. In addition, the high-performance and distributed statistical and machine learning functions from the RevoScaleR and MicrosoftML packages in Microsoft R will be available as Python functions for use within SQL Server.

The video below provides more details on the capabilities and the underlying architecture, and includes a demo of deploying and running Python code within SQL Server 2017.

SQL Server 2017 will also be available for the first time on Linux (including within Docker containers). The new version is currently in public preview, and R and Python integration will be available for both the Linux and Windows editions. In case you missed it yesterday, check out our post on the many new features supporting R in Microsoft R Server 9.1. And for more on the many updates coming in SQL Server 2017, take a look at the official blog post linked below.

SQL Server Blog: Delivering AI with data: the next generation of Microsoft’s data platform

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

User Defined Functions in R Exercises (Part 1)

Thu, 04/20/2017 - 17:50

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the Exercises we will discuss User Defined Function in R

Answers to the exercises are available here.

Exercise 1

Create a function to print square of number

Exercise 2

Create a function to print a number raise to another with the one argument a default argument
Exercise 3

Create a function to print class of an argument

Exercise 4

Create a function to accept two matrix arguments and do matrix operations with same.

Exercise 5

Create a user defined function to accept a name from the user

Exercise 6

Create a user defined function to accept values from the user using scan and return the values

Exercise 7

Create a user defined function to create a matrix and return the same.

Exercise 8

Create a function to take two arguments, one student marks and other student names and plot a graph based on the same.

Exercise 9

Create a function to accept an employee data frame(Name,Gender,Age,Designation & SSN) and print the First & Fifth employee as well as the Names & the Designation of all the employees in the function

Exercise 10

Create a function to create an employee data frame(Name,Gender,Age,Designation & SSN) and return the Name,Age & Designation of all employees.

Related exercise sets:
  1. Matrix exercises
  2. Functions exercises vol. 2
  3. Data frame exercises Vol. 2
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Announcing Datazar v2.0

Thu, 04/20/2017 - 09:08

(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)

Seven months, two weeks and three days ago, we “officially” released Datazar version 1.0 to the general public. I put officially in quotes because Datazar has always been available to the general public. We develop with our community to ensure that what we’re building is essential to the workflow we always strive to perfect. Innovation and transparency are the two pillars of who we are, therefore we pride ourselves in our ability to innovate openly.

So before I get into any of the details, I would like to thank our community for helping us get here. It would have been impossible without your continuous feedback and criticism. The release of version 2 is a massive leap from version 1. Around 80% of the code base is entirely new. So let’s go ahead and unwrap the toys!

The Workspace

Something that’s entirely new is the Workspace. Before the workspace was introduced, the only thing you could do on Datazar was upload your data, map it with its related files/analysis (manually) and allow other people to download it. Pretty simple. But what’s the point of having a dataset if you can’t use it (visualize/analyze)? That extra step of downloading the dataset, analyzing it and re-uploading whatever the result is just doesn’t cut it (based on our data). So how about the ability to analyze the dataset right then and there using tools you’re already familiar with? Like, R and Python.

We developed a notebook and console interface for both R and Python; the two most popular analysis languages out there (open source).

Now you can analyze any dataset using R and Python with the notebook or console interfaces right in your browser. All the computation is done on Datazar’s servers so you can literally do it using a Chromebook.

We didn’t want to re-invent the wheel so we decided to stick to interfaces everyone is already comfortable with. Namely the console and the notebook. We just made slight modifications; ex: the images in the console interface appear in-line with the text results instead of another window (like it would in your terminal). Using these interfaces also allows for maximum reproducibility.

You can install any packages/modules you want, create any models, charts, visualizations etc… Just as you would in your local machine with your favorite editor.

In the case of R, you can also create RMarkdown files and not just consoles and notebooks.

One-Click Charts

Sometimes you don’t need to launch R or Python to explore a dataset. Sometimes a simple chart is enough to get you started. So for that exact reason we also introduced One-Click Charts. These are very simple, exploratory charts like scatter-plots, line-plots etc… If you update the dataset, the charts will also update accordingly so they can be used to keep track of things as a function of time.

Redesigned Project Interface

In version 1.0, the Project was an afterthought. The entire platform was centered around the File or dataset. We completely re-designed entire flows to make the platform centered around the Project. Why? Datasets by themselves don’t have the same pull as datasets with supporting documents such as analysis notebooks, notes and publication files.

This change completely changed how you work on the platform. The new interface that’s based on the Project pushes you to collaborate further with quick access to things like Project Discussions, Project Metrics, Project Activity etc…

Publishing

With the new Project interface, you can now Publish your projects once you’re done analyzing your data. Whether you’re using a MarkDown file or a LaTeX file, you can re-direct your readers to your publication document so it’s the first thing they see.

Replication

Projects can now be replicated. Replication in Datazar, just like how you replicate your fellow researchers’ project, allows you to create an exact copy of the project. Once the copy is complete, you can re-run all the analysis and go through the datasets and methods without affecting/altering the original documents. We completely embedded the scientific process into the platform.

Pricing

Along with the launch of version 2.0, we’re also officially releasing our Plans and Pricing.


https://www.datazar.com/pricing/

The plans are differentiated based on how much data you want to compute, calls to the API and project privacy.

The first factor is computation. Anyone with any plan can upload as much data as they want and create analysis files as big as they want. The only difference is how much data you want to compute in the cloud using R, Python etc…. With the Pro and Team plans, you can compute any file size you want, just like in your own computer*.

The second factor in the pricing structure is access to the API. Anyone with a Datazar account can access the API using tokens. Getting on a Student, Pro or Team plan gives you higher API rate limits.

The third factor is project privacy. All free accounts can create as many public projects as they want while uploading any file type or size. The paid plans allow accounts to create private projects where only invited collaborators have access.

We also created a discounted version of the Pro plan for students with relatively high computational and API limits.

*considering hardware limitations

Miscellaneous

Faster Rendering: D3 visualizations and charts now load 3.5X faster than before due to the redesign of the rendering engine.

Bulk Upload: Files can now be uploaded in bulk instead of 1 by 1. Datazar will also now automatically detect what kind of file it is (raw data, prepared data, analysis, visualization…)

Community Contributions Control: Project owners and maintainers can now control whether the community can contribute datasets and analysis even if the projects are public.

Documentation: Several guides have been uploaded to docs.datazar.com/guides. In a few weeks, we’ll be opening up the Docs system to everyone so anyone can write guides.

External Sharing: Sharing to social media is now available to all files.

Thanks again to everyone who participated in making this happen, we’re one step closer to re-designing how research is done. If you’re thinking about moving your research to Datazar or just getting into research, don’t hesitate to contact support@datazar.com. The Explore and Product pages are great resources if you’re on the fence, but as always, the best way to find out is to get right to it.

Join the discussion on ProductHunt.


Announcing Datazar v2.0 was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Random GeoJSON and WKT with randgeo

Thu, 04/20/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

randgeo generates random points and shapes in GeoJSON and WKT formats for
use in examples, teaching, or statistical applications.

Points and shapes are generated in the long/lat coordinate system and with
appropriate spherical geometry; random points are distributed evenly across
the globe, and random shapes are sized according to a maximum great-circle
distance from the center of the shape.

randgeo was adapted from https://github.com/tmcw/geojson-random to have
a pure R implementation without any dependencies as well as appropriate
geometry. Data generated by randgeo may be processed or displayed of with
packages such as sf,
wicket,
geojson,
wellknown,
geojsonio, or
lawn.

Package API:

  • rg_position – random position (lon, lat)
  • geo_point – random GeoJSON point
  • geo_polygon – random GeoJSON polygon
  • wkt_point – random WKT point
  • wkt_polygon – random WKT polygon
setup

Install randgeo – and we'll need a few other packages for examples below.

install.packages("randgeo") install.packages(c('leaflet', 'lawn')) library(randgeo) GeoJSON

Functions that start with geo are for creating GeoJSON data in JSON format.
If you want to create an R list or data.frame, you can use jsonlite::fromJSON.

Random point

Evenly distributed across the sphere. The bbox option allows
you to limit points to within long/lat bounds.

geo_point() #> $type #> [1] "FeatureCollection" #> #> $features #> $features[[1]] #> $features[[1]]$type #> [1] "Feature" #> #> $features[[1]]$geometry #> $features[[1]]$geometry$type #> [1] "Point" #> #> $features[[1]]$geometry$coordinates #> [1] 105.95999 -46.58477 #> #> #> $features[[1]]$properties #> NULL #> #> #> #> attr(,"class") #> [1] "geo_list" Random polygon

Centered on a random point, with default maximum size

geo_polygon() #> $type #> [1] "FeatureCollection" #> #> $features #> $features[[1]] #> $features[[1]]$type #> [1] "Feature" #> #> $features[[1]]$geometry #> $features[[1]]$geometry$type #> [1] "Polygon" #> #> $features[[1]]$geometry$coordinates #> $features[[1]]$geometry$coordinates[[1]] #> $features[[1]]$geometry$coordinates[[1]][[1]] #> [1] -138.49434 -25.11895 #> #> $features[[1]]$geometry$coordinates[[1]][[2]] #> [1] -145.95566 -28.17623 #> #> $features[[1]]$geometry$coordinates[[1]][[3]] #> [1] -145.87817 -28.74364 #> #> $features[[1]]$geometry$coordinates[[1]][[4]] #> [1] -146.61325 -28.59748 #> #> $features[[1]]$geometry$coordinates[[1]][[5]] #> [1] -139.18167 -31.07703 #> #> $features[[1]]$geometry$coordinates[[1]][[6]] #> [1] -140.88748 -31.24708 #> #> $features[[1]]$geometry$coordinates[[1]][[7]] #> [1] -143.50402 -33.93551 #> #> $features[[1]]$geometry$coordinates[[1]][[8]] #> [1] -146.48114 -30.43185 #> #> $features[[1]]$geometry$coordinates[[1]][[9]] #> [1] -144.68315 -35.45465 #> #> $features[[1]]$geometry$coordinates[[1]][[10]] #> [1] -157.58084 -24.52897 #> #> $features[[1]]$geometry$coordinates[[1]][[11]] #> [1] -138.49434 -25.11895 #> #> #> #> #> $features[[1]]$properties #> NULL #> #> #> #> attr(,"class") #> [1] "geo_list"

Visualize your shapes with lawn.

lawn::view(jsonlite::toJSON(unclass(geo_polygon(count = 4)), auto_unbox = TRUE))

WKT

Functions prefixed with wkt create random Well-Known Text (WKT) data. These functions
wrap the GeoJSON versions, but then convert the data to WKT.

Random point wkt_point() #> [1] "POINT (179.8795330 -29.1106238)" Random polygon wkt_polygon() #> [1] "POLYGON ((-60.0870329 -12.9315478, -61.5073816 -25.3204334, -62.6987366 -24.5766272, -64.1853669 -24.0497260, -67.7152546 -27.4752321, -68.4190340 -26.9510818, -67.6018452 -21.5489551, -64.3083560 -21.6772242, -63.1471630 -21.9415438, -64.1137279 -14.2398013, -60.0870329 -12.9315478))" Use case

Example of geospatial data manipulation, using randgeo, leaflet and
lawn.

Steps:

  • Generate random overlapping polygons
  • Calculate a single polygon from overlapping polygons
  • Map polygon
  • Generate random locaitons (points)
  • Clip locations to the polygon
  • Overlay locations (more random points) on the polygon
library(randgeo) library(lawn) library(leaflet) generate random data set.seed(5) polys <- randgeo::geo_polygon(count = 2, num_vertices = 4, bbox = c(-120, 40, -100, 50)) Get intersection of polygons polysinter <- lawn::lawn_intersect(polys$features[[1]], polys$features[[2]]) map polygons polysinter %>% lawn::view()

generate random points – clip points to polygon pts <- randgeo::geo_point(count = 500, bbox = c(-120, 40, -100, 50)) pts <- lawn::lawn_within( points = lawn_featurecollection(pts), polygons = lawn_featurecollection(polysinter) ) Draw polygon + points on map polysinter %>% view() %>% addGeoJSON(geojson = jsonlite::toJSON(unclass(pts)))

Feedback

Let us know what you think! randgeo doesn't have any revdep's on CRAN yet, but
is being used in one package on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bland-Altman Plot for Age Comparisons?

Thu, 04/20/2017 - 07:00

(This article was first published on fishR Blog, and kindly contributed to R-bloggers)

Last week I posted about a modified age bias plot. In this post I began looking more deeply at an alternative plot called the Bland-Altman plot. Below, I describe this plot, demonstrate how to construct it in R, give a mild critique of its use for compare fish age estimates, and develop an alternative that is mean to correct what I see as some of the shortcomings of using the Bland-Altman plot for comparing age estimates. This is large a “thinking out loud” exercise so I am open to any suggestions that you may have.

Bland-Altman Plot

The Bland-Altman plot (Bland and Altman 1986) is commonly used in medical and chemistry research to assess agreement between two measurement or assay methods (Giavarina 2015). McBride (2015) used the Bland-Altman plot in his simulation study of the effects of accuracy and precision on the ability to diagnose agreement between sets of fish age estimates. McBride (2015) noted that Bland-Altman plots “readily depict both bias and imprecision” and that this was summarized for “the entire sample, rather than specific age classes.” Despite this, I am aware of only two entries in the fisheries literature that used the Bland-Altman plot to compare fish age estimates (one in the grey literature, one in a thesis). Below, I describe the Bland-Altman plot and then offer a modified version for comparing estimates of fish age.

The Bland-Altman plot is a scatterplot where the differences in two age estimates are on the y-axis and means of the two age estimates are on the x-axis. The plot may be augmented with several horizontal lines at the following locations Figure 1:

  • Zero,
  • Mean difference in ages (heavy red dashed line),
  • Upper and lower 95% confidence limits for the mean difference in ages (dark red dashed lines),
  • Upper and lower 95% “agreement limits” (usually 1.96 times the standard deviation for the difference in ages above and below the mean difference in ages; heavy dashed blue lines), and
  • Upper and lower 95% confidence limits for the upper and lower “agreement limits” (dashed dark blue lines).

Figure 1: Bland-Altman plot for comparing scale to otolith age estimates of Lake Champlain Lake Whitefish. Thiw was constructed with BlandAltmanLeh.

As a general rule, a 95% confidence interval for the mean difference that does not contain zero suggests a difference (or a bias) between the two age estimates. For example, a bias is evident in Figure 1. In addition, one would expect 95% of the points to fall within the “agreement limits.” Points that fall outside this range may be considered further as possible outliers. Log differences have been used if the differences are not normally distributed and the percentage difference (where the difference is divided by the mean age) have also been used (Giavarina 2015).

The Bland-Altman plot in Figure 1 was created with bland.altman.plot() from the BlandAltmanLeh package (Lehnert 2015b). Other R functions exist for creating Bland-Altman plots (or the equivalent “Tukey’s Mean Difference Plot”). However, this is a simple plot that can be easily constructed from “scratch” as shown next. I then provide a mild critique of the Bland-Altman plot for use in age comparisons and offer an alternative (that is not an age bias plot).

Constructing a Bland-Altman Plot

In this example, a Bland-Altman plot is created to compare consensus (between two readers) scale (scaleC) and otolith (otolithC) age estimates for Lake Champlain Lake Whitefish.

library(FSA) # provides WhitefishLC, col2rgbt() data(WhitefishLC) str(WhitefishLC) ## 'data.frame': 151 obs. of 11 variables: ## $ fishID : int 1 2 3 4 5 6 7 8 9 10 ... ## $ tl : int 345 334 348 300 330 316 508 475 340 173 ... ## $ scale1 : int 3 4 7 4 3 4 6 4 3 1 ... ## $ scale2 : int 3 3 5 3 3 4 7 5 3 1 ... ## $ scaleC : int 3 4 6 4 3 4 7 5 3 1 ... ## $ finray1 : int 3 3 3 3 4 2 6 9 2 2 ... ## $ finray2 : int 3 3 3 2 3 3 6 9 3 1 ... ## $ finrayC : int 3 3 3 3 4 3 6 9 3 1 ... ## $ otolith1: int 3 3 3 3 3 6 9 11 3 1 ... ## $ otolith2: int 3 3 3 3 3 5 10 12 4 1 ... ## $ otolithC: int 3 3 3 3 3 6 10 11 4 1 ...

The mean and differences between the two age estimates are easily computed and added to the original data.frame. One way (of many ways) to do that is with mutate() from dplyr (as shown below).

library(dplyr) WhitefishLC <- mutate(WhitefishLC,meanSO=(scaleC+otolithC)/2,diffSO=scaleC-otolithC) head(WhitefishLC) ## fishID tl scale1 scale2 scaleC finray1 finray2 finrayC otolith1 ## 1 1 345 3 3 3 3 3 3 3 ## 2 2 334 4 3 4 3 3 3 3 ## 3 3 348 7 5 6 3 3 3 3 ## 4 4 300 4 3 4 3 2 3 3 ## 5 5 330 3 3 3 4 3 4 3 ## 6 6 316 4 4 4 2 3 3 6 ## otolith2 otolithC meanSO diffSO ## 1 3 3 3.0 0 ## 2 3 3 3.5 1 ## 3 3 3 4.5 3 ## 4 3 3 3.5 1 ## 5 3 3 3.0 0 ## 6 5 6 5.0 -2

A scatterplot is constructed with plot() with a formula of the form y~x. In this case, the points are white (col="white") so that the points will not be evident.

plot(diffSO~meanSO,data=WhitefishLC,col="white", xlab="Mean Age",ylab="Scale-Otolith Age")

In this way, items can be added “behind” the points. For example, abline() is used with h=0 to add a horizontal reference line at zero.

abline(h=0,lwd=2,col="gray70")

In addition, horizontal lines at the mean and approximate lower and upper 95% confidence limits for the differences may be added.

mndiff <- mean(WhitefishLC$diffSO) sediff <- se(WhitefishLC$diffSO) abline(h=mndiff+c(-1.96,0,1.96)*sediff,lty=2,col="red")

Agreement lines (without confidence limits) may also be added. [These could, of course, be eliminated to address the second issue below.]

sddiff <- sd(WhitefishLC$diffSO) abline(h=mndiff+c(-1.96,1.96)*sddiff,lty=3,col="blue")

The data points are then added to this base plot with points(). “Solid dots” (pch=19) that are a transparent black such that when five points are overplotted the “point” will appear solid black (col2rgbt("black",1/5)) are used to address the first issue below.

points(diffSO~meanSO,data=WhitefishLC,pch=19,col=col2rgbt("black",1/5))

Figure 2: Bland-Altman plot for comparing scale to otolith age estimates of Lake Champlain Lake Whitefish. Thiw was constructed in parts.

My Critique of the Bland-Altman Plot for Age Comparisons

I like that Bland-Altman plots (relative to age bias plots) do not require that one of the variables be designated as the “reference” group. This may be more useful when comparing age estimates where one set of estimates is not clearly a priori considered to be more accurate (e.g., comparing readers with similar levels of experience).

However, I don’t like the following characteristics of (default) Bland-Altman plots.

  • There may be considerable overlap of the plotted points because of the discrete nature of most age data. Various authors have dealt with this by adding a “petal” to the point for each overplotted point to make a so-called “sunflower plot” (Lehnert, 2015a) or using bubbles that are proportional to the number of overplotted points (McBride 2015). However, I find the “sunflowers” and “bubbles” to be distracting. I addressed this issue with transparent points above.
  • The “agreement lines” are not particularly useful. They may be useful for identifying outliers, but an egregious outlier will likely stand out without these lines.
  • The single confidence interval for the mean difference suggests that any bias between the sets of estimates is “constant” across the range of mean ages. This can be relaxed somewhat if the percentage difference is plotted on the y-axis. However, neither of these allows for more complex situations where the bias is nonlinear with age. For example, a common situation of little difference between the estimates at young ages, but increasing differences with increasing ages (e.g., Figure 1) is not well-represented by this single confidence interval.
A Modified Bland-Altman Plot for Age Comparisons

The third issue above has been addressed with some Bland-Altman plots by fitting a linear regression that describes the difference in age estimates as a function of mean age (Giarvina 2015). However, this only allows for a linear relationship, which may not represent or reveal more interesting nonlinear relationships. A generalized additive model (GAM) could be used to estimate a “smoothed” potentially nonlinear relationship between the differences in ages and the means of the ages.

A GAM may be fit in R with gam() from the mgcv package (Wood 2006). A thin-plate regression spline (Wood 2003) is used as the smoother by default when s() is used on the right-hand-side of the formula in gam(). The degree of smoothing is controlled by k= in s(). I have found (with minimal experience) that k=5 may be adequate for most age comparison situations (note that smaller values provide more smoothing, larger values provide less smoothing).

library(mgcv) mod1 <- gam(diffSO~s(meanSO,k=5),data=WhitefishLC)

Summary of the model fit is seen by submitting the gam object to summary(). In this example, the approximate p-value for the smoother term suggests that there is a relationship between the differences and means.

summary(mod1) ## ## Family: gaussian ## Link function: identity ## ## Formula: ## diffSO ~ s(meanSO, k = 5) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.7815 0.1525 -11.68 <2e-16 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(meanSO) 3.438 3.827 33.53 <2e-16 ## ## R-sq.(adj) = 0.461 Deviance explained = 47.4% ## GCV = 3.617 Scale est. = 3.5107 n = 151

The smoother “line” with 95% confidence limits is added to the plot by first using the smoother to predict the differences in ages (along with the SE of those predictions) across the entire range of mean ages. Approximate confidence limits are then derived (using normal theory) from the predicted values and their SEs.

tmp <- seq(0,18,0.1) pred1 <- data.frame(age=tmp, predict(mod1,data.frame(meanSO=tmp),type="response",se=TRUE)) %>% mutate(LCI=fit-1.96*se.fit,UCI=fit+1.96*se.fit) head(pred1) ## age fit se.fit LCI UCI ## 1 0.0 -0.2479616 0.7058686 -1.631464 1.135541 ## 2 0.1 -0.2333355 0.6846088 -1.575169 1.108498 ## 3 0.2 -0.2187095 0.6634582 -1.519088 1.081669 ## 4 0.3 -0.2040835 0.6424276 -1.463241 1.055075 ## 5 0.4 -0.1894574 0.6215291 -1.407654 1.028740 ## 6 0.5 -0.1748314 0.6007766 -1.352354 1.002691

The smoother “line” is then added to the plot with lines().

lines(fit~age,data=pred1,lwd=2,lty=2,col="red")

The 95% intervals for the smoother line are added as lines with lines()

lines(LCI~age,data=pred1,lty=3,col="red") lines(UCI~age,data=pred1,lty=3,col="red")

Alternatively, the 95% intervals for the smoother line may be added as a shaded region with polygon(). Note that rev() reverses the order of the elements in a vector and is a “trick” used to properly construct the boundaries of the polygon for plotting.

with(pred1,polygon(c(age,rev(age)),c(LCI,rev(UCI)), border=NA,col=col2rgbt("red",1/10)))

Putting this all together results in the plot shown in Figure 3. These results suggest that there is little difference between scale and otolith age estimates up to a mean age estimate of approximately five, after which age estimates from scales are less than age estimates from otoliths, with the difference between the two generally increasing with increasing mean age.

plot(diffSO~meanSO,data=WhitefishLC,col="white", xlab="Mean Age",ylab="Scale-Otolith Age") abline(h=0,lwd=2,col="gray70") lines(fit~age,data=pred1,lwd=2,lty=2,col="red") with(pred1,polygon(c(age,rev(age)),c(LCI,rev(UCI)), border=NA,col=col2rgbt("red",1/10))) points(diffSO~meanSO,data=WhitefishLC,pch=19,col=col2rgbt("black",1/5))

Figure 3: Scatterplot of the difference in scale and otolith age estimates versus the mean of the scale and otolith age estimates of Lake Champlain Lake Whitefish with a thin-plate regression spline smoother and 95% confidence band shown.

A similar plot is shown for the comparison of otolith age estimates between two readers in Figure 4. Also note (see below) that the smoother term is not significant for the between-reader comparison of otolith age estimates, which suggests no relationship between the differences in ages and the mean age. In addition, the intercept term is not significantly different from zero, which indicates that there is not a constant bias between the two readers.

## ## Family: gaussian ## Link function: identity ## ## Formula: ## diffOO ~ s(meanOO, k = 5) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.0000 0.0611 0 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(meanOO) 1 1 0.013 0.909 ## ## R-sq.(adj) = -0.00662 Deviance explained = 0.00881% ## GCV = 0.57128 Scale est. = 0.56371 n = 151

Figure 4: Scatterplot of the difference between otolith age estimates for two readers and the mean otolith age estimates of Lake Champlain Lake Whitefish with a thin-plate regression spline smoother and 95% confidence band shown.

Adding the GAM to the Age Bias Plot

In situations where one of the age estimates could be considered a priori to be more accurate, it seems to make more sense to put that age estimate on the x-axis rather than the mean between it and the less accurate estimate. In other words, return to the concept, though not the exact structure, of the age bias plot. The GAM smoother can also be added to this plot (Figures 5 and 6).

mod1a <- gam(diffSO~s(otolithC,k=15),data=WhitefishLC) plot(diffSO~otolithC,data=WhitefishLC,col="white", xlab="Otolith Age",ylab="Scale - Otolith Age") abline(h=0,lwd=2,col="gray70") tmp <- seq(0,25,0.1) pred1a <- data.frame(age=tmp, predict(mod1a,data.frame(otolithC=tmp),type="response",se=TRUE)) %>% mutate(LCI=fit-1.96*se.fit,UCI=fit+1.96*se.fit) lines(fit~age,data=pred1a,lwd=2,lty=2,col="red") with(pred1a,polygon(c(age,rev(age)),c(LCI,rev(UCI)), border=NA,col=col2rgbt("red",1/10))) points(diffSO~otolithC,data=WhitefishLC,pch=19,col=col2rgbt("black",1/5))

Figure 5: Scatterplot of the difference in scale and otolith age estimates versus otolith age estimates of Lake Champlain Lake Whitefish with a thin-plate regression spline smoother and 95% confidence band shown.

Figure 6: Scatterplot of the difference otolith age estimates for two readers and the otolith age estimates for the first reader of Lake Champlain Lake Whitefish with a thin-plate regression spline smoother and 95% confidence band shown.

Things To Do

If this all seems plausible then consider:

  • Add histogram to right-side (show distribution of differences).
  • Perhaps run McBride’s simulations through this.
References

To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

mabbles: The Missing Piece of a Unified Theory of Tidyness

Thu, 04/20/2017 - 04:05

(This article was first published on R – 4D Pie Charts, and kindly contributed to R-bloggers)

R programming has seen a big shift in the last couple of years. All those packages that RStudio have been creating to solve this or that problem suddenly started to cohere into a larger ecosystem of packages. Once it was given a name, the tidyverse, it became possible to start thinking about the structure of the ecosystem and how packages relate to each other and what new packages were needed. At this point, the tidyverse is already the dominant ecosystem on CRAN. Five of the top ten downloaded packages are tidyverse packages, and most of the packages in the ecosystem are in the top one hundred.

As the core tidyverse packages like dplyr mature, the most exciting developments are its expansion into new fields. Notably tidytext is taking over text mining, tidyquant is poised to conquer financial time analyses, and sf is getting the spatial stats community excited.

There is one area that remains stubbornly distinct from the tidyverse. Bioconductor dominates biological research, particularly ‘omics fields (genomics, transcriptomics, proteomics, and metabolomics). Thanks to the heavy curation of package by Bioconductor Core, the two and a half thousand packages in the Bioconductor repository also form a coherent ecosystem.

In the same way that the general theory of relativity and quantum mechanics are incredibly powerful by themselves but are currently irreconcilable when it come to thinking about gravity, the tidyverse and Bioconductor are more or less mutually exclusive ecosystems of R packages for data analysis. The fundamental data structure of the tidyverse is the data frame, but for Bioconductor it is the ExpressionSet.

If you’ve not come across ExpressionSets before, they essentially consist of a data frame of feature data, a data frame of response data, and matrix of measurements. This data type is marvelously suited to dealing with data from ‘omics experiments and has served Bioconductor well for years.

However, over the last decade, biological datasets have been growing exponentially, and for many experiments it is now no longer practical to store them in RAM, which means that an ExpressionSet is impractical. There are some very clever workarounds, but it strikes me that what Bioconductor needs is a trick from the tidyverse.

My earlier statement that the data frame is the fundamental data structure in the tidyverse isn’t quite true. It’s actually the tibble, an abstraction of the data frame. From a user point of view, tibbles behave like data frames with a slightly nicer print method. From a technical point of view, they have one huge advantage: they don’t care where their data is. tibbles can store their data in a regular data.frame, a data.table, a database, or on Spark. The user gets to write the same dplyr code to manipulate them, but the analysis can scale beyond the limits of RAM.

If Bioconductor could have a similar abstracted ExpressionSet object, its users and developers could stop worrying about the rapidly expanding sizes of biological data.

Swapping out the data frame parts of an ExpressionSet is simple – you can just use tibbles already. The tricky part is what to do with the matrix. What is needed is an object that behaves like a matrix to the user, but acts like a tibble underneath.

I call such a theoretical object a mabble.

Unfortunately, right now, it doesn’t exist. This is where you come in. I think that there is plenty of fame and fortune for the person or team that can develop such an object, so I urge you to have a go.

The basic idea seems reasonably simple. You store the mabble as a tibble, with three columns for row, column, and value. Here’s a very simple implementation.

mabble <- function(x, nrow = NULL, ncol = NULL) { # Decide on dimensions n <- length(x) if(is.null(nrow)) { if(is.null(ncol)) { # Default to column vector nrow <- n ncol <- 1 } else { # only ncol known nrow <- n / ncol assert_all_are_whole_numbers(nrow) } } else { if(is.null(ncol)) { # only nrow known nrow <- n / ncol assert_all_are_whole_numbers(ncol) } else { # both known # Not allowing recycling for now; may change my mind later assert_all_are_equal_to(nrow * ncol, length(x)) } } m <- tibble( r = rep.int(seq_len(nrow), times = ncol), c = rep(seq_len(ncol), each = nrow), v = x ) class(m) <- c("mbl", "tbl_df", "tbl", "data.frame") m }

Then you need a print method so it displays like a matrix. Here’s a simple solution, though ideally only a limited number of rows and column would be displayed.

as.matrix.mbl <- function(x) { reshape2::acast(x, r ~ c, value.var = "v") } print.mbl <- function(x) { print(as.matrix(x)) } (m <- mabble(1:12, 3, 4)) ## 1 2 3 4 ## 1 1 4 7 10 ## 2 2 5 8 11 ## 3 3 6 9 12

The grunt work is to write methods for all the things that matrices can do. Transposing is easy – you just swap the r and c columns.

t.mbl <- function(x) { x %>% dplyr::select(r = c, c = r, v) } t(m) ## 1 2 3 ## 1 1 2 3 ## 2 4 5 6 ## 3 7 8 9 ## 4 10 11 12

There are a lot of things that need to be worked out. Right now, I have no idea how you implement linear algebra with a mabble. I don’t have time to make this thing myself but I’d be happy to advise you if you are interested in creating something yourself.

Tagged: Bioconductor, programming, r, tidyverse

To leave a comment for the author, please follow the link and comment on their blog: R – 4D Pie Charts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppQuantuccia 0.0.1

Thu, 04/20/2017 - 03:38

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

New package! And, as it happens, a effectively a subset or variant of one my oldest packages, RQuantLib.

Fairly recently, Peter Caspers started to put together a header-only subset of QuantLib. He called this Quantuccia, and, upon me asking, said that it stands for "little sister" of QuantLib. Very nice.

One design goal is to keep Quantuccia header-only. This makes distribution and deployment much easier. In the fifteen years that we have worked with QuantLib by providing the R bindings via RQuantLib, it has always been a concern to provide current QuantLib libraries on all required operating systems. Many people helped over the years but it is still an issue, and e.g. right now we have no Windows package as there is no library build it against.

Enter RcppQuantuccia. It only depends on R, Rcpp (for seamless R and C++ integrations) and BH bringing Boost headers. This will make it much easier to have Windows and macOS binaries.

So what can it do right now? We started with calendaring, and you can compute date pertaining to different (ISDA and other) business day conventions, and compute holiday schedules. Here is one example computing inter alia under the NYSE holiday schedule common for US equity and futures markets:

R> library(RcppQuantuccia) R> fromD <- as.Date("2017-01-01") R> toD <- as.Date("2017-12-31") R> getHolidays(fromD, toD) # default calender ie TARGET [1] "2017-04-14" "2017-04-17" "2017-05-01" "2017-12-25" "2017-12-26" R> setCalendar("UnitedStates") R> getHolidays(fromD, toD) # US aka US::Settlement [1] "2017-01-02" "2017-01-16" "2017-02-20" "2017-05-29" "2017-07-04" "2017-09-04" [7] "2017-10-09" "2017-11-10" "2017-11-23" "2017-12-25" R> setCalendar("UnitedStates::NYSE") R> getHolidays(fromD, toD) # US New York Stock Exchange [1] "2017-01-02" "2017-01-16" "2017-02-20" "2017-04-14" "2017-05-29" "2017-07-04" [7] "2017-09-04" "2017-11-23" "2017-12-25" R>

The GitHub repo already has a few more calendars, and more are expected. Help is of course welcome for both this, and for porting over actual quantitative finance calculations.

More information is on the RcppQuantuccia page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R-Cade Games: Simulating the Legendary Game of Pong

Thu, 04/20/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Pong is one of the earliest arcade games on the market, first released in 1972. From the day I first saw this miracle box, I wanted to know more about computers.

I learnt how to write code from the 1983 book Dr. C. Wacko’s Miracle Guide to Designing and Programming your own Atari Computer Arcade Games. This book explains in a very clear and humorous way how to write computer games in Atari basic. I devoured this book and spent many hours developing silly games. This article is an ode to Dr Wacko, a computer geek’s midlife-crisis and an attempt to replicate the software I developed thirty years ago.

I showed in a previous post that R can be used for board games. The question is whether we create arcade games in R. My challenge is to recreate the look and feel of 1980s arcade games, or R-Cade games, using R? The code shown below simulates the legendary game of pong.

Playing Pong in R

The code is based on the Wacko’s Boing Program in the above-mentioned book. The R code is fully commented and speaks for itself. Please note that the animation is very clunky when you run it in RStudio. Only the native R Terminal displays the animation correctly.

Perhaps somebody can help me perfect this little ditty. I love to know how to read real-time USB input to control the game, so we get a step closer to the first R-Cade game.


http://r.prevos.net/wp-content/uploads/sites/11/2017/04/Rpong.mp4 The Pong Code # Sound library library(beepr) # Game parameters skill <- 0.87 # Skill (0-1) score <- 0 high.score <- 0 # Define playing field par(mar = rep(1,4), bg = "black") plot.new() plot.window(xlim = c(0, 30), ylim = c(0, 30)) lines(c(1, 30, 30, 1), c(0, 0, 30, 30), type = "l", lwd = 5, col = "white") # Playing field boundaries (depend on cex) xmin <- 0.5 xmax <- 29.4 ymin <- 0.5 ymax <- 29.4 # initial position x <- sample(5:25, 1) y <- sample(5:25, 1) points(x, y, pch = 15, col = "white", cex = 2) # Paddle control psize <- 4 ypaddle <- y # Set direction dx <- runif(1, .5, 1) dy <- runif(1, .5, 1) # Game play while (x > xmin - 1) { sound <- 0 # Silence Sys.sleep(.05) # Pause screen. Reduce to increase speed points(x, y, pch = 15, col = "black", cex = 2) # Erase ball # Move ball x <- x + dx y <- y + dy # Collision detection if (x > xmax) { dx <- -dx * runif(1, .9, 1.1) # Bounce if (x > xmin) x <- xmax # Boundary sound <- 10 # Set sound } if (y < ymin | y > ymax) { if (y < ymin) y <- ymin if (y > ymax) y <- ymax dy <- -dy * runif(1, .9, 1.1) sound <- 10 } # Caught by paddle? if (x < xmin & (y > ypaddle - (psize / 2)) & y < ypaddle + (psize / 2)) { if (x < xmin) x <- xmin dx <- -dx * runif(1, .9, 1.1) sound <- 2 score <- score + 1 } # Draw ball points(x, y, pch = 15, col = "white", cex = 2) if (sound !=0) beep(sound) # Move paddle if (runif(1, 0, 1) < skill) ypaddle <- ypaddle + dy # Imperfect follow # Draw paddle # Erase back line lines(c(0, 0), c(0, 30), type = "l", lwd = 8, col = "black") # Keep paddle inside window if (ypaddle < (psize / 2)) ypaddle <- (psize / 2) if (ypaddle > 30 - (psize / 2)) ypaddle <- 30 - (psize / 2) # Draw paddle lines(c(0, 0), c(ypaddle - (psize / 2), ypaddle + (psize / 2)), type = "l", lwd = 8, col = "white") } beep(8) text(15,15, "GAME OVER", cex=5, col = "white") s <- ifelse(score == 1, "", "s") text(15,5, paste0(score, " Point", s), cex=3, col = "white")

The post R-Cade Games: Simulating the Legendary Game of Pong appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

readxl 1.0.0

Thu, 04/20/2017 - 01:12

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

I’m pleased to announce that readxl 1.0.0 is available on CRAN. readxl makes it easy to bring tabular data out of Excel and into R, for modern .xlsx files and the legacy .xls format. readxl does not have any tricky external dependencies, such as Java or Perl, and is easy to install and use on Mac, Windows, and Linux.

You can install it with:

install.packages("readxl")

As well as fixing many bugs, this release:

  • Allows you to target specific cells for reading, in a variety of ways
  • Adds two new column types: "logical" and "list", for data of disparate type
  • Is more resilient to the wondrous diversity in spreadsheets, e.g., those written by 3rd party tools

You can see a full list of changes in the release notes. This is the first release maintained by Jenny Bryan.

Specifying the data rectangle

In an ideal world, data would live in a neat rectangle in the upper left corner of a spreadsheet. But spreadsheets often serve multiple purposes for users with different priorities. It is common to encounter several rows of notes above or below the data, for example. The new range argument provides a flexible interface for describing the data rectangle, including Excel-style ranges and row- or column-only ranges.

library(readxl) read_excel( readxl_example("deaths.xlsx"), range = "arts!A5:F15" ) #> # A tibble: 10 × 6 #> Name Profession Age `Has kids` `Date of birth` #> #> 1 David Bowie musician 69 TRUE 1947-01-08 #> 2 Carrie Fisher actor 60 TRUE 1956-10-21 #> 3 Chuck Berry musician 90 TRUE 1926-10-18 #> 4 Bill Paxton actor 61 TRUE 1955-05-17 #> # ... with 6 more rows, and 1 more variables: `Date of death` read_excel( readxl_example("deaths.xlsx"), sheet = "other", range = cell_rows(5:15) ) #> # A tibble: 10 × 6 #> Name Profession Age `Has kids` `Date of birth` #> #> 1 Vera Rubin scientist 88 TRUE 1928-07-23 #> 2 Mohamed Ali athlete 74 TRUE 1942-01-17 #> 3 Morley Safer journalist 84 TRUE 1931-11-08 #> 4 Fidel Castro politician 90 TRUE 1926-08-13 #> # ... with 6 more rows, and 1 more variables: `Date of death`

There is also a new argument n_max that limits the number of data rows read from the sheet. It is an example of readxl’s evolution towards a readr-like interface. The Sheet Geometry vignette goes over all the options.

Column typing

The new ability to target cells for reading means that readxl’s automatic column typing will “just work”” for most sheets, most of the time. Above, the Has kids column is automatically detected as logical, which is a new column type for readxl.

You can still specify column type explicitly via col_types, which gets a couple new features. If you provide exactly one type, it is recycled to the necessary length. The new type "guess" can be mixed with explicit types to specify some types, while leaving others to be guessed.

read_excel( readxl_example("deaths.xlsx"), range = "arts!A5:C15", col_types = c("guess", "skip", "numeric") ) #> # A tibble: 10 × 2 #> Name Age #> #> 1 David Bowie 69 #> 2 Carrie Fisher 60 #> 3 Chuck Berry 90 #> 4 Bill Paxton 61 #> # ... with 6 more rows

The new argument guess_max limits the rows used for type guessing. Leading and trailing whitespace is trimmed when the new trim_ws argument is TRUE, which is the default. Finally, thanks to Jonathan Marshall, multiple na values are accepted. The Cell and Column Types vignette has more detail.

"list" columns

Thanks to Greg Freedman Ellis we now have a "list" column type. This is useful if you want to bring truly disparate data into R without the coercion required by atomic vector types.

(df <- read_excel( readxl_example("clippy.xlsx"), col_types = c("text", "list") )) #> # A tibble: 4 × 2 #> name value #> <chr> <list> #> 1 Name <chr [1]> #> 2 Species <chr [1]> #> 3 Approx date of death <dttm [1]> #> 4 Weight in grams <dbl [1]> tibble::deframe(df) #> $Name #> [1] "Clippy" #> #> $Species #> [1] "paperclip" #> #> $`Approx date of death` #> [1] "2007-01-01 UTC" #> #> $`Weight in grams` #> [1] 0.9 Everything else

To learn more, read the vignettes and articles or release notes. Highlights include:

  • General rationalization of sheet geometry, including detection and treatment of empty rows and columns.
  • Improved behavior and messaging around coercion and mismatched cell and column types.
  • Improved handling of datetimes with respect to 3rd party software, rounding, and the Lotus 1-2-3 leap year bug.
  • read_xls() and read_xlsx() are now exposed, so that files without an .xls or .xlsx extension can be read. Thanks Jirka Lewandowski!
  • readxl Workflows showcases patterns that reduce tedium and increase reproducibility when raw data arrives in a spreadsheet.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Microsoft R Server 9.1 now available

Wed, 04/19/2017 - 22:29

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

During today's Data Amp online event, Joseph Sirosh announced the new Microsoft R Server 9.1, which is available for customers now. In addition the updated Microsoft R Client, which has the same capabilities for local use, is available free for everyone on both Windows and — new to this update — Linux.

This release adds many new capabilities to Microsoft R, including: 

Based on R 3.3.3. This update is based on the latest release of open source R from the R Foundation, bringing many improvements to the core language engine.

Microsoft R Client now available on Linux. The Microsoft R Client is available free, and has all of the same capabilities as Microsoft R Server for a standalone machine (or, you can use it to push heavy workloads to a remote Microsoft R Server, SQL Server or compute cluster). With this release it is now also available on Linux (specifically, Ubuntu, Red Hat / CentOS and SUSE) in addition to Windows.

New function for "pleasingly parallel" R computations on data partitions. The new rxExecBy function allows you to apply any R function to partitions of a data set, and compute on the partitions in parallel. (You don't have to manually split the data first; this happens automatically at the data source, without the need to move or copy the data.) You can use any parallel platform supported in Microsoft R: multiple cores in a local context, multiple threads in SQL Server, or multiple nodes in a Spark cluster.

New functions for sentiment scoring and image featurization have been added to the built-in MicrosoftML package. The new getSentiment function returns a sentiment score (on a scale from "very negative" to "very positive") for a piece of English text. (You can also featurize text in English, French, German, Dutch, Italian, Spanish and Japanese as you build your own models.) The new featurizeImage function decomposes an image into a few numeric variables (using a selection of ResNet recognizers) which can then be used as the basis of a predictive model. Both functions are based on deep neural network models generated from thousands of compute-hours of training by Microsoft Research.

Interoperability between Microsoft R Server and sparklyr. You can now use RStudio's sparklyr package in tandem with Microsoft R Server in a single Spark session

New machine learning models in Hadoop and Spark. The new machine learning functions introduced with Version 9.0 (such as FastRank gradient-boosted trees and GPU-accelerated deep neural networks) are now available in the Hadoop and Spark contexts in addition to standalone servers and within SQL Server. 

Production-scale deployment of R models. The new publishService function creates a real-time web-service for certain RevoScaleR and MicrosoftML functions that is independent of any underlying R interpreter, and can return results with millisecond latency. There are also new tools for launching and managing asynchronous batch R jobs.

Updates for R in SQL Server. There are also improvements for R Services in SQL Server 2016, in addition to the new R functions described above. There are new tools to manage installed R packages on SQL Server, and faster single-row scoring. Even more new capabilities are coming in SQL Server 2017 (now in preview for both Windows and Linux), including the ability to use both R and Python for in-database computations.

For more on the updates in Microsoft R Server 9.1, check the official blog post by Nagesh Pabbisetty linked below.

SQL Server Blog: Introducing Microsoft R Server 9.1 release

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine Learning Using Support Vector Machines

Wed, 04/19/2017 - 21:06

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Support Vector Machines (SVM) is a data classification method that separates data using hyperplanes. The concept of SVM is very intuitive and easily understandable. If we have labeled data, SVM can be used to generate multiple separating hyperplanes such that the data space is divided into segments and each segment contains only one kind of data. SVM technique is generally useful for data which has non-regularity which means, data whose distribution is unknown.

Let’s take a simple example to understand how SVM works. Say you have only two kinds of values and we can represent them as in the figure:

This data is simple to classify and one can see that the data is clearly separated into two segments. Any line that separates the red and blue items can be used to classify the above data. Had this data been multi-dimensional data, any plane can separate and successfully classify the data. However, we want to find the “most optimal” solution. What will then be the characteristic of this most optimal line? We have to remember that this is just the training data and we can have more data points which can lie anywhere in the subspace. If our line is too close to any of the datapoints, noisy test data is more likely to get classified in a wrong segment. We have to choose the line which lies between these groups and is at the farthest distance from each of the segments.

To solve this classifier line, we draw the line as y=ax+b and make it equidistant from the respective data points closest to the line. So we want to maximize the margin (m).

We know that all points on the line ax+b=0 will satisfy the equation. If we draw two parallel lines – ax+b=-1 for one segment and ax+b=1 for the other segment such that these lines pass through a datapoint in the segment which is nearest to our line then the distance between these two lines will be our margin. Hence, our margin will be m=2/||a||. Looked at another way, if we have a training dataset (x1,x2,x3…xn) and want to produce and outcome y such that y is either -1 or 1 (depending on which segment the datapoint belongs to), then our classifier should correctly classify data points in the form of y=ax+b. This also means that y(ax+b)>1 for all data points. Since we have to maximize the margin, we have the solve this problem with the constraint of maximizing 2/||a|| or, minimizing ||a||^2/2 which is basically the dual form of the constraint. Solving this can be easy or complex depending upon the dimensionality of data. However, we can do this quickly with R and try out with some sample dataset.

To use SVM in R, I just created a random data with two features x and y in excel. I took all the values of x as just a sequence from 1 to 20 and the corresponding values of y as derived using the formula y(t)=y(t-1) + r(-1:9) where r(a,b) generates a random integer between a and b. I took y(1) as 3. The following code in R illustrates a set of sample generated values:

x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60) #Create a data frame of the data train=data.frame(x,y) Let’s see how our data looks like. For this we use the plot function #Plot the dataset plot(train,pch=16)

 


Seems to be a fairly good data. Looking at the plot, it also seems like a linear regression should also be a good fit to the data. We’ll use both the models and compare them. First, the code for linear regression:

 

#Linear regression model <- lm(y ~ x, train) #Plot the model using abline abline(model)

 

SVM

To use SVM in R, we have a package e1071. The package is not preinstalled, hence one needs to run the line “install.packages(“e1071”) to install the package and then import the package contents using the library command. The syntax of svm package is quite similar to linear regression. We use svm function here.

#SVM library(e1071) #Fit a model. The function syntax is very similar to lm function model_svm <- svm(y ~ x , train) #Use the predictions on the data pred <- predict(model_svm, train) #Plot the predictions and the plot to see our model fit points(train$x, pred, col = "blue", pch=4)

 

The points follow the actual values much more closely than the abline. Can we verify that? Let’s calculate the rmse errors for both the models:

#Linear model has a residuals part which we can extract and directly calculate rmse error <- model$residuals lm_error <- sqrt(mean(error^2)) # 3.832974 #For svm, we have to manually calculate the difference between actual values (train$y) with our predictions (pred) error_2 <- train$y - pred svm_error <- sqrt(mean(error_2^2)) &amp;amp;amp;amp;amp;nbsp;# 2.696281

In this case, the rmse for linear model is ~3.83 whereas our svm model has a lower error of ~2.7. A straightforward implementation of SVM has an accuracy higher than the linear regression model. However, the SVM model goes far beyond that. We can further improve our SVM model and tune it so that the error is even lower. We will now go deeper into the SVM function and the tune function. We can specify the values for the cost parameter and epsilon which is 0.1 by default. A simple way is to try for each value of epsilon between 0 and 1 (I will take steps of 0.01) and similarly try for cost function from 4 to 2^9 (I will take exponential steps of 2 here). I am taking 101 values of epsilon and 8 values of cost function. I will thus be testing 808 models and see which ones performs best. The code may take a short while to run all the models and find the best version. The corresponding code will be

svm_tune <- tune(svm, y ~ x, &amp;amp;amp;amp;amp;nbsp;data = train, ranges = list(epsilon = seq(0,1,0.01), cost = 2^(2:9)) ) print(svm_tune) #Printing gives the output: #Parameter tuning of ‘svm’: # - sampling method: 10-fold cross validation #- best parameters: # epsilon cost #0.09 256 #- best performance: 2.569451 #This best performance denotes the MSE. The corresponding RMSE is 1.602951 which is square root of MSE

An advantage of tuning in R is that it lets us extract the best function directly. We don’t have to do anything and just extract the best function from the svm_tune list. We can now see the improvement in our model by calculating its RMSE error using the following code

#The best model best_mod <- svm_tune$best.model best_mod_pred <- predict(best_mod, train) error_best_mod <- train$y - best_mod_pred # this value can be different on your computer # because the tune method randomly shuffles the data best_mod_RMSE <- sqrt(mean(error_best_mod^2)) # 1.290738

This tuning method is known as grid search. R runs all various models with all the possible values of epsilon and cost function in the specified range and gives us the model which has the lowest error. We can also plot our tuning model to see the performance of all the models together

plot(svm_tune)

This plot shows the performance of various models using color coding. Darker regions imply better accuracy. The use of this plot is to determine the possible range where we can narrow down our search to and try further tuning if required. For instance, this plot shows that I can run tuning for epsilon in the new range of 0 to 0.2 and while I’m at it, I can move in even lower steps (say 0.002) but going further may lead to overfitting so I can stop at this point. From this model, we have improved on our initial error of 2.69 and come as close as 1.29 which is about half of our original error in SVM. We have come very far in our model accuracy. Let’s see how the best model looks like when plotted.

plot(train,pch=16) points(train$x, best_mod_pred, col = "blue", pch=4)

Visually, the points predicted by our tuned model almost follow the data. This is the power of SVM and we are just seeing this for data with two features. Imagine the abilities of the model with more number of complex features!

 

Summary

SVM is a powerful technique and especially useful for data whose distribution is unknown (also known as non-regularity in data). Because the example considered here consisted of only two features, the SVM fitted by R here is also known as linear SVM. SVM is powered by a kernel for dealing with various kinds of data and its kernel can also be set during model tuning. Some such examples include gaussian and radial. Hence, SVM can also be used for non-linear data and does not require any assumptions about its functional form. Because we separate data with the maximum possible margin, the model becomes very robust and can deal with incongruencies such as noisy test data or biased train data. We can also interpret the results produced by SVM through visualization. A common disadvantage with SVM is associated with its tuning. The level of accuracy in predicting over the training data has to be defined in our data. Because our example was custom generated data, we went ahead and tried to get our model accuracy as high as possible by reducing the error. However, in business situations when one needs to train the model and continually predict over test data, SVM may fall into the trap of overfitting. This is the reason SVM needs to be carefully modeled – otherwise the model accuracy may not be satisfactory. As I did in the example, SVM technique is closely related to regression technique. For linear data, we can compare SVM with linear regression while non-linear SVM is comparable to logistic regression. As the data becomes more and more linear in nature, linear regression becomes more and more accurate. At the same time, tuned SVM can also fit the data. However, noisiness and bias can severely impact the ability of regression. In such cases, SVM comes in really handy!

 

With this article, I hope one may understand the basics of SVM and its implementation in R. One can also deep dive into the SVM technique by using model tuning. The full R code used in the article is laid out as under:

x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60) #Create a data frame of the data train=data.frame(x,y) #Plot the dataset plot(train,pch=16) #Linear regression model <- lm(y ~ x, train) #Plot the model using abline abline(model) #SVM library(e1071) #Fit a model. The function syntax is very similar to lm function model_svm <- svm(y ~ x , train) #Use the predictions on the data pred <- predict(model_svm, train) #Plot the predictions and the plot to see our model fit points(train$x, pred, col = "blue", pch=4) #Linear model has a residuals part which we can extract and directly calculate rmse error <- model$residuals lm_error <- sqrt(mean(error^2)) # 3.832974 #For svm, we have to manually calculate the difference between actual values (train$y) with our predictions (pred) error_2 <- train$y - pred svm_error <- sqrt(mean(error_2^2)) # 2.696281 # perform a grid search svm_tune <- tune(svm, y ~ x, data = train, ranges = list(epsilon = seq(0,1,0.01), cost = 2^(2:9)) ) print(svm_tune) #Parameter tuning of ‘svm’: # - sampling method: 10-fold cross validation #- best parameters: # epsilon cost #0 8 #- best performance: 2.872047 #The best model best_mod <- svm_tune$best.model best_mod_pred <- predict(best_mod, train) error_best_mod <- train$y - best_mod_pred # this value can be different on your computer # because the tune method randomly shuffles the data best_mod_RMSE <- sqrt(mean(error_best_mod^2)) # 1.290738 plot(svm_tune) plot(train,pch=16) points(train$x, best_mod_pred, col = "blue", pch=4)

Bio: Chaitanya Sagar is the Founder and CEO of Perceptive Analytics. Perceptive Analytics has been chosen as one of the top 10 analytics companies to watch out for by Analytics India Magazine. It works on Marketing Analytics for e-commerce, Retail and Pharma companies.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

quantmod 0.4-8 on CRAN

Wed, 04/19/2017 - 17:45

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

I pushed a bug-fix release of quantmod to CRAN last night. The major changes were to

  • getSymbols.FRED (#141)
  • getSymbols.oanda (#144)
  • getSymbols.yahoo (#149)

All three providers made breaking changes to their URLs/interfaces.

getSymbols.google also got some love. It now honors all arguments set via setSymbolLookup (#138), and it correctly parses the date column in non-English locales (#140).

There’s a handy new argument to getDividends: split.adjust. It allows you to request dividends unadjusted for splits (#128). Yahoo provides split-adjusted dividends, so you previously had to manually unadjust them for splits if you wanted the original raw values. To import the raw unadjusted dividends, just call:

rawDiv <- getDividends(“IBM”, split.adjust = FALSE)

Note that the default is split.adjust = TRUE to maintain backward-compatibility.

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Quick Tip: Table parameters for rmarkdown reports

Wed, 04/19/2017 - 10:50

(This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

The recent(ish) advent of parameters in rmarkdown reports is pretty nifty but there’s a little bit of behaviour that can come in handy but doesn’t come across in the documentation. You can use table parameters for rmarkdown reports.

Previously, if you wanted to produce multiple reports based off a dataset, you would make the dataset available and then perform filtering in the report. Now we can pass the filtered data directly to the report, which keeps all the filtering logic in one place.

It’s actually super simple to add table parameters for rmarkdown reports.

View the code on Gist.

We specify a params section and we create a parameter for our table. I use the !r iris bit to add default contents to the report so that I can generate the report easily.

With this fragment now, I could easily pass in a new dataset to this report.

rmarkdown::render("MWE.Rmd", params = list(mytbl=mtcars))

By being able to pass in tables to the report in the params argument directly, any filtering logic can now be kept external to the report and kept close to the iterator. This is pretty darn cool in my opinion!

The post R Quick Tip: Table parameters for rmarkdown reports appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R for Enterprise: Understanding R’s Startup

Wed, 04/19/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)



R’s startup behavior is incredibly powerful. R sets environment variables, loads base packages, and understands whether you’re running a script, an interactive session, or even a build command.

Most R users will never have to worry about changing R’s startup process. In fact, for portability and reproducibility of code, we recommend that users do not modify R’s startup profile. But, for system administrators, package developers, and R enthusiasts, customizing the launch process can provide a powerful tool and help avoid common gotchas. R’s behavior is thoroughly documented in R’s base documentation: “Initialization at Start of an R Session”. This post will elaborate on the official documentation and provide some examples. Read on if you’ve ever wondered how to:

  • Tell R about a local CRAN-like repository to host and share R packages internally
  • Use a different version of Python, e.g., to support a Tensorflow project
  • Define a proxy so R can reach the internet in locked-down environments
  • Understand why Packrat creates a .Rprofile
  • Automatically run code at the end of a session to capture and log sesssionInfo()

We’ll also discuss how RStudio starts R. Spoiler: it’s a bit different than you might expect!

.Rprofile, .Renviron, and R*.site oh my!

R’s startup process follows three steps: starting R, setting environment variables, and sourcing profile scripts. In the last two steps, R looks for site-wide files and user- or project-specific files. The R documentation explains this process in detail.

Common Gotchas and Tricks:
  1. The Renviron file located at R_HOME/etc is unique and different from Renviron.site and the user-specific .Renviron files. Do not edit the Renviron file!

  2. A site-wide file, and either a project file or a user file, can be loaded at the same time. It is not possible to use both a user file and a project file. If the project file exists, it will be used instead of the user file.

  3. The environment files are plain-text files in the form name=value. The profile files contain R code.

  4. To double check what environment variables are defined in the R environment, run Sys.getenv().

  5. Do not place things in a profile that limit the reproducibility or portability of your code. For example, setting options(stringsAsFactors = FALSE) is discouraged because it will cause your code to break in mysterious ways in other environments. Other bad ideas include: reading in data, loading packages, and defining functions.

Where to put what?

The R Startup process is very flexible, which means there are different ways to achieve the same results. For example, you may be wondering which environment variables to set in .Renviron versus Renviron.site. (Don’t even think about calling Sys.setenv() in a Rprofile…)

A simple rule of thumb is to answer the question: “When else do I want this variable to be set?”

For example, if you’re on a shared server and you want the settings every time you run R, place .Renviron or .Rprofile in your home directory. If you’re a system admin and you want the settings to take affect for every user, modify Renviron.site or Rprofile.site.

The best practice is to scope these settings as narrowly as possible. That means if you can place code in .Rprofile instead of Rprofile.site you should! This practice complements the previous warnings about modifying R’s startup. The narrowest scope is to setup the environment within the code, not the profile.

Quiz

What is the best way to modify the path? The answer depends on the desired scope for the change.

For example, in an R project using the Tensorflow package, I might want R to use the version of Python installed in /usr/local/bin instead of /usr/bin. This change is best implemented by reordering the PATH using PATH=/usr/local/bin:${PATH}. This is a change I only want for this project, so I’d place the line in a .Renviron file in the project directory.

On the other hand, I may want to add the JAVA SDK to the path so that any R session can use the rJava package. To do so, I’d add a line like PATH=${PATH}:/opt/jdk1.7.0_75/bin:/opt/jdk1.7.0_75/jre/bin to Renviron.site.

R Startup in RStudio

A common misconception is that R and RStudio are one in the same. RStudio runs on top of R and requires R to be installed separately. If you look at the process list while running RStudio, you’ll see at least two different processes: usually one called RStudio and one called rsession.

RStudio starts R a bit differently than running R from the terminal. Technically, RStudio doesn’t “start” R, it uses R as a library, either as a DLL on Windows or as a shared object on Mac and Linux.

The main difference is that the script wrapped around R’s binary is not run, and any customization to the script will not take affect. To see the script try:

cat $(which R)

For most people, this difference won’t be noticeable. Any settings in the startup files will still take affect. For user’s that build R from source, it is important to include the --enable-R-shlib flag to ensure R also builds the shared libraries used by RStudio.

R Startup in RStudio Server Pro

RStudio Server Pro acts differently from R and the open-source version of RStudio. Prior to starting R, RStudio Server Pro uses PAM to create a session, and sources the rsession-profile. In addition, RStudio Server Pro launches R from bash, which means settings defined in the user’s bash profile are available.

In short, RStudio Server Pro provides more ways to customize the environment used by R. You might ask why you’d ever want more options. Recall our rule of thumb: “When else do I want this variable to be set?”

In server environments, there are often environment variables set every time a user interacts with the server. These environment variables are placed in a user’s bash profile by a system admin. Normally R wouldn’t pick up these settings. RStudio Server Pro allows R to make use of the work the system admin has already done by picking up these profiles.

Likewise, there may be some actions that take place on the server when a user logs in that have to happen before R starts. For example, a Kerberos ticket used by the R session to access a data source must exist before R is started. RStudio Server Pro uses PAM sessions to enable these actions.

There may also be actions or variables that should only be defined for RStudio, and not any other time R is run. To facilitate this use case, RStudio Server Pro provides the rsession-profile. For example, if your environment makes use of RStudio Server Pro’s support for multiple versions of R, you’d place any environment variables that should defined for all versions of R inside of rsession-profile.

Examples: Define proxy settings in Renviron.site

Renviron.site is commonly used to tell R how to access the internet in environments with restricted network access. Renviron.site is used so the settings take affect for all R sessions and users. For example,

http_proxy=http://proxy.mycompany.com

This article contains more details on how to configure RStudio to use a proxy.

Add a local CRAN repository for all users

Organizations with offline environments often use local CRAN repositories instead of installing packages directly from a CRAN mirror. Local CRAN repositories are also useful for sharing internally developed R packages among colleagues.

To use a local CRAN repository, it is necessary to add the repository to R’s list of repos. This setting is important for all sessions and users, so Rprofile.site is used.

old_repos <- getOption("repos") local_CRAN_URI <- paste0("file://", normalizePath("path_to_local_CRAN_repo")) options(repos = c(old_repos, my_repo = lcoal_CRAN_URI))

More information on setting up a local CRAN repository is available here.

Record sessionInfo automatically

Reproducibility is a critical part of any analysis done in R. One challenge for reproducible scripts and documents is tracking the version of R packages used during an analysis.

The following code can be added to a .Rprofile file within an RStudio project to automatically log the sessionInfo() after every RStudio session.

This log could be referenced if an analysis needs to be run at a later date and fails due to a package discrepancy.

.Last <- function(){ if (interactive()) { ## check to see if we're in an RStudio project (requires the rstudioapi package) if (!requireNamespace("rstudioapi")) return(NULL) pth <- rstudioapi::getActiveProject() if (is.null(pth)) return(NULL) ## append date + sessionInfo to a file called sessionInfoLog cat("Recording session info into the project's sesionInfoLog file...") info <- capture.output(sessionInfo()) info <- paste("\n----------------------------------------------", paste0('Session Info for ', Sys.time()), paste(info, collapse = "\n"), sep = "\n") f <- file.path(pth, "sessionInfoLog") cat(info, file = f, append = TRUE) } } Automatically turn on packrat

Packrat is an automated tool for package management and reproducible research. Packrat acts as a super-set of the previous example. When a user opts in to using packrat with an RStudio project, one of the things packrat automatically does is create (or modify) a project-specific .Rprofile. Packrat uses the .Rprofile to ensure that each time the project opens, Packrat mode is turned on.

To Wrap Up

R’s startup behavior can be complex, sometimes quirky, but always powerful. At RStudio, we’ve worked hard to ensure that R starts and stops correctly whether you’re running RStudio Desktop, serving a Shiny app on shinyapps.io, rendering a report in RStudio Connect, or supporting hundreds of users and thousands of sessions in a load balanced configuration of RStudio Server Pro.

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Group-By Modeling in R Made Easy

Tue, 04/18/2017 - 22:29

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

There are several aspects of the R language that make it hard to learn, and repeating a model for groups in a data set used to be one of them. Here I briefly describe R’s built-in approach, show a much easier one, then refer you to a new approach described in the superb book,  R for Data Science, by Hadley Wickham and Garrett Grolemund.

For ease of comparison, I’ll use some of the same examples in that book. The gapminder data set contains a few measurements for countries around the world every five years from 1952 through 2007.

> library("gapminder") > gapminder # A tibble: 1,704 × 6 country continent year lifeExp pop gdpPercap <fctr> <fctr> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 2 Afghanistan Asia 1957 30.332 9240934 820.8530 3 Afghanistan Asia 1962 31.997 10267083 853.1007 4 Afghanistan Asia 1967 34.020 11537966 836.1971 5 Afghanistan Asia 1972 36.088 13079460 739.9811 6 Afghanistan Asia 1977 38.438 14880372 786.1134 7 Afghanistan Asia 1982 39.854 12881816 978.0114 8 Afghanistan Asia 1987 40.822 13867957 852.3959 9 Afghanistan Asia 1992 41.674 16317921 649.3414 10 Afghanistan Asia 1997 41.763 22227415 635.3414 # ... with 1,694 more rows

Let’s create a simple regression model to predict life expectancy from year. We’ll start by looking at just New Zealand.

> library("tidyverse") > nz <- filter(gapminder, + country == "New Zealand") > nz_model <- lm(lifeExp ~ year, data = nz) > summary(nz_model) Call: lm(formula = lifeExp ~ year, data = nz) Residuals: Min 1Q Median 3Q Max -1.28745 -0.63700 0.06345 0.64442 0.91192 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -307.69963 26.63039 -11.55 4.17e-07 *** year 0.19282 0.01345 14.33 5.41e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8043 on 10 degrees of freedom Multiple R-squared: 0.9536, Adjusted R-squared: 0.9489 F-statistic: 205.4 on 1 and 10 DF, p-value: 5.407e-08

If we had just a few countries, and we wanted to simply read the output (rather than processing it further) we could write a simple function and apply it using R’s built-in by() function. Here’s what that might look like:

my_lm <- function(df) {   summary(lm(lifeExp ~ year, data = df)) } by(gapminder, gapminder$country, my_lm) ... ----------------------------------------------- gapminder$country: Zimbabwe Call: lm(formula = lifeExp ~ year, data = df) Residuals: Min 1Q Median 3Q Max -10.581 -4.870 -0.882 5.567 10.386 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 236.79819 238.55797 0.993 0.344 year -0.09302 0.12051 -0.772 0.458 Residual standard error: 7.205 on 10 degrees of freedom Multiple R-squared: 0.05623, Adjusted R-squared: -0.03814 F-statistic: 0.5958 on 1 and 10 DF, p-value: 0.458

Since we have so many countries, that wasn’t very helpful. Much of the output scrolled out of sight (I’m showing only the results for the last one, Zimbabwe). But in a simpler case, that might have done just what you needed. It’s a bit more complex than how SAS or SPSS would do it since it required the creation of a function, but it’s not too difficult.

In our case, it would be much more helpful to save the output to a file for further processing. That’s when things get messy. We could use the str() function to study the structure of the output, then write another function to extract the pieces we need, then apply that function, then continue to process the result until we get what we finally end up with a useful data frame of results. Altogether, that is a lot of work! To make matters worse, what you learned from all that is unlikely to generalize to a different function. The output’s structure, parameter names, and so on, are often unique to each of R’s modeling functions.

Luckily, David Robinson made a package called broom that simplifies all that. It has three ways to “clean up” a model, each diving more deeply into its details. Let’s see what it does with our model for New Zealand.

> library("broom") > glance(nz_model) r.squared adj.r.squared sigma statistic p.value df 1 0.9535846 0.9489431 0.8043472 205.4459 5.407324e-08 2 logLik AIC BIC deviance df.residual 1 -13.32064 32.64128 34.096 6.469743 10

The glance() function gives us information about the entire model, and it puts it into a data frame with just one line of output. As we head towards doing a model for each country, you can imagine this will be a very convenient format.

To get a bit more detail, we can use broom’s tidy() function to clean up the parameter-level view.

> tidy(nz_model) term estimate std.error statistic p.value 1 (Intercept) -307.699628 26.63038965 -11.55445 4.166460e-07 2 year 0.192821 0.01345258 14.33339 5.407324e-08

Now we have a data frame with two rows, one for each model parameter, but getting this result was just as simple to do as the previous example.

The greatest level of model detail is provided by broom’s augment() function. This function adds observation-level detail to the original model data:

> augment(nz_model) lifeExp year .fitted .se.fit .resid 1 69.390 1952 68.68692 0.4367774 0.70307692 2 70.260 1957 69.65103 0.3814859 0.60897203 3 71.240 1962 70.61513 0.3306617 0.62486713 4 71.520 1967 71.57924 0.2866904 -0.05923776 5 71.890 1972 72.54334 0.2531683 -0.65334266 6 72.220 1977 73.50745 0.2346180 -1.28744755 7 73.840 1982 74.47155 0.2346180 -0.63155245 8 74.320 1987 75.43566 0.2531683 -1.11565734 9 76.330 1992 76.39976 0.2866904 -0.06976224 10 77.550 1997 77.36387 0.3306617 0.18613287 11 79.110 2002 78.32797 0.3814859 0.78202797 12 80.204 2007 79.29208 0.4367774 0.91192308 .hat .sigma .cooksd .std.resid 1 0.29487179 0.8006048 0.2265612817 1.04093898 2 0.22494172 0.8159022 0.1073195744 0.85997661 3 0.16899767 0.8164883 0.0738472863 0.85220295 4 0.12703963 0.8475929 0.0004520957 -0.07882389 5 0.09906760 0.8162209 0.0402635628 -0.85575882 6 0.08508159 0.7194198 0.1302005662 -1.67338100 7 0.08508159 0.8187927 0.0313308824 -0.82087061 8 0.09906760 0.7519001 0.1174064153 -1.46130610 9 0.12703963 0.8474910 0.0006270092 -0.09282813 10 0.16899767 0.8451201 0.0065524741 0.25385073 11 0.22494172 0.7944728 0.1769818895 1.10436232 12 0.29487179 0.7666941 0.3811504335 1.35014569

Using those functions was easy. Let’s now get them to work repeatedly for each country in the data set. The dplyr package, by Hadly Wickham and Romain Francois, provides an excellent set of tools for group-by processing. The dplyr package was loaded into memory as part of the tidyverse package used above. First we prepare the gapminder data set by using the group_by() function and telling it what variable(s) make up our groups:

> by_country <- + group_by(gapminder, country)

Now any other function in the dplyr package will understand that the by_country data set contains groups, and it will process the groups separately when appropriate. However, we want to use the lm() function, and that does not understand what a grouped data frame is. Luckily, the dplyr package has a do() function that takes care of that problem, feeding any function only one group at a time. It uses the period “.” to represent each data frame in turn. The do() function wants the function it’s doing to return a data frame, but that’s exactly what broom’s functions do.

Let’s repeat the three broom functions, this time by country. We’ll start with glance().

> do(by_country, + glance( + lm(lifeExp ~ year, data = .))) Source: local data frame [142 x 12] Groups: country [142] country r.squared adj.r.squared sigma <fctr> <dbl> <dbl> <dbl> 1 Afghanistan 0.9477123 0.9424835 1.2227880 2 Albania 0.9105778 0.9016355 1.9830615 3 Algeria 0.9851172 0.9836289 1.3230064 4 Angola 0.8878146 0.8765961 1.4070091 5 Argentina 0.9955681 0.9951249 0.2923072 6 Australia 0.9796477 0.9776125 0.6206086 7 Austria 0.9921340 0.9913474 0.4074094 8 Bahrain 0.9667398 0.9634138 1.6395865 9 Bangladesh 0.9893609 0.9882970 0.9766908 10 Belgium 0.9945406 0.9939946 0.2929025 # ... with 132 more rows, and 8 more variables: # statistic <dbl>, p.value <dbl>, df <int>, # logLik <dbl>, AIC <dbl>, BIC <dbl>, # deviance <dbl>, df.residual <int>

Now rather than one row of output, we have a data frame with one row per country. Since it’s a data frame, we already know how to manage it. We could sort by R-squared, or correct the p-values for the number of models done using p.adjust(), and so on.

Next let’s look at the grouped parameter-level output that tidy() provides. This will be the same code as above, simply substituting tidy() where glance() had been.

> do(by_country, + tidy( + lm(lifeExp ~ year, data = .))) Source: local data frame [284 x 6] Groups: country [142] country term estimate std.error <fctr> <chr> <dbl> <dbl> 1 Afghanistan (Intercept) -507.5342716 40.484161954 2 Afghanistan year 0.2753287 0.020450934 3 Albania (Intercept) -594.0725110 65.655359062 4 Albania year 0.3346832 0.033166387 5 Algeria (Intercept) -1067.8590396 43.802200843 6 Algeria year 0.5692797 0.022127070 7 Angola (Intercept) -376.5047531 46.583370599 8 Angola year 0.2093399 0.023532003 9 Argentina (Intercept) -389.6063445 9.677729641 10 Argentina year 0.2317084 0.004888791 # ... with 274 more rows, and 2 more variables: # statistic <dbl>, p.value <dbl>

Again, this is a simple data frame allowing us to do whatever we need without learning anything new. We can easily search for models that contain a specific parameter that is significant. In our organization, we search through salary models that contain many parameters to see if gender is an important predictor (hoping to find none, of course).

Finally, let’s augment the original model data by adding predicted values, residuals and so on. As you might expect, it’s the same code, this time with augment() replacing the tidy() function.

> do(by_country, + augment( + lm(lifeExp ~ year, data = .))) Source: local data frame [1,704 x 10] Groups: country [142] country lifeExp year .fitted .se.fit <fctr> <dbl> <int> <dbl> <dbl> 1 Afghanistan 28.801 1952 29.90729 0.6639995 2 Afghanistan 30.332 1957 31.28394 0.5799442 3 Afghanistan 31.997 1962 32.66058 0.5026799 4 Afghanistan 34.020 1967 34.03722 0.4358337 5 Afghanistan 36.088 1972 35.41387 0.3848726 6 Afghanistan 38.438 1977 36.79051 0.3566719 7 Afghanistan 39.854 1982 38.16716 0.3566719 8 Afghanistan 40.822 1987 39.54380 0.3848726 9 Afghanistan 41.674 1992 40.92044 0.4358337 10 Afghanistan 41.763 1997 42.29709 0.5026799 # ... with 1,694 more rows, and 5 more variables: # .resid <dbl>, .hat <dbl>, .sigma <dbl>, # .cooksd <dbl>, .std.resid <dbl>

If we were to pull out just the results for New Zealand, we would see that we got exactly the same answer in the group_by result as we did when we analyzed that country by itself.

We can save that augmented data to a file to reproduce one of the residual plots from R for Data Science.

> gapminder_augmented <- + do(by_country, + augment( + lm(lifeExp ~ year, data = .))) > ggplot(gapminder_augmented, aes(year, .resid)) + + geom_line(aes(group = country), alpha = 1 / 3) + + geom_smooth(se = FALSE) `geom_smooth()` using method = 'gam'

This plots the residuals of each country’s model by year by setting “group=country” then it follows it with a smoothed fit (geom_smooth) for all countries (blue line) by leaving out “group=country”. That’s a clever approach that I haven’t thought of before!

The broom package has done several very helpful things. As we have seen, it contains all the smarts needed to extract the important parts of models at three different levels of detail. It doesn’t just do this for linear regression though. R’s methods() function will show you what types of models broom’s functions are currently capable of handling:

> methods(tidy) [1] tidy.aareg* [2] tidy.acf* [3] tidy.anova* [4] tidy.aov* [5] tidy.aovlist* [6] tidy.Arima* [7] tidy.betareg* [8] tidy.biglm* [9] tidy.binDesign* [10] tidy.binWidth* [11] tidy.boot* [12] tidy.brmsfit* [13] tidy.btergm* [14] tidy.cch* [15] tidy.character* [16] tidy.cld* [17] tidy.coeftest* [18] tidy.confint.glht* [19] tidy.coxph* [20] tidy.cv.glmnet* [21] tidy.data.frame* [22] tidy.default* [23] tidy.density* [24] tidy.dgCMatrix* [25] tidy.dgTMatrix* [26] tidy.dist* [27] tidy.ergm* [28] tidy.felm* [29] tidy.fitdistr* [30] tidy.ftable* [31] tidy.gam* [32] tidy.gamlss* [33] tidy.geeglm* [34] tidy.glht* [35] tidy.glmnet* [36] tidy.glmRob* [37] tidy.gmm* [38] tidy.htest* [39] tidy.kappa* [40] tidy.kde* [41] tidy.kmeans* [42] tidy.Line* [43] tidy.Lines* [44] tidy.list* [45] tidy.lm* [46] tidy.lme* [47] tidy.lmodel2* [48] tidy.lmRob* [49] tidy.logical* [50] tidy.lsmobj* [51] tidy.manova* [52] tidy.map* [53] tidy.matrix* [54] tidy.Mclust* [55] tidy.merMod* [56] tidy.mle2* [57] tidy.multinom* [58] tidy.nlrq* [59] tidy.nls* [60] tidy.NULL* [61] tidy.numeric* [62] tidy.orcutt* [63] tidy.pairwise.htest* [64] tidy.plm* [65] tidy.poLCA* [66] tidy.Polygon* [67] tidy.Polygons* [68] tidy.power.htest* [69] tidy.prcomp* [70] tidy.pyears* [71] tidy.rcorr* [72] tidy.ref.grid* [73] tidy.ridgelm* [74] tidy.rjags* [75] tidy.roc* [76] tidy.rowwise_df* [77] tidy.rq* [78] tidy.rqs* [79] tidy.sparseMatrix* [80] tidy.SpatialLinesDataFrame* [81] tidy.SpatialPolygons* [82] tidy.SpatialPolygonsDataFrame* [83] tidy.spec* [84] tidy.stanfit* [85] tidy.stanreg* [86] tidy.summary.glht* [87] tidy.summary.lm* [88] tidy.summaryDefault* [89] tidy.survexp* [90] tidy.survfit* [91] tidy.survreg* [92] tidy.table* [93] tidy.tbl_df* [94] tidy.ts* [95] tidy.TukeyHSD* [96] tidy.zoo* see '?methods' for accessing help and source code >

Each of those models contain similar information, but often stored in a completely different data structure and named slightly different things, even when they’re nearly identical. While that covers a lot of model types, R has hundreds more. David Robinson, the package’s developer, encourages people to request adding additional ones by opening an issue here.

I hope I’ve made a good case that doing group-by analyses in R can be done easily through the combination of dplyr’s do() function and broom’s three functions. That approach handles the great majority of group-by problems that I’ve seen in my 35-year career. However, if your needs are not met by this approach, then I encourage you to read Chapter 25 of R for Data Science. But as the chapter warns, it will “stretch your brain!”

If your organization is interested in a hands-on workshop that covers many similar topics, please drop me a line. Have fun with your data analyses!

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Creating Interactive Charts with R, Shiny, MySQL and AnyChart JS via Template

Tue, 04/18/2017 - 21:00

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Data visualization and charting are actively evolving as a more and more important field of web development. In fact, people perceive information much better when it is represented graphically rather than numerically as raw data. As a result, various business intelligence apps, reports, and so on widely implement graphs and charts to visualize and clarify data and, consequently, to speed up and facilitate its analysis for further decision making.

While there are many ways you can follow to handle data visualization in R, today let’s see how to create interactive charts with the help of popular JavaScript (HTML5) charting library AnyChart. It has recently got an official R, Shiny and MySQL template that makes the whole process pretty straightforward and easy. (Disclaimer: I am the CTO at the AnyChart team. The template I am talking about here is released under the Apache 2.0 license; the library itself can be used for free in any personal, educational and other non-profit projects and is open on GitHub, but for commercial purposes it requires a commercial license though is fully functional when taken on a free trial.)

In this step-by-step tutorial, we will take a closer look at the template and the basic pie chart example it includes, and then I will show you how to quickly modify it to get some different data visualization, e.g. a 3D column (vertical bar) chart.

Briefly about AnyChart

AnyChart is a flexible, cross-browser JS charting library for adding interactive charts to websites and web apps in quite a simple way. Basically, it does not require any installations and work with any platform and database. Some more of AnyChart’s features include (but are not limited to):

Templates for popular technology stacks, like R, Shiny and MySQL in the present case, can further facilitate AnyChart’s integration.

Getting started

First of all, let’s make sure the R language is installed. If not, you can visit the official R website and follow the instructions.

If you have worked with R before, most likely you already have RStudio. Then you are welcome to create a project in it now, because the part devoted to R can be done there. If currently you do not have RStudio, you are welcome to install it from the official RStudio website. But, actually, using RStudio is not mandatory, and the pad will be enough in our case.

After that, we should check if MySQL is properly installed. To do that, you can open a terminal window and enter the next command:

$ mysql --version mysql Ver 14.14 Distrib 5.7.16, for Linux (x86_64) using EditLine wrapper

You should receive the above written response (or a similar one) to be sure all is well. Please follow these instructions to install MySQL, if you do not have it at the moment.

Now that all the required components have been installed, we are ready to write some code for our example.

Basic template

First, to download the R, Shiny and MySQL template for AnyChart, type the next command in the terminal:

$ git clone https://github.com/anychart-integrations/r-shiny-mysql-template.git

The folder you are getting here features the following structure:

r-shiny-mysql-template/ www/ css/ style.css # css style app.R # main application code database_backup.sql # MySQL database dump LICENSE README.md index.html # html template

Let’s take a look at the project files and examine how this sample works. We’ll run the example first.

Open the terminal and go to the repository folder:

$ cd r-shiny-mysql-template

Set up the MySQL database. To specify your username and password, make use of -u and -p flags:

$ mysql < database_backup.sql

Then run the R command line, using the command below:

$ R

And install the Shiny and RMySQL packages as well as initialize the Shiny library at the end:

> install.packages(&quot;shiny&quot;) > install.packages(&quot;RMySQL&quot;) > library(shiny)

If you face any problems during the installation of these dependencies, carefully read error messages, e.g. you might need sudo apt-get install libmysqlclient-dev for installing RMySQL.

Finally, run the application:

> runApp(&quot;{PATH_TO_TEMPLATE}&quot;) # e.g. runApp(&quot;/workspace/r-shiny-mysql-template&quot;)

And the new tab that should have just opened in your browser shows you the example included in the template:

Basic template: code

Now, let’s go back to the folder with our template to see how it works.

Files LICENSE and README.md contain information about the license and the template (how to run it, technologies, structure, etc.) respectively. They are not functionally important to our project, and therefore we will not explore them here. Please check these files by yourself for a general understanding.

The style.css file is responsible for the styles of the page.

The database_backup.sql file contains a code for the MySQL table and user creation and for writing data to the table. You can use your own table or change the data in this one.

Let’s move on to the code. First, open the app.R file. This file ensures the connection to the MySQL database, reads data, and passes it to the index.html file, which contains the main code of using AnyChart. The following is a part of the app.R code, which contains the htmlTemplate function; here we specify the name of the file where the data will be transmitted to, the names of our page and chart, and the JSON encoded chart data from MySQL database.

htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Top 5 fruits"), chartData = toJSON(loadData())

The main thing here is the index.html file, which is actually where the template for creating charts is. As you see, the first part of this file simply connects all necessary files to the code, including the AnyChart library, the CSS file with styles, and so on. I’ll skip this for now and proceed directly to the script tag and the anychart.onDocumentReady (function () {...}) function.

anychart.onDocumentReady(function() { var chart = anychart.pie({{ chartData }}); chart.title({{ chartTitle }}); chart.container("container"); chart.draw(); });

This pattern works as follows. We create a pie chart by using the function pie() and get the data that have already been read and prepared using the R code. Please note that the names of the variables containing data are the same in the app.R and index.html files. Then we display the chart title via (chart.title({{ chartTitle }})) and specify the ID of the element that will contain a chart, which is a div with id = container in this case. To show all that was coded, we use сhart.draw().

Modifying the template to create a custom chart

Now that we’ve explored the basic example included in the template, we can move forward and create our own, custom interactive chart. To do that, we simply need to change the template a little bit and add some features if needed. Let’s see how it works.

First, we create all the necessary files by ourselves or make a new project using RStudio.

Second, we add a project folder named anychart. Its structure should look like illustrated below. Please note that some difference is possible (and acceptable) if you are using a new project in RStudio.

anychart/ www/ css/ style.css # css style ui.R # main application code server.R # sub code database_backup.sql # data set index.html # html template

Now you know what files you need. If you’ve made a project with the studio, the ui.R and server.R files are created automatically. If you’ve made a project by yourself, just create empty files with the same names and extensions as specified above.

The main difference from the original example included in the template is that we should change the file index.html and divide app.R into parts. You can copy the rest of the files or create new ones for your own chart.

Please take a look at the file server.R. If you’ve made a project using the studio, it was created automatically and you don’t need to change anything in it. However, if you’ve made it by yourself, open it in the Notepad and add the code below, which is standard for the Shiny framework. You can read more about that here.

The file structure of ui.R is similar to the one of app.R, so you can copy app.R from the template and change/add the following lines:

loadData = dbGetQuery(db, "SELECT name, value FROM fruits") data1 <- character() #data preparation for(var in 1:nrow(loadData)){ c = c(as.character(loadData[var, 1]), loadData[var, 2]) data1 <- c(data1, c) } data = matrix(data1, nrow=nrow(loadData), ncol=2, byrow=TRUE) ui = function(){ htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Fruits"), chartData = toJSON(data) )}

Since we are going to change the chart type, from pie to 3D vertical bar (column), the data needs some preparation before being passed to index.html. The main difference is that we will use the entire data from the database, not simply the top 5 positions.

We will slightly modify and expand the basic template. Let’s see the resulting code of the index.html first (the script tag) and then explore it.

anychart.onDocumentReady(function() { var chart = anychart.column3d({{ chartData }}); chart.title({{ chartTitle }}); chart.animation(true); var xAxis = chart.xAxis(); xAxis.title("fruits"); var yAxis = chart.yAxis(); yAxis.title("pounds, t"); var yScale = chart.yScale(); yScale.minimum(0); yScale.maximum(120); chart.container("container"); chart.draw(); });

With the help of var chart = anychart.column3d({{chartData}}), we are creating a 3D column chart by using the function column3d(). Here you can choose any other chart type you need; consider getting help from Chartopedia if you are unsure which one works best in your situation.

Next, we are adding animation to the column chart via chart.animation (true) to make it appear on page load gradually.

In the following section, we are creating two variables, xAxis and yAxis. Including these is required if you want to provide the coordinate axes of the chart with captions. So, you should create variables that will match the captions for the X and Y axes, and then use the function, transmit the values that you want to see.

The next unit is basically optional. We are explicitly specifying the maximum and minimum values for the Y axis, or else AnyChart will independently calculate these values. You can do that the same way for the X axis.

And that’s it! Our 3D column chart is ready, and all seems to be fine for successfully running the code. The only thing left to do before that is to change the MySQL table to make it look as follows:

('apple',100), ('orange',58), ('banana',81), ('lemon',42), ('melon',21), ('kiwi',66), ('mango',22), ('pear',48), ('coconut',29), ('cherries',65), ('grapes',31), ('strawberries',76),

To see what you’ve got, follow the same steps as for running the R, Shiny and MySQL template example, but do not forget to change the path to the folder and the folder name to anychart. So, let’s open the terminal and command the following:

$ cd anychart $ mysql < database_backup.sql $ R > install.packages("shiny") > install.packages("RMySQL") > library(shiny) > runApp("{PATH_TO_TEMPLATE}") # e.g. runApp("/workspace/anychart")

For consistency purposes, I am including the code of ui.R and server.R below. The full source code of this example can be found on GitHub.

ui.R: library(shiny) library(RMySQL) library(jsonlite) data1 <- character() db = dbConnect(MySQL(), dbname = "anychart_db", host = "localhost", port = 3306, user = "anychart_user", password = "anychart_pass") loadData = dbGetQuery(db, "SELECT name, value FROM fruits") #data preparation for(var in 1:nrow(loadData)){ c = c(as.character(loadData[var, 1]), loadData[var, 2]) data1 <- c(data1, c) } data = matrix(data1, nrow=nrow(loadData), ncol=2, byrow=TRUE) server = function(input, output){} ui = function(){ htmlTemplate("index.html", title = "Anychart R Shiny template", chartTitle = shQuote("Fruits"), chartData = toJSON(data) )} shinyApp(ui = ui, server = server) server.R: library(shiny) shinyServer(function(input, output) { output$distPlot <- renderPlot({ # generate bins based on input$bins from ui.R x <- faithful[, 2] bins <- seq(min(x), max(x), length.out = input$bins + 1) # draw the chart with the specified number of bins hist(x, breaks = bins, col = 'darkgray', border = 'white') }) }) Conclusion

When your technology stack includes R, Shiny and MySQL, using AnyChart JS with the integration template we were talking about in this tutorial requires no big effort and allows you to add beautiful interactive JavaScript-based charts to your web apps quite quickly. It is also worth mentioning that you can customize the look and feel of charts created this way as deeply as needed by using some of the library’s numerous out-of-the-box features: add or remove axis labels, change the background color and how the axis is positioned, leverage interactivity, and so on.

The scope of this tutorial is likely to be actually even broader, because the process I described here not only applies to the AnyChart JS charting library, but also is mostly the same for its sister libraries AnyMap (geovisualization in maps), AnyStock (date/time graphs), and AnyGantt (charts for project management). All of them are free for non-profit projects but – I must put it clearly here again just in case – require a special license for commercial use.

I hope you find this article helpful in your activities when it comes to interactive data visualization in R. Now ask your questions, please, if any.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Warren Buffett Shareholder Letters: Sentiment Analysis in R

Tue, 04/18/2017 - 20:38

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Warren Buffett — known as the "Oracle of Omaha" — is one of the most successful investors of all time. Wherever the winds of the market may blow, he always seems to find a way to deliver impressive returns for his investors and his company, Berkshire Hathaway. Every year he authors his famous "shareholder letter" with his musing about the market and investment strategy and — perhaps as reflects his continued success — this sentiment analysis of his letters by data scientist Michael Toth shows that the tone has been generally positive over time. Only five of the forty years of letters show an average negative sentiment: those correspond to market downturns in 1987, 1990, 2001/2002 and 2008.

Michael used the R language to generate a sentiment score for each letter, and the process was surprisingly simple (you can find the R code here). The letters are published as PDF documents, from which the text can be extracted using the pdf_text function in the pdftools package. Then you can use the tidytext package to decompose the letters into individual words, whose Bing sentiment score can be calculated using its get_sentiments function. From there, a simple ggplot2 bar chart is used to show the average sentiment scores for each letter.

For more on the sentiment of Warren Buffett's shareholder letters, including an analysis of the most-used positive and negative words, follow the link to the complete blog post below.

Michael Toth: Sentiment Analysis of Warren Buffett's Letters to Shareholders

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

MODIStsp (v 1.3.2) is on CRAN !

Tue, 04/18/2017 - 17:32

(This article was first published on Spatial Processing in R, and kindly contributed to R-bloggers)

We are glad to report that MODIStsp is now also available on CRAN ! From now on, you can therefore install it by simply using:

install.packages("MODIStsp")

In v 1.3.2 we also added the functionality to automatically apply scale and offset coefficients on MODIS original values according with the specifications of single MODIS products. Setting the new “Scale output values” option to “Yes”, scale factors and offsets are applied (if existing).

In this case, for example, Land Surface Temperature values in the output rasters will be in °K, and spectral indices will be floating point values (e.g., NDVI will be between -1 and 1 instead than between -10000 and 10000). We also corrected a few bugs, affecting in particular ftp download, and modified the names of some output layers to reduce the length and homogenize output file names, and correct a few  errors.

The changelog for v1.3.2 can be found HERE

We hope you will find the new version useful and that we didn’t introduce too many bugs ! Please report any problems in our issues GitHub page.

The `development` version of MODIStsp, containing the latest updates and bug fixes,  will still be available on GitHub. It can be installed using:

library(devtools)
install_github(“lbusett/MODIStsp”, ref = “master”)


“MODIStsp” is a R package allowing automatic download and preprocessing of MODIS Land Products time series – you can find additional information here

To leave a comment for the author, please follow the link and comment on their blog: Spatial Processing in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A novel method for modelling interaction between categorical variables

Tue, 04/18/2017 - 10:00

(This article was first published on Rense Nieuwenhuis » R-Project, and kindly contributed to R-bloggers)

We have been developing weighted effect coding in an ongoing series of publications (hint: a publication in the R Journal will follow). To include nominal and ordinal variables as predictors in regression models, their categories first have to be transformed into so-called ‘dummy variables’. There are many transformations available, and popular is ‘dummy coding’ in which the estimates represent deviations from a preselected ‘reference category’.

To avoid choosing a reference category, weighted effect coding provides estimates representing deviations from the sample mean. This is particularly useful when the data are unbalanced (i.e., categories holding different numbers of observation). The basics of this technique, with applications in R, were detailed here.

In a new publication, we show that weighted effect coding can also be applied to regression models with interaction effects (also commonly referred to as moderation). The weighted effect coded interactions represent the additional effects over and above the main effects obtained from the model without these interactions.

To apply the procedures introduced in these papers, called weighted effect coding, procedures are made available for R, SPSS, and Stata. For R, we created the ‘wec’ package which can be installed by typing:

install.packages(“wec”)

References

Grotenhuis, M., Ben Pelzer, Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2017). A novel method for modelling interaction between categorical variables. International Journal of Public Health, 62(3), 427–431. http://link.springer.com/article/10.1007/s00038-016-0902-0

Grotenhuis, M., Ben Pelzer, Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2017). When size matters: advantages of weighted effect coding in observational studies. International Journal of Public Health, 62(1), 163–167. http://doi.org/10.1007/s00038-016-0901-1

To leave a comment for the author, please follow the link and comment on their blog: Rense Nieuwenhuis » R-Project. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages