Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 52 min 3 sec ago

See you in Barcelona this summer

Thu, 03/07/2019 - 01:00

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

Have you been feeling lately that you are missing out the coolest skill-set in academia?

Here is you chance to cut in and dive into R.

In July BaRacelona Summer School of Demography welcomes dedicated scholars, aspiring or established, to help them migrate to the world of new oppoRtunities.

The school consists of 4 modules. You can take them all or choose specific ones. The first, instructed by Tim Riffe, introduces the basics of R. The second, instructed by myself, focuses on visualizing data, very general with a slight tilt towards population data. The third, instructed by Marie-Pier Bergeron Boucher, presents the foundations of demographic analysis with R. Even if you are not (yet) a demographer these methods are very general and are usable in a wide range of social science disciplines. Finally, the fourth course, instructed by Juan Galeano, teaches the powerful ways to unleash the spatial dimension of data analysis.

There are still several places available, the call closes on March 31st.

I’ll be happy to see you in sunny Catalonia!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

76th Tokyo.R Users Meetup Roundup!

Thu, 03/07/2019 - 01:00

(This article was first published on R by R(yo), and kindly contributed to R-bloggers)



The 76th Tokyo R User Meetup happened on
March 2nd, graciously hosted by DeNA (an entertainment and
e-commerce company) in their lovely headquarters located in Shibuya.





(Photo courtesy of Takashi Minoda)

On this day another R User Meetup was also happening up in Sapporo,
Hokkaido. You can check them out
here. Although
this was the second Tokyo.R of 2019 I wasn’t able to attend the one in
January as I was at the RStudio::Conf in Austin, Texas… a long way from
home! Similar to my roundup blog
post

of the talks at Japan.R I
will be going through around half of all the talks. Hopefully, my
efforts will help spread the vast knowledge of Japanese R users to the
wider R community. Throughout I will also post helpful blog posts and
links from other sources if you are interested in learning more about
the topic of a certain talk. You can follow Tokyo.R by searching for the
#TokyoR hashtag on
Twitter.

Unlike most R Meetups a lot of people present using just their Twitter
handles so I’ll mostly be referring to them by those instead. I’ve been
going to events here in Japan for about a year but even now sometimes
I’m like, “Whoahh that’s what
@very_recognizable_twitter_handle_in_the_japan_r_community actually
looks like?!” Anyways…

Let’s get started!

Beginner Tutorials

Every Tokyo.R sessions starts off with three talks given by one of the
organizing team members who go over some of the very basic aspects of R for
beginner users. These talks are given by very experienced R users and
are a way to let newbies feel comfortable before diving into real world
applications of R in the main talks and LTs happening later on.

In this edition of Tokyo.R:

  • ktatyamtema gave a talk on R
    basics, from downloading and installing packages, reading in data
    files into R, and saving outputs from R.

  • Slides

  • Next, kilometer00 gave a talk on
    data pipelines, specifically focusing on proper coding style and
    thought process behind good programming. He showed some really great
    examples, like in the picture below, of writing code that is easy to
    understand by using the tidyverse verbs and pipes. His entire
    slideshow has tons of great images on how to visualize programming
    in R which you really don’t need Japanese to understand so I
    recommend beginners to have a look through them!

  • Slides



  • Finally, koriakane gave a talk on
    plotting and visualizing data using both base R and ggplot2. She
    carefully explained step-by-step the process of creating different
    types of graphs and working with colors and scales. At the end of
    the slides there is a huge list of ggplot2 resources in Japanese
    that would be very helpful to Japanese R users.

  • Slides

Talks kato_kohaku: Model-Agnostic Explanations

In the first main talk of the day, @kato_kohaku dived deep into
model-agnostic explanations using the DALEX, iml, and mlr
packages. One of the problems seen in the ML field is the growing
complexity of models as researchers have been able to push the limits of
what they can do with increased computational power and the consequent
discovery of new methods. The high performance of these complex models
have come at a high cost with interpretability being reduced
dramatically, with many of these newer models being called “black boxes”
for that very reason. A model-agnostic method is preferable to
model-specific methods mainly due to their flexibility, as typically
data scientists evaluate many different types of ML models to solve a
task. A model-agnostic method allows you to compare these types of
models using the same method in a way that a model-specific method
can’t.



@kato_kohaku went over the workflow for performing model-agnostic
interpretation and covered partial dependence plots (PDP), individual
conditional expectation (ICE), permutation importance, accumulated local
effects plot (ALE), feature interaction, LIME, Shapley values, and more.



The topics he covered are well explained in Christoph
Molnar’s
excellent book,
Interpretable Machine
Learning
which
@kato_kohaku referred to through the presentation. There are a HUGE
amount of slides (146 of them!) filled with a ton of great info that you
can can read (a lot of the slides have explanations taken straight from
the documentation in English) so I highly recommend taking a look
through them if you are interested in what the DALEX and iml
packages have to offer for interpreting models.

A great code-through explanation of using DALEX with mlr in English
can be found
here
using the same data set as seen in @kato_kohaku's slides.

y_mattu: Operators/Objects in R

One of the organizers of Tokyo.R, @y_mattu, presented on objects in R.
Specifically he went over using the pryr and lobstr packages to dig inside R
objects and see what is happening “under the hood” of your everyday R
operations.

“Every object in R is a function, every function in R is an object”

The above maxim means that even operators such as + can be turned into
a function using parentheses to place all the arguments:



Looking deeper @y_mattu used the ast() function from the lobstr
package to see the abstract syntax tree of the R expression that was
shown above, 1 + 2.

library(lobstr) lobstr::ast(1 + 2) ## o-`+` ## +-1 ## \-2

The above shows the exact order in which the functions are being run by
R. To now understand what is happening when we run this operation we
need to look at the R environment. To check which environment holds the
+ () operator

library(pryr) ## ## Attaching package: 'pryr' ## The following objects are masked from 'package:lobstr': ## ## ast, mem_used pryr::where("+") ##

And we find that the base package holds this operator and it is called
from the base environment. In the final part @y_mattu looked into
the + operator itself by looking at the .Primitive() as well as
pryr::show_c_source() to see the C source code used to make R be able
to run +.

This was a very technical topic (for me) but it piqued my interest on
what’s actually happening whenever you run a line of R code!

bob3bob3: DeNA

At every Tokyo.R the hosting company is given time to talk about their
own company, how they use R, and hopefully provide some information for
any interested job seekers. For DeNA,
@bob3bob3 gave this talk and he provided us on some details on what
exactly DeNA does as well as his own LT on SEM using lavaan. DeNA is a
entertainment/e-commerce firm that is most well-known for it’s cellphone
platform, Mobage. Interestingly, they also took ownership of
MyAnimeList a few years back (probably one of
the largest anime/manga database communities in the world). For job
seekers he talked about the large variety of positions DeNA have
available in the “Kaggler” category as well as open positions in the
automobile, healthcare, sports analytics, HR analytics, marketing
researcher departments, and more…!

Following his elevator pitch about DeNA he gave a small talk about using
lavaan to plot out path analysis for structural equation modeling.
@bob3bob3 explained how he ended up creating his own plotting function
using the DiagrammeR and Graphviz packages to visualize the lavaan
output as he did not like the default plotting method.



Lightning Talks flaty13: Tidy Time-Series Analysis

@flaty13, who has also recently presented at
Japan.R
and SportsAnalyst
Meetup

on tennis analytics, gave a talk on analyzing time-series data with R.
He first talked about how packages like lubridate and dplyr, while
useful, may not be the best way to handle time series data. The solution
@flaty13 talked about was the tsibble package created by Earo
Wang
. At RStudio::Conf 2019 Earo gave a
talk on this package and using tidy data principles with time series
data which you can watch
here.

@flaty13 used his own pedometer data from a healthcare app on his
iPhone for his demonstration. After reading the data in and performing
the usual tidyverse operations on it, the data frame was turned into a
tsibble object and then visualized as a calendar plot using the
sugrrants package (also by Earo Wang).



saltcooky: Organizing a R Study Group at My Company!

@saltcooky took the time to talk to us about something that doesn’t
usually get mentioned at Tokyo.R, as he reported about the success of an
intra-company R workshop he hosted. At @saltcooky's company the
majority of his co-workers are Pythonistas with only three other
co-workers and him being R users. Hoping to change this dynamic,
especially as their company does a lot of data analytics, @saltcooky
set out to create some workshops. What he came up with were three
separate sessions heavily inspired by the Tokyo.R method that I talked
about in “Beginner Tutorial” section.

  • The first session was basically around an hour on R basics and
    talking about what exactly you can do with R, where he got the
    Pythonistas to slowly get interested in using R for various
    analytical tasks.
  • Second, was a tidyverse data handling/processing session with some
    hands-on exercises with help from @y_mattu.
  • The third session was using ggplot2 for visualization.

Throughout the workshops @saltcooky was asked some peculiar questions
like “Is there a difference in using . vs. _ in separating words in
a function/object name?” and “Why are there so many packages/functions
with the same functionality!?”.

One of the major hurdles that @saltcooky faced was in installing R for
all the different OSes that his co-workers used. The solution he came up
with was to use RStudio Cloud. This eased the burden for him as he
didn’t need to set up or manage any servers while the students did not
need to install any software at all! There was actually a great talk on
using “RStudio Cloud for
Education”

by Mel Gregory at RStudio::Conference 2019 a few months ago and it’s
a great resource for others thinking about holding workshops.

@saltcooky concluded that his workshops were a mild success as he was
able to get a couple more people using R casually at his workplace and
although Python remains dominant he looks forward to convincing more
people to use R in the future.

moratoriamuo271: Topic Modeling Cooking Recipes!

Continuing the theme of “tidy” data analysis, @moratoriamuo271 applied
the concept to text analysis. The motivation for this talk came from the
difficulty and hassle of figuring out a nice set of meals to eat over
the course of a week. To solve this problem he sought to create a
recommendation engine for recipes!



As seen in the above flowchart @moratoriamuo271:

  1. Web scraped recipes using rvest
  2. Created some word-clouds for some EDA
  3. Used the RMeCab and tm to create an organized document term matrix
    (RMeCab is a package specifically for Japanese text analysis)
  4. Latent Dirichlet Analysis with topicmodels and ldatuning packages
  5. Finally, splitting recipes into categories with tidytext

Before he showed us the results of his work, @moratoriamuo271 took us
through a crash course on various topic modeling techniques from the
basic uni-gram model, to mixture of uni-gram models, and finally on
Latent Dirichlet Analysis (LDA).



He also went over the process in which he decided on the optimal number
of topics for his recommendation engine. This was done by looking at the
perplexity values from the ldatuning package.
Here is a
great blog post by Peter Ellis on
using cross-validation on perplexity to determine the optimal number
of topics. Below is the final finished product that gives you recipes
for nutritious and balanced meals for seven dinners!



@moratoriamuo271 has also released a blog post with ALL the code that
you can check out
here!

Other Talks

I couldn’t go through all of the talks but I will provide their slides
below (if/when they become available)

Conclusion

After the talks, everyone got together for a little after-party over
food and drinks. Usually pizza is served but this time was a bit more
fancy with kara-age and cheese-on-crackers being served. As the night
wore on R users from all over Tokyo talked about their successes and
struggles with R.

Unfortunately, there is only so much I can do to translate the talks,
especially as Tokyo.R doesn’t do recordings anymore, but I hope that I
could be of some help and maybe you’ll be inspired by a code snippet
there or a certain package name elsewhere, etc.! Tokyo.R happens almost
monthly and it’s a great way to mingle with Japanese R users as it is
the largest regular meetup here in Japan. Talks in English are also
welcome so if you’re ever in Tokyo come join us!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The ggforce Awakens (again)

Thu, 03/07/2019 - 01:00

(This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

After what seems like a lifetime (at least to me), a new feature release of
ggforce is available on CRAN. ggforce is my general purpose extension package
for ggplot2, my first early success, what got me on twitter in the first place,
and ultimately instrumental in my career move towards full-time software/R
development. Despite this pedigree ggforce haven’t really received much love in
the form of a feature release since, well, since it was released. One of the
reasons for this is that after the first release I began pushing changes to
ggplot2 that allowed for different stuff I wanted to do in ggforce, so the
release of the next ggforce version became tied to the release of ggplot2. This
doesn’t happen every day, and when it eventually transpired, I was deep in
patchwork and gganimate development, and couldn’t take time off to run the last
mile with ggforce. In the future I’ll probably be more conservative with my
ggplot2 version dependency, or at least keep it out of the main branch until a
ggplot2 release is in sight.

Enough excuses though, a new version is finally here and it’s a glorious one.
Let’s celebrate! This version both brings a slew of refinements to existing
functionality as well as a wast expanse of new features, so there’s enough to
dig into.

New features

This is why we’re all here, right? The new and shiny! Let’s get going; the list
is pretty long.

The Shape of Geoms

Many of the new and current geoms and stats in ggforce are really there to allow
you to draw different types of shapes easily. This means that the workhorse of
these has been geom_polygon(), while ggforce provided the means to describe
the shapes in meaningful ways (e.g. wedges, circles, thick arcs). With the new
release all of these geoms (as well as the new ones) will use the new
geom_shape() under the hood. The shape geom is an extension of the polygon one
that allows a bit more flourish in how the final shape is presented. It does
this by providing two additional parameters: expand and radius, which will
allow fixed unit expansion (and contraction) of the polygons as well as rounding
of the corners based on a fixed unit radius. What do I mean with fixed unit?
In the same way as the points in geom_point stay the same size during resizing
of the plot, so does the corner radius and expansion of the polygon.

Let us modify the goem_polygon() example to use geom_shape() to see what it
is all about:

library(ggforce) ids <- factor(c("1.1", "2.1", "1.2", "2.2", "1.3", "2.3")) values <- data.frame( id = ids, value = c(3, 3.1, 3.1, 3.2, 3.15, 3.5) ) positions <- data.frame( id = rep(ids, each = 4), x = c(2, 1, 1.1, 2.2, 1, 0, 0.3, 1.1, 2.2, 1.1, 1.2, 2.5, 1.1, 0.3, 0.5, 1.2, 2.5, 1.2, 1.3, 2.7, 1.2, 0.5, 0.6, 1.3), y = c(-0.5, 0, 1, 0.5, 0, 0.5, 1.5, 1, 0.5, 1, 2.1, 1.7, 1, 1.5, 2.2, 2.1, 1.7, 2.1, 3.2, 2.8, 2.1, 2.2, 3.3, 3.2) ) datapoly <- merge(values, positions, by = c("id")) # Standard look ggplot(datapoly, aes(x = x, y = y)) + geom_polygon(aes(fill = value, group = id))

# Contracted and rounded ggplot(datapoly, aes(x = x, y = y)) + geom_shape(aes(fill = value, group = id), expand = unit(-2, 'mm'), radius = unit(5, 'mm'))

If you’ve never needed this, it may be the kind of thing you go
why even bother, but if you’ve needed to venture into Adobe Illustrator to add
this kind of flourish it is definitely something where you appreciate the lack of
this round-trip. And remember: you can stick this at anything that expects a
geom_polygon — not just the ones from ggforce.

More shape primitives

While geom_shape() is the underlying engine for drawing, ggforce adds a bunch
of new shape parameterisations, which we will quickly introduce:

geom_ellipse makes, you guessed it, ellipses. Apart from standard ellipses it
also offers the possibility of making super-ellipses so if you’ve been dying to
draw those with ggplot2, now is your time to shine.

# Not an ordinary ellipse — a super-ellipse ggplot() + geom_ellipse(aes(x0 = 0, y0 = 0, a = 6, b = 3, angle = -pi / 3, m1 = 3)) + coord_fixed()

geom_bspline_closed allows you to draw closed b-splines. It takes the same
type of input as geom_polygon but calculates a closed b-spline from the corner
points instead of just connecting them.

# Create 6 random control points controls <- data.frame( x = runif(6), y = runif(6) ) ggplot(controls, aes(x, y)) + geom_polygon(fill = NA, colour = 'grey') + geom_point(colour = 'red') + geom_bspline_closed(alpha = 0.5)

geom_regon draws regular polygons of a set radius and number of sides.

ggplot() + geom_regon(aes(x0 = runif(8), y0 = runif(8), sides = sample(3:10, 8), angle = 0, r = runif(8) / 10)) + coord_fixed()

geom_diagonal_wide draws thick diagonals (quadratic bezier paths with the two
control points pointing towards each other but perpendicular to the same axis)

data <- data.frame( x = c(1, 2, 2, 1, 2, 3, 3, 2), y = c(1, 2, 3, 2, 3, 1, 2, 5), group = c(1, 1, 1, 1, 2, 2, 2, 2) ) ggplot(data) + geom_diagonal_wide(aes(x, y, group = group))

Is it a Sankey? Is it an Alluvial? No, It’s a Parallel Set

Speaking of diagonals, one of the prime uses of this is for creating parallel
sets visualizations. There’s a fair bit of nomenclature confusion with this,
so you may know this as Sankey diagrams, or perhaps alluvial plots. I’ll insist
that Sankey diagrams are specifically for following flows (and often employs a
more loose positioning of the axes) and alluvial plots are for following
temporal changes, but we can all be friends no matter what you call it. ggforce
allows you to create parallel sets plots with a standard layered geom approach
(for another approach to this problem, see
the ggalluvial package). The main
problem is that data for parallel sets plots are usually not represented very
well in the tidy format expected by ggplot2, so ggforce further provides a
reshaping function to get the data in line for plotting:

titanic <- reshape2::melt(Titanic) # This is how we usually envision data for parallel sets head(titanic) ## Class Sex Age Survived value ## 1 1st Male Child No 0 ## 2 2nd Male Child No 0 ## 3 3rd Male Child No 35 ## 4 Crew Male Child No 0 ## 5 1st Female Child No 0 ## 6 2nd Female Child No 0 # Reshape for putting the first 4 columns as axes in the plot titanic <- gather_set_data(titanic, 1:4) head(titanic) ## Class Sex Age Survived value id x y ## 1 1st Male Child No 0 1 Class 1st ## 2 2nd Male Child No 0 2 Class 2nd ## 3 3rd Male Child No 35 3 Class 3rd ## 4 Crew Male Child No 0 4 Class Crew ## 5 1st Female Child No 0 5 Class 1st ## 6 2nd Female Child No 0 6 Class 2nd # Do the plotting ggplot(titanic, aes(x, id = id, split = y, value = value)) + geom_parallel_sets(aes(fill = Sex), alpha = 0.3, axis.width = 0.1) + geom_parallel_sets_axes(axis.width = 0.1) + geom_parallel_sets_labels(colour = 'white')

As can be seen, the parallel sets plot consist of several layers, which is
something required for many, more involved, composite plot types. Separating
them into multiple layers gives you more freedom without over-poluting the
argument and aesthetic list.

The markings of a great geom

If there is one thing of general utility lacking in ggplot2 it is probably the
ability to annotate data cleanly. Sure, there’s geom_text()/geom_label() but
using them requires a fair bit of fiddling to get the best placement and
further, they are mainly relevant for labeling and not longer text. ggrepel
has improved immensely on the fiddling part, but the lack of support for longer
text annotation as well as annotating whole areas is still an issue.

In order to at least partly address this, ggforce includes a family of geoms
under the geom_mark_*() moniker. They all behaves equivalently except for how
they encircle the given area(s). The 4 different geoms are:

  • geom_mark_rect() encloses the data in the smallest enclosing rectangle
  • geom_mark_circle() encloses the data in the smallest enclosing circle
  • geom_mark_ellipse() encloses the data in the smallest enclosing ellipse
  • geom_mark_hull() encloses the data with a concave or convex hull

All the enclosures are calculated at draw time so respond to resizing (most are
susceptible to changing aspect ratios), and further uses geom_shape() with a
default expansion and radius set, so that the enclosure is always slightly
larger than the data it needs to enclose.

Just to give a quick sense of it, here’s an example of geom_mark_ellipse()

ggplot(iris, aes(Petal.Length, Petal.Width)) + geom_mark_ellipse(aes(fill = Species)) + geom_point()

If you simply want to show the area where different classes appear, we’re pretty
much done now, as the shapes along with the legend tells the story. But I
promised you some more: textual annotation. So how does this fit into it all?

In addition to the standard aesthetics for shapes, the mark geoms also take a
label and description aesthetic. When used, things get interesting:

ggplot(iris, aes(Petal.Length, Petal.Width)) + geom_mark_ellipse(aes(fill = Species, label = Species)) + geom_point()

The text is placed automatically so that it does not overlap with any data used
in the layer, and it responds once again to resizing, always trying to find the
most optimal placement of the text. If it is not possible to place the desired
text it elects to not show it at all.

Anyway, in the plot above we have an overabundance of annotation. Both the
legend and the labels. Further, we often want to add annotations to specific
data in the plot, not all of it. We can put focus on setosa by ignoring the
other groups:

desc <- 'This iris species has a markedly smaller petal than the others.' ggplot(iris, aes(Petal.Length, Petal.Width)) + geom_mark_ellipse(aes(filter = Species == 'setosa', label = 'Setosa', description = desc)) + geom_point()

We are using another one of the mark geom family’s tricks here, which is the
filter aesthetic. It makes it quick to specify the data you want to annotate,
but in addition the remaining data is remembered so that any annotation doesn’t
overlap with it even if it is not getting annotated (you wouldn’t get this if
you pre-filtered the data for the layer). Another thing that happens behind the
lines is that the description text automatically gets word wrapping, based on
a desired width of the text-box (defaults to 5 cm).

The mark geoms offer a wide range of possibilities for styling the annotation,
too many to go into detail with here, but rest assured that you have full
control over text appearance, background, line, distance between data and
text-box etc.

Lost in Tessellation

The last of the big additions in this release is a range of geoms for creating
and plotting Delaunay triangulation and Voronoi tessellation. How often do you
need that, you ask? Maybe never… Does it look wicked cool? Why, yes!

Delaunay triangulation is a way to connect points to their nearest neighbors
without any connections overlapping. By nature, this results in triangles being
created. This data can either be thought of as a set of triangles, or a set of
line segments, and ggforce provides both through the geom_delaunay_tile() and
geom_delaunay_segment() geoms. Further, a geom_delaunay_segment2() version
exists that mimics geom_link2 in allowing aesthetic interpolation between
endpoints.

As we are already quite acquainted with the Iris dataset, let’s take it for a
whirl again:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_delaunay_tile(alpha = 0.3) + geom_delaunay_segment2(aes(colour = Species, group = -1), size = 2, lineend = 'round')

The triangulation is not calculated at draw time and is thus susceptible to
range differences on the x and y axes. To combat this it is possible to
normalize the position data before calculating the triangulation.

Voronoi tessellation is sort of an inverse of Delaunay triangulation. it draws
perpendicular segments in the middle of all the triangulation segments and
connects the neighboring ones. The end result is a tile around each point
marking the area where the point is the closest one. In parallel to the
triangulation, Voronoi also comes with both a tile and a segment version.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_voronoi_tile(aes(fill = Species, group = -1L)) + geom_voronoi_segment() + geom_point()

We need to set the group aesthetic to a scalar in order to force all points to
be part of the same tessellation. Otherwise each group would get its own:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_voronoi_tile(aes(fill = Species), colour = 'black')

Let’s quickly move on from that…

As a Voronoi tessellation can in theory expand forever, we need to define a
bounding box. The default is to expand an enclosing rectangle 10% to each side,
but you can supply your own rectangle, or even an arbitrary polygon. Further,
it is possible to set a radius bound for each point instead:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_voronoi_tile(aes(fill = Species, group = -1L), max.radius = 0.2, colour = 'black')

This functionality is only available for the tile geom, not the segment, but
this will hopefully change with a later release.

A last point, just to beat a dead horse, is that the tile geoms of course
inherits from geom_shape() so if you like them rounded corners you can have it
your way:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_voronoi_tile(aes(fill = Species, group = -1L), max.radius = 1, colour = 'black', expand = unit(-0.5, 'mm'), radius = unit(0.5, 'mm'), show.legend = FALSE)

Zoom

Not a completely new feature as the ones above, but facet_zoom() has gained
enough new power to warrant a mention. The gist of the facet is that it allows
you to zoom in on an area of the plot while keeping the original view as a
separate panel. The old version only allowed specifying the zoom region by
providing a logical expression that indicated what data should be part of the
zoom, but it now has a dedicated xlim and ylim arguments to set them
directly.

ggplot(diamonds) + geom_histogram(aes(x = price), bins = 50) + facet_zoom(xlim = c(3000, 5000), ylim = c(0, 2500), horizontal = FALSE)

The example above shows a shortcoming in simply zooming in on a plot. Sometimes
the resolution (here, bins) aren’t really meaningful for zooming. Because of
this, facet_zoom() has gotten a zoom.data argument to indicate what data to
put on the zoom panel and what to put on the overview panel (and what to put in
both places). It takes a logical expression to evaluate on the data and if it
returns TRUE the data is put in the zoom panel, if it returns FALSE it is
put on the overview panel, and if it returns NA it is put in both. To improve
the visualization above, well add two layers with different number of bins and
use zoom.data to put them in the right place:

ggplot() + geom_histogram(aes(x = price), dplyr::mutate(diamonds, z = FALSE), bins = 50) + geom_histogram(aes(x = price), dplyr::mutate(diamonds, z = TRUE), bins = 500) + facet_zoom(xlim = c(3000, 5000), ylim = c(0, 300), zoom.data = z, horizontal = FALSE) + theme(zoom.y = element_blank(), validate = FALSE)

The last flourish we did above was to remove the zoom indicator for the y axis
zoom by using the zoom.y theme element. We currently need to turn off
validation for this to work as ggplot2 by default doesn’t allow unknown theme
elements.

All the rest

The above is just the most worthwhile, but the release also includes a slew of
other features and improvements. Notable mentions are

  • geom_sina() rewrite to allow dodging and follow the shape of geom_violin()
  • position_jitternormal() that jitters points based on a normal distribution
    instead of a uniform one
  • facet_stereo() to allow for faux 3D plots

See the NEWS.md file for
the full list.

Further, ggforce now has a website at https://ggforce.data-imaginist.com, with
full documentation overview etc. This is something I plan to roll out to all my
major packages during the next release cycle. I’ve found that it gives a great
impediment to improving the examples in the documentation!

I do hope that it won’t take another two years before ggforce sees the next big
update. It is certainly a burden of my shoulder to get this out of the door and
I hope I can adhere to smaller, more frequent, releases in the future.

Now go get plotting!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Efficient landscape metrics calculations for buffers around sampling points

Thu, 03/07/2019 - 01:00

(This article was first published on Rstats on Jakub Nowosad's website, and kindly contributed to R-bloggers)

Landscape metrics are algorithms that quantify physical characteristics of landscape mosaics (aka categorical raster) in order to connect them to some ecological processes.
Many different landscape metrics exist and they can provide three main levels of information: (i) landscape level, (ii) class level, and (iii) patch level.
A landscape level metric gives just one value describing a certain property of a landscape mosaic, such as its diversity.
A class level metric returns one value for each class (category) present in a landscape. A patch level metric separates neighboring cells belonging to the same class, and calculates a value for each patch.1

Landscape metrics are often applied to better understand a landscape structure around some sampling points.
In this post, I show how to prepare input data and how to calculate selected landscape metrics based on different buffer sizes and shapes.
All calculations were performed in R, using gdalUtils, sf, and raster packages for spatial data handling, purrr to ease some calculations, and most importantly the landscapemetrics package was used to calculate landscape metrics.

Data preparation

Every software for calculations of landscape metrics, such as FRAGSTATS or landscapemetrics, expects the input data in a projected coordinate reference system instead of a geographic coordinate reference system.
This is due to a fact that geographic coordinate reference systems are expressed in degrees, and one degree on the equator has a different length in meters than one degree on a middle latitude.
Projected coordinate reference systems, on the other hand, have a linear unit of measurement such as meters. This allows for proper calculations of distances or areas.2

In this case, the input datasets (cci_lc_australia.tif and sample_points.csv) are in a geographic coordinate reference system called WGS84.
Therefore, it is important to reproject the data before calculating any landscape metrics. Reprojection of raster objects is possible with an R package called gdalUtils:

library(gdalUtils) dir.create("data") # downloading the land cover data download.file("https://github.com/Nowosad/lsm-bp1/raw/master/data/cci_lc_australia.zip", destfile = "data/cci_lc_australia.zip") # unziping the file unzip("data/cci_lc_australia.zip", "data/cci_lc_australia.tif") # reprojecting the raster data to a local projected coordinate reference system # read more at http://spatialreference.org/ref/epsg/gda94-australian-albers/ gdalwarp(srcfile = "data/cci_lc_australia.tif", dstfile = "data/cci_lc_australia_aea.vrt", of = "VRT", t_srs = "+proj=aea +lat_1=-18 +lat_2=-36 +lat_0=0 +lon_0=132 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs") # compressing the output file gdal_translate(src_dataset = "data/cci_lc_australia_aea.vrt", dst_dataset = "data/cci_lc_australia_aea.tif", co = "COMPRESS=LZW")

Reading and reprojecting vector (points) objects can be done with the sf package:

library(sf) # downloading the points download.file("https://raw.githubusercontent.com/Nowosad/lsm-bp1/master/data/sample_points.csv", destfile = "data/sample_points.csv") # reading the point data and reprojecting it points = read.csv("data/sample_points.csv") %>% st_as_sf(coords = c("longitude", "latitude")) %>% st_set_crs(4326) %>% st_transform("+proj=aea +lat_1=-18 +lat_2=-36 +lat_0=0 +lon_0=132 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs ") # saving the new object to a spatial data format GPKG st_write(points, "data/sample_points.gpkg")

Location of three sample points on a land cover map of Australia. Source of the input map:

Calculations of landscape metrics in a buffer

Now, we can read the new files – the new raster with the raster() function and the new points file with the st_read() function:

# attaching the packages used in the next examples library(raster) library(sf) library(purrr) library(landscapemetrics) # reading the raster and vector datasets lc = raster("data/cci_lc_australia_aea.tif") sample_points = st_read("data/sample_points.gpkg", quiet = TRUE)

The landscapemetrics package has a function designed for calculations of landscape metrics for a given buffer.
It is called sample_lsm() and it expects, at least three input arguments – a raster, a vector (points), and a buffer size. As a default, it calculates all of the available metrics in a square buffer, where buffer size is the side-length in map units (e.g. meters).

square_all = sample_lsm(lc, points = sample_points, size = 3000) square_all ## # A tibble: 1,282 x 8 ## layer level class id metric value plot_id percentage_inside ## ## 1 1 class 11 NA ai 92.1 1 109. ## 2 1 class 100 NA ai 100 1 109. ## 3 1 class 122 NA ai NA 1 109. ## 4 1 class 130 NA ai 50 1 109. ## 5 1 class 150 NA ai 71.1 1 109. ## 6 1 class 11 NA area_cv NA 1 109. ## 7 1 class 100 NA area_cv NA 1 109. ## 8 1 class 122 NA area_cv NA 1 109. ## 9 1 class 130 NA area_cv 33.3 1 109. ## 10 1 class 150 NA area_cv 119. 1 109. ## # … with 1,272 more rows

The output, square_all, is a data frame containing values of all of the landscape metrics available in the landscapemetrics package.
The structure of the output is stable and contains the following columns:

  1. layer – id of the input raster (landscape). It is possible to calculate landscape metrics on several rasters at the same time.
  2. level – a landscape metrics level (“landscape”, “class”, or “patch”).
  3. class – a number of a class for class and patch level metrics.
  4. id – a number of a patch for patch level metrics.
  5. metric – a metric name.
  6. value – an output value of the metric.
  7. plot_id – id of the input vector (point).
  8. percentage_inside – the size of the actual sampled landscape in %.
    It is not always 100% due to two reasons: (i) clipping raster cells using a buffer not directly intersecting a raster cell center lead to inaccuracies and (ii) sample plots can exceed the landscape boundary.

Square buffers with the side-length of 3000 meters around the three sample points.

The sample_lsm() function also allows for selecting metrics using a level name ("landscape", "class", or "patch"), a metric abbreviation, full name of a metric, or a metric type.
For example, SHDI (Shannon’s diversity index) on a landscape level ("lsm_l_shdi") can be calculated using the code below.

square_shdi = sample_lsm(lc, points = sample_points, size = 3000, what = "lsm_l_shdi") square_shdi ## # A tibble: 3 x 8 ## layer level class id metric value plot_id percentage_inside ## ## 1 1 landscape NA NA shdi 0.870 1 109. ## 2 1 landscape NA NA shdi 1.49 2 98.9 ## 3 1 landscape NA NA shdi 0.665 3 98.9

The result shows that the area around the second sample point is the most diverse (SHDI value of 1.49), and the area around the third sample point is the least diverse (SHDI value of 0.665).

As the default, sample_lsm() uses a square buffer, but it could be changed to a circle one with the shape argument set to "circle".
In this case, the size argument represents the radius in map units.

circle_shdi = sample_lsm(lc, points = sample_points, size = 3000, what = "lsm_l_shdi", shape = "circle") circle_shdi ## # A tibble: 3 x 8 ## layer level class id metric value plot_id percentage_inside ## ## 1 1 landscape NA NA shdi 1.03 1 99.6 ## 2 1 landscape NA NA shdi 1.65 2 100. ## 3 1 landscape NA NA shdi 0.689 3 99.9

Circle buffers with the radius of 3000 meters around the three sample points.

Calculations of landscape metrics for many buffers

Finally, it is also possible to calculate landscape metrics on many different buffer sizes at the same time.
You just need to select what sizes you are interested in and use the sample_lsm() function inside a map_dfr() function.

# creating buffers of 3000 and 6000 meters sizes = c(3000, 6000) # calculating shdi (landscape metric) for each buffer two_sizes_output = sizes %>% set_names() %>% map_dfr(~sample_lsm(lc, points = sample_points, what = "lsm_l_shdi", size = .), .id = "buffer")

It calculates all of the selected metrics for all of the buffers.

Square buffers with the side-length of 3000 and 6000 meters around the three sample points.

The result is a data frame, where the first column describes the buffer size, and the rest is a standard landscapemetrics output.

two_sizes_output ## # A tibble: 6 x 9 ## buffer layer level class id metric value plot_id percentage_inside ## ## 1 3000 1 landscape NA NA shdi 0.870 1 109. ## 2 3000 1 landscape NA NA shdi 1.49 2 98.9 ## 3 3000 1 landscape NA NA shdi 0.665 3 98.9 ## 4 6000 1 landscape NA NA shdi 1.02 1 99.1 ## 5 6000 1 landscape NA NA shdi 1.62 2 99.1 ## 6 6000 1 landscape NA NA shdi 0.731 3 99.1 Summary

This post gave a few examples of how to calculate selected landscape metrics for buffers around sampling points using the landscapemetrics package and its function sample_lsm().
In the same time, the landscapemetrics package has much more to offer, including a lot of implemented landscape metrics, a number of utility functions (e.g. extract_lsm() or get_patches()), and several C++ functions designed for the development of new metrics.

The landscapemetrics is also an open-source collaborative project.
It enables community contributions, including documentation improvements, bug reports, and ideas to add new metrics and functions.
To learn more you can visit the package website at https://r-spatialecology.github.io/landscapemetrics/.

  1. For a more detailed explanation see https://r-spatialecology.github.io/landscapemetrics/articles/articles/general-background.html. [return]
  2. More information about coordinate reference systems can be found at https://geocompr.robinlovelace.net/spatial-class.html#crs-intro. [return]
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Jakub Nowosad's website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

At the end of the rainbow

Thu, 03/07/2019 - 00:00

(This article was first published on Achim Zeileis, and kindly contributed to R-bloggers)

Fully saturated RGB rainbow colors are still widely used in scientific visualizations despite their widely-recognized disadvantages. A recent wild-caught example is presented, showing its limitations along with a better HCL-based alternative palette.

#endrainbow

The go-to palette in many software packages is – or used to be until rather recently – the so-called rainbow: a palette created by changing the hue in highly-saturated RGB colors. This has been widely recognized as having a number of disadvantages including: abrupt shifts in brightness, misleading for viewers with color vision deficiencies, too flashy to look at for a longer time. As part of our R software project colorspace we therefore started collecting typical (ab-)uses of the RGB rainbow palette on our web site http://colorspace.R-Forge.R-project.org/articles/endrainbow.html and suggest better HCL-based color palettes.

Here, we present the most recent addition to that example collection, a map of influenza severity in Germany, published by the influenza working group of the Robert Koch-Institut. Along with the original map and its poor choice of colors we:

  • highlight its problems by desaturation to grayscale and by emulating color vision deficiencies,
  • suggest a proper sequential HCL-based palette, and
  • provide the R code that can extract and replace the color palette in a PNG graphics file.
Influenza in Germany

The shaded map below was taken from the web site of the Robert Koch-Institut (Arbeitsgemeinschaft Influenza) and it shows the severity of influenza in Germany in week 8, 2019. The original color palette (left) is the classic rainbow ranging from “normal” (blue) to “strongly increased” (red). As all colors in the palette are very flashy and highly-saturated it is hard to grasp intuitively which areas are most affected by influenza. Also, the least interesting “normal” areas stand out as blue is the darkest color in the palette.

As an alternative, a proper multi-hue sequential HCL palette is used on the right. This has smooth gradients and the overall message can be grasped quickly, giving focus to the high-risk regions depicted with dark/colorful colors. However, the extremely sharp transitions between “normal” and “strongly increased” areas (e.g., in the North and the East) might indicate some overfitting in the underlying smoothing for the map.


Converting all colors to grayscale brings out even more clearly why the overall picture is so hard to grasp with the original palette: The gradients are discontinuous switching several times between bright and dark. Thus, it is hard to identify the high-risk regions while this is more natural and straightforward with the HCL-based sequential palette.


Emulating green-deficient vision (deuteranopia) emphasizes the same problems as the desaturated version above but shows even more problems with the original palette: The wrong areas in the map “pop out”, making the map extremely hard to use for viewers with red-green deficiency. The HCL-based palette on the other hand is equally accessible for color-deficient viewers as for those with full color vision.


Replication in R

The desaturated and deuteranope version of the original image influenza-rainbow.png (a screenshot of the RKI web page) are relatively easy to produce using the colorspace function cvd_emulator("influenza-rainbow.png"). Internally, this reads the RGB colors for all pixels in the PNG, converts them with the colorspace functions desaturate() and deutan(), respectively, and saves the PNG again. Below we also do this “by hand”.

What is more complicated is the replacement of the original rainbow palette with a properly balanced HCL palette (without access to the underlying data). Luckily the image contains a legend from which the original palette can be extracted. Subsequently, it is possibly to index all colors in the image, replace them, and write out the PNG again.

As a first step we read the original PNG image using the R package png, returning a height x width x 4 array containing the three RGB (red/green/blue) channels plus a channel for alpha transparency. Then, this is turned into a height x width matrix containing color hex codes using the base rgb() function:

img <- png::readPNG("influenza-rainbow.png") img <- matrix( rgb(img[,,1], img[,,2], img[,,3]), nrow = nrow(img), ncol = ncol(img) )

Using a manual search we find a column of pixels from the palette legend (column 630) and thin it to obtain only 99 colors:

pal_rain <- img[96:699, 630] pal_rain <- pal_rain[seq(1, length(pal_rain), length.out = 99)]

For replacement we use a slightly adapted sequential_hcl() that was suggested by Stauffer et al. (2015) for a precipitation warning map. The "Purple-Yellow" palette is currently only in version 1.4-1 of the package on R-Forge but other sequential HCL palettes could also be used here.

library("colorspace") pal_hcl <- sequential_hcl(99, "Purple-Yellow", p1 = 1.3, c2 = 20)

Now for replacing the RGB rainbow colors with the sequential colors, the following approach is taken: The original image is indexed by matching the color of each pixel to the closest of the 99 colors from the rainbow palette. Furthermore, to preserve the black borders and the gray shadows, 50 shades of gray are also offered for the indexing. To match pixel colors to palette colors a simple Manhattan distance (sum of absolute distances) is used in the CIELUV color space:

# 50 shades of gray pal_gray <- gray(0:50/50) ## HCL coordinates for image and palette img_luv <- coords(as(hex2RGB(as.vector(img)), "LUV")) pal_luv <- coords(as(hex2RGB(c(pal_rain, pal_gray)), "LUV")) ## Manhattan distance matrix dm <- matrix(NA, nrow = nrow(img_luv), ncol = nrow(pal_luv)) for(i in 1:nrow(pal_luv)) dm[, i] <- rowSums(abs(t(t(img_luv) - pal_luv[i,]))) idx <- apply(dm, 1, which.min)

Now each element of the img hex color matrix can be easily replaced by indexing a new palette with 99 colors (plus 50 shades of gray) using the idx vector. This is what the pal_to_png() function below does, writing the resulting matrix to a PNG file. The function is somewhat quick and dirty, makes no sanity checks, and assumes img and idx are in the calling environment.

pal_to_png <- function(pal = pal_hcl, file = "influenza.png", rev = FALSE) { ret <- img pal <- if(rev) c(rev(pal), rev(pal_gray)) else c(pal, pal_gray) ret[] <- pal[idx] ret <- coords(hex2RGB(ret)) dim(ret) <- c(dim(img), 3) png::writePNG(ret, target = file) }

With this function, we can easily produce the PNG graphic with the desaturated palette and the deuteranope version”

pal_to_png(desaturate(pal_rain), "influenza-rainbow-gray.png") pal_to_png( deutan(pal_rain), "influenza-rainbow-deutan.png")

The analogous graphics for the HCL-based "Purple-Yellow" palette are generated by:

pal_to_png( pal_hcl, "influenza-purpleyellow.png") pal_to_png(desaturate(pal_hcl), "influenza-purpleyellow-gray.png") pal_to_png( deutan(pal_hcl), "influenza-purpleyellow-deutan.png") Further remarks

Given that we have now extracted the pal_rain palette and set up the pal_hcl alternative we can also use the colorspace function specplot() to understand how the perceptual properties of the colors change across the two palettes. For the HCL-based palette the hue/chroma/luminance changes smoothly from dark/colorful purple to a light yellow. In contrast, in the original RGB rainbow chroma and, more importantly, luminance change non-monotonically and rather abruptly:

specplot(pal_rain) specplot(pal_hcl)


Given that the colors in the image are indexed now and the gray shades are in a separate subvector, we can now easily rev-erse the order in both subvectors. This yields a black background with white letters and we can use the "Inferno" palette that works well on dark backgrounds:

pal_to_png(sequential_hcl(99, "Inferno"), "influenza-inferno.png", rev = TRUE)

For more details on the limitations of the rainbow palette and further pointers see “The End of the Rainbow” by Hawkins et al. (2014) or “Somewhere over the Rainbow: How to Make Effective Use of Colors in Meteorological Visualizations” by Stauffer et al. (2015) as well as the #endrainbow hashtag on Twitter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Engineers

Wed, 03/06/2019 - 21:12

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Starting With Data Science

A rigorous hands-on introduction to data science for engineers.

Win Vector LLC is now offering a 4 day on-site intensive data science course. The course targets engineers familiar with Python and introduces them to the basics of current data science practice. This is designed as an interactive in-person (not remote or video) course.

The course includes lectures, hands-on labs, and optional homework exercises. Students are expected to attend a 4 full days, and will come out with a basic understanding of some of the most important tools for supervised learning in data science. We share all course materials, which the students can use as starting points for their own projects. Students are expected to bring a laptop with a Python 3 configured version of Anaconda (https://www.anaconda.com/download/) so they can work along with the lectures, and also work on labs and exercises. The topics are designed for an engineering audience. This is for an audience comfortable with Python, and interested in learning the basics of data manipulation and starting with supervised machine learning.

Topics covered:

  • Day1:
    Starting with data and statistics

    • Starting with Python and JupyterLab.
    • Working with data using Pandas.
    • Basic concepts of probability and statistics.
    • Predicting quantities (regression metrics and methods).
    • Overfit and test/train split.
  • Day 2:
    Starting with machine learning

    • Advanced regression methods: polynomial regression and generalized additive models.
    • Predicting classes (supervised classification) using logistic regression.
    • Fixing modeling problems using regularization.
    • Advanced classification metrics
    • Basic feature engineering (missing values, categorical variables, sessionization).
  • Day 3:
    Advanced machine learning

    • tree based methods
    • Decision trees
    • Random forest methods
    • Gradient boosted trees
    • Dimension reduction
    • Variable screening
    • Principal components reduction.
  • Day 4:
    Unsupervised methods and advanced topics

    • Clustering
    • Principal components lab
    • Introduction to Neural Nets and Deep Learning
    • Embeddings
    • Keras
    • Model explainability (LIME)

The course is designed in terms of “daily take aways” or “daily victories”. The goals by day are:

  • Day 1: an understanding of sampling error, modeling numeric values (regression) and evaluating models (metrics).
  • Day 2: understanding classification and ranking problems and how to solve them. The students are encouraged to use logistic regression as their go-to tool for these problems.
  • Day 3: Current machine learning methods for complex relations: ensembles of trees methods (Random Forest and Gradient Boosted Trees). These are powerful machine learning methods that routinely win machine learning and data science contests.
  • Day 4: Deep learning. The students will work with the Keras package and use Neural Net methods to encode complex data such as images and text (basic natural language processing). We will end on model explainability, which attempts to link the powerful “black box” methods of days 3 and 4 with the more inspectable methods of days 1 and 2.

The software used in the course is all open-source. Primarily the course will use:

  • Python 3: the primary programming language for the course.
  • The Anaconda distribution: the main software distribution and package manager for the course.
  • JupyterLab: current interactive worksheet system for Python and data science.
  • Pandas: a state of the art data manipulation system for Python.
  • scikit-learn: a state of the art machine learning system for Python.
  • seaborn: a best of breed plotting and visualization package.
  • Keras: a powerful deep learning package.

All engineers will eventually work with data, or work with people who work with data. This course well help prepare your engineers to quickly become effective in this reality.

Win Vector LLC is a data science consulting and training organization. The senior parters are authors of numerous data science packages, R and Python training courses, and a best selling data science book. Please enquire for course rates and scheduling. The above Python course has been presented to great success at large companies, it can be adapted to R on request.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

International Summer School on Geospatial Data Science with R in Jena

Wed, 03/06/2019 - 14:28

(This article was first published on R – jannesm, and kindly contributed to R-bloggers)

We are proud to announce our “International Summer School on Geospatial Data Science with R” taking place at the GIScience department of Jena (Germany) from 25. August to 1. September 2019.

The summer school will consist of a mixture of lectures and lab classes and will teach advanced geospatial techniques for analyzing and predicting global change phenomena on the command line (mainly using R) in a fully reproducible workflow. This includes predictive modeling using statistical and machine learning techniques, model assessment and data science challenges with geospatial data. The summer school is generously funded by the German Academic Exchange Service which means that each accepted international participant will receive at least a funding of 450 Euros upon successful completion of the summer school.1 Application deadline is 15. May 2019.

Please share this fantastic opportunity with all interested parties and refer to the official website for more information on how to apply, the program and the invited lecturers.

We are looking forward to welcoming you in Jena!

  1. Applicants from Germany are welcome to apply but only a limited number can be admitted and no funding can be granted.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – jannesm. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dataiku with RStudio integration

Wed, 03/06/2019 - 05:00

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

Introduction

Some time ago I wrote about the support for R users within Dataiku. The software can really boost the productivity of R users by having:

  • Managed R code environments,
  • Support for publishing of R markdown files and shiny apps in Dataiku dashboards,
  • The ability to easily create and deploy API’s based on R functions.

See the blog post here.

With the 5.1 release of Dataiku two of my favorite data science tools are brought closely together :-). Within Dataiku there is now a nice integration with RStudio. And besides this, there is much more new functionality in the system, see the 5.1 release notes.

RStudio integration

From time to time, the visual recipes to perform certain tasks in Dataiku are not enough for what you want to do. In that case you can make use of code recipes. Different languages are supported, Python, Scala, SQL, shell and R. To edit the R code you are provided with an editor in Dataiku, it is however, not the best editor I’ve seen. Especially when you are used to the RStudio IDE

There is no need to create a new R editor, that would be a waist of time and resources. In Dataiku you can now easily use the RStudio environment in your Dataiku workflow.

Suppose you are using an R code recipe in your workflow, then in the code menu in Dataiku you can select ‘RStudio server’. It will bring you to an RStudio session (embedded) in the Dataiku interface (either on the same Dataiku server or on a different server).

In your (embedded) RStudio session you can easily ‘communicate’ with the Dataiku environment. There is a ‘dataiku’ R package that exposes functions and RStudio add-ins to make it easy to communicate with Dataiku. You can use it to bring over R code recipes and / or data from Dataiku. This will allow you to edit and test the R code recipe in RStudio with all the features that a modern IDE should have.

Once you are done editing and testing in RStudio you can then upload the R code recipe back to the Dataiku workflow. If your happy with the complete Dataiku workflow, put it into production, i.e. run it in database, or deploy it as an API. See this short video for a demo.

Stay tuned for more Dataiku and R updates….. Cheers, Longhow.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a Shiny app to show the impact of vaccines

Wed, 03/06/2019 - 01:00

(This article was first published on R blog posts on sandsynligvis.dk, and kindly contributed to R-bloggers)


Debates about vaccines are
ongoing in
many
countries and the debate has
reblossomed in Denmark after we’ve had five recent occurrences of
measels. While that is nothing compared to the measles outbreak
currently ravaging
Japan

it is still enough to worry the health authorities that it might
result in an epidemic. Here we’ll use Shiny to create an app that shows the
impact of contagious diseases and the influence of
vaccination. Wrapping the computations in a Shiny app will allow
non-R-users to tweak the input parameters themselves and observe the
consequences of an outbreak. Hopefully, this can lead to a more informed discussion about vaccination.

The SIR compartmental model

The susceptible-infectious-recovered (SIR)
model

is one of the simplest compartmental models and it has previously been
successfully used to describe outbreaks of measles, the flu, small
pox, mumps, and rubella. The SIR model is easily extended
to accommodate immunities due to earlier infections or
vaccinations. Implementing the SIR model in R has previously been documented so in this post we will extend this previous work adding an additional
component to accommodate previously vaccinated individuals, and warp everything
in a Shiny app.

We will use a 4-compartment model where each individual in the population initially can be classified into 4 categories: susceptible (S), the infected (I), the recovered (R), and a group of previously vaccinated/immune (V) individuals. The number of individuals starting in compartment I will be very small since that is just the initial number of infected individuals, the initial number of persons in R will be zero since no one has yet had the disease and recovered from it as part of the current outbreak. The remaining individuals will be in either S (have not had the disease before) or V (vaccinated / immune from earlier infection). The setup is sketched in the diagram below. Adding the group V to the SIR model reduces the spread of the disease since the disease cannot infect individuals that it comes into contact with if they are already immune or vaccinated.

{"x":{"diagram":"digraph {\n\nbgcolor = \"transparent\"\ngraph [layout = dot, rankdir = LR]\n\n# define the global styles of the nodes. We can override these in box if we wish\nnode [shape = rectangle, style = filled, fillcolor = Linen]\n\nV [label = \"Already vaccinated and/or immune (V)\", fontcolor=white, fillcolor = Blue]\nS [label = \"Susceptible (S)\", fillcolor = Green]\nI [label = \"Infected (I)\", fillcolor = Red, fontcolor=white]\nR [label = \"Recovered (R)\", fontcolor=white, fillcolor = Black]\n\n{rank = same; S V}\n\n# edge definitions with the node IDs\n{\nS -> I[label = \"β\"];\nI -> R[label = \"γ\"];\n}\n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}

This variant of the SIR model is useful for modeling airborne diseases and in the model we disregard individuals who die from other causes than the disease, new vaccinations, and demographic changes in the population.11 Consequently, we are also assuming that the total population size is constant over the time period, and we will also assume that the transition rates \(\beta\) and \(\gamma\) are constant over the period, and that anyone in the population can get in contact with anyone else.

To compute the consequences of an outbreak we need to set some initial parameters for model. The parameters directly influencing the model are

  • \(\beta\) – the transition rate from compartment \(S\) to \(I\). This rate is defined as the basic reproductive number, \(R0\), divided by the infection period (i.e., the average number of individuals each infected person will infect in an unprotected population divided by the number of days that the person can pass on the disease)
  • \(\gamma\) – the transition rate from \(I\) to \(R\). This is equal to the inverse of the disease period since once the disease period is over, a person automatically transfers to the \(R\) group.

Consequently we need to allow the user to set the following

  • The reproductive number \(R0\) – the average number of individuals that each infected person may infect, i.e., how contagious is the disease,
  • the infection period,
  • the population size,
  • the number of individuals initially infected,
  • the proportion of individuals in the immune/vaccine group \(V\). This percentage should be multiplied by the vaccine effectiveness if it is not 100%.
  • the time frame to consider.

An example of these initial parameters is shown in the code below

# Set parameters timeframe <- 200 # Look at development over 200 days pinf <- 5 # Initial number of infected persons popsize <- 5700000 # Population size (5.7 mio in Denmark) pvac <- .90 # Proportion vaccinated vaceff <- .95 # Effectiveness of vaccine connum <- 15 # R0, the reproductive number. 15 is roughly measles infper <- 14 # Infection period, 14 days # Compute the proportions in each of the compartments at # the initial outbreak init <- c(S = 1 - pinf / popsize - pvac * vaceff, I = pinf / popsize, R = 0, V = pvac * vaceff ) # First set the parameters for beta and gamma parameters <- c(beta = connum / infper, gamma = 1 / infper) ## Time frame times <- seq(0, timeframe, by = .2)

Wonderful!

Based on the model outlined above we can now set up the following set of coupled differential equations22 Without loss of generality we have divided by the total number of individuals in the population, \(N\), to obtain the proportion of individuals in each compartment.

\[\begin{array}{l}
\frac{dS}{dt} = -\beta I(t)S(t) \\
\frac{dI}{dt} = \beta I(t)S(t) – \gamma I(t) = [\beta S(t) – \gamma] I(t)\\
\frac{dR}{dt} = \gamma I(t) \\
\frac{dV}{dt} = 0
\end{array}\]

If \(\beta S(t) – \gamma>0\) then the number of infected individuals will increase and an epidemic will ensure. If \(\beta S(t) – \gamma<0\) then the number of infected will decrease and the disease will die out by itself. Note that since \(S(t)\) decreases with time the outbreak will eventually die out by itself since the majority of the current population will at one point be in the recovered group, where they cannot get the disease again.

Now we are ready to define the coupled differential equations that describe the transitions between the compartments. This is formulated as a function that takes the time points, time, the current state (i.e., the distribution of the individuals in the 4 compartments), and the set of parameters.

## Create a SIR function with an extra V component sirv <- function(time, state, parameters) { with(as.list(c(state, parameters)), { dS <- -beta * S * I dI <- beta * S * I - gamma * I dR <- gamma * I dV <- 0 return(list(c(dS, dI, dR, dV))) }) }

Once we have the sirv() function we can use the ode() function from the deSolve package to solve the coupled differential equations.

library("deSolve") ## Solve the set of coupled differential equations out <- ode(y = init, times = times, func = sirv, parms = parameters ) head(as.data.frame(out)) time S I R V 1 0.0 0.1449991 8.771930e-07 0.000000e+00 0.855 2 0.2 0.1449991 8.920417e-07 1.263739e-08 0.855 3 0.4 0.1449991 9.071419e-07 2.548870e-08 0.855 4 0.6 0.1449990 9.224976e-07 3.855756e-08 0.855 5 0.8 0.1449990 9.381132e-07 5.184763e-08 0.855 6 1.0 0.1449990 9.539932e-07 6.536268e-08 0.855

The value for the basic reproductive number, \(R0\), depends on the type of disease and typical numbers can be seen in the table below.33 It is not easy to provide good estimates of the reproductive number, \(R0\), since it is difficult to gauge how many individuals a person has been in contact with and since we hardly have a population containing only susceptible individuals. Also, check out The failure of R0 by Li, Blakeley og Smith for other comments about \(R0\). A large value of \(R0\) means that the disease is highly contagious.

⊕Tabel 1: R0 for common diseases that can be used with the SIR-model. The estimates for \(R0\) for the different diseases are obtained from Wikipedia and from these guidelines.

Sygdom R0 Measels 12-18 Chickenpox 10-12 Smallpox 5-7 Rubella 5-7 Mumps 4-7 SARS 2-5 Flu (Spanish flu) 2-3

Now we have the full functionality and we just need to wrap it in a shiny app to make it easily accessible to anyone and to allow non-R-users to try out different parameter values and see their consequences. One way to see the results is to plot the proportion of the population in each of the four compartments:

library("ggplot2") library("tidyverse") ldata <- as.data.frame(out) %>% gather(key, value, -time) head(ldata) time key value 1 0.0 S 0.1449991 2 0.2 S 0.1449991 3 0.4 S 0.1449991 4 0.6 S 0.1449990 5 0.8 S 0.1449990 6 1.0 S 0.1449990 ggplot(data=ldata, aes(x = time, y = value, group = key, col = key )) + ylab("Proportion of full population") + xlab("Time (days)") + geom_line(size = 2) + scale_colour_manual(values = c("red", "green4", "black", "blue")) + scale_y_continuous(labels = scales::percent, limits = c(0, 1))


The graph shows the result of the model and the distribution of individuals in the four groups over time. The interesting part is how large (and how quickly) a drop in the proportion of the susceptible that is observed. The blue line shows the actual proportion of vaccinated / previously immune individuals and it will be constant in this model. The new individuals in the recovered group will be added to the immune group by the time of the next outbreak.44 The reason for why the recovered and vaccinated groups are kept separate is that it might be interesting to add a societal cost from the current outbreak to end up in the recovered group so we need to be able to distinguish the two groups for this outbreak. The black curve shows how the susceptible individuals are moved to the infected (and subsequently to the recovered) group.

It is also possible to compute simple statistics that show the impact of the outbreak.

# Proportion of full population that got the # disease by end of time frame ldata %>% filter(time == max(time), key=="R") %>% mutate(prop = round(100 * value, 2)) time key value prop 1 200 R 0.1137451 11.37

Here, 11.37% of the population will have had the disease as part of this outbreak. We can also compute the proportion of the susceptible, that have had the disease at the end of the time frame.

# Proportion of susceptible that will get # the disease by end of time frame as.data.frame(out) %>% filter(row_number() == n()) %>% mutate(res = round(100*(R + I) / (S + I + R), 2)) %>% pull(res) [1] 81.84 What does it take to prevent an epidemic?

\(R0\) denotes the number of persons that an infected person will infect. If this number is less that 1 then the outbreak will die out by itself and if \(R0>1\) then an epidemic starts. Herd immunity is the percentage of vaccinated/immune that is necessary to keep the effective \(R0\) less than one from the initial outbreak. We can compute these numbers directly

# Proportion of population that need to be # immune for herd immunity round(100 * (1 - 1 / (connum)), 2) [1] 93.33

So 93.33% immunity if needed for herd
immunity from day 1 when \(R0\)=15. The effective \(R0\) at day 1 is the \(R0\) from the
disease scaled down according to the number of people already immune.

# Effective R0 (for populationen at outbreak, # when prior immunity is taken into account) round(connum * (1 - pvac * vaceff), 2) [1] 2.18

Even though the basic reproductive number here was $R0=$15 then the effective number of individuals that an infected person will infect is only 2.18 due to the number of immune and vaccinated individuals. However, since this number is larger than 1 it will still turn into an outbreak.

When evaluating the impact of vaccination it is necessary to take into account the frequencies, risks and costs of getting the disease compared to the costs and risks of vaccination. For example, roughly 1-2 out of 1000 persons infected with measles will die – even with the best treatment, and 1 out of 1000 persons infected with measles are under risk on swelling of the brain. If just 1% of the 5.7 million peopple in Denmark are infected with measles then that corresponds to around. 57000 persons. Consequently, 80 people are likely to die because of measles and 60 will be under risk of swelling of the brain.

It should be stressed that we have used ordinary differential equations for all of the computations. Thus, we have not taken the stochastic fluctuations that might occur in a real situation into account. The numbers that we see here should therefore be regarded as average effects whereas the effect in practice could be milder or worse than what we see here.

Wrapping it in a Shiny app

Now we have all the machinery in place and we need to wrap it inside a Shiny app. The final layout is shown below

The app can be run directly from your own computer using

shiny::runGitHub("ekstroem/Shiny-Vaccine")

or the full code can be downloaded from Github. Below I’ve shown the full code for app.R that wraps the computations shown above into the shiny app for easy exploration by anyone.

# # Shiny app for SIR model with vaccinations # # Created by Claus Ekstrøm 2019 # @ClausEkstrom # library("shiny") library("deSolve") library("cowplot") library("ggplot2") library("tidyverse") library("ggrepel") library("shinydashboard") ## Create an SIR function sir <- function(time, state, parameters) { with(as.list(c(state, parameters)), { dS <- -beta * S * I dI <- beta * S * I - gamma * I dR <- gamma * I dV <- 0 return(list(c(dS, dI, dR, dV))) }) } # # Define UI # ui <- dashboardPage( dashboardHeader(disable = TRUE), dashboardSidebar( sliderInput("popsize", "Population size (millions):", min = 1, max = 300, value = 6 ), sliderInput("connum", "Basic reproductive number (R0, # persons):", min = .5, max = 20, value = 5 ), sliderInput("pinf", "# infected at outbreak:", min = 1, max = 50, value = 2 ), sliderInput("pvac", "Proportion vaccinated / immune (%):", min = 0, max = 100, value = 75 ), sliderInput("vaceff", "Vaccine effectiveness (%):", min = 0, max = 100, value = 85 ), sliderInput("infper", "Infection period (days):", min = 1, max = 30, value = 7 ), sliderInput("timeframe", "Time frame (days):", min = 1, max = 400, value = 200 ) ), dashboardBody( tags$head(tags$style(HTML(' /* body */ .content-wrapper, .right-side { background-color: #fffff8; } '))), # mainPanel( fluidRow(plotOutput("distPlot")), br(), fluidRow( # Dynamic valueBoxes valueBoxOutput("progressBox", width = 6), valueBoxOutput("approvalBox", width = 6), valueBoxOutput("BRRBox", width = 6), valueBoxOutput("HIBox", width = 6) ), br(), br() ) ) # # Define server # server <- function(input, output) { # Create reactive input dataInput <- reactive({ init <- c( S = 1 - input$pinf / (input$popsize*1000000) - input$pvac / 100 * input$vaceff / 100, I = input$pinf / (input$popsize*1000000), R = 0, V = input$pvac / 100 * input$vaceff / 100 ) ## beta: infection parameter; gamma: recovery parameter parameters <- c(beta = input$connum * 1 / input$infper, # * (1 - input$pvac/100*input$vaceff/100), gamma = 1 / input$infper) ## Time frame times <- seq(0, input$timeframe, by = .2) ## Solve using ode (General Solver for Ordinary Differential Equations) out <- ode( y = init, times = times, func = sir, parms = parameters ) # out as.data.frame(out) }) output$distPlot <- renderPlot({ out <- dataInput() %>% gather(key, value, -time) %>% mutate( id = row_number(), key2 = recode( key, S = "Susceptible (S)", I = "Infected (I)", R = "Recovered (R)", V = "Vaccinated / immune (V)" ), keyleft = recode( key, S = "Susceptible (S)", I = "", R = "", V = "Vaccinated / immune (V)" ), keyright = recode( key, S = "", I = "Infected (I)", R = "Recovered (R)", V = "" ) ) ggplot(data = out, aes( x = time, y = value, group = key2, col = key2, label = key2, data_id = id )) + # ylim(0, 1) + ylab("Proportion of full population") + xlab("Time (days)") + geom_line(size = 2) + geom_text_repel( data = subset(out, time == max(time)), aes(label = keyright), size = 6, segment.size = 0.2, segment.color = "grey50", nudge_x = 0, hjust = 1, direction = "y" ) + geom_text_repel( data = subset(out, time == min(time)), aes(label = keyleft), size = 6, segment.size = 0.2, segment.color = "grey50", nudge_x = 0, hjust = 0, direction = "y" ) + theme(legend.position = "none") + scale_colour_manual(values = c("red", "green4", "black", "blue")) + scale_y_continuous(labels = scales::percent, limits = c(0, 1)) + theme( rect=element_rect(size=0), legend.position="none", panel.background=element_rect(fill="transparent", colour=NA), plot.background=element_rect(fill="transparent", colour=NA), legend.key = element_rect(fill = "transparent", colour = "transparent") ) }) output$progressBox <- renderValueBox({ valueBox( dataInput() %>% filter(time == max(time)) %>% select(R) %>% mutate(R = round(100 * R, 2)) %>% paste0("%"), "Proportion of full population that got the disease by end of time frame", icon = icon("thumbs-up", lib = "glyphicon"), color = "black" ) }) output$approvalBox <- renderValueBox({ valueBox( paste0(round( 100 * (dataInput() %>% filter(row_number() == n()) %>% mutate(res = (R + I) / (S + I + R)) %>% pull("res")), 2), "%"), "Proportion of susceptibles that will get the disease by end of time frame", icon = icon("thermometer-full"), color = "black" ) }) output$BRRBox <- renderValueBox({ valueBox( paste0(round(input$connum * (1 - input$pvac / 100 * input$vaceff / 100), 2), ""), "Effective R0 (for populationen at outbreak, when immunity is taken into account)", icon = icon("arrows-alt"), color = "red" ) }) output$HIBox <- renderValueBox({ valueBox( paste0(round(100 * (1 - 1 / (input$connum)), 2), "%"), "Proportion of population that needs to be immune for herd immunity", icon = icon("medkit"), color = "blue" ) }) } # Run the application shinyApp(ui = ui, server = server) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R blog posts on sandsynligvis.dk. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting started with RMarkdown & trying to make it in the world of Kaggle. Join MünsteR for our next meetup!

Wed, 03/06/2019 - 01:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

In our next MünsteR R-user group meetup on Tuesday, April 9th, 2019, we will have two exciting talks: Getting started with RMarkdown and Trying to make it in the world of Kaggle!

You can RSVP here: http://meetu.ps/e/Gg5th/w54bW/f

Getting started with RMarkdown

First, Niklas Wulms from the University Hospital, Münster will give an introduction to RMarkdown:

He started using R in 2018 and learnt the advantages of using only one framework of free software and code. Now he is able to develop a clear logical processing and analysis structure which can be updated at anytime. The possibility to use interactive documents, that can be shared with collaborators e.g. via E-Mail, GitHub or Cloud-Services is presented in his “Introduction to RMarkdown”.

This talk is for everybody, who is interested in learning a method, that is hardly taking a minute to set up and improves your code by adding emphasis on documentation, making it reproducible, easy to collaborate and review (even for inexperienced users).

Trying to make it in the world of Kaggle

Then, Thomas Nowicki from viadee AG will present his entry in the Kaggle Petfinder competition:

The internet was made for cats. In this talk Thomas will present his findings for a Kaggle challenge posted by petfinder.my in which he participated.

From the Petfinder competition description: Millions of stray animals suffer on the streets or are euthanized in shelters every day around the world. If homes can be found for them, many precious lives can be saved — and more happy families created. PetFinder.my has been Malaysia’s leading animal welfare platform since 2008, with a database of more than 150,000 animals. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.

The key question that he was trying to resolve was: How quickly is a pet adopted?

About Niklas:

Niklas Wulms is a biologist (MSc) doing his PhD (Dr. rer. medic.) in the junior research group “Population Imaging of the Brain” at the Institute of Epidemiology and Social Medicine of the University Hospital, Münster. Since 2016 he is focused on answering neurobiological and clinical questions using image processing, batch programming and statistics. He started programming using MATLAB interchangeably with SPSS, Statistica and Spreadsheets. In his research regarding the interaction and early development of depression and cardiovascular disease he compares clinical and behavioral parameters with structural and functional MRI.

About Thomas:

Thomas Nowicki got his Master’s degree of Artificial Intelligence in Maastricht. Upon completion he noticed a lack of jobs in data science in Germany. This is why he went for a career in Software Development. Now, after a few years, he opened up a new branch in his consulting company and they try to learn as much as they can about data science. Kaggle is one of the platforms they identified as most interesting to start the learning process.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Graph analysis using the tidyverse

Wed, 03/06/2019 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

It is because I am not a graph analysis expert that I though it important to write this article. For someone who thinks in terms of single rectangular data sets, it is a bit of a mental leap to understand how to apply tidy principles to a more robust object, such as a graph table. Thankfully, there are two packages that make this work much easier:

  • tidygraph – Provides a way for dplyr to interact with graphs

  • ggraph – Extension to ggplot2 for graph analysis

Quick intro

Simply put, graph theory studies relationships between objects in a group. Visually, we can think of a graph as a series of interconnected circles, each representing a member of a group, such as people in a Social Network. Lines drawn between the circles represent a relationship between the members, such as friendships in a Social Network. Graph analysis helps with figuring out things such as the influence of a certain member, or how many friends are in between two members. A more formal definition and detailed explanation of Graph Theory can be found in Wikipedia here.

Example

Using an example, this article will introduce concepts of graph analysis work, and how tidyverse and tidyverse-adjacent tools can be used for such analysis.

Data source

The tidytuesday weekly project encourages new and experienced users to use the tidyverse tools to analyze data sets that change every week. I have been using that opportunity to lean new tools and techniques. One of the most recent data sets relates to French trains; it contains aggregate daily total trips per connecting stations.

library(readr) url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-26/small_trains.csv" small_trains <- read_csv(url) head(small_trains) ## # A tibble: 6 x 13 ## year month service departure_stati… arrival_station journey_time_avg ## ## 1 2017 9 Nation… PARIS EST METZ 85.1 ## 2 2017 9 Nation… REIMS PARIS EST 47.1 ## 3 2017 9 Nation… PARIS EST STRASBOURG 116. ## 4 2017 9 Nation… PARIS LYON AVIGNON TGV 161. ## 5 2017 9 Nation… PARIS LYON BELLEGARDE (AI… 164. ## 6 2017 9 Nation… PARIS LYON BESANCON FRANC… 129. ## # … with 7 more variables: total_num_trips , ## # avg_delay_all_departing , avg_delay_all_arriving , ## # num_late_at_departure , num_arriving_late , ## # delay_cause , delayed_number Data Preparation

Even though it was meant to analyze delays, I thought it would be interesting to use the data to understand how stations connect with each other. A new summarized data set is created, called routes, which contains a single entry for each connected station. It also includes the average journey time it takes to go between stations.

library(dplyr) routes <- small_trains %>% group_by(departure_station, arrival_station) %>% summarise(journey_time = mean(journey_time_avg)) %>% ungroup() %>% mutate(from = departure_station, to = arrival_station) %>% select(from, to, journey_time) routes ## # A tibble: 130 x 3 ## from to journey_time ## ## 1 AIX EN PROVENCE TGV PARIS LYON 186. ## 2 ANGERS SAINT LAUD PARIS MONTPARNASSE 97.5 ## 3 ANGOULEME PARIS MONTPARNASSE 146. ## 4 ANNECY PARIS LYON 225. ## 5 ARRAS PARIS NORD 52.8 ## 6 AVIGNON TGV PARIS LYON 161. ## 7 BARCELONA PARIS LYON 358. ## 8 BELLEGARDE (AIN) PARIS LYON 163. ## 9 BESANCON FRANCHE COMTE TGV PARIS LYON 131. ## 10 BORDEAUX ST JEAN PARIS MONTPARNASSE 186. ## # … with 120 more rows

The next step is to transform the tidy data set, into a graph table. In order to prepare routes for this transformation, it has to contain two variables specifically named: from and to, which are the names that tidygraph expects to see. Those variables should contain the name of each member (e.g., “AIX EN PROVENCE TGV”), and the relationship (“AIX EN PROVENCE TGV” -> “PARIS LYON”) .

In graph terminology, a member of the group is called a node (or vertex) in the graph, and a relationship between nodes is called an edge.

library(tidygraph) graph_routes <- as_tbl_graph(routes) graph_routes ## # A tbl_graph: 59 nodes and 130 edges ## # ## # A directed simple graph with 1 component ## # ## # Node Data: 59 x 1 (active) ## name ## ## 1 AIX EN PROVENCE TGV ## 2 ANGERS SAINT LAUD ## 3 ANGOULEME ## 4 ANNECY ## 5 ARRAS ## 6 AVIGNON TGV ## # … with 53 more rows ## # ## # Edge Data: 130 x 3 ## from to journey_time ## ## 1 1 39 186. ## 2 2 40 97.5 ## 3 3 40 146. ## # … with 127 more rows

The as_tbl_graph() function splits the routes table into two:

  1. Node Data – Contains all of the unique values found in the from and to variables. In this case, it is a table with a single column containing the names of all of the stations.

  2. Edge Data – Is a table of all relationships between from and to. A peculiarity of tidygraph is that it uses the row position of the node as the identifier for from and to, instead of its original name.

Another interesting thing about tidygraph is that it allows us to attach more information about the node or edge in an additional column. In this case, journey_time is not really needed to create the graph table, but it may be needed for the analysis we plan to perform. The as_tbl_graph() function automatically created the column for us.

Thinking about graph_routes as two tibbles inside a larger table graph, was one of the two major mental breakthroughs I had during this exercise. At that point, it became evident that dplyr needs a way to know which of the two tables (nodes or edges) to perform the transformations on. In tidygraph, this is done using the activate() function. To showcase this, the nodes table will be “activated” in order to add two new string variables derived from name.

library(stringr) graph_routes <- graph_routes %>% activate(nodes) %>% mutate( title = str_to_title(name), label = str_replace_all(title, " ", "\n") ) graph_routes ## # A tbl_graph: 59 nodes and 130 edges ## # ## # A directed simple graph with 1 component ## # ## # Node Data: 59 x 3 (active) ## name title label ## ## 1 AIX EN PROVENCE TGV Aix En Provence Tgv "Aix\nEn\nProvence\nTgv" ## 2 ANGERS SAINT LAUD Angers Saint Laud "Angers\nSaint\nLaud" ## 3 ANGOULEME Angouleme Angouleme ## 4 ANNECY Annecy Annecy ## 5 ARRAS Arras Arras ## 6 AVIGNON TGV Avignon Tgv "Avignon\nTgv" ## # … with 53 more rows ## # ## # Edge Data: 130 x 3 ## from to journey_time ## ## 1 1 39 186. ## 2 2 40 97.5 ## 3 3 40 146. ## # … with 127 more rows

It was really impressive how easy it was to manipulate the graph table, because once one of the two tables are activated, all of the changes can be made using tidyverse tools. The same approach can be used to extract data from the graph table. In this case, a list of all the stations is pulled into a single character vector.

stations <- graph_routes %>% activate(nodes) %>% pull(title) stations ## [1] "Aix En Provence Tgv" "Angers Saint Laud" ## [3] "Angouleme" "Annecy" ## [5] "Arras" "Avignon Tgv" ## [7] "Barcelona" "Bellegarde (Ain)" ## [9] "Besancon Franche Comte Tgv" "Bordeaux St Jean" ## [11] "Brest" "Chambery Challes Les Eaux" ## [13] "Dijon Ville" "Douai" ## [15] "Dunkerque" "Francfort" ## [17] "Geneve" "Grenoble" ## [19] "Italie" "La Rochelle Ville" ## [21] "Lausanne" "Laval" ## [23] "Le Creusot Montceau Montchanin" "Le Mans" ## [25] "Lille" "Lyon Part Dieu" ## [27] "Macon Loche" "Madrid" ## [29] "Marne La Vallee" "Marseille St Charles" ## [31] "Metz" "Montpellier" ## [33] "Mulhouse Ville" "Nancy" ## [35] "Nantes" "Nice Ville" ## [37] "Nimes" "Paris Est" ## [39] "Paris Lyon" "Paris Montparnasse" ## [41] "Paris Nord" "Paris Vaugirard" ## [43] "Perpignan" "Poitiers" ## [45] "Quimper" "Reims" ## [47] "Rennes" "Saint Etienne Chateaucreux" ## [49] "St Malo" "St Pierre Des Corps" ## [51] "Strasbourg" "Stuttgart" ## [53] "Toulon" "Toulouse Matabiau" ## [55] "Tourcoing" "Tours" ## [57] "Valence Alixan Tgv" "Vannes" ## [59] "Zurich" Visualizing

In graphs, the absolute position of the each node is not as relevant as it is with other kinds of visualizations. A very minimal ggplot2 theme is set to make it easier to view the plotted graph.

library(ggplot2) thm <- theme_minimal() + theme( legend.position = "none", axis.title = element_blank(), axis.text = element_blank(), panel.grid = element_blank(), panel.grid.major = element_blank(), ) theme_set(thm)

To create the plot, start with ggraph() instead of ggplot2(). The ggraph package contains geoms that are unique to graph analysis. The package contains geoms to specifically plot nodes, and other geoms for edges.

As a first basic test, the point geom will be used, but instead of callinggeom_point(), we call geom_node_point(). The edges are plotted using geom_edge_diagonal().

library(ggraph) graph_routes %>% ggraph(layout = "kk") + geom_node_point() + geom_edge_diagonal()

To make it easier to see where each station is placed in this plot, the geom_node_text() is used. Just as with regular geoms in ggplot2, other attributes such as size, color, and alpha can be modified.

graph_routes %>% ggraph(layout = "kk") + geom_node_text(aes(label = label, color = name), size = 3) + geom_edge_diagonal(color = "gray", alpha = 0.4)

Morphing time!

The second mental leap was understanding how a graph algorithm is applied. Typically, the output of a model function is a model object, not a data object. With tidygraph, the process begins and ends with a graph table. The steps are these:

  1. Start with a graph table
  2. Temporarily transform the graph to comply with the model that is requested (morph())
  3. Add additional transformations to the morphed data using dplyr (optional)
  4. Restore the original graph table, but modified to keep the changes made during the morph

The shortest path algorithm defines the “length” as the number of edges in between two nodes. There may be multiple routes to get from point A to point B, but the algorithm chooses the one with the fewest number of “hops”. The way to call the algorithm is inside the morph() function. Even though to_shortest_path() is a function in itself, and it is possible run it without morph(), it is not meant to be used that way. In the example, the journey_time is used as weights to help the algorithm find an optimal route between the Arras and the Nancy stations. The print output of the morphed graph will not be like the original graph table.

from <- which(stations == "Arras") to <- which(stations == "Nancy") shortest <- graph_routes %>% morph(to_shortest_path, from, to, weights = journey_time) shortest ## # A tbl_graph temporarily morphed to a shortest path representation ## # ## # Original graph is a directed simple graph with 1 component ## # consisting of 59 nodes and 130 edges

It is possible to make more transformations with the use of activate() and dplyr functions. The results can be previewed, or committed back to the original R variable using unmorph(). By default, nodes are active in a morphed graph, so there is no need to set that explicitly.

shortest %>% mutate(selected_node = TRUE) %>% unmorph() ## # A tbl_graph: 59 nodes and 130 edges ## # ## # A directed simple graph with 1 component ## # ## # Node Data: 59 x 4 (active) ## name title label selected_node ## ## 1 AIX EN PROVENCE T… Aix En Provence T… "Aix\nEn\nProvence\n… NA ## 2 ANGERS SAINT LAUD Angers Saint Laud "Angers\nSaint\nLaud" NA ## 3 ANGOULEME Angouleme Angouleme NA ## 4 ANNECY Annecy Annecy NA ## 5 ARRAS Arras Arras TRUE ## 6 AVIGNON TGV Avignon Tgv "Avignon\nTgv" NA ## # … with 53 more rows ## # ## # Edge Data: 130 x 3 ## from to journey_time ## ## 1 1 39 186. ## 2 2 40 97.5 ## 3 3 40 146. ## # … with 127 more rows

While it was morphed, only the few nodes that make up the connections between the Arras and Nancy stations were selected. A simple mutate() adds a new variable called selected_node, which tags those nodes with TRUE. The new variable and value is retained once the rest of the nodes are restored via the unmorph() command.

To keep the change, the shortest variable is updated with the changes made to both edges and nodes.

shortest <- shortest %>% mutate(selected_node = TRUE) %>% activate(edges) %>% mutate(selected_edge = TRUE) %>% unmorph()

The next step is to coerce each NA into a 1, and the shortest route into a 2. This will allow us to easily re-arrange the order that the edges are drawn in the plot, ensuring that the route will be drawn at the top.

shortest <- shortest %>% activate(nodes) %>% mutate(selected_node = ifelse(is.na(selected_node), 1, 2)) %>% activate(edges) %>% mutate(selected_edge = ifelse(is.na(selected_edge), 1, 2)) %>% arrange(selected_edge) shortest ## # A tbl_graph: 59 nodes and 130 edges ## # ## # A directed simple graph with 1 component ## # ## # Edge Data: 130 x 4 (active) ## from to journey_time selected_edge ## ## 1 1 39 186. 1 ## 2 2 40 97.5 1 ## 3 3 40 146. 1 ## 4 4 39 225. 1 ## 5 6 39 161. 1 ## 6 7 39 358. 1 ## # … with 124 more rows ## # ## # Node Data: 59 x 4 ## name title label selected_node ## ## 1 AIX EN PROVENCE T… Aix En Provence T… "Aix\nEn\nProvence\n… 1 ## 2 ANGERS SAINT LAUD Angers Saint Laud "Angers\nSaint\nLaud" 1 ## 3 ANGOULEME Angouleme Angouleme 1 ## # … with 56 more rows

A simple way to plot the route is to use the selected_ variables to modify the alpha. This will highlight the shortest path, without completely removing the other stations. This is a personal design choice, so experimenting with different ways of highlighting the results is always recommended.

shortest %>% ggraph(layout = "kk") + geom_edge_diagonal(aes(alpha = selected_edge), color = "gray") + geom_node_text(aes(label = label, color =name, alpha = selected_node ), size = 3)

The selected_ fields can also be used in other dplyr functions to analyze the results. For example, to know the aggregate information about the trip, selected_edge is used to filter the edges, and then the totals can be calculated. There is no summarise() function for graph tables; this make sense because the graph table would become a summarized table with such a function. Since the end result we seek is a total rather than another graph table, a simple as_tibble() command will coerce the edges, which will then allows us to finish the calculation.

shortest %>% activate(edges) %>% filter(selected_edge == 2) %>% as_tibble() %>% summarise( total_stops = n() - 1, total_time = round(sum(journey_time) / 60) ) ## # A tibble: 1 x 2 ## total_stops total_time ## ## 1 8 23 Re-using the code

To compile most of the code in a single chunk, here is an example of how to re-run the shortest path for a different set of stations: the Laval and Montpellier stations.

from <- which(stations == "Montpellier") to <- which(stations == "Laval") shortest <- graph_routes %>% morph(to_shortest_path, from, to, weights = journey_time) %>% mutate(selected_node = TRUE) %>% activate(edges) %>% mutate(selected_edge = TRUE) %>% unmorph() %>% activate(nodes) %>% mutate(selected_node = ifelse(is.na(selected_node), 1, 2)) %>% activate(edges) %>% mutate(selected_edge = ifelse(is.na(selected_edge), 1, 2)) %>% arrange(selected_edge) shortest %>% ggraph(layout = "kk") + geom_edge_diagonal(aes(alpha = selected_edge), color = "gray") + geom_node_text(aes(label = label, color =name, alpha = selected_node ), size = 3)

Additional, the same code can be recycled to obtain the trip summarized data.

shortest %>% activate(edges) %>% filter(selected_edge == 2) %>% as_tibble() %>% summarise( total_stops = n() - 1, total_time = round(sum(journey_time) / 60) ) ## # A tibble: 1 x 2 ## total_stops total_time ## ## 1 3 10 Shiny app

To see how to use this kind of analysis inside Shiny, please refer to this application. It lets the user select two stations, and it returns the route, plus the summarized data. The source code is embedded in the app.

_____='https://rviews.rstudio.com/2019/03/06/intro-to-graph-analysis/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Cancer clusters and the Poisson distributions

Tue, 03/05/2019 - 22:18

(This article was first published on The Princess of Science, and kindly contributed to R-bloggers)

On March 1, 2019, an article was published in Israel’s Ynetnews website, under the title “The curious case of the concentration of cancer”. The story reports on a concentration of cancer cases in the town of Rosh Ha’ayin in central Israel.

In the past few years dozens of cases of cancer have been discovered in the center of Rosh Ha’ayin. About 40 people have already died of the disease. Residents are sure that the cause of the disease is cellular antennas on the roof of a building belonging to the municipality. “Years we cry and no one listens”, They say, “People die one after the other.”

I do not underestimate the pain of the residents of Rosh Ha’ayin. I also do not intend to discuss the numbers mentioned in the article. I accept them as they are. I just want to relate only to the claim that the cause of the disease is cellular antennas. It is easy (at least for me) to explain why this claim is at least questionable: there are many more cellular antennas in many places, and around them there is no high rate of morbidity in cancer. If the antennas are carcinogenic, then they need cancer everywhere, not just in Rosh Ha’ayin. On the other hand, I can’t blame them for blaming the antennas. People try to rationalize what they see and find a cause.

I also must emphasize that the case of Rosh Ha’ayin must not be neglected. If it is not because the cellular antennas, it is possible that there is some other risk factor, and the state authorities must investigate.

So why does Rosh Ha’ayin have such a large cluster of cancer morbidity? One possible answer is that there is an environmental factor (other than the antennas) that does not exist elsewhere. Another possible answer is that there may be a non-environmental factor that does not exist elsewhere, maybe a genetic factor (most of the town’s residents are immigrants from Yemen and their descendants). A third and particularly sad possibility is that the local residents suffer from really bad luck. Statistics rules can be cruel.

Clusters happen: if there are no local (environmental or other) risk factors that cause cancer (or any other disease), and the disease spreads randomly across the whole country, then clusters are formed. If the country is divided into units of equal size, then the number of cases in a given area unit follows a Poisson distribution. Then there is a small, but not negligible possibility that one of these units will contain a large number of cases. The problem is that there is no way of knowing in advance where it will happen.

The opposite is also true: If the distribution of the number of cases in a given area unit is a Poisson distribution, then it can be concluded that the dispersion on the surface is random.

I will demonstrate the phenomenon using simulation.

Consider a hypothetical country that has a perfect square shape, and its size is 100 x 100 kilometers. I randomized 400 cases of morbidity across the country:

# generate n random points on 0:100 x 0:100 > set.seed(21534) > n=400 > x=runif(n)*100 > y=runif(n)*100 > dat=data.frame(x,y) > head(dat) x y 1 15.73088 8.480265 2 12.77018 78.652808 3 45.50406 31.316797 4 86.46181 6.669138 5 27.25488 48.164316 6 17.42388 98.429575

I plotted the 400 cases.

> # plot the points > plot(dat$x, dat$y, ylim=c(0,100), xlim=c(0,100), + asp=1, frame.plot=FALSE, axes=FALSE, + xlab=' ', ylab=' ', col='aquamarine', pch=16, + las=1, xaxs="i", yaxs="i", + ) > axis(1, at=0:10*10) > axis(2, at=0:10*10, las=1, pos=0) > axis(3, at=0:10*10, labels=FALSE) > axis(4, at=0:10*10, labels=FALSE, pos=100)

Here’s the map I got.

Next, I divided the map into 100 squares, each 10 x 10 kilometers:

> #draw gridlines > for (j in 1:9){ + lines(c(0,100), j*c(10,10)) + lines(j*c(10,10), c(0,100)) + } >

In order to count the cases in each square, I assigned row and column numbers for each of the squares, and recorded the position (row and column) for every case/dot:

> # row and column numbers and cases positions > dat$row=0 > dat$col=0 > dat$pos=0 > for (j in 1:nrow(dat)){ + dat$row=ceiling(dat$y/10) + dat$col=ceiling(dat$x/10) + dat$pos=10*(dat$row-1)+dat$col + } >

Now I can count the number of points/cases in each square:

> # calculate number of points for each position > # ppp=points per position > dat$count=1 > ppp=aggregate(count~pos, dat, sum) > dat=dat[,-6]

But of course, it is possible that there are squares with zero cases; (actually the data frame ppp has only 97 rows). Let’s identify them:

> # add positions with zero counts, if any > npp=nrow(ppp) > if(npp<100){ + w=which(!(1:100 %in% ppp$pos)) + addrows=(npp+1):(npp+length(w)) + ppp[addrows,1]=w + ppp[addrows,2]=0 + ppp=ppp[order(ppp$pos),] + } >

And now we can get the distribution of number of cases in each of the 100 squares:

> # distribution of number of points/cases in each position > tb=table(ppp$count) > print(tb) 0 1 2 3 4 5 6 7 8 9 11 3 9 12 21 15 17 13 5 1 3 1 >

We see that there is one very unlucky cluster with 11 cases, and there also 3 squares with 9 cases each. Let’s paint them on the map:

> # identify largest cluster > mx=max(ppp$count) > loc=which(ppp$count==11) > clusters=dat[dat$pos %in% loc,] > points(clusters$x, clusters$y, col='red', pch=16) > > # identify second lasrgest cluster/s > loc=which(ppp$count==9) > clusters=dat[dat$pos %in% loc,] > points(clusters$x, clusters$y, col='blue', pch=16) >

Let’s also mark the squares with zero points/cases. In order to do this, we first need to identify the row and column locations of these squares:

> # identify sqaures without cases > # find row and column locations > loc=which(ppp$count==0) > zeroes=data.frame(loc) > zeroes$row=ceiling(zeroes$loc/10) > zeroes$col=zeroes$loc %% 10 > w=which(zeroes$col==0) > if(length(w)>0){ + zeroes$col[w]=10 + } > print(zeroes) loc row col 1 8 1 8 2 31 4 1 3 99 10 9 >

So there is one empty square in the 8th column of the first row, one in the first column of the 4th row, and one in the 9th column of the 10th row. Let’s mark them. To do that, we need to know the coordinates of each of the four vertices of these squares:

# mark squares with zero cases > for (j in 1:nrow(zeroes)){ + h1=(zeroes$col[j]-1)*10 + h2=h1+10 + v1=(zeroes$row[j]-1)*10 + v2=v1+10 + lines(c(h1,h2), c(v1,v1), lwd=3, col='purple') + lines(c(h1,h2), c(v2,v2), lwd=3, col='purple') + lines(c(h1,h1), c(v1,v2), lwd=3, col='purple') + lines(c(h2,h2), c(v1,v2), lwd=3, col='purple') + }

Do you see any pattern?

How well does the data fit the Poisson distribution? We can perform a goodness of fit test.
Let’s do the log-likelihood chi-square test (also known as the G-test):

> # log likelihood chi square to test the goodness of fit > # of the poisson distribution to the data > > # the obserevd data > observed=as.numeric(tb) > values=as.numeric(names(tb)) > > # estimate the poisson distribution parameter lambda > # it is the mean number of cases per square > lambda=nrow(dat)/100 > print(lambda) [1] 4 > > # calculate the expected values according to > # a poisson distribution with mean lambda > expected=100*dpois(values, lambda) > > # view the data for the chi-square test > poisson_data=data.frame(values, observed, expected) > print(poisson_data) values observed expected 1 0 3 1.8315639 2 1 9 7.3262556 3 2 12 14.6525111 4 3 21 19.5366815 5 4 15 19.5366815 6 5 17 15.6293452 7 6 13 10.4195635 8 7 5 5.9540363 9 8 1 2.9770181 10 9 3 1.3231192 11 11 1 0.1924537 > > # calculate the degrees of freedom > df=max(values) > print(df) [1] 11 > > # calculate the test statistic and p-value > g2=sum(observed*log(observed/expected)) > pvalue=1-pchisq(g2,df) > log_likelihood_chi_squrae_test=data.frame(g2, df, pvalue) > print(log_likelihood_chi_squrae_test) g2 df pvalue 1 4.934042 11 0.9343187 >

We cannot reject the hypothesis that the data follows the Poison distribution. This does not imply, of course, that the data follows the Poisson distribution, but we can say that the Poisson model fits the data well.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Princess of Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Head’s Up! Roll Your Own HTTP Headers Investigations with the ‘hdrs’ Package

Tue, 03/05/2019 - 21:50

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I blathered alot about HTTP headers in the last post.

In the event you wanted to dig deeper I threw together a small package that will let you grab HTTP headers from a given URL and take a look at them. The README has examples for most things but we’ll go through a bit of them here as well.

For those that just want to play, you can do:

install.packages("hdrs", repos = "https://cinc.rud.is/") hdrs::explore_app()

and use the diminutive Shiny app to explore a site’s security headers or look at all the headers they return. (Oh, yeah…if you read the previous post then looked at the above screenshot you’ll notice how completely useless IP blocking is to determined attackers individuals.)

NOTE: There are binaries for macOS and Windows at my CINC repo for hdrs so you’ll be getting those if you use the above method. Use type='source' on that call or use various remotes package functions to install the source package (after reading it b/c you really shouldn’t trust any package, ever) from:

Moving Ahead

Let’s use the command-line to poke at my newfound most favorite site to use in security-related examples:

library(hdrs) assess_security_headers("https://cran.r-project.org/") %>% dplyr::select(-url) ## # A tibble: 13 x 4 ## header value status_code message ## * ## 1 access-control-allow-origin NA WARN Header not set ## 2 content-security-policy NA WARN Header not set ## 3 expect-ct NA WARN Header not set ## 4 feature-policy NA WARN Header not set ## 5 public-key-pins NA WARN Header not set ## 6 referrer-policy NA WARN Header not set ## 7 server Apache/2.4.10 (Debian) NOTE Server header found ## 8 strict-transport-security NA WARN Header not set ## 9 x-content-type-options NA WARN Header not set ## 10 x-frame-options NA WARN Header not set ## 11 x-permitted-cross-domain-policies NA WARN Header not set ## 12 x-powered-by NA WARN Header not set ## 13 x-xss-protection NA WARN Header not set

Ouch. Not exactly a great result (so, perhaps it matters little how poorly maintained the downstream mirrors are after all, or maybe it’s perfectly fine to run a five year old web server with some fun vulns).

Anyway…

The assess_security_headers() function looks at 13 modern “security-oriented” HTTP headers, performs a very light efficacy assessment and returns the results.

  • access-control-allow-origin
  • content-security-policy
  • expect-ct
  • feature-policy
  • server
  • public-key-pins
  • referrer-policy
  • strict-transport-security
  • x-content-type-options
  • x-frame-options
  • x-permitted-cross-domain-policies
  • x-powered-by
  • x-xss-protection

Since you likely do not have every HTTP header’s name, potential values, suggested values, and overall purpose memorized, you can also pass in include_ref = TRUE to the function to get more information with decent textual descriptions like you saw in the screenshot (the Shiny app omits many fields).

The full reference is available in a data element:

data("http_headers") dplyr::glimpse(http_headers) ## Observations: 184 ## Variables: 14 ## $ header_field_name "A-IM", "Accept", "Accept-Additions", "Accept-Charset", "Accept-Datetime", "Accept-Encoding… ## $ type_1 "Permanent", "Permanent", "Permanent", "Permanent", "Permanent", "Permanent", "Permanent", … ## $ protocol "http", "http", "http", "http", "http", "http", "http", "http", "http", "http", "http", "ht… ## $ status "", "standard", "", "standard", "informational", "standard", "", "standard", "", "standard"… ## $ reference "https://tools.ietf.org/html/rfc3229#section-10.5.3", "https://tools.ietf.org/html/rfc7231#… ## $ type_2 "Request", "Request", "Request", "Request", "Request", "Request", "Request", "Request", "Re… ## $ enable FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FAL… ## $ required NA, NA, NA, NA, NA, NA, NA, NA, TRUE, TRUE, NA, TRUE, NA, NA, NA, TRUE, NA, NA, NA, NA, NA,… ## $ https NA, NA, NA, NA, NA, NA, NA, NA, TRUE, TRUE, NA, TRUE, NA, NA, NA, TRUE, NA, NA, NA, NA, NA,… ## $ security_description "", "", "", "", "", "", "", "", "Sometimes an HTTP intermediary might try to detect viruses… ## $ security_reference "", "", "", "", "", "", "", "", "https://tools.ietf.org/html/rfc5789#section-5", "https://t… ## $ recommendations "", "", "", "", "", "", "", "", "Antivirus software scans for viruses or worms.", "Servers … ## $ cwe "", "", "", "", "", "", "", "", "CWE-509: Replicating Malicious Code (Virus or Worm)", "CWE… ## $ cwe_url "\r", "\r", "\r", "\r", "\r", "\r", "\r", "\r", "https://cwe.mitre.org/data/definitions/509…

There will eventually be a lovely vignette with well-formatted sections that include the above information so you can reference it at your leisure (it’s great bedtime reading).

The http_headers object is fully documented but here’s what those fields mean:

  • header_field_name: header field
  • type_1: Permanent (in a standard); Provisional (experimental); Personal (unofficial)
  • protocol: should always be http for now but may be different (e.g. quic)
  • status: blank == unknown; otherwise the value describes the status well
  • reference: where to look for more info
  • type_2: Request (should only be found in requests); Response (should only be found in responses); Request/Response found in either; Reserved (not in use yet)
  • enable: should you have this enabled
  • required: Is this header required
  • https: HTTPS-specific header?
  • security_description: Information on the header
  • security_reference: Extra external reference information on the header
  • recommendations: Recommended setting(s)
  • cwe: Associated Common Weakness Enumeration (CWE) identifier
  • cwe_url: Associated CWE URL
Even Moar Headers

HTTP servers can spit out tons of headers and we can catch’em all with hdrs::explain_headers(). That function grabs the headers, merges in the full metadata from http_headers and returns a big ol’ data frame. We’ll only pull out the security reference URL for this last example:

explain_headers("https://community.rstudio.com/") %>% dplyr::select(header, value, security_reference) ## # A tibble: 18 x 3 ## header value security_reference ## ## 1 cache-control no-cache, no-store https://tools.ietf.org/html/rfc7234#… ## 2 connection keep-alive "" ## 3 content-encoding gzip https://en.wikipedia.org/wiki/BREACH… ## 4 content-security-po… base-uri 'none'; object-src 'none'; script-src 'unsafe-eval'… https://www.owasp.org/index.php/List… ## 5 content-type text/html; charset=utf-8 https://tools.ietf.org/html/rfc7231#… ## 6 date Tue, 05 Mar 2019 20:40:31 GMT "" ## 7 referrer-policy strict-origin-when-cross-origin NA ## 8 server nginx https://tools.ietf.org/html/rfc7231#… ## 9 strict-transport-se… max-age=31536000 https://tools.ietf.org/html/rfc6797 ## 10 vary Accept-Encoding "" ## 11 x-content-type-opti… nosniff https://www.owasp.org/index.php/List… ## 12 x-discourse-route list/latest NA ## 13 x-download-options noopen NA ## 14 x-frame-options SAMEORIGIN https://tools.ietf.org/html/rfc7034 ## 15 x-permitted-cross-d… none NA ## 16 x-request-id 12322c6e-b47e-4960-b384-32138097886c NA ## 17 x-runtime 0.106664 NA ## 18 x-xss-protection 1; mode=block https://www.owasp.org/index.php/List… FIN

Have some fun and poke at some headers. Perhaps even use this to do a survey of key web sites in your field of work/study and see how well they rate. As usual, post PRs & issues at your fav social coding site.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to become a Mango

Tue, 03/05/2019 - 17:33

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

At Mango, we talk a lot about going on a ‘data-driven journey’ with your business. We’re passionate about data and getting the best use out of it. But for now, instead of looking at business journeys, I wanted to talk to the Mango team and find out how they started on their own ‘data journey’ – what attracted them to a career in data science and what they enjoy about their day-to-day work. (It’s not just typing in random numbers?! What?!)

We are hugely fortunate to have a wonderful team of data scientists who are always generous in sharing their skills or don’t mind teaching the Marketing and Events Coordinator (me) R for uber beginners. So let’s see what they have to say on becoming a Mango…

Jack Talboys

Jack joined us last year as a year-long placement student 

“I actually had no idea what Data Science was until I discovered Mango about a year and a half ago. I was at the university career fair – not really impressed by the prospect of working in finance or as a statistician for a large company. I stumbled across Liz Matthews and Owen Jones who were there representing Mango, drawn in by the title “Data Science” we started talking. Data Science seemed to tick all of my boxes, being able to use my knowledge of statistics and probability while doing lots of coding in R.

I’m now 6 months in at Mango and it couldn’t be going better. I’ve greatly improved my proficiency in R, alongside learning new skills like Git, SQL and Python. I’ve been given a great deal of responsibility, assiting in delivering training to a client and attending the EARL 2018 conference making up some of my highlights. There have also been opportunities for me to be client-facing, giving me a deeper understanding of what it takes to be a Data Science Consultant.

Working at Mango hasn’t just developed my technical skills however, without really noticing I’ve found that I have become a better communicator. Whether organising tasks with the other members of the ValidR team or talking to clients I have discovered a new sense of confidence and trust in myself. Even as a relative newbie I can see that Data Science as an industry is growing massively – and I’m excited to be part of this growth and make the most of the exciting opportunities it presents with Mango.”

Beth Ashlee, Data Scientist

“I got into data science after applying for a summer internship at Mango. I didn’t really know much about the data science community previously, but spent the next few weeks learning more technical and practical skills than I had in 3 years at university.

I’ve been working as a Data Science Consultant for nearly 3 years and due to the wide variety of projects I’ve never had a dull moment. I have had amazing opportunities to travel worldwide teaching training courses and interacting with customers from all industries. The variety is my favourite part of the job, you could be building a Shiny application to help a pharmaceutical company visualise their assay data one week and the next teaching a training course at the head offices of large companies such as Screwfix.”

Owen Jones, Data Scientist

“To be honest, it rarely feels like work… since we’re a consultancy, there’s always a wide variety of projects on the go, and you can get yourself involved in the areas you find most interesting! Plus, you have the opportunity to visit new places, and you’re always meeting and working with new people – which means new conversations, new ideas and new understanding. I love it.”

Nick Howlett, Data Scientist

Nick is currently working on a client project in Italy.

“During my time creating simulations in academic contexts I found myself more motivated to meet my supervisor’s requirements than pursuing niche research topics. Towards the end of my studies, I discovered data science and realised that the client-consultant relationship was a situation very similar to this.

Working at Mango has allowed me to develop personal relationships with clients across many sectors – and get to know their motivations and individual data requirements. Mango has also given me the opportunity to travel on both short term training projects and more long term projects abroad.”

Karina Marks, Data Scientist

If you’d like to join the Mango team, take a look at the positions we have currently.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

dqrng v0.1.0: breaking changes

Tue, 03/05/2019 - 14:59

(This article was first published on R – daqana Blog, and kindly contributed to R-bloggers)

A new version of dqrng has made it onto the CRAN servers. This version brings two breaking changes, hence the “larger than usual” change in version number:

  • An integer vector instead of a single int is used for seeding (Aaron Lun in #10)
    • Single integer seeds lead to a different RNG state than before.
    • dqrng::dqset_seed() expects a Rcpp::IntegerVector instead of an int
  • Support for Mersenne-Twister has been removed, Xoroshiro128+ is now the default.

The first change is motivated by the desire to provide more than 32 bits of randomness as seed to the RNG. With this possibility in place, the previously used scrambling of the single 32 bit integer did not make much sense anymore and was therefore removed. The new method generateSeedVectors() for generating a list of random int vectors from R’s RNG can be used to generate such seed vector.

The second change is related to a statement in R’s manual: Nor should the C++11 random number library be used …. I think that relates to the implementation-defined distributions and not the generators, but in general one should follow WRE by the letter. So std::mt19937_64 has to go, and unfortunately it cannot be replaced by boost::random::mt19937_64 due to a not-merged pull request. Instead of shipping a fixed version of MT I opted for removal since:

  • MT is known to fail certain statistical tests.
  • MT is slower than the other generators.
  • MT was the only generator that does not support multiple streams of random numbers necessary for parallel operations.

The other usage of random from C++11 was the default initialization, which used std::random_device. This is now unnecessary since the initial state of the default RNG is now based on R’s RNG, using the techniques developed for generateSeedVectors().

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – daqana Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Evolution works!

Tue, 03/05/2019 - 10:00

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

Source: Wikimedia

Hamlet: Do you see yonder cloud that’s almost in shape of a camel?
Polonius: By the mass, and ’tis like a camel, indeed.
Hamlet: Methinks it is like a weasel.
from Hamlet by William Shakespeare

The best way to see how evolution works, is to watch it in action! You can watch the evolution of cars live in this application (but be careful, it’s addictive): BoxCar 2D

It is fascinating to see how those cars get better and better over time, sometimes finding very impressive solutions:

To understand how evolution works even better, let us create an artificial evolution in R!

The famous evolutionary biologist Richard Dawkins gave in his book “The Blind Watchmaker” the following thought experiment:

I don’t know who it was first pointed out that, given enough time, a monkey bashing away at random on a typewriter could produce all the works of Shakespeare. The operative phrase is, of course, given enough time. Let us limit the task facing our monkey somewhat. Suppose that he has to produce, not the complete works of Shakespeare but just the short sentence ‘Methinks it is like a weasel’, and we shall make it relatively easy by giving him a typewriter with a restricted keyboard, one with just the 26 (capital) letters, and a space bar. How long will he take to write this one little sentence?

We are now going to put this idea into practice! The following outline is from the Wikipedia article on the weasel program (Weasel program):

  1. Start with a random string of 28 characters.
  2. Make 100 copies of the string (reproduce).
  3. For each character in each of the 100 copies, with a probability of 5%, replace (mutate) the character with a new random character.
  4. Compare each new string with the target string “METHINKS IT IS LIKE A WEASEL”, and give each a score (the number of letters in the string that are correct and in the correct position).
  5. If any of the new strings has a perfect score (28), halt. Otherwise, take the highest scoring string, and go to step 2.

So let us first define some variables and helper functions for reproduction, mutation and fitness calculation:

target <- unlist(strsplit("METHINKS IT IS LIKE A WEASEL" , "")) # assign target string to "target" pop_sz <- 100 # assign population size 100 to "pop_sz" mt_rt <- 0.05 # assign mutation rate 5% to "mt_rt" reproduce <- function(string) { # input: vector "string" # output: matrix with "pop_sz" columns, where each column is vector "string" matrix(string, nrow = length(string), ncol = pop_sz) } mutate <- function(pop) { # input: matrix of population "pop" # output: matrix of population where each character, with a probability of mt_rt per cent (= 5%), is replaced with a new random character mt_pos <- runif(length(pop)) <= mt_rt pop[mt_pos] <- sample(c(LETTERS, " "), sum(mt_pos), replace = TRUE) pop } fitness <- function(pop) { # input: matrix of population "pop" # output: vector of the number of letters that are correct (= equal to target) for each column colSums(pop == target) }

After that we are going through all five steps listed above:

# 1. Start with a random string of 28 characters. set.seed(70) start <- sample(c(LETTERS, " "), length(target), replace = TRUE) # 2. Make 100 copies of this string (reproduce). pop <- reproduce(start) # 3. For each character in each of the 100 copies, with a probability of 5%, replace (mutate) the character with a new random character. pop <- mutate(pop) # 4. Compare each new string with the target "METHINKS IT IS LIKE A WEASEL", and give each a score (the number of letters in the string that are correct and in the correct position). score <- fitness(pop) # 5. If any of the new strings has a perfect score (28), halt. Otherwise, take the highest scoring string, and go to step 2. highscorer <- pop[ , which.max(score)] # assign string to "highscorer" which has max. score in the population gen_no <- 1 #assign 1 to generation counter "gen_no" while (max(score) < length(target)) { cat("No. of generations: ", gen_no, ", best so far: ", highscorer, " with score: ", max(score), "\n", sep = "") pop <- reproduce(highscorer) # 2. select the highest scoring string for reproduction pop <- mutate(pop) # 3. mutation score <- fitness(pop) # 4. fitness calculation highscorer <- pop[ , which.max(score)] # assign string to "highscorer" which has max. score in the population gen_no <- gen_no + 1 # increment generation counter } ## No. of generations: 1, best so far: BZRDXXINEIMYQVJWBFZKFCVUPFYL with score: 2 ## No. of generations: 2, best so far: BZRDXNINEIMYQVJWBFZKFCVUPFYL with score: 3 ## No. of generations: 3, best so far: BZRDXNINEIMYQVJWBFZKACVEPFYR with score: 4 ## No. of generations: 4, best so far: BZRDININEIMYQBJWBFZKACVEPFYR with score: 5 ## No. of generations: 5, best so far: BZRDININEIMYIBJWBFZKACVEPFYR with score: 6 ## No. of generations: 6, best so far: BZRDININEIMYIBJLBFZKACVEPFYR with score: 7 ## No. of generations: 7, best so far: BRRDININEIMYIBJLOFZKACVEPFYL with score: 8 ## No. of generations: 8, best so far: BRRDININEIMYIZJLOFZKACVEAFYL with score: 9 ## No. of generations: 9, best so far: BRRDINKNEIMYIZJLOFZKAT EAFYL with score: 10 ## No. of generations: 10, best so far: BRRDINKNEIMYIZJLOFZKATVEASYL with score: 11 ## No. of generations: 11, best so far: BRRDINKNEIMYIZJLOFEKATVEASYL with score: 12 ## No. of generations: 12, best so far: BRRUINKNEIMYIZJLOFEKATVEASEL with score: 13 ## No. of generations: 13, best so far: BERUINKNEIMYIZJLOFEKATVEASEL with score: 14 ## No. of generations: 14, best so far: BERHINKNEIMYIZJLVFEKATVEASEL with score: 15 ## No. of generations: 15, best so far: BERHINKNEIMQIZJLVFE ATVEASEL with score: 16 ## No. of generations: 16, best so far: BERHINKNEIMQIZ LVFE ATVEASEL with score: 17 ## No. of generations: 17, best so far: BERHINKNEIMQIZ LVFE ATVEASEL with score: 17 ## No. of generations: 18, best so far: BERHINKNEIMQIZ LVFE ATVEASEL with score: 17 ## No. of generations: 19, best so far: TERHINKNEIMQIZ LVFE ATDEASEL with score: 17 ## No. of generations: 20, best so far: TERHINKNEIMQIZ LVFE ATDEASEL with score: 17 ## No. of generations: 21, best so far: TERHINKNJISQIZ LVFE ATDEASEL with score: 17 ## No. of generations: 22, best so far: TERHINKNJISQIZ LVFE A DEASEL with score: 18 ## No. of generations: 23, best so far: TERHINKNJISQIZ LVFE A DEASEL with score: 18 ## No. of generations: 24, best so far: TERHINKNJITQIZ LVFE A YEASEL with score: 19 ## No. of generations: 25, best so far: TERHINKNJITQIZ LPFE A YEASEL with score: 19 ## No. of generations: 26, best so far: TERHINKN ITQIZ LPFE A YEASEL with score: 20 ## No. of generations: 27, best so far: MERHINKN ITQIZ LPFE A YEASEL with score: 21 ## No. of generations: 28, best so far: MERHINKN IT IZ LPFE A YEASEL with score: 22 ## No. of generations: 29, best so far: MERHINKN IT IS LPFE A YEASEL with score: 23 ## No. of generations: 30, best so far: MERHINKN IT IS LPFE A YEASEL with score: 23 ## No. of generations: 31, best so far: MERHINKN IT IS LPFE A YEASEL with score: 23 ## No. of generations: 32, best so far: MERHINKN IT IS LAFE A WEASEL with score: 24 ## No. of generations: 33, best so far: METHINKN IT IS LAFE A WEASEL with score: 25 ## No. of generations: 34, best so far: METHINKN IT IS LAFE A WEASEL with score: 25 ## No. of generations: 35, best so far: METHINKN IT IS LAFE A WEASEL with score: 25 ## No. of generations: 36, best so far: METHINKN IT IS LAFE A WEASEL with score: 25 ## No. of generations: 37, best so far: METHINKN IT IS LAFE A WEASEL with score: 25 ## No. of generations: 38, best so far: METHINKU IT IS LIFE A WEASEL with score: 26 ## No. of generations: 39, best so far: METHINKU IT IS LIFE A WEASEL with score: 26 ## No. of generations: 40, best so far: METHINKU IT IS LIFE A WEASEL with score: 26 ## No. of generations: 41, best so far: METHINKU IT IS LIKE A WEASEL with score: 27 ## No. of generations: 42, best so far: METHINKU IT IS LIKE A WEASEL with score: 27 ## No. of generations: 43, best so far: METHINKU IT IS LIKE A WEASEL with score: 27 ## No. of generations: 44, best so far: METHINKU IT IS LIKE A WEASEL with score: 27 ## No. of generations: 45, best so far: METHINKU IT IS LIKE A WEASEL with score: 27 cat("Mission accomplished in ", gen_no, " generations: ", highscorer, sep = "") ## Mission accomplished in 46 generations: METHINKS IT IS LIKE A WEASEL

As you can see, the algorithm arrived at the target phrase pretty quickly. Now, you can try to tweak different parameter setting, like the population size or the mutation rate, and see what happens. You can of course also change the target phrase.

A minority of (often very religious) people reject the fact of evolution because they miss a crucial step: selection based on fitness. Selection gives evolution direction towards solutions that are better able to solve a certain problem. It is the exact opposite of pure randomness which many people still suspect behind evolution.

To see the difference the only thing we have to do is to comment out the line
pop <- reproduce(highscorer) which selects the highest scoring string for reproduction. We can see that without selection there is no improvement to be seen and the algorithm would run “forever”:

## No. of generations: 1, best so far: UJGGZYOEDJMRADTQUXFWAVWPBGFX with score: 2 ## No. of generations: 2, best so far: UHGGZQOEDJERAD QBXFSBRWPBGFX with score: 2 ## No. of generations: 3, best so far: UNGDZYOEDSERADTQIXFSBVWPAGFX with score: 3 ## No. of generations: 4, best so far: UHGGZQNEDJERAG QBXFSBRWPBGWX with score: 2 ## No. of generations: 5, best so far: IDGGTJOELJERAETQBDFSBVWEBGFX with score: 2 ## No. of generations: 6, best so far: IDGGTJOELJERNETQBDFSBVWEBGFX with score: 2 ## No. of generations: 7, best so far: FNJGZYOESJERERTQGXGSBVWEBSFX with score: 3 ## No. of generations: 8, best so far: UJGWZBOERJMUAQTQUXFVAVWKKSFX with score: 3 ## No. of generations: 9, best so far: VETGRYOEYVVSAOTQBKOSTVPPGGFM with score: 3 ## No. of generations: 10, best so far: VETGRYOEYVVSAOTQBKOSTVPPGGFM with score: 3 ## No. of generations: 11, best so far: VETGRYOEYVVSAKTQBKOSTVPPGGFM with score: 3 ## No. of generations: 12, best so far: IETGRYOTYVVDAKTQBKOCTVPPGGFM with score: 3 ## No. of generations: 13, best so far: TTVVZOKDJERADELYXFKWGWXKGYO with score: 3 ## No. of generations: 14, best so far: UNGWCYOZDEWRAD WKXKSBVWECGFX with score: 3 ## No. of generations: 15, best so far: UNGWCYOZDEWRBD WKXKSBVWECGFX with score: 3 ## No. of generations: 16, best so far: UNGSCYOZDEWRBD WKXKSAVCECGFX with score: 3 ## No. of generations: 17, best so far: MXKGZYOMSJ RIOTQBLJSBVNPAGDL with score: 4 ## No. of generations: 18, best so far: MXKGZYOMSJ RIOTQBLJSBVNPAGDL with score: 4 ## No. of generations: 19, best so far: MXKGZYOMZJ RIOTQBLJSVVNPAGDL with score: 4 ## No. of generations: 20, best so far: TTVVJGKDDERADELYJXKRGWEKGYU with score: 4 ## No. of generations: 21, best so far: TTVVJGKDDERADELYDXBRGWEKGYU with score: 4 ## No. of generations: 22, best so far: TTWVJGKDQERADELYDXBRGWEKGYU with score: 4 ## No. of generations: 23, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3 ## No. of generations: 24, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3 ## No. of generations: 25, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3 ## No. of generations: 26, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3 ## No. of generations: 27, best so far: TNTUXYKJPJNDAITLAJTYBAWPMGGB with score: 4 ## No. of generations: 28, best so far: MXKGOYOMCJ RIOTLBLJIVVAPAJDX with score: 4 ## No. of generations: 29, best so far: MXKGOYOMCJ RIOTLBLJIVVAJAJDX with score: 4 ## No. of generations: 30, best so far: TUTUYYKNPJNDAITLAJTYBAAPMOGB with score: 3 ## No. of generations: 31, best so far: NGAFULYDZELWD QDPRSMPWYAPZH with score: 3 ## No. of generations: 32, best so far: HKUOZSJSXDERS TLBHASAVGPBEJT with score: 3 ## No. of generations: 33, best so far: NGAFULYDTELWD QDPRSMPWYAPZH with score: 3 ## No. of generations: 34, best so far: HKUYMSJAXDERS TLBHA AVGPBEJT with score: 3 ## No. of generations: 35, best so far: HKUYMSJAXDSRS TLBHA AVGPBEJT with score: 3 ## No. of generations: 36, best so far: HKXYMSJYXDSRS TLBHA AVGPNEJT with score: 3 ## No. of generations: 37, best so far: KNGABULYDTELWD QDORSFPWYAPZH with score: 3 ## No. of generations: 38, best so far: LLCIZN EOISJ DHFIEGPXNWYMYOX with score: 4 ## No. of generations: 39, best so far: LLCIZN EOISJ DHFIEXPXNWYMYOX with score: 4 ## No. of generations: 40, best so far: MZN KMIESQRRILELIIILFIGRYRZZ with score: 4 ## No. of generations: 41, best so far: ITQXZEKK SENLSCJXAKQ EKNCNUJ with score: 3 ## No. of generations: 42, best so far: MELBV VEUBRKXSNHWGILBU JVLZX with score: 3 ## No. of generations: 43, best so far: DZNAKMIEOQRRILELIVILKIGVYRZZ with score: 3 ## No. of generations: 44, best so far: DZNAKMIEOQRRILELIVILKIGVYRZZ with score: 3 ## No. of generations: 45, best so far: LRPDILXMGCWDD ZQD BKANWHMKFI with score: 3 ## No. of generations: 46, best so far: KEGAMRLYDAELDDUXLORSFPWOAPLH with score: 3 ## No. of generations: 47, best so far: KEGAMRLYDAELDDUXLORSFPWOAPLH with score: 3 ## No. of generations: 48, best so far: KEGAMRLYDAELDZUXLORHFPWOAPLH with score: 3 ## No. of generations: 49, best so far: KEGAMRLYDAEWDZUXLORHFPWOAPLH with score: 3 ## No. of generations: 50, best so far: KEGAMRLYDAEWDZDXLORHFPWOAPLH with score: 3

If this was how evolution really worked it wouldn’t work at all.

Because evolution is a very powerful optimization method there are also real world applications of so called genetic algorithms (GA). In the following example we want to find the global optimum of the so called Rastrigin function. What makes this task especially difficult for this popular test problem is the large number of local minima, as can be seen when plotting the function:

library(GA) ## Loading required package: foreach ## Loading required package: iterators ## Package 'GA' version 3.2 ## Type 'citation("GA")' for citing this R package in publications. ## ## Attaching package: 'GA' ## The following object is masked from 'package:utils': ## ## de Rastrigin <- function(x1, x2) { 20 + x1^2 + x2^2 - 10*(cos(2*pi*x1) + cos(2*pi*x2)) } x1 <- x2 <- seq(-5.12, 5.12, by = 0.1) f <- outer(x1, x2, Rastrigin) persp3D(x1, x2, f, theta = 50, phi = 20)

filled.contour(x1, x2, f, color.palette = bl2gr.colors)

To find the global minimum (spoiler: it is at ) we use the GA package (because GA only maximizes we use the minus sign in front of the fitness function):

set.seed(70) GA <- ga(type = "real-valued", fitness = function(x) -Rastrigin(x[1], x[2]), lower = c(-5.12, -5.12), upper = c(5.12, 5.12), maxiter = 1000) summary(GA) ## -- Genetic Algorithm ------------------- ## ## GA settings: ## Type = real-valued ## Population size = 50 ## Number of generations = 1000 ## Elitism = 2 ## Crossover probability = 0.8 ## Mutation probability = 0.1 ## Search domain = ## x1 x2 ## lower -5.12 -5.12 ## upper 5.12 5.12 ## ## GA results: ## Iterations = 1000 ## Fitness function value = -3.630204e-07 ## Solution = ## x1 x2 ## [1,] 2.81408e-05 3.221658e-05 plot(GA)

filled.contour(x1, x2, f, color.palette = bl2gr.colors, plot.axes = { axis(1); axis(2); points(GA@solution[ , 1], GA@solution[ , 2], pch = 3, cex = 2, col = "white", lwd = 2) } )

Quite impressive, isn’t it! Evolution just works!

In an upcoming post we will use evolutionary methods to find a nice functional form for some noisy data with a method called symbolic regression or genetic programming – so stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

shapper is on CRAN, it’s an R wrapper over SHAP explainer for black-box models

Tue, 03/05/2019 - 08:27

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

Written by: Alicja Gosiewska

In applied machine learning, there are opinions that we need to choose between interpretability and accuracy. However in field of the Interpretable Machine Learning, there are more and more new ideas for explaining black-box models. One of the best known method for local explanations is SHapley Additive exPlanations (SHAP).

The SHAP method is used to calculate influences of variables on the particular observation. This method is based on Shapley values, a technique borrowed from the game theory. SHAP was introduced by Scott M. Lundberg and Su-In Lee in A Unified Approach to Interpreting Model Predictions NIPS paper. Originally it was implemented in the Python library shap.

The R package shapper is a port of the Python library shap. In this post we show the functionalities of shapper. The examples are provided on titanic_train data set for classification.

While shapper is a port for Python library shap, there are also pure R implementations of the SHAP method, e.g. iml or shapleyR.

Installation

The shapper wraps up the Python library, therefore installation requires a bit more effort than installation of an ordinary R package.

Install the R package shapper

First of all we need to install shapper, this may be the stable release from CRAN

install.packages("shapper")

or the developer version form GitHub.

devtools::install_github("ModelOriented/shapper")

Install the Python library shap

Before you run shapper, make sure that you have installed Python.

Python library shap is required to use shapper. The shap can be installed both by Python or R. To install it through R, you an use function install_shap() from the shapper package.

library("shapper") install_shap()

If you experience any problems related to the installation of Python libraries or evaluation of Python code, see the reticulate documentation. The shapper access Python within reticulate, therefore the solution to the problem is likely to be in there ;-).

Would you survive sinking of the RMS Titanic?

The example usage is presented on the titanic_train dataset from the R package titanic. We will predict the Survived status. The other variables used by the model are: Pclass, Sex, Age, SibSp, Parch, Fare and Embarked.

library("titanic") titanic <- titanic_train[,c("Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")] titanic$Survived <- factor(titanic$Survived) titanic$Sex <- factor(titanic$Sex) titanic$Embarked <- factor(titanic$Embarked) titanic <- na.omit(titanic) head(titanic) ## Survived Pclass Sex Age SibSp Parch Fare Embarked ## 1 0 3 male 22 1 0 7.2500 S ## 2 1 1 female 38 1 0 71.2833 C ## 3 1 3 female 26 0 0 7.9250 S ## 4 1 1 female 35 1 0 53.1000 S ## 5 0 3 male 35 0 0 8.0500 S ## 7 0 1 male 54 0 0 51.8625 S

Let’s build a model

Let’s see what are our chances assessed by the random forest model.

library("randomForest") set.seed(123) model_rf <- randomForest(Survived ~ . , data = titanic) model_rf ## ## Call: ## randomForest(formula = Survived ~ ., data = titanic) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 18.63% ## Confusion matrix: ## 0 1 class.error ## 0 384 40 0.09433962 ## 1 93 197 0.32068966

Prediction to be explained

Let’s assume that we want to explain the prediction of a particular observation (male, 8 years old, traveling 1-st class embarked at C, without parents and siblings.

new_passanger <- data.frame( Pclass = 1, Sex = factor("male", levels = c("female", "male")), Age = 8, SibSp = 0, Parch = 0, Fare = 72, Embarked = factor("C", levels = c("","C","Q","S")) )

Model prediction for this observation is .558 for survival.

predict(model_rf, new_passanger, type = "prob") ## 0 1 ## 1 0.442 0.558 ## attr(,"class") ## [1] "matrix" "votes"

Here shapper starts

To use the function shap() function (alias for individual_variable_effect()) we need four elements

  • a model,
  • a data set,
  • a function that calculated scores (predict function),
  • an instance (or instances) to be explained.

The shap() function can be used directly with these four arguments, but for the simplicity here we are using the DALEX package with preimplemented predict functions.

library("DALEX") exp_rf <- explain(model_rf, data = titanic[,-1])

The explainer is an object that wraps up a model and meta-data. Meta data consists of, at least, the data set used to fit model and observations to explain.

And now it’s enough to generate SHAP attributions with explainer for RF model.

library("shapper") ive_rf <- shap(exp_rf, new_observation = new_passanger) ive_rf ## Pclass Sex Age SibSp Parch Fare Embarked _id_ _ylevel_ _yhat_ ## 1 1 male 8 0 0 72 C 1 0.558 ## 1.1 1 male 8 0 0 72 C 1 0.558 ## 1.2 1 male 8 0 0 72 C 1 0.558 ## 1.3 1 male 8 0 0 72 C 1 0.558 ## 1.4 1 male 8 0 0 72 C 1 0.558 ## 1.5 1 male 8 0 0 72 C 1 0.558 ## _yhat_mean_ _vname_ _attribution_ _sign_ _label_ ## 1 0.3672941 Pclass 0.070047752 + randomForest ## 1.1 0.3672941 Sex -0.154519708 - randomForest ## 1.2 0.3672941 Age 0.143046212 + randomForest ## 1.3 0.3672941 SibSp 0.003154522 + randomForest ## 1.4 0.3672941 Parch -0.018111585 - randomForest ## 1.5 0.3672941 Fare 0.086728705 + randomForest

Plotting results

To generate a plot of Shapley values you can simply pass an object of class importance_variable_effect to a plot() function. Since we are interested in the class Survived = 1 we may add additional parameter that filter only selected classes.

plot(ive_rf)

Labels on y-axis show values of variables for this particular observation. Black arrows show predictions of model, in this case, probabilities of each status. Other arrows show effect of each variable on this prediction. Effects may be positive or negative and they sum up to the value of prediction.

On this plot we can see that model predicts that the passenger will survive. Changes are higher due to young age and 1st class, only the Sex = male decreases chances of survival for this observation.

More models

It is useful to contrast prediction of two models. Here we will show how to use shapper for such contrastive explanations.

We will compare randomForest with svm implemented in the e1071.

library("e1071") model_svm <- svm(Survived~. , data = titanic, probability = TRUE) model_svm ## ## Call: ## svm(formula = Survived ~ ., data = titanic, probability = TRUE) ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 1 ## gamma: 0.1 ## ## Number of Support Vectors: 338

This model predict 32.5% chances of survival.

attr(predict(model_svm, newdata = new_passanger, probability = TRUE), "probabilities") ## 0 1 ## 1 0.6748768 0.3251232 Again, we create an explainer that wraps model, data and the predict function. exp_svm <- explain(model_svm, data = titanic[,-1]) ive_svm <- shap(exp_svm, new_passanger)

Shapley values plot may be modified. To show more than one model you can pass more individual_variable_plot objects.

plot(ive_rf, ive_svm)

To see only attributions use option show_predcited = FALSE.

plot(ive_rf, show_predcited = FALSE)

More

Documentation and more examples are available at https://modeloriented.github.io/shapper/. The stable version of the package is on CRAN, the development version is on GitHub (https://github.com/ModelOriented/shapper). Shapper is a part of the DALEX universe.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting Help in R

Tue, 03/05/2019 - 01:00

(This article was first published on Rsquared Academy Blog, and kindly contributed to R-bloggers)

Introduction

In this post, we will learn about the different methods of getting help in R.
Often, we get stuck while doing some analysis as either we do not know the
correct function to use or its syntax. It is important for anyone who is new
to R to know the right place to look for help. There are two ways to look for
help in R:

  • built in help system
  • online

In the first section, we will look at various online resources that can
supplement the built in help system. In the second section, we will look at
various ways to access the built in help system of R. Let us get started!

Online Resources R Bloggers

R Bloggers aggregates blogs written in English from across the globe. This is the first place you want to visit if you want help with R, data analysis, visualization and machine learning. There are blogs on a wide range of topics and the latest content is delivered to your inbox if you subscribe. If you are a blogger yourself, share it with th R community by adding your blog to R Bloggers.

Stack Overflow

Stack Overflow is a great place to visit if you are having trouble with R code or packages. Chances are high that someone has already encountered the same or similar problem and you can use the answers given by R experts. In case you have encountered a new problem or issue, you can ask for help by providing a reproducible example of your analysis along with the R code. Use the reprex package to create reproducible examples.

Twitter

The R community is very active on Twitter and there are lot of R experts who are willing to help those who are new to R. Use the hashtag #rstats if you are asking for help or guidance on Twitter.

RStudio Community

RStudio Community is similar to Stack Overflow. You can ask questions
related to RStudio, Shiny, tidyverse and other RStudio products.

r4ds Community

An online learning community that brings learners and mentors on a single platform. You can learn more about the
community here.

RStudio Resources

RStudio has very good resources including cheatsheets, webinars and blogs.

Reddit

Reddit is another place where you can look for help. The discussions are moderated by R experts. There are subreddits for Rstats, Rlanguage, Rstudio and Rshiny.

R Weekly

Visit RWeekly to get regular updates about the R community. You can find information about new packages, blogs, conferences, workshops, tutorials and R jobs.

R User Groups

There are several R User Groups active across the globe. You can find the list here. Join the local user group to meet, discuss and learn from other R enthusiasts and experts.

Data Helpers

Data Helpers is a list of data analysts, scientists and engineers willing to offer guidance put together by Angela Bassa. Visit the website to learn more about how you can approach for help and guidance.

Internal

In this section, we will look at the following functions:

  • help.start()
  • help()
  • ?
  • ??
  • help.search()
  • demo()
  • example()
  • library()
  • vignette()
  • browseVignettes()
help.start

The help.start() function opens the documetation page in your browser. Here you can find manuals, reference and other materials.

help.start() ## starting httpd help server ... done ## If nothing happens, you should open ## 'http://127.0.0.1:24726/doc/html/index.html' yourself help

Use help() to access the documentation of functions and data sets. ? is a shortcut for help() and returns the same information.

help(plot) ?plot help.search

help.search() will search all sources of documentation and return those that match the search string. ?? is a shortcut for help.search() and returns the same information.

help.search('regression') ??regression demo

demo displays an interactive demonstration of certain topics provided in a R package. Typing demo() in the console will list the demos available in all the R packages installed.

demo() demo(scoping) ## ## ## demo(scoping) ## ---- ~~~~~~~ ## ## > ## Here is a little example which shows a fundamental difference between ## > ## R and S. It is a little example from Abelson and Sussman which models ## > ## the way in which bank accounts work. It shows how R functions can ## > ## encapsulate state information. ## > ## ## > ## When invoked, "open.account" defines and returns three functions ## > ## in a list. Because the variable "total" exists in the environment ## > ## where these functions are defined they have access to its value. ## > ## This is even true when "open.account" has returned. The only way ## > ## to access the value of "total" is through the accessor functions ## > ## withdraw, deposit and balance. Separate accounts maintain their ## > ## own balances. ## > ## ## > ## This is a very nifty way of creating "closures" and a little thought ## > ## will show you that there are many ways of using this in statistics. ## > ## > # Copyright (C) 1997-8 The R Core Team ## > ## > open.account <- function(total) { ## + ## + list( ## + deposit = function(amount) { ## + if(amount <= 0) ## + stop("Deposits must be positive!\n") ## + total <<- total + amount ## + cat(amount,"deposited. Your balance is", total, "\n\n") ## + }, ## + withdraw = function(amount) { ## + if(amount > total) ## + stop("You don't have that much money!\n") ## + total <<- total - amount ## + cat(amount,"withdrawn. Your balance is", total, "\n\n") ## + }, ## + balance = function() { ## + cat("Your balance is", total, "\n\n") ## + } ## + ) ## + } ## ## > ross <- open.account(100) ## ## > robert <- open.account(200) ## ## > ross$withdraw(30) ## 30 withdrawn. Your balance is 70 ## ## ## > ross$balance() ## Your balance is 70 ## ## ## > robert$balance() ## Your balance is 200 ## ## ## > ross$deposit(50) ## 50 deposited. Your balance is 120 ## ## ## > ross$balance() ## Your balance is 120 ## ## ## > try(ross$withdraw(500)) # no way.. ## Error in ross$withdraw(500) : You don't have that much money! example

example() displays examples of the specified topic if available.

example('mean') ## ## mean> x <- c(0:10, 50) ## ## mean> xm <- mean(x) ## ## mean> c(xm, mean(x, trim = 0.10)) ## [1] 8.75 5.50 Package Documentation library

Access the documentation of a package using help() inside library(). The package need not be installed for accessing the documentation.

library(help = 'ggplot2') vignette

A vignette is a long form guide to a R package. You can access the vignettes available using vignette(). It will display alist of vignettes available in installed packages.

vignette()

To access a specific vignette from a package, specify the name of the vignette and the package.

vignette('dplyr', package = 'dplyr') browseVignettes

browseVignettes() is another way to access the vignettes in installed packages. It will list the vignettes in each package along with links to the web page and R code.

browseVignettes() RsiteSearch

RsiteSearch() will search for a specified topics in help pages, vignettes and task views using the search engine at this link and return the result in a browser.

RSiteSearch('glm') ## A search query has been submitted to http://search.r-project.org ## The results page should open in your browser shortly Summary

To sum it up, the R community is very beginner friendly and we hope you will
find all the above resources, both internal help system and online resources
useful.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2

Tue, 03/05/2019 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


In part 1 of this series I set
up Vowpal Wabbit to classify newspapers content. Now, let’s use the model to make predictions and
see how and if we can improve the model. Then, let’s train the model on the whole data.

Step 1: prepare the data

The first step consists in importing the test data and preparing it. The test data need not be large
and thus can be imported and worked on in R.

I need to remove the target column from the test set, or else it will be used to make predictions.
If you do not remove this column the accuracy of the model will be very high, but it will be wrong
since, of course, you do not have the target column at running time… because it is the column
that you want to predict!

library("tidyverse") library("yardstick") small_test <- read_delim("data_split/small_test.txt", "|", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE) small_test %>% mutate(X1= " ") %>% write_delim("data_split/small_test2.txt", col_names = FALSE, delim = "|")

I wrote the data in a file called small_test2.txt and can now use my model to make predictions:

system2("/home/cbrunos/miniconda3/bin/vw", args = "-t -i vw_models/small_oaa.model data_split/small_test2.txt -p data_split/small_oaa.predict")

The predictions get saved in the file small_oaa.predict, which is a plain text file. Let’s add these
predictions to the original test set:

small_predictions <- read_delim("data_split/small_oaa.predict", "|", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE) small_test <- small_test %>% rename(truth = X1) %>% mutate(truth = factor(truth, levels = c("1", "2", "3", "4", "5"))) small_predictions <- small_predictions %>% rename(predictions = X1) %>% mutate(predictions = factor(predictions, levels = c("1", "2", "3", "4", "5"))) small_test <- small_test %>% bind_cols(small_predictions) Step 2: use the model and test data to evaluate performance

We can use the several metrics included in {yardstick} to evaluate the model’s performance:

conf_mat(small_test, truth = truth, estimate = predictions) accuracy(small_test, truth = truth, estimate = predictions) Truth Prediction 1 2 3 4 5 1 51 15 2 10 1 2 11 6 3 1 0 3 0 0 0 0 0 4 0 0 0 0 0 5 0 0 0 0 0 # A tibble: 1 x 3 .metric .estimator .estimate 1 accuracy multiclass 0.570

We can see that the model never predicted class 3, 4 or 5. Can we improve by adding some
regularization? Let’s find out!

Step 3: adding regularization

Before trying regularization, let’s try changing the cost function from the logistic function to the
hinge function:

# Train the model hinge_oaa_fit <- system2("/home/cbrunos/miniconda3/bin/vw", args = "--oaa 5 -d data_split/small_train.txt --loss_function hinge -f vw_models/hinge_oaa.model", stderr = TRUE) system2("/home/cbrunos/miniconda3/bin/vw", args = "-i vw_models/hinge_oaa.model -t -d data_split/small_test2.txt -p data_split/hinge_oaa.predict") predictions <- read_delim("data_split/hinge_oaa.predict", "|", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE) test <- test %>% select(-predictions) predictions <- predictions %>% rename(predictions = X1) %>% mutate(predictions = factor(predictions, levels = c("1", "2", "3", "4", "5"))) test <- test %>% bind_cols(predictions) conf_mat(test, truth = truth, estimate = predictions) accuracy(test, truth = truth, estimate = predictions) Truth Prediction 1 2 3 4 5 1 411 120 45 92 1 2 355 189 12 17 0 3 11 2 0 0 0 4 36 4 0 1 0 5 3 0 3 0 0 # A tibble: 1 x 3 .metric .estimator .estimate 1 accuracy multiclass 0.462

Well, didn’t work out so well, but at least we now know how to change the loss function. Let’s go
back to the logistic loss and add some regularization. First, let’s train the model:

regul_oaa_fit <- system2("/home/cbrunos/miniconda3/bin/vw", args = "--oaa 5 --l1 0.005 --l2 0.005 -d data_split/small_train.txt -f vw_models/small_regul_oaa.model", stderr = TRUE)

Now we can use it for prediction:

system2("/home/cbrunos/miniconda3/bin/vw", args = "-i vw_models/small_regul_oaa.model -t -d data_split/test2.txt -p data_split/small_regul_oaa.predict") predictions <- read_delim("data_split/small_regul_oaa.predict", "|", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE) test <- test %>% select(-predictions) predictions <- predictions %>% rename(predictions = X1) %>% mutate(predictions = factor(predictions, levels = c("1", "2", "3", "4", "5"))) test <- test %>% bind_cols(predictions)

We can now use it for predictions:

conf_mat(test, truth = truth, estimate = predictions) accuracy(test, truth = truth, estimate = predictions) Truth Prediction 1 2 3 4 5 1 816 315 60 110 1 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0 5 0 0 0 0 0 # A tibble: 1 x 3 .metric .estimator .estimate 1 accuracy multiclass 0.627

So accuracy improved, but the model only predicts class 1 now… let’s try with other hyper-parameters values:

regul_oaa_fit <- system2("/home/cbrunos/miniconda3/bin/vw", args = "--oaa 5 --l1 0.00015 --l2 0.00015 -d data_split/small_train.txt -f vw_models/small_regul_oaa.model", stderr = TRUE) conf_mat(test, truth = truth, estimate = predictions) accuracy(test, truth = truth, estimate = predictions) Truth Prediction 1 2 3 4 5 1 784 300 57 108 1 2 32 14 3 2 0 3 0 1 0 0 0 4 0 0 0 0 0 5 0 0 0 0 0 # A tibble: 1 x 3 .metric .estimator .estimate 1 accuracy multiclass 0.613

So accuracy is lower than previously, but at least more categories get correctly predicted. Depending
on your needs, you should consider different metrics. Especially for classification problems, you might
not be interested in accuracy, in particular if the data is severely unbalanced.

Anyhow, to finish this blog post, let’s train the model on the whole data and measure the time it
takes to run the full model.

Step 4: Training on the whole data

Let’s first split the whole data into a training and a testing set:

nb_lines <- system2("cat", args = "text_fr.txt | wc -l", stdout = TRUE) system2("split", args = paste0("-l", floor(as.numeric(nb_lines)*0.995), " text_fr.txt data_split/")) system2("mv", args = "data_split/aa data_split/train.txt") system2("mv", args = "data_split/ab data_split/test.txt")

The whole data contains 260247 lines, and the training set weighs 667MB, which is quite large. Let’s train
the simple multiple classifier on the data and see how long it takes:

tic <- Sys.time() oaa_fit <- system2("/home/cbrunos/miniconda3/bin/vw", args = "--oaa 5 -d data_split/train.txt -f vw_models/oaa.model", stderr = TRUE) Sys.time() - tic Time difference of 4.73266 secs

Yep, you read that right. Training the classifier on 667MB of data took less than 5 seconds!

Let’s take a look at the final object:

oaa_fit [1] "final_regressor = vw_models/oaa.model" [2] "Num weight bits = 18" [3] "learning rate = 0.5" [4] "initial_t = 0" [5] "power_t = 0.5" [6] "using no cache" [7] "Reading datafile = data_split/train.txt" [8] "num sources = 1" [9] "average since example example current current current" [10] "loss last counter weight label predict features" [11] "1.000000 1.000000 1 1.0 2 1 253" [12] "0.500000 0.000000 2 2.0 2 2 499" [13] "0.250000 0.000000 4 4.0 2 2 6" [14] "0.250000 0.250000 8 8.0 1 1 2268" [15] "0.312500 0.375000 16 16.0 1 1 237" [16] "0.250000 0.187500 32 32.0 1 1 557" [17] "0.171875 0.093750 64 64.0 1 1 689" [18] "0.179688 0.187500 128 128.0 2 2 208" [19] "0.144531 0.109375 256 256.0 1 1 856" [20] "0.136719 0.128906 512 512.0 4 4 4" [21] "0.122070 0.107422 1024 1024.0 1 1 1353" [22] "0.106934 0.091797 2048 2048.0 1 1 571" [23] "0.098633 0.090332 4096 4096.0 1 1 43" [24] "0.080566 0.062500 8192 8192.0 1 1 885" [25] "0.069336 0.058105 16384 16384.0 1 1 810" [26] "0.062683 0.056030 32768 32768.0 2 2 467" [27] "0.058167 0.053650 65536 65536.0 1 1 47" [28] "0.056061 0.053955 131072 131072.0 1 1 495" [29] "" [30] "finished run" [31] "number of examples = 258945" [32] "weighted example sum = 258945.000000" [33] "weighted label sum = 0.000000" [34] "average loss = 0.054467" [35] "total feature number = 116335486"

Let’s use the test set and see how the model fares:

conf_mat(test, truth = truth, estimate = predictions) accuracy(test, truth = truth, estimate = predictions) Truth Prediction 1 2 3 4 5 1 537 175 52 100 1 2 271 140 8 9 0 3 1 0 0 0 0 4 7 0 0 1 0 5 0 0 0 0 0 # A tibble: 1 x 3 .metric .estimator .estimate 1 accuracy multiclass 0.521

Better accuracy can certainly be achieved with hyper-parameter tuning… maybe the subject for a
future blog post? In any case I am very impressed with Vowpal Wabbit and am certainly looking forward
to future developments of {RVowpalWabbit}!

Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me.

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;} Buy me an Espresso

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building Big Shiny Apps — A Workflow 2/2

Mon, 03/04/2019 - 22:18

(This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers)

Second part of the blog transcription of the talk I’ve given during the eposter session of the rstudio::conf(2019).

Read the first part here.

Building Big Shiny Apps: step by step Step 1: Designing

Don’t rush into coding. I know you want to, because it’s what we like to do and what we are good at. But before entering the coding marathon, take time to think about the application and the way it will be deployed and used. Take a pen and a piece of paper and draw the app. Talk about it with the people who will use the app, just to decipher what they actually need. Take a moment to talk with the IT. Here are some questions you can ask:

  • Who are the end users of the app?” — This will help you know if the end users are tech literate or not, and what they aim to achieve with the app.
  • How frequently will they use the app?” — The small details of the design & the UI of an app you use on a daily basis is more crucial than when the app is used once a month.
  • What level of complexity and personalization do the users really need?” — People writing app specifications sometimes want more functionalities than what is actually needed by the users.
  • What level of interactivity do you want, and to what extent is it central?” — People love interactive graphs and like when things automatically sync with each other. Yet these two can make the app slower, without any significant gain. For example, being reactive to a selectInput() or a sliderInput() can lead to too much computation: maybe the user will not succeed to choose the right input the first, second or third time… So let them do their choice, and add a button so that they can validate when they are ready.
  • How important is it that the app is fast?” — Should you spend a lot of time optimizing the little things?
  • etc.

Asking questions, taking notes, and drawing the app help you have a good idea of what is expected and what you have to do now.

So, next step!

Step 2: Prototype

I like to go “UI first”. For two main reasons:

  • Once the UI is set, there is no “surprise implementation”. Once we agree on what elements there are in the app, there is no sudden “oh the app needs to do that now”.
  • A pre-defined UI allows every person involved in the coding to know which part of the app they are working on. In other words, when you start working on the backend, it’s way much easier to work on a piece you can visually identify and integrate in a complete app scenario.

So yes, spend time writing a front-end prototype in lorem ipsum. And good news, we’ve got a tool for you: it’s called {shinipsum}. The main goal of this package is to create random Shiny elements that can be used to draw a UI, without actually do any heavy lifting in the backend.

Hence, once you’ve got a draft of your app on a piece of paper, you can then move to the “ipsum-UI” stage: building the front-end of the app, and filling it with random Shiny elements, with functions like random_ggplot() or random_DT().

Another package that can be used to do that is {fakir}}. This package is designed to create fake data frames, primarily for teaching purpose, but it can be used for inserting data in a shiny prototype.

Step 3: Build

Now the UI and the features are set, time to work on the backend

This part is pretty standard — everybody can now work on the implementation of the functions that process the app inputs, in their own modules. As the UI, functionalities and modules have been defined in the previous steps, everyone (well, in theory) knows what they have to work on.

And also, as said before, there should be no “surprise implementation”, as the app has been well defined before.

Step 4: Secure

Securing your app means two things: testing, and locking the application environment.

So first, be sure to include tests all along the building process — just like any other R code. As the app is contained in a package, you can use standard testing tools for testing the business logic of your app — as said in the first part, it’s important to split the backend functions and algorithm from the user interface. That means that these backend functions can run outside of the application. And yes, if they can run outside of the app, they can be tested the standard way, using {testthat}.

When it comes to testing the front end, you can try the {shinytest} package from RStudio, if you need to be sure there is no visual regression all along the project development. {shinyloadtest}, on the other hand, tests how an application behaves when one, two, three, twenty, one hundred users connect to the app, and gives you a visual report about the connections and response time of each session.

One other tool I like to use is Katalon Studio. It’s not R related, and can be used with any kind of web app. How it works is quite simple: it opens your browser where the Shiny app runs, and record everything that happens. Once you stop the recording, you can relaunch the app and it will replay all the events it has recorded. And of course, you can specify your own scenario, define your own events, etc. It’s not that straightforward to use, but once you get a good grasp of how it works, it’s a very powerful tool.

Secondly, secure your app means that it can be deployed again any time in the future — in other words, you have to ensure you’ve got a proper handle on the required R version, and of the packages versions which are required to run your app. That means that you have to be aware that upgrading a package might break your app — so, provide an environment that can prevent your app from breaking when a package gets updated. For that, there is of course Docker, but also R specific tools like {packrat}, custom CRAN repositories or package manager.

Step 5: Deploy

Tools for deployment are not the subject of this blog post so I won’t talk about this in details (remember, we are talking about building ), but our two tools of choice are Docker & ShinyProxy, and RStudio Connect.

Building Big Shiny Apps: an introduction to {golem}

Ok, that’s a lot of things to process. Is there a tool that can help us simplify this workflow? Of course there is, and it’s called {golem}.

It can be found at https://github.com/ThinkR-open/golem

{golem} is an R package that implements an opinionated framework for building production-ready Shiny apps. It all starts with an RStudio project, which contains a predefined setup for building your app. The idea is that with {golem}, you don’t have to focus on the foundation of your app, and can spend your time thinking about what you want to do, not about how to do it. It’s built on top of the working process we’ve developed at ThinkR, and tries to gather in one place the functions and tools we’ve created for building applications designed for production.

When you open a golem project, you’ll start with a dev-history file, which contains a series of functions that will guide you through the whole process of starting, building, and deploying your app. The newly created package contains an app_ui.R and app_server.R waiting to be filled, and a run_app() function that will launch your application. Any new module can be added with golem::add_module(), a function that creates a new file with the required skeleton for a shiny module. As I said, you don’t need to think about the technical things

You can also find a series of UI, server, and prod-related tools, functions for creating deployment scripts, and other cool stuffs. Check the README for more information.

Building Big Shiny Apps: the Tools Package with {golem}

We believe that Shiny Apps are to be put into a package. Why? Because it allows them to be documented, tested, and can be installed in several environments.

Also, think about your last Shiny app that wasn’t in a package. It’s an app.R, maybe with a folder you’re sourcing and which contains functions (let’s say in a R/ folder). Maybe you’ve written some meta information (let’s call it a DESCRIPTION), and some tests you’ve put in a tests/ folder. Also, as you want to be sure to do the things right, you’ve put documentation in your functions. Do you see where I’m heading? Yes, you’ve written an R package.

Git

Friends don’t let friends work on a coding project without version control.

Shiny modules

Shiny modules are crucial tools when it comes to building big shiny apps: they allow to collaborate, to split the work into pieces, they facilitate testing, and they allow implementation of new features to be made more easily.

Prototyping with {shinipsum} and {fakir}

These two tools allow you to prototype a Shiny App and to go “UI first”. Learn more :

CI and testing

Testing is central for making your application survive on the long run. The {testthat} package can be used to test the “business logic” side of your app, while the application features can be tested with packages like {shinytest}, or software like Katalon.

Deploy

Docker, Shiny Proxy, RStudio Connect… Don’t just put your app on the server and let it live. Use tools that are specifically designed for deploying and scaling web applications. But this will be the topic of another blog post

Want to know more? Keep an eye on our bookdown “Building Big Shiny Apps — A Workflow“!

The post Building Big Shiny Apps — A Workflow 2/2 appeared first on (en) The R Task Force.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages