Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 11 hours 52 min ago

A primer in using Java from R – part 2

Sat, 07/07/2018 - 15:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

In this part of the primer we discuss creating and using custom .jar archives within our R scripts and packages, handling of Java exceptions from R and a quick look at performance comparison between the low and high-level interfaces provided by rJava.

In the first part we talked about using the rJava package to create objects, call methods and work with arrays, we examined the various ways to call Java methods and calling Java code from R directly via execution of shell commands.

R <3 Java, or maybe not?

Contents
  1. Using rJava with custom built classes
  2. Very quick look at performace
  3. Usage of jars in R packages
  4. Handling Java exceptions in R
  5. References
Using rJava with custom built classes Preparing a .jar archive for use

Getting back to our example with running the main method of our HelloWorldDummy class from the first part of this primer, in practice we most likely want to actually create objects and invoke methods for such classes rather than simply call the main method.

For our resources to be available to rJava, we need to create a .jar archive and add it to the class path. An example of the process can be as follows. Compile our code to create the class file, and jar it:

$ javac DummyJavaClassJustForFun/HelloWorldDummy.java $ cd DummyJavaClassJustForFun/ $ jar cvf HelloWorldDummy.jar HelloWorldDummy.class Adding the .jar file to the class path

Within R, attach rJava, initialize the JVM and investigate our current class path using .jclassPath:

library(rJava) .jinit() .jclassPath()

Now, we add our newly created .jar to the class path using .jaddClassPath:

.jaddClassPath(paste0(jardir, "HelloWorldDummy.jar"))

If this worked, we can see the added jar(s) in the class path if we call .jclassPath() again.

Creating objects, investigating methods and fields

Now that we have our .jar in the class path, we can create a new Java object from our class:

dummyObj <- .jnew("DummyJavaClassJustForFun/HelloWorldDummy") str(dummyObj) ## Formal class 'jobjRef' [package "rJava"] with 2 slots ## ..@ jobj : ## ..@ jclass: chr "DummyJavaClassJustForFun/HelloWorldDummy"

We can also investigate the available constructors, methods and fields for our class (or provide the object as argument, then its class will be queried):

  • .jconstructors returns a character vector with all constructors for a given class or object
  • .jmethods returns a character vector with all methods for a given class or object
  • .jfields returns a character vector with all fields (aka attributes) for a given class or object
  • .DollarNames returns all fields and methods associated with the object. Method names are followed by ( or () depending on arity.
# Requesting vectors of methods, constructors and fields by class .jmethods("DummyJavaClassJustForFun/HelloWorldDummy") ## [1] "public static void DummyJavaClassJustForFun.HelloWorldDummy.main(java.lang.String[])" ## [2] "public java.lang.String DummyJavaClassJustForFun.HelloWorldDummy.SayMyName()" ## [3] "public final void java.lang.Object.wait(long,int) throws java.lang.InterruptedException" ## [4] "public final native void java.lang.Object.wait(long) throws java.lang.InterruptedException" ## [5] "public final void java.lang.Object.wait() throws java.lang.InterruptedException" ## [6] "public boolean java.lang.Object.equals(java.lang.Object)" ## [7] "public java.lang.String java.lang.Object.toString()" ## [8] "public native int java.lang.Object.hashCode()" ## [9] "public final native java.lang.Class java.lang.Object.getClass()" ## [10] "public final native void java.lang.Object.notify()" ## [11] "public final native void java.lang.Object.notifyAll()" .jconstructors("DummyJavaClassJustForFun/HelloWorldDummy") ## [1] "public DummyJavaClassJustForFun.HelloWorldDummy()" .jfields("DummyJavaClassJustForFun/HelloWorldDummy") ## NULL # Requesting vectors of methods, constructors and fields by object .jmethods(dummyObj) ## [1] "public static void DummyJavaClassJustForFun.HelloWorldDummy.main(java.lang.String[])" ## [2] "public java.lang.String DummyJavaClassJustForFun.HelloWorldDummy.SayMyName()" ## [3] "public final void java.lang.Object.wait(long,int) throws java.lang.InterruptedException" ## [4] "public final native void java.lang.Object.wait(long) throws java.lang.InterruptedException" ## [5] "public final void java.lang.Object.wait() throws java.lang.InterruptedException" ## [6] "public boolean java.lang.Object.equals(java.lang.Object)" ## [7] "public java.lang.String java.lang.Object.toString()" ## [8] "public native int java.lang.Object.hashCode()" ## [9] "public final native java.lang.Class java.lang.Object.getClass()" ## [10] "public final native void java.lang.Object.notify()" ## [11] "public final native void java.lang.Object.notifyAll()" .jconstructors(dummyObj) ## [1] "public DummyJavaClassJustForFun.HelloWorldDummy()" .jfields(dummyObj) ## NULL Calling methods 3 different ways

We can now invoke our SayMyName method on this object in the three ways as discussed is the first part of this primer:

# low level lres <- .jcall(dummyObj, "Ljava/lang/String;", "SayMyName") # high level hres <- J(dummyObj, method = "SayMyName") # convenient $ shorthand dres <- dummyObj$SayMyName() c(lres, hres, dres) ## [1] "My name is DummyJavaClassJustForFun.HelloWorldDummy" ## [2] "My name is DummyJavaClassJustForFun.HelloWorldDummy" ## [3] "My name is DummyJavaClassJustForFun.HelloWorldDummy" Very quick look at performace

The low-level is much faster, since J has to use reflection to find the most suitable method. The $ seems to be the slowest, but also very convenient, as it supports code completion:

microbenchmark::microbenchmark(times = 1000 , .jcall(dummyObj, "Ljava/lang/String;", "SayMyName") , J(dummyObj, "SayMyName") , dummyObj$SayMyName() ) ## Unit: microseconds ## expr min lq ## .jcall(dummyObj, "Ljava/lang/String;", "SayMyName") 34.992 48.0385 ## J(dummyObj, "SayMyName") 696.860 748.9630 ## dummyObj$SayMyName() 894.576 963.1035 ## mean median uq max neval ## 62.13235 61.3425 66.6470 723.719 1000 ## 903.54615 786.2715 841.2835 66191.562 1000 ## 1093.67310 1009.4020 1075.2565 6743.220 1000 Usage of jars in R packages

To use rJava within an R package, Simon Urbanek, the author of rJava even provides a convenience function for this purpose which initializes the JVM and registers Java classes and native code contained in the package with it. A quick step by step guide to use .jars within a package is as follows:

  1. place our .jars into inst/java/
  2. add Depends: rJava and SystemRequirements: Java into our NAMESPACE
  3. add a call to .jpackage(pkgname, lib.loc=libname) into our .onLoad.R or .First.lib for example like so:
.onLoad <- function(libname, pkgname) { .jpackage(pkgname, lib.loc = libname) }
  1. if possible, add .java source files into /java folder of our package

If you are interested in more detail than provided in this super-quick overview, Tobias Verbeke created a Hello Java World! package with a vignette providing a verbose step-by-step tutorial for interfacing to Java archives inside R packages.

Setting java.parameters

The .jpackage function calls .jinit with the default parameters = getOption("java.parameters"), so if we want to set some of the java parameters, we can do it for example like so:

.onLoad <- function(libname, pkgname) { options(java.parameters = c("-Xmx1000m")) .jpackage(pkgname, lib.loc = libname) }

Note that the options call needs to be done before the call to .jpackage, as Java parameters can only be used during JVM initialization. Consequently, this will only work if other package did not intialize the JVM already.

Handling Java exceptions in R

rJava maps Java exceptions to R conditions relayed by the stop function, therefore we can use the standard R mechanisms such as tryCatch to handle the exceptions.

The R condition object, assume we call it e for this, is actually an S3 object (a list) that contains:

  • call – a language object containing the call resulting in the exception
  • jobj – an S4 object containing the actual exception object, so we can for example investigate investigate it’s class: e[["jobj"]]@jclass
tryCatch( iOne <- .jnew(class = "java/lang/Integer", 1), error = function(e) { message("\nLets look at the condition object:") str(e) message("\nClass of the jobj item:") print(e[["jobj"]]@jclass) message("\nClasses of the condition object: ") class(e) } ) ## ## Lets look at the condition object: ## List of 3 ## $ message: chr "java.lang.NoSuchMethodError: " ## $ call : language .jnew(class = "java/lang/Integer", 1) ## $ jobj :Formal class 'jobjRef' [package "rJava"] with 2 slots ## .. ..@ jobj : ## .. ..@ jclass: chr "java/lang/NoSuchMethodError" ## - attr(*, "class")= chr [1:9] "NoSuchMethodError" "IncompatibleClassChangeError" "LinkageError" "Error" ... ## ## Class of the jobj item: ## [1] "java/lang/NoSuchMethodError" ## ## Classes of the condition object: ## [1] "NoSuchMethodError" "IncompatibleClassChangeError" ## [3] "LinkageError" "Error" ## [5] "Throwable" "Object" ## [7] "Exception" "error" ## [9] "condition"

Since class(e) is a vector of simple java class names which allows the R code to use direct handlers, we can handle different such classes differently:

withCallingHandlers( iOne <- .jnew(class = "java/lang/Integer", 1) , error = function(e) { message("Meh, just a boring error") } , NoSuchMethodError = function(e) { message("We have a NoSuchMethodError") } , IncompatibleClassChangeError = function(e) { message("We also have a IncompatibleClassChangeError - lets recover") recover() # recovering here and looking at # 2: .jnew(class = "java/lang/Integer", 1) # we see that the issue is in # str(list(...)) # List of 1 # $ : num 1 # We actually passed a numeric, not integer # To fix it, just do # .jnew(class = "java/lang/Integer", 1L) } , LinkageError = function(e) { message("Ok, this is getting a bit overwhelming, lets smile and end here :o)") } ) ## Meh, just a boring error ## We have a NoSuchMethodError ## We also have a IncompatibleClassChangeError - lets recover ## recover called non-interactively; frames dumped, use debugger() to view ## Ok, this is getting a bit overwhelming, ## lets smile and end here ## :o) ## Error in .jnew(class = "java/lang/Integer", 1): java.lang.NoSuchMethodError: References
  1. Hello Java World! vignette – a tutorial for interfacing to Java archives inside R packages by Tobias Verbeke
  2. rJava basic crashcourse – at the rJava site on rforge, scroll down to the Documentation section
  3. The JNI Type Signatures – at Oracle JNI specs
  4. rJava documentation on CRAN
  5. Calling Java code from R by prof. Darren Wilkinson
Did you find the article helpful or interesting? Help others find it by sharing var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

HavanaR Workshop 2018

Sat, 07/07/2018 - 02:00

(This article was first published on Blog on FORWARDS, and kindly contributed to R-bloggers)

The first meeting of R users in Cuba, HavanaR! Club 2018, will take place from 29-10-2018 to 02-11-2018. The workshop is organized jointly by the Group of Probability and Statistics of Universidad de La Habana, the >eR-Biostat initiative, the Center for Statistics, CenStat & I-Biostat, Hasselt University, Belgium and Forwards.

The workshop will focus on visualisation methods and exploratory analysis of high dimensional and multivariate datasets.

Program:

29-Oct: Basic concepts of exploratory data analysis using R. Exploratory Analysis of high dimensional data using R (Adetayo Kasim, Durham University, UK, a.s.kasim@durham.ac.uk and Ziv Shkedy, Hasselt University, Belgium, ziv.shkedy@uhasselt.be).

30-Oct: Exploratory Analysis of high dimensional data using R (Adetayo Kasim and Ziv Shkedy).

31-Oct: HavanaR! Club 2018: The Cuban R users workshop with presentations of the participants.

01-Nov: Exploratory Multivariate Data Analysis using R (Julie Josse, Ecole Polytechnique, France, julie.josse@polytechnique.edu).

02-Nov: Exploratory Multivariate Data Analysis using R (Julie Josse).

The workshop is free but registration in mandatory. To register please send an email to erbiostat@gmail.com or cbouza2002@gmail.com.

More details and workshop materials will be available online at https://er-biostat.github.io/Courses/.
All updates about the workshop will be anounced on the Facebook page (https://www.facebook.com/ER-BioStat-1463845487001786/) and Twitter (@erbiostat).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog on FORWARDS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing macOS App Usage with a Little Help from osqueryr & mactheknife

Fri, 07/06/2018 - 20:34

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Both my osqueryr and macthekinfe packages have had a few updates and I wanted to put together a fun example (it being Friday, and all) for what you can do with them. All my packages are now on GitHub and GitLab and I’ll be maintaining them on both so I can accommodate the comfort-level of any and all contributors but will be prioritizing issues and PRs on GitLab ahead of any other platform. Having said that, I’ll mark non-CRAN packages with a # notcran comment in the source views so you know you need to install it from wherever you like to grab sketch packages from.

One table that osquery makes available under macOS is an inventory of all “apps” that macOS knows about. Previous posts have shown how to access these tables via the dplyr interface I built for osquery, but they involved multiple steps and as I started to use it more regularly (especially to explore the macOS 10.14 beta I’m running) I noticed that it could use some helper functions. One in particular — osq_expose_tables() — is pretty helpful in that it handles all the dplyr boilerplate code and makes table(s) available in the global environment by name. It takes a single table name or regular expression and then exposes all matching entities. While the function has a help page, it’s easier just to see it in action. Let’s expose the apps table:

library(osqueryr) # notcran library(tidyverse) osq_expose_tables("apps") apps ## # Source: table [?? x 19] ## # Database: OsqueryConnection ## applescript_enab… bundle_executable bundle_identifier bundle_name bundle_package_… ## ## 1 0 1Password 6 com.agilebits.onep… 1Password 6 APPL ## 2 0 2BUA8C4S2C.com.agil… 2BUA8C4S2C.com.agi… 1Password m… APPL ## 3 1 Adium com.adiumX.adiumX Adium APPL ## 4 1 Adobe Connect com.adobe.adobecon… Adobe Conne… APPL ## 5 1 Adobe Illustrator com.adobe.illustra… Illustrator… APPL ## 6 "" AIGPUSniffer com.adobe.AIGPUSni… AIGPUSniffer APPL ## 7 "" CEPHtmlEngine Helper com.adobe.cep.CEPH… CEPHtmlEngi… APPL ## 8 "" CEPHtmlEngine com.adobe.cep.CEPH… CEPHtmlEngi… APPL ## 9 "" LogTransport2 com.adobe.headligh… LogTranspor… APPL ## 10 "" droplet "" Analyze Doc… APPL ## # ... with more rows, and 14 more variables: bundle_short_version , ## # bundle_version , category , compiler , copyright , ## # development_region , display_name , element , environment , ## # info_string , last_opened_time , minimum_system_version , name , ## # path

There’s tons of info on all the apps macOS knows about, some of which are system services and “helper” apps (like Chrome’s auto-updater). One field — last_opened_time — caught my eye and I thought it would be handy to see which apps had little use (i.e. ones that haven’t been opened in a while) and which apps I might use more frequently (i.e. ones with more recent “open” times). That last_open_time is a fractional POSIX timestamp and, due to the way osquery created the schemas, it’s in a character field. That’s easy enough to convert and then arrange() the whole list in descending order to let you see what you use most frequently.

But, this is R and we can do better than a simple table or even a DT::datatable().

I recently added the ability to read macOS property lists (a.k.a. “plists”) to mactheknife by wrapping a Python module (plistlib). Since all (OK, “most”) macOS apps have an icon, I thought it would be fun to visualize the last opened frequency for each app using the app icons and ggplot2. Unfortunately, the ImageMagick (and, thus the magick package) cannot read macOS icns files, so you’ll need to do a brew install libicns before working with any of the remaining code since we’ll be relying on a command-line utility from that formula.

Let’s get the frontmatter out of the way:

library(sys) library(magick) library(osqueryr) # notcran library(mactheknife) #notcran library(ggimage) library(hrbrthemes) library(ggbeeswarm) library(tidyverse) osq_expose_tables("apps") # macOS will use a generic app icon when none is present in an app bundle; this is the location and we'll # need to use it when our plist app spelunking comes up short default_app <- "/System/Library/CoreServices/CoreTypes.bundle/Contents/Resources/GenericApplicationIcon.icns"

Next, we'll:

  • collect the apps table locally
  • filter out system-ish things (which we really don't care about for this post)
  • convert the last used time to something useful (and reduce it to a day resolution)
  • try to locate the property list for the app and read the path to the app icon file, substituting the generic one if not found (or other errors pop up):
select(apps, name, path, last_opened_time) %>% collect() %>% filter(!str_detect(path, "(^/System|usr|//System|/Library/|Helper|/Contents/|\\.service$)")) %>% mutate(lop_day = as.Date(anytime::anytime(as.numeric(last_opened_time)))) %>% mutate(icon = map_chr(path, ~{ p <- read_plist(file.path(.x, "Contents", "Info.plist")) icns <- p$CFBundleIconFile[1] if (is.null(icns)) return(default_app) if (!str_detect(icns, "\\.icns$")) icns <- sprintf("%s.icns", icns) file.path(.x, "Contents", "Resources", icns) })) -> apps_df apps_df ## # A tibble: 274 x 5 ## last_opened_time name path lop_day icon ## ## 1 1529958322.11297 1Password 6.app /Applications/1Password … 2018-06-25 /Applications/1Password 6.… ## 2 1523889402.80918 Adium.app /Applications/Adium.app 2018-04-16 /Applications/Adium.app/Co… ## 3 1516307513.7606 Adobe Connect.app /Applications/Adobe Conn… 2018-01-18 /Applications/Adobe Connec… ## 4 1530044681.76677 Adobe Illustrator.app /Applications/Adobe Illu… 2018-06-26 /Applications/Adobe Illust… ## 5 -1.0 Analyze Documents.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust… ## 6 -1.0 Make Calendar.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust… ## 7 -1.0 Contact Sheets.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust… ## 8 -1.0 Export Flash Animation.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust… ## 9 -1.0 Web Gallery.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust… ## 10 -1.0 Adobe InDesign CC 2018.app /Applications/Adobe InDe… 1969-12-31 /Applications/Adobe InDesi… ## # ... with 264 more rows

Since I really didn't feel like creating a package wrapper for libicns, we're going to use the sys package to make system calls to convert the icns files to png files. We really don't want to do this repeatedly for the same files if we ever run this again, so we'll setup a cache directory to hold our converted pngs.

Apps can (and, usually do) have multiple icons with varying sizes and are not guaranteed to have every common size available. So, we'll have the libicns icns2png utility extract all the icons and use the highest resolution one, using magick to reduce it to a 32x32 png bitmap.

# setup the cache dir -- use whatever you want cache_dir <- path.expand("~/.r-icns-cache") dir.create(cache_dir) # create a unique name hash for more compact names mutate(apps_df, icns_png = map_chr(icon, ~{ hash <- digest::digest(.x, serialize=FALSE) file.path(cache_dir, sprintf("%s.png", hash)) })) -> apps_df # find the icns2png program icns2png <- unname(Sys.which("icns2png")) # go through each icon file pb <- progress_estimated(length(apps_df$icns_png)) walk2(apps_df$icon, apps_df$icns_png, ~{ pb$tick()$print() # progress! if (!file.exists(.y)) { # don't create it if it already exists td <- tempdir() # no icon file == use default one if (!file.exists(.x)) .x <- default_app # convert all of them to pngs sys::exec_internal( cmd = icns2png, args = c("-x", "-o", td, .x), error = FALSE ) -> res rawToChar(res$stdout) %>% # go through icns2png output str_split("\n") %>% flatten_chr() %>% keep(str_detect, " Saved") %>% # find all the extracted icons last() %>% # use the last one str_replace(".* to /", "/") %>% # clean up the filename so we can read it in str_replace("\\.$", "") -> png # read and convert image_read(png) %>% image_resize(geometry_area(32, 32)) %>% image_write(.y) } })

You can open up that cache directory with the macOS finder to find all the extracted/converted pngs.

Now, we're on the final leg of our app-use visualization journey.

Some system/utility apps have start-of-epoch dates due to the way the macOS installer tags them. We only want "recent" ones so I set an arbitrary cutoff date of the year 2000. Since many apps would have the same last opened date, I wanted to get a spread out layout "for free". One way to do that is to use ggbeeswarm::position_beswarm():

filter(apps_df, lop_day > as.Date("2000-01-01")) %>% ggplot() + geom_image( aes(x="", lop_day, image = icns_png), size = 0.033, position = position_quasirandom(width = 0.5) ) + geom_text( data = data_frame( x = c(0.6, 0.6), y = as.Date(c("2018-05-01", "2017-09-15")), label = c("More recently used ↑", "Not used in a while ↓") ), aes(x, y, label=label), family = font_an, size = 5 , hjust = 0, color = "lightslategray" ) + labs(x = NULL, y = "Last Opened Time") + labs( x = NULL, y = NULL, title = "macOS 'Last Used' App History" ) + theme_ipsum_rc(grid="Y") + theme(axis.text.x = element_blank())

There are tons of other ways to look at this data and you can use the osquery daemon to log this data regularly so you can get an extra level of detail. An interesting offshot project would be to grab the latest RStudio dailies and see if you can wrangle a sweet D3 visualization from the app data we collected. Make sure to drop a comment with your creations in the comments. You can find the full code in this snippet.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Setting up RStudio Server, Shiny Server and PostgreSQL by @ellis2013nz

Fri, 07/06/2018 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

Motivation

For a variety of reasons I wanted a Linux server in the cloud with data science software on it. In no particular order:

  1. so I could access a reliable setup from anywhere with a web browser, including on networks without the software I need (which happens a lot in corporate/government environments)
  2. so I could do round-the-clock data gathering when necessary
  3. to run my pet HappyRrobot twitterbot, having retired the old physical linux box it used to be hosted from
  4. to make sure I stay familiar with what’s needed to set up such a system from scratch.

I couldn’t tell you which of these are the more important.

In this blog post I’ll be noting down the steps necessary for setting up this server, and illustrating purpose 2 with a Shiny app, hosted on the server and drawing data from a database on the server, which is collecting a round-the-clock random sample of Twitter. The end result looks like this (click on the image to go to the actual working app):

If you’re interested in Twitter data but not in setting up a Linux server with R, Shiny and PostgreSQL on it you can skip straight to where I talk about getting the data in and analysing it.

Setting up a server for data and stats

I chose Amazon Web Services Elastic Cloud to host the server. I used the Elastic IP service (fairly cheap) to allocate a permanent IP address to the instance.

I was keen on using Red Hat or similar flavoured Linux, because of purpose #4 noted above; I already knew a bit about Ubuntu and wanted to expand my knowledge sphere, and Red Hat is the flavour that seems to pop up on corporate and government servers. In the end I opted for CentOS, which is “a Linux distribution that provides a free, enterprise-class, community-supported computing platform functionally compatible with its upstream source, Red Hat Enterprise Linux (RHEL)”. In fact, installing R is a bit easier on CentOS that it was on my Red Hat experiments.

I want the following on this machine:

  • R, RStudio Server and Python 3 for analysis and for data munging
  • PostgreSQL for storing data and supporting analysis
  • Shiny Server for dissemination
  • A webserver (I chose Nginx) to support RStudio Server and Shiny Server through reverse proxy so I can access them on regular web browser ports rather than ports 8787 and 3838, which will often not be available from a corporate network
  • all the other utilities and extras to support all this, such as curl, mail and fonts

This was non-trivial, which is one of the reasons why I’m blogging about it! There are lots of fishhooks and little things to sort through and while there are some excellent blog posts and tutorials out there to help do it, none of them covered quite the end-to-end setup I needed. One of my outputs from the process was a set of notes – not quite a single run-and-forget configuration script as you’d want in a professional setup, but fairly easy to use – that makes it easy to do similar in the future.

R and a few basics

Here’s where I started, once having started up the server (plenty of tutorials on how to do that provided by Amazon themselves and others). I start by installing R, including the epel “Extra Packages for Enterprise Linux” that are needed beforehand.

# we'll want mail later sudo yum install -y mailx #-----------------R and its dependencies------------------------ #install R. We need epel first sudo yum update -y sudo yum install –y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm sudo yum install -y texlive sudo yum install -y texinfo # clean up rm *.rpm sudo yum install -y R # and version control of course sudo yum install -y git Database

Next thing was to install PostgreSQL. I found it useful to install this before I started installing R packages, because some R packages that speak to PostgreSQL behave differently on installation depending on whether PostgreSQL is found on the machine or not

#=====================postgresql======================= # install postgresql. Good to do this before we start installing R packages sudo yum install -y postgresql-server postgresql-contrib postgresql-devel sudo postgresql-setup initdb # stop to give postgres account a password for the operating system sudo passwd postgres # start the postgresql service sudo systemctl start postgresql # I think this next line means the database service restarts when the machine is rebooted sudo systemctl enable postgresql Extending R via packages and their dependencies

I find installing all the R packages I regularly use a harder job in Linux than Windows. I’m sorry, but I do. In particular, Windows installations of packages like gdal seems to look after upstream dependencies seamlessly and quietly. Not so on Linux. Here’s what I needed to do at the command line to get all the R packages I wanted installed.

#======================miscellaneous dependencies needed by R packages=============== # iBest to do this on a large instance, even if you only start it as big # during the install. 8GB seems a minimum #----------------tidyverse and Cairo--------- # First, some dependencies that rvest, devtools, Cairo need: sudo yum install -y libcurl-devel libxml2-devel openssl-devel sudo yum install -y cairo-devel libXt-devel udunits2-devel gdal-devel poppler-cpp-devel #------------The gdal epic--------------------- # needed for spatial stuff and in particular sf which needs version > 2 (currently 2.2). # This should work according to the sf github page (excluding udunits2 which we look after later): sudo yum install -y gdal-devel proj-devel proj-epsg proj-nad geos-devel # but that installs the wrong version of gdal! We have to install it by hand. # see https://gis.stackexchange.com/questions/263495/how-to-install-gdal-on-centos-7-4 # adapted from https://gist.github.com/simondobner/f859b2db15ad65090c3c316d3c224f45 wget http://download.osgeo.org/gdal/2.2.4/gdal-2.2.4.tar.gz tar zxvf gdal-2.2.4.tar.gz cd gdal-2.2.4/ ./configure --prefix=/usr/ --with-sfcgal=no make -j4 sudo make install # should have a test here to only do the next two things if installed correctly! rm *.tar.gz rm gdal-2.2.4 -r #======================R packages===================== # Now we can install R packages that need all the above system dependencies first. # # udunits2 needs special configuration when installing in R so let's do that first and get it out of the way sudo R -e "install.packages('udunits2',configure.args='--with-udunits2-include=/usr/include/udunits2', repos='http://cran.rstudio.com/')" # these are a bunch of packages that are heavily used and that I want installed up front and available # for all users (hence installing them as super user) sudo R -e "install.packages(c('Rcpp', 'rlang', 'bindrcpp', 'dplyr', 'digest', 'htmltools', 'tidyverse', 'shiny', 'leaflet', 'sf', 'scales', 'Cairo', 'forecast', 'forcats', 'h2o', 'seasonal', 'data.table', 'extrafont','survey', 'forecastHybrid', 'ggseas', 'treemap', 'glmnet', 'ranger', 'RPostgres', 'igraph', 'ggraph', 'nzelect', 'tm', 'wordcloud', 'praise', 'showtext', 'ngram', 'pdftools', 'rtweet', 'GGally', 'ggExtra', 'lettercase', 'xgboost'), repos='http://cran.rstudio.com/')" # fonts sudo yum install dejavu-sans-fonts sudo yum install -y google-droid-*-fonts sudo yum install -y gnu-free-*-fonts sudo R -e "extrafont::font_import(prompt = FALSE)"

I did all of this with my server set up as a “large” instance with 8GB of RAM. This particularly makes a difference when installing Rcpp. After all the initial is setup you can stop the instance, downsize it something cheaper, and restart it.

Note that I am using sudo to install R packages so they are available to all users (which will include the shiny user down the track), not just to me. I wanted everyone using this server to have the same set of packages available; obviously whether this is desirable or not depends on the purpose of the setup.

Server-related stuff

Next I want to get RStudio Server and Shiny Server working, and accessible via a web browser that just talks to standard port 80. There is a step here where the Nginx configuration file gets edited by hand; the links to RStudio support for RStudio Server and for Shiny Server contain instructions on what needs to go where.

Also note that the actual versions of RStudio Server and of Shiny Server below are date-specific (because they are installed via local install), and probably the links are already out of date.

#-------------webby stuff------------ # install a web server so we can deliver things through it via reverse proxy # see https://support.rstudio.com/hc/en-us/articles/200552326-Running-RStudio-Server-with-a-Proxy # and https://support.rstudio.com/hc/en-us/articles/213733868-Running-Shiny-Server-with-a-Proxy sudo yum install -y nginx #install RStudio-Server (2018-04-23) wget https://download2.rstudio.org/rstudio-server-rhel-1.1.447-x86_64.rpm sudo yum localinstall -y --nogpgcheck rstudio-server-rhel-1.1.447-x86_64.rpm #install shiny and shiny-server (2018-04-23) wget https://download3.rstudio.org/centos6.3/x86_64/shiny-server-1.5.7.907-rh6-x86_64.rpm sudo yum localinstall -y --nogpgcheck shiny-server-1.5.7.907-rh6-x86_64.rpm rm *.rpm # now go make the necessary edits to /etc/nginx/nginx.conf # note that the additions are made in two different bits of that file, you don't just past the whole # lot in. sudo nano /etc/nginx/nginx.conf sudo systemctl restart nginx # go to yr.ip.number/shiny/ and yr.ip.number/rstudio/ to check all working # add some more users if wanted at this point # sudo useradd ellisp # sudo passwd ellisp # not sure if all these are needed: sudo systemctl enable nginx sudo systemctl enable rstudio-server sudo systemctl enable shiny-server # set the ownership of the directory we're going to keep apps in so the `shiny` # user can access it sudo chown -R shiny:shiny /srv/shiny-server Python

Centos currently comes with Python 2.7, but I wanted to be using Python 3. My Python skills are halting at best but I want them to be as future-proofed as possible. Anaconda seems a relatively straightforward way to manage Python.

#---------------Anaconda / python---------------- # go to https://repo.continuum.io/archive/ or https://www.anaconda.com/download/#linux to see the latest version # Anaconda3 is with python 3.X, Anaconda2 is wit python 2.7. Note # that python 2.7 is part of the Centos linux dsitribution and shouldn't be # overwritten ie python xxx.py should run python 2.7. But doing the process below does this; # watch out for if this causes problems later... # wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh sudo bash Anaconda3-5.1.0-Linux-x86_64.sh # agree to the license, and specify /opt/anaconda3 as location when asked # we want to give all users anaconda on their path, so I snitched this from: # https://www.vultr.com/docs/how-to-install-jupyter-notebook-on-a-vultr-centos-7-server-instance sudo cp /etc/profile /etc/profile_backup echo 'export PATH=/opt/anaconda3/bin:$PATH' | sudo tee -a /etc/profile source /etc/profile echo $PATH sudo /opt/anaconda3/bin/conda conda install psycopg2 # as far as I can tell this makes python3.6 the default python, which is surely going to cause problems down # the track... Configuring PostgreSQL

I installed PostgreSQL and started its database service early in this process, but in the next step need to actually set up some database and users for use. The PostgreSQL security model is thorough and comprehensive but with lots of fishhooks. Here’s how I set it up for this particular (very simple) use case. First, I enter the psql environment as the postgres user (currently the only user with any access to the database server)

sudo -u postgres psql

Now we can set up the users we want to be accessing our databases; some databases for them to use; and schemas within those database. In this case, I set up two databases for now

  • survey_microdata
  • twitter

and three different users, in addition to postgres:

  • ellisp (ie me, in development mode)
  • external_analyst (ie me or others, in read-only mode)
  • shiny (the Shiny Server’s id on the server, needed so Shiny apps can access the database)
-- you are now in psql as user postgres. Although default is to use unix's identification of you, -- and you don't need a password to access the database from the local host, it's good to have a -- password if you want to set up other connections later \password postgres CREATE DATABASE survey_microdata; CREATE DATABASE twitter; CREATE ROLE ellisp; \password ellisp; ALTER ROLE ellisp WITH LOGIN; CREATE ROLE shiny; -- no need for a password for shiny, it can only access the db from this machine CREATE ROLE external_analyst; \password external_analyst; GRANT ALL PRIVILEGES ON DATABASE twitter TO ellisp; GRANT ALL PRIVILEGES ON DATABASE survey_microdata TO ellisp; \c survey_microdata; CREATE SCHEMA nzivs; CREATE SCHEMA nzis2011; GRANT ALL PRIVILEGES ON SCHEMA nzivs TO ellisp; GRANT ALL PRIVILEGES ON SCHEMA nzis2011 TO ellisp; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA nzivs TO ellisp; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA nzis2011 TO ellisp; GRANT SELECT ON ALL TABLES IN SCHEMA nzis2011 to external_analyst; GRANT SELECT ON ALL TABLES IN SCHEMA nzivs to external_analyst; \c twitter CREATE SCHEMA tweets; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO ellisp; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA tweets TO ellisp; GRANT SELECT ON ALL TABLES IN SCHEMA tweets TO shiny; GRANT CONNECT ON DATABASE twitter TO shiny; \q

We also need to tweak the configuration so the PostgreSQL database is accessible from the outside world (if that’s what we want, which I do).

# follow instructions at https://blog.bigbinary.com/2016/01/23/configure-postgresql-to-allow-remote-connection.html if # you want to remotely access eg from DBeaver on your laptop. Definitely need a password then # first add in listen_addresses = 'localhost' just above the commented out version of #listen_addresses = 'localhost' sudo nano /var/lib/pgsql/data/postgresql.conf # now the client authentication file about how individuals can actually log on. # Add the below two lines (not the # at beginning) to the bottom of the table. # lets users log on via password form anywhere. If this doesn't suit your # risk profile, find something more constrictive... # host all all 0.0.0.0/0 md5 # host all all ::/0 md5 sudo nano /var/lib/pgsql/data/pg_hba.conf sudo systemctl restart postgresql A Twitter sample stream database Collecting data

OK, that wasn’t so bad was it (or was it…). I now have my server running (I stopped it and restarted it as a smaller cheaper instance than the 8GB of RAM I used during that setup) and available to do Useful Stuff. Like collect Twitter data for analysis and dissemination in Shiny:

For a while, I’ve been mildly exercised by the problem of sampling from Twitter. See for example this earlier post where I was reduced to using a snowballing network method to find users, and searching for tweets with the letter “e” in them to get a sample of tweets. Both of these methods have obvious problems if you want to do inference about how people as a whole are using Twitter.

On the other hand, Twitter make available several sets of public Tweets that are fully representative:

  • The free Sample Tweets API “returns a small random sample of all public Tweets.”
  • The Decahose stream provides a 10% sample of all public Tweets
  • The Firehose provides all public tweets

The latter two services are for paying customers only. My interest in Twitter is curiousity at most, so I’m only interested in the free sample, which is thought to be around 1% of the Firehose (exactly what proportion it is of the Firehose isn’t publicly known, and is a question of some inferential interest).

So I was interested in the sample stream, but I wanted to collect a sample over a period of time, not just from the day I was going to do some analysis. Even this 1% sample was more than I wanted to pay disk space to store if I were to collect over time, so I decided I would collect 30 seconds of sample streaming data every hour, at a random time within the hour to avoid problems associated with doing the sampling at the same time each day.

I designed a data model to capture the data I was most interested in while discarding attached video and images (this was about saving me disk space; I think serious Twitter analysis would have to do better than just collecting text). It looks like this:

BTW that diagram (and much of the database development) was done with the excellent universal SQL editor and database admin tool, DBeaver. It works with different flavours of relational database and is awesome.

The code that creates and populates that database is available on GitHub:

This has now been running smoothly since 17 May 2018, apart from one day last week when I botched an R upgrade and it all went down for half a day before I noticed (lesson learned – run update.package(ask = FALSE, checkBuilt = TRUE) to ensure your R packages all keep working after the R internals change). So far the database is about 3GB in size, and I’m quite happy to let it grow quite a bit more than that.

What do we find out?

So far the main use I’ve put this data to is the Shiny app that I’ve scattered a few screenshots of in this blog post. The source code is on GitHub of course. That Shiny app writes its own SQL based on the inputs provided by the user (eg date range), queries the database and produces charts.

So what have I learned about Twitter (as opposed to about Linux administration) from the exercise? No time to explore in much depth right now, but some of the interesting things include:

  • Tweets have a daily cycle, peaking at around 15:30 UTC each day (this assumes that the sampling ratio in the Twitter sample stream is roughly constant; which I think is likely as otherwise why would we see this seasonality).
  • The most tweeted hashtags all relate to teen-oriented popular music. I had to look up TeenChoice just to find out what it was… My filter bubble isn’t so much a liberal-v-conservative one as something relating to different interests to most people in the world altogether. The things that dominate my own Twitter feed are not even faintly representative of Twitter as a whole (I expected this with regard to statistical computing of course, but it was interesting to find out that even US politics hardly makes a dent in the most common tweets/retweets in any particular day, compared to popular retweets such as “If the Cleveland Cavaliers win the 2018 NBA finals I’ll buy everyone who retweet’s this a jersey…” (sic) – 1.1 million retweets – and several suspiciously similar variants)
  • If you ignore tweets in Japanese, Korean, Thai and Arabic script you are missing three of the top seven languages on Twitter. Any serious analysis needs to find a way to bring them on board (my first obstacle in this was getting a font that could represent as many different scripts and emojis as possible without knowing in advance the language; in the end I opted for GNU FreeFont, as described in my own answer to my question on StackOverflow about this problem)
  • Only six of the ten most prolific Tweeters listed on Twitter Counter are currently prolifically tweeting. In particular, @venethis @AmexOffers @BEMANISoundTeam and @__Scc__ seem to have gone either completely quiet or just much lower frequency tweeting (code for analysis).
  • The currently most prolific tweeter is @akiko_lawson, which I think (I don’t read Japanese) is an account associated with Lawson convenience stores. This single account issues around 1 tweet for every 5,000 tweets by anyone on the planet.
  • The currently second most prolific tweeter is probably a spam bot, called @test5f1798, that amongst other things tweets pictures of Spam. Very meta.

There’s some interesting statistical challenges with using this database for inference that I might come back to. For example, I could use a small amount of auxiliary information such as the 1.1 million retweets of that Cleveland Cavaliers jersey tweet and compare it to the 105 times I found the tweet in my sample; and deduce that my sample is about 1 in 10,000 of the full population of tweets. This is consistent with the sample stream being a genuine 1% sample, of which I collect 1/120th (30 seconds every hour). So I should be able to treat my sample as a 1/12,000 sample of the whole population, clustered by the 30 second window they are in. Something for later.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Lyric Analysis with NLP and Machine Learning using R: Part One – Text Mining

Fri, 07/06/2018 - 12:48

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

June 22
By Debbie Liske

This is Part One of a three part tutorial series originally published on the DataCamp online learning platform in which you will use R to perform a variety of analytic tasks on a case study of musical lyrics by the legendary artist, Prince. The three tutorials cover the following:

Musical lyrics may represent an artist’s perspective, but popular songs reveal what society wants to hear. Lyric analysis is no easy task. Because it is often structured so differently than prose, it requires caution with assumptions and a uniquely discriminant choice of analytic techniques. Musical lyrics permeate our lives and influence our thoughts with subtle ubiquity. The concept of Predictive Lyrics is beginning to buzz and is more prevalent as a subject of research papers and graduate theses. This case study will just touch on a few pieces of this emerging subject.

Prince: The Artist

To celebrate the inspiring and diverse body of work left behind by Prince, you will explore the sometimes obvious, but often hidden, messages in his lyrics. However, you don’t have to like Prince’s music to appreciate the influence he had on the development of many genres globally. Rolling Stone magazine listed Prince as the 18th best songwriter of all time, just behind the likes of Bob Dylan, John Lennon, Paul Simon, Joni Mitchell and Stevie Wonder. Lyric analysis is slowly finding its way into data science communities as the possibility of predicting “Hit Songs” approaches reality.

Prince was a man bursting with music – a wildly prolific songwriter, a virtuoso on guitars, keyboards and drums and a master architect of funk, rock, R&B and pop, even as his music defied genres. – Jon Pareles (NY Times)

In this tutorial, Part One of the series, you’ll utilize text mining techniques on a set of lyrics using the tidy text framework. Tidy datasets have a specific structure in which each variable is a column, each observation is a row, and each type of observational unit is a table. After cleaning and conditioning the dataset, you will create descriptive statistics and exploratory visualizations while looking at different aspects of Prince’s lyrics.

Check out the article here!

(reprint by permission of DataCamp online learning platform)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Ghosts of Animals Haunt Portland’s Overpriced Apartments

Fri, 07/06/2018 - 02:00

(This article was first published on Dan Garmat's Blog -- R, and kindly contributed to R-bloggers)

Apartment hunting in an expensive city is leading me to curses and exclamations. Following are some outstanding examples of insanely priced apartments in Portland, OR, run through Google Deep Dream in hopes of my understanding why people pay so much for a small box. These listings will be gone in no time (I’m sure) so including some captions for posterity.

Let’s start with this one. Indeed, it appears $1899 for 1 bedroom grants access to this clubhouse haunted by some floating apparition:

Deep Dream InceptionV3 algorithm here is trained on ImageNet, then makes changes that increase confidence in the predicted category. Looped several times with the num_octave hyperparameter, it starts to look a good bit trippy and helps give some intuition what a neural network “sees” as prototypical examples of a predicted class. Apparently there is no “view of apartment” class as it keeps seeing ghastly animals. Perhaps it is no coincidence even before running InceptionV3 this clubhouse already looks like it could work in The Shining:

$1850 / 1br – 697ft2 – BRAND NEW! Enjoy Luxury Uban [Urban?] Living at The Franklin Flats!

“NEW ON THE MARKET!

“The Franklin Flats is the newest addition to this desirable part of town! Built with the urban adventurer in mind, our small community offers luxury appeal with a neighborhood feel. Boasting a walkability score of 86 out of 100, you can’t beat the location! [unless an 87+?] Our close proximity to Mt. Tabor, food carts, [because you won’t have anything left over for restaurants] shopping and eateries gives you the classic Northwest experience you crave. Email or call to schedule a personal tour today!”

Apparently the Attack on Titan seals make this a desirable part of town.

Perhaps those seals are why walkability doesn’t crack 90. If you survive the seals on a midday stroll, there are titan polar bears who can walk through walls:

$4250 / 2br – 1900ft2 – Condo on the Willamette

“Breathtaking views of the city and the Willamette River, located in the elegant Atwater. This condo has two bedrooms, living room, dining room, gourmet kitchen, gas fireplace, small office, two balconies, utility room and underground parking. Includes concierge desk, card-accessed security.”

Something tells me this view of the Willamette River would be complete if a cocker spaniel is staring at me…

But this view is what you really pay for: look at all the suckers in the two identical buildings who massively overpaid for – how the heck did those get up here?!

$3900 / 2br – 1004ft2 – Portland’s most well-appointed two bedroom apartments now available

“Portland’s premier rental community is now pre-leasing. Find everything you desire in one place: The finest dining, distinguished boutiques, and most-beloved haunts. Experience the perfect merger of luxury and livability with our timeless brick exterior, stunning marble lobby, tiled bathrooms, tall ceilings, ample light, extensive storage, concierge services, and even a dog washing station.

“Proudly situated in Portland’s [S]Nob Hill/23rd Ave neighborhood, 21 Astor boasts a “97” walk score and a “96” bike score. [Beat that Franklin Flats!] Life is great in the epicenter of Portland’s charm. Pick up your locally-grown produce at City Market. Grab happy hour with the gang at Seratto. Sweat it out at Corepower Yoga.

“Our greater than 1:1 parking ratio assures a space for your car in our garage. Imagine never having to find parking on NW 23rd again.”

I work on NW 21st, that’s almost the cost of parking alone. People walk by on Oregon ‘pot tours’ and this may be how they see the building as well:

Whoa! Is that a pig in the sky? Far out!

At least their new $3.88/sqft/month kitchens aren’t yet haunted – oh, god! Are those giant psychedelic worms!

Code:

Started with JJ Allaire, François Chollet, RStudio, and Google’s keras code based on Google DeepDream. Added a little looping to try different parameters over these 7 images:

library(keras) library(tensorflow) library(purrr) # Function Definitions ---------------------------------------------------- preprocess_image <- function(image_path){ image_load(image_path) %>% image_to_array() %>% array_reshape(dim = c(1, dim(.))) %>% inception_v3_preprocess_input() } deprocess_image <- function(x){ x <- x[1,,,] # Remove zero-center by mean pixel x <- x/2. x <- x + 0.5 x <- x * 255 # 'BGR'->'RGB' x <- x[,,c(3,2,1)] # Clip to interval 0, 255 x[x > 255] <- 255 x[x < 0] <- 0 x[] <- as.integer(x)/255 x } # Parameters -------------------------------------------------------- ## list of images to process -- list_images <- list.files('images/deep dream apartments/orig/', full.names = TRUE) ## list of settings to try -- list_settings <- list( settings = list( features = list( mixed2 = 0.2, mixed3 = 0.5, mixed4 = 2., mixed5 = 1.5), hyperparams = list( # Playing with these hyperparameters will also allow you to achieve new effects step = 0.01, # Gradient ascent step size num_octave = 6, # Number of scales at which to run gradient ascent octave_scale = 1.4, # Size ratio between scales iterations = 20, # Number of ascent steps per scale max_loss = 10 ) ), settings = list( features = list( mixed2 = 0.5, mixed3 = 0.2, mixed4 = 1.1, mixed5 = 1.5), hyperparams = list( step = 0.01, num_octave = 9, octave_scale = 1.1, iterations = 20, max_loss = 5 ) ), settings = list( features = list( mixed2 = 0.02, mixed3 = 0.05, mixed4 = 0.01, mixed5 = 0.05), hyperparams = list( step = 0.01, num_octave = 11, octave_scale = 1.1, iterations = 20, max_loss = 20 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 0.5, mixed4 = 2., mixed5 = 1.5), hyperparams = list( step = 0.01, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 10 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 0.1, mixed4 = 0.4, mixed5 = 0.3), hyperparams = list( step = 0.01, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 10 ) ), settings = list( features = list( mixed2 = 1.2, mixed3 = 1.5, mixed4 = 3., mixed5 = 2.5), hyperparams = list( step = 0.01, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 25 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 2.5, mixed4 = 2., mixed5 = 3.5), hyperparams = list( step = 0.05, num_octave = 6, octave_scale = 1.4, iterations = 20, max_loss = 13 ) ), settings = list( features = list( mixed2 = 0.2, mixed3 = 2.5, mixed4 = 2., mixed5 = 3.5), hyperparams = list( step = 0.05, num_octave = 8, octave_scale = 1.4, iterations = 20, max_loss = 13 ) ) ) # Model Definition -------------------------------------------------------- k_set_learning_phase(0) # Build the InceptionV3 network with our placeholder. # The model will be loaded with pre-trained ImageNet weights. model <- application_inception_v3(weights = "imagenet", include_top = FALSE) # This will contain our generated image dream <- model$input # Get the symbolic outputs of each "key" layer (we gave them unique names). layer_dict <- model$layers names(layer_dict) <- map_chr(layer_dict ,~.x$name) ## just loop on images - bottleneck is training anyway -- for(image_path in list_images){ image <- preprocess_image(image_path) #res <- predict(model,image) #reduce(dim(res)[-1], .f = `*`) %>% # as.numeric() %>% #array_reshape(res, c(2, 51200)) %>% # imagenet_decode_predictions() ## get model prediction: #x <- image_to_array(image) #x <- array_reshape(image, c(1, dim(image))) #x <- imagenet_preprocess_input(image) #preds = predict(model, x) #imagenet_decode_predictions(preds, top = 3) #predict_classes(model, image) # need to try again later setting_counter <- 0 ## then nested loop on settings -- for(settings in list_settings){ setting_counter <- setting_counter + 1 print(paste0('starting ', image_path, ', setting # ', setting_counter)) # reload image each time image <- preprocess_image(image_path) # Define the loss loss <- k_variable(0.0) for(layer_name in names(settings$features)){ # Add the L2 norm of the features of a layer to the loss coeff <- settings$features[[layer_name]] x <- layer_dict[[layer_name]]$output scaling <- k_prod(k_cast(k_shape(x), 'float32')) # Avoid border artifacts by only involving non-border pixels in the loss loss <- loss + coeff*k_sum(k_square(x)) / scaling } # Compute the gradients of the dream wrt the loss grads <- k_gradients(loss, dream)[[1]] # Normalize gradients. grads <- grads / k_maximum(k_mean(k_abs(grads)), k_epsilon()) # Set up function to retrieve the value # of the loss and gradients given an input image. fetch_loss_and_grads <- k_function(list(dream), list(loss,grads)) # this function will crash the R session if too many octaves on too small an image eval_loss_and_grads <- function(image){ outs <- fetch_loss_and_grads(list(image)) list( loss_value = outs[[1]], grad_values = outs[[2]] ) } gradient_ascent <- function(x, iterations, step, max_loss = NULL) { for (i in 1:iterations) { out <- tryCatch(eval_loss_and_grads(x), error = function(e) NA) # need to add this for negative gradients if(is.na(out$loss_value)){ print(paste0('NA loss_value on setting # ', setting_counter)) break } else if (!is.null(max_loss) & out$loss_value > max_loss) { break } print(paste("Loss value at", i, ':', out$loss_value)) x <- x + step * out$grad_values } x } original_shape <- dim(image)[-c(1, 4)] successive_shapes <- list(original_shape) for (i in 1:settings$hyperparams$num_octave) { successive_shapes[[i+1]] <- as.integer(original_shape/settings$hyperparams$octave_scale^i) } successive_shapes <- rev(successive_shapes) original_image <- image shrunk_original_img <- image_array_resize( image, successive_shapes[[1]][1], successive_shapes[[1]][2] ) shpnum <- 0 # for debugging for (shp in successive_shapes) { shpnum <- shpnum + 1 # for debugging image <- image_array_resize(image, shp[1], shp[2]) print(paste0('running octave ', shpnum))# for debugging image <- gradient_ascent(image, settings$hyperparams$iterations, settings$hyperparams$`step`, settings$hyperparams$max_loss) print(paste0('finished octave ', shpnum))# for debugging upscaled_shrunk_original_img <- image_array_resize(shrunk_original_img, shp[1], shp[2]) same_size_original <- image_array_resize(original_image, shp[1], shp[2]) lost_detail <- same_size_original - upscaled_shrunk_original_img image <- image + lost_detail shrunk_original_img <- image_array_resize(original_image, shp[1], shp[2]) } image_path %>% gsub('/orig/', '/dream/', .) %>% gsub('.jpg', paste0('_dream', setting_counter, '.png'), .) %>% png(filename = .) plot(as.raster(deprocess_image(image))) dev.off() print(paste0('finished ', image_path, ', setting # ', setting_counter)) } } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dan Garmat's Blog -- R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New package packagefinder – Search for packages from the R console

Fri, 07/06/2018 - 01:52

(This article was first published on Topics in R, and kindly contributed to R-bloggers)

There are over 12,700 R packages on CRAN. How to find the right one for you? The new package ‘packagefinder‘ helps you search for packages on CRAN right from the R console. With ‘packagefinder’ you can search for multiple keywords in the name, title and description of the CRAN package, either case-sensitive or insensitive and define your own weighting scheme for the search results, if you like. Once you have found a promising package, you can use the simple function go() to go to the package’s CRAN webpage or view its PDF manual, directly from the R console without having to installing the package first. Of course, you can also install the package easily, if you want to try it out. Check our ‘packagefinder’ on CRAN: https://cloud.r-project.org/web/packages/packagefinder/index.html

And leave your comments on GitHub (https://github.com/jsugarelli/packagefinder) or contact me via Twitter or e-mail. Your ideas are highly appreciated!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Topics in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

seven digit addition

Fri, 07/06/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Another quick riddle from the riddler: solve the equation

EXMREEK + EHKREKK = ?K?H?X?E

which involves every digit between 0 and 9. While the puzzle can be unravelled by considering first E and K, which must be equal to 6 and 3, a simple R code also leads to the conclusion

isok <- function(a,b){ s=as.numeric(unlist(strsplit(as.character(sum(10^(6:0)*a)+ sum(10^(6:0)*b)),""))) if (length(s)==7){ goal=FALSE}else{ goal=(length(unique(c(a,b,s)))==10)&(a[2]==s[6])& (s[8]==a[6])&(s[2]==a[7])&(b[2]==s[4])} return(goal)} pasok <- function(T=1e3){ for (t in 1:T){ a[1]=a[5]=a[6]=6;a[7]=3 digs=sample(c(0:2,4,5,7:9),4) a[2:4]=digs[1:3] b[1]=a[1];b[2]=digs[4]; b[3]=a[7];b[4]=a[4];b[5]=a[1];b[6:7]=a[7] if (isok(a=a,b=b)) print(rbind(a,b))}} > pasok() [,1] [,2] [,3] [,4] [,5] [,6] [,7] a 6 2 4 7 6 6 3 b 6 8 3 7 6 3 3

which sum is 13085296.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Convex Regression Model

Thu, 07/05/2018 - 23:23

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

This morning during the lecture on nonlinear regression, I mentioned (very) briefly the case of convex regression. Since I forgot to mention the codes in R, I will publish them here. Assume that where is some convex function.

Then is convex if and only if , , Hidreth (1954) proved that ifthen is unique.

Let , then where. I.e. is the projection of onto the (closed) convex cone . The projection theorem gives existence and unicity.

For convenience, in the application, we will consider the real-valued case, , i.e. . Assume that observations are ordered . Here

Hence, quadratic program with linear constraints.

is a piecewise linear function (interpolation of consecutive pairs ).

If is differentiable, is convex if

More generally, if is convex, then there exists such that
is a subgradient of at . And then

Hence, is solution of and . Now, to do it for real, use cobs package for constrained (b)splines regression,

1 library(cobs)

To get a convex regression, use

1 2 3 4 5 plot(cars) x = cars$speed y = cars$dist rc = conreg(x,y,convex=TRUE) lines(rc, col = 2)


Here we can get the values of the knots

1 2 3 4 5 6 7 rc   Call: conreg(x = x, y = y, convex = TRUE) Convex regression: From 19 separated x-values, using 5 inner knots, 7, 8, 9, 20, 23. RSS = 1356; R^2 = 0.8766; needed (5,0) iterations

and actually, if we use them in a linear-spline regression, we get the same output here

1 2 3 4 reg = lm(dist~bs(speed,degree=1,knots=c(4,7,8,9,,20,23,25)),data=cars) u = seq(4,25,by=.1) v = predict(reg,newdata=data.frame(speed=u)) lines(u,v,col="green")

Let us add vertical lines for the knots

1 abline(v=c(4,7,8,9,20,23,25),col="grey",lty=2)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Course: Support Vector Machines in R

Thu, 07/05/2018 - 20:28

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Here is the course link.

Course Description

Support Vector Machines in R will help students develop an understanding of the SVM model as a classifier and gain practical experience using R’s libsvm implementation from the e1071 package. Along the way, students will gain an intuitive understanding of important concepts, such as hard and soft margins, the kernel trick, different types of kernels, and how to tune SVM parameters. Get ready to classify data with this impressive model.

Chapter 1: Introduction

This chapter introduces some key concepts of support vector machines through a simple 1-dimensional example. Students are also walked through the creation of a linearly separable dataset that is used in the subsequent chapter.

Chapter 2: Support Vector Classifiers – Linear Kernels

Chapter 2 introduces students to the basic concepts of support vector machines by applying the svm algorithm to a dataset that is linearly separable. Key concepts are illustrated through ggplot visualizations that are built from the outputs of the algorithm and the role of the cost parameter is highlighted via a simple example. The chapter closes with a section on how the algorithm deals with multi-class problems.

Chapter 3: Polynomial Kernels

This chapter provides an introduction to polynomial kernels via a dataset that is radially separable (i.e. has a circular decision boundary). After demonstrating the inadequacy of linear kernels for this dataset, students will see how a simple transformation renders the problem linearly separable thus motivating an intuitive discussion of the kernel trick. Students will then apply the polynomial kernel to the dataset and tune the resulting classifier.

Chapter 4: Radial Basis Function Kernels

Chapter 4 builds on the previous three chapters by introducing the highly flexible Radial Basis Function (RBF) kernel. Students will create a "complex" dataset that shows up the limitations of polynomial kernels. Then, following an intuitive motivation for the RBF kernel, students see how it addresses the shortcomings of the other kernels discussed in this course.

The course requires the following prerequisites:

Introduction to R

Introduction to Machine Learning

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

First Class R Support in Binder / Binderhub – Shiny Apps As Well as R-Kernels and RStudio

Thu, 07/05/2018 - 19:59

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

I notice from the binder-examples/r repo that Binderhub now appears to offer all sorts of R goodness out of the can, if you specify a particular R build.

From the same repo root, you can get:

And from previously, here’s a workaround for displaying R/HTMLwidgets in a Jupyter notebook.

OpenRefine is also available from a simple URL – https://mybinder.org/v2/gh/betatim/openrefineder/master?urlpath=openrefine – courtesy of betatim/openrefineder:

Perhaps it’s time for me to try to get my head round what the Jupyter notebook proxy handlers are doing…

PS see also Scripted Forms for a simple markdown script way of building interactive Jupyter widget powered UIs.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Course: Interactive Maps with leaflet in R

Thu, 07/05/2018 - 19:19

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Here is the course link.

Course Description

Get ready to have some fun with maps! Interactive Maps with leaflet in R will give you the tools to make attractive and interactive web maps using spatial data and the tidyverse. In this course, you will create maps using the IPEDS dataset, which contains data on U.S. colleges and universities. Along the way, you will customize our maps using labels, popups, and custom markers, and add layers to enhance interactivity. Following the course, you will be able to create and customize your own interactive web maps to reveal patterns in your data.

Chapter 1: Setting Up Interactive Web Maps (Free)

This chapter will introduce students to the htmlwidgets package and the leaflet package. Following this introduction, students will build their first interactive web map using leaflet. Through the process of creating this first map students will be introduced to many of the core features of the leaflet package, including adding different map tiles, setting the center point and zoom level, plotting single points based on latitude and longitude coordinates, and storing leaflet maps as objects.

Chapter 2: Plotting Points

In this chapter students will build on the leaflet map they created in chapter 1 to create an interactive web map of every four year college in California. After plotting hundreds of points on an interactive leaflet map, students will learn to customize the markers on their leaflet map. This chapter will also how to color code markers based on a factor variable.

Chapter 3: Groups, Layers, and Extras

In chapter 3, students will expand on their map of all four year colleges in California to create a map of all American colleges. First, in section 3.1 students will review and build on the material from Chapter 2 to create a map of all American colleges. Then students will re-plot the colleges on their leaflet map by sector (public, private, or for-profit) using groups to enable users to toggle the colleges that are displayed on the map. In section 3.3 students will learn to add multiple base maps so that users can toggle between multiple map tiles.

Chapter 4: Plotting Polygons

In Chapter 4, students will learn to map polygons, which can be used to define geographic regions (e.g., zip codes, states, countries, etc.). Chapter 4 will start by plotting the zip codes in North Carolina that fall in the top quartile of mean family incomes. Students will learn to customize the polygons with color palettes and labels. Chapter 4 will conclude with adding a new layer to the map of every college in America that displays every zip code with a mean income of $200,000 or more during the 2015 tax year. Through the process of mapping zip codes students will learn about spatial data generally, geoJSON data, the @ symbol, and the addPolygons() function.

This course requires the following prerequisites:

Introduction to R

Introduction to the Tidyverse

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using R to Generate Live World Cup Notifications

Thu, 07/05/2018 - 13:00

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

Here in Belgium, World Cup fever is at fever pitch, but with matches starting
during work hours, how is a desk worker supposed to follow along? By leaving
the R environment? Blasphemy.

Today we show how to use R to generate live desktop notifications for The
Beautiful Game.

A notification system preview, free of local bias.

Overview

We break the process of producing a live score notification into the following
steps:

  1. Get the score
  2. Check if the score has changed
  3. If yes, show a notification
  4. Repeat steps 1-3 every minute
Step 1: Getting the Score

FIFA provides an API
with detailed information about matches. The API provides a list of each day’s
matches in JSON. A full description of the fields is provided in the API
documentation
.

For the purposes of this exercise, we need the scores
(AggregateHomeTeamScore, AggregateAwayTeamScore, HomeTeamPenaltyScore,
AwayTeamPenaltyScore), and team names (HomeTeam.TeamName,
AwayTeam.TeamName).
Additionally, we subset the data to the active World Cup matches by filtering to
matches with IdSeason of 254645 (the World Cup competition
ID) and MatchStatus of 3 (the live match status ID).

As functions, this looks like:

readLiveMatchScore <- function() { # reading in the API data worldcupDataDF <- jsonlite::fromJSON("https://api.fifa.com/api/v1/live/football/")$Results # which World Cup match is currently active? worldcupMatchIdx <- which(worldcupDataDF$IdSeason == 254645 & worldcupDataDF$MatchStatus == 3) if (length(worldcupMatchIdx) != 1) { # no matches or more than 1 match liveScore <- NULL } else { # get the score liveScore <- unlist(worldcupDataDF[worldcupMatchIdx, c("AggregateHomeTeamScore", "AggregateAwayTeamScore", "HomeTeamPenaltyScore", "AwayTeamPenaltyScore")]) homeTeam <- worldcupDataDF$HomeTeam$TeamName[[worldcupMatchIdx]]$Description awayTeam <- worldcupDataDF$AwayTeam$TeamName[[worldcupMatchIdx]]$Description names(liveScore) <- rep(c(homeTeam, awayTeam), 2) } liveScore } scoreAsString <- function(matchScore, penalties = FALSE) { out <- paste(names(matchScore)[1], " - ", names(matchScore)[2], ":", matchScore[1], " - ", matchScore[2]) if (penalties && matchScore[1] == matchScore[2]) out <- paste0(out, " (pen. ", matchScore[3], " - ", matchScore[4], ")" ) out } Step 2: Check If the Score Has Changes

To check if the score has changed, we store the previous score and
check if it differs from the current score. If there is a change, we send a
notification.

checkScoreAndNotify <- function(prevScore = NULL) { newScore <- readLiveMatchScore() if (is.null(newScore) && is.null(prevScore)) { # nothing to do here } else if (is.null(newScore) && !is.null(prevScore)) { # end of the game sendNotification(title = "Match ended", message = scoreAsString(prevScore, TRUE)) } else if (is.null(prevScore) && !is.null(newScore)) { # start of the game sendNotification(title = "Match started", message = scoreAsString(newScore)) } else if (!is.null(prevScore) && !is.null(newScore) && !identical(newScore, prevScore)) { # change in the score sendNotification(title = "GOAL!", message = scoreAsString(newScore)) } return(newScore) } Step 3: Display Notification

To create a notification, we use the notifier R
package (now archived on CRAN).
It can be installed via devtools:

devtools::install_version("notifier")

or via the CRAN Archive by giving the URL:

url <- "https://cran.rstudio.com/src/contrib/Archive/notifier/notifier_1.0.0.tar.gz" install.packages(url, type = "source", repos = NULL)

To spice up the notification, we add the World Cup logo in the
notification area.

# get the logo from FIFA website download.file("https://api.fifa.com/api/v1/picture/tournaments-sq-4/254645_w", "logo.png") sendNotification <- function(title = "", message) { notifier::notify(title = title, msg = message, image = normalizePath("logo.png")) } Step 4: Repeat Every Minute

We use the later package to
query the scores API repeatedly without blocking the R session. Taking
inspiration from Yihui Xie’s blog
Schedule R code to Be Executed Periodically in the
Current R Session
,
we write a recursive function to query the scores. The previous score is
tracked using a global variable.

getScoreUpdates <- function() { prevScore <<- checkScoreAndNotify(prevScore) later::later(getScoreUpdates, delay = 60) } All Together Now

To run this entire process, we simply initialize the
global prevScore variable and launch the recursive function
getScoreUpdates:

prevScore <- NULL getScoreUpdates() Complete script ## 0. preparatory steps if (!require("notifier", character.only = TRUE)) { url <- "https://cloud.r-project.org/src/contrib/Archive/notifier/notifier_1.0.0.tar.gz" install.packages(url, type = "source", repos = NULL) } if (!require("later", character.only = TRUE)) { install.packages("later") } download.file("https://api.fifa.com/api/v1/picture/tournaments-sq-4/254645_w", "logo.png") ## 1. get match score readLiveMatchScore <- function() { # reading in the API data worldcupDataDF <- jsonlite::fromJSON("https://api.fifa.com/api/v1/live/football/")$Results # which World Cup match is currently active? worldcupMatchIdx <- which(worldcupDataDF$IdSeason == 254645 & worldcupDataDF$MatchStatus == 3) if (length(worldcupMatchIdx) != 1) { # no matches or more than 1 match liveScore <- NULL } else { # get the score liveScore <- unlist(worldcupDataDF[worldcupMatchIdx, c("AggregateHomeTeamScore", "AggregateAwayTeamScore", "HomeTeamPenaltyScore", "AwayTeamPenaltyScore")]) homeTeam <- worldcupDataDF$HomeTeam$TeamName[[worldcupMatchIdx]]$Description awayTeam <- worldcupDataDF$AwayTeam$TeamName[[worldcupMatchIdx]]$Description names(liveScore) <- rep(c(homeTeam, awayTeam), 2) } liveScore } scoreAsString <- function(matchScore, penalties = FALSE) { out <- paste(names(matchScore)[1], " - ", names(matchScore)[2], ":", matchScore[1], " - ", matchScore[2]) if (penalties && matchScore[1] == matchScore[2]) out <- paste0(out, " (pen. ", matchScore[3], " - ", matchScore[4], ")" ) out } ## 2. check for score changes checkScoreAndNotify <- function(prevScore = NULL) { newScore <- readLiveMatchScore() if (is.null(newScore) && is.null(prevScore)) { # nothing to do here } else if (is.null(newScore) && !is.null(prevScore)) { # end of the game sendNotification(title = "Match ended", message = scoreAsString(prevScore, TRUE)) } else if (is.null(prevScore) && !is.null(newScore)) { # start of the game sendNotification(title = "Match started", message = scoreAsString(newScore)) } else if (!is.null(prevScore) && !is.null(newScore) && !identical(newScore, prevScore)) { # change in the score sendNotification(title = "GOAL!", message = scoreAsString(newScore)) } return(newScore) } ## 3. send notification sendNotification <- function(title = "", message) { notifier::notify(title = title, msg = message, image = normalizePath("logo.png")) } ## 4. check score every minute getScoreUpdates <- function() { prevScore <<- checkScoreAndNotify(prevScore) later::later(getScoreUpdates, delay = 60) } ## 5. launch everything prevScore <- NULL getScoreUpdates() Wrap-Up

That’s our quick take on generating live score notifications using R. By using
a different API or alternative competition codes, this approach can be
generalized to generate notifications for other settings.

Oh, and if you’re looking to predict the winner of the World Cup using
statistics and historical trends, the BBC has you
covered
. May the loudest vuvuzela
win!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Basic Generalised Additive Model In Ecology; Exercise

Thu, 07/05/2018 - 09:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Generalised Additive Models (GAM) are non-parametric models that add smoother to the data. On this exercise, we will look at GAMs using cubic spline using the mgcv package. Dataset used can be downloaded here. The dataset is the experiment result at grassland richness over time in Yellowstone National Park (Skkink et al. 2007).
Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load dataset and required package before running the exercise.

Exercise 1
observe the dataset and try to classify the response and explanatory variables. We will focus on ROCK as an explanatory variable.

Exercise 2
do some scatter plot

Exercise 3
since it is not linear, try to do GAM with ROCK variables

Exercise 4
check the result. what can be inferred

Exercise 5
do some validation plots

Exercise 6
plot base graph

Exercise 7
add predict across the data and add some lines

Exercise 8
plot the fitted values

Why we only use ROCK variables? Because it is proofed to give the most fitted data without incorporation all the explanatory variables. Try to play around with other explanatory variables to see the difference.

Related exercise sets:
  1. Advanced Techniques With Raster Data – Part 3: Exercises
  2. Spatial Data Analysis: Introduction to Raster Processing (Part 1)
  3. Spatial Data Analysis: Introduction to Raster Processing: Part-3
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Weighting tricks for machine learning with Icarus – Part 1

Thu, 07/05/2018 - 02:17

(This article was first published on R – NC233, and kindly contributed to R-bloggers)

Calibration in survey sampling is a wonderful tool, and today I want to show you how we can use it in some Machine Learning applications, using the R package Icarus. And because ’tis the season, what better than a soccer dataset to illustrate this? The data and code are located on this gitlab repo: https://gitlab.com/haroine/weighting-ml

https://gitlab.com/haroine/weighting-ml

First, let’s start by installing and loading icarus and nnet, the two packages needed in this tutorial, from CRAN (if necessary):

install.packages(c("icarus","nnet")) library(icarus) library(nnet)

Then load the data:

load("data/weighting_ML_part1.RData")

The RData file contains two dataframes, one for the training set and one for the test set. They contain results of some international soccer games, from 01/2008 to 12/2016 for the training set, and from 01/2017 to 11/2017 for the test. Along with the team names and goals scored for each side, a few descriptive variables that we’re going to use as features of our ML models:

> head(train_soccer) Date team opponent_team home_field elo_team 1 2010-10-12 Belarus Albania 1 554 2 2010-10-08 Bosnia and Herzegovina Albania 0 544 3 2011-06-07 Bosnia and Herzegovina Albania 0 594 4 2011-06-20 Argentina Albania 1 1267 5 2011-08-10 Montenegro Albania 0 915 6 2011-09-02 France Albania 0 918 opponent_elo importance goals_for goals_against outcome year 1 502 1 2 0 WIN 2010 2 502 1 1 1 DRAW 2010 3 564 1 2 0 WIN 2011 4 564 1 4 0 WIN 2011 5 524 1 2 3 LOSS 2011 6 546 1 2 1 WIN 2011

elo_team and opponent_elo are quantitative variables indicative of the level of the team at the date of the game ; importance is a measure of high-profile the game played was (a friendly match rates 1 while a World Cup game rates 4). The other variables are imo self-descriptive.

Then we can train a multinomial logistic regression, with outcome being the predicted variable, and compute the predictions from the model:

outcome_model_unw <- multinom(outcome ~ elo_team + opponent_elo + home_field + importance, data = train_soccer) test_soccer$pred_outcome_unw <- predict(outcome_model_unw, newdata = test_soccer)

The sheer accuracy of this predictor is kinda good:

> ## Accuracy > sum(test_soccer$pred_outcome_unw == test_soccer$outcome) / nrow(test_soccer) [1] 0.5526316

but it has a problem: it never predicts draws!

> summary(test_soccer$pred_outcome_unw) DRAW LOSS WIN 0 208 210

And indeed, draws being less common than other results, it seems more profitable for the algorithm that optimizes accuracy never to predict them. As a consequence, the probabilities of the game being a draw is always lesser than the probability of one team winning it. We could show that the probabilities are not well calibrated.

A common solution to this problem is to use reweighting to correct the imbalances in the sample, which we’ll now tackle. It is important to note that the weighting trick has to happen in the training set to avoid “data leaks”. A very good piece on this subject has been written by Max Kuhn in the documentation of caret.

https://topepo.github.io/caret/

Commonly, you would do:

train_soccer$weight <- 1 train_soccer[train_soccer$outcome == "DRAW",]$weight <- (nrow(train_soccer)/table(train_soccer$outcome)[1]) * 1/3 train_soccer[train_soccer$outcome == "LOSS",]$weight <- (nrow(train_soccer)/table(train_soccer$outcome)[2]) * 1/3 train_soccer[train_soccer$outcome == "WIN",]$weight <- (nrow(train_soccer)/table(train_soccer$outcome)[3]) * 1/3 > table(train_soccer$weight) 0.916067146282974 1.22435897435897 3336 1248

The draws are reweighted with a factor greater than 1 and the other games with a factor lesser than 1. This balances the predicted outcomes and thus improves the quality of the probabilities …

outcome_model <- multinom(outcome ~ elo_team + opponent_elo + home_field + importance, data = train_soccer, weights = train_soccer$weight) test_soccer$pred_outcome <- predict(outcome_model, newdata = test_soccer) > summary(test_soccer$pred_outcome) DRAW LOSS WIN 96 167 155

… though at a loss in accuracy:

> ## Accuracy > sum(test_soccer$pred_outcome == test_soccer$outcome) / nrow(test_soccer) [1] 0.5263158

Now let’s look at the balance of our training sample on other variables:

> round(table(test_soccer$importance) / nrow(test_soccer),2) 1 2 3 4 0.26 0.08 0.54 0.12 > round(table(train_soccer$importance) / nrow(train_soccer),2) 1 2 3 4 0.56 0.08 0.23 0.12

It seems that the test set features a lot more important matches than the training set. Let’s look further, in particular at the dates the matches of the training set were played:

> round(table(train_soccer$year) / nrow(train_soccer),2) 2008 2009 2010 2011 2012 2013 2014 2015 2016 0.10 0.11 0.11 0.10 0.11 0.13 0.11 0.11 0.12

Thus the matches of each year between 2008 and 2016 have the same influence on the final predictor. A better idea would be to give the most recent games a slightly higher influence, for example by increasing their weight and thus reducing the weights of the older games:

nyears <- length(unique(train_soccer$year)) year_tweak <- rep(1/nyears,nyears) * 1:nyears year_tweak <- year_tweak * 1/sum(year_tweak) ## Normalization > year_tweak [1] 0.02222222 0.04444444 0.06666667 0.08888889 0.11111111 0.13333333 [7] 0.15555556 0.17777778 0.20000000

We determine it is thus a good idea to balance on these two additional variables (year and importance). Now how should we do this? A solution could be to create an indicator variable containing all the values of the cross product between the variables outcome, year and importance, and use the same reweighting technique as before. But this would not be very practical and more importantly, some of the sub-categories would be nearly empty, making the procedure not very robust. A better solution is to use survey sampling calibration and Icarus

train_soccer$weight_cal <- 1 importance_pct_test <- unname( table(test_soccer$importance) / nrow(test_soccer), ) marginMatrix <- matrix(, nrow = 0, ncol = 1) %>% ## Will be replaced by newMarginMatrix() in icarus 0.3.2 addMargin("outcome", c(0.333,0.333,0.333)) %>% addMargin("importance", importance_pct_test) %>% addMargin("year", year_tweak) train_soccer$weight_cal <- calibration(data=train_soccer, marginMatrix=marginMatrix, colWeights="weight_cal", pct=TRUE, description=TRUE, popTotal = nrow(train_soccer), method="raking") outcome_model_cal <- multinom(outcome ~ elo_team + opponent_elo + home_field + importance, data = train_soccer, weights = train_soccer$weight_cal) test_soccer$pred_outcome_cal <- predict(outcome_model_cal, newdata = test_soccer)

icarus gives a summary of the calibration procedure in the log (too long to reproduce here). We then observe a slight improvement in accuracy compared to the previous reweighting technique:

> sum(test_soccer$pred_outcome_cal == test_soccer$outcome) / nrow(test_soccer) [1] 0.5478469

But more importantly we have reason to believe that the we improved the quality of the probabilities assigned to each event (we could check this using metrics such as the Brier score or calibration plots)

It is also worth noting that some algorithms (especially those who rely on bagging, boosting, or more generally on ensemble methods) naturally do a good job at balancing samples. You could for example rerun the whole code and replace the logit regressions by boosted algorithms. You would then observe fewer differences between the unweighted algorithm and its weighted counterparts.

Stay tuned for the part 2, where we’ll show a trick to craft better probabilities (particularly for simulations) using external knowledge on probabilities.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NC233. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why R Conference

Thu, 07/05/2018 - 02:00

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

This July we had the great honor to present mlr and its ecosystem at the http://whyr2018.pl/ in Wroclaw in Poland.
You can find the slides here.
We want to thank the organizers for inviting us, providing us with great food and coffee and also many thanks to all participants for showing great interest in mlr.

Hopefully, this will spark further improvements and inspire new features as a result of the many interesting discussions we had.
This really encouraged us to work harder, in order to create even better software.
We also hope the event created interest in participating and contributing to our project, as the community really thrives on knowledge and experience of a very diverse set of people and backgrounds.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: mlr-org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A First Look at NIMBLE

Thu, 07/05/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Writing a domain-specific language (DSL) is a powerful and fairly common method for extending the R language. Both ggplot2 and dplyr, for example, are DSLs. (See Hadley’s chapter in Advanced R for some elaboration.) In this post, I take a first look at NIMBLE (Numerical Inference for Statistical Models using Bayesian and Likelihood Estimation), a DSL for formulating and efficiently solving statistical models in general, and Bayesian hierarchical models in particular. The latter comprise a class of interpretable statistical models useful for both inference and prediction. (See Gelman’s 2006 Technographics paper for what these models can and cannot do.)

Most of what I describe here can be found in the comprehensive and the very readable paper by Valpine et al., or the extensive NIMBLE User Manual.

At the risk of oversimplification, it seems to me that the essence of the NIMBLE is that it is a system for developing models designed around two dichotomies. The first dichotomy is that NIMBLE separates the specification of a model from the implementation of the algorithms that express it. A user can formulate a model in the BUGS language, and then use different NIMBLE functions to solve the model. Or, looking at things the other way round, a user can write a NIMBLE function to implement some algorithm or another efficiently, and then use this function in different models.

The second dichotomy separates model setup from model execution. NIMBLE programming is done by writing nimbleFunctions (see Chapter 11 of the Manual), a subset of the R Language that is augmented with additional data structures and made compilable. nimbleFunctions come in two flavors. Setup functions run once for each model, node any model structure used to define the model. Run functions can be executed by R or compiled into C++ code.

NIMBLE is actually more complicated, or should I say “richer”, than this. There are many advanced programming concepts to wrap your head around. Nevertheless, it is pretty straightforward to start using NIMBLE for models that already have BUGS implementations. The best way to get started for anyone new to Bayesian statistics, or whose hierarchical model building skills may be a bit rusty, is to take the advice of the NIMBLE developers and work through the Pump Model example in Chapter 2 of the manual. In the rest of this post, I present an abbreviated and slightly reworked version of that model.

Before getting started, note that although NIMBLE is an R package listed on CRAN, and it installs like any other R package, you must have a C++ compiler and the standard make utility installed on your system before installing NIMBLE. (See Chapter 4 of the manual for platform-specific installation instructions.)

Pump Failure Model

The Pump Failure model is discussed by George et al. in their 1993 paper: Conjugate Likelihood Distributions. The paper examines Bayesian models that use conjugate priors, but which do not have closed form solutions when prior distributions are associated with the hyperparameters. The BUGS solution of the problem is given here.

The data driving the model are: x[i] the number of failures for pump, i in a time interval, t[i] where i goes from 1 to 10.

library(nimble) library(igraph) library(tidyverse) pumpConsts <- list(N = 10, t = c(94.3, 15.7, 62.9, 126, 5.24,31.4, 1.05, 1.05, 2.1, 10.5)) pumpData <- list(x = c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22))

Arrival times are a Poisson distribution with parameter lambda, where lambda is itself modeled as a Gamma distribution with hyperparameters alpha and beta.

The model is expressed in the BUGS language wrapped inside the NIMBLE function nimbleCode(), which turns the BUGS code into a object that can be operated on by nimbleModel()

pumpCode <- nimbleCode( { for (i in 1:N){ theta[i] ~ dgamma(alpha,beta) lambda[i] <- theta[i]*t[i] x[i] ~ dpois(lambda[i]) } alpha ~ dexp(1.0) beta ~ dgamma(0.1,1.0) }) pumpInits <- list(alpha = 1, beta = 1, theta = rep(0.1, pumpConsts$N))

nimbleModel() produces the model object that can be executed by R or compiled.

pump <- nimbleModel(code = pumpCode, name = "pump", constants = pumpConsts, data = pumpData, inits = pumpInits)

The following command lets us look at the nodes that comprise the model’s directed graph, and plot it.

pump$getNodeNames() [1] "alpha" "beta" "lifted_d1_over_beta" [4] "theta[1]" "theta[2]" "theta[3]" [7] "theta[4]" "theta[5]" "theta[6]" [10] "theta[7]" "theta[8]" "theta[9]" [13] "theta[10]" "lambda[1]" "lambda[2]" [16] "lambda[3]" "lambda[4]" "lambda[5]" [19] "lambda[6]" "lambda[7]" "lambda[8]" [22] "lambda[9]" "lambda[10]" "x[1]" [25] "x[2]" "x[3]" "x[4]" [28] "x[5]" "x[6]" "x[7]" [31] "x[8]" "x[9]" "x[10]" pump$plotGraph()

We can look at the values stored at each node. The node for x contains the initial values we entered into the model and the nodes for theta and lambda contain the initial calculated values

pump$x [1] 5 1 5 14 3 19 1 1 4 22 pump$theta [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 pump$lambda [1] 9.430 1.570 6.290 12.600 0.524 3.140 0.105 0.105 0.210 1.050

We can also look at the log probabilities of the likelihoods.

pump$logProb_x [1] -2.998011 -1.118924 -1.882686 -2.319466 -4.254550 -20.739651 [7] -2.358795 -2.358795 -9.630645 -48.447798

Next, we use the model to simulate new values for theta and update the variables.

set.seed(1) pump$simulate("theta") pump$theta [1] 0.15514136 1.88240160 1.80451250 0.83617765 1.22254365 1.15835525 [7] 0.99001994 0.30737332 0.09461909 0.15720154

These new values will, of course, lead to new values of lambda and the log probabilities.

pump$lambda [1] 9.430 1.570 6.290 12.600 0.524 3.140 0.105 0.105 0.210 1.050 pump$logProb_x [1] -2.998011 -1.118924 -1.882686 -2.319466 -4.254550 -20.739651 [7] -2.358795 -2.358795 -9.630645 -48.447798

The model can also be compiled. The C++ code generated is loaded back into R with an object that can be examined like the uncompiled model.

Cpump <- compileNimble(pump) Cpump$theta [1] 0.15514136 1.88240160 1.80451250 0.83617765 1.22254365 1.15835525 [7] 0.99001994 0.30737332 0.09461909 0.15720154

Note that since a wide variety of NIMBLE models can be compiled, NIMBLE should be generally useful for developing production R code. Have a look at the presentation by Christopher Paciorek for a nice overview of NIMBLE’s compilation process.

Next, the default NIMBLE MCMC algorithm is used to generate posterior samples from the distributions for the model parameters alpha, beta, theta and lambda, along with summary statistics and the value of Wantanabi’s WAIC statistic.

mcmc.out <- nimbleMCMC(code = pumpCode, constants = pumpConsts, data = pumpData, inits = pumpInits, monitors=c("alpha","beta","theta","lambda"), nchains = 2, niter = 10000, summary = TRUE, WAIC = TRUE) |-------------|-------------|-------------|-------------| |-------------------------------------------------------| |-------------|-------------|-------------|-------------| |-------------------------------------------------------| names(mcmc.out) [1] "samples" "summary" "WAIC"

The model object contains a summary of the model,

mcmc.out$summary $chain1 Mean Median St.Dev. 95%CI_low 95%CI_upp alpha 0.69804352 0.65835063 0.27037676 0.287898244 1.3140461 beta 0.92862598 0.82156847 0.54969128 0.183699137 2.2872696 lambda[1] 5.67617535 5.35277649 2.39989346 1.986896247 11.3087435 lambda[2] 1.59476464 1.28802618 1.24109695 0.126649837 4.7635134 lambda[3] 5.58222072 5.28139948 2.36539331 1.961659802 11.1331869 lambda[4] 14.57540796 14.23984600 3.79587390 8.085538041 22.9890092 lambda[5] 3.16402849 2.87859865 1.63590766 0.836991007 7.1477641 lambda[6] 19.21831706 18.86685281 4.33423677 11.703168568 28.6847447 lambda[7] 0.94776605 0.74343559 0.77658191 0.077828283 2.8978174 lambda[8] 0.93472103 0.74313533 0.76301563 0.076199581 2.9598715 lambda[9] 3.31124086 3.03219017 1.61332897 0.955909813 7.2024472 lambda[10] 20.89018329 20.59798130 4.45302924 13.034529765 30.4628012 theta[1] 0.06019274 0.05676327 0.02544956 0.021069950 0.1199230 theta[2] 0.10157737 0.08203988 0.07905076 0.008066869 0.3034085 theta[3] 0.08874755 0.08396502 0.03760562 0.031186960 0.1769982 theta[4] 0.11567784 0.11301465 0.03012598 0.064170937 0.1824525 theta[5] 0.60382223 0.54935089 0.31219612 0.159731108 1.3640771 theta[6] 0.61204831 0.60085518 0.13803302 0.372712375 0.9135269 theta[7] 0.90263434 0.70803389 0.73960182 0.074122175 2.7598261 theta[8] 0.89021051 0.70774794 0.72668155 0.072571029 2.8189252 theta[9] 1.57678136 1.44390008 0.76825189 0.455195149 3.4297368 theta[10] 1.98954127 1.96171250 0.42409802 1.241383787 2.9012192 $chain2 Mean Median St.Dev. 95%CI_low 95%CI_upp alpha 0.69101961 0.65803654 0.26548378 0.277195564 1.2858148 beta 0.91627273 0.81434426 0.53750825 0.185772263 2.2702428 lambda[1] 5.59893463 5.29143956 2.32153991 1.976164994 10.9564380 lambda[2] 1.57278293 1.27425268 1.23323878 0.129781580 4.7262566 lambda[3] 5.60321125 5.27780200 2.32992322 1.970709123 10.9249502 lambda[4] 14.60674094 14.30971897 3.86145332 8.013012004 23.0526313 lambda[5] 3.13119513 2.84685917 1.67006926 0.782262325 7.2043337 lambda[6] 19.17917926 18.82326155 4.33456380 11.661139736 28.5655803 lambda[7] 0.94397897 0.74446578 0.76576887 0.080055678 2.8813517 lambda[8] 0.94452324 0.74263433 0.77013200 0.074813473 2.9457364 lambda[9] 3.30813061 3.04512049 1.58008544 0.986914665 7.0355869 lambda[10] 20.88570471 20.60384141 4.44130984 13.092562597 30.5574424 theta[1] 0.05937364 0.05611283 0.02461866 0.020956151 0.1161870 theta[2] 0.10017726 0.08116259 0.07855024 0.008266343 0.3010355 theta[3] 0.08908126 0.08390782 0.03704170 0.031330829 0.1736876 theta[4] 0.11592652 0.11356920 0.03064645 0.063595333 0.1829574 theta[5] 0.59755632 0.54329373 0.31871551 0.149286703 1.3748728 theta[6] 0.61080189 0.59946693 0.13804343 0.371373877 0.9097319 theta[7] 0.89902759 0.70901502 0.72930369 0.076243503 2.7441445 theta[8] 0.89954594 0.70727079 0.73345905 0.071250926 2.8054633 theta[9] 1.57530029 1.45005738 0.75242164 0.469959364 3.3502795 theta[10] 1.98911473 1.96227061 0.42298189 1.246910723 2.9102326 $all.chains Mean Median St.Dev. 95%CI_low 95%CI_upp alpha 0.69453156 0.65803654 0.26795776 0.28329854 1.2999319 beta 0.92244935 0.81828160 0.54365539 0.18549077 2.2785444 lambda[1] 5.63755499 5.32462511 2.36129857 1.98294712 11.1597721 lambda[2] 1.58377379 1.28149065 1.23719199 0.12734396 4.7382079 lambda[3] 5.59271599 5.28024565 2.34769002 1.96451069 11.0072923 lambda[4] 14.59107445 14.27080924 3.82874035 8.04541916 23.0250158 lambda[5] 3.14761181 2.86460377 1.65311690 0.80506062 7.1718837 lambda[6] 19.19874816 18.84484055 4.33433610 11.68198222 28.6445471 lambda[7] 0.94587251 0.74395440 0.77117739 0.07927988 2.8911629 lambda[8] 0.93962214 0.74299160 0.76657858 0.07571751 2.9470742 lambda[9] 3.30968573 3.03910484 1.59675456 0.97386482 7.1120082 lambda[10] 20.88794400 20.60051118 4.44706278 13.05509616 30.5216406 theta[1] 0.05978319 0.05646474 0.02504028 0.02102807 0.1183433 theta[2] 0.10087731 0.08162361 0.07880204 0.00811108 0.3017967 theta[3] 0.08891440 0.08394667 0.03732417 0.03123228 0.1749967 theta[4] 0.11580218 0.11326039 0.03038683 0.06385253 0.1827382 theta[5] 0.60068928 0.54668011 0.31548032 0.15363752 1.3686801 theta[6] 0.61142510 0.60015416 0.13803618 0.37203765 0.9122467 theta[7] 0.90083096 0.70852800 0.73445465 0.07550465 2.7534885 theta[8] 0.89487822 0.70761105 0.73007484 0.07211191 2.8067373 theta[9] 1.57604083 1.44719278 0.76035931 0.46374515 3.3866706 theta[10] 1.98932800 1.96195345 0.42352979 1.24334249 2.9068229

and the value of the WAIC statistic.

mcmc.out$WAI [1] 48.69896

Here, we select sample values for the parameters for pump 1 in the first chain, and put them into a data frame for plotting.

df <- data.frame(mcmc.out$samples$chain1) df_l <- df %>% select(alpha, beta, theta.1., lambda.1.) %>% gather(key="parameter", value="value")

We plot the sample values.

ps <- df_l %>% ggplot(aes(x=seq_along(value), y = value)) + geom_line() ps + facet_wrap(~parameter, scales = "free")

And, we plot histograms.

p <- ggplot(df_l,aes(value)) + geom_histogram(aes( y= ..density..),bins = 60) p + facet_wrap(~parameter, scales = "free")

Note that it is also possible to perform the MCMC simulation using the compiled model.

mcmc.out_c<- nimbleMCMC(model=Cpump, monitors=c("alpha","beta","theta","lambda"), nchains = 2, niter = 10000, summary = TRUE, WAIC = TRUE) |-------------|-------------|-------------|-------------| |-------------------------------------------------------| |-------------|-------------|-------------|-------------| |-------------------------------------------------------| mcmc.out_c$summary $chain1 Mean Median St.Dev. 95%CI_low 95%CI_upp alpha 0.70269965 0.65474666 0.27690796 0.288871669 1.3525877 beta 0.92892181 0.82320341 0.54874194 0.186215812 2.2756643 lambda[1] 5.65646492 5.32568604 2.38108302 1.991839453 11.1329833 lambda[2] 1.58848917 1.28392445 1.25676948 0.133001650 4.7163618 lambda[3] 5.62720720 5.30681963 2.36776285 1.986194548 11.1264464 lambda[4] 14.61256770 14.29447077 3.75584085 8.106535985 22.8405075 lambda[5] 3.16521167 2.88771869 1.65132178 0.785161558 7.0181185 lambda[6] 19.12824948 18.77541823 4.27045427 11.733960316 28.4815908 lambda[7] 0.94353548 0.75312906 0.75111813 0.079570796 2.8410369 lambda[8] 0.93661525 0.74764821 0.75397145 0.078424425 2.8536890 lambda[9] 3.33178422 3.05974419 1.63035789 1.019006182 7.3163945 lambda[10] 20.90388784 20.58355995 4.45456152 12.997984788 30.2815949 theta[1] 0.05998372 0.05647599 0.02525009 0.021122370 0.1180592 theta[2] 0.10117765 0.08177863 0.08004901 0.008471443 0.3004052 theta[3] 0.08946275 0.08436915 0.03764329 0.031577020 0.1768910 theta[4] 0.11597276 0.11344818 0.02980826 0.064337587 0.1812739 theta[5] 0.60404803 0.55109135 0.31513774 0.149839992 1.3393356 theta[6] 0.60917992 0.59794326 0.13600173 0.373693004 0.9070570 theta[7] 0.89860522 0.71726577 0.71535060 0.075781711 2.7057495 theta[8] 0.89201452 0.71204591 0.71806805 0.074689928 2.7177991 theta[9] 1.58656391 1.45702104 0.77636090 0.485241039 3.4839974 theta[10] 1.99084646 1.96033904 0.42424395 1.237903313 2.8839614 $chain2 Mean Median St.Dev. 95%CI_low 95%CI_upp alpha 0.70184646 0.65773527 0.27237859 0.297108329 1.3420954 beta 0.92323539 0.82124257 0.53880496 0.194879601 2.2590201 lambda[1] 5.59702813 5.25201276 2.35832632 2.029382518 10.9327321 lambda[2] 1.62105397 1.31199418 1.26269004 0.137977536 4.8652461 lambda[3] 5.62874314 5.33611797 2.37576774 1.953756972 11.1296452 lambda[4] 14.53507135 14.22210300 3.84823087 8.042950272 22.9287741 lambda[5] 3.17647361 2.88816158 1.67257096 0.807468793 7.2500264 lambda[6] 19.13576117 18.82011366 4.34294395 11.646448559 28.4716352 lambda[7] 0.94656373 0.74705570 0.76192793 0.084445304 2.9245815 lambda[8] 0.94754678 0.75725106 0.75136514 0.083985118 2.8740656 lambda[9] 3.34514300 3.04989975 1.64818642 0.974288761 7.2961225 lambda[10] 20.97154230 20.61713159 4.45405753 13.260753885 30.5614115 theta[1] 0.05935343 0.05569473 0.02500876 0.021520493 0.1159357 theta[2] 0.10325185 0.08356651 0.08042612 0.008788378 0.3098883 theta[3] 0.08948717 0.08483494 0.03777055 0.031061319 0.1769419 theta[4] 0.11535771 0.11287383 0.03054151 0.063832939 0.1819744 theta[5] 0.60619725 0.55117587 0.31919293 0.154097098 1.3835928 theta[6] 0.60941915 0.59936668 0.13831032 0.370906005 0.9067400 theta[7] 0.90148927 0.71148162 0.72564565 0.080424099 2.7853158 theta[8] 0.90242550 0.72119149 0.71558585 0.079985826 2.7372053 theta[9] 1.59292524 1.45233321 0.78485068 0.463947029 3.4743440 theta[10] 1.99728974 1.96353634 0.42419596 1.262928941 2.9106106 $all.chains Mean Median St.Dev. 95%CI_low 95%CI_upp alpha 0.70227305 0.65580594 0.27464608 0.292312894 1.3489184 beta 0.92607860 0.82237274 0.54378998 0.191611561 2.2684771 lambda[1] 5.62674653 5.28611334 2.36985910 2.007150471 11.0415843 lambda[2] 1.60477157 1.29777398 1.25980697 0.135465213 4.8015059 lambda[3] 5.62797517 5.32327502 2.37170950 1.967656654 11.1291530 lambda[4] 14.57381952 14.26247905 3.80241886 8.071801955 22.9002186 lambda[5] 3.17084264 2.88801833 1.66194832 0.798927322 7.1492800 lambda[6] 19.13200533 18.78653361 4.30674559 11.682136284 28.4767857 lambda[7] 0.94504960 0.75003915 0.75652494 0.081865294 2.8859048 lambda[8] 0.94208101 0.75177975 0.75267045 0.080734142 2.8644982 lambda[9] 3.33846361 3.05488924 1.63926902 0.990402148 7.3041714 lambda[10] 20.93771507 20.60497943 4.45432662 13.116940079 30.4162660 theta[1] 0.05966857 0.05605635 0.02513106 0.021284735 0.1170900 theta[2] 0.10221475 0.08266076 0.08024248 0.008628358 0.3058284 theta[3] 0.08947496 0.08463076 0.03770603 0.031282300 0.1769341 theta[4] 0.11566523 0.11319428 0.03017793 0.064061920 0.1817478 theta[5] 0.60512264 0.55114854 0.31716571 0.152467046 1.3643664 theta[6] 0.60929953 0.59829725 0.13715750 0.372042557 0.9069040 theta[7] 0.90004724 0.71432300 0.72049994 0.077966947 2.7484808 theta[8] 0.89722001 0.71598072 0.71682900 0.076889659 2.7280935 theta[9] 1.58974458 1.45470916 0.78060429 0.471620071 3.4781768 theta[10] 1.99406810 1.96237899 0.42422158 1.249232389 2.8967872 Monte Carlo Expectation Analysis

To illustrate that NIMBLE is more than just an MCMC engine, the manual example uses NIMBLE’s built-in Monte Carlo Expectation algorithm to maximize the marginal likelihood for alpha and beta. The first step is to set up “box constraints” for the model. Then, the buildMCEM() function is used to construct an MCEM algorithm from a NIMBLE model.

pump2 <- pump$newModel() box = list( list(c("alpha","beta"), c(0, Inf))) pumpMCEM <- buildMCEM(model = pump2, latentNodes = "theta[1:10]", boxConstraints = box)

This is how to run the Monte Carlo Expectation model. Note that the authors of the NIMBLE manual point out that these results are within 0.01 of the values obtained by George et al.

pumpMLE <- pumpMCEM$run() |-------------|-------------|-------------|-------------| |-------------------------------------------------------| Iteration Number: 1. Current number of MCMC iterations: 1000. Parameter Estimates: alpha beta 0.8130146 1.1125783 Convergence Criterion: 1.001. |-------------|-------------|-------------|-------------| |-------------------------------------------------------| Iteration Number: 2. Current number of MCMC iterations: 1000. Parameter Estimates: alpha beta 0.8148887 1.2323211 Convergence Criterion: 0.02959269. |-------------|-------------|-------------|-------------| |-------------------------------------------------------| Iteration Number: 3. Current number of MCMC iterations: 1000. Parameter Estimates: alpha beta 0.8244363 1.2797908 Convergence Criterion: 0.005418668. |-------------|-------------|-------------|-------------| |-------------------------------------------------------| Monte Carlo error too big: increasing MCMC sample size. |-------------|-------------|-------------|-------------| |-------------------------------------------------------| Iteration Number: 4. Current number of MCMC iterations: 1250. Parameter Estimates: alpha beta 0.8280353 1.2700022 Convergence Criterion: 0.001180319. |-------------|-------------|-------------|-------------| |-------------------------------------------------------| Monte Carlo error too big: increasing MCMC sample size. |-------------|-------------|-------------|-------------| |-------------------------------------------------------| Monte Carlo error too big: increasing MCMC sample size. |-------------|-------------|-------------|-------------| |-------------------------------------------------------| Iteration Number: 5. Current number of MCMC iterations: 2188. Parameter Estimates: alpha beta 0.8268224 1.2794244 Convergence Criterion: 0.000745466.

Like the tidyverse, NIBMLE is an ecosystem of DSLs. The BUGS language is extended and used as a DSL for formulation models. nimbleFunctions is a language for writing algorithms that may be used with both BUGS and R. But, unlike the tidyverse, the NIMBLE DSLs are not distributed across multiple packages, but instead wrapped up into the NIMBLE package code.

For more on the design an use of DSLs with R, have a look at the Chapter in Advanced R, or Thomas Mailund’s new book, Domain-Specific Languages in R.

_____='https://rviews.rstudio.com/2018/07/05/a-first-look-at-nimble/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Life (expectancy), animated

Thu, 07/05/2018 - 02:00

(This article was first published on Rstats on Jakub Nowosad's website, and kindly contributed to R-bloggers)

Global socio-economic data is easily accessible nowadays.
Just type the indicator of interest and the name of the country in your preferred search engine and you can find its value, sometimes also an additional plot or a map.
But what about when you want to go further and (for example):

  • Want to compare many countries?
  • Get data just for a specific year?
  • See changes in time?
  • Just want to create a very specific plot or a map?

In this blog post, I will show how to download, process, and present global socio-economic data, using the following packages:

library(wbstats) # the World Bank data downloading library(dplyr) # data manipulation library(tidyr) # data manipulation library(purrr) # data manipulation library(ggplot2) # data visualization library(sf) # spatial data manipulation library(rnaturalearth) # access to spatial data library(tmap) # spatial data visualization World Bank Data

The World Bank Data is a source of various global socio-economic indicators with a large database of historical records.
Its API is, for instance, accessible through the R package wbstats package.
The two most important functions of this package are wbsearch() and wb().
The first one allows for searching for available indicators, and the second one is used for downloading the data.
Read the vignette “wbstats: An R package for searching and downloading data from the World Bank API” to learn more about this package.

Here, I would like to create a time-series of a selected indicator.
The function below downloads data for a given indicator and a year.
Secondly, it selects only the variables of interest and reshapes them into the wide format.

wb_data_create = function(indicator, new_name, year, ...){ df = wb(indicator = indicator, startdate = year, enddate = year, ...) %>% select(iso_a2 = iso2c, value, year = date) %>% mutate(indicator = new_name) %>% spread(indicator, value) %>% as_data_frame() # return the dataframe df }

Importantly, this function can be also used to download data from many years using the map_dfr() function.
As an example, let’s download a data of life expectancy at birth for every five years from 1963 to 2013 on a country level:

data_life_exp = seq(1963, 2013, by = 5) %>% map_dfr(wb_data_create, indicator = "SP.DYN.LE00.IN", new_name = "life_exp", country = "countries_only")

This new dataset, data_life_exp, already allows analyzing temporal changes in life expectancy.
For example, we can calculate the mean life expectancy of all countries:

data_life_exp_avg = data_life_exp %>% group_by(year) %>% summarise(mean_life_exp = mean(life_exp, na.rm = TRUE))

Let’s visualize the life expectancy.
Each grey line represents one country, and the red one shows the mean life expectancy.

ggplot() + geom_line(data = data_life_exp, aes(year, life_exp, group = iso_a2), color = "gray40") + geom_line(data = data_life_exp_avg, aes(year, mean_life_exp, group = 1), color = "red", size = 2) + labs(x = "Year", y = "Life expectancy") + theme_minimal()

Spatialize it

Overall, we can see a stable increase in life expectancy over the last 55 years.
However, are there any differences between countries?
What is the spatial distribution of the life expectancy?
To answer that, we can download spatial data of the world using the rnaturalearth package:

world = ne_countries(returnclass = "sf") %>% select(iso_a2, name_long, continent)

Next, we need to combine the new spatial data with our non-spatial information of life expectancy.
First, the non-spatial data is reshaped into the wide form.
This way we are filling missing data with NA instead of removing them
As a result, we will have the same number of countries (borders) for each year in our database.
Secondly, we are joining the spatial world dataset with the non-spatial data_life_exp_wide using the variable "iso_a2" common to both datasets.
The last step is to convert the data back into a long form, where each pair of country and year corresponds to one row.
This also means that the spatial geometry for each country will be repeated for each time step.

data_life_exp_wide = data_life_exp %>% spread(year, life_exp) world_temporal = world %>% left_join(data_life_exp_wide, by = c("iso_a2")) %>% gather(year, life_exp, `1963`:`2013`) Facet map

How to create a map showing temporal changes?
One of the most straigforward ways is to use a facet plot:

my_map = tm_shape(world_temporal, projection = "robin") + tm_fill("life_exp", title = "Life expectancy", palette = "viridis") + tm_facets(by = "year", ncol = 3) my_map

Each map facet represents a different year.
We can see that, on overall, the map colors shift from darker to more bright, indicating an increase in life expectancy.
There is a noticeable spatial similarity (“spatial autocorrelation”) – European countries tend to have a longer life expectancy, while the lowest life expectancy is in Africa.
This visualization also reveals (or just helps to see) a lot more, for example, life expectancy in China increased visibly faster than in neighboring countries or that there are missing values for Greenland before 1978.

There is, however, a downside of facet maps – each facet is relatively small and it is hard to see any details without zooming in.
Alternatively, an animated map can be created, where each facet would be a next image frame.

Animated map

The code to create an animated map is almost identical to the facet map code.
There is only one important difference – the along parameter should be used instead of by:

my_ani = tm_shape(world_temporal, projection = "robin") + tm_fill("life_exp", title = "Life expectation", palette = "viridis") + tm_facets(along = "year")

The output, my_ani, is a set of maps – one map per year.
It can be further converted into a proper animation with the tmap_animation() function:

tmap_animation(my_ani, filename = "life_exp_animation.gif", width = 1500, height = 600, delay = 60)

The output presents the animated changes in life expectancy.
The overall trend is still visible, but also small changes can be seen.
It includes, for example, decreases in life expectancy in Cambodia in the 1970s and Rwanda in the 1990s.

Bonus

It is also possible to combine the two above methods – facet and animated map.
Let’s try it on the data from South America:

world_temporal_sa = filter(world_temporal, continent == "South America") %>% na.omit()

To enable animated facet map, both arguments – by and along need to be set.
Here, countries will be presented on facets and each animation frame is a different year.
Feel free to test what would happen if you switch the arguments of by and along.

my_ani2 = tm_shape(world_temporal_sa) + tm_polygons("life_exp", title = "Life expectation", palette = "viridis") + tm_facets(by = "name_long", along = "year", nrow = 3) tmap_animation(my_ani2, filename = "life_exp_sa_animation.gif", width = 1600, height = 1000, delay = 60)

This way of presentation has a nice property – it allows for focus on values, instead of countries location or size.
You can easily see values of the smallest countries and compare them to the largest ones.

I must admit that the bonus visualization example might not be the most appropriate.
This visualization method is the best suited for comparing continuous features.
For example, imagine animated facet maps comparing continuous surface phenomena (e.g. temperature or air quality) for each country through time.
Try this method for yourself!

Final words

This blog post builds on the content in Chapter 8 of the book Geocomputation with R.
To learn more about spatial data analysis, visualization and modeling using R visit the online version of the Geocomputation with R book at https://geocompr.robinlovelace.net/.

Acknowledgement

I’d like to thanks to Tal Galili (creator of https://www.r-bloggers.com/ and https://www.r-users.com/), and Maëlle Salmon, Bruce Zhao, Colin Fay, Eric Nantz, Hao Zhu, Jasmine Dumas, Jon Calder, Jonathan Carroll, Kun Ren, Tracy Shen, Wolfram Qin (team members of https://rweekly.org) for doing a great work of gathering and spreading blog posts, tutorials, and many other R-related resources. Keep your awesome work!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Jakub Nowosad's website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Survey your audience and visualise the results with R and Google forms

Thu, 07/05/2018 - 02:00

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

Survey your audience and visualise the results with R and Google forms

I wanted to make my presentation on dataviz at the UQ School of
Bioinformatics

more interactive.

A quiz is a good way to engage your audience. Given I was giving a talk
about R datavisuals I thought it would be fun to visualise the quiz
results using R live with the audience. To top it off, we posted the
results to Twitter.

This blog describes is how I did that.

You could also use this system to survey our audience and share the
results live. Just prepare you R code and set it to run at a certain
time during your talk with a task scheduling algorithm.

Setting up the survey

I used Google Forms to do my quiz. You can take it
here
.
I posed a few questions that challenged the audience to think about the
best way to visualise data.

It is pretty easy to set up a survey if you have a gmail account. A few
tips:

  • You can add images, which is great posing questions about results.

  • I used the ‘short answer’ input for numeric answers. If you click
    the validation tab at the bottom of each ‘short answer’ question you
    can require users enter certain types of numbers (e.g. within a
    range).

  • Think carefully about limiting required inputs if you want to avoid
    bugs that might arise from unexpected answers.

  • There is a green button at the top of the form that let’s you link
    it to a google sheet. Do this.

  • You can make the sheet public, so other people can use it, but
    changing the sharing settings.

Connecting to your survey answers in R

I used the googlesheets package to read my survey answers from the
(public) spreadsheet. You will need to authenticate yourself first:

library(googlesheets) gs_ls()

This will prompt you to login to your google account and authenticate an
app that allows the connection to happen.

Now we can load our data:

sheet_url <- "https://docs.google.com/spreadsheets/d/10i3v3NIVpgmURyLVzsiadPAMGeqa7dLFcDb9sqFe8KA/edit#gid=1513779153" dataviz <- gs_url(sheet_url) #creates connection

If you want to keep your sheet private you can use gs_ls() to list all
your sheets, and then pick a name to read it in. e.g. like this:

dataviz <- gs_title("Dataviz quiz 2018 v2 (Responses)") dat <- gs_read(dataviz) Analysing your data

The file dat we just read in is a dataframe like object (actually a
tibble) where each column is a question and each row is a response. The
first column is a time stamp.

All other columns are titled with your questions.

It will make life easier if we rename the columns to shorter (but still
descriptive) names.

newnames <- c("timestamp", "shopping", "bar_percent", "pie_percent", "room", "cb_age") names(dat) <- newnames

Now let’s create some dataviz

library(ggplot2) datplot <- na.omit(dat) ggplot(datplot, aes(x = room, y = cb_age)) + geom_boxplot() + xlab("Position in room") + ylab("Guess at CB's age") + ylim(0, 75) + theme_bw()

A boxplot of the audience’s guesses at my age by their position in the
room. I limited the y-axis because there were some outrageously large
numbers!

Share the results

We could show the audience the results on our screen. But why not let
Twitter know too!

For this, I used the rtweet package. rtweet is pretty simple to use
once you’ve set up an app on Twitter’s API and authorised R to access
it
. So get rtweet then look at the vignette vignette("auth").
Follow the instructions to the letter and you shouldn’t have any
problems.

Once authorisation is done, its a simple matter to save our plot as a
png to use in a tweet:

myplot <- ggplot(datplot, aes(x = room, y = cb_age)) + geom_boxplot() ggsave(filename = "myplot.png", myplot)

Now just write your tweet and send it off to twitter.

library(rtweet) newstatus = "Chris age as surveyed at #UQwinterSchool @DoktrNick @UQwinterSchool" post_tweet(status = newstatus, media = "myplot.png") Next steps?

So I tried this as a way of doing a live R tutorial. Next step would be
to try and integrate it into a talk without showing the R coding. For
that you would either need to get a friend to run the code or use a
scheduler (like the
taskscheduleR
R package).

Be careful though! You never know what answers people may give if
allowed. So design you code to be robust to strange answers (like that I
am 100 years old).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New rWind release on CRAN! (v1.0.2)

Thu, 07/05/2018 - 01:13

(This article was first published on long time ago..., and kindly contributed to R-bloggers)

Hi there! Just a few lines to inform you about the new release of rWind R package (v1.0.2). This version have several new features. Here you have an example of one of them, the function wind.dl_2 to download time series of wind data. In this example we create a bunch of PNG files to be converted later into a GIF of wind speed of east Asia region. Enjoy! p { margin-bottom: 0.25cm; line-height: 115%; }

# We will create a gif of wind speed for east Asia during the end of June (2018)

# First, we should load the packages that we will use. Use install.packages() to
# download and install the new version of rWind (v1.0.2)

install.package("rWind")

library(rWind)

library(fields)
library(shape)
library(rworldmap)
library(lubridate)

# Now, we use lubridate package to create a sequence of dates/times (each three
# hours)

dt <- seq(ymd_hms(paste(2018,6,25,00,00,00, sep="-")),
ymd_hms(paste(2018,7,4,21,00,00, sep="-")),by="3 hours")

# Now, we use the new function wind.dl_2 to download the whole time series of
# wind data. We use the "dt" object created with lubridate to provide the input
# to wind.dl_2. Since it's a large area and many days, it could take a while...

wind_series <- wind.dl_2(dt,90,150,5,40)

# Next, we can use wind2raster function from rWind package directly over the
# output from wind.dl_2, it has been adapted to work with this lists of wind
# data.

wind_series_layer <- wind2raster(wind_series)

# Finally, we will create a bunch of PNG files to be converted later in a GIF

id<-0
for (i in 1:72) {
id <- sprintf("%03d", i)
png(paste("asia",id,".png", sep=""), width=1000, height=600, units="px",
pointsize=18)
image.plot(wind_series_layer[[i]]$wind.speed, col=bpy.colors(1000),
zlim=c(0,18), main =wind_series[[i]]$time[1])
lines(getMap(resolution = "low"), lwd=3)
dev.off()
}

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: long time ago.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages