Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 14 hours 22 min ago

API Instructions

Tue, 06/12/2018 - 15:32

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

The Fantasy Football Analytics API allows developers to programmatically access our data in JSON format. Users must pay for a standalone subscription to obtain a unique API key granting access. You must have an account on the Fantasy Football Analytics web app in order to subscribe to the API (simply click the “Register” tab in the navigation bar at http://apps.fantasyfootballanalytics.net/ to set up an account if you don’t already have one).

There are two API subscription types, “Basic” and “Premier”, both of which allow access to seasonal and weekly projections for previous seasons and to seasonal projections for the current season. Only the “Premier” subscription allows access to weekly projections for the current season. Both subscriptions also allow unrestricted access to ADP data for all seasons to date.

To subscribe to the API, sign into the Fantasy Football Analytics web app, click the “Subscribe” tab in the navigation bar, and use the select box to choose your subscription type. After you enter your credit card information and click the “Subscribe” button, the “Subscribe” modal will disappear. If you have successfully subscribed to the API, your API key will appear at the bottom of the “My Account” modal.

The API has two endpoints, “proj” and “adp”, which allow the user to get averaged projections and ADP data, respectively. API requests are expected to be made in the form of a simple HTTP GET, with users specifying various query parameters. Specifically, the optional query parameters for the “proj” endpoint are:

• pos: One of “qb” (default), “rb”, “wr”, “te”, “k”, “dst”, “dl”, “lb”, or “db”.
• season: A year between 2008 and the most recent NFL season (default), inclusive.
• week: An integer between 0 and 20, inclusive. The option week = 0 (default) corresponds to seasonal projections, whereas all other options correspond to weekly projections. The options week > 17 are used to designate playoff weeks. Weekly projections are not available for seasons prior to 2015.
• type: Scheme used to average individual projection sources; one of “average” (simple average; default), “robust” (robust average, which is resistant to outliers), or “weight” (weighted average, whereby the weights are based on historical projection accuracy).

The optional query parameters for the “adp” endpoint are:

• season: A year between 2015 and the most recent NFL season (default), inclusive.
• type: League type; either “std” (standard; default) or “ppr” (points-per-reception).

API calls can be made using a programming language of your choice. In particular, the example below demonstrates how to get weighted average projections for team defenses for the 2016 season using R. (Be sure to replace “YOUR_API_KEY” with your actual unique API key for the Authorization header.) Analogously, ADP data may be retrieved by changing the “endpoint” parameter to “adp” and setting the query parameters appropriately.

# query parameters pos <- "dst" season <- 2016 week <- 0 type <- "weight" # construct API URL protocol <- "http" host <- "api.fantasyfootballanalytics.net/api" endpoint <- "proj" query.string <- paste0("?pos=", pos, "&season=", season, "&week=", week, "&type=", type) URL <- paste0(protocol, "://", host, "/", endpoint, query.string) # API key api.key <- YOUR_API_KEY # call API to get JSON json <- httr::content(httr::GET(URL, httr::add_headers(Authorization = paste("Basic", api.key))), type = "text") # convert JSON to data frame df <- jsonlite::fromJSON(json)

Please note that empty/null values are intentionally omitted from the returned JSON, following Google’s JSON style guide.

The post API Instructions appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Another Prediction for the FIFA World Cup 2018

Tue, 06/12/2018 - 13:36

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Ava Yang, Data Scientist

Given that the UEFA Champion League final a few weeks ago between Real Madrid and Liverpool is the only match I’ve watched properly in over ten years, how dare I presume I can guess that Brazil is going to lift the trophy in the 2018 FIFA World Cup? Well, here goes…

By the way, if you find the below dry to read, it is because of my limited natural language on the subject matter…data science tricks to the rescue!

This blogpost is largely based on the prediction framework from an eRum 2018 talk by Claus Thorn Ekstrøm. For first hand materials please take a look at the slides, video and code.

The idea is that in each simulation run of a tournament, we find team winner, runners-up, third and fourth etc. N times of simulation runs e.g. 10k returns a list of winners with highest probability to be ranked top.

Apart from the winner question, this post seeks to answer which team will be top scorer and how many goals will they score. After following Claus’s analysis rmarkdown file, I collected new data, put functions in a package and tried another modelling approach. Whilst the model is too simplistic to be correct, it captures the trend and is a fair starting point to add complex layers on top.

Initialization

To begin with, we load packages including accompnying R package worldcup where my utility functions reside. Package is a convenient way to share code, seal utility functions and speed up iteration. Global parameters normalgoals (the average number of goals scored in a world cup match) and nsim (number of simulations) are declared in the YAML section at the top of the RMarkdown document.

Next we load three datasets that have been tidied up from open source resource or updated from original version. Plenty of time was spent on gathering data, aligning team names and cleaning up features.

  • team_data contains features associated with team
  • group_match_data is match schedule, public
  • wcmatches_train is a match dataset available on this Kaggle competition and can be used as training set to estimate parameter lamda i.e. the average goals scored in a match for a single team. Records from 1994 up to 2014 are kept in the training set.
library(tidyverse) library(magrittr) devtools::load_all("worldcup") normalgoals <- params$normalgoals nsim <- params$nsim data(team_data) data(group_match_data) data(wcmatches_train) Play game

Claus proposed three working models to calculate single match outcome. The first is based on two independent poisson distributions, where two teams are treated equal and so the result is random regardless of their actual skills and talent. The second assumes the scoring event in a match are two possion events, the difference of two poisson events believed to have skellam distribution. The result turns out to be much more reliable as the parameters are estimated from actual bettings. The third one is based on World Football ELO Ratings rules. From current ELO ratings, we calculate expected result of one side in a match. It can be seen as the probability of success in a binomial distribution. It seems that this approach overlooked draw due to the nature of binomial distribution i.e. binary.

The fourth model presented here is my first attempt. To spell out: we assumed two independent poisson events, with lambdas predicted from a trained poisson model. Then predicted goal is simulated by rpois.

Model candidate each has its own function, and it is specified by the play_fun parameter and provided to higher level wrapper function play_game.

# Specify team Spain and Portugal play_game(play_fun = "play_fun_simplest", team1 = 7, team2 = 8, musthavewinner=FALSE, normalgoals = normalgoals) ## Agoals Bgoals ## [1,] 0 1 play_game(team_data = team_data, play_fun = "play_fun_skellam", team1 = 7, team2 = 8, musthavewinner=FALSE, normalgoals = normalgoals) ## Agoals Bgoals ## [1,] 1 4 play_game(team_data = team_data, play_fun = "play_fun_elo", team1 = 7, team2 = 8) ## Agoals Bgoals ## [1,] 0 1 play_game(team_data = team_data, train_data = wcmatches_train, play_fun = "play_fun_double_poisson", team1 = 7, team2 = 8) ## Agoals Bgoals ## [1,] 2 2 Estimate poisson mean from training

Let’s have a quick look at the core of my training function. Target variable in the glm function is the number of goals a team obtained in a match. Predictors are FIFA and ELO ratings at a point before the 2014 tournament started. Both are popular ranking systems – the difference being that the FIFA rating is official and the latter is in the wild, adapted from chess ranking methodology.

mod <- glm(goals ~ elo + fifa_start, family = poisson(link = log), data = wcmatches_train) broom::tidy(mod) ## term estimate std.error statistic p.value ## 1 (Intercept) -3.5673415298 0.7934373236 -4.4960596 6.922433e-06 ## 2 elo 0.0021479463 0.0005609247 3.8292949 1.285109e-04 ## 3 fifa_start -0.0002296051 0.0003288228 -0.6982638 4.850123e-01

From the model summary, the ELO rating is statistically significant whereas the FIFA rating is not. More interesting is that the estimate for the FIFA ratings variable is negative, inferring the effect is 0.9997704 relative to average. Overall, FIFA rating appears to be less predictive to the goals one may score than ELO rating. One possible reason is that ratings in 2014 alone are collected, and it may be worth future effort to go into history. Challenge to FIFA ratings’ predictive power is not new after all.

Training set wcmatches_train has a home column, representing whether team X in match Y is the home team. However, it’s hard to say in a third country whether a team/away position makes much difference comparing to league competetions. Also, I didn’t find an explicit home/away split for the Russian World Cup. We could derive a similar feature – host advantage, indicating host nation or continent in future model interation. Home advantage stands no better chance for the time being.

Group and kickout stages

Presented below are examples showing how to find winners at various stages – from group to round 16, quarter-finals, semi-finals and final.

find_group_winners(team_data = team_data, group_match_data = group_match_data, play_fun = "play_fun_double_poisson", train_data = wcmatches_train)$goals %>% filter(groupRank %in% c(1,2)) %>% collect() ## Warning: package 'bindrcpp' was built under R version 3.4.4 ## # A tibble: 16 x 11 ## number name group rating elo fifa_start points goalsFore ## ## 1 2 Russia A 41.0 1685 493 7.00 5 ## 2 3 Saudi Arabia A 1001 1582 462 5.00 4 ## 3 7 Portugal B 26.0 1975 1306 7.00 6 ## 4 6 Morocco B 501 1711 681 4.00 2 ## 5 12 Peru C 201 1906 1106 5.00 3 ## 6 11 France C 7.50 1984 1166 5.00 6 ## 7 13 Argentina D 10.0 1985 1254 9.00 8 ## 8 15 Iceland D 201 1787 930 6.00 4 ## 9 17 Brazil E 5.00 2131 1384 7.00 8 ## 10 20 Serbia E 201 1770 732 6.00 4 ## 11 21 Germany F 5.50 2092 1544 6.00 8 ## 12 24 Sweden F 151 1796 889 6.00 5 ## 13 27 Panama G 1001 1669 574 5.00 3 ## 14 25 Belgium G 12.0 1931 1346 5.00 4 ## 15 31 Poland H 51.0 1831 1128 4.00 2 ## 16 29 Colombia H 41.0 1935 989 4.00 1 ## # ... with 3 more variables: goalsAgainst , goalsDifference , ## # groupRank find_knockout_winners(team_data = team_data, match_data = structure(c(3L, 8L, 10L, 13L), .Dim = c(2L, 2L)), play_fun = "play_fun_double_poisson", train_data = wcmatches_train)$goals ## team1 team2 goals1 goals2 ## 1 3 10 2 2 ## 2 8 13 1 2 Run the tournament

Here comes to the most exciting part. We made a function–simulate_one()–to play the tournament one time and then replicate() (literally) it many many times. To run an ideal number of simulations, for example 10k, you might want to turn on parallel. I am staying at 1000 for simplicity.

Finally, simulate_tournament() is an ultimate wrapper for all the above bullet points. The returned resultX object is a 32 by R params$nsim matrix, each row representing predicted rankings per simulation. set.seed() is here to ensure the result of this blogpost is reproducible.

# Run nsim number of times world cup tournament set.seed(000) result <- simulate_tournament(nsim = nsim, play_fun = "play_fun_simplest") result2 <- simulate_tournament(nsim = nsim, play_fun = "play_fun_skellam") result3 <- simulate_tournament(nsim = nsim, play_fun = "play_fun_elo") result4 <- simulate_tournament(nsim = nsim, play_fun = "play_fun_double_poisson", train_data = wcmatches_train) Get winner list

get_winner() reports a winner list showing who has highest probability. Apart from the random poisson model, Brazil is clearly the winner in three other models. The top two teams are between Brazil and Germany. With different seeds, the third and fourth places (in darker blue) in my model are more likely to change. Variance might be an interesting point to look at.

get_winner(result) %>% plot_winner()

get_winner(result2) %>% plot_winner()

get_winner(result3) %>% plot_winner()

get_winner(result4) %>% plot_winner()

Who will be top scoring team?

The skellum model seems more reliable, my double poisson model gives systematically lower scoring frequency than probable actuals. They both favour Brazil though.

get_top_scorer(nsim = nsim, result_data = result2) %>% plot_top_scorer()

get_top_scorer(nsim = nsim, result_data = result4) %>% plot_top_scorer()

Conclusion

The framework is pretty clear, all you need is to customise the play_game function, such as game_fun_simplest, game_fun_skellam and game_fun_elo.

Tick-tock… Don’t hesitate to send a pull request to ekstroem/socceR2018 on GitHub. Who is winning the guess-who-wins-worldcup2018 game?

If you like this post, please leave your star, fork, issue or banana on the GitHub repository of the post, including all code (https://github.com/MangoTheCat/blog_worldcup2018). The analysis couldn’t have been done without help from Rich, Doug, Adnan and all others who have kindly shared ideas. I have passed on your knowledge to the algorithm.

Notes
  1. Data collection. I didn’t get to feed the models with the most updated betting odds and ELO ratings in the team_data dataset. If you would like to, they are available via the below three sources. FIFA rating is the easiest and can be scraped by rvest in the usual way. The ELO ratings and betting odds tables seem to have been rendered by javascript and I haven’t found a working solution. For betting information, Betfair, an online betting exchange has an API and R package abettor which helps to pull those odds which are definetly interesting for anyone who are after strategy beyond predction:

  2. Model enhancement. This is probably where it matters most. For example, previous research has suggested various bivariate poissons for football predictions.
  3. Feature engineering. Economic factors such as national GDP, market information like total player value or insurance value, and player injure data may be useful to improve accuracy.
  4. Model evaluation. One way to understand if our model has good prediction capibility or not is to evaluate the predictions against actual outcomes after 15 July 2018. Current odds from bookies can also be referred to. It is not imporssible to run the whole thing on historical data e.g. Year 2014. and perform model selection and tuning.
  5. Functions and package could be better parameterized; code to be tidied up.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Popularity of Point-and-Click GUIs for R

Tue, 06/12/2018 - 13:22

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

 

Point-and-click graphical user interfaces (GUIs) for R allow people to analyze data using the R software, without having to learn how to program in the R language. This is a brief look at how popular each one is. Knowing that a GUI is popular doesn’t mean it will meet your needs, but it does mean that it’s meeting the needs of many others. This may be helpful information when selecting the appropriate GUI for you, if programming is not your primary interest. For detailed information regarding what each GUI can do for you, and how it works, see my series of comparative reviews, which is currently in progress.

There are many ways to estimate the popularity of data science software, but one of the most accurate is by counting the number of downloads (see appendix for details). Figure 1 shows the monthly downloads of four of the six R GUIs that I’m reviewing (i.e. all that exist as far as I know).  We can see that the R Commander (Rcmdr) is the most popular GUI, and it has had steady growth since its introduction. Next comes Rattle, which is more oriented towards machine learning tasks. It too, has shown high popularity and steady growth.

The three lines at the bottom could use more “breathing room” so let’s look at them in their own plot.

Figure 1. Number of times each software was downloaded by month.

 

Figure 2 shows the same data as Figure 1, but with the two most popular GUIs removed to make room to study the remaining data. From it we can see that Deducer has been around for many more years than the other two. Downloads for Deducer grew steadily for a couple of years, then they leveled off. Its downloads appear to be declining slightly in recent years. jamovi (its name is not capitalized) has only been around for a brief period, and its growth has been very rapid. As you can see from my recent review, jamovi has many useful features.

Figure 2. Number of times the less popular GUIs were downloaded. (Same as Fig. 1, with the R Commander and rattle removed).

The lowest (blue) line shows downloads for the jmv package, that contains all the functions used by the jamovi GUI. It allows programmers to write code instead of using the jamovi GUI. People who point-and-click their way through an analysis in jamovi can send their code to any R user, who would then use the jmv package to run it. Since most jamovi users would prefer to point-and-click their way through analyses, it makes sense that the jmv package has been downloaded many fewer times than jamovi itself.

Two GUIs are missing from this plot: RKWard and BlueSky Statistics. Neither of those are downloaded from CRAN, and I was unable to obtain data from the developers of those GUIs. However, knowing that RKWard has a similar number of point-and-click features as Deducer, one can deduce (heh!) that it might have a similar level of popularity. The BlueSky software has only recently appeared on the scene, especially with its current level of features, so I expect it too will be towards the bottom, but growing rapidly.

I’m nearly done with all my reviews, so stay tuned to see what the other GUIs offer.

Acknowledgements

Thanks to Guangchuang Yu for making the dlstats package which allowed me to collect data so easily. Thanks also to Jonathon Love, who provided the download data for jamovi, and to Josh Price for his helpful editorial advice.

Appendix: Where the Data Came From

I used R’s dlstats package, which makes quick work of gathering counts of monthly downloads of R packages from the Comprehensive R Archive Network (CRAN). CRAN consists of sites around the world called “mirrors” from which people can download R packages. When starting the download process, R asks you to choose a mirror that is close to your location. In the popular RStudio development environment for R, the default mirror is set to their own server, which is actually a worldwide network of mirrors. Since it’s the default download location in a very popular tool for R, its download data will give us a good idea of the relative popularity of each GUI. The absolute popularity will be greater, but to get that data I would have to gather data from all the other servers around the world. If you have time to do that, please send me the results!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Second Edition of “Processamento e Analise de Dados Financeiros com o R”

Tue, 06/12/2018 - 09:00

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

It is with great pleasure that I announce the second edition of the portuguese version of my book, Processing and Analyzing Financial Data with R. This edition updates the material significantly. The portuguese version is now not only in par with the international version of the book, but much more!

Here are the main changes:

  • The structure of chapters changed towards the stages of a research, from obtaining the raw data, cleaning it, manipulating it and, finally, reporting tables and figures.

  • Many new additions of packages for obtaining data, including my own, GetDFPData and others such as rbcb.

  • Added new chapter for reporting results, exporting tables and also including a whole section about using Rmarkdown.

  • Alignment with the tidyverse. I have no doubt that the packages from the tidyverse are here to stay. While the native function are presented, there is an emphasis in using the tidyverse, specially in reading local data, manipulating dataframes and functional programming.

  • Exercises are available at the end of each chapter, including hard questions that will challenge your programming ability.

You can find the new edition of the book in Amazon. The TOC (table of contents) is available here. As usual, an online (and free) version of the book is available at http://www.msperlin.com/padfeR/.

It was a lot of work (and fun) to write the new edition. I’m very happy with the result. I hope you enjoy it!

Best,

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R 3.5.0 on Debian and Ubuntu: An Update

Tue, 06/12/2018 - 03:27

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)


Overview

R 3.5.0 was released a few weeks ago. As it changes some (important) internals, packages installed with a previous version of R have to be rebuilt. This was known and expected, and we took several measured steps to get R binaries to everybody without breakage.

The question of but how do I upgrade without breaking my system was asked a few times, e.g., on the r-sig-debian list as well as in this StackOverflow question.

Debian

Core Distribution As usual, we packaged R 3.5.0 as soon as it was released – but only for the experimental distribution, awaiting a green light from the release masters to start the transition. A one-off repository [drr35](https://github.com/eddelbuettel/drr35) was created to provide R 3.5.0 binaries more immediately; this was used, e.g., by the r-base Rocker Project container / the official R Docker container which we also update after each release.

The actual transition was started last Friday, June 1, and concluded this Friday, June 8. Well over 600 packages have been rebuilt under R 3.5.0, and are now ready in the unstable distribution from which they should migrate to testing soon. The Rocker container r-base was also updated.

So if you use Debian unstable or testing, these are ready now (or will be soon once migrated to testing). This should include most Rocker containers built from Debian images.

Contributed CRAN Binaries Johannes also provided backports with a –cran35 suffix in his CRAN-mirrored Debian backport repositories, see the README.

Ubuntu

Core (Upcoming) Distribution Ubuntu, for the upcoming 18.10, has undertaken a similar transition. Few users access this release yet, so the next section may be more important.

Contributed CRAN and PPA Binaries Two new Launchpad PPA repositories were created as well. Given the rather large scope of thousands of packages, multiplied by several Ubuntu releases, this too took a moment but is now fully usable and should get mirrored to CRAN ‘soon’. It covers the most recent and still supported LTS releases as well as the current release 18.04.

One PPA contains base R and the recommended packages, RRutter3.5. This is source of the packages that will soon be available on CRAN. The second PPA (c2d4u3.5) contains over 3,500 packages mainly derived from CRAN Task Views. Details on updates can be found at Michael’s R Ubuntu Blog.

This can used for, e.g., Travis if you managed your own sources as Dirk’s r-travis does. We expect to use this relatively soon, possibly as an opt-in via a variable upon which run.sh selects the appropriate repository set. It will also be used for Rocker releases built based off Ubuntu.

In both cases, you may need to adjust the sources list for apt accordingly.

Others

There may also be ongoing efforts within Arch and other Debian-derived distributions, but we are not really aware of what is happening there. If you use those, and coordination is needed, please feel free to reach out via the the r-sig-debian list.

Closing

In case of questions or concerns, please consider posting to the r-sig-debian list.

Dirk, Michael and Johannes, June 2018

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The ssh Package: Secure Shell (SSH) Client for R

Tue, 06/12/2018 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Have you ever needed to connect to a remote server over SSH to transfer files via SCP or to setup a secure tunnel, and wished you could do so from R itself? The new rOpenSci ssh package provides a native ssh client in R allows you to do that and even more, like running a command or script on the host while streaming stdout and stderr directly to the client. The package is based on libssh, a powerful C library implementing the SSH protocol.

install.packages("ssh")

Because the ssh package is based on libssh it does not need to shell out. Therefore it works natively on all platforms without any runtime dependencies. Even on Windows.

The package is still work in progress, but the core functionality should work. Below some examples to get you started from the intro vignette.

Connecting to an SSH server

First create an ssh session by connecting to an SSH server.

session <- ssh_connect("jeroen@dev.opencpu.org") print(session) ## ## connected: jeroen@dev.opencpu.org:22 ## server: 1e:28:44:af:84:91:e5:88:fe:82:ca:34:d7:c8:cf:a8:0d:2f:ec:af

Once established, a session is closed automatically by the garbage collector when the object goes out of scope or when R quits. You can also manually close it using ssh_disconnect() but this is not strictly needed.

Authentication

The client attempts to use the following authentication methods (in this order) until one succeeds:

  1. try key from privkey argument in ssh_connect() if specified
  2. if ssh-agent is available, try private key from ssh-agent
  3. try user key specified in ~/.ssh/config or any of the default locations: ~/.ssh/id_ed25519, ~/.ssh/id_ecdsa, ~/.ssh/id_rsa, or .ssh/id_dsa.
  4. Try challenge-response password authentication (if permitted by the server)
  5. Try plain password authentication (if permitted by the server)

To debug authentication set verbosity to at least level 2 or 3:

session <- ssh_connect("jeroen@dev.opencpu.org", verbose = 2) ## ssh_socket_connect: Nonblocking connection socket: 7 ## ssh_connect: Socket connecting, now waiting for the callbacks to work ## socket_callback_connected: Socket connection callback: 1 (0) ## ssh_client_connection_callback: SSH server banner: SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.4 ## ssh_analyze_banner: Analyzing banner: SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.4 ## ssh_analyze_banner: We are talking to an OpenSSH client version: 7.2 (70200) ## ssh_packet_dh_reply: Received SSH_KEXDH_REPLY ## ssh_client_curve25519_reply: SSH_MSG_NEWKEYS sent ## ssh_packet_newkeys: Received SSH_MSG_NEWKEYS ## ssh_packet_newkeys: Signature verified and valid ## ssh_packet_userauth_failure: Access denied. Authentication that can continue: publickey ## ssh_packet_userauth_failure: Access denied. Authentication that can continue: publickey ## ssh_agent_get_ident_count: Answer type: 12, expected answer: 12 ## ssh_userauth_publickey_auto: Successfully authenticated using /Users/jeroen/.ssh/id_rsa Execute Script or Command

Run a command or script on the host and block while it runs. By default stdout and stderr are steamed directly back to the client. This function returns the exit status of the remote command (hence it does not automatically error for an unsuccessful exit status).

out <- ssh_exec_wait(session, command = 'whoami') ## jeroen print(out) ## [1] 0

You can also run a script that consists of multiple commands.

ssh_exec_wait(session, command = c( 'curl -O https://cran.r-project.org/src/contrib/Archive/jsonlite/jsonlite_1.4.tar.gz', 'R CMD check jsonlite_1.4.tar.gz', 'rm -f jsonlite_1.4.tar.gz' )) ## % Total % Received % Xferd Average Speed Time Time Time Current ## Dload Upload Total Spent Left Speed ## 100 1071k 100 1071k 0 0 654k 0 0:00:01 0:00:01 --:--:-- 654k ## * using log directory '/home/jeroen/jsonlite.Rcheck' ## * using R version 3.4.3 (2017-11-30) ## * using platform: x86_64-pc-linux-gnu (64-bit) ## * using session charset: ASCII ## * checking for file 'jsonlite/DESCRIPTION' ... OK ## * this is package 'jsonlite' version '1.4' ## * checking package namespace information ... OK ## * checking package dependencies ... ... Capturing output

The ssh_exec_internal() is a convenient wrapper for ssh_exec_wait() which buffers the output steams and returns them as a raw vector. Also it raises an error by default when the remote command was not successful.

out <- ssh_exec_internal(session, "R -e 'rnorm(10)'") print(out$status) ## [1] 0 cat(rawToChar(out$stdout)) ## R version 3.4.4 (2018-03-15) -- "Someone to Lean On" ## Copyright (C) 2018 The R Foundation for Statistical Computing ## Platform: x86_64-pc-linux-gnu (64-bit) ## ## R is free software and comes with ABSOLUTELY NO WARRANTY. ## You are welcome to redistribute it under certain conditions. ## Type 'license()' or 'licence()' for distribution details. ## ## R is a collaborative project with many contributors. ## Type 'contributors()' for more information and ## 'citation()' on how to cite R or R packages in publications. ## ## Type 'demo()' for some demos, 'help()' for on-line help, or ## 'help.start()' for an HTML browser interface to help. ## Type 'q()' to quit R. ## ## > rnorm(10) ## [1] 0.14301778 -0.26873489 0.83931307 0.22034917 0.87214122 -0.13655736 ## [7] -0.08793867 -0.68616146 0.23469591 0.93871035

This function is very useful if you are running a remote command and want to use it’s output as if you had executed it locally.

Using sudo

Note that the exec functions are non interactive so they cannot prompt for a sudo password. A trick is to use -S which reads the password from stdin:

command <- 'echo "mypassword!" | sudo -s -S apt-get update -y' out <- ssh_exec_wait(session, command)

Be very careful with hardcoding passwords!

Transfer Files via SCP

Upload and download files via SCP. Directories are automatically traversed as in scp -r.

# Upload a file to the server file_path <- R.home("COPYING") scp_upload(session, file_path) ## [100%] /Library/Frameworks/R.framework/Versions/3.5/Resources/COPYING

This will upload the file to the home directory on your server. Let’s download it back:

# Download the file back and verify it is the same scp_download(session, "COPYING", to = tempdir()) ## 18011 /var/folders/l8/bhmtp25n2lx0q0dgv1x4gf1w0000gn/T//Rtmpldz4eO/COPYING

We can compare the checksums to verify that the files are identical:

tools::md5sum(file_path) ## "eb723b61539feef013de476e68b5c50a" tools::md5sum(file.path(tempdir(), "COPYING")) ## "eb723b61539feef013de476e68b5c50a" Hosting a Tunnel

Opens a port on your machine and tunnel all traffic to a custom target host via the SSH server.

ssh_tunnel(session, port = 5555, target = "ds043942.mongolab.com:43942")

This function blocks while the tunnel is active. Use the tunnel by connecting to localhost:5555 from a separate process. The tunnel can only be used once and will automatically be closed when the client disconnects.

tried it for 2 weeks and absolutely love it! Thanks a lot for developing such nice packages

— aurelien ginolhac ♮ (@kingsushigino) June 5, 2018

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Merging spatial buffers in R

Mon, 06/11/2018 - 21:55

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

I’m sure there’s a better way out there, but I struggled to find a way to dissolve polygons that touched/overlapped each other (the special case being buffers). For example,  using the osmdata package, we can download the polygons representing hospital buildings in Bern, Switzerland.

library(osmdata) library(rgdal) ; library(maptools) ; library(rgeos) q0 <- opq(bbox = "Bern, Switzerland", timeout = 60) q1 <- add_osm_feature(q0, key = 'building', value = "hospital") x <- osmdata_sp(q1) library(leaflet) spChFIDs(x$osm_polygons) <- 1:nrow(x$osm_polygons@data) cent <- gCentroid(x$osm_polygons, byid = TRUE) leaflet(cent) %>% addTiles() %>% addCircles()

Here we plot the building centroids.

Each point represents a hospital building. We don’t particularly care about the buildings themselves though. We want to create hospitals. To do so, we try a 150m buffer around each centroid.

buff <- gBuffer(cent, byid = TRUE, width = 0.0015) leaflet(cent) %>% addTiles() %>% addPolygons(data = buff, col = "red") %>% addCircles()

We then want to merge the buffers into, in this case, four groups. This is the point that doesn’t seem to be implemented anywhere that I could see (I also tried QGIS but that just created one big feature, rather than many small ones). My approach is to get the unique sets of intersections, add them as a variable to the buffer and unify the polygons.

buff <- SpatialPolygonsDataFrame(buff, data.frame(row.names = names(buff), n = 1:length(buff))) gt <- gIntersects(buff, byid = TRUE, returnDense = FALSE) ut <- unique(gt) nth <- 1:length(ut) buff$n <- 1:nrow(buff) buff$nth <- NA for(i in 1:length(ut)){ x <- ut[[i]] buff$nth[x] <- i } buffdis <- gUnaryUnion(buff, buff$nth) leaflet(cent) %>% addTiles() %>% addPolygons(data = buffdis, col = "red") %>% addCircles()

As you see, it almost worked. The lower left group is composed of three polygons. Doing the same process again clears it (only code shown). Large jobs might need more iterations (or larger buffers). The final job is to get the hospital centroids.

gt <- gIntersects(buffdis, byid = TRUE, returnDense = FALSE) ut <- unique(gt) nth <- 1:length(ut) buffdis <- SpatialPolygonsDataFrame(buffdis, data.frame(row.names = names(buffdis), n = 1:length(buffdis))) buffdis$nth <- NA for(i in 1:length(ut)){ x <- ut[[i]] buffdis$nth[x] <- i } buffdis <- gUnaryUnion(buffdis, buffdis$nth) leaflet(cent) %>% addTiles() %>% addPolygons(data = buffdis, col = "red") %>% addCircles() buffcent <- gCentroid(buffdis, byid = TRUE

Code here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine Learning in R with H2O and LIME: A free workshop!

Mon, 06/11/2018 - 18:47

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

 

We’ve told you about it, and now it’s happening. It’s the workshop about Machine Learning in R, with H2O and LIME!

The General Data Protection Regulation (GDPR) recently got approved. Are you and your organization ready to explain your models?

Jo-Fai Chow from H2O will lead us in a journey through the construction and interpretation of Machine Learning models, in an hand-on experience open to everybody (previous knowledge of ML is not required). We will discover the use of two R packages, h2o & LIME, for automatic and interpretable machine learning. We will learn how to build regression and classification models quickly with H2O’s AutoML. Then we will be guided to explain the model outcomes with a framework called Local Interpretable Model-Agnostic Explanations (LIME).

The workshop is free, but it is open to max 48 participants. Not much, are they? So make sure to be fast.

When you’re sure you can be there, register on the Eventbrite: https://www.eventbrite.it/e/machine-learning-in-r-with-h2o-lime-the-workshop-tickets-46692800423

If you change your mind, please unsubscribe from Eventbrite! This way you’ll leave space for others.

Also keep your eye on the emails, because short before the workshop we’ll ask you to confirm your participation (confirmation is mandatory to get your seat!!) and we’ll give you instructions for installing all tools required, in order to be fully ready for the workshop.

Agenda
19:00 – Welcome presentation (+ a presentation of Data Hack Italia)
19:30 – Machine Learning in R with H20 & Lime: The Workshop part1
20:30 – Break: Free pizza!
21:00 – Machine Learning in R with H20 & Lime: The Workshop part2
22:00 – Bye bye and see you soon!
Something more about the speaker and the topics:

  • The speaker will be Jo-Fai (or Joe) Chow, who works at H2O.ai as a a data science evangelist/community manager. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.
  • H2O is open-source software for big-data analysis. It’s used for exploring and analyzing datasets held in cloud computing systems and in the Apache Hadoop Distributed File System as well as in the conventional operating-systems Linux, macOS, and Microsoft Windows. The software is written in Java, Python, and R, and it’s compatible with most browsers. The aim of H2O.ai (H2O’s developers) is to develop an analytical interface for cloud computing and to provide all users with tools for data analysis
  • The LIME is about explaining what machine learning classifiers (or models) are doing. LIME (short for local interpretable model-agnostic explanations) it’s useful for interpretation of black-box classifiers with two or more classes, at an individual level

The appointment is at Mikamai, Milan, the 25th of June from 19 pm to 22 pm. See you there!

 

 

The post Machine Learning in R with H2O and LIME: A free workshop! appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Tip: use isTRUE()

Mon, 06/11/2018 - 17:55

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

R Tip: use isTRUE().

A lot of R functions are type unstable, which means they return different types or classes depending on details of their values.

For example consider all.equal(), it returns the logical value TRUE when the items being compared are equal:

all.equal(1:3, c(1, 2, 3)) # [1] TRUE

However, when the items being compared are not equal all.equal() instead returns a message:

all.equal(1:3, c(1, 2.5, 3)) # [1] "Mean relative difference: 0.25"

This can be inconvenient in using functions similar to all.equal() as tests in if()-statements and other program control structures.

The saving functions is isTRUE(). isTRUE() returns TRUE if its argument value is equivalent to TRUE, and returns FALSE otherwise. isTRUE() makes R programming much easier.

Some examples of isTRUE() are given below:

isTRUE(TRUE) # [1] TRUE isTRUE(FALSE) [1] FALSE isTRUE(NULL) # [1] FALSE isTRUE(NA) # [1] FALSE isTRUE(all.equal(1:3, c(1, 2.5, 3))) # [1] FALSE isTRUE(all.equal(1:3, c(1, 2, 3))) # [1] TRUE lst <- list(x = 5) isTRUE(lst$y == 7) # [1] FALSE lst$y == 7 logical(0) isTRUE(logical(0)) # [1] FALSE

Using isTRUE() one can write safe and legible code such as the following:

# Pretend this assignment was performed by somebody else. lst <- list(x = 5) # Our own sanitization code. if(!isTRUE(lst$y > 3)) { lst$y = lst$x } print(lst) # $x # [1] 5 # # $y # [1] 5

R now also has isFALSE(), but by design it does not mean the same thing as !isTRUE(). The ideas is: for a value v at most of one of isTRUE() or isFALSE() is set, and both are non-NA unnamed scalar logical values. (example: isTRUE(5), isFALSE(5)).

Or as help(isTRUE) puts it:

… if(isTRUE(cond)) may be preferable to if(cond) …

Note: prior to R 3.5 isTRUE (the current version!) was defined as “isTRUE <- function(x) identical(x, TRUE)” (please see change-log here). This seemed clever, but failed on named logical values (violating a principle of least surprise):

oldTRUE <- function(x) identical(x, TRUE) v <- TRUE oldTRUE(v) # [1] TRUE isTRUE(v) # [1] TRUE names(v) <- "condition" oldTRUE(v) # [1] FALSE isTRUE(v) # [1] TRUE

This caused a lot of problems (example taken from R3.5.0 NEWS):

x <- rlnorm(99) isTRUE(median(x) == quantile(x)["50%"]) # [1] TRUE oldTRUE(median(x) == quantile(x)["50%"]) # [1] FALSE var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Anomaly Detection in R

Mon, 06/11/2018 - 11:30

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

The World of Anomalies

Imagine you are a credit card selling company and you know about a particular customer who makes a purchase of 25$ every week. You guessed this purchase is his fixed weekly rations but one day, this customer makes a different purchase of 700$. This development will not just startle you but also compel you to talk to the customer and find out the reason so you can approve the transaction. This is because, the behavior of the customer had become fixed and the change was so different that it was not expected. Hence we call this event as an anomaly.

Anomalies are hard to detect because they can also be real phenomenon. Let’s say that the customer in the example above made the usual purchases while he was living alone and will be starting his family this week. This will mean that this should be the first of his future purchases of similar magnitude or he is throwing a party this week and this was a one-time large purchase. In all these cases, the customer will be classified as making an ‘abnormal’ choice. We as the credit card seller need to know which of these cases are genuine and which are mistakes which can be corrected if they reconfirm the same with the customer. The usefulness of detecting such anomalies are very useful especially in BFSI industry with the primary use in credit card transactions. Such anomalies can be signs of fraud or theft. Someone making multiple transactions of small amounts from the same credit card, making one very large transaction which is a few order of magnitudes larger than the average, making transactions from an unfamiliar location are such examples that can caused by fraudsters and must be caught. With the popularity of adoption, let’s study the ways we can detect anomalies.

Detecting The Pattern To Find Anomalies

Anomalies are essentially the outliers in our data. If something happens regularly, it is not an anomaly but a trend. Things which happen once or twice and are deviant from the usual behavior, whether continuously or with lags are all anomalies. So it all boils down to the definition of outliers for our data. R provides a lot of packages with different approaches to anomaly detection. We will use the AnomalyDetection package in R to understand the concept of anomalies using one such method. However, the package needs to be installed specially from github. This requires the install_github() function in devtools package. We will also use the Rcpp package which helps us to integrate R with C++ functions. Another github package to be used in this article is the wikipedia trend package which contains the API to access wikipedia and create data for anomaly detection analysis.

The package is capable of identifying outliers in the presence of seasonality and trend in the data. The package uses an algorithm known as Seasonal Hybrid ESD algorithm which finds outliers globally as well as locally in time series or a vector of data. The package has a lot of features, some of which include visualization graphs, type of anomalies (positive or negative) and specifying the window of interest.

#Install the devtools package then github packages install.packages("devtools") install.packages("Rcpp") library(devtools) install_github("petermeissner/wikipediatrend") install_github("twitter/AnomalyDetection") #Loading the libraries library(Rcpp) library(wikipediatrend) library(AnomalyDetection)

The first step is data preparation. We will use the page views on wikipedia page marked on fifa data starting from date 18th March 2013. (link: https://en.wikipedia.org/wiki/FIFA). The wp_trend function gives us the access statistics for the page with the ability to filter data from within the function. We will use this data to model day wise page views and understand anomalies in the pattern of those view numbers

#Download wikipedia webpage "fifa" fifa_data_wikipedia = wp_trend("fifa", from="2013-03-18", lang = "en")

This gives us a dataset of about 1022 observations and 8 columns. Looking at the data reveals some redundant information captured

#First_look fifa_data_wikipedia    project   language article access     agent      granularity date       views 197 wikipedia en       Fifa    all-access all-agents daily       2016-01-13 116   546 wikipedia en       Fifa    all-access all-agents daily       2016-12-27  64   660 wikipedia en       Fifa    all-access all-agents daily       2017-04-20 100   395 wikipedia en       Fifa    all-access all-agents daily       2016-07-29  70   257 wikipedia en       Fifa    all-access all-agents daily       2016-03-13  75   831 wikipedia en       Fifa    all-access all-agents daily       2017-10-08 194   229 wikipedia en       Fifa    all-access all-agents daily       2016-02-14  84   393 wikipedia en       Fifa    all-access all-agents daily       2016-07-27 140   293 wikipedia en       Fifa    all-access all-agents daily       2016-04-18 105   420 wikipedia en       Fifa    all-access all-agents daily       2016-08-23 757

We see that project, language, article, access, agent and granularity appear to be same for all rows and are irrelevant for us. We are only concerned with date and views as the features to work on. Let’s plot the views against date

#Plotting data library(ggplot2) ggplot(fifa_data_wikipedia, aes(x=date, y=views, color=views)) + geom_line()


We see some huge spikes at different intervals. There are a lot of anomalies in this data. Before we process them further, let’s keep only the relevant columns.

# Keep only date & page views and discard all other variables columns_to_keep=c("date","views") fifa_data_wikipedia=fifa_data_wikipedia[,columns_to_keep]

We will now perform anomaly detection using Seasonal Hybrid ESD Test. The technique maps data as a series and captures seasonality while pointing out data which does not follow the seasonality pattern. The AnomalyDetectionTs() function finds the anomalies in the data. It will basically narrow down all the peaks keeping in mind that not more than 10% of data can be anomalies (by default). We can also reduce this number by changing the max_anoms parameter in the data. We can also specify which kind of anomalies are to be identified using the direction parameter. Here, we are going to specify only positive direction anomalies to be identified. That means that sudden dips in the data are not considered.

#Apply anomaly detection and plot the results anomalies = AnomalyDetectionTs(fifa_data_wikipedia, direction="pos", plot=TRUE) anomalies$plot



Our data has 5.68% anomalies in positive direction if we take a level of significance (alpha) to be 95%. Since we had a total of 1022 observations, 5.68% of the number is about 58 observations. We can look at the specific dates which are pointed out by the algorithm.

# Look at the anomaly dates anomalies$anoms    timestamp anoms 1  2015-07-01   269 2  2015-07-02   233 3  2015-07-04   198 4  2015-07-05   330 5  2015-07-06   582 6  2015-07-07   276 7  2015-07-08   211 8  2015-07-09   250 9  2015-07-10   198 10 2015-07-20   315 11 2015-07-21   209 12 2015-07-25   202 13 2015-07-26   217 14 2015-09-18   278 15 2015-09-25   234 16 2015-09-26   199 17 2015-10-03   196 18 2015-10-07   242 19 2015-10-08   419 20 2015-10-09   240 21 2015-10-11   204 22 2015-10-12   223 23 2015-10-13   237 24 2015-10-18   204 25 2015-10-28   213 26 2015-12-03   225 27 2015-12-21   376 28 2015-12-22   212 29 2016-02-24   240 30 2016-02-26   826 31 2016-02-27   516 32 2016-02-29   199 33 2016-04-04   330 34 2016-05-13   217 35 2016-05-14   186 36 2016-06-10   196 37 2016-06-11   200 38 2016-06-12   258 39 2016-06-13   245 40 2016-06-14   204 41 2016-06-22   232 42 2016-06-27   273 43 2016-06-28   212 44 2016-07-10   221 45 2016-07-11   233 46 2016-08-22   214 47 2016-08-23   757 48 2016-08-24   244 49 2016-09-18   250 50 2016-09-19   346 51 2017-01-10   237 52 2017-03-29   392 53 2017-06-03   333 54 2017-06-21   365 55 2017-10-08   194 56 2017-10-09   208 57 2017-10-11   251 58 2017-10-14   373

We have the exact dates and the anomaly values for each date. In a typical anomaly detection process, each of these dates are looked case by case and the reason for anomalies is identified. For instance, the page views can be higher on these dates if there had been fifa matches or page updates on these particular days. Another reason could be big news about fifa players. However, if page views on any of the dates does not correspond to any special event, then those days are true anomalies and should be flagged. In other situations such as credit card transactions, such anomalies can indicate fraud and quick action must be taken on identification.

The ‘Anomaly Way’

Anomalies are a kind of outlier so SH-ESD (Seasonal Hybrid ESD) is not the only way to detect anomalies. Moreover, ‘AnomalyDetection’ is not the only package we will look upon. Let’s try the anomalize package which is available in CRAN. However, it is always recommended to update the package using github as the owners keep the most recent package versions there and it takes time and testing for the changes to move into standard repositories such as CRAN. We will first install the package from CRAN so that the dependencies are also installed then update the package using devtools

#Installing anomalize install.packages('anomalize') #Update from github library(devtools) install_github("business-science/anomalize") #Load the package library(anomalize) # We will also use tidyverse package for processing and coindeskr to get bitcoin data library(tidyverse) library(coindeskr)

I am also using the tidyverse package (Link) and coindeskr package (Link). The coindeskr package is used to download the bitcoin data and tidyverse is used for speedy data processing. We will now download bitcoin data from 1st January 2017

#Get bitcoin data from 1st January 2017 bitcoin_data <- get_historic_price(start = "2017-01-01")

This data indicates the price per date. Let’s convert it into a time series

#Convert bitcoin data to a time series bitcoin_data_ts = bitcoin_data %>% rownames_to_column() %>% as.tibble() %>% mutate(date = as.Date(rowname)) %>% select(-one_of('rowname'))

In the time series conversion, we are actually converting the data to a tibble_df which the package requires. We could have alternatively converted the data into tibbletime object. Since it is a time series now, we should also see the seasonality and trend patterns in the data. It is important to remove them so that anomaly detection is not affected. We will now decompose the series. We will also plot the series

#Decompose data using time_decompose() function in anomalize package. We will use stl method which extracts seasonality

bitcoin_data_ts %>% time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>%  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% plot_anomaly_decomposition() Converting from tbl_df to tbl_time. Auto-index message: index = date frequency = 7 days trend = 90.5 days


We have some beautiful plots with the first plot being overall observed data, second being season, third being trend and the final plot analyzed for anomalies. The red points indicate anomalies according to the anomalize function. However, we are not looking for this plot. We only want the anomalies plot with trend and seasonality removed. Let’s plot the data again with recomposed data. This can be done by setting the time_recomposed() function

#Plot the data again by recomposing data

bitcoin_data_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>%  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5) Converting from tbl_df to tbl_time. Auto-index message: index = date frequency = 7 days trend = 90.5 days


This is a better plot and shows the anomalies. We all know how bitcoin prices shot up in 2018. The grey portion explains the expected trend. Let’s see what these red points are.

#Extract the anomalies anomalies=bitcoin_data_ts %>% time_decompose(Price) %>%  anomalize(remainder) %>%  time_recompose() %>%  filter(anomaly == 'Yes')

Now the anomalies dataset consists of the data points which were identified as anomalies by the algorithm

Conclusion: Are You An Anomaly?

We have twitter’s anomaly detection package based on Seasonal Hybrid ESD (SH-ESD) as well as CRAN’s anomaly detection package based on factor analysis, Mahalanobis distance, Horn’s parallel analysis or Principal component analysis. We also have TsOutliers package and anomalize packages in R. There are a lot more packages than one could find in R. They all have the same concept but differ in the underlying algorithm which they use to detect anomaly. Hence, one can get a general idea from all such packages: anomalies are data points which do not follow the general trend or do not lie under the expected behavior of the rest of the data. The next question which is raised is the criteria for a data point to be following expected behavior. The rest of the data points are all anomalies. One can also have varying types of anomalies such as direction based anomalies as described by the anomaly detection package (positive or negative) or anomalies not following events such as matches in fifa data. One can similarly pitch in another logic for anomaly classification and treat them accordingly.

Here is the entire code used in this article
#Install the devtools package then github packages install.packages("devtools") install.packages("Rcpp") library(devtools) install_github("petermeissner/wikipediatrend") install_github("twitter/AnomalyDetection") #Loading the libraries library(Rcpp) library(wikipediatrend) library(AnomalyDetection) # Download wikipedia webpage "fifa" fifa_data_wikipedia = wp_trend("fifa", from="2013-03-18", lang = "en") #First_look fifa_data_wikipedia # Plotting data library(ggplot2) ggplot(fifa_data_wikipedia, aes(x=date, y=views, color=views)) + geom_line() # Keep only date & page views and discard all other variables columns_to_keep=c("date","views") fifa_data_wikipedia=fifa_data_wikipedia[,columns_to_keep] #Apply anomaly detection and plot the results anomalies = AnomalyDetectionTs(fifa_data_wikipedia, direction="pos", plot=TRUE) anomalies$plot # Look at the anomaly dates anomalies$anoms #Installing anomalize install.packages('anomalize') #Update from github library(devtools) install_github("business-science/anomalize") #Load the package library(anomalize) # We will also use tidyverse package for processing and coindeskr to get bitcoin data library(tidyverse) library(coindeskr) #Get bitcoin data from 1st January 2017 bitcoin_data = get_historic_price(start = "2017-01-01") #Convert bitcoin data to a time series bitcoin_data_ts = bitcoin_data %>% rownames_to_column() %>% as.tibble() %>% mutate(date = as.Date(rowname)) %>% select(-one_of('rowname')) #Decompose data using time_decompose() function in anomalize package. We will use stl method which extracts seasonality bitcoin_data_ts %>% time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>%  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% plot_anomaly_decomposition() #Plot the data again by recomposing data bitcoin_data_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>%  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5) #Extract the anomalies anomalies=bitcoin_data_ts %>% time_decompose(Price) %>%  anomalize(remainder) %>%  time_recompose() %>%  filter(anomaly == 'Yes')

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Prudhvi Potuganti, Saneesh Veetil and Chaitanya Sagar contributed to this article.

Perceptive Analytics provides Tableau Consulting, data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Top Tip: Don’t keep your data prep in the same project as your Shiny app

Mon, 06/11/2018 - 10:04

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Mark Sellors, Head of Data Engineering

If you use RStudio Connect to publish your Shiny app (and even if you don’t) take care with how your arrange your projects. If you have a single project that includes both your data prep and your Shiny app, packrat (which RSConnect uses to resolve package dependencies for your project) will assume the packages you used for both parts are required on the RSConnect server and will try to install them all.

This means that if your Shiny app uses three packages and your data prep uses six, packrat and RSconnect will attempt to install all nine on the server. This can be time consuming as packages are often built from source in Connect-based environments, so this will increase the deployment time considerably. Furthermore, some packages may require your server admin to resolve system-level package dependency issues, which may even be for packages that your app doesn’t use while it’s running.

Keeping data prep and your app within a single project can also confuse people who come on to your project as collaborators later in the development process, since the scope of the project will be less clear. Plus, documenting the pieces separately also helps to improve clarity.

Lastly, separating the two will make your life easier if you ever get to the stage where you want to start automating parts of your workflow as the data prep stage will already be separate from the rest of the project.

Clear separation of individual projects (and by extension, source code repositories) may cause some short term pain, but the long term benefits are hard to understate:

  • Smoother and faster RStudio Connect deployments
  • Easier collaboration
  • More straightforward automation (easier to build out into a pipeline)
  • Simpler to document – one set for the app, another for your data prep

Of course, if your Shiny app actually does data prep as part of the apps internal processing, then all bets are off!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Customizing time and date scales in ggplot2

Mon, 06/11/2018 - 09:11

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

In the last post of this series we dealt with axis systems. In this post we are also dealing with axes but this time we are taking a look at the position scales of dates, time and datetimes. Since we at STATWORX are often forecasting – and thus plotting – time series, this is an important issue for us. The choice of axis ticks and labels can make the message conveyed by a plot clearer. Oftentimes, some points in time are – e.g. due to their business implications – more important than others and should be easily identified. Unequivocal, yet parsimonious labeling is key to the readability of any plot. Luckily, ggplot2 enables us to do so for dates and times with almost any effort at all.

We are using ggplot`s economics data set. Our base Plot looks like this:

# base plot base_plot <- ggplot(data = economics) + geom_line(aes(x = date, y = unemploy), color = "#09557f", alpha = 0.6, size = 0.6) + labs(x = "Date", y = "US Unemployed in Thousands", title = "Base Plot") + theme_minimal() Scale Types

As of now, ggplot2 supports three date and time classes: POSIXct, Date and hms.
Depending on the class at hand, axis ticks and labels can be controlled by using scale_*_date, scale_*_datetime or scale_*_time, respectively. Depending on whether one wants to modify the x or the y axis scale_x_* or scale_y_* are to be employed. For sake of simplicity, in the examples only scale_x_date is employed, but all discussed arguments work just the same for all mentioned scales.

Minor Modifications

Let’s start easy. With the argument limits the range of the displayed dates or time can be set. Two values of the correct date or time class have to be supplied.

base_plot + scale_x_date(limits = as.Date(c("1980-01-01","2000-01-01"))) + ggtitle("limits = as.Date(c(\"1980-01-01\",\"2000-01-01\"))")

The expand argument ensures that there is some distance between the displayed data and the axes. The argument expand takes two numeric values, the first is the multiplicative expansion constant, the second the additive expansion constant. The larger one of the two distances is employed in the plot, the multiplicative constant is multiplied with the range of the displayed data, the additive is multiplied with one unit of the depicted data. The resulting empty space is added at the left and right end of the x-axis or the top and bottom of the y-axis.

base_plot + scale_x_date(expand = c(0, 5000)) + #5000/365 = 13.69863 years ggtitle("expand = c(0, 5000)")

position argument defines where the labels are displayed: Either “left” or “right” from the y-axis or on the “top” or on the “bottom” of the x-axis.

base_plot + scale_x_date(position = "top") + ggtitle("position = \"top\"") Axis Ticks and Grid Lines

More essential than the cosmetic modifications discussed so far are the axis ticks. There are several ways to define the axis ticks of dates and times. There are the labelled major breaks and further the minor breaks, which are not labeled but marked by grid lines. These can be customized with the arguments breaks and minor_breaks, respectively. The breaks as the well as minor_breaks can be defined by a numeric vector of exact positions or a function with the axis limits as inputs and breaks as outputs. Alternatively, the arguments can be set to NULL to display (minor) breaks at all. These options are especially handy if irregular intervals between breaks are desired.

base_plot + scale_x_date(breaks = as.Date(c("1970-01-01", "2000-01-01")), minor_breaks = as.Date(c("1975-01-01", "1980-01-01", "2005-01-01", "2010-01-01"))) + ggtitle("(minor_)breaks = fixed Dates")

base_plot + scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "12 years"), minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "2 years")) + ggtitle("(minor_)breaks = custom function")

base_plot + scale_x_date(breaks = NULL, minor_breaks = NULL) + ggtitle("(minor_)breaks = NULL")

Another and very convenient way to define regular breaks are the date_breaks and the date_minor_breaks argument. As input both arguments take a character vector combining a string specifying the time unit (either “sec", "min", "hour", "day", "week", "month" or "year") and an integer specifying number of said units specifying the break intervals.

base_plot + scale_x_date(date_breaks = "10 years", date_minor_breaks = "2 years") + ggtitle("date_(minor_)breaks = \"x years\"")

If both are given, date(_minor)_breaks overrules (minor_)breaks.

Axis Labels

Similar to the axis ticks, the format of the displayed labels can either be defined via the labels or the date_labels argument. The labels argument can either be set to NULL if no labels should be displayed, with the breaks as inputs and the labels as outputs. Alternatively, a character vector with labels for all the breaks can be supplied to the argument. This can be very useful, since like this virtually any character vector can be used to label the breaks. The number of labels must be the same as the number of breaks. If the breaks are defined by a function, date_breaks or by default the labels must be defined by a function as well.

base_plot + scale_x_date(date_breaks = "15 years", labels = function(x) paste((x-365), "(+365 days)")) + ggtitle("labels = custom function")

base_plot + scale_x_date(breaks = as.Date(c("1970-01-01", "2000-01-01")), labels = c("~ '70", "~ '00")) + ggtitle("labels = character vector")

Furthermore and very conveniently, the format of the labels can be controlled via the argument date_labels set to a string of formatting codes, defining order, format and elements to be displayed:

Code Meaning %S second (00-59) %M minute (00-59) %l hour, in 12-hour clock (1-12) %I hour, in 12-hour clock (01-12) %H hour, in 24-hour clock (01-24) %a day of the week, abbreviated (Mon-Sun) %A day of the week, full (Monday-Sunday) %e day of the month (1-31) %d day of the month (01-31) %m month, numeric (01-12) %b month, abbreviated (Jan-Dec) %B month, full (January-December) %y year, without century (00-99) %Y year, with century (0000-9999)

Source: Wickham 2009 p. 99

base_plot + scale_x_date(date_labels = "%Y (%b)") + ggtitle("date_labels = \"%Y (%b)\"")

The choice of axis ticks and labels might seem trivial. However, one should not underestimate the amount of confusion that can be caused by too many, too less or poorly positioned axis ticks and labels. Further, economical yet clear labeling of axis ticks can increase the readability and visual appeal of any time series plot immensely. Since it is so easy to tweak the date and time axes in ggplot2 there is simply no excuse not to do so.

References
  • Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer.
Über den Autor

Lea Waniek

Lea ist Mitglied im Data Science Team und unterstützt ebenfalls im Bereich Statistik.

Der Beitrag Customizing time and date scales in ggplot2 erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

tiler 0.2.0 CRAN release: Create map tiles from R

Mon, 06/11/2018 - 02:00

(This article was first published on R on Matthew Leonawicz | Blog, and kindly contributed to R-bloggers)

The tiler package provides a map tile-generator function for creating map tile sets for use with packages such as leaflet.
In addition to generating map tiles based on a common raster layer source, it also handles the non-geographic edge case, producing map tiles from arbitrary images. These map tiles, which have a “simple CRS”, a non-geographic simple Cartesian coordinate reference system, can also be used with leaflet when applying the simple CRS option.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Matthew Leonawicz | Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Create outstanding dashboards with the new semantic.dashboard package

Mon, 06/11/2018 - 02:00

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

We all know that Shiny is great for interactive data visualisations. But, sometimes even the best attempts to fit all your graphs just in one Shiny page are not enough. From our experience, almost every project with growing amount of KPIs struggles with a problem of messy and not readable final reports. Here is where dashboards appear to be handy. Dashboards allow you to intuitively structure your reports by breaking them down into the sections, panels and tabs. Thanks to that it is much easier for a final user to navigate through your work.

shinydashboard does a decent job here. However, when you create a bunch of apps using it, you quickly realize that they all look the same and are simply boring. In this tutorial, I will show you how to take advantage of semantic.dashboard package. This is an alternative to shinydashboard which makes use of Semantic UI. Thanks to that you can introduce beautiful Semantic components into your app and select from many available themes.

Before we start: if you don’t have semantic.dashboard installed yet, visit this GitHub page for detailed instructions.

How to start?

Let’s begin with creating an empty dashboard:

library(shiny) library(semantic.dashboard) ui <- dashboardPage( dashboardHeader(), dashboardSidebar(), dashboardBody() ) server <- shinyServer(function(input, output, session) { }) shinyApp(ui, server)

For comparison you might check what happens if you change library(semantic.dashboard) to library(shinydashboard).

What you should see is something like this:

With almost no effort we have created the skeleton for our first semantic.dashboard app.

Now it is time to discuss basic components of a dashboard. Each dashboardPage consists of three elements:

  • header,
  • sidebar,
  • body.

Currently our header is quite boring, so let’s add some title and change its colour:

dashboardHeader(color = "blue", title = "Dashboard Demo", inverted = TRUE)

We don’t need a sidebar to be so wide and to make it more functional let’s add two menu elements:

dashboardSidebar( size = "thin", color = "teal", sidebarMenu( menuItem(tabName = "main", "Main"), menuItem(tabName = "extra", "Extra") ) )

This is the result you should expect to see:

Not bad! We can do even better by adding some icons to our menuItems. Can you do it yourself?

Hint! Use semantic.dashboard documentation, eg. by typing ?menuItem in RStudio console.

Adding content

In this section we will start filling our app with some content. We will use popular dataset “mtcars”, extracted from the 1974 Motor Trend US magazine, which comprises some parameters of 32 old-school automobiles. Before that, let’s make sure that we know how to change the content of our body using tabs. Look at the following piece of code:

dashboardBody( tabItems( selected = 1, tabItem( tabName = "main", fluidRow( h1("main") ) ), tabItem( tabName = "extra", fluidRow( h1("extra") ) ) ) )

We created two tabItems with tabNames exactly the same as menuItems. The selected parameters of tabItems tell us which tabItem should be selected and displayed after running the app.

Equipped with that knowledge, we can finally implement something really functional. We start from creating a simple plot describing the relationship between the car’s gearbox type and their miles per gallon parameter. In the shinyServer function we call:

data("mtcars") mtcars$am <- factor(mtcars$am,levels=c(0,1), labels=c("Automatic","Manual")) output$boxplot1 <- renderPlot({ ggplot(mtcars, aes(x = am, y = mpg)) + geom_boxplot(fill = semantic_palette[["green"]]) + xlab("gearbox") + ylab("Miles per gallon") })

Since we are using ggplot2 don’t forget to attach this package at the beginning of the script.

We are going to make another plot from this dataset, so let’s divide the main page into two sections. We can exchange content of fluidRow function of our first tabItem by:

box(width = 8, title = "Graph 1", color = "green", ribbon = TRUE, title_side = "top right", column(8, plotOutput("boxplot1") ) ), box(width = 8, title = "Graph 2", color = "red", ribbon = TRUE, title_side = "top right" )

This is a good moment to make sure that your app is running. If the answer is yes, proceed to the next section. In case it’s not, verify if you followed all the previous steps.

Interactive items

What do you say about adding more interactive plot to our dashboard? Let’s use “plotly” to present the relation between the weight of a car and miles per gallon. I decided for a point plot here since I can include on it some extra information. By colour, I mark the number of cylinders and by the size of dots a quarter mile time. That’s the code to achieve it:

colscale <- c(semantic_palette[["red"]], semantic_palette[["green"]], semantic_palette[["blue"]]) output$dotplot1 <- renderPlotly({ ggplotly(ggplot(mtcars, aes(wt, mpg)) + geom_point(aes(colour=factor(cyl), size = qsec)) + scale_colour_manual(values = colscale) ) })

Note that we used semantic_palette here for graph colors in order to stay consistent with SemnaticUI layout of the app.

To insert it into the dashboard, we add by analogy to the second box:

box(width = 8, title = "Graph 2", color = "red", ribbon = TRUE, title_side = "top right", column(width = 8, plotlyOutput("dotplot1") ) )

That’s how our main page looks right now:

I would say that’s enough here. Let’s switch to another tab.

Probably, some curious user would like to see a complete list of cars properties in the “Extra” tab. Let’s give him that chance. For that we will use the DT package to render the table on the server side:

output$gstatstable <- DT::renderDataTable(github.data)

and display it in the second tab:

tabItem( tabName = "extra", fluidRow( dataTableOutput("carstable") ) )

I’m pretty satisfied of what we have achieved so far…

But you know what? I changed my mind at this point and decided to modify the theme of this dashboard. I reviewed my options on Semantic Forest website and decided for cerulean. It’s very easy to change the theme now, as it requires changing only one parameter in the dashboardPage function.

dashboardPage( header, sidebar, body, theme = "cerulean")

Let’s see how this app works:

Full code

Okay, so that’s it! If you got to this point, it means that you created your first dashboard with semantic.dashboard package. For training, I encourage you to customize it by adding more tabs with new fancy plots. If you missed some step, here you can check the complete code for the dashboard:

library(shiny) library(semantic.dashboard) library(ggplot2) library(plotly) library(DT) ui <- dashboardPage( dashboardHeader(color = "blue",title = "Dashboard Demo", inverted = TRUE), dashboardSidebar( size = "thin", color = "teal", sidebarMenu( menuItem(tabName = "main", "Main", icon = icon("car")), menuItem(tabName = "extra", "Extra", icon = icon("table")) ) ), dashboardBody( tabItems( selected = 1, tabItem( tabName = "main", fluidRow( box(width = 8, title = "Graph 1", color = "green", ribbon = TRUE, title_side = "top right", column(width = 8, plotOutput("boxplot1") ) ), box(width = 8, title = "Graph 2", color = "red", ribbon = TRUE, title_side = "top right", column(width = 8, plotlyOutput("dotplot1") ) ) ) ), tabItem( tabName = "extra", fluidRow( dataTableOutput("carstable") ) ) ) ), theme = "cerulean" ) server <- shinyServer(function(input, output, session) { data("mtcars") colscale <- c(semantic_palette[["red"]], semantic_palette[["green"]], semantic_palette[["blue"]]) mtcars$am <- factor(mtcars$am,levels=c(0,1), labels=c("Automatic","Manual")) output$boxplot1 <- renderPlot({ ggplot(mtcars, aes(x = am, y = mpg)) + geom_boxplot(fill = semantic_palette[["green"]]) + xlab("gearbox") + ylab("Miles per gallon") }) output$dotplot1 <- renderPlotly({ ggplotly(ggplot(mtcars, aes(wt, mpg)) + geom_point(aes(colour=factor(cyl), size = qsec)) + scale_colour_manual(values = colscale) ) }) output$carstable <- renderDataTable(mtcars) }) shinyApp(ui, server) Resources:

Read the original post at
Appsilon Data Science Blog.

Follow Appsilon Data Science var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sankey Diagram for the 2018 FIFA World Cup Forecast

Mon, 06/11/2018 - 00:00

(This article was first published on Achim Zeileis, and kindly contributed to R-bloggers)

The probabilistic forecast from the bookmaker consensus model for the 2018 FIFA World Cup is visualized in an interactive Sankey diagram, highlighting the teams’ most likely progress through the tournament.

Bookmaker consensus model

Two weeks ago we published our Probabilistic Forecast for the 2018 FIFA World Cup: By adjusting quoted bookmakers’ odds for the profit margins of the bookmakers (also known as overrounds), transforming and averaging them, a predicted winning probability for each team was obtained. By employing millions of tournament simulations in combination with a model for pairwise comparisons (or matches) we could also obtain forecasted probabilities for each team to progress through the tournament. In our original study, we visualized these by “survival” curves. See the working paper for more details and references.

Sankey diagram

Here, we present another display that highlights the likely flow of all teams through the tournament simultaneously. Click on the image to obtain an interactive full-width version of this Sankey diagram produced by Plotly.

Compared to the survival curves shown in our original study this visualization brings out more clearly at which stages of the tournament the strong teams are most likely to meet. Its usage was inspired by the nice working paper On Elo based prediction models for the FIFA Worldcup 2018 by Lorenz A. Gilch and Sebastian Müller.

In a few days we will start learning which of these paths will actually come true. Enjoy the 2018 FIFA World Cup!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppZiggurat 0.1.5

Sun, 06/10/2018 - 20:27

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release 0.1.5 of RcppZiggurat is now on the CRAN network for R.

The RcppZiggurat package updates the code for the Ziggurat generator which provides very fast draws from a Normal distribution. The package provides a simple C++ wrapper class for the generator improving on the very basic macros, and permits comparison among several existing Ziggurat implementations. This can be seen in the figure where Ziggurat from this package dominates accessing the implementations from the GSL, QuantLib and Gretl—all of which are still way faster than the default Normal generator in R (which is of course of higher code complexity).

Per a request from CRAN, we changed the vignette to accomodate pandoc 2.* just as we did with the most recent pinp release two days ago. No other changes were made. Other changes that have been pending are a minor rewrite of DOIs in DESCRIPTION, a corrected state setter thanks to a PR by Ralf Stubner, and a tweak for function registration to have user_norm_rand() visible.

The NEWS file entry below lists all changes.

Changes in version 0.1.5 (2018-06-10)
  • Description rewritten using doi for references.

  • Re-setting the Ziggurat generator seed now correctly re-sets state (Ralf Stubner in #7 fixing #3)

  • Dynamic registration reverts to manual mode so that user_norm_rand() is visible as well (#7).

  • The vignette was updated to accomodate pandoc 2* [CRAN request].

Courtesy of CRANberries, there is also a diffstat report for the most recent release. More information is on the RcppZiggurat page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppGSL 0.3.6

Sun, 06/10/2018 - 20:20

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance update 0.3.6 of RcppGSL is now on CRAN. The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

Per a request from CRAN, we changed the vignette to accomodate pandoc 2.* just as we did with the most recent pinp release two days ago. No other changes were made. The (this time really boring) NEWS file entry follows:

Changes in version 0.3.6 (2018-06-10)
  • The vignette was updated to accomodate pandoc 2* [CRAN request].

Courtesy of CRANberries, a summary of changes to the most recent release is available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppClassic 0.9.10

Sun, 06/10/2018 - 18:36

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release RcppClassic 0.9.9 is now at CRAN. This package provides a maintained version of the otherwise deprecated first Rcpp API; no new projects should use it.

Per a request from CRAN, we changed the vignette to accomodate pandoc 2.* just as we did with the most recent pinp release two days ago. No other changes were made.

CRANberries also reports the changes relative to the previous release.

Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Creating Wordclouds

Sun, 06/10/2018 - 15:30

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Cloudy with a Chance of Words Lots of fun projects in the works, so today’s post will be short – a demonstration on how to create wordclouds, both with and without sentiment analysis results. While I could use song lyrics again, I decided to use a different dataset that comes with the quanteda packages: all 58 Inaugural Addresses, from Washington’s first speech in 1789 to Trump’s in 2017.

library(quanteda) #install with install.packages("quanteda") if needed
data(data_corpus_inaugural)
speeches <- data_corpus_inaugural$documents
row.names(speeches) <- NULL

As you can see, this dataset has each Inaugural Address in a column called “texts,” with year and President’s name as additional variables. To do analysis on the words in speeches, and generate a wordcloud, we’ll want to unnest the words in the texts column.

library(tidytext)
library(tidyverse)
speeches_tidy <- speeches %>%
unnest_tokens(word, texts) %>%
anti_join(stop_words)
## Joining, by = "word"

For our first wordcloud, let’s see what are the most common words across all speeches.

library(wordcloud) #install.packages("wordcloud") if needed
speeches_tidy %>%
count(word, sort = TRUE) %>%
with(wordcloud(word, n, max.words = 50))

While the language used by Presidents certainly varies by time period and the national situation, these speeches refer often to the people and the government; in fact, most of the larger words directly reference the United States and Americans. The speeches address the role of “president” and likely the “duty” that role entails. The word “peace” is only slightly larger than “war,” and one could probably map out which speeches were given during wartime and which weren’t.

We could very easily create a wordcloud for one President specifically. For instance, let’s create one for Obama, since he provides us with two speeches worth of words. But to take things up a notch, let’s add sentiment information to our wordcloud. To do that, we’ll use the comparison.cloud function; we’ll also need the reshape2 library.

library(reshape2) #install.packages("reshape2") if needed
obama_words <- speeches_tidy %>%
filter(President == "Obama") %>%
count(word, sort = TRUE)

obama_words %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
filter(n > 1) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red","blue"))
## Joining, by = "word"

The acast statement reshapes the data, putting our sentiments of positive and negative as separate columns. Setting fill = 0 is important, since a negative word will be missing a value for the positive column and vice versa; without fill = 0, it would drop any row with NA in one of the columns (which would be every word in the set). As a sidenote, we could use the comparison cloud to compare words across two documents, such as comparing two Presidents. The columns would be counts for each President, as opposed to count by sentiment.

Interestingly, the NRC classifies “government” and “words” as negative. But even if we ignore those two words, which are Obama’s most frequent, the negatively-valenced words are much larger than most of his positively-valenced words. So while he uses many more positively-valenced words than negatively-valenced words – seen by the sheer number of blue words – he uses the negatively-valenced words more often. If you were so inclined, you could probably run a sentiment analysis on his speeches and see if they tend to be more positive or negative, and/or if they follow arcs of negativity and positivity. And feel free to generate your own wordcloud: all you’d need to do is change the filter(President == “”) to whatever President you’re interested in examining (or whatever text data you’d like to use, if President’s speeches aren’t your thing).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting data from pdfs using the pdftools package

Sun, 06/10/2018 - 02:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

It is often the case that data is trapped inside pdfs, but thankfully there are ways to extract
it from the pdfs. A very nice package for this task is
pdftools (Github link)
and this blog post will describe some basic functionality from that package.

First, let’s find some pdfs that contain interesting data. For this post, I’m using the diabetes
country profiles from the World Health Organization. You can find them here.
If you open one of these pdfs, you are going to see this:


I’m interested in this table here in the middle:


I want to get the data from different countries, put it all into a nice data frame and make a
simple plot.

Let’s first start by loading the needed packages:

library("pdftools") library("glue") library("tidyverse") ## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 2.2.1 ✔ purrr 0.2.5 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.5 ## ✔ tidyr 0.8.1 ✔ stringr 1.3.1 ## ✔ readr 1.1.1 ✔ forcats 0.3.0 ## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::collapse() masks glue::collapse() ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() library("ggthemes") country <- c("lux", "fra", "deu", "usa", "prt", "gbr") url <- "http://www.who.int/diabetes/country-profiles/{country}_en.pdf?ua=1"

The first 4 lines load the needed packages for this exercise: pdftools is the package that I
described in the beginning of the post, glue is optional but offers a nice alternative to the
paste() and paste0() functions. Take a closer look at the url: you’ll see that I wrote {country}.
This is not in the original links; the original links look like this (for example for the USA):

"http://www.who.int/diabetes/country-profiles/usa_en.pdf?ua=1"

So because I’m interested in several countries, I created a vector with the country codes
of the countries I’m interested in. Now, using the glue() function, something magical happens:

(urls <- glue(url)) ## http://www.who.int/diabetes/country-profiles/lux_en.pdf?ua=1 ## http://www.who.int/diabetes/country-profiles/fra_en.pdf?ua=1 ## http://www.who.int/diabetes/country-profiles/deu_en.pdf?ua=1 ## http://www.who.int/diabetes/country-profiles/usa_en.pdf?ua=1 ## http://www.who.int/diabetes/country-profiles/prt_en.pdf?ua=1 ## http://www.who.int/diabetes/country-profiles/gbr_en.pdf?ua=1

This created a vector with all the links where {country} is replaced by each of the codes
contained in the variable country.

I use the same trick to create the names of the pdfs that I will download:

pdf_names <- glue("report_{country}.pdf")

And now I can download them:

walk2(urls, pdf_names, download.file, mode = "wb")

walk2() is a function from the purrr package that is similar to map2(). You could use map2()
for this, but walk2() is cleaner here, because dowload.file() is a function with a so-called
side effect; it downloads files. map2() is used for functions without side effects.

Now, I can finally use the pdf_text() function from the pdftools function to get the text
from the pdfs:

raw_text <- map(pdf_names, pdf_text)

raw_text is a list of where each element is the text from one of the pdfs. Let’s take a look:

str(raw_text) ## List of 6 ## $ : chr "Luxembourg "| __truncated__ ## $ : chr "France "| __truncated__ ## $ : chr "Germany "| __truncated__ ## $ : chr "United States Of America "| __truncated__ ## $ : chr "Portugal "| __truncated__ ## $ : chr "United Kingdom "| __truncated__

Let’s take a look at one of these elements, which is nothing but a very long character:

raw_text[[1]] ## [1] "Luxembourg Total population: 567 000\n Income group: High\nMortality\nNumber of diabetes deaths Number of deaths attributable to high blood glucose\n males females males females\nages 30–69 <100 <100 ages 30–69 <100 <100\nages 70+ <100 <100 ages 70+ <100 <100\nProportional mortality (% of total deaths, all ages) Trends in age-standardized prevalence of diabetes\n Communicable,\n maternal, perinatal Injuries 35%\n and nutritional 6% Cardiovascular\n conditions diseases\n 6% 33%\n 30%\n 25%\n % of population\n Other NCDs\n 16% 20%\n No data available 15% No data available\n Diabetes 10%\n 2%\n 5%\n Respiratory\n diseases\n 6% 0%\n Cancers\n 31%\n males females\nPrevalence of diabetes and related risk factors\n males females total\nDiabetes 8.3% 5.3% 6.8%\nOverweight 70.7% 51.5% 61.0%\nObesity 28.3% 21.3% 24.8%\nPhysical inactivity 28.2% 31.7% 30.0%\nNational response to diabetes\nPolicies, guidelines and monitoring\nOperational policy/strategy/action plan for diabetes ND\nOperational policy/strategy/action plan to reduce overweight and obesity ND\nOperational policy/strategy/action plan to reduce physical inactivity ND\nEvidence-based national diabetes guidelines/protocols/standards ND\nStandard criteria for referral of patients from primary care to higher level of care ND\nDiabetes registry ND\nRecent national risk factor survey in which blood glucose was measured ND\nAvailability of medicines, basic technologies and procedures in the public health sector\nMedicines in primary care facilities Basic technologies in primary care facilities\nInsulin ND Blood glucose measurement ND\nMetformin ND Oral glucose tolerance test ND\nSulphonylurea ND HbA1c test ND\nProcedures Dilated fundus examination ND\nRetinal photocoagulation ND Foot vibration perception by tuning fork ND\nRenal replacement therapy by dialysis ND Foot vascular status by Doppler ND\nRenal replacement therapy by transplantation ND Urine strips for glucose and ketone measurement ND\nND = country did not respond to country capacity survey\n〇 = not generally available ● = generally available\nWorld Health Organization – Diabetes country profiles, 2016.\n"

As you can see, this is a very long character string with some line breaks (the "\n" character).
So first, we need to split this string into a character vector by the "\n" character. Also, it might
be difficult to see, but the table starts at the line with the following string:
"Prevalence of diabetes" and ends with "National response to diabetes". Also, we need to get
the name of the country from the text and add it as a column. As you can see, a whole lot
of operations are needed, so what I do is put all these operations into a function that I will apply
to each element of raw_text:

clean_table <- function(table){ table <- str_split(table, "\n", simplify = TRUE) country_name <- table[1, 1] %>% stringr::str_squish() %>% stringr::str_extract(".+?(?=\\sTotal)") table_start <- stringr::str_which(table, "Prevalence of diabetes") table_end <- stringr::str_which(table, "National response to diabetes") table <- table[1, (table_start +1 ):(table_end - 1)] table <- str_replace_all(table, "\\s{2,}", "|") text_con <- textConnection(table) data_table <- read.csv(text_con, sep = "|") colnames(data_table) <- c("Condition", "Males", "Females", "Total") dplyr::mutate(data_table, Country = country_name) }

I advise you to go through all these operations and understand what each does. However, I will
describe some of the lines, such as this one:

stringr::str_extract(".+?(?=\\sTotal)")

This uses a very bizarre looking regular expression: ".+?(?=\\sTotal)". This extracts everything
before a space, followed by the string "Total". This is because the first line, the one that contains
the name of the country looks like this: "Luxembourg Total population: 567 000\n". So everything
before a space followed by the word "Total" is the country name. Then there’s these lines:

table <- str_replace_all(table, "\\s{2,}", "|") text_con <- textConnection(table) data_table <- read.csv(text_con, sep = "|")

The first lines replaces 2 spaces or more (“\\s{2,}”) with "|". The reason I do this is because
then I can read the table back into R as a data frame by specifying the separator as the “|” character.
On the second line, I define table as a text connection, that I can then read back into R using
read.csv(). On the second to the last line I change the column names and then I add a column
called "Country" to the data frame.

Now, I can map this useful function to the list of raw text extracted from the pdfs:

diabetes <- map_df(raw_text, clean_table) %>% gather(Sex, Share, Males, Females, Total) %>% mutate(Share = as.numeric(str_extract(Share, "\\d{1,}\\.\\d{1,}")))

I reshape the data with the gather() function (see what the data looks like before and after
reshaping). I then convert the "Share" column into a numeric (it goes from something that looks
like "12.3 %" into 12.3) and then I can create a nice plot. But first let’s take a loot at
the data:

diabetes ## Condition Country Sex Share ## 1 Diabetes Luxembourg Males 8.3 ## 2 Overweight Luxembourg Males 70.7 ## 3 Obesity Luxembourg Males 28.3 ## 4 Physical inactivity Luxembourg Males 28.2 ## 5 Diabetes France Males 9.5 ## 6 Overweight France Males 69.9 ## 7 Obesity France Males 25.3 ## 8 Physical inactivity France Males 21.2 ## 9 Diabetes Germany Males 8.4 ## 10 Overweight Germany Males 67.0 ## 11 Obesity Germany Males 24.1 ## 12 Physical inactivity Germany Males 20.1 ## 13 Diabetes United States Of America Males 9.8 ## 14 Overweight United States Of America Males 74.1 ## 15 Obesity United States Of America Males 33.7 ## 16 Physical inactivity United States Of America Males 27.6 ## 17 Diabetes Portugal Males 10.7 ## 18 Overweight Portugal Males 65.0 ## 19 Obesity Portugal Males 21.4 ## 20 Physical inactivity Portugal Males 33.5 ## 21 Diabetes United Kingdom Males 8.4 ## 22 Overweight United Kingdom Males 71.1 ## 23 Obesity United Kingdom Males 28.5 ## 24 Physical inactivity United Kingdom Males 35.4 ## 25 Diabetes Luxembourg Females 5.3 ## 26 Overweight Luxembourg Females 51.5 ## 27 Obesity Luxembourg Females 21.3 ## 28 Physical inactivity Luxembourg Females 31.7 ## 29 Diabetes France Females 6.6 ## 30 Overweight France Females 58.6 ## 31 Obesity France Females 26.1 ## 32 Physical inactivity France Females 31.2 ## 33 Diabetes Germany Females 6.4 ## 34 Overweight Germany Females 52.7 ## 35 Obesity Germany Females 21.4 ## 36 Physical inactivity Germany Females 26.5 ## 37 Diabetes United States Of America Females 8.3 ## 38 Overweight United States Of America Females 65.3 ## 39 Obesity United States Of America Females 36.3 ## 40 Physical inactivity United States Of America Females 42.1 ## 41 Diabetes Portugal Females 7.8 ## 42 Overweight Portugal Females 55.0 ## 43 Obesity Portugal Females 22.8 ## 44 Physical inactivity Portugal Females 40.8 ## 45 Diabetes United Kingdom Females 6.9 ## 46 Overweight United Kingdom Females 62.4 ## 47 Obesity United Kingdom Females 31.1 ## 48 Physical inactivity United Kingdom Females 44.3 ## 49 Diabetes Luxembourg Total 6.8 ## 50 Overweight Luxembourg Total 61.0 ## 51 Obesity Luxembourg Total 24.8 ## 52 Physical inactivity Luxembourg Total 30.0 ## 53 Diabetes France Total 8.0 ## 54 Overweight France Total 64.1 ## 55 Obesity France Total 25.7 ## 56 Physical inactivity France Total 26.4 ## 57 Diabetes Germany Total 7.4 ## 58 Overweight Germany Total 59.7 ## 59 Obesity Germany Total 22.7 ## 60 Physical inactivity Germany Total 23.4 ## 61 Diabetes United States Of America Total 9.1 ## 62 Overweight United States Of America Total 69.6 ## 63 Obesity United States Of America Total 35.0 ## 64 Physical inactivity United States Of America Total 35.0 ## 65 Diabetes Portugal Total 9.2 ## 66 Overweight Portugal Total 59.8 ## 67 Obesity Portugal Total 22.1 ## 68 Physical inactivity Portugal Total 37.3 ## 69 Diabetes United Kingdom Total 7.7 ## 70 Overweight United Kingdom Total 66.7 ## 71 Obesity United Kingdom Total 29.8 ## 72 Physical inactivity United Kingdom Total 40.0

Now let’s go for the plot:

ggplot(diabetes) + theme_fivethirtyeight() + scale_fill_hc() + geom_bar(aes(y = Share, x = Sex, fill = Country), stat = "identity", position = "dodge") + facet_wrap(~Condition)

That was a whole lot of work for such a simple plot!

If you found this blog post useful, you might want to follow me on twitter
for blog post updates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages