R news and tutorials contributed by hundreds of R bloggers
Updated: 2 days 14 hours ago

### R⁶ Series — Random Sampling From Apache Drill Tables With R & sergeant

Wed, 12/20/2017 - 14:09

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

(For first-timers, R⁶ tagged posts are short & sweet with minimal expository; R⁶ feed)

At work-work I mostly deal with medium-to-large-ish data. I often want to poke at new or existing data sets w/o working across billions of rows. I also use Apache Drill for much of my exploratory work.

Here’s how to uniformly sample data from Apache Drill using the sergeant package:

library(sergeant) db <- src_drill("sonar") tbl <- tbl(db, "dfs.dns.aaaa.parquet") summarise(tbl, n=n()) ## # Source: lazy query [?? x 1] ## # Database: DrillConnection ## n ## ## 1 19977415 mutate(tbl, r=rand()) %>% filter(r <= 0.01) %>% summarise(n=n()) ## # Source: lazy query [?? x 1] ## # Database: DrillConnection ## n ## ## 1 199808 mutate(tbl, r=rand()) %>% filter(r <= 0.50) %>% summarise(n=n()) ## # Source: lazy query [?? x 1] ## # Database: DrillConnection ## n ## ## 1 9988797

And, for groups (using a different/larger “database”):

fdns <- tbl(db, "dfs.fdns.201708") summarise(fdns, n=n()) ## # Source: lazy query [?? x 1] ## # Database: DrillConnection ## n ## ## 1 1895133100 filter(fdns, type %in% c("cname", "txt")) %>% count(type) ## # Source: lazy query [?? x 2] ## # Database: DrillConnection ## type n ## ## 1 cname 15389064 ## 2 txt 67576750 filter(fdns, type %in% c("cname", "txt")) %>% group_by(type) %>% mutate(r=rand()) %>% ungroup() %>% filter(r <= 0.15) %>% count(type) ## # Source: lazy query [?? x 2] ## # Database: DrillConnection ## type n ## ## 1 cname 2307604 ## 2 txt 10132672

I will (hopefully) be better at cranking these bite-sized posts more frequently in 2018.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### eRum 2018 to be held in Budapest, May 14-18

Wed, 12/20/2017 - 11:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The main international R user conference, useR!, is typically hosted in Europe every other year. Back in 2016 when useR! was held in the USA, an alternative conference was held in Poland. Next year (when the useR! conference will be in Australia) the tradition continues, and the 2018 eRum (European R Users Meeting) will be held in Budapest, Hungary on May 14-16.

Around 400-500 attendees are expected, and the program committee is sure to put together an excellent collection of workshops and talks. If you'd like to contribute, workshop proposals are due on January 14, and abstracts for talks and posters are due on February 25. You can submit your proposals here.

Conference registration will open in early January, but you can follow @erum2018 on Twitter to be notified or simply check at the URL below for ticket information.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A Data Science Lab for R

Wed, 12/20/2017 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

In a previous post I described the role of analytic administrator as a data scientist who: onboards new tools, deploys solutions, supports existing standards, and trains other data scientists. In this post I will describe how someone in that role might set up a data science lab for R.

Architecture

A data science lab is an environment for developing code and creating content. It should enhance the productivity of your data scientists and integrate with your existing systems. Your data science lab might live on your premises or in the cloud. It might be built with hardware, virtual machines, or containers. You may use it to support a single data scientist or hundreds of R developers. Here is one reference architecture of a data science lab based on server instances.

Key components of this setup include: authentication; load balancing; a testing environment; data connectivity; and a publishing platform. In this server-based architecture, data scientists use a web browser to access the data science lab. High performance compute and live data reside securely behind a firewall.

Instance Sizing

The size of your server instance depends on how many concurrent sessions you run and how large your sessions are. Keep in mind that R is single threaded by default and holds data in memory. Here is a list of example server sizes:

Instance Size Cores RAM Description

Minimum recommended

2

4G

This server will be for lightweight jobs, testing, and sandboxing.

Small

4

8G

This server will support one or two analysts with small data.

Large

16

256G

This server will support 15 analysts with a blend of large and small sessions. Alternatively, it will support dozens of analysts with small sessions.

Jumbo

32+

1T+

May be useful for heavier workloads.

Open-Source R

If you haven’t done so already, I recommend you make R a legitimate part of your organization by officially recognizing it as an analytic standard. You should be familiar with installing and managing R and its packages.

You can install R as a pre-compiled binary from a repository, or you can install R from source. Installing R from source allows you to install multiple versions of R side by side. If you compile R from source, I recommend you link to the BLAS libraries so that you can speed up certain low-level math computations.

Data science labs tend to require a modern toolkit. You should expect to upgrade R at least once a year. You should also keep your operating system up to date. New and improved R packages tend to work better when you use them with recent versions of R and updated system libraries.

RStudio Server Pro

Building a data science lab involves installing, configuring, and managing tools. In this section I will describe how to administer RStudio Server Pro which has features for authentication, security, and admin controls.

1. Installation

Once you have installed R, you can install RStudio Server Pro by downloading the binaries and following the instructions. You will need root privileges to install and run the software. You will also need to create local system accounts for all of your R developers.

2. Configuration

Authentication. The first thing you will want to do after you install RStudio Server Pro is to configure it with your authentication system. RStudio Server Pro supports LDAP via PAM sessions. If you use single sign on or another system, you can configure RStudio Server Pro to work in proxied auth mode. You can also authenticate via Google accounts and local system accounts.

Data Connectivity. Most data scientists use R with databases. The RStudio Pro Drivers are ODBC drivers that will connect R to some of the most popular databases today. These drivers are a free add-on for RStudio Server Pro. If you are using a data source that is not supported, or if you are using the open source version of RStudio Server, you can bring your own ODBC driver.

Load Balancing. If you want to load balance your server instances, you can use the load balancer that is built into RStudio Server Pro or you can bring your own load balancer. Load balancing is designed to balance user sessions seamlessly across the cluster and provide high availability. It requires a shared home drive that is mounted to each one of the instances.

More Features. RStudio Server Pro has a list of features that you can configure. You should decide which features you want to enable or disable. For more information on configuring each of these features, see the admin guide.

Feature Description

Authentication

• LDAP, Active Directory, Google Accounts and system accounts
• Full support for Pluggable Authentication Modules, Kerberos via PAM, and custom authentication via proxied HTTP header

Data Connectivity

• RStudio Professional Drivers are ODBC data connectors that help you connect to some of the most popular databases.

• Load balance R sessions across two or more servers
• Ensure high availability using multiple masters

Enhanced security

• Encrypt traffic using SSL and restrict client IP addresses

• Monitor active sessions and their CPU and memory utilization
• Suspend, forcibly terminate, or assume control of any active session
• Review historical usage and server logs

Auditing and monitoring

• Monitor server resources (CPU, memory, etc.) on both a per-user and system-wide basis
• Send metrics to external systems with the Graphite/Carbon plaintext protocol
• Health check with configurable output (custom XML, JSON)
• Audit all R console activity by writing input and output to a central location

• Tailor the version of R, reserve CPU, prioritize scheduling and limit resources by User and Group
• Provision accounts and mount home directories dynamically via the PAM Session API
• Automatically execute per-user profile scripts for database and cluster connectivity

Project sharing

• Share projects & edit code files simultaneously with others
3. Management

Once RStudio Server Pro is installed and configured, you’ll need to manage it over time. RStudio Server Pro comes with a variety of tools for workspace and server management that will help keep your environment organized. For example, you can kill sessions, set session timeouts, and broadcast notifications to user sessions in real-time. You can manage product licenses for both online and offline environments. If your instances start and stop frequently you can opt for using a floating license manager.

Next Steps

Your data science lab for R should be designed to scale. That might mean adding more people, more systems, or more tools. It also might mean creating more content. Shiny is an R package that makes it easy to build interactive web apps straight from R. R Markdown is an R package that makes it easy to author reports and build dashboards. You can publish your Shiny apps or R Markdown reports with the push of a button to RStudio Connect. RStudio Connect lets you share and manage content in one convenient place. You can also publish Shiny apps to shinyapps.io, which allows you to share your Shiny apps online.

_____='https://rviews.rstudio.com/2017/12/20/rstudio-server-quick-start/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Mining Game of Thrones Scripts with R

Wed, 12/20/2017 - 01:00

(This article was first published on Computational Social Science, and kindly contributed to R-bloggers)

Quantitative Text Analysis Part II

I meant to showcase the quanteda package in my previous post on the Weinstein Effect but had to switch to tidytext at the last minute. Today I will make good on that promise. quanteda is developed by Ken Benoit and maintained by Kohei Watanabe – go LSE! On that note, the first 2018 LondonR meeting will be taking place at the LSE on January 16, so do drop by if you happen to be around. quanteda v1.0 will be unveiled there as well.

Given that I have already used the data I had in mind, I have been trying to identify another interesting (and hopefully less depressing) dataset for this particular calling. Then it snowed in London, and the dire consequences of this supernatural phenomenon were covered extensively by the r/CasualUK/. One thing led to another, and before you know it I was analysing Game of Thrones scripts:

He looks like he also hates the lack of native parallelism in R.

Getting the Scripts

Mandatory spoilers tag, the rest of the post contains (surprise) spoilers (although only up until the end of the sixth season).

I intend to keep to the organic three-step structure I have developed lately in my posts: obtaining data, showcasing a package, and visualising the end result. With GoT, there are two obvious avenues: full-text books or the show scripts. I decided to go with the show because I’m a filthy casual fan. A wise man once quipped: ‘Never forget what you are. The rest of the world will not. Wear it like armor, and it can never be used to hurt you.’ It’s probably a Chinese proverb.

Nowadays it’s really easy to scrape interesting stuff online. rvest package is especially convenient to use. How it works is that you feed it a URL, it reads the html, you locate which html tag/class contains the information you want to extract, and finally it lets you clean up the text by removing the html bits. Let’s do an example.

Mighty Google told me that this website has GoT scripts online. Cool, let’s fire up the very first episode. With any modern browser, you should be able to inspect the page to see the underlying code. If you hover where the text is located in inspection mode, you’ll find that it’s wrapped in ‘scrolling-script-container’ tags. This is not a general rule, so you’ll probably have to do this every time you try to scrape a new website.

library(rvest) library(tidyverse) url <- "https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=game-of-thrones&episode=s01e01" webpage <- read_html(url) #note the dot before the node script <- webpage %>% html_node(".scrolling-script-container") full.text <- html_text(script, trim = TRUE) glimpse(full.text) ## chr "Easy, boy. What do you expect? They're savages. One lot steals a goat from another lot, before you know it they"| __truncated__

Alright, that got us the first episode. Sixty-something more to go! Let’s set up and execute a for-loop in R because we like to live dangerously:

#Loop for getting all GoT scripts baseurl <- "https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=game-of-thrones&episode=" s <- c(rep(1:6, each = 10), rep(7, 7)) season <- paste0("s0", s) ep <- c(rep(1:10, 6), 1:7) episode <- ifelse(ep < 10, paste0("e0", ep), paste0("e", ep)) all.scripts <- NULL #Only the first 6 seasons for (i in 1:60) { url <- paste0(baseurl, season[i], episode[i]) webpage <- read_html(url) script <- webpage %>% html_node(".scrolling-script-container") all.scripts[i] <- html_text(script, trim = TRUE) }

Now, the setup was done in a way to get all aired episodes, but the website does not currently have S07E01 (apparently they had an incident and still recovering data). We can find it somewhere else of course, however the point is not to analyse GoT in a complete way but to practice data science with R. So I’ll just cut the loop short by only running it until the end of the sixth season. Let’s see what we got:

got <- as.data.frame(all.scripts, stringsAsFactors = FALSE) counter <- paste0(season, episode) row.names(got) <- counter[1:60] colnames(got) <- "text" as.tibble(got) ## # A tibble: 60 x 1 ## text ## * ## 1 "Easy, boy. What do you expect? They're savages. One lot steals a goat from ## 2 "You need to drink, child. And eat. lsn't there anything else? The Dothraki ## 3 "Welcome, Lord Stark. Grand Maester Pycelle has called a meeting of the Sma ## 4 "The little lord's been dreaming again. - We have visitors. - I don't want ## 5 "Does Ser Hugh have any family in the capital? No. I stood vigil for him my ## 6 "Your pardon, Your Grace. I would rise, but Do you know what your wife has ## 7 "\"Summoned to court to answer for the crimes \"of your bannerman Gregor Cl ## 8 "Yah! Left high, left low. Right low, lunge right. You break anything, the ## 9 You've seen better days, my lord. Another visit? lt seems you're my last fr ## 10 "Look at me. Look at me! Do you remember me now, boy, eh? Remember me? Ther ## # ... with 50 more rows

Those are the first sentences of the first ten GoT episodes – looks good! We won’t worry about the backslash on line 7 for now. One quirk of this website is that they seem to have used small case L for capital I (e.g. “l’snt” in line 2 above) in some places. You can easily fix those with a string replacement solution; I’ll let them be. Right, let’s generate some numbers to go along with all this text.

Text Analysis with Quanteda

As I covered n-grams in my previous post, I will try to diversify a bit. That should be doable – quanteda offers a smooth ride and it has a nicely documented website. Which is great, otherwise I don’t think I’d have gotten into it! Let’s transform our scripts dataset into a corpus. The showmeta argument should cut off the additional information you get at the end of a summary, however it doesn’t work on my computer. Yet, we can manipulate the meta data manually as well:

library(quanteda) ## quanteda version 0.99.22 ## Using 7 of 8 threads for parallel computing ## ## Attaching package: 'quanteda' ## The following object is masked from 'package:utils': ## ## View got.corpus <- corpus(got) metacorpus(got.corpus, "source") <- "No peaking!" summary(got.corpus, 10, showmeta = FALSE) ## Corpus consisting of 60 documents, showing 10 documents: ## ## Text Types Tokens Sentences ## s01e01 844 3388 392 ## s01e02 1071 4177 398 ## s01e03 1163 4436 468 ## s01e04 1398 6074 512 ## s01e05 1388 6434 597 ## s01e06 1064 4333 481 ## s01e07 1157 4704 526 ## s01e08 1001 3993 400 ## s01e09 1213 5335 510 ## s01e10 1094 4804 418 ## ## Source: No peaking! ## Created: Wed Dec 20 14:18:43 2017 ## Notes:

I intentionally turned on the message option in the above chunk so that you can see quanteda is thoughtful enough to leave you with a single core for your other computational purposes. The Night King certainly approves. You could also pass a compress = TRUE argument while creating a corpus, which is basically a trade-off between memory space and computation speed. We don’t have that much text so it’s not a necessity for us, but it’s good to know that the option exists.

When you do things for the first couple of times, it’s good practice to conduct a couple of sanity checks. The kwic function, standing for ‘keywords-in-context’, returns a list of such words in their immediate context. This context is formally defined by the window argument, which is bi-directional and includes punctuation. If only there were sets of words in the GoT universe that are highly correlated with certain houses…

#Money money money kwic(got.corpus, phrase("always pays"), window = 2) ## ## [s01e05, 931:932] a Lannister | always pays | his debts ## [s01e05, 1275:1276] A Lannister | always pays | his debts ## [s01e06, 1755:1756] a Lannister | always pays | his debts ## [s01e06, 3479:3480] A Lannister | always pays | his debts ## [s02e08, 3654:3655] a Lannister | always pays | her debts ## [s04e07, 1535:1536] A Lannister | always pays | his debts #What's coming kwic(got.corpus, "winter", window = 3) ## ## [s01e01, 383] forever. And | winter | is coming. ## [s01e01, 3180] the King. | Winter | is coming. ## [s01e03, 577] king?- | Winter | may be coming ## [s01e03, 1276] And when the | winter | comes, the ## [s01e03, 1764] our words. | Winter | is coming. ## [s01e03, 1784] . But now | winter | is truly coming ## [s01e03, 1792] And in the | winter | , we must ## [s01e03, 1968] is for the | winter | , when the ## [s01e04, 5223] remember the last | winter | ? How long ## [s01e04, 5289] during the last | winter | . It was ## [s01e04, 5559] And come the | winter | you will die ## [s01e10, 4256] Wall! And | winter | is coming! ## [s02e01, 566] an even longer | winter | . A common ## [s02e01, 579] for a five-year | winter | . If it ## [s02e01, 614] . And with | winter | coming, it'll ## [s02e01, 1140] not stand the | winter | . The stones ## [s02e01, 2719] cold breath of | winter | will freeze the ## [s02e02, 5504] will starve when | winter | comes. The ## [s02e03, 1095] of summer and | winter | is coming. ## [s02e05, 2343] The Starks understand | winter | better than we ## [s02e05, 3044] half of last | winter | beyond the Wall ## [s02e05, 3051] . The whole | winter | . He was ## [s02e10, 4609] them, through | winter | , summer, ## [s02e10, 4613] , summer, | winter | again. Across ## [s03e01, 3276] Wait out the | winter | where it's beautiful ## [s03e03, 5029] from home and | winter | is coming. ## [s03e04, 196] from home and | winter | is coming. ## [s03e04, 3336] house." | Winter | is coming! ## [s03e04, 4092] for a short | winter | . Boring and ## [s03e04, 4860] make it through | winter | ? Enough! ## [s03e05, 1844] might survive the | winter | . A million ## [s03e05, 5436] Wait out the | winter | .- Winter ## [s03e05, 5439] winter.- | Winter | could last five ## [s03e07, 5545] be dead by | winter | . She'll be ## [s04e01, 3493] your balls till | winter | ? We wait ## [s04e03, 2282] be dead come | winter | .- You ## [s04e03, 2309] be dead come | winter | . Dead men ## [s04e10, 474] both know that | winter | is coming. ## [s05e01, 4301] will survive the | winter | , not if ## [s05e01, 4507] hero. Until | winter | comes and the ## [s05e03, 2723] prisoners indefinitely. | Winter | is coming. ## [s05e04, 651] afford? With | winter | coming, half ## [s05e04, 2664] a crown of | winter | roses in Lyanna's ## [s05e04, 2796] Landing before the | winter | snows block his ## [s05e05, 656] Jon Snow. | Winter | is almost upon ## [s05e05, 1496] you. But | winter | is coming. ## [s05e05, 3382] could turn to | winter | at any moment ## [s05e07, 1305] --- | Winter | is coming. ## [s05e07, 1329] Black, we | winter | at Castle Black ## [s05e07, 1342] many years this | winter | will last? ## [s06e05, 912] the winds of | winter | as they lick ## [s06e06, 2533] . Don't fear | winter | . Fear me ## [s06e10, 2675] white raven. | Winter | is here. ## [s06e10, 4427] is over. | Winter | has come.

We find that these Lannister folks sound like they are the paying-back sort and this winter business had a wild ride before it finally arrived. However, our findings indicate many saw this coming. Moving on, let’s look at tokens. We’ll get words, including n-grams up to three, and remove punctuation:

got.tokens <- tokens(got.corpus, what = "word", ngrams = 1:3, remove_punct = TRUE) head(got.tokens[[7]], 15) ## [1] "Summoned" "to" "court" "to" "answer" ## [6] "for" "the" "crimes" "of" "your" ## [11] "bannerman" "Gregor" "Clegane" "the" "Mountain"

See, we didn’t have to worry about the backslash after all.

Tokens are good, however for the nitty-gritty, we want to convert our corpus into a document-feature matrix using the dfm function. After that, we can populate the top n features by episode:

got.dfm <- dfm(got.corpus, remove = stopwords("SMART"), remove_punct = TRUE) top.words <- topfeatures(got.dfm, n = 5, groups = docnames(got.dfm)) #S06E05 top.words[55] ## $s06e05 ## hodor door hold men bran ## 42 33 31 21 20 Sad times. One quick note – we removed stopwords using the SMART dictionary that comes with quanteda. We could also use stopwords("english") and several other languages. SMART differs from English somewhat, however both are arbitrary by design. You can call stopwords("dictionary_name") to see what they contain; these words will be ignored. Sometimes, you might want to tweak the dictionary if they happen to include words that you rather keep. Let’s repeat the previous chunk, but this time we group by season. Recycle the season variable and re-do the corpus: #Include the season variable we constructed earlier got$season <- s[1:60] got.group.corpus <- corpus(got) got.group.dfm <- dfm(got.group.corpus, ngrams = 1:3, groups = "season", remove = stopwords("SMART"), remove_punct = TRUE)

One convenient feature of having a grouped corpus is that we can analyse temporal trends. Say, you are known by many names and/or happen to be fond of titles:

dany <- c("daenerys", "stormborn", "khaleesi","the_unburnt", "mhysa", "mother_of_dragons", "breaker_of_chains") titles <- got.group.dfm[, colnames(got.group.dfm) %in% dany] titles <- as.data.frame(titles) #Divide all cells with their row sums and round them up round(titles / rowSums(titles), 2) ## daenerys stormborn khaleesi mhysa mother_of_dragons the_unburnt ## 1 0.20 0.02 0.78 0.00 0.00 0.00 ## 2 0.13 0.08 0.54 0.00 0.25 0.00 ## 3 0.11 0.11 0.35 0.35 0.05 0.00 ## 4 0.19 0.04 0.30 0.44 0.04 0.00 ## 5 0.46 0.04 0.04 0.31 0.15 0.00 ## 6 0.41 0.16 0.09 0.06 0.16 0.03 ## breaker_of_chains ## 1 0.00 ## 2 0.00 ## 3 0.03 ## 4 0.00 ## 5 0.00 ## 6 0.09

Khaleesi dominates the first season (~80%), and it is her most one-sided title usage of any season. In S2, she gets the moniker of ‘mother of dragons’ in addition to khaleesi (25% and 55%, respectively). Seasons 3 and 4 are the most balanced, when she was known as khaleesi and mhysa somewhat equally (~35% both). In the last two seasons (in our dataset, at least), she is most commonly (>40%) called/mentioned by her actual name. This particular exercise would have definitely benefited from S7 scripts. You can refer to the titles object to see the raw counts rather than column percentages by row.

Yet another thing we can calculate is term similarity and distance. Using textstat_simil, we can get the top n words that are associated with it:

sim <- textstat_simil(got.dfm, diag = TRUE, c("throne", "realm", "walkers"), method = "cosine", margin = "features") lapply(as.list(sim), head) ## $throne ## iron lord men father kingdoms fire ## 0.7957565 0.7707637 0.7573764 0.7493148 0.7343086 0.7336560 ## ##$realm ## protector kingdoms robert honor hundred shadowcats ## 0.7237571 0.6604497 0.6558736 0.6417062 0.6396021 0.6192188 ## ## $walkers ## white deserter detail corners guardsman pups ## 0.8206750 0.7774816 0.7774816 0.7774816 0.7774816 0.7774816 Shadowcats? White Walker pups? Finally, one last thing before we move on to the visualisations. We will model topic similarities and call it a package. We’ll need topicmodels, and might as well write another for-loop (double-trouble). The below code is not evaluated here, but if you do, you’ll find that GoT consistently revolves around lords, kings, the realm, men, and fathers with the occasional khaleesi thrown in. library(topicmodels) for (i in 1:6) { x <- 1 got.LDA <- LDA(convert(got.dfm[x:(x + 9), ], to = "topicmodels"), k = 3, method = "Gibbs") topics <- get_terms(got.LDA, 4) print(paste0("Season", i)) print(topics) x <- x + 10 } Joy Plots Numbers and Greek letters are cool, however you’ll find that a well-made graph can convey a lot at a glance. quanteda readily offers several statistics that lend themselves very well to Joy plots. When you call summary on a corpus, it reports descriptives on type, tokens, and sentences. These are all counts, and the difference between a type and a token is that the former provides a count of distinct tokens: (a, b, c, c) is four tokens but three types. Let’s recycle our corpus as a dataframe and clean it up. After that, we’ll get rid of the redundant first column, followed by renaming the contents of the season variable and make sure it’s a factor. Then, we’ll calculate the average length of a sentence by dividing token count by the sentence count. Finally, we shall gather the spread-out variables of type, tokens, and sentences into a single ‘term’ and store their counts under ‘frequency’. Usually one (i.e. who works with uncurated data) does the transformation the other way around; you spread a single variable into many to tidy it up – it’s good to utilise this lesser-used form from time to time. Also, we are doing all of this just to be able to use the facet_grid argument: you can manually plot four separate graphs and display them together but that’s not how we roll around here. #Setup; first two lines are redundant if you populated them before got$season <- s[1:60] got.group.corpus <- corpus(got) got.stats <- as.data.frame(summary(got.group.corpus), row.names = 1:60) got.stats <- got.stats[, 2:5] got.stats$season <- paste0("Season ", got.stats$season) got.stats$season <- as.factor(got.stats$season) got.stats$Average Sentence Length <- got.stats$Token / got.stats$Sentences got.stats <- gather(got.stats, term, frequency, -season) means <- got.stats %>% group_by(season, term) %>% summarise(mean = floor(mean(frequency))) #Plot library(ggplot2) library(ggridges) library(viridis) #Refer to previous post for installing the below two library(silgelib) theme_set(theme_roboto()) #counts by season data ggplot(got.stats, aes(x = frequency, y = season)) + #add facets for type, tokens, sentences, and average facet_grid(~term, scales = "free") + #add densities geom_density_ridges(aes(fill = season)) + #assign colour palette; reversed legend if you decide to include one scale_fill_viridis(discrete = TRUE, option = "D", direction = -1, guide = guide_legend(reverse = TRUE)) + #add season means at the bottom geom_rug(data = means, aes(x = mean, group = season), alpha = .5, sides = "b") + labs(title = "Game of Thrones (Show) Corpus Summary", subtitle = "Episode Statistics Grouped by Season Token: Word Count | Type: Unique Word Count | Sentence : Sentence Count | Sentence Length: Token / Sentence", x = "Frequency", y = NULL) + #hide the colour palette legend and the grid lines theme(legend.position = "none", panel.grid.major = element_blank(), panel.grid.minor = element_blank()) Larger PDF version here. Some remarks. Each ridge represents a season and contains counts from ten episodes. These are distributions, so sharp peaks indicate clustering and multiple peaks/gradual changes signal diffusion. For example, in the first column (sentence length), we see that S1 has three peaks: some episodes cluster around 9, some at 10.5 and others at slightly less than 12. In contrast, S5 average sentence length is very specific: nearly all episodes have a mean of 9 tokens/sentence. Moving on to the second column, we find that the number of sentences in episodes rise from S1 to S3, and then gradually go down all the way to S1 levels by the end of S6. Token and type counts follow similar trends. In other words, if we flip the coordinates, we would see a single peak between S3 and S4: increasing counts of individual terms as you get closer to the peak from both directions (i.e. from S1 to S3 and from S6 to S4), but also shorter average sentence lengths. We should be cautious about making strong inferences, however – we don’t really have the means to account for the quality of writing. Longer sentences do not necessarily imply an increase in complexity, even coupled with higher numbers of type (unique words). WesteRos In case you have seen cool ggridges plots before or generally are a not-so-easily-impressed (that counts as one token, by the way) type, let’s map Westeros in R. If you are also wondering why there is a shapefile for Westeros in the first place, that makes two of us. But don’t let these kinds of things stop you from doing data science. The zip file contains several shapefiles; I will only read in ‘political’ and ‘locations’. You will need these files (all of them sharing the same name, not just the .shp file) in your working directory so that you can call it with ".". The spatial data come as factors, and I made some arbitrary modifications to them (mostly for aesthetics). First, in the original file the Night’s Watch controls two regions: New Gift and Bran’s Gift. I removed one an renamed the other “The Wall”. Spatial data frames are S4 objects so you need to call @data$ instead of the regular $. Second, let’s identify the capitals of the regions and set a custom .png icon so that we can differentiate them on the map. At this point, I realised the shapefile does not have an entry for Casterly Rock – maybe they haven’t paid back the creator yet? We’ll have to do without it for now. Third, let’s manually add in some of the cool places by placing them in a vector called ‘interesting’. Conversely, we shall get rid of some so that they do not overlap with region names (‘intheway’). I’m using a %nin operator (not in) that comes with Hmisc, but there are other ways of doing it. Finally, using RColorBrewer I assigned a bunch of reds and blues – viridis looked a bit odd next to the colour of the sea. library(Hmisc) library(rgdal) library(tmap) library(RColorBrewer) #Read in two shapefiles westeros <- readOGR(".", "political") locations <- readOGR(".", "locations") #Cleaning factor levels westeros@data$name <- levels<-(addNA(westeros@data$name), c(levels(westeros@data$name), "The Lands of Always Winter")) levels(westeros@data$name)[1] <- "The Wall" levels(westeros@data$name)[4] <- "" levels(westeros@data$ClaimedBy)[11] <- "White Walkers" #Identify capitals places <- as.character(locations@data$name) places <- gsub(" ", "_", places) capitals <- c("Winterfell", "The Eyrie", "Harrenhal", "Sunspear", "King's Landing", "Castle Black", "Pyke", "Casterly Rock", "Storm's End", "Highgarden") holds <- locations[locations@data$name %in% capitals, ] #Castle icon castle <- tmap_icons(file = "https://image.ibb.co/kykHfR/castle.png", keep.asp = TRUE) #Locations we rather keep interesting <- c("Fist of the First Men", "King's Landing", "Craster's Keep", "Tower of Joy") #Locations we rather get rid of intheway <- c("Sarsfield", "Hornvale", "Cider Hall", "Hayford Castle", "Griffin's Roost") #Subsetting locations <- locations[locations@data$type == "Castle" | locations@data$name %in% interesting, ] locations <- locations[locations@data$name %nin% intheway, ] #Color palettes - the hard way blues <- brewer.pal(6, "Blues") reds <- brewer.pal(7, "Reds") sorted <- c(blues[3], reds[4], blues[4], reds[2], reds[6], #vale, stormlands, iron islands, westerlands, dorne blues[6], blues[5], reds[3], reds[1], reds[5], blues[1]) #wall, winterfell, crownsland, riverlands, reach, beyond the wall #Map tm_shape(westeros) + #Colour regions using the sorted palette and plot their names tm_fill("ClaimedBy", palette = sorted) + tm_text("name", fontfamily = "Game of Thrones", size = .4, alpha = .6) + #Plot location names and put a dot above them tm_shape(locations) + tm_text("name", size = .2, fontfamily = "Roboto Condensed", just = "top") + tm_dots("name", size = .01, shape = 20, col = "black", ymod = .1) + #Plot capitals and add custom shape tm_shape(holds) + tm_dots("name", size = .05, alpha = .5, shape = castle, border.lwd = NA, ymod = .2) + #Fluff tm_compass(type = "8star", position = c("right", "top"), size = 1.5) + tm_layout(bg.color = "lightblue", main.title = "Westeros", frame.lwd = 2, fontfamily = "Game of Thrones") + tm_legend(show = FALSE)

Woo! Okay, let’s go over what happened before wrapping this up. tmap operates similarly to ggplot grammar, so it should be understandable (relatively speaking). We are calling three shapefiles here: ‘westeros’ for the regions, ‘locations’ for the castles and manually added/subtracted places, and ‘holds’ for the capitals (which is just a subset of locations really). The tm parameters (fill, text, dots) under these shapes handle the actual plotting. For example, under westeros, we fill the regions by ‘ClaimedBy’, which would normally return the names of the Houses. However, that’s only the fill argument, and the text parameter in the next line calls ‘name’, which is the name of the regions (and what gets plotted). You can download GoT fonts for added ambiance. We pass our custom castle shape by calling shape = castle and remove the square borders around the .png with the border.lwd = NA. Finally, the ymod argument helps us overcome overlapping labels by slightly moving them up in the y-axis. Feel free to fork the code for this post on GitHub and mess around! Idea: calculate term frequencies of location names using quanteda first and then pass them using tm_bubble with the argument size = frequency so that it gives you a visual representation of their relative importance in the show.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Computational Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Time Series Forecasting with Recurrent Neural Networks

Wed, 12/20/2017 - 01:00

(This article was first published on TensorFlow for R, and kindly contributed to R-bloggers)

Deep Learning with R

This post is an excerpt from Chapter 5 of François Chollet’s and J.J. Allaire’s book, Deep Learning with R (Manning Publications). Use the code fccallaire for a 42% discount on the book at manning.com.

Time Series Forecasting with Recurrent Neural Networks

In this section, we’ll review three advanced techniques for improving the performance and generalization power of recurrent neural networks. By the end of the section, you’ll know most of what there is to know about using recurrent networks with Keras. We’ll demonstrate all three concepts on a temperature-forecasting problem, where you have access to a time series of data points coming from sensors installed on the roof of a building, such as temperature, air pressure, and humidity, which you use to predict what the temperature will be 24 hours after the last data point. This is a fairly challenging problem that exemplifies many common difficulties encountered when working with time series.

We’ll cover the following techniques:

• Recurrent dropout — This is a specific, built-in way to use dropout to fight overfitting in recurrent layers.
• Stacking recurrent layers — This increases the representational power of the network (at the cost of higher computational loads).
• Bidirectional recurrent layers — These present the same information to a recurrent network in different ways, increasing accuracy and mitigating forgetting issues.
A temperature-forecasting problem

Until now, the only sequence data we’ve covered has been text data, such as the IMDB dataset and the Reuters dataset. But sequence data is found in many more problems than just language processing. In all the examples in this section, you’ll play with a weather timeseries dataset recorded at the Weather Station at the Max Planck Institute for Biogeochemistry in Jena, Germany.

In this dataset, 14 different quantities (such air temperature, atmospheric pressure, humidity, wind direction, and so on) were recorded every 10 minutes, over several years. The original data goes back to 2003, but this example is limited to data from 2009–2016. This dataset is perfect for learning to work with numerical time series. You’ll use it to build a model that takes as input some data from the recent past (a few days’ worth of data points) and predicts the air temperature 24 hours in the future.

Let’s look at the data.

library(tibble) library(readr) data_dir <- "~/Downloads/jena_climate" fname <- file.path(data_dir, "jena_climate_2009_2016.csv") data <- read_csv(fname) glimpse(data) Observations: 420,551 Variables: 15 $Date Time "01.01.2009 00:10:00", "01.01.2009 00:20:00", "...$ p (mbar) 996.52, 996.57, 996.53, 996.51, 996.51, 996.50,... $T (degC) -8.02, -8.41, -8.51, -8.31, -8.27, -8.05, -7.62...$ Tpot (K) 265.40, 265.01, 264.91, 265.12, 265.15, 265.38,... $Tdew (degC) -8.90, -9.28, -9.31, -9.07, -9.04, -8.78, -8.30...$ rh (%) 93.3, 93.4, 93.9, 94.2, 94.1, 94.4, 94.8, 94.4,... $VPmax (mbar) 3.33, 3.23, 3.21, 3.26, 3.27, 3.33, 3.44, 3.44,...$ VPact (mbar) 3.11, 3.02, 3.01, 3.07, 3.08, 3.14, 3.26, 3.25,... $VPdef (mbar) 0.22, 0.21, 0.20, 0.19, 0.19, 0.19, 0.18, 0.19,...$ sh (g/kg) 1.94, 1.89, 1.88, 1.92, 1.92, 1.96, 2.04, 2.03,... $H2OC (mmol/mol) 3.12, 3.03, 3.02, 3.08, 3.09, 3.15, 3.27, 3.26,...$ rho (g/m**3) 1307.75, 1309.80, 1310.24, 1309.19, 1309.00, 13... $wv (m/s) 1.03, 0.72, 0.19, 0.34, 0.32, 0.21, 0.18, 0.19,...$ max. wv (m/s) 1.75, 1.50, 0.63, 0.50, 0.63, 0.63, 0.63, 0.50,... $wd (deg) 152.3, 136.1, 171.6, 198.0, 214.3, 192.7, 166.5... Here is the plot of temperature (in degrees Celsius) over time. On this plot, you can clearly see the yearly periodicity of temperature. library(ggplot2) ggplot(data, aes(x = 1:nrow(data), y = T (degC))) + geom_line() Here is a more narrow plot of the first 10 days of temperature data (see figure 6.15). Because the data is recorded every 10 minutes, you get 144 data points per day. ggplot(data[1:1440,], aes(x = 1:1440, y = T (degC))) + geom_line() On this plot, you can see daily periodicity, especially evident for the last 4 days. Also note that this 10-day period must be coming from a fairly cold winter month. If you were trying to predict average temperature for the next month given a few months of past data, the problem would be easy, due to the reliable year-scale periodicity of the data. But looking at the data over a scale of days, the temperature looks a lot more chaotic. Is this time series predictable at a daily scale? Let’s find out. Preparing the data The exact formulation of the problem will be as follows: given data going as far back as lookback timesteps (a timestep is 10 minutes) and sampled every steps timesteps, can you predict the temperature in delay timesteps? You’ll use the following parameter values: • lookback = 720 — Observations will go back 5 days. • steps = 6 — Observations will be sampled at one data point per hour. • delay = 144 — Targets will be 24 hours in the future. To get started, you need to do two things: • Preprocess the data to a format a neural network can ingest. This is easy: the data is already numerical, so you don’t need to do any vectorization. But each time series in the data is on a different scale (for example, temperature is typically between -20 and +30, but atmospheric pressure, measured in mbar, is around 1,000). You’ll normalize each time series independently so that they all take small values on a similar scale. • Write a generator function that takes the current array of float data and yields batches of data from the recent past, along with a target temperature in the future. Because the samples in the dataset are highly redundant (sample N and sample N + 1 will have most of their timesteps in common), it would be wasteful to explicitly allocate every sample. Instead, you’ll generate the samples on the fly using the original data. NOTE: Understanding generator functions A generator function is a special type of function that you call repeatedly to obtain a sequence of values from. Often generators need to maintain internal state, so they are typically constructed by calling another yet another function which returns the generator function (the environment of the function which returns the generator is then used to track state). For example, the sequence_generator() function below returns a generator function that yields an infinite sequence of numbers: sequence_generator <- function(start) { value <- start - 1 function() { value <<- value + 1 value } } gen <- sequence_generator(10) gen() [1] 10 gen() [1] 11 The current state of the generator is the value variable that is defined outside of the function. Note that superassignment (<<-) is used to update this state from within the function. Generator functions can signal completion by returning the value NULL. However, generator functions passed to Keras training methods (e.g. fit_generator()) should always return values infinitely (the number of calls to the generator function is controlled by the epochs and steps_per_epoch parameters). First, you’ll convert the R data frame which we read earlier into a matrix of floating point values (we’ll discard the first column which included a text timestamp): data <- data.matrix(data[,-1]) You’ll then preprocess the data by subtracting the mean of each time series and dividing by the standard deviation. You’re going to use the first 200,000 timesteps as training data, so compute the mean and standard deviation for normalization only on this fraction of the data. train_data <- data[1:200000,] mean <- apply(train_data, 2, mean) std <- apply(train_data, 2, sd) data <- scale(data, center = mean, scale = std) The code for the data generator you’ll use is below. It yields a list (samples, targets), where samples is one batch of input data and targets is the corresponding array of target temperatures. It takes the following arguments: • data — The original array of floating-point data, which you normalized in listing 6.32. • lookback — How many timesteps back the input data should go. • delay — How many timesteps in the future the target should be. • min_index and max_index — Indices in the data array that delimit which timesteps to draw from. This is useful for keeping a segment of the data for validation and another for testing. • shuffle — Whether to shuffle the samples or draw them in chronological order. • batch_size — The number of samples per batch. • step — The period, in timesteps, at which you sample data. You’ll set it 6 in order to draw one data point every hour. generator <- function(data, lookback, delay, min_index, max_index, shuffle = FALSE, batch_size = 128, step = 6) { if (is.null(max_index)) max_index <- nrow(data) - delay - 1 i <- min_index + lookback function() { if (shuffle) { rows <- sample(c((min_index+lookback):max_index), size = batch_size) } else { if (i + batch_size >= max_index) i <<- min_index + lookback rows <- c(i:min(i+batch_size, max_index)) i <<- i + length(rows) } samples <- array(0, dim = c(length(rows), lookback / step, dim(data)[[-1]])) targets <- array(0, dim = c(length(rows))) for (j in 1:length(rows)) { indices <- seq(rows[[j]] - lookback, rows[[j]], length.out = dim(samples)[[2]]) samples[j,,] <- data[indices,] targets[[j]] <- data[rows[[j]] + delay,2] } list(samples, targets) } } The i variable contains the state that tracks next window of data to return, so it is updated using superassignment (e.g. i <<- i + length(rows)). Now, let’s use the abstract generator function to instantiate three generators: one for training, one for validation, and one for testing. Each will look at different temporal segments of the original data: the training generator looks at the first 200,000 timesteps, the validation generator looks at the following 100,000, and the test generator looks at the remainder. lookback <- 1440 step <- 6 delay <- 144 batch_size <- 128 train_gen <- generator( data, lookback = lookback, delay = delay, min_index = 1, max_index = 200000, shuffle = TRUE, step = step, batch_size = batch_size ) val_gen = generator( data, lookback = lookback, delay = delay, min_index = 200001, max_index = 300000, step = step, batch_size = batch_size ) test_gen <- generator( data, lookback = lookback, delay = delay, min_index = 300001, max_index = NULL, step = step, batch_size = batch_size ) # How many steps to draw from val_gen in order to see the entire validation set val_steps <- (300000 - 200001 - lookback) / batch_size # How many steps to draw from test_gen in order to see the entire test set test_steps <- (nrow(data) - 300001 - lookback) / batch_size A common-sense, non-machine-learning baseline Before you start using black-box deep-learning models to solve the temperature-prediction problem, let’s try a simple, common-sense approach. It will serve as a sanity check, and it will establish a baseline that you’ll have to beat in order to demonstrate the usefulness of more-advanced machine-learning models. Such common-sense baselines can be useful when you’re approaching a new problem for which there is no known solution (yet). A classic example is that of unbalanced classification tasks, where some classes are much more common than others. If your dataset contains 90% instances of class A and 10% instances of class B, then a common-sense approach to the classification task is to always predict “A” when presented with a new sample. Such a classifier is 90% accurate overall, and any learning-based approach should therefore beat this 90% score in order to demonstrate usefulness. Sometimes, such elementary baselines can prove surprisingly hard to beat. In this case, the temperature time series can safely be assumed to be continuous (the temperatures tomorrow are likely to be close to the temperatures today) as well as periodical with a daily period. Thus a common-sense approach is to always predict that the temperature 24 hours from now will be equal to the temperature right now. Let’s evaluate this approach, using the mean absolute error (MAE) metric: mean(abs(preds - targets)) Here’s the evaluation loop. evaluate_naive_method <- function() { batch_maes <- c() for (step in 1:val_steps) { c(samples, targets) %<-% val_gen() preds <- samples[,dim(samples)[[2]],2] mae <- mean(abs(preds - targets)) batch_maes <- c(batch_maes, mae) } print(mean(batch_maes)) } evaluate_naive_method() This yields an MAE of 0.29. Because the temperature data has been normalized to be centered on 0 and have a standard deviation of 1, this number isn’t immediately interpretable. It translates to an average absolute error of 0.29 x temperature_std degrees Celsius: 2.57˚C. celsius_mae <- 0.29 * std[[2]] That’s a fairly large average absolute error. Now the game is to use your knowledge of deep learning to do better. A basic machine-learning approach In the same way that it’s useful to establish a common-sense baseline before trying machine-learning approaches, it’s useful to try simple, cheap machine-learning models (such as small, densely connected networks) before looking into complicated and computationally expensive models such as RNNs. This is the best way to make sure any further complexity you throw at the problem is legitimate and delivers real benefits. The following listing shows a fully connected model that starts by flattening the data and then runs it through two dense layers. Note the lack of activation function on the last dense layer, which is typical for a regression problem. You use MAE as the loss. Because you evaluate on the exact same data and with the exact same metric you did with the common-sense approach, the results will be directly comparable. library(keras) model <- keras_model_sequential() %>% layer_flatten(input_shape = c(lookback / step, dim(data)[-1])) %>% layer_dense(units = 32, activation = "relu") %>% layer_dense(units = 1) model %>% compile( optimizer = optimizer_rmsprop(), loss = "mae" ) history <- model %>% fit_generator( train_gen, steps_per_epoch = 500, epochs = 20, validation_data = val_gen, validation_steps = val_steps ) Let’s display the loss curves for validation and training. Some of the validation losses are close to the no-learning baseline, but not reliably. This goes to show the merit of having this baseline in the first place: it turns out to be not easy to outperform. Your common sense contains a lot of valuable information that a machine-learning model doesn’t have access to. You may wonder, if a simple, well-performing model exists to go from the data to the targets (the common-sense baseline), why doesn’t the model you’re training find it and improve on it? Because this simple solution isn’t what your training setup is looking for. The space of models in which you’re searching for a solution – that is, your hypothesis space – is the space of all possible two-layer networks with the configuration you defined. These networks are already fairly complicated. When you’re looking for a solution with a space of complicated models, the simple, well-performing baseline may be unlearnable, even if it’s technically part of the hypothesis space. That is a pretty significant limitation of machine learning in general: unless the learning algorithm is hardcoded to look for a specific kind of simple model, parameter learning can sometimes fail to find a simple solution to a simple problem. A first recurrent baseline The first fully connected approach didn’t do well, but that doesn’t mean machine learning isn’t applicable to this problem. The previous approach first flattened the time series, which removed the notion of time from the input data. Let’s instead look at the data as what it is: a sequence, where causality and order matter. You’ll try a recurrent-sequence processing model – it should be the perfect fit for such sequence data, precisely because it exploits the temporal ordering of data points, unlike the first approach. Instead of the LSTM layer introduced in the previous section, you’ll use the GRU layer, developed by Chung et al. in 2014. Gated recurrent unit (GRU) layers work using the same principle as LSTM, but they’re somewhat streamlined and thus cheaper to run (although they may not have as much representational power as LSTM). This trade-off between computational expensiveness and representational power is seen everywhere in machine learning. model <- keras_model_sequential() %>% layer_gru(units = 32, input_shape = list(NULL, dim(data)[[-1]])) %>% layer_dense(units = 1) model %>% compile( optimizer = optimizer_rmsprop(), loss = "mae" ) history <- model %>% fit_generator( train_gen, steps_per_epoch = 500, epochs = 20, validation_data = val_gen, validation_steps = val_steps ) The results are plotted below. Much better! You can significantly beat the common-sense baseline, demonstrating the value of machine learning as well as the superiority of recurrent networks compared to sequence-flattening dense networks on this type of task. The new validation MAE of ~0.265 (before you start significantly overfitting) translates to a mean absolute error of 2.35˚C after denormalization. That’s a solid gain on the initial error of 2.57˚C, but you probably still have a bit of a margin for improvement. Using recurrent dropout to fight overfitting It’s evident from the training and validation curves that the model is overfitting: the training and validation losses start to diverge considerably after a few epochs. You’re already familiar with a classic technique for fighting this phenomenon: dropout, which randomly zeros out input units of a layer in order to break happenstance correlations in the training data that the layer is exposed to. But how to correctly apply dropout in recurrent networks isn’t a trivial question. It has long been known that applying dropout before a recurrent layer hinders learning rather than helping with regularization. In 2015, Yarin Gal, as part of his PhD thesis on Bayesian deep learning, determined the proper way to use dropout with a recurrent network: the same dropout mask (the same pattern of dropped units) should be applied at every timestep, instead of a dropout mask that varies randomly from timestep to timestep. What’s more, in order to regularize the representations formed by the recurrent gates of layers such as layer_gru and layer_lstm, a temporally constant dropout mask should be applied to the inner recurrent activations of the layer (a recurrent dropout mask). Using the same dropout mask at every timestep allows the network to properly propagate its learning error through time; a temporally random dropout mask would disrupt this error signal and be harmful to the learning process. Yarin Gal did his research using Keras and helped build this mechanism directly into Keras recurrent layers. Every recurrent layer in Keras has two dropout-related arguments: dropout, a float specifying the dropout rate for input units of the layer, and recurrent_dropout, specifying the dropout rate of the recurrent units. Let’s add dropout and recurrent dropout to the layer_gru and see how doing so impacts overfitting. Because networks being regularized with dropout always take longer to fully converge, you’ll train the network for twice as many epochs. model <- keras_model_sequential() %>% layer_gru(units = 32, dropout = 0.2, recurrent_dropout = 0.2, input_shape = list(NULL, dim(data)[[-1]])) %>% layer_dense(units = 1) model %>% compile( optimizer = optimizer_rmsprop(), loss = "mae" ) history <- model %>% fit_generator( train_gen, steps_per_epoch = 500, epochs = 40, validation_data = val_gen, validation_steps = val_steps ) The plot below shows the results. Success! You’re no longer overfitting during the first 20 epochs. But although you have more stable evaluation scores, your best scores aren’t much lower than they were previously. Stacking recurrent layers Because you’re no longer overfitting but seem to have hit a performance bottleneck, you should consider increasing the capacity of the network. Recall the description of the universal machine-learning workflow: it’s generally a good idea to increase the capacity of your network until overfitting becomes the primary obstacle (assuming you’re already taking basic steps to mitigate overfitting, such as using dropout). As long as you aren’t overfitting too badly, you’re likely under capacity. Increasing network capacity is typically done by increasing the number of units in the layers or adding more layers. Recurrent layer stacking is a classic way to build more-powerful recurrent networks: for instance, what currently powers the Google Translate algorithm is a stack of seven large LSTM layers – that’s huge. To stack recurrent layers on top of each other in Keras, all intermediate layers should return their full sequence of outputs (a 3D tensor) rather than their output at the last timestep. This is done by specifying return_sequences = TRUE. model <- keras_model_sequential() %>% layer_gru(units = 32, dropout = 0.1, recurrent_dropout = 0.5, return_sequences = TRUE, input_shape = list(NULL, dim(data)[[-1]])) %>% layer_gru(units = 64, activation = "relu", dropout = 0.1, recurrent_dropout = 0.5) %>% layer_dense(units = 1) model %>% compile( optimizer = optimizer_rmsprop(), loss = "mae" ) history <- model %>% fit_generator( train_gen, steps_per_epoch = 500, epochs = 40, validation_data = val_gen, validation_steps = val_steps ) The figure below shows the results. You can see that the added layer does improve the results a bit, though not significantly. You can draw two conclusions: • Because you’re still not overfitting too badly, you could safely increase the size of your layers in a quest for validation-loss improvement. This has a non-negligible computational cost, though. • Adding a layer didn’t help by a significant factor, so you may be seeing diminishing returns from increasing network capacity at this point. Using bidirectional RNNs The last technique introduced in this section is called bidirectional RNNs. A bidirectional RNN is a common RNN variant that can offer greater performance than a regular RNN on certain tasks. It’s frequently used in natural-language processing – you could call it the Swiss Army knife of deep learning for natural-language processing. RNNs are notably order dependent, or time dependent: they process the timesteps of their input sequences in order, and shuffling or reversing the timesteps can completely change the representations the RNN extracts from the sequence. This is precisely the reason they perform well on problems where order is meaningful, such as the temperature-forecasting problem. A bidirectional RNN exploits the order sensitivity of RNNs: it consists of using two regular RNNs, such as the layer_gru and layer_lstm you’re already familiar with, each of which processes the input sequence in one direction (chronologically and antichronologically), and then merging their representations. By processing a sequence both ways, a bidirectional RNN can catch patterns that may be overlooked by a unidirectional RNN. Remarkably, the fact that the RNN layers in this section have processed sequences in chronological order (older timesteps first) may have been an arbitrary decision. At least, it’s a decision we made no attempt to question so far. Could the RNNs have performed well enough if they processed input sequences in antichronological order, for instance (newer timesteps first)? Let’s try this in practice and see what happens. All you need to do is write a variant of the data generator where the input sequences are reverted along the time dimension (replace the last line with list(samples[,ncol(samples):1,], targets)). Training the same one-GRU-layer network that you used in the first experiment in this section, you get the results shown below. The reversed-order GRU underperforms even the common-sense baseline, indicating that in this case, chronological processing is important to the success of your approach. This makes perfect sense: the underlying GRU layer will typically be better at remembering the recent past than the distant past, and naturally the more recent weather data points are more predictive than older data points for the problem (that’s what makes the common-sense baseline fairly strong). Thus the chronological version of the layer is bound to outperform the reversed-order version. Importantly, this isn’t true for many other problems, including natural language: intuitively, the importance of a word in understanding a sentence isn’t usually dependent on its position in the sentence. Let’s try the same trick on the LSTM IMDB example from section 6.2. library(keras) # Number of words to consider as features max_features <- 10000 # Cuts off texts after this number of words maxlen <- 500 imdb <- dataset_imdb(num_words = max_features) c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb # Reverses sequences x_train <- lapply(x_train, rev) x_test <- lapply(x_test, rev) # Pads sequences x_train <- pad_sequences(x_train, maxlen = maxlen) <4> x_test <- pad_sequences(x_test, maxlen = maxlen) model <- keras_model_sequential() %>% layer_embedding(input_dim = max_features, output_dim = 128) %>% layer_lstm(units = 32) %>% layer_dense(units = 1, activation = "sigmoid") model %>% compile( optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("acc") ) history <- model %>% fit( x_train, y_train, epochs = 10, batch_size = 128, validation_split = 0.2 ) You get performance nearly identical to that of the chronological-order LSTM. Remarkably, on such a text dataset, reversed-order processing works just as well as chronological processing, confirming the hypothesis that, although word order does matter in understanding language, which order you use isn’t crucial. Importantly, an RNN trained on reversed sequences will learn different representations than one trained on the original sequences, much as you would have different mental models if time flowed backward in the real world – if you lived a life where you died on your first day and were born on your last day. In machine learning, representations that are different yet useful are always worth exploiting, and the more they differ, the better: they offer a new angle from which to look at your data, capturing aspects of the data that were missed by other approaches, and thus they can help boost performance on a task. This is the intuition behind ensembling, a concept we’ll explore in chapter 7. A bidirectional RNN exploits this idea to improve on the performance of chronological-order RNNs. It looks at its input sequence both ways, obtaining potentially richer representations and capturing patterns that may have been missed by the chronological-order version alone. To instantiate a bidirectional RNN in Keras, you use the bidirectional() function, which takes a recurrent layer instance as an argument. The bidirectional() function creates a second, separate instance of this recurrent layer and uses one instance for processing the input sequences in chronological order and the other instance for processing the input sequences in reversed order. Let’s try it on the IMDB sentiment-analysis task. model <- keras_model_sequential() %>% layer_embedding(input_dim = max_features, output_dim = 32) %>% bidirectional( layer_lstm(units = 32) ) %>% layer_dense(units = 1, activation = "sigmoid") model %>% compile( optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("acc") ) history <- model %>% fit( x_train, y_train, epochs = 10, batch_size = 128, validation_split = 0.2 ) It performs slightly better than the regular LSTM you tried in the previous section, achieving over 89% validation accuracy. It also seems to overfit more quickly, which is unsurprising because a bidirectional layer has twice as many parameters as a chronological LSTM. With some regularization, the bidirectional approach would likely be a strong performer on this task. Now let’s try the same approach on the temperature prediction task. model <- keras_model_sequential() %>% bidirectional( layer_gru(units = 32), input_shape = list(NULL, dim(data)[[-1]]) ) %>% layer_dense(units = 1) model %>% compile( optimizer = optimizer_rmsprop(), loss = "mae" ) history <- model %>% fit_generator( train_gen, steps_per_epoch = 500, epochs = 40, validation_data = val_gen, validation_steps = val_steps ) This performs about as well as the regular layer_gru. It’s easy to understand why: all the predictive capacity must come from the chronological half of the network, because the antichronological half is known to be severely underperforming on this task (again, because the recent past matters much more than the distant past in this case). Going even further There are many other things you could try, in order to improve performance on the temperature-forecasting problem: • Adjust the number of units in each recurrent layer in the stacked setup. The current choices are largely arbitrary and thus probably suboptimal. • Adjust the learning rate used by the RMSprop optimizer. • Try using layer_lstm instead of layer_gru. • Try using a bigger densely connected regressor on top of the recurrent layers: that is, a bigger dense layer or even a stack of dense layers. • Don’t forget to eventually run the best-performing models (in terms of validation MAE) on the test set! Otherwise, you’ll develop architectures that are overfitting to the validation set. As always, deep learning is more an art than a science. We can provide guidelines that suggest what is likely to work or not work on a given problem, but, ultimately, every problem is unique; you’ll have to evaluate different strategies empirically. There is currently no theory that will tell you in advance precisely what you should do to optimally solve a problem. You must iterate. Wrapping up Here’s what you should take away from this section: • As you first learned in chapter 4, when approaching a new problem, it’s good to first establish common-sense baselines for your metric of choice. If you don’t have a baseline to beat, you can’t tell whether you’re making real progress. • Try simple models before expensive ones, to justify the additional expense. Sometimes a simple model will turn out to be your best option. • When you have data where temporal ordering matters, recurrent networks are a great fit and easily outperform models that first flatten the temporal data. • To use dropout with recurrent networks, you should use a time-constant dropout mask and recurrent dropout mask. These are built into Keras recurrent layers, so all you have to do is use the dropout and recurrent_dropout arguments of recurrent layers. • Stacked RNNs provide more representational power than a single RNN layer. They’re also much more expensive and thus not always worth it. Although they offer clear gains on complex problems (such as machine translation), they may not always be relevant to smaller, simpler problems. • Bidirectional RNNs, which look at a sequence both ways, are useful on natural-language processing problems. But they aren’t strong performers on sequence data where the recent past is much more informative than the beginning of the sequence. NOTE: Markets and machine learning Some readers are bound to want to take the techniques we’ve introduced here and try them on the problem of forecasting the future price of securities on the stock market (or currency exchange rates, and so on). Markets have very different statistical characteristics than natural phenomena such as weather patterns. Trying to use machine learning to beat markets, when you only have access to publicly available data, is a difficult endeavor, and you’re likely to waste your time and resources with nothing to show for it. Always remember that when it comes to markets, past performance is not a good predictor of future returns – looking in the rear-view mirror is a bad way to drive. Machine learning, on the other hand, is applicable to datasets where the past is a good predictor of the future. main { hyphens: inherit; } Deep Learning with R This post is an excerpt from Chapter 5 of François Chollet’s and J.J. Allaire’s book, Deep Learning with R (Manning Publications). Use the code fccallaire for a 42% discount on the book at manning.com. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: TensorFlow for R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Using blogdown with an existing Hugo site Wed, 12/20/2017 - 01:00 (This article was first published on R on Locke Data Blog, and kindly contributed to R-bloggers) If you decide you want to use R in your existing Hugo blog, it’s really easy to convert over. There’s a single command you need to know from blogdown and the rest is working out your deployment process. To create content, use the blogdown Rstudio add-in to quickly get started. This niftily reads all tags and categories from past posts to help you get going. You can then write Rmarkdown as usual. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on Locke Data Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Registration now open for workshop on Deep Learning with Keras and TensorFlow using R Wed, 12/20/2017 - 01:00 (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers) Recently, I announced my workshop on Deep Learning with Keras and TensorFlow. The next dates for it are January 18th and 19th in Solingen, Germany. You can register now by following this link: https://www.codecentric.de/schulung/deep-learning-mit-keras-und-tensorflow If any non-German-speaking people want to attend, I’m happy to give the course in English! Contact me if you have further questions. As a little bonus, I am also sharing my sketch notes from a Podcast I listened to when first getting into Keras: Sketchnotes: Software Engineering Daily – Podcast from Jan 29th 2016 Links from the notes: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### ASA Police Data Challenge student visualization contest winners Tue, 12/19/2017 - 18:00 (This article was first published on Revolutions, and kindly contributed to R-bloggers) The winners of the American Statistical Association Police Data Challenge have been announced. The ASA teamed up with the Police Data Initiative, which provides open data from local law enforcement agencies in the US, to create a competition for high school and college students to analyze crime data from Baltimore, Seattle and Cincinnati. In this post, we take a look at the winners of the visualization category. The winners of the Best Visualization for college students were Julia Nguyen, Katherine Qian, Youbeen Shim, Catherine Sun from University of Virginia. Their entry included several visualizations of crime data in Baltimore, including the crime density map shown below. The team used R for all of the data cleaning, manipulation, and visualizations. The tidyverse suite of packages was used for data pipelining (including stringr and lubridate for merging data on latitude/longitude and date), and the ggmap package for visualization. The winners of the Best Visualization for high school students were Alex Lapuente, Ana Kenefick and Sara Kenefick from Charlotte Latin School (Charlotte, N.C.). They used Microsoft Excel to look at overall trends Seattle crime data, the impact of employment and poverty on crime, and this visualization of the frequency of traffic-related incidents (note the "pedestrian violation" segment — I can attest from experience that jaywalking is strictly enforced in Seattle!): For more on the Police Data Challenge and the winners in the Overall and Best Use of External Data categories, follow the link below. This is Statistics: Police Data Challenge: Congratulations to our Winners! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Read and write using fst & feather for large data files. Tue, 12/19/2017 - 16:07 (This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers) /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:8.0pt; mso-para-margin-left:0in; line-height:107%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:8.0pt; mso-para-margin-left:0in; line-height:107%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} For past few years , I was using featheras my favorite data writing and reading option in R (one reason was its cross platform compatible across Julia, Python and R), however, recently, observed it’s read and write time lines were not at all effective with large files of size > 5 GB. And found fst format to be good for both read and write of large files in R, only disadvantage is it is cross platform compatible as feather. Especially, fst compression with 100% is good for storage of large file and it is more efficient in reading the same into R environment. Also, I feel no point in re-testing as they are already available at fst site. —— Happy R programming let me know your experiences. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Data Wonderland: Christmas songs from the viewpoint of a data scientist Tue, 12/19/2017 - 15:54 (This article was first published on eoda english R news, and kindly contributed to R-bloggers) Whether „Driving Home for Christmas“, „Winter Wonderland“, „Let it snow!“ or „Last Christmas“ – every year christmas songs are taking over the charts again. While average Joe is joyfully putting on the next christmas song, the data scientist starts his journey of discovery through the snowy music history. The data set comes from 55000+ Song Lyrics, which contains over 55,000+ songs. It is a data frame with 55,000+ rows and four columns: • artist • song • link • text library(igraph) library(ggraph) library(ggplot2) library(wordcloud2) library(dplyr) library(widyr) library(tidytext) library(tm) library(stringr) library(topicmodels) library(reshape2) library(quanteda) library(Rtsne) library(DT) library(knitr) library(animation) library(ldatuning) set.seed(201712) # Preperation ------------------------------------------------------------- songs <- read.csv("songdata.csv") songs$song <- songs$song %>% as.character() songs$artist <- songs$artist %>% as.character() songs$link <- songs$link %>% as.character() songs$text <- songs$text %>% as.character() Our goal is to perform a comprehensive analysis of the song texts to identify the Christmas songs. In order to do so, first we add an additional column to the data frame to give each song a label of either Christmas or Not Christmas, where every song which contains the words ChristmasXmas or X-mas will be labeled as Christmas and otherwise as Not Christmas. # Initialization of the Labels label <- character(dim(songs)[1]) for(i in 1:dim(songs)[1]){ if(str_detect(songs$song[i], "Christmas") | str_detect(songs$song[i], "X-mas") | str_detect(songs$song[i], "Xmas")){ label[i] <- "Christmas" } else{ label[i] <- "Not Christmas" } } songs <- songs %>% mutate(Label = label)

This is just the initialization of the labels, later we will apply Naive Bayes to a training set to identify the other Christmas songs. First of all, we will start by exploring the data set by means of some intuitive descriptive approaches.

D3Vis <- function(edgeList, directed){ colnames(edgeList) <- c("SourceName", "TargetName", "Weight") # Min-Max & Inverse scaling, because the weights should represent distance/similarity edgeList$Weight <- 1 - edgeList$Weight weight.min <- edgeList$Weight %>% min weight.max <- edgeList$Weight %>% max edgeList$Weight <- (edgeList$Weight - weight.min)/(weight.max - weight.min) # Create a graph. Use simplyfy to ensure that there are no duplicated edges or self loops gD <- igraph::simplify(igraph::graph.data.frame(edgeList, directed=directed)) # Create a node list object (actually a data frame object) that will contain information about nodes nodeList <- data.frame(ID = c(0:(igraph::vcount(gD) - 1)), # because networkD3 library requires IDs to start at 0 nName = igraph::V(gD)$name) # Map node names from the edge list to node IDs getNodeID <- function(x){ which(x == igraph::V(gD)$name) - 1 # to ensure that IDs start at 0 } # And add them to the edge list edgeList <- plyr::ddply(edgeList, .variables = c("SourceName", "TargetName", "Weight"), function (x) data.frame(SourceID = getNodeID(x$SourceName), TargetID = getNodeID(x$TargetName))) #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++# # Calculate some node properties and node similarities that will be used to illustrate # different plotting abilities and add them to the edge and node lists # Calculate degree for all nodes nodeList <- cbind(nodeList, nodeDegree=igraph::degree(gD, v = igraph::V(gD), mode = "all")) # Calculate betweenness for all nodes betAll <- igraph::betweenness(gD, v = igraph::V(gD), directed = directed) / (((igraph::vcount(gD) - 1) * (igraph::vcount(gD)-2)) / 2) betAll.norm <- (betAll - min(betAll))/(max(betAll) - min(betAll)) nodeList <- cbind(nodeList, nodeBetweenness=100*betAll.norm) # We are scaling the value by multiplying it by 100 for visualization purposes only (to create larger nodes) rm(betAll, betAll.norm) #Calculate Dice similarities between all pairs of nodes dsAll <- igraph::similarity.dice(gD, vids = igraph::V(gD), mode = "all") F1 <- function(x) {data.frame(diceSim = dsAll[x$SourceID +1, x$TargetID + 1])} edgeList <- plyr::ddply(edgeList, .variables=c("SourceName", "TargetName", "Weight", "SourceID", "TargetID"), function(x) data.frame(F1(x))) rm(dsAll, F1, getNodeID, gD) #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++# # We will also create a set of colors for each edge, based on their dice similarity values # We'll interpolate edge colors based on the using the "colorRampPalette" function, that # returns a function corresponding to a collor palete of "bias" number of elements (in our case, that # will be a total number of edges, i.e., number of rows in the edgeList data frame) F2 <- colorRampPalette(c("#FFFF00", "#FF0000"), bias = nrow(edgeList), space = "rgb", interpolate = "linear") colCodes <- F2(length(unique(edgeList$diceSim))) edges_col <- sapply(edgeList$diceSim, function(x) colCodes[which(sort(unique(edgeList$diceSim)) == x)]) rm(colCodes, F2) #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++# # revise transformation of the weights edgeList$Weight <- -(edgeList$Weight*(weight.max - weight.min) + weight.min - 1) # Let's create a network D3_network_LM <- networkD3::forceNetwork(Links = edgeList, # data frame that contains info about edges Nodes = nodeList, # data frame that contains info about nodes Source = "SourceID", # ID of source node Target = "TargetID", # ID of target node Value = "Weight", # value from the edge list (data frame) that will be used to value/weight relationship amongst nodes NodeID = "nName", # value from the node list (data frame) that contains node description we want to use (e.g., node name) Nodesize = "nodeBetweenness", # value from the node list (data frame) that contains value we want to use for a node size Group = "nodeDegree", # value from the node list (data frame) that contains value we want to use for node color # height = 500, # Size of the plot (vertical) # width = 1000, # Size of the plot (horizontal) fontSize = 20, # Font size linkDistance = networkD3::JS("function(d) { return 10*d.value; }"), # Function to determine distance between any two nodes, uses variables already defined in forceNetwork function (not variables from a data frame) # linkWidth = networkD3::JS("function(d) { return d.value/5; }"),# Function to determine link/edge thickness, uses variables already defined in forceNetwork function (not variables from a data frame) opacity = 0.85, # opacity arrows = directed, zoom = TRUE, # ability to zoom when click on the node # opacityNoHover = 0.1, # opacity of labels when static legend = F, linkColour = edges_col) # edge colors # Plot network D3_network_LM } Exploration of the initial Christmas songs Cleaning & Tokenization We should start with the data cleaning and tokenization. Afterwards, the Christmas songs will be selected and saved as a variable. songs.unnest <- songs %>% unnest_tokens(word, text) %>% anti_join(tibble(word = stop_words$word)) %>% filter(!str_detect(word, "\\d+")) xmas.unnest <- songs.unnest %>% filter(Label == "Christmas") Correlation Analysis

Now we can start analyzing the initial Christmas songs by means of correlations from different perspectives. In the following, we will visualize the correlations with the networkD3 html widget where nodes with the same total number of connections will be given the same color and the color of the edge implies the number of common neighbors shared by two nodes. Moreover, the size of a node indicates the centrality of it, which is defined by the betweenness, i.e. the number of shortest paths going through it. Where the distance between two nodes is the minimum maximum transformation of 1 minus the correlation, which makes sense because intuitively the higher the correlation, the nearer two nodes should be. Moreover, the shorter the distance, the wider the edge.

Note that the correlations are always based on lyrics.

Correlation between words

The correlation between words which appeared more than 100 times and are correlated with at least one other word with a correlation greater than 0.55.

correlation.words <- xmas.unnest %>% group_by(word) %>% filter(n() > 100) %>% ungroup() %>% pairwise_cor(word, song, sort = T) # Network visualization correlation.words %>% filter(correlation > 0.55) %>% D3Vis(directed = F) Correlation between songs

The correlation between songs which are correlated with at least 3 other songs with a correlation greater than 0.75. With this, we may detect similiar or just slightly modified songs.

correlation.songs <- xmas.unnest %>% pairwise_cor(song, word, sort = T) # Network visualization correlation.songs %>% filter(correlation > 0.75) %>% group_by(item1) %>% filter(n() >= 3) %>% ungroup() %>% D3Vis(directed = F) Correlation between certain words

The correlation between certain words

correlation.words %>% filter(item1 == "christus" | item1 == "jesus" | item1 == "snow" | item1 == "reindeer" | item1 == "home" | item1 == "holy" | item1 == "love" | item1 == "tree" | item1 == "white" | item1 == "christmas", correlation > 0.4) %>% D3Vis(directed = F) Correlation between artists

The correlation between artists

correlation.artists <- xmas.unnest %>% pairwise_cor(artist, word, sort = T) # Network Visualization correlation.artists %>% filter(correlation > 0.8) %>% group_by(item1) %>% filter(n() >= 3) %>% ungroup() %>% D3Vis(directed = F) Word Cloud

Word cloud of the initial Christmas songs

xmas.cloud <- xmas.unnest %>% count(word) %>% as.data.frame() xmas.cloud %>% wordcloud2(minSize = 3, shape = 'star') Naive Bayes

Naive Bayes is a popular supervised machine learning algorithm to handle classification problems with a huge amount of features. It is „naive“ in the sense that, conditioned on a class, the features are assumed to be independently distributed. In our case, we would like to know, given a bunch of features, i.e. the tf-idf of words in a document, whether a song should be classified as Christmas song or not by Naive Bayes.

Generally, given features $\mathbf{x} = (x_1, …, x_p)$ we have \begin{aligned} \mathbb{P}(C_k|\mathbf{x}) &= \frac{\mathbb{P}(C_k)\mathbb{P}(\mathbf{x}|C_k)}{\mathbb{P(\mathbf{x})}} \ &= \frac{\mathbb{P}(C_k)\prod_{i = 1}^p\mathbb{P}(x_i|C_k)}{\mathbb{P(\mathbf{x})}} \varpropto \mathbb{P}(C_k)\prod_{i = 1}^p\mathbb{P}(x_i|C_k) \end{aligned} Where $\mathbb{P}(C_k)$ is called the prior and $\mathbb{P}(C_k|\mathbf{x})$ the posterior and $\mathbb{P}(\mathbf{x}|C_k)$ the likelihood. The MLE is obviously $$\hat{C}:= \underset{k}{\arg \max},\mathbb{P}(C_k)\prod_{i = 1}^p\mathbb{P}(x_i|C_k)$$

Because we assume that the features are independent conditioning on an arbitrary class. We may therefore estimate $\mathbb{P}(x_i|C_k), \forall i = 1,…, p$ independently of other features using a training set, which makes the whole thing much easier. The popular assumptions of the likelihood are Gaussian, multinomial or Bernoulli. The harder part of constructing the maximum likelihood estimator is the choice of the prior distribution, i.e. the probability distribution of the classes. Where it is usually assumed to be uniformly distributed or estimated by the class frequencies. In our case the multinomial distribution for the likelihood and the uniform distribution for the prior are used, which means we have no prejudice regarding the categorization of the songs without given further information.

Identify the hidden Christmas songs # Document Feature Matrix songs.dfm.tfidf <- corpus(songs, text_field = "text", docid_field = "song") %>% dfm(tolower = T, stem = TRUE, remove_punct = TRUE, remove = stopwords("english")) %>% dfm_trim(min_count = 5, min_docfreq = 3) %>% dfm_weight(type = "tfidf") # Determine the Indizes for the training set christmas.index <- which(label == "Christmas") not_christmas.index <- which(label == "Not Christmas") christmas.train.index <- christmas.index not_christmas.train.index <- sample(not_christmas.index, length(christmas.index)) train.index <- c(christmas.train.index, not_christmas.train.index) label.train <- label[train.index] trainning.set <- songs.dfm.tfidf[train.index, ] # Train the Model classifier_NB <- textmodel_NB(trainning.set, label.train) # Prediction predictions <- classifier_NB %>% predict(newdata = songs.dfm.tfidf) # Confusion Matrix confusion <- table(predictions$nb.predicted, label) confusion So we have identified 2965 hidden Christmas songs and there are 2 songs out of the initial 500 Christmas songs that are rejected by Naive Bayes as Christmas songs. Explore the hidden Christmas songs #Determine the Indizes for the hidden (not) Christmas Songs. hidden.index <- (predictions$nb.predicted == "Christmas") & (songs$Label == "Not Christmas") hidden_not.index <- (predictions$nb.predicted == "Not Christmas") & (songs$Label == "Christmas") # Change the labels label[hidden.index] <- "Hidden Christmas" label[hidden_not.index] <- "Hidden Not Christmas" songs$Label <- label songs.dfm.tfidf@docvars$Label <- label # Wordcloud for the hidden Christmas Songs hidden.xmas <- songs[hidden.index, ] hidden.unnest <- hidden.xmas %>% unnest_tokens(word, text) %>% anti_join(tibble(word = stop_words$word)) %>% filter(!str_detect(word, "\\d+")) hidden.unnest %>% count(word) %>% filter(n >= 5) %>% as.data.frame() %>% wordcloud2(shape = "star", minSize = 5) # Correlation hidden.correlation.words <- hidden.unnest %>% group_by(word) %>% filter(n() > 15) %>% ungroup() %>% pairwise_cor(word, song, sort = T) # Network visualization hidden.correlation.words %>% filter(correlation > 0.65) %>% group_by(item1) %>% filter(n() >= 20) %>% ungroup() %>% D3Vis(directed = F)

We have therefore successfully identified a bunch of religous christmas songs, whose titles usually do not contain the word „Christmas“ or „X-mas“.

Latent Dirichtlet Allocation & t-Statistics Stochastic Neighbor Embedding Data Preparation

Only the top 300 features for the Christmas songs including the hidden ones will be used to calculate the Rtsne & LDA, else the memory space will not be sufficient.

xmas.dfm.tfidf <- songs.dfm.tfidf %>% dfm_subset(Label == "Christmas" | Label == "Hidden Christmas") songs.dfm.tfidf_300 <- songs.dfm.tfidf %>% dfm_select(pattern = xmas.dfm.tfidf %>% topfeatures(300) %>% names(), selection = "keep") xmas.dfm.tfidf_300 <- xmas.dfm.tfidf %>% dfm_select(pattern = xmas.dfm.tfidf %>% topfeatures(300) %>% names(), selection = "keep") LDA

LDA stands for Latent Dirichtlet Allocation, which was introduced in Blei, Ng, Jordan (2003). It is a generative probabilistic model of a corpus, where the documents are represented as random mixtures over latent topics and for a single document there are usually only a few topics that are assigned unneglectable probabilities. Moreover, each topic is characterized by a distribution over words, where usually only a small set of words will be assigned significant probabilities for a certain topic. Either the variational expectation maximization algorithm or Gibbs sampling is used for the statistical inference of the parameters.

LDA requires a fixed number of topics, i.e. it assumes that the number of topics should already be known before applying the algorithm. However, there are possibilities to determine the optimal number of topics by different performance metrics, see Nikita, by using the package ldatuning.

OptimalNumber <- FindTopicsNumber(LDA_xmas <- xmas.dfm.tfidf_300 %>% convert(to = "topicmodels"), topics = seq(2, 8, by = 1), mc.cores = 2, metrics = c("CaoJuan2009", "Arun2010", "Deveaud2014"), method = "VEM", verbose = T) FindTopicsNumber_plot(OptimalNumber)

Therefore, we will choose 8 as the optimal number of topics.

LDA_xmas <- xmas.dfm.tfidf_300 %>% convert(to = "topicmodels") %>% LDA(k = 8)

We may use the package tidytext to inspect the topic probability distribution of each document, i.e. for each document the sum of the probabilities that it belongs to a topic from 1 to 8 is equal to 1.

LDA_xmas %>% tidy(matrix = "gamma") %>% datatable(rownames = F)

Analogously, we can also obtain the probability distribution of words for each topic, i.e. for each topic the sum of probabilities that it generates different words is equal to 1.

LDA_xmas %>% tidy(matrix = "beta") %>% datatable(rownames = F)

The top terms for each topic are:

# LDA for the Christmas songs terms(LDA_xmas, 10) t-SNE

Developed by van der Maaten and Hinton (2008), t-SNE stands for t-Statistics Stochastic Neighborhood Embedding, which is a dimensionality reduction technique that is formulated to capture the local clustering structure of the original data points. It is non-linear and non-deterministic.

We have generally speaking data points with high dimensionality $$x_1, …, x_n \in \mathbb{R}^N$$ and would like to calculate its counterparts $$y_1, …, y_n \in \mathbb{R}^M$$ in a low dimensional space, i.e. where $M First of all, we define the probability that$x_i$would pick$x_j$as its neighbor as $$p_{j|i} = \frac{exp(-||x_j – x_i||^2/2\sigma_i^2)}{\sum_{k\neq i} exp(-||x_k – x_i||^2/2\sigma_i^2)}$$, i.e. it is proportional to a Gaussian centered at$x_i$where the variance$\sigma_i$is determined by a binary search such that the perplexity $$Perp(p_i) = 2^{H(p_i)} = 2^{-\sum_{j\neq i} p_{j|i}log_2p_{j|i}}$$ is as close as possible to a perplexity which is predefined by the user. However, the conditional probability is not symmetric. In order to measure the similarity between$x_i$and$x_j$, we define the metric to be $$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2}$$. The similarity metric for$y_1, …, y_n$is defined as the Student-t distribution with one-degree of freedom, i.e. the similarity between$y_i$and$y_j$is $$q_{ij} = \frac{(1 + ||y_j – y_i||^2)^{-1}}{\sum_{k \neq i}(1 + ||y_k – y_i||^2)^{-1}}$$. The goal of t-SNE is to find the counterparts$\mathfrak{Y} = (y_1, …, y_n)$of$\mathfrak{X} = (x_1, …, x_n)$such that the Kullback-Leibler divergence $$D_{KL}(P||Q) = \sum_{i \neq j} p_{ij}log\frac{p_{ij}}{q_{ij}}$$, i.e. our loss function$C$, which can be understood as the information loss using$\mathfrak{Y}$to represent$\mathfrak{X}$, is minimized. Obviously, there will be a relatively high loss if we use far apart pair$(y_i, y_j)$to represent nearby pair$(x_i, x_j)$. Therefore, the local clustering/neighborhood structure of$\mathfrak{X}$is preserved. It can be shown that the gradient of the loss function has a relatively simple form of $$\frac{dC}{d\mathfrak{Y}} = (\frac{\partial C}{\partial y_1}, …, \frac{\partial C}{\partial y_n})$$ where $$\frac{\partial C}{\partial y_i} = 4\sum_j (p_{ij} – q_{ij})(y_i – y_j)(1 + ||y_i – y_j||^2)^{-1}$$. The gradient descent is applied to minimize the loss function: $$\mathfrak{Y}^{(t)} = \mathfrak{Y}^{(t – 1)} + \eta\frac{dC}{d\mathfrak{Y}} + \alpha (t)(\mathfrak{Y}^{(t-1)} – \mathfrak{Y}^{(t – 2)})$$, where$\eta$is called the learning rate and$\alpha(t)$the momentum.$\mathfrak{Y}^{(0)}$is a sample from an isotropic Gaussian with small variance. The following computation will take about 30 minutes. # t-Statistics Stochastic Neighbor Embedding -------------------------------- index.unique.songs <- !songs.dfm.tfidf_300 %>% as.matrix() %>% duplicated() songs.unique <- songs.dfm.tfidf_300[index.unique.songs, ] %>% as.matrix() tsne.all <- Rtsne(songs.unique) songs_2d <- tsne.all$Y %>% as.data.frame() %>% mutate(Label = label[index.unique.songs]) songs_2d %>% ggplot(aes(x = V1, y = V2, color = Label)) + geom_point(size = 0.25) + scale_color_manual(values = c("Not Christmas" = "#a6a6a6", "Christmas" = "#88ab33", "Hidden Christmas" = "#F98948", "Hidden Not Christmas" = "#437F97")) + guides(color = guide_legend(override.aes = list(size = 5))) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "black")) What if we repeat the procedure for more than one iteration?

So far we have only run the Naive Bayes for one iteration. However, we may repeat this procedure for more than one iteration, i.e. train a Naive Bayes classifier and relabel all the false positives as Hidden Christmas/Christmas and all the false negatives as Hidden Not Christmas/Not Christmas over and over again.

First of all, we prepare the data again to avoid bugs.

songs <- read.csv("songdata.csv") songs$song <- songs$song %>% as.character() songs$artist <- songs$artist %>% as.character() songs$link <- songs$link %>% as.character() songs$text <- songs$text %>% as.character() # Initialization of the Labels label <- character(dim(songs)[1]) for(i in 1:dim(songs)[1]){ if(str_detect(songs$song[i], "Christmas") | str_detect(songs$song[i], "X-mas") | str_detect(songs$song[i], "Xmas")){ label[i] <- "Christmas" } else{ label[i] <- "Not Christmas" } } songs <- songs %>% mutate(Label = label) songs.dfm.tfidf <- corpus(songs, text_field = "text", docid_field = "song") %>% dfm(tolower = T, stem = TRUE, remove_punct = TRUE, remove = stopwords("english")) %>% dfm_trim(min_count = 5, min_docfreq = 3) %>% dfm_weight(type = "tfidf") results <- data.frame(precision = numeric(10), recall = numeric(10), f1_score = numeric(10)) Run 10 iterations. for(i in 1:10){ # Determine the Indizes christmas.index <- which(label == "Christmas") not_christmas.index <- which(label == "Not Christmas") if(length(christmas.index) < length(not_christmas.index)){ christmas.train.index <- christmas.index not_christmas.train.index <- sample(not_christmas.index, length(christmas.index)) } else{ not_christmas.train.index <- not_christmas.index christmas.train.index <- sample(christmas.index, length(not_christmas.index)) } train.index <- c(christmas.train.index, not_christmas.train.index) label.train <- label[train.index] trainning.set <- songs.dfm.tfidf[train.index, ] # Train the Model classifier_NB <- textmodel_NB(trainning.set, label.train) # Prediction predictions <- classifier_NB %>% predict(newdata = songs.dfm.tfidf) # Confusion Matrix confusion <- table(predictions$nb.predicted, label) precision <- confusion[1, 1]/sum(confusion[1, ]) recall <- confusion[1, 1]/sum(confusion[, 1]) f1_score <- 2*precision*recall/(precision + recall) # The hidden (not) Christmas Songs ---------------------------------------------- hidden.index <- (predictions$nb.predicted == "Christmas") & (songs$Label == "Not Christmas") hidden_not.index <- (predictions$nb.predicted == "Not Christmas") & (songs$Label == "Christmas") hidden.xmas <- songs[hidden.index, ] hidden.not_xmas <- songs[hidden_not.index, ] label[hidden.index] <- "Hidden Christmas" label[hidden_not.index] <- "Hidden Not Christmas" songs_2d <- tsne.all$Y %>% as.data.frame() %>% mutate(Label = label[index.unique.songs]) random.forest <- songs_2d %>% ggplot(aes(x = V1, y = V2, color = Label)) + geom_point(size = 0.25) + scale_color_manual(values = c("Not Christmas" = "#a6a6a6", "Christmas" = "#88ab33", "Hidden Christmas" = "#F98948", "Hidden Not Christmas" = "#437F97")) + guides(color = guide_legend(override.aes = list(size = 5))) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "black")) + ggtitle(paste("Iteration:", i)) # Change the labels label[hidden.index] <- "Christmas" label[hidden_not.index] <- "Not Christmas" songs$Label <- label songs.dfm.tfidf@docvars$Label <- label results[i, ] <- c(precision, recall, f1_score) plot(random.forest) } results %>% mutate(index = 1:10) %>% melt(id = "index") %>% ggplot(aes(x = index, y = value, color = variable)) + geom_line() Then the precision as well as the f1 score grow monotonically at first and then converge to a value around 0.95, which means there are not many „Hidden Christmas Songs“ and „Hidden Not Christmas Songs“ left to be detected. However, in this procedure we always believe that the Naive Bayes classifier is 100% accurate, which is hardly possible. Thus, in each iteration there are some songs falsely classified by Naive Bayes as „Christmas“, which will be used in the next iteration in the training set to train the Naive Bayes classifier. With this accumulating error we might have the apprehension that the results are actually worse with more iterations. confusion At the end we have roughly half of the songs classified as „Christmas“ and the other half as „Not Christmas“, which seems very implausible. It raises the question whether or not there is an optimal number of iterations, however, we simply can not manually control whether all the 57,650 songs are correctly classified or not. This remains an open question to be answered. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### 5 interesting subtle insights from TED videos data analysis in R Tue, 12/19/2017 - 15:00 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) This post aims to bring out some not-so-obvious subtle insights from analyzing TED Videos posted on TED.com. For those who do not know what is TED, Here’s the summary from Wikipedia: TED (Technology, Entertainment, Design) is a media organization which posts talks online for free distribution, under the slogan “ideas worth spreading”. This analysis uses TED Talks dataset posted on Kaggle Datasets. Data Description Once the data is downloaded from the above link and unzipped, two files – 1. transcripts.csv, 2. ted_main.csv that are found to be read into R as below: transcripts <- read.csv('../transcripts.csv',stringsAsFactors=F, header = T) main <- read.csv('../ted_main.csv',stringsAsFactors=F, header = T) ted_main.csv contains informations like Speaker Name, Talk Name, Event Name, Talk Duration, Comments, Video Views and much more for the videos made available on TED and transcripts.csv contains the entire talk transcript of the same talks. Loading Libraries library(dplyr); library(ggplot2); library(ggthemes); Extract some basic info regarding the dataset. nrow(main) 2550 2550 Video details are available in the main data frame. Since not all TED videos are extremely popular, let us see how many of those are with more than 1M views. paste0('Total Number of videos with more than 1M views: ',main %>% filter(views > 1000000) %>% count() ) paste0('% of videos with more than 1M views: ', round((main %>% filter(views > 1000000) %>% count() / nrow(main))*100,2),'%') 'Total Number of videos with more than 1M views: 1503' '% of videos with more than 1M views: 58.94%' Being a one-trick-Pony is very easy in any business, so let us explore who are those among best of not-so-one-trick-ponies. main %>% filter(views > 1000000) %>% group_by(main_speaker) %>% count() %>% filter(n >2) %>% arrange(desc(n)) %>% head(20) %>% ggplot() + geom_bar(aes(reorder(main_speaker,-n),n),stat='identity') + theme_solarized() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Speakers') + ggtitle('To 20 Frequently Appeared Speakers in all videos with 1M+ views') Gives this plot: And the winner is none other than Mr. Hans Rosling whose Gapminder TED talk is always an inspiration for any Data Wiz. Many a time, the amount of effort put gets translated to the effectiveness of the outcome, but the best is always getting things done with less effort – which in our case, more views with less time. main %>% filter(views > 1000000) %>% arrange(duration) %>% slice(1:10) %>% select('name','duration','views','event') name duration views event Derek Sivers: Weird, or just different? 162 2835976 TEDIndia 2009 Paolo Cardini: Forget multitasking, try monotasking 172 2324212 TEDGlobal 2012 Mitchell Joachim: Don't build your home, grow it! 176 1332785 TED2010 Arthur Benjamin: Teach statistics before calculus! 178 2175141 TED2009 Terry Moore: How to tie your shoes 179 6263759 TED2005 Malcolm London: "High School Training Ground" 180 1188177 TED Talks Education Bobby McFerrin: Watch me play ... the audience! 184 3302312 World Science Festival Derek Sivers: How to start a movement 189 6475731 TED2010 Bruno Maisonnier: Dance, tiny robots! 189 1193896 TEDxConcorde Dean Ornish: Your genes are not your fate 192 1384333 TED2008 And the winner this time is, Derek Sivers, who happened to appear twice on the same list, donning two popular TED talks that are just under 6 minutes. Malcolm Gladwell in his book Outliers presents an interesting case of how Date of Birth plays a role in Hockey team selection, so let us see if there is any magical first letter of the name that stands out among TED Speakers. main$first_letter <- substr(main$main_speaker,1,1) main %>% group_by(first_letter = toupper(first_letter)) %>% count() %>% arrange(desc(n)) %>% ggplot() + geom_bar(aes(reorder(first_letter,-n),n),stat = 'identity') + theme_solarized() + xlab('Speaker First Letter') + ylab('Count') + ggtitle('Popular First Letter of Author Names appearing in TED Talks') Gives this plot: And ‘J’ is the outstanding winner of the first-letter race that whose name holders frequently appeared on TED talks (Remember, correlation doesn’t mean causation!). While TED primarily publishes videos from TED Global Event, some great TEDx videos get to feature on TED and let us analyze which TEDx chapter made it big. tedx %>% filter(views > 1000000) %>% group_by(event) %>% count() %>% filter(n >2) %>% arrange(desc(n)) %>% head(20) %>% ggplot() + geom_bar(aes(reorder(event,-n),n),stat='identity') + theme_solarized() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('TEDx Events') + ggtitle('Top 20 TEDx Events that more talks with 1M+ views on TED.com') Gives this plot: And Finally, let us analyze what is that magical (first) word that TED speakers usually start their talk with. transcripts$first_word <- unlist(lapply(transcripts$transcript, function(x) strsplit(x," ")[[1]][1])) transcripts %>% group_by(first_word) %>% count() %>% arrange(desc(n)) %>% head(25) %>% ggplot() + geom_bar(aes(reorder(first_word,-n),n),stat = 'identity') + theme_solarized() + xlab('First Word of the Talk') + ylab('Count') + ggtitle('Top First Word of the Talk') + theme(axis.text.x = element_text(angle = 60, hjust = 1)) Gives this plot: And as most humans on the planet, most TED Speakers seem to start with ‘I’ (Narcissism, maybe?) and strangely Chris – the first name of Chris Anderson, the owner of TED appears on the same list too (Gratitude, maybe!). This dataset still has a lot more interesting – subtle insights to be unveiled. The code used here is available on my Github. Related Post var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### ANNOUNCEMENT: EARL London 2018 + abstract submissions open! Tue, 12/19/2017 - 11:34 We are thrilled to announce the dates for next year’s EARL London: 11-13 September 2018 (make sure you pop these dates in your calendars!). Call for abstracts Abstract submissions are now open! We’d love to hear from you if you want to share your R story. Every year, R users from all over the world submit fantastic abstracts around really cool commercial R usage and we want you to be one of them in 2018! We’re accepting lightning talks… For 2018, we’re mixing things up and accepting abstracts for lightning talks as well as the usual 30 minute slots. These lightning talk slots are 10 minutes long, so it’s about concentrating the awesomeness of R into a bite-sized chunk. Seven reasons you should submit an abstract In case you’re ‘umming’ and ‘ahhing’, we’ve put together the top seven reasons why you should submit an abstract 1. Our attendees – our audience are amongst the most passionate about R in the world. They come from across the globe to discuss, share and network with their fellow R Users. Our audience are also incredibly friendly, which makes networking fun and easy. You shouldn’t worry about attending solo either because many of our attendees do too. 2. The other speakers – hopefully after your talk you’ll be able to catch some of the other presentations. At EARL, speaker slots aren’t sponsored so we’re not charging companies to share their stories. This helps us get fascinating speakers from a huge range of industries and different sized companies. 3. Forefront of R usage in business – On the topic of speakers, having so many industries represented means EARL is a great place to get a live snapshot of what’s going on in the industry right now and what trends and technologies are starting to emerge. 4. The supportive atmosphere – maybe you think you don’t know enough to present a talk, but it’s not about being the best or knowing the most. People come to EARL to hear from people just like them. They want to hear about the lessons you’ve learnt, your successes, and your mistakes. EARL is about inspiring others to solve problems with R. 5. The evening networking event – We’ve been to HMS Belfast, the Tower of London, Tower Bridge and on a cruise down the Thames. We can’t reveal our location for 2018 just yet, but it’s lining up to be another incredible venue. The evening reception is a great opportunity to catch someone you’ve been wanting to talk to or just have a more relaxed discussion over a couple of drinks. 6. The range of topics – If you’re unsure if anyone will be interested in your talk, we can promise you that there will be others there that love what you love. Our audience always have plenty of interesting questions to ask no matter what the talk topic is. 7. Free conference ticket – speakers at EARL receive a free conference pass for the day they are speaking and the evening networking event. Deadline Abstracts submissions for EARL London close on 9 March 2018. Please do get in contact if you have any questions about your topic, submitting abstracts or if you just need some encouragement! Submit your abstract here. Tickets Tickets will be available in January 2018, so keep your eyes peeled! We look forward to reading your abstracts! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Insurance Data Science Conference 2018 Tue, 12/19/2017 - 01:00 (This article was first published on R on mages' blog, and kindly contributed to R-bloggers) Following five R in Insurance conferences, we are organising the first Insurance Data Science conference at Cass Business School London, 16 July 2018. In 2013, we started with the aim to bring practitioners of industry and academia together to discuss and exchange ideas and needs from both sides. R was and is a perfect glue between the two groups, a tool which both side embrace and which has fostered the knowledge transfer between the two. However, R is just one example and other languages serve this purpose equally well. Python is another popular language, but also Julia and Stan have gained momentum. For that reason we have rebranded our conference series to “Insurance Data Science”. We believe by removing the explicit link to “R” we have more freedom to stay relevant and embrace whatever technology may evolve in the future. We expect contributions to topics related to risk assessment, customer analytics, pricing, reserving, capital management, catastrophe and econometric modelling. The keynote speakers are: The conference will be followed by a full-day Stan in insurance workshop with Eric Novik and Paul-Christian Bürkner. Key dates • 1 February 2018: Registration opens • 9 April 2018: Abstract submission deadline • 31 May 2018: Early-bird registrations deadline • 16 July 2018: Insurance Data Science Conference • 17 July 2018: Stan in Insurance workshop For more details visit https://insurancedatascience.org var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Le Monde puzzle [#1033] Tue, 12/19/2017 - 00:17 (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers) A simple Le Monde mathematical puzzle after two geometric ones I did not consider: 1. Bob gets a 2×3 card with three integer entries on the first row and two integer entries on the second row such that (i) entry (1,1) is 1, (ii) summing up subsets of adjacent entries produces all integers from 1 to 21. (Adjacent means sharing an index.) Deduce Bob’s voucher. 2. Alice gets Bob’s voucher completed into a 2×4 card with further integer entries. What is the largest value of N such that all integers from 1 to N are available through summing up all subsets of entries? The first question only requires a few attempts but it can be solves by brute force simulation. Here is a R code that leads to the solution: alsumz<-function(sol){return( c(sol,sum(sol[1:2]),sum(sol[2:3]),sum(sol[4:5]), sum(sol[c(1,4)]), sum(sol[c(1,5)]),sum(sol[1:3]), sum(sol[c(1,4,5)]),sum(sol[c(1,2,5)]), sum(sol[c(2,4,5)]), sum(sol[c(1,2,3,5)]),sum(sol[2:5]), sum(sol[c(1,2,4)]),sum(sol[c(1,2,4,5)]),sum(sol[1:4]), sum(sol)),sum(sol[c(2,3,5)]))} produces (1,8,7,3,2) as the only case for which (length(unique(alsum(sol)))==21) The second puzzle means considering all sums and checking there exists a solution for all subsets. There is no constraint in this second question, hence on principle this could produce N=2⁸-1=255, but I have been unable to exceed 175 through brute force simulation. (Which entitled me to use the as.logical(intToBits(i) R command!) Filed under: Books, Kids, R Tagged: Alice and Bob, Le Monde, mathematical puzzle, partition, R var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### How to Create Notebooks on Datazar Tue, 12/19/2017 - 00:07 (This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers) Getting started with notebooks is as easy as clicking a button on Datazar. Notebooks are great tools that make — creating and sharing reproducible analysis — super easy. To create R or Python notebooks, follow these directions: I. Navigate to your project, or create a new one if you don’t have one yet. Click on the “Create File” buton and a box containing several options will popup. Let’s go with “R Notebook” for this one. You can choose “Python Notebook” and the steps will be exactly the same. And just like that, you’ll be redirected to your freshly printed Notebook! First thing you’ll see is the first cell (a code text box). If you created an R Notebook type in message("hello world!") and then hit Shift+Enter. If you’ve created a Python Notebook, type in print "hello world!" and then Shift+Enter. You’ll see the results for the cell displayed right below it. You’ll see a new cell is created right below the result. On and on; you get the gist. II. If you want to load data that’s in your project and access it from your Notebook for your analysis, all you have to do is click on the “Load Files” button and click on the button next to the file you want to upload. If you’re the using R Notebook, you can then import the dataset with data<-read.csv("ExampleData.csv") . That’s it! The data is now loaded to the data variable. You can do the same with script files with the source function in R. Whatever you do already in R (in your terminal or other programs), you can replicate in these Notebooks including plots/visualizations, loading external libraries etc… Visit www.datazar.com to create Notebooks for free and show us what you built @datazarhq How to Create Notebooks on Datazar was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Data Manipulation in R Mon, 12/18/2017 - 22:29 (This article was first published on R on Locke Data Blog, and kindly contributed to R-bloggers) Data Manipulation in R is now generally available on Amazon. All book links will attempt geo-targeting so you end up at the right Amazon. Prices are in USD as most readers are American and the price will be the equivalent in local currency. Data Manipulation in R is the second book in my R Fundamentals series that takes folks from no programming knowledge through to an experienced R user. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on Locke Data Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### How to Perform Hierarchical Clustering using R Mon, 12/18/2017 - 22:01 (This article was first published on R-posts.com, and kindly contributed to R-bloggers) What is Hierarchical Clustering? Clustering is a technique to club similar data points into one group and separate out dissimilar observations into different groups or clusters. In Hierarchical Clustering, clusters are created such that they have a predetermined ordering i.e. a hierarchy. For example, consider the concept hierarchy of a library. A library has many sections, each section would have many books, and the books would be grouped according to their subject, let’s say. This forms a hierarchy. In Hierarchical Clustering, this hierarchy of clusters can either be created from top to bottom, or vice-versa. Hence, it’s two types namely – Divisive and Agglomerative. Let’s discuss it in detail. Divisive Method In Divisive method we assume that all of the observations belong to a single cluster and then divide the cluster into two least similar clusters. This is repeated recursively on each cluster until there is one cluster for each observation. This technique is also called DIANA, which is an acronym for Divisive Analysis. Agglomerative Method It’s also known as Hierarchical Agglomerative Clustering (HAC) or AGNES (acronym for Agglomerative Nesting). In this method, each observation is assigned to its own cluster. Then, the similarity (or distance) between each of the clusters is computed and the two most similar clusters are merged into one. Finally, steps 2 and 3 are repeated until there is only one cluster left. Please note that Divisive method is good for identifying large clusters while Agglomerative method is good for identifying small clusters. We will proceed with Agglomerative Clustering for the rest of the article. Since, HAC’s account for the majority of hierarchical clustering algorithms while Divisive methods are rarely used. I think now we have a general overview of Hierarchical Clustering. Let’s also get ourselves familiarized with the algorithm for it. HAC Algorithm Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson’s (1967) hierarchical clustering is – 1. Assign each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. In Steps 2 and 3 here, the algorithm talks about finding similarity among clusters. So, before any clustering is performed, it is required to determine the distance matrix that specifies the distance between each data point using some distance function (Euclidean, Manhattan, Minkowski, etc.). Then, the matrix is updated to specify the distance between different clusters that are formed as a result of merging. But, how do we measure the distance (or similarity) between two clusters of observations? For this, we have three different methods as described below. These methods differ in how the distance between each cluster is measured. Single Linkage It is also known as the connectedness or minimum method. Here, the distance between one cluster and another cluster is taken to be equal to the shortest distance from any data point of one cluster to any data point in another. That is, distance will be based on similarity of the closest pair of data points as shown in the figure. It tends to produce long, “loose” clusters. Disadvantage of this method is that it can cause premature merging of groups with close pairs, even if those groups are quite dissimilar overall. Complete Linkage This method is also called the diameter or maximum method. In this method, we consider similarity of the furthest pair. That is, the distance between one cluster and another cluster is taken to be equal to the longest distance from any member of one cluster to any member of the other cluster. It tends to produce more compact clusters. One drawback of this method is that outliers can cause close groups to be merged later than what is optimal. Average Linkage In Average linkage method, we take the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. A variation on average-link clustering is the UCLUS method of D’Andrade (1978) which uses the median distance instead of mean distance. Ward’s Method Ward’s method aims to minimize the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. In other words, it forms clusters in a manner that minimizes the loss associated with each cluster. At each step, the union of every possible cluster pair is considered and the two clusters whose merger results in minimum increase in information loss are combined. Here, information loss is defined by Ward in terms of an error sum-of-squares criterion (ESS). If you want a mathematical treatment of this visit this link. The following table describes the mathematical equations for each of the methods — Where, • X1, X2, , Xk = Observations from cluster 1 • Y1, Y2, , Yl = Observations from cluster 2 • d (x, y) = Distance between a subject with observation vector x and a subject with observation vector y • ||.|| = Euclidean norm Implementing Hierarchical Clustering in R Data Preparation To perform clustering in R, the data should be prepared as per the following guidelines – 1. Rows should contain observations (or data points) and columns should be variables. 2. Check if your data has any missing values, if yes, remove or impute them. 3. Data across columns must be standardized or scaled, to make the variables comparable. We’ll use a data set called ‘Freedman’ from the ‘car’ package. The ‘Freedman’ data frame has 110 rows and 4 columns. The observations are U. S. metropolitan areas with 1968 populations of 250,000 or more. There are some missing data. (Make sure you install the car package before proceeding) data <- car::Freedman To remove any missing value that might be present in the data, type this: data <- na.omit(data) This simply removes any row that contains missing values. You can use more sophisticated methods for imputing missing values. However, we’ll skip this as it is beyond the scope of the article. Also, we will be using numeric variables here for the sake of simply demonstrating how clustering is done in R. Categorical variables, on the other hand, would require special treatment, which is also not within the scope of this article. Therefore, we have selected a data set with numeric variables alone for conciseness. Next, we have to scale all the numeric variables. Scaling means each variable will now have mean zero and standard deviation one. Ideally, you want a unit in each coordinate to represent the same degree of difference. Scaling makes the standard deviation the unit of measurement in each coordinate. This is done to avoid the clustering algorithm to depend to an arbitrary variable unit. You can scale/standardize the data using the R function scale: data <- scale(data) Implementing Hierarchical Clustering in R There are several functions available in R for hierarchical clustering. Here are some commonly used ones: • ‘hclust’ (stats package) and ‘agnes’ (cluster package) for agglomerative hierarchical clustering • ‘diana’ (cluster package) for divisive hierarchical clustering Agglomerative Hierarchical Clustering For ‘hclust’ function, we require the distance values which can be computed in R by using the ‘dist’ function. Default measure for dist function is ‘Euclidean’, however you can change it with the method argument. With this, we also need to specify the linkage method we want to use (i.e. “complete”, “average”, “single”, “ward.D”). # Dissimilarity matrix d <- dist(data, method = "euclidean") # Hierarchical clustering using Complete Linkage hc1 <- hclust(d, method = "complete" ) # Plot the obtained dendrogram plot(hc1, cex = 0.6, hang = -1) Another alternative is the agnes function. Both of these functions are quite similar; however, with the agnes function you can also get the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure). # Compute with agnes (make sure you have the package cluster) hc2 <- agnes(data, method = "complete") # Agglomerative coefficient hc2$ac ## [1] 0.9317012 Let’s compare the methods discussed # vector of methods to compare m <- c( "average", "single", "complete", "ward") names(m) <- c( "average", "single", "complete", "ward") # function to compute coefficient ac <- function(x) { agnes(data, method = x)$ac } map_dbl(m, ac) ## from ‘purrr’ package ## average single complete ward ## 0.9241325 0.9215283 0.9317012 0.9493598 Ward’s method gets us the highest agglomerative coefficient. Let us look at its dendogram. hc3 <- agnes(data, method = "ward") pltree(hc3, cex = 0.6, hang = -1, main = "Dendrogram of agnes") Divisive Hierarchical Clustering The function ‘diana’ in the cluster package helps us perform divisive hierarchical clustering. ‘diana’ works similar to ‘agnes’. However, there is no method argument here, and, instead of agglomerative coefficient, we have divisive coefficient. # compute divisive hierarchical clustering hc4 <- diana(data) # Divise coefficient hc4$dc ## [1] 0.9305939 # plot dendrogram pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana")

A dendogram is a cluster tree where each group is linked to two or more successor groups. These groups are nested and organized as a tree. You could manage dendograms and do much more with them using the package “dendextend”. Check out the vignette of the package here:

Great! So now we understand how to perform clustering and come up with dendograms. Let us move to the final step of assigning clusters to the data points.

This can be done with the R function cutree. It cuts a tree (or dendogram), as resulting from hclust (or diana/agnes), into several groups either by specifying the desired number of groups (k) or the cut height (h). At least one of k or h must be specified, k overrides h if both are given.

Following our demo, assign clusters for the tree obtained by diana function (under section Divisive Hierarchical Clustering).

clust <- cutree(hc4, k = 5)

We can also use the fviz_cluster function from the factoextra package to visualize the result in a scatter plot.

fviz_cluster(list(data = data, cluster = clust)) ## from ‘factoextra’ package

You can also visualize the clusters inside the dendogram itself by putting borders as shown next

pltree(hc4, hang=-1, cex = 0.6)

rect.hclust(hc4, k = 9, border = 2:10)

dendextend

You can do a lot of other manipulation on dendrograms with the dendextend package such as – changing the color of labels, changing the size of text of labels, changing type of line of branches, its colors, etc. You can also change the label text as well as sort it. You can see the many options in the online vignette of the package.

Another important function that the dendextend package offers is tanglegram. It is used to compare two dendrogram (with the same set of labels), one facing the other, and having their labels connected by lines.

As an example, let’s compare the single and complete linkage methods for agnes function.

library(dendextend) hc_single <- agnes(data, method = "single") hc_complete <- agnes(data, method = "complete") # converting to dendogram objects as dendextend works with dendogram objects hc_single <- as.dendrogram(hc_single) hc_complete <- as.dendrogram(hc_complete) tanglegram(hc_single,hc_complete)

This is useful in comparing two methods. As seen in the figure, one can relate to the methodology used for building the clusters by looking at this comparison.
You can find more functionalities of the dendextend package here.

End Notes

Hope now you have a better understanding of clustering algorithms than what you started with. We discussed about Divisive and Agglomerative clustering techniques and four linkage methods namely, Single, Complete, Average and Ward’s method. Next, we implemented the discussed techniques in R using a numeric dataset. Note that we didn’t have any categorical variable in the dataset we used. You need to treat the categorical variables in order to incorporate them into a clustering algorithm. Lastly, we discussed a couple of plots to visualise the clusters/groups formed. Note here that we have assumed value of ‘k’ (number of clusters) is known. However, this is not always the case. There are a number of heuristics and rules-of-thumb for picking number of clusters. A given heuristic will work better on some datasets than others. It’s best to take advantage of domain knowledge to help set the number of clusters, if that’s possible. Otherwise, try a variety of heuristics, and perhaps a few different values of k.

Consolidated code install.packages('cluster') install.packages('purrr') install.packages('factoextra') library(cluster) library(purrr) library(factoextra) data <- car::Freedman data <- na.omit(data) data <- scale(data) d <- dist(data, method = "euclidean") hc1 <- hclust(d, method = "complete" ) plot(hc1, cex = 0.6, hang = -1) hc2 <- agnes(data, method = "complete") hc2$ac m <- c( "average", "single", "complete", "ward") names(m) <- c( "average", "single", "complete", "ward") ac <- function(x) { agnes(data, method = x)$ac } map_dbl(m, ac) hc3 <- agnes(data, method = "ward") pltree(hc3, cex = 0.6, hang = -1, main = "Dendrogram of agnes") hc4 <- diana(data) hc4$dc pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana") clust <- cutree(hc4, k = 5) fviz_cluster(list(data = data, cluster = clust)) pltree(hc4, hang=-1, cex = 0.6) rect.hclust(hc4, k = 9, border = 2:10) library(dendextend) hc_single <- agnes(data, method = "single") hc_complete <- agnes(data, method = "complete") # converting to dendogram objects as dendextend works with dendogram objects hc_single <- as.dendrogram(hc_single) hc_complete <- as.dendrogram(hc_complete) tanglegram(hc_single,hc_complete) Author Bio: This article was contributed by Perceptive Analytics. Amanpreet Singh, Chaitanya Sagar, Jyothirmayee Thondamallu and Saneesh Veetil contributed to this article. Perceptive Analytics provides data analytics, business intelligence and reporting services to e-commerce, retail and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### 9th MilanoR Meeting – Dataviz with R | Photos and presentations! Mon, 12/18/2017 - 18:25 (This article was first published on MilanoR, and kindly contributed to R-bloggers) Hello R-users, On November 20th we had the pleasure to host the 9th MilanoR meeting in Mikamai! Lots of you participated with passion and enthusiasm and we’re so glad of the outcome. Thank you all, from the bottom of our hearts! This time the meeting focused on a thrilling and very useful subject: Data Visualization. This post is thought for you who participated and want to look through the materials again, and for everyone who could not make it. As you know, MilanoR meetings have a defined structure: two talks and the free buffet. But this time the structure was different! After the speech from our sponsors, we had an introductive talk, a presentation by Andrea Cirillo and then four lighting talks, of 5 minutes each. But don’t worry: the free buffet and networking time were still there! The introductive talk was hosted by Mariachiara Fortuna, from MilanoR; she gave us a brief overview of the history of Data Visualization with R, how it changed in time and how it probably will change, as well as what actually is Data Visualization with R now and how can we use it. This presentation will give a framework the R world of dataviz, form Shiny to ggplot2, to htmlwidgets and more. Dataviz with R Overview by MilanoR – Mariachiara Fortuna When it was Andrea Cirillo’s turn to seize the stage, we were transported in the middle of the world of Data Visualization. Have you ever dreamt of giving the colors of the Mona Lisa to your graphic? You can do it now, with Andrea’s package paletter! Paletter takes the colors from your chosen image and optimizes them to give you the perfect graphic palette. You can download the presentation and the package itself with the links down below. QUI Automatically build the perfect palette for you plot with paletteR + paletteR package by Andrea Cirillo It was then the turn of the Lightning talks! Stefano Biffani explained the process of getting from modeling to visualization itself, with a great example: the fertility of bunnies! The second talk was from Stefano Barberis, who showed us how to map data with R using Google Earth and the package plotKML. From modelling to visualization by Stefano Biffani Interactive maps and spatial visualizations with R by Stefano Barberis Nicola Pesenti introduced to us Highchart, a wonderful library to create interactive charts and plots. We highly suggest to give a look at it! At last, Moreno Riccardi put an end to the meeting with his testimony of the use of Data Visualization in his work. Moreno is a data scientist working with the famous ESP MailUp and uses Shiny to monitor costumers’ behavior and prevent the abuse of the platform. by Nicola Pesenti by Moreno Riccardi During the buffet and the networking time we had the chance to see many smiling faces. It’s wonderful to see so many people brought together by a shared passion. Of course none of this could have been possible without our sponsors: Quantide, a provider of consulting services and training courses specialized in R who supports MilanoR meetings since its first edition (here the Quantide’s presentation at the meeting), and Open Search Network, a headhunting company focused on Italian Data Digital Experts who has chosen to support us for the first time. Our many thanks also go to Le Madeleine Catering, to our photographer Alberto Morici and of course to Mikamai, which once again hosted our meeting with no charge at all. We really had fun and if you had too, please stay in touch with us: many other dates and meeting occasions are coming! Our next event will focus on the fascinating subject of personalized healthcare with R [R-Lab#5, hands on code, on January 16th in Mikamai – Subscription should open here in a few days, if we manage to not sink into the Christmas funnel of things to do xD] Until then, thanks again and happy holidays! The post 9th MilanoR Meeting – Dataviz with R | Photos and presentations! appeared first on MilanoR. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### New Course: Working with Dates & Times in R Mon, 12/18/2017 - 16:58 (This article was first published on R-posts.com, and kindly contributed to R-bloggers) Hello R users! We just launched another course: Working with Dates and Times in R by Charlotte Wickham! Dates and times are abundant in data and essential for answering questions that start with when, how long, or how often. However, they can be tricky, as they come in a variety of formats and can behave in unintuitive ways. This course teaches you the essentials of parsing, manipulating, and computing with dates and times in R. By the end, you'll have mastered the lubridate package, a member of the tidyverse, specifically designed to handle dates and times. You'll also have applied your new skills to explore how often R versions are released, when the weather is good in Auckland (the birthplace of R), and how long monarchs ruled in Britain. Take me to chapter 1! Working with Dates and Times in R features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in dates and times in R! What you'll learn 1. Dates and Times in R Dates and times come in a huge assortment of formats, so your first hurdle is often to parse the format you have into an R datetime. This chapter teaches you to import dates and times with the lubridate package. You'll also learn how to extract parts of a datetime. You'll practice by exploring the weather in R's birthplace, Auckland NZ. 2. Parsing and Manipulating Dates and Times with lubridate Dates and times come in a huge assortment of formats, so your first hurdle is often to parse the format you have into an R datetime. This chapter teaches you to import dates and times with the lubridate package. You'll also learn how to extract parts of a datetime. You'll practice by exploring the weather in R's birthplace, Auckland NZ. 3. Arithmetic with Dates and Times Getting datetimes into R is just the first step. Now that you know how to parse datetimes, you need to learn how to do calculations with them. In this chapter, you'll learn the different ways of representing spans of time with lubridate and how to leverage them to do arithmetic on datetimes. By the end of the chapter, you'll have calculated how long it's been since the first man stepped on the moon, generated sequences of dates to help schedule reminders, calculated when an eclipse occurs, and explored the reigns of monarch's of England (and which ones might have seen Halley's comet!). 4. Problems in practice You now know most of what you need to tackle data that includes dates and times, but there are a few other problems you might encounter in practice. In this final chapter you'll learn a little more about these problems by returning to some of the earlier data examples and learning how to handle time zones, deal with times when you don't care about dates, parse dates quickly, and output dates and times. Master dates and times in R with our latest course! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Conference Cost Mon, 12/18/2017 - 01:00 (This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers) In last weeks post we tantalised you with upcoming R & data science conferences, but from a cost point view, not all R conferences are the same. Using the R conference site, it’s fairly easy to compare the cost of previous R conferences. I selected the main conferences over the last few years and obtained the cost for the full ticket (including any tutorial days, but ignoring any discounts). Next, I converted all prices to dollars and calculated the cost per day. Conference Cost ($) #Days \$ per day eRum 2016 30 3 10 SatRday 2017 82 3 27 useR! 2017 770 4 192 R Finance 2017 600 2 300 rstudio::conf 2018 995 3 331 New York R 725 2 362 Earl London 1191 3 397

The conferences fall into two camps business oriented and more general R conferences; useR! is somewhere in the middle. A simple bar plot highlights the extreme difference in cost per day

I think the organisers of eRum and SatRdays should be proud of themselves for keeping the cost down; it would also be really useful if the organisers wrote a blog post giving more general tips for keeping the cost down.

I’m going to resist commenting on the conferences since I have only ever attended useR! (which is excellent), but in terms of speakers, most conferences have the same keynotes, so this year I’ll be looking at eRum 2018 (sadly useR! is a bit too far away from me).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...