Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 13 hours 30 min ago

Edward Tufte’s Slopegraphs and political fortunes in Ontario

Mon, 05/21/2018 - 16:21

(This article was first published on eKonometrics, and kindly contributed to R-bloggers)

With fewer than three weeks left in the June 7 provincial elections in Ontario, Canada’s most populous province with 14.2 million persons, the expected outcome is far from certain. The weekly opinion polls reflect the volatility in public opinion. Progressive Conservatives (PC), one of the main opposition parties, is in the lead with the support of roughly 40 percent of the electorate. The incumbent Ontario Liberals are trailing with their support hanging around lower 20 percent. The real story in these elections is the unexpected rise in the fortunes of the New Democratic Party (NDP) that has seen a sustained increase in its popularity from less than 20 percent a few weeks ago to mid 30 percent. As a data scientist/journalist, I have been concerned with how best to represent this information. A scatter plot of sorts would do. However, I would like to demonstrate the change in political fortunes over time with the x-axis representing time. Hence, a time series chart would be more appropriate. Ideally, I would like to plot what Edward Tufte called a Slopegraph. Tufte, in his 1983 book The Visual Display of Quantitative Information, explained that “Slopegraphs compare changes usually over time for a list of nouns located on an ordinal or interval scale”. But here’s the problem. No software offers a readymade solution to draw a Slopegraph. Luckily, I found a way, in fact, two ways, around the challenge with help from colleagues at Stata and R (plotrix). So, what follows in this blog is the story of the elections in Ontario described with data visualized as Slopegraphs. I tell the story first with Stata and then with the plotrix package in R. My interest grew in Slopegraphs when I wanted to demonstrate the steep increase in highly leveraged mortgage loans in Canada from 2014 to 2016. I generated the chart in Excel and sent it to Stata requesting help to recreate it. Stata assigned my request to Derek Wagner whose excellent programming skills resulted in the following chart. Derek built the chart on the linkplot command built by the uber Stata guru, Professor Nicholas J. Cox. However, a straightforward application of linkplot still required a lot of tweaks that Derek very ably managed. For comparison, see the initial version of the chart generated by linkplot below.   We made the following modifications to the base linkplot: 1.    Narrow the plot by reducing the space between the two time periods. 2.    Label the entities and their respective values at the primary and secondary y-axes. 3.    Add a title and footnotes (if necessary). 4.    Label time periods with custom names. 5.    Colour lines and symbols to match preferences. Once we apply these tweaks a Slopegraph with the latest poll data for Ontario’s election is drawn as follows. Notice that in fewer than two weeks, NDP has jumped from 29 percent to 34 percent, almost tying up with the leading PC party whose support has remained steady at 35 percent. The incumbent Ontario Liberals appear to be in free fall from 29 percent to 24 percent. I must admit that I have sort of cheated in the above chart. Note that both Liberals and NDP secured 29 percent of the support in the poll conducted on May 06. In the original chart drawn with Stata’s code, their labels overlapped resulting in unintelligible text. I fixed this manually by manipulating the image in PowerPoint. I wanted to replicate the above chart in R. I tried a few packages, but nothing really worked until I landed on the plotrix package that carries the bumpchart command. In fact, Edward Tufte in Beautiful Evidence (2006) mentions that bumpcharts may be considered as slopegraphs. A straightforward application of bumpchart from the plotrix package labelled the party names but not the respective percentages of support each party commanded. Dr. Jim Lemon authored bumpchart. I turned to him for help. Jim was kind enough to write a custom function, bumpchart2, that I used to create a Slopegraph like the one I generated with Stata. For comparison, see the chart below. As with the Slopegraph generated with Stata, I manually manipulated the labels to prevent NDP and Liberal labels from overlapping. Data Scientist must dig even deeper

The job of a data scientist, unlike a computer scientist or a statistician, is not done by estimating models and drawing figures. A data scientist must tell a story with all caveats that might apply. So, here’s the story about what can go wrong with polls.

The most important lesson about forecasting from Brexit and the last US Presidential elections is that one cannot rely on polls to determine the future electoral outcomes. Most polls in the UK predicted a NO vote for Brexit. In the US, most polls forecasted Hillary Clinton to be the winner. Both forecasts went horribly wrong. When it comes to polls, one must determine who sponsored the poll, what methods were used, and how representative is the sample of the underlying population. Asking the wrong question to the right people or posing the right question to the wrong people (non-representative sample) can deliver problematic results. Polling is as much science as it is arts. Late Warren Mitofsky, who pioneered exit polls and innovated political survey research, remains a legend in political polling. His painstakingly cautious approach to polling is why he remains a respected name in market research. Today, the advances in communication and information technologies have made survey research easier to conduct but more difficult to be precise. No longer can one rely on random digit dialling, a Mitosky innovation, to reach a representative sample. Younger cohorts sparingly subscribe to land telephone lines. The attempts to catch them online poses the risk of fishing for opinions in echo chambers. Add political polarization to technological challenges, and one realizes the true scope of the difficulties inherent in the task of taking the political pulse of an electorate where motivated pollster may be after not the truth, but a convenient version of it. Polls also differ by survey instrument, methodology, and sample size. The Abacus Data poll presented above is essentially an online poll of 2,326 respondents. In comparison, a poll by Mainstreet Research used Interactive Voice Response (IVR) system with a sample size of 2,350 respondents. IVR uses automated computerized responses over the telephone to record responses. Abacus Data and Mainstreet Research use quite different methods with similar sample sizes. Professor Dan Cassino of Fairleigh Dickinson University explained the challenges with polling techniques in a 2016 article in the Harvard Business Review. He favours live telephone interviewers who “are highly experienced and college educated and paying them is the main cost of political surveys.”   Professor Cassino believes that techniques like IVR make “polling faster and cheaper,” but these systems are hardly foolproof with lower response rates. They cannot legally reach cellphones. “IVR may work for populations of older, whiter voters with landlines, such as in some Republican primary races, but they’re not generally useful,” explained Professor Cassino. Similarly, online polls are limited in the sense that in the US alone 16 percent Americans don’t use the Internet. With these caveats in mind, a plot of Mainstreet Research data reveals quite a different picture where the NDP doesn’t seem to pose an immediate and direct challenge to the PC party. So, here’s the summary. Slopegraph is a useful tool to summarize change over time between distinct entities. Ontario is likely to have a new government on June 7. It is though, far from being certain whether the PC Party or NDP will assume office. Nevertheless, Slopergaphs generate visuals that expose the uncertainty in the forthcoming elections. Note: To generate the charts in this blog, you can download data and code for Stata and Plotrix (R) by clicking HERE.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: eKonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ML models: What they can’t learn?

Sun, 05/20/2018 - 23:03

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

What I love in conferences are the people, that come after your talk and say: It would be cool to add XYZ to your package/method/theorem.

After the eRum (great conference by the way) I was lucky to hear from Tal Galili: It would be cool to use DALEX for teaching, to show how different ML models are learning relations.

Cool idea. So let’s see what can and what cannot be learned by the most popular ML models. Here we will compare random forest against linear models against SVMs.
Find the full example here. We simulate variables from uniform U[0,1] distribution and calculate y from following equation

In all figures below we compare PDP model responses against the true relation between variable x and the target variable y (pink color). All these plots are created with DALEX package.

For x1 we can check how different models deal with a quadratic relation. The linear model fails without prior feature engineering, random forest is guessing the shape but the best fit if found by SVMs.

With sinus-like oscillations the story is different. SVMs are not that flexible while random forest is much closer.

Turns out that monotonic relations are not easy for these models. The random forest is close but event here we cannot guarantee the monotonicity.

The linear model is the best one when it comes to truly linear relation. But other models are not that far.

The abs(x) is not an easy case for neither model.

Find the R codes here.

Of course the behavior of all these models depend on number of observation, noise to signal ratio, correlation among variables and interactions.
Yet is may be educational to use PDP curves to see how different models are learning relations. What they can grasp easily and what they cannot.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rcpp 0.12.17: More small updates

Sun, 05/20/2018 - 16:26

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Another bi-monthly update and the seventeenth release in the 0.12.* series of Rcpp landed on CRAN late on Friday following nine (!!) days in gestation in the incoming/ directory of CRAN. And no complaints: we just wish CRAN were a little more forthcoming with what is happenening when, and/or would let us help supplying additional test information. I do run a fairly insane amount of backtests prior to releases, only to then have to wait another week or more is … not ideal. But again, we all owe CRAN and immense amount of gratitude for all they do, and do so well.

So once more, this release follows the 0.12.0 release from July 2016, the 0.12.1 release in September 2016, the 0.12.2 release in November 2016, the 0.12.3 release in January 2017, the 0.12.4 release in March 2016, the 0.12.5 release in May 2016, the 0.12.6 release in July 2016, the 0.12.7 release in September 2016, the 0.12.8 release in November 2016, the 0.12.9 release in January 2017, the 0.12.10.release in March 2017, the 0.12.11.release in May 2017, the 0.12.12 release in July 2017, the 0.12.13.release in late September 2017, the 0.12.14.release in November 2017, the 0.12.15.release in January 2018 and the 0.12.16.release in March 2018 making it the twenty-first release at the steady and predictable bi-montly release frequency.

Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 1362 packages on CRAN depend on Rcpp for making analytical code go faster and further, along with another 138 in the current BioConductor release 3.7.

Compared to other releases, this release contains again a relatively small change set, but between Kevin and Romain cleaned a few things up. Full details are below.

Changes in Rcpp version 0.12.17 (2018-05-09)
  • Changes in Rcpp API:

    • The random number Generator class no longer inhreits from RNGScope (Kevin in #837 fixing #836).

    • A spurious parenthesis was removed to please gcc8 (Dirk fixing #841)

    • The optional Timer class header now undefines FALSE which was seen to have side-effects on some platforms (Romain in #847 fixing #846).

    • Optional StoragePolicy attributes now also work for string vectors (Romain in #850 fixing #849).

Thanks to CRANberries, you can also look at a diff to the previous release. As always, details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Welcome to Sentiment Analysis with “Hotel California”

Sun, 05/20/2018 - 15:30

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Welcome to the Hotel California As promised in last week’s post, this week: sentiment analysis, also with song lyrics.

Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:

  • “AFINN” for Finn Årup Nielsen – which classifies words from -5 to +5 in terms of negative or positive valence
  • “bing” for Bing Liu and colleagues – which classifies words as either positive or negative
  • “loughran” for Loughran-McDonald – mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
  • “nrc” for the NRC lexicon – which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment

Sentiment analysis works on unigrams – single words – but you can aggregate across multiple words to look at sentiment across a text.

To demonstrate sentiment analysis, I’ll use one of my favorite songs: “Hotel California” by the Eagles.

I know, I know.

Using similar code as last week, let’s pull in the lyrics of the song.

library(geniusR)
library(tidyverse)
hotel_calif <- genius_lyrics(artist = "Eagles", song = "Hotel California") %>%
mutate(line = row_number())

First, we’ll chop up these 43 lines into individual words, using the tidytext package and unnest_tokens function.

library(tidytext)
tidy_hc <- hotel_calif %>%
unnest_tokens(word,lyric)

This is also probably the point I would remove stop words with anti_join. But these common words are very unlikely to have a sentiment attached to them, so I’ll leave them in, knowing they’ll be filtered out anyway by this analysis. We have 4 lexicons to choose from. Loughran is more financial and textual, but we’ll still see how well it can classify the words anyway. First, let’s create a data frame of our 4 sentiment lexicons.

new_sentiments <- sentiments %>%
mutate( sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
ifelse(lexicon == "AFINN" & score < 0,
"negative", sentiment))) %>%
group_by(lexicon) %>%
mutate(words_in_lexicon = n_distinct(word)) %>%
ungroup()

Now, we’ll see how well the 4 lexicons match up with the words in the lyrics. Big thanks to Debbie Liske at Data Camp for this piece of code (and several other pieces used in this post):

my_kable_styling <- function(dat, caption) {
kable(dat, "html", escape = FALSE, caption = caption) %>%
kable_styling(bootstrap_options = c("striped", "condensed", "bordered"),
full_width = FALSE)
}


library(kableExtra)
library(formattable)
library(yarrr)
tidy_hc %>%
mutate(words_in_lyrics = n_distinct(word)) %>%
inner_join(new_sentiments) %>%
group_by(lexicon, words_in_lyrics, words_in_lexicon) %>%
summarise(lex_match_words = n_distinct(word)) %>%
ungroup() %>%
mutate(total_match_words = sum(lex_match_words),
match_ratio = lex_match_words/words_in_lyrics) %>%
select(lexicon, lex_match_words, words_in_lyrics, match_ratio) %>%
mutate(lex_match_words = color_bar("lightblue")(lex_match_words),
lexicon = color_tile("lightgreen","lightgreen")(lexicon)) %>%
my_kable_styling(caption = "Lyrics Found In Lexicons")
## Joining, by = "word"
Lyrics Found In Lexicons lexicon lex_match_words words_in_lyrics match_ratio AFINN 18 175 0.1028571 bing 18 175 0.1028571 loughran 1 175 0.0057143 nrc 23 175 0.1314286

NRC offers the best match, classifying about 13% of the words in the lyrics. (It’s not unusual to have such a low percentage. Not all words have a sentiment.)

hcsentiment <- tidy_hc %>%
inner_join(get_sentiments("nrc"), by = "word")

hcsentiment
## # A tibble: 103 x 4
## track_title line word sentiment
##
## 1 Hotel California 1 dark sadness
## 2 Hotel California 1 desert anger
## 3 Hotel California 1 desert disgust
## 4 Hotel California 1 desert fear
## 5 Hotel California 1 desert negative
## 6 Hotel California 1 desert sadness
## 7 Hotel California 1 cool positive
## 8 Hotel California 2 smell anger
## 9 Hotel California 2 smell disgust
## 10 Hotel California 2 smell negative
## # ... with 93 more rows

Let’s visualize the counts of different emotions and sentiments in the NRC lexicon.

theme_lyrics <- function(aticks = element_blank(),
pgminor = element_blank(),
lt = element_blank(),
lp = "none")
{
theme(plot.title = element_text(hjust = 0.5), #Center the title
axis.ticks = aticks, #Set axis ticks to on or off
panel.grid.minor = pgminor, #Turn the minor grid lines on or off
legend.title = lt, #Turn the legend title on or off
legend.position = lp) #Turn the legend on or off
}

hcsentiment %>%
group_by(sentiment) %>%
summarise(word_count = n()) %>%
ungroup() %>%
mutate(sentiment = reorder(sentiment, word_count)) %>%
ggplot(aes(sentiment, word_count, fill = -word_count)) +
geom_col() +
guides(fill = FALSE) +
theme_minimal() + theme_lyrics() +
labs(x = NULL, y = "Word Count") +
ggtitle("Hotel California NRC Sentiment Totals") +
coord_flip()

Most of the words appear to be positively-valenced. How do the individual words match up?

library(ggrepel)

plot_words <- hcsentiment %>%
group_by(sentiment) %>%
count(word, sort = TRUE) %>%
arrange(desc(n)) %>%
ungroup()

plot_words %>%
ggplot(aes(word, 1, label = word, fill = sentiment)) +
geom_point(color = "white") +
geom_label_repel(force = 1, nudge_y = 0.5,
direction = "y",
box.padding = 0.04,
segment.color = "white",
size = 3) +
facet_grid(~sentiment) +
theme_lyrics() +
theme(axis.text.y = element_blank(), axis.line.x = element_blank(),
axis.title.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect("lightgray", fill = NA),
strip.text.x = element_text(size = 9)) +
xlab(NULL) + ylab(NULL) +
ggtitle("Hotel California Words by NRC Sentiment") +
coord_flip()

It looks like some words are being misclassified. For instance, “smell” as in “warm smell of colitas” is being classified as anger, disgust, and negative. But that doesn’t explain the overall positive bent being applied to the song. If you listen to the song, you know it’s not really a happy song. It starts off somewhat negative – or at least, ambiguous – as the narrator is driving on a dark desert highway. He’s tired and having trouble seeing, and notices the Hotel California, a shimmering oasis on the horizon. He stops in and is greated by a “lovely face” in a “lovely place.” At the hotel, everyone seems happy: they dance and drink, they have fancy cars, they have pretty “friends.”

But the song is in a minor key. Though not always a sign that a song is sad, it is, at the very least, a hint of something ominous, lurking below the surface. Soon, things turn bad for the narrator. The lovely-faced woman tells him they are “just prisoners here of our own device.” He tries to run away, but the night man tells him, “You can check out anytime you like, but you can never leave.”

The song seems to be a metaphor for something, perhaps fame and excess, which was also the subject of another song on the same album, “Life in the Fast Lane.” To someone seeking fame, life is dreary, dark, and deserted. Fame is like an oasis – beautiful and shimmering, an escape. But it isn’t all it appears to be. You may be surrounded by beautiful people, but you can only call them “friends.” You trust no one. And once you join that lifestyle, you might be able to check out, perhaps through farewell tour(s), but you can never leave that life – people know who you are (or were) and there’s no disappearing. And it could be about something even darker that it’s hard to escape from, like substance abuse. Whatever meaning you ascribe to the song, the overall message seems to be that things are not as wonderful as they appear on the surface.

So if we follow our own understanding of the song’s trajectory, we’d say it starts off somewhat negatively, becomes positive in the middle, then dips back into the negative at the end, when the narrator tries to escape and finds he cannot.

We can chart this, using the line number, which coincides with the location of the word in the song. We’ll stick with NRC since it offered the best match, but for simplicity, we’ll only pay attention to the positive and negative sentiment codes.

hcsentiment_index <- tidy_hc %>%
inner_join(get_sentiments("nrc")%>%
filter(sentiment %in% c("positive",
"negative"))) %>%
count(index = line, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"

This gives us a data frame that aggregates sentiment by line. If a line contains more positive than negative words, its overall sentiment is positive, and vice versa. Because not every word in the lyrics has a sentiment, not every line has an associated aggregate sentiment. But it gives us a sort of trajectory over the course of the song. We can visualize this trajectory like this:

hcsentiment_index %>%
ggplot(aes(index, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE)

As the chart shows, the song starts somewhat positive, with a dip soon after into the negative. The middle of the song is positive, as the narrator describes the decadence of the Hotel California. But it turns dark at the end, and stays that way as the guitar solo soars in.

Sources

This awesome post by Debbie Liske, mentioned earlier, for her code and custom functions to make my charts pretty.

Text Mining with R: A Tidy Approach by Julia Silge and David Robinson

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Beautiful and Powerful Correlation Tables in R

Sun, 05/20/2018 - 02:00

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

Another correlation function?!

Yes, the correlation function from the psycho package.

devtools::install_github("neuropsychology/psycho.R") # Install the newest version library(psycho) library(tidyverse) cor <- psycho::affective %>% correlation()

This function automatically select numeric variables and run a correlation analysis. It returns a psychobject.

A table

We can then extract a formatted table that can be saved and pasted into reports and manuscripts by using the summary function.

summary(cor) # write.csv(summary(cor), "myformattedcortable.csv")   Age Life_Satisfaction Concealing Adjusting Age         Life_Satisfaction 0.03       Concealing -0.05 -0.06     Adjusting 0.03 0.36*** 0.22***   Tolerating 0.03 0.15*** 0.07 0.29*** A Plot

It integrates a plot done with ggcorplot.

plot(cor)

A print

It also includes a pairwise correlation printing method.

print(cor) Pearson Full correlation (p value correction: holm): - Age / Life_Satisfaction: Results of the Pearson correlation showed a non significant and weak negative association between Age and Life_Satisfaction (r(1249) = 0.030, p > .1). - Age / Concealing: Results of the Pearson correlation showed a non significant and weak positive association between Age and Concealing (r(1249) = -0.050, p > .1). - Life_Satisfaction / Concealing: Results of the Pearson correlation showed a non significant and weak positive association between Life_Satisfaction and Concealing (r(1249) = -0.063, p > .1). - Age / Adjusting: Results of the Pearson correlation showed a non significant and weak negative association between Age and Adjusting (r(1249) = 0.027, p > .1). - Life_Satisfaction / Adjusting: Results of the Pearson correlation showed a significant and moderate negative association between Life_Satisfaction and Adjusting (r(1249) = 0.36, p < .001***). - Concealing / Adjusting: Results of the Pearson correlation showed a significant and weak negative association between Concealing and Adjusting (r(1249) = 0.22, p < .001***). - Age / Tolerating: Results of the Pearson correlation showed a non significant and weak negative association between Age and Tolerating (r(1249) = 0.031, p > .1). - Life_Satisfaction / Tolerating: Results of the Pearson correlation showed a significant and weak negative association between Life_Satisfaction and Tolerating (r(1249) = 0.15, p < .001***). - Concealing / Tolerating: Results of the Pearson correlation showed a non significant and weak negative association between Concealing and Tolerating (r(1249) = 0.074, p = 0.05°). - Adjusting / Tolerating: Results of the Pearson correlation showed a significant and weak negative association between Adjusting and Tolerating (r(1249) = 0.29, p < .001***). Options

You can also cutomize the type (pearson, spearman or kendall), the p value correction method (holm (default), bonferroni, fdr, none…) and run partial, semi-partial or glasso correlations.

psycho::affective %>% correlation(method = "pearson", adjust="bonferroni", type="partial") %>% summary()   Age Life_Satisfaction Concealing Adjusting Age         Life_Satisfaction 0.01       Concealing -0.06 -0.16***     Adjusting 0.02 0.36*** 0.25***   Tolerating 0.02 0.06 0.02 0.24*** Fun with p-hacking

In order to prevent people for running many uncorrected correlation tests (promoting p-hacking and result-fishing), we included the i_am_cheating parameter. If FALSE (default), the function will help you finding interesting results!

df_with_11_vars <- data.frame(replicate(11, rnorm(1000))) cor <- correlation(df_with_11_vars, adjust="none") ## Warning in correlation(df_with_11_vars, adjust = "none"): We've detected that you are running a lot (> 10) of correlation tests without adjusting the p values. To help you in your p-fishing, we've added some interesting variables: You never know, you might find something significant! ## To deactivate this, change the 'i_am_cheating' argument to TRUE. summary(cor)   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X1                       X2 -0.04                     X3 -0.04 -0.02                   X4 0.02 0.05 -0.02                 X5 -0.01 -0.02 0.05 -0.03               X6 -0.03 0.03 0.08* 0.02 0.02             X7 0.03 -0.01 -0.02 -0.04 -0.03 -0.04           X8 0.01 -0.07* 0.04 0.02 -0.01 -0.01 0.00         X9 -0.02 0.03 -0.03 -0.02 0.00 -0.04 0.03 -0.02       X10 -0.03 0.00 0.00 0.01 0.01 -0.01 0.01 -0.02 0.02     X11 0.01 0.01 -0.03 -0.05 0.00 0.05 0.01 0.00 -0.01 0.07*   Local_Air_Density 0.26*** -0.02 -0.44*** -0.15*** -0.25*** -0.50*** 0.57*** -0.11*** 0.47*** 0.06 0.01 Reincarnation_Cycle -0.03 -0.02 0.02 0.04 0.01 0.00 0.05 -0.04 -0.05 -0.01 0.03 Communism_Level 0.58*** -0.44*** 0.04 0.06 -0.10** -0.18*** 0.10** 0.46*** -0.50*** -0.21*** -0.14*** Alien_Mothership_Distance 0.00 -0.03 0.01 0.00 -0.01 -0.03 -0.04 0.01 0.01 -0.02 0.00 Schopenhauers_Optimism 0.11*** 0.31*** -0.25*** 0.64*** -0.29*** -0.15*** -0.35*** -0.09** 0.08* -0.22*** -0.47*** Hulks_Power 0.03 0.00 0.02 0.03 -0.02 -0.01 -0.05 -0.01 0.00 0.01 0.03

As we can see, Schopenhauer’s Optimism is strongly related to many variables!!!

Credits

This package was useful? You can cite psycho as follows:

  • Makowski, (2018). The psycho Package: an Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R/exams @ eRum 2018

Sun, 05/20/2018 - 00:00

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Keynote lecture about R/exams at eRum 2018 (European R Users Meeting) in Budapest: Slides, video, e-learning, replication materials.

Keynote lecture at eRum 2018

R/exams was presented in a keynote lecture by Achim Zeileis at eRum 2018, the European R Users Meeting, this time organized by a team around Gergely Daróczi in Budapest. It was a great event with many exciting presentations, reflecting the vibrant R community in Europe (and beyond).

This blog post provides various resources accompanying the presentation which may be of interest to those who did not attend the meeting as well as those who did and who want to explore the materials in more detail.

Most importantly the presentation slides are available in PDF format (under CC-BY):

Video

The eRum organizers did a great job in making the meeting accessible to those useRs who could not make it to Budapest. All presentations were available in a livestream on YouTube where also videos of all lectures were made available after the meeting (Standard YouTube License):

E-Learning

To illustrate the e-learning capabilities supported by R/exams, the presentation started with a live quiz using the audience response system ARSnova. The original version of the quiz was hosted on the ARSnova installation at Universität Innsbruck. To encourage readers to try out ARSnova for their own purposes, a copy of the quiz was also posted on the official ARSnova server at Technische Hochschule Mittelhessen (where ARSnova is developed under the General Public License, GPL):

The presentation briefly also showed an online test generated by R/exams and imported into OpenOLAT, an open-source learning management system (available under the Apache License). The online test is made available again here for anonymous guest access. (Note however, that the system only has one guest user so that when you start the test there may already be some test results from a previous guest session. In that case you can finish the test and also start it again.)

Replication code

The presentation slides show how to set up an exam using the R package and then rendering it into different output formats. In order to allow the same exam to be rendered into a wide range of different output formats, only single-choice and multiple-choice exercises were employed (see the choice list below). However, in the e-learning test shown in OpenOLAT all exercises types are supported (see the elearn list below). All these exercises are readily provided in the package and also introduced online: deriv/deriv2, fruit/fruit2, ttest, boxplots, cholesky, lm, function. The code below uses the R/LaTeX (.Rnw) version but the R/Markdown version (.Rmd) could also be used instead.

## package library("exams") ## single-choice and multiple-choice only choice <- list("deriv2.Rnw", "fruit2.Rnw", c("ttest.Rnw", "boxplots.Rnw")) ## e-learning test (all exercise types) elearn <- c("deriv.Rnw", "fruit.Rnw", "ttest.Rnw", "boxplots.Rnw", "cholesky.Rnw", "lm.Rnw", "function.Rnw")

First, the exam with the choice-based questions can be easily turned into a PDF exam in NOPS format using exams2nops, here using Hungarian language for illustration. Exams in this format can be easily scanned and evaluated within R.

set.seed(2018-05-16) exams2nops(choice, institution = "eRum 2018", language = "hu")

Second, the choice-based exam version can be exported into the JSON format for ARSnova: Rexams-1.json. This contains an entire ARSnova session that can be directly imported into the ARSnova system as shown above. It employs a custom exercise set up just for eRum (conferences.Rmd) as well as a slightly tweaked exercise (fruit3.Rmd) that displays better in ARSnova.

set.seed(2018-05-16) exams2arsnova(list("conferences.Rmd", choice[[1]], "fruit3.Rmd", choice[[3]]), name = "R/exams", abstention = FALSE, fix_choice = TRUE)

Third, the e-learning exam can be generated in QTI 1.2 format for OpenOLAT, as shown above: eRum-2018.zip. The exams2openolat command below is provided starting from the current R/exams version 2.3-1. It essentially just calls exams2qti12 but slightly tweaks the MathJax output from pandoc so that it is displayed properly by OpenOLAT.

set.seed(2018-05-16) exams2openolat(elearn, name = "eRum-2018", n = 10, qti = "1.2") What else?

In the last part of the presentation a couple of new and ongoing efforts within the R/exams project are highlighted. First, the natural language support in NOPS exams is mentioned which was recently described in more detail in this blog. Second, the relatively new “stress tester” was illustrated with the following example. (A more detailed blog post will follow soon.)

s <- stresstest_exercise("deriv2.Rnw") plot(s)

Finally, a psychometric analysis illustrated how to examine exams regarding: Exercise difficulty, student performance, unidimensionality, fairness. The replication code for the results from the slides is included below (omitting some graphical details for simplicity, e.g., labeling or color).

## load data and exclude extreme scorers library("psychotools") data("MathExam14W", package = "psychotools") mex <- subset(MathExam14W, nsolved > 0 & nsolved < 13) ## raw data plot(mex$solved) ## Rasch model parameters mr <- raschmodel(mex$solved) plot(mr, type = "profile") ## points per student MathExam14W <- transform(MathExam14W, points = 2 * nsolved - 0.5 * rowSums(credits == 1) ) hist(MathExam14W$points, breaks = -4:13 * 2 + 0.5, col = "lightgray") abline(v = 12.5, lwd = 2, col = 2) ## person-item map plot(mr, type = "piplot") ## principal component analysis pr <- prcomp(mex$solved, scale = TRUE) plot(pr) biplot(pr, col = c("transparent", "black"), xlim = c(-0.065, 0.005), ylim = c(-0.04, 0.065)) ## differential item functioning mr1 <- raschmodel(subset(mex, group == 1)$solved) mr2 <- raschmodel(subset(mex, group == 2)$solved) ma <- anchortest(mr1, mr2, adjust = "single-step") ## anchored item difficulties plot(mr1, parg = list(ref = ma$anchor_items), ref = FALSE, ylim = c(-2, 3), pch = 19) plot(mr2, parg = list(ref = ma$anchor_items), ref = FALSE, add = TRUE, pch = 19, border = 4) legend("topleft", paste("Group", 1:2), pch = 19, col = c(1, 4), bty = "n") ## simultaneous Wald test for pairwise differences plot(ma$final_tests) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppGSL 0.3.5

Sat, 05/19/2018 - 22:11

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance update of RcppGSL just brought version 0.3.5 to CRAN, a mere twelve days after the RcppGSL 0.3.4. release. Just like yesterday’s upload of inline 0.3.15 it was prompted by a CRAN request to update the per-package manual page; see the inline post for details.

The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

No user-facing new code or features were added. The NEWS file entries follow below:

Changes in version 0.3.5 (2018-05-19)
  • Update package manual page using references to DESCRIPTION file [CRAN request].

Courtesy of CRANberries, a summary of changes to the most recent release is available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

openrouteservice – geodata!

Sat, 05/19/2018 - 20:56

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

The openrouteservice provides a new method to get geodata into R. It has an API (or a set of them) and an R package has been written to communicate with said API(s) and is available from GitHub. I’ve just been playing around with the examples on this page, in the thought of using it for a project (more on that later if I get anywhere with it).

Anyways…onto the code…which is primarily a modification from the examples page I mentioned earlier (see that page for more examples).

devtools::install_github("GIScience/openrouteservice-r")

Load some libraries

library(openrouteservice) library(leaflet)

Set the API key

ors_api_key("your-key-here")

Locations of interest and send the request to the API asking for the region that is accessible within a 15 minute drive of the coordinates.

coordinates <- list(c(8.55, 47.23424), c(8.34234, 47.23424), c(8.44, 47.4)) x <- ors_isochrones(coordinates, range = 60*15, # maximum time to travel (15 mins) interval = 60*15, # results in bands of 60*15 seconds (15 mins) intersections=FALSE) # no intersection of polygons

By changing the interval to, say, 60*5, three regions per coordinate are returned representing regions accessible within 5, 10 and 15 minutes drive. Changing the intersections argument would produce a separate polygon for any overlapping regions. The information of the intersected polygons is limited though, so it might be better to do the intersection with other tools afterwards.

The results can be plotted with leaflet…

leaflet() %>% addTiles() %>% addGeoJSON(x) %>% fitBBox(x$bbox)

The blue regions are the three regions accessible within 15 minutes. A few overlapping regions are evident, each of which would be saved to a unique polygon had we set intersections to TRUE.

The results from the API come down in a GeoJSON format which is given a class of, in this case ors_isochrones, which isn’t recognized by so many formats so you might want to convert it to an sp object, giving access to all of the tools for those formats. That’s easy enough to do via the geojsonio package…

library(geojsonio) class(x) <- "geo_list" y <- geojson_sp(x) library(sp) plot(y)

You can also derive coordinates from (partial) addresses. Here is an example for a region of Bern in Switzerland, using the postcode.

coord <- ors_geocode("3012, Switzerland")

This resulted in 10 hits, the first of which was correct…the others were in different countries…

unlist(lapply(coord$features, function(x) x$properties$label)) [1] "3012, Bern, Switzerland" [2] "A1, Bern, Switzerland" [3] "Bremgartenstrasse, Bern, Switzerland" [4] "131 Bremgartenstrasse, Bern, Switzerland" [5] "Briefeinwurf Bern, Gymnasium Neufeld, Bern, Switzerland" [6] "119 Bremgartenstrasse, Bern, Switzerland" [7] "Gym Neufeld, Bern, Switzerland" [8] "131b Bremgartenstrasse, Bern, Switzerland" [9] "Gebäude Nord, Bern, Switzerland" [10] "113 Bremgartenstrasse, Bern, Switzerland"

The opposite (coordinate to address) is also possible, again returning multiple hits…

address <- ors_geocode(location = c(7.425898, 46.961598)) unlist(lapply(address$features, function(x) x$properties$label)) [1] "3012, Bern, Switzerland" [2] "A1, Bern, Switzerland" [3] "Bremgartenstrasse, Bern, Switzerland" [4] "131 Bremgartenstrasse, Bern, Switzerland" [5] "Briefeinwurf Bern, Gymnasium Neufeld, Bern, Switzerland" [6] "119 Bremgartenstrasse, Bern, Switzerland" [7] "Gym Neufeld, Bern, Switzerland" [8] "131b Bremgartenstrasse, Bern, Switzerland" [9] "Gebäude Nord, Bern, Switzerland" [10] "113 Bremgartenstrasse, Bern, Switzerland"

Other options are distances/times/directions between points and places of interest (POI) near a point or within a region.

Hope that helps someone! Enjoy!

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Create Code Metrics with cloc

Sat, 05/19/2018 - 20:31

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The cloc Perl script (yes, Perl!) by Al Danial (https://github.com/AlDanial/cloc) has been one of the go-to tools for generating code metrics. Given a single file, directory tree, archive, or git repo, cloc can speedily give you metrics on the count of blank lines, comment lines, and physical lines of source code in a vast array of programming languages.

I don’t remember the full context but someone in the R community asked about about this type of functionality and I had tossed together a small script-turned-package to thinly wrap the Perl cloc utility. Said package was and is unimaginatively named cloc. Thanks to some collaborative input from @ma_salmon, the package gained more features. Recently I added the ability to process R markdown (Rmd) files (i.e. only count lines in code chunks) to the main cloc Perl script and was performing some general cleanup when the idea to create some RStudio addins hit me.

cloc Basics

As noted, you can cloc just about anything. Here’s some metrics for dplyr::group_by:

cloc("https://raw.githubusercontent.com/tidyverse/dplyr/master/R/group-by.r") ## # A tibble: 1 x 10 ## source language file_count file_count_pct loc loc_pct blank_lines blank_line_pct comment_lines comment_line_pct ## 1 group… R 1 1. 44 1. 13 1. 110 1.

and, here’s a similar set of metrics for the whole dplyr package:

cloc_cran("dplyr") ## # A tibble: 7 x 11 ## source language file_count file_count_pct loc loc_pct blank_lines blank_line_pct comment_lines comment_line_pct ## 1 dplyr… R 148 0.454 13216 0.442 2671 0.380 3876 0.673 ## 2 dplyr… C/C++ H… 125 0.383 6687 0.223 1836 0.261 267 0.0464 ## 3 dplyr… C++ 33 0.101 4724 0.158 915 0.130 336 0.0583 ## 4 dplyr… HTML 11 0.0337 3602 0.120 367 0.0522 11 0.00191 ## 5 dplyr… Markdown 2 0.00613 1251 0.0418 619 0.0880 0 0. ## 6 dplyr… Rmd 6 0.0184 421 0.0141 622 0.0884 1270 0.220 ## 7 dplyr… C 1 0.00307 30 0.00100 7 0.000995 0 0. ## # ... with 1 more variable: pkg

We can also measure (in bulk) from afar, such as the measuring the dplyr git repo:

cloc_git("git://github.com/tidyverse/dplyr.git") ## # A tibble: 12 x 10 ## source language file_count file_count_pct loc loc_pct blank_lines blank_line_pct comment_lines ## 1 dplyr.git HTML 108 0.236 21467 0.335 3829 0.270 1114 ## 2 dplyr.git R 156 0.341 13648 0.213 2682 0.189 3736 ## 3 dplyr.git Markdown 12 0.0263 10100 0.158 3012 0.212 0 ## 4 dplyr.git C/C++ Header 126 0.276 6891 0.107 1883 0.133 271 ## 5 dplyr.git CSS 2 0.00438 5684 0.0887 1009 0.0711 39 ## 6 dplyr.git C++ 33 0.0722 5267 0.0821 1056 0.0744 393 ## 7 dplyr.git Rmd 7 0.0153 447 0.00697 647 0.0456 1309 ## 8 dplyr.git XML 1 0.00219 291 0.00454 0 0. 0 ## 9 dplyr.git YAML 6 0.0131 212 0.00331 35 0.00247 12 ## 10 dplyr.git JavaScript 2 0.00438 44 0.000686 10 0.000705 4 ## 11 dplyr.git Bourne Shell 3 0.00656 34 0.000530 15 0.00106 10 ## 12 dplyr.git C 1 0.00219 30 0.000468 7 0.000493 0 ## # ... with 1 more variable: comment_line_pct All in on Addins

The Rmd functionality made me realize that some interactive capabilities might be handy, so I threw together three of them.

Two of them extraction of code chunks from Rmd documents. One uses cloc other uses knitr::purl() (h/t @yoniceedee). The knitr one adds in some very nice functionality if you want to preserve chunk options and have “eval=FALSE” chunks commented out.

The final one will gather up code metrics for all the sources in an active project.

FIN

If you’d like additional features or want to contribute, give (https://github.com/hrbrmstr/cloc) a visit and drop an issue or PR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An East-West less divided?

Sat, 05/19/2018 - 19:41

(This article was first published on R – thinkr, and kindly contributed to R-bloggers)

With tensions heightened recently at the United Nations, one might wonder whether we’ve drawn closer, or farther apart, over the decades since the UN was established in 1945.

We’ll see if we can garner a clue by performing cluster analysis on the General Assembly voting of five of the founding members. We’ll focus on the five permanent members of the Security Council. Then later on we can look at whether Security Council vetoes corroborate our findings.

A prior article, entitled the “cluster of six“, employed unsupervised machine learning to discover the underlying structure of voting data. We’ll use related techniques here to explore the voting history of the General Assembly, the only organ of the United Nations in which all 193 member states have equal representation.

By dividing the voting history into two equal parts, which we’ll label as the “early years” and the “later years”, we can assess how our five nations cluster in the two eras.

During the early years, France, the UK and the US formed one cluster, whilst Russia stood apart.

Although the Republic of China (ROC) joined the UN at its founding in 1945, it’s worth noting that the People’s Republic of China (PRC), commonly called China today, was admitted into the UN in 1971. Hence its greater distance in the clustering.

Through the later years, France and the UK remained close. Not surprising given our EU ties. Will Brexit have an impact going forward?

The US is slightly separated from its European allies, but what’s more striking, is the shorter distance between these three and China / Russia. Will globalization continue to bring us closer together, or is the tide about to turn?

The cluster analysis above focused on General Assembly voting. By web-scraping the UN’s Security Council Veto List, we can acquire further insights on the voting patterns of our five nations.

Russia dominated the early vetoes before these dissipated in the late 60s. Vetoes picked up again in the 70s with the US dominating through to the 80s. China has been the most restrained throughout.

Since the 90s, there would appear to be less dividing us, supporting our finding from the General Assembly voting. But do the vetoes in 2017, and so far in 2018, suggest a turning of the tide? Or just a temporary divergence?

R toolkit

R packages and functions (excluding base) used throughout this analysis.

  Packages Functions purrr map_dbl[3]; map[1]; map2_df[1]; possibly[1]; set_names[1] XML readHTMLTable[1] dplyr if_else[15]; mutate[9]; filter[6]; select[5]; group_by[3]; summarize[3]; distinct[2]; inner_join[2]; slice[2]; arrange[1]; as_data_frame[1]; as_tibble[1]; data_frame[1]; desc[1]; rename[1] tibble as_data_frame[1]; as_tibble[1]; data_frame[1]; enframe[1]; rowid_to_column[1] stringr str_c[8]; str_detect[6]; str_replace[3]; fixed[2]; str_remove[2]; str_count[1] rebus dgt[1]; literal[1]; lookahead[1]; lookbehind[1] lubridate year[7]; dmy[1]; today[1]; ymd[1] dummies dummy.data.frame[2] tidyr spread[3]; gather[2]; unnest[1] cluster pam[3] ggplot2 aes[6]; ggplot[5]; ggtitle[5]; scale_x_continuous[5]; element_blank[4]; geom_text[4]; geom_line[3]; geom_point[3]; ylim[3]; element_rect[2]; geom_col[2]; labs[2]; scale_fill_manual[2]; theme[2]; coord_flip[1] factoextra fviz_cluster[3]; fviz_dend[1]; fviz_silhouette[1]; hcut[1] cowplot draw_plot[2]; ggdraw[1] ggthemes theme_economist[1] kableExtra kable[1]; kable_styling[1] knitr kable[1]

View the code here.

Citations / Attributions

R Development Core Team (2008). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

Erik Voeten “Data and Analyses of Voting in the UN General Assembly” Routledge Handbook of International Organization, edited by Bob Reinalda (published May 27, 2013)

The post An East-West less divided? appeared first on thinkr.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – thinkr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Do Clustering by “Dimensional Collapse”

Sat, 05/19/2018 - 13:39

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Problem

Image that someone in a bank want to find out whether some of bank’s credit card holders are acctually the same person, so according to his experience, he set a rule: the people share either the same address or the same phone number can be reasonably regarded as the same person. Just as the example:

library(tidyverse) a <- data_frame(id = 1:16,                 addr = c("a", "a", "a", "b", "b", "c", "d", "d", "d", "e", "e", "f", "f", "g", "g", "h"),                 phone = c(130L, 131L, 132L, 133L, 134L, 132L, 135L, 136L, 137L, 136L, 138L, 138L, 139L, 140L, 141L, 139L),                 flag = c(1L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 3L)) head(a) ## id   addr    phone   flag ## 1    a   130 1 ## 2    a   131 1 ## 3    a   132 1 ## 4    b   133 2 ## 5    b   134 2 ## 6    c   132 1

In the dataframe

]czoxOlwiYVwiO3tbJiomXX0=[

, the letters in column

]czo0OlwiYWRkclwiO3tbJiomXX0=[

stand for address information, the numbers in column

]czo1OlwicGhvbmVcIjt7WyYqJl19[

stand for phone numbers, and the integers in column

]czo0OlwiZmxhZ1wiO3tbJiomXX0=[

is what he want: the CLUSTER flag which means “really” different persons.

In the above plot, each point stand for a “identity” who has a address which you can tell according to horizontal axis , and a phone number which you can see in vertical axis. The red dotted line present the “connections” betweent identities, which actually means the same address or phone number. So the wanted result is the blue rectangels to circle out different flags which reprent really different persons.

Goal

The “finding the same person” thing is typically a clustring process, and I am very sure there are pretty many ways to do it, Such as Disjoint-set data structure. But, I can not help thinking mayby we can make it in a simple way with R. that’s my goal.

“Dimensional Collapse”

When I stared at the plot, I ask myself, why not map the x-axis information of the points to the very first one according to the y-axis “connections”. When everything goes well and all done, all the grey points should be mapped along the red arrows to the first marks of the groups, and there should be only 4 marks leave on x-axis: a, b, d and g, instead of 9 marks in the first place. And the y-axis information, after contributing all the “connection rules”, can be put away now, since the left x-axis marks are exactly what I want: the final flags. It is why I like to call it “Dimensional Collapse”.

Furthermore, in order to take advantage of R properties, I also:
1. Treat both dimensions as integers by factoring them.
2. Use “integer subsetting” to map and collapse.

axis_collapse <- function(df, .x, .y) {     .x <- enquo(.x)     .y <- enquo(.y)          # Turn the address and phone number into integers.     df <- mutate(df,                  axis_x = c(factor(!!.x)),                  axis_y = c(factor(!!.y)))          oldRule <- seq_len(max(df$axis_x))          mapRule <- df %>%       select(axis_x, axis_y) %>%       group_by(axis_y) %>%       arrange(axis_x, .by_group = TRUE) %>%       mutate(collapse = axis_x[1]) %>%       ungroup() %>%       select(-axis_y) %>%       distinct() %>%       group_by(axis_x) %>%       arrange(collapse, .by_group = TRUE) %>%       slice(1) %>%         ungroup() %>%       arrange(axis_x) %>%       pull(collapse)          # Use integer subsetting to collapse x-axis.     # In case of indirect "connections", we should do it recursively.     while (TRUE) {         newRule <- mapRule[oldRule]         if(identical(newRule, oldRule)) {             break         } else {             oldRule <- newRule         }     }          df <- df %>%       mutate(flag = newRule[axis_x],              flag = c(factor(flag))) %>%       select(-starts_with("axis_"))          df }

Let see the result.

a %>%   rename(flag_t = flag) %>%   axis_collapse(addr, phone) %>%   mutate_at(.vars = vars(addr:flag), factor) %>%   ggplot(aes(factor(addr), factor(phone), shape = flag_t, color = flag)) +   geom_point(size = 3) +   labs(x = "Address", y = "Phone Number", shape = "Target Flag:", color = "Cluster Flag:")

Not bad so far.

Calculation Complexity

Let make a simple test about time complexity.

test1 <- data_frame(addr = sample(1:1e4, 1e4), phone = sample(1:1e4, 1e4)) test2 <- data_frame(addr = sample(1:1e5, 1e5), phone = sample(1:1e5, 1e5)) bm <- microbenchmark::microbenchmark(n10k = axis_collapse(test1, addr, phone),                                      n100k = axis_collapse(test2, addr, phone),                                      times = 30) summary(bm) ## expr min lq  mean    median  uq  max neval   cld ## n10k     249.2172    259.918     277.0333    266.9297    279.505     379.4292    30  a ## n100k    2489.1834   2581.731    2640.9394   2624.5741   2723.390    2839.5180   30  b

It seems that the growth of consumed time is in linear relationship with data increase holding the other conditions unchanged. That is acceptable.

More Dimensions?

To me, since this method collapse one dimension by transfering their clustering information to the other dimension, so the method should can be used resursively on more than 2 dimensions. But I am not 100% sure. Let do a simple test.

a %>%   # I deliberately add a column which connect group 2 and 4 only.   mutate(other = c(LETTERS[1:14], "D", "O")) %>%   # use axis_collapse recursively   axis_collapse(other, phone) %>%   axis_collapse(flag, addr) %>%   ggplot(aes(x = factor(addr), y = factor(phone), color = factor(flag))) +   geom_point(size = 3) +   labs(x = "Address", y = "Phone Number", color = "Cluster Flag:")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Decision Modelling in R Workshop in The Netherlands!

Sat, 05/19/2018 - 13:39

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

The Decision Analysis in R for Technologies in Health (DARTH) workgroup is hosting a two-day workshop on decision analysis in R in Leiden, The Netherlands from June 7-8, 2018. A one-day introduction to R course will also be offered the day before the workshop, on June 6th.

Decision models are mathematical simulation models that are increasingly being used in health sciences to simulate the impact of policy decisions on population health. New methodological techniques around decision modeling are being developed that rely heavily on statistical and mathematical techniques. R is becoming increasingly popular in decision analysis as it provides a flexible environment where advanced statistical methods can be combined with decision models of varying complexity. Also, the fact that R is freely available improves model transparency and reproducibility.

The workshop will guide participants on building probabilistic decision trees, Markov models and microsimulations, creating publication-quality tabular and graphical output, and will provide a basic introduction to value of information methods and model calibration using R.

For more information and to register, please visit: http://www.cvent.com/d/mtqth1

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

wrapr 1.4.1 now up on CRAN

Sat, 05/19/2018 - 04:57

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

wrapr 1.4.1 is now available on CRAN. wrapr is a really neat R package both organizing, meta-programming, and debugging R code. This update generalizes the dot-pipe feature’s dot S3 features.

Please give it a try!

wrapr, is an R package that supplies powerful tools for writing and debugging R code.

Introduction

Primary wrapr services include:

  • let() (let block)
  • %.>% (dot arrow pipe)
  • build_frame()/draw_frame()
  • := (named map builder)
  • DebugFnW() (function debug wrappers)
  • λ() (anonymous function builder)
let()

let() allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for names as with base::substitute() or base::with()).

The function is simple and powerful. It treats strings as variable names and re-writes expressions as if you had used the denoted variables. For example the following block of code is equivalent to having written "a + a".

library("wrapr") a <- 7 let( c(VAR = 'a'), VAR + VAR ) # [1] 14

This is useful in re-adapting non-standard evaluation interfaces (NSE interfaces) so one can script or program over them.

We are trying to make let() self teaching and self documenting (to the extent that makes sense). For example try the arguments "eval=FALSE" prevent execution and see what would have been executed, or debug=TRUE to have the replaced code printed in addition to being executed:

let( c(VAR = 'a'), eval = FALSE, { VAR + VAR } ) # { # a + a # } let( c(VAR = 'a'), debugPrint = TRUE, { VAR + VAR } ) # $VAR # [1] "a" # # { # a + a # } # [1] 14

Please see vignette('let', package='wrapr') for more examples. Some formal documentation can be found here.

For working with dplyr 0.7.* we strongly suggest wrapr::let() (or even an alternate approach called seplyr).

%.>% (dot pipe or dot arrow)

%.>% dot arrow pipe is a pipe with intended semantics:

"a %.>% b" is to be treated approximately as if the user had written "{ . <- a; b };" with "%.>%" being treated as left-associative.

Other R pipes include magrittr and pipeR.

The following two expressions should be equivalent:

cos(exp(sin(4))) # [1] 0.8919465 4 %.>% sin(.) %.>% exp(.) %.>% cos(.) # [1] 0.8919465

The notation is quite powerful as it treats pipe stages as expression parameterized over the variable ".". This means you do not need to introduce functions to express stages. The following is a valid dot-pipe:

1:4 %.>% .^2 # [1] 1 4 9 16

The notation is also very regular as we show below.

1:4 %.>% sin # [1] 0.8414710 0.9092974 0.1411200 -0.7568025 1:4 %.>% sin(.) # [1] 0.8414710 0.9092974 0.1411200 -0.7568025 1:4 %.>% base::sin # [1] 0.8414710 0.9092974 0.1411200 -0.7568025 1:4 %.>% base::sin(.) # [1] 0.8414710 0.9092974 0.1411200 -0.7568025 1:4 %.>% function(x) { x + 1 } # [1] 2 3 4 5 1:4 %.>% (function(x) { x + 1 }) # [1] 2 3 4 5 1:4 %.>% { .^2 } # [1] 1 4 9 16 1:4 %.>% ( .^2 ) # [1] 1 4 9 16

Regularity can be a big advantage in teaching and comprehension. Please see "In Praise of Syntactic Sugar" for more details. Some formal documentation can be found here.

  • Some obvious "dot-free"" right-hand sides are rejected. Pipelines are meant to move values through a sequence of transforms, and not just for side-effects. Example: 5 %.>% 6 deliberately stops as 6 is a right-hand side that obviously does not use its incoming value. This check is only applied to values, not functions on the right-hand side.
  • Trying to pipe into a an "zero argument function evaluation expression" such as sin() is prohibited as it looks too much like the user declaring sin() takes no arguments. One must pipe into either a function, function name, or an non-trivial expression (such as sin(.)). A useful error message is returned to the user: wrapr::pipe does not allow direct piping into a no-argument function call expression (such as "sin()" please use sin(.)).
  • Some reserved words can not be piped into. One example is 5 %.>% return(.) is prohibited as the obvious pipe implementation would not actually escape from user functions as users may intend.
  • Obvious de-references (such as $, ::, @, and a few more) on the right-hand side are treated performed (example: 5 %.>% base::sin(.)).
  • Outer parenthesis on the right-hand side are removed (example: 5 %.>% (sin(.))).
  • Anonymous function constructions are evaluated so the function can be applied (example: 5 %.>% function(x) {x+1} returns 6, just as 5 %.>% (function(x) {x+1})(.) does).
  • Checks and transforms are not performed on items inside braces (example: 5 %.>% { function(x) {x+1} } returns function(x) {x+1}, not 6).
build_frame()/draw_frame()

build_frame() is a convenient way to type in a small example data.frame in natural row order. This can be very legible and saves having to perform a transpose in one’s head. draw_frame() is the complimentary function that formats a given data.frame (and is a great way to produce neatened examples).

x <- build_frame( "measure" , "training", "validation" | "minus binary cross entropy", 5 , -7 | "accuracy" , 0.8 , 0.6 ) print(x) # measure training validation # 1 minus binary cross entropy 5.0 -7.0 # 2 accuracy 0.8 0.6 str(x) # 'data.frame': 2 obs. of 3 variables: # $ measure : chr "minus binary cross entropy" "accuracy" # $ training : num 5 0.8 # $ validation: num -7 0.6 cat(draw_frame(x)) # build_frame( # "measure" , "training", "validation" | # "minus binary cross entropy", 5 , -7 | # "accuracy" , 0.8 , 0.6 ) := (named map builder)

:= is the "named map builder". It allows code such as the following:

'a' := 'x' # a # "x"

The important property of named map builder is it accepts values on the left-hand side allowing the following:

name <- 'variableNameFromElsewhere' name := 'newBinding' # variableNameFromElsewhere # "newBinding"

A nice property is := commutes (in the sense of algebra or category theory) with R‘s concatenation function c(). That is the following two statements are equivalent:

c('a', 'b') := c('x', 'y') # a b # "x" "y" c('a' := 'x', 'b' := 'y') # a b # "x" "y"

The named map builder is designed to synergize with seplyr.

DebugFnW()

DebugFnW() wraps a function for debugging. If the function throws an exception the execution context (function arguments, function name, and more) is captured and stored for the user. The function call can then be reconstituted, inspected and even re-run with a step-debugger. Please see our free debugging video series and vignette('DebugFnW', package='wrapr') for examples.

λ() (anonymous function builder)

λ() is a concise abstract function creator or "lambda abstraction". It is a placeholder that allows the use of the -character for very concise function abstraction.

Example:

# Make sure lambda function builder is in our enironment. wrapr::defineLambda() # square numbers 1 through 4 sapply(1:4, λ(x, x^2)) # [1] 1 4 9 16 Installing

Install with either:

install.packages("wrapr")

or

# install.packages("devtools") devtools::install_github("WinVector/wrapr") More Information

More details on wrapr capabilities can be found in the following two technical articles:

Note

Note: wrapr is meant only for "tame names", that is: variables and column names that are also valid simple (without quotes) R variables names.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

inline 0.3.15

Sat, 05/19/2018 - 03:04

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release of the inline package arrived on CRAN today. inline facilitates writing code in-line in simple string expressions or short files. The package is mature and in maintenance mode: Rcpp used it greatly for several years but then moved on to Rcpp Attributes so we have a much limited need for extensions to inline. But a number of other package have a hard dependence on it, so we do of course look after it as part of the open source social contract (which is a name I just made up, but you get the idea…)

This release was triggered by a (as usual very reasonable) CRAN request to update the per-package manual page which had become stale. We now use Rd macros, you can see the diff for just that file at GitHub; I also include it below. My pkgKitten package-creation helper uses the same scheme, I wholeheartedly recommend it — as the diff shows, it makes things a lot simpler.

Some other changes reflect both two user-contributed pull request, as well as standard minor package update issues. See below for a detailed list of changes extracted from the NEWS file.

Changes in inline version 0.3.15 (2018-05-18)
  • Correct requireNamespace() call thanks (Alexander Grueneberg in #5).

  • Small simplification to .travis.yml; also switch to https.

  • Use seq_along instead of seq(along=...) (Watal M. Iwasaki) in #6).

  • Update package manual page using references to DESCRIPTION file [CRAN request].

  • Minor packaging updates.

Courtesy of CRANberries, there is a comparison to the previous release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

What Makes a Song (More) Popular

Fri, 05/18/2018 - 16:42

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Earlier this week, the Association for Psychological Science sent out a press release about a study examining what makes a song popular:

Researchers Jonah Berger of the University of Pennsylvania and Grant Packard of Wilfrid Laurier University were interested in understanding the relationship between similarity and success. In a recent study published in Psychological Science, the authors describe how a person’s drive for stimulation can be satisfied by novelty. Cultural items that are atypical, therefore, may be more liked and become more popular.

“Although some researchers have argued that cultural success is impossible to predict,” they explain, “textual analysis of thousands of songs suggests that those whose lyrics are more differentiated from their genres are more popular.”

The study, which is was published online ahead of print, used a method of topic modeling called latent Dirichlet allocation. (Side note, this analysis is available in the R topicmodels package, as function LDA. It requires a document term matrix, which can be created in R. Perhaps a future post!) The LDA extracted 10 topics from the lyrics of songs spanning seven genres (Christian, country, dance, pop, rap, rock, and rhythm and blues):

  • Anger and violence
  • Body movement
  • Dance moves
  • Family
  • Fiery love
  • Girls and cars
  • Positivity
  • Spiritual
  • Street cred
  • Uncertain love

Overall, they found that songs with lyrics that differentiated them from other songs in their genre were more popular. However, this wasn’t the case for the specific genres of pop and dance, where lyrical differentiation appeared to be harmful to popularity. Finally, being lyrically different by being more similar to a different genre (a genre to which the song wasn’t defined) had no impact. So it isn’t about writing a rock song that sounds like a rap song to gain popularity; it’s about writing a rock song that sounds different from other rock songs.

I love this study idea, especially since I’ve started doing some text and lyric analysis on my own. (Look for another one Sunday, tackling the concept of sentiment analysis!) But I do have a criticism. This research used songs listed in the Billboard Top 50 by genre. While it would be impossible to analyze every single song that comes out a given time, this study doesn’t really answer the question of what makes a song popular, but what determines how popular an already popular song is. The advice in the press release (To Climb the Charts, Write Lyrics That Stand Out), may be true for established artists who are already popular, but it doesn’t help that young artist trying to break onto the scene. They’re probably already writing lyrics to try to stand out. They just haven’t been noticed yet.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How To Plot With Dygraphs: Exercises

Fri, 05/18/2018 - 15:41

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

The dygraphs package is an R interface to the dygraphs JavaScript charting library. It provides rich facilities for charting time-series data in R, including:

1. Automatically plots xts time-series objects (or any object convertible to xts.)

2. Highly configurable axis and series display (including optional second Y-axis.)

3. Rich interactive features, including zoom/pan and series/point highlighting.

4. Display upper/lower bars (ex. prediction intervals) around the series.

5. Various graph overlays, including shaded regions, event lines, and point annotations.

6. Use at the R console, just like conventional R plots (via RStudio Viewer.)

7. Seamless embedding within R Markdown documents and Shiny web applications.

Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then, try to solve the exercises below by using R without looking at the answers. Then, check the solutions to check your answers.

Exercise 1

Unite the two time series data-sets mdeaths and fdeaths and create a time-series dygraph of the new data-set.

Exercise 2

Insert a date range selector into the dygraph you just created.

Exercise 3

Change the label names of “mdeaths” and “fdeaths” to “Male” and “Female.”

Exercise 4

Make the graph stacked.

Exercise 5

Set the date range selector height to 20.

Exercise 6

Add a main title to your graph.

Exercise 7

Use the tutorial’s predicted data-set to create a dygraph of “lwr”, “fit”, and “upr”, but display the label as the summary of them.

Exercise 8

Set the colors to red.

Exercise 9

Remove the x-axis grid lines from your graph.

Exercise 10

Remove the y-axis grid lines from your graph.

Related exercise sets:
  1. Spatial Data Analysis: Introduction to Raster Processing (Part 1)
  2. Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
  3. Spatial Data Analysis: Introduction to Raster Processing: Part-3
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

‘LMX ot NOSJ!’ Interchanging Classic Data Formats With Single `blackmagic` Incantations

Fri, 05/18/2018 - 14:20

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The D.C. Universe magic hero Zatanna used spells (i.e. incantations) to battle foes and said spells were just sentences said backwards, hence the mixed up jumble in the title. But, now I’m regretting not naming the package zatanna and reversing the function names to help ensure they’re only used deliberately & carefully. You’ll see why in a bit.

Just like their ore-seeking speleological counterparts, workers in our modern day data mines process a multitude of mineralic data formats to achieve our goals of world domination finding meaning, insight & solutions to hard problems.

Two formats in particular are common occurrences in many of our $DAYJOBs: XML and JSON. The rest of this (hopefully short-ish) post is going to assume you have at least a passing familiarity with — if not full-on battle scars from working with — them.

XML and JSON are, in many ways, very similar. This similarity is on purpose since JSON was originally created to make is easier to process data in browsers and help make data more human-readable. If your $DAYJOB involves processing small or large streams of nested data, you likely prefer JSON over XML.

There are times, though, that — even if one generally works with only JSON data — one comes across a need to ingest some XML and turn it into JSON. This was the case for a question-poser on Stack Overflow this week (I won’t point-shill with a direct link but it’ll be easy to find if you are interested in this latest SODD package).

Rather than take on the potentially painful task of performing the XML to JSON transformation on their own the OP wished for a simple incantation to transform the entirety of the incoming XML into JSON.

We’ll switch comic universes for a moment to issue a warning that all magic comes with a price. And, the cost for automatic XML<->JSON conversion can be quite high. XML has namespaces, attributes tags and values and requires schemas to convey data types and help validate content structure. JSON has no attributes, implicitly conveys types and is generally schema-less (though folks have bolted on that concept).

If one is going to use magic for automatic data conversion there must be rules (no, not those kind of Magic rules), otherwise how various aspects of XML become encoded into JSON (and the reverse) will generate inconsistency and may even result in significant data corruption. Generally speaking, you are always better off writing your own conversion utility vs rely on specific settings in a general conversion script/function. However, if your need is a one-off (which anyone who has been doing this type of work for a while knows is also generally never the case) you may have cause to throw caution to the wind, get your quick data fix, and move on. If that is the case, the blackmagic package may be of use to you.

gnitrevnoC eht ANAI sserddA ecapS yrtsigeR ot NOSJ

One file that’s in XML that I only occasionally have to process is the IANA IPv4 Address Space Registry. If you visited that link you may have uttered “Hey! That’s not XML it’s HTML!”, to wit — I would respond — “Well, HTML is really XML anyway, but use the View Source, Luke! and see that it is indeed XML with some clever XSL style sheet processing being applied in-browser to make the gosh awful XML human readable.”.

With blackmagic we can make quick work of converting this monstrosity into JSON.

The blackmagic package itself uses even darker magic to accomplish its goals. The package is just a thin V8 wrapper around the xml-js javascript library. Because of this, it is recommended that you do not try to process gigabytes of XML with it as there is a round trip of data marshalling between R and the embedded v8 engine.

requireNamespace("jsonlite") # jsonlite::flatten clobbers purrr::flatten in the wrong order so I generally fully-qualify what I need ## Loading required namespace: jsonlite library(xml2) library(blackmagic) # devtools::install_github("hrbrmstr/blackmagic") library(purrr) requireNamespace("dplyr") # I'm going to fully qualify use of dplyr:data_frame() below ## Loading required namespace: dplyr

You can thank @yoniceedee for the URL processing capability in blackmagic:

source_url <- "https://www.iana.org/assignments/ipv4-address-space/ipv4-address-space.xml" iana_json <- blackmagic::xml_to_json(source_url) # NOTE: cat the whole iana_json locally to see it — perhaps to file="..." vs clutter your console cat(substr(iana_json, 1800, 2300)) ## me":"prefix","elements":[{"type":"text","text":"000/8"}]},{"type":"element","name":"designation","elements":[{"type":"text","text":"IANA - Local Identification"}]},{"type":"element","name":"date","elements":[{"type":"text","text":"1981-09"}]},{"type":"element","name":"status","elements":[{"type":"text","text":"RESERVED"}]},{"type":"element","name":"xref","attributes":{"type":"note","data":"2"}}]},{"type":"element","name":"record","elements":[{"type":"element","name":"prefix","elements":[{"type":"

By by the hoary hosts of Hoggoth that's not very "human readable"! And, it looks super-verbose. Thankfully, Yousuf Almarzooqi knew we'd want to fine-tune the output and we can use those options to make this a bit better:

blackmagic::xml_to_json( doc = source_url, spaces = 2, # Number of spaces to be used for indenting XML output compact = FALSE, # Whether to produce detailed object or compact object ignoreDeclaration = TRUE # No declaration property will be generated. ) -> iana_json # NOTE: cat the whole iana_json locally to see it — perhaps to file="..." vs clutter your console cat(substr(iana_json, 3000, 3300)) ## pe": "element", ## "name": "prefix", ## "elements": [ ## { ## "type": "text", ## "text": "000/8" ## } ## ] ## }, ## { ## "type": "element", ## "name": "designation", ##

One "plus side" for doing the mass-conversion is that we don't really need to do much processing to have it be "usable" data in R:

blackmagic::xml_to_json( doc = source_url, compact = FALSE, ignoreDeclaration = TRUE ) -> iana_json # NOTE: consider taking some more time to explore this monstrosity than this str(processed <- jsonlite::fromJSON(iana_json), 3) ## List of 1 ## $ elements:'data.frame': 3 obs. of 5 variables: ## ..$ type : chr [1:3] "instruction" "instruction" "element" ## ..$ name : chr [1:3] "xml-stylesheet" "oxygen" "registry" ## ..$ instruction: chr [1:3] "type=\"text/xsl\" href=\"ipv4-address-space.xsl\"" "RNGSchema=\"ipv4-address-space.rng\" type=\"xml\"" NA ## ..$ attributes :'data.frame': 3 obs. of 2 variables: ## .. ..$ xmlns: chr [1:3] NA NA "http://www.iana.org/assignments" ## .. ..$ id : chr [1:3] NA NA "ipv4-address-space" ## ..$ elements :List of 3 ## .. ..$ : NULL ## .. ..$ : NULL ## .. ..$ :'data.frame': 280 obs. of 4 variables: compact(processed$elements$elements[[3]]$elements) %>% head(6) %>% str(3) ## List of 6 ## $ :'data.frame': 1 obs. of 2 variables: ## ..$ type: chr "text" ## ..$ text: chr "IANA IPv4 Address Space Registry" ## $ :'data.frame': 1 obs. of 2 variables: ## ..$ type: chr "text" ## ..$ text: chr "Internet Protocol version 4 (IPv4) Address Space" ## $ :'data.frame': 1 obs. of 2 variables: ## ..$ type: chr "text" ## ..$ text: chr "2018-04-23" ## $ :'data.frame': 3 obs. of 4 variables: ## ..$ type : chr [1:3] "text" "element" "text" ## ..$ text : chr [1:3] "Allocations to RIRs are made in line with the Global Policy published at " NA ". \nAll other assignments require IETF Review." ## ..$ name : chr [1:3] NA "xref" NA ## ..$ attributes:'data.frame': 3 obs. of 2 variables: ## .. ..$ type: chr [1:3] NA "uri" NA ## .. ..$ data: chr [1:3] NA "http://www.icann.org/en/resources/policy/global-addressing" NA ## $ :'data.frame': 3 obs. of 4 variables: ## ..$ type : chr [1:3] "text" "element" "text" ## ..$ text : chr [1:3] "The allocation of Internet Protocol version 4 (IPv4) address space to various registries is listed\nhere. Origi"| __truncated__ NA " documents most of these allocations." ## ..$ name : chr [1:3] NA "xref" NA ## ..$ attributes:'data.frame': 3 obs. of 2 variables: ## .. ..$ type: chr [1:3] NA "rfc" NA ## .. ..$ data: chr [1:3] NA "rfc1466" NA ## $ :'data.frame': 5 obs. of 4 variables: ## ..$ type : chr [1:5] "element" "element" "element" "element" ... ## ..$ name : chr [1:5] "prefix" "designation" "date" "status" ... ## ..$ elements :List of 5 ## .. ..$ :'data.frame': 1 obs. of 2 variables: ## .. ..$ :'data.frame': 1 obs. of 2 variables: ## .. ..$ :'data.frame': 1 obs. of 2 variables: ## .. ..$ :'data.frame': 1 obs. of 2 variables: ## .. ..$ : NULL ## ..$ attributes:'data.frame': 5 obs. of 2 variables: ## .. ..$ type: chr [1:5] NA NA NA NA ... ## .. ..$ data: chr [1:5] NA NA NA NA ...

As noted previously, all magic comes with a price and we just traded XML processing for some gnarly list processing. This isn't the case for all XML files and you can try to tweak the parameters to xml_to_json() to make the output more usable (NOTE: key name transformation parameters still need to be implemented in the package), but this seems a whole lot easier (to me):

doc <- read_xml(source_url) xml_ns_strip(doc) dplyr::data_frame( prefix = xml_find_all(doc, ".//record/prefix") %>% xml_text(), designation = xml_find_all(doc, ".//record/designation") %>% xml_text(), date = xml_find_all(doc, ".//record/date") %>% xml_text() %>% sprintf("%s-01", .) %>% as.Date(), whois = xml_find_all(doc, ".//record") %>% map(xml_find_first, "./whois") %>% map_chr(xml_text), status = xml_find_all(doc, ".//record/status") %>% xml_text() ) ## # A tibble: 256 x 5 ## prefix designation date whois status ## ## 1 000/8 IANA - Local Identification 1981-09-01 RESERVED ## 2 001/8 APNIC 2010-01-01 whois.apnic… ALLOCAT… ## 3 002/8 RIPE NCC 2009-09-01 whois.ripe.… ALLOCAT… ## 4 003/8 Administered by ARIN 1994-05-01 whois.arin.… LEGACY ## 5 004/8 Level 3 Parent, LLC 1992-12-01 whois.arin.… LEGACY ## 6 005/8 RIPE NCC 2010-11-01 whois.ripe.… ALLOCAT… ## 7 006/8 Army Information Systems Center 1994-02-01 whois.arin.… LEGACY ## 8 007/8 Administered by ARIN 1995-04-01 whois.arin.… LEGACY ## 9 008/8 Administered by ARIN 1992-12-01 whois.arin.… LEGACY ## 10 009/8 Administered by ARIN 1992-08-01 whois.arin.… LEGACY ## # ... with 246 more rows NIF

xml_to_json() has a sibling function --- json_to_xml() for the reverse operation and you're invited to fill in the missing parameters with a PR as there is a fairly consistent and straightforward way to do that. Note that a small parameter tweak can radically change the output, which is one of the aforementioned potentially costly pitfalls of this automagic conversion.

Before using either function, seriously consider taking the time to write a dedicated, small package that exposes a function or two to perform the necessary conversions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Improvements for Bio7 2.8

Fri, 05/18/2018 - 13:48

(This article was first published on R – Bio7 Website, and kindly contributed to R-bloggers)

18.05.2018

The next release of Bio7 adds a lot of new R features and improvements. One minor change is that the default perspective after the startup of Bio7 now is the R perspective to emphazise the importance of R within this software.

The R-Shell view has been simplified and the R templates have been moved in it’s own simple view for an improved usability (see screenshot from R perspective below).

In addition the context menu has been enhanced to allow the creation of submenus from scripts found in folders and subfolders (recursively added) which you can create for a menu structure.
Scripts can be added created in R, JavaScript, Groovy, Jython, BeanShell, ImageJ Macros.
Java (with dependant classes) can be dynamically compiled and executed like a script, too.

Several improvements have also been added to the R-Shell and the R editor for an easier generation of valid R code. The R-Shell and the R editor now display R workspace objects with it’s class and structure in the code completion dialog (marked with a new workspace icon – see below).

R-Shell:

R editor:

In the R editor a new quick fix function has been added to detect and install missing packages (from scanned default packages folder of an R installation – has to be enabled in the Bio7 R code analysis preferences).

Also the detection of missing package imports are fixable (when a function is called but the installed package declaration is missing in the code but the package is installed to deliver the function).

The code assistance in the R-Shell and in the R editor now offers completions for, e.g., dataframes (columns) in the %>% operator of piped function calls:

In addition code assistance is available for list, vectors, dataframes and arrays of named rows and columns, etc., when available in the current R environment.

Code completion for package functions can now easily added with the R-Shell or the R editor which loads the package function help for both interfaces. The editor will automatically be updated (see updated editor marking unknown functions in screencast below).

Numerous other features, improvements and bugfixes have been added, too.

Bio7 2.8 will hopefully be available soon at:

https://bio7.org

Overview videos on YouTube

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Bio7 Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

NYC restaurants reviews and inspection scores

Fri, 05/18/2018 - 02:57

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

 

If you ever pass outside a restaurant in New York City, you’ll notice a prominently displayed letter grade. Since July 2010, the Health Department has required restaurants to post letter grades showing sanitary inspection results.

An A grade attests to top marks for health and safety, so you can feel secure about eating there. But you don’t necessarily know that you will enjoy the food and experience courteous service. To find that out, you’d refer to the restaurant reviews. For this project, I looked at a simple data analysis and visualization of the NYC restaurants reviews and inspection scores data to find out if there is any correlation between the two. The data will also show which types of cuisines and which NYC locations tend to attract more ratings.

Nowadays, business reviews, ratings and grades are the decision making for any business to measure for their quality, popularity and future success. For restaurants business, ratings, hygienic, and cleanliness are essential. A popular site for reviews, Yelp, offers many individual ratings for restaurants. The New York City Department of Health and Mental Hygiene (DOHMH) conducts unannounced restaurant inspections annually. They check if the food handling, food temperature, personal hygiene of workers and vermin control of the restaurants are in  compliance with hygienic standards.. The scoring and grading process can be found here.

The restaurant ratings and location information used in this project come from Yelp’s API. The inspection data was downloaded from NYC open data website. I merge yelp restaurants review data and inspection data and remove NA rows which doesn’t haveeither inspection score or reviews. I also reassigned the inspection score in the grades A, B, and C category as this measure is widely used and label on restaurants. There were other scores, primarily P or Z, or some version of grade pending which we are ignoring in our analysis here. Restaurants with a score between 0 and 13 points earn an A, those with 14 to 27 points receive a B and those with 28 or more a C.


 

The data shows that an A is the most commonly assigned inspection grade for restaurants of all types in all locations. I plotted various bar plots to visualized the inspection scores and ratings based on borough and cuisine type.

With respect to location, this borough bar plot shows that Manhattan has highest number of restaurants with all grades compared to others. This is obvious as it has highest number of restaurants in general.  Staten Island has lowest number of restaurants with grades A, B and C among all.

As for cuisine types, the cuisines plots shows first 15 restaurants with highest number of counts for based on cuisine.  This indicates that the American cuisine has highest number of A grade compared to other. This indicate that american restaurants are focus more on hygienic and cleanliness compare to others type of restaurants.

 

The review plot indicates that most  restaurants do achieve the top rating of 4 stars. Again, Manhattan has the highest number of restaurants with ratings four stars while Staten Island has lowest numbers of restaurants with high ratings. It also shows that almost all borough have a low number of  2 star restaurants. Moreover, cuisine reviews plot indicates that American cuisine tend to have the highest rating compared to other cuisines. The reasons could be more American restaurants under this category then others.

 

The scatter plots shows therelationship between inspection score and rating. It indicates that there is no direct clear correlation between two variables. It is fairly common for a  restaurant with a C grade inspection score to achieve a 4-5 star ratings in a review. Also it is possible to find a number of A grade ratings for restaurants that only have 1-2 stars.  This could be because so long as food is tasty, people will rate the restaurant well because they do not pay very much attentions to cleanliness and hygienic issues. The scatter plots also show that though some  restaurants maintain a very high level of cleanliness and hygienic food conditions, they fail to get good ratings, which could be due to bad service or less than tasty food . We can do further analysis on both side of  restaurants by analyzing review comments and try to find why some restaurants have good reviews but low inspection score and vice-versa. This require further data about reviews comments and further analysis using NLP.

 

 

The cluster map of NYC restaurants helps visualize locations and  to filter the restaurants based cuisine types. The color mark of the point indicates the ratings and includes  descriptions of the featured restaurants. The heat map show the density of the restaurants based on borough selection or cuisine selection. It indicate which area has a greater number of restaurants. This could be helpful for business people to make informed decisions about where to  open new restaurants based on the types of restaurants already in place.

Finally, this app can be useful for people to filter the data base on borough, cuisine , ratings , and inspection grade.  The people want to go to eat with specific criteria can filters the restaurants and visit their favorite restaurants based on top marks for both ratings and inspection grades. The shiny app link is here.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

drake’s improved high-performance computing power

Fri, 05/18/2018 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

The drake R package is not only a reproducible research solution, but also a serious high-performance computing engine. The Get Started page introduces drake, and this technical note draws from the guides on high-performance computing and timing.

You can help!

Some of these features are brand new, and others are newly refactored. The GitHub version has all the advertised functionality, but it needs more testing and development before I can submit it to CRAN in good conscience. New issues such as r-lib/processx#113 and HenrikBengtsson/future#226 seem to affect drake, and more may emerge. If you use drake for your own work, please consider supporting the project by field-testing the claims below and posting feedback here.

Let drake schedule your targets.

A typical workflow is a sequence of interdependent data transformations. Consider the example from the Get Started page.


When you call make() on this project, drake takes care of "raw_data.xlsx", then raw_data, and then data in sequence. Once data completes, fit and hist can launch in parallel, and then "report.md" begins once everything else is done. It is drake’s responsibility to deduce this order of execution, hunt for ways to parallelize your work, and free you up to focus on the substance of your research.

Activate parallel processing.

Simply set the jobs argument to an integer greater than 1. The following make() recruits multiple processes on your local machine.

make(plan, jobs = 2)

For parallel deployment to a computing cluster (SLURM, TORQUE, SGE, etc.) drake calls on packages future, batchtools, and future.batchtools. First, create a batchtools template file to declare your resource requirements and environment modules. There are built-in example files in drake, but you will likely need to tweak your own by hand.

drake_batchtools_tmpl_file("slurm") # Writes batchtools.slurm.tmpl.

Next, tell future.batchtools to talk to the cluster.

library(future.batchtools) future::plan(batchtools_slurm, template = "batchtools.slurm.tmpl")

Finally, set make()’s parallelism argument equal to "future" or "future_lapply".

make(plan, parallelism = "future", jobs = 8) Choose a scheduling algorithm.

The parallelism argument of make() controls not only where to deploy the workers, but also how to schedule them. The following table categorizes the 7 options.

Deploy: local Deploy: remote Schedule: persistent “mclapply”, “parLapply” “future_lapply” Schedule: transient “future”, “Makefile” Schedule: staged “mclapply_staged”, “parLapply_staged”

Staged scheduling

drake’s first custom parallel algorithm was staged scheduling. It was easier to implement than the other two, but the workers run in lockstep. In other words, all the workers pick up their targets at the same time, and each worker has to finish its target before any worker can move on. The following animation illustrates the concept.



But despite weak parallel efficiency, staged scheduling remains useful because of its low overhead. Without the bottleneck of a formal master process, staged scheduling blasts through armies of tiny conditionally independent targets (example here). Consider it if the bulk of your work is finely diced and perfectly parallel, maybe if your dependency graph is tall and thin.


Persistent scheduling

Persistent scheduling is brand new to drake. Here, make(jobs = 2) deploys three processes: two workers and one master. Whenever a worker is idle, the master assigns it the next target whose dependencies are fully ready. The workers keep running until no more targets remain. See the animation below.





Transient scheduling

If the time limits of your cluster are too strict for persistent workers, consider transient scheduling, another new arrival. Here, make(jobs = 2) starts a brand new worker for each individual target. See the following video.



How many jobs should you choose?

The predict_runtime() function can help. Let’s revisit the mtcars example.


Let’s also

  1. Plan for non-staged scheduling,
  2. Assume each non-file target (black circle) takes 2 hours to build, and
  3. Rest assured that everything else is super quick.

When we declare the runtime assumptions with the known_times argument and cycle over a reasonable range of jobs, predict_runtime() paints a clear picture.

jobs = 4 is a solid choice. Any fewer would slow us down, and the next 2-hour speedup would take double the jobs and the hardware to back it up. Your choice of jobs for make() ultimately depends on the runtime you can tolerate and the computing resources at your disposal.

Thanks!

When I attended RStudio::conf(2018), drake relied almost exclusively on staged scheduling. Kirill Müller spent hours on site and hours afterwards helping me approach the problem and educating me on priority queues, message queues, and the knapsack problem. His generous help paved the way for drake’s latest enhancements.

Disclaimer

This post is a product of my own personal experiences and opinions and does not necessarily represent the official views of my employer. I created and embedded the Powtoon videos only as explicitly permitted in the Terms and Conditions of Use, and I make no copyright claim to any of the constituent graphics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages