R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 16 min ago

### Analyzing NetHack data, part 2: What players kill the most

Sat, 11/10/2018 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

Introduction

This is the third blog post that deals with data from the game NetHack, and oh boy, did a lot of
things happen since the last blog post! Here’s a short timeline of the events:

• I scraped data from alt.org/nethack and made a package with the data available on Github
(that package was too big for CRAN)
• Then, I analyzed the data, focusing on what monsters kill the players the most, and also where
players die the most
• @GridSageGames, developer of the roguelike Cogmind and moderator of the roguelike subreddit,
posted the blog post on reddit
• I noticed that actually, by scraping the data like I did, I only got a sample of 100 daily games
• This point was also discussed on Reddit, and bhhak, an UnNetHack developer (UnNetHack is a fork
of NetHack) suggested I used the xlogfiles instead
• xlogfiles are log files generated by NetHack, and are also available on alt.org/nethack
• I started scraping them, and getting a lot more data
• I got contacted on twitter by @paxed, an admin of alt.org/nethack:

There was no need for scraping, we'll happily give out the whole database if you want.

— Pasi Kallinen (@paxed) November 5, 2018

• The admins of alt.org/nethack will release all the data to the public!

So, I will now continue with the blog post I wanted to do in the first place; focusing now on what
roles players choose to play the most, and also which monsters they kill the most. BUT! Since all the data
will be released to the public, my {nethack} package that contains data that I scraped is not
that useful anymore. So I changed the nature of the package.
Now the package contains some functions: a function to parse and prepare the xlogfiles from NetHack that
you can download from alt.org/nethack (or from any other public server), a function
to download dumplogs such as this one. These dumplogs contain a lot of
info that I will extract in this blog post, using another function included in the {nethack} package.
The package also contains a sample of 6000 runs from NetHack version 3.6.1.

You can install the package with the following command line:

devtools::install_github("b-rodrigues/nethack") The {nethack} package

In part 1 I showed what killed
players the most. Here, I will focus on what monsters players kill the most.

library(tidyverse) library(lubridate) library(magrittr) library(ggridges) library(brotools) library(rvest) library(nethack)

Let’s first describe the data:

brotools::describe(nethack) %>% print(n = Inf) ## # A tibble: 23 x 17 ## variable type nobs mean sd mode min max q05 ## ## 1 deathdn… Nume… 6.00e3 8.45e-1 1.30e+0 2 0. 7.00e0 0. ## 2 deathlev Nume… 6.00e3 4.32e+0 3.69e+0 10 -5.00e0 4.50e1 1.00e0 ## 3 deaths Nume… 6.00e3 8.88e-1 3.54e-1 1 0. 5.00e0 0. ## 4 endtime Nume… 6.00e3 1.53e+9 4.72e+6 1534… 1.52e9 1.54e9 1.53e9 ## 5 hp Nume… 6.00e3 6.64e+0 4.96e+1 -1 -9.40e1 1.79e3 -8.00e0 ## 6 maxhp Nume… 6.00e3 3.82e+1 5.29e+1 57 2.00e0 1.80e3 1.10e1 ## 7 maxlvl Nume… 6.00e3 5.52e+0 6.36e+0 10 1.00e0 5.30e1 1.00e0 ## 8 points Nume… 6.00e3 4.69e+4 4.18e+5 10523 0. 9.92e6 1.40e1 ## 9 realtime Nume… 6.00e3 4.42e+3 1.60e+4 4575 0. 3.23e5 6.90e1 ## 10 startti… Nume… 6.00e3 1.53e+9 4.72e+6 1534… 1.52e9 1.54e9 1.53e9 ## 11 turns Nume… 6.00e3 3.60e+3 9.12e+3 6797 3.10e1 1.97e5 9.49e1 ## 12 align Char… 6.00e3 NA NA Cha NA NA NA ## 13 align0 Char… 6.00e3 NA NA Cha NA NA NA ## 14 death Char… 6.00e3 NA NA kill… NA NA NA ## 15 gender Char… 6.00e3 NA NA Fem NA NA NA ## 16 gender0 Char… 6.00e3 NA NA Fem NA NA NA ## 17 killed_… Char… 6.00e3 NA NA fain… NA NA NA ## 18 name Char… 6.00e3 NA NA drud… NA NA NA ## 19 race Char… 6.00e3 NA NA Elf NA NA NA ## 20 role Char… 6.00e3 NA NA Wiz NA NA NA ## 21 dumplog List 1.33e6 NA NA NA NA NA ## 22 birthda… Date 6.00e3 NA NA NA NA NA ## 23 deathda… Date 6.00e3 NA NA NA NA NA ## # ... with 8 more variables: q25 , median , q75 , ## # q95 , n_missing , n_unique , starting_date , ## # ending_date

All these columns are included in xlogfiles. The data was prepared using two functions, included
in {nethack}:

xlog <- read_delim("~/path/to/nethack361_xlog.csv", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE) xlog_df <- clean_xlog(xlog)

nethack361_xlog.csv is the raw xlogfiles that you can get from NetHack public servers.
clean_xlog() is a function that parses an xlogfile and returns a clean data frame.
xlog_df will be a data frame that will look just as the one included in {nethack}. It is
then possible to get the dumplog from each run included in xlog_df using get_dumplog():

xlog_df <- get_dumplog(xlog_df)

This function adds a column called dumplog with the dumplog of that run. I will now analyze
the dumplog file, by focusing on monsters vanquished, genocided or extinct. In a future blogpost
I will focus on other achievements.

Roles played (and other starting stats)

I will take a look at the races, roles, gender and alignment players start with the most. I will do
pie charts to visualize these variable, so first, let’s start by writing a general function that
allows me to do just that:

create_pie <- function(dataset, variable, repel = FALSE){ if(repel){ geom_label <- function(...){ ggrepel::geom_label_repel(...) } } variable <- enquo(variable) dataset %>% count((!!variable)) %>% mutate(total = sum(n), freq = n/total, labels = scales::percent(freq)) %>% arrange(desc(freq)) %>% ggplot(aes(x = "", y = freq, fill = (!!variable))) + geom_col() + geom_label(aes(label = labels), position = position_stack(vjust = 0.25), show.legend = FALSE) + coord_polar("y") + theme_blog() + scale_fill_blog() + theme(legend.title = element_blank(), panel.grid = element_blank(), axis.text = element_blank(), axis.title = element_blank()) }

Now I can easily plot the share of races chosen:

create_pie(nethack, race)

or the share of alignment:

create_pie(nethack, align0)

Same for the share of gender:

create_pie(nethack, gender0)

and finally for the share of roles:

create_pie(nethack, role, repel = TRUE)

create_pie() is possible thanks to tidy evaluation in {ggplot2},
which makes it possible to write a function that passes data frame columns down to aes(). Before
version 3.0 of {ggplot2} this was not possible, and writing such a function would have been a bit
more complicated. Now, it’s as easy as pie, if I dare say.

Something else I want to look at, is the distribution of turns by role:

nethack %>% filter(turns < quantile(turns, 0.98)) %>% ggplot(aes(x = turns, y = role, group = role, fill = role)) + geom_density_ridges(scale = 6, size = 0.25, rel_min_height = 0.01) + theme_blog() + scale_fill_blog() + theme(axis.text.y = element_blank(), axis.title.y = element_blank()) ## Picking joint bandwidth of 486

I use the very cool {ggridges} package for that. The distribution seems to mostly be the same
(of course, one should do a statistical test to be sure), but the one for the role “Valkyrie”
seems to be quite different from the others. It is known that it is easier to win the game playing
as a Valkyrie, but a question remains: is it really easier as a Valkyrie, or do good players tend
to play as Valkyries more often?

Creatures vanquished, genocided or extinct

The dumplog lists which, and how many of which, creatures were vanquished during the run, as well
as creatures that were genocided and extinct. The player can genocide an entire species by reading
a scroll of genocide (or by sitting on a throne). A species gets extinct if the player manages to
kill every monster from that species (there’s other ways too, but for the sake of simplicity, let’s
just say that when the players kills every monster from a species, the species is extinct).
The following lines are an extract of a dumplog:

"Vanquished creatures:" " Baalzebub" " Orcus" " Juiblex" "the Wizard of Yendor (4 times)" " Pestilence (thrice)" " Famine" " Vlad the Impaler" " 4 arch-liches" " an arch-lich" " a high priest" "..." "..." "..." "2873 creatures vanquished."

If I want to analyze this, I have to first solve some problems:

• Replace “a” and “an” by “1”
• Put the digit in the string “(4 times)” in front of the name of the monster (going from “the Wizard of Yendor (4 times)” to “4 the Wizard of Yendor”)
• Do something similar for “twice” and “thrice”
• Put everything into singular (for example, arch-liches into arch-lich)
• Trim whitespace
• Extract the genocided or extinct status from the dumplog too
• Finally, return a data frame with all the needed info

I wrote a function called extracted_defeated_monsters() and included it in the {nethack} package.
I discuss this function in appendix, but what it does is extracting information from dumplog files
about vanquished, genocided or extinct monsters and returns a tidy dataframe with that info. This
function has a lot of things going on inside it, so if you’re interested in learning more about
regular expression and other {tidyverse} tricks, I really encourage you to read its source code.

I can now easily add this info to my data:

nethack %<>% mutate(monsters_destroyed = map(dumplog, ~possibly(extract_defeated_monsters, otherwise = NA)(.)))

Let’s take a look at one of them:

nethack$monsters_destroyed[[117]] ## # A tibble: 285 x 3 ## value monster status ## ## 1 1 baalzebub ## 2 1 orcu ## 3 1 juiblex ## 4 4 the wizard of yendor ## 5 3 pestilence ## 6 1 famine ## 7 1 vlad the impaler ## 8 4 arch-lich ## 9 1 high priest ## 10 1 medusa ## # ... with 275 more rows nethack$monsters_destroyed[[117]] %>% count(status) ## # A tibble: 3 x 2 ## status n ## ## 1 extinct 2 ## 2 genocided 7 ## 3 276

The status variable tells us if that monster was genocided or extinct during that run. status
equal to “NA” means vanquished.

It is now possible to look at, say, the top 15 vanquished monsters (normalized):

nethack %>% filter(!is.na(monsters_destroyed)) %>% pull(monsters_destroyed) %>% bind_rows %>% group_by(monster) %>% summarise(total = sum(value)) %>% top_n(15) %>% ungroup() %>% mutate(norm_total = (total - min(total))/(max(total) - min(total))) %>% mutate(monster = fct_reorder(monster, norm_total, .desc = FALSE)) %>% ggplot() + geom_col(aes(y = norm_total, x = monster)) + coord_flip() + theme_blog() + scale_fill_blog() + ylab("Ranking") + xlab("Monster") ## Selecting by total

In this type of graph, the most vanquished monster, “gnome” has a value of 1, and the least
vanquished one, 0. This normalization step is also used in the pre-processing step of machine learning
algorithms. This helps convergence of the gradient descent algorithm for instance.

Monsters can also get genocided or extinct. Let’s make a pie chart of the proportion of genocided
and extinct monsters (I lump monsters that are genocided or extinct less than 5% of the times
into a category called other). Because I want two pie charts, I nest the data after having grouped
it by the status variable. This is a trick I discussed in this blog post
and that I use very often:

nethack %>% filter(!is.na(monsters_destroyed)) %>% pull(monsters_destroyed) %>% bind_rows %>% filter(!is.na(status)) %>% group_by(status) %>% count(monster) %>% mutate(monster = fct_lump(monster, prop = 0.05, w = n)) %>% group_by(status, monster) %>% summarise(total_count = sum(n)) %>% mutate(freq = total_count/sum(total_count), labels = scales::percent(freq)) %>% arrange(desc(freq)) %>% group_by(status) %>% nest() %>% mutate(pie_chart = map2(.x = status, .y = data, ~ggplot(data = .y, aes(x = "", y = freq, fill = (monster))) + geom_col() + ggrepel::geom_label_repel(aes(label = labels), position = position_stack(vjust = 0.25), show.legend = FALSE) + coord_polar("y") + theme_blog() + scale_fill_blog() + ggtitle(.x) + theme(legend.title = element_blank(), panel.grid = element_blank(), axis.text = element_blank(), axis.title = element_blank()) )) %>% pull(pie_chart) ## Warning in mutate_impl(.data, dots): Unequal factor levels: coercing to ## character ## Warning in mutate_impl(.data, dots): binding character and factor vector, ## coercing into character vector ## Warning in mutate_impl(.data, dots): binding character and factor vector, ## coercing into character vector ## [[1]]

## ## [[2]]

That was it for this one, the graphs are not that super sexy, but the amount of work that went
into making them was quite consequent. The main reason was that parsing xlogfiles was a bit tricky, but
the main challenge was extracting information from dumplog files. This proved to be a bit more
complicated than expected (just take a look at the source code of extract_defeated_monsters()
to get an idea…).

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

Bonus plot Correct number of daily games

The daily number of games are available here. Let’s
extract this info and remake the plot that shows the number of runs per day:

games <- read_html("https://alt.org/nethack/dailygames_ct.html") %>% html_nodes(xpath = '//table') %>% html_table(fill = TRUE)

This extracts all the tables and puts them into a list. Let’s take a look at one:

head(games[[1]]) ## 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ## 1 NA 1 2 3 4 5 6 7 8 9 10 11 12 ## 2 Jan 11639 275 370 394 363 392 276 288 324 297 411 413 430 ## 3 Feb 10819 375 384 359 376 440 345 498 457 416 376 421 416 ## 4 Mar 12148 411 403 421 392 447 391 451 298 350 309 309 369 ## 5 Apr 13957 456 513 482 516 475 490 397 431 436 438 541 493 ## 6 May 13361 595 509 576 620 420 443 407 539 440 446 404 282 ## 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ## 1 13 14 15 16 17 18 19 20 21 22 23 24 25 26 ## 2 331 341 318 483 408 424 464 412 371 430 348 315 359 375 ## 3 385 367 443 324 283 341 385 398 361 379 399 276 455 460 ## 4 390 358 362 345 388 360 411 382 371 400 410 417 328 431 ## 5 593 537 396 578 403 435 526 448 339 377 476 492 528 393 ## 6 265 358 419 564 483 429 423 299 424 404 450 408 355 409 ## 2018 2018 2018 2018 2018 ## 1 27 28 29 30 31 ## 2 432 371 385 440 399 ## 3 353 347 NA NA NA ## 4 386 484 493 486 395 ## 5 407 421 463 477 NA ## 6 417 433 360 391 389

Let’s clean this up.

clean_table <- function(df){ # Promotes first row to header colnames(df) <- df[1, ] df <- df[-1, ] # Remove column with total from the month df <- df[, -2] # Name the first column "month" colnames(df)[1] <- "month" # Now put it in a tidy format df %>% gather(day, games_played, -month) }

Now I can clean up all the tables. I apply this function to each element of the list games. I

games <- map(games, clean_table) %>% map2_dfr(.x = ., .y = seq(2018, 2001), ~mutate(.x, year = .y))

Now I can easily create the plot I wanted

games %<>% mutate(date = lubridate::ymd(paste(year, month, day, sep = "-"))) ## Warning: 122 failed to parse. ggplot(games, aes(y = games_played, x = date)) + geom_point(colour = "#0f4150") + geom_smooth(colour = "#82518c") + theme_blog() + ylab("Total games played") ## geom_smooth() using method = 'gam' and formula 'y ~ s(x, bs = "cs")' ## Warning: Removed 502 rows containing non-finite values (stat_smooth). ## Warning: Removed 502 rows containing missing values (geom_point).

There’s actually a lot more games than 50 per day being played!

Appendix Fuzzy matching

If you take a look at the extract_defeated_monsters() source code, you’ll see that at some point
I “singularize” monster names. I decided to deal with this singular/plural issue, “by hand”,
but also explored other possibilities, such as matching the plural nouns with the singular
nouns fuzzily. In the end it didn’t work out so well, but here’s the code for future reference.

monster_list <- read_html("https://nethackwiki.com/wiki/Monsters_(by_difficulty)") %>% html_nodes(".prettytable") %>% .[[1]] %>% html_table(fill = TRUE) monster_list %<>% select(monster = Name) head(monster_list) ## monster ## 1 Demogorgon ## 2 Asmodeus ## 3 Baalzebub ## 4 Dispater ## 5 Geryon ## 6 Orcus library(fuzzyjoin) test_vanquished <- extract_defeated_monsters(nethackdumplog[[117]]) head(test_vanquished) ## # A tibble: 6 x 3 ## value monster status ## ## 1 1 baalzebub ## 2 1 orcu ## 3 1 juiblex ## 4 4 the wizard of yendor ## 5 3 pestilence ## 6 1 famine You can take a look at the result by expanding: Click to expand stringdist_left_join(test_vanquished, monster_list) %>% count(monster.y) %>% print(n = Inf) ## Joining by: "monster" ## # A tibble: 297 x 2 ## monster.y n ## ## 1 acid blob 1 ## 2 air elemental 2 ## 3 Aleax 1 ## 4 aligned priest 1 ## 5 Angel 1 ## 6 ape 2 ## 7 arch-lich 1 ## 8 Baalzebub 1 ## 9 baby black dragon 1 ## 10 baby crocodile 1 ## 11 baby gray dragon 1 ## 12 baby green dragon 1 ## 13 baby long worm 1 ## 14 baby orange dragon 1 ## 15 baby white dragon 1 ## 16 baby yellow dragon 1 ## 17 balrog 1 ## 18 baluchitherium 1 ## 19 barbed devil 1 ## 20 barrow wight 1 ## 21 bat 1 ## 22 black dragon 1 ## 23 black light 1 ## 24 black naga 1 ## 25 black pudding 1 ## 26 black unicorn 1 ## 27 blue dragon 1 ## 28 blue jelly 1 ## 29 bone devil 1 ## 30 brown mold 1 ## 31 brown pudding 1 ## 32 bugbear 1 ## 33 captain 1 ## 34 carnivorous ape 1 ## 35 cave spider 1 ## 36 centipede 1 ## 37 chameleon 1 ## 38 chickatrice 2 ## 39 clay golem 1 ## 40 cobra 1 ## 41 cockatrice 2 ## 42 couatl 1 ## 43 coyote 1 ## 44 crocodile 1 ## 45 demilich 1 ## 46 dingo 1 ## 47 disenchanter 1 ## 48 dog 1 ## 49 doppelganger 1 ## 50 dust vortex 1 ## 51 dwarf 2 ## 52 dwarf king 1 ## 53 dwarf lord 1 ## 54 dwarf mummy 1 ## 55 dwarf zombie 1 ## 56 earth elemental 1 ## 57 electric eel 1 ## 58 elf 1 ## 59 elf mummy 1 ## 60 elf zombie 1 ## 61 elf-lord 1 ## 62 Elvenking 1 ## 63 energy vortex 1 ## 64 erinys 1 ## 65 ettin 1 ## 66 ettin mummy 1 ## 67 ettin zombie 1 ## 68 Famine 1 ## 69 fire ant 2 ## 70 fire elemental 2 ## 71 fire giant 2 ## 72 fire vortex 2 ## 73 flaming sphere 1 ## 74 flesh golem 1 ## 75 floating eye 1 ## 76 fog cloud 1 ## 77 forest centaur 1 ## 78 fox 1 ## 79 freezing sphere 1 ## 80 frost giant 1 ## 81 gargoyle 1 ## 82 garter snake 1 ## 83 gas spore 1 ## 84 gecko 1 ## 85 gelatinous cube 1 ## 86 ghost 2 ## 87 ghoul 2 ## 88 giant ant 3 ## 89 giant bat 3 ## 90 giant beetle 1 ## 91 giant eel 1 ## 92 giant mimic 1 ## 93 giant mummy 1 ## 94 giant rat 3 ## 95 giant spider 1 ## 96 giant zombie 1 ## 97 glass piercer 1 ## 98 gnome 1 ## 99 gnome king 1 ## 100 gnome lord 1 ## 101 gnome mummy 1 ## 102 gnome zombie 1 ## 103 gnomish wizard 1 ## 104 goblin 1 ## 105 gold golem 2 ## 106 golden naga 1 ## 107 golden naga hatchling 1 ## 108 gray ooze 1 ## 109 gray unicorn 1 ## 110 Green-elf 1 ## 111 gremlin 1 ## 112 Grey-elf 1 ## 113 grid bug 1 ## 114 guardian naga 1 ## 115 guardian naga hatchling 1 ## 116 hell hound 1 ## 117 hell hound pup 1 ## 118 hezrou 1 ## 119 high priest 1 ## 120 hill giant 1 ## 121 hill orc 1 ## 122 hobbit 1 ## 123 hobgoblin 1 ## 124 homunculus 1 ## 125 horned devil 1 ## 126 horse 2 ## 127 housecat 1 ## 128 human 1 ## 129 human mummy 1 ## 130 human zombie 1 ## 131 ice devil 1 ## 132 ice troll 1 ## 133 ice vortex 2 ## 134 iguana 1 ## 135 imp 1 ## 136 incubus 1 ## 137 iron golem 1 ## 138 iron piercer 1 ## 139 jabberwock 1 ## 140 jackal 1 ## 141 jaguar 1 ## 142 jellyfish 1 ## 143 Juiblex 1 ## 144 Keystone Kop 1 ## 145 ki-rin 1 ## 146 killer bee 1 ## 147 kitten 1 ## 148 kobold 1 ## 149 kobold lord 1 ## 150 kobold mummy 1 ## 151 kobold shaman 1 ## 152 kobold zombie 1 ## 153 Kop Lieutenant 1 ## 154 Kop Sergeant 1 ## 155 kraken 2 ## 156 large cat 1 ## 157 large dog 1 ## 158 large kobold 1 ## 159 large mimic 1 ## 160 leather golem 1 ## 161 leocrotta 1 ## 162 leprechaun 1 ## 163 lich 2 ## 164 lichen 2 ## 165 lieutenant 1 ## 166 little dog 1 ## 167 lizard 1 ## 168 long worm 1 ## 169 Lord Surtur 1 ## 170 lurker above 1 ## 171 lynx 1 ## 172 manes 1 ## 173 marilith 1 ## 174 master lich 1 ## 175 master mind flayer 1 ## 176 Medusa 1 ## 177 mind flayer 1 ## 178 minotaur 1 ## 179 monk 2 ## 180 monkey 1 ## 181 Mordor orc 1 ## 182 mountain centaur 1 ## 183 mountain nymph 1 ## 184 mumak 1 ## 185 nalfeshnee 1 ## 186 Nazgul 1 ## 187 newt 1 ## 188 Norn 1 ## 189 nurse 2 ## 190 ochre jelly 1 ## 191 ogre 1 ## 192 ogre king 1 ## 193 ogre lord 1 ## 194 Olog-hai 1 ## 195 orange dragon 1 ## 196 orc 3 ## 197 orc mummy 1 ## 198 orc shaman 1 ## 199 orc zombie 1 ## 200 orc-captain 1 ## 201 Orcus 1 ## 202 owlbear 1 ## 203 page 2 ## 204 panther 1 ## 205 paper golem 1 ## 206 Pestilence 1 ## 207 piranha 1 ## 208 pit fiend 1 ## 209 pit viper 1 ## 210 plains centaur 1 ## 211 pony 1 ## 212 purple worm 1 ## 213 pyrolisk 1 ## 214 python 1 ## 215 quantum mechanic 1 ## 216 quasit 1 ## 217 queen bee 1 ## 218 quivering blob 1 ## 219 rabid rat 1 ## 220 ranger 1 ## 221 raven 2 ## 222 red dragon 1 ## 223 red mold 1 ## 224 red naga 1 ## 225 rock mole 1 ## 226 rock piercer 1 ## 227 rock troll 1 ## 228 rogue 2 ## 229 rope golem 1 ## 230 roshi 1 ## 231 rothe 1 ## 232 rust monster 1 ## 233 salamander 1 ## 234 sandestin 1 ## 235 sasquatch 1 ## 236 scorpion 1 ## 237 sergeant 1 ## 238 sewer rat 1 ## 239 shade 3 ## 240 shark 2 ## 241 shocking sphere 1 ## 242 shrieker 1 ## 243 silver dragon 1 ## 244 skeleton 1 ## 245 small mimic 1 ## 246 snake 2 ## 247 soldier 1 ## 248 soldier ant 1 ## 249 spotted jelly 1 ## 250 stalker 1 ## 251 steam vortex 1 ## 252 stone giant 2 ## 253 stone golem 1 ## 254 storm giant 2 ## 255 straw golem 1 ## 256 succubus 1 ## 257 tengu 1 ## 258 tiger 1 ## 259 titanothere 1 ## 260 trapper 1 ## 261 troll 1 ## 262 umber hulk 1 ## 263 Uruk-hai 1 ## 264 vampire 1 ## 265 vampire bat 1 ## 266 vampire lord 1 ## 267 violet fungus 1 ## 268 Vlad the Impaler 1 ## 269 vrock 1 ## 270 warg 2 ## 271 warhorse 1 ## 272 water elemental 1 ## 273 water moccasin 1 ## 274 water nymph 1 ## 275 werejackal 2 ## 276 wererat 2 ## 277 werewolf 2 ## 278 white dragon 1 ## 279 white unicorn 1 ## 280 winged gargoyle 1 ## 281 winter wolf 1 ## 282 winter wolf cub 1 ## 283 wizard 1 ## 284 wolf 1 ## 285 wood golem 2 ## 286 wood nymph 1 ## 287 Woodland-elf 1 ## 288 wraith 1 ## 289 wumpus 1 ## 290 xan 3 ## 291 xorn 2 ## 292 yellow dragon 1 ## 293 yellow light 1 ## 294 yellow mold 1 ## 295 yeti 1 ## 296 zruty 1 ## 297 1 As you can see, some matches fail, especially for words that end in “y” in the singular, so “ies” in plural, or “fire vortices” that does not get matched to “fire vortex”. I tried all the methods but it’s either worse, or marginally better. Extracting info from dumplogfiles Click here to take a look at the source code from extract_defeated_monsters #' Extract information about defeated monsters from an xlogfile #' @param xlog A raw xlogfile #' @return A data frame with information on vanquished, genocided and extincted monsters #' @importFrom dplyr mutate select filter bind_rows full_join #' @importFrom tidyr separate #' @importFrom tibble as_tibble tibble #' @importFrom magrittr "%>%" #' @importFrom purrr map2 possibly is_empty modify_if simplify discard #' @importFrom readr read_lines #' @importFrom stringr str_which str_replace_all str_replace str_trim str_detect str_to_lower str_extract_all str_extract #' @export #' @examples #' \dontrun{ #' get_dumplog(xlog) #' } extract_defeated_monsters <- function(dumplog){ if(any(str_detect(dumplog, "No creatures were vanquished."))){ return(NA) } else { start <- dumplog %>% # <- dectect the start of the list str_which("Vanquished creatures") end <- dumplog %>% # <- detect the end of the list str_which("\\d+ creatures vanquished.") if(is_empty(end)){ # This deals with the situation of only one vanquished creature end <- start + 2 } list_creatures <- dumplog[(start + 1):(end - 1)] %>% # <- extract the list str_replace_all("\\s+an? ", "1 ") %>% # <- replace a or an by 1 str_trim() # <- trim white space # The following function first extracts the digit in the string (123 times) # and replaces the 1 with this digit # This means that: "1 the Wizard of Yendor (4 times)" becomes "4 the Wizard of Yendor (4 times)" str_extract_replace <- function(string){ times <- str_extract(string, "\\d+(?=\\stimes)") str_replace(string, "1", times) } result <- list_creatures %>% # If a string starts with a letter, add a 1 # This means that: "Baalzebub" becomes "1 Baalzebub" modify_if(str_detect(., "^[:alpha:]"), ~paste("1", .)) %>% # If the string "(twice)" is detected, replace "1" (that was added the line before) with "2" modify_if(str_detect(., "(twice)"), ~str_replace(., "1", "2")) %>% # Same for "(thrice)" modify_if(str_detect(., "(thrice)"), ~str_replace(., "1", "3")) %>% # Exctract the digit in "digit times" and replace the "1" with digit modify_if(str_detect(., "(\\d+ times)"), str_extract_replace) %>% # Replace "(times)" or "(twice)" etc with "" str_replace_all("\$$.*\$$", "") %>% str_trim() %>% simplify() %>% # Convert the resulting list to a tibble. This tibble has one column: # value # 1 Baalzebub # 2 dogs #... as_tibble() %>% # Use tidyr::separate to separate the "value" column into two columns. The extra pieces get merged # So for example "1 Vlad the Impaler" becomes "1" "Vlad the Impaler" instead of "1" "Vlad" which # would be the case without "extra = "merge"" separate(value, into = c("value", "monster"), extra = "merge") %>% mutate(value = as.numeric(value)) %>% mutate(monster = str_to_lower(monster)) # This function singularizes names: singularize_monsters <- function(nethack_data){ nethack_data %>% mutate(monster = str_replace_all(monster, "mummies", "mummy"), monster = str_replace_all(monster, "jellies", "jelly"), monster = str_replace_all(monster, "vortices", "vortex"), monster = str_replace_all(monster, "elves", "elf"), monster = str_replace_all(monster, "wolves", "wolf"), monster = str_replace_all(monster, "dwarves", "dwarf"), monster = str_replace_all(monster, "liches", "lich"), monster = str_replace_all(monster, "baluchiteria", "baluchiterium"), monster = str_replace_all(monster, "homunculi", "homonculus"), monster = str_replace_all(monster, "mumakil", "mumak"), monster = str_replace_all(monster, "sasquatches", "sasquatch"), monster = str_replace_all(monster, "watchmen", "watchman"), monster = str_replace_all(monster, "zruties", "zruty"), monster = str_replace_all(monster, "xes", "x"), monster = str_replace_all(monster, "s$", "")) } result <- singularize_monsters(result) } # If a player did not genocide or extinct any species, return the result: if(any(str_detect(dumplog, "No species were genocided or became extinct."))){ result <- result %>% mutate(status = NA_character_) return(result) } else { # If the player genocided or extincted species, add this info: start <- dumplog %>% # <- dectect the start of the list str_which("Genocided or extinct species:") # <- sometimes this does not appear in the xlogfile end <- dumplog %>% # <- detect the end of the list str_which("Voluntary challenges") if(is_empty(start)){# This deals with the situation start does not exist start <- end - 2 } list_creatures <- dumplog[(start + 1):(end - 1)] %>% # <- extract the list str_trim() # <- trim white space extinct_species <- list_creatures %>% str_extract_all("[:alpha:]+\\s(?=\$$extinct\$$)", simplify = T) %>% str_trim %>% discard(==(., "")) extinct_species_df <- tibble(monster = extinct_species, status = "extinct") genocided_species_index <- list_creatures %>% str_detect(pattern = "extinct|species") %>% ! genocided_species <- list_creatures[genocided_species_index] genocided_species_df <- tibble(monster = genocided_species, status = "genocided") genocided_or_extinct_df <- singularize_monsters(bind_rows(extinct_species_df, genocided_species_df)) result <- full_join(result, genocided_or_extinct_df, by = "monster") %>% filter(monster != "") # <- this is to remove lines that were added by mistake, for example if start was empty return(result) } } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### simmer 4.1.0 Fri, 11/09/2018 - 19:36 (This article was first published on R – Enchufa2, and kindly contributed to R-bloggers) The 4.1.0 release of simmer, the Discrete-Event Simulator for R, is on CRAN. As per request in the mailing list, now get_global() is able to work inside a generator function. Moreover, the new add_global() method attaches a global attribute to a simulator. library(simmer) env <- simmer() hello_sayer <- trajectory() %>% log_("hello world!") %>% set_global("interarrival", 1, mod="+") generator <- function() get_global(env, "interarrival") env %>% add_global("interarrival", 1) %>% add_generator("dummy", hello_sayer, generator) %>% run(7) %>% get_global("interarrival") ## 1: dummy0: hello world! ## 3: dummy1: hello world! ## 6: dummy2: hello world! ## [1] 4 Compared to plain global variables, these ones are automatically managed and thus reinitialised if the environment is reset. env %>% reset() %>% get_global("interarrival") ## [1] 1 env %>% run(7) %>% get_global("interarrival") ## 1: dummy0: hello world! ## 3: dummy1: hello world! ## 6: dummy2: hello world! ## [1] 4 There has been a small refactoring of some parts of the C++ core, which motivates the minor version bump, but this shouldn’t be noticeable to the users. Finally, several bug fixes and improvements complete this release. See below for a complete list. New features: • New getter get_selected() retrieves names of selected resources via the select() activity (#172 addressing #171). • Source and resource getters have been vectorised to retrieve parameters from multiple entities (as part of #172). • Simplify C++ Simulator interface for adding processes and resources (#162). The responsibility of building the objects has been moved to the caller. • New add_global() method to attach global attributes to a simulation environment (#174 addressing #158). Minor changes and fixes: • Remove 3.8.0 and 4.0.1 deprecations (#170 addressing #165). • Fix get_global() to work outside trajectories (#170 addressing #165). • Fix rollback() with an infinite amount (#173). • Fix and improve schedules and managers (as part of #174). • Fix reset() to avoid overwriting the simulation environment (#175). Article originally published in Enchufa2.es: simmer 4.1.0. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### T-mobile uses R for Customer Service AI Fri, 11/09/2018 - 14:50 (This article was first published on Revolutions, and kindly contributed to R-bloggers) T-Mobile, the global telecommunication company, is using R in production to automatically classify text messages to customer service and route them to an agent that can help. The AI@T-mobile team used the keras library in R to build a natural language processing engine with Tensorflow, and deployed it to production as a docker container. The MRAN Time Machine ensures the container gets fixed R package versions for reproducibility. While you may think first of Python for building production AI models, the keras library in R has already proven to be a first-class interface for building AI models. The T-mobile team was able to build this system in just four months and with a small budget because they were able to explore in R and immediately deploy in R: "Despite being an incredibly popular language for exploratory analysis, data scientists are repeatedly told that R is not sufficient for machine learning – especially if those ML models are destined for production. Though much exploratory analysis and modeling is done in R, these models must be rebuilt in Python to meet DevOps and enterprise requirements. Our team doesn’t see the value in this double work. We explore in R and immediately deploy in R. Our APIs are neural network models using R and TensorFlow in docker containers that are small and maintainable enough to make our DevOps team happy!" If you'd like to try this out yourself, the T-mobile team has published an open source version of their message-classification container on Github, and you can read the details in the blog post by team leads at the link below. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Escaping the macOS 10.14 (Mojave) Filesystem Sandbox with R / RStudio Fri, 11/09/2018 - 13:01 (This article was first published on R – rud.is, and kindly contributed to R-bloggers) If you’re an R/RStudio user who has migrated to Mojave (macOS 10.14) or are contemplating migrating to it, you will likely eventually run into an issue where you’re trying to access resources that are in Apple’s new hardened filesystem sandboxes. Rather than reinvent the wheel by blathering about what that means, give these links a visit if you want more background info: and/or Google/DuckDuckGo for macos privacy tcc full disk access as there have been a ton of blogs on this topic. The TLDR for getting your R/RStudio bits to work in these sandboxed areas is to grant them “Full Disk Access” (if that sounds scary, it is) permissions. You can do that by adding both the R executable and RStudio executable (drag their icons) to the Full Disk Access pane under the Privacy tab of System Preferences => Security & Privacy: I also used the Finder’s “Go To” command to head on over /usr/local/bin and use the “Show Original” popup on both R and Rscript and dragged their fully qualified path binaries into the pane as well (you can’t see them b/c the pane is far too small). The symbolic links might be sufficient, but I’ve been running this way since the betas and just re-drag the versioned R/Rscript binaries each time I upgrade (or rebuild) R. If you do grant FDA to R/RStudio just make sure be even more careful about trusting R code you run from the internet or R packages you install from untrusted sources (like GitHub) since R/RStudio are now potential choice conduits for malicious code that wants to get at your seekret things. Photo by Alexander Dummer on Unsplash var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Coding Regression trees in 150 lines of R code Fri, 11/09/2018 - 11:25 (This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers) Motivation There are dozens of machine learning algorithms out there. It is impossible to learn all their mechanics, however, many algorithms sprout from the most established algorithms, e.g. ordinary least squares, gradient boosting, support vector machines, tree-based algorithms and neural networks. At STATWORX we discuss algorithms daily to evaluate their usefulness for a specific project. In any case, understanding these core algorithms is key to most machine learning algorithms in the literature. While I like reading machine learning research papers, the maths is sometimes hard to follow. That is why I like implementing the algorithms in R by myself. Of course this means digging through the maths and the algorithms as well. However, you can challenge your understanding of the algorithm directly. In my two subsequent blog post I will introduce two machine learning algorithms in 150 lines of R Code. This blog post will be about regression trees, which are the foundation of most tree-based algorithms. You can find the other blog post about coding gradient boosted machines from scratch on our blog. The algorithms will cover all core mechanics, while being very generic. You can find all code on my GitHub. Gathering all puzzle pieces Surely, there are tons of great articles out there which explain regression trees theoretically accompanied with a hands-on example. This is not the objective of this blog post. If you are interested in a hands-on tutorial with all necessary theory I strongly recommend this tutorial. The objective of this blog post is to establish the theory of the algorithm by writing simple R code. You do not need any prior knowledge of the algorithm to follow. The only thing you need to know is our objective: We want to estimate our real-valued target (y) with a set of real-valued features (X). Most probably you are already familiar with decision trees, which is a machine learning algorithm to solve classification tasks. As the name itself states regression trees solve regressions, i.e. estimation with continuous scaled targets. These kind of trees are the key part of every tree-based method, since the way you grow a tree is more of the same really. The differing parts among the implementations are mostly about the splitting rule. In this tutorial we will program a very simple, but generic implementation of a regression tree. Fortunately, we do not have to cover much maths in this tutorial, because the algorithm itself is rather a technical than a mathematical challenge. With that said, the technical path I have chosen might not be the most efficient way, but I tried to trade-off efficiency with simplicity. Anyway, as most of you might know decision or regression trees are rule-based approaches. Meaning, we are trying to split the data into partitions conditional to our feature space. The data partitioning is done with the help of a splitting criterion. There is no common ground on how to do those splits, there are rather multiple different splitting criteria with different pros and cons. We will focus on a rather simple criterion in this tutorial. Bear with me, here comes some maths. Ok, so what does this state? This is the sum of squared errors determined in two different subsets ( and ). As the name suggests, that should be something we want to minimize. In fact, it is the squared distance between the mean and the target within this data subset. In every node of our regression tree we calculate the SSE for every potential split we could do in our data for every feature we have to figure out the best split we can achieve. Let us have a look at the R Code: # This is the splitting criterion we minimize (SSE [Sum Of Squared Errors]): # SSE = \sum{i \in S_1} (y_i - \bar(y)_1)^2 + \sum{i \in S_2} (y_i - \bar(y)_2)^2 sse_var <- function(x, y) { splits <- sort(unique(x)) sse <- c() for (i in seq_along(splits)) { sp <- splits[i] sse[i] <- sum((y[x < sp] - mean(y[x < sp]))^2) + sum((y[x >= sp] - mean(y[x >= sp]))^2) } split_at <- splits[which.min(sse)] return(c(sse = min(sse), split = split_at)) } The function takes two inputs our numeric feature x and our target real-valued y. We then go ahead and calculate the SSE for every unique value of our x. This means we calculate the SSE for every possible data subset we could obtain conditional on the feature. Often we want to cover more than one feature in our problem, which means that we have to run this function for every feature. As a result, the best splitting rule has the lowest SSE among all possible splits of all features. Once we have determined the best splitting rule we can split our data into these two subsets according to our criterion, which is nothing else than feature x <= split_at and x > split_at. We call these two subsets children and they again can be split into subsets again. Let us lose some more words on the SSE though, because it reveals our estimator. In this implementation our estimator in the leaf is simply the avarage value of our target within this data subset. This is the simplest version of a regression tree. However, with some additional work you can apply more sophisticated models, e.g. an ordinary least squares fit. The Algorithm Enough with the talking, let's get to the juice. In the following you will see the algorithm in all of its beauty. Afterwards we will breakdown the algorithm in easy-to-digest code chunks. #' reg_tree #' Fits a simple regression tree with SSE splitting criterion. The estimator function #' is the mean. #' #' @param formula an object of class formula #' @param data a data.frame or matrix #' @param minsize a numeric value indicating the minimum size of observations #' in a leaf #' #' @return \itemize{ #' \item tree - the tree object containing all splitting rules and observations #' \item fit - our fitted values, i.e. X %*% theta #' \item formula - the underlying formula #' \item data - the underlying data #' } #' @export #' #' @examples # Complete runthrough see: www.github.com/andrebleier/cheapml reg_tree <- function(formula, data, minsize) { # coerce to data.frame data <- as.data.frame(data) # handle formula formula <- terms.formula(formula) # get the design matrix X <- model.matrix(formula, data) # extract target y <- data[, as.character(formula)[2]] # initialize while loop do_splits <- TRUE # create output data.frame with splitting rules and observations tree_info <- data.frame(NODE = 1, NOBS = nrow(data), FILTER = NA, TERMINAL = "SPLIT", stringsAsFactors = FALSE) # keep splitting until there are only leafs left while(do_splits) { # which parents have to be splitted to_calculate <- which(tree_info$TERMINAL == "SPLIT") for (j in to_calculate) { # handle root node if (!is.na(tree_info[j, "FILTER"])) { # subset data according to the filter this_data <- subset(data, eval(parse(text = tree_info[j, "FILTER"]))) # get the design matrix X <- model.matrix(formula, this_data) } else { this_data <- data } # estimate splitting criteria splitting <- apply(X, MARGIN = 2, FUN = sse_var, y = y) # get the min SSE tmp_splitter <- which.min(splitting[1,]) # define maxnode mn <- max(tree_info$NODE) # paste filter rules tmp_filter <- c(paste(names(tmp_splitter), ">=", splitting[2,tmp_splitter]), paste(names(tmp_splitter), "<", splitting[2,tmp_splitter])) # Error handling! check if the splitting rule has already been invoked split_here <- !sapply(tmp_filter, FUN = function(x,y) any(grepl(x, x = y)), y = tree_info$FILTER) # append the splitting rules if (!is.na(tree_info[j, "FILTER"])) { tmp_filter <- paste(tree_info[j, "FILTER"], tmp_filter, sep = " & ") } # get the number of observations in current node tmp_nobs <- sapply(tmp_filter, FUN = function(i, x) { nrow(subset(x = x, subset = eval(parse(text = i)))) }, x = this_data) # insufficient minsize for split if (any(tmp_nobs <= minsize)) { split_here <- rep(FALSE, 2) } # create children data frame children <- data.frame(NODE = c(mn+1, mn+2), NOBS = tmp_nobs, FILTER = tmp_filter, TERMINAL = rep("SPLIT", 2), row.names = NULL)[split_here,] # overwrite state of current node tree_info[j, "TERMINAL"] <- ifelse(all(!split_here), "LEAF", "PARENT") # bind everything tree_info <- rbind(tree_info, children) # check if there are any open splits left do_splits <- !all(tree_info$TERMINAL != "SPLIT") } # end for } # end while # calculate fitted values leafs <- tree_info[tree_info$TERMINAL == "LEAF", ] fitted <- c() for (i in seq_len(nrow(leafs))) { # extract index ind <- as.numeric(rownames(subset(data, eval(parse(text = leafs[i, "FILTER"]))))) # estimator is the mean y value of the leaf fitted[ind] <- mean(y[ind]) } # return everything return(list(tree = tree_info, fit = fitted, formula = formula, data = data)) }

Uff, that was a lot of scrolling. Ok, so what do we have here. At first glance we see two loops a while loop and a for loop, which are essential to the algorithm. But let us start bit by bit. Let's have a look at the first code chunk.

# coerce to data.frame data <- as.data.frame(data) # handle formula formula <- terms.formula(formula) # get the design matrix X <- model.matrix(formula, data) # extract target y <- data[, as.character(formula)[2]] # initialize while loop do_splits <- TRUE # create output data.frame with splitting rules and observations tree_info <- data.frame(NODE = 1, NOBS = nrow(data), FILTER = NA, TERMINAL = "SPLIT", stringsAsFactors = FALSE)

Essentially, this is everything before the while loop. Here, we see some input handling in the beginning and the extraction of our design matrix X and our target y. All our featuers are within this design matrix. do_splits is the condition for our while loop, which we will cover in a bit. The data frame tree_info is the key storage element in our algorithm, because it will contain every information we need about our tree. The essential piece of this object is the filter column. This column saves the paths (filter) we have to take through (to apply to) our data to get to a leaf (a terminal node) in our regression tree. We initiate this data.frame with NODE = 1, since at the beginning we are in the root node of our tree with the whole data set at our expense. Furthermore, there is a column called TERMINAL, which controls the state of the node. We have three different states SPLIT, LEAF and PARENT. When we describe a node with the SPLIT state, we mark it for a potential split. The state PARENT indicates, that we have already split this node. Lastly, the state LEAF marks terminal nodes of our regression tree.

Reaching the treetop

When do we reach such a terminal node? In many implementations there is a minimum size parameter, where we determine valid splits by the amount of observations of its children. If the children have lower data points than the minimum size the split is invalid and will not be done. Imagine not having this parameter in our case, we could end up with leafs covering only one observation. Another termination rule is if the lowest SSE is at a split we have already invoked within the branch and hence has already been invoked in this branch. The split at such a point would be nonsense, since we would end up with the same subset over and over again.

Basically, this is everything there is to the algorithm. The while and for loop just ensure, that we estimate and create every node in our tree_info. Thus, you can perceive the tree_info as sort of a job list, since we create new jobs within this data frame. Let us walk through the actual R code.

# keep splitting until there are only leafs left while(do_splits) { # which parents have to be splitted to_calculate <- which(tree_info$TERMINAL == "SPLIT") for (j in to_calculate) { # handle root node if (!is.na(tree_info[j, "FILTER"])) { # subset data according to the filter this_data <- subset(data, eval(parse(text = tree_info[j, "FILTER"]))) # get the design matrix X <- model.matrix(formula, this_data) } else { this_data <- data } # estimate splitting criteria splitting <- apply(X, MARGIN = 2, FUN = sse_var, y = y) # get the min SSE tmp_splitter <- which.min(splitting[1,]) # define maxnode mn <- max(tree_info$NODE) # paste filter rules tmp_filter <- c(paste(names(tmp_splitter), ">=", splitting[2,tmp_splitter]), paste(names(tmp_splitter), "<", splitting[2,tmp_splitter])) # Error handling! check if the splitting rule has already been invoked split_here <- !sapply(tmp_filter, FUN = function(x,y) any(grepl(x, x = y)), y = tree_info$FILTER) # append the splitting rules if (!is.na(tree_info[j, "FILTER"])) { tmp_filter <- paste(tree_info[j, "FILTER"], tmp_filter, sep = " & ") } # get the number of observations in current node tmp_nobs <- sapply(tmp_filter, FUN = function(i, x) { nrow(subset(x = x, subset = eval(parse(text = i)))) }, x = this_data) # insufficient minsize for split if (any(tmp_nobs <= minsize)) { split_here <- rep(FALSE, 2) } # create children data frame children <- data.frame(NODE = c(mn+1, mn+2), NOBS = tmp_nobs, FILTER = tmp_filter, TERMINAL = rep("SPLIT", 2), row.names = NULL)[split_here,] # overwrite state of current node tree_info[j, "TERMINAL"] <- ifelse(all(!split_here), "LEAF", "PARENT") # bind everything tree_info <- rbind(tree_info, children) # check if there are any open splits left do_splits <- !all(tree_info$TERMINAL != "SPLIT") } # end for } # end while

The while loop covers our tree depth and our for loop all splits within this certain depth. Within the while loop we seek every row in our tree_info, which we still have to estimate, i.e. all nodes in the "SPLIT" state. In the first iteration it would be the first row, our root node. The for loop iterates over all potential splitting nodes. The first if condition ensures that we filter our data according to the tree depth. Of course, there is no filter in the root node that is why we take the data as is. But imagine calculating possible splits for a parent in depth level two. The filter would look similar to this one: "feature_1 > 5.33 & feature_2 <= 3.22".

How do we apply splitting?

Afterwards we seek the minimum SSE by applying the sse_var function to every feature. Note, that in this version we can only handle numeric features. Once we have found the best splitting variable, here named tmp_splitter, we build the according filter rule in object tmp_filter.

We still have to check if this is a valid split, i.e. we have not invoked this split for this branch yet and we have sufficient observations in our children. Our indicator split_here rules whether we split our node. Well, that is about it. In the last passage of the loop we prepare the output for this node. That is, handling the state of the calculated node and adding the children information to our job list tree_info. After the for loop we have to check whether our tree is fully grown. The variable do_splits checks whether there are any nodes left we have to calculate. We terminate the calculation if there are no "SPLIT" nodes left in our tree_info.

Extracting our estimates # calculate fitted values leafs <- tree_info[tree_info$TERMINAL == "LEAF", ] fitted <- c() for (i in seq_len(nrow(leafs))) { # extract index ind <- as.numeric(rownames(subset(data, eval(parse( text = leafs[i, "FILTER"]))))) # estimator is the mean y value of the leaf fitted[ind] <- mean(y[ind]) } At the end of our calculation we have a filter rule for every leaf in our tree. With the help of these filters we can easily calculate the fitted values by simply applying the filter on our data and calculating our fit, i.e. the mean of our target in this leaf. I am sure by now you can think of a way to implement more sophisticated estimators, which I would leave up to you. Well, that's a regression tree with minimum size restriction. I have created a little runthrough with data from my simulation package on my GitHub, which you can checkout and try everything on your own. Make sure to checkout my other blog post about coding gradient boosted machines from scratch. Über den Autor André Bleier The most exciting part of being a data scientist at STATWORX is to find this unique solution to a problem by fusing machine learning, statistics and business knowledge. ABOUT US STATWORX is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. .button {background-color: #0085af;} Der Beitrag Coding Regression trees in 150 lines of R code erschien zuerst auf STATWORX. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Image segmentation based on Superpixels and Clustering Fri, 11/09/2018 - 01:00 (This article was first published on mlampros, and kindly contributed to R-bloggers) In this blog post, I’ll explain the new functionality of the OpenImageR package, SLIC and SLICO superpixels (Simple Linear Iterative Clustering) and their applicability based on an IJSR article. The author of the article uses superpixel (SLIC) and Clustering (Affinity Propagation) to perform image segmentation. The article was reproduced (and extended with Kmeans) using the latest versions of the OpenImageR and ClusterR packages. Superpixels “Introduced by Ren and Malik in 2003, superpixels group pixels similar in color and other low-level properties. In this respect, superpixels address two problems inherent to the processing of digital images: firstly, pixels are merely a result of discretization; and secondly, the high number of pixels in large images prevents many algorithms from being computationally feasible. Ren and Malik introduce superpixels as more natural entities – grouping pixels which perceptually belong together while heavily reducing the number of primitives for subsequent algorithms.” You can read more about superpixels in this arxiv article. The superpixels() function of the OpenImageR package is based on the open source C++ code of the IMAGE AND VISUAL REPRESENTATION LAB IVRL and allows users to receive labels and output slic images. The following code chunk illustrates how this can be done using either the “slic” or the “slico” method, library(OpenImageR) path = system.file("tmp_images", "slic_im.png", package = "OpenImageR") im = readImage(path) #-------------- # "slic" method #-------------- res_slic = superpixels(input_image = im, method = "slic", superpixel = 200, compactness = 20, return_slic_data = TRUE, return_labels = TRUE, write_slic = "", verbose = TRUE) str(res_slic) List of 2$ slic_data: num [1:360, 1:360, 1:3] 255 255 255 255 255 255 255 255 255 255 ... $labels : num [1:360, 1:360] 0 0 0 0 0 0 0 0 0 0 ... #--------------- # "slico" method [ the "slico" method does not require the "compactness" parameter ] #--------------- res_slico = superpixels(input_image = im, method = "slico", superpixel = 200, return_slic_data = TRUE, return_labels = TRUE, write_slic = "", verbose = TRUE) str(res_slico) List of 2$ slic_data: num [1:360, 1:360, 1:3] 255 255 255 255 255 255 255 255 255 255 ... $labels : num [1:360, 1:360] 0 0 0 0 0 0 0 0 0 0 ... #----------------------------- # plot both "slic" and "slico" #----------------------------- par(mfrow=c(1,2), mar = c(0.2, 0.2, 0.2, 0.2)) plot_slic = OpenImageR::NormalizeObject(res_slic$slic_data) plot_slic = grDevices::as.raster(plot_slic) graphics::plot(plot_slic) plot_slico = OpenImageR::NormalizeObject(res_slico$slic_data) plot_slico = grDevices::as.raster(plot_slico) graphics::plot(plot_slico) According to the authors of the slic method, “SLIC uses the same compactness parameter (chosen by user) for all superpixels in the image. If the image is smooth in certain regions but highly textured in others, SLIC produces smooth regular-sized superpixels in the smooth regions and highly irregular superpixels in the textured regions. So, it becomes tricky choosing the right parameter for each image. SLICO does away with this problem completely. The user no longer has to set the compactness parameter or try different values of it. SLICO adaptively chooses the compactness parameter for each superpixel differently. This generates regular shaped superpixels in both textured and non textured regions alike. The improvement comes with hardly any compromise on the computational efficiency – SLICO continues to be as fast as SLIC.” More information about the parameters of the superpixels() function can be found in the package documentation. Image segmentation “In computer vision, image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as super-pixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.” You can read more about image segmentation in this wikipedia article. Affinity Propagation “Affinity propagation is a new algorithm that takes as input measures of similarity between pairs of data points and simultaneously considers all data points as potential exemplars. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We have used affinity propagation to solve a variety of clustering problems and we found that it uniformly found clusters with much lower error than those found by other methods, and it did so in less than one-hundredth the amount of time. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering.” The Affinity Propagation algorithm is explained in detail in the authors web page. Image segmentation based on superpixels (SLIC, SLICO) and Affinity Propagation (AP) The author of the article, that I mentioned earlier, uses a method named SLICAP (SLIC + AP) to perform image segmentation. The methodology is the following: • First, the SLICAP technique uses the SLIC superpixel algorithm to form an over-segmentation of an image • Then, a similarity is constructed based on the features of superpixels • Finally, the AP algorithm clusters these superpixels with the similarities obtained The advantages of this method compared to other methods are the following: • similarities are not computed based on all data (image), but on the resulted superpixels (reduction of the computational burden) • the user does not have to specify the number of clusters (AP algorithm) as is the case with other clustering methods such as Kmeans The author of the article utilizes, • the CIELAB colour space which is designed to be perceptually uniform with respect to human color vision • the negative Euclidean distance (although the author uses 3 different similarity measures (page 1526), I’ll use the first one in my code because it gives better results) • weights for the 3 colour channels (wL, wA, wB) because they keep balance so as to be consistent with human perception • a colorradius to adjust the number of clusters. If its value is low, the number of targets would increase, which leads to more detailed segmentation results • the Berkeley Segmentation Database (BSD) (it can be downloaded also from my Github repository) I bundled the code in an R package, SuperpixelImageSegmentation, which can be downloaded from Github (consult the README.md file of the package for more information). The following code snippet first reads the input image and then performs image segmentation based on SLIC superpixels and AP clustering, library(SuperpixelImageSegmentation) path = system.file("images", "BSR_bsds500_image.jpg", package = "SuperpixelImageSegmentation") im = OpenImageR::readImage(path) init = Image_Segmentation$new() spx = init$spixel_segmentation(input_image = im, superpixel = 600, AP_data = TRUE, use_median = TRUE, sim_wL = 3, sim_wA = 10, sim_wB = 10, sim_color_radius = 10, verbose = TRUE) Sequential computation starts ... WARNING: The input data has values between 0.000000 and 1.000000. The image-data will be multiplied by the value: 255! The super-pixel slic method as pre-processing step was used / completed! The similarity matrix based on super-pixels was computed! It took 179 iterations for affinity propagation to complete! 6 clusters were chosen based on super-pixels and affinity propagation! Image data based on Affinity Propagation clustering ('AP_image_data') will be returned! Elapsed time: 0 hours and 0 minutes and 2 seconds. str(spx) List of 4$ KMeans_image_data: num[0 , 0 , 0 ] $masks : list() ..- attr(*, "dim")= int [1:2] 0 0$ centr : num[0 , 0 ] $AP_image_data : num [1:481, 1:321, 1:3] 1 1 1 1 1 1 1 1 1 1 ... By using a colorradius of 10 (the author uses by default 20) I receive the following output, OpenImageR::imageShow(spx$AP_image_data)

Decreasing the number of superpixels (from 600 to 200) or modifying the colorradius (from 10 to 20) will affect the quality of the output image, but it will also reduce the computation time.

Superpixels, AP and Kmeans (or Mini-Batch-Kmeans)

As a follow-up to the author’s method, one might take advantage of the automatic way the AP algorithm determines the number of clusters in order to perform vector quantization using kmeans (this will give qualitative better results at the cost of increasing computation time). I included additional parameters in the spixel_segmentation() method of the Image_Segmentation R6-class for this purpose and I’ll use the following image, which appears also in the article,

path_km = system.file("images", "airplane.jpg", package = "SuperpixelImageSegmentation") im_km = OpenImageR::readImage(path_km) OpenImageR::imageShow(im_km)

spx_km = init$spixel_segmentation(input_image = im_km, superpixel = 600, AP_data = TRUE, use_median = TRUE, sim_wL = 3, sim_wA = 10, sim_wB = 10, sim_color_radius = 10, kmeans_method = "kmeans", kmeans_initializer = "kmeans++", kmeans_num_init = 3, kmeans_max_iters = 100, verbose = TRUE) Sequential computation starts ... WARNING: The input data has values between 0.000000 and 1.000000. The image-data will be multiplied by the value: 255! The super-pixel slic method as pre-processing step was used / completed! The similarity matrix based on super-pixels was computed! It took 146 iterations for affinity propagation to complete! 10 clusters were chosen based on super-pixels and affinity propagation! Image data based on Affinity Propagation clustering ('AP_image_data') will be returned! The kmeans algorithm based on the number of affinity-propagation-clusters was completed! Pre-processing of the kmeans-output-image is completed! Elapsed time: 0 hours and 0 minutes and 8 seconds. str(spx_km) List of 4$ KMeans_image_data: num [1:321, 1:481, 1:3] 0.1 0.1 0.26 0.389 0.432 ... $masks :List of 10 ..$ : num[0 , 0 , 0 ] ..$: num[0 , 0 , 0 ] ..$ : num[0 , 0 , 0 ] ..$: num[0 , 0 , 0 ] ..$ : num[0 , 0 , 0 ] ..$: num[0 , 0 , 0 ] ..$ : num[0 , 0 , 0 ] ..$: num[0 , 0 , 0 ] ..$ : num[0 , 0 , 0 ] ..$: num[0 , 0 , 0 ] ..- attr(*, "dim")= int [1:2] 10 1$ centr : num [1:10, 1:3] 0.432 0.251 0.863 0.552 0.26 ... $AP_image_data : num [1:321, 1:481, 1:3] 0.404 0.404 0.404 0.404 0.404 ... The following image shows the output based on Superpixels + AP, OpenImageR::imageShow(spx_km$AP_image_data)

and the following based on Superpixels + AP + Kmeans (vector quantization),

OpenImageR::imageShow(spx_km$KMeans_image_data) The spixel_segmentation() takes as method parameter the Mini-Batch-Kmeans too, which can decrease the computation time at the cost of a reduced image quality. However, the advantage of using Superpixels + AP + Kmeans (or Mini-Batch-Kmeans) lies in the fact that the user does not have to specify the number of clusters beforehand, spx_mbkm = init$spixel_segmentation(input_image = im_km, superpixel = 600, AP_data = TRUE, use_median = TRUE, sim_wL = 3, sim_wA = 10, sim_wB = 10, sim_color_radius = 10, kmeans_method = "mini_batch_kmeans", kmeans_initializer = "kmeans++", kmeans_num_init = 3, kmeans_max_iters = 100, minib_kmeans_batch = 10, minib_kmeans_init_fraction = 0.75, verbose = TRUE) Sequential computation starts ... WARNING: The input data has values between 0.000000 and 1.000000. The image-data will be multiplied by the value: 255! The super-pixel slic method as pre-processing step was used / completed! The similarity matrix based on super-pixels was computed! It took 146 iterations for affinity propagation to complete! 10 clusters were chosen based on super-pixels and affinity propagation! Image data based on Affinity Propagation clustering ('AP_image_data') will be returned! The mini-batch-kmeans algorithm based on the number of affinity-propagation-clusters was completed! Pre-processing of the mini-batch-kmeans-output-image is completed! Elapsed time: 0 hours and 0 minutes and 3 seconds.

The following image is based on Superpixels + AP + Mini-Batch-Kmeans,

OpenImageR::imageShow(spx_mbkm$KMeans_image_data) Further parameters which can be adjusted in the spixel_segmentation() method are: • method (“slic”, “slico”) • colour_type (“RGB”, “LAB” or “HSV”) More details can be found in the SuperpixelImageSegmentation R package documentation. Image Masks The spixel_masks_show() method in combination with the adjust_centroids_and_return_masks parameter (set to TRUE), allows users to also plot the masks belonging to each separate cluster, spx_masks = init$spixel_segmentation(input_image = im_km, superpixel = 600, AP_data = TRUE, use_median = TRUE, sim_wL = 3, sim_wA = 10, sim_wB = 10, sim_color_radius = 10, kmeans_method = "kmeans", kmeans_initializer = "kmeans++", adjust_centroids_and_return_masks = TRUE, verbose = TRUE) Sequential computation starts ... WARNING: The input data has values between 0.000000 and 1.000000. The image-data will be multiplied by the value: 255! The super-pixel slic method as pre-processing step was used / completed! The similarity matrix based on super-pixels was computed! It took 146 iterations for affinity propagation to complete! 10 clusters were chosen based on super-pixels and affinity propagation! Image data based on Affinity Propagation clustering ('AP_image_data') will be returned! The kmeans algorithm based on the number of affinity-propagation-clusters was completed! NOTE: The 'KMeans_image_data' will be returned in black & white colour due to the fact that the 'adjust_centroids_and_return_masks' parameter was set to TRUE! Pre-processing of the kmeans-output-image is completed! The centroids were adjusted and the image-masks will be returned! Elapsed time: 0 hours and 0 minutes and 8 seconds. str(spx_masks) List of 4 $KMeans_image_data: num [1:321, 1:481, 1:3] 5 5 4 7 0 0 7 7 7 7 ...$ masks :List of 10 ..$: num [1:321, 1:481, 1:3] 0 0 0 0 0.404 ... ..$ : num [1:321, 1:481, 1:3] 0 0 0 0 0 0 0 0 0 0 ... ..$: num [1:321, 1:481, 1:3] 0 0 0 0 0 0 0 0 0 0 ... ..$ : num [1:321, 1:481, 1:3] 0 0 0 0 0 0 0 0 0 0 ... ..$: num [1:321, 1:481, 1:3] 0 0 0.294 0 0 ... ..$ : num [1:321, 1:481, 1:3] 0.0784 0.1216 0 0 0 ... ..$: num [1:321, 1:481, 1:3] 0 0 0 0 0 0 0 0 0 0 ... ..$ : num [1:321, 1:481, 1:3] 0 0 0 0.392 0 ... ..$: num [1:321, 1:481, 1:3] 0 0 0 0 0 0 0 0 0 0 ... ..$ : num [1:321, 1:481, 1:3] 0 0 0 0 0 0 0 0 0 0 ... ..- attr(*, "dim")= int [1:2] 10 1 $centr : num [1:10, 1:3] 0 1 2 3 4 5 6 7 8 9 ...$ AP_image_data : num [1:321, 1:481, 1:3] 0.404 0.404 0.404 0.404 0.404 ...

the display_all parameter enables plotting of masks either all at once or sequentially,

library(AzureRMR) az <- az_rm$new(tenant="{tenant_id}", app="{app_id}", password="{password}") # get a subscription and resource group sub <- az$get_subscription("{subscription_id}") rg <- sub$get_resource_group("rgName") # get a resource (storage account) stor <- rg$get_resource(type="Microsoft.Storage/storageAccounts", name="myStorage") # method chaining works too stor <- az$get_subscription("{subscription_id}")$ get_resource_group("rgName")$get_resource(type="Microsoft.Storage/storageAccounts", name="myStorage") # create a new resource group and resource rg2 <- sub$create_resource_group("newRgName", location="westus") stor2 <- rg2$create_resource(type="Microsoft.Storage/storageAccounts", name="myNewStorage", kind="Storage", sku=list(name="Standard_LRS", tier="Standard")) # delete them stor2$delete(confirm=FALSE) rg2$delete(confirm=FALSE) You can use AzureRMR to work with any Azure service, but a key idea behind it is that it's extensible: you can write new packages that provide extra functionality relevant to specific Azure offerings. This deals with one of the main shortcomings of AzureSMR: as a monolithic package, it became steadily more difficult to add new features as time went on. The packages that extend AzureRMR in this way constitute the AzureR family. Currently, the AzureR family includes the following packages: • AzureVM is a package for working with virtual machines and virtual machine clusters. It allows you to easily deploy, manage and delete a VM from R. A number of templates are provided based on the Data Science Virtual Machine, or you can also supply your own template. • AzureStor is a package for working with storage accounts. In addition to a Resource Manager interface for creating and deleting storage accounts, it also provides a client interface for working with the storage itself. You can upload and download files and blobs, list file shares and containers, list files within a share, and so on. It supports authenticated access to storage via either a key or a shared access signature (SAS). • AzureContainers is a package for working with containers in Azure: specifically, it provides an interface to Azure Container Instances, Azure Container Registry and Azure Kubernetes Service. You can easily create an image, push it to an ACR repo, and then deploy it as a service to AKS. It also provides lightweight shells to docker, kubectl and helm (assuming these are installed). I'll be writing future blog posts to describe each of these in further detail. All the AzureR packages are part of the CloudyR Project, which aims to make R work better with cloud services of all kinds. They are not (yet) on CRAN, but you can obtain them with devtools::install_github. library(devtools) install_github("cloudyr/AzureRMR") install_github("cloudyr/AzureVM") install_github("cloudyr/AzureStor") install_github("cloudyr/AzureContainers") If you use Azure with R, I hope you'll find these packages to be of significant benefit to your workflow. If you have any questions or comments, please contact me at hongooi@microsoft.com. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Practical statistics books for software engineers Thu, 11/08/2018 - 04:50 (This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers) So you have read my (draft) book on evidence-based software engineering and want to learn more about the statistical techniques used, but are not interested lots of detailed mathematics. What books do I suggest? All the following books are sitting on the shelf next to where I write (not that they get read that much these days). Before I took the training wheels off my R usage, my general go to book was (I still look at it from time to time): “The R Book” by Crawley, second edition; “R in Action” by Kabacoff is a good general read. In alphabetical subject order: Categorical data: “Categorical Data Analysis” by Agresti, the third edition is a weighty tomb (in content and heaviness). Plenty of maths+example; more of a reference. Compositional data: “Analyzing compositional data with R” by van den Boogaart and Tolosana-Delgado, is more or less the only book of its kind. Thankfully, it is quite good. Count data: “Modeling count data” by Hilbe, may be more than you want to know about count data. Readable. Circular data: “Circular statistics in R” by Pewsey, Neuhauser and Ruxton, is the only non-pure theory book available. The material seems to be there, but is brief. Experiments: “Design and analysis of experiments” by Montgomery. General: “Applied linear statistical models” by Kutner, Nachtsheim, Neter and Li, covers a wide range of topics (including experiments) using a basic level of mathematics. Mixed-effects models: “Mixed-effects models in S and S-plus” by Pinheiro and Bates, is probably the book I prefer; “Mixed effects models and extensions in ecology with R” by Zuur, Ieno, Walker, Saveliev and Smith, is another view on an involved topic (plus lots of ecological examples). Modeling: “Statistical rethinking” by McElreath, is full of interesting modeling ideas, using R and Stan. I wish I had some data to try out some of these ideas. Regression analysis: “Applied Regression Analysis and Generalized Linear Models” by Fox, now in its third edition (I also have the second edition). I found this the most useful book, of those available, for a more detailed discussion of regression analysis. Some people like “Regression modeling strategies” by Harrell, but this does not appeal to me. Survival analysis: “Introducing survival and event history analysis” by Mills, is a readable introduction covering everything; “Survival analysis” by Kleinbaum and Klein, is full of insights but more of a book to dip into. Time series: The two ok books are: “Time series analysis and its application: with R examples” by Shumway and Stoffler, contains more theory, while “Time series analysis: with applications in R” by Cryer and Chan, contains more R code. There are lots of other R/statistics books on my shelves (just found out I have 31 of Springer’s R books), some ok, some not so. I have a few ‘programming in R’ style books; if you are a software developer, R the language is trivial to learn (its library is another matter). Suggestions for books covering topics I have missed welcome, or your own preferences (as a software developer). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Rcpp 1.0.0: The Tenth Birthday Release Thu, 11/08/2018 - 01:47 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) As mentioned here two days ago, the Rcpp package turned ten on Monday—and we used to opportunity to mark the current version as 1.0.0! Thanks to everybody who liked and retweeted our tweet about this. And of course, once more a really big Thank You! to everybody who helped along this journey: Rcpp Core team, contributors, bug reporters, workshop and tutorial attendees and last but not least all those users – we did well. So let’s enjoy and celebrate this moment. As indicated in Monday’s blog post, we had also planned to upload this version to CRAN, and this 1.0.0 release arrived on CRAN after the customary inspection and is now available. I will build the Debian package in a moment, it will find its way to Ubuntu and of the CRAN-mirrored backport that Michael looks after so well. While this release is of course marked as 1.0.0 signifying the feature and release stability we have had for some time, it also marks another regular release at the now-common bi-monthly schedule following nineteen releases since July 2016 in the 0.12.* series as well as another five in the preceding 0.11.* series. Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 1493 packages on CRAN depend on Rcpp for making analytical code go faster and further, along with another 150 in the (very recent) BioConductor release 3.8. Per the (partial) logs of CRAN downloads, we were reaching more than 900,000 downloads a month of late. Once again, we have a number of nice pull requests from the usual gang of contributors in there, see below for details. Changes in Rcpp version 1.0.0 (2018-11-05) • Happy tenth birthday to Rcpp, and hello release 1.0 ! • Changes in Rcpp API: • The empty destructor for the Date class was removed to please g++-9 (prerelease) and -Wdeprecated-copy (Dirk). • The constructor for NumericMatrix(not_init(n,k)) was corrected (Romain in #904, Dirk in #905, and also Romain in #908 fixing #907). • Rcpp::String no longer silently drops embedded NUL bytes in strings but throws new Rcpp exception embedded_nul_in_string. (Kevin in #917 fixing #916). • Changes in Rcpp Deployment: • The Dockerfile for Continuous Integration sets the required test flag (for release versions) inside the container (Dirk). • Correct the R CMD check call to skip vignettes (Dirk). • Changes in Rcpp Attributes: • A new [[Rcpp::init]] attribute allows function registration for running on package initialization (JJ in #903). • Sort the files scanned for attributes in the C locale for stable output across systems (JJ in #912). • Changes in Rcpp Documentation: • The ‘Rcpp Extending’ vignette was corrected and refers to EXPOSED rather than EXPORTED (Ralf Stubner in #910). • The ‘Unit test’ vignette is no longer included (Dirk in #914). Thanks to CRANberries, you can also look at a diff to the previous release. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Why R? 2018 Conference – After Movie and Summary Wed, 11/07/2018 - 09:00 (This article was first published on http://r-addict.com, and kindly contributed to R-bloggers) For us, R is not about packages and data analysis. It is about a vibrant community that creates all those wonderful toys. As the Why R? Foundation we want to take part in the growth of the R user base. One of our first efforts is the Why R? conference series with a unique community-oriented vibe. Before the main event, we co-organized twelve R meetups in European cities. It wouldn’t be possible without a wonderful support of local R communities. Aside from the promotion of Why R?, we had a great opportunity to interact with other R enthusiasts. Why R? 2018 happened in Wrocław in Poland. We started with workshops on July the 2nd with workshops and ended on July the 5th with Special Interest Groups and mlr hackathon. Check out the after-movie! The participants of the conference have a chance to take part in 8 workshops. We are especially grateful to R-Ladies Warsaw who conducted the R-Ladies workshop. Thanks to their support we managed to keep the R-Ladies workshop free for all participants. We are happy that our conference hosts also free side-events aiding in the further promotion of R. All 45 presentations and 8 keynotes were great, but we have to highlight the work of the mlr-team. Bernd Bischl and his coworkers did an amazing job animating Why R? with workshops, the mlr session, and a hackathon. The hackathon was quite a challenge, as it goal was to improve documentation of the mlr package. I am aware how compelling perspective is to work with developers of mlr on their package. Still, the massive participation in the hackathon was beyond our expectations. We hope that contributions of Why R? participants will be a valuable addition to the package. The last part of the conference were special interest groups. The idea is simple: gather together R users with a common interest and let them talk. We are very happy to see the active participation in all three SIGs: Diversity in Data Science, Teaching Data Science and Career Development in Data Science. Again, these events wouldn’t be as successful as they were without their moderators. We are already missing the atmosphere of enthusiasm that we encountered during Why R. Thus, we are already planning the next even. The Why R? 2019 conference happens in Warsaw (dates to be announced), but in the meantime stay tuned for new pre-meetings. See you in Warsaw! The event could not be hosted without the support of our sponsors whom we say big thank you from the whole organizing committee! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### The “probability to win” is hard to estimate… Wed, 11/07/2018 - 03:09 (This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers) Real-time computation (or estimation) of the “probability to win” is difficult. We’ve seem that in soccer games, in elections… but actually, as a professor, I see that frequently when I grade my students. Consider a classical multiple choice exam. After each question, imagine that you try to compute the probability that the student will pass. Consider here the case where we have 50 questions. Students pass when they have 25 correct answers, or more. Just for simulations, I will assume that students just flip a coin at each question… I have students, and 50 questions 1 2 3 set.seed(1) n=10 M=matrix(sample(0:1,size=n*50,replace=TRUE),50,n) Let denote the score of student at question . Let denote the cumulated score, i.e. . At step , I can get some sort of prediction of the final score, using . Here is the code 1 2 SM=apply(M,2,cumsum) NB=SM*50/(1:50) We can actually plot it 1 2 3 4 plot(NB[,1],type="s",ylim=c(0,50)) abline(h=25,col="blue") for(i in 2:n) lines(NB[,i],type="s",col="light blue") lines(NB[,3],type="s",col="red") But that’s simply the prediction of the final score, at each step. That’s not the computation of the probability to pass ! Let’s try to see how we can do it… If after questions, the students has 25 correct answer, the probability should be 1 – i.e. if – since he cannot fail. Another simple case is the following : if after questions, the number of points he can get with all correct answers until the end is not sufficient, he will fail. That means if S_{i,j}+(50-i+1)< 25[/latex] the probability should be 0. Otherwise, to compute the probability to sucess, it is quite straightforward. It is the probability to obtain at least $25-S_{i,j}$ correct answers, out of $50-j$ questions, when the probability of success is actually $S_{i,j}/j$. We recognize the survival probability of a binomial distribution. The code is then simply 1 2 3 4 5 6 7 PB=NB*NA for(i in 1:50){ for(j in 1:n){ if(SM[i,j]&gt;=25) PB[i,j]=1 if(SM[i,j]+(50-i+1)&lt;25) PB[i,j]=0 if((SM[i,j]&lt;25)&amp;(SM[i,j]+(50-i+1)&gt;=25)) PB[i,j]=1-pbinom(25-SM[i,j],size=(50-i),prob=SM[i,j]/i) }} So if we plot it, we get 1 2 3 4 plot(PB[,1],type="s",ylim=c(0,1)) abline(h=25,col="red") for(i in 2:n) lines(PB[,i],type="s",col="light blue") lines(PB[,3],type="s",col="red") which is much more volatile than the previous curves we obtained ! So yes, computing the “probability to win” is a complicated exercice ! Don’t blame those who try to find it hard to do ! Of course, things are slightly different if my students don’t flip a coin… this is what we obtain if half of the students are good (2/3 probability to get a question correct) and half is not good (1/3 chance), If we look at the probability to pass, we usually do not have to wait until the end (the 50 questions) to know who passed and who failed PS : I guess a less volatile solution can be obtained with a Bayesian approach… if I find some spare time this week, I will try to code it… var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### anytime 0.3.2 Wed, 11/07/2018 - 01:23 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) A new minor release of the anytime package arrived on CRAN this morning. This is the thirteenth release, and the first since July as the package has gotten feature-complete. anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects – and to do so without requiring a format string. See the anytime page, or the GitHub README.md for a few examples. This release adds a nice new vignette, solidifies some code in response to the rchk tests by Tomas Kalibera and updates some tests. We had some last-minute issues with the vignette on Windows, and Josh Ulrich was very helpful with some additional tests. For now, the new vignette is added pre-built. Changes in anytime version 0.3.2 (2018-11-05) • Added a new vignette introducing the anytime package. Seemingly it cannot be compiled on Windows so included prebuilt. • Some more tests for anydate were added along with so code coverage tags. • The C++ code was robustified in two places to not trigger rchk warnings (#79). • Three unit test files which fail on Solaris are now skipping this os as we cannot reproduce or test on this OS (#80). Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page. For questions or comments use the issue tracker off the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### UI Update — Datazar Wed, 11/07/2018 - 01:17 (This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers) UI Update — Datazar As your projects get more and more files added to it, keeping track of your datasets and analysis can become tedious. We’ve updated your project file list so you can see what type of files you have in your project before you click them or try to read the extension. Notebooks, consoles and scripts are now represented by the file type they analyze. The R logo for R notebooks and consoles and so on. We’re always making improvements on the UI/UX based on user feedback we get everyday, so use that “Feedback” button found at the bottom of every to let us know what you think can be improved and added! We’ve also previously added the language icons to the analysis pages for notebooks and consoles. That way when you’re switching from notebook to notebook (or console), you can be sure what flavor you’re using. UI Update — Datazar was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Working with US Census Data in R Wed, 11/07/2018 - 01:03 (This article was first published on Revolutions, and kindly contributed to R-bloggers) If you need data about the American populace, there's no source more canonical than the US Census Bureau. The bureau publishes a wide range of public sets, and not just from the main Census conducted every 10 years: there are more than 100 additional surveys and programs published as well. To help R users access this rich source of data, Ari Lamstein and Logan Powell have published A Guide to Working with US Census Data in R, a publication of the R Consortium Census Working Group. The guide provides an overview of the data available from the US census bureau and various tools available in R to access and analyze it. The guide notes that there are 22 R packages for working with census data, and cites as being particularly useful: • tigris, for working with shape files of census regions (census data is may be aggregated to any of a number of levels as shown int the diagram below) • acs, for downloading and managing data from the decennial census and the American Community Survey • choroplethr and choroplethrMaps, for mapping data (including census data) by region • tidycensus, to extract census data as tidy data frames • censusapi, for extracting data using the Census API • ipumsr, to extract US census data in a form that can be compared with data from other countries Standard Hierarchy of Census Geographic Entities, US Census Bureau. You can find the complete guide at the link below. R Consortium: A Guide to Working with US Census Data in R var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### In-database xgboost predictions with R Wed, 11/07/2018 - 01:00 (This article was first published on R Views, and kindly contributed to R-bloggers) Moving predictive machine learning algorithms into large-scale production environments can present many challenges. For example, problems arise when attempting to calculate prediction probabilities (“scores”) for many thousands of subjects using many thousands of features located on remote databases. xgboost (docs), a popular algorithm for classification and regression, and the model of choice in many winning Kaggle competitions, is no exception. However, to run xgboost, the subject-features matrix must be loaded into memory, a cumbersome and expensive process. Available solutions require using expensive high-memory machines, or implementing external memory across distributed machines (expensive and in beta). Both solutions still require transferring all feature data from the database to the local machine(s), loading it into memory, calculating the probabilities for the subjects, and then transferring the probabilities back to the database for storage. I have seen this take, 20-50 minutes for ~1MM subjects. In this post, we will consider in-database scoring, a simple alternative for calculating batch predictions without having to transfer features stored in a database to the machine where the model is located. Instead, we will convert the model predictions into SQL commands and thereby transfer the scoring process to the database. We will convert the xgboost model prediction process into a SQL query, and thereby accomplish the same task while leveraging a cloud database’s scalability to efficiently calculate the predictions. To accomplish this, we’ll need to work through a few steps. First, we’ll import the model as a list of nested tree structures that we can iterate through recursively. Then, we’ll create a function that will recursively descend through a tree and translate it into a SQL CASE statement. After that, we’ll create a query that sums the CASE statements for all trees before logit-transforming it to calculate a probability. This first block of code loads the required packages and converts the model object to a list of trees that we can work with: library(xgboost) library(jsonlite) library(whisker) # our model exists in the variable xgb_model: # dump the list of trees as JSON and import it as model_trees using jsonlite model_trees <- jsonlite::fromJSON( xgb.dump(xgb_model, with_stats = FALSE, dump_format='json'), simplifyDataFrame = FALSE) Now, we need to translate each tree into a SQL CASE statement. Each tree represents a set of decisions based on whether a variable (the ‘split’) is less than a threshold value (the ‘split_condition’). The result of the decision could be ‘yes’, ‘no’, or ‘missing’. In each case, the tree provides the ‘node_id’ of the next decision to evaluate. When we reach a leaf, no decision needs to be made and instead a value is returned. An example tree is shown below: We’ll also need a dictionary that maps an integer to its associated feature name, since the trees themselves refer to 0-indexed integers instead of the feature names. We can accomplish that by creating the following list: feature_dict <- as.list(xgb_model$feature_names)

Using our feature_dict object, we can recursively descend through the tree and translate each node into a CASE statement, producing a sequence of nested CASE statements. The following function does just that:

xgb_tree_sql <- function(tree, feature_dict, sig=5){ # split variables must exist to generate subquery for tree children sv <- c("split", "split_condition", "yes", "no", "missing", "children") # we have a leaf, just return the leaf value if("leaf" %in% names(tree)){ return(round(tree[['leaf']],sig)) } else if(all(sv %in% names(tree))){ tree$split_long <- feature_dict[[tree$split+1]] # +1 because xgboost is 0-indexed cs <- c(tree$yes, tree$no, tree$missing) cd <- data.frame( k = c(min(cs), max(cs)), v = c(1,2) ) tree$missing_sql <- xgb_tree_sql(tree$children[[cd$v[cd$k==tree$missing]]], feature_dict) tree$yes_sql <- xgb_tree_sql(tree$children[[cd$v[cd$k==tree$yes]]], feature_dict) tree$no_sql <- xgb_tree_sql(tree$children[[cd$v[cd$k==tree$no]]], feature_dict) q <- " CASE WHEN {{{split_long}}} IS NULL THEN {{{missing_sql}}} WHEN {{{split_long}}} < {{{split_condition}}} THEN {{{yes_sql}}} ELSE {{{no_sql}}} END " return(whisker.render(q,tree)) } }

When we transform one tree into a sequence of nested CASE statements, we are producing a statement that yields that tree’s contribution to the total score. We now need to sum the output of each tree and then calculate the total probability prediction. In other words, we need to add up a list of nested CASE statements and then logit-transform the result.

Note that below we make use of the R whisker package. This logic-less templating language is a great way to easily transform associative-arrays into SQL that contains easily identifiable labels as placeholders. We find this more readable than sequences of paste statements.

xgb_sql_score_query <- function(list_of_trees, features_table, feature_dict, key_field = "id"){ # a swap list to render queries via whisker swap <- list( key_field = key_field, features_table = features_table ) # score_queries contains the score query for each tree in the list_of_trees score_queries <- lapply(list_of_trees, function(tree){ xgb_tree_sql(tree, feature_dict) }) # the query clause to sum the scores from each tree swap$sum_of_scores <- paste(score_queries, collapse=' + ') # score query that logit-transforms the sum_of_scores q <- " SELECT {{{key_field}}}, 1/(1+exp(-1*( {{{sum_of_scores}}} ))) AS score FROM {{{features_table}}} " return(whisker.render(q,swap)) } We are now ready to generate the score query from our model: queries <- xgb_sql_score_query( model trees, 'mydataset.my_feature_table', feature_dict ) for(q in queries){ # example: run the query with the R bigrquery package bq_project_query('my_project', q) } In summary, production models typically calculate predictions for all subjects on a daily, hourly, or even more frequent basis; however, moving feature data between a database and a local “scoring” machine is expensive and slow. Transferring the scoring calculations to run within the database, as we’ve shown above, can significantly reduce both cost and run time. The astute reader may notice that, depending on the database, this will only work for a limited number of trees. When that becomes a problem, it is possible to add another layer that stores the summed scores for batches of trees as views or tables, and then aggregates their results. Beyond that, when queries with views become too long, it is possible to add an additional layer than aggregates batches of views into tables. We will save all of this for a future post. Roland Stevenson is a data scientist and consultant who may be reached on Linkedin _____='https://rviews.rstudio.com/2018/11/07/in-database-xgboost-predictions-with-r/'; var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Using httr to Detect HTTP(s) Redirects Wed, 11/07/2018 - 00:00 (This article was first published on petermeissner, and kindly contributed to R-bloggers) The Summary In this short note I will write about the httr package and my need to detect whether or not an HTTP request had been redirected or not – it turns out this is quite easy. Along the way I will also show how to access information of an HTTP-conversation other than the actual content to be retrieved. The Problem I am the creator and maintainer of the robotstxt package an R package that enables users to retrieve and parse robots.txt files and ultimately is designed to do access permission checking for web resources. Recently a discussion came up about how to interpret permissions in case of sub-domains and HTTP redirects. Long story short: In case of robots.txt files redirects are suspicious and users should at least be informed about it happening so they might take appropriate action. So, I set out to find a way to check whether or not a robots.txt files requested via the httr package has gone through one or more redirects prior to its retrieval. httr’s automatic handling of redirects is one of its many wonderful features and happens silently in the background. Despite the fact that httr hides this process it turns out that the location negotiation process is transparently logged within the return value of httr’s HTTP functions (e.g. httr::GET()). Furthermore it is easy to tap into that information for further processing. The Request Now let’s get our hands dirty with some real HTTP-requests done via httr. For this we use httpbin.org a service allowing to test numerous HTTP-interaction scenarios – e.g. simple HTTP GET requests with redirection. When executing an HTTP GET request against the httpbin.org/redirect/2 endpoint it leads to two redirects before finally providing a resource. At first glance the result and status code looks pretty normal … # executing HTTP request r <- httr::GET("httpbin.org/redirect/2") … the status is 200 (everything OK)… # HTTP status code r$status_code ## [1] 200

… and we get some content back.

# parsed content retrieved httr::content(r) ## $args ## named list() ## ##$headers ## $headers$Accept ## [1] "application/json, text/xml, application/xml, */*" ## ## $headers$Accept-Encoding ## [1] "gzip, deflate" ## ## $headers$Connection ## [1] "close" ## ## $headers$Host ## [1] "httpbin.org" ## ## $headers$User-Agent ## [1] "libcurl/7.59.0 r-curl/3.2 httr/1.3.1" ## ## ## $origin ## [1] "91.249.21.24" ## ##$url ## [1] "http://httpbin.org/get" The Solution

So far, so good. If we look further into the response object we see that among its items there are two of particular interest: headers and all_headers. While the former only gives back headers and response meta information about the last response, the latter is a list of headers and response information for all responses.

# items of response object names(r) ## [1] "url" "status_code" "headers" "all_headers" "cookies" ## [6] "content" "date" "times" "request" "handle" # contents of all_headers item names(r$all_headers[[1]]) ## [1] "status" "version" "headers" The solution to the initial problem now can be written down as neat little function which (1) extracts all status codes logged in all_headers and (2) checks if any of them equals some 3xx status code (3xx is mainly but not exclusively about redirection but can be considered always suspicious in the problem at hand). #' http_was_redirected #' #' @param response an httr response object, e.g. from a call to httr::GET() #' #' @return logical of length 1 indicating whether or not any redirect happened #' during the HTTP request #' #' @export #' http_was_redirected <- function(response){ # extract status status <- vapply( X = response$all_headers, FUN = [[, FUN.VALUE = integer(1), "status" ) # check status and return any(status >= 300 & status < 400) } The Bonus

A more specific question in regard to redirects is whether or not a redirect
did not only change the path but also entailed a domain change – robots.txt conventions clearly state that each domain and subdomain have to provide their own robots.txt files.

The following function makes use of the all_headers item from the response object again. In addition it uses the domain() function provided by the urltools package to extract the domain part of an URL. If any location header shows a domain not equal to that of the original URL requested a domain change must have happened along the way.

#' http_domain_changed #' #' @param response an httr response object, e.g. from a call to httr::GET() #' #' @return logical of length 1 indicating whether or not any domain change #' happened during the HTTP request #' #' @export #' http_domain_changed <- function(response){ # get domain of origignal HTTP request orig_domain <- urltools::domain(response$request$url) # extract location headers location <- unlist( lapply( X = response$all_headers, FUN = function(x){ x$headers$location } ) ) # new domains new_domains <- urltools::domain(location) # check domains in location against original domain any( !is.na(new_domains) & new_domains != orig_domain ) } Some Resources var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: petermeissner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### R plus Magento 2 REST API revisited: part 1- authentication and universal search Tue, 11/06/2018 - 23:39 (This article was first published on R – Alex Levashov – eCommerce Consultant (Melbourne, Australia), and kindly contributed to R-bloggers) I wrote a post about getting Magento 2 data to R using REST API last year. Now I provide more examples of use and a wrapper over API that you can re-use to get data from Magento 2 to R in a bit more convenient way. Prerequisites Magento 2 I assume you have admin access to Magento 2 store where you want to pull data from. You’ll need to create API user with access to the resources you plan to query. You need to write down: • Store URL (top level, where Magento app sits) • API user username • API user password R We used the next R packages: • httr – essential for work with API • jsonlite -to work with JSON data • dplyr – for some data manipulation in samples later (not in this post). Not essential one, if you prefer to use something else, but nice to have Authentication For ease of reuse let’s define a function that will get a token. We have to pass that token with most API requests. getm2authtoken <- function (url, username, password){ myquery = list (username=username, password=password) url <- paste0(urlbase, "index.php/rest/V1/integration/admin/token") myqueryj = toJSON(myquery, pretty = TRUE, auto_unbox = TRUE) req <- POST(url, add_headers ("Content-Type" = "application/json"), body = myquery, encode = "json") token <-rawToChar(req$content[2:33])
auth <- paste0("Bearer ", token)
return (auth)
}

urlbase <- "https://yourdomain.com.au/"

Now we can call the function and save the token for future use

If you print auth you should get something like [1] "Bearer 9r8ish02l5zz372ivp7rt8tzndd123yk" as a value if it works correctly.

Universal search

Now define function that does universal search in Magento 2 data.

getm2objects <- function(urlbase, endpoint, query, auth){
url <- paste0(urlbase, endpoint, "?")
request <- GET(url, add_headers ("Content-Type" = "application/json", "Authorization" = auth), query = query)
object <- content(request, as ="parsed")
return (object)
}

The function has the next parameters:

• urlbase – base URL of your Magento store, similar with you used for authentication function. Typically your domain. Important to have trailing slash!
• endpoint – specific Magento 2 REST API endpoint.  Skip first slash in the endpoint. You may find some endpoints in the examples below, all of them available in Magento  2 Documentation
• query – your search query. It is list of triplets and pairs of parameters, which is explained below and in the examples
• auth is authentication token we’ve already got and saved
Query

The simplest case of query is a list with 3 name-value pairs:

myquery <- list("searchCriteria[filter_groups][0][filters][0][field]"="created_at",
"searchCriteria[filter_groups][0][filters][0][value]"="2017-01-01 00:00:00",
"searchCriteria[filter_groups][0][filters][0][condition_type]"="gt")

The query above look for all data created later than specified date (01-01-2017).

Third pair is condition, where ‘gt’ means ‘greater than’

You may find all possible conditions and sample use in general (not R specific) in Magento Documentation.

Examples of using universal search Simple search – invoices issued after given date/time

Define parameters, here and later we assume that authentication is stored in ‘auth’ variable.

endpoint <- "rest/V1/invoices"
# query
myquery <- list("searchCriteria[filter_groups][0][filters][0][field]"="created_at",
"searchCriteria[filter_groups][0][filters][0][value]"="2017-01-01 00:00:00",
"searchCriteria[filter_groups][0][filters][0][condition_type]"="gt")

Calling function

invoices <- getm2objects(urlbase = urlbase, endpoint = endpoint, query = myquery, auth=auth)

The resulting object is a complex list with many levels of nesting that stores data from many Magento 2 database tables.

It looks complex enought and there is no surprise. But having all this data in one object is one of the reason to use API over direct query to SQL. If you have tried the latter you know that it may be extremely complex because Magento has hundreds of tables with complex relationships.

Search with OR logic and wildcard use in products

In this example we’ll use 2 search conditions – look for products those name include either ‘Dress’ or ‘Throw’

endpoint <- "rest/V1/products"
myquery <- list("searchCriteria[filter_groups][0][filters][0][field]"="name",
"searchCriteria[filter_groups][0][filters][0][value]"="%Dress%",
"searchCriteria[filter_groups][0][filters][0][condition_type]"="like",
"searchCriteria[filter_groups][0][filters][1][field]"="name",
"searchCriteria[filter_groups][0][filters][1][value]"="%Throw%",
"searchCriteria[filter_groups][0][filters][1][condition_type]"="like")

Note that we have 2 triplets of criteria, one per product name and the second has 1 in search criteria group.

Also percentage char used to indicate that there may be any number of characters before and after both terms. That syntax comes from SQL and other SQL wildcards can be used

Now call the function

products <- getm2objects(urlbase = urlbase, endpoint = endpoint, query = myquery, auth=auth)

The resulting list contains all the products with either “Dress” or “Throw” in name. Register ignored.

To quickly check how many results in the list use the next command

products[["total_count"]]

It returns number of search results founds.

Logical AND and OR search with invoices

Let’s get invoices for certain period of time and for specific grand total amounts above OR below certain thresholds

# get invoices for 2013 year and for any time with grand total under $100 OR over$1000
endpoint <- "rest/V1/invoices"
myquery <- list("searchCriteria[filter_groups][0][filters][0][field]"="created_at",
"searchCriteria[filter_groups][0][filters][0][value]"="2013-01-01 00:00:00",
"searchCriteria[filter_groups][0][filters][0][condition_type]"="gt",
"searchCriteria[filter_groups][1][filters][0][field]"="created_at",
"searchCriteria[filter_groups][1][filters][0][value]"="2014-01-01 00:00:00",
"searchCriteria[filter_groups][1][filters][0][condition_type]"="lt",
"searchCriteria[filter_groups][2][filters][0][field]"="base_grand_total",
"searchCriteria[filter_groups][2][filters][0][value]"="1000",
"searchCriteria[filter_groups][2][filters][0][condition_type]"="gt",
"searchCriteria[filter_groups][2][filters][1][field]"="base_grand_total",
"searchCriteria[filter_groups][2][filters][1][value]"="100",
"searchCriteria[filter_groups][2][filters][1][condition_type]"="lt")
Note that here we have 3 filter groups that stands for AND condition. The last group has 2 conditions (filters) that are connected as OR.
Calling the function
invoices2 <- getm2objects(urlbase = urlbase, endpoint = endpoint, query = myquery, auth=auth)
Let’s also check how big is our invoice in average

object.size(invoices2)/invoices2[["total_count"]]

In my case one invoice in average was over 17 KB. It isn’t a problem for limited data, but if you have massive number of objects it may end up not good.

Hence a word of caution!

Be mindful with your API requests, especially if you query production Magento 2 instance. Getting too wild may slow it down significantly and you can run out of memory on your local machine.

That is it for now. In the next post or two, I’ll write about getting filtered results. It is useful if you don’t need all the data, only part of it and don’t want to consume more than necessary resources, which may be important in many situations. Stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Alex Levashov – eCommerce Consultant (Melbourne, Australia). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Source and List: Organizing R Shiny Apps

Tue, 11/06/2018 - 23:00

(This article was first published on R – R-BAR, and kindly contributed to R-bloggers)

Keeping R Shiny code organized can be a challenge. One method to organize your Shiny UI and Server code is to use a combination of R’s list and source functions. Another method to organize you’re Shiny code is through modularization techniques. Here though, we’re going concentrate on the list and source options.

If you feel comfortable with Shiny and want to get strait to the answer, check out the Solution section. Otherwise, read on!

Outline Difficulty Organizing Shiny UIs

If you have a Shiny application with a lot of controls, it’s easy to get lost in the braces, parentheses and brackets. Using R’s list() and source() functions, we can make UI code much more manageable by moving components out to other files.

So how do you do it?

To get us started, we are going to work with the starter app provided in R studio. Click File →New →Shiny Web App… . A dialogue box will pop up. My setup details are shown below in the R-Studio dialogue box. Choose a working directory appropriate for you and press Create.

R-Studio will then show the code for the starter Shiny App. The demo code has 4 parts.

• Define the UI
• Define the Server
• Run the app.

We are going to start on the UI portion of the Shiny app shown in the screen capture below. This particular UI can be broken down into two parts the “Side Bar Panel” [lines 9-13] and the “Main Panel” [lines 15-17]. What we are going to do is factor out the content in the Side Bar Panel to a file named Side_Panel_CTRL.R and the Main Panel content to a file named Main_Panel_OUTPUT.R.

So, make the two new R scripts named Side_Panel_CTRL.R and Main_Panel_OUTPUT.R. From the app.R file, MOVE the sliderInput (lines 10-12 above) to the Side_Panel_Control.R file and the plotOutput (line 16) to the Main_Panel_OUTPUT.R. Make sure these two new files are saved in the same folder as your app.R file.

A UI Source Example that Usually Doesn’t Work

Once you have your files setup, add the source() statements into you app.R file as shown in the code snippet below (lines 8 & 12).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Define UI for application that draws a histogram ui <- fluidPage( titlePanel("List and Source UI"), # Title Page   # Sidebar with a slider input for number of bins sidebarLayout( sidebarPanel( # The Side Bar source(file = "Side_Panel_CTRL.R", local = T) ), # Show a plot of the generated distribution mainPanel( source(file="Main_Panel_OUTPUT.R", local=T) ) ) )

Press the Run App Button in R Studio and the app starts as you might expect. You’ll notice that the word TRUE pops up at the bottom of the slider control and the plot. Not sure why this happens, but you can remove it by adding “[1]” to the end of the source() statement as shown in the screen capture below summarizing where we are right now.

Cool, but this is about as far as you’re going to get with source command alone.

When Source Fails in the UI

The source command alone can fail in the UI sections of a Shiny app. To demonstrate, we are going to add just one additional control to our side panel control file. First, though, let me show you that adding one additional control works when I add it directly to the app.R file in the UI section:

To the right of the code, you can see that the app works. Now let’s see what happens when we move the following control code back to the Side_panel_CTRL.r file.

1 2 3 4 5 6 7 #Slider Input #Slider Control sliderInput("bins", "Number of bins:", min = 1, max = 50, value = 30 ), #text box control textInput("textBox1", label = "Enter Value")

OK, moved the code. Here is what the three files should look like and the result of trying to run them.

hmm… We had an error, and it seems to involve the syntax of the code we are sourcing… perhaps the comma. How do we fix this?

Solution

The simple fix is to wrap the code in your UI files with the list command as shown in the following few screen shots: Look specifically at lines 1 and 9 of the Side_Panel_CTRL.r file and 1 and 5 of Main_Panel_OUTPUT.r file.

Usage Note

I find the source() and list() combo to be particularly useful when I work with tabbed panels. A quick example is shown below with some tabPanel code on lines 11-12. The controls for these two panels are coded in the source files as described previously.

Organizing the Shiny Server with Source

Server side Shiny is good old fashioned R-code and can be easily sourced from an external file using the source() command without list.

source(
file = “[source file path]”,
local = True #
)

The only trick with server side Shiny is dealing with the reactive elements like “reactive({…})”. For these you need to put the source() command in place of the “…”. To illustrate consider the following two code snippets.

Out-of-the-box sever code looks something like this:

1 2 3 4 5 6 7 8 9 10 # Define server logic required to draw a histogram server <- function(input, output) { output$distPlot <- renderPlot({ # generate bins based on input$bins from ui.R x <- faithful[, 2] bins <- seq(min(x), max(x), length.out = input$bins + 1) # draw the histogram with the specified number of bins hist(x, breaks = bins, col = 'darkgray', border = 'white') }) } If you source out lines 3 through 9 after the output$distPlot assignment to a file called Server_Plot.R, and leave the server code like this:

1 2 3 4 server <- function(input, output) { output$distPlot <- source(file="Server_Plot.R", local=T) } You’ll get an error: “Error in .subset2(x, “impl”)$defineOutput(name, value, label) :
Unexpected list output for distPlot”

To avoid this error, keep the renderPlot({}) reactive element in the app.R file and source from within like so:

1 2 3 4 5 server <- function(input, output) { output\$distPlot <- renderPlot({ source(file="Server_Plot.R", local=T) }) }

For the server side, that’s it. You should also notice that compared to the UI side, we did not need to place the “[1]” at the end of the source command.

Summary

Shiny is a great tool for communicating data, but its easy to get lost in the braces, brackets, and parentheses. To help simplify your next Shiny application, try using the source-list combo and don’t forget the “[1]” at the end of the source statement, or you’ll have a while bunch of TRUEs running around. On the server side, we need to be careful around reactive elements.

The post Source and List: Organizing R Shiny Apps appeared first on R-BAR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – R-BAR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...