Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 6 hours 52 min ago

R-Ladies global tour

Fri, 10/06/2017 - 02:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

It was recently brought to my attention by Hannah Frick that there are now sooo many R-Ladies chapters around the world! R-Ladies is a world-wide organization to promote gender diversity in the R community, and I’m very grateful to be part of this community through which I met so many awesome ladies! Since we’re all connected, it has now happened quite a few times that R-Ladies gave talks at chapters outside of their hometowns. An R-Lady from Taiwan giving a talk in Madrid while on a trip in Europe and another one doing the same in Lisbon, an R-Lady from San Francisco presenting at the London and Barcelona chapters thanks to a conference on the continent, an R-Lady from Uruguay sharing her experience for the New York City and San Francisco chapters… It’s like rockstars tours!

Therefore we R-Ladies often joke about doing an exhaustive global tour. Hannah made me think about this tour again… If someone were to really visit all of the chapters, what would be the shortest itinerary? And could we do a cool gif with the results? These are the problems we solve here.

Getting the chapters

To find all chapters, I’ll use Meetup information about meetups whose topics include “r-ladies”, although it means forgetting a few chapters that maybe haven’t updated their topics yet. Thus, I’ll scrape this webpage because I’m too impatient to wait for the cool meetupr package to include the Meetup API topic endpoint and because I’m too lazy to include it myself. I did open an issue though. Besides, I was allowed to scrape the page:

robotstxt::paths_allowed("https://www.meetup.com/topics/") ## [1] TRUE

Yesss. So let’s scrape!

library("rvest") link <- "https://www.meetup.com/topics/r-ladies/all/" page_content <- read_html(link) css <- 'span[class="text--secondary text--small chunk"]' chapters <- html_nodes(page_content, css) %>% html_text(trim = TRUE) chapters <- stringr::str_replace(chapters, ".*\\|", "") chapters <- trimws(chapters) head(chapters) ## [1] "London, United Kingdom" "San Francisco, CA" ## [3] "Istanbul, Turkey" "Melbourne, Australia" ## [5] "New York, NY" "Madrid, Spain" # Montenegro chapters[chapters == "HN\\, Montenegro"] <- "Herceg Novi, Montenegro" Geolocating the chapters

Here I decided to use a nifty package to the awesome OpenCage API. Ok, this is my own package. But hey it’s really a good geocoding API. And the package was reviewed for rOpenSci by Julia Silge! In the docs of the package you’ll see how to save your API key in order not to have to input it as a function parameter every time.

Given that there are many chapters but not that many (41 to be exact), I could inspect the results and check them.

geolocate_chapter <- function(chapter){ # query the API results <- opencage::opencage_forward(chapter)$results # deal with Strasbourg if(chapter == "Strasbourg, France"){ results <- dplyr::filter(results, components.city == "Strasbourg") } # get a CITY results <- dplyr::filter(results, components._type == "city") # sort the results by confidence score results <- dplyr::arrange(results, desc(confidence)) # choose the first line among those with highest confidence score results <- results[1,] # return only long and lat tibble::tibble(long = results$geometry.lng, lat = results$geometry.lat, chapter = chapter, formatted = results$formatted) } chapters_df <- purrr::map_df(chapters, geolocate_chapter) # add an index variable chapters_df <- dplyr::mutate(chapters_df, id = 1:nrow(chapters_df)) knitr::kable(chapters_df[1:10,]) long lat chapter formatted id -0.1276473 51.50732 London, United Kingdom London, United Kingdom 1 -122.4192362 37.77928 San Francisco, CA San Francisco, San Francisco City and County, California, United States of America 2 28.9651646 41.00963 Istanbul, Turkey Istanbul, Fatih, Turkey 3 144.9631608 -37.81422 Melbourne, Australia Melbourne VIC 3000, Australia 4 -73.9865811 40.73060 New York, NY New York City, United States of America 5 -3.7652699 40.32819 Madrid, Spain Leganés, Community of Madrid, Spain 6 -77.0366455 38.89495 Washington, DC Washington, District of Columbia, United States of America 7 -83.0007064 39.96226 Columbus, OH Columbus, Franklin County, Ohio, United States of America 8 -71.0595677 42.36048 Boston, MA Boston, Suffolk, Massachusetts, United States of America 9 -79.0232050 35.85030 Durham, NC Durham, NC, United States of America 10 Planning the trip

I wanted to use the ompr package inspired by this fantastic use case, “Boris Johnson’s fully global itinerary of apology” – be careful, the code of this use case is slightly outdated but is up-to-date in the traveling salesperson vignette. The ompr package supports modeling and solving Mixed Integer Linear Programs. I got a not so bad notion of what this means by looking at this collection of use cases. Sadly, the traveling salesperson problem is a complicated problem and its solving time exponentially increases with the number of stops… in that case, it became really too long for plain mixed integer linear programming, as in “more than 24 hours later not done” too long.

Therefore, I decided to use a specific R package for traveling salesperson problems TSP. Dirk, ompr’s maintainer, actually used it once as seen in this gist and then in this newspaper piece about how to go to all 78 Berlin museums during the night of the museums. Quite cool!

We first need to compute the distance between chapters. In kilometers and rounded since it’s enough precision.

convert_to_km <- function(x){ round(x/1000) } distance <- geosphere::distm(as.matrix(dplyr::select(chapters_df, long, lat)), fun = geosphere::distGeo) %>% convert_to_km()

I used methods that do not find the optimal tour. This means that probably my solution isn’t the best one, but let’s say it’s ok for this time. Otherwise, the best thing is to ask Concorde’s maintainer if one can use their algorithm which is the best one out there, see its terms of use here.

library("TSP") set.seed(42) result0 <- solve_TSP(TSP(distance), method = "nearest_insertion") result <- solve_TSP(TSP(distance), method = "two_opt", control = list(tour = result0))

And here is how to link the solution to our initial chapters data.frame.

paths <- tibble::tibble(from = chapters_df$chapter[as.integer(result)], to = chapters_df$chapter[c(as.integer(result)[2:41], as.integer(result)[1])], trip_id = 1:41) paths <- tidyr::gather(paths, "property", "chapter", 1:2) paths <- dplyr::left_join(paths, chapters_df, by = "chapter") knitr::kable(paths[1:3,]) trip_id property chapter long lat formatted id 1 from Charlottesville, VA -78.55676 38.08766 Charlottesville, VA, United States of America 38 2 from Washington, DC -77.03665 38.89495 Washington, District of Columbia, United States of America 7 3 from New York, NY -73.98658 40.73060 New York City, United States of America 5 Plotting the tour, boring version

I’ll start by plotting the trips as it is done in the vignette, i.e. in a static way. Note: I used Dirk’s code in the Boris Johnson use case for the map, and had to use a particular branch of ggalt to get coord_proj working.

library("ggplot2") library("ggalt") library("ggthemes") library("ggmap") world <- map_data("world") %>% dplyr::filter(region != "Antarctica") ggplot(data = paths, aes(long, lat)) + geom_map(data = world, map = world, aes(long, lat, map_id = region), fill = "white", color = "darkgrey", alpha = 0.8, size = 0.2) + geom_path(aes(group = trip_id), color = "#88398A") + geom_point(data = chapters_df, color = "#88398A", size = 0.8) + theme_map(base_size =20) + coord_proj("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") + ggtitle("R-Ladies global tour", subtitle = paste0(tour_length(result), " km"))

Dirk told me the map would look better with great circles instead of straight lines so I googled a bit around, asked for help on Twitter before finding this post.

library("geosphere") # find points on great circles between chapters gc_routes <- gcIntermediate(paths[1:length(chapters), c("long", "lat")], paths[(length(chapters)+1):(2*length(chapters)), c("long", "lat")], n = 360, addStartEnd = TRUE, sp = TRUE, breakAtDateLine = TRUE) gc_routes <- SpatialLinesDataFrame(gc_routes, data.frame(id = paths$id, stringsAsFactors = FALSE)) gc_routes_df <- fortify(gc_routes) p <- ggplot() + geom_map(data = world, map = world, aes(long, lat, map_id = region), fill = "white", color = "darkgrey", alpha = 0.8, size = 0.2) + geom_path(data = gc_routes_df, aes(long, lat, group = group), alpha = 0.5, color = "#88398A") + geom_point(data = chapters_df, color = "#88398A", size = 0.8, aes(long, lat)) + theme_map(base_size =20)+ coord_proj("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") p + ggtitle("R-Ladies global tour", subtitle = paste0(tour_length(result), " km"))

Ok this is nicer, it was worth the search.

Plotting the tour, magical version

And now I’ll use magick because I want to add a small star flying around the world. By the way if this global tour were to happen I reckon that one would need to donate a lot of money to rainforest charities or the like, because it’d have a huge carbon footprint! Too bad really, I don’t want my gif to promote planet destroying behaviours.

To make the gif I used code similar to the one shared in this post but in a better version thanks to Jeroen who told me to read the vignette again. Not saving PNGs saves time!

I first wanted to really show the emoji flying along the route and even created data for that, with a number of rows between chapters proportional to the distance between them. It’d have looked nice and smooth. But making a gif with hundreds of frames ended up being too long for me at the moment. So I came up with another idea, I’ll have to hope you like it!

library("emojifont") load.emojifont('OpenSansEmoji.ttf') library("magick") plot_one_moment <- function(chapter, size, p, chapters_df){ print(p + ggtitle(paste0("R-Ladies global tour, ", chapters_df[chapters_df$chapter == chapter,]$chapter), subtitle = paste0(tour_length(result), " km"))+ geom_text(data = chapters_df[chapters_df$chapter == chapter,], aes(x = long, y = lat, label = emoji("star2")), family="OpenSansEmoji", size = size)) } img <- image_graph(1000, 800, res = 96) out <- purrr::walk2(rep(chapters[as.integer(result)], each = 2), rep(c(5, 10), length = length(chapters)*2), p = p, plot_one_moment, chapters_df = chapters_df) dev.off() ## png ## 2 image_animate(img, fps=1) %>% image_write("rladiesglobal.gif")

At least I made a twinkling star. I hope Hannah will be happy with the gif, because now I’d like to just dream of potential future trips! Or learn a bit of geography by looking at the gif.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Checking residual distributions for non-normal GLMs

Fri, 10/06/2017 - 02:00

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

Checking residual distributions for non-normal GLMs Quantile-quantile plots

If you are fitting a linear regression with Gaussian (normally
distributed) errors, then one of the standard checks is to make sure the
residuals are approximately normally distributed.

It is a good idea to do these checks for non-normal GLMs too, to make
sure your residuals approximate the model’s assumption.

Here I explain how to create quantile-quantile plots for non-normal
data, using an example of fitting a GLM using Student-t distributed
errors. Such models can be appropriate when the residuals are
overdispersed
.

First let’s create some data. We will make a linear predictor (ie the
true regression line) eta and then simulate some data by adding
residuals. We will simulate two data-sets that have the same linear
predictor, but the first will have normally distributed errors and the
second will have t distributed errors:

n <- 100 phi <- 0.85 mu <- 0.5 set.seed(23) x <- rnorm(n) eta <- mu + phi * x nu <- 2.5 tau <- 3 y_tdist <- eta + (rt(n, df=nu)/sqrt(tau)) y_normdist <- eta + rnorm(n, sd = 1/sqrt(tau)) plot(x, y_tdist) points(x, y_normdist, col = "red", pch = 16, cex = 0.8) legend("topleft", legend = c("t-distributed errors", "normally distributed errors"), pch = c(1,16), col = c("black", "red"))

Notice how the t-distributed data are more spread out. The df
parameter, here named nu=2.5 controls how dispersed the data are.
Lower values will give data that are more dispersed, large values
approach a normal distribution.

Now let’s fit a Gaussian glm (just a linear regression really) to both
these data-sets

m1_norm <- glm(y_normdist ~ x) m1_t <- glm(y_tdist ~ x)

We should check whether the two models meet the normal assumption, using
the standard ‘qq’ (quantile-quantile) plot:

par(mfrow = c(1,2)) plot(m1_norm, 2) plot(m1_t, 2)

These plots compare the theoretical quantiles to the actual quantiles of
the residuals. If the points fall on the straight line then the
theoretical and realised are very similar, and the assumption is met.
Clearly this is not the case for the second model, which is
overdispersed.

We know it is overdispersed because the theoretical quantiles are much
smaller than the actual at the tails (notice how the ends down then up).

The p-values (or CIs if you use them) for m1_t are therefore likely
biased and too narrow, leading potentially to type I errors (us saying
that x affects y, which in fact it does not). In this case we know we
haven’t made a type I error, because we made up the data. However, if
you were using real data you wouldn’t be so sure.

Doing our own quantile-quantile plot

To better understand the QQ plot it helps to generate it yourself,
rather than using R’s automatic checks.

First we calculate the model residuals (in plot(m1_t) R did this
internally):

m1_t_resid <- y_tdist - predict(m1_t)

Then we can plot the quantiles for the residuals against theoretical
quantiles generated using qnorm. Below we also plot the original QQ
plot from above, so you can see that our version is the same as R’s
automatic one:

par(mfrow = c(1,2)) qqplot(qnorm(ppoints(n), sd = 1), m1_t_resid) qqline(m1_t_resid, lty = 2) plot(m1_t,2)

I added the qqline for comparative purposes. It just puts a line
through the 25th and 75th quantiles.

QQ plot for a non-normal GLM

Now we have learned how to write our own custom for a QQ plot, we can
use it to check other types of non-normal data.

Here we will fit a GLM to the y_tdist data using student-t distributed
errors. I do this using the Bayesian package INLA.

library(INLA) data <- list(y=y_tdist, x = x) mod_tdist <- inla(y ~ x, family="T", data=data, control.predictor = list(compute = TRUE), control.family = list( hyper = list(prec = list(prior="loggamma",param=c(1,0.5)), dof = list(prior="loggamma",param=c(1,0.5)) ) ) )

The family ="T" command tells INLA to use the t-distribution, rather
than the Normal distribution. Note also I have specified the priors
using the control.family command. This is best practice. We need a
prior for the precision (1/variance) and a prior for the dof (=
degrees of freedom, which has to be >2 in INLA).

It is sometimes help to visualise the priors, so we can check too see
they look sensible. Here we visualise the prior for the dof, (which in
INLA has a min value of 2):

xgamma <- seq(0.01, 10, length.out = 100) plot(xgamma+2, dgamma(xgamma, 1, 0.5), type = 'l', xlab = "Quantile", ylab = "Density")

We don’t really expect values much greater than 10, so this prior makes
sense. If we used an old-school prior that was flat in 2-1000 we might
get issues with model fitting.

Now enough about priors. Let’s look at the estimated coefficients:

mod_tdist$summary.fixed ## mean sd 0.025quant 0.5quant 0.975quant mode ## (Intercept) 0.5324814 0.07927198 0.3773399 0.5321649 0.6891779 0.5315490 ## x 0.7229362 0.08301006 0.5565746 0.7239544 0.8835630 0.7259817 ## kld ## (Intercept) 3.067485e-12 ## x 6.557565e-12

Good the CIs contain our true values, and the mean is close to our true
value too.
What about the hyper-parameters (the precision and DF)? We need to get
INLA to run some more calucations to get accurate estimates of these:

h_tdist <- inla.hyperpar(mod_tdist) h_tdist$summary.hyperpar[,3:5] ## 0.025quant 0.5quant 0.975quant ## precision for the student-t observations 0.2663364 0.6293265 1.163440 ## degrees of freedom for student-t 2.2404966 2.7396391 4.459057

The estimate for the DF might be a ways off the mark. That is ok, we
expect that, you need lots of really good data to get accurate estimates
of hyper-parameters.

Now, let’s use our skills in creating QQ plots to make QQ plot using
theoretical quantiles from the t distribution.

First step is to extract INLA’s predictions of the data, so we can
calculate residuals

preds <- mod_tdist$summary.fitted.values resids <- y_tdist - preds[,4]

Next step is to extract the marginal estimates of the DF and precision
to use when generating our QQ plot (the quantiles will change with the
DF):

tau_est <- h_tdist$summary.hyperpar[1,4] nu_est <- h_tdist$summary.hyperpar[2,4]

Now we can use qt() to generate theoretical quantiles and the
residuals for our realised quantiles:

qqplot(qt(ppoints(n), df = nu_est), resids * sqrt(tau_est), xlab = "Theoretical quantile", ylab = "residuals") qqline(resids * sqrt(tau_est), lty = 2)

Note that I multiply the residuals by the sqrt of the precision
estimate. This is how INLA fits a t-distributed
GLM
.
I do the same for the qqline.

Our residuals are now falling much closer to the line. The model is
doing a much better job of fitting the data. You could also calculate
the WAIC for this model and a Gaussian one, to compare the fits. The
t-distributed GLM should have a lower WAIC (better fit).

We can now be confident that our CIs are accurate.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The BayesianTools R package with general-purpose MCMC and SMC samplers for Bayesian statistics

Thu, 10/05/2017 - 14:59

(This article was first published on Submitted to R-bloggers – theoretical ecology, and kindly contributed to R-bloggers)

This is a somewhat belated introduction of a package that we published on CRAN at the beginning of the year already, but I hadn’t found the time to blog about this earlier. In the R environment and beyond, a large number of packages exist that estimate posterior distributions via MCMC sampling, either for specific statistical models (e.g.…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Submitted to R-bloggers – theoretical ecology. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The [R] Kenntnis-Tage 2017: A special event

Thu, 10/05/2017 - 11:52

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

Days are getting shorter and the leaves begin to change: autumn is here and this means only one more month until the [R] Kenntnis-Tagen 2017 on November 8 and 9. A diverse agenda, cross-industry networking and a clear business context – many aspects make the [R] Kenntnis-Tage a special event and are reasons for you to participate:Inspiring Insights

From “googleVis” author and finance expert Markus Gesmann to “Algorithm Watch” co-founder Prof. Dr. Katharina Anna Zweig – the [R] Kenntnis-Tage 2017 offer a stage for people with a strong passion for data science and R. Whether in the sessions or the networking: participants will gain detailed insights into the various fields of application of the free programming language as well as inspiration for the daily business.

Clear Business Context

Unlike many other events related to R, the [R] Kenntnis-Tage have a distinct business focus. They serve as a platform for the German-speaking R community to exchange challenges and solutions in the daily work with R, making it an event which is unique in this form. In this context, eoda will present an innovation which takes the work with data in the daily business to a new level.

Topical Diversity

Forecasting models for the energy sector, R in the automotive industry or in data journalism – the topics of the [R] Kenntnis-Tage 2017 are as diverse as R itself. Compared to events with a focus on specific industries, eoda promotes the concept of interdisciplinarity at their event. Cross-industry transfer is one of the cornerstones of continuous development and enables participants to learn about new approaches and solutions. Very often data science presents itself as one thing in particular: a journey of discovery.

Insightful Tutorials

For the third time, the [R] Kenntnis-Tage combine data science success stories with practical R tutorials. Experienced eoda trainers will share their methodical know-how with the participants in intense tutorial sessions. Topics include the building of a Shiny app, the data mining workflow with “caret” and the data management with data.table among others. These tutorials offer an opportunity to learn new approaches and re-evaluate established ones.

Enthusiastic Participants

In the past two years participants from all industries and companies were more than satisfied with the event’s concept, the exchange with other participants and the insights they could gather at the event. The [R] Kenntnis-Tage have become a firmly established event for R users with a business background.

Flair of the documenta City Kassel

The documenta as most significant exhibition of contemporary art has just come to an end but the special flair is still perceptible in the city. The event location of the [R] Kenntnis-Tage 2017 in the heart of the city as well as the dinner location in the UNESCO World Heritage Site Bergpark Wilhelmshöhe provide an appropriate setting for a successful event.

You are convinced? Then register for the [R] Kenntnis-Tage 2017 and make use of the price advantage – only until October 6.

Further information is available here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Working with R

Thu, 10/05/2017 - 11:29

(This article was first published on R on Locke Data Blog, and kindly contributed to R-bloggers)

I’ve been pretty quiet on the blog front recently. That’s because I overhauled my site, migrating it to Hugo (the foundation of blogdown). Just doing one extra thing on top of my usual workload, I also did another thing. I wrote a book too!
I’m a big book fan, and I’m especially a Kindle Unlimited fan (all the books you can read for £8 per month, heck yeah!) so I wanted to make books that I could publish and see on Kindle Unlimited.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Locke Data Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Is udpipe your new NLP processor for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing

Thu, 10/05/2017 - 10:45

(This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

If you work on natural language processing in a day-to-day setting which involves statistical engineering, at a certain timepoint you need to process your text with a number of text mining procedures of which the following ones are steps you must do before you can get usefull information about your text

  • Tokenisation (splitting your full text in words/terms)
  • Parts of Speech (POS) tagging (assigning each word a syntactical tag like is the word a verb/noun/adverb/number/…)
  • Lemmatisation (a lemma means that the term we “are” is replaced by the verb to “be”, more information: https://en.wikipedia.org/wiki/Lemmatisation)
  • Dependency Parsing (finding relationships between, namely between “head” words and words which modify those heads, allowing you to look to words which are maybe far away from each other in the raw text but influence each other)

If you do this in R, there aren’t much available tools to do this. In fact there are none which

  1. do this for multiple language
  2. do not depend on external software dependencies (java/python)
  3. which also allow you to train your own parsing & tagging models.

Except R package udpipe (https://github.com/bnosac/udpipe, https://CRAN.R-project.org/package=udpipe) which satisfies these 3 criteria.

If you are interested in doing the annotation, pre-trained models are available for 50 languages (see ?udpipe_download_model) for details. Let’s show how this works on some Dutch text and what you get of of this.

library(udpipe)
dl <- udpipe_download_model(language = "dutch")
dl

language                                                                      file_model
   dutch C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-ud-2.0-170801.udpipe

udmodel_dutch <- udpipe_load_model(file = "dutch-ud-2.0-170801.udpipe")
x <- udpipe_annotate(udmodel_dutch,
                     x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
x <- as.data.frame(x)
x

The result of this is a dataset where text has been splitted in paragraphs, sentences, words, words are replaced by their lemma (ging > ga, nam > neem), you get the universal parts of speech tags, detailed parts of speech tags, you get features of the word and with the head_token_id we see which words are influencing other words in the text as well as the dependency relationship between these words.

To go from that dataset to meaningfull visualisations like this one is than just a matter of a few lines of code. The following visualisation shows the co-occurrence of nouns with customer feedback on Airbnb appartment stays in Brussels (open data available at http://insideairbnb.com/get-the-data.html).

In a next post, we’ll show how to train your own tagging models.

If you like this type of analysis or if you are interested in text mining with R, we have 3 upcoming courses planned on text mining. Feel free to register at the following links.

    • 18-19/10/2017: Statistical machine learning with R. Leuven (Belgium). Subscribe here
    • 08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here
    • 27-28/11/2017: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to training@di-academy.com
    • 19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
    • 20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
    • 08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
    • 22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here

For business questions on text mining, feel free to contact BNOSAC by sending us a mail here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Writing academic articles using R Markdown and LaTeX

Thu, 10/05/2017 - 01:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

One of my favourite activities in R is using Markdown to create business reports. Most of my work I export to MS Word to communicate analytical results with my colleagues. For my academic work and eBooks, I prefer LaTeX to produce great typography. This article explains how to write academic articles and essays combining R Markdown and LaTeX. The article is formatted in accordance with the APA (American Psychological Association) requirements.

To illustrate the principles of using R Markdown and LaTeX, I recycled an essay about problems with body image that I wrote for a psychology course many years ago. You can find the completed paper and all necessary files on my GitHub repository.

Body Image

Body image describes the way we feel about the shape of our body. The literature on this topic demonstrates that many people, especially young women, struggle with their body image. A negative body image has been strongly associated with eating disorders. Psychologists measure body image using a special scale, shown in the image below.

My paper measures the current and ideal body shape of the subject and the body shape of the most attractive other sex. The results confirm previous research which found that body dissatisfaction for females is significantly higher than for men. The research also found a mild positive correlation between age and ideal body shape for women and between age and the female body shape found most attractive by men. You can read the full paper on my personal website.

Body shape measurement scale.

R Markdown and LaTeX

The R Markdown file for this essay uses the Sweave package to integrate R code with LaTeX. The first two code chunks create a table to summarise the respondents using the xtable package. This package creates LaTeX or HTML tables from data generated by R code.

The first lines of the code read and prepare the data, while the second set of lines creates a table in LaTeX code. The code chunk uses results=tex to ensure the output is interpreted as LaTeX. This approach is used in most of the other chunks. The image is created within the document and saved as a pdf file and back integrated into the document as an image with appropriate label and caption.

<>= body <- read.csv("body_image.csv") # Respondent characteristics body$Cohort <- cut(body$Age, c(0, 15, 30, 50, 99), labels = c("<16", "16--30", "31--50", ">50")) body$Date <- as.Date(body$Date) body$Current_Ideal <- body$Current - body$Ideal library(xtable) respondents <- addmargins(table(body$Gender, body$Cohort)) xtable(respondents, caption = "Age profile of survey participants", label = "gender-age", digits = 0) @ Configuration

I created this file in R Studio, using the Sweave and knitr functionality. To knit the R Markdown file for this paper you will need to install the apa6 and ccicons packages in your LaTeX distribution. The apa6 package provides macros to format papers in accordance with the requirements American Psychological Association.

The post Writing academic articles using R Markdown and LaTeX appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing the Deep Learning Virtual Machine on Azure

Wed, 10/04/2017 - 21:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

A new member has just joined the family of Data Science Virtual Machines on Azure: The Deep Learning Virtual Machine. Like other DSVMs in the family, the Deep Learning VM is a pre-configured environment with all the tools you need for data science and AI development pre-installed. The Deep Learning VM is designed specifically for GPU-enabled instances, and comes with a complete suite of deep learning frameworks including Tensorflow, PyTorch, MXNet, Caffe2 and CNTK. It also comes witth example scripts and data sets to get you started on deep learning and AI problems, including:

The DLVM along with all the DSVMs also provides a complete suite of data science tools including R, Python, Spark, and much more:

There have also been some updates and additions to the tools provided in the entire DSVM family, including:

All Data Science Virtual Machines, including the Deep Learning Virtual Machine, are available as Windows and Ubuntu Linux instances, and are free of any software charges: pay only for the infrastructure charge according to the power and size of the instance you choose. An Azure account is required, but you can get started with $200 in free Azure credits here.

Microsoft Azure: Data Science Virtual Machines

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Time Series Analysis in R Part 3: Getting Data from Quandl

Wed, 10/04/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

This is part 3 of a multi-part guide on working with time series data in R. You can find the previous parts here: Part 1, Part 2.

Generated data like that used in Parts 1 and 2 is great for sake of example, but not very interesting to work with. So let’s get some real-world data that we can work with for the rest of this tutorial. There are countless sources of time series data that we can use including some that are already included in R and some of its packages. We’ll use some of this data in examples. But I’d like to expand our horizons a bit.

Quandl has a great warehouse of financial and economic data, some of which is free. We can use the Quandl R package to obtain data using the API. If you do not have the package installed in R, you can do so using:

install.packages('Quandl')

You can browse the site for a series of interest and get its API code. Below is an example of using the Quandl R package to get housing price index data. This data originally comes from the Yale Department of Economics and is featured in Robert Shiller’s book “Irrational Exuberance”. We use the Quandl function and pass it the code of the series we want. We also specify “ts” for the type argument so that the data is imported as an R ts object. We can also specify start and end dates for the series. This particular data series goes all the way back to 1890. That is far more than we need so I specify that I want data starting in January of 1990. I do not supply a value for the end_date argument because I want the most recent data available. You can find this data on the web here.

library(Quandl) hpidata <- Quandl("YALE/NHPI", type="ts", start_date="1990-01-01") plot.ts(hpidata, main = "Robert Shiller's Nominal Home Price Index")

Gives this plot:

While we are here, let’s grab some additional data series for later use. Below, I get data on US GDP and US personal income, and the University of Michigan Consumer Survey on selling conditions for houses. Again I obtained the relevant codes by browsing the Quandl website. The data are located on the web here, here, and here.

gdpdata <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01") pidata <- Quandl("FRED/PINCOME", type="ts", start_date="1990-01-01") umdata <- Quandl("UMICH/SOC43", type="ts")[, 1] plot.ts(cbind(gdpdata, pidata), main="US GPD and Personal Income, billions $")

Gives this plot:

plot.ts(umdata, main = "University of Michigan Consumer Survey, Selling Conditions for Houses")

Gives this plot:

The Quandl API also has some basic options for data preprocessing. The US GDP data is in quarterly frequency, but assume we want annual data. We can use the collapse argument to collapse the data to a lower frequency. Here we covert the data to annual as we import it.

gdpdata_ann <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01", collapse="annual") frequency(gdpdata_ann) [1] 1

We can also transform our data on the fly as its imported. The Quandl function has a argument transform that allows us to specify the type of data transformation we want to perform. There are five options – “diff“, ”rdiff“, ”normalize“, ”cumul“, ”rdiff_from“. Specifying the transform argument as”diff” returns the simple difference, “rdiff” yields the percentage change, “normalize” gives an index where each value is that value divided by the first period value and multiplied by 100, “cumul” gives the cumulative sum, and “rdiff_from” gives each value as the percent difference between itself and the last value in the series. For more details on these transformations, check the API documentation here.

For example, here we get the data in percent change form:

gdpdata_pc <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01", transform="rdiff") plot.ts(gdpdata_pc * 100, ylab= "% change", main="US Gross Domestic Product, % change")

Gives this plot:

You can find additional documentation on using the Quandl R package here. I’d also encourage you to check out the vast amount of free data that is available on the site. The API allows a maximum of 50 calls per day from anonymous users. You can sign up for an account and get your own API key, which will allow you to make as many calls to the API as you like (within reason of course).

In Part 4, we will discuss visualization of time series data. We’ll go beyond the base R plotting functions we’ve used up until now and learn to create better-looking and more functional plots.

    Related Post

    1. Pulling Data Out of Census Spreadsheets Using R
    2. Extracting Tables from PDFs in R using the Tabulizer Package
    3. Extract Twitter Data Automatically using Scheduler R package
    4. An Introduction to Time Series with JSON Data
    5. Get Your Data into R: Import Data from SPSS, Stata, SAS, CSV or TXT
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    nzelect 0.4.0 on CRAN with results from 2002 to 2014 and polls up to September 2017 by @ellis2013nz

    Wed, 10/04/2017 - 13:00

    (This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

    More nzelect New Zealand election data on CRAN

    Version 0.4.0 of my nzelect R package is now on CRAN. The key changes from version 0.3.0 are:

    • election results by voting place are now available back to 2002 (was just 2014)
    • polling place locations, presented as consistent values of latitude and longitude, now available back to 2008 (was just 2014)
    • voting intention polls are now complete up to the 2017 general election (previously stopped about six months ago on CRAN, although the GitHub version was always kept up to date)
    • a few minor bug fixes eg allocate_seats now takes integer arguments, not just numeric.

    The definitive source of New Zealand election statistics is the Electoral Commission. If there are any discrepencies between their results and those in nzelect, it’s a bug, and please file an issue on GitHub. The voting intention polls come from Wikipedia.

    Example – special and early votes

    Currently, while we wait for the counting of the “special” votes from the 2017 election, there’s renewed interest in the differences between special votes and those counted on election night. Special votes are those made by anyone outside their electorate, enrolled in the month before the election, or are in one of a few other categories. Most importantly, it’s people who are on the move or who are very recently enrolled, and obviously such people are different from the run of the mill voter.

    Here’s a graphic created with the nzelect R package that shows how the “special votes” in the past have been disproportionately important for Greens and Labour:

    Here’s the code to create that graphic:

    # get the latest version from CRAN: install.packages("nzelect") library(nzelect) library(tidyverse) library(ggplot2) library(scales) library(ggrepel) library(forcats) # palette of colours for the next couple of charts: palette <- c(parties_v, Other = "pink2", `Informal Votes` = "grey") # special votes: nzge %>% filter(voting_type == "Party") %>% mutate(party = fct_lump(party, 5)) %>% mutate(dummy = grepl("special", voting_place, ignore.case = TRUE)) %>% group_by(electorate, party, election_year) %>% summarise(prop_before = sum(votes[dummy]) / sum(votes), total_votes = sum(votes)) %>% ungroup() %>% mutate(party = gsub(" Party", "", party), party = gsub("ACT New Zealand", "ACT", party), party = gsub("New Zealand First", "NZ First", party)) %>% mutate(party = fct_reorder(party, prop_before)) %>% ggplot(aes(x = prop_before, y = party, size = total_votes, colour = party)) + facet_wrap(~election_year) + geom_point(alpha = 0.1) + ggtitle("'Special' votes proportion by party, electorate and year", "Each point represents the proportion of a party's vote in each electorate that came from special votes") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package", y = "") + scale_size_area("Total party votes", label = comma) + scale_x_continuous("\nPercentage of party's votes that were 'special'", label = percent) + scale_colour_manual(values = palette, guide = FALSE)

    Special votes are sometimes confused with advance voting in general. While many special votes are advance votes, the relationship is far from one to one. We see this particularly acutely by comparing the previous graphic to one that is identical except that it identifies all advance votes (those with the phrase “BEFORE” in the Electoral Commission’s description of polling place):

    While New Zealand First are the party that gains least proportionately from special votes, they gain the most from advance votes, although the difference between parties is fairly marginal. New Zealand First voters are noticeably more likely to be in an older age bracket than the voters for other parties. My speculation on their disproportionate share of advance voting is that it is related to that, although I’m not an expert in that area and am interested in alternative views.

    This second graphic also shows nicely just how much the advance voting is becoming a feature of the electoral landscape. Unlike the proportion of votes that are “special” which has been fairly stable, the proportion of votes that are case in advance has increased very substantially over the past decade, and increased further in the 2017 election (for which final results come out on Saturday).

    Here’s the code for the second graphic; it’s basically the same as the previous chunk of code, except filtering on a different character string in the voting place name:

    nzge %>% filter(voting_type == "Party") %>% mutate(party = fct_lump(party, 5)) %>% mutate(dummy = grepl("before", voting_place, ignore.case = TRUE)) %>% group_by(electorate, party, election_year) %>% summarise(prop_before = sum(votes[dummy]) / sum(votes), total_votes = sum(votes)) %>% ungroup() %>% mutate(party = gsub(" Party", "", party), party = gsub("ACT New Zealand", "ACT", party), party = gsub("New Zealand First", "NZ First", party)) %>% mutate(party = fct_reorder(party, prop_before)) %>% ggplot(aes(x = prop_before, y = party, size = total_votes, colour = party)) + facet_wrap(~election_year) + geom_point(alpha = 0.1) + ggtitle("'Before' votes proportion by party, electorate and year", "Each point represents the proportion of a party's vote in each electorate that were cast before election day") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package", y = "") + scale_size_area("Total party votes", label = comma) + scale_x_continuous("\nPercentage of party's votes that were before election day", label = percent) + scale_colour_manual(values = palette, guide = FALSE) ‘Party’ compared to ‘Candidate’ vote

    Looking for something else to showcase, I thought it might be interesting to pool all five elections for which I have the detailed results and compare the party vote (ie the proportional representation choice out of the two votes New Zealanders get) to the candidate vote (ie the representative member choice). Here’s a graphic that does just that:

    We see that New Zealand First and the Greens are the two parties that are most noticeably above the diagonal line indicating equality between party and candidate votes. This isn’t a surprise – these are minority parties that appeal to (different) demographic and issues-based communities that are dispersed across the country, and generally have little chance of winning individual electorates. Hence the practice of voters is often to split their votes. This is all perfectly fine and is exactly how mixed-member proportional voting systems are meant to work.

    Here’s the code that produced the scatter plot:

    nzge %>% group_by(voting_type, party) %>% summarise(votes = sum(votes)) %>% spread(voting_type, votes) %>% ggplot(aes(x = Candidate, y = Party, label = party)) + geom_abline(intercept = 0, slope = 1, colour = "grey50") + geom_point() + geom_text_repel(colour = "steelblue") + scale_x_log10("Total 'candidate' votes", label = comma, breaks = c(1, 10, 100, 1000) * 1000) + scale_y_log10("Total 'party' votes", label = comma, breaks = c(1, 10, 100, 1000) * 1000) + ggtitle("Lots of political parties: total votes from 2002 to 2014", "New Zealand general elections") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package") + coord_equal() What next?

    Obviously, the next big thing for nzelect is to get the 2017 results in once they are announced on Saturday. I should be able to do this for the GitHub version by early next week. I would have delayed the CRAN release until 2017 results were available, but unfortunately I had a bug in some of the examples in my helpfiles that stopped them working after the 23 September 2017, so I had to rush a fix and the latest enhancements to CRAN to avoid problems with the CRAN maintainers (which I fully endorse and thank by the way).

    My other plans for nzelect over the next months to a year include:

    • reliable point locations for voting places back to 2002
    • identify consistent/duplicate voting places over time to make it easier to analyse comparative change by micro location
    • add detailed election results for the 1999 and 1996 elections (these are saved under a different naming convention to those from 2002 onwards, which is why they need a bit more work)
    • add high level election results for prior to 1994

    The source code for cleaning the election data and packaging it into nzelect is on GitHub. The package itself is on CRAN and installable in the usual way.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RProtoBuf 0.4.11

    Wed, 10/04/2017 - 02:25

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    RProtoBuf provides R bindings for the Google Protocol Buffers ("ProtoBuf") data encoding and serialization library used and released by Google, and deployed fairly widely in numerous projects as a language and operating-system agnostic protocol.

    A new releases RProtoBuf 0.4.11 appeared on CRAN earlier today. Not unlike the other recent releases, it is mostly a maintenance release which switches two of the vignettes over to using the pinp package and its template for vignettes.

    Changes in RProtoBuf version 0.4.11 (2017-10-03)
    • The RProtoBuf-intro and RProtoBif-quickref vignettes were converted to Rmarkdown using the templates and style file from the pinp package.

    • A few minor internal upgrades

    CRANberries also provides a diff to the previous release. The RProtoBuf page has copies of the (older) package vignette, the ‘quick’ overview vignette, a unit test summary vignette, and the pre-print for the JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Create Powerpoint presentations from R with the OfficeR package

    Tue, 10/03/2017 - 23:30

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    For many of us data scientists, whatever the tools we use to conduct research or perform an analysis, our superiors are going to want the results as a Microsoft Office document. Most likely it's a Word document or a PowerPoint presentation, it probably has to follow the corporate branding guidelines to boot. The OfficeR package, by David Gohel, addresses this problem by allowing you to take a Word or PowerPoint template and programatically insert text, tables and charts generated by R into the template to create a complete document. (The OfficeR package also represents a leap forward from the similar ReporteRs package: it's faster, and no longer has a dependency on a Java installation.)

    At his blog, Len Kiefer takes the OfficeR package through its paces, demonstrating how to create a PowerPoint deck using R. The process is pretty simple:

    • Create a template PowerPoint presentation to host the slides. You can use the Slide Master mode in PowerPoint to customize the style of the slides you will create using R, and you can use all of PowerPoint's features to control layout, select fonts and colors, and include images like logos. 
    • For each slide you wish to create, you can either reference a template slide included in your base presentation, or add a new slide based on one of your custom layouts.
    • For each slide, you will reference on or or more placeholders (regions where content goes) in the layout using their unique names. You can then use R functions to fill them with text, hyperlinks, tables or images. The formatting of each will be controlled by the formatting you specified within Powerpoint.

    Len used this process to convert the content of a blog post on housing inventory into the PowerPoint deck you see embedded below.

    You can find details of how to create Microsoft Office documents in the OfficeR vignette on PowerPoint documents, and a there's a similar guide for Word documents, too. Or just take a look at the blog post linked before for a worked example.

    Len Kiefer: Crafting a PowerPoint Presentation with R

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R live class – Statistics for Data Science

    Tue, 10/03/2017 - 17:52

    (This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

     

    Statistics for Data Science is our second course of the autumn term. It takes place in October 11-12 in Milano Lima.

    In this two-day course you will learn how to develop a wide variety of linear and generalized linear models with R. The course follows a step-by-step approach: starting from the simplest linear regression we will add complexity to the model up to the most sophisticated GLM, looking at the statistical theory along with its R implementation. Supported by plenty of examples, the course will give you a wide overview of the R capabilities to model and to make predictions.

    Statistics for Data Science Outlines
    • t-test, ANOVA
    • Linear, Polynomial and Multiple Regression
    • More Complex Linear Models
    • Generalized Linear Models
    • Logistic Regression
    • Poisson Dep. Var. Regression
    • Gamma Dep. Var. Regression
    • Check of models assumptions
    • Brief outlines of GAM, Mixed Models. Neural Networks, Tree-based Modelling

     

    Statistics for Data Science is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

    The course is for max 6 attendees.

    Location

    The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

    Registration

    If you want to reserve a seat go to: FAQ, detailed program and tickets.

    Other R courses | Autumn term

    You can find an overview of all our courses here. Next dates will be:

    • October 25-26: Machine Learning with R. Find patterns in large data sets using the R tools for Dimensionality Reduction, Clustering, Classification and Prediction. Reserve now!
    • November 7-8: Data Visualization and Dashboard with R. Show the story behind your data: create beautiful effective visualizations and interactive Shiny dashboards. Reserve now!
    • November 21-22: R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
    • November 29-30: Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!

    In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

    The post R live class – Statistics for Data Science appeared first on Quantide – R training & consulting.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Dashboard Design: 8 Types of Online Dashboards

    Tue, 10/03/2017 - 13:31

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    What type of online dashboard will work best for your data? This post reviews eight types of online dashboards to assist you in choosing the right approach for your next dashboard. Note that there may well be more than eight types of dashboards, I am sure I will miss a few. If so, please tell me in the comments section of this post.

    KPI Online Dashboards

    The classic dashboards are designed to report key performance indicators (KPIs). Think of the dashboard of a car or the cockpit of an airplane. The KPI dashboard is all about dials and numbers. Typically, these dashboards are live and show the latest numbers. In a business context, they typically show trend data as well.

    A very simple example of a KPI Dashboard is below. Such dashboards can, of course, be huge. Huge dashboards have lots of pages crammed with numbers and charts, looking at all manner of operational and strategic data.

    Click the image for an interactive version

    Geographic Online Dashboards

    The most attractive dashboards are often geographic. The example below was created by Iaroslava Mizai in Tableau. Due to people being inspired by such dashboards, I imagine that a lot of money has been spent on Tableau licenses. While visually attractive, such dashboards tend to make up a tiny proportion of the dashboards in widespread use.

    While visually attractive, such dashboards tend to make up a tiny proportion of the dashboards in widespread use. Outside of sales, geography, and demography, few people spend much time exploring geographic data.

    Click the image for an interactive version

    Catalog Online Dashboards

    catalog online dashboard is based around a menu. The viewer can select the results they are interested in from that menu. It is a much more general dashboard used for displaying data rather than geography. Here, you can also use any variable to cut the data. For example, the Catalog Dashboard below gives the viewer a choice of country to investigate.

    Click the image for an interactive version

    The dashboard below has the same basic idea, except the user navigates by clicking the control box on the right-side of the heading. In this example of a brand health dashboard, the control box is currently set to IGA (but you could click on it to change it to another supermarket).

     

    Click the image for an interactive version

    The PowerPoint Alternative Dashboard

    story dashboard consists of a series of pages specifically ordered for the reader. This type of online dashboard is used as a powerful alternative to PowerPoint, with the additional benefits of being interactive, updatable and live. Typically, a user either navigates through such a dashboard using navigation buttons (i.e., forward and backward). Alternatively, they use the navigation bar on the left, as shown in the online dashboard example below.

    Drill-Down Online Dashboards

    A drill-down is an online dashboard (or dashboard control) where the viewer can “drill” into the data to get more information. The whole dashboard is organized in in hierarchical fashion.

    There are five common ways of facilitating drill-downs in dashboards: zoom, graphical filtering, control-based, filtering, and landing pages. The choice of which to use is partly technological and partly related to the structure of the data.

    1. Zoom

    Zooming is perhaps the most widely used technique for permitting users to drill-down. The user can typically achieve the zoom via mouse, touch, + buttons, and draggers. For example, the earlier Microsoft KPI dashboard permitted the viewer to change the time series window by dragging on the bottom of each chart.

    While zooming is the most aesthetically pleasing way of drilling into data, it is also the least general. This approach to dashboarding only works when there is a strong and obvious ordering of the data. This is typically only the case with geographic and time series data, although sometimes data is forced into a hierarchy to make zooming possible. This is the case in the Zooming example below, which shows survival rates for the Titanic (double-click on it to zoom).

    Unless writing everything from scratch, the ability to add zoom to a dashboard will depend on the components being used (i.e. whether the components support zoom).

    Click the image for an interactive version

    2. Graphical filtering

    Graphical filtering allows the user to explore data by clicking on graphical elements of the dashboard. For example, in this QLik dashboard, I clicked on the Ontario pie slice (on the right of the screen) and all the other elements on the page automatically updated to show data relating to Ontario.

    Graphical filtering is cool. However, it both requires highly structured data and quite a bit of time figuring out how to design and implement the user interface. They are also the most challenging to build. The most amazing examples tend to be bespoke websites created by data journalists (e.g., http://www.poppyfield.org/). The most straightforward way of creating such dashboards with graphical filtering tends to be using business intelligence tools, like Qlik and Tableau. Typically, there is a lot of effort required to structure the data up front. You then get the graphical filtering “for free”.  If you are more the DIY-type, wanting to build your own dashboards and pay nothing, RStudio’s Shiny is probably the most straightforward option.

    Click the image for an interactive version

    3. Control-based drill-downs

    A quicker and easier way of implementing drill-downs is to give the user controls that they can use to select data. From a user interface perspective, the appearance is essentially the same as with the Supermarket Brand Health dashboard (example a few dashboards above). Here, a user chooses from the available options (or uses sliders, radio buttons, etc.).

    4. Filtered drill-downs

    When drilling-down involves restricting the data to a subset of the observations (e.g., to a subset of respondents in a survey), users can zoom in using filtering tools. For example, you can filter the Supermarket Brand Health dashboard by various demographic groups. While using filters to zoom is the least sexy of the ways of permitting users to drill into data, it is usually the most straightforward to implement. Furthermore, it is also a lot more general than any of the other styles of drill-downs considered so far. For example, the picture below illustrates drilling into the data of women aged 35 or more (using the Filters drop-down menu on the top right corner).

    Click the image for an interactive version

    5. Hyperlink drill-downs

    The most general approach for creating drill-downs is to link together multiple pages with hyperlinks. While all of the other approaches involve some aspect of filtering. On the other hand, hyperlinks enable the user to drill into qualitatively different data. Typically, there is a landing page that contains a summary of key data. So the user clicks on the data of interest to drill down and get more information. In the example of a hyperlinked dashboard below, the landing page shows the performance of different departments in a supermarket. The viewer clicks on the result for a department (e.g.: CHECK OUT) which takes them to a screen showing more detailed results.

    Click the image for an interactive version

     

    Interactive Infographic Dashboard

    Infographic dashboards present viewers with a series of closely related charts, text, and images. Here is an example of an interactive infographic on Gamers, where the user can change the country at the top and the dashboard automatically updates.

    Click the image for an interactive version

    Visual Confections Dashboard

    visual confection is an online dashboard that layers multiple visual elements. On the other hand, a series of related visualizations is an infographic. The dashboard below overlays time series information, with exercise and diet information.

    Click the image for an interactive version

    Simulator Dashboards

    The final type of dashboard that I can think of is a simulator. The simulator dashboard example below is from a latent class logit choice model of the egg market. The user can select different properties for each of the competitors and the dashboard predicts market share.

    Click the image for an interactive version

     Create your own Online Dashboards

    I have mentioned a few specific apps for creating online dashboards, including Tableau, QLik, and Shiny. All the other online dashboards in this post used R from within Displayr (you can even just use Displayr to see the underlying R code for each online dashboard). To explore or replicate the Displayr dashboards, just follow the links below for Edit mode for each respective dashboard, and then click on each of the visual elements.

    Microsoft KPI

    Overview: A one-page dashboard showing stock price and Google Trends data for Microsoft.
    Interesting features: Automatically updated every 24 hours, pulling in data from Yahoo Finance and Google Trends.
    Edit mode: Click here to see the underlying document.
    View modeClick here to see the dashboard.

     

    Europe and Immigration

    Overview: Attitudes of Europeans to Immigration
    Interesting features: Based on 213,308 survey responses collected over 13 years. Custom navigation via images and hyperlinks.
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboard.

     

    Supermarket Brand Health

    Overview: Usage and attitudes towards supermarkets
    Interesting features: Uses a control (combo box) to update the calculations for the chosen supermarket brand.
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboard.

     

    Supermarket Department NPS

    Overview: Performance by department of supermarkets.
    Interesting features: Color-coding of circles based on underlying data (they change when the data is filtered using the Filters menu in the top right). Custom navigation, whereby the user clicks on the circle for a department and gets more information about that department.
    Edit mode: Click here to see the dashboard.
    View mode: Click here to see the underlying document.

     

    Blood Glucose Confection

    Overview: Blood glucose measurements and food diary.
    Interesting features: The fully automated underlying charts that integrate data from a wearable blood glucose implant and a food diary. See Layered Data Visualizations Using R, Plotly, and Displayr for more about this dashboard.
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboards.

     

    Interactive infographic

    Overview: An infographic that updates based on the viewer’s selection of country.
    Interesting features: Based on an infographic created in Canva. The data is pasted in from a spreadsheet  (i.e., no hookup to a database).
    Edit mode: Click here to see the dashboard.
    View mode: Click here to see the underlying document.

     

     

    Presidential MaxDiff

    Overview: A story-style dashboard showing an analysis of what Americans desire in their Commander-in-Chief.
    Interesting features: A revised data file can be used to automatically update the visualizations, text, and the underlying analysis (a MaxDiff model)(i.e., it is an automated report).
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboards.

     

    Choice Simulator

    Overview: A decision-support system
    Interesting features: The simulator is hooked up directly to an underlying latent class model. See How to Create an Online Choice Simulator for more about this dashboard.
    Edit mode: Click here to see the dashboard.
    View mode: Click here to see the underlying document.

     

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    googleLanguageR – Analysing language through the Google Cloud Machine Learning APIs

    Tue, 10/03/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    One of the greatest assets human beings possess is the power of speech and language, from which almost all our other accomplishments flow. To be able to analyse communication offers us a chance to gain a greater understanding of one another.

    To help you with this, googleLanguageR is an R package that allows you to perform speech-to-text transcription, neural net translation and natural language processing via the Google Cloud machine learning services.

    An introduction to the package is below, but you can find out more details at the googleLanguageR website.

    Google's bet

    Google predicts that machine learning is to be a fundamental feature of business, and so they are looking to become the infrastructure that makes machine learning possible. Metaphorically speaking: If machine learning is electricity, then Google wants to be the pylons carrying it around the country.

    Google may not be the only company with such ambitions, but one advantage Google has is the amount of data it possesses. Twenty years of web crawling has given it an unprecedented corpus to train its models. In addition, its recent moves into voice and video gives it one of the biggest audio and speech datasets, all of which have been used to help create machine learning applications within its products such as search and Gmail. Further investment in machine learning is shown by Google's purchase of Deepmind, a UK based A.I. research firm that recently was in the news for defeating the top Go champion with its neural network trained Go bot. Google has also taken an open-source route with the creation and publication of Tensorflow, a leading machine learning framework.

    Whilst you can create your own machine learning models, for those users who haven't the expertise, data or time to do so, Google also offers an increasing range of machine learning APIs that are pre-trained, such as image and video recognition or job search. googleLanguageR wraps the subset of those machine learning APIs that are language flavoured – Cloud Speech, Translation and Natural Language.

    Since they carry complementary outputs that can be used in each other's input, all three of the APIs are included in one package. For example, you can transcribe a recording of someone speaking in Danish, translate that to English and then identify how positive or negative the writer felt about its content (sentiment analysis) then identify the most important concepts and objects within the content (entity analysis).

    Motivations Fake news

    One reason why I started looking at this area was the growth of 'fake news', and its effect on political discourse on social media. I wondered if there was some way to put metrics on how much a news story fuelled one's own bias within your own filter bubble. The entity API provides a way to perform entity and sentiment analysis at scale on tweets, and by then comparing different users and news sources preferences the hope is to be able to judge how much they are in agreement with your own bias, views and trusted reputation sources.

    Make your own Alexa

    Another motivating application is the growth of voice commands that will become the primary way of user interface with technology. Already, Google reports up to 20% of search in its app is via voice search. I'd like to be able to say "R, print me out that report for client X". A Shiny app that records your voice, uploads to the API then parses the return text into actions gives you a chance to create your very own Alexa-like infrastructure.

    The voice activated internet connected speaker, Amazon's Alexa – image from www.amazon.co.uk

    Translate everything

    Finally, I live and work in Denmark. As Danish is only spoken by less than 6 million people, applications that work in English may not be available in Danish very quickly, if at all. The API's translation service is the one that made the news in 2016 for "inventing its own language", and offers much better English to Danish translations that the free web version and may make services available in Denmark sooner.

    Using the library

    To use these APIs within R, you first need to do a one-time setup to create a Google Project, add a credit card and authenticate which is detailed on the package website.

    After that, you feed in the R objects you want to operate upon. The rOpenSci review helped to ensure that this can scale up easily, so that you can feed in large character vectors which the library will parse and rate limit as required. The functions also work within tidyverse pipe syntax.

    Speech-to-text

    The Cloud Speech API is exposed via the gl_speech function.

    It supports multiple audio formats and languages, and you can either feed a sub-60 second audio file directly, or perform asynchrnous requests for longer audio files.

    Example code:

    library(googleLanguageR) my_audio <- "my_audio_file.wav" gl_speech(my_audio) # A tibble: 1 x 3 # transcript confidence words #* #1 Hello Mum 0.9227779 Translation

    The Cloud Translation API lets you translate text via gl_translate

    As you are charged per character, one tip here if you are working with lots of different languages is to perform detection of language offline first using another rOpenSci package, cld2. That way you can avoid charges for text that is already in your target language i.e. English.

    library(googleLanguageR) library(cld2) library(purrr) my_text <- c("Katten sidder på måtten", "The cat sat on the mat") ## offline detect language via cld2 detected <- map_chr(my_text, detect_language) # [1] "DANISH" "ENGLISH" ## get non-English text translate_me <- my_text[detected != "ENGLISH"] ## translate gl_translate(translate_me) ## A tibble: 1 x 3 # translatedText detectedSourceLanguage text #* #1 The cat is sitting on the mat da Katten sidder på måtten Natural Language Processing

    The Natural Language API reveals the structure and meaning of text, accessible via the gl_nlp function.

    It returns several analysis:

    • Entity analysis – finds named entities (currently proper names and common nouns) in the text along with entity types, salience, mentions for each entity, and other properties. If possible, will also return metadata about that entity such as a Wikipedia URL.
    • Syntax – analyzes the syntax of the text and provides sentence boundaries and tokenization along with part of speech tags, dependency trees, and other properties.
    • Sentiment – the overall sentiment of the text, represented by a magnitude [0, +inf] and score between -1.0 (negative sentiment) and 1.0 (positive sentiment)

    These are all useful to get an understanding of the meaning of a sentence, and has potentially the greatest number of applications of the APIs featured. With entity analysis, auto categorisation of text is possible; the syntax returns let you pull out nouns and verbs for parsing into other actions; and the sentiment analysis allows you to get a feeling for emotion within text.

    A demonstration is below which gives an idea of what output you can generate:

    library(googleLanguageR) quote <- "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." nlp <- gl_nlp(quote) str(nlp) #List of 6 # $ sentences :List of 1 # ..$ :'data.frame': 1 obs. of 4 variables: # .. ..$ content : chr "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." # .. ..$ beginOffset: int 0 # .. ..$ magnitude : num 0.6 # .. ..$ score : num -0.6 # $ tokens :List of 1 # ..$ :'data.frame': 20 obs. of 17 variables: # .. ..$ content : chr [1:20] "Two" "things" "are" "infinite" ... # .. ..$ beginOffset : int [1:20] 0 4 11 15 23 25 29 38 42 48 ... # .. ..$ tag : chr [1:20] "NUM" "NOUN" "VERB" "ADJ" ... # .. ..$ aspect : chr [1:20] "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" ... # .. ..$ case : chr [1:20] "CASE_UNKNOWN" "CASE_UNKNOWN" "CASE_UNKNOWN" "CASE_UNKNOWN" ... # .. ..$ form : chr [1:20] "FORM_UNKNOWN" "FORM_UNKNOWN" "FORM_UNKNOWN" "FORM_UNKNOWN" ... # .. ..$ gender : chr [1:20] "GENDER_UNKNOWN" "GENDER_UNKNOWN" "GENDER_UNKNOWN" "GENDER_UNKNOWN" ... # .. ..$ mood : chr [1:20] "MOOD_UNKNOWN" "MOOD_UNKNOWN" "INDICATIVE" "MOOD_UNKNOWN" ... # .. ..$ number : chr [1:20] "NUMBER_UNKNOWN" "PLURAL" "NUMBER_UNKNOWN" "NUMBER_UNKNOWN" ... # .. ..$ person : chr [1:20] "PERSON_UNKNOWN" "PERSON_UNKNOWN" "PERSON_UNKNOWN" "PERSON_UNKNOWN" ... # .. ..$ proper : chr [1:20] "PROPER_UNKNOWN" "PROPER_UNKNOWN" "PROPER_UNKNOWN" "PROPER_UNKNOWN" ... # .. ..$ reciprocity : chr [1:20] "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" ... # .. ..$ tense : chr [1:20] "TENSE_UNKNOWN" "TENSE_UNKNOWN" "PRESENT" "TENSE_UNKNOWN" ... # .. ..$ voice : chr [1:20] "VOICE_UNKNOWN" "VOICE_UNKNOWN" "VOICE_UNKNOWN" "VOICE_UNKNOWN" ... # .. ..$ headTokenIndex: int [1:20] 1 2 2 2 2 6 2 6 9 6 ... # .. ..$ label : chr [1:20] "NUM" "NSUBJ" "ROOT" "ACOMP" ... # .. ..$ value : chr [1:20] "Two" "thing" "be" "infinite" ... # $ entities :List of 1 # ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6 obs. of 9 variables: # .. ..$ name : chr [1:6] "human stupidity" "things" "universe" "universe" ... # .. ..$ type : chr [1:6] "OTHER" "OTHER" "OTHER" "OTHER" ... # .. ..$ salience : num [1:6] 0.1662 0.4771 0.2652 0.2652 0.0915 ... # .. ..$ mid : Factor w/ 0 levels: NA NA NA NA NA NA # .. ..$ wikipedia_url: Factor w/ 0 levels: NA NA NA NA NA NA # .. ..$ magnitude : num [1:6] NA NA NA NA NA NA # .. ..$ score : num [1:6] NA NA NA NA NA NA # .. ..$ beginOffset : int [1:6] 42 4 29 86 29 86 # .. ..$ mention_type : chr [1:6] "COMMON" "COMMON" "COMMON" "COMMON" ... # $ language : chr "en" # $ text : chr "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." # $ documentSentiment:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 2 variables: # ..$ magnitude: num 0.6 # ..$ score : num -0.6 Acknowledgements

    This package is 10 times better due to the efforts of the rOpenSci reviewers Neal Richardson and Julia Gustavsen, who have whipped the documentation, outputs and test cases into the form they are today in 0.1.0. Many thanks to them.

    Hopefully, this is just the beginning and the package can be further improved by its users – if you do give the package a try and find a potential improvement, raise an issue on GitHub and we can try to implement it. I'm excited to see what users can do with these powerful tools.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Mapping ecosystems of software development

    Tue, 10/03/2017 - 02:00

    (This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

    I have a new post on the Stack Overflow blog today about the complex, interrelated ecosystems of software development. On the data team at Stack Overflow, we spend a lot of time and energy thinking about tech ecosystems and how technologies are related to each other. One way to get at this idea of relationships between technologies is tag correlations, how often technology tags at Stack Overflow appear together relative to how often they appear separately. One place we see developers using tags at Stack Overflow is on their Developer Stories. If we are interested in how technologies are connected and how they are used together, developers’ own descriptions of their work and careers is a great place to get that.

    I released the data for this network structure as a dataset on Kaggle so you can explore it for yourself! For example, the post for Stack Overflow includes an interactive visualization created using the networkD3 package but we can create other kinds of visualizations using the ggraph package. Either way, trusty igraph comes into play.

    library(readr) library(igraph) library(ggraph) stack_network <- graph_from_data_frame(read_csv("stack_network_links.csv"), vertices = read_csv("stack_network_nodes.csv")) set.seed(2017) ggraph(stack_network, layout = "fr") + geom_edge_link(alpha = 0.2, aes(width = value)) + geom_node_point(aes(color = as.factor(group), size = 10 * nodesize)) + geom_node_text(aes(label = name), family = "RobotoCondensed-Regular", repel = TRUE) + theme_graph(base_family = "RobotoCondensed-Regular") + theme(plot.title = element_text(family="Roboto-Bold"), legend.position="none") + labs(title = "Stack Overflow Tag Network", subtitle = "Tags correlated on Developer Stories")

    We have explored these kinds of network structures using all kinds of data sources at Stack Overflow, from Q&A to traffic, and although we see similar relationships across all of them, we really like Developer Stories as a data source for this particular question. Let me know if you have any comments or questions!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Comparing assault death rates in the US to other advanced democracies

    Mon, 10/02/2017 - 23:28

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    In an effort to provide context to the frequent mass shootings in the United States, Kieran Healy (Associate Professor of Sociology at Duke University) created this updated chart comparing assault death rates in the US to that of 23 other advanced democracies. The chart shows the rate (per 100,000 citizens) of death caused by assaults (stabbings, gunshots, etc. by a third party). Assaults are used rather than gun deaths specifically, as that's the only statistic for which readily comparable data is available. The data come from the OECD Health Status Database through 2015, the most recent complete year available.

    The goal of this chart is to "set the U.S. in some kind of longitudinal context with broadly comparable countries", and to that end OECD countries Estonia and Mexico are not included. (Estonia suffered a spike of violence in the mid-90's, and Mexico has been embroiled in drug violence for decades. See the chart with Estonia and Mexico included here.) Healy provides a helpful FAQ justifying this decision and other issues related to the data and their presentation.

    Healy used the R language (and, specifically the ggplot2 graphics package) to create this chart, and the source code is available on Github.

    For more context around this chart follow the link below, and also his prior renderings and commentary related to the same data through 2013 and through 2010.

    Kieran Healy: Assault deaths to 2015

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Processing Rmarkdown documents with Eclipse and StatET

    Mon, 10/02/2017 - 23:06

    (This article was first published on R-project – lukemiller.org, and kindly contributed to R-bloggers)

    Processing R markdown (Rmd) documents with Eclipse/StatET external tools requires a different setup than processing ‘regular’ knitr documents (Rnw). I was having problems getting the whole rmarkdown -> pandoc workflow working on Eclipse, but the following fix seems to have resolved it, and I can generate Word or HTML documents from a single .Rmd file with a YAML header (see image below).

    A .Rmd document with YAML header set to produce a Microsoft Word .docx output file.

    For starters, I open the Run > External Tools > External Tools Configurations window. The Rmarkdown document is considered a type of Wikitext, so create a new Wikitext + R Document Processing entry. To do that, hit the little blank-page-with-a-plus icon found in the upper left corner of the window, above where it says “type filter text”.

    Using the Auto (YAML) single-step preset.

    Enter a name for this configuration in the Name field (where it says knitr_RMD_yaml in my image above). I then used the Load Preset/Example menu in the upper right to choose the “Auto (YAML) using RMarkdown, single-step” option. This processes your .Rmd document directly using the rmarkdown::render() function, skipping over a separate knitr::knit() step, but the output should look the same.

    Next go to the “2) Produce Output” tab (you are skipping over the “1) R-Weave” tab because you chose the 1-step process). By default the entry in the File field here was causing errors for me. The change is pictured below, so that the File entry reads "${file_name_base:${source_file_path}}.${out_file_ext}". This change allowed my setup to actually find the .Rmd and output .md files successfully, so that the .md document could then be passed on to pandoc.

    Modify the File entry from the default so that instead reads the same as what’s pictured here.

    This all assumes that you’ve previously downloaded and installed pandoc so that it can be found on the Windows system $PATH.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-project – lukemiller.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Marketing Multi-Channel Attribution model based on Sales Funnel with R

    Thu, 09/28/2017 - 06:00

    (This article was first published on R language – AnalyzeCore – data is beautiful, data is a story, and kindly contributed to R-bloggers)

    This is the last post in the series of articles about using Multi-Channel Attribution in marketing. In previous two articles (part 1 and part 2), we’ve reviewed a simple and powerful approach based on Markov chains that allows you to effectively attribute marketing channels.

    In this article, we will review another fascinating approach that marries heuristic and probabilistic methods. Again, the core idea is straightforward and effective.

    Sales Funnel Usually, companies have some kind of idea on how their clients move along the user journey from first visiting a website to closing a purchase. This sequence of steps is called a Sales (purchasing or conversion) Funnel. Classically, the Sales Funnel includes at least four steps:
    • Awareness – the customer becomes aware of the existence of a product or service (“I didn’t know there was an app for that”),
    • Interest – actively expressing an interest in a product group (“I like how your app does X”),
    • Desire – aspiring to a particular brand or product (“Think I might buy a yearly membership”),
    • Action – taking the next step towards purchasing the chosen product (“Where do I enter payment details?”).

    For an e-commerce site, we can come up with one or more conditions (events/actions) that serve as an evidence of passing each step of a Sales Funnel.

    For some extra information about Sales Funnel, you can take a look at my (rather ugly) approach of Sales Funnel visualization with R.

    Companies, naturally, lose some share of visitors on each following step of a Sales Funnel as it gets narrower. That’s why it looks like a string of bottlenecks. We can calculate a probability of transition from the previous step to the next one based on recorded history of transitions. On the other hand, customer journeys are sequences of sessions (visits) and these sessions are attributed to different marketing channels.

    Therefore, we can link marketing channels with a probability of a customer passing through each step of a Sales Funnel. And here goes the core idea of the concept. The probability of moving through each “bottleneck” represents the value of the marketing channel which leads a customer through it. The higher probability of passing a “neck”, the lower the value of a channel that provided the transition. And vice versa, the lower probability, the higher value of a marketing channel in question.

    Let’s study the concept with the following example. First off, we’ll define the Sales Funnel and a set of conditions which will register as customer passing through each step of the Funnel.

    • 0 step (necessary condition) – customer visits a site for the first time
    • 1st step (awareness) – visits two site’s pages
    • 2nd step (interest) – reviews a product page
    • 3rd step (desire) –  adds a product to the shopping cart
    • 4th step (action) – completes purchase

    Second, we need to extract the data that includes sessions where corresponding events occurred. We’ll simulate this data with the following code:

    click to expand R code library(tidyverse) library(purrrlyr) library(reshape2) ##### simulating the "real" data ##### set.seed(454) df_raw <- data.frame(customer_id = paste0('id', sample(c(1:5000), replace = TRUE)), date = as.POSIXct(rbeta(10000, 0.7, 10) * 10000000, origin = '2017-01-01', tz = "UTC"), channel = paste0('channel_', sample(c(0:7), 10000, replace = TRUE, prob = c(0.2, 0.12, 0.03, 0.07, 0.15, 0.25, 0.1, 0.08))), site_visit = 1) %>% mutate(two_pages_visit = sample(c(0,1), 10000, replace = TRUE, prob = c(0.8, 0.2)), product_page_visit = ifelse(two_pages_visit == 1, sample(c(0, 1), length(two_pages_visit[which(two_pages_visit == 1)]), replace = TRUE, prob = c(0.75, 0.25)), 0), add_to_cart = ifelse(product_page_visit == 1, sample(c(0, 1), length(product_page_visit[which(product_page_visit == 1)]), replace = TRUE, prob = c(0.1, 0.9)), 0), purchase = ifelse(add_to_cart == 1, sample(c(0, 1), length(add_to_cart[which(add_to_cart == 1)]), replace = TRUE, prob = c(0.02, 0.98)), 0)) %>% dmap_at(c('customer_id', 'channel'), as.character) %>% arrange(date) %>% mutate(session_id = row_number()) %>% arrange(customer_id, session_id) df_raw <- melt(df_raw, id.vars = c('customer_id', 'date', 'channel', 'session_id'), value.name = 'trigger', variable.name = 'event') %>% filter(trigger == 1) %>% select(-trigger) %>% arrange(customer_id, date)

    And the data sample looks like:

    Next up, the data needs to be preprocessed. For example, it would be useful to replace NA/direct channel with the previous one or separate first-time purchasers from current customers, or even create different Sales Funnels based on new and current customers, segments, locations and so on. I will omit this step but you can find some ideas on preprocessing in my previous blogpost.

    The important thing about this approach is that we only have to attribute the initial marketing channel, one that led the customer through their first step. For instance, a customer initially reviews a product page (step 2, interest) and is brought by channel_1. That means any future product page visits from other channels won’t be attributed until the customer makes a purchase and starts a new Sales Funnel journey.

    Therefore, we will filter records for each customer and save the first unique event of each step of the Sales Funnel using the following code:

    click to expand R code ### removing not first events ### df_customers <- df_raw %>% group_by(customer_id, event) %>% filter(date == min(date)) %>% ungroup()

    I point your attention that in this way we assume that all customers were first-time buyers, therefore every next purchase as an event will be removed with the above code.

    Now, we can use the obtained data frame to compute Sales Funnel’s transition probabilities, importance of Sale Funnel steps, and their weighted importance. According to the method, the higher probability, the lower value of the channel. Therefore, we will calculate the importance of an each step as 1 minus transition probability. After that, we need to weight importances because their sum will be higher than 1. We will do these calculations with the following code:

    click to expand R code ### Sales Funnel probabilities ### sf_probs <- df_customers %>% group_by(event) %>% summarise(customers_on_step = n()) %>% ungroup() %>% mutate(sf_probs = round(customers_on_step / customers_on_step[event == 'site_visit'], 3), sf_probs_step = round(customers_on_step / lag(customers_on_step), 3), sf_probs_step = ifelse(is.na(sf_probs_step) == TRUE, 1, sf_probs_step), sf_importance = 1 - sf_probs_step, sf_importance_weighted = sf_importance / sum(sf_importance) )

    A hint: it can be a good idea to compute Sales Funnel probabilities looking at a limited prior period, for example, 1-3 months. The reason is that customers’ flow or “necks” capacities could vary due to changes on a company’s site or due to changes in marketing campaigns and so on. Therefore, you can analyze the dynamics of the Sales Funnel’s transition probabilities in order to find the appropriate time period.

    I can’t publish a blogpost without visualization. This time I suggest another approach for the Sales Funnel visualization that represents all customer journeys through the Sales Funnel with the following code:

    click to expand R code ### Sales Funnel visualization ### df_customers_plot <- df_customers %>% group_by(event) %>% arrange(channel) %>% mutate(pl = row_number()) %>% ungroup() %>% mutate(pl_new = case_when( event == 'two_pages_visit' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'two_pages_visit'])) / 2), event == 'product_page_visit' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'product_page_visit'])) / 2), event == 'add_to_cart' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'add_to_cart'])) / 2), event == 'purchase' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'purchase'])) / 2), TRUE ~ 0 ), pl = pl + pl_new) df_customers_plot$event <- factor(df_customers_plot$event, levels = c('purchase', 'add_to_cart', 'product_page_visit', 'two_pages_visit', 'site_visit' )) # color palette cols <- c('#4e79a7', '#f28e2b', '#e15759', '#76b7b2', '#59a14f', '#edc948', '#b07aa1', '#ff9da7', '#9c755f', '#bab0ac') ggplot(df_customers_plot, aes(x = event, y = pl)) + theme_minimal() + scale_colour_manual(values = cols) + coord_flip() + geom_line(aes(group = customer_id, color = as.factor(channel)), size = 0.05) + geom_text(data = sf_probs, aes(x = event, y = 1, label = paste0(sf_probs*100, '%')), size = 4, fontface = 'bold') + guides(color = guide_legend(override.aes = list(size = 2))) + theme(legend.position = 'bottom', legend.direction = "horizontal", panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(size = 20, face = "bold", vjust = 2, color = 'black', lineheight = 0.8), axis.title.y = element_text(size = 16, face = "bold"), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.text.y = element_text(size = 8, angle = 90, hjust = 0.5, vjust = 0.5, face = "plain")) + ggtitle("Sales Funnel visualization - all customers journeys")

    Ok, seems we now have everything to make final calculations. In the following code, we will remove all users that didn’t make a purchase. Then, we’ll link weighted importances of the Sales Funnel steps with sessions by event and, at last, summarize them.

    click to expand R code ### computing attribution ### df_attrib <- df_customers %>% # removing customers without purchase group_by(customer_id) %>% filter(any(as.character(event) == 'purchase')) %>% ungroup() %>% # joining step's importances left_join(., sf_probs %>% select(event, sf_importance_weighted), by = 'event') %>% group_by(channel) %>% summarise(tot_attribution = sum(sf_importance_weighted)) %>% ungroup()

    As the result, we’ve obtained the number of conversions that have been distributed by marketing channels:

    In the same way you can distribute the revenue by channels.

    At the end of the article, I want to share OWOX company’s blog where you can read more about the approach: Funnel Based Attribution Model.

    In addition, you can find that OWOX provides an automated system for Marketing Multi-Channel Attribution based on BigQuery. Therefore, if you are not familiar with R or don’t have a suitable data warehouse, I can recommend you to test their service.

    SaveSaveSaveSaveSaveSaveSaveSaveSaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSaveSaveSave

    SaveSaveSaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSave

    The post Marketing Multi-Channel Attribution model based on Sales Funnel with R appeared first on AnalyzeCore – data is beautiful, data is a story.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R language – AnalyzeCore – data is beautiful, data is a story. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppZiggurat 0.1.4

    Thu, 09/28/2017 - 04:06

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A maintenance release of RcppZiggurat is now on the CRAN network for R. It switched the vignette to the our new pinp package and its two-column pdf default.

    The RcppZiggurat package updates the code for the Ziggurat generator which provides very fast draws from a Normal distribution. The package provides a simple C++ wrapper class for the generator improving on the very basic macros, and permits comparison among several existing Ziggurat implementations. This can be seen in the figure where Ziggurat from this package dominates accessing the implementations from the GSL, QuantLib and Gretl—all of which are still way faster than the default Normal generator in R (which is of course of higher code complexity).

    The NEWS file entry below lists all changes.

    Changes in version 0.1.4 (2017-07-27)
    • The vignette now uses the pinp package in two-column mode.

    • Dynamic symbol registration is now enabled.

    Courtesy of CRANberries, there is also a diffstat report for the most recent release. More information is on the RcppZiggurat page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages