Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 36 min 7 sec ago

JAGS 4.3.0 is released

Tue, 07/18/2017 - 23:13

(This article was first published on R – JAGS News, and kindly contributed to R-bloggers)

The source tarball for JAGS 4.3.0 is now available from Sourceforge. Binary distributions will be available later. See the updated manual for details of the features in the new version. This version is fully compatible with the current version of rjags on CRAN.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – JAGS News. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Securely store API keys in R scripts with the "secret" package

Tue, 07/18/2017 - 20:40

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you use an API key to access a secure service, or need to use a password to access a protected database, you'll need to provide these "secrets" in your R code somewhere. That's easy to do if you just include those keys as strings in your code — but it's not very secure. This means your private keys and passwords are stored in plain-text on your hard drive, and if you email your script they're available to anyone who can intercept that email. It's also really easy to inadvertently include those keys in a public repo if you use Github or similar code-sharing services.

To address this problem, Gábor Csárdi and Andrie de Vries created the secret package for R. The secret package integrates with OpenSSH, providing R functions that allow you to create a vault to keys on your local machine, define trusted users who can access those keys, and then include encrypted keys in R scripts or packages that can only be decrypted by you or by people you trust. You can see how it works in the vignette secret: Share Sensitive Information in R Packages, and in this presentation by Andrie de Vries at useR!2017:

 

To use the secret package, you'll need access to your private key, which you'll also need to store securely. For that, you might also want to take a look at the in-progress keyring package, which allows you to access secrets stored in Keychain on macOS, Credential Store on Windows, and the Secret Service API on Linux.

The secret package is available now on CRAN, and you can also find the latest development version on Github.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Multiple Factor Analysis to analyse several data tables

Tue, 07/18/2017 - 20:21

(This article was first published on François Husson, and kindly contributed to R-bloggers)

How to take into account and how to compare information from different information sources? Multiple Factor Analysis is a principal Component Methods that deals with datasets that contain quantitative and/or categorical variables that are structured by groups.

Here is a course with videos that present the method named Multiple Factor Analysis.

Multiple Factor Analysis (MFA) allows you to study complex data tables, where a group of individuals is characterized by variables structured as groups, and possibly coming from different information sources. Our interest in the method is due to it being able to analyze a data table as a whole, but also its ability to compare information provided by the various information sources.

Four videos present a course on MFA, highlighting the way to interpret the data. Then  you will find videos presenting the way to implement MFA in FactoMineR.

With this course, you will be stand-alone to perform and interpret results obtain with MFA.

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: François Husson. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Animating a spinner using ggplot2 and ImageMagick

Tue, 07/18/2017 - 20:00

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

It’s Sunday, and I [Bob] am just sitting on the couch peacefully ggplotting to illustrate basic sample spaces using spinners (a trick I’m borrowing from Jim Albert’s book Curve Ball). There’s an underlying continuous outcome (i.e., where the spinner lands) and a quantization into a number of regions to produce a discrete outcome (e.g., “success” and “failure”). I’m quite pleased with myself for being able to use polar coordinates to create the spinner and arrow. ggplot works surprisingly well in polar coordinates once you figure them out; almost everything people have said about them online is confused and the doc itself assumes you’re a bit more of a ggplotter and geometer than me.

I’m so pleased with it that I show the plot to Mitzi. She replies, “Why don’t you animate it?” I don’t immediately say, “What a waste of time,” then get back to what I’m doing. Instad, I boast, “It’ll be done when you get back from your run.” Luckily for me, she goes for long runs—I just barely had the prototype working as she got home. And then I had to polish it and turn it into a blog post. So here it is, for your wonder and amazement.



Here’s the R magic.

library(ggplot2) draw_curve <- function(angle) { df <- data.frame(outcome = c("success", "failure"), prob = c(0.3, 0.7)) plot <- ggplot(data=df, aes(x=factor(1), y=prob, fill=outcome)) + geom_bar(stat="identity", position="fill") + coord_polar(theta="y", start = 0, direction = 1) + scale_y_continuous(breaks = c(0.12, 0.7), labels=c("success", "failure")) + geom_segment(aes(y= angle/360, yend= angle/360, x = -1, xend = 1.4), arrow=arrow(type="closed"), size=1) + theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text.y = element_blank()) + theme(panel.grid = element_blank(), panel.border = element_blank()) + theme(legend.position = "none") + geom_point(aes(x=-1, y = 0), color="#666666", size=5) return(plot) } ds <- c() pos <- 0 for (i in 1:66) { pos <- (pos + (67 - i)) %% 360 ds[i] <- pos } ds <- c(rep(0, 10), ds) ds <- c(ds, rep(ds[length(ds)], 10)) for (i in 1:length(ds)) { ggsave(filename = paste("frame", ifelse(i < 10, "0", ""), i, ".png", sep=""), plot = draw_curve(ds[i]), device="png", width=4.5, height=4.5) }

I probably should've combined theme functions. Ben would've been able to define ds in a one-liner and then map ggsave. I hope it's at least clear what my code does (just decrements the number of degrees moved each frame by one---no physics involved).

After producing the frames in alphabetical order (all that ifelse and paste mumbo-jumbo), I went to the output directory and ran the results through ImageMagick (which I'd previously installed on my now ancient Macbook Pro) from the terminal, using

> convert *.png -delay 3 -loop 0 spin.gif

That took a minute or two. Each of the pngs is about 100KB, but the final output is only 2.5MB or so. Maybe I should've went with less delay (I don't even know what the units are!) and fewer rotations and maybe a slower final slowing down (maybe study the physics). How do the folks at Pixar ever get anything done?

P.S. I can no longer get the animation package to work in R, though it used to work in the past. It just wraps up those calls to ImageMagick.

P.P.S. That salmon and teal color scheme is the default!

The post Animating a spinner using ggplot2 and ImageMagick appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Plants

Tue, 07/18/2017 - 19:14

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Blue dragonflies dart to and fro
I tie my life to your balloon and let it go
(Warm Foothills, Alt-J)

In my last post I did some drawings based on L-Systems. These drawings are done sequentially. At any step, the state of the drawing can be described by the position (coordinates) and the orientation of the pencil. In that case I only used two kind of operators: drawing a straight line and turning a constant angle. Today I used two more symbols to do stack operations:

  • “[“ Push the current state (position and orientation) of the pencil onto a pushdown
    operations stack
  • “]” Pop a state from the stack and make it the current state of the pencil (no line is drawn)

These operators allow to return to a previous state to continue drawing from there. Using them you can draw plants like these:

Each image corresponds to a different axiom, rules, angle and depth. I described these terms in my previous post. If you want to reproduce them you can find the code below (each image corresponds to a different set of axiom, rules, angle and depth parameters). Change colors, add noise to angles, try your own plants … I am sure you will find nice images:

library(gsubfn) library(stringr) library(dplyr) library(ggplot2) #Plant 1 axiom="F" rules=list("F"="FF-[-F+F+F]+[+F-F-F]") angle=22.5 depth=4 #Plant 2 axiom="X" rules=list("X"="F[+X][-X]FX", "F"="FF") angle=25.7 depth=7 #Plant 3 axiom="X" rules=list("X"="F[+X]F[-X]+X", "F"="FF") angle=20 depth=7 #Plant 4 axiom="X" rules=list("X"="F-[[X]+X]+F[+FX]-X", "F"="FF") angle=22.5 depth=5 #Plant 5 axiom="F" rules=list("F"="F[+F]F[-F]F") angle=25.7 depth=5 #Plant 6 axiom="F" rules=list("F"="F[+F]F[-F][F]") angle=20 depth=5 for (i in 1:depth) axiom=gsubfn(".", rules, axiom) actions=str_extract_all(axiom, "\\d*\\+|\\d*\\-|F|L|R|\\[|\\]|\\|") %>% unlist status=data.frame(x=numeric(0), y=numeric(0), alfa=numeric(0)) points=data.frame(x1 = 0, y1 = 0, x2 = NA, y2 = NA, alfa=90, depth=1) for (action in actions) { if (action=="F") { x=points[1, "x1"]+cos(points[1, "alfa"]*(pi/180)) y=points[1, "y1"]+sin(points[1, "alfa"]*(pi/180)) points[1,"x2"]=x points[1,"y2"]=y data.frame(x1 = x, y1 = y, x2 = NA, y2 = NA, alfa=points[1, "alfa"], depth=points[1,"depth"]) %>% rbind(points)->points } if (action %in% c("+", "-")){ alfa=points[1, "alfa"] points[1, "alfa"]=eval(parse(text=paste0("alfa",action, angle))) } if(action=="["){ data.frame(x=points[1, "x1"], y=points[1, "y1"], alfa=points[1, "alfa"]) %>% rbind(status) -> status points[1, "depth"]=points[1, "depth"]+1 } if(action=="]"){ depth=points[1, "depth"] points[-1,]->points data.frame(x1=status[1, "x"], y1=status[1, "y"], x2=NA, y2=NA, alfa=status[1, "alfa"], depth=depth-1) %>% rbind(points) -> points status[-1,]->status } } ggplot() + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), lineend = "round", colour="white", data=na.omit(points)) + coord_fixed(ratio = 1) + theme(legend.position="none", panel.background = element_rect(fill="black"), panel.grid=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), axis.text=element_blank()) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-3)

Tue, 07/18/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

In the first two part of this series, we’ve seen how to identify the distribution of a random variable by plotting the distribution of a sample and by estimating statistic. We also seen that it can be tricky to identify a distribution from a small sample of data. Today, we’ll see how to estimate the confidence interval of a statistic in this situation by using a powerful method called bootstrapping.

Answers to the exercises are available here.

Exercise 1
Load this dataset and draw the histogram, the ECDF of this sample and the ECDF of a density who’s a good fit for the data.

Exercise 2
Write a function that takes a dataset and a number of iterations as parameter. For each iteration this function must create a sample with replacement of the same size than the dataset, calculate the mean of the sample and store it in a matrix, which the function must return.

Exercise 3
Use the t.test() to compute the 95% confidence interval estimate for the mean of your dataset.

Learn more about bootstrapping functions in the online course Structural equation modeling (SEM) with lavaan. In this course you will learn how to:

  • Learn how to develop bootstrapped confidence intervals
  • Go indepth into the lavaan package for modelling equations
  • And much more

Exercise 4
Use the function you just wrote to estimate the mean of your sample 10,000 times. Then draw the histogram of the results and the sampling mean of the data.

The probability distribution of the estimation of a mean is a normal distribution centered around the real value of the mean. In other words, if we take a lot of samples from a population and compute the mean of each sample, the histogram of those mean will look like one of a normal distribution center around the real value of the mean we try to estimate. We have recreated artificially this process by creating a bunch of new sample from the dataset, by resampling it with replacement and now we can do a point estimation of the mean by computing the average of the sample of means or compute the confidence interval by finding the correct percentile of this distribution. This process is basically what is called bootstrapping.

Exercise 5
Calculate the value of the 2.5 and 97.5 percentile of your sample of 10,000 estimates of the mean and the mean of this sample. Compare this last value to the value of the sample mean of your data.

Exercise 6
Bootstrapping can be used to compute the confidence interval of all the statistics of interest, but you don’t have to write a function for each of them! You can use the boot() function from the library of the same name and pass the statistic as argument to compute the bootstrapped sample. Use this function with 10,000 replicates to compute the median of the dataset.

Exercise 7
Look at the structure of your result and plot his histogram. On the same plot, draw the value of the sample median of your dataset and plot the 95% confidence interval of this statistic by adding two vertical green lines at the lower and higher bounds of the interval.

Exercise 8
Write functions to compute by bootstrapping the following statistics:

  • Variance
  • kurtosis
  • Max
  • Min

Exercise 9
Use the functions from last exercise and the boot function with 10,000 replicates to compute the following statistics:

  • Variance
  • kurtosis
  • Max
  • Min

Then draw the histogram of the bootstrapped sample and plot the 95% confidence interval of the statistics.

Exercise 10
Generate 1000 points from a normal distribution of mean and standard deviation equal to the one of the dataset. Use the bootstrap method to estimate the 95% confidence interval of the mean, the variance, the kurtosis, the min and the max of this density. Then plot the histograms of the bootstrap samples for each of the variable and draw the 95% confidence interval as two red vertical line.

Two bootstrap estimate of the same statistic of two sample who are distributed by the same density should be pretty similar. When we compare those last plots with the confidence interval we drawn before we see that they are. More importantly, the confidence interval computed in exercise 10 overlap the confidence interval of the statistics of the first dataset. As a consequence, we can’t conclude that the two sample come from different density distribution and in practice we could use a normal distribution with a mean of 0.4725156 and a standard deviation of 1.306665 to simulate this random variable.

Related exercise sets:
  1. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
  2. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-1)
  3. Data Science for Doctors – Part 3 : Distributions
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Multiple Correspondence Analysis with FactoMineR

Tue, 07/18/2017 - 13:06

(This article was first published on François Husson, and kindly contributed to R-bloggers)

Here is a course with videos that present Multiple Correspondence Analysis in a French way. The most well-known use of Multiple Correspondence Analysis is: surveys.

Four videos present a course on MCA, highlighting the way to interpret the data. Then  you will find videos presenting the way to implement MCA in FactoMineR, to deal with missing values in MCA thanks to the package missMDA and lastly a video to draw interactive graphs with Factoshiny. And finally you will see that the new package FactoInvestigate allows you to obtain automatically an interpretation of your MCA results.

With this course, you will be stand-alone to perform and interpret results obtain with MCA.

 

For more information, you can see the book blow. Here are some reviews on the book and a link to order the book.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: François Husson. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

constants 0.0.1

Tue, 07/18/2017 - 10:00

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The new constants package is available on CRAN. This small package provides the CODATA 2014 internationally recommended values of the fundamental physical constants (universal, electromagnetic, physicochemical, atomic…), provided as symbols for direct use within the R language. Optionally, the values with errors and/or the values with units are also provided if the errors and/or the units packages are installed as well.

But, what is CODATA? The Committee on Data for Science and Technology (CODATA) is an interdisciplinary committee of the International Council for Science. The Task Group on Fundamental Constants periodically provides the internationally accepted set of values of the fundamental physical constants. The version currently in force is the “2014 CODATA”, published on 25 June 2015.

This package wraps the codata dataset, defines unique symbols for each one of the 237 constants, and provides them enclosed in three sets of symbols: syms, syms_with_errors and syms_with_units.

library(constants) # the speed of light with(syms, c0) ## [1] 299792458 # explore which constants are available lookup("planck constant", ignore.case=TRUE) ## quantity symbol value unit ## 7 Planck constant h 6.626070040e-34 J s ## 8 Planck constant h_eV 4.135667662e-15 eV s ## 9 Planck constant hbar h/(2*pi) J s ## 10 Planck constant hbar_eV h_eV/(2*pi) eV s ## 11 Planck constant hbar.c0 197.3269788 MeV fm ## 212 molar Planck constant Na.h 3.9903127110e-10 J s mol-1 ## 213 molar Planck constant Na.h.c0 0.119626565582 J m mol-1 ## rel_uncertainty type ## 7 1.2e-08 universal ## 8 6.1e-09 universal ## 9 1.2e-08 universal ## 10 6.1e-09 universal ## 11 6.1e-09 universal ## 212 4.5e-10 physicochemical ## 213 4.5e-10 physicochemical # symbols can also be attached to the search path attach(syms) # the Planck constant hbar ## [1] 1.054572e-34

If the errors/units package is installed in your system, constants with errors/units are available:

attach(syms_with_errors) # the Planck constant with error hbar ## 1.05457180(1)e-34 attach(syms_with_units) # the Planck constant with units hbar ## 1.054572e-34 J*s

The dataset is available for lazy loading:

data(codata) head(codata) ## quantity symbol value unit ## 1 speed of light in vacuum c0 299792458 m s-1 ## 2 magnetic constant mu0 4*pi*1e-7 N A-2 ## 3 electric constant epsilon0 1/(mu0*c0^2) F m-1 ## 4 characteristic impedance of vacuum Z0 mu0*c0 Ω ## 5 Newtonian constant of gravitation G 6.67408e-11 m3 kg-1 s-2 ## 6 Newtonian constant of gravitation G_hbar.c0 6.70861e-39 GeV-2 c4 ## rel_uncertainty type ## 1 0.0e+00 universal ## 2 0.0e+00 universal ## 3 0.0e+00 universal ## 4 0.0e+00 universal ## 5 4.7e-05 universal ## 6 4.7e-05 universal dplyr::count(codata, type, sort=TRUE) ## # A tibble: 15 x 2 ## type n ## ## 1 atomic-nuclear-electron 31 ## 2 atomic-nuclear-proton 26 ## 3 atomic-nuclear-neutron 24 ## 4 physicochemical 24 ## 5 atomic-nuclear-helion 18 ## 6 atomic-nuclear-muon 17 ## 7 electromagnetic 17 ## 8 universal 16 ## 9 atomic-nuclear-deuteron 15 ## 10 atomic-nuclear-general 11 ## 11 atomic-nuclear-tau 11 ## 12 atomic-nuclear-triton 11 ## 13 adopted 7 ## 14 atomic-nuclear-alpha 7 ## 15 atomic-nuclear-electroweak 2

Article originally published in Enchufa2.es: constants 0.0.1.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Value of #Welcome

Tue, 07/18/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

I’m participating in the AAAS Community Engagement Fellows Program (CEFP), funded by the Alfred P. Sloan Foundation. The inaugural cohort of Fellows is made up of 17 community managers working in a wide range of scientific communities. This is cross-posted from the Trellis blog as part of a series of reflections that the CEFP Fellows are sharing.

In my training as a AAAS Community Engagement Fellow, I hear repeatedly about the value of extending a personal welcome to your community members. This seems intuitive, but recently I put this to the test. Let me tell you about my experience creating and maintaining a #welcome channel in a community Slack group.

"Welcome" by Nathan under CC BY-SA 2.0

I listen in on and occasionally participate in a Slack group for R-Ladies community organizers (R-Ladies is a global organization with local meetup chapters around the world, for women who do/want to do programming in R). Their Slack is incredibly well-organized and has a #welcome channel where new joiners are invited to introduce themselves in a couple of sentences. The leaders regularly jump in to add a wave emoji and ask people to introduce themselves if they have not already.

At rOpenSci, where I am the Community Manager, when people joined our 150+ person Slack group, they used to land in the #general channel where people ask and answer questions. Often, new people joining went unnoticed among the conversations. So recently I copied R-Ladies and created a #welcome channel in our Slack and made sure any new people got dropped in there, as well as in the #general channel. The channel purpose is set as "A place to welcome new people and for new people to introduce themselves. We welcome participation and civil conversations that adhere to our code of conduct."

I pinged three new rOpenSci community members to join and introduce themselves, and in the #general channel said “Hey, come say hi to people over here at #welcome!”. One day later, we already had 33 people in #welcome. I spent that morning nurturing it, noting some people’s activities and contributions that they might not otherwise highlight themselves e.g. Bea just did her first open peer software review for rOpenSci, or Matt has been answering people’s questions about meta data, or Julie, Ben and Jamie are co-authors on this cool new paper about open data science tools. And I gave a shoutout to R-Ladies stating clearly that I copied their fantastic #welcome channel.

People are introducing themselves, tagging with emoji and thanking “the community” saying things like:

“…I feel super lucky to be a part of the rOpenSci community, which has … had a great positive impact on my life!”

“I love rOpenSci for building this community and helping people like me become confident data scientists!”

“[I’m] hoping to contribute more to the group moving forward”

“…thank you for having me part of this community!”

Such is the value of #welcome.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Automatically Fitting the Support Vector Machine Cost Parameter

Tue, 07/18/2017 - 07:32

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

In an earlier post I discussed how to avoid overfitting when using Support Vector Machines. This was achieved using cross validation. In cross validation, prediction accuracy is maximized by varying the cost parameter. Importantly, prediction accuracy is calculated on a different subset of the data from that used for training.

In this blog post I take that concept a step further, by automating the manual search for the optimal cost.

The data set I’ll be using describes different types of glass based upon physical attributes and chemical composition.  You can read more about the data here, but for the purposes of my analysis all you need to know is that the outcome variable is categorical (7 types of glass) and the 4 predictor variables are numeric.

Creating the base support vector machine model

I start, as in my earlier analysis, by splitting the data into a larger 70% training sample and a smaller 30% testing sample. Then I train a support vector machine on the training sample with the following code:

library(flipMultivariates) svm = SupportVectorMachine(Type ~ RefractiveIndex + Ca + Ba + Fe, subset = training, cost = 1)

This produces output as shown below. There are 2 reasons why we can largely disregard the 64.67% accuracy:

  1. We used the training data (and not the independent testing data) to calculate accuracy.
  2. We have used a default value for the cost of 1 and not attempted to optimize.

Amending the R code

I am going to amend the code above in order to loop over a range of values of cost. For each value, I will calculate the accuracy on the test sample. The updated code is as follows:

library(flipMultivariates) library(flipRegression) costs = c(0.1, 1, 10, 100, 1000, 10000) i = 1 accuracies = rep(0, length(costs)) for (cost in costs) { svm = SupportVectorMachine(Type ~ RefractiveIndex + Ca + Ba + Fe, subset = training, cost = cost) accuracies[i] = attr(ConfusionMatrix(svm, subset = (testing == 1)), "accuracy") i = i + 1 } plot(costs, accuracies, type = "l", log = "x")

The first 5 lines set things up. I load libraries required to run the Support Vector Machine and calculate the accuracy. Next I choose a range of costs, initialize a loop counter i and an empty vector accuracies, where I store the results.

Then I add a loop around the code that created the base model to iterate over costs. The next line calculates and stores the accuracy on the testing sample. Finally I plot the results which tells me that the greatest accuracy appears around 100. This allows us to go back and update costs to a more granular range around this value.

Re-running the code again using the new costs (10, 20, 50, 75, 100, 150, 200, 300, 500, 1000) I get the final chart shown below. This indicates that a cost of 50 gives best performance.

TRY IT OUT
The analysis in this post used R in Displayr. The flipMultivariates package (available on GitHub), which uses the e1071 package, performed the calculations. You can try automatically fitting the Support Vector Machine Cost Parameter yourself using the data in this example. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Programming Notes – Part 2

Tue, 07/18/2017 - 06:34

In an older post, I discussed a number of functions that are useful for programming in R. I wanted to expand on that topic by covering other functions, packages, and tools that are useful. Over the past year, I have been working as an R programmer and these are some of the new learnings that have become fundamental in my work.

IS TRUE and IS FALSE

isTRUE is a logical operator that can be very useful in checking whether a condition or variable has been set to true. Lets say that we are writing a script whereby we will take run a generalized linear regression when the parameter run_mod is set to true. The conditional portion of the script can be written as either if(isTRUE(run_mod)) or if(run_mod). I am partial to isTRUE, but this is entirely a matter of personal preference. Users should also be aware of the isFALSE function, which is part of the BBmisc package.

run_mod = TRUE if(isTRUE(run_mod){ tryCatch( GLM_Model(full_data=full.df, train_data=train.df, test_data=test.df), error = function(e) { print("error occured") print(e) }) } if(BBmisc::isFALSE(check_one) & BBmisc::isFALSE(check_two)){ data_output.tmp$score = 0.8 }

INVISIBLE

The invisible function can be used to return an object that is not printed out and can be useful in a number of circumstances. For example, it’s useful when you have helper functions that will be utilized within other functions to do calculations. In those cases, it’s often not desireable to print those results. I generally use invisible when I’m checking function arguments. For example, consider a function that takes two arguments and you need to check whether the input is a vector.

if(!check_input(response, type='character', length=1)) { stop('something is wrong') }

The check_input function is something I created and has a few lines which contain invisible. The idea is for check_input to return true or false based on the inputs so that it’ll stop stop the execution when needed.

if(is.null(response) & !length(response)==0) { return(FALSE) } else if (!is.null(response)) { return(invisible(TRUE)) }

DEBUG

When I’m putting together new classes or have multiple functions that interact with one another, I ensure that the code includes an comprehensive debugging process. This means that I’m checking my code at various stages so that I can identify when issues arise. Consider that I’m putting together a function that will go through a number of columns in a data frame, summarize those variables, and save the results as a nested list. To effectively put together code without issues, I ensure that the functions takes a debug argument that will run when it’s set to true. In the code below, it will print out values at different stages of the code. Furthermore, the final line of the code will check the resulting data structure.

DSummary_Variable(data_obj, var, debug=TRUE){ ...... } if(debug) message('|==========>>> Processing of the variable. \n') if(debug){ if(!missing(var_summary)){ message('|==========>>> var_summary has been created and has a length of ', length(var_summary), ' and the nested list has a length of ', length(var_summary[['var_properties']]), ' \n') } else { stop("var_summary is missing. Please investigate") }

If you have multiple functions that interact with one another, it’s a good idea to preface the printed message with the name of the function name.

add_func <- function(a,b) a + b mult_func <- function(a,b) a * b main_func <- function(data, cola, colb, debug=TRUE){ if(debug){ message("mult_func: checking inputs to be used") } mult_func(data[,cola], data[,colb]) if(debug){ message("mult_add: checking inputs to be used") } }

Stay tuned for part three, where I’ll talk about the testthat and assertive package.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Investigating Cryptocurrencies (Part II)

Tue, 07/18/2017 - 02:36

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

This is the second in a series of posts designed to show how the R programming language can be used with cryptocurrency related data sets.  A number of R packages are great for analyzing stocks and bonds and similar financial instruments.  These can also be applied to working with cryptocurrencies.  In this post we will focus on Bitcoin.

Bitcoin has garnered enough attention that it is available through Yahoo’s finance data under the symbol BTCUSD=X.  The quantmod package is comprised of a set of packages and utilities geared towards time series analysis traditionally associated with stocks.  You can load Bitcoin along with other Stock symbols using the loadSymbols function.  In this example we will also load AMD, which makes graphics cards used by cryptocurrency miners.

library(quantmod)
loadSymbols(c(‘BTCUSD=X’,’AMD’))

If you have any issue downloading the data, make sure you update to the latest version of quantmod.  If all goes well, you will have two objects in your global environment named AMD and BTCUSD=X.

ls()
[1] “AMD”      “BTCUSD=X”

You can plot AMD by simply passing it to the plot function.

plot(AMD)

Bitcoin is slightly different simply because the symbol in use includes an equal sign.  To ensure that R evaluates the code properly, the symbol must be surrounded in back ticks.

plot(`BTCUSD=X`)

There is data missing for certain days.  There are other sources for cryptocurrency data which can be substituted if needed.  We will ignore this anomaly for the remainder of this post.  There is data for the last 4 weeks. We can construct a candle chart that focuses on this subset of data.

chartSeries(`BTCUSD=X`, subset=’last 4 weeks’)

This chart can then be modified to include technical analysis – for instance Bollinger Bands.

addBBands()


The capabilities of the quantmod package in an earlier post (see http://www.r-chart.com/2010/06/stock-analysis-using-r.html) where a listing of other functions that can be applied is included.

Inasmuch as cryptocurrencies behave like traditional equities, they lend themselves to similar types of analysis.  The quantmod package is a great place to start when analyzing cryptocurrencies.   


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Chart. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Young people neither in employment nor in education and training in Europe, 2000-2016

Tue, 07/18/2017 - 02:00

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

R Documentation at Stack Overflow

One of the nice features of R is the ease of data acquisition. I am now working on the examples of data acquisition form different sources within an R session. Soon I am going publish a long-read with an overview of demographic data acquisition in R.

Please consider contributing your examples to the Data aquisition topic.

NEET in Europe

As an example of Eurostat data usage I chose to show the dynamics of NEET (Young people neither in employment nor in education and training) in European countries. The example is using the brilliant geofact package.

library(tidyverse) library(lubridate) library(forcats) library(eurostat) library(geofacet) library(viridis) library(ggthemes) library(extrafont) # Find the needed dataset code # http://ec.europa.eu/eurostat/web/regions/data/database # download fertility rates for countries neet <- get_eurostat("edat_lfse_22") # if the automated download does not work, the data can be grabbed manually at # http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing neet %>% filter(geo %>% paste %>% nchar == 2, sex == "T", age == "Y18-24") %>% group_by(geo) %>% mutate(avg = values %>% mean()) %>% ungroup() %>% ggplot(aes(x = time %>% year(), y = values))+ geom_path(aes(group = 1))+ geom_point(aes(fill = values), pch = 21)+ scale_x_continuous(breaks = seq(2000, 2015, 5), labels = c("2000", "'05", "'10", "'15"))+ scale_y_continuous(expand = c(0, 0), limits = c(0, 40))+ scale_fill_viridis("NEET, %", option = "B")+ facet_geo(~ geo, grid = "eu_grid1")+ labs(x = "Year", y = "NEET, %", title = "Young people neither in employment nor in education and training in Europe", subtitle = "Data: Eurostat Regional Database, 2000-2016", caption = "ikashnitsky.github.io")+ theme_few(base_family = "Roboto Condensed", base_size = 15)+ theme(axis.text = element_text(size = 10), panel.spacing.x = unit(1, "lines"), legend.position = c(0, 0), legend.justification = c(0, 0))

The whole code may be downloaded from the gist

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Ecosystems chapter added to “Empirical software engineering using R”

Tue, 07/18/2017 - 01:05

(This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

The Ecosystems chapter of my Empirical software engineering book has been added to the draft pdf (download here).

I don’t seem to be able to get away from rewriting everything, despite working on the software engineering material for many years. Fortunately the sparsity of the data keeps me in check, but I keep finding new and interesting data (not a lot, but enough to slow me down).

There is still a lot of work to be done on the ecosystems chapter, not least integrating all the data I have been promised. The basic threads are there, they just need filling out (assuming the promised data sets arrive).

I did not get any time to integrate in the developer and economics data received since those draft chapters were released; there has been some minor reorganization.

As always, if you know of any interesting software engineering data, please tell me.

I’m looking to rerun the workshop on analyzing software engineering data. If anybody has a venue in central London, that holds 30 or so people+projector, and is willing to make it available at no charge for a series of free workshops over several Saturdays, please get in touch.

Projects chapter next.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Ten-HUT! The Apache Drill R interface package — sergeant — is now on CRAN

Tue, 07/18/2017 - 00:54

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’m extremely pleased to announce that the sergeant package is now on CRAN or will be hitting your local CRAN mirror soon.

sergeant provides JDBC, DBI and dplyr/dbplyr interfaces to Apache Drill. I’ve also wrapped a few goodies into the dplyr custom functions that work with Drill and if you have Drill UDFs that don’t work “out of the box” with sergeant‘s dplyr interface, file an issue and I’ll make a special one for it in the package.

I’ve written about drill on the blog before so check out those posts for some history and stay tuned for more examples. The README should get you started using sergeant and/or Drill (if you aren’t running Drill now, take a look and you’ll likely get hooked).

I’d like to take a moment to call out special thanks to Edward Visel for bootstrapping the dbplyr update to sergeant when the dplyr/dbplyr interfaces split. It saved me loads of time and really helped the progress of this package move faster towards a CRAN release.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Revisiting the useR!2017 conference: Recordings now available

Mon, 07/17/2017 - 22:12

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The annual useR!2017 conference took place July 4-7 in Brussels, and in every dimension it was the best yet. It was the largest (with over 1,100 R users from around the world in attendance), and yet still very smoothly run with many amazing talks and lots of fun for everyone. If you weren't able to make it to Brussels, take a look at these recaps from Nick Strayer & Lucy D'Agostino McGowan, Once Upon Data and DataCamp to get a sense of what it was like, or simply take a look at this recap video:

From my personal point of view, if I were to try and capture user!2017 in just one word, it would be: vibrant. With so many first-time attendees, an atmosphere of excitement was everywhere, and the conference was noticeably much more diverse than in prior years — a really positive development. Kudos to the organizers for their focus on making useR!2017 a welcoming and inclusive conference, and a special shout-out to the R-Ladies community for encouraging and inspiring so many. I especially enjoyed meeting the diversity scholars and being a part of the special beginner's session held before the conference officially began (and so sadly unrecorded). Judging from the 200+ attendees reactions there, many welcomed getting a jump-start on the R project, its community, and how best to participate and contribute.

The diversity was reflected in the content, too, with a great mix of tutorials, keynotes and talks on R packages, R applications, the R community and ecosystem, and the R project itself. With thanks to Microsoft, all of this material was recorded, andis now available to view on Channel 9: 

useR!2017 RecordingsuseR! International R User 2017 Conference

All recordings are streamable and downloadable, and are shared under a Creative Commons license. (Note: a few talks are still in the editing room awaiting posting, but all the content should be available at the link above by July 21.) In many cases, you can also find slides in the sessions listed in the useR!2017 schedule

With around 300 videos it might be tricky to find the one you want, but you can use the Filters button to reveal a search tool, and you can also filter by specific speakers:

Here are a few searches you might find useful:

Next year's useR! conference, useR!2018, will be held July 10-13 in Brisbane, Australia. The organizers have opened a survey on useR!2018 to give the R community an opportunity to make suggestions on the content. If you have ideas for tutorial topics and presenters, keynote speakers, services like child care, or sign language interpreters, or how scholarships should be awarded, please do contribute your ideas.

Looking even further out, useR!2019 will be in Toulouse (France), and useR!2020 will be in Boston (USA). That's a lot to be looking forward to, and with useR!2017 setting such a high a high bar I'm sure these will be outstanding conferences as well. See you there! 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Twitter analysis using R (Semantic analysis of French elections)

Mon, 07/17/2017 - 19:51

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Last month the French elections viewed through Twitter: a semantic analysis post showed how the two contenders were perceived on Twitter during three key events of the campaign (Macron leaks, presidential debate and election day). The goal of the post is to show how to perform this twitter analysis using R.

Collecting tweets in real time with streamR (Twitter streaming API)

To perform the analysis, I needed an important number of tweets and I wanted to use all of the tweets concerning the election. The Twitter search API is limited since you only have access to a sample of tweets. On the other hand, the streaming API allows you to collect the data in real-time and to collect almost all tweets. Hence, I used the streamR package.

So, I collected tweets on 60 seconds batch and saved them on .json files. The use of batches instead of one large file is to improve RAM consumption (Instead of reading and then subsetting one large file, you can do the subset on each of the batches and then merge them). Here is the code to collect the data with streamR.

###Loading my twitter credentials load("oauth.Rdata") ##Collecting data require('streamR') i=1 while(TRUE) { i=i+1 filterStream( file=paste0("tweet_macronleaks/tweets_rstats",i,".json"), track=c("#MacronLeaks"), timeout=60, oauth=my_oauth,language = 'fr') }

The code is doing an infinite loop (stopped manually), the filterStream function filters the Twitter stream according to the defined filter. Here, we only take the tweets containing #MacronLeaks which are in French.

Tweets cleaning and pre-processing

Now that the tweets are collected, they need to be cleaned and pre-processed. A raw tweet will contain links, tabulation, @, #, double spaces,  … that will influence the analysis. It will also contain stop words (stop words are very frequent words in the language such as ‘and’, ‘or, ‘with’, …).
In addition to this, some tweets are retweeted (sometimes a lot) and may change the words and text distribution. Enough of the RT are kept to show that some tweets are more popular than others but most of them are removed to avoid them standing too much out of the crowd.

First, the saved tweets need to be read and merged:

require(data.table) data.tweet=NULL i=1 while(TRUE) { i=i+1 print(i) print(paste0("tweet_macronleaks/tweets_rstats",i,".json")) if (is.null(data.tweet)) data.tweet=data.table(parseTweets(paste0("tweet_macronleaks/tweets_rstats",i,".json"))) else data.tweet=rbind(data.tweet,data.table(parseTweets(paste0("tweet_macronleaks/tweets_rstats",i,".json")))) }

Then we only keep some of the RT. The retweet count is the indices of a given retweet, hence we only keep log(1+n) of the RT.

data.tweet[,min_RT:=min(retweet_count),by=text] data.tweet[,max_RT:=max(retweet_count),by=text] data.tweet=data.tweet[lang=='fr',] data.tweet=data.tweet[retweet_count&lt;=min_RT+log(max_RT-min_RT+1),]

Then, the text can be cleaned using function from the tm package

###Unaccent and clean the text Unaccent <- function(x) { x=tolower(x) x = gsub("@\\w+", "", x) x = gsub("[[:punct:]]", " ", x) x = gsub("[ |\t]{2,}", " ", x) x = gsub("^ ", " ", x) x = gsub("http\\w+", " ", x) x=tolower(x) x=gsub('_',' ',x,fixed=T) x } require(tm) ###Remove accents data.tweet$text=Unaccent(iconv(data.tweet$text,from="UTF-8",to="ASCII//TRANSLIT")) ##Remove top words data.tweet$text=removeWords(data.tweet$text,c('rt','a',stopwords('fr'),'e','co','pr')) ##Remove double whitespaces data.tweet$text=stripWhitespace(data.tweet$text) Tokenization and creation of the vocabulary

Now that the tweets have been cleaned, they can be tokenized. During this step, each tweet will be split into tokens of its different words, here each word corresponds to a token.

# Create iterator over tokens tokens <- space_tokenizer(data.tweet$text) it = itoken(tokens, progressbar = FALSE)

Now a vocabulary can be created (it is a “summary” of the words distribution) based on the corpus. Then the vocabulary is pruned (very common and rare words are removed).

vocab = create_vocabulary(it) vocab = prune_vocabulary(vocab, term_count_min = 5, doc_proportion_max = 0.4, doc_proportion_min = 0.0005) vectorizer = vocab_vectorizer(vocab, grow_dtm = FALSE, skip_grams_window = 5L) tcm = create_tcm(it, vectorizer)

Now, we can create the word embedding, in this example, I used a glove embedding to learn vectors representations of the words. The new vector space has around 200 dimensions.

glove = GlobalVectors$new(word_vectors_size = 200, vocabulary = vocab, x_max = 100) glove$fit(tcm, n_iter = 200) word_vectors <- glove$get_word_vectors() How to finish our twitter analysis with Tsne

Now that the words are vectors, we would like to plot them in two dimensions to show the meaning of the words in an appealing (and understandable) way. The number of dimension needs to be reduced to two, to do so, we will use T-sne. T-sne is a non-parametric dimensionality reduction algorithm and tends to perform well on word embedding. R has a package (actually two) to perform Tsne, we will use the most recent one  Rtsne.
To avoid overcrowding in our plot and reduce computing time, only words with more than 50 appearances will be used.

require('Rtsne') set.seed(123) word_vectors_sne=word_vectors[which(vocab$vocab$doc_counts>50&!rownames(word_vectors)%in%stopwords('fr')),] tsne_out=Rtsne(word_vectors_sne,perplexity =2,initial_dims = 200,dims = 2) DF_proj=data.frame(x=tsne_out$Y[,1],y=tsne_out$Y[,2],word=rownames(word_vectors_sne))

Now that the projection in 2 dimensions has been done, to color the plot we’d like to know which contenders is assigned to each word. To do so, a dictionary is created with the names and pseudo of each of the contenders and the distance from every word to each of these pseudos is computed.
For instance, to assign a candidate to the word ‘democracy’, the minimum distance between ‘democracy’ and ‘mlp’, ‘marine’, ‘fn will be computed. The same thing will be done between ‘democracy’ and ‘macron’, ’em’, ’emmarche’. If the first distance is the smallest then ‘democracy’ will be assigned to Marine Le Pen, otherwise, it will be assigned to Emmanuel Macron.

require(ggplot2) require(ggrepel) DF_proj=data.table(DF_proj) DF_proj$count=vocab$vocab$doc_counts[which(vocab$vocab$doc_counts>500& !(rownames(word_vectors)%in%stopwords('fr')))] DF_proj=DF_proj[word!='NA'] distance_to_candidat=function(word_vectors,words_list,word_in) { max(sim2(word_vectors[words_list,,drop=F],word_vectors[word_in,,drop=F])) } closest_candidat=function(word_vectors,mot_in) { mot_le_pen=c('marine','pen','lepen','fn','mlp') mot_macron=c('macron','emmanuel','em','enmarche','emmanuelmacron') dist_le_pen=distance_to_candidat(word_vectors,mot_le_pen,mot_in) dist_macron=distance_to_candidat(word_vectors,mot_macron,mot_in) if (dist_le_pen>dist_macron) 'Le Pen' else 'Macron' } DF_proj[,word:=as.character(word)] DF_proj=DF_proj[word!=""] DF_proj[,Candidat:=closest_candidat(word_vectors,word),by=word] require(plotly) gg=ggplot(DF_proj,aes(x,y,label=word,color=Candidat))+geom_text(aes(size=sqrt(count+1))) ggplotly(gg)

You can get our latest news on Twitter:

 

 

The post Twitter analysis using R (Semantic analysis of French elections) appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

No time wasting

Mon, 07/17/2017 - 18:42

(This article was first published on Blog, and kindly contributed to R-bloggers)

Ensuring your analytic IP is given the attention it deserves

It is now widely recognised that data is the key to making informed business decisions. As such, models and code are the tools used to extract insight and should be considered very valuable IP for an organisation.

Considering data as a valuable asset, it’s important to store it so it’s easy for others within the organisation to find, reuse and repurpose this code in other projects and areas of the business.

However, there are some key challenges:

Losing code

Even if sharing code is actively encouraged inside an organisation, traditional storage platforms treat analytical code in the same way as any other file. This means that —without prior knowledge of a particular script’s existence— they can be hard to find, and in some cases lost forever in a mass of other files and scripts in the same platform.

Reproducibility

Have you ever been asked to reproduce a piece of analysis from 6 months ago? Or how about two years ago? For many, reproducing an older piece of analysis can be a huge task. Finding the script is one thing, but then knowing which version of a script was used, the data it was run against, the versions of the software that it was originally run in make this a more complex problem that you might originally think.

Wasting time

How many times have you written a script, only to find out a colleague has already written code which does the exact same thing? If this has happened to you, then it has probably happened to your colleagues.

ModSpace offers the solutions to these problems and much more.

Developed by the Mango team with modellers and statisticians in mind, it is a safe place to store analytical code and models. Plus, because ModSpace has the ability to understand analytical code when code is loaded into the system, key information is collected to make it easily discoverable by other users.

By linking ModSpace to multiple repositories around your organisation and integrating it with analysts’ preferred tools —such as R, SAS, Python, MATLAB, NONMEM and many others— you’re providing your teams with an enhanced workflow and analytic hub.

If analytical code and models are a vital part of your business, please join us on 20 July for a FREE demonstration of ModSpace.

Register now. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Top reasons to send your team to EARL

Mon, 07/17/2017 - 18:21

(This article was first published on Blog, and kindly contributed to R-bloggers)

It’s easy to get stuck in the day-to-day at the office and there’s never time to upskill or even think about career development. However, to really grow and develop your organisation, it’s important to grow and develop your team.

While there are many ways to develop teams, including training and providing time to complete personal (and relevant) projects, conferences provide a range of benefits.

Spark innovation

Some of the best in the business present their projects, ideas and solutions at EARL each year. It’s the perfect opportunity to see what’s trending and what’s really working. Topics at EARL Conferences this year include, best practice SAS to R; Shiny applications; using social media data; web scraping, plus presentations on R in marketing, healthcare, finance, insurance and transport. Take a look at the agenda for London here.

A cross-sector conference like EARL can help your organisation think outside the box because learnings are transferable, regardless of industry.

Imbue knowledge

This brings us to knowledge. Learning from the best in the business will help employees expand their knowledgebase. This can keep them motivated and engaged in what they’re doing; and a wider knowledgebase can also inform their everyday tasks enabling them to advance the way they do their job.

When employees feel like you want to invest in them, they stay engaged and are more likely to remain in the same organisation for longer.

Encourage networking

EARL attracts R users from all levels and industries and not just to speak. The agenda offers plenty of opportunities to network with some of the industry’s most engaged R users. This is beneficial for a number of reasons, including knowledge exchange and sharing your organisation’s values.

Boost inspiration

We often see delegates who have come to an EARL Conference with a specific business challenge in mind. By attending, they get access to the current innovations, knowledge and networking mentioned above, and can return to their team —post-conference— with a renewed vigour to solve those problems using their new-found knowledge.

Making the most out of attending EARL

After all of that, the next step is making sure your organisation makes the most out of attending EARL. We recommend:

Setting goals

Do you have a specific challenge you’re trying to solve in your organisation? Going with a set challenge in mind means your team can plan which sessions to sit in and who they should talk to during the networking sessions.

De-briefing

This is two-fold:

1) Writing a post-conference report will help your team put what they have learnt at EARL into action.

2) Not everyone can attend, so those who do can share their new-found knowledge with their peers who can learn second-hand from their colleague’s just be smarexperience.

Following up

We’re all guilty of going to a conference, coming back inspired and then getting lost in the day-to-day. Assuming you’ve set goals and de-briefed, it should be easy to develop a follow up plan.

You can make the most of inspired team members to put in place new strategies, technologies and innovations through further training, contact follow-ups and new procedure development.

EARL Conference offers a range of discounts, including package deals for organisations looking to send more than 2 delegates.

Buy tickets now or contact the EARL Team

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Volatility modelling in R exercises (Part-4)

Mon, 07/17/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

This is the fourth part of the series on volatility modelling. For other parts of the series follow the tag volatility.

In this exercise set we will explore GARCH-M and E-GARCH models. We will also use these models to generate rolling window forecasts, bootstrap forecasts and perform simulations.

Answers to the exercises are available here.

Exercise 1
Load the rugarch and the FinTS packages. Next, load the m.ibmspln dataset from the FinTS package. This dataset contains monthly excess returns of the S&P500 index and IBM stock from Jan-1926 to Dec-1999 (Ruey Tsay (2005) Analysis of Financial Time Series, 2nd ed. ,Wiley, chapter 3).
Also, load the forecast package which we will use for autocorrelation graphs.

Exercise 2
Estimate a GARCH(1,1)-M model for the S&P500 excess returns series. Determine if the effect of volatility on asset returns is significant.

Exercise 3
Excess IBM stock returns are defined as a regular zoo variable. Convert this to a time series variable with correct dates.

Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

  • Avoid model over-fitting using cross-validation for optimal parameter selection
  • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
  • And much more

Exercise 4
Plot the absolute and squared excess IBM stock returns along with its ACF and PACF graphs and determine the appropriate model configuration.

Exercise 5
The exponential GARCH model incorporates asymmetric effects for positive and negative asset returns. Estimate an AR(1)-EGARCH(1,1) model for the IBM series.

Exercise 6
Using the results from exercise-5, get rolling window forecasts starting from the 800th observation and refit the model after every three observations.

Exercise 7
Estimate an AR(1)-GARCH(1,1) model for the IBM series and get a bootstrap forecast for the next 50 periods with 500 replications.

Exercise 8
Plot the forecasted returns and sigma with bootstrap error bands.

Exercise 9
We can use Monte-Carlo simulation to get a distribution of the parameter estimates. Using the fitted model from exercise-7, run the simulation for 500 periods for a horizon of 2000 periods.

Exercise 10
Plot the density functions of the parameter estimates.

Related exercise sets:
  1. Volatility modelling in R exercises (Part-3)
  2. Volatility modelling in R exercises (Part-2)
  3. Forecasting: ARIMAX Model Exercises (Part-5)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages