Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 11 hours 49 min ago

Newer to R? rstudio::conf 2018 is for you! Early bird pricing ends August 31.

Fri, 08/25/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Immersion is among the most effective ways to learn any language. Immersing where new and advanced users come together to improve their use of the R language is a rare opportunity. rstudio::conf 2018 is that time and place!

REGISTER TODAY

Be an Early Bird! Discounts for early conference registration expire August 31.
Immerse as a team! Ask us about group discounts for 5 or more from the same organization.

Rstudio::conf 2018 is a two day conference with optional two day workshops. One of the conference tracks will focus on topics for newer R users. Newer R users will learn about the best ways to use R, to avoid common pitfalls and accelerate proficiency. Several workshops are also designed specifically for those newer to R.

Intro to R & RStudio
  • Are you new to R & RStudio and do you learn best in person? You will learn the basics of R and data science, and practice using the RStudio IDE (integrated development environment) and R Notebooks. We will have a team of TAs on hand to show you the ropes and help you out when you get stuck.

    This course is taught by well-known R educator and friend of RStudio, Amelia McNamara, a Smith College Visiting Assistant Professor of Statistical and Data Sciences & Mass Mutual Faculty Fellow.

Data Science in the Tidyverse
  • Are you ready to begin applying the book, R for Data Science? Learn how to achieve your data analysis goals the “tidy” way. You will visualize, transform, and model data in R and work with date-times, character strings, and untidy data formats. Along the way, you will learn and use many packages from the tidyverse including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, lubridate, and forcats.

    This course is taught by friend of RStudio, Charlotte Wickham, a professor and award winning teacher and data analyst at Oregon State University.

Intro to Shiny & R Markdown
  • Do you want to share your data analysis with others in effective ways? For people who know their way around the RStudio IDE and R at least a little, this workshop will help you become proficient in Shiny application development and using R Markdown to communicate insights from data analysis to others.

    This course is taught by Mine Çetinkaya-Rundel, Duke professor and RStudio professional educator. Mine is well known for her open education efforts, and her popular data science MOOCs.

Whether you are new to the R language or as advanced as many of our speakers and educators, rstudio::conf 2018 is the place and time to focus on all things R & RStudio.

We hope to see you in San Diego!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gradient boosting in R

Thu, 08/24/2017 - 21:04

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Boosting is another famous ensemble learning technique in which we are not concerned with reducing the variance of learners like in Bagging where our aim is to reduce the high variance of learners by averaging lots of models fitted on bootstrapped data samples generated with replacement from training data, so as to avoid overfitting.

Another major difference between both the techniques is that in Bagging the various models which are generated are independent of each other and have equal weightage .Whereas Boosting is a sequential process in which each next model which is generated is added so as to improve a bit from the previous model.Simply saying each of the model that is added to mix is added so as to improve on the performance of the previous collection of models.In Boosting we do weighted averaging.

Both the ensemble techniques are common in terms of generating lots of models on training data and using their combined power to increase the accuracy of the final model which is formed by combining them.

But Boosting is more towards Bias i.e simple learners or more specifically Weak learners. Now a weak learner is a learner which always learns something i.e does better than chance and also has error rate less then 50%.The best example of a weak learner is a Decision tree.This is the reason we generally use ensemble technique on decision trees to improve its accuracy and performance.

In Boosting each tree or Model is grown or trained using the hard examples.By hard I mean all the training examples \( (x_i,y_i) \) for which a previous model produced incorrect output \(Y\).Boosting boosts the performance of a simple base-learner by iteratively shifting the focus towards problematic training observations that are difficult to predict.Now that information from the previous model is fed to the next model.And the thing with boosting is that every new tree added to the mix will do better than the previous tree because it will learn from the mistakes of the previous models and try not to repeat them.Hence by this technique it will eventually convert a weak learner to a strong learner which is better and more accurate in generalization for unseen test examples.

An important thing to remember in boosting is that the base learner which is being boosted should not be a complex and complicated learner which has high variance for e.g a neural network with lots of nodes and high weight values.For such learners boosting will have inverse effects.

So I will explain Boosting with respect to decision trees in this tutorial because they can be regarded as weak learners most of the times.We will generate a gradient boosting model.

Gradient boosting generates learners using the same general boosting learning process. It first builds learner to predict the values/labels of samples, and calculate the loss (the difference between the outcome of the first learner and the real value). It will build a second learner to predict the loss after the first step. The step continues to learn the third, forth… until certain threshold.Gradient boosting identifies hard examples by calculating large residuals-\( (y_{actual}-y_{pred} ) \) computed in the previous iterations.

Implementing Gradient Boosting

Let’s use gbm package in R to fit gradient boosting model.

require(gbm) require(MASS)#package with the boston housing dataset #separating training and test data train=sample(1:506,size=374)

We will use the Boston housing data to predict the median value of the houses.

Boston.boost=gbm(medv ~ . ,data = Boston[train,],distribution = "gaussian",n.trees = 10000, shrinkage = 0.01, interaction.depth = 4) Boston.boost summary(Boston.boost) #Summary gives a table of Variable Importance and a plot of Variable Importance gbm(formula = medv ~ ., distribution = "gaussian", data = Boston[-train, ], n.trees = 10000, interaction.depth = 4, shrinkage = 0.01) A gradient boosted model with gaussian loss function. 10000 iterations were performed. There were 13 predictors of which 13 had non-zero influence. >summary(Boston.boost) var rel.inf rm rm 36.96963915 lstat lstat 24.40113288 dis dis 10.67520770 crim crim 8.61298346 age age 4.86776735 black black 4.23048222 nox nox 4.06930868 ptratio ptratio 2.21423811 tax tax 1.73154882 rad rad 1.04400159 indus indus 0.80564216 chas chas 0.28507720 zn zn 0.09297068

The above Boosted Model is a Gradient Boosted Model which generates 10000 trees and the shrinkage parametet (\lambda= 0.01\) which is also a sort of learning Rate. Next parameter is the interaction depth which is the total splits we want to do.So here each tree is a small tree with only 4 splits.
The summary of the Model gives a feature importance plot.In the above list is on the top is the most important variable and at last is the least important variable.
And the 2 most important features which explain the maximum variance in the Data set is lstat i.e lower status of the population (percent) and rm which is average number of rooms per dwelling.

Variable importance plot–

Plotting the Partial Dependence Plot

The partial Dependence Plots will tell us the relationship and dependence of the variables \(X_i\) with the Response variable \(Y\).

#Plot of Response variable with lstat variable plot(Boston.boost,i="lstat") #Inverse relation with lstat variable plot(Boston.boost,i="rm") #as the average number of rooms increases the the price increases

The above plot simply shows the relation between the variables in the x-axis and the mapping function \(f(x)\) on the y-axis.First plot shows that lstat is negatively correlated with the response mdev, whereas the second one shows that rm is somewhat directly related to mdev.

cor(Boston$lstat,Boston$medv)#negetive correlation coeff-r cor(Boston$rm,Boston$medv)#positive correlation coeff-r > cor(Boston$lstat,Boston$medv) [1] -0.7376627 > cor(Boston$rm,Boston$medv) [1] 0.6953599 Prediction on Test Set

We will compute the Test Error as a function of number of Trees.

n.trees = seq(from=100 ,to=10000, by=100) #no of trees-a vector of 100 values #Generating a Prediction matrix for each Tree predmatrix<-predict(Boston.boost,Boston[-train,],n.trees = n.trees) dim(predmatrix) #dimentions of the Prediction Matrix #Calculating The Mean squared Test Error test.error<-with(Boston[-train,],apply( (predmatrix-medv)^2,2,mean)) head(test.error) #contains the Mean squared test error for each of the 100 trees averaged #Plotting the test error vs number of trees plot(n.trees , test.error , pch=19,col="blue",xlab="Number of Trees",ylab="Test Error", main = "Perfomance of Boosting on Test Set") #adding the RandomForests Minimum Error line trained on same data and similar parameters abline(h = min(test.err),col="red") #test.err is the test error of a Random forest fitted on same data legend("topright",c("Minimum Test error Line for Random Forests"),col="red",lty=1,lwd=1) dim(predmatrix) [1] 206 100 head(test.error) 100 200 300 400 500 600 26.428346 14.938232 11.232557 9.221813 7.873472 6.911313

In the above plot the red line represents the least error obtained from training a Random forest with same data and same parameters and number of trees.Boosting outperforms Random Forests on same test dataset with lesser Mean squared Test Errors.

Conclusion

From the above plot we can notice that if boosting is done properly by selecting appropriate tuning parameters such as shrinkage parameter \(lambda\) ,the number of splits we want and the number of trees \(n\), then it can generalize really well and convert a weak learner to strong learner. Ensembling techniques are really well and tend to outperform a single learner which is prone to either overfitting or underfitting or generate thousands or hundreds of them,then combine them to produce a better and stronger model.

Hope you guys liked the article, make sure to like and share. Cheers!!.

    Related Post

    1. Radial kernel Support Vector Classifier
    2. Random Forests in R
    3. Network analysis of Game of Thrones
    4. Structural Changes in Global Warming
    5. Deep Learning with R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Linear Congruential Generator in R

    Thu, 08/24/2017 - 20:00

    (This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

    Part of 1 in the series Random Number Generation

    A Linear congruential generator (LCG) is a class of pseudorandom number generator (PRNG) algorithms used for generating sequences of random-like numbers. The generation of random numbers plays a large role in many applications ranging from cryptography to Monte Carlo methods. Linear congruential generators are one of the oldest and most well-known methods for generating random numbers primarily due to their comparative ease of implementation and speed and their need for little memory. Other methods such as the Mersenne Twister are much more common in practical use today.

    Linear congruential generators are defined by a recurrence relation:

    There are many choices for the parameters , the modulus, , the multiplier, and the increment. Wikipedia has a seemingly comprehensive list of the parameters currently in use in common programs.

    Aside: ‘Pseudorandom’ and Selecting a Seed Number

    Random number generators such as LCGs are known as ‘pseudorandom’ as they require a seed number to generate the random sequence. Due to this requirement, random number generators today are not truly ‘random.’ The theory and optimal selection of a seed number are beyond the scope of this post; however, a common choice suitable for our application is to take the current system time in microseconds.

    A Linear Congruential Generator Implementation in R

    The parameters we will use for our implementation of the linear congruential generator are the same as the ANSI C implementation (Saucier, 2000.).

    The following function is an implementation of a linear congruential generator with the given parameters above.

    lcg.rand <- function(n=10) { rng <- vector(length = n) m <- 2 ** 32 a <- 1103515245 c <- 12345 # Set the seed using the current system time in microseconds d <- as.numeric(Sys.time()) * 1000 for (i in 1:n) { d <- (a * d + c) %% m rng[i] <- d / m } return(rng) }

    We can use the function to generate random numbers .

    # Print 10 random numbers on the half-open interval [0, 1) lcg.rand() ## [1] 0.4605103 0.6643705 0.6922703 0.4603930 0.1842995 0.6804419 0.8561535 ## [8] 0.2435846 0.8236771 0.9643965

    We can also demonstrate how apparently ‘random’ the LCG is by plotting a sample generation in 3 dimensions. To do this, we generate three random vectors using our LCG above and plot. The plot3d package is used to create the scatterplot, and the animation package is used to animate each scatterplot as the length of the random vectors, , increases.

    library(plot3D) library(animation) n <- c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000) saveGIF({ for (i in 1:length(n)) { x <- lcg.rand(n[i]) y <- lcg.rand(n[i]) z <- lcg.rand(n[i]) scatter3D(x, y, z, colvar = NULL, pch=20, cex = 0.5, theta=20, main = paste('n = ', n[i])) } }, movie.name = 'lcg.gif')

    As increases, the LCG appears to be ‘random’ enough as demonstrated by the cloud of points.

    Linear Congruential Generators with Poor Parameters

    The values chosen for the parameters are very important in driving how ‘random’ the generated values from the linear congruential estimator; hence that is why there are so many different parameters in use today as there has not yet been a clear consensus on the ‘best’ parameters to use.

    We can demonstrate how choosing poor parameters for our LCG leads to not so random generated values by creating a new LCG function.

    lcg.poor <- function(n=10) { rng <- vector(length = n) # Parameters taken from https://www.mimuw.edu.pl/~apalczew/CFP_lecture3.pdf m <- 2048 a <- 1229 c <- 1 d <- as.numeric(Sys.time()) * 1000 for (i in 1:n) { d <- (a * d + c) %% m rng[i] <- d / m } return(rng) }

    Generating successively longer vectors using the ‘poor’ LCG and plotting as we did previously, we see the generated points are very sequentially correlated, and there doesn’t appear to be any ‘randomness’ at all as increases.

    n <- c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000) saveGIF({ for (i in 1:length(n)) { x <- lcg.poor(n[i]) y <- lcg.poor(n[i]) z <- lcg.poor(n[i]) scatter3D(x, y, z, colvar = NULL, pch=20, cex=0.5, theta=20, main = paste('n = ', n[i])) } }, movie.name = 'lcg_poor.gif')

    References

    Saucier, R. (2000). Computer Generation of Statistical Distributions (1st ed.). Aberdeen, MD. Army Research Lab.

    The post Linear Congruential Generator in R appeared first on Aaron Schlegel.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Calculating a fuzzy kmeans membership matrix with R and Rcpp

    Thu, 08/24/2017 - 18:14

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    by Błażej Moska, computer science student and data science intern 

    Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like fanny, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the kmeans function. The equation is defined as follows (source: Wikipedia):

    $$w_{ij} = \frac{1}{ \sum_ {k=1}^c ( \frac{ \| x_{i} – c_{j} \| }{ \| x_{i} – c_{k} \| }) ^{ \frac{2}{m-1} } } $$

    \(w_{ij}\) denotes to what extent the \(i\)th object belongs to the \(j\)th cluster. So the total number of rows of this matrix equals number of observation and total number of columns equals number of variables included in clustering. \(m\) is a parameter, typically set to \(m=2\). \(w_{ij}\) values range between 0 and 1 so they are easy and convenient to compare. In this example I will use \(m = 2\) so the Euclidean distance will be calculated.

    To make computations faster I also used the Rcpp package, then I compared speed of execution of function written in R with this written in C++.

    In implementations for loops were used, although it is not a commonly used method in R (see this blog post for more information and alternatives), but in this case I find it more convenient.

    Rcpp (C++ version) #include #include using namespace Rcpp; // [[Rcpp::export]] NumericMatrix fuzzyClustering(NumericMatrix data, NumericMatrix centers, int m) { /* data is a matrix with observations(rows) and variables, centers is a matrix with cluster centers coordinates, m is a parameter of equation, c is a number of clusters */ int c=centers.rows(); int rows = data.rows(); int cols = data.cols(); /*number of columns equals number of variables, the same as is in centers matrix*/ double tempDist=0; /*dist and tempDist are variables storing temporary euclidean distances */ double dist=0; double denominator=0; //denominator of “main” equation NumericMatrix result(rows,c); //declaration of matrix of results for(int i=0;i<rows;i++){ for(int j=0;j<c;j++){ for(int k=0;k<c;k++){ for(int p=0;p<cols;p++){ tempDist = tempDist+pow(centers(j,p)-data(i,p),2); //in innermost loop an euclidean distance is calculated. dist = dist + pow(centers(k,p)-data(i,p),2); /*tempDist is nominator inside the sum operator in the equation, dist is the denominator inside the sum operator in the equation*/ } tempDist = sqrt(tempDist); dist = sqrt(dist); denominator = denominator+pow((tempDist/dist),(2/(m-1))); tempDist = 0; dist = 0; } result(i,j) = 1/denominator; // nominator/denominator in the main equation denominator = 0; } } return result; }

    We can save this in a file with .cpp extension. To compile it from R we can write:

    sourceCpp("path_to_cpp_file")

    If everything goes right our function fuzzyClustering will then be available from R.

    R version fuzzyClustering=function(data,centers,m){ c <- nrow(centers) rows <- nrow(data) cols <- ncol(data) result <- matrix(0,nrow=rows,ncol=c) #defining membership matrix denominator <- 0 for(i in 1:rows){ for(j in 1:c){ tempDist <- sqrt(sum((centers[j,]-data[i,])^2)) #euclidean distance, nominator inside a sum operator for(k in 1:c){ Dist <- sqrt(sum((centers[k,]-data[i,])^2)) #euclidean distance, denominator inside a sum operator denominator <- denominator +((tempDist/Dist)^(2/(m-1))) #denominator of an equation } result[i,j] <- 1/denominator #inserting value into membership matrix denominator <- 0 } } return(result); }

    Result looks as follows. Columns are cluster numbers (in this case 10 clusters were created), rows are our objects (observations). Values were rounded to the third decimal place, so the sums of rows can be slightly different than 1:

    1 2 3 4 5 6 7 8 9 10 [1,] 0.063 0.038 0.304 0.116 0.098 0.039 0.025 0.104 0.025 0.188 [2,] 0.109 0.028 0.116 0.221 0.229 0.080 0.035 0.116 0.017 0.051 [3,] 0.067 0.037 0.348 0.173 0.104 0.066 0.031 0.095 0.018 0.062 [4,] 0.016 0.015 0.811 0.049 0.022 0.017 0.009 0.023 0.007 0.031 [5,] 0.063 0.048 0.328 0.169 0.083 0.126 0.041 0.079 0.018 0.045 [6,] 0.069 0.039 0.266 0.226 0.102 0.111 0.037 0.084 0.017 0.048 [7,] 0.045 0.039 0.569 0.083 0.060 0.046 0.025 0.071 0.015 0.046 [8,] 0.070 0.052 0.399 0.091 0.093 0.054 0.034 0.125 0.022 0.062 [9,] 0.095 0.037 0.198 0.192 0.157 0.088 0.038 0.121 0.019 0.055 [10,] 0.072 0.024 0.132 0.375 0.148 0.059 0.025 0.081 0.015 0.067 Performance comparison

    Shown below is the output of Sys.time for the C++ and R versions, running against a simulated matrix with 30000 observations, 3 variables and 10 clusters.
    The hardware I used was a low-cost notebook Asus R556L with Intel Core i3-5010 2.1 GHz processor and 8 GB DDR3 1600 MHz RAM memory.

    C++ version:

    user system elapsed 0.32 0.00 0.33

    R version:

    user system elapsed 15.75 0.02 15.94

    In this example, the function written in C++ executed about 50 times faster than the equivalent function written in pure R.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Reticulating Readability

    Thu, 08/24/2017 - 18:01

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a brute-force method than one that uses any real “intelligence” when performing the text node targeting.

    Most modern browsers have inherent or plugin-able “readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.

    What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:

    library(xml2) library(httr) library(reticulate) library(magrittr) res <- GET("https://ropensci.org/blog/blog/2017/08/22/visdat") content(res, as="text", endoding="UTF-8") ## [1] "\n \n\n\n \n \n \n \n \n \n\n \n\n \n \n\n \n \n \n \n \n \n \n try{Typekit.load();}catch(e){}\n \n hljs.initHighlightingOnLoad();\n\n Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

    (it goes on for a bit, best to run the code locally)

    We can use the reticulate package to load the Python readability module to just get the clean, article text:

    readability <- import("readability") # pip install readability-lxml doc <- readability$Document(httr::content(res, as="text", endoding="UTF-8")) doc$summary() %>% read_xml() %>% xml_text() # [1] "Take a look at the dataThis is a phrase that comes up when you first get a dataset.It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more. I felt like the code was a little sloppy, and that it could be better. I wanted to know whether others found it useful.What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.What's in it for the author?Feedback on your packageSupport from rOpenSci membersMaintain ownership of your packagePublicity from it being under rOpenSciContribute something to rOpenSciPotentially a publicationWhat can rOpenSci do that CRAN cannot?The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:Assess documentation readability / usabilityProvide a code review to find weak points / points of improvementDetermine whether a package is overlapping with another.

    (again, it goes on for a bit, best to run the code locally)

    That text is now in good enough shape to tidy.

    Here’s the same version with clean_text():

    # devtools::install_github("hrbrmstr/hgr") hgr::clean_text(content(res, as="text", endoding="UTF-8")) ## [1] "Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n  \n \n \n \n \n August 22, 2017 \n \n \n \n \nTake a look at the data\n\n\nThis is a phrase that comes up when you first get a dataset.\n\nIt is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?\n\nStarting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.\n\nThese growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.\n\nThe package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".\n\nMaking was fun, and it was easy to use. But I couldn't help but think that maybe could be more.\n\n I felt like the code was a little sloppy, and that it could be better.\n I wanted to know whether others found it useful.\nWhat I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.\n\nToo much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci provides.\n\nrOpenSci onboarding basics\n\nOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with .\n\nWhat's in it for the author?\n\nFeedback on your package\nSupport from rOpenSci members\nMaintain ownership of your package\nPublicity from it being under rOpenSci\nContribute something to rOpenSci\nPotentially a publication\nWhat can rOpenSci do that CRAN cannot?\n\nThe rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN . Here's what rOpenSci does that CRAN cannot:\n\nAssess documentation readability / usability\nProvide a code review to find weak points / points of improvement\nDetermine whether a package is overlapping with another.

    (lastly, it goes on for a bit, best to run the code locally)

    As you can see, even though that version is usable, readability does a much smarter job of cleaning the text.

    The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Big Data analytics with RevoScaleR Exercises

    Thu, 08/24/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)


    In this set of exercise , you will explore how to handle bigdata with RevoscaleR package from Microsoft R (previously Revolution Analytics).It comes with Microsoft R client . You can get it from here . get the Credit card fraud data set from revolutionanalytics and lets get started
    Answers to the exercises are available here.Please check the documentation before starting these exercise set

    If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    Exercise 1
    The heart of RevoScaleR is the xdf file format , convert the creditcardfraud data set into xdf format .

    Exercise 2

    use the newly created xdf file to get information about the variables and print 10 rows to check the data .

    Learn more about importing big data in the online course Data Mining with R: Go from Beginner to Advanced. In this course you will learn how to

    • work with different data import techniques,
    • know how to import data and transform it for a specific moddeling or analysis goal,
    • and much more.

    Exercise 3
    use rxSummary ,get the summary for variables gender, balance ,cardholder where numTrans is greater than 10

    Exercise 4
    use rxDataStep and create a variable avgbalpertran which is balance /numTran+numIntlTran .use rxGetInfo to check if your changes being reflected in the xdf data

    Exercise 5
    use rxCor and find the correlation between the newly created variable and fraudRisk
    Exercise 6
    use rxLinMod to construct the linear regression of fraudRisk on gender,balance and cardholder. Dont forget to check the summary of the model .

    Exercise 7
    Find the contingency table of fraudRisk and Gender , use rxCrossTab .Hint : Figure out how to include factors in the formula .

    Exercise 8
    use rxCube to find the mean balance for each of the two genders .

    Exercise 9
    Create a histogram from the xdf file on balance to show the relative frequency histogram .

    Exercise 10
    Create a two panel histogram with gender and fradurisk as explanatory variable to show the relative frequency of fraudrisk in two genders .

    Related exercise sets:
    1. Vector exercises
    2. Cross Tabulation with Xtabs exercises
    3. Data Exploration with Tables exercises
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Introducing ‘powerlmm’ an R package for power calculations for longitudinal multilevel models

    Thu, 08/24/2017 - 17:30

    (This article was first published on R Psychologist - R, and kindly contributed to R-bloggers)

    Over the years I’ve produced quite a lot of code for power calculations and simulations of different longitudinal linear mixed models. Over the summer I bundled together these calculations for the designs I most typically encounter into an R package. The purpose of powerlmm is to help design longitudinal treatment studies, with or without higher-level clustering (e.g. by therapists, groups, or physician), and missing data. Currently, powerlmm supports two-level models, nested three-level models, and partially nested models. Additionally, unbalanced designs and missing data can be accounted for in the calculations. Power is calculated analytically, but simulation methods are also provided in order to evaluated bias, type 1 error, and the consequences of model misspecification. For novice R users, the basic functionality is also provided as a Shiny web application.

    The package can be install from GitHub here: github.com/rpsychologist/powerlmm. Currently, the packages includes three vignettes that show how to setup your studies and calculate power.

    A basic example library(powerlmm) # dropout per treatment group d <- per_treatment(control = dropout_weibull(0.3, 2), treatment = dropout_weibull(0.2, 2)) # Setup design p <- study_parameters(n1 = 11, # time points n2 = 10, # subjects per cluster n3 = 5, # clusters per treatment arm icc_pre_subject = 0.5, icc_pre_cluster = 0, icc_slope = 0.05, var_ratio = 0.02, dropout = d, cohend = -0.8) # Power get_power(p) ## ## Power calculation for longitudinal linear-mixed model (three-level) ## with missing data and unbalanced designs ## ## n1 = 11 ## n2 = 10 (treatment) ## 10 (control) ## n3 = 5 (treatment) ## 5 (control) ## 10 (total) ## total_n = 50 (treatment) ## 50 (control) ## 100 (total) ## dropout = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 (time) ## 0, 0, 1, 3, 6, 9, 12, 16, 20, 25, 30 (%, control) ## 0, 0, 1, 2, 4, 5, 8, 10, 13, 17, 20 (%, treatment) ## icc_pre_subjects = 0.5 ## icc_pre_clusters = 0 ## icc_slope = 0.05 ## var_ratio = 0.02 ## cohend = -0.8 ## power = 0.68 Feedback

    I appreciate all types of feedback, e.g. typos, bugs, inconsistencies, feature requests, etc. Open an issue on github.com/rpsychologist/powerlmm/issues or via my contact info here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Psychologist - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    New R Course: Sentiment Analysis in R – The Tidy Way

    Thu, 08/24/2017 - 16:21

    (This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

    Hello, R users! This week we’re continuing to bridge the gap between computers and human language with the launch Sentiment Analysis in R: The Tidy Way by Julia Silge!

    Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in several case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from real-world data.

    Take me to chapter 1!

    Sentiment Analysis in R: The Tidy Way features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in sentiment analysis!

    What you’ll learn:

    Chapter 1: Tweets across the United States

    In this chapter, you will implement sentiment analysis using tidy data principles using geocoded Twitter data.

    Chapter 2: Shakespeare gets Sentimental

    Your next real-world text exploration uses tragedies and comedies by Shakespeare to show how sentiment analysis can lead to insight into differences in word use. You will learn how to transform raw text into a tidy format for further analysis.

    Chapter 3: Analyzing TV News

    Text analysis using tidy principles can be applied to diverse kinds of text, and in this chapter, you will explore a dataset of closed captioning from television news. You will apply the skills you have learned so far to explore how different stations report on a topic with different words, and how sentiment changes with time.

    Chapter 4: Singing a Happy Song (or Sad?!)

    In this final chapter on sentiment analysis using tidy principles, you will explore pop song lyrics that have topped the charts from the 1960s to today. You will apply all the techniques we have explored together so far, and use linear modeling to find what the sentiment of song lyrics can predict.

    Learn all there is to know about Sentiment Analysis in R, the Tidy Way!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Practical Guide to Principal Component Methods in R

    Thu, 08/24/2017 - 14:04

    (This article was first published on Easy Guides, and kindly contributed to R-bloggers)

    Introduction

    Although there are several good books on principal component methods (PCMs) and related topics, we felt that many of them are either too theoretical or too advanced.

    This book provides a solid practical guidance to summarize, visualize and interpret the most important information in a large multivariate data sets, using principal component methods in R.

    Where to find the book:

    The following figure illustrates the type of analysis to be performed depending on the type of variables contained in the data set.

    There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.

    However, the result is presented differently depending on the used package.

    To help in the interpretation and in the visualization of multivariate analysis – such as cluster analysis and principal component methods – we developed an easy-to-use R package named factoextra (official online documentation: http://www.sthda.com/english/rpkgs/factoextra).

    No matter which package you decide to use for computing principal component methods, the factoextra R package can help to extract easily, in a human readable data format, the analysis results from the different packages mentioned above. factoextra provides also convenient solutions to create ggplot2-based beautiful graphs.

    Methods, which outputs can be visualized using the factoextra package are shown in the figure below:

    In this book, we’ll use mainly:

    • the FactoMineR package to compute principal component methods;
    • and the factoextra package for extracting, visualizing and interpreting the results.

    The other packages – ade4, ExPosition, etc – will be also presented briefly.

    How this book is organized

    This book contains 4 parts.

    Part I provides a quick introduction to R and presents the key features of FactoMineR and factoextra.

    Part II describes classical principal component methods to analyze data sets containing, predominantly, either continuous or categorical variables. These methods include:

    • Principal Component Analysis (PCA, for continuous variables),
    • Simple correspondence analysis (CA, for large contingency tables formed by two categorical variables)
    • Multiple correspondence analysis (MCA, for a data set with more than 2 categorical variables).

    In Part III, you’ll learn advanced methods for analyzing a data set containing a mix of variables (continuous and categorical) structured or not into groups:

    • Factor Analysis of Mixed Data (FAMD) and,
    • Multiple Factor Analysis (MFA).

    Part IV covers hierarchical clustering on principal components (HCPC), which is useful for performing clustering with a data set containing only categorical variables or with a mixed data of categorical and continuous variables

    Key features of this book

    This book presents the basic principles of the different methods and provide many examples in R. This book offers solid guidance in data mining for students and researchers.

    Key features:

    • Covers principal component methods and implementation in R
    • Highlights the most important information in your data set using ggplot2-based elegant visualization
    • Short, self-contained chapters with tested examples that allow for flexibility in designing a course and for easy reference

    At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter. Additionally, we provide links to other resources and to our hand-curated list of videos on principal component methods for further learning.

    Examples of plots

    Some examples of plots generated in this book are shown hereafter. You’ll learn how to create, customize and interpret these plots.

    1. Eigenvalues/variances of principal components. Proportion of information retained by each principal component.

    1. PCA – Graph of variables:
    • Control variable colors using their contributions to the principal components.

    • Highlight the most contributing variables to each principal dimension:

    1. PCA – Graph of individuals:
    • Control automatically the color of individuals using the cos2 (the quality of the individuals on the factor map)

    • Change the point size according to the cos2 of the corresponding individuals:

    1. PCA – Biplot of individuals and variables

    1. Correspondence analysis. Association between categorical variables.

    1. FAMD/MFA – Analyzing mixed and structured data

    1. Clustering on principal components

    Book preview

    Download the preview of the book at: Principal Component Methods in R (Book preview)

    Order now

    About the author

    Alboukadel Kassambara is a PhD in Bioinformatics and Cancer Biology. He works since many years on genomic data analysis and visualization (read more: http://www.alboukadel.com/).

    He has work experiences in statistical and computational methods to identify prognostic and predictive biomarker signatures through integrative analysis of large-scale genomic and clinical data sets.

    He created a bioinformatics web-tool named GenomicScape (www.genomicscape.com) which is an easy-to-use web tool for gene expression data analysis and visualization.

    He developed also a training website on data science, named STHDA (Statistical Tools for High-throughput Data Analysis, www.sthda.com/english), which contains many tutorials on data analysis and visualization using R software and packages.

    He is the author of many popular R packages for:

    Recently, he published three books on data analysis and visualization:

    1. Practical Guide to Cluster Analysis in R (https://goo.gl/DmJ5y5)
    2. Guide to Create Beautiful Graphics in R (https://goo.gl/vJ0OYb).
    3. Complete Guide to 3D Plots in R (https://goo.gl/v5gwl0).


    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Notice: Changes to the site

    Thu, 08/24/2017 - 13:51

    (This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

    I wanted to give everyone a heads up about a major rebrand and some probable downtime happening over the weekend.

    I’m going to be consolidating my Locke Data consulting company materials, the blog, the talks, and my package documentation into a single site. The central URL itsalocke.com won’t be changing but there will be a ton of changes happening.

    I think I’ve got the redirects and the RSS feeds pretty much sorted, but I’ve converted more than 300 pages of content to new systems – I’ve likely gotten things wrong. You might notice some issues when clicking through from R-Bloggers, on twitter, or from other people’s sites.

    I hope you’ll like the changes I’ve made but I’m going to have a “bug bounty” in place. If you find a broken link, a blog post that isn’t rendered correctly, or some other bug, then report it. Filling in the form is easy and if you provide your name and address and I’ll send you a sticker as thanks!

    If you want to get a preview of the site, check it out on its temporary home lockelife.com.

    The post Notice: Changes to the site appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Boston EARL Keynote speaker announcement: Tareef Kawaf

    Thu, 08/24/2017 - 13:25

    (This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

    Mango Solutions are thrilled to announce that Tareef Kawaf, President of RStudio, will be joining us at EARL Boston as our third Keynote Speaker.

    Tareef is an experienced software startup executive and a member of teams that built up ATG’s eCommerce offering and Brightcove’s Online Video Platform, helping both companies grow from early startups to publicly traded companies. He joined RStudio in early 2013 to help define its commercial product strategy and build the team. He is a software engineer by training, and an aspiring student of advanced analytics and R.

    This will be Tareef’s second time speaking at EARL Boston and we’re big supporters of RStudio’s mission to provide the most widely used open source and enterprise-ready professional software for the R statistical computing environment, so we’re looking forward to him taking to the podium again this year.

    Want to join Tareef at EARL Boston? Speak

    Abstract submissions close on 31 August, so time is running out to share your R adventures and innovations with fellow R users.

    All accepted speakers receive a 1-day Conference pass and a ticket to the evening networking reception.

    Submit your abstract here.

    Buy a ticket

    Early bird tickets are now available! Save more than $100 on a Full Conference pass.

    Buy tickets here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Analyzing Google Trends Data in R

    Thu, 08/24/2017 - 05:48

    Google Trends shows the changes in the popularity of search terms over a given time (i.e., number of hits over time). It can be used to find search terms with growing or decreasing popularity or to review periodic variations from the past such as seasonality. Google Trends search data can be added to other analyses, manipulated and explored in more detail in R.

    This post describes how you can use R to download data from Google Trends, and then include it in a chart or other analysis. We’ll discuss first how you can get overall (global) data on a search term (query), how to plot it as a simple line chart, and then how to can break the data down by geographical region. The first example I will look at is the rise and fall of the Blu-ray.

    Analyzing Google Trends in R

    I have never bought a Blu-ray disc and probably never will. In my world, technology moved from DVDs to streaming without the need for a high definition physical medium. I still see them in some shops, but it feels as though they are declining. Using Google Trends we can find out when interest in Blu-rays peaked.

    The following R code retrieves the global search history since 2004 for Blu-ray.

    library(gtrendsR) library(reshape2) google.trends = gtrends(c("blu-ray"), gprop = "web", time = "all")[[1]] google.trends = dcast(google.trends, date ~ keyword + geo, value.var = "hits") rownames(google.trends) = google.trends$date google.trends$date = NULL

    The first argument to the gtrends function is a list of up to 5 search terms. In this case, we have just one item. The second argument gprop is the medium searched on and can be any of web, newsimages or youtube. The third argument time can be any of now 1-d, now 7-d, today 1-m, today 3-m, today 12-m, today+5-y or all (which means since 2004). A final possibility for time is to specify a custom date range e.g. 2010-12-31 2011-06-30.

    Note that I am using gtrendsR version 1.9.9.0. This version improves upon the CRAN version 1.3.5 (as of August 2017) by not requiring a login. You may see a warning if your timezone is not set – this can be avoided by adding the following line of code:

    Sys.setenv(TZ = "UTC")

    After retrieving the data from Google Trends, I format it into a table with dates for the row names and search terms along the columns. The table below shows the result of running this code.

    Plotting Google Trends data: Identifying seasonality and trends

    Plotting the Google Trends data as an R chart we can draw two conclusions. First, interest peaked around the end of 2008. Second, there is a strong seasonal effect, with significant spikes around Christmas every year.

    Note that results are relative to the total number of searches at each time point, with the maximum being 100. We cannot infer anything about the volume of Google searches. But we can say that as a proportion of all searches Blu-ray was about half as frequent in June 2008 compared to December 2008. An explanation about Google Trend methodology is here.

    Google Trends by geographic region

    Next, I will illustrate the use of country codes. To do so I will find the search history for skiing in Canada and New Zealand. I use the same code as previously, except modifying the gtrends line as below.

    google.trends = gtrends(c("skiing"), geo = c("CA", "NZ"), gprop = "web", time = "2010-06-30 2017-06-30")[[1]]

    The new argument to gtrends is geo, which allows the users to specify geographic codes to narrow the search region. The awkward part about geographical codes is that they are not always obvious. Country codes consist of two letters, for example, CA and NZ in this case. We could also use region codes such as US-CA for California. I find the easiest way to get these codes is to use this Wikipedia page.

    An alternative way to find all the region-level codes for a given country is to use the following snippet of R code. In this case, it retrieves all the regions of Italy (IT).

    library(gtrendsR) geo.codes = sort(unique(countries[substr(countries$sub_code, 1, 2) == "IT", ]$sub_code))

    Plotting the ski data below, we note the contrast between northern and southern hemisphere winters. It is also relatively more popular in Canada than New Zealand. The 2014 winter Olympics causes a notable spike in both countries but particularly Canada.

    Create your own analysis

    In this post I have shown how to import data from Google Trends using the R package gtrendsR. Anyone can click on this link to explore the examples used in this post or create your own analysis  (just sign into Displayr first).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Hard-nosed Indian Data Scientist Gospel Series – Part 1 : Incertitude around Tools and Technologies

    Thu, 08/24/2017 - 05:34

    (This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers)

    Before recession a commercial tool was popular in the country, hence, uncertainty around tools and technology was not much; however, after recession, incertitude (i.e. uncertainty) around tools and technology have pre-occupied and occupying data science learning, delivery and deployment.

    When python was continuing as general programming language, Rwas the left out best choice (became more popular with the advent of an IDE i.e. RStudio) and author still see its popularity among non-programming background (i.e. other than computer scientists) data scientists. Yet, author notices in local meet ups, panel discussions, webinars, still, a clarity on which is better from aspirants towards the data sicence as a everyday interest as shown in below image.

    Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
    Find more about author at http://in.linkedin.com/in/pradeepmavuluri var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Digit fifth powers: Euler Problem 30

    Thu, 08/24/2017 - 02:00

    (This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

    Euler problem 30 is another number crunching problem that deals with numbers to the power of five. Two other Euler problems dealt with raising numbers to a power. The previous problem looked at permutations of powers and problem 16 asks for the sum of the digits of .

    Numberphile has a nice video about a trick to quickly calculate the fifth root of a number that makes you look like a mathematical wizard.

    Euler Problem 30 Definition

    Surprisingly there are only three numbers that can be written as the sum of fourth powers of their digits:

    As is not a sum, it is not included.

    The sum of these numbers is . Find the sum of all the numbers that can be written as the sum of fifth powers of their digits.

    Proposed Solution

    The problem asks for a brute-force solution but we have a halting problem. How far do we need to go before we can be certain there are no sums of fifth power digits? The highest digit is and , which has five digits. If we then look at , which has six digits and a good endpoint for the loop. The loop itself cycles through the digits of each number and tests whether the sum of the fifth powers equals the number.

    largest <- 6 * 9^5 answer <- 0 for (n in 2:largest) { power.sum <-0 i <- n while (i > 0) { d <- i %% 10 i <- floor(i / 10) power.sum <- power.sum + d^5 } if (power.sum == n) { print(n) answer <- answer + n } } print(answer)

    View the most recent version of this code on GitHub.

    The post Digit fifth powers: Euler Problem 30 appeared first on The Devil is in the Data.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Sentiment analysis using tidy data principles at DataCamp

    Thu, 08/24/2017 - 02:00

    (This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

    I’ve been developing a course at DataCamp over the past several months, and I am happy to announce that it is now launched!

    The course is Sentiment Analysis in R: the Tidy Way and I am excited that it is now available for you to explore and learn from. This course focuses on digging into the emotional and opinion content of text using sentiment analysis, and it does this from the specific perspective of using tools built for handling tidy data. The course is organized into four case studies (one per chapter), and I don’t think it’s too much of a spoiler to say that I wear a costume for part of it. I’m just saying you should probably check out the course trailer.

    Course description

    Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in four case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from real-world data.

    Learning objectives
    • Learn the principles of sentiment analysis from a tidy data perspective
    • Practice manipulating and visualizing text data using dplyr and ggplot2
    • Apply sentiment analysis skills to several real-world text datasets

    Check the course out, have fun, and start practicing those text mining skills!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Recreating and updating Minard with ggplot2

    Wed, 08/23/2017 - 23:59

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Minard's chart depicting Napoleon's 1812 march on Russia is a classic of data visualization that has inspired many homages using different time-and-place data. If you'd like to recreate the original chart, or create one of your own, Andrew Heiss has created a tutorial on using the ggplot2 package to re-envision the chart in R:

    The R script provided in the tutorial is driven by historical data on the location and size of Napoleon's armies during the 1812 campaign, but you could adapt the script to use new data as well. Andrew also shows how to combine the chart with a geographical or satellite map, which is how the cities appear in the version above (unlike in Minard's original). 

    The data behind the Minard chart is available from Michael Friendly and you can find the R scripts at this Github repository. For the complete tutorial, follow the link below.

    Andrew Heiss: Exploring Minard’s 1812 plot with ggplot2 (via Jenny Bryan)

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Basics of data.table: Smooth data exploration

    Wed, 08/23/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    The data.table package provides perhaps the fastest way for data wrangling in R. The syntax is concise and is made to resemble SQL. After studying the basics of data.table and finishing this exercise set successfully you will be able to start easing into using data.table for all your data manipulation needs.

    We will use data drawn from the 1980 US Census on married women aged 21–35 with two or more children. The data includes gender of first and second child, as well as information on whether the woman had more than two children, race, age and number of weeks worked in 1979. For more information please refer to the reference manual for the package AER.

    Answers are available here.

    Exercise 1
    Load the data.table package. Furtermore (install and) load the AER package and run the command data("Fertility") which loads the dataset Fertility to your workspace. Turn it into a data.table object.

    Exercise 2
    Select rows 35 to 50 and print to console its age and work entry.

    Exercise 3
    Select the last row in the dataset and print to console.

    Exercise 4
    Count how many women proceeded to have a third child.

    Learn more about the data.table package in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to

    • work with different data manipulation packages,
    • know how to import, transform and prepare your dataset for modelling,
    • and much more.

    Exercise 5
    There are four possible gender combinations for the first two children. Which is the most common? Use the by argument.

    Exercise 6
    By racial composition what is the proportion of woman working four weeks or less in 1979?

    Exercise 7
    Use %between% to get a subset of woman between 22 and 24 calculate the proportion who had a boy as their firstborn.

    Exercise 8
    Add a new column, age squared, to the dataset.

    Exercise 9
    Out of all the racial composition in the dataset which had the lowest proportion of boys for their firstborn. With the same command display the number of observation in each category as well.

    Exercise 10
    Calculate the proportion of women who have a third child by gender combination of the first two children?

    Related exercise sets:
    1. Vector exercises
    2. Data frame exercises Vol. 2
    3. Instrumental Variables in R exercises (Part-1)
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Going Bayes #rstats

    Wed, 08/23/2017 - 15:07

    (This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

    Some time ago I started working with Bayesian methods, using the great rstanarm-package. Beside the fantastic package-vignettes, and books like Statistical Rethinking or Doing Bayesion Data Analysis, I also found the ressources from Tristan Mahr helpful to both better understand Bayesian analysis and rstanarm. This motivated me to implement tools for Bayesian analysis into my packages, as well.

    Due to the latest tidyr-update, I had to update some of my packages, in order to make them work again, so – beside some other features – some Bayes-stuff is now avaible in my packages on CRAN.

    Finding shape or location parameters from distributions

    The following functions are included in the sjstats-package. Given some known quantiles or percentiles, or a certain value or ratio and its standard error, the functions find_beta(), find_normal() or find_cauchy() help finding the parameters for a distribution. Taking the example from here, the plot indicates that the mean value for the normal distribution is somewhat above 50. We can find the exact parameters with find_normal(), using the information given in the text:

    library(sjstats) find_normal(x1 = 30, p1 = .1, x2 = 90, p2 = .8) #> $mean #> [1] 53.78387 #> #> $sd #> [1] 30.48026 High Density Intervals for MCMC samples

    The hdi()-function computes the high density interval for posterior samples. This is nothing special, since there are other packages with such functions as well – however, you can use this function not only on vectors, but also on stanreg-objects (i.e. the results from models fitted with rstanarm). And, if required, you can also transform the HDI-values, e.g. if you need these intervals on an expontiated scale.

    library(rstanarm) fit <- stan_glm(mpg ~ wt + am, data = mtcars, chains = 1) hdi(fit) #> term hdi.low hdi.high #> 1 (Intercept) 32.158505 42.341421 #> 2 wt -6.611984 -4.022419 #> 3 am -2.567573 2.343818 #> 4 sigma 2.564218 3.903652 # fit logistic regression model fit <- stan_glm( vs ~ wt + am, data = mtcars, family = binomial("logit"), chains = 1 ) hdi(fit, prob = .89, trans = exp) #> term hdi.low hdi.high #> 1 (Intercept) 4.464230e+02 3.725603e+07 #> 2 wt 6.667981e-03 1.752195e-01 #> 3 am 8.923942e-03 3.747664e-01 Marginal effects for rstanarm-models

    The ggeffects-package creates tidy data frames of model predictions, which are ready to use with ggplot (though there’s a plot()-method as well). ggeffects supports a wide range of models, and makes it easy to plot marginal effects for specific predictors, includinmg interaction terms. In the past updates, support for more model types was added, for instance polr (pkg MASS), hurdle and zeroinfl (pkg pscl), betareg (pkg betareg), truncreg (pkg truncreg), coxph (pkg survival) and stanreg (pkg rstanarm).

    ggpredict() is the main function that computes marginal effects. Predictions for stanreg-models are based on the posterior distribution of the linear predictor (posterior_linpred()), mostly for convenience reasons. It is recommended to use the posterior predictive distribution (posterior_predict()) for inference and model checking, and you can do so using the ppd-argument when calling ggpredict(), however, especially for binomial or poisson models, it is harder (and much slower) to compute the „confidence intervals“. That’s why relying on posterior_linpred() is the default for stanreg-models with ggpredict().

    Here is an example with two plots, one without raw data and one including data points:

    library(sjmisc) library(rstanarm) library(ggeffects) data(efc) # make categorical efc$c161sex <- to_label(efc$c161sex) # fit model m <- stan_glm(neg_c_7 ~ c160age + c12hour + c161sex, data = efc) dat <- ggpredict(m, terms = c("c12hour", "c161sex")) dat #> # A tibble: 128 x 5 #> x predicted conf.low conf.high group #> #> 1 4 10.80864 10.32654 11.35832 Male #> 2 4 11.26104 10.89721 11.59076 Female #> 3 5 10.82645 10.34756 11.37489 Male #> 4 5 11.27963 10.91368 11.59938 Female #> 5 6 10.84480 10.36762 11.39147 Male #> 6 6 11.29786 10.93785 11.61687 Female #> 7 7 10.86374 10.38768 11.40973 Male #> 8 7 11.31656 10.96097 11.63308 Female #> 9 8 10.88204 10.38739 11.40548 Male #> 10 8 11.33522 10.98032 11.64661 Female #> # ... with 118 more rows plot(dat) plot(dat, rawdata = TRUE)

    As you can see, if you work with labelled data, the model-fitting functions from the rstanarm-package preserves all value and variable labels, making it easy to create annotated plots. The „confidence bands“ are actually hidh density intervals, computed with the above mentioned hdi()-function.

    Next…

    Next I will integrate ggeffects into my sjPlot-package, making sjPlot more generic and supporting more models types. Furthermore, sjPlot shall get a generic plot_model()-function which will replace former single functions like sjp.lm(), sjp.glm(), sjp.lmer() or sjp.glmer(). plot_model() should then produce a plot, either marginal effects, forest plots or interaction terms and so on, and accept (m)any model class. This should help making sjPlot more convenient to work with, more stable and easier to maintain…

    Tagged: Bayes, data visualization, ggplot, R, rstanarm, rstats, sjPlot, Stan

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Rcpp now used by 10 percent of CRAN packages

    Wed, 08/23/2017 - 13:18

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Over the last few days, Rcpp passed another noteworthy hurdle. It is now used by over 10 percent of packages on CRAN (as measured by Depends, Imports and LinkingTo, but excluding Suggests). As of this morning 1130 packages use Rcpp out of a total of 11275 packages. The graph on the left shows the growth of both outright usage numbers (in darker blue, left axis) and relative usage (in lighter blue, right axis).

    Older posts on this blog took note when Rcpp passed round hundreds of packages, most recently in April for 1000 packages. The growth rates for both Rcpp, and of course CRAN, are still staggering. A big thank you to everybody who makes this happen, from R Core and CRAN to all package developers, contributors, and of course all users driving this. We have built ourselves a rather impressive ecosystem.

    So with that a heartfelt Thank You! to all users and contributors of R, CRAN, and of course Rcpp, for help, suggestions, bug reports, documentation, encouragement, and, of course, code.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Gender roles in film direction, analyzed with R

    Tue, 08/22/2017 - 23:33

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    What do women do in films? If you analyze the stage directions in film scripts — as Julia Silge, Russell Goldenberg and Amber Thomas have done for this visual essay for ThePudding — it seems that women (but not men) are written to snuggle, giggle and squeal, while men (but not women) shoot, gallop and strap things to other things.  

    This is all based on an analysis of almost 2,000 film scripts mostly from 1990 and after. The words come from pairs of words beginning with "he" and "she" in the stage directions (but not the dialogue) in the screenplays — directions like "she snuggles up to him, strokes his back" and "he straps on a holster under his sealskin cloak". The essay also includes an analysis of words by the writer and character's gender, and includes lots of lovely interactive elements (including the ability to see examples of the stage directions).

    The analysis, including the chart above, was was created using the R language, and the R code is available on GitHub. The screenplay analysis makes use on the tidytext package, which simplifies the process of handling the text-based data (the screenplays), extracting the stage directions, and tabulating the word pairs.

    You can find the complete essay linked below, and it's well worth checking out to experience the interactive elements.

    ThePudding: She Giggles, He Gallops

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages