Tips and tricks on using R to query data in Power BI
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
In Power BI, the dashboarding and reporting tool, you can use R to filter, transform, or restructure data via the Query Editor. For example, you could use the mice package to impute missing values, or use the tidytext package to assign sentiment scores to text inputs. As Imke Feldmann explains, there are lots of useful tricks you can accomplish using R in the query editor, including:
 Passing multiple Power BI tables into the R script
 Parameterizing R scripts
 Modifying file names and output data types for use in Power BI
 Generating multiple table outputs from an R script
For the details on these tips, follow the link below. You can also find tips on using R in the Query Editor here.
The Biccountant: Tips and Tricks for R scripts in the query editor in Power BI
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How much will that Texas rain be
PUTTING 35 INCHES OF RAIN IN PERSPECTIVE
We are always interested in putting numbers into perspective, so we were interested in this article in which they put the Hurricane Harvey’s rain into perspective.
They’re predicting 3040 inches of rain in a few days in Texas. They asked an expert to put that into perspective and he said:
Let’s put it in context. Much of the Northeast Corridor — Washington to New York and Boston — maybe receives maybe between 40 and 45 inches of rain a year. Think of all the rain you get in July through Christmas and put that in a couple days. It’s a lot of rain.
It’s easy for us to think in terms of New York City, so we looked up some weather data. See the table at the top of this post (all figures are in inches).
First thing we can notice is that the expert understated things, for New York at least. Thirty five inches would be equivalent to all the rain in NYC from April (not July) to December, inclusive.
But we agree that it’s a lot of rain.
We’ve always had trouble putting rain forecasts into perspective, so here are some rules of thumb we figured out from the data that we’re going to memorize. If you live in the corridor from DC to Boston, you may find these useful.
 The average amount of rain per rainy day in NYC is .38 inches, which conveniently is about 1 cm.
 When you hear it’s going to rain 1 cm or 3/8 inch, you can think “no big deal, that’s a typical NYC rainy day”.
 If you hear it’s going to rain an inch, you can think “oh darn, that’s like three rainy days worth”.
Here’s Rmarkdown code if you want to play around:
The post How much will that Texas rain be appeared first on Decision Science News.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));How to prepare and apply machine learning to your dataset
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Dear reader,
If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world.
This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets.
In this stepbystep tutorial you will:
1. Use one of the most popular machine learning packages in R.
2. Explore a dataset by using statistical summaries and data visualization.
3. Build 5 machinelearning models, pick the best, and build confidence that the accuracy is reliable.
The process of a machine learning project may not be exactly the same, but there are certain standard and necessary steps:
1. Define Problem.
2. Prepare Data.
3. Evaluate Algorithms.
4. Improve Results.
5. Present Results.
1. PACKAGE INSTALLATION & DATA SET
The first thing you have to do is install and load the “caret” package with:
install.packages("caret")
library(caret)
Moreover, we need a dataset to work with. The dataset we chose in our case is “iris,” which contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species. To attach it to the environment, use:
data(iris)
1.1 Create a Validation Dataset
First of all, we need to validate that our data set is good. Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. To be sure about the accuracy of the best model on unseen data, we will evaluate it on actual unseen data. To do this, we will “deposit” some data that the algorithms will not find and use this data later to get a second and independent idea of how accurate the best model really is.
We will split the loaded dataset into two, 80% of which we will use to train our models and 20% of which we will hold back as a validation dataset. Look at the example below:
#create a list of 80% of rows in the original dataset to use them for training
validation_index < createDataPartition(dataset$Species, p=0.80, list=FALSE)
# select 20% of the data for validation
validation < dataset[validation_index,]
# use the remaining 80% of data to training and testing the models
dataset < dataset[validation_index,]
You now have training data in the dataset variable and a validation set that will be used later in the validation variable.
Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to: Create a machine learning algorhitm from a beginner point of view
 Quickly dive into more advanced methods in an accessible pace and with more explanations
 And much more
This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.
2. DATASET SUMMARY
In this step, we are going to explore our data set. More specifically, we need to know certain features of our dataset, like:
1. Dimensions of the dataset.
2. Types of the attributes.
3. Details of the data.
4. Levels of the class attribute.
5. Analysis of the instances in each class.
6. Statistical summary of all attributes.
2.1 Dimensions of Dataset
We can see of how many instances (rows) and how many attributes (columns) the data contains with the dim function. Look at the example below:
dim(dataset)
2.2 Types of Attributes
Knowing the types is important as it can help you summarize the data you have and possible transformations you might need to use to prepare the data before modeilng. They could be doubles, integers, strings, factors and other types. You can find it with:
sapply(dataset, class)
2.3 Details of the Data
You can take a look at the first 5 rows of the data with:
head(dataset)
2.4 Levels of the Class
The class variable is a factor that has multiple class labels or levels. Let’s look at the levels:
levels(dataset$Species)
There are two types of classification problems: the multinomial like this one and the binary if there were two levels.
2.5 Class Distribution
Let’s now take a look at the number of instances that belong to each class. We can view this as an absolute count and as a percentage with:
percentage < prop.table(table(dataset$Species)) * 100
cbind(freq=table(dataset$Species), percentage=percentage)
2.6 Statistical Summary
This includes the mean, the min and max values, as well as some percentiles. Look at the example below:
summary(dataset)
3. DATASET VISUALIZATION
We now have a basic idea about the data. We need to extend that with some visualizations, and for that reason we are going to use two types of plots:
1. Univariate plots to understand each attribute.
2. Multivariate plots to understand the relationships between attributes.
3.1 Univariate Plots
We can visualize just the input attributes and just the output attributes. Let’s set that up and call the input attributes x and the output attributes y.
x < dataset[,1:4]
y < dataset[,5]
Since the input variables are numeric, we can create box and whisker plots of each one with:
par(mfrow=c(1,4))
for(i in 1:4) {
boxplot(x[,i], main=names(iris)[i])
}
We can also create a barplot of the Species class variable to graphically display the class distribution.
plot(y)
3.2 Multivariate Plots
First, we create scatterplots of all pairs of attributes and color the points by class. Then, we can draw ellipses around them to make them more easily separated.
You have to install and call the “ellipse” package to do this.
install.packages("ellipse")
library(ellipse)
featurePlot(x=x, y=y, plot="ellipse")
We can also create box and whisker plots of each input variable, but this time they are broken down into separate plots for each class.
featurePlot(x=x, y=y, plot="box")
Next, we can get an idea of the distribution of each attribute. We will use some probability density plots to give smooth lines for each distribution.
scales < list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)
4. ALGORITHMS EVALUATION
Now it is time to create some models of the data and estimate their accuracy on unseen data.
1. Use the test harness to use 10fold cross validation.
2. Build 5 different models to predict species from flower measurements.
3. Select the best model.
4.1 Test Harness
This will split our dataset into 10 parts, train in 9, test on 1, and release for all combinations of traintest splits.
control < trainControl(method="cv", number=10)
metric < "Accuracy"
We are using the metric of “Accuracy” to evaluate models. This is: (number of correctly predicted instances / divided by the total number of instances in the dataset)*100 to give a percentage.
4.2 Build Models
We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that we created earlier.
Algorithms evaluation:
1. Linear Discriminant Analysis (LDA)
2. Classification and Regression Trees (CART).
3. kNearest Neighbors (kNN).
4. Support Vector Machines (SVM) with a linear kernel.
5. Random Forest (RF)
This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
NOTE: To proceed, first install and load the following packages: “rpart”, “kernlab”, “e1071” and “randomForest”.
Let’s build our five models:
# a) linear algorithms
set.seed(7)
fit.lda < train(Species~., data=dataset, method="lda", metric=metric, trControl=control)
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart < train(Species~., data=dataset, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn < train(Species~., data=dataset, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm < train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf < train(Species~., data=dataset, method="rf", metric=metric, trControl=control)
4.3 Select the Best Model
We now have 5 models and accuracy estimations for each so we have to compare them.
It is a good idea to create a list of the created models and use the summary function.
results < resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
Moreover, we can create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times.
dotplot(results)
You can summarize the results for just the LDA model that seems to be the most accurate.
print(fit.lda)
5. Make Predictions
The LDA was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.
We can run the LDA model directly on the validation set and summarize the results in a confusion matrix.
predictions < predict(fit.lda, validation)
confusionMatrix(predictions, validation$Species)
 Vector exercises
 Evaluate your model with R Exercises
 Neural networks Exercises (Part2)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
BH 1.65.01
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
The BH package on CRAN was updated today to version 1.65.0. BH provides a sizeable portion of the Boost C++ libraries as a set of template headers for use by R, possibly with Rcpp as well as other packages.
This release upgrades the version of Boost to the rather new upstream version Boost 1.65.0 released earlier this week, and adds two new libraries: align and sort.
I had started the upgrade process a few days ago under release 1.64.0. Rigorous checking of reverse dependencies showed that mvnfast needed a small change (which was trivial: just seeding the RNG prior to running tests), which Matteo did in no time with a fresh CRAN upload. rstan is needing a bit more work but should be ready real soon now and we are awaiting a new version. And once I switched to the just release Boost 1.65.0 it became apparent that Cyclops no longer needs its embedded copy of Boost iterator—and Marc already made that change with yet another fresh CRAN upload. It is a true pleasure to work in such a responsive and collaborative community.
Changes in version 1.65.01 (20170824)
Upgraded to Boost 1.64 and then 1.65 installed directly from upstream source with several minor tweaks (as before)

Fourth tweak corrects a misplaced curly brace (see the Boost ublas GitHub repo and its issue #40)

Added Boost multiprecision by fixing a script typo (as requested in #42)

Updated Travis CI support via newer run.sh
Via CRANberries, there is a diffstat report relative to the previous release.
Comments and suggestions are welcome via the mailing list or the issue tracker at the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Newer to R? rstudio::conf 2018 is for you! Early bird pricing ends August 31.
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
Immersion is among the most effective ways to learn any language. Immersing where new and advanced users come together to improve their use of the R language is a rare opportunity. rstudio::conf 2018 is that time and place!
Be an Early Bird! Discounts for early conference registration expire August 31.
Immerse as a team! Ask us about group discounts for 5 or more from the same organization.
Rstudio::conf 2018 is a two day conference with optional two day workshops. One of the conference tracks will focus on topics for newer R users. Newer R users will learn about the best ways to use R, to avoid common pitfalls and accelerate proficiency. Several workshops are also designed specifically for those newer to R.
Intro to R & RStudio
Are you new to R & RStudio and do you learn best in person? You will learn the basics of R and data science, and practice using the RStudio IDE (integrated development environment) and R Notebooks. We will have a team of TAs on hand to show you the ropes and help you out when you get stuck.
This course is taught by wellknown R educator and friend of RStudio, Amelia McNamara, a Smith College Visiting Assistant Professor of Statistical and Data Sciences & Mass Mutual Faculty Fellow.

Are you ready to begin applying the book, R for Data Science? Learn how to achieve your data analysis goals the “tidy” way. You will visualize, transform, and model data in R and work with datetimes, character strings, and untidy data formats. Along the way, you will learn and use many packages from the tidyverse including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, lubridate, and forcats.
This course is taught by friend of RStudio, Charlotte Wickham, a professor and award winning teacher and data analyst at Oregon State University.

Do you want to share your data analysis with others in effective ways? For people who know their way around the RStudio IDE and R at least a little, this workshop will help you become proficient in Shiny application development and using R Markdown to communicate insights from data analysis to others.
This course is taught by Mine ÇetinkayaRundel, Duke professor and RStudio professional educator. Mine is well known for her open education efforts, and her popular data science MOOCs.
Whether you are new to the R language or as advanced as many of our speakers and educators, rstudio::conf 2018 is the place and time to focus on all things R & RStudio.
We hope to see you in San Diego!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Gradient boosting in R
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
Boosting is another famous ensemble learning technique in which we are not concerned with reducing the variance of learners like in Bagging where our aim is to reduce the high variance of learners by averaging lots of models fitted on bootstrapped data samples generated with replacement from training data, so as to avoid overfitting.
Another major difference between both the techniques is that in Bagging the various models which are generated are independent of each other and have equal weightage .Whereas Boosting is a sequential process in which each next model which is generated is added so as to improve a bit from the previous model.Simply saying each of the model that is added to mix is added so as to improve on the performance of the previous collection of models.In Boosting we do weighted averaging.
Both the ensemble techniques are common in terms of generating lots of models on training data and using their combined power to increase the accuracy of the final model which is formed by combining them.
But Boosting is more towards Bias i.e simple learners or more specifically Weak learners. Now a weak learner is a learner which always learns something i.e does better than chance and also has error rate less then 50%.The best example of a weak learner is a Decision tree.This is the reason we generally use ensemble technique on decision trees to improve its accuracy and performance.
In Boosting each tree or Model is grown or trained using the hard examples.By hard I mean all the training examples \( (x_i,y_i) \) for which a previous model produced incorrect output \(Y\).Boosting boosts the performance of a simple baselearner by iteratively shifting the focus towards problematic training observations that are difficult to predict.Now that information from the previous model is fed to the next model.And the thing with boosting is that every new tree added to the mix will do better than the previous tree because it will learn from the mistakes of the previous models and try not to repeat them.Hence by this technique it will eventually convert a weak learner to a strong learner which is better and more accurate in generalization for unseen test examples.
An important thing to remember in boosting is that the base learner which is being boosted should not be a complex and complicated learner which has high variance for e.g a neural network with lots of nodes and high weight values.For such learners boosting will have inverse effects.
So I will explain Boosting with respect to decision trees in this tutorial because they can be regarded as weak learners most of the times.We will generate a gradient boosting model.
Gradient boosting generates learners using the same general boosting learning process. It first builds learner to predict the values/labels of samples, and calculate the loss (the difference between the outcome of the first learner and the real value). It will build a second learner to predict the loss after the first step. The step continues to learn the third, forth… until certain threshold.Gradient boosting identifies hard examples by calculating large residuals\( (y_{actual}y_{pred} ) \) computed in the previous iterations.
Implementing Gradient BoostingLet’s use gbm package in R to fit gradient boosting model.
require(gbm) require(MASS)#package with the boston housing dataset #separating training and test data train=sample(1:506,size=374)We will use the Boston housing data to predict the median value of the houses.
Boston.boost=gbm(medv ~ . ,data = Boston[train,],distribution = "gaussian",n.trees = 10000, shrinkage = 0.01, interaction.depth = 4) Boston.boost summary(Boston.boost) #Summary gives a table of Variable Importance and a plot of Variable Importance gbm(formula = medv ~ ., distribution = "gaussian", data = Boston[train, ], n.trees = 10000, interaction.depth = 4, shrinkage = 0.01) A gradient boosted model with gaussian loss function. 10000 iterations were performed. There were 13 predictors of which 13 had nonzero influence. >summary(Boston.boost) var rel.inf rm rm 36.96963915 lstat lstat 24.40113288 dis dis 10.67520770 crim crim 8.61298346 age age 4.86776735 black black 4.23048222 nox nox 4.06930868 ptratio ptratio 2.21423811 tax tax 1.73154882 rad rad 1.04400159 indus indus 0.80564216 chas chas 0.28507720 zn zn 0.09297068The above Boosted Model is a Gradient Boosted Model which generates 10000 trees and the shrinkage parametet (\lambda= 0.01\) which is also a sort of learning Rate. Next parameter is the interaction depth which is the total splits we want to do.So here each tree is a small tree with only 4 splits.
The summary of the Model gives a feature importance plot.In the above list is on the top is the most important variable and at last is the least important variable.
And the 2 most important features which explain the maximum variance in the Data set is lstat i.e lower status of the population (percent) and rm which is average number of rooms per dwelling.
The partial Dependence Plots will tell us the relationship and dependence of the variables \(X_i\) with the Response variable \(Y\).
#Plot of Response variable with lstat variable plot(Boston.boost,i="lstat") #Inverse relation with lstat variable plot(Boston.boost,i="rm") #as the average number of rooms increases the the price increasesThe above plot simply shows the relation between the variables in the xaxis and the mapping function \(f(x)\) on the yaxis.First plot shows that lstat is negatively correlated with the response mdev, whereas the second one shows that rm is somewhat directly related to mdev.
cor(Boston$lstat,Boston$medv)#negetive correlation coeffr cor(Boston$rm,Boston$medv)#positive correlation coeffr > cor(Boston$lstat,Boston$medv) [1] 0.7376627 > cor(Boston$rm,Boston$medv) [1] 0.6953599 Prediction on Test SetWe will compute the Test Error as a function of number of Trees.
n.trees = seq(from=100 ,to=10000, by=100) #no of treesa vector of 100 values #Generating a Prediction matrix for each Tree predmatrix<predict(Boston.boost,Boston[train,],n.trees = n.trees) dim(predmatrix) #dimentions of the Prediction Matrix #Calculating The Mean squared Test Error test.error<with(Boston[train,],apply( (predmatrixmedv)^2,2,mean)) head(test.error) #contains the Mean squared test error for each of the 100 trees averaged #Plotting the test error vs number of trees plot(n.trees , test.error , pch=19,col="blue",xlab="Number of Trees",ylab="Test Error", main = "Perfomance of Boosting on Test Set") #adding the RandomForests Minimum Error line trained on same data and similar parameters abline(h = min(test.err),col="red") #test.err is the test error of a Random forest fitted on same data legend("topright",c("Minimum Test error Line for Random Forests"),col="red",lty=1,lwd=1) dim(predmatrix) [1] 206 100 head(test.error) 100 200 300 400 500 600 26.428346 14.938232 11.232557 9.221813 7.873472 6.911313In the above plot the red line represents the least error obtained from training a Random forest with same data and same parameters and number of trees.Boosting outperforms Random Forests on same test dataset with lesser Mean squared Test Errors.
ConclusionFrom the above plot we can notice that if boosting is done properly by selecting appropriate tuning parameters such as shrinkage parameter \(lambda\) ,the number of splits we want and the number of trees \(n\), then it can generalize really well and convert a weak learner to strong learner. Ensembling techniques are really well and tend to outperform a single learner which is prone to either overfitting or underfitting or generate thousands or hundreds of them,then combine them to produce a better and stronger model.
Hope you guys liked the article, make sure to like and share. Cheers!!.
Related Post
 Radial kernel Support Vector Classifier
 Random Forests in R
 Network analysis of Game of Thrones
 Structural Changes in Global Warming
 Deep Learning with R
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Linear Congruential Generator in R
(This article was first published on R – Aaron Schlegel, and kindly contributed to Rbloggers)
Part of 1 in the series Random Number GenerationA Linear congruential generator (LCG) is a class of pseudorandom number generator (PRNG) algorithms used for generating sequences of randomlike numbers. The generation of random numbers plays a large role in many applications ranging from cryptography to Monte Carlo methods. Linear congruential generators are one of the oldest and most wellknown methods for generating random numbers primarily due to their comparative ease of implementation and speed and their need for little memory. Other methods such as the Mersenne Twister are much more common in practical use today.
Linear congruential generators are defined by a recurrence relation:
There are many choices for the parameters , the modulus, , the multiplier, and the increment. Wikipedia has a seemingly comprehensive list of the parameters currently in use in common programs.
Aside: ‘Pseudorandom’ and Selecting a Seed NumberRandom number generators such as LCGs are known as ‘pseudorandom’ as they require a seed number to generate the random sequence. Due to this requirement, random number generators today are not truly ‘random.’ The theory and optimal selection of a seed number are beyond the scope of this post; however, a common choice suitable for our application is to take the current system time in microseconds.
A Linear Congruential Generator Implementation in RThe parameters we will use for our implementation of the linear congruential generator are the same as the ANSI C implementation (Saucier, 2000.).
The following function is an implementation of a linear congruential generator with the given parameters above.
lcg.rand < function(n=10) { rng < vector(length = n) m < 2 ** 32 a < 1103515245 c < 12345 # Set the seed using the current system time in microseconds d < as.numeric(Sys.time()) * 1000 for (i in 1:n) { d < (a * d + c) %% m rng[i] < d / m } return(rng) }We can use the function to generate random numbers .
# Print 10 random numbers on the halfopen interval [0, 1) lcg.rand() ## [1] 0.4605103 0.6643705 0.6922703 0.4603930 0.1842995 0.6804419 0.8561535 ## [8] 0.2435846 0.8236771 0.9643965We can also demonstrate how apparently ‘random’ the LCG is by plotting a sample generation in 3 dimensions. To do this, we generate three random vectors using our LCG above and plot. The plot3d package is used to create the scatterplot, and the animation package is used to animate each scatterplot as the length of the random vectors, , increases.
library(plot3D) library(animation) n < c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000) saveGIF({ for (i in 1:length(n)) { x < lcg.rand(n[i]) y < lcg.rand(n[i]) z < lcg.rand(n[i]) scatter3D(x, y, z, colvar = NULL, pch=20, cex = 0.5, theta=20, main = paste('n = ', n[i])) } }, movie.name = 'lcg.gif')As increases, the LCG appears to be ‘random’ enough as demonstrated by the cloud of points.
Linear Congruential Generators with Poor ParametersThe values chosen for the parameters are very important in driving how ‘random’ the generated values from the linear congruential estimator; hence that is why there are so many different parameters in use today as there has not yet been a clear consensus on the ‘best’ parameters to use.
We can demonstrate how choosing poor parameters for our LCG leads to not so random generated values by creating a new LCG function.
lcg.poor < function(n=10) { rng < vector(length = n) # Parameters taken from https://www.mimuw.edu.pl/~apalczew/CFP_lecture3.pdf m < 2048 a < 1229 c < 1 d < as.numeric(Sys.time()) * 1000 for (i in 1:n) { d < (a * d + c) %% m rng[i] < d / m } return(rng) }Generating successively longer vectors using the ‘poor’ LCG and plotting as we did previously, we see the generated points are very sequentially correlated, and there doesn’t appear to be any ‘randomness’ at all as increases.
n < c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000) saveGIF({ for (i in 1:length(n)) { x < lcg.poor(n[i]) y < lcg.poor(n[i]) z < lcg.poor(n[i]) scatter3D(x, y, z, colvar = NULL, pch=20, cex=0.5, theta=20, main = paste('n = ', n[i])) } }, movie.name = 'lcg_poor.gif') ReferencesSaucier, R. (2000). Computer Generation of Statistical Distributions (1st ed.). Aberdeen, MD. Army Research Lab.
The post Linear Congruential Generator in R appeared first on Aaron Schlegel.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Calculating a fuzzy kmeans membership matrix with R and Rcpp
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
by Błażej Moska, computer science student and data science intern
Suppose that we have performed clustering Kmeans clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like fanny, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the kmeans function. The equation is defined as follows (source: Wikipedia):
$$w_{ij} = \frac{1}{ \sum_ {k=1}^c ( \frac{ \ x_{i} – c_{j} \ }{ \ x_{i} – c_{k} \ }) ^{ \frac{2}{m1} } } $$
\(w_{ij}\) denotes to what extent the \(i\)th object belongs to the \(j\)th cluster. So the total number of rows of this matrix equals number of observation and total number of columns equals number of variables included in clustering. \(m\) is a parameter, typically set to \(m=2\). \(w_{ij}\) values range between 0 and 1 so they are easy and convenient to compare. In this example I will use \(m = 2\) so the Euclidean distance will be calculated.
To make computations faster I also used the Rcpp package, then I compared speed of execution of function written in R with this written in C++.
In implementations for loops were used, although it is not a commonly used method in R (see this blog post for more information and alternatives), but in this case I find it more convenient.
Rcpp (C++ version) #include #include using namespace Rcpp; // [[Rcpp::export]] NumericMatrix fuzzyClustering(NumericMatrix data, NumericMatrix centers, int m) { /* data is a matrix with observations(rows) and variables, centers is a matrix with cluster centers coordinates, m is a parameter of equation, c is a number of clusters */ int c=centers.rows(); int rows = data.rows(); int cols = data.cols(); /*number of columns equals number of variables, the same as is in centers matrix*/ double tempDist=0; /*dist and tempDist are variables storing temporary euclidean distances */ double dist=0; double denominator=0; //denominator of “main” equation NumericMatrix result(rows,c); //declaration of matrix of results for(int i=0;i<rows;i++){ for(int j=0;j<c;j++){ for(int k=0;k<c;k++){ for(int p=0;p<cols;p++){ tempDist = tempDist+pow(centers(j,p)data(i,p),2); //in innermost loop an euclidean distance is calculated. dist = dist + pow(centers(k,p)data(i,p),2); /*tempDist is nominator inside the sum operator in the equation, dist is the denominator inside the sum operator in the equation*/ } tempDist = sqrt(tempDist); dist = sqrt(dist); denominator = denominator+pow((tempDist/dist),(2/(m1))); tempDist = 0; dist = 0; } result(i,j) = 1/denominator; // nominator/denominator in the main equation denominator = 0; } } return result; }We can save this in a file with .cpp extension. To compile it from R we can write:
sourceCpp("path_to_cpp_file")If everything goes right our function fuzzyClustering will then be available from R.
R version fuzzyClustering=function(data,centers,m){ c < nrow(centers) rows < nrow(data) cols < ncol(data) result < matrix(0,nrow=rows,ncol=c) #defining membership matrix denominator < 0 for(i in 1:rows){ for(j in 1:c){ tempDist < sqrt(sum((centers[j,]data[i,])^2)) #euclidean distance, nominator inside a sum operator for(k in 1:c){ Dist < sqrt(sum((centers[k,]data[i,])^2)) #euclidean distance, denominator inside a sum operator denominator < denominator +((tempDist/Dist)^(2/(m1))) #denominator of an equation } result[i,j] < 1/denominator #inserting value into membership matrix denominator < 0 } } return(result); }Result looks as follows. Columns are cluster numbers (in this case 10 clusters were created), rows are our objects (observations). Values were rounded to the third decimal place, so the sums of rows can be slightly different than 1:
1 2 3 4 5 6 7 8 9 10 [1,] 0.063 0.038 0.304 0.116 0.098 0.039 0.025 0.104 0.025 0.188 [2,] 0.109 0.028 0.116 0.221 0.229 0.080 0.035 0.116 0.017 0.051 [3,] 0.067 0.037 0.348 0.173 0.104 0.066 0.031 0.095 0.018 0.062 [4,] 0.016 0.015 0.811 0.049 0.022 0.017 0.009 0.023 0.007 0.031 [5,] 0.063 0.048 0.328 0.169 0.083 0.126 0.041 0.079 0.018 0.045 [6,] 0.069 0.039 0.266 0.226 0.102 0.111 0.037 0.084 0.017 0.048 [7,] 0.045 0.039 0.569 0.083 0.060 0.046 0.025 0.071 0.015 0.046 [8,] 0.070 0.052 0.399 0.091 0.093 0.054 0.034 0.125 0.022 0.062 [9,] 0.095 0.037 0.198 0.192 0.157 0.088 0.038 0.121 0.019 0.055 [10,] 0.072 0.024 0.132 0.375 0.148 0.059 0.025 0.081 0.015 0.067 Performance comparisonShown below is the output of Sys.time for the C++ and R versions, running against a simulated matrix with 30000 observations, 3 variables and 10 clusters.
The hardware I used was a lowcost notebook Asus R556L with Intel Core i35010 2.1 GHz processor and 8 GB DDR3 1600 MHz RAM memory.
C++ version:
user system elapsed 0.32 0.00 0.33R version:
user system elapsed 15.75 0.02 15.94In this example, the function written in C++ executed about 50 times faster than the equivalent function written in pure R.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Reticulating Readability
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a bruteforce method than one that uses any real “intelligence” when performing the text node targeting.
Most modern browsers have inherent or pluginable “readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.
What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:
library(xml2) library(httr) library(reticulate) library(magrittr) res < GET("https://ropensci.org/blog/blog/2017/08/22/visdat") content(res, as="text", endoding="UTF8") ## [1] "\n \n\n\n \n \n \n \n \n \n\n \n\n \n \n\n \n \n \n \n \n \n \n try{Typekit.load();}catch(e){}\n \n hljs.initHighlightingOnLoad();\n\n Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n(it goes on for a bit, best to run the code locally)
We can use the reticulate package to load the Python readability module to just get the clean, article text:
readability < import("readability") # pip install readabilitylxml doc < readability$Document(httr::content(res, as="text", endoding="UTF8")) doc$summary() %>% read_xml() %>% xml_text() # [1] "Take a look at the dataThis is a phrase that comes up when you first get a dataset.It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?Starting down either path, you often encounter the nontrivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more. I felt like the code was a little sloppy, and that it could be better. I wanted to know whether others found it useful.What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.What's in it for the author?Feedback on your packageSupport from rOpenSci membersMaintain ownership of your packagePublicity from it being under rOpenSciContribute something to rOpenSciPotentially a publicationWhat can rOpenSci do that CRAN cannot?The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:Assess documentation readability / usabilityProvide a code review to find weak points / points of improvementDetermine whether a package is overlapping with another.(again, it goes on for a bit, best to run the code locally)
That text is now in good enough shape to tidy.
Here’s the same version with clean_text():
# devtools::install_github("hrbrmstr/hgr") hgr::clean_text(content(res, as="text", endoding="UTF8")) ## [1] "Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n \n \n \n \n \n August 22, 2017 \n \n \n \n \nTake a look at the data\n\n\nThis is a phrase that comes up when you first get a dataset.\n\nIt is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?\n\nStarting down either path, you often encounter the nontrivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.\n\nThese growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.\n\nThe package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".\n\nMaking was fun, and it was easy to use. But I couldn't help but think that maybe could be more.\n\n I felt like the code was a little sloppy, and that it could be better.\n I wanted to know whether others found it useful.\nWhat I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.\n\nToo much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci provides.\n\nrOpenSci onboarding basics\n\nOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with .\n\nWhat's in it for the author?\n\nFeedback on your package\nSupport from rOpenSci members\nMaintain ownership of your package\nPublicity from it being under rOpenSci\nContribute something to rOpenSci\nPotentially a publication\nWhat can rOpenSci do that CRAN cannot?\n\nThe rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN . Here's what rOpenSci does that CRAN cannot:\n\nAssess documentation readability / usability\nProvide a code review to find weak points / points of improvement\nDetermine whether a package is overlapping with another.(lastly, it goes on for a bit, best to run the code locally)
As you can see, even though that version is usable, readability does a much smarter job of cleaning the text.
The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Big Data analytics with RevoScaleR Exercises
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
In this set of exercise , you will explore how to handle bigdata with RevoscaleR package from Microsoft R (previously Revolution Analytics).It comes with Microsoft R client . You can get it from here . get the Credit card fraud data set from revolutionanalytics and lets get started
Answers to the exercises are available here.Please check the documentation before starting these exercise set
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
The heart of RevoScaleR is the xdf file format , convert the creditcardfraud data set into xdf format .
Exercise 2
use the newly created xdf file to get information about the variables and print 10 rows to check the data .
Learn more about importing big data in the online course Data Mining with R: Go from Beginner to Advanced. In this course you will learn how to work with different data import techniques,
 know how to import data and transform it for a specific moddeling or analysis goal,
 and much more.
Exercise 3
use rxSummary ,get the summary for variables gender, balance ,cardholder where numTrans is greater than 10
Exercise 4
use rxDataStep and create a variable avgbalpertran which is balance /numTran+numIntlTran .use rxGetInfo to check if your changes being reflected in the xdf data
Exercise 5
use rxCor and find the correlation between the newly created variable and fraudRisk
Exercise 6
use rxLinMod to construct the linear regression of fraudRisk on gender,balance and cardholder. Dont forget to check the summary of the model .
Exercise 7
Find the contingency table of fraudRisk and Gender , use rxCrossTab .Hint : Figure out how to include factors in the formula .
Exercise 8
use rxCube to find the mean balance for each of the two genders .
Exercise 9
Create a histogram from the xdf file on balance to show the relative frequency histogram .
Exercise 10
Create a two panel histogram with gender and fradurisk as explanatory variable to show the relative frequency of fraudrisk in two genders .
 Vector exercises
 Cross Tabulation with Xtabs exercises
 Data Exploration with Tables exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Introducing ‘powerlmm’ an R package for power calculations for longitudinal multilevel models
(This article was first published on R Psychologist  R, and kindly contributed to Rbloggers)
Over the years I’ve produced quite a lot of code for power calculations and simulations of different longitudinal linear mixed models. Over the summer I bundled together these calculations for the designs I most typically encounter into an R package. The purpose of powerlmm is to help design longitudinal treatment studies, with or without higherlevel clustering (e.g. by therapists, groups, or physician), and missing data. Currently, powerlmm supports twolevel models, nested threelevel models, and partially nested models. Additionally, unbalanced designs and missing data can be accounted for in the calculations. Power is calculated analytically, but simulation methods are also provided in order to evaluated bias, type 1 error, and the consequences of model misspecification. For novice R users, the basic functionality is also provided as a Shiny web application.
The package can be install from GitHub here: github.com/rpsychologist/powerlmm. Currently, the packages includes three vignettes that show how to setup your studies and calculate power.
A basic example library(powerlmm) # dropout per treatment group d < per_treatment(control = dropout_weibull(0.3, 2), treatment = dropout_weibull(0.2, 2)) # Setup design p < study_parameters(n1 = 11, # time points n2 = 10, # subjects per cluster n3 = 5, # clusters per treatment arm icc_pre_subject = 0.5, icc_pre_cluster = 0, icc_slope = 0.05, var_ratio = 0.02, dropout = d, cohend = 0.8) # Power get_power(p) ## ## Power calculation for longitudinal linearmixed model (threelevel) ## with missing data and unbalanced designs ## ## n1 = 11 ## n2 = 10 (treatment) ## 10 (control) ## n3 = 5 (treatment) ## 5 (control) ## 10 (total) ## total_n = 50 (treatment) ## 50 (control) ## 100 (total) ## dropout = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 (time) ## 0, 0, 1, 3, 6, 9, 12, 16, 20, 25, 30 (%, control) ## 0, 0, 1, 2, 4, 5, 8, 10, 13, 17, 20 (%, treatment) ## icc_pre_subjects = 0.5 ## icc_pre_clusters = 0 ## icc_slope = 0.05 ## var_ratio = 0.02 ## cohend = 0.8 ## power = 0.68 FeedbackI appreciate all types of feedback, e.g. typos, bugs, inconsistencies, feature requests, etc. Open an issue on github.com/rpsychologist/powerlmm/issues or via my contact info here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Psychologist  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
New R Course: Sentiment Analysis in R – The Tidy Way
(This article was first published on DataCamp Blog, and kindly contributed to Rbloggers)
Hello, R users! This week we’re continuing to bridge the gap between computers and human language with the launch Sentiment Analysis in R: The Tidy Way by Julia Silge!
Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in several case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from realworld data.
Sentiment Analysis in R: The Tidy Way features interactive exercises that combine highquality video, inbrowser coding, and gamification for an engaging learning experience that will make you an expert in sentiment analysis!
What you’ll learn:Chapter 1: Tweets across the United States
In this chapter, you will implement sentiment analysis using tidy data principles using geocoded Twitter data.
Chapter 2: Shakespeare gets SentimentalYour next realworld text exploration uses tragedies and comedies by Shakespeare to show how sentiment analysis can lead to insight into differences in word use. You will learn how to transform raw text into a tidy format for further analysis.
Chapter 3: Analyzing TV NewsText analysis using tidy principles can be applied to diverse kinds of text, and in this chapter, you will explore a dataset of closed captioning from television news. You will apply the skills you have learned so far to explore how different stations report on a topic with different words, and how sentiment changes with time.
Chapter 4: Singing a Happy Song (or Sad?!)In this final chapter on sentiment analysis using tidy principles, you will explore pop song lyrics that have topped the charts from the 1960s to today. You will apply all the techniques we have explored together so far, and use linear modeling to find what the sentiment of song lyrics can predict.
Learn all there is to know about Sentiment Analysis in R, the Tidy Way!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Practical Guide to Principal Component Methods in R
(This article was first published on Easy Guides, and kindly contributed to Rbloggers)
Introduction
Although there are several good books on principal component methods (PCMs) and related topics, we felt that many of them are either too theoretical or too advanced.
This book provides a solid practical guidance to summarize, visualize and interpret the most important information in a large multivariate data sets, using principal component methods in R.
Where to find the book:
 Download the PDF through payhip
 Read the ebook on google play
 Order a physical copy from amazon
 (Download the book preview)
The following figure illustrates the type of analysis to be performed depending on the type of variables contained in the data set.
There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.
However, the result is presented differently depending on the used package.
To help in the interpretation and in the visualization of multivariate analysis – such as cluster analysis and principal component methods – we developed an easytouse R package named factoextra (official online documentation: http://www.sthda.com/english/rpkgs/factoextra).
No matter which package you decide to use for computing principal component methods, the factoextra R package can help to extract easily, in a human readable data format, the analysis results from the different packages mentioned above. factoextra provides also convenient solutions to create ggplot2based beautiful graphs.
Methods, which outputs can be visualized using the factoextra package are shown in the figure below:
In this book, we’ll use mainly:
 the FactoMineR package to compute principal component methods;
 and the factoextra package for extracting, visualizing and interpreting the results.
The other packages – ade4, ExPosition, etc – will be also presented briefly.
How this book is organizedThis book contains 4 parts.
Part I provides a quick introduction to R and presents the key features of FactoMineR and factoextra.
Part II describes classical principal component methods to analyze data sets containing, predominantly, either continuous or categorical variables. These methods include:
 Principal Component Analysis (PCA, for continuous variables),
 Simple correspondence analysis (CA, for large contingency tables formed by two categorical variables)
 Multiple correspondence analysis (MCA, for a data set with more than 2 categorical variables).
In Part III, you’ll learn advanced methods for analyzing a data set containing a mix of variables (continuous and categorical) structured or not into groups:
 Factor Analysis of Mixed Data (FAMD) and,
 Multiple Factor Analysis (MFA).
Part IV covers hierarchical clustering on principal components (HCPC), which is useful for performing clustering with a data set containing only categorical variables or with a mixed data of categorical and continuous variables
Key features of this bookThis book presents the basic principles of the different methods and provide many examples in R. This book offers solid guidance in data mining for students and researchers.
Key features:
 Covers principal component methods and implementation in R
 Highlights the most important information in your data set using ggplot2based elegant visualization
 Short, selfcontained chapters with tested examples that allow for flexibility in designing a course and for easy reference
At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter. Additionally, we provide links to other resources and to our handcurated list of videos on principal component methods for further learning.
Examples of plotsSome examples of plots generated in this book are shown hereafter. You’ll learn how to create, customize and interpret these plots.
 Eigenvalues/variances of principal components. Proportion of information retained by each principal component.
 PCA – Graph of variables:
 Control variable colors using their contributions to the principal components.
 Highlight the most contributing variables to each principal dimension:
 PCA – Graph of individuals:
 Control automatically the color of individuals using the cos2 (the quality of the individuals on the factor map)
 Change the point size according to the cos2 of the corresponding individuals:
 PCA – Biplot of individuals and variables
 Correspondence analysis. Association between categorical variables.
 FAMD/MFA – Analyzing mixed and structured data
 Clustering on principal components
Download the preview of the book at: Principal Component Methods in R (Book preview)
Order now About the authorAlboukadel Kassambara is a PhD in Bioinformatics and Cancer Biology. He works since many years on genomic data analysis and visualization (read more: http://www.alboukadel.com/).
He has work experiences in statistical and computational methods to identify prognostic and predictive biomarker signatures through integrative analysis of largescale genomic and clinical data sets.
He created a bioinformatics webtool named GenomicScape (www.genomicscape.com) which is an easytouse web tool for gene expression data analysis and visualization.
He developed also a training website on data science, named STHDA (Statistical Tools for Highthroughput Data Analysis, www.sthda.com/english), which contains many tutorials on data analysis and visualization using R software and packages.
He is the author of many popular R packages for:
 multivariate data analysis (factoextra, http://www.sthda.com/english/rpkgs/factoextra),
 survival analysis (survminer, http://www.sthda.com/english/rpkgs/survminer/),
 correlation analysis (ggcorrplot, http://www.sthda.com/english/wiki/ggcorrplotvisualizationofacorrelationmatrixusingggplot2),
 creating publication ready plots in R (ggpubr, http://www.sthda.com/english/rpkgs/ggpubr).
Recently, he published three books on data analysis and visualization:
 Practical Guide to Cluster Analysis in R (https://goo.gl/DmJ5y5)
 Guide to Create Beautiful Graphics in R (https://goo.gl/vJ0OYb).
 Complete Guide to 3D Plots in R (https://goo.gl/v5gwl0).
To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Notice: Changes to the site
(This article was first published on R – Locke Data, and kindly contributed to Rbloggers)
I wanted to give everyone a heads up about a major rebrand and some probable downtime happening over the weekend.
I’m going to be consolidating my Locke Data consulting company materials, the blog, the talks, and my package documentation into a single site. The central URL itsalocke.com won’t be changing but there will be a ton of changes happening.
I think I’ve got the redirects and the RSS feeds pretty much sorted, but I’ve converted more than 300 pages of content to new systems – I’ve likely gotten things wrong. You might notice some issues when clicking through from RBloggers, on twitter, or from other people’s sites.
I hope you’ll like the changes I’ve made but I’m going to have a “bug bounty” in place. If you find a broken link, a blog post that isn’t rendered correctly, or some other bug, then report it. Filling in the form is easy and if you provide your name and address and I’ll send you a sticker as thanks!
If you want to get a preview of the site, check it out on its temporary home lockelife.com.
The post Notice: Changes to the site appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Boston EARL Keynote speaker announcement: Tareef Kawaf
(This article was first published on Mango Solutions, and kindly contributed to Rbloggers)
Mango Solutions are thrilled to announce that Tareef Kawaf, President of RStudio, will be joining us at EARL Boston as our third Keynote Speaker.
Tareef is an experienced software startup executive and a member of teams that built up ATG’s eCommerce offering and Brightcove’s Online Video Platform, helping both companies grow from early startups to publicly traded companies. He joined RStudio in early 2013 to help define its commercial product strategy and build the team. He is a software engineer by training, and an aspiring student of advanced analytics and R.
This will be Tareef’s second time speaking at EARL Boston and we’re big supporters of RStudio’s mission to provide the most widely used open source and enterpriseready professional software for the R statistical computing environment, so we’re looking forward to him taking to the podium again this year.
Want to join Tareef at EARL Boston? SpeakAbstract submissions close on 31 August, so time is running out to share your R adventures and innovations with fellow R users.
All accepted speakers receive a 1day Conference pass and a ticket to the evening networking reception.
Buy a ticketEarly bird tickets are now available! Save more than $100 on a Full Conference pass.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Analyzing Google Trends Data in R
Google Trends shows the changes in the popularity of search terms over a given time (i.e., number of hits over time). It can be used to find search terms with growing or decreasing popularity or to review periodic variations from the past such as seasonality. Google Trends search data can be added to other analyses, manipulated and explored in more detail in R.
This post describes how you can use R to download data from Google Trends, and then include it in a chart or other analysis. We’ll discuss first how you can get overall (global) data on a search term (query), how to plot it as a simple line chart, and then how to can break the data down by geographical region. The first example I will look at is the rise and fall of the Bluray.
Analyzing Google Trends in RI have never bought a Bluray disc and probably never will. In my world, technology moved from DVDs to streaming without the need for a high definition physical medium. I still see them in some shops, but it feels as though they are declining. Using Google Trends we can find out when interest in Blurays peaked.
The following R code retrieves the global search history since 2004 for Bluray.
library(gtrendsR) library(reshape2) google.trends = gtrends(c("bluray"), gprop = "web", time = "all")[[1]] google.trends = dcast(google.trends, date ~ keyword + geo, value.var = "hits") rownames(google.trends) = google.trends$date google.trends$date = NULLThe first argument to the gtrends function is a list of up to 5 search terms. In this case, we have just one item. The second argument gprop is the medium searched on and can be any of web, news, images or youtube. The third argument time can be any of now 1d, now 7d, today 1m, today 3m, today 12m, today+5y or all (which means since 2004). A final possibility for time is to specify a custom date range e.g. 20101231 20110630.
Note that I am using gtrendsR version 1.9.9.0. This version improves upon the CRAN version 1.3.5 (as of August 2017) by not requiring a login. You may see a warning if your timezone is not set – this can be avoided by adding the following line of code:
Sys.setenv(TZ = "UTC")After retrieving the data from Google Trends, I format it into a table with dates for the row names and search terms along the columns. The table below shows the result of running this code.
Plotting Google Trends data: Identifying seasonality and trendsPlotting the Google Trends data as an R chart we can draw two conclusions. First, interest peaked around the end of 2008. Second, there is a strong seasonal effect, with significant spikes around Christmas every year.
Note that results are relative to the total number of searches at each time point, with the maximum being 100. We cannot infer anything about the volume of Google searches. But we can say that as a proportion of all searches Bluray was about half as frequent in June 2008 compared to December 2008. An explanation about Google Trend methodology is here.
Google Trends by geographic regionNext, I will illustrate the use of country codes. To do so I will find the search history for skiing in Canada and New Zealand. I use the same code as previously, except modifying the gtrends line as below.
google.trends = gtrends(c("skiing"), geo = c("CA", "NZ"), gprop = "web", time = "20100630 20170630")[[1]]The new argument to gtrends is geo, which allows the users to specify geographic codes to narrow the search region. The awkward part about geographical codes is that they are not always obvious. Country codes consist of two letters, for example, CA and NZ in this case. We could also use region codes such as USCA for California. I find the easiest way to get these codes is to use this Wikipedia page.
An alternative way to find all the regionlevel codes for a given country is to use the following snippet of R code. In this case, it retrieves all the regions of Italy (IT).
library(gtrendsR) geo.codes = sort(unique(countries[substr(countries$sub_code, 1, 2) == "IT", ]$sub_code))Plotting the ski data below, we note the contrast between northern and southern hemisphere winters. It is also relatively more popular in Canada than New Zealand. The 2014 winter Olympics causes a notable spike in both countries but particularly Canada.
Create your own analysis
In this post I have shown how to import data from Google Trends using the R package gtrendsR. Anyone can click on this link to explore the examples used in this post or create your own analysis (just sign into Displayr first).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Hardnosed Indian Data Scientist Gospel Series – Part 1 : Incertitude around Tools and Technologies
(This article was first published on Coastal Econometrician Views, and kindly contributed to Rbloggers)
Before recession a commercial tool was popular in the country, hence, uncertainty around tools and technology was not much; however, after recession, incertitude (i.e. uncertainty) around tools and technology have preoccupied and occupying data science learning, delivery and deployment.
When python was continuing as general programming language, Rwas the left out best choice (became more popular with the advent of an IDE i.e. RStudio) and author still see its popularity among nonprogramming background (i.e. other than computer scientists) data scientists. Yet, author notices in local meet ups, panel discussions, webinars, still, a clarity on which is better from aspirants towards the data sicence as a everyday interest as shown in below image.
Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };
(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
s.src = '//cdn.viglink.com/api/vglnk.js';
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Digit fifth powers: Euler Problem 30
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Euler problem 30 is another number crunching problem that deals with numbers to the power of five. Two other Euler problems dealt with raising numbers to a power. The previous problem looked at permutations of powers and problem 16 asks for the sum of the digits of .
Numberphile has a nice video about a trick to quickly calculate the fifth root of a number that makes you look like a mathematical wizard.
Euler Problem 30 DefinitionSurprisingly there are only three numbers that can be written as the sum of fourth powers of their digits:
As is not a sum, it is not included.
The sum of these numbers is . Find the sum of all the numbers that can be written as the sum of fifth powers of their digits.
Proposed SolutionThe problem asks for a bruteforce solution but we have a halting problem. How far do we need to go before we can be certain there are no sums of fifth power digits? The highest digit is and , which has five digits. If we then look at , which has six digits and a good endpoint for the loop. The loop itself cycles through the digits of each number and tests whether the sum of the fifth powers equals the number.
largest < 6 * 9^5 answer < 0 for (n in 2:largest) { power.sum <0 i < n while (i > 0) { d < i %% 10 i < floor(i / 10) power.sum < power.sum + d^5 } if (power.sum == n) { print(n) answer < answer + n } } print(answer)View the most recent version of this code on GitHub.
The post Digit fifth powers: Euler Problem 30 appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Sentiment analysis using tidy data principles at DataCamp
(This article was first published on Rstats on Julia Silge, and kindly contributed to Rbloggers)
I’ve been developing a course at DataCamp over the past several months, and I am happy to announce that it is now launched!
The course is Sentiment Analysis in R: the Tidy Way and I am excited that it is now available for you to explore and learn from. This course focuses on digging into the emotional and opinion content of text using sentiment analysis, and it does this from the specific perspective of using tools built for handling tidy data. The course is organized into four case studies (one per chapter), and I don’t think it’s too much of a spoiler to say that I wear a costume for part of it. I’m just saying you should probably check out the course trailer.
Course descriptionText datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in four case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from realworld data.
Learning objectives Learn the principles of sentiment analysis from a tidy data perspective
 Practice manipulating and visualizing text data using dplyr and ggplot2
 Apply sentiment analysis skills to several realworld text datasets
Check the course out, have fun, and start practicing those text mining skills!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Recreating and updating Minard with ggplot2
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Minard's chart depicting Napoleon's 1812 march on Russia is a classic of data visualization that has inspired many homages using different timeandplace data. If you'd like to recreate the original chart, or create one of your own, Andrew Heiss has created a tutorial on using the ggplot2 package to reenvision the chart in R:
The R script provided in the tutorial is driven by historical data on the location and size of Napoleon's armies during the 1812 campaign, but you could adapt the script to use new data as well. Andrew also shows how to combine the chart with a geographical or satellite map, which is how the cities appear in the version above (unlike in Minard's original).
The data behind the Minard chart is available from Michael Friendly and you can find the R scripts at this Github repository. For the complete tutorial, follow the link below.
Andrew Heiss: Exploring Minard’s 1812 plot with ggplot2 (via Jenny Bryan)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...