WoE and IV Variable Screening with {Information} in R
(This article was first published on The Exactness of Mind, and kindly contributed to Rbloggers)
A short note on informationtheoretic variable screening in R w. {information}.
Variable screening comes as an important step in the
contemporary EDA for predictive modeling: what can we tell about the
nature of the relationships between a set of predictors and the
dependent before entering the modeling phase? Can we infer something about the predictive power of the independent variables before we
start rolling them into a predictive model? In this blog post I will
discuss two informationtheoretic measures that are common in variable
screening for binary classification and regression models in the credit
risk arena (the fact being completely unrelated to the simple truth that
they could do you some good in any other application of predictive
modeling as well). I will first introduce the Weight of Evidence (WoE) and Information Value
(IV) of a variable in respect to a binary outcome. Then I will
illustrate their computation (it’s fairly easy, in fact) from the
{Information} package in R.
Weight of Evidence
Take the common Bayesian hypothesis test (or a Bayes factor, if you prefer):
and assume your models M1, M2 of the world*
are simply two discrete possible states of a binary variable Y, while
the data are given by discrete distributions of some predictor X (or, X
stands for a binned continuous distribution); for every category j in X, j = 1, 2,.. n, take the log:
and you will get to simple a measure of evidence in favor of M1 against M2 that Good has described as Weight of Evidence
(WoE). In theory, any monotonic transformation of the odds would do,
but the logarithm brings an intuitive advantage of obtaining a negative
WoE when the odds are less than one and a positive one when they are
higher than one. To simplify the setting where the analysis under
consideration encompasses a binary dependent Y and a discrete (or binned
continuous) predictor X, we are simply inspecting the conditional distribution of X given Y:
where f denotes counts.
Let’s illustrate the computation of WoE in this setting for a variable from a wellknown dataset**. We have one categorical, binary dependent:
dataSet < read.table(‘bankadditionalfull.csv’,
header = T,
strip.white = F,
sep = “;”)
str(dataSet)
table(dataSet$y)
dataSet$y < recode(dataSet$y,
‘yes’ = 1,
‘no’ = 0)
and we want to compute the WoE for, say, the age variable. Here it goes:
# – compute WOE for: dataSet$age
bins < 10
q < quantile(dataSet$age,
probs = c(1:(bins – 1)/bins),
na.rm = TRUE,
type = 3)
cuts < unique(q)
aggAge < table(findInterval(dataSet$age,
vec = cuts,
rightmost.closed = FALSE),
dataSet$y)
aggAge < as.data.frame.matrix(aggAge)
aggAge$N < rowSums(aggAge)
aggAge$WOE < log((aggAge$`1`*sum(aggAge$`0`))/(aggAge$`0`*sum(aggAge$`1`)))
In the previous example I have used exactly the approach to bin X (age, in this case) that is used in the R package {Information} whose application I want to illustrate later. The table()
call provides for the conditional distributions like the ones shown in
the table above. The computation of WoE is then straightforward – as
exemplified in the last line. However, you want to spare yourself from
computing the WoE in this way for many variables in the dataset, and
that’s where {Information} in R comes handy; for the same dataset:
# – Information value: all variables
infoTables < create_infotables(data = dataSet,
y = “y”,
bins = 10,
parallel = T)
# – WOE table:
infoTables$Tables$age$WOE
with the respective data frames in infoTables$Tables standing for the variables in the dataset.
Information Value
A straightforward definition of the Information Value (IV)of a variable is provided in the {Information} package vignette:
In effect, this means that we are summing across the individual WoE values (i.e. for each bin j of X) and weighting them by the respective differences between P(xjY=1) and P(xjY=0). The IV of a variable measures its predictive power, and variables with IV < .05 are generally considered to have a low predictive power.
Using {Information} in R, for the dataset under consideration:
# – Information value: all variables
infoTables < create_infotables(data = dataSet,
y = “y”,
bins = 10,
parallel = T)
# – Plot IV
plotFrame < infoTables$Summary[order(infoTables$Summary$IV), ]
plotFrame$Variable < factor(plotFrame$Variable,
levels = plotFrame$Variable[order(plotFrame$IV)])
ggplot(plotFrame, aes(x = Variable, y = IV)) +
geom_bar(width = .35, stat = “identity”, color = “darkblue”, fill = “white”) +
ggtitle(“Information Value”) +
theme_bw() +
theme(plot.title = element_text(size = 10)) +
theme(axis.text.x = element_text(angle = 90))
You may have noted the usage of parallel = T in the create_infotables()
call; the {Information} package will try to use all available cores to
speed up the computations by default. Besides the basic package
functionality that I have illustrated, the package provides a natural
way of dealing with uplift models, where the computation of the IVs for
the control vs. treatment designs is nicely automated. Crossvalidation
procedures are also builtin.
However, now that we know that we have a nice,
working package for WoE and IV estimation in R, let’s restrain ourselves
from using it to perform automatic feature
selection for models like binary logistic regression. While the
informationtheoretic measures discussed here truly assess the
predictive power of a predictor in binary classification, building a
model that encompasses multiple terms model is another story. Do not get
disappointed if you start figuring out how the AICs for the full models
are still lower then those for the nested models obtained by feature
selection based on the IV values; although they can provide useful
guidelines, WoE and IV are not even meant to be used that way (I’ve
tried… once with the dataset used in the previous examples, and then
with the two {Information} builtin datasets; not too much of a success,
as you may have guessed).
References
* For parametric models, you would need to integrate over the full
parameter space, of course; taking the MLEs would result in obtaining
the standard LR test.
** The dataset is considered in S. Moro, P. Cortez and P. Rita
(2014). A DataDriven Approach to Predict the Success of Bank
Telemarketing. Decision Support Systems, Elsevier, 62:2231, June 2014. I
have obtained it from: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
(N.B. https://archive.ics.uci.edu/ml/machinelearningdatabases/00222/,
file: bankadditional.zip); a nice description of the dataset is found
at:
http://www2.1010data.com/documentationcenter/beta/Tutorials/MachineLearningExamples/BankMarketingDataSet.html)
Goran S. Milovanović, Phd
Data Science Consultant, SmartCat
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
NYC Citi Bike Visualization – A Hint of Future Transportation
(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to Rbloggers)
Introduction
Like all other sharing systems, Airbnb the housing sharing system, Uber the car sharing system, Citi Bike is the network of bicycle rental stations intended for pointtopoint transportation.
Citi Bike is New York City’s largest bike sharing system. It’s a convenient solution for trips that are too far to walk but too short for a taxi or the subway. The bike sharing system is combined with all other transportation methods available in the area for commuters.
Currently, there are about a million trips on average per month by Citi Bike riders. The system has 10,000 bicycles and 610 stations. By end of 2017, the total size of Citi Bike system will be 12,000 bikes and 750 stations. The grey area is the current service area. The yellow and blue areas represent the sections to be covered by end of 2017.
The Optimization Questions
Any Citi Bike client has come up against two frustrating scenarios: the empty dock at the start and full dock at the end of the trip. Researchers call this as “rebalancing” problem as part of “fleet optimization” questions. This problem has attracted the attention of data scientists to develop complex methodologies to optimize the available bikes and open docks.
Following I attempt to utilize the shiny visualization app to provide a hint for the 3 questions:
 Fleet Routing Pattern Detection: what are the most popular routes during peak hours and offpeak? What is the direction of the flow?
 Station Balance Prediction: what is the average volume of imbalance in the distributed system? What is the stationlevel inflow and outflow? Is it sensitive to the time? How does it look like in a time series?
 Reducing rebalancing demand: What are the riders’ activities like? Is it possible to rebalance through pricing schemes?
The visualization app is intended to provide a way to explore different comparative measures at the route, station and system levels with spatial attributes and time series.
The Data
Citi published Citi Bike Trip Histories – downloadable files of Citi Bike trip data. I used the Citi Bike data for the month of March 2017 (approximately 1 million observations). The data includes:
 Trip Duration (in seconds)
 Start Time and Date
 Stop Time and Date
 Start Station Name
 End Station Name
 Station ID
 Station Lat/Long
 Bike ID
 User Type (Customer = 24hour pass or 7day pass user; Subscriber = Annual Member)
 Gender (Zero=unknown; 1=male; 2=female)
 Year of Birth
Before moving ahead with building the app, I was interested in exploring the data and identifying patterns of rebalancing.
Bar Chart 1 – Time wise imbalance (Peak/Off peak)
Bar Chart 2 Location wise imbalance (Top 10 popular Station)
Insights
On the interactive map, each dot presents a station. The visualization will also provide options to identify popular routes by selecting date and hour range. The top popular routes are marked in orange as the lines between the spatial points. The direction of the routes is indicated by moving from the more red towards the more green dots.
The Patterns
Interesting patterns are observed. The most popular routes on the west side run through Central Park and Chelsea Pier. Grand central/Penn Station centered routes are also in the hottest route list. Outside Manhattan there are centers in Queen and Brooklyn initiating lots of popular routes. Riders bike more along the west and east streets than along north and south avenue. That makes sense in light of the fact that there are more uptown and downtown subways than crosstown ones, and riders do utilize the Citi Bike as a an alternative transportation option.
While not enough bikes available in hot pick up stations, the docks are lacking in hot drop off stations. The red dots are where outflow of bikes exceeds the inflow of bikes The green dots are where inflow of bikes exceeds the outflow of bikes. In the other words, the green dots are the hot spot to pick up a bike(more inbound bikes) and the red(more empty docks) to drop them off. And The more extreme the color of dot is, the higher percentage change of the flows this stations has. The size and transparency of the dot is represented by the volume of both inflow and outflow of the stations. The more obvious the dot is, the hotter spot the station is.
What caused the balancing problem? The map based interactive app provides an insight for predicting demand. The information displayed is the accumulated hourly variables based on dates selected. Details of statistic numbers is also available for each stations by zooming in.
New York has a classic peak commuter flow pattern. Most commuters ride bikes towards the center of the town from its edge in the morning. At the end of the day, they ride the reverse way to the edge where they live, especially at the edge sections with fewer public transportations options.
The Rider’s Activities
What about the rider’s activities. Is there any pattern involved? The app provides insights of rider’s performance for reducing rebalancing demands. By studying rider’s activities, it will provides suggestions for potential solutions.
Below each bubble represents an age and gender group. The age is represented as the number on each bubble. A negative correlation is observed between age and speed. The younger the rider is, the faster he/she rides. In similarity, the group in the thirties shows similar miles per trip. The performance between female and male group are also different. The male groups in blue perform a higher speed level than female groups in red.
The Balancing Solutions
Is there solutions for rebalancing to cut the cost and improve the efficiency, instead of manually moving bikes via trucks, bikedraw trailers and sprinter vans from full stations to empty stations? The moving will take crews travel in pairs 45 minutes to load a truck.
Citi Bike sought a way to get the riders to move the bikes themselves. In May it started the pilot Bike Angel. The reversecommuter would be perfect target member of the Angel program. What is so appealing about the program is the bike sharing system could self distribute its fleet with the proper incentives. The member can easily make 10 Amazon gift card with a few reverse trips. As a result, the demand of manually moving bike around would decrease.
The Conclusions
The Visualization app provides the real time status of fleets: popular routes, inbound/outbound, net change, time series of stations, hot spot analysis and rider’s activities. It supports the self distributed fleet by establishing a baseline for identifying “healthy” rebalancing within the bike share system. It provides a hint for a future transportation solutions.
The App
The interactive app is available on Shiny.io.
The post NYC Citi Bike Visualization – A Hint of Future Transportation appeared first on NYC Data Science Academy Blog.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Why I find tidyeval useful
(This article was first published on Econometrics and Free Software, and kindly contributed to Rbloggers)
First thing’s first: maybe you shouldn’t care about tidyeval. Maybe you don’t need it. If you exclusively work interactively, I don’t think that learning about tidyeval is important. I can only speak for me, and explain to you why I personally find tidyeval useful.
I wanted to write this blog post after reading this twitter thread and specifically this question.
Mara Averick then wrote this blogpost linking to 6 other blog posts that give some tidyeval examples.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part7)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.
Until now, we used random variable simulation and bootstrapping to test hypothesis and compute statistics of a single sample. In today’s set, we’ll learn how to use permutation to test hypothesis about two different samples and how to adapt bootstrapping to this situation.
Answers to the exercises are available here.
Exercise 1
 Generate 500 points from a beta distribution of parameter a=2 and b=1.5, then store the result in a vector named beta1.
 Generate 500 points from the same distribution and store those points in a vector named beta2.
 Concatenate both vectors to create a vector called beta.data.
 Plot the ecdf of beta1 and beta2.
 Sample 500 points from beta.data and plot the ecdf of this sample. Repeat this process 5 times.
 Does all those samples share the same distribution and if the answer is yes, what is the distribution?
Exercise 2
When we test an hypothesis, we suppose that this hypothesis is true, we simulate what would happen if that’s the case and if our initial observation happen less that α percent of the time we reject the hypothesis. Now, from the first exercise, we know that if two samples share the same distribution, we can assume that any sample drawn from those samples will follow the same distribution. In particular, if we shuffle the observations from a sample of size n1 and those of a sample of size n2, shuffle them and draw two new samples of size n1 and n2, they all should have a similar CDF. We can use this fact to test the hypothesis that two samples have the same distribution. This is process is called a permutation test.
Load this dataset where each column represents a variable and we want to know if they are identically distributed. Each exercise below follow a step of a permutation test.
 What are the null and alternative hypotheses for this test?
 Concatenate both samples into a new vector called data.ex.2.
 Write a function that take data.ex.2 and the size of both sample as arguments, create a temporary vector by permuting data.ex.2 and return two new samples. The first sample has the same number of observations than the first column of the dataset, the second is made from the rest of the observations. Name this function permutation.sample (we will used it in the next exercise.) Why do we want the function to return samples of those size?
 Plot the ECDF of both initial variables in black.
 Use the function permutation.sample 100 times to generate permuted samples, then compute the ECDF of those samples and add the plot of those curve to the previous plot. Use the color red for the first batch of samples and green for the second batch.
 By looking at the plot, can you tell if the null hypothesis is true?
Exercise 3
A business analyst think that the daily returns of the apple stocks follow a normal distribution with mean of 0 and a standard deviation of 0.1. Use this dataset of the daily return of those stocks for the last 10 years to test this hypothesis.
 Work with about different binomial and logistic regression techniques
 Know how to compare regression models and choose the right fit
 And much more
Exercise 4
Permutation test can help us verify if two samples come from the same distribution, but if this is true, we can conclude that both sample share the same statistics. As a consequence permutation test can also be used to test if statistic of two sample are the same. One really useful application of this is to test if two mean are the same or significantly different (as you have probably realized by now, statistician are obsessed with mean and love to spend time studying it!). In this situation, the question is to determine if the difference of mean in two sample are random or a consequence of a difference of distribution.
You should be quite familiar with tests by now, so how would you proceed to do a permutation test to verify if two means are equals? Used that process to test the equality of the mean of both sample in this dataset.
Exercise 5
Looking at the average annual wage of the United States and Switzerland both country have relatively the same level of wealth since those statistics are of 60154 and 60124 US dollar respectively. In this dataset, you will find simulated annual wage from citizen of both countries. Test the hypothesis that both the American and the Swiss have the same average annual wage based on those samples at a level of 5%.
Exercise 6
To test if two samples from different distribution have the same statistics, we cannot use the permutation test: we instead will use bootstrapping. To test if two sample as the same mean, for example, you should follow those steps:
 Formulate a null and an alternative hypothesis.
 Set a significance level.
 Compute the difference of mean of both samples. This will be the reference value we will use to compute the pvalue.
 Concatenate both samples and compute the mean of this new dataset.
 Shift both samples so that they share the mean of the concatenated dataset.
 Use bootstrap to generate an estimate of the mean of both shifted samples.
 Compute the difference of both means.
 Repeat the last two steps at least 1000 times.
 Compute the pvalue and draw a conclusion.
Use the dataset from last exercise to see if the USA and Switzerland have the same average wage at a level of 5%.
Exercise 7
Test the hypothesis that both samples in this dataset have the same mean.
Exercise 8
R have functions that use analytic methods to test if two samples have an equal mean.
 Use the t.test() function to test the equality of the mean of the samples of the last exercise.
 Use this function to test the hypothesis that the average wage in the US are bigger than in Switzerland.
Exercise 9
The globular cluster luminosity dataset list measurement about the luminosity of cluster of stars in different region of the milky way galaxy and the Andromeda galaxy. Test the hypothesis that the average luminosity in both galaxy have a difference of 24,78.
Exercise 10
A company that mold aluminum for auto parts has bought a smaller company to increase the amount of parts they can produce each year. In their factory, the smaller company used the standard equipment, but used a different factory layout, had a different supply line and managed their employees work schedules in a completely different manner that their new parent company. Before changing the company culture, the engineer in the parent company are interested to know which of the approach is the more effective. To do so they measure the time it took to make an auto part in each factory, 150 times and created this dataset where the first column represent the sample of the small factory.
 Does the average time it takes to make a part is the same in both factory?
 Does the production time follow the same distribution in both factory?
 If the engineer want to minimize the percentage of part that take more than one hour to be made, which setup they should implement in both their factory: the one of the parent company or the one of the smaller company?
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part6)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part2)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part5)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
wrapr: R Code Sweeteners
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
wrapr is an R package that supplies powerful tools for writing and debugging R code.
Primary wrapr services include:
 let()
 %.>% (dot arrow pipe)
 := (named map builder)
 λ() (anonymous function builder)
 DebugFnW()
let() allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for names as with base::substitute() or base::with()).
The function is simple and powerful. It treats strings as variable names and rewrites expressions as if you had used the denoted variables. For example the following block of code is equivalent to having written "a + a".
library("wrapr") a < 7 let( c(VAR = 'a'), VAR + VAR ) # [1] 14This is useful in readapting nonstandard evaluation interfaces (NSE interfaces) so one can script or program over them.
We are trying to make let() self teaching and self documenting (to the extent that makes sense). For example try the arguments "eval=FALSE" prevent execution and see what would have been executed, or debug=TRUE to have the replaced code printed in addition to being executed:
let( c(VAR = 'a'), eval = FALSE, { VAR + VAR } ) # { # a + a # } let( c(VAR = 'a'), debugPrint = TRUE, { VAR + VAR } ) # { # a + a # } # [1] 14Please see vignette('let', package='wrapr') for more examples. For working with dplyr 0.7.* we suggest also taking a look at an alternate approach called seplyr.
%.>% (dot arrow pipe)%.>% dot arrow pipe is a strict pipe with intended semantics:
"a %.>% b" is to be treated as if the user had written "{ . < a; b };" with "%.>%" being treated as leftassociative, and .side effects removed.
That is: %.>% does not alter any function arguments that are not explicitly named. It is not defined as a %.% b ~ b(a) (roughly dplyr‘s original pipe) or as the large set of differing cases constituting magrittr::%>%. %.>% is designed to be explicit and simple.
The effect looks is show below.
The following two expressions should be equivalent:
cos(exp(sin(4))) # [1] 0.8919465 4 %.>% sin(.) %.>% exp(.) %.>% cos(.) # [1] 0.8919465The notation is quite powerful as it treats pipe stages as expression parameterized over the variable ".". This means you do not need to introduce functions to express stages. The following is a valid dotpipe:
1:4 %.>% .^2 # [1] 1 4 9 16The notation is also very regular in that expressions have the same iterpretation be then surrounded by parenthesis, braces, or asis:
1:4 %.>% { .^2 } # [1] 1 4 9 16 1:4 %.>% ( .^2 ) # [1] 1 4 9 16Regularity can be a big advantage in teaching and comprehension. Please see "In Praise of Syntactic Sugar" for more details.
:=:= is the "named map builder". It allows code such as the following:
'a' := 'x' # a # "x"The important property of named map builder is it accepts values on the lefthand side allowing the following:
name < 'variableNameFromElseWhere' name := 'newBinding' # variableNameFromElseWhere # "newBinding"A nice property is := commutes (in the sense of algebra or category theory) with R‘s concatenation function c(). That is the following two statements are equivalent:
c('a', 'b') := c('x', 'y') # a b # "x" "y" c('a' := 'x', 'b' := 'y') # a b # "x" "y" λ()λ() is a concise abstract function creator. It is a placeholder that allows the use of the λcharacter for very concise function abstraction.
Example:
# Make sure lambda function builder is in our enironment. if(exists('defineLambda', envir=as.environment('package:wrapr'), mode='function')) { # Note: prior to version 0.4.2 wrapr # loaded a lambdadefinition during # package load. The following explicit # wrapr::defineLambda() is more polite. wrapr::defineLambda() } # square numbers 1 through 4 sapply(1:4, λ(x, x^2)) # [1] 1 4 9 16 DebugFnW()DebugFnW() wraps a function for debugging. If the function throws an exception the execution context (function arguments, function name, and more) is captured and stored for the user. The function call can then be reconstituted, inspected and even rerun with a stepdebugger. Please see our free debugging video series and vignette('DebugFnW', package='wrapr') for examples.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Unbottling “.msg” Files in R
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
There was a discussion on Twitter about the need to read in “.msg” files using R. The “MSG” file format is one of the many binary abominations created by Microsoft to lock folks and users into their platform and tools. Thankfully, they (eventually) provided documentation for the MSG file format which helped me throw together a small R package — msgxtractr — that can read in these ‘.msg’ files and produce a list as a result.
I had previously creatred a quick version of this by wrapping a Python module, but that’s a path fraught with peril and did not work for one of the requestors (yay, notsocrossplatform UTF woes). So, I cobbled together some bits and pieces from the C to provide a singular function read_msg() that smashes open bottled up msgs, grabs sane/useful fields and produces a list() with them all wrapped up in a bow (an example is at the end and in the GH README).
Thanks to rhub, WinBuilder and Travis the code works on macOS, Linux and Windows and even has pretty decent code coverage for a quick project. That’s a resounding testimony to the work of many members of the R community who’ve gone to great lengths to make testing virtually painless for package developers.
Now, I literally have a singular ‘.msg’ file to test with, so if folks can kick the tyres, file issues (with errors or feature suggestions) and provide some more ‘.msg’ files for testing, it would be most appreciated.
devtools::install_github("hrbrmstr/msgxtractr") library(msgxtractr) print(str(read_msg(system.file("extdata/unicode.msg", package="msgxtractr")))) ## List of 7 ## $ headers :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 18 variables: ## ..$ Returnpath : chr "" ## ..$ Received :List of 1 ## .. ..$ : chr [1:4] "from st11p00mmsmtpin007.mac.com ([17.172.84.240])\nby ms06561.mac.com (Oracle Communications Messaging Server " __truncated__ "from mailvc0f182.google.com ([209.85.220.182])\nby st11p00mmsmtpin007.mac.com\n(Oracle Communications Messag" __truncated__ "by mailvc0f182.google.com with SMTP id ie18so3484487vcb.13 for\n; Mon, 18 Nov 2013 00:26:25 0800 (PST)" "by 10.58.207.196 with HTTP; Mon, 18 Nov 2013 00:26:24 0800 (PST)" ## ..$ Originalrecipient : chr "rfc822;brianzhou@me.com" ## ..$ ReceivedSPF : chr "pass (st11p00mmsmtpin006.mac.com: domain of brizhou@gmail.com\ndesignates 209.85.220.182 as permitted sender)\" __truncated__ ## ..$ DKIMSignature : chr "v=1; a=rsasha256; c=relaxed/relaxed; d=gmail.com;\ns=20120113; h=mimeversion:date:messageid:subject:f" __truncated__ ## ..$ MIMEversion : chr "1.0" ## ..$ XReceived : chr "by 10.221.47.193 with SMTP id ut1mr14470624vcb.8.1384763184960;\nMon, 18 Nov 2013 00:26:24 0800 (PST)" ## ..$ Date : chr "Mon, 18 Nov 2013 10:26:24 +0200" ## ..$ Messageid : chr "" ## ..$ Subject : chr "Test for TIF files" ## ..$ From : chr "Brian Zhou " ## ..$ To : chr "brianzhou@me.com" ## ..$ Cc : chr "Brian Zhou " ## ..$ Contenttype : chr "multipart/mixed; boundary=001a113392ecbd7a5404eb6f4d6a" ## ..$ Authenticationresults : chr "st11p00mmsmtpin007.mac.com; dkim=pass\nreason=\"2048bit key\" header.d=gmail.com header.i=@gmail.com\nheader." __truncated__ ## ..$ xicloudspamscore : chr "33322\nf=gmail.com;e=gmail.com;pp=ham;spf=pass;dkim=pass;wl=absent;pwl=absent" ## ..$ XProofpointVirusVersion: chr "vendor=fsecure\nengine=2.50.10432:5.10.8794,1.0.14,0.0.0000\ndefinitions=20131118_02:20131118,20131117,19" __truncated__ ## ..$ XProofpointSpamDetails : chr "rule=notspam policy=default score=0 spamscore=0\nsuspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifie" __truncated__ ## $ sender :List of 2 ## ..$ sender_email: chr "brizhou@gmail.com" ## ..$ sender_name : chr "Brian Zhou" ## $ recipients :List of 2 ## ..$ :List of 3 ## .. ..$ display_name : NULL ## .. ..$ address_type : chr "SMTP" ## .. ..$ email_address: chr "brianzhou@me.com" ## ..$ :List of 3 ## .. ..$ display_name : NULL ## .. ..$ address_type : chr "SMTP" ## .. ..$ email_address: chr "brizhou@gmail.com" ## $ subject : chr "Test for TIF files" ## $ body : chr "This is a test email to experiment with the MS Outlook MSG Extractor\r\n\r\n\r\n \r\n\r\n\r\nKind regards\r\n" __truncated__ ## $ attachments :List of 2 ## ..$ :List of 4 ## .. ..$ filename : chr "importOl.tif" ## .. ..$ long_filename: chr "import OleFileIO.tif" ## .. ..$ mime : chr "image/tiff" ## .. ..$ content : raw [1:969674] 49 49 2a 00 ... ## ..$ :List of 4 ## .. ..$ filename : chr "raisedva.tif" ## .. ..$ long_filename: chr "raised value error.tif" ## .. ..$ mime : chr "image/tiff" ## .. ..$ content : raw [1:1033142] 49 49 2a 00 ... ## $ display_envelope:List of 2 ## ..$ display_cc: chr "Brian Zhou" ## ..$ display_to: chr "brianzhou@me.com" ## NULLNOTE: Don’t try to read those TIFF images with magick or evan the tiff package. It seems to have some strange tags. But, saving it (use writeBin()) and opening with Preview (or your favorite image viewer) should work (it did for me and produces the following image that I’ve converted to png):
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Tidy evaluation, most common actions
(This article was first published on That’s so Random, and kindly contributed to Rbloggers)
Tidy evaluation is a bit challenging to get your head around. Even after reading programming with dplyr several times, I still struggle when creating functions from time to time. I made a small summary of the most common actions I perform, so I don’t have to dig in the vignettes and on stackoverflow over and over. Each is accompanied with a minimal example on how to implement it. I thought others might find this useful too, so here it is in a blog post. This list is meant to be a living thing so additions and improvements are most welcome. Please do a PR on this file or send an email.
library(tidyverse) bare to quosure: quo bare_to_quo < function(x, var){ x %>% select(!!var) %>% head(1) } bare_to_quo(mtcars, quo(cyl)) ## cyl ## Mazda RX4 6 bare to quosure in function: enquo bare_to_quo_in_func < function(x, var) { var_enq < enquo(var) x %>% select(!!var_enq) %>% head(1) } bare_to_quo_in_func(mtcars, mpg) ## mpg ## Mazda RX4 21 quosure to a name: quo_name bare_to_name < function(x, nm) { nm_name < quo_name(nm) x %>% mutate(!!nm_name := 42) %>% head(1) %>% select(!!nm) } bare_to_name(mtcars, quo(this_is_42)) ## this_is_42 ## 1 42 quosure to text: quo_text quo_to_text < function(x, var) { var_enq < enquo(var) ggplot(x, aes_string(rlang::quo_text(var_enq))) + geom_density() } plt < quo_to_text(mtcars, cyl)Note that tidy evaluation is not yet implemented in ggplot2, but this will be in future versions. This is a workaround for the meantime, when combining dplyr and ggplot2.
character to quosure: sym char_to_quo < function(x, var) { var_enq < rlang::sym(var) x %>% select(!!var_enq) %>% head(1) } char_to_quo(mtcars, "vs") ## vs ## Mazda RX4 0 multiple bares to quosure: quos bare_to_quo_mult < function(x, ...) { grouping < quos(...) x %>% group_by(!!!grouping) %>% summarise(nr = n()) } bare_to_quo_mult(mtcars, vs, cyl) ## # A tibble: 5 x 3 ## # Groups: vs [?] ## vs cyl nr ## ## 1 0 4 1 ## 2 0 6 3 ## 3 0 8 14 ## 4 1 4 10 ## 5 1 6 4 multiple characters to quosure: syms bare_to_quo_mult_chars < function(x, ...) { grouping < rlang::syms(...) x %>% group_by(!!!grouping) %>% summarise(nr = n()) } bare_to_quo_mult_chars(mtcars, list("vs", "cyl")) ## # A tibble: 5 x 3 ## # Groups: vs [?] ## vs cyl nr ## ## 1 0 4 1 ## 2 0 6 3 ## 3 0 8 14 ## 4 1 4 10 ## 5 1 6 4 quoting full expressionsAltough quoting column names is most often used, it is by no means the only option. We can use the above to quote full expressions.
filter_func < function(x, filter_exp) { filter_exp_enq < enquo(filter_exp) x %>% filter(!!filter_exp_enq) } filter_func(mtcars, hp == 93) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Stan Weekly Roundup, 25 August 2017
(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to Rbloggers)
This week, the entire Columbia portion of the Stan team is out of the office and we didn’t have an inperson/online meeting this Thursday. Mitzi and I are on vacation, and everyone else is either teaching, TAing, or attending the Stan course. Luckily for this report, there’s been some great activity out of the meeting even if I don’t have a report of what everyone around Columbia has been up to. If a picture’s really worth a thousand words, this is the longest report yet.
 Ari Hartikainen has produced some absolutely beautiful parallel coordinate plots of HMC divergences* for multiple parameters. The divergent transitions are shown in green and the lines connect a single draw. The top plot is unnormalized, whereas the bottom scales all parameters to a [0, 1] range.
You can follow the ongoing discussion on the forum thread. There are some further plots for larger models and some comparisons with the pairs plots that Michael Betancourt has been recommending for the same purpose (the problem with pairs is that it’s very very slow, at least in RStan, because it has to draw quadratically many plots).
 Sebastian Weber has a complete working prototype of the MPI (multicore parallelization) in place and has some beautiful results to report. The first graph is the speedup he achieved on a 20core server (all in one box with shared memory):
The second graph shows what happens when the problem size grows (those bottom numbers on the xaxis are the number of ODE systems being solved, whereas the top number remains the number of cores used).
As with Ari’s plots, you can follow the ongoing disussion on the forum thread. And if you know something about MPI, you can even help out. Sebastian’s been asking if anyone who knows MPI would like to check his work—he’s learning it as he goes (and doing a bangup job of it, I might add!).
These lists are incomplete
After doing a handful of these reports, I’m sorry to say you’re only seeing a very biased selection of activity around Stan. For the full story, I’d encourage you to jump onto our forums or GitHub (warning: very high traffic, even if you focus).
* Divergences in Stan arise when the Hamiltonian, which should be conserved across a trajectory, diverges—it’s basically a numerical simulation problem—if we could perfectly follow the Hamiltonian through complex geometries, there wouldn’t be any divergences. This is a great diagnostic mechanism to signal something’s going wrong and resulting estimates might be biased. It may seem to make HMC more fragile, but the problem is that Gibbs and Metropolis will fail silently in a lot of these situations (though BUGS will often help you out of numerical issues by crashing).
The post Stan Weekly Roundup, 25 August 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Tips and tricks on using R to query data in Power BI
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
In Power BI, the dashboarding and reporting tool, you can use R to filter, transform, or restructure data via the Query Editor. For example, you could use the mice package to impute missing values, or use the tidytext package to assign sentiment scores to text inputs. As Imke Feldmann explains, there are lots of useful tricks you can accomplish using R in the query editor, including:
 Passing multiple Power BI tables into the R script
 Parameterizing R scripts
 Modifying file names and output data types for use in Power BI
 Generating multiple table outputs from an R script
For the details on these tips, follow the link below. You can also find tips on using R in the Query Editor here.
The Biccountant: Tips and Tricks for R scripts in the query editor in Power BI
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How much will that Texas rain be
PUTTING 35 INCHES OF RAIN IN PERSPECTIVE
We are always interested in putting numbers into perspective, so we were interested in this article in which they put the Hurricane Harvey’s rain into perspective.
They’re predicting 3040 inches of rain in a few days in Texas. They asked an expert to put that into perspective and he said:
Let’s put it in context. Much of the Northeast Corridor — Washington to New York and Boston — maybe receives maybe between 40 and 45 inches of rain a year. Think of all the rain you get in July through Christmas and put that in a couple days. It’s a lot of rain.
It’s easy for us to think in terms of New York City, so we looked up some weather data. See the table at the top of this post (all figures are in inches).
First thing we can notice is that the expert understated things, for New York at least. Thirty five inches would be equivalent to all the rain in NYC from April (not July) to December, inclusive.
But we agree that it’s a lot of rain.
We’ve always had trouble putting rain forecasts into perspective, so here are some rules of thumb we figured out from the data that we’re going to memorize. If you live in the corridor from DC to Boston, you may find these useful.
 The average amount of rain per rainy day in NYC is .38 inches, which conveniently is about 1 cm.
 When you hear it’s going to rain 1 cm or 3/8 inch, you can think “no big deal, that’s a typical NYC rainy day”.
 If you hear it’s going to rain an inch, you can think “oh darn, that’s like three rainy days worth”.
Here’s Rmarkdown code if you want to play around:
The post How much will that Texas rain be appeared first on Decision Science News.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));How to prepare and apply machine learning to your dataset
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Dear reader,
If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world.
This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets.
In this stepbystep tutorial you will:
1. Use one of the most popular machine learning packages in R.
2. Explore a dataset by using statistical summaries and data visualization.
3. Build 5 machinelearning models, pick the best, and build confidence that the accuracy is reliable.
The process of a machine learning project may not be exactly the same, but there are certain standard and necessary steps:
1. Define Problem.
2. Prepare Data.
3. Evaluate Algorithms.
4. Improve Results.
5. Present Results.
1. PACKAGE INSTALLATION & DATA SET
The first thing you have to do is install and load the “caret” package with:
install.packages("caret")
library(caret)
Moreover, we need a dataset to work with. The dataset we chose in our case is “iris,” which contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species. To attach it to the environment, use:
data(iris)
1.1 Create a Validation Dataset
First of all, we need to validate that our data set is good. Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. To be sure about the accuracy of the best model on unseen data, we will evaluate it on actual unseen data. To do this, we will “deposit” some data that the algorithms will not find and use this data later to get a second and independent idea of how accurate the best model really is.
We will split the loaded dataset into two, 80% of which we will use to train our models and 20% of which we will hold back as a validation dataset. Look at the example below:
#create a list of 80% of rows in the original dataset to use them for training
validation_index < createDataPartition(dataset$Species, p=0.80, list=FALSE)
# select 20% of the data for validation
validation < dataset[validation_index,]
# use the remaining 80% of data to training and testing the models
dataset < dataset[validation_index,]
You now have training data in the dataset variable and a validation set that will be used later in the validation variable.
Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to: Create a machine learning algorhitm from a beginner point of view
 Quickly dive into more advanced methods in an accessible pace and with more explanations
 And much more
This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.
2. DATASET SUMMARY
In this step, we are going to explore our data set. More specifically, we need to know certain features of our dataset, like:
1. Dimensions of the dataset.
2. Types of the attributes.
3. Details of the data.
4. Levels of the class attribute.
5. Analysis of the instances in each class.
6. Statistical summary of all attributes.
2.1 Dimensions of Dataset
We can see of how many instances (rows) and how many attributes (columns) the data contains with the dim function. Look at the example below:
dim(dataset)
2.2 Types of Attributes
Knowing the types is important as it can help you summarize the data you have and possible transformations you might need to use to prepare the data before modeilng. They could be doubles, integers, strings, factors and other types. You can find it with:
sapply(dataset, class)
2.3 Details of the Data
You can take a look at the first 5 rows of the data with:
head(dataset)
2.4 Levels of the Class
The class variable is a factor that has multiple class labels or levels. Let’s look at the levels:
levels(dataset$Species)
There are two types of classification problems: the multinomial like this one and the binary if there were two levels.
2.5 Class Distribution
Let’s now take a look at the number of instances that belong to each class. We can view this as an absolute count and as a percentage with:
percentage < prop.table(table(dataset$Species)) * 100
cbind(freq=table(dataset$Species), percentage=percentage)
2.6 Statistical Summary
This includes the mean, the min and max values, as well as some percentiles. Look at the example below:
summary(dataset)
3. DATASET VISUALIZATION
We now have a basic idea about the data. We need to extend that with some visualizations, and for that reason we are going to use two types of plots:
1. Univariate plots to understand each attribute.
2. Multivariate plots to understand the relationships between attributes.
3.1 Univariate Plots
We can visualize just the input attributes and just the output attributes. Let’s set that up and call the input attributes x and the output attributes y.
x < dataset[,1:4]
y < dataset[,5]
Since the input variables are numeric, we can create box and whisker plots of each one with:
par(mfrow=c(1,4))
for(i in 1:4) {
boxplot(x[,i], main=names(iris)[i])
}
We can also create a barplot of the Species class variable to graphically display the class distribution.
plot(y)
3.2 Multivariate Plots
First, we create scatterplots of all pairs of attributes and color the points by class. Then, we can draw ellipses around them to make them more easily separated.
You have to install and call the “ellipse” package to do this.
install.packages("ellipse")
library(ellipse)
featurePlot(x=x, y=y, plot="ellipse")
We can also create box and whisker plots of each input variable, but this time they are broken down into separate plots for each class.
featurePlot(x=x, y=y, plot="box")
Next, we can get an idea of the distribution of each attribute. We will use some probability density plots to give smooth lines for each distribution.
scales < list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)
4. ALGORITHMS EVALUATION
Now it is time to create some models of the data and estimate their accuracy on unseen data.
1. Use the test harness to use 10fold cross validation.
2. Build 5 different models to predict species from flower measurements.
3. Select the best model.
4.1 Test Harness
This will split our dataset into 10 parts, train in 9, test on 1, and release for all combinations of traintest splits.
control < trainControl(method="cv", number=10)
metric < "Accuracy"
We are using the metric of “Accuracy” to evaluate models. This is: (number of correctly predicted instances / divided by the total number of instances in the dataset)*100 to give a percentage.
4.2 Build Models
We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that we created earlier.
Algorithms evaluation:
1. Linear Discriminant Analysis (LDA)
2. Classification and Regression Trees (CART).
3. kNearest Neighbors (kNN).
4. Support Vector Machines (SVM) with a linear kernel.
5. Random Forest (RF)
This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
NOTE: To proceed, first install and load the following packages: “rpart”, “kernlab”, “e1071” and “randomForest”.
Let’s build our five models:
# a) linear algorithms
set.seed(7)
fit.lda < train(Species~., data=dataset, method="lda", metric=metric, trControl=control)
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart < train(Species~., data=dataset, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn < train(Species~., data=dataset, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm < train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf < train(Species~., data=dataset, method="rf", metric=metric, trControl=control)
4.3 Select the Best Model
We now have 5 models and accuracy estimations for each so we have to compare them.
It is a good idea to create a list of the created models and use the summary function.
results < resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
Moreover, we can create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times.
dotplot(results)
You can summarize the results for just the LDA model that seems to be the most accurate.
print(fit.lda)
5. Make Predictions
The LDA was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.
We can run the LDA model directly on the validation set and summarize the results in a confusion matrix.
predictions < predict(fit.lda, validation)
confusionMatrix(predictions, validation$Species)
 Vector exercises
 Evaluate your model with R Exercises
 Neural networks Exercises (Part2)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
BH 1.65.01
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
The BH package on CRAN was updated today to version 1.65.0. BH provides a sizeable portion of the Boost C++ libraries as a set of template headers for use by R, possibly with Rcpp as well as other packages.
This release upgrades the version of Boost to the rather new upstream version Boost 1.65.0 released earlier this week, and adds two new libraries: align and sort.
I had started the upgrade process a few days ago under release 1.64.0. Rigorous checking of reverse dependencies showed that mvnfast needed a small change (which was trivial: just seeding the RNG prior to running tests), which Matteo did in no time with a fresh CRAN upload. rstan is needing a bit more work but should be ready real soon now and we are awaiting a new version. And once I switched to the just release Boost 1.65.0 it became apparent that Cyclops no longer needs its embedded copy of Boost iterator—and Marc already made that change with yet another fresh CRAN upload. It is a true pleasure to work in such a responsive and collaborative community.
Changes in version 1.65.01 (20170824)
Upgraded to Boost 1.64 and then 1.65 installed directly from upstream source with several minor tweaks (as before)

Fourth tweak corrects a misplaced curly brace (see the Boost ublas GitHub repo and its issue #40)

Added Boost multiprecision by fixing a script typo (as requested in #42)

Updated Travis CI support via newer run.sh
Via CRANberries, there is a diffstat report relative to the previous release.
Comments and suggestions are welcome via the mailing list or the issue tracker at the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Newer to R? rstudio::conf 2018 is for you! Early bird pricing ends August 31.
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
Immersion is among the most effective ways to learn any language. Immersing where new and advanced users come together to improve their use of the R language is a rare opportunity. rstudio::conf 2018 is that time and place!
Be an Early Bird! Discounts for early conference registration expire August 31.
Immerse as a team! Ask us about group discounts for 5 or more from the same organization.
Rstudio::conf 2018 is a two day conference with optional two day workshops. One of the conference tracks will focus on topics for newer R users. Newer R users will learn about the best ways to use R, to avoid common pitfalls and accelerate proficiency. Several workshops are also designed specifically for those newer to R.
Intro to R & RStudio
Are you new to R & RStudio and do you learn best in person? You will learn the basics of R and data science, and practice using the RStudio IDE (integrated development environment) and R Notebooks. We will have a team of TAs on hand to show you the ropes and help you out when you get stuck.
This course is taught by wellknown R educator and friend of RStudio, Amelia McNamara, a Smith College Visiting Assistant Professor of Statistical and Data Sciences & Mass Mutual Faculty Fellow.

Are you ready to begin applying the book, R for Data Science? Learn how to achieve your data analysis goals the “tidy” way. You will visualize, transform, and model data in R and work with datetimes, character strings, and untidy data formats. Along the way, you will learn and use many packages from the tidyverse including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, lubridate, and forcats.
This course is taught by friend of RStudio, Charlotte Wickham, a professor and award winning teacher and data analyst at Oregon State University.

Do you want to share your data analysis with others in effective ways? For people who know their way around the RStudio IDE and R at least a little, this workshop will help you become proficient in Shiny application development and using R Markdown to communicate insights from data analysis to others.
This course is taught by Mine ÇetinkayaRundel, Duke professor and RStudio professional educator. Mine is well known for her open education efforts, and her popular data science MOOCs.
Whether you are new to the R language or as advanced as many of our speakers and educators, rstudio::conf 2018 is the place and time to focus on all things R & RStudio.
We hope to see you in San Diego!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Gradient boosting in R
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
Boosting is another famous ensemble learning technique in which we are not concerned with reducing the variance of learners like in Bagging where our aim is to reduce the high variance of learners by averaging lots of models fitted on bootstrapped data samples generated with replacement from training data, so as to avoid overfitting.
Another major difference between both the techniques is that in Bagging the various models which are generated are independent of each other and have equal weightage .Whereas Boosting is a sequential process in which each next model which is generated is added so as to improve a bit from the previous model.Simply saying each of the model that is added to mix is added so as to improve on the performance of the previous collection of models.In Boosting we do weighted averaging.
Both the ensemble techniques are common in terms of generating lots of models on training data and using their combined power to increase the accuracy of the final model which is formed by combining them.
But Boosting is more towards Bias i.e simple learners or more specifically Weak learners. Now a weak learner is a learner which always learns something i.e does better than chance and also has error rate less then 50%.The best example of a weak learner is a Decision tree.This is the reason we generally use ensemble technique on decision trees to improve its accuracy and performance.
In Boosting each tree or Model is grown or trained using the hard examples.By hard I mean all the training examples \( (x_i,y_i) \) for which a previous model produced incorrect output \(Y\).Boosting boosts the performance of a simple baselearner by iteratively shifting the focus towards problematic training observations that are difficult to predict.Now that information from the previous model is fed to the next model.And the thing with boosting is that every new tree added to the mix will do better than the previous tree because it will learn from the mistakes of the previous models and try not to repeat them.Hence by this technique it will eventually convert a weak learner to a strong learner which is better and more accurate in generalization for unseen test examples.
An important thing to remember in boosting is that the base learner which is being boosted should not be a complex and complicated learner which has high variance for e.g a neural network with lots of nodes and high weight values.For such learners boosting will have inverse effects.
So I will explain Boosting with respect to decision trees in this tutorial because they can be regarded as weak learners most of the times.We will generate a gradient boosting model.
Gradient boosting generates learners using the same general boosting learning process. It first builds learner to predict the values/labels of samples, and calculate the loss (the difference between the outcome of the first learner and the real value). It will build a second learner to predict the loss after the first step. The step continues to learn the third, forth… until certain threshold.Gradient boosting identifies hard examples by calculating large residuals\( (y_{actual}y_{pred} ) \) computed in the previous iterations.
Implementing Gradient BoostingLet’s use gbm package in R to fit gradient boosting model.
require(gbm) require(MASS)#package with the boston housing dataset #separating training and test data train=sample(1:506,size=374)We will use the Boston housing data to predict the median value of the houses.
Boston.boost=gbm(medv ~ . ,data = Boston[train,],distribution = "gaussian",n.trees = 10000, shrinkage = 0.01, interaction.depth = 4) Boston.boost summary(Boston.boost) #Summary gives a table of Variable Importance and a plot of Variable Importance gbm(formula = medv ~ ., distribution = "gaussian", data = Boston[train, ], n.trees = 10000, interaction.depth = 4, shrinkage = 0.01) A gradient boosted model with gaussian loss function. 10000 iterations were performed. There were 13 predictors of which 13 had nonzero influence. >summary(Boston.boost) var rel.inf rm rm 36.96963915 lstat lstat 24.40113288 dis dis 10.67520770 crim crim 8.61298346 age age 4.86776735 black black 4.23048222 nox nox 4.06930868 ptratio ptratio 2.21423811 tax tax 1.73154882 rad rad 1.04400159 indus indus 0.80564216 chas chas 0.28507720 zn zn 0.09297068The above Boosted Model is a Gradient Boosted Model which generates 10000 trees and the shrinkage parametet (\lambda= 0.01\) which is also a sort of learning Rate. Next parameter is the interaction depth which is the total splits we want to do.So here each tree is a small tree with only 4 splits.
The summary of the Model gives a feature importance plot.In the above list is on the top is the most important variable and at last is the least important variable.
And the 2 most important features which explain the maximum variance in the Data set is lstat i.e lower status of the population (percent) and rm which is average number of rooms per dwelling.
The partial Dependence Plots will tell us the relationship and dependence of the variables \(X_i\) with the Response variable \(Y\).
#Plot of Response variable with lstat variable plot(Boston.boost,i="lstat") #Inverse relation with lstat variable plot(Boston.boost,i="rm") #as the average number of rooms increases the the price increasesThe above plot simply shows the relation between the variables in the xaxis and the mapping function \(f(x)\) on the yaxis.First plot shows that lstat is negatively correlated with the response mdev, whereas the second one shows that rm is somewhat directly related to mdev.
cor(Boston$lstat,Boston$medv)#negetive correlation coeffr cor(Boston$rm,Boston$medv)#positive correlation coeffr > cor(Boston$lstat,Boston$medv) [1] 0.7376627 > cor(Boston$rm,Boston$medv) [1] 0.6953599 Prediction on Test SetWe will compute the Test Error as a function of number of Trees.
n.trees = seq(from=100 ,to=10000, by=100) #no of treesa vector of 100 values #Generating a Prediction matrix for each Tree predmatrix<predict(Boston.boost,Boston[train,],n.trees = n.trees) dim(predmatrix) #dimentions of the Prediction Matrix #Calculating The Mean squared Test Error test.error<with(Boston[train,],apply( (predmatrixmedv)^2,2,mean)) head(test.error) #contains the Mean squared test error for each of the 100 trees averaged #Plotting the test error vs number of trees plot(n.trees , test.error , pch=19,col="blue",xlab="Number of Trees",ylab="Test Error", main = "Perfomance of Boosting on Test Set") #adding the RandomForests Minimum Error line trained on same data and similar parameters abline(h = min(test.err),col="red") #test.err is the test error of a Random forest fitted on same data legend("topright",c("Minimum Test error Line for Random Forests"),col="red",lty=1,lwd=1) dim(predmatrix) [1] 206 100 head(test.error) 100 200 300 400 500 600 26.428346 14.938232 11.232557 9.221813 7.873472 6.911313In the above plot the red line represents the least error obtained from training a Random forest with same data and same parameters and number of trees.Boosting outperforms Random Forests on same test dataset with lesser Mean squared Test Errors.
ConclusionFrom the above plot we can notice that if boosting is done properly by selecting appropriate tuning parameters such as shrinkage parameter \(lambda\) ,the number of splits we want and the number of trees \(n\), then it can generalize really well and convert a weak learner to strong learner. Ensembling techniques are really well and tend to outperform a single learner which is prone to either overfitting or underfitting or generate thousands or hundreds of them,then combine them to produce a better and stronger model.
Hope you guys liked the article, make sure to like and share. Cheers!!.
Related Post
 Radial kernel Support Vector Classifier
 Random Forests in R
 Network analysis of Game of Thrones
 Structural Changes in Global Warming
 Deep Learning with R
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Linear Congruential Generator in R
(This article was first published on R – Aaron Schlegel, and kindly contributed to Rbloggers)
Part of 1 in the series Random Number GenerationA Linear congruential generator (LCG) is a class of pseudorandom number generator (PRNG) algorithms used for generating sequences of randomlike numbers. The generation of random numbers plays a large role in many applications ranging from cryptography to Monte Carlo methods. Linear congruential generators are one of the oldest and most wellknown methods for generating random numbers primarily due to their comparative ease of implementation and speed and their need for little memory. Other methods such as the Mersenne Twister are much more common in practical use today.
Linear congruential generators are defined by a recurrence relation:
There are many choices for the parameters , the modulus, , the multiplier, and the increment. Wikipedia has a seemingly comprehensive list of the parameters currently in use in common programs.
Aside: ‘Pseudorandom’ and Selecting a Seed NumberRandom number generators such as LCGs are known as ‘pseudorandom’ as they require a seed number to generate the random sequence. Due to this requirement, random number generators today are not truly ‘random.’ The theory and optimal selection of a seed number are beyond the scope of this post; however, a common choice suitable for our application is to take the current system time in microseconds.
A Linear Congruential Generator Implementation in RThe parameters we will use for our implementation of the linear congruential generator are the same as the ANSI C implementation (Saucier, 2000.).
The following function is an implementation of a linear congruential generator with the given parameters above.
lcg.rand < function(n=10) { rng < vector(length = n) m < 2 ** 32 a < 1103515245 c < 12345 # Set the seed using the current system time in microseconds d < as.numeric(Sys.time()) * 1000 for (i in 1:n) { d < (a * d + c) %% m rng[i] < d / m } return(rng) }We can use the function to generate random numbers .
# Print 10 random numbers on the halfopen interval [0, 1) lcg.rand() ## [1] 0.4605103 0.6643705 0.6922703 0.4603930 0.1842995 0.6804419 0.8561535 ## [8] 0.2435846 0.8236771 0.9643965We can also demonstrate how apparently ‘random’ the LCG is by plotting a sample generation in 3 dimensions. To do this, we generate three random vectors using our LCG above and plot. The plot3d package is used to create the scatterplot, and the animation package is used to animate each scatterplot as the length of the random vectors, , increases.
library(plot3D) library(animation) n < c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000) saveGIF({ for (i in 1:length(n)) { x < lcg.rand(n[i]) y < lcg.rand(n[i]) z < lcg.rand(n[i]) scatter3D(x, y, z, colvar = NULL, pch=20, cex = 0.5, theta=20, main = paste('n = ', n[i])) } }, movie.name = 'lcg.gif')As increases, the LCG appears to be ‘random’ enough as demonstrated by the cloud of points.
Linear Congruential Generators with Poor ParametersThe values chosen for the parameters are very important in driving how ‘random’ the generated values from the linear congruential estimator; hence that is why there are so many different parameters in use today as there has not yet been a clear consensus on the ‘best’ parameters to use.
We can demonstrate how choosing poor parameters for our LCG leads to not so random generated values by creating a new LCG function.
lcg.poor < function(n=10) { rng < vector(length = n) # Parameters taken from https://www.mimuw.edu.pl/~apalczew/CFP_lecture3.pdf m < 2048 a < 1229 c < 1 d < as.numeric(Sys.time()) * 1000 for (i in 1:n) { d < (a * d + c) %% m rng[i] < d / m } return(rng) }Generating successively longer vectors using the ‘poor’ LCG and plotting as we did previously, we see the generated points are very sequentially correlated, and there doesn’t appear to be any ‘randomness’ at all as increases.
n < c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000) saveGIF({ for (i in 1:length(n)) { x < lcg.poor(n[i]) y < lcg.poor(n[i]) z < lcg.poor(n[i]) scatter3D(x, y, z, colvar = NULL, pch=20, cex=0.5, theta=20, main = paste('n = ', n[i])) } }, movie.name = 'lcg_poor.gif') ReferencesSaucier, R. (2000). Computer Generation of Statistical Distributions (1st ed.). Aberdeen, MD. Army Research Lab.
The post Linear Congruential Generator in R appeared first on Aaron Schlegel.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Calculating a fuzzy kmeans membership matrix with R and Rcpp
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
by Błażej Moska, computer science student and data science intern
Suppose that we have performed clustering Kmeans clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like fanny, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the kmeans function. The equation is defined as follows (source: Wikipedia):
$$w_{ij} = \frac{1}{ \sum_ {k=1}^c ( \frac{ \ x_{i} – c_{j} \ }{ \ x_{i} – c_{k} \ }) ^{ \frac{2}{m1} } } $$
\(w_{ij}\) denotes to what extent the \(i\)th object belongs to the \(j\)th cluster. So the total number of rows of this matrix equals number of observation and total number of columns equals number of variables included in clustering. \(m\) is a parameter, typically set to \(m=2\). \(w_{ij}\) values range between 0 and 1 so they are easy and convenient to compare. In this example I will use \(m = 2\) so the Euclidean distance will be calculated.
To make computations faster I also used the Rcpp package, then I compared speed of execution of function written in R with this written in C++.
In implementations for loops were used, although it is not a commonly used method in R (see this blog post for more information and alternatives), but in this case I find it more convenient.
Rcpp (C++ version) #include #include using namespace Rcpp; // [[Rcpp::export]] NumericMatrix fuzzyClustering(NumericMatrix data, NumericMatrix centers, int m) { /* data is a matrix with observations(rows) and variables, centers is a matrix with cluster centers coordinates, m is a parameter of equation, c is a number of clusters */ int c=centers.rows(); int rows = data.rows(); int cols = data.cols(); /*number of columns equals number of variables, the same as is in centers matrix*/ double tempDist=0; /*dist and tempDist are variables storing temporary euclidean distances */ double dist=0; double denominator=0; //denominator of “main” equation NumericMatrix result(rows,c); //declaration of matrix of results for(int i=0;i<rows;i++){ for(int j=0;j<c;j++){ for(int k=0;k<c;k++){ for(int p=0;p<cols;p++){ tempDist = tempDist+pow(centers(j,p)data(i,p),2); //in innermost loop an euclidean distance is calculated. dist = dist + pow(centers(k,p)data(i,p),2); /*tempDist is nominator inside the sum operator in the equation, dist is the denominator inside the sum operator in the equation*/ } tempDist = sqrt(tempDist); dist = sqrt(dist); denominator = denominator+pow((tempDist/dist),(2/(m1))); tempDist = 0; dist = 0; } result(i,j) = 1/denominator; // nominator/denominator in the main equation denominator = 0; } } return result; }We can save this in a file with .cpp extension. To compile it from R we can write:
sourceCpp("path_to_cpp_file")If everything goes right our function fuzzyClustering will then be available from R.
R version fuzzyClustering=function(data,centers,m){ c < nrow(centers) rows < nrow(data) cols < ncol(data) result < matrix(0,nrow=rows,ncol=c) #defining membership matrix denominator < 0 for(i in 1:rows){ for(j in 1:c){ tempDist < sqrt(sum((centers[j,]data[i,])^2)) #euclidean distance, nominator inside a sum operator for(k in 1:c){ Dist < sqrt(sum((centers[k,]data[i,])^2)) #euclidean distance, denominator inside a sum operator denominator < denominator +((tempDist/Dist)^(2/(m1))) #denominator of an equation } result[i,j] < 1/denominator #inserting value into membership matrix denominator < 0 } } return(result); }Result looks as follows. Columns are cluster numbers (in this case 10 clusters were created), rows are our objects (observations). Values were rounded to the third decimal place, so the sums of rows can be slightly different than 1:
1 2 3 4 5 6 7 8 9 10 [1,] 0.063 0.038 0.304 0.116 0.098 0.039 0.025 0.104 0.025 0.188 [2,] 0.109 0.028 0.116 0.221 0.229 0.080 0.035 0.116 0.017 0.051 [3,] 0.067 0.037 0.348 0.173 0.104 0.066 0.031 0.095 0.018 0.062 [4,] 0.016 0.015 0.811 0.049 0.022 0.017 0.009 0.023 0.007 0.031 [5,] 0.063 0.048 0.328 0.169 0.083 0.126 0.041 0.079 0.018 0.045 [6,] 0.069 0.039 0.266 0.226 0.102 0.111 0.037 0.084 0.017 0.048 [7,] 0.045 0.039 0.569 0.083 0.060 0.046 0.025 0.071 0.015 0.046 [8,] 0.070 0.052 0.399 0.091 0.093 0.054 0.034 0.125 0.022 0.062 [9,] 0.095 0.037 0.198 0.192 0.157 0.088 0.038 0.121 0.019 0.055 [10,] 0.072 0.024 0.132 0.375 0.148 0.059 0.025 0.081 0.015 0.067 Performance comparisonShown below is the output of Sys.time for the C++ and R versions, running against a simulated matrix with 30000 observations, 3 variables and 10 clusters.
The hardware I used was a lowcost notebook Asus R556L with Intel Core i35010 2.1 GHz processor and 8 GB DDR3 1600 MHz RAM memory.
C++ version:
user system elapsed 0.32 0.00 0.33R version:
user system elapsed 15.75 0.02 15.94In this example, the function written in C++ executed about 50 times faster than the equivalent function written in pure R.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Reticulating Readability
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a bruteforce method than one that uses any real “intelligence” when performing the text node targeting.
Most modern browsers have inherent or pluginable “readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.
What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:
library(xml2) library(httr) library(reticulate) library(magrittr) res < GET("https://ropensci.org/blog/blog/2017/08/22/visdat") content(res, as="text", endoding="UTF8") ## [1] "\n \n\n\n \n \n \n \n \n \n\n \n\n \n \n\n \n \n \n \n \n \n \n try{Typekit.load();}catch(e){}\n \n hljs.initHighlightingOnLoad();\n\n Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n(it goes on for a bit, best to run the code locally)
We can use the reticulate package to load the Python readability module to just get the clean, article text:
readability < import("readability") # pip install readabilitylxml doc < readability$Document(httr::content(res, as="text", endoding="UTF8")) doc$summary() %>% read_xml() %>% xml_text() # [1] "Take a look at the dataThis is a phrase that comes up when you first get a dataset.It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?Starting down either path, you often encounter the nontrivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more. I felt like the code was a little sloppy, and that it could be better. I wanted to know whether others found it useful.What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.What's in it for the author?Feedback on your packageSupport from rOpenSci membersMaintain ownership of your packagePublicity from it being under rOpenSciContribute something to rOpenSciPotentially a publicationWhat can rOpenSci do that CRAN cannot?The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:Assess documentation readability / usabilityProvide a code review to find weak points / points of improvementDetermine whether a package is overlapping with another.(again, it goes on for a bit, best to run the code locally)
That text is now in good enough shape to tidy.
Here’s the same version with clean_text():
# devtools::install_github("hrbrmstr/hgr") hgr::clean_text(content(res, as="text", endoding="UTF8")) ## [1] "Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n \n \n \n \n \n August 22, 2017 \n \n \n \n \nTake a look at the data\n\n\nThis is a phrase that comes up when you first get a dataset.\n\nIt is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?\n\nStarting down either path, you often encounter the nontrivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.\n\nThese growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.\n\nThe package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".\n\nMaking was fun, and it was easy to use. But I couldn't help but think that maybe could be more.\n\n I felt like the code was a little sloppy, and that it could be better.\n I wanted to know whether others found it useful.\nWhat I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.\n\nToo much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci provides.\n\nrOpenSci onboarding basics\n\nOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with .\n\nWhat's in it for the author?\n\nFeedback on your package\nSupport from rOpenSci members\nMaintain ownership of your package\nPublicity from it being under rOpenSci\nContribute something to rOpenSci\nPotentially a publication\nWhat can rOpenSci do that CRAN cannot?\n\nThe rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN . Here's what rOpenSci does that CRAN cannot:\n\nAssess documentation readability / usability\nProvide a code review to find weak points / points of improvement\nDetermine whether a package is overlapping with another.(lastly, it goes on for a bit, best to run the code locally)
As you can see, even though that version is usable, readability does a much smarter job of cleaning the text.
The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Big Data analytics with RevoScaleR Exercises
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
In this set of exercise , you will explore how to handle bigdata with RevoscaleR package from Microsoft R (previously Revolution Analytics).It comes with Microsoft R client . You can get it from here . get the Credit card fraud data set from revolutionanalytics and lets get started
Answers to the exercises are available here.Please check the documentation before starting these exercise set
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
The heart of RevoScaleR is the xdf file format , convert the creditcardfraud data set into xdf format .
Exercise 2
use the newly created xdf file to get information about the variables and print 10 rows to check the data .
Learn more about importing big data in the online course Data Mining with R: Go from Beginner to Advanced. In this course you will learn how to work with different data import techniques,
 know how to import data and transform it for a specific moddeling or analysis goal,
 and much more.
Exercise 3
use rxSummary ,get the summary for variables gender, balance ,cardholder where numTrans is greater than 10
Exercise 4
use rxDataStep and create a variable avgbalpertran which is balance /numTran+numIntlTran .use rxGetInfo to check if your changes being reflected in the xdf data
Exercise 5
use rxCor and find the correlation between the newly created variable and fraudRisk
Exercise 6
use rxLinMod to construct the linear regression of fraudRisk on gender,balance and cardholder. Dont forget to check the summary of the model .
Exercise 7
Find the contingency table of fraudRisk and Gender , use rxCrossTab .Hint : Figure out how to include factors in the formula .
Exercise 8
use rxCube to find the mean balance for each of the two genders .
Exercise 9
Create a histogram from the xdf file on balance to show the relative frequency histogram .
Exercise 10
Create a two panel histogram with gender and fradurisk as explanatory variable to show the relative frequency of fraudrisk in two genders .
 Vector exercises
 Cross Tabulation with Xtabs exercises
 Data Exploration with Tables exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Introducing ‘powerlmm’ an R package for power calculations for longitudinal multilevel models
(This article was first published on R Psychologist  R, and kindly contributed to Rbloggers)
Over the years I’ve produced quite a lot of code for power calculations and simulations of different longitudinal linear mixed models. Over the summer I bundled together these calculations for the designs I most typically encounter into an R package. The purpose of powerlmm is to help design longitudinal treatment studies, with or without higherlevel clustering (e.g. by therapists, groups, or physician), and missing data. Currently, powerlmm supports twolevel models, nested threelevel models, and partially nested models. Additionally, unbalanced designs and missing data can be accounted for in the calculations. Power is calculated analytically, but simulation methods are also provided in order to evaluated bias, type 1 error, and the consequences of model misspecification. For novice R users, the basic functionality is also provided as a Shiny web application.
The package can be install from GitHub here: github.com/rpsychologist/powerlmm. Currently, the packages includes three vignettes that show how to setup your studies and calculate power.
A basic example library(powerlmm) # dropout per treatment group d < per_treatment(control = dropout_weibull(0.3, 2), treatment = dropout_weibull(0.2, 2)) # Setup design p < study_parameters(n1 = 11, # time points n2 = 10, # subjects per cluster n3 = 5, # clusters per treatment arm icc_pre_subject = 0.5, icc_pre_cluster = 0, icc_slope = 0.05, var_ratio = 0.02, dropout = d, cohend = 0.8) # Power get_power(p) ## ## Power calculation for longitudinal linearmixed model (threelevel) ## with missing data and unbalanced designs ## ## n1 = 11 ## n2 = 10 (treatment) ## 10 (control) ## n3 = 5 (treatment) ## 5 (control) ## 10 (total) ## total_n = 50 (treatment) ## 50 (control) ## 100 (total) ## dropout = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 (time) ## 0, 0, 1, 3, 6, 9, 12, 16, 20, 25, 30 (%, control) ## 0, 0, 1, 2, 4, 5, 8, 10, 13, 17, 20 (%, treatment) ## icc_pre_subjects = 0.5 ## icc_pre_clusters = 0 ## icc_slope = 0.05 ## var_ratio = 0.02 ## cohend = 0.8 ## power = 0.68 FeedbackI appreciate all types of feedback, e.g. typos, bugs, inconsistencies, feature requests, etc. Open an issue on github.com/rpsychologist/powerlmm/issues or via my contact info here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Psychologist  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
New R Course: Sentiment Analysis in R – The Tidy Way
(This article was first published on DataCamp Blog, and kindly contributed to Rbloggers)
Hello, R users! This week we’re continuing to bridge the gap between computers and human language with the launch Sentiment Analysis in R: The Tidy Way by Julia Silge!
Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in several case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from realworld data.
Sentiment Analysis in R: The Tidy Way features interactive exercises that combine highquality video, inbrowser coding, and gamification for an engaging learning experience that will make you an expert in sentiment analysis!
What you’ll learn:Chapter 1: Tweets across the United States
In this chapter, you will implement sentiment analysis using tidy data principles using geocoded Twitter data.
Chapter 2: Shakespeare gets SentimentalYour next realworld text exploration uses tragedies and comedies by Shakespeare to show how sentiment analysis can lead to insight into differences in word use. You will learn how to transform raw text into a tidy format for further analysis.
Chapter 3: Analyzing TV NewsText analysis using tidy principles can be applied to diverse kinds of text, and in this chapter, you will explore a dataset of closed captioning from television news. You will apply the skills you have learned so far to explore how different stations report on a topic with different words, and how sentiment changes with time.
Chapter 4: Singing a Happy Song (or Sad?!)In this final chapter on sentiment analysis using tidy principles, you will explore pop song lyrics that have topped the charts from the 1960s to today. You will apply all the techniques we have explored together so far, and use linear modeling to find what the sentiment of song lyrics can predict.
Learn all there is to know about Sentiment Analysis in R, the Tidy Way!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...