Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 10 hours 17 min ago

Multiplicative Congruential Generators in R

Thu, 08/31/2017 - 20:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part 2 of 2 in the series Random Number Generation

Multiplicative congruential generators, also known as Lehmer random number generators, is a type of linear congruential generator for generating pseudorandom numbers in . The multiplicative congruential generator, often abbreviated as MLCG or MCG, is defined as a recurrence relation similar to the LCG with .

Unlike the LCG, the parameters and for multiplicative congruential generators are more restricted and the initial seed must be relatively prime to the modulus (the greatest common divisor between and is ). The current parameters in common use are . However, in a correspondence from the Communications of the ACM, Park, Miller and Stockmeyer changed the value of the parameter , stating:

The minimal standard Lehmer generator we advocated had a modulus of m = 2^31 – 1 and a multiplier of a = 16807. Relative to this particular choice of multiplier, we wrote “… if this paper were to be written again in a few years it is quite possible that we would advocate a different multiplier ….” We are now prepared to do so. That is, we now advocate a = 48271 and, indeed, have done so “officially” since July 1990. This new advocacy is consistent with the discussion on page 1198 of [10]. There is nothing wrong with 16807; we now believe, however, that 48271 is a little better (with q = 44488, r = 3399).

Multiplicative Congruential Generators with Schrage’s Method

When using a large prime modulus such as , the multiplicative congruential generator can overflow. Schrage’s method was invented to overcome the possibility of overflow. We can check the parameters in use satisfy this condition:

a <- 48271 m <- 2 ** 31 - 1 a * (m %% a) < m ## [1] TRUE

Schrage’s method restates the modulus as a decomposition where and .

Multiplicative Congruential Generator in R

We can implement a Lehmer random number generator in R using the parameters mentioned earlier.

lehmer.rng <- function(n=10) { rng <- vector(length = n) m <- 2147483647 a <- 48271 q <- 44488 r <- 3399 # Set the seed using the current system time in microseconds. # The initial seed value must be coprime to the modulus m, # which we are not really concerning ourselves with for this example. d <- as.numeric(Sys.time()) for (i in 1:n) { h <- d / q l <- d %% q t <- a * l - r * h if (t < 0) { d <- t } else { d <- t + m } rng[i] <- d / m } return(rng) } # Print the first 10 randomly generated numbers lehmer.rng() ## [1] 0.68635675 0.12657390 0.84869106 0.16614698 0.08108171 0.89533896 ## [7] 0.90708773 0.03195725 0.60847522 0.70736551

Plotting our multiplicative congruential generator in three dimensions allows us to visualize the apparent ‘randomness’ of the generator. As before, we generate three random vectors with our Lehmer RNG function and plot the points. The plot3d package is used to create the scatterplot and the animation package is used to animate each scatterplot as the length of the random vectors, , increases.

library(plot3D) library(animation) n <- c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000) saveGIF({ for (i in 1:length(n)) { x <- lehmer.rng(n[i]) y <- lehmer.rng(n[i]) z <- lehmer.rng(n[i]) scatter3D(x, y, z, colvar = NULL, pch=20, cex = 0.5, theta=20, main = paste('n = ', n[i])) } }, movie.name = 'lehmer.gif')

The generator appears to be generating suitably random numbers demonstrated by the increasing swarm of points as increases.

References

Anne Gille-Genest (March 1, 2012). Implementation of the Pseudo-Random Number Generators and the Low Discrepancy Sequences.

Saucier, R. (2000). Computer Generation of Statistical Distributions (1st ed.). Aberdeen, MD. Army Research Lab.

Stephen K. Park; Keith W. Miller; Paul K. Stockmeyer (1988). “Technical Correspondence”. Communications of the ACM. 36 (7): 105–110.

The post Multiplicative Congruential Generators in R appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Probability functions intermediate

Thu, 08/31/2017 - 18:57

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercises, we are going to explore some of the probability functions in R by using practical applications. Basic probability knowledge is required. In case you are not familiarized with the function apply, check the R documentation.

Note: We are going to use random numbers functions and random processes functions in R such as runif. A problem with these functions is that every time you run them, you will obtain a different value. To make your results reproducible you can specify the value of the seed using set.seed(‘any number’) before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random number process.) For this set of exercises, we will use set.seed(1).Don’t forget to specify it before every exercise that includes random numbers.

 

 

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generating dice rolls Set your seed to 1 and generate 30 random numbers using runif. Save it in an object called random_numbers. Then use the ceiling function to round the values. These values represent rolling dice values.

Exercise 2

Simulate one dice roll using the function rmultinom. Make sure n = 1 is inside the function, and save it in an object called die_result. The matrix die_result is a collection of 1 one and 5 zeros, with the one indicating which value was obtained during the process. Use the function whichto create an output that shows only the value obtained after the dice is rolled.

Exercise 3

Using rmultinom, simulate 30 dice rolls. Save it in a variable called dice_result and use apply to transform the matrix into a vector with the result of each dice.

Exercise 4

Some gambling games use 2 dice, and after being rolled they sum their value. Simulate throwing 2 dice 30 times and record the sum of the values of each pair. Use rmultinomto simulate throwing 2 dice 30 times. Use the function apply to record the sum of the values of each experiment.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to

  • work with different binomial and logistic regression techniques,
  • know how to compare regression models and choose the right fit,
  • and much more.

Exercise 5

Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with a standard deviation of 0.1. Using rnorm, simulate the height of 100 people and save it in an object called heights.

To get an idea of the values of heights, use the function summary.

Exercise 6

90% of the population is smaller than ____________?

Exercise 7

Which percentage of the population is bigger than 1.60 m?

Exercise 8

Run the following line code before this exercise. This will load a library required for the exercise.

if (!'MASS' %in% installed.packages()) install.packages('MASS')
library(MASS)

Simulate 1000 people with height and weight using the function mvrnorm with mu = c(1.70, 60) and
Sigma = matrix(c(.1,3.1,3.1,100), nrow = 2)

Exercise 9

How many people from the simulated population are taller than 1.70 m and heavier than 60 kg?

Exercise 10

How many people from the simulated population are taller than 1.75 m and lighter than 60 kg?

Related exercise sets:
  1. Lets Begin with something sample
  2. Probability functions beginner
  3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

DEADLINE EXTENDED: Last call for Boston EARL abstracts

Thu, 08/31/2017 - 18:56

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Are you solving problems and innovating with R?
Are you working with R in a commercial setting?
Do you enjoy sharing your knowledge?

If you said yes to any of the above, we want your abstract!

Share your commerical R stories with your peers at EARL Boston this November.

EARL isn’t about knowing the most or being the best in your field – it’s about taking what you’ve learnt and sharing it with others, so they can learn from your wins (and sometimes your failures, because we all have them!).

As long as your proposed presentation is focused on the commerical use of R, any topic from any industry is welcome!

The abstract submission deadline has been extended to Sunday 3 September.

Join David Robinson, Mara Averick and Tareef Kawaf on 1-3 November 2017 at The Charles Hotel in Cambridge.

To submit your abstract on the EARL website, click here.

See you in Boston!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Text featurization with the Microsoft ML package

Thu, 08/31/2017 - 17:23

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last week I wrote about how you can use the MicrosoftML package in Microsoft R to featurize images: reduce an image to a vector of 4096 numbers that quantify the essential characteristics of the image, according to an AI vision model. You can perform a similar featurization process with text as well, but in this case you have a lot more control of the features used to represent the text.

Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from Amazon.com. The dataset includes reviews from 975,194 products on Amazon.com from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here's one example, selected at random:

What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It's great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.

The review contains many positive terms ("useful", "indespensable", "highly recommended"), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:

transformRule = list( featurizeText( vars = c(Features = "REVIEW_TEXT"), # ngramLength=2: include not only "Azure", "AD", but also "Azure AD" # skipLength=1 : "computer" and "compuuter" is the same wordFeatureExtractor = ngramCount( weighting = "tfidf", ngramLength = 2, skipLength = 1), language = "English" ), selectFeatures( vars = c("Features"), mode = minCount(500) ) ) # train using transforms ! model <- rxFastLinear( RATING ~ Features, data = train, mlTransforms = transformRule, type = "regression" # not binary (numeric regression) )

We can then look at the coefficients associated with these features (presence of n-grams) to assess their impact on the overall rating. By this standard, the top 10 words or word-pairs contributing to a negative rating are:

boring -7.647399 waste -7.537471 not -6.355953 nothing -6.149342 money -5.386262 bad -5.377981 no -5.210301 worst -5.051558 poorly -4.962763 disappointed -4.890280

Similarly, the top 10 words or word-pairs associated with a positive rating are:

will 3.073104 the|best 3.265797 love 3.290348 life 3.562267 wonderful 3.652950 ,|and 3.762862 you 3.889580 excellent 3.902497 my 4.454115 great 4.552569

Another option is simply to look at the sentiment score for each review, which can be extracted using the getSentiment function. 

sentimentScores <- rxFeaturize(data=data, mlTransforms = getSentiment(vars = list(SentimentScore = "REVIEW_TEXT")))

As we expect, a negative seniment (in the 0-0.5 range) is associated with 1- and 2-star reviews, while a positive sentiment (0.5-1.0) is associated with the 4- and 5-star reviews.

You can find more details on this analysis, including the Microsoft R code, at the link below.

Microsoft Technologies Blog for Enterprise Developers: Analyze your text in R (MicrosoftML)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why to use the replyr R package

Thu, 08/31/2017 - 16:48

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #> [1] '0.7.2.9000' packageVersion("sparklyr") #> [1] '0.6.2' packageVersion("dbplyr") #> [1] '1.1.0.9000' sc <- spark_connect(master = 'local') #> * Using Spark: 2.1.0 d <- dplyr::copy_to(sc, data.frame(x = 1:2)) dim(d) #> [1] NA ncol(d) #> [1] NA nrow(d) #> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).



Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users.

The explanation is: “tibble::truncate uses nrow()” and “print.tbl_spark is too slow since dbplyr started using tibble as the default way of printing records”.

A little digging gets us to this:

The above might make sense if tibble and dbplyr were the only users of dim(), ncol() or nrow().

Frankly if I call nrow() I expect to learn the number of rows in a table.

The suggestion is for all user code to adapt to use sdf_dim(), sdf_ncol() and sdf_nrow() (instead of tibble adapting). Even if practical (there are already a lot of existing sparklyr analyses), this prohibits the writing of generic dplyr code that works the same over local data, databases, and Spark (by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-sparklyr dbplyr users (i.e., databases such as PostgreSQL), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “d %>% tally %>% pull“).

I admit, calling nrow() against an arbitrary query can be expensive. However, I am usually calling nrow() on physical tables (not on arbitrary dplyr queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for nrow() to be a cheap operation.

Allowing the user to write reliable generic code that works against many dplyr data sources is the purpose of our replyr package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or Spark. Below are the functions replyr supplies for examining the size of tables:

library("replyr") packageVersion("replyr") #> [1] '0.5.4' replyr_hasrows(d) #> [1] TRUE replyr_dim(d) #> [1] 2 1 replyr_ncol(d) #> [1] 1 replyr_nrow(d) #> [1] 2 spark_disconnect(sc)

Note: the above is only working properly in the development version of replyr, as I only found out about the issue and made the fix recently.

replyr_hasrows() was added as I found in many projects the primary use of nrow() was to determine if there was any data in a table. The idea is: user code uses the replyr functions, and the replyr functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems. replyr accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).

The point of replyr is to provide re-usable work arounds of design choices far away from our influence.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Pulling Data Out of Census Spreadsheets Using R

Thu, 08/31/2017 - 14:22

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In this post, I show a method for extracting small amounts of data from somewhat large Census Bureau Excel spreadsheets, using R.  The objects of interest are expenditures of state and local governments on hospital capital in Iowa for the years 2004 to 2014. The data can be found at http://www2.census.gov/govs/local/. The files at the site are yearly files.

The files to be used are those named ‘yrslsstab1a.xls’, where ‘yr‘ is replaced by the two digits of the year for a given year, for example, ’04’ or ’11’. The individual yearly files contain data for the whole country and for all of the states, over all classes of state and local government revenue and expenditures. The task is to extract three data points from each file – state and local expenditures, state expenditures, and local expenditures – for the state of Iowa.

The structure of the files varies from year to year, so first reviewing the files is important. I found two patterns for the expenditure data – data with and data without margins of error. The program locates the columns for Iowa and the row for hospital capital expenditures. Then, the data are extracted and put in a matrix for outputting.

First, character strings of the years are created, to be used in referencing the data sets, and a data frame is created to contain the final result.

years = c(paste("0", 4:9, sep=""), paste(10:14)) hospital.capital.expend <- data.frame(NA,NA,NA)

Second, the library ‘gdata’ is opened. The library ‘gdata’ contains functions useful for manipulating data in R and provides for reading data into R from an URL containing an Excel file.

library(gdata)

Third, a loop is run through the eleven years to fill in the ‘hospital.capital.expend’ data frame with the data from each year. The object ‘fn’ contains the URL of the Excel file for a given year. The function ‘paste’ concatenates the three parts of the URL. Note that ‘sep’ must be set to “” in the function.

for (i in 1:11) { fn = paste("http://www2.census.gov/govs/local/",years[i], "slsstab1a.xls", sep="")

Next, the Excel file is read into the object ‘ex’. The argument ‘header’ is set to ‘F’ so that all of the rows are input. Also, since all of the columns contain some character data, all of the data is forced to be character by setting ‘stringsAsFactors’ to ‘F’.  The function used to read the spreadsheet is ‘read.xls’ in the package ‘gdata’.

ex = read.xls(fn, sheet=1, header=F, stringsAsFactors=F)

Next, the row and column indices of the data are found using the functions ‘grepl’ and ‘which’. The first argument in ‘grepl’ is a pattern to be matched. For a data frame, the ‘grepl’ function returns a logical vector of ‘T’s and ‘F’s of length equal to the number of columns in the data frame – giving ‘T’ if the column contains the pattern and ‘F’ if not. Note that ‘*’ can be used as a wild card in the pattern.  For a character vector, ‘grepl’ returns ‘T’ if an element of the vector matches the pattern and ‘F’ otherwise.

The ‘which’ function returns the indices of a logical vector which have the value ‘T’. So, ‘ssi1’ contains the index of the column containing ‘Hospital’ and ‘ssi2’ contains the index of the column containing ‘Iowa’. The object ‘ssi4’ contains the rows containing ‘Hospital’, since ‘ex[,ssi1]’ is a character vector instead of a data frame.   For all of the eleven years, the second incidence of ‘Hospital’ in the ‘Hospital’ column contains hospital expenditures.

ssi1 = which(grepl("*Hospital*", ex, ignore.case=T)) ssi2 = which(grepl("Iowa", ex, ignore.case=T)) ssi4 = which(grepl("Hospital",ex[,ssi1], ignore.case=T))[2]

Next, the data are extracted, and the temporary files are removed. If the column index of ‘Iowa’ is less that 80, no margin of error was included and the data points are in the column of ‘Iowa’ and in the next two columns. If the column index of ‘Iowa’ is larger than 79, a margin of error was included and the data are in the column of ‘Iowa’ and the second and third columns to the right.

The capital expenditures are found one row below the ‘Hospital’ row, so one is added to ‘ssi4’ to get the correct row index. The data are put in the data frame ‘df.1’ which is row bound to the data frame ‘hospital.capital.expend’. The names of the columns in ‘df.1’ are set to ‘NA’ so that the row bind will work.  Then the temporary files are removed and the loop ends.

if (ssi2<80) ssi5=ssi2+0:2 else ssi5 = ssi2 + c(0,2,3) df.1 = data.frame(ex[ssi4+1, ssi5], stringsAsFactors = F) names(df.1)=c(NA,NA,NA) hospital.capital.expend = rbind(hospital.capital.expend, df.1) rm(fn, ex, df.1, ssi1, ssi2, ssi4, ssi5) }

There are just a few steps left to clean things up. The first row of ‘hospital.capital.expend’, which just contains ‘NA’s, is removed. Then, the commas within the numbers, as extracted from the census file, are removed from the character strings using the function ‘gsub’ and the data frame is converted to a numeric matrix. Next, the eleven years are column bound to the matrix. Last, the columns are given names and the matrix is printed out.

hospital.capital.expend = as.matrix(hospital.capital.expend[-1,]) hospital.capital.expend = matrix(as.numeric(gsub(",","",hospital.capital.expend)),ncol=3) hospital.capital.expend = cbind(2004:2014,hospital.capital.expend) colnames(hospital.capital.expend) = c("Year", "State.Local", "State", "Local") print(hospital.capital.expend)

That’s it!!!

    Related Post

    1. Extracting Tables from PDFs in R using the Tabulizer Package
    2. Extract Twitter Data Automatically using Scheduler R package
    3. An Introduction to Time Series with JSON Data
    4. Get Your Data into R: Import Data from SPSS, Stata, SAS, CSV or TXT
    5. Exporting Data from R to TXT, CSV, SPSS or Stata
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Community Call – rOpenSci Software Review and Onboarding

    Thu, 08/31/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    Are you thinking about submitting a package to rOpenSci's open peer software review? Considering volunteering to review for the first time? Maybe you're an experienced package author or reviewer and have ideas about how we can improve.

    Join our Community Call on Wednesday, September 13th. We want to get your feedback and we'd love to answer your questions!

    Agenda
    1. Welcome (Stefanie Butland, rOpenSci Community Manager, 5 min)
    2. guest: Noam Ross, editor (15 min)
      Noam will give an overview of the rOpenSci software review and onboarding, highlighting the role editors play and how decisions are made about policies and changes to the process.
    3. guest: Andee Kaplan, reviewer (15 min)
      Andee will give her perspective as a package reviewer, sharing specifics about her workflow and her motivation for doing this.
    4. Q & A (25 min, moderated by Noam Ross)
    Speaker bios

    Andee Kaplan is a Postdoctoral Fellow at Duke University. She is a recent PhD graduate from the Iowa State University Department of Statistics, where she learned a lot about R and reproducibility by developing a class on data stewardship for Agronomists. Andee has reviewed multiple (two!) packages for rOpenSci, iheatmapr and getlandsat, and hopes to one day be on the receiving end of the review process.

    Andee on GitHub, Twitter

    Noam Ross is one of rOpenSci's four editors for software peer review. Noam is a Senior Research Scientist at EcoHealth Alliance in New York, specializing in mathematical modeling of disease outbreaks, as well as training and standards for data science and reproducibility. Noam earned his Ph.D. in Ecology from the University of California-Davis, where he founded the Davis R Users' Group.

    Noam on GitHub, Twitter

    Resources var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Create and Update PowerPoint Reports using R

    Thu, 08/31/2017 - 02:50

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    In my sordid past, I was a data science consultant. One thing about data science that they don’t teach you at school is that senior managers in most large companies require reports to be in PowerPoint. Yet, I like to do my more complex data science in R – PowerPoint and R are not natural allies. As a result, creating an updating PowerPoint reports using R can be painful.

    In this post, I discuss how to make R and PowerPoint work efficiently together. The underlying assumption is that R is your computational engine and that you are trying to get outputs into PowerPoint. I compare and contrast three tools for creating and updating PowerPoint reports using R: free ReporteRs package with two commercial products, Displayr and Q.

    Option 1: ReporteRs

    The first approach to getting R and PowerPoint to work together is to use David Gohel’s ReporteRs. To my mind, this is the most “pure” of the approaches from an R perspective. If you are an experienced R user, this approach works in pretty much the way that you will expect it to work.

    The code below creates 250 crosstabs, conducts significance tests, and, if the p-value is less than 0.05, presents a slide containing each. And, yes, I know this is p-hacking, but this post is about how to use PowerPoint and R, not how to do statistics…

    library(devtools) devtools::install_github('davidgohel/ReporteRsjars') devtools::install_github('davidgohel/ReporteRs') install.packages(c('ReporteRs', 'haven', 'vcd', 'ggplot2', 'reshape2')) library(ReporteRs) library(haven) library(vcd) library(ggplot2) library(reshape2) dat = read_spss("http://wiki.q-researchsoftware.com/images/9/94/GSSforDIYsegmentation.sav") filename = "c://delete//Significant crosstabs.pptx" # the document to produce document = pptx(title = "My significant crosstabs!") alpha = 0.05 # The level at which the statistical testing is to be done. dependent.variable.names = c("wrkstat", "marital", "sibs", "age", "educ") all.names = names(dat)[6:55] # The first 50 variables int the file. counter = 0 for (nm in all.names) for (dp in dependent.variable.names) { if (nm != dp) { v1 = dat[[nm]] if (is.labelled(v1)) v1 = as_factor(v1) v2 = dat[[dp]] l1 = attr(v1, "label") l2 = attr(v2, "label") if (is.labelled(v2)) v2 = as_factor(v2) if (length(unique(v1)) <= 10 <= 10) # Only performing tests if 10 or fewer rows and columns. { x = xtabs(~v1 + v2) x = x[rowSums(x) > 0, colSums(x) > 0] ch = chisq.test(x) p = ch$p.value if (!is.na(p) && p <= alpha) { counter = counter + 1 # Creating the outputs. crosstab = prop.table(x, 2) * 100 melted = melt(crosstab) melted$position = 100 - as.numeric(apply(crosstab, 2, cumsum) - 0.5 * crosstab) p = ggplot(melted, aes(x = v2, y = value,fill = v1)) + geom_bar(stat='identity') p = p + geom_text(data = melted, aes(x = v2, y = position, label = paste0(round(value, 0),"%")), size=4) p = p + labs(x = l2, y = l1) colnames(crosstab) = paste0(colnames(crosstab), "%") #bar = ggplot() + geom_bar(aes(y = v1, x = v2), data = data.frame(v1, v2), stat="identity") # Writing them to the PowerPoint document. document = addSlide(document, slide.layout = "Title and Content" ) document = addTitle(document, paste0("Standardized residuals and chart: ", l1, " by ", l2)) document = addPlot(doc = document, fun = print, x = p, offx = 3, offy = 1, width = 6, height = 5 ) document = addFlexTable(doc = document, FlexTable(round(ch$stdres, 1), add.rownames = TRUE),offx = 8, offy = 2, width = 4.5, height = 3 ) } } } } writeDoc(document, file = filename ) cat(paste0(counter, " tables and charges exported to ", filename, "."))

    Below we see one of the admittedly ugly slides created using this code. With more time and expertise, I am sure I could have done something prettier. A cool aspect of the ReporteRs package is that you can then edit the file in PowerPoint. You can then get R to update any charts and other outputs originally created in R.

     

     

    Option 2: Displayr

    A completely different approach is to author the report in Displayr, and then export the resulting report from Displayr to PowerPoint.

    This has advantages and disadvantages relative to using ReporteRs. First, I will start with the big disadvantage, in the hope of persuading you of my objectivity (disclaimer: I have no objectivity, I work at Displayr).

    Each page of a Displayr report is created interactively, using a mouse and clicking and dragging things. In my earlier example using ReporteRs, I only created pages where there was a statistically significant association. Currently, there is no way of doing such a thing in Displayr.

    The flipside of using the graphical user interface like Displayr is that it is a lot easier to create attractive visualizations. As a result, the user has much greater control over the look and feel of the report. For example, the screenshot below shows a PowerPoint document created by Displayr. All but one of the charts has been created using R, and the first two are based on a moderately complicated statistical model (latent class rank-ordered logit model).

    You can access the document used to create the PowerPoint report with R here (just sign in to Displayr first) – you can poke around and see how it all works.

    A benefit of authoring a report using Displayr is that the user can access the report online, interact with it (e.g., filter the data), and then export precisely what they want. You can see this document as it is viewed by a user of the online report here.

     

     

    Option 3: Q

    A third approach for authoring and updating PowerPoint reports using R is to use Q, which is a Windows program designed for survey reporting (same disclaimer as with Displayr). It works by exporting and updating results to a PowerPoint document. Q has two different mechanisms for exporting R analyses to PowerPoint. First, you can export R outputs, including HTMLwidgets, created in Q directly to PowerPoint as images. Second, you can create tables using R and then have these exported as native PowerPoint objects, such as Excel charts and PowerPoint tables.

    Q has two different mechanisms for exporting R analyses to PowerPoint. First, you can export R outputs, including HTMLwidgets, created in Q directly to PowerPoint as images. Second, you can create tables using R and then have these exported as native PowerPoint objects, such as Excel charts and PowerPoint tables.

    In Q, a Report contains a series of analyses. Analyses can either be created using R, or, using Q’s own internal calculation engine, which is designed for producing tables from survey data.

    The map above (in the Displayr report) is an HTMLwidget created using the plotly R package. It draws data from a table called Region, which would also be shown in the report. (The same R code in the Displayr example can be used in an R object within Q). So when exported into PowerPoint, it creates a page, using the PowerPoint template, where the title is Responses by region and the map appears in the middle of the page.

    The screenshot below is showing another R chart created in PowerPoint. The data has been extracted from Google Trends using the gtrendsR R package. However, the chart itself is a standard Excel chart, attached to a spreadsheet containing the data. These slides can then be customized using all the normal PowerPoint tools and can be automatically updated when the data is revised.

     

    Explore the Displayr example

    You can access the Displayr document used to create and update the PowerPoint report with R here (just sign in to Displayr first). Here, you can poke around and see how it all works or create your own document.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pacific Island Hopping using R and iGraph

    Thu, 08/31/2017 - 02:00

    (This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

    Last month I enjoyed a relaxing holiday in the tropical paradise of Vanuatu. One rainy day I contemplated how to go island hopping across the Pacific ocean visiting as many island nations as possible. The Pacific ocean is a massive body of water between, Asia and the Americas, which covers almost half the surface of the earth. The southern Pacific is strewn with island nations from Australia to Chile. In this post, I describe how to use R to plan your next Pacific island hopping journey.

    The Pacific Ocean.

    Listing all airports

    My first step was to create a list of flight connections between each of the island nations in the Pacific ocean. I am not aware of a publically available data set of international flights so unfortunately, I created a list manually (if you do know of such data set, then please leave a comment).

    My manual research resulted in a list of international flights from or to island airports. This list might not be complete, but it is a start. My Pinterest board with Pacific island airline route maps was the information source for this list.

    The first code section reads the list of airline routes and uses the ggmap package to extract their coordinates from Google maps. The data frame with airport coordinates is saved for future reference to avoid repeatedly pinging Google for the same information.

    # Init library(tidyverse) library(ggmap) library(ggrepel) library(geosphere) # Read flight list and airport list flights <- read.csv("Geography/PacificFlights.csv", stringsAsFactors = FALSE) f <- "Geography/airports.csv" if (file.exists(f)) { airports <- read.csv(f) } else airports <- data.frame(airport = NA, lat = NA, lon = NA) # Lookup coordinates for new airports all_airports <- unique(c(flights$From, flights$To)) new_airports <- all_airports[!(all_airports %in% airports$airport)] if (length(new_airports) != 0) { coords <- geocode(new_airports) new_airports <- data.frame(airport = new_airports, coords) airports <- rbind(airports, new_airports) airports <- subset(airports, !is.na(airport)) write.csv(airports, "Geography/airports.csv", row.names = FALSE) } # Add coordinates to flight list flights <- merge(flights, airports, by.x="From", by.y="airport") flights <- merge(flights, airports, by.x="To", by.y="airport") Create the map

    To create a map, I modified the code to create flight maps I published in an earlier post. This code had to be changed to centre the map on the Pacific. Mapping the Pacific ocean is problematic because the -180 and +180 degree meridians meet around the date line. Longitudes west of the antemeridian are positive, while longitudes east are negative.

    The world2 data set in the borders function of the ggplot2 package is centred on the Pacific ocean. To enable plotting on this map, all negative longitudes are made positive by adding 360 degrees to them.

    # Pacific centric flights$lon.x[flights$lon.x < 0] <- flights$lon.x[flights$lon.x < 0] + 360 flights$lon.y[flights$lon.y < 0] <- flights$lon.y[flights$lon.y < 0] + 360 airports$lon[airports$lon < 0] <- airports$lon[airports$lon < 0] + 360 # Plot flight routes worldmap <- borders("world2", colour="#efede1", fill="#efede1") ggplot() + worldmap + geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") + geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport), col = "black", size = 2, segment.color = NA) + geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y, yend = lat.y, col = Airline), size = .4, curvature = .2) + theme(panel.background = element_rect(fill="white"), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank() ) + xlim(100, 300) + ylim(-40,40) Pacific Island Hopping

    This visualisation is aesthetic and full of context, but it is not the best visualisation to solve the travel problem. This map can also be expressed as a graph with nodes (airports) and edges (routes). Once the map is represented mathematically, we can generate travel routes and begin our Pacific Island hopping.

    The igraph package converts the flight list to a graph that can be analysed and plotted. The shortest_path function can then be used to plan routes. If I would want to travel from Auckland to Saipan in the Northern Mariana Islands, I have to go through Port Vila, Honiara, Port Moresby, Chuuk, Guam and then to Saipan. I am pretty sure there are quicker ways to get there, but this would be an exciting journey through the Pacific.

    library(igraph) g <- graph_from_edgelist(as.matrix(flights[,1:2]), directed = FALSE) par(mar = rep(0, 4)) plot(g, layout = layout.fruchterman.reingold, vertex.size=0) V(g) shortest_paths(g, "Auckland", "Saipan")

    View the latest version of this code on GitHub.

    The post Pacific Island Hopping using R and iGraph appeared first on The Devil is in the Data.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Probably more likely than probable

    Wed, 08/30/2017 - 23:33

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    What kind of probability are people talking about when they say something is "highly likely" or has "almost no chance"? The chart below, created by Reddit user zonination, visualizes the responses of 46 other Reddit users to "What probability would you assign to the phase: <phrase>" for various statements of probability. Each set of responses has been converted to a kernel destiny estimate and presented as a joyplot using R. 

    Somewhat surprisingly, the results from the Redditors hew quite closely to a similar study of 23 NATO intelligence officers in 2007. In that study, the officers — who were accustomed to reading intelligence reports with assertions of likelihood — were giving a similar task with the same descriptions of probability. The results, here presented as a dotplot, are quite similar.

    For details on the analysis of the Redditors, including the data and R code behind the joyplot chart, check out the Github repository linked below.

    Github (zonination): Perceptions of Probability and Numbers

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Finding distinct rows of a tibble

    Wed, 08/30/2017 - 22:30

    (This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

    I’ve been using R or its predecessors for about 30 years, which means I know a lot about R, and also that I don’t necessarily know how to use modern R tools. Lately, I’ve been trying to unlearn some old approaches, and to re-learn them using the tidyverse approach to data analysis. I agree that it is much better, but old dogs and new tricks…
    Recently, I was teaching a class where I needed to extract some rows of a data set.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R in the Data Science Stack at ODSC

    Wed, 08/30/2017 - 22:05

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)


    Register now
    for ODSC West in San Francisco, November 2-4 and save 60% with code RB60 until September 1st.

    R continues to hold its own in the data science landscape thanks in no small part to its flexibility.  That flexibility allows R to integrate with some of the most popular data science tools available.

    Given R’s memory bounds, it’s no surprise that deep learning tools like Tensorflow are on that list. Comprehensive, intuitive, and well documented, Tensorflow has quickly become one of the most popular deep learning platforms and RStudio released a package to integrate with the Tensorflow API. Not to be outdone MXNet, another popular and powerful deep learning framework has native support for R with an API interface.  

    It doesn’t stop with deep learning. Data science is moving real-time and the streaming analytics platform, Apache Kafka, is rapidly gaining traction with the community.  The kafka package allows one to use the Kafka messaging queue via R. Spark is now one of the dominant machine learning platforms and thus we see multiple R integrations in the form of the spark package and the SparkR package. The list will continue to grow with package integrations released for H20.ai, Druid etc. and more on the way.

    At the  Open Data Science Conference, R has long been one of the most popular data science languages and ODSC West 2017 is no exception. We have a strong lineup this year that includes:

    • R Tools for Data Science
    • Modeling Big Data with R, sparklyr, and Apache Spark
    • Machine Learning with R
    • Introduction to Data Science with R
    • Modern Time-Series with Prophet
    • R4ML: A Scalable and Distributed framework in R for Machine Learning
    • Databases Using R
    • Geo-Spatial Data Visualization using R

    From an R user perspective, one of the most exciting things about ODSC West 2017 is that it offers an excellent opportunity to do a deep dive into some of the most popular data science tools you can now leverage with R. Talks and workshops on the conference schedule include:

    • Deep learning from Scratch WIth Tensorflow
    • Apache Kafka for Real-time analytics
    • Deep learning with MXNet
    • Effective TensorFlow
    • Building an Open Source Analytics Solution with Kafka and Druid
    • Deep Neural Networks with Keras
    • Robust Data Pipelines with Apache Airflow
    • Apache Superset – A Modern, Enterprise-Ready Business Intelligence Web Application

    Over 3 packed days, ODSC West 2017 also offers a great opportunity to brush up on your modeling skills that include predictive analytics, time series, NLP, machine learning, image recognition, deep learning. autonomous vehicles, and AI chatbot assistants. Here’s just a few of the data science workshops and talks scheduled:

    • Feature Selection from High Dimensions
    • Interpreting Predictions from Complex Models
    • Deep Learning for Recommender Systems
    • Natural Language Processing in Practice – Do’s and Don’ts
    • Machine Imaging recognition
    • Training a Prosocial Chatbot
    • Anomaly Detection Using Deep Learning
    • Myths of Data Science: Practical Issues You Can and Can Not Ignore.
    • Playing Detective with CNNs
    • Recommendation System Architecture and Algorithms
    • Driver and Occupants Monitoring AI for Autonomous Vehicles
    • Solving Impossible Problems by Collaborating with an AI
    • Dynamic Risk Networks: Mapping Risk in the Financial System

    With over 20 full training session, 50 workshops and 100 speakers, ODSC West 2017 is ideal for beginners to experts looking to understand the latest in R tools and topics in data science and AI.

    Register now and save 60% with code RB60 until September 1st.

    Sheamus McGovern, CEO of ODSC

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Web Scraping Influenster: Find a Popular Hair Care Product for You

    Wed, 08/30/2017 - 19:28

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    Are you a person who likes to try new products? Are you curious about which hair products are popular and trendy? If you’re excited about getting your hair glossy and eager to find a suitable shampoo, conditioner or hair oil merchandise, using my ‘Shiny (Hair) App’ could help you find what you seek in less time. My codes are available on GitHub.

     

    Research Questions

    What are popular hair care brands?

    What is the user behavior on Influenter.com?

    What kind of factors may have critical influences on customers satisfaction?

    Is it possible to create a search engine, which takes charge of phrases and returns related products?

     

    Data Collection

    To obtain the most up-to-date hair care information, I decided to web scrape Influenster, a product discovery and review platform. It has over 14 million reviews and over 2 millions products for users to choose from.  

    In order to narrow down my research scope, I focused on 3 categories: shampoo, hair conditioner, and hair oil. I garnered 54 top choices for each one. For product datasets, I scraped brand name, product name, overall product rating, rank and reviews. Plus, the web scraping review dataset includes author name, author location, content, rating score, and hair profile.

     

    Results

    Top Brands Graph

    Firstly, the “other” category represents the brands which have one or two popular products. Thus, judging from the popular brands’ pie chart, we can see that most of the popular products belong to huge brands.

     

    Rating Map

    As to checking users’ behaviors on Influenster in the United States, I decided to make two maps to see whether there are any interesting results linked to location. Since I scraped top 54 products for each category, the overall rating score is high across the country. As a result, it is difficult to see regional differences.

     

    Reviews Map

    However, if we take a look at the number of hair care product reviews on Influenster.com across the nation, we know that there are 4740, 3898, 3787, 2818 reviews in California, Florida, Texas and New York respectively.

     

    Analysis of Rating and Number of Reviews

    There is a negative relationship between rating and number of reviews. As you can see, Pureolog receives the highest score 4.77out of 5, but it only has 514 reviews. On the other hand, OGX is scored 4.4 out of 5, though, it gains over 5167 reviews.

     

    Wordcloud & Comparison Cloud

    As we may be interested in what factors customers care about most and what contributes to their satisfaction with a product, I decided to inspect the most frequently mentioned words in those 77 thousand reviews. For the first try, I created word clouds for each category and the overall reviews. However, there is no significant difference among the four graphs. Therefore, I created a comparison cloud to collate the most common words popping up in reviews.From the comparison cloud, we can infer that customers regard functionalities of products and fragrance as the most important. In addition, the word “recommend” shows up as a commonly used word in the reviews dataset. Consequently, in my perspective, word of mouth is a great marketing strategy for brands to focus on.

     

    Search Engine built in my Shiny App (NLP: TF-IDF, cosine similarity)


    http://blog.nycdatascience.com/wp-content/uploads/2017/08/engine_demo.mp4

    TF-IDF

    TF-IDF is a NLP technique, which stands for “Term Frequency–Inverse Document Frequency,”, a numerical statistic that is intended to reflect how important a word is compared to a document in a corpus.

    For my search engine, I utilize “tm” package and employ weightSMART “nnn” weighted schema for term frequency. Basically, the weightSMART “nnn”, a natural weighting computation, counts how many times each individual word matches up with the document in the dataset. If you would like to read more details and check more weighting schemas, please feel free to take a look at the R documentation.

    Cosine Similarity

    With TF-IDF measurements in place, products are recommended according to a cosine similarity score with the query. To further elaborate how cosine similarity works, it is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. In the case of information retrieval like a search engine, the cosine similarity of two documents will range from 0 to 1, because the term frequencies (TF-IDF weights) cannot be negative. In other words, the angle between two term frequency vectors cannot be greater than 90 degrees. Additionally, when the cosine value is closer to 1, it means that there is a higher similarity between the two vectors (products). The cosine similarity formula is shown below.

     

     

    Insights

    Most of the products belong to household brands.

    The more active users of the site are from California, Florida, Texas and New York.

    There is a negative relationship between the number of reviews and rating score.

    Functions and the scent of hair care products are of great importance.

    Even though “recommend” is a commonly used word, in this project, it is difficult to tell whether is positive or negative feedbacks. Thus, I can conduct sentiment analysis in the future.

    The self-developed search engine, applied with TF-IDF and cosine similarity concepts, will work even better if I include product descriptions. By adding up product descriptions, users can have a higher probability to match their inputs to not only product name but product description, so that they are able to retrieve more related merchandises and explore new features of products.

    The post Web Scraping Influenster: Find a Popular Hair Care Product for You appeared first on NYC Data Science Academy Blog.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Data wrangling : Cleansing – Regular expressions (2/3)

    Wed, 08/30/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)


    Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the fourth part of the series and it aims to cover the cleaning of data used. At previous parts we learned how to import, reshape and transform data. The rest of the series will be dedicated to the data cleansing process. On this post we will go through the regular expressions, a sequence of characters that define a search pattern, mainly
    for use in pattern matching with text strings.In particular, we will cover the foundations of regular expression syntax.

    Before proceeding, it might be helpful to look over the help pages for the grep, gsub.

    Moreover please run the following commands to create the strings that we will work on.
    bio <- c('24 year old', 'data scientist', '1992', 'A.I. enthusiast', 'R version 3.4.0 (2017-04-21)',
    'r-exercises author', 'R is cool', 'RR')

    Answers to the exercises are available here.

    If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    Exercise 1

    Find the strings with Numeric values between 3 and 6.

    Exercise 2

    Find the strings with the character ‘A’ or ‘y’.

    Exercise 3

    Find any strings that have non-alphanumeric characters.

    Exercise 4

    Remove lower case letters.

    Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

    Exercise 5

    Remove space or tabs.

    Exercise 6

    Remove punctuation and replace it with white space.

    Exercise 7

    Remove alphanumeric characters.

    Exercise 8

    Match sentences that contain ‘M’.

    Exercise 9

    Match states with two ‘o’.

    Exercise 10

    Match cars with one or two ‘e’.

    Related exercise sets:
    1. Regular Expressions Exercises – Part 1
    2. Data wrangling : Cleansing – Regular expressions (1/3)
    3. Data wrangling : Transforming (2/3)
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The one function call you need to know as a data scientist: h2o.automl

    Wed, 08/30/2017 - 14:49
    Introduction

    Two things that recently came to my attention were AutoML (Automatic Machine Learning) by h2o.ai and the fashion MNIST by Zalando Research. So as a test, I ran AutoML on the fashion mnist data set.

    H2o AutoML

    As you all know a large part of the work in predictive modeling is in preparing the data. But once you have done that, ideally you don’t want to spend too much work in trying many different machine learning models.  That’s were AutoML from h2o.ai comes in. With one function call you automate the process of training a large, diverse, selection of candidate models.

    AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, GLM’s, Gradient Boosting Machines (GBMs) and Neural Nets. And then as “bonus” it trains a Stacked Ensemble using all of the models. The function to use in the h2o R interface is: h2o.automl. (There is also a python interface)

    FashionMNIST_Benchmark = h2o.automl( x = 1:784, y = 785, training_frame = fashionmnist_train, validation_frame = fashionmninst_test )

    So the first 784 columns in the data set are used as inputs and column 785 is the column with labels. There are more input arguments that you can use. For example, maximum running time or maximum number of models to use, a stopping metric.

    It can take some time to run all these models, so I have spun up a so-called high CPU droplet on Digital Ocean: 32 dedicated cores ($0.92 /h).

    h2o utilizing all 32 cores to create models

    The output in R is an object containing the models and a ‘leaderboard‘ ranking the different models. I have the following accuracies on the fashion mnist test set.

    1. Gradient Boosting (0.90)
    2. Deep learning (0.89)
    3. Random forests (0.89)
    4. Extremely randomized forests (0.88)
    5. GLM (0.86)

    There is no ensemble model, because it’s not supported yet for multi label classifiers. The deeplearning in h2o are fully connected hidden layers, for this specific Zalando images data set, you’re better of pursuing more fancy convolutional neural networks. As a comparison I just ran a simple 2 layer CNN with keras, resulting in an test accuracy of 0.92. It outperforms all the models here!

    Conclusion

    If you have prepared your modeling data set, the first thing you can always do now is to run h2o.automl.

    Cheers, Longhow.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Layered Data Visualizations Using R, Plotly, and Displayr

    Wed, 08/30/2017 - 08:03

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    If you have tried to communicate research results and data visualizations using R, there is a good chance you will have come across one of its great limitations. R is painful when you need to create visualizations by layering multiple visual elements on top of each other. In other words, R can be painful if you want to assemble many visual elements, such as charts, images, headings, and backgrounds, into one visualization.

    The good: R can create awesome charts 

    R is great for creating charts. It gives you a lot of control and makes it easy to update charts with revised data. As an example, the chart below was created in R using the plotly package. It has quite a few nifty features that cannot be achieved in, say, Excel or Tableau.

    The data visualization below measures blood sugar, exercise intensity, and diet. Each dot represents a blood glucose (BG) measurement for a patient over the course of a day. Note that the blood sugar measurements are not collected at regular intervals so there are gaps between some of the dots. In addition, the y-axis label spacings are irregular because this chart needs to emphasize the critical point of a BG of 8.9. The dots also get larger the further they are from a BG of 6 and color is used to emphasize extreme values. Finally, green shading is used to indicate the intensity of the patient’s physical activity, and readings from a food diary have been automatically added to this chart.

    While this R visualization is awesome, it can be made even more interesting by overlaying visual elements such as images and headings.

    You can look at this R visualization live, and you can hover your mouse over points to see the dates and times of individual readings. 

     

    The bad: It is very painful to create visual confections in R

    In his book, Visual Explanations, Edward Tufte coins the term visual confections to describe visualizations that are created by overlaying multiple visual elements (e.g., combining charts with images or joining multiple visualizations into one). The document below is an example of a visual confection.

    The chart created in R above has been incorporated into the visualization below, along with another chart, images, background colors, headings and more – this is a visual confection.

    In addition to all information contained in the original chart, the patient’s insulin dose for each day is shown in a syringe and images of meals have also been added. The background has been colored, and headings and sub-headings included. While all of this can be done in R, it cannot be done easily.

    Even if you know all the relevant functions to programmatically insert images, resize them, deal with transparency, and control their order, you still have to go through a painful trial and error process of guesstimating the coordinates where things need to appear. That is, R is not WYSIWYG, and you really feel this when creating visual confections. Whenever I have done such things, I end up having to print the images, use a ruler, and create a simple model to estimate the coordinates!

     

    The solution: How to assemble many visual layers into one data visualization

    The standard way that most people create visual confections is using PowerPoint. However, PowerPoint and R are not great friends, as resizing R charts in PowerPoint causes problems, and PowerPoint cannot support any of the cool hover effects or interactivity in HTMLwidgets like plotly.

    My solution was to build Displayr, which is a bit like a PowerPoint for the modern age, except that charts can be created in the app using R. The app is also online and can have its data updated automatically.

    Click here to create your own layered visualization (just sign into Displayr first). Here you can access and edit the document that I used to create the visual confection example used in this post. This document contains all the raw data and the R code (as a function) used to automatically create the charts in this post. You can see the published layered visualization as a web page here.

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppArmadillo 0.7.960.1.2

    Wed, 08/30/2017 - 04:20

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A second fix-up release is needed following on the recent bi-monthly RcppArmadillo release as well as the initial follow-up as it turns out that OS X / macOS is so darn special that it needs an entire separate treatment for OpenMP. Namely to turn it off entirely…

    Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 384 other packages on CRAN—an increase of 54 since the CRAN release in June!

    Changes in RcppArmadillo version 0.7.960.1.2 (2017-08-29)
    • On macOS, OpenMP support is now turned off (#170).

    • The package is now compiling under the C++11 standard (#170).

    • The vignette dependency is correctly set (James and Dirk in #168 and #169)

    Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RStudio 1.1 Preview – I Only Work in Black

    Wed, 08/30/2017 - 02:00

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    Today, we’re continuing our blog series on new features in RStudio 1.1. If you’d like to try these features out for yourself, you can download a preview release of RStudio 1.1.

    I Only Work in Black

    For those of us that like to work in black or very very dark grey, the dark theme can be enabled from the ‘Global Options’ menu, selecting the ‘Appearance’ tab and choosing an ‘Editor theme’ that is dark.

    Icons are now high-DPI, a ‘Modern’ and ‘Sky’ theme were also added, read more about them under Using RStudio Themes.

    All panels support themes: Code editor, Console, Terminal, Environment, History, Files, Connections, Packages, Help, Build and VCS. Other features like Notebooks, Debugging, Profiling , Menus and the Object Explorer support this theme as well.

    However, the Plots and Viewer panes render with the default colors of your content and therefore, require additional packages to switch to dark themes. For instance, shinythemes provides the darkly theme for Shiny and ggthemes support for light = FALSE under ggplot. If you are a package author, consider using rstudioapi::getThemeInfo() when generating output to these panes.

    Enjoy!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    IMDB Genre Classification using Deep Learning

    Wed, 08/30/2017 - 02:00

    (This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

    The Internet Movie Database (Imdb) is a great source to get information about movies. Keras provides access to some part of the cleaned dataset (e.g. for sentiment classification). While sentiment classification is an interesting topic, I wanted to see if it is possible to identify a movie’s genre from its description.
    The image illustrates the task;

    To see if that is possible I downloaded the raw data from an FU-Berlin ftp- server. Most movies have multiple genres assigned (e.g. Action and Sci-fi.). I chose to randomly pick one genre in case of multiple assignments.

    So the task at hand is to use a lengthy description to interfere a (noisy) label. Hence, the task is similar to the Reuters news categorization task. I used the code as a guideline for the model.
    However, looking at the code, it becomes clear that data preprocessing part is skipped. In order to make it easy for a practitioner to create their own applications, I will try to detail the necessary preprocessing.
    The texts are represented as a vector of integers (indexes). So basically one builds a dictionary in which each index refers to a particular word.

    require(caret) require(keras) max_words <- 1500 ### create a balanced dataset with equal numbers of observations for each class down_train <- caret::downSample(x = mm, y = mm$GenreFact) ### preprocessing --- tokenizer = keras::text_tokenizer(num_words = max_words) keras::fit_text_tokenizer(tokenizer, mm$descr) sequences = tokenizer$texts_to_sequences(mm$descr) ## split in training and test set. train <- sample(1:length(sequences),size = 0.95*length(sequences), replace=F) x_test <- sequences[-train] x_train <- sequences[train] ### labels! y_train <- mm[train,]$GenreFact y_test <- mm[-train,]$GenreFact ########## how many classes do we have? num_classes <- length(unique(y_train)) +1 cat(num_classes, '\n') #'Vectorizing sequence data to a matrix which can be used an input matrix x_train <- sequences_to_matrix(tokenizer, x_train, mode = 'binary') x_test <- sequences_to_matrix(tokenizer, x_test, mode = 'binary') cat('x_train shape:', dim(x_train), '\n') cat('x_test shape:', dim(x_test), '\n') #'Convert class vector to binary class matrix', # '(for use with categorical_crossentropy)\n') y_train <- to_categorical(y_train, num_classes) y_test <- to_categorical(y_test, num_classes)

    In order to get a trainable data, we first balance the dataset such that all classes have the same frequency.
    Then we preprocess the raw text descriptions in such an index based representation. As always, we split the dataset in test and training data (90%). Finally, we transform the index based representation into a matrix representation and hot-one-encode the classes.

    After setting up the data, we can define the model. I tried different combinations (depth, dropouts, regularizers and input units) and the following layout seems to work the best:

    batch_size <- 64 epochs <- 200 model <- keras_model_sequential() model %>% layer_dense(units = 512, input_shape = c(max_words), activation="relu") %>% layer_dropout(rate = 0.6) %>% layer_dense(units=64, activation = 'relu', regularizer_l1(l=0.15)) %>% layer_dropout(rate = 0.8) %>% layer_dense(units=num_classes, activation = 'softmax') summary(model) model %>% compile( loss = 'categorical_crossentropy', optimizer = 'adam', metrics = c('accuracy') ) hist <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = 200, verbose = 1, validation_split = 0.1 ) ## using the holdout dataset! score <- model %>% evaluate( x_test, y_test, batch_size = batch_size, verbose = 1 ) cat('Test score:', score[[1]], '\n') cat('Test accuracy', score[[2]], '\n')

    Finally, we plot the training progress and conclude that it is possible to train a classifier without too much effort.

    I hope the short tutorial illustrated how to preprocess text in order to build a text-based deep-learning learning classifier. I am pretty sure that are better parameters to tune the model.
    If you want to implement such a model in production environment, I would recommend playing with the text-preprocessing parameters. The text-tokenizer and the text_to_sequence functions hold a lot of untapped value.

    Good luck!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    New CRAN Package Announcement: splashr

    Wed, 08/30/2017 - 00:26

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    I’m pleased to announce that splashr is now on CRAN.

    (That image was generated with splashr::render_png(url = "https://cran.r-project.org/web/packages/splashr/")).

    The package is an R interface to the Splash javascript rendering service. It works in a similar fashion to Selenium but is fear more geared to web scraping and has quite a bit of power under the hood.

    I’ve blogged about splashr before:

    and, the package comes with three vignettes that (hopefully) provide a solid introduction to using the web scraping framework.

    More features — including additional DSL functions — will be added in the coming months, but please kick the tyres and file an issue with problems or suggestions.

    Many thanks to all who took it for a spin and provided suggestions and even more thanks to the CRAN team for a speedy onboarding.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages