Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 hours 50 min ago

Web Scraping Influenster: Find a Popular Hair Care Product for You

Wed, 08/30/2017 - 19:28

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Are you a person who likes to try new products? Are you curious about which hair products are popular and trendy? If you’re excited about getting your hair glossy and eager to find a suitable shampoo, conditioner or hair oil merchandise, using my ‘Shiny (Hair) App’ could help you find what you seek in less time. My codes are available on GitHub.

 

Research Questions

What are popular hair care brands?

What is the user behavior on Influenter.com?

What kind of factors may have critical influences on customers satisfaction?

Is it possible to create a search engine, which takes charge of phrases and returns related products?

 

Data Collection

To obtain the most up-to-date hair care information, I decided to web scrape Influenster, a product discovery and review platform. It has over 14 million reviews and over 2 millions products for users to choose from.  

In order to narrow down my research scope, I focused on 3 categories: shampoo, hair conditioner, and hair oil. I garnered 54 top choices for each one. For product datasets, I scraped brand name, product name, overall product rating, rank and reviews. Plus, the web scraping review dataset includes author name, author location, content, rating score, and hair profile.

 

Results

Top Brands Graph

Firstly, the “other” category represents the brands which have one or two popular products. Thus, judging from the popular brands’ pie chart, we can see that most of the popular products belong to huge brands.

 

Rating Map

As to checking users’ behaviors on Influenster in the United States, I decided to make two maps to see whether there are any interesting results linked to location. Since I scraped top 54 products for each category, the overall rating score is high across the country. As a result, it is difficult to see regional differences.

 

Reviews Map

However, if we take a look at the number of hair care product reviews on Influenster.com across the nation, we know that there are 4740, 3898, 3787, 2818 reviews in California, Florida, Texas and New York respectively.

 

Analysis of Rating and Number of Reviews

There is a negative relationship between rating and number of reviews. As you can see, Pureolog receives the highest score 4.77out of 5, but it only has 514 reviews. On the other hand, OGX is scored 4.4 out of 5, though, it gains over 5167 reviews.

 

Wordcloud & Comparison Cloud

As we may be interested in what factors customers care about most and what contributes to their satisfaction with a product, I decided to inspect the most frequently mentioned words in those 77 thousand reviews. For the first try, I created word clouds for each category and the overall reviews. However, there is no significant difference among the four graphs. Therefore, I created a comparison cloud to collate the most common words popping up in reviews.From the comparison cloud, we can infer that customers regard functionalities of products and fragrance as the most important. In addition, the word “recommend” shows up as a commonly used word in the reviews dataset. Consequently, in my perspective, word of mouth is a great marketing strategy for brands to focus on.

 

Search Engine built in my Shiny App (NLP: TF-IDF, cosine similarity)


http://blog.nycdatascience.com/wp-content/uploads/2017/08/engine_demo.mp4

TF-IDF

TF-IDF is a NLP technique, which stands for “Term Frequency–Inverse Document Frequency,”, a numerical statistic that is intended to reflect how important a word is compared to a document in a corpus.

For my search engine, I utilize “tm” package and employ weightSMART “nnn” weighted schema for term frequency. Basically, the weightSMART “nnn”, a natural weighting computation, counts how many times each individual word matches up with the document in the dataset. If you would like to read more details and check more weighting schemas, please feel free to take a look at the R documentation.

Cosine Similarity

With TF-IDF measurements in place, products are recommended according to a cosine similarity score with the query. To further elaborate how cosine similarity works, it is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. In the case of information retrieval like a search engine, the cosine similarity of two documents will range from 0 to 1, because the term frequencies (TF-IDF weights) cannot be negative. In other words, the angle between two term frequency vectors cannot be greater than 90 degrees. Additionally, when the cosine value is closer to 1, it means that there is a higher similarity between the two vectors (products). The cosine similarity formula is shown below.

 

 

Insights

Most of the products belong to household brands.

The more active users of the site are from California, Florida, Texas and New York.

There is a negative relationship between the number of reviews and rating score.

Functions and the scent of hair care products are of great importance.

Even though “recommend” is a commonly used word, in this project, it is difficult to tell whether is positive or negative feedbacks. Thus, I can conduct sentiment analysis in the future.

The self-developed search engine, applied with TF-IDF and cosine similarity concepts, will work even better if I include product descriptions. By adding up product descriptions, users can have a higher probability to match their inputs to not only product name but product description, so that they are able to retrieve more related merchandises and explore new features of products.

The post Web Scraping Influenster: Find a Popular Hair Care Product for You appeared first on NYC Data Science Academy Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data wrangling : Cleansing – Regular expressions (2/3)

Wed, 08/30/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the fourth part of the series and it aims to cover the cleaning of data used. At previous parts we learned how to import, reshape and transform data. The rest of the series will be dedicated to the data cleansing process. On this post we will go through the regular expressions, a sequence of characters that define a search pattern, mainly
for use in pattern matching with text strings.In particular, we will cover the foundations of regular expression syntax.

Before proceeding, it might be helpful to look over the help pages for the grep, gsub.

Moreover please run the following commands to create the strings that we will work on.
bio <- c('24 year old', 'data scientist', '1992', 'A.I. enthusiast', 'R version 3.4.0 (2017-04-21)',
'r-exercises author', 'R is cool', 'RR')

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Find the strings with Numeric values between 3 and 6.

Exercise 2

Find the strings with the character ‘A’ or ‘y’.

Exercise 3

Find any strings that have non-alphanumeric characters.

Exercise 4

Remove lower case letters.

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 5

Remove space or tabs.

Exercise 6

Remove punctuation and replace it with white space.

Exercise 7

Remove alphanumeric characters.

Exercise 8

Match sentences that contain ‘M’.

Exercise 9

Match states with two ‘o’.

Exercise 10

Match cars with one or two ‘e’.

Related exercise sets:
  1. Regular Expressions Exercises – Part 1
  2. Data wrangling : Cleansing – Regular expressions (1/3)
  3. Data wrangling : Transforming (2/3)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The one function call you need to know as a data scientist: h2o.automl

Wed, 08/30/2017 - 14:49
Introduction

Two things that recently came to my attention were AutoML (Automatic Machine Learning) by h2o.ai and the fashion MNIST by Zalando Research. So as a test, I ran AutoML on the fashion mnist data set.

H2o AutoML

As you all know a large part of the work in predictive modeling is in preparing the data. But once you have done that, ideally you don’t want to spend too much work in trying many different machine learning models.  That’s were AutoML from h2o.ai comes in. With one function call you automate the process of training a large, diverse, selection of candidate models.

AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, GLM’s, Gradient Boosting Machines (GBMs) and Neural Nets. And then as “bonus” it trains a Stacked Ensemble using all of the models. The function to use in the h2o R interface is: h2o.automl. (There is also a python interface)

FashionMNIST_Benchmark = h2o.automl( x = 1:784, y = 785, training_frame = fashionmnist_train, validation_frame = fashionmninst_test )

So the first 784 columns in the data set are used as inputs and column 785 is the column with labels. There are more input arguments that you can use. For example, maximum running time or maximum number of models to use, a stopping metric.

It can take some time to run all these models, so I have spun up a so-called high CPU droplet on Digital Ocean: 32 dedicated cores ($0.92 /h).

h2o utilizing all 32 cores to create models

The output in R is an object containing the models and a ‘leaderboard‘ ranking the different models. I have the following accuracies on the fashion mnist test set.

  1. Gradient Boosting (0.90)
  2. Deep learning (0.89)
  3. Random forests (0.89)
  4. Extremely randomized forests (0.88)
  5. GLM (0.86)

There is no ensemble model, because it’s not supported yet for multi label classifiers. The deeplearning in h2o are fully connected hidden layers, for this specific Zalando images data set, you’re better of pursuing more fancy convolutional neural networks. As a comparison I just ran a simple 2 layer CNN with keras, resulting in an test accuracy of 0.92. It outperforms all the models here!

Conclusion

If you have prepared your modeling data set, the first thing you can always do now is to run h2o.automl.

Cheers, Longhow.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Layered Data Visualizations Using R, Plotly, and Displayr

Wed, 08/30/2017 - 08:03

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

If you have tried to communicate research results and data visualizations using R, there is a good chance you will have come across one of its great limitations. R is painful when you need to create visualizations by layering multiple visual elements on top of each other. In other words, R can be painful if you want to assemble many visual elements, such as charts, images, headings, and backgrounds, into one visualization.

The good: R can create awesome charts 

R is great for creating charts. It gives you a lot of control and makes it easy to update charts with revised data. As an example, the chart below was created in R using the plotly package. It has quite a few nifty features that cannot be achieved in, say, Excel or Tableau.

The data visualization below measures blood sugar, exercise intensity, and diet. Each dot represents a blood glucose (BG) measurement for a patient over the course of a day. Note that the blood sugar measurements are not collected at regular intervals so there are gaps between some of the dots. In addition, the y-axis label spacings are irregular because this chart needs to emphasize the critical point of a BG of 8.9. The dots also get larger the further they are from a BG of 6 and color is used to emphasize extreme values. Finally, green shading is used to indicate the intensity of the patient’s physical activity, and readings from a food diary have been automatically added to this chart.

While this R visualization is awesome, it can be made even more interesting by overlaying visual elements such as images and headings.

You can look at this R visualization live, and you can hover your mouse over points to see the dates and times of individual readings. 

 

The bad: It is very painful to create visual confections in R

In his book, Visual Explanations, Edward Tufte coins the term visual confections to describe visualizations that are created by overlaying multiple visual elements (e.g., combining charts with images or joining multiple visualizations into one). The document below is an example of a visual confection.

The chart created in R above has been incorporated into the visualization below, along with another chart, images, background colors, headings and more – this is a visual confection.

In addition to all information contained in the original chart, the patient’s insulin dose for each day is shown in a syringe and images of meals have also been added. The background has been colored, and headings and sub-headings included. While all of this can be done in R, it cannot be done easily.

Even if you know all the relevant functions to programmatically insert images, resize them, deal with transparency, and control their order, you still have to go through a painful trial and error process of guesstimating the coordinates where things need to appear. That is, R is not WYSIWYG, and you really feel this when creating visual confections. Whenever I have done such things, I end up having to print the images, use a ruler, and create a simple model to estimate the coordinates!

 

The solution: How to assemble many visual layers into one data visualization

The standard way that most people create visual confections is using PowerPoint. However, PowerPoint and R are not great friends, as resizing R charts in PowerPoint causes problems, and PowerPoint cannot support any of the cool hover effects or interactivity in HTMLwidgets like plotly.

My solution was to build Displayr, which is a bit like a PowerPoint for the modern age, except that charts can be created in the app using R. The app is also online and can have its data updated automatically.

Click here to create your own layered visualization (just sign into Displayr first). Here you can access and edit the document that I used to create the visual confection example used in this post. This document contains all the raw data and the R code (as a function) used to automatically create the charts in this post. You can see the published layered visualization as a web page here.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppArmadillo 0.7.960.1.2

Wed, 08/30/2017 - 04:20

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A second fix-up release is needed following on the recent bi-monthly RcppArmadillo release as well as the initial follow-up as it turns out that OS X / macOS is so darn special that it needs an entire separate treatment for OpenMP. Namely to turn it off entirely…

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 384 other packages on CRAN—an increase of 54 since the CRAN release in June!

Changes in RcppArmadillo version 0.7.960.1.2 (2017-08-29)
  • On macOS, OpenMP support is now turned off (#170).

  • The package is now compiling under the C++11 standard (#170).

  • The vignette dependency is correctly set (James and Dirk in #168 and #169)

Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio 1.1 Preview – I Only Work in Black

Wed, 08/30/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Today, we’re continuing our blog series on new features in RStudio 1.1. If you’d like to try these features out for yourself, you can download a preview release of RStudio 1.1.

I Only Work in Black

For those of us that like to work in black or very very dark grey, the dark theme can be enabled from the ‘Global Options’ menu, selecting the ‘Appearance’ tab and choosing an ‘Editor theme’ that is dark.

Icons are now high-DPI, a ‘Modern’ and ‘Sky’ theme were also added, read more about them under Using RStudio Themes.

All panels support themes: Code editor, Console, Terminal, Environment, History, Files, Connections, Packages, Help, Build and VCS. Other features like Notebooks, Debugging, Profiling , Menus and the Object Explorer support this theme as well.

However, the Plots and Viewer panes render with the default colors of your content and therefore, require additional packages to switch to dark themes. For instance, shinythemes provides the darkly theme for Shiny and ggthemes support for light = FALSE under ggplot. If you are a package author, consider using rstudioapi::getThemeInfo() when generating output to these panes.

Enjoy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

IMDB Genre Classification using Deep Learning

Wed, 08/30/2017 - 02:00

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

The Internet Movie Database (Imdb) is a great source to get information about movies. Keras provides access to some part of the cleaned dataset (e.g. for sentiment classification). While sentiment classification is an interesting topic, I wanted to see if it is possible to identify a movie’s genre from its description.
The image illustrates the task;

To see if that is possible I downloaded the raw data from an FU-Berlin ftp- server. Most movies have multiple genres assigned (e.g. Action and Sci-fi.). I chose to randomly pick one genre in case of multiple assignments.

So the task at hand is to use a lengthy description to interfere a (noisy) label. Hence, the task is similar to the Reuters news categorization task. I used the code as a guideline for the model.
However, looking at the code, it becomes clear that data preprocessing part is skipped. In order to make it easy for a practitioner to create their own applications, I will try to detail the necessary preprocessing.
The texts are represented as a vector of integers (indexes). So basically one builds a dictionary in which each index refers to a particular word.

require(caret) require(keras) max_words <- 1500 ### create a balanced dataset with equal numbers of observations for each class down_train <- caret::downSample(x = mm, y = mm$GenreFact) ### preprocessing --- tokenizer = keras::text_tokenizer(num_words = max_words) keras::fit_text_tokenizer(tokenizer, mm$descr) sequences = tokenizer$texts_to_sequences(mm$descr) ## split in training and test set. train <- sample(1:length(sequences),size = 0.95*length(sequences), replace=F) x_test <- sequences[-train] x_train <- sequences[train] ### labels! y_train <- mm[train,]$GenreFact y_test <- mm[-train,]$GenreFact ########## how many classes do we have? num_classes <- length(unique(y_train)) +1 cat(num_classes, '\n') #'Vectorizing sequence data to a matrix which can be used an input matrix x_train <- sequences_to_matrix(tokenizer, x_train, mode = 'binary') x_test <- sequences_to_matrix(tokenizer, x_test, mode = 'binary') cat('x_train shape:', dim(x_train), '\n') cat('x_test shape:', dim(x_test), '\n') #'Convert class vector to binary class matrix', # '(for use with categorical_crossentropy)\n') y_train <- to_categorical(y_train, num_classes) y_test <- to_categorical(y_test, num_classes)

In order to get a trainable data, we first balance the dataset such that all classes have the same frequency.
Then we preprocess the raw text descriptions in such an index based representation. As always, we split the dataset in test and training data (90%). Finally, we transform the index based representation into a matrix representation and hot-one-encode the classes.

After setting up the data, we can define the model. I tried different combinations (depth, dropouts, regularizers and input units) and the following layout seems to work the best:

batch_size <- 64 epochs <- 200 model <- keras_model_sequential() model %>% layer_dense(units = 512, input_shape = c(max_words), activation="relu") %>% layer_dropout(rate = 0.6) %>% layer_dense(units=64, activation = 'relu', regularizer_l1(l=0.15)) %>% layer_dropout(rate = 0.8) %>% layer_dense(units=num_classes, activation = 'softmax') summary(model) model %>% compile( loss = 'categorical_crossentropy', optimizer = 'adam', metrics = c('accuracy') ) hist <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = 200, verbose = 1, validation_split = 0.1 ) ## using the holdout dataset! score <- model %>% evaluate( x_test, y_test, batch_size = batch_size, verbose = 1 ) cat('Test score:', score[[1]], '\n') cat('Test accuracy', score[[2]], '\n')

Finally, we plot the training progress and conclude that it is possible to train a classifier without too much effort.

I hope the short tutorial illustrated how to preprocess text in order to build a text-based deep-learning learning classifier. I am pretty sure that are better parameters to tune the model.
If you want to implement such a model in production environment, I would recommend playing with the text-preprocessing parameters. The text-tokenizer and the text_to_sequence functions hold a lot of untapped value.

Good luck!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New CRAN Package Announcement: splashr

Wed, 08/30/2017 - 00:26

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’m pleased to announce that splashr is now on CRAN.

(That image was generated with splashr::render_png(url = "https://cran.r-project.org/web/packages/splashr/")).

The package is an R interface to the Splash javascript rendering service. It works in a similar fashion to Selenium but is fear more geared to web scraping and has quite a bit of power under the hood.

I’ve blogged about splashr before:

and, the package comes with three vignettes that (hopefully) provide a solid introduction to using the web scraping framework.

More features — including additional DSL functions — will be added in the coming months, but please kick the tyres and file an issue with problems or suggestions.

Many thanks to all who took it for a spin and provided suggestions and even more thanks to the CRAN team for a speedy onboarding.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

3-D animations with R

Tue, 08/29/2017 - 23:42

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R is often used to visualize and animate 2-dimensional data. (Here are just a few examples.)  But did you know you can create 3-dimensional animations as well? 

As Thomas Lins Pedersen explains in a recent blog post, the trick is in using the persp function to translate points in 3-D space into a 2-D projection. This function is normally used to render a 3-D scatterplot or wireframe plot, but if you instead capture its output value, it returns a transformation matrix. You can then use the trans3d function to with this matrix to transform points in 3-D space. Thomas demonstrates how you can pass the transformed 2-D coordinates to plot a 3-D cube, and even animate it from two slightly different perspectives to create a 3-D stereo pair:

Rendering 3-D images isn't just for fun: there's plenty of 3-D data to analyze and visualize, too. Giora Simchoni used R to visualize data from the Carnegie-Mellon Graphics Lab Motion Capture Database. This data repository provides the output of human figures in motion-capture suits performing actions like walking, jumping, and even dancing. Since the motion-capture suits include multiple sensors measured over a time-period, the data structures are quite complex. To make things simpler, Giora created the mocap package (available on Github) to read these motion data files and generate 3-D animations to visualize them. For example, here's the output from two people performing the Charleston together:

You can find complete details behind both animations, including the associated R code, at the links below.

Data Imaginist: I made a 3D movie with ggplot2 once – here's how I did it
Giora Simchoni: Lambada! (The mocap Package)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Clean or shorten Column names while importing the data itself

Tue, 08/29/2017 - 19:05

(This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers)

When it comes to clumsy column headers namely., wide ones with spaces and special characters, I see many get panic and change the headers in the source file, which is an awkward option given variety of alternatives that exist in R for handling them.

One easy handling of such scenarios is using library(janitor), as name suggested can be employed for cleaning and maintaining. Janitor has function by name clean_names() which can be useful while directly importing the data itself as show in the below example:
“ library(janitor); newdataobject <- read.csv(“yourcsvfilewithpath.csv”, header=T) %>% clean_names()

” 

Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Working with air quality and meteorological data Exercises (Part-2)

Tue, 08/29/2017 - 18:08

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Atmospheric air pollution is one of the most important environmental concerns in many countries around the world, and it is strongly affected by meteorological conditions. Accordingly, in this set of exercises we use openair package to work and analyze air quality and meteorological data. This packages provides tools to directly import data from air quality measurement network across UK, as well as tools to analyse and producing reports.

In the previous exercise set we used data from MY1 station to see how to import data and extract basic statistical information from data. In this exercise set we will use some basic and useful functions that are available in openair package to analyze and visualize MY1 data.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag openair

Please load the package openair before starting the exercises.

Exercise 1
Use summaryPlot function to plot timeseries and histogram for pm10, and o3

Exercise 2
Use windRose function to plot monthly wind rose.

You can use Air Quality Data and weather patterns in combination with spatial data visualization, Learn more about spatial data in the online course
[Intermediate] Spatial Data Analysis with R, QGIS & More
. this course you will learn how to:

  • Work with Spatial data and maps
  • Learn about different tools to develop spatial data next to R
  • And much more

Exercise 3
Use pollutionRose function to plot monthly pollution roses for
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Exercise 4
Use pollutionRose to plot seasonal pollution roses for
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Exercise 5
Use percentileRose function to plot monthly percentile roses for
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Exercise 6
Use polarCluster function to plot cluster roses plot for
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Related exercise sets:
  1. Working with air quality and meteorological data Exercises (Part-1)
  2. Forecasting: Time Series Exploration Exercises (Part-1)
  3. Data table exercises: keys and subsetting
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Cycling Accident Map of Madrid City

Tue, 08/29/2017 - 18:00

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Far away, this ship has taken me far away (Starlight, Muse)

Madrid City has an Open Data platform where can be found around 300 data sets about a number of topics. One of these sets is the one I used for this experiment. It contains information about cycling accidents  happened in the city from January to July 2017. I have done a map to locate where the accidents took place. This experiment shows how R makes very easy to create professional maps with Leaflet (in this case I use Carto basemaps).

To locate accidents the data set only contains the address where they happened so the first thing I did is to obtain their geographical coordinates using geocode function from ggmap package. There were 431 accidents during the first 7 months of 2017 (such a big number!) and I got coordinates of 407 so I can locate 94% of the accidents.

Obviously, the amount of accidents in some place depend on how many bikers circulate there as well as on its infrastructure. None of these things can be seen in the map: It only shows number of accidents.

The categorization of accidents is:

  • Double collision (Colisión doble): Traffic accident occurred between two moving vehicles.
  • Multiple collision (Colisión múltiple): Traffic accident occurred between more than two moving vehicles.
  • Fixed object collision (Choque con objeto fijo): Accident occurred between a moving vehicle with a driver and an immovable object that occupies the road or separated area of ​​the same, whether parked vehicle, tree, street lamp, etc.
  • Accident (Atropello): Accident occurred between a vehicle and a pedestrian that occupies the road or travels by sidewalks, refuges, walks or zones of the public road not destined to the circulation of vehicles.
  • Overturn (Vuelco): Accident suffered by a vehicle with more than two wheels which by some circumstance loses contact with the road and ends supported on one side or on its roof.
  • Motorcycle fall (Caída motocicleta): Accident suffered by a motorcycle, which at some moment loses balance, because of the driver or due to the conditions of the road.
  • Moped fall (Caída ciclomotor): Accident suffered by a moped, which at some moment loses balance, because of the driver or due to the conditions of the road.
  • Bicycle fall (Caída bicicleta): Accident suffered by a bicycle, which at some moment loses balance, because of the driver or due to the conditions of the road.

These categories are redundant (e.g. Double and Multiple collision), difficult to understand (e.g. Overturn) or both things at the same time (e.g. Motorcycle fall and Moped fall). This categorization also forgets human damages incurred by the accident.

Taking all these things in mind, this is the map:


Here is a full-screen version of the map.

My suggestions to the city council of Madrid are:

  1. Add geographical coordinates to data (I guess many of the analysis will need them)
  2. Rethink the categorization to make it clearer and more informative
  3. Add more cycling data sets to the platform (detail of bikeways, traffic …) to understand accidents better
  4. Attending just to the number of accidents , put the focus around Parque del Retiro, specially on its west surroundings, from Plaza de Cibeles to Plaza de Carlos V: more warning signals, more  (or better) bikeways …

I add the code below to update the map (If someone ask it to me, I can do it myself regularly):

library(dplyr) library(stringr) library(ggmap) library(leaflet) # First, getting the data download.file(paste0("http://datos.madrid.es/egob/catalogo/", file), destfile="300110-0-accidentes-bicicleta.csv") data=read.csv("300110-0-accidentes-bicicleta.csv", sep=";", skip=1) # Prepare data for geolocation data %>% mutate(direccion=paste(str_trim(Lugar), str_trim(Numero), "MADRID, SPAIN", sep=", ") %>% str_replace("NA, ", "") %>% str_replace(" - ", " CON ")) -> data # Geolocation (takes some time ...) coords=c() for (i in 1:nrow(data)) { coords %>% rbind(geocode(data[i,"direccion"])) -> coords Sys.sleep(0.5) } # Save data, just in case data %>% cbind(coords) %>% saveRDS(file="bicicletas.RDS") data=readRDS(file="bicicletas.RDS") # Remove non-successfull geolocations data %>% filter(!is.na(lon)) %>% droplevels()-> data # Remove non-successfull geolocations data %>% mutate(Fecha=paste0(as.Date(data$Fecha, "%d/%m/%Y"), " ", TRAMO.HORARIO), popup=paste0("Dónde:", direccion, "Cuándo:", Fecha, "Qué pasó:", Tipo.Accidente)) -> data # Do the map data %>% split(data$Tipo.Accidente) -> data.df l <- leaflet() %>% addProviderTiles(providers$CartoDB.Positron) names(data.df) %>% purrr::walk( function(df) { l <<- l %>% addCircleMarkers(data=data.df[[df]], lng=~lon, lat=~lat, popup=~popup, color="red", stroke=FALSE, fillOpacity = 0.8, group = df, clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F)) }) l %>% addLayersControl( overlayGroups = names(data.df), options = layersControlOptions(collapsed = FALSE) ) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rpad Domain Repurposed To Deliver Creepy (and potentially malicious) Content

Tue, 08/29/2017 - 16:14

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I was about to embark on setting up a background task to sift through R package PDFs for traces of functions that “omit NA values” as a surprise present for Colin Fay and Sir Tierney:

[Please RT]#RStats folks, @nj_tierney & I need your help for {naniar}!
When does R silently drop/omit NA? https://t.co/V5elyGcG8Z pic.twitter.com/VScLXFCl2n

— Colin Fay (@_ColinFay) August 29, 2017

When I got distracted by a PDF in the CRAN doc/contrib directory: Short-refcard.pdf. I’m not a big reference card user but students really like them and after seeing what it was I remembered having seen the document ages ago, but never associated it with CRAN before.

I saw:

by Tom Short, EPRI PEAC, tshort@epri-peac.com 2004-11-07 Granted to the public domain. See www. Rpad. org for the source and latest version. Includes material from R for Beginners by Emmanuel Paradis (with permission).

at the top of the card. The link (which I’ve made unclickable for reasons you’ll see in a sec — don’t visit that URL) was clickable and I tapped it as I wanted to see if it had changed since 2004.

You can open that image in a new tab to see the full, rendered site and take a moment to see if you can find the section that links to objectionable — and, potentially malicious — content. It’s easy to spot.

I made a likely correct assumption that Tom Short had nothing to do with this and wanted to dig into it a bit further to see when this may have happened. So, don your bestest deerstalker and follow along as we see when this may have happened.

Digging In Domain Land

We’ll need some helpers to poke around this data in a safe manner:

library(wayback) # devtools::install_github("hrbrmstr/wayback") library(ggTimeSeries) # devtools::install_github("AtherEnergy/ggTimeSeries") library(splashr) # devtools::install_github("hrbrmstr/splashr") library(passivetotal) # devtools::install_github("hrbrmstr/passivetotal") library(cymruservices) library(magick) library(tidyverse)

(You’ll need to get a RiskIQ PassiveTotal key to use those functions. Also, please donate to Archive.org if you use the wayback package.)

Now, let’s see if the main Rpad content URL is in the wayback machine:

glimpse(archive_available("http://www.rpad.org/Rpad/")) ## Observations: 1 ## Variables: 5 ## $ url "http://www.rpad.org/Rpad/" ## $ available TRUE ## $ closet_url "http://web.archive.org/web/20170813053454/http://ww... ## $ timestamp 2017-08-13 ## $ status "200"

It is! Let’s see how many versions of it are in the archive:

x <- cdx_basic_query("http://www.rpad.org/Rpad/") ts_range <- range(x$timestamp) count(x, timestamp) %>% ggplot(aes(timestamp, n)) + geom_segment(aes(xend=timestamp, yend=0)) + labs(x=NULL, y="# changes in year", title="rpad.org Wayback Change Timeline") + theme_ipsum_rc(grid="Y")

count(x, timestamp) %>% mutate(Year = lubridate::year(timestamp)) %>% complete(timestamp=seq(ts_range[1], ts_range[2], "1 day")) %>% filter(!is.na(timestamp), !is.na(Year)) %>% ggplot(aes(date = timestamp, fill = n)) + stat_calendar_heatmap() + viridis::scale_fill_viridis(na.value="white", option = "magma") + facet_wrap(~Year, ncol=1) + labs(x=NULL, y=NULL, title="rpad.org Wayback Change Timeline") + theme_ipsum_rc(grid="") + theme(axis.text=element_blank()) + theme(panel.spacing = grid::unit(0.5, "lines"))

There’s a big span between 2008/9 and 2016/17. Let’s poke around there a bit. First 2016:

tm <- get_timemap("http://www.rpad.org/Rpad/") (rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2016)) ## # A tibble: 1 x 5 ## rel link type ## ## 1 memento http://web.archive.org/web/20160629104907/http://www.rpad.org:80/Rpad/ ## # ... with 2 more variables: from , datetime (p2016 <- render_png(url = rurl$link))

Hrm. Could be server or network errors.

Let’s go back to 2009.

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2009)) ## # A tibble: 4 x 5 ## rel link type ## ## 1 memento http://web.archive.org/web/20090219192601/http://rpad.org:80/Rpad ## 2 memento http://web.archive.org/web/20090322163146/http://www.rpad.org:80/Rpad ## 3 memento http://web.archive.org/web/20090422082321/http://www.rpad.org:80/Rpad ## 4 memento http://web.archive.org/web/20090524155658/http://www.rpad.org:80/Rpad ## # ... with 2 more variables: from , datetime (p2009 <- render_png(url = rurl$link[4]))

If you poke around that, it looks like the original Rpad content, so it was “safe” back then.

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2017)) ## # A tibble: 6 x 5 ## rel link type ## ## 1 memento http://web.archive.org/web/20170323222705/http://www.rpad.org/Rpad ## 2 memento http://web.archive.org/web/20170331042213/http://www.rpad.org/Rpad/ ## 3 memento http://web.archive.org/web/20170412070515/http://www.rpad.org/Rpad/ ## 4 memento http://web.archive.org/web/20170518023345/http://www.rpad.org/Rpad/ ## 5 memento http://web.archive.org/web/20170702130918/http://www.rpad.org/Rpad/ ## 6 memento http://web.archive.org/web/20170813053454/http://www.rpad.org/Rpad/ ## # ... with 2 more variables: from , datetime (p2017 <- render_png(url = rurl$link[1]))

I won’t break your browser and add another giant image, but that one has the icky content. So, it’s a relatively recent takeover and it’s likely that whomever added the icky content links did so to try to ensure those domains and URLs have both good SEO and a positive reputation.

Let’s see if they were dumb enough to make their info public:

rwho <- passive_whois("rpad.org") str(rwho, 1) ## List of 18 ## $ registryUpdatedAt: chr "2016-10-05" ## $ admin :List of 10 ## $ domain : chr "rpad.org" ## $ registrant :List of 10 ## $ telephone : chr "5078365503" ## $ organization : chr "WhoisGuard, Inc." ## $ billing : Named list() ## $ lastLoadedAt : chr "2017-03-14" ## $ nameServers : chr [1:2] "ns-1147.awsdns-15.org" "ns-781.awsdns-33.net" ## $ whoisServer : chr "whois.publicinterestregistry.net" ## $ registered : chr "2004-06-15" ## $ contactEmail : chr "411233718f2a4cad96274be88d39e804.protect@whoisguard.com" ## $ name : chr "WhoisGuard Protected" ## $ expiresAt : chr "2018-06-15" ## $ registrar : chr "eNom, Inc." ## $ compact :List of 10 ## $ zone : Named list() ## $ tech :List of 10

Nope. #sigh

Is this site considered “malicious”?

(rclass <- passive_classification("rpad.org")) ## $everCompromised ## NULL

Nope. #sigh

What’s the hosting history for the site?

rdns <- passive_dns("rpad.org") rorig <- bulk_origin(rdns$results$resolve) tbl_df(rdns$results) %>% type_convert() %>% select(firstSeen, resolve) %>% left_join(select(rorig, resolve=ip, as_name=as_name)) %>% arrange(firstSeen) %>% print(n=100) ## # A tibble: 88 x 3 ## firstSeen resolve as_name ## ## 1 2009-12-18 11:15:20 144.58.240.79 EPRI-PA - Electric Power Research Institute, US ## 2 2016-06-19 00:00:00 208.91.197.132 CONFLUENCE-NETWORK-INC - Confluence Networks Inc, VG ## 3 2016-07-29 00:00:00 208.91.197.27 CONFLUENCE-NETWORK-INC - Confluence Networks Inc, VG ## 4 2016-08-12 20:46:15 54.230.14.253 AMAZON-02 - Amazon.com, Inc., US ## 5 2016-08-16 14:21:17 54.230.94.206 AMAZON-02 - Amazon.com, Inc., US ## 6 2016-08-19 20:57:04 54.230.95.249 AMAZON-02 - Amazon.com, Inc., US ## 7 2016-08-26 20:54:02 54.192.197.200 AMAZON-02 - Amazon.com, Inc., US ## 8 2016-09-12 10:35:41 52.84.40.164 AMAZON-02 - Amazon.com, Inc., US ## 9 2016-09-17 07:43:03 54.230.11.212 AMAZON-02 - Amazon.com, Inc., US ## 10 2016-09-23 18:17:50 54.230.202.223 AMAZON-02 - Amazon.com, Inc., US ## 11 2016-09-30 19:47:31 52.222.174.253 AMAZON-02 - Amazon.com, Inc., US ## 12 2016-10-24 17:44:38 52.85.112.250 AMAZON-02 - Amazon.com, Inc., US ## 13 2016-10-28 18:14:16 52.222.174.231 AMAZON-02 - Amazon.com, Inc., US ## 14 2016-11-11 10:44:22 54.240.162.201 AMAZON-02 - Amazon.com, Inc., US ## 15 2016-11-17 04:34:15 54.192.197.242 AMAZON-02 - Amazon.com, Inc., US ## 16 2016-12-16 17:49:29 52.84.32.234 AMAZON-02 - Amazon.com, Inc., US ## 17 2016-12-19 02:34:32 54.230.141.240 AMAZON-02 - Amazon.com, Inc., US ## 18 2016-12-23 14:25:32 54.192.37.182 AMAZON-02 - Amazon.com, Inc., US ## 19 2017-01-20 17:26:28 52.84.126.252 AMAZON-02 - Amazon.com, Inc., US ## 20 2017-02-03 15:28:24 52.85.94.225 AMAZON-02 - Amazon.com, Inc., US ## 21 2017-02-10 19:06:07 52.85.94.252 AMAZON-02 - Amazon.com, Inc., US ## 22 2017-02-17 21:37:21 52.85.63.229 AMAZON-02 - Amazon.com, Inc., US ## 23 2017-02-24 21:43:45 52.85.63.225 AMAZON-02 - Amazon.com, Inc., US ## 24 2017-03-05 12:06:32 54.192.19.242 AMAZON-02 - Amazon.com, Inc., US ## 25 2017-04-01 00:41:07 54.192.203.223 AMAZON-02 - Amazon.com, Inc., US ## 26 2017-05-19 00:00:00 13.32.246.44 AMAZON-02 - Amazon.com, Inc., US ## 27 2017-05-28 00:00:00 52.84.74.38 AMAZON-02 - Amazon.com, Inc., US ## 28 2017-06-07 08:10:32 54.230.15.154 AMAZON-02 - Amazon.com, Inc., US ## 29 2017-06-07 08:10:32 54.230.15.142 AMAZON-02 - Amazon.com, Inc., US ## 30 2017-06-07 08:10:32 54.230.15.168 AMAZON-02 - Amazon.com, Inc., US ## 31 2017-06-07 08:10:32 54.230.15.57 AMAZON-02 - Amazon.com, Inc., US ## 32 2017-06-07 08:10:32 54.230.15.36 AMAZON-02 - Amazon.com, Inc., US ## 33 2017-06-07 08:10:32 54.230.15.129 AMAZON-02 - Amazon.com, Inc., US ## 34 2017-06-07 08:10:32 54.230.15.61 AMAZON-02 - Amazon.com, Inc., US ## 35 2017-06-07 08:10:32 54.230.15.51 AMAZON-02 - Amazon.com, Inc., US ## 36 2017-07-16 09:51:12 54.230.187.155 AMAZON-02 - Amazon.com, Inc., US ## 37 2017-07-16 09:51:12 54.230.187.184 AMAZON-02 - Amazon.com, Inc., US ## 38 2017-07-16 09:51:12 54.230.187.125 AMAZON-02 - Amazon.com, Inc., US ## 39 2017-07-16 09:51:12 54.230.187.91 AMAZON-02 - Amazon.com, Inc., US ## 40 2017-07-16 09:51:12 54.230.187.74 AMAZON-02 - Amazon.com, Inc., US ## 41 2017-07-16 09:51:12 54.230.187.36 AMAZON-02 - Amazon.com, Inc., US ## 42 2017-07-16 09:51:12 54.230.187.197 AMAZON-02 - Amazon.com, Inc., US ## 43 2017-07-16 09:51:12 54.230.187.185 AMAZON-02 - Amazon.com, Inc., US ## 44 2017-07-17 13:10:13 54.239.168.225 AMAZON-02 - Amazon.com, Inc., US ## 45 2017-08-06 01:14:07 52.222.149.75 AMAZON-02 - Amazon.com, Inc., US ## 46 2017-08-06 01:14:07 52.222.149.172 AMAZON-02 - Amazon.com, Inc., US ## 47 2017-08-06 01:14:07 52.222.149.245 AMAZON-02 - Amazon.com, Inc., US ## 48 2017-08-06 01:14:07 52.222.149.41 AMAZON-02 - Amazon.com, Inc., US ## 49 2017-08-06 01:14:07 52.222.149.38 AMAZON-02 - Amazon.com, Inc., US ## 50 2017-08-06 01:14:07 52.222.149.141 AMAZON-02 - Amazon.com, Inc., US ## 51 2017-08-06 01:14:07 52.222.149.163 AMAZON-02 - Amazon.com, Inc., US ## 52 2017-08-06 01:14:07 52.222.149.26 AMAZON-02 - Amazon.com, Inc., US ## 53 2017-08-11 19:11:08 216.137.61.247 AMAZON-02 - Amazon.com, Inc., US ## 54 2017-08-21 20:44:52 13.32.253.116 AMAZON-02 - Amazon.com, Inc., US ## 55 2017-08-21 20:44:52 13.32.253.247 AMAZON-02 - Amazon.com, Inc., US ## 56 2017-08-21 20:44:52 13.32.253.117 AMAZON-02 - Amazon.com, Inc., US ## 57 2017-08-21 20:44:52 13.32.253.112 AMAZON-02 - Amazon.com, Inc., US ## 58 2017-08-21 20:44:52 13.32.253.42 AMAZON-02 - Amazon.com, Inc., US ## 59 2017-08-21 20:44:52 13.32.253.162 AMAZON-02 - Amazon.com, Inc., US ## 60 2017-08-21 20:44:52 13.32.253.233 AMAZON-02 - Amazon.com, Inc., US ## 61 2017-08-21 20:44:52 13.32.253.29 AMAZON-02 - Amazon.com, Inc., US ## 62 2017-08-23 14:24:15 216.137.61.164 AMAZON-02 - Amazon.com, Inc., US ## 63 2017-08-23 14:24:15 216.137.61.146 AMAZON-02 - Amazon.com, Inc., US ## 64 2017-08-23 14:24:15 216.137.61.21 AMAZON-02 - Amazon.com, Inc., US ## 65 2017-08-23 14:24:15 216.137.61.154 AMAZON-02 - Amazon.com, Inc., US ## 66 2017-08-23 14:24:15 216.137.61.250 AMAZON-02 - Amazon.com, Inc., US ## 67 2017-08-23 14:24:15 216.137.61.217 AMAZON-02 - Amazon.com, Inc., US ## 68 2017-08-23 14:24:15 216.137.61.54 AMAZON-02 - Amazon.com, Inc., US ## 69 2017-08-25 19:21:58 13.32.218.245 AMAZON-02 - Amazon.com, Inc., US ## 70 2017-08-26 09:41:34 52.85.173.67 AMAZON-02 - Amazon.com, Inc., US ## 71 2017-08-26 09:41:34 52.85.173.186 AMAZON-02 - Amazon.com, Inc., US ## 72 2017-08-26 09:41:34 52.85.173.131 AMAZON-02 - Amazon.com, Inc., US ## 73 2017-08-26 09:41:34 52.85.173.18 AMAZON-02 - Amazon.com, Inc., US ## 74 2017-08-26 09:41:34 52.85.173.91 AMAZON-02 - Amazon.com, Inc., US ## 75 2017-08-26 09:41:34 52.85.173.174 AMAZON-02 - Amazon.com, Inc., US ## 76 2017-08-26 09:41:34 52.85.173.210 AMAZON-02 - Amazon.com, Inc., US ## 77 2017-08-26 09:41:34 52.85.173.88 AMAZON-02 - Amazon.com, Inc., US ## 78 2017-08-27 22:02:41 13.32.253.169 AMAZON-02 - Amazon.com, Inc., US ## 79 2017-08-27 22:02:41 13.32.253.203 AMAZON-02 - Amazon.com, Inc., US ## 80 2017-08-27 22:02:41 13.32.253.209 AMAZON-02 - Amazon.com, Inc., US ## 81 2017-08-29 13:17:37 54.230.141.201 AMAZON-02 - Amazon.com, Inc., US ## 82 2017-08-29 13:17:37 54.230.141.83 AMAZON-02 - Amazon.com, Inc., US ## 83 2017-08-29 13:17:37 54.230.141.30 AMAZON-02 - Amazon.com, Inc., US ## 84 2017-08-29 13:17:37 54.230.141.193 AMAZON-02 - Amazon.com, Inc., US ## 85 2017-08-29 13:17:37 54.230.141.152 AMAZON-02 - Amazon.com, Inc., US ## 86 2017-08-29 13:17:37 54.230.141.161 AMAZON-02 - Amazon.com, Inc., US ## 87 2017-08-29 13:17:37 54.230.141.38 AMAZON-02 - Amazon.com, Inc., US ## 88 2017-08-29 13:17:37 54.230.141.151 AMAZON-02 - Amazon.com, Inc., US

Unfortunately, I expected this. The owner keeps moving it around on AWS infrastructure.

So What?

This was an innocent link in a document on CRAN that went to a site that looked legit. A clever individual or organization found the dead domain and saw an opportunity to legitimize some fairly nasty stuff.

Now, I realize nobody is likely using “Rpad” anymore, but this type of situation can happen to any registered domain. If this individual or organization were doing more than trying to make objectionable content legit, they likely could have succeeded, especially if they enticed you with a shiny new devtools::install_…() link with promises of statistically sound animated cat emoji gif creation tools. They did an eerily good job of making this particular site still seem legit.

There’s nothing most folks can do to “fix” that site or have it removed. I’m not sure CRAN should remove the helpful PDF, but with a clickable link, it might be a good thing to suggest.

You’ll see that I used the splashr package (which has been submitted to CRAN but not there yet). It’s a good way to work with potentially malicious web content since you can “see” it and mine content from it without putting your own system at risk.

After going through this, I’ll see what I can do to put some bows on some of the devel-only packages and get them into CRAN so there’s a bit more assurance around using them.

I’m an army of one when it comes to fielding R-related security issues, but if you do come across suspicious items (like this or icky/malicious in other ways) don’t hesitate to drop me an @ or DM on Twitter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tidyverse practice: mapping large European cities

Tue, 08/29/2017 - 13:59

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)



As noted in several recent posts, when you’re learning R and R’s Tidyverse packages, it’s important to break everything down into small units that you can learn.

What that means is that you need to identify the most important tools and functions of the Tidyverse, and then practice them until you are fluent.

But once you have mastered the essential functions as isolated units, you need to put them together. By putting the individual piece together, you solidify your knowledge of how they work individually but also begin to learn how you can combine small tools together to create novel effects.

With that in mind, I want to show you another small project. Here, we’re going to use a fairly small set of functions to create a map of the largest cities in Europe.

As we do this, pay attention:

  • How many packages and functions do you really need?
  • Evaluate: how long would it really take to memorize each individual function? (Hint: it’s much, much less time than you think.)
  • Which functions have you seen before? Are some of the functions and techniques used more often than others (if you look across many different analyses)?

Ok, with those questions in mind, let’s get after it.

First we’ll just load a few packages.

#============== # LOAD PACKAGES #============== library(rvest) library(tidyverse) library(ggmap) library(stringr)

Next, we’re going to use the rvest package to scrape data from Wikipedia. The data that we are gathering is data about the largest cities in Europe. You can read more about the data on Wikipedia.

#=========================== # SCRAPE DATA FROM WIKIPEDIA #=========================== html.population <- read_html('https://en.wikipedia.org/wiki/List_of_European_cities_by_population_within_city_limits') df.euro_cities <- html.population %>% html_nodes("table") %>% .[[2]] %>% html_table() # inspect df.euro_cities %>% head() df.euro_cities %>% names()

Here at Sharp Sight, we haven’t worked with the rvest package in too many examples, so you might not be familiar with it.

Having said that, just take a close look. How many functions did we use from rvest? Could you memorize them? How long would it take?

Ok. Now we’ll do a little data cleaning.

First, we’re going to remove some of the variables using dplyr::select(). We are using the minus sign (‘-‘) in front of the names of the variables that we want to remove.

#============================ # REMOVE EXTRANEOUS VARIABLES #============================ df.euro_cities <- select(df.euro_cities, -Date, -Image, -Location, -`Ref.`, -`2011 Eurostat\npopulation[3]`) # inspect df.euro_cities %>% names()

After removing the variables that we don’t want, we only have four variables. These remaining raw variable names could be cleaned up a little.

Ideally, we want names that are lower case (because they are easier to type). We also want variable names that are brief and descriptive.

In this case, renaming these variables to be brief, descriptive, and lower-case is fairly straightforward. Here, we will use very simple variable names: like rank, city, country, and population.

To add these new variable names, we can simply assign them by using the colnames() function.

#=============== # RENAME COLUMNS #=============== colnames(df.euro_cities) <- c("rank", "city", "country", "population") # inspect df.euro_cities %>% names() df.euro_cities %>% head()

Now that we have clean variable names, we will do a little modification of the data itself.

When we scraped the data from Wikipedia, some extraneous characters appeared in the population variable. Essentially, there were some leading digits and special characters that appear to be useless artifacts of the scraping process. We want to remove these extraneous characters and parse the population data into a proper numeric.

To do this, we will use a few functions from the stringr package.

First, we use str_extract() to extract the population data. When we do this, we are extracting everything from the ‘♠’ character to the end of the string (note: to do this, we are using a regular expression in str_extract()).

This is a quick way to get the numbers at the end of the string, but we actually don’t want to keep the ‘♠’ character. So, after we extract the population numbers (along with the ‘♠’), we then strip off the ‘♠’ character by using str_replace().

#======================================================================== # CLEAN UP VARIABLE: population # - when the data are scraped, there are some extraneous characters # in the "population" variable. # ... you can see leading numbers and some other items # - We will use stringr functions to extract the actual population data # (and remove the stuff we don't want) # - We are executing this transformation inside dplyr::mutate() to # modify the variable inside the dataframe #======================================================================== df.euro_cities <- df.euro_cities %>% mutate(population = str_extract(population, "♠.*$") %>% str_replace("♠","") %>% parse_number()) df.euro_cities %>% head()

We will also do some quick data wrangling on the city names. Two of the city names on the Wikipedia page (Istanbul and Moscow) had footnotes. Because of this, those two city names had extra bracket characters when we read them in (e.g. “Istanbul[a]”).

We want to strip off those footnotes. To do this we will once again use str_replace() to strip away the information that we don’t want.

#========================================================================== # REMOVE "notes" FROM CITY NAMES # - two cities had extra characters for footnotes # ... we will remove these using stringr::str_replace and dplyr::mutate() #========================================================================== df.euro_cities <- df.euro_cities %>% mutate(city = str_replace(city, "\\[.\\]","")) df.euro_cities %>% head()

For the sake of making the data a little easier to explain, we’re going to filter the data to records where the population is over 1,000,000.

Keep in mind: this is a straightforward use of dplyr::filter(); this is the sort of thing that you should be able to do with your eyes closed.

#========================= # REMOVE CITIES UNDER 1 MM #========================= df.euro_cities <- filter(df.euro_cities, population >= 1000000) #================= # COERCE TO TIBBLE #================= df.euro_cities <- df.euro_cities %>% as_tibble()

Before we map the cities on a map, we need to get geospatial information. That is, we need to geocode these records.

To do this, we will use the geocode() function to get the longitude and latitude.

After obtaining the geo data, we will join it back to the original data using cbind().

#======================================================== # GEOCODE # - here, we're just getting longitude and latitude data # using ggmap::geocode() #======================================================== data.geo <- geocode(df.euro_cities$city) df.euro_cities <- cbind(df.euro_cities, data.geo) #inspect df.euro_cities

To map the data points, we also need a map that will sit in the background, underneath the points.

We will use the function map_data() to get a world map.

#============== # GET WORLD MAP #============== map.europe <- map_data("world")

Now that the data are clean, and we have a world map, we will plot the data.

#================================= # PLOT BASIC MAP # - this map is "just the basics" #================================= ggplot() + geom_polygon(data = map.europe, aes(x = long, y = lat, group = group)) + geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) + coord_cartesian(xlim = c(-9,45), ylim = c(32,70))

This first plot is a “first iteration.” In this version, we haven’t done any serious formatting. It’s just a “first pass” to make sure that the data are in the right format. If we had found anything “out of line,” we would go back to an earlier part of the analysis and modify our code to correct any problems in the data.



Based on this plot, it looks like the data are essentially correct.

Now, we just want to “polish” the visualization by changing colors, fonts, sizes, etc.

#==================================================== # PLOT 'POLISHED' MAP # - this version is formatted and cleaned up a little # just to make it look more aesthetically pleasing #==================================================== #------------- # CREATE THEME #------------- theme.maptheeme <- theme(text = element_text(family = "Gill Sans", color = "#444444")) + theme(plot.title = element_text(size = 32)) + theme(plot.subtitle = element_text(size = 16)) + theme(panel.grid = element_blank()) + theme(axis.text = element_blank()) + theme(axis.ticks = element_blank()) + theme(axis.title = element_blank()) + theme(legend.background = element_blank()) + theme(legend.key = element_blank()) + theme(legend.title = element_text(size = 18)) + theme(legend.text = element_text(size = 10)) + theme(panel.background = element_rect(fill = "#596673")) + theme(panel.grid = element_blank()) #------ # PLOT #------ #fill = "#AAAAAA",colour = "#818181", size = .15) ggplot() + geom_polygon(data = map.europe, aes(x = long, y = lat, group = group), fill = "#DEDEDE",colour = "#818181", size = .15) + geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) + geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", shape = 1) + coord_cartesian(xlim = c(-9,45), ylim = c(32,70)) + labs(title = "European Cities with Large Populations", subtitle = "Cities with over 1MM population, within city limits") + scale_size_continuous(range = c(.7,15), breaks = c(1100000, 4000000, 8000000, 12000000), name = "Population", labels = scales::comma_format()) + theme.maptheeme



Not too bad.

Keep in mind that as a reader, you get to see the finished product: the finalized visualization and the finalized code.

But as always, the process for creating a visualization like this is highly iterative. If you work on a similar project, expect to change your code dozens of times. You’ll change your data-wrangling code as you work with the data and identify new items you need to change or fix. You’ll also change your ggplot() visualization code multiple times as you try different colors, fonts, and settings.

If you master the basics, the hard things never seem hard

Creating this visualization is actually not terribly hard to do, but if you’re somewhat new to R, it might seem rather challenging.

If you look at this, and it seems difficult then you need to understand: once you master the basics, the hard things never seem hard.

What I mean by that, is that this visualization is nothing more than a careful application of a few dozen simple tools, arranged in a way to create something new.

Once you master individual tools from ggplot2, dplyr, and the rest of the Tidyverse, projects like this become very easy to execute.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need to master the essential tools.

You need to know what tools are important, which tools are not important, and how to practice.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

Sign up now for our email list, and you’ll receive regular tutorials and lessons.

You’ll learn:

  • What data science tools you should learn (and what not to learn)
  • How to practice those tools
  • How to put those tools together to execute analyses and machine learning projects
  • … and more

If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.

SIGN UP NOW

The post Tidyverse practice: mapping large European cities appeared first on SHARP SIGHT LABS.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rtimicropem: Using an *R* package as platform for harmonized cleaning of data from RTI MicroPEM air quality sensors

Tue, 08/29/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

As you might remember from my blog post about ropenaq, I work as a data manager and statistician for an epidemiology project called CHAI for Cardio-vascular health effects of air pollution in Telangana, India. One of our interests in CHAI is determining exposure, and sources of exposure, to PM2.5 which are very small particles in the air that have diverse adverse health effects. You can find more details about CHAI in our recently published protocol paper. In this blog post that partly corresponds to the content of my useR! 2017 lightning talk, I'll present a package we wrote for dealing with the output of a scientific device, which might remind you of similar issues in your experimental work.

Why write the rtimicropem package?

Part of the CHAI project is a panel study involving about 40 people wearing several devices, as you see above. The devices include a GPS, an accelerometer, a wearable camera, and a PM2.5 monitor outputting time-resolved data (the grey box on the left). Basically, with this device, the RTI MicroPEM, we get one PM2.5 exposure value every 10 seconds. This is quite exciting, right? Except that we have two main issues with it…

First of all, the output of the device, a file with a ".csv" extension corresponding to a session of measurements, in our case 24 hours of measurements, is not really a csv. The header contains information about settings of the device for that session, and then comes the actual table with measurements.

Second, since the RTI MicroPEMs are nice devices but also a work-in-progress, we had some problems with the data, such as negative relative humidity. Because of these issues, we decided to write an R package whose three goals were to:

  • Transform the output of the device into something more usable.

  • Allow the exploration of individual files after a day in the field.

  • Document our data cleaning process.

We chose R because everything else in our project, well data processing, documentation and analysis, was to be implemented in R, and because we wanted other teams to be able to use our package.

Features of rtimicropem: transform, explore and learn about data cleaning

First things first, our package lives here and is on CRAN. It has a nice documentation website thanks to pkgdown.

Transform and explore single files

In rtimicropem after the use of the convert_output function, one gets an object of the R6 class micropem class. Its fields include the settings and measurements as two data.frames, and it has methods such as summary and plot for which you see the static output below (no unit on this exploratory plot).

The plot method can also outputs an interactive graph thanks to rbokeh.

While these methods can be quite helpful to explore single files as an R user, they don't help non R users a lot. Because we wanted members of our team working in the field to be able to explore and check files with no R knowledge, we created a Shiny app that allows to upload individual files and then look at different tabs, including one with a plot, one with the summary of measurements, etc. This way, it was easy to spot a device failure for instance, and to plan a new measurement session with the corresponding participant.

Transform a bunch of files

At the end of the CHAI data collection, we had more than 250 MicroPEM files. In order to prepare them for further processing we wrote the batch_convert function that saves the content of any number of MicroPEM files as two (real!) csv, one with the measurements, one with the settings.

Learn about data cleaning

As mentioned previously, we experienced issues with MicroPEM data quality. Although we had heard other teams complain of similar problems, in the literature there were very few details about data cleaning. We decided to gather information from other teams and the manufacturer and to document our own decisions, e.g. remove entire files based on some criteria, in a vignette of the package. This is our transparent answer to the question "What was your experience with MicroPEMs?" which we get often enough from other scientists interested in PM2.5 exposure.

Place of rtimicropem in the R package ecosystem

When preparing rtimicropem submission to rOpenSci, I started wondering whether one would like to have one R package for each scientific device out there. In our case, having the weird output to deal with, and the lack of a central data issues documentation place, were enough of a motivation. But maybe one could hope that manufacturers of scientific devices would focus a bit more on making the output format analysis-friendly, and that the open documentation of data issues would be language-agnostic and managed by the manufacturers themselves. In the meantime, we're quite proud to have taken the time to create and share our experience with rtimicropem, and have already heard back from a few users, including one who found the package via googling "RTI MicroPEM data"! Another argument I in particular have to write R packages for dealing with scientific data is that it might motivate people to learn R, but this is maybe a bit evil.

What about the place of rtimicropem in the rOpenSci package collection? After very useful reviews by Lucy D'Agostino McGowan and Kara Woo our package got onboarded which we were really thankful for and happy about. Another package I can think off the top of my head to deal with the output of a scientific tool is plater. Let me switch roles from CHAI team member to rOpenSci onboarding co-editor here and do some advertisement… Such packages are unlikely to become the new ggplot2 but their specialization doesn't make them less useful and they fit very well in the "data extraction" of the onboarding categories. So if you have written such a package, please consider submitting it! It'll get better thanks to review and might get more publicity as part of a larger software ecosystem. For the rtimicropem submission we took advantage of the joint submission process of rOpenSci and the Journal of Open Source Software, JOSS, so now our piece of software has its JOSS paper with a DOI. And hopefully, having more submissions of packages for scientific hardware might inspire R users to package up the code they wrote to use the output of their scientific tools!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppSMC 0.2.0

Tue, 08/29/2017 - 04:38

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version 0.2.0 of the RcppSMC package arrived on CRAN earlier today (as a very quick pretest-publish within minutes of submission).

RcppSMC provides Rcpp-based bindings to R for the Sequential Monte Carlo Template Classes (SMCTC) by Adam Johansen described in his JSS article.

This release 0.2.0 is chiefly the work of Leah South, a Ph.D. student at Queensland University of Technology, who was during the last few months a Google Summer of Code student mentored by Adam and myself. It was pleasure to work with Leah on this, and see her progress. Our congratulations to Leah for a job well done!

Changes in RcppSMC version 0.2.0 (2017-08-28)
  • Also use .registration=TRUE in useDynLib in NAMESPACE

  • Multiple Sequential Monte Carlo extensions (Leah South as part of Google Summer of Code 2017)

    • Switching to population level objects (#2 and #3).

    • Using Rcpp attributes (#2).

    • Using automatic RNGscope (#4 and #5).

    • Adding multiple normalising constant estimators (#7).

    • Static Bayesian model example: linear regression (#10 addressing #9).

    • Adding a PMMH example (#13 addressing #11).

    • Framework for additional algorithm parameters and adaptation (#19 addressing #16; also #24 addressing #23).

    • Common adaptation methods for static Bayesian models (#20 addressing #17).

    • Supporting MCMC repeated runs (#21).

    • Adding adaptation to linear regression example (#22 addressing #18).

Courtesy of CRANberries, there is a diffstat report for this release.

More information is on the RcppSMC page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio::Conf 2018

Tue, 08/29/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

It’s not even Labor Day, so it seems to be a bit early to start planning for next year’s R conferences. But, early-bird pricing for RStudio::Conf 2018 ends this Thursday.

The conference which will be held in San Diego between January 31st and February 3rd promises to match and even surpass this year’s event. In addition to keynotes from Di Cook (Monash University and Iowa State University), J.J. Allaire (RStudio Founder, CEO & Principal Developer), Shiny creator Joe Cheng, and Chief Scientist Hadley Wickham, a number of knowledgeable (and entertaining) speakers have already committed including quant, long-time R user and twitter humorist JD Long (@CMastication), Stack Overflow’s David Robinson (@drob) and ProPublica editor Olga Pierce (@olgapierce).

Making the deadline for early-bird pricing will get you a significant savings. The “full seat” price for the conference is $495 before midnight EST August 31, 2017 and $695 thereafter. You can register here.

For a good idea of the kinds of things you can expect to learn at RStudio::Conf 2018 have a look at the videos from this year’s event.

_____='https://rviews.rstudio.com/2017/08/29/rstudio-conf-2018/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny Dev Center gets a shiny new update

Tue, 08/29/2017 - 02:00

I am excited to announce the redesign and reorganization of shiny.rstudio.com, also known as the Shiny Dev Center. The Shiny Dev Center is the place to go to learn about all things Shiny and to keep up to date with it as it evolves.

The goal of this refresh is to provide a clear learning path for those who are just starting off with developing Shiny apps as well as to make advanced Shiny topics easily accessible to those building large and complex apps. The articles overview that we designed to help navigate the wealth of information on the Shiny Dev Center aims to achieve this goal.

Other highlights of the refresh include:

  • A brand new look!
  • New articles
  • Updated articles with modern Shiny code examples
  • Explicit linking, where relevant, to other RStudio resources like webinars, support docs, etc.
  • A prominent link to our ever growing Shiny User Showcase
  • A guide for contributing to Shiny (inspired by the Tidyverse contribute guide)

Stay tuned for more updates to the Shiny Dev Center in the near future!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

6 R Jobs for R users (2017-08-28) – from all over the world

Tue, 08/29/2017 - 00:23
To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

 

More New Jobs
  1. Freelance
    Optimization Expert IdeaConnection – Posted by Sherri Ann O’Gorman
    Anywhere
    22 Aug2017
  2. Full-Time
    Data journalist for The Economist @London The Economist – Posted by cooberp
    London England, United Kingdom
    18 Aug2017
  3. Full-Time
    Technical Business Analyst Investec Asset Management – Posted by IAM
    London England, United Kingdom
    14 Aug2017
  4. Full-Time
    Senior Data Scientist @ Dallas, Texas, U.S. Colaberry Data Analytics – Posted by Colaberry_DataAnalytics
    Dallas Texas, United States
    8 Aug2017
  5. Full-Time
    Financial Analyst/Modeler @ Mesa, Arizona, U.S. MD Helicopters – Posted by swhalen
    Mesa Arizona, United States
    31 Jul2017
  6. Full-Time
    Research volunteer in Cardiac Surgery @ Philadelphia, Pennsylvania, U.S. Thomas Jefferson University – Posted by CVSurgery
    Philadelphia Pennsylvania, United States
    31 Jul2017

 

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Le Monde puzzle [#1018]

Tue, 08/29/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

An arithmetic Le Monde mathematical puzzle (that first did not seem to involve R programming because of the large number of digits in the quantity involved):

An integer x with less than 100 digits is such that adding the digit 1 on both sides of x produces the integer 99x.  What are the last nine digits of x? And what are the possible numbers of digits of x?

The integer x satisfies the identity

where ω is the number of digits of x. This amounts to

10….01 = 89 x,

where there are ω zeros. Working with long integers in R could bring an immediate solution, but I went for a pedestrian version, handling each digit at a time and starting from the final one which is necessarily 9:

#multiply by 9 rap=0;row=NULL for (i in length(x):1){ prud=rap+x[i]*9 row=c(prud%%10,row) rap=prud%/%10} row=c(rap,row) #multiply by 80 rep=raw=0 for (i in length(x):1){ prud=rep+x[i]*8 raw=c(prud%%10,raw) rep=prud%/%10} #find next digit y=(row[1]+raw[1]+(length(x)>1))%%10

returning

7 9 7 7 5 2 8 0 9

as the (only) last digits of x. The same code can be exploited to check that the complete multiplication produces a number of the form 10….01, hence to deduce that the length of x is either 21 or 65, with solutions

[1] 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 9 [1] 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 8 9 8 8 7 6 4 0 4 4 9 4 3 8 2 0 2 2 [39] 4 7 1 9 1 0 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 9

The maths question behind is to figure out the powers k of 10 such that

For instance, 10²≡11 mod (89) and 11¹¹≡88 mod (89) leads to the first solution ω=21. And then, since 10⁴⁴≡1 mod (89), ω=21+44=65 is another solution…

Filed under: Books, Kids, R Tagged: arithmetics, competition, Le Monde, long division, mathematical puzzle, R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages