Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 3 hours 43 min ago

Tutorial: Launch a Spark and R cluster with HDInsight

Fri, 09/22/2017 - 17:52

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you'd like to get started using R with Spark, you'll need to set up a Spark cluster and install R and all the other necessary software on the nodes. A really easy way to achieve that is to launch an HDInsight cluster on Azure, which is just a managed Spark cluster with some useful extra components. You'll just need to configure the components you'll need, in our case R and Microsoft R Server, and RStudio Server.

This tutorial explains how to launch an HDInsight cluster for use with R. It explains how to size the cluster and launch the cluster, connect to it via SSH, install Microsoft R Server (with R) on each of the nodes, and install RStudio Server community edition to use as an IDE on the edge node. (If you find you need a larger or smaller cluster after you've set it up, it's easy to resize the cluster dynamically.) Once you have the cluster up an running, here are some things you can try:

To get started with R and Spark, follow the instructions for setting up a HDInsight cluster at the link below.

Microsoft Azure Documentation: Get started using R Server on HDInsight

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Multi-Dimensional Reduction and Visualisation with t-SNE

Fri, 09/22/2017 - 14:54

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

t-SNE is a very powerful technique that can be used for visualising (looking for patterns) in multi-dimensional data. Great things have been said about this technique. In this blog post I did a few experiments with t-SNE in R to learn about this technique and its uses.

Its power to visualise complex multi-dimensional data is apparent, as well as its ability to cluster data in an unsupervised way.

What’s more, it is also quite clear that t-SNE can aid machine learning algorithms when it comes to prediction and classification. But the inclusion of t-SNE in machine learning algorithms and ensembles has to be ‘crafted’ carefully, since t-SNE was not originally intended for this purpose.

All in all, t-SNE is a powerful technique that merits due attention.

t-SNE

Let’s start with a brief description. t-SNE stands for t-Distributed Stochastic Neighbor Embedding and its main aim is that of dimensionality reduction, i.e., given some complex dataset with many many dimensions, t-SNE projects this data into a 2D (or 3D) representation while preserving the ‘structure’ (patterns) in the original dataset. Visualising high-dimensional data in this way allows us to look for patterns in the dataset.

t-SNE has become so popular because:

  • it was demonstrated to be useful in many situations,
  • it’s incredibly flexible, and
  • can often find structure where other dimensionality-reduction algorithms cannot.

A good introduction on t-SNE can be found here. The original paper on t-SNE and some visualisations can be found at this site. In particular, I like this site which shows how t-SNE is used for unsupervised image clustering.

While t-SNE itself is computationally heavy, a faster version exists that uses what is known as the Barnes-Hut approximation. This faster version allows t-SNE to be applied on large real-world datasets.

t-SNE for Exploratory Data Analysis

Because t-SNE is able to provide a 2D or 3D visual representation of high-dimensional data that preserves the original structure, we can use it during initial data exploration. We can use it to check for the presence of clusters in the data and as a visual check to see if there is some ‘order’ or some ‘pattern’ in the dataset. It can aid our intuition about what we think we know about the domain we are working in.

Apart from the initial exploratory stage, visualising information via t-SNE (or any other algorithm) is vital throughout the entire analysis process – from those first investigations of the raw data, during data preparation, as well as when interpreting the outputs of machine learning algorithms and presenting insights to others. We will see further on that we can use t-SNE even during the predcition/classification stage itself.

Experiments on the Optdigits dataset

In this post, I will apply t-SNE to a well-known dataset, called optdigits, for visualisation purposes.

The optdigits dataset has 64 dimensions. Can t-SNE reduce these 64 dimensions to just 2 dimension while preserving structure in the process? And will this structure (if present) allow handwritten digits to be correctly clustered together? Let’s find out.

tsne package

We will use the tsne package that provides an exact implementation of t-SNE (not the Barnes-Hut approximation). And we will use this method to reduce dimensionality of the optdigits data to 2 dimensions. Thus, the final output of t-SNE will essentially be an array of 2D coordinates, one per row (image). And we can then plot these coordinates to get the final 2D map (visualisation). The algorithm runs in iterations (called epochs), until the system converges. Every
\(K\) number of iterations and upon convergence, t-SNE can call a user-supplied callback function, and passes the list of 2D coordinates to it. In our callback function, we plot the 2D points (one per image) and the corresponding class labels, and colour-code everything by the class labels.

traindata <- read.table("optdigits.tra", sep=",") trn <- data.matrix(traindata) require(tsne) cols <- rainbow(10) # this is the epoch callback function used by tsne. # x is an NxK table where N is the number of data rows passed to tsne, and K is the dimension of the map. # Here, K is 2, since we use tsne to map the rows to a 2D representation (map). ecb = function(x, y){ plot(x, t='n'); text(x, labels=trn[,65], col=cols[trn[,65] +1]); } tsne_res = tsne(trn[,1:64], epoch_callback = ecb, perplexity=50, epoch=50)

The images below show how the clustering improves as more epochs pass.

As one can see from the above diagrams (especially the last one, for epoch 1000), t-SNE does a very good job in clustering the handwritten digits correctly.

But the algorithm takes some time to run. Let’s try out the more efficient Barnes-Hut version of t-SNE, and see if we get equally good results.

Rtsne package

The Rtsne package can be used as shown below. The perplexity parameter is crucial for t-SNE to work correctly – this parameter determines how the local and global aspects of the data are balanced. A more detailed explanation on this parameter and other aspects of t-SNE can be found in this article, but a perplexity value between 30 and 50 is recommended.

traindata <- read.table("optdigits.tra", sep=",") trn <- data.matrix(traindata) require(Rtsne) # perform dimensionality redcution from 64D to 2D tsne <- Rtsne(as.matrix(trn[,1:64]), check_duplicates = FALSE, pca = FALSE, perplexity=30, theta=0.5, dims=2) # display the results of t-SNE cols <- rainbow(10) plot(tsne$Y, t='n') text(tsne$Y, labels=trn[,65], col=cols[trn[,65] +1])

Gives this plot:

Note how clearly-defined and distinctly separable the clusters of handwritten digits are. We have only a minimal amount of incorrect entries in the 2D map.

Of Manifolds and the Manifold Assumption

How can t-SNE achieve such a good result? How can it ‘drop’ 64-dimensional data into just 2 dimensions and still preserve enough (or all) structure to allow the classes to be separated?

The reason has to do with the mathematical concept of manifolds. A Manifold is a \(d\)-dimensional surface that lives in an \(D\)-dimensional space, where \(d < D\). For the 3D case, imagine a 2D piece of paper that is embedded within 3D space. Even if the piece of paper is crumpled up extensively, it can still be ‘unwrapped’ (uncrumpled) into the 2D plane that it is. This 2D piece of paper is a manifold in 3D space. Or think of an entangled string in 3D space – this is a 1D manifold in 3D space.

Now there’s what is known as the Manifold Hypothesis, or the >Manifold Assumption, that states that natural data (like images, etc.) forms lower dimensional manifolds in their embedding space. If this assumption holds (there are theoretical and experimental evidence for this hypothesis), then t-SNE should be able to find this lower-dimensional manifold, ‘unwrap it’, and present it to us as a lower-dimensional map of the original data.

t-SNE vs. SOM

t-SNE is actually a tool that does something similar to Self-Organising Maps (SOMs), though the underlying process is quite different. We have used a SOM on the optdigits dataset in a previous blog and obtained the following unsupervised clustering shown below.

The t-SNE map is more clear than that obtained via a SOM and the clusters are separated much better.

t-SNE for Shelter Animals dataset

In a previous blog, I applied machine learning algorithms for predicting the outcome of shelter animals. Let’s now apply t-SNE to the dataset – I am using the cleaned and modified data as described in this blog entry.

trn <- read.csv('train.modified.csv.zip') trn <- data.matrix(trn) require(Rtsne) # scale the data first prior to running t-SNE trn[,-1] <- scale(trn[,-1]) tsne <- Rtsne(trn[,-1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2) # display the results of t-SNE cols <- rainbow(5) plot(tsne$Y, t='n') text(tsne$Y, labels=as.numeric(trn[,1]), col=cols[trn[,1]])

Gives this plot:

Here, while it is evident that there is some structure and patterns to the data, clustering by OutcomeType has not happened.

Let’s now use t-SNE to perform dimensionality reduction to 3D on the same dataset. We just need to set the dims parameter to 3 instead of 2. And we will use package rgl for plotting the 3D map produced by t-SNE.

tsne <- Rtsne(trn[,-1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2) #display results of t-SNE require(rgl) plot3d(tsne$Y, col=cols[trn[,1]]) legend3d("topright", legend = '0':'5', pch = 16, col = rainbow(5))

Gives this plot:

This 3D map has a richer structure than the 2D version, but again the resulting clustering is not done by OutcomeType.

One possible reason could be that the Manifold Assumption is failing for this dataset when trying to reduce to 2D and 3D. Please note that the Manifold Assumption cannot be always true for a simple reason. If it were, we could take an arbitrary set of \(N\)-dimensional points and conclude that they lie on some \(M\)-dimensional manifold (where \(M < N\)). We could then take those \(M\)-dimensional points and conclude that they lie on some other lower-dimensional manifold (with, say, \(K\) dimensions, where \(K < M\)). We could repeat this process indefinitely and eventually conclude that our arbitrary \(N\)-dimensional dataset lies on a 1D manifold. Because this cannot be true for all datasets, the manifold assumption cannot always be true. In general, if you have \(N\)-dimensional points being generated by a process with \(N\) degrees of freedom, your data will not lie on a lower-dimensional manifold. So probably the Animal Shelter dataset has degrees of freedom higher than 3.

So we can conclude that t-SNE did not aid much the initial exploratory analysis for this dataset.

t-SNE and machine learning?

One disadvantage of t-SNE is that there is currently no incremental version of this algorithm. In other words, it is not possible to run t-SNE on a dataset, then gather a few more samples (rows), and “update” the t-SNE output with the new samples. You would need to re-run t-SNE from scratch on the full dataset (previous dataset + new samples). Thus t-SNE works only in batch mode.

This disadvantage appears to makes it difficult for t-SNE to be used in a machine learning system. But as we will see in a future post, it is still possible to use t-SNE (with care) in a machine learning solution. And the use of t-SNE can improve classification results, sometimes markedly. The limitation is the extra non-realtime processing brought about by t-SNE’s batch mode nature.

Stay tuned.

    Related Post

    1. Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
    2. Analyzing Obesity across USA
    3. Can we predict flu deaths with Machine Learning and R?
    4. Graphical Presentation of Missing Data; VIM Package
    5. Creating Graphs with Python and GooPyCharts
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    My advice on dplyr::mutate()

    Fri, 09/22/2017 - 14:22

    There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

    Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.


    “Character is what you are in the dark.”

    John Whorfin quoting Dwight L. Moody.

    I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.

    What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R package dplyr.

    • Disclaimer: Win-Vector LLC has no official standing with RStudio, or dplyr development.
    • However:

      “One need not have been Caesar in order to understand Caesar.”

      Alternately: Georg Simmmel or Max Webber.

      Win-Vector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving R, dplyr, sparklyr, and Apache Spark. Win-Vector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to use R and dplyr. Our group knows how to help deploy production grade systems, and how to help new users master these systems.

    From experience we have distilled a lot of best practices. And below we will share one.

    From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:

    Note that you can refer to columns that you’ve just created:

    mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours )

    Let’s try that with database backed data:

    suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") # [1] ‘0.7.3’ db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") flights <- copy_to(db, nycflights13::flights, 'flights') mutate(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) # # Source: lazy query [?? x 22] # # Database: sqlite 3.19.3 [:memory:] # year month day dep_time sched_dep_time ... # ... # 1 2013 1 1 517 515 ... # ...

    That worked. One of the selling points of dplyr is a lot of dplyr is source-generic or source-agnostic: meaning it can be run against different data providers (in-memory, databases, Spark).

    However, if a new user tries to extend such an example (say adding gain_per_minutes) they run into this:

    mutate(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours, gain_per_minute = 60 * gain_per_hour ) # Error in rsqlite_send_query(conn@ptr, statement) : # no such column: gain_per_hour

    It is hard for experts to understand how frustrating the above is to a new R user or to a part time R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.

    This quickly leads to strong feelings of learned helplessness and anxiety.

    Our rule for dplyr::mutate() has been for some time:

    Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).

    Under this rule neither of the above mutates() are allowed. The second should be written as (switching to pipe-notation):

    flights %>% mutate(gain = arr_delay - dep_delay, hours = air_time / 60) %>% mutate(gain_per_hour = gain / hours) %>% mutate(gain_per_minute = 60 * gain_per_hour)

    And the above works.

    If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.

    dplyr documentation (such as “help(mutate)“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.

    Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is dplyr issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.

    However, confusion around re-use of column names has been driving dplyr issues for quite some time:

    It makes sense to work in a reliable and teachable sub-dialect of dplyr that will serve users well (or barring that, you can use an adapter, such as seplyr). In production you must code to what systems are historically reliably capable of, not just the specification.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Mining USPTO full text patent data – Analysis of machine learning and AI related patents granted in 2017 so far – Part 1

    Fri, 09/22/2017 - 07:00

    (This article was first published on Alexej's blog, and kindly contributed to R-bloggers)

    The United States Patent and Trademark office (USPTO) provides immense amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML), data science, and artificial intelligence (AI).

    I started this exploration by downloading the full text data (excluding images) for all patents that were assigned by the USPTO within the year 2017 up to the time of writing (Patent Grant Full Text Data/XML for the year 2017 through the week of Sept 12 from the USPTO Bulk Data Storage System).

    In this blog post – “Part 1” – I address questions such as: How many ML and AI related patents were granted? Who are the most prolific inventors? The most frequent patent assignees? Where are inventions made? And when? Is the number of ML and AI related patents increasing over time? How long does it take to obtain a patent for a ML or AI related invention? Is the patent examination time shorter for big tech companies? etc.

    “Part 2” will be an in depth analysis of the language used in patent titles, descriptions, and claims, and “Part 3” will be on experimentation with with deep neural nets applied to the patents’ title, description, and claims text.

    Identifying patents related to machine learning and AI

    First, I curated a patent full text dataset consisting of “machine learning and AI related” patents.
    I am not just looking for instances where actual machine learning or AI algorithms were patented; I am looking for inventions which are related to ML or AI in any/some capacity. That is, I am interested in patents where machine learning, data mining, predictive modeling, or AI is utilized as a part of the invention in any way whatsoever. The subset of relevant patents was determined by a keyword search as specified by the following definition.

    Definition: For the purposes of this blog post, a machine learning or AI related patent is a patent that contains at least one of the keywords
    “machine learning”, “deep learning”, “neural network”, “artificial intelligence”, “statistical learning”, “data mining”, or “predictive model”
    in its invention title, description, or claims text (while of course accounting for capitalization, pluralization, etc.).1

    With this keyword matching approach a total of 6665 patents have been selected. The bar graph below shows how many times each keyword got matched.

    Interestingly the term “neural network” is even more common than the more general terms “machine learning” and “artificial intelligence”.

    Some example patents

    Here are three (randomly chosen) patents from the resulting dataset. For each printed are the invention title, the patent assignee, as well as one instance of the keyword match within the patent text.

    ## $`2867` ## [1] "Fuselage manufacturing system" ## [2] "THE BOEING COMPANY" ## [3] "... using various techniques. For example, at least ## one of an artificial intelligence program, a ## knowledgebase, an expert ..." ## ## $`1618` ## [1] "Systems and methods for validating wind farm ## performance measurements" ## [2] "General Electric Company" ## [3] "... present disclosure leverages and fuses ## accurate available sensor data using machine ## learning algorithms. That is, the more ..." ## ## $`5441` ## [1] "Trigger repeat order notifications" ## [2] "Accenture Global Services Limited" ## [3] "... including user data obtained from a user ## device; obtaining a predictive model that ## estimates a likelihood of ..."

    And here are three examples of (randomly picked) patents that contain the relevant keywords directly in their invention title.

    ## $`5742` ## [1] "Adaptive demodulation method and apparatus using an ## artificial neural network to improve data recovery ## in high speed channels" ## [2] "Kelquan Holdings Ltd." ## [3] "... THE INVENTION\nh-0002\n1 The present invention ## relates to a neural network based integrated ## demodulator that mitigates ..." ## ## $`3488` ## [1] "Cluster-trained machine learning for image processing" ## [2] "Amazon Technologies, Inc." ## [3] "... BACKGROUND\nh-0001\n1 Artificial neural networks, ## especially deep neural network ..." ## ## $`3103` ## [1] "Methods and apparatus for machine learning based ## malware detection" ## [2] "Invincea, Inc." ## [3] "... a need exists for methods and apparatus that can ## use machine learning techniques to reduce the amount ..." Who holds these patents (inventors and assignees)?

    The first question I would like to address is who files most of the machine learning and AI related patents.

    Each patent specifies one or several inventors, i.e., the individuals who made the patented invention, and a patent assignee which is typically the inventors’ employer company that holds the rights to the patent. The following bar graph visualizes the top 20 most prolific inventors and the top 20 most frequent patent assignees among the analyzed ML and AI related patents.

    It isn’t surprising to see this list of companies. The likes of IBM, Google, Amazon, Microsoft, Samsung, and AT&T rule the machine learning and AI patent space. I have to admit that I don’t recognize any of the inventors’ names (but it might just be me not being familiar enough with the ML and AI community).

    There are a number of interesting follow-up questions which for now I leave unanswered (hard to answer without additional data):

    • What is the count of ML and AI related patents by industry or type of company (e.g., big tech companies vs. startups vs. reserach universities vs. government)?
    • Who is deriving the most financial benefit by holding ML or AI related patents (either through licensing or by driving out the competition)?
    Where do these inventions come from (geographically)?

    Even though the examined patents were filed in the US, some of the inventions may have been made outside of the US.
    In fact, the data includes specific geographic locations for each patent, so I can map in which cities within the US and the world inventors are most active.
    The following figure is based on where the inventors are from, and shows the most active spots. Each point corresponds to the total number of inventions made at that location (though note that the color axis is a log10 scale, and so is the point size).

    The results aren’t that surprising.
    However, we see that most (ML and AI related) inventions patented with the USPTO were made in the US. I wonder if inventors in other countries prefer to file patents in their home countries’ patent offices rather the in the US.

    Alternatively, we can also map the number of patents per inventors’ origin countries.

    Sadly, there seem to be entire groups of countries (e.g., almost the entire African continent) which seem to be excluded from the USPTO’s patent game, at least with respect to machine learning and AI related inventions.
    Whether it is a lack of access, infrastructure, education, political treaties or something else is an intriguing question.

    Patent application and publication dates, and duration of patent examination process

    Each patent has a date of filing and an assignment date attached to it. Based on the provided dates one can try to address questions such as:
    When were these patents filed? Is the number of ML and AI related patents increasing over time? How long did it usually take from patent filing to assignment? And so on.

    For the set of ML and AI related patents that were granted between Jan 3 and Sept 12 2017 the following figure depicts…

    • …in the top panel: number of patents (y-axis) per their original month of filing (x-axis),
    • …in the bottom panel: the number of patents (y-axis) that were assigned (approved) per week (x-axis) in 2017 so far.

    The patent publication dates plot suggests that the number of ML and AI related patents seems to be increasing slightly throughout the year 2017.
    The patent application dates plot suggests that the patent examination phase for the considered patents takes about 2.5 years. In fact the average time from patent filing to approval is 2.83 years with a standard deviation of 1.72 years in this dataset (that is, among the considered ML and AI related patents in 2017). However, the range is quite extensive spanning 0.24-12.57 years.

    The distribution of the duration from patent filing date to approval is depicted by following figure.

    So, what are some of the inventions that took longest to get approved? Here are the five patents with the longest examination periods:

    • “Printing and dispensing system for an electronic gaming device that provides an undisplayed outcome” (~12.57 years to approval; assignee: Aim Management, Inc.)
    • “Apparatus for biological sensing and alerting of pharmaco-genomic mutation” (~12.24 years to approval; assignee: NA)
    • “System for tracking a player of gaming devices” (~12.06 years to approval; assignee: Aim Management, Inc.)
    • “Device, method, and computer program product for customizing game functionality using images” (~11.72 years to approval; assignee: NOKIA TECHNOLOGIES OY)
    • “Method for the spectral identification of microorganisms” (~11.57 years to approval; assignee: MCGILL UNIVERSITY, and HEALTH CANADA)

    Each of these patents is related to either gaming or biotech. I wonder if that’s a coincidence…

    We can also look at the five patents with the shortest approval time:

    • “Home device application programming interface” (~91 days to approval; assignee: ESSENTIAL PRODUCTS, INC.)
    • “Avoiding dazzle from lights affixed to an intraoral mirror, and applications thereof” (~104 days to approval; assignee: DENTAL SMARTMIRROR, INC.)
    • “Media processing methods and arrangements” (~106 days to approval; assignee: Digimarc Corporation)
    • “Machine learning classifier that compares price risk score, supplier risk score, and item risk score to a threshold” (~111 days to approval; assignee: ACCENTURE GLOBAL SOLUTIONS LIMITED)
    • “Generating accurate reason codes with complex non-linear modeling and neural networks” (~111 days to approval; assignee: SAS INSTITUTE INC.)

    Interstingly the patent approved in the shortest amount of time among all 6665 analysed (ML and AI related) patents is some smart home thingy from Andy Rubin’s hyped up company Essential.

    Do big tech companies get their patents approved faster than other companies (e.g., startups)?

    The following figure separates the patent approval times according to the respective assignee company, considering several of the most well known tech giants.

    Indeed some big tech companies, such as AT&T or Samsung, manage to push their patent application though the USPTO process much faster than most other companies. However, there are other tech giants, such as Microsoft, which on average take longer to get their patent applications approved than even the companies in category “Other”. Also noteworthy is the fact that big tech companies tend to have fewer outliers regarding the patent examination process duration than companies in the category “Other”.

    Of course it would also be interesting to categorize all patent assignees into categories like “Startup”, “Big Tech”, “University”, or “Government”, and compare the typical duration of the patent examination process between such groups. However, it’s not clear to me how to establish such categories without collecting additional data on each patent assignee, which at this point I don’t have time for :stuck_out_tongue:.

    Conclusion

    There is definitely a lot of promise in the USPTO full text patent data.
    Here I have barely scratched the surface, and I hope that I will find the time to play around with these data some more.
    The end goal is, of course, to replace the patent examiner with an AI trained on historical patent data. :stuck_out_tongue_closed_eyes:


    This work (blog post and included figures) is licensed under a Creative Commons Attribution 4.0 International License.

    1. There are two main aspects to my reasoning as to this particular choice of keywords. (1) I wanted to keep the list relatively short in order to have a more specific search, and (2) I tried to avoid keywords which may generate false positives (e.g., the term “AI” would match all sorts of codes present in the patent text, such as “123456789 AI N1”). In no way am I claiming that this is a perfect list of keywords to identify ML and AI related patents, but I think that it’s definitely a good start. 

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Will Stanton hit 61 home runs this season?

    Fri, 09/22/2017 - 01:33

    (This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

    [edit: Juho Kokkala corrected my homework. Thanks! I updated the post. Also see some further elaboration in my reply to Andrew’s comment. As Andrew likes to say …]

    So far, Giancarlo Stanton has hit 56 home runs in 555 at bats over 149 games. Miami has 10 games left to play. What’s the chance he’ll hit 61 or more home runs? Let’s make a simple back-of-the-envelope Bayesian model and see what the posterior event probability estimate is.

    Sampling notation

    A simple model that assumes a home run rate per at bat with a uniform (conjugate) prior:

    The data we’ve seen so far is 56 home runs in 555 at bats, so that gives us our likelihood.

    Now we need to simulate the rest of the season and compute event probabilities. We start by assuming the at-bats in the rest of the season is Poisson.

    We then take the number of home runs to be binomial given the number of at bats and the home run rate.

    Finally, we define an indicator variable that takes the value 1 if the total number of home runs is 61 or greater and the value of 0 otherwise.

    Event probability

    The probability Stanton hits 61 or more home runs (conditioned on our model and his performance so far) is then the posterior expectation of that indicator variable,

    Computation in R

    The posterior for is analytic because the prior is conjugate, letting us simulate the posterior chance of success given the observed successes (56) and number of trials (555). The number of at bats is independent and also easy to simulate with a Poisson random number generator. We then simulate the number of hits on the outside as a random binomial, and finally, we compare it to the total and then report the fraction of simulations in which the simulated number of home runs put Stanton at 61 or more:

    > sum(rbinom(1e5, rpois(1e5, 10 * 555 / 149), rbeta(1e5, 1 + 56, 1 + 555 - 56)) >= (61 - 56)) / 1e5 [1] 0.34

    That is, I’d give Stanton about a 34% chance conditioned on all of our assumptions and what he’s done so far.

    Disclaimer

    The above is intended for recreational use only and is not intended for serious bookmaking.

    Exercise

    You guessed it—code this up in Stan. You can do it for any batter, any number of games left, etc. It really works for any old statistics. It’d be even better hierarchically with more players (that’ll pull the estimate for down toward the league average). Finally, the event probability can be done with an indicator variable in the generated quantities block.

    The basic expression looks like you need discrete random variables, but we only need them here for the posterior calculation in generated quantities. So we can use Stan’s random number generator functions to do the posterior simulations right in Stan.

    The post Will Stanton hit 61 home runs this season? appeared first on Statistical Modeling, Causal Inference, and Social Science.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pirating Pirate Data for Pirate Day

    Thu, 09/21/2017 - 22:36

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017

    The post provides a useful and practical example of extracting data from a website without an API, otherwise known as "scraping" data. In this case the website was the International Chamber of Commerce, and the provided R code demonstrates several useful R packages for scraping:

    • rvest, for extracting data from a web pages' HTML source
    • purrr, and specifically the safely function, for streamlining the process of iterating over pages that may return errors
    • purrrly, for iterating over the rows of scraped data
    • splashr, for automating the process of taking a screenshot of a webpage
    • and robotstxt, for checking whether automated downloads of the website content are allowed 

    That last package is an important one, because while it's almost always technically possible to automate the process of extracting data from a website, it's not always allowed or even legal. Inspecting the robots.txt file is one check you should definitely make, and you should also check the terms of service of the website which may also prohibit the practice. Even then, you should be respectful and take care not to overload the server with frequent and/or repeated calls, as Bob demonstrates by spacing requests by 5 seconds. Finally — and most importantly! — even if scraping isn't expressly forbidden, using scraped data may not be ethical, especially when the data is about people who are unable to give their individual consent to your use of the data. This Forbes article about analyzing data scraped from a dating website offers an instructive tale in that regard.

    This piracy data example however provides a case study of using websites and the data it provides in the right way. Follow the link below all for the details.

    rud.is: Pirating Web Content Responsibly With R

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Exploratory Data Analysis of Tropical Storms in R

    Thu, 09/21/2017 - 20:41

    (This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

    Exploratory Data Analysis of Tropical Storms in R

    The disastrous impact of recent hurricanes, Harvey and Irma, generated a large influx of data within the online community. I was curious about the history of hurricanes and tropical storms so I found a data set on data.world and started some basic Exploratory data analysis (EDA).

    EDA is crucial to starting any project. Through EDA you can start to identify errors & inconsistencies in your data, find interesting patterns, see correlations and start to develop hypotheses to test. For most people, basic spreadsheets and charts are handy and provide a great place to start. They are an easy-to-use method to manipulate and visualize your data quickly. Data scientists may cringe at the idea of using a graphical user interface (GUI) to kick-off the EDA process but those tools are very effective and efficient when used properly. However, if you’re reading this, you’re probably trying to take EDA to the next level. The best way to learn is to get your hands dirty, let’s get started.

    The original source of the data was can be found at DHS.gov.

    Step 1: Take a look at your data set and see how it is laid out FID YEAR MONTH DAY AD_TIME BTID NAME LAT LONG WIND_KTS PRESSURE CAT 2001 1957 8 8 1800Z 63 NOTNAMED 22.5 -140.0 50 0 TS 2002 1961 10 3 1200Z 116 PAULINE 22.1 -140.2 45 0 TS 2003 1962 8 29 0600Z 124 C 18.0 -140.0 45 0 TS 2004 1967 7 14 0600Z 168 DENISE 16.6 -139.5 45 0 TS 2005 1972 8 16 1200Z 251 DIANA 18.5 -139.8 70 0 H1 2006 1976 7 22 0000Z 312 DIANA 18.6 -139.8 30 0 TD

    Fortunately, this is a tidy data set which will make life easier and appears to be cleaned up substantially. The column names are relatively straightforward with the exception of “ID” columns.

    The description as given by DHS.gov:

    This dataset represents Historical North Atlantic and Eastern North Pacific Tropical Cyclone Tracks with 6-hourly (0000, 0600, 1200, 1800 UTC) center locations and intensities for all subtropical depressions and storms, extratropical storms, tropical lows, waves, disturbances, depressions and storms, and all hurricanes, from 1851 through 2008. These data are intended for geographic display and analysis at the national level, and for large regional areas. The data should be displayed and analyzed at scales appropriate for 1:2,000,000-scale data.

    Step 2: View some descriptive statistics YEAR MONTH DAY WIND_KTS PRESSURE Min. :1851 Min. : 1.000 Min. : 1.00 Min. : 10.00 Min. : 0.0 1st Qu.:1928 1st Qu.: 8.000 1st Qu.: 8.00 1st Qu.: 35.00 1st Qu.: 0.0 Median :1970 Median : 9.000 Median :16.00 Median : 50.00 Median : 0.0 Mean :1957 Mean : 8.541 Mean :15.87 Mean : 54.73 Mean : 372.3 3rd Qu.:1991 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.: 70.00 3rd Qu.: 990.0 Max. :2008 Max. :12.000 Max. :31.00 Max. :165.00 Max. :1024.0

    We can confirm that this particular data had storms from 1851 – 2010, which means the data goes back roughly 100 years before naming storms started! We can also see that the minimum pressure values are 0, which likely means it could not be measured (due to the fact zero pressure is not possible in this case). We can see that there are recorded months from January to December along with days extending from 1 to 31. Whenever you see all of the dates laid out that way, you can smile and think to yourself, “if I need to, I can put dates in an easy to use format such as YYYY-mm-dd (2017-09-12)!”

    Step 3: Make a basic plot

    This is a great illustration of our data set and we can easily notice an upward trend in the number of storms over time. Before we go running to tell the world that the number of storms per year is growing, we need to drill down a bit deeper. This could simply be caused because more types of storms were added to the data set (we know there are hurricanes, tropical storms, waves, etc.) being recorded. However, this could be a potential starting point for developing a hypothesis for time-series data.

    You will notice the data starts at 1950 rather than 1851. I made this choice because storms were not named until this point so it would be difficult to try and count the unique storms per year. It could likely be done by finding a way to utilize the “ID” columns. However, this is a preliminary analysis so I didn’t want to dig too deep.

    Step 4: Make some calculations YEAR Distinct_Storms Distinct_Storms_Change Distinct_Storms_Pct_Change 1951 10 -3 -0.23 1952 6 -4 -0.40 1953 8 2 0.33 1954 8 0 0.00 1955 10 2 0.25 1956 7 -3 -0.30 1957 9 2 0.29 1958 10 1 0.11 1959 13 3 0.30 1960 13 0 0.00

    In this case, we can see the number of storms, nominal change and percentage change per year. These calculations help to shed light on what the growth rate looks like each year. So we can use another summary table:

    Distinct_Storms Distinct_Storms_Change Distinct_Storms_Pct_Change Min. : 6.00 Min. :-15.0000 Min. :-0.42000 1st Qu.:15.75 1st Qu.: -3.0000 1st Qu.:-0.12000 Median :25.00 Median : 1.0000 Median : 0.04000 Mean :22.81 Mean : 0.3448 Mean : 0.05966 3rd Qu.:28.00 3rd Qu.: 3.7500 3rd Qu.: 0.25000 Max. :43.00 Max. : 16.0000 Max. : 1.14000

    From the table we can state the following for the given time period:

    • The mean number of storms is 23 per year (with a minimum of 6 and maximum of 43)
    • The mean change in the number of storms per year is 0.34 (with a minimum of -15 and maximum of 16)
    • The mean percent change in the number of storms per year is 6% (with a minimum of -42% and maximum of 114%)

    Again, we have to be careful because these numbers are in aggregate and may not tell the whole story. Dividing these into groups of storms is likely much more meaningful.

    Step 5: Make a more interesting plot

    Because I was most interested in hurricanes, I filtered out only the data which was classified as “H (1-5).” By utilizing a data visualization technique called “small multiples” I was able to pull out the different types and view them within the same graph. While this is possible to do in tables and spreadsheets, it’s much easier to visualize this way. By holding the axes constant, we can see the majority of the storms are classified as H1 and then it appears to consistently drop down toward H5 (with very few actually being classified as H5). We can also see that most have an upward trend from 1950 – 2010. The steepest appears to be H1 (but it also flattens out over the last decade).

    Step 6: Make a filtered calculation Distinct_Storms Distinct_Storms_Change Distinct_Storms_Pct_Change Min. : 4.00 Min. :-11.00000 Min. :-0.56000 1st Qu.: 9.25 1st Qu.: -4.00000 1st Qu.:-0.25750 Median :13.50 Median : 0.00000 Median : 0.00000 Mean :12.76 Mean : 0.05172 Mean : 0.08379 3rd Qu.:15.00 3rd Qu.: 3.00000 3rd Qu.: 0.35250 Max. :24.00 Max. : 10.00000 Max. : 1.80000

    Now we are looking strictly at hurricane data (classified as H1-H5):

    • The mean number of hurricanes is 13 per year (with a minimum of 4 and maximum of 24)
    • The mean change in the number of hurricanes per year is 0.05 (with a minimum of -11 and maximum of 10)
    • The mean percent change in the number of hurricanes per year is 8% (with a minimum of -56% and maximum of 180%)

    While it doesn’t really make sense to say “we had an average growth of 0.05 hurricanes per year between 1950 and 2010” … it may make sense to say “we saw an average of growth of 8% per year in the number of hurricanes between 1950 and 2010.”

    That’s a great thing to put in quotes!

    During EDA we discovered an average of growth of 8% per year in the number of hurricanes between 1950 and 2010.

    Side Note: Be ready, as soon as you make a statement like that, you will likely have to explain how you arrived at that conclusion. That’s where having an RMarkdown notebook and data online in a repository will help you out! Reproducible research is all of the hype right now.

    Step 7: Try visualizing your statements

    A histogram and/or density plot is a great way to visualize the distribution of the data you are making statements about. This plot helps to show that we are looking at a right-skewed distribution with substantial variance. Knowing that we have n = 58 (meaning 58 years after being aggregated), it’s not surprising that our histogram looks sparse and our density plot has an unusual shape. At this point, you can make a decision to jot this down, research it in depth and then attack it with full force.

    However, that’s not what we’re covering in this post.

    Step 8: Plot another aspect of your data

    60K pieces of data can get out of hand quickly, we need to back this down into manageable chunks. Building on the knowledge from our last exploration, we should be able to think of a way to cut this down to get some better information. The concept of small multiples could come in handy again! Splitting the data up by type of storm could prove to be valuable. We can also tell that we are missing

    After filtering the data down to hurricanes and utilizing a heat map rather than plotting individual points we can get a better handle on what is happening where. The H4 and H5 sections are probably the most interesting. It appears as if H4 storms are more frequent on the West coast of Mexico whereas the H5 are most frequent in the Gulf of Mexico.

    Because we’re still in EDA mode, we’ll continue with another plot.

    Here are some of the other storms from the data set. We can see that TD, TS and L have large geographical spreads. The E, SS, and SD storms are concentrated further North toward New England.

    Digging into this type of data and building probabilistic models is a fascinating field. The actuarial sciences are extremely difficult and insurance companies really need good models. Having mapped this data, it’s pretty clear you could dig in and find out what parts of the country should expect what types of storms (and you’ve also known this just from being alive for 10+ years). More hypotheses could be formed about location at this stage and could be tested!

    Step 9: Look for a relationship

    What is the relationship between WIND_KTS and PRESSURE? This chart helps us to see that low PRESSURE and WIND_KTS are likely negatively correlated. We can also see that the WIND_KTS is essentially the predictor in the data set which can perfectly predict how a storm is classified. Well, it turns out, that’s basically the distinguishing feature when scientists are determining how to categorize these storms!

    Step N……

    We have done some basic EDA and identified some patterns in the data. While doing some EDA can simply be just for fun, in most data analysis, it’s important to find a place to apply these discoveries by making and testing hypotheses! There are plenty of industries where you could find a use for this:

    • Insurance – improve statistical modeling and risk analysis
    • Real Estate Development – identify strategic investment locations
    • Agriculture – crop selection

    The rest is up to you! This is a great data set and there are a lot more pieces of information lurking within it. I want people to do their own EDA and send me anything interesting!

    Some fun things to check out:

    • What was the most common name for a hurricane?
    • Do the names actually follow an alphabetical pattern through time? (This is one is tricky)
    • If the names are in alphabetical order, how often does a letter get repeated in a year?
    • Can we merge this data with FEMA, charitable donations, or other aid data?

    To get you started on the first one, here’s the Top 10 most common names for tropical storms. Why do you think it’s Florence?

    Thank you for reading, I hope this helps you with your own data. The code is all written in R and is located on my GitHub. You can also find other data visualization posts and usages of ggplot2 on my blog Stoltzmaniac

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Gold-Mining – Week 3 (2017)

    Thu, 09/21/2017 - 17:25

    (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

    Week 3 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

    The post Gold-Mining – Week 3 (2017) appeared first on Fantasy Football Analytics.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Don’t teach students the hard way first

    Thu, 09/21/2017 - 16:00

    (This article was first published on Variance Explained, and kindly contributed to R-bloggers)

    Imagine you were going to a party in an unfamiliar area, and asked the host for directions to their house. It takes you thirty minutes to get there, on a path that takes you on a long winding road with slow traffic. As the party ends, the host tells you “You can take the highway on your way back, it’ll take you only ten minutes. I just wanted to show you how much easier the highway is.”

    Wouldn’t you be annoyed? And yet this kind of attitude is strangely common in programming education.

    I was recently talking to a friend who works with R and whose opinions I greatly respect. He was teaching some sessions to people in his company who hadn’t used R before, where he largely followed my philosophy on teaching the tidyverse to beginners. I agreed with his approach, until he said something far too familiar to me:

    “I teach them dplyr’s group_by/summarize in the second lesson, but I teach them loops first just to show them how much easier dplyr is.”

    I talk to people about teaching a lot, and that phrase keeps popping up: “I teach them X just to show them how much easier Y is”. It’s a trap- a trap I’ve fallen into before when teaching, and one that I’d like to warn others against.

    Students don’t share your nostalgia

    First, why do some people make this choice? I think because when we teach a class, we accidentally bring in all of our own history and context.

    For instance, I started programming with Basic and Perl in high school, then Java in college, then got really into Python, then got even more into R. Along the way I built up habits in each language that had to be painfully undone afterwards. I worked in an object-oriented style in Python, then switched to thinking in data frames and functional operations. I wrote too many loops when I started R, then I grew accustomed to the apply family of functions. Along the way there were thousands of little frustrations and little epiphanies, tragedies and triumphs in microcosm.

    But when I’m teaching someone how to use R… they don’t care about any of that. They weren’t there! They didn’t make my mistakes, they don’t need to be talked out of them. Going back to the party host, perhaps the highway was built only last year, and maybe the host had to talk his friends, stuck in their habits, into taking the highway by explaining how much faster it is. But that doesn’t make any difference to the guest, who has never taken either route.

    @ucfagls I think ppl teach base stuff preferentially because it’s what they still use or they’re reliving their own #rstats “journey” @drob

    — Jenny Bryan (@JennyBryan) January 16, 2015

    It’s true that I learned a lot about programming in the path I described above. I learned how to debug, how to compare programming paradigms, and when to switch from one approach to another. I think some teachers hope that by walking students through this “journey”, we can impart some of that experience. But it doesn’t work: feeding students two possible solutions in a row is nothing like the experience of them comparing solutions for themselves, and doesn’t grant them any of the same skills. Besides which, students will face plenty of choices and challenges as they continue their programming career, without us having to invent artificial ones.

    Students should absolutely learn multiple approaches (there’s usually advantages to each one). But not in this order, from hardest to easiest, and not because it happened to be the order we learned it ourselves.

    Bandwidth and trust

    There are two reasons I recommend against teaching a harder approach first. One is educational bandwidth, and one is trust.

    One of the most common mistakes teachers make (especially inexperienced ones) is to think they can teach more material than they can.

    .@minebocek talks philosophy of Data Carpentry.

    Highlight: don't take on too much material. Rushing discourages student questions #UseR2017 pic.twitter.com/txAcSd3ND3

    — David Robinson (@drob) July 6, 2017

    This comes down to what I sometimes call educational bandwidth: the total amount of information you can communicate to students is limited, especially since you need to spend time reinforcing and revisiting each concept. It’s not just about the amount of time you have in the lesson, either. Learning new ideas is hard work: think of the headache you can get at the end of a one-day workshop. This means you should make sure every idea you get across is valuable. If you teach them a method they’ll never have to use, you’re wasting time and energy.

    The other reason is trust. Think of the imaginary host who gave poor directions before giving better ones. Wouldn’t it be unpleasant to be tricked like that? When a student goes through the hard work of learning a new method, telling them “just kidding, you didn’t need to learn that” is obnoxious, and hurts their trust in you as a teacher.

    In some cases there’s a tradeoff between bandwidth and trust. For instance, as I’ve described before, I teach dplyr and the %>% operator before explaining what a function is, or even how to do a variable assignment. This conserves bandwidth (it gets students doing powerful things quickly) but it’s a minor violation of their trust (I’m hiding details of how R actually works). But there’s no tradeoff in teaching a hard method before an easier one.

    Exceptions

    Is there any situation where you might want to show students the hard way first? Yes: one exception is if the hard solution is something the student would have been tempted to do themselves.

    One good example is in R for Data Science, by Hadley Wickham and Garrett Grolemund. Chapter 19.2 describes “When should you write a function”, and gives an example of rescaling several columns by copying and pasting code:

    df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE)) df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE)) df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE)) df$d <- (df$d - min(df$d, na.rm = TRUE)) / (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

    By showing them what this approach looks like (and even including an intentional typo in the second line, to show how copying and pasting code is prone to error), the book guides them towards the several steps involved in writing a function.

    rescale01 <- function(x) { rng <- range(x, na.rm = TRUE) (x - rng[1]) / (rng[2] - rng[1]) }

    This educational approach makes sense because copying and pasting code is a habit beginners would fall into naturally, especially if they’ve never programmed before. It doesn’t take up any educational bandwidth because students already know how to do it, and it’s upfront about it’s approach (when I teach this way, I usually use the words “you might be tempted to…”).

    However, teaching a loop (or split()/lapply(), or aggregate, or tapply) isn’t something beginners would do accidentally, and it’s therefore not something you need to be talking.

    In conclusion: teaching programming is hard, don’t make it harder. Next time you’re teaching a course, or workshop, or writing a tutorial, or just helping a colleague getting set up in R, try teaching them your preferred method first, instead of meandering through subpar solutions. I think you’ll find that it’s worth it.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    ggformula: another option for teaching graphics in R to beginners

    Thu, 09/21/2017 - 15:03

    (This article was first published on SAS and R, and kindly contributed to R-bloggers)

    A previous entry (http://sas-and-r.blogspot.com/2017/07/options-for-teaching-r-to-beginners.html) describes an approach to teaching graphics in R that also “get[s] students doing powerful things quickly”, as David Robinson suggested

    In this guest blog entry, Randall Pruim offers an alternative way based on a different formula interface. Here’s Randall: 

    For a number of years I and several of my colleagues have been teaching R to beginners using an approach that includes a combination of

    • the lattice package for graphics,
    • several functions from the stats package for modeling (e.g., lm(), t.test()), and
    • the mosaic package for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.

    Important in this approach is the syntactic similarity that the following “formula template” brings to all of these operations.  


      goal ( y ~ x , data = mydata, … )

    Many data analysis operations can be executed by filling in four pieces of information (goal, y, x, and mydata) with the appropriate information for the desired task. This allows students to become fluent quickly with a powerful, coherent toolkit for data analysis.

    Trouble in paradise
    As the earlier post noted, the use of lattice has some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantile-quantile plots are simple to make with lattice, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into sub-plots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with ggplot2. And in our experience, students generally find the look of ggplot2 graphics more appealing. On the other hand, introducing ggplot2 into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach, ggplot2 is tough to justify. ggformula: The third-and-a half way Danny Kaplan and I recently introduced ggformula, an R package that provides a formula interface to ggplot2 graphics. Our hope is that this provides the best aspects of lattice (the formula interface and lighter syntax) and ggplot2 (modularity, layering, and better visual aesthetics). For simple plots, the only thing that changes is the name of the plotting function. Each of these functions begins with gf. Here are two examples, either of which could replace the side-by-side boxplots made with lattice in the previous post. We can even overlay these two types of plots to see how they compare. To do so, we simply place what I call the “then” operator (%>%, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.
    Comparing groups Groups can be compared either by overlaying multiple groups distinguishable by some attribute (e.g., color) or by creating multiple plots arranged in a grid rather than overlaying subgroups in the same space. The ggformula package provides two ways to create these facets. The first uses | very much like lattice does. Notice that the gf_lm() layer inherits information from the the gf_points() layer in these plots, saving some typing when the information is the same in multiple layers. The second way adds facets with gf_facet_wrap() or gf_facet_grid() and can be more convenient for complex plots or when customization of facets is desired. Fitting into the tidyverse work flow ggformala also fits into a tidyverse-style workflow (arguably better than ggplot2 itself does). Data can be piped into the initial call to a ggformula function and there is no need to switch between %>% and + when moving from data transformations to plot operations. Summary The “Less Volume, More Creativity” approach is based on a common formula template that has served well for several years, but the arrival of ggformula strengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of ggplot2 features and customizations remains available, and the  ggformula  package vignettes and tutorials describe these in more detail.

    — Randall Pruim

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: SAS and R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016

    Thu, 09/21/2017 - 14:05

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    R has a lot of packages for users to analyse posts on social media. As an experiment in this field, I decided to start with the biggest one: Facebook.

    I decided to look at the Facebook activity of Donald Trump and Hillary Clinton during the 2016 presidential election in the United States.

    The winner may be more famous for his Twitter account than his Facebook one, but he still used it to great effect to help pick off his Republican rivals in the primaries and to attack Hillary Clinton in the general election.

    For this work we’re going to be using the Rfacebook package developed by Pablo Barbera, plus his excellent how-to guide.

    The first thing to do is to generate an access token from Facebook’s developer portal. Keep it anonymous (otherwise you’re gifting the world access to your account) and save it in your environment.

    library(Rfacebook) options(scipen = 999) token <- "Your token goes here"

    The next thing to do is to use the getPage() function to retrieve all the posts from each candidate.

    I’m going to start the clock on January 1, 2016 and end it the day after the election on November 9, 2016 (which means it will stop on election day, the day before)

    trump <- getPage("donaldtrump", token, n = 5000, since='2016/01/01', until='2016/11/09') clinton <- getPage("hillaryclinton", token, n = 5000, since='2016/01/01', until='2016/11/09') Caveat: The data doesn’t seem to contain all Trump and Clinton’s Facebook posts

    I ran the commands several times and got 545 posts for Trump and 692 posts for Clinton. However, I think I may have got more results the first time I ran the commands. I also searched their pages via Facebook and came up with some posts that don’t appear in the R datasets. If you have a solution for this, please let me know!

    In the meantime, we will work with what we have

    We want to calculate the average number of likes, comments and shares for each month for both candidates. Again, we will be using Pablo’s code for a while here.

    First up, we will format the date:

    format.facebook.date <- function(datestring) { date <- as.POSIXct(datestring, format = "%Y-%m-%dT%H:%M:%S+0000", tz = "GMT") }

    Then we will use his formula to calculate the average likes, comments and shares (metrics) per month:

    aggregate.metric <- function(metric) { m <- aggregate(page[[paste0(metric, "_count")]], list(month = page$month), mean) m$month <- as.Date(paste0(m$month, "-15")) m$metric <- metric return(m) }

    Now we run this data for both candidates:

    #trump page <- trump page$datetime <- format.facebook.date(page$created_time) page$month <- format(page$datetime, "%Y-%m") df.list <- lapply(c("likes", "comments", "shares"), aggregate.metric) trump_months head(trump_months) month x metric 1 2016-01-15 17199.93 likes 2 2016-02-15 15239.63 likes 3 2016-03-15 22616.28 likes 4 2016-04-15 19364.17 likes 5 2016-05-15 14598.30 likes 6 2016-06-15 32760.68 likes #clinton page <- clinton page$datetime <- format.facebook.date(page$created_time) page$month <- format(page$datetime, "%Y-%m") df.list <- lapply(c("likes", "comments", "shares"), aggregate.metric) clinton_months <- do.call(rbind, df.list)

    Before we combine them together, let’s label them so we know who’s who:

    trump_months$candidate <- "Donald Trump" clinton_months$candidate <- "Hillary Clinton" both <- rbind(trump_months, clinton_months)

    Now we have the data, we can visualise it. This is a neat opportunity to have a go at faceting using ggplot2.

    Faceting is when you display two or more plots side-by-side for easy at-a-glance comparison.

    library(ggplot2) library(scales) p <- ggplot(both, aes(x = month, y = x, group = metric)) + geom_line(aes(color = metric)) + scale_x_date(date_breaks = "months", labels = date_format("%m")) + ggtitle("Facebook engagement during the 2016 election") + labs(y = "Count", x = "Month (2016)", aesthetic='Metric') + theme(text=element_text(family="Browallia New", color = "#2f2f2d")) + scale_colour_discrete(name = "Metric") #add in a facet p <- p + facet_grid(. ~ candidate) p Analysis

    Clearly Trump’s Facebook engagement got far better results than Clinton’s. Even during his ‘off months’ he received more likes per page on average than Clinton managed at the height of the general election.

    Trump’s comments per page also skyrocketed during October and November as the election neared.

    Hillary Clinton enjoyed a spike in engagement around June. It was a good month for her: she was confirmed as the Democratic nominee and received the endorsement of President Obama.

    Themes

    Trump is famous for using nicknames for his political opponents. We had low-energy Jeb, Little Marco, Lyin’ Ted and then Crooked Hillary.

    The first usage of Crooked Hillary in the data came on April 26. A look through his Twitter feed shows he seems to have decided on Crooked Hillary around this time as well.

    #DrainTheSwamp was one of his later hashtags, making the first appearance in the data on October 20, just a few weeks shy of the election on November 8.

    Clinton meanwhile mentioned her rival’s surname in about a quarter of her posts. Her most popular ones were almost all on the eve of the election exhorting her followers to vote.

    Of her earlier ones, only her New Year message and one from March appealing to Trump’s temperament resonated as strongly.

    Conclusion

    It’s frustrating that the API doesn’t seem to retrieve all the data.
    Nonetheless, it’s a decent sample size and shows that the Trump campaign was far more effective than the Clinton one on Facebook during the 2016 election.

      Related Post

      1. Analyzing Obesity across USA
      2. Can we predict flu deaths with Machine Learning and R?
      3. Graphical Presentation of Missing Data; VIM Package
      4. Creating Graphs with Python and GooPyCharts
      5. Visualizing Tennis Grand Slam Winners Performances
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Visualizing the Spanish Contribution to The Metropolitan Museum of Art

      Thu, 09/21/2017 - 08:30

      (This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

      Well I walk upon the river like it’s easier than land
      (Love is All, The Tallest Man on Earth)


      The Metropolitan Museum of Art provides here a dataset with information on more than 450.000 artworks in its collection. You can do anything you want with these data: there are no restrictions of use. Each record contains information about the author, title, type of work, dimensions, date, culture and  geography of a particular piece.

      I can imagine a bunch of things to do with these data but since I am a big fan of highcharter,  I have done a treemap, which is an artistic (as well as efficient) way to visualize hierarchical data. A treemap is useful to visualize frequencies. They can handle levels, allowing to navigate to go into detail about any category. Here you can find a good example of treemap.

      To read data I use fread function from data.table package. I also use this package to do some data wrangling operations on the data set. After them, I filter it looking for the word SPANISH in the columns Artist Nationality and Culture and looking for the word SPAIN in the column Country. For me, any piece created by an Spanish artist (like this one), coming from Spanish culture (like this one) or from Spain (like this one) is Spanish (this is my very own definition and may do not match with any academical one). Once it is done, it is easy to extract some interesting figures:

      • There are 5.294 Spanish pieces in The Met, which means a 1,16% of the collection
      • This percentage varies significantly between departments: it raises to 9,01% in The Cloisters and to 4,83% in The Robert Lehman Collection; on the other hand, it falls to 0.52% in The Libraries and to 0,24% in Photographs.
      • The Met is home to 1.895 highlights and 44 of them (2,32%) are Spanish; It means that Spanish art is twice as important as could be expected (remember that represents a 1,16% of the entire collection)

      My treemap represents the distribution of Spanish artworks by department (column Department) and type of work (column Classification). There are two important things to know before doing a treemap with highcharter:

      • You have to use treemap function from treemap package to create a list with your data frame that will serve as input for hctreemap function
      • hctreemap fails if some category name is the same as any of its subcategories. To avoid this, make sure that all names are distinct.

      This is the treemap:


      Here you can see a full size version of it.

      There can be seen several things at a glance: most of the pieces are drawings and prints and european sculpture and decorative arts (in concrete, prints and textiles), there is also big number of costumes, arms and armor is a very fragmented department … I think treemap is a good way to see what kind of works owns The Met.

      Mi favorite spanish piece in The Met is the stunning Portrait of Juan de Pareja by Velázquez, which illustrates this post: how nice would be to see it next to El Primo in El Museo del Prado!

      Feel free to use my code to do your own experiments:

      library(data.table) library(dplyr) library(stringr) library(highcharter) library(treemap) file="MetObjects.csv" # Download data if (!file.exists(file)) download.file(paste0("https://media.githubusercontent.com/media/metmuseum/openaccess/master/", file), destfile=file, mode='wb') # Read data data=fread(file, sep=",", encoding="UTF-8") # Modify column names to remove blanks colnames(data)=gsub(" ", ".", colnames(data)) # Clean columns to prepare for searching data[,`:=`(Artist.Nationality_aux=toupper(Artist.Nationality) %>% str_replace_all("\\[\\d+\\]", "") %>% iconv(from='UTF-8', to='ASCII//TRANSLIT'), Culture_aux=toupper(Culture) %>% str_replace_all("\\[\\d+\\]", "") %>% iconv(from='UTF-8', to='ASCII//TRANSLIT'), Country_aux=toupper(Country) %>% str_replace_all("\\[\\d+\\]", "") %>% iconv(from='UTF-8', to='ASCII//TRANSLIT'))] # Look for Spanish artworks data[Artist.Nationality_aux %like% "SPANISH" | Culture_aux %like% "SPANISH" | Country_aux %like% "SPAIN"] -> data_spain # Count artworks by Department and Classification data_spain %>% mutate(Classification=ifelse(Classification=='', "miscellaneous", Classification)) %>% mutate(Department=tolower(Department), Classification1=str_match(Classification, "(\\w+)(-|,|\\|)")[,2], Classification=ifelse(!is.na(Classification1), tolower(Classification1), tolower(Classification))) %>% group_by(Department, Classification) %>% summarize(Objects=n()) %>% ungroup %>% mutate(Classification=ifelse(Department==Classification, paste0(Classification, "#"), Classification)) %>% as.data.frame() -> dfspain # Do treemap without drawing tm_dfspain <- treemap(dfspain, index = c("Department", "Classification"), draw=F, vSize = "Objects", vColor = "Objects", type = "index") # Do highcharter treemap hctreemap( tm_dfspain, allowDrillToNode = TRUE, allowPointSelect = T, levelIsConstant = F, levels = list( list( level = 1, dataLabels = list (enabled = T, color = '#f7f5ed', style = list("fontSize" = "1em")), borderWidth = 1 ), list( level = 2, dataLabels = list (enabled = F, align = 'right', verticalAlign = 'top', style = list("textShadow" = F, "fontWeight" = 'light', "fontSize" = "1em")), borderWidth = 0.7 ) )) %>% hc_title(text = "Spanish Artworks in The Met") %>% hc_subtitle(text = "Distribution by Department") -> plot plot var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pandigital Products: Euler Problem 32

      Thu, 09/21/2017 - 02:00

      (This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

      Euler Problem 32 returns to pandigital numbers, which are numbers that contain one of each digit. Like so many of the Euler Problems, these numbers serve no practical purpose whatsoever, other than some entertainment value. You can find all pandigital numbers in base-10 in the Online Encyclopedia of Interegers (A050278). The Numberhile video explains everything you ever wanted to

      The Numberhile video explains everything you ever wanted to know about pandigital numbers but were afraid to ask.

      Euler Problem 32 Definition

      We shall say that an n-digit number is pandigital if it makes use of all the digits 1 to n exactly once; for example, the 5-digit number, 15234, is 1 through 5 pandigital.

      The product 7254 is unusual, as the identity, 39 × 186 = 7254, containing multiplicand, multiplier, and product is 1 through 9 pandigital.

      Find the sum of all products whose multiplicand/multiplier/product identity can be written as a 1 through 9 pandigital.

      HINT: Some products can be obtained in more than one way so be sure to only include it once in your sum.

      Proposed Solution

      The pandigital.9 function tests whether a string classifies as a pandigital number. The pandigital.prod vector is used to store the multiplication.

      The only way to solve this problem is brute force and try all multiplications but we can limit the solution space to a manageable number. The multiplication needs to have ten digits. For example, when the starting number has two digits, the second number should have three digits so that the total has four digits, e.g.: 39 × 186 = 7254. When the first number only has one digit, the second number needs to have four digits.

      pandigital.9 <- function(x) # Test if string is 9-pandigital (length(x)==9 & sum(duplicated(x))==0 & sum(x==0)==0) t <- proc.time() pandigital.prod <- vector() i <- 1 for (m in 2:100) { if (m < 10) n_start <- 1234 else n_start <- 123 for (n in n_start:round(10000 / m)) { # List of digits digs <- as.numeric(unlist(strsplit(paste0(m, n, m * n), ""))) # is Pandigital? if (pandigital.9(digs)) { pandigital.prod[i] <- m * n i <- i + 1 print(paste(m, "*", n, "=", m * n)) } } } answer <- sum(unique(pandigital.prod)) print(answer)

      Numbers can also be checked for pandigitality using mathematics instead of strings.

      You can view the most recent version of this code on GitHub.

      The post Pandigital Products: Euler Problem 32 appeared first on The Devil is in the Data.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Report from Mexico City

      Thu, 09/21/2017 - 02:00

      (This article was first published on R Views, and kindly contributed to R-bloggers)

      Editors Note: It has been heartbreaking watching the images from México City. Teresa Ortiz, co-organizer of R-Ladies CDMX reports on efforts of data scientists to help. Our thoughts are with them, and with the people of México.

      It has been a hard couple of days around here.

      In less than 2 weeks, México has gone through two devastating earthquakes and the damages keep adding. Nevertheless, the response from the citizens has been outstanding and Mexican data-driven initiatives have not stayed behind.

      An example is codeandoMexico, an open innovation platform which brings together top talent through public challenges, codeandoMexico developed a Citizen’s quick response center with a directory of emergency phone numbers, providing the location of places that need help, maps, and more. Most of the information in the site is updated from first-hand experience of volunteers that are helping throughout the affected areas. Technical collaboration with codeandoMexico is open to everyone by attending to issues on their website these include extending the services offered by the site or attending to bugs.

      Social networks are being extensively used to spread information about where and how to help, however often times tweets and facebook posts are fake or obsolete. There are several efforts taking place to overcome this difficulty; R-Ladies CDMX’s cofounder Silvia Gutierrez is working in one such project to map information on shelters, affected buildings and more (http://bit.ly/2xfIwZm).

      The full scope of the damage is yet to be seen, but the support, both from México and other countries, is encouraging.

      _____='https://rviews.rstudio.com/2017/09/21/report-from-mexico-city/';

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Monte Carlo Simulations & the "SimDesign" Package in R

      Wed, 09/20/2017 - 21:43

      (This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to R-bloggers)

      Past posts on this blog have included several relating to Monte Carlo simulation – e.g., see here, here, and here.

      Recently I came across a great article by Matthew Sigal and Philip Chalmers in the Journal of Statistics Education. It’s titled, “Play it Again: Teaching Statistics With Monte Carlo Simulation”, and the full reference appears below. The authors provide a really nice introduction to basic Monte Carlo simulation, using R. In particular, they contrast using a “for loop” approach, with using the “SimDesign” R package (Chalmers, 2017).  Here’s the abstract of their paper:

      “Monte Carlo simulations (MCSs) provide important information about statistical phenomena that would be impossible to assess otherwise. This article introduces MCS methods and their applications to research and statistical pedagogy using a novel software package for the R Project for Statistical Computing constructed to lessen the often steep learning curve when organizing simulation code. A primary goal of this article is to demonstrate how well-suited MCS designs are to classroom demonstrations, and how they provide a hands-on method for students to become acquainted with complex statistical concepts. In this article, essential programming aspects for writing MCS code in R are overviewed, multiple applied examples with relevant code are provided, and the benefits of using a generate–analyze–summarize coding structure over the typical “for-loop” strategy are discussed.”

      The SimDesign package provides an efficient, and safe template for setting pretty much any Monte Carlo experiment that you’re likely to want to conduct. It’s really impressive, and I’m looking forward to experimenting with it. The Sigal-Chalmers paper includes helpful examples, with the associated R code and output. It would be superfluous for me to add that here. Needless to say, the SimDesign package is just as useful for simulations in econometrics as it is for those dealing with straight statistics problems. Try it out for yourself! References
      Chalmers, R. P., 2017. SimDesign: Structure for Organizing Monte Carlo Simulation Designs, R package version 1.7. M. J. Sigal and R. P. Chalmers, 2016. Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24, 136-156.

      © 2017, David E. Giles var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Econometrics Beat: Dave Giles' Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Answer probability questions with simulation (part-2)

      Wed, 09/20/2017 - 18:22

      (This article was first published on R-exercises, and kindly contributed to R-bloggers)

      This is the second exercise set on answering probability questions with simulation. Finishing the first exercise set is not a prerequisite. The difficulty level is about the same – thus if you are looking for a challenge aim at writing up faster more elegant algorithms.

      As always, it pays off to read the instructions carefully and think about what the solution should be before starting to code. Often this helps you weed out irrelevant information that can otherwise make your algorithm unnecessarily complicated and slow.

      Answers are available here.

       

      Exercise 1
      If you take cards numbered from 1-10 and shuffle them, and lay them down in order, what is the probability that at least one card matches its position. For example card 3 comes down third?

      Exercise 2
      Consider an election 3 candidates and 35 voters and who casts his ballot randomly for one of the candidates. What is the probability of a tie for the first position?

      Exercise 3
      If you were to randomly find a playing card on the floor every day, how many days would it take on average to find a full standard deck?

      Exercise 4
      Throw two dice. What is the probability the difference between them is 3, 4, or 5 instead of 0, 1, or 2?

      Exercise 5
      What is the expected number of distinct birthdays in a group of 400 people? Assume 365 days and that all are equally likely.

      Exercise 6
      What is the probability that a five-card hand in standard deck of cards has exactly three aces?

      Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to

      • work with different binomial and logistic regression techniques,
      • know how to compare regression models and choose the right fit,
      • and much more.

      Exercise 7
      Randomly select three distinct integers a, b, c from the set {1, 2, 3, 4, 5, 6, 7}.
      What is the probability that a + b > c?

      Exercise 8
      Given that a throw of three dice show three different faces what is the probability if the sum of all the dice is 8.

      Exercise 9
      Throwing a standard die until you get 6. What is the expected number of throws (including the throw giving 6) conditioned on the event that all throws gave even numbers.

      Exercise 10
      Choose two-digit integer at random, what is the probability that it is divisible by each of its digits. This is not exactly a simulation proplem – but the concept is similar make R do the hard work.

       

      (Picture by Gil)

      Related exercise sets:
      1. Answer probability questions with simulation
      2. Probability functions intermediate
      3. Lets Begin with something sample
      4. Explore all our (>1000) R exercises
      5. Find an R course using our R Course Finder directory
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      EARL London 2017 – That’s a wrap!

      Wed, 09/20/2017 - 18:04

      (This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

      Beth Ashlee, Data Scientist

      After a successful first-ever EARL San Francisco in June, it was time to head back to the birth place EARL – London. With more abstracts submitted than ever before, the conference was made up of 54 fantastic talks and 5 key notes from an impressive selection of industries. With so many talks to pick from we thought we would summarise a few of my favourites!

      Day 1 highlights:

      After brilliant keynotes from Tom Smith (ONS) and Rstudio’s Jenny Bryan in session 1, Derek Norton and Neera Talbert from Microsoft took us through the Microsoft process of moving a company from SAS to R in session 2. They explained that with the aim of shrinking the ‘SAS footprint’, it’s important to think about the drivers behind a company leaving SAS as well as considering the impact to end users. Their approach focused on converting program logic rather than specific code.

      After lunch, Luisa Pires discussed the digital change occurring within the Bank of England. She highlighted the key selling points behind choosing R as a platform and the process behind organizing their journey. They first ran divisional R training, before progressing through to produce a data science training programme to enable the use of R as a project tool.

      Finishing up the day, Jessica Peterka-Bonetta gave a fascinating talk on sentiment analysis when considering the use of emojis. She demonstrated that even though adding emojis into your sentiment analysis can add complexity to your process, in the right context they can add real value to tracking the sentiment of a trend. It was an engaging talk which prompted some interesting audience questions such as – “What about combinations of emojis; how would they effect sentiment?”.

      After a Pimms reception it was all aboard the Symphony Cruiser for a tour of the River Thames. On board we enjoyed food, drinks and live music (which resulted in some impromptu dancing, but what happens at a Conference, stays at the Conference!).

      Day 2 highlights:

      Despite a ‘lively’ evening, the EARL kicked off in full swing on Thursday morning. There were three fantastic keynotes, including Hilary Parker – her talk filled with analogies and movie references to describe her reproducible work flow methods – definitely something I could relate to!

      The morning session included a talk from Mike Smith from Pfizer. Mike showed us his use of Shiny as a tool for determining wait times when submitting large jobs to Pfizer’s high performance compute grid. Mike used real time results to visualise whether it was beneficial to submit a large job at the current time or wait until later. He outlined some of the frustrations of changing data sources in a such a large company and his reluctance to admit he was ‘data science-ing’.

      After lunch, my colleague Adnan Fiaz gave an interesting talk on the data pipeline, comparing it to the process involved in oil pipelines. He spoke of the methods and actions that must be taken before being able to process your data and generate results. The comparison showed surprising similarities between the two processes, and clarified the method that we must take at Mango to ensure we can safely, efficiently and aptly analyse data.

      The final session of day 2 finished on a high, with a packed room gathering to hear from RStudio’s Joe Cheng. Joe highlighted the exciting new asynchronous programming feature that will be available in the next release of Shiny. This new feature is set to revolutionise the responsiveness of Shiny applications, allowing users to overcome the restrictions of R’s single threaded natured.

      I get so much out of EARL each year and this year was no different; I just wish I had a timeturner to get to all the presentations!

      On behalf of the rest of the Mango team, a massive thank you to all our speakers and attendees, we hope you enjoyed EARL as much as we did!

      All slides we have permission to publish are available under the Speakers section on the EARL conference website.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Preview: ALTREP promises to bring major performance improvements to R

      Wed, 09/20/2017 - 16:55

      (This article was first published on Revolutions, and kindly contributed to R-bloggers)

      Changes are coming to the internals of the R engine which promise to improve performance and reduce memory use, with dramatic impacts in some circumstances. The changes were first proposed by Gabe Becker at the DSC Conference in 2016 (and updated in 2017), and the implementation by Luke Tierney and Gabe Becker is now making its way into the development branches of the R sources.

      ALTREP is an alternative way of representing R objects internally. (The details are described in this draft vignette, and there are a few more examples in this presentation.) It's easiest to illustrate by example.

      Today, a vector of numbers in R is represented as a contiguous block of memory in RAM. In other words, if you create the sequence of a million integers 1:1000000, R creates a block of memory 4Mb in size (4 bytes per number) to store each number in the sequence. But what if we could use a more efficient representation just for sequences like this? WIth ALTREP, a sequence like this is instead represented by just its start and end values, which takes up almost no memory at all. That means, in the future you'll be able to write loops like this:

      for (i in 1:1e10) do_something()

      without getting errors like this:

      Error: cannot allocate vector of size 74.5 Gb

      ALTREP has the potential to make many other operations faster or even instantaneous, even on very large objects. Here are a few examples of functions that could be sped up:

      • is.na(x) — ALTREP will keep track of whether a vector includes NAs or not, so that R no longer has to inspect the entire vector
      • sort(x) — ALTREP will keep track of whether a vector is already sorted, and sorting will be instantaneous in this case
      • x < 5 — knowing that the vector is already sorted, ALTREP can find the breakpoint very quickly (in O(log n) time), and return a "sequence" logical vector that consumes basically no memory
      • match(y,x) — if ALTREP knows that x is already sorted, matching is also much faster
      • as.character(numeric_vector) — ALTREP can defer converting a numeric vector to characters until the character representation is actually needed

      That last benefit will likely have a large impact on the handling of data frames, which carry around a column of character row labels which start out as a numeric sequence. Development builds are already demonstrating a huge performance gain in the linear regression function lm() as a result:

      > n <- 10000000 > x <- rnorm(n) > y <- rnorm(n) # With R 3.4 > system.time(lm(y ~ x)) user system elapsed 9.225 0.703 9.929 # With ALTREP > system.time(lm(y ~ x)) user system elapsed 1.886 0.610 2.496

      The ALTREP framework is designed to be extensible, so that package authors can define their own alternative representations of standard R objects. For example, an R vector could be represented as a distributed object as in a system like Spark, while still behaving like an ordinary R vector to the R engine. An example package, simplemmap, illustrates this concept by defining an implementation of vectors as memory-mapped objects that live on disk instead of in RAM.

      There's no definite date yet when ALTREP will be in an official R release, and my guess is that there's likely to be an extensive testing period to shake out any bugs caused by changing the internal representation of R objects. But the fact that the implementation is already making its way into the R sources is hugely promising, and I look forward to testing out the real-world impacts. You can read more about the current state of ALTREP in the draft vignette by Luke Tierney, Gabe Becker and Tomas Kalibera linked below.

      R-Project.org: ALTREP: Alternative Representations for R Objects

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      pinp 0.0.2: Onwards

      Wed, 09/20/2017 - 15:00

      (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

      A first update 0.0.2 of the pinp package arrived on CRAN just a few days after the initial release.

      We added a new vignette for the package (see below), extended a few nice features, and smoothed a few corners.

      The NEWS entry for this release follows.

      Changes in tint version 0.0.2 (2017-09-20)
      • The YAML segment can be used to select font size, one-or-two column mode, one-or-two side mode, linenumbering and watermarks (#21 and #26 addressing #25)

      • If pinp.cls or jss.bst are not present, they are copied in ((#27 addressing #23)

      • Output is now in shaded framed boxen too (#29 addressing #28)

      • Endmatter material is placed in template.tex (#31 addressing #30)

      • Expanded documentation of YAML options in skeleton.Rmd and clarified available one-column option (#32).

      • Section numbering can now be turned on and off (#34)

      • The default bibliography style was changed to jss.bst.

      • A short explanatory vignette was added.

      Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

      This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js

      Wed, 09/20/2017 - 10:00

      (This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

      D3partitionR is an R package to visualize interactively nested and hierarchical data using D3.js and HTML widget. These last few weeks I’ve been working on a major D3partitionR update which is now available on GitHub. As soon as enough feedbacks are collected, the package will be on uploaded on the CRAN. Until then, you can install it using devtools

      library(devtools) install_github("AntoineGuillot2/D3partitionR")

      Here is a quick overview of the possibilities using the Titanic data:

      A major update

      This update is a major update from the previous version which will break code from 0.3.1

      New functionalities
      • Additional data for nodes: Additional data can be added for some given nodes. For instance, if a comment or a link needs to be shown in the tooltip or label of some nodes, they can be added through the add_nodes_data function


        You can easily add specific hyperlink or text in the tooltip

      • Variable selection and computation, now, you can provide a variable for:
        • sizing (i.e. the size of each node)
        • color, any variable from your data.frame or from the nodes data can be used as a color.
        • label, any variable from your data.frame or from the nodes data can be used as a label.
        • tooltip, you can provide several variables to be displayed in the tooltip.
        • aggregation function, when numerical variables are provided, you can choose the aggregation function you want.
      • Coloring: The color scale can now be continuous. For instance, you can use the mean survival rate to the Titanic accident in each node, this make it easy to visualise quickly women in 1st class are more likely to survive than men in 3rd class.

      Treemap to show the survival rate to the Titanic accident
      • Label: Labels providing the showing the node’s names (or any other variable) can now be added to the plot.
      • Breadcrumb: To avoid overlapping, the width of each breadcrumb is now variable and dependant on the length of the word

      Variable breadcrumb width
      • Legend: By default, the legend now shows all the modalities/levels that are in the plot. To avoid wrapping, enabling the zoom_subset option will only shows the modalities in the direct children of the zoomed root.
      API and backend change
      • Easy data preprocessing, The data preparation was tedious in the previous versions. Now, you just need to aggregate your data.frame at the right level, the data.frame can directly be used in the D3partitionR functions to avoid to deal with nesting a data.frame which can be pretty complicated.
      ##Loading packages require(data.table) require(D3partitionR) ##Reading data titanic_data=fread("train.csv") ##Agregating data to have unique sequence for the 4 variables var_names=c('Sex','Embarked','Pclass','Survived') data_plot=titanic_data[,.N,by=var_names] data_plot[,(var_names):=lapply(var_names,function(x){data_plot[[x]]=paste0(x,' ',data_plot[[x]]) })]
      • The R API is greatly improved, D3partitionR are now S3 objects with a clearly named list of function to add data and to modify the chart appearance and parameters. Using pipes now makes D3partitionR syntax looks gg-like
      ##Treemap D3partitionR()%>% add_data(data_plot,count = 'N',steps=c('Sex','Embarked','Pclass','Survived'))%>% set_chart_type('treemap')%>% plot() ##Circle treemap D3partitionR()%>% add_data(data_plot,count = 'N',steps=c('Sex','Embarked','Pclass','Survived'))%>% set_chart_type('circle_treemap')%>% plot()

      Style consistency among the different type of chart. Now, it’s easy to switch from a treemap to a circle treemap or a sunburst and keep consistent styling policy.

      Update to d3.js V4 and modularization. Each type of charts now has its own file and function. This function draws the chart at its root level with labels and colors, it returns a zoom function. The on-click actions (such as the breadcrumb update or the legend update) and the hover action (tooltips) are defined in a ‘global’ function.

      Hence, adding new visualizations will be easy, the drawing and zooming script will just need to be adapted to this previous template.

      What’s next

      Thanks to the several feedbacks that will be collected during next week, a stable release version should soon be on the CRAN. I will also post more ressources on D3partitionR with use cases and example of Shiny Applications build on it.

      The post Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js appeared first on Enhance Data Science.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages