How to install wgrib2 in OSX
(This article was first published on R – Bovine Aerospace, and kindly contributed to Rbloggers)
Prompted by both my own struggles with wgrib2 compilation and a plea on the rNOMADS email listserv, I’m going to describe how to compile and install wgrib2 on Mac OS.
First of all, some background: wgrib2 is an excellent utility written by Wesley Ebisuzaki at NOAA. It allows for a number of swift and stable operations on GRIB2 files (a common file format for weather and climate data). It is also a requirement for using grib files in rNOMADS (the function ReadGrib() in particular).
So here’s how to install it on Mac OS.
 Get Command Line Tools for Xcode (search for it on Duckduckgo or your engine of choice) and also make sure gcc is installed. If you don’t know what gcc is, stop now and find someone who does (it will save you a lot of time).
 Download wgrib2 here (note download links are pretty far down the page).
 Untar the tarball somewhere, and roll up your sleeves. cd into the resulting wgrib directory.
 In the makefile, uncomment the lines
#export cc=gcc
#export FC=gfortran
Also search for makefile.darwin in the makefile and uncomment the line containing it. You’ll see instructions to this effect in the makefile anyway.  Now we have to edit the included libpng package, since it is untarred by the makefile and doesn’t inherit our compiler specifications in step 4. Ensuring that we’re in the wgrib directory:
tar xvf libpng1.2.57.tar.gz
cd libpng1.2.57/scripts/
now edit the makefile.darwin file, changing
CC=cc
to
CC=gcc
Now, return to the wgrib directory, and retar libpng!
tar cf libpng1.2.57.tar libpng1.2.57
gzip libpng1.2.57.tar
If it asks you if you want to replace the original tar.gz file, say “yes”. What we’ve done here is edited libpng to make sure it uses the right compiler.  Finally, type make to build wgrib2.
If you are still having problems (for example, libaec complaining that there is no c compiler), make sure that all the compiler commands (gcc, cc, etc.) all point to something other than clang (the default compiler that comes with OSX). You may have to edit your bash_profile file to ensure this.
As always, contact me on the form below if you’re having unresolvable issues.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Bovine Aerospace. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Is dplyr Easily Comprehensible?
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
dplyr is one of the most popular R packages. It is powerful and important. But is it in fact easily comprehensible?dplyr makes sense to those of us who use it a lot. And we can teach part time R users a lot of the common good use patterns.
But, is it an easy task to study and characterize dplyr itself?
Please take our advanced dplyr quiz to test your dplyr mettle.
“Pop dplyr quiz, hotshot! There is data in a pipe. What does each verb do?”
To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
simplyR
(This article was first published on Easy Guides, and kindly contributed to Rbloggers)
simplyR is a web space where we’ll be posting practical and easy guides for solving real important problems using R programming language.
As we aren’t fans of unnecessary complications, we’ll keep the content of our tutorials / R codes as simple as possible.
Many tutorials are coming soon.
Topics we love include:
 R programming
 Biostatistics
 Genomic data analysis
 Survival analysis
 Machine/statistical learning
 Data visualization
Samples of our recent publications, on R & Data Science, are:
 Correlation matrix : A quick start guide to analyze, format and visualize a correlation matrix using R software
 ggplot2 – Easy way to mix multiple graphs on the same page
 Bar Plots and Modern Alternatives
 Facilitating Exploratory Data Visualization: Application to TCGA Genomic Data
 Add Pvalues and Significance Levels to ggplots
 fastqcr: An R Package Facilitating Quality Controls of Sequencing Data for Large Numbers of Samples
 Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization
 Survival Analysis
 Cluster Analysis
 R xlsx package : A quick start guide to manipulate Excel files in R
 See More…
If you want to contribute, read this: http://www.sthda.com/english/pages/contributetosthda
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Zillow Rent Analysis
(This article was first published on R – Journey of Analytics, and kindly contributed to Rbloggers)
Hello Readers,
This is a notification post – Did you realize our website has moved? The blog is live at New JA Blog under the domain http://www.journeyofanalytics.com . You can read about the rent analysis post here.
If you received this post AND an email from anu_analytics, then please disregard this post.
If you received this post update from WordPress, but did NOT receive an email from anu_analytics (via MailChimp email) then please send us an email at anuprv@journeyofanalytics.com . The email from the main site was sent out 4 hours ago. Alternatively, you can sign up using this contact form.
(Email screenshot below)
JourneyofAnalytics email Newsletter " datamediumfile="https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=300" datalargefile="https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=676&h=375" class="sizelarge wpimage1342" src="https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=676&h=375" alt="JourneyofAnalytics email Newsletter" width="450" srcset_temp="https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=676&h=375 676w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=150&h=83 150w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=300&h=166 300w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=768&h=426 768w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg 999w" sizes="(maxwidth: 676px) 100vw, 676px" />Again, the latest blogposts are available at blog.journeyofanalytics.com and all code/project files are available under the Projects page.
See you at our new site. Happy Coding!
Filed under: learning resources, Project Updates, R Tagged: inferential statistics, journeyofanalytics, new site, rent analysis
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Journey of Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
More things with the New Zealand Election Study by @ellis2013nz
(This article was first published on Peter's stats stuff  R, and kindly contributed to Rbloggers)
A new cross tab toolI recently put up a simple web app, built with R Shiny, to let users explore the relationship between party vote in the 2014 New Zealand general election and a range of demographic and attitudinal questions in the New Zealand Election Study. The image below is a link to the web app:
The original motivation was to answer a question on Twitter for a breakdown of National party vote by gender. I was surprised how interesting I found the resulting tool though. Without a fancy graphic, just a table of numbers, there’s a lot to play around with here. I deliberately kept the functionality narrow, because I wanted to avoid a bewildering array of choices and confusing user interface, so it tries to do only one thing and does it well. The thing it does is show cross tabs of party vote with other variables from the study.
The source code of the Shiny app is available as is the preparation script but they’re quite unremarkable so I won’t reproduce them here; follow the links and read them on GitHub in their natural habitat.
A few interesting statistical points to note:
 I produced a new set of weights so population totals would match the actual party vote. Even after the NZES team did their weighting, the sample wasn’t representative of the population of people on the electoral roll in terms of actual party vote. In fact, people who did not vote were particularly overrepresented. This isn’t that surprising – people who don’t vote for whatever reason (whether it is apathy or being out of the country and busy with other things) are probably also disproportionately likely not to respond to surveys. The web app gives the user the choice of the original NZES weights or my recalibrated ones, defaulting to the latter. I think that’s useful because people might use the app to say “X thousand voters for party Y have Z attitude”, so adding up to the right number of voters by party is important.
 I included an option to see the Pearson residuals, which compare the observed cell count with what would have been expected in the (nearly always implausible) null hypothesis of no relationship at all between the two variables making up the cross tab. I think this is by far the best way to look at which cells of the table are unusual. For example, in the screenshot above, it is highlighted clearly that National voters had unusually strong levels of agreement with the statement “with lower welfare benefits people would learn to stand on their own two feet”, whereas voters for Labour and the Greens were unusually unlikely to agree with that statement (and likely to disagree). This won’t be a surprise for any watchers of New Zealand politics.
 One version I tested had a little Chi square test of the null hypothesis of no relationship between the two. But it was nearly always returning a p value of zero, because of course there is a relationship between these variables. I decided it was uninteresting, and didn’t want to focus people on null hypothesis testing anyway, so left it out.
 I resisted the urge to put multiple numbers in each cell of the table, as is done in some stats package output (eg SPSS). I think tables like these work as visualisations if the eye can sweep across, knowing that every number in the table is somehow comparable. This isn’t possible when you combine values in each cell (eg include both rowwise and columnwise percentages).
 It was interesting to think through what should be the default way of calculating percentages in a table like this. I decided in the end to default to rowwise, which means the user is reading (for example) “Of the people who voted X, what percentage thought Y?” I don’t think there’s a right or wrong, just a contingent guess that this is most likely to be the first want of people.
 An early version of the tool had an option for “margin of error” for each cell and this drew my attention to the difficult of conceptualising the margin of error in a single cell of a contingency table. I’m going to think more about this one.
 Adding the heatmap colour was a late addition, made easy by the wonder of the easy combination of datatable JavaScript with R via the DT package.
So I now have two web apps with this data:
 Predict party vote given a combination of demographic characteristics
 Cross tab of your choice of variable with party vote
… and six blog posts. To recap, here’s all the blog posts I’ve done so far with this data:
1. Attitudes to the “Dirty Politics” bookIn my first post on the data, I did quick demo analysis of what the attitudes were of voters for various parties to Nicky Hagar’s book “Dirty Politics”
2. Modelling individual level party voteI did some reasonably comprehensive modelling of who votes for whom. The main work here was deciding how many degrees of freedom could be spared for the various demographic variables, and clumping/tidying them up into analysisready form. This was also a good opportunity for some thinking about modelling strategy, the role of the bootstrap, and multiple imputation which is essential with this sort of problem.
3. Web app for individual voteThis led to my first web app with the New Zealand Election Study data, which lets you explore the predicted probability of different types of people voting for different parties.
4. Sankey chart of ‘transitions’ from 2011 vote to 2014This was an interesting experiment in looking at what one survey can tell us about people swapping from party to party:
5. Modelling voter turnoutI adapted my approach of modelling party vote to the perhaps even more important question of who turns out to vote at all.
6. Cross tab toolThe sixth blog post is today’s.
For New Zealand readers, have a good final five weeks up to the 2017 election!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Obstacles to performance in parallel programming
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of Parallel Computing for Data Science: With Examples in R, C++ and CUDA, has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include:
 Communications overhead, particularly an issue with finegrained parallelism consisting of a very large number of relatively small tasks;
 Load balance, where the computing resources aren't contributing equally to the problem;
 Impacts from use of RAM and virtual memory, such as cache misses and page faults;
 Network effects, such as latency and bandwidth, that impact performance and communication overhead;
 Interprocess conflicts and thread scheduling;
 Data access and other I/O considerations.
The chapter is well worth a read for anyone writing parallel code in R (or indeed any programming language). It's also worth checking out Norm Matloff's keynote from the useR!2017 conference, embedded below.
Norm Matloff: Understanding overhead issues in parallel computation
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Starting a Rmarkdown Blog with Bookdown + Hugo + Github
(This article was first published on R – Tales of R, and kindly contributed to Rbloggers)
Finally, after 24h of failed attempts, I could get my favourite Hugo theme up and running with R Studio and Blogdown.
All the steps I followed are detailed in my new Blogdown entry, which is also a GitHub repo.
After exploring some alternatives, like Shirin’s (with Jekyll), and Amber Thomas advice (which involved Git skills beyond my basic abilities), I was able to install Yihui’s hugolithiumtheme in a new repository.
However, I wanted to explore other blog templates, hosted in GiHub, like:
 gcushen/hugoacademic
 jpescador/hugofutureimperfect and
 kakawait/hugotranquilpeaktheme
 Or this one kishaningithub/hugocreativeportfoliotheme
The three first themes are currently linked in the blogdown documentation as being most simple and easy to set up for unexperienced blog programmers, but I hope the list will grow in the following months. For those who are willing to experiment, the complete list is here.
Finally I chose the hugotranquilpeak theme, by Thibaud Leprêtre, for which I mostly followed Tyler Clavelle’s entry on the topic. This approach turned out to be easy and good, given some conditions:
 Contrary to Yihui Xie’s advice, I chose github.io to host my blog, instead of Netlify (I love my desktop integration with GitHub, so it was interesting for me not to move to another service for my static content).
 In my machine, I installed Blogdown & Hugo using R studio (v 1.1.336).
 In GiHub, it was easier for me to host the blog directly in my main github pages repository (always named [USERNAME].github.io), in the master branch, following Tyler’s tutorial.
 I have basic knowledge of html, css and javascript, so I didn’t mind to tinker around with the theme.
 My custom styles didn’t involve theme rebuilding. At this moment they’re simple cosmetic tricks.
The steps I followed were:
Git & GitHub repos Setting a GitHub repo with the name [USERNAME].github.io (in my case auroramareviv.github.io). See this and this.
 Create a git repo in your machine:
 Create manually a new directory called [USERNAME].github.io.
 Run in the terminal (Windows users have to install git first):
 For now, your repo is ready. We will now focus in creating & customising our Blogdown.
 We will open RStudio (v 1.1.336, development version as of today).
 First, you may need to install Blogdown in R:
 In RStudio, select the Menu > File > New Project following the lower half of these instructions. The wizard for setting up a Hugo Blogdown project may not be yet available in your RStudio version (not for much longer probably).
Customising paths and styles
Before we build and serve our site, we need to tweak a couple of things in advance, if we want to smoothly deploy our blog into GitHub pages.
Modify config.toml fileTo integrate with GiHub pages, there are the essential modifications at the top of our config.toml file:
 We need to set up the base URL to the “root” of the web page (https://[USERNAME].github.io/ in this case)
 By default, the web page is published in the “public” folder. We need it to be published in the root of the repository, to match the structure of the GitHub masterbranch:
 Other useful global settings:
We can revisit the config.toml file to make changes to the default settings.
The logo that appears in the corner must be in the root folder. To modify it in the config.toml:
picture = "logo.png" # the path to the logoThe cover (background) image must be located in /themes/hugotranquilpeaktheme/static/images . To modify it in the config.toml:
coverImage = "myimage.jpg"We want some custom css and js. We need to locate it in /static/css and in /static/jsrespectively.
# Custom CSS. Put here your custom CSS files. They are loaded after the theme CSS; # they have to be referred from static root. Example customCSS = ["css/mystyle.css"] # Custom JS. Put here your custom JS files. They are loaded after the theme JS; # they have to be referred from static root. Example customJS = ["js/myjs.js"] Custom cssWe can add arbitrary classes to our css file (see above).
Since I started writing in Bootstrap, I miss it a lot. Since this theme already has bootstrap classes, I brought some others I didn’t find in the theme (they’re available for .md files, but currently not for .Rmd)
Here is my custom css file to date:
/* @import url('https://maxcdn.bootstrapcdn.com/bootswatch/3.3.7/cosmo/bootstrap.min.css'); may conflict with default theme*/ @import url('https://fonts.googleapis.com/icon?family=Material+Icons'); /*google icons*/ @import url('https://cdnjs.cloudflare.com/ajax/libs/fontawesome/4.7.0/css/fontawesome.min.css'); /*font awesome icons*/ .inputlg { fontsize: 30px; } .input { fontsize: 20px; } .fontsm { fontsize: 0.7em; } .texttt { fontfamily: monospace; } .alert { padding: 15px; marginbottom: 20px; border: 1px solid transparent; borderradius: 4px; } .alertsuccess { color: #3c763d; backgroundcolor: #dff0d8; bordercolor: #d6e9c6; } .alertdanger, .alerterror { color: #b94a48; backgroundcolor: #f2dede; bordercolor: #eed3d7; } .alertinfo { color: #3a87ad; backgroundcolor: #d9edf7; bordercolor: #bce8f1; } .alertgray { backgroundcolor: #f2f3f2; bordercolor: #f2f3f2; } /*style for printing*/ @media print { .noPrint { display:none; } } /*link formatting*/ a:link { color: #478ca7; textdecoration: none; } a:visited { color: #478ca7; textdecoration: none; } a:hover { color: #82b5c9; textdecoration: none; }Also, we have fontawesome icons!
Site build with blogdownOnce we have ready our theme, we can add some content, modifying or deleting the various examples we will find in /content/post .
We need to make use of Blogdown & Hugo to compile our .Rmd file and create our html post:
blogdown::build_site() blogdown::serve_site()In the viewer, at the right side of the IDE you can examine the resulting html and see if something didn’t go OK.
Deploying the site Updating the local git repositoryThis can be done with simple git commands:
cd /Git/[USERNAME].github.io # your path to the repo may be different git add . # indexes all files that wil be added to the local repo git commit m "Starting my Hugo blog" # adds all files to the local repo, with a commit message Pushing to GitHub git push origin master # we push the changes from the local git repo to the remote repo (GitHub repo)Just go to the page https://[USERNAME].github.io and enjoy your blog!
R codeWorks just the same as in Rmarkdown. R code is compiled into an html and published as static web content in few steps. Welcome to the era of reproducible blogging!
The figure 1 uses the ggplot2 library:
library(ggplot2) ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point() Rmd source codeYou can download it from here
I, for one, welcome the new era of reproducible blogging!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Tales of R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ggvis Exercises (Part2)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
The ggvis package is used to make interactive data visualizations. The fact that it combines shiny’s reactive programming model and dplyr’s grammar of data transformation make it a useful tool for data scientists.
This package may allows us to implement features like interactivity, but on the other hand every interactive ggvis plot must be connected to a running R session.
Before proceeding, please follow our short tutorial.
Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.
Exercise 1
Create a list which will include the variables “Horsepower” and “MPG.city” of the “Cars93” data set and make a scatterplot. HINT: Use ggvis() and layer_points().
Exercise 2
Add a slider to the scatterplot of Exercise 1 that sets the point size from 10 to 100. HINT: Use input_slider().
Learn more about using ggvis in the online course R: Complete Data Visualization Solutions. In this course you will learn how to: Work extensively with the ggvis package and its functionality
 Learn what visualizations exist for your specific use case
 And much more
Exercise 3
Add a slider to the scatterplot of Exercise 1 that sets the point opacity from 0 to 1. HINT: Use input_slider().
Exercise 4
Create a histogram of the variable “Horsepower” of the “Cars93” data set. HINT: Use layer_histograms().
Exercise 5
Set the width and the center of the histogram bins you just created to 10.
Exercise 6
Add 2 sliders to the histogram you just created, one for width and the other for center with values from 0 to 10 and set the step to 1. HINT: Use input_slider().
Exercise 7
Add the labels “Width” and “Center” to the two sliders respectively. HINT: Use label.
Exercise 8
Create a scatterplot of the variables “Horsepower” and “MPG.city” of the “Cars93” dataset with size = 10 and opacity = 0.5.
Exercise 9
Add to the scatterplot you just created a function which will set the size with the left and right keyboard controls. HINT: Use left_right().
Exercise 10
Add interactivity to the scatterplot you just created using a function that shows the value of the “Horsepower” when you “mouseover” a certain point. HINT: Use add_tooltip().
Related exercise sets: How to create interactive data visualizations with ggvis
 ggvis Exercises (Part1)
 How to create visualizations with iPlots package in R
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
GoTr – R wrapper for An API of Ice And Fire
(This article was first published on Mango Solutions, and kindly contributed to Rbloggers)
Ava Yang
It’s Game of Thrones time again as the battle for Westeros is heating up. There are tons of ideas, ingredients and interesting analyses out there and I was craving for my own flavour. So step zero, where is the data?
Jenny Bryan’s purrr tutorial introduced the list got_chars, representing characters information from the first five books, which seems not much fun beyond exercising list manipulation muscle. However, it led me to an API of Ice and Fire, the world’s greatest source for quantified and structured data from the universe of Ice and Fire including the HBO series Game of Thrones. I decided to create my own API functions, or better, an R package (inspired by the famous rwar package).
The API resources cover 3 types of endpoint – Books, Characters and Houses. GoTr pulls data in JSON format and parses them to R list objects. httr’s Best practices for writing an API package by Hadley Wickham is another life saver.
The package contains: – One function got_api() – Two ways to specify parameters generally, i.e. endpoint type + id or url – Three endpoint types
## Install GoTr from github #devtools::install_github("MangoTheCat/GoTr") library(GoTr) library(tidyverse) library(listviewer) # Retrieve books id 5 books_5 < got_api(type = "books", id = 5) # Retrieve characters id 583 characters_583 < got_api(type = "characters", id = 583) # Retrieve houses id 378 house_378 < got_api(type = "houses", id = 378) # Retrieve pov characters data in book 5 povData < books_5$povCharacters %>% flatten_chr() %>% map(function(x) got_api(url = x)) # Helpful functions to check structure of list object length(books_5) ## [1] 11 names(books_5) ## [1] "url" "name" "isbn" "authors" ## [5] "numberOfPages" "publisher" "country" "mediaType" ## [9] "released" "characters" "povCharacters" names(house_378) ## [1] "url" "name" "region" ## [4] "coatOfArms" "words" "titles" ## [7] "seats" "currentLord" "heir" ## [10] "overlord" "founded" "founder" ## [13] "diedOut" "ancestralWeapons" "cadetBranches" ## [16] "swornMembers" str(characters_583, max.level = 1) ## List of 16 ## $ url : chr "https://anapioficeandfire.com/api/characters/583" ## $ name : chr "Jon Snow" ## $ gender : chr "Male" ## $ culture : chr "Northmen" ## $ born : chr "In 283 AC" ## $ died : chr "" ## $ titles :List of 1 ## $ aliases :List of 8 ## $ father : chr "" ## $ mother : chr "" ## $ spouse : chr "" ## $ allegiances:List of 1 ## $ books :List of 1 ## $ povBooks :List of 4 ## $ tvSeries :List of 6 ## $ playedBy :List of 1 map_chr(povData, "name") ## [1] "Aeron Greyjoy" "Arianne Martell" "Arya Stark" ## [4] "Arys Oakheart" "Asha Greyjoy" "Brienne of Tarth" ## [7] "Cersei Lannister" "Jaime Lannister" "Samwell Tarly" ## [10] "Sansa Stark" "Victarion Greyjoy" "Areo Hotah" #listviewer::jsonedit(povData)Another powerful parameter is query which allows filtering by specific attribute such as the name of a character, pagination and so on.
It’s worth knowing about pagination. The first simple request will render a list of 10 elements, since the default number of items per page is 10. The maximum valid pageSize is 50, i.e. if 567 is passed on to it, you still get 50 characters.
# Retrieve character by name Arya_Stark < got_api(type = "characters", query = list(name = "Arya Stark")) # Retrieve characters on page 3, change page size to 20. characters_page_3 < got_api(type = "characters", query = list(page = "3", pageSize="20"))So how do we get ALL books, characters or houses information? The package does not provide the function directly but here’s an implementation.
# Retrieve all books booksAll < got_api(type = "books", query = list(pageSize="20")) # Extract names of all books map_chr(booksAll, "name") ## [1] "A Game of Thrones" "A Clash of Kings" ## [3] "A Storm of Swords" "The Hedge Knight" ## [5] "A Feast for Crows" "The Sworn Sword" ## [7] "The Mystery Knight" "A Dance with Dragons" ## [9] "The Princess and the Queen" "The Rogue Prince" ## [11] "The World of Ice and Fire" "A Knight of the Seven Kingdoms" # Retrieve all houses houses < 1:9 %>% map(function(x) got_api(type = "houses", query = list(page=x, pageSize="50"))) %>% unlist(recursive=FALSE) map_chr(houses, "name") %>% length() ## [1] 444 map_df(houses, `[`, c("name", "region")) %>% head() ## # A tibble: 6 x 2 ## name region ## ## 1 House Algood The Westerlands ## 2 House Allyrion of Godsgrace Dorne ## 3 House Amber The North ## 4 House Ambrose The Reach ## 5 House Appleton of Appleton The Reach ## 6 House Arryn of Gulltown The ValeThe houses list is a starting point for a social network analysis: Mirror mirror tell me, who are the most influential houses in the Seven Kingdom? Stay tuned for that is the topic of the next blogpost.
Thanks to all open resources. Please comment, fork, issue, star the workinprogress on our GitHub repository.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Estimating Gini coefficient when we only have mean income by decile by @ellis2013nz
(This article was first published on Peter's stats stuff  R, and kindly contributed to Rbloggers)
Income inequality dataIdeally the Gini coefficient to estimate inequality is based on original household survey data with hundreds or thousands of data points. Often this data isn’t available due to access restrictions from privacy or other concerns, and all that is published is some kind of aggregate measure. Some aggregations include the income at the 80th percentile divided by that at the 20th (or 90 and 10); the number of people at the top of the distribution whose combined income is equal to that of everyone else; or the income of the top 1% as a percentage of all income. I wrote a little more about this in one of my earliest blog posts.
One way aggregated data are sometimes presented is as the mean income in each decile or quintile. This is not the same as the actual quantile values themselves, which are the boundary between categories. The 20th percentile is the value of the 20/100th person when they are lined up in increasing order, whereas the mean income of the first quintile is the mean of all the incomes of a “bin” of everyone from 0/100 to 20/100, when lined up in order.
To explore estimating Gini coefficients from this type of binned data I used data from the wonderful LaknerMilanovic World Panel Income Distribution database, which is available for free download. This useful collection contains the mean income by decile bin of many countries from 1985 onwards – the result of some careful and doubtless very tedious work with household surveys from around the world. This is an amazing dataset, and amongst other purposes it can be used (as Milanovic and coauthors have pioneered dating back to his World Bank days) in combination with population numbers to estimate “global inequality”, treating everyone on the planet as part of a single economic community regardless of national boundaries. But that’s for another day.
Here’s R code to download the data (in Stata format) and grab the first ten values, which happen to represent Angloa in 1995. These particular data are based on consumption, which in poorer economies is often more sensible to measure than income:
library(tidyverse) library(scales) library(foreign) # for importing Stata files library(actuar) # for access to the Burr distribution library(acid) library(forcats) # Data described at https://www.gc.cuny.edu/CUNY_GC/media/LISCenter/brankoData/LaknerMilanovic2013WorldPanelIncomeDistributionLMWPIDDescription.pdf # The database has been created specifically for the # paper “Global Income Distribution: From the Fall of the Berlin Wall to the Great Recession”, # World Bank Policy Research Working Paper No. 6719, December 2013, published also in World # Bank Economic Review (electronically available from 12 August 2015). download.file("https://wfs.gc.cuny.edu/njohnson/www/BrankoData/LM_WPID_web.dta", mode = "wb", destfile = "LM_WPID_web.dta") wpid < read.dta("LM_WPID_web.dta") # inc_con whether survey is income or consumption # group income decline group 1 to 10 # RRinc is average per capita income of the decile in 2005 PPP # the first 10 rows are Angola in 1995, so let's experiment with them angola < wpid[1:10, c("RRinc", "group")]Here’s the resulting 10 numbers. N
And this is the Lorenz curve:
Those graphics were drawn with this code:
ggplot(angola, aes(x = group, y = RRinc)) + geom_line() + geom_point() + ggtitle("Mean consumption by decile in Angola in 1995") + scale_y_continuous("Annual consumption for each decile group", label = dollar) + scale_x_continuous("Decile group", breaks = 1:10) + labs(caption = "Source: Lakner/Milanovic World Panel Income Distribution data") + theme(panel.grid.minor = element_blank()) angola %>% arrange(group) %>% mutate(cum_inc_prop = cumsum(RRinc) / sum(RRinc), pop_prop = group / max(group)) %>% ggplot(aes(x = pop_prop, y = cum_inc_prop)) + geom_line() + geom_ribbon(aes(ymax = pop_prop, ymin = cum_inc_prop), fill = "steelblue", alpha = 0.2) + geom_abline(intercept = 0, slope = 1, colour = "steelblue") + labs(x = "Cumulative proportion of population", y = "Cumulative proportion of consumption", caption = "Source: Lakner/Milanovic World Panel Income Distribution data") + ggtitle("Inequality in Angola in 1995", "Lorenz curve based on binned decile mean consumption") Calculating Gini directly from deciles?Now, I could just treat these 10 deciles as a sample of 10 representative people (each observation after all represents exactly 10% of the population) and calculate the Gini coefficient directly. But my hunch was that this would underestimate inequality, because of the straight lines in the Lorenz curve above which are a simplification of the real, more curved, reality.
To investigate this issue, I started by creating a known population of 10,000 income observations from a Burr distribution, which is a flexible, continuous nonnegative distribution often used to model income. That looks like this:
population < rburr(10000, 1, 3, 3) par(bty = "l", font.main = 1) plot(density(population), main = "Burr(1,3,3) distribution")Then I divided the data up into between 2 and 100 bins, took the means of the bins, and calculated the Gini coefficient of the bins. Doing this for 10 bins is the equivalent of calculating a Gini coefficient directly from decile data such as in the LaknerMilanovic dataset. I got this result, which shows, that when you have the means of 10 bins, you are underestimating inequality slightly:
Here’s the code for that little simulation. I make myself a little function to bin data and return the mean values of the bins in a tidy data frame, which I’ll need for later use too:
#' Quantile averages #' #' Mean value in binned groups #' #' @param y numeric vector to provide summary statistics on #' @param len number of bins to calculate means for #' @details this is different from producing the actual quantiles; it is the mean value of y within each bin. bin_avs < function(y, len = 10){ # argument checks: if(class(y) != "numeric"){stop("y should be numeric") } if(length(y) < len){stop("y should be longer than len")} # calculation: y < sort(y) bins < cut(y, breaks = quantile(y, probs = seq(0, 1, length.out = len + 1))) tmp < data.frame(bin_number = 1:len, bin_breaks = levels(bins), mean = as.numeric(tapply(y, bins, mean))) return(tmp) } ginis < numeric(99) for(i in 1:99){ ginis[i] < weighted.gini(bin_avs(population, len = i + 1)$mean)$Gini } ginis_df < data.frame( number_bins = 2:100, gini = ginis ) ginis_df %>% mutate(label = ifelse(number_bins < 11  round(number_bins / 10) == number_bins / 10, number_bins, "")) %>% ggplot(aes(x = number_bins, y = gini)) + geom_line(colour = "steelblue") + geom_text(aes(label = label)) + labs(x = "Number of bins", y = "Gini coefficient estimated from means within bins") + ggtitle("Estimating Gini coefficient from binned mean values of a Burr distribution population", paste0("Correct Gini is ", round(weighted.gini(population)$Gini, 3), ". Around 25 bins needed for a really good estimate.")) A better method for Gini from deciles?Maybe I should have stopped there; after all, there is hardly any difference between 0.32 and 0.34; probably much less than the sampling error from the survey. But I wanted to explore if there were a better way. The method I chose was to:
 choose a lognormal distribution that would generate (close to) the 10 decile averages we have;
 simulate individuallevel data from that distribution; and
 estimate the Gini coefficient from that simulated data.
I also tried this with a Burr distribution but the results were very unstable. The lognormal approach was quite good at generating data with means of 10 bins very similar to the original data, and gave plausible values of Gini coefficient just slightly higher than when calculated directly of the bins’ means.
Here’s how I did that:
# algorithm will be iterative # 1. assume the 10 binned means represent the following quantiles: 0.05, 0.15, 0.25 ... 0.65, 0.75, 0.85, 0.95 # 2. pick the best lognormal distribution that fits those 10 quantile values. # Treat as a nonlinear optimisation problem and solve with `optim()`. # 3. generate data from that distribution and work out what the actual quantiles are # 4. repeat 2, with these actual quantiles n < 10000 x < angola$RRinc fn2 < function(params) { sum((x  qlnorm(p, params[1], params[2])) ^ 2 / x) } p < seq(0.05, 0.95, length.out = 10) fit2 < optim(c(1,1), fn2) simulated2 < rlnorm(n, fit2$par[1], fit2$par[2]) p < plnorm(bin_avs(simulated2)$mean, fit2$par[1], fit2$par[2]) fit2 < optim(c(1,1), fn2) simulated2 < rlnorm(n, fit2$par[1], fit2$par[2])And here are the results. The first table shows the means of the bins in my simulated lognormal population (mean) compared to the original data for Angola’s actual deciles in 1995 (x). The next two values, 0.415 and 0.402, are the Gini coefficents from the simulated and original data respectively:
> cbind(bin_avs(simulated2), x) bin_number bin_breaks mean x 1 1 (40.6,222] 165.0098 174 2 2 (222,308] 266.9120 287 3 3 (308,393] 350.3674 373 4 4 (393,487] 438.9447 450 5 5 (487,589] 536.5547 538 6 6 (589,719] 650.7210 653 7 7 (719,887] 795.9326 785 8 8 (887,1.13e+03] 1000.8614 967 9 9 (1.13e+03,1.6e+03] 1328.3872 1303 10 10 (1.6e+03,1.3e+04] 2438.4041 2528 > weighted.gini(simulated2)$Gini [,1] [1,] 0.4145321 > > # compare to the value estimated directly from the data: > weighted.gini(x)$Gini [,1] [1,] 0.401936As would be expected from my earlier simulation, the Gini coefficient from the estimated underlying lognormal distribtuion is verr slightly higher than that calculated directly from the means of the decile bins.
Applying this method to the LaknerMilanovic inequality dataI rolled up this approach into a function to convert means of deciles into Gini coefficients and applied it to all the countries and years in the World Panel Income Distribution data. Here are the results, first over time:
.. and then as a snapshot
Neither of these is great as a polished data visualisation, but it’s difficult data to present in a static snapshot, and will do for these illustrative purposes.
Here’s the code for that function (which depends on the previously defined ) and drawing those charts. Drawing on the convenience of Hadley Wickham’s dplyr and ggplot2 it’s easy to do this on the fly and in the below I calculate the Gini coefficients twice, once for each chart. Technically this is wasteful, but with modern computers this isn’t a big deal even though there is quite a bit of computationally intensive stuff going on under the hood; the code below only takes a minute or so to run.
#' Convert data that is means of deciles into a Gini coefficient #' #' @param x vector of 10 numbers, representing mean income (or whatever) for 10 deciles #' @param n number of simulated values of the underlying lognormal distribution to generate #' @details returns an estimate of Gini coefficient that is less biased than calculating it #' directly from the deciles, which would be slightly biased downwards. deciles_to_gini < function(x, n = 1000){ fn < function(params) { sum((x  qlnorm(p, params[1], params[2])) ^ 2 / x) } # starting estimate of p based on binned means and parameters p < seq(0.05, 0.95, length.out = 10) fit < optim(c(1,1), fn) # calculate a better value of p simulated < rlnorm(n, fit$par[1], fit$par[2]) p < plnorm(bin_avs(simulated)$mean, fit$par[1], fit$par[2]) # new fit with the better p fit < optim(c(1,1), fn) simulated < rlnorm(n, fit$par[1], fit$par[2]) output < list(par = fit$par) output$Gini < as.numeric(weighted.gini(simulated)$Gini) return(output) } # example usage: deciles_to_gini(x = wpid[61:70, ]$RRinc) deciles_to_gini(x = wpid[171:180, ]$RRinc) # draw some graphs: wpid %>% filter(country != "Switzerland") %>% mutate(inc_con = ifelse(inc_con == "C", "Consumption", "Income")) %>% group_by(region, country, contcod, year, inc_con) %>% summarise(Gini = deciles_to_gini(RRinc)$Gini) %>% ungroup() %>% ggplot(aes(x = year, y = Gini, colour = contcod, linetype = inc_con)) + geom_point() + geom_line() + facet_wrap(~region) + guides(colour = FALSE) + ggtitle("Inequality over time", "Gini coefficients estimated from decile data") + labs(x = "", linetype = "", caption = "Source: Lakner/Milanovic World Panel Income Distribution data") wpid %>% filter(country != "Switzerland") %>% mutate(inc_con = ifelse(inc_con == "C", "Consumption", "Income")) %>% group_by(region, country, contcod, year, inc_con) %>% summarise(Gini = deciles_to_gini(RRinc)$Gini) %>% ungroup() %>% group_by(country) %>% filter(year == max(year)) %>% ungroup() %>% mutate(country = fct_reorder(country, Gini), region = fct_lump(region, 5)) %>% ggplot(aes(x = Gini, y = country, colour = inc_con, label = contcod)) + geom_text(size = 2) + facet_wrap(~region, scales = "free_y", nrow = 2) + labs(colour = "", y = "", x = "Gini coefficient", caption = "Source: LaknerMilanovic World Panel Income Distribution") + ggtitle("Inequality by country", "Most recent year available up to 2008; Gini coefficients are estimated from decile mean income.")There we go – deciles to Gini fun with world inequality data!
# cleanup unlink("LM_WPID_web.dta") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Oil leakage… those old BMW’s are bad :)
(This article was first published on R – Longhow Lam's Blog, and kindly contributed to Rbloggers)
IntroductionMy first car was a 13 year Mitsubishi Colt, I paid 3000 Dutch Guilders for it. I can still remember a friend that would not like me to park this car in front of his house because of possible oil leakage.
Can you get an idea of which cars will likely to leak oil? Well, with open car data from the Dutch RDW you can. RDW is the Netherlands Vehicle Authority in the mobility chain.
RDW DataThere are many data sets that you can download. I have used the following:
 Observed Defects. This set contains 22 mln. records on observed defects at car level (license plate number). Cars in The Netherlands have to be checked yearly, and the findings of each check are submitted to RDW.
 Basic car details. This set contains 9 mln. records, they are all the cars in the Netherlands, license plate number, brand, make, weight and type of car.
 Defects code. This little table provides a description of all the possible defect codes. So I know that code ‘RA02’ in the observed defects data set represents ‘oil leakage’.
I have imported the data in R and with some simple dplyr statements I have determined per car make and age (in years) the number of cars with an observed oil leakage defect. Then I have determined how many cars there are per make and age, then dividing those two numbers will result in a so called oil leak percentage.
For example, in the Netherlands there are 2043 Opel Astra’s that are four years old, there are three observed with an oil leak, so we have an oil leak percentage of 0.15%.
The graph below shows the oil leak percentages for different car brands and ages. Obviously, the older the car the higher the leak percentage. But look at BMW: waaauwww those old BMW’s are leaking like oil crazy… The few lines of R code can be found here.
ConclusionThere is a lot in the open car data from RDW, you can look at much more aspects / defects of cars. Regarding my old car that i had, according to this data Mitsubishi’s have a low oil leak percentage, even older ones.
Cheers, Longhow
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppArmadillo 0.7.960.1.0
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
The bimonthly RcppArmadillo release is out with a new version 0.7.960.1.0 which is now on CRAN, and will get to Debian in due course.
And it is a big one. Lots of nice upstream changes from Armadillo, and lots of work on our end as the Google Summer of Code project by Binxiang Ni, plus a few smaller enhancements — see below for details.
Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 379 other packages on CRAN—an increase of 49 since the last CRAN release in June!
Changes in this release relative to the previous CRAN release are as follows:
Changes in RcppArmadillo version 0.7.960.1.0 (20170811)
Upgraded to Armadillo release 7.960.1 (Northern Banana Republic Deluxe)

faster randn() when using OpenMP (NB: usually omitted when used fromR)

faster gmm_diag class, for Gaussian mixture models with diagonal covariance matrices

added .sum_log_p() to the gmm_diag class

added gmm_full class, for Gaussian mixture models with full covariance matrices

expanded .each_slice() to optionally use OpenMP for multithreaded execution


Upgraded to Armadillo release 7.950.0 (Northern Banana Republic)

expanded accu() and sum() to use OpenMP for processing expressions with computationally expensive elementwise functions

expanded trimatu() and trimatl() to allow specification of the diagonal which delineates the boundary of the triangular part


Enhanced support for sparse matrices (Binxiang Ni as part of Google Summer of Code 2017)

Add support for dtCMatrix and dsCMatrix (#135)

Add conversion and unit tests for dgT, dtT and dsTMatrix (#136)

Add conversion and unit tests for dgR, dtR and dsRMatrix (#139)

Add conversion and unit tests for pMatrix and ddiMatrix (#140)

Rewrite conversion for dgT, dtT and dsTMatrix, and add filebased tests (#142)

Add conversion and unit tests for indMatrix (#144)

Rewrite conversion for ddiMatrix (#145)

Add a warning message for matrices that cannot be converted (#147)

Add new vignette for sparse matrix support (#152; Dirk in #153)

Add support for sparse matrix conversion from Python SciPy (#158 addressing #141)


Optional return of row or column vectors in collapsed form if appropriate #define is set (Serguei Sokol in #151 and #154)

Correct speye() for nonsymmetric cases (Qiang Kou in #150 closing #149).

Ensure tests using Scientific Python and reticulate are properly conditioned on the packages being present.

Added .aspell/ directory with small local directory now supported by Rdevel.
Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcppdevel mailing list off the RForge page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
2017 App Update
(This article was first published on R – Fantasy Football Analytics, and kindly contributed to Rbloggers)
As you may have noticed, we have made a few changes to our apps for the 2017 season to bring you a smoother and quicker experience while also adding more advanced and customizable views.
Most visibly, we moved the apps to Shiny so we can continue to build on our use of R and add new features and improvements throughout the season. We expect the apps to better handle high traffic load this season during draft season and peak traffic.
In addition to the ability to create and save custom settings, you can also choose the columns you view in our Projections tool. We have also added more advanced metrics such as weekly VOR and Projected Points Per Dollar (ROI) for those of you in auction leagues. With a free account, you’ll be able to create and save one custom setting. If you get an FFA Insider subscription, you’ll be able to create and save unlimited custom settings.
Up next is the ability to upload custom auction values to make it easier to use during auction drafts.
We are also always looking to add new features, so feel free to drop us a suggestion in the Comments section below!
The post 2017 App Update appeared first on Fantasy Football Analytics.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Chapman University DataFest Highlights
(This article was first published on R Views, and kindly contributed to Rbloggers)
Editor’s Note: The 2017 Chapman University DataFest was held during the weekend of April 2123. The 2018 DataFest will be held during the weekend of April 2729.
DataFest was founded by Rob Gould in 2011 at UCLA with 40 students. In just seven years, it has grown to 31 sites in three countries. Have a look at Mine ÇetinkayaRundel’s post Growth of DataFest over the years for the details. In recent years, it has been difficult for UCLA to keep up with the growing interest and demand from southern California universities. This year, the Chapman DataFest became the second DataFest site in southern California, and the largest inaugural DataFest in the history of the event. We had 65 students who stayed the whole weekend from seven universities organized into 15 teams.
The event began on a Friday evening with Professor Rob Gould, the “founder” of DataFest, giving advice on goals for the weekend. He then introduced the Expedia dataset: nearly 11 million records representing users’ online searches for hotels, plus an associated file with detailed information about the hotel destinations.
Throughout the weekend, the organizers kept students motivated with data challenges (with cell phone chargers awarded as prizes), a minitalk on tools for joining and merging data files, and a tutorial from bitScoop on using their API integration platform.
At noon on Sunday, the students submitted their twoslide presentations via email. At 1 pm, each team had five minutes to show their findings to the sixjudge panel: Johnny Lin (UCLA), Joe Kurian (Mitsubishi UFG Union Bank, Irvine), Tao Song (Spectrum Pharmaceuticals), Pamela Hsu (Spectrum Pharmaceuticals), Lynn Langit (AWS, GCP IoT), and Brett Danaher (Chapman University).
The judges announced winners in three official categories:
Best Insight: CSU Northridge team “Mean Squares” (Jamie Decker, Matthew Jones, Collin Miller, Ian Postel, and Seyed Sajjadi). [See Seyed’s description of his team’s experience!]
Best Visualization: Chapman University team “Winners ‘); Drop Table” (Dylan Bowman, William Cortes, Shevis Johnson, and Tristan Tran).
Best Use of External Data: Chapman University team “BEST” (Brandon Makin, Sarah Lasman, and Timothy Kristedja).
Additionally, “Judges’ Choice” awards for “Best Use of Statistical Models” went to the USC “Big Data” team (Hsuanpei Lee, Omar Lopez, Yi Yang Tan, Grace Xu, and Xuejia Xu) and the USC “Quants” team (Cheng (Serena) Cheng, Chelsea Lee, and Hossein Shafii).
All winners were given certificates and medallions designed by Chapman’s Ideation Lab and printed on Chapman’s MLAT Lab 3D printer (see photo).
Winners also received free student memberships in the American Statistical Association.
Many thanks go to the Silver Sponsors: Children’s Hospital Orange County Medical Intelligence and Innovation Institute, Southern California Chapter of the American Statistical Association, and Chapman University MLAT Lab; and Bronze Sponsors: Experian, RStudio, Chapman University Computational and Data Sciences and Schmid College of Science and Technology, Orange County Long Beach ASA Chapter, the Missing Variables, USC Stats Club, Luke Thelen, and Google.
Thanks also to the 45 VIP consultants from BitScoop Labs, Chapman University, Compatiko, CSU Fullerton, CSU Long Beach, CSU San Bernardino, Education Management Services, Freelance Data Analysis, Hiner Partners, Mater Dei High School, Nova Technologies, Otoy, Southern California Edison, Sonikpass, Startup, SurEmail, UC Irvine, UCLA, USC, and Woodbridge High School, many of whom spent most of the weekend working with the students.
Overall, participants were enthusiastic about meeting students from other schools and the opportunity to work with the local professionals. (See the two student perspectives below.) DataFest will continue to grow as these students return to their schools and share their enthusiasm with their classmates!
The Mean Squares Perspectiveby Seyed Sajjadi
For most of our team, this DataFest was only the first or second hackathon they ever attended, but the group gelled instantly.
Culture is important for a hackathon group, but talent and preparation play key roles in the success or failure. Our group spent more than a month in advance preparing for this competition. We practiced, practiced, and practiced some more for this event. We had weekly workshops where we presented the assignments that we had worked on for the past week.
The next essential for the competition may come as a surprise to most: having an artist design and prepare the presentation took an enormous amount of work off our shoulders. During the entire competition, we had a very talented artist design a fabulous slideshow for the presentation. This may sound boastful, but allowing specialized talent to work on the slideshow the entire competition is a lot better than designing it at the last minute.
The questions that were asked were not specific at all, and it was on the participants to form and ask the proper questions. We focused on optimizing two questions of customer acquisition and retention/conversion. We proved that online targeting and marketing can be optimized by regional historical data feedback, meaning that most states residents tend to have similar preferences when it comes to same destinations. For instance, most Californians go to Las Vegas to gamble, but most people from Texas go to Las Vegas for music events; these analyses can be used to better target potential customers from neighboring regions.
Regarding customer retention and conversion of lookers to bookers, we calculated the optimum point in time where Expedia can offer more special packages; this time frame happened to be around 14 sessions of interaction between the customer and the website. The biggest part of our analysis was achieved via hierarchical clustering.
A big aspect of the event had to do with the atmosphere and the organization. They invited people from industry to come and roam around the halls, which led to a great opportunity to meet professionals in the field of data science. We were situated in a huge room with all of the teams. We ended up crowding around a small table with everyone on their laptops and chairs. The room was big enough to have impromptu meetings, which allowed a lot of room to breathe. This hackathon was a huge growing experience for all of us on “The Mean Squares”.
Team Pineapples’ Perspectiveby Annelise Hitchman
On day one, I could tell my enthusiasm to start working on the dataset was matched by the other dozens of students participating. The room was filled with interaction, and not just among the individual teams. I enjoyed talking with all the consultants in the room about the data, our approach, and even just learning about what they did for work. DataFest introduced me to realworld data that I had never seen in my classes. I learned quite a bit about data analysis from both my own team members and nearly everyone else at the event. Watching the final presentations was an inspiring and insightful end to DataFest. I really hope that DataFest is able to continue and be available to universities such as my own, so that all students interested in data analysis can participate.
Michael Fahy is Professor of Mathematics and Computer Science and Associate Dean, Schmid College of Science & Technology at Chapman University
_____='https://rviews.rstudio.com/2017/08/18/chapmanuniversitydatafesthighlights/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RStudio Server Pro is ready for BigQuery on the Google Cloud Platform
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
RStudio is excited to announce the availability of RStudio Server Pro on the Google Cloud Platform.
RStudio Server Pro GCP is identical to RStudio Server Pro, but with additional convenience for data scientists, including preinstallation of multiple versions of R, common systems libraries, and the BigQuery package for R.
RStudio Server Pro GCP adapts to your unique circumstances. It allows you to choose different GCP computing instances for RStudio Server Pro no matter how large, whenever a project requires it (hourly pricing).
If the enhanced security, support for multiple R versions and multiple sessions, and commercially licensed and supported features of RStudio Server Pro appeal to you, please give RStudio Server Pro for GCP a try. Below are some useful links to get you started:
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
20 years of the R Core Group
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
The first "official" version of R, version 1.0.0, was released on February 29, 200. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday, from R Core member Peter Dalgaard:
It was twenty years ago today, Ross Ihaka got the band to play…. #rstats pic.twitter.com/msSpPz2kyA
— Peter Dalgaard (@pdalgd) August 16, 2017
Twenty years ago, on August 16 1997, the R Core Group was formed. Before that date, the committers to R were the projects' founders Ross Ihaka and Robert Gentleman, along with Luke Tierney, Heiner Schwarte and Paul Murrell. The email above was the invitation for Kurt Kornik, Peter Dalgaard and Thomas Lumley to join as well. With the sole exception of Schwarte, all of the above remain members of the R Core Group, which has since expanded to 21 members. These are the volunteers that implement the R language and its base packages, document, build, test and release it, and manage all the infrastructure that makes that possible.
Thank you to all the R Core Group members, past and present!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Probability functions beginner
On this set of exercises, we are going to explore some of the probability functions in R with practical applications. Basic probability knowledge is required.
Note: We are going to use random number functions and random process functions in R such as runif, a problem with these functions is that every time you run them you will obtain a different value. To make your results reproducible you can specify the value of the seed using set.seed(‘any number’) before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random numbers). For this set of exercises we will use set.seed(1), don’t forget to specify it before every random exercise.
Answers to the exercises are available here
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Generating random numbers. Set your seed to 1 and generate 10 random numbers using runif and save it in an object called random_numbers.
Exercise 2
Using the function ifelse and the object random_numbers simulate coin tosses. Hint: If random_numbers is bigger than .5 then the result is head, otherwise is tail.
Another way of generating random coin tosses is by using the rbinom function. Set the seed again to 1 and simulate with this function 10 coin tosses. Note: The value you will obtain is the total number of heads of those 10 coin tosses.
Exercise 3
Using the function rbinom to generate 10 unfair coin tosses with probability success of 0.3. Set the seed to 1.
Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to work with different binomial and logistic regression techniques,
 know how to compare regression models and choose the right fit,
 and much more.
Exercise 4
We can simulate rolling a die in R with runif. Save in an object called die_roll 1 random number with min = 1 and max = 6. This mean that we will generate a random number between 1 and 6.
Apply the function ceiling to die_roll. Don’t forget to set the seed to 1 before calling runif.
Exercise 5
Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with an standard deviation of 0.1, using rnorm simulate the height of 100 people and save it in an object called heights.
To get an idea of the values of heights applying the function summaryto it.
Exercise 6
a) What’s the probability that a person will be smaller than 1.90? Use pnorm
b) What’s the probability that a person will be taller than 1.60? Use pnorm
Exercise 7
The waiting time (in minutes) at a doctor’s clinic follows an exponential distribution with a rate parameter of 1/50. Use the function rexp to simulate the waiting time of 30 people at the doctor’s office.
Exercise 8
What’s the probability that a person will wait less than 10 minutes? Use pexp
Exercise 9
What’s the waiting time average?
Exercise 10
Let’s assume that patients with a waiting time bigger than 60 minutes leave. Out of 30 patients that arrive to the clinic how many are expected to leave? Use qexp
Related exercise sets: Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part4)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part6)
 Lets Begin with something sample
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
Tesseract and Magick: High Quality OCR in R
(This article was first published on rOpenSci Blog, and kindly contributed to Rbloggers)
Last week we released an update of the tesseract package to CRAN. This package provides R bindings to Google's OCR library Tesseract.
install.packages("tesseract")The new version ships with the latest libtesseract 3.05.01 on Windows and MacOS. Furthermore it includes enhancements for managing language data and using tesseract together with the magick package.
Installing Language DataThe new version has several improvements for installing additional language data. On Windows and MacOS you use the tesseract_download() function to install additional languages:
tesseract_download("fra")Language data are now stored in rappdirs::user_data_dir('tesseract') which makes it persist across updates of the package. To OCR french text:
french < tesseract("fra") text < ocr("https://jeroen.github.io/images/french_text.png", engine = french) cat(text)Très Bien! Note that on Linux you should not use tesseract_download but instead install languages using aptget (e.g. tesseractocrfra) or yum (e.g. tesseractlangpackfra).
Tesseract and MagickThe tesseract developers recommend to clean up the image before OCR'ing it to improve the quality of the output. This involves things like cropping out the text area, rescaling, increasing contrast, etc.
The rOpenSci magick package is perfectly suitable for this task. The latest version contains a convenient wrapper image_ocr() that works with pipes.
devtools::install_github("ropensci/magick")Let's give it a try on some example scans:
# Requires devel version of magick # devtools::install_github("ropensci/magick") # Test it library(magick) library(magrittr) text < image_read("https://courses.cs.vt.edu/csonline/AI/Lessons/VisualProcessing/OCRscans_files/bowers.jpg") %>% image_resize("2000") %>% image_convert(colorspace = 'gray') %>% image_trim() %>% image_ocr() cat(text) The Llfe and Work of Fredson Bowers by G. THOMAS TANSELLE N EVERY FIELD OF ENDEAVOR THERE ARE A FEW FIGURES WHOSE ACCOM plishment and inﬂuence cause them to be the symbols of their age; their careers and oeuvres become the touchstones by which the ﬁeld is measured and its history told. In the related pursuits of analytical and descriptive bibliography, textual criticism, and scholarly editing, Fredson Bowers was such a ﬁgure, dominating the four decades after 1949, when his Principles of Bibliographical Description was pub lished. By 1973 the period was already being called “the age of Bowers”: in that year Norman Sanders, writing the chapter on textual scholarship for Stanley Wells's Shakespeare: Select Bibliographies, gave this title to a section of his essay. For most people, it would be achievement enough to rise to such a position in a ﬁeld as complex as Shakespearean textual studies; but Bowers played an equally important role in other areas. Editors of ninetcemhcemury American authors, for example, would also have to call the recent past “the age of Bowers," as would the writers of descriptive bibliographies of authors and presses. His ubiquity in the broad ﬁeld of bibliographical and textual study, his seemingly com plete possession of it, distinguished him from his illustrious predeces sors and made him the personiﬁcation of bibliographical scholarship in his time. \Vhen in 1969 Bowers was awarded the Gold Medal of the Biblio graphical Society in London, John Carter’s citation referred to the Principles as “majestic," called Bowers's current projects “formidable," said that he had “imposed critical discipline" on the texts of several authors, described Studies in Bibliography as a “great and continuing achievement," and included among his characteristics "uncompromising seriousness of purpose” and “professional intensity." Bowers was not unaccustomed to such encomia, but he had also experienced his share of attacks: his scholarly positions were not universally popular, and he expressed them with an aggressiveness that almost seemed calculated toNot bad but not perfect. Can you do a better job?
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Update on Our ‘revisit’ Package
(This article was first published on Mad (Data) Scientist, and kindly contributed to Rbloggers)
On May 31, I made a post here about our R package revisit, which is designed to help remedy the reproducibility crisis in science. The intended user audience includes
 reviewers of research manuscripts submitted for publication,
 scientists who wish to confirm the results in a published paper, and explore alternate analyses, and
 members of the original research team itself, while collaborating during the course of the research.
The package is documented mainly in the README file, but we now also have a paper on arXiv.org, which explains the reproducibility crisis in detail, and how our package addresses it. Reed Davis and I, the authors of the software, are joined in the paper by Prof. Laurel Beckett of the UC Davis Medical School, and Dr. Paul Thompson of Sanford Research.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Visualising Water Consumption using a Geographic Bubble Chart
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
A geographic bubble chart is a straightforward method to visualise quantitative information with a geospatial relationship. Last week I was in Vietnam helping the Phú Thọ Water Supply Joint Stock Company with their data science. They asked me to create … Continue reading →
The post Visualising Water Consumption using a Geographic Bubble Chart appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...