Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 5 hours 25 min ago

Using R to study the Yemen Conflict with night light images

Sun, 03/26/2017 - 12:38

The Yemeni civil war has received very little attention despite the growing humanitarian disaster. There is a lack of reliable figures on the extent of the human suffering in Yemen. The few data that is available suggests that it is immense. According to the UN, from March 2015 to August 2016, over 10,000 people have been killed in Yemen, including 3,799 civilians.

This note asks whether high frequency satellite images do capture the extent to which conflict is ongoing in Yemen and asks in particular, whether there is distinct geographic variation suggesting which areas are most affected by the ongoing conflict.

Can the effect and the spatial incidence of the conflict in Yemen be traced through satellite images?

Satellite images have been used to study urban sprawl and general economic growth and development. The extent to which satellite images can be used to study man-made disasters such as conflicts is not widely explored.

There are lots of other papers that have used various night light data sets to study urbanization, ethnic favoritism, and economic growth (see Henderson et al, 2012 ; Michalopoulos and Papaioannou 2013,  Hodler and Raschky, 2014).

In related work Fetzer et al., 2016, I studied the extent to which light emissions in the early 1990s can be used to obtain a measure of the extent of power rationing in Colombia following El-Nino induced droughts. In another project, we use the DMSP night light images to study the evolution of cities over time and how democratization can change the relative distribution of cities Fetzer and Shanghavi, 2015.

Since 2012, the VIIRS
high frequency and high resolution satellite images capturing night lights emissions are available from NASA’s Earth Observation Group. They have now been made available for analysis on Google’s Earth Engine, making them much more accessible to the wider research audience.

Lets have a look at night light Yemen before and after the Saudi Arabian military intervention.

Average VIIRS lights after the Saudi intervention in Yemen started.

Average VIIRS lights for the period before the Saudi intervention in Yemen.

The light scales are identical, indicating that relative to the border with Saudi Arabia, the night light emissions from Yemen have dropped dramatically, especially around the capital city Sana’a. The circular blobs indicated are around the main oil/ gas producing parts of Yemen, where there may be light emissions due to flaring of natural gas.

A minimal average light emissions of 0.5 was imposed
Zooming in to Sana’a, the figures look as follows.

Average light emissions from Sana’a since the Saudi intervention in Yemen started.

 

Average light emissions from Sana’a for period before the Saudi intervention in Yemen.

Having a look at the data library(data.table) library(foreign) library(plyr) library(parallel) options(stringsAsFactors = FALSE) setwd("/Users/thiemo/Dropbox/Research/Yemen") # A DATA SET OF 34k populated places (or historically populated places) YE <- data.table(read.csv(file = "~/Dropbox/Research/Yemen/Yemen-Cities.csv")) # LIGHTS DATA IS FROM VIIRS Images made availabe on the Google Earth Engine LIGHTS <- data.table(read.csv(file = "~/Dropbox/Research/Yemen/lightsall.csv")) LIGHTS[, `:=`(year, as.numeric(substr(system.index, 1, 4)))] LIGHTS[, `:=`(month, as.numeric(substr(system.index, 5, 6)))] LIGHTS[, `:=`(.geo, NULL)] LIGHTS[, `:=`(UFI, NULL)] LIGHTS[, `:=`(LONGITUDE, NULL)] LIGHTS[, `:=`(date, strptime(paste(year, month, "01", sep = "-"), "%Y-%m-%d"))] LIGHTS <- join(LIGHTS, YE) ## Joining by: rownum

Some simple plots are quite suggestive. The following plots the average light emissions around populated places over time by month. The date of the intervention onset, which coincides with the date of the fall of Sana’a coincides with dramatic drop in light emissions.

Average lights dropped by a almost 2/3, suggesting a stand still in economic activity. Overall light emissions are still visible as indicated in the graphs suggesting that the places do not turn pitch black. The

plot(LIGHTS[, mean(list), by = date], type = "l")

The Distributional Effects of the Conflict

The Houthi movement has been gaining influence over a longer time period. In particular, since the 2012 the Houthi’s have gained influence spreading from North to the South. The European Council of Foreign Relations has produced maps illustrating the spatial expansion of Houthi control in Yemen.

A central question relates to the strategy of the Saudi military intervention. In particular, whether the intervention is aimed at territories that came under Houthi control since 2012 or whether the intervention is targeted at the Houthi-heartland.

A simple exercise that allows this study is to look at the evolution of lights in the northern Houthi-heartland relative to the populated places in the rest of the country that came under Houthi control since 2012.

A definition of what consists of the Houthi-heartland is subject to contention. But a conservative definition may consist of the four governerates Ammran, Sada’ah, Al Jawf and Hajjah.

LIGHTS[, `:=`(HOUTHI, as.numeric(ADM1 %in% c("15", "22", "21", "19")))] require(ggplot2) ggplot(LIGHTS[, mean(list), by = c("HOUTHI", "date")], aes(date, V1, colour = as.factor(HOUTHI))) + geom_line() + geom_point() + theme_bw() + theme(legend.position = "bottom")

 

The summary statistics suggest that in absolute terms much larger in the non-Houthi heartland. Though given that the initial level in the Houthi heartland is much lower, suggesting that that part of the country is much less developed. Given that there is a notional minimum light emissions of zero, this truncation of the data is a concern.

One way around this is to dummify the lights measure and look at whether a populated place is lit above a certain threshold.

LIGHTS[, `:=`(anylit, list > 0.25)] ggplot(LIGHTS[, mean(anylit), by = c("HOUTHI", "date")], aes(date, V1, colour = as.factor(HOUTHI))) + geom_line() + geom_point() + theme_bw() + theme(legend.position = "bottom")

Again it is hard to see whether there is any divergence in trends in this dummified measure, but this naturally is less prone to be affected by the truncation inherent to this type of data.

A regression with location and time fixed effects that measures whether there was a distinct change in nightlights in places in the Houthi-heartland relative to the non-Houthi heartland suggests that there is indeed a marked difference, indicating that the conflict is concentrated in the non-Houthi heartland.

Definint the discrete variable for a difference in difference estimation and loading the lfe package that allows for high dimensional fixed effects:

LIGHTS[, `:=`(anylit, list > 0.25)] LIGHTS[, `:=`(postKSAintervention, as.numeric(date > "2015-03-01"))] library(lfe) LIGHTS[, `:=`(date, as.factor(date))]

Running the actual difference in difference regressions:

# levels summary(felm(list ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS)) ## ## Call: ## felm(formula = list ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS) ## ## Residuals: ## Min 1Q Median 3Q Max ## -74.347 -0.205 0.043 0.194 82.063 ## ## Coefficients: ## Estimate Cluster s.e. t value Pr(>|t|) ## postKSAintervention:HOUTHI 0.4184 0.1900 2.202 0.0277 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.758 on 1172455 degrees of freedom ## Multiple R-squared(full model): 0.752 Adjusted R-squared: 0.7447 ## Multiple R-squared(proj model): 0.003315 Adjusted R-squared: -0.02603 ## F-statistic(full model, *iid*): 103 on 34519 and 1172455 DF, p-value: < 2.2e-16 ## F-statistic(proj model): 4.848 on 1 and 22 DF, p-value: 0.03846 # dummified measure summary(felm(anylit ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS)) ## ## Call: ## felm(formula = anylit ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.12247 -0.10416 0.00593 0.06185 1.06958 ## ## Coefficients: ## Estimate Cluster s.e. t value Pr(>|t|) ## postKSAintervention:HOUTHI 0.08470 0.02359 3.59 0.00033 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2223 on 1172455 degrees of freedom ## Multiple R-squared(full model): 0.5762 Adjusted R-squared: 0.5637 ## Multiple R-squared(proj model): 0.008458 Adjusted R-squared: -0.02073 ## F-statistic(full model, *iid*):46.18 on 34519 and 1172455 DF, p-value: < 2.2e-16 ## F-statistic(proj model): 12.89 on 1 and 22 DF, p-value: 0.00163 # taking logs summary(felm(log(list) ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS[!is.infinite(log(list))])) ## ## Call: ## felm(formula = log(list) ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS[!is.infinite(log(list))]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12.8918 -0.3725 0.1060 0.5223 6.5958 ## ## Coefficients: ## Estimate Cluster s.e. t value Pr(>|t|) ## postKSAintervention:HOUTHI 0.4133 0.1234 3.35 0.000809 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8958 on 844476 degrees of freedom ## (327294 observations deleted due to missingness) ## Multiple R-squared(full model): 0.6534 Adjusted R-squared: 0.6393 ## Multiple R-squared(proj model): 0.01248 Adjusted R-squared: -0.02789 ## F-statistic(full model, *iid*):46.12 on 34519 and 844476 DF, p-value: < 2.2e-16 ## F-statistic(proj model): 11.22 on 1 and 22 DF, p-value: 0.002899

An alternative way to study this is by doing a flexible non-parametric estimation to rule out diverging trends prior to the military intervention.

summary(felm(anylit ~ date:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS)) ## ## Call: ## felm(formula = anylit ~ date:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.12574 -0.10765 0.00313 0.06437 1.06515 ## ## Coefficients: ## Estimate Cluster s.e. t value Pr(>|t|) ## date2014-01-01:HOUTHI NA 0.00000 NA NA ## date2014-02-01:HOUTHI 0.01095 0.01320 0.830 0.406641 ## date2014-03-01:HOUTHI 0.03173 0.02764 1.148 0.250884 ## date2014-04-01:HOUTHI 0.11048 0.06028 1.833 0.066814 . ## date2014-05-01:HOUTHI 0.09762 0.05271 1.852 0.063989 . ## date2014-06-01:HOUTHI 0.10249 0.05861 1.749 0.080336 . ## date2014-07-01:HOUTHI 0.07204 0.06053 1.190 0.233987 ## date2014-08-01:HOUTHI 0.06338 0.04866 1.302 0.192778 ## date2014-09-01:HOUTHI 0.03816 0.04690 0.814 0.415860 ## date2014-10-01:HOUTHI 0.04247 0.04359 0.974 0.329930 ## date2014-11-01:HOUTHI 0.05621 0.03646 1.542 0.123115 ## date2014-12-01:HOUTHI 0.02213 0.03037 0.729 0.466205 ## date2015-01-01:HOUTHI -0.02596 0.02585 -1.004 0.315415 ## date2015-02-01:HOUTHI 0.02250 0.05141 0.438 0.661649 ## date2015-03-01:HOUTHI 0.06080 0.05740 1.059 0.289437 ## date2015-04-01:HOUTHI 0.13514 0.04806 2.812 0.004925 ** ## date2015-05-01:HOUTHI 0.15874 0.04647 3.416 0.000635 *** ## date2015-06-01:HOUTHI 0.15493 0.05151 3.008 0.002632 ** ## date2015-07-01:HOUTHI 0.12681 0.04697 2.700 0.006944 ** ## date2015-08-01:HOUTHI 0.12363 0.04319 2.863 0.004202 ** ## date2015-09-01:HOUTHI 0.13972 0.05276 2.648 0.008088 ** ## date2015-10-01:HOUTHI 0.13422 0.04697 2.857 0.004273 ** ## date2015-11-01:HOUTHI 0.12408 0.04566 2.717 0.006578 ** ## date2015-12-01:HOUTHI 0.12125 0.04505 2.691 0.007119 ** ## date2016-01-01:HOUTHI 0.11971 0.03905 3.065 0.002176 ** ## date2016-02-01:HOUTHI 0.11952 0.04151 2.879 0.003984 ** ## date2016-03-01:HOUTHI 0.12721 0.04239 3.001 0.002693 ** ## date2016-04-01:HOUTHI 0.12537 0.04532 2.766 0.005669 ** ## date2016-05-01:HOUTHI 0.12989 0.05297 2.452 0.014209 * ## date2016-06-01:HOUTHI 0.13070 0.05936 2.202 0.027675 * ## date2016-07-01:HOUTHI 0.14831 0.06597 2.248 0.024573 * ## date2016-08-01:HOUTHI 0.13047 0.04614 2.827 0.004693 ** ## date2016-09-01:HOUTHI 0.14481 0.06024 2.404 0.016227 * ## date2016-10-01:HOUTHI 0.11782 0.05255 2.242 0.024959 * ## date2016-11-01:HOUTHI 0.12175 0.04473 2.722 0.006486 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2219 on 1172422 degrees of freedom ## Multiple R-squared(full model): 0.5776 Adjusted R-squared: 0.5652 ## Multiple R-squared(proj model): 0.01175 Adjusted R-squared: -0.01738 ## F-statistic(full model, *iid*): 46.4 on 34552 and 1172422 DF, p-value: < 2.2e-16 ## F-statistic(proj model): 147.2 on 35 and 22 DF, p-value: < 2.2e-16

This suggests that the differential drop in lights occured only after March 2015, the month in which Saudi Arabia’s military intervention commenced.

On average, the regressions suggest that the drop in lights was significantly more pronounced outside the Houthi heartland. This suggests that the conflict and the bombing carried out by Saudi Arabia is mostly concentrated outside the Houthi rebel heartland.

That the dramatic drops in light emissions is associated with the Saudi military intervention is quite clear. The conflict between the Houthi rebels and the government had been ongoing for several years but only starting with the intervention of Saudi Arabia do marked differences between Houthi and non-Houthi heartland provinces appear.

This analysis can further be refined by studying the role of the religious make up of different provinces, as the role of the religious make up between Shia and Sunni muslim groups is said to be an important factor driving this conflict.

Nevertheless, this analysis suggests that high frequency satellite images such as these can be useful in assessing the extent to which areas area directly affected by conflict, which may be useful for targeting humanitarian relief.

New book and package pmfdR

Sun, 03/26/2017 - 09:00

(This article was first published on R and Finance, and kindly contributed to R-bloggers)

Processing and modelling financial data with R –

My Portuguese book about finance and R was
published
a couple of months ago and, given its positive feedback, I decided to
work on the english version immediately. You can find details about my
experience in self publishing the book in this
post.

The English book is not a simple translation of text and examples. This
is a long term project that I always dreamed of doing. With time, I plan
to keep the Portuguese and English version synchronized. The feedback I
got from the Portuguese version was taken into account and I wrote
additional sections covering advanced use of dplyr with list-columns,
storing data in SQLITE, reporting tables with xtable and stargazer,
and many more.

The book is not yet finished. I’m taking my time in reviewing everything
and making sure that it comes out perfectly. I believe it will be ready
in a few months or less. If you are interested in the book, please go to
its website where you can
find its current TOC (table of contents), code and data.

If you want to be notified about the publication of the book, please
sign this form and I’ll let
you know as soon as it is available.

Package pmfdR

Yesterday I released package pmfdR, which provides access to all
material from my book Processing and Modelling Financial Data with
R
, including code, data and exercises.

The exercises are still not complete. I expect to have at least 100
exercises covering all chapters of the book. As soon as the book is
finished, I’ll starting working on it.

With package pmfdR you can:

  1. Download data and code with function pmfdR_download.code.and.data
  2. Build exercises with function pmfdR_build.exercise
Downloading code and data

All the R code from the book is publicly available in
github. Function
pmfdR_download.code.and.data will download a zip file from the
repository and unzip it at specified folder. Have a look in its usage:

if (!require(pmfdR)){ install.packages('pmfdR') library(pmfdR) } my.lan <- 'en' # language of code and data ('en' or 'pt-br') # dl may take some time (around 60 mb) pmfdR_download.code.and.data(lan = my.lan) dir.out <- 'pmfdR-en-code_data-master' # list R code list.files(dir.out, pattern = '*.R') list.files(paste0(dir.out,'/data')) Building exercises

All exercises from the book are based on package exams. This means
that every reader will have a different version of the exercise, with
different values and correct answer. I’ve written extensively about the
positive aspects of using exams. You can find the full post
here

You can create your custom exercise file using function
pmfdR_build.exercise. Give it a try, just copy and paste the following
chunk of code in your R prompt.

if (!require(pmfdR)){ install.packages('pmfdR') library(pmfdR) } my.lan <- 'en' # language of exercises my.exercise.folder <- 'pmfdR-exercises' # name of folder with exercises files (will download from github) my.pdf.folder <- 'PdfOut' # name of folder to place pdf file and answer sheet pmfdR_build.exercise(lan = my.lan, exercise.folder = my.exercise.folder, pdf.folder = my.pdf.folder) list.files(my.pdf.folder)

To leave a comment for the author, please follow the link and comment on their blog: R and Finance. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

More book, more cricket! 2nd edition of my books now on Amazon

Sun, 03/26/2017 - 04:54

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

The 2nd edition of both my books

a) Cricket analytics with cricketr
b) Beaten by sheer pace – Cricket analytics with yorkr

is now available on Amazon, both as Paperback and Kindle versions.
Pick up your copies today!!!

A) Cricket analytics with cricketr: Second Edition

B) Beaten by sheer pace: Cricket analytics with yorkr(2nd edition)

Pick up your copies today!!!

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R&lt;-Slovakia meetup started to build community in Bratislava

Sun, 03/26/2017 - 01:00

(This article was first published on Peter Laurinec, and kindly contributed to R-bloggers)

On 22. March a first special R related meetup called R<-Slovakia took place. As the name of the meetup group implies, it is based in Slovakia, in its capital – Bratislava. I am very happy to be the first speaker on this event ever. R<-Slovakia has the intention to build a community to share knowledge of people who are dealing with data science, machine learning or related fields. And, of course, of those, who want to start to learn about R and about the world of data.

Topics of the presentation

The work on my PhD. thesis was the main topic of my presentation, which is still in progress. So, what kind of challenges did I tackle in the field of “mining” smart meter data? I am focused on clustering of consumers to produce more accurate forecasts of aggregated electricity consumption. For this task, it is crucial to choose most suitable forecast methods and representations of time series of electricity consumption, which are inputs to the clustering algorithm. I also spoke about R “tips and tricks”, that can help to write more effective scripts for this task. Some ideas from the presentation are already published in scientific papers – you can check them in my ResearchGate profile.

The headlines of the presentation are as follows:

  • Introduction of R
  • Challenges in smart grids
  • Forecasting electricity consumption
  • Suitable methods for forecasting multiple seasonal time series
  • Cluster analysis of consumers
  • R tips and tricks

Slides are here:

RSlovakia #1 meetup from Peter Laurinec About the organizer

The organizer of R<-Slovakia is an NGO GapData. They also started meetups called PyData Bratislava, which is a network of users and developers of data analysis tools who meet on a regular basis to share ideas and learn from each other. The PyData community gathers to discuss how best to apply Python tools, as well as tools using R and Julia. R<-Slovakia and PyData Bratislava are organized with help of a nonprofit organization NumFocus.

R<-Slovakia is also in the list of R meetups around the world. You can check more specifically for example europian R meetups.

With this large amount of information (see links), you can initiate your meetup (or project) to build a community around data analysts.

Look ahead

I hope you will benefit some knowledge from my slides, e.g. get a bigger picture of how to use data mining methods on smart meter data. If you don’t get some details, don’t worry, I will explain in more detail, how to implement regression trees, ensemble learning or cluster analysis of consumers in R. Stay tuned and in the meantime check my previous blog posts.

Teaser :sunglasses: :chart_with_upwards_trend: :

To leave a comment for the author, please follow the link and comment on their blog: Peter Laurinec. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RApiDatetime 0.0.2

Sat, 03/25/2017 - 23:38

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Two days after the initial 0.0.1 release, a new version of RApiDatetime has just arrived on CRAN.

RApiDatetime provides six entry points for C-level functions of the R API for Date and Datetime calculations. The functions asPOSIXlt and asPOSIXct convert between long and compact datetime representation, formatPOSIXlt and Rstrptime convert to and from character strings, and POSIXlt2D and D2POSIXlt convert between Date and POSIXlt datetime. These six functions are all fairly essential and useful, but not one of them was previously exported by R.

Josh Ulrich took one hard look at the package — and added the one line we needed to enable the Windows support that was missing in the initial release. We now build on all platforms supported by R and CRAN. Otherwise, I just added a NEWS file and called it a bugfix release.

Changes in RApiDatetime version 0.0.2 (2017-03-25)
  • Windows support has added (Josh Ulrich in #1)
Changes in RApiDatetime version 0.0.1 (2017-03-23)
  • Initial release with six accessible functions

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the rapidatetime page.

For questions or comments please use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Linear regression in “The Man who counted”

Sat, 03/25/2017 - 21:15

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Recently, I got a book by Brasilian writer  Júlio César de Mello e Souza (published under pen name Malba Tahan), titled The Man who counted. Book is a collection of mathematical stories very similar to Scheherazada’s 1001 Nights, where mathematical story-telling is the center of book.

                                              

In story 5“In so many words”, Malba describes a simple algebraic problem of proportion between price for lodging offered and price of jewel sold.

This man,” old Salim said pointing to the jeweler “came from Syria to sell precious stones in Baghdad. He promised he would pay 20 dinars for his lodgings if he sold all of his jewels for 100 dinars and 35 dinars if he sold them for 200. After several days of wandering about, he ended up selling all of them for 140 dinars. How much does he owe me according to our agreement?”

Both, jeweler and lodge owner,  calculate the result each using percent proportion problem, both ending with wrong results, that was each to favor each:

  1. Jeweler:
200 : 35 :: 140 : x x = (35 x 140) / 200 = 24.5

2. Lodge Owner:

100 : 20 :: 140 : x x = (20 x 140) / 100 = 28

 

With two different results, both would end up in argument, so the third needed to be calculated:

Sale Price Price of Lodgings 200 35 -100 -20 ----- ------- 100 15

 

So the difference between both calculations forms a proportion to calculate the new case, when Sale price for jewel is 140, the price of lodging would be 26.

100: 15:: 40: x x = (15 x 40) / 100 = 6

 

Mathematically speaking, problem is very interesting to be solved also using Linear Regression, since the two pair of points [200, 35] and [100, 20] form a linear prediction function and we would need to predict what would be the price of lodging, when sale price for jewel is 140.

diamond <- c(100, 200) sleep   <- c(20, 35) # regression sleep_model <- lm(sleep ~ diamond) plot(x=diamond, y=sleep) abline(lm(sleep ~ diamond))

Now, we can call this a prediction, what actually Beremiz does by heart.

predict_data <- data.frame(diamond=140) fit <- predict(sleep_model, predict_data, interval = "predict") #new value for diamond=140 fit[1]

Result is 26, which is strictly algebraic and R-prediction speaking correct result.

In this case, linear regression does same as proportion calculation, but what strikes me is which calculation – not mathematically speaking – does make more sense? 26 or 24,5 or 28 ? And which method for calculating next price lodge should satisfy both jeweler and lodge owner.

Happy reading!

 

 

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Perform pairwise Wilcoxon test, classify groups by significance and plot results

Sat, 03/25/2017 - 20:18

(This article was first published on R – Fabio Marroni's Blog, and kindly contributed to R-bloggers)

This post is the result of work performed in collaboration with my colleague Eleonora Paparelli (who actually did most of the work!). We wanted to compare several distributions using Wilcoxon test and summarize results (i.e. indicate the comparisons showing significant differences).
R base includes pairwise.wilcox.test to perform Wilcoxon rank sum test between all pairs of samples in a study.
A common way to represent significance in pairwise comparisons is the use of letters. Samples sharing a letter are not different from each other. Samples not sharing letters are different.
Library multcompView in R can take a square matrix of p-values and return letters for all samples.
Sadly enough, pairwise.wilcox.test returns a triangular matrix, so I had to write a small function – named tri.to.squ – to take the output of pairwise.wilcox.test and convert it in a suitable input for the multcompLetters function of multcompView.
Now we can easily plot the distributions as box plot and add the letters as text.

library(multcompView) first.set<-cbind(rnorm(100,mean=1.8),1) second.set<-cbind(rnorm(100,mean=0.9),2) third.set<-cbind(rnorm(100,mean=1),3) full<-rbind(first.set,second.set,third.set) pp<-pairwise.wilcox.test(full[,1], full[,2], p.adjust.method = "none", paired = FALSE) mymat<-tri.to.squ(pp$p.value) myletters<-multcompLetters(mymat,compare="<=",threshold=0.05,Letters=letters) boxplot(full[,1] ~ full[,2],ylab="Something nice",names=c("Set 1","Set 2","Set 3"),ylim=c(min(full[,1]),0.2+max(full[,1]))) text(c(1,2,3),0.1+max(full[,1]),c(myletters$Letters[1],myletters$Letters[2],myletters$Letters[3])) tri.to.squ<-function(x) { rn<-row.names(x) cn<-colnames(x) an<-unique(c(cn,rn)) myval<-x[!is.na(x)] mymat<-matrix(1,nrow=length(an),ncol=length(an),dimnames=list(an,an)) for(ext in 1:length(cn)) { for(int in 1:length(rn)) { if(is.na(x[row.names(x)==rn[int],colnames(x)==cn[ext]])) next mymat[row.names(mymat)==rn[int],colnames(mymat)==cn[ext]]<-x[row.names(x)==rn[int],colnames(x)==cn[ext]] mymat[row.names(mymat)==cn[ext],colnames(mymat)==rn[int]]<-x[row.names(x)==rn[int],colnames(x)==cn[ext]] } } return(mymat) }

To leave a comment for the author, please follow the link and comment on their blog: R – Fabio Marroni's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Debugging Pipelines in R with Bizarro Pipe and Eager Assignment

Sat, 03/25/2017 - 16:42

This is a note on debugging magrittr pipelines in R using Bizarro Pipe and eager assignment.



Pipes in R

The magrittr R package supplies an operator called “pipe” which is written as “%>%“. The pipe operator is partly famous due to its extensive use in dplyr and use by dplyr users. The pipe operator is roughly described as allowing one to write “sin(5)” as “5 %>% sin“. It is described as being inspired by F#‘s pipe-forward operator “|>” which itself is defined or implemented as:

let (|>) x f = f x

The magrittr pipe doesn’t actually perform the above substitution directly. As a consequence “5 %>% sin” is evaluated in a different environment than “sin(5)” would be (unlike F#‘s “|>“), and the actual implementation is fairly involved.

The environment change is demonstrated below:

library("dplyr") f <- function(...) {print(parent.frame())} f(5) ## <environment: R_GlobalEnv> 5 %>% f ## <environment: 0x1032856a8>

Pipes are like any other coding feature: if you code with it you are eventually going to have to debug with it. Exact pipe semantics and implementation details are important when debugging, as one tries to control execution sequence and examine values and environments while debugging.

A Debugging Example

Consider the following example taken from the “Chaining” section of “Introduction to dplyr“.

library("dplyr") library("nycflights13") flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) ## Adding missing grouping variables: `year`, `month`, `day` ## Source: local data frame [49 x 5] ## Groups: year, month [11] ## ## year month day arr dep ## <int> <int> <int> <dbl> <dbl> ## 1 2013 1 16 34.24736 24.61287 ## 2 2013 1 31 32.60285 28.65836 ## ...

A beginning dplyr user might wonder at the meaning of the warning “Adding missing grouping variables: `year`, `month`, `day`“. Similarly, a veteran dplyr user may wonder why we bother with a dplyr::select(), as selection is implied in the following dplyr::summarise(); but this is the example code as we found it.

Using Bizarro Pipe

We can run down the cause of the warning quickly by performing the mechanical translation from a magrittr pipeline to a Bizarro pipeline. This is simply making all the first arguments explicit with “dot” and replacing the operator “%>%” with the Bizarro pipe glyph: “->.;“.

We can re-run the modified code by pasting into R‘s command console and the warning now lands much nearer the cause (even when we paste or execute the entire pipeline at once):

flights ->.; group_by(., year, month, day) ->.; select(., arr_delay, dep_delay) ->.; ## Adding missing grouping variables: `year`, `month`, `day` summarise(., arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) ->.; filter(., arr > 30 | dep > 30) ## Source: local data frame [49 x 5] ## Groups: year, month [11] ## ## year month day arr dep ## <int> <int> <int> <dbl> <dbl> ## 1 2013 1 16 34.24736 24.61287 ## 2 2013 1 31 32.60285 28.65836 ## ...

We can now clearly see the warning was issued by dplyr::select() (even though we just pasted in the whole block of commands at once). This means despite help(select) saying “select() keeps only the variables you mention” this example is depending on the (useful) accommodation that dplyr::select() preserves grouping columns in addition to user specified columns (though this accommodation is not made for columns specified in dplyr::arrange()).

A Caveat

To capture a value from a Bizarro pipe we must make an assignment at the end of the pipe, not the beginning. The following will not work as it would capture only the value after the first line (“flights ->.;“) and not the value at the end of the pipeline.

One must not write:

VARIABLE <- flights ->.; group_by(., year, month, day)

To capture pipeline results we must write:

flights ->.; group_by(., year, month, day) -> VARIABLE

I think the right assignment is very readable if you have the discipline to only use pipe operators as line-enders, making assignments the unique lines without pipes. Also, leaving an extra line break after assignments helps with readability.

Making Things More Eager

A remaining issue is: Bizarro pipe only made the composition eager. For a data structure with additional lazy semantics (such as dplyr‘s view of a remote SQL system) we would still not have the warning near the cause.

Unfortunately different dplyr backends give different warnings, so we can’t demonstrate the same warning here. We can, however, deliberately introduce an error and show how to localize errors in the presence of lazy eval data structures. In the example below I have misspelled “month” as “moth”. Notice the error is again not seen until printing, long after we finished composing the pipeline.

s <- dplyr::src_sqlite(":memory:", create = TRUE) flts <- dplyr::copy_to(s, flights) flts ->.; group_by(., year, moth, day) ->.; select(., arr_delay, dep_delay) ->.; summarise(., arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) ->.; filter(., arr > 30 | dep > 30) ## Source: query [?? x 5] ## Database: sqlite 3.11.1 [:memory:] ## Groups: year, moth ## na.rm not needed in SQL: NULL are always droppedFALSE ## na.rm not needed in SQL: NULL are always droppedFALSE ## Error in rsqlite_send_query(conn@ptr, statement) : no such column: moth

We can try to force dplyr into eager evaluation using the eager value landing operator “replyr::`%->%`” (from replyr package) to form the “extra eager” Bizarro glyph: “%->%.;“.



When we re-write the code in terms of the extra eager Bizarro glyph we get the following.

install.packages("replyr") library("replyr") flts %->%.; group_by(., year, moth, day) %->%.; ## Error in rsqlite_send_query(conn@ptr, statement) : no such column: moth select(., arr_delay, dep_delay) %->%.; summarise(., arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %->%.; ## na.rm not needed in SQL: NULL are always droppedFALSE ## na.rm not needed in SQL: NULL are always droppedFALSE filter(., arr > 30 | dep > 30) ## Source: query [?? x 5] ## Database: sqlite 3.11.1 [:memory:]

Notice we have successfully localized the error.

Nota Bene

One thing to be careful with in “dot debugging” is: when a statement such as dplyr::select() errors-out this means
the Bizarro assignment on that line does not occur (normal R exception semantics). Thus “dot” will be still carrying the value from the previous line, and the pasted block of code will continue after the failing line using this older data state found in “dot.” So you may see strange results and additional errors indicated in the pipeline. The debugging advice is: at most the first error message is trustworthy.

The Trick

The trick is to train your eyes to to read “->.;” or “%->%.;” as a single atomic or indivisible glyph, and not as a sequence of operators, variables, and separators. I see Bizarro pipe as a kind of strange superhero.



Conclusion

Pipes are a fun notation, and even the original magrittr package experiments with a number of interesting variations of them. I hope you add Bizarro pipe (which turns out has been available in R all along, without requiring any packages!) and extra eager Bizarro pipe to your debugging workflow.

New Zealand election forecasts

Sat, 03/25/2017 - 12:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Over the weekend I released a new webpage, connected to this blog, with forecasts for the New Zealand 2017 General Election. The aim is to go beyond poll aggregation to something that takes the uncertainty of the future into account, as well as relatively minor issues such as the success (or not) of different polling firms predicting results in the past. The full source code is available in its own repository (and under active development) and I won’t try and discuss it all here, but I did have a couple of reflections.

Randomness on election day

The guts of the model currently being used in that site is a generalized additive model with a multivariate Normal response, which is party vote on the logit scale. Effectively, I have a latent unobserved party support variable, and I treat the various polling data as observations somehow imperfectly manifested by that latent variable. The whole approach is a quick and dirty alternative to a Bayesian state space model which stays on my “to do one day” list.

When transformed back onto the original scale (proportion of party vote), that model looks like this:

To turn this into a prediction for election day, I need to deal with two sources of randomness:

  • Because predicting the future is hard, I don’t know where the latent support for each party will be on election day (at the time of writing, six months away). The shaded ribbons in the image above are 95% credibility intervals for where this will end up, and hence can be treated as a random variable.
  • Even after taking into account the uncertainty in the underlying latent level of support, there will be some randomness in how it manifests on election day. Will people turn up; will we find out the polls have systematically missed something; and so on.

The aim is to create simulations that take both those forms of randomness into account, including the fact that the random variables in question are negatively correlated (because support for one party increases at the expense of the others). Those simulations look like this, when done:

The first form of variance is easy, and comes straight out of the mgcv function that fits the generalized additive model; as does an estimate of the covariance between the party support. But I had three goes at identifying the second form of randomness, election day randomness:

  1. At first I ignored it altogether. An election is (sort of) a census compared to an opinion poll which is a sample survey, so one might expect the election to come out bang on where the “real” latent support is, so the only uncertainty to worry about is estimating the latent support.
  2. That seemed to create distributions of party vote that were unrealistically narrow and I worried I would be making the classic electoral modelling mistake of underestimating the randomness of the electorate (yes, this has been prominent over the past year). I went in the other direction and said, perhaps an election is just as random as an opinion poll; I estimated the variance of individual opinion polls around the latent party support, and simulated election day outcomes with the same variance. This is actually the first version I published; I liked it because it had great dollops of uncertainty in its suggested outcomes, which is nearly always a good thing that helps compensate for the problem of the real world being messier and more complex than our models, so confidence or credibility intervals based purely on models nearly always fail more than advertised.
  3. Further reflection during the day made me think I’d gone too far. Method #2, even when applied to data one week out from the 2014 election, had an implausibly large range of outcomes; reflecting the fact that individual polls are pretty small and highly variable, much more so than elections. I opted instead to compare the election results for parties in 2005, 2008, 2011 and 2014 to the latent party support estimated by my generalized additive model, and use the mean squared error from that to scale the variance of my simulated 2017 election day.

Here’s where the mean squared error from my model compares to actual election results:

This suggested a simple pragmatic linear model of E(log(Var(logit(vote))) ~ E(logit(vote)). So now in my forecasting simulation, I take that expected variance, add it to the variance in the estimate of the latent party support itself, and combine with the correlation of party support from the GAM to estimate a variance-covariance matrix for use with mvrnorm to generate party votes (on logit scale still). This is the method currently live on the website, having replaced my too-conservative method #2. For the curious, here are the probabilistic predictions for the 2017 New Zealand election:


Quality control

I had a little annoyance this morning, with a version of the page containing an error (although only for about 15 minutes) in the part that tested the forecasting methods on the 2014 election data, pretending to have polls only up to six months before the election. A bit of code that was meant to turn “no polling information available” on minor parties into 0% support had slipped through a change in method from the raw proportion scale to the logit scale. Inverse logit of 0 is 0.5 so effectively I was saying that the Conservative party, ACT, and United Future each had a bunch of polls indicating 50% support (for non-New Zealanders, these are all small parties which in fact polled less than 5% support in 2014). I’d noticed some funny graphics with trend lines implausibly high in some dates, but forgotten to look into it. Once I looked into it I quickly found the bug, but not until after the site had been live for 15 minutes.

That experience, and the experience of changing my mind on how to simulate election day randomness, made me reflect on quality control. Electoral predictions are important things, the more so given that in New Zealand there’s few competing models (and only a couple of poll aggregators). It would be embarrassing, and generally bad, to make many errors in a page like the one I’m writing about here, no matter what caveats and disclaimers I add. Erring on the side of caution helps; but this is the sort of area where sometimes one doesn’t even notice one is making rash assumptions. Full transparency is part of the solution (hence all my code is open to scrutiny) but isn’t really a substitute for professional teamwork.

So some lessons (from this and similar experiences):

  • get a peer reviewer if you can.
  • publish code in full if you can, and take care to make it as reproducible as possible. Working through the situation of “could someone pick up this repository and run it” can be a good way of surfacing some types of problems that otherwise stay hidden.
  • don’t just review code, review thinking processes. Being challenged on the source of individual election variance would have got me to a better outcome earlier; but just reviewing my code wouldn’t have done it (the code worked fine, and did what it was intended – the problem was the sub-optimal method).
  • pay attention to warning signs. Even if they don’t look important themselves, they might indicate something else.

All fairly obvious. Some other day I’ll do a more thorough blog post on quality control.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

On a First Name Basis with Statistics Sweden

Sat, 03/25/2017 - 00:00

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract

Jugding from recent R-Bloggers posts, it appears that many data scientists are concerned with scraping data from various media sources (Wikipedia, twitter, etc.). However, one should be aware that well structured and high quality datasets are available through state’s and country’s bureau of statistics. Increasingly these are offered to the public through direct database access, e.g., using a REST API. We illustrate the usefulness of such an approach by accessing data from Statistics Sweden.



This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

Introduction

Scandinavian countries are world-class when it comes to public registries. So when in need for reliable population data, this is the place to look. As an example, we access Statistics Sweden data by their API using the pxweb package developed by @MansMeg, @antagomir and @LCHansson. Love was the first speaker at a Stockholm R-Meetup some years ago, where I also gave a talk. Funny how such R-Meetups become useful many years after!

library(pxweb)

By browsing the Statistics Sweden (in Swedish: Statistiska Centralbyrån (SCB)) data using their web interface one sees that they have two relevant first name datasets: one containing the tilltalsnamn of newborns for each year during 1998-2016 and one for the years 2004-2016. Note: A tilltalsnamn in Sweden is the first name (of several possible first names) by which a person is usually addressed. About 2/3 of the persons in the Swedish name registry indicate which of their first names is their tilltalsnamn. For the remaining persons it is automatically implied that their tilltalsnamn is the first of the first names. Also note: For reasons of data protection the 1998-2016 dataset contains only first names used 10 or more times in a given year, the 2004-2016 dataset contains only first names used 2 or more times in a given year.

Downloading such data through the SCB web-interface is cumbersome, because the downloads are limited to 50,000 data cells per query. Hence, one has to do several manual queries to get hold of the relevant data. This is where their API becomes a real time-saver. Instead of trying to fiddle with the API directly using rjson or RJSONIO we use the specially designed pxweb package to fetch the data. One can either use the web-interface to determine the name of the desired data matrix to query or navigate directly through the api using pxweb:

d <- interactive_pxweb(api = "api.scb.se", version = "v1", lang = "en")

and select Population followed by Name statistics and then BE0001T04Ar or BE0001T04BAr, respectively, in order to obtain the relevant data and api download url. This leads to the following R code for download:

names10 <- get_pxweb_data( url = "http://api.scb.se/OV0104/v1/doris/en/ssd/BE/BE0001/BE0001T04Ar", dims = list(Tilltalsnamn = c('*'), ContentsCode = c('BE0001AH'), Tid = c('*')), clean = TRUE) %>% as.tbl

For better usability we rename the columns a little and replace NA counts to be zero. For visualization we pick 10 random lines of the dataset.

names10 <- names10 %>% select(-observations) %>% rename(firstname=`first name normally used`,counts=values) %>% mutate(counts = ifelse(is.na(counts),0,counts)) ##Look at 10 random lines names10 %>% slice(sample(seq_len(nrow(names10)),size=5)) ## # A tibble: 5 × 3 ## firstname year counts ## <fctr> <fctr> <dbl> ## 1 Leandro 2011 15 ## 2 Marlon 2004 0 ## 3 Andrej 2009 0 ## 4 Ester 2002 63 ## 5 Muhamed 1998 0

Note: Each spelling variant of a name in the data is treated as a unique name. In similar fashion we download the BE0001AL dataset as names2.

We now join the two datasets into one large data.frame by

names <- rbind(data.frame(names2,type="min02"), data.frame(names10,type="min10"))

and thus got everything in place to compute the name collision probability over time using the birthdayproblem package (as shown in previous posts).

library(birthdayproblem) collision <- names %>% group_by(year,type) %>% do({ data.frame(p=pbirthday_up(n=26L, prob= .$counts / sum(.$counts),method="mase1992")$prob, gini= ineq::Gini(.$counts)) }) %>% ungroup %>% mutate(year=as.numeric(as.character(year)))

And the resulting probabilities based on the two datasets min02 (at least two instances of the name in a given year) and min10 (at least ten instances of the name in a given year) can easily be visualized over time.

ggplot( collision, aes(x=year, y=p, color=type)) + geom_line(size=1.5) + scale_y_continuous(label=scales::percent,limits=c(0,1)) + xlab("Year") + ylab("Probability") + ggtitle("Probability of a name collision in a class of 26 kids born in year YYYY") + scale_colour_discrete(name = "Dataset")

As seen in similar plots for other countries, there is a decline in the collision probability over time. Note also that the two curves are upper limits to the true collision probabilities. The true probabilities, i.e. taking all tilltalsnamn into account, would be based on the hypothetical min1 data set. These probabilities would be slightly, but not substantially, below the min2 line. The same problem occurs, e.g., in the corresponding UK and Wales data. Here, Table 6 is listing all first names with 3 or more uses, but not stating how many newborns have a name occurring once and twice, respectively. With all due respect for the need to anonymise the name statistics, it’s hard to understand why this summary figure is not automatically reported, so one would be able to at least compute correct totals or collision probabilities.

Summary

Altogether, I was still quite happy to get proper individual name data so the collision probabilities are – opposite to some of my previous blog analyses – exact!

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Comparing subreddits, with Latent Semantic Analysis in R

Fri, 03/24/2017 - 21:38

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

FiveThirtyEight published a fascinating article this week about the subreddits that provided support to Donald Trump during his campaign, and continue to do so today. Reddit, for those not in the know, is an popular online social community organized into thousands of discussion topics, called subreddits (the names all begin with "r/"). Most of the subreddits are a useful forum for interesting discussions by like-minded people, and some of them are toxic. (That toxicity extends to some of the names, which is reflected in some of the screenshots below — apologies in advance.) The article looks at various popular and notorious subreddits and finds those that are most similar to the main subreddit devoted to Donald Trump and also to the main other contenders in the 2016 campaign for president, Hillary Clinton and Bernie Sanders.

The underlying method used to compare subreddits for this purpose is quite ingenious. It's based on a concept you might call "subreddit algebra": you can "add" two subreddits and find a third that reflects the intersection of the two. (One example they give is adding r/nba to r/minnesota gives you r/timberwolves, the subreddit for Minnesota's NBA team.) The you can apply the same process to subtraction: if you remove all the posts like those in the mainstream r/politics site from those in r/The_Donald you're left with posts that look like those in several toxic subreddits.

The statistical technique used to identify posts that are "similar" to another is Latent Semantic Analysis, and the article gives this nice illustration of using it to compare subreddits:

The analysis was performed in R, and the code is available in GitHub. The code makes heavy use of the lsa package for R, which provides a number of functions for performing latent semantic analysis. The triangular plot shown above — known as a ternary diagram — was created using the ggtern package.

For the complete subreddit analysis, and the list of subreddits close to Donald Trump based on the analysis, check out the FiveThirtyEight article linked below.

FiveThirtyEight: Dissecting Trump's Most Rabid Following

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building Shiny App Exercises (part-8)

Fri, 03/24/2017 - 18:10

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Transform your App into Dashboard

Now that we covered the basic staff that you need to know in order to build your App it is time to enhance its appearance and its functionality. The interface is very important fot the user as it must not only be friendly but also easy to use.

At this part we will transform your Shiny App into a beautiful Shiny Dashboard. Firstly we will create the interface and then step by step we will “move” the App you built in the previous parts into this. In part 8 we will move the app step by step into your dashboard and in the last two parts we will enhance its appearance even more and of course deploy it.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

INSTALLATION

The packages that we are going to use is shinydashboard and shiny . To install, run:

install.packages("shinydashboard")
install.packages("shiny")

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

Exercise 1

Install the package shinydashboard and the package shiny in your working directory.

BASICS

A dashboard has three parts: a header, a sidebar, and a body. Here’s the most minimal possible UI for a dashboard page.

## ui.R ##
library(shinydashboard)

dashboardPage(
dashboardHeader(),
dashboardSidebar(),
dashboardBody()
)

Exercise 2

Add a dashboardPage and then Header, Sidebar and Body into your UI. HINT: Use dashboardPage, dashboardHeader, dashboardSidebar, dashboardBody.

First of all we should name it with title like below:
## ui.R ##
library(shinydashboard)

dashboardPage(
dashboardHeader(title="Dashboard),
dashboardSidebar(),
dashboardBody()
)

Exercise 3

Name your dashboard “Shiny App”. HINT: Use title.

Next, we can add content to the sidebar. For this example we’ll add menu items that behave like tabs. These function similarly to Shiny’s tabPanels: when you click on one menu item, it shows a different set of content in the main body.

There are two parts that need to be done. First, you need to add menuItems to the sidebar, with appropriate tabNames.
## Sidebar content
dashboardSidebar(
sidebarMenu(
menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")),
menuItem("Widgets", tabName = "widgets", icon = icon("th"))
)
)

Exercise 4

Create three menuItem, name them “DATA TABLE”, “SUMMARY” and “K-MEANS” respectively. Make sure to use distict tabName for each one of them. The icon is of your choise. HINT: Use menuItem, tabName and icon.

In the body, add tabItems with corresponding values for tabName:
## Body content
dashboardBody(
tabItems(
tabItem(tabName = "dashboard",
h2("Dashboard"),
fluidRow(
box()
)
),
tabItem(tabName = "widgets",
h2("WIDGETS")
),
)
)

Exercise 5

Add tabItems in dashboardBody. Be sure to give the same tabName to each one to get them linked with your menuItem. HINT: Use tabItems, tabItem, h2.

Obviously, this dashboard isn’t very useful. We’ll need to add components that actually do something. In the body we can add boxes that have content.

Firstly let’s create a box for our dataTable in the tabItem with tabName “dt”.
## Body content
dashboardBody(
tabItems(
tabItem(tabName = "dashboard",
h2("Dashboard"),
fluidRow(
box()
)
),
tabItem(tabName = "widgets",
h2("WIDGETS")
),
)
)

Exercise 6

Specify the fluidrow and create a box inside the “DATA TABLE” tabItem. HINT: Use fluidrow and box.

Exercise 7

Do the same for the other two tabItem. Create one fluidrow and one box in the “SUMMARY” and another fluidrow with four boxin the “K-MEANS”.

Now just copy and paste the code below, which you used in part 7 to move your dataTable inside the “DATA TABLE” tabItem.
#ui.R
dataTableOutput("Table"),width = 400
#server.R
output$Table <- renderDataTable(
iris,options = list(
lengthMenu = list(c(10, 20, 30,-1),c('10','20','30','ALL')),
pageLength = 10))

Exercise 8

Place the sample code above in the right place in order to add the dataTable “Table” inside the “DATA TABLE” tabItem.

Now just copy and paste the code below, which you used in part 7 to move the dataTable “Table2” inside the “SUMMARY” tabItem.
#ui.R
dataTableOutput("Table2"),width = 400

#server.R
sumiris<-as.data.frame.array(summary(iris))
output$Table2 <- renderDataTable(sumiris)

Exercise 9

Place the sample code above in the right place in order to add the dataTable “Table2” inside the “SUMMARY” tabItem.

Do the same for the last exercise as you just have to put the code from part 7 inside the “K-MEANS” tabItem.

Exercise 10

Place the K-Means plot and the three widgets from part 7 inside the four box you created before.

Related exercise sets:
  1. Building Shiny App exercises part 1
  2. Building Shiny App exercises part 2
  3. Building Shiny App exercises part 3
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Superpixels in imager

Fri, 03/24/2017 - 17:23

(This article was first published on R – dahtah, and kindly contributed to R-bloggers)

Superpixels are used in image segmentation as a pre-processing step. Instead of segmenting pixels directly, we first group similar pixels into “super-pixels”, which can then be processed further (and more cheaply).


(image from Wikimedia)

The current version of imager doesn’t implement them, but it turns out that SLIC superpixels are particularly easy to implement. SLIC is essentially k-means applied to pixels, with some bells and whistles.

We could use k-means to segment images based on colour alone. To get good results on colour segmentation the CIELAB colour space is appropriate, because it tries to be perceptually uniform.

library(tidyverse) library(imager) im <- load.image("https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Aster_Tataricus.JPG/1024px-Aster_Tataricus.JPG") #Convert to CIELAB colour space, then create a data.frame with three colour channels as columns d <- sRGBtoLab(im) %>% as.data.frame(wide="c")%>% dplyr::select(-x,-y) #Run k-means with 2 centers km <- kmeans(d,2) #Turn cluster index into an image seg <- as.cimg(km$cluster,dim=c(dim(im)[1:2],1,1)) plot(im,axes=FALSE) highlight(seg==1)

We mostly manage to separate the petals from the rest, with a few errors here and there.
SLIC does pretty much the same thing, except we (a) use many more centers and (b) we add pixel coordinates as features in the clustering. The latter ensures that only adjacent pixels get grouped together.

The code below implements SLIC. It’s mostly straightforward:

#Compute SLIC superpixels #im: input image #nS: number of superpixels #ratio: determines compactness of superpixels. #low values will result in pixels with weird shapes #... further arguments passed to kmeans slic <- function(im,nS,compactness=1,...) { #If image is in colour, convert to CIELAB if (spectrum(im) ==3) im <- sRGBtoLab(im) #The pixel coordinates vary over 1...width(im) and 1...height(im) #Pixel values can be over a widely different range #We need our features to have similar scales, so #we compute relative scales of spatial dimensions to colour dimensions sc.spat <- (dim(im)[1:2]*.28) %>% max #Scale of spatial dimensions sc.col <- imsplit(im,"c") %>% map_dbl(sd) %>% max #Scaling ratio for pixel values rat <- (sc.spat/sc.col)/(compactness*10) X <- as.data.frame(im*rat,wide="c") %>% as.matrix #Generate initial centers from a grid ind <- round(seq(1,nPix(im)/spectrum(im),l=nS)) #Run k-means km <- kmeans(X,X[ind,],...) #Return segmentation as image (pixel values index cluster) seg <- as.cimg(km$cluster,dim=c(dim(im)[1:2],1,1)) #Superpixel image: each pixel is given the colour of the superpixel it belongs to sp <- map(1:spectrum(im),~ km$centers[km$cluster,2+.]) %>% do.call(c,.) %>% as.cimg(dim=dim(im)) #Correct for ratio sp <- sp/rat if (spectrum(im)==3) { #Convert back to RGB sp <- LabtosRGB(sp) } list(km=km,seg=seg,sp=sp) }

Use it as follows:

#400 superpixels out <- slic(im,400) #Superpixels plot(out$sp,axes=FALSE) #Segmentation plot(out$seg,axes=FALSE) #Show segmentation on original image (im*add.colour(abs(imlap(out$seg)) == 0)) %>% plot(axes=FALSE)


The next step is to segment the superpixels but I’ll keep that for another time.

To leave a comment for the author, please follow the link and comment on their blog: R – dahtah. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Writing a conference abstract the data science way

Fri, 03/24/2017 - 17:01

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

Adnan Fiaz

Conferences are an ideal platform to share your work with the wider community. However, as we all know, conferences require potential speakers to submit abstracts about their talk. And writing abstracts is not necessarily the most rewarding work out there. I have actually never written one so when asked to prepare abstracts for this year’s conferences I didn’t really know where to start.

So, I did what any sane person would do: get data. As Mango has organised a number of EARL conferences there are a good deal of abstracts available, both accepted and not accepted. In this blogpost I’m going to use the tidytext package to analyse these abstracts and see what distinguishes the accepted abstracts from the rest.

Disclaimer: the objective of this blogpost is not to present a rigorous investigation into conference abstracts but rather an exploration of, and potential use for, the tidytext package.

The data

I don’t know what it’s like for other conferences but for EARL all abstracts are submitted through an online form. I’m not sure if these forms are stored in a database but I received them as a PDF. To convert the PDFs to text I make use of the pdftotext program as outlined in this stackoverflow thread.

list_of_files <- list.files("Abstracts Received/", pattern=".pdf", full.names=TRUE) convert_pdf <- function(fileName){ outputFile <- paste0('"', gsub('.pdf', '.txt', fileName),'"') inputFile <- paste0('"', fileName, '"') # I turned the layout option on because the input is a somewhat tabular layout command <- paste('"C:/Program Files/xpdf/bin64/pdftotext.exe" -layout', inputFile, outputFile) system(command, wait=FALSE) } result <- map(list_of_files, convert_pdf) failures <- length(list_of_files) - sum(unlist(result)) if(failures>0){ warning(sprintf("%i files failed to convert", c(failures)) }

Now that I have converted the files I can read them in and extract the relevant fields.

# load functions extract_abstracts(), remove_punctuation() source('Scripts/extract_abstracts.R') input_files <- list.files("Abstracts Received/", pattern=".txt", full.names=TRUE) acceptance_results <- read_csv("Abstracts Received/acceptance_results.csv") %>% mutate(Title=remove_punctuation(Title), Title=str_sub(Title, end=15)) %>% rename(TitleShort=Title) # the conversion to text created files with an "incomplete final line" for which # readLines generates a warning, hence the suppressWarnings input_data <- suppressWarnings(map(input_files, readLines)) %>% map_df(extract_abstracts) %>% mutate(AbstractID=row_number(), TitleShort=str_sub(Title, end=15)) %>% left_join(acceptance_results, by="TitleShort") %>% filter(!is.na(Accepted)) %>% select(AbstractID, Abstract, Accepted) # a code chunk that ends with a plot is a good code chunk qplot(map_int(input_data$Abstract, nchar)) + labs(title="Length of abstracts", x="Number of characters", y="Number of abstracts")

The analysis

So, now that I have the data ready I can apply some tidytext magic. I will first convert the data into a tidy format, then clean it up a bit and finally create a few visualisations.

data(stop_words) tidy_abstracts <- input_data %>% mutate(Abstract=remove_punctuation(Abstract)) %>% unnest_tokens(word, Abstract) %>% # abracadabra! anti_join(stop_words %>% filter(word!="r")) %>% # In this case R is a word filter(is.na(as.numeric(word))) # filter out numbers # my personal mantra: a code chunk that ends with a plot is a good code chunk tidy_abstracts %>% count(AbstractID, Accepted) %>% ggplot() + geom_density(aes(x=n, colour=Accepted), size=1) + labs(title="Distribution of number of words per abstract", x="Number of words")

The abstracts with a higher number of words have a slight advantage but I wouldn’t bet on it. There is something to be said for being succinct. But what really matters is obviously content so let’s have a look at what words are commonly used.

tidy_abstracts %>% count(word, Accepted, sort=TRUE) %>% # count the number of observations per category and word group_by(Accepted) %>% top_n(20) %>% # select the top 20 counts per category ungroup() %>% ggplot() + geom_col(aes(x=word, y=n, fill=Accepted), show.legend = FALSE) + coord_flip() + labs(x="", y="Count", title="Wordcount by Acceptance category") + facet_grid(~ Accepted)

Certainly an interesting graph! It may have been better to show the proportions instead of counts as the number of abstracts in each category are not equal. Nevertheless, the conclusion remains the same. The words “r” and “data” are clearly the most common. However, what is more interesting is that abstracts in the “yes” category use certain words significantly more often than abstracts in the “no” category and vice versa (more often because a missing bar doesn’t necessarily mean a zero observation). For example, the words “science”, “production” and “performance” occur more often in the “yes” category. Vice versa, the words “tools”, “product”, “package” and “company(ies)” occur more often in the “no” category. Also, the word “application” occurs in its singular form in the “no” category and in its plural form in the “yes” category. Certainly, at EARL we like our applications to be plural, it is in the name after all.

There is one important caveat with the above analysis and that is to do with the frequency of words within abstracts. The overall frequencies aren’t really that high and one abstract’s usage of a particular word can make it seem more important than it really is. Luckily the tidytext package provides a solution for that as I can now easily calculate the TF-IDF score.

tidy_abstracts %>% count(Accepted, word, sort=TRUE) %>% # count the number of observations per category and word bind_tf_idf(word, Accepted, n) %>% # calculate tf-idf group_by(Accepted) %>% top_n(10) %>% # select the top 10 scores per category ungroup() %>% ggplot() + geom_col(aes(x=word, tf_idf, fill=Accepted), show.legend = FALSE) + labs(x="", y="TF-IDF", title="TF-IDF by Acceptance category") + coord_flip() + facet_grid(~ Accepted)

Note that I have aggregated the counts over the Acceptance category as I’m interested in what words are important within a category and not within a particular abstract. There isn’t an obvious pattern visible in the results but I can certainly hypothesise. Words like “algorithm”, “effects”, “visualize”, “ml” and “optimization” point strongly towards the application side of things. Whereas words like “concept”, “objects” and “statement” are softer and more generic. XBRL is the odd one out here but interesting in it’s own right, whoever submitted that abstract should perhaps consider re-submitting as it’s quite unique.

Next Steps

That’s it for this blogpost but here are some next steps I would do if I had more time:

  • Add more abstracts from previous years / other conferences
  • Analyse combination of words (n-grams) to work towards what kind of sentences should go into an abstract
  • The content isn’t the only thing that matters. By adding more metadata (time of submission, previously presented, etc.) the model can be made more accurate
  • Try out topic modeling on the accepted abstracts to help with deciding what streams would make sense
  • Train a neural network with all abstracts and generate a winning abstract [insert evil laugh]
Conclusion

In this blogpost I have explored text data taken from abstract submissions to the EARL conference using the fabulous tidytext package. I analysed words from abstracts that were accepted versus those that weren’t and also compared their TF-IDF score. If you want to know more about the tidytext package come to the Web Scraping and Text Mining workshop my colleagues Nic Crane and Beth Ashlee will be giving preceding the LondonR meetup this Tuesday the 28th of March. Also, if this blogpost has made you want to write an abstract, we are still accepting submissions for EARL London and EARL San Fransisco (I promise I won’t use it for a blogpost).

As always, the code for this post can be found on GitHub.

//


//

 

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Some Random Weekend Reading

Fri, 03/24/2017 - 17:00

(This article was first published on RStudio, and kindly contributed to R-bloggers)

by Joseph Rickert

Few of us have enough time to read, and most of us already have depressingly deep stacks of material that we would like to get through. However, sometimes a random encounter with something interesting is all that it takes to regenerate enthusiasm. Just in case you are not going to get to a book store with a good technical section this weekend, here are a few not-quite-random reads.

Deep Learning by Goodfellow, Bengio and Courville is a solid, self-contained introduction to Deep Learning that begins with Linear Algebra and ends with discussions of research topics such as Autoencoders, Representation Learning, and Boltzman Machines. The online layout extends an invitation to click anywhere and begin reading. Sampling the chapters, I found the text to be engaging reading; much more interesting and lucid than just an online resource. For some Deep Learning practice with R and H2O, have a look at the post Deep Learning in R by Kutkina and Feuerriegel.

However, if you are under the impression that getting a handle on Deep Learning will get you totally up to speed with neural network buzzwords, you may be disappointed. Extreme Learning Machines, which “aim to break the barriers between the conventional artificial learning techniques and biological learning mechanisms”, are sure to take you even deeper into the abyss. For a succinct introduction to ELMs with and application to handwritten digit classification, have a look at the recent paper by Pang and Yang. For more than an afternoon’s worth of reading, browse through the IEEE Intelligent Systems issue on Extreme Learning Machines here, and the other resources collected here. See the announcement of the 2014 conference for the full context of the quote above.

For something a little lighter and closer to home, Christopher Gandrud’s page on the networkD3 package is sure to set you browsing through Sankey Diagrams and Force Directed Drawing Alorithms.

Finally, if you are like me and think that the weekends are for catching up on things that you should probably already know, but on which you might be a bit shaky, remember that you can never know enough about GitHub. Compliments of GitHub’s Carolyn Shin, here is some online GitHub reading: GitHub Guides, GitHub on Demand Training, and an online version of the Pro Git Book.

Reading recommendations go both ways. Please feel free to comment with some recommendations of your own.

To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Weekly Bulletin Vol – I

Fri, 03/24/2017 - 14:44

(This article was first published on R programming, and kindly contributed to R-bloggers)

We are starting with R weekly bulletins which will contain some interesting ways and methods to write codes in R and solve bugging problems. We will also cover R functions and shortcut keys for beginners. We understand that there can be more than one way of writing a code in R, and the solutions listed in the bulletins may not be the sole reference point for you. Nevertheless, we believe that the solutions listed will be helpful to many of our readers. Hope you like our R weekly bulletins. Enjoy reading them!

Shortcut Keys
  1. To move cursor to R Source Editor – Ctrl+1
  2. To move cursor to R Console – Ctrl+2
  3. To clear the R console – Ctrl+L
Problem Solving Ideas Creating user input functionality

To create the user input functionality in R, we can make use of the readline function. This gives us the flexibility to set the input values for variables for our choice in the code is run.

Example: Suppose that we have coded a backtesting strategy. We want to have the flexibility to choose the backtest period. To do so, we can create a user input “n”, signifying the backtest period in years, and add the line shown below at the start of the code.

When the code is run, it will prompt the user to enter the value for “n”. Upon entering the value, the R code will get executed for the set period and produce the desired output.

n = readline(prompt = "Enter the backtest period in years: ")

Refresh a code in every x seconds

To refresh a code in every x seconds we can use the while loop and the Sys.sleep function. The “while loop” keeps executing the enclosed block of commands until the condition remains satisfied. We enclose the code in the while statement and keep the condition as TRUE. By keeping the condition as TRUE, it will keep looping. At the end of the code, we add the Sys.sleep function and specify the delay time in seconds. This way the code will get refreshed after every “x” seconds.

Example: In this example, we initialize the x value to zero. The code is refreshed every 1 second, and it will keep printing the value of x. One can hit the escape button on the keyboard to terminate the code.

x = 0 while (TRUE) { x = x + 1 print(x) Sys.sleep(1) }

Running multiple R scripts sequentially

To run multiple R scripts, one can have the main script which will contain the names of the scripts to be run. Running the main script will lead to the execution of the other R scripts. Assume the name of the main script is “NSE Stocks.R”. In this script, we will mention the names of the scripts we wish to run within the source function. In this example, we wish to run the “Top gainers.R” and the “Top losers.R” script. These will be the part of the “NSE Stocks.R” as shown below and we run the main script to run these 2 scripts.

source("Top gainers.R") source("Top losers.R")

Enclosing the R script name within the “source” function causes R to accept its input from the named file. Input is read and parsed from that file until the end of the file is reached, then the parsed expressions are evaluated sequentially in the chosen environment. Alternatively, one can also place the R script names in a vector, and use the sapply function.

Example:

filenames = c("Top gainers.R", "Top losers.R") sapply(filenames, source)

Converting a date in the American format to Standard date format

The American date format is of the type mm/dd/yyyy, whereas the ISO 8601 standard format is yyyy-mm-dd. To convert a date from American format to the Standard date format we will use the as.Date function along with the format function. The example below illustrates the method.

Example:

# date in American format dt = "07/24/2016" # If we call the as.Date function on the date, it will # throw up an error, as the default format assumed by the as.Date function is yyyy-mmm-dd. as.Date(dt)

Error in charToDate(x): character string is not in a standard unambiguous format

# Correct way of formatting the date as.Date(dt, format = "%m/%d/%Y")

[1] “2016-07-24”

How to remove all the existing files from a folder

To remove all the files from a particular folder, one can use the unlink function. Specify the path of the folder as the argument to the function. A forward slash with an asterisk is added at the end of the path. The syntax is given below.

unlink(“path/*”)

Example:

unlink("C:/Users/Documents/NSE Stocks/*")

This will remove all the files present in the “NSE Stocks” folder.

Functions Demystified write.csv function

If you want to save a data frame or matrix in a csv file, R provides for the write.csv function. The syntax for the write.csv function is given as:

write.csv(x, file=”filename”, row.names=FALSE)

If we specify row.names=TRUE, the function prepends each row with a label taken from the row.names attribute of your data. If your data doesn’t have row names then the function just uses the row numbers. Column header line is written by default. If you do not want the column headers, set col.names=FALSE.

Example:

# Create a data frame Ticker = c("PNB","CANBK","KTKBANK","IOB") Percent_Change = c(2.30,-0.25,0.50,1.24) df = data.frame(Ticker,Percent_Change) write.csv(df, file="Banking Stocks.csv", row.names=FALSE)

This will write the data contained in the “df” dataframe to the “Banking Stocks.csv” file. The file gets saved in the R working directory.

fix function

The fix function shows the underlying code of a function provided as an argument to it.

Example:

fix(sd)

The underlying code for the standard deviation function is as shown below. This is displayed when we executed the fix function with “sd” as the argument.

function (x, na.rm = FALSE) sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm))

download.file function

The download.file function helps download a file from a website. This could be a webpage, a csv file, an R file, etc. The syntax for the function is given as:

download.file(url, destfile)

where,
url – the Uniform Resource Locator (URL) of the file to be downloaded
destfile – the location to save the downloaded file, i.e. path with a file name

Example: In this example, the function will download the file from the path given in the “url” argument, and saved it in the D drive within the “Skills” folder with the name “betawacc.xls”.

url = "http://www.exinfm.com/excel%20files/betawacc.xls" destfile = "D:/Skills/wacc.xls" download.file(url, destfile)

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Download the PDF Now!

The post R Weekly Bulletin Vol – I appeared first on .

To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Web data acquisition: parsing json objects with tidyjson (Part 3)

Fri, 03/24/2017 - 12:56

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

The collection of example flight data in json format available in part 2, described the libraries and the structure of the POST request necessary to collect data in a json object. Despite the process generated and transferred locally a proper response, the data collected were neither in a suitable structure for data analysis nor immediately readable. They appears as just a long string of information nested and separated according to the JavaScript object notation syntax.

Thus, to visualize the deeply nested json object and make it human readable and understandable for further processing, the json content could be copied and pasted in a common online parser. The tool allows to select each node of the tree and observe the data structure up to the variables and data of interest for the statistical analysis. The bulk of the relevant information for the purpose of the analysis on flight prices are hidden in the tripOption node as shown in the following figure (only 50 flight solutions were requested).

However, looking deeply into the object, several other elements are provided as the distance in mile, the segment, the duration, the carrier, etc. The R parser to transform the json structure in a usable dataframe requires the dplyr library for using the pipe operator (%>%) to streamline the code and make the parser more readable. Nevertheless, the library actually wrangling through the lines is tidyjson and its powerful functions:

  • enter_object: enters and dives into a data object;
  • gather_array: stacks a JSON array;
  • spread_values: creates new columns from values assigning specific type (e.g. jstring, jnumber).
library(dplyr) # for pipe operator %>% and other dplyr functions library(tidyjson) # https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html data_items <- datajson %>% spread_values(kind = jstring("kind")) %>% spread_values(trips.kind = jstring("trips","kind")) %>% spread_values(trips.rid = jstring("trips","requestId")) %>% enter_object("trips","tripOption") %>% gather_array %>% spread_values( id = jstring("id"), saleTotal = jstring("saleTotal")) %>% enter_object("slice") %>% gather_array %>% spread_values(slice.kind = jstring("kind")) %>% spread_values(slice.duration = jstring("duration")) %>% enter_object("segment") %>% gather_array %>% spread_values( segment.kind = jstring("kind"), segment.duration = jnumber("duration"), segment.id = jstring("id"), segment.cabin = jstring("cabin")) %>% enter_object("leg") %>% gather_array %>% spread_values( segment.leg.aircraft = jstring("aircraft"), segment.leg.origin = jstring("origin"), segment.leg.destination = jstring("destination"), segment.leg.mileage = jnumber("mileage")) %>% select(kind, trips.kind, trips.rid, saleTotal,id, slice.kind, slice.duration, segment.kind, segment.duration, segment.id, segment.cabin, segment.leg.aircraft, segment.leg.origin, segment.leg.destination, segment.leg.mileage) head(data_items) kind trips.kind trips.rid saleTotal 1 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR178.38 2 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR178.38 3 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR235.20 4 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR235.20 5 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR248.60 6 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR248.60 id slice.kind slice.duration 1 ftm7QA6APQTQ4YVjeHrxLI006 qpxexpress#sliceInfo 510 2 ftm7QA6APQTQ4YVjeHrxLI006 qpxexpress#sliceInfo 510 3 ftm7QA6APQTQ4YVjeHrxLI009 qpxexpress#sliceInfo 490 4 ftm7QA6APQTQ4YVjeHrxLI009 qpxexpress#sliceInfo 490 5 ftm7QA6APQTQ4YVjeHrxLI007 qpxexpress#sliceInfo 355 6 ftm7QA6APQTQ4YVjeHrxLI007 qpxexpress#sliceInfo 355 segment.kind segment.duration segment.id segment.cabin 1 qpxexpress#segmentInfo 160 GixYrGFgbbe34NsI COACH 2 qpxexpress#segmentInfo 235 Gj1XVe-oYbTCLT5V COACH 3 qpxexpress#segmentInfo 190 Grt369Z0shJhZOUX COACH 4 qpxexpress#segmentInfo 155 GRvrptyoeTfrSqg8 COACH 5 qpxexpress#segmentInfo 100 GXzd3e5z7g-5CCjJ COACH 6 qpxexpress#segmentInfo 105 G8axcks1R8zJWKrN COACH segment.leg.aircraft segment.leg.origin segment.leg.destination segment.leg.mileage 1 320 FCO IST 859 2 77W IST LHR 1561 3 73H FCO ARN 1256 4 73G ARN LHR 908 5 319 FCO STR 497 6 319 STR LHR 469

Data are now in an R-friendly structure despite not yet ready for analysis. As can be observed from the first rows, each record has information on a single segment of the flight selected. A further step of aggregation using some SQL is needed in order to end up with a dataframe of flights data suitable for statistical analysis.

Next up, the aggregation, some data analysis and data visualization to complete the journey through the web data acquisition using R.

#R #rstats #maRche #json #curl #tidyjson #Rbloggers

This post is also shared in www.r-bloggers.com and LinkedIn

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R – Change columns names in a spatial dataframe

Fri, 03/24/2017 - 12:32

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

Ordnance Survey have a great OpenRoads dataset, but unfortunately it contains a column called ‘primary’, which is a keyword in SQL. This makes it challenging/impossible to import the OpenRoads dataset into a SQL database (e.g. GRASS), without changing the offending column name.

Enter R! Or any other capable programming language. The following script reads a folder of shp files, changes a given column name and overwrites the original files. Note the use of the ‘@’ symbol to call a slot from the S4 class object (thanks Stack Exchange).

library(rgdal) f = list.files("~/Downloads/OS/data", pattern="shp") f = substr(f, 1, nchar(f) - 4) lapply(f, function(i){ x = readOGR("/home/mspencer/Downloads/OS/data", i) colnames(x@data)[11] = "prim" writeOGR(x, "/home/mspencer/Downloads/OS/data", i, "ESRI Shapefile", overwrite_layer=T) })

 

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Neural Networks for Learning Lyrics

Fri, 03/24/2017 - 12:21

(This article was first published on More or Less Numbers, and kindly contributed to R-bloggers)

I created a Twitter account which was inspired by a couple Twitter accounts that applied a particular type of machine learning technique to learn how two (at the time) presidential hopefuls spoke. I thought, why not see what a model like this could do with lyrics from my favorite rock n roll artist? Long short term memory (LSTM) is a recurrent neural network (RNN) that can be used to produce sentences or phrases by learning from text. The two twitter accounts that inspired this were ‘@deeplearnthebern’ and ‘@deepdrumpf’ which use this technique to produce phrases and sentences. I scraped a little more than 300 of his songs and have fed them to a LSTM model using R and the mxnet library. Primarily I used the mxnet.io/ to build and train the model…great site and tools.  The tutorials on their site are very helpful and particularly this one.

The repository is here that contains the code for the scraper and other information. Follow deeplearnbruce for tweets that are hopefully entertaining for Springsteen fans or anyone else. 

To leave a comment for the author, please follow the link and comment on their blog: More or Less Numbers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Lesser known purrr tricks

Fri, 03/24/2017 - 12:00

purrr is package that extends R’s functional programming capabilities. It brings a lot of new stuff to the table and in this post I show you some of the most useful (at least to me) functions included in purrr.

Getting rid of loops with map() library(purrr) numbers <- list(11, 12, 13, 14) map_dbl(numbers, sqrt) ## [1] 3.316625 3.464102 3.605551 3.741657

You might wonder why this might be preferred to a for loop? It’s a lot less verbose, and you do not need to initialise any kind of structure to hold the result. If you google “create empty list in R” you will see that this is very common. However, with the map() family of functions, there is no need for an initial structure. map_dbl() returns an atomic list of real numbers, but if you use map() you will get a list back. Try them all out!

Map conditionally map_if() # Create a helper function that returns TRUE if a number is even is_even <- function(x){ !as.logical(x %% 2) } map_if(numbers, is_even, sqrt) ## [[1]] ## [1] 11 ## ## [[2]] ## [1] 3.464102 ## ## [[3]] ## [1] 13 ## ## [[4]] ## [1] 3.741657 map_at() map_at(numbers, c(1,3), sqrt) ## [[1]] ## [1] 3.316625 ## ## [[2]] ## [1] 12 ## ## [[3]] ## [1] 3.605551 ## ## [[4]] ## [1] 14

map_if() and map_at() have a further argument than map(); in the case of map_if(), a predicate function ( a function that returns TRUE or FALSE) and a vector of positions for map_at(). This allows you to map your function only when certain conditions are met, which is also something that a lot of people google for.

Map a function with multiple arguments numbers2 <- list(1, 2, 3, 4) map2(numbers, numbers2, `+`) ## [[1]] ## [1] 12 ## ## [[2]] ## [1] 14 ## ## [[3]] ## [1] 16 ## ## [[4]] ## [1] 18

You can map two lists to a function which takes two arguments using map_2(). You can even map an arbitrary number of lists to any function using pmap().

By the way, try this in: `+`(1,3) and see what happens.

Don’t stop execution of your function if something goes wrong possible_sqrt <- possibly(sqrt, otherwise = NA_real_) numbers_with_error <- list(1, 2, 3, "spam", 4) map(numbers_with_error, possible_sqrt) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051 ## ## [[4]] ## [1] NA ## ## [[5]] ## [1] 2

Another very common issue is to keep running your loop even when something goes wrong. In most cases the loop simply stops at the error, but you would like it to continue and see where it failed. Try to google “skip error in a loop” or some variation of it and you’ll see that a lot of people really just want that. This is possible by combining map() and possibly(). Most solutions involve the use of tryCatch() which I personally do not find very easy to use.

Don’t stop execution of your function if something goes wrong and capture the error safe_sqrt <- safely(sqrt, otherwise = NA_real_) map(numbers_with_error, safe_sqrt) ## [[1]] ## [[1]]$result ## [1] 1 ## ## [[1]]$error ## NULL ## ## ## [[2]] ## [[2]]$result ## [1] 1.414214 ## ## [[2]]$error ## NULL ## ## ## [[3]] ## [[3]]$result ## [1] 1.732051 ## ## [[3]]$error ## NULL ## ## ## [[4]] ## [[4]]$result ## [1] NA ## ## [[4]]$error ## <simpleError in .f(...): non-numeric argument to mathematical function> ## ## ## [[5]] ## [[5]]$result ## [1] 2 ## ## [[5]]$error ## NULL

safely() is very similar to possibly() but it returns a list of lists. An element is thus a list of the result and the accompagnying error message. If there is no error, the error component is NULL if there is an error, it returns the error message.

Transpose a list safe_result_list <- map(numbers_with_error, safe_sqrt) transpose(safe_result_list) ## $result ## $result[[1]] ## [1] 1 ## ## $result[[2]] ## [1] 1.414214 ## ## $result[[3]] ## [1] 1.732051 ## ## $result[[4]] ## [1] NA ## ## $result[[5]] ## [1] 2 ## ## ## $error ## $error[[1]] ## NULL ## ## $error[[2]] ## NULL ## ## $error[[3]] ## NULL ## ## $error[[4]] ## <simpleError in .f(...): non-numeric argument to mathematical function> ## ## $error[[5]] ## NULL

Here we transposed the above list. This means that we still have a list of lists, but where the first list holds all the results (which you can then access with safe_result_list$result) and the second list holds all the errors (which you can access with safe_result_list$error). This can be quite useful!

Apply a function to a lower depth of a list transposed_list <- transpose(safe_result_list) transposed_list %>% at_depth(2, is_null) ## $result ## $result[[1]] ## [1] FALSE ## ## $result[[2]] ## [1] FALSE ## ## $result[[3]] ## [1] FALSE ## ## $result[[4]] ## [1] FALSE ## ## $result[[5]] ## [1] FALSE ## ## ## $error ## $error[[1]] ## [1] TRUE ## ## $error[[2]] ## [1] TRUE ## ## $error[[3]] ## [1] TRUE ## ## $error[[4]] ## [1] FALSE ## ## $error[[5]] ## [1] TRUE

Sometimes working with lists of lists can be tricky, especially when we want to apply a function to the sub-lists. This is easily done with at_depth()!

Set names of list elements name_element <- c("sqrt()", "ok?") set_names(transposed_list, name_element) ## $`sqrt()` ## $`sqrt()`[[1]] ## [1] 1 ## ## $`sqrt()`[[2]] ## [1] 1.414214 ## ## $`sqrt()`[[3]] ## [1] 1.732051 ## ## $`sqrt()`[[4]] ## [1] NA ## ## $`sqrt()`[[5]] ## [1] 2 ## ## ## $`ok?` ## $`ok?`[[1]] ## NULL ## ## $`ok?`[[2]] ## NULL ## ## $`ok?`[[3]] ## NULL ## ## $`ok?`[[4]] ## <simpleError in .f(...): non-numeric argument to mathematical function> ## ## $`ok?`[[5]] ## NULL Reduce a list to a single value reduce(numbers, `*`) ## [1] 24024

reduce() applies the function * iteratively to the list of numbers. There’s also accumulate():

accumulate(numbers, `*`) ## [1] 11 132 1716 24024

which keeps the intermediary results.

This function is very general, and you can reduce anything:

Matrices:

mat1 <- matrix(rnorm(10), nrow = 2) mat2 <- matrix(rnorm(10), nrow = 2) mat3 <- matrix(rnorm(10), nrow = 2) list_mat <- list(mat1, mat2, mat3) reduce(list_mat, `+`) ## [,1] [,2] [,3] [,4] [,5] ## [1,] -0.5228188 0.4813357 0.3808749 -1.1678164 0.3080001 ## [2,] -3.8330509 -0.1061853 -3.8315768 0.3052248 0.3486929

even data frames:

df1 <- as.data.frame(mat1) df2 <- as.data.frame(mat2) df3 <- as.data.frame(mat3) list_df <- list(df1, df2, df3) reduce(list_df, dplyr::full_join) ## Joining, by = c("V1", "V2", "V3", "V4", "V5") ## Joining, by = c("V1", "V2", "V3", "V4", "V5") ## V1 V2 V3 V4 V5 ## 1 0.01587062 0.8570925 1.04330594 -0.5354500 0.7557203 ## 2 -0.46872345 0.3742191 -1.88322431 1.4983888 -1.2691007 ## 3 -0.60675851 -0.7402364 -0.49269182 -0.4884616 -1.0127531 ## 4 -1.49619518 1.0714251 0.06748534 0.6650679 1.1709317 ## 5 0.06806907 0.3644795 -0.16973919 -0.1439047 0.5650329 ## 6 -1.86813223 -1.5518295 -2.01583786 -1.8582319 0.4468619

Hope you enjoyed this list of useful functions! If you enjoy the content of my blog, you can follow me on twitter.

Pages