A Shiny App to Draw Curves Based on L-System

R bloggers - Mon, 06/26/2017 - 23:46

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Don’t worry about a thing ’cause every little thing gonna be alright (Three Little Birds, Bob Marley)

One of my favourite books is The Computational Beauty of Nature by Gary William Flake where there is a fantastic chapter about fractals in which I discovered the L-Systems.

L-Systems were conceived  in 1968 by Aristide Lindenmayer, a Hungarian biologist, as a mathematical description of plant growth. Apart from the Wikipedia, there are many places on the Internet where you can read about them. If you are interested, don’t miss The Algorithmic Beauty of Plants, an awesome book by Przemysław Prusinkiewicz that you can obtain here for free.

Roughly speaking, a L-System is a very efficient way to make drawings. In its simplest way consists in two different actions: draw a straigh line and change the angle. This is just what you need, for example, to draw a square: draw a straigh line of  any length, turn 90 degrees (without drawing), draw another straigh line of the same length, turn 90 degrees in the same direction, draw, turn and draw again. Denoting F as the action of drawing a line of length d and + as turning 90 degrees right, the whole process to draw a square can be represented as F+F+F+F.

L-Systems are quite simple to program in R. You only need to substitute the rules iteratively into the axiom (I use gsubfn function to do it) and split the resulting chain into parts with str_extract_all, for example. The result is a set of very simple actions (draw or turn) that can be visualized with ggplot and its path geometry. There are four important parameters in L-Systems:

  • The seed of the drawing, called axiom
  • The substitutions to be applied iteratively, called rules
  • How many times to apply substitutions, called depth
  • Angle of each turning

For example, let’s define the next L-System:

  • Axiom: F-F-F-F
  • Rule: F → F−F+F+FF−F−F+F

The rule means that every F must be replaced by F−F+F+FF−F−F+F while + means right turning and - left one. After one iteration, the axiom is replaced by F-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F and iterating again, the new string is F-F+F+FF-F-F+F-F-F+F+FF-F-F+F+F-F+F+FF-F-F+F+F-F+F+FF-F-F+FF-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F+F-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F+F-F+F+FF-F-F+F+F-F+F+FF-F-F+FF-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F+F-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F+F-F+F+FF-F-F+F+F-F+F+FF-F-F+FF-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F+F-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F+F-F+F+FF-F-F+F+F-F+F+FF-F-F+FF-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F+F-F+F+FF-F-F+F. As you can see, the length of the string grows exponentially. Converting last string into actions, produces this drawing, called Koch Island:

It is funny how different axioms and rules produce very different drawings. I have done a Shiny App to play with L-systems. Although it is quite simple, it has two interesting features I would like to undeline:

  • Delay reactions with eventReactive to allow to set depth and angle values before refreshing the plot
  • Build a dynamic UI that reacts to user input depending on the curve choosen

There are twelve curves in the application: Koch Island (and 6 variations), cuadratic snowflake, Sierpinsky triangle, hexagonal Gosper, quadratic Gosper and Dragon curve. These are their plots:












The definition of all these curves (axiom and rules) can be found in the first chapter of the Prusinkiewicz’s book. The magic comes when you modify angles and colors. These are some examples among the infinite number of possibilities that can be created:














I enjoyed a lot doing and playing with the app. You can try it here. If you do a nice drawing, please let me know in Twitter or dropping me an email. This is the code of the App:

ui.R:

library(shiny) shinyUI(fluidPage( titlePanel("Curves based on L-systems"), sidebarLayout( sidebarPanel( selectInput("cur", "Choose a curve:", c("","Koch Island", "Cuadratic Snowflake", "Koch Variation 1", "Koch Variation 2", "Koch Variation 3", "Koch Variation 4", "Koch Variation 5", "Koch Variation 6", "Sierpinsky Triangle", "Dragon Curve", "Hexagonal Gosper Curve", "Quadratic Gosper Curve"), selected = ""), conditionalPanel( condition = "input.cur != \"\"", uiOutput("Iterations")), conditionalPanel( condition = "input.cur != \"\"", uiOutput("Angle")), conditionalPanel( condition = "input.cur != \"\"", selectInput("lic", label = "Line color:", choices = colors(), selected = "black")), conditionalPanel( condition = "input.cur != \"\"", selectInput("bac", label = "Background color:", choices = colors(), selected = "white")), conditionalPanel( condition = "input.cur != \"\"", actionButton(inputId = "go", label = "Go!", style="color: #fff; background-color: #337ab7; border-color: #2e6da4")) ), mainPanel(plotOutput("curve", height="550px", width = "100%")) ) ))

server.R:

library(shiny) library(gsubfn) library(stringr) library(dplyr) library(ggplot2) library(rlist) shinyServer(function(input, output) { curves=list( list(name="Koch Island", axiom="F-F-F-F", rules=list("F"="F-F+F+FF-F-F+F"), angle=90, n=2, alfa0=90), list(name="Cuadratic Snowflake", axiom="-F", rules=list("F"="F+F-F-F+F"), angle=90, n=4, alfa0=90), list(name="Koch Variation 1", axiom="F-F-F-F", rules=list("F"="FF-F-F-F-F-F+F"), angle=90, n=3, alfa0=90), list(name="Koch Variation 2", axiom="F-F-F-F", rules=list("F"="FF-F-F-F-FF"), angle=90, n=4, alfa0=90), list(name="Koch Variation 3", axiom="F-F-F-F", rules=list("F"="FF-F+F-F-FF"), angle=90, n=3, alfa0=90), list(name="Koch Variation 4", axiom="F-F-F-F", rules=list("F"="FF-F--F-F"), angle=90, n=4, alfa0=90), list(name="Koch Variation 5", axiom="F-F-F-F", rules=list("F"="F-FF--F-F"), angle=90, n=5, alfa0=90), list(name="Koch Variation 6", axiom="F-F-F-F", rules=list("F"="F-F+F-F-F"), angle=90, n=4, alfa0=90), list(name="Sierpinsky Triangle", axiom="R", rules=list("L"="R+L+R", "R"="L-R-L"), angle=60, n=6, alfa0=0), list(name="Dragon Curve", axiom="L", rules=list("L"="L+R+", "R"="-L-R"), angle=90, n=10, alfa0=90), list(name="Hexagonal Gosper Curve", axiom="L", rules=list("L"="L+R++R-L--LL-R+", "R"="-L+RR++R+L--L-R"), angle=60, n=4, alfa0=60), list(name="Quadratic Gosper Curve", axiom="-R", rules=list("L"="LL-R-R+L+L-R-RL+R+LLR-L+R+LL+R-LR-R-L+L+RR-", "R"="+LL-R-R+L+LR+L-RR-L-R+LRR-L-RL+L+R-R-L+L+RR"), angle=90, n=2, alfa0=90)) output$Iterations <- renderUI({ if (input$cur!="") curve=list.filter(curves, name==input$cur) else curve=list.filter(curves, name=="Koch Island") iterations=list.select(curve, n) %>% unlist numericInput("ite", "Depth:", iterations, min = 1, max = (iterations+2)) }) output$Angle <- renderUI({ curve=list.filter(curves, name==input$cur) angle=list.select(curve, angle) %>% unlist numericInput("ang", "Angle:", angle, min = 0, max = 360) }) data <- eventReactive(input$go, { curve=list.filter(curves, name==input$cur) axiom=list.select(curve, axiom) %>% unlist rules=list.select(curve, rules)[[1]]$rules alfa0=list.select(curve, alfa0) %>% unlist for (i in 1:input$ite) axiom=gsubfn(".", rules, axiom) actions=str_extract_all(axiom, "\\d*\\+|\\d*\\-|F|L|R|\\[|\\]|\\|") %>% unlist points=data.frame(x=0, y=0, alfa=alfa0) for (i in 1:length(actions)) { if (actions[i]=="F"|actions[i]=="L"|actions[i]=="R") { x=points[nrow(points), "x"]+cos(points[nrow(points), "alfa"]*(pi/180)) y=points[nrow(points), "y"]+sin(points[nrow(points), "alfa"]*(pi/180)) alfa=points[nrow(points), "alfa"] points %>% rbind(data.frame(x=x, y=y, alfa=alfa)) -> points } else{ alfa=points[nrow(points), "alfa"] points[nrow(points), "alfa"]=eval(parse(text=paste0("alfa",actions[i], input$ang))) } } return(points) }) output$curve <- renderPlot({ ggplot(data(), aes(x, y)) + geom_path(color=input$lic) + coord_fixed(ratio = 1) + theme(legend.position="none", panel.background = element_rect(fill=input$bac), panel.grid=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), axis.text=element_blank()) }) }) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Useful tricks when including images in Rmarkdown documents

R bloggers - Mon, 06/26/2017 - 23:46

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Rmarkdown is an enormously useful system for combining text, output and graphics generated by R into a single document. Images, in particular, are a powerful means of communication in a report, whether they be data visualizations, diagrams, or pictures. To maximize the power of those images, Zev Ross has created a comprehensive list of tips and tricks for working with images in R markdown documents. Those tips include:

While Zev's guide focuses on manipulating images as generated by R, this might also be a good time to revisit these tips on making the graphics themselves as attractive and useful as possible. Although written many years ago, most of these tips are still relevant, in particular:

  • If you plan to print (or these days, view on high-resolution displays), use PDF output
  • Use PNG for R graphics output, and use a high-resolution if you plan to print. Don't ever use JPG for R graphics output.
  • Think about the aspect ratio of your graphics.
  • Remove unused whitespace around your graphics (more of an issue for base R grahics; ggplot2 handles this pretty well already).

You may also find the Rmarkdown cheat sheet helpful if you're working with Rmarkdown documents.

ZevRoss Technical Blog: Tips and tricks for working with images and figures in R Markdown documents

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

bigrquery 0.4.0

R bloggers - Mon, 06/26/2017 - 23:43

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

I’m pleased to announce that bigrquery 0.4.0 is now on CRAN. bigrquery makes it possible to talk to Google’s BigQuery cloud database. It provides both DBI and dplyr backends so you can interact with BigQuery using either low-level SQL or high-level dplyr verbs.

Install the latest version of bigrquery with:

install.packages("bigrquery") Basic usage

Connect to a bigquery database using DBI:

library(dplyr) con <- DBI::dbConnect(dbi_driver(), project = "publicdata", dataset = "samples", billing = "887175176791" ) DBI::dbListTables(con) #> [1] "github_nested" "github_timeline" "gsod" "natality" #> [5] "shakespeare" "trigrams" "wikipedia"

(You’ll be prompted to authenticate interactively, or you can use a service token with set_service_token().)

Then you can either submit your own SQL queries or use dplyr to write them for you:

shakespeare <- con %>% tbl("shakespeare") shakespeare %>% group_by(word) %>% summarise(n = sum(word_count)) #> 0 bytes processed #> # Source: lazy query [?? x 2] #> # Database: BigQueryConnection #> word n #> #> 1 profession 20 #> 2 augury 2 #> 3 undertakings 3 #> 4 surmise 8 #> 5 religion 14 #> 6 advanced 16 #> 7 Wormwood 1 #> 8 parchment 8 #> 9 villany 49 #> 10 digs 3 #> # ... with more rows New features
  • dplyr support has been updated to require dplyr 0.7.0 and use dbplyr. This means that you can now more naturally work directly with DBI connections. dplyr now translates to modern BigQuery SQL which supports a broader set of translations. Along the way I also fixed a vareity of SQL generation bugs.
  • New functions insert_extract_job() makes it possible to extract data and save in google storage, and insert_table() allows you to insert empty tables into a dataset.
  • All POST requests (inserts, updates, copies and query_exec) now take .... This allows you to add arbitrary additional data to the request body making it possible to use parts of the BigQuery API that are otherwise not exposed. snake_case argument names are automatically converted to camelCase so you can stick consistently to snake case in your R code.
  • Full support for DATE, TIME, and DATETIME types (#128).

There were a variety of bug fixes and other minor improvements: see the release notes for full details.

Contributors

bigrquery a community effort: a big thanks go to Christofer Bäcklin, Jarod G.R. Meng and Akhmed Umyarov for their pull requests. Thank you all for your contributions!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Weighted population density

R bloggers - Mon, 06/26/2017 - 21:00

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)

DENSITY THE AVERAGE PERSON EXPERIENCES


CLICK TO ENLARGE

In Alaska, there’s about one person for each square mile of land.

You might picture a typical Alaskan not being able to see the next house.

But it’s not that way of course.

Most of Alaska is uninhabited. People have crowded into a few areas.

The average Alaskan experiences a population density of about 72 people per square mile.

That’s a lot more than one.

In the R code below, we roughly estimate the weighted population density for each US state, that is, the population density that the average person experiences. We do this for each state by taking the average of its counties’ population densities, weighted by the population of each county. It would be even better to do this for smaller areas, such as census tracts, but we were too lazy to chase down the data.

In the figure at the top of this post, we see the weighted and unweighted population densities for each state.

Note how New Jersey has a higher population density than New York, but when you look at what the average person experiences, it flips.

Amazingly, the average person in New York State shares a square mile with more than 10,000 other residents!

What states have the biggest ratios of weighted to unweighted densities? The chart below shows states with a 10x or greater ratio in blue.

CLICK TO ENLARGE

Here are the states with the biggest ratios:

Alaska – The average person experiences a population density that is 62 times greater than the state’s density.
New York – 39 times
Nebraska – 24 times
Utah – 21 times
Colorado – 16 times
Minnesota – 14 times
Oregon – 13 times
New Mexico – 12 times
Kansas – 12 times
Texas – 12 times
Illinois – 12 times

R CODE FOR THOSE WHO WISH TO FOLLOW ALONG AT HOME

library(tidyverse)
library(ggrepel)
library(scales)
setwd("C:/Dropbox/Projects/20170605_Population_Density")
#source https://factfinder.census.gov/bkmk/table/1.0/en/DEC/10_SF1/GCTPH1.US05PR
df <- read_csv("DEC_10_SF1_GCTPH1.US05PR.csv", skip = 1) names(df) = c( "id", "state", "country", "geo_id", "geo_id_suffix", "geographic_area", "county_name", "population", "housing_units", "total_area", "water_area", "land_area", "density_population_sqmi_land", "density_housing_units_sqmi_land" ) #drop puerto rico and DC. sorry guys! df = df %>%
filter(geo_id != "0400000US72") %>%
filter(geo_id != "0500000US11001") %>%
filter(geo_id != "0400000US11")
#make a state data frame with just four facts for each state (for later joining)
sdf = df %>%
filter(!is.na(geo_id_suffix)) %>%
filter(stringr::str_length(geo_id_suffix) < 5) %>% #states have short geoids
mutate(
state = stringr::str_sub(geo_id_suffix, 1, 2),
geographic_area = stringr::str_sub(geographic_area, 16, stringr::str_length(geographic_area))
) %>%
select(state,
geographic_area,
population,
density_population_sqmi_land)
names(sdf) = c("state", "geographic_area", "state_pop", "state_density")
#clean up county data, dropping irrelevant cols
df = df %>%
filter(!is.na(geo_id_suffix)) %>%
filter(stringr::str_length(geo_id_suffix) == 5) %>% #counties have geoids of length 5
mutate(state = stringr::str_sub(geo_id_suffix, 1, 2)) %>%
select( #drop unneeded columns
-id,-country,-geo_id,-housing_units,-total_area,
-water_area,-density_housing_units_sqmi_land
)
#join the state data with the county data
result = left_join(df, sdf, by = "state") %>%
group_by(state) %>%
summarise(weighted_density = round(sum(
population / state_pop * density_population_sqmi_land
), 0)) %>%
ungroup() %>%
left_join(sdf, .) %>%
arrange(-weighted_density) %>%
#mark states with weighted density 10x higher than unweighted density
mutate(highlight = weighted_density / state_density > 10)
write_csv(result, "result.csv")
#graphit, Schulte style
p = ggplot(result,
aes(x = state_density, y = weighted_density, color = highlight)) +
theme_bw() +
scale_x_log10(breaks = c(1, 3, 10, 30, 100, 300, 1000, 3000, 10000),
label = comma) +
scale_y_log10(breaks = c(1, 3, 10, 30, 100, 300, 1000, 3000, 10000),
label = comma) +
geom_point() +
geom_text_repel(aes(label = geographic_area)) +
geom_abline(slope = 1) +
theme(legend.position = "none") +
labs(x = "Unweighted Population Density", y = "Weighted Population Density")
p
ggsave(
plot = p,
file = "unweighted_v_weighted_density.png",
height = 8,
width = 8
)
#make a long version of result with two rows per state
result_l = result %>%
mutate(sortval = weighted_density) %>%
gather(measure, density, state_density:weighted_density) %>%
arrange(sortval, measure) %>%
mutate(measure = factor(measure, levels = c("weighted_density", "state_density")))
#graph it
p = ggplot(result_l, aes(
x = density,
# make the rows be states sorted by weighted density
y = reorder(geographic_area, sortval),
color = measure
)) +
theme_bw() +
geom_point(size = 3) +
#connect the two measures for each state with a line
geom_line(aes(group = geographic_area), color = "black") +
scale_x_log10(breaks = c(10, 30, 100, 300, 1000, 3000, 10000),
label = comma) +
theme(legend.position = "bottom") +
labs(x = "Population density", y = "States ranked by weighted population density") +
scale_color_discrete(
name = "",
breaks = c("weighted_density", "state_density"),
labels = c("Weighted Population Density", "Unweighted Population Density")
)
p
ggsave(
plot = p,
file = "state_v_unweighted_and_weighted_density.png",
height = 8,
width = 6
)

H/T Jake Hofman for getting me to do this and talking R.
H/T to Hadley Wickham for creating the tidyverse.

The post Weighted population density appeared first on Decision Science News.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Please Consider Using wrapr::let() for Replacement Tasks

R bloggers - Mon, 06/26/2017 - 19:51

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

From dplyr issue 2916.

The following appears to work.

suppressPackageStartupMessages(library("dplyr")) COL <- "homeworld" starwars %>% group_by(.data[[COL]]) %>% head(n=1) ## # A tibble: 1 x 14 ## # Groups: COL [1] ## name height mass hair_color skin_color eye_color birth_year ## ## 1 Luke Skywalker 172 77 blond fair blue 19 ## # ... with 7 more variables: gender , homeworld , species , ## # films , vehicles , starships , COL

Though notice it reports the grouping is by "COL", not by "homeworld". Also the data set now has 14 columns, not the original 13 from the starwars data set.

And this seemingly similar variation (currently) throws an exception:

homeworld <- "homeworld" starwars %>% group_by(.data[[homeworld]]) %>% head(n=1) ## Error in mutate_impl(.data, dots): Evaluation error: Must subset with a string.

I know this will cost me what little community good-will I might have left (after already having raised this, unsolicited, many times), but please consider using our package wrapr::let() for tasks such as the above.

library("wrapr") let( c(COL = "homeworld"), starwars %>% group_by(COL) %>% head(n=1) ) ## # A tibble: 1 x 13 ## # Groups: homeworld [1] ## name height mass hair_color skin_color eye_color birth_year ## ## 1 Luke Skywalker 172 77 blond fair blue 19 ## # ... with 6 more variables: gender , homeworld , species , ## # films , vehicles , starships let( c(homeworld = "homeworld"), starwars %>% group_by(homeworld) %>% head(n=1) ) ## # A tibble: 1 x 13 ## # Groups: homeworld [1] ## name height mass hair_color skin_color eye_color birth_year ## ## 1 Luke Skywalker 172 77 blond fair blue 19 ## # ... with 6 more variables: gender , homeworld , species , ## # films , vehicles , starships

Some explanation can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing Time Series Data in R

R bloggers - Mon, 06/26/2017 - 19:02

(This article was first published on R – The R Trader, and kindly contributed to R-bloggers)

I’m very pleased to announce my DataCamp course on Visualizing Time Series Data in R. This course is also part of the  Time Series with R skills track. Feel free to have a look, the first chapter is free!

Course Description

As the saying goes, “A chart is worth a thousand words”. This is why visualization is the most used and powerful way to get a better understanding of your data. After this course you will have a very good overview of R time series visualization capabilities and you will be able to better decide which model to choose for subsequent analysis. You will be able to also convey the message you want to deliver in an efficient and beautiful way.

Course Outline

Chapter 1: R Time Series Visualization Tools
This chapter will introduce you to basic R time series visualization tools.

Chapter 2: Univariate Time Series
Univariate plots are designed to learn as much as possible about the distribution, central tendency and spread of the data at hand. In this chapter you will be presented with some visual tools used to diagnose univariate times series.

Chapter 3: Multivariate Time Series
What to do if you have to deal with multivariate time series? In this chapter, you will learn how to identify patterns in the distribution, central tendency and spread over pairs or groups of data.

Chapter 4: Case study: Visually selecting a stock that improves your existing portfolio
Let’s put everything you learned so far in practice! Imagine you already own a portfolio of stocks and you have some spare cash to invest, how can you wisely select a new stock to invest your additional cash? Analyzing the statistical properties of individual stocks vs. an existing portfolio is a good way of approaching the problem.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – The R Trader. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Check Data Quality with padr

R bloggers - Mon, 06/26/2017 - 19:00

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

The padr package was designed to prepare datetime data for analysis. That is, to take raw, timestamped data, and quickly convert it into a tidy format that can be analyzed with all the tidyverse tools. Recently, a colleague and I discovered a second use for the package that I had not anticipated: checking data quality. Every analysis should contain checking if data are as expected. In the case of timestamped data, observations are sometimes missing due to technical malfunction of the system that produced them. Here are two examples that show how pad and thicken can be leveraged to detect problems in timestamped data quickly.

Regular observations

Lets imagine our system produces a value every five minutes. We want to analyse the data of the last couple of months and start by some routine checks. We quickly find that the number of records is not what we expected.

library(tidyverse) library(padr) regular_system %>% head ## # A tibble: 6 x 2 ## timestamp value ## ## 1 2017-03-01 00:00:00 423.69 ## 2 2017-03-01 00:05:00 434.51 ## 3 2017-03-01 00:10:00 206.01 ## 4 2017-03-01 00:15:00 432.83 ## 5 2017-03-01 00:20:00 220.07 ## 6 2017-03-01 00:25:00 393.44 seq(regular_system$timestamp %>% min, regular_system$timestamp %>% max, by = "5 min") %>% length() ## [1] 32456 nrow(regular_system) ## [1] 32454

Two observations are missing here, with pad they are located in no time.

regular_system %>% mutate(check_var = 1) %>% pad() %>% filter(is.na(check_var)) ## pad applied on the interval: 5 min ## # A tibble: 2 x 3 ## timestamp value check_var ## ## 1 2017-06-08 11:55:00 NA NA ## 2 2017-06-08 12:00:00 NA NA

There we are, aparrantly the system took a lunch break on June the 8th.

Irregular observations

Now a more common situation. The system only produces data when it has something to tell us. This makes the observations irregular. This server produces a message each time some event happened.

irregular_system %>% head ## # A tibble: 6 x 2 ## time_stamp message ## ## 1 2016-10-09 00:02:01 QL2341@ ## 2 2016-10-09 00:07:01 #A222IWL ## 3 2016-10-09 00:11:01 QL2341@ ## 4 2016-10-09 00:17:00 WW#5 ## 5 2016-10-09 00:17:00 QL2341@ ## 6 2016-10-09 00:17:01 WW#5

Also here are server might be temporarily down, however, this is not so easy to locate. It is helpful here to apply thicken, then aggregate, pad, and fill, and finally plot the result. We might want to look at several different interval, lets make it as generic as possible.

thicken_plot <- function(x, interval) { x %>% thicken(interval, "ts_thick") %>% count(ts_thick) %>% pad() %>% fill_by_value() %>% ggplot(aes(ts_thick, n)) + geom_line() }

Lets look at 10 minute intervals.

thicken_plot(irregular_system, "10 min") ## pad applied on the interval: 10 min

Thats not too helpful, maybe a little coarser.

thicken_plot(irregular_system, "30 min") ## pad applied on the interval: 30 min

Now we see a dip at the middle of the day for October 10th, where for all the other days there is ample activty during these hours. There must be something wrong here that has to be sorted out. That will wrap up these two quick examples of how to use padr for data quality checking.

I will present the package during a lightning talk at useR next week (Wednesday 5:50 pm at 4.02 Wild Gallery). Hope to meet many of you during the conference!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Volatility modelling in R exercises (Part-1)

R bloggers - Mon, 06/26/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Volatility modelling is typically used for high frequency financial data. Asset returns are typically uncorrelated while the variation of asset prices (volatility) tends to be correlated across time.
In this exercise set we will use the rugarch package (package description: here) to implement the ARCH (Autoregressive Conditional Heteroskedasticity) model in R.

Answers to the exercises are available here.

Exercise 1
Load the rugarch package and the dmbp dataset (Bollerslev, T. and Ghysels, E. 1996, Periodic Autoregressive Conditional Heteroscedasticity, Journal
of Business and Economic Statistics, 14, 139–151). This dataset has daily logarithmic nominal returns for Deutsche-mark / Pound. There is also a dummy variable to indicate non-trading days.

Exercise 2
Define the daily return as a time series variable and plot the return against time. Notice the unpredictability apparent from the graph.

Exercise 3
Plot the graph of the autocorrelation function of returns. Notice that there is hardly any evidence of autocorrelation of returns.

Exercise 4
Plot the graph of the autocorrelation function of squared returns. Notice the apparent strong serial correlation.

Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

  • Avoid model over-fitting using cross-validation for optimal parameter selection
  • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
  • And much more

Exercise 5
We will first simulate and analyze an ARCH process. Use the ugarchspec function to define an ARCH(1) process. The return has a simple mean specification with mean=0. The variance follows as AR-1 process with constant=0.2 and AR-1 coefficient = 0.7.

Exercise 6
Simulate the ARCH process for 500 time periods. Exercises 7 to 9 use this simulated data.

Exercise 7
Plot the returns vs time and note the apparent unpredictability. Plot the path of conditional sigma vs time and note that there is some persistence over time.

Exercise 8
Plot the ACF of returns and squared returns. Note that there is no auto-correlation between returns but squared returns have significant first degree autocorrelation as we defined in exercise-5.

Exercise 9
Test for ARCH effects using the Ljung Box test for the simulated data.

Exercise 10
Test for ARCH effects using the Ljung Box test for the currency returns data.

Related exercise sets:
  1. Forecasting: Multivariate Regression Exercises (Part-4)
  2. Introduction to copulas Exercises (Part-2)
  3. Conditional execution exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Munging With R Preview — Storing Values (Assigning)

R bloggers - Mon, 06/26/2017 - 15:40

Since about October last year, I’ve been writing an introduction to R book. It’s
been quite the experience. I’ve finally started making time to document some of
the interesting things I’ve learned (about R, about writing, about how to
bring those two together) along the way.

The book is aimed at proper beginners; people with absolutely no formal coding
experience. This tends to mean people coming from Excel who need to do more than
a spreadsheet can/should.

I'm writing an R book for real beginners (ppl with 0 code XP) via @ManningBooks! What tripped you up most when you first learned R? Pls RT!

— Jonathan Carroll (@carroll_jono) September 27, 2016

Most of the books I’ve looked at which claim to teach programming begin with
some strong assumptions about the reader already knowing how to program, and
teach the specific syntax of some language. That’s no good if this is your first
language, so I’m working towards teaching the concepts, the language, and the
syntax (warts and all).

The book is currently available under the Manning Early Access Program (MEAP)
which means if you buy it you get the draft of the first three chapters right
now. If you find something you still don’t understand, or you don’t like how
I’ve written some/all of it, then jump onto the forum and let me know. I make
more edits and write more chapters, and you get updated versions. Lather, rinse,
repeat until the final version and you get a polished book which (if I’m any
good) contains what you want it to.

I’m genuinely interested in getting this right; I want to help people learn R. I
contribute a bit of time on Stack Overflow answering people’s questions, and
it’s very common to see questions that shouldn’t need asking. I don’t blame the
user for not knowing something (a different answer for not searching, perhaps),
but I can help make the resource they need.

To show that I really want people to contribute, here’s a discount code to
sweeten the deal: use mlcarroll for 50% off here.

Chapter 1 is a free download, so please check that out too! At the moment the
MEAP covers the first three chapters, but the following four aren’t too far
behind.

I’ll document some of the behind-the-scenes process shortly, but for now here’s
an excerpt from chapter 2:

2.2. Storing Values (Assigning)

In order to do something with our data, we will need to tell R what to call
it, so that we can refer to it in our code. In programming in general, we
typically have variables (things that may vary) and values (our data). We’ve
already seen that different data values can have different types, but we
haven’t told R to store any of them yet. Next, we’ll create some variables
to store our data values.

2.2.1. Data (Variables)

If we have the values 4 and 8 and we want to do something with them, we can
use the values literally (say, add them together as 4 + 8). You may be
familiar with this if you frequently use Excel; data values are stored in cells
(groups of which you can opt to name) and you tell the program which values you
wish to combine in some calculation by selecting the cells with the mouse or
keyboard. Alternatively, you can opt to refer to cells by their grid reference
(e.g. A1). Similarly to this second method, we can store values in variables
(things that may vary) and abstract away the values. In R, assigning of
values to variables takes the following form

variable <- value

The assignment operator <- can be thought of as storing the value/thing on
the right hand side into the name/thing on the left hand side. For example, try
typing x <- 4 into the R Console then press Enter:

Figure 2. 1. The variable x has been assigned the value 4.

You could just as easily use the equals sign to achieve this; x = 4 but I
recommend you use <- for this for reasons that will become clear later.

You’ll notice that the Environment tab of the Workspace pane now lists
x under Values and shows the number 4 next to it, as shown in [fig-x_eq_4]

Figure 2. 2. The variable x has been assigned the value 4.

What happened behind the scenes was that when we pressed Enter, R took
the entire expression that we entered (x <- 4) and evaluated it. Since we
told R to assign the value 4 to the variable x, R converted the value
4 to binary and placed that in the computer’s memory. R then gives us a
reference to that place in the computer’s memory and labels it x. A diagram of
this process is shown in [fig-assign] Nothing else appeared in the
Console because the action of assigning a value doesn’t return anything
(we’ll cover this more in our section on functions).

Figure 2. 3. Assigning a value to a variable. The value entered is converted to binary, then stored in memory, the reference to which is labelled by the variable.

This is overly simplified, of course. Technically speaking, in R, names have
objects rather than the other way around. This means that R can be quite
memory efficient since it doesn’t create a copy of anything it doesn’t need to.

On “hidden” variables

Variables which begin with a period (e.g. .length) are considered
hidden and do not appear in the Environment tab of the Workspace. They otherwise behave exactly as any other variable; they can be printed and manipulated. An example of one of these is the .Last.value variable, which exists from the moment you load up R (with the value TRUE) — this contains the output value of the last statement executed (handy if you forgot to assign it to something).
There are very few reasons you’ll want to use this feature (dot-prefixed hidden variables) on purpose at the moment, so for
now, avoid creating variable names with this pattern. The exception to the
hidden nature of these is again the .Last.value variable which you can
request to be visible in the Environment tab via
Tools ▸ Global Options…​ ▸ General ▸ Show .Last.value in environment listing.

We can retrieve the value assigned to the variable x by asking R to print
the value of x

print(x = x) #> [1] 4

for which we have a useful shortcut — if your entire expression is just a
variable, R will assume you mean to print() it, so

x #> [1] 4

works just the same.

Now, about that [1]: it’s important to know that in R, there’s no such thing
as a single value; every value is actually a vector of values (we’ll cover
these properly in the next chapter, but think of these as collections of values
of the same type).[1]
Whenever R prints a value it allows for the case where the value contains more
than one number. To make this easier on the eye, it labels the first value
appearing on the line by it’s position in the collection. For collections
(vectors) with just a single value, this might appear strange, but this makes
more sense once our variables contain more values

# Print the column names of the mtcars dataset names(x = mtcars) #> [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" #> [11] "carb"

We can assign another variable to another value

y <- 8

There are few restrictions for what we can name our data values, but R will
complain if you try to break them. Variables should start with a letter, not a
number. Trying to create the variable 2b will generate an error. Variables can
also start with a dot (.) as long as it’s not immediately followed by a
number, although you may wish to avoid doing so. The rest of the variable name
can consist of letters (upper and lower case) and numbers, but not punctuation
(except . or _) or other symbols (except the dot, though again, preferably
not).

There are also certain reserved words that you can’t name variables as. Some
are reserved for built-in functions or keywords

if, else, repeat, while, function, for, in, next, and break.

Others are reserved for particular values

TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_,
NA_complex_, and NA_character_.

We’ll come back to what each of these means, but for now you just need to know
that you can’t create a variable with one of those names.

On overwriting names

What you can do however, which you may wish to take care with, is overwrite
the in-built names of variables and functions. By default, the value pi is
available (π = 3.141593).

If you were translating an equation into code, and wanted to enter the value
pi you might accidentally call it pi and in doing so change the default
value, causing all sorts of trouble when you next go to use it or call a
function you’ve written which expects it to still be the default.

The default value can still be accessed by specifying the package in which it is
defined, separated by two colons (::). In the case of pi, this is the base
package.

pi <- 3L (1) base::pi (2) #> [1] 3.141593
  1. Re-defining pi to be equal to exactly 3

  2. The default, correct value is still available.

This is also an issue for functions, with the same solution; specify the package
in which it is defined to use that definition. We’ll return to this in
a section on ‘scope’.

We’ll cover how to do things to our variables in more detail in the next
section, but for now let’s see what happens if we add our variables x and y
in the same way as we did for our regular numbers

x + y #> [1] 12

which is what we got when we added these numbers explicitly. Note that since our
expression produces just a number (no assignment), the value is printed. We’ll
cover how to add and subtract values in more depth in our section on basic mathematics.

R has no problems with overwriting these values, and it doesn’t mind what data
you overwrite these with.[2]

y <- 'banana' y #> [1] "banana"

R is case-sensitive, which means that it treats a and A as distinct
names. You can have a variable named myVariable and another named MYvariable
and another named myVARIABLE and R will hold the value assigned to each
independently.

On variable names

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton
Principal Curmudgeon Netscape Communications Corporation

I said earlier that R won’t keep track of your units so it’s a good idea to
name your variables in a way that makes logical sense, is meaningful, and will
help you remember what it represents. Variables x and y are fine for playing
around with values, but aren’t particularly meaningful if your data represents
speeds, where you may want to use something like speed_kmph for speeds in
kilometers per hour. Underscores (_) are allowed in variable names, but
whether or not you use them is up to you. Some programmers prefer to name
variables in this way (sometimes referred to as ‘snake_case’), others prefer
‘CamelCase’. The use of periods (dots, .) to separate words is discouraged for
reasons beyond the scope of this book.[3]

Naming things

Be careful when naming your variables. Make them meaningful and
concise. In six months from now, will you remember what data_17 corresponds to? Tomorrow, will you remember that newdata was updated twice?

2.2.2. Unchanging Data

If you’re familiar with working with data in a spreadsheet program (such as
Excel), you may expect your variables to behave in a way that they
won’t. Automatic recalculation is a very useful feature of spreadsheet programs,
but it’s not how R behaves.

If we assign our two variables, then add them, we can save that result to
another variable

a <- 4 b <- 8 sum_of_a_and_b <- a + b

This has the value we expect

print(x = sum_of_a_and_b) #> [1] 12

Now, if we change one of these values

b <- 2

this has no impact on the value of the variable we created to hold the sum
earlier

print(x = sum_of_a_and_b) #> [1] 12

Once the sum was calculated, and that value stored in a variable, the connection
to the original values was lost. This makes things reliable because you know
for sure what value a variable will have at any point in your calculation by
following the steps that lead to it, whereas a spreadsheet depends much more on
its current overall state.

2.2.3. Assigmnent Operators (<- vs =)

If you’ve read some R code already, you’ve possibly seen that both <- and
= are used to assign values to objects, and this tends to cause some
confusion. Technically, R will accept either when assigning variables, so in
that respect it comes down to a matter of style (I still highly recommend
assigning with <-). The big difference comes when using functions that take
arguments — there you should only use = to specify what the value of the
argument. For example, when we inspected the mtcars data, we could
specify a string with which to indent the output

str(object = mtcars, indent.str = '>> ') #> 'data.frame': 32 obs. of 11 variables: #> >> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... #> >> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... #> >> $ disp: num 160 160 108 258 360 ... #> >> $ hp : num 110 110 93 110 175 105 245 62 95 123 ... #> >> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... #> >> $ wt : num 2.62 2.88 2.32 3.21 3.44 ... #> >> $ qsec: num 16.5 17 18.6 19.4 17 ... #> >> $ vs : num 0 0 1 1 0 1 0 1 1 1 ... #> >> $ am : num 1 1 1 0 0 0 0 0 0 0 ... #> >> $ gear: num 4 4 4 3 3 3 3 4 4 4 ... #> >> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

If we had used <- instead of = for either argument, then R would treat
that as creating a new variable object or indent.str with value mtcars or
'>> ' respectively, which isn’t what we want.

Assigning to variables. score <- 4.8 score #> [1] 4.8 str(object = score) #> num 4.8 fruit <- 'banana' fruit #> [1] "banana" str(object = fruit) #> chr "banana"

Note that we didn’t need to tell R that one of these was a number and one was
a string, it figured that out itself. It’s good practice (and easier to read) to
make your <- line up vertically when defining several variables:

first_name <- 'John' last_name <- 'Smith' top_points <- 23

but only if this can be achieved without adding too many spaces (exactly how
many is too many is up to you).

Watch this space!

An extra space can make a big difference to the syntax. Compare:

a <- 3

with

a < - 3 #> [1] FALSE

In the first case we assigned the value 3 to the variable a (which returns
nothing). In the second case, with a wayward space, we compared a to the
value -3 which returns FALSE (I’ll explain why that works at all, later).

Now that we know how to provide some data to R, what if we want to explicitly
tell R that our data should be of a specific type, or we want to convert our
data to a different type? That’s an article for another day.

If you’re interested in seeing more, and hopefully providing feedback on what
you do/don’t like about it, then use the discount code mlcarroll here

for 50% off and get reading!

1. In technical terms, R has no scalar types. 2. This is where the distinction of weakly typed becomes important – in a strongly typed language you would not be able to arbitrarily change the type of a variable. 3. This syntax is already used within R to denote functions acting on a specific class, such as print.Date().

View this post on GitHub

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

pixel art of ggplot2 faceting using geofacet

R bloggers - Mon, 06/26/2017 - 14:13

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

I just discovered an interesting ggplot2 extension, geofacet, that supports arranging facet panels that mimics geographic topoloty.

After playing with it, I realized that it is not only for visualizing geo-related data, but also can be fun for presenting data to mimics pixel art.

Here is an example using the Turkey shape:

Turkey <- read.csv("http://pages.iu.edu/~cdesante/turkey.csv") Turkey = Turkey[, -3] colnames(Turkey)[2:1] = c("row", "col") Turkey$row = max(Turkey$row) - Turkey$row +1 Turkey$name <- Turkey$code <- paste0('turkey', 1:nrow(Turkey)) require(ggplot2) require(geofacet) x <- split(eu_gdp, eu_gdp$code) x <- x[sample.int(length(x), nrow(Turkey), replace=T)] for (i in 1:length(x)) { x[[i]]$code = Turkey$code[i] } y <- do.call(rbind, x) p1 <- ggplot(y, aes(gdp_pc, year))+ geom_line(color="steelblue") + facet_geo(~code, grid=Turkey, scales='free') print(p1)

p1 + theme_void() + theme(strip.text.x = element_blank())

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

H2O Benchmark for CSV Import

R bloggers - Mon, 06/26/2017 - 07:24

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local", sparkConfig = list(spark.driver.memory = "10g", spark.driver.cores = "4")) library(h2o) h2o.init(max_mem_size = "10g") library(rbenchmark) benchmark(replications = 5, order = "elapsed", relative = "elapsed", csv = { df <- read.csv("Documents/nycflights13.csv") print(nrow(df)) rm(df) }, spk = { df <- read.df("Documents/nycflights13.csv", source = "csv", header = "true", inferSchema = "true") print(nrow(df)) rm(df) }, h2o = { df <- h2o.importFile(path = "Documents/nycflights13.csv", header = TRUE, sep = ",") print(nrow(df)) rm(df) } ) # test replications elapsed relative user.self sys.self user.child sys.child # 3 h2o 5 8.221 1.000 0.508 0.032 0 0 # 2 spk 5 9.822 1.195 0.008 0.004 0 0 # 1 csv 5 16.595 2.019 16.420 0.176 0 0

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Re: raster - unrotate?

R-sig-geo - Mon, 06/26/2017 - 07:17
Hi,

Thanks for the offer to help.  I have been redirected for the time being to other tasks, but I expect I'll be back to this in a few weeks.  I think I will try your archived rotate() as I'm starting with whole-earth coverage - nothing tricky.

You have been going really deep with tabularaster.  I have been folding tidyverse into my 'stuff' over the last few months, but I am on a relatively short leash resource-wise so the effort feels feeble.  Most of my work uses rasters and dismo https://cran.r-project.org/web/packages/dismo/index.html with only minor use of non-raster spatial data.  That might be my lame excuse for not diving into sf yet; that and I haven't seen examples I can ape for getting sf and raster to play together (well, until tabularaster).

Thanks!
Ben

> On Jun 22, 2017, at 7:26 PM, Michael Sumner <[hidden email]> wrote:
>
>
>
>
> On Thu, 22 Jun 2017 at 21:28 Ben Tupper <[hidden email] <mailto:[hidden email]>> wrote:
> Hi,
>
> Wow!  A treasure from the past!
>
> Hmmm.   I see what you mean; it might be best to snip-clip-and-smush from the native presentation.  We are testing out the performance of the dineof function in the sinkr package (https://github.com/marchtaylor/sinkr/blob/master/R/dineof.R <https://github.com/marchtaylor/sinkr/blob/master/R/dineof.R>) to interpolate patchy chlorophyl data.  
>
> Thanks for the tip!
>
>
>
> For what it's worth, I'm happy to help, there's no one size fits all here. The dateline is one of those "easy problems" that does not have a general fix and will bite in many different contexts. :)
>
> The best distillation of my tricks is in this project, but it does depend on your workflow and the actual data in use.
>
> https://github.com/hypertidy/tabularaster <https://github.com/hypertidy/tabularaster>
>
> If you have performance issues raster's cell abstraction  will help, but they are a bit old-school when it comes to multi-part geometries. (It's identical to standard database indexing concepts as are all "spatial" optimization tricks)
>
> (I see a desperate need for a general API for this kind of problem so that we can build tools from a general framework, which I'm working on - that's some kind of excuse for why this and related projects are quite raw and unfinished.)
>
> Cheers, Mike.  
>
> Cheers,
> Ben
>
>> On Jun 22, 2017, at 4:46 AM, Michael Sumner <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>
>> It used to do the inverse, and I preserved a copy of the old behaviour here:
>>
>> https://github.com/AustralianAntarcticDivision/raadtools/blob/master/R/utils.R#L9 <https://github.com/AustralianAntarcticDivision/raadtools/blob/master/R/utils.R#L9>
>>
>> You're probably best to learn from how it approaches it rather than just use it, since it's not a general capability, just works for those very specific cases.
>>
>> I just tested it and it still seems to work fine, I used
>>
>> oceandata.sci.gsfc.nasa.gov/MODISA/Mapped/Monthly/9km/chlor/A20021822002212.L3m_MO_CHL_chlor_a_9km.nc <http://oceandata.sci.gsfc.nasa.gov/MODISA/Mapped/Monthly/9km/chlor/A20021822002212.L3m_MO_CHL_chlor_a_9km.nc>
>>
>> obtainable with
>>
>> https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/A20021822002212.L3m_MO_CHL_chlor_a_9km.nc <https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/A20021822002212.L3m_MO_CHL_chlor_a_9km.nc>
>>
>> If you're extracting points you might as well just convert them into the "Atlantic" range, but it's painful to get that right for polygons or lines (and very hard to generalize once you get it working anyway).
>>
>> For simple regions like the one you show I'd probably split it into two extents() and use those - but depends on your workflow, all these options are useful here and there.
>>
>> Cheers, Mike.
>>
>>
>> On Thu, 22 Jun 2017 at 18:30 Ben Tupper <[hidden email] <mailto:[hidden email]>> wrote:
>> Hello,
>>
>> We have rasters that span [-180, 180] from NASA's Ocean Color (https://oceancolor.gsfc.nasa.gov/ <https://oceancolor.gsfc.nasa.gov/>)  datasets.  We are trying to extract a region that spans 100E to 90W, that is 100 to -90.  The region 'wraps' across the edges as shown by the plot at the address below.
>>
>> https://dl.dropboxusercontent.com/u/8433654/sst.png <https://dl.dropboxusercontent.com/u/8433654/sst.png>
>>
>>
>> library(raster)
>> uri <- 'https://oceandata.sci.gsfc.nasa.gov:443/opendap/MODISA/L3SMI/2016/001/A20160012016032.L3m_R32_SST_sst_9km.nc <https://oceandata.sci.gsfc.nasa.gov/opendap/MODISA/L3SMI/2016/001/A20160012016032.L3m_R32_SST_sst_9km.nc>'
>> R <- raster::raster(uri, varname = 'sst')
>>
>> R
>> #class       : RasterLayer
>> #dimensions  : 180, 360, 64800  (nrow, ncol, ncell)
>> #resolution  : 1, 1  (x, y)
>> #extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
>> #coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
>> #data source : in memory
>> #names       : layer
>> #values      : 1.482572e-05, 0.9999928  (min, max)
>>
>> plot(R)
>> lines(c(180, 100, 100, 180), c(80,80,0,0))
>> lines(c(-180,-90,-90,-180), c(80,80,0,0))
>>
>> I see that there is the nice raster::rotate() function to rotate raster coordinates from [0,360] to [-180,180]. That would make extracting the region super easy if the inverse were available.  Is there an equivalent way to implement the inverse or raster::rotate()?  I could dig into the source of raster::rotate() to see how to code one up, but I hate like heck to reinvent the wheel.
>>
>> Thanks!
>> Ben
>>
>>
>> Ben Tupper
>> Bigelow Laboratory for Ocean Sciences
>> 60 Bigelow Drive, P.O. Box 380
>> East Boothbay, Maine 04544
>> http://www.bigelow.org <http://www.bigelow.org/>
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> [hidden email] <mailto:[hidden email]>
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo <https://stat.ethz.ch/mailman/listinfo/r-sig-geo>
>> --
>> Dr. Michael Sumner
>> Software and Database Engineer
>> Australian Antarctic Division
>> 203 Channel Highway
>> Kingston Tasmania 7050 Australia
>>
>
> Ben Tupper
> Bigelow Laboratory for Ocean Sciences
> 60 Bigelow Drive, P.O. Box 380
> East Boothbay, Maine 04544
> http://www.bigelow.org <http://www.bigelow.org/>
>
>
>
> --
> Dr. Michael Sumner
> Software and Database Engineer
> Australian Antarctic Division
> 203 Channel Highway
> Kingston Tasmania 7050 Australia
Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org




        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Re: Using gIntersection to split polygons with lines

R-sig-geo - Mon, 06/26/2017 - 06:11


On 26/06/17 12:56, Frederico Mestre wrote:
> Hello,
>
> I'm trying to use gIntersection() to cut a polygon ("SpatialPolygons") with
> a line ("SpatialLinesDataFrame").
>
> I wanted to obtain a "SpatialPolygons" object with the polygons defined by
> the lines. This is what I did:
>
> polygons1 <- gIntersection(spgeom1=mypolygon, spgeom2=mylines)
>
> However I get an "SpatialLines" object as an output instead of a
> "SpatialPolygons".
>
> Am I doing something wrong? Is there any alternative? Package sf has a function st_split() that cuts a polygon with a line,
similar to postgis' function st_split.

To use it, you'd need to build sf from source after having liblwgeom
installed; the CRAN binary builds do not have support for liblwgeom.

See https://github.com/edzer/sfr/ for instructions, and
https://github.com/edzer/sfr/issues/360 for an example.

>
> I provide this sample code:
>
> #Creating the polygons
> coords <- rbind(c(1,2),c(2,2),c(2,1),c(1,1))
> pol1 <- Polygon(coords)
> pol2 <- Polygons(srl=list(pol1), ID=1)
> pol3 <- SpatialPolygons(Srl=list(pol2))
>
> #Creating the line
> l1 <- Line(rbind(c(0.5,1.5),c(2.5,1.5)))
> l2 <- Lines(list(l1),ID=1)
> l3 <- SpatialLines(list(l2))
> line1 <- SpatialLinesDataFrame(l3,data=as.data.frame(matrix(ncol=2,nrow=1)))
>
> #Ploting both
> plot(pol3)
> plot(line1, add=TRUE)
>
> #Intersection
> inters1 <- gIntersection(spgeom1=pol3, spgeom2=line1)
> inters1 <- gIntersection(spgeom1=line1, spgeom2=pol3)
> plot(inters1)
> class(inters1)
>
> It only shows the lines (the object is of class "SpatialLines").
>
> Thanks,
> Frederico
> ᐧ
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> --
Edzer Pebesma
Institute for Geoinformatics  (ifgi),  University of Münster
Heisenbergstraße 2, 48149 Münster, Germany; +49 251 83 33081
Journal of Statistical Software:   http://www.jstatsoft.org/
Computers & Geosciences:   http://elsevier.com/locate/cageo/


_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
signature.asc (484 bytes) Download Attachment

Using gIntersection to split polygons with lines

R-sig-geo - Mon, 06/26/2017 - 05:56
Hello,

I'm trying to use gIntersection() to cut a polygon ("SpatialPolygons") with
a line ("SpatialLinesDataFrame").

I wanted to obtain a "SpatialPolygons" object with the polygons defined by
the lines. This is what I did:

polygons1 <- gIntersection(spgeom1=mypolygon, spgeom2=mylines)

However I get an "SpatialLines" object as an output instead of a
"SpatialPolygons".

Am I doing something wrong? Is there any alternative?

I provide this sample code:

#Creating the polygons
coords <- rbind(c(1,2),c(2,2),c(2,1),c(1,1))
pol1 <- Polygon(coords)
pol2 <- Polygons(srl=list(pol1), ID=1)
pol3 <- SpatialPolygons(Srl=list(pol2))

#Creating the line
l1 <- Line(rbind(c(0.5,1.5),c(2.5,1.5)))
l2 <- Lines(list(l1),ID=1)
l3 <- SpatialLines(list(l2))
line1 <- SpatialLinesDataFrame(l3,data=as.data.frame(matrix(ncol=2,nrow=1)))

#Ploting both
plot(pol3)
plot(line1, add=TRUE)

#Intersection
inters1 <- gIntersection(spgeom1=pol3, spgeom2=line1)
inters1 <- gIntersection(spgeom1=line1, spgeom2=pol3)
plot(inters1)
class(inters1)

It only shows the lines (the object is of class "SpatialLines").

Thanks,
Frederico


        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Balancing on multiple factors when the sample is too small to stratify

R bloggers - Mon, 06/26/2017 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Ideally, a study that uses randomization provides a balance of characteristics that might be associated with the outcome being studied. This way, we can be more confident that any differences in outcomes between the groups are due to the group assignments and not to differences in characteristics. Unfortunately, randomization does not guarantee balance, especially with smaller sample sizes. If we want to be certain that groups are balanced with respect to a particular characteristic, we need to do something like stratified randomization.

When the sample size is small and we want to guarantee balance across multiple characteristics, the task is a bit more challenging. Say we have 20 schools that we are randomizing to two groups, 10 in each, and want to make sure the groups are balanced with respect to 4 characteristics: language, poverty, location, and size. Simple stratification may not work so well. If we assume that these four characteristics are binary (e.g. either “yes” or “no”), there are 16 possible combinations. One or more of these combinations could easily be represented by a single school – so it would be impossible to randomize within each of the 16 combinations. What to do?

One possible approach is to generate all possible randomization schemes of the 20 schools, and keep only those schemes that are balanced with respect to the four characteristics. Once we have a list of acceptable randomization schemes, we can just pick one of those at random. (Of course, it is preferable if each school has close to a 50% chance of being assigned to either intervention group.)

Simulate school-level data

To start, we generate data for our 20 hypothetical schools using simstudy functions:

library(simstudy) set.seed(125) # define data characteristics for schools ddef <- defData(varname = "language", formula = .3, dist = "binary") ddef <- defData(ddef, "poverty", formula = .2, dist = "binary") ddef <- defData(ddef, "location", formula = .5, dist = "binary") ddef <- defData(ddef, "size", formula = .5, dist = "binary") ddef ## varname formula variance dist link ## 1: language 0.3 0 binary identity ## 2: poverty 0.2 0 binary identity ## 3: location 0.5 0 binary identity ## 4: size 0.5 0 binary identity # generate schools dt <- genData(20, ddef) # number of schools in each combination dt[, .N, keyby = .(language,poverty,location,size)] ## language poverty location size N ## 1: 0 0 0 1 5 ## 2: 0 0 1 0 1 ## 3: 0 0 1 1 5 ## 4: 0 1 0 0 1 ## 5: 0 1 1 0 2 ## 6: 1 0 0 0 2 ## 7: 1 0 0 1 1 ## 8: 1 0 1 0 1 ## 9: 1 0 1 1 2

In this case, we have nine different combinations of the four characteristics, four of which include only a single school (rows 2, 4, 7, and 8). Stratification wouldn’t work necessarily work here if our goal was balance across all four characteristics.

Create randomization scenarios to assess for balance

Ideally, we would generate all possible randomization combinations and check them all for balance. If the number of total units (e.g. schools) is small, this does not pose a challenge (e.g. if N=4, then we only have six possible randomization schemes: TTCC, TCTC, TCCT, CTTC, CTCT, CCTT). However, with N=20, then there are 184,756 possible randomization schemes. Depending on the efficiency of the algorithm, it may be impractical to evaluate all the schemes. So, an alternative is to sample a subset of the schemes and evaluate those. For illustration purposes (so that you can understand what I am doing), I am using some very inefficient R code (using a loops). As a result, I cannot evaluate all possible schemes in a reasonable period of time to get this post out; I decided to sample instead to evaluate 1000 possible randomizations. (At the end of this post, I show results using much more efficient code that uses data.table and Rcpp code much more effectively – so that we can quickly evaluate millions of randomization schemes.)

To start, I create all combinations of randomization schemes:

totalSchools = 20 rxSchools = 10 xRx <- t(combn(totalSchools, rxSchools)) # show 5 randomly sampled combinations sampleRows <- sample(nrow(xRx), 5, replace = FALSE) xRx[sampleRows,] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 2 3 5 6 7 8 10 12 14 19 ## [2,] 5 6 7 8 13 14 15 16 17 18 ## [3,] 1 3 4 5 7 9 12 15 17 20 ## [4,] 2 3 4 5 9 11 14 15 19 20 ## [5,] 3 5 6 7 8 10 11 12 15 16

Below is a function (which I chose to do in Rcpp) that converts the xRx matrix of school ids to a 20-column matrix of 1’s and 0’s indicating whether or not a school is randomized to the intervention in a particular scenario:

#include // [[Rcpp::depends(RcppArmadillo)]] using namespace Rcpp; using namespace arma; // [[Rcpp::export]] NumericMatrix convert01(NumericMatrix xmat, int tcols) { int xrows = xmat.nrow(); int xcols = xmat.ncol(); NumericMatrix pmat(xrows, tcols); for (int i=0; i < xrows; i++) { for (int j=0; j < xcols; j++) { pmat(i, xmat(i,j) - 1) = 1; } } return(pmat); } x01 <- convert01(xRx, totalSchools) # show some rows x01[sampleRows,] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] ## [1,] 0 1 1 0 1 1 1 1 0 1 0 1 ## [2,] 0 0 0 0 1 1 1 1 0 0 0 0 ## [3,] 1 0 1 1 1 0 1 0 1 0 0 1 ## [4,] 0 1 1 1 1 0 0 0 1 0 1 0 ## [5,] 0 0 1 0 1 1 1 1 0 1 1 1 ## [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] ## [1,] 0 1 0 0 0 0 1 0 ## [2,] 1 1 1 1 1 1 0 0 ## [3,] 0 0 1 0 1 0 0 1 ## [4,] 0 1 1 0 0 0 1 1 ## [5,] 0 0 1 1 0 0 0 0

Because the evaluation code is so inefficient, I draw 1,000 rows at random from this “intervention” matrix x01 (after converting it to a data.table).

# convert matrix to data.table d01 <- data.table(x01) d01[, id := .I] ids <- sample(nrow(d01), 1000, replace = FALSE) sampleD01 <- d01[id %in% ids]

Now we are ready to evaluate each of the 1,000 schemes. As I mentioned before, this approach is highly inefficient as the algorithm requires us to literally loop through each each combination to find the balanced ones. I have sacrificed efficiency and speed for clarity of code (I hope).

for (i in 1:1000) { dt[, grp:= t(sampleD01[i,1:20])] dx <- dt[ , .N, keyby = .(language, grp)] dc <- dcast(dx, language ~ grp, fill = 0, value.var = "N" ) dc[, diff := abs(`1` - `0`)] # we declare a scheme balanced if counts differ by # no more than 1 school sampleD01[i, language := (sum(dc[, diff > 1]) == 0)] dx <- dt[ , .N, keyby = .(poverty, grp)] dc <- dcast(dx, poverty ~ grp, fill = 0, value.var = "N" ) dc[, diff := abs(`1` - `0`)] sampleD01[i, poverty := (sum(dc[, diff > 1]) == 0)] dx <- dt[ , .N, keyby = .(location, grp)] dc <- dcast(dx, location ~ grp, fill = 0, value.var = "N" ) dc[, diff := abs(`1` - `0`)] sampleD01[i, location := (sum(dc[, diff > 1]) == 0)] dx <- dt[ , .N, keyby = .(size, grp)] dc <- dcast(dx, size ~ grp, fill = 0, value.var = "N" ) dc[, diff := abs(`1` - `0`)] sampleD01[i, size := (sum(dc[, diff > 1]) == 0)] }

The final determination of balance is made if a scheme is balanced across all four characteristics. In this case, 136 of the 1,000 schemes were balanced based on this criterion:

sampleD01[, balanced := all(language, poverty, location, size), keyby = id] # proportion of sampled combinations that are balanced ... sampleD01[,mean(balanced)] ## [1] 0.136

And let’s inspect the actual balance of two randomly selected schemes – one which is balanced, and one which is not:

sTrue <- sampleD01[balanced == TRUE] sFalse <- sampleD01[balanced == FALSE] A balanced scheme dtAssigned <- copy(dt) dtAssigned[, group := as.vector(t(sTrue[sample(.N, 1), 1:20]))] dtAssigned[, .N, keyby=.(language, group)] ## language group N ## 1: 0 0 7 ## 2: 0 1 7 ## 3: 1 0 3 ## 4: 1 1 3 dtAssigned[, .N, keyby=.(poverty, group)] ## poverty group N ## 1: 0 0 9 ## 2: 0 1 8 ## 3: 1 0 1 ## 4: 1 1 2 dtAssigned[, .N, keyby=.(location, group)] ## location group N ## 1: 0 0 4 ## 2: 0 1 5 ## 3: 1 0 6 ## 4: 1 1 5 dtAssigned[, .N, keyby=.(size, group)] ## size group N ## 1: 0 0 3 ## 2: 0 1 4 ## 3: 1 0 7 ## 4: 1 1 6 An unbalanced scheme

In this case, language and location are imbalanced, though size and poverty are fine.

dtAssigned <- copy(dt) dtAssigned[, group := as.vector(t(sFalse[sample(.N, 1), 1:20]))] dtAssigned[, .N, keyby=.(language, group)] ## language group N ## 1: 0 0 8 ## 2: 0 1 6 ## 3: 1 0 2 ## 4: 1 1 4 dtAssigned[, .N, keyby=.(poverty, group)] ## poverty group N ## 1: 0 0 8 ## 2: 0 1 9 ## 3: 1 0 2 ## 4: 1 1 1 dtAssigned[, .N, keyby=.(location, group)] ## location group N ## 1: 0 0 3 ## 2: 0 1 6 ## 3: 1 0 7 ## 4: 1 1 4 dtAssigned[, .N, keyby=.(size, group)] ## size group N ## 1: 0 0 4 ## 2: 0 1 3 ## 3: 1 0 6 ## 4: 1 1 7 Fast implementation with data.table and Rcpp

As I alluded to before, if we want to implement this in the real world, it would be preferable to use code that does not bog down when we want to search 100,000+ possible randomization schemes. I have written a set of R and Rcpp functions the facilitate this. (Code is available here.)

# generate all possible schemes xperm <- xPerms(totalSchools, rxSchools, N=NULL) nrow(xperm) ## [1] 184756 xperm[sample(nrow(xperm), 5, replace = FALSE)] ## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 ## 1: 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 0 0 1 0 ## 2: 1 1 0 1 1 1 1 1 0 1 0 0 0 0 1 0 1 0 0 ## 3: 1 0 1 0 0 1 1 1 1 1 1 0 1 0 0 1 0 0 0 ## 4: 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 1 0 0 0 ## 5: 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 0 1 1 0 ## V20 id ## 1: 1 94784 ## 2: 0 19535 ## 3: 0 61644 ## 4: 0 14633 ## 5: 0 35651 # prepare data for evaluation dtMat <- as.matrix(dt[,-1]) cc <- parse(text=attr(xperm, "varlist")) cc ## expression(c(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, ## V13, V14, V15, V16, V17, V18, V19, V20)) # evaluate each combination sF <- xperm[, cppChk(eval(cc), dtMat), keyby = id] sF[sample(nrow(sF), 5, replace = FALSE)] ## id V1 ## 1: 15924 FALSE ## 2: 68284 FALSE ## 3: 149360 FALSE ## 4: 62924 FALSE ## 5: 14009 TRUE # keep only the balanced schemes sFinal <- xperm[sF$V1] nrow(sFinal) ## [1] 7742 # randomize from the balanced schemes selectRow <- sample(nrow(sFinal), 1) # check balance of randomized scheme dtAssigned <- copy(dt) dtAssigned[, group := as.vector(t(sFinal[selectRow, -"id"]))] dtAssigned[, .N, keyby=.(language, group)] ## language group N ## 1: 0 0 7 ## 2: 0 1 7 ## 3: 1 0 3 ## 4: 1 1 3 dtAssigned[, .N, keyby=.(poverty, group)] ## poverty group N ## 1: 0 0 9 ## 2: 0 1 8 ## 3: 1 0 1 ## 4: 1 1 2 dtAssigned[, .N, keyby=.(location, group)] ## location group N ## 1: 0 0 5 ## 2: 0 1 4 ## 3: 1 0 5 ## 4: 1 1 6 dtAssigned[, .N, keyby=.(size, group)] ## size group N ## 1: 0 0 3 ## 2: 0 1 4 ## 3: 1 0 7 ## 4: 1 1 6 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hex stickers for the forecast package

R bloggers - Mon, 06/26/2017 - 02:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

I’ve caved in to the hex sticker craze, and produced some hex stickers for the forecast package for R. If you attend a workshop I teach, I’ll give you one. Otherwise you can order (in bulk) from hexi.pics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data visuals notes for my talks in 2017

R bloggers - Mon, 06/26/2017 - 02:00

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

Data visuals: notes for my talks in 2017

Supplementary notes for CJ Brown’s talks on dataviz in 2017 for Griffith
University’s honours students and the UQ Winterschool in
Bioinformatics
.

Skip to the quiz Visualsing sexual dimorphism in elephant seals

I picked this example to demonstrate a simple barchart for representing
the sizes of different things.

mnsize <- c(2800, 650) sesize <- c(800, 50) nams <- c('Male', 'Female') cols <- c('steelblue', 'sienna3') par(oma = c(1,2,1,1), mgp = c(6,1,0), mar = c(5,8,4,2)) barplot(mnsize, names.arg = '', las = 1, ylab = 'Weight (kg)', cex.axis = 2, cex.lab =2, col = cols) legend('topright', legend = nams, pch = 15, col = cols, cex = 2)

Comparing volume and area

Compare these. Note that if we compare circles we should use area, not
the radius or diameter to scale their size.

n <- c(10, 5) barplot(n, col = 'skyblue', xaxt = 'n', yaxt = 'n')

rad1 <- 1 area1 <- pi*(rad1^2) area2 <- area1/2 rad2 <- sqrt(area2/pi) par(mfrow = c(1,2), mar = c(0,0,0,0)) pie(1, col = 'skyblue', labels = NA, border = NA, radius = rad1) pie(1, col = 'skyblue', labels = NA, border = NA, radius = rad2)

Exploration of data

Let’s create a point cloud to demonstrate some data exploration
techniques

set.seed(42) x <- rnorm(1000) y <- 5*x + 3 + rnorm(1000, sd = 15) plot(x,y, pch = 16, col = grey(0.5,0.5), las = 1)

Can’t see alot here. A linear model might help us explore if there is
any trend going on:

mod1 <- lm(y ~ x) plot(x,y, pch = 16, col = grey(0.5,0.5), las = 1) abline(mod1, lwd = 2, col = 'red') xnew <- seq(min(x), max(x), length.out = 100) pmod <- predict(mod1, newdata =data.frame(x=xnew), se = T) lines(xnew, pmod$fit + pmod$se.fit, lwd = 2, col = 'red', lty = 2) lines(xnew, pmod$fit - pmod$se.fit, lwd = 2, col = 'red', lty = 2)

What about identifying extreme points, that may be worth investigating
further? We can pick out points that are greater than 2SDs from the
trend:

modresid <- resid(mod1) sd2 <- sd(modresid)*2 ipt <- which(abs(modresid) > sd2) plot(x,y, pch = 16, col = grey(0.5,0.5), las = 1) abline(mod1, lwd = 2, col = 'red') points(x[ipt], y[ipt], pch = 16, col = rgb(1,0,0, 0.6))

Three ways of visualising the same x-y data

Each of these graphs of the same data has a slightly different
interpretation.

x <- 1:100 y <- x + rnorm(100, sd=30) plot(x,y, pch = 16, col = grey(0.5, 0.5))

mod1 <- lm(y ~ x) plot(x,y, col = 'white') abline(mod1, lwd = 3, col = 'red')

library(MASS) filled.contour(kde2d(x,y), scale = F)

plot(x,y, pch = 16, col = grey(0.5, 0.5)) abline(mod1, lwd = 3, col = 'red')

Make your own Datasaurus dozen

The datasaurus is a great example of why you should view your data,
invented by Alberto Cairo. See Steph Locke’s code and package on
github for making this in R.

library(datasauRus) datnames <- rev(unique(datasaurus_dozen$dataset)) nlevels <- length(datnames) for (i in 1:nlevels){ i <- which(datasaurus_dozen$dataset == datnames[i]) plot(datasaurus_dozen$x[i], datasaurus_dozen$y[i], xlab = "x", ylab = "y", las = 1) Sys.sleep(1) }

Convince yourself that the mean, sd and correlation is the same in all
of these plots:

library(dplyr) datasaurus_dozen %>% group_by(dataset) %>% summarize(meanx = mean(x), meany = mean(y), sdx = sd(x), sdy = sd(y), corr = cor(x,y)) ## # A tibble: 13 × 6 ## dataset meanx meany sdx sdy corr ## ## 1 away 54.26610 47.83472 16.76982 26.93974 -0.06412835 ## 2 bullseye 54.26873 47.83082 16.76924 26.93573 -0.06858639 ## 3 circle 54.26732 47.83772 16.76001 26.93004 -0.06834336 ## 4 dino 54.26327 47.83225 16.76514 26.93540 -0.06447185 ## 5 dots 54.26030 47.83983 16.76774 26.93019 -0.06034144 ## 6 h_lines 54.26144 47.83025 16.76590 26.93988 -0.06171484 ## 7 high_lines 54.26881 47.83545 16.76670 26.94000 -0.06850422 ## 8 slant_down 54.26785 47.83590 16.76676 26.93610 -0.06897974 ## 9 slant_up 54.26588 47.83150 16.76885 26.93861 -0.06860921 ## 10 star 54.26734 47.83955 16.76896 26.93027 -0.06296110 ## 11 v_lines 54.26993 47.83699 16.76996 26.93768 -0.06944557 ## 12 wide_lines 54.26692 47.83160 16.77000 26.93790 -0.06657523 ## 13 x_shape 54.26015 47.83972 16.76996 26.93000 -0.06558334

We can also save these as .png images to make a .gif image (see also
here)

for (ilvs in 1:nlevels){ i <- which(datasaurus_dozen$dataset == datnames[ilvs]) thiscol <- ifelse(datnames[ilvs] == "dino", "darkseagreen", "grey20") png(filename = paste0("datasaurus/",datnames[ilvs],".png")) plot(datasaurus_dozen$x[i], datasaurus_dozen$y[i], xlab = "x", ylab = "y", las = 1, xlim = c(10, 105), ylim = c(-5, 105), col = thiscol, pch = 16) dev.off() } Effect size

When doing confirmatory analysis, we might want to know how strong an
effect is. Data viz is very useful for this task. Lets compare two
datasets that have similar p-values, but very different effect sizes

set.seed(42) x <- rnorm(1000) set.seed(420) y <- 5*x + 3 + rnorm(1000, sd = 15) set.seed(420) y2 <- 5*x + 3 + rnorm(1000, sd = 1) mod1 <- lm(y ~ x) mod2 <- lm(y2 ~ x) #Compare the pvalues on the slopes summary(mod1) ## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -43.201 -10.330 0.395 9.634 46.694 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.8054 0.4614 6.080 1.71e-09 *** ## x 4.2096 0.4603 9.145 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 14.59 on 998 degrees of freedom ## Multiple R-squared: 0.07732, Adjusted R-squared: 0.07639 ## F-statistic: 83.63 on 1 and 998 DF, p-value: < 2.2e-16 summary(mod2) ## ## Call: ## lm(formula = y2 ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.88004 -0.68868 0.02634 0.64229 3.11291 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.98703 0.03076 97.11 <2e-16 *** ## x 4.94731 0.03069 161.21 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9724 on 998 degrees of freedom ## Multiple R-squared: 0.963, Adjusted R-squared: 0.963 ## F-statistic: 2.599e+04 on 1 and 998 DF, p-value: < 2.2e-16 par(mfrow = c(1,2)) plot(x,y, pch = 16, col = grey(0.5,0.5), las = 1) abline(mod1, lwd = 2, col = 'red') plot(x,y2, pch = 16, col = grey(0.5,0.5), las = 1) abline(mod2, lwd = 2, col = 'red')

Don’t waste digital ink

Plots with less ‘add-ons’ tend to communicate the key message more
clearly. For instance, just like excel plots dont:

x <- rnorm(100) dat <- data.frame(x = x, y = 0.25*x + rnorm(100, sd = 0.2)) library(ggplot2) library(ggthemes) ggplot(dat, aes(x = x, y = y)) + geom_point() + theme_excel()

You can get additional themes for ggplot2 using this excellent
package
.
A cleaner view:

ggplot(dat, aes(x = x, y = y)) + geom_point() + theme_base()

Or simply:

plot(dat$x, dat$y, xlab = "x", ylab = "y", las = 1)

A good principle is to not use ‘ink’ on figures that isn’t needed to
communicate your message. Tufte
takes the ‘less ink’ philosophy to the extreme:

ggplot(dat, aes(x = x, y = y)) + geom_point() + theme_tufte()+theme(axis.text=element_text(size=20), axis.title=element_text(size=20))

When is ggplot2 appropriate, or when should I use base R?

In general I think ggplot2 is appropriate for
problems of intermediate complexity. Like this:


Base R is great if you just want to plot a barplot quickly, or do an x-y
plot. ggplot2 comes into its own for slight more complex plots, like
having multiple panels for different groups or colouring lines by a 3rd
factor. But once you move to really complex plots, like overlaying a
subplot on a map, ggplot2 becomes very difficult, if not impossible. At
that point it is better to move back to Base R. ggplot2 can also get
very fiddly if you are very specific about your plots, like having
certain colours, or the labels in a certain way.

As an example, ggplot2 is great for data like this:

x1 <- rnorm(30) grps <- letters[c(rep(1, 10), rep(2, 10), rep(3, 10))] y1 <- x1 + c(rep(1, 10), rep(-1, 10), rep(2, 10)) + rnorm(30) dat <- data.frame(x = x1, grps = grps, y = y1) head(dat) ## x grps y ## 1 1.15862941 a 5.015377 ## 2 0.89667985 a 1.109440 ## 3 0.04423754 a 2.122226 ## 4 0.27715868 a 1.863923 ## 5 0.02435753 a 1.087412 ## 6 1.54113178 a 3.839985 ggplot(dat, aes(x = x1, y = y1, color = grps)) + geom_point() + theme_bw()

It is also pretty handy for faceting:

ggplot(dat, aes(x = x1, y = y1)) + geom_point() + facet_wrap(~grps)+ theme_bw()

The key with ggplot2 is to have your data in a data-frame.

In reality both ggplot2 and base R graphics are worth learning, but I
would start with learning the basics of base R graphics and then move
onto ggplot2 if you want to quickly plot lots of structured data-sets.

Pie graphs vs bar graphs

In Mariani et al. they plot rates of seafood fraud by several European
countries. While its a foundational study that establishes improvement
in the accuracy of food labelling, their graphics could be improved in
several ways.

First they use perspective pies. This makes it incredibly hard to
compare the two groups (fish that are labelled/mislabelled). Humans are
very bad at comparing angles and pretty bad at comparing areas. With the
perspective you can’t even compare the areas properly. They do provide
the raw numbers, but then, why bother with the pies?
Note that the % pies misrepresent the data slightly because the %
figures are actually odds ratios (mis-labels / correct labels), rather
than percent (mis-labeels / total samples).
Second the pies are coloured red/green, which will be hard for red-green
colourblind people to see.
Third, they have coloured land blue on their map, so it appears to be
ocean at first look.
Fourth, the map is not really neccessary. There are no spatial patterns
going on that the authors want to draw attention to. I guess having a
map does emphasize that the study is in Europe. Finally, the size of
each pie is scaled to the sample size, but the scale bar for the sample
size shows a sample of only 30, whereas most of their data are for much
larger samples sizes (>200). Do you get the impression from the pies
that the UK has 668 samples, whereas Ireland only has 187? Therefore,
from this graphic we have no idea what sample size was used in each
country.

In fact, all the numbers that are difficult to interpret in the figure
are very nicely presented in Table 1.

Below is a start at improving the presentation. For instance, you could
do a simple bar chart, ordering by rate of mislabelling.

cnames <- c('Ireland' ,'UK','Germany','France','Spain','Portugal') corrlab <- c(180, 647, 145, 146, 267, 178) mislab <- c(7, 21, 9, 4, 24, 12) misrate <- 100*signif(mislab / (corrlab + mislab),2) corrrate <- 100 - misrate ord <- order(misrate, decreasing = T) y <- rbind(corrrate, misrate) par(mar = c(5,4,4,7)) barplot(y[,ord], names.arg = cnames[ord], col = c('skyblue','tomato'), ylab = 'Labelling rate (%)', las = 2) legend(x=7.5, y = 90, legend = c("Mislabelled", "Correctly labelled"), pch = 16, col = c('tomato','skyblue'), xpd = NA, cex = 0.7)

You could add another subfigure to this, showing the rate by different
species too.

The barplot doesn’t communicate the sample size, but then that is
probably not the main point. The sample sizes are probably best reported
in the table

If we felt the map was essential, then putting barcharts on it would be
more informative. It is not that easy to add barcharts ontop of an
existing map in R, so I would recommend creating the barcharts
seperately, then adding them on in Illustrator or Powerpoint.

We can make a basic map like this:

library(maps) library(maptools) map('world', xlim = c(-20, 20), ylim = c(35, 60), col = 'grey', fill = T)

Then create some nice barcharts. We write a loop so we get one barchart
for each country.

nc <- length(cnames) par(mfrow = c(2,3), oma = c(1,1,1,3)) for (i in 1:nc){ y <- c(mislab[i], corrlab[i]) barplot(y, names.arg = '', las = 2, col = c('tomato','skyblue'), ylim = c(0, corrlab[i]), main = cnames[i], cex.main = 2.4, yaxt = 'n') byy <- signif(max(y),2)/3 yat <- c(0, min(y), max(y)) axis(2, at = yat, las = 2, cex.axis = 2, labels = F) axis(2, at = yat[2:3], las = 2, cex.axis = 2, labels = T) } legend(x = 2.8, y = 500, legend = c('Fraud', 'Correct'), pch = 15, col = c('tomato','skyblue'), xpd = NA, cex = 2, bty = 'n')

Choosing axes

It can be misleading to present % and proportion data on axes that are
not scaled 0 – 100%. For instance, compare these three graphs:

y <- c(70, 72, 68, 73) x <- 1:4 xnams <- LETTERS[1:4] par(mfrow = c(1,3), oma = c(1,1,1,3), mar = c(5,6,2,2)) plot(x,y, pch = 3, cex = 2, las = 1, xaxt = 'n', xlab = '', ylab = 'Percent', cex.axis = 2, cex.lab = 2, tcl = 0.5, xlim = c(0, 5), col = 'red', lwd = 3) axis(1, at = x, labels = xnams, cex.axis = 2, tcl = 0.5) barplot(y, names.arg = xnams, las = 1, cex.axis = 2, cex.lab = 2, cex.names = 2, ylab = 'Percent') barplot(y, names.arg = xnams, las = 1, cex.axis = 2, cex.lab = 2, cex.names = 2, ylab = 'Percent', ylim = c(0, 100))

Choosing colour scales

Alot of thought should go into choosing colour scales for graphs for
instance- will it print ok? will colour blind people be able to see
this? does the scale create artificial visual breaks in the data?
Luckily there is a package to help you make the right decision for a
colour scale, it is called RColorBrewer. Check out colorbrewer.org for
a helpful interactive web interface for choosing colours.
Here I recreate the map of that in Cinner et al. 2016 Nature for fish
biomass (using dummy data, as I don’t have their data).

First let’s specify some dummy data

fishbio <- c(3.2, 3.3, 3.4, 6, 9, 3.7,3.9,7, 8, 5, 6, 4) longs <- c(95.59062 ,96.50676,121.05892,128.14529,135.00157,166.68020,156.44645,146.75308, 150.8980, 151.1395, 142.9253, 119.0074) lats <- c(4.3201110,1.9012211, -7.4197367, 0.4173821 , 7.2730141, -19.8772305, -28.3750059, -16.9918706, -21.985131, -9.422199, -2.899138, 1.932759) library(RColorBrewer) library(maps) #Create the colours for locations cols <- brewer.pal(9, 'RdYlGn') colsf <- colorRampPalette(cols) cols100 <- colsf(100) breaks <- seq(min(fishbio)-0.01, max(fishbio)+0.01, length.out = 100) icol <- cut(fishbio, breaks = breaks, labels = F) fishcols <- cols100[icol] #Plot locations map('world', xlim = c(80, 180), ylim = c(-35, 35),interior = F, fill = T, col = 'grey') points(longs, lats, pch = 16, col = fishcols, cex = 2) points(longs, lats, pch = 1, cex = 2)

Using red-green palettes makes it hard for colour blind people. Also,
using a diverging palette makes it look like there is something
important about the middle point (yellow). A better palette to use would
be one of the sequential ones. Using reds (and reversing it using
rev()) emphasises the worst places. We could use greens and emphasise
the best places too. I am using a light grey for the fill so that the
points stand out more.

#Create sequential colours for locations cols <- rev(brewer.pal(9, 'Reds')) colsf <- colorRampPalette(cols) cols100 <- colsf(100) breaks <- seq(min(fishbio)-0.01, max(fishbio)+0.01, length.out = 100) icol <- cut(fishbio, breaks = breaks, labels = F) fishcols <- cols100[icol] #Plot locations map('world', xlim = c(80, 180), ylim = c(-35, 35),interior = F, fill = T, col = 'grey90') points(longs, lats, pch = 16, col = fishcols, cex = 2) points(longs, lats, pch = 1, cex = 2)

To make it easier to understand, let’s look at these again as contour
plots. I will use a more appropriate diverging palette to the red-green
one though.

z <- matrix(rep(1:10, 10), nrow = 10) filled.contour(z, col = brewer.pal(9, 'Reds'), nlevels = 10)

filled.contour(z, col = brewer.pal(9, 'RdBu'), nlevels = 10)

Notice the diverging pallette creates an artificial split at yellow

One of the only legitimate uses for pie graphs (I think) is visualising
the colour scales. Here is how:

reds <- brewer.pal(9, 'Reds') greens <- brewer.pal(9, 'Greens') blues <- brewer.pal(9, 'Blues') rdylgn <- brewer.pal(9, 'RdYlGn') rdbu <- brewer.pal(9, 'RdBu') dark2 <- brewer.pal(8, 'Dark2') par(mfrow = c(2,3), mar = c(0,0,0,0), oma = c(0,0,0,0)) pie(rep(1, 9), col = reds) pie(rep(1, 9), col = greens) pie(rep(1, 9), col = blues) pie(rep(1, 9), col = rdylgn) pie(rep(1, 9), col = rdbu) pie(rep(1, 9), col = dark2)

Interpreting rates

People are bad at interpreting rates, we just can’t get our heads around
accumulation very well. Here is a numerical example. Check out the below
figure and ask yourself:

At what time is the number of people in the shopping centre declining?

Would you say it is at point A, B, C or D?

Before you proceed with code below, take the poll:


Loading poll…

Here is how we made the figure and generated the data:

par(mar = c(4,4.5,2,2), mgp = c(3,1,0)) plot(times, inrate_err, type = 'l', xlab = 'Hour of day', ylab = 'People per 10 minutes', las = 1, cex.axis = 2, lwd = 3, col = 'darkblue', cex.lab = 2, ylim = c(0, 12)) lines(times, outrate_err, lwd = 3, col = 'tomato') abline(v = 12, lwd = 2, col = grey(0.5,0.5)) text(12, 13, 'A', xpd = NA, cex = 2) abline(v = 13.5, lwd = 2, col = grey(0.5,0.5)) text(13.5, 13, 'B', xpd = NA, cex = 2) abline(v = 14.2, lwd = 2, col = grey(0.5,0.5)) text(14.2, 13, 'C', xpd = NA, cex = 2) abline(v = 15.8, lwd = 2, col = grey(0.5,0.5)) text(15.8, 13, 'D', xpd = NA, cex = 2) legend('bottomleft', legend = c('Entering', 'Leaving'), lwd = 2, col = c('darkblue','tomato'), cex = 1.5, xpd = NA)

Let’s plot the total number of people:

par(mar = c(5,5.5,2,2), mgp = c(4,1,0)) plot(times, cumsum(inrate_err) - cumsum(outrate_err), type = 'l', xlab = 'Hour of day', ylab = 'People in shopping centre', las = 1, cex.axis = 2, lwd = 3, col = 'darkblue', cex.lab = 2, ylim = c(0, 120)) abline(v = 12, lwd = 2, col = grey(0.5,0.5)) text(12, 130, 'A', xpd = NA, cex = 2) abline(v = 13.5, lwd = 2, col = grey(0.5,0.5)) text(13.5, 130, 'B', xpd = NA, cex = 2) abline(v = 14.1, lwd = 2, col = 'tomato') text(14.2, 130, 'C', xpd = NA, cex = 2) abline(v = 15.8, lwd = 2, col = grey(0.5,0.5)) text(15.8, 130, 'D', xpd = NA, cex = 2)

Hopefully the answer is obvious now. So the right scales can help make
interpretation much easier.

Anatomy of a simple chart

The construction of a simple chart in R can be a surprisingly long piece
of code. Here is an example to get you started. Don’t be afraid to
experiment!

# --------------- # # Make some data # --------------- # set.seed(42) n <- 11 x <- rnorm(n, mean = seq(18, 25, length.out = n)) y <- rnorm(n, mean =seq(26, 18, length.out = n)) z <- rnorm(n, mean = 22) t <- 2005:(2005+n-1) datnames <- c('Almonds', 'Peanuts', 'Hazelnuts') plot(t, x) lines(t, y) lines(t, z)

Which look terrible. Let’s build a better chart.

# Package for colours library(RColorBrewer) #Set axis limits ymax <- 30 ylim <- c(15, ymax) xlim <- c(min(t), max(t)) #Define colours cols <- brewer.pal(3, 'Dark2') #Parameters for plotting lwd <- 2 xlabels <- seq(min(t), max(t), by = 5) ylabels <- seq(0, ymax, by = 5) #Set the window params par(mar = c(5,5,4,4)) #Build the plot plot(t, x, type = 'l', bty = 'n', xaxt = 'n', yaxt = 'n', ylim = ylim, xlim = xlim, lwd = lwd, col = cols[1], xaxs = 'i', yaxs = 'i', xlab = 'Time (yrs)', ylab = '', main = 'Changing price of nuts ($/kg)') #Add more lines lines(t, y, lwd = lwd, col = cols[2]) lines(t, z, lwd = lwd, col = cols[3]) #Add labels to lines text(t[n], x[n], datnames[1], adj = c(0, 0), xpd = NA, col = cols[1]) text(t[n], y[n], datnames[2], xpd = NA, adj = c(0, 0), col = cols[2]) text(t[n], z[n], datnames[3], xpd = NA, adj = c(0, 0), col = cols[3]) # Add custom axes axis(1, col = 'white', col.ticks = 'black', labels = xlabels, at = xlabels) axis(1, col = 'white', col.ticks = 'black', labels = NA, at = t) axis(2, col = 'white', col.ticks = 'black', las =1, labels = ylabels, at = ylabels)

Resources and further reading var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Visualization with googleVis exercises part 4

R bloggers - Sun, 06/25/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Adding Features to your Charts

We saw in the previous charts some basic and well-known types of charts that googleVis offers to users. Before continuing with other, more sophisticated charts in the next parts we are going to “dig a little deeper” and see some interesting features of those we already know.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package & Data frame

As you already know, the first thing you have to do is install and load the googleVis package with:
install.packages("googleVis")
library(googleVis)

Secondly we will create an experimental data frame which will be used for our charts’ plotting. You can create it with:
df=data.frame(name=c("James", "Curry", "Harden"),
Pts=c(20,23,34),
Rbs=c(13,7,9))

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page.

Customizing Chart

We are going to use the two-axis Line Chart we created in part 1. This is the code we used, in case you forgot it:

LineC2 <- gvisLineChart(df, "name", c("Pts","Rbs"),
options=list(
series="[{targetAxisIndex: 0},
{targetAxisIndex:1}]",
vAxes="[{title:'Pts'}, {title:'Rbs'}]"
))
plot(LineC2)

Colours

To set the color of every line we can use:
series="[{color:'green', targetAxisIndex: 0,

Exercise 1

Change the colours of your line chart to green and yellow respectively and plot the chart.

Line Width

You can determine the line width of every line with:
series="[{color:'green',targetAxisIndex: 0, lineWidth: 3},

Exercise 2

Change the line width of your lines to 3 and 6 respectively and plot the chart.

Dashed lines

You can tranform your lines to dashed with:
series="[{color:'green', targetAxisIndex: 0,
lineWidth: 1, lineDashStyle: [2, 2, 20, 2, 20, 2]},

There are many styles and colours available and you can find them here.

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

  • Work extensively with the GoogleVis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 3

Choose two different styles of dashed lines for every line of your chart from the link above and plot your chart.

Point Shape

With the pointShape option you can choose from a variety of shapes for your points.

We will use the scatter chart we built in part 3 to see how it works. Here is the code:
ScatterCD <- gvisScatterChart(cars,
options=list(
legend="none",
pointSize=3,lineWidth=2,
title="Cars", vAxis="{title:'speed'}",
hAxis="{title:'dist'}",
width=600, height=300))
plot(ScatterCD)

Exercise 4

Change the shape of your scatter chart’s points to ‘square’ and plot it. HINT: Use pointShape.

Exercise 5

Change the shape of your scatter chart’s points to ‘triangle’, their point size to 7 and plot it.

Edit Button

A really useful and easy feature that googleVis provides is the edit button which gives the user the ability to customize the chart in an automated way.
options=list(gvis.editor="Edit!"))

Exercise 6

Add an edit button in the scatter chart you just created. HINT: Use gvis.editor.

Chart with more options

Now let’s see how we can create a chart with many features that can enhance its appearance. We will use again the 2-axis line that we used before.
LineCD2 <- gvisLineChart(df, "name", c("Pts","Rbs"),
options=list(
series="[{color:'green',targetAxisIndex: 0, lineWidth: 3,
lineDashStyle: [14, 2, 2, 7]},
{color:'yellow',targetAxisIndex:1,lineWidth: 6,
lineDashStyle: [10, 2]}]",
vAxes="[{title:'Pts'}, {title:'Rbs'}]"
))
plot(LineCD2)

Background color

You can decide the background color of your chart with:
backgroundColor="red",

Exercise 7

Set the background color of your line chart to “lightblue” and plot it. HINT: Use backgroundColor.

Title

To give a title and decide its features you can use:
title="Title",
titleTextStyle="{color:'orange',
fontName:'Courier',
fontSize:14}",

Exercise 8

Give a title of your choise to the line chart and set its font to blue, Courier of size 16. HINT: Use titleTextStyle.

Curve Type & Legend

Another nice-looking choise that googleVis gives you is to display the lines like curves with:
curveType="function"

You can also move the legend of your chart to the bottom with:
legend="bottom"

Exercise 9

Smooth the lines of your line chart by setting the curveType option to function and move the legend to the bottom. HINT: Use curveType and legend.

Axes features

Finally you can “play” with your axes. This is an example:
vAxis="{gridlines:{color:'green', count:4}}",
hAxis="{title:'City', titleTextStyle:{color:'red'}}",
series="[{color:'yellow', targetAxisIndex: 0},
{color: 'brown',targetAxisIndex:1}]",
vAxes="[{title:'val1'}, {title:'val2'}]",

Exercise 10

Give the title “Name” to your hAxis and color it orange. Separate your vAxis with 3 red gridlines. HINT: Use titleTextStyle and gridlines

Related exercise sets:
  1. Data Visualization with googleVis exercises part 2
  2. Data Visualization with googleVis exercises part 3
  3. Data Visualization with googleVis exercises part 1
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Weekly Bulletin Vol – XII

R bloggers - Sun, 06/25/2017 - 11:57

(This article was first published on R programming, and kindly contributed to R-bloggers)

This week’s R bulletin will cover topics on how to resolve some common errors in R.

We will also cover functions like do.call, rename, and lapply. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Find and Replace – Ctrl+F
2. Find Next – F3
3. Find Previous – Shift+F3

Problem Solving Ideas Resolving the ‘: cannot open the connection’ Error

There can be two reasons for this error to show up when we run an R script: 1) A file/connection can’t be opened because R can’t find it (mostly due to an error in the path) 2) Failure in .onLoad() because a package can’t find a system dependency

Example:

symbol = "AXISBANK" noDays = 1 dirPath = paste(getwd(), "/", noDays, " Year Historical Data", sep = "") fileName = paste(dirPath, symbol, ".csv", sep = "") data = as.data.frame(read.csv(fileName))

Warning in file(file, “rt”): cannot open file ‘C:/Users/Madhukar/Documents/
1 Year Historical DataAXISBANK.csv’: No such file or directory
Error in file(file, “rt”): cannot open the connection

We are getting this error because we have specified the wrong path to the “dirPath” object in the code. The right path is shown below. We missed adding a forward slash after “Year Historical Data” in the paste function. This led to the wrong path, and hence the error.

dirPath = paste(getwd(),”/”,noDays,” Year Historical Data/”,sep=””)

After adding the forward slash, we re-ran the code. Below we can see the right dirPath and fileName printed in the R console.

Example:

symbol = "AXISBANK" noDays = 1 dirPath = paste(getwd(), "/", noDays, " Year Historical Data/", sep = "") fileName = paste(dirPath, symbol, ".csv", sep = "") data = as.data.frame(read.csv(fileName)) print(head(data, 3))

Resolving the ‘could not find function’ Error

This error arises when an R package is not loaded properly or due to the misspelling of the function names.

When we run the code shown below, we get a “could not find the function ymd” error in the console. This is because we have misspelled the “ymd” function as “ymed”. If we do not load the required packages, this will also throw up a “could not find function ymd” error.

Example:

# Read NIFTY price data from the csv file df = read.csv("NIFTY.csv") # Format date dates = ymed(df$DATE)

Error in eval(expr, envir, enclos): could not find function “ymed”

Resolving the “replacement has” Error

This error occurs when one tries to assign a vector of values to an existing object and the lengths do not match up.

In the example below, the stock price data of Axis bank has 245 rows. In the code, we created a sequence “s” of numbers from 1 to 150. When we try to add this sequence to the Axis Bank data set, it throws up a “replacement error” as the lengths of the two do not match. Thus to resolve such errors one should ensure that the lengths match.

Example:

symbol = "AXISBANK" ; noDays = 1 ; dirPath = paste(getwd(),"/",noDays," Year Historical Data/",sep="") fileName = paste(dirPath,symbol,".csv",sep="") df = as.data.frame(read.csv(fileName)) # Number of rows in the dataframe "df" n = nrow(df); print(n); # create a sequence of numbers from 1 to 150 s = seq(1,150,1) # Add a new column "X" to the existing data frame "df" df$X = s print(head(df,3))

Error in $<-.data.frame(*tmp*, “X”, value = c(1, 2, 3, 4, 5, 6, 7, : replacement has 150 rows, data has 245

Functions Demystified do.call function

The do.call function is used for calling other functions. The function which is to be called is provided as the first argument to the do.call function, while the second argument of the do.call function is a list of arguments of the function to be called. The syntax for the function is given as:

do.call (function_name, arguments)

Example: Let us first define a simple function that we will call later in the do.call function.

numbers = function(x, y) { sqrt(x^3 + y^3) } # Now let us call this 'numbers' function using the do.call function. We provide the function name as # the first argument to the do.call function, and a list of the arguments as the second argument. do.call(numbers, list(x = 3, y = 2))

[1] 5.91608

rename function

The rename function is part of the dplyr package, and is used to rename the columns of a data frame. The syntax for the rename function is to have the new name on the left-hand side of the = sign, and the old name on the right-hand side. Consider the data frame “df” given in the example below.

Example:

library(dplyr) Tic = c("IOC", "BPCL", "HINDPETRO", "ABAN") OP = c(555, 570, 1242, 210) CP = c(558, 579, 1248, 213) df = data.frame(Tic, OP, CP) print(df)

# Renaming the columns as 'Ticker', 'OpenPrice', and 'ClosePrice'. This can be done in the following # manner: renamed_df = rename(df, Ticker = Tic, OpenPrice = OP, ClosePrice = CP) print(renamed_df)

lapply function

The lapply function is part of the R base package, and it takes a list “x” as an input, and returns a list of the same length as “x”, each element of which is the result of applying a function to the corresponding element of X. The syntax of the function is given as:

lapply(x, Fun)
where,
x is a vector (atomic or list)
Fun is the function to be applied

Example 1:

Let us create a list with 2 elements, OpenPrice and the ClosePrice. We will compute the mean of the values in each element using the lapply function.

x = list(OpenPrice = c(520, 521.35, 521.45), ClosePrice = c(521, 521.1, 522)) lapply(x, mean)

$OpenPrice
[1] 520.9333

$ClosePrice
[1] 521.3667

Example 2:

x = list(a = 1:10, b = 11:15, c = 1:50) lapply(x, FUN = length)

$a
[1] 10

$b
[1] 5

$c
[1] 50

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Download the PDF Now!

The post R Weekly Bulletin Vol – XII appeared first on .

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Operations Research with R

R bloggers - Sun, 06/25/2017 - 11:14

(This article was first published on R Blog, and kindly contributed to R-bloggers)

Stefan Feuerriegel

This blog entry concerns our course on “Operations Reserch with R” that we teach as part of our study program. We hope that the materials are of value to lectures and everyone else working in the field of numerical optimiatzion.

Course outline

The course starts with a review of numerical and linear algebra basics for optimization. Here, students learn how to derive a problem statement that is compatible with solving algorithms. This is followed by an overview on problem classes such as one and multi-dimensional problems. Starting with linear and quadratic algorithms, we also cover convex optimization, followed by non-linear approaches such as gradient based (gradient descent methods), Hessian based (Newton and quasi-Newton methods) and non-gradient based (Nelder-Mead). We finally demonstrate the potent capabilities of R for Operations Research: we show how to solve optimization problems in industry and business, as well as illustrate the use in methods for statistics and data mining (e.g. support vector machine or quantile regression). All examples are supported by appropriate visualizations.

Goals

1. Motivate to use R for operations research tasks

2. Familiarize with classes of optimization problems

3. Perform numerical optimization tasks in R using suitable packages

Motivation

R is widely taught in business courses and, hence, known by most data scientists with business background. However, when it comes to Operations Research, many other languages are used. Especially for optimization, solutions range from Microsoft Excel solvers to modeling environments such as Matlab and GAMS. Most of these are non-free and require students to learn yet another language. Because of this, we propose to use R in optimization problems of Operations Research, since R is open source, comes for free and is widely known. Furthermore, R provides a multitude of numerical optimization packages that are readily available. At the same time, R is widely used in industry, making it a suitable and skillful tool to lever the potential of numerical optimization.

Download link to the resources:
https://www.is.uni-freiburg.de/resources/computational-economics?set_language=en

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages