R news and tutorials contributed by (750) R bloggers
Updated: 4 hours 37 min ago

### Evaluate your model with R Exercises

Wed, 05/31/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

There was a time where statistician had to manually crunch number when they wanted to fit their data to a model. Since this process was so long, those statisticians usually did a lot of preliminary work researching other model who worked in the past or looking for studies in other scientific field like psychology or sociology who can influence their model with the goal to maximize their chance to make a relevant model. Then they would create a model and an alternative model and choose the one which seem more efficient.

Now that even an average computer give us incredible computing power, it’s easy to make multiple models and choose the one that best fit the data. Even though it is better to have good prior knowledge of the process you are trying to analyze and of other model used in the past, coming to a conclusion using mostly only the data help you avoid bias and help you create better models.

In this set of exercises, we’ll see how to apply the most used error metrics on your models with the intention to rate them and choose the one that is the more appropriate for the situation.
Most of those errors metrics are not part of any R package, in consequence you have to apply the equation I give you on your data. Personally, I preferred to write a function which I can easily
use on everyone of my models, but there’s many ways to code those equations. If your code is different from the one in the solution, feel free to post your code in the comments.

Answers to the exercises are available here.

Exercise 1
We start by looking at error metrics we can use for regression model. For linear regression problem, the more used metrics are the coefficient of determination R2 which show what percentage of variance is explained by the model and the adjusted R2 which penalize model who use variables that doesn’t have an effective contribution to the model (see this page for more details). Load the attitude data set from the package of the same name and make three linear models with the goal to explain the rating variable. The first one use all the variables from the dataset, the second one use the variable complaints, privileges, learning and advance as independent variables and the third one use only the complaints, learning and advance variables. Then use the summary() function to print R2 and the adjusted R2.

Exercise 2
Another way to measure how your model fit your data is to use the Root Mean Squared Error (RMSE), which is defined as the square root of the average of the square of all the error made by your model. You can find the mathematical definition of the RMSE on this page.

Exercise 3
The mean absolute error (MAE) is a good alternative to the RMSE if you don’t want to penalise too much the large estimation error of your model. The MAE is given by the equation
The mathematical definition of the MAE can be found here.
Calculate the MAE of the prediction made by the 3 models.

Exercise 4
Sometime some prediction error hurt your model than other. For example, if you are trying to predict the financial loss of a business over a period of time, under estimation of the loss would
put the business at risk of bankruptcy, while overestimation of the loss will result in a conservative model. In those case, using the Root Mean Squared Logarithmic Error (RMSLE) as an error
metric is useful since this metric penalize under estimation. The RMSLE is given by the equation given on this page.
Calculate the RMSLE of the prediction made by the three models.

Exercise 5
Now that we’ve seen some examples of error metrics which can be used in a regression context, let’s see five examples of error metrics which can be used when you perform clustering analysis. But
first, we must create a clustering model to test those metrics on. Load the iris dataset and apply the kmeans algorithm. Since the iris dataset has three distinct
labels, use the kmeans algorithm with three centers. Also, use set the maximum number of iterations at 50 and use the “Lloyd” algorithm. Once it’s done, take time to rearrange the labels of your
prediction so they are compatible with the factors in the iris dataset.

Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

• Avoid model over-fitting using cross-validation for optimal parameter selection
• Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
• And much more

Exercise 6
Print the confusion matrix of your model.

Exercise 7
The easiest way to measure how well your model did categorize the data, is to calculate the accuracy, the recall and the precision of your results. Write three functions which return those individual values and calculate those metrics for your models.

Exercise 8
The F-measure summarize the precision and recall value of your model by calculating the harmonic mean of those two values.
Write a function which return the F-measure of your model and compute twice this measure for your data: once with a parameter of 2 and then with a parameter of 0.5.

Exercise 9
The Purity is a measure of the homogeneity of your cluster: if all your cluster regroup object of the same class, you’ll get a purity score of one and if there’s no majority class in any of the cluster, you’ll get a purity score of 0. Write a function which return the purity score of your model and test it on your predictions.

Exercise 10
The last error metrics we’ll see today is the Dunn index, which indicate if the clusters are compact and separated. You can find the mathematical definition of the Dunn index here. Load the cvalid package and use the dunn() on your model, to compute the Dunn index of your classification. Note that this function take an integer vector representing the cluster partitioning as parameter.

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New online datacamp course: Forecasting in R

Wed, 05/31/2017 - 17:23

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Forecasting in R is taught by Rob J. Hyndman, author of the forecast package

Forecasting involves making predictions about the future. It is required in many situations such as deciding whether to build another power generation plant in the next ten years requires forecasts of future demand; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning. This course provides an introduction to time series forecasting using R.

What You’ll Learn
Chapter 1: Exploring and Visualizing Time Series in R
The first thing to do in any data analysis task is to plot the data.
Chapter 2: Benchmark Methods and Forecast Accuracy
In this chapter, you will learn general tools that are useful for many different forecasting situations.
Chapter 3: Exponential Smoothing
is framework generates reliable forecasts quickly and for a wide range of time series.
Chapter 4: Forecasting with ARIMA Models
ARIMA models provide another approach to time series forecasting.
In this chapter, you will look at some methods that handle more complicated seasonality.

You can start the free chapter for free of Forecasting in R.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Drilling Into CSVs — Teaser Trailer

Wed, 05/31/2017 - 17:12

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I used reading a directory of CSVs as the foundational example in my recent post on idioms.

During my exchange with Matt, Hadley and a few others — in the crazy Twitter thread that spawned said post — I mentioned that I’d personally “just use Drill.

I’ll use this post as a bit of a teaser trailer for the actual post (or, more likely, series of posts) that goes into detail on where to get Apache Drill, basic setup of Drill for standalone workstation use and then organizing data with it.

You can get ahead of those posts by doing two things:

2. Review the U.S. EPA annual air quality data archive (they have individual, annual CSVs that are perfect for the example)

My goals for this post are really to just to pique your interest enough in Drill and parquet files (yes, I’m ultimately trying to socially engineer you into using parquet files) to convince you to read the future post(s) and show that it’s worth your time to do Step #1 above.

Getting EPA Air Quality Data

The EPA has air quality data going back to 1990 (so, 27 files as of this post). They’re ~1-4MB ZIP compressed and ~10-30MB uncompressed.

You can use the following code to grab them all with the caveat that the libcurl method of performing simultaneous downloads caused some pretty severe issues — like R crashing — for some of my students who use Windows. There are plenty of examples for doing sequential downloads of a list of URLs out there that folks should be able to get all the files even if this succinct method does not work on your platform.

dir.create("airq") urls <- sprintf("https://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/annual_all_%d.zip", 1990L:2016L) fils <- sprintf("airq/%s", basename(urls)) download.file(urls, fils, method = "libcurl")

I normally shy away from this particular method since it really hammers the remote server, but this is a beefy U.S. government server, the files are relatively small in number and size and I’ve got a super-fast internet connection (no long-lived sockets) so it should be fine.

Putting all those files under the “control” of Drill is what the next post is for. For now, i’m going to show the basic code and benchmarks for reading in all those files and performing a basic query for all the distinct years. Yes, we know that information already, but it’s a nice, compact task that’s easy to walk through and illustrates the file reading and querying in all three idioms: Drill, tidyverse and data.table.

Data Setup

I’ve converted the EPA annual ZIP files into bzip2 format. ZIP is fine for storage and downloads but it’s not a great format for data analysis tasks. gzip would be slightly faster but it’s not easily splittable and — even though I’m not using the data in a Hadoop context — I think it’s wiser to not have to re-process data later on if I ever had to move raw CSV or JSON data into Hadoop. Uncompressed CSVs are the most portable, but there’s no need to waste space.

All the following files are in a regular filesystem directory accessible to both Drill and R:

> (epa_annual_fils <- dir("~/Data/csv/epa/annual", "*.csv.bz2")) [1] "annual_all_1990.csv.bz2" "annual_all_1991.csv.bz2" "annual_all_1992.csv.bz2" [4] "annual_all_1993.csv.bz2" "annual_all_1994.csv.bz2" "annual_all_1995.csv.bz2" [7] "annual_all_1996.csv.bz2" "annual_all_1997.csv.bz2" "annual_all_1998.csv.bz2" [10] "annual_all_1999.csv.bz2" "annual_all_2000.csv.bz2" "annual_all_2001.csv.bz2" [13] "annual_all_2002.csv.bz2" "annual_all_2003.csv.bz2" "annual_all_2004.csv.bz2" [16] "annual_all_2005.csv.bz2" "annual_all_2006.csv.bz2" "annual_all_2007.csv.bz2" [19] "annual_all_2008.csv.bz2" "annual_all_2009.csv.bz2" "annual_all_2010.csv.bz2" [22] "annual_all_2011.csv.bz2" "annual_all_2012.csv.bz2" "annual_all_2013.csv.bz2" [25] "annual_all_2014.csv.bz2" "annual_all_2015.csv.bz2" "annual_all_2016.csv.bz2"

Drill can directly read plain or compressed JSON, CSV and Apache web server log files plus can treat a directory tree of them as a single data source. It can also read parquet & avro files (both are used frequently in distributed “big data” setups) and access MySQL, MongoDB and other JDBC resources as well as query data stored in Amazon S3 and HDFS (I’ve already mentioned it works fine in plain ‘ol filesystems, too).

I’ve tweaked my Drill configuration to support reading column header info from .csv files (which I’ll show in the next post). In environments like Drill or even Spark, CSV columns are usually queried with some type of column index (e.g. COLUMN[0]) so having named columns makes for less verbose query code.

I turned those individual bzip2 files into parquet format with one Drill query:

CREATE TABLE dfs.pq./epa/annual.parquet AS SELECT * FROM dfs.csv./epa/annual/*.csv.bz2

Future posts will explain the dfs... component but they are likely familiar path specifications for folks used to Spark and are pretty straightforward. The first bit (up to the back-tick) is an internal Drill shortcut to the actual storage path (which is a plain directory in this test) followed by the tail end path spec to the subdirectories and/or target files. That one statement said ‘take all the CSV files in that directory and make one big table out of them”.

The nice thing about parquet files is that they work much like R data frames in that they can be processed on the column level. We’ll see how that speeds up things in a bit.

Benchmark Setup

The tests were performed on a maxed out 2016 13″ MacBook Pro.

There are 55 columns of data in the EPA annual summary files.

To give both read_csv and fread some benchmark boosts, we’ll define the columns up-front and pass those in to each function on data ingestion and I’ll leave them out of this post for brevity (they’re just a cols() specification and colClasses vector). Drill gets no similar help for this at least when it comes to CSV processing.

I’m also disabling progress & verbose reporting in both fread and read_csv despite not stopping Drill from writing out log messages.

Now, we need some setup code to connect to drill and read in the list of files, plus we’ll setup the five benchmark functions to read in all the files and get the list of distinct years from each.

library(sergeant) library(data.table) library(tidyverse) (epa_annual_fils <- dir("~/Data/csv/epa/annual", "*.csv.bz2", full.names = TRUE)) db <- src_drill("localhost") # Remember, defining ct & ct_dt - the column types specifications - have been left out for brevity mb_drill_csv <- function() { epa_annual <- tbl(db, "dfs.csv./epa/annual/*.csv.bz2") select(epa_annual, Year) %>% distinct(Year) %>% collect() } mb_drill_parquet <- function() { epa_annual_pq <- tbl(db, "dfs.pq./epa/annual.parquet") select(epa_annual_pq, Year) %>% distinct(Year) %>% collect() } mb_tidyverse <- function() { map_df(epa_annual_fils, read_csv, col_types = ct, progress = FALSE) -> tmp unique(tmp$Year) } mb_datatable <- function() { rbindlist( lapply( epa_annual_fils, function(x) { fread(sprintf("bzip2 -c -d %s", x), colClasses = ct_dt, showProgress = FALSE, verbose = FALSE) })) -> tmp unique(tmp$Year) } mb_rda <- function() { read_rds("~/Data/rds/epa/annual.rds") -> tmp unique(tmpYear) } microbenchmark( csv = { mb_drill_csv() }, pq = { mb_drill_parquet() }, df = { mb_tidyverse() }, dt = { mb_datatable() }, rda = { mb_rda() }, times = 5 ) -> mb Yep, it’s really as simple as: tbl(db, "dfs.csv./epa/annual/*.csv.bz2") to have Drill treat a directory tree as a single table. It’s also not necessary for all the columns to be in all the files (i.e. you get the bind_rows/map_df/rbindlist behaviour for “free”). I’m only doing 5 evaluations here since I don’t want to make you wait if you’re going to try this at home now or after the Drill series. I’ve run it with a more robust benchmark configuration and the results are aligned with this one. Unit: milliseconds expr min lq mean median uq max neval csv 15473.5576 16851.0985 18445.3905 19586.1893 20087.1620 20228.9450 5 pq 493.7779 513.3704 616.2634 550.5374 732.6553 790.9759 5 df 41666.1929 42361.1423 42701.2682 42661.9521 43110.3041 43706.7498 5 dt 37500.9351 40286.2837 41509.0078 42600.9916 43105.3040 44051.5247 5 rda 9466.6506 9551.7312 10012.8560 9562.9114 9881.8351 11601.1517 5 The R data route, which is the closest to the parquet route, is definitely better than slurping up CSVs all the time. Both parquet and R data files require pre-processing, so they’re not as flexible as having individual CSVs (that may get added hourly or daily to a directory). Drill’s CSV slurping handily beats the other R methods even with some handicaps the others did not have. This particular example is gamed a bit, which helped parquet to ultimately “win”. Since Drill can target the singular column (Year) that was asked for, it doesn’t need to read all the extra columns just to compute the final product (the distinct list of years). IMO both the Drill CSV ingestion and Drill parquet access provide compelling enough use-cases to use them over the other three methods, especially since they are easily transferrable to remote Drill servers or clusters with virtually no code changes. A single node Drillbit (like R) is constrained by the memory on that individual system, so it’s not going to get you out of a memory jam, but it may make it easier to organize and streamline your core data operations before other analysis and visualization tasks. FIN I’m sure some member of some other tribe will come up with an example that proves superiority of their particular tribal computations. I’m hoping one of those tribes is the R/Spark tribe so that can get added into the mix (using Spark standalone is much like using Drill, but with more stats/ML functions directly available). I’m hopeful that this post has showcased enough of Drill’s utility to general R users that you’ll give it a go and consider adding it to your R data analysis toolbox. It can be beneficial having both a precision tools as well as a Swiss Army knife — which is what Drill really is — handy. You can find the sergeant package on GitHub. To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### GSoC 2017 : Parser for Biodiversity Checklists Wed, 05/31/2017 - 11:24 (This article was first published on Vijay Barve, and kindly contributed to R-bloggers) Guest post by Qingyue Xu Compiling taxonomic checklists from varied sources of data is a common task that biodiversity informaticians encounter. In the GSoC 2017 project Parser for Biodiversity checklists, my overall goal is to extract taxonomic names from given text into a tabular format so that easy aggregation of biodiversity data in a structured format that can be used for further processing can be facilitated. I mainly plan to build three major functions which serve different purposes and take various sources of text into account. However, before building functions, we need to first identify and cover as many different formats of scientific names as possible. The inconsistencies of scientific names make things complicated for us. The common rules for scientific names follow the order of: genus, [species], [subspecies], [author, year], [location] Many components are optional, and some components like author and location can be one or even more. Therefore, when we’re parsing the text, we need to analyze the structure of the text and match it with all possible patterns of scientific names and identify the most likely one. To resolve the problem more accurately, we can even draw help from NLTK (Natural Language Toolkit) packages to help us identify “PERSON” and “LOCATION” so that we can analyze the components of scientific names more efficiently. Function: find_taxoname (url/input_file, output_file) • Objective: This is a function to search scientific names with supplied texts, especially applied to the situation when the text is not well structured. • Parameters: The first parameter is the URL of a web page (HTML based) or the file path of a PDF/TXT file, which is our source text to search for the biology taxonomic names. The second parameter is the file path of the output file and it will be in a tabular format including columns of the genus, species, subspecies, author, year. • Approach: Since this function is intended for the unstructured text, we can’t find a certain pattern to parse the taxonomic names. What we can do is utilizing existing standard dictionaries of scientific names to locate the genus, species, subspecies. By analyzing the surrounding structure and patterns, we can find corresponding genus, species, subspecies, author, year, if they exist, and output the findings in a tabular format. Function: parse_taxolist(input_file, filetype, sep, output_file, location) • Objective: This is a function to parse and extract taxonomic names from a given structured text file and each row must contain exactly one entry of the scientific names. If the location information is given, the function can also list and return the exact location (latitude and longitude) of the species. The output is also in a tabular format including columns of genus, species, subspecies, author(s), year(s), location(s), latitude(s), longitude(s). • Parameters: The first parameter is the file path of the input file and the second parameter is the file type, which supports txt, PDF, CSV types of files. The third parameter ‘sep’ should indicate the separators used in the input file to separate every word in the same row. The fourth parameter is the intended file path of the output file. The last parameter is a Boolean, indicating whether the input file contains the location information. If ‘true’, then the output will contain detailed location information. • Approach: The function will parse the input file based on rows and the given separators into a well-organized tabular format. An extended function is to point the exact location of the species if the related information is given. With the location info such as “Mirik, West Bengal, India”, the function will return the exact latitude and longitude of this location as 26°53’7.07″N and 88°10’58.06″E. It can be realized through crawling the web page of https://www.distancesto.com/coordinates or utilizing the API of Google Map. This is also a possible solution to help us identify whether the content of the text represents a location. If it cannot get exact latitude and longitude, then it’s not a location. If a scientific name doesn’t contain location information, the function will return NULL value for the location part. If it contains multiple locations, the function will return multiple values as a list as well as the latitudes and longitudes. Function: recursive_crawler(url, htmlnode_taxo, htmlnode_next, num, output_file, location) • Objective: This function is intended to crawl the web pages containing information about taxonomic names recursively. The start URL must be given and the html_node of the scientific names should also be indicated. Also, if the text contains location info, the output will also include the detailed latitude and longitude. • Parameters: The first parameter is the start URL of the web page and the following web pages must follow the same structure as the first web page. The second parameter is the html_node of the taxonomic names, such as “.SP .SN > li”. (There’re a lot of tools for the users to identify the HTML nodes code for certain contexts). The third parameter is the html_node of the next page, which can lead us to the next page of another genus. The fourth parameter ‘num’ is the intended number of web pages the user indicates. If ‘num’ is not given, the function will automatically crawl and stop until the htmlnode_next cannot return a valid URL. The next two parameters are the same with the above two functions. • Approach: For the parsing part and getting the location parts, the approach is the same as the above functions. For the crawling part, for a series of structured web pages, we can parse and get valid scientific names based on the given HTML nodes. The HTML nodes for the next pages should also be given, and we can always get the URL of the next page by extracting it from the source code. For example, the following screenshot from the web page we used provides a link which leads us to the next page. By recursively fetching the info from the current page and jump to the following pages, we can output a well-organized tabular file including all the following web pages. • Other possible functionalities to be realized Since inconsistencies might exist in the format of scientific names, I also need to construct a function to normalize the names. The complication always lies in the author part, and there can be two approaches to address the problem. The first one is still analyzing the structure of the scientific name and we can try to capture as many exceptions as possible, such as author names which have multiple parts or there’re two authors. The second approach is to draw help from the NLTK package to identify possible PERSON names. However, when it gets too complicated, the parsing result won’t be very accurate all the time. Therefore, we can add a parameter to suggest how reliable our result is and indicate the need for further manual process if the parser cannot work reliably. To leave a comment for the author, please follow the link and comment on their blog: Vijay Barve. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Getting data for every Census tract in the US with purrr and tidycensus Wed, 05/31/2017 - 10:00 (This article was first published on KYLE WALKER DATA, and kindly contributed to R-bloggers) Interested in more tips on working with Census data? Click here to join my email list! Last week, I published the development version of my new R package, tidycensus. You can read through the documentation and some examples at https://walkerke.github.io/tidycensus/. I’m working on getting the package CRAN-ready with better error handling; in the meantime, I’m sharing a few examples to demonstrate its functionality. If you are working on a national project that includes demographic data as a component, you might be interested in acquiring Census tract data for the entire United States. However, Census tract data are commonly available by state (with the exception of NHGIS, which is a wonderful resource), meaning that an analyst would have to spend time piecing the data together. tidycensus solves this problem directly within R with help from the purrr package, a member of the tidyverse. In tidycensus, there is a built-in data frame named fips_codes that includes US state and county IDs; tidycensus uses this data frame to handle translations between state/county names and FIPS codes. However, this data frame can also be used to generate a vector of state codes to be fed to the map_df function in purrr. As such, this is all it takes to get a tibble of total population estimates for all US Census tracts from the 2011-2015 ACS: library(tidycensus) library(purrr) # Un-comment below and set your API key # census_api_key("YOUR KEY GOES HERE") us <- unique(fips_codesstate)[1:51] totalpop <- map_df(us, function(x) { get_acs(geography = "tract", variables = "B01003_001", state = x) }) str(totalpop) ## Classes 'tbl_df', 'tbl' and 'data.frame': 73056 obs. of 5 variables: ## $GEOID : chr "01001020100" "01001020200" "01001020300" "01001020400" ... ##$ NAME : chr "Census Tract 201, Autauga County, Alabama" "Census Tract 202, Autauga County, Alabama" "Census Tract 203, Autauga County, Alabama" "Census Tract 204, Autauga County, Alabama" ... ## $variable: chr "B01003_001" "B01003_001" "B01003_001" "B01003_001" ... ##$ estimate: num 1948 2156 2968 4423 10763 ... ## $moe : num 203 268 404 493 624 478 436 281 1000 535 ... Get any ACS or decennial Census data in this way. However – what if you also want tract geometry for mapping? This only requires a few small modifications. map_df in purrr uses the bind_rows function under the hood, which doesn’t work with simple features objects (yet). However, sf does have an rbind method that works for sf objects and can be fed to purrr’s reduce function. library(sf) options(tigris_use_cache = TRUE) totalpop_sf <- reduce( map(us, function(x) { get_acs(geography = "tract", variables = "B01003_001", state = x, geometry = TRUE) }), rbind ) str(totalpop_sf) ## Classes 'sf' and 'data.frame': 72843 obs. of 6 variables: ##$ GEOID : chr "01003010500" "01003011501" "01009050500" "01015981901" ... ## $NAME : chr "Census Tract 105, Baldwin County, Alabama" "Census Tract 115.01, Baldwin County, Alabama" "Census Tract 505, Blount County, Alabama" "Census Tract 9819.01, Calhoun County, Alabama" ... ##$ variable: chr "B01003_001" "B01003_001" "B01003_001" "B01003_001" ... ## $estimate: num 5321 5771 7007 4 1607 ... ##$ moe : num 452 825 556 6 235 309 506 386 425 310 ... ## $geometry:sfc_GEOMETRY of length 72843; first list element: List of 1 ## ..$ :List of 1 ## .. ..$: num [1:55, 1:2] -87.8 -87.8 -87.8 -87.8 -87.8 ... ## ..- attr(*, "class")= chr "XY" "MULTIPOLYGON" "sfg" ## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA ## ..- attr(*, "names")= chr "GEOID" "NAME" "variable" "estimate" ... ## - attr(*, "sf_column")= chr "geometry" By declaring geometry = TRUE, tidycensus fetches tract feature geometry using the tigris package and merges it to the ACS data automatically for you. I recommend using the caching feature in the tigris package if you plan to use this workflow multiple times. You might note the discrepancy in tracts between the geometry-enabled and regular data frames; this is due to the removal of some water-only tracts in the cartographic boundary shapefiles used by tidycensus. To leave a comment for the author, please follow the link and comment on their blog: KYLE WALKER DATA. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Training dates for Marketing Analytics course announced Wed, 05/31/2017 - 01:00 (This article was first published on CYBAEA on R, and kindly contributed to R-bloggers) Classes 3-7 July 2017: We have updated our very popular training course to take advantage of the latest Microsoft Advanced Analytics technology and re-launched it as Marketing Analytics using Microsoft R and Azure. We have just announced public training dates and are looking forward to giving the course in the week 3-7 July 2017. This is a live, instructor-led course in an online classroom environment. There are class times to suit everyone; participants in Middle East and Asia register here for live classes 08:00-12:00 London time while participants in North and South America register here for classes 10:00-14:00 New York time. Please see the listings for additional time zones. This course has consistently achieved 100% positive promoter scores both on recommending the course and recommending the instructor. This intensive course gives you hands-on experience using Microsoft R and Azure covering the key techniques for analysing customer data for Sales and Marketing. The focus is on getting to the business results and you will return to your organization with the skills you need to deliver and demonstrate commercial impact from your work. For more information and detailed syllabus, please see the course page: Marketing Analytics using Microsoft R and Azure We are offering the course in partnership with Microsoft (following their acquisition of Revolutions); you can see our listing on Microsoft’s Learn Analytics portal. Hope to see you there! If you have any questions, just contact us. To leave a comment for the author, please follow the link and comment on their blog: CYBAEA on R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Data Science Podcasts Wed, 05/31/2017 - 00:00 (This article was first published on Jon Calder's R Blog, and kindly contributed to R-bloggers) Make the most of your commute! – Podcasts are awesome. Especially when you’re stuck in traffic on the way to work. Below are some podcasts I listen to that relate to data science and statistics. Each of them has something slightly different to offer, so if this is an area of interest to you then I recommend you give these a try! Not So Standard Deviations Roger Peng and Hilary Parker talk about the latest in data science and data analysis in academia and industry. Data Skeptic Data Skeptic is your source for a perspective of scientific skepticism on topics in statistics, machine learning, big data, artificial intelligence, and data science. More or Less: Behind the Stats Tim Harford and the More or Less team from BBC Radio 4 try to make sense of the statistics that surround us. The R-Podcast Giving practical advice on how to use R for powerful and innovative data analyses. The host of the R-Podcast is Eric Nantz, a statistician working in the life sciences industry who has been using R since 2004. Partially Derivative Hosted by Jonathon, Vidya, and Chris, Partially Derivative is a podcast about data science in the world around us. Episodes are a mix of explorations into the techniques used in data science and discussions with the field’s leading experts. Linear Digressions Hosts Katie Malone and Ben Jaffe explore machine learning and data science through interesting (and often very unusual) applications. Are there other data science podcasts missing from this list that you can recommend? Feel free to comment below and let me know! To leave a comment for the author, please follow the link and comment on their blog: Jon Calder's R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Reflections on ROpenSci Unconference 2017 Tue, 05/30/2017 - 23:37 (This article was first published on Revolutions, and kindly contributed to R-bloggers) Last week I attended the ROpenSci Unconference in Los Angeles, and it was fantastic. Now in its fourth year, the ROpenSci team brought together a talented and diverse group of about 70 R developers from around the world to work on R-related projects in an intense 2-day hackathon. Not only did everyone have a lot of fun, make new connections and learn from others, but the event also advanced the ROpenSci mission of creating packages, tools and resources to support scientific endeavours using R. During the two-way workshop, the attendees self-organized into teams of 4-8 to work on projects. There were 20 projects started at the ROpenSci conference, and all of them produced a working package. You can details on all the projects on Github, but here are a few examples to give you a sense of what the @LucyStats @rdpeng Including an R-generated maze with an R-generated bot that can find his way out! @Inchio @revodavid @kwbroman @daroczig @alikzaidi pic.twitter.com/92SQg4xmcu — Brooke Anderson (@gbwanderson) May 30, 2017 I was very fortunate to be part of that last team: we had a blast connecting R to Minecraft, like creating a procedurally-generated maze and an AI bot to navigate it. (I plan to write a full blog post about the project in due course.) But for more on #runconf17, take a look at Karthik Ram's storify which will give you a good sense of the conference as it unfolded in tweets. I also highly recommend checking out these post-conference wrap-ups by Jasmine Dumas, Karl Broman and Bob Rudis. Thanks to the organizers for such a fantastic event, and also to the sponsors (including my own employer, Microsoft) for making it all possible. I'm looking forward to next year already! ROpenSci: Unconference 2017 To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Hospital Infection Scores – R Shiny App Tue, 05/30/2017 - 20:43 (This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers) Medicare Data – R Shiny App About two weeks ago I created an extremely rough version of an R Shiny Application surrounding Medicare data. Right after publishing the blog post, I received a lot of input for improvement and help from others. Here’s a look a look at the latest version of the Medicare Shiny App. This utilizes data.gov found here. I was traveling for two weeks and had very little time to do any work on it. After creating a GitHub Repository for it, the user Ginberg played a huge role in cleaning it up and adding a lot more functionality. I found it incredible that a complete stranger to me would put in such effort to something like this. In fact, he isn’t even a resident of the USA – so Medicare probably isn’t on his radar as often as it is for some of us. Fantastic generosity! Ultimately, I will be looking to keep this project alive and grow it to fully utilize a lot more of the Medicare data available. The infections data set was very simple and easy to use, so I started off with it but there are a lot more tables listed on data.gov. The purpose of this application is to allow people who don’t want to spend time digging through tables to utilize the information available. This isn’t necessarily just for people seeking care to make a decision but this could perhaps be utilized for others doing research regarding hospitals in the US. The R Shiny App allows you to filter by location and infection information. These are important in helping to actually find information on what you care about. Three key tabs were created by (@Ginberg): • Sorting hospitals by infection score • Maps of hospitals in the area • Data table of hospital data Sorting hospital data by score: • This is a tricky plot because “score” is different for each type of metric • Higher “scores” aren’t necessarily bad because they can be swayed by more heavily populated areas (or density) • Notice the use of plotly and its interactivity Map of the data: • Location of the hospitals is incredibly important if you happen to live in a certain area • The radius of the circle allows for easy identification of places with larger “scores” • Hovering and clicking the circle will display more information Displaying the data in table format: • This allows for visibility into the raw data that is populating the other tabs • It provides the ability to download a CSV or PDF file containing the relevant information • More information may be added such as the address and telephone number of the hospital – depending on the direction of the app Again, I want to give a big thanks to Ginberg for essentially creating 95% of what makes this app great! Here’s the link to the Medicare Shiny App. You can find the code on my GitHub. To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Summarizing big data in R Tue, 05/30/2017 - 18:12 (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers) Our next "R and big data tip" is: summarizing big data. We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything). Simple question: is there an easy way to summarize big data in R? The answer is: yes, but we suggest you use the replyr package to do so. Let’s set up a trivial example. suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ## [1] '0.5.0' library("sparklyr") packageVersion("sparklyr") ## [1] '0.5.5' library("replyr") packageVersion("replyr") ## [1] '0.3.902' sc <- sparklyr::spark_connect(version='2.0.2', master = "local") diris <- copy_to(sc, iris, 'diris') The usual S3–summary() summarizes the handle, not the data. summary(diris) ## Length Class Mode ## src 1 src_spark list ## ops 3 op_base_remote list tibble::glimpse() throws. packageVersion("tibble") ## [1] '1.3.3' # errors-out glimpse(diris) ## Observations: 150 ## Variables: 5 ## Error in if (width[i] <= max_width[i]) next: missing value where TRUE/FALSE needed broom::glance() throws. packageVersion("broom") ## [1] '0.4.2' broom::glance(diris) ## Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl replyr_summary() works, and returns results in a data.frame. replyr_summary(diris) %>% select(-nunique, -index, -nrows) ## column class nna min max mean sd lexmin lexmax ## 1 Sepal_Length numeric 0 4.3 7.9 5.843333 0.8280661 ## 2 Sepal_Width numeric 0 2.0 4.4 3.057333 0.4358663 ## 3 Petal_Length numeric 0 1.0 6.9 3.758000 1.7652982 ## 4 Petal_Width numeric 0 0.1 2.5 1.199333 0.7622377 ## 5 Species character 0 NA NA NA NA setosa virginica sparklyr::spark_disconnect(sc) rm(list=ls()) gc() ## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 762515 40.8 1442291 77.1 1168576 62.5 ## Vcells 1394407 10.7 2552219 19.5 1820135 13.9 To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### New package polypoly (helper functions for orthogonal polynomials) Tue, 05/30/2017 - 07:00 (This article was first published on Higher Order Functions, and kindly contributed to R-bloggers) Last week, I released a new package called polypoly to CRAN. It wraps up some common tasks for dealing with orthogonal polynomials into a single package. The README shows off the main functionality, as well as the neat “logo” I made for the package. In this post, I use the package on some word recognition data. Demo: Growth curve analysis I primarily use orthogonal polynomials to model data from eyetracking experiments where growth curves describe how the probability of looking at a image changes as the image is named. The analysis technique, including orthogonal polynomials and mixed effects models of eyetracking data, are described in Mirman’s 2014 book. In our 2015 paper, toddlers saw two images on a computer screen. The objects in the images started with different consonants: for example, duck and ball. The toddlers heard sentences like “find the ball”, and we measured how their gaze location onscreen changed in response to speech. This setup is a pretty standard procedure for studying spoken word recognition. We manipulated the vowel in the word the. In the facilitating condition, the vowel has acoustic information (via anticipatory coarticulation) which would allow an adult listener to predict the upcoming consonant. In the neutral condition, the vowel provides no cues about the upcoming consonant. The scientific question is whether these kiddos can take advantage of these acoustic cues during word recognition. Here’s how the data look, both in R and in a plot. library(ggplot2) library(dplyr) # The data d #> # A tibble: 986 x 6 #> Subj Condition Time ToDistractor ToTarget Proportion #> #> 1 1 facilitating 200 9 9 0.5000000 #> 2 1 facilitating 250 9 10 0.5263158 #> 3 1 facilitating 300 6 12 0.6666667 #> 4 1 facilitating 350 6 12 0.6666667 #> 5 1 facilitating 400 6 12 0.6666667 #> 6 1 facilitating 450 6 12 0.6666667 #> 7 1 facilitating 500 6 12 0.6666667 #> 8 1 facilitating 550 6 12 0.6666667 #> 9 1 facilitating 600 4 12 0.7500000 #> 10 1 facilitating 650 3 15 0.8333333 #> # ... with 976 more rows # Helper dataframe of where to put condition labels on the next plot df_labs <- data_frame( Time = c(650, 800), Proportion = c(.775, .625), Condition = c("facilitating", "neutral")) p <- ggplot(d) + aes(x = Time, y = Proportion, color = Condition) + geom_hline(yintercept = .5, size = 2, color = "white") + stat_summary(fun.data = mean_se) + geom_text(aes(label = Condition), data = df_labs, size = 6) + labs(x = "Time after noun onset [ms]", y = "Proportion looks to named image", caption = "Mean ± SE. N = 29 children.") + guides(color = "none") p Early on, children look equal amounts to both images on average (.5), and the proportion of looks to the named image increase as the word unfolds. In the facilitating condition, that rise happens earlier. We fit a mixed-effects logistic regression model to estimate how the probability of looking to the named image changes over time, across conditions, and within children. We use cubic orthogonal polynomials to represent Time. For each time point, we have three predictors available to us: Time1, Time2, and Time3. (Plus, there’s a constant “intercept” term.) Our model’s growth curve will be a weighted combination of these polynomial curves. The code below shows off about half the functionality of the package :bowtie:: poly(unique(d$Time), 3) %>% # Force Time^1 term to range from -.5 to .5. Rescale others accordingly. polypoly::poly_rescale(scale_width = 1) %>% polypoly::poly_plot()

I think people sometimes describe the contributions of these curves to the
overall growth curve as trends: “A negative linear trend”, “a significant
quadratic trend”, etc. I like that word because it makes the terminology a
little less intimidating.

Quick aside: Why orthogonal polynomials?

Why do we use orthogonal polynomial terms? First, note that simple polynomials
x, x2 and x3 are correlated. Orthogonal ones are not
correlated. (Hence, the name.)

# Simple poly(1:10, 3, raw = TRUE) %>% cor() %>% round(2) #> 1 2 3 #> 1 1.00 0.97 0.93 #> 2 0.97 1.00 0.99 #> 3 0.93 0.99 1.00 # Orthogonal poly(1:10, 3, raw = FALSE) %>% cor() %>% round(2) #> 1 2 3 #> 1 1 0 0 #> 2 0 1 0 #> 3 0 0 1

Adding new correlated predictors to a model is a problem. The parameter
estimates will change as different predictors are added. Here we simulate some
fake data, and fit three models with 1-, 2- and 3-degree raw polynomials.

x <- 1:10 y <- x + rnorm(1, mean = 100) * (x) + rnorm(1, mean = 0, sd = .01) * (x) ^ 2 + rnorm(1, mean = -1) * (x) ^ 3 + rnorm(10) models <- list( m1 = lm(y ~ x), m2 = lm(y ~ x + I(x^2)), m3 = lm(y ~ x + I(x^2) + I(x^3)) )

As expected, the estimates for the effects change from model to model:

models %>% lapply(broom::tidy) %>% bind_rows(.id = "model") %>% select(model:estimate) %>% mutate(estimate = round(estimate, 2)) #> model term estimate #> 1 m1 (Intercept) 75.69 #> 2 m1 x 72.91 #> 3 m2 (Intercept) -23.91 #> 4 m2 x 122.72 #> 5 m2 I(x^2) -4.53 #> 6 m3 (Intercept) 1.15 #> 7 m3 x 100.48 #> 8 m3 I(x^2) 0.29 #> 9 m3 I(x^3) -0.29

But with orthogonal polynomials, the parameter estimates don’t change from model
to model.

models2 <- list( m1 = lm(y ~ poly(x, 1)), m2 = lm(y ~ poly(x, 2)), m3 = lm(y ~ poly(x, 3)) ) models2 %>% lapply(broom::tidy) %>% bind_rows(.id = "model") %>% select(model:estimate) %>% mutate(estimate = round(estimate, 2)) #> model term estimate #> 1 m1 (Intercept) 476.72 #> 2 m1 poly(x, 1) 662.27 #> 3 m2 (Intercept) 476.72 #> 4 m2 poly(x, 2)1 662.27 #> 5 m2 poly(x, 2)2 -104.03 #> 6 m3 (Intercept) 476.72 #> 7 m3 poly(x, 3)1 662.27 #> 8 m3 poly(x, 3)2 -104.03 #> 9 m3 poly(x, 3)3 -16.24

That’s probably the simplest reason why orthogonal polynomials are preferred. (I
can’t remember any others right now.)

Back to the data

Before fitting the model, I use poly_add_columns() to add polynomial terms as
columns to the dataframe. (For speed here, I use a simplified random effects
structure, estimating growth curve parameters for each Child x Condition
combination.)

library(lme4) #> Loading required package: Matrix #> Loading required package: methods d <- d %>% polypoly::poly_add_columns(Time, degree = 3, prefix = "ot", scale_width = 1) %>% # Change the reference level mutate(Condition = factor(Condition, c("neutral", "facilitating"))) m <- glmer( cbind(ToTarget, ToDistractor) ~ (ot1 + ot2 + ot3) * Condition + (ot1 + ot2 + ot3 | Subj:Condition), family = binomial, data = d)

We can confirm that the model captures the overall shape of the growth curves.

# The lines here are not quite the overall average, but the averages of 29 # individual fits (for each participant). That's why the caption is a little # weird. p + stat_summary(aes(y = fitted(m)), fun.y = mean, geom = "line") + labs(caption = "Line: Average of model-fitted values. Points: Mean ± SE.")

We can inspect the model summary as well.

arm::display(m) #> glmer(formula = cbind(ToTarget, ToDistractor) ~ (ot1 + ot2 + #> ot3) * Condition + (ot1 + ot2 + ot3 | Subj:Condition), data = d, #> family = binomial) #> coef.est coef.se #> (Intercept) 0.47 0.10 #> ot1 1.57 0.28 #> ot2 0.45 0.11 #> ot3 -0.34 0.09 #> Conditionfacilitating 0.23 0.14 #> ot1:Conditionfacilitating 0.45 0.39 #> ot2:Conditionfacilitating -0.44 0.16 #> ot3:Conditionfacilitating 0.11 0.13 #> #> Error terms: #> Groups Name Std.Dev. Corr #> Subj:Condition (Intercept) 0.53 #> ot1 1.46 0.23 #> ot2 0.52 -0.05 0.31 #> ot3 0.39 -0.08 -0.64 0.09 #> Residual 1.00 #> --- #> number of obs: 986, groups: Subj:Condition, 58 #> AIC = 4788.2, DIC = -3961.1 #> deviance = 395.6

The model summary indicates a significant Condition x Time2
interaction, but really, only the intercept and Time1 can ever be
interpreted directly. To understand the model fit, we visualize how each of the
polynomial terms are weighted.

Here we create a matrix of the polynomial terms plus a column of ones for the
intercept.

time_mat <- poly(sort(unique(d$Time)), 3) %>% polypoly::poly_rescale(1) %>% cbind(constant = 1, .) round(time_mat, 2) #> constant 1 2 3 #> [1,] 1 -0.50 0.57 -0.57 #> [2,] 1 -0.44 0.36 -0.14 #> [3,] 1 -0.37 0.17 0.14 #> [4,] 1 -0.31 0.01 0.30 #> [5,] 1 -0.25 -0.11 0.36 #> [6,] 1 -0.19 -0.22 0.34 #> [7,] 1 -0.12 -0.29 0.26 #> [8,] 1 -0.06 -0.33 0.14 #> [9,] 1 0.00 -0.34 0.00 #> [10,] 1 0.06 -0.33 -0.14 #> [11,] 1 0.12 -0.29 -0.26 #> [12,] 1 0.19 -0.22 -0.34 #> [13,] 1 0.25 -0.11 -0.36 #> [14,] 1 0.31 0.01 -0.30 #> [15,] 1 0.37 0.17 -0.14 #> [16,] 1 0.44 0.36 0.14 #> [17,] 1 0.50 0.57 0.57 To compute the weighted values, we multiply by a diagonal matrix of the coefficients. neut_coefs <- fixef(m)[1:4] faci_coefs <- neut_coefs + fixef(m)[5:8] faci_coefs #> (Intercept) ot1 ot2 ot3 #> 0.699926322 2.014186150 0.006646146 -0.226658408 set_colnames <- colnames<- m_neut <- time_mat %*% diag(neut_coefs) %>% set_colnames(c("constant", "ot1", "ot2", "ot3")) m_faci <- time_mat %*% diag(faci_coefs) %>% set_colnames(c("constant", "ot1", "ot2", "ot3")) # Convince ourselves with an example round(m_faci, 2) #> constant ot1 ot2 ot3 #> [1,] 0.7 -1.01 0 0.13 #> [2,] 0.7 -0.88 0 0.03 #> [3,] 0.7 -0.76 0 -0.03 #> [4,] 0.7 -0.63 0 -0.07 #> [5,] 0.7 -0.50 0 -0.08 #> [6,] 0.7 -0.38 0 -0.08 #> [7,] 0.7 -0.25 0 -0.06 #> [8,] 0.7 -0.13 0 -0.03 #> [9,] 0.7 0.00 0 0.00 #> [10,] 0.7 0.13 0 0.03 #> [11,] 0.7 0.25 0 0.06 #> [12,] 0.7 0.38 0 0.08 #> [13,] 0.7 0.50 0 0.08 #> [14,] 0.7 0.63 0 0.07 #> [15,] 0.7 0.76 0 0.03 #> [16,] 0.7 0.88 0 -0.03 #> [17,] 0.7 1.01 0 -0.13 Then, we can use the poly_melt() function to get a dataframe from each weighted matrix and then plot each of the effects. df_neut <- m_neut %>% polypoly::poly_melt() %>% tibble::add_column(Condition = "neutral") df_faci <- m_faci %>% polypoly::poly_melt() %>% tibble::add_column(Condition = "facilitating") df_both <- bind_rows(df_faci, df_neut) %>% mutate(Condition = factor(Condition, c("neutral", "facilitating"))) ggplot(df_both) + aes(x = observation, y = value, color = Condition) + geom_line() + facet_wrap("degree") Visually, the quadratic effect on the neutral curve pulls down the values during the center (when the curves are most different) and pushes the values in the tails upwards (when the curves are closest). Although only the quadratic effect is nominally significant, the constant and linear terms suggest other smaller effects but they are too noisy to pin down. It’s worth noting that the predictors and weights discussed above are on the log-odds/logit scale used inside of the model, instead of the proportion scale used in the plots of the data and model fits. Basically, these weighted values are summed together and then squeezed into the range [0, 1] with a nonlinear transformation. For these data, the two scales produce similar looking growth curves, but you can notice that the right end of the curves are pinched slightly closer together in the probability-scale plot: ggplot(df_both) + aes(x = observation, y = value, color = Condition) + stat_summary(fun.y = sum, geom = "line") + ggtitle("logit scale") + guides(color = "none") ggplot(df_both) + aes(x = observation, y = value, color = Condition) + stat_summary(fun.y = function(xs) plogis(sum(xs)), geom = "line") + ggtitle("probability scale") + guides(color = "none") To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Mixed models for ANOVA designs with one observation per unit of observation and cell of the design Mon, 05/29/2017 - 22:34 (This article was first published on Computational Psychology - Henrik Singmann, and kindly contributed to R-bloggers) Together with David Kellen I am currently working on an introductory chapter to mixed models for a book edited by Dan Spieler and Eric Schumacher (the current version can be found here). The goal is to provide a theoretical and practical introduction that is targeted mainly at experimental psychologists, neuroscientists, and others working with experimental designs and human data. The practical part focuses obviously on R, specifically on lme4 and afex. One part of the chapter was supposed to deal with designs that cannot be estimated with the maximal random effects structure justified by the design because there is only one observation per participant and cell of the design. Such designs are the classical repeated-measures ANOVA design as ANOVA cannot deal with replicates at the cell levels (i.e., those are usually aggregated to yield one observation per cell and unit of observation). Based on my previous thoughts that turned out to be wrong we wrote the following: Random Effects Structures for Traditional ANOVA Designs The estimation of the maximal model is not possible when there is only one observation per participant and cell of a repeated-measures design. These designs are typically analyzed using a repeated-measures ANOVA. Currently, there are no clear guidelines on how to proceed in such situations, but we will try to provide some advice. If there is only a single random effects grouping factor, for example participants, we feel that instead of a mixed model, it is appropriate to use a standard repeated-measures ANOVA that addresses sphericity violations via the Greenhouse-Geisser correction. One alternative strategy that employs mixed models and that we \emph{do not recommend} consists of using the random-intercept only model or removing the random slopes for the highest within-subject interaction. The resulting model assumes invariance of the omitted random effects across participants. If this assumption is violated such a model produces results that cannot be trusted . […] Fortunately, we asked Jake Westfall to take a look at the chapter and Jake responded: I don’t think I agree with this. In the situation you describe, where we have a single random factor in a balanced ANOVA-like design with 1 observation per unit per cell, personally I am a proponent of the omit-the-the-highest-level-random-interaction approach. In this kind of design, the random slopes for the highest-level interaction are perfectly confounded with the trial-level error term (in more technical language, the model is only identifiable up to the sum of these two variance components), which is what causes the identifiability problems when one tries to estimate the full maximal model there. (You know all of this of course.) So two equivalent ways to make the model identifiable are to (1) omit the error term, i.e., force the residual variance to be 0, or (2) omit the random slopes for the highest-level interaction. Both of these approaches should (AFAIK) result in a statistically equivalent model, but lme4 does not provide an easy way to do (1), so I generally recommend (2). The important point here is that the standard errors should still be correct in either case — because these two variance components are confounded, omitting e.g. the random interaction slopes simply causes that omitted variance component to be implicitly added to the residual variance, where it is still incorporated into the standard errors of the fixed effects in the appropriate way (because the standard error of the fixed interaction looks roughly like sqrt[(var_error + var_interaction)/n_subjects]). I think one could pretty easily put together a little simulation that would demonstrate this. Hmm, that sounds very reasonable, but can my intuition on the random effects structure and mixed models really be that wrong? To investigate this I followed Jake’s advise and coded a short simulation that tested this and as it turns out, Jake is right and I was wrong. In the simulation we will simulate a simple one-factor repeated-measures design with one factor with three levels. Importantly, each unit of observation will only have one observation per factor level. We will then fit this simulated data with both repeated-measures ANOVA and random-intercept only mixed and compare their p-values. Note again that for such a design we cannot estimate random slopes for the condition effect. First, we need a few packages and set some parameters for our simulation: require(afex) set_sum_contrasts() # for orthogonal sum-to-zero contrasts require(MASS) NSIM <- 1e4 # number of simulated data sets NPAR <- 30 # number of participants per cell NCELLS <- 3 # number of cells (i.e., groups) Now we need to generate the data. For this I employed an approach that is clearly not the most parsimonious, but most clearly follows the formulation of a mixed model that has random variability in the condition effect and on top of this residual variance (i.e., the two confounded factors). We first create a bare bone data.frame with participant id and condition column and a corresponding model.matrix. Then we create the three random parameters (i.e., intercept and the two parameters for the three conditions) using a zero-centered multivarite normal with specified variance-covariance matrix. We then loop over the participant and estimate the predictions deriving from the three random effects parameters. Only after this we add uncorrelated residual variance to the observations for each simulated data set. dat <- expand.grid(condition = factor(letters[seq_len(NCELLS)]), id = factor(seq_len(NPAR))) head(dat) # condition id # 1 a 1 # 2 b 1 # 3 c 1 # 4 a 2 # 5 b 2 # 6 c 2 mm <- model.matrix(~condition, dat) head(mm) # (Intercept) condition1 condition2 # 1 1 1 0 # 2 1 0 1 # 3 1 -1 -1 # 4 1 1 0 # 5 1 0 1 # 6 1 -1 -1 Sigma_c_1 <- matrix(0.6, NCELLS,NCELLS) diag(Sigma_c_1) <- 1 d_c_1 <- replicate(NSIM, mvrnorm(NPAR, rep(0, NCELLS), Sigma_c_1), simplify = FALSE) gen_dat <- vector("list", NSIM) for(i in seq_len(NSIM)) { gen_dat[[i]] <- dat gen_dat[[i]]$dv <- NA_real_ for (j in seq_len(NPAR)) { gen_dat[[i]][(j-1)*3+(1:3),"dv"] <- mm[1:3,] %*% d_c_1[[i]][j,] } gen_dat[[i]]$dv <- gen_dat[[i]]$dv+rnorm(nrow(mm), 0, 1) }

Now we only need a function that estimates the ANOVA and mixed model for each data set and returns the p-value and loop over it.

## functions returning p-value for ANOVA and mixed model within_anova <- function(data) { suppressWarnings(suppressMessages( a <- aov_ez(id = "id", dv = "dv", data, within = "condition", return = "univariate", anova_table = list(es = "none")) )) c(without = a[["univariate.tests"]][2,6], gg = a[["pval.adjustments"]][1,2], hf = a[["pval.adjustments"]][1,4]) } within_mixed <- function(data) { suppressWarnings( m <- mixed(dv~condition+(1|id),data, progress = FALSE) ) c(mixed=anova(m)Pr(>F)) } p_c1_within <- vapply(gen_dat, within_anova, rep(0.0, 3)) m_c1_within <- vapply(gen_dat, within_mixed, 0.0) The following graph shows the results (GG is the results using the Greenhouse-Geisser adjustment for sphericity violations). ylim <- c(0, 700) par(mfrow = c(1,3)) hist(p_c1_within[1,], breaks = 20, main = "ANOVA (default)", xlab = "p-value", ylim=ylim) hist(p_c1_within[2,], breaks = 20, main = "ANOVA (GG)", xlab = "p-value", ylim=ylim) hist(m_c1_within, breaks = 20, main = "Random-Intercept Model", xlab = "p-value", ylim=ylim) What these graph clearly shows is that the p-value distribution for the standard repeated-measures ANOVA and the random-intercept mixed model is virtually identical. This clearly shows that my intuition was wrong and Jake was right. We also see that for ANOVA and mixed model the rate of significant findings with p < .05 is slightly above the nominal level. More specifically: mean(p_c1_within[1,] < 0.05) # ANOVA default # [1] 0.0684 mean(p_c1_within[2,] < 0.05) # ANOVA GG # [1] 0.0529 mean(p_c1_within[3,] < 0.05) # ANOVA HF # [1] 0.0549 mean(m_c1_within < 0.05) # random-intercept mixed model # [1] 0.0701 These additional results indicate that maybe one also needs to adjust the degrees of freedom for mixed models for violations of sphericity. But this is not the topic of today’s post. To sum this up, this simulation shows that removing the highest-order random slope seems to be the right decision if one wants to use a mixed model for a design with one observation per cell of the design and participant, but wants to implement the ‘maximal random effects structure’. One more thing to note. Ben Bolker raised the same issue and pointed us to one of his example analyses of the starling data that is relevant to the current question. We are very grateful that Jake and Ben took the time to go through our chapter! You can also download the RMarkdown file of the simulation. References 881472 {STGUHGIT};{STGUHGIT},{M2F33N9W} apa default ASC no 499 http://singmann.org/wp-content/plugins/zotpress/ To leave a comment for the author, please follow the link and comment on their blog: Computational Psychology - Henrik Singmann. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Instrumental Variables in R exercises (Part-3) Mon, 05/29/2017 - 18:00 (This article was first published on R-exercises, and kindly contributed to R-bloggers) This is the third part of the series on Instrumental Variables. For other parts of the series follow the tag instrumental variables. In this exercise set we will use Generalized Method of Moments (GMM) estimation technique using the examples from part-1 and part-2. Recall that GMM estimation relies on the relevant moment conditions. For OLS we assume that predictors are uncorrelated with the error term. Similarly, IV estimation implies that the instrument(s) is uncorrelated with the error term. Answers to the exercises are available here. Exercise 1 Load the AER package (package description: here) and the PSID1976 dataset. This has data regarding labor force participation of married women. Define a new data-frame that has data for all married women that were employed. As we did in part-2, this data-frame will be used for the remaining exercises. Next, load the ‘gmm’ package (package description: here). Exercise 2 We will start with a simple example. Regress log(wage) on education using the usual OLS technique. Next, use the gmm function to estimate the same model using OLS’s moment conditions. Match your result and comment on the standard errors. Exercise 3 Estimate the return to education for the model from Exercise-2 using feducation as the IV. Use both ivreg and gmm functions and compare results. Exercise 4 Regress log(wage) on education, experience and experience^2 using the usual OLS technique. Next, use the gmm function to estimate the same model using OLS’s moment conditions. Match your result. Exercise 5 Estimate the return to education for the model from Exercise-4 using feducation as the IV. Use both ivreg and gmm functions and compare results. Exercise 6 We will now use the over-identified case. Estimate the return to education for the model from Exercise-2 using feducation and meducation as IVs. Use both ivreg and gmm functions and compare results. Exercise 7 Estimate the return to education for the model from Exercise-4 using feducation and meducation as IVs. Use both ivreg and gmm functions and compare results. Exercise 8 The test of over-identifying restrictions can be obtained by the J-test (Sargan test). It is displayed with the summary and specTest functions. Do the over-identified moment conditions match the data well? Exercise 9 Iterated estimation might offer some advantages over the default two-step method in some cases. Estimate the model in Exercise-7 using the iterative estimation technique. Exercise 10 Use the plot function to get the graph of log(wage) and fitted values for the model in Exercise-7. Related exercise sets: To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Update to GetHFData (Version 1.3) Mon, 05/29/2017 - 09:00 (This article was first published on R and Finance, and kindly contributed to R-bloggers) I just posted a major update to GetHFData. Version 1.3 of GetHFData makes it possible to download and aggregate order data from Bovespa. The data comprises buy and sell orders sent by market operators. Tabular data includes type of orders (buy or sell, new/update/cancel/..), date/time of submission, priority time, prices, order quantity, among many other information. Be aware that these are very large files. One day of buy and sell orders in the equity market is around 100 MB zipped and close to 1 GB unzipped. If you computer is not suited to store this data in its memory, it will crash. Here’s an example of usage that will download and aggregate order data for all option contracts related to Petrobras (PETR): library(GetHFData) first.time <- '10:00:00' last.time <- '17:00:00' first.date <- '2015-08-18' last.date <- '2015-08-18' type.output <- 'agg' # aggregates data agg.diff <- '5 min' # interval for aggregation my.assets <- 'PETR' # all options related to Petrobras (partial matching) type.matching <- 'partial' # finds tickers from my.assets using partial matching type.market = 'options' # option market type.data <- 'orders' # order data df.out <- ghfd_get_HF_data(my.assets =my.assets, type.data= type.data, type.matching = type.matching, type.market = type.market, first.date = first.date, last.date = last.date, first.time = first.time, last.time = last.time, type.output = type.output, agg.diff = agg.diff) Minor updates: • Fixed link to paper and citation • Now it is possible to use partial matching (e.g. use PETR for all stocks or options related to Petrobras) • implement option for only downloading files (this is helpful if you are dealing with order data and will process the files in other R session or software) • muted message “Using ‘,’ as decimal …” from readr I’m also now using Github for development. This will make it easier to handle bug reports and version control. Here’s is the link. The new version is already available in Github. Give it a try: devtools::install_github('msperlin/GetHFData') The update should be in CRAN shortly. To leave a comment for the author, please follow the link and comment on their blog: R and Finance. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Manipulate Biological Data Using Biostrings Package Exercises(Part 2) Sun, 05/28/2017 - 20:00 (This article was first published on R-exercises, and kindly contributed to R-bloggers) Bioinformatics is an amalgamation Biology and Computer science.Biological Data is manipulated using Computers and Computer software’s in Bioinformatics. Biological Data includes DNA; RNA & Proteins. DNA & RNA is made of Nucleotide which are our genetic material in which we are coded.Our Structure and Functions are done by protein, which are build of Amino acids In this exercise we try correlate the relation between DNA, RNA & Protein. Conversion of DNA to RNA is known as Transcription. DNA/RNA to protein is known as Translation. Here we also discuss Sequence Alignment Techniques. Sequence Alignment is comparing the similarity between the sequences to check how much the DNA,RNA or Protein are related to each other. Three are three types of Sequence Alignment 1. Global Alignment 2. Local Alignment 3. Over lap Alignment In the exercises below we cover how we can Manipulate Biological Data using Biostrings package in Bioconductor. Install Packages Biostrings Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Exercise 1 Create a DNA String and find out the complement of the DNA Exercise 2 Create a RNA String and find out the complement of the RNA Exercise 3 Create a DNA string and find the reverse complement of the same. Exercise 4 Create a RNA string and find the reverse complement of the same. Exercise 5 Create a DNA string and translate the same into Amino Acids using Standard Genetic codon and print the three letter codon of the amino acids Exercise 6 Create a DNA string and translate the same into Amino Acids using Standard Genetic codon and print the three letter codon of the amino acids Exercise 7 Create two DNA Strings and align the sequence using Global Alignment technique and print the score of the alignment Exercise 8 Create two DNA Strings and align the sequence using Global Alignment technique and print the score of the alignment after specifying your own score for match and mismatch among the sequence Exercise 9 Create two DNA Strings and align the sequence using Local Alignment technique and print the score of the alignment Exercise 10 Create two DNA Strings and align the sequence using Overlap Alignment technique and print the score of the alignment Related exercise sets: To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### 5 ways to measure running time of R code Sun, 05/28/2017 - 06:09 A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available. A quick online search revealed at least three R packages for benchmarking R code (rbenchmark, microbenchmark, and tictoc). Additionally, base R provides at least two methods to measure the running time of R code (Sys.time and system.time). In the following I briefly go through the syntax of using each of the five option, and present my conclusions at the end. 1. Using Sys.time The run time of a chunk of code can be measured by taking the difference between the time at the start and at the end of the code chunk. Simple yet flexible :sunglasses:. sleep_for_a_minute <- function() { Sys.sleep(60) } start_time <- Sys.time() sleep_for_a_minute() end_time <- Sys.time() end_time - start_time # Time difference of 1.000327 mins 2. Library tictoc The functions tic and toc are used in the same manner for benchmarking as the just demonstrated Sys.time. However tictoc adds a lot more convenience to the whole. The most recent development1 version of tictoc can be installed from github: devtools::install_github("collectivemedia/tictoc") One can time a single code chunk: library(tictoc) tic("sleeping") print("falling asleep...") sleep_for_a_minute() print("...waking up") toc() # [1] "falling asleep..." # [1] "...waking up" # sleeping: 60.026 sec elapsed Or nest multiple timers: tic("total") tic("data generation") X <- matrix(rnorm(50000*1000), 50000, 1000) b <- sample(1:1000, 1000) y <- runif(1) + X %*% b + rnorm(50000) toc() tic("model fitting") model <- lm(y ~ X) toc() toc() # data generation: 3.792 sec elapsed # model fitting: 39.278 sec elapsed # total: 43.071 sec elapsed 3. Using system.time One can time the evaluation of an R expression using system.time. For example, we can use it to measure the execution time of the function sleep_for_a_minute (defined above) as follows. system.time({ sleep_for_a_minute() }) # user system elapsed # 0.004 0.000 60.051 But what exactly are the reported times user, system, and elapsed? :confused: Well, clearly elapsed is the wall clock time taken to execute the function sleep_for_a_minute, plus some benchmarking code wrapping it (that’s why it took slightly more than a minute to run I guess). As for user and system times, William Dunlap has posted a great explanation to the r-help mailing list: “User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system. :grinning: 4. Library rbenchmark The documentation to the function benchmark from the rbenchmark R package describes it as “a simple wrapper around system.time”. However it adds a lot of convenience compared to bare system.time calls. For example it requires just one benchmark call to time multiple replications of multiple expressions. Additionally the returned results are conveniently organized in a data frame. I installed the development1 version of the rbenchmark package from github: devtools::install_github("eddelbuettel/rbenchmark") For example purposes, let’s compare the time required to compute linear regression coefficients using three alternative computational procedures: 1. lm, 2. the Moore-Penrose pseudoinverse, 3. the Moore-Penrose pseudoinverse but without explicit matrix inverses. library(rbenchmark) benchmark("lm" = { X <- matrix(rnorm(1000), 100, 10) y <- X %*% sample(1:10, 10) + rnorm(100) b <- lm(y ~ X + 0)coef }, "pseudoinverse" = { X <- matrix(rnorm(1000), 100, 10) y <- X %*% sample(1:10, 10) + rnorm(100) b <- solve(t(X) %*% X) %*% t(X) %*% y }, "linear system" = { X <- matrix(rnorm(1000), 100, 10) y <- X %*% sample(1:10, 10) + rnorm(100) b <- solve(t(X) %*% X, t(X) %*% y) }, replications = 1000, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self")) # test replications elapsed relative user.self sys.self # 3 linear system 1000 0.167 1.000 0.208 0.240 # 1 lm 1000 0.930 5.569 0.952 0.212 # 2 pseudoinverse 1000 0.240 1.437 0.332 0.612

Here, the meaning of elapsed, user.self, and sys.self is the same as described above in the section about system.time, and relative is simply the time ratio with the fastest test. Interestingly lm is by far the slowest here.

5. Library microbenchmark

The most recent development version of microbenchmark can be installed from github:

devtools::install_github("olafmersmann/microbenchmarkCore") devtools::install_github("olafmersmann/microbenchmark")

Much like benchmark from the package rbenchmark, the function microbenchmark can be used to compare running times of multiple R code chunks. But it offers a great deal of convenience and additional functionality.

I find that one particularly nice feature of microbenchmark is the ability to automatically check the results of the benchmarked expressions with a user-specified function. This is demonstrated below, where we again compare three methods computing the coefficient vector of a linear model.

library(microbenchmark) set.seed(2017) n <- 10000 p <- 100 X <- matrix(rnorm(n*p), n, p) y <- X %*% rnorm(p) + rnorm(100) check_for_equal_coefs <- function(values) { tol <- 1e-12 max_error <- max(c(abs(values[[1]] - values[[2]]), abs(values[[2]] - values[[3]]), abs(values[[1]] - values[[3]]))) max_error < tol } mbm <- microbenchmark("lm" = { b <- lm(y ~ X + 0)$coef }, "pseudoinverse" = { b <- solve(t(X) %*% X) %*% t(X) %*% y }, "linear system" = { b <- solve(t(X) %*% X, t(X) %*% y) }, check = check_for_equal_coefs) mbm # Unit: milliseconds # expr min lq mean median uq max neval cld # lm 96.12717 124.43298 150.72674 135.12729 188.32154 236.4910 100 c # pseudoinverse 26.61816 28.81151 53.32246 30.69587 80.61303 145.0489 100 b # linear system 16.70331 18.58778 35.14599 19.48467 22.69537 138.6660 100 a We used the function argument check to check for equality (up to a maximal error of 1e-12) of the results returned by the three methods. If the results weren’t equal, microbenchmark would return an error message. Another great feature is the integration with ggplot2 for plotting microbenchmark results. library(ggplot2) autoplot(mbm) Conclusion The given demonstration of the different benchmarking functions is surely not exhaustive. Nevertheless I made some conclusions for my personal benchmarking needs: • The Sys.time approach as well as the tictoc package can be used for timing (potentially nested) steps of a complicated algorithm (that’s often my use case). However, tictoc is more convenient, and (most importantly) foolproof. • We saw that microbenchmark returns other types of measurements than benchmark, and I think that in most situations the microbenchmark measurements are of a higher practical significance :stuck_out_tongue:. • To my knowledge microbenchmark is the only benchmarking package that has visualizations built in :+1:. For these reasons I will go with microbenchmark and tictoc. :bowtie: 1. Though the repository does not seem to be very active. So the github version is probably no different from the stable release on CRAN. ↩2 ### An Interesting Study: Exploring Mental Health Conditions in the Tech Workplace Sun, 05/28/2017 - 02:12 (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers) Contributed by Bo Lian. This blog post is the first class project on Exploratory Visualization & Shiny. Please check the Shiny App here. Background and Motivation Mental illness is a global health problem. It has critically high prevalence, yielding severe health outcomes. One in four adults suffers from a diagnosable mental illness in any given year (National Alliance on Mental Health, 2013). In the tech industry, 50% of individuals have sought treatment for mental illness, according to Finkler, 2015. This important health issue warrants further investigation. This study, based on the 2014 Mental Health in Tech Survey from Open Sourcing Mental Illness, performs a complete data visualization and statistical analyses among the mental illness and various factors in order to answer following questions: 1. What are the strongest groups of predictors of mental health illness in the workplace? 2. What might be the causes of the high mental illnesses prevalence in the tech industry other than some common thoughts? 3. How is the representativeness of the survey, is other factors involved: geographic locations, ages, etc.? The Data The data used in the study is the 2014 Mental Health in Tech Survey from Open Sourcing Mental Illness. With 1259 responses, it is considered the largest survey on mental health in the tech industry. The dataset contains 27 factors that could be segmented into 5 categories, each of which can be performed for Chi-squared test against the response variable of the treatment (if one has sought treatment for a mental health condition). 1. Demographics : age, gender, country, state etc. 2. Mental Health Condition : treatment, work interference, family history etc. 3. Employment Background : tech, remote, employee number, size etc. 4. Organizational Policies on Mental Health : benefit, wellness program etc. 5. Openness about Mental Health : self awareness, coworkers, supervisors, openness to discuss etc. Data Visualization and Analysis The data is drawn from employees from 46 countries, among which US employees account for 59.2%. Most of those hail from tech heavy states like California, New York, Washington, etc. The median age of the surveyees is 31. The distribution of ages is right skewed which is expected as tech industry tends to have younger employees. From the box-plots, there is no statistically significant difference of ages between the treatment and the non treatment groups. Below are 6 of the data analysis examples by Chi-Square Testing with the scores and P values. As expected, mental illness is strongly dependent on the family history, work interference. If the companies offer mental-illness care options, it also enhances the mental illness treatment rate, so as gender difference. What is interesting is that the study shows that the data shows that what people may assume correlates with mental illness is not a factor at all. For example, either remote work or tech type company is a factor does not correlate with t mental illness so as the supervisor’s attitude. Results and Conclusions: Significant Factors of the Chi-Square tests They include: gender, family history, work interference, benefits, care options, mental health interview etc. Insignificant Factors of the Chi-Square tests Identifiers like self_employed, remote work, number of employees, tech or not, supervisors etc. , did not prove significant. Among the results, the family history, work interference, gender difference are mostly likely to be the main causal factors. If a company has provided better mental health care services, the employees might be more likely to seek treatment or raise their awareness, different genders might tend to seek treatment differently as well. Interestingly, none of the true work conditions has any statistically significant impact on the mental illness of the employees. Human are tougher than they think! One thing to consider is that as the tech industry is primarily located at the US west or east coast or developed countries as shown in the map, the high living cost, competitive atmosphere, and high population density might contribute greatly to the prevalence of mental illness rather than the working conditions themselves. Future Work Keep exploring on • More data visualization • Data complexity and subjectivity, location specificity • Integrating ongoing survey data from 2016 • Interaction among factors • Predictions with multiple variates To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Oakland Real Estate – Full EDA Sat, 05/27/2017 - 07:50 (This article was first published on R Analysis – Analyst at Large, and kindly contributed to R-bloggers) Living in the Bay Area has led me to think more and more about real estate (and how amazingly expensive it is here…) I’ve signed up for trackers on Zillow and Redfin, but the data analyst in me always wants to dive deeper, to look back historically, to quantify, to visualize the trends, etc… With that in mind, here is my first view at Oakland real estate prices over the past decade. I’ll only be looking at multi-tenant units (duplexes, triplexes, etc.) The first plot is simply looking at the number of sales each month: You can clearly see a strong uptick in the number of units sold from 2003 to 2005 and the following steep decline in sales bottoming out during the financial crisis in 2008. Interestingly, sales pick up again very quickly in 2009 and 2010 (a time when I expected to see low sales figures) before stabilizing at the current rate of ~30 properties sold per month. The next plot shows the price distribution for multi-tenant buildings sold in Oakland: Here we can see prices rising pretty steadily right up until 2008 when the financial crisis hit (peaking with median prices around$650K).  The prices are rising in 2006 and 2007 even while the number of sales is dropping rapidly.  After prices drop to ~$250K we can see that they stay low from 2009 – 2013 (even though sales picked up dramatically in 2009). This suggests that a number of investors recognized the low prices and bought up a lot of the available properties. Starting in 2013 prices begin rising and they have continued to present where we are looking at record high median prices of ~$750K.

Does month of year matter?

I’ve started doing some basic analysis on Oakland real estate prices over the past decade (multi-tenant buildings only).  There’s still a lot to unpack here, but I’m only able to investigate this 30 minutes at a time (new dad life), so I’m breaking the analysis into small, manageable pieces.  The first one I wanted to explore quickly is: does month of year matter?  I’ve often heard that summer is the busy season for buying a house because families with children try not to move during the school year.  Does this rule also apply to multi-tenant buildings as well (which tend to be purchased by investors)?

I’ve collected the sales over the past decade and group by month of year.  We can see that summer months (May, June, and July) do see more sales than other months.  Interestingly, December also tends to see a large number of sales.  Maybe people have more time over the holidays to check out investment properties?  Not sure what else could be driving this weird spike in sales in December – any ideas?

Given that the most sales occur in summer months, I wanted to see if this has any impact on sale price.

There doesn’t seem to be much of a relationship at all between month of year and sale price.  I had expected to see higher prices in the summer months under the assumption that demand is higher during that period, but maybe supply is also higher during that time (and they rise in roughly equal proportion).  This is something that I could theoretically test (since I have the list date as well as the sale date for each property), but I think there are more interesting topics to explore…  It’s enough for me to know that month of year doesn’t appear to be correlated with sale price!

Are multi-tenant houses going above or below asking price?

For many people one of the most difficult aspects of the house-buying process is deciding what to bid.  I decided to take a look at the relationship between list price and sale price.

Since 2000 the average difference between list and sale price has generally been less than $25K. We can see that in 2004-2006 multi-tenant houses in Oakland were generally selling for an average of$15K – $25K above the asking price. In financial crisis in 2008, we can see houses going for an average of$25K less than asking.  From 2010-2013, list price and sale price averaged close to $0K difference. Starting in 2013 we start to see houses selling for more than asking with multi-tenant houses in Oakland now averaging$50K more than asking!  I know that housing prices have increased over time, so my next step was to look at these as percentage of asking price (attempting to control for inflation in housing values over the past two decades).

The shape of this plot matches the plot showing absolute dollar figures, but it is helpful to see percentages.  The most recent data shows that Oakland multi-tenant houses sell at an average of 7.5% premium over asking price.

How are days on market and sales price vs. list price premium related?

I’ve often heard that the longer a house is on the market, the lower the premium vs. asking price.  This seems as good a dataset as any to test this theory out!

This first plot just shows the relationship between days on market (x-axis) and difference between sale price and list price (y-axis).  This plot isn’t particularly helpful, but it does show me a couple of things.  I see a handful of outliers (houses on the market for 1000+ days or selling $1M below asking price). There also seems to be a fairly large cluster of houses selling significantly above asking price between 7-30 days after going on the market. Next step was just to bucket these days on market: We can see that there does appear to be an inverse relationship between days on market and sales price – list price premium. Roughly 75% of houses sold in less than 30 days were sold above asking price (with median premiums of ~$10K). On the other hand, roughly 75% of houses on the market for 60+ days were sold below asking price (with median discounts of ~$10K). Next steps here would be converting to percent of list price and then reviewing these trends over time, but that will have to wait for another day! R code here: To leave a comment for the author, please follow the link and comment on their blog: R Analysis – Analyst at Large. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Testing the Hierarchical Risk Parity algorithm Fri, 05/26/2017 - 21:43 (This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers) This post will be a modified backtest of the Adaptive Asset Allocation backtest from AllocateSmartly, using the Hierarchical Risk Parity algorithm from last post, because Adam Butler was eager to see my results. On a whole, as Adam Butler had told me he had seen, HRP does not generate outperformance when applied to a small, carefully-constructed, diversified-by-selection universe of asset classes, as opposed to a universe of hundreds or even several thousand assets, where its theoretically superior properties result in it being a superior algorithm. First off, I would like to thank one Matthew Barry, for helping me modify my HRP algorithm so as to not use the global environment for recursion. You can find his github here. Here is the modified HRP code. covMat <- read.csv('cov.csv', header = FALSE) corMat <- read.csv('corMat.csv', header = FALSE) clustOrder <- hclust(dist(corMat), method = 'single')$order getIVP <- function(covMat) { invDiag <- 1/diag(as.matrix(covMat)) weights <- invDiag/sum(invDiag) return(weights) } getClusterVar <- function(covMat, cItems) { covMatSlice <- covMat[cItems, cItems] weights <- getIVP(covMatSlice) cVar <- t(weights) %*% as.matrix(covMatSlice) %*% weights return(cVar) } getRecBipart <- function(covMat, sortIx) { w <- rep(1,ncol(covMat)) w <- recurFun(w, covMat, sortIx) return(w) } recurFun <- function(w, covMat, sortIx) { subIdx <- 1:trunc(length(sortIx)/2) cItems0 <- sortIx[subIdx] cItems1 <- sortIx[-subIdx] cVar0 <- getClusterVar(covMat, cItems0) cVar1 <- getClusterVar(covMat, cItems1) alpha <- 1 - cVar0/(cVar0 + cVar1) # scoping mechanics using w as a free parameter w[cItems0] <- w[cItems0] * alpha w[cItems1] <- w[cItems1] * (1-alpha) if(length(cItems0) > 1) { w <- recurFun(w, covMat, cItems0) } if(length(cItems1) > 1) { w <- recurFun(w, covMat, cItems1) } return(w) } out <- getRecBipart(covMat, clustOrder) out

With covMat and corMat being from the last post. In fact, this function can be further modified by encapsulating the clustering order within the getRecBipart function, but in the interest of keeping the code as similar to Marcos Lopez de Prado’s code as I could, I’ll leave this here.

Anyhow, the backtest will follow. One thing I will mention is that I’m using Quandl’s EOD database, as Yahoo has really screwed up their financial database (I.E. some sector SPDRs have broken data, dividends not adjusted, etc.). While this database is a $50/month subscription, I believe free users can access it up to 150 times in 60 days, so that should be enough to run backtests from this blog, so long as you save your downloaded time series for later use by using write.zoo. This code needs the tseries library for the portfolio.optim function for the minimum variance portfolio (Dr. Kris Boudt has a course on this at datacamp), and the other standard packages. A helper function for this backtest (and really, any other momentum rotation backtest) is the appendMissingAssets function, which simply adds on assets not selected to the final weighting and re-orders the weights by the original ordering. require(tseries) require(PerformanceAnalytics) require(quantmod) require(Quandl) Quandl.api_key("YOUR_AUTHENTICATION_HERE") # not displaying my own api key, sorry # function to append missing (I.E. assets not selected) asset names and sort into original order appendMissingAssets <- function(wts, allAssetNames, wtsDate) { absentAssets <- allAssetNames[!allAssetNames %in% names(wts)] absentWts <- rep(0, length(absentAssets)) names(absentWts) <- absentAssets wts <- c(wts, absentWts) wts <- xts(t(wts), order.by=wtsDate) wts <- wts[,allAssetNames] return(wts) } Next, we make the call to Quandl to get our data. symbols <- c("SPY", "VGK", "EWJ", "EEM", "VNQ", "RWX", "IEF", "TLT", "DBC", "GLD") rets <- list() for(i in 1:length(symbols)) { # quandl command to download from EOD database. Free users should use write.zoo in this loop. returns <- Return.calculate(Quandl(paste0("EOD/", symbols[i]), start_date="1990-12-31", type = "xts")$Adj_Close) colnames(returns) <- symbols[i] rets[[i]] <- returns } rets <- na.omit(do.call(cbind, rets))

While Josh Ulrich fixed quantmod to actually get Yahoo data after Yahoo broke the API, the problem is that the Yahoo data is now garbage as well, and I’m not sure how much Josh Ulrich can do about that. I really hope some other provider can step up and provide free, usable EOD data so that I don’t have to worry about readers not being able to replicate the backtest, as my policy for this blog is that readers should be able to replicate the backtests so they don’t just nod and take my word for it. If you are or know of such a provider, please leave a comment so that I can let the blog readers know all about you.

Next, we initialize the settings for the backtest.

invVolWts <- list() minVolWts <- list() hrpWts <- list() ep <- endpoints(rets, on = "months") nMonths = 6 # month lookback (6 as per parameters from allocateSmartly) nVol = 20 # day lookback for volatility (20 ibid)

While the AAA backtest actually uses a 126 day lookback instead of a 6 month lookback, as it trades at the end of every month, that’s effectively a 6 month lookback, give or take a few days out of 126, but the code is less complex this way.

Next, we have our actual backtest.

for(i in 1:(length(ep)-nMonths)) { # get returns subset and compute absolute momentum retSubset <- rets[c(ep[i]:ep[(i+nMonths)]),] retSubset <- retSubset[-1,] moms <- Return.cumulative(retSubset) # select top performing assets and subset returns for them highRankAssets <- rank(moms) >= 6 # top 5 assets posReturnAssets <- moms > 0 # positive momentum assets selectedAssets <- highRankAssets & posReturnAssets # intersection of the above selectedSubset <- retSubset[,selectedAssets] # subset returns slice if(sum(selectedAssets)==0) { # if no qualifying assets, zero weight for period wts <- xts(t(rep(0, ncol(retSubset))), order.by=last(index(retSubset))) colnames(wts) <- colnames(retSubset) invVolWts[[i]] <- minVolWts[[i]] <- hrpWts[[i]] <- wts } else if (sum(selectedAssets)==1) { # if one qualifying asset, invest fully into it wts <- xts(t(rep(0, ncol(retSubset))), order.by=last(index(retSubset))) colnames(wts) <- colnames(retSubset) wts[, which(selectedAssets==1)] <- 1 invVolWts[[i]] <- minVolWts[[i]] <- hrpWts[[i]] <- wts } else { # otherwise, use weighting algorithms cors <- cor(selectedSubset) # correlation volSubset <- tail(selectedSubset, nVol) # 20 day volatility vols <- StdDev(volSubset) covs <- t(vols) %*% vols * cors # minimum volatility using portfolio.optim from tseries minVolRets <- t(matrix(rep(1, sum(selectedAssets)))) minVolWt <- portfolio.optim(x=minVolRets, covmat = covs)$pw names(minVolWt) <- colnames(covs) minVolWt <- appendMissingAssets(minVolWt, colnames(retSubset), last(index(retSubset))) minVolWts[[i]] <- minVolWt # inverse volatility weights invVols <- 1/vols invVolWt <- invVols/sum(invVols) invNames <- colnames(invVolWt) invVolWt <- as.numeric(invVolWt) names(invVolWt) <- invNames invVolWt <- appendMissingAssets(invVolWt, colnames(retSubset), last(index(retSubset))) invVolWts[[i]] <- invVolWt # hrp weights clustOrder <- hclust(dist(cors), method = 'single')$order hrpWt <- getRecBipart(covs, clustOrder) names(hrpWt) <- colnames(covs) hrpWt <- appendMissingAssets(hrpWt, colnames(retSubset), last(index(retSubset))) hrpWts[[i]] <- hrpWt } }

In a few sentences, this is what happens:

The algorithm takes a subset of the returns (the past six months at every month), and computes absolute momentum. It then ranks the ten absolute momentum calculations, and selects the intersection of the top 5, and those with a return greater than zero (so, a dual momentum calculation).

If no assets qualify, the algorithm invests in nothing. If there’s only one asset that qualifies, the algorithm invests in that one asset. If there are two or more qualifying assets, the algorithm computes a covariance matrix using 20 day volatility multiplied with a 126 day correlation matrix (that is, sd_20′ %*% sd_20 * (elementwise) cor_126. It then computes normalized inverse volatility weights using the volatility from the past 20 days, a minimum variance portfolio with the portfolio.optim function, and lastly, the hierarchical risk parity weights using the HRP code above from Marcos Lopez de Prado’s paper.

Lastly, the program puts together all of the weights, and adds a cash investment for any period without any investments.

invVolWts <- round(do.call(rbind, invVolWts), 3) # round for readability minVolWts <- round(do.call(rbind, minVolWts), 3) hrpWts <- round(do.call(rbind, hrpWts), 3) # allocate to cash if no allocation made due to all negative momentum assets invVolWts$cash <- 0; invVolWts$cash <- 1-rowSums(invVolWts) hrpWts$cash <- 0; hrpWts$cash <- 1-rowSums(hrpWts) minVolWts$cash <- 0; minVolWts$cash <- 1-rowSums(minVolWts) # cash value will be zero rets\$cash <- 0 # compute backtest returns invVolRets <- Return.portfolio(R = rets, weights = invVolWts) minVolRets <- Return.portfolio(R = rets, weights = minVolWts) hrpRets <- Return.portfolio(R = rets, weights = hrpWts)

Here are the results:

compare <- cbind(invVolRets, minVolRets, hrpRets) colnames(compare) <- c("invVol", "minVol", "HRP") charts.PerformanceSummary(compare) rbind(table.AnnualizedReturns(compare), maxDrawdown(compare), CalmarRatio(compare)) invVol minVol HRP Annualized Return 0.0872000 0.0724000 0.0792000 Annualized Std Dev 0.1208000 0.1025000 0.1136000 Annualized Sharpe (Rf=0%) 0.7221000 0.7067000 0.6968000 Worst Drawdown 0.1548801 0.1411368 0.1593287 Calmar Ratio 0.5629882 0.5131956 0.4968234

In short, in the context of a small, carefully-selected and allegedly diversified (I’ll let Adam Butler speak for that one) universe dominated by the process of which assets to invest in as opposed to how much, the theoretical upsides of an algorithm which simultaneously exploits a covariance structure without needing to invert a covariance matrix can be lost.

However, this test (albeit from 2007 onwards, thanks to ETF inception dates combined with lookback burn-in) confirms what Adam Butler himself told me, which is that HRP hasn’t impressed him, and from this backtest, I can see why. However, in the context of dual momentum rank selection, I’m not convinced that any weighting scheme will realize much better performance than any other.

NOTE: I am always interested in networking and hearing about full-time opportunities related to my skill set. My linkedIn profile can be found here.

### Canada Immigration: Where to settle in? Exercises

Fri, 05/26/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Many people around the globe would like to immigrate to Canada as a Skilled Worker. These candidates must prove language proficiency in French and English, at least 2 years of working experience after graduation, and more. But, many immigrants that arrive in canada face unemployment rates sometimes even higher than in their original countries. So, the choice of the province to settle in is very important for those wishing to have economic success. With these exercises we will use R to analyze some immigration open data from Canadian government.

Answers to the exercises are available here.

Exercise 1
Download and read into R a data set from Labour force survey estimates (LFS), by immigrant status, age group, Canada, regions, provinces and Montreal, Toronto, Vancouver census metropolitan areas, 3-month moving average, unadjusted for seasonality.. Then take a look at it using head.

Exercise 2
Load libraries to manipulate data like dplyr. Turn Ref_Date into a Date type variable. (Tip: use as.Date)

Exercise 3
Transform the variable “Value” to a numeric format.

Learn more about Data manipulation in the online course R: Complete Data Analysis Solutions. In this course you will learn how to:

• Learn indepth how to work with dplyr
• Get a full introduction to the data.table package
• And much more

Exercise 4
Create a numeric vector that contains this column indices 1,2,4,5,6, and 9. And create a new data frame to store this data.

Exercise 5
Create a text vector that contains the province names. Create a new data frame to store only lines with valid province names.

Exercise 6
We are interested in comparing unemployment rate between people born in canada and recent immigrants. Exclude lines related to other kinds of status.

Exercise 7
Skilled worker immigrants usually need to have a university degree and at least 2 year of professional experience. So, exclude lines in the “agegroup” variable with “15 years and over”, and remove this column.

Exercise 8
Take a look at the summary information of the unemployment rate.

Exercise 9
Use the summarize this data grouping then by status and province. Please, take the mean of the unemployment rate as the table content.

Exercise 10
Use qplot from ggplot2 to create a plot and find the best province in terms of difference at unemployment rate between local people and recent immigrants.

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...