Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 34 min ago

Visualizing Portfolio Volatility

Fri, 07/21/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)




























This is the third post in our series on portfolio volatility, variance and standard deviation. If you want to start at the beginning with calculating portfolio volatility, have a look at the first post here – Intro to Volatility. The second post on calculating rolling standard deviations is here: Intro to Rolling Volatility.

Today we will visualize rolling standard deviations with highcharter using two objects from that second post.

The charts, which are the fun payoff after all of the equations and functions that we have ground out in the previous posts, should highlight any unusual occurrences or volatility spikes/dips that we might want to investigate.

First, load the .RDat file saved from our previous Notebook (or you can run the scripts from the previous posts).

load('rolling-sd.RDat')

We now have 2 objects in our Global Environment – spy_rolling_sd – an xts object of rolling SPY standard deviations – roll_portfolio_result – an xts object of rolling portfolio standard deviations. Because both of those are xts objects, we can pass them straight to highcharter with the hc_add_series() function and set a name and a color with the name and color arguments. Nothing too complicated here – we did the hard work in our previous Notebooks.

highchart(type = "stock") %>% hc_title(text = "SPY v. Portfolio Rolling Volatility") %>% hc_add_series(spy_rolling_sd, name = "SPY Volatility", color = "blue") %>% hc_add_series(roll_portfolio_result, name = "Port Volatility", color = "green") %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE)

{"x":{"hc_opts":{"title":{"text":"SPY v. Portfolio Rolling Volatility"},"yAxis":{"title":{"text":null}},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1375228800000,0.0230087834710489],[1377820800000,0.0310554927963628],[1380499200000,0.0303154391342273],[1383177600000,0.0333351835060359],[1385683200000,0.033679613898303],[1388448000000,0.0288400976907278],[1391126400000,0.0338190396691738],[1393545600000,0.029904383528237],[1396224000000,0.0305942566371719],[1398816000000,0.0275796530645834],[1401408000000,0.0268834182854031],[1404086400000,0.0266169197325657],[1406764800000,0.019606506360967],[1409270400000,0.0178261554082542],[1412035200000,0.0218704877587742],[1414713600000,0.0226129041709753],[1417132800000,0.0230956127375855],[1419984000000,0.0243598131833272],[1422576000000,0.0279750349266381],[1424995200000,0.0322177953429278],[1427760000000,0.0325097911709751],[1430352000000,0.0316790953242732],[1432857600000,0.0302347213514618],[1435622400000,0.0322755984506811],[1438300800000,0.029389462373966],[1440979200000,0.0321307589009705],[1443571200000,0.032984375371157],[1446163200000,0.050852177600668],[1448841600000,0.0505138108788652],[1451520000000,0.0503340252080932],[1454025600000,0.0522341207818885],[1456704000000,0.046219715155029],[1459382400000,0.0504880420143848],[1461888000000,0.0371553096877483],[1464652800000,0.0378912457951088],[1467244800000,0.0361050285668243],[1469750400000,0.0250938017057704],[1472601600000,0.0246028353527571],[1475193600000,0.0153415995353276],[1477872000000,0.0187392260707039],[1480464000000,0.0224796653731647],[1483056000000,0.0220139494031393],[1485820800000,0.018927657566137],[1488240000000,0.0221800497260555],[1490918400000,0.0218674879816177],[1493337600000,0.0159545004191706],[1496188800000,0.0135388212505291],[1498780800000,0.014675445173365],[1500508800000,0.0150096717916259]],"name":"SPY Volatility","color":"blue"},{"data":[[1377820800000,0.0241115216143897],[1380499200000,0.0260442402607256],[1383177600000,0.0272674443067543],[1385683200000,0.0288719473982621],[1388448000000,0.0288857557055273],[1391126400000,0.0205191246489011],[1393545600000,0.0260055041878384],[1396224000000,0.0264271524902842],[1398816000000,0.0265816646662814],[1401408000000,0.0249591129630544],[1404086400000,0.0249769037782382],[1406764800000,0.0247281123755299],[1409270400000,0.0199346459519581],[1412035200000,0.0139540306559013],[1414713600000,0.0197192962226576],[1417132800000,0.0198713309717489],[1419984000000,0.0190676195505098],[1422576000000,0.0222013376079352],[1424995200000,0.0230297501738257],[1427760000000,0.0319188848721891],[1430352000000,0.0312580175470481],[1432857600000,0.0321510597602722],[1435622400000,0.031875737738177],[1438300800000,0.031631756790222],[1440979200000,0.0295208333915207],[1443571200000,0.0272649913296038],[1446163200000,0.0278878899612652],[1448841600000,0.0415543963149301],[1451520000000,0.0411460482731848],[1454025600000,0.0412861897878856],[1456704000000,0.0435368174613372],[1459382400000,0.0395236110434442],[1461888000000,0.0461989881404102],[1464652800000,0.0360002783258131],[1467244800000,0.0370928565224689],[1469750400000,0.0348749773237639],[1472601600000,0.0258599246690703],[1475193600000,0.0235703025145043],[1477872000000,0.0130155131890959],[1480464000000,0.0163314552380049],[1483056000000,0.0156270326216257],[1485820800000,0.0141434664632901],[1488240000000,0.0120186849784016],[1490918400000,0.0143345865051231],[1493337600000,0.014532447732245],[1496188800000,0.00807871448547398],[1498780800000,0.00794794549875872],[1500508800000,0.0132875445224473]],"name":"Port Volatility","color":"green"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"chart":{"backgroundColor":"transparent"}},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vml-radial-gradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvas-tools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}

It is interesting to note that from late April 2016 to late October 2016, SPY’s rolling standard deviation dipped below that of the diversified portfolio. The portfolio volatility was plunging at the same time, but SPY’s was falling faster. What happened over the 6 preceding months to explain this?

Maybe we should add a flag to highlight this event. We can also add flags for the maximum SPY volatility, maximum and minimum portfolio rolling volatility and might as well include a line for the mean rolling volatility of SPY to practice adding horizontal lines.

We will use two methods for adding flags. First, we’ll hard code the date for the flag as “2016-04-29” using the date when rolling SPY volatility dipped below the portfolio. Second, we’ll set a flag with the date
as.Date(index(roll_portfolio_result[which.max(roll_portfolio_result)]),format = "%Y-%m-%d") which looks like a convoluted mess but is adding a date for whenever the rolling portfolio standard deviation hit its maximum.

This is a bit more ‘dynamic’ because we can change our assets but keep this code the same and it will find the date with the maximum rolling standard deviation. Our first flag is not dynamic in the sense that it is specific to the comparison between SPY and this exact portfolio.

spy_important_date <- as.Date(c("2016-04-29"), format = "%Y-%m-%d") port_max_date <- as.Date(index(roll_portfolio_result[which.max(roll_portfolio_result)]), format = "%Y-%m-%d") port_min_date <- as.Date(index(roll_portfolio_result[which.min(roll_portfolio_result)]), format = "%Y-%m-%d") spy_max_date <- as.Date(index(spy_rolling_sd[which.max(spy_rolling_sd)]), format = "%Y-%m-%d") highchart(type = "stock") %>% hc_title(text = "SPY v. Portfolio Rolling Volatility") %>% hc_add_series(spy_rolling_sd, name = "SPY Volatility", color = "blue", id = "SPY") %>% hc_add_series(roll_portfolio_result, name = "Portf Volatility", color = "green", id = "Port") %>% hc_add_series_flags(spy_important_date, title = c("SPY Vol Dips"), text = c("SPY rolling sd dips below portfolio."), id = "SPY") %>% hc_add_series_flags(spy_max_date, title = c("SPY Max "), text = c("SPY max rolling volatility."), id = "SPY") %>% hc_add_series_flags(port_max_date, title = c("Portf Max"), text = c("Portfolio maximum rolling volatility."), id = "Port") %>% hc_add_series_flags(port_min_date, title = c("Portf Min"), text = c("Portfolio min rolling volatility."), id = "Port") %>% hc_yAxis(title = list(text = "Mean SPY rolling Vol"), showFirstLabel = FALSE, showLastLabel = FALSE, plotLines = list( list(value = mean(spy_rolling_sd), color = "#2b908f", width = 2))) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE)

{"x":{"hc_opts":{"title":{"text":"SPY v. Portfolio Rolling Volatility"},"yAxis":{"title":{"text":"Mean SPY rolling Vol"},"showFirstLabel":false,"showLastLabel":false,"plotLines":[{"value":0.0291554812507092,"color":"#2b908f","width":2}]},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1375228800000,0.0230087834710489],[1377820800000,0.0310554927963628],[1380499200000,0.0303154391342273],[1383177600000,0.0333351835060359],[1385683200000,0.033679613898303],[1388448000000,0.0288400976907278],[1391126400000,0.0338190396691738],[1393545600000,0.029904383528237],[1396224000000,0.0305942566371719],[1398816000000,0.0275796530645834],[1401408000000,0.0268834182854031],[1404086400000,0.0266169197325657],[1406764800000,0.019606506360967],[1409270400000,0.0178261554082542],[1412035200000,0.0218704877587742],[1414713600000,0.0226129041709753],[1417132800000,0.0230956127375855],[1419984000000,0.0243598131833272],[1422576000000,0.0279750349266381],[1424995200000,0.0322177953429278],[1427760000000,0.0325097911709751],[1430352000000,0.0316790953242732],[1432857600000,0.0302347213514618],[1435622400000,0.0322755984506811],[1438300800000,0.029389462373966],[1440979200000,0.0321307589009705],[1443571200000,0.032984375371157],[1446163200000,0.050852177600668],[1448841600000,0.0505138108788652],[1451520000000,0.0503340252080932],[1454025600000,0.0522341207818885],[1456704000000,0.046219715155029],[1459382400000,0.0504880420143848],[1461888000000,0.0371553096877483],[1464652800000,0.0378912457951088],[1467244800000,0.0361050285668243],[1469750400000,0.0250938017057704],[1472601600000,0.0246028353527571],[1475193600000,0.0153415995353276],[1477872000000,0.0187392260707039],[1480464000000,0.0224796653731647],[1483056000000,0.0220139494031393],[1485820800000,0.018927657566137],[1488240000000,0.0221800497260555],[1490918400000,0.0218674879816177],[1493337600000,0.0159545004191706],[1496188800000,0.0135388212505291],[1498780800000,0.014675445173365],[1500508800000,0.0150096717916259]],"name":"SPY Volatility","color":"blue","id":"SPY"},{"data":[[1377820800000,0.0241115216143897],[1380499200000,0.0260442402607256],[1383177600000,0.0272674443067543],[1385683200000,0.0288719473982621],[1388448000000,0.0288857557055273],[1391126400000,0.0205191246489011],[1393545600000,0.0260055041878384],[1396224000000,0.0264271524902842],[1398816000000,0.0265816646662814],[1401408000000,0.0249591129630544],[1404086400000,0.0249769037782382],[1406764800000,0.0247281123755299],[1409270400000,0.0199346459519581],[1412035200000,0.0139540306559013],[1414713600000,0.0197192962226576],[1417132800000,0.0198713309717489],[1419984000000,0.0190676195505098],[1422576000000,0.0222013376079352],[1424995200000,0.0230297501738257],[1427760000000,0.0319188848721891],[1430352000000,0.0312580175470481],[1432857600000,0.0321510597602722],[1435622400000,0.031875737738177],[1438300800000,0.031631756790222],[1440979200000,0.0295208333915207],[1443571200000,0.0272649913296038],[1446163200000,0.0278878899612652],[1448841600000,0.0415543963149301],[1451520000000,0.0411460482731848],[1454025600000,0.0412861897878856],[1456704000000,0.0435368174613372],[1459382400000,0.0395236110434442],[1461888000000,0.0461989881404102],[1464652800000,0.0360002783258131],[1467244800000,0.0370928565224689],[1469750400000,0.0348749773237639],[1472601600000,0.0258599246690703],[1475193600000,0.0235703025145043],[1477872000000,0.0130155131890959],[1480464000000,0.0163314552380049],[1483056000000,0.0156270326216257],[1485820800000,0.0141434664632901],[1488240000000,0.0120186849784016],[1490918400000,0.0143345865051231],[1493337600000,0.014532447732245],[1496188800000,0.00807871448547398],[1498780800000,0.00794794549875872],[1500508800000,0.0132875445224473]],"name":"Portf Volatility","color":"green","id":"Port"},{"data":[{"x":1461888000000,"title":"SPY Vol Dips","text":"SPY rolling sd dips below portfolio."}],"onSeries":"SPY","type":"flags"},{"data":[{"x":1454025600000,"title":"SPY Max ","text":"SPY max rolling volatility."}],"onSeries":"SPY","type":"flags"},{"data":[{"x":1461888000000,"title":"Portf Max","text":"Portfolio maximum rolling volatility."}],"onSeries":"Port","type":"flags"},{"data":[{"x":1498780800000,"title":"Portf Min","text":"Portfolio min rolling volatility."}],"onSeries":"Port","type":"flags"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"chart":{"backgroundColor":"transparent"}},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vml-radial-gradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvas-tools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}

Hover on the flags and you can see the text we added for explanation.

It’s remarkable how rolling volatility has absolutely plunged since early-to-mid 2016. Since August of 2016, both the portfolio and SPY rolling standard deviations have been well below the SPY mean.

Thanks for sticking with this three-part introduction to volatility. Next time, we’ll port our work to Shiny and play with different assets and allocations.

_____='https://rviews.rstudio.com/2017/07/21/visualizing-portfolio-volatility/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to make maps with Census data in R

Fri, 07/21/2017 - 02:00

(This article was first published on randomjohn.github.io, and kindly contributed to R-bloggers)

US Census Data

The US Census collects a number of demographic measures and publishes aggregate data through its website. There are several ways to use Census data in R, from the Census API to the USCensus2010 package. If you are interested in geopolitical data in the US, I recommend exploring both these options – the Census API requires a key for each person who uses it, and the package requires downloading a very large dataset. The setups for both require some effort, but once that effort is done you don’t have to do it again.

The acs package in R allows you to access the Census API easily. I highly recommend checking it out, and that’s the method we will use here. Note that I’ve already defined the variable api_key – if you are trying to run this code you will need to first run something like api_key <- before running the rest of this code.

library(acs) api.key.install(key=api_key) # now you are ready to run the rest of the acs code

For purposes here, we will use the toy example of plotting median household income by county for every county in South Carolina. First, we obtain the Census data. The first command gives us the table and variable names of what we want. I then use that table number in the acs.fetch command to get the variable I want.

acs.lookup(endyear=2015, span=5,dataset="acs", keyword= c("median","income","family","total"), case.sensitive=F) ## Warning in acs.lookup(endyear = 2015, span = 5, dataset = "acs", keyword = c("median", : XML variable lookup tables for this request ## seem to be missing from ' https://api.census.gov/data/2015/acs5/variables.xml '; ## temporarily downloading and using archived copies instead; ## since this is *much* slower, recommend running ## acs.tables.install() ## An object of class "acs.lookup" ## endyear= 2015 ; span= 5 ## ## results: ## variable.code table.number ## 1 B10010_001 B10010 ## 2 B19126_001 B19126 ## 3 B19126_002 B19126 ## 4 B19126_005 B19126 ## 5 B19126_006 B19126 ## 6 B19126_009 B19126 ## 7 B19215_001 B19215 ## 8 B19215_002 B19215 ## 9 B19215_003 B19215 ## 10 B19215_006 B19215 ## 11 B19215_009 B19215 ## 12 B19215_010 B19215 ## 13 B19215_013 B19215 ## table.name ## 1 Median Family Income for Families with GrndPrnt Householders Living With Own GrndChldrn < 18 Yrs ## 2 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years ## 3 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years ## 4 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years ## 5 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years ## 6 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years ## 7 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 8 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 9 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 10 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 11 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 12 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 13 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## variable.name ## 1 Median family income in the past 12 months-- Total: ## 2 Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Total: ## 3 Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Married-couple family -- Total ## 4 Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Other family -- Total ## 5 Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Other family -- Male householder, no wife present -- Total ## 6 Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Other family -- Female householder, no husband present -- Total ## 7 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Total (dollars): ## 8 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Male householder -- Total (dollars) ## 9 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Male householder -- Living alone -- Total (dollars) ## 10 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Male householder -- Not living alone -- Total (dollars) ## 11 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Female householder -- Total (dollars) ## 12 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Female householder -- Living alone -- Total (dollars) ## 13 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Female householder -- Not living alone -- Total (dollars) my_cnty <- geo.make(state = 45,county = "*") home_median_price<-acs.fetch(geography=my_cnty, table.number="B19126",endyear=2015) # home median prices ## Warning in (function (endyear, span = 5, dataset = "acs", keyword, table.name, : XML variable lookup tables for this request ## seem to be missing from ' https://api.census.gov/data/2015/acs5/variables.xml '; ## temporarily downloading and using archived copies instead; ## since this is *much* slower, recommend running ## acs.tables.install() ## Error in if (url.test["statusMessage"] != "OK") {: missing value where TRUE/FALSE needed knitr::kable(head(home_median_price@estimate))   B19126_001 B19126_002 B19126_003 B19126_004 B19126_005 B19126_006 B19126_007 B19126_008 B19126_009 B19126_010 B19126_011 Abbeville County, South Carolina 44918 55141 65664 50698 24835 43187 50347 24886 22945 18101 29958 Aiken County, South Carolina 57396 70829 72930 70446 29302 36571 35469 37906 27355 22760 34427 Allendale County, South Carolina NA NA NA NA NA NA NA NA NA NA NA Anderson County, South Carolina 53169 65881 75444 60166 26608 36694 37254 36297 24384 17835 29280 Bamberg County, South Carolina NA NA NA NA NA NA NA NA NA NA NA Barnwell County, South Carolina 44224 59467 70542 54030 19864 25143 18633 45714 18317 13827 21315 Plotting the map data

If you have the maps and ggplot2 packages, you already have the data you need to plot. We use the map_data function from ggplot2 to pull in county shape data for South Carolina. (A previous attempt at this blogpost had used the ggmap package, but there is an incompatibility between that and the latest ggplot2 package at the time of this writing.)

library(ggplot2) ## Want to understand how all the pieces fit together? Buy the ## ggplot2 book: http://ggplot2.org/book/ sc_map <- map_data("county",region="south.carolina") ggplot() + geom_polygon(aes(x=long,y=lat,group=group),data=sc_map,colour="white",fill="black") + theme_minimal()

Merging the demographic and map data

Now we have the demographic data and the map, but merging the two will take a little effort. The reason is that the map data gives a lower case representation of the county and calls it a “subregion”, while the Census data returns the county as “xxxx County, South Carolina”. I use the dplyr and stringr packages (for str_replace) to make short work of this merge.

library(dplyr) library(stringr) merged <- as.data.frame(home_median_price@estimate) %>% mutate(county_full = rownames(.), county = str_replace(county_full,"(.+) County.*","\\1") %>% tolower) %>% select(county,B19126_001) %>% rename(med_income=B19126_001) %>% right_join(sc_map,by=c("county"="subregion")) knitr::kable(head(merged,10)) county med_income long lat group order region abbeville 44918 -82.24809 34.41758 1 1 south carolina abbeville 44918 -82.31685 34.35455 1 2 south carolina abbeville 44918 -82.31111 34.33163 1 3 south carolina abbeville 44918 -82.31111 34.29152 1 4 south carolina abbeville 44918 -82.28247 34.26860 1 5 south carolina abbeville 44918 -82.25955 34.25142 1 6 south carolina abbeville 44918 -82.24809 34.21131 1 7 south carolina abbeville 44918 -82.23663 34.18266 1 8 south carolina abbeville 44918 -82.24236 34.15401 1 9 south carolina abbeville 44918 -82.27674 34.10818 1 10 south carolina

It’s now a simple matter to plot this merged dataset. In fact, we only have to tweak a few things from the first time we plotted the map data.

ggplot() + geom_polygon(aes(x=long,y=lat,group=group,fill=med_income),data=merged) + theme_minimal()

Discussion

It’s pretty easy to plot U.S. Census data on a map. The real power of Census data comes not just from plotting it, but combining with other geographically-based data (such as crime). The acs package in R makes it easy to obtain Census data, which can then be merged with other data using packages such as dplyr and stringr and then plotted with ggplot2. Hopefully the authors of the ggmap and ggplot2 packages can work out their incompatibilities so that the above maps can be created using the Google API map or open street maps.

It should be noted that while I obtained county-level information, aggregate data can be obtained at Census block and tract levels as well, if you are looking to do some sort of localized analysis.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: randomjohn.github.io. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Package bigstatsr: Statistics with matrices on disk (useR 2017)

Fri, 07/21/2017 - 02:00

In this post, I will talk about my package bigstatsr, which I’ve just presented in a lightning talk of 5 minutes at useR!2017. You can listen to me in action there. I should have chosen a longer talk to explain more about this package, maybe next time. I will use this post to give you a more detailed version of the talk I gave in Brussels.

Motivation behind bigstatsr

I’m a PhD student in predictive human genetics. I’m basically trying to predict someone’s risk of disease based on their DNA mutations. These DNA mutations are in the form of large matrices so that I’m currently working with a matrix of 15K rows and 300K columns. This matrix would take approximately 32GB of RAM if stored as a standard R matrix.

When I began studying this dataset, I had only 8GB of RAM on my computer. I now have 64GB of RAM but it would take only copying this matrix once to make my computer begin swapping and therefore slowing down. I found a convenient solution by using the object big.matrix provided by the R package bigmemory (Kane, Emerson, and Weston 2013). With this solution, you can access a matrix that is stored on disk almost as if it were a standard R matrix in memory.

Yet, for these special matrices, some useful statistical functions were missing or not fast enough. So, I implemented these. It was a good experience about programming optimized and parallelized algorithms. I’m aware that there are other packages that come with bigmemory, such as biganalytics and bigalgebra, that already implement some of the features I will talk about in this post. I will discuss why I don’t use these packages. However, I use the work of other packages such as biglasso and RSpectra.

Introduction to bigmemory # loading package bigstatsr (and bigmemory) library(bigstatsr) # initializing some matrix on disk: wrapper to bigmemory::big.matrix() mat <- FBM(backingroot = "matrix-on-disk", descriptor = FALSE)(5e3, 10e3) ## Creating directory "backingfiles" which didn't exist.. dim(mat) ## [1] 5000 10000 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 0 0 ## [2,] 0 0 0 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0 mat[1, 1] <- 2 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 0 0 0 0 ## [2,] 0 0 0 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0 mat[2:4] <- 3 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 0 0 0 0 ## [2,] 3 0 0 0 0 ## [3,] 3 0 0 0 0 ## [4,] 3 0 0 0 0 ## [5,] 0 0 0 0 0 mat[, 2:3] <- rnorm(2 * nrow(mat)) mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 -0.1611891 3.1218981 0 0 ## [2,] 3 -0.6026680 -1.3935111 0 0 ## [3,] 3 0.4743703 -0.6566549 0 0 ## [4,] 3 0.1514252 0.8247594 0 0 ## [5,] 0 -1.5167186 0.7329260 0 0

What we can see is that big matrices (big.matrix objects) can be accessed (read/write) almost as if they were standard R matrices, but you have to be cautious. For example, doing mat[1, ] isn’t recommended. Indeed, big matrices, as standard R matrices, are stored by column so that it is in fact a big vector with columns stored one after the other, contiguously. So, accessing the first row would access elements that are not stored contiguously in memory, which is slow. One should always access columns rather than rows.

Apply an R function to a big matrix

An easy strategy to apply an R function to a big matrix would be the split-apply-combine strategy (Wickham 2011). For example, you could access only a block of columns at a time, apply a (vectorized) function to this block, and then combine the results of all blocks. This is implemented in function big_apply().

# Compute the sums of the first 1000 columns colsums_1 <- colSums(mat[, 1:1000]) # Compute the sums of the second block of 1000 columns colsums_2 <- colSums(mat[, 1001:2000]) # Combine the results colsums_1_2 <- c(colsums_1, colsums_2) # Do this automatically with big_apply() colsums_all <- big_apply(mat, a.FUN = function(X, ind) colSums(X[, ind]), a.combine = 'c')

When the split-apply-combine strategy can be used for a given function, you could use big_apply() to get the results, while accessing only small blocks of columns (or rows) at a time. Package biganalytics, by the creators of bigmemory, provides a way to apply an R function to margins of a big.matrix. Yet, for example, if the big.matrix has a lot of columns, it would be much slower to loop through all columns rather that applying a vectorized function to blocks of columns. You can find more examples of applications for big_apply() there.

colsums_all2 <- biganalytics::apply(mat, 2, sum) all.equal(colsums_all2, colsums_all) ## [1] TRUE Use Rcpp with a big matrix

Using Rcpp with a big.matrix is super easy. Let’s use the previous example, i.e. the computation of the colsums of a big.matrix. We will do it in 3 different ways.

1. Using the bigmemory way // [[Rcpp::depends(bigmemory, BH)]] #include #include using namespace Rcpp; // [[Rcpp::export]] NumericVector bigcolsums(const S4& BM) { XPtr xpMat = BM.slot("address"); MatrixAccessor<double> macc(*xpMat); int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); // vector of m zeros int i, j; for (j = 0; j < m; j++) for (i = 0; i < n; i++) res[j] += macc[j][i]; return res; } colsums_all3 <- bigcolsums(mat) all.equal(colsums_all3, colsums_all) ## [1] TRUE 2. Using the bigstatsr way // [[Rcpp::depends(bigstatsr, bigmemory, BH)]] #include // [[Rcpp::export]] NumericVector bigcolsums2(const S4& BM, const IntegerVector& rowInd, const IntegerVector& colInd) { XPtr xpMat = BM.slot("address"); // C++ indices begin at 0 IntegerVector rows = rowInd - 1; IntegerVector cols = colInd - 1; // An accessor of only part of the big.matrix SubMatAcc<double> macc(*xpMat, rows, cols); int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); // vector of m zeros int i, j; for (j = 0; j < m; j++) for (i = 0; i < n; i++) res[j] += macc(i, j); return res; } colsums_all4 <- bigcolsums2(mat, rows_along(mat), cols_along(mat)) all.equal(colsums_all4, colsums_all) ## [1] TRUE

In bigstatsr, most of the functions have parameters for subsetting rows and columns because it is often useful. One of the main reasons why I don’t use package bigalgebra is its lack of subsetting options.

3. Use an already implemented function str(colsums_all5 <- big_colstats(mat)) ## 'data.frame': 10000 obs. of 2 variables: ## $ sum: num 11 139.6 73.2 0 0 ... ## $ var: num 0.0062 1.0117 1.0164 0 0 ... all.equal(colsums_all5$sum, colsums_all) ## [1] TRUE Principal Component Analysis

Let’s begin by filling the matrix with random numbers in a tricky way.

U <- sweep(matrix(rnorm(nrow(mat) * 10), ncol = 10), 2, 1:10, "/") V <- matrix(rnorm(ncol(mat) * 10), ncol = 10) big_apply(mat, a.FUN = function(X, ind) { X[, ind] <- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind)) NULL }, a.combine = 'c') ## NULL

Let’s say we want the first 10 PCs of the (scaled) matrix.

system.time( small_svd <- svd(scale(mat[, 1:2000]), nu = 10, nv = 10) ) ## user system elapsed ## 8.348 0.063 8.429 system.time( small_svd2 <- big_SVD(mat, big_scale(), ind.col = 1:2000) ) ## (2) ## user system elapsed ## 1.900 0.083 1.989 plot(small_svd2$u, small_svd$u)

system.time( small_svd3 <- big_randomSVD(mat, big_scale(), ind.col = 1:2000) ) ## user system elapsed ## 0.353 0.002 0.355 plot(small_svd3$u, small_svd$u)

system.time( svd_all <- big_randomSVD(mat, big_scale()) ) ## user system elapsed ## 2.267 0.001 2.268 plot(svd_all)

Function big_randomSVD() uses Rcpp and package Rpsectra to implement a fast Singular Value Decomposition for a big.matrix that is linear in all dimensions (standard PCA algorithm is quadratic in the smallest dimension) which makes it very fast even for large datasets (that have both dimensions that are large).

Some linear models M <- 100 # number of causal variables set <- sample(ncol(mat), M) y <- mat[, set] %*% rnorm(M) y <- y + rnorm(length(y), sd = 2 * sd(y)) ind.train <- sort(sample(nrow(mat), size = 0.8 * nrow(mat))) ind.test <- setdiff(rows_along(mat), ind.train) mult_test <- big_univLinReg(mat, y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ]) library(ggplot2) plot(mult_test) + aes(color = cols_along(mat) %in% set) + labs(color = "Causal?")

train <- big_spLinReg(mat, y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ], alpha = 0.5) pred <- predict(train, X. = mat, ind.row = ind.test, covar.row = svd_all$u[ind.test, ]) plot(apply(pred, 2, cor, y = y[ind.test]))

The functions big_spLinReg(), big_spLogReg() and big_spSVM() all use lasso (L1) or elastic-net (L1 & L2) regularizations in order to limit the number of predictors and to accelerate computations thanks to strong rules (R. Tibshirani et al. 2012). The implementation of these functions are based on modifications from packages sparseSVM and biglasso (Zeng and Breheny 2017). Yet, these models give predictions for a range of 100 different regularization parameters whereas we are only interested in one prediction.

So, that’s why I came up with the idea of Cross-Model Selection and Averaging (CMSA), which principle is:

  1. This function separates the training set in K folds (e.g. 10).
  2. In turn,
    • each fold is considered as an inner validation set and the others (K – 1) folds form an inner training set,
    • the model is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed,
    • the vector of scores which maximizes feval is determined,
    • the vector of coefficients corresponding to the previous vector of scores is chosen.
  3. The K resulting vectors of coefficients are then combined into one vector.
train2 <- big_CMSA(big_spLinReg, feval = function(pred, target) cor(pred, target), X. = mat, y.train = y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ], alpha = 0.5, ncores = parallel::detectCores() / 2) mean(train2 != 0) # percentage of predictors ## [1] 0.1806194 pred2 <- predict(train2, X. = mat, ind.row = ind.test, covar.row = svd_all$u[ind.test, ]) cor(pred2, y[ind.test]) ## [1] 0.311157 Some matrix computations

For example, let’s compute the correlation of the first 2000 columns.

system.time( corr <- cor(mat[, 1:2000]) ) ## user system elapsed ## 10.822 0.007 10.824 system.time( corr2 <- big_cor(mat, ind.col = 1:2000) ) ## user system elapsed ## 0.729 0.005 0.737 all.equal(corr2, corr) ## [1] TRUE Advantages of using big.matrix objects
  • you can apply algorithms on 100GB of data,
  • you can easily parallelize your algorithms because the data on disk is shared,
  • you write more efficient algorithms,
  • you can use different types of data, for example, in my field, I’m storing my data with only 1 byte per element (rather than 8 bytes for a standard R matrix).

In a next post, I’ll try to talk about good practices on how to use parallelism in R.

References

Kane, Michael J, John W Emerson, and Stephen Weston. 2013. “Scalable Strategies for Computing with Massive Data.” Journal of Statistical Software 55 (14): 1–19. doi:10.18637/jss.v055.i14.

Tibshirani, Robert, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J Tibshirani. 2012. “Strong rules for discarding predictors in lasso-type problems.” Journal of the Royal Statistical Society. Series B, Statistical Methodology 74 (2): 245–66. doi:10.1111/j.1467-9868.2011.01004.x.

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software.” Journal of Statistical Software 40 (1): 1–29. doi:10.1039/np9971400083.

Zeng, Yaohui, and Patrick Breheny. 2017. “The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R,” January. http://arxiv.org/abs/1701.05936.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

What analysis program do conservation scientists use?

Fri, 07/21/2017 - 02:00

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

International Congress for Conservation Biology: What analysis program do conservation scientists use?

With the International Congress for Conservation Biology starting 23rd July I was wondering, what analysis programs are most commonly used by conservation scientists? And, what do they use for spatial analysis and mapping?

To find out, if you are a conservation scientist please participate in these interactive polls:

Favourite analysis programs

Loading poll…

Favourite programs for spatial analysis

Loading poll…

Please share this post with your friends and colleagues. I will follow up with a discussion once we get some results in, hopefully during ICCB2017 in Columbia.

If you choose one of the ‘other’ options, let me know via twitter what program you do use.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Quirks about running Rcpp on Windows through RStudio

Thu, 07/20/2017 - 21:17

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

Quirks about running Rcpp on Windows through RStudio

This is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not.

I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find Rcpp.h. First, devtools::find_rtools was giving a positive result, so (a) was not true. Second, I noticed that the wrong C++ compiler was being called. Even more frustrating was the fact that everything was working if I worked on a native R console rather than RStudio. So there was nothing inherently wrong with the code or setup, but rather the environment RStudio was creating.

After some searching the interwebs and StackOverflow, the following solution worked for me. I added the following lines to my global .Rprofile file:

Sys.setenv(PATH = paste(Sys.getenv("PATH"), "C:/RBuildTools/3.4/bin/", "C:/RBuildTools/3.4/mingw_64/bin", sep = ";")) Sys.setenv(BINPREF = "C:/RBuildTools/3.4/mingw_64/bin/")

Note that C:/RBuildTools is the default location suggested when I installed RTools.

This solution is indicated here, but I have the reverse issue of the default setup working in R and not in the latest RStudio. However, the solution still works!!

Note that instead of putting it in the global .Rprofile, you could put it in a project-specific .Rprofile, or even in your R code as long as it is run before loading the Rcpp or derivative packages. Note also that if you use binary packages that use Rcpp, there is no problem. Only when you’re compiling C++ code either for your own code or for building a package from source is this an issue. And, as far as I can tell, only on Windows.

Hope this prevents someone else from 3 hours of heartburn trying to make Rcpp work on a Windows box. And, if this has already been fixed in RStudio, please comment and I’ll be happy to update this post.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Quickly Check your id Variables

Thu, 07/20/2017 - 21:16

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

Virtually every dataset has them; id variables that link a record to a subject and/or time point. Often one column, or a combination of columns, forms the unique id of a record. For instance, the combination of patient_id and visit_id, or ip_adress and visit_time. The first step in most of my analyses is almost always checking the uniqueness of a variable, or a combination of variables. If it is not unique, may assumptions about the data may be wrong, or there are data quality issues. Since I do this so often, I decided to make a little wrapper around this procedure. The unique_id function will return TRUE if the evaluated variables indeed are the unique key to a record. If not, it will return all the records for which the id variable(s) are duplicated so we can pinpoint the problem right away. It uses dplyr v.0.7.1, so make sure that it is loaded.

library(dplyr) some_df <- data_frame(a = c(1, 2, 3, 3, 4), b = 101:105, val = round(rnorm(5), 1)) some_df %>% unique_id(a) ## # A tibble: 2 x 3 ## a b val ## ## 1 3 103 -1.8 ## 2 3 104 1.1 some_df %>% unique_id(a, b) ## [1] TRUE

Here you find the source code of the function. You can also obtain it by installing the package accompanying this blog using devtools::install.github(edwinth/thatssorandom).

unique_id <- function(x, ...) { id_set <- x %>% select(...) id_set_dist <- id_set %>% distinct if (nrow(id_set) == nrow(id_set_dist)) { TRUE } else { non_unique_ids <- id_set %>% filter(id_set %>% duplicated) %>% distinct() suppressMessages( inner_join(non_unique_ids, x) %>% arrange(...) ) } } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Analysis for Life Sciences

Thu, 07/20/2017 - 17:00

Rafael Irizarry from the Harvard T.H. Chan School of Public Health has presented a number of courses on R and Biostatistics on EdX, and he recently also provided an index of all of the course modules as YouTube videos with supplemental materials. The EdX courses are linked below, which you can take for free, or simply follow the series of YouTube videos and materials provided in the index. 

Data Analysis for the Life Sciences Series 

A companion book and associated R Markdown documents are also available for download.

Genomics Data Analysis Series

For links to all of the course components, including videos and supplementary materials, follow the link below.

rafalab: HarvardX Biomedical Data Science Open Online Training

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Preparing for the Plumber v0.4.0 Release

Thu, 07/20/2017 - 11:13

(This article was first published on Trestle Technology, LLC - R, and kindly contributed to R-bloggers)

Plumber is a package which allows you to create web APIs from your R code. If you’re new to Plumber, you can take a look at www.rplumber.io to learn more about how to use the package to create your APIs.

Over the years, we’ve noticed a handful of things that we wished we’d done differently in the development of Plumber. In particular, there were some decisions that prioritized convenience over security which we wanted to roll back. We’ve decided to bite the bullet and make these changes before more users start using Plumber. The v0.4.0 release of Plumber includes a handful of breaking changes that mitigate these issues all at once. Our hope in getting all of these out of the way at once is to ensure that users building on Plumber moving forward have confidence that any breaking change under consideration has already been made and that we shouldn’t have this kind of churn again anytime soon.

If you’re already using Plumber, we strongly encourage you to read the v0.4.0 migration guide below. As you’ll see, most of these changes affect the hosting or low-level interactions in Plumber. The dominant mode in which people use Plumber (adding comments to their existing R functions) has not changed; there are no breaking changes in how those files are interpreted.

v0.4.0 of Plumber will be deployed to CRAN in the coming days. We strongly encourage you to try out the v0.4.0 version of Plumber ahead of time to make sure that you’ve migrated your APIs correctly.

devtools::install_github("trestletech/plumber")

Alternatively, you can continue using the last release of Plumber which didn’t include any of these breaking changes (v0.3.3) until you’re ready to upgrade by using the following command:

devtools::install_github("trestletech/plumber", ref="v0.3.3")

You can see the full release notes for v0.4.0 here.

Plumber v0.4.0 Migration Guide

There are a number of changes that users should consider when preparing to upgrade to plumber v0.4.0.

  1. Plumber no longer accepts external connections by default. The host parameter for the run() method now defaults to 127.0.0.1, meaning that Plumber will only listen for incoming requests from the local machine on which it’s running — not from any other machine on the network. This is done for security reasons so that you don’t accidentally expose a Plumber API that you’re developing to your entire network. To restore the old behavior in which Plumber listened for connections from any machine on the network, use $run(host="0.0.0.0"). Note that if you’re deploying to an environment that includes an HTTP proxy (such as the DigitalOcean servers which use nginx), having Plumber listen only on 127.0.0.1 is likely the right default, as your proxy — not Plumber — is the one receiving external connections.
  2. Plumber no longer sets the Access-Control-Allow-Origin HTTP header to *. This was previously done for convenience but given the security implications we’re reversing this decision. The previous behavior would have allowed web browsers to make requests of your API from other domains using JavaScript if the request used only standard HTTP headers and were a GET, HEAD, or POST request. These requests will no longer work by default. If you wish to allow an endpoint to be accessible from other origins in a web browser, you can use res$setHeader("Access-Control-Allow-Origin", "*") in an endpoint or filter.
  3. Rather than setting the default port to 8000, the port is now randomly selected. This ensures that a shared server (like RStudio Server) will be able to support multiple people developing Plumber APIs concurrently without them having to manually identify an available port. This can be controlled by specifying the port parameter in the run() method or by setting the plumber.port option.
  4. The object-oriented model for Plumber routers has changed. If you’re calling any of the following methods on your Plumber router, you will need to modify your code to use the newer alternatives: addFilter, addEndpoint, addGlobalProcessor, and addAssets. The code around these functions has undergone a major rewrite and some breaking changes have been introduced. These four functions are still supported with a deprecation warning in 0.4.0, but support is only best-effort. Certain parameters on these methods are no longer supported, so you should thoroughly test any Plumber API that leverages any of these methods before deploying version 0.4.0. Updated documentation for using Plumber programmatically is now available.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Trestle Technology, LLC - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Preparing for the Plumber v0.4.0 Release

Thu, 07/20/2017 - 11:13

(This article was first published on Trestle Technology, LLC - R, and kindly contributed to R-bloggers)

Plumber is a package which allows you to create web APIs from your R code. If you’re new to Plumber, you can take a look at www.rplumber.io to learn more about how to use the package to create your APIs.

Over the years, we’ve noticed a handful of things that we wished we’d done differently in the development of Plumber. In particular, there were some decisions that prioritized convenience over security which we wanted to roll back. We’ve decided to bite the bullet and make these changes before more users start using Plumber. The v0.4.0 release of Plumber includes a handful of breaking changes that mitigate these issues all at once. Our hope in getting all of these out of the way at once is to ensure that users building on Plumber moving forward have confidence that any breaking change under consideration has already been made and that we shouldn’t have this kind of churn again anytime soon.

If you’re already using Plumber, we strongly encourage you to read the v0.4.0 migration guide below. As you’ll see, most of these changes affect the hosting or low-level interactions in Plumber. The dominant mode in which people use Plumber (adding comments to their existing R functions) has not changed; there are no breaking changes in how those files are interpreted.

v0.4.0 of Plumber will be deployed to CRAN in the coming days. We strongly encourage you to try out the v0.4.0 version of Plumber ahead of time to make sure that you’ve migrated your APIs correctly.

devtools::install_github("trestletech/plumber")

Alternatively, you can continue using the last release of Plumber which didn’t include any of these breaking changes (v0.3.3) until you’re ready to upgrade by using the following command:

devtools::install_github("trestletech/plumber", ref="v0.3.3")

You can see the full release notes for v0.4.0 here.

Plumber v0.4.0 Migration Guide

There are a number of changes that users should consider when preparing to upgrade to plumber v0.4.0.

  1. Plumber no longer accepts external connections by default. The host parameter for the run() method now defaults to 127.0.0.1, meaning that Plumber will only listen for incoming requests from the local machine on which it’s running — not from any other machine on the network. This is done for security reasons so that you don’t accidentally expose a Plumber API that you’re developing to your entire network. To restore the old behavior in which Plumber listened for connections from any machine on the network, use $run(host="0.0.0.0"). Note that if you’re deploying to an environment that includes an HTTP proxy (such as the DigitalOcean servers which use nginx), having Plumber listen only on 127.0.0.1 is likely the right default, as your proxy — not Plumber — is the one receiving external connections.
  2. Plumber no longer sets the Access-Control-Allow-Origin HTTP header to *. This was previously done for convenience but given the security implications we’re reversing this decision. The previous behavior would have allowed web browsers to make requests of your API from other domains using JavaScript if the request used only standard HTTP headers and were a GET, HEAD, or POST request. These requests will no longer work by default. If you wish to allow an endpoint to be accessible from other origins in a web browser, you can use res$setHeader("Access-Control-Allow-Origin", "*") in an endpoint or filter.
  3. Rather than setting the default port to 8000, the port is now randomly selected. This ensures that a shared server (like RStudio Server) will be support multiple people developing Plumber APIs concurrently without them having to manually identify an available port. This can be controlled by specifying the port parameter in the run() method or by setting the plumber.port option.
  4. The object-oriented model for Plumber routers has changed. If you’re calling any of the following methods on your Plumber router, you will need to modify your code to use the newer alternatives: addFilter, addEndpoint, addGlobalProcessor, and addAssets. The code around these functions has undergone a major rewrite and some breaking changes have been introduced. These four functions are still supported with a deprecation warning in 0.4.0, but support is only best-effort. Certain parameters on these methods are no longer supported, so you should thoroughly test any Plumber API that leverages any of these methods before deploying version 0.4.0. Updated documentation for using Plumber programmatically is now available.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Trestle Technology, LLC - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Viral Recurring Decimal: Euler Problem 26

Thu, 07/20/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Proposed solution to Euler Problem 26 in the R language. Find the value of d < 1000 for which 1/d contains the longest recurring decimal cycle. Continue reading →

The post The Viral Recurring Decimal: Euler Problem 26 appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Ligature fonts for R

Wed, 07/19/2017 - 21:46

(This article was first published on R – Benomics, and kindly contributed to R-bloggers)

Ligature fonts are fonts which sometimes map multiple characters to a single glyph, either for readability or just because it looks neat. Importantly, this only affects the rendering of the text with said font, while the distinct characters remain in the source.

The Apple Chancery font with and without ligatures enabled.

Maybe ligatures are an interesting topic in themselves if you’re into typography, but it’s the relatively modern monospaced variants which are significantly more useful in the context of R programming.

Two of the most popular fonts in this category are:

  • Fira Code — an extension of Fira Mono which really goes all out providing a wide range of ligatures for obscure Haskell operators, as well as the more standard set which will be used when writing R
  • Hasklig — a fork of Source Code Pro (in my opinion a nicer base font) which is more conservative with the ligatures it introduces

Here’s some code to try out with these ligature fonts, first rendered via bog-standard monospace font:

library(magrittr) library(tidyverse) filtered_storms <- dplyr::storms %>% filter(category == 5, year &gt;= 2000) %>% unite("date", year:day, sep = "-") %>% group_by(name) %>% filter(pressure == max(pressure)) %>% mutate(date = as.Date(date)) %>% arrange(desc(date)) %>% ungroup() %T>% print()

Here’s the same code rendered with Hasklig:

Some of the glyphs on show here are:

  • A single arrow glyph for less-than hyphen (<-)
  • Altered spacing around two colons (::)
  • Joined up double-equals

Fira Code takes this a bit further and also converts >= to a single glyph:

In my opinion these fonts are a nice and clear way of reading and writing R. In particular the single arrow glyph harks back to the APL keyboards with real arrow keys, for which our modern two-character <- is a poor substitute.

One downside could be a bit of confusion when showing your IDE to someone else, or maybe writing slightly longer lines than it appears, but personally I’m a fan and my RStudio is now in Hasklig.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Benomics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Neville’s Method of Polynomial Interpolation

Wed, 07/19/2017 - 20:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part 1 of 5 in the series Numerical Analysis

Neville’s method evaluates a polynomial that passes through a given set of and points for a particular value using the Newton polynomial form. Neville’s method is similar to a now-defunct procedure named Aitken’s algorithm and is based on the divided differences recursion relation (“Neville’s Algorithm”, n.d).

It was stated before in a previous post on Lagrangian polynomial interpolation that there exists a Lagrange polynomial that passes through points where each is a distinct integer and at corresponding x values . The points are denoted .

Neville’s Method

Neville’s method can be stated as follows:

Let a function be defined at points where and are two distinct members. For each , there exists a Lagrange polynomial that interpolates the function at the points . The th Lagrange polynomial is defined as:

The and are often denoted and , respectively, for ease of notation.

The interpolating polynomials can thus be generated recursively, which we will see in the following example:

Neville’s Method Example

Consider the following table of and corresponding values.

x y 8.1 16.9446 8.3 17.56492 8.6 18.50515 8.7 18.82091

Suppose we are interested in interpolating a polynomial that passes through these points to approximate the resulting value from an value of 8.4.

We can construct the interpolating polynomial approximations using the function above:



The approximated values are then used in the next iteration.


Then the final iteration yields the approximated value for the given value.

Therefore is the approximated value at the point .

Neville’s Method in R

The following function is an implementation of Neville’s method for interpolating and evaluating a polynomial.

poly.neville <- function(x, y, x0) { n <- length(x) q <- matrix(data = 0, n, n) q[,1] <- y for (i in 2:n) { for (j in i:n) { q[j,i] <- ((x0 - x[j-i+1]) * q[j,i-1] - (x0 - x[j]) * q[j-1,i-1]) / (x[j] - x[j-i+1]) } } res <- list('Approximated value'=q[n,n], 'Neville iterations table'=q) return(res) }

Let’s test this function to see if it reports the same result as what we found earlier.

x <- c(8.1, 8.3, 8.6, 8.7) y <- c(16.9446, 17.56492, 18.50515, 18.82091) poly.neville(x, y, 8.4) ## $`Approximated value` ## [1] 17.87709 ## ## $`Neville iterations table` ## [,1] [,2] [,3] [,4] ## [1,] 16.94460 0.00000 0.00000 0.00000 ## [2,] 17.56492 17.87508 0.00000 0.00000 ## [3,] 18.50515 17.87833 17.87703 0.00000 ## [4,] 18.82091 17.87363 17.87716 17.87709

The approximated value is reported as , the same value we calculated previously (minus a few decimal places). The function also outputs the iteration table that stores the intermediate results.

The pracma package contains the neville() function which also performs Neville’s method of polynomial interpolation and evaluation.

library(pracma) neville(x, y, 8.4) ## [1] 17.87709

The neville() function reports the same approximated value that we found with our manual calculations and function.

References

Burden, R. L., & Faires, J. D. (2011). Numerical analysis (9th ed.). Boston, MA: Brooks/Cole, Cengage Learning.

Cheney, E. W., & Kincaid, D. (2013). Numerical mathematics and computing (6th ed.). Boston, MA: Brooks/Cole, Cengage Learning.

Neville’s algorithm. (2016, January 2). In Wikipedia, The Free Encyclopedia. From https://en.wikipedia.org/w/index.php?title=Neville%27s_algorithm&oldid=697870140

The post Neville’s Method of Polynomial Interpolation appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ArCo Package v 0.2 is on

Wed, 07/19/2017 - 19:14

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

The ArCo package 0.2 is now available on CRAN. The functions are now more user friendly. The new features are:

  • Default function for estimation if the user does not inform the functions fn and p.fn. The default model is Ordinary Least Squares.
  • The user can now add extra arguments to the fn function in the call.
  • The data will be automatically coerced when possible.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data wrangling : Transforming (2/3)

Wed, 07/19/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the third part of the series and it aims to cover the transforming of data used.This can include filtering, summarizing, and ordering your data by different means. This also includes combining various data sets, creating new variables, and many other manipulation tasks. At this post, we will go through a few more advanced transformation tasks on mtcars data set.

Before proceeding, it might be helpful to look over the help pages for the group_by, ungrpoup, summary, summarise, arrange, mutate, cumsum.

Moreover please load the following libraries.
install.packages("dplyr")
library(dplyr)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create a new object named cars_cyl and assign to it the mtcars data frame grouped by the variable cyl
Hint: be careful about the data type of the variable, in order to be used for grouping it has to be a factor.

Exercise 2

Remove the grouping from the object cars_cyl

Exercise 3

Print out the summary statistics of the mtcars data frame using the summary function and pipeline symbols %>%.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • Work with popular libraries such as dplyr
  • Learn about methods such as pipelines
  • And much more

Exercise 4

Make a more descriptive summary statistics output containing the 4 quantiles, the mean, the standard deviation and the count.

Exercise 5

Print out the average *hp* for every cyl category

Exercise 6

Print out the mtcars data frame sorted by hp (ascending oder)

Exercise 7

Print out the mtcars data frame sorted by hp (descending oder)

Exercise 8

Create a new object named cars_per containing the mtcars data frame along with a new variable called performance and calculated as performance = hp/mpg

Exercise 9

Print out the cars_per data frame, sorted by performance in descending order and create a new variable called rank indicating the rank of the cars in terms of performance.

Exercise 10

To wrap everything up, we will use the iris data set. Print out the mean of every variable for every Species and create two new variables called Sepal.Density and Petal.Density being calculated as Sepal.Density = Sepal.Length Sepal.Width and Petal.Density = Sepal.Length Petal.Width respectively.

Related exercise sets:
  1. Building Shiny App exercises part 6
  2. Density-Based Clustering Exercises
  3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New R Course: Writing Efficient R Code

Wed, 07/19/2017 - 16:38

Hello R users, we’ve got a brand new course today: Writing Efficient R Code by Colin Gillespie.

The beauty of R is that it is built for performing data analysis. The downside is that sometimes R can be slow, thereby obstructing our analysis. For this reason, it is essential to become familiar with the main techniques for speeding up your analysis, so you can reduce computational time and get insights as quickly as possible.

Take me to chapter 1!

Writing Efficient R Code features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master in writing efficient, quick, R code!

What you’ll learn: 

Chapter 1: The Art of Benchmarking

In order to make your code go faster, you need to know how long it takes to run.

Chapter 2: Fine Tuning – Efficient Base R

R is flexible because you can often solve a single problem in many different ways. Some ways can be several orders of magnitude faster than the others.

Chapter 3: Diagnosing Problems – Code Profiling

Profiling helps you locate the bottlenecks in your code.

Chapter 4: Turbo Charged Code – Parallel Programming

Some problems can be solved faster using multiple cores on your machine. This chapter shows you how to write R code that runs in parallel.


Learn how to write efficient R code today!


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Short course on Bayesian data analysis and Stan 23-25 Aug in NYC!

Wed, 07/19/2017 - 16:00

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

Jonah “ShinyStan” Gabry, Mike “Riemannian NUTS” Betancourt, and I will be giving a three-day short course next month in New York, following the model of our successful courses in 2015 and 2016.

Before class everyone should install R, RStudio and RStan on their computers. (If you already have these, please update to the latest version of R and the latest version of Stan.) If problems occur please join the stan-users group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.

Class structure and example topics for the three days:

Day 1: Foundations
Foundations of Bayesian inference
Foundations of Bayesian computation with Markov chain Monte Carlo
Intro to Stan with hands-on exercises
Real-life Stan
Bayesian workflow

Day 2: Linear and Generalized Linear Models
Foundations of Bayesian regression
Fitting GLMs in Stan (logistic regression, Poisson regression)
Diagnosing model misfit using graphical posterior predictive checks
Little data: How traditional statistical ideas remain relevant in a big data world
Generalizing from sample to population (surveys, Xbox example, etc)

Day 3: Hierarchical Models
Foundations of Bayesian hierarchical/multilevel models
Accurately fitting hierarchical models in Stan
Why we don’t (usually) have to worry about multiple comparisons
Hierarchical modeling and prior information

Specific topics on Bayesian inference and computation include, but are not limited to:
Bayesian inference and prediction
Naive Bayes, supervised, and unsupervised classification
Overview of Monte Carlo methods
Convergence and effective sample size
Hamiltonian Monte Carlo and the no-U-turn sampler
Continuous and discrete-data regression models
Mixture models
Measurement-error and item-response models

Specific topics on Stan include, but are not limited to:
Reproducible research
Probabilistic programming
Stan syntax and programming
Optimization
Warmup, adaptation, and convergence
Identifiability and problematic posteriors
Handling missing data
Ragged and sparse data structures
Gaussian processes

Again, information on the course is here.

The course is organized by Lander Analytics.

The course is not cheap. Stan is open-source, and we organize these courses to raise money to support the programming required to keep Stan up to date. We hope and believe that the course is more than worth the money you pay for it, but we hope you’ll also feel good, knowing that this money is being used directly to support Stan R&D.

The post Short course on Bayesian data analysis and Stan 23-25 Aug in NYC! appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine Learning Explained: supervised learning, unsupervised learning, and reinforcement learning

Wed, 07/19/2017 - 15:51

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Machine learning is often split between three main types of learning: supervised learning, unsupervised learning, and reinforcement learning. Knowing the differences between these three types of learning is necessary for any data scientist.

The big picture

The type of learning is defined by the problem you want to solve and is intrinsic to the goal of your analysis:

  • You have a target, a value or a class to predict. For instance, let’s say you want to predict the revenue of a store from different inputs (day of the week, advertising, promotion). Then your model will be trained on historical data and use them to forecast future revenues. Hence the model is supervised, it knows what to learn.
  • You have unlabelled data and looks for patterns, groups in these data. For example, you want to cluster to clients according to the type of products they order, how often they purchase your product, their last visit, … Instead of doing it manually, unsupervised machine learning will automatically discriminate different clients.
  • You want to attain an objective. For example, you want to find the best strategy to win a game with specified rules. Once these rules are specified, reinforcement learning techniques will play this game many times to find the best strategy.
On supervised learning

Supervised learning regroups different techniques which all share the same principles:

  • The training dataset contains inputs data (your predictors) and the value you want to predict (which can be numeric or not).
  • The model will use the training data to learn a link between the input and the outputs. Underlying idea is that the training data can be generalized and that the model can be used on new data with some accuracy.

Some supervised learning algorithms:

  • Linear and logistic regression
  • Support vector machine
  • Naive Bayes
  • Neural network
  • Gradient boosting
  • Classification trees and random forest

Supervised learning is often used for expert systems in image recognition, speech recognition, forecasting, and in some specific business domain (Targeting, Financial analysis, ..)

On unsupervised learning

Cluster Analysis from Wikipedia

On the other hand, unsupervised learning does not use output data (at least output data that are different from the input). Unsupervised algorithms can be split into different categories:

  • Clustering algorithm, such as K-means, hierarchical clustering or mixture models. These algorithms try to discriminate and separate the observations in different groups.
  • Dimensionality reduction algorithms (which are mostly unsupervised) such as PCA, ICA or autoencoder. These algorithms find the best representation of the data with fewer dimensions.
  • Anomaly detections to find outliers in the data, i.e. observations which do not follow the data set patterns.

Most of the time unsupervised learning algorithms are used to pre-process the data, during the exploratory analysis or to pre-train supervised learning algorithms.

On reinforcement learning

Reinforcement learning algorithms try to find the best ways to earn the greatest reward. Rewards can be winning a game, earning more money or beating other opponents. They present state-of-art results on very human task, for instance, this paper from the University of Toronto shows how a computer can beat human in old-school Atari video game.

Reinforcement learnings algorithms follow the different circular steps:


From Wikipedia: Reinforcement Learning

Given its and the environment’s states, the agent will choose the action which will maximize its reward or will explore a new possibility. These actions will change the environment’s and the agent states. They will also be interpreted to give a reward to the agent. By performing this loop many times, the agents will improve its behavior.

Reinforcement learning already performs wells on ‘small’ dynamic system and is definitely to follow for the years to come.

The post Machine Learning Explained: supervised learning, unsupervised learning, and reinforcement learning appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppAPT 0.0.4

Wed, 07/19/2017 - 14:12

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version of RcppAPT — our interface from R to the C++ library behind the awesome apt, apt-get, apt-cache, … commands and their cache powering Debian, Ubuntu and the like — arrived on CRAN yesterday.

We added a few more functions in order to compute on the package graph. A concrete example is shown in this vignette which determines the (minimal) set of remaining Debian packages requiring a rebuild under R 3.4.* to update their .C() and .Fortran() registration code. It has been used for the binNMU request #868558.

As we also added a NEWS file, its (complete) content covering all releases follows below.

Changes in version 0.0.4 (2017-07-16)
  • New function getDepends

  • New function reverseDepends

  • Added package registration code

  • Added usage examples in scripts directory

  • Added vignette, also in docs as rendered copy

Changes in version 0.0.3 (2016-12-07)
  • Added dumpPackages, showSrc
Changes in version 0.0.2 (2016-04-04)
  • Added reverseDepends, dumpPackages, showSrc
Changes in version 0.0.1 (2015-02-20)
  • Initial version with getPackages and hasPackages

A bit more information about the package is available here as well as as the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

seplyr update

Wed, 07/19/2017 - 09:06

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The development version of my new R package seplyr is performing in practical applications with dplyr 0.7.* much better than even I (the seplyr package author) expected.

I think I have hit a very good set of trade-offs, and I have now spent significant time creating documentation and examples.

I wish there had been such a package weeks ago, and that I had started using this approach in my own client work at that time. If you are already a dplyr user I strongly suggest trying seplyr in your own analysis projects.

Please see here for details.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Some Ideas for your Internal R Package

Wed, 07/19/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)



At RStudio, I have the pleasure of interacting with data science teams around the world. Many of these teams are led by R users stepping into the role of analytic admins. These users are responsible for supporting and growing the R user base in their organization and often lead internal R user groups.

One of the most successful strategies to support a corporate R user group is the creation of an internal R package. This article outlines some common features and functions shared in internal packages. Creating an R package is easier than you might expect. A good place to start is this webinar on package creation.

Logos and Custom CSS

Interestingly, one powerful way to increase the adoption of data science outputs – plots, reports, and even slides – is to stick to consistent branding. Having a common look and feel makes it easier for management to recognize the work of the data science team, especially as the team grows. Consistent branding also saves the R user time that would normally be spent picking fonts and color schemes.

It is easy to include logos and custom CSS inside of an R package, and to write wrapper functions that copy the assets from the package to a user’s local working directory. For example, this wrapper function adds a logo from the RStudioInternal package to the working directory:

getLogo <- function(copy_to = getwd()){ copy_to <- normalizePath(copy_to) file.copy(system.file(“logo.png”, package = “RStudioInternal”) , copy_to) }

Once available, logos and CSS can be added to Shiny apps and R Markdown documents.

ggplot2 Themes

Similar to logos and custom CSS, many internal R packages include a custom ggplot2 theme. These themes ensure consistency across plots in an organization, making data science outputs easier to recognize and read.

ggplot2 themes are shared as functions. To get started writing a ggplot2 theme, see resource 1. For inspiration, take a look at the ggthemes package.

Data Connections

Internal R packages are also an effective way to share functions that make it easy for analysts to connect to internal data sources. Nothing is more frustrating for a first time R user than trying to navigate the world of ODBC connections and complex database schemas before they can get started with data relevant to their day-to-day job.

If you’re not sure where to begin, look through your own scripts for common database connection strings or configurations. db.rstudio.com can provide more information on how to handle credentials, drivers, and config files.

learnr Tutorials

RStudio recently released a new package for creating interactive tutorials in R Markdown called learnr. There are many great resources online for getting started with R, but it can be useful to create tutorials specific to your internal data and domain. learnr tutorials can serve as training wheels for the other components of the internal R package or teach broader concepts and standards accepted across the organization. For example, you might provide a primer that teaches new users your organization’s R style guide.

Sharing an Internal R Package

Internal packages can be built in the RStudio IDE and distributed as tar files. Alternatively, many organizations use RStudio Server or RStudio Server Pro to standardize the R environment in their organization. In addition to making it easy to share an internal package, a standard compute environment keeps new R users from having to spend time installing R, RStudio, and packages. While these are necessary skills, the first interactions with R should get new users to a data insight as fast as possible. RStudio Server Pro also includes IT functions for monitoring and restricting resources.

Wrap Up

If you are leading an R group, an internal R package is a powerful way to support your users and the adoption of R. Imagine how easy it would be to introduce R to co-workers if they could connect to real, internal data and create a useful, beautiful plot in under 10 minutes. Investing in an internal R packages makes that on boarding experience possible.

_____='https://rviews.rstudio.com/2017/07/19/supporting-corporate-r-user-groups/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages