Visualizing Portfolio Volatility
(This article was first published on R Views, and kindly contributed to Rbloggers)
This is the third post in our series on portfolio volatility, variance and standard deviation. If you want to start at the beginning with calculating portfolio volatility, have a look at the first post here – Intro to Volatility. The second post on calculating rolling standard deviations is here: Intro to Rolling Volatility.
Today we will visualize rolling standard deviations with highcharter using two objects from that second post.
The charts, which are the fun payoff after all of the equations and functions that we have ground out in the previous posts, should highlight any unusual occurrences or volatility spikes/dips that we might want to investigate.
First, load the .RDat file saved from our previous Notebook (or you can run the scripts from the previous posts).
load('rollingsd.RDat')We now have 2 objects in our Global Environment – spy_rolling_sd – an xts object of rolling SPY standard deviations – roll_portfolio_result – an xts object of rolling portfolio standard deviations. Because both of those are xts objects, we can pass them straight to highcharter with the hc_add_series() function and set a name and a color with the name and color arguments. Nothing too complicated here – we did the hard work in our previous Notebooks.
highchart(type = "stock") %>% hc_title(text = "SPY v. Portfolio Rolling Volatility") %>% hc_add_series(spy_rolling_sd, name = "SPY Volatility", color = "blue") %>% hc_add_series(roll_portfolio_result, name = "Port Volatility", color = "green") %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE){"x":{"hc_opts":{"title":{"text":"SPY v. Portfolio Rolling Volatility"},"yAxis":{"title":{"text":null}},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1375228800000,0.0230087834710489],[1377820800000,0.0310554927963628],[1380499200000,0.0303154391342273],[1383177600000,0.0333351835060359],[1385683200000,0.033679613898303],[1388448000000,0.0288400976907278],[1391126400000,0.0338190396691738],[1393545600000,0.029904383528237],[1396224000000,0.0305942566371719],[1398816000000,0.0275796530645834],[1401408000000,0.0268834182854031],[1404086400000,0.0266169197325657],[1406764800000,0.019606506360967],[1409270400000,0.0178261554082542],[1412035200000,0.0218704877587742],[1414713600000,0.0226129041709753],[1417132800000,0.0230956127375855],[1419984000000,0.0243598131833272],[1422576000000,0.0279750349266381],[1424995200000,0.0322177953429278],[1427760000000,0.0325097911709751],[1430352000000,0.0316790953242732],[1432857600000,0.0302347213514618],[1435622400000,0.0322755984506811],[1438300800000,0.029389462373966],[1440979200000,0.0321307589009705],[1443571200000,0.032984375371157],[1446163200000,0.050852177600668],[1448841600000,0.0505138108788652],[1451520000000,0.0503340252080932],[1454025600000,0.0522341207818885],[1456704000000,0.046219715155029],[1459382400000,0.0504880420143848],[1461888000000,0.0371553096877483],[1464652800000,0.0378912457951088],[1467244800000,0.0361050285668243],[1469750400000,0.0250938017057704],[1472601600000,0.0246028353527571],[1475193600000,0.0153415995353276],[1477872000000,0.0187392260707039],[1480464000000,0.0224796653731647],[1483056000000,0.0220139494031393],[1485820800000,0.018927657566137],[1488240000000,0.0221800497260555],[1490918400000,0.0218674879816177],[1493337600000,0.0159545004191706],[1496188800000,0.0135388212505291],[1498780800000,0.014675445173365],[1500508800000,0.0150096717916259]],"name":"SPY Volatility","color":"blue"},{"data":[[1377820800000,0.0241115216143897],[1380499200000,0.0260442402607256],[1383177600000,0.0272674443067543],[1385683200000,0.0288719473982621],[1388448000000,0.0288857557055273],[1391126400000,0.0205191246489011],[1393545600000,0.0260055041878384],[1396224000000,0.0264271524902842],[1398816000000,0.0265816646662814],[1401408000000,0.0249591129630544],[1404086400000,0.0249769037782382],[1406764800000,0.0247281123755299],[1409270400000,0.0199346459519581],[1412035200000,0.0139540306559013],[1414713600000,0.0197192962226576],[1417132800000,0.0198713309717489],[1419984000000,0.0190676195505098],[1422576000000,0.0222013376079352],[1424995200000,0.0230297501738257],[1427760000000,0.0319188848721891],[1430352000000,0.0312580175470481],[1432857600000,0.0321510597602722],[1435622400000,0.031875737738177],[1438300800000,0.031631756790222],[1440979200000,0.0295208333915207],[1443571200000,0.0272649913296038],[1446163200000,0.0278878899612652],[1448841600000,0.0415543963149301],[1451520000000,0.0411460482731848],[1454025600000,0.0412861897878856],[1456704000000,0.0435368174613372],[1459382400000,0.0395236110434442],[1461888000000,0.0461989881404102],[1464652800000,0.0360002783258131],[1467244800000,0.0370928565224689],[1469750400000,0.0348749773237639],[1472601600000,0.0258599246690703],[1475193600000,0.0235703025145043],[1477872000000,0.0130155131890959],[1480464000000,0.0163314552380049],[1483056000000,0.0156270326216257],[1485820800000,0.0141434664632901],[1488240000000,0.0120186849784016],[1490918400000,0.0143345865051231],[1493337600000,0.014532447732245],[1496188800000,0.00807871448547398],[1498780800000,0.00794794549875872],[1500508800000,0.0132875445224473]],"name":"Port Volatility","color":"green"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"chart":{"backgroundColor":"transparent"}},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vmlradialgradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvastools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}
It is interesting to note that from late April 2016 to late October 2016, SPY’s rolling standard deviation dipped below that of the diversified portfolio. The portfolio volatility was plunging at the same time, but SPY’s was falling faster. What happened over the 6 preceding months to explain this?
Maybe we should add a flag to highlight this event. We can also add flags for the maximum SPY volatility, maximum and minimum portfolio rolling volatility and might as well include a line for the mean rolling volatility of SPY to practice adding horizontal lines.
We will use two methods for adding flags. First, we’ll hard code the date for the flag as “20160429” using the date when rolling SPY volatility dipped below the portfolio. Second, we’ll set a flag with the date
as.Date(index(roll_portfolio_result[which.max(roll_portfolio_result)]),format = "%Y%m%d") which looks like a convoluted mess but is adding a date for whenever the rolling portfolio standard deviation hit its maximum.
This is a bit more ‘dynamic’ because we can change our assets but keep this code the same and it will find the date with the maximum rolling standard deviation. Our first flag is not dynamic in the sense that it is specific to the comparison between SPY and this exact portfolio.
spy_important_date < as.Date(c("20160429"), format = "%Y%m%d") port_max_date < as.Date(index(roll_portfolio_result[which.max(roll_portfolio_result)]), format = "%Y%m%d") port_min_date < as.Date(index(roll_portfolio_result[which.min(roll_portfolio_result)]), format = "%Y%m%d") spy_max_date < as.Date(index(spy_rolling_sd[which.max(spy_rolling_sd)]), format = "%Y%m%d") highchart(type = "stock") %>% hc_title(text = "SPY v. Portfolio Rolling Volatility") %>% hc_add_series(spy_rolling_sd, name = "SPY Volatility", color = "blue", id = "SPY") %>% hc_add_series(roll_portfolio_result, name = "Portf Volatility", color = "green", id = "Port") %>% hc_add_series_flags(spy_important_date, title = c("SPY Vol Dips"), text = c("SPY rolling sd dips below portfolio."), id = "SPY") %>% hc_add_series_flags(spy_max_date, title = c("SPY Max "), text = c("SPY max rolling volatility."), id = "SPY") %>% hc_add_series_flags(port_max_date, title = c("Portf Max"), text = c("Portfolio maximum rolling volatility."), id = "Port") %>% hc_add_series_flags(port_min_date, title = c("Portf Min"), text = c("Portfolio min rolling volatility."), id = "Port") %>% hc_yAxis(title = list(text = "Mean SPY rolling Vol"), showFirstLabel = FALSE, showLastLabel = FALSE, plotLines = list( list(value = mean(spy_rolling_sd), color = "#2b908f", width = 2))) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE){"x":{"hc_opts":{"title":{"text":"SPY v. Portfolio Rolling Volatility"},"yAxis":{"title":{"text":"Mean SPY rolling Vol"},"showFirstLabel":false,"showLastLabel":false,"plotLines":[{"value":0.0291554812507092,"color":"#2b908f","width":2}]},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1375228800000,0.0230087834710489],[1377820800000,0.0310554927963628],[1380499200000,0.0303154391342273],[1383177600000,0.0333351835060359],[1385683200000,0.033679613898303],[1388448000000,0.0288400976907278],[1391126400000,0.0338190396691738],[1393545600000,0.029904383528237],[1396224000000,0.0305942566371719],[1398816000000,0.0275796530645834],[1401408000000,0.0268834182854031],[1404086400000,0.0266169197325657],[1406764800000,0.019606506360967],[1409270400000,0.0178261554082542],[1412035200000,0.0218704877587742],[1414713600000,0.0226129041709753],[1417132800000,0.0230956127375855],[1419984000000,0.0243598131833272],[1422576000000,0.0279750349266381],[1424995200000,0.0322177953429278],[1427760000000,0.0325097911709751],[1430352000000,0.0316790953242732],[1432857600000,0.0302347213514618],[1435622400000,0.0322755984506811],[1438300800000,0.029389462373966],[1440979200000,0.0321307589009705],[1443571200000,0.032984375371157],[1446163200000,0.050852177600668],[1448841600000,0.0505138108788652],[1451520000000,0.0503340252080932],[1454025600000,0.0522341207818885],[1456704000000,0.046219715155029],[1459382400000,0.0504880420143848],[1461888000000,0.0371553096877483],[1464652800000,0.0378912457951088],[1467244800000,0.0361050285668243],[1469750400000,0.0250938017057704],[1472601600000,0.0246028353527571],[1475193600000,0.0153415995353276],[1477872000000,0.0187392260707039],[1480464000000,0.0224796653731647],[1483056000000,0.0220139494031393],[1485820800000,0.018927657566137],[1488240000000,0.0221800497260555],[1490918400000,0.0218674879816177],[1493337600000,0.0159545004191706],[1496188800000,0.0135388212505291],[1498780800000,0.014675445173365],[1500508800000,0.0150096717916259]],"name":"SPY Volatility","color":"blue","id":"SPY"},{"data":[[1377820800000,0.0241115216143897],[1380499200000,0.0260442402607256],[1383177600000,0.0272674443067543],[1385683200000,0.0288719473982621],[1388448000000,0.0288857557055273],[1391126400000,0.0205191246489011],[1393545600000,0.0260055041878384],[1396224000000,0.0264271524902842],[1398816000000,0.0265816646662814],[1401408000000,0.0249591129630544],[1404086400000,0.0249769037782382],[1406764800000,0.0247281123755299],[1409270400000,0.0199346459519581],[1412035200000,0.0139540306559013],[1414713600000,0.0197192962226576],[1417132800000,0.0198713309717489],[1419984000000,0.0190676195505098],[1422576000000,0.0222013376079352],[1424995200000,0.0230297501738257],[1427760000000,0.0319188848721891],[1430352000000,0.0312580175470481],[1432857600000,0.0321510597602722],[1435622400000,0.031875737738177],[1438300800000,0.031631756790222],[1440979200000,0.0295208333915207],[1443571200000,0.0272649913296038],[1446163200000,0.0278878899612652],[1448841600000,0.0415543963149301],[1451520000000,0.0411460482731848],[1454025600000,0.0412861897878856],[1456704000000,0.0435368174613372],[1459382400000,0.0395236110434442],[1461888000000,0.0461989881404102],[1464652800000,0.0360002783258131],[1467244800000,0.0370928565224689],[1469750400000,0.0348749773237639],[1472601600000,0.0258599246690703],[1475193600000,0.0235703025145043],[1477872000000,0.0130155131890959],[1480464000000,0.0163314552380049],[1483056000000,0.0156270326216257],[1485820800000,0.0141434664632901],[1488240000000,0.0120186849784016],[1490918400000,0.0143345865051231],[1493337600000,0.014532447732245],[1496188800000,0.00807871448547398],[1498780800000,0.00794794549875872],[1500508800000,0.0132875445224473]],"name":"Portf Volatility","color":"green","id":"Port"},{"data":[{"x":1461888000000,"title":"SPY Vol Dips","text":"SPY rolling sd dips below portfolio."}],"onSeries":"SPY","type":"flags"},{"data":[{"x":1454025600000,"title":"SPY Max ","text":"SPY max rolling volatility."}],"onSeries":"SPY","type":"flags"},{"data":[{"x":1461888000000,"title":"Portf Max","text":"Portfolio maximum rolling volatility."}],"onSeries":"Port","type":"flags"},{"data":[{"x":1498780800000,"title":"Portf Min","text":"Portfolio min rolling volatility."}],"onSeries":"Port","type":"flags"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"chart":{"backgroundColor":"transparent"}},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vmlradialgradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvastools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}
Hover on the flags and you can see the text we added for explanation.
It’s remarkable how rolling volatility has absolutely plunged since earlytomid 2016. Since August of 2016, both the portfolio and SPY rolling standard deviations have been well below the SPY mean.
Thanks for sticking with this threepart introduction to volatility. Next time, we’ll port our work to Shiny and play with different assets and allocations.
_____='https://rviews.rstudio.com/2017/07/21/visualizingportfoliovolatility/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to make maps with Census data in R
(This article was first published on randomjohn.github.io, and kindly contributed to Rbloggers)
US Census DataThe US Census collects a number of demographic measures and publishes aggregate data through its website. There are several ways to use Census data in R, from the Census API to the USCensus2010 package. If you are interested in geopolitical data in the US, I recommend exploring both these options – the Census API requires a key for each person who uses it, and the package requires downloading a very large dataset. The setups for both require some effort, but once that effort is done you don’t have to do it again.
The acs package in R allows you to access the Census API easily. I highly recommend checking it out, and that’s the method we will use here. Note that I’ve already defined the variable api_key – if you are trying to run this code you will need to first run something like api_key < before running the rest of this code.
library(acs) api.key.install(key=api_key) # now you are ready to run the rest of the acs codeFor purposes here, we will use the toy example of plotting median household income by county for every county in South Carolina. First, we obtain the Census data. The first command gives us the table and variable names of what we want. I then use that table number in the acs.fetch command to get the variable I want.
acs.lookup(endyear=2015, span=5,dataset="acs", keyword= c("median","income","family","total"), case.sensitive=F) ## Warning in acs.lookup(endyear = 2015, span = 5, dataset = "acs", keyword = c("median", : XML variable lookup tables for this request ## seem to be missing from ' https://api.census.gov/data/2015/acs5/variables.xml '; ## temporarily downloading and using archived copies instead; ## since this is *much* slower, recommend running ## acs.tables.install() ## An object of class "acs.lookup" ## endyear= 2015 ; span= 5 ## ## results: ## variable.code table.number ## 1 B10010_001 B10010 ## 2 B19126_001 B19126 ## 3 B19126_002 B19126 ## 4 B19126_005 B19126 ## 5 B19126_006 B19126 ## 6 B19126_009 B19126 ## 7 B19215_001 B19215 ## 8 B19215_002 B19215 ## 9 B19215_003 B19215 ## 10 B19215_006 B19215 ## 11 B19215_009 B19215 ## 12 B19215_010 B19215 ## 13 B19215_013 B19215 ## table.name ## 1 Median Family Income for Families with GrndPrnt Householders Living With Own GrndChldrn < 18 Yrs ## 2 B19126. Median Family Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Family Type by Presence of Own Children Under 18 Years ## 3 B19126. Median Family Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Family Type by Presence of Own Children Under 18 Years ## 4 B19126. Median Family Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Family Type by Presence of Own Children Under 18 Years ## 5 B19126. Median Family Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Family Type by Presence of Own Children Under 18 Years ## 6 B19126. Median Family Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Family Type by Presence of Own Children Under 18 Years ## 7 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 8 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 9 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 10 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 11 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 12 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## 13 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 InflationAdjusted Dollars) by Sex of Householder by Living Alone by Age of Householder ## variable.name ## 1 Median family income in the past 12 months Total: ## 2 Median family income in the past 12 months (in 2015 Inflationadjusted dollars)  Total: ## 3 Median family income in the past 12 months (in 2015 Inflationadjusted dollars)  Marriedcouple family  Total ## 4 Median family income in the past 12 months (in 2015 Inflationadjusted dollars)  Other family  Total ## 5 Median family income in the past 12 months (in 2015 Inflationadjusted dollars)  Other family  Male householder, no wife present  Total ## 6 Median family income in the past 12 months (in 2015 Inflationadjusted dollars)  Other family  Female householder, no husband present  Total ## 7 Median nonfamily household income in the past 12 months (in 2015 Inflationadjusted dollars)  Total (dollars): ## 8 Median nonfamily household income in the past 12 months (in 2015 Inflationadjusted dollars)  Male householder  Total (dollars) ## 9 Median nonfamily household income in the past 12 months (in 2015 Inflationadjusted dollars)  Male householder  Living alone  Total (dollars) ## 10 Median nonfamily household income in the past 12 months (in 2015 Inflationadjusted dollars)  Male householder  Not living alone  Total (dollars) ## 11 Median nonfamily household income in the past 12 months (in 2015 Inflationadjusted dollars)  Female householder  Total (dollars) ## 12 Median nonfamily household income in the past 12 months (in 2015 Inflationadjusted dollars)  Female householder  Living alone  Total (dollars) ## 13 Median nonfamily household income in the past 12 months (in 2015 Inflationadjusted dollars)  Female householder  Not living alone  Total (dollars) my_cnty < geo.make(state = 45,county = "*") home_median_price<acs.fetch(geography=my_cnty, table.number="B19126",endyear=2015) # home median prices ## Warning in (function (endyear, span = 5, dataset = "acs", keyword, table.name, : XML variable lookup tables for this request ## seem to be missing from ' https://api.census.gov/data/2015/acs5/variables.xml '; ## temporarily downloading and using archived copies instead; ## since this is *much* slower, recommend running ## acs.tables.install() ## Error in if (url.test["statusMessage"] != "OK") {: missing value where TRUE/FALSE needed knitr::kable(head(home_median_price@estimate)) B19126_001 B19126_002 B19126_003 B19126_004 B19126_005 B19126_006 B19126_007 B19126_008 B19126_009 B19126_010 B19126_011 Abbeville County, South Carolina 44918 55141 65664 50698 24835 43187 50347 24886 22945 18101 29958 Aiken County, South Carolina 57396 70829 72930 70446 29302 36571 35469 37906 27355 22760 34427 Allendale County, South Carolina NA NA NA NA NA NA NA NA NA NA NA Anderson County, South Carolina 53169 65881 75444 60166 26608 36694 37254 36297 24384 17835 29280 Bamberg County, South Carolina NA NA NA NA NA NA NA NA NA NA NA Barnwell County, South Carolina 44224 59467 70542 54030 19864 25143 18633 45714 18317 13827 21315 Plotting the map dataIf you have the maps and ggplot2 packages, you already have the data you need to plot. We use the map_data function from ggplot2 to pull in county shape data for South Carolina. (A previous attempt at this blogpost had used the ggmap package, but there is an incompatibility between that and the latest ggplot2 package at the time of this writing.)
library(ggplot2) ## Want to understand how all the pieces fit together? Buy the ## ggplot2 book: http://ggplot2.org/book/ sc_map < map_data("county",region="south.carolina") ggplot() + geom_polygon(aes(x=long,y=lat,group=group),data=sc_map,colour="white",fill="black") + theme_minimal() Merging the demographic and map dataNow we have the demographic data and the map, but merging the two will take a little effort. The reason is that the map data gives a lower case representation of the county and calls it a “subregion”, while the Census data returns the county as “xxxx County, South Carolina”. I use the dplyr and stringr packages (for str_replace) to make short work of this merge.
library(dplyr) library(stringr) merged < as.data.frame(home_median_price@estimate) %>% mutate(county_full = rownames(.), county = str_replace(county_full,"(.+) County.*","\\1") %>% tolower) %>% select(county,B19126_001) %>% rename(med_income=B19126_001) %>% right_join(sc_map,by=c("county"="subregion")) knitr::kable(head(merged,10)) county med_income long lat group order region abbeville 44918 82.24809 34.41758 1 1 south carolina abbeville 44918 82.31685 34.35455 1 2 south carolina abbeville 44918 82.31111 34.33163 1 3 south carolina abbeville 44918 82.31111 34.29152 1 4 south carolina abbeville 44918 82.28247 34.26860 1 5 south carolina abbeville 44918 82.25955 34.25142 1 6 south carolina abbeville 44918 82.24809 34.21131 1 7 south carolina abbeville 44918 82.23663 34.18266 1 8 south carolina abbeville 44918 82.24236 34.15401 1 9 south carolina abbeville 44918 82.27674 34.10818 1 10 south carolinaIt’s now a simple matter to plot this merged dataset. In fact, we only have to tweak a few things from the first time we plotted the map data.
ggplot() + geom_polygon(aes(x=long,y=lat,group=group,fill=med_income),data=merged) + theme_minimal() DiscussionIt’s pretty easy to plot U.S. Census data on a map. The real power of Census data comes not just from plotting it, but combining with other geographicallybased data (such as crime). The acs package in R makes it easy to obtain Census data, which can then be merged with other data using packages such as dplyr and stringr and then plotted with ggplot2. Hopefully the authors of the ggmap and ggplot2 packages can work out their incompatibilities so that the above maps can be created using the Google API map or open street maps.
It should be noted that while I obtained countylevel information, aggregate data can be obtained at Census block and tract levels as well, if you are looking to do some sort of localized analysis.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: randomjohn.github.io. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Package bigstatsr: Statistics with matrices on disk (useR 2017)
In this post, I will talk about my package bigstatsr, which I’ve just presented in a lightning talk of 5 minutes at useR!2017. You can listen to me in action there. I should have chosen a longer talk to explain more about this package, maybe next time. I will use this post to give you a more detailed version of the talk I gave in Brussels.
Motivation behind bigstatsrI’m a PhD student in predictive human genetics. I’m basically trying to predict someone’s risk of disease based on their DNA mutations. These DNA mutations are in the form of large matrices so that I’m currently working with a matrix of 15K rows and 300K columns. This matrix would take approximately 32GB of RAM if stored as a standard R matrix.
When I began studying this dataset, I had only 8GB of RAM on my computer. I now have 64GB of RAM but it would take only copying this matrix once to make my computer begin swapping and therefore slowing down. I found a convenient solution by using the object big.matrix provided by the R package bigmemory (Kane, Emerson, and Weston 2013). With this solution, you can access a matrix that is stored on disk almost as if it were a standard R matrix in memory.
Yet, for these special matrices, some useful statistical functions were missing or not fast enough. So, I implemented these. It was a good experience about programming optimized and parallelized algorithms. I’m aware that there are other packages that come with bigmemory, such as biganalytics and bigalgebra, that already implement some of the features I will talk about in this post. I will discuss why I don’t use these packages. However, I use the work of other packages such as biglasso and RSpectra.
Introduction to bigmemory # loading package bigstatsr (and bigmemory) library(bigstatsr) # initializing some matrix on disk: wrapper to bigmemory::big.matrix() mat < FBM(backingroot = "matrixondisk", descriptor = FALSE)(5e3, 10e3) ## Creating directory "backingfiles" which didn't exist.. dim(mat) ## [1] 5000 10000 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 0 0 ## [2,] 0 0 0 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0 mat[1, 1] < 2 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 0 0 0 0 ## [2,] 0 0 0 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0 mat[2:4] < 3 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 0 0 0 0 ## [2,] 3 0 0 0 0 ## [3,] 3 0 0 0 0 ## [4,] 3 0 0 0 0 ## [5,] 0 0 0 0 0 mat[, 2:3] < rnorm(2 * nrow(mat)) mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 0.1611891 3.1218981 0 0 ## [2,] 3 0.6026680 1.3935111 0 0 ## [3,] 3 0.4743703 0.6566549 0 0 ## [4,] 3 0.1514252 0.8247594 0 0 ## [5,] 0 1.5167186 0.7329260 0 0What we can see is that big matrices (big.matrix objects) can be accessed (read/write) almost as if they were standard R matrices, but you have to be cautious. For example, doing mat[1, ] isn’t recommended. Indeed, big matrices, as standard R matrices, are stored by column so that it is in fact a big vector with columns stored one after the other, contiguously. So, accessing the first row would access elements that are not stored contiguously in memory, which is slow. One should always access columns rather than rows.
Apply an R function to a big matrixAn easy strategy to apply an R function to a big matrix would be the splitapplycombine strategy (Wickham 2011). For example, you could access only a block of columns at a time, apply a (vectorized) function to this block, and then combine the results of all blocks. This is implemented in function big_apply().
# Compute the sums of the first 1000 columns colsums_1 < colSums(mat[, 1:1000]) # Compute the sums of the second block of 1000 columns colsums_2 < colSums(mat[, 1001:2000]) # Combine the results colsums_1_2 < c(colsums_1, colsums_2) # Do this automatically with big_apply() colsums_all < big_apply(mat, a.FUN = function(X, ind) colSums(X[, ind]), a.combine = 'c')When the splitapplycombine strategy can be used for a given function, you could use big_apply() to get the results, while accessing only small blocks of columns (or rows) at a time. Package biganalytics, by the creators of bigmemory, provides a way to apply an R function to margins of a big.matrix. Yet, for example, if the big.matrix has a lot of columns, it would be much slower to loop through all columns rather that applying a vectorized function to blocks of columns. You can find more examples of applications for big_apply() there.
colsums_all2 < biganalytics::apply(mat, 2, sum) all.equal(colsums_all2, colsums_all) ## [1] TRUE Use Rcpp with a big matrixUsing Rcpp with a big.matrix is super easy. Let’s use the previous example, i.e. the computation of the colsums of a big.matrix. We will do it in 3 different ways.
1. Using the bigmemory way // [[Rcpp::depends(bigmemory, BH)]] #include #include using namespace Rcpp; // [[Rcpp::export]] NumericVector bigcolsums(const S4& BM) { XPtr xpMat = BM.slot("address"); MatrixAccessor<double> macc(*xpMat); int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); // vector of m zeros int i, j; for (j = 0; j < m; j++) for (i = 0; i < n; i++) res[j] += macc[j][i]; return res; } colsums_all3 < bigcolsums(mat) all.equal(colsums_all3, colsums_all) ## [1] TRUE 2. Using the bigstatsr way // [[Rcpp::depends(bigstatsr, bigmemory, BH)]] #include // [[Rcpp::export]] NumericVector bigcolsums2(const S4& BM, const IntegerVector& rowInd, const IntegerVector& colInd) { XPtr xpMat = BM.slot("address"); // C++ indices begin at 0 IntegerVector rows = rowInd  1; IntegerVector cols = colInd  1; // An accessor of only part of the big.matrix SubMatAcc<double> macc(*xpMat, rows, cols); int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); // vector of m zeros int i, j; for (j = 0; j < m; j++) for (i = 0; i < n; i++) res[j] += macc(i, j); return res; } colsums_all4 < bigcolsums2(mat, rows_along(mat), cols_along(mat)) all.equal(colsums_all4, colsums_all) ## [1] TRUEIn bigstatsr, most of the functions have parameters for subsetting rows and columns because it is often useful. One of the main reasons why I don’t use package bigalgebra is its lack of subsetting options.
3. Use an already implemented function str(colsums_all5 < big_colstats(mat)) ## 'data.frame': 10000 obs. of 2 variables: ## $ sum: num 11 139.6 73.2 0 0 ... ## $ var: num 0.0062 1.0117 1.0164 0 0 ... all.equal(colsums_all5$sum, colsums_all) ## [1] TRUE Principal Component AnalysisLet’s begin by filling the matrix with random numbers in a tricky way.
U < sweep(matrix(rnorm(nrow(mat) * 10), ncol = 10), 2, 1:10, "/") V < matrix(rnorm(ncol(mat) * 10), ncol = 10) big_apply(mat, a.FUN = function(X, ind) { X[, ind] < tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind)) NULL }, a.combine = 'c') ## NULLLet’s say we want the first 10 PCs of the (scaled) matrix.
system.time( small_svd < svd(scale(mat[, 1:2000]), nu = 10, nv = 10) ) ## user system elapsed ## 8.348 0.063 8.429 system.time( small_svd2 < big_SVD(mat, big_scale(), ind.col = 1:2000) ) ## (2) ## user system elapsed ## 1.900 0.083 1.989 plot(small_svd2$u, small_svd$u) system.time( small_svd3 < big_randomSVD(mat, big_scale(), ind.col = 1:2000) ) ## user system elapsed ## 0.353 0.002 0.355 plot(small_svd3$u, small_svd$u) system.time( svd_all < big_randomSVD(mat, big_scale()) ) ## user system elapsed ## 2.267 0.001 2.268 plot(svd_all)Function big_randomSVD() uses Rcpp and package Rpsectra to implement a fast Singular Value Decomposition for a big.matrix that is linear in all dimensions (standard PCA algorithm is quadratic in the smallest dimension) which makes it very fast even for large datasets (that have both dimensions that are large).
Some linear models M < 100 # number of causal variables set < sample(ncol(mat), M) y < mat[, set] %*% rnorm(M) y < y + rnorm(length(y), sd = 2 * sd(y)) ind.train < sort(sample(nrow(mat), size = 0.8 * nrow(mat))) ind.test < setdiff(rows_along(mat), ind.train) mult_test < big_univLinReg(mat, y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ]) library(ggplot2) plot(mult_test) + aes(color = cols_along(mat) %in% set) + labs(color = "Causal?") train < big_spLinReg(mat, y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ], alpha = 0.5) pred < predict(train, X. = mat, ind.row = ind.test, covar.row = svd_all$u[ind.test, ]) plot(apply(pred, 2, cor, y = y[ind.test]))The functions big_spLinReg(), big_spLogReg() and big_spSVM() all use lasso (L1) or elasticnet (L1 & L2) regularizations in order to limit the number of predictors and to accelerate computations thanks to strong rules (R. Tibshirani et al. 2012). The implementation of these functions are based on modifications from packages sparseSVM and biglasso (Zeng and Breheny 2017). Yet, these models give predictions for a range of 100 different regularization parameters whereas we are only interested in one prediction.
So, that’s why I came up with the idea of CrossModel Selection and Averaging (CMSA), which principle is:
 This function separates the training set in K folds (e.g. 10).
 In turn,
 each fold is considered as an inner validation set and the others (K – 1) folds form an inner training set,
 the model is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed,
 the vector of scores which maximizes feval is determined,
 the vector of coefficients corresponding to the previous vector of scores is chosen.
 The K resulting vectors of coefficients are then combined into one vector.
For example, let’s compute the correlation of the first 2000 columns.
system.time( corr < cor(mat[, 1:2000]) ) ## user system elapsed ## 10.822 0.007 10.824 system.time( corr2 < big_cor(mat, ind.col = 1:2000) ) ## user system elapsed ## 0.729 0.005 0.737 all.equal(corr2, corr) ## [1] TRUE Advantages of using big.matrix objects you can apply algorithms on 100GB of data,
 you can easily parallelize your algorithms because the data on disk is shared,
 you write more efficient algorithms,
 you can use different types of data, for example, in my field, I’m storing my data with only 1 byte per element (rather than 8 bytes for a standard R matrix).
In a next post, I’ll try to talk about good practices on how to use parallelism in R.
ReferencesKane, Michael J, John W Emerson, and Stephen Weston. 2013. “Scalable Strategies for Computing with Massive Data.” Journal of Statistical Software 55 (14): 1–19. doi:10.18637/jss.v055.i14.
Tibshirani, Robert, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J Tibshirani. 2012. “Strong rules for discarding predictors in lassotype problems.” Journal of the Royal Statistical Society. Series B, Statistical Methodology 74 (2): 245–66. doi:10.1111/j.14679868.2011.01004.x.
Wickham, Hadley. 2011. “The SplitApplyCombine Strategy for Data Analysis. Journal of Statistical Software.” Journal of Statistical Software 40 (1): 1–29. doi:10.1039/np9971400083.
Zeng, Yaohui, and Patrick Breheny. 2017. “The biglasso Package: A Memory and ComputationEfficient Solver for Lasso Model Fitting with Big Data in R,” January. http://arxiv.org/abs/1701.05936.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));What analysis program do conservation scientists use?
(This article was first published on Bluecology blog, and kindly contributed to Rbloggers)
International Congress for Conservation Biology: What analysis program do conservation scientists use?With the International Congress for Conservation Biology starting 23rd July I was wondering, what analysis programs are most commonly used by conservation scientists? And, what do they use for spatial analysis and mapping?
To find out, if you are a conservation scientist please participate in these interactive polls:
Favourite analysis programsLoading poll…
Favourite programs for spatial analysisLoading poll…
Please share this post with your friends and colleagues. I will follow up with a discussion once we get some results in, hopefully during ICCB2017 in Columbia.
If you choose one of the ‘other’ options, let me know via twitter what program you do use.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Quirks about running Rcpp on Windows through RStudio
(This article was first published on R – Stat Bandit, and kindly contributed to Rbloggers)
Quirks about running Rcpp on Windows through RStudioThis is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not.
I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find Rcpp.h. First, devtools::find_rtools was giving a positive result, so (a) was not true. Second, I noticed that the wrong C++ compiler was being called. Even more frustrating was the fact that everything was working if I worked on a native R console rather than RStudio. So there was nothing inherently wrong with the code or setup, but rather the environment RStudio was creating.
After some searching the interwebs and StackOverflow, the following solution worked for me. I added the following lines to my global .Rprofile file:
Sys.setenv(PATH = paste(Sys.getenv("PATH"), "C:/RBuildTools/3.4/bin/", "C:/RBuildTools/3.4/mingw_64/bin", sep = ";")) Sys.setenv(BINPREF = "C:/RBuildTools/3.4/mingw_64/bin/")Note that C:/RBuildTools is the default location suggested when I installed RTools.
This solution is indicated here, but I have the reverse issue of the default setup working in R and not in the latest RStudio. However, the solution still works!!
Note that instead of putting it in the global .Rprofile, you could put it in a projectspecific .Rprofile, or even in your R code as long as it is run before loading the Rcpp or derivative packages. Note also that if you use binary packages that use Rcpp, there is no problem. Only when you’re compiling C++ code either for your own code or for building a package from source is this an issue. And, as far as I can tell, only on Windows.
Hope this prevents someone else from 3 hours of heartburn trying to make Rcpp work on a Windows box. And, if this has already been fixed in RStudio, please comment and I’ll be happy to update this post.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Quickly Check your id Variables
(This article was first published on That’s so Random, and kindly contributed to Rbloggers)
Virtually every dataset has them; id variables that link a record to a subject and/or time point. Often one column, or a combination of columns, forms the unique id of a record. For instance, the combination of patient_id and visit_id, or ip_adress and visit_time. The first step in most of my analyses is almost always checking the uniqueness of a variable, or a combination of variables. If it is not unique, may assumptions about the data may be wrong, or there are data quality issues. Since I do this so often, I decided to make a little wrapper around this procedure. The unique_id function will return TRUE if the evaluated variables indeed are the unique key to a record. If not, it will return all the records for which the id variable(s) are duplicated so we can pinpoint the problem right away. It uses dplyr v.0.7.1, so make sure that it is loaded.
library(dplyr) some_df < data_frame(a = c(1, 2, 3, 3, 4), b = 101:105, val = round(rnorm(5), 1)) some_df %>% unique_id(a) ## # A tibble: 2 x 3 ## a b val ## ## 1 3 103 1.8 ## 2 3 104 1.1 some_df %>% unique_id(a, b) ## [1] TRUEHere you find the source code of the function. You can also obtain it by installing the package accompanying this blog using devtools::install.github(edwinth/thatssorandom).
unique_id < function(x, ...) { id_set < x %>% select(...) id_set_dist < id_set %>% distinct if (nrow(id_set) == nrow(id_set_dist)) { TRUE } else { non_unique_ids < id_set %>% filter(id_set %>% duplicated) %>% distinct() suppressMessages( inner_join(non_unique_ids, x) %>% arrange(...) ) } } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data Analysis for Life Sciences
Rafael Irizarry from the Harvard T.H. Chan School of Public Health has presented a number of courses on R and Biostatistics on EdX, and he recently also provided an index of all of the course modules as YouTube videos with supplemental materials. The EdX courses are linked below, which you can take for free, or simply follow the series of YouTube videos and materials provided in the index.
Data Analysis for the Life Sciences Series
 Statistics and R
 Introduction to Linear Models and Matrix Algebra
 Statistical Interference and Modeling for Highthroughput Experiments
 HighDimensional Data Analysis
A companion book and associated R Markdown documents are also available for download.
Genomics Data Analysis Series
 Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays
 Highperformance computing for reproducible genomics with Bioconductor
 Case Studies in Functional Genomics
For links to all of the course components, including videos and supplementary materials, follow the link below.
rafalab: HarvardX Biomedical Data Science Open Online Training
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Preparing for the Plumber v0.4.0 Release
(This article was first published on Trestle Technology, LLC  R, and kindly contributed to Rbloggers)
Plumber is a package which allows you to create web APIs from your R code. If you’re new to Plumber, you can take a look at www.rplumber.io to learn more about how to use the package to create your APIs.
Over the years, we’ve noticed a handful of things that we wished we’d done differently in the development of Plumber. In particular, there were some decisions that prioritized convenience over security which we wanted to roll back. We’ve decided to bite the bullet and make these changes before more users start using Plumber. The v0.4.0 release of Plumber includes a handful of breaking changes that mitigate these issues all at once. Our hope in getting all of these out of the way at once is to ensure that users building on Plumber moving forward have confidence that any breaking change under consideration has already been made and that we shouldn’t have this kind of churn again anytime soon.
If you’re already using Plumber, we strongly encourage you to read the v0.4.0 migration guide below. As you’ll see, most of these changes affect the hosting or lowlevel interactions in Plumber. The dominant mode in which people use Plumber (adding comments to their existing R functions) has not changed; there are no breaking changes in how those files are interpreted.
v0.4.0 of Plumber will be deployed to CRAN in the coming days. We strongly encourage you to try out the v0.4.0 version of Plumber ahead of time to make sure that you’ve migrated your APIs correctly.
devtools::install_github("trestletech/plumber")Alternatively, you can continue using the last release of Plumber which didn’t include any of these breaking changes (v0.3.3) until you’re ready to upgrade by using the following command:
devtools::install_github("trestletech/plumber", ref="v0.3.3")You can see the full release notes for v0.4.0 here.
Plumber v0.4.0 Migration GuideThere are a number of changes that users should consider when preparing to upgrade to plumber v0.4.0.
 Plumber no longer accepts external connections by default. The host parameter for the run() method now defaults to 127.0.0.1, meaning that Plumber will only listen for incoming requests from the local machine on which it’s running — not from any other machine on the network. This is done for security reasons so that you don’t accidentally expose a Plumber API that you’re developing to your entire network. To restore the old behavior in which Plumber listened for connections from any machine on the network, use $run(host="0.0.0.0"). Note that if you’re deploying to an environment that includes an HTTP proxy (such as the DigitalOcean servers which use nginx), having Plumber listen only on 127.0.0.1 is likely the right default, as your proxy — not Plumber — is the one receiving external connections.
 Plumber no longer sets the AccessControlAllowOrigin HTTP header to *. This was previously done for convenience but given the security implications we’re reversing this decision. The previous behavior would have allowed web browsers to make requests of your API from other domains using JavaScript if the request used only standard HTTP headers and were a GET, HEAD, or POST request. These requests will no longer work by default. If you wish to allow an endpoint to be accessible from other origins in a web browser, you can use res$setHeader("AccessControlAllowOrigin", "*") in an endpoint or filter.
 Rather than setting the default port to 8000, the port is now randomly selected. This ensures that a shared server (like RStudio Server) will be able to support multiple people developing Plumber APIs concurrently without them having to manually identify an available port. This can be controlled by specifying the port parameter in the run() method or by setting the plumber.port option.
 The objectoriented model for Plumber routers has changed. If you’re calling any of the following methods on your Plumber router, you will need to modify your code to use the newer alternatives: addFilter, addEndpoint, addGlobalProcessor, and addAssets. The code around these functions has undergone a major rewrite and some breaking changes have been introduced. These four functions are still supported with a deprecation warning in 0.4.0, but support is only besteffort. Certain parameters on these methods are no longer supported, so you should thoroughly test any Plumber API that leverages any of these methods before deploying version 0.4.0. Updated documentation for using Plumber programmatically is now available.
To leave a comment for the author, please follow the link and comment on their blog: Trestle Technology, LLC  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Preparing for the Plumber v0.4.0 Release
(This article was first published on Trestle Technology, LLC  R, and kindly contributed to Rbloggers)
Plumber is a package which allows you to create web APIs from your R code. If you’re new to Plumber, you can take a look at www.rplumber.io to learn more about how to use the package to create your APIs.
Over the years, we’ve noticed a handful of things that we wished we’d done differently in the development of Plumber. In particular, there were some decisions that prioritized convenience over security which we wanted to roll back. We’ve decided to bite the bullet and make these changes before more users start using Plumber. The v0.4.0 release of Plumber includes a handful of breaking changes that mitigate these issues all at once. Our hope in getting all of these out of the way at once is to ensure that users building on Plumber moving forward have confidence that any breaking change under consideration has already been made and that we shouldn’t have this kind of churn again anytime soon.
If you’re already using Plumber, we strongly encourage you to read the v0.4.0 migration guide below. As you’ll see, most of these changes affect the hosting or lowlevel interactions in Plumber. The dominant mode in which people use Plumber (adding comments to their existing R functions) has not changed; there are no breaking changes in how those files are interpreted.
v0.4.0 of Plumber will be deployed to CRAN in the coming days. We strongly encourage you to try out the v0.4.0 version of Plumber ahead of time to make sure that you’ve migrated your APIs correctly.
devtools::install_github("trestletech/plumber")Alternatively, you can continue using the last release of Plumber which didn’t include any of these breaking changes (v0.3.3) until you’re ready to upgrade by using the following command:
devtools::install_github("trestletech/plumber", ref="v0.3.3")You can see the full release notes for v0.4.0 here.
Plumber v0.4.0 Migration GuideThere are a number of changes that users should consider when preparing to upgrade to plumber v0.4.0.
 Plumber no longer accepts external connections by default. The host parameter for the run() method now defaults to 127.0.0.1, meaning that Plumber will only listen for incoming requests from the local machine on which it’s running — not from any other machine on the network. This is done for security reasons so that you don’t accidentally expose a Plumber API that you’re developing to your entire network. To restore the old behavior in which Plumber listened for connections from any machine on the network, use $run(host="0.0.0.0"). Note that if you’re deploying to an environment that includes an HTTP proxy (such as the DigitalOcean servers which use nginx), having Plumber listen only on 127.0.0.1 is likely the right default, as your proxy — not Plumber — is the one receiving external connections.
 Plumber no longer sets the AccessControlAllowOrigin HTTP header to *. This was previously done for convenience but given the security implications we’re reversing this decision. The previous behavior would have allowed web browsers to make requests of your API from other domains using JavaScript if the request used only standard HTTP headers and were a GET, HEAD, or POST request. These requests will no longer work by default. If you wish to allow an endpoint to be accessible from other origins in a web browser, you can use res$setHeader("AccessControlAllowOrigin", "*") in an endpoint or filter.
 Rather than setting the default port to 8000, the port is now randomly selected. This ensures that a shared server (like RStudio Server) will be support multiple people developing Plumber APIs concurrently without them having to manually identify an available port. This can be controlled by specifying the port parameter in the run() method or by setting the plumber.port option.
 The objectoriented model for Plumber routers has changed. If you’re calling any of the following methods on your Plumber router, you will need to modify your code to use the newer alternatives: addFilter, addEndpoint, addGlobalProcessor, and addAssets. The code around these functions has undergone a major rewrite and some breaking changes have been introduced. These four functions are still supported with a deprecation warning in 0.4.0, but support is only besteffort. Certain parameters on these methods are no longer supported, so you should thoroughly test any Plumber API that leverages any of these methods before deploying version 0.4.0. Updated documentation for using Plumber programmatically is now available.
To leave a comment for the author, please follow the link and comment on their blog: Trestle Technology, LLC  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The Viral Recurring Decimal: Euler Problem 26
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Proposed solution to Euler Problem 26 in the R language. Find the value of d < 1000 for which 1/d contains the longest recurring decimal cycle. Continue reading →
The post The Viral Recurring Decimal: Euler Problem 26 appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Ligature fonts for R
(This article was first published on R – Benomics, and kindly contributed to Rbloggers)
Ligature fonts are fonts which sometimes map multiple characters to a single glyph, either for readability or just because it looks neat. Importantly, this only affects the rendering of the text with said font, while the distinct characters remain in the source.
Maybe ligatures are an interesting topic in themselves if you’re into typography, but it’s the relatively modern monospaced variants which are significantly more useful in the context of R programming.
Two of the most popular fonts in this category are:
 Fira Code — an extension of Fira Mono which really goes all out providing a wide range of ligatures for obscure Haskell operators, as well as the more standard set which will be used when writing R
 Hasklig — a fork of Source Code Pro (in my opinion a nicer base font) which is more conservative with the ligatures it introduces
Here’s some code to try out with these ligature fonts, first rendered via bogstandard monospace font:
library(magrittr) library(tidyverse) filtered_storms < dplyr::storms %>% filter(category == 5, year >= 2000) %>% unite("date", year:day, sep = "") %>% group_by(name) %>% filter(pressure == max(pressure)) %>% mutate(date = as.Date(date)) %>% arrange(desc(date)) %>% ungroup() %T>% print()Here’s the same code rendered with Hasklig:
Some of the glyphs on show here are:
 A single arrow glyph for lessthan hyphen (<)
 Altered spacing around two colons (::)
 Joined up doubleequals
Fira Code takes this a bit further and also converts >= to a single glyph:
In my opinion these fonts are a nice and clear way of reading and writing R. In particular the single arrow glyph harks back to the APL keyboards with real arrow keys, for which our modern twocharacter < is a poor substitute.
One downside could be a bit of confusion when showing your IDE to someone else, or maybe writing slightly longer lines than it appears, but personally I’m a fan and my RStudio is now in Hasklig.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Benomics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Neville’s Method of Polynomial Interpolation
(This article was first published on R – Aaron Schlegel, and kindly contributed to Rbloggers)
Part 1 of 5 in the series Numerical AnalysisNeville’s method evaluates a polynomial that passes through a given set of and points for a particular value using the Newton polynomial form. Neville’s method is similar to a nowdefunct procedure named Aitken’s algorithm and is based on the divided differences recursion relation (“Neville’s Algorithm”, n.d).
It was stated before in a previous post on Lagrangian polynomial interpolation that there exists a Lagrange polynomial that passes through points where each is a distinct integer and at corresponding x values . The points are denoted .
Neville’s MethodNeville’s method can be stated as follows:
Let a function be defined at points where and are two distinct members. For each , there exists a Lagrange polynomial that interpolates the function at the points . The th Lagrange polynomial is defined as:
The and are often denoted and , respectively, for ease of notation.
The interpolating polynomials can thus be generated recursively, which we will see in the following example:
Neville’s Method ExampleConsider the following table of and corresponding values.
x y 8.1 16.9446 8.3 17.56492 8.6 18.50515 8.7 18.82091Suppose we are interested in interpolating a polynomial that passes through these points to approximate the resulting value from an value of 8.4.
We can construct the interpolating polynomial approximations using the function above:
The approximated values are then used in the next iteration.
Then the final iteration yields the approximated value for the given value.
Therefore is the approximated value at the point .
Neville’s Method in RThe following function is an implementation of Neville’s method for interpolating and evaluating a polynomial.
poly.neville < function(x, y, x0) { n < length(x) q < matrix(data = 0, n, n) q[,1] < y for (i in 2:n) { for (j in i:n) { q[j,i] < ((x0  x[ji+1]) * q[j,i1]  (x0  x[j]) * q[j1,i1]) / (x[j]  x[ji+1]) } } res < list('Approximated value'=q[n,n], 'Neville iterations table'=q) return(res) }Let’s test this function to see if it reports the same result as what we found earlier.
x < c(8.1, 8.3, 8.6, 8.7) y < c(16.9446, 17.56492, 18.50515, 18.82091) poly.neville(x, y, 8.4) ## $`Approximated value` ## [1] 17.87709 ## ## $`Neville iterations table` ## [,1] [,2] [,3] [,4] ## [1,] 16.94460 0.00000 0.00000 0.00000 ## [2,] 17.56492 17.87508 0.00000 0.00000 ## [3,] 18.50515 17.87833 17.87703 0.00000 ## [4,] 18.82091 17.87363 17.87716 17.87709The approximated value is reported as , the same value we calculated previously (minus a few decimal places). The function also outputs the iteration table that stores the intermediate results.
The pracma package contains the neville() function which also performs Neville’s method of polynomial interpolation and evaluation.
library(pracma) neville(x, y, 8.4) ## [1] 17.87709The neville() function reports the same approximated value that we found with our manual calculations and function.
ReferencesBurden, R. L., & Faires, J. D. (2011). Numerical analysis (9th ed.). Boston, MA: Brooks/Cole, Cengage Learning.
Cheney, E. W., & Kincaid, D. (2013). Numerical mathematics and computing (6th ed.). Boston, MA: Brooks/Cole, Cengage Learning.
Neville’s algorithm. (2016, January 2). In Wikipedia, The Free Encyclopedia. From https://en.wikipedia.org/w/index.php?title=Neville%27s_algorithm&oldid=697870140
The post Neville’s Method of Polynomial Interpolation appeared first on Aaron Schlegel.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ArCo Package v 0.2 is on
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
The ArCo package 0.2 is now available on CRAN. The functions are now more user friendly. The new features are:
 Default function for estimation if the user does not inform the functions fn and p.fn. The default model is Ordinary Least Squares.
 The user can now add extra arguments to the fn function in the call.
 The data will be automatically coerced when possible.
To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data wrangling : Transforming (2/3)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a timeconsuming process which is estimated to take about 6080% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the third part of the series and it aims to cover the transforming of data used.This can include filtering, summarizing, and ordering your data by different means. This also includes combining various data sets, creating new variables, and many other manipulation tasks. At this post, we will go through a few more advanced transformation tasks on mtcars data set.
Before proceeding, it might be helpful to look over the help pages for the group_by, ungrpoup, summary, summarise, arrange, mutate, cumsum.
Moreover please load the following libraries.
install.packages("dplyr")
library(dplyr)
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Create a new object named cars_cyl and assign to it the mtcars data frame grouped by the variable cyl
Hint: be careful about the data type of the variable, in order to be used for grouping it has to be a factor.
Exercise 2
Remove the grouping from the object cars_cyl
Exercise 3
Print out the summary statistics of the mtcars data frame using the summary function and pipeline symbols %>%.
Learn more about Data PreProcessing in the online course R Data PreProcessing & Data Management – Shape your Data!. In this course you will learn how to: Work with popular libraries such as dplyr
 Learn about methods such as pipelines
 And much more
Exercise 4
Make a more descriptive summary statistics output containing the 4 quantiles, the mean, the standard deviation and the count.
Exercise 5
Print out the average *hp* for every cyl category
Exercise 6
Print out the mtcars data frame sorted by hp (ascending oder)
Exercise 7
Print out the mtcars data frame sorted by hp (descending oder)
Exercise 8
Create a new object named cars_per containing the mtcars data frame along with a new variable called performance and calculated as performance = hp/mpg
Exercise 9
Print out the cars_per data frame, sorted by performance in descending order and create a new variable called rank indicating the rank of the cars in terms of performance.
Exercise 10
To wrap everything up, we will use the iris data set. Print out the mean of every variable for every Species and create two new variables called Sepal.Density and Petal.Density being calculated as Sepal.Density = Sepal.Length Sepal.Width and Petal.Density = Sepal.Length Petal.Width respectively.
Related exercise sets: Building Shiny App exercises part 6
 DensityBased Clustering Exercises
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part2)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
New R Course: Writing Efficient R Code
Hello R users, we’ve got a brand new course today: Writing Efficient R Code by Colin Gillespie.
The beauty of R is that it is built for performing data analysis. The downside is that sometimes R can be slow, thereby obstructing our analysis. For this reason, it is essential to become familiar with the main techniques for speeding up your analysis, so you can reduce computational time and get insights as quickly as possible.
Writing Efficient R Code features interactive exercises that combine highquality video, inbrowser coding, and gamification for an engaging learning experience that will make you a master in writing efficient, quick, R code!
What you’ll learn:
Chapter 1: The Art of Benchmarking
In order to make your code go faster, you need to know how long it takes to run.
Chapter 2: Fine Tuning – Efficient Base R
R is flexible because you can often solve a single problem in many different ways. Some ways can be several orders of magnitude faster than the others.
Chapter 3: Diagnosing Problems – Code Profiling
Profiling helps you locate the bottlenecks in your code.
Chapter 4: Turbo Charged Code – Parallel Programming
Some problems can be solved faster using multiple cores on your machine. This chapter shows you how to write R code that runs in parallel.
Learn how to write efficient R code today!
Short course on Bayesian data analysis and Stan 2325 Aug in NYC!
(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to Rbloggers)
Jonah “ShinyStan” Gabry, Mike “Riemannian NUTS” Betancourt, and I will be giving a threeday short course next month in New York, following the model of our successful courses in 2015 and 2016.
Before class everyone should install R, RStudio and RStan on their computers. (If you already have these, please update to the latest version of R and the latest version of Stan.) If problems occur please join the stanusers group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.
Class structure and example topics for the three days:
Day 1: Foundations
Foundations of Bayesian inference
Foundations of Bayesian computation with Markov chain Monte Carlo
Intro to Stan with handson exercises
Reallife Stan
Bayesian workflow
Day 2: Linear and Generalized Linear Models
Foundations of Bayesian regression
Fitting GLMs in Stan (logistic regression, Poisson regression)
Diagnosing model misfit using graphical posterior predictive checks
Little data: How traditional statistical ideas remain relevant in a big data world
Generalizing from sample to population (surveys, Xbox example, etc)
Day 3: Hierarchical Models
Foundations of Bayesian hierarchical/multilevel models
Accurately fitting hierarchical models in Stan
Why we don’t (usually) have to worry about multiple comparisons
Hierarchical modeling and prior information
Specific topics on Bayesian inference and computation include, but are not limited to:
Bayesian inference and prediction
Naive Bayes, supervised, and unsupervised classification
Overview of Monte Carlo methods
Convergence and effective sample size
Hamiltonian Monte Carlo and the noUturn sampler
Continuous and discretedata regression models
Mixture models
Measurementerror and itemresponse models
Specific topics on Stan include, but are not limited to:
Reproducible research
Probabilistic programming
Stan syntax and programming
Optimization
Warmup, adaptation, and convergence
Identifiability and problematic posteriors
Handling missing data
Ragged and sparse data structures
Gaussian processes
Again, information on the course is here.
The course is organized by Lander Analytics.
The course is not cheap. Stan is opensource, and we organize these courses to raise money to support the programming required to keep Stan up to date. We hope and believe that the course is more than worth the money you pay for it, but we hope you’ll also feel good, knowing that this money is being used directly to support Stan R&D.
The post Short course on Bayesian data analysis and Stan 2325 Aug in NYC! appeared first on Statistical Modeling, Causal Inference, and Social Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Machine Learning Explained: supervised learning, unsupervised learning, and reinforcement learning
(This article was first published on Enhance Data Science, and kindly contributed to Rbloggers)
Machine learning is often split between three main types of learning: supervised learning, unsupervised learning, and reinforcement learning. Knowing the differences between these three types of learning is necessary for any data scientist.
The big pictureThe type of learning is defined by the problem you want to solve and is intrinsic to the goal of your analysis:
 You have a target, a value or a class to predict. For instance, let’s say you want to predict the revenue of a store from different inputs (day of the week, advertising, promotion). Then your model will be trained on historical data and use them to forecast future revenues. Hence the model is supervised, it knows what to learn.
 You have unlabelled data and looks for patterns, groups in these data. For example, you want to cluster to clients according to the type of products they order, how often they purchase your product, their last visit, … Instead of doing it manually, unsupervised machine learning will automatically discriminate different clients.
 You want to attain an objective. For example, you want to find the best strategy to win a game with specified rules. Once these rules are specified, reinforcement learning techniques will play this game many times to find the best strategy.
Supervised learning regroups different techniques which all share the same principles:
 The training dataset contains inputs data (your predictors) and the value you want to predict (which can be numeric or not).
 The model will use the training data to learn a link between the input and the outputs. Underlying idea is that the training data can be generalized and that the model can be used on new data with some accuracy.
Some supervised learning algorithms:
 Linear and logistic regression
 Support vector machine
 Naive Bayes
 Neural network
 Gradient boosting
 Classification trees and random forest
Supervised learning is often used for expert systems in image recognition, speech recognition, forecasting, and in some specific business domain (Targeting, Financial analysis, ..)
On unsupervised learningCluster Analysis from Wikipedia
On the other hand, unsupervised learning does not use output data (at least output data that are different from the input). Unsupervised algorithms can be split into different categories:
 Clustering algorithm, such as Kmeans, hierarchical clustering or mixture models. These algorithms try to discriminate and separate the observations in different groups.
 Dimensionality reduction algorithms (which are mostly unsupervised) such as PCA, ICA or autoencoder. These algorithms find the best representation of the data with fewer dimensions.
 Anomaly detections to find outliers in the data, i.e. observations which do not follow the data set patterns.
Most of the time unsupervised learning algorithms are used to preprocess the data, during the exploratory analysis or to pretrain supervised learning algorithms.
On reinforcement learningReinforcement learning algorithms try to find the best ways to earn the greatest reward. Rewards can be winning a game, earning more money or beating other opponents. They present stateofart results on very human task, for instance, this paper from the University of Toronto shows how a computer can beat human in oldschool Atari video game.
Reinforcement learnings algorithms follow the different circular steps:
From Wikipedia: Reinforcement Learning
Given its and the environment’s states, the agent will choose the action which will maximize its reward or will explore a new possibility. These actions will change the environment’s and the agent states. They will also be interpreted to give a reward to the agent. By performing this loop many times, the agents will improve its behavior.
Reinforcement learning already performs wells on ‘small’ dynamic system and is definitely to follow for the years to come.
The post Machine Learning Explained: supervised learning, unsupervised learning, and reinforcement learning appeared first on Enhance Data Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppAPT 0.0.4
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new version of RcppAPT — our interface from R to the C++ library behind the awesome apt, aptget, aptcache, … commands and their cache powering Debian, Ubuntu and the like — arrived on CRAN yesterday.
We added a few more functions in order to compute on the package graph. A concrete example is shown in this vignette which determines the (minimal) set of remaining Debian packages requiring a rebuild under R 3.4.* to update their .C() and .Fortran() registration code. It has been used for the binNMU request #868558.
As we also added a NEWS file, its (complete) content covering all releases follows below.
Changes in version 0.0.4 (20170716)
New function getDepends

New function reverseDepends

Added package registration code

Added usage examples in scripts directory

Added vignette, also in docs as rendered copy
 Added dumpPackages, showSrc
 Added reverseDepends, dumpPackages, showSrc
 Initial version with getPackages and hasPackages
A bit more information about the package is available here as well as as the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
seplyr update
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
The development version of my new R package seplyr is performing in practical applications with dplyr 0.7.* much better than even I (the seplyr package author) expected.
I think I have hit a very good set of tradeoffs, and I have now spent significant time creating documentation and examples.
I wish there had been such a package weeks ago, and that I had started using this approach in my own client work at that time. If you are already a dplyr user I strongly suggest trying seplyr in your own analysis projects.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Some Ideas for your Internal R Package
(This article was first published on R Views, and kindly contributed to Rbloggers)
At RStudio, I have the pleasure of interacting with data science teams around the world. Many of these teams are led by R users stepping into the role of analytic admins. These users are responsible for supporting and growing the R user base in their organization and often lead internal R user groups.
One of the most successful strategies to support a corporate R user group is the creation of an internal R package. This article outlines some common features and functions shared in internal packages. Creating an R package is easier than you might expect. A good place to start is this webinar on package creation.
Logos and Custom CSSInterestingly, one powerful way to increase the adoption of data science outputs – plots, reports, and even slides – is to stick to consistent branding. Having a common look and feel makes it easier for management to recognize the work of the data science team, especially as the team grows. Consistent branding also saves the R user time that would normally be spent picking fonts and color schemes.
It is easy to include logos and custom CSS inside of an R package, and to write wrapper functions that copy the assets from the package to a user’s local working directory. For example, this wrapper function adds a logo from the RStudioInternal package to the working directory:
getLogo < function(copy_to = getwd()){ copy_to < normalizePath(copy_to) file.copy(system.file(“logo.png”, package = “RStudioInternal”) , copy_to) }Once available, logos and CSS can be added to Shiny apps and R Markdown documents.
ggplot2 ThemesSimilar to logos and custom CSS, many internal R packages include a custom ggplot2 theme. These themes ensure consistency across plots in an organization, making data science outputs easier to recognize and read.
ggplot2 themes are shared as functions. To get started writing a ggplot2 theme, see resource 1. For inspiration, take a look at the ggthemes package.
Data ConnectionsInternal R packages are also an effective way to share functions that make it easy for analysts to connect to internal data sources. Nothing is more frustrating for a first time R user than trying to navigate the world of ODBC connections and complex database schemas before they can get started with data relevant to their daytoday job.
If you’re not sure where to begin, look through your own scripts for common database connection strings or configurations. db.rstudio.com can provide more information on how to handle credentials, drivers, and config files.
learnr TutorialsRStudio recently released a new package for creating interactive tutorials in R Markdown called learnr. There are many great resources online for getting started with R, but it can be useful to create tutorials specific to your internal data and domain. learnr tutorials can serve as training wheels for the other components of the internal R package or teach broader concepts and standards accepted across the organization. For example, you might provide a primer that teaches new users your organization’s R style guide.
Sharing an Internal R PackageInternal packages can be built in the RStudio IDE and distributed as tar files. Alternatively, many organizations use RStudio Server or RStudio Server Pro to standardize the R environment in their organization. In addition to making it easy to share an internal package, a standard compute environment keeps new R users from having to spend time installing R, RStudio, and packages. While these are necessary skills, the first interactions with R should get new users to a data insight as fast as possible. RStudio Server Pro also includes IT functions for monitoring and restricting resources.
Wrap UpIf you are leading an R group, an internal R package is a powerful way to support your users and the adoption of R. Imagine how easy it would be to introduce R to coworkers if they could connect to real, internal data and create a useful, beautiful plot in under 10 minutes. Investing in an internal R packages makes that on boarding experience possible.
_____='https://rviews.rstudio.com/2017/07/19/supportingcorporaterusergroups/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...