Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 38 min ago

«smooth» package for R. es() function. Part VI. Parameters optimisation

Sat, 04/29/2017 - 20:56

(This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers)

Now that we looked into the basics of es()

function, we can discuss how the optimisation mechanism works, how the parameters are restricted and what are the initials values for the parameters in the optimisation of the function. This will be fairly technical post for the researchers who are interested in the inner (darker) parts of es()

.

NOTE. In this post we will discuss initial values of parameters. Please, don’t confuse this with initial values of components. The former is a wider term, than the latter, because in general it includes the latter. Here I will explain how initialisation is done before the parameters are optimised.

Let’s get started.

Before the optimisation, we need to have some initial values of all the parameters we want to estimate. The number of parameters and initialisation principles depend on the selected model. Let’s go through all of them in details.

The smoothing parameters \(\alpha\), \(\beta\) and \(\gamma\) (for level, trend and seasonal components) for the model with additive error term are set to be equal to 0.3, 0.2 and 0.1, while for the multiplicative one they are equal to 0.1, 0.05 and 0.01 respectively. The motivation here is that we want to have parameters closer to zero in order to smooth the series (although we don’t always get these values), and in case with multiplicative models the parameters need to be very low, otherwise the model may become too sensitive to the noise.

The next important values is the initial of the damping parameter \(\phi\), which is set to be equal to 0.95. We don’t want to start from one, because the damped trend model in this case looses property of damping, but we want to be close to one in order not too enforce the excessive damping.

As for the vector of states, its initial values are set depending on the type of model. First, the following simple model is fit to the first 12 observations of data (if we don’t have 12 observations, than to the whole sample we have):
\begin{equation} \label{eq:simpleregressionAdditive}
y_t = a_0 + a_1 t + e_t .
\end{equation}
In case with multiplicative trend we use a different model:
\begin{equation} \label{eq:simpleregressionMulti}
\log(y_t) = a_0 + a_1 t + e_t .
\end{equation}
In both cases \(a_0\) is the intercept, which is used as the initial value of level component and \(a_1\) is the slope of the trend, which is used as the initial of trend component. In case of multiplicative model, exponents of \(a_0\) and \(a_1\) are used. For the case of no trend, a simple average (of the same sample) is used as initial of level component.

In case of seasonal model, the classical seasonal decomposition (“additive” or “multiplicative” – depending on the type specified by user) is done using decompose()

function, and the seasonal indices are used as initials for the seasonal component.

All the values are then packed in the vector called C

(I know that this is not a good name for the vector of parameters, but that’s life) in the following order:

  1. Vector of smoothing parameters \(\mathbf{g}\) ( persistence

    );

  2. Damping parameter \(\phi\) ( phi

    );

  3. Initial values of non-seasonal part of vector of states \(\mathbf{v}_t\) ( initial

    );

  4. Initial values of seasonal part of vector of states \(\mathbf{v}_t\) ( initialSeason

    );

  5. After that parameters of exogenous variables are added to the vector. We will cover the topic of exogenous variable separately in one of the upcoming posts. The sequence is:

  6. Vector of parameters of exogenous variables \(\mathbf{a}_t\) ( initialX

    );

  7. Transition matrix for exogenous variables ( transitionX

    );

  8. Persistence vector for exogenous variables ( persistenceX

    ).

Obviously, if we use predefined values of some of those elements, then they are not optimised and skipped during the formation of the vector C

. For example, if user specifies parameter initial, then the step (3) is skipped.

There are some restrictions on the estimated parameters values. They are defined in vectors CLower

and CUpper

, which have the same length as C

and correspond to the same elements as in C

(persistence vector, then damping parameter etc). They may vary depending on the value of the parameter bounds. These restrictions are needed in order to find faster the optimal value of the vector C

. The majority of them are fairly technical, making sure that the resulting model has meaningful components (for example, multiplicative component should be greater than zero). The only parameter that is worth mentioning separately is the damping parameter \(\phi\). It is allowed to take values between zero and one (including boundary values). In this case the forecasting trajectories do not exhibit explosive behaviour.

Now the vectors CLower

and CUpper

define general regions for all the parameters, but the bounds of smoothing parameters need finer regulations, because they are connected with each other. That is why they are regulated in the cost function itself. If user defines "usual"

bounds, then they are restricted to make sure that:
\begin{equation} \label{eq:boundsUsual}
\alpha \in [0, 1]; \beta \in [0, \alpha]; \gamma \in [0, 1-\alpha]
\end{equation}
This way the exponential smoothing has property of averaging model, meaning that the weights are distributed over time in an exponential fashion, they are all positive and add up to one, plus the weights of the newer observations are higher than the weights of the older ones.

If user defines bounds="admissible"

then the eigenvalues of discount matrix are calculated on each iteration. The function makes sure that the selected smoothing parameters guarantee that the eigenvalues lie in the unit circle. This way the model has property of being stable, which means that the weights decrease over time and add up to one. However, on each separate observation they may become negative or greater than one, meaning that the model is no longer an “averaging” model.

In the extreme case of bounds="none"

the bounds of smoothing parameters are not checked.

In case of violation of bounds for smoothing parameters, the cost function returns a very high number, so the optimiser “hits the wall” and goes to the next value.

In order to optimise the model we use function nloptr()

from nloptr

package. This function implements non-linear optimisation algorithms written in C. smooth functions use two algorithms: BOBYQA and Nelder-Mead. This is done in two stages: parameters are estimated using the former, after that the returned parameters are used as the initial values for the latter. In cases of mixed models we also check if the parameters returned from the first stage differ from the initial values. If they don’t, then it means that the optimisation failed and BOBYQA is repeated but with the different initial values of the vector of parameters C

(smoothing parameters that failed during the optimisation are set equal to zero).

Overall this optimisation mechanism guarantees that the parameters are close to the optimal values, lie in the meaningful region and satisfy the predefined bounds.

As an example, we will apply es()

to the time series N41 from M3.

First, here’s how ETS(A,A,N) with usual bounds looks like on that time series:

es(M3$N0041$x,"AAN",bounds="u",h=6)

Time elapsed: 0.1 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 397.628 Cost function type: MSE; Cost function value: 101640.73 Information criteria: AIC AICc BIC 211.1391 218.6391 214.3344

As we see, in this case the optimal smoothing parameters are equal to zero. This means that we do not take any information into account and just produce the straight line (deterministic trend). See for yourselves:

Series №41 and ETS(A,A,N) with traditional (aka “usual”) bounds

And here’s what happens with the admissible bounds:

es(M3$N0041$x,"AAN",bounds="a",h=6) Time elapsed: 0.11 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 1.990 0.018 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 327.758 Cost function type: MSE; Cost function value: 69059.107 Information criteria: AIC AICc BIC 205.7283 213.2283 208.9236

The smoothing parameter of the level, \(\alpha\) is greater than one. It is almost two. This means that the exponential smoothing is no longer averaging model, but I can assure you that the model is still stable. Such a high value of smoothing parameter means that the level in time series changes drastically. This is not common and usually not a desired, but possible behaviour of the exponential smoothing. Here how it looks:

Series №41 and ETS(A,A,N) with admissible bounds

Here I would like to note that model can be stable even with negative smoothing parameters. So don’t be scared. If the model is not stable, the function will warn you.

Last but not least, user can regulate values of C

, CLower

and CUpper

vectors for the first optimisation stage. Model selection does not work with the provided vectors of initial parameters, because the length of C

, CLower

and CUpper

vectors is fixed, while in the case of model selection it will vary from model to model. User also needs to make sure that the length of each of the vectors is correct and corresponds to the selected model. The values are passed via the ellipses, the following way:

Cvalues <- c(0.2, 0.1, M3$N0041$x[1], diff(M3$N0041$x)[1]) es(M3$N0041$x,"AAN",C=Cvalues,h=6,bounds="u") Time elapsed: 0.1 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 1 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 429.923 Cost function type: MSE; Cost function value: 118821.938 Information criteria: AIC AICc BIC 213.3256 220.8256 216.5209

In this case we reached boundary values for both level and trend smoothing parameters. The resulting model has constantly changing level (random walk level) and deterministic trend. This is a weird, but possible combination. The fit and forecast looks similar to the model with the admissible bounds, but not as reactive:

Series №41 and ETS(A,A,N) with traditional bounds and non-standard initials

Using this functionality, you may end up with ridiculous and meaningless models, so be aware and be careful. For example, the following does not have any sense from forecasting perspective:

Cvalues <- c(2.5, 1.1, M3$N0041$x[1], diff(M3$N0041$x)[1]) CLower <- c(1,1, 0, -Inf) CUpper <- c(3,3, Inf, Inf) es(M3$N0041$x,"AAN",C=Cvalues, CLower=CLower, CUpper=CUpper, bounds="none",h=6) Time elapsed: 0.12 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 2.483 1.093 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 193.328 Cost function type: MSE; Cost function value: 24027.222 Information criteria: AIC AICc BIC 190.9475 198.4475 194.1428 Warning message: Model ETS(AAN) is unstable! Use a different value of 'bounds' parameter to address this issue!

Although the fit is very good and the model approximates data better than all the others (MSE value is 24027 versus 70000 – 120000 of other models), the model is unstable (the function warns us about that), meaning that the weights are distributed in an unreasonable way: the older observations become more important than the newer ones. The forecast of such a model is meaningless and most probably is biased and not accurate. Here how it looks:

Series №41 and ETS(A,A,N) with crazy bounds

So be careful with manual tuning of the optimiser.

Have fun but be reasonable!

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Accessing and Manipulating Biological Databases Exercises (Part 1)

Sat, 04/29/2017 - 17:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the exercises below we cover how we can Access and Manipulate Biological Data bases through rentrez & seqinr packages

Install Packages
rentrez
seqinr

Answers to the exercises are available here

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Print all the available data bases which you can access through rentrez package

Exercise 2

Print all the searchable terms in a database

Exercise 3

Display the details of any database of your choice

Exercise 4

Retrieve and print 10 ids of nucleotide sequences from nuccore database about Human.

Exercise 5

Retrieve and print 20 ids of protein sequences from protein database about Human.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 6

Create a Fasta File for a particular human protein sequence from the listed ids.

Exercise 7

Create a Fasta File for a particular human nucleotide sequence from the listed ids.

Exercise 8

Open the Nucleotide Fasta file and print the details using seqinr package.

Exercise 9

Open the Protein Fasta file and print the details using seqinr package

Exercise 10

Open the Nucleotide Fasta file and print only sequence from the created Fasta file striping all other information.

Related exercise sets:
  1. Bioinformatics Tutorial with Exercises in R (part 1)
  2. Regular sequences
  3. Reading delimited data
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Plotting Data Online via Plotly and Python

Sat, 04/29/2017 - 14:36

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

I don’t do a lot of plotting in my job, but I recently heard about a website called Plotly that provides a plotting service for anyone’s data. They even have a plotly package for Python (among others)! So in this article we will be learning how to plot with their package. Let’s have some fun making graphs!

Getting Started

You will need the plotly package to follow along with this article. You can use pip to get the package and install it:

pip install plotly

Now that you have it installed, you’ll need to go to the Plotly website and create a free account. Once that’s done, you will get an API key. To make things super simple, you can use your username and API key to create a credentials file. Here’s how to do that:

import plotly.tools as tls tls.set_credentials_file( username="your_username", api_key="your_api_key") # to get your credentials credentials = tls.get_credentials_file()

If you don’t want to save your credentials, then you can also sign in to their service by doing the following:

import plotly.plotly as py py.sign_in('your_username','your_api_key')

For the purposes of this article, I’m assuming you have created the credentials file. I found that makes interacting with their service a bit easier to use.

Creating a Graph

Plotly seems to default to a Scatter Plot, so we’ll start with that. I decided to grab some data from a census website. You can download any US state’s population data, along with other pieces of data. In this case, I downloaded a CSV file that contained the population of each county in the state of Iowa. Let’s take a look:

import csv import plotly.plotly as py #---------------------------------------------------------------------- def plot_counties(csv_path): """ http://census.ire.org/data/bulkdata.html """ counties = {} county = [] pop = [] counter = 0 with open(csv_path) as csv_handler: reader = csv.reader(csv_handler) for row in reader: if counter == 0: counter += 1 continue county.append(row[8]) pop.append(row[9]) trace = dict(x=county, y=pop) data = [trace] py.plot(data, filename='ia_county_populations') if __name__ == '__main__': csv_path = 'ia_county_pop.csv' plot_counties(csv_path)

If you run this code, you should see a graph that looks like this:

You can also view the graph here. Anyway, as you can see in the code above, all I did was read the CSV file and extract out the county name and the population. Then I put that data into two different Python lists. Finally I created a dictionary of those lists and then wrapped that dictionary in a list. So you end up with a list that contains a dictionary that contains two lists! To make the Scatter Plot, I passed the data to plotly’s plot method.

Converting to a Bar Chart

Now let’s see if we can change the ScatterPlot to a Bar Chart. First off, we’ll play around with the plot data. The following was done via the Python interpreter:

>>> scatter = py.get_figure('driscollis', '0') >>> print scatter.to_string() Figure( data=Data([ Scatter( x=[u'Adair County', u'Adams County', u'Allamakee County', u'..', ], y=[u'7682', u'4029', u'14330', u'12887', u'6119', u'26076', '..' ] ) ]) )

This shows how we can grab the figure using the username and the plot’s unique number. Then we printed out the data structure. You will note that it doesn’t print out the entire data structure. Now let’s do the actual conversion to a Bar Chart:

from plotly.graph_objs import Data, Figure, Layout scatter_data = scatter.get_data() trace_bar = Bar(scatter_data[0]) data = Data([trace_bar]) layout = Layout(title="IA County Populations") fig = Figure(data=data, layout=layout) py.plot(fig, filename='bar_ia_county_pop')

This will create a bar chart at the following URL: https://plot.ly/~driscollis/1.

Here’s the image of the graph:

This code is slightly different than the code we used originally. In this case, we explicitly created a Bar object and passed it the scatter plot’s data. Then we put that data into a Data object. Next we created a Layout object and gave our chart a title. Then we created a Figure object using the data and layout objects. Finally we plotted the bar chart.

Saving the Graph to Disk

Plotly also allows you to save your graph to your hard drive. You can save it in the following formats: png, svg, jpeg, and pdf. Assuming you still have the Figure object from the previous example handy, you can do the following:

py.image.save_as(fig, filename='graph.png')

If you want to save using one of the other formats, then just use that format’s extension in the filename.

Wrapping Up

At this point you should be able to use the plotly package pretty well. There are many other graph types available, so be sure to read Plotly’s documentation thoroughly. They also support streaming graphs. As I understand it, Plotly allows you to create 10 graphs for free. After that you would either have to delete some of your graphs or pay a monthly fee.

Additional Reading

If you have question you can leave a comment below.

    Related Post

    1. Using PostgreSQL and shiny with a dynamic leaflet map: monitoring trash cans
    2. Visualizing Streaming Data And Alert Notification with Shiny
    3. Metro Systems Over Time: Part 3
    4. Metro Systems Over Time: Part 2
    5. Metro Systems Over Time: Part 1

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Luke-warm about micromaps

    Sat, 04/29/2017 - 14:00

    Continuing my exploring methods for spatial visualisation of data in R, today I’m looking at linked micromaps. Micromaps are a way of showing statistical graphics for a small subset of regions at a time, with a small map indicating which regions are being looked at in each of the small multiples. Alberto Cairo has some nice discussion in this blog post.

    Poverty and education in the USA

    It’s easier to show than explain. Here’s one I’ve adapted from the helpfiles of the R micromaps package by Quinn Payton, Tony Olsen, Marc Weber, Michael McManus and Tom Kincaid of the US Environmental Protection Agency and generously open-sourced by that agency. Under the hood, micromaps uses Hadley Wickham’s ggplot2 to create the graphic assets and (I think) Paul Murrell’s grid to lay them out in aligned fashion.

    Some links for the micromaps package:

    I used this USA example as my test case for understanding the micromaps API. Some of the changes I’ve introduced here include:

    • Set my own font
    • Pale blue background with white gridlines to make them go into the background, ggplot2-style.
    • White borders for the grey-fill states that have been previously drawn, so the border lines don’t distract the eye from the five states being referred to in each cluster.
    • Tidied up the code for readability

    Here’s the code that draws this, and sets up the overall session. Note that the nzcensus (by me) and mbiemaps (by New Zealand’s Ministry of Business, Innovation and Employment, my previous employer) are only available on GitHub, not CRAN. Code to install them is available in a previous post. They’re needed for New Zealand data and maps later on.

    library(micromap) library(nzcensus) library(extrafont) library(mbiemaps) # for the territorial authority map library(tidyverse) library(testthat) library(scales) library(ggrepel) the_font <- "Calibri" theme_set(theme_minimal(base_family = the_font)) #============polish up the example from the ?lmplot helpfile=============== data("USstates") data("edPov") # convert from SpatialPolygonsDataFrame into a data frame (similar to fortify) statePolys <- create_map_table(USstates, 'ST') # draw graphic lmplot(stat.data = edPov, map.data = statePolys, panel.types = c('labels', 'dot', 'dot','map'), panel.data = list('state','pov','ed', NA), ord.by = 'pov', grouping = 5, colors = brewer.pal(5, "Spectral"), median.row = TRUE, # how to merge the two data frames: map.link = c('StateAb', 'ID'), # stroke colour of the borders of previously-drawn areas: map.color2 = "white", # how to save the result: print.file = "0095-eg1.png", print.res = 600, plot.width = 7, plot.height = 9, # attributes of the panels: panel.att = list( list(1, header = "States", panel.width = 0.8, text.size = 0.8, align = 'right', panel.header.font = the_font, panel.header.face = "bold", text.font = the_font, left.margin = 2, right.margin = 1), list(2, header = "Percent living below\npoverty level", xaxis.title = 'Percent', xaxis.ticks = list(10, 15, 20), xaxis.labels = list(10, 15, 20), graph.bgcolor = 'lightblue', graph.grid.color = "grey99", graph.border.color = "white", panel.header.font = the_font, panel.header.face = "bold"), list(3, header = "Percent adults with\n4+ years of college", xaxis.title = 'Percent', xaxis.ticks = list(0, 20, 30, 40), xaxis.labels = list(0, 20, 30, 40), graph.bgcolor = 'lightblue', graph.grid.color = "grey99", graph.border.color = "white", panel.header.font = the_font, panel.header.face = "bold"), list(4, header = 'Light gray means\nhighlighted above', panel.width = 0.8, panel.header.font = the_font, panel.header.face = "italic") ))

    Most of the polishing code is in the list of lists passed to the panel.att argument. Each list refers to the attributes of one of the four panels (state names, poverty dot charts, education dot charts, maps). I had to do a bit of digging to find out how to control things like grid colour; while doing this it was useful to run one of the three lines of code below to see what attributes are within your control for the different panel types:

    unlist(labels_att(TRUE) ) unlist(dot_att(TRUE) ) unlist(map_att(TRUE) )

    Note that the lmplot function specifies a file to print the graphic to. I don’t like this, as it’s breaks some commonly accepted R workflows. For example, for this blog I usually create in advance all the graphics for each post in SVG format, which scales up nicely if people zoom in on it and is generally the best format (my view) for web graphics. That can’t be done when lmplot restricts you to particular device types. The pattern also doesn’t work well with Yixuan Qiu’s showtext R package that I normally use for fonts (it lets me access Google fonts, including the Poppins font I use for most graphics).

    New Zealand census example

    To be sure I understood how to use the technique, I wanted to apply it to some New Zealand maps and data. I’m used to presenting data at the Territorial Authority level in New Zealand by means of a choropleth map like this one:

    … which was drawn with this code, using the TA2013 data frame of 2013 census data from my nzcensus package:

    # pre-fortified data frame version of TA level map, from mbiemaps package: data(ta_simpl_gg) # change name in census data to match the name used in ta_simpl_gg TA2013$short_name <- gsub(" District", "", TA2013$TA2013_NAM) # filter out some visually inconvenient data: ta_data <- TA2013 %>% filter(!short_name %in% c("Chatham Islands Territory", "Area Outside Territorial Authority")) %>% mutate(PercNoReligion = PropNoReligion2013 * 100) # draw map: ta_simpl_gg %>% left_join(ta_data, by = c("NAME" = "short_name")) %>% arrange(order) %>% ggplot(aes(x = long, y = lat, group = group, fill = MedianIncome2013)) + geom_polygon(colour = "grey50") + ggmap::theme_nothing(legend = TRUE) + theme(legend.position = c(0.2, 0.7)) + scale_fill_gradientn("Median individual\nincome", colours = brewer.pal(11, "Spectral"), label = dollar) + coord_map(projection = "sinusoidal") + ggtitle("Example choropleth map") + labs(caption = "Source: Statistics New Zealand, 2013 Census")

    Doing this with a linked micromap instead of a choropleth map lets us look at more variables at once, but I can’t say I’m happy with the result:

    My reservations about this graphic:

    • It feels like there are just too many territorial authorities for this to be a really friendly graphic.
    • Also, my map of New Zealand is probably too crinkly and individual districts and cities too small to show up well.
    • There’s an awkwardness of New Zealand being tall rather than wide – a smaller aspect ratio than USA. This seems to make the New Zealand map less well suited to the technique than the USA map.
    • It’s hard for the reader to move their eyes back and forth from the district or city name to the dots and to the map.
    • I couldn’t work out how (if it is possible) to control the projection of the map, and hence New Zealand looks a little bit stretched and rotated.

    Here’s the code that drew this, anyway:

    # change names of the map data frame to meet lmplot's expectations ta_polys2 <- ta_simpl_gg %>% rename(coordsx = long, coordsy = lat, ID = NAME) %>% mutate(hole = as.numeric(hole), plug = 0, region = as.numeric(as.factor(ID))) # check merge will be ok a <- unique(ta_polys2$ID) # names of map polygons b <- unique(ta_data$short_name) # names of TAs with census data expect_equal(a[!a %in% b], b[!b %in% a]) # draw graphic: lmplot(stat.data = ta_data, map.data = ta_polys2, panel.types = c('labels', 'dot', 'dot','map'), panel.data = list('short_name','MedianIncome2013','PercNoReligion', NA), ord.by = 'MedianIncome2013', grouping = 6, median.row = FALSE, # how to merge the two data frames: map.link = c('short_name', 'ID'), # stroke colour of the borders of previously-drawn areas: map.color2 = "white", # how to save the result: print.file = "0095-eg2.png", print.res = 600, plot.width = 14, plot.height = 18, # attributes of the panels: panel.att = list( list(1, header = "Territorial Authorities", panel.width = 1.2, panel.header.size = 2, text.size = 1.6, align = 'right', panel.header.font = the_font, panel.header.face = "bold", text.font = the_font, left.margin = 2, right.margin = 1, xaxis.title.size = 2), list(2, header = "Median\nincome", panel.header.size = 2, xaxis.title = 'Dollars', xaxis.title.size = 2, xaxis.labels.size = 2, graph.bgcolor = 'grey80', graph.grid.color = "grey99", graph.border.color = "white", panel.header.font = the_font, panel.header.face = "bold", point.border = FALSE, point.size = 2), list(3, header = "Percent individuals\nwith no religion", panel.header.size = 2, xaxis.title = 'Percent', xaxis.title.size = 2, xaxis.labels.size = 2, graph.bgcolor = 'grey80', graph.grid.color = "grey99", graph.border.color = "white", panel.header.font = the_font, panel.header.face = "bold", point.border = FALSE, point.size = 2), list(4, header = '\n', panel.header.size = 2, panel.width = 0.5, panel.header.font = the_font, panel.header.face = "italic", active.border.size = 0.2, withdata.border.size = 0.2, nodata.border.size = 0.2) ))

    Now, there’s a standard way of showing two variables against each other. We lose the spatial element, but for most purposes I think the good old scatter plot is better for this data:

    …drawn with:

    ta_data %>% ggplot(aes(x = MedianIncome2013, y = PropNoReligion2013, label = short_name)) + scale_x_continuous("Median income", label = dollar) + scale_y_continuous("No religion", label = percent) + geom_smooth(method = "lm", colour = "white") + geom_point(aes(size = CensusNightPop2013), shape = 19, colour = "grey60") + geom_point(aes(size = CensusNightPop2013), shape = 1, colour = "black") + geom_text_repel(colour = "steelblue", size = 2.5) + scale_size_area("Census night population", max_size = 10, label = comma) + theme(legend.position = "bottom") + labs(caption = "Source: Statistics New Zealand, census 2013") Maybe better with a smaller number of areas?

    New Zealand has 66 districts and cities (not counting Chatham Islands), but only 16 Regional Councils. Perhaps the method works better with a smaller number of areas to show:

    … and I think that probably is ok. But really, all we are showing here is 32 numbers. It’s an expensive graphic for something that could almost be as meaningfully shown in a table. All up, my reaction to linked micromaps is one of caution. Like any visualisation tool, I think they’ll be good in some circumstances, but in others they just won’t seem to easily communicate.

    Code for the regional council micromap:

    #================regions example============ # convert from SpatialPolygonsDataFrame into a data frame (similar to fortify) data(region_simpl) # map from mbiemaps reg_polys <- create_map_table(region_simpl, 'NAME') reg_data <- REGC2013 %>% mutate(PercNoReligion = PropNoReligion2013 * 100) %>% filter(REGC2013_N != "Area Outside Region") lmplot(stat.data = reg_data, map.data = reg_polys, panel.types = c('labels', 'dot', 'dot','map'), panel.data = list('REGC2013_N','MedianIncome2013','PercNoReligion', NA), ord.by = 'MedianIncome2013', grouping = 4, median.row = FALSE, # how to merge the two data frames: map.link = c('REGC2013_N', 'ID'), # stroke colour of the borders of previously-drawn areas: map.color2 = "white", # how to save the result: print.file = "0095-eg3.png", print.res = 600, plot.width = 8, plot.height = 9, # attributes of the panels: panel.att = list( list(1, header = "Territorial Authorities", panel.width = 1.2, text.size = 0.8, align = 'right', panel.header.font = the_font, panel.header.face = "bold", text.font = the_font, left.margin = 2, right.margin = 1), list(2, header = "Median\nincome", xaxis.title = 'Dollars', panel.width = 0.8, left.margin = .5, right.margin = .5, graph.bgcolor = 'grey90', graph.grid.color = "grey99", graph.border.color = "white", panel.header.font = the_font, panel.header.face = "bold", point.border = FALSE, point.size = 2.5), list(3, header = "Percent individuals\nwith no religion", panel.width = 0.8, left.margin = .5, right.margin = .5, xaxis.title = 'Percent', graph.bgcolor = 'grey90', graph.grid.color = "grey99", graph.border.color = "white", panel.header.font = the_font, panel.header.face = "bold", point.border = FALSE, point.size = 2.5), list(4, header = '\n', panel.width = 1.2, panel.header.font = the_font, panel.header.face = "italic", active.border.size = 0.5, withdata.border.size = 0.5, nodata.border.size = 0.5) )) Bonus – area unit cartograms

    Finally, an aside. My last two blog posts have been about what you might call population-weighted carto-choro-pletho-grams… I’ve been gradually enhancing this Shiny web app which lets people play around with visualisations of 2013 census data. Latest addition is data at the detailed area unit level. Here’s a screenshot showing which area units have higher proportions of people with a background from the Pacific islands:

    Check out the full app for more.

    R Weekly Bulletin Vol – VI

    Sat, 04/29/2017 - 09:20

    This week’s R bulletin will cover topics like how to parse milliseconds in R, how to format dates and method to extract specific parts of date and time.

    We will also cover functions like %within%, %m+%, to.period function, period.max and period.min functions. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

    Shortcut Keys

    1. Indent – hit Tab button at beginning of the line
    2. Outdent – Shift+Tab
    3. Go to a specific line – Shift+Alt+G

    Problem Solving Ideas How to parse milliseconds in R

    When R reads dates from a text or spreadsheet file, it will typically store them as character vector or a factor. To convert them to dates, we need to parse these strings. Parsing can be done using the strptime function, which returns POSIXlt dates.

    To parse the dates, you must tell strptime which bits of the string correspond to which bits of the date. The date format is specified using a string, with components specified with a percent symbol followed by a letter. See the example below. These components are combined with other fixed characters, such as colons in times, or dashes and slashes in dates to form a full specification.

    Example:

    strptime("25/06/2016 09:50:24", "%d/%m/%Y %H:%M:%S")

    [1] “2016-06-25 09:50:24 IST”

    If a string does not match the format in the format string, it takes the value NA. For example, specifying dashes instead of slashes makes the parsing fail:

    Example:

    strptime("25-06-2016 09:50:24", "%d/%m/%Y %H:%M:%S")

    [1] NA

    To parse milliseconds in R we use the following format string:

    Example:

    strptime("25-06-2016 09:50:24.975", "%d-%m-%Y %H:%M:%OS")

    [1] “2016-06-25 09:50:24.975 IST”
    op = options(digits.secs=3)
    options(op)

    The options function allows the user to set and examine a variety of global options which affect the way in which R computes and displays its results. The argument “digits.secs” to the options function controls the maximum number of digits to print when formatting time values in seconds. Valid values are 0 to 6 with default 0.

    How to format dates

    We can format a date as per our requirement using the strftime (“string format time”) function. This function works on both POSIXct and POSIXlt date classes, and it turns a date variable into a string.

    Example:

    now_time = Sys.time() # Let us first check the class of the the now_time variable by calling the # class function. class(now_time)

    [1] “POSIXct” “POSIXt”

    As can be seen, it is a “POSIXct” variable. To correctly format a date, we need to use the right naming conventions. For example, %B signifies the full name of the month, %p signifies the AM/PM time indicator, and so on. For the full list of the conventions, check the help page for the function. The required formatting conventions are placed in quotes and form the second argument of the function. An example of the strftime function usage is shown below.

    strftime(now_time, "%I:%M%p %A, %d %B %Y")

    [1] “09:16AM Saturday, 29 April 2017”

    Extract specific parts of a date and time

    R has two standard date-time classes, POSIXct and POSIXlt. The POSIXct class stores dates as the number of seconds since the start of 1970, while the POSIXlt stores dates as a list, with components for seconds, minutes, hours, day of month, etc. One can use list indexing to access individual components of a POSIXlt date. These components can be viewed using the unclass function.

    Example: In this example, we use the Sys.time function, which gives the current date and time. We convert this into POSIXlt class using the as.POSIXlt function. Now we can extract the required components.

    time_now = as.POSIXlt(Sys.time()) print(time_now)

    [1] “2017-04-29 09:16:06 IST”

    unclass(time_now)

    $sec
    [1] 6.642435

    $min
    [1] 16

    $hour
    [1] 9

    $mday
    [1] 29

    $mon
    [1] 3

    $year
    [1] 117

    $wday
    [1] 6

    $yday
    [1] 118

    $isdst
    [1] 0

    $zone
    [1] “IST”

    $gmtoff
    [1] 19800

    attr(,”tzone”)
    [1] “” “IST” “IST”

    # To extract minutes time_now$min

    [1] 16

    # To extract day of the month time_now$mday

    [1] 29

    Functions Demystified %within% and %m+% functions

    %within% function: This function from the lubridate package checks whether a date-time object falls in a given date-time interval. It returns a logical TRUE or FALSE as the output. The function requires a date-time interval, which is created using the interval function. We use the ymd function to convert a non-date-time object into a date-time object.

    Example:

    library(lubridate) dates = interval("2016-03-03", "2016-06-03") d = ymd("2016-04-21") d %within% dates

    [1] TRUE

    %m+% function: This function is used to add or subtract months to/from a given date-time object.

    Examples:

    # To add a month to a date-time objects library(lubridate) d = ymd("2016-04-21") d %m+% months(1)

    [1] “2016-05-21”

    # To create a sequence of months from a given date-time object library(lubridate) d = ymd("2016-04-21") d %m+% months(1:3)

    [1] “2016-05-21” “2016-06-21” “2016-07-21”

    # To subtract a year from a given date-time object d = ymd("2016-04-21") d %m+% years(-1)

    [1] “2015-04-21”

    to.period function

    The to.period function is part of the xts package. It converts an OHLC or univariate object to a specified periodicity lower than the given data object.

    For example, the function can convert a daily series to a monthly series, or a monthly series to a yearly one, or a one minute series to an hourly series.

    Example:

    library(quantmod) data = getSymbols("AAPL", src = "yahoo", from = "2016-01-01", to = "2016-01-15", auto.assign = FALSE) nrow(data)

    [1] 10

    # Convert the above daily data series to weekly data. to.period(data, period = "weeks")

    Valid period character strings include: “seconds”, “minutes”, “hours”, “days”, “weeks”, “months”, “quarters”, and “years”.

    To convert the daily data to monthly data, the syntax will be:

    df = to.period(data,period = ‘months’)

    The result will contain the open and close for the given period, as well as the maximum and minimum over the new period, reflected in the new high and low, respectively. If volume for a period was available, the new volume will also be calculated.

    period.max and period.min functions

    The period.max and period.min functions are part of the xts package. period.max is used to calculate a maximum value per period given an arbitrary index of sections to be calculated over. The period.min function works in a similar manner to compute the minimum values.

    The syntax is given as:

    period.max(x, index) period.min(x, index)

    where,
    x – represents a univariate data object
    index – represents a numeric vector of endpoints

    Example:

    library(xts) # compute the period maximum period.max(c(1, 1, 4, 2, 2, 6, 7, 8, -1, 20), c(0, 3, 5, 8, 10))

    3 5 8 10
    4 2 8 20

    # compute the period minimum period.min(c(1, 1, 4, 2, 2, 6, 7, 8, -1, 20), c(0, 3, 5, 8, 10))

    3 5 8 10
    1 2 6 -1

    Next Step

    We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

    Download the PDF Now!

    The post R Weekly Bulletin Vol – VI appeared first on .

    Performance differences between RevoScaleR, ColumnStore Table and In-Memory OLTP Table

    Sat, 04/29/2017 - 00:55

    (This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

    Running *.XDF files using RevoScaleR computational functions versus have dataset available in Columnstore table or in In-Memory OLTP table will be focus of comparison for this blog post.

    For this test, I will use the AirLines dataset, available here. Deliberately, I have picked a sample 200 MB (of 13GB dataset) in order to properly test the differences and what should be the best way.

    After unzipping the file, I will use following T-SQL query to import the file into SQL Server.

    With this example, you can import xdf file directly to SQL Server table (note, that I have transformed a CSV file into XDF and import xdf file into SQL table):

    -- must have a write permissions on folder: C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData DECLARE @RScript nvarchar(max) SET @RScript = N'library(RevoScaleR)                 rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")                 inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")                 of <-  rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf",                              transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")                             ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))                             ,overwrite = TRUE                             ,maxRowsByCols = 10000000                             ,rowsPerRead = 200000)                 OutputDataSet <- rxXdfToDataFrame(of)' DECLARE @SQLScript nvarchar(max) SET @SQLScript = N'SELECT 1 AS N' EXECUTE sp_execute_external_script      @language = N'R'     ,@script = @RScript     ,@input_data_1 = @SQLScript WITH RESULT SETS ((ArrDelay INT                     ,CRSDepTime DECIMAL(6,4)                     ,DofWeek NVARCHAR(20))) GO

     

    So the whole process is to be done by creating a table, converting the above sp_execute_external_script into procedure and import results from external procedure to the table.

    --Complete process CREATE TABLE AirFlights_small (id INT IDENTITY(1,1) ,ArrDelay INT ,CRSDepTime DECIMAL(6,4) ,DofWeek NVARCHAR(20) ); GO CREATE Procedure ImportXDFtoSQLTable AS DECLARE @RScript nvarchar(max) SET @RScript = N'library(RevoScaleR)                 rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")                 inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")                 of <-  rxDataStep(inData = inFile, outFile = "airline20170428_2.xdf",                 transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")             ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))             ,overwrite = TRUE             ,maxRowsByCols = 10000000)              OutputDataSet <- data.frame(rxReadXdf(file=of, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))' DECLARE @SQLScript nvarchar(max) SET @SQLScript = N'SELECT 1 AS N' EXECUTE sp_execute_external_script      @language = N'R'     ,@script = @RScript     ,@input_data_1 = @SQLScript WITH RESULT SETS ((ArrDelay INT,CRSDepTime DECIMAL(6,4),DofWeek NVARCHAR(20))); GO INSERT INTO AirFlights_small EXECUTE ImportXDFtoSQLTable; GO

     

    There you go. Data are in T-SQL Table. Now we can start with comparisons.  I will be measuring the time to get average air delay time per day of the week.

    RevoScaleR

    With using the RevoScaleR package, I will be using rxCrossTabs function with the help of transform argument to convert day of the week into factors:

    #importing data outFile2 <- rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf",             transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")            ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))            ,overwrite = TRUE            ,maxRowsByCols = 10000000) of2 <- data.frame(rxReadXdf(file=outFile2, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek"))) summary(rxCrossTabs(ArrDelay~DayOfWeek                     ,data = of2 #outFile2                     ,transforms = transforms                     ,blocksPerRead=300000), output="means")

    Now get those times:

    # Getting times system.time({   summary(rxCrossTabs(ArrDelay~DayOfWeek                       ,data = of2                       ,transforms = transforms                       ,blocksPerRead=300000), output="means")   })

    With results of 7.8 on elapsed time and computation time of 3.8 second.

    Rows Read: 8400013, Total Rows Processed: 8400013, Total Chunk Time: 3.825 seconds Computation time: 3.839 seconds. user system elapsed 2.89 0.37 7.89

     

    T-SQL query without any specifics

    To have a baseline, let’s run the following query:

    SET STATISTICS TIME ON; SELECT [DofWeek] ,AVG(ArrDelay) AS [means] FROM     AirFlights_small GROUP BY     [DofWeek] SET STATISTICS TIME OFF;

    And check these time statistics

     SQL Server Execution Times:
    CPU time = 6124 ms,  elapsed time = 2019 ms.
    Warning: Null value is eliminated by an aggregate or other SET operation.

    Obiously the CPU / computation time is higher, although the elapsed time is faster.

    ColumnStore Table

    Let’s create a nonclustered column store index.

    CREATE TABLE AirFlights_CS (id INT IDENTITY(1,1) ,ArrDelay INT ,CRSDepTime DECIMAL(6,4) ,DofWeek NVARCHAR(20) ); GO INSERT INTO AirFlights_CS(ArrDelay, CRSDepTime, DofWeek) SELECT ArrDelay, CRSDepTime, DofWeek FROM AirFlights_small CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI_AirFlight ON AirFlights_CS (id, ArrDelay, CRSDepTime, DofWeek); GO

    With the execution of the same query

    SET STATISTICS TIME ON; SELECT [DofWeek] ,AVG(ArrDelay) AS [means] FROM   AirFlights_CS GROUP BY     [DofWeek] SET STATISTICS TIME OFF;

    The following time statistics are in

     SQL Server Execution Times:
    CPU time = 202 ms,  elapsed time = 109 ms.
    Warning: Null value is eliminated by an aggregate or other SET operation.

     

    In-Memory OLTP

    To get Memory optimized table, we need to add a filegroup and create a table with memory optimized turned on:

    CREATE TABLE dbo.AirFlight_M    (     id INT NOT NULL PRIMARY KEY NONCLUSTERED  ,ArrDelay INT  ,CRSDepTime DECIMAL(6,4)  ,DofWeek NVARCHAR(20) ) WITH (MEMORY_OPTIMIZED=ON, DURABILITY = SCHEMA_AND_DATA); GO

    And insert the data

    INSERT INTO AirFlight_M SELECT * FROM AirFlights_small

    Running the simple query

    SET STATISTICS TIME ON; SELECT [DofWeek] ,AVG(ArrDelay) AS [means] FROM     AirFlight_M GROUP BY     [DofWeek] SET STATISTICS TIME OFF;

    results are:

     SQL Server Execution Times:
    CPU time = 6186 ms,  elapsed time = 1627 ms.
    Warning: Null value is eliminated by an aggregate or other SET operation.

    These results were somehow expected, mostly because the ColumnStore table is the only one having index and reading (also by looking in execution plans) optimized with comparison to others. Also degree of parallelism, clustered and non-clustered index can  be pushed, but the idea was to have tests similar to the one in RevoScaleR and R environemnt. With R, we can not push any index on the XDF file.

    In R we run:

    system.time({ LMResults <- rxLinMod(ArrDelay ~ DayOfWeek, data = outFile2, transforms = transforms) LMResults$coefficients })

    And in SSMS we run:

    SET STATISTICS TIME ON; -- 1. T-SQL DECLARE @RScript nvarchar(max) SET @RScript = N'library(RevoScaleR)                 LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)                 OutputDataSet <- data.frame(LMResults$coefficients)' DECLARE @SQLScript nvarchar(max) SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_small]' EXECUTE sp_execute_external_script      @language = N'R'     ,@script = @RScript     ,@input_data_1 = @SQLScript WITH RESULT SETS ((             --DofWeek NVARCHAR(20)         --    ,             Coefficient DECIMAL(10,5)             )); GO SET STATISTICS TIME OFF; SET STATISTICS TIME ON; -- 2. ColumnStore DECLARE @RScript nvarchar(max) SET @RScript = N'library(RevoScaleR)                 LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)                 OutputDataSet <- data.frame(LMResults$coefficients)' DECLARE @SQLScript nvarchar(max) SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_CS]' EXECUTE sp_execute_external_script      @language = N'R'     ,@script = @RScript     ,@input_data_1 = @SQLScript WITH RESULT SETS ((             --DofWeek NVARCHAR(20)         --    ,             Coefficient DECIMAL(10,5)             )); GO SET STATISTICS TIME OFF; SET STATISTICS TIME ON; -- 3. Memory optimized DECLARE @RScript nvarchar(max) SET @RScript = N'library(RevoScaleR)                 LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)                 OutputDataSet <- data.frame(LMResults$coefficients)' DECLARE @SQLScript nvarchar(max) SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlight_M]' EXECUTE sp_execute_external_script      @language = N'R'     ,@script = @RScript     ,@input_data_1 = @SQLScript WITH RESULT SETS ((             --DofWeek NVARCHAR(20)         --    ,             Coefficient DECIMAL(10,5)             )); GO SET STATISTICS TIME OFF;

     

    Conclusion

    Gathering statistics on CPU time and elapsed time when running simple Linear regression, this is comparison:

    df_LR_comparison <- data.frame (   method = c("T-SQL", "ColumnStore", "Memory Optimized", "RevoScaleR")   ,CPUtime = c(3000,1625,2156,7689)   ,ElapsedTime = c(14323,10851,10600,7760)   ) library(ggplot2) ggplot(df_LR_comparison, aes(method, fill=method)) +   geom_bar(aes(y=ElapsedTime), stat="identity") +   geom_line(aes(y=CPUtime, group=1), color="white", size=3) +   scale_colour_manual(" ", values=c("d1" = "blue", "d2" = "red"))+   #scale_fill_manual("",values="red")+   theme(legend.position="none")

    Showing that elapsed time for R environment with RevoScaleR is fastest (and getting data from XDF), where as simple T-SQL run with sp_execute_external_script and using RevoScaleR gives the slowest response.

    In terms of CPU time (white line), Columnstore with RevoScaleR call through external procedure outperforms all others.

    Final conclusion: When running statistical analysis (using RevoScaleR or any other R library), use columnstore and index optimized tables/views to receive best CPU and elapsed times.  Important to remember is also the fact, that any aggregations and calculations that can be done within SQL Server, are better to be perfomered there.

     

    As always, code is available at GitHub.

     

    Happy coding!

    To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Make pleasingly parallel R code with rxExecBy

    Fri, 04/28/2017 - 22:22

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Some things are easy to convert from a long-running sequential process to a system where each part runs at the same time, thus reducing the required time overall. We often call these "embarrassingly parallel" problems, but given how easy it is to reduce the time it takes to execute them by converting them into a parallel process, "pleasingly parallel" may well be a more appropriate name.

    Using the foreach package (available on CRAN) is one simple way of speeding up pleasingly parallel problems using R. A foreach loop is much like a regular for loop in R, and by default will run each iteration in sequence (again, just like a for loop). But by registering a parallel "backend" for foreach, you can run many (or maybe even all) iterations at the same time, using multiple processors on the same machine, or even multiple machines in the cloud.

    For many applications, though, you need to provide a different chunk of data to each iteration to process. (For example, you may need to fit a statistical model within each country — each iteration will then only need the subset for one country.) You could just pass the entire data set into each iteration and subset it there, but that's inefficient and may even be impractical when dealing with very large datasets sitting in a remote repository. A better idea would be to leave the data where it is, and run R within the data repository, in parallel.

    Microsoft R 9.1 introduces a new function, rxExecBy, for exactly this purpose. When your data is sitting in SQL Server or Spark, you can specify a set of keys to partition the data by, and an R function (any R function, built-in or user-defined) to apply to the partitions. The data doesn't actually move: R runs directly on the data platform. You can also run it on local data in various formats

    The rxExecBy function is included in Microsoft R Client (available free) and Microsoft R Server. For some examples of using rxExecBy, take a look at the Microsoft R Blog post linked below.

    Microsoft R Blog: Running Pleasingly Parallel workloads using rxExecBy on Spark, SQL, Local and Localpar compute contexts

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Are the Kids Alright? Trends in Teen Risk Behavior in NYC

    Fri, 04/28/2017 - 21:11

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    Contributed by Brandy Freitas as part of the Spring 2017 NYC Data Science Academy 12-Week Data Science Bootcamp. This post is based on the first class project, due Week 3 – Exploratory Visualization & Shiny. A link to the Shiny App is here: Youth Risk Behavior Survey App. Introduction: Take a second to think about a teenager in your community, your borough, your neighborhood. Think about who they are, what they look like, what is happening in their life. Take another second to think about how much power we, as voting adults, advertisers, policy makers, have over an entire population with its own culture, norms, issues and struggles. We make the policies and decisions that allocate resources like time, money, and education for them. Often, adults have a notion that “we know best” how to help our youth, but I wonder: do we really know what’s going on with them? Are we in touch with the reality of the youth experience in our community? I was drawn to a particular study, the Youth Risk Behavior Survey (YRBS), conducted by the Center for Disease Control (CDC) since the 1990’s. The survey itself looks at eating habits, exposure to violence, sex, drugs, physical activity, and many more variables related to the physical and emotional health of youth. It offers a longitudinal view of the well-being of teens from across the country, in both rural and urban settings, and was designed to determine the prevalence of health behaviors, assess whether they increase, decrease, or stay the same over time, and examine their co-occurrence. The CDC adds questions over the years in response to technological advancements and trends, as well, such as questions about texting while driving and cyber bullying. The Data: I downloaded the combined YRBS dataset in the available ASCII format from the CDC’s website, and converted to .csv file using an SPSS script and IBM’s SPSS software package. Since my questions were framed around understanding my local youth, for whom my voting patterns and preconceptions probably matter most, I wanted to work with data on teens from the five New York City boroughs. Using R, I filtered the data based on location, variables that I was interested studying (based on question type), and cleaned them up to remove missing values in key variable columns (grade, age, sex, etc) that were required for my analysis. Originally, the file contained 320K rows with 210 variables, and was reduced to 64K rows and 27 variables. I also recoded the data to make it easier to understand, as most of the variables included were coded for simplicity (ie, 1 = Female, 2 = Male), which was difficult to manipulate and interpret. I used ggplot2, Leaflet, plotly, and the Shiny Dashboard to visualize the data. A sample of the data manipulated in R is below: The CDC weights the data to make them representative of the population of students from which the sample was drawn. Generally, these adjustments are made by applying a weight based on student sex, grade, and race, so that researchers accessing the data are able to use it as is without worrying about selection bias. Borough data were from seven surveys over twelve years (the odd years from 2003-2015), and there were about two- to three-thousand respondents from each borough per year. I narrowed my focus down to obesity—which the CDC determines based on height, weight, age and gender—illegal drug use, and suicidal ideation. I was interested in studying differences in gender, borough, grade level, and the effects over time.  The App:  My app is designed to offer a visual representation of some of the data from the YRBS, and to encourage the user to interact with the findings. I. Teen Demographics: The landing page of the app is a Leaflet map of the NYC region, with markers for each borough showing the racial breakdown of the teens going to school there. I wanted to allow the user to be able to explore the demographics of the region, and draw their own inferences about variations between the teen population and the Census-reported overall population. For instance, the racial makeup of the students in public and private schools in Manhattan is not representative of the resident population.                                                                Does this distribution of teen demographics reflect a changing trend in the resident population of the five boroughs? What are the social and political implications of this kind of population disparity? II. Drug Use: The purportedly worsening drug habits of urban youth have often been the subject of crime dramas, early morning news shows, and impassioned political speeches. We have also heard that things like legalization of marijuana in the US will increase the usage of marijuana among youths due to a shifting perspective of morality. I wondered, though, whether teens really are consuming more illegal drugs. In this tab of the app, we can see the drug use over time of particular illegal drugs, grouped by borough. Below we see the trends in marijuana use over time: For the drugs that I looked at (Cocaine, Methamphetamines, Heroin, Ecstasy, Marijuana), you can see that there is a downward trend in many of the boroughs, and that the overall percent of students reporting using Ecstasy, Methamphetamines, and Heroin is very low. Digging deeper, I wonder: could we elucidate whether there is a correlation between drug use and sexual promiscuity (which is surveyed by the CDC as well), or between drug use and measures of depression? As a side note (or, in the ever-popular research science vernacular: ‘Data Not Shown’), during an initial look at racial differences in the data, I found that Asian teens were very unlikely to report having used drugs, while white teens typically reported the highest levels of trying illegal drugs. This is particularly interesting give than Staten Island, with the highest percentage of white teens of all of the boroughs, consistently has the highest drug use reported. I am, however, hesitant to place any significance on these findings until I understand more about reporting differences between populations. III. Depression: The National Suicide Prevention Resource Center estimates that 3.7% of the US’s Adult Population have suicidal thoughts, and 0.5% attempt suicide every year. If we look at the percentages among these populations (which is a percentage of students with either suicidal thoughts or actual attempts), though, you can see that they are significantly higher than that of the adult population. The user can explore many questions in this tab: Is depression in males increasing over time? Which boroughs tend to have higher depression rates? How has mental health trended over the year, and what might cause this? For perspective on mental health aid available to teens in NYC, there are 231K high school students in NYC with over 200 high schools in the Bronx alone. However, according to the NYC Department of Education, there are only 200 Mental Health Clinics available in schools in all five boroughs. A lot of questions came up for me from this, particularly from a policy standpoint. The CDC has been mainly focused, both programmatically and funding-based, on HIV, STD, and pregnancy prevention efforts in the nation’s schools. Based on this depression and mental health data, I wonder: Is this focus justified? Are mental health issues, which still appear to be under-funded and stigmatized, the basis of some of these risk behaviors? Further questions that I would like to study from this dataset: Are the teens from boroughs that have a lower median income or a high report of violence more prone to depression? Do students who are overweight or obese, or students who identify as LGBTQ, show more signs of suicidal ideation? Does bullying contribute? IV. Obesity: For the US Teen population, 5th to 85th percentiles are considered normal weight and lie in the range 17-25. Under 17 is underweight, and over 30 is obese. I wanted to first focus in on this normal range, and designed my chart to present the user with the biologically more interesting section of the data. I wanted to draw attention first to the mass of the distribution, and the median of each group. Take the Queens in 2015 here for example: A quick guide to reading this box plot using the highlighted BMI:
    • 42.14 = Greatest BMI value
    • 31.54 = Greatest value, excluding outliers
    • 24.75 = 25% of students have a BMI over this value
    • 22.19 = 50% of the students in the sample have a BMI over this value (median)
    • 19.72 = 25% of the students have a BMI less than this
    • 13.15 = Minimum value, excluding outliers
    Which means that about 25% of the students in Queens in 11th grade are overweight. If you look across the grade levels, you can see that there is an increase in the median consistently throughout high school. Are students getting heavier as they advance in grade level? What is causing this? I also wanted to make sure that the user was able to interact with the data by being able to view the full range and whiskers. To get a perspective on the data’s distribution, we can zoom out to see which way the data sways. Unfortunately, you can see that it leans toward obese: This portion of the study brought up many avenues for further study. Which populations are most at risk (by race, gender, or some other factor), and can we identify them using the data here? For boroughs with better BMI stats (like Manhattan and Staten Island), what are they doing well, and could this be replicated in other areas? Regarding current policy trends, now that large urban centers are shifting away from the ‘free and reduced lunch’ model to the ‘universal free lunch’ model, will we see a shift in this, either in a positive or negative way? Could we make provided school lunches more nutritious? What could be done to improve education on weight and exercise in schools? Final Notes:

    This study offers an interesting snapshot of the mental and physical health of a very vulnerable portion of our society, and I am looking forward to digging deeper into the data to find more coincident variables and health outcomes. It is my strong suggestion, though, that the CDC offer zipped .csv files on their website, so that data enthusiasts would be more likely to access and analyze this study.

     

    The post Are the Kids Alright? Trends in Teen Risk Behavior in NYC appeared first on NYC Data Science Academy Blog.

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Data science for Doctors: Variable importance Exercises

    Fri, 04/28/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)


    Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

    We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

    This is the tenth part of the series and it aims to cover the very basics of the subject of principal correlation coefficient and components analysis, those two methods illustrate how variables are related.
    In my opinion, it is necessary for researchers to know how to have a notion of the relationships between variables, in order to be able to find potential cause and effect relation – however this relation is hypothetical, you can’t claim that there is a cause-effect relation only because the correlation is high between those two variables-,remove unecessary variables etc. In particular we will go through Pearson correlation coefficient and Confidence interval by the bootstrap and ( Principal component analysis.

    Before proceeding, it might be helpful to look over the help pages for the ggplot, cor, cor.tes, boot.cor, quantile, eigen, princomp, summary, plot, autoplot.

    Moreover please load the following libraries.
    install.packages("ggplot2")
    library(ggplot2)
    install.packages("ggfortify")
    library(ggfortify)

    Please run the code below in order to load the data set and transform it into a proper data frame format:

    url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
    data <- read.table(url, fileEncoding="UTF-8", sep=",")
    names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
    colnames(data) <- names
    data <- data[-which(data$mass ==0),]

    Answers to the exercises are available here.

    If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    Exercise 1

    Compute the value of the correlation coefficient for the variables age and preg.

    Exercise 2

    Construct the scatterplot for the variables age and preg.

    Exercise 3

    Apply a correlation test for the variables age and preg with null hypothesis to be the correlation is zero and the alternative to be different from zero.
    hint: cor.test

    Exercise 4

    Construct a 95% confidence interval is by the bootstrap. First find the correlation by bootstrap.
    hint: mean

    Exercise 5

    Now that you have found the correlation, find the 95% confidence interval.

    Exercise 6

    Find the eigen values and eigen vectors for the data set(exclude the class.fac variable).

    Exercise 7

    Compute the principal components for the dataset used above.

    Exercise 8

    Show the importance of each principal component.

    Exercise 9

    Plot the principal components using an elbow graph.

    Exercise 10

    Constract a scatterplot with x-axis to be the first component and the y-axis to be the second component. Moreover if possible draw the eigen vectors on the plot.
    hint: autoplot

    Related exercise sets:
    1. Data science for Doctors: Inferential Statistics Exercises(Part-4)
    2. Data science for Doctors: Inferential Statistics Exercises (part-3)
    3. Data Science for Doctors – Part 1 : Data Display
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Machine Learning Classification Using Naive Bayes

    Fri, 04/28/2017 - 17:26

    (This article was first published on Data R Value, and kindly contributed to R-bloggers)

    We will develop a classification exercise using Naive-Bayes algorithm. The exercise was originally published in “Machine Learning in R” by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

    Naive Bayes is a probabilistic classification algorithm that can be applied to problems of text classification such as spam filtering, intrusion detection or network anomalies, diagnosis of medical conditions given a set of symptoms, among others.

    The exercise we will develop is about filtering spam and ham sms messages.

    We will carry out the exercise verbatim as published in the aforementioned reference.

    To develop the Naive Bayes classifier, we will use data adapted from the SMS Spam Collection at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ .

    ### install required packages

    install.packages(“tm”)
    install.packages(“NLP”)
    install.packages(“SnowballC”)
    install.packages(“wordcloud”)
    install.packages(“e1071”)
    install.packages(“RColorBrewer”)
    install.packages(“gmodels”)

    library(tm)
    library(NLP)
    library(SnowballC)
    library(RColorBrewer)
    library(wordcloud)
    library(e1071)
    library(gmodels)

    ### read the data
    sms_raw <- read.csv(“sms_spam.csv”, stringsAsFactors = FALSE)

    ### transform type into factor (ham/spam)
    sms_raw$type <- factor(sms_raw$type)

    ### see frequencies
    table(sms_raw$type)

    ### create volatile (stored in memory) corpus
    ### Vcorpus create a complex list, we can use list manipulation
    ### commands to manage it

    sms_corpus <- VCorpus(VectorSource(sms_raw$text))

    ### inspect the two first elements
    inspect(sms_corpus[1:2])

    ### see the first sms (as text)
    as.character(sms_corpus[[1]])

    ### to view multiple documents
    lapply(sms_corpus[1:2], as.character)

    ### cleaning the text in documents (corpus)

    ### tm_map to map all over the corpus
    ### content_transformer to access the corpus
    ### tolower to lowercase all strings

    sms_corpus_clean <- tm_map(sms_corpus,
                               content_transformer(tolower))
    ### check and compare the result of the cleaning
    as.character(sms_corpus[[1]])
    as.character(sms_corpus_clean[[1]])

    ### remove numbers
    sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)

    ### remove “stop words”
    ### see the “stop words” list

    stopwords()

    sms_corpus_clean <- tm_map(sms_corpus_clean,
                               removeWords, stopwords())
    ### remove punctuation
    sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)

    ### stemming (transform words into it’s root form)
    sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)

    ### romove additional whitespaces
    sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)

    ### check and compare result of cleaning
    as.character(sms_corpus[[1]])
    as.character(sms_corpus_clean[[1]])

    ### tokenization
    ### document term matrix (DTM)
    ### rows are sms and columns are words

    sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

    ### data preparation (training and test sets)
    ### 75% train 25% test

    sms_dtm_train <- sms_dtm[1:4169, ]
    sms_dtm_test <- sms_dtm[4170:5559, ]

    ### labels for train and test sets
    ### feature to be classified

    sms_train_labels <- sms_raw[1:4169, ]$type
    sms_test_labels <- sms_raw[4170:5559, ]$type

    ### confirm that the subsets are representative of the
    ### complete set of SMS data

    prop.table(table(sms_train_labels))
    prop.table(table(sms_test_labels))

    ### wordcloud
    wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)


    ### separated wordclouds
    spam <- subset(sms_raw, type == “spam”)
    ham <- subset(sms_raw, type == “ham”)

    wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
    wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))

    ### creating indicator features for frequent words
    findFreqTerms(sms_dtm_train, 5)
    sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
    str(sms_freq_words)

    ### filter DTM by frequent terms
    sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
    sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]

    ### change DTM frequency to factor (categorical)
    convert_counts <- function(x) {
      x <- ifelse(x > 0, “Yes”, “No”)
    }

    sms_train <- apply(sms_dtm_freq_train, MARGIN = 2,
                       convert_counts)
    sms_test <- apply(sms_dtm_freq_test, MARGIN = 2,
                      convert_counts)

    ### training a model on the data
    ### alternative package for Naives-Bayes (“klaR”)

    sms_classifier <- naiveBayes(sms_train, sms_train_labels)

    ### evaluating model performance
    ### make predictions on test set

    sms_test_pred <- predict(sms_classifier, sms_test)

    ### compare predictions (classifications) with the
    ### true values

    CrossTable(sms_test_pred, sms_test_labels,
               prop.chisq = FALSE, prop.t = FALSE,
               dnn = c(‘predicted’, ‘actual’))  



    ### improving model performance?
    ### set Laplace estimator to 1

    sms_classifier2 <- naiveBayes(sms_train, sms_train_labels,
                                  laplace = 1)

    sms_test_pred2 <- predict(sms_classifier2, sms_test)

    CrossTable(sms_test_pred2, sms_test_labels,
               prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
               dnn = c(‘predicted’, ‘actual’))

    You can get the exercise and the data set in:
    https://github.com/pakinja/Data-R-Value 

    To leave a comment for the author, please follow the link and comment on their blog: Data R Value. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Retrieving Reading Levels with R

    Fri, 04/28/2017 - 13:14

    (This article was first published on More or Less Numbers, and kindly contributed to R-bloggers)

    For those that don’t work in education or aren’t aware, there is a measurement for a child’s reading level called a Lexile ® Level.  There are ways in which this level can be retrieved using different reading assessments.  The measurement can be used to match a child’s reading level to books that are the same level.  So the company maintains a database of books that they have assigned levels.  There are other assessments that measure how well a child is reading, but I’m not aware of any systems that assign books as comprehensively.  The library at the school I work at lacked Lexile Levels for their books.  This is unfortunate because of the fact that teachers are not able provide students with books for their respective Lexile Level.  Fortunately though the library had a list of ISBNs for the books.

    On the Lexile website there is a way to search for books using ISBN numbers to retrieve their Lexile ® Level if the book is available in their database.  Entering every ISBN number available is a task fit for something not human.

    rvest to the rescue.

    Below is the script to retrieve the Lexile Levels of books if a list of ISBNs is available.  This was an incredible time save provided by some R code and hopefully someone else out there could use it.
     
    library(rvest) library(httr) library(htmltools) library(dplyr) ##Prep for things used later url<-“https://www.lexile.com/fab/results/?keyword=“ url2<-“https://lexile.com/book/details/“ ##CSV file with ISBN numbers dat1<-read.csv(“~isbns.csv“,header=FALSE) ##dat1<-data.frame(dat1[203:634,]) dat<-as.character(dat1[,1])%>%trimws() ##dat<-dat[41:51] blank<-as.character(“NA“) blank1<-as.character(“NA“) ##blank2<-as.character(“NA”) ##blank3<-as.character(“NA”) all<-data.frame(“A“,“B“,“C“) colnames(all)<-c(“name“,“lexiledat“,“num“) all<-data.frame(all[–1,]) for(i in dat) { sites<-paste(url,i,sep=““) x <- GET(sites, add_headers(‘user-agent‘ = ‘r‘)) webpath<-x$url%>%includeHTML%>%read_html() ##Book Name name<-webpath%>%html_nodes(xpath=“///div[2]/div/div[2]/h4/a“)%>%html_text()%>%trimws() ##Lexile Range lexile<-webpath%>%html_nodes(xpath=“///div[2]/div/div[3]/div[1]“)%>%html_text()%>%trimws()%>%as.character() ##CSS change sometimes lexiledat<-ifelse(is.na(lexile[2])==TRUE,lexile,lexile[2]) test1<-data.frame(lexiledat,NA) ##Breaks every now and then when adding Author/Pages ##Author Name ##author<-webpath%>%html_nodes(xpath=’///div[2]/div/div[2]/span’)%>%html_text()%>%as.character()%>%trimws() ##author<-sub(“by: “,””,author) ##Pages ##pages<-webpath%>%html_nodes(xpath=’///div[2]/div/div[2]/div/div[1]’)%>%html_text()%>%as.character()%>%trimws() ##pages<-sub(“Pages: “,””,pages) ##Some books not found, this excludes them and replaces with NA values df<-if(is.na(test1)) data.frame(blank,blank1) else data.frame(name,lexiledat,stringsAsFactors = FALSE) colnames(df)<-c(“name“,“lexiledat“) df$num <- i all<-bind_rows(all,df) } master<-rbind(all1,all)

       
    Link to code

    To leave a comment for the author, please follow the link and comment on their blog: More or Less Numbers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R Quick Tip: Upload multiple files in shiny and consolidate into a dataset

    Fri, 04/28/2017 - 12:27

    In shiny, you can use the fileInput with the parameter multiple = TRUE to enable you to upload multiple files at once. But how do you process those multiple files in shiny and consolidate into a single dataset?

    The bit we need from shiny is the input$param$fileinputpath value.

    We can use lapply() with data.table‘s fread() to read multiple CSVs from the fileInput(). Then to consolidate the data, we can use data.table‘s rbindlist() to consolidate these into a dataset.

    For more info on using data.table for consolidating CSVs, read my post on rbindlist()

    If you wanted to process things other CSVs then you might consider alternative libraries, and of course, you don’t just need to put them all into one big table.

    View the code on Gist.

    The post R Quick Tip: Upload multiple files in shiny and consolidate into a dataset appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

    Beautiful boxplots in base R

    Fri, 04/28/2017 - 11:26

    (This article was first published on R – Opiniomics, and kindly contributed to R-bloggers)

    As many of you will be aware, I like to post some R code, and I especially like to post base R versions of ggplot2 things!

    Well these amazing boxplots turned up on github – go and check them out!

    So I did my own version in base R – check out the code here and the result below.  Enjoy!

    To leave a comment for the author, please follow the link and comment on their blog: R – Opiniomics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Meetup: Machine Learning in Production with Szilard Pafka

    Fri, 04/28/2017 - 09:55

    Machine Learning in Production
    by Szilard Pafka

    In this talk I will discuss the main steps involved in having machine learning models in production. No hype (“artificial intelligence”), no bullshitting and no marketing of expensive products (the best tools are all open source and free, you just need to know what you are doing).

    Bio: Szilard has trained, deployed and maintained machine learning models in production (starting with random forests) since 2007. He’s a Chief Scientist and a Physics PhD with 20+ years experience with data, computing and modeling. He founded the first data meetup in LA in 2009 (by chance) and he has organized several data science events since.

    We’ll also have a lightning talk (10-mins) before the main talk:

    Opening the black box: Attempts to understand the results of machine learning models
    by Michael Tiernay, Data Scientist at Edmunds

    Timeline:

    – 6:30pm arrival, food/drinks and networking
    – 7:30pm talks
    – 9 pm more networking

    You must have a confirmed RSVP and please arrive by 7:25pm the latest. Please RSVP here on Eventbrite.

    Venue: Edmunds, 2401 Colorado Avenue (this is Edmunds’ NEW LOCATION, don’t go to the old one)
    Park underneath the building (Colorado Center), Edmunds will validate.

    Salaries by alma mater – an interactive visualization with R and plotly

    Fri, 04/28/2017 - 06:08

    (This article was first published on Alexej's blog, and kindly contributed to R-bloggers)

    Based on an interesting dataset from the Wall Street Journal I made the above visualization of the median starting salary for US college graduates from different undergraduate institutions (I have also looked at the mid-career salaries, and the salary increase, but more on that later). However, I thought that it would be a lot more informative, if it were interactive. To the very least I wanted to be able to see the school names when hovering over or clicking on the points with the mouse.

    Luckily, this kind of interactivity can be easily achieved in R with the library plotly, especially due to its excellent integration with ggplot2, which I used to produce the above figure. In the following I describe how exactly this can be done.

    Before I show you the interactive visualizations, a few words on the data preprocessing, and on how the map and the points are plotted with ggplot2:

    • I generally use functions from the tidyverse R packages.
    • I save the data in the data frame salaries, and transform the given amounts to proper floating point numbers, stripping the dollar signs and extra whitespaces.
    • The data provide school names. However, I need to find out the exact geographical coordinates of each school to put it on the map. This can be done in a very convenient way, by using the geocode function from the ggmap R package: school_longlat <- geocode(salaries$school) school_longlat$school <- salaries$school salaries <- left_join(salaries, school_longlat)

    • For the visualization I want to disregard the colleges in Alaska and Hawaii to avoid shrinking the rest of the map. The respective rows of salaries can be easily determined with a grep search: grep("alaska", salaries$school, ignore.case = 1) # [1] 206 grep("hawaii", salaries$school, ignore.case = 1) # [1] 226

    • A data frame containing geographical data that can be used to plot the outline of all US states can be loaded using the function map_data from the ggplot2 package: states <- map_data("state")

    • And I load a yellow-orange-red palette with the function brewer.pal from the RColorBrewer library, to use as a scale for the salary amounts: yor_col <- brewer.pal(6, "YlOrRd")

    • Finally the (yet non-interactive) visualization is created with ggplot2: p <- ggplot(salaries[-c(206, 226), ]) + geom_polygon(aes(x = long, y = lat, group = group), data = states, fill = "black", color = "white") + geom_point(aes(x = lon, y = lat, color = starting, text = school)) + coord_fixed(1.3) + scale_color_gradientn(name = "Starting\nSalary", colors = rev(yor_col), labels = comma) + guides(size = FALSE) + theme_bw() + theme(axis.text = element_blank(), axis.line = element_blank(), axis.ticks = element_blank(), panel.border = element_blank(), panel.grid = element_blank(), axis.title = element_blank())

    Now, entering p into the R console will generate the figure shown at the top of this post.

    However, we want to…

    …make it interactive

    The function ggplotly immediately generates a plotly interactive visualization from a ggplot object. It’s that simple! :smiley: (Though I must admit that, more often than I would be okay with, some elements of the ggplot visualization disappear or don’t look as expected. :fearful:)

    The function argument tooltip can be used to specify which aesthetic mappings from the ggplot call should be shown in the tooltip. So, the code

    ggplotly(p, tooltip = c("text", "starting"), width = 800, height = 500)

    generates the following interactive visualization.

    Now, if you want to publish a plotly visualization to https://plot.ly/, you first need to communicate your account info to the plotly R package:

    Sys.setenv("plotly_username" = "??????") Sys.setenv("plotly_api_key" = "????????????")

    and after that, posting the visualization to your account at https://plot.ly/ is as simple as:

    plotly_POST(filename = "Starting", sharing = "public") More visualizations

    Finally, based on the same dataset I have generated an interactive visualization of the median mid-career salaries by undergraduate alma mater (the R script is almost identical to the one described above).
    The resulting interactive visualization is embedded below.

    Additionally, it is quite informative to look at a visualization of the salary increase from starting to mid-career.

    To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    NY R Conference

    Fri, 04/28/2017 - 02:00

    (This article was first published on R Views, and kindly contributed to R-bloggers)



    The 2017 New York R Conference was held last weekend in Manhattan. For the third consecutive year, the organizers – a partnership including Lander Analytics, The New York Meetup and Work-Bench – pulled off a spectacular event. There was a wide range of outstanding talks, some technical and others more philosophical, a palpable sense of community and inclusiveness, great food, beer and Bloody Marys.

    Fortunately, the talks were recorded. While watching the videos (which I imagine will be available in a couple of weeks) will be no substitute for the experience, I expect that the extended R community will find the talks valuable. In this post, I would like to mention just a couple of the presentations that touched on the professions and practices of data science and statistics.

    In his talk, “The Humble (Programmer) Data Scientist: Essence and Accident in (Software Engineering) Data Science”, Friederike Schüür invoked an analogy between the current effort to establish data science as a profession, and the events of sixty years ago to obtain professional status for programmers. Schüür called out three works from master programmers: The Humble Programmer by Edsger Dijstra, Computer Programming as Art by Donald Knuth, and No Silver Bullet – Essence and Accident in Software Engineering. These are all classic papers, perennially relevant, and well worth reading and re-reading.

    In the first paper, Dijstra writes:

    Another two years later, in 1957, I married and Dutch marriage rites require you to state your profession and I stated that I was a programmer. But the municipal authorities of the town of Amsterdam did not accept it on the grounds that there was no such profession. . . So much for the slowness with which I saw the programming profession emerge in my own country. Since then I have seen more of the world, and it is my general impression that in other countries, apart from a possible shift of dates, the growth pattern has been very much the same.

    Social transformations appear to happen much more quickly in the twenty-first century. Nevertheless, that the process requires many different kinds of people to alter their world views to establish a new profession seems to be an invariant.

    The second talk I would like to mention was by Andrew Gelman. The title listed in the program is “Theoretical Statistics is the Theory of Applied Statistics: How to Think About What We Do”. Since Gelman spoke without slides in front of a dark screen, I am not sure that is actually the talk he gave, but whatever talk he gave, it was spellbinding. Gelman spoke so quickly and threw out so many ideas that my mental buffers were completely overwritten. I will have to wait for the video to sort out my impressions. Nevertheless, there were four quotations that I managed to write down that I think are worth considering:

    • “Taking something that was outside of the realm of statistics and putting it into statistics is a good idea.”
    • “I would like workflows to be more formalized.”
    • “Much of workflow is still outside of the tent of statistical theory.”
    • “I think we need a theory of models.”

    Taken together, and I hope not taken out of context, the sentiment expressed here argues for expanding the domain of theoretical statistics to include theories of inferential models and the workflows in which they are produced.

    The idea of the importance of workflows to science itself seems to be gaining some traction in the statistical community. In his yet-to-be-published but widely circulated paper, 50 years of Data Science, theoretical statistician David Donoho writes:

    A crucial hidden component of variability in science is the analysis workflow. Different studies of the same intervention may follow different workflows, which may cause the studies to get different conclusions… Joshua Carp studied analysis workflows in 241 fMRI studies. He found nearly as many unique workflows as studies! In other words researchers are making up a new workflow for pretty much every study.

    My take on things is that the movement to establish the profession of data science already has too much momentum behind it to stop it any time soon. Whether or not it is the data scientists or the statisticians who come to own the theories of models and workflows doesn’t matter all that much. No matter who develops these research areas, we will all benefit. The social and intellectual friction caused my the movement of data science is heating things up in a good way.

    window.location.href='https://rviews.rstudio.com/2017/04/28/nyr/';

    To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    a secretary problem with maximum ability

    Fri, 04/28/2017 - 00:17

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    The Riddler of today has a secretary problem, where one measures sequentially N random variables until one deems the current variable to be the largest of the whole sample. The classical secretary problem has a counter-intuitive solution where one first measures N/e random variables without taking any decision and then and only then picks the first next outcome larger than the largest in the first group. The added information in the current riddle is that the distribution of those iid random variables is set to be uniform on {1,…,M}, which begs for a modification in the algorithm. As for instance when observing M on the current draw.

    The approach I devised is clearly suboptimal, as I decided to pick the currently observed value if the (conditional) probability it is the largest is larger than the probability subsequent draws. This translates into the following R code:

    M=100 #maximum value N=10 #total number of draws hyprob=function(m){ # m is sequence of draws so far n=length(m);mmax=max(m) if ((m[n]<mmax)||(mmax-n<N-n)){prob=0 }else{ prob=prod(sort((1:mmax)[-m],dec=TRUE) [1:(N-n)]/((M-n):(M-N+1))} return(prob)} decision=function(draz=sample(1:M,N)){ i=0 keepgoin=TRUE while ((keepgoin)&(i<N)){ i=i+1 keepgoin=(hyprob(draz[1:i])<0.5)} return(c(i,draz[i],(draz[i]<max(draz))))}

    which produces a winning rate of around 62% when N=10 and M=100, hence much better than the expected performances of the secretary algorithm, with a winning frequency of 1/e.

    Filed under: Kids, R Tagged: mathematical puzzle, R, secretary problem, stopping rule, The Riddler

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Where Europe lives, in 14 lines of R Code

    Thu, 04/27/2017 - 23:56

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Via Max Galka, always a great source of interesting data visualizations, we have this lovely visualization of population density in Europe in 2011, created by Henrik Lindberg:

     

    Impressively, the chart was created with just 14 lines of R code:

    (To recreate it yourself, download the GEOSTAT-grid-POP-1K-2011-V2-0-1.zip file from eurostat, and move the two .csv files inside in range of your R script.) The code parses the latitude/longitude of population centers listed in the CSV file, arranges them into a 0.01 by 0.01 degree grid, and plots each row as a horizontal line with population as the vertical axis. Grid cells with zero populations cause breaks in the line and leave white gaps in the map. It's quite an elegant effect!

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Load, Save, and .rda files

    Thu, 04/27/2017 - 16:44

    (This article was first published on The Practical R, and kindly contributed to R-bloggers)

    A couple weeks ago I stumbled across a feature in R that I had never heard of before. The functions save(), load(), and the R file type .rda.

    The .rda files allow a user to save their R data structures such as vectors, matrices, and data frames. The file is automatically compressed, with user options for additional compression. Let’s take a look.

    First, we will grab one of the built-in R datasets. We can view these by calling data(). Let’s use the “Orange” dataset.

    # get the Orange data Orange Tree age circumference 1 1 118 30 2 1 484 58 3 1 664 87 4 1 1004 115 5 1 1231 120 6 1 1372 142 7 1 1582 145 8 2 118 33 9 2 484 69 10 2 664 111 11 2 1004 156 12 2 1231 172 13 2 1372 203 14 2 1582 203 15 3 118 30 16 3 484 51 17 3 664 75 18 3 1004 108 19 3 1231 115 20 3 1372 139 21 3 1582 140 22 4 118 32 23 4 484 62 24 4 664 112 25 4 1004 167 26 4 1231 179 27 4 1372 209 28 4 1582 214 29 5 118 30 30 5 484 49 31 5 664 81 32 5 1004 125 33 5 1231 142 34 5 1372 174 35 5 1582 177

    Next, let’s save each column individually as vectors.

    # save the Orange data as vectors count<-Orange$Tree age<-Orange$age circumference<-Orange$circumference

    Now if we look at our variables in the RStudio environment, we can see count, age, and circumference saved there.

    Next, let’s set our R working directory, so the .rda file will save in the correct location. First we’ll use getwd() to find our current working directory, then we’ll adjust it (if needed) using setwd(). I set my working directory to a folder on the D drive.

    #get and set working directory getwd() [1] "D:/Users" setwd("D:/r-temp") > getwd() [1] "D:/r-temp"

    Finally, let’s use the save() command to save our 3 vectors to an .rda file. The “file” name will be the name of the new .rda file.

    #save to rda file save(count, age, circumference, file = "mydata.rda")

    Next we will remove our R environment variables using the command rm().

    #remove variables rm(age, circumference, count)

    Now we can see that we no longer have saved variables in our R workspace.

    Now, we can check that our .rda file (myrda.rda) does in fact store our data by using the load() command.
    Note: If we had not properly set our working directory, then we would have needed to provide a full path to the rda file. For example, “C:/Users/Documents/R files/myrda” rather than just “myrda”.

    #load the rda file load(file = "mydata.rda")

    Great, now we can see that our variables are back in the R environment for use once more.

    Saving and loading data in R might be very useful when you’re working with large datasets that you want to clear from your memory, but you also would like to save for later. It also might be useful for long, complex R workflows and scripts. You can control the compression of the file using the settings ‘compress’ and ‘compression_level’.

    That’s all for now!

    To leave a comment for the author, please follow the link and comment on their blog: The Practical R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Data Science for Operational Excellence (Part-4)

    Thu, 04/27/2017 - 16:26

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    Suppose your friend is a restaurant chain owner (only 3 units) facing some competitors challenges related to low price, lets call it a price war. Inside his business he knows that there’s no much cost to be cut. But, he thinks that, maybe if he tries harder to find better supplier with low freight and product costs, he could be in a better situation. So, he decided to hire you a recent grad data scientist to figure out how to solve this problem and to build a tool to make your findings to be incorporated in his daily operations. As a Data Scientist you know that this problem could be solved through the use of lpSolve package.

    Our goal here is to expand your knowledge to create custom constraints to be used in real business problems.

    Answers to the exercises are available here.

    Exercise 1
    We will solve a transportation problem for your friend’s restaurant chain with 2 different products sold from 4 different suppliers. Create a cost vector that model different costs for each combination of restaurant, supplier and product. Use integer random numbers from 0 to 1000 to fill this vector. In order to be reproducible, set seed equals to 1234.

    Exercise 2
    Create the demand constraints. Consider that every restaurant need a specific quantity for each product. Use integer random numbers from 100 to 500 to define minimum quantities to keep the restaurant open without running out of any supplies.

    Exercise 3
    Create the offer constraints. Consider that every supplier can deliver a specific quantity related to each product. Use integer random numbers from 200 to 700 to define maximum quantities that each supplier can deliver.

    Exercise 4
    Prepare the parameter of the lp() function using the variables created above.

    Exercise 5
    Now, solve these problem with these constraints created so far.

    Learn more about geo visualization in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

    • Build advanced map visualizations
    • Work with different sources for maps
    • And much more visualizations

    Exercise 6
    We know that some suppliers have minimum order quantities. Create a new set of constraints to represent that. Use integer random numbers from 50 to 70 to define minimum quantities that we can order from each supplier.

    Exercise 7
    Now, solve these problem with these constraints created so far.

    Exercise 8
    We also know that some vehicles have maximum capacity in terms of weight and volume. Create a new set of constraints to represent that. Use integer random numbers from 100 to 500 to define maximum quantities that we can order from each supplier.

    Exercise 9
    Prepare again the lp() function parameters using the variable created above.

    Exercise 10
    Now, solve these problem with all constraints.

    Related exercise sets:
    1. Lets Begin with something sample
    2. Data Science for Operational Excellence (Part-1)
    3. Data Exploration with Tables exercises
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages