Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 1 min ago

Nuts and Bolts of Quantstrat, Part V

Fri, 04/14/2017 - 00:33

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will be about pre-processing custom indicators in quantstrat–that is, how to add values to your market data that do not arise from the market data itself.

The first four parts of my nuts and bolts of quantstrat were well received. They are even available as a datacamp course. For those that want to catch up to today’s post, I highly recommend the datacamp course.

To motivate this post, the idea is that say you’re using alternative data that isn’t simply derived from a transformation of the market data itself. I.E. you have a proprietary alternative data stream that may predict an asset’s price, you want to employ a cross-sectional ranking system, or any number of things. How do you do this within the context of quantstrat?

The answer is that it’s as simple as binding a new xts to your asset data, as this demonstration will show.

First, let’s get the setup out of the way.

require(quantstrat) require(PerformanceAnalytics) initDate="1990-01-01" from="2003-01-01" to="2012-12-31" options(width=70) options("getSymbols.warning4.0"=FALSE) currency('USD') Sys.setenv(TZ="UTC") symbols <- 'SPY' suppressMessages(getSymbols(symbols, from=from, to=to, src="yahoo", adjust=TRUE)) stock(symbols, currency="USD", multiplier=1)

Now, we have our non-derived indicator. In this case, it’s a toy example–the value is 1 if the year is odd (I.E. 2003, 2005, 2007, 2009), and 0 if it’s even. We compute that and simply column-bind (cbind) it to the asset data.

nonDerivedIndicator <- as.numeric(as.character(substr(index(SPY), 1, 4)))%%2 == 1 nonDerivedIndicator <- xts(nonDerivedIndicator, order.by=index(SPY)) SPY <- cbind(SPY, nonDerivedIndicator) colnames(SPY)[7] = "nonDerivedIndicator"

Next, we just have a very simple strategy–buy a share of SPY on odd years, sell on even years. That is, buy when the nonDerivedIndicator column crosses above 0.5 (from 0 to 1), and sell when the opposite occurs.

strategy.st <- portfolio.st <- account.st <- "nonDerivedData" rm.strat(strategy.st) initPortf(portfolio.st, symbols=symbols, initDate=initDate, currency='USD') initAcct(account.st, portfolios=portfolio.st, initDate=initDate, currency='USD') initOrders(portfolio.st, initDate=initDate) strategy(strategy.st, store=TRUE) add.signal(strategy.st, name = sigThreshold, arguments = list(column = "nonDerivedIndicator", threshold = 0.5, relationship = "gte", cross = TRUE), label = "longEntry") add.signal(strategy.st, name = sigThreshold, arguments = list(column = "nonDerivedIndicator", threshold = 0.5, relationship = "lte", cross = TRUE), label = "longExit") tmp <- applySignals(strategy = strategy.st, mktdata=SPY) add.rule(strategy.st, name="ruleSignal", arguments=list(sigcol="longEntry", sigval=TRUE, ordertype="market", orderside="long", replace=FALSE, prefer="Open", orderqty = 1), type="enter", path.dep=TRUE) add.rule(strategy.st, name="ruleSignal", arguments=list(sigcol="longExit", sigval=TRUE, orderqty="all", ordertype="market", orderside="long", replace=FALSE, prefer="Open"), type="exit", path.dep=TRUE) #apply strategy t1 <- Sys.time() out <- applyStrategy(strategy=strategy.st,portfolios=portfolio.st) t2 <- Sys.time() print(t2-t1) #set up analytics updatePortf(portfolio.st) dateRange <- time(getPortfolio(portfolio.st)$summary)[-1] updateAcct(portfolio.st,dateRange) updateEndEq(account.st)

And the result:

chart.Posn(portfolio.st, 'SPY')

In conclusion, you can create signals based off of any data in quantstrat. Whether that means volatility ratios, fundamental data, cross-sectional ranking, or whatever proprietary alternative data source you may have access to, this very simple process is how you can use quantstrat to add all of those things to your systematic trading backtest research.

Thanks for reading.

Note: I am always interested in full-time opportunities which may benefit from my skills. I have experience in data analytics, asset management, and systematic trading research. If you know of any such opportunities, do not hesitate to contact me on my LinkedIn, found here.

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

optimultiplication [a riddle]

Fri, 04/14/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The riddle of this week is about an optimisation of positioning the four digits of a multiplication of two numbers with two digits each and is open to a coding resolution:

Four digits are drawn without replacement from {0,1,…,9}, one at a time. What is the optimal strategy to position those four digits, two digits per row, as they are drawn, toward minimising the average product?

Although the problem can be solved algebraically by computing E[X⁴|x¹,..] and E[X⁴X³|x¹,..]  I wrote three R codes to “optimise” the location of the first three digits: the first digit ends up as a unit if it is 5 or more and a multiple of ten otherwise, on the first row. For the second draw, it is slightly more variable: with this R code,

second<-function(i,j,N=1e5){draw drew=matrix(0,N,2) for (t in 1:N) drew[t,]=sample((0:9)[-c(i+1,j+1)],2) conmean=(45-i-j)/8 conprod=mean(drew[,1]*drew[,2]) if (i<5){ #10*i pos=c((110*i+11*j)*conmean, 100*i*j+10*(i+j)*conmean+conprod, (100*i+j)*conmean+10*i*j+10*conprod)}else{ pos=c((110*j+11*i)*conmean, 10*i*j+(100*j+i)*conmean+10*conprod, 10*(i+j)*conmean+i*j+100*conprod) return(order(pos)[1])}

the resulting digit again ends up as a unit if it is 5 (except when x¹=7,8,9, where it is 4) or more and a multiple of ten otherwise, but on the second row. Except when x¹=0, x²=1,2,3,4, when they end up on the first row together, 0 obviously in front.

For the third and last open draw, there is only one remaining random draw, which mean that the decision only depends on x¹,x²,x³ and E[X⁴|x¹,x²,x³]=(45-x¹-x²-x³)/7. Attaching x³ to x² or x¹ will then vary monotonically in x³, depending on whether x¹>x² or x¹<x²:

fourth=function(i,j,k){ comean=(45-i-j-k)/7 if ((i<1)&(j<5)){ pos=c(10*comean+k,comean+10*k)} if ((i<5)&(j>4)){ pos=c(100*i*comean+k*j,j*comean+100*i*k)} if ((i>0)&(i<5)&(j<5)){ pos=c(i*comean+k*j,j*comean+i*k)} if ((i<7)&(i>4)&(j<5)){ pos=c(i*comean+100*k*j,j*comean+100*i*k)} if ((i<7)&(i>4)&(j>4)){ pos=c(i*comean+k*j,j*comean+i*k)} if ((i>6)&(j<4)){ pos=c(i*comean+100*k*j,j*comean+100*i*k)} if ((i>6)&(j>3)){ pos=c(i*comean+k*j,j*comean+i*k)} return(order(pos)[1])}

Running this R code for all combinations of x¹,x² shows that, except for the cases x¹≥5 and x²=0, for which x³ invariably remains in front of x¹, there are always values of x³ associated with each position.

Filed under: Books, Kids, R, Statistics Tagged: coding, conditional probability, FiveThirtyEight, mathematical puzzle, R, The Riddler

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

QR Decomposition with Householder Reflections

Thu, 04/13/2017 - 22:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

The more common approach to QR decomposition is employing Householder reflections rather than utilizing Gram-Schmidt. In practice, the Gram-Schmidt procedure is not recommended as it can lead to cancellation that causes inaccuracy of the computation of q_j, which may result in a non-orthogonal Q matrix. Householder reflections are another method of orthogonal transformation that transforms a vector x into a unit vector y parallel with x. The Householder reflection matrix with normal vector v takes the form:

H = I – 2vv^T

Thus we need to build H so that Hx = \alpha e_1 for some constant \alpha and e_1 = \begin{bmatrix} 1 & 0 & 0 \end{bmatrix}^T.

Since H is orthogonal, ||Hx|| = ||x|| and ||\alpha e_1|| = |\alpha| ||e_1|| = |\alpha|. Therefore, \alpha = \pm ||x||. The sign is selected so it has the opposite sign of x_1 (we’ll use + for the remaining definitions). The vector u we seek is thus:

u = \begin{bmatrix} x_1 + sign(x_1) ||x_1|| \\\ x_2 \\\ \vdots \\\ x_n \end{bmatrix}

With the unit vector v defined as u = \frac{v}{||v||}. The corresponding Householder reflection is then:

H(x) = I – 2vv^T = I – 2 \frac{uu^T}{u^Tu}

QR Decomposition with Householder Reflections

The Householder reflection method of QR decomposition works by finding appropriate H matrices and multiplying them from the left by the original matrix A to construct the upper triangular matrix R. As we saw earlier, unlike the Gram-Schmidt procedure, the Householder reflection approach does not explicitly form the Q matrix. However, the Q matrix can be found by taking the dot product of each successively formed Householder matrix.

Q = H_1 H_2 \cdots H_{m-2}H_{m-1}

Consider the matrix A.

A = \begin{bmatrix} 2 & – 2 & 18 \\\ 2 & 1 & 0 \\\ 1 & 2 & 0 \end{bmatrix}

We find the reflection of the first column vector a_1 = \begin{bmatrix}2 & 2 & 1 \end{bmatrix}^T, v_1 = a_1 + sign(a_{11})||a_1||e_1.

v_1 = \begin{bmatrix}2 \\\ 2 \\\ 1 \end{bmatrix} + \sqrt{\sum^m_{k=1}{a_1}^2} \begin{bmatrix}1 \\\ 0 \\\ 0 \end{bmatrix} = \begin{bmatrix} 5 \\\ 2 \\\ 1 \end{bmatrix}

With a corresponding Householder matrix:

H_1 = \begin{bmatrix} 1 & 0 & 0 \\\ 0 & 1 & 0 \\\ 0 & 0 & 1 \end{bmatrix} – 2 \frac{\begin{bmatrix} 2 & 2 & 1 \end{bmatrix}\begin{bmatrix} 2 \\\ 2 \\\ 1 \end{bmatrix}}{\begin{bmatrix} 2 \\\ 2 \\\ 1 \end{bmatrix}\begin{bmatrix} 2 & 2 & 1 \end{bmatrix}} =
H_1 = \begin{bmatrix} -\frac{2}{3} & -\frac{2}{3} & -\frac{1}{3} \\\ -\frac{2}{3} & 0.7333 & -0.1333 \\\ -\frac{1}{3} & -0.1333 & 0.9333 \end{bmatrix}

With H_1, we then find R by H_1 A:

H_1 A = \begin{bmatrix} -\frac{2}{3} & -\frac{2}{3} & -\frac{1}{3} \\\ -\frac{2}{3} & 0.7333 & -0.1333 \\\ -\frac{1}{3} & -0.1333 & 0.9333 \end{bmatrix} \begin{bmatrix} 2 & – 2 & 18 \\\ 2 & 1 & 0 \\\ 1 & 2 & 0 \end{bmatrix}
H_1 A = \begin{bmatrix} -3 & 0 & -12 \\\ 0 & 1.8 & 12 \\\ 0 & 2.4 & -6 \end{bmatrix}

H_1A replaces A in the next iteration. Now we move to a_2 and H_2. To this end we only consider the submatrix of A:

A^{(1)} = \begin{bmatrix} 1.8 & 12 \\\ 2.4 & -6 \end{bmatrix}

Thus, v_2 is equal to:

\begin{bmatrix} 1.8 \\\ 2.4 \end{bmatrix} + \sqrt{\sum^m_{j=1} a_2^2} \begin{bmatrix} 1 \\\ 0 \end{bmatrix} = \begin{bmatrix} 4.8 \\\ 2.4 \end{bmatrix}

Therefore, the corresponding Householder matrix H_2 is equal to:

H_2 = \begin{bmatrix} 1 & 0 \\\ 0 & 1 \end{bmatrix} – 2 \frac{\begin{bmatrix} 4.8 & 2.4 \end{bmatrix} \begin{bmatrix} 4.8 \\\ 2.4 \end{bmatrix}}{\begin{bmatrix} 4.8 \\\ 2.4 \end{bmatrix} \begin{bmatrix} 4.8 & 2.4 \end{bmatrix}}
H_2 = \begin{bmatrix} 1 & 0 & 0 \\\ 0 & -0.6 & -0.8 \\\ 0 & -0.8 & 0.6 \end{bmatrix}

The first column \begin{bmatrix}1 \\\ 0 \\\ 0 \end{bmatrix} and first row \begin{bmatrix}1 & 0 & 0 \end{bmatrix} are added to the resulting H_2 matrix to keep it n \times n.

Then we find H_2 A:

\begin{bmatrix} 1 & 0 & 0 \\\ 0 & -0.6 & -0.8 \\\ 0 & -0.8 & 0.6 \end{bmatrix} \begin{bmatrix} -3 & 0 & -12 \\\ 0 & 1.8 & 12 \\\ 0 & 2.4 & -6 \end{bmatrix}
H_2 A = \begin{bmatrix} -3 & 0 & -12 \\\ 0 & -3 & 12 \\\ 0 & 0 & 6 \end{bmatrix}

Moving to a_3 and H_3, the submatrix of H_2 A is thus [6]. Therefore, v_3 is equal to:

\begin{bmatrix} 6 \end{bmatrix} – \sqrt{\sum^m_{j=1} a_3^2} \begin{bmatrix} 1 \end{bmatrix} = 12

The corresponding Householder matrix H_3 is then:

H_3 = \begin{bmatrix} 1 \end{bmatrix} – 2 \frac{\begin{bmatrix} 12 \end{bmatrix}\begin{bmatrix} 12 \end{bmatrix}}{\begin{bmatrix} 12 \end{bmatrix}\begin{bmatrix} 12 \end{bmatrix}} = \begin{bmatrix}1 & 0 & 0 \\\ 0 & 1 & 0 \\\ 0 & 0 & -1 \end{bmatrix}
H_3 A = \begin{bmatrix}1 & 0 & 0 \\\ 0 & 1 & 0 \\\ 0 & 0 & -1 \end{bmatrix} \begin{bmatrix} -3 & 0 & -12 \\\ 0 & -3 & 12 \\\ 0 & 0 & 6 \end{bmatrix}
H_3 A = R = \begin{bmatrix} -3 & 0 & -12 \\\ 0 & -3 & 12 \\\ 0 & 0 & -6 \end{bmatrix}

Which is the R factorization in the QR decomposition method. The Q factorization of QR decomposition is found by multiplying all the H matrices together as mentioned earlier.

H_1 H_2 H_3 = Q
Q = \begin{bmatrix} -\frac{2}{3} & -\frac{2}{3} & -\frac{1}{3} \\\ -\frac{2}{3} & 0.7333 & -0.1333 \\\ -\frac{1}{3} & -0.1333 & 0.9333 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\\ 0 & -0.6 & -0.8 \\\ 0 & -0.8 & 0.6 \end{bmatrix} \begin{bmatrix}1 & 0 & 0 \\\ 0 & 1 & 0 \\\ 0 & 0 & -1 \end{bmatrix}
Q = \begin{bmatrix} -\frac{2}{3} & \frac{2}{3} & -\frac{1}{3} \\\ -\frac{2}{3} & -\frac{1}{3} & \frac{2}{3} \\\ -\frac{1}{3} & -\frac{2}{3} & -\frac{2}{3} \end{bmatrix}

Thus we obtain the same result as we did utilizing the Gram-Schmidt procedure.

QR = \begin{bmatrix} -\frac{2}{3} & \frac{2}{3} & -\frac{1}{3} \\\ -\frac{2}{3} & -\frac{1}{3} & \frac{2}{3} \\\ -\frac{1}{3} & -\frac{2}{3} & -\frac{2}{3} \end{bmatrix} \begin{bmatrix} -3 & 0 & -12 \\\ 0 & -3 & 12 \\\ 0 & 0 & -6 \end{bmatrix}

Householder Reflection QR Decomposition in R

The following function implements the Householder reflections approach to QR decomposition. The bdiag() function in the Matrix package is used in constructing the H matrices as seen above in the calculation of H_2.

qr.householder <- function(A) { require(Matrix) R <- as.matrix(A) # Set R to the input matrix A n <- ncol(A) m <- nrow(A) H <- list() # Initialize a list to store the computed H matrices to calculate Q later if (m > n) { c <- n } else { c <- m } for (k in 1:c) { x <- R[k:m,k] # Equivalent to a_1 e <- as.matrix(c(1, rep(0, length(x)-1))) vk <- sign(x[1]) * sqrt(sum(x^2)) * e + x # Compute the H matrix hk <- diag(length(x)) - 2 * as.vector(vk %*% t(vk)) / (t(vk) %*% vk) if (k > 1) { hk <- bdiag(diag(k-1), hk) } # Store the H matrix to find Q at the end of iteration H[[k]] <- hk R <- hk %*% R } Q <- Reduce("%*%", H) # Calculate Q matrix by multiplying all H matrices res <- list('Q'=Q,'R'=R) return(res) }

Use the function to compute the QR decomposition of the following matrix A with Householder reflections.

A <- rbind(c(2,-2,18),c(2,1,0),c(1,2,0)) A ## [,1] [,2] [,3] ## [1,] 2 -2 18 ## [2,] 2 1 0 ## [3,] 1 2 0 qr.householder(A) ## $Q ## 3 x 3 Matrix of class "dgeMatrix" ## [,1] [,2] [,3] ## [1,] -0.6666667 0.6666667 -0.3333333 ## [2,] -0.6666667 -0.3333333 0.6666667 ## [3,] -0.3333333 -0.6666667 -0.6666667 ## ## $R ## 3 x 3 Matrix of class "dgeMatrix" ## [,1] [,2] [,3] ## [1,] -3.000000e+00 2.220446e-16 -12 ## [2,] -1.165734e-16 -3.000000e+00 12 ## [3,] 1.554312e-16 0.000000e+00 -6

The only package that I found that directly implements the Householder reflection approach to QR is the pracma package.

library(pracma) house <- householder(A) house ## $Q ## [,1] [,2] [,3] ## [1,] -0.6666667 0.6666667 0.3333333 ## [2,] -0.6666667 -0.3333333 -0.6666667 ## [3,] -0.3333333 -0.6666667 0.6666667 ## ## $R ## [,1] [,2] [,3] ## [1,] -3.000000e+00 2.220446e-16 -12 ## [2,] -1.165734e-16 -3.000000e+00 12 ## [3,] -1.554312e-16 0.000000e+00 6 References

En.wikipedia.org. (2017). QR decomposition. [online] Available at: https://en.wikipedia.org/wiki/QR_decomposition#Using_Householder_reflections [Accessed 10 Apr. 2017].

http://www.math.iit.edu/~fass/477577_Chapter_4.pdf

Trefethen, L. and Bau, D. (1997). Numerical linear algebra. 1st ed. Philadelphia: SIAM.

The post QR Decomposition with Householder Reflections appeared first on Aaron Schlegel.

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New features in the checkpoint package, version 0.4.0

Thu, 04/13/2017 - 18:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrie de Vries

In 2014 we introduced the checkpoint package for reproducible research. This package makes it easy to use R package versions that existed on CRAN at a given date in the past, and to use varying package versions with different projects. Previous blog posts include:

On April 12, 2017, we published version 0.4.0 of checkpoint to CRAN.

The checkpoint() function enables reproducible research by managing your R package versions. These packages are downloaded into a local .checkpoint folder. If you use checkpoint() for many projects, these local packages can consume some storage space, and this update introduces functions to manage your snapshots. In this post I review:

  • Managing local archives:
    • checkpointArchives(): list checkpoint archives on disk.
    • checkpointRemove(): remove checkpoint archive from disk.
    • getAccessDate(): returns the date the snapshot was last accessed.
  • Other:
    • unCheckpoint(): reset .libPaths to the user library to undo the effect of checkpoint().
Setting up an example project

For illustration, set up a script referencing a single package:

library(MASS) hist(islands) truehist(islands)

Next, create the checkpoint:

dir.create(file.path(tempdir(), ".checkpoint"), recursive = TRUE) ## Create a checkpoint by specifying a snapshot date library(checkpoint) checkpoint("2015-04-26", project = tempdir(), checkpointLocation = tempdir()) Working with checkpoint archive snapshots

You can query the available snapshots on disk using the checkpointArchives() function. This returns a vector of snapshot folders.

# List checkpoint archives on disk. checkpointArchives(tempdir()) ## [1] "2015-04-26"

You can get the full paths by including the argument full.names=TRUE:

checkpointArchives(tempdir(), full.names = TRUE) ## [1] "C:/Users/adevries/AppData/Local/Temp/RtmpcnciXd/.checkpoint/2015-04-26" Working with access dates

Every time you use checkpoint() the function places a small marker in the snapshot archive with the access date. In this way you can track when was the last time you actually used the snapshot archive.

# Returns the date the snapshot was last accessed. getAccessDate(tempdir()) ## C:/Users/adevries/AppData/Local/Temp/RtmpcnciXd/.checkpoint/2015-04-26 ## "2017-04-12" Removing a snapshot from local disk

Since the date of last access is tracked, you can use this to manage your checkpoint archives. The function checkpointRemove() will delete archives from disk. You can use this function in multiple ways. For example, specify a specific archive to remove:

# Remove singe checkpoint archive from disk. checkpointRemove("2015-04-26")

You can also remove a range of snapshot archives older (or more recent) than a snapshot date

# Remove range of checkpoint archives from disk. checkpointRemove("2015-04-26", allSinceSnapshot = TRUE) checkpointRemove("2015-04-26", allUntilSnapshot = = TRUE)

Finally, you can remove all snapshot archives that have not been accessed since a given date:

# Remove snapshot archives that have not been used recently checkpointRemove("2015-04-26", notUsedSince = TRUE) Reading the checkpoint log file

One of the side effects of checkpoint() is to create a log file that contains information about packages that get downloaded, as well as the download size. This file is stored in the checkpoint root folder, and is a csv file with column names, so you can read this with your favourite R function or other tools.

dir(file.path(tempdir(), ".checkpoint")) ## [1] "2015-04-26" "checkpoint_log.csv" "R-3.3.3"

Inspect the log file:

log_file ## timestamp snapshotDate pkg bytes ## 1 2017-04-12 15:05:12 2015-04-26 MASS 1084392 Resetting the checkpoint

In older versions of checkpoint() the only way to reset the effect of checkpoint() was to restart your R session. In v0.3.20 and above, you can use the function unCheckpoint(). This will reset your .libPaths to the user folder.

.libPaths() ## [1] "C:/Users/adevries/AppData/Local/Temp/RtmpcnciXd/.checkpoint/2015-04-26/lib/x86_64-w64-mingw32/3.3.3" ## [2] "C:/Users/adevries/AppData/Local/Temp/RtmpcnciXd/.checkpoint/R-3.3.3" ## [3] "C:/R/R-33~1.3/library" Now use `unCheckpoint()` to reset your library paths # Note this is still experimental unCheckpoint() .libPaths() ## [1] "C:\\Users\\adevries\\Documents\\R\\win-library" ## [2] "C:/R/R-33~1.3/library" How to obtain and use checkpoint

Version 0.4.0 of the checkpoint package is available on CRAN now, so you can install it with:

install.packages("checkpoint", repos="https://cloud.r-project.org")

The above command works both for CRAN R, and also for Microsoft R Open (which comes bundled with an older version of checkpoint). For more information on checkpoint, see the vignette Using checkpoint for reproducible research.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Best Practices: R you writing the R way!

Thu, 04/13/2017 - 13:45

(This article was first published on R programming, and kindly contributed to R-bloggers)

By Milind Paradkar

Any programmer inevitably writes tons of codes in his daily work. However, not all programmers inculcate the habit of writing clean codes which can be easily be understood by others. One of the reasons can be the lack of awareness among programmers of the best practices followed in writing a program. This is especially the case for novice programmers. In this post, we list some of the R programming best practices which will lead to improved code readability, consistency, and repeatability. Read on!

Best practices of writing in R

1) Describe your code – When you start coding describe what the R code does in the very first line. For subsequent blocks of codes follow the same method of describing the blocks. This makes it easy for other people to understand and use the code.

Example:

# This code captures the 52-week high effect in stocks # Code developed by Milind Paradkar

2) Load Packages – After describing your code in the first line, use the library function to list and load all the relevant packages needed to execute your code.

Example:

library(quantmod);  library(zoo); library(xts); library(PerformanceAnalytics); library(timeSeries); library(lubridate);

3) Use Updated Packages – While writing your code ensure that you are using the latest updated R packages. To check the version of any R package you can use the packageVersion function.

Example:

packageVersion("TTR") [1] ‘0.23.1’

4) Organize all source files in the same directory – Store all the necessary files that will be used/sourced in your code in the same directory. You can use the respective relative path to access them.

Example:

# Reading file using relative path df = read.csv(file = "NIFTY.csv", header = TRUE) # Reading file using full path df =  read.csv(file = "C:/Users/Documents/NIFTY.csv", header = TRUE)

5) Use a consistent style for data structure types – R programming language allows for different data structures like vectors, factors, data frames, matrices, and lists. Use a similar naming for a particular type of data structure. This will make it easy to recognize the similar data structures used in the code and to spot the problems easily.

Example:
You can name all different data frames used in your code by adding .df as the suffix.

aapl.df   = as.data.frame(read.csv(file = "AAPL.csv", header = TRUE)) amzn.df = as.data.frame(read.csv(file = "AMZN.csv", header = TRUE)) csco.df  = as.data.frame(read.csv(file = "CSCO.csv", header = TRUE))

6) Indent your code – Indentation makes your code easier to read, especially, if there are multiple nested statements like For-loop and If statement.

Example:

# Computing the Profit & Loss (PL) and the Equity dt$PL = numeric(nrow(dt)) for (i in 1:nrow(dt)){ if (dt$Signal[i] == 1) {dt$PL[i+1] = dt$Close[i+1] - dt$Close[i]} if (dt$Signal[i] == -1){dt$PL[i+1] = dt$Close[i] - dt$Close[i+1]} }

7) Remove temporary objects – For long codes, running in thousands of lines, it is a good practice to remove temporary objects after they have served their purpose in the code. This can ensure that R does not into memory issues.

8) Time the code – You can time your code using the system.time function. You can also use the same function to find out the time taken by different blocks of code. The function returns the amount of time taken in seconds to evaluate the expression or a block of code. Timing codes will help to figure out any bottlenecks and help speed up your codes by making the necessary changes in the script.

To find the time taken for different blocks we wrapped them in curly braces within the call to the system.time function.

The two important metrics returned by the function include:
i) User time – time charged to the CPU(s) for the code
ii) Elapsed time – the amount of time elapsed to execute the code in entirety

 Example:

# Generating random numbers system.time({ mean_1 = rnorm(1e+06, mean = 0, sd = 0.8) }) user    system    elapsed 0.40      0.00       0.45

9) Use vectorization – Vectorization results in faster execution of codes, especially when we are dealing with large data sets. One can use statements like the ifelse statement or the with function for vectorization.

Example:
Consider the NIFTY 1-year price series. Let us find the gap opening for each day using both the methods (using for-loop and with function) and time them using the system.time function. Note the time taken to execute the for-loop versus the time to execute the with function in combination with the lagpad function.

library(quantmod) # Using FOR Loop system.time({ df = read.csv("NIFTY.csv") df = df[,c(1,3:6)] df$GapOpen = double(nrow(df)) for ( i in 2:nrow(df)) { df$GapOpen[i] = round(Delt(df$CLOSE[i-1],df$OPEN[i])*100,2) } print(head(df)) })

# Using with function + lagpad, instead of FOR Loop system.time({ df = read.csv("NIFTY.csv") df = dt[,c(1,3:6)] lagpad = function(x, k) { c(rep(NA, k), x)[1 : length(x)] } df$PrevClose = lagpad(df$CLOSE, 1) df$GapOpen_ = with(df, round(Delt(df$PrevClose,df$OPEN)*100,2)) print(head(df)) })

10) Folding codes – Folding codes is a way wherein the R programmer can fold a code of line or code sections. This allows hiding blocks of code whenever required, and makes it easier to navigate through lengthy codes. Code folding can be done in two ways:
i) Automatic folding of codes
ii) User-defined folding of codes

Automatic folding of codes: RStudio automatically provides the flexibility to fold the codes. When a coder writes a function or conditional blocks, RStudio automatically creates foldable codes.

User-defined folding of codes: 
One can also fold a random group of codes by using Edit -> Folding -> Collapse or by simply selecting the group of codes and pressing Alt+L key.

User-defined folding can also be done via Code Sections:
To insert a new code section you can use the Code -> Insert Section command. Alternatively, any comment line which includes at least four trailing dashes (-), equal signs (=) or pound signs (#) automatically creates a code section.

11) Review and test your code rigorously – Once your code is ready, ensure that you test it code rigorously on different input parameters. Ensure that the logic used in statements like for-loop, if statement, ifelse statement is correct. It is a nice idea to get your code reviewed by your colleague to ensure that the work is of high quality.

12) Don’t save your workspace When you want to exit R it checks if you want to save your workspace. It is advisable to not save the workspace and start in a clean workspace for your next R session. Objects from the previous R sessions can lead to errors which can be hard to debug.

These were some of the best practices of writing in R that one can follow to make your codes easy to read, debug and to ensure consistency.

 Next Step

 If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

The post R Best Practices: R you writing the R way! appeared first on .

To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fuzzy string Matching using fuzzywuzzyR and the reticulate package in R

Thu, 04/13/2017 - 02:00

(This article was first published on mlampros, and kindly contributed to R-bloggers)

I recently released an (other one) R package on CRAN – fuzzywuzzyR – which ports the fuzzywuzzy python library in R. “fuzzywuzzy does fuzzy string matching by using the Levenshtein Distance to calculate the differences between sequences (of character strings).”

There is no big news here as in R already exist similar packages such as the stringdist package. Why then creating the package? Well, I intend to participate in a recently launched kaggle competition and one popular method to build features (predictors) is fuzzy string matching as explained in this blog post. My (second) aim was to use the (newly released from Rstudio) reticulate package, which “provides an R interface to Python modules, classes, and functions” and makes the process of porting python code in R not cumbersome.

First, I’ll explain the functionality of the fuzzywuzzyR package and then I’ll give some examples on how to take advantage of the reticulate package in R.

fuzzywuzzyR

The fuzzywuzzyR package includes R6-classes / functions for string matching,

classes

FuzzExtract FuzzMatcher FuzzUtils SequenceMatcher Extract() Partial_token_set_ratio() Full_process() ratio() ExtractBests() Partial_token_sort_ratio() Make_type_consistent() quick_ratio() ExtractWithoutOrder() Ratio() Asciidammit() real_quick_ratio() ExtractOne() QRATIO() Asciionly() get_matching_blocks()   WRATIO() Validate_string() get_opcodes()   UWRATIO()       UQRATIO()       Token_sort_ratio()       Partial_ratio()       Token_set_ratio()    

functions

GetCloseMatches()

The following code chunks / examples are part of the package documentation and give an idea on what can be done with the fuzzywuzzyR package,

FuzzExtract

Each one of the methods in the FuzzExtract class takes a character string and a character string sequence as input ( except for the Dedupe method which takes a string sequence only ) and given a processor and a scorer it returns one or more string match(es) and the corresponding score ( in the range 0 – 100 ). Information about the additional parameters (limit, score_cutoff and threshold) can be found in the package documentation,

library(fuzzywuzzyR) word = "new york jets" choices = c("Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys") #------------ # processor : #------------ init_proc = FuzzUtils$new() # initialization of FuzzUtils class to choose a processor PROC = init_proc$Full_process # processor-method PROC1 = tolower # base R function ( as an example for a processor ) #--------- # scorer : #--------- init_scor = FuzzMatcher$new() # initialization of the scorer class SCOR = init_scor$WRATIO # choosen scorer function init <- FuzzExtract$new() # Initialization of the FuzzExtract class init$Extract(string = word, sequence_strings = choices, processor = PROC, scorer = SCOR) # example output [[1]] [[1]][[1]] [1] "New York Jets" [[1]][[2]] [1] 100 [[2]] [[2]][[1]] [1] "New York Giants" [[2]][[2]] [1] 79 [[3]] [[3]][[1]] [1] "Atlanta Falcons" [[3]][[2]] [1] 29 [[4]] [[4]][[1]] [1] "Dallas Cowboys" [[4]][[2]] [1] 22 # extracts best matches (limited to 2 matches) init$ExtractBests(string = word, sequence_strings = choices, processor = PROC1, scorer = SCOR, score_cutoff = 0L, limit = 2L) [[1]] [[1]][[1]] [1] "New York Jets" [[1]][[2]] [1] 100 [[2]] [[2]][[1]] [1] "New York Giants" [[2]][[2]] [1] 79 # extracts matches without keeping the output order init$ExtractWithoutOrder(string = word, sequence_strings = choices, processor = PROC, scorer = SCOR, score_cutoff = 0L) [[1]] [[1]][[1]] [1] "Atlanta Falcons" [[1]][[2]] [1] 29 [[2]] [[2]][[1]] [1] "New York Jets" [[2]][[2]] [1] 100 [[3]] [[3]][[1]] [1] "New York Giants" [[3]][[2]] [1] 79 [[4]] [[4]][[1]] [1] "Dallas Cowboys" [[4]][[2]] [1] 22 # extracts first result init$ExtractOne(string = word, sequence_strings = choices, processor = PROC, scorer = SCOR, score_cutoff = 0L) [[1]] [1] "New York Jets" [[2]] [1] 100

The dedupe method removes duplicates from a sequence of character strings using fuzzy string matching,

duplicat = c('Frodo Baggins', 'Tom Sawyer', 'Bilbo Baggin', 'Samuel L. Jackson', 'F. Baggins', 'Frody Baggins', 'Bilbo Baggins') init$Dedupe(contains_dupes = duplicat, threshold = 70L, scorer = SCOR) [1] "Frodo Baggins" "Samuel L. Jackson" "Bilbo Baggins" "Tom Sawyer"

FuzzMatcher

Each one of the methods in the FuzzMatcher class takes two character strings (string1, string2) as input and returns a score ( in range 0 to 100 ). Information about the additional parameters (force_ascii, full_process and threshold) can be found in the package documentation,

s1 = "Atlanta Falcons" s2 = "New York Jets" init = FuzzMatcher$new() initialization of FuzzMatcher class init$Partial_token_set_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE) # example output [1] 31 init$Partial_token_sort_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE) [1] 31 init$Ratio(string1 = s1, string2 = s2) [1] 21 init$QRATIO(string1 = s1, string2 = s2, force_ascii = TRUE) [1] 29 init$WRATIO(string1 = s1, string2 = s2, force_ascii = TRUE) [1] 29 init$UWRATIO(string1 = s1, string2 = s2) [1] 29 init$UQRATIO(string1 = s1, string2 = s2) [1] 29 init$Token_sort_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE) [1] 29 init$Partial_ratio(string1 = s1, string2 = s2) [1] 23 init$Token_set_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE) [1] 29

FuzzUtils

The FuzzUtils class includes a number of utility methods, from which the Full_process method is from greater importance as besides its main functionality it can also be used as a secondary function in some of the other fuzzy matching classes,

s1 = 'Frodo Baggins' init = FuzzUtils$new() init$Full_process(string = s1, force_ascii = TRUE) # example output [1] "frodo baggins"

GetCloseMatches

The GetCloseMatches method returns a list of the best “good enough” matches. The parameter string is a sequence for which close matches are desired (typically a character string), and sequence_strings is a list of sequences against which to match the parameter string (typically a list of strings).

vec = c('Frodo Baggins', 'Tom Sawyer', 'Bilbo Baggin') str1 = 'Fra Bagg' GetCloseMatches(string = str1, sequence_strings = vec, n = 2L, cutoff = 0.6) [1] "Frodo Baggins"

SequenceMatcher

The SequenceMatcher class is based on difflib which comes by default installed with python and includes the following fuzzy string matching methods,

s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair.' s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair.' init = SequenceMatcher$new(string1 = s1, string2 = s2) init$ratio() [1] 0.9127517 init$quick_ratio() [1] 0.9127517 init$real_quick_ratio() [1] 0.966443

The get_matching_blocks and get_opcodes return triples and 5-tuples describing matching subsequences. More information can be found in the Python’s difflib module and in the fuzzywuzzyR package documentation.

A last think to note here is that the mentioned fuzzy string matching classes can be parallelized using the base R parallel package. For instance, the following MCLAPPLY_RATIOS function can take two vectors of character strings (QUERY1, QUERY2) and return the scores for each method of the FuzzMatcher class,

MCLAPPLY_RATIOS = function(QUERY1, QUERY2, class_fuzz = 'FuzzMatcher', method_fuzz = 'QRATIO', threads = 1, ...) { init <- eval(parse(text = paste0(class_fuzz, '$new()'))) METHOD = paste0('init$', method_fuzz) if (threads == 1) { res_qrat = lapply(1:length(QUERY1), function(x) do.call(eval(parse(text = METHOD)), list(QUERY1[[x]], QUERY2[[x]], ...)))} else { res_qrat = parallel::mclapply(1:length(QUERY1), function(x) do.call(eval(parse(text = METHOD)), list(QUERY1[[x]], QUERY2[[x]], ...)), mc.cores = threads) } return(res_qrat) }

query1 = c('word1', 'word2', 'word3') query2 = c('similarword1', 'similar_word2', 'similarwor') quer_res = MCLAPPLY_RATIOS(query1, query2, class_fuzz = 'FuzzMatcher', method_fuzz = 'QRATIO', threads = 1) unlist(quer_res) # example output [1] 59 56 40 reticulate package

My personal opinion is that the newly released reticulate package is good news (for all R-users with minimal knowledge of python) and bad news (for package maintainers whose packages do not cover the full spectrum of a subject in comparison to an existing python library) at the same time. I’ll explain this in the following two examples.

As an R user I’d always like to have a truncated svd function similar to the one of the sklearn python library. So, now in R using the reticulate package and the mnist data set one can do,

reticulate::py_module_available('sklearn') # check that 'sklearn' is available in your OS [1] TRUE dim(mnist) # after downloading and opening the data from the previous link 70000 785 mnist = as.matrix(mnist) # convert to matrix trunc_SVD = reticulate::import('sklearn.decomposition') res_svd = trunc_SVD$TruncatedSVD(n_components = 100L, n_iter = 5L, random_state = 1L) res_svd$fit(mnist) # TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5, # random_state=1, tol=0.0) out_svd = res_svd$transform(mnist) str(out_svd) # num [1:70000, 1:100] 1752 1908 2289 2237 2236 ... class(out_svd) # [1] "matrix"

to receive the desired output ( a matrix with 70000 rows and 100 columns (components) ).

As a package maintainer, I do receive from time to time e-mails from users of my packages. In one of them a user asked me if the hog function of the OpenImageR package is capable of plotting the hog features. Actually not, but now an R-user can, for instance, use the scikit-image python library to plot the hog-features using the following code chunk,

reticulate::py_module_available("skimage") # check that 'sklearn' is available in your OS # [1] TRUE feat <- reticulate::import("skimage.feature") # import module data_sk <- reticulate::import("skimage.data") # import data color <- reticulate::import("skimage.color") # import module to plot tmp_im = data_sk$astronaut() # import specific image data ('astronaut') dim(tmp_im) # [1] 512 512 3 image = color$rgb2gray(tmp_im) # convert to gray dim(image) # [1] 512 512 res = feat$hog(image, orientations = 8L, pixels_per_cell = c(16L, 16L), cells_per_block = c(1L, 1L), visualise=T) str(res) # List of 2 # $ : num [1:8192(1d)] 1.34e-04 1.53e-04 6.68e-05 9.19e-05 7.93e-05 ... # $ : num [1:512, 1:512] 0 0 0 0 0 0 0 0 0 0 ... OpenImageR::imageShow(res[[2]]) # using the OpenImageR to plot the data

As a final word, I think that the reticulate package, although not that popular yet, it will make a difference in the R-community.

The README.md file of the fuzzywuzzyR package includes the SystemRequirements and detailed installation instructions for each OS.

An updated version of the fuzzywuzzyR package can be found in my Github repository and to report bugs/issues please use the following link, https://github.com/mlampros/fuzzywuzzyR/issues.

To leave a comment for the author, please follow the link and comment on their blog: mlampros. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Keeping Up with Your Data Science Options

Thu, 04/13/2017 - 00:19

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

The field of data science is changing so rapidly that it’s quite hard to keep up with it all. When I first started tracking The Popularity of Data Science Software in 2010, I followed only ten packages, all of them classic statistics software. The term data science hadn’t caught on yet, data mining was still a new thing. One of my recent blog posts covered 53 packages, and choosing them from a list of around 100 was a tough decision!

To keep up with the rapidly changing field, you can read the information on a package’s web site, see what people are saying on blog aggregators such as R-Bloggers.com or StatsBlogs.com, and if it sounds good, download a copy and try it out. What’s much harder to do is figure out how they all relate to one another. A helpful source of information on that front is the book Disruptive Analtyics, by Thomas Dinsmore.

I was lucky enough to be the technical reviewer for the book, during which time I ended up reading it twice. I still refer to it regularly as it covers quite a lot of material. In a mere 262 pages, Dinsmore manages to describe each of the following packages, how they relate to one another, and how they fit into the big picture of data science:

  • Alluxio
  • Alpine Data
  • Alteryx
  • APAMA
  • Apex
  • Arrow
  • Caffe
  • Cloudera
  • Deeplearning4J
  • Drill
  • Flink
  • Giraph
  • Hadoop
  • HAWQ
  • Hive
  • IBM SPSS Modeler
  • Ignite
  • Impala
  • Kafka
  • KNIME Analytics Platform
  • Kylin
  • MADLib
  • Mahout
  • MapR
  • Microsoft R Aerver
  • Phoenix
  • Pig
  • Python
  • R
  • RapidMiner
  • Samza
  • SAS
  • SINGA
  • Skytree Server
  • Spark
  • Storm
  • Tajo
  • Tensorflow
  • Tez
  • Theano
  • Trafodion

As you can tell from the title, a major theme of the book is how open source software is disrupting the data science marketplace. Dinsmore’s blog, ML/DL: Machine Learning, Deep Learning, extends the book’s coverage as data science software changes from week to week.

I highly recommend both the book and the blog. Have fun keeping up with the field!

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Amp: a major on-line Microsoft event, April 19

Wed, 04/12/2017 - 23:55

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

This coming Wednesday, April 19 at 8AM Pacific Time (click for your local time), Microsoft will be hosting a major on-line event of interest to anyone working with big data, analytics, and artificial intelligence: Microsoft Data Amp.

During Data Amp, Executive Vice President Scott Guthrie and Corporate Vice President Joseph Sirosh will share how Microsoft’s latest innovations put data, analytics and artificial intelligence at the heart of business transformation. The event will include exciting announcements that will help you derive even more value from the cloud, enable transformative application development, and ensure you can capitalize on intelligence from any data, any size, anywhere, across Linux and other open source technologies.

When the event begins you can watch the livestream at the link below. You may also like to sign up before the event for early access to sessions on SQL Server vNext, Azure Data Services and Cortana Intelligence Suite.  

Microsoft: Data Amp — where data gets to work

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

CAS RPM: Installing & running Rattle in RStudio

Wed, 04/12/2017 - 22:10

(This article was first published on triKnowBits, and kindly contributed to R-bloggers)

At the Ratemaking and Product Management (RPM) seminar of the Casualty Actuarial Society (CAS) in San Diego last month, Linda Brobeck, Peggy Brinkmann, and yours truly gave a concurrent session on a machine learning technique called decision tree analysis. See the DSPA-2 session — and other sessions — at this link. Technical issues precluded showing the video of  installing and running Rattle within RStudio. Thus this post.

As Peggy notes, installing and running Rattle takes only three “commands” in R:   install.packages(“rattle”, dependencies=c(‘Depends’, “Suggests”))   library(rattle)

  rattle()

Although not necessary, the “Suggests” option above can avoid Rattle’s annoying requests for more packages as you click through its GUI; the downside is the time to download and install about 900 packages. Below (“Read more >>”) is a link to a video of me running through the three lines above:

download/install Rattle

For those whose IT departments restrict installing R and RStudio on company equipment, this presentation used the software on a USB drive. The first minute or so of the video shows me setting up the shortcut on the USB drive so RStudio runs without a bunch of confusing questions, especially for a newbie. If you have R and RStudio installed on your computer already, you can fast forward through the first minute.  Peggy put her three lines of code into a script called ‘Rattle.R’. That’s the file I double-click around the 1:10 mark. You can see the three lines of the script in the upper left pane of RStudio. I run the first line starting at 8:17 am, which finishes at 9:32 am (the 1:42 mark in the video). Plenty of time to read the newspaper. Then I run the second line which, despite the “Depends” and “Suggests” settings above, has to download and install yet another package, GTK+. (There are reasons why that’s necessary, but not important right now.) That takes just a few minutes. Then I run the third line, which fires up the Rattle user interface (GUI). It’s a little hard to see, but the Rattle GUI shows up as another process running in the taskbar. At this point you can start using rattle/rpart per Peggy’s presentation. The links to our three presentations are reproduced here for convenience:    Linda Brobeck    Peggy Brinkmann    Dan Murphy

Contact me if questions.

To leave a comment for the author, please follow the link and comment on their blog: triKnowBits. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New R course: Beginning Bayes in R

Wed, 04/12/2017 - 19:50

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Hello, R users! Today we’re launching Beginning Bayes in R by Jim Albert.

There are two schools of thought in the world of statistics, the frequentist perspective and the Bayesian perspective. At the core of the Bayesian perspective is the idea of representing your beliefs about something using the language of probability, collecting some data, then updating your beliefs based on the evidence contained in the data. This provides a convenient way of implementing the scientific method for learning about the world we live in. Bayesian statistics is increasingly popular due to recent improvements in computation, the ability to fit a wide range of models and to produce intuitive interpretations of the results.

Start course for free

Beginning Bayes in R features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master bayesian statistics in R!

What you’ll learn

Chapter 1 introduces the idea of discrete probability models and Bayesian learning. You’ll express your opinion about plausible models by defining a prior probability distribution, you’ll observe new information, and then, you’ll update your opinion about the models by applying Bayes’ theorem. Start the first chapter for free.

Chapter 2 describes learning about a population proportion using discrete and continuous models. You’ll use a beta curve to represent prior opinion about the proportion, take a sample and observe the number of successes and failures, and construct a beta posterior curve that combines both the information in the prior and in the sample. You’ll then use the beta posterior curve to draw inferences about the population proportion.

Chapter 3 introduces Bayesian learning about a population mean. You’ll sample from a normal population with an unknown mean and a known standard deviation, construct a normal prior to reflect your opinion about the location of the mean before sampling and see that the posterior distribution also has a normal form with updated values of the mean and standard deviation. You’ll also get more practice drawing inferences from the posterior distribution, only this time, about a population mean.

Finally, suppose you’re interested in comparing proportions from two populations. You take a random sample from each population and you want to learn about the difference in proportions. Chapter 4 will illustrate the use of discrete and continuous priors to do this kind of inference. You’ll use a Bayesian regression approach to learn about a mean or the difference in means when the sampling standard deviation is unknown. Enjoy!

Start course for free

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Predicting tides in R

Wed, 04/12/2017 - 18:11

(This article was first published on R – R is my friend, and kindly contributed to R-bloggers)

Skip to TL/DR….

Water movement in estuaries is affected by many processes acting across space and time. Tidal exchange with the ocean is an important hydrodynamic process that can define several characteristics of an estuary. Physical flushing rates and water circulation are often controlled by tidal advection, whereas chemical and biological components are affected by the flux of dissolved or particulate components with changes in the tide. As such, describing patterns of tidal variation is a common objective of coastal researchers and environmental managers.

Tidal predictions are nothing new. A clever analog approach has been around since the late 1800s. The tide-predicting machine represents the tide as the summation of waves with different periods and amplitudes. Think of a continuous line plot where the repeating pattern is linked to a rotating circle, Representing the line in two-dimensions from the rotating circle creates a sine wave with the amplitude equal to the radius of the circle. A more complex plot can be created by adding the output of two or more rotating disks, where each disk varies in radius and rate of rotation. The tide-predicting machine is nothing more than a set of rotating disks linked to a single graph as the sum of the rotations from all disks. Here’s a fantastic digital representation of the tide-predicting machine:

Tides are caused primarily by the gravitational pull of the sun and moon on the earth’s surface. The elliptical orbits of both the moon around the earth and the earth around the sun produce periodic but unequal forces that influence water movement. These forces combined with local surface topography and large-scale circulation patterns from uneven heating of the earth’s surface lead to the variation of tidal patterns across the globe. Although complex, these periodic patterns can be characterized as the summation of sine waves, where one wave represents the effect of a single physical process (e.q., diurnal pull of the moon). Describing these forces was the objecive of the earlier tide-predicting machines. Fortunately for us, modern software (i.e., R) provides us with a simpler and less expensive approach based on harmonic regression.

Approach

We’ll create and sum our own sine waves to demonstrate complexity from addition. All sine waves follow the general form y as a function of x:

where the amplitude of the wave is beta and the frequency (or 1 / period) is f. The parameters alpha and Phi represent scalar shifts in the curve up/down and left/right, respectively. We can easily create a function in R to simulate sine waves with different characteristics. This function takes the parameters from the above equation as arguments and returns a sine wave (y) equal in length to the input time series (x). The alpha and beta are interpreted as units of wave height (e.g., meters) and f and Phi are in hours.

# function for creating sine wave waves <- function(time_in, alpha = 0, beta = 1, freq = 24, phi = 0){ # timestep per hour time_step <- 60 / unique(diff(time_in)) # set phi as difference in hours from start of time_in phi <- min(time_in) + phi * 3600 phi<- as.numeric(difftime(phi, min(time_in))) phi <- phi / time_step # get input values to cos func in_vals <- seq(0, length(time_in), length = length(time_in)) in_vals <- in_vals / time_step in_vals <- 2 * pi * in_vals * 1 / freq # wave y <- alpha + beta * sin(in_vals + phi) return(y) }

The default arguments will return a sine wave with an amplitude of one meter and frequency of one wave per 24 hours. Two additional time series are created that vary these two parameters.

# input time series for two weeks, 15 minute time step x <- as.POSIXct(c('2017-04-01', '2017-04-15')) x <- seq(x[1], x[2], by = 60 * 15) # get three sine waves # a: default # b: amplitude 0.5, 48 hour period # c: amplitude 2, 12 hour period a <- waves(x) b <- waves(x, beta = 0.5, f = 48) c <- waves(x, beta = 2, f = 12)

We can combine all three waves in the same data object, take the summation, and plot to see how it looks.

# for data munging and plotting library(tidyverse) # get sum of all y values, combine to single object yall <- rowSums(cbind(a, b, c)) dat <- data.frame(x, a, b, c, yall) %>% gather('var', 'val', -x) # plot ggplot(dat, aes(x = x, y = val)) + geom_line() + facet_wrap(~var, ncol = 1) + theme_bw()

The important piece of information we get from the plot is that adding simple sine waves can create complex patterns. As a general rule, about 83% of the variation in tides is created by seven different harmonic components that, when combined, lead to the complex patterns we observe from monitoring data. These components are described as being of lunar or solar original and relative periods occuring either once or twice daily. For example, the so-called ‘M2’ component is typically the dominant tidal wave caused by the moon, twice daily. The periods of tidal components are constant across locations but the relative strength (amplitudes) vary considerably.

The oce package in R has a nifty function for predicting up to 69 different tidal constituents. You’ll typically only care about the main components above but it’s useful to appreciate the variety of components included in a tidal signal. We’ll apply the tidem function from oce to predict the tidal components on a subset of SWMP data. A two-week period from the Apalachicola Bay Dry Bar station is used.

library(SWMPr) library(oce) # clean, one hour time step, subset, fill gaps dat <- qaqc(apadbwq) %>% setstep(timestep = 60) %>% subset(subset = c('2013-01-01 0:0', '2013-12-31 0:0'), select = 'depth') %>% na.approx(maxgap = 1e6)

The tidem function from oce requires a ‘sealevel’ object as input. Plotting the sealevel object using the plot method from oce shows three panels; the first is the complete time series, second is the first month in the record, and third is a spectral decomposition of the tidal components as cycles per hour (cph, or period).

datsl <- as.sealevel(elevation = dat$depth, time = dat$datetimestamp) plot(datsl)

We can create a model to estimate the components from the table above using tidem. Here, we estimate each component separately to extract predictions for each, which we then sum to estimate the complete time series.

# tidal components to estimate constituents <- c('M2', 'S2', 'N2', 'K2', 'K1', 'O1', 'P1') # loop through tidal components, predict each with tidem preds <- sapply(constituents, function(x){ mod <- tidem(t = datsl, constituent = x) pred <- predict(mod) pred - mean(pred) }) # combine prediction, sum, add time data predall <- rowSums(preds) + mean(datsl[['elevation']]) preds <- data.frame(time = datsl[['time']], preds, Estimated = predall) head(preds) ## time M2 S2 N2 K2 ## 1 2013-01-01 00:00:00 -0.111578526 -0.020833606 0.000215982 -0.0048417234 ## 2 2013-01-01 01:00:00 -0.118544835 -0.008940681 0.006428260 -0.0093752262 ## 3 2013-01-01 02:00:00 -0.095806627 0.005348532 0.011088593 -0.0113830570 ## 4 2013-01-01 03:00:00 -0.049059634 0.018205248 0.013072149 -0.0103243372 ## 5 2013-01-01 04:00:00 0.009986414 0.026184523 0.011900172 -0.0064842694 ## 6 2013-01-01 05:00:00 0.066540974 0.027148314 0.007855534 -0.0008973087 ## K1 O1 P1 Estimated ## 1 0.0911501572 0.01312209 0.0381700294 1.463683 ## 2 0.0646689921 0.03909021 0.0340807303 1.465686 ## 3 0.0337560517 0.06274939 0.0276811946 1.491713 ## 4 0.0005294868 0.08270543 0.0194051690 1.532812 ## 5 -0.0327340223 0.09778235 0.0098135843 1.574727 ## 6 -0.0637552642 0.10709170 -0.0004434629 1.601819

Plotting two weeks from the estimated data shows the results. Note the variation in amplitude between the components. The M2 , K1, and O1 components are the largest at this location. Also note the clear spring/neap variation in range every two weeks for the combined time series. This complex fort-nightly variation is caused simply by adding the separate sine waves.

# prep for plot toplo <- preds %>% gather('component', 'estimate', -time) %>% mutate(component = factor(component, level = c('Estimated', constituents))) # plot two weeks ggplot(toplo, aes(x = time, y = estimate, group = component)) + geom_line() + scale_x_datetime(limits = as.POSIXct(c('2013-07-01', '2013-07-31'))) + facet_wrap(~component, ncol = 1, scales = 'free_y') + theme_bw()

All tidal components can of course be estimated together. By default, the tidem function estimates all 69 tidal components. Looking at our components of interest shows the same estimated amplitudes in the plot above.

# estimate all components together mod <- tidem(t = datsl) # get components of interst amps <- data.frame(mod@data[c('name', 'amplitude')]) %>% filter(name %in% constituents) %>% arrange(amplitude) amps ## name amplitude ## 1 K2 0.01091190 ## 2 N2 0.01342395 ## 3 S2 0.02904518 ## 4 P1 0.04100388 ## 5 O1 0.11142455 ## 6 M2 0.12005114 ## 7 K1 0.12865764

And of course comparing the model predictions with the observed data is always a good idea.

# add predictions to observed data dat$Estimated <- predict(mod) # plot one month ggplot(dat, aes(x = datetimestamp, y = depth)) + geom_point() + geom_line(aes(y = Estimated), colour = 'blue') + scale_x_datetime(limits = as.POSIXct(c('2013-07-01', '2013-07-31'))) + scale_y_continuous(limits = c(0.9, 2)) + theme_bw()

The fit is not perfect but this could be from several reasons, none of which are directly related to the method – instrument drift, fouling, water movement from non-tidal sources, etc. The real value of the model is we can use it to fill missing observations in tidal time series or to predict future observations. We also get reasonable estimates of the main tidal components, i.e., which physical forces are really driving the tide and how large are the contributions. For example, our data from Apalachicola Bay showed that the tide is driven primarily by the M2, K2, and O1 components, where each had relative amplitudes of about 0.1 meter. This is consistent with general patterns of micro-tidal systems in the Gulf of Mexico. Comparing tidal components in other geographic locations would produce very differents results, both in the estimated amplitudes and the dominant components.

TL/DR

Here’s how to estimate the tide from an observed time series. The data are from SWMPr and the tidem model is from oce.

library(SWMPr) library(oce) # clean input data, one hour time step, subset, fill gaps dat <- qaqc(apadbwq) %>% setstep(timestep = 60) %>% subset(., subset = c('2013-01-01 0:0', '2013-12-31 0:0'), select = 'depth') %>% na.approx(maxgap = 1e6) # get model datsl <- as.sealevel(elevation = dat$depth, time = dat$datetimestamp) mod <- tidem(t = datsl) # add predictions to observed data dat$Estimated <- predict(mod) # plot ggplot(dat, aes(x = datetimestamp, y = Estimated)) + geom_line() + theme_bw()

 

To leave a comment for the author, please follow the link and comment on their blog: R – R is my friend. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Structures Exercises (Part 2)

Wed, 04/12/2017 - 17:50

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the Exercises we will cover Hashes, Factors and Zoo in R

Answers to the exercises are available here.

Learn more about Data Structures in the online course
R Data Pre-Processing & Data Management – Shape your Data!
. In this course you will learn how to:

  • Convert your data in R
  • Use SQL query embedded in R
  • Work with different data management packages like dplyr
  • And much more

Exercise 1

Create a Hash based Key Value Pairs with three employees, with Employees as the key and the name as the values.

Exercise 2

Create Hash of an Employee with the Name, Age, Address & SSN as the key and the details of the employee as the values

Exercise 3

Create Hash and use names function to give names to the key.

Exercise 4

Create a factor based on marks of students and print the summary of the same.

Exercise 5

Create a factor based on heights and print the levels and summary for the heights factor

Exercise 6

Create a factors of 10 random values between 1 and 100, with on repeating values and also check the variable is a factor

Exercise 7

Create a factor based on Age of students and plot a graph based on the age.

Exercise 8

Create a random factor based on 0 & 1 and give assign false for 0 and true for 1 and check the variable by print the type of the variable.

Exercise 9

Create a random factor based on 0 & 1 and give assign false for 0 and true for 1 and assign new 11 the value as False.

Exercise 10

Create a data Frame of dates and return of company and use zoo
to convert the data.

Related exercise sets:
  1. Data frame exercises Vol. 2
  2. Descriptive Analytics-Part 1: Data Formatting Exercises
  3. Data Science for Doctors – Part 1 : Data Display
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Periodic Table of Data Science

Wed, 04/12/2017 - 17:30

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

This periodic table can serve as a guide to navigate the key players in the data science space. The resources in the table were chosen by looking at surveys taken from data science users, such as the 2016 Data Science Salary Survey by O’Reilly, the 2017 Magic Quadrant for Data Science Platforms by Gartner, and the KD Nuggets 2016 Software Poll results, among other sources. The categories in the table are not all mutually exclusive.

Check out the full periodic table of data science below:

Navigating The Periodic Table of Data Science

You’ll see that on the table’s left-hand section lists companies that have to do with education: here, you’ll find courses, boot camps and conferences. On the right-hand side, on the other hand, you’ll find resources that will keep you up to date with the latest news, hottest blogs and relevant material in the data science community. In the middle, you’ll find tools that you can use to get started with data science: you’ll find programming languages, projects and challenges, data visualization tools, etc. 

The table puts the data science resources, tools and companies in the following 13 categories: 

Courses: for those who are looking to learn data science, there are a bunch of sites (companies) out there that offer data science courses. You’ll find various options here that will probably suit your learning style: DataCamp for learning by doing, MOOCs by Coursera and Edx, and much more!

Boot camps: this section includes resources for those who are looking for more mentored options to learn data science. You’ll see that boot camps like The Data Incubator or Galvanize have been included. 

Conferences: learning is not an activity that you do when you go on courses or boot camps. Conferences are something that learners often forget, but they also contribute to learning data science: it’s important that you attend them as a data science aspirant, as you’ll get in touch with the latest advancements and the best industry experts. Some of the ones that are listed in the table are UseR!, Tableau Conference and PyData.

Data: practice makes perfect, and this is also the case for data science. You’ll need to look and find data sets in order to start practicing what you learned in the courses on real-life data or to make your data science portfolio. Data is the basic building block of data science and finding that data can be probably one of the hardest things. Some of the options that you could consider when you’re looking for cool data sets are data.world, Quandl and Statista.

Projects & Challenges, Competitions: after practicing, you might also consider taking on bigger projects: data science portfolios, competitions, challenges, …. You’ll find all of these in this category of the Periodic Table of Data Science! One of the most popular options is probably Kaggle, but also DrivenData or DataKind are worth checking out!

Programming Languages & Distributions:  data scientists generally use not only one, but many programming languages; Some programming languages like Python have recently gained a lot of traction in the community and also Python distributions, like Anaconda, seem to find their way to data science aspirants. 

Search & Data Management: this enormous category contains all tools that you can use to search and manage your data in some way. You’ll see, on the one hand, a search library like Lucene, but also a relational database management system like Oracle

Machine Learning & Stats: this category not only offers you libraries to get started with machine learning and stats with programming languages such as Python, but also entire platforms, such as Alteryx or DataRobot

Data Visualization & Reporting: after you have analyzed and modeled your data, you might be looking to visualize the results and report on what you have been investigating. You can make use of open-source options like Shiny or Matplotlib to do this, or all back on commercial options such as Qlikview or Tableau

Collaboration: collaboration is a trending topic in the data science community. As you grow, you’ll also find the need to work in teams (even if it’s just with one other person!) and in those cases, you’ll want to make use of notebooks like Jupyter. But even as you’re just working on your own, working with an IDE can come in handy if you’re just starting out. In such cases, consider Rodeo or Spyder

Community & Q&A: asking questions and falling back on the community is one of the things that you’ll probably do a lot when you’re learning data science. If you’re ever unsure of where you can find the answer to your data science question, you can be sure to find it in sites such as StackOverflow, Quora, Reddit, etc. 

News, Newsletters & Blogs: you’ll find that the community is evolving and growing rapidly: following the news and the latest trends is a necessity. General newsletters like Data Science Weekly or Data Elixir, or language-specific newsletters like Python Weekly or R Weekly can give you your weekly dose of data science right in your mailbox. But also blogging sites like RBloggers or KD Nuggets are worth following! 

Podcasts: last, but definitely not least, are the podcasts. These are great in many ways, as you’ll get introduced to expert interviews, like in Becoming A Data Scientist or to specific data science topics, like in Data Stories or Talking Machines!

Are you thinking of another resource that should be added to this periodic table?  Leave a comment below and tell us about it!

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dataviz of the week, 12/4/17

Wed, 04/12/2017 - 15:02

(This article was first published on R – Robert Grant's stats blog, and kindly contributed to R-bloggers)

This week, a chart with some Bayesian polemic behind it. Alexander Etz put this on Twitter:

He is working on an R package to provide easy Bayesian adjustments for reporting bias with a method by Guan & Vandekerchhove. Imagine a study reporting three p-values, all just under the threshold of significance, and with small-ish sample sizes. Sound suspicious?

Sounds like someone’s been sniffing around after any pattern they could find. Trouble is, if they don’t tell you about the other results they threw away (reporting bias), you don’t know whether to believe them or not. Or there are a thousand similar studies but this is the (un)lucky one and this author didn’t do anything wrong in their own study (publication bias).

Well, you have to make some assumptions to do the adjustment, but at least being Bayesian, you don’t have to assume one number for the bias, you can have a distribution. Here, the orange distribution is the posterior for the true effect once the bias has been added (in this case, p>0.05 has a 0% chance of getting published, which is not unrealistic in some circles!) This is standard probabilistic stuff but it doesn’t get done  because the programming seems so daunting to a lot of people. The more easy tools – with nice helpful visualisations – the better.

To leave a comment for the author, please follow the link and comment on their blog: R – Robert Grant's stats blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

tidyverse updates

Wed, 04/12/2017 - 14:36

Over the couple of months there have been a bunch of smaller releases to packages in the tidyverse. This includes:

  • forcats 0.2.0, for working with factors.
  • readr 1.1.0, for reading flat-files from disk.
  • stringr 1.2.0, for manipulating strings.
  • tibble 1.3.0, a modern re-imagining of the data frame.

This blog post summarises the most important new features, and points to the full release notes where you can learn more.

(If you’ve never heard of the tidyverse before, it’s an set of packages that are designed to work together to help you do data science. The best place to learn all about it is R for Data Science.)

forcats 0.2.0

forcats has three new functions:

  • as_factor() is a generic version of as.factor(), which creates factors from character vectors ordered by appearance, rather than alphabetically. This ensures means that as_factor(x) will always return the same result, regardless of the current locale.
  • fct_other() makes it easier to convert selected levels to “other”: x <- factor(rep(LETTERS[1:6], times = c(10, 5, 1, 1, 1, 1))) x %>% fct_other(keep = c("A", "B")) %>% fct_count() #> # A tibble: 3 × 2 #> f n #> #> 1 A 10 #> 2 B 5 #> 3 Other 4 x %>% fct_other(drop = c("A", "B")) %>% fct_count() #> # A tibble: 5 × 2 #> f n #> #> 1 C 1 #> 2 D 1 #> 3 E 1 #> 4 F 1 #> 5 Other 15
  • fct_relabel() allows programmatic relabeling of levels: x <- factor(letters[1:3]) x #> [1] a b c #> Levels: a b c x %>% fct_relabel(function(x) paste0("-", x, "-")) #> [1] -a- -b- -c- #> Levels: -a- -b- -c-

See the full list of other changes in the release notes.

stringr 1.2.0

This release includes a change to the API: str_match_all() now returns NA if an optional group doesn’t match (previously it returned “”). This is more consistent with str_match() and other match failures.

x <- c("a=1,b=2", "c=3", "d=") x %>% str_match("(.)=(\\d)?") #> [,1] [,2] [,3] #> [1,] "a=1" "a" "1" #> [2,] "c=3" "c" "3" #> [3,] "d=" "d" NA x %>% str_match_all("(.)=(\\d)?,?") #> [[1]] #> [,1] [,2] [,3] #> [1,] "a=1," "a" "1" #> [2,] "b=2" "b" "2" #> #> [[2]] #> [,1] [,2] [,3] #> [1,] "c=3" "c" "3" #> #> [[3]] #> [,1] [,2] [,3] #> [1,] "d=" "d" NA

There are three new features:

  • In str_replace(), replacement can now be a function. The function is once for each match and its return value will be used as the replacement. redact <- function(x) { str_dup("-", str_length(x)) } x <- c("It cost $500", "We spent $1,200 on stickers") x %>% str_replace_all("\\$[0-9,]+", redact) #> [1] "It cost ----" "We spent ------ on stickers"
  • New str_which() mimics grep(): fruit <- c("apple", "banana", "pear", "pinapple") # Matching positions str_which(fruit, "p") #> [1] 1 3 4 # Matching values str_subset(fruit, "p") #> [1] "apple" "pear" "pinapple"
  • A new vignette (vignette("regular-expressions")) describes the details of the regular expressions supported by stringr. The main vignette (vignette("stringr")) has been updated to give a high-level overview of the package.

See the full list of other changes in the release notes.

readr 1.1.0

readr gains two new features:

  • All write_*() functions now support connections. This means that that you can write directly to compressed formats such as .gz, bz2 or .xz (and readr will automatically do so if you use one of those suffixes). write_csv(iris, "iris.csv.bz2")
  • parse_factor(levels = NULL) and col_factor(levels = NULL) will produce a factor column based on the levels in the data, mimicing factor parsing in base R (with the exception that levels are created in the order seen). iris2 <- read_csv("iris.csv.bz2", col_types = cols( Species = col_factor(levels = NULL) ))

See the full list of other changes in the release notes.

tibble 1.3.0

tibble has one handy new function: deframe() is the opposite of enframe(): it turns a two-column data frame into a named vector.

df <- tibble(x = c("a", "b", "c"), y = 1:3) deframe(df) #> a b c #> 1 2 3

See the full list of other changes in the release notes.

pandas “transform” using the tidyverse

Wed, 04/12/2017 - 04:43

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

Chris Moffit has a nice blog on how to use the transform function in pandas. He provides some (fake) data on sales and asks the question of what fraction of each order is from each SKU.

Being a R nut and a tidyverse fan, I thought to compare and contrast the code for the pandas version with an implementation using the tidyverse.

First the pandas code:

import pandas as pd dat = pd.read_excel('sales_transactions.xlsx') dat['Percent_of_Order'] = dat['ext price']/dat.groupby('order')['ext price'].transform('sum')

A similar implementation using the tidyverse:

library(tidyverse) library(readxl) dat <- read_excel('sales_transactions.xlsx') dat <- dat %&gt;% group_by(order) %&gt;% mutate(Percent_of_Order = `ext price`/sum(`ext price`))

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science from a Strategic Business Perspective

Wed, 04/12/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Last night I spoke at the Melbourne R User Group (MelbuRn) about data science from a strategic business perspective. It was great to see so many people attending.

My presentation outlined the strategy that I developed and am implementing for my employer Coliban Water. This strategy is based on a common-sense approach that leverages our existing strengths. This strategy was also outlined in an article for the Source journal.

Water utilities are, pardon the pun, awash with data. For decades we have been using a rudimentary version of the Internet of Things called SCADA (Supervisory Control and Data Aquisition). This system controls our assets and provides operators and analysts with the needed data. All this data is used to control our systems and stored for future reference.

There is no such thing as ‘dark data’. All data has been used for its designated purpose when it was created. My job at Coliban Water is to create value from this information.

In this presentation, I explained how Coliban Water is creating more value from data by following a systematic strategy,

Data Science Strategy Presentation

The post Data Science from a Strategic Business Perspective appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sow the seeds, know the seeds

Wed, 04/12/2017 - 02:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

When you do simulations, for instance in R, e.g. drawing samples from a distribution, it’s best to set a random seed via the function set.seed in order to have reproducible results. The function has no default value. I think I mostly use set.seed(1). Last week I received an R script from a colleague in which he used a weird number in set.seed (maybe a phone number? or maybe he let his fingers type randomly?), which made me curious about the usual seed values. As in my blog post about initial commit messages I used the Github API via the gh package to get a very rough answer (an answer seedling from the question seed?).

From Github API search endpoint you can get up to 1,000 results corresponding to a query which in the case of set.seed occurrences in R code isn’t the whole picture but hopefully a good sample. I wrote a function to treat the output of a query to the API where I take advantage of the stringr package. I just want the thing inside set.seed() from the text matches returned by the API.

get_seeds_from_matches <- function(item){ url <- item$html_url matches <- item$text_matches matches <- unlist(lapply(matches, "[[", "fragment")) matches <- stringr::str_split(matches, "\\\n", simplify = TRUE) matches <- stringr::str_extract(matches, "set\\.seed\\(.*\\)") matches <- stringr::str_replace(matches, "set\\.seed\\(", "") seeds <- stringr::str_replace(matches, "\\).*", "") seeds <- seeds[!is.na(seeds)] tibble::tibble(seed = seeds, url = rep(url, length(seeds))) }

After that I made the queries themselves, pausing every 30 pages because of the rate limiting, and adding a try around the call in order to stop as soon as I reached the 1,000 results. Not a very elegant solution but I wasn’t in a perfectionnist mood.

Note that the header "Accept" = 'application/vnd.github.v3.text-match+json' is very important, without it you wouldn’t get the text fragments in the results.

library("gh") seeds <- NULL ok <- TRUE page <- 1 while(ok){ matches <- try(gh("/search/code", q = "set.seed&language:r", .token = Sys.getenv("GITHUB_PAT"), .send_headers = c("Accept" = 'application/vnd.github.v3.text-match+json'), page = page), silent = TRUE) ok <- !is(matches, "try-error") if(ok){ seeds <- bind_rows(seeds, bind_rows(lapply(matches$items, get_seeds_from_matches))) } page <- page + 1 # wait 2 minutes every 30 pages if(page %% 30 == 1 & page > 1){ Sys.sleep(120) } } save(seeds, file = "data/2017-04-12-seeds.RData.RData") library("magrittr") load("data/2017-04-12-seeds.RData") head(seeds) %>% knitr::kable() seed url 1 https://github.com/berndbischl/ParamHelpers/blob/9d374430701d94639cc78db84f91a0c595927189/tests/testthat/helper_zzz.R 1 https://github.com/TypeFox/R-Examples/blob/d0917dbaf698cb8bc0789db0c3ab07453016eab9/ParamHelpers/tests/testthat/helper_zzz.R 1 https://github.com/cran/ParamHelpers/blob/92a49db23e69d32c8ae52585303df2875d740706/tests/testthat/helper_zzz.R 4.0 https://github.com/ACP-KR/AsanAdvR/blob/0517e88efce94266997d680e8b5a7c2a97c9277d/R-Object-Oriented-Programming-master/chapter4/chapter_4_ex11.R 4.0 https://github.com/ACP-KR/AsanAdvR/blob/0517e88efce94266997d680e8b5a7c2a97c9277d/R-Object-Oriented-Programming-master/chapter4/chapter_4_ex11.R 4.0 https://github.com/KellyBlack/R-Object-Oriented-Programming/blob/efbb0b81063baa30dd9d56d5d74b3f73b12b4926/chapter4/chapter_4_ex11.R

I got 984 entries, not 1,000 so maybe I lost some seeds in the process or the results weren’t perfect. The reason why I also added the URL of the script to the results was to be able to go and look at the code around surprising seeds.

Let’s have a look at the most frequent seeds in the sample.

table(seeds$seed) %>% broom::tidy() %>% dplyr::arrange(desc(Freq)) %>% head(n = 12) %>% knitr::kable() Var1 Freq seed 312 1 134 123 60 iseed 48 10 47 13121098 28 ss 24 20 21 1234 18 42 18 123456 15 0 14

So the most prevalent seed is a mystery because I’m not motivated enough to go scrape the code to find if the seed gets assigned a value before, like in that tweet I saw today. I was happy that 1 was so popular, maybe it means I belong?

I was surprised by two values. First, 13121098.

dplyr::filter(seeds, seed == "13121098") %>% head(n = 10) %>% knitr::kable() seed url 13121098 https://github.com/DJRumble/Swirl-Course/blob/4e2771141e579904eb6dd32bce51ff6e0d840d44/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R 13121098 https://github.com/swirldev/swirl_courses/blob/b3d432bfdf480c865af1c409ee0ee927c1fdbda0/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R 13121098 https://github.com/1vbutkus/swirl/blob/310874100536e1e7c66861eced9ecb52939a3e0a/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R 13121098 https://github.com/gotitsingh13/swirldev/blob/b7369b974ba76716fbcf6101bcbdc2db2f774d18/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R 13121098 https://github.com/pauloramazza/swirl_courses/blob/4e2771141e579904eb6dd32bce51ff6e0d840d44/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R 13121098 https://github.com/ildesoft/Swirl_Courses/blob/3e7f43cecbeb41e92e4f5972658f9b293e0e4b84/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R 13121098 https://github.com/hrdg/Regression_Models/blob/22f47ecf2ae62f553aa132d3d948cc6b4e1599cc/Residuals_Diagnostics_and_Variation/initLesson.R 13121098 https://github.com/Rizwanabro/Swirl-Course/blob/3e7f43cecbeb41e92e4f5972658f9b293e0e4b84/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R 13121098 https://github.com/mkgiitr/Data-Analytics/blob/1d659db1e9137b1fe595a6ef3356887de431b1be/win-library/3.1/swirl/Courses/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R 13121098 https://github.com/Jutair/R-programming-Coursera/blob/4faeed6ca780ee7f14b224c293cae77293146f37/Swirl/Rsubversion/branches/Writing_swirl_Courses/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R

I went and had a look and it seems most repositories correspond to code learnt in a Coursera course. I have taken a few courses from that specialization and loved it but I don’t remember learning about the special seed, too bad. Well I guess everyone used it to reproduce results but what does this number mean in the first place? Who typed it? A cat walking on the keyboard?

The other number that surprised me was 42 but then I remembered it is the “Answer to the Ultimate Question of Life, the Universe, and Everything” . I’d therefore say that this might be the coolest random seed. Now I can’t tell you whether it produces better results. Maybe it helps when your code actually tries to answer the Ultimate Question of Life, the Universe, and Everything?

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The National Map Base Maps

Wed, 04/12/2017 - 02:00

(This article was first published on The USGS OWI blog , and kindly contributed to R-bloggers)

A number of map services are offered through The National Map (TNM).
There are no use restrictions on these services.
However, map content is limited to the United States and Territories.
This post explains how to integrate TNM services within your own interactive web map using
Leaflet for R.

R packages required for this tutorial include
leaflet,
rgdal, and
dataRetrieval.
Install the required packages from the Comprehensive R Archive Network (CRAN)
using the following commands:

for (pkg in c("leaflet", "rgdal", "dataRetrieval")) { if (!pkg %in% rownames(utils::installed.packages())) utils:install.packages(pkg, repos = "https://cloud.r-project.org/") }

The first step is to create a Leaflet map widget:

map <- leaflet::leaflet()

In Leaflet, a map layer is used to display a specific dataset.
Map layers are organized by group.
Many layers can belong to the same group, but each layer can only belong to zero or one groups.
For this example, each map layer belongs to a discrete group.
Create a vector of unique group names identifying the five layers to be added to the map widget:

grp <- c("USGS Topo", "USGS Imagery Only", "USGS Imagery Topo", "USGS Shaded Relief", "Hydrography")

Specify the line of attribution text to display in the map using the Hypertext Markup Language (HTML) syntax:

att <- paste0("<a href='https://www.usgs.gov/'>", "U.S. Geological Survey</a> | ", "<a href='https://www.usgs.gov/laws/policies_notices.html'>", "Policies</a>")

Leaflet supports base maps using map tiles.
TNM base maps are available as Web Map Service (WMS) tiles.
Add tiled layers (base maps) that describe topographic information in TNM to the map widget:

GetURL <- function(service, host = "basemap.nationalmap.gov") { sprintf("https://%s/arcgis/services/%s/MapServer/WmsServer", host, service) } map <- leaflet::addWMSTiles(map, GetURL("USGSTopo"), group = grp[1], attribution = att, layers = "0") map <- leaflet::addWMSTiles(map, GetURL("USGSImageryOnly"), group = grp[2], attribution = att, layers = "0") map <- leaflet::addWMSTiles(map, GetURL("USGSImageryTopo"), group = grp[3], attribution = att, layers = "0") map <- leaflet::addWMSTiles(map, GetURL("USGSShadedReliefOnly"), group = grp[4], attribution = att, layers = "0")

The content of these layers is described in the
TNM Base Maps document.

An overlay map layer adds information, such as river and lake features, to a base map.
Add the tiled overlay for the National Hydrography Dataset to the map widget:

opt <- leaflet::WMSTileOptions(format = "image/png", transparent = TRUE) map <- leaflet::addWMSTiles(map, GetURL("USGSHydroCached"), group = grp[5], options = opt, layers = "0") map <- leaflet::hideGroup(map, grp[5])

Point locations, that appear on the map as icons, may be added to a base map using a marker overlay.
In this example, site locations are included for selected wells in the
USGS Idaho National Laboratory
water-quality observation network.
Create the marker-overlay dataset using the following commands (requires web access):

site_no <- c("USGS 1" = "432700112470801", "USGS 14" = "432019112563201", "USGS 8" = "433121113115801", "USGS 126A" = "435529112471301", "USGS 29" = "434407112285101", "USGS 52" = "433414112554201", "USGS 84" = "433356112574201", "TRA 4" = "433521112574201") dat <- dataRetrieval::readNWISsite(site_no) sp::coordinates(dat) <- c("dec_long_va", "dec_lat_va") sp::proj4string(dat) <- sp::CRS("+proj=longlat +datum=NAD83") dat <- sp::spTransform(dat, sp::CRS("+init=epsg:4326"))

Popups are small boxes containing text that appear when marker icons are clicked.
Specify the text to display in the popups using the HTML syntax:

num <- dat$site_no # site number nam <- names(site_no)[match(num, site_no)] # local site name url <- sprintf("https://waterdata.usgs.gov/nwis/inventory/?site_no=%s", num) pop <- sprintf("<b>Name:</b> %s<br/><b>Site No:</b> <a href='%s'>%s</a>", nam, url, num)

Add the marker overlay to the map widget:

opt <- leaflet::markerClusterOptions(showCoverageOnHover = FALSE) map <- leaflet::addCircleMarkers(map, radius = 10, weight = 3, popup = pop, clusterOptions = opt, data = dat)

Add a Leaflet control feature that allows users to interactively show and hide base maps:

opt <- leaflet::layersControlOptions(collapsed = FALSE) map <- leaflet::addLayersControl(map, baseGroups = grp[1:4], overlayGroups = grp[5], options = opt)

Print the map widget to display it in your web browser:

print(map)

Some users have reported that base maps do not render correctly in the
RStudio viewer.
Until RStudio can address this issue, the following workaround is provided:

options(viewer = NULL) print(map)

And let’s not forget the R session information.

print(utils::sessionInfo()) ## R version 3.3.3 (2017-03-06) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 14393) ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.10 xml2_1.1.1 knitr_1.15.1 ## [4] magrittr_1.5 hms_0.3 lattice_0.20-35 ## [7] xtable_1.8-2 R6_2.2.0 plyr_1.8.4 ## [10] stringr_1.2.0 httr_1.2.1 dplyr_0.5.0 ## [13] tools_3.3.3 rgdal_1.2-6 grid_3.3.3 ## [16] DBI_0.6-1 htmltools_0.3.5 crosstalk_1.0.0 ## [19] yaml_2.1.14 dataRetrieval_2.6.7 leaflet_1.1.0 ## [22] digest_0.6.12 assertthat_0.1 tibble_1.3.0 ## [25] shiny_1.0.1 reshape2_1.4.2 readr_1.1.0 ## [28] htmlwidgets_0.8 curl_2.4 evaluate_0.10 ## [31] mime_0.5 sp_1.2-4 stringi_1.1.5 ## [34] jsonlite_1.4 lubridate_1.6.0 httpuv_1.3.3

To leave a comment for the author, please follow the link and comment on their blog: The USGS OWI blog . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Prepare real-world data for analysis with the vtreat package

Wed, 04/12/2017 - 00:03

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

As anyone who's tried to analyze real-world data knows, there are any number of problems that may be lurking in the data that can prevent you from being able to fit a useful predictive model:

  • Categorical variables can include infrequently-used levels, which will cause problems if sampling leaves them unrepresented in the training set.
  • Numerical variables can be in wildly different scales, which can cause instability when fitting models.
  • The data set may include several highly-correlated columns, some of which could be pruned from the data without sacrificing predictive power.
  • The data set may include missing values that need to be dealt with before analysis can begin.
  • … and many others

The vtreat package is designed to counter common data problems like these in a statistically sound manner. It's a data frame preprocessor which applies a number of data cleaning processes to the input data before analysis, using techniques such as impact coding and categorical variable encoding (the methods are described in detail in this paper). Further details can be found on the vtreat github page, where authors John Mount and Nina Zumel note:

Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, and so on) or standard statistical methods (regression, generalized regression, generalized additive models) there are common data issues that can cause modeling to fail. vtreat deals with a number of these in a principled and automated fashion.

One final note: the main function in the package, prepare, is a little like model.matrix in that categorical variables are converted into numeric variables using contrast codings. This means that the output is suitable for many machine-learning functions (like xgboost) that don't accept categorical variables.

The vtreat package is available on CRAN now, and you can find a worked example using vtreat in the blog post linked below.

Win-Vector Blog: vtreat: prepare data

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages