Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 29 min 24 sec ago

RStudio Connect 1.4.4.1

Tue, 03/28/2017 - 20:42

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re excited to announce the release of RStudio Connect: version 1.4.4.1. This release includes the ability to manage different versions of your work on RStudio Connect.

Rollback / Roll Forward
The most notable feature of this release is the ability to “rollback” to a previously deployed version of your work or “roll forward” to a more recent version of your work.

You can also download a particular version, perhaps as a starting place for a new report or application, and delete old versions that you want to remove from the server.

Other important features allow you to:

  • Specify the number of versions to retain. You can alter the setting Applications.BundleRetentionLimit to specify how many versions of your applications you want to keep on disk. By default, we retain all bundles eternally.
  • Limit the number of scheduled reports that will be run concurrently using the Applications.ScheduleConcurrency setting. This setting will help ensure that your server isn’t overwhelmed by too many reports all scheduled to run at the same time of day. The default is set to 2.
  • Create a printable view of your content with a new “Print” menu option.
  • Notify users of unsaved changes before they take an action in parameterized reports.

The release also includes numerous security and stability improvements.

If you haven’t yet had a chance to download and try RStudio Connect we encourage you to do so. RStudio Connect is the best way to share all the work that you do in R (Shiny apps, R Markdown documents, plots, dashboards, etc.) with collaborators, colleagues, or customers.

You can find more details or download a 45 day evaluation of the product at https://www.rstudio.com/products/connect/. Additional resources can be found below.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

nanotime 0.1.2

Tue, 03/28/2017 - 13:32

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new minor version of the nanotime package for working with nanosecond timestamps arrived yesterday on CRAN.

nanotime uses the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64 arithmetic.

This release just arranges things neatly before Leonardo Silvestri and I may shake things up with a possible shift to doing it all in S4 as we may need the added rigour for nanotime object operations for use in his ztsdb project.

Changes in version 0.1.2 (2017-03-27)
  • The as.integer64 function is now exported as well.

We also have a diff to the previous version thanks to CRANberries. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing Boardgame Recommendation website built with Shiny

Tue, 03/28/2017 - 05:28

My family and I love boardgames.  We are also on the lookout for new ones that would fit with the ones we already like to play.  So I decided to create a boardgame recommendation website

https://larrydag.shinyapps.io/boardgame_reco/

I built it using R.  The recommendation engine uses a very simple collaborative filtering algortihm based on correlation scores from other boardgame players collection lists.  The collections are gathered using the API from BoardgameGeek.com.  It is very much in a beta project phase as I just wanted to get something built to get working.

I also wanted another project to build in Shiny.  I really like how easy it is to publish R projects with Shiny.

Some of the features include:

  • Ability to enter your own collection
  • Get recommendation on your collection
  • Amazon link to buy boardgame that is recommended
Its a work in progress.  There is much to clean up and to make more presentable.  Please take a look and offer comments to help improve the website. 

Le Monde puzzle [#1000…1025]

Tue, 03/28/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Le Monde mathematical puzzle launched a competition to celebrate its 1000th puzzle! A fairly long-term competition as it runs over the 25 coming puzzles (and hence weeks). Starting with puzzle #1001. Here is the 1000th puzzle, not part of the competition:

Alice & Bob spend five (identical) vouchers in five different shops, each time buying the maximum number of items to get close to the voucher value. In these five shops, they buy sofas at 421 euros each, beds at 347 euros each, kitchen appliances at 289 euros each, tables at 251 euros each and bikes at 211 euros each, respectively. Once the buying frenzy is over, they realise that within a single shop, they would have spent exactly four vouchers for the same products. What is the value of a voucher?

Filed under: Kids, R Tagged: Alice and Bob, arithmetics, competition, Le Monde, mathematical puzzle, R, Tangente

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to Reach More People with your Next Data Science Talk

Mon, 03/27/2017 - 18:00

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Speaking publicly about your data science projects is one of the best things you can do for your portfolio.

Writing your presentation will help you refine what your project is, who it is for, and how it relates to other work in the area.

Giving your presentation at a live, in-person event will make a large impact on the audience.

That being said, public speaking has its drawbacks. For example:

  1. Time to prepare. While your talk might take just 30 minutes to deliver, you could easily spend weeks writing and rehearsing it.
  2. Cost. Some talks, such as those at local meetups, might be free to attend. But some will require significant travel. That costs money, even if your conference ticket is free.
  3. Limited upside. When you deliver a conference talk, your upside is limited in two ways:
    1. Reach. Only the people in the room at the time of your talk can hear your message.
    2. Interaction. Your interaction with the audience is normally limited to a few minutes of Q&A, and whatever conversations you have with people during the rest of the conference.

In my experience, this third pain point is the most significant. After all, if you had reached 10x the number of people with your last talk, then you probably would have been willing to spend more time writing it. And you probably would have been willing to travel further to give it.

Imagine if your talk wasn’t limited by the conference itself

For a moment, try to ignore the limits that the conferences place on your talk.

Imagine if members of the audience could ask you questions days, or even weeks, after you gave your talk. Wouldn’t that be a game changer?

And what about all the people who care about your topic but couldn’t make it to the conference? Wouldn’t it be great if they could hear your talk as well?

All of this is possible

I’ve recently worked out a system to solve these problems in my own public speaking events. My talks now reach more people than ever before, and lead to more interactions than ever before.

Bonus: A checklist for making the most out of your data science talks! Click Here

Before giving you the exact system I use, let me tell you the technology that you’ll need in order to implement it. Note that some of the links below are affiliate links. This means that, at no additional cost to you, I may get a commission if you wind up purchasing after clicking these links:

  1. A self-hosted WordPress website. I use BlueHost.
  2. An Email Service so you can follow up with people. I use Drip.
  3. A Webinar Provider. I use Zoom.
  4. A Landing Page Builder. I use Thrive.
Step 1: Let the Audience Download your Slides

Before the conference, take out a url such as <ConferenceName>Slides.com. The URL should be easy for the audience to remember. Set the URL to redirect to a separate page on your website.

This page should a “landing page”. This means that the only Call-to-Action (CTA) on the page should be for the reader to enter their email address so that you can send them the slides.

I recommend that the last slide in your deck contain just this URL. When the slide appears, say “If you would like to download the slide deck I used for this talk, visit this URL and enter your email address. I will then email you the slides.”

Set up an automated email campaign just for this opt-in. The first email should fire immediately and send them the slides. The second email should fire a few days later, and ask people if they enjoyed the slides or have any questions.

This tip alone should multiply the chances you have to interact with your audience after the talk ends.

Step 2: Host a Post-Conference Webinar

The sad fact is this: most people who would enjoy your talk won’t be in the audience. No conference can get even a fraction of the people who are interested in a topic. The best way around this is to host a live webinar where you go through the same exact same slides as your presentation.

The week after your presentation, publish a blog post where you announce that you’ll be giving the same talk, but this time as a live webinar. Drive traffic to the registration page by announcing it on LinkedIn, Facebook and Twitter.

The biggest impact your talk will have is the in-person presentation at the conference. The second biggest impact will be the live webinar.

Step 3: Use the Webinar Recording

After giving the webinar, you will now have a recording of the presentation. This is great because (again) most people who are interested in your talk will not have been at the conference, and will not have attended your live webinar.

You can host the recording on your own website or on youtube. When you get a new subscriber, point them to the recording. You can also periodically drive traffic to the recording on social media.

Step 4: Post Your Slides Online

A week after your webinar I recommend posting your slides online. I also recommend creating a content upgrade so that readers can download the slides. A “content upgrade” simply means that readers enter their email address, and that you then email them the slides.

This step will probably have the highest volume (i.e. the most people will interact with your content in this form), but the least impact (i.e. people will not get as much out of interacting with your talk in this form).

Closing Notes

As a data scientist with an online portfolio, conference talks are probably the most valuable type of content you create. Without additional effort, though, that value will be limited to people who are in the audience at the time of your presentation.

With just a little bit of work, though, you can overcome that limitation and have more followup with your audience and reach more people.

Bonus: A checklist for making the most out of your data science talks! Click Here

The post How to Reach More People with your Next Data Science Talk appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data science for Doctors: Inferential Statistics Exercises (part-3)

Mon, 03/27/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the sixth part of the series and it aims to cover partially the subject of Inferential statistics.
Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
In more detail, in this part we will go through the hypothesis testing for Student’s t-distribution (Student’s t-test), which may be the most used test you will need to apply, since in most cases the standard deviation σ of the population is not known. We will cover the one-sample t-test and two-sample t-test(both with equal and unequal variance). If you are not aware of what are the mentioned distributions please go here to acquire the necessary background.

Before proceeding, it might be helpful to look over the help pages for the t.test.

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Suppose that we take a sample of 25 candidates that tried a diet and they had a average weight of 29 (generate 25 normal distributed samples with mean 29 and standard deviation 4) after the experiment.
Find the t-value.

Exercise 2

Find the p-value.

Exercise 3

Find the 95% confidence interval.

Exercise 4

Apply t-test with Null Hypothesis that the true mean of the sample is equal to the mean of the sample with 5% confidence level.

Exercise 5

Apply t-test with Null Hypothesis that the true mean of the sample is equal to the mean of the population and the alternative that the true mean is less than the mean of the population with 5% confidence level.

Exercise 6

Suppose that we want to compare the current diet with another one. We assume that we test a different diet to a sample of 27 with mass average of 31(generate normal distributed samples with mean 31 and standard deviation of 5). Test whether the two diets are significantly different.
Note that the two distributions have different variances.
hint: This is a two sample hypothesis testing with different variances.

Exercise 7

Test whether the the first diet is more efficient than the second.

Exercise 8

Assume that the second diet has the same variance as the first one. Is it significant different?

Exercise 9

Assume that the second diet has the same variance as the first one. Is it significantly better?

Exercise 10

Suppose that you take a sample of 27 with average mass of 29, and after the diet the average mass is 28(generate the sampled with rnorm(27,average,4)). Are they significant different?
hint: Paired Sample T-Test.

Related exercise sets:
  1. Data science for Doctors: Inferential Statistics Exercises (part-2)
  2. Data Science for Doctors – Part 4 : Inferential Statistics (1/5)
  3. Data Science for Doctors – Part 2 : Descriptive Statistics
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An Introduction to Stock Market Data Analysis with R (Part 1)

Mon, 03/27/2017 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evaluating trading strategies (see part 1 and part 2). These have been my most popular posts, up until I published my article on learning programming languages (featuring my dad’s story as a programmer), and has been translated into both Russian (which used to be on backtest.ru at a link that now appears to no longer work) and Chinese (here and here). R has excellent packages for analyzing stock data, so I feel there should be a “translation” of the post for using R for stock data analysis.

This post is the first in a two-part series on stock data analysis using R, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking. The final post will include practice problems. This first post discusses topics up to introducing moving averages.

NOTE: The information in this post is of a general nature containing information and opinions from the author’s perspective. None of the content of this post should be considered financial advice. Furthermore, any code written here is provided without any form of guarantee. Individuals who choose to use it do so at their own risk.

Introduction

Advanced mathematics and statistics have been present in finance for some time. Prior to the 1980s, banking and finance were well-known for being “boring”; investment banking was distinct from commercial banking and the primary role of the industry was handling “simple” (at least in comparison to today) financial instruments, such as loans. Deregulation under the Regan administration, coupled with an influx of mathematical talent, transformed the industry from the “boring” business of banking to what it is today, and since then, finance has joined the other sciences as a motivation for mathematical research and advancement. For example one of the biggest recent achievements of mathematics was the derivation of the Black-Scholes formula, which facilitated the pricing of stock options (a contract giving the holder the right to purchase or sell a stock at a particular price to the issuer of the option). That said, bad statistical models, including the Black-Scholes formula, hold part of the blame for the 2008 financial crisis.

In recent years, computer science has joined advanced mathematics in revolutionizing finance and trading, the practice of buying and selling of financial assets for the purpose of making a profit. In recent years, trading has become dominated by computers; algorithms are responsible for making rapid split-second trading decisions faster than humans could make (so rapidly, the speed at which light travels is a limitation when designing systems). Additionally, machine learning and data mining techniques are growing in popularity in the financial sector, and likely will continue to do so. In fact, a large part of algorithmic trading is high-frequency trading (HFT). While algorithms may outperform humans, the technology is still new and playing an increasing role in a famously turbulent, high-stakes arena. HFT was responsible for phenomena such as the 2010 flash crash and a 2013 flash crash prompted by a hacked Associated Press tweet about an attack on the White House.

My articles, however, will not be about how to crash the stock market with bad mathematical models or trading algorithms. Instead, I intend to provide you with basic tools for handling and analyzing stock market data with R. We will be using stock data as a first exposure to time series data, which is data considered dependent on the time it was observed (other examples of time series include temperature data, demand for energy on a power grid, Internet server load, and many, many others). I will also discuss moving averages, how to construct trading strategies using moving averages, how to formulate exit strategies upon entering a position, and how to evaluate a strategy with backtesting.

DISCLAIMER: THIS IS NOT FINANCIAL ADVICE!!! Furthermore, I have ZERO experience as a trader (a lot of this knowledge comes from a one-semester course on stock trading I took at Salt Lake Community College)! This is purely introductory knowledge, not enough to make a living trading stocks. People can and do lose money trading stocks, and you do so at your own risk!

Getting and Visualizing Stock Data Getting Data from Yahoo! Finance with quantmod

Before we analyze stock data, we need to get it into some workable format. Stock data can be obtained from Yahoo! Finance, Google Finance, or a number of other sources, and the quantmod package provides easy access to Yahoo! Finance and Google Finance data, along with other sources. In fact, quantmod provides a number of useful features for financial modelling, and we will be seeing those features throughout these articles. In this lecture, we will get our data from Yahoo! Finance.

# Get quantmod if (!require("quantmod")) { install.packages("quantmod") library(quantmod) } start <- as.Date("2016-01-01") end <- as.Date("2016-10-01") # Let's get Apple stock data; Apple's ticker symbol is AAPL. We use the # quantmod function getSymbols, and pass a string as a first argument to # identify the desired ticker symbol, pass 'yahoo' to src for Yahoo! # Finance, and from and to specify date ranges # The default behavior for getSymbols is to load data directly into the # global environment, with the object being named after the loaded ticker # symbol. This feature may become deprecated in the future, but we exploit # it now. getSymbols("AAPL", src = "yahoo", from = start, to = end) ## As of 0.4-0, 'getSymbols' uses env=parent.frame() and ## auto.assign=TRUE by default. ## ## This behavior will be phased out in 0.5-0 when the call will ## default to use auto.assign=FALSE. getOption("getSymbols.env") and ## getOptions("getSymbols.auto.assign") are now checked for alternate defaults ## ## This message is shown once per session and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for more details. ## [1] "AAPL" # What is AAPL? class(AAPL) ## [1] "xts" "zoo" # Let's see the first few rows head(AAPL) ## AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume ## 2016-01-04 102.61 105.37 102.00 105.35 67649400 ## 2016-01-05 105.75 105.85 102.41 102.71 55791000 ## 2016-01-06 100.56 102.37 99.87 100.70 68457400 ## 2016-01-07 98.68 100.13 96.43 96.45 81094400 ## 2016-01-08 98.55 99.11 96.76 96.96 70798000 ## 2016-01-11 98.97 99.06 97.34 98.53 49739400 ## AAPL.Adjusted ## 2016-01-04 102.61218 ## 2016-01-05 100.04079 ## 2016-01-06 98.08303 ## 2016-01-07 93.94347 ## 2016-01-08 94.44022 ## 2016-01-11 95.96942

Let’s briefly discuss this. getSymbols() created in the global environment an object called AAPL (named automatically after the ticker symbol of the security retrieved) that is of the xts class (which is also a zoo-class object). xts objects (provided in the xts package) are seen as improved versions of the ts object for storing time series data. They allow for time-based indexing and provide custom attributes, along with allowing multiple (presumably related) time series with the same time index to be stored in the same object. (Here is a vignette describing xts objects.) The different series are the columns of the object, with the name of the associated security (here, AAPL) being prefixed to the corresponding series.

Yahoo! Finance provides six series with each security. Open is the price of the stock at the beginning of the trading day (it need not be the closing price of the previous trading day), high is the highest price of the stock on that trading day, low the lowest price of the stock on that trading day, and close the price of the stock at closing time. Volume indicates how many stocks were traded. Adjusted close (abreviated as “adjusted” by getSymbols()) is the closing price of the stock that adjusts the price of the stock for corporate actions. While stock prices are considered to be set mostly by traders, stock splits (when the company makes each extant stock worth two and halves the price) and dividends (payout of company profits per share) also affect the price of a stock and should be accounted for.

Visualizing Stock Data

Now that we have stock data we would like to visualize it. I first use base R plotting to visualize the series.

plot(AAPL[, "AAPL.Close"], main = "AAPL")

A linechart is fine, but there are at least four variables involved for each date (open, high, low, and close), and we would like to have some visual way to see all four variables that does not require plotting four separate lines. Financial data is often plotted with a Japanese candlestick plot, so named because it was first created by 18th century Japanese rice traders. Use the function candleChart() from quantmod to create such a chart.

candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white")

With a candlestick chart, a black candlestick indicates a day where the closing price was higher than the open (a gain), while a red candlestick indicates a day where the open was higher than the close (a loss). The wicks indicate the high and the low, and the body the open and close (hue is used to determine which end of the body is the open and which the close). Candlestick charts are popular in finance and some strategies in technical analysis use them to make trading decisions, depending on the shape, color, and position of the candles. I will not cover such strategies today.

(Notice that the volume is tracked as a bar chart on the lower pane as well, with the same colors as the corresponding candlesticks. Some traders like to see how many shares are being traded; this can be important in trading.)

We may wish to plot multiple financial instruments together; we may want to compare stocks, compare them to the market, or look at other securities such as exchange-traded funds (ETFs). Later, we will also want to see how to plot a financial instrument against some indicator, like a moving average. For this you would rather use a line chart than a candlestick chart. (How would you plot multiple candlestick charts on top of one another without cluttering the chart?)

Below, I get stock data for some other tech companies and plot their adjusted close together.

# Let's get data for Microsoft (MSFT) and Google (GOOG) (actually, Google is # held by a holding company called Alphabet, Inc., which is the company # traded on the exchange and uses the ticker symbol GOOG). getSymbols(c("MSFT", "GOOG"), src = "yahoo", from = start, to = end) ## [1] "MSFT" "GOOG" # Create an xts object (xts is loaded with quantmod) that contains closing # prices for AAPL, MSFT, and GOOG stocks <- as.xts(data.frame(AAPL = AAPL[, "AAPL.Close"], MSFT = MSFT[, "MSFT.Close"], GOOG = GOOG[, "GOOG.Close"])) head(stocks) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 105.35 54.80 741.84 ## 2016-01-05 102.71 55.05 742.58 ## 2016-01-06 100.70 54.05 743.62 ## 2016-01-07 96.45 52.17 726.39 ## 2016-01-08 96.96 52.33 714.47 ## 2016-01-11 98.53 52.30 716.03 # Create a plot showing all series as lines; must use as.zoo to use the zoo # method for plot, which allows for multiple series to be plotted on same # plot plot(as.zoo(stocks), screens = 1, lty = 1:3, xlab = "Date", ylab = "Price") legend("right", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)

What’s wrong with this chart? While absolute price is important (pricey stocks are difficult to purchase, which affects not only their volatility but your ability to trade that stock), when trading, we are more concerned about the relative change of an asset rather than its absolute price. Google’s stocks are much more expensive than Apple’s or Microsoft’s, and this difference makes Apple’s and Microsoft’s stocks appear much less volatile than they truly are (that is, their price appears to not deviate much).

One solution would be to use two different scales when plotting the data; one scale will be used by Apple and Microsoft stocks, and the other by Google.

plot(as.zoo(stocks[, c("AAPL.Close", "MSFT.Close")]), screens = 1, lty = 1:2, xlab = "Date", ylab = "Price") par(new = TRUE) plot(as.zoo(stocks[, "GOOG.Close"]), screens = 1, lty = 3, xaxt = "n", yaxt = "n", xlab = "", ylab = "") axis(4) mtext("Price", side = 4, line = 3) legend("topleft", c("AAPL (left)", "MSFT (left)", "GOOG"), lty = 1:3, cex = 0.5)

Not only is this solution difficult to implement well, it is seen as a bad visualization method; it can lead to confusion and misinterpretation, and cannot be read easily.

A “better” solution, though, would be to plot the information we actually want: the stock’s returns. This involves transforming the data into something more useful for our purposes. There are multiple transformations we could apply.

One transformation would be to consider the stock’s return since the beginning of the period of interest. In other words, we plot:

This will require transforming the data in the stocks object, which I do next.

# Get me my beloved pipe operator! if (!require("magrittr")) { install.packages("magrittr") library(magrittr) } ## Loading required package: magrittr stock_return % t %>% as.xts head(stock_return) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 1.0000000 1.0000000 1.0000000 ## 2016-01-05 0.9749407 1.0045620 1.0009975 ## 2016-01-06 0.9558614 0.9863139 1.0023994 ## 2016-01-07 0.9155197 0.9520073 0.9791734 ## 2016-01-08 0.9203607 0.9549271 0.9631052 ## 2016-01-11 0.9352634 0.9543796 0.9652081 plot(as.zoo(stock_return), screens = 1, lty = 1:3, xlab = "Date", ylab = "Return") legend("topleft", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)

This is a much more useful plot. We can now see how profitable each stock was since the beginning of the period. Furthermore, we see that these stocks are highly correlated; they generally move in the same direction, a fact that was difficult to see in the other charts.

Alternatively, we could plot the change of each stock per day. One way to do so would be to plot the percentage increase of a stock when comparing day to day , with the formula:

But change could be thought of differently as:

These formulas are not the same and can lead to differing conclusions, but there is another way to model the growth of a stock: with log differences.

(Here, is the natural log, and our definition does not depend as strongly on whether we use or .) The advantage of using log differences is that this difference can be interpreted as the percentage change in a stock but does not depend on the denominator of a fraction.

We can obtain and plot the log differences of the data in stocks as follows:

stock_change % log %>% diff head(stock_change) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 NA NA NA ## 2016-01-05 -0.025378648 0.0045516693 0.000997009 ## 2016-01-06 -0.019763704 -0.0183323194 0.001399513 ## 2016-01-07 -0.043121062 -0.0354019469 -0.023443064 ## 2016-01-08 0.005273804 0.0030622799 -0.016546113 ## 2016-01-11 0.016062548 -0.0005735067 0.002181138 plot(as.zoo(stock_change), screens = 1, lty = 1:3, xlab = "Date", ylab = "Log Difference") legend("topleft", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)

Which transformation do you prefer? Looking at returns since the beginning of the period make the overall trend of the securities in question much more apparent. Changes between days, though, are what more advanced methods actually consider when modelling the behavior of a stock. so they should not be ignored.

Moving Averages

Charts are very useful. In fact, some traders base their strategies almost entirely off charts (these are the “technicians”, since trading strategies based off finding patterns in charts is a part of the trading doctrine known as technical analysis). Let’s now consider how we can find trends in stocks.

A -day moving average is, for a series and a point in time , the average of the past days: that is, if denotes a moving average process, then:

Moving averages smooth a series and helps identify trends. The larger is, the less responsive a moving average process is to short-term fluctuations in the series . The idea is that moving average processes help identify trends from “noise”. Fast moving averages have smaller and more closely follow the stock, while slow moving averages have larger , resulting in them responding less to the fluctuations of the stock and being more stable.

quantmod allows for easily adding moving averages to charts, via the addSMA() function.

candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white") addSMA(n = 20)

Notice how late the rolling average begins. It cannot be computed until 20 days have passed. This limitation becomes more severe for longer moving averages. Because I would like to be able to compute 200-day moving averages, I’m going to extend out how much AAPL data we have. That said, we will still largely focus on 2016.

start = as.Date("2010-01-01") getSymbols(c("AAPL", "MSFT", "GOOG"), src = "yahoo", from = start, to = end) ## [1] "AAPL" "MSFT" "GOOG" # The subset argument allows specifying the date range to view in the chart. # This uses xts style subsetting. Here, I'm using the idiom # 'YYYY-MM-DD/YYYY-MM-DD', where the date on the left-hand side of the / is # the start date, and the date on the right-hand side is the end date. If # either is left blank, either the earliest date or latest date in the # series is used (as appropriate). This method can be used for any xts # object, say, AAPL candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") addSMA(n = 20)

You will notice that a moving average is much smoother than the actual stock data. Additionally, it’s a stubborn indicator; a stock needs to be above or below the moving average line in order for the line to change direction. Thus, crossing a moving average signals a possible change in trend, and should draw attention.

Traders are usually interested in multiple moving averages, such as the 20-day, 50-day, and 200-day moving averages. It’s easy to examine multiple moving averages at once.

candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") addSMA(n = c(20, 50, 200))

The 20-day moving average is the most sensitive to local changes, and the 200-day moving average the least. Here, the 200-day moving average indicates an overall bearish trend: the stock is trending downward over time. The 20-day moving average is at times bearish and at other times bullish, where a positive swing is expected. You can also see that the crossing of moving average lines indicate changes in trend. These crossings are what we can use as trading signals, or indications that a financial security is changing direction and a profitable trade might be made.

Visit next week to read about how to design and test a trading strategy using moving averages.

# Package/system information sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu 15.10 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] magrittr_1.5 quantmod_0.4-7 TTR_0.23-1 xts_0.9-7 ## [5] zoo_1.7-14 RWordPress_0.2-3 optparse_1.3.2 knitr_1.15.1 ## ## loaded via a namespace (and not attached): ## [1] lattice_0.20-34 XML_3.98-1.5 bitops_1.0-6 grid_3.3.3 ## [5] formatR_1.4 evaluate_0.10 highr_0.6 stringi_1.1.3 ## [9] getopt_1.20.0 tools_3.3.3 stringr_1.2.0 RCurl_1.95-4.8 ## [13] XMLRPC_0.3-0

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyzing Accupedo step count data in R: Part 2 – Adding weather data

Mon, 03/27/2017 - 16:30

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

In my last set of posts, I wrote about analyzing data from the Accupedo step counter app I have on my phone. In this post, I’ll talk about some additional analysis I’ve done by merging the step counter data with weather data from another source.

The website www.wunderground.com has freely available weather data available for most parts of the world. According to their website, their data are recorded by a crowd-sourced team of weather enthusiasts who report the weather conditions from personal stations at their homes. Pretty cool project, and it was really easy to search for my city and download the data in a .csv format. In this post, we will focus on the Mean Temperature (for the day) value, recorded in degrees Celsius.

Data Preparation

The step count data here are aggregated to the daily level, as I didn’t have more granular weather information. In all honesty, it probably makes more sense to analyze the data at the day level (as I’ve done); it would likely be hard to see hourly evolution in the weather matching hourly evolution in my step counts. My movements are pretty constrained during certain hours of the day (e.g. when I’m at work), and so examining that granular of a relationship is unlikely to be very illuminating.

I was able to match all of the dates that I had step counts for with the daily weather information from www.wunderground.com, and merged the weather data into the main step count dataset described in the previous post. I then created the analysis dataset, a day-level version of the walking data aggregated across the two variables I had created previously – day and day of week (essentially this aggregates to the day-level, while keeping a column describing the day of the week – e.g. Monday, Tuesday, etc.).

# load the required packages
library(plyr); library(dplyr)
library(lubridate)
library(ggplot2)

# make the variables to aggregate across
# the master data set with steps and temperature is called walking_weather
walking_weather$oneday <- cut(walking_weather$date_time, breaks = "1 day")
walking_weather$dow <- wday(walking_weather$date_time, label = TRUE)

# aggregate by day, day of week
# max of step count; mean of temperature
ww_per_day <- walking_weather %>% group_by(oneday, dow) %>%
summarise(steps = max(steps), temp = mean(Mean_TemperatureC))

As before, when aggregating I kept the maximum step count per day (as these counts are cumulative), and computed the average temperature in degrees Celsius per day (because there are multiple observations per day in the non-aggregated data, taking the average across days simply returns the daily average temperature from the daily weather data).

The resulting dataset looks like this:

Data Visualization and Analysis

As a first step, let’s make histograms of the temperature and step count variables:

# histograms of temperature and step count at a daily level
# using the base plotting system
hist(ww_per_day$temp, col ='darkblue',
xlab = 'Daily Average Temperature (Celsius)',
main = 'Histogram of Temperature')

hist(ww_per_day$steps, col ='darkgreen',
xlab = 'Daily Total Step Count',
main = 'Histogram of Step Count')

The temperatures have ranged from -4 degrees to 27 degrees Celsius, with a mean of 11.66 (SD = 5.83). 

 

The daily step counts have ranged from 6 to 65,337, with a mean of 18,294.04 (SD = 7,808.36).

In order to visualize the relationship between my daily step count and the daily average temperature, I made a simple scatterplot, with a linear model line superimposed on the graph (95% confidence intervals shown in light grey):

# plot of steps by temperature
p = ggplot(data = ww_per_day, aes(x = temp, y = steps)) +
geom_point() +
coord_cartesian(ylim = c(0, 25000)) +
geom_smooth(method="lm") +
theme(legend.title=element_blank()) +
labs(x = "Temperature (Celsius)", y = "Steps" )
p

There is a slight positive relationship between temperature and step count, such that on warmer days I tend to walk more. A simple linear regression will estimate the size of this relationship:

summary(lm(steps~temp , data = ww_per_day ))

The estimates given:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; } Estimate Std. Error t value Pr(>|t|) Starz! (Intercept) 17277.29 681.76 25.342 < 2e-16 *** temp 87.21 52.31 1.667 0.096 .

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7798 on 653 degrees of freedom
Multiple R-squared: 0.004238, Adjusted R-squared: 0.002713
F-statistic: 2.779 on 1 and 653 DF, p-value: 0.09597

The model estimates that I walk an additional 87.21 steps for each degree increase in the daily average temperature. However, the p-value associated with this estimate shows only a nonsignificant trend toward significance.

In a previous post we saw that my walking patterns tend to be different during the weekdays and the weekend. I was therefore curious to examine the relationship between temperature and step count across these two different types of days.

I generated the week/weekend variable (as in the previous post) and the plot with the following code:

# generate the weekday vs. weekend variable
ww_per_day$week_weekend[ww_per_day$dow == 'Sun'] <- 'Weekend'
ww_per_day$week_weekend[ww_per_day$dow == 'Sat'] <- 'Weekend'
ww_per_day$week_weekend[ww_per_day$dow == 'Mon'] <- 'Weekday'
ww_per_day$week_weekend[ww_per_day$dow == 'Tues'] <- 'Weekday'
ww_per_day$week_weekend[ww_per_day$dow == 'Wed'] <- 'Weekday'
ww_per_day$week_weekend[ww_per_day$dow == 'Thurs'] <- 'Weekday'
ww_per_day$week_weekend[ww_per_day$dow == 'Fri'] <- 'Weekday'

# plot the relationship between temperature and step count
# with separate colors and lines for weekday vs. weekend
p = ggplot(data = ww_per_day, aes(x = temp, y = steps, color=week_weekend)) +
geom_point() +
coord_cartesian(ylim = c(0, 25000)) +
geom_smooth(method="lm") +
theme(legend.title=element_blank()) +
labs(x = "Temperature (Celsius)", y = "Steps" )
p

The relationship between temperature and step count does indeed appear to be different on weekdays vs. the weekend:

The general pattern is that there’s not much of a relationship between temperature and step count during the week (after all, I have to walk to work whether it’s warm or cold), but on the weekends I walk quite a bit more when it’s warmer out.

Notice how the confidence intervals get wider at the extremes of the x-axis. These are regions where there are simply less data (by definition, extreme temperatures are rare). This pattern is especially noticeable for the weekend days (there are far fewer weekend days than there are week days, and so we have even less data here at the margins).

Although I’m in general not particularly interested in null-hypothesis significance testing of regression coefficients in this type of analysis, I went through the formality of testing the time period (weekday vs. weekend) by temperature interaction.

summary(lm(steps~temp + as.factor(week_weekend)+ temp:as.factor(week_weekend),
data = ww_per_day ))

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

Estimate Std. Error t value Pr(>|t|) Starz! (Intercept) 19178.76 774.45 24.76 < 2e-16 *** temp 23.46 59.52 0.39 0.69 as.factor(week_weekend)Weekend -6982.66 1486.25 -4.70 0.00 *** temp:as.factor(week_weekend)Weekend 246.76 113.76 2.17 0.03 * ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7560 on 651 degrees of freedom
Multiple R-squared: 0.06682, Adjusted R-squared: 0.06252
F-statistic: 15.54 on 3 and 651 DF, p-value: 9.005e-10

The results show that the interaction term is statistically significant; we already understood the shape and meaning of the interaction with the above plot.

More interesting for our purposes is the model summary. The adjusted multiple R squared is about 6%, much greater than the .3% from the model above (where we only include temperature as a predictor), but not very high overall. The take-home message is that time period (weekday vs. weekend) and temperature have clear relationships with step count, but together and in combination they don’t explain a huge amount of variance in the amount of steps I walk each day.

I wondered about the importance of the interaction term, actually. Interactions are widely sought after in certain academic contexts, but I enjoy the simplicity and ease-of-interpretation of main effects models for applied problems which require interpretable solutions (e.g. “white box” models). Bluntly put: how important is the interaction term in this context? One might argue that, although the p-value for the interaction term passes the threshold for statistical significance, knowing that the interaction term is (statistically) significantly different from zero doesn’t bring us much further in our understanding of the problem at hand, after accounting for the main effects of time period and temperature.

Model Comparison

In order to more concretely investigate the importance of the time period by temperature interaction term, I computed the model predicting step count with just the main effects of time period and temperature (lm_1 below), and also the model with the main effects and their interaction term (lm_2 below). I then performed a model comparison to see if adding the interaction term results in a statistically significant reduction in the residual sum of squares. In other words, is the difference between the predictive performance of the first and second model greater than zero?

# main effects model
lm_1 <- lm(steps~temp + as.factor(week_weekend), data = ww_per_day )
summary(lm_1)

# main effects + interaction model
lm_2 <- lm(steps~temp + as.factor(week_weekend)+ temp:as.factor(week_weekend),
data = ww_per_day )
summary(lm_2)

# model comparison
anova(lm_1,lm_2)

Which gives the following result:
table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

Model Res.Df RSS Df Sum of Sq F Pr(>F) 1 652 3.75E+10 2 651 3.72E+10 1 268916758 4.70 0.03 * ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This comparison suggests that adding the interaction term results in a statistically significant decrease in the residual sum of squares. In other words, we can better predict my step count if we include the interaction term.

Conclusion

To sum up, in this post we took the daily total step counts from my Accupedo step counter app and added data about the average temperature on each day that steps were recorded. We used scatterplots to explore and understand the relationship between these two variables, both overall and according to the time period (e.g. weekday vs. weekend).

We also used linear regression to investigate the relationship between daily temperature and daily step count. When predicting step count from temperature, we saw that there was a non-statistically significant trend such that I walk more on warmer days. When adding time period as a categorical predictor in an interactive regression model, we saw a statistically significant interaction between temperature and time period, such that the relationship between temperature and step count was stronger on weekends vs. weekdays. In other words, I walk more when it’s warmer out, but only during the weekend.

Caveat: Null Hypothesis Significance Testing (NHST) in Exploratory Data Analysis

We used NHST and p-values to interpret regression coefficients and to compare the predictive performance of the main effects vs. interaction regression models. I’m somewhat ambivalent about the use of p-values in this context, honestly. I’ve been very influenced by Andrew Gelman’s work, particularly his wonderful blog, and am very conscious of the limitations of p-values in exploratory contexts like this one. If I had to choose the most interesting test of statistical significance in this post, I would choose the model comparison: it puts an explicit focus on the relative performance of competing (nested) models, rather than simply using significance testing to determine whether individual regression coefficients are significantly different from zero.

The statistical approach we’ve seen in this post conforms very much to the way I was trained in the social sciences. In this tradition, a strong focus is put on NHST (via p-values), and less frequently on interpreting model R square and performing formal model comparisons. In the work I do now as an applied data analyst (or “data scientist”), I often take an approach that borrows heavily from machine learning (though these ideas are also present in traditional statistical thinking): evaluating models via predictive accuracy on a test dataset that was not used in data modeling.

Coming Up Next

In the next set of posts, we will analyze a different dataset using an approach that leans more heavily towards this machine-learning perspective. We’ll also change analytical software, using Python/Pandas and the scikit-learn library to analyze the data. Stay tuned!

To leave a comment for the author, please follow the link and comment on their blog: Method Matters. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

#0: Introducing R^4

Mon, 03/27/2017 - 15:10

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

So I had been toying with the idea of getting back to the blog and more regularly writing / posting little tips and tricks. I even starting taking some notes but because perfect is always the enemy of the good it never quite materialized.

But the relatively broad discussion spawned by last week’s short rant on Suggests != Depends made a few things clear. There appears to be an audience. It doesn’t have to be long. And it doesn’t have to be too polished.

So with that, let’s get the blogging back from micro-blogging.

This note forms post zero of what will a new segment I call R4 which is shorthand for relatively random R rambling.

Stay tuned.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing brotools

Mon, 03/27/2017 - 09:23

I’m happy to announce my first R package, called brotools. This is a package that contains functions that are specific to my needs but that you might find also useful. I blogged about some of these functions, so if you follow my blog you might already be familiar with some of them. It is not on CRAN and might very well never be. The code is hosted on bitbucket and you can install the package with

devtools::install_bitbucket("b-rodrigues/brotools")

Hope you’ll find the brotools useful!

Fitting Bayesian Linear Mixed Models for continuous and binary data using Stan: A quick tutorial

Mon, 03/27/2017 - 08:48

(This article was first published on Shravan Vasishth's Slog (Statistics blog), and kindly contributed to R-bloggers)

I want to give a quick tutorial on fitting Linear Mixed Models (hierarchical models) with a full variance-covariance matrix for random effects (what Barr et al 2013 call a maximal model) using Stan.

For a longer version of this tutorial, see: Sorensen, Hohenstein, Vasishth, 2016.

Prerequisites: You need to have R and preferably RStudio installed; RStudio is optional. You need to have rstan installed. See here. I am also assuming you have fit lmer models like these before:
lmer(log(rt) ~ 1+RCType+dist+int+(1+RCType+dist+int|subj) + (1+RCType+dist+int|item), dat)

If you don’t know what the above code means, first read chapter 4 of my lecture notes.
The code and data format needed to fit LMMs in Stan The data

I assume you have a 2×2 repeated measures design with some continuous measure like reading time (rt) data and want to do a main effects and interaction contrast coding. Let’s say your main effects are RCType and dist, and the interaction is coded as int. All these contrast codings are $\pm 1$. If you don’t know what contrast coding is, see these notes and read section 4.3 (although it’s best to read the whole chapter). I am using an excerpt of an example data-set from Husain et al. 2014.
"subj" "item" "rt""RCType" "dist" "int"
1 14 438 -1 -1 1
1 16 531 1 -1 -1
1 15 422 1 1 1
1 18 1000 -1 -1 1
...

Assume that these data are stored in R as a data-frame with name rDat.
The Stan code

Copy the following Stan code into a text file and save it as the file matrixModel.stan. For continuous data like reading times or EEG, you never need to touch this file again. You will only ever specify the design matrix X and the structure of the data. The rest is all taken care of.
data {
int N; //no trials
int P; //no fixefs
int J; //no subjects
int n_u; //no subj ranefs
int K; //no items
int n_w; //no item ranefs
int subj[N]; //subject indicator
int item[N]; //item indicator
row_vector[P] X[N]; //fixef design matrix
row_vector[n_u] Z_u[N]; //subj ranef design matrix
row_vector[n_w] Z_w[N]; //item ranef design matrix
vector[N] rt; //reading time
}

parameters {
vector[P] beta; //fixef coefs
cholesky_factor_corr[n_u] L_u; //cholesky factor of subj ranef corr matrix
cholesky_factor_corr[n_w] L_w; //cholesky factor of item ranef corr matrix
vector[n_u] sigma_u; //subj ranef std
vector[n_w] sigma_w; //item ranef std
real sigma_e; //residual std
vector[n_u] z_u[J]; //spherical subj ranef
vector[n_w] z_w[K]; //spherical item ranef
}

transformed parameters {
vector[n_u] u[J]; //subj ranefs
vector[n_w] w[K]; //item ranefs
{
matrix[n_u,n_u] Sigma_u; //subj ranef cov matrix
matrix[n_w,n_w] Sigma_w; //item ranef cov matrix
Sigma_u = diag_pre_multiply(sigma_u,L_u);
Sigma_w = diag_pre_multiply(sigma_w,L_w);
for(j in 1:J)
u[j] = Sigma_u * z_u[j];
for(k in 1:K)
w[k] = Sigma_w * z_w[k];
}
}

model {
//priors
L_u ~ lkj_corr_cholesky(2.0);
L_w ~ lkj_corr_cholesky(2.0);
for (j in 1:J)
z_u[j] ~ normal(0,1);
for (k in 1:K)
z_w[k] ~ normal(0,1);
//likelihood
for (i in 1:N)
rt[i] ~ lognormal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);
}
Define the design matrix

Since we want to test the main effects coded as the columns RCType, dist, and int, our design matrix will look like this:
# Make design matrix
X <- unname(model.matrix(~ 1 + RCType + dist + int, rDat))
attr(X, "assign") <- NULL
Prepare data for Stan

Stan expects the data in a list form, not as a data frame (unlike lmer). So we set it up as follows:
# Make Stan data
stanDat <- list(N = nrow(X),
P = ncol(X),
n_u = ncol(X),
n_w = ncol(X),
X = X,
Z_u = X,
Z_w = X,
J = nlevels(rDat$subj),
K = nlevels(rDat$item),
rt = rDat$rt,
subj = as.integer(rDat$subj),
item = as.integer(rDat$item))
Load library rstan and fit Stan model
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# Fit the model
matrixFit <- stan(file = "matrixModel.stan", data = stanDat,
iter = 2000, chains = 4)
Examine posteriors
print(matrixFit)

This print output is overly verbose. I wrote a simple function to get the essential information quickly.
stan_results<-function(m,params=paramnames){
m_extr<-extract(m,pars=paramnames)
par_names<-names(m_extr)
means<-lapply(m_extr,mean)
quantiles<-lapply(m_extr,
function(x)quantile(x,probs=c(0.025,0.975)))
means<-data.frame(means)
quants<-data.frame(quantiles)
summry<-t(rbind(means,quants))
colnames(summry)<-c("mean","lower","upper")
summry
}

For example, if I want to see only the posteriors of the four beta parameters, I can write:
stan_results(matrixFit, params=c("beta[1]","beta[2]","beta[3]","beta[4]"))

For more details, such as interpreting the results and computing things like Bayes Factors, see Nicenboim and Vasishth 2016.
FAQ: What if I don’t want to fit a lognormal?

In the Stan code above, I assume a lognormal function for the reading times:
rt[i] ~ lognormal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);

If this upsets you deeply and you want to use a normal distribution (and in fact, for EEG data this makes sense), go right ahead and change the lognormal to normal:
rt[i] ~ normal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);
FAQ: What if I my dependent measure is binary (0,1) responses?

Use this Stan code instead of the one shown above. Here, I assume that you have a column called response in the data, which has 0,1 values. These are the trial level binary responses.
data {
int N; //no trials
int P; //no fixefs
int J; //no subjects
int n_u; //no subj ranefs
int K; //no items
int n_w; //no item ranefs
int subj[N]; //subject indicator
int item[N]; //item indicator
row_vector[P] X[N]; //fixef design matrix
row_vector[n_u] Z_u[N]; //subj ranef design matrix
row_vector[n_w] Z_w[N]; //item ranef design matrix
int response[N]; //response
}

parameters {
vector[P] beta; //fixef coefs
cholesky_factor_corr[n_u] L_u; //cholesky factor of subj ranef corr matrix
cholesky_factor_corr[n_w] L_w; //cholesky factor of item ranef corr matrix
vector[n_u] sigma_u; //subj ranef std
vector[n_w] sigma_w; //item ranef std
vector[n_u] z_u[J]; //spherical subj ranef
vector[n_w] z_w[K]; //spherical item ranef
}

transformed parameters {
vector[n_u] u[J]; //subj ranefs
vector[n_w] w[K]; //item ranefs
{
matrix[n_u,n_u] Sigma_u; //subj ranef cov matrix
matrix[n_w,n_w] Sigma_w; //item ranef cov matrix
Sigma_u = diag_pre_multiply(sigma_u,L_u);
Sigma_w = diag_pre_multiply(sigma_w,L_w);
for(j in 1:J)
u[j] = Sigma_u * z_u[j];
for(k in 1:K)
w[k] = Sigma_w * z_w[k];
}
}

model {
//priors
beta ~ cauchy(0,2.5);
sigma_u ~ cauchy(0,2.5);
sigma_w ~ cauchy(0,2.5);
L_u ~ lkj_corr_cholesky(2.0);
L_w ~ lkj_corr_cholesky(2.0);
for (j in 1:J)
z_u[j] ~ normal(0,1);
for (k in 1:K)
z_w[k] ~ normal(0,1);
//likelihood
for (i in 1:N)
response[i] ~ bernoulli_logit(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]]);
}
For reproducible example code

See here.

To leave a comment for the author, please follow the link and comment on their blog: Shravan Vasishth's Slog (Statistics blog). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Updated Shiny app

Mon, 03/27/2017 - 03:58

(This article was first published on Bayes Ball, and kindly contributed to R-bloggers)

A short post to alert the world that my modest Shiny application, showing Major League Baseball run scoring trends since 1901, has been updated to include the 2016 season. The application can be found here:
https://monkmanmh.shinyapps.io/MLBrunscoring_shiny/.

In addition to the underlying data, the update removed some of the processing that was happening inside the application, and put it into the pre-processing stage. This processing needs to happen only the once, and is not related to the reactivity of the application. This will improve the speed of the application; in addition to reducing the processing, it also shrinks the size of the data table loaded into the application.

The third set of changes were a consequence of the updates to the Shiny and ggplot2 packages in the two years that have passed since I built the app. In Shiny, there was a deprecation for “format” in the sliderInput widget. And in ggplot2, it was a change in the quotes around the “method” specification in stat_smooth(). A little thing that took a few minutes to debug! Next up will be some formatting changes, and a different approach to one of the visualizations.

 -30-

To leave a comment for the author, please follow the link and comment on their blog: Bayes Ball. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1001]

Mon, 03/27/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

After a long lag (due to my missing the free copies distributed at Paris-Dauphine!), here is a Sudoku-like Le Monde mathematical puzzle:

A grid of size (n,n) holds integer values such that any entry larger than 1 is the sum of one term in the same column and one term in the same row. What is the maximal possible value observed in such a grid when n=3,4?

This can be solved in R by a random exploration of such possible grids in a simulated annealing spirit:

mat=matrix(1,N,N) goal=1 targ=function(mat){ #check constraints d=0 for (i in (1:(N*N))[mat>1]){ r=(i-1)%%N+1;c=(i-1)%/%N+1 d=d+(min(abs(mat[i]-outer(mat[-r,c],mat[r,-c],"+")))>0)} return(d)} cur=0 for (t in 1:1e6){ i=sample(1:(N*N),1);prop=mat prop[i]=sample(1:(2*goal),1) d=targ(prop) if (10*log(runif(1))/t<cur-d){ mat=prop;cur=d} if ((d==0)&(max(prop)>goal)){ goal=max(prop);maxx=prop}}

returning a value of 8 for n=3 and 37 for n=4. However, the method is quite myopic and I tried instead a random filling of the grid, using each time the maximum possible sum for empty cells:

goal=1 for (v in 1:1e6){ mat=matrix(0,N,N) #one 1 per row/col for (i in 1:N) mat[i,sample(1:N,1)]=1 for (i in 1:N) if (max(mat[,i])==0) mat[sample(1:N,1),i]=1 while (min(mat)==0){ parm=sample(1:(N*N)) #random order for (i in parm[mat[parm]==0]){ r=(i-1)%%N+1;c=(i-1)%/%N+1 if ((max(mat[-r,c])>0)&(max(mat[r,-c])>0)){ mat[i]=max(mat[-r,c])+max(mat[r,-c]) break()}}} if (goal<max(mat)){ goal=max(mat);maxx=mat}}

which recovered a maximum of 8 for n=3, but reached 48 for n=4. And 211 for n=5, 647 for n=6… For instance, here is the solution for n=4:

[1,] 1 5 11 10 [2,] 2 4 1 5 [3,] 48 2 24 1 [4,] 24 1 22 11

While the update in the above is random and associated with the first term in the permutation, it may be preferable to favour the largest possible term at each iteration, which I tried as

while (min(mat)==0){ parm=sample(1:(N*N)) val=0*parm for (i in parm[mat[parm]==0]){ r=(i-1)%%N+1;c=(i-1)%/%N+1 if ((max(mat[-r,c])>0)&(max(mat[r,-c])>0)){ val[i]=max(mat[-r,c])+max(mat[r,-c])} } #largest term i=order(-val)[1];mat[i]=val[i]}

For n=4, I did not recover the maximal value 48, but achieved larger values for n=5 (264) and n=6 (2256).

As an aside, the R code sometimes led to a strange error message linked with the function sample(), which is that too large a bound in the range produces the following

> sample(1:1e10,1) Error: cannot allocate vector of size 74.5 Gb

meaning that 1:1e10 first creates a vector for all the possible values. The alternative

> sample.int(1e10,1) [1] 7572058778

works, however. And only breaks down for 10¹².

Filed under: Kids, R Tagged: Le Monde, mathematical puzzle, R, sample, sudoku

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppTOML 0.1.2

Mon, 03/27/2017 - 00:12

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release of RcppTOML is now on CRAN. This release fixes a few parsing issues for less frequently-used inputs: vectors of boolean or date(time) types, as well as table array input.

RcppTOML brings TOML to R. TOML is a file format that is most suitable for configurations, as it is meant to be edited by humans but read by computers. It emphasizes strong readability for humans while at the same time supporting strong typing as well as immediate and clear error reports. On small typos you get parse errors, rather than silently corrupted garbage. Much preferable to any and all of XML, JSON or YAML — though sadly these may be too ubiquitous now. TOML is making good inroads with newer and more flexible projects such as the Hugo static blog compiler, or the Cargo system of Crates (aka "packages") for the Rust language.

Changes in version 0.1.2 (2017-03-26)
  • Dates and Datetimes in arrays in the input now preserve their types instead of converting to numeric vectors (#13)

  • Boolean vectors are also correctly handled (#14)

  • TableArray types are now stored as lists in a single named list (#15)

  • The README.md file was expanded with an example and screenshot.

  • Added file init.c with calls to R_registerRoutines() and R_useDynamicSymbols(); also use .registration=TRUE in useDynLib in NAMESPACE

  • Two example files were updated.

Courtesy of CRANberries, there is a diffstat report for this release.

More information is on the RcppTOML page page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Economics chapter added to “Empirical software engineering using R”

Mon, 03/27/2017 - 00:04

(This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

The Economics chapter of my Empirical software engineering book has been added to the draft pdf (download here).

This is a slim chapter, it might grow a bit, but I suspect not by a huge amount. Reasons include lots of interesting data being confidential and me not having spent a lot of time on this topic over the years (so my stash of accumulated data is tiny). Also, a significant chunk of the economics data I have is used to discuss issues in the Ecosystems and Projects chapters, perhaps some of this material will migrate back once these chapters are finalized.

You might argue that Economics is more important than Human cognitive characteristics and should have appeared before it (in chapter order). I would argue that hedonism by those involved in producing software is the important factor that pushes (financial) economics into second place (still waiting for data to argue my case in print).

Some of the cognitive characteristics data I have been waiting for arrived, and has been added to this chapter (some still to be added).

As always, if you know of any interesting software engineering data, please tell me.

I am after a front cover. A woodcut of alchemists concocting a potion appeals, perhaps with various software references discretely included, or astronomy related (the obvious candidate has already been used). The related modern stuff I have seen does not appeal. Suggestions welcome.

Ecosystems next.

To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using R to study the Yemen Conflict with night light images

Sun, 03/26/2017 - 12:38

The Yemeni civil war has received very little attention despite the growing humanitarian disaster. There is a lack of reliable figures on the extent of the human suffering in Yemen. The few data that is available suggests that it is immense. According to the UN, from March 2015 to August 2016, over 10,000 people have been killed in Yemen, including 3,799 civilians.

This note asks whether high frequency satellite images do capture the extent to which conflict is ongoing in Yemen and asks in particular, whether there is distinct geographic variation suggesting which areas are most affected by the ongoing conflict.

Can the effect and the spatial incidence of the conflict in Yemen be traced through satellite images?

Satellite images have been used to study urban sprawl and general economic growth and development. The extent to which satellite images can be used to study man-made disasters such as conflicts is not widely explored.

There are lots of other papers that have used various night light data sets to study urbanization, ethnic favoritism, and economic growth (see Henderson et al, 2012 ; Michalopoulos and Papaioannou 2013,  Hodler and Raschky, 2014).

In related work Fetzer et al., 2016, I studied the extent to which light emissions in the early 1990s can be used to obtain a measure of the extent of power rationing in Colombia following El-Nino induced droughts. In another project, we use the DMSP night light images to study the evolution of cities over time and how democratization can change the relative distribution of cities Fetzer and Shanghavi, 2015.

Since 2012, the VIIRS
high frequency and high resolution satellite images capturing night lights emissions are available from NASA’s Earth Observation Group. They have now been made available for analysis on Google’s Earth Engine, making them much more accessible to the wider research audience.

Lets have a look at night light Yemen before and after the Saudi Arabian military intervention.

Average VIIRS lights after the Saudi intervention in Yemen started.

Average VIIRS lights for the period before the Saudi intervention in Yemen.

The light scales are identical, indicating that relative to the border with Saudi Arabia, the night light emissions from Yemen have dropped dramatically, especially around the capital city Sana’a. The circular blobs indicated are around the main oil/ gas producing parts of Yemen, where there may be light emissions due to flaring of natural gas.

A minimal average light emissions of 0.5 was imposed
Zooming in to Sana’a, the figures look as follows.

Average light emissions from Sana’a since the Saudi intervention in Yemen started.

 

Average light emissions from Sana’a for period before the Saudi intervention in Yemen.

Having a look at the data library(data.table) library(foreign) library(plyr) library(parallel) options(stringsAsFactors = FALSE) setwd("/Users/thiemo/Dropbox/Research/Yemen") # A DATA SET OF 34k populated places (or historically populated places) YE <- data.table(read.csv(file = "~/Dropbox/Research/Yemen/Yemen-Cities.csv")) # LIGHTS DATA IS FROM VIIRS Images made availabe on the Google Earth Engine LIGHTS <- data.table(read.csv(file = "~/Dropbox/Research/Yemen/lightsall.csv")) LIGHTS[, `:=`(year, as.numeric(substr(system.index, 1, 4)))] LIGHTS[, `:=`(month, as.numeric(substr(system.index, 5, 6)))] LIGHTS[, `:=`(.geo, NULL)] LIGHTS[, `:=`(UFI, NULL)] LIGHTS[, `:=`(LONGITUDE, NULL)] LIGHTS[, `:=`(date, strptime(paste(year, month, "01", sep = "-"), "%Y-%m-%d"))] LIGHTS <- join(LIGHTS, YE) ## Joining by: rownum

Some simple plots are quite suggestive. The following plots the average light emissions around populated places over time by month. The date of the intervention onset, which coincides with the date of the fall of Sana’a coincides with dramatic drop in light emissions.

Average lights dropped by a almost 2/3, suggesting a stand still in economic activity. Overall light emissions are still visible as indicated in the graphs suggesting that the places do not turn pitch black. The

plot(LIGHTS[, mean(list), by = date], type = "l")

The Distributional Effects of the Conflict

The Houthi movement has been gaining influence over a longer time period. In particular, since the 2012 the Houthi’s have gained influence spreading from North to the South. The European Council of Foreign Relations has produced maps illustrating the spatial expansion of Houthi control in Yemen.

A central question relates to the strategy of the Saudi military intervention. In particular, whether the intervention is aimed at territories that came under Houthi control since 2012 or whether the intervention is targeted at the Houthi-heartland.

A simple exercise that allows this study is to look at the evolution of lights in the northern Houthi-heartland relative to the populated places in the rest of the country that came under Houthi control since 2012.

A definition of what consists of the Houthi-heartland is subject to contention. But a conservative definition may consist of the four governerates Ammran, Sada’ah, Al Jawf and Hajjah.

LIGHTS[, `:=`(HOUTHI, as.numeric(ADM1 %in% c("15", "22", "21", "19")))] require(ggplot2) ggplot(LIGHTS[, mean(list), by = c("HOUTHI", "date")], aes(date, V1, colour = as.factor(HOUTHI))) + geom_line() + geom_point() + theme_bw() + theme(legend.position = "bottom")

 

The summary statistics suggest that in absolute terms much larger in the non-Houthi heartland. Though given that the initial level in the Houthi heartland is much lower, suggesting that that part of the country is much less developed. Given that there is a notional minimum light emissions of zero, this truncation of the data is a concern.

One way around this is to dummify the lights measure and look at whether a populated place is lit above a certain threshold.

LIGHTS[, `:=`(anylit, list > 0.25)] ggplot(LIGHTS[, mean(anylit), by = c("HOUTHI", "date")], aes(date, V1, colour = as.factor(HOUTHI))) + geom_line() + geom_point() + theme_bw() + theme(legend.position = "bottom")

Again it is hard to see whether there is any divergence in trends in this dummified measure, but this naturally is less prone to be affected by the truncation inherent to this type of data.

A regression with location and time fixed effects that measures whether there was a distinct change in nightlights in places in the Houthi-heartland relative to the non-Houthi heartland suggests that there is indeed a marked difference, indicating that the conflict is concentrated in the non-Houthi heartland.

Definint the discrete variable for a difference in difference estimation and loading the lfe package that allows for high dimensional fixed effects:

LIGHTS[, `:=`(anylit, list > 0.25)] LIGHTS[, `:=`(postKSAintervention, as.numeric(date > "2015-03-01"))] library(lfe) LIGHTS[, `:=`(date, as.factor(date))]

Running the actual difference in difference regressions:

# levels summary(felm(list ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS)) ## ## Call: ## felm(formula = list ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS) ## ## Residuals: ## Min 1Q Median 3Q Max ## -74.347 -0.205 0.043 0.194 82.063 ## ## Coefficients: ## Estimate Cluster s.e. t value Pr(>|t|) ## postKSAintervention:HOUTHI 0.4184 0.1900 2.202 0.0277 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.758 on 1172455 degrees of freedom ## Multiple R-squared(full model): 0.752 Adjusted R-squared: 0.7447 ## Multiple R-squared(proj model): 0.003315 Adjusted R-squared: -0.02603 ## F-statistic(full model, *iid*): 103 on 34519 and 1172455 DF, p-value: < 2.2e-16 ## F-statistic(proj model): 4.848 on 1 and 22 DF, p-value: 0.03846 # dummified measure summary(felm(anylit ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS)) ## ## Call: ## felm(formula = anylit ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.12247 -0.10416 0.00593 0.06185 1.06958 ## ## Coefficients: ## Estimate Cluster s.e. t value Pr(>|t|) ## postKSAintervention:HOUTHI 0.08470 0.02359 3.59 0.00033 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2223 on 1172455 degrees of freedom ## Multiple R-squared(full model): 0.5762 Adjusted R-squared: 0.5637 ## Multiple R-squared(proj model): 0.008458 Adjusted R-squared: -0.02073 ## F-statistic(full model, *iid*):46.18 on 34519 and 1172455 DF, p-value: < 2.2e-16 ## F-statistic(proj model): 12.89 on 1 and 22 DF, p-value: 0.00163 # taking logs summary(felm(log(list) ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS[!is.infinite(log(list))])) ## ## Call: ## felm(formula = log(list) ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS[!is.infinite(log(list))]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12.8918 -0.3725 0.1060 0.5223 6.5958 ## ## Coefficients: ## Estimate Cluster s.e. t value Pr(>|t|) ## postKSAintervention:HOUTHI 0.4133 0.1234 3.35 0.000809 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8958 on 844476 degrees of freedom ## (327294 observations deleted due to missingness) ## Multiple R-squared(full model): 0.6534 Adjusted R-squared: 0.6393 ## Multiple R-squared(proj model): 0.01248 Adjusted R-squared: -0.02789 ## F-statistic(full model, *iid*):46.12 on 34519 and 844476 DF, p-value: < 2.2e-16 ## F-statistic(proj model): 11.22 on 1 and 22 DF, p-value: 0.002899

An alternative way to study this is by doing a flexible non-parametric estimation to rule out diverging trends prior to the military intervention.

summary(felm(anylit ~ date:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS)) ## ## Call: ## felm(formula = anylit ~ date:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.12574 -0.10765 0.00313 0.06437 1.06515 ## ## Coefficients: ## Estimate Cluster s.e. t value Pr(>|t|) ## date2014-01-01:HOUTHI NA 0.00000 NA NA ## date2014-02-01:HOUTHI 0.01095 0.01320 0.830 0.406641 ## date2014-03-01:HOUTHI 0.03173 0.02764 1.148 0.250884 ## date2014-04-01:HOUTHI 0.11048 0.06028 1.833 0.066814 . ## date2014-05-01:HOUTHI 0.09762 0.05271 1.852 0.063989 . ## date2014-06-01:HOUTHI 0.10249 0.05861 1.749 0.080336 . ## date2014-07-01:HOUTHI 0.07204 0.06053 1.190 0.233987 ## date2014-08-01:HOUTHI 0.06338 0.04866 1.302 0.192778 ## date2014-09-01:HOUTHI 0.03816 0.04690 0.814 0.415860 ## date2014-10-01:HOUTHI 0.04247 0.04359 0.974 0.329930 ## date2014-11-01:HOUTHI 0.05621 0.03646 1.542 0.123115 ## date2014-12-01:HOUTHI 0.02213 0.03037 0.729 0.466205 ## date2015-01-01:HOUTHI -0.02596 0.02585 -1.004 0.315415 ## date2015-02-01:HOUTHI 0.02250 0.05141 0.438 0.661649 ## date2015-03-01:HOUTHI 0.06080 0.05740 1.059 0.289437 ## date2015-04-01:HOUTHI 0.13514 0.04806 2.812 0.004925 ** ## date2015-05-01:HOUTHI 0.15874 0.04647 3.416 0.000635 *** ## date2015-06-01:HOUTHI 0.15493 0.05151 3.008 0.002632 ** ## date2015-07-01:HOUTHI 0.12681 0.04697 2.700 0.006944 ** ## date2015-08-01:HOUTHI 0.12363 0.04319 2.863 0.004202 ** ## date2015-09-01:HOUTHI 0.13972 0.05276 2.648 0.008088 ** ## date2015-10-01:HOUTHI 0.13422 0.04697 2.857 0.004273 ** ## date2015-11-01:HOUTHI 0.12408 0.04566 2.717 0.006578 ** ## date2015-12-01:HOUTHI 0.12125 0.04505 2.691 0.007119 ** ## date2016-01-01:HOUTHI 0.11971 0.03905 3.065 0.002176 ** ## date2016-02-01:HOUTHI 0.11952 0.04151 2.879 0.003984 ** ## date2016-03-01:HOUTHI 0.12721 0.04239 3.001 0.002693 ** ## date2016-04-01:HOUTHI 0.12537 0.04532 2.766 0.005669 ** ## date2016-05-01:HOUTHI 0.12989 0.05297 2.452 0.014209 * ## date2016-06-01:HOUTHI 0.13070 0.05936 2.202 0.027675 * ## date2016-07-01:HOUTHI 0.14831 0.06597 2.248 0.024573 * ## date2016-08-01:HOUTHI 0.13047 0.04614 2.827 0.004693 ** ## date2016-09-01:HOUTHI 0.14481 0.06024 2.404 0.016227 * ## date2016-10-01:HOUTHI 0.11782 0.05255 2.242 0.024959 * ## date2016-11-01:HOUTHI 0.12175 0.04473 2.722 0.006486 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2219 on 1172422 degrees of freedom ## Multiple R-squared(full model): 0.5776 Adjusted R-squared: 0.5652 ## Multiple R-squared(proj model): 0.01175 Adjusted R-squared: -0.01738 ## F-statistic(full model, *iid*): 46.4 on 34552 and 1172422 DF, p-value: < 2.2e-16 ## F-statistic(proj model): 147.2 on 35 and 22 DF, p-value: < 2.2e-16

This suggests that the differential drop in lights occured only after March 2015, the month in which Saudi Arabia’s military intervention commenced.

On average, the regressions suggest that the drop in lights was significantly more pronounced outside the Houthi heartland. This suggests that the conflict and the bombing carried out by Saudi Arabia is mostly concentrated outside the Houthi rebel heartland.

That the dramatic drops in light emissions is associated with the Saudi military intervention is quite clear. The conflict between the Houthi rebels and the government had been ongoing for several years but only starting with the intervention of Saudi Arabia do marked differences between Houthi and non-Houthi heartland provinces appear.

This analysis can further be refined by studying the role of the religious make up of different provinces, as the role of the religious make up between Shia and Sunni muslim groups is said to be an important factor driving this conflict.

Nevertheless, this analysis suggests that high frequency satellite images such as these can be useful in assessing the extent to which areas area directly affected by conflict, which may be useful for targeting humanitarian relief.

New book and package pmfdR

Sun, 03/26/2017 - 09:00

(This article was first published on R and Finance, and kindly contributed to R-bloggers)

Processing and modelling financial data with R –

My Portuguese book about finance and R was
published
a couple of months ago and, given its positive feedback, I decided to
work on the english version immediately. You can find details about my
experience in self publishing the book in this
post.

The English book is not a simple translation of text and examples. This
is a long term project that I always dreamed of doing. With time, I plan
to keep the Portuguese and English version synchronized. The feedback I
got from the Portuguese version was taken into account and I wrote
additional sections covering advanced use of dplyr with list-columns,
storing data in SQLITE, reporting tables with xtable and stargazer,
and many more.

The book is not yet finished. I’m taking my time in reviewing everything
and making sure that it comes out perfectly. I believe it will be ready
in a few months or less. If you are interested in the book, please go to
its website where you can
find its current TOC (table of contents), code and data.

If you want to be notified about the publication of the book, please
sign this form and I’ll let
you know as soon as it is available.

Package pmfdR

Yesterday I released package pmfdR, which provides access to all
material from my book Processing and Modelling Financial Data with
R
, including code, data and exercises.

The exercises are still not complete. I expect to have at least 100
exercises covering all chapters of the book. As soon as the book is
finished, I’ll starting working on it.

With package pmfdR you can:

  1. Download data and code with function pmfdR_download.code.and.data
  2. Build exercises with function pmfdR_build.exercise
Downloading code and data

All the R code from the book is publicly available in
github. Function
pmfdR_download.code.and.data will download a zip file from the
repository and unzip it at specified folder. Have a look in its usage:

if (!require(pmfdR)){ install.packages('pmfdR') library(pmfdR) } my.lan <- 'en' # language of code and data ('en' or 'pt-br') # dl may take some time (around 60 mb) pmfdR_download.code.and.data(lan = my.lan) dir.out <- 'pmfdR-en-code_data-master' # list R code list.files(dir.out, pattern = '*.R') list.files(paste0(dir.out,'/data')) Building exercises

All exercises from the book are based on package exams. This means
that every reader will have a different version of the exercise, with
different values and correct answer. I’ve written extensively about the
positive aspects of using exams. You can find the full post
here

You can create your custom exercise file using function
pmfdR_build.exercise. Give it a try, just copy and paste the following
chunk of code in your R prompt.

if (!require(pmfdR)){ install.packages('pmfdR') library(pmfdR) } my.lan <- 'en' # language of exercises my.exercise.folder <- 'pmfdR-exercises' # name of folder with exercises files (will download from github) my.pdf.folder <- 'PdfOut' # name of folder to place pdf file and answer sheet pmfdR_build.exercise(lan = my.lan, exercise.folder = my.exercise.folder, pdf.folder = my.pdf.folder) list.files(my.pdf.folder)

To leave a comment for the author, please follow the link and comment on their blog: R and Finance. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

More book, more cricket! 2nd edition of my books now on Amazon

Sun, 03/26/2017 - 04:54

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

The 2nd edition of both my books

a) Cricket analytics with cricketr
b) Beaten by sheer pace – Cricket analytics with yorkr

is now available on Amazon, both as Paperback and Kindle versions.
Pick up your copies today!!!

A) Cricket analytics with cricketr: Second Edition

B) Beaten by sheer pace: Cricket analytics with yorkr(2nd edition)

Pick up your copies today!!!

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R&lt;-Slovakia meetup started to build community in Bratislava

Sun, 03/26/2017 - 01:00

(This article was first published on Peter Laurinec, and kindly contributed to R-bloggers)

On 22. March a first special R related meetup called R<-Slovakia took place. As the name of the meetup group implies, it is based in Slovakia, in its capital – Bratislava. I am very happy to be the first speaker on this event ever. R<-Slovakia has the intention to build a community to share knowledge of people who are dealing with data science, machine learning or related fields. And, of course, of those, who want to start to learn about R and about the world of data.

Topics of the presentation

The work on my PhD. thesis was the main topic of my presentation, which is still in progress. So, what kind of challenges did I tackle in the field of “mining” smart meter data? I am focused on clustering of consumers to produce more accurate forecasts of aggregated electricity consumption. For this task, it is crucial to choose most suitable forecast methods and representations of time series of electricity consumption, which are inputs to the clustering algorithm. I also spoke about R “tips and tricks”, that can help to write more effective scripts for this task. Some ideas from the presentation are already published in scientific papers – you can check them in my ResearchGate profile.

The headlines of the presentation are as follows:

  • Introduction of R
  • Challenges in smart grids
  • Forecasting electricity consumption
  • Suitable methods for forecasting multiple seasonal time series
  • Cluster analysis of consumers
  • R tips and tricks

Slides are here:

RSlovakia #1 meetup from Peter Laurinec About the organizer

The organizer of R<-Slovakia is an NGO GapData. They also started meetups called PyData Bratislava, which is a network of users and developers of data analysis tools who meet on a regular basis to share ideas and learn from each other. The PyData community gathers to discuss how best to apply Python tools, as well as tools using R and Julia. R<-Slovakia and PyData Bratislava are organized with help of a nonprofit organization NumFocus.

R<-Slovakia is also in the list of R meetups around the world. You can check more specifically for example europian R meetups.

With this large amount of information (see links), you can initiate your meetup (or project) to build a community around data analysts.

Look ahead

I hope you will benefit some knowledge from my slides, e.g. get a bigger picture of how to use data mining methods on smart meter data. If you don’t get some details, don’t worry, I will explain in more detail, how to implement regression trees, ensemble learning or cluster analysis of consumers in R. Stay tuned and in the meantime check my previous blog posts.

Teaser :sunglasses: :chart_with_upwards_trend: :

To leave a comment for the author, please follow the link and comment on their blog: Peter Laurinec. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RApiDatetime 0.0.2

Sat, 03/25/2017 - 23:38

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Two days after the initial 0.0.1 release, a new version of RApiDatetime has just arrived on CRAN.

RApiDatetime provides six entry points for C-level functions of the R API for Date and Datetime calculations. The functions asPOSIXlt and asPOSIXct convert between long and compact datetime representation, formatPOSIXlt and Rstrptime convert to and from character strings, and POSIXlt2D and D2POSIXlt convert between Date and POSIXlt datetime. These six functions are all fairly essential and useful, but not one of them was previously exported by R.

Josh Ulrich took one hard look at the package — and added the one line we needed to enable the Windows support that was missing in the initial release. We now build on all platforms supported by R and CRAN. Otherwise, I just added a NEWS file and called it a bugfix release.

Changes in RApiDatetime version 0.0.2 (2017-03-25)
  • Windows support has added (Josh Ulrich in #1)
Changes in RApiDatetime version 0.0.1 (2017-03-23)
  • Initial release with six accessible functions

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the rapidatetime page.

For questions or comments please use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages