Hans Gilde’s weblog

Interesting enhancements to MapReduce

Posted in eventprocessing, financial services by Hans on October 19, 2009

Interesting paper about changes to M-R, in part to enable online processing. Ideas like pipelining and better inter-job data flow have been on the radar for a while.

This kind of thing will likely be useful on the Amazon cloud. Rather than uploading data and then running M-R, it might be possible to begin the job as the data is uploading, thus getting results back sooner.

Fun with color maps: visualizing financial time series

Posted in decision making, financial services, R, statistics by Hans on October 2, 2009

Here’s an interesting visualization of daily stock returns for 50 components of the S&P 500. I used the same kind of heat map plot from my previous post.

Again, this plot conveys a lot about the multiple series, but you have to look for a minute to see why. Once you start to look, you see a surprising amount of information come out – to me, more information than from plotting all these series in the usual chart formats.

The plot shows the percent change in price for 50 random components of the S&P 500 (on the Y-axis, one stock per row) for the 250 periods (time is the x-axis, left to right) prior to October 1, 2009. The 250 periods corresponds to just short of one year of trading, so here we see summarized a year of trading in 50 stocks.

Note: The returns are capped at the lower 20% quantile and the upper 80% quantile. So while the legend shows -3% to 3%, really we round anything above or below these values. This ensures that really high and low values don’t throw off the colors. A better approach would be a custom coloring scheme designed for this kind of data.

stock_heat

At first, yes, this looks like colorful noise.

But I see many patterns (admittedly, I have good eyesight):

  • October 2008 was a bad period for these stocks, as shown by all the red to the left.
  • In this bad period, we also see plenty of big price movements, as shown by rows having alternating red and blue. That represents sequences of plus or minus almost 3% on alternating days.
  • Also during the bad period, some stocks fared better than others. We see several rows with red on the left, but becoming green much faster than the others
  • October 2009 is much more calm, as shown by all the green, representing small changes, on the right.
  • Yet there are some stocks that see much more volatility than others. Just pick out rows of alternating reds and blues from the fields of green. These “colorful” rows are more volatile stocks.

I’m sure there are plenty of other meaningful patterns in here.

So overall, an interesting technique for visualizing many time series together.

The data comes from here. The R code to process it is below, and you’ll need to install the Heatplus library per my previous post (not from CRAN).

stocks.raw=read.csv("sp500hst.txt", header=FALSE)
names(stocks.raw)=c("date", "symbol", "open", "high", "low", "close", "volume")

stock.close=tapply(stocks.raw$close, stocks.raw$symbol, function(x){x})
stock.close.cleaned=stock.close[lapply(stock.close, length)==251]

set.seed(1234567)
stock.names=sample(names(stock.close.cleaned), 50)

stock.1=stock.close.cleaned[stock.names]

stock.returns=t(sapply(stock.1, function(d) {(d[2:251]-d[1:250])/d[1:250]}, simplify=TRUE))

heatmap_2(stock.returns, col=rainbow(length(stock.returns[1,]), end=4/6), Rowv=NA, Colv=NA,
do.dendro=c(FALSE,FALSE), scale="none", legend=2,
main="Stock returns", trim=.8)

Nifty use of a heat map to show change over time

Posted in programming, R, statistics by Hans on October 1, 2009

All code in this post is for R. To make heat maps, I will use the function heatmap_2 from the Heatplus package, which comes from the Bioconductor library. At the time I used it (several months before this post), Bioconductor R packages had a strange installation system (not the usual CRAN modules). So I installed the package as follows:

source("http://bioconductor.org/biocLite.R")
biocLite("Heatplus")

You can use your favorite heat map function. On to the niftyness.

If you want to show observations over time, you can plot them:

d=log(seq(from=.05, to=2, by=.01))
plot(d, xlab="time", ylab="value")

Resulting in:

log

But what to do with many subjects?

set.seed(789478547)
subjects=50
f=function(){sample(c(-1,1), 1) * d*runif(1,.01,2)+runif(1,1,2)}
many = t(replicate(subjects, f(), simplify=TRUE))

Here we have the matrix “many”, where each row is a transform of a the above logarithmic data. Let’s say each row is observations over time, for one subject. The rows have been transformed, so some are increasing, some are decreasing and all at different rates. We would like to visualize this set of observations.

There are 50 rows in this matrix and we could plot each row, all overlaid on the same x/y axis. I’ll not bother doing this, because the result is a mess. Even if we use lines rather than points, and give each line a unique color, we get a mess.

So how about this:

library(Heatplus)
heatmap_2(many, col=rainbow(length(d)), Rowv=NA, Colv=NA, do.dendro=c(FALSE,FALSE), scale="none", legend=2, main="Observations over time")

Resulting in this:

heat_mapIn this “heat map”, each row is a subject and the x-axis is time (flowing from left to right). I’m sure the colors could be better, but hopefully the more you look at this, the more information you see. I think this image describes the data in a very intuitive way.

I applied this technique to data where some subjects showed a trend over time and some didn’t. I could easily distinguish subjects with a strong trend from subjects with no trend. I will probably use this again in the future, although hopefully I’ll find a heat map function that is more suited to this kind of application.

Follow

Get every new post delivered to your Inbox.