Short post today: if you have not already done so, take a look at Google Go.

Interesting paper about changes to M-R, in part to enable online processing. Ideas like pipelining and better inter-job data flow have been on the radar for a while.

This kind of thing will likely be useful on the Amazon cloud. Rather than uploading data and then running M-R, it might be possible to begin the job as the data is uploading, thus getting results back sooner.

Here’s an interesting visualization of daily stock returns for 50 components of the S&P 500. I used the same kind of heat map plot from my previous post.

Again, this plot conveys a lot about the multiple series, but you have to look for a minute to see why. Once you start to look, you see a surprising amount of information come out – to me, more information than from plotting all these series in the usual chart formats.

The plot shows the percent change in price for 50 random components of the S&P 500 (on the Y-axis, one stock per row) for the 250 periods (time is the x-axis, left to right) prior to October 1, 2009. The 250 periods corresponds to just short of one year of trading, so here we see summarized a year of trading in 50 stocks.

Note: The returns are capped at the lower 20% quantile and the upper 80% quantile. So while the legend shows -3% to 3%, really we round anything above or below these values. This ensures that really high and low values don’t throw off the colors. A better approach would be a custom coloring scheme designed for this kind of data.

stock_heat

At first, yes, this looks like colorful noise.

But I see many patterns (admittedly, I have good eyesight):

  • October 2008 was a bad period for these stocks, as shown by all the red to the left.
  • In this bad period, we also see plenty of big price movements, as shown by rows having alternating red and blue. That represents sequences of plus or minus almost 3% on alternating days.
  • Also during the bad period, some stocks fared better than others. We see several rows with red on the left, but becoming green much faster than the others
  • October 2009 is much more calm, as shown by all the green, representing small changes, on the right.
  • Yet there are some stocks that see much more volatility than others. Just pick out rows of alternating reds and blues from the fields of green. These “colorful” rows are more volatile stocks.

I’m sure there are plenty of other meaningful patterns in here.

So overall, an interesting technique for visualizing many time series together.

The data comes from here. The R code to process it is below, and you’ll need to install the Heatplus library per my previous post (not from CRAN).

stocks.raw=read.csv("sp500hst.txt", header=FALSE)
names(stocks.raw)=c("date", "symbol", "open", "high", "low", "close", "volume")

stock.close=tapply(stocks.raw$close, stocks.raw$symbol, function(x){x})
stock.close.cleaned=stock.close[lapply(stock.close, length)==251]

set.seed(1234567)
stock.names=sample(names(stock.close.cleaned), 50)

stock.1=stock.close.cleaned[stock.names]

stock.returns=t(sapply(stock.1, function(d) {(d[2:251]-d[1:250])/d[1:250]}, simplify=TRUE))

heatmap_2(stock.returns, col=rainbow(length(stock.returns[1,]), end=4/6), Rowv=NA, Colv=NA,
do.dendro=c(FALSE,FALSE), scale="none", legend=2,
main="Stock returns", trim=.8)

All code in this post is for R. To make heat maps, I will use the function heatmap_2 from the Heatplus package, which comes from the Bioconductor library. At the time I used it (several months before this post), Bioconductor R packages had a strange installation system (not the usual CRAN modules). So I installed the package as follows:

source("http://bioconductor.org/biocLite.R")
biocLite("Heatplus")

You can use your favorite heat map function. On to the niftyness.

If you want to show observations over time, you can plot them:

d=log(seq(from=.05, to=2, by=.01))
plot(d, xlab="time", ylab="value")

Resulting in:

log

But what to do with many subjects?

set.seed(789478547)
subjects=50
f=function(){sample(c(-1,1), 1) * d*runif(1,.01,2)+runif(1,1,2)}
many = t(replicate(subjects, f(), simplify=TRUE))

Here we have the matrix “many”, where each row is a transform of a the above logarithmic data. Let’s say each row is observations over time, for one subject. The rows have been transformed, so some are increasing, some are decreasing and all at different rates. We would like to visualize this set of observations.

There are 50 rows in this matrix and we could plot each row, all overlaid on the same x/y axis. I’ll not bother doing this, because the result is a mess. Even if we use lines rather than points, and give each line a unique color, we get a mess.

So how about this:

library(Heatplus)
heatmap_2(many, col=rainbow(length(d)), Rowv=NA, Colv=NA, do.dendro=c(FALSE,FALSE), scale="none", legend=2, main="Observations over time")

Resulting in this:

heat_mapIn this “heat map”, each row is a subject and the x-axis is time (flowing from left to right). I’m sure the colors could be better, but hopefully the more you look at this, the more information you see. I think this image describes the data in a very intuitive way.

I applied this technique to data where some subjects showed a trend over time and some didn’t. I could easily distinguish subjects with a strong trend from subjects with no trend. I will probably use this again in the future, although hopefully I’ll find a heat map function that is more suited to this kind of application.

A few blogs have discussed an intuitive explanation of why multiplying two negatives produces a positive. That would be an explanation which  makes sense to a non-mathematician.

The most intuitive argument from those blogs, to my mind, was from The Math Less Traveled and talks about a negative as a reflection about zero on the number line (image of this at The Number Warrior).

Here is how I’d explain it:

The first question to be asked is: What is a negative value? Like what does -3 really mean?

The answer: A negative value, let’s say -x, is exactly a value where x+(-x)=0. That’s formally an additive inverse.

Now let’s look at (-x)(-y). If we can believe that (-x)(-y)=-(-(xy)) then we’re in good shape.

What does -(-(xy)) mean? Per above, it means the number that, when added to -(xy), equals zero. And that is obviously xy.

Intersting Wolfram Alpha query

September 13, 2009

I have previously been underwhelmed by Wolfram Alpha. But today I finally saw a real-life query produce good results.

Admittedly, I am probably in the minority of people excited by this: I was wondering how many megawatt hours are produced by 8000 tons of oil. And what do you know, the query on Alpha worked.

Thanks, Wolfram researchers.

Just looked at a very cool company: Monkey Analytics

They provide an AJAX interface to an Octave (MATLAB language) session, running on EC2. You upload your data and your scripts, then you run them in interactive mode via the remote interface. This is an alternative to keeping a dedicated high end box at your desk, or running a session on a shared server. Of course uploading big data is annoying, but that’s a known trade off.

They now offer interactive Octave sessions, an editor for your .m files (not a great editor yet), and an interface to run Python scripts. They intend to add R in the near future.

This is a really creative idea – the interactive session is very nice, much better than running a remote script. Plus the session persists over time, so you can use it from multiple computers (home, work, the web terminal at your boring vacation resort).

Hopefully, they can provide a convenient way for folks to run their big analytics without the hassle of maintaining additional hardware. Good luck Monkey Analytics!

I’ve recently read the Gartner Hype Cycle report that mentions Complex Event Processing (CEP) (Gartner document number G00168300, 16 July 2009), having first read Opher’s recent post on the subject.

In its first mention of CEP in a Hype Cycle (I think), Gartner puts CEP as a transformational technology, just approaching the peak of inflated expectation and 5 to 10 years from mainstream adoption. CEP community members will naturally be enthused that CEP is slowly but surely getting more press.

But in this report, I see the same old problems with defining CEP and making it relevant to a broad audience.  While sales of some CEP products have grown steadily over the past few years, the EPTS has a long way to go in defining and communicating the value of CEP.

For example, here is the best (only) use case that Gartner could come up with to communicate the value of CEP:

sales managers who receive an alert message containing a complex event that says,
“Today’s sales volume is 30% above average” grasp the situation more quickly than if they were
shown the hundreds of individual sales transactions (base events) that contributed to that
complex event.

… sales managers who receive an alert message containing a complex event that says, ”Today’s sales volume is 30% above average” grasp the situation more quickly than if they were shown the hundreds of individual sales transactions (base events) that contributed to that complex event.

Has anyone noticed that sales managers already get this information?!?! Are we saying that “CEP will produce sales volume reports”? That’s transformational? This use case doesn’t communicate anything new!

Here’s another quote:

Business activity monitoring (BAM) relies on CEP to provide the information it needs to operate.

I really wish that I were taking this sentence out of context, as an excuse to rag on how CEP is misunderstood. But I’m not – the report talks about CEP and BAM, without giving one piece of information to tell you what is the difference between the two. Does this mean that anyone who’s got BAM, already has CEP?

I’m sorry to say this, but the EPTS is failing to convey the value of CEP to Gartner. Reading the reports on CEP, there is no easily followed trail to that “ah ha” moment, which convinces users that it will be transformational.

The EPTS is on its way to positive things, but it is making much more (although still fairly slow) progress on forming its internal community than on communicating to the world about CEP.

Until now, TIBCO BusinessEvents has been the only rules based product focussed on event processing. A rules based product makes quite a lot of sense in this area: most people find rules to be a familiar and easy to understand way to code some common kinds of event logic (“when such-and-so happens, and so-and-so condition already exists, then do XYZ”). Enter Drools Fusion from the JBoss project. While the initial release will clearly not be up ot the feature set of BE, it’s got promise. Combined with Esper, this would make a very interesting event processing platform.

I see this kind of thing as helpful to TIBCO and other CEP vendors. It provides a free demo of event processing capabilities to the masses. Then the vendors can step in with capabilities that the free software currently can’t match and pick up some sales.

I have some fun event processing code examples in store (a hidden Markov model and some on-line categorization and partitioning algorithms written in streaming SQL). But I’ve been busy, so instead I’m giving my two cents on WolframAlpha.

Here is my big problem with WolframAlpha: you don’t know what features it has, so you have to guess. The result is that you waste so much time that you get bored.

For example, the site will give me answers for “5 largest countries by area” or “5 largest countries by number of people” or “5 largest countries by GDP” but not “5 largest countries by number of sheep” or “5 largest countries by export of wheat”.

And it gives me no idea of why. Is it because it does not know how many sheep are in every country (or about wheat exports)? Or because it is hard-coded to understand how to rank countries by only a few terms and I am wasting my time by asking it for anything else? I get the impression that it is the latter, which would make this thing much less revolutionary than we have been told.

If only it would give me some feedback, I would know. Maybe it could tell me how it knows to rank countries. Or how it parsed my query, rather than just a generic “dunno what you want” message.

Similarly, it gives no answer for “median population of 10 largest countries”. So why not? It knows population (since it will rank countries by population), it knows how to calculate a median… what’s wrong? Am I phrasing the question wrong? Does it simply have hard coded functions like “rank countries” and “calculate median of input numbers” but will not chain the median function to the data on population of countries? Again, am I wasting my time by trying queries like this?

I’m getting the feeling that the system has some built in functions to operate over its data store (rank countries, compare cities) and it just tries to match your query up to one of these abilities. In this case, all the “cool examples” that are provided on the site are not “examples” at all, they are a list of everything the site can do. This is very frustrating because Wolfram is giving the impression that I should experiment with all kinds of queries and see what I get.

All in all, I wish it would give me more feedback so I can understand whether it fails to answer my queries because I am phrasing them wrong, because it is missing data, or because it has a limited set of features and I am just wasting my time.