Hans Gilde’s weblog

Advice for starting a real-time analytics project

Posted in eventprocessing by Hans on May 3, 2010

Although unrelated to my daily work, I have recently aided some former colleagues who were called in to help a troubled real-time analytics project. That reminded me about the most important advice I can offer a new real-time analytics project:

The first step is to gather a large sample of the real-time data and demonstrate the analytics over this sample. The next step is building the real-time analytics system.

The first step is sometimes called “back testing” and it applies just as much to retail sales as it does to financial markets. This step does not involve any kind of real-time analysis – all the analysis is done using regular data processing tools like SQL, R, MATLAB, a RETE rules engine.

Do not buy Complex Event Processing software or start building a high-speed data processing system without first trying your ideas on a good sample of historical data. This may seem obvious, but I am surprised by how often projects start by diving right in to the real-time components.

This advice applies to both statistical and non-statistical analytics. Even simple rules for monitoring a distributed system should be tested with historical data.

Unfortunately, it is sometimes expensive just to gather and process the sample historical data. Maybe this represents a third or more of all the work in the project. Maybe it involves costly hardware sensors.  In these cases, it’s even more important not to jump right in to the real-time processing part. Skipping this step of first collecting and analyzing a sample of historical data is almost never a cost savings – it just introduces massive project risk.

To summarize:

  • Step 1: prove the analytics using historical data
  • Step 2: build or buy a real-time analytics system

Do not do these steps in parallel unless you really know what you’re doing (in which case you don’t need my advice).

It’s possible that very standard analytics like web log analysis or simple network monitoring could skip the step of testing with historical data. I don’t work with these kinds of projects, so I can’t offer good advice there.

Follow

Get every new post delivered to your Inbox.