There’s been some talk recently on the CEP-Interest group and in various blogs (for example this recent post on the Apama blog) about doing BI and BAM using EP tools. I happen to be a fan of using EP for such applications and my experience is that there are plenty of cases where a pure database BAM/BI application can’t reproduce the functionality of an EP application. Some of these cases involve the speed of processing and some involve queries that cannot be easily expressed as SQL-style set operations.

But doing BAM and BI in “real-time” does bring about a few problems that are worth discussing. One advantage of a database application is that when it produces obviously incorrect results, you can run the query again. Also with a database solution, if some of the data is missing, it can be loaded in later and the query can be run again. These two scenarios are somewhat different with an EP solution.

The two failure scenarios, then, are: (1) a bug in the rules that causes incorrect results and (2) a problem that causes missing data.

Both of these scenarios are likely to produce incorrect output from an EP application. Now if the output from the EP application is only driving a GUI dashboard, we’ll just see strange things in the GUI. In this case, we can alert users that their dashboard isn’t working right at the moment, so please ignore it.

But what if we want to rely on this data in other ways? Here things can get harder. If I am using EP as the only solution to watch for fraud, can I afford to tell my users that “there was a problem on Tuesday, so we might have missed some fraud on that day”? This might not be received too well.

Also, what if the problem results in thousands and thousands of spurious fraud events? Do we simply discard all fraud events for the period in question? Or will some policy force us to look into every event? That will most definitely get a poor reception.

One potential solution may be to capture every event to disk. Those events can, in theory, be replayed through the engine at a later time. But even here, we can find problems. First of all, now we’re storing all the events and in a big environment, that will be a lot of data. It will especially be annoying if someone in risk analysis determines that the data is important enough to the operation to be available under a disaster recovery scenario. Now we may need to use special disaster-proof storage for our data and this can drive up the cost of the project significantly.

Second, note that if the EP engine itself is capturing the input events to disk, we still have not protected against failure scenario 2. If there is some problem such that the EP engine didn’t get certain bus messages but everyone else in the organization did, then our captured data set will not contain these messages. We must rely on someone else to have this data. Of course, the same thing can happen if we capture messages to a database. But EP solutions are newer and have more moving parts than traditional database capture software. So it is only prudent to think that the chances of having a problem where our EP solution misses messages may be greater than the chance that a database capture system will miss messages.

Even once we have found an acceptable solution to capture all events/messages to disk, we still need to think about what the EP rules are doing with that data. In a failure scenario, we intend to replay a whole batch of data through the EP engine to get correct output. If the EP rules rely on external databases or web services, we need to think about whether these things will be available at the time when we intend to replay our data. And if we intend to replay our data at a higher speed than it arrived in production, these external systems will see queries come from the EP engine at a faster rate. Will they be ok with this? Finally, where will we replay this data? Our development hardware may be much slower than the production hardware, will be be able to run this processing after hours (there is no such thing in a 24 hour operation) or will we be able to get powerful hardware quickly enough in the case of a problem?

EP solutions can certainly implement important functionality that would not be possible with a traditional database-query based BAM or BI solution. But since EP by nature operates in a time sensitive manner, it’s important to consider failure scenarios carefully. Considerations include how reliable the output from the EP solution needs to be and how the output is going to be used. Also when designing the solution, it’s important to keep these scenarios in mind not only for the overall architecture but also when coding the business rules (and/or when thinking about what business rules can and should be allowed).

7 Responses to “Failure scenarios for “real-time” BAM and BI using EP”

  1. John Trigg Says:

    Hans, this is an impressive (and of course daunting) set of problems to consider when implementing an EP replay solution. One of the additional things we are very careful about with event replay (a part of our Apama solution) is the issue of time order and determinism. If the same results are to collected from a replay of the original event sequence, every time, the replay must ensure that all events are channeled according to original sequence and original temporal spacing. This is key if I want to see what the conditions were when triggering a fraud action (or a trade or a command to alter production or any other action or alert). See a longer discussion on determinism here … http://apama.typepad.com/my_weblog/2007/09/to-be-or-not-to.html

  2. Hans Says:

    I agree that determinism is a key consideration with playback.

    I just wanted to note that I’m not suggesting that every project solve any or all of these problems. They are meant to highlight some of the issues that even simple failure scenarios can bring up, depending on what the data from the EP solution will be used for. I would suggest that architects think about these issues when developing the intended scope of the EP solution.

  3. More on EP products and BAM/BI « talldude Says:

    [...] point about determinism and its role in recovering from a failure scenario (see his comments on my post). Determinism is not something that’s always so easy to code and Apama is rightfully [...]

  4. Determinism and scalability « talldude Says:

    [...] by Hans under eventprocessing   I got some interesting email responses to my post about BAM failure scenarios, some of which would probably do better in a public forum. I sense that hovering around the fringes [...]

  5. Advice on monitoring trade flow with an EP engine « talldude Says:

    [...] then delays in the network can cause matches to time out and result in lots of alerts. I wrote a post on risks of a project like [...]

  6. Compliance, front running and CEP « talldude Says:

    [...] about using CEP for this kind of thing. As I posted here, real-time monitoring is not all rosy. This kind of change to compliance rules is a very big [...]

  7. EP is real-time data mining? « talldude Says:

    [...] I want to move that processing into an EP engine? After all, generating this data in real-time adds risk, so it had better also provide additional [...]

Leave a Reply