I’ve posted a few times over the past week and it’s not because I have so much free time. It’s because I’m studying for some really hard classes and posting seems to clear my head. I’m studying something new and taking breaks to think about something that I know well relaxes my brain. But I’ve promised myself that I’ll keep the writing time very short.

My post about failure scenarios got a few very interesting responses. John over at Apama has a good point about determinism and its role in recovering from a failure scenario (see his comments on my post). Determinism is not something that’s always so easy to code and Apama is rightfully interested in positive qualities that an application can derive from deterministic behavior. My experience is that EP engines that offer determinism inevitably introduce certain threading or processing requirements to guarantee it. That ties in with Opher’s post about problems that can happen with event driven data and potential solutions. He points out that unless you take certain precautions, it’s very easy to wind up with unreproducible behavior (the very mention of which is like nails on a chalkboard to development groups everywhere). In a perfect world (from the developer’s POV), every application would be deterministic. So having an EP product produce deterministic applications by default seems like a good thing to me.

It seems to me that, if nothing else, an EP solution should give its user is a framework that reinforces certain key application features. When using an EP product, I should be able to write the logic of my application while worrying mostly about the correctness of my business rules. That’s in contrast to many hand written EP applications, where developers spend lots of time worrying about the impact of their code on issues like performance, threading or even determinism. From the perspective of the business, we would prefer developers to spend all their time worrying about business logic and zero time worrying about anything else and furthering this goal is, IMO, is the point of using an EP application. Of course, lacking magical abilities, nothing that lets the user program in any meaningful way will be foolproof. But the EP product should help the cause very significantly.

Opher thinks that EP vendors should provide good debugging capability and as an EP user, I’m delighted to hear this opinion. In my experience, the difficulty in debugging EP software (and I don’t just mean vendor software here, debugging one’s own EP software is even harder) is so daunting that it often limits EP to the realm of highly technical organizations and teams. Every event driven application that lets business users modify the rules, has at least one developer guru in the background troubleshooting complex problems. The easier the debugging, the better for everyone. If an EP product is to be adopted in a large project, I will even go so far as to say that debugging features are key to success.

On a related note, there is an interesting aspect of BAM/BI solutions that I think is often overlooked in public discussion: A BAM/BI solution that works in “real-time” is often processing as many messages per second as the entire rest of the distributed application combined, sometimes more. When you think about it, this is remarkable. Any smart development manager, when asked to implement such a high performance application, will immediately begin wondering just how much processing logic they can include and still meet such a high message rate. Yet every day, EP solutions vendors demonstrate to customers that indeed, their product can take messages from all the servers in their environment and process them into useful BAM and/or BI in near real time. I was really amazed to see this initially. I know high performance applications pretty well and seeing a single dual core server humming away at 40% utilization, running complex matching and detection logic and sucking in messages that are spewed out of a combined 10 or 12 multiprocessor servers running at or over 60%… is just plain neat.

This immediately brings up the question of why the EP solutions (meaning the ones that I’ve seen) perform so well. Maybe in the future I’ll have a little time to write about this topic. Of course, there are plenty of things that explain performance of an application and anyone can use an EP solution to write a slow application (and some people can write a fast application without using an EP solution). But from my experience, a good EP product will guide the logic in such a way that the end result is surprisingly fast. Not to say that tuning in both code and in the engine isn’t required, but in terms of performance, I think that some EP products (only being able to speak from experience with a few) do very well at letting the user implement their logic without worrying so much about the performance impact. For a much simplified explanation of this phenomenon: In many cases with an EP product, you don’t have to choose as many data structures and algorithms as you would when coding by hand. Having to choose less, in my experience, generally means fewer wrong choices.

As Opher says, more on this later. Maybe.

Update: I see that the slug for this post is “more-on-ep-products-and-bambi”. Snicker.

There’s been some talk recently on the CEP-Interest group and in various blogs (for example this recent post on the Apama blog) about doing BI and BAM using EP tools. I happen to be a fan of using EP for such applications and my experience is that there are plenty of cases where a pure database BAM/BI application can’t reproduce the functionality of an EP application. Some of these cases involve the speed of processing and some involve queries that cannot be easily expressed as SQL-style set operations.

But doing BAM and BI in “real-time” does bring about a few problems that are worth discussing. One advantage of a database application is that when it produces obviously incorrect results, you can run the query again. Also with a database solution, if some of the data is missing, it can be loaded in later and the query can be run again. These two scenarios are somewhat different with an EP solution.

The two failure scenarios, then, are: (1) a bug in the rules that causes incorrect results and (2) a problem that causes missing data.

Both of these scenarios are likely to produce incorrect output from an EP application. Now if the output from the EP application is only driving a GUI dashboard, we’ll just see strange things in the GUI. In this case, we can alert users that their dashboard isn’t working right at the moment, so please ignore it.

But what if we want to rely on this data in other ways? Here things can get harder. If I am using EP as the only solution to watch for fraud, can I afford to tell my users that “there was a problem on Tuesday, so we might have missed some fraud on that day”? This might not be received too well.

Also, what if the problem results in thousands and thousands of spurious fraud events? Do we simply discard all fraud events for the period in question? Or will some policy force us to look into every event? That will most definitely get a poor reception.

One potential solution may be to capture every event to disk. Those events can, in theory, be replayed through the engine at a later time. But even here, we can find problems. First of all, now we’re storing all the events and in a big environment, that will be a lot of data. It will especially be annoying if someone in risk analysis determines that the data is important enough to the operation to be available under a disaster recovery scenario. Now we may need to use special disaster-proof storage for our data and this can drive up the cost of the project significantly.

Second, note that if the EP engine itself is capturing the input events to disk, we still have not protected against failure scenario 2. If there is some problem such that the EP engine didn’t get certain bus messages but everyone else in the organization did, then our captured data set will not contain these messages. We must rely on someone else to have this data. Of course, the same thing can happen if we capture messages to a database. But EP solutions are newer and have more moving parts than traditional database capture software. So it is only prudent to think that the chances of having a problem where our EP solution misses messages may be greater than the chance that a database capture system will miss messages.

Even once we have found an acceptable solution to capture all events/messages to disk, we still need to think about what the EP rules are doing with that data. In a failure scenario, we intend to replay a whole batch of data through the EP engine to get correct output. If the EP rules rely on external databases or web services, we need to think about whether these things will be available at the time when we intend to replay our data. And if we intend to replay our data at a higher speed than it arrived in production, these external systems will see queries come from the EP engine at a faster rate. Will they be ok with this? Finally, where will we replay this data? Our development hardware may be much slower than the production hardware, will be be able to run this processing after hours (there is no such thing in a 24 hour operation) or will we be able to get powerful hardware quickly enough in the case of a problem?

EP solutions can certainly implement important functionality that would not be possible with a traditional database-query based BAM or BI solution. But since EP by nature operates in a time sensitive manner, it’s important to consider failure scenarios carefully. Considerations include how reliable the output from the EP solution needs to be and how the output is going to be used. Also when designing the solution, it’s important to keep these scenarios in mind not only for the overall architecture but also when coding the business rules (and/or when thinking about what business rules can and should be allowed).

Mark T. from Coral8 writes about how thinking about EP problems in terms of clouds seems to confuse rather than illuminate in a short post on the Coral8 blog. I have also found this to be the case. Since I’m just a user of EP technology, I don’t have nearly as much broad contact with EP technology customers as some might. But I’ve had many discussions on this topic, brainstorming on uses of EP technology. Over all of these discussions, I have found that more progress has been made thinking about the practical side of EP and extrapolating that into limited theory.

I find that talking about an abstract cloud (a theoretical concept) of all events rarely results in meaningful forward motion. Thinking about an abstract cloud of events generally seems to lead to the conclusion that identifying all possible causality patterns in the cloud will take forever. But once you start to think about what’s really happening in the environment, use cases and patterns crop up immediately. And what’s happening in the environment is that lots of systems are sending around message streams.

As Mark points out, thinking about streams does not mean assuming ordering. Anyone who is used to dealing with high speed distributed message (event) processing understands the issue like threading, message duplexing, clock drift and performance differences among distributed systems that make ordering assumptions invalid. The point is that thinking about the streams of messages, the uses for those streams and the relationships between the streams is what usually sets the stage for productive use case brainstorming.

This goes hand in hand with the very useful technique of starting with a set of obvious use cases, incrementally improving them, adding additional cases as they become clear and then factoring out common patterns where applicable. Starting with a use case like “Detect fraud patterns in the event cloud on our various buses” seems to result in a discussion that goes around in circles for an hour. But starting with “Detect this particular set of patterns using these particular message streams coming from those particular systems” often gets the process moving quickly.

For example, let’s say we have the abstract events [A,B,C,D]. We find that a pattern of events [A,B,C, missing D] constitutes fraud. Having identified this pattern, now we maybe want to think about things that can go wrong with detecting this pattern. Rather than discussing abstract issues of causality, we will probably hone down on slow systems, backed up message queues, network problems, etc. These things can affect, for example, the stream of messages of type D. And we have to tailor our detection rules to work with these practical issues. If we had started instead by talking about abstract patterns in the event cloud, it would (in my experience) take much longer to reach a consensus on what the EP system should or could do.

Now note that I’m not saying that design patterns are not useful. Design patterns can be great. For example, there are a patterns describing the various stages of event processing and they really do help with thinking about EP. But to get the discussion going and to get the use cases flowing, I find it best to begin with a practical discussion and use theory or patterns where obviously applicable. And in practice, all computer systems communicate in streams and not clouds. Computers may use structures to simulate aspects of a cloud, but all the data arrived to that structure as a stream.