Idea for using SQL EPLs in monitoring
June 24, 2008
I was talking yesterday with a colleague about application monitoring, which lead me to think about how an EP engine could help. There are many monitoring products on the market that use rules or even a rules engine to help interpret and respond to events. What about embedding an SQL EPL (event processing language) in there?
An SQL EPL could provide some functionality that is tough using traditioinal rules. This could compliment a rules based approach, to provide more flexible monitoring.
The first feature that comes to mind is event pattern matching. Some monitoring products already support certain patterns, but AFAIK, SQL EPLs support more. All of the major SQL EPLs now support detecting several kinds of patterns in incoming events, a simple example being “event A followed by event B within X seconds”. For reference, see StreamBase SELECT FROM PATTERN, Coral8 SELECT FROM MATCHING, Esper SELECT FROM PATTERN.
This kind of pattern matching could allow for more flexible scripted monitoring. Let’s say I want to send a request to a message bus and expect a response from an application on that bus. I have seen solutions that allow for this, but configuring an advanced scenario can be tough (although I have not done much work in this area, so there are many products that I don’t know). But suppose we have a monitoring component that sends the message to the bus and a component that gets messages and sends them to an SQL EPL engine. Now we can declare patterns like “a send, not followed by a bus error, not followed by a matching response within 30 seconds”. Or a more complicated version involving several patterns:
- a send followed by a bus error results in an internal bus-down alert
- a bus-down alert, followed by a send, not followed by a bus error results in a bus-up alert
- a send, not followed by a bus error, not followed by a response within 30 seconds results in a service-down alert
- a send, not followed by a bus error, followed by a response within any time-frame results in an internal service-timing event containing the time between the send and the response
Now, add in SQL aggregation (see StreamBase GROUP BY, Coral8 GROUP BY, Esper aggregation in SELECT statements):
- First a pattern: a bus send, not followed by a bus error, followed by a response within any time-frame results in an internal service-timing event containing the time between the send and the response as well as the service-ID.
- Now some aggregation: select service-timing events where response-time is greater than 10 seconds, group by service-ID and count the events within the last minute. This would produce one service-delay-count event per minute, per service-ID
- And then a basic selection rule: select service-delay-count events where the count is greater than 3 and produce an internal service-delayed event.
And then do some simple statistics with aggregation and windows (see StreamBase CREATE WINDOW, Coral8 CREATE WINDOW, Esper CREATE WINDOW):
- Using aggregation as above, create events containing the mean and variance of service-timing over the past hour, updated every minute.
- Keep the last one of these events in a window of size one. Now we have an up-to-the-minute record of service timing, per service, for the past hour.
- Also, use aggregation to produce events containing the per-service timing for the past ten minutes.
- Join the hourly timing with the ten minute timing (see the JOIN statements from previously mentioned products), and use simple statistical methods to detect the case where the service timing is slowing down to a significant degree.
I think that it’s exciting that all of these scenarios could be detected using an SQL scripting language. The events produced by the detection scripts could be fed into an existing event management solution, a rules engine or whatever.
Some of these techniques are already used with these SQL products to detect stock market patterns, so we are using a known approach to event detection. I’ve seen an SQL system used to detect service issues in an electronic trading scenario, but it was created as a custom project and had to be manually integrated with existing monitoring solutions. I wonder what would happen if an SQL EPL were integrated tightly with a monitoring product, which would make it more friendly to the people who are already involved in monitoring.
This sounds like an exciting prospect for SQL EPLs. At the moment, SQL EPLs are used mostly in custom projects or high volume, low latency applications like electronic trading applications, custom monitoring (see StreamBase’s case studies in monitoring) or RFID. But if this technology could provide value to more general monitoring, and if it could be tightly integrated with a larger monitoring product, then it would move from a niche solution to a central component in the enterprise architecture.
P.S. I would have linked to the equivalent Aleri features in this post, but I could not find linkable online documentation.
P.P.S. SQL EPLs are not the only possibility here. My understanding is that other specialized event processing products like RuleCore support much of this functionality as well.
June 25, 2008 at 8:14 am
Hans, you are right. ruleCore CEP Server solves this kinds of problems in a very natural way. No surprise here, as ruleCore was designed exactly to solve the kind of problem you describe above.
Providing active event pattern detection capability into an enterprise architecture is what we built ruleCore for.
We have a different model compared to the query model of SQL. It was specially designed just for detection of complex patterns, so we would like to believe we are really good at this
Other things like algo trading does not come as natural for us. I would actually advise against it. RuleCore is, as you say, a rather specialized event processing solution focused on event pattern detection.
Basically what you do with ruleCore is to define a view which contains a dynamically updated window into the incoming stream of events. The view contains events which have some common properties you are interested in.
Then you define a situation (consisting of multiple event patterns) which should be detected in the context of this view.
The last step is to create a summary event of all the events that contributed to the detection of the situation. Done using XSLT so it can contain basically any type of aggregation of the contributing events of their contents.
All done using declarative XML, so no programming involved
June 25, 2008 at 9:09 am
Hi Marco. I does sound like ruleCore would be a good candidate for working with an application monitoring solution. First of all, in this case I think it is better to declare situation detection rather than having to translate the situation into the appropriate SQL. This is especially true since the target audience has a primary goal of monitoring applications and may not be a developer. Second, the use of known XML technologies like XSLT makes it easier not only to understand how the tool is supposed to work, but also to provide tooling like a graphical utility for summarizing the situation events.
This kind of applitation monitoring is not my specialty, but I understand that monitoring an SOA environment can be a real pain. First of all, if you have a message bus, then even connectivity issues can be hard to detect. Not to mention properly detecting the case where a failure of one service causes a cascade of others. I know that there are now some products geared toward solving tis problem, but since they don’t put as much focus on fully featured situation detection, my guess is that there is an opportunity to improve on the state of the art.
June 27, 2008 at 8:51 am
An alternative to ruleCore would be a systems/network monitoring solution which commonly have functions to do event correlation, for example HP have an event correlator. Commonly you find a set of common packaged correlations in these tools which model typical problems faced while monitoring networks. For example transient error condition suppression. Each of these correlations are rather similar to a situation in ruleCore, but with the difference that the correlations are sort of hardcoded and ready to be used. In ruleCore you would need to create a rule to do the same thing. So you could solve more types of situation detection problems in addition to the pre-packaged correlations.
July 3, 2008 at 12:19 pm
[...] – TIBCO BusinessEvents included; hence at least one answer to Hans Gilde’s recently asked question on why rule-based (monitoring) systems do not include some SQL-based Event Processing Langu… is: actually some already [...]