I posted a Knol about CEP here. It was partly a way to try Knol, which I found to be a little disappointing but mostly usable.
But it was also an exercise to see whether the pieces of CEP fit easily together into something coherent. I don’t spend much time these days thinking about CEP. Actually, the term CEP has never meant much to me because my interest was always in using the CEP software, not the terminology.
And honestly, I have never seen an explanation of CEP that does not leave me with more questions than answers. So since Professor Luckham was kind enough to write a bunch of stuff recently (all of which is referenced from the Knol) I thought I’d see if I could summarize CEP into something that I would want to read.
I didn’t post it on Wikipedia because I already had a bad experience with editing CEP content there. But my Knol content could easily be merged back onto Wikipedia and combined with the CEP content that’s already there. At least if someone who’s never heard of CEP cuts out half of it on Wikipedia, I will still have the Knol.
So feel free to edit the Knol as if it were Wikipedia and feel free to copy the content back onto Wikipedia if you want. I may or may not maintain this Knol, so I encourage anyone with an interest in seeing CEP explained (which, strangely enough, does not include me) to contribute to either the Knol or the Wikipedia page.
So I guess that I’ve never done much serious thinking about EP (event processing) as something that deals with actual events. Marco has several posts on this topic (like this one or that one), and the ideas bounce around my head but have never really fit into place.
But I see that there is momentum behind this concept. So here I’m writing a bit to get my brain started on thinking about the issue properly.
My own experience with event processing basically treats messages as events and so in my head, I have always had this idea that EP is just a new way to talk about message processing. Marco would call my way real-time data processing or something like that, not event processing. And actually, I think that this is a good point. If I’ve been doing message processing for so many years, why would I now call it event processing? I guess it really is that vendors that I like started to call it event processing, and I don’t really care what it’s called, so I go right along.
And at the same time, most of my “message processing” takes into account some underlying model or reason for the messages – meaning that I am definitely considering the real-life implications of getting this messaging. I’m not just routing it or doing some static process that does not care about why I am getting it or where it came from.
David Luckham (among others) frequently distinguished between products available today and what you would really want from Event Processing. His book PoE advanced a moderately high level language, where you declare certain conditions like causality and use them as part of patterns. This is certainly different from something like streaming SQL or event ECA rules, where you get in some data and what you do with it is your own business.
Opher Etzion is also working his way toward innovations that in part incorporate the concept of an event being more than a message.
And this seems to be the basis for why Marco’s product ruleCore is different from other products.
So ok, I’m starting to get it. I’m not sure where it all fits together, but I agree that there is potential here. Actually I have never thought that there was no potential, I just had not thought about the problem domain in the same way that the above people had.
I think, though, that many things still need to be reconciled. For example, in The PoE, Luckham uses event based syntax to work with financial transactions, some of which Marco would be hard pressed to call events. The fact is that a stock quote absolutely represents a real event: it indicates that someone out there has made a firm commitment to buy or sell shares. It is also an event with a frustratingly low amount of information – often not including the identity of the counter party and never including information on why they are so willing to make this trade.
There is some clarification available in the EP Glossary, but I’m not sure that this is yet a full reconciliation of all ideas expressed in the community.
Also, there is the question of whether one can differentiate products based on some notion of treatment of messages as “events”.
There has been an attempt to differentiate products by the ability to do “advanced analytics”, which apparently includes something like a Kalman filter. I can implement a Kalman filter in streaming SQL in a mostly natural way (the hardest part is coding certain matrix operations). So for me, differentiating between streaming SQL and something that uses a Kalman filter is like differentiating between Java and Java with classes that implement a Kalman filter. In other words, not a useful distinction.
But I think that there is more substance to differentiating based on the treatment of messages as “events” or not. Marco and David talk about causality, and things like a chain of causality from derived events back to received events. If that kind of thing were maintained automatically by the EP engine… well this is more than just adding some extra functionality to a programming language. This is adding a whole framework of causality tracking that operates behind the scenes and that is a basis on which I could see differentiating products.
Just putting down some basic thoughts here, nothing really concrete. Not sure I have a use for it, but I like where this is going. It is at least fun to think about. Thanks to everyone who regularly posts on this topic.
The Protocol Buffers project from Google is similar to several projects that I’ve seen at the NYSE and at big banks. You define messages to be passed around your environment, then compile those messages into language-specific implementations (classes) that serialize and parse the message format. The message format is binary and designed as a compromise between a fixed format struct (best performance) and XML (most flexible). Actually, it is somewhat like FIX but binary and with more features.
To use the protocol, you get the code that was generated for your programming language/environment and use the classes directly in your own code. So to read a message, you read in one message from the network/bus/whatever and use an instance of the generated class to parse it. To send a message, you use a generated class, serialize it and send the resulting data to the network/bus/whatever.
I like this approach for a few reasons:
It is a standard protocol. Not supported by a standards organization, but still more standard than one guy in R&D pushing his home brewed protocol.
The implementation for each language is supported by a community, not a little team (or more likely one person) in a company. The biggest problem with these protocols is the maintenance, especially after the guy who wrote it is promoted/fired/quits.
It works with a company-wide process. The message definitions can be kept in a central place and support optional fields and extensions of each message. This lends itself to a “protocol committee” that makes decisions on the company-wide use cases, while allowing individual projects to extend the protocol to fit their needs without breaking the ability to interact with others.
It is easy to release new versions of the protocol. Most users will just want the classes for their language – no need to have much expertise on the protocol itself.
It is testable. You can run the generated classes through unit tests. The protocol committee can release not only the protocol classes, but also test data in the form of serialized messages. Client applications can install the new classes and run unit and regression tests over predefined test data. This can make both regression and backward compatibility testing a more automated process.
So bottom line: I think that everyone who does low latency messaging should look at Protocol Buffers. Even if it’s not exactly what you need, I think that there are plenty of good lessons to be learned from the design.
A friend of mine has lots of R code and we were talking about ways to speed it up (it works with static data sets). By nature, R is not parallel and we decided to look at parallelizing the code.
There are various levels of parallel, of course, ranging from breaking up portions of the processing pipeline, parallelizing certain algorithms (in a single computer or on a grid) or even rewriting the whole thing is a more parallel-friendly way. There are known techniques (see this paper on Map Reduce for multicore machine learning) for parallelizing frequently used operations, so that seemed like a good place to start. We quickly settled on the easiest method: replacing some R functions with versions written with R/parallel or some such. This will at least parallelize the most time consuming operations and it involves the least amount code rewriting.
This kind of thing comes up all the time as people try to use their algorithms/research in different ways. The commercial statistical software vendors have caught on here with design guidelines for enabling parallel computing, “parallel toolkit” type addons and grid implementations. The pure-play grid vendors mostly (all?) have lots of common operations pre-gridified for your use. There is less available open source, but that can’t be too far in the future.
The project got me thinking again about how many requirements and use cases there are in this area. We just wanted to speed up the processing of static data and weren’t dealing with real-time data. Real-time data has latency or throughput requirements with potentially more severe consequences than those for static data (see this post). So just parallelizing a few key algorithms is frequently not an option here.
Then again, there are plenty of problems (such as the one in question here) that can be decomposed into what amount to a processing pipeline with a series of transformations that use a shared state, and only a few numerically intensive operations (like matrix inversion) involved at certain stages. Even if we were processing real-time data, how much could we get from parallelizing the processing pipeline (as opposed to parallelizing just the obviously numerically intensive operations, as we chose to do)? Just think about how much overhead is involved in sharing state, locking and moving data around between processes/threads just to get a parallel processing pipeline.
I’m pretty much used to building a solution, out of as many pre-built components as I can find/afford, to fit each problem.
But I got into thinking: what CEP could contribute to this area? If I were marketing a CEP product, what features would I want to support? Very tough questions.
Semantic event processing seems exciting (see the blog by Opher Etzion). The goal here is to let me describe my problem semantically and to generate an efficient implementation. I love the idea, and though it seems like science fiction right now (will software really be able to semantically describe and then efficiently solve truly complex problems?), it also seems like the kind of thing that will make a huge impact one day. I guess it’s hard for me to judge how far we are from seeing this kind of thing in action.
Or a CEP product could focus on being essentially a real-time data analysis package (real-time SPSS?). But does anyone want that? I think that I would rather do analysis over a static data set rather than always having to work with a stream.
Or it could focus not on being an analysis tool, but on helping people who have some particular analytics worked out but need to implement them in real-time (this seems to be the current trend of streaming SQL vendors as well as Progress Apama). Or it could focus on being the best tool for managing and understanding day-to-day business rules and leave the implementation of analytics to other packages (BusinessEvents is following this model?).
So many options. Who can say for sure which ones will make money?
What would happen if XML data were not not huge and slow to parse? It would be Protocol Buffers, a project recently open sourced by Google.
Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the “old” format.
I’ll bet that every big bank has developed at least two message formats that try to accomplish the goal of this project.
What can it be used for? How about to finally kill FIX in low latency applications (and maybe more).
It can be used via a socket, over a message bus or with a custom shared memory bus, across languages and platforms – all the same format. It can be multicast and understood by many subscribers, using many languages. You can add fields to existing messages without breaking backward compatibility.
I love it already and I only just started playing with it.
Apparently I misinterpreted the Oracle announcement, per the comment from Alex below, Oracle may use Esper going forward. Also, as he says, Esper is gaining well deserved traction in areas outside of their OEM relationship. I still think that, given the effort put into making Esper embeddable, it would make a great combination with a Java based rules engine.
Oracle announced that the BEA CEP offering will not be a “strategic” product going forward. Meaning that they will discontinue it as quickly as possible. The alternative will be the Oracle CEP offering, mentioned recently by Paul at TIBCO.
The BEA offering was using Esper and this presumably leaves Esper a free agent (?)
I say that it should be integrated with BusinessEvents. The ontology in BE can dive the structure of the streams in Esper. Huh? That’s right, I said it! In BE, you first have to declare your ontology of objects that can be asserted and detected using rules. So add Esper and you also get to declare streams of those same objects. You get moving window calculations that, AFAIK, are better than what’s available in BE. You get the pattern language. And you get the ability to code some stuff in Esper (using joins and windows) in a more performant way than could be done in BE. Then you get to integrate the results of that back into the rules, so that you don’t have to do awful SQL constructs to imitate something that rules do naturally!
Let the convergence begin!
Here’s a response to this post from Marc, who wonders how streaming SQL vendors like Coral8, StreamBase or Aleri could work with SPlus/R/Matlab.
Marc wonders whether one of those languages would be a good candidate for a streaming language. I looked at implementing streaming extensions to R and I did not see a way to truly (without klugery) integrate streaming into the language. Doesn’t mean there’s not a way, just that it’s not as straight forward as it might seem.
Take the function aggregateSeries in S+, which has analogs in R and Matlab. This function works as follows: The return value is a vector/array, which is the default data structure in these languages (a scalar variable is an array of size 1). So we pass in an array and aggregateSeries takes a sub-array corresponding to the first window and applies the function FUN to that array. The function FUN must return a scalar value and that scalar value is stuffed into the first entry in the return value. It proceeds like this to calculate the value of the function FUN applied over subsequent windows (sub-arrays), putting the resulting scalars into subsequent values of the return value. So the return value is an array of values that were returned by FUN when applied to each window in turn.
This is not really something that works in a streaming world. Notice that the input to aggregateSeries is already an array. There is no declaration to take an infinite stream and break it into windows. That concept of an infinite stream, to which one can subscribe, is the part that requires the kluging.
So here is the kluge: you have some external process/thread pushing data into the engine. It pushes in a data structure and calls an S or Matlab function on that data structure. If that data structure is a single value, like a tick, then you process single events. And if that data structure is a sliding window… you process sliding windows.
This brings me to how simple it should be to integrate S+/R/Matlab with streaming SQL. I suggested this about two years ago: Streaming SQL has the concept of a window. So I just want to be able to push the contents of the window into S+/R/Matlab via the above kluge, every time the contents changes. This should be dead simple because (1) the window already exists and (2) the method to push in the data and call a function already exists.
Of course, there would also be a library provided to push a value back to streaming SQL (as in, to insert a value into a stream). Again, this is very easy.
I don’t know much about Coral8, but this is probably not too hard to implement yourself as a plug-in to the engine. If you’re using R, you can use JRI to interface with the R engine (to push in a data structure and call a function). The other languages come with their own interfaces.
However, at this point I am at a total loss to understand why this stuff doesn’t come build in by the streaming SQL vendors.