Until now, TIBCO BusinessEvents has been the only rules based product focussed on event processing. A rules based product makes quite a lot of sense in this area: most people find rules to be a familiar and easy to understand way to code some common kinds of event logic (”when such-and-so happens, and so-and-so condition already exists, then do XYZ”). Enter Drools Fusion from the JBoss project. While the initial release will clearly not be up ot the feature set of BE, it’s got promise. Combined with Esper, this would make a very interesting event processing platform.

I see this kind of thing as helpful to TIBCO and other CEP vendors. It provides a free demo of event processing capabilities to the masses. Then the vendors can step in with capabilities that the free software currently can’t match and pick up some sales.

I have some fun event processing code examples in store (a hidden Markov model and some on-line categorization and partitioning algorithms written in streaming SQL). But I’ve been busy, so instead I’m giving my two cents on WolframAlpha.

Here is my big problem with WolframAlpha: you don’t know what features it has, so you have to guess. The result is that you waste so much time that you get bored.

For example, the site will give me answers for “5 largest countries by area” or “5 largest countries by number of people” or “5 largest countries by GDP” but not “5 largest countries by number of sheep” or “5 largest countries by export of wheat”.

And it gives me no idea of why. Is it because it does not know how many sheep are in every country (or about wheat exports)? Or because it is hard-coded to understand how to rank countries by only a few terms and I am wasting my time by asking it for anything else? I get the impression that it is the latter, which would make this thing much less revolutionary than we have been told.

If only it would give me some feedback, I would know. Maybe it could tell me how it knows to rank countries. Or how it parsed my query, rather than just a generic “dunno what you want” message.

Similarly, it gives no answer for “median population of 10 largest countries”. So why not? It knows population (since it will rank countries by population), it knows how to calculate a median… what’s wrong? Am I phrasing the question wrong? Does it simply have hard coded functions like “rank countries” and “calculate median of input numbers” but will not chain the median function to the data on population of countries? Again, am I wasting my time by trying queries like this?

I’m getting the feeling that the system has some built in functions to operate over its data store (rank countries, compare cities) and it just tries to match your query up to one of these abilities. In this case, all the “cool examples” that are provided on the site are not “examples” at all, they are a list of everything the site can do. This is very frustrating because Wolfram is giving the impression that I should experiment with all kinds of queries and see what I get.

All in all, I wish it would give me more feedback so I can understand whether it fails to answer my queries because I am phrasing them wrong, because it is missing data, or because it has a limited set of features and I am just wasting my time.

Math tricks wiki

April 19, 2009

The Tricki is a Wiki project meant to capture the strategies and “tricks” used to solve math problems. Other math Wiki projects start from definitions, which often link to some proofs related to the definition. The Tricki starts from classes of problems, so you first locate the type of problem you are working on as specifically as possible, and then maybe you will find a strategy for solving it. Great idea!

Lots of front office shops are moving (or have long since moved) to build their order management and position keeping (and other stuff) over a distributed cache. Caches are much better than they used to be, as is our understanding of what caches can and should do. For example, modern “caches” should deploy, distribute and partition processing just as well as they do data.

One interesting idea is to add in a rules engine into this mix. Conceptually, this makes sense because both inference and ECA rules are good at a lot (not all) of the logic that we want to build over a cache. For example, logic like “when XYZ state exists” or “when ABC event happens” then update another object, send an event or trigger another activity. The hard part is making the cache and the rules engine work together efficiently.

TIBCO seems to be approaching this issue from another direction with BusinessEvents, per this blog post. They’ve built a distributed cache under their rules engine, to help with distributed rules processing.

If they can incorporate the features of a modern distributed cache with a  good rules engine and framework for distributed processing, they could wind up with a killer product for front office infrastructure. Something to keep an eye on at least.

I’ve been playing around with Biocep-R, an OSS project that aims to move the R platform into a whole new realm. I humbly suggest that folks in finance keep an eye on this. It has big potential in many other areas as well (e.g. biostatistics).

It still needs work before it’s ready for prime time use, but the vision and ability to execute seems to be there.

A few pieces from the project:

  • Central control over distributed R engines in a multi-server environment. 
  • Amazon cloud support for the above: “Amazon EC2 virtual machines running R servers can be fired up or shut down to scale up or scale down according to the load…”
  • Central R object repository to allow many engines to cooperate on large computation.
  • The R engines can be persistent or run on demand for certain jobs. Maintain your own R session on a server and connect to it remotely; this is great for individuals with long running R jobs.
  • SOAP/other remote access to R-based services. Submit jobs to the R-cloud from the technology of your choice.  You will not want to submit a huge data set over SOAP, so the mechanism for making this work in practice needs some thought.
  • A Java based GUI to work with all this. Not my favorite code editor, but still good.

One big stumbling block is still, IMO, the performance problems in the R runtime. Also the memory model in R is not the best for large computation; the S-PLUS Big Data feature tries to help with this (not available for R as of yet), but in the long run a more invasive solution may be needed.

So this project has a way to go before becoming the next big thing. But I think it gives a good picture of some next steps in statistical computing.

I just found this great comment from Julian Hyde in my WordPress spam folder.  Sorry Julian, didn’t see it until today. It was in response to my previous post about making progress on streaming SQL.  I am reprinting almost his whole comment:

I am chief architect of SQLstream and I do a bit of technical blogging athttp://julianhyde.blogspot.com about what SQLstream is capable of. Not enough, I admit – sometimes it’s a choice between blogging and ‘real work’.

I want to improve the usability of streaming SQL languages, but I think that if we stray too far from relational semantics we will end up with something less declarative, more proprietary (and therefore more difficult to understand by the many folks who have a SQL background and would like to process data in flight), and less maintainable.

I actually read that particular Streambase post with some horror. The problems solved by that post are already solved, much better, by standard SQL and implemented in a few database systems. Streambase have introduced concepts similar to standard SQL concepts but have given them different, and misleading, names. Where they use CREATE SCHEMA, the rest of the world would use CREATE TYPE (standard SQL has a SCHEMA but it means something completely different). What they call TUPLE, standard SQL calls ROW. Wildcard attributes might be a quick win to deploy a project quickly, but you will end up with a project that is brittle: you can’t even add a column without the risk that it will be captured by a wildcard rule somewhere in your application.

I’m not one of those relational bigots who believe that we should remain faithful to every word E.F. Codd wrote in 1970. I believe that SQL systems have been effective because they have a small number of basic operations that can be combined in powerful ways, they allow structures and operations to be specified declaratively so that the system can optimize, because there are standards to allow SQL systems to interoperate, and because there are a lot of IT professionals who understand SQL deeply.

Those principles are as important, if not more so, for problems of streaming data. We may need to add more or two new operators, but the basic operations are applicable to streams and can achieve a lot of power. The SQL standard has some newer elements, such as moving totals, nested relations, XML support, user-defined transforms, and SQL/MED that are perfect for streaming systems but I have not seen any other streaming SQL vendors exploiting them. At SQLstream we have started with these fundamentals, then added a few key extensions for streaming data.

So this is a very interesting perspective, basically taking the opposite view to mine and saying that adding features just for the sake of having features is the wrong way to go. Also he comments on the apparent divergence of StreamBase’s naming conventions from SQL standards – something that I will not comment on other than to say that it would surprise me to find that they are not religiously following previous SQL conventions. Anyway, that is a separate point from whether they have made language “improvements” or are going down the wrong path.

Among many opinions on streaming SQL, Opher also frequently says that it is the wrong starting point for a generic event processing language. He sees SQL as being natural for expressing certain parts of event processing but not all. Just search his blog for “SQL”, there is a good amount of perspective there.

I will try to post some of my own thinking on this at some point in the future. But so far, I am still in favor of StreamBase’s new enhancements from an end user perspective.

Many R users don’t know this and they could be misinterpreting the R QQ plot.

Here is an example QQ plot from R, generated from an object of type aov named ‘fit’:

plot(fit, which=2)

qq

Now there is a line on this image, and one might naturally assume that this line represents the perfect match with the standard normal, leading to one interpretation.

But let’s try plotting the 45 degree line representing a perfect match:

plot(fit, which=2)

abline(0,1)

qq-abline
Woah, quite a different interpretation.

The line drawn by default is not the 45 degree line, but rather the line “which passes through the first and third quartiles”.

 
See “?qqplot” for more info.

Opher Etzion recently blogged about one of the fun projects from the MIT Media Lab. The project involves a camera hung around the neck, a projector worn on a hat and a cell phone. The projector projects a screen on a surface in front of you and the camera reads the motion of your hands as they interact with that screen. The result is a gesture based interface for a heads up display. Opher concludes that this particular project is cool, but not really event processing. This is probably because in principal, it very much resembles computer inputs such as a tablet, touch pad, touch screen, iPod or Microsoft surface.

When I first read about about this particular project (actually, what was probably a predecessor or previous form of this technology), I immediately thought about synergies with event processing.

Let’s think about a simplified technique for using gestures in this way:

First we must reduce each picture from the camera into a numeric structure that facilitates efficiently recognizing certain patterns. Then we process sequences of these structures to locate movement, shading, shape or other patterns over time. This involves some patern recognition and probably smoothing or other jitter/noise reduction techniques. Then we compare various patterns to locate gestures. Finally we take action based on the gestures, in the context of the currently running application.

A little pondering of the steps above shows us that this process actually has has quite a lot in common with many event processing applications in capital markets, fraud, security and intelligence gathering.

Now let’s think about a researcher who would like to work on such gesture systems. There are many components to this research: image processing, noise reduction, pattern recognition, user experience design. But there is also a lot of tricky programming: wiring all the pieces together, collecting and processing sliding windows of data structures, interactive state machines, threading and plenty of other stuff that the researcher would probably prefer not to worry about.

Here is a great opportunity for synergy with event processing. If an event processing system can take some of the more annoying and time consuming programming tasks away from a project like this, it can enable the researcher to work more efficiently on the interesting parts. And that is exactly the kind of role that the EP products of today are meant for.

I was happy to see this post by StreamBase on a feature of the latest version of their streaming SQL language. I hope it begins a move back to interesting technical blogging in the CEP space. Coral8 (prior to being bought) had also done some technical blogging. And while Aleri has an interesting blog, they could (IMO) make it better by adding more technical content about their product and use cases.  Edit: Apama is also doing some technical blogging that I missed because they changed their feed to Feedburner late last year.

The biggest complaint about streaming SQL languages is that, while the they simplify many tasks in processing network and streaming data, they make certain other tasks mind-numbingly difficult. For example, maybe I would like to build an arbitrary-length list of numbers in one component and pass in along streams to other components. This task would be dead simple in many (most?) programming languages, but is nearly impossible with most streaming SQL products.

Part of the problem comes from the database roots of streaming SQL languages. Many products that implement streaming SQL languages use database-like structures under the covers, and those structures do not seem to like arbitrary-size collections being passed around streams.

I am still enthused about Esper which, other than being an impressive example of what one motivated programmer can do, mitigates many of the more annoying problems of streaming SQL. Esper is not bound by database-like data structures: its streaming SQL language interacts with streams of POJOs. The result is much more flexible than other streaming SQL implementations. Of course Esper pays a price for this flexibility, including the potential for  garbage collector pauses.

So I am also enthused to see that database-rooted streaming SQL vendors are taking steps to make their languages more programmer-friendly. And blogging about it, no less.  It think that the last time there was a blog post about such topics was about Aleri’s SPLASH last year.

I do not know where SQLstream fits into this language usability issue, but I will be interested to find out.

Some comments on recent posts about Complex Event Processing (CEP) products.

False statement #1: CEP software is “smart” or inherently mathematical. False. CEP products are programmer tools for handling messages coming in over the network. They are not mathematical marvels and they are definitely not “smart”. They just help you write event processing logic.

False statement #2: CEP software can’t handle probability or it can only handle simple types of probability. Also false. Again, CEP software is a tool. Nothing stops you from implementing probabilistic reasoning with CEP software. Plenty of people do this. Similarly, nothing stops you from implementing very sophisticated probabilistic reasoning with CEP software. For example, many CEP products come with the ability to manipulate the types of data structures that are frequently used in machine learning algorithms. Mileage may vary, given that different products have different capabilities.

Fun fact about CEP: While CEP is currently a brand name with little underlying, usable theory… the industry is working on changing this. The Event Processing Technical Society aims to organize many aspects of Event Processing into a cohesive field, and they seem to be making good progress of late.