I’ve been playing around with Biocep-R, an OSS project that aims to move the R platform into a whole new realm. I humbly suggest that folks in finance keep an eye on this. It has big potential in many other areas as well (e.g. biostatistics).
It still needs work before it’s ready for prime time use, but the vision and ability to execute seems to be there.
A few pieces from the project:
- Central control over distributed R engines in a multi-server environment.
- Amazon cloud support for the above: “Amazon EC2 virtual machines running R servers can be fired up or shut down to scale up or scale down according to the load…”
- Central R object repository to allow many engines to cooperate on large computation.
- The R engines can be persistent or run on demand for certain jobs. Maintain your own R session on a server and connect to it remotely; this is great for individuals with long running R jobs.
- SOAP/other remote access to R-based services. Submit jobs to the R-cloud from the technology of your choice. You will not want to submit a huge data set over SOAP, so the mechanism for making this work in practice needs some thought.
- A Java based GUI to work with all this. Not my favorite code editor, but still good.
One big stumbling block is still, IMO, the performance problems in the R runtime. Also the memory model in R is not the best for large computation; the S-PLUS Big Data feature tries to help with this (not available for R as of yet), but in the long run a more invasive solution may be needed.
So this project has a way to go before becoming the next big thing. But I think it gives a good picture of some next steps in statistical computing.
I just found this great comment from Julian Hyde in my WordPress spam folder. Sorry Julian, didn’t see it until today. It was in response to my previous post about making progress on streaming SQL. I am reprinting almost his whole comment:
I am chief architect of SQLstream and I do a bit of technical blogging at
about what SQLstream is capable of. Not enough, I admit – sometimes it’s a choice between blogging and ‘real work’.
I want to improve the usability of streaming SQL languages, but I think that if we stray too far from relational semantics we will end up with something less declarative, more proprietary (and therefore more difficult to understand by the many folks who have a SQL background and would like to process data in flight), and less maintainable.
I actually read that particular Streambase post with some horror. The problems solved by that post are already solved, much better, by standard SQL and implemented in a few database systems. Streambase have introduced concepts similar to standard SQL concepts but have given them different, and misleading, names. Where they use CREATE SCHEMA, the rest of the world would use CREATE TYPE (standard SQL has a SCHEMA but it means something completely different). What they call TUPLE, standard SQL calls ROW. Wildcard attributes might be a quick win to deploy a project quickly, but you will end up with a project that is brittle: you can’t even add a column without the risk that it will be captured by a wildcard rule somewhere in your application.
I’m not one of those relational bigots who believe that we should remain faithful to every word E.F. Codd wrote in 1970. I believe that SQL systems have been effective because they have a small number of basic operations that can be combined in powerful ways, they allow structures and operations to be specified declaratively so that the system can optimize, because there are standards to allow SQL systems to interoperate, and because there are a lot of IT professionals who understand SQL deeply.
Those principles are as important, if not more so, for problems of streaming data. We may need to add more or two new operators, but the basic operations are applicable to streams and can achieve a lot of power. The SQL standard has some newer elements, such as moving totals, nested relations, XML support, user-defined transforms, and SQL/MED that are perfect for streaming systems but I have not seen any other streaming SQL vendors exploiting them. At SQLstream we have started with these fundamentals, then added a few key extensions for streaming data.
So this is a very interesting perspective, basically taking the opposite view to mine and saying that adding features just for the sake of having features is the wrong way to go. Also he comments on the apparent divergence of StreamBase’s naming conventions from SQL standards – something that I will not comment on other than to say that it would surprise me to find that they are not religiously following previous SQL conventions. Anyway, that is a separate point from whether they have made language “improvements” or are going down the wrong path.
Among many opinions on streaming SQL, Opher also frequently says that it is the wrong starting point for a generic event processing language. He sees SQL as being natural for expressing certain parts of event processing but not all. Just search his blog for “SQL”, there is a good amount of perspective there.
I will try to post some of my own thinking on this at some point in the future. But so far, I am still in favor of StreamBase’s new enhancements from an end user perspective.
Many R users don’t know this and they could be misinterpreting the R QQ plot.
Here is an example QQ plot from R, generated from an object of type aov named ‘fit’:
Now there is a line on this image, and one might naturally assume that this line represents the perfect match with the standard normal, leading to one interpretation.
But let’s try plotting the 45 degree line representing a perfect match:
plot(fit, which=2) abline(0,1)
Woah, quite a different interpretation.
The line drawn by default is not the 45 degree line, but rather the line “which passes through the first and third quartiles”.
See “?qqplot” for more info.
Opher Etzion recently blogged about one of the fun projects from the MIT Media Lab. The project involves a camera hung around the neck, a projector worn on a hat and a cell phone. The projector projects a screen on a surface in front of you and the camera reads the motion of your hands as they interact with that screen. The result is a gesture based interface for a heads up display. Opher concludes that this particular project is cool, but not really event processing. This is probably because in principal, it very much resembles computer inputs such as a tablet, touch pad, touch screen, iPod or Microsoft surface.
When I first read about about this particular project (actually, what was probably a predecessor or previous form of this technology), I immediately thought about synergies with event processing.
Let’s think about a simplified technique for using gestures in this way:
First we must reduce each picture from the camera into a numeric structure that facilitates efficiently recognizing certain patterns. Then we process sequences of these structures to locate movement, shading, shape or other patterns over time. This involves some patern recognition and probably smoothing or other jitter/noise reduction techniques. Then we compare various patterns to locate gestures. Finally we take action based on the gestures, in the context of the currently running application.
A little pondering of the steps above shows us that this process actually has has quite a lot in common with many event processing applications in capital markets, fraud, security and intelligence gathering.
Now let’s think about a researcher who would like to work on such gesture systems. There are many components to this research: image processing, noise reduction, pattern recognition, user experience design. But there is also a lot of tricky programming: wiring all the pieces together, collecting and processing sliding windows of data structures, interactive state machines, threading and plenty of other stuff that the researcher would probably prefer not to worry about.
Here is a great opportunity for synergy with event processing. If an event processing system can take some of the more annoying and time consuming programming tasks away from a project like this, it can enable the researcher to work more efficiently on the interesting parts. And that is exactly the kind of role that the EP products of today are meant for.
I was happy to see this post by StreamBase on a feature of the latest version of their streaming SQL language. I hope it begins a move back to interesting technical blogging in the CEP space. Coral8 (prior to being bought) had also done some technical blogging. And while Aleri has an interesting blog, they could (IMO) make it better by adding more technical content about their product and use cases. Edit: Apama is also doing some technical blogging that I missed because they changed their feed to Feedburner late last year.
The biggest complaint about streaming SQL languages is that, while the they simplify many tasks in processing network and streaming data, they make certain other tasks mind-numbingly difficult. For example, maybe I would like to build an arbitrary-length list of numbers in one component and pass in along streams to other components. This task would be dead simple in many (most?) programming languages, but is nearly impossible with most streaming SQL products.
Part of the problem comes from the database roots of streaming SQL languages. Many products that implement streaming SQL languages use database-like structures under the covers, and those structures do not seem to like arbitrary-size collections being passed around streams.
I am still enthused about Esper which, other than being an impressive example of what one motivated programmer can do, mitigates many of the more annoying problems of streaming SQL. Esper is not bound by database-like data structures: its streaming SQL language interacts with streams of POJOs. The result is much more flexible than other streaming SQL implementations. Of course Esper pays a price for this flexibility, including the potential for garbage collector pauses.
So I am also enthused to see that database-rooted streaming SQL vendors are taking steps to make their languages more programmer-friendly. And blogging about it, no less. It think that the last time there was a blog post about such topics was about Aleri’s SPLASH last year.
I do not know where SQLstream fits into this language usability issue, but I will be interested to find out.
Some comments on recent posts about Complex Event Processing (CEP) products.
False statement #1: CEP software is “smart” or inherently mathematical. False. CEP products are programmer tools for handling messages coming in over the network. They are not mathematical marvels and they are definitely not “smart”. They just help you write event processing logic.
False statement #2: CEP software can’t handle probability or it can only handle simple types of probability. Also false. Again, CEP software is a tool. Nothing stops you from implementing probabilistic reasoning with CEP software. Plenty of people do this. Similarly, nothing stops you from implementing very sophisticated probabilistic reasoning with CEP software. For example, many CEP products come with the ability to manipulate the types of data structures that are frequently used in machine learning algorithms. Mileage may vary, given that different products have different capabilities.
Fun fact about CEP: While CEP is currently a brand name with little underlying, usable theory… the industry is working on changing this. The Event Processing Technical Society aims to organize many aspects of Event Processing into a cohesive field, and they seem to be making good progress of late.
I guess that a few sites have picked up this article about how maybe Complex Event Processing (CEP) is creating jobs in financial services.
Not true, AFAIK.
What’s creating jobs is the trendless volatility in the stock market. In this environment, the best way to make money in the short term is with automated day trading. So automated trading is on the rise again, and with that comes programmer jobs.
CEP systems are tools that programmers use to build programs. They are not “smart software” as the article says. They are just programmer tools.
If CEP tools were not around, the same automated trading systems would be built with the tools that existed prior to CEP.
So the rise of automated trading is good news for vendors of CEP software. But other than the few jobs created at these vendors, I don’t see how CEP is creating any jobs.