EII, ETL, ELT, EAI,....This alphabet soup for integration technologies got created by some industry analysts. I've spent much of my career in this Exx soup, and so have become quite expert in what these things are and are not.
EII stands for Enterprise Information Integration, which doesn't say at all what it is.
EII is SQL Database Federation - it is viewing the problem of data integration using the SQL perspective. That is, what if we could make all data look like it was in one big RDBMS, even though it isn't. Then we'd have a uniform way to query it.
This idea is great really. Viewing all data through the lens of SQL. It is great for query and analysis types of applications. Not so great for transactional processing though there are products in the market which federate and implement distributed two-phase commit even. But nobody would claim it's a high-performance system.
The problems with EII start with expressiveness - SQL's type system doesn't handle hierarchical data. Or rather, standard SQL doesn't. Unfortunately, an industry consensus never formed around nested tables or any of the other ideas for dealing with hierarchical data from SQL.
EII now includes XQuery (which is SQL-like, but does deal with hierarchy), and systems that view all data as XML.
Another problem with EII is incompleteness. EII is about how you access the data once it's been made syntactically compatible, but it doesn't address how that is achieved. You need another integration technology of some sort to turn non-RDBMS or non-XML data sources into ones that an EII environment can manipulate.
Finally there are performance issues with EII. As good as SQL optimization technology is, you still can't ignore that federated databases aren't one database, so joins across two separate databases are going to be very expensive. Optimizing XQuery is a research topic still.
Given how inexpensive databases and storage are these days, a commercially viable EII approach is to gather all data into a central database. That is, forget about the federation aspect and just centralize so as to bring SQL to bear in the environment where it performs best.
Generally, EII is a good example of an adage I like: "If you've got a really big hammer, everything starts looking like a nail" :-)
So much for EII.
Now, ETL - Extract Transform, and Load. This is the technology used to populate databases from non-databases. It is usually batch, but is being adapted to more real-time things.
ETL tools are usually based on a principle called streaming dataflow, as are the SQL-engines inside databases, but ETL tools are more extensible. E.g., in SQL you can aggregate data with a group-by statement and compute sums, counts, averages, etc. But in an ETL tool you can build a best of breed aggregator using your own custom business specific algorithm. E.g., given all the records representing a customer, take fields from one or another so as to get the best record representing the customer. So, consider the gender field in 10 records representing the same customer, suppose it is blank in a few, Male in a few, Female in a few. The name is "Dana", which doesn't help. So, a business specific heuristic can be defined that says if a particular record comes from say, the brokerage division of a financial house, then that's more likely to have the gender correct because the brokers actually talk to the customers on the phone. Try expressing that in SQL!
The streaming dataflow that ETL tools use as a basis is a natural candidate for parallel processing to increase throughput. This can be on clustered computers, or just a multi-core CPU. This idiom is one candidate for fixing the problem where programmers don't know how to exploit all these cheaply available CPU cores.
ELT is an alternative to ETL. It's more SQL oriented. It stands for Extract, Load, then Transform. You extract data. Do to it only what has to be done to it to load it into a database, then transform by using a number of SQL-based operations on it. This works when SQL is sufficient to express what needs to be done. In that sense it is like EII.
All of the three so far ETL, EII, ELT, are what we call set-oriented systems. Aggregating a set of multiple separate records together to compute a collective result from the set makes sense. Sorting a set of records makes sense so that you can examine them in a particular order. All of these are suitable for massive-scale processing when data is large.
Now let's look at EAI - Enterprise Application Integration. This is the stuff of enterprise message buses. it's about processing individual messages so as to perform a transaction not only in one place, but in all the places that the information that transaction contains is needed.
EAI systems are primarily about the processing of one message at a time. not necessarily sequentially, but the logic of an EAI system isn't about processing groups of messages together. The idea of taking the set of messages, sorting it, doing best of breed record construction, is not part of the EAI mind set. Rather, the transformation and routing of a message to the right places and in the right formats is what it is about.
EAI's message by message approach can be scaled up massively also, but can't do the same kinds of things that a set-oriented system can do across records/messages. Or rather it can't do them the same way. To carry out a transformation involving multiple messages in an EAI system you process one record and keep a state in a data store, typically a database. When the next message of the set arrives, you retrieve the data from the database, compute more, write the state back to the database. Repeat until you've combined all the messages. (Figuring out which is the last one is a big issue.) Ultimately the answer comes from the last message and its interaction with the state, which is where the accumulated knowledge sits about the collective set of messages.
But here's the rub with EAI. When you have messages where there is complex hierarchical data e.g., an XML document, then sometimes the transformations on the detail levels of the hierarchy are set-oriented. E.g., given a purchase order, what is the total of all the parts being ordered. This is not an item-by-item operation on each line-item of the purchase order, it is an operation on the whole set of line-items for a purchase order.
So an EAI system needs to be able to transition from the message by message model of transformation and routing into the set-oriented notion for dealing with the hierarchical content of those messages. This set-oriented stuff is much more like what is typically and readily expressed in ETL/EII systems.
Lately some new acronyms have been added to the soup. (They don't begin with "E" thankfully)One is CEP - complex event processing. This is EAI with different means of expressing the processing using rule languages. Another thing (no acronym) is sensor network streams - this is similar to the real-time ETL systems, but emphasizes that the messages are coming from sensors and the input rate is so fast that it might even be ok to lose a message here or there if needed to keep up overall.
So, my conclusion is that the alphabet soup of EII, ETL, EAI, ELT, isn't really very useful anymore. You have SQL-based approaches, and non-SQL approaches. You need more than the expressive power of just SQL in many cases, and you need more than message-by-message processing as well.
An ideal system has to have the expressive power of the ETL systems, but be designed for always-on online applications, not just batch. It must handle hierarchical data naturally. It must be extensible to allow ad-hoc aggregations, not just the fixed library of SQL or XQuery. It must make simple online message by message processing very simple to express.
But... to my knowledge nobody makes one of those. They're all stuck in one of the Exx boxes and provide only what that one Exx box narrowly provides.
Maybe some open source provider, someone who doesn't care about the Exx categories, will fix this someday.
Comments