I chair a standards committee/workgroup within an organization called the Open Grid Forum. The workgroup is on something called Data Format Description Language, or DFDL, which you can pronounce "Daffodil" if you want.
DFDL is about facilitating data interchange, critical to most computing, but to data integration for BI applications in particular.
Many people ask why DFDL is needed in an era where there are so many standard data formats available (e.g., why not just use XML?). There are a number of social phenomena in the way software is developed which have lead to the current situation where DFDL is needed to standardize description of diverse data formats.
First, programs are very often written speculatively, that is, without any advance understanding of how important they will become. Appropriately given this situation, little effort is expended on data formats since it remains easier to program the I/O in the most straightforward way possible given the programming tools in use. Even something as simple as using an XML-based data format is harder than simply using the native I/O libraries of a programming language.
At some point however, it is realized that the program is important because either lots of people are using it, or it has become important for business or organizational needs to start using it in larger scale deployments. At that point it is often too late to go back and change the data formats. For example, there may be real or perceived business costs to delaying a deployment of a program for a rewrite just to change the data formats, particularly if such rewriting will reduce performance of the program and increase costs of deployment. (It takes longer to program, but at least it's slower when you are done ;-)
Additionally, the need for data format standardization for interchange with other software may not even be clear at the point where a program first becomes 'important'. Eventually, however, the need for data interchange with the program becomes apparent. At that point, you look back at the data format and maybe it's not too complex yet. So you don't re-engineer it. But add a year or so of evolution of the software with the attendant changes here and there to the data formats and suddenly you have a real problem.
The above phenomena are not something that is going away any time soon. There are of course efforts to much more smoothly integrate standardized data format handling into programming languages. But it is very unclear whether these will catch on, and there is, regardless, a role for DFDL since it allows after-the-fact description of a data format.
DFDL is also needed for performance reasons. At the hairy edge of computing people are always trying to process ever more data to gain some competitive advantage. At this edge, the performance penalty from using verbose data formats like XML can become very burdensome.
Lastly, there's this problem with data format debugging. I will elaborate a bit here because this is a hidden tax on every integration project where there is data in files. I get emails from field engineers trying to help customers with data format problems fairly often. Here's a typical one:
Problem: EBCDIC file, no Cobol file definition. Customer is telling me that it contains 5 fields. Also tells me that the first 3 fields are unsigned, fourth field is Packed, 5th field is 'EBCDIC' (which makes no sense). Based on this a row looks like this:
149,13568,0,4,"UUNR LEA "
Customer says the row should look like this:
95,35,48,222038109,"UUNR LEA IN"
Here's a couple of records dumped out hexadecimal:
0095350000000000000048222038109fe4e4d5d940d3c5c140c9d540000f 0095350000000951500501088322722fd2e4d9e3e94b4bf0d9f64b4b002f
Figuring out problems like this is what I call data archeology. Most people can't imagine that this sort of work is still needed routinely in data integration projects. This is truly geeky stuff, I mean hex dumps?!
A tool, a data format debugger, is badly needed to address this.
Looking at the above hex, using some brain heuristics, coupled with the customer's information provided allows me to chop up the hex like so:
0095 35 0000000000000048 222038109f e4e4d5d940d3c5c140c9d540 000f 0095 35 0000000951500501 088322722f d2e4d9e3e94b4bf0d9f64b4b 002f
Analysis: Looks like the "unsigned" fields are also packed. I.e., what they meant by "unsigned" was "unsigned packed", not unsigned base 2 binary. Seems like they also didn't tell me about all the fields. There's one more field at the end.
unsigned packed 4 digits
Unsigned packed 2 digits
Unsigned packed 17 digits
Unsigned packed 9 digits (the final "F" nibble is padding)
Fixed length 12 character string ebcdic character set
unsigned packed 3 digits (the final "F" nibble is padding)
It's also possible that the fields with the "F" nibbles are signed packed numbers but from a non-IBM mainframe Cobol compiler. (IBM standard Cobol uses C and D as signs, and F as padding for odd-length unsigned packed decimal I believe.)
I hope you are now as sick of this example as I am... phew!
Now, pulling up out of those gritty details,.....the depth of experience and knowledge needed to be able to pull off this kind of data archeology is pretty expensive to hire, and it is a huge tax on our industry every time this sort of thing comes up, which is far more often than people imagine.
Unfortunately, data format issues are not a sexy, high-value adding feature for a software product. Yes, these issues are costly for customers. But no customer says "Gee I'm happy to pay another bunch of dollars to have a data format debugger." It's not in their top 10 new feature requests because it is hard for them to imagine the huge cost that is repeatedly sunk into mundane data archeology in a single major data warehousing or BI initiative.
Most people who are naive to this issue figure that each of the major vendors have a proprietary data format language in their product portfolios. Reality is that the larger vendors have 1/2 dozen or more. The difficulty is that the lack of a standard for data format description means that all these separate software bases are being maintained to do roughly the same thing and no one product team can afford to invest in a really good data format debugger since it works only with that one product's proprietary data format system. As a result, nobody gets a data format debugger that is really any good, and we all pay the tax again and again.
A standard for DFDL fixes this. A data format debugger pays dividends across multiple products if they are using a common DFDL and so the investment to make one is worthwhile eventhough it's not a sexy new feature that customers are asking for.
I did overstate above when I said no customers ask for this. There are some educated consumers who ask for DFDL. I started work on DFDL when a customer made it clear that they would buy more software from me if the data format description language it used was standard, not proprietary. This is the best reason to promote a standard. Not in order to commoditize some software but to enable increased usage. A standard DFDL would allow them to use it to interoperate across software from multiple vendors. They could invest safely in tools that leveraged this common data format description language. At the time we were dealing with interoperation with SAS, which has its own data format descriptions. We were struggling with moving data from my software (this was Ascential at the time) to and from SAS (as in The SAS Institute) applications. When you have a record format with 700+ fields in it, and there's an error somewhere in the middle, debugging it is pretty hard. Eventually we found the problem, but the customer learned that the effort involved in using these pieces of software together and dealing with their proprietary data format languages was not worth it. If a standard data format language was shared then they could use these software packages for more things, and use them together easily.
You can learn more about the DFDL standard here.