There is now an emerging standard intended to solve a pervasive problem in data interchange and data integration. The standard is called DFDL, which can be pronounced "daffodil", stands for Data Format Description Language, and it's about exposing data so that there is robust programatic access to it based on a standard description of the format.
The point of DFDL is to fix the way we solve the problem of the syntax of data, i.e., these 4 bytes are a number, these next 16 are a character string, these next 2 are a date, etc. Even in data formats where the data is all textual, the formats can still be utterly confusing. DFDL describes all these details declaratively and a DFDL parser takes data and provides access to those numbers and strings so that applications that need to understand what the data means don't get bogged down in dealing with figuring out what bytes mean what in the data. A DFDL "unparser" can perform the inverse operation of writing data out in the described format.
Many people ask why DFDL is needed in an era where there are so many standard data formats available (e.g., why not just use XML?). There are a number of social phenomena in the way software is developed which have lead to the current situation where DFDL is needed to standardize description of diverse data formats.
First, programs are very often written speculatively, that is, without any advance understanding of how important they will become. Appropriately given this situation, little effort is expended on data formats since it remains easier to program the I/O in the most straightforward way possible given the programming tools in use. Even something as simple as using an XML-based data format is harder than simply using the native I/O libraries of a programming language.
At some point however, it is realized that the program is important because either lots of people are using it, or it has become important for business or organizational needs to start using it in larger scale deployments. At that point it is often too late to go back and change the data formats. For example, there may be real or perceived business costs to delaying a deployment of a program for a rewrite just to change the data formats, particularly if such rewriting will reduce performance of the program and increase costs of deployment. (It takes longer to program, but at least it's slower when you are done ;-)
Additionally, the need for data format standardization for interchange with other software may not even be clear at the point where a program first becomes 'important'. Eventually, however, the need for data interchange with the program becomes apparent. At that point, you look back at the data format and maybe it's not too complex yet. So you don't re-engineer it. But add a year or so of evolution of the software with the attendant changes here and there to the data formats and suddenly you have a real problem.
The above phenomena are not something that is going away any time soon. There are of course efforts to much more smoothly integrate standardized data format handling (e.g, XML) into programming languages. But it is very unclear whether these will catch on, and there is, regardless, a role for DFDL since it allows after-the-fact description of a data format.
DFDL is also needed for performance reasons. At the hairy edge of computing people are always trying to process ever more data to gain some competitive advantage. At this edge, the performance penalty from using verbose data formats like XML can become very burdensome. DFDL can be used to describe densely packed bit-optimized formats.
Lastly, there's this problem with data format debugging. I will elaborate a bit here because this is a hidden tax on every integration project where there is data in files. I get emails from field engineers trying to help customers with data format problems fairly often. Here's a typical one:
Problem: EBCDIC file, no Cobol file definition. Customer is telling me that it contains 5 fields. Also tells me that the first 3 fields are unsigned, fourth field is Packed, 5th field is 'EBCDIC' (which makes no sense). Based on this a row looks like this:
149,13568,0,4,"UUNR LEA "
Customer says the row should look like this:
95,35,48,222038109,"UUNR LEA IN"
Here's a couple of records dumped out hexadecimal:
Figuring out problems like this is what I call data archeology. Most people can't imagine that this sort of work is still needed routinely in data integration projects. This is truly geeky stuff, I mean hex dumps?!
A tool, a data format debugger, is badly needed to address this.
Looking at the above hex, using some brain heuristics, coupled with the customer's information provided allows me to chop up the hex like so:
0095 35 0000000000000048 222038109f e4e4d5d940d3c5c140c9d540 000f 0095 35 0000000951500501 088322722f d2e4d9e3e94b4bf0d9f64b4b 002f
Analysis: Looks like the "unsigned" fields are also packed. I.e., what they meant by "unsigned" was "unsigned packed", not unsigned base 2 binary. Seems like they also didn't tell me about all the fields. There's one more field at the end.
unsigned packed 4 digits
Unsigned packed 2 digits
Unsigned packed 17 digits
Unsigned packed 9 digits (the final "F" nibble is padding)
Fixed length 12 character string ebcdic character set
unsigned packed 3 digits (the final "F" nibble is padding)
It's also possible that the fields with the "F" nibbles are signed packed numbers but from a non-IBM mainframe Cobol compiler. (IBM standard Cobol uses C and D as signs, and F as padding for odd-length unsigned packed decimal I believe.)
I hope you are now as sick of this example as I am... phew!
Now, pulling up out of those gritty details,.....the depth of experience and knowledge needed to be able to pull off this kind of data archeology is pretty expensive to hire, and it is a huge tax on our industry every time this sort of thing comes up, which is far more often than people imagine.
Unfortunately, data format issues are not a sexy, high-value adding feature for a software product. Yes, these issues are costly for customers. But no customer says "Gee I'm happy to pay another bunch of dollars to have a data format debugger." It's not in their top 10 new feature requests because it is hard for them to imagine the huge cost that is repeatedly sunk into mundane data archeology in a single major data warehousing or BI initiative.
Most people who are naive to this issue figure that each of the major vendors have a proprietary data format language in their product portfolios. Reality is that the larger vendors have 1/2 dozen or more. The difficulty is that the lack of a standard for data format description means that all these separate software bases are being maintained to do roughly the same thing and no one product team can afford to invest in a really good data format debugger since it works only with that one product's proprietary data format system. As a result, nobody gets a data format debugger that is really any good, and we all pay the tax again and again.
A standard for DFDL fixes this. A data format debugger pays dividends across multiple products if they are using a common DFDL and so the investment to make one is worthwhile eventhough it's not a sexy new feature that customers are asking for.
I did overstate above when I said no customers ask for this. There are some educated consumers who ask for DFDL. I started work on DFDL when a customer made it clear that they would buy more software from me if the data format description language it used was standard, not proprietary. This is the best reason to promote a standard. Not in order to commoditize some software but to enable increased usage. A standard DFDL would allow them to use it to interoperate across software from multiple vendors. They could invest safely in tools that leveraged this common data format description language. At the time we were dealing with interoperation with SAS (as in The SAS Institute), which like every software system has its own data format description system. We were struggling with moving data from my software (this was Ascential at the time) to and from SAS applications. When you have a record format with 700+ fields in it, and there's an error somewhere in the middle, debugging it is pretty hard. Eventually we found the problem, but the customer learned that the effort involved in using these pieces of software together and dealing with their proprietary data format languages was not worth it. If a standard data format language was shared then they could use these software packages for more things, and use them together easily.
You can learn more about the DFDL standard here.