June 02, 2009

DFDL Accelerated - and GMTFD!

I co-chair this standard about Data Formats, describing them to insure interoperability and such. I posted about this previously.

I just wanted to advise folks that DFDL (Data Format Description Language) efforts are being accelerated by major vendors of late. So this standard is no longer languishing in the slow efforts of devoted committee members working extra hours outside their day jobs. People are really being paid now to make this reality.

I am frankly thrilled to see this finally happen. DFDL is important to interoperability, high performance computing, and bridging the very rich legacy of computing to the future where the data will be everywhere providing new value to businesses!

There's this whole problem with data in enterprises. IT shops have governed and taken stewardship over it ... to death.  Most enterprises can't actually do data governance well, nor data stewardship - how many people understand the difference? (Hint: Good data governance requires people to be good data stewards. Yeah, ok, but what does that mean?) How many organizations can afford to dedicate people's time and attention to these issues?

Just extracting data from many systems is HARD. That's where DFDL comes in. Forget about being stewards of data.... How about getting it into the hands of people who can properly interpret and infer things from it? Yes, it cannot be trash if people are to interpret it. But it needn't be perfect either!

Reality: most businesses need to give data over to outside organizations who will give back to them information providing valuable business insight. Trying to govern use of data too much is just silly. Failure of imagination - you can't anticipate everything that can be learned from your data.

Warning/Caveats/Alert: Keep data privacy rules and security in mind....but outside of that, let the data go!

GMTFD is my new acronym. By analogy to the now legendary (at least on T-shirts) acronym RTFM,  my acronym stands for "Give Me the Freakin' Data".

Given data, smart outside organizations can provide businesses with just incredible insights that would simply be unavailable if they tried to do this work in-house. (www.oco-inc.com)

Open Data Standards - Relay to New Blog

I remain the Chief Bias Officer, biased because I believe in what I do, but unbiased in that my opinions are not strictly corporate or self-serving. I keep my clients interest at heart. That said, there are contexts where my opinion is clearly that of my company. Where both my Bias officer opinion and my opinion as Oco CTO are aligned. In those cases I'll post on the Oco IT blog. Please check out this post: 

Open Standards, SaaS Business Intelligence, SQL, and Cloud Computing

It's about the whole "lock in" fear of SaaS, recently popularized by The Economist . This fear is not really justified so long as you ask your providers to support open data standards.

December 23, 2008

Predictions for 2009 for Business Intelligence

Time to predict the future a bit. The trends we see in the evolution of Business Intelligence are going to be amplified by the current economic climate.

1.     Growth of Complementary BI Solutions

Companies have long attempted to standardize using a single application or approach to address all of their technology needs in a specific area.  However, BI in particular, cannot be effectively implemented using a single standardized approach.  There are too many different capabilities, vertical-specific requirements and deployment models to be restricted to one standard BI tool from a single vendor.  Driven by diverse organizational needs, a growing number of companies will recognize that one size does not fit all, and turn to a set of distinctive BI solutions with varying capabilities from different vendors to address a range of BI requirements. 


2.     SaaS Goes Upmarket

Software-as-a-Service (SaaS) has provided an attractive proposition to small and mid-sized companies, yet large, multinational enterprises are also realizing the benefits of rapidly deployed solutions that require minimal ongoing IT support.  These large firms are turning to SaaS BI solutions to complement their existing BI infrastructure.  The SaaS-based BI approach will significantly proliferate among large, multi-billion-dollar companies in 2009 and beyond.


3.     Information Silos Will Tumble

As the different functions within an organization become more interrelated, the insight needed to make effective business decisions doesn’t come from one single area.    Analytical decision-making requires that information be gathered from many different functions across an enterprise, which is a difficult task when data ownership resides in individual systems.  As decision-making continues to become more cross-functional, data silos will break down.  For example, to gain an understanding of at-risk customer revenue, companies may need to pull actual customer revenue from finance, forecast revenue from sales, customer satisfaction metrics from marketing, and on-time shipment information from supply chain management. 


4.     Collaborative BI Throughout the Extended Enterprise

As analytic solutions become richer and more powerful, they will reach outside the firewall to a company’s customers, suppliers and service providers.  This extended enterprise will both contribute data to the solution, and also use the analytics that it offers.  Reaching into the extended enterprise has been a priority for years—now measuring and making business decisions based on the same set of analytics will become pervasive.   These solutions will be significantly enriched in 2009, driving value throughout the ecosystem. 


5.     Corporate Hot Spots Will Drive BI

In the recession, companies will choose IT initiatives that address specific business “hot spots” and offer a quick payback. Initiatives focusing on increasing revenue, reducing costs or improving customer satisfaction will reign.  BI will provide organizations with visibility into opportunities to cross-sell into existing customers, ensure at-risk revenue is secured, identify new opportunities, rationalize costs and inventory, and meet goals set out by leadership.  Companies are targeting a 3x ROI and payback in 90 days or less. 

October 25, 2008

Solution is to Tool as Business Genomics is to Relational Databases

I struggle to explain to people what it is that is unique about Oco and what we do that is different from all the other "SaaS BI" companies, which are rushing into the playing field.

I always try to say something which briefly summarized is "It's the business semantics stupid!". I've harped on Oco does solutions, everyone else is doing just tools.

Ok, that's a bit polemical, but the point here is that what we do is VERY different, ... but what is the difference between a solution and a tool anyway?

Here goes....

Oco was founded (back in 1999) by a gentleman named George O'Connor (hence the name. ... yup. I'm naming my next company after myself too.)

George was in a very interesting situation which allowed him to make a rather unique insight which is just blindingly obvious in hindsight, but is really quite profound. So in order to completely slap you in the face with it, let me digress.

Suppose you are at the 50,000 foot level. You are IBM, or Oracle. You are trying to decide how to understand all the data in businesses. If you are looking across dozens or hundreds or thousands of businesses, then you would have to adopt a computer science kind of posture and say to yourself "I think we need a relational database system here. All these businesses can represent everything they're doing in that common framework and then we'll be able to exploit commonalities". You would be right of course. Most businesses today use lots of relational database systems, and they are the tool of choice to represent business information.

Now suppose you are looking at just a single business, trying to decide how to understand all its data. You would adopt a very precise focus and say "I think our orders can be a table with columns for the amount of the order, tax, who sold it, what division they are part of..." and so forth. You would create something very tailored to what you need for just that business.

Ok, now here's the point. George O'Conor worked for a company that owned 10 others that he was responsible for. He had exactly 10 companies to look at. Not one, not hundreds, but ten.

This changes everything. There aren't too many jobs where you are responsible for 10.

His insight was this. Among the 10 companies, there are not an infinite number of different ways they need to represent their orders. There are only 3. Their shipments have only 2 variations actually, their pricing policies - 4 different kinds.

At the perspective of 10, you don't invent something as general as the relational database. You look at the common business behaviors, and you invent software that represents each of the different behaviors. You don't need infinite flexibility, you need limited flexibility.

So far so good, but why was this article title about "genomics" anyway.

Well, we all know all life is based on DNA, and there's a gazillion base pairs organized into genes, and there is the potential there for infinite variability.

Yet, how many different human eye colors are there? Answer: it's not an infinite number of variations, it is 6. They are Amber, Blue, Brown, Gray, Green and Hazel. (Sorry Liz. Wikipedia says violet is just not real!)

Just 6.

The database of DNA supports infinite variability, yet we humans boil down to exatly 6 variations.

Could that be because our environment didn't NEED more than 6 variations. Certainly had it been to evolutionary advantage there'd be many more colors, but 6 proves to be enough for us humans even in our complex social world. In fact having too many variations would be problematic. It is comforting to us to see similarities across people. Not uniformity, but not total diversity either.

And so it is with businesses. In businesses today, how many different pricing strategies are really out there being practiced? How many different ways to manage orders. A relational database can represent, like DNA, an infinite number of variations on business concepts, yet businesses really only do any given business behavior a few different ways. And this is because competition hasn't required more variations. And, if every business did each of these behaviors entirely differently.... well it would be awfully hard to hire experienced help. It's useful for us that there aren't so many variations.

In other words: There are lots of better ways to make money than constantly inventing different pricing strategies for your business. 'nuf said.

Bruce Richardson from AMR coined this idea as "Oco discovers the business genome". It is a very appropriate analogy.

So for pricing, there's some variations, but not so many. For order management, again a few variations. Inventory, again a few. Combining them together there's lots of combined variations, but in each individual area, we could call it each business gene, there's not so many variations.

Just like hair color, eye color, etc.

This is truly compelling and entirely obvious once you see it. Yet people tend to believe that businesses just are full of endless variations of practices and what they do is unique in so many ways. And it simply isn't the case.

We exploit this big time at Oco.

Oh yeah, the ERP companies do also. Funny that.

How do we take advantage at Oco? Well we have a central database design in our product which can represent a few variations on say, pick pricing again. Will we change it if we find another variation? Sure we will, but we don't expect to do that often now. If you can cover the variations we have now, we think it's likely we can already handle many other variations.

Here's the sweet part: All our tools can exploit the fact that this database is quite stable in theme and design. We don't have to transform data from what the business has in hand to whatever database design is created for a particular project. We've got a fixed database design. That whole variable drops out of the equation.

Back to Solution vs. Tool.

A tool is like a relational database. It does whatever you want to do with it.

A solution has more of the genome built up in it. It's closer to a living thing, a living business.

To tell them apart, well if you are considering a BI system, ask the vendor what they know about pricing for your industry? Do they recommend managing orders (i.e., what customers buy from you) and purchases (what you buy from others) the same way, or differently? Heck, can they even comment on these topics? What about transportation logistics? Can they comment on fuel surcharges, or why trucking is different from rail?

If the answer is "however you want to do it", then you are purchasing a tool, not a solution.



 

October 05, 2008

EII, ETL, ELT, EAI ... Enough!

EII, ETL, ELT, EAI,....This alphabet soup for integration technologies got created by some industry analysts. I've spent much of my career in this Exx soup, and so have become quite expert in what these things are and are not.

EII stands for Enterprise Information Integration, which doesn't say at all what it is.

EII is SQL Database Federation - it is viewing the problem of data integration using the SQL perspective. That is, what if we could make all data look like it was in one big RDBMS, even though it isn't. Then we'd have a uniform way to query it.

This idea is great really. Viewing all data through the lens of SQL. It is great for query and analysis types of applications. Not so great for transactional processing though there are products in the market which federate and implement distributed two-phase commit even. But nobody would claim it's a high-performance system.

The problems with EII start with expressiveness - SQL's type system doesn't handle hierarchical data. Or rather, standard SQL doesn't. Unfortunately, an industry consensus never formed around nested tables or any of the other ideas for dealing with hierarchical data from SQL.

EII now includes XQuery (which is SQL-like, but does deal with hierarchy), and systems that view all data as XML.

Another problem with EII is incompleteness. EII is about how you access the data once it's been made syntactically compatible, but it doesn't address how that is achieved. You need another integration technology of some sort to turn non-RDBMS or non-XML data sources into ones that an EII environment can manipulate.

Finally there are performance issues with EII. As good as SQL optimization technology is, you still can't ignore that federated databases aren't one database, so joins across two separate databases are going to be very expensive. Optimizing XQuery is a research topic still.

Given how inexpensive databases and storage are these days, a commercially viable EII approach is to gather all data into a central database. That is, forget about the federation aspect and just centralize so as to bring SQL to bear in the environment where it performs best.

Generally, EII is a good example of an adage I like: "If you've got a really big hammer, everything starts looking like a nail" :-)

So much for EII.

Now, ETL - Extract Transform, and Load. This is the technology used to populate databases from non-databases. It is usually batch, but is being adapted to more real-time things.

ETL tools are usually based on a principle called streaming dataflow, as are the SQL-engines inside databases, but ETL tools are more extensible. E.g., in SQL you can aggregate data with a group-by statement and compute sums, counts, averages, etc. But in an ETL tool you can build a best of breed aggregator using your own custom business specific algorithm. E.g., given all the records representing a customer, take fields from one or another so as to get the best record representing the customer. So, consider the gender field in 10 records representing the same  customer, suppose it is blank in a few, Male in a few, Female in a few. The name is "Dana", which doesn't help. So, a business specific heuristic can be defined that says if a particular record comes from say, the brokerage division of a financial house, then that's more likely to have the gender correct because the brokers actually talk to the customers on the phone. Try expressing that in SQL!

The streaming dataflow that ETL tools use as a basis is a natural candidate for parallel processing to increase throughput. This can be on clustered computers, or just a multi-core CPU. This idiom is one candidate for fixing the problem where programmers don't know how to exploit all these cheaply available CPU cores.

ELT is an alternative to ETL. It's more SQL oriented. It stands for Extract, Load, then Transform.  You extract data. Do to it only what has to be done to it to load it into a database, then transform by using a number of SQL-based operations on it. This works when SQL is sufficient to express what needs to be done. In that sense it is like EII.

All of the three so far ETL, EII, ELT, are what we call set-oriented systems. Aggregating a set of multiple separate records together to compute a collective result from the set makes sense. Sorting a set of records makes sense so that you can examine them in a particular order. All of these are suitable for massive-scale processing when data is large.

Now let's look at EAI - Enterprise Application Integration. This is the stuff of enterprise message buses. it's about processing individual messages so as to perform a transaction not only in one place, but in all the places that the information that transaction contains is needed.

EAI systems are primarily about the processing of one message at a time. not necessarily sequentially, but the logic of an EAI system isn't about processing groups of messages together. The idea of taking the set of messages, sorting it, doing best of breed record construction, is not part of the EAI mind set.  Rather, the transformation and routing of a message to the right places and in the right formats is what it is about.

EAI's message by message approach can be scaled up massively also, but can't do the same kinds of things that a set-oriented system can do across records/messages. Or rather it can't do them the same way. To carry out a transformation involving multiple messages in an EAI system you process one record and keep a state in a data store, typically a database. When the next message of the set arrives, you retrieve the data from the database, compute more, write the state back to the database. Repeat until you've combined all the messages. (Figuring out which is the last one is a big issue.) Ultimately the answer comes from the last message and its interaction with the state, which is where the accumulated knowledge sits about the collective set of messages.

But here's the rub with EAI. When you have messages where there is complex hierarchical data e.g., an XML document, then sometimes the transformations on the detail levels of the hierarchy are set-oriented. E.g., given a purchase order, what is the total of all the parts being ordered. This is not an item-by-item operation on each line-item of the purchase order, it is an operation on the whole set of line-items for a purchase order.

So an EAI system needs to be able to transition from the message by message model of transformation and routing into the set-oriented notion for dealing with the hierarchical content of those messages. This set-oriented stuff is much more like what is typically and readily expressed in ETL/EII systems.

Lately some new acronyms have been added to the soup. (They don't begin with "E" thankfully)One is CEP - complex event processing. This is EAI with different means of expressing the processing using rule languages. Another thing (no acronym) is sensor network streams - this is similar to the real-time ETL systems, but emphasizes that the messages are coming from sensors and the input rate is so fast that it might even be ok to lose a message here or there if needed to keep up overall.

So, my conclusion is that the alphabet soup of EII, ETL, EAI, ELT, isn't really very useful anymore. You have SQL-based approaches, and non-SQL approaches. You need more than the expressive power of just SQL in many cases, and you need more than message-by-message processing as well.

An ideal system has to have the expressive power of the ETL systems, but be designed for always-on online applications, not just batch. It must handle hierarchical data naturally. It must be extensible to allow ad-hoc aggregations, not just the fixed library of SQL or XQuery. It must make simple online message by message processing very simple to express.

But... to my knowledge nobody makes one of those. They're all stuck in one of the Exx boxes and provide only what that one Exx box narrowly provides.

Maybe some open source provider, someone who doesn't care about the Exx categories, will fix this someday.

October 03, 2008

Internet Data Center vs. Your PC - Which is more wasteful?

My prior post Your Computation Mileage May Vary generated some interesting hallway discussion.

I believe my intended point is valid still which is there is no upper bound in the demand for computational services. My article didn't really make its point too well, which is more about the fact that the computation cost is kind of like rent with utilities included. No incentive to conserve.  

But, I need to clarify that it was not my intent to accuse the Internet search companies of being wasteful of resources.  Not at all.

One issue is the matter of what Internet searching replaces. For example, Ebay vs. trucking newsprint around in order to occasionally reach my eyeballs with a classified ad or two. Hauling newspaper isn't exactly smart use of resources either.

So by way of setting the record straight on Internet search energy use, colleagues pointed me to some of the facts about this subject. Here are some notes on estimates for energy consumption per search query:

  • Standard server-class machines are generally not used. Instead, much more energy-efficient designs are used (careful power distribution, power supply, motherboard design, etc.
  • There is a lot of room to do cooling more efficiently than is usually done.
  • The "thousands" of computers involved in serving search queries are not all involved for each query. (This part I already knew.)
  • Searches take far less than 1 second.

Clearly, if it took anywhere near the energy my blog post suggested to run a search engine, companies like Yahoo, Google, etc. wouldn't be able to make (much) money through search advertising.

Google has previously published some information about energy efficiency, here are a couple:

  http://services.google.com/blog_resources/PSU_white_paper.pdf

The above points out that every single PC has an incredibly wasteful power supply in it. Collectively, that's a lot of wasted energy. After reading the above I feel good about buying laptops! 

  http://googleblog.blogspot.com/2008/10/saving-electricity-one-data-center-at.html 

The above says "In the time it takes to do a Google search, your own personal computer will use more energy than Google will use to answer your query."

From my understanding of Internet search technology, and looking at what we can find out publicly about Google, this is entirely credible. Really it is great to know that Google has been successful at lowering the impact of all this great computation capability they're providing.
 

July 19, 2008

Why SaaS is Good for IT vs. Why SaaS Isn't Good for IT

I've heard this argument back and forth from a number of people, so it is worth commenting on, particularly as related to business intelligence systems. That is data warehouses and the like.

Here's the positive themes: cost savings, freeing up resources for other strategic projects, improved quality of BI information and improved distribution of BI information, improved coverage of business units.

Here's the negatives: risk of reduced budget, risk of career disruption/change.

The net is that IT folks have to view SaaS as "good" for them because (a) it is good for the business (b) it is inevitable.

Let me elaborate.

The basic argument for SaaS in general is just the Adam Smith economic division of labor argument, i.e., you should make use of services instead of being self-sufficient for the same reason you buy food at the supermarket instead of farming your own - it is more efficient for a specialized producer to do it than for individuals to each do something they do not have the specialized skills for. That efficiency is returned to us in lowered cost, and increased reliability of supply. For IT systems this argument is presented really well in The Big Switch, which I highly recommend.

Contrary-wise this implies that if you have computing needs for which services aren't available, then by all means you should do those in house. This goes all the way down to writing your own software if software isn't available to do what you need. This is where the "good" part of the SaaS trend for IT comes from. It's the ability to refocus efforts on things that are strategic, not commodity.

For SaaS BI, more specifically, well it is good for the business because it saves cost over doing it in-house, and in most cases we at Oco can do it far better than an in-house project would be able to do. The cost reduction can actually enable BI to be available for the first time at many businesses which could not afford to do it before on the price scales that larger enterprises have been paying.

However, for businesses that replace an in-house effort with SaaS BI, whether it is "good" for IT depends on what is done with the savings. If these are re-invested in other strategic IT projects then it creates ongoing career opportunities for IT folks. However, if those savings are just used to increase profits and/or pay dividends to shareholders, then this good for the business is at best neutral for IT folks working in the trenches, and possibly very negative. Change can be opportunity, but always presents challenges.

If a new IT technology is good for the business, then strategic IT departments have to find a way to embrace it and make it successful for the business. It's untenable for an IT department to push back on important technology for "turf" reasons alone.

The career plans of people that are vested today in in-house creation of BI/DW systems clearly need to evolve to match the new reality. E.g., if you are an IT leader who wants a successful DW/BI deployment on your resume as part of your career plans, well you have to consider whether it would look better to have a SaaS deployment of BI on your resume instead. A SaaS BI deployment should save your organization lots of money, so perhaps you would also be able to feature the other projects you can now afford to do with the budget savings.

If you are a data analyst, then by definition your company already sees the importance of having data analysts, and having a SaaS BI deployment should just let you move up the value chain to deliver the BI information that goes beyond what you can get from your Oco system, or which explores the directions for enhancements of your SaaS-deployed BI solution. 

If you are an IT worker in the trenches, such as a DBA or systems administrator hoping to work on a DW/BI deployment, then there certainly is a risk of a disruption of your plans due to SaaS. I'm not sure I can sweeten this for you. As IT switches over to services and away from in-house deployments, then your job and role is going to change. You really do have to read The Big Switch, and should consider looking for employment on the services side, that is with the service providers. If you stick within your existing business/employer, you should look for something you can work on that is more strategic, i.e., that is related to a system that is very specific to your business, and not common across that business and many others. I can't find the origin of this quote, but the point bluntly is to "get strategic or get outsourced".

There's another way in which an inevitable change can be "good" for IT. The inevitability is basically saying that since it's coming, it's "good" to be a thought leader and embrace the change and get some advantage from it before everyone else has those advantages as well, at which point you are just playing catch-up. This argument is also known as "the best defense is a good offense".

The alternative is to fight the trend, rather than switch, but this is just burying one's head in the sand, to mix in another metaphor.

July 14, 2008

SaaS BI Going Mainstream: What Business Objects OnDemand + Oco Partnership Means to Me

Today Oco announced a partnership with Business Objects. I'm not going to reiterate what the announcement says since you can read that elsewhere. To me the key point here is that a major trusted BI supplier, Business Objects, is (and has been for a while now) heavily investing in OnDemand/SaaS/hosted deployment, and they've recognized that Oco's solution provides value for their customers through the integrated information content Oco creates accessed by way of the Business Objects OnDemand hosted environment.

To Oco this partnership is great because it legitimizes what we do and really reduces the shock-factor that our prospective customers see. Let's face it. Oco is an innovator (read "small" -- though growing!) company which by itself sells a solution which is delivered "shockingly" quickly (6 to 10 weeks), financial "risk free", delivered via a "radical" new deployment model (SaaS), and it provides only a UI targeted at the business-user. This is all pretty unfamiliar stuff for a conservative BI customer. Keep in mind that the track record on Data Warehouse/BI projects is historically pretty poor - hence many people have become conservative about them.

Business Objects changes the balance considerably: You replace the small company risk-factor with a major trusted supplier like Business Objects. You replace this radical SaaS deployment model with the fact that Business Objects OnDemand represents this SaaS trend going mainstream. Finally, you get what I like to call "design headroom" that the broad suite of Business Objects tools offers which caters to the complete user community that traditionally consumes BI spanning from the business users (who depend on Crystal Reports and Xcelsius dashboards) to data analysts performing ad-hoc analysis (for example via Web Intelligence).

What was shocking, new, and risky, is now a pretty safe bet, and customers can focus on what the new value is that the solution provides, not on all the seemingly risky new-ness of the way it is created and delivered.

My thanks go out to my colleagues at Business Objects who examined Oco, saw the value here, and worked to add us into their partnership sphere. Everyone at Oco and myself will be working hard to make sure we deliver great solutions to our joint customers.

 

July 11, 2008

DFDL = Data Format Description Language - The syntax of data

I chair a standards committee/workgroup within an organization called the Open Grid Forum. The workgroup is on something called Data Format Description Language, or DFDL, which you can pronounce "Daffodil" if you want.

DFDL is about facilitating data interchange, critical to most computing, but to data integration for BI applications in particular.

Many people ask why DFDL is needed in an era where there are so many standard data formats available (e.g., why not just use XML?). There are a number of social phenomena in the way software is developed which have lead to the current situation where DFDL is needed to standardize description of diverse data formats.

First, programs are very often written speculatively, that is, without any advance understanding of how important they will become. Appropriately given this situation, little effort is expended on data formats since it remains easier to program the I/O in the most straightforward way possible given the programming tools in use. Even something as simple as using an XML-based data format is harder than simply using the native I/O libraries of a programming language.

At some point however, it is realized that the program is important because either lots of people are using it, or it has become important for business or organizational needs to start using it in larger scale deployments. At that point it is often too late to go back and change the data formats. For example, there may be real or perceived business costs to delaying a deployment of a program for a rewrite just to change the data formats, particularly if such rewriting will reduce performance of the program and increase costs of deployment. (It takes longer to program, but at least it's slower when you are done ;-)

Additionally, the need for data format standardization for interchange with other software may not even be clear at the point where a program first becomes 'important'. Eventually, however, the need for data interchange with the program becomes apparent. At that point, you look back at the data format and maybe it's not too complex yet. So you don't re-engineer it. But add a year or so of evolution of the software with the attendant changes here and there to the data formats and suddenly you have a real problem.

The above phenomena are not something that is going away any time soon. There are of course efforts to much more smoothly integrate standardized data format handling into programming languages. But it is very unclear whether these will catch on, and there is, regardless, a role for DFDL since it allows after-the-fact description of a data format.

DFDL is also needed for performance reasons. At the hairy edge of computing people are always trying to process ever more data to gain some competitive advantage. At this edge, the performance penalty from using verbose data formats like XML can become very burdensome.

Lastly, there's this problem with data format debugging. I will elaborate a bit here because this is a hidden tax on every integration project where there is data in files. I get emails from field engineers trying to help customers with data format problems fairly often. Here's a typical one:

Problem: EBCDIC file, no Cobol file definition. Customer is telling me that it contains 5 fields.  Also tells me that the first 3 fields are unsigned, fourth field is Packed, 5th field is 'EBCDIC' (which makes no sense). Based on this a row looks like this:
            
            149,13568,0,4,"UUNR LEA "
            
Customer says the row should look like this:
            
            95,35,48,222038109,"UUNR LEA IN"

Here's a couple of records dumped out hexadecimal:

0095350000000000000048222038109fe4e4d5d940d3c5c140c9d540000f
0095350000000951500501088322722fd2e4d9e3e94b4bf0d9f64b4b002f

Figuring out problems like this is what I call data archeology. Most people can't imagine that this sort of work is still needed routinely in data integration projects. This is truly geeky stuff, I mean hex dumps?!

A tool, a data format debugger, is badly needed to address this.

Looking at the above hex, using some brain heuristics, coupled with the customer's information provided allows me to chop up the hex like so:

0095 35 0000000000000048 222038109f e4e4d5d940d3c5c140c9d540 000f
0095 35 0000000951500501 088322722f d2e4d9e3e94b4bf0d9f64b4b 002f

Analysis: Looks like the "unsigned" fields are also packed. I.e., what they meant by "unsigned" was "unsigned packed", not unsigned base 2 binary. Seems like they also didn't tell me about all the fields. There's one more field at the end.
            
            unsigned packed 4 digits
            Unsigned packed 2 digits
            Unsigned packed 17 digits
            Unsigned packed 9 digits (the final "F" nibble is padding)
            Fixed length 12 character string ebcdic character set
            unsigned packed 3 digits (the final "F" nibble is padding)
            
It's also possible that the fields with the "F" nibbles are signed packed numbers but from a non-IBM mainframe Cobol compiler. (IBM standard Cobol uses C and D as signs, and F as padding for odd-length unsigned packed decimal I believe.)

I hope you are now as sick of this example as I am... phew!

Now, pulling up out of those gritty details,.....the depth of experience and knowledge needed to be able to pull off this kind of data archeology is pretty expensive to hire, and it is a huge tax on our industry every time this sort of thing comes up, which is far more often than people imagine.

Unfortunately, data format issues are not a sexy, high-value adding feature for a software product. Yes, these issues are costly for customers. But no customer says "Gee I'm happy to pay another bunch of dollars to have a data format debugger." It's not in their top 10 new feature requests because it is hard for them to imagine the huge cost that is repeatedly sunk into mundane data archeology in a single major data warehousing or BI initiative.

Most people who are naive to this issue figure that each of the major vendors have a proprietary data format language in their product portfolios. Reality is that the larger vendors have 1/2 dozen or more. The difficulty is that the lack of a standard for data format description means that all these separate software bases are being maintained to do roughly the same thing and no one product team can afford to invest in a really good data format debugger since it works only with that one product's proprietary data format system. As a result, nobody gets a data format debugger that is really any good, and we all pay the tax again and again.

A standard for DFDL fixes this. A data format debugger pays dividends across multiple products if they are using a common DFDL and so the investment to make one is worthwhile eventhough it's not a sexy new feature that customers are asking for.

I did overstate above when I said no customers ask for this. There are some educated consumers who ask for DFDL. I started work on DFDL when a customer made it clear that they would buy more software from me if the data format description language it used was standard, not proprietary. This is the best reason to promote a standard. Not in order to commoditize some software but to enable increased usage. A standard DFDL would allow them to use it to interoperate across software from multiple vendors. They could invest safely in tools that leveraged this common data format description language. At the time we were dealing with interoperation with SAS, which has its own data format descriptions. We were struggling with moving data from my software (this was Ascential at the time) to and from SAS (as in The SAS Institute) applications. When you have a record format with 700+ fields in it, and there's an error somewhere in the middle, debugging it is pretty hard. Eventually we found the problem, but the customer learned that the effort involved in using these pieces of software together and dealing with their proprietary data format languages was not worth it. If a standard data format language was shared then they could use these software packages for more things, and use them together easily.

You can learn more about the DFDL standard here.

June 27, 2008

Solutions vs. Tools - Round 1, and BI for the Equipment Servicing Industry

A theme I will undoubtedly hit over and over is this difference between solutions and tools in Business Intelligence. To some, the BI market is entirely populated by tools vendors. If you say you sell a BI solution people will quite literally ask back "what kind of a tool is it?".

Of course the difference between a tool and a solution depends on your perspective. If you are a data analyst, a solution to your particular problem may be a superior tool to help you analyze data in flexible ways. However, if you are a business person looking for solutions to business problems, then a data analysis tool is clearly not a solution. It might be part of one, but much additional business knowledge must be used to turn it into a business solution. So when I use the term 'solution' I generally mean 'business solution' to a business problem for a business person to use, and specifically NOT a tool for a data analyst to use.

At Oco, this is a huge part of what we do. We add business value by facilitating cross-business-unit agreement on common definitions for business concepts, so that our reporting solution really addresses the needs of the business and is usable by business people. Our solution contains quite specialized knowledge about the business problem areas we address.

Register now and get a free White PaperAs an example of this, imagine you work for an equipment manufacturer. The equipment needs servicing, repair, preventative maintenance, and so forth. An important part of your business is revenue from this service business. Optimizing this business for profit, customer satisfaction and such is, naturally, important. So, you don't need a "BI tool". Rather, what you need is a system for understanding and optimizing the way you run this services business. To this point, Oco (with Aberdeen and Qualcomm) is having a webinar titled: Smart Business Intelligence Solutions For Optimizing Your Service Operations. Attend if you are in the equipment/service industry, or really want to understand the difference between a solution and a tool for BI. Once you start to appreciate the value of this embedded business value in a BI solution, you start to use the term "BI tool" rather pejoratively, as in "XYZ product is interesting, but it's just a tool".

My Photo