The da Vinci plugin is a clone of the Microsoft OfficeOpenXML (MS-OOXML) plug-in for MSOffice, otherwise known as the Microsoft Compatibility Pack. When Microsoft released their first series of XML file format plug-ins for the MSOffice 2003 beta, they exposed to developers some of the methodology needed to write such a plug-in.
There are three types of MSOffice plug-in architectures now in common use; internal, external and translator.
External plug-ins such as Sun's ODf plug-in for MSOffice use an external conversion engine re-factored from OpenOffice.org. The Sun ODf plug-in shares the same conversion fidelity as one would get from OpenOffice.org, which is pegged at about 85%.
There are two Translator plug-ins available; the Microsoft-Cleverage-Novell ODf Translator plug-in for MSOffice and the OOXML Plug-in for OpenOffice.org. The conversion process here is based on XSL Transformations between ODf and OOXML, with an average conversion fidelity pegged at about 65%.
The MSOffice Compatibility Pack and our da Vinci plug-in clone are “internal plug-ins”. They use the internal conversion process inherent in MSOffice applications. The use of the existing internal conversion process provides a much higher conversion fidelity. Upwards of 98%.
That said, conversion fidelity itself depends on two factors. The first is the actual conversion of binary to xml. The second is that the conversion target file format must be able to capture the richness of information being converted.
The internal conversion engines have a high probability of good to excellent fidelity because they use the applications internal conversion process to break and stage the binary. The target format however must still be able to capture the internal structures.
Many people have asked us why we switched from ODf to CDF, the W3C's Compound Document Format? The answer is that we cannot pipe the converted MS binaries into ODf without the use of some additional eXtensions. The richness of MSOffice features and business process development cannot fit into ODf without suffering loss of fidelity (or loss of presentation information). We can how ever pipe from our internal conversion process into a CDF target format.
This strikes everyone as the strangest statement they've ever heard. It's assumed that ODf was developed for desktop office suites, so why wouldn't it work for MSOffice?
The truth is that ODf was not developed for the task of converting MS binaries.
This issue itself has two parts. Demands which we have found to be uncompromising market requirements for our da Vinci plug-in:
Compatibility with existing file formats, including MS binaries.
Interoperability with existing applications, including MSOffice apps.
For short, these issues are referred to as “compatibility and interoperability”, although they are both part of the larger “interoperability” issue.
The hard truth is that ODf not developed for the task of compatibility with MS binaries, or interoperability with MSOffice. Which is to say, ODf was not designed or intended to meet the demands of what we found to be uncompromising market requirements for our da Vinci plug-in. It would far more accurate to say that ODf is designed to be implemented by ODf ready applications.
The OASIS ODf TC might not agree with our direct statement of this problem, but the fact remains that the near five years of ODf TC work is re pleat with comments, votes, and discussions unfailingly pointing out that these issues are, “Outside the charter and out of scope”. And the always popular sentiment, “if Microsoft wants to implement ODf they can join the OASIS ODf TC like everyone else”.
If anyone on the OASIS ODf TC disagrees with this interpretation of the past five years of activity, there is a simple solution; take the steps necessary to make ODf compatible with existing MS file formats and interoperable with existing MSOffice applications. It isn't difficult. All it takes is five generic elements dealing with the fundamental document structures of lists, tables, fields, sections and page dynamics (and maybe a sixth for graphic objects – but that's a much debated point).
Between July of 2006, and February of 2007, there were no less than five interoperability enhancement proposals (called “iX” for short) submitted to the OASIS ODf TC and Metadata SC for discussion and consideration. They did not fare well at all, and if there was ever any doubt about the OASIS ODf members opinions concerning compatibility and interop with Microsoft technologies, it can be answered in these discussions and votes.
The first three of these iX submissions were supported and signed off by CIO Louis Gutierrez and his Massachusetts ITD staff. They were needed to save ODf in Massachusetts. A fourth was needed by the Translator project. And the fifth represented a sweeping RDF based interoperability framework designed to solve the compatibility-interoperability issues and more.
The iX interoperability enhancements are critical to the problem of fitting MSOffice feature sets, add-ons, and business processes into ODf. Without iX, there is no way da Vinci can meet the high fidelity market requirements every government we've ever presented to has asked us to address. Having met these marketplace realities head on, we don't believe it is possible to implement ODf anywhere there are MSOffice bound workgroups involved in a business process exchange of documents. Least ways not without using ODf iX!
The first version of da Vinci was written for ODf 1.0, and was presented to Louis Gutierrez, the Massachusetts ITD, and government guests on June 19th, 2006. The event was hosted by IBM's Rob Weir, Don Harbison, and Doug Heintzman at the Lotus International Conference Center in Cambridge, Mass. A second “public” presentation followed in Brussels Belgium, at an EU-IDABC sponsored conference on July 4th-5th, 2006.
In July of 2006, IBM conducted a professional examination of the da Vinci source code under NDA.
It was clear from the test documents and the results of a year long Pilot Study that Massachusetts had conducted concerning the implementation of ODf, that we had hit a much higher fidelity than ODf 1.0 was capable of. Plus, the MSOffice bound workgroup factor meant we had to perfect a “round trip” conversion fidelity capable of matching that of the MS OOXML plug-in for MSOffice 2003 and 2000.
At this point we moved to ODf iX with the provision given to Massachusetts and California, that an ODf iX da Vinci would not be released unless and until the iX interoperability enhancement needs were approved by the OASIS ODf TC. This was not a hard decision since Massachusetts, California and the EU-IDABC all insisted that ODf not be forked!
On October 4th, 2006, all work on ODf da Vinci stopped with the resignation of Louis Gutierrez. Unless we could ge the iX enhancements through OASIS, there was no sense continuing work on da Vinci.
By April of 2007, it was clear that none of the iX interoperability enhancements would ever make it through OASIS. It is at this point we went back to the W3C Internet Technologies hoping that XHTML 2.0 and CSS 3.0 could somehow be used to solve our market requirements. What we found is CDF :) And it can do the job.
The transition from ODf to CDF is difficult in that CDF isn't a normal file format construct in the same sense of MS binaries, RTF, ODf or MS-OOXML. It's a different, 100% Web – W3C technologies approach, primarily designed for the convergence of devices and web information systems. That we can convert with the required high fidelity existing MS documents, applications and processes to CDF is amazing since CDF was not designed for this challenge. But it works extremely well. Elegant even. CDF is every bit as effective as Microsoft's own MS-OOML plug-in for MSOffice (The XML Compatibility Pack).
Note that even though ODf was not designed for the high fidelity conversion of existing MS documents, applications and processes, all we actually needed was five generic eXtensions to get the job done.
One important key to CDF's explosive ubiquity potential is that it is entirely application, platform and vendor independent. ODf was designed for OpenOffice, and to this day, nothing goes into ODf that OpenOffice doesn't implement and support. Similarly, MS-OOXML was designed for MSOffice and the billions of binary legacy documents. (Note that MS-OOXML maintains backwards compatibility through the use of generics!). CDF doe not suffer from any sort of application, platform or vendor binding. In fact, it has a Profile model designed to arbitrate these legacy inspired dilemma's.
What's important to know about CDF is that it fully meets and exceeds our da Vinci market requirements. We can solve the file format compatibility – application interoperability problems with CDF. We also have get a bonus with CDF. We get to solve the third market requirement, called “grand convergence”.
Grand convergence is the convergence of desktop, server, device and web information systems regardless of application, platform or vendor specifics. It is exactly what CDF was designed for except that no one ever expected the “desktop” part of this convergence on the scale of complete integration with existing MSOffice installations.
There is a great article from eWEEK's Tiffany Maleshefski, Office Formats Fail to Communicate, where she catches some prize quotes from Sun, and provides an excellent backdrop for the problem of file format interoperability. My follow-up comments to this article are posted at, "Why Can't We All Just Get Along?".
Note well Sun's
belief that open standards are determined by big vendors agreeing to
sit down and create a standard. We disagree with this. We believe
that standards are best created without vendors having a vote. In a
more perfect world, where universal interoperability was the primary
goal, vendors would only be allowed to submit proposals and comments.
The actual standards would be defined by independent experts
answering only to market requirements.
As a bit of background,
Tiffany is part of the eWEEKS Labs team, and is writing a series of
reports comparing three XML plug-ins for MSOffice:
Microsoft's OOXML Plug-in for MSOffice (Otherwise know as the MS Compatibiltiy Kit)
Sun's ODF Plug-in for MSOffice
The OpenDocument Foundation's ACME 376 Plug-in for MSOffice
While we are still
waiting for Tiffany's assessment of ACME
376, i did get a chance to visit her lab and follow the
installation of ACME 376. The version she installed was the
publicly available version, which does not include "The
Magic Key".
The Magic Key is interesting in that we have
discovered a secret registry setting that allows for the complete
capture of all VBa scripts, macros, OLE, security settings etc, in a
file format. The key is of course reserved for native Microsoft
Office file formats (binary and xml), and only one application can
own the key. Which causes complications for test systems.
One of the important things about the magic key for any file
format plug-in is that perfect capture of this binary logic is
essential to MSOffice bound workgroups-workflow business processes,
line of business integration and many assistive technology add-on
applications.
In our opinion, the true "real world"
test of a XML file format plug-in for MSOffice is that it can perfect
the high fidelity "round trip" conversion demanded by these
bound business processes. So we offered to reset the magic key
for Tiffany in hopes that she would push the envelope with ACME 376,
and go for the real world test of full participation in existing
workgroup-workflow business processes, without disruption.
It's
a test we know the ACME and the OOXML plug-in for MSOffice can pass,
but is an impossible challenge for Sun's ODF plug-in.
Nevertheless,
this is the real world challenge that defined the failure of ODF in
Massachusetts. And that failure immediately (October 4th, 2006)
slammed the door shut on ODF mandate legislation in California.
The
thing is that the OpenDocument Foundation's plug-in model is a clone
of Microsoft's OOXML plug-in. Right down to taking ownership of
the magic key. Both are "internal" conversion
processes. In fact, the da Vinci plug-in conversion model trips
MSOffice into it's own OOXML conversion process, intercepting
the conversion at a particular moment in time.
The Sun ODf plug-in relies on an “external” conversion process based on the OpenOffice.org conversion engine. While the the OOo conversion engine is legendary in it's ability to convert MS binary documents with an extremely good fidelity, for which the OOo engineers deserve mountains of reverse engineering credit, it's not good enough to crack the high bar of full non disruptive participation in existing MSOffice bound business processes. It's also one of the primary motivations behind the incessant ODf big vendor drumbeat for legislative mandates that would put governments in the troublesome position of either the costly and disruptive “ripping out and replacing” of MSOffice, or, using an anti trust wrecking ball forcing the release of a fully documented secret MS binary blueprints.
Governments today find themselves between a rock and hard place. On the one hand, Microsoft is not about to give up their platform lock-in of documents and business processes. On the other hand, ODf big vendors seek to totally break the monopolist grip with a legislative antitrust inspired “mandate-rip out and replace- forced release of the binary blueprints” hard place.
Not liking these options, governments like Massachusetts sought out a third, more palatable alternative; and put out the call for an “internal” clone of the OOXML plug-in for MSOffice that could produce a similar non disruptive, high fidelity “round trip” conversion to ODf. The result of this triangulation is a Mexican standoff where the first one to move gets shot by the other two. With October 4th, 2006 as case in point.
We also believe that the the Massachusetts RFi “Request for information concerning the feasibility of an ODf plug-in for MSOffice” signaled the end of anti trust inspired “rip out and replace” government mandates. If Massachusetts was unwilling and unable to go forward with a rip out and replace implementation of ODF, we felt that no other government would either. So our choices were simple. Either come up with a non disruptive transitional plug-in solution, or, sit back and complain as Microsoft used the non disruptive transition to MS-OOXML as the means of expanding their monopoly from the desktop to include server, devices and web systems.
Okay, water under the bridge. You take your lessons learned and move on. Which the OpenDocument Foundation is doing with this transition from ODF to CDF.
Although we chose to
distribute ACME 376 as an installable demonstration of the high
fidelity the da Vinci conversion process is capable of. ACME 376
demonstrates a conversion fidelity matching that of the MS-OOXML
plug-in, but near everyone is left wondering if there is any useful
purpose for ACME beyond the demonstration? After all, there are no
other applications capable of rendering XML encoded RTF.
It's
a fair question that requires an understanding of how the da Vinci
conversion process pieces fit together. The simple explanation
is that ACME 376 is a component at the center of the da Vinci
conversion process. This is where the real conversion of MS
binaries takes place, although it's a process of stages.
The ACME-376 file themselves are XML encoded RTF, and can be piped into or extracted back from any file format of sufficient reach. When you unzip an ACME 376 file, what you see is a decoded MS binary re structured as XML encoded RTF. That perhaps sounds a bit ziggy, but there is a method to this madness. It must be done this way because this is within the internal methodology Microsoft has exposed, and we are able to work with.
The da Vinci conversion chain has two processes. The first is converting the MS binary (or MS-OOXML) to ACME 376. The second is piping the ACME 376 structure into a target file format such as ODf iX, CDF, or UOF (as an example). The quality of conversion fidelity depends on the quality of each of these two processes. You can have excellent conversion of the binary, and lose that fidelity piping into a target file format unable to hold and reflect perfectly the conversion contents.
When we pipe from the ACME 376 conversion stage into an ODf target, we lose upwards of 15% or more of our fidelity. If we use the five generic elements known as ODf iX, we have much closer to perfect conversion – the burden of results now reflecting on ACME 376 performance and not the target.
So here is how it works. Let's start with the MS-OOXML conversion process because da Vinci is a clone of the MS Compatibility Pack plug-in. All da Vinci does is intercept this conversion process at the point called “MS-RTF”. MS-RTF is part of an internal process that Microsoft applications use to break the in-memory-binary-representation into a common structure designed to transcend application version variations. If the internal conversion process isn't triggered, the in-memory-binary-representation (imbr) would otherwise be dumped to disk as a MS binary.
::: The MS-OOXML conversion process looks like this:
imbr <> MS-RTF <> OOXML
::: The piping chain for da Vinci CDF+ looks like this:
imbr <> MS-RTF <> ACME <> InfoSet <> CDF+
ACME intercepts what we call MS-RTF. We call it that because it looks like RTF, but it's highly encoded, with lots of implied references and relationships. It's also totally undocumented. ACME decodes MS-RTF, as XML-RTF, and then passes this structure to InfoSet.
InfoSet was designed as a staging area enabling da Vinci to target many formats with the same install. The final stage is that of piping into the target which, in this case is our favorite CDF+.
If
the OASIS ODf TC approves both the iX enhancements and the Universal
Interoperability Framework Proposal, we will configure da Vinci to
pipe into both CDF+ and, ODf iX.
Keep in mind that the
MSOffice developer platform is rich with features and integrated
applications. Capturing this richness is a challenge that few
file formats are capable of, and for sure, few file formats outside
of Microsoft's are designed for this challenge. Many at the
OASIS ODf TC will tell you that this is so difficult to pull off
without also having access to the full documentation of the secret
binary blueprints, that it wouldn't make any sense to try to design
ODf to be compatible with the billions of existing binary documents.
This challenge is thought to be Microsoft's sole responsibility, with
the continuing hope that governments will force Microsoft to join the
OASIS ODf TC and natively implement ODf.
For us, this hope ended the day Massachusetts released their plug-in RFi.
It must be noted that since da Vinci let's Microsoft do the conversion of the binaries, the quality of conversion is not dependent on having access to the binary blueprints. :) All the magic for da Vinci is in ACME 376.
Let me reiterate once
more that we are unable to pipe MS documents into ODF unless and
until we have our ODF iX "interoperability enhancements" -
(the five generic elements) and the UI framework. We will not eXtend
ODf.
One very important aspect of the Universal
interoperability Framework Proposal is that the ODF Application
Conformance-Compliance Clause would be changed to "MUST
preserve" unknown elements and attributes. Without his
much needed change, there is no possibility of round tripping ODf iX
documents.
imbr
= the in-memory-binary-representation of both MS binary documents and
OOXML documents.
MS-RTF =
the secret internal structure MS applications use whenever they
convert inmbr. As in the conversion to OOXML.
ACME
= the da Vinci plug-in internal structure designed to fully break and
capture encoded MS-RTF.
InfoSet
= the superset structure where da Vinci stages ACME for piping into
targeted file formats such as ODF, UOF, and CDF. The only inhibitor
is can the target file format "receive" the richness and
application specific implementation model of MSOffice binaries - xml?
InfoSet is designed to be portable, enabling servers and Office 2.0
applications to embed InfoSet to perfect CDF conversions of MS
binaries and xml.
CDF =
The recently released W3C specification for next generation HTML,
called the "Compound Document Format".
Might as well call it the universal file format because, if we can do
this with MSWord, Excel and PowerPoint, converting MS documents into
universally accessible portable CDF+, the world changes. CDF+
simply indicates that this is a CDF file with supernatural
powers!
openCSS = The key
to da Vinci CDF+ is the writing of a MSWord /MSOffice specific CSS
style language, tentatively called "openCSS". An
effort will be made to include OpenOffice nuances in openCSS as
well. This CSS file will be open, well documented, and
universally available.
One last thing. Our excitement over
this discovery that we can pipe into CDF is important because CDF can
solve all three of our market requirements. Let's list them again,
and note that even if we had obtained the needed ODf iX changes from
OASIS, ODf iX still would not be able to solve all three challenges.
Challenge three is a doosy. And as it currently sits, one has to say
that ODf was not designed to meet any of these challenges.
The Universal Document Market Requirements:
The iX “interoperability enhancement” issues are only one part of the difficult to implement ODf story. The other part is basic to any universal file format contender – the ability to solve three "real world" problems. Problems a universal file format must address because the world is not a clean slate:
Compatibility with existing documents - file formats :: including the volumes of MS binary documents.
Interoperability with existing applications :: including the over 500 million MSOffice bound workgroups.
Grand
Convergence of desktop, server, device, and web systems
as fluid and highly interoperable routers of documents, data, and
media.
|
Market Requirements |
OpenDocument |
Office Open XML |
Compound Document Formats |
|
Open? |
Yes |
No |
Yes |
|
Compatibility with legacy MS formats? |
No |
Yes |
Yes |
|
Interoperability framework? |
No |
No |
Yes |
|
Converges desktop, server, device and web? |
No |
Yes |
Yes |
|
Unencumbered by IPR? |
No |
No |
Yes |
|
No |
No |
Yes |
|
|
Yes |
No |
Yes |
|
|
No |
No |
Yes |
Resources:
eWEEK's Tiffany Maleshefski, Office Formats Fail to Communicate
Foundation's Universal Interoperability Framework Proposal -- Part 1
Part II of our Universal Interoperability Proposal will reintroduce the ODf iX enhancements using the RDF Metadata model
Our quest for a Universal Document centers on this definition:
The Universal Document must be open, unencumbered, universally interoperable in that it properly reuses existing standards within an interoperability framework conforming to International Trade Agreement requirements, is totally application-platform-vendor independent, with a trusted governance that is not big vendor dominated or consortia controlled.