Office Open XML Overview

From OOXML-Wiki

Jump to: navigation, search

Contents

[edit] Throughout all parts, name issue

The name "Office Open XML" is often mistakenly called 'Open Office XML” implying a connection to the OpenOffice project which does not exist. This naming confusion has been documented and has occurred numerous times, including by analysts and even in Microsoft press releases and blogs. Since “Open Office” is the pre-existing name, by 6 years, Ecma should choose a new name, less apt to continue this confusion.

Proposed change: Change the name of Office Open XML to a name which is not confused with OpenOffice. SC34 standards usually end with 'DL' (description langage) 'SL' (schema language), etc. For DIS 29500 a suitable name is RODDL (Run-based Office Document Description Language), which would remedy the fault noted.

[edit] Throughout all parts, standard evolution issue

From the overall document contents, it is acutely clear that no effort has been made in OOXML to start from the existing ISO standard for the representation of documents in XML, that is ODF 1.0, ISO/IEC 26300:2006. We can see no reason for that deliberate departure and contend that unneeded differences are harmful, and request that the OOXML proposal be rewritten starting from the existing standard.

Proposed change: Rewrite OOXML starting from ODF 1.0, ISO/IEC 26300:2006, for all matters that apply.

[edit] Page 2

“Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority. The emergence of these four forces – extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation – have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible.”

There are three problems with this statement:

  1. It does not adequately explain why the parties to the ECMA TC process could not have instead joined the OASIS TC and driven the extension of ODF so that it could “represent faithfully” the same features that OOXML claims to.
  2. The assertion that there is an imperative to migrate all of the “billions” of legacy documents to the new format is questionable. It is unlikely that any organisation will have the resources to migrate and check the correctness of all of its legacy documents, or that it will have any concrete business case to do so. For a given organisation, a subset of its documents have to be preserved for a variety of statutory and historical purposes, and another subset need to be continually available as they are part of the operational processing activities of the organisation. The majority of legacy documents are irrelevant to current operational processing, and do not need to be preserved as records. Organisations that are good at managing information will have processes that clear away these legacy documents. Many organisations are unable to dedicate resources to such processes, which is why we have “billions” of legacy documents.
  3. For that subset of documents that must be preserved as records, a further challenge arises to the assertions above. Why is a completely new XML format essential for digital preservation of records? An electronic record is "evidence of an activity or decision and demonstrates accountability" (National Archives, e-Government Policy Framework for Electronic Records Management, p.7 [1]) and as such "need to be captured, managed and preserved in an organised system which maintains their integrity and authenticity" (ref. as before). Organisations with a statutory duty to preserve records in this way will usually have invested in Electronic Records Management systems, that contain specific controls to ensure that files in a variety of formats can be protected from changes, even when stored in editable formats. The most economical and effective way to preserve old binary documents in MS formats would be for MS to provide full featured document readers that can run on multiple platforms, available in perpetuity. In practice, this has been the case for many years, and it also appears to be the approach recommended by MS and the National Archives: creating a set of Virtual PC environments with copies of the original software pre-loaded on the OS images. [2]

Furthermore, if it were proven that an XML format is required for digital preservation, three further issues arise:

  1. There is evidence that translating legacy MS Office file formats into OOXML does not provide a perfect replica of the original file, e.g. in the implementation of charts, breaking the requirement that a record is unchanged from the original.
  2. OOXML as specified in DIS29500 does not provide an explicit mapping of the legacy binary formats' layout features to the new XML format, instead wrapping them in elements such as 2.15.3.6 autoSpaceLikeWord95 (Emulate Word 95 Full-Width Character Spacing). The “informative” guidance given for elements like this one contains the following statement:
    “To faithfully replicate this behavior, applications must imitate the behavior of that application, which involves many possible behaviors and cannot be faithfully placed into narrative for this Office Open XML Standard. If applications wish to match this behavior, they must utilize and duplicate the output of those applications. It is recommended that applications not intentionally replicate this behavior as it was deprecated due to issues with its output, and is maintained only for compatibility with existing documents from that application.”
    There are two possible readings of this text. Either 1) it is not possible to specify how to replicate this behaviour in any OOXML consumer, and therefore the legacy preservation goal of the standard fails, or 2) it is possible to meet the goal, and this can be proven with MS Office 2007, in which case it is possible for the behaviours to be “faithfully placed into narrative” in some form because MS have done so within their product development group.
    A further implication is that legacy preservation can only be achieved by means of the MS Office 2007 product's conversion capabilities. This leads on to questions about the interoperability goals of the standard.
  3. An XML standard for office documents already exists, ISO 26300 ODF. It is already possible to convert legacy documents into ODF with very little loss of fidelity of layout. The experience of a number of public sector organisations in the UK and EU attests to this. The National Archives guidance on maintaining authenticity and integrity of records clearly accepts that there may be a need to transfer records from old formats to some newer, more sustainable format, and that in doing so some changes may be unavoidable. They say that "...a record has integrity if it remains complete and uncorrupted in all its essential respects throughout the course of its existence. This does not mean that a record must be precisely the same as it was when first created for its integrity to exist and be demonstrated. A record can be considered to be essentially complete and uncorrupted if the message that it is meant to communicate in order to achieve its purpose is unaltered." (The National Archives, Generic requirements for sustaining electronic information over time: 1 Defining the characteristics for authentic records. p. 14) [3] MS, and latterly the ECMA TC, could have proposed extensions to ODF that enabled preservation of features in MS binary formats that are currently unsupported, rather than defining a new and separate format.

[edit] 4.1 Interoperability [p3]

“Foremost, the interoperability of OpenXML has been accomplished through extensive contributions, modification, and review of the Specification by members of the Ecma TC45 committee...”

OOXML has not yet been proven to be interoperable, as no conforming consumers and producers have yet been created. This claim cannot be made until more than one full implementation of an application that produces and consumes conformant OOXML exists. This is made difficult by the problems with the conformance definition in Part 1 - Fundamentals as described elsewhere in these comments.

[edit] 4.1 INTEROPERABILITY [p4]

References in the following bulleted list refer to the wrong sections:

  • OpenXML contains no restriction on image, audio or video types. For example, images can be in GIF, PNG, TIFF, PICT, JPEG or any other image type (§1:14.2.12).
  • Embedded controls can be of any type, such as Java or ActiveX (§1:15.2.8).
  • WordprocessingML font specifications can include font metrics and PANOSE information to assist in finding a substitution font if the original is not available (§3:2.10.5).

Proposed change: Change the references to

  • OpenXML ... (§1:15.2.13). - for "15.2.13 Image Part"
  • WordprocessingML font ... (§3:2.9.5). - for "2.9.5 Font Substitution Data"

“One of the central requirements for interoperability is independence from any particular type of source content.”

This claim is dubious, and relies on the absence of a clear definition of “interoperability” in the specification. I assume that one of the meanings of interoperability involves the ability for Application A to produce an OOXML file, that can be consumed by Application B, presented to the user with 100% fidelity, edited and saved, then consumed by Application N, still with 100% fidelity of representation.

If this is the case, it seems logical that a central requirement would be for clear standards-based specification of source content, such that a future consuming application, unknown to the producer, has clear expectations of the valid range of content found within a conforming OOXML file. Interoperability between applications requires rules that impose constraints, whereas “independence from any particular type of source content” implies a lack of determining structure. If a conformant OOXML file can contain any type of source content, conforming consumers will have to support any type of source content - which is clearly impossible.

[edit] 5.2 WORDPROCESSINGML [p11]

r – run (§3:2.4.2). The description of a run is confused about whether it is limited to text-only, and whether it contains additional markup. "[A run] Can contain multiple types of run content, primarily text ranges. ... A run is a contiguous piece of text with identical properties; a run contains no additional text markup." Part 3 and Part 4 reiterate "...the run, which defines a region of text..." [Implied "text-only" is wrong] Part 3 and Part 4 define with many examples that the run can contain a range of additional text markup in child elements like delText, endnoteRef, fldChar, ... (e.g., see §4:2.3.2.23). Part 3 and Part 4 also define that the run can contain non-text items like drawing (DrawingML Object), object (Inline Embedded Object), pict (VML Object), ... (e.g., see §4:2.3.2.23).

Proposed change: Clearly define the general concept of a run that can contain multiple types of content, primarily a text range with the same properties. [If the primary intent of a run is for text rather than other content types - if not the primary intent, use words like "such as a text range ...".] This also needs changes to the sections in Parts 3 and 4, including section titles that imply runs are only for text. [The text content is defined by a sub-element, t. §4:2.3.3.30]

[edit] 5.2 WORDPROCESSINGML [p11]

t – text range (§3:2.4.3.1). The statement about text formatting inheritance from run properties and paragraph properties is too limiting, because it does not account for the entire style hierarchy, as alluded to in the following paragraph in the Overview. Proposed change: Change sentence to indicate inheritence from style hierarchy. "The formatting for the text is inherited from any run properties and paragraph properties, and from the higher style hierarchy as outlined in the following paragraph."

[edit] 5.2 WORDPROCESSINGML [p11]

t – text range (§3:2.4.3.1). Is it OK to define OOXML attributes and behaviour within another standard (the separate XML 1.0 specification)?

I believe that preserve whitespace is not "often" used, for routine text runs (only likely if several text runs need to be merged? If preserve is "often" used, why is the WordprocessingML default to remove white space?

Proposed change: Change sentence to clarify use of the xml:space="preserve" attribute.

[edit] 6 SUMMARY [p13]

"OpenXML ... and its documentation has become both complete (through extensive reference material) ..."

The documentation is not complete (yet?) which in part is a reason for the review process.

Proposed change: Change sentence to state OpenXML ... and its documentation includes extensive reference material ...

[edit] 6 SUMMARY [p13]

"The compelling need exists for an open document-format standard that is capable of preserving the billions of documents that have been created in the preexisting binary formats,..." As stated, the need is for an open document-format standard that is capable of preserving the documents. This does not mean that the standard has to be a new XML representation of the preexisting binary formats. There is already an open document-format standard that is capable of preserving the documents, and that already has widespread use and for some time its evolution has "enjoyed the checks and balances afforded by an open standards process".

If the Summary needs a statement about the need for an OOXML standard, it should qualify if there is a need for another open document-format standard alongside existing established standards, and how the new standard will interoperate with established standards.

Personal tools