Most widespread are word-processor and other office-document formats. When this material was compiled, two of these formats were more or less reliably documented in international standards, namely the Open Document Format, and the Office Open XML File Format. For other word-processor formats, there is rarely any technical documentation. It is often possible for technical people of sufficient skill and patience to reverse engineer a format, if well understood sample documents in the format are available for examination. In such efforts, partial success is often attainable; perfect success is a theoretical possibility.
A second widely used application format is defined for the display of documents on the World Wide Web. In addition to the resources listed below, the W3C has published a number of ancillary documents related to HTML; see the W3C Technical Reports page.
XHTML™ 1.0 : World Wide Web Consortium (W3C).
HTML 5 : World Wide Web Consortium (W3C).
SGML, XML, and their Applications :
Less widely used than word-processor formats or HTML, but perhaps more popular among digitization projects concerned with data longevity and reuse, are document formats designed for the software-independent representation of documents. These are of particular interest and importance for data curators because they seek, by design, to make the documents represented independent of any single piece of software. They thus avoid the single most common cause of format obsolescence, which is discontinuation of the software supporting the format. The desire for software independence also forces the designers of such formats to document the meaning of the format somewhat more carefully and completely than is usual among creators of software-specific formats.
While in theory there is no end to the methods that might be used to define document formats in a software-independent way, in practice almost all recent efforts in this direction have used SGML or XML.
There are a very large number of XML-based vocabularies; the single most useful source of information about them, and more generally about XML and related technologies, is The Cover Pages, compiled by Robin Cover and currently hosted by the Organization for the Advancement of Structured Information Standards, OASIS.
The Cover Pages : Robin Cover, ed.
JATS: Journal Article Tag Suite : NISO: National Information Standards Organization.
DocBook.org : Norman Walsh, ed.
TEI P5 : Lou Burnard and Syd Bauman, eds.
TeX is a batch document-formatting program written by the computer scientist Donald Knuth; its capabilities for formatting mathematical expressions are particularly well thought of. Since the formatting commands intrinsic to TeX operate at an extremely low level, it is customary to use TeX by defining higher level commands called macros. Over the years, a number of TeX macro sets have been written.
By far the most commonly used set of macros for TeX is LaTeX, originally written by the computer scientist Leslie Lamport.
TeX and LaTeX are in wide use for the creation of technical and scientific documents, particularly among academics. Unfortunately, the data format is defined exclusively in terms of the operational semantics provided by the executable TeX program; while it is possible in principle to define a declarative semantics for most of LaTeX, in practice many LaTeX authors extend the system with macros of their own. For preservation purposes, therefore, TeX and LaTeX documents rely on the continued existence of software to process them. Fortunately, the source code for TeX and LaTeX is publicly available and written with a great deal of care to be device- and system-independent.
PostScript is a programming language devised by Adobe Systems Incorporated; PDF (Portable Document Format) is a document format devised by the same organization, which uses a subset of PostScript and provides rules for embedding fonts in a document and for bundling all the pieces of a document together.
While originally a proprietary format, PDF has more recently been standardized and Adobe has issued a public license allowing the use of its patented technology in the creation of PDF software that supports the ISO standard definition of PDF.
TeX Users Group Home Page : TeX Users Group.
Document management — Portable document format : ISO: International Organization for Standardization.