Since the set of formats potentially faced by a digital curator is unbounded, it is not feasible to a guide to all relevant formats. This section mentions a few of the most common and most important formats and points to other resources for further information.
C.M. Sperberg-McQueen, Black Mesa Technology
David Dubin, University of Illinois, Urbana-Champaign
It is central to the conception of database management systems that the internal data representation of material in the database should not be visible to users of the database, except through a defined application-program interface (API) such as SQL. Discussions of the formats used internally are thus of no particular use to users of database management systems. They are in any case not standardized; competing systems strive to find data representations that allow faster indexing or retrieval and/or more compact storage, and in the case of commercial products the details of the representation are likely to be a closely guarded trade secret.
In order to allow mass imports or exports of data, however, database management systems typically provide one or more dump formats which can be read and written by the system. These are again apt to be implementation-specific, though comma-separated-value (CSV) formats are common. There is no standard definition of the CSV format, however, and implementations vary a good deal in punctuation rules and character sets. Occasional attempts have been made to write out a coherent specification for the CSV format, but these appear not to have any influence on the majority of implementors. The problems inherent in such variation led database vendors to adopt XML for inter-database exchange very early in the life of the XML specification.
Most widespread are word-processor and other office-document formats. When this material was compiled, two of these formats were more or less reliably documented in international standards, namely the Open Document Format, and the Office Open XML File Format. For other word-processor formats, there is rarely any technical documentation. It is often possible for technical people of sufficient skill and patience to reverse engineer a format, if well understood sample documents in the format are available for examination. In such efforts, partial success is often attainable; perfect success is a theoretical possibility.
A second widely used application format is defined for the display of documents on the World Wide Web. In addition to the resources listed below, the W3C has published a number of ancillary documents related to HTML; see the W3C Technical Reports page.
XHTML™ 1.0 : World Wide Web Consortium (W3C).
HTML 5 : World Wide Web Consortium (W3C).
SGML, XML, and their Applications :
Less widely used than word-processor formats or HTML, but perhaps more popular among digitization projects concerned with data longevity and reuse, are document formats designed for the software-independent representation of documents. These are of particular interest and importance for data curators because they seek, by design, to make the documents represented independent of any single piece of software. They thus avoid the single most common cause of format obsolescence, which is discontinuation of the software supporting the format. The desire for software independence also forces the designers of such formats to document the meaning of the format somewhat more carefully and completely than is usual among creators of software-specific formats.
While in theory there is no end to the methods that might be used to define document formats in a software-independent way, in practice almost all recent efforts in this direction have used SGML or XML.
There are a very large number of XML-based vocabularies; the single most useful source of information about them, and more generally about XML and related technologies, is The Cover Pages, compiled by Robin Cover and currently hosted by the Organization for the Advancement of Structured Information Standards, OASIS.
The Cover Pages : Robin Cover, ed.
JATS: Journal Article Tag Suite : NISO: National Information Standards Organization.
DocBook.org : Norman Walsh, ed.
TEI P5 : Lou Burnard and Syd Bauman, eds.
TeX is a batch document-formatting program written by the computer scientist Donald Knuth; its capabilities for formatting mathematical expressions are particularly well thought of. Since the formatting commands intrinsic to TeX operate at an extremely low level, it is customary to use TeX by defining higher level commands called macros. Over the years, a number of TeX macro sets have been written.
By far the most commonly used set of macros for TeX is LaTeX, originally written by the computer scientist Leslie Lamport.
TeX and LaTeX are in wide use for the creation of technical and scientific documents, particularly among academics. Unfortunately, the data format is defined exclusively in terms of the operational semantics provided by the executable TeX program; while it is possible in principle to define a declarative semantics for most of LaTeX, in practice many LaTeX authors extend the system with macros of their own. For preservation purposes, therefore, TeX and LaTeX documents rely on the continued existence of software to process them. Fortunately, the source code for TeX and LaTeX is publicly available and written with a great deal of care to be device- and system-independent.
PostScript is a programming language devised by Adobe Systems Incorporated; PDF (Portable Document Format) is a document format devised by the same organization, which uses a subset of PostScript and provides rules for embedding fonts in a document and for bundling all the pieces of a document together.
While originally a proprietary format, PDF has more recently been standardized and Adobe has issued a public license allowing the use of its patented technology in the creation of PDF software that supports the ISO standard definition of PDF.
TeX Users Group Home Page : TeX Users Group.
Document management — Portable document format : ISO: International Organization for Standardization.
Image formats fall into two classes: raster graphics, which represent images as an array of picture elements (pixels) coded for color, and vector graphics, which represent images as sets of geometric shapes (lines, rectangles, circles, ellipses, curves of various degrees of complexity, and text).
Vector graphics are more often used for the creation of new graphics than for the digitization of pre-existing non-digital graphic material, so for curatorial purposes the reader is more likely to encounter raster graphics than vector graphics. Vector graphics have a number of properties that make them attractive for the creation of new images, however (they are often more compact, and they do not degrade when the user zooms in on details), and the reader may wish to use vector graphics when creating new materials.
Several formats can contain graphic elements in either raster or vector format.
Historically, the representation of numbers in electronic form has been a fundamental design question for computer systems, with analog and digital representations competing with each other for adoption. In modern digital systems, four main families of representations can be distinguished:
Those involved with data curation will probably seldom have need for detailed technical understanding of the formats historically or now used to represent numbers in computing. (Exceptions may arise when dealing with material which uses non-standard number formats for any reason.) But it may be worth while to scan the descriptions of number representations in Wikipedia, if only to dispel the notion that computer representations of numbers are somehow natural and thus simpler and less problematic than computer representations of other datatypes. The treatment in Wikipedia is reasonably sound, though by nature it will strike some readers as a bit dry.