Skip to main content

Data Management for the Humanities

Format Information

Since the set of formats potentially faced by a digital curator is unbounded, it is not feasible to a guide to all relevant formats. This section mentions a few of the most common and most important formats and points to other resources for further information.

Resources

Adapted from:

“Data Representation” 
C.M. Sperberg-McQueen, Black Mesa Technology
David Dubin, University of Illinois, Urbana-Champaign

Databases

It is central to the conception of database management systems that the internal data representation of material in the database should not be visible to users of the database, except through a defined application-program interface (API) such as SQL. Discussions of the formats used internally are thus of no particular use to users of database management systems. They are in any case not standardized; competing systems strive to find data representations that allow faster indexing or retrieval and/or more compact storage, and in the case of commercial products the details of the representation are likely to be a closely guarded trade secret.

In order to allow mass imports or exports of data, however, database management systems typically provide one or more dump formats which can be read and written by the system. These are again apt to be implementation-specific, though comma-separated-value (CSV) formats are common. There is no standard definition of the CSV format, however, and implementations vary a good deal in punctuation rules and character sets. Occasional attempts have been made to write out a coherent specification for the CSV format, but these appear not to have any influence on the majority of implementors. The problems inherent in such variation led database vendors to adopt XML for inter-database exchange very early in the life of the XML specification.

Resources

Documents

Word-Processor Formats: 

Most widespread are word-processor and other office-document formats. When this material was compiled, two of these formats were more or less reliably documented in international standards, namely the Open Document Format, and the Office Open XML File Format. For other word-processor formats, there is rarely any technical documentation. It is often possible for technical people of sufficient skill and patience to reverse engineer a format, if well understood sample documents in the format are available for examination. In such efforts, partial success is often attainable; perfect success is a theoretical possibility.

HTML: 

A second widely used application format is defined for the display of documents on the World Wide Web. In addition to the resources listed below, the W3C has published a number of ancillary documents related to HTML; see the W3C Technical Reports page.

Resources:  

XHTML™ 1.0 : World Wide Web Consortium (W3C).

HTML 5 : World Wide Web Consortium (W3C).

SGML, XML, and their Applications :

Less widely used than word-processor formats or HTML, but perhaps more popular among digitization projects concerned with data longevity and reuse, are document formats designed for the software-independent representation of documents. These are of particular interest and importance for data curators because they seek, by design, to make the documents represented independent of any single piece of software. They thus avoid the single most common cause of format obsolescence, which is discontinuation of the software supporting the format. The desire for software independence also forces the designers of such formats to document the meaning of the format somewhat more carefully and completely than is usual among creators of software-specific formats.

While in theory there is no end to the methods that might be used to define document formats in a software-independent way, in practice almost all recent efforts in this direction have used SGML or XML.

There are a very large number of XML-based vocabularies; the single most useful source of information about them, and more generally about XML and related technologies, is The Cover Pages, compiled by Robin Cover and currently hosted by the Organization for the Advancement of Structured Information Standards, OASIS.

Resources:

The Cover Pages : Robin Cover, ed.

 

JATS: Journal Article Tag Suite : NISO: National Information Standards Organization.

 

DocBook.org : Norman Walsh, ed.

 

TEI P5 : Lou Burnard and Syd Bauman, eds.

Other Formats: 

TeX is a batch document-formatting program written by the computer scientist Donald Knuth; its capabilities for formatting mathematical expressions are particularly well thought of. Since the formatting commands intrinsic to TeX operate at an extremely low level, it is customary to use TeX by defining higher level commands called macros. Over the years, a number of TeX macro sets have been written.

By far the most commonly used set of macros for TeX is LaTeX, originally written by the computer scientist Leslie Lamport.

TeX and LaTeX are in wide use for the creation of technical and scientific documents, particularly among academics. Unfortunately, the data format is defined exclusively in terms of the operational semantics provided by the executable TeX program; while it is possible in principle to define a declarative semantics for most of LaTeX, in practice many LaTeX authors extend the system with macros of their own. For preservation purposes, therefore, TeX and LaTeX documents rely on the continued existence of software to process them. Fortunately, the source code for TeX and LaTeX is publicly available and written with a great deal of care to be device- and system-independent.

PostScript is a programming language devised by Adobe Systems Incorporated; PDF (Portable Document Format) is a document format devised by the same organization, which uses a subset of PostScript and provides rules for embedding fonts in a document and for bundling all the pieces of a document together.

While originally a proprietary format, PDF has more recently been standardized and Adobe has issued a public license allowing the use of its patented technology in the creation of PDF software that supports the ISO standard definition of PDF.

Resources:

TeX Users Group Home Page : TeX Users Group.

Document management — Portable document format : ISO: International Organization for Standardization.

Images

Image formats fall into two classes: raster graphics, which represent images as an array of picture elements (pixels) coded for color, and vector graphics, which represent images as sets of geometric shapes (lines, rectangles, circles, ellipses, curves of various degrees of complexity, and text).

Vector graphics are more often used for the creation of new graphics than for the digitization of pre-existing non-digital graphic material, so for curatorial purposes the reader is more likely to encounter raster graphics than vector graphics. Vector graphics have a number of properties that make them attractive for the creation of new images, however (they are often more compact, and they do not degrade when the user zooms in on details), and the reader may wish to use vector graphics when creating new materials.

Several formats can contain graphic elements in either raster or vector format.

Resources : Raster Graphics

Resources : Vector Graphic

Numbers

Historically, the representation of numbers in electronic form has been a fundamental design question for computer systems, with analog and digital representations competing with each other for adoption. In modern digital systems, four main families of representations can be distinguished:

  • Integers are typically represented in a fixed-width field of bits, either as unsigned base-2 numbers (so the possible values representable in a field of n bits range from 0 to 2n) or as signed numbers. Different methods of representing negative numbers are possible; virtually all current systems use the so-called “twos-complement” representation (which will not be explained here).
  • Since binary numbers have rounding properties that differ from those of decimal numbers, they can cause problems for financial applications (which conventionally assume and require rounding behaviors suitable for decimal numbers). For this reason, systems intended for commercial use (most notably mainframe computers manufactured by IBM) often use binary-coded decimal representations of numbers. In this system, groups of four bits are used to represent the decimal digits, and a number is represented as a sequence of such decimal digits. Fractional numbers are handled by conventions at the programming-language level or higher which supply an implicit decimal point at a fixed location. The number of bits used to represent a number may vary. Computer hardware other than IBM mainframes seldom has hardware support for binary-coded decimal arithmetic, but software systems designed to support computation with large numbers often use binary-coded decimal representations.
  • Real numbers pose a particularly thorny problem for digital systems, since one of the fundamental properties of the real number continuum (the fact that given any two real numbers we can identify a third midway between them) is very difficult to model with a digital system. For most purposes, real numbers are represented in modern computer systems using floating-point binary numbers, which use a fixed-width bit field to represent numbers with a range of values and arithmetic precisions. Over the years, the width of the bit field commonly used for floating-point numbers has grown from 32 bits to 64 and 128 bits. The standard representation of floating-point binary is defined by IEEE 754, which is supported by virtually all current hardware; other floating-point binary formats survive in some specialized markets. A 2010 revision of IEEE 754 specifies not only a floating-point binary but also a floating-point decimal format.

Those involved with data curation will probably seldom have need for detailed technical understanding of the formats historically or now used to represent numbers in computing. (Exceptions may arise when dealing with material which uses non-standard number formats for any reason.) But it may be worth while to scan the descriptions of number representations in Wikipedia, if only to dispel the notion that computer representations of numbers are somehow natural and thus simpler and less problematic than computer representations of other datatypes. The treatment in Wikipedia is reasonably sound, though by nature it will strike some readers as a bit dry.

Resources