In existing computer systems there is typically a long chain of relations connecting the physical phenomena by which data are represented with the data being represented. Each link in the chain connects two layers of representation: each layer organizes information available at the next lower level into structures at a higher (or at least different) layer of abstraction, and in this way provides information used in turn by the next higher level in the representation. For example, the representation of an email message may involve the following layers:
- Physical layer: holes in cards or tape, magnetic charges, color changes on optical disks or scan codes, tones on a telephone connection, or similar phenomena are interpreted as representing sequences of bits.
- Bit layer: those sequences of bits may be interpreted as representations of other different sequences of bits (for example five bits may be written to the physical medium to represent four bits of data, in such a way as to guarantee a minimum and maximum amount of space between magnetic flux events in the media).
- Byte / octet layer: the sequences of bits read from the storage device are grouped into octets: units of eight bits often referred to as bytes. (Historically different machines had bytes of different sizes, but it has been some decades since any prominent system had bytes of other than eight bits.)
- Character layer: an octet sequence may be interpreted as a sequence of characters. For conventional email, each octet will be interpreted as one character, as defined by the appropriate character-set standard.
- Application-specific data structure layer: the email reader will read the character stream and distinguish the mail header from the message body, and may distinguish multiple alternative representations of the message and attachments within the message body. Within the mail header, mail software will distinguish important fields like date, sender, and addressee.
- Presentation layer: the email reader will display the message on the user's screen.
The human reader of the mail will of course read the screen and (in the normal case) discern letters, words, and sentences, as well as (perhaps) images.
This hierarchy of layers of abstraction is characteristic of many information technologies, not just data representations. It has parallels to the structuralist idealization of natural language as organized into phonological, morphological, lexical, syntactic, semantic, and pragmatic layers. In the case of natural language, different layers sometimes interact in ways that conflict with the hierarchical model. Artificial systems of data representation, in contrast, may follow more frequently than natural languages the ideal of a strict hierarchy of layers in which no layer depends on or interacts with layers other than the immediately adjacent ones. In the design of technologies, such layering helps limit the complexity of the system and reduces the likelihood of error in its construction. From the metaphor of several pieces of software, each layered on top of the next, systems constructed in this way are often referred to as a software (or technology) stack. Software that supports network connectivity is often referred to as “the network stack”; the technologies available for working with XML documents are sometimes referred to as “the XML stack”; and so on.
There is no single hierarchy of data representation layers that applies to all data representations; other software running on the same machine may have a chain of representations and layers rather different from the chain described above for email messages. In particular, different applications will almost always have different application-specific data structures. Moreover, proprietary applications frequently use binary data formats which have no distinguishable character-level representation.
Despite the manifold opportunities for variation, however, some properties are shared by many data representations in wide use today:
- Unlike proprietary applications, non-proprietary applications often define character-level representations as a way of allowing interoperability between different implementations.
- Historically, machines from different manufacturers often used different character encodings. Since the development of the so-called Universal Character Set (UCS) of ISO 10646 and Unicode in the 1990s, however, hardware manufacturers, software developers, and the writers of non-proprietary specifications have been slowly converging on the use of the UCS as a standard character representation level. Examples include the use of the UCS as the fundamental character representation in the Java programming language, in HTML beginning with HTML 4.0, and in XML. In consequence, character-set variation is likely to pose practical problems primarily in the case of formats or data material from the 1990s or earlier.
- Most applications on most systems share the lowest levels of the representation and diverge only in their treatment of the octet sequence.
- Many applications designed for use on a single computer system also assume that the operating system within which they are used provides a file system (which can be described abstractly as a mapping from file names to octet sequences). This is not a universal property, however: not all computing devices provide file systems, and (in the interests of speed and/or reliability) many database management systems bypass the file system to deal directly with the interface to the hard disk or other storage media. Network protocols, in contrast, typically avoid assuming the existence of a file system and rely instead on concepts of messages, data transmissions, or (especially in the context of the World Wide Web) resources.
In practice, the issues of concern for data curation are almost all at or above the octet level. Partly this is because lower levels are normally highly reliable (and thus seldom need attention), partly because intervention at lower levels requires specialized engineering knowledge and equipment, and partly because application formats are designed to rely only on the octet level, precisely in order to make them independent of the precise implementation of the lower levels. (But see the discussion of bit preservation below.)
C.M. Sperberg-McQueen, Black Mesa Technology
David Dubin, University of Illinois, Urbana-Champaign