Data Management for the Humanities

Curatorial Requirements for Data Representations

For data curation purposes there are two fundamental requirements: all other requirements derive from these (or are not requirements but negotiable desiderata):

  1. Permanence: the data representation must last a long time without corruption, degradation, decay, or loss.
  2. Usability: it must be possible to use the information being preserved. Using the information provides a check that the information has thus far been successfully preserved without the loss of some crucial bits, and in any case if the data have become unusable there may be little point in spending further resources to preserve them longer.

From the essential requirements others follow; all of these are desiderata only.

  1. Any data representation relied on for long-term preservation of information must have clear, well written, published documentation. If the format is not documented, the likelihood that the information it represents can be preserved without loss across media conversions is small; the likelihood that it can be preserved without loss across format conversions is nil. One of the most effective methods available for confirming that digital objects have been successfully preserved so far is to provide effective intellectual access to the material; active users of the material provide a far better monitor of data quality than any automated system could ever do. But if the format of the data is not documented, it is much harder, if not always impossible, to provide effective intellectual access to the material.
  2. The specification documents for preservation formats should be controlled by public bodies, preferably consensus-based organizations in the international standardization system or by relevant industry consortia. Proprietary formats are subject to change and abandonment by their owners in ways that make them a poor bet for long-term access to information.
  3. Other things being equal, a data representation that is widely supported has a better chance of long-term utility than one with a much smaller user community. Larger numbers of users mean it's easier to share costs of maintenance and development across a larger pool of resources, understanding and documentation of the format are likely to be more widespread, and there are better prospects for commercial support for the format. There are limits to this principle, however: a suitable format used by a small specialized community will often be preferable to a format used by a much larger community that does not provide a suitable representation of the information. (The mostly widely supported representations of human-readable documents, for example, are those of word-processor software. But many scholars using computers for the analysis of language and literature prefer other formats for the data they work on, because word-processor formats are not oriented to linguistic and literary concerns. It would not be a good idea to translate data from a well designed XML format into a proprietary word-processor format on the grounds that the word-processor format is more widely used.)

Adapted from:

“Data Representation” 
C.M. Sperberg-McQueen, Black Mesa Technology
David Dubin, University of Illinois, Urbana-Champaign

Bit Preservation and Information Preservation

Practical work on data curation can usefully divided into two classes: efforts focused on the preservation of information at the bit or octet level (bit preservation) and efforts focused on higher levels. Efforts at both levels are essential to the successful preservation of digital materials; which area more urgently requires the attention and resources of data curators is an area of active controversy.

Briefly, bit preservation is the act of ensuring that devices in the future will be able to reproduce the sequence of bits, or octets, currently used to represent the information to be conserved. Bit preservation protects against bit rot and media failure, but not against other threats to digital preservation and access.

Information preservation is the act of ensuring that the information represented in a resource is preserved, possibly by translating it from an obsolescent format into a more current format. Note that format conversion protects against file-format obsolescence, but not against other possible threats to digital preservation.

Preservation of bits is a necessary part of digital preservation: since the bit sequence is the foundation for all the higher levels in the representation of the information, if the bit sequence is lost, the information will be lost as well. But bit preservation is not sufficient: a future user interested in a WordStar 1.3 document (for example) will be able to make use of the document effectively only if software capable of reading the WordStar 1.3 file format is available. Since WordStar was a very popular program for its day, such software may very possibly be available in practice. For the formats of less popular software, however, the situation looks less promising.