Skip to main content

Data Management for the Humanities

File Formats and Software

The format and software in which research data are created usually depend on how researchers choose to collect and analyze data, often determined by discipline-specific standards and customs.

All digital information is designed to be interpreted by computer programs to make it understandable and is - by nature - software dependent. All digital data are thus endangered by the obsolescence of the hardware and software environment on which access to data depends.

Despite the backward compatibility of many software packages to import data created in previous software versions and the interoperability between competing popular software programs, the safest option to guarantee long-term data access and usable data is to convert data to standard formats that most software are capable of interpreting, and that are suitable for data interchange and transformation.

This typically means using open or standard formats - such as OpenDocument Format (ODF), ASCII, tab-delimited format, comma-separated values, XML - as opposed to proprietary ones. Some proprietary formats, such as MS Rich Text Format, MS Excel, SPSS, are widely used and likely to be accessible for a reasonable, but not unlimited, time.

Thus, whilst researchers use the most suitable data formats and software according to planned analyses, once data analysis is completed and data are prepared for storing, researchers should consider converting their research data to standard, interchangeable and longer-lasting formats, to avoid being unable to use the data in the future.

Adapted from the UK Data Archive

File Formats Table

Type of data Acceptable formats for sharing, reuse and preservation Other acceptable formats for data preservation
  • Quantitative tabular data with extensive metadata
  • a dataset with variable labels, code labels, and defined missing values, in addition to the matrix of data
  • SPSS portable format (.por)
  • delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) containing metadata information
  • some structured text or mark-up file containing metadata information, e.g. DDI XML file
  • proprietary formats of statistical packages e.g. SPSS (.sav), Stata (.dta)
  • MS Access (.mdb/.accdb)
  • Quantitative tabular data with minimal metadata
  • a matrix of data with or without column headings or variable names, but no other metadata or labeling
  • comma-separated values (CSV) file (.csv)
  • tab-delimited file (.tab)
  • including delimited text of given character set with SQL data definition statements where appropriate
  • delimited text of given character set - only characters not present in the data should be used as delimiters (.txt)
  • widely-used formats, e.g. MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf) and OpenDocument Spreadsheet (.ods)
  • Geospatial data
  • vector and raster data
  • ESRI Shapefile (essential - .shp, .shx, .dbf, optional - .prj, .sbx, .sbn)
  • geo-referenced TIFF (.tif, .tfw)
  • CAD data (.dwg)
  • tabular GIS attribute data
  • ESRI Geodatabase format (.mdb)
  • MapInfo Interchange Format (.mif) for vector data
  • Keyhole Mark-up Language (KML) (.kml)
  • Adobe Illustrator (.ai), CAD data (.dxf or .svg)
  • binary formats of GIS and CAD packages
  • Qualitative data
  • textual
  • eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml)
  • Rich Text Format (.rtf)
  • plain text data, ASCII (.txt)
  • Hypertext Mark-up Language (HTML) (.html)
  • widely-used proprietary formats, e.g. MS Word (.doc/.docx)
  • some proprietary/software-specific formats, e.g. NUD*IST, NVivo and ATLAS.ti
  • Digital image data
  • TIFF version 6 uncompressed (.tif)
  • JPEG (.jpeg, .jpg) but only if created in this format
  • TIFF (other versions) (.tif, .tiff)
  • Adobe Portable Document Format (PDF/A, PDF) (.pdf)
  • standard applicable RAW image format (.raw)
  • Photoshop files (.psd)
  • Digital audio data
  • Free Lossless Audio Codec (FLAC) (.flac)
  • MPEG-1 Audio Layer 3 (.mp3) but only if created in this format
  • Audio Interchange File Format (AIFF) (.aif)
  • Waveform Audio Format (WAV) (.wav)
  • Digital video data
  • MPEG-4 (.mp4)
  • motion JPEG 2000 (.mj2)
 
  • Documentation and scripts
  • Rich Text Format (.rtf)
  • PDF/A or PDF (.pdf)
  • HTML (.htm)
  • OpenDocument Text (.odt)
  • plain text (.txt)
  • some widely-used proprietary formats, e.g. MS Word (.doc/.docx) or MS Excel (.xls/.xlsx)
  • XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0