Skip to Main Content

Text and Data Mining

A guide that goes over the basics of text and data mining.

Data Mining with Adam Matthew Primary Source Collections

Adam Matthew partners with libraries and archives around the world to digitize historical documents and primary sources for the humanities and social sciences. These digital documents are made available through a series of online databases, which together contain millions of pages of original content. UCLA have access to all of these databases

While developing these collections, the editorial team at Adam Matthew create metadata for each document. This records details such as when it was produced, who the author was, and what topics get mentioned in it. All printed items are also run through an OCR (Optical Character Recognition) program, which produces a searchable text-file version of the writing.

The data can be made available free of charge. If you are interested in initialing such a  project, please contact a librarian in Scholarly Communications and Licensing to begin the process.

How Can You Access the Data

There are two main options for accessing the data:

  1. Secure online access via an API (Applied Programming Interface). This is an application that provides access to a collection’s metadata and full-text data on a stand-alone, secure server. You will need client software to interact with the data, as this is a technical interface with no front-end search engine. There is no charge for this access.  To find out more about available client software, please contact a librarian in Scholarly Communication and Licensing to help you get started. 
  2. An offline copy provided via FTP (File Transfer Protocol) between our server and your server. Under current agreements this is limited to a 3-year storage period, after which time a renewal can be requested, or if the project is complete, the original data (not any research material) deleted. There is no charge for this access.  To find out more about available client software, please contact a librarian in Scholarly Communication and Licensing to help you get started. 

Both require a form to be filled in, which a librarian in Scholarly Communication and Licensing will complete.  This is for data protection purposes, as Adam Matthew have a responsibility to the source archives to let them know how the data is being stored. Once you have a research project in mind, and know which collection’s data you wish to use, you can request a form by emailing a librarian in Scholarly Communication and Licensing.  Simply list the collections you wish to use, and your preferred access method, in the email and the appropriate form will be sent to you.

Once the form has been completed, the timeline for receiving approval and the data varies depending on the amount of data requested and the method of delivery required. A request for metadata is quicker to turn around than a request for full-text data. Access via API is quicker to set up than FTP.

As a general rule, Adam Matthew will try to provide all data requested via an API within one week of receiving the completed data mining form. Data requested via FTP has a similar turnaround time, but it will depend on the amount of data, especially if the request includes full-text data.

Examples of TDM Projects using Adam Matthew

Many projects and research tools were created using Adam Matthews content and TDM.  Some of these were created by the developers at Adam Matthew, while others were part of academic research projects.  In order to provide an idea of the potential for this sort of work, let's take a look at the example below.

This shows that George Germain appears in 3,868 documents (number underlined in dotted blue). The person who most often appears alongside him is Henry Clinton, who appears in 1,011 of those documents (circled in red). Clinton himself appears in 1,586 documents in total (number in the green box), meaning 63.7% of his appearances are alongside George Germain. The darker the purple shading (e.g. number in the green box), the higher the percentage of documents in which the people co-occur.

This is the type of association analysis that can be produced from full-text data.  Projects like this helps to raise and answer questions about how and why different people were connected to each other.  Of course, this is just one example of how TDM can work.