One Search to Rule Them All

Why you still have to use multiple search engines to find library materials

Metadata: the Good, the Bad, and the Ugly

As mentioned on the previous page, most of the world's library content is not full-text searchable. And so we rely on metadata to search for things. When you use most library search engines, you're only searching a tiny subset of the words from the actual materials. Words like the title, the author's name, and a few descriptive terms about the work's subject. For articles, abstracts are also usually included in the metadata.

Advantages of Metadata

Searching metadata only is often very useful because it ensures that the words you search for are actually important to the work, and not just mentioned in passing. But metadata searching goes well beyond that in its power.

Most metadata is structured, which makes it more than just a bunch of words. Each piece of structured metadata is associated with a field that tells you how it relates to the object. This enables powerful searching options. For example, you can search for "shakespeare" and specify whether you want something called Shakespeare (title), by Shakespeare (author), or about Shakespeare (subject). That's not something you can do in Google!

In addition to generic fields like title, author, and subject, metadata can also include specialized fields useful for certain disciplines. For example, most metadata systems have a date field for date of publication, but metadata schemes for history books and articles often have a separate field for the dates the work is about.

Finally, good metadata will use controlled vocabularies. This is especially useful for subjects, where many variant terms may be in use by writers. The classic example is film, aka movies, motion pictures, or cinema. A controlled vocabulary picks one standard term for each topic, so no matter what words the original author uses, or even what language they write in, the metadata will be consistent.

Disadvantages of Metadata

One of the first things you may have realized about the Advantages of Metadata is that those are only advantages if your search engine is designed to use them. That wouldn't be a problem if there were one consistent standard for metadata. But there isn't. There are just too many types of information content out there to come up with universal rules. Even common metadata elements like title, author, and date can be difficult to define. What's the "title" of a photograph? Who's the "author" of a translated work? What's the "publication date" of a magazine that's regularly released for sale 3-4 months before the date on the cover? There are many competing metadata schema out there, and they don't necessarily play well together.

Even when two metadata systems use compatible fields, they almost certainly have different controlled vocabularies. And as soon as you start combining controlled vocabularies, they rapidly become un controlled vocabularies.

Another common problem is level of description. What exactly is the thing that the metadata record is describing? Is it a newspaper like the Los Angeles Times? An individual day's issue of that newspaper? Or an individual article published on that day? Description levels also go the other way: we have archival collections (both digital and microform) that include multiple newspapers inside them. Summing up with a totally different example: searching for "Hamlet" does no good if the database you're searching has a single metadata record for "The Collected Works of William Shakespeare," even though Hamlet is definitely in there!

Finally, good metadata takes time and effort, usually from a professional cataloger. There's a lot of bad metadata out there, simply because no one has time and money to do it all well. And even the good metadata is often based on outdated descriptive schemes, because there are even fewer resources available to go back and re-catalog materials that were already done.

Put 'Em Together and What've You Got?

So the Library has tons of metadata from thousands of sources. Each source has its own metadata based on its own standards designed to work in its own search engine. Actually, much of the metadata predates search engines! Sometimes the differences are so dramatic that you can't even tell whether or not two records are describing the exact same book.

Trying to combine them into one usable search engine is a nightmare of epic proportions! Nonetheless, people have been trying to do so for some time... and may finally have something that's halfway usable!

But note that even if we succeed, by definition we're going to have to give up many of the most useful search features that rely on uncommon metadata fields. Things like...

  • a checkbox to limit to articles about clinical trials (PubMed)
  • being able to search by the date an article is about (Historical Abstracts and America: History & Life)
  • limit searches by the age range of the people being studied (PsycInfo)

Until the day we decide to devote a good chunk of the global economy to retrospectively re-catalog the world's information to a consistent and rich metadata standard, any universal search box is going to sacrifice quality for quantity. Which is why you're likely to need many different search boxes for the foreseeable future.