Skip to main content

One Search to Rule Them All

Why you still have to use multiple search engines to find library materials

Google's Advantages

Just to forestall any accusations of Google-bashing, let me assure you: libraries love Google! We're more aware than most people of its limitations, but we still use it all the time and partner regularly with them in projects like Google Books and Google Scholar, both of which are major steps toward creating the elusive One Search Box.Google

So please, Google to your heart's content!

Besides the ability to throw lots of money at the problem, Google enjoys four big advantages over the library world.

  1. Scope of what it's searching.
  2. Access to the full text of everything it's searching.
  3. Hyperlinks to help rank results.
  4. Lots of people looking for the same things.

Size Matters

Everyone knows that the web is huge. Really, really huge. But remember, Google indexes and searches only the public web, not the much larger deep web. Furthermore, it searches only the current web, not the two decades worth of deleted pages. So still huge, but maybe just plain huge.

But the job of libraries is to provide access to the collected, recorded knowledge of mankind produced over the past 5000 years of history. That includes the public web, the deep web, and archived versions of old web sites plus stone tablets, scrolls, books, journals, documents, newspapers, letters and e-mails, maps, microfilm, audio and video (both digital and analog), statistical data sets, the list goes on and on. Despite massive efforts to digitize older materials, most of that stuff isn't online--and most of what is online is locked away in the deep web.

So if what Google searches is huge, then what libraries have to search is best described as super-mega-uber-hyper-huge (a new technical term which I hope catches on)!

What You See Is What You Search

The public web has one very unique trait over all other sources of information. It's public.

That means that every word of it is out there, visible to everyone. So when it comes time to index it and create a search engine, the whole shebang is right there for anyone to do it. It's a monumental computing task, but the raw data is all there for everyone to use.

Libraries don't have that luxury. Much of our content is still in print or other analog formats—we have no way of searching the full text. And oddly, it doesn't get any easier for a lot of digital content. With modern publishing practices, libraries rarely buy e-books and e-journals—we subscribe to them. Which means we don't actually have all the content. It's accessible to our students and faculty, but the terms of our licenses actually prevent us from systematically downloading it all, which we'd need to do in order to index and search it.

Because of this, library search engines don't search the full text of our collections. Instead they search metadata--data about the data. Things like the title, the author, and keywords or subject headings. There are actually many advantages to searching metadata over full text. (And, of course, for non-text content like images and statistics, metadata is the only thing you can search on.) But metadata has one big drawback: somebody has to create it. It takes a lot of time and effort to create good metadata, and while we've got it for most books and journal articles, there are still large chunks of the library's collections with minimal, erratic, or non-standard metadata.

Metadata is such a big deal when it comes to good searching that it gets its own page.

Built-In User Ranking

One of the biggest challenges in search engine design is ranking results. Most Google searches produce thousands of results, and the real magic of Google is that 90% of the time what you want is in the top few listed.

Google is able to do this because of another unique feature of the web—it's hyperlinked. As part of its indexing process, Google keeps track of how many pages link to a given page and use that to rank the pages. Web sites and pages which lots of people link to appear higher in the results list, especially if those links are coming from sites which themselves are highly ranked. The ranking system actually gets a lot more sophisticated than that, especially since Google is in an arms race with search engine spammers. But the whole system is built on the bedrock of counting hyperlinks.

Once again, this is an advantage libraries don't have. Most of our content, even the digital materials, isn't hyperlinked. Library search engines must rely on far more tenuous metrics to rank results, usually based on some mixture of assumed relevance and recency. However, even if we had hyperlinks, it's not certain they'd improve matters. Hyperlinks basically measure popularity. Which leads directly to the next issue.

Following the Herd

Google enjoys a massive user base using it primarily for routine discovery of common or well-known information. The vast majority of Google searches are for information also being sought by thousands of other people. This not only makes the page ranking system work well, it also makes it easy to fine-tune the relevance ranking system because there's a statistically known "best answer" to most searches. In other words, we tend to ask Google easy questions.

In contrast, higher education, especially at a research university, is mostly about the hard questions. UCLA faculty and students are often working at the cutting edge of their fields, researching topics that few people have looked at, where there's little consensus, where evidence is spotty, or where there isn't even standard conceptual terminology developed yet. I've helped many people in the library doing research on questions that it's fairly obvious no other human has yet published about. This makes it difficult, if not impossible, to do ranking, since even if two researchers put in the exact same search terms there's a good chance that they're actually looking for different things.