Citation
The Weaknesses of full-text searching

Material Information

Title:
The Weaknesses of full-text searching
Series Title:
Beall, Jeffrey. “The Weaknesses of Full-Text Searching.” The Journal of Academic Librarianship 34, no. 5 (September 2008): 438–44. doi:10.1016/j.acalib.2008.06.007.
Creator:
Beall, Jeffrey ( Author, Primary )
Publication Date:
Physical Description:
Journal Article

Notes

Abstract:
This paper provides a theoretical critique of the deficiencies of full-text searching in academic library databases. Because full-text searching relies on matching words in a search query with words in online resources, it is an inefficient method of finding information in a database. This matching fails to retrieve synonyms, and it also retrieves unwanted homonyms. Numerous other problems also make full-text searching an ineffective information retrieval tool. Academic libraries purchase and subscribe to numerous proprietary databases, many of which rely on full-text searching for access and discovery. An understanding of the weaknesses of full-text searching is needed to evaluate the search and discovery capabilities of academic library databases.

Record Information

Source Institution:
Auraria Library
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

The Weaknesses of Full-Text Searching By Jeffrey Beall This is a post-print of an article publish ed in the Journal of Academic Librarianship (Volume 34, Issue 5, September 2008, P ages 438–444). The publisher's version of the article is available here (generic link) and here (authenticated link). doi:10.1016/j.acalib.2008.06.007 Abstract : Because full-text searching relies on matching words in a search query with words in online resources, it is an ineffi cient method of findi ng information in a database. This matching fails to retrieve synonyms, and it also retrieves unwanted homonyms. Numerous other problems also ma ke full-text searching an ineffective information retrieval tool. Jeffrey Beall Auraria Library University of Colorado Denver jeffrey.beall@ucdenver.edu

PAGE 2

2 The Weaknesses of Full-Text Searching 1. Introduction 1.1. Definition of full-text searching Full-text searching is the type of search a computer performs when it matches terms in a search query with terms in individual documents in a database and ranks the results, either by relevance using computer algorithms, or by some property of the individual items retrie ved, such as creation date. This type of searching is ubi quitous on the Internet and incl udes the type of natural language search we typically find in search engines, Web site search boxes, and in many proprietary databases. The term full-text searching has several synonyms and variations, including keyword searching, al gorithmic searching, stochastic searching, and probabilistic searching. 1.2. Metadata-enabled searching There is one other main type of online searching. This is metadat a-enabled searching, which is also called deterministic searching. In this type of search, searchers pr e-select and search individual facets of an information resource, such as author, title, and subject. In this type of search, the system matches terms in the search with terms in structured metadata and generates results, often a browse displa y sorted alphanumerically. 1.3. Objective The purpose of this article is to list and describe the chief weaknesses of full-text searchin g. We limit the scope of this article to true full-text searching that automatically matches words entered in the search box with words in

PAGE 3

3 resources a database contains to generate resu lts. This study does not include in its analysis new, semantic search engines such as Hakia, which stores metadata for each Web page indexed and uses that metadata, along with word matching, to generate search results. Indeed, many popular search engines do inco rporate metadata into their searches. For example, the Google advanced s earch allows for limiting search results to a specific language. This search limit is generated by languag e metadata that the search engine assigns to each Web page it indexes. (The accuracy of this automatically-generated language metadata may not always be high). Still, the great majority of the searches performed on the In ternet are of the type this paper seeks to study: full-text searching that matches words in a search box with words in online documents or online te xt. This study is not a compar ison of full-text searching and metadata-enabled searching. Both of t hese two types of searching have their various strengths and weaknesses. This article seeks chiefly to describe the weaknesses of full-text searching. 1.4 Previous studies Most Information retrieval and information discovery has transitioned from searching dominated by metadata-enabled searching (library card catalogs) to the present full-text or algor ithmic searching (Web search engines). This transition occurred without suffici ent analysis of the weaknesses of full-text searching. Perhaps if searchers understood the number of resources they were missing because of full-text searchingÂ’s reliance on word ma tching to generate retrieval, they would be less satisfied with it. Generally, books and arti cles on information retr ieval often cite one

PAGE 4

4 or two examples of the weaknesses of full-text searching; few have been comprehensive in their analyses, as this one seeks to be. Among those to write about t he weaknesses of full-text searching was Thomas Mann, a reference librarian at the Libr ary of Congress. He states “Keyword searching fails to map the taxonomies that alert researchers to unanticipated aspects of their subjects. It fails to retrieve literature that uses ke ywords other than those the researcher can specify; it misses not only synonyms and variant phrases but also all relevant works in foreign languages. Searching by keywords is not the same as searching by conceptual categories” (Mann, 2005). Here he makes refe rence to the synonym problem in full-text searching (and he prefers to use the term keyword search ing rather than full-text searching, providing yet anot her example of the synonym pr oblem). Mann (2005b) also states, When all is said and done, keyword searchi ng necessarily entails the problem of the unpredictability of the many variant ways the same subject can be expressed, within a single language (“capital punishm ent,” “death penalty”) and across multiple languages (“peine de mort,” “pena capitale”). And no software algorithm will solve this problem when it is conf ined to dealing with only the actual words that it can retrieve from within the given documents (or citations or abstracts) themselves. (p. 102)

PAGE 5

5 Beall (2006, 2006b) presents two brief but more complete looks at the problems of fulltext searching. The present paper aims fo r a more comprehensive analysis. Moreover, Beall (2007) introduces the term “search fatigue” to describe the feelings of frustration searchers feel when they are unsuccessful in finding information due to the weaknesses of full-text searching. A recent study by Hemminger, Saelim, Sullivan, and Vision (2007) compares full-text search ing to metadata searching and finds that “it may be time to make the transition to direct full-text se arching as the standard” (p. 2250). However, later in the article the authors concede that their study may not be truly representative, for it compared the two searching modes using gene names, which are consistently used in the literatur e they studied. 2. The synonym problem Perhaps the biggest and most pervasive weakness of fulltext searching is the synony m problem. This problem occurs because there is often more than one way to name or express a given concept, such as a person, place, or thing. There are several different aspects of the synonym problem. 2.1. True synonyms Synonyms are two words that mean the same thing in one language. In full-text searching, synonyms hinder effective information retrieval when a searcher enters a term in the search box a nd the system only returns results that match the term and does not return results that refer to the concept only by one of its synonyms. For example, if a searcher see ks information on leprosy, he would likely enter “leprosy” in the search box and expect complete results. However, some online documents refer to this disease as “Hansen’ s disease.” While it’s true that many

PAGE 6

6 documents will contain both term s, thus enabling access regar dless of which term is searched, a certain percentage of the documents will on ly contain one term, this providing an incomplete retrieval. 2.2. Variant spellings Words that mean the exact same thing can sometimes be spelled differently, as in variant British and American spellings. In full-text searching, a search for “harbour” will miss re sults that use the spelling “ harbor.” It’s true that many full-text search engines have developed methods for overcoming this problem; searchers can use wild card or truncation operators to retrie ve multiple spellings of a word. But there are also variant spellings within a single dialect of a language, and these differences are often beyond the scope of the truncation or wild card operators. For example, in American English the s pellings “donut” and “doughnut” are both common. Unlike the case of synonyms, where in a single doc ument both synonyms may appear, spelling tends to be consist ent within a document. A document about harbors written in the United States is unlik ely to also contain the spelling “harbour.” This means that there is a smaller chance of retrieving documents with variant spellings than there is with true synonyms. 2.3. Shortened forms of terms Abbreviations, acronyms, and initialisms can hinder recall in full-text search systems because a document may c ontain only the short form of the word or only the long form. When this occurs, someone searching on the short form (PETA) will miss in his retr ieval documents that only use the long form (People for the Ethical Treatment of Animals). Alternately, searching on the long form of

PAGE 7

7 the term, like Magnetic Resonance Spectrosc opy, will miss documents that only refer to the concept by its short form, MRS. 2.4. Different languages or dialects When searching a term in one language, a searcher will not match documents that cont ain the foreign-language version for that concept, unless the two terms happen to be c ognates. For example, if you search the term “butter” in a search, the search will mi ss documents that only refer to this by its Spanish equivalent “mantequilla.” For many searchers, this exclusion is not a problem; they prefer their search results to be in just one language. Howeve r, scholarship, such as medical research, or research for a thesis or dissertation, needs to be comprehensive regardless of language. Addition ally, variation occurs within a single language. The phrase “football c oach” means different things in British and American English. In the United States, this term re fers to a person who directs an American football team, that is, the coac h. In British English, a “foot ball coach” refers to a bus (motorcoach) for soccer players. When the words are the same in two or more languages or dialects of a single language, however, such as the word “migration,” which means the same in English and French, the different language pr oblem does not occur. 2.5. Obsolete terms Linguistic change can also prevent complete information retrieval in full-text searching. For exampl e, the phrase “French distemper” is one of many archaic ways of referring to syphilis. (T he term was also used metaphorically by

PAGE 8

8 the English to refer to the French Revolu tion.) Someone researching the history of syphilis and using full-text searching would miss resources that only use the term “French distemper.” It is pos sible in Google Books to fi nd digitized books that only use this term. While it is possible to search every possible variant term to generate a complete search result in full-text search ing, this method is not very efficient and requires that one know all the variant terms, an unlikely possibility. 2.6. Humanities vs. STM Overall, despite the above example, we should note that the synonym problem probably occurs more frequently in the humanities thans it does in science, technology, and medici ne. STM scholarship tends to be more consistent in its terminology, even across languages. For example, the scientific names of plants and animals (binom inal nomenclature) are t he same in all languages (Tyrannosaurus rex, for example.) This te ndency to use a standard terminology even across languages ameliorates the synonym probl em in these fields. (Note, however, that Tyrannosaurus rex is o ften abbreviated to T-rex, cr eating an instance of the abbreviation problem described above.) This is not to say that STM fields always use consistent terminology. There are at least si xty different terms that all mean “Atlantic cod,” for example (Beall, 2007). The variati on occurs in the common names and not in the scientific names, though. While scientific names tend to be applied consistently within the scientific domain, popular terms for natural things reflect a diverse terminology.

PAGE 9

9 Unlike scientific terminology, humanities te rminology varies significantly from one language to another and by time and dialect within a single language. Take the term “short stories” for example. In French it’s “nouvelles,” in Spanish it’s “cuentos,” and in German, “Erzhlungen.” The names for l anguages themselves differ from language to language too. The names for the German language include alemn, Deutsch, and allemand. Perhaps one area in the humanit ies where there is some cross-language consistency is music. Many languages shar e terms like “soprano.” Also, as described earlier, regional differences within a single language can lead to problems in information retrieval when using full-text se arching. In British English a “solicitor” is a lawyer; in American English, it is som eone who goes door to door sell ing something or asking for contributions for charity. 3. The homonym problem The humanities problem occurs in full-text searching when a single word or phrase has more than one meaning. Because full-t ext searching relies on word matching to generate results, a search for a term with several meanings will retrieve documents for all of the meanings, rather than just the one the searcher wants. Homonyms are perhaps the chief caus e of low search precision. 3.1 True homonyms Without metadata, computers do not know the sense of each of a given pair of homonyms. That is computers cannot effectively disambiguate two concepts when they are called by th e same term. For example, a search on “cookies” will pull up documents both about the food and the little files stored on a computer. Searchers are awar e of this problem, for it occurs frequently. Many have

PAGE 10

10 developed strategies to elim inate unwanted hits and incr ease the probability of search results matching the particular meaning of the homonym they seek. For example, someone looking for information on comput er file cookies might add the word “computer” to the search terms (instead of only searching for “cookies”), because the documents about edible cookies are less likely to have this term in them. Alternatively, a sophisticated searcher might use the “not” operator to try to eliminate unwanted homonyms and increase a search’s precision. The searcher might enter “cookies not recipes” in an effort to increase the search’s precision. While these strategies help, they are not completely effective. Words can hav e many more meanings than just two, and one often does not anticipate that a search term has synonyms. 3.2 Personal name disambiguation This problem occurs in both full-text searching and in metadataenabled search systems where the practice of name disambiguation is not employed. Name disambiguation is th e process of making each person’s name unique in a database. The more common a name in a database, the greater the problem. T he problem is made worse by name s that also function as other parts of speech, like bill, April, mill er, and mike. Because name disambiguation necessarily involves adding me tadata, virtually all full-text documents lack this valueadded feature. 3.3 False cognates These are two words that are spelled the same (or almost the same) in two languages but, deceptively, do not mean the same thing. In full-text searching, false cognates are only a proble m when they are spelled exactly the same.

PAGE 11

11 The problem occurs when a word entered into a search box happens to match a word in a different language that has no semantic relationship to the original search term. For example, the word “location” in French doesn’t mean “locati on” in English; it means a rental or a lease. 4. Inability to search by facets Sometimes searchers have a need to search by only a specific characteristic or attribute of an on line resource, such as author, title, subject, date of creation, etc. These a ttributes, or facets, help to cl uster resources by specific shared characteristics. Cluster ing, or collocating, is helpful in information retrieval because it helps exclude unw anted resources from a search. Also, clustering matches typical searcher queries, such as “I want a ll DVDs on agriculture,” or “I want all PDF files on land use planning in Utah published before 2000.” Pure full-te xt searching fails at these tasks, because the search engine doesn’t know the format (DVD’s) or the subject (agriculture) or the publication dat e (2000) of the documents it searches. If a search engine does know these dates, then it ’s not a pure, full-text search engine. Instead, it is a metadata-enhanced search engine and draws its ability to sort by facets from metadata assigned to each resource it indexes. 4.1. Clustering Clustering is most helpful wh en it attempts to solve the homonym problem in subject searches. Here, cl ustering is the process of grouping and separating out resources by subject. For ex ample someone searching for information on ocean banks might just enter “banks” as t he search term. A search engine with the ability to cluster would then separate out t he results that refer to ocean banks from

PAGE 12

12 those that refer to banks, the financial in stitutions. It’s probably not uncommon for users who stumble on the homonym problem in a fulltext database to do a revised search that includes a second search term, as a strategy for eliminatin g unwanted documents. For example, a searcher could enter “ banks ocean” to eliminate documents in the retrieval that are about banks the financial inst itutions. This stratagem is not foolproof, however, for there are many resources about fi nancial institutions t hat contain the words “banks” and “ocean.” Increasingly, proprieta ry databases are performing this type of cluster analysis algorithmically but with limited success. 5. Unable to sort Just as full-text search engines lack the ability to cluster search results, they also lack the ability to sort re sults by facets. Sorting plays an important role in and can increase the value of information retrieval because it helps arrange search results in a meaningful and predictable order. For example, sorting search results by publication date (oldest first or most recent first) is helpful to searchers looking for recent or old publications. Traditional full-text search e ngines cannot perform this type of sort because they don’t know the publicat ion dates of the documents in the database. Search engines that do have the ability to so rt by publication date are actually using metadata to do the sort and are not true, fu ll-text search engines. The alphanumeric sorting of information resources’ other main facets, author, title, and subject, also adds great value to information discovery and retrie val, but a true, fulltext search engine cannot perform this function. Relevance rank ing, a sorting system based on probabilistic analysis, works well when the resource a sear cher desires falls on the first screen of

PAGE 13

13 search retrieval display, but, increasingly, th is is seldom the case in full-text search engines. 6. Spamming This problem is limited to open dat abases, such as the Internet, where anyone can upload data that becomes part of t he database. In this context, spamming refers to the addition of text to cause documents, such as Web pages, to appear in search results gratuitously. This is some times called “keyword stuffing” (Henzinger, 2007, p. 470). The result is that irrelevant search results appear. Most of the major Web search engines are sufficient ly able to deal with this problem algorithmically and strategically, but at a cost. Most big s earch engines ignore subject metadata (often referred to here as “keywords”) added into a doc ument’s meta tags fo r fear that it is spam. Brooks summarizes: We are now just learning that the Web has a different social dynamic. The Web is not a benign, socially cooperative env ironment, but an aggressive, competitive arena where authors seek to promote thei r Web content, even by abusing topical metadata. As a result, Web crawlers mu st act in self defense and regard all keywords and topical metadat a as spam. (Brooks, 2003). http://informationr.net/ir/83/paper154.html Thus this potential added value, that is, the va lue of rich subject me tadata, is often lost in the jungle of the World Wide Web.

PAGE 14

14 7. The aboutness problem Language and words do not always convey what a resource is about. Just because a Web site contains a word doesn’t mean it’s about whatever concept that word names. But bec ause full-text search engines rely on word matching to guess at aboutness, searches for information on a topi c often fail. Online documents do an inadequate job of providing their own subjec t metadata. “A classical problem for document retrieval systems is t he failure of keywords to identify the conceptual content of docum ents” (Olsen, Sochats, & Williams, 1998, p. 108). Searchers have an idea of what information t hey want to find in their minds. They express this idea as search terms. The probl em is that language is often an ineffective means of unambiguously and clearly stat ing an information need. Garrett (2006) summarizes that, “an extraordinarily subtle and intricate process relates speaker meanings to language output in all natural (i.e., human) lan guage. Individual words and even complete sentences t herefore do not necessarily map one-to-one to phenomena of the world.” 7.1. Figurative language Figurative language also im pedes effective information retrieval in full-text sear ching. Figurative language is language that is not used according to the literal, dictionary definit ion of the words used. For example, the sentence, “She’s drowning in birthday present s” invokes figurative language, in this case a metaphor. The word “drowning” is not used in a literal sense. But a searcher looking for information on drowning in a fulltext database would retrieve the document with this sentence among the sear ch results. In the looking-glass world of full-text search engines, all met aphors become real.

PAGE 15

15 7.2. Word lists Individual entries in online dictionar ies, glossaries, and word lists also often match search terms in full-text database s and appear in search results. Such lists are the source of many “false drops” or irrelevant hits in full-text searches. Mann (2005) says, The Google Print project will be hamper ed by a further problem: its scanned 15,000,000 books will in clude tens of thousands of dictionaries. Any keyword searched will thus retrieve all dictionaries in which th e word appears–nor could results be "progressively refined" by adding more words because those words, too, will "hit" in the same dictionaries. (T his is already a probl em for researchers using a much smaller full-text database, the Evans Early American Imprints.) From: "Will Google's Keyword S earching Eliminate the Need for LC Cataloging and Classification?" Of course, for some searches, the word list may be exactly what the searcher desires. But for the tens of thousands of times the lis t is not what the searcher desires, the search results will amount to little more than noise. 8. Abstract topics It’s difficult to search successf ully for documents on abstract topics in full-text databases. Subjects such as “hea lth,” “free will,” and “ethics” generate large retrievals in Web search engines, decreasing the probability that the first few screens of search results will contain document s useful to the searcher.

PAGE 16

16 9. The incognito problem This problem refers to a person, place or thing not being called by its standard name in at least one step of the search process. Specifically, in order to retrieve information in a full-text database, both the terms the searcher enters in the search box and the terms in the best re sources have to match. To understand this, it’s important to understand that searching is a process that involves several steps. The specious statement “Only librar ians like to search; everyone else likes to find” (Tennant, 2001) displays an ignorance of information discove ry and retrieval as a process. That is to say, there’s more to the process than just t he last step, finding. 9.1. Search terms not in resource It’s not uncommon for a document to describe something and fail to name it. Thus, s earches for the concept will not retrieve the resources that do not match the term. Ga rrett (2006) shares this example: “Michel Foucault's foundational work on meaning and signifying, The Order of Things can be said to be all about the French Revolu tion, and yet it's possible—and I haven't checked—that the word string "French Revoluti on" does not occur a single time in the entire book.” Also, Batty (1998) reports, “the golfer Arnold Palmer hit two holes-in-one on the same hole on two successive da ys in a major tournament, and the article describing this unprecedented feat nev er mentioned the word GOLF.” 9.2 Searcher doesn’t know term Frequently the searcher is the source of the problem in full-text searchi ng. When a searcher does not know the correct term for a concept, it can be very difficult for the sear cher to find desired info rmation. For example,

PAGE 17

17 chondromalacia is a painful medical problem involving the cartilage of the knee joint. A person with a sore knee seeking information about the problem might have this precise condition but not know the name of it. In this case, the person will likely look for information using a Web search engine but will only search with terms such as “pain” and “knee.” It’s true that the Web sites themse lves will probably hel p deal with this problem. There are Web sites that will be re trieved that will describe one form of knee pain as chondromalacia. Then the searcher, pr ovided he can make the connection, is able to do a second, more precise s earch on the exact term. Searcher ignorance brings up another point: the insidious nat ure of full-text search engines. If a searcher does not know a resource exists, he wi ll not know when a full-text search fails to include it in the search result s. This often leads searchers into an endless and exhausting search for information. 9.3. Non-textual resources Full-text search engines ar e only able to read and index textual information. In the absence of metadata, objects such as pictures, sound files, video files, et c. are not indexed and th erefore not searchable, despite the fact that they might contain val uable information. 10. Difficult-to-search paired topics Because full-text searching lacks precoordination, it is fr equently difficult to search paired t opics in full-text search engines. Pre-coordination is the assigning of subject metadata into stri ngs or phrases that reflect

PAGE 18

18 a summary of a resourceÂ’s content. Often, resources describe or present information on two topics in relation to each other. Here are some examples of paired topics: Art and mental illness Architecture and philosophy Movies and fascism Libraries and German Americans Searching paired topics such as these in a full-text database is pr oblematic however. A full-text search matches documents that happen to contain both terms. Frequently a search on two topics will retrie ve resources that do not in fa ct discuss the two concepts in relation to each other; the resources me rely happen to contain both terms. The ability to sort out only the resources that descri be the relationship between the two topics is a valuable one, but full-te xt searching performs poorly at this task. 11. Variability among different search engines Web search engines are big business, and like makes of aut omobiles, each one is a little different. Moreover, there are hundreds of proprietary databases, each wit h its own full-text search methods. Individual Web pages also oft en offer a full-text search box, and these also differ greatly from one resource to another. 11.1 Lack of standardization in searching This variability means that searchers have to learn the best way to sear ch for each database that they search. For

PAGE 19

19 example, is the default Boolean operator in a given database “and” or “or”? Or does the search engine search all the terms as a phr ase? In each case, in order to ensure effective retrieval, the searcher has to know or learn the particular search rules for the database. Probably many searchers assume t hat all simple search boxes work the same way as the Google search works, but this is not always the case. 11.2 Variability in result ranking Searchers also have to adjust to the different ways that full-text search engines rank search results. The so-called relevancy rankings that are frequently found in Web search en gines are created according to proprietary algorithms. Thus ranking di ffers from one system to anot her. Some less sophisticated full-text search engines might sort by so me other aspect rather than a probabilistic calculation of relevancy, su ch as date the resource was first added to the database, or date the file was last updated. 11.3. Indexing differences Search engines index text differently. One example is hyphenated words. Different search engines might index t he hyphen in “full-text” as a space, or they might i gnore the hyphen and index the phr ase as “fulltext.” These differences require searchers to be aw are of each search engine’s indexing and searching policies to ensure complete search results. Some search engines might get around this problem by indexing hyphenated te xt both ways, that is, both with and without a space. But this would not resolve t he problem of a resource that does not use a hyphen in a compound word (as in “nonstandard ”) and a searcher who searches it as two words (as in “non standard”).

PAGE 20

20 Brooks (1999) also describes the problem of digitized text that includes words originally broken at the end of a line of print with a hyphen. He cites such examples as "'Europeans’ broken into Europe and ans 'distinguishing' broken into distingu and ishing 'occurred' broken into occ and urred … (p. 737). Each of these cases could represent a failed search in a full-text se arch system. This problem will only worsen as more print works are digitized. Brooks (1999) also describes the problem of stopwords in full-text indexing and searching. Stopwords are shor t and common words that norma lly carry little meaning in a document. Many full-text databas es contain a list of stopwords that are not indexed in their systems. One problem is that each database generally has its own unique list of stopwords, and another problem is that these words do occasionally carry substantive information. Two examples include the word "in" in the phrase "mother-in-law," and the word "a" as in vitamin A. When these words are not indexed in a given system, retrieval on "mother-in-law" and "vitamin A" woul d be made much more difficult. 12. The opaque Web There is a great amount of information available on the Internet that is hidden behind search interfaces. This information is opaque, or invisible to search engines. Henzinger (2007 ) states, “A plethora of c ontent is stored in databases rather than in typical W eb pages. The pages as well as their URLs are created in response to a user filling out a form on the Web. Because search engines are unable to emulate this behavior, such dynamically ge nerated pages cannot be indexed. There has

PAGE 21

21 been some research on trying to make form-f illing automatic, but the problem remains largely unsolved” (p. 469). A recent report written by two governm ent watchdog groups, OMB Watch and the Center for Democracy & Tec hnology, bemoans the amount of valuable United States government information that is hidden behi nd search interfaces and therefore not indexed in the popular search engines. The report states, “Unfortunately, many important information sources wit hin the federal government ar e essentially hidden from the very search engines that the public is most likely to use” (OMB Watch and the Center for Democracy & Technology, 2007). The report also states, “Often the agencies mentioned operate tens or hundreds of dynamic databases that can not be indexed and searched” (OMB Watch and the Center for Democracy & Technology, 2007). One example of such a database familiar to librari ans is the Library of Congress Authorities Web site. This Web site contains a search interface that leads to hundreds of thousands of name, title, and subject authority record s. These records contain a great deal of valuable biographical, geographical, and subject informat ion, but because Web search engines can’t see them, the information they contain is not accessible without knowing in advance about the Web site and accessing it directly. 13. Conclusion Linguistic problems, the limitations of full-text search engines, and missing data combine to make full-text search ing unreliable, incomplete, and insidiously imprecise, especially for serious information s eeking, such as scholarly research. Many Web-based applications still use basic full-t ext searching as their chief information retrieval mechanism. Over the past fifteen years, most information retrieval has

PAGE 22

22 transitioned from searching bas ed on rich metadata to full-text searching. The result of this transition is an overall decrease in t he quality of information retrieval. References Batty David. 1998. “WWW-Wealth, wearines s or waste: Cont rolled vocabulary and thesauri in support of online informati on access.” D-Lib Magazine, November http://www.dlib.org/dlib/november98/11batty.html Beall, Jeffrey. (2007). Search fatigue: finding a cure for the database blues. American Libraries, vol. 38, no. 3, p. 46-50. Beall, Jeffrey. (2006). The death of metadata. The Serials Librarian vol. 51, no. 2, p. 55-74. Beall, Jeffrey. (2006b). The deat h of full-text searching. PNLA quarterly vol. 70, no. 2, Winter, p. 5-6. Brooks, Terrence A. (2003) "Web Search : how the Web has changed information retrieval" Information Research 8 (3) paper no. 154 [Available at http://InformationR.n et/ir/8-3/paper154.html] Brooks, T.A., 1998. Orthography as a fundam ental impediment to online information retrieval. Journal of the American Society for Information Science 49 8, pp. 731–741.

PAGE 23

23 Garrett, Jeffrey. KWIC and Dirty? Human Cognition and the Claims of Full-Text Searching. The Journal of Electronic Publishing v ol. 9, no. 1, Winter 2006 Hemminger, Bradley M., Saelim, Billy, Sulliv an, Patrick F., Vision, Todd J. (2007) "Comparison of full-text searching to meta data searching for genes in two biomedical literature cohorts" Journal of the American Soci ety for Information Science & Technology Vol. 58 Issue 14, p. 2341-2352. Henzinger, Monika. “Search Tec hnologies for the Internet.” Science 27 July 2007:Vol. 317. no. 5837, pp. 468 – 471. DOI: 10.1126/science.1126557 Olsen, Kai A.; Sochats, Kenneth M.; Willia ms, James G. “Full Text Searching and Information Overload.” International Information & Library Review v30 n2 p105-22 Jun 1998 OMB Watch and Center for Democracy & Tec hnology (2007). Hiding in plain sight: Why Important Government Information Cannot Be Found Through Commercial Search Engines. http://www.cdt.org/right toknow/search/Searchability.pdf Mann, Thomas (2005). “Will Google’s Keywor d Searching Eliminate the Need for LC Cataloging and Classification?” Available: http://www.guild 2910.org/searching.htm Mann, Thomas (2005b). The Oxford Guide to Library Research. Oxford: Oxford University Press.

PAGE 24

24 Tennant, Roy (2001). “Avoiding Unintended C onsequences” Library Journal, Vol. 126 Issue 1, p. 38 Available: http://www. libraryjournal.com/article/CA156524.html