《信息检索导论(英文版)》是信息检索的教材,旨在从计算机科学的视角提供一种现代的信息检索方法。书中从基本概念讲解网络搜索以及文本分类和文本聚类等,对收集、索引和搜索文档系统的设计和实现的方方面面、评估系统的方法、机器学习方法在文本收集中的应用等给出了最新的讲解。
书中所有重要的思想都是用示例进行解释,图文并茂。《信息检索导论(英文版)》非常适合作为计算机科学及相关专业的高年级本科生和研究生的“信息检索”课程的入门教材,当然也同样适合研究人员和专业人士阅读。
《信息检索导论(英文版)》从计算机科学领域的角度出发,介绍了信息检索的基础知识,并对当前信息检索的发展做了回顾,重点介绍了搜索引擎的核心技术,如文档分类和文档聚类问题,以及机器学习和数值计算方法。书中所有重要的思想都用示例进行了解释,生动形象,引人入胜,实现了理论与实战的完美结合。 《信息检索导论(英文版)》的三位作者均是信息检索领域的顶级专家,两位来自学术教育界,一位来自硅谷业界,使《信息检索导论(英文版)》既具备深厚的理论基础,又代表了尖端科技水准。因此,该书甫一出版,即被奉为该领域的权威著作,备受瞩目,目前已被众多世界名校采用为信息检索课程的教材。
As recently as the 1990s, studies showed that most people preferred getting information from other people rather than from information retrieval OR) systems. Of course, in that time period, most people also used human travel agents to book their travel. However, during the last decade, relentless opti- mization of information retrieval effectiveness has driven web search engines to new quality levels at which most people are satisfied most of the time, and web search has become a standard and often preferred source of information finding. For example, the 2004 Pew Internet Survey (Fallows 2004) found that "92% of Internet users say the Internet is a good place to go for getting everyday information." To the surprise of many, the feld of information re- trieval has moved from being a primarily academic discipline to being the basis underlying most peoples preferred means of information access. This book presents the scientific underpinnings of this field, at a level accessible to graduate students as well as advanced undergraduates.
Information retrieval did not begin with the Web. In response to various challenges of providing information access, the field of IR evolved to give principled approaches to searching various forms of content. The field be- gan with scientific publications and library records but soon spread to other forms of content, particularly those of information professionals, such as journalists, lawyers, and doctors. Much of the scientific research on IR has occurred in these contexts, and much of the continued practice of IR deals with providing access to unstructured information in various corporate and governmental domains, and this work forms much of the foundation of our book.
Christopher D.Manning,斯坦福大学语言学博士,现任斯坦福大学计算机科学和语言学副教授,主要研究方向是统计自然语言处理、信息提取与表示、文本理解和文本挖掘等。
Prabhakar Raghavan,加州大学伯克利分校博士,现任Yahoo!实验室主任,斯坦福大学计算机科学系顾问教授,是ACM和IEEE会士。主要研究兴趣是文本及Web数据挖掘、算法设计等。此前,他曾任Verity公司CTO,并在旧M研究院担任过管理工作。
Hinrich Schuze斯坦福大学博士,现任斯图加特大学自然语言处理研究所理论计算语言学主任。他在美国硅谷工作过多年,曾在施乐Palo Alto研究中心供职,担任过Outride公司(后被Google公司收购)副总裁,做过Novation生物科技公司CTO和Enkata公司首席科学家。
1 Boolean retrieval 1
1.1 An example information retrieval problem 3
1.2 A first take at building an inverted index 6
1.3 Processing Boolean queries 9
1.4 The extended Boolean model versus ranked retrieval 13
1.5 References and further reading 16
2 The term vocabulary and postings lists 18
2.1 Document delineation and character sequence decoding 18
2.2 Determining the vocabulary of terms 21
2.3 Faster postings list intersection via skip pointers 33
2.4 Positional postings and phrase queries 36
2.5 References and further reading 43
3 Dictionaries and tolerant retrieval 45
3.1 Search structures for dictionaries 45
3.2 Wildcard queries 48
3.3 Spelling correction 52
3.4 Phonetic correction 58
3.5 References and further reading 59
4 Index construction 61
4.1 Hardware basics 62
4.2 Blocked sort-based indexing 63
4.3 Single-pass in-memory indexing 66
4.4 Distributed indexing 68
4.5 Dynamic indexing 71
4.6 Other types of indexes 73
4.7 References and further reading 76
5 Index compression 78
5.1 Statistical properties of terms in information retrieval 79
5.2 Dictionary compression 82
5.3 Postings file compression 87
5.4 References and further reading 97
6 Scoring, term weighting, and the vector space model 100
6.1 Parametric and zone indexes 101
6.2 Term frequency and weighting 107
6.3 The vector space model for scoring 110
6.4 Variant tf–idf functions 116
6.5 References and further reading 122
7 Computing scores in a complete search system 124
7.1 Efficient scoring and ranking 124
7.2 Components of an information retrieval system 132
7.3 Vector space scoring and query operator interaction 136
7.4 References and further reading 137
8 Evaluation in information retrieval 139
8.1 Information retrieval system evaluation 140
8.2 Standard test collections 141
8.3 Evaluation of unranked retrieval sets 142
8.4 Evaluation of ranked retrieval results 145
8.5 Assessing relevance 151
8.6 A broader perspective: System quality and user utility 154
8.7 Results snippets 157
8.8 References and further reading 159
9 Relevance feedback and query expansion 162
9.1 Relevance feedback and pseudo relevance feedback 163
9.2 Global methods for query reformulation 173
9.3 References and further reading 177
10 XML retrieval 178
10.1 Basic XML concepts 180
10.2 Challenges in XML retrieval 183
10.3 A vector space model for XML retrieval 188
10.4 Evaluation of XML retrieval 192
10.5 Text-centric versus data-centric XML retrieval 196
10.6 References and further reading 198
11 Probabilistic information retrieval 201
11.1 Review of basic probability theory 202
11.2 The probability ranking principle 203
11.3 The binary independence model 204
11.4 An appraisal and some extensions 212
11.5 References and further reading 216
12 Language models for information retrieval 218
12.1 Language models 218
12.2 The query likelihood model 223
12.3 Language modeling versus other approaches in information retrieval 229
12.4 Extended language modeling approaches 230
12.5 References and further reading 232
13 Text classification and Naive Bayes 234
13.1 The text classification problem 237
13.2 Naive Bayes text classification 238
13.3 The Bernoulli model 243
13.4 Properties of Naive Bayes 245
13.5 Feature selection 251
13.6 Evaluation of text classification 258
13.7 References and further reading 264
14 Vector space classification 266
14.1 Document representations and measures of relatedness in vector spaces 267
14.2 Rocchio classification 269
14.3 k nearest neighbor 273
14.4 Linear versus nonlinear classifiers 277
14.5 Classification with more than two classes 281
14.6 The bias–variance tradeoff 284
14.7 References and further reading 291
15 Support vector machines and machine learning on documents 293
15.1 Support vector machines: The linearly separable case 294
15.2 Extensions to the support vector machine model 300
15.3 Issues in the classification of text documents 307
15.4 Machine-learning methods in ad hoc information retrieval 314
15.5 References and further reading 318
16 Flat clustering 321
16.1 Clustering in information retrieval 322
16.2 Problem statement 326
16.3 Evaluation of clustering 327
16.4 K-means 331
16.5 Model-based clustering 338
16.6 References and further reading 343
17 Hierarchical clustering 346
17.1 Hierarchical agglomerative clustering 347
17.2 Single-link and complete-link clustering 350
17.3 Group-average agglomerative clustering 356
17.4 Centroid clustering 358
17.5 Optimality of hierarchical agglomerative clustering 360
17.6 Divisive clustering 362
17.7 Cluster labeling 363
17.8 Implementation notes 365
17.9 References and further reading 367
18 Matrix decompositions and latent semantic indexing 369
18.1 Linear algebra review 369
18.2 Term–document matrices and singular valuede compositions 373
18.3 Low-rank approximations 376
18.4 Latent semantic indexing 378
18.5 References and further reading 383
19 Web search basics 385
19.1 Background and history 385
19.2 Web characteristics 387
19.3 Advertising as the economic model 392
19.4 The search user experience 395
19.5 Index size and estimation 396
19.6 Near-duplicates and shingling 400
19.7 References and further reading 404
20 Web crawling and indexes 405
20.1 Overview 405
20.2 Crawling 406
20.3 Distributing indexes 415
20.4 Connectivity servers 416
21 Link analysis 421
21.1 TheWeb as a graph 422
21.2 PageRank 424
21.3 Hubs and authorities 433
21.4 References and further reading 439
Inde 469
Bibliography 441
An example information retrieval problem
A fat book that many people own is Shakespeares Collected Works.Suppose you wanted to determine which plays of Shakespeare contain the words Brutus AND Caesar AND NOT Calpurnia.One way to do that is to start at the beginning and to read through all the text,noting for each play whether it contains Brutus and Caesar and excluding it from consideration if it contains Calpurnia.The simplest form of document retrieval is for a computer to do this sort of linear scan through documents.This process is commonly referred to as grepping through text,after the Unix command g r e p,which performs this process.Grepping through text can be a very effective process, especially given the speed of modem computers,and often allows useful possibilities for wildcard pattern matching through the use of regular expressions.With modem computers.for simple querying of modest collections (the size of Shakespeares Collected Works is a bit under one million words of text in total),you really need nothing more.
But for many purposes,you do need more:
1.To process large document collections quickly.The amount of online data has grown at least as quickly as the speed of computers,and we would now like to be able to search collections that total in the order of biHions to trillions of words.
2.To allow more flexible matching operations.For example,it is impractical to perform the query Romans NEAR countrymen with g r e p,where NEAR might be defined as within 5 words or within the same sentence?
3.To allow ranked retrieval.In many cases,you want the best answer to an information need among many documents that contain certain words. The way to avoid linearly scanning the texts for each query is to index the documents in advance.Let us stick with Shakespeares Collected Works,and use it to introduce the basics of the Boolean retrieval model.Suppose we record foreachdocument—here aplayofShakespeare’s—whetheritcontainseach word out of all the words Shakespeare used(Shakespeare used about 32,000 different words).The result is a binary term—document incidence matrix,as in Figure 1.1.Terms are the indexed units(further discussed in Section 2.2);they are usuany words,and for the moment you can think of them as wordsf but the information retrieval literature normally speaks of terms because some of them,such as perhaps I-9 or Hong Kong are not usuaHy thought of as words.