#Week 2 Unit 2: Document and query processing

· IIR sections 1.2, chapters 2 and 3

In the section 1.2 A first take at building an inverted index, I learn the major steps in build the index.

1)collect the documents to be indexed

2)tokenize the text, turning each document into a list of tokens

3)do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms

4)index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings

In the chapter 2 the term vocabulary and postings lists. I learnd that the first step of processing is to convert this byte sequence into a linear sequence of characters. And the next phase is to determine what the document unit for indexing is. What's more, we should given a character swquence and a defined documnet unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. Then, we drop the stop words by sort the terms by collection frequency and take the most frequent terms. In order to match the occur, we should toke normalization and the most standard way to normalize is to implicitly create equivalence classes. Moreover, in order to reduce inflectional forms and sometimes derivationally, we need to do stemming and lemmatization.

In the chapter 3, develop the data structure that help the search for terms in the vocabulary in an inverted index to determin whether each query term exists in the vocabulary and if so, identify the pointer to the corresponding postings. Then I learn about the wildcard queries. It is used in these situations:1) the user is uncertain of the spelling of a query term 2) the user is aware of multiple variants of spelling a term and seeks documents containing any of variants 3) the user seeks documents containing variants of a term that would be caught by stemming, but is unsure whether the search engine performs stemming 4) the user is uncertain of the correct redition of a foreign word or phrase. Afterwards, I learn the two basic principles underlying most spelling correction algorithms and focus on two specific forms of spelling correction that we refer to as isolated-term correction and context-sensitive correction. Also, we begin examining two techniques for addressing isolated-term correction by editing distance and k-gram overlap followed by proceed to context-sensitive correction. In the final technique of tolerant retrieval, we need to do phonetic correction, which means misspellings that arise because the user types a query that sounds like the target term. And there is an orginal soundex algorithm, build on following scheme: 1)Turn every term to be index into 4-character reduced form. Build an inverted index from reduced forms to the orginal terms; call this the soundex index 2) do the same with query terms 3) when the query calls for a soundex match, search this soundex index.

搜索此博客

Information Retrieval

#Week 2 Unit 2: Document and query processing

评论

发表评论