صفحه 1:
Modern Information Retrieval < 6: ‏موطاون)‎ Rettevd

صفحه 2:
Lecture Overview Ours Types ° Aarideure Orviors ° Aewerted Iedexes Query Provessiay Qvery Opiicvizatiog ۰ Oisrussiva * RePerewes هه چاه مرو ۵0۵00 موه ‎iran‏

صفحه 3:
Lecture Overview * Oats Types هه چاه مرو ۵0۵00 موه ‎Blackened‏

صفحه 4:
Types of data * Oustructured ۰ Gewisiructured * Gtructured هه چاه مرو ۵0۵00 موه ‎Blackened‏

صفحه 5:
Unstructured data * Dypicdly rePers to Pree text * Clows > Kepword quertes iakrdkery operators > Qore sophisticated ‘ovaept” queries &.4., «pad al ae pa eR ig oe ° Chessiv wodel Por searchicy text domueuts هه چاه مرو ۵0۵00 موه ‎iran‏

صفحه 6:
Unstructured (text) vs. structure (database) data in 1996 @Unstructured @ Structured Data volume Market Cap 9 هه چاه مرو ۵0۵00 موه ال

صفحه 7:
Unstructured (text) vs. structure (database) data in 2009 @Unstructured @ Structured Data volume Market Cap ° هه چاه مرو ۵0۵00 موه ‎iran‏

صفحه 8:
Unstructured data in 1680 © Whick phys of Ghokespeare octane words Brute BOO Ces bt DOT Capua? Re Ove onl grep ol of Gkchespewe's phys Por Orit ad Ouse, theo strip out toes oootaicicy Oupuratd? * Oh ts thot ot the cower? > Obw (Por bee corpora) > DOP Cobras & ‏نمی‎ > Otker vperaivas (e.x., Prerd the word Rowe cor powiqy@ed) ot Peasitle > Booked retrieval (best doonrente te retur) ‏سا ترا‎ ‏ص0 عصان موه‎ 3 ۵0۵00 موه ال

صفحه 9:
Semi-structured data ° Ie Pact okerst oo dtc is “costructured” * Cy, this slide kos disttony idectPied 270e5 suck us the Tike ord Bullets * ‏او(‎ “sewi-structured” searck suck a > Pike cxetcices chats PDO Bullets corcteacr search: ... ‏یا گه رای رود و‎ structure

صفحه 10:
Lecture Overview هه چاه مرو ۵0۵00 موه ‎Blackened‏

صفحه 11:
Term-document incidence Ouiovy wed Orepatra duhur Owvar ‘The Trapest ‘Drolet Otkelle Oucbetk ‎q o o o 4‏ 0 تسه ‎o q o o‏ 0 0 00 ‎o q 4 4‏ 0 0 ین ‎q ۰ 0 o o‏ ۰ سوه ‎o o o o‏ ۰ 0 سوت ‎q a 4 4‏ 0 0 330 ‎q 4 o‏ 4 ۰ 0 سس ‎CP ploy cookies word, D‏ ‎vise‏ ‏هه چاه مرو ‎ ‏۵0۵00 موه اس ربص تا ‎ ‎

صفحه 12:
Incidence vectors ° Gowe kwe a OM verter Por cack terw. ° Dp ceswer query: the the veviors Por @rutes, Ovwse ond Oupurat (cowplewesied) [] bitwise .00 ‎GOO 000000 @O”O@ 6200400 <‏ 100000200 * 00۰ 6 هه موه معط iran ‏موه‎ ۵0۵00

صفحه 13:
Bigger collections * Oowsider D = 1 wilica domuceds, ack wit bout DOO words. ۰ Buy O bytes/ word tochidiacy ‏جوم اوه‎ > OBO of date ‏عم صصصیك جملا جز‎ ۰ Go there we 0 = GOOK dstice ‏وه مه‎ 6 هه چاه مرو ۵0۵00 موه ‎iran‏

صفحه 14:
Can’t build the matrix * GOOK x 1D watrix kos holP-c-itliod O's aed (1's. . Osi ‏ات‎ Ne < on? ? ‏جا عضوو‎ Exteel sparse. ۰ Oke's uo better represrotaiod? > De ody revo te ( postions. ‎ae‏ هه چاه مرو ۵0۵00 موه ‎iran‏ ‎

صفحه 15:
Lecture Overview * ‏لصو‎ Iedexes هه چاه مرو ۵0۵00 موه ‎Blackened‏

صفحه 16:
Inverted index ° or euwk terw 4 we weet store o bst oP of domes thot orotic 6 < ‏رتیل‎ rack by ab AAD, ‏لو امنود حول و‎ ۱ ‏ها‎ this? ad {| 90 | ۵ veo] ae} o— 1] 418 1 1 91 9 [ 49] 9 ‎Sse] aad‏ 201 [ 2 ]ت۱۳ ‏سح و للم حول لت 8 ماب ‎ae? ‏هه ی ون‎ 08 ۳ ‏ا ی‎ ‘tie ‏ون‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎

صفحه 17:
Inverted index ۰ Oe weed voricble-size posto lists > On deh, a pvakaumus ruc oF posting te wre ced best > la weno, mr ee baked bets or vortable bert: ory ۶ Gowe tradevPPs ia size/eose ‏موه اه‎ Postery ۱0] 6 ۰ aad {| 90 | ۵ ۳۸۵ ۵۵ 8 ]9 9 6 1 418 إححم uc——> [eS [| Sd] S] aod a

صفحه 18:
Inverted index construction Friends, Romans, countrymen. Friends | [Romans | |Countrymen friend] [roman] [countryman ‎oot > Le i.‏ الم ‏ات ۳ ۳ ‎rower‏ ‏۵ب 49 ]0 | بح ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎

صفحه 19:
06 ‘rats ‘oid you ‘ambitious Indexer steps: Token * ‏مص يوه‎ oP (DodPed then, Doomed هه چاه ۵0۵00 موه sequence 10) privy. Ove 1 ‏مده‎ ‎Blackened

صفحه 20:
Indexer steps: Sort “erm ead “em | dosio 1 1 ambos aa 1 te 2 enact 1 ts 1 ios 1 tus 2 ۰ GS caesar 1 capitol 1 ۲7۷ ferns ! 1 ‏مم‎ 1 1 coat 2 < 0 AO ile ۱ cer 0 the 1 enact 1 capita 1 rath 1 ‘rus 1 1 1 red 1 1 1 3 3 2 1 1 : 2 ‏ی‎ 2 i 2 3 3 rn ۱ 1 2 tied 1 Re 5 pistes 1 ‘with 2 let 2 ‘caesar 2 me 1 = 5 rete 2 = 3 reba 3 one 5 ite ۱ ‘hath 2 a 2 tos 3 ‘od 3 3 you was 2 ‏ليا‎ 5 ambitious 2 ۳ 8 ‏هه هه موه سس‎ ۵0۵00 موه ال

صفحه 21:
Indexer steps: Dictionary & term doc. freq. — postings lists Teen | aoa ambuous ‏و‎ ambitious [1 ] ۰ bes 1 1 ‏صمص جاصنان<1)‎ Eins 2 rots | 2 cept! 1 capitol [1] potries itt caesar 1 4 cesar 2 caesar ‏مت هت رت‎ cess 2 17 0 ‏رل مه‎ 5-2 1 enact [1] bath 1 wath [T ۰ 111 Gpiit ‏جر ۲ سا‎ ‏له تمه‎ has تسچ‎ cn tle 1 ulus [1] i ‏تا‎ fet 4 ‘ater ime 1 Ove. rete 2 wit = 2 me Prequedy tte i noble [1 the 2 mee 0 jae 3 yw 2 the [2 ‏.ساعن‎ ES 3 ‘old [1 vat 2 ۳-۹ was [2 | with [T Oops Oka Overy ea Blackened ‏موه‎ ۵0۵00

صفحه 22:
Where do we pay in storage? ‏اي ب سس‎ عم سرا 5 aoe ieee ‘capitol [1 ] aaa did] i] enact | 1] the [2 wold [1] you [7] was [2 | with [T kak Ooversyy ‏وه‎ ‏بجوت‎ 0002 مرو ‎iran‏

صفحه 23:
Lecture Overview ° Query Provessingy هه چاه مرو ۵0۵00 موه ‎Blackened‏

صفحه 24:
Query processing: AND * ‏جل صنو0‎ provessioy ihe query: ‏سوجعو 2 (00(0) ص8‎ > bovate Onate ta he Dirty; ها موق * ‎ta the Dirty;‏ ویو( مورا خ ها موق ۶ ‎pose:‏ ها ‎٩‏ ظ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎Brutus‏ ]228[ [64], |32], | 16]ء|8], [4ب[2 7 |34 | [21]ب[ 13لی یدای[ 3 م2 11 هه چاه اس ‎iran Gorn 000 ‎ ‎ ‎ ‎

صفحه 25:
The merge ۶ ‏لور‎ through the tio posticgs sitmulccepusl), ic foe ‏وی اه وا و ما‎ of posticgs ruiries 2|_[4|_[8] [16 |_.[32]_.[64]_.[228] Brutus 34] Caesar 13 21 AP est becghs are oc ced, werge thes O(ety) operatic, ‏)السك روا له مس لسن‎ وه هه چاه ون ز ۵0۵00 موه ‎Blackened‏

صفحه 26:
Boolean queries: Exact match ° Dke (Dovleco retrieval wodel ts betey uble to ush a query thot is o @ovleon expression > Borla Quertes we BOD, OR ard DOT jvia query eras © Oiews cack ‏جه حول‎ a set oP words ° 4s previse! domed watches ooediion or wt. > Perkaps the skoplest wodel w bubd wa WR syste ‏مم‎ rivvary cowswercid retievdl tool Por O deoxndes. QOcuy seurck systews pou stil use ore @ovlead > Boral, throry vetaby, Dar OG X Spottt وه هه چاه مرو ۵0۵00 موه ‎iran‏

صفحه 27:
Boolean queries ۶ Daw proPessioad searchers sill the Bootes search ‏موه تیا نوخ‎ uke pou ore oeticry ° But that doesc't wea it actuchy works better... ‎er‏ هه چاه مرو ۵0۵00 موه ‎iran‏ ‎

صفحه 28:
Boolean queries: More general merges * Cxernise: (Bdupt the were Por the queries! @rnte PDO VDOT Owes @ris OR DOT Oesar Ova we sill ra through the werge in tere O( 4)? Okt cog we uckieve? ‎eo‏ هه چاه ۵0۵00 موه ‎

صفحه 29:
وه Merging Oke obowt os orbirary @ovleas Poraka? (Oras OR Owse) PDD DOT (Boy OR Okvpara) ° On we duos were io “eeu” tee? > Dicer ict wot? ° Ca we do better? هه چاه مرو ۵0۵00 موه ‎iran‏

صفحه 30:
280 Lecture Overview هه چاه مرو ۵0۵00 موه ‎Blackened‏

صفحه 31:
Query optimization * Okotis the best order Por query provessiccy? * ‏وونل‎ a query thot is ot POO ۵۴ ‏,واه‎ ‎* or cock of the otercs, vet its posting, theo @OO tew tovpter. u——>Te | | 0 1 06 | 56 66 266 u——>[aTelel[s][e | © ape ۵ ات۱ Queer’: ®ruts POO Caburata PDD ‏مت‎ 5 0ك ۵0۵00 موه 90 مرو ‎iran‏

صفحه 32:
Query optimization example ° Crovess to onder oP ‏سوه‎ Preg: > start ui swalest set, thea keep vudioy Porter. oS eT Ss Se ares o> 37 37 = ۳ 05 aq] OA] ۱۳] 7 ‎oe‏ هه چاه بو ۵0۵00 موه ‎iran‏ ‎ ‎ ‎ ‎ ‎ ‎

صفحه 33:
More general optimization ۶ Exp, (workday OR prowd) POO (kyaobke OR srfe) * Get doo. Preq.’s Por ull terxvs. 2° Cotkvate the stze of rack OR by the sur , DF its ‏حول‎ Preq.’s (cowwervaiive). a rovess in eoreusiag oder DP OR sizes. وه هه چاه ۵0۵00 موه

صفحه 34:
or 6660© 119 «9 5660© و۲۷۵۲ ۰ وه له ۰ Term (tangerine OR trees) AND ‏هرد‎ ‎(marmalade OR koletdoscope skies) AND 1 (kaleidoscope OR wurwulude eyes) shies (1-۶ trees tte OkorF Oaversty (Bkozvtcrceped ٠ ‏موه‎ ۵0۵00

صفحه 35:
Query processing exercises ° Cxervise: IP the query is Prieads POO rxnas @OVO (WOT vevattywed), kow ood we use the Freq oP povetpoed? ° Exercise: Cxteud the werye to oo orbirary @ovleca query. Cus we doops yucruder executive ic fie foear io the total posticcps size? ° dict: Bega wis the case oP a @ovleod Forwule query where eurk tern uppeurs ody cove ia the query. وه هه چاه مرو ۵0۵00 موه ‎iran‏

صفحه 36:
Exercise * Dry the searck ‏و‎ ot أعمدجموعحلحاه أومه. جدممصو مسو مسمس ]لصا حك ‎searck Pectures pou thick it out‏ تا مرول )۶ ‎better‏ ‏ص0 عصان ‎orm‏ ۵0۵00 موه ال

صفحه 37:
What’s ahead in IR? Beyond term search ° Oke ubout phrases? > CtnPord Oaversiy ° Proxivity: Picd Gates DEPR Oerescft. > Deed iadex te coptre postica Porwoticn ta docs. * Lowes io domed: (Pied docunects wits (wukor = Okew) POO (text coctcics cubautd). م6 و موه نا ‎(Spray QOGQ‏ ل ا

صفحه 38:
Evidence accumulation * dus. O vP a searck ter مسج ) مر 0 ذا > Ove. 5 ‏طه ,صصسسووم‎ > Dovid) wore seews better ° Deed tern Prequeup icPorewation itt dove )30 هه چاه مرو ۵0۵00 موه ‎iran‏

صفحه 39:
Ranking search results ° @ovlecd queries que husivd or exclusion oF dos. ۰ ‏صا من جر مان‎ rook/roup results > Deed @ weuswre proxkiy Prow query te rock db. > Deed to devide whelker devs preseued ‏ها‎ er are stodetvas, ‏تلو ی و ها‎ dees covertay vorivus usperts OF the query. وه هه چاه مرو ۵0۵00 موه ‎iran‏

صفحه 40:
Clustering, classification and ranking © OChosteriog! Bived o set oP dos, your thew into thosters bused va their cutee. ° OlassiPicaticn: Gives o set oP topics, plus a oew doo O, devide whick wpic(s) OD beloops tv. & Rochicg! Owe we teare how to best order a set oP ‎eo‏ هه چاه مرو ۵0۵00 موه ‎iran‏

صفحه 41:
60 Lecture Overview هه چاه مرو ۵0۵00 موه ‎Blackened‏

صفحه 42:
Next Time * 222 * Reads 2222?) > oye & Deed “Phe Phesorus Ppprowk to Ieboraioa Retrieval” (ice Reads bok) > bike “Phe Burwratie Derivation of ‏مه توا‎ Retievd Boaordewerus Pro Dackice-Reakable Texts” (i React) > Dovke “eadentcry cer Pbetratiay by Bosvvkaion, P11" (it Recers) ‎ee‏ هه چاه مرو ۵0۵00 موه ‎iran‏ ‎

صفحه 43:
Lecture Overview هه چاه مرو ۵0۵00 موه ‎Blackened‏

صفحه 44:
Resources for today’s lecture © Fatroduction ‏صا‎ ToPorwaiod Retrieval, chapter (| "0 > hip /howw rye. ver) sbakespeare! > Dev tke aed browse by keyword sequewe Feature! ° Dacniay ‏موجه بطم‎ 0.0 9° Dodera Porras Retrierd, ckopter 8.8 ‎ee‏ هه چاه مرو ۵0۵00 موه ‎iran‏ ‎

Modern Information Retrieval Lecture 3: Boolean Retrieval Lecture Overview • Data Types • Incidence Vectors • Inverted Indexes • Query Processing • Query Optimization • Discussion • References Marjan Ghazvininejad Sharif University Spring 2012 2 Lecture Overview • Data Types • Incidence Vectors • Inverted Indexes • Query Processing • Query Optimization • Discussion • References Marjan Ghazvininejad Sharif University Spring 2012 3 Types of data • Unstructured • Semi-structured • Structured Marjan Ghazvininejad Sharif University Spring 2012 4 Unstructured data • Typically refers to free text • Allows  Keyword queries including operators  More sophisticated “concept” queries e.g., • find all web pages dealing with drug abuse • Classic model for searching text documents Marjan Ghazvininejad Sharif University Spring 2012 5 Unstructured (text) vs. structured (database) data in 1996 Marjan Ghazvininejad Sharif University Spring 2012 6 Unstructured (text) vs. structure (database) data in 2009 Marjan Ghazvininejad Sharif University Spring 2012 7 Sec. 1.1 Unstructured data in 1680 • Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? • One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? • Why is that not the answer?  Slow (for large corpora)  NOT Calpurnia is non-trivial  Other operations (e.g., find the word Romans near countrymen) not feasible  Ranked retrieval (best documents to return) • Later lectures Marjan Ghazvininejad Sharif University Spring 2012 8 Semi-structured data • In fact almost no data is “unstructured” • E.g., this slide has distinctly identified zones such as the Title and Bullets • Facilitates “semi-structured” search such as  Title contains data AND Bullets contain search … to say nothing of linguistic structure 9 Lecture Overview • Data Types • Incidence Vectors • Inverted Indexes • Query Processing • Query Optimization • Discussion • References Marjan Ghazvininejad Sharif University Spring 2012 10 Sec. 1.1 Term-document incidence An t o n y an d Cl eo p at ra Ju li u s Caesar Th e Temp est Haml et Ot h el lo Macb et h An t o n y 1 1 0 0 0 1 Bru t u s 1 1 0 1 0 0 Caesa r 1 1 0 1 1 1 Ca lp u rn i a 0 1 0 0 0 0 Cl eo p at ra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 wo rser 1 0 1 1 1 0 Brutus AND Caesar BUT NOT Calpurnia Marjan Ghazvininejad Sharif University Spring 2012 1 if play contains word, 0 otherwise 11 Sec. 1.1 Incidence vectors • So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. • 110100 AND 110111 AND 101111 = 100100. Marjan Ghazvininejad Sharif University Spring 2012 12 Sec. 1.1 Bigger collections • Consider N = 1 million documents, each with about 1000 words. • Avg 6 bytes/word including spaces/punctuation  6GB of data in the documents. • Say there are M = 500K distinct terms among these. Marjan Ghazvininejad Sharif University Spring 2012 13 Sec. 1.1 Can’t build the matrix • 500K x 1M matrix has half-a-trillion 0’s and 1’s. • But it has no more than one billion 1’s.  matrix is extremely sparse. Why? • What’s a better representation?  We only record the 1 positions. Marjan Ghazvininejad Sharif University Spring 2012 14 Lecture Overview • Data Types • Incidence Vectors • Inverted Indexes • Query Processing • Query Optimization • Discussion • References Marjan Ghazvininejad Sharif University Spring 2012 15 Sec. 1.2 Inverted index • For each term t, we must store a list of all documents that contain t.  Identify each by a docID, a document serial number • Can we used fixed-size arrays for this? Brutus 1 2 4 11 31 45 173 174 Caesar 1 2 4 5 6 16 Calpurnia 2 31 57 132 54 101 What happens if the word Caesar is added to document 14? Marjan Ghazvininejad Sharif University Spring 2012 16 Sec. 1.2 Inverted index • We need variable-size postings lists  On disk, a continuous run of postings is normal and best  In memory, can use linked lists or variable length arrays • Some tradeoffs in size/ease of insertion Posting Brutus 1 2 4 11 31 45 173 174 Caesar 1 2 4 5 6 16 Calpurnia 2 31 Dictionary 57 132 54 101 Postings Sorted by docID (more later on why). 17 Sec. 1.2 Inverted index construction Documents to be indexed Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen friend roman countryman Linguistic modules Modified tokens Indexer Inverted index friend 2 4 roman 1 2 countryman 13 16 Sec. 1.2 Indexer steps: Token sequence • Sequence of (Modified token, Document ID) pairs. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Marjan Ghazvininejad Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Sharif University Spring 2012 19 Sec. 1.2 Indexer steps: Sort • Sort by terms  And then docID Core indexing step Marjan Ghazvininejad Sharif University Spring 2012 20 Sec. 1.2 Indexer steps: Dictionary & Postings • Multiple term entries in a single document are merged. • Split into Dictionary and Postings • Doc. frequency information is added. Marjan Ghazvininejad Sharif University Spring 2012 21 Sec. 1.2 Where do we pay in storage? Lists of docIDs Later in the course: • How do we index efficiently? • How much storage do we need? Terms and counts Marjan Ghazvininejad Pointers Sharif University Spring 2012 22 Lecture Overview • Data Types • Incidence Vectors • Inverted Indexes • Query Processing • Query Optimization • Discussion • References Marjan Ghazvininejad Sharif University Spring 2012 23 Sec. 1.3 Query processing: AND • Consider processing the query: Brutus AND Caesar  Locate Brutus in the Dictionary; • Retrieve its postings.  Locate Caesar in the Dictionary; • Retrieve its postings.  “Merge” the two postings: 2 4 8 16 1 2 3 5 Marjan Ghazvininejad 32 8 64 13 Sharif University Spring 2012 128 21 34 Brutus Caesar 24 Sec. 1.3 The merge • Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 8 2 4 8 16 1 2 3 5 32 8 64 13 128 21 34 Brutus Caesar If list lengths are x and y, merge takes O(x+y) operations. Crucial: postings sorted by docID. Marjan Ghazvininejad Sharif University Spring 2012 25 Sec. 1.3 Boolean queries: Exact match • The Boolean retrieval model is being able to ask a query that is a Boolean expression:  Boolean Queries use AND, OR and NOT to join query terms • Views each document as a set of words • Is precise: document matches condition or not.  Perhaps the simplest model to build an IR system on • Primary commercial retrieval tool for 3 decades. • Many search systems you still use are Boolean:  Email, library catalog, Mac OS X Spotlight Marjan Ghazvininejad Sharif University Spring 2012 26 Boolean queries • Many professional searchers still like Boolean search  You know exactly what you are getting • But that doesn’t mean it actually works better…. Marjan Ghazvininejad Sharif University Spring 2012 27 Boolean queries: More general merges Sec. 1.3 • Exercise: Adapt the merge for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar Can we still run through the merge in time O(x+y)? What can we achieve? Marjan Ghazvininejad Sharif University Spring 2012 28 Sec. 1.3 Merging What about an arbitrary Boolean formula? (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) • Can we always merge in “linear” time?  Linear in what? • Can we do better? Marjan Ghazvininejad Sharif University Spring 2012 29 Lecture Overview • Data Types • Incidence Vectors • Inverted Indexes • Query Processing • Query Optimization • Discussion • References Marjan Ghazvininejad Sharif University Spring 2012 30 Sec. 1.3 Query optimization • What is the best order for query processing? • Consider a query that is an AND of n terms. • For each of the n terms, get its postings, then AND them together. Brutus 2 4 8 16 32 Caesar 1 2 3 5 8 Calpurnia 13 16 64 128 16 21 34 Query: Brutus AND Calpurnia AND Caesar Marjan Ghazvininejad Sharif University Spring 2012 31 31 Sec. 1.3 Query optimization example • Process in order of increasing freq:  start with smallest set, then keep cutting further. This is why we kept document freq. in dictionary Brutus 2 Caesar 1 2 13 16 Calpurnia 4 8 16 32 64 128 3 5 8 16 21 34 Execute the query as (Calpurnia AND Brutus) AND Caesar. Marjan Ghazvininejad Sharif University Spring 2012 32 Sec. 1.3 More general optimization • e.g., (madding OR crowd) AND (ignoble OR strife) • Get doc. freq.’s for all terms. • Estimate the size of each OR by the sum of its doc. freq.’s (conservative). • Process in increasing order of OR sizes. Marjan Ghazvininejad Sharif University Spring 2012 33 Exercise • Recommend a query processing order for Term (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) Marjan Ghazvininejad e ye s kaleidoscope m a rm a l a d e skies t a n g e ri n e t re e s Sharif University Spring 2012 Freq 213312 87009 107913 271658 46653 316812 34 Query processing exercises • Exercise: If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen? • Exercise: Extend the merge to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size? • Hint: Begin with the case of a Boolean formula query where each term appears only once in the query. Marjan Ghazvininejad Sharif University Spring 2012 35 Exercise • Try the search feature at http://www.rhymezone.com/shakespeare/ • Write down five search features you think it could do better Marjan Ghazvininejad Sharif University Spring 2012 36 What’s ahead in IR? Beyond term search • What about phrases?  Stanford University • Proximity: Find Gates NEAR Microsoft.  Need index to capture position information in docs. • Zones in documents: Find documents with (author = Ullman) AND (text contains automata). Marjan Ghazvininejad Sharif University Spring 2012 37 Evidence accumulation • 1 vs. 0 occurrence of a search term  2 vs. 1 occurrence  3 vs. 2 occurrences, etc.  Usually more seems better • Need term frequency information in docs Marjan Ghazvininejad Sharif University Spring 2012 38 Ranking search results • Boolean queries give inclusion or exclusion of docs. • Often we want to rank/group results  Need to measure proximity from query to each doc.  Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query. Marjan Ghazvininejad Sharif University Spring 2012 39 Clustering, classification and ranking • Clustering: Given a set of docs, group them into clusters based on their contents. • Classification: Given a set of topics, plus a new doc D, decide which topic(s) D belongs to. • Ranking: Can we learn how to best order a set of documents, e.g., a set of search results Marjan Ghazvininejad Sharif University Spring 2012 40 Lecture Overview • Data Types • Incidence Vectors • Inverted Indexes • Query Processing • Query Optimization • Discussion • References Marjan Ghazvininejad Sharif University Spring 2012 41 Next Time • ??? • Readings  Chapter ??? in IR text (?????????????)  Joyce & Needham “The Thesaurus Approach to Information Retrieval” (in Readings book)  Luhn “The Automatic Derivation of Information Retrieval Encodements from Machine-Readable Texts” (in Readings)  Doyle “Indexing and Abstracting by Association, Pt I” (in Readings) Marjan Ghazvininejad Sharif University Spring 2012 42 Lecture Overview • Data Types • Incidence Vectors • Inverted Indexes • Query Processing • Query Optimization • Discussion • References Marjan Ghazvininejad Sharif University Spring 2012 43 Resources for today’s lecture • Introduction to Information Retrieval, chapter 1 • Shakespeare:  http://www.rhymezone.com/shakespeare/  Try the neat browse by keyword sequence feature! • Managing Gigabytes, chapter 3.2 • Modern Information Retrieval, chapter 8.2 Marjan Ghazvininejad Sharif University Spring 2012 44

51,000 تومان