صفحه 1:
Modern Information
Retrieval
< 6: موطاون) Rettevd
صفحه 2:
Lecture Overview
Ours Types
° Aarideure Orviors
° Aewerted Iedexes
Query Provessiay
Qvery Opiicvizatiog
۰ Oisrussiva
* RePerewes
هه چاه مرو
۵0۵00 موه iran
صفحه 3:
Lecture Overview
* Oats Types
هه چاه مرو
۵0۵00 موه Blackened
صفحه 4:
Types of data
* Oustructured
۰ Gewisiructured
* Gtructured
هه چاه مرو
۵0۵00 موه Blackened
صفحه 5:
Unstructured data
* Dypicdly rePers to Pree text
* Clows
> Kepword quertes iakrdkery operators
> Qore sophisticated ‘ovaept” queries &.4.,
«pad al ae pa eR ig oe
° Chessiv wodel Por searchicy text domueuts
هه چاه مرو
۵0۵00 موه iran
صفحه 6:
Unstructured (text) vs. structure
(database) data in 1996
@Unstructured
@ Structured
Data volume Market Cap
9 هه چاه مرو
۵0۵00 موه ال
صفحه 7:
Unstructured (text) vs. structure
(database) data in 2009
@Unstructured
@ Structured
Data volume Market Cap
° هه چاه مرو
۵0۵00 موه iran
صفحه 8:
Unstructured data in 1680
© Whick phys of Ghokespeare octane words Brute
BOO Ces bt DOT Capua?
Re Ove onl grep ol of Gkchespewe's phys Por Orit
ad Ouse, theo strip out toes oootaicicy Oupuratd?
* Oh ts thot ot the cower?
> Obw (Por bee corpora)
> DOP Cobras & نمی
> Otker vperaivas (e.x., Prerd the word Rowe cor
powiqy@ed) ot Peasitle
> Booked retrieval (best doonrente te retur)
سا ترا
ص0 عصان موه 3
۵0۵00 موه ال
صفحه 9:
Semi-structured data
° Ie Pact okerst oo dtc is “costructured”
* Cy, this slide kos disttony idectPied 270e5 suck us the
Tike ord Bullets
* او( “sewi-structured” searck suck a
> Pike cxetcices chats PDO Bullets corcteacr search:
... یا گه رای رود و structure
صفحه 10:
Lecture Overview
هه چاه مرو
۵0۵00 موه Blackened
صفحه 11:
Term-document incidence
Ouiovy wed Orepatra duhur Owvar ‘The Trapest ‘Drolet Otkelle Oucbetk
q o o o 4 0 تسه
o q o o 0 0 00
o q 4 4 0 0 ین
q ۰ 0 o o ۰ سوه
o o o o ۰ 0 سوت
q a 4 4 0 0 330
q 4 o 4 ۰ 0 سس
CP ploy cookies word, D
vise
هه چاه مرو
۵0۵00 موه اس ربص تا
صفحه 12:
Incidence vectors
° Gowe kwe a OM verter Por cack terw.
° Dp ceswer query: the the veviors Por @rutes,
Ovwse ond Oupurat (cowplewesied) [] bitwise
.00
GOO 000000 @O”O@ 6200400 < 100000200 *
00۰
6 هه موه معط
iran موه ۵0۵00
صفحه 13:
Bigger collections
* Oowsider D = 1 wilica domuceds, ack wit bout
DOO words.
۰ Buy O bytes/ word tochidiacy جوم اوه
> OBO of date عم صصصیك جملا جز
۰ Go there we 0 = GOOK dstice وه مه
6 هه چاه مرو
۵0۵00 موه iran
صفحه 14:
Can’t build the matrix
* GOOK x 1D watrix kos holP-c-itliod O's aed (1's.
. Osi ات Ne < on?
? جا عضوو Exteel sparse.
۰ Oke's uo better represrotaiod?
> De ody revo te ( postions.
ae هه چاه مرو
۵0۵00 موه iran
صفحه 15:
Lecture Overview
* لصو Iedexes
هه چاه مرو
۵0۵00 موه Blackened
صفحه 16:
Inverted index
° or euwk terw 4 we weet store o bst oP of domes
thot orotic 6
< رتیل rack by ab AAD, لو امنود حول و
۱ ها this?
ad {| 90 | ۵ veo] ae}
o—
1] 418 1 1 91 9 [ 49] 9
Sse] aad 201 [ 2 ]ت۱۳
سح و للم حول لت 8 ماب
ae?
هه ی ون 08
۳ ا ی ‘tie ون
صفحه 17:
Inverted index
۰ Oe weed voricble-size posto lists
> On deh, a pvakaumus ruc oF posting te wre ced best
> la weno, mr ee baked bets or vortable bert: ory
۶ Gowe tradevPPs ia size/eose موه اه
Postery
۱0] 6 ۰ aad {| 90 | ۵ ۳۸۵ ۵۵
8 ]9 9 6 1 418 إححم
uc——> [eS [| Sd] S] aod
a
صفحه 18:
Inverted index construction
Friends, Romans, countrymen.
Friends | [Romans | |Countrymen
friend] [roman] [countryman
oot > Le i. الم
ات ۳ ۳ rower
۵ب 49 ]0 | بح
صفحه 19:
06
‘rats
‘oid
you
‘ambitious
Indexer steps: Token
* مص يوه oP (DodPed then, Doomed
هه چاه
۵0۵00 موه
sequence
10) privy.
Ove
1 مده
Blackened
صفحه 20:
Indexer steps: Sort
“erm ead “em | dosio
1 1 ambos
aa 1 te 2
enact 1 ts 1
ios 1 tus 2
۰ GS caesar 1 capitol 1
۲7۷ ferns ! 1 مم 1
1 coat 2
< 0 AO ile ۱ cer 0
the 1 enact 1
capita 1 rath 1
‘rus 1 1 1
red 1 1 1
3 3 2 1 1
: 2 ی 2 i 2
3 3 rn ۱
1 2 tied 1
Re 5 pistes 1
‘with 2 let 2
‘caesar 2 me 1
= 5 rete 2
= 3
reba 3
one 5 ite ۱
‘hath 2 a 2
tos 3 ‘od 3
3 you
was 2 ليا 5
ambitious 2 ۳ 8
هه هه موه سس
۵0۵00 موه ال
صفحه 21:
Indexer steps: Dictionary &
term doc. freq. — postings lists
Teen | aoa
ambuous و ambitious [1 ]
۰ bes 1 1
صمص جاصنان<1) Eins 2 rots | 2
cept! 1 capitol [1]
potries itt caesar 1 4
cesar 2 caesar
مت هت رت cess 2 17
0
رل مه 5-2 1 enact [1]
bath 1 wath [T
۰ 111
Gpiit جر ۲ سا
له تمه has تسچ cn
tle 1 ulus [1]
i
تا fet 4 ‘ater
ime 1
Ove. rete 2 wit
= 2 me
Prequedy tte i noble [1
the 2 mee
0 jae 3
yw 2 the [2
.ساعن ES 3 ‘old [1
vat 2 ۳-۹
was [2 |
with [T
Oops Oka Overy ea
Blackened موه ۵0۵00
صفحه 22:
Where do we pay in storage?
اي ب سس
عم سرا 5
aoe ieee
‘capitol [1 ]
aaa
did] i]
enact | 1]
the [2
wold [1]
you [7]
was [2 |
with [T
kak Ooversyy وه
بجوت 0002
مرو
iran
صفحه 23:
Lecture Overview
° Query Provessingy
هه چاه مرو
۵0۵00 موه Blackened
صفحه 24:
Query processing: AND
* جل صنو0 provessioy ihe query:
سوجعو 2 (00(0) ص8
> bovate Onate ta he Dirty;
ها موق *
ta the Dirty; ویو( مورا خ
ها موق ۶
pose: ها ٩ ظ
Brutus ]228[ [64], |32], | 16]ء|8], [4ب[2
7 |34 | [21]ب[ 13لی یدای[ 3 م2 11
هه چاه اس
iran Gorn 000
صفحه 25:
The merge
۶ لور through the tio posticgs sitmulccepusl), ic foe
وی اه وا و ما of posticgs ruiries
2|_[4|_[8] [16 |_.[32]_.[64]_.[228] Brutus
34] Caesar
13 21
AP est becghs are oc ced, werge thes O(ety) operatic,
)السك روا له مس لسن
وه هه چاه ون ز
۵0۵00 موه Blackened
صفحه 26:
Boolean queries: Exact match
° Dke (Dovleco retrieval wodel ts betey uble to ush a query
thot is o @ovleon expression
> Borla Quertes we BOD, OR ard DOT jvia query
eras
© Oiews cack جه حول a set oP words
° 4s previse! domed watches ooediion or wt.
> Perkaps the skoplest wodel w bubd wa WR syste مم
rivvary cowswercid retievdl tool Por O deoxndes.
QOcuy seurck systews pou stil use ore @ovlead
> Boral, throry vetaby, Dar OG X Spottt
وه هه چاه مرو
۵0۵00 موه iran
صفحه 27:
Boolean queries
۶ Daw proPessioad searchers sill the Bootes search
موه تیا نوخ uke pou ore oeticry
° But that doesc't wea it actuchy works better...
er هه چاه مرو
۵0۵00 موه iran
صفحه 28:
Boolean queries:
More general merges
* Cxernise: (Bdupt the were Por the queries!
@rnte PDO VDOT Owes
@ris OR DOT Oesar
Ova we sill ra through the werge in tere O( 4)?
Okt cog we uckieve?
eo هه چاه
۵0۵00 موه
صفحه 29:
وه
Merging
Oke obowt os orbirary @ovleas Poraka?
(Oras OR Owse) PDD DOT
(Boy OR Okvpara)
° On we duos were io “eeu” tee?
> Dicer ict wot?
° Ca we do better?
هه چاه مرو
۵0۵00 موه iran
صفحه 30:
280
Lecture Overview
هه چاه مرو
۵0۵00 موه Blackened
صفحه 31:
Query optimization
* Okotis the best order Por query provessiccy?
* وونل a query thot is ot POO ۵۴ ,واه
* or cock of the otercs, vet its posting, theo
@OO tew tovpter.
u——>Te | | 0 1 06 | 56 66 266
u——>[aTelel[s][e | © ape
۵ ات۱
Queer’: ®ruts POO Caburata PDD
مت 5
0ك
۵0۵00 موه
90 مرو
iran
صفحه 32:
Query optimization example
° Crovess to onder oP سوه Preg:
> start ui swalest set, thea keep vudioy Porter.
oS eT Ss Se ares
o> 37 37 = ۳ 05 aq] OA]
۱۳] 7
oe هه چاه بو
۵0۵00 موه iran
صفحه 33:
More general optimization
۶ Exp, (workday OR prowd) POO (kyaobke
OR srfe)
* Get doo. Preq.’s Por ull terxvs.
2° Cotkvate the stze of rack OR by the sur
, DF its حول Preq.’s (cowwervaiive).
a rovess in eoreusiag oder DP OR sizes.
وه هه چاه
۵0۵00 موه
صفحه 34:
or
6660©
119
«9
5660©
و۲۷۵۲ ۰
وه له ۰
Term
(tangerine OR
trees) AND هرد
(marmalade OR koletdoscope
skies) AND 1
(kaleidoscope OR wurwulude
eyes) shies
(1-۶
trees
tte OkorF Oaversty
(Bkozvtcrceped ٠ موه ۵0۵00
صفحه 35:
Query processing exercises
° Cxervise: IP the query is Prieads POO rxnas
@OVO (WOT vevattywed), kow ood we use the
Freq oP povetpoed?
° Exercise: Cxteud the werye to oo orbirary @ovleca
query. Cus we doops yucruder executive ic fie
foear io the total posticcps size?
° dict: Bega wis the case oP a @ovleod Forwule query
where eurk tern uppeurs ody cove ia the query.
وه هه چاه مرو
۵0۵00 موه iran
صفحه 36:
Exercise
* Dry the searck و ot
أعمدجموعحلحاه أومه. جدممصو مسو مسمس ]لصا
حك searck Pectures pou thick it out تا مرول )۶
better
ص0 عصان orm
۵0۵00 موه ال
صفحه 37:
What’s ahead in IR?
Beyond term search
° Oke ubout phrases?
> CtnPord Oaversiy
° Proxivity: Picd Gates DEPR Oerescft.
> Deed iadex te coptre postica Porwoticn ta docs.
* Lowes io domed: (Pied docunects wits
(wukor = Okew) POO (text coctcics cubautd).
م6 و موه نا
(Spray QOGQ ل ا
صفحه 38:
Evidence accumulation
* dus. O vP a searck ter
مسج ) مر 0 ذا
> Ove. 5 طه ,صصسسووم
> Dovid) wore seews better
° Deed tern Prequeup icPorewation itt dove
)30 هه چاه مرو
۵0۵00 موه iran
صفحه 39:
Ranking search results
° @ovlecd queries que husivd or exclusion oF dos.
۰ صا من جر مان rook/roup results
> Deed @ weuswre proxkiy Prow query te rock db.
> Deed to devide whelker devs preseued ها er are
stodetvas, تلو ی و ها dees covertay vorivus usperts
OF the query.
وه هه چاه مرو
۵0۵00 موه iran
صفحه 40:
Clustering, classification and
ranking
© OChosteriog! Bived o set oP dos, your thew into
thosters bused va their cutee.
° OlassiPicaticn: Gives o set oP topics, plus a oew doo
O, devide whick wpic(s) OD beloops tv.
& Rochicg! Owe we teare how to best order a set oP
eo هه چاه مرو
۵0۵00 موه iran
صفحه 41:
60
Lecture Overview
هه چاه مرو
۵0۵00 موه Blackened
صفحه 42:
Next Time
* 222
* Reads
2222?)
> oye & Deed “Phe Phesorus Ppprowk to
Ieboraioa Retrieval” (ice Reads bok)
> bike “Phe Burwratie Derivation of مه توا
Retievd Boaordewerus Pro Dackice-Reakable
Texts” (i React)
> Dovke “eadentcry cer Pbetratiay by Bosvvkaion, P11"
(it Recers)
ee هه چاه مرو
۵0۵00 موه iran
صفحه 43:
Lecture Overview
هه چاه مرو
۵0۵00 موه Blackened
صفحه 44:
Resources for today’s lecture
© Fatroduction صا ToPorwaiod Retrieval, chapter (|
"0
> hip /howw rye. ver) sbakespeare!
> Dev tke aed browse by keyword sequewe Feature!
° Dacniay موجه بطم 0.0
9° Dodera Porras Retrierd, ckopter 8.8
ee هه چاه مرو
۵0۵00 موه iran
