صفحه 1:
TePorwaton Retrisvd :09 مان
صفحه 2:
مه مج100 :06 بان ...ل
Reeve Ruchiory Ostey Nero
Retevoace Dstey Wypertchs
Gps, Wowras, عبط( لو
Aerdextay oP Oorancects:
CPPevivecss 11ل
Orb Gearck Cages
4ePorwotica Retteval ond Grructured Data
Orevtories
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO wo ©Sbervehnts, Cork لح 0
صفحه 3:
+ وست و1 Beireud Oysiews
۲ 1 reed (1(R) syste use و و doit او thos catch
یب
8 4 Porwoton orerized os جلدمصسحصل خاه مصاعصاده د
© Oornveds ore ری ow svhewa
۲ $ohornwuiva retievd locutes relevadt domes, vo the basis oP user iazut suck
se keywords or مج موم
© ey, Prod dom noes oootoictay the words “cotubuse syotews”
۲ Coa be weed eved oo texted desoripioes provided wil: overtextrd dota suds a
ميم
© Web مت پوت ۲ و وی او exaople of IR sysiews
1۱ Ooweytr- O* Oters, Gey 8, OODO wo ©Sbervehnts, Cork لح 0
صفحه 4:
+ ePorcntiva Retrievd Gystews (Ond.)
© OPPerewes Prow dutcbuse وه
۶ 1 syste s dont ded uth inxertvad ures (ieichay pour
اجه سیر recovery)
©) Oxtabuse systews ded wih structured data, with: schewar thot dePice the
مه ول
© systews ded wih sowe queryicy issues wl yeurrdly uddressed by
ال systews
۱ @pproxtrate searckiey by keywords
١ Raokioy oP retteved ceswers by esttvated degree oP وا
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO we ©Sbervehnts, Cork لح 0
صفحه 5:
ای لس +
Bo Bibtex rettevd, lhe words ta wack doomed one omeidered iy be keywords.
© De wee te word tere ty rePer ty he Words to dorrcect
وا وی لو نموت ری نو رل وه روف ۱
oxi ao! لمي هت لیا بط لو
۱ exphelly speoPed
BE Roche of dommes oo he: busty oP eetevated rebvowre tp a query & ortical
© Bekveure rockty ty based va Partory suck a
> Dera Preeay
اه موی ان بو query keyword ta doeanvent
موم ول سم
ها وسححت لمحسييجها بدسي جما عدتتسجصل نوم صملا"
> Power D que لوا و سرت
۱ مرا ty doris
(Dore bobs ty a sharers ماو مت تحص و
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO wo ©Sbervehnts, Cork لح 0
صفحه 6:
+ 3 رسپ را Tera
۲ ۱/۱۵۵) regen! hwerse Ooanved Prequeay) rachis
© bet od) = cxncber of terns ia he سح of
۱ ۷ ( < مج و طلسم of tera fin the ب مسج
۶ واه ول مج هط
TF (d, t) - هد + cy
+ Whe tox Pastor ts to مه excessive weight to Prequect ters
© Rekvowe of donnved to query
_ ۶ ۳0۵
r(d, Q) ‘co nd)
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO wo ©Sbervehnts, Cork لح 0
صفحه 7:
+ Rebvawe Radtoy Doty Perws (Ova)
۲ ost systews odd to the ubove wodel
© ord thet corr tote, cuhor tet, sevion headkrep, et ore ved yrecier
موز
baer ی و ۱۳۹
موز
et ore ekoirated ۱
Oxted stop words ”
© Croxoty: P hepwords i query ooour close iryeher is he doce, the
و لا ها سم thad iP they vorur Por ozort
Oornveds we returced io depreustry order oP سوه عمط
۶ بو رل تراسج Pew dormedis ore returced, ut of
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO or ©Sbervehnts, Cork لح 0
صفحه 8:
+ Orokrty Owed Retievd
Broke) bored reitevd - retieve doanoeds stoke bo محم مجعو
© Gkokriy wy be defied oa the bass of power words
١ و و لا بیج ® uak kihest DP (v1) / 0 (1) ood we hee
سود ny Herd dea eee ob rah de ere
۱ Rebvene Perdbock! Gkokady com be weed و rece cewer set to hepword
ney
© Oser setlevis Pew relevant domunveds Prow hose retrieved by keyword
spery, ond systew Prods viher سا ما اه مج
© Ocvior spare wodel: dePoe on dkrewivad space, where ois the ouber of
words ta the domed se.
سین و ۲ domed dyes Brow pig too potat whose (© corde i
TE (dt) 1 (1)
© یه با of the ood between the veviors oP tive docuceds is werd oo
were of their stort.
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO
صفحه 9:
همهم لا" ووتون) عممورواو)
© Duter of donnveds retevodi too query رای ۶ ع وه 9 موه
Prequecdes ure tohed taty acu
B Ostey ere Prequeuies wohes و Pus
* G.y. oc trvel ogeury coo odd waey porn eues oP the words “travel” to جلا
page te woke tts rook very high
Dost oF the tke peuple ore lovhiey Por pages Prow popukar اه
BE deo! we poriony of Deb ote (e-<. kw cacy peor viet) to rank ote paces thot
جوا وف اس
اه خام وولو اه او دا ۲
Goktion: cext slide ©
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO wo ©Sbervehnts, Cork لح 0
صفحه 10:
+ مسد ساد Delay Uygeride (low)
BO Gohiica: use ovber of hypertoks too site ی و وه oP the populady or prestige oP
the site
© ماع زان نیون Prow euch ste (Why? - see previne skke)
© Poptany wewue ts Por ste, oot Por kicked poe
> ماسجا مه رف ron oP te
١ Obey, وا لاله و چاه ببس dePioe stave 0 (ORD prefix the oe.yoe.eds
وس رصم اه وم لصا نوم وه
© RePreweus
© Oke cowputey presice bused vo bake to sie, ve wore Wweitht tp hobs Prow sites
thot thewselves hove higher prestige
> DePratica لح صا
* Get up oad solve systew of اج تاسلج
© Oboe teu busts of the Borde Page(Raok ranbieg werheraira
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork لح 0
صفحه 11:
+ مسد ساد Delay Uygeride (low)
© Ooweentivs ty serie زاس theories thot racked presive of people
© Coy, the presided of the 0.6.0 hos o high prestive stuce woop people how
مس
۱ by wulipe previyous peuple hus high previce
Bab odd cuherty based rember
© ۵ hab rope thd stores bkhe iy wey panes (oa « trie)
Ou odor © wpe ha pou wad wParwutos pou ee
(Bock page yet 3 hub preske based ox presice of ctkorties thot pois صا
Cork pax wee oo cuboriy presice bused ou previxe of hubs thet porto
Oypic, presiqe dePratices ore opole, od coo be برط اي
موی ص ماو
هو و وا نو راو ان Ose cuhoniy presiee ©
1۱ Ooweytr- O* Oters, Gey 8, OODO لح 0 ا سواه 1 موه 0
صفحه 12:
Opwwws and Wowowws
© Cy. doanvedt: “evra "صمت صاعم امب" اتمصي اموت
eed i rede tra “cookie” ond “repo” oe Sys
© Oppiew ear ented query we “weerovel aad (revar or anndemnee)”
۱ سا
© Cy, “obec! hos dPPered weuckn وراج جه
© Con dob ude eR (7 swe exec) Brow te ooo
قاس بسا مهن موم بت هی ی بطم
© Dery enbyokanl tated wrung kt order موه سل
» Or very syeanos wil wer
۰ اه وه له چا بوه مسق
1۱ Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork لح 0
صفحه 13:
Ovwsp-®wed Gueryieq
اه ۱
ممم مويذا جد دع مه ١ المعصصصه داز جماص ۱
توس ءسامامت سس سس و( ©
ای وا اوه رسمه موه لاوما(
Cox. he 160 retaiccship trot we saw io the GR wodet *
(iis approock ooo be used to standardize و و یلها
Ontologies carn bobs c7uliple مها
۲ مب of the Grant Deb (oot covered kere)
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO
صفحه 14:
جلمعمعدهوو(]) خام و مرول 7
۱ ۵ و ما ۵ سح و وه ۰ ) تسوا ات سوت سل سس ther
لاسسوا
۶ ای بالط مسج(
۱
ام اما روم تست و تم مت مها انوا
۶ امه ط عون oP و ساوسو 1)
ب مت یکاخ اه مت ما عمط تم لیا
۶ مسا 60 60 ۵
BB opr operation: doranvents thet ovetcts of rust oe oP Ky, Kay ony
ی و ۵ 6۵ 0 A Oy.
B Cok G6 hep را یلم ماه ناه تا له ersten
© of cee dby be ePRivteuly twplewedted by wercery oP sorted bets
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork لح 0
صفحه 15:
Orwwity Rerievd GPPevivewss
۱ سوه ره موه ارو by vein inex sinker trot support
re مر وه Day result!
© Pade centr (Pr drop) - sone rebrnt do neni بوب ww be
retrieved.
© Babe postive - sowe ineckevant domnveds way be retrieved.
۱ should ot perc way Pulse drope, but
way persia Pew Pose postives.
© Reevad perPorwoue wets:
© prevbive - wht peroeutaye oP the retrieved domeveuis ore releveret tr the
ery,
© reodl - whol perveckne of the doves relevad io the query were
retrieved.
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork لح 0
صفحه 16:
+ ewig Rorevd OPPevtvewes (Ova)
۲ Revd us. preveiva treo:
( موه و0 rend by retrevies macy dom uvects (dew ty «tow level of
روت لا لو مرا inrelevodt dornrenis wml be بسكي لس
paver
Bl Deceures of retrevd ef Pecos!
© Recolor a Raxetoa of annober of dooanceuts Petched, or
© اه ماه مس
ای( جح سم اه مخ هچ افو
© x. “provera of PO% a revel oF OO%, and OO% ot a rev oF PO%"
Bl Probie! whick donors oe windy reevend, od whick oe oot
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork لح 0
صفحه 17:
+ Ovb Orack Ouges
© Deb wrawers oe proqras thot locate od gaher Porwatic oo the Deb
© Revursivel, Polow hyperboks presen to hoowe domventy, to Prod other
عمج
۱ عمج و بو لو و مس موی
© Cetcked سمل
۱۳ to oc todextay systeo
* Coc be discarded oPter todextog, or store us a cocked copy
© Onawtay he eotre Deb would the مسجت تا بو و oF tee
© Gearck eats ypicdly cover ody a pod oP the Deb, ot ll oP it
© Deke warts to perPorsn a sted oad
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO لح 0 ا سواه 1 جووه 0
صفحه 18:
+ Web Oravtoy (Ova)
۲ مس طاقن نوا بو صا مایمن va رد ری لت ft parcel
© Get oP takes te be oraded stored foo dotcbose
© ew bok Poured to prowted pages udded ty this eet, to be aroued hte
198 odextay process dev rues vo ای اولح
© Credes 0 vew copy oP fade testead oP wrdP toy old ford
© OW fades ip used ip one wer queries
۶ Ren a orn ی او مسا له یووم جا
۲ نوی و ای ات لش( queries
© لیا way be kept ری و
© Queres لا اما ای سل 0 لمع رون
1۱ Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork لح 0
صفحه 19:
۲۱ cod Orctired Dots
§ toPorwatca retrievdl spstews orighndly treuied domuredis os وصاصامت د of
words:
§doPorwatics extrusion systews ther structure Prow domes, Fh!
© Cxtrentios oF house utitbutes (size, address, oucvber of bedrovws, et.)
رميق a text advertiser
© Cxtrarton oP topic ord people wowed Proc وه سنج و
© Rebticws or XOL structures used ty store extracted dott
© اس اه مس exevay dots io oosiwer queries
و مس سین ۱
1۱ Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork لح 0
صفحه 20:
Oreviores
۲ نوت netted dommes together سم و روما و و
© ers con sre ont poly requested domed but dbo rehied vues.
B Orowsteny is Puniitated by chissficuiog systew tho oryeaizes loyicahy vetted
donned together.
© Orcprizaiva is hierachicd: chssPivaion herarchy
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO لح 0 ا سواه 1 موه 0
صفحه 21:
+ © Owe Pouca Weravly Por © borey Gyoew
graph algorithms
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork لح 0
صفحه 22:
008 م0 +
ار مق وله مه و راما و وا ام Qoanveds coo reside in wutiple ©
جوا نت سا ملجوا وان عصد ,مهد
© OkssPiratvg herachy ts عاسوط) لمس) صطا Gropk (DOG)
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO لح 0 ا سواه 1 موه 0
صفحه 23:
6 ملاسان )2)0 ۳۶ 6 Library
ePorcwaiva Retrievd Opsiew
science engineering
computer science
graph algorithms
Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO لح 0 ا سواه 1 موه 0
صفحه 24:
+ Deb Orevores
19 OO eb drove & ist u ches ficdivd dreviory va Deb pes
© بر Yahoo! Orreviory, Opes Direviory proievt
۴ Asses:
* Oka should the directory hierarchy be?
* ved a docuvedt, whic ordes of the موجه مه رس relevorat
حول با و
© OR eu doce wand
١ ChesFicuica of doncrrats tz a hierarchy way be due based vo tera
اد
1۱ Ooweytr- O* Oters, Gey 8, OODO w0e ©Sbervehnts, Cork لح 0
صفحه 25:
Gad oP Okaper
Chapter 19: Information Retrieval
Database System Concepts
©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
1
Chapter 19: Information Retrieval
Relevance Ranking Using Terms
Relevance Using Hyperlinks
Synonyms., Homonyms, and Ontologies
Indexing of Documents
Measuring Retrieval Effectiveness
Web Search Engines
Information Retrieval and Structured Data
Directories
Database System Concepts - 5th Edition, Sep 2, 2005
19.2
©Silberschatz, Korth and Sudarshan
Information Retrieval Systems
Information retrieval (IR) systems use a simpler data model than database
systems
Information organized as a collection of documents
Documents are unstructured, no schema
Information retrieval locates relevant documents, on the basis of user input such
as keywords or example documents
e.g., find documents containing the words “database systems”
Can be used even on textual descriptions provided with non-textual data such as
images
Web search engines are the most familiar example of IR systems
Database System Concepts - 5th Edition, Sep 2, 2005
19.3
©Silberschatz, Korth and Sudarshan
Information Retrieval Systems (Cont.)
Differences from database systems
IR systems don’t deal with transactional updates (including concurrency
control and recovery)
Database systems deal with structured data, with schemas that define the
data organization
IR systems deal with some querying issues not generally addressed by
database systems
Approximate searching by keywords
Ranking of retrieved answers by estimated degree of relevance
Database System Concepts - 5th Edition, Sep 2, 2005
19.4
©Silberschatz, Korth and Sudarshan
Keyword Search
In full text retrieval, all the words in each document are considered to be keywords.
We use the word term to refer to the words in a document
Information-retrieval systems typically allow query expressions formed using keywords
and the logical connectives and, or, and not
Ands are implicit, even if not explicitly specified
Ranking of documents on the basis of estimated relevance to a query is critical
Relevance ranking is based on factors such as
Term frequency
– Frequency of occurrence of query keyword in document
Inverse document frequency
– How many documents the query keyword occurs in
» Fewer give more importance to keyword
Hyperlinks to documents
– More links to a document document is more important
Database System Concepts - 5th Edition, Sep 2, 2005
19.5
©Silberschatz, Korth and Sudarshan
Relevance Ranking Using Terms
TF-IDF (Term frequency/Inverse Document frequency) ranking:
Let n(d) = number of terms in the document d
n(d, t) = number of occurrences of term t in the document d.
Relevance of a document d to a term t
n(d, t)
TF (d, t) = log 1 +
n(d)
The log factor is to avoid excessive weight to frequent terms
Relevance of document to query Q
r (d, Q) = TF (d, t)
tQ n(t)
Database System Concepts - 5th Edition, Sep 2, 2005
19.6
©Silberschatz, Korth and Sudarshan
Relevance Ranking Using Terms (Cont.)
Most systems add to the above model
Words that occur in title, author list, section headings, etc. are given greater
importance
Words whose first occurrence is late in the document are given lower
importance
Very common words such as “a”, “an”, “the”, “it” etc are eliminated
Called stop words
Proximity: if keywords in query occur close together in the document, the
document has higher importance than if they occur far apart
Documents are returned in decreasing order of relevance score
Usually only top few documents are returned, not all
Database System Concepts - 5th Edition, Sep 2, 2005
19.7
©Silberschatz, Korth and Sudarshan
Similarity Based Retrieval
Similarity based retrieval - retrieve documents similar to a given document
Similarity may be defined on the basis of common words
Relevance feedback: Similarity can be used to refine answer set to keyword
query
E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these
terms to find relevance of other documents.
User selects a few relevant documents from those retrieved by keyword
query, and system finds other documents similar to these
Vector space model: define an n-dimensional space, where n is the number of
words in the document set.
Vector for document d goes from origin to a point whose i th coordinate is
TF (d,t ) / n (t )
The cosine of the angle between the vectors of two documents is used as a
measure of their similarity.
Database System Concepts - 5th Edition, Sep 2, 2005
19.8
©Silberschatz, Korth and Sudarshan
Relevance Using Hyperlinks
Number of documents relevant to a query can be enormous if only term
frequencies are taken into account
Using term frequencies makes “spamming” easy
E.g. a travel agency can add many occurrences of the words “travel” to its
page to make its rank very high
Most of the time people are looking for pages from popular sites
Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that
match given keywords
Problem: hard to find actual popularity of site
Solution: next slide
Database System Concepts - 5th Edition, Sep 2, 2005
19.9
©Silberschatz, Korth and Sudarshan
Relevance Using Hyperlinks (Cont.)
Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of
the site
Count only one hyperlink from each site (why? - see previous slide)
Popularity measure is for site, not for individual page
But, most hyperlinks are to root of site
Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu
contains many unrelated pages of varying popularity
Refinements
When computing prestige based on links to a site, give more weight to links from sites
that themselves have higher prestige
Definition is circular
Set up and solve system of simultaneous linear equations
Above idea is basis of the Google PageRank ranking mechanism
Database System Concepts - 5th Edition, Sep 2, 2005
19.10
©Silberschatz, Korth and Sudarshan
Relevance Using Hyperlinks (Cont.)
Connections to social networking theories that ranked prestige of people
E.g. the president of the U.S.A has a high prestige since many people know
him
Someone known by multiple prestigious people has high prestige
Hub and authority based ranking
A hub is a page that stores links to many pages (on a topic)
An authority is a page that contains actual information on a topic
Each page gets a hub prestige based on prestige of authorities that it points to
Each page gets an authority prestige based on prestige of hubs that point to it
Again, prestige definitions are cyclic, and can be got by
solving linear equations
Use authority prestige when ranking answers to a query
Database System Concepts - 5th Edition, Sep 2, 2005
19.11
©Silberschatz, Korth and Sudarshan
Synonyms and Homonyms
Synonyms
E.g. document: “motorcycle repair”, query: “motorcycle maintenance”
need to realize that “maintenance” and “repair” are synonyms
System can extend query as “motorcycle and (repair or maintenance)”
Homonyms
E.g. “object” has different meanings as noun/verb
Can disambiguate meanings (to some extent) from the context
Extending queries automatically using synonyms can be problematic
Need to understand intended meaning in order to infer synonyms
Or verify synonyms with user
Synonyms may have other meanings as well
Database System Concepts - 5th Edition, Sep 2, 2005
19.12
©Silberschatz, Korth and Sudarshan
Concept-Based Querying
Approach
For each word, determine the concept it represents from context
Use one or more ontologies:
Hierarchical structure showing relationship between concepts
E.g.: the ISA relationship that we saw in the E-R model
This approach can be used to standardize terminology in a specific field
Ontologies can link multiple languages
Foundation of the Semantic Web (not covered here)
Database System Concepts - 5th Edition, Sep 2, 2005
19.13
©Silberschatz, Korth and Sudarshan
Indexing of Documents
An inverted index maps each keyword Ki to a set of documents Si that contain the
keyword
Inverted index may record
Keyword locations within document to allow proximity based ranking
Counts of number of occurrences of keyword to compute TF
and operation: Finds documents that contain all of K1, K2, ..., Kn.
Intersection S1 S2 ..... Sn
or operation: documents that contain at least one of K1, K2, …, Kn
Documents identified by identifiers
union, S1 S2 ..... Sn,.
Each Si is kept sorted to allow efficient intersection/union by merging
“not” can also be efficiently implemented by merging of sorted lists
Database System Concepts - 5th Edition, Sep 2, 2005
19.14
©Silberschatz, Korth and Sudarshan
Measuring Retrieval Effectiveness
Information-retrieval systems save space by using index structures that support
only approximate retrieval. May result in:
false negative (false drop) - some relevant documents may not be
retrieved.
false positive - some irrelevant documents may be retrieved.
For many applications a good index should not permit any false drops, but
may permit a few false positives.
Relevant performance metrics:
precision - what percentage of the retrieved documents are relevant to the
query.
recall - what percentage of the documents relevant to the query were
retrieved.
Database System Concepts - 5th Edition, Sep 2, 2005
19.15
©Silberschatz, Korth and Sudarshan
Measuring Retrieval Effectiveness (Cont.)
Recall vs. precision tradeoff:
Measures of retrieval effectiveness:
Recall as a function of number of documents fetched, or
Precision as a function of recall
Can increase recall by retrieving many documents (down to a low level of
relevance ranking), but many irrelevant documents would be fetched, reducing
precision
Equivalently, as a function of number of documents fetched
E.g. “precision of 75% at recall of 50%, and 60% at a recall of 75%”
Problem: which documents are actually relevant, and which are not
Database System Concepts - 5th Edition, Sep 2, 2005
19.16
©Silberschatz, Korth and Sudarshan
Web Search Engines
Web crawlers are programs that locate and gather information on the Web
Recursively follow hyperlinks present in known documents, to find other
documents
Starting from a seed set of documents
Fetched documents
Handed over to an indexing system
Can be discarded after indexing, or store as a cached copy
Crawling the entire Web would take a very large amount of time
Search engines typically cover only a part of the Web, not all of it
Take months to perform a single crawl
Database System Concepts - 5th Edition, Sep 2, 2005
19.17
©Silberschatz, Korth and Sudarshan
Web Crawling (Cont.)
Crawling is done by multiple processes on multiple machines, running in parallel
Set of links to be crawled stored in a database
New links found in crawled pages added to this set, to be crawled later
Indexing process also runs on multiple machines
Creates a new copy of index instead of modifying old index
Old index is used to answer queries
After a crawl is “completed” new index becomes “old” index
Multiple machines used to answer queries
Indices may be kept in memory
Queries may be routed to different machines for load balancing
Database System Concepts - 5th Edition, Sep 2, 2005
19.18
©Silberschatz, Korth and Sudarshan
Information Retrieval and Structured Data
Information retrieval systems originally treated documents as a collection of
words
Information extraction systems infer structure from documents, e.g.:
Extraction of house attributes (size, address, number of bedrooms, etc.)
from a text advertisement
Extraction of topic and people named from a new article
Relations or XML structures used to store extracted data
System seeks connections among data to answer queries
Question answering systems
Database System Concepts - 5th Edition, Sep 2, 2005
19.19
©Silberschatz, Korth and Sudarshan
Directories
Storing related documents together in a library facilitates browsing
users can see not only requested document but also related ones.
Browsing is facilitated by classification system that organizes logically related
documents together.
Organization is hierarchical: classification hierarchy
Database System Concepts - 5th Edition, Sep 2, 2005
19.20
©Silberschatz, Korth and Sudarshan
A Classification Hierarchy For A Library System
Database System Concepts - 5th Edition, Sep 2, 2005
19.21
©Silberschatz, Korth and Sudarshan
Classification DAG
Documents can reside in multiple places in a hierarchy in an information retrieval
system, since physical location is not important.
Classification hierarchy is thus Directed Acyclic Graph (DAG)
Database System Concepts - 5th Edition, Sep 2, 2005
19.22
©Silberschatz, Korth and Sudarshan
A Classification DAG For A Library
Information Retrieval System
Database System Concepts - 5th Edition, Sep 2, 2005
19.23
©Silberschatz, Korth and Sudarshan
Web Directories
A Web directory is just a classification directory on Web pages
E.g. Yahoo! Directory, Open Directory project
Issues:
What should the directory hierarchy be?
Given a document, which nodes of the directory are categories relevant
to the document
Often done manually
Classification of documents into a hierarchy may be done based on term
similarity
Database System Concepts - 5th Edition, Sep 2, 2005
19.24
©Silberschatz, Korth and Sudarshan
End of Chapter
Database System Concepts
©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
25