صفحه 1:
‎TePorwaton Retrisvd‏ :09 مان

صفحه 2:
مه مج100 :06 بان ...ل Reeve Ruchiory Ostey Nero Retevoace Dstey Wypertchs Gps, Wowras, ‏عبط( لو‎ Aerdextay oP Oorancects: ‎CPPevivecss‏ 11ل ‎Orb Gearck Cages ‎4ePorwotica Retteval ond Grructured Data ‎Orevtories ‎Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO wo ©Sbervehnts, Cork ‏لح‎ 0

صفحه 3:
+ ‏وست و1‎ Beireud Oysiews ۲ 1 reed (1(R) syste use ‏و و‎ doit ‏او‎ thos catch ‏یب‎ 8 4 Porwoton orerized os ‏جلدمصسحصل خاه مصاعصاده د‎ © Oornveds ore ‏ری‎ ow svhewa ۲ $ohornwuiva retievd locutes relevadt domes, vo the basis oP user iazut suck se keywords or ‏مج موم‎ © ey, Prod dom noes oootoictay the words “cotubuse syotews” ۲ Coa be weed eved oo texted desoripioes provided wil: overtextrd dota suds a ‏ميم‎ © Web ‏مت پوت ۲ و وی او‎ exaople of IR sysiews 1۱ Ooweytr- O* Oters, Gey 8, OODO wo ©Sbervehnts, Cork ‏لح‎ 0

صفحه 4:
+ ePorcntiva Retrievd Gystews (Ond.) © OPPerewes Prow dutcbuse ‏وه‎ ۶ 1 syste s dont ded uth inxertvad ures (ieichay pour ‏اجه سیر‎ recovery) ©) Oxtabuse systews ded wih structured data, with: schewar thot dePice the ‏مه ول‎ © systews ded wih sowe queryicy issues wl yeurrdly uddressed by ‏ال‎ systews ۱ @pproxtrate searckiey by keywords ١ Raokioy oP retteved ceswers by esttvated degree oP ‏وا‎ Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO we ©Sbervehnts, Cork ‏لح‎ 0

صفحه 5:
ای لس + Bo Bibtex rettevd, lhe words ta wack doomed one omeidered iy be keywords. © De wee te word tere ty rePer ty he Words to dorrcect وا وی لو نموت ری نو رل وه روف ۱ ‎oxi ao!‏ لمي هت لیا بط لو ۱ exphelly speoPed BE Roche of dommes oo he: busty oP eetevated rebvowre tp a query & ortical © Bekveure rockty ty based va Partory suck a > Dera Preeay ‏اه موی ان بو‎ query keyword ta doeanvent ‏موم ول سم‎ ‏ها وسححت لمحسييجها بدسي جما عدتتسجصل نوم صملا"‎ > Power D que ‏لوا و سرت‎ ۱ ‏مرا‎ ty doris (Dore bobs ty a sharers ‏ماو مت تحص و‎ Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO wo ©Sbervehnts, Cork ‏لح‎ 0

صفحه 6:
+ 3 ‏رسپ را‎ Tera ۲ ۱/۱۵۵) regen! hwerse Ooanved Prequeay) rachis © bet od) = cxncber of terns ia he ‏سح‎ of ۱ ۷ ( < ‏مج و طلسم‎ of tera fin the ‏ب مسج‎ ۶ ‏واه ول مج هط‎ TF (d, t) - ‏هد‎ + cy + Whe tox Pastor ts to ‏مه‎ excessive weight to Prequect ters © Rekvowe of donnved to query _ ۶ ۳0۵ r(d, Q) ‘co nd) Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO wo ©Sbervehnts, Cork ‏لح‎ 0

صفحه 7:
+ Rebvawe Radtoy Doty Perws (Ova) ۲ ost systews odd to the ubove wodel © ord thet corr tote, cuhor tet, sevion headkrep, et ore ved yrecier موز ‎baer‏ ی و ۱۳۹ موز ‎et ore ekoirated‏ ۱ ‎Oxted stop words‏ ” © Croxoty: P hepwords i query ooour close iryeher is he doce, the ‏و لا ها سم‎ thad iP they vorur Por ozort Oornveds we returced io depreustry order oP ‏سوه عمط‎ ۶ ‏بو رل تراسج‎ Pew dormedis ore returced, ut of Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO or ©Sbervehnts, Cork ‏لح‎ 0

صفحه 8:
+ Orokrty Owed Retievd Broke) bored reitevd - retieve doanoeds stoke bo ‏محم مجعو‎ © Gkokriy wy be defied oa the bass of power words ١ ‏و و لا بیج‎ ® uak kihest DP (v1) / 0 (1) ood we hee ‏سود‎ ny Herd dea eee ob rah de ere ۱ Rebvene Perdbock! Gkokady com be weed ‏و‎ rece cewer set to hepword ney © Oser setlevis Pew relevant domunveds Prow hose retrieved by keyword spery, ond systew Prods viher ‏سا ما اه مج‎ © Ocvior spare wodel: dePoe on dkrewivad space, where ois the ouber of words ta the domed se. ‏سین و‎ ۲ domed dyes Brow pig too potat whose (© corde i TE (dt) 1 (1) © ‏یه با‎ of the ood between the veviors oP tive docuceds is werd oo were of their stort. Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO

صفحه 9:
همهم لا" ووتون) عمم‌ورواو) © Duter of donnveds retevodi too query ‏رای ۶ ع وه 9 موه‎ Prequecdes ure tohed taty acu B Ostey ere Prequeuies wohes ‏و‎ Pus * G.y. oc trvel ogeury coo odd waey porn eues oP the words “travel” to ‏جلا‎ ‎page te woke tts rook very high Dost oF the tke peuple ore lovhiey Por pages Prow popukar ‏اه‎ BE deo! we poriony of Deb ote (e-<. kw cacy peor viet) to rank ote paces thot ‏جوا وف اس‎ اه خام وولو اه او دا ۲ ‎Goktion: cext slide‏ © Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO wo ©Sbervehnts, Cork ‏لح‎ 0

صفحه 10:
+ ‏مسد ساد‎ Delay Uygeride (low) BO Gohiica: use ovber of hypertoks too site ‏ی و وه‎ oP the populady or prestige oP the site © ‏ماع زان نیون‎ Prow euch ste (Why? - see previne skke) © Poptany wewue ts Por ste, oot Por kicked poe > ‏ماسجا مه رف‎ ron oP te ١ Obey, ‏وا لاله و چاه ببس‎ dePioe stave 0 (ORD prefix the oe.yoe.eds ‏وس رصم اه وم لصا نوم وه‎ © RePreweus © Oke cowputey presice bused vo bake to sie, ve wore Wweitht tp hobs Prow sites thot thewselves hove higher prestige > DePratica ‏لح صا‎ * Get up oad solve systew of ‏اج تاسلج‎ © Oboe teu busts of the Borde Page(Raok ranbieg werheraira Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork ‏لح‎ 0

صفحه 11:
+ ‏مسد ساد‎ Delay Uygeride (low) © Ooweentivs ty serie ‏زاس‎ theories thot racked presive of people © Coy, the presided of the 0.6.0 hos o high prestive stuce woop people how ‏مس‎ ۱ by wulipe previyous peuple hus high previce Bab odd cuherty based rember © ۵ hab rope thd stores bkhe iy wey panes (oa « trie) Ou odor © wpe ha pou wad wParwutos pou ee (Bock page yet 3 hub preske based ox presice of ctkorties thot pois ‏صا‎ ‎Cork pax wee oo cuboriy presice bused ou previxe of hubs thet porto Oypic, presiqe dePratices ore opole, od coo be ‏برط اي‎ ‏موی ص ماو‎ هو و وا نو راو ان ‎Ose cuhoniy presiee‏ © 1۱ Ooweytr- O* Oters, Gey 8, OODO ‏لح 0 ا سواه 1 موه‎ 0

صفحه 12:
Opwwws and Wowowws © Cy. doanvedt: “evra ‏"صمت صاعم امب" اتمصي اموت‎ eed i rede tra “cookie” ond “repo” oe Sys © Oppiew ear ented query we “weerovel aad (revar or anndemnee)” ۱ ‏سا‎ ‎© Cy, “obec! hos dPPered weuckn ‏وراج جه‎ © Con dob ude eR (7 swe exec) Brow te ooo ‏قاس بسا مهن موم بت هی ی بطم‎ © Dery enbyokanl tated wrung kt order ‏موه سل‎ » Or very syeanos wil wer ۰ ‏اه وه له چا بوه مسق‎ 1۱ Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork ‏لح‎ 0

صفحه 13:
Ovwsp-®wed Gueryieq اه ۱ ممم مويذا جد دع مه ‎١‏ المعصصصه داز جماص ۱ توس ءسامامت سس سس و( © ای وا اوه رسمه موه لاوما( ‎Cox. he 160 retaiccship trot we saw io the GR wodet‏ * (iis approock ooo be used to standardize ‏و و یلها‎ Ontologies carn bobs c7uliple ‏مها‎ ۲ ‏مب‎ of the Grant Deb (oot covered kere) Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO

صفحه 14:
جلمعمعدهوو(]) خام و مرول 7 ۱ ۵ ‏و ما ۵ سح و وه ۰ ) تسوا ات سوت سل سس‎ ther ‏لاسسوا‎ ‎۶ ‏ای بالط مسج(‎ ۱ ‏ام اما روم تست و تم مت مها انوا‎ ۶ ‏امه ط عون‎ oP ‏و ساوسو‎ 1) ‏ب مت یکاخ اه مت ما عمط تم لیا‎ ۶ ‏مسا‎ 60 60 ۵ BB opr operation: doranvents thet ovetcts of rust oe oP Ky, Kay ony ‏ی و‎ ۵ 6۵ 0 A Oy. B Cok G6 hep ‏را یلم ماه ناه تا له‎ ersten © of cee dby be ePRivteuly twplewedted by wercery oP sorted bets Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork ‏لح‎ 0

صفحه 15:
Orwwity Rerievd GPPevivewss ۱ ‏سوه ره موه ارو‎ by vein inex sinker trot support re ‏مر وه‎ Day result! © Pade centr (Pr drop) - sone rebrnt do neni ‏بوب‎ ww be retrieved. © Babe postive - sowe ineckevant domnveds way be retrieved. ۱ should ot perc way Pulse drope, but way persia Pew Pose postives. © Reevad perPorwoue wets: © prevbive - wht peroeutaye oP the retrieved domeveuis ore releveret tr the ery, © reodl - whol perveckne of the doves relevad io the query were retrieved. Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork ‏لح‎ 0

صفحه 16:
+ ewig Rorevd OPPevtvewes (Ova) ۲ Revd us. preveiva treo: ( ‏موه و0‎ rend by retrevies macy dom uvects (dew ty «tow level of ‏روت لا لو مرا‎ inrelevodt dornrenis wml be ‏بسكي لس‎ paver Bl Deceures of retrevd ef Pecos! © Recolor a Raxetoa of annober of dooanceuts Petched, or © ‏اه ماه مس‎ ‏ای( جح سم اه مخ هچ افو‎ © x. “provera of PO% a revel oF OO%, and OO% ot a rev oF PO%" Bl Probie! whick donors oe windy reevend, od whick oe oot Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork ‏لح‎ 0

صفحه 17:
+ Ovb Orack Ouges © Deb wrawers oe proqras thot locate od gaher Porwatic oo the Deb © Revursivel, Polow hyperboks presen to hoowe domventy, to Prod other ‏عمج‎ ۱ ‏عمج و بو لو و مس موی‎ © Cetcked ‏سمل‎ ‎۱۳ to oc todextay systeo * Coc be discarded oPter todextog, or store us a cocked copy © Onawtay he eotre Deb would the ‏مسجت تا بو و‎ oF tee © Gearck eats ypicdly cover ody a pod oP the Deb, ot ll oP it © Deke warts to perPorsn a sted oad Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO ‏لح 0 ا سواه 1 جووه‎ 0

صفحه 18:
+ Web Oravtoy (Ova) ۲ ‏مس طاقن نوا بو صا مایمن‎ va ‏رد ری لت‎ ft parcel © Get oP takes te be oraded stored foo dotcbose © ew bok Poured to prowted pages udded ty this eet, to be aroued hte 198 odextay process dev rues vo ‏ای اولح‎ © Credes 0 vew copy oP fade testead oP wrdP toy old ford © OW fades ip used ip one wer queries ۶ Ren a orn ‏ی او مسا له یووم جا‎ ۲ ‏نوی و ای ات لش(‎ queries © ‏لیا‎ way be kept ‏ری و‎ © Queres ‏لا اما ای سل 0 لمع رون‎ 1۱ Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork ‏لح‎ 0

صفحه 19:
۲۱ cod Orctired Dots § toPorwatca retrievdl spstews orighndly treuied domuredis os ‏وصاصامت د‎ of words: §doPorwatics extrusion systews ther structure Prow domes, Fh! © Cxtrentios oF house utitbutes (size, address, oucvber of bedrovws, et.) ‏رميق‎ a text advertiser © Cxtrarton oP topic ord people wowed Proc ‏وه سنج و‎ © Rebticws or XOL structures used ty store extracted dott © ‏اس اه مس‎ exevay dots io oosiwer queries و مس سین ۱ 1۱ Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork ‏لح‎ 0

صفحه 20:
Oreviores ۲ ‏نوت‎ netted dommes together ‏سم و روما و و‎ © ers con sre ont poly requested domed but dbo rehied vues. B Orowsteny is Puniitated by chissficuiog systew tho oryeaizes loyicahy vetted donned together. © Orcprizaiva is hierachicd: chssPivaion herarchy Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO ‏لح 0 ا سواه 1 موه‎ 0

صفحه 21:
+ © Owe Pouca Weravly Por © borey Gyoew graph algorithms Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO 16.00 ©Sbervehnts, Cork ‏لح‎ 0

صفحه 22:
008 م0 + ار مق وله مه و راما و وا ام ‎Qoanveds coo reside in wutiple‏ © جوا نت سا ملجوا وان عصد ,مهد © OkssPiratvg herachy ts ‏عاسوط) لمس) صطا‎ Gropk (DOG) Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO ‏لح 0 ا سواه 1 موه‎ 0

صفحه 23:
6 ‏ملاسان‎ )2)0 ۳۶ 6 Library ePorcwaiva Retrievd Opsiew science engineering computer science graph algorithms Oredrwe Oyeteer Ooweytr- O* Oters, Gey 8, OODO ‏لح 0 ا سواه 1 موه‎ 0

صفحه 24:
+ Deb Orevores 19 OO eb drove & ist u ches ficdivd dreviory va Deb pes © ‏بر‎ Yahoo! Orreviory, Opes Direviory proievt ۴ Asses: * Oka should the directory hierarchy be? * ved a docuvedt, whic ordes of the ‏موجه مه رس‎ relevorat ‏حول با و‎ © OR eu doce wand ١ ChesFicuica of doncrrats tz a hierarchy way be due based vo tera ‏اد‎ 1۱ Ooweytr- O* Oters, Gey 8, OODO w0e ©Sbervehnts, Cork ‏لح‎ 0

صفحه 25:
Gad oP Okaper

Chapter 19: Information Retrieval Database System Concepts ©Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use 1 Chapter 19: Information Retrieval  Relevance Ranking Using Terms  Relevance Using Hyperlinks  Synonyms., Homonyms, and Ontologies  Indexing of Documents  Measuring Retrieval Effectiveness  Web Search Engines  Information Retrieval and Structured Data  Directories Database System Concepts - 5th Edition, Sep 2, 2005 19.2 ©Silberschatz, Korth and Sudarshan Information Retrieval Systems   Information retrieval (IR) systems use a simpler data model than database systems  Information organized as a collection of documents  Documents are unstructured, no schema Information retrieval locates relevant documents, on the basis of user input such as keywords or example documents  e.g., find documents containing the words “database systems”  Can be used even on textual descriptions provided with non-textual data such as images  Web search engines are the most familiar example of IR systems Database System Concepts - 5th Edition, Sep 2, 2005 19.3 ©Silberschatz, Korth and Sudarshan Information Retrieval Systems (Cont.)  Differences from database systems  IR systems don’t deal with transactional updates (including concurrency control and recovery)  Database systems deal with structured data, with schemas that define the data organization  IR systems deal with some querying issues not generally addressed by database systems  Approximate searching by keywords  Ranking of retrieved answers by estimated degree of relevance Database System Concepts - 5th Edition, Sep 2, 2005 19.4 ©Silberschatz, Korth and Sudarshan Keyword Search  In full text retrieval, all the words in each document are considered to be keywords.   We use the word term to refer to the words in a document Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not Ands are implicit, even if not explicitly specified Ranking of documents on the basis of estimated relevance to a query is critical  Relevance ranking is based on factors such as  Term frequency – Frequency of occurrence of query keyword in document  Inverse document frequency – How many documents the query keyword occurs in » Fewer  give more importance to keyword  Hyperlinks to documents – More links to a document  document is more important   Database System Concepts - 5th Edition, Sep 2, 2005 19.5 ©Silberschatz, Korth and Sudarshan Relevance Ranking Using Terms  TF-IDF (Term frequency/Inverse Document frequency) ranking:  Let n(d) = number of terms in the document d  n(d, t) = number of occurrences of term t in the document d.  Relevance of a document d to a term t n(d, t) TF (d, t) = log 1 + n(d)   The log factor is to avoid excessive weight to frequent terms Relevance of document to query Q r (d, Q) =  TF (d, t) tQ n(t) Database System Concepts - 5th Edition, Sep 2, 2005 19.6 ©Silberschatz, Korth and Sudarshan Relevance Ranking Using Terms (Cont.)  Most systems add to the above model  Words that occur in title, author list, section headings, etc. are given greater importance  Words whose first occurrence is late in the document are given lower importance  Very common words such as “a”, “an”, “the”, “it” etc are eliminated    Called stop words Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart Documents are returned in decreasing order of relevance score  Usually only top few documents are returned, not all Database System Concepts - 5th Edition, Sep 2, 2005 19.7 ©Silberschatz, Korth and Sudarshan Similarity Based Retrieval  Similarity based retrieval - retrieve documents similar to a given document  Similarity may be defined on the basis of common words   Relevance feedback: Similarity can be used to refine answer set to keyword query   E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents. User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these Vector space model: define an n-dimensional space, where n is the number of words in the document set.  Vector for document d goes from origin to a point whose i th coordinate is TF (d,t ) / n (t )  The cosine of the angle between the vectors of two documents is used as a measure of their similarity. Database System Concepts - 5th Edition, Sep 2, 2005 19.8 ©Silberschatz, Korth and Sudarshan Relevance Using Hyperlinks  Number of documents relevant to a query can be enormous if only term frequencies are taken into account  Using term frequencies makes “spamming” easy  E.g. a travel agency can add many occurrences of the words “travel” to its page to make its rank very high  Most of the time people are looking for pages from popular sites  Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords  Problem: hard to find actual popularity of site  Solution: next slide Database System Concepts - 5th Edition, Sep 2, 2005 19.9 ©Silberschatz, Korth and Sudarshan Relevance Using Hyperlinks (Cont.)   Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site  Count only one hyperlink from each site (why? - see previous slide)  Popularity measure is for site, not for individual page  But, most hyperlinks are to root of site  Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu contains many unrelated pages of varying popularity Refinements   When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige  Definition is circular  Set up and solve system of simultaneous linear equations Above idea is basis of the Google PageRank ranking mechanism Database System Concepts - 5th Edition, Sep 2, 2005 19.10 ©Silberschatz, Korth and Sudarshan Relevance Using Hyperlinks (Cont.)   Connections to social networking theories that ranked prestige of people  E.g. the president of the U.S.A has a high prestige since many people know him  Someone known by multiple prestigious people has high prestige Hub and authority based ranking  A hub is a page that stores links to many pages (on a topic)  An authority is a page that contains actual information on a topic  Each page gets a hub prestige based on prestige of authorities that it points to  Each page gets an authority prestige based on prestige of hubs that point to it  Again, prestige definitions are cyclic, and can be got by solving linear equations  Use authority prestige when ranking answers to a query Database System Concepts - 5th Edition, Sep 2, 2005 19.11 ©Silberschatz, Korth and Sudarshan Synonyms and Homonyms  Synonyms  E.g. document: “motorcycle repair”, query: “motorcycle maintenance”     need to realize that “maintenance” and “repair” are synonyms System can extend query as “motorcycle and (repair or maintenance)” Homonyms  E.g. “object” has different meanings as noun/verb  Can disambiguate meanings (to some extent) from the context Extending queries automatically using synonyms can be problematic  Need to understand intended meaning in order to infer synonyms   Or verify synonyms with user Synonyms may have other meanings as well Database System Concepts - 5th Edition, Sep 2, 2005 19.12 ©Silberschatz, Korth and Sudarshan Concept-Based Querying  Approach  For each word, determine the concept it represents from context  Use one or more ontologies:  Hierarchical structure showing relationship between concepts  E.g.: the ISA relationship that we saw in the E-R model  This approach can be used to standardize terminology in a specific field  Ontologies can link multiple languages  Foundation of the Semantic Web (not covered here) Database System Concepts - 5th Edition, Sep 2, 2005 19.13 ©Silberschatz, Korth and Sudarshan Indexing of Documents  An inverted index maps each keyword Ki to a set of documents Si that contain the keyword    Inverted index may record  Keyword locations within document to allow proximity based ranking  Counts of number of occurrences of keyword to compute TF and operation: Finds documents that contain all of K1, K2, ..., Kn.   Intersection S1 S2 .....  Sn or operation: documents that contain at least one of K1, K2, …, Kn   Documents identified by identifiers union, S1 S2 .....  Sn,. Each Si is kept sorted to allow efficient intersection/union by merging  “not” can also be efficiently implemented by merging of sorted lists Database System Concepts - 5th Edition, Sep 2, 2005 19.14 ©Silberschatz, Korth and Sudarshan Measuring Retrieval Effectiveness   Information-retrieval systems save space by using index structures that support only approximate retrieval. May result in:  false negative (false drop) - some relevant documents may not be retrieved.  false positive - some irrelevant documents may be retrieved.  For many applications a good index should not permit any false drops, but may permit a few false positives. Relevant performance metrics:  precision - what percentage of the retrieved documents are relevant to the query.  recall - what percentage of the documents relevant to the query were retrieved. Database System Concepts - 5th Edition, Sep 2, 2005 19.15 ©Silberschatz, Korth and Sudarshan Measuring Retrieval Effectiveness (Cont.)  Recall vs. precision tradeoff:   Measures of retrieval effectiveness:  Recall as a function of number of documents fetched, or  Precision as a function of recall    Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision Equivalently, as a function of number of documents fetched E.g. “precision of 75% at recall of 50%, and 60% at a recall of 75%” Problem: which documents are actually relevant, and which are not Database System Concepts - 5th Edition, Sep 2, 2005 19.16 ©Silberschatz, Korth and Sudarshan Web Search Engines  Web crawlers are programs that locate and gather information on the Web  Recursively follow hyperlinks present in known documents, to find other documents    Starting from a seed set of documents Fetched documents  Handed over to an indexing system  Can be discarded after indexing, or store as a cached copy Crawling the entire Web would take a very large amount of time  Search engines typically cover only a part of the Web, not all of it  Take months to perform a single crawl Database System Concepts - 5th Edition, Sep 2, 2005 19.17 ©Silberschatz, Korth and Sudarshan Web Crawling (Cont.)    Crawling is done by multiple processes on multiple machines, running in parallel  Set of links to be crawled stored in a database  New links found in crawled pages added to this set, to be crawled later Indexing process also runs on multiple machines  Creates a new copy of index instead of modifying old index  Old index is used to answer queries  After a crawl is “completed” new index becomes “old” index Multiple machines used to answer queries  Indices may be kept in memory  Queries may be routed to different machines for load balancing Database System Concepts - 5th Edition, Sep 2, 2005 19.18 ©Silberschatz, Korth and Sudarshan Information Retrieval and Structured Data  Information retrieval systems originally treated documents as a collection of words  Information extraction systems infer structure from documents, e.g.:   Extraction of house attributes (size, address, number of bedrooms, etc.) from a text advertisement  Extraction of topic and people named from a new article Relations or XML structures used to store extracted data  System seeks connections among data to answer queries  Question answering systems Database System Concepts - 5th Edition, Sep 2, 2005 19.19 ©Silberschatz, Korth and Sudarshan Directories  Storing related documents together in a library facilitates browsing  users can see not only requested document but also related ones.  Browsing is facilitated by classification system that organizes logically related documents together.  Organization is hierarchical: classification hierarchy Database System Concepts - 5th Edition, Sep 2, 2005 19.20 ©Silberschatz, Korth and Sudarshan A Classification Hierarchy For A Library System Database System Concepts - 5th Edition, Sep 2, 2005 19.21 ©Silberschatz, Korth and Sudarshan Classification DAG  Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important.  Classification hierarchy is thus Directed Acyclic Graph (DAG) Database System Concepts - 5th Edition, Sep 2, 2005 19.22 ©Silberschatz, Korth and Sudarshan A Classification DAG For A Library Information Retrieval System Database System Concepts - 5th Edition, Sep 2, 2005 19.23 ©Silberschatz, Korth and Sudarshan Web Directories  A Web directory is just a classification directory on Web pages  E.g. Yahoo! Directory, Open Directory project  Issues:   What should the directory hierarchy be?  Given a document, which nodes of the directory are categories relevant to the document Often done manually  Classification of documents into a hierarchy may be done based on term similarity Database System Concepts - 5th Edition, Sep 2, 2005 19.24 ©Silberschatz, Korth and Sudarshan End of Chapter Database System Concepts ©Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use 25

51,000 تومان