bazyabiye_ettelaate_modern

در نمایش آنلاین پاورپوینت، ممکن است بعضی علائم، اعداد و حتی فونت‌ها به خوبی نمایش داده نشود. این مشکل در فایل اصلی پاورپوینت وجود ندارد.




  • جزئیات
  • امتیاز و نظرات
  • متن پاورپوینت

امتیاز

درحال ارسال
امتیاز کاربر [0 رای]

نقد و بررسی ها

هیچ نظری برای این پاورپوینت نوشته نشده است.

اولین کسی باشید که نظری می نویسد “Modern Information Retrieval”

Modern Information Retrieval

اسلاید 1: Modern Information RetrievalLecture 3: Boolean Retrieval

اسلاید 2: Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferencesMarjan Ghazvininejad2Sharif University Spring 2012

اسلاید 3: Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferencesMarjan Ghazvininejad3Sharif University Spring 2012

اسلاید 4: Types of dataUnstructuredSemi-structuredStructuredMarjan Ghazvininejad4Sharif University Spring 2012

اسلاید 5: Unstructured dataTypically refers to free textAllowsKeyword queries including operatorsMore sophisticated “concept” queries e.g.,find all web pages dealing with drug abuseClassic model for searching text documents5Marjan GhazvininejadSharif University Spring 2012

اسلاید 6: Unstructured (text) vs. structured (database) data in 19966Marjan GhazvininejadSharif University Spring 2012

اسلاید 7: Unstructured (text) vs. structured (database) data in 20097Marjan GhazvininejadSharif University Spring 2012

اسلاید 8: Unstructured data in 1680Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?Why is that not the answer?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans near countrymen) not feasibleRanked retrieval (best documents to return)Later lectures8Sec. 1.1Marjan GhazvininejadSharif University Spring 2012

اسلاید 9: Semi-structured dataIn fact almost no data is “unstructured”E.g., this slide has distinctly identified zones such as the Title and BulletsFacilitates “semi-structured” search such asTitle contains data AND Bullets contain search… to say nothing of linguistic structure9

اسلاید 10: Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferencesMarjan Ghazvininejad10Sharif University Spring 2012

اسلاید 11: Term-document incidence1 if play contains word, 0 otherwiseBrutus AND Caesar BUT NOT CalpurniaSec. 1.1Marjan Ghazvininejad11Sharif University Spring 2012

اسلاید 12: Incidence vectorsSo we have a 0/1 vector for each term.To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND.110100 AND 110111 AND 101111 = 100100. 12Sec. 1.1Marjan GhazvininejadSharif University Spring 2012

اسلاید 13: Bigger collectionsConsider N = 1 million documents, each with about 1000 words.Avg 6 bytes/word including spaces/punctuation 6GB of data in the documents.Say there are M = 500K distinct terms among these.13Sec. 1.1Marjan GhazvininejadSharif University Spring 2012

اسلاید 14: Can’t build the matrix500K x 1M matrix has half-a-trillion 0’s and 1’s.But it has no more than one billion 1’s.matrix is extremely sparse.What’s a better representation?We only record the 1 positions.14Why?Sec. 1.1Marjan GhazvininejadSharif University Spring 2012

اسلاید 15: Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferencesMarjan Ghazvininejad15Sharif University Spring 2012

اسلاید 16: Inverted indexFor each term t, we must store a list of all documents that contain t.Identify each by a docID, a document serial numberCan we used fixed-size arrays for this?16What happens if the word Caesar is added to document 14? Sec. 1.2BrutusCalpurniaCaesar12456165713212411314517323117454101Marjan GhazvininejadSharif University Spring 2012

اسلاید 17: Inverted indexWe need variable-size postings listsOn disk, a continuous run of postings is normal and bestIn memory, can use linked lists or variable length arraysSome tradeoffs in size/ease of insertion17DictionaryPostingsSorted by docID (more later on why).PostingSec. 1.2BrutusCalpurniaCaesar12456165713212411314517323117454101

اسلاید 18: TokenizerToken streamFriendsRomansCountrymenInverted index constructionLinguistic modulesModified tokensfriendromancountrymanIndexerInverted indexfriendromancountryman24213161Documents tobe indexedFriends, Romans, countrymen.Sec. 1.2

اسلاید 19: Indexer steps: Token sequenceSequence of (Modified token, Document ID) pairs.I did enact JuliusCaesar I was killed i the Capitol; Brutus killed me.Doc 1So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitiousDoc 2Sec. 1.2Marjan Ghazvininejad19Sharif University Spring 2012

اسلاید 20: Indexer steps: SortSort by termsAnd then docID Core indexing stepSec. 1.2Marjan Ghazvininejad20Sharif University Spring 2012

اسلاید 21: Indexer steps: Dictionary & PostingsMultiple term entries in a single document are merged.Split into Dictionary and PostingsDoc. frequency information is added.Sec. 1.2Marjan Ghazvininejad21Sharif University Spring 2012

اسلاید 22: Where do we pay in storage?22PointersTerms and countsLater in the course:How do we index efficiently?How much storage do we need?Sec. 1.2Lists of docIDsMarjan GhazvininejadSharif University Spring 2012

اسلاید 23: Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferencesMarjan Ghazvininejad23Sharif University Spring 2012

اسلاید 24: Query processing: ANDConsider processing the query:Brutus AND CaesarLocate Brutus in the Dictionary;Retrieve its postings.Locate Caesar in the Dictionary;Retrieve its postings.“Merge” the two postings:2412834248163264123581321BrutusCaesarSec. 1.3Marjan GhazvininejadSharif University Spring 2012

اسلاید 25: The mergeWalk through the two postings simultaneously, in time linear in the total number of postings entries253412824816326412358132112834248163264123581321BrutusCaesar28If list lengths are x and y, merge takes O(x+y) operations.Crucial: postings sorted by docID.Sec. 1.3Marjan GhazvininejadSharif University Spring 2012

اسلاید 26: Boolean queries: Exact matchThe Boolean retrieval model is being able to ask a query that is a Boolean expression:Boolean Queries use AND, OR and NOT to join query termsViews each document as a set of wordsIs precise: document matches condition or not.Perhaps the simplest model to build an IR system onPrimary commercial retrieval tool for 3 decades. Many search systems you still use are Boolean:Email, library catalog, Mac OS X Spotlight26Sec. 1.3Marjan GhazvininejadSharif University Spring 2012

اسلاید 27: Boolean queriesMany professional searchers still like Boolean searchYou know exactly what you are gettingBut that doesn’t mean it actually works better….Marjan Ghazvininejad27Sharif University Spring 2012

اسلاید 28: Boolean queries: More general mergesExercise: Adapt the merge for the queries:Brutus AND NOT CaesarBrutus OR NOT CaesarCan we still run through the merge in time O(x+y)?What can we achieve?28Sec. 1.3Marjan GhazvininejadSharif University Spring 2012

اسلاید 29: MergingWhat about an arbitrary Boolean formula?(Brutus OR Caesar) AND NOT(Antony OR Cleopatra)Can we always merge in “linear” time?Linear in what?Can we do better?29Sec. 1.3Marjan GhazvininejadSharif University Spring 2012

اسلاید 30: Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferencesMarjan Ghazvininejad30Sharif University Spring 2012

اسلاید 31: Query optimizationWhat is the best order for query processing?Consider a query that is an AND of n terms.For each of the n terms, get its postings, then AND them together.Query: Brutus AND Calpurnia AND Caesar31Sec. 1.3Marjan Ghazvininejad31Sharif University Spring 2012BrutusCalpurniaCaesar123581621342481632641281316

اسلاید 32: Query optimization exampleProcess in order of increasing freq:start with smallest set, then keep cutting further.32This is why we keptdocument freq. in dictionaryExecute the query as (Calpurnia AND Brutus) AND Caesar.Sec. 1.3BrutusCaesarCalpurnia123581621342481632641281316Marjan GhazvininejadSharif University Spring 2012

اسلاید 33: More general optimizatione.g., (madding OR crowd) AND (ignoble OR strife)Get doc. freq.’s for all terms.Estimate the size of each OR by the sum of its doc. freq.’s (conservative).Process in increasing order of OR sizes.33Sec. 1.3Marjan GhazvininejadSharif University Spring 2012

اسلاید 34: ExerciseRecommend a query processing order for34(tangerine OR trees) AND(marmalade OR skies) AND(kaleidoscope OR eyes)Marjan GhazvininejadSharif University Spring 2012

اسلاید 35: Query processing exercisesExercise: If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen?Exercise: Extend the merge to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size?Hint: Begin with the case of a Boolean formula query where each term appears only once in the query.35Marjan GhazvininejadSharif University Spring 2012

اسلاید 36: ExerciseTry the search feature at http://www.rhymezone.com/shakespeare/Write down five search features you think it could do better36Marjan GhazvininejadSharif University Spring 2012

اسلاید 37: What’s ahead in IR? Beyond term searchWhat about phrases?Stanford UniversityProximity: Find Gates NEAR Microsoft.Need index to capture position information in docs.Zones in documents: Find documents with (author = Ullman) AND (text contains automata).37Marjan GhazvininejadSharif University Spring 2012

اسلاید 38: Evidence accumulation1 vs. 0 occurrence of a search term2 vs. 1 occurrence3 vs. 2 occurrences, etc.Usually more seems betterNeed term frequency information in docs38Marjan GhazvininejadSharif University Spring 2012

اسلاید 39: Ranking search resultsBoolean queries give inclusion or exclusion of docs.Often we want to rank/group resultsNeed to measure proximity from query to each doc.Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query.39Marjan GhazvininejadSharif University Spring 2012

اسلاید 40: Clustering, classification and rankingClustering: Given a set of docs, group them into clusters based on their contents.Classification: Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.Ranking: Can we learn how to best order a set of documents, e.g., a set of search results40Marjan GhazvininejadSharif University Spring 2012

اسلاید 41: Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferencesMarjan Ghazvininejad41Sharif University Spring 2012

اسلاید 42: Next Time???ReadingsChapter ??? in IR text (?????????????)Joyce & Needham “The Thesaurus Approach to Information Retrieval” (in Readings book)Luhn “The Automatic Derivation of Information Retrieval Encodements from Machine-Readable Texts” (in Readings)Doyle “Indexing and Abstracting by Association, Pt I” (in Readings)Marjan Ghazvininejad42Sharif University Spring 2012

اسلاید 43: Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferencesMarjan Ghazvininejad43Sharif University Spring 2012

اسلاید 44: Resources for today’s lectureIntroduction to Information Retrieval, chapter 1Shakespeare: http://www.rhymezone.com/shakespeare/Try the neat browse by keyword sequence feature!Managing Gigabytes, chapter 3.2Modern Information Retrieval, chapter 8.244Marjan GhazvininejadSharif University Spring 2012

18,000 تومان

خرید پاورپوینت توسط کلیه کارت‌های شتاب امکان‌پذیر است و بلافاصله پس از خرید، لینک دانلود پاورپوینت در اختیار شما قرار خواهد گرفت.

در صورت عدم رضایت سفارش برگشت و وجه به حساب شما برگشت داده خواهد شد.

در صورت نیاز با شماره 09353405883 در واتساپ، ایتا و روبیکا تماس بگیرید.

افزودن به سبد خرید