Tecniche di Data Mining
در نمایش آنلاین پاورپوینت، ممکن است بعضی علائم، اعداد و حتی فونتها به خوبی نمایش داده نشود. این مشکل در فایل اصلی پاورپوینت وجود ندارد.
- جزئیات
- امتیاز و نظرات
- متن پاورپوینت
امتیاز
Tecniche di Data Mining
اسلاید 1: Tecniche di Data MiningFosca Giannotti and Dino PedreschiPisa KDD Lab, CNUCE-CNR & Univ. Pisahttp://www-kdd.cnuce.cnr.it/ DIPARTIMENTO DI INFORMATICA - Università di Pisa anno accademico 2002/2003
اسلاید 2: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione2Tecniche di Data MiningAA270 Tecniche di data miningCorsi di Laurea Specialistica in Informatica e Tecnologie Informatiche4I117 Basi di dati e sistemi informativi: tecniche di data mining per l’analisi dei datiCorso di Laurea in Informatica (quinquennale, vecchio ordinamento)Analisi dei dati ed estrazione di conoscenzaCorso di Laurea Specialistica in Informatica per l’Economia e per l’Azienda
اسلاید 3: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione3Tecniche di Data MiningAcronimo: TDMOrario: Lunedi 11-13 aula E, Giovedi 14-16 aula BDocente:Fosca Giannotti, CNUCE-CNR, f.giannotti@cnuce.cnr.itCorso Integrativo: Dino Pedreschi, Dipartimento di Informatica, pedre@di.unipi.itRicevimento: Mercoledi 14-16 ISTI, Area Ricerca CNR, località San Cataldo, Pisa
اسلاید 4: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione4Tecniche di Data MiningRiferimenti bibliografici Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2000 http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-489-8 U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (editors). Advances in Knowledge discovery and data mining, MIT Press, 1996. David J. Hand, Heikki Mannila, Padhraic Smyth, Principles of Data Mining, MIT Press, 2001.S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann, ISBN 1-55860-754-4, 2002I lucidi utilizzati nelle lezioni saranno resi disponibili attraverso il sito web del corso: http://www-kdd.cnuce.cnr.it/
اسلاید 5: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione5QuestionarioMessaggio e-mail con subject: Corso TDMContenutoNome e Cognome………………..e-mail:……………………………anno immatricolazione:…………….Corso di laurea :……………………..Corsi di basi di dati:· Frequentati nei precedenti semestri:· In questo semestre:
اسلاید 6: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione6Contenuti del corso Introduction and Basic concepts (2 ore)Le applicazioniIl processo di knowledge discoveryData Consolidation & Data Preparation (4 +2 esercitazione)Nozioni basiche di Data Warehousing Nozioni basiche di analisi multidimensionale dei dati Regole Associative(6 +4 esercitazione)Regole intra-attributo, inter-attributo Calcolo efficiente di regole dassociazione: algoritmo Apriori e varianti Estensioni del concetto di regola dassociazione: tassonomie, regole quantitative, regole predittive. Regole associative e fattore Tempo: RdA Cicliche e Calendriche Pattern Sequenziali e Serie Temporali Basket Market Analysis utilizzando RdA
اسلاید 7: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione7Contenuti del corso Classificazione con alberi di decisione (6 ore +2 esercitazione)Principali tecniche di classificazioneClassificatori bayesianiAlberi di decisione Rassegna di altri metodiApplicazione al rilevamento di frodi Clustering (2 ore +2 esercitazione)Principali tecniche di clusteringApplicazione al Customer segmentation Web Mining (4 ore)Temi avanzati (6 ore seminari)
اسلاید 8: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione8Modalità di valutazioneEsercizi durante il corso (o Orale): 30% Seminario (o Progetto): 70% ?Students should pair up in teams. They will receive the same credit as their partner. Division of labor is up to them. Presentations should take 50 minutes, including 10 minutes for discussion. A presentation normally covers two or three closely related papers Transparencies should be made available to the rest of the class---preferably in PDF or HTML format.
اسلاید 9: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione9Course OutlineIntroduction and basic conceptsMotivations, applications, the KDD process, the techniques Deeper into DM technologyAssociation Rules and Market Basket AnalysisDecision Trees and Fraud Detection Clustering and Customer SegmentationDeeper into Data PreparationBasic notion of DatawarehouseSelection and preprocessingAdvanced TopicsScalable DM algorithmsData mining query languagesMining on Web
اسلاید 10: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione10Evolution of Database Technology: from data management to data analysis1960s:Data collection, database creation, IMS and network DBMS.1970s: Relational data model, relational DBMS implementation.1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.).1990s: Data mining and data warehousing, multimedia databases, and Web technology.
اسلاید 11: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione11Motivations “Necessity is the Mother of Invention”Data explosion problem: Automated data collection tools, mature database technology and internet lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. We are drowning in information, but starving for knowledge! (John Naisbett)Data warehousing and data mining :On-line analytical processingExtraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases.
اسلاید 12: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione12Motivations for DM Abundance of business and industry dataCompetitive focus - Knowledge ManagementInexpensive, powerful computing enginesStrong theoretical/mathematical foundations machine learning & logicstatisticsdatabase management systems
اسلاید 13: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione13Sources of DataBusiness Transactionswidespread use of bar codes => storage of millions of transactions daily (e.g., Walmart: 2000 stores => 20M transactions per day)most important problem: effective use of the data in a reasonable time frame for competitive decision-makinge-commerce dataScientific Datadata generated through multitude of experiments and observations examples, geological data, satellite imaging data, NASA earth observationsrate of data collection far exceeds the speed by which we analyze the dataFinancial Datacompany informationeconomic data (GNP, price indexes, etc.)stock markets
اسلاید 14: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione14Sources of DataPersonal / Statistical Datagovernment censusmedical historiescustomer profilesdemographic datadata and statistics about sports and athletesWorld Wide Web and Online Repositoriesemail, news, messages Web documents, images, video, etc.link structure of of the hypertext from millions of Web sitesWeb usage data (from server logs, network traffic, and user registrations)online databases, and digital libraries
اسلاید 15: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione15Classes of applicationsDatabase analysis and decision support Market analysistarget marketing, customer relation management, market basket analysis, cross selling, market segmentation.Risk analysisForecasting, customer retention, improved underwriting, quality control, competitive analysis.Fraud detectionOther ApplicationsText (news group, email, documents) and Web analysis.Intelligent Query Answering
اسلاید 16: Giannotti & Pedreschi Anno accademico, 2002/2003 IntroduzioneMarket AnalysisWhere are the data sources for analysis?Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies.Target marketingFind clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.Determine customer purchasing patterns over timeConversion of single to a joint bank account: marriage, etc.Cross-market analysisAssociations/co-relations between product salesPrediction based on the association information.
اسلاید 17: Giannotti & Pedreschi Anno accademico, 2002/2003 IntroduzioneCustomer profilingdata mining can tell you what types of customers buy what products (clustering or classification).Identifying customer requirementsidentifying the best products for different customersuse prediction to find what factors will attract new customersProvides summary informationvarious multidimensional summary reports;statistical summary information (data central tendency and variation)Market Analysis and ManagementMarket Analysis (2)
اسلاید 18: Giannotti & Pedreschi Anno accademico, 2002/2003 IntroduzioneRisk AnalysisFinance planning and asset evaluation: cash flow analysis and predictioncontingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)Resource planning:summarize and compare the resources and spendingCompetition:monitor competitors and market directions (CI: competitive intelligence).group customers into classes and class-based pricing proceduresset pricing strategy in a highly competitive market
اسلاید 19: Giannotti & Pedreschi Anno accademico, 2002/2003 IntroduzioneFraud DetectionApplications:widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.Approach:use historical data to build models of fraudulent behavior and use data mining to help identify similar instances.Examples:auto insurance: detect a group of people who stage accidents to collect on insurancemoney laundering: detect suspicious money transactions (US Treasurys Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references
اسلاید 20: Giannotti & Pedreschi Anno accademico, 2002/2003 IntroduzioneMore examples:Detecting inappropriate medical treatment: Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).Detecting telephone fraud: Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail: Analysts estimate that 38% of retail shrink is due to dishonest employees.Fraud Detection (2)
اسلاید 21: Giannotti & Pedreschi Anno accademico, 2002/2003 IntroduzioneSportsIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat.AstronomyJPL and the Palomar Observatory discovered 22 quasars with the help of data miningInternet Web Surf-AidIBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.Watch for the PRIVACY pitfall!Other applications
اسلاید 22: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione22The selection and processing of data for:the identification of novel, accurate, and useful patterns, and the modeling of real-world phenomena.Data mining is a major component of the KDD process - automated discovery of patterns and the development of predictive and explanatory models.What is KDD? A process!
اسلاید 23: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione23Selection and PreprocessingData MiningInterpretation and EvaluationData ConsolidationKnowledgep(x)=0.02WarehouseData SourcesPatterns & ModelsPrepared Data ConsolidatedDataThe KDD process
اسلاید 24: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione24The KDD Process in PracticeKDD steps can be merged or combinedData Cleaning + Data Integration = Data PreprocessingData Selection + Data Transformation = Data ConsolidationKDD is and Iterative Processart + engineering rather than science
اسلاید 25: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione25IdentifyProblem or OpportunityMeasure effectof ActionAct onKnowledgeKnowledgeResultsStrategyProblemThe virtuous cycle
اسلاید 26: Giannotti & Pedreschi Anno accademico, 2002/2003 IntroduzioneLearning the application domain:relevant prior knowledge and goals of applicationData consolidation: Creating a target data setSelection and Preprocessing Data cleaning : (may take 60% of effort!)Data reduction and projection:find useful features, dimensionality/variable reduction, invariant representation.Choosing functions of data mining summarization, classification, regression, association, clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestInterpretation and evaluation: analysis of results.visualization, transformation, removing redundant patterns, … Use of discovered knowledgeThe steps of the KDD process
اسلاید 27: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione27Roles in the KDD process
اسلاید 28: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione28A business intelligence environment
اسلاید 29: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione29Selection and PreprocessingData MiningInterpretation and EvaluationData ConsolidationKnowledgep(x)=0.02WarehouseData SourcesPatterns & ModelsPrepared Data ConsolidatedDataThe KDD process
اسلاید 30: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione30Garbage in Garbage out The quality of results relates directly to quality of the data50%-70% of KDD process effort is spent on data consolidation and preparationMajor justification for a corporate data warehouseData consolidation and preparation
اسلاید 31: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione31From data sources to consolidated data repositoryRDBMSLegacy DBMSFlat FilesDataConsolidationand CleansingWarehouseObject/Relation DBMS Multidimensional DBMS Deductive Database Flat files ExternalData consolidation
اسلاید 32: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione32Determine preliminary list of attributes Consolidate data into working databaseInternal and External sourcesEliminate or estimate missing valuesRemove outliers (obvious exceptions)Determine prior probabilities of categories and deal with volume biasData consolidation
اسلاید 33: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione33Selection and PreprocessingData Mining Interpretation and EvaluationData ConsolidationKnowledgep(x)=0.02WarehouseThe KDD process
اسلاید 34: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione34Generate a set of exampleschoose sampling methodconsider sample complexitydeal with volume bias issuesReduce attribute dimensionalityremove redundant and/or correlating attributescombine attributes (sum, multiply, difference)Reduce attribute value rangesgroup symbolic discrete valuesquantify continuous numeric valuesTransform datade-correlate and normalize values map time-series data to static representationOLAP and visualization tools play key roleData selection and preprocessing
اسلاید 35: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione35Selection and PreprocessingData Mining Interpretation and EvaluationData ConsolidationKnowledgep(x)=0.02WarehouseThe KDD process
اسلاید 36: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione36Data mining tasks and methodsDirected Knowledge Discovery Purpose: Explain value of some field in terms of all the others (goal-oriented)Method: select the target field based on some hypothesis about the data; ask the algorithm to tell us how to predict or classify new instancesExamples:what products show increased sale when cream cheese is discountedwhich banner ad to use on a web page for a given user coming to the site
اسلاید 37: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione37Data mining tasks and methodsUndirected Knowledge Discovery (Explorative Methods)Purpose: Find patterns in the data that may be interesting (no target filed)Method: clustering, association rules (affinity grouping)Examples:which products in the catalog often sell togethermarket segmentation (groups of customers/users with similar characteristics)
اسلاید 38: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione38Alternatively:Data mining tasks and methods Automated Exploration/Discoverye.g.. discovering new market segmentsclustering analysisPrediction/Classificatione.g.. forecasting gross sales given current factorsregression, neural networks, genetic algorithms, decision treesExplanation/Descriptione.g.. characterizing customers by demographics and purchase historydecision trees, association rulesx1x2f(x)xif age > 35 and income < $35k then ...
اسلاید 39: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione39Clustering: partitioning a set of data into a set of classes, called clusters, whose members share some interesting common properties.Distance-based numerical clusteringmetric grouping of examples (K-NN)graphical visualization can be usedBayesian clusteringsearch for the number of classes which result in best fit of a probability distribution to the data AutoClass (NASA) one of best examplesAutomated exploration and discovery
اسلاید 40: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione40Learning a predictive modelClassification of a new case/sample Many methods:Artificial neural networksInductive decision tree and rule systemsGenetic algorithmsNearest neighbor clustering algorithmsStatistical (parametric, and non-parametric)Prediction and classification
اسلاید 41: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione41The objective of learning is to achieve good generalization to new unseen cases.Generalization can be defined as a mathematical interpolation or regression over a set of training pointsModels can be validated with a previously unseen test set or using cross-validation methodsf(x)xGeneralization and regression
اسلاید 42: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione42Classification and predictionClassify data based on the values of a target attribute, e.g., classify countries based on climate, or classify cars based on gas mileage.Use obtained model to predict some unknown or missing attribute values based on other information.
اسلاید 43: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione43Objective: Develop a general model or hypothesis from specific examplesFunction approximation (curve fitting)Classification (concept learning, pattern recognition)x1x2ABf(x)xSummarizing: inductive modeling = learning
اسلاید 44: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione44Learn a generalized hypothesis (model) from selected dataDescription/Interpretation of model provides new knowledge Affinity GroupingMethods:Inductive decision tree and rule systemsAssociation rule systemsLink Analysis… Explanation and description
اسلاید 45: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione45Generate a model of normal activityDeviation from model causes alertMethods:Artificial neural networksInductive decision tree and rule systemsStatistical methodsVisualization toolsException/deviation detection
اسلاید 46: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione46Outlier and exception data analysisTime-series analysis (trend and deviation): Trend and deviation analysis: regression, sequential pattern, similar sequences, trend and deviation, e.g., stock analysis.Similarity-based pattern-directed analysisFull vs. partial periodicity analysisOther pattern-directed or statistical analysis
اسلاید 47: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione47Example: Moviegoer Database
اسلاید 48: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione48Example: Moviegoer Databasemoviegoers.namesexagesourcemovies.nameAmyf27OberlinIndependence DayAndrewm25Oberlin12 MonkeysAndym34OberlinThe BirdcageAnnef30OberlinTrainspottingAnsjef25OberlinI Shot Andy WarholBethf30OberlinChain ReactionBobm51PinewoodsSchindlers ListBrianm23OberlinSuper CopCandyf29OberlinEddieCaraf25OberlinPhenomenonCathyf39Mt. AuburnThe BirdcageCharlesm25OberlinKingpinCurtm30MRJT2 Judgment DayDavidm40MRJIndependence DayEricaf23Mt. AuburnTrainspottingSELECT moviegoers.name, moviegoers.sex, moviegoers.age,sources.source, movies.nameFROM movies, sources, moviegoersWHERE sources.source_ID = moviegoers.source_ID AND movies.movie_ID = moviegoers.movie_IDORDER BY moviegoers.name;
اسلاید 49: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione49Example: Moviegoer DatabaseClassificationdetermine sex based on age, source, and movies seendetermine source based on sex, age, and movies seendetermine most recent movie based on past movies, age, sex, and sourceEstimationfor predict, need a continuous variable (e.g., “age”)predict age as a function of source, sex, and past moviesif we had a “rating” field for each moviegoer, we could predict the rating a new moviegoer gives to a movie based on age, sex, past movies, etc.
اسلاید 50: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione50Example: Moviegoer DatabaseClusteringfind groupings of movies that are often seen by the same peoplefind groupings of people that tend to see the same moviesclustering might reveal relationships that are not necessarily recorded in the data (e.g., we may find a cluster that is dominated by people with young children; or a cluster of movies that correspond to a particular genre)
اسلاید 51: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione51Example: Moviegoer DatabaseAssociation Rulesmarket basket analysis (MBA): “which movies go together?”need to create “transactions” for each moviegoer containing movies seen by that moviegoer:may result in association rules such as:{“Phenomenon”, “The Birdcage”} ==> {“Trainspotting”}{“Trainspotting”, “The Birdcage”} ==> {sex = “f”}nameTIDTransactionAmy001{Independence Day, Trainspotting}Andrew002{12 Monkeys, The Birdcage, Trainspotting, Phenomenon}Andy003{Super Cop, Independence Day, Kingpin}Anne004{Trainspotting, Schindlers List}……...
اسلاید 52: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione52Example: Moviegoer DatabaseSequence Analysissimilar to MBA, but order in which items appear in the pattern is importante.g., people who rent “The Birdcage” during a visit tend to rent “Trainspotting” in the next visit.
اسلاید 53: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione53Selection and PreprocessingData Mining Interpretation and EvaluationData Consolidationand WarehousingKnowledgep(x)=0.02WarehouseThe KDD process
اسلاید 54: Giannotti & Pedreschi Anno accademico, 2002/2003 IntroduzioneA data mining system/query may generate thousands of patterns, not all of them are interesting.Interestingness measures:easily understood by humansvalid on new or test data with some degree of certainty.potentially usefulnovel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measuresObjective: based on statistics and structures of patterns, e.g., support, confidence, etc.Subjective: based on user’s beliefs in the data, e.g., unexpectedness, novelty, etc.Are all the discovered pattern interesting?
اسلاید 55: Giannotti & Pedreschi Anno accademico, 2002/2003 Introduzione55EvaluationStatistical validation and significance testingQualitative review by experts in the fieldPilot surveys to evaluate model accuracyInterpretationInductive tree and rule models can be read directlyClustering results can be graphed and tabledCode can be automatically generated by some systems (IDTs, Regression models)Interpretation and evaluatio
خرید پاورپوینت توسط کلیه کارتهای شتاب امکانپذیر است و بلافاصله پس از خرید، لینک دانلود پاورپوینت در اختیار شما قرار خواهد گرفت.
در صورت عدم رضایت سفارش برگشت و وجه به حساب شما برگشت داده خواهد شد.
در صورت نیاز با شماره 09353405883 در واتساپ، ایتا و روبیکا تماس بگیرید.
- پاورپوینتهای مشابه
نقد و بررسی ها
هیچ نظری برای این پاورپوینت نوشته نشده است.