صفحه 1:
Tecniche di Data Mining Fosca Giannotti and Dino Pedreschi Pisa KDD Lab, CNUCE-CNR & Univ. Pisa http://www-kdd.cnuce.cnrit/ DIPARTIMENTO DI INFORMATICA - Universita di Pisa anno accademico 2002/2003

صفحه 2:
Tecniche di Data Mining 1 ®@CPO Tevacke di deta whic | Corsi d beers Gperidisics tt Toforwuires ‏اس‎ ‎‘Porwutck= ‎0 6۳100۶0 Bust di dat ‏سای :رهب موه و‎ di dota ‏هو بو‎ Pocutst det dat © Corse dt Lowes to ToPorwwutce (quiquecme, verchic prices) Bante det det ed epteuzicae ‏میس ول‎ 1 Opreo dh Lewes Gpecidisice tt FoPorwutce per (Ervarwia pe per !@zenk Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 3:
Tecniche di Data Mining 1 Acronimo: TDM ‏:واعقء0 اش(‎ Lunedi 11-13 aula E, 610۷601 14-6 aula B | Docente: Fosca Giannotti, CNUCE-CNR, f.giannotti@cnuce.cnr.it | Corso Integrativo: Dino Pedreschi, Dipartimento di Informatica, pedre@di.unipi.it Ricevimento: Mercoledi 14-16 | ISTI, Area Ricerca CNR, localita San Cataldo, Pisa Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 4:
Tecniche di Data Mining I RP erioent bolo Pet %deavet Lor, Dicker (Comber, Osta ‏امه سوه( رت(‎ Meche, Dorcas (ener Pubbshers, 0 ‏هط سوام واه منوا‎ 16۵ 0266660- 60-6 KO. Parnn, ©. Pricisty-Shopiry, ®. Gah, 7. Dirnveuny (editors). dbenwer in Komwlecke dovovery ond chia wisien, OTD Press, (890. ead J. Meret, WLebht Dorada, Padhrais Seo, Priecipbes oP Dot ‏رم‎ OW Prove, DOA. ©, Ckohrebart, ren he (Deb: Disrovertiy ‏اوه‎ Prox ‏مجو رلا"‎ Wer0, Dorgaa keaParenm, SBD (-SSE80-PSF-F, CDOS D4 foot waltz zat celle leziodt sarap resi dspoubt utrwersr ft sie web del corsv: ‏ون یی اج( سسمبا ند‎ Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 5:
ONATLO I mMessaggio e-mail con subject:CorsoTDM. | Contenuto | NomeeCognome... 7 e-mail " annoimmatricofazione: 5 Corso diCawrea :, © Corsi dibasidi dati: ۲۰ Frequentati neiprecedenti semestri: (+ mmquestosemestre: annoaccadenicasoos/2008 Introduzione Giannotti & Pedreschi

صفحه 6:
Contenuti del corso I IntroductionandBasicconcepts(2 ore) > Leapplicaziont > Iprocessodi knowledge discovery 1 DataConsolidation &Data Preparation (4 +2 esercitazione) + Nozionibasiche di Data Warehousing + Nozioni basiche dianalisimultidimensionale dei dati 1 Regole Associatives +4 esercitazione) * Regoleintra-attributo,inter-attributo ~ Calcolo efficiente di regoleLassociazione:algoritmo Apriorie varianti * Estensioni del concetto di regola dassociazione:tassonomie,regole quantitative, regolepredittive, Regoleassociativeefattore Tempo: RAA Cicliche eCalendriche > Pattern SequenzialieSerieTemporali 2 Basket Market Analysisutilizzando RIA Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 7:
Contenuti del corso Classificazione con alberi di decisione (6 ore +2 esercitazione) > Principalitecniche di classificazione > Classificatori bayesiant * Aberi di decisione > Rassegnadialtrimetodi > Applicazioneal rilevamento di frodi 1 Clustering (2 ore+z2 esercitazione) © Principalitecniche di clustering > Apylicazioneal Customer segmentation ۲ Web Mining (gore) 1 Temi avanzati (6 oreseminari) Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 8:
Modalit divalutazione | €sercizi durante il corso(oOrale):30% | Seminario (oProgetto):70%? | students shouldpairupinteams. They will receive thesame credit as their partner. Division of Caborisuptothem. Presentations show dtake s5ominutes, including 10minutes for discussion. Apresentation normally covers twoor three closely relatedpapers Transparencies showldbemadeavailable tothe rest of the class---preferablyin PDF or HIML format. Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 9:
Course Outline 1 Imtroductionandbasic concepts | mMotivations,applications, the KDD process, the techniques | DeeperintoDM technology 1 Association Rules and Market Basket Analysis 1 Decision TreesandFraudDetection | ClusteringandCustomer Segmentation | DeeperintoData Preparation 1 Basicnotion of Datawarefiouse | Selectionandpreprocessing 1 AdvancedTopics 1 ScalableDM algorithms 1 Datamining query languages 1 Miningon Web = Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi —

صفحه 10:
fromdatamanagement to data analysis | 1960s: | Data coltection, database creation IMS andnetworkDBMS. | 4970s: | Relational datamodel,relational DBMS implementation. | 4980s: ۱ (extended-relational,00,deductive,etc.) andapplication-oriented DBMS (spatial, scientific, engineering, etc.). “49908: | Datamininganddatawarehousing, multimedia databasesand Web technology. 70 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 11:
“Necessity isthe Mother of Invention” 1 Data explosion problem: | automateddata collection toolsmature database technologyand internet Ceadtotremendous amounts of data storedin databases, data warehouses andother information repositories. | Wearedrowning in information, but starving for knowledge! ohn ‘Naisbett) 9 ‏وم مه‎ 1 On-Cineanalytical processing | Extraction ofinteresting knowledge (rules, regularities, patterns,constraints)fromdatainlargedatabases. ” Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 12:
20 ott watt ons 5 DM | Abundance of businessandindustry data " Competitive focus - Knowledge Management " Imexpensive,powerful computing engines “ Strong theoretical/mathematical foundations ' machinelearning & Cogic ' statistics ' databasemanagement systems 92 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 13:
Sources 0 ‘Data I Business Transactions | widespreaduse of bar codes => storage ofmillions of transactions daily (e.g.,Walmart:2000 stores => 20M transactions per day) | most important problem:effectiveuse of thedatainareasonable timeframe for competitive decision-making | u SclentificData | datageneratedthroughmultitude of experimentsandobservations | examples, geological data,satelliteimaging data, NASAcarth observations ۱ ateofdata collection far exceedsthespeedbywhichweanalyse the ita Financial Data | companyinformation | economic data (GNP,priceindexes,etc.) | stockmarkets a annoaccadenicascesiso0» Introduzione Giannotti & Pedreschi

صفحه 14:
ا ‎Sources of Data‏ I Personal / Statistical Data government census medical histories customer profiles demographic data dataandstatistics about sportsandathletes orld Wide Web andOnline Repositories email, news,messages Web documents, images, video, etc. Cink structure of of the hypertext frommiltions of Web sites Webusage data (fromserver Cogs, network traffic.anduser registrations) online databases,and digital libraries د 7 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 15:
Classes of applications I Databaseanalysisanddecision support | Market analysis * target marketing,customer relation management, market basket anaCysis,cross sefCing, market segmentation, | Riskanalysis * Forecasting, customer retention, improvedunderwriting, quality control, competitiveanalysis. 0 | Other Applications | Text (news group,email,documents)and Web analysis. ! Intelligent Query Answering Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi 0

صفحه 16:
Market analysts I Wherearethe data sourcesfor analysis? ! Credit cardtransactions,Coyalty cards, discount coupons, customer complaint calls,plus (public) lifestyle studies. ‎Target marketing‏ ا ‎| Find clusters of “model” customers who share the same characteristics:interest,income level, spending fabits,etc. ‎“ Determine customer purchasing patterns over time ! Conversion ofsingletoafoint bankaccount:marriage,etc. “ Cross-market analysis ! Associations/co-reCations between product sales | Prediction basedon the associationinformation. ‎Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 17:
OO Market Analysis (2) | Customer profiling | dataminingcan tell youwhat types of customers buy what products (clustering or classification). | Identifying customer requirements ! identifying the best products for different customers | useprediction tofindwhatfactorswill attract new customers | Provides summary information | yariousmultidimensional summary reports! | statistical summary information (data central tendency andvariation) Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 18:
se amatysis | Financeplanningandasset evaluation: | cashflow analysisandprediction | contingent claimanalysistoevaluateassets ! cross-sectional andtimeseriesanalysis (financial-ratio,trend analysis,etc.) | Resourceplanning: ! swmmarizeandcompare the resourcesandspending | Competition: | monitor competitorsandmarket directions (CI: competitive intelligence). | group customersintoclassesandclass-hasedpricing procedures | setpricing strategy ina highly competitivemarket Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 19:
FraudDetecti on “ Applications: | widelyusedin healthcare, retail,credit cardservices, teLecommunications (phonecardfraud),etc. “Approach: | usehistorical data to buildmodels of fraudulent behavior anduse datamining to help identify similar instances. “Examples: | autoinswurance: detect a group of people whostageaccidentsto collect oninsurance | money laundering: detect suspicious money transactions (US ‘Treasury's Financial Crimes Enforcement Network) | medicalinsurance: detect professional patientsandring of doctorsandring of references Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 20:
FraudDetection (2) “ Moreexamples: | Detectinginappropriatemedical treatment: | Australian HealthInsurance Commission identifies thatin many cases blanket screening tests were requested (save Australian $7m/yr). | Detecting telephonefraud: 0 Telephone call model: destination of the call, duration, time of day or week. Analyzepatterns that deviatefromanexpected norm. 0 British TeLecomidentifieddiscretegroupsof callerswith frequent intra-group calls, especially mobile phones,and brokea multimillion dollar fraud. | Retail: Analystsestimate that 38 %of retail shrinkis due to dishonest employees, Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi ‏سح‎

صفحه 21:
Other applications | Sports 1 IBM AdvancedScout analyzed NBA game statistics (shots Blocked, assists,andfouls) to gain competitiveadvantagefor New YorRKnicksandMiami Heat. 1 Astronomy | yrrandthe Palomar Observatory discovered 22 quasars with the help of datamining | Internet Web Surf-Aid 1 IBM Surf-Aidapplies data mining algorithms to Web access Cogsfor market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Webmarketing, improving Web site organization, etc. | Watchfor the PRIVACY pitfa(t! Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 22:
SS... Whatis KDD? Aprocess! | Theselectionandprocessing of datafor: ' theidentification of novel,accurate,and useful patterns,and ' themodeling of real-worldphenomena. ‎Datamining isamajor component of the‏ لأ ‎KDD process - automated discovery of‏ ‎patternsandthe development of predictive‏ ‎andexplanatory models.‏ ‎22 ‎annoaccadenicascesiso0» Introduzione Giannotti & Pedreschi

صفحه 23:
the XDD process Interpretatio: nd Evaluatio) $election an Rreprocessing 3 “ 2 E atterns | | Models + Prepared. Data eg ge Data Sources Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 24:
1 xDD stepscanbemergedor combined ! Data Cleaning +DataIntegration=Data Preprocessing | Data Sefection+Data Transformation =Data Consolidation | KDDisandIterative Process | art +engineering rather than science 24 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 25:
The virtuous cycle Identify Problemor Qpportunit Act on Knowledge Measure effect Strategy of Action Results 25 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 26:
The steps of the KDD process Learning theapplication domain: ! relevant prior Rnowledgeand goals ofapplication Dataconsolidation:Creatinga target dataset Selectionand Preprocessing | Data cleaning: (may take 60%of effort!) | Datareductionandprofection: | finduseful features dimensionality /variable reduction, invariant ‘representation. Choosing functions of datamining summarization, classification, regression,association, clustering, Choosing themining algorithms) Datamining:searchfor patterns ofinterest Interpretationandevaluation:analysis of results. | visualization, transformation, removing redundant patterns, .« ‘Use of discoveredknowledge Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 27:
Domain Experts Mining Specialists Data Administrator] “ Annoaccadenicascezi3000 Introduzione Giannotti & Pedreschi ‏ری‎

صفحه 28:
تأنالا 01011 عانتمالا ععنالات وبأ نأناء عأنااناً 55ع نا اناأكطا5 إل Data Mining Operational Business Business ‏و‎ ‎Data ۲ 1 ~ Basket Analysis Data iformation ۳ Fraud Detection Werehouse Warehouse Target Marketing ga BN. v Business Queries Extraction/ Replication Hs “ae Data Cleaning Meta Data Management Giannotti & Pedreschi ‏سح‎ Co Annoaccademicasoos/2000 Introduzione

صفحه 29:
The xDD process Interpretati d Evaluatio * Patterns & : Models anne Date SOUTCES » troduzione Giannotti & Pedrescht

صفحه 30:
Data consolidationand preparation GarbageinGarbage out I The quality of results relates directly to quality of thedata | 50%-70%of KDD process effort is spent on data consolidation andpreparation | Mafor justification for a corporate data warehouse 30 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 31:
5 7 ‏سم‎ on Fromdata sources toconsolidateddata repository S| — ‏ا‎ Sse Object/Relation DBMS Multidimensional DBMS ‘ ۱ Deductive Datei’ ۱ 2 Flat files 6 Giannotti & Pedreschi Annoaccademicasoos/2000 Introduzione

صفحه 32:
11( 0:60:05 00012011 | Determinepreliminary list ofattributes | Consolidate dataintoworking database ' InternalandExternal sources | Eliminate or estimate missing values | Remove outliers (obvious exceptions) | Determineprior probabilities of categoriesand deal with volume bias و Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 33:
The XDD process Interpretat: d Evaluatio; wd av. ۸ annoacadeicascoines Introduzione Giannotti & Pedreschi bb

صفحه 34:
Data selectionandpreprocessing I Generateaset ofexamples | choosesampling method ' consider sample complexity | dealwithvolumebiasissues | Reduceattribute dimensionality | remove redundant and/or correlating attributes | comBineattributes (sum,multiply, difference) | Reduceattributevalueranges ! group symbolic discretevatues 1! quantify continuous numericvatues | Transformdata | de-correlateandnormatizevalues | map time-series data tostaticrepresentation ۱ tools play key role 7 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 35:
The XDD process Interpretatio: d Evaluatio; election and reprocessing وه Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 36:
Datamining tasks andmethods | DirectedKnowledgeDiscovery ! Purpose: Explainvalue of somefieldin terms of all the others (goal-oriented) | Method: select the target fieldbasedon some hypothesis about thedatalaskthealgorithmtotellus how to predict or classify new instances ۱ Examples: | what products show increasedsalewhen cream cheeseis discounted ‏لقع تدع نه 07[ ونه طز قاع بن1سه :010 ع كننا 0 :نه :0111161 6 بقاع نه سد ذأ‎ ‘user ‏اما ها متام‎ 2 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 37:
OO Datamining tasks andmethods ‎undirected Knowledge Discovery‏ ذأ ‎(Explorative Methods)‏ ‎| Purpose: Findpatternsin the data that may be interesting (notarget filed) ۱3 rules (affinity grouping) ! Examples: J whichproductsin the catalog often sell together ‎| market segmentation (groups of customers/users withsimilar characteristics) ‎37 ‎Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 38:
Alternatively :Dataminingtasksandmethods | Automated Exploration/Discovery ' ¢g.. discovering new market segments | clusteringanalysis | Prediction/Classification | e.g..forecasting gross sales given current factors regression, newral networks,geneticalgorithms, decision trees ft rey \ | Explanation/Description ' eg..characterizing customers by demographics andpurchase history 5 | decision trees,association rules ‏انا‎ ‎andincome < $35k then... و Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 39:
Automatedexplorationand discovery I Clustering:partitioninga set of dataintoa set of classes, called clusters,whosemembers sharesome interesting common properties, ‎Distance-basednumerical clustering‏ لأ ‎| metricgrouping of examples (K-NN) ' graphical visualization can beused ۱ ‏ونتانأمت تأكنانك بلاتصياىء ينه‎ ‎| searchfor the number of classes which result in best fit of aprobability distributiontothedata ‎! AutoClass (NASA) one of best examples ‎39 ‎Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 40:
Predictionandclassification | cearning apredictivemodel ‏لأ‎ Classification of a new case/sample | Many methods: ! Artificial neural networks ! Inductive decisiontreeandrule systems 1١ Geneticalgorithms ! Nearest neighbor clustering algorithms ! Statistical (parametric,andnon-parametric) 40 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 41:
Generalizationandregression | The objective of learning is toachieve good generalizationtonew unseen cases. “ Generalization can be definedasamathematical interpolationor regressionover a set of training points “ Models can bevalidatedwithapreviouslyunseen test set or using cross-validation metfods a Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 42:
02320001 andprediction ! Classify data basedon thevalues ofa target attribute,e.g.,classify countries basedon climate, or classify cars basedon gasmileage. ! Alseobtainedmodel topredict someunknown or missing attributevalues basedon other information. a2 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 43:
Summarizing:inductive modeling > ۵ Objective: Developageneralmodel or hypothesis from specific examples 5 الال تنه وماقمس ووه مم5 0 | Classification (concept ‏ا‎ ‎recognition) x 43 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 44:
Explanationand description | Learna generalized hypothesis (model) from selecteddata | Description/Interpretation of model provides new knowledge | Affinity Grouping | Methods: ' Inductive decision treeandrufe systems ! Association rule systems ' LinkAnalysis 1 a Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 45:
Exception/deviation detection ‎Generateamodel of normal activity‏ لا ‎Deviation frommodel causes alert‏ “ ‎"Methods:‏ ‎| Artificial newral networks ‎۱ Inductive decision treeandrule systems ‎| Statisticalmethods ‎۱ Visualization tools ‎۶ ‎Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi {7

صفحه 46:
Outlier andexception data analysis | Time-series analysis (trendanddeviation): ! Trendanddeviation analysis: regression, sequential pattern, similar sequences,trendand deviation,e.g., stockanalysis. ! Similarity-basedpattern-directedanalysis | Pull vs. partial periodicity analysis | Otherpattern-directedor statisticalanalysis 46 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 47:
“7 Example: Moviegoer Database fmoviegoer_ID movie _10 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 48:
SELECT moviegoers.name, moviegoers.sex, moviegoers.age, sources.source, movies.name FROM movies, sources, moviegoers WHERE sources.source_ID = moviegoers.source ID AND movies.movie ID = moviegoers.movie_ID ORDER BY moviegoers. name; سس — ی سس ‎Day‏ مس ‎oP Obert‏ م ‎Oo‏ ‏موه 1۵ ‎Obert‏ 06 - مه مه 7 ‎ow Obert‏ - هه Ome 3 90 Obert ‏امس‎ اوه باه سق 1 ‎Obert‏ 06 م ‎Owe‏ ‏مه مه ‎vt 0 90 Obert‏ Gob 2 60 ‏.تمصت‎ Scharber’ Let Gram - ee Obert Guger Oop ‎Obert ddr‏ « 0 بسن ‎Caw 3 es Obert ‏و‎ ‎O.Pines The Orde‏ 90 م ون ‎Obert rem‏ ۵ مه ‎Cut 3 6 ‏بسسد ۵ لهه‎ Da) Want - €o ‏هه‎ ۳ Cre 0 29) D.Odwe Prepon 1 ‎Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi 1 ‎ ‎ ‎ ‎ ‎

صفحه 49:
Example: Moviegoer Da tabase | Classification | determine sex basedonage,source,andmovies seen | determine source basedon sex,age,andmovies seen | determinemost recent movie basedon past movies, age, sex,and sowrce | Estimation | forpredict, needa continuousvariable(e.g,,“age”) | predict ageasafunction of source,sex,andpast movies | ifwehada “rating’ fieldfor eachmoviegoer,wecowdpredict therating a new moviegoer gives toa movie basedon age, sex, past movies,etc. a3 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 50:
Example: Moviegoer Database U Clustering | findgroupings of movies that are often seen by the same people findgroupings of people that tend toseethesame movies clustering might reveal relationships that arenot necessarily recordedin the data (e.g.,wemay finda cluster that is dominated by people with young children {or a cluster of movies that correspondtoa particular genre) 50 annoaccadenicascesiso0» Introduzione Giannotti & Pedreschi

صفحه 51:
Example: Moviegoer Database I Association Rules | market basket analysis (MBA): “whichmoviesgotogether?” | needtocreate transactions’ for eachmoviegoer containing movies seen by that moviegoer: MO ‏مس‎ مر روط سم ۵۵0 سانا مهو رم ۰ مه 6 ‎O08‏ ‎Dey, (reps)‏ سم من متا ©0006 004: (Prerepemn, Cchanber's Les} | mayresult in association rules sucha: {“Phenomenon”, “The Birdcage”} {“Trainspotting”} {“Trainspotting”, “The Birdcage”} ==> {sex = “f"} 1111 a7 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 52:
Example: Moviegoer Database I sequence Analysis ۱ similar to MBA, but order inwhichitemsappearin thepatternisimportant | e.g.,peoplewho rent “The Birdcage” during avisit tendtorent “Trainspotting” in the next visit. 32 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

صفحه 53:
The XDD process Interpretatio: nd Evaluatio; Dp and Warehousing \ w ‏ل‎ 5 ia annoaccadenicascesiso0» Introduzione Giannotti & Pedreschi ort

صفحه 54:
Areall the discoveredpatterninteresting? | Adatamining system/query may generate thousands of patterns, not all of themareinteresting. “ Imterestingnessmeasures: | easifyunderstoodby humans | validon new or test data withsome degree of certainty. | potentiallyuseful | novel, orvalidates some hypothesis that auser seeks toconfirm | Objectivevs. subjectiveinterestingnessmeasures ! Offective:basedonstatisticsandstructures of patterns,e.g,, support, confidence, etc. | Subjective:basedonuser’s beliefsin the data,e.g.,unexpectedness, novelty,etc. annoaccadenicascesiso0» Introduzione Giannotti & Pedreschi

صفحه 55:
Interpretationandevaluation Evaluation | Statistical validationandsignificance testing | Qualitative review by expertsinthefield | Pilot surveys toevaluate model accuracy Interpretation | Inductivetreeandrulemodelscanbereaddirectly | Clustering results can begraphedandtabled | Codecanbeautomatically generated by some systems (IDTs, Regression models) 55 Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi

Tecniche di Data Mining Fosca Giannotti and Dino Pedreschi Pisa KDD Lab, CNUCE-CNR & Univ. Pisa http://www-kdd.cnuce.cnr.it/ DIPARTIMENTO DI INFORMATICA - Università di Pisa anno accademico 2002/2003 Tecniche di Data Mining  AA270 Tecniche di data mining  Corsi di Laurea Specialistica in Informatica e Tecnologie Informatiche  4I117 Basi di dati e sistemi informativi: tecniche di data mining per l’analisi dei dati  Corso di Laurea in Informatica (quinquennale, vecchio ordinamento)  Analisi dei dati ed estrazione di conoscenza  Corso di Laurea Specialistica in Informatica per l’Economia e per l’Azienda Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 2 Tecniche di Data Mining  Acronimo: TDM  Orario: Lunedi 11-13 aula E, Giovedi 14-16 aula B  Docente: Fosca Giannotti, CNUCE-CNR, f.giannotti@cnuce.cnr.it  Corso Integrativo: Dino Pedreschi, Dipartimento di Informatica, pedre@di.unipi.it  Ricevimento: Mercoledi 14-16  ISTI, Area Ricerca CNR, località San Cataldo, Pisa Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 3 Tecniche di Data Mining  Riferimenti bibliografici  Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2000 http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860489-8  U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (editors). Advances in Knowledge discovery and data mining, MIT Press, 1996.  David J. Hand, Heikki Mannila, Padhraic Smyth, Principles of Data Mining, MIT Press, 2001.  S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann, ISBN 1-55860-754-4, 2002  I lucidi utilizzati nelle lezioni saranno resi disponibili attraverso il sito web del corso: http://www-kdd.cnuce.cnr.it/ Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 4 Questionario  Messaggio e-mail con subject: Corso TDM  Contenuto  Nome e Cognome………………..  e-mail:……………………………  anno immatricolazione:…………….  Corso di laurea :……………………..  Corsi di basi di dati:     Anno accademico, 2002/2003 Frequentati nei precedenti semestri: In questo semestre: Introduzione Giannotti & Pedreschi 5    Contenuti del corso Introduction and Basic concepts (2 ore)   Le applicazioni Il processo di knowledge discovery   Nozioni basiche di Data Warehousing Nozioni basiche di analisi multidimensionale dei dati Data Consolidation & Data Preparation (4 +2 esercitazione) Regole Associative(6 +4 esercitazione)       Regole intra-attributo, inter-attributo Calcolo efficiente di regole d'associazione: algoritmo Apriori e varianti Estensioni del concetto di regola d'associazione: tassonomie, regole quantitative, regole predittive. Regole associative e fattore Tempo: RdA Cicliche e Calendriche Pattern Sequenziali e Serie Temporali Basket Market Analysis utilizzando RdA Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 6 Contenuti del corso  Classificazione con alberi di decisione (6 ore +2 esercitazione)      Principali tecniche di classificazione Classificatori bayesiani Alberi di decisione Rassegna di altri metodi Applicazione al rilevamento di frodi  Clustering (2 ore +2 esercitazione)  Principali tecniche di clustering  Applicazione al Customer segmentation  Web Mining (4 ore)  Temi avanzati (6 ore seminari) Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 7 Modalità di valutazione  Esercizi durante il corso (o Orale): 30%  Seminario (o Progetto): 70% ?  Students should pair up in teams. They will receive the same credit as their partner. Division of labor is up to them.  Presentations should take 50 minutes, including 10 minutes for discussion. A presentation normally covers two or three closely related papers  Transparencies should be made available to the rest of the class---preferably in PDF or HTML format. Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 8 Course Outline Introduction and basic concepts   Motivations, applications, the KDD process, the techniques            Deeper into DM technology Association Rules and Market Basket Analysis Decision Trees and Fraud Detection Clustering and Customer Segmentation Deeper into Data Preparation Basic notion of Datawarehouse Selection and preprocessing Advanced Topics Scalable DM algorithms Data mining query languages Mining on Web Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 9 Evolution of Database Technology: from data management to data analysis  1960s:  Data collection, database creation, IMS and network DBMS.  1970s:  Relational data model, relational DBMS implementation.  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.).  1990s:  Data mining and data warehousing, multimedia databases, and Web technology. Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 10 Motivations “Necessity is the Mother of Invention”  Data explosion problem:  Automated data collection tools, mature database technology and internet lead to tremendous amounts of data stored in databases, data warehouses and other information repositories.  We are drowning in information, but starving for knowledge! (John Naisbett)  Data warehousing and data mining :  On-line analytical processing  Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases. Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 11 Motivations for DM  Abundance of business and industry data  Competitive focus - Knowledge Management  Inexpensive, powerful computing engines  Strong theoretical/mathematical foundations  machine learning & logic  statistics  database management systems Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 12 Sources of Data  Business Transactions  widespread use of bar codes => storage of millions of transactions daily (e.g., Walmart: 2000 stores => 20M transactions per day)  most important problem: effective use of the data in a reasonable time frame for competitive decision-making  e-commerce data  Scientific Data  data generated through multitude of experiments and observations  examples, geological data, satellite imaging data, NASA earth observations  rate of data collection far exceeds the speed by which we analyze the data  Financial Data  company information  economic data (GNP, price indexes, etc.)  stock markets Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 13 Sources of Data  Personal / Statistical Data      government census medical histories customer profiles demographic data data and statistics about sports and athletes  World Wide Web and Online Repositories email, news, messages Web documents, images, video, etc. link structure of of the hypertext from millions of Web sites Web usage data (from server logs, network traffic, and user registrations)  online databases, and digital libraries     Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 14 Classes of applications  Database analysis and decision support  Market analysis • target marketing, customer relation management, market basket analysis, cross selling, market segmentation.  Risk analysis • Forecasting, customer retention, improved underwriting, quality control, competitive analysis.  Fraud detection  Other Applications  Text (news group, email, documents) and Web analysis.  Intelligent Query Answering Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 15 Market Analysis  Where are the data sources for analysis?  Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies.  Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time  Conversion of single to a joint bank account: marriage, etc.  Cross-market analysis  Associations/co-relations between product sales  Prediction based on the association information. Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi MarketMarket Analysis and Management Analysis (2)  Customer profiling  data mining can tell you what types of customers buy what products (clustering or classification).  Identifying customer requirements  identifying the best products for different customers  use prediction to find what factors will attract new customers  Provides summary information  various multidimensional summary reports;  statistical summary information (data central tendency and variation) Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi Risk Analysis  Finance planning and asset evaluation:  cash flow analysis and prediction  contingent claim analysis to evaluate assets  cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)  Resource planning:  summarize and compare the resources and spending  Competition:  monitor competitors and market directions (CI: competitive intelligence).  group customers into classes and class-based pricing procedures  set pricing strategy in a highly competitive market Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi Fraud Detection  Applications:  widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.  Approach:  use historical data to build models of fraudulent behavior and use data mining to help identify similar instances.  Examples:  auto insurance: detect a group of people who stage accidents to collect on insurance  money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)  medical insurance: detect professional patients and ring of doctors and ring of references Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi  More examples: Fraud Detection (2)  Detecting inappropriate medical treatment:  Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).  Detecting telephone fraud:  Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.  British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.  Retail: Analysts estimate that 38% of retail shrink is due to dishonest employees. Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi Other applications  Sports  IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat.  Astronomy  JPL and the Palomar Observatory discovered 22 quasars with the help of data mining  Internet Web Surf-Aid  IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.  Watch for the PRIVACY pitfall! Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi What is KDD? A process!  The selection and processing of data for:  the identification of novel, accurate, and useful patterns, and  the modeling of real-world phenomena.  Data mining is a major component of the KDD process - automated discovery of patterns and the development of predictive and explanatory models. Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 22 The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse Patterns & Models Prepared Data Consolidated Data Data Sources Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 23 The KDD Process in Practice  KDD steps can be merged or combined  Data Cleaning + Data Integration = Data Preprocessing  Data Selection + Data Transformation = Data Consolidation  KDD is and Iterative Process  art + engineering rather than science Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 24 The virtuous cycle 9 The KDD Process Interpretation and Evaluation Data Mining Problem Selection and Preprocessing Data Consolidation Knowledge p(x)=0.02 Patterns & Models Warehouse Prepared Data Consolidated Data Knowledge Data Sources CogNova Technologies Identify Problem or Opportunity Strategy Anno accademico, 2002/2003 Introduzione Act on Knowledge Measure effect of Action Giannotti & Pedreschi Results 25 The steps of the KDD process  Learning the application domain:  relevant prior knowledge and goals of application  Data consolidation: Creating a target data set  Selection and Preprocessing  Data cleaning : (may take 60% of effort!)  Data reduction and projection:  find useful features, dimensionality/variable reduction, invariant representation.  Choosing functions of data mining  summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Interpretation and evaluation: analysis of results.  visualization, transformation, removing redundant patterns, …  Use of discovered knowledge Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi Roles in the KDD process Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 27 A business intelligence environment Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 28 The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse Patterns & Models Prepared Data Consolidated Data Data Sources Introduzione Anno accademico, 2002/2003 Giannotti & Pedreschi 29 Data consolidation and preparation Garbage in Garbage out  The quality of results relates directly to quality of the data  50%-70% of KDD process effort is spent on data consolidation and preparation  Major justification for a corporate data warehouse Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 30 Data consolidation From data sources to consolidated data repository RDBMS Legacy DBMS Flat Files Data Consolidation and Cleansing External Anno accademico, 2002/2003 Introduzione Warehouse Object/Relation DBMS Multidimensional DBMS Deductive Database Flat files 31 Giannotti & Pedreschi Data consolidation  Determine preliminary list of attributes  Consolidate data into working database  Internal and External sources  Eliminate or estimate missing values  Remove outliers (obvious exceptions)  Determine prior probabilities of categories and deal with volume bias Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 32 The KDD process Interpretation and Evaluation Data Mining Selection and Preprocessing Knowledge p(x)=0.02 Data Consolidation Warehouse Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 33 Data selection and preprocessing  Generate a set of examples    choose sampling method consider sample complexity deal with volume bias issues  Reduce attribute dimensionality   remove redundant and/or correlating attributes combine attributes (sum, multiply, difference)  Reduce attribute value ranges   group symbolic discrete values quantify continuous numeric values  Transform data   de-correlate and normalize values map time-series data to static representation  OLAP and visualization tools play key role Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 34 The KDD process Interpretation and Evaluation Data Mining Selection and Preprocessing Knowledge p(x)=0.02 Data Consolidation Warehouse Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 35 Data mining tasks and methods  Directed Knowledge Discovery  Purpose: Explain value of some field in terms of all the others (goal-oriented)  Method: select the target field based on some hypothesis about the data; ask the algorithm to tell us how to predict or classify new instances  Examples: what products show increased sale when cream cheese is discounted  which banner ad to use on a web page for a given user coming to the site Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 36 Data mining tasks and methods  Undirected Knowledge Discovery (Explorative Methods)  Purpose: Find patterns in the data that may be interesting (no target filed)  Method: clustering, association rules (affinity grouping)  Examples:  which products in the catalog often sell together  market segmentation (groups of customers/users with similar characteristics) Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 37 Alternatively:Data mining tasks and methods  Automated Exploration/Discovery e.g.. discovering new market segments  clustering analysis  x2 x1  Prediction/Classification e.g.. forecasting gross sales given current factors  regression, neural networks, genetic algorithms, decision trees   Explanation/Description f(x) x e.g.. characterizing customers by demographics and purchase history if age > 35  decision trees, association rules  and income < $35k then ... Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 38 Automated exploration and discovery  Clustering: partitioning a set of data into a set of classes, called clusters, whose members share some interesting common properties.  Distance-based numerical clustering   metric grouping of examples (K-NN) graphical visualization can be used  Bayesian clustering   search for the number of classes which result in best fit of a probability distribution to the data AutoClass (NASA) one of best examples Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 39 Prediction and classification Learning a predictive model  Classification of a new case/sample  Many methods:      Artificial neural networks Inductive decision tree and rule systems Genetic algorithms Nearest neighbor clustering algorithms Statistical (parametric, and non-parametric) Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 40 Generalization and regression  The objective of learning is to achieve good generalization to new unseen cases.  Generalization can be defined as a mathematical interpolation or regression over a set of training points  Models can be validated with a previously unseen test set or using cross-validation methods f(x) x Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 41 Classification and prediction  Classify data based on the values of a target attribute, e.g., classify countries based on climate, or classify cars based on gas mileage.  Use obtained model to predict some unknown or missing attribute values based on other information. Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 42 Summarizing: inductive modeling = learning Objective: Develop a general model or hypothesis from specific examples  Function approximation (curve fitting) f(x) x  Classification (concept learning, pattern recognition) A x2 B x1 Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 43 Explanation and description  Learn a generalized hypothesis (model) from selected data  Description/Interpretation of model provides new knowledge  Affinity Grouping  Methods:     Inductive decision tree and rule systems Association rule systems Link Analysis … Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 44 Exception/deviation detection  Generate a model of normal activity  Deviation from model causes alert  Methods:     Artificial neural networks Inductive decision tree and rule systems Statistical methods Visualization tools Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 45 Outlier and exception data analysis  Time-series analysis (trend and deviation):  Trend and deviation analysis: regression, sequential pattern, similar sequences, trend and deviation, e.g., stock analysis.  Similarity-based pattern-directed analysis  Full vs. partial periodicity analysis  Other pattern-directed or statistical analysis Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 46 Example: Moviegoer Database Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 47 Example: Moviegoer Database SELECT SELECT moviegoers.name, moviegoers.name, moviegoers.sex, moviegoers.sex, moviegoers.age, moviegoers.age, sources.source, movies.name sources.source, movies.name FROM FROM movies, movies, sources, sources, moviegoers moviegoers WHERE sources.source_ID WHERE sources.source_ID == moviegoers.source_ID moviegoers.source_ID AND AND movies.movie_ID = moviegoers.movie_ID movies.movie_ID = moviegoers.movie_ID ORDER BY ORDER BY moviegoers.name; moviegoers.name; moviegoers.name sex age source movies.name Amy Andrew Andy Anne Ansje Beth Bob Brian Candy Cara Cathy Charles Curt David Erica f m m f f f m m f f f m m m f 27 25 34 30 25 30 51 23 29 25 39 25 30 40 23 Oberlin Oberlin Oberlin Oberlin Oberlin Oberlin Pinewoods Oberlin Oberlin Oberlin Mt. Auburn Oberlin MRJ MRJ Mt. Auburn Independence Day 12 Monkeys The Birdcage Trainspotting I Shot Andy Warhol Chain Reaction Schindler's List Super Cop Eddie Phenomenon The Birdcage Kingpin T2 Judgment Day Independence Day Trainspotting 48 Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi Example: Moviegoer Database  Classification  determine sex based on age, source, and movies seen  determine source based on sex, age, and movies seen  determine most recent movie based on past movies, age, sex, and source  Estimation  for predict, need a continuous variable (e.g., “age”)  predict age as a function of source, sex, and past movies  if we had a “rating” field for each moviegoer, we could predict the rating a new moviegoer gives to a movie based on age, sex, past movies, etc. Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 49 Example: Moviegoer Database  Clustering  find groupings of movies that are often seen by the same people  find groupings of people that tend to see the same movies  clustering might reveal relationships that are not necessarily recorded in the data (e.g., we may find a cluster that is dominated by people with young children; or a cluster of movies that correspond to a particular genre) Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 50 Example: Moviegoer Database  Association Rules  market basket analysis (MBA): “which movies go together?”  need to create “transactions” for each moviegoer containing movies seen by that moviegoer: name TID Transaction Amy Andrew Andy Anne … 001 002 003 004 … {Independence Day, Trainspotting} {12 Monkeys, The Birdcage, Trainspotting, Phenomenon} {Super Cop, Independence Day, Kingpin} {Trainspotting, Schindler's List} ...  may result in association rules such as: {“Phenomenon”, “The Birdcage”} ==> {“Trainspotting”} {“Trainspotting”, “The Birdcage”} ==> {sex = “f”} Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 51 Example: Moviegoer Database  Sequence Analysis  similar to MBA, but order in which items appear in the pattern is important  e.g., people who rent “The Birdcage” during a visit tend to rent “Trainspotting” in the next visit. Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 52 The KDD process Interpretation and Evaluation Data Mining Selection and Preprocessing Knowledge p(x)=0.02 Data Consolidation and Warehousing Warehouse Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 53 Are all the discovered pattern interesting?  A data mining system/query may generate thousands of patterns, not all of them are interesting.  Interestingness measures:     easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to confirm  Objective vs. subjective interestingness measures  Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.  Subjective: based on user’s beliefs in the data, e.g., unexpectedness, novelty, etc. Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi Interpretation and evaluation Evaluation  Statistical validation and significance testing  Qualitative review by experts in the field  Pilot surveys to evaluate model accuracy Interpretation  Inductive tree and rule models can be read directly  Clustering results can be graphed and tabled  Code can be automatically generated by some systems (IDTs, Regression models) Anno accademico, 2002/2003 Introduzione Giannotti & Pedreschi 55

51,000 تومان