صفحه 1:
Tecniche di Data Mining
Fosca Giannotti and
Dino Pedreschi
Pisa KDD Lab, CNUCE-CNR & Univ. Pisa
http://www-kdd.cnuce.cnrit/
DIPARTIMENTO DI INFORMATICA - Universita di Pisa
anno accademico 2002/2003
صفحه 2:
Tecniche di Data Mining
1 ®@CPO Tevacke di deta whic
| Corsi d beers Gperidisics tt Toforwuires اس
‘Porwutck=
0 6۳100۶0 Bust di dat سای :رهب موه و di dota
هو بو Pocutst det dat
© Corse dt Lowes to ToPorwwutce (quiquecme, verchic
prices)
Bante det det ed epteuzicae میس ول
1 Opreo dh Lewes Gpecidisice tt FoPorwutce per (Ervarwia
pe per !@zenk
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 3:
Tecniche di Data Mining
1 Acronimo: TDM
:واعقء0 اش( Lunedi 11-13 aula E, 610۷601 14-6
aula B
| Docente: Fosca Giannotti, CNUCE-CNR,
f.giannotti@cnuce.cnr.it
| Corso Integrativo: Dino Pedreschi,
Dipartimento di Informatica, pedre@di.unipi.it
Ricevimento: Mercoledi 14-16
| ISTI, Area Ricerca CNR, localita San Cataldo, Pisa
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 4:
Tecniche di Data Mining
I RP erioent bolo Pet
%deavet Lor, Dicker (Comber, Osta امه سوه( رت( Meche,
Dorcas (ener Pubbshers, 0
هط سوام واه منوا 16۵ 0266660-
60-6
KO. Parnn, ©. Pricisty-Shopiry, ®. Gah, 7. Dirnveuny (editors).
dbenwer in Komwlecke dovovery ond chia wisien, OTD Press, (890.
ead J. Meret, WLebht Dorada, Padhrais Seo, Priecipbes oP Dot
رم OW Prove, DOA.
©, Ckohrebart, ren he (Deb: Disrovertiy اوه Prox مجو رلا"
Wer0, Dorgaa keaParenm, SBD (-SSE80-PSF-F, CDOS
D4 foot waltz zat celle leziodt sarap resi dspoubt utrwersr ft sie
web del corsv:
ون یی اج( سسمبا ند
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 5:
ONATLO
I mMessaggio e-mail con subject:CorsoTDM.
| Contenuto
| NomeeCognome...
7 e-mail
" annoimmatricofazione:
5 Corso diCawrea :,
© Corsi dibasidi dati:
۲۰ Frequentati neiprecedenti semestri:
(+ mmquestosemestre:
annoaccadenicasoos/2008 Introduzione Giannotti & Pedreschi
صفحه 6:
Contenuti del corso
I IntroductionandBasicconcepts(2 ore)
> Leapplicaziont
> Iprocessodi knowledge discovery
1 DataConsolidation &Data Preparation (4 +2 esercitazione)
+ Nozionibasiche di Data Warehousing
+ Nozioni basiche dianalisimultidimensionale dei dati
1 Regole Associatives +4 esercitazione)
* Regoleintra-attributo,inter-attributo
~ Calcolo efficiente di regoleLassociazione:algoritmo Apriorie
varianti
* Estensioni del concetto di regola dassociazione:tassonomie,regole
quantitative, regolepredittive,
Regoleassociativeefattore Tempo: RAA Cicliche eCalendriche
> Pattern SequenzialieSerieTemporali
2 Basket Market Analysisutilizzando RIA
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 7:
Contenuti del corso
Classificazione con alberi di decisione (6 ore +2 esercitazione)
> Principalitecniche di classificazione
> Classificatori bayesiant
* Aberi di decisione
> Rassegnadialtrimetodi
> Applicazioneal rilevamento di frodi
1 Clustering (2 ore+z2 esercitazione)
© Principalitecniche di clustering
> Apylicazioneal Customer segmentation
۲ Web Mining (gore)
1 Temi avanzati (6 oreseminari)
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 8:
Modalit divalutazione
| €sercizi durante il corso(oOrale):30%
| Seminario (oProgetto):70%?
| students shouldpairupinteams. They will receive thesame
credit as their partner. Division of Caborisuptothem.
Presentations show dtake s5ominutes, including 10minutes
for discussion. Apresentation normally covers twoor three
closely relatedpapers
Transparencies showldbemadeavailable tothe rest of the
class---preferablyin PDF or HIML format.
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 9:
Course Outline
1 Imtroductionandbasic concepts
| mMotivations,applications, the KDD process, the techniques
| DeeperintoDM technology
1 Association Rules and Market Basket Analysis
1 Decision TreesandFraudDetection
| ClusteringandCustomer Segmentation
| DeeperintoData Preparation
1 Basicnotion of Datawarefiouse
| Selectionandpreprocessing
1 AdvancedTopics
1 ScalableDM algorithms
1 Datamining query languages
1 Miningon Web =
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi —
صفحه 10:
fromdatamanagement to data analysis
| 1960s:
| Data coltection, database creation IMS andnetworkDBMS.
| 4970s:
| Relational datamodel,relational DBMS implementation.
| 4980s:
۱ (extended-relational,00,deductive,etc.)
andapplication-oriented DBMS (spatial, scientific, engineering, etc.).
“49908:
| Datamininganddatawarehousing, multimedia databasesand Web
technology.
70
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 11:
“Necessity isthe Mother of Invention”
1 Data explosion problem:
| automateddata collection toolsmature database technologyand
internet Ceadtotremendous amounts of data storedin databases,
data warehouses andother information repositories.
| Wearedrowning in information, but starving for knowledge! ohn
‘Naisbett)
9 وم مه
1 On-Cineanalytical processing
| Extraction ofinteresting knowledge (rules, regularities,
patterns,constraints)fromdatainlargedatabases.
”
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 12:
20 ott watt ons 5 DM
| Abundance of businessandindustry data
" Competitive focus - Knowledge Management
" Imexpensive,powerful computing engines
“ Strong theoretical/mathematical
foundations
' machinelearning & Cogic
' statistics
' databasemanagement systems
92
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 13:
Sources 0 ‘Data
I Business Transactions
| widespreaduse of bar codes => storage ofmillions of transactions
daily (e.g.,Walmart:2000 stores => 20M transactions per day)
| most important problem:effectiveuse of thedatainareasonable
timeframe for competitive decision-making
|
u SclentificData
| datageneratedthroughmultitude of experimentsandobservations
| examples, geological data,satelliteimaging data, NASAcarth
observations
۱ ateofdata collection far exceedsthespeedbywhichweanalyse the
ita
Financial Data
| companyinformation
| economic data (GNP,priceindexes,etc.)
| stockmarkets
a
annoaccadenicascesiso0» Introduzione Giannotti & Pedreschi
صفحه 14:
ا
Sources of Data
I Personal / Statistical Data
government census
medical histories
customer profiles
demographic data
dataandstatistics about sportsandathletes
orld Wide Web andOnline Repositories
email, news,messages
Web documents, images, video, etc.
Cink structure of of the hypertext frommiltions of Web sites
Webusage data (fromserver Cogs, network traffic.anduser
registrations)
online databases,and digital libraries
د
7
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 15:
Classes of applications
I Databaseanalysisanddecision support
| Market analysis
* target marketing,customer relation management, market basket
anaCysis,cross sefCing, market segmentation,
| Riskanalysis
* Forecasting, customer retention, improvedunderwriting, quality
control, competitiveanalysis.
0
| Other Applications
| Text (news group,email,documents)and Web analysis.
! Intelligent Query Answering
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi 0
صفحه 16:
Market analysts
I Wherearethe data sourcesfor analysis?
! Credit cardtransactions,Coyalty cards, discount coupons,
customer complaint calls,plus (public) lifestyle studies.
Target marketing ا
| Find clusters of “model” customers who share the same
characteristics:interest,income level, spending fabits,etc.
“ Determine customer purchasing patterns over time
! Conversion ofsingletoafoint bankaccount:marriage,etc.
“ Cross-market analysis
! Associations/co-reCations between product sales
| Prediction basedon the associationinformation.
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 17:
OO
Market Analysis (2)
| Customer profiling
| dataminingcan tell youwhat types of customers buy what
products (clustering or classification).
| Identifying customer requirements
! identifying the best products for different customers
| useprediction tofindwhatfactorswill attract new
customers
| Provides summary information
| yariousmultidimensional summary reports!
| statistical summary information (data central tendency
andvariation)
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 18:
se amatysis
| Financeplanningandasset evaluation:
| cashflow analysisandprediction
| contingent claimanalysistoevaluateassets
! cross-sectional andtimeseriesanalysis (financial-ratio,trend
analysis,etc.)
| Resourceplanning:
! swmmarizeandcompare the resourcesandspending
| Competition:
| monitor competitorsandmarket directions (CI: competitive
intelligence).
| group customersintoclassesandclass-hasedpricing procedures
| setpricing strategy ina highly competitivemarket
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 19:
FraudDetecti on
“ Applications:
| widelyusedin healthcare, retail,credit cardservices,
teLecommunications (phonecardfraud),etc.
“Approach:
| usehistorical data to buildmodels of fraudulent behavior anduse
datamining to help identify similar instances.
“Examples:
| autoinswurance: detect a group of people whostageaccidentsto
collect oninsurance
| money laundering: detect suspicious money transactions (US
‘Treasury's Financial Crimes Enforcement Network)
| medicalinsurance: detect professional patientsandring of
doctorsandring of references
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 20:
FraudDetection (2)
“ Moreexamples:
| Detectinginappropriatemedical treatment:
| Australian HealthInsurance Commission identifies thatin
many cases blanket screening tests were requested (save
Australian $7m/yr).
| Detecting telephonefraud:
0 Telephone call model: destination of the call, duration, time of
day or week. Analyzepatterns that deviatefromanexpected
norm.
0 British TeLecomidentifieddiscretegroupsof callerswith
frequent intra-group calls, especially mobile phones,and
brokea multimillion dollar fraud.
| Retail: Analystsestimate that 38 %of retail shrinkis due to
dishonest employees,
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi سح
صفحه 21:
Other applications
| Sports
1 IBM AdvancedScout analyzed NBA game statistics (shots
Blocked, assists,andfouls) to gain competitiveadvantagefor
New YorRKnicksandMiami Heat.
1 Astronomy
| yrrandthe Palomar Observatory discovered 22 quasars with
the help of datamining
| Internet Web Surf-Aid
1 IBM Surf-Aidapplies data mining algorithms to Web access
Cogsfor market-related pages to discover customer preference
and behavior pages, analyzing effectiveness of Webmarketing,
improving Web site organization, etc.
| Watchfor the PRIVACY pitfa(t!
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 22:
SS...
Whatis KDD? Aprocess!
| Theselectionandprocessing of datafor:
' theidentification of novel,accurate,and
useful patterns,and
' themodeling of real-worldphenomena.
Datamining isamajor component of the لأ
KDD process - automated discovery of
patternsandthe development of predictive
andexplanatory models.
22
annoaccadenicascesiso0» Introduzione Giannotti & Pedreschi
صفحه 23:
the XDD process
Interpretatio:
nd Evaluatio)
$election an
Rreprocessing
3 “
2 E atterns
| | Models
+ Prepared. Data
eg ge
Data Sources
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 24:
1 xDD stepscanbemergedor combined
! Data Cleaning +DataIntegration=Data Preprocessing
| Data Sefection+Data Transformation =Data Consolidation
| KDDisandIterative Process
| art +engineering rather than science
24
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 25:
The virtuous cycle
Identify
Problemor
Qpportunit
Act on
Knowledge
Measure effect
Strategy of Action Results
25
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 26:
The steps of the KDD process
Learning theapplication domain:
! relevant prior Rnowledgeand goals ofapplication
Dataconsolidation:Creatinga target dataset
Selectionand Preprocessing
| Data cleaning: (may take 60%of effort!)
| Datareductionandprofection:
| finduseful features dimensionality /variable reduction, invariant
‘representation.
Choosing functions of datamining
summarization, classification, regression,association, clustering,
Choosing themining algorithms)
Datamining:searchfor patterns ofinterest
Interpretationandevaluation:analysis of results.
| visualization, transformation, removing redundant patterns, .«
‘Use of discoveredknowledge
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 27:
Domain
Experts
Mining
Specialists
Data
Administrator]
“
Annoaccadenicascezi3000 Introduzione Giannotti & Pedreschi ری
صفحه 28:
تأنالا 01011 عانتمالا ععنالات وبأ نأناء عأنااناً 55ع نا اناأكطا5 إل
Data Mining
Operational Business Business و
Data ۲ 1 ~ Basket Analysis
Data iformation ۳ Fraud Detection
Werehouse Warehouse Target Marketing
ga BN.
v Business Queries
Extraction/ Replication Hs “ae
Data Cleaning
Meta Data Management
Giannotti & Pedreschi سح
Co
Annoaccademicasoos/2000 Introduzione
صفحه 29:
The xDD process
Interpretati
d Evaluatio
* Patterns &
: Models
anne Date SOUTCES » troduzione Giannotti & Pedrescht
صفحه 30:
Data consolidationand preparation
GarbageinGarbage out
I The quality of results relates directly to quality of
thedata
| 50%-70%of KDD process effort is spent on data
consolidation andpreparation
| Mafor justification for a corporate data warehouse
30
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 31:
5 7 سم on
Fromdata sources toconsolidateddata
repository
S| — ا Sse
Object/Relation DBMS
Multidimensional DBMS
‘ ۱ Deductive Datei’ ۱
2 Flat files 6
Giannotti & Pedreschi
Annoaccademicasoos/2000 Introduzione
صفحه 32:
11( 0:60:05 00012011
| Determinepreliminary list ofattributes
| Consolidate dataintoworking database
' InternalandExternal sources
| Eliminate or estimate missing values
| Remove outliers (obvious exceptions)
| Determineprior probabilities of categoriesand
deal with volume bias
و
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 33:
The XDD process
Interpretat:
d Evaluatio;
wd
av. ۸
annoacadeicascoines Introduzione Giannotti & Pedreschi bb
صفحه 34:
Data selectionandpreprocessing
I Generateaset ofexamples
| choosesampling method
' consider sample complexity
| dealwithvolumebiasissues
| Reduceattribute dimensionality
| remove redundant and/or correlating attributes
| comBineattributes (sum,multiply, difference)
| Reduceattributevalueranges
! group symbolic discretevatues
1! quantify continuous numericvatues
| Transformdata
| de-correlateandnormatizevalues
| map time-series data tostaticrepresentation
۱ tools play key role
7
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 35:
The XDD process
Interpretatio:
d Evaluatio;
election and
reprocessing
وه
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 36:
Datamining tasks andmethods
| DirectedKnowledgeDiscovery
! Purpose: Explainvalue of somefieldin terms of all the
others (goal-oriented)
| Method: select the target fieldbasedon some hypothesis
about thedatalaskthealgorithmtotellus how to
predict or classify new instances
۱ Examples:
| what products show increasedsalewhen cream
cheeseis discounted
لقع تدع نه 07[ ونه طز قاع بن1سه :010 ع كننا 0 :نه :0111161 6 بقاع نه سد ذأ
‘user اما ها متام
2
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 37:
OO
Datamining tasks andmethods
undirected Knowledge Discovery ذأ
(Explorative Methods)
| Purpose: Findpatternsin the data that may be
interesting (notarget filed)
۱3 rules (affinity
grouping)
! Examples:
J whichproductsin the catalog often sell together
| market segmentation (groups of customers/users
withsimilar characteristics)
37
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 38:
Alternatively :Dataminingtasksandmethods
| Automated Exploration/Discovery
' ¢g.. discovering new market segments
| clusteringanalysis
| Prediction/Classification
| e.g..forecasting gross sales given current factors
regression, newral networks,geneticalgorithms,
decision trees ft
rey \
| Explanation/Description
' eg..characterizing customers by demographics
andpurchase history 5
| decision trees,association rules انا
andincome < $35k
then...
و
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 39:
Automatedexplorationand discovery
I Clustering:partitioninga set of dataintoa set of classes,
called clusters,whosemembers sharesome interesting
common properties,
Distance-basednumerical clustering لأ
| metricgrouping of examples (K-NN)
' graphical visualization can beused
۱ ونتانأمت تأكنانك بلاتصياىء ينه
| searchfor the number of classes which result in best fit of
aprobability distributiontothedata
! AutoClass (NASA) one of best examples
39
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 40:
Predictionandclassification
| cearning apredictivemodel
لأ Classification of a new case/sample
| Many methods:
! Artificial neural networks
! Inductive decisiontreeandrule systems
1١ Geneticalgorithms
! Nearest neighbor clustering algorithms
! Statistical (parametric,andnon-parametric)
40
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 41:
Generalizationandregression
| The objective of learning is toachieve good
generalizationtonew unseen cases.
“ Generalization can be definedasamathematical
interpolationor regressionover a set of training
points
“ Models can bevalidatedwithapreviouslyunseen
test set or using cross-validation metfods
a
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 42:
02320001 andprediction
! Classify data basedon thevalues ofa target
attribute,e.g.,classify countries basedon climate,
or classify cars basedon gasmileage.
! Alseobtainedmodel topredict someunknown or
missing attributevalues basedon other
information.
a2
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 43:
Summarizing:inductive modeling > ۵
Objective: Developageneralmodel or
hypothesis from specific examples
5 الال تنه وماقمس ووه مم5 0
| Classification (concept ا
recognition)
x
43
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 44:
Explanationand description
| Learna generalized hypothesis (model) from
selecteddata
| Description/Interpretation of model provides new
knowledge
| Affinity Grouping
| Methods:
' Inductive decision treeandrufe systems
! Association rule systems
' LinkAnalysis
1
a
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 45:
Exception/deviation detection
Generateamodel of normal activity لا
Deviation frommodel causes alert “
"Methods:
| Artificial newral networks
۱ Inductive decision treeandrule systems
| Statisticalmethods
۱ Visualization tools
۶
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi {7
صفحه 46:
Outlier andexception data analysis
| Time-series analysis (trendanddeviation):
! Trendanddeviation analysis: regression,
sequential pattern, similar sequences,trendand
deviation,e.g., stockanalysis.
! Similarity-basedpattern-directedanalysis
| Pull vs. partial periodicity analysis
| Otherpattern-directedor statisticalanalysis
46
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 47:
“7
Example: Moviegoer Database
fmoviegoer_ID
movie _10
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 48:
SELECT moviegoers.name, moviegoers.sex, moviegoers.age,
sources.source, movies.name
FROM movies, sources, moviegoers
WHERE sources.source_ID = moviegoers.source ID AND
movies.movie ID = moviegoers.movie_ID
ORDER BY moviegoers. name;
سس — ی سس
Day مس oP Obert م Oo
موه 1۵ Obert 06 - مه
مه 7 ow Obert - هه
Ome 3 90 Obert امس
اوه باه سق 1 Obert 06 م Owe
مه مه vt 0 90 Obert
Gob 2 60 .تمصت Scharber’ Let
Gram - ee Obert Guger Oop
Obert ddr « 0 بسن
Caw 3 es Obert و
O.Pines The Orde 90 م ون
Obert rem ۵ مه
Cut 3 6 بسسد ۵ لهه Da)
Want - €o هه ۳
Cre 0 29) D.Odwe Prepon 1
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi 1
صفحه 49:
Example: Moviegoer Da tabase
| Classification
| determine sex basedonage,source,andmovies seen
| determine source basedon sex,age,andmovies seen
| determinemost recent movie basedon past movies, age, sex,and
sowrce
| Estimation
| forpredict, needa continuousvariable(e.g,,“age”)
| predict ageasafunction of source,sex,andpast movies
| ifwehada “rating’ fieldfor eachmoviegoer,wecowdpredict
therating a new moviegoer gives toa movie basedon age, sex,
past movies,etc.
a3
Annoaccademicasoos/2000 Introduzione
Giannotti & Pedreschi
صفحه 50:
Example: Moviegoer Database
U Clustering
| findgroupings of movies that are often seen by the
same people
findgroupings of people that tend toseethesame
movies
clustering might reveal relationships that arenot
necessarily recordedin the data (e.g.,wemay finda
cluster that is dominated by people with young
children {or a cluster of movies that correspondtoa
particular genre)
50
annoaccadenicascesiso0» Introduzione Giannotti & Pedreschi
صفحه 51:
Example: Moviegoer Database
I Association Rules
| market basket analysis (MBA): “whichmoviesgotogether?”
| needtocreate transactions’ for eachmoviegoer containing
movies seen by that moviegoer:
MO مس
مر روط سم ۵۵0
سانا مهو رم ۰ مه 6 O08
Dey, (reps) سم من متا ©0006
004: (Prerepemn, Cchanber's Les}
| mayresult in association rules sucha:
{“Phenomenon”, “The Birdcage”} {“Trainspotting”}
{“Trainspotting”, “The Birdcage”} ==> {sex = “f"}
1111
a7
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 52:
Example: Moviegoer Database
I sequence Analysis
۱ similar to MBA, but order inwhichitemsappearin
thepatternisimportant
| e.g.,peoplewho rent “The Birdcage” during avisit
tendtorent “Trainspotting” in the next visit.
32
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
صفحه 53:
The XDD process
Interpretatio:
nd Evaluatio;
Dp
and Warehousing
\
w
ل
5 ia
annoaccadenicascesiso0» Introduzione Giannotti & Pedreschi ort
صفحه 54:
Areall the discoveredpatterninteresting?
| Adatamining system/query may generate thousands of
patterns, not all of themareinteresting.
“ Imterestingnessmeasures:
| easifyunderstoodby humans
| validon new or test data withsome degree of certainty.
| potentiallyuseful
| novel, orvalidates some hypothesis that auser seeks toconfirm
| Objectivevs. subjectiveinterestingnessmeasures
! Offective:basedonstatisticsandstructures of patterns,e.g,,
support, confidence, etc.
| Subjective:basedonuser’s beliefsin the data,e.g.,unexpectedness,
novelty,etc.
annoaccadenicascesiso0» Introduzione Giannotti & Pedreschi
صفحه 55:
Interpretationandevaluation
Evaluation
| Statistical validationandsignificance testing
| Qualitative review by expertsinthefield
| Pilot surveys toevaluate model accuracy
Interpretation
| Inductivetreeandrulemodelscanbereaddirectly
| Clustering results can begraphedandtabled
| Codecanbeautomatically generated by some
systems (IDTs, Regression models)
55
Annoaccademicasoos/2000 Introduzione Giannotti & Pedreschi
Tecniche di Data Mining
Fosca Giannotti and
Dino Pedreschi
Pisa KDD Lab, CNUCE-CNR & Univ. Pisa
http://www-kdd.cnuce.cnr.it/
DIPARTIMENTO DI INFORMATICA - Università di Pisa
anno accademico 2002/2003
Tecniche di Data Mining
AA270 Tecniche di data mining
Corsi di Laurea Specialistica in Informatica e Tecnologie
Informatiche
4I117 Basi di dati e sistemi informativi: tecniche di data
mining per l’analisi dei dati
Corso di Laurea in Informatica (quinquennale, vecchio
ordinamento)
Analisi dei dati ed estrazione di conoscenza
Corso di Laurea Specialistica in Informatica per l’Economia
e per l’Azienda
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
2
Tecniche di Data Mining
Acronimo:
TDM
Orario:
Lunedi 11-13 aula E, Giovedi 14-16
aula B
Docente: Fosca Giannotti, CNUCE-CNR,
f.giannotti@cnuce.cnr.it
Corso Integrativo: Dino Pedreschi,
Dipartimento di Informatica, pedre@di.unipi.it
Ricevimento:
Mercoledi 14-16
ISTI, Area Ricerca CNR, località San Cataldo, Pisa
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
3
Tecniche di Data Mining
Riferimenti bibliografici
Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann Publishers, 2000
http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860489-8
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (editors).
Advances in Knowledge discovery and data mining, MIT Press, 1996.
David J. Hand, Heikki Mannila, Padhraic Smyth, Principles of Data
Mining, MIT Press, 2001.
S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext
Data, Morgan Kaufmann, ISBN 1-55860-754-4, 2002
I lucidi utilizzati nelle lezioni saranno resi disponibili attraverso il sito
web del corso:
http://www-kdd.cnuce.cnr.it/
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
4
Questionario
Messaggio e-mail con subject: Corso TDM
Contenuto
Nome e Cognome………………..
e-mail:……………………………
anno immatricolazione:…………….
Corso di laurea :……………………..
Corsi di basi di dati:
Anno accademico, 2002/2003
Frequentati nei precedenti semestri:
In questo semestre:
Introduzione
Giannotti & Pedreschi
5
Contenuti del corso
Introduction and Basic concepts (2 ore)
Le applicazioni
Il processo di knowledge discovery
Nozioni basiche di Data Warehousing
Nozioni basiche di analisi multidimensionale dei dati
Data Consolidation & Data Preparation (4 +2 esercitazione)
Regole Associative(6 +4 esercitazione)
Regole intra-attributo, inter-attributo
Calcolo efficiente di regole d'associazione: algoritmo Apriori e
varianti
Estensioni del concetto di regola d'associazione: tassonomie, regole
quantitative, regole predittive.
Regole associative e fattore Tempo: RdA Cicliche e Calendriche
Pattern Sequenziali e Serie Temporali
Basket Market Analysis utilizzando RdA
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
6
Contenuti del corso
Classificazione con alberi di decisione (6 ore +2 esercitazione)
Principali tecniche di classificazione
Classificatori bayesiani
Alberi di decisione
Rassegna di altri metodi
Applicazione al rilevamento di frodi
Clustering (2 ore +2 esercitazione)
Principali tecniche di clustering
Applicazione al Customer segmentation
Web Mining (4 ore)
Temi avanzati (6 ore seminari)
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
7
Modalità di valutazione
Esercizi durante il corso (o Orale): 30%
Seminario (o Progetto): 70% ?
Students should pair up in teams. They will receive the same
credit as their partner. Division of labor is up to them.
Presentations should take 50 minutes, including 10 minutes
for discussion. A presentation normally covers two or three
closely related papers
Transparencies should be made available to the rest of the
class---preferably in PDF or HTML format.
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
8
Course Outline
Introduction and basic concepts
Motivations, applications, the KDD process, the techniques
Deeper into DM technology
Association Rules and Market Basket Analysis
Decision Trees and Fraud Detection
Clustering and Customer Segmentation
Deeper into Data Preparation
Basic notion of Datawarehouse
Selection and preprocessing
Advanced Topics
Scalable DM algorithms
Data mining query languages
Mining on Web
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
9
Evolution of Database Technology:
from data management to data analysis
1960s:
Data collection, database creation, IMS and network DBMS.
1970s:
Relational data model, relational DBMS implementation.
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
and application-oriented DBMS (spatial, scientific, engineering, etc.).
1990s:
Data mining and data warehousing, multimedia databases, and Web
technology.
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
10
Motivations
“Necessity is the Mother of Invention”
Data explosion problem:
Automated data collection tools, mature database technology and
internet lead to tremendous amounts of data stored in databases,
data warehouses and other information repositories.
We are drowning in information, but starving for knowledge! (John
Naisbett)
Data warehousing and data mining :
On-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases.
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
11
Motivations for DM
Abundance of business and industry data
Competitive focus - Knowledge Management
Inexpensive, powerful computing engines
Strong theoretical/mathematical
foundations
machine learning & logic
statistics
database management systems
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
12
Sources of Data
Business Transactions
widespread use of bar codes => storage of millions of transactions
daily (e.g., Walmart: 2000 stores => 20M transactions per day)
most important problem: effective use of the data in a reasonable
time frame for competitive decision-making
e-commerce data
Scientific Data
data generated through multitude of experiments and observations
examples, geological data, satellite imaging data, NASA earth
observations
rate of data collection far exceeds the speed by which we analyze the
data
Financial Data
company information
economic data (GNP, price indexes, etc.)
stock markets
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
13
Sources of Data
Personal / Statistical Data
government census
medical histories
customer profiles
demographic data
data and statistics about sports and athletes
World Wide Web and Online Repositories
email, news, messages
Web documents, images, video, etc.
link structure of of the hypertext from millions of Web sites
Web usage data (from server logs, network traffic, and user
registrations)
online databases, and digital libraries
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
14
Classes of applications
Database analysis and decision support
Market analysis
• target marketing, customer relation management, market basket
analysis, cross selling, market segmentation.
Risk analysis
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis.
Fraud detection
Other Applications
Text (news group, email, documents) and Web analysis.
Intelligent Query Answering
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
15
Market Analysis
Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies.
Target marketing
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information.
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
MarketMarket
Analysis
and Management
Analysis
(2)
Customer profiling
data mining can tell you what types of customers buy what
products (clustering or classification).
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports;
statistical summary information (data central tendency
and variation)
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
Risk Analysis
Finance planning and asset evaluation:
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning:
summarize and compare the resources and spending
Competition:
monitor competitors and market directions (CI: competitive
intelligence).
group customers into classes and class-based pricing procedures
set pricing strategy in a highly competitive market
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
Fraud Detection
Applications:
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach:
use historical data to build models of fraudulent behavior and use
data mining to help identify similar instances.
Examples:
auto insurance: detect a group of people who stage accidents to
collect on insurance
money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
More examples:
Fraud Detection (2)
Detecting inappropriate medical treatment:
Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
Detecting telephone fraud:
Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected
norm.
British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
Retail: Analysts estimate that 38% of retail shrink is due to
dishonest employees.
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
Other applications
Sports
IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat.
Astronomy
JPL and the Palomar Observatory discovered 22 quasars with
the help of data mining
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer preference
and behavior pages, analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Watch for the PRIVACY pitfall!
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
What is KDD? A process!
The selection and processing of data for:
the identification of novel, accurate, and
useful patterns, and
the modeling of real-world phenomena.
Data mining is a major component of the
KDD process - automated discovery of
patterns and the development of predictive
and explanatory models.
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
22
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data
Consolidation
Warehouse
Patterns &
Models
Prepared Data
Consolidated
Data
Data Sources
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
23
The KDD Process in Practice
KDD steps can be merged or combined
Data Cleaning + Data Integration = Data Preprocessing
Data Selection + Data Transformation = Data Consolidation
KDD is and Iterative Process
art + engineering rather than science
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
24
The virtuous cycle
9
The KDD Process
Interpretation
and Evaluation
Data Mining
Problem
Selection and
Preprocessing
Data
Consolidation
Knowledge
p(x)=0.02
Patterns &
Models
Warehouse
Prepared Data
Consolidated
Data
Knowledge
Data Sources
CogNova
Technologies
Identify
Problem or
Opportunity
Strategy
Anno accademico, 2002/2003
Introduzione
Act on
Knowledge
Measure effect
of Action
Giannotti & Pedreschi
Results
25
The steps of the KDD process
Learning the application domain:
relevant prior knowledge and goals of application
Data consolidation: Creating a target data set
Selection and Preprocessing
Data cleaning : (may take 60% of effort!)
Data reduction and projection:
find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Interpretation and evaluation: analysis of results.
visualization, transformation, removing redundant patterns, …
Use of discovered knowledge
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
Roles in the KDD process
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
27
A business intelligence environment
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
28
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data
Consolidation
Warehouse
Patterns &
Models
Prepared Data
Consolidated
Data
Data Sources
Introduzione
Anno accademico, 2002/2003
Giannotti & Pedreschi
29
Data consolidation and preparation
Garbage in Garbage out
The quality of results relates directly to quality of
the data
50%-70% of KDD process effort is spent on data
consolidation and preparation
Major justification for a corporate data warehouse
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
30
Data consolidation
From data sources to consolidated data
repository
RDBMS
Legacy
DBMS
Flat Files
Data
Consolidation
and Cleansing
External
Anno accademico, 2002/2003
Introduzione
Warehouse
Object/Relation DBMS
Multidimensional DBMS
Deductive Database
Flat files
31
Giannotti & Pedreschi
Data consolidation
Determine preliminary list of attributes
Consolidate data into working database
Internal and External sources
Eliminate or estimate missing values
Remove outliers (obvious exceptions)
Determine prior probabilities of categories and
deal with volume bias
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
32
The KDD process
Interpretation
and Evaluation
Data Mining
Selection and
Preprocessing
Knowledge
p(x)=0.02
Data
Consolidation
Warehouse
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
33
Data selection and preprocessing
Generate a set of examples
choose sampling method
consider sample complexity
deal with volume bias issues
Reduce attribute dimensionality
remove redundant and/or correlating attributes
combine attributes (sum, multiply, difference)
Reduce attribute value ranges
group symbolic discrete values
quantify continuous numeric values
Transform data
de-correlate and normalize values
map time-series data to static representation
OLAP and visualization tools play key role
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
34
The KDD process
Interpretation
and Evaluation
Data Mining
Selection and
Preprocessing
Knowledge
p(x)=0.02
Data
Consolidation
Warehouse
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
35
Data mining tasks and methods
Directed Knowledge Discovery
Purpose: Explain value of some field in terms of all the
others (goal-oriented)
Method: select the target field based on some hypothesis
about the data; ask the algorithm to tell us how to
predict or classify new instances
Examples:
what products show increased sale when cream
cheese is discounted
which banner ad to use on a web page for a given
user coming to the site
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
36
Data mining tasks and methods
Undirected Knowledge Discovery
(Explorative Methods)
Purpose: Find patterns in the data that may be
interesting (no target filed)
Method: clustering, association rules (affinity
grouping)
Examples:
which products in the catalog often sell together
market segmentation (groups of customers/users
with similar characteristics)
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
37
Alternatively:Data mining tasks and methods
Automated Exploration/Discovery
e.g.. discovering new market segments
clustering analysis
x2
x1
Prediction/Classification
e.g.. forecasting gross sales given current factors
regression, neural networks, genetic algorithms,
decision trees
Explanation/Description
f(x)
x
e.g.. characterizing customers by demographics
and purchase history
if age > 35
decision trees, association rules
and income < $35k
then ...
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
38
Automated exploration and discovery
Clustering: partitioning a set of data into a set of classes,
called clusters, whose members share some interesting
common properties.
Distance-based numerical clustering
metric grouping of examples (K-NN)
graphical visualization can be used
Bayesian clustering
search for the number of classes which result in best fit of
a probability distribution to the data
AutoClass (NASA) one of best examples
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
39
Prediction and classification
Learning a predictive model
Classification of a new case/sample
Many methods:
Artificial neural networks
Inductive decision tree and rule systems
Genetic algorithms
Nearest neighbor clustering algorithms
Statistical (parametric, and non-parametric)
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
40
Generalization and regression
The objective of learning is to achieve good
generalization to new unseen cases.
Generalization can be defined as a mathematical
interpolation or regression over a set of training
points
Models can be validated with a previously unseen
test set or using cross-validation methods
f(x)
x
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
41
Classification and prediction
Classify data based on the values of a target
attribute, e.g., classify countries based on climate,
or classify cars based on gas mileage.
Use obtained model to predict some unknown or
missing attribute values based on other
information.
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
42
Summarizing: inductive modeling = learning
Objective: Develop a general model or
hypothesis from specific examples
Function approximation (curve fitting)
f(x)
x
Classification (concept learning, pattern
recognition)
A
x2
B
x1
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
43
Explanation and description
Learn a generalized hypothesis (model) from
selected data
Description/Interpretation of model provides new
knowledge
Affinity Grouping
Methods:
Inductive decision tree and rule systems
Association rule systems
Link Analysis
…
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
44
Exception/deviation detection
Generate a model of normal activity
Deviation from model causes alert
Methods:
Artificial neural networks
Inductive decision tree and rule systems
Statistical methods
Visualization tools
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
45
Outlier and exception data analysis
Time-series analysis (trend and deviation):
Trend and deviation analysis: regression,
sequential pattern, similar sequences, trend and
deviation, e.g., stock analysis.
Similarity-based pattern-directed analysis
Full vs. partial periodicity analysis
Other pattern-directed or statistical analysis
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
46
Example: Moviegoer Database
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
47
Example: Moviegoer Database
SELECT
SELECT moviegoers.name,
moviegoers.name, moviegoers.sex,
moviegoers.sex, moviegoers.age,
moviegoers.age,
sources.source,
movies.name
sources.source, movies.name
FROM
FROM movies,
movies, sources,
sources, moviegoers
moviegoers
WHERE
sources.source_ID
WHERE sources.source_ID == moviegoers.source_ID
moviegoers.source_ID AND
AND
movies.movie_ID
=
moviegoers.movie_ID
movies.movie_ID = moviegoers.movie_ID
ORDER
BY
ORDER BY moviegoers.name;
moviegoers.name;
moviegoers.name
sex
age
source
movies.name
Amy
Andrew
Andy
Anne
Ansje
Beth
Bob
Brian
Candy
Cara
Cathy
Charles
Curt
David
Erica
f
m
m
f
f
f
m
m
f
f
f
m
m
m
f
27
25
34
30
25
30
51
23
29
25
39
25
30
40
23
Oberlin
Oberlin
Oberlin
Oberlin
Oberlin
Oberlin
Pinewoods
Oberlin
Oberlin
Oberlin
Mt. Auburn
Oberlin
MRJ
MRJ
Mt. Auburn
Independence Day
12 Monkeys
The Birdcage
Trainspotting
I Shot Andy Warhol
Chain Reaction
Schindler's List
Super Cop
Eddie
Phenomenon
The Birdcage
Kingpin
T2 Judgment Day
Independence Day
Trainspotting
48
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
Example: Moviegoer Database
Classification
determine sex based on age, source, and movies seen
determine source based on sex, age, and movies seen
determine most recent movie based on past movies, age, sex, and
source
Estimation
for predict, need a continuous variable (e.g., “age”)
predict age as a function of source, sex, and past movies
if we had a “rating” field for each moviegoer, we could predict
the rating a new moviegoer gives to a movie based on age, sex,
past movies, etc.
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
49
Example: Moviegoer Database
Clustering
find groupings of movies that are often seen by the
same people
find groupings of people that tend to see the same
movies
clustering might reveal relationships that are not
necessarily recorded in the data (e.g., we may find a
cluster that is dominated by people with young
children; or a cluster of movies that correspond to a
particular genre)
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
50
Example: Moviegoer Database
Association Rules
market basket analysis (MBA): “which movies go together?”
need to create “transactions” for each moviegoer containing
movies seen by that moviegoer:
name
TID
Transaction
Amy
Andrew
Andy
Anne
…
001
002
003
004
…
{Independence Day, Trainspotting}
{12 Monkeys, The Birdcage, Trainspotting, Phenomenon}
{Super Cop, Independence Day, Kingpin}
{Trainspotting, Schindler's List}
...
may result in association rules such as:
{“Phenomenon”, “The Birdcage”} ==> {“Trainspotting”}
{“Trainspotting”, “The Birdcage”} ==> {sex = “f”}
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
51
Example: Moviegoer Database
Sequence Analysis
similar to MBA, but order in which items appear in
the pattern is important
e.g., people who rent “The Birdcage” during a visit
tend to rent “Trainspotting” in the next visit.
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
52
The KDD process
Interpretation
and Evaluation
Data Mining
Selection and
Preprocessing
Knowledge
p(x)=0.02
Data Consolidation
and Warehousing
Warehouse
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
53
Are all the discovered pattern interesting?
A data mining system/query may generate thousands of
patterns, not all of them are interesting.
Interestingness measures:
easily understood by humans
valid on new or test data with some degree of certainty.
potentially useful
novel, or validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s beliefs in the data, e.g., unexpectedness,
novelty, etc.
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
Interpretation and evaluation
Evaluation
Statistical validation and significance testing
Qualitative review by experts in the field
Pilot surveys to evaluate model accuracy
Interpretation
Inductive tree and rule models can be read directly
Clustering results can be graphed and tabled
Code can be automatically generated by some
systems (IDTs, Regression models)
Anno accademico, 2002/2003
Introduzione
Giannotti & Pedreschi
55