صفحه 1:
Data Stream
Mining
صفحه 2:
What is
Data
Stream?
* Transient,
Continuously,
increasing
sequence of
Data
صفحه 3:
Data stream
* Google Searches
* Credit Card
Transaction
* Sensor Network
صفحه 4:
Data stream
Characteristics
* Infinite Volume
* Chronological Order
* Dynamic Changes
صفحه 5:
Knowledge
Single pass
r
Selecting some parts of
data steam
Preprocessing of data
streams
Incremental learning
Knowledge extraction
Data Stream Mining
Data stream
generators
Sensor networks
Satellites
Internet traffic
Call records
صفحه 6:
Mining
Traditional Data Stream
Number of
Paases angle
Multiple
Time Unlimited Real-time
Memory Unlimited Bounded
Concepts One Multiple
5 Accurate | Approximate
صفحه 7:
Data Stream Mining
Algorithm
Mining Task
» Decision Tree
Based on Classes
Weight
Rule Based
> Combination of
Different Classifier
e K-Means
e Micro Clustering
Approach
* Density Base
Clustering
e Prediction
Algorithm
VFDT & CVFDT
LWClass
SCALLOP
» Ensemble-Based
e VFKM
e CluStream
e D-Stream
e AWSOM
Classi
ficatio
n
صفحه 8:
Classificati
on
صفحه 9:
Concept Drift
* Changes in discovered pattern
over time
=e > هم
ل رت SE Se Se
SSS Ses “Sree ee سح
Old Data New Data
صفحه 10:
صفحه 11:
* Incremental
Learning vs.
Batch Mode
* Very Fast
Decision Tree
صفحه 12:
Challenges
* How to ‘forget’ old samples ?
5
2
صفحه 13:
New Data
صفحه 14:
Data Expiration
Problem
optimum boundary :— positive: @
overfitting:-- negative: O
0
(a) Se,arrived (b) S1,arrived (c) له ردق
during [te,t:) during [t:,te) during [fz,ts)
Overfitting!
صفحه 15:
Data Expiration
Problem
optimum boundary :—
Conflicting Concepts!
صفحه 16:
۱61۱00
* Stream is partitioned into sequential
chunks
Train a classifier from each chunk
Assume y is a test example
fc (y) => probability of y being an
instance of class
The probality output of faye ensemble
is given by: ko
9 7 رد33 روت
ه١
*
*
صفحه 17:
Accuracy-Weighted
Ensemble
* Ek => an Ensemble with k
classifiers
* Gk => a single classifier learned
from k last chunks
* Ek produces smaller classification
error than Gk
صفحه 18:
Accuracy-Weighted
Ensemble
Divide the data stream into k data chunks
501 Sn
Sn is the most recent chunk
For a record (x,c) in Sn
fc~i (x) => probability given by Ci that x
is an instance of C
Thus [] 1-f[]c7i (x) is the error of Ci
+
صفحه 19:
Accuracy-Weighted
Ensemble
* Assign ۷ _ classifiers
based on expected prediction
accuracy
* Only top k classifiers is kept
= MSE, — MSE;
MSE, = ( p(c)(1 — p(0))? نت۱ وم اه
2 ,65
صفحه 20:
وصاءع ونان
/
1 x
CLUSTERING
صفحه 21:
صفحه 22:
5000
15000
7000
25000
100000
14
10
15
30
صفحه 23:
۳-5
Clustering
صفحه 24:
صفحه 25:
Height
Length
صفحه 26:
صفحه 27:
VFKM
* consists of a number of runs
and each run contains a
number of iterations
* uses only a calculated number
of all the available data items
* uses only a particular number
of data items in each step i
صفحه 28:
NEW Data
Fd EE
كل
error probability
صفحه 29:
1۲ )) ۲۱05۲ ۶* < اع ( && (least 1-5*
[2] > 5i)) { print( “The END, \(Ei),
\(6i)”) }
صفحه 30:
0 5
و۰ 1 0 3 ما Dat
0 و » Data
oO : ۰ 1
= is ۰ 2
۱ 0 9
C
decrease in memory can potentially lead to an exe
Decrease
صفحه 31:
صفحه 32:
AOG
* Algorithm output granularity is a
generic, resource-aware mining
data stream approach _ that
focuses on adapting the
algorithm's performance
according to the data rate and
available memory.
صفحه 33:
“Thank you for your attention.”