صفحه 1:
Language
Modeling
Introduction to N-
grams
صفحه 2:
Dan Jurafsky
Probabilistic Language Models
* Today’s goal: assign a probability to a
sentence
* Machine Translation:
* P(high winds tonite) > P(large winds tonite)
* Spell Correction
* The office is about fifteen minuets from my house
* P(about fifteen minutes from) > P(about fifteen
minuets from)
* Speech Recognition
* P(l saw a van) >> Pleves awe of an)
Why?
صفحه 3:
57
Probabilistic Language
Modeling
* Goal: compute the probability of a sentence
or sequence of words:
P(W) = P(w,,W,,W5,W4,Ws...W,)
* Related task: probability of an upcoming
word:
P(w,|W,,W2,W3,W,4)
* A model that computes either of these:
P(W) or — P(w,,|W,,W5...Wa.2) is called a
ee ee
صفحه 4:
Dan Jurafsky
G&S) How to compute P(W)
* How to compute this joint probability:
*P(its, water, is, so, transparent, that)
* Intuition: let’s rely on the Chain Rule of Probability
صفحه 5:
میم
Dan Jurafsky
Reminder: The Chain Rule
* Recall the definition of conditional
probabilities
Rewriting:
* More variables:
P(A,B,C,D) = P(A)P(BJA)P(C|A,B)P(DIA,B,C)
* The Chain Rule in General
P(x, ,X5,X2,...,X-) = P(x,)P(x5/x,)P(x21x,,x
صفحه 6:
Dan Jurafsky
compute joint probability of
4 words in sentence
P(w,w2...W,) = []P0«, lwyw2...W34)
P(“its water is so transparent”) =
P(its) x P(waterl|its) x P(is|its water)
x P(solits water is) x P(transparentlits
water is so)
صفحه 7:
Dan Jurafsky
How to estimate these
probabilities
* Could we just count and divide?
P(the lits water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)
Lane ene Lawn! Pome ع اتوم سايم مدت
° We'll never see enough data for estimating these
صفحه 8:
Dan Jurafsky
6 Markov Assumption
¢ Simplifying assumption:
Andrei Markov
P(the lits water is so transparent that) = P(the | that)
- Ur maype
P(the lits water is so transparent that) = P(the | transparent that)
صفحه 9:
Markov Assumption
P(w,w3...w,) =] [Pow, lw,_,...W,4)
* In other words, we approximate each
component in the product
P(w, |w,w,...W,,) = P(w, lw, ...W,,)
صفحه 10:
Dan Jurafsky
Simplest case: Unigram model
P(ww,...W,,) = [ [ec )
e automatically generated sentences from a unigram model
fifth, an, of, futures, the, an, incorporated, a,
a, the, inflation, most, dollars, quarter, in, is,
mass
thrift, did, eighty, said, hard, 'm, july, bullish
that, or, limited, the
صفحه 11:
Dan Jurafsky
Bigram model
=| Condition on the previous word:
P(w, |w,w,...w,_,) = P(w, |w,_,)
texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen
outside, new, car, parking, lot, of, the, agreement, reached
this, would, be, a, record, november
صفحه 12:
Dan Jurafsky
N-gram models
* We can extend to trigrams, 4-grams, 5-
grams
* In general this is an insufficient model of
language
* because language has long-distance
dependencies:
“The computer which | had just put into the
machine room on the fifth floor crashed.”
صفحه 13:
Language
Modeling
Estimating N-gram
Probabilities
aia
اه
5 i
We
81
8
صفحه 14:
Dan Jurafsky
72
Estimating bigram probabilities
* The Maximum Likelihood Estimate
count(w, ,.W,)
P(w, lw.) = —————
count(w,_,)
c(wW,_,.W;)
P(w, |w,,) =
c(w,_,)
صفحه 15:
Dan Jurafsky
72
An example
<s> 1am Sam </s>
<s> Sam |am </s>
<s> | do not like green eggs and
ham </s>
صفحه 16:
Dan Jurafsky
Berkeley Restaurant Project
sentences
* can you tell me about any good cantonese
restaurants close by
* mid priced thai food is what i’m looking for
* tell me about chez panisse
* can you give me a listing of the kinds of food that
are available
* i'm looking for a good place to eat breakfast
٠ when is caffe venezia open during the day
صفحه 17:
food
6
ه ه ه نو من حابر
Raw bigram counts
chinese
16
con
eat
Dan Jurafsky
* Out of 9222 sentences
to
اح دوت شرت نم
2
want
827
نت ألم دم دراه
Bree
2
i
want
to
eat
chinese
food
Junch
spend
صفحه 18:
Dan Jurafsky
Raw bigram probabilities
٠ Normalize by unigrams:
i want | 6 eat_| chinese [ food | lunch | spend
2533 927 2417 746 158 1093 341 278
* Result:
i want[to [eat | chinese | food [Tunch [ spend
i 0002 [0.33 [0 0.0036] 0 0 0 0009
want | 0.0022 |0 —|.0.66 |]0.0011| 0.0065 0006۵ 0.0054] 0.0011
to 0000830 | 0.0017] 0.28 ۰ 0.0025 | 0.087
eat 0 0 0 0.021 | 0.0027] 0.056 | 0
chinese | 00۵6۵ ]0 | 0 0 0 0,52 | 0.0063] 0
food | 0.014 /0 0.014 Jo 0.00092 | 0.0037 | 0 0
lunch | وكممه 0 0 0 0 00020 0
spend | 0.0036 ]0 0 0 0 0 0
صفحه 19:
Bigram estimates of sentence
probabilities
P(<s> | want english food </s>) =
P(I|<s>)
P(want|I)
P(english|want)
P(food|english)
P(</s>|food)
= .000031
x
x xX X
صفحه 20:
What kinds of knowledge?
P(english|want) = .0011
P(chinese|want) = .0065
(
(
(
(
(
(
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P(i| <s>) = .25
صفحه 21:
Dan Jurafsky
72
Practical Issues
° We do everything in log space
*Avoid underflow
*(also adding is faster than
multiplying)
2۱۷ 2:۷ Ps X P, = log p, + log p, + log p, + log p,
صفحه 22:
Dan Jurafsky
G5) Language Modeling Toolkits
° SRILM
وإعاعء زماص/جصمع. امع ۱ ۰
rilm
صفحه 23:
Language
Modeling
Evaluation and
Perplexity
صفحه 24:
Dan Jurafsky
Evaluation: How good is our
model?
* Does our language model prefer good sentences to
bad ones?
* Assign higher probability to “real” or “frequently observed”
sentences
* Than “ungrammatical” or “rarely observed” sentences?
۰ We train parameters of our model on a training
set.
° We test the model’s performance on data we
haven't seen.
* Atest set is an unseen dataset that is different from our
صفحه 25:
Dan Jurafsky
Extrinsic evaluation of N-gram
models
* Best evaluation for comparing models A and
B
* Put each model in a task
* spelling corrector, speech recognizer, MT
system
* Run the task, get an accuracy for A and for B
* How many misspelled words corrected properly
* How many words translated correctly
* Compare accuracy for A andB
صفحه 26:
Dan Jurafsky
Difficulty of extrinsic (in-vivo)
evaluation of N-gram models
¢ Extrinsic evaluation
* Time-consuming; can take days or weeks
* So
* Sometimes use intrinsic evaluation: perplexity
* Bad approximation
¢ unless the test data looks just like the training
data
* So generally only useful in pilot
experiments
e But is helnful to think ahout.
صفحه 27:
Dan Jurafsky
Intuition of Perplexity
0 سسساعیه ]
* The Shannon Game:
* How well can we predict ۱6 ۱6۷ 0.4
J see 0.01
| always order pizza with cheese and
The 33 President of the US was
Pred vive (D004
56۷8 (onl desde
* Unigrams are terrible at this game. (Why?)
* A better model of a text
٠ is one which assigns a higher probability to the word that
asks ea |e ecg es
صفحه 28:
Dan Jurafsky
Perplexity
The best language model is one that best predicts an
unseen test set
ره - ره توق یوج رآک ریم
the test set, normalized by the ۲ 1
01
number of words: = عط )
۰۳ - مس[
١ Fone
Chain rule: اتف
1
= wba)
K@kiRIOHASPErplexity is the same as maximizing
صفحه 29:
The Shannon Game intuition
for perplexity
From Josh Goodman
How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9"
+ Perplexity 10
How hard is recognizing (30,000) names at Microsoft.
* Perplexity = 30,000
If a system has to recognize
* Operator (1 in 4)
* Sales (1 in 4)
* Technical Support (1 in 4)
* 30,000 names (1 in 120,000 each)
* Perplexity is 53
Perplexity is weighted equivalent branching factor
صفحه 30:
Dan Jurafsky
Perplexity as branching factor
* Let’s suppose a sentence consisting of random
digits
* What is the perplexity of this sentence according to
a model that assign P=1/10 to each digit?
PP(W) = P(wiwy...wy) ®
13 باب
)د (
1
10
10
صفحه 31:
Lower perplexity = better
model
* Training 38 million words, test 1.5 million
words, WS}
1 Bigram |Trigra
gram m
Order
Perplexi 962 170 109
ty
صفحه 32:
Language
Modeling
Generalization and
zeros
صفحه 33:
The Shannon Visualization
Method
Choose a random bigram
<s> I
(<s>, w) according to its T want
probability want to
* Now choose a random to eat
bigram (w, x) according éat! Chinese
to its probability Chinese food
* And so on until we choose food </s>
حول I want to eat Chinese food
Then string the words
together
صفحه 34:
Dan Jurafsky
EB 5
Approximating Shakespeare
سوام
‘To him swallowed confess bear both, Which, Of save on trail for are ay device and rote life have
Every enter now severally so, et
Hill be late speaks: or! a more to leg less first you enter
Are where exeunt and sighs have rise excelleney took of.. Sleep kuave we. near; vile like
Bigram
What means, sit. Teonfess she? then all sorts, he is trim, captain.
‘Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow
‘What we, hath got so she that I rest and sent fo scold and nature bankrupt, nor the first gentleman?
Trigram
‘Sweet prince, Falstaff shall die. Harry of Monmout’s grave.
‘This shall forbid it should he branded, if renown made it empty
Indeed the duke; and hind a very good friend,
Fly, and will rid me these news of price. Therefore the saclness of parting. as they say, tis dane
Quadrigram
King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet serv'd in;
Will you not tell me who I am?
Tteannot be but so,
Indeed the short and the long. Marry tis a noble Lepidas.
صفحه 35:
Shakespeare as corpus
° N=884,647 tokens, V=29,066
° Shakespeare produced 300,000
bigram types out of V?= 844 million
possible bigrams.
*So 99.96% of the possible bigrams were
never seen (have zero entries in the table)
* Quadrigrams worse: What's coming
Baer eae | سم ات قاس 0م cmc PR cee: دس ١ ری
صفحه 36:
Dan Jurafsky
= The wall street journal is not
Gs) shakespeare (no offense)
Unigram
Months the my and issue of year foreign new exchange’s september were recession ex-
change new endorsed a acquire to six executives
Bigram
Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor
would seem to complete the major central planners one point five percent of U. S. E. has
already old M. X. corporation of living on information such as more frequently fishing to
keep her
Trigram
They also point to ninety nine point six billion dollars from two hundred four oh six three
percent of the rates of interest stores as Mexico and Brazil on market conditions
صفحه 37:
The perils of overfitting
* N-grams only work well for word prediction if
the test corpus looks like the training corpus
*In real life, it often doesn’t
*We need to train robust models that
generalize!
*One kind of generalization: Zeros!
* Things that don’t ever occur in the
training set
صفحه 38:
* Training set: ° Test set
.. denied the .., denied the
allegations offer
.. denied the reports... denied the loan
.. denied the claims
.. denied the request
Pitter. | denied
the) =
صفحه 39:
Dan Jurafsky
Zero probability bigrams
* Bigrams with zero probability
* mean that we will assign 0 probability to the test set!
* And hence we cannot compute perplexity (can’t
divide by 0)!
صفحه 40:
Language
Modeling
Smoothing: Add-
one (Laplace)
smoothing
=
Ss
al
صفحه 41:
Dan Jurafsky
The tetutiog oP sxooviktay (Pro Dac (eta)
en we have sparse statistics:
P(w | denied the)
3 allegations
2 reports
1 claims
{request tiie
7 total
* Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
1.5 reports 2
0.5 claims 3
0.5 request 3 a id...
aes mi:
7 total
صفحه 42:
Dan Jurafsky
Add-one estimation
* Also called Laplace smoothing
° Pretend we saw each word one more time
than we did
* Just add one to all c(w
w.)
wie (W, |W.) =
3
cw, i)
* MLE estimate:
_ €(W,).W)+1
Py Cw, lw.)
we ۲ ۷
Add-1 estimate:
صفحه 43:
Dan Jurafsky
Maximum Likelihood Estimates
* The maximum likelihood estimate
* of some parameter of a model M from a training set T
* maximizes the likelihood of the training set T given the model M
* Suppose the word “bagel” occurs 400 times in a corpus of a
million words
* What is the probability that a random word from some other
text will be “bagel”?
* MLE estimate is 400/1,000,000 = .0004
* This may be a bad estimate for some other corpus
* But it is the estimate that makes it most likely that “bagel” will
occur 400 times in a million word corpus.
صفحه 44:
Dan Jurafsky
sxoovtked bigrany pours
i want | to eat | chinese | food | lunch | spend
i 6 828 1 10 1 1 1 3
want 3 2 609 | 2 7 7 6 2
to 3 1 5 687 | 3 1 7 212
eat 1 1 3 1 17 3 43 1
chinese || 2 1 1 1 1 83 2 1
food 16 | 1 16 1 2 5 1 1
Junch 3 1 1 1 1 2 1 1
spend 2 1 2 1 1 1 1 1
صفحه 45:
spend
0.00075
000084
0.055
0.00046
0.00062
0.00039
0.00036
0.00058
Tunch
0.00025
0.0025
0.0018
0.02
0.0012
0.00039
0.00056
0.00058.
Dan Jurafsky
Gi) Lepkoe سينا سداس
3 (w, —1Wn ) +1
3 (Wr 1 ) +V
eat chinese | food
0.0025 | 0.00025] 0.00025
0.00084| 0.0029 | 0.0029
0.18 0.00078} 6
0.00046} 0.0078 | 0.0014
0.00062] 0.00062) 0.052
0.00039 | 0.00079} 0.002
0.00056} 0.00056] 0.0011
0.00058 | 0.00058] 0.00058
to
0.00025
0.26
0.0013
0.0014
0.00062
0.0063
0.00056
0.0012
want
021
0.00042
0.00026
0.00046
0.00062
0.00039
0.00056
0.00058
P (walwn—1) =
i
0.0015
0.0013.
0.00078
22046
0.0012
0.0063,
00017
0.0012
1
want
to
eat
chinese
food
lunch
spend
صفحه 46:
Dan Jurafsky
[COn1Wn) +L x Cn)
عع سکس زر رم
i want | to eat chinese| 1000| lunch| spend
i 3.8 | 527 | 064] 64 0.64 0.64] 0.64 | 19
want 1,2 0.39 238 0.78 27 3: aa 0.78
to 1.9 0.63 3.1 430 19 0.63) 44 133
eat 0.34) 4 1 0.34 58 1 15 034
chinese |] 2 0.098] 0.098) 0.098} 0.098 8.2 0.2 0.098
food 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0.43
lunch 0.57| 0.19 | 0.19 | 0.19 | 0.19 0.38} 0.19 | 0.19
spend 0.32] 6 0.32 0.16 0.16 0.16] 0.16 0.16
صفحه 47:
Dan Jurafsky
eat_| chinese | food | Tunch | spend
1 ۱9 ] 0 yo fo 2
want 1 | 6 6 5 1
to 686 | 2 0 6 21
eat 0 | 16 2 2 |o
chinese 0 |] 92 | 1 0
food o ft 4 0 0
سا ۰ 1 0 0
spend 0 ۰ Jo 0 0
كه | chinese | food] lunch] spend
1 64 | 064 | O68] 064 | 19
want 0.78 | 27 27 | 23 | 0.78
to 430 | 19 063} 44 | 133
eat 034 | 58 1 كل | 034
chinese 0098| 0.098) 0.098 | 8.2 0.098,
food 0.43 | 0.86 2} 043 | 043
lunch 0.19 | 0.19 0.19 | 0.19
spend 0.16 | 0.16 | 0.16 | 0.16
صفحه 48:
Dan Jurafsky
Add-1 estimation is a blunt
instrument
* So add-1 isn’t used for N-grams:
* We'll see better methods
° But add-1 is used to smooth other NLP models
¢ For text classification
*In domains where the number of zeros isn’t so
huge.
صفحه 49:
Language
Modeling
Interpolation,
Backoff, and Web-
Scale LMs
=
Ss
al
صفحه 50:
Dan Jurafsky
Qucho PP and Toterpohatiod
* Sometimes it helps to use less context
* Condition on less context for contexts you haven’t learned
much about
° Backoff:
* use trigram if you have good evidence,
* otherwise bigram, otherwise unigram
* Interpolation:
* mix unigram, bigram, trigram
e Intarnolation workc hatter
صفحه 51:
Dan Jurafsky
Linear Interpolation
* Simple interpolation
رم رصیق (مو رما
27 هی
+A3P(wn) ۶
:أا0»© 0۴ 6۵۳0۱۲۱۵۴۵۱ ۵۴۵086 ۰
۱2 = Pav} PO |Wn2Wn1)
هیقب
+)
7
صفحه 52:
Dan Jurafsky
How to set the lambdas?
* Use a held-out a |
۳ aoe
* Choose As to maximi As to maximize the probability of held-out
data:
* Fix the N-gram probabilities (on the training data)
* Then search for As that give largest probability to
Nog P(vw,... ww, |M(A,...A,)) = = peek 14 | OM Iw.)
صفحه 53:
Unknown words: Open versus
closed vocabulary tasks
* If we know all the words in advanced
٠ Vocabulary V is fixed
* Closed vocabulary task
* Often we don’t know this
* Out Of Vocabulary = OOV words
* Open vocabulary task
* Instead: create an unknown word token <UNK>
* Training of <UNK> probabilities
* Create a fixed lexicon L of size V
* At text normalization phase, any training word not in L changed to
<UNK>
* Now we train its probabilities like a normal word
* At decoding time
٠١ for any word not in training
صفحه 54:
Dan Jurafsky
Huge web-scale n-grams
* How to deal with, e.g., Google N-gram corpus
* Pruning
* Only store N-grams with count > threshold.
* Remove singletons of higher-order n-grams
* Entropy-based pruning
° Efficiency
* Efficient data structures like tries
* Bloom filters: approximate language models
* Store words as indexes, not strings
* Use Huffman coding to fit large numbers of words into
two bytes
oe
صفحه 55:
Dan Jurafsky
Smoothing for Web-scale N-
grams
٠ “Stupid backoff” (Brants et a/. 2007)
* No discounting, just use relative frequencies
count(w;,
۳ counts) if count(w;,,,)>0
Sow, Iwi, =} count(w,,)
0.4S(w, lw*],,) otherwise
_ count(w, )
S(w,)
55
صفحه 56:
N-gram Smoothing Summary
۰ ۸00-1 ۰
* OK for text categorization, not for language
modeling
* The most commonly used method:
* Extended Interpolated Kneser-Ney
¢ For very large N-grams like the Web:
* Stupid backoff
56
صفحه 57:
Dan Jurafsky
O@dvowed Loaguage Oodetary
۰ Discriminative models:
* choose n-gram weights to improve a task, not to
fit the training set
* Parsing-based models
* Caching Models
* Recently used words are more likely to appear
۳ c(w E history)
-scue Ww lhistory) = APGw, lw, .W,,)+(1- A)
thistory!
* These perform very poorly for speech recognition
(why?)
صفحه 58:
Language
Modeling
Advanced: Good
Turing Smoothing
صفحه 59:
Dan Jurafsky
(Lapke) Gwoviser ۵404 بطم
_ COW. YW, +1
c(w,,)+V
م
۵4-۱: | Wia )
صفحه 60:
QOore yeverd Porwuaivds: Odd-+e و
“(w,,.w,)+k
Pi. (w, lw, ,) = See S
a ۴ c(w,_,)+kV
c(w,_,,W,)+ m4)
و 1۱, ,(< ۲
c(w,,)+m
صفحه 61:
Dan Jurafsky
Gs) QOuigrav prior sxoovtsiay
cl, yw.)
| pe
:)یی ۱۷, ( c(w,,)+m
“(w,,.w,)+mP(w,)
۱ ات۳ i
ی ( ۱۶ )نی
صفحه 62:
Dan Jurafsky
Advanced smoothing
algorithms
* Intuition used by many smoothing
algorithms
* Good-Turing
* Kneser-Ney
* Witten-Bell
* Use the count of things we’ve seen
once
*to help estimate the count of things we’ve
۲: 5۱ 5 ۳ 2
صفحه 63:
Dan Jurafsky
Notation: N, = Frequency of
frequency c
¢ N. = the count of things we've seen c times
* Sam |am1am Sam | do not eat
I 3
sam 2
am 2 N, = 3
do 1 يلا - 2
not 1
eat 1 ولا 1
63
صفحه 64:
Dan Jurafsky
Bord Turtay scoovistary totic
٠ You are fishing (a scenario from Josh Goodman), and
caught:
* 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18
fish
° How likely is it that next species is trout?
°1/18
* How likely is it that next species is new (i.e. catfish
or bass)
* Let’s use our estimate of things-we-saw-once to estimate
the new things.
© 2/18 (hecalice N —2)
صفحه 65:
Dan Jurafsky
Good همیب
P.,(things with zero frequency) = 3 ct a AD
N.
© Ovsern (buss or caPish) ٠ Gero vor (trout)
٠ دج 0: ۰ 20
۰ 0۵ p=OM0 =O ۰ 0۵ 2
0 * 9 > (مس*0) ۰
© * 6-
هه -
حول > ۱6 9/9 = سای ۰
* Pon (wera) = O/O = O10
صفحه 66:
Dan Jurafsky
QResultay GoodTurtay cunbers
Numbers from Church and Gale Count | Good Turing
(1991) 6 ct
* 22 million words of AP 0 -0000270
News | (c+ 1 0.446
—_ 2 1.26
3 2.24
4 3.24
5 4.22
6 5.19
7 6.21
8 7.24
9 8.25
صفحه 67:
Dan Jurafsky
Resultagy GoolTurtay wawbers
Numbers from Church and Gale Count | Good Turing
(1991) 3 يت
* 22 million words of AP 1 -0000270
News | (c+ 1 0.446
—— 2 1.26
3 2.24
4 3.24
5 4.22
6 5.19
7 6.21
* It sure looks like c* = (c - .75) 8 7.24
9 8.25