Improved Inference
for Unlexicalized Parsing
Slav Petrov and Dan Klein
[Petrov et al. ‘06]
Unlexicalized Parsing
Hierarchical, adaptive refinement:
DT
DT1
DT2
91.2 F1 score on Dev Set (1600 sentences)
1,140
DT1
DT2
DT1 DT2
DT3 DT4
Nonterminal symbols
531,200 Rewrites
DT3
1621min
DT5 DT6
DT4
Parsing time
DT7 DT8
1621 min
[Goodman ‘97, Charniak&Johnson ‘05]
Coarse-to-Fine Parsing
Treebank
Coarse grammar
NP … VP
NP-apple NP-1 VP-run
VP-6
NP-17 NP-12
…
VP-31
NP-dog
…
NP-eat
NP-cat
…
…
grammar
RefinedRefined
grammar
Prune?
For each chart item X[i,j], compute posterior probability:
< threshold
E.g. consider the span 5 to 12:
coarse:
refined:
…
QP
NP
VP
…
1621 min
111 min
(no search error)
[Charniak et al. ‘06]
Multilevel Coarse-to-Fine Parsing
X
Add more rounds of
pre-parsing
Grammars coarser
than X-bar
A,B,..
NP … VP
?
???
???
NP-dog
NP-apple
NP-cat
…
…
VP-run
Refined grammar
NP-eat
Hierarchical Pruning
Consider again the span 5 to 12:
coarse:
…
split in two:
split in four:
split in eight:
…
…
QP
NP
VP
…
QP1 QP2 NP1 NP2 VP1 VP2 …
…
QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 …
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Intermediate Grammars
DT
X-Bar=G0
DT1
G1
Learning
G2
G3
G4
G5
G= G6
DT1
DT2
DT2
DT3
DT4
DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8
1621 min
111 min
35 min
(no search error)
State Drift (DT tag)
the
that
this
That
That
This
…… That
……
some
this
these some
this
…… these
this
…… these
that
…… these
that
…… some
…… some
……
EM
Projected Grammars
X-Bar=G0
0(G)
G1
1(G)
2(G)
G3
3(G)
G4
4(G)
G5
5(G)
G= G6
G
Projection i
Learning
G2
Estimating Projected Grammars
Nonterminals?
Easy:
NP1 NP0
VP
VP0 1
S0
S
1
Nonterminals in G
Projection 
NP
VP
S
Nonterminals in (G)
Estimating Projected Grammars
Rules?
S1  NP1 VP1
S1  NP1 VP2
S1  NP2 VP1
S1  NP2 VP2
S2  NP1 VP1
S2  NP1 VP2
S2  NP2 VP1
S2  NP2 VP2
0.20
0.12
0.02
0.03
0.11
0.05
0.08
0.12
Rules in G
?
S  NP VP
???
Rules in (G)
[Corazza & Satta ‘06]
Estimating Projected
GrammarsGrammars
…
S1  NP1 VP1
S1  NP1 VP2
S1  NP2 VP1
S1  NP2 VP2
S2  NP1 VP1
S2  NP1 VP2
S2  NP2 VP1
S2  NP2 VP2
0.20
0.12
0.02
0.03
0.11
0.05
0.08
0.12
S  NP VP 0.56
Rules in (G)
Rules in G
…
InfiniteTreebank
tree distribution
Calculating Expectations
 Nonterminals:
 ck(X): expected counts up to depth k
 Converges within 25 iterations (few seconds)
 Rules:
1621 min
111 min
35 min
15 min
(no search error)
Parsing times
60 %
G1
12 %
G2
7%
G3
6%
G4
6%
G5
5%
G= G6
4%
Learning
X-Bar=G0
Bracket Posteriors (after G0)
Bracket Posteriors (after G1)
(Final Chart)
Bracket Posteriors (Movie)
Bracket Posteriors (Best Tree)
Parse Selection
Parses:
Derivations:
-1
-1
-1
-2
-2
-1
-1
-2
-2
-1
-1
-1
-1
-2
Computing most likely unsplit tree is NP-hard:
 Settle for best derivation.
 Rerank n-best list.
 Use alternative objective function.
[Titov & Henderson ‘06]
Parse Risk Minimization
 Expected loss according to our beliefs:
 TT : true tree
 TP : predicted tree
 L : loss function (0/1, precision, recall, F1)
 Use n-best candidate list and approximate
expectation with samples.
Reranking Results
Objective
Precision
Recall
F1
Exact
89.5
37.4
BEST DERIVATION
Viterbi Derivation
89.6
89.4
RERANKING
Precision (sampled)
91.1
88.1
89.6
21.4
Recall (sampled)
88.2
91.3
89.7
21.5
F1 (sampled)
90.2
89.3
89.8
27.2
Exact (sampled)
89.5
89.5
89.5
25.8
Exact (non-sampled)
90.8
90.8
90.8
41.7
Exact/F1 (oracle)
95.3
94.4
95.0
63.9
Dynamic Programming
[Matsuzaki et al. ‘05]
Approximate posterior parse distribution
à la [Goodman ‘98]
Maximize number of expected correct rules
Dynamic Programming Results
Objective
Precision
Recall
F1
Exact
89.5
37.4
BEST DERIVATION
Viterbi Derivation
89.6
89.4
DYNAMIC PROGRAMMING
Variational
90.7
90.9
90.8
41.4
Max-Rule-Sum
90.5
91.3
90.9
40.4
Max-Rule-Product
91.2
91.1
91.2
41.4
Final Results (Efficiency)
 Berkeley Parser:
 15 min
 91.2 F-score
 Implemented in Java
 Charniak & Johnson ‘05 Parser
 19 min
 90.7 F-score
 Implemented in C
Final Results (Accuracy)
ENG
GER
CHN
≤ 40 words
F1
all
F1
Charniak&Johnson ‘05 (generative)
90.1
89.6
This Work
90.6
90.1
Charniak&Johnson ‘05 (reranked)
92.0
91.4
Dubey ‘05
76.3
-
This Work
80.8
80.1
Chiang et al. ‘02
80.0
76.6
This Work
86.3
83.4
Conclusions
 Hierarchical coarse-to-fine inference
 Projections
 Marginalization
 Multi-lingual unlexicalized parsing
Thank You!
Parser available at
http://nlp.cs.berkeley.edu

slides