Self-training with Products of Latent
Variable Grammars
Zhongqiang Huang, Mary Harper, and Slav Petrov
Overview
 Motivation
and Prior Related Research
 Experimental Setup
 Results
 Analysis
 Conclusions
2
PCFG-LA Parser
[Matsuzaki et. al ’05] [Petrov et. al ’06] [Petrov & Klein’07]
...
Parse Tree
Sentence
3
Derivations
Parameters
PCFG-LA Parser

Hierarchical splitting (& merging)
Original Node
Split to 4
NP1
Typical learning curve
NP
Split to 2
Split to 8

NP2
NP1
NP2
NP1
NP2
NP3
NP4
NP3
NP4
NP5
NP6
Increased
Model
Complexity
NP7
NP8

…
…
n-th grammar: grammar trained after n-th split-merge rounds
Grammar Order Selection

Use development set
Max-Rule Decoding (Single Grammar)
S
VP
NP
6
[Goodman ’98, Matsuzaki et al. ’05, Petrov & Klein ’07]
Variability
7
[Petrov, ’10]
Max-Rule Decoding (Multiple Grammars)
Treebank
8
...
[Petrov, ’10]
Product Model Results
9
[Petrov, ’10]
Motivation for Self-Training
10
Self-training (ST)
Select with dev
Train
Hand
Labeled
Train
Unlabeled
Data
11
Label
Automatically
Labeled
Data
Self-Training Curve
13
F score
WSJ Self-Training Results
14
[Huang & Harper, ’09]
Self-Trained Grammar Variability
Self-trained Round 7
Self-trained Round 6
16
Summary

Two issues: Variability & Over-fitting

Product model



Self-training



Makes use of variability
Over-fitting remains in individual grammars
Alleviates over-fitting
Variability remains in individual grammars
Next step: combine self-training with product models
17
Experimental Setup

Two genres:



WSJ: Sections 2-21 for training, 22 for dev, 23 for test,
176.9K sentences per self-trained grammar
Broadcast News: WSJ+80% of BN for training, 10%
for dev, 10% for test (see paper),
Training Scenarios: train 10 models with different
seeds and combine using Max-Rule Decoding
Regular: treebank training with up to 7 split-merge
iterations
 Self-Training: three methods with up to 7 split-merge
iterations

18
ST-Reg
Multiple
Grammars?
Select with
dev set
Train
Hand
Labeled
Train
Train
Unlabeled
Data
Label
Automatically
Labeled
Data
Single automatically labeled set by round 6 product
19
⁞
Product
ST-Prod
Use more
data?
Product
⁞
Train
Hand
Labeled
Train
Unlabeled
Data
Label
Automatically
Labeled
Data
Single automatically labeled set by round 6 product
20
⁞
Product
ST-Prod-Mult
Product
⁞
Train
Hand
Labeled
Label
⁞
Label
10 different automatically
labeled sets by round 6
product
Product
⁞
21
Average and Product Model Parser Performance
6 Average
6 Product
7 Average
7 Product
93
92.8
92.4
92.5
92.2
92.2
92.0
92
92.5
92.4
92.0
91.7
91.5
F Score
91.5
91.4
91.7
91.4
91.2
91
90.5
90.5
90.1
90
89.5
89
Regular
24
ST-Reg
ST-Prod
Training Scenario
ST-Prod-Mult
A Closer Look at Regular Results
25
A Closer Look at Regular Results
26
A Closer Look at Regular Results
27
A Closer Look at Self-Training Results
28
A Closer Look at Self-Training Results
29
A Closer Look at Self-Training Results
30
Analysis of Rule Variance

We measure the average empirical variance of the log
posterior probabilities of the rules among the learned
grammars over a held-out set S to get at the diversity
among the grammars:
31
Analysis of Rule Variance
32
89.7
33
90.1
91
Single Parser
Reranker
Product
92.4
[Zhang et al. ’09]
92.4
[Fossum & Knight ’09]
91.8
[Sagae & Lavie ’06]
92.3
This Work
91.7
[Petrov ’10]
[McClosky et al. ’06]
[Huang ’08]
91.5
[Charniak & Johnson ’05]
91.4
91.6
This Work
[Huang & Harper ’08]
[Carreras et al. ’08]
Petrov et al. ’06]
[Charniak ’00]
English Test Set Results (WSJ 23)
92.6
92
Parser Combination
Broadcast News
34
Conclusions
Very high parse accuracies can be achieved by
combining self-training and product models on
newswire and broadcast news parsing tasks.
 Two important factors:

1.
2.
35
Accuracy of the model used to parse the
unlabeled data
Diversity of the individual grammars

slides