IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Significance of
Result Differences
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Why Significance Tests?
IMS
• everybody knows we have to test
the significance of our results
• but do we really?
• evaluation results are valid for
•
•
•
•
data from specific corpus
extracted with specific methods
for a particular type of collocations
according to the intuitions of one
particular annotator (or two)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Why Significance Tests?
IMS
• significance tests are about
generalisations
• basic question:
"If we repeated the evaluation
experiment (on similar data),
would we get the same results?"
• influence of source corpus,
domain, collocation type and
definition, annotation guidelines, ...
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Evaluation of Association Measures
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Evaluation of Association Measures
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
A Different Perspective
IMS
• pair types are described by
tables (O11, O12, O21, O22)
 coordinates in 4-D space
• O22 is redundant because
O11 + O12 + O21 + O22 = N
• can also describe pair type by
joint and marginal frequencies
(f, f1, f2) = "coordinates"
 coordinates in 3-D space
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
A Different Perspective
IMS
• data set = cloud of points in
three-dimensional space
• visualisation is "challenging"
• many association measures
depend on O11 and E11 only
(MI, gmean, t-score, binomial)
• projection to (O11, E11)
 coordinates in 2-D space
(ignoring the ratio f1 / f2)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Parameter Space of
Collocation Candidates
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Parameter Space of
Collocation Candidates
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Parameter Space of
Collocation Candidates
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Parameter Space of
Collocation Candidates
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Parameter Space of
Collocation Candidates
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
N-best Lists in Parameter Space
IMS
• N-best List for AM  includes all
pair types where score   c
(threshold c obtained from data)
• {  c} describes a subset of the
parameter space
• for a sound association measure
isoline { = c} is lower boundary
(because scores should increase
with O11 for fixed value of E11)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
N-Best Isolines in the
Parameter Space
IMS
MI
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
N-Best Isolines in the
Parameter Space
IMS
MI
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
N-Best Isolines in the
Parameter Space
IMS
t-score
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
N-Best Isolines in the
Parameter Space
IMS
t-score
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
95% Confidence Interval
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
99% Confidence Interval
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
95% Confidence Interval
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Precision Values
IMS
tbl
t-score frequency
TPs
322
283
FPs
678
717
• number of TPs and FPs for
1000-best lists
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
McNemar's Test
IMS
tbl
– t-score + t-score
– freq
610
46
+ freq
7
276
+ = in 1000-best list – = not in 1000-best list
• ideally: all TPs in 1000-best list (possible!)
• H0: differences between AMs are random
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
McNemar's Test
tbl
– t-score + t-score
– freq
610
46
+ freq
7
276
+ = in 1000-best list
– = not in 1000-best list
> mcnemar.test(tbl)
• p-value < 0.001  highly significant
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Significant Differences
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Significant Differences
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Significant Differences
IMS
= significant
= relevant (2%)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Lowest-Frequency Data:
Samples
IMS
• Too much data for full manual
evaluation  random samples
• AdjN data
• 965 pairs with f = 1 (15% sample)
• manually identified 31 TPs (3.2%)
• PNV data
• 983 pairs with f < 3 (0.35% sample)
• manually identified 6 TPs (0.6%)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Lowest-Frequency Data:
Samples
• Estimate proportion p of TPs
among all lowest-frequency data
• Confidence set from binomial test
• AdjN: 31 TPs among 965 items
• p  5% with 99% confidence
• at most  320 TPs
• PNV: 6 TPs among 983-items
• p  1.5% with 99% confidence
• there might still be  4200 TPs !!
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
N-best Lists for
Lowest-Frequency Data
IMS
• evaluate 10,000-best lists
• to reduce manual annotation work,
take 10% sample from each list
(i.e. 1,000 candidates for each AM)
• precision graphs for N-best lists
• up to N = 10,000 for the PNV data
• 95% confidence estimates for
precision of best-performing AM
(from binomial test)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Random Sample Evaluation
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Random Sample Evaluation
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Random Sample Evaluation
IMS

Stefan Evert, IMS