MUC4 evaluation details

Trying to figure out details in extraction-based MUC4 evaluations

Brendan O’Connor 2012-06

Update 2012-08:

From: Nate Chambers <nchamber@usna.edu>

Date: Thu, 23 Aug 2012 16:51:26 -0400

To: "Brendan O'Connor" <brenocon@cmu.edu>, Ellen Riloff <riloff@cs.utah.edu>,

Siddharth Patwardhan <sidd@patwardhans.net>

Cc: Dan Jurafsky <jurafsky@stanford.edu>

I wanted to give you an update on this MUC-4 evaluation business. I don't know if you guys are still actively working on this corpus, but in case you are submitting new work, I finally got around to making these evaluation changes. I reran my ACL 2011 system to make sure my results weren't dishonest and you can safely talk about them. The difference was minimal and it actually ticked up. This is still a 4-class experiment (perp ind and org merged), not the harder 5-class:

ACL 2011

* I originally reported F1=0.40

* New fixed score is now F1=0.41

Changes I made as suggested in Brendan's analysis:

1. Included the optional entities that I was ignoring.

2. Changed my string comparison to be rightmost word, instead of rightmost phrase.

3. Fixed a bug where I gave myself too much credit on duplicate mentions. (not Brendan related, but found it as I made these changes!)

4. Checked my string edit distance. It is not giving me improper credit. It only applies a few times when the text has a parenthesis ( and the template has a bracket [.

[End update]

This is not the original MUC4 task at all -- this is all based on string matching of slot-fillers.

PR = Patwardhan and Riloff 2007 (EMNLP)

CJ = Chambers and Jurafsky 2011 (ACL)

Which MUC slots are used?

PR evaluate 5 slots

CJ evaluate 4 slots (two of PR’s are merged into one)

QUESTION FOR PR AND CJ: this is based on my interpretation of your emails. Any corrections?

Sid: This is correct.
Nate: This is correct.

PR name (Table 2)	PR MUC slots	CJ name (Section 6)	CJ MUC slots	notes
PerpInd --------------- PerpOrg	PERP: INDIV ID ------------------------------- PERP: ORG ID	“perpetrator”	PERP: INDIV ID and PERP: ORG ID	CJ merge these two
Victim	HUM TGT: DESCR (without colon clauses) and HUM TGT: NAME	“human target”	HUM TGT: DESCR (using colon clauses)	questions below
Target	PHYS TGT: ID	“physical target”	PHYS TGT: ID
Weapon	INCDNT: INSTRMT ID	“instrument”	INCDNT: INSTRMT ID

Version of the data

From here:

http://www-nlpir.nist.gov/related_projects/muc/muc_data/muc_data_index.html

http://www-nlpir.nist.gov/related_projects/muc/muc_data/muc34.tar.gz

As I understand it, you can call this the “MUC4 dataset”, instead of MUC 3/4, because MUC4 used MUC3 as a subset of its data.

Tricky cases parsing the MUC4 keyfile format

The MUC4 keyfiles are grouped by template. Therefore, a single slot can have multiple entities as slot-fillers, and for one entity, there can be several alternate strings. I don’t think calling these things “entities” is the usual terminology -- the MUC docs seem to call them “fillers” -- but I personally find it easier to understand, to distinguish the hierarchy between entities and their alternate strings (i.e. what in ACE and the coref world are called “mentions”).

NATE: I think there is a stronger argument to call them entities and mentions than just personal choice. The coreference community has clear definitions of entities/mentions, and these are exactly what is labeled :)

Brendan: hm. Also, when you think about it like this, it looks like that in this evaluation, recall is computed as a percentage of (gold) ENTITIES. But precision is copmuted as a percentage of (predicted) MENTIONS. Weird.

Here is an example of a slot that has has one entity, and the entity has multiple alternate strings. Note this is one single line in the file. (DEV-MUC3-0498)

9. PERP: INDIVIDUAL ID "SPECIAL POLICE UNITS" / "POLICE UNITS" / "POLICEMEN" / "POLICE"

10. [...]

Here is an example of two entities for one slot. They are represented on separate lines, with the second being a continuation line of the slot (it doesn’t repeat the slot name). (DEV-MUC3-0437)

12. PHYS TGT: ID "BANK BRANCHES"

"RESTAURANT"

13. [...]

Here is an example of two entities for one slot, each having two alternate strings. (DEV-MUC3-0713)

10. PERP: ORGANIZATION ID "FASCIST ARMED FORCES" / "HIGH COMMAND OF THE FASCIST ARMED FORCES"

"CRISTIANI GOVERNMENT" / "FASCIST GOVERNMENT"

11. [...]

The “ID” slots have a simple structure for a single entity: it’s a string, or a set of alternate strings. The other slots, however, have a more complicated internal structure with what I’m calling “colon clauses”. For the PR and CJ evaluations, it is not completely necessary to delve into this, except for the “HUM TGT: DESCRIPTION” slot. See the section on it further below.

Note the MUC data doesn’t always use the colon notation for HUM TGT DESC. And it sometimes uses slash notation. So these cases are straightforward by the rules above. (e.g. DEV-MUC3-0038,)

19. HUM TGT: DESCRIPTION "JESUIT PRIESTS" / "JESUITS"

"WOMEN"

20. [...]

CJ’s PERP merge

Example:

9. PERP: INDIVIDUAL ID "DEMOCRATIC PATRIOTIC COMMITTEES"

"OUR ORGANIZATION'S DISCIPLINARY TRIBUNAL"

10. PERP: ORGANIZATION ID "ANTICOMMUNIST ACTION ALLIANCE" / "AAA" / "ALIANZA DE ACCION ANTICOMUNISTA"

11. [...]

In my reading, there are two PERP INDIV ID’s, and one PERP ORG ID (which has several alternate strings). Under the CJ merge, there are three entities in the new “Perpetrator” slot. So you need all three true positives to get 100% recall for the merged “Perpetrator” slot on this document.

(This implies the CJ task, when averaging across all slots, is easier than PR, because you don’t have to discriminate whether it’s INDIV or ORG sub-type of PERP. For the same reason a 4-way classication problem is easier than a 5-way one. I’m fine with Chambers’ argument that for this domain it makes substantively more sense to merge them, and he notes some prior work has merged them, but we should be aware of this fact.)

Nate: you know I never realized my 5-way was inherently easier than the 4-way, but obviously it is. I probably chose the 4 slots because my unsupervised approach would not capture that IND is different from ORG. However, 4 slots actually matches the deeper history on MUC beyond only looking at Sid’s 2007 paper. You need to go back further, and the story gets a lot uglier, Brendan:

My initial evaluation was based on Riloff’s 2005 paper (4 slots). However, in 1996 she actually only evaluated on 3 slots (perp individual, victim, target). It was her 2005 paper that I modeled my evaluation off of, adding the instrument slot (perp individual, victim, target, instrument). You will also see that Xiao in 2005 only used 3 slots, probably following Riloff’s 1996 work.

I really don’t remember anymore, but I imagine I added the PERP ORG slot to PERP IND because I saw that Sid’s paper had added this 5th slot, and I saw it as a middle ground between Riloff’s 2005 and Sid’s 2007/2009 sequence.

That said, I really think ORG and IND should be merged. MUC is about extracting the key entities, and I don’t see why anyone would want them separate.

Optional templates

An entire template can be marked as OPTIONAL, in the MESSAGE: TEMPLATE slot.

0. MESSAGE: ID DEV-MUC3-0031 (NCCOSC)

1. MESSAGE: TEMPLATE 2 (OPTIONAL)

2. [...]

Optional entities

A single entity can be marked as optional via a prefix question mark. (DEV-MUC3-0149)

12. PHYS TGT: ID ? "ELECTRICITY"

? "TELEPHONE COMMUNICATION"

"HIGHWAYS"

"RAILWAYS"

13. [...]

It seems like the question marks apply to entire entities, not individual strings. For example

10. PERP: ORGANIZATION ID ? "FARABUNDO MARTI NATIONAL LIBERATION FRONT" / "FMLN"

11. [...]

The PR iescorer-0.4 code processes these.

QUESTION FOR CJ: did you process these in this manner?

NATE: I did not. I wasn’t aware that the question marks meant optional. I only looked at the template message (“MESSAGE: TEMPLATE”) to see if the whole template was optional. Nice find, Brendan, yet another evaluation difference...

How optionality is scored:

Let all optional entities be the union of

(1) Individual slot entities marked as optional

(2) All entities within templates, where the template is marked as optional

If you miss getting an optional entity, ignore it when computing recall.

The effect is: you get credit if you get it right, but are not penalized if you miss it.

From PR email:

We get credit if we extract an optional slot. However, we are not penalized is we fail to extract an optional slot. What this means is that the denominator used in computing the recall would vary across different experiments (depending on how many optional slots were extracted in each experiment).

NATE: This is how I computed optional. But as said above, I did not include the entity-level optionals. I have a nice line in my code that strips out question marks because I thought they were just for the annotators and meaningless to users.

Handling of HUM TGT NAME and HUM TGT DESC and colon clauses

CJ only used the HUM TGT DESC slot to create their “human target” slot.

PR say they merged HUM TGT NAME with HUM TGT DESC.

Whether these are equivalent or not depends on colon processing. Consider DEV-MUC3-0433,

18. HUM TGT: NAME "BERNARDETTE PARDO"

"CARLOS CORRALES"

"JORGE SAENZ"

19. HUM TGT: DESCRIPTION "U.S. JOURNALIST": "BERNARDETTE PARDO"

"CAMERAMAN": "CARLOS CORRALES"

"MOVIE ACTOR": "JORGE SAENZ"

20. [...]

In CJ’s interpretation, I believe, the colon simply demarcates alternate strings just like the slash does for all the simpler ID slots. I believe in the CJ interpretation, the HUM TGT DESC slot has three entities for one slot, with two alternate strings each. QUESTION FOR CJ: is my interpretation correct?

Nate: Correct. I treat colons as separating the mentions for a single entity (which they are if you look at them in the documents).

But I think PR interprets this very differently. in iescorer-0.4/src/TerrorFileReader.C:559,

/* Get the list of alternatives for victim descriptions*/

set<string> TerrorFileReader::getDescriptions(const string& instring) const

[...]

/* Lines containing ":" are discarded */

if(regex_match(remain, regex(".*:.*")))

return retSet;

This implies they throw out all the entities in the above example, so there are zero entities for the HUM TGT DESC slot in this case. QUESTION FOR PR: is my interpretation correct?

SId: Yes, that is correct. The description field is somewhat strange (and I don’t quite understand it’s purpose). In the cases where the description contains a colon, it provides an additional description of entities already listed in HUM TGT NAME. But when there is no colon, then the line contains new entities not previously listed in the template. The iescorer is only using the ones that were not previously seen in HUM TGT NAME.

Brendan: Substantively, I think the DESC field is good to use -- it effectively gives another mention of the entity, in a similar fashion as the multiple strings elsewhere.

If PR is discarding all colon lines, and CJ is using them, this could be a discrepancy between CJ and PR. (They are not rare: in the development keyfiles’ HUM TGT DESC slots, there are 707 entities with colon clauses, versus 975 without.)

Whenever there is colon clause in HUM TGT DESC, the right side is the same string as one of the HUM TGT NAMEs. (I think it’s asserting that the LHS is a description string applying to the entity referred to by the RHS.) Therefore, since PR uses HUM TGT NAME as well, I think the effect is that CJ ends up with twice as many alternate strings, in cases where the colon clause is used.

NATE: I didn’t verify your numbers, but your comparison between PR and me sounds correct given everything in this document being correct! It would be interesting to delve into it a little more (although maybe not enough for a research contribution). Many of these colons are actually NP NP structures. For instance, “U.S. Journalist Bernardette Pardo was seen....”. This is split into two mentions (as any good coref system would), and labeled with a colon between “U.S. Journalist” : “Bernardette Pardo”. I imagine in these cases, both PR and CJ got them correct or incorrect. Having the 2nd mention doesn’t help if you don’t get the event trigger word here.

Ellen: FYI, the difference between the slots is that the NAME slot was for proper names and the DESC slot for nominal descriptions. So if a document only contains a proper name (e.g., “Barack Obama”) then only the NAME slot will be filled. If a document only contains a description (e.g., “3 tourists”) then only the DESC slot will be filled. The colon happens when a document contains both a proper name and a nominal description (e.g., “U.S. president Barack Obama” --> “Barack Obama” in the NAME field and “U.S. president” : “Barack Obama” in the DESC field).

Our logic (which admittedly is a bit harsh sometimes) is that the NAME is preferable to the DESC when both are available, primarily because there are cases where the description is quite general. For example, extracting only “businessman” when the document says that the person’s name is “Manuel Vallejo Uribe”, doesn’t feel satisfactory.

Bottom line: if a document only provides a NAME or a DESC, then it’s clear that’s the desired answer string. But if the document contains both a NAME and a DESC, then we considered only the NAME to be an acceptable answer string. I’m not arguing that this is best approach, just explaining why we handled the “colon info” the way we did.

String matching rules

Given a gold alternate string, and a system-produced string, do they match?

PR and CJ do this largely similarly.

The rule is:

Does the system string’s head noun match the gold string’s head noun?

Where PR defines “head noun” as

Rightmost token of the string,

UNLESS the word “of” is in the string. Then use the rightmost token before the “of”.

e.g.

"FARABUNDO MARTI NATIONAL LIBERATION FRONT" ⇒ “front”

“bank of america” ⇒ “bank”

Both CJ and PR did the “of” check.

Furthermore, Chambers reports he checked on the rightmost matching phrase, not single-token, which makes the task slightly more difficult. (This introduces a dependency on how system defines phrases.)

Chambers reports he used a very small edit distance threshold to deal with bugs in the data where the string in the keyfile does not exactly match the text of the article. I (Brendan) found the only differences are due to the goldtext having whitespace where there’s punctuation in the article text; so for visualizations I use a fuzzy regex matcher

query = re.escape(goldtext).replace(r'\ ', r'(\s+|[^a-zA-Z0-9]{,5})')

NATE: I found some character swapping misspelling. That was my motivation for doing it, but I don’t recall which template(s) I saw it in.

Cleaning up the MUC4 keyfile format

Not all, but some of the issues may stem from the messy MUC4 keyfile format. I propose translating it into a much less ambiguous form so future work will have stronger agreement about what the data structures mean and how to use them. It is better for researchers to share well-defined data, rather than everyone’s separate program parsing the data into potentially different internal data structures. I have written a preprocessing script and created such output, for the 5 slots under consideration here, at:

http://brenocon.com/muc4_proc/

In particular, the “.html” files show the original MUC format on the left, and the interpreted output on the right (in a JSON format). The “jsons.txt” files can be easily loaded by a program using any standard JSON parsing library.

One issue is, the colon clauses should be resolved into substructure below the entity they refer to.

However, a nicer data format solves only part of the issues described above.

===========================

obsolete stuff

QUESTION FOR PR: basic question, but does iescorer-0.4 handle continuations? I’m looking at TerrorFileReader.C, TerrorFileReader::read(). It’s structured around a line-by-line processing loop, but there’s a state machine system that may be handling the continuations... I assume this is the case, but just wanted to confirm because I can’t exactly figure out the code.

Sid: Yes, the state machine handles the continuations.