Task #04: Classification of Semantic Relations between Nominals



Mailing list

Participants and prospective participants in Task 4 are invited to join the Semantic Relations Google Group:

http://groups.google.com/group/semanticrelations

Datasets and Formats

The following list summarizes the main features of our dataset:


  • 7 semantic relations (not exhaustive and possibly overlapping)

  • 140 training sentences per relation (7 × 140 = 980 training sentences)

  • 70 testing sentences per relation (7 × 70 = 490 testing sentences)

  • 210 combined testing and training sentences per relation (7 × 210 = 1,470 sentences)

  • sentence classes will be approximately 50% positive and 50% negative

  • several different search patterns will be used for each semantic relation, to avoid biasing the sample sentences

  • negative examples of a relation will be “near misses”


The following examples illustrate the format of the data set.  Each example consists of a sentence with two words participating in the target semantic relation.  The sentence is followed by the WordNet senses of the two words and the relation type.

"The <e1>silver</e1> <e2>ship</e2> usually carried silver bullion bars, but sometimes the cargo was gold or platinum." 
WordNet(e1) = "n1", WordNet(e2) = "n1", Content-Container(e1, e2) = "true"


"Summer was over and he knew that the <e1>climate</e1> in the <e2>forest</e2> would only get worse."
WordNet(e1) = "n1", WordNet(e2) = "n1", Content-Container(e1, e2) = "false"

Evaluation

As with the Senseval-3 Lexical Sample tasks, each team participating in this task will initially have access only to the training data. Later, the teams will have access to unlabeled testing data (that is, there will be WordNet labels, but no Relation labels). The teams will enter their algorithms' guesses for the labels for the testing data. When SemEval-1 is over, the labels for the testing data will be released to the public.


Algorithms will be allowed to skip examples that they cannot classify. An algorithm's score for a given relation will be the F score, the harmonic mean of precision and recall. Algorithms will be ranked according to their average F scores for the chosen set of relations. We will also analyze the results to see which relations are most difficult to classify. To assess the effect of varying quantities of training data, we will ask the teams to submit several sets of guesses for the labels for the testing data, using varying fractions of the training data.


Some algorithms (e.g., corpus-based algorithms) may have no use for WordNet annotations. It might also be argued that the WordNet annotation is not practical in a real application. Therefore we will ask teams to indicate, when they submit their answers, whether their algorithms used the WordNet labels. We will group the submitted answers into those that used the WordNet labels and those that did not, and we will rank the answers in each group separately. Teams will be allowed to submit both types of answers, if their algorithms permit it.

Download area

This section will contain evaluation software, useful scripts, complementary materials, baseline systems, etc., but not the datasets proper. The datasets will be available at the main site for download.

System and Results

This section will be completed after the competition.

References

This section will be completed after the competition.



 For more information, visit the SemEval-2007 home page.