Functional Clause Boundary Detection

This page describes the Shared Task for the 1st Computational Systemic Functional Linguistics Conference, at Sydney University, 16-17th July, 2004.

News

Description

The aim is to train a machine learning system to identify the beginnings and ends of functional clauses. This is similar to a chunking or sentence boundary detection problem, but in this case clauses may also be nested.

The clause is the basic unit of analysis in Systemic Functional Linguistics (SFL). A text must be segmented into clauses before the detailed functional annotation the theory describes can be applied. Usually, a clause consists of a verb phrase and its non-clause arguments.

Because systemic functional linguistics draws syntactic boundaries slightly differently from other linguistic theories, it requires data sets to be specifically created or adapted for it. This task represents the first release of such data, and is therefore a novel opportunity to investigate how easily the theory's syntactic model can be computed. This shared task invites researchers with any background to use new or existing machine learning approaches to investigate the identification of functional clause boundaries.

Data

The data used in this task is drawn from the Penn Treebank, Version 3. This can be obtained through the Linguistic Data Consortium/ Due to license restrictions, the tags are made available separately, and must be combined afterwards using a script.

The data consists of one word per line, with a blank line in between sentences. For each word, the following tags are provided:

Note: The clause, POS, and chunk tags may be used without a Penn Treebank license. A non-lexicalised machine learning solution can be built directly from these tags.

The POS and chunk information was assigned automatically using the C&C tagger (Curran & Clark, 2003). These were used instead of the true tags to test clause boundary detection in a real-world environment, when gold-standard data is not available. The clause tags mark the beginning and end of each clause. Clauses may be embedded, and a single word may end multiple nested clauses, but clauses do not overlap. The tags were assigned by converting the Treebank parse trees to a systemic parse (using the technique described in (Honnibal, 2004)) and then keeping only the clause boundary information. There are two data sets in this shared task, each with a training and test set. The Wall Street Journal newswire corpus and the Brown corpus of literature are different enough that we expect significant variation in performance from one set to the other.

Downloads

There are two components to the download. The data itself is available as an archive (zip file) that contains tag files matching each file from the Wall Street Journal (wsj) and Brown (brown) sections of the Penn Treebank. There is also a Python script that merges these files with the original Treebank data.

Evaluation

Evaluation is on the basis of correctly-identified clauses. There is an evaluation script that calculates the performance at each clause depth and as a whole.

Download the evaluation script: fcbd_eval.py

It is not compulsory to use any of the word, POS, or chunk data made available. Gold-standard data from the original PTB distribution may also be used. If any external data sources are used, experiments should be run with and without them for comparison.

Wall Street Journal: training set is 00-20; test set is 21-24.
Brown: training set is cf-cn; test set is cp-cr.

Submissions

Submission date: June 3, 2005

Email your submissions (PDF) to csfg05@it.usyd.edu.au

Submissions can be up to a maximum of five pages. Attendance at the conference is NOT compulsory for acceptance of submissions to the shared task. Discussion of all submissions will be included in a critical analysis paper prepared by the organisers. Submissions should follow the ACL format. See the main page for the conference for detailed submission information.

Each paper should contain a description of the boundary detection method used and a summary of results. Each paper is expected to perform the following experiments and analysis:

Acknowledgements

The data was prepared by Matthew Honnibal, Jeremy Fletcher, and Casey Whitelaw, with the assistance of James Curran and Stephen Anthony.

References

Honnibal, M, 2004. Converting the Penn Tree Bank to Systemic Functional Grammar, Australasian Lang Tech Workshop 2004, pp147-154 (online)

Investigating GIS and Smoothing for Maximum Entropy Taggers. James R. Curran and Stephen Clark, Proceedings of the 11th Meeting of the European Chapter of the Association for Computational Linguistics (EACL-03), pp.91-98, Budapest, Hungary, 2003