Constraint grammar

Constraint grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context-dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation (lexeme or base form), inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally (defined distances) or globally (undefined distances). Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

The CG concept was launched by Fred Karlsson in 1990 (Karlsson 1990; Karlsson et al., eds, 1995), and CG taggers and parsers have since been written for a large variety of languages, routinely achieving accuracy F-scores for part of speech (word class) of over 99%.^[1] A number of syntactic CG systems have reported F-scores of around 95% for syntactic function labels. CG systems can be used to create full syntactic trees in other formalisms by adding small, non-terminal based phrase structure grammars or dependency grammars, and a number of Treebank projects have used CG for automatic annotation. CG methodology has also been used in a number of language technology applications, such as spell checkers and machine translation systems.

Rule syntax and format

A Constraint Grammar parser expects as its input a stream of morphologically analysed tokens, typically produced by a Finite-state transducer-based analyser (common ones are the Xerox tools twolc/lexc/xfst, HFST or Apertium's lttoolbox). Each token may be ambiguous and have many readings, the surface form with all its readings is called a cohort. Below is a possible example analysis of ", and X was like “" in the input format expected by VISL CG-3:

"<,>"
	"," cm
"<and>"
	"and" conj
"<X>"
	"X" num pl
	"X" noun prop
"<was>"
	"be" verb past p1 sg
	"be" verb past p3 sg
"<like>"
	"like" adj
	"like" subj
	"like" pr
	"like" verb inf
	"like" verb pres
	"like" verb imp
"<“>"
	"“" lquot

This snippet shows 5 cohorts, each with one or more readings. The surface wordforms are in "<anglequotes>" while the lemmas/baseforms are in regular "quotes" followed by an unquoted set of tags, and we see some cohorts have several readings, ie. are ambiguous ("<like>" being ambiguous between 6 readings). The job of the CG parser is now to 1) remove as many wrong readings as it is safe to do given the context, 2) optionally apply one or more syntactic function labels to each cohort (or even dependency relations) and 3) disambiguate the applied labels/relations.

Below is an example rule (again in VISL CG-3 format) for picking the third person reading of "was" (by removing the first person reading), given that there is no first person pronoun to the left:

REMOVE (verb p1) IF 
  (0C (verb))
  (NEGATE *-1 (prn p1)) ;

Here (verb p1) is a set of tags (order doesn't matter) that must match the reading we're removing. After IF follows a list of zero or more constraints, the first one says that in this cohort (position 0), all readings (the qualifier C, for Careful) have the tag verb. The second constraint says that if there is a cohort that is at least one word to the left (position *-1, the * meaning we may go further than one word and - meaning left) and that cohort is a first person pronoun, then the constraint does *not* match (NEGATE).

In CG-3, rules may also be given names, e.g. SELECT:somename (…) IF, which show up in the trace output.

A rule can also pick a single reading if we're sure that all the other readings must be wrong given the constraints:

SELECT:quoting ("like" subj) IF 
  (-1 ("<was>"))
  (1 (lquot) OR (":"))  ;

In this rule, we see that we can refer to both wordforms and baseforms in tagsets (they are treated just like any other tag, and a reading will always match its wordform). Here the second constraint uses OR to combine two tagsets. If this set is commonly used, we can give it a name and use the name – without the parentheses – like this:

LIST prequote = lquot ":" ; 
SELECT:quoting ("like" subj) IF 
  (-1 ("<was>"))
  (1 prequote) ;

An equivalent definition would be SET prequote = (lquot) OR (":") ; .

After running the above rules, we should end up with this:

"<,>"
	"," cm
"<and>"
	"and" conj
"<X>"
	"X" num pl
	"X" noun prop
"<was>"
	"be" verb past p3 sg
"<like>"
	"like" subj
"<“>"
	"“" lquot

If we used --trace, we would see the removed readings with an initial ;, and the name and line number of the rule wherever it applied to a reading.

The rule syntax for adding syntactic function labels follows a similar scheme of "do this if x, y and z":

LIST nominal = noun prn ;
ADD (@SUBJ) IF 
  (NEGATE *-1 nominal)
  (0C (prop))
  (1C finiteverb) ;

This is called a "mapping rule", and we may end up with multiple such mapping tags per cohort, in which case we can disambiguate using the same SELECT/REMOVE rules.

Implementations

CG-1

The first CG implementation was CGP by Fred Karlsson in the early 1990s. It was purely LISP-based, and the syntax was based on LISP s-expressions (Karlsson 1990).

CG-2

Pasi Tapanainen's CG-2 implementation mdis^[2] removed some of the parentheses in the grammar format and was implemented in C++, interpreting the grammar as a Finite State Transducer for speed.

CG-2 was later reimplemented (with a non-FST method) by the VISL group at Syddansk Universitet as the open source VISL CG [1], keeping the same format as Tapanainen's closed-source mdis.

CG-3

The VISL project later turned into VISL CG-3, which brought further changes and additions to the grammar format, e.g.:

full Unicode support through International Components for Unicode
different interpretation of negation (NOT)
named relations in addition to plain dependency relations
variable-setting
full regex matching
tag/set unification – LIST gen = m f; SELECT (det) + $$gen IF (1 noun) (1 $$gen); will select the determiner that has the same gender as the following noun
wrappers for reading/writing Apertium and HFST formats
support for subreadings (where one reading has several "parts", used for multi-word expressions and compounds)
scanning past point of origin or even window boundaries
use as a library and support for integration with external processes

There is also simple IDE for CG-3 developed by VISL,[2] which provides syntax highlighting and lets you see input and output and possible errors as you write your grammar. There is also an Emacs mode cg.el[3] with similar features and simple code-navigation.

Unlike the Tapanainen implementation, the VISL implementations do not use finite state transducers. Rules are ordered within sections, which gives more predictability when writing grammars, but at the cost of slower parsing and the possibility of endless loops.

There have been experimental open-source FST-based reimplementations of CG-2 that for small grammars reach the speed of VISL CG-3, if not mdis.^[3]

List of systems

Free software

VISL CG-3 CGrammar compiler/parser
North and Lule Sami, Faroese, Komi and Greenlandic from the University of Tromsø (more information, Northern Sami documentation)
- Fred Karlsson's original Finnish FinCG is also available from the University of Tromsø as GPL, both in the original CG1 and in a converted CG3 version.
Estonian [4]
Norwegian Nynorsk and Bokmål online, Oslo-Bergen tagger(source code)
Breton, Welsh, Irish Gaelic and Norwegian (converted from the above) in Apertium (see CG in Apertium)

Non-free software

Basque [5]
Catalan CATCG
Danish DanGram
English ENGCG, ENGCG-2, VISL-ENGCG
Esperanto EspGram
French FrAG
German GerGram
Irish online
Italian ItaGram
Spanish HISPAL
Swedish SWECG Archived 2007-02-05 at the Wayback Machine
Swahili
Portuguese PALAVRAS

References

Scholia has a topic profile for Constraint grammar.

^ For English, see for example Tapanainen and Voutilainen 1994.
^ Tapanainen, Pasi 1996: The Constraint Grammar Parser CG-2. University of Helsinki Publications No. 27.
^ Nemeskey, D. M., Tyers, F. M. and Hulden, M. (2014) "Why Implementation Matters: Evaluation of an Open-source Constraint Grammar Parser". Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014) (to appear)

Bick, Eckhard. 2000. The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus: Aarhus University Press. ISBN 87-7288-910-1.
Karlsson, Fred. 1990. Constraint Grammar as a Framework for Parsing Unrestricted Text. H. Karlgren, ed., Proceedings of the 13th International Conference of Computational Linguistics, Vol. 3. Helsinki 1990, 168-173.
Karlsson, Fred, Atro Voutilainen, Juha Heikkilä, and Arto Anttila, editors. 1995. Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Natural Language Processing, No 4. Mouton de Gruyter, Berlin and New York. ISBN 3-11-014179-5.
Tapanainen, Pasi and Atro Voutilainen 1994: Tagging accurately: don't guess if you know. ANLC '94 Proceedings of the fourth conference on Applied natural language processing.