Rule-based Dependency Parser Module
The Txala dependency parser [ACM05] gets constituency parsed sentences -that is, sentence
objects which have been enriched with a parse_tree
by the chart_parser
(or by any other means). The input parsing may be shallow. The dependency parser will complete the parse tree if needed, convert it to a dependency tree, and assign a syntactic function to each edge in the dependency tree.
class dep_txala : public dependency_parser {
public:
/// constructor
dep_txala(const std::string &cfgfile, const std::string &start);
/// analyze given sentence.
void analyze(sentence &s) const;
/// analyze given sentences.
void analyze(std::list<sentence> &ls) const;
/// return analyzed copy of given sentence
sentence analyze(const sentence &s) const;
/// return analyzed copy of given sentences
std::list<sentence> analyze(const std::list<sentence> &ls) const;
};
The constructor receives two strings: the name of the file containging the dependency rules to be used, and the start symbol of the grammar used by the chart_parser
to parse the sentence.
The dependency parser works in three stages:
* At the first stage, the <GRPAR>
rules are used to complete the shallow parsing produced by the chart into a complete parsing tree. The rules are applied to a pair of adjacent chunks. At each step, the selected pair is fused in a single chunk. The process stops when only one chunk remains.
* The next step is an automatic conversion of the complete parse tree to a dependency tree. Since the parsing grammar encodes information about the head of each rule, the conversion is straighforward.
* The last step is the labeling. Each edge in the dependeny tree is labeled with a syntactic function, using the <GRLAB>
rules.
The syntax and semantics of <GRPAR>
and <GRLAB>
rules are described below.
Dependency Parsing Rule File
The dependency rules file contains a set of rules to perform dependency parsing.
The file consists of five sections: sections: <GRPAR>
, <GRLAB>
, <SEMDB>
, <CLASS>
, and <PAIRS>
.
Parse-tree completion rules
Section <GRPAR>
contains rules to complete the partial parsing provided by the chart parser. The tree is completed by combining chunk pairs as stated by the rules. Rules are applied from highest priority (lower values) to lowest priority (higher values), and left-to right. That is, the pair of adjacent chunks matching the most prioritary rule is found, and the rule is applied, joining both chunks in one. The process is repeated until only one chunk is left.
The rules can be enabled/disabled via the activation of global flags. Each rule may be stated to be enabled only if certain flags are on. If none of its enabling flags are on, the rule is not applied. Each rule may also state which flags have to be toggled on/off after its application, thus enabling/disabling other rule subsets.
Each line in section <GRPAR>
contains a rule, with the format:
priority flags context (lchunk,rchunk) pair-constraints operation op-params flag-ops
where:
* priority
is a number stating the priority of a rule (the lower the number, the higher the priority).
* flags
is a list of strings separated by vertical bars (|
). Each string is the name of a flag that will cause the rule to be enabled. If this field is -
, the rule will be always enabled.
* context
is a context limiting the application of the rule only to chunk pairs that are surrounded by the appropriate context. A dash (-
) means no restrictions, and the rule is applied to any matching chunk pair.
* (lchunk,rchunk)
are the labels of the adjacent pair of chunks the rule may be applied to. The labels are either assigned by the chunk parser, or by a RELABEL
operation on some other completion rule. The pair must be enclosed in parenthesis, separated by a comma, and contain NO whitespaces.
The chunk labels may be suffixed with one extra condition of the form: `(form)`, `<lemma>`, `[class]`, or `{PoS_regex}`.
For instance,
The label: | Would match: |
---|---|
np |
any chunk labeled np by the chunker |
np(cats) |
any chunk labeled np by the chunker with a head word with form cats |
np<cat> |
any chunk labeled np by the chunker with a head word with lemma cat |
np[animal] |
any chunk labeled np by the chunker with a head word with a lemma in animal category (see CLASS section below) |
np{^N.M[PS]} |
any chunk labeled np by the chunker with a head word with a PoS tag matching the regular expression ^N.M[PS] |
-
pair-constraits
expresses a constraint that must be satisfied by the pair of target chunks(lchunk,rchunk)
. If no constraints are required, this field must be a dash (-
). The format of the constraint ispairclass::(value1,value2)
, where:pairclass
is the name of a pair class defined in the<PAIRS>
section (see below).-
value1
andvalue2
are the two values that must belong to the pair class.Each
value
specifies whether the value is to be extracted fromlchunk
(L
) orrchunk
(R
), the path to or a node below them (if target is not the root), and the attribute to extract.For instance,
L.lemma
specifies the lemma of the head word oflchunk
.R:sn.pos
specifies the PoS tag of the head word of a node with labelsn
located underrchunk
.R:sp:sn.semfile
specifies the semantic file of the head word of a node with labelsn
located under a node with labelsp
located underrchunk
.Valid attributes are:
lemma
,pos
,semfile
,tonto
,synon
,asynon
. Their meaning is the same than for dependency labeling rules, and is described below.
-
operation
is the way in whichlchunk
andrchunk
nodes are to be combined (see below). -
op-params
has two possible meanings, depending on theoperation
field:top_left
andtop_right
operations must be followed by the literalRELABEL
plus the new label(s) to assign to the chunks. Other operations must be followed by the literalMATCHING
plus the label to be matched.For
top_left
andtop_right
operations the labels following the keywordRELABEL
state the labels with which each chunk in the pair will be relabelled, in the formatlabel1:label2
. If specified,label1
will be the new label for the left chunk, andlabel2
the one for the right chunk. A dash (-
) means no relabelling. In none of both chunks is to be relabelled,-
may be used instead of-:-
. For example, the rule:20 - - (np,pp<of>) top_left RELABEL np-of:- -
will hang thepp
chunk as a daughter of the left chunk in the pair (i.e.np
), then relabel thenp
tonp-of
, and leave the label for thepp
unchanged.For
last_left
,last_right
andcover_last_left
operations, the label following the keywordMATCHING
states the label that a node must have in order to be considered a validlast
and get the subtree as a new child. This label may carry the same modifying suffixes than the chunk labels. If no node with this label is found in the tree, the rule is not applied. For example, the rule:
20 - - (vp,pp<of>) last_left MATCHING np -
will hang thepp
chunk as a daughter of the last subtree labelednp
found inside thevp
chunk.- The last field
flag-ops
is a space-separated list of flags to be toggled on/off. The list may be empty (meaning that the rule doesn't change the status of any flag). If a flag name is preceded by a+
, it will be toggled on. If the leading symbol is a-
, it will be toggled off.
- The last field
For instance, the rule:
20 - - (np,pp<of>) top_left RELABEL - -
states that if two subtrees labelled np
and pp
are found contiguous in the partial tree, and the second head word has lemma of
, then the later (rightmost) is added as a new child of the former (leftmost), whatever the context is, without need of any special flag active, performing no relabelling of the new tree root, and without activating or deactivating any flags.
The supported tree-building operations are the following:
top_left
: The right subtree is added as a daughter of the left subtree. The root of the new tree is the root of the left subtree. If alabel
value other than-
is specified, the root is relabelled with that string.last_left
: The right subtree is added as a daughter of the last node inside the left subtree matchinglabel
value (or to the root if none is found). The root of the new tree is the root of the left subtree.top_right
: The left subtree is added as a new daughter of the right subtree. The root of the new tree is the root of the right subtree. If alabel
value other than-
is specified, the root is relabelled with that string.last_right
: The left subtree is added as a daughter of the last node inside the right subtree matchinglabel
value (or to the root if none is found). The root of the new tree is the root of the right subtree.cover_last_left
: The left subtree (s) takes the position of the last node (x) inside the right subtree matchinglabel
value. The node x is hanged as new child of s. The root of the new tree is the root of the right subtree.
The context may be specified as a sequence of chunk labels, separated by underscores _
. One of the chunk labels must be $$
, and refers to the pair of chunks which the rule is being applied to.
For instance, the rule:
20 - $$_vp (np,pp<of>) top_left RELABEL -
would add the right chunk in the pair (pp<of>
) under the left one (np
) only if the chunk immediate to the right of the pair is labeled vp
.
Other admitted labels in the context are: ?
(matching exactly one chunk, with any label), *
(matching zero or more chunks with any label), and OUT
(matching a sentence boundary).
For instance the context np_$$_*_vp_?_OUT
would match a sentence in which the focus pair of chunks is immediately after an np
, and the second-to-last chunk in the sentence is labeled vp
.
Context conditions can be globally negated preceding them with an exclamation mark (!
). E.g. !np_$$_*_vp
would cause the rule to be applied only if that particular context is not satisfied.
Context condition components may also be individually negated preceding them with the symbol ~
. E.g. the rule np_$$_~vp
would be satisfied if the preceding chunk is labeled np
and the following chunk has any label but vp
.
Enabling flags may be defined and used at the grammarian's will. For instance, the rule:
20 INIT|PH1 $_vp (np,pp<of>) last_left MATCHING nphrase[animal] +PH2 -INIT -PH1
Will be applied if either INIT
or PH1
flags are on, the chunk pair is a np
followed by a pp
with head lemma of
, and the context (one vp
chunk following the pair) is satisfied. Then, the deepest rightmost node matching the label nphrase[animal]
will be sought in the left chunk, and the right chunk will be linked as one of its children. If no such node is found, the rule will not be applied.
After applying the rule, the flag PH2
will be toggled on, and the flags INIT
and PH1
will be toggled off.
The only predefined flag is INIT
, which is toggled on when the parsing starts. The grammarian can define any alphanumerical string as a flag, simply toggling it on in some rule.
Dependency function labeling rules
Labelling rules defined in section <GRLAB>
are applied once the tree has been completed and converted to a dependency tree.
For each edge in the tree, the first matching rule is located and applied.
Section <GRLAB>
contains two kind of lines.
The first kind are the lines defining UNIQUE
labels, which have the format:
UNIQUE label1 label2 label3 ...
You can specify many UNIQUE
lines, each with one or more labels. The effect is the same than having all of them in a single line, and the order is not relevant.
Labels in UNIQUE
lists will be assigned only once per head. That is, if a head has a daugther with a dependency already labeled as label1
, rules assigning this label will be ignored for all other daugthers of the same head. (e.g. if a verb has got a subject
label for one of its dependencies, no other dependency will get that label, even if it meets the conditions to do so).
The second kind of lines state the rules to label the dependences extracted from the full parse tree build with the rules in previous section:
Each line contains a rule, with the format:
ancestor-label dependence-label condition1 condition2 ...
where:
* ancestor-label
is the label of the node which is head of the dependence.
* dependence-label
is the label to be assigned to the dependence
* condition
is a list of conditions that the dependence has to match to satisfy the rule.
Each condition
has one of the forms:
node.attribute = value
node.attribute != value
Where node
is a string describing a node on which the attribute
has to be checked. The value
is a string to be matched, or a set of strings (separated by |
). The strings can be right-wildcarded (e.g. np*
is allowed, but not n*p
). For the pos
attribute, value
can be any valid regular expression.
The node
expresses a path to locate the node to be checked. The path must start with p
(parent node) or d
(descendant node), and may be followed by a colon-separated list of labels. For instance p:sn:n
refers to the first node labeled n
found under a node labeled sn
which is under the dependency parent p
.
The node
may be also As
(All siblings) or Es
(Exists sibling) which will check the list of all children of the ancestor (p
), excluding the focus daughter (d
). As
and Es
may be followed by a path, just like p
and d
. For instance, Es:sn:n
will check for a sibling with that path, and As:sn:n
will check that all siblings have that path.
The node
may be also a pair of the form [node1.attribute,node2.attribute]
where node1
and node2
can be a node path (e.g. p:sn:n
, p
, etc) but not As
,Es
, or another pair. In this case, the attributes for node1
and node2
can only be one of label
, lemma
, or pos
. Finally, the attribute for such a pair node can be only pairclass
.
Possible attribute to be used:
* label
: chunk label (or PoS tag) of the node.
* side
: (left or right) position of the specified node with respect to the other. Only valid for p
and d
.
* lemma
: lemma of the node head word.
* pos
: PoS tag of the node head word
* class
: word class (see below) of lemma of the node head word.
* tonto
: EWN Top Ontology properties of the node head word.
* semfile
: WN semantic file of the node head word.
* synon
: Synonym lemmas of the node head word (according to WN).
* asynon
: Synonym lemmas of the node head word ancestors (according to WN).
* pairclass
: Only applicable to pair nodes. Check whether the pair is in any of specified classes.
Note that since no disambiguation is required, the attributes dealing with semantic properties will be satisfied if any of the word senses matches the condition.
For instance, the rule:
verb-phr subj d.label=np* d.side=left
states that if a verb-phr
node has a daughter to its left, with a label starting by np
, this dependence is to be labeled as subj
.
Similarly, the rule:
verb-phr obj d.label=np* d:sn.tonto=Edible p.lemma=eat|gulp
states that if a verb-phr
node has eat
or gulp
as lemma, and a descendant with a label starting by np
and containing a daughter labeled sn
that has Edible
property in EWN Top ontology, this dependence is to be labeled as obj
.
Another example:
verb-phr iobj d.label=pp* d.lemma=to|for Es.label=np*
states that if a verb-phr
has a descendant with a label starting by pp
(prepositional phrase) and lemma to
or for
, and there is another child of the same parent which is a noun phrase (np*
), this dependence is to be labeled as iobj
.
Yet another:
verb-phr dobj d.label=pp* d.lemma=to|for As.label!=np*
states that if a verb-phr
has a descendant with a label starting by pp
(prepositional phrase) and lemma to
or for
, and all the other children of the same parent are not noun phrases (np*
), this dependence is to be labeled as dobj
.
And some examples for pairclass
:
verb-phr dobj d.label=noun-phr* d.side=right [p.lemma,d.lemma].pairclass=direct
verb-phr loc d.label=pp* d.lemma=in|at d.side=right [p.lemma,d:noun-phr.lemma].pairclass=location
First rule above states that a noun-phr
daughter under a verb-phr
parent will be labeled as dobj
if it is to the right of its parent, and the pair formed by their lemmas is found in the direct
pair class (which should be defined in section <PAIRS>
as described below).
The second rule states that a pp
daughter under a verb-phr
parent will be labeled as loc
if it is to the right of its parent, the preposition heading the pp
is in
or at
, and the pair formed by the verb lemma and the noun heading the noun phrase inside the pp
is found in the location
pair class (which should be defined in section <PAIRS>
as described below).
Semantic database location
Section <SEMDB>
is only necessary if the dependency labeling rules in section <GRLAB>
use conditions on semantic values (that is, any of tonto
, semfile
, synon
, or asynon
). Since it is needed by <GRLAB>
rules, section <SEMDB>
must be defined before section <GRLAB>
. The section must contain a single line specifying a configuration file for a semanticDB object. The filename may be absolute or relative to the location of the dependency rules file.
<SEMDB>
../semdb.dat
</SEMDB>
The configuration file must follow the format described in section Semantic Database.
Class definitions
Section <CLASS>
contains class definitions which may be used as attributes in the dependency labelling rules.
Each line contains a class assignation for a lemma, with two possible formats:
class-name lemma comments
class-name "filename" comments
For instance, the following lines assign to the class mov
the four listed verbs, and to the class animal
all lemmas found in animals.dat
file. In the later case, if the file name is not an absolute path, it is interpreted as a relative path based at the location of the rule file.
Anything to the right of the second field is considered a comment and ignored.
<CLASS>
mov go %%prep= to,towards (note that "%%" is not required for comments)
mov come %%prep= from (anything after 2nd field is ignored)
mov walk %%prep= through
mov run %%prep= to,towards D.Obj.
animal "animals.dat"
</CLASS>
Pair-class definitions
Section <PAIRS>
contains class definitions of compatible pairs. They can be used as attributes in the tree-completing rules.
Each line contains a class assignation for a pair, with two possible formats:
class-name element1 element2
class-name "filename"
For instance, the following lines assign to the class material
the two first pairs, to the class location
the third and fourth pair, and to class food"
all pairs found in food.dat
file. In the later case, if the file name is not an absolute path, it is interpreted as a relative path based at the location of the rule file. The contents of the file must be a list of pairs (one per line).
No comments are allowed in the same line than a pair.
<PAIRS>
material table wood
material knife steel
location tree forest
location car road
food "food.dat"
</PAIRS>
These pair classes may be used by tree-completion rules to set constraints to check whether two chunks must be joined or not, and by labeling rules to check whether a head and its dependant have a certain type of relation.