главная о нас продукты скачать демо технологии ^
A short description of Dialing Project. Alexey Sokirko, 2001. |
A short description of Dialing Project
Introduction
Before the First Level Semantics
Basic Semantic Description
Building Semantic Nodes
Building
all possible semantic relations
Building Tree Variants
After the First Level Semantics
Acknowledgments
References
A short
description of Dialing Project |
^ |
This paper gives
a short description of a natural
language processing system (NLP-system) that is called Dialing
Project (1999-2001). Dialing Project is a Russian-to-English machine translation
system that is based on French-to-Russian machine translation system [1] developed by Dr.Leontieva and her group in 1976-84. The author of the paper
pays attention mainly to the semantic module of Dialing Project since he
participated in its development.
Dialing System consists of six modules of text processing:
- Graphematics;
- Morphology;
- Group Syntax;
- First-level Semantics;
- Transfer;
- Synthesis.
On each stage there is a special formal language in which one can
translate the input text. Firstly, the input text is translated to
graphematical language, then from graphematical to morphological and so
on. Evidently, these languages share
some words between each other. In principle it is possible to translate from one language to another without
intermediate languages. In practice one should avoid it because it violates the
well-known Law of Demeter("each component should only know about closely
related components"). For example, one
should avoid references to Russian graphematical representation from English Synthesis stage. Still this
scheme makes possible to use the
modules in different combinations, or
even stand-alone, that"s why one can
convert the machine translation system
Dialing to various types of NLP systems.
The main symbols of these six
special languages are descriptors that are
special constants which describe
one word or group of words of the input text. We can build complex construction
based on descriptors, i.e. sets or lists of descriptors, logical forms and so
on. Each descriptor has its meaning and a set of rules that can assign it to a
particular word or group of words. For example, descriptor "EOS" means end of
sentence, and its set of assigning rules is based on analysis of full stops,
exclamation or question marks. While working
with natural languages one can find it helpful to use the same descriptors for
different natural languages. The shared descriptors always have the same
meaning, but can differ in sets of assigning rules.
In text processing a descriptor can be attached to a word or group.
In Dialing Project any group of words always has a single main word. A group
can be continuous and not. A continuous
group can be defined by lower and upper borders. A non-continuous group can be
defined only word by word. A group of two words is called a relation.
Besides the modules of analysis there are dictionaries that hold
predefined information about words and groups. This is the most important part
of the system, since it is the only source of real knowledge for the computer.
When we find an entry in dictionaries that matches the input word or group, we
say that we find an interpretation. If there are many interpretations for one
text item then it is the case of ambiguity. Normally we should choose only one
interpretation for a text item. Speaking roughly the process of analysis can be
viewed as a sequence of interpretation choices. In most cases an interpretation
of some text item A depends on interpretations of neighbor items, which
depend themselves on the interpretation
of A. In other words ambiguous items
are cross dependent. That"s why we are to process all text variants of the word"s context and choose the best.
If the number of variants is too
large then the programmer should narrow
the context or simply cut some variants off. Such narrowing is an unavoidable
source of interpretation errors that a computer usually makes.
Before the First Level Semantics |
^ |
Now let us describe Dialing process stages in more detail. The first
processor (Graphematics) divides the input text into words, digital sequences,
alpha-numeral sequences. It finds electronic addresses, file names, some
abbreviations. Graphematical procedure splits input text into sentences, paragraphs, and headings.
The second processor is morphology. For each input word form the
procedure finds all morphological interpretations in the morphology dictionary.
Morphology interpretation is a tuple:
<Normal
form, Part of Speech, Grammems>, for example:
doing -> <do, VERB, ing-form>
In addition the morphological processor applies heuristic
context-dependent rules that resolve some simple cases of morphological
ambiguity.
The next stage is occupied by syntax
and fragmentation modules that are closely connected. The fragmentation module
finds clauses in Russian compound
sentences. The syntax module builds continuous syntax groups inside clauses
(NP, PP etc). Any word of a syntax
group has a determined morphological interpretation whereas clause can include some
items that are not fully interpreted. Since Russian language has free word
order, syntax groups of Dialing Project cover 70% of words. We consider that
the rest of the syntax structure(30%) should be built on the next stage, i.e.
First Level Semantics.
Basic Semantic Description |
^ |
The First Level Semantic analysis builds a semantic graph for one
Russian sentence. Speaking generally,
the concept of "semantics" is not well defined in computational linguistics.
Theoretical linguists and art scholars
understand this word deeper than specialists of applied linguistics.
Usually pure humanists think that natural language semantics is a science that deals with definitions
of words and collocations. Accordingly, they feel that formal linguistics should
build formal definitions. However the last investigations in this domain
convince us that full formal descriptions of
words cannot be effectually combined to result in a structure for the
whole input text. The main reason of it is the size of the descriptions. For example, consider a definition of word
"Table" [2]:
"Table" = "a piece
of furniture having a smooth flat top supported by one or more vertical legs"
(Wordnet).
Now imagine what a mess the input text will be after replacing all
original words by such definitions.
If formal definitions cannot be used in computational linguistics
due to their size, then what is the difference between semantics and syntax?
Following Dr. Leontieva[9-10] we assume that the difference is in the relations
and descriptors. Semantic relations are universal while syntax relations are
language-dependent. Thus semantic structure is a graph with semantic relations.
The number of nodes in the graph is approximately equal to the number of
words in the input text, but there are some exceptions. Certain words can be divided into two or more nodes, for example, German word "Donaudampfschifffahrtsgesellschaftskapitänswitwenrentenauszahlungsstelle". will be translated into 8 nodes ("Danube", "steamship company",
"society", "capitan", "widow", "rent",
"payment", "office" ). On the other hand, some words can be united in
one semantic node, for example "one hundred and forty" -> "140". But the
most frequent correspondence is "one word -> one node", which makes the
whole structure transparent. Consider the example of analysis:
Input Text: I write with a pencil
Syntax Structure: NP =>
I; V -> write; PP -> with a pencil
Semantic Structure: Agent(I, write); Instrument
(pencil, write)
Although the semantic structure is
very simple and therefore very productive, it is necessary to understand that it is very hard to get such
structure without errors. In this light one should not overvalue computational semantics in
whole. For example, if you remove the semantic module from Dialing Project the
translation will become only 20% worse.
Now we are ready to describe the semantic machinery used in Dialing Project. There are three types of objects to be
discussed:
- semantic relations;
- semantic descriptors;
- semantic categories.
Most of semantic relations
we use are wide spread (see, for example Universal
Networking Language). Semantic relations are written in the following
format:
R(A,B), where R is a relation name, A is slave semantic node and B is master semantic node.
Formula R(A,B)
with some R,A, and B agrees with the input text iff "A is R
for B" , i.e.:
R(A,B) <=> "A is R for B".
For example, formula Author(Leo Tolstoy, War and Peace)
agrees with the text "the author of "War and Peace" Leo Tolstoy" since
the text implies that Leo Tolstoy is the author of "War and Peace".
The next principal object of our semantics is semantic
descriptor. There are about 40
semantic descriptors in the dictionary, like Animative, Rank, Organization
etc. A formula that is built on logical operations (conjunction, disjunction)
and semantic descriptors as atoms is
attached to each entry in the dictionary and to each actant pattern. For example:
SemDes(president) = Power & Rank.
Initially such formulae were introduced to choose the right sense of
words, if more than one sense exist. Subsequently semantic descriptor acquired
some special meaning that means that
SemDes(A) = SemDes(B) <=> "Meanings of A and B have something in common".
The third semantic instrument is semantic categories. There are four of them:1. Situations;
2.Objects; 3. Properties; 4 Operators. Situations(go,
explosion, ...) always have two latent valencies: where and when it happened. Objects(keyboard, monitor, ...) have only one latent
valency: where it is located . Properties(red, old..) modify situations
and objects. Operators (not, yet,...) only modify semantic structure, they
never correspond to separate semantic
nodes,.
All these instruments are used in description of word senses. Consider
an example:
Title =
read
Sense = 1
Category = Situation
SemDes = Process
Valency = Agent(C,
A1); Object(C, A2)
SemDes1 = Animative
GramDes1 = subj
SemDes2 = Information & Thing
GramDes2 = obj
That was a short description of the adopted semantic theory that is
used in Dialing Project. Let us proceed with the description of the program.
The input of the semantic module is a set of syntax linkages. The
output is semantic structure. While
building the resulting semantic structure the program faces the
following problems:
-
Morphological ambiguity, since it
cannot be fully resolved on the syntax stage.
-
Lexical ambiguity, for example I saw a bat, where a
bat might refer to an animal or, among others, a table tennis bat.
-
Structural ambiguity, for example
"John sold the dog apple", where at
least two interpretations are possible:
"John sold the apple to the dog" or "John sold the apple that somehow
resembles a dog"
The types of ambiguity constitute the design of the semantic module:
input text -> M1,...,Mn – morphological variants
each Mi -> L1,...,Ln – lexical
variants
each Li -> S1,...,Sn – structural
variants
Building Semantic Nodes |
^ |
On the first level the procedure works with one morphological
variant. In short this level is aimed to build semantic nodes and to assign
them dictionary interpretations. The
most interesting parts of the level are as follows:
-
Finding time groups;
-
Finding fixed collocations;
-
Getting all semantic interpretations
for each semantic node.
The procedures 1) and 2) deal with collocations, and it is time to have a look at collocations in Dialing Project. Algorithmically, there are two
distributions: conditional/unconditional collocations and closed/open
collocations. Conditional collocations
differ from unconditional ones in the method of finding. To find an
unconditional collocation in a text one should only know its syntax and
morphological properties. On the contrary, there must be strong semantic
evidence for conditional collocations, that"s why we have to find conditional collocations only on the semantic
stage. Here are examples:
Unconditional:
-
to be chip off the old block
Yes, Tom is just a chip off the old block.
-
come hell or high water
Finish your homework come hell or high water!
Conditional:
- clear the air
- His apology
cleared the air
-
Drivers! Clear the air with
Flame Gard Filters
- climb
the wall
- If I don't
go on a vacation soon, I'll climb the wall.
-
If you can't climb the wall,
build a door
Unconditional collocations are
usually more idiomatic and longer than conditional.
The difference between open and closed collocations is more rigid.
Any closed collocation corresponds to
only one semantic node in a semantic structure while open collocations have as many semantic nodes as their words.
Words of an open collocation can be connected with other nodes of the semantic
structure, but of course, there must be
some irregularity either in
syntax of the collocation, or in its translation. Each node of an open
collocation has a separate dictionary interpretation whereas a closed
collocation can be interpreted only as a whole.
Examples:
Open collocations:
-
1.time groups with the special use of
colon
- at 7 : 30
- at 7 : 30 Friday evening
- crack a smile
- I finally
got him to crack a smile.
- He would
crack a forced smile
Closed collocations
- cut and dried
That's my
answer, cut and dried
- cross the Rubicon
He crossed the
Rubicon
Obviously, this distribution is not absolute but it is very
practical. It is clear that the computer can process unconditional/closed
collocations quicker than conditional/open ones. If we want to forget about optimization issues we should make all
collocations open (just in case) and conditional.
Collocations are stored in the
collocation dictionaries, some of which will be described below.
The dictionary of fixed
functional words consists of
complex conjunctions (in order to, so that...) and complex
prepositions (on behalf
of ...).They are closed and unfortunately open,
since there are some examples like this:
(1) to be, or not to be — that is the question // not
conjunction
(2) God has given us
complete autonomy, that is self-government //conjunction
The next dictionary is Dialing
Thesaurus. Its structure resembles
Wordnet. The basic relations are hypernymy/hyponymy and meronymy/holonymy. One
thesaurus relation can link two synsets together. That is the common place.
However there are some differences from standard thesauri:
-
Only noun phrase can be inserted into Dialing Thesaurus, no VP.
-
Each thesaurus entry can be supplied with a valency structure that is
expressed by means of semantic relations and descriptors.
-
Any thesaurus synset can be used as semantic descriptor to select a
word"s actant.
By default all thesaurus entries are open and unconditional
collocations.
The next collocation dictionary is Time Group Dictionary. Time
groups fill Time valencies, for example:
On Monday we go to the Zoo.
These collocations are open and unconditional, even though
non-prepositional time phrases can be conditional, for example
(1) Sunday morning will be all right (not Time valency)
(2) We"ll meet Sunday morning (Time valency)
The syntax of time groups is very special, that"s why we consider
them collocations:
(1) Tuesday September 4 1:34 AM ET
(2) in the 1980s
The last collocation dictionary is called Fixed Collocation
Dictionary. Most of entries of this dictionary are verb phrases, like clear
the air or climb the wall, and considered
unconditional and closed, for example chip off the
old block.
Besides the
collocation dictionaries there is also one big dictionary for words that is
called Russian Common-Puprose Dictionary. Thus, the Dialing Project uses the
following Russian dictionaries:
-
Russian Common-Puprose
Dictionary (main);
- Thesaurus;
-
Time Group Dictionary;
-
Functional Words Dictionary;
-
Fixed Collocations Dictionary.
Each entry
of these dictionaries is identified by Title and SenseId. In other
words, dictionary interpretation is
the following tuple:
<Dictionary Name,
Title, SenseId>
If no dictionary
interpretation can be found for a word, the default one is used. For example,
there is a default interpretation for transitive verbs. Of course, the program
should choose only one dictionary interpretation for one semantic node, that"s
why the program has to search among all lexical variants (a variant of
attaching dictionary interpretations to
semantic nodes).
Building
all possible semantic relations |
^ |
The main
steps the program makes while processing
one lexical variant concern
semantic relations:
-
Building valency
structure of each node;
-
Building all possible
relations;
-
Processing similar nodes;
The
first procedure builds valency
structure of each node. Here are
further complications to worry us.
Firstly, some valencies are not compatible with the others, and this information
is written in the dictionary in a special format. Secondly, there are default
valencies that should be added in any
case to some types of nodes. For example, any Situation node has a default Time valency. Thirdly, some
valencies are mandatory, and some valencies are not.
The
second procedure builds all possible semantic relations without checking
semantic conditions. It uses morphological and syntax information about nodes
and information that is written in GramDes field [3]
of node"s article. For example, let X be a node. Let An be the n-th
valency of word X. If the dictionary article of X includes formula "GramDesn
= of+NP" then each PP with the
preposition "of" will be connected to X. Usually this procedure builds
much more relations than needed. It is
a task for the next procedures to choose the right hypothesis.
The
ratio R/N, where R is
the number of all possible relations and N is the number of nodes can measure
the complicity of the input
text. If R/N > 1,7 then the current version
of the program invokes heuristic procedures that delete unpromising relations.
It is well
known that the worst enemies of NLP are ambiguity and similarity that"s why the
procedure for similar nodes is one of the important part of this level. Similar
words occupy only one valency place,
that"s why there should be a proxy (we call it Multiple Node Actant, or MNA )
that represents all similar words.
There are several MNA syntax types:
(1) I like Peter, Mary, and John
(basic)
(2) I like neither Peter nor Mary (corr.
conj)
(3) That will do more harm than good (comparative)
After some
kind of MNA is found then we need to
identify all possible nodes that should be subdued by this MNA and to find all
possible parents of this MNA.
Our solution of the
last problem resembles Link Parser[3] Similar Rule where linkage of sentence
I like Peter and John
is built from two linkages
I like Peter
I like John.
In short, node X
becomes a parent of MNA if X subdues two children of the MNA:
|
The last procedure searches the best tree variant and it is the
slowest part of the whole program.
Obviously it is impossible to check all tree variants for a sentence of
standard length (20-30 words). We know two implemented methods to restrict the
search space. The first method that is
used in ETAP System[4] chooses the best variant for each node locally.
Initially it finds the root of the clause, then it chooses root"s children and
so on. It means that choices on
different levels are independent. Evidently that is not a correct method
because it is not clear in which order
the program should go through
the nodes. But we don"t want to
criticize it since the existence of fully correct methods is questionable. The second known method was
developed in Mikrokosmos Project by Beale Stephen[5]. It is based on Constraint
Satisfaction [6]. In short, this method divides the graph into subgraphs with minimal ratio N/M, where N is the number of nodes in
this subgraph and M is the number of nodes of the subgraph that are
connected to nodes that are not in this
subgraph.
|
After finding the first subgraph division it solves ambiguity in
each subgraph. Then it finds new
subgraphs and so on. One can see that Mikrokosmos method is not also fully correct since choices in different subgraphs are
independent, but we know that the only correct method is too slow. That"s why,
to some extent, the Dialing tree variant procedure is a kind of
Microkosmos method.
Our procedure hinges on the assumption that the number of cycles in
the graph of all possible relations is small. For Russian language this fact is
proved experimentally, though there is
no evidence that it is true for other
languages. If the number of cycles is small then the following steps are
reasonable:
-
Delete one relation in each cycle in
all possible ways that produce a set of uncycled graph variants, for example,
if there is one cycle with three nodes
then there are three ways to make the graph uncycled;
-
If there is only one relation that
points to a node, then fix this relation. We are sure that all fixed relations
will be in the resulting variant, because the result variant should be
connected;
-
Find best tree variants in each
maximal connected component that has no fixed relation.
-
Combine all best tree variants in the
resulting tree variant.
One can
see that besides the number of cycles the number of fixed relations is also
crucial for this procedure. If the
input graph has no fixed relation then
the whole procedure turns out to
be the full search. In practice the part of fixed relations is 20%, which makes
the procedure several times quicker than the full search. This method was tested and no counter-examples
were found.
Now the
key question is the evaluation of tree variants. There are several parameters
of different importance that produces the value of a tree variant:
- Number of Connected components;
-
Projectivity
-
Length of relations
-
Valency order;
-
Mandatory and optional valencies;
-
Matching Semantic Descriptors;
-
...
There are
more than twenty parameters the program uses. The most important thing about
them is that the parameters should be linguistically transparent – no special
search trick is admitted. Thus, Dialing
Project uses transparent search algorithm (with fixed relation) and
linguistically motivated set of parameters to estimate variants. We
believe, any NLP system should be
designed in such way:
- search algorithm should be simple and performs nearly full search;
- measures should be linguistically transparent.
After the First Level Semantics |
^ |
The First
Level Semantics sends the best semantic structure to the Transfer stage, which
performs two main tasks. Firstly, it
maps the semantic structure to the
English Semantic Dictionary, i.e. finds
the best translation for a semantic node. To find the best translation one
should consider semantic node"s
descriptors and valencies.
Secondly, the program makes
some changes in the semantic structure
to prepare it for English Synthesis. For example, in Dialing Project all
phrases like:
The girl is beautiful
The beautiful girl
The girl who is beautiful
have the
same semantic structure Attrib(beautiful, girl) with different
syntax information, hence Transfer
should have a procedure
which inserts node "be" in the
structure like this:
girl |
|
|
|
be |
|
|
|
|
|
|
|
|
beautiful |
|
girl |
|
beautiful |
The last step is
English Synthesis. Its main task is to produce a string of English Words, therefore
all problems with word order are solved here.
The Word Order Problem is relatively simple when the semantic graph has
only one connected component. The things become worse when the graph is not
connected. In this case we are to put all connected components one after another
in the output sentence, which usually
produces unsound results. Now we say
that in Dialing Project it is crucial to achieve fully connected graphs
on the First Level Semantic Stage. To
achieve it we need to enlarge our
semantic dictionaries.
I would like to
thank all members and consultants of Dialing Project and especially Dialing
Company President Eduard Khachukaev, who funded this project. More information about the Project can be
found on www.aot.ru.
[1] The Concise Oxford Dictionary of Linguistics, © Oxford University Press 1997
[2] English as 2nd Language http://esl.about.com
[3] Temperley D, Lafferty J., Sleator D. 1995.Link Grammar Parser http://www.link.cs.cmu.edu/link
[4] Russian Academy of Sciences Institute for Information Transmission Problems, ETAP-3 system http://proling.iitp.ru/Etap/index_e.html
[5] Beale Stephen. (1996) Hunter-Gatherer: Applying Constraint Satisfaction, Branch-and-Bound and Solution Synthesis to Natural Language Semantics NMSU CRL Technical Report. MCCS-96-292.
[6] Tsang E. 1993. Foundation of Constraint Satisfaction. Academic Press, London.
[7] Michael Blekhman, Boris Pevzner; First Steps of Language Engineering in the USSR: The 50s through 70s;
http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-6.htm
[8] W.John Hutchins; Machine translation over fifty years; http://ourworld.compuserve.com/homepages/WJHutchins/HEL.htm
[9] Leontieva Nina. Textual Facts as Units of Coherent Text Semantic Analysis // Arbeitspapiere der GMD 671. Darmstadt, 1992. 0,6
[10] Leontieva Nina. ROSS: Semantic Dictionary for Text Understanding and Summarization // META. - 1995.- 40, No 1. Montreal, 1995.
[1] Some information about the system can be found in [7] and [8].
[2] Many examples in this paper are translated into English to help reader to understand problems that exist in Russian.
[3] = Grammatical Description Field.
главная о нас продукты скачать демо технологии ^ |