| главная о нас продукты скачать  демо технологии   ^ 
                          
                            | A short description of Dialing Project. Alexey Sokirko, 2001. |  A short description of Dialing Project Introduction Before the First Level Semantics Basic Semantic Description Building Semantic Nodes Building
all possible semantic relations Building Tree Variants After the First Level Semantics Acknowledgments References   
                          
                          
                            
                            | A short
description of Dialing Project | ^ |  This paper gives
a short description of a  natural
language processing system (NLP-system) that is called Dialing
Project (1999-2001). Dialing Project is a Russian-to-English machine translation
system that is based on French-to-Russian machine translation system [1] developed by Dr.Leontieva and her group in 1976-84. The author of the paper
pays attention mainly to the semantic module of Dialing Project  since he 
participated in its development.   Dialing System consists of six modules of text processing: 
                          Graphematics;Morphology;Group Syntax;First-level Semantics; Transfer;Synthesis. On each stage there is a special formal language in which one can
translate the input text. Firstly, the input text is translated to
graphematical language, then from graphematical  to morphological  and so
on. Evidently, these languages  share
some words between each other. In principle it is possible to  translate from one language to another without
intermediate languages. In practice one should avoid it because it violates the
well-known Law of Demeter("each component should only know about closely
related components").  For example, one
should avoid references to Russian graphematical representation  from English Synthesis stage. Still this
scheme makes   possible to use the
modules in different combinations,  or
even stand-alone, that"s why  one can
convert the  machine translation system
Dialing to various types of NLP systems. The main symbols of  these six
special languages are descriptors that are 
special constants which  describe
one word or group of words of the input text. We can build complex construction
based on descriptors, i.e. sets or lists of descriptors, logical forms and so
on. Each descriptor has its meaning and a set of rules that can assign it to a
particular word or group of words. For example, descriptor "EOS" means end of
sentence, and its set of assigning rules is based on analysis of full stops,
exclamation or question marks.  While working
with natural languages one can find it helpful to use the same descriptors for
different natural languages. The shared descriptors always have the same
meaning, but can differ in sets of assigning rules. In text processing a descriptor can be attached to a word or group.
In Dialing Project any group of words always has a single main word. A group
can be continuous and not.  A continuous
group can be defined by lower and upper borders. A non-continuous group can be
defined only word by word. A group of two words is called a relation. Besides the modules of analysis there are dictionaries that hold
predefined information about words and groups. This is the most important part
of the system, since it is the only source of real knowledge for the computer.
When we find an entry in dictionaries that matches the input word or group, we
say that we find an interpretation. If there are many interpretations for one
text item then it is the case of ambiguity. Normally we should choose only one
interpretation for a text item. Speaking roughly the process of analysis can be
viewed as a sequence of interpretation choices. In most cases an interpretation
of  some text item A depends  on interpretations of neighbor items, which
depend  themselves on the interpretation
of A. In other  words ambiguous items
are cross dependent. That"s why we are to process all text variants of  the word"s context and choose the best.
If  the number of variants is too
large  then the programmer should narrow
the context or simply cut some variants off. Such narrowing is an unavoidable
source of interpretation errors that a computer usually makes.   
                          
                          
                            
                            | Before the First Level Semantics | ^ |  Now let us describe Dialing process stages in more detail. The first
processor (Graphematics) divides the input text into words, digital sequences,
alpha-numeral sequences. It finds electronic addresses, file names, some
abbreviations. Graphematical procedure splits input text into  sentences, paragraphs, and headings. The second processor is morphology. For each input word form the
procedure finds all morphological interpretations in the morphology dictionary.
Morphology interpretation is a tuple: <Normal
form, Part of Speech, Grammems>, for example: doing -> <do, VERB, ing-form> In addition the morphological processor applies heuristic
context-dependent rules that resolve some simple cases of morphological
ambiguity. The next stage is occupied by syntax
and fragmentation modules that are closely connected.  The fragmentation  module
finds clauses in Russian  compound
sentences. The syntax module builds continuous syntax groups inside clauses
(NP, PP etc).  Any word of a syntax
group has a determined morphological interpretation whereas clause can include some
items that are not fully interpreted. Since Russian language has free word
order, syntax groups of Dialing Project cover 70% of words. We consider that
the rest of the syntax structure(30%) should be built on the next stage, i.e.
First Level Semantics.   
                          
                          
                            
                            | Basic Semantic Description | ^ |  The First Level Semantic  analysis builds a semantic graph for one
Russian sentence.  Speaking generally,
the concept of "semantics" is not well defined in computational linguistics.
Theoretical linguists and art scholars 
understand this word deeper than specialists of applied linguistics.
Usually pure humanists think that natural language semantics  is a science that deals with definitions
of  words and collocations. Accordingly,  they feel that formal linguistics should
build formal definitions. However the last investigations in this domain
convince us that full formal descriptions of 
words cannot be effectually combined to result in a structure for the
whole input text. The main reason of it is the size of the descriptions.  For example, consider a definition of word
"Table" [2]: "Table" = "a piece
of furniture having a smooth flat top supported by one or more vertical legs"
(Wordnet).  Now imagine what a mess the input text will be after replacing all
original words by such definitions.  If formal definitions cannot be used in computational linguistics
due to their size, then what is the difference between semantics and syntax?
Following Dr. Leontieva[9-10] we assume that the difference is in the relations
and descriptors. Semantic relations are universal while syntax relations are
language-dependent. Thus semantic structure is a graph with semantic relations.
The number of nodes in the graph is approximately  equal to the number of 
words in the input text, but there are some exceptions.  Certain words  can be divided into two or more nodes, for example,  German word "Donaudampfschifffahrtsgesellschaftskapitänswitwenrentenauszahlungsstelle". will be translated into 8 nodes ("Danube", "steamship company",
"society", "capitan", "widow", "rent", 
"payment", "office" ). On the other hand, some words can be united in
one semantic node, for example "one hundred and forty" -> "140". But the
most frequent correspondence  is  "one word -> one node", which makes the
whole structure transparent. Consider the example of analysis: Input Text: I write with a pencilSyntax Structure:  NP =>
I;  V -> write; PP ->  with a pencil
 Semantic Structure:  Agent(I, write); Instrument
(pencil, write) Although the semantic structure is 
very simple and therefore very productive,  it is necessary to understand that it is very hard to get such
structure without  errors.  In this light one should  not overvalue computational semantics in
whole. For example, if you remove the semantic module  from Dialing Project  the
translation will become only 20% worse.  Now we are ready  to describe the  semantic machinery used in Dialing Project.   There are three types of objects to be
discussed: 
                          semantic relations;semantic descriptors;semantic categories. Most  of semantic relations
we use are wide spread (see, for example Universal
Networking Language). Semantic relations are written in the following
format: R(A,B), where R is a relation name, A is slave semantic node and B is  master semantic node.   Formula R(A,B)
with some R,A, and B  agrees  with the input text iff  "A is R 
for B" , i.e.:  R(A,B) <=>  "A is R  for B".  For example, formula Author(Leo Tolstoy, War and Peace)
agrees with the text "the author of "War and Peace" Leo Tolstoy" since
the text implies that Leo Tolstoy is the author of  "War and Peace".  The next principal object of our semantics is semantic
descriptor.   There are about 40
semantic descriptors in the dictionary, like Animative, Rank, Organization
etc. A formula that is built on logical operations (conjunction, disjunction)
and semantic descriptors as atoms  is
attached to each entry in the dictionary and to each actant pattern.  For example: SemDes(president) = Power & Rank. Initially such formulae were introduced to choose the right sense of
words, if more than one sense exist. Subsequently semantic descriptor acquired
some special meaning that means that  
SemDes(A) = SemDes(B) <=> "Meanings of A and B have something in common". The third semantic instrument is semantic categories.  There are four of them:1. Situations;
2.Objects; 3. Properties; 4 Operators. Situations(go,
explosion, ...) always have two latent valencies: where  and when it happened.  Objects(keyboard,  monitor, ...) have only one latent
valency: where it is located . Properties(red, old..) modify situations
and objects. Operators (not, yet,...) only modify semantic structure, they
never correspond  to separate semantic
nodes,. All these instruments are used in description of word senses. Consider
an example: 
                          
Title  = 
readSense = 1
 Category  = Situation
 SemDes = Process
 Valency = Agent(C,
A1);  Object(C, A2)
 SemDes1 = Animative
 GramDes1 = subj
 SemDes2 = Information & Thing
 GramDes2 = obj
 That was a short description of the adopted semantic theory that is
used in Dialing Project. Let us proceed with the description of the program. The input of the semantic module is a set of syntax linkages. The
output is semantic structure. While 
building the resulting semantic structure the program faces the
following problems: 
                          
  
  Morphological ambiguity, since it
cannot be fully resolved on the syntax stage.
  
  Lexical ambiguity, for example  I saw a bat, where a
bat might refer to an animal or, among others, a table tennis bat.
  
  Structural ambiguity, for example
"John sold the dog apple",  where at
least two interpretations are possible: 
"John sold the apple to the dog" or "John sold the apple that somehow
resembles a dog" The types of ambiguity constitute the design of the semantic module: 
                          input text -> M1,...,Mn – morphological variants each Mi -> L1,...,Ln – lexical 
variants each Li -> S1,...,Sn – structural 
variants   
                          
                          
                            
                            | Building Semantic Nodes | ^ |  On the first level the procedure works with one morphological
variant. In short this level is aimed to build semantic nodes and to assign
them dictionary interpretations.  The
most interesting parts of the level are as follows: 
                          
  
  Finding time groups;
  
  Finding fixed collocations;
  
  Getting all semantic interpretations
for each semantic node. The procedures 1) and 2) deal with collocations, and  it is time to have  a look at collocations in Dialing Project.  Algorithmically, there are two
distributions: conditional/unconditional collocations and closed/open
collocations.  Conditional collocations
differ from unconditional ones in the method of finding. To find an
unconditional collocation in a text one should only know its syntax and
morphological properties. On the contrary, there must be strong semantic
evidence for conditional collocations, that"s why  we have to find conditional collocations only on the semantic
stage. Here are examples: Unconditional: 
                          
  
  to be chip off the old blockYes, Tom is just a chip off the old block.
  
  come hell or high water Finish your homework come hell or high water!
 Conditional:
                         
                           clear the air
                            
                              His apology
cleared the air
Drivers! Clear the air with
Flame Gard Filters climb
the wall 
                            
                              If I don't
go on a vacation soon, I'll climb the wall.
 If you can't climb the wall,
build a door Unconditional collocations are
usually more idiomatic and longer than conditional.  The difference between open and closed collocations is more rigid.
Any closed collocation corresponds  to
only one semantic node in a semantic structure while  open collocations have as many semantic nodes as their words.
Words of an open collocation can be connected with other nodes of the semantic
structure, but of course, there must be 
some irregularity  either in
syntax of the collocation, or in its translation. Each node of an open
collocation has a separate dictionary interpretation whereas a closed
collocation can be interpreted only as a whole. Examples: Open collocations: 
                          
1.time groups with  the special use of
colon
                            
                              at 7 : 30at 7 : 30 Friday eveningcrack a smile 
                            
                              I finally
got him to crack a smile.He would
crack a forced smile Closed collocations 
                          cut and driedThat's my
answer, cut and dried
 cross the RubiconHe crossed the
Rubicon
 Obviously, this distribution is not absolute but it is very
practical. It is clear that the computer can process unconditional/closed
collocations quicker than conditional/open ones. If  we want to forget about optimization issues we should make all
collocations open (just in case) and conditional.  Collocations are stored in the
collocation dictionaries, some of which will be described below. The dictionary of fixed
functional words consists of 
complex conjunctions (in order to, so that...) and complex
prepositions (on behalf
of ...).They are closed and unfortunately open,
since there are some examples like this: 
                          (1) to be, or not to be — that is the question // not
conjunction 
(2) God has given us
complete autonomy, that is self-government //conjunction The next dictionary is Dialing
Thesaurus. Its structure  resembles
Wordnet. The basic relations are hypernymy/hyponymy and meronymy/holonymy. One
thesaurus relation can link two synsets together. That is the common place.
However there are some differences from standard thesauri: 
                          
  
  Only noun phrase can be inserted into  Dialing Thesaurus,  no VP.
  
  Each thesaurus entry can be supplied with a valency structure that is
expressed by means of semantic relations and descriptors.
  
  Any thesaurus synset can be used as semantic descriptor to select a
word"s actant.  By default all thesaurus entries are open and unconditional
collocations. The next collocation dictionary is Time Group Dictionary. Time
groups fill Time valencies, for example:  
                          
On Monday we go to the Zoo.  These collocations are open and unconditional, even though
non-prepositional time phrases can be conditional, for example 
                          
  
  (1) Sunday morning will be all right  (not Time valency) 
  
  (2) We"ll meet Sunday morning (Time valency) The syntax of time groups is very special, that"s why we consider
them collocations: 
                          
(1)  Tuesday September 4 1:34 AM ET (2)  in the 1980s The last collocation dictionary is called Fixed Collocation
Dictionary. Most of entries of this dictionary are verb phrases, like clear
the air or climb the wall,  and considered
unconditional and closed, for example chip off the
old block. Besides the
collocation dictionaries there is also one big dictionary for words that is
called Russian Common-Puprose Dictionary. Thus, the Dialing Project uses the
following Russian dictionaries:  
                          
  
  Russian Common-Puprose
Dictionary (main);Thesaurus;
  
  Time Group Dictionary;
  
  Functional Words Dictionary;
  
  Fixed Collocations Dictionary.   Each entry 
of these dictionaries is identified by Title and SenseId. In other
words, dictionary   interpretation is
the following tuple:  <Dictionary Name, 
Title, SenseId>   If no dictionary
interpretation can be found for a word, the default one is used. For example,
there is a default interpretation for transitive verbs. Of course, the program
should choose only one dictionary interpretation for one semantic node, that"s
why the program has to search among all lexical variants (a variant of
attaching  dictionary interpretations to
semantic nodes).   
                          
                          
                            
                            | Building
all possible semantic relations | ^ |  The main
steps the program makes while processing 
one lexical variant  concern
semantic relations: 
                          
  
  Building valency
structure of each node;
  
  Building all possible
relations;
  
  Processing  similar nodes; The
first  procedure builds valency
structure of  each node. Here are
further complications to worry us. 
Firstly, some valencies are not compatible with the others, and this information
is written in the dictionary in a special format. Secondly, there are default
valencies that should be added  in any
case to some types of nodes. For example, any Situation node has  a default Time valency. Thirdly, some
valencies are mandatory, and some valencies are not. The
second procedure builds all possible semantic relations without checking
semantic conditions. It uses morphological and syntax information about nodes
and information that is written in GramDes field [3]
of node"s article. For example, let X be a node. Let An be the n-th
valency of word X.  If  the dictionary article of X includes  formula "GramDesn
=  of+NP"  then  each PP with the
preposition "of" will be connected to X. Usually this procedure builds
much more relations than needed.  It is
a task for the next procedures to choose the right hypothesis.  The
ratio  R/N, where R  is 
the number of all possible relations and N is  the number of nodes can measure 
the  complicity of the input
text. If R/N  > 1,7 then the current version
of the program invokes heuristic procedures that delete unpromising relations.   It is well
known that the worst enemies of NLP are ambiguity and similarity that"s why the
procedure for similar nodes is one of the important part of this level. Similar
words occupy only one  valency place,
that"s why there should be a proxy (we call it Multiple Node Actant, or MNA )
that  represents all similar words.
There are several MNA syntax types: 
                          
(1)  I like Peter, Mary, and John
(basic) 
(2)  I like neither Peter nor Mary (corr.
conj) (3) That will do more harm than good  (comparative) After some
kind of MNA is found  then we need to
identify all possible nodes that should be subdued by this MNA and to find all
possible parents of this MNA. Our solution of the
last problem resembles Link Parser[3] Similar Rule where linkage of sentence  
                          
I like Peter and John   is built from two linkages  
I like Peter I  like John. In short, node X
becomes a parent of MNA if X subdues two children of the MNA:  
                          
                            |  |    The last procedure searches the best tree variant and it is the
slowest part of the whole program. 
Obviously it is impossible to check all tree variants for a sentence of
standard length (20-30 words). We know two implemented methods to restrict the
search space. The first method  that is
used in ETAP System[4] chooses the best variant for each node locally.
Initially it finds the root of the clause, then it  chooses root"s children and 
so on.  It means that choices on
different levels are independent. Evidently that is not a correct method
because it is not clear in which order 
the program should  go through
the nodes. But we don"t  want to
criticize it since the existence of fully correct methods  is questionable. The second known method was
developed in Mikrokosmos Project by Beale Stephen[5]. It is based on Constraint
Satisfaction [6]. In short, this method divides the graph into subgraphs with minimal  ratio N/M, where N is the number of nodes in
this subgraph and M is the number of nodes of the subgraph that are
connected  to nodes that are not in this
subgraph.  
                          
                            |  |  After finding the first subgraph division it solves ambiguity in
each subgraph.  Then it finds new
subgraphs and so on. One can see that Mikrokosmos method  is not also fully correct since  choices in different subgraphs are
independent, but we know that the only correct method is too slow. That"s why,
to some extent, the Dialing tree variant procedure is a kind of
Microkosmos  method.  Our procedure hinges on the assumption that the number of cycles in
the graph of all possible relations is small. For Russian language this fact is
proved experimentally,  though there is
no evidence  that it is true for other
languages. If the number of cycles is small then the following steps are
reasonable: 
                          
  
  Delete one relation in each cycle in
all possible ways that produce a set of uncycled graph variants, for example,
if there is one cycle with three  nodes
then there are three ways to make the graph uncycled;
  
  If there is only one relation that
points to a node, then fix this relation. We are sure that all fixed relations
will be in the resulting variant, because the result variant should be
connected;
  
  
  
  Find best tree variants in each
maximal connected component that has no fixed relation.
  
  Combine all best tree variants in the
resulting tree variant. One can
see that besides the number of cycles the number of fixed relations is also
crucial for this procedure.  If the
input graph has no fixed relation then 
the whole  procedure turns out to
be the full search. In practice the part of fixed relations is 20%, which makes
the procedure several times quicker than the full search.  This method was tested and no counter-examples
were found. Now the
key question is the evaluation of tree variants. There are several parameters
of different importance that produces the value of a tree variant: 
                          Number of Connected components;
  
  Projectivity
  
  Length of relations
  
  Valency order;
  
  Mandatory and optional valencies;
  
  Matching Semantic Descriptors; 
  
  ... There are
more than twenty parameters the program uses. The most important thing about
them is that the parameters should be linguistically transparent – no special
search trick is admitted. Thus,  Dialing
Project uses transparent search algorithm (with fixed relation) and
linguistically motivated set of parameters to estimate variants. We
believe,  any NLP system should be
designed in such way:  
                          search algorithm should be simple and performs nearly full search;measures should  be linguistically transparent.   
                          
                          
                            
                            | After the First Level Semantics | ^ |  The First
Level Semantics sends the best semantic structure to the Transfer stage, which
performs  two main tasks. Firstly, it
maps  the  semantic structure  to the
 English Semantic Dictionary, i.e. finds
the best translation for a semantic node. To find the best translation one
should consider semantic node"s 
descriptors and valencies. 
Secondly,   the program makes
some changes in the  semantic structure
to prepare it for English Synthesis. For example,  in Dialing Project  all
phrases like: 
                          The girl is beautiful 
The beautiful  girl  
The girl who is beautiful have the
same semantic structure Attrib(beautiful, girl) with different
syntax information, hence Transfer 
should have a  procedure
which  inserts node "be" in the
structure like this: 
                          
                            | girl |  |  |  | be |  |  
                            |  |  |  |  |  |  |  
                            |  | beautiful |  | girl |  | beautiful |  The last step is
English Synthesis. Its main task is to produce a string of English Words, therefore
all problems with word order are solved here. 
The Word Order Problem is relatively simple when the semantic graph has
only one connected component. The things become worse when the graph is not
connected. In this case we are to put all connected components one after another
in the output sentence, which usually 
produces unsound results. Now we say 
that in Dialing Project it is crucial to achieve fully connected graphs
on the First  Level Semantic Stage. To
achieve it we need to enlarge  our
semantic dictionaries.   I would like to
thank all members and consultants of Dialing Project and especially Dialing
Company President Eduard Khachukaev, who funded this project.  More information about the Project can be
found on www.aot.ru.   [1] The Concise Oxford Dictionary of Linguistics,  © Oxford University Press 1997 [2] English as 2nd Language http://esl.about.com [3] Temperley D, Lafferty J., Sleator D. 1995.Link Grammar Parser  http://www.link.cs.cmu.edu/link [4] Russian Academy of Sciences Institute for Information Transmission Problems, ETAP-3 system  http://proling.iitp.ru/Etap/index_e.html [5] Beale Stephen. (1996) Hunter-Gatherer: Applying Constraint Satisfaction, Branch-and-Bound and Solution Synthesis to Natural Language Semantics NMSU CRL Technical Report. MCCS-96-292. [6] Tsang E. 1993. Foundation of Constraint Satisfaction. Academic Press, London. [7] Michael Blekhman, Boris Pevzner; First Steps of Language Engineering in the USSR: The 50s through 70s;
 http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-6.htm [8] W.John Hutchins; Machine translation over fifty years; http://ourworld.compuserve.com/homepages/WJHutchins/HEL.htm [9] Leontieva Nina. Textual Facts as Units of Coherent Text Semantic Analysis // Arbeitspapiere der GMD 671. Darmstadt, 1992. 0,6 [10] Leontieva Nina. ROSS: Semantic Dictionary for Text Understanding and Summarization // META. - 1995.- 40, No 1. Montreal, 1995.     [1] Some information about the system can be found in [7] and [8]. [2] Many examples in this paper are translated into English to help reader to understand problems that exist in Russian. [3] = Grammatical Description Field.   главная о нас продукты скачать  демо технологии    ^ |