Alexey Sokirko, 2003
The main purpose of DWDS/Dialing Concordance ("DDC") is to search words or sequences of words together with morphological patterns. DDC is created to help linguists to find a particular collocation or word in the given context. Roughly speaking, the functionality of DDC is similar to the search engine “SARA” that is accessible at the site of British National Corpus (http://sara.natcorp.ox.ac.uk/lookup.html).
Generally, the majority of web search engines were designed to look for the information. It means that they must be able to process all text formats and aggregate all text variance into one representation. After all they are to give the most relevant hits to the end-user. The target user of the internet search engines searches for the information and not for the variance of its representation. A web search engine uses usually stop word lists, various dictionaries of synonyms, dictionaries of proper names, and it indexes the corpora by files, not by sentences. All these notions are useless for a linguistic search engine. For example, using a stop word list is senseless, since for a linguist stop words are the most relevant words of the language (prepositions, conjunctions and so on). Therefore we cannot simply use a web search engine for linguistic purposes, but are to design a special version of a search engine.
There are two possible strategies to
search something in a corpus. On the one hand we can create a program like Unix
"grep", which doesn't use any index at all. It is a good solution
because we really can create a very sophisticated query language. Once a
sentence (a hit) is found, and its morphological interpretation is built, it is
possible to go through the sentence many times and to check all conditions,
which the query contains. Of course, the speed of the process will be equal to
the speed of indexing, i.e. to the speed of linguistic processors (morphology,
syntax). Apart from the speed there is one more disadvantage of this solution.
When the search goal lies at the end of the corpus, the end-user is to wait
till the whole corpus is processed. And for our opinion, the last consideration
is the most important cause to give up the grep-like searching.
Obviously the only one alternative
we have, is to create special indices. The creation of indices, in the way it
is done in DDC, is described in Unit 2; the usage of index is described in Unit
3.
The current version of DDC supports
three languages: Russian, English, and German. But the corpus must be
homogeneous and the language of the corpus
should be specified before the process of indexing. DDC doesn't
solve the problem of language recognition by
coding pages. Thus any English words in the German corpora will be considered
as German ones.
Accordingly to the language DDC
contains three morphological systems: Russian, English, and German. Russian and
English morphology systems were adopted from Dialing Project The German
morphology was borrowed from Morphy System.
For each input word form the morphological procedure finds all morphological interpretations in the morphology dictionary. Morphology interpretation is a tuple:
<Normal form, Part of Speech,
Grammems>, for example:
doing -> <do, VERB, ing>
Grammems are special features like
"sin" singular,
"plu" plural and so on. We
call a pair
<Part of Speech,
Grammems> a morphological pattern.
Generally, the morphology system is
used in two places:
1.
In
processing of queries, when it is necessary to build all forms of the given
word form;
2.
In
indexing, when we are to store all possible morphological interpretations of
the given word form.
The corpora itself is a list of text or html
files. This list is stored in a file like this:
C:\test\test1.txt
C:\test\test2.txt
...
DDC supports plain text format and
standard html-format. We think, that for linguistic purposes these two formats
are quite sufficient.
The process of indexing starts with
loading of the list of input files. Every file should really exist and it
should be stored on the local hard disk. The size of all files of the corpus
cannot be more than 2,5 Gigabyte. This
restriction follows from the inner structure of the program, since we use
something like global offsets, which
can point to any place of the whole corpora. And the maximal value of these
offsets cannot be more than 4 Gigabyte.
After loading a list of files, the
program must be supplied with a language identifier (German, English or
Russian). Then the program starts the main cycle of indexing which looks like
this
for every file from the list of files do
begin
Create sentence and paragraph index;
Create word and morphological index;
if the size of the current index is more than 50 MB
then do some swapping in order to clean memory;
end;
The first procedure creates the
index of sentences borders. Paragraphs are just sequences of sentences and
therefore they are processed in the same way as sentences. The procedure goes
through the input file and for every end of sentence stores two numbers:
Both offsets are global, i.e. the
first offset of a file is equal to the end offset of the previous file. This is
a very quick procedure, because it doesn't allocate much memory, since all
offsets are written directly to the hard disk.
The second procedure builds the main index of tokens and morphological patterns. A token is a text item to be indexed. In the current version of the program token can be:
All other text items (that are not
tokens) are disregarded. It is quite easy to change the definition of a token, and to index, for example, all punctuation
marks also.
While indexing, the program
maintains a set of all string representations of tokens and a set of all
possible morphological patterns. Actually, these two sets are implemented in
the same way, i.e. there is only one program class for both. Such sets are
called index sets.
An index set is a set of all
possible index items (words or morphological patterns).
Initially index sets are empty.
While indexing they are growing when a new token or a new morphological pattern
occurs. For every index item we store its occurrences:
"Haus" -> 1, 678, 2345 ...
"Hause" -> 12, 38, 234 ...
...
It is clear that while indexing we
need not store all occurrences in the corpora in the short-term memory, that's
why from time to time we can save occurrences to the hard disk. This swapping
is done in the main cycle of indexing. Thus we hold in memory only two sets of
indexing items without occurences, and these sets are usually grow very slow[1][1].
We are going to present some
statistics about the indexing process. For testing purposes we used P4 1,5 GHz Windows 2000. The statistic results
depend upon the language of the corpus, since we used different
linguisticprocessors to obtain tokens and sentence division.
Corpus Name |
Language |
Count of tokens |
Size in bytes |
Elapsed Time |
Memory Usage |
DWDS-
corpus1 |
German |
11 mln. |
85 Mb |
9 min |
40 Mb |
DWDS- corpus2 |
German |
30 mln. |
160 Mb |
20 min. |
60 Mb |
Moshkov-subset1 |
Russian |
15 mln |
100 Mb |
13 min |
60 Mb |
Moshkov-subset2 |
Russian |
54 mln. |
350 Mb |
55 min |
80 Mb |
It may be worth adding that the program always allocates at least 18 Mb
for the morphological dictionaries. The Linux version of the program works
approximately 20% quicker than the Windows version.
The result index
contains two huge files with occurrences of tokens and occurrences of
morphological patterns. The size of these two files together is approximately
equal to the size of corpora in bytes. Thus the
built index has the
same size as the corpora itself.
At the last stage of the indexing process we create an index of inverted
tokens in order to run a query with the right truncation like "*nat".
We need not a special index for the left truncation since we can use for it the
set of tokens itself.
We have mentioned some of the restrictions that DDC has. Here is the
full list of them:
In the current version of the program query
language contains the following constructions:
Query Type |
Purpose |
Example |
Result |
Word |
word
description |
Hause |
All sentences of the corpora which
contain some morphological variant of "Hause" |
Word* |
word
description |
Ha* |
All sentences which contain a word that
starts with "Ha" |
*Word |
word
description |
*ken |
All sentences which contain a word that
ends with "ken" |
[PartOfSpeech
{Features}] |
word
description |
[SUB
sin] |
"PartOfSpeech" is a lexical
category, "Features" is a comma-delimited list of morphological
features (see below for complete lists of features and parts of speech). |
@Word |
word
description |
@Hause |
All sentences of the corpora which
contain wordform "Hause" |
{H} |
word
description |
{AMTBERUFG} |
All sentences of the corpora which
contain a subtype of the given thesaurus node (DWDS thesaurus is used) |
"X1 X2
? XN" |
sequence
of word descriptions |
"mein Haus" |
All sentences which contain "mein
Haus". |
Q1 && Q2 |
conjunction
of word or sequence descriptions |
Hause && [SUB sin] |
All sentences of the corpora which
contain Q1 and Q2 |
Q1 || Q2 |
disjunction
of word or sequence descriptions |
[VER
aux,inf,] && [SUB sin] |
All sentences of the corpora which
contain Q1 or Q2 |
near(Q1;Q2;n) |
two near
words |
NEAR
(Hause ; [SUB]; 2) |
All sentences which contain
"Hause" and some noun, and the distance between them is less or
equal 2. The order of the occurrences is disregarded. |
"X1
#D1 X2 #D2 ? XN" |
sequence
of word descriptions with distances |
"mein #1 Haus" |
1. All sentences which contain
"mein" and "Haus", which may be divided by one
token |
To the end of the query the user can add a context size operator
in form:
#cntxt N, where N=1,…,5.
Context size
operator tells the program to show not only the sentence where the given query
is found but also N previous and following sentences. For example, a
query
test #cntxt 2
gives us all hits with the word “test”
together with the two sentence context.
Actually DDC can process the query
in two different ways. Firstly, we can ask the
system to present us the count of hits (count of sentences which match
with this query). Secondly, we can ask to give us some examples of the hits. It
is quite obvious that the first type of
query (we call it a statistical query) takes much more time than the second
type(we call it an example query), since a statistical query is to process the
whole corpus. The speed of statistical query depends crucially upon the corpus
frequencies of the items, which
constitute this query. Thus a query like
"[ART] [ADJ] [NOUN]" (a pattern for "the smart dog",
"a deep pool" and so on) will be processed thousand times slower than
a query like "a red boy", since there are millions of articles,
adjectives and nouns in the corpora in comparison with count of "red"
or " boy".
It goes without saying that the
count of hits is calculated by all web engines but the problem is that a
"web hit" is a document, not a sentence. For example for a query like
"a*" Yahoo gives 642710 hits while DDC gives 1185283 hits (two times
more) for the corpora of 40 mln. tokens. Some search engines (like
"Google") give us only the approximate count of hits, that's of
course quite simple. However it is claimed that Google has more than 1.5
billion indexed pages.
All this leads us to the conclusion
that statistical queries should be used only when they are really necessary, in
other cases it is better to use only example queries. In both cases
(statistical and example) the query should be parsed. The parsing is made with
the help
of Yacc&Lex. The grammar is written in
quite straightforward way except the parsing of morphological patterns, which
is made by a special function. If there is no parsing error then the query will
be evaluated. The evaluation process is the process of getting hits from the
corpus according to parse tree of the query. When it is possible we use
depth-first search. For example, for query like:
( ( ( (A && B) && C) && D)
&& (E && F))
the sequence of evaluation steps is as follows:
(A && B)
( (A && B) && C)
( ( (A && B) && C) && D)
(E && F)
( ( ( (A && B) && C) && D)
&& (E && F))
A depth-first algorithm is
preferable since a breadth-first algorithm is more space-consuming, It is very important that we can apply the
evaluation procedure to some subset of the corpus, since in this case the usage
of short-term memory is under our control. The whole body of corpus is divided
internally into smaller subcorpora. The size of subcorpus[2]
was obtained empirically after several experiments, and perhaps it depends upon
operational systems[3][3].
The steps of evaluation are made
without any optimization according to the semantics of query operators. For
instance, let's take a close look at near
operator. The syntax of this operator is as follows
near (Q1;Q2;n), where Q1 and Q2 are token
descriptions and n is a maximal distance between them. In order to evaluate this operator we have to get all
occurrences of Q1 and Q2 in the current subcorpus. Thenthe occurrences should
be sorted. It is worth mentioning that the sorting of occurrences takes half of
the processor time for queries containing very frequent items. Thus it is safe
to say that quite a long time DDC is busy with regrouping (sorting) of
occurrences.
Let occurrences of Q1 be x1, x2,
..., xk, and let occurrences of Q2 be y1, y2, ..., ym. The following algorithm
finds all such pairs <xi yj> that | xi - yj | <= n (n is a distance).
Start
:= 0;
for
i:=0 to k-1 do
begin
j := Start;
while (j < m-1)
begin
if ( xi <= yj + n)
then
break;
j := j+1;
end;
Start = k;
while (j < m-1)
begin
if (xi + n < yj)
then
break;
else
<xi yj > is a new pair;
j := j+1;
end:
end
The efficiency of this algorithm is
O (n*k), where n is a distance and k is the count of occurrences of Q1.
The other
operators are implemented in the same straightforward way. There is a slight
difference between logical operations (&& and ||) and the other ones.
These logical operations deal with hits (sentences which match the query). The
other operations deal with occurrences of tokens. The result of evaluation is always a set of
hits, i.e. we dare say that on the bottom level of parsing tree, the program
operates only occurrences,while on the top levels it deals with hits.
Obviously there must be a
function that converts occurrences to hits, and this function really exists and
looks like this:
x1, x2,
..., xk, - occurrences
h1,
h2, ..., hm, - hits (end of sentences)
i :=
0 // index of occurence
q :=
0; // index of hit
while
(i < k)
{
// binary search of xi in hits between q and m
q := lower_bound (q, m, xi)[4];
// adjusting q
if (hq == xi)
q := q+1;
hq is a new hit
i := i+1;
};
This procedure takes 30 percent of
time for queries containing very frequent items. It is the second slow
procedure of the quering mechanism. Having
described the most time-consuming algorithms we are going to present some
statistics
Corpus
Name |
Type of query |
Moshkov1 15 mln |
Moshkov2 54 Mln. |
DWDS1 190 Mln. |
Mutter |
Example |
0.05 |
0.05 |
0.07 |
Mutter |
Statistical |
0.007 |
0.015 |
0.045 |
Ba* |
Example |
0,06 |
0.07 |
0.15 |
Ba* |
Statistical |
0.1 |
0.3 |
1,3 |
“[ADJ]
[SUB][VER]” [5] |
Example |
1,1 |
1,1 |
1,25 |
„[ADJ] [SUB] [VER]“ |
Statistical |
2,5 |
14 |
25 |
Google Engine processes any query
with the speed from 0.1 Sek to 1 Sek, but there are hundreds of Google servers
and also there are thousands of simultaneous users. That's why the comparison
is hardly valuable. Surely, that being run on one separate computer Google Web
Engine is much more faster than DDC.
We should explain why the corpus of
54 mln tokens("Moshkov Library") is processed ten times slower than
the corpus of 15 Mln tokens("Moshkov Library subset"), while the
former is only four times bigger than the latter. As far as we can understand
the problem lies in the size of short-term memory. The sum of the index of the
smaller corpus is less than the size of free memory, which we have on the test
computer (130 Mb). Windows system seems to load the whole index to the memory,
and therefore it works much quicker. The size of the index of "Moshkov
Library" is 450, this file cannot be loaded into the memory, that's why
the
system is to swap it to a hard disk. We have
tried the same query on a computer with 512 MB, it was accomplished in 5
Seconds. Generally for DDC the following requirement should be fulfilled .
The
size of index must be equalö to the size of short-term memory on the computer, otherwise the querying
works approximately two times slower.
The restrictions on speed and size
of the corpus can be overcome by using parallel processing. It is possible for
DDC to create several corpora, put them on several computers and run a query
against these corpora simultaneously. For network transfer DDC uses Berkeley
Sockets over TCP/IP. There is one client in the middle of the structure. This
client holds only the list of corpus servers like this:
CorporaName |
IP |
PORT |
GermanInternet |
192.168.1.69 |
1 |
Kollok |
192.168.1.61 |
2 |
When a user runs a query this client
simply redirects the query to all corpora and after this it sums up the result
data and outputs it to the user. In its turn each corpus server should open a
socket in order to receive queries and to send the result. The parallel
processing can help us to work with an unlimited corpus as though this corpus
is a small one.
In order to reduce the size of index
we introduce an archiving algorithm for occurrences. A list of occurrences is a
long sorted list of positive numbers. An average difference between two
neighbour items is between 10 and 100 depending upon the index set. And the
algorithm we are using is based on that fact.
Normally, without archiving, each
occurrence is stored in one four-byte number. It is too wasteful. The first
step is consists of replacing occurrences by differences of two adjacent items.
X1, X2,X3,....., Xn -> (X1), (X2-X1),
(X3-X2)....., (Xn-Xn-1)
This operation greatly reduces the
average value of a list item. The operation itself and its undo is performed in
O(N) where N is the length of the list.
After this we can apply a
length-depended method of number storing. According to this method the number
is stored as follows:
A1, A2, ...,A5, B1, B2, ...,Bk,where
Ai and Bj are bits.
The bit sequences A1, A2, ...,A5
holds a length of the sequence of B1, B2, ...,Bk, i.e. it holds the value of k.
The maximum
value of k is 32. The bit sequence B1, B2, ...,Bk holds the number itself (the
item of the list). If values
if items are approximately low then this operation reduces the storage space
for the list. Otherwise the storage space could grow. The precise formula that
explains conditions under which the storage space begins to extend, can be
acquired, but it is hardly necessary.
The composition of these two
operations (replacing an occurence by the difference of two adjacent items and
the length-depended method of number storing) constitute the archiving
algorithm of DDC. Obviously it works with linear speed. The ratio of reducing
of the corpus index is approximately 40%.
On the whole we are to accept that
DDC is a quite simple system, which to our mind has the following benefits:
On the other hand there are some evident drawbacks:
We hope that these drawbacks don't overweigh
the benefits and we hope also that DDC will be helpful for the linguistic
society.
[1] For a German corpus of 25 Mln
tokens, the set of morphological patterns is 6000,
and the set of tokens is 600.000.
[2] In the current version the size of
a subcorpus is 1 mln. token.
[3] DDC was created under Windows 2000.
[4]
lower_bound function determines the lowest value of A in the range [q,m)
such that, for each j in the range [q, A) hj < xi
[5] Any adjective, followed by a noun,
which is followed by a verb.