Alexey Sokirko, 2003

sokirko@dwds.de

 

A technical overview of DWDS/Dialing Concordance

 

 

Unit 1. Introduction

 

The main purpose of DWDS/Dialing Concordance ("DDC") is to search words or sequences of words  together with morphological patterns. DDC is created to help linguists to  find a particular collocation or word in the given context. Roughly speaking, the functionality of DDC is similar to the search engine “SARA” that is accessible at the site of British National Corpus (http://sara.natcorp.ox.ac.uk/lookup.html).

Generally, the majority of web search engines were designed to look for the information. It means that they must be able  to process all text formats and aggregate all text variance into one representation. After all they are to give the most relevant hits to the end-user. The target user of  the internet search engines searches for the information and not for the variance of its representation. A web search engine uses usually stop word lists, various dictionaries of synonyms, dictionaries of proper names, and it indexes the corpora by files, not by sentences. All these notions are useless for a linguistic search engine. For example, using a stop word list is senseless, since for a linguist stop words are the most relevant words of the language (prepositions, conjunctions and so on). Therefore we cannot simply use a web search engine for linguistic purposes, but are to design a special version of a search engine.

There are two possible strategies to search something in a corpus. On the one hand we can create a program like Unix "grep", which doesn't use any index at all. It is a good solution because we really can create a very sophisticated query language. Once a sentence (a hit) is found, and its morphological interpretation is built, it is possible to go through the sentence many times and to check all conditions, which the query contains. Of course, the speed of the process will be equal to the speed of indexing, i.e. to the speed of linguistic processors (morphology, syntax). Apart from the speed there is one more disadvantage of this solution. When the search goal lies at the end of the corpus, the end-user is to wait till the whole corpus is processed. And for our opinion, the last consideration is the most important cause to give up the grep-like searching.

Obviously the only one alternative we have, is to create special indices. The creation of indices, in the way it is done in DDC, is described in Unit 2; the usage of index is described in Unit 3.

The current version of DDC supports three languages: Russian, English, and German. But the corpus must be

homogeneous and the language of the corpus should be specified before the process of indexing. DDC doesn't

solve the problem of language recognition by coding pages. Thus any English words in the German corpora will be considered as German ones.

Accordingly to the language DDC contains three morphological systems: Russian, English, and German. Russian and English morphology systems were adopted from Dialing Project The German morphology was borrowed from Morphy System.

For each input word form the morphological procedure finds all morphological interpretations in the morphology dictionary. Morphology interpretation is a tuple:

<Normal form, Part of Speech, Grammems>, for example:

doing -> <do, VERB, ing>

Grammems are special features like "sin"  singular, "plu"  plural and so on. We call a pair
             <Part of Speech, Grammems> a morphological pattern.

Generally, the morphology system is used in two places:

1.      In processing of queries, when it is necessary to build all forms of the given word form;

2.      In indexing, when we are to store all possible morphological interpretations of the given word form.

The corpora itself is a list of text or html files. This list is stored in a file like this:

C:\test\test1.txt

C:\test\test2.txt

...

 

DDC supports plain text format and standard html-format. We think, that for linguistic purposes these two formats are quite sufficient.

 

 

Unit 2. Indexing

 

The process of indexing starts with loading of the list of input files. Every file should really exist and it should be stored on the local hard disk. The size of all files of the corpus cannot be more than 2,5 Gigabyte.            This restriction follows from the inner structure of the program, since we use something like global offsets,            which can point to any place of the whole corpora. And the maximal value of these offsets cannot be more than 4 Gigabyte.

After loading a list of files, the program must be supplied with a language identifier (German, English or Russian). Then the program starts the main cycle of indexing which looks like this

for every file from the list of files do

begin

Create sentence and paragraph index;

Create word and morphological index;

if the size of the current index is more than 50 MB

then do some swapping in order to clean memory;

end;

The first procedure creates the index of sentences borders. Paragraphs are just sequences of sentences and therefore they are processed in the same way as sentences. The procedure goes through the input file and for every end of sentence stores two numbers:

  1. The offset of the end of sentence in tokens;
  2. The offset of the end of sentence in bytes.

Both offsets are global, i.e. the first offset of a file is equal to the end offset of the previous file. This is a very quick procedure, because it doesn't allocate much memory, since all offsets are written directly to the hard disk.

The second procedure builds the main index of tokens and morphological patterns. A token is a text item to be indexed. In the current version of the program token can be:

  1. A word (depending upon the language);
  2. A number in decimal notation;
  3. A sequence of punctuation marks, which is thought to be the end of sentence.

All other text items (that are not tokens) are disregarded. It is quite easy to change the definition of a token,  and to index, for example, all punctuation marks also.

While indexing, the program maintains a set of all string representations of tokens and a set of all possible morphological patterns. Actually, these two sets are implemented in the same way, i.e. there is only one program class for both. Such sets are called index sets.

An index set is a set of all possible index items (words or morphological patterns).

Initially index sets are empty. While indexing they are growing when a new token or a new morphological pattern occurs. For every index item we store its occurrences:

"Haus" -> 1, 678, 2345 ...

"Hause" -> 12, 38, 234 ...

...

It is clear that while indexing we need not store all occurrences in the corpora in the short-term memory, that's why from time to time we can save occurrences to the hard disk. This swapping is done in the main cycle of indexing. Thus we hold in memory only two sets of indexing items without occurences, and these sets are usually grow very slow[1][1].

We are going to present some statistics about the indexing process. For testing purposes we used P4 1,5 GHz            Windows 2000. The statistic results depend upon the language of the corpus, since we used different linguisticprocessors to obtain tokens and sentence division.

 

Corpus Name

Language

Count of tokens

Size in bytes

Elapsed Time

Memory Usage

DWDS- corpus1

German

11 mln.

85 Mb

9 min

40 Mb

DWDS- corpus2

German

30 mln.

160 Mb

20 min.

60 Mb

Moshkov-subset1

Russian

15 mln

100 Mb

13 min

60 Mb

Moshkov-subset2

Russian

54 mln.

350 Mb

55 min

80 Mb

 

 

It may be worth adding that the program always allocates at least 18 Mb for the morphological dictionaries. The Linux version of the program works approximately 20% quicker than the Windows version.

The result index contains two huge files with occurrences of tokens and occurrences of morphological patterns. The size of these two files together is approximately equal to the size of corpora in bytes. Thus the

built index has the same size as the corpora itself.

At the last stage of the indexing process we create an index of inverted tokens in order to run a query with the right truncation like "*nat". We need not a special index for the left truncation since we can use for it the set of tokens itself.

We have mentioned some of the restrictions that DDC has. Here is the full list of them:

  1. The size of the corpus cannot be more than 2,5 Gb. We recommend that the corpus size should be less or equal to the size of short-term memory (see explanation in Unit 3).
  2. The size of index set cannot be more than 8 mln. For example, it means that we cannot index huge corpora with files of different languages. Russian, English and German languages together contain more than 8 mln different words, that's why we cannot include their complete lists of word forms in  one DDC corpus.
  3. The size of one sentence in the corpus cannot be more than 20 Kb (2000 Words).

 


Unit 3. Running a query

 

In the current version of the program query language contains the following constructions:

Query Type

Purpose

Example

Result

Word

word description

Hause

All sentences of the corpora which  contain  some  morphological variant of "Hause"

Word*

word description

Ha*

All sentences which contain a word that starts with "Ha"

*Word

word description

*ken

All sentences which contain a word that ends  with "ken"

[PartOfSpeech {Features}]

word description

[SUB sin]
[VER aux,inf,]

"PartOfSpeech" is a lexical category, "Features" is a comma-delimited list of morphological features (see below for complete lists of features and parts of speech).

@Word

word description

@Hause

All sentences of the corpora which contain  wordform "Hause"

{H}

word description

{AMTBERUFG}
{EIG}

All sentences of the corpora which contain a subtype of the given thesaurus node   (DWDS thesaurus is used)

"X1 X2 ? XN"

sequence of word descriptions

"mein Haus"
"Haus [VER]"

All sentences which contain "mein Haus".
All sentences which contain "Haus" which is followed by a verb.

Q1 && Q2

conjunction of word or sequence  descriptions

 Hause && [SUB sin]

All sentences of the corpora which contain Q1 and Q2

Q1 || Q2

disjunction of word or sequence descriptions

[VER aux,inf,] && [SUB sin]

All sentences of the corpora which contain Q1 or Q2

near(Q1;Q2;n)

two near words
0<= n <= 10

NEAR (Hause ; [SUB]; 2)

All sentences which contain "Hause" and some noun, and the distance between them is less or equal 2. The order of  the occurrences is disregarded.

"X1 #D1 X2 #D2 ? XN"

sequence of word descriptions with distances

"mein #1 Haus"
"Haus #10 [VER]"

1. All sentences which contain "mein" and "Haus", which may be divided by one token 
2. All sentences which contain "Haus" and some verb and there can be 10 tokens between "Haus" and the verb. The order of  the occurrences is important.

To the end of the query the user can add a context size operator in form:


#cntxt N, where N=1,…,5.

 

Context size operator tells the program to show not only the sentence where the given query is found but also N previous and following sentences.  For example, a query
               test #cntxt 2
gives  us all hits with the word “test” together with the two sentence context.

Actually DDC can process the query in two different ways. Firstly, we can ask the  system to present us the count of hits (count of sentences which match with this query). Secondly, we can ask to give us some examples of the hits. It is quite obvious that the first  type of query (we call it a statistical query) takes much more time than the second type(we call it an example query), since a statistical query is to process the whole corpus. The speed of statistical query depends crucially upon the corpus frequencies of the items, which

constitute this query. Thus a query like "[ART] [ADJ] [NOUN]" (a pattern for "the smart dog", "a deep pool" and so on) will be processed thousand times slower than a query like "a red boy", since there are millions of articles, adjectives and nouns in the corpora in comparison with count of "red" or " boy".

It goes without saying that the count of hits is calculated by all web engines but the problem is that a "web hit" is a document, not a sentence. For example for a query like "a*" Yahoo gives 642710 hits while DDC gives 1185283 hits (two times more) for the corpora of 40 mln. tokens. Some search engines (like "Google") give us only the approximate count of hits, that's of course quite simple. However it is claimed that Google has more than 1.5 billion indexed pages.

All this leads us to the conclusion that statistical queries should be used only when they are really necessary, in other cases it is better to use only example queries. In both cases (statistical and example) the query should be parsed. The parsing is made with the help

of Yacc&Lex. The grammar is written in quite straightforward way except the parsing of morphological patterns, which is made by a special function. If there is no parsing error then the query will be evaluated. The evaluation process is the process of getting hits from the corpus according to parse tree of the query. When it is possible we use depth-first search. For example, for query like:

( ( ( (A && B) && C) && D) && (E && F))

the sequence of evaluation steps is as follows:

(A && B)

( (A && B) && C)

( ( (A && B) && C) && D)

(E && F)

( ( ( (A && B) && C) && D) && (E && F))

A depth-first algorithm is preferable since a breadth-first algorithm is more space-consuming,  It is very important that we can apply the evaluation procedure to some subset of the corpus, since in this case the usage of short-term memory is under our control. The whole body of corpus is divided internally into smaller subcorpora. The size of subcorpus[2] was obtained empirically after several experiments, and perhaps it depends upon operational systems[3][3].

The steps of evaluation are made without any optimization according to the semantics of query operators. For

instance, let's take a close look at near operator. The syntax of this operator is as follows

near (Q1;Q2;n), where Q1 and Q2 are token descriptions and n is a maximal distance between them.            In order to evaluate this operator we have to get all occurrences of Q1 and Q2 in the current subcorpus. Thenthe occurrences should be sorted. It is worth mentioning that the sorting of occurrences takes half of the processor time for queries containing very frequent items. Thus it is safe to say that quite a long time DDC is busy with regrouping (sorting) of occurrences.

Let occurrences of Q1 be x1, x2, ..., xk, and let occurrences of Q2 be y1, y2, ..., ym. The following algorithm finds all such pairs <xi yj> that | xi - yj | <= n (n is a distance).

Start := 0;

for i:=0 to k-1 do

begin

j := Start;

while (j < m-1)

begin

if ( xi <= yj + n)

then

break;

j := j+1;

end;

Start = k;

while (j < m-1)

begin

if (xi + n < yj)

then

break;

else

<xi yj > is a new pair;

 

j := j+1;

end:

end

The efficiency of this algorithm is O (n*k), where n is a distance and k is the count of occurrences of Q1.

The other operators are implemented in the same straightforward way. There is a slight difference between logical operations (&& and ||) and the other ones. These logical operations deal with hits (sentences which match the query). The other operations deal with occurrences of tokens. The result of evaluation is always a set of hits, i.e. we dare say that on the bottom level of parsing tree, the program operates only occurrences,while on the top levels it deals with hits.








       Obviously there must be a function that converts occurrences to hits, and this function really exists and looks like this:

 

x1, x2, ..., xk, - occurrences

h1, h2, ..., hm, - hits (end of sentences)

i := 0 // index of occurence

q := 0; // index of hit

while (i < k)

{

// binary search of xi in hits between q and m

q := lower_bound (q, m, xi)[4];

// adjusting q

if (hq == xi)

q := q+1;

hq is a new hit

i := i+1;

};

 

This procedure takes 30 percent of time for queries containing very frequent items. It is the second slow procedure of the quering mechanism.            Having described the most time-consuming algorithms we are going to present some statistics

 

Corpus Name

Type of query

Moshkov1

15 mln

Moshkov2

54 Mln.

DWDS1

190 Mln.

Mutter

Example

0.05

0.05

0.07

Mutter

Statistical

0.007

0.015

0.045

Ba*

Example

0,06

0.07

0.15

Ba*

Statistical

0.1

0.3

1,3

“[ADJ] [SUB][VER]” [5]

Example

1,1

1,1

1,25

„[ADJ] [SUB] [VER]“

Statistical

2,5

14

25

 

 

Google Engine processes any query with the speed from 0.1 Sek to 1 Sek, but there are hundreds of Google servers and also there are thousands of simultaneous users. That's why the comparison is hardly valuable. Surely, that being run on one separate computer Google Web Engine is much more faster than DDC.

We should explain why the corpus of 54 mln tokens("Moshkov Library") is processed ten times slower than the corpus of 15 Mln tokens("Moshkov Library subset"), while the former is only four times bigger than the latter. As far as we can understand the problem lies in the size of short-term memory. The sum of the index of the smaller corpus is less than the size of free memory, which we have on the test computer (130 Mb). Windows system seems to load the whole index to the memory, and therefore it works much quicker. The size of the index of "Moshkov Library" is 450, this file cannot be loaded into the memory, that's why the

system is to swap it to a hard disk. We have tried the same query on a computer with 512 MB, it was accomplished in 5 Seconds. Generally for DDC the following requirement should be fulfilled .

 

            The size of index must be equalö to the size of short-term memory  on the computer, otherwise the querying works approximately  two times slower.

 

 

The restrictions on speed and size of the corpus can be overcome by using parallel processing. It is possible for DDC to create several corpora, put them on several computers and run a query against these corpora simultaneously. For network transfer DDC uses Berkeley Sockets over TCP/IP. There is one client in the middle of the structure. This client holds only the list of corpus servers like this:

 

CorporaName

IP

PORT

GermanInternet

192.168.1.69

1

Kollok

192.168.1.61

2

 

When a user runs a query this client simply redirects the query to all corpora and after this it sums up the result data and outputs it to the user. In its turn each corpus server should open a socket in order to receive queries and to send the result. The parallel processing can help us to work with an unlimited corpus as though this corpus is a small one.

 

Unit 4. Archiving occurrences

 

In order to reduce the size of index we introduce an archiving algorithm for occurrences. A list of occurrences is a long sorted list of positive numbers. An average difference between two neighbour items is between 10 and 100 depending upon the index set. And the algorithm we are using is based on that fact.

Normally, without archiving, each occurrence is stored in one four-byte number. It is too wasteful. The first step is consists of replacing occurrences by differences of two adjacent items.

X1, X2,X3,....., Xn -> (X1), (X2-X1), (X3-X2)....., (Xn-Xn-1)

This operation greatly reduces the average value of a list item. The operation itself and its undo is performed in O(N) where N is the length of the list.

After this we can apply a length-depended method of number storing. According to this method the number is stored as follows:

A1, A2, ...,A5, B1, B2, ...,Bk,where Ai and Bj are bits.

The bit sequences A1, A2, ...,A5 holds a length of the sequence of B1, B2, ...,Bk, i.e. it holds the value of k.

The maximum value of k is 32. The bit sequence B1, B2, ...,Bk holds the number itself (the item of the list). If values if items are approximately low then this operation reduces the storage space for the list. Otherwise the storage space could grow. The precise formula that explains conditions under which the storage space begins to extend, can be acquired, but it is hardly necessary.

The composition of these two operations (replacing an occurence by the difference of two adjacent items and the length-depended method of number storing) constitute the archiving algorithm of DDC. Obviously it works with linear speed. The ratio of reducing of the corpus index is approximately 40%.

 

Unit 5. Conclusion

 

On the whole we are to accept that DDC is a quite simple system, which to our mind has the following benefits:

  1. Indexing in "constant size" memory;
  2. Indexing by sentences not by documents;
  3. Querying using subcorpora, that means, that querying is also run in "constant size" memory;
  4. Parallel querying, that enables us to deal with unlimited corpora.
  5. Distinction between statistical queries and example ones.

On the other hand there are some evident drawbacks:

  1. The speed of the engine is much more slower than the speed of current web search engines
  2. There is no systematic analysis of the two most time-consuming procedures of the query mechanism
  3. The index files take the same room as the corpus itself without archiving.

We hope that these drawbacks don't overweigh the benefits and we hope also that DDC will be helpful for the linguistic society.



[1] For a German corpus of 25 Mln tokens, the set of morphological patterns is 6000,

and the set of tokens is 600.000.

 

[2] In the current version the size of a subcorpus is 1 mln. token.

 

[3] DDC was created under Windows 2000.

 

[4]  lower_bound function determines the lowest value of A in the range [q,m) such that, for each j in the range [q, A) hj < xi

[5] Any adjective, followed by a noun, which is followed by a verb.