Quantitative description of two samples of literary texts in Slovenian is given in the thesis. From the most important parameters of character n-grams, from single characters to 24-grams, the upper bound of entropy is estimated.
The second chapter, which follows the introduction with description of the main contributions of the thesis, is dedicated to the sources of both samples. The first sample contains 60 texts of 41 authors, from Ivan Cankar to Ivan Zorec, 46 original works and 14 translations, published between 1858 and 1996, 17 million characters, 2,7 million words and 200.000 sentences in all. The second sample includes the complete opus of the great Slovenian stylist from the middle of the century, Ciril Kosmač (1910-1980), 52 works published between 1931 and 1988, 2,5 million characters, 408.000 words and 37.000 sentences. Both samples together are estimated at 0,5 % - 1 % of the complete Slovenian literary production.
The third chapter is devoted to preparation of texts. It begins with text headers and additional symbols, introduced to facilitate the task of tagging the text components to the level of sentences, including direct speech and quotes, and to enable an automatic conversion of texts from a single file to a set of HTML files for Internet. The second part of the chapter deals with the question of errors. It describes the procedure which generated a collection of 3,5 million wordforms from 92.000 verb, noun and adjective lemmata from the Dictionary of the Slovenian Literary Language. The collection has been used during the cleanup of texts from both samples.
Part-of-speech tagging is described in chapter 4. The second sample has been POS tagged, with a stochastic tagger built for the purpose. Tagged texts were verified and corrected at the Institute for the Slovenian Language (part of Scientific Research Centre of the Slovenian Academy of Sciences and Arts); so the precision of the tagger could be measured directly and it turned out to be 92 %. The most frequent lemmata for the 5 main word classes of the second sample are given in table 1. Percentage values which follow the words were computed with regard to the total number of words (407.938): 9,11 % equals the frequency of 37.163 and 0,06 % the frequency of 245. The verb biti ( to be) has been divided into 3 components. Biti p denotes an auxiliary verb (as in John is moving the table.), biti r the verb to be in relational sense (as in You are beautiful.) and biti o in the most narrow sense of existence in time or space (as in There is a house.)
Table 1: The most common lemmata for the 5 word classes in sample 2 (frequencies in %)
verb | noun | adjective | pronoun | adverb | ||||||
1. | biti p | 9,11 | roka | 0,39 | star | 0,19 | on | 1,60 | tako | 0,39 |
2. | biti r | 1,71 | glava | 0,24 | velik | 0,12 | ki | 0,73 | zdaj | 0,33 |
3. | ne biti p | 0,65 | oči | 0,20 | lep | 0,11 | jaz | 0,70 | nato | 0,24 |
4. | reči | 0,47 | otrok | 0,20 | dolg | 0,09 | ta | 0,69 | spet | 0,21 |
5. | biti o | 0,44 | dan | 0,18 | črn | 0,09 | ona | 0,66 | potem | 0,18 |
6. | imeti | 0,23 | hiša | 0,17 | bel | 0,09 | svoj | 0,55 | počasi | 0,15 |
7. | vedeti | 0,23 | leto | 0,15 | dober | 0,08 | ves | 0,39 | lahko | 0,13 |
8. | videti | 0,20 | vrata | 0,13 | živ | 0,07 | ti | 0,36 | takoj | 0,11 |
9. | iti | 0,19 | beseda | 0,13 | mlad | 0,07 | vse | 0,29 | skoraj | 0,10 |
10. | stopiti | 0,18 | oče | 0,12 | težek | 0,06 | oni | 0,28 | bolj | 0,10 |
11. | začeti | 0,17 | človek | 0,11 | širok | 0,06 | sam | 0,22 | naglo | 0,09 |
12. | pogledati | 0,16 | glas | 0,11 | hud | 0,06 | kako | 0,20 | dobro | 0,08 |
Very approximate English translations are given in table 2. The pronoun svoj, marked by an asterisk, is impossible to translate into one word: it means pertaining to oneself.
Table 2: Approximate English translations of the words in table 1
verb | noun | adjective | pronoun | adverb | ||||||
1. | biti p | 9,11 | hand | 0,39 | old | 0,19 | he | 1,60 | so | 0,39 |
2. | biti r | 1,71 | head | 0,24 | big | 0,12 | which | 0,73 | now | 0,33 |
3. | ne biti p | 0,65 | eyes | 0,20 | beautiful | 0,11 | I | 0,70 | then | 0,24 |
4. | to say | 0,47 | child | 0,20 | long | 0,09 | this | 0,69 | again | 0,21 |
5. | biti o | 0,44 | day | 0,18 | black | 0,09 | she | 0,66 | afterwards | 0,18 |
6. | to have | 0,23 | house | 0,17 | white | 0,09 | svoj* | 0,55 | slowly | 0,15 |
7. | to know | 0,23 | year | 0,15 | good | 0,08 | total | 0,39 | easily | 0,13 |
8. | to see | 0,20 | door | 0,13 | alive | 0,07 | you | 0,36 | at once | 0,11 |
9. | to go | 0,19 | word | 0,13 | young | 0,07 | all | 0,29 | nearly | 0,10 |
10. | to step | 0,18 | father | 0,12 | heavy | 0,06 | they | 0,28 | more | 0,10 |
11. | to begin | 0,17 | man | 0,11 | wide | 0,06 | alone | 0,22 | fast | 0,09 |
12. | to look at | 0,16 | voice | 0,11 | angry | 0,06 | how | 0,20 | well | 0,08 |
Very interesting are the distributions of gender, number and case in the POS tags of the relevant words. The ratios for gender are 55 % (masculine), 33 % (feminine) and 12 % (neuter). For the number the figures are 65 % for singular, 33 % for plural and 2 % for dual. The distribution of cases is 28 % for nominative, 13 % for genitive, 6 % for dative, 27 % for accusative, 15 % for locative and 11 % for instrumental.
Statistical description of both samples is the subject of chapter 5. The common character set is given (168 characters), as well as character sets and letter distributions for both samples. The distribution of the most common 25 letters is shown in figure 1. The vowels are most common with e (7,92 % of all characters) on top, followed by a (7,74 % of the total). Character n-grams follow and the most common for n = 1 to n = 14 are given for both samples. Word statistics comes next:
Figure 1: Distribution of the most common 25 letters in both samples
the average word length for both samples is 4,55 letters and the 12 most common wordforms are: je 'is', in 'and', se reflexive personal pronoun and free morpheme of reflexive verbs, v 'in', da subordinator 'that', na 'on', so 'are' (3 person), ne the particles 'not' and 'no', pa 'but', ki relative particle, bi particle of the conditional and z 'with'. The longest words in both samples as well as in the Bank of English are given, followed by the most common word n-grams. Description of sentences with sentence lengths and the most common sentences (top 7 in sample 1: Yes. So it is. By all means. Of course. For sure. No. How?) come at the end of the chapter.
Table 3: Entropy of character n-grams in the first sample
n | H | H | F | n | H | H | F |
1 | 4,456 | 4,456 | 4,456 | 13 | 23,460 | 1,805 | 0,239 |
2 | 7,994 | 3,997 | 3,538 | 14 | 23,615 | 1,687 | 0,155 |
3 | 11,020 | 3,673 | 3,026 | 15 | 23,715 | 1,581 | 0,100 |
4 | 13,565 | 3,391 | 2,545 | 16 | 23,779 | 1,486 | 0,064 |
5 | 15,739 | 3,148 | 2,174 | 17 | 23,821 | 1,401 | 0,042 |
6 | 17,643 | 2,941 | 1,904 | 18 | 23,848 | 1,325 | 0,027 |
7 | 19,272 | 2,753 | 1,629 | 19 | 23,866 | 1,256 | 0,018 |
8 | 20,587 | 2,573 | 1,315 | 20 | 23,878 | 1,194 | 0,012 |
9 | 21,594 | 2,399 | 1,007 | 21 | 23,886 | 1,137 | 0,008 |
10 | 22,334 | 2,233 | 0,740 | 22 | 23,891 | 1,086 | 0,005 |
11 | 22,861 | 2,078 | 0,527 | 23 | 23,895 | 1,039 | 0,004 |
12 | 23,221 | 1,935 | 0,360 | 24 | 23,898 | 0,996 | 0,003 |
Entropy is the main topic of chapter 6. An algorithm is given, which computes entropies up to n = 62 and for sample sizes which can be an order of magnitude bigger than the samples of the thesis. Its achievements are based on special management of the quickly expanding database of Hapax Legomena. Total entropy H, entropy per character H and conditional entropies F are given for both samples, to n = 24 (table 3). At that stage the conditional entropy, i. e. the entropy of n-th character if the previous (n-1) are known, falls below 0,005 bits per character in both samples. From the values of H and from the flow of new Hapax Legomena from n to (n+1) the upper bound of entropy in Slovenian literary texts is estimated at 2,2 bits per character. A language model, based on probabilities of n-grams from the first sample which achieved the frequency at least 2, is described. The mapping of the text of the second sample, based on the model, achieved an average length of 2,7 bits per character.
How fast the number of different n-grams, if sorted by frequency in descending order (the most frequent first) fills up the entire set in sample 1 is shown in figure 2. The growth curves for n-grams from 1 to 12 follow one another from the left to the right. From the figure it can be seen that to attain half of the entire text sample it is necessary to have 6 most frequent characters (Space, e, a, i, o, n), 50 bigrams, 300 trigrams 2.000 4-grams, 8.000 5-grams, 32.000 6-grams, 80.000 7-grams, 300.000 8-grams, 800.000 9-grams and 1.500.000 10-grams.
Figure 2: Growth curves for character n-grams (1-12) in the first sample
Conclusions as well as the possible follow-up research in the field of computational analysis of Slovenian literary corpora are discussed in the final, seventh chapter.