Primož Jakopin

UPPER BOUND OF ENTROPY IN SLOVENIAN LITERARY TEXTS

Abstract

Quantitative description of two samples of literary texts in Slovenian is given in the thesis. From the most important parameters of character n-grams, from single characters to 24-grams, the upper bound of entropy is estimated.

The second chapter, which follows the introduction with description of the main contributions of the thesis, is dedicated to the sources of both samples. The first sample contains 60 texts of 41 authors, from Ivan Cankar to Ivan Zorec, 46 original works and 14 translations, published between 1858 and 1996, 17 million characters, 2,7 million words and 200.000 sentences in all. The second sample includes the complete opus of the great Slovenian stylist from the middle of the century, Ciril Kosmač (1910-1980), 52 works published between 1931 and 1988, 2,5 million characters, 408.000 words and 37.000 sentences. Both samples together are estimated at 0,5 % - 1 % of the complete Slovenian literary production.

The third chapter is devoted to preparation of texts. It begins with text headers and additional symbols, introduced to facilitate the task of tagging the text components to the level of sentences, including direct speech and quotes, and to enable an automatic conversion of texts from a single file to a set of HTML files for Internet. The second part of the chapter deals with the question of errors. It describes the procedure which generated a collection of 3,5 million wordforms from 92.000 verb, noun and adjective lemmata from the Dictionary of the Slovenian Literary Language. The collection has been used during the cleanup of texts from both samples.

Part-of-speech tagging is described in chapter 4. The second sample has been POS tagged, with a stochastic tagger built for the purpose. Tagged texts were verified and corrected at the Institute for the Slovenian Language (part of Scientific Research Centre of the Slovenian Academy of Sciences and Arts); so the precision of the tagger could be measured directly and it turned out to be 92 %. The most frequent lemmata for the 5 main word classes of the second sample are given in table 1. Percentage values which follow the words were computed with regard to the total number of words (407.938): 9,11 % equals the frequency of 37.163 and 0,06 % the frequency of 245. The verb biti ( to be) has been divided into 3 components. Biti p denotes an auxiliary verb (as in John is moving the table.), biti r the verb to be in relational sense (as in You are beautiful.) and biti o in the most narrow sense of existence in time or space (as in There is a house.)

Table 1: The most common lemmata for the 5 word classes in sample 2 (frequencies in %)

       verb        noun        adjective        pronoun        adverb



1.    biti p 9,11    roka 0,39    star 0,19    on 1,60    tako 0,39

2.    biti r 1,71    glava 0,24    velik 0,12    ki 0,73    zdaj 0,33

3.    ne biti p 0,65    oči 0,20    lep 0,11    jaz 0,70    nato 0,24

4.    reči 0,47    otrok 0,20    dolg 0,09    ta 0,69    spet 0,21

5.    biti o 0,44    dan 0,18    črn 0,09    ona 0,66    potem 0,18

6.    imeti 0,23    hiša 0,17    bel 0,09    svoj 0,55    počasi 0,15

7.    vedeti 0,23    leto 0,15    dober 0,08    ves 0,39    lahko 0,13

8.    videti 0,20    vrata 0,13    živ 0,07    ti 0,36    takoj 0,11

9.    iti 0,19    beseda 0,13    mlad 0,07    vse 0,29    skoraj 0,10

10.    stopiti 0,18    oče 0,12    težek 0,06    oni 0,28    bolj 0,10

11.    začeti 0,17    človek 0,11    širok 0,06    sam 0,22    naglo 0,09

12.    pogledati 0,16    glas 0,11    hud 0,06    kako 0,20    dobro 0,08

Very approximate English translations are given in table 2. The pronoun svoj, marked by an asterisk, is impossible to translate into one word: it means pertaining to oneself.

Table 2: Approximate English translations of the words in table 1

       verb        noun        adjective        pronoun        adverb



1.    biti p 9,11    hand 0,39    old 0,19    he 1,60    so 0,39

2.    biti r 1,71    head 0,24    big 0,12    which 0,73    now 0,33

3.    ne biti p 0,65    eyes 0,20    beautiful 0,11    I 0,70    then 0,24

4.    to say 0,47    child 0,20    long 0,09    this 0,69    again 0,21

5.    biti o 0,44    day 0,18    black 0,09    she 0,66    afterwards 0,18

6.    to have 0,23    house 0,17    white 0,09    svoj* 0,55    slowly 0,15

7.    to know 0,23    year 0,15    good 0,08    total 0,39    easily 0,13

8.    to see 0,20    door 0,13    alive 0,07    you 0,36    at once 0,11

9.    to go 0,19    word 0,13    young 0,07    all 0,29    nearly 0,10

10.    to step 0,18    father 0,12    heavy 0,06    they 0,28    more 0,10

11.    to begin 0,17    man 0,11    wide 0,06    alone 0,22    fast 0,09

12.    to look at 0,16    voice 0,11    angry 0,06    how 0,20    well 0,08

Very interesting are the distributions of gender, number and case in the POS tags of the relevant words. The ratios for gender are 55 % (masculine), 33 % (feminine) and 12 % (neuter). For the number the figures are 65 % for singular, 33 % for plural and 2 % for dual. The distribution of cases is 28 % for nominative, 13 % for genitive, 6 % for dative, 27 % for accusative, 15 % for locative and 11 % for instrumental.

Statistical description of both samples is the subject of chapter 5. The common character set is given (168 characters), as well as character sets and letter distributions for both samples. The distribution of the most common 25 letters is shown in figure 1. The vowels are most common with e (7,92 % of all characters) on top, followed by a (7,74 % of the total). Character n-grams follow and the most common for n = 1 to n = 14 are given for both samples. Word statistics comes next:

Figure 1: Distribution of the most common 25 letters in both samples

the average word length for both samples is 4,55 letters and the 12 most common wordforms are: je 'is', in 'and', se reflexive personal pronoun and free morpheme of reflexive verbs, v 'in', da subordinator 'that', na 'on', so 'are' (3^rd person), ne the particles 'not' and 'no', pa 'but', ki relative particle, bi particle of the conditional and z 'with'. The longest words in both samples as well as in the Bank of English are given, followed by the most common word n-grams. Description of sentences with sentence lengths and the most common sentences (top 7 in sample 1: Yes. So it is. By all means. Of course. For sure. No. How?) come at the end of the chapter.

Table 3: Entropy of character n-grams in the first sample

n    H    H_n    F_n    n    H    H_n    F_n

1     4,456    4,456    4,456        13    23,460    1,805    0,239

2     7,994    3,997    3,538        14    23,615    1,687    0,155

3    11,020    3,673    3,026        15    23,715    1,581    0,100

4    13,565    3,391    2,545        16    23,779    1,486    0,064

5    15,739    3,148    2,174        17    23,821    1,401    0,042

6    17,643    2,941    1,904        18    23,848    1,325    0,027

7    19,272    2,753    1,629        19    23,866    1,256    0,018

8    20,587    2,573    1,315        20    23,878    1,194    0,012

9    21,594    2,399    1,007        21    23,886    1,137    0,008

10    22,334    2,233    0,740        22    23,891    1,086    0,005

11    22,861    2,078    0,527        23    23,895    1,039    0,004

12    23,221    1,935    0,360        24    23,898    0,996    0,003

Entropy is the main topic of chapter 6. An algorithm is given, which computes entropies up to n = 62 and for sample sizes which can be an order of magnitude bigger than the samples of the thesis. Its achievements are based on special management of the quickly expanding database of Hapax Legomena. Total entropy H, entropy per character H_n and conditional entropies F_n are given for both samples, to n = 24 (table 3). At that stage the conditional entropy, i. e. the entropy of n-th character if the previous (n-1) are known, falls below 0,005 bits per character in both samples. From the values of H_n and from the flow of new Hapax Legomena from n to (n+1) the upper bound of entropy in Slovenian literary texts is estimated at 2,2 bits per character. A language model, based on probabilities of n-grams from the first sample which achieved the frequency at least 2, is described. The mapping of the text of the second sample, based on the model, achieved an average length of 2,7 bits per character.

How fast the number of different n-grams, if sorted by frequency in descending order (the most frequent first) fills up the entire set in sample 1 is shown in figure 2. The growth curves for n-grams from 1 to 12 follow one another from the left to the right. From the figure it can be seen that to attain half of the entire text sample it is necessary to have 6 most frequent characters (Space, e, a, i, o, n), 50 bigrams, 300 trigrams 2.000 4-grams, 8.000 5-grams, 32.000 6-grams, 80.000 7-grams, 300.000 8-grams, 800.000 9-grams and 1.500.000 10-grams.

Figure 2: Growth curves for character n-grams (1-12) in the first sample

Conclusions as well as the possible follow-up research in the field of computational analysis of Slovenian literary corpora are discussed in the final, seventh chapter.

Naslov strani: http://www.jakopin.net/primoz/disertacija/abstract.php Datum: 16. maj 1999. Zadnja sprememba: 17. februar 2017. 1457

Naprej: Zahvala Nazaj: Izvleček Kazalo Začetek Konec

	verb		noun		adjective		pronoun		adverb

1.	biti p	9,11	roka	0,39	star	0,19	on	1,60	tako	0,39
2.	biti r	1,71	glava	0,24	velik	0,12	ki	0,73	zdaj	0,33
3.	ne biti p	0,65	oči	0,20	lep	0,11	jaz	0,70	nato	0,24
4.	reči	0,47	otrok	0,20	dolg	0,09	ta	0,69	spet	0,21
5.	biti o	0,44	dan	0,18	črn	0,09	ona	0,66	potem	0,18
6.	imeti	0,23	hiša	0,17	bel	0,09	svoj	0,55	počasi	0,15
7.	vedeti	0,23	leto	0,15	dober	0,08	ves	0,39	lahko	0,13
8.	videti	0,20	vrata	0,13	živ	0,07	ti	0,36	takoj	0,11
9.	iti	0,19	beseda	0,13	mlad	0,07	vse	0,29	skoraj	0,10
10.	stopiti	0,18	oče	0,12	težek	0,06	oni	0,28	bolj	0,10
11.	začeti	0,17	človek	0,11	širok	0,06	sam	0,22	naglo	0,09
12.	pogledati	0,16	glas	0,11	hud	0,06	kako	0,20	dobro	0,08

n	H	H_n	F_n	n	H	H_n	F_n

1	4,456	4,456	4,456	13	23,460	1,805	0,239
2	7,994	3,997	3,538	14	23,615	1,687	0,155
3	11,020	3,673	3,026	15	23,715	1,581	0,100
4	13,565	3,391	2,545	16	23,779	1,486	0,064
5	15,739	3,148	2,174	17	23,821	1,401	0,042
6	17,643	2,941	1,904	18	23,848	1,325	0,027
7	19,272	2,753	1,629	19	23,866	1,256	0,018
8	20,587	2,573	1,315	20	23,878	1,194	0,012
9	21,594	2,399	1,007	21	23,886	1,137	0,008
10	22,334	2,233	0,740	22	23,891	1,086	0,005
11	22,861	2,078	0,527	23	23,895	1,039	0,004
12	23,221	1,935	0,360	24	23,898	0,996	0,003