Skip to content

StatisticsTrees

nevenjovanovic edited this page Oct 18, 2019 · 2 revisions

Statistics on syntactic trees

Alpheios ALDT XML transformed into nested trees as proposed by Bozia 2018, for easier retrieving of children and parent nodes, using script ListToTree.xq to populate a new database, created by createGrcTBGTree.xq. Only sentences with existing @head attribute values are included.

Database: grc-tb-g-tree

Date: 2019-10-18+02:00

Documents: 123

Sentences: 20,830

Words: 500,098

How many subtrees per sentence?

A subtree is part of sentence tree governed by root. For example, the following sentence has three subtrees:

<sentence id="152" document_id="urn:cts:greekLit:tlg0008.tlg001.perseus-grc1" subdoc="13.47" span="Χαιρεφῶντος0:.12">
  <word id="2" form="," lemma="punc1" postag="u--------" relation="AuxX" head="0"/>
  <word id="3" form="ἔφη" lemma="φημί" postag="v3siia---" relation="PRED" head="0">
    <word id="6" insertion_id="005f" artificial="elliptic" relation="OBJ" form="[1]" lemma="[1]" postag="v_____---" head="3">
      <word id="1" form="ἔγχαλκα" lemma="ἔγχαλκος" postag="a-p---nn-" relation="PNOM" head="6"/>
      <word id="4" form="παιδίον" lemma="παιδίον" postag="n-s---nv-" relation="ExD" head="6"/>
    </word>
  </word>
  <word id="5" form="." lemma="punc1" postag="u--------" relation="AuxK" head="0"/>
</sentence>

In original ALDT XML, the sentence looks like this:

<sentence>
  <word id="1" form="ἔγχαλκα" lemma="ἔγχαλκος" postag="a-p---nn-" relation="PNOM" head="6"/>
  <word id="2" form="," lemma="punc1" postag="u--------" relation="AuxX" head="0"/>
  <word id="3" form="ἔφη" lemma="φημί" postag="v3siia---" relation="PRED" head="0"/>
  <word id="4" form="παιδίον" lemma="παιδίον" postag="n-s---nv-" relation="ExD" head="6"/>
  <word id="5" form="." lemma="punc1" postag="u--------" relation="AuxK" head="0"/>
  <word id="6" insertion_id="005f" artificial="elliptic" relation="OBJ" form="[1]" lemma="[1]" postag="v_____---" head="3"/>
</sentence>

Analysis and findings

Outliers, strange cases, are sentences with more than 10 subtrees; maximum value is 33. But sentences with just one subtree are also quite strange (there are just 21 such cases) – which is logical, because sentence punctuation, such as full stop or question mark, counts as a subtree.

A majority of sentences (more than 18,000 from a corpus of 20,830) has between 2 and 5 subtrees.

Bear in mind that a sentence with two subtrees could have more than two elements; the subtrees themselves can have subtrees.

In fact, we can measure how many subtrees dependent directly on the root branch out further, that is, how many have two or more subtrees (using the script CountSubtreeBranches.xq). There are 20,091 such cases.

Data

Produced by script CountSubtrees.xq.

SUB SENT
33 1
30 1
21 2
19 2
18 3
17 4
16 11
15 10
14 17
13 21
12 39
11 49
10 106
9 151
8 278
7 497
6 969
5 1756
4 3193
3 5133
2 8567
1 20
Clone this wiki locally