Skip to content

Latest commit

 

History

History
629 lines (419 loc) · 25.6 KB

transcription.md

File metadata and controls

629 lines (419 loc) · 25.6 KB

Feature documentation

Here you find a description of the transcriptions of several cuneiform corpora, the Text-Fabric model in general, and the node types, features of these corpora

See also

Conversion from ATF to TF

Below is a description of document transcriptions in ATF (see below) and an account how we transform them into Text-Fabric format by means of convert.py.

There are various pages with documentation on ATF at the ORACC site, notably:

One of the observations during conversion of this corpus from ATF to TF is, that the match between patterns described in the docs and the patterns seen in the sources is often difficult to make, and not always perfect.

The Text-Fabric model views the text as a series of atomic units, called slots. In this corpus signs are the slots.

On top of that, more complex textual objects can be represented as nodes. In this corpus we have node types for: sign, word, cluster, line, face, document.

The type of every node is given by the feature otype. Every node is linked to a subset of slots by oslots.

Nodes can be annotated with features. See the table below.

Text-Fabric supports up to three customizable section levels. In this corpus we use: document and face and line.

Unicode Glyphs

In order to map readings and graphemes to Cuneiform Unicode characers, we use the file GeneratedSignList.json generated by Auday Hussein.

I generated this JSON file from the original source http://home.zcu.cz/~ksaskova/Sign_List.html using a python script I wrote. The original HTML list is created manually by Dr. Kateřina Šašková from the University of West Bohemia, therefore all credit should go to her.

Auday Hussein in an email to Dirk Roorda

There remain a few dozen unmapped readings and ambiguously mapped readings. We deal with those cases in the mapReadings. notebook.

Reference table of features

(Keep this under your pillow)

Node type sign

Basic unit containing a single reading and/or grapheme and zero or more flags.

There are several types of sign, stored in the feature type.

type examples description
reading ma qa2 normal sign with a reading (lowercase)
unknown x n representation of an unknown sign, the n stands for an unknown numeral
numeral 5(disz) 5/6(disz) a numeral, either with a repeat or with a fraction
ellipsis ... representation of an unknown number of missing signs
grapheme ARAD2 GAN2 sign given as a grapheme (uppercase)
wdiv / word divider
empty empty sign, usually due to an input or conversion error
complex szu!(LI) isx(USZ) complex sign with reading, operator and given grapheme
comment ($ blank space $) comment sign to represent an inline comment
commentline $ rest broken comment sign to represent a line comment; such a sign is the only sign of a comment line
feature values in ATF description
after - : . / + ha:a-am a-di ma-di what comes after a sign before the next sign
afterr what comes after a sign before the next sign when represented with rich characters; in those cases where two signs are adjacent ␣ is inserted
afteru : . / + what comes after a sign before the next sign when represented with unicode characters
atf qa2 ARAD2 5/6(disz) ba?!(GESZ) idem full atf of a sign, also complex signs, with flags but without clustering characters
atfpost }_ {ki}_ clustering characters attached at the end of a sign
atfpre { {ki}_ clustering characters attached at the start of a sign
collated 1 _8(gesz2)* indicates the presence of the collated flag *
comment blank space rest broken ($ blank space $) $ rest broken value of a comment; the comment may come from an inline comment, then the sign type is comment; or from a line comment, then the sign type is commentline
damage 1 isz-ta#-a-lu indicates the presence of the damage flag #
det 1 {d}suen asza5{a-sza3} indicates whether the sign is a determinative gloss, marked by being within braces { }
excised 1 <<ma>> <<ip-pa-ar-ra-as>> whether a sign is excised by the editor, marked by being within double angle brackets << >>
fraction 5/6 5/6(disz) the fraction part of a numeral
grapheme graphemer graphemeu ARAD2 GAN2 LI USZ ARAD2 GAN2 szu!(LI) isx(USZ) the grapheme name of a sign when its atf is capitalized or when the grapheme is shown between brackets after an operator; the -r variant uses accented letters; the -u variant uses cuneiform unicode
langalt 1 _{d}suen_ whether the sign is in the alternate language in this corpus Sumerian. See also the document feature lang. ATF marks alternate language by enclosing signs in _ ... _
missing 1 [ki-im] whether a sign is missing, marked by being within square brackets [ ]
operator operatorr operatoru ! x szu!(LI) isx(USZ) the type of operator in a complex sign; the -r and -u versions represent them as = and
question 1 DU6~b? indicates the presence of the question flag ?
reading readingr readingu suen idem reading (lowercase) of a sign; the sign may be simple or complex; the -r variant uses accented letters; the -u variant uses cuneiform unicode
remarkable 1 lam! indicates the presence of the remarkable flag !
repeat 5 5(disz) marks repetition of a grapheme in a numeric sign
sym symr symu essential parts of a sign, composed of reading, grapheme, repeat, fraction, operator, also defined for words; the -r variant uses accented letters; the -u variant uses cuneiform unicode
supplied 1 <pa> i-ba-<asz-szi> whether a sign is supplied by the editor, marked by being within angle brackets < >
type type of sign, see table above
uncertain 1 [x (x)] [li-(il)-li] whether a sign is uncertain, marked by being within brackets ( )

Node type word

Sequence of signs separated by -. Sometimes the - is omitted. Very rarely there is an other character between two signs, such as : or /. Words themselves are separated by spaces.

feature values in ATF description
after what comes after a word before the next word, including word dividers (unlike this feature for signs)
atf {disz}sze-ep-_{d}suen idem full atf of a word, including flags and clustering characters, but no word dividers
sym symr symu essential parts of a word, composed of the sym, symr, symu values of its individual signs; the -r variant uses accented letters; the -u variant uses cuneiform unicode

Node type cluster

Grouped sequence of signs. There are different types of these bracketings. Clusters may be nested. But clusters of different types need not be nested properly with respect to each other.

The type of a cluster is stored in the feature type.

type examples description
langalt _ _ alternate language
det { } gloss, determinative
uncertain ( ) uncertain
missing [ ] missing
supplied < > supplied by the editor in order to get a reading
excised << >> excised by the editor in order to get a reading

Each cluster induces a sign feature with the same name as the type of the cluster, which gets value 1 precisely when the sign is in that cluster.

Node type line

Subdivision of a containing face. Corresponds to a transcription or comment line in the source data.

feature values in ATF description
col 1 @column 1 number of the column in which the line occurs; without prime, see also primecol
ln 1 1. [a-na] ATF line number of a numbered transcription line; without prime, see also primeln; see also lnc
lnc $a $b $ rest broken ATF line number of a comment line ($); the value $ plus a, b etc., every new column restarts this numbering; see also ln
lnno combination of col, primecol, ln, primeln to identify a line
primecol 1' whether the column number has a prime '
primeln 1' whether the line number has a prime '
remarks reading la-mi! proposed by Von Soden # reading la-mi! proposed by Von Soden the contents of a remark targetedto the contents of a transcription line; the remark feature is present on the line that is being commented; multiple remark lines will be joined with a newline
srcLn 1. [a-na x]-da-a-a idem see source data
srcLnNum 29908 not represented see source data
trans 1 indicates whether a line has a translation (in the form of a following meta line (#))
translation@en was given (lit. sealed) to me— #tr.en: was given (lit. sealed) to me— English translation in the form of a meta line (#)

Node type face

One of the sides of an object belonging to a document document. In most cases, the object is a tablet, but it can also be an envelope, or yet an other kind of object.

feature values in ATF description
face obverse reverse seal 1 envelope - seal 1 @obverse @reverse @seal 1 type of face, if on an object different from a tablet, the type of object is prepended
object tablet envelope @tablet @envelope object on which a face is situated; seals are not objects but faces
srcLn @obverse idem see source data
srcLnNum 29907 not represented see source data

Node type document

The main entity of which the corpus is composed, representing the transcription of all objects associated with it.

feature values in ATF description
collection AbB &P509373 = AbB 01, 059 the collection of a document
docnote Bu 1888-05-12, 200 &P365091 = CT 02, pl. 10, Bu 1888-05-12, 200 additional remarks in the document identification
docnumber 059 &P509373 = AbB 01, 059 the identification of a document as number within a collection - volume
lang akk sux the language the document is written in. akk = Akkadian, sux = Sumerian. See the sign feature langalt for the language of smaller portions of the document
pnumber P509373 &P509373 = AbB 01, 059 the P-number identification of a document
srcfile AbB-primary or AbB-secondary not represented see source data
srcLn &P494060 = AbB 14, 226 idem see source data
srcLnNum 29904 not represented see source data
volume 01 &P509373 = AbB 01, 059 the volume of a document as number within a collection

We also store a bunch of the metadata fields that preced the transliterations in the source files:

feature from metadata field description
author Author(s) author
pubdate Publication date publication date
museumname Collection museum name
museumcode Museum no. museum code
excavation Excavation no. excavation number
period Period period indication
material Material material indication
genre Genre genre
subgenre Sub-genre sub-genre
transcriber ATF source person who did the encoding into ATF
ARK UCLA Library ARK persistent identifier of type ARK

Source data

All nodes that correspond directly to a line in the corpus, also get features by which you can retrieve the original transcription.

For documents and faces the line refers to the source line where the encoding starts.

  • srcfile the name of the source file, it does not occur as such in the source data;
  • srcLn the literal contents of the line in the source;
  • srcLnNum the line number of the corresponding line in the source file, not the ATF line number, but n as in the n-th line in the file, it does not occur as such in the source data.

Slots

Slots are the textual positions. They can be occupied by individual signs or inline comments ($ ccc $). We have inserted empty slots on comment lines (starting with $) in order to anchor these lines at the right place in the text sequence and to store the comment itself in the feature comment.

We discuss the node types we are going to construct. A node type corresponds to a textual object. Some node types will be marked as a section level.

Sign

This is the basic unit of writing.

The node type sign is our slot type in the Text-Fabric representation of this corpus.

All signs have the features atf, atfpre, atfpost and after.

Together they are the building blocks by which the complete original ATF sequence for that sign can be reconstructed:

atfpre + atf + atfpost + after

atf contains the encoding of the sign itself, including possible flags.

atfpre and atfpost contain the bracketing characters before and after the sign.

after contains the linking characters with the next sign, usually a - or a .

For analytical purposes, there is a host of other features on signs, depending on the type of sign.

Simple signs

The defining trait of a sign is its reading and/or optionally its grapheme.

We will collect the name string of a sign, without variants and flags, and store it in the sign feature reading if it is lowercase, and grapheme if it is uppercase.

The type of such signs is reading or grapheme.

Simple signs may be augmented with flags (see below).

Unknown signs

The letters x and X, n and N in isolation stand for an unknown signs.

The type of such signs is unknown.

If the value is x or n, it will stored in reading, if it is X or N in grapheme.

The x and X stand for completely unknown signs, the n and N stand for unknown signs of which it is known that they are numerals.

N.B: See under numerals below, where n plays a slightly different role.

Ellipsis

The value ...stands for an unknown number of missing signs.

The type of such signs is ellipsis.

The grapheme feature will be filled with ....

Numerals: repeats and fractions

Signs, especially those with a numeric meaning, may be repeated.

5(disz)

Numeric signs may also be preceded with a fraction:

5/6(disz)

We store the integral number before the brackets in the feature repeat, and the fraction in the feature fraction.

If the repeat is n, it means that the amount of repetition is uncertain or that a repetition is missing. We store it as repeat = -1, so repeats always have an integer value.

In a numeral, within the brackets you find the reading or grapheme, depending on whether it is lowercase or uppercase..

Numeral signs have type numeral.

After the closing bracket the numeral may be augmented with flags.

Complex signs: operators

There are two constructs that have the same shape, but not the same meaning. Both lead to a complex sign.

Correction:

szu!(LI)

Operator (x):

isx(USZ)

In both cases we see a reading, followed by an operator (! or x), followed by a grapheme.

The type of such signs is complex.

The grapheme might be quite complex: an expression with or without surrounding | |, and with operators . inside. We have not broken down these graphemes in our conversion, they are stored as is in grapheme.

Comment signs

Within a transcription line, you might encounter expressions of the form ($ ccc $).

These are inline comments, not to be confused with structural line comments ($ lines) or other line comments (# lines) which occupy a line of their own.

Such comments will be converted to single signs, of type comment, and the comment itself goes into the feature comment.

The comment, surrounded by the ($ $) goes into the feature atf.

Commentline signs

Commentline signs have been artificially added to comment lines ($ lines) in order to anchor them to the textual sequence.

The comment text of the line goes into the feature comment of the single commentline sign of that line. It also goes to features sym, symr, symu and atf.

Empty signs

Empty signs may have been generated as the result of faulty inputs. The conversion program detects these errors and issues messages about them. The current run of the conversion has not detected empty signs.

Flags

Signs may have flags. In transcription they show up as a special trailing character. Flags code for signs that are damaged, questionable (in their reading), remarkable, or collated.

collated *

Example:

  1. 8(gesz2)* sze gur i-ib-szu-u2

Here the numeral 8(gesz2) is collated.

remarkable !

Only if the ! is not followed by (GGG)

Example:

8. a-di isz!-ti i-na-an-na

Here the reading isz is remarkable.

question ?

Questionable identification.

Example:

6. sza a-na ti?-bi a-bi-ka be-li szu-um-szu

Here the reading ti is questionable.

damage #

Example:

10. _ma2_ a-na ra-ka-ab s,u2-ha-ar-tim#

Here the reading tim is damaged.

Choice

Sometimes there are choices in the transliteration, e.g.

-ni/am/ka/szum#

We store all choices in separate signs. We mark the fact that they are choices by storing the / character in the feature after.

There are never more than 4 choices in this corpus.

The other nodes

Cluster

One or more signs may be bracketed by _ _ or by ( ) or by [ ] or by < > or by << >>: together they form a cluster.

Each pair of boundary signs marks a cluster of a certain type. This type is stored in the feature type.

Clusters are not be nested in clusters of the same type.

Clusters of one type in general do not respect the boundaries of clusters of other types.

Clusters do not cross line boundaries.

Clusters may contain just one sign.

In Text-Fabric, cluster nodes are linked to the signs it contains. So, if c is a cluster, you can get its signs by

L.d(c, otype='sign')

More over, every type of cluster corresponds to a numerical feature on signs with the same name as that type. It has value 1 for those signs that are inside a cluster of that type and no value otherwise.

langalt _ _

Marks a switch to the alternate language. In this corpus, the documents are mainly in Akkadian (akk). The alternate language is Sumerian (sux).

det { }

Marks a glosses of the determinative kind.

uncertain ( )

Marks uncertain readings.

missing [ ]

Marks missing signs.

excised << >>

Marks signs that have been excised by the editor in order to arrive at a reading.

supplied < >

Marks signs that have been supplied by the editor in order to arrive at a reading

Word

Words are sequences of signs joined by - or occasionally : or /. Words themselves are separated by spaces .

They have only one feature: atf, which contains the original ATF source, including cluster characters that are glued to the word or occur inside it.

Line

This node type is section level 3

A node of type line corresponds to a numbered line with transcribed material or to a line with a structural comment (which starts with $).

Lines that start with a # are comments to the previous line or metadata to the document. Their contents are turned into document and line features, but they do not give rise to line nodes.

Lines get a column number from preceding @column i lines (if any), and this gets stored in col.

There is no node type corresponding to columns.

The ATF number at the start of the line goes into ln, without the ..

If primes ' are present on column numbers and line numbers, they will not get stored on col and ln, but instead the features primcol and primeln will receive a 1.

The number of the line in the source file is stored in srcLnNum, the unmodified contents of the line, including the ATF line number goes into srcLn.

If the line is a structural comment ($), the contents of the line goes into the comment feature of its sole sign, a sign of type commentline.

If a line has a comment in the form of one or more following lines that start with # , then these lines will be joined with newlines and collectively go into remarks.

If a line has a translation, say in English, marked by a following line starting with #tr.en:, then the contents of the translation will be added to translation@en.

If a line has any translation at all, in whatever language, the feature trans becomes 1.

Face

This node type is section level 2

Lines are grouped into faces.

Faces are marked by lines like

@obverse

or

@reverse

or

@seal 1

There are a few other possibilities, such as:

@left edge
@upper edge

A node of type face corresponds to the material after a face specifier and before the next face specifier or the end of an object or document.

Note that objects, such as tablets, envelopes and eyestones are also marked by @ lines. Whenever the object is not a tablet, the type of object will prepended to the name of the face:

The obverse of an envelope is

@envelope - obverse

whereas

@obverse

is the obverse of a tablet.

Seals are faces, not objects.

The resulting face type is stored in the feature face.

The object on which a face resides, goes to the feature object.

Faces also have features srcLn and srcLnNum, like lines. In this cases, they refer to the line where a face starts.

Document

This node type is section level 1.

Faces are grouped into documents.

Documents are started by lines like

&P510635 = AbB 12, 112

Here we collect

  • P002174 as the pnumber of the document,
  • AbB as the collection,
  • 12 as the volume,
  • 112 as the docnumber

If this line has irregular content, we put the irregular material into docnote:

  • &P497776 = Fs Landsberger 235
    • collection, volume, docnumber undefined;
    • docnote = Fs Landsberger 235
  • &P479394 = CT 33, pl. 26, BM 097405
    • collection = CT
    • volume = 33
    • docnumber = 26
    • docnote = BM 097405

We also add the name of the source file as a feature srcfile, with possible values:

  • AbB-primary for documents whose primary publication is AbB ;
  • AbB-secondary for documents whose secondary publication(s) has AbB .

This corpus is just a set of documents. The position of a particular document in the whole set is not meaningful. The main identification of documents is by their pnumber, not by any sequence number within the corpus.

Text formats

The following text formats are defined (you can also list them with T.formats).

format kind description
text-orig-full plain the full atf, including flags and cluster characters
text-orig-plain plain the essential bits: readings, graphemes, repeats, fractions, operators, no clusters, flags, inline comments
text-orig-rich plain as text-orig-plain but with accented characters
text-orig-unicode plain as text-orig-plain but with cuneiform unicode characters, hyphens are suppressed
layout-orig-rich layout as text-orig-rich but the flag and cluster information is visible in layout
layout-orig-unicode layout as text-orig-unicode but the flag and cluster information is visible in layout

The formats with text result in strings that are plain text, without additional formatting.

The formats with layout result in pieces html with css-styles; the richness of layout enables us to code more information in the plain representation, e.g. blurry characters when signs are damaged or uncertain.

See also the showcases: