Skip to content

Commit

Permalink
metadata added; better unicode mapping
Browse files Browse the repository at this point in the history
  • Loading branch information
dirkroorda committed Mar 15, 2019
1 parent f38c4f6 commit 160e4a3
Show file tree
Hide file tree
Showing 73 changed files with 2,816,042 additions and 107 deletions.
884 changes: 884 additions & 0 deletions characters/mapping.tsv

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ page, we have chosen `Download all text`.

The downloaded files contain metadata and transliterations.

Currently, we are using the transliterations only.
We use a dozen or so of the metadata fields, but the focus is on the transliterations.

We have a [specification](transcription.md) of the transcription format and
how we model the text in Text-Fabric.
Expand Down
17 changes: 17 additions & 0 deletions docs/transcription.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,23 @@ feature | values | in ATF | description
**srcLnNum** | 29904 | not represented | see [source data](#source-data)
**volume** | `01` | `&P509373 = AbB 01, 059` | the volume of a [*document*](#document) as number within a collection

We also store a bunch of the metadata fields that preced the transliterations in the source
files:

feature | from metadata field | description
------- | ------ | ------
author | Author(s) | author
pubdate | Publication date | publication date
museumname | Collection | museum name
museumcode | Museum no. | museum code
excavation | Excavation no. | excavation number
period | Period | period indication
material | Material | material indication
genre | Genre | genre
subgenre | Sub-genre | sub-genre
transcriber | ATF source | person who did the encoding into ATF
ARK | UCLA Library ARK | persistent identifier of type ARK

Source data
===========

Expand Down
101 changes: 71 additions & 30 deletions programs/mapReadings.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@
"\n",
"## Task\n",
"\n",
"We want to map the *readings* in the Old Babylonian Corpus to unicode strings in the Cuneiform block,\n",
"We want to map *readings* and *graphemes* in cuneiform corpora to cuneiform unicode characters,\n",
"based on extant mapping tables.\n",
"\n",
"We generate a plain mapping that can be used readily by programs that convert from ATF to TF or something else.\n",
"\n",
"## Problem\n",
"\n",
"There are multiple mapping tables, there are several ways to transliterate readings.\n",
Expand Down Expand Up @@ -67,7 +69,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 95,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -82,7 +84,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 96,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -490,7 +492,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 129,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -505,7 +507,7 @@
"SIGN_FILE = 'GeneratedSignList.json'\n",
"SIGN_PATH = f'{WRITING_DIR}/{SIGN_FILE}'\n",
"\n",
"MAPPING_FILE = 'mapping'"
"MAPPING_FILE = f'{os.path.abspath(\"..\")}/characters/mapping.tsv'"
]
},
{
Expand All @@ -519,7 +521,7 @@
},
{
"cell_type": "code",
"execution_count": 75,
"execution_count": 110,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -568,7 +570,7 @@
},
{
"cell_type": "code",
"execution_count": 76,
"execution_count": 111,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -590,7 +592,7 @@
},
{
"cell_type": "code",
"execution_count": 77,
"execution_count": 112,
"metadata": {},
"outputs": [
{
Expand All @@ -608,7 +610,7 @@
" 'Ḫ': 'H,'}"
]
},
"execution_count": 77,
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -619,7 +621,7 @@
},
{
"cell_type": "code",
"execution_count": 78,
"execution_count": 113,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -640,7 +642,7 @@
},
{
"cell_type": "code",
"execution_count": 88,
"execution_count": 114,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -674,7 +676,7 @@
},
{
"cell_type": "code",
"execution_count": 89,
"execution_count": 115,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -718,20 +720,25 @@
},
{
"cell_type": "code",
"execution_count": 90,
"execution_count": 149,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"155 unmapped tokens\n",
"𒀭\n",
"154 unmapped tokens\n",
" 50 ambiguously mapped tokens\n",
"766 uniquely mapped tokens\n"
"767 uniquely mapped tokens\n"
]
}
],
"source": [
"MAPPING_FIXES = {\n",
" 'd': 'dingir',\n",
"}\n",
"\n",
"unmapped = set()\n",
"unique = {}\n",
"multiple = {}\n",
Expand All @@ -740,7 +747,8 @@
" if type(t) is tuple:\n",
" unmapped.add(t)\n",
" continue\n",
" tU = t.upper()\n",
" tLookup = MAPPING_FIXES.get(t, t)\n",
" tU = tLookup.upper()\n",
" if tU not in mapping:\n",
" unmapped.add(t)\n",
" continue\n",
Expand All @@ -764,14 +772,14 @@
},
{
"cell_type": "code",
"execution_count": 91,
"execution_count": 143,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"155 unmapped tokens\n"
"154 unmapped tokens\n"
]
},
{
Expand Down Expand Up @@ -817,7 +825,6 @@
" (6, 'bur3'),\n",
" (8, 'bur3'),\n",
" (9, 'bur3'),\n",
" 'd',\n",
" 'dah',\n",
" (1, 'disz'),\n",
" ('1/2', 'disz'),\n",
Expand Down Expand Up @@ -934,7 +941,7 @@
" '|UD.KIB|']"
]
},
"execution_count": 91,
"execution_count": 143,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -957,7 +964,7 @@
},
{
"cell_type": "code",
"execution_count": 92,
"execution_count": 144,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -970,7 +977,7 @@
},
{
"cell_type": "code",
"execution_count": 93,
"execution_count": 145,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -987,14 +994,14 @@
},
{
"cell_type": "code",
"execution_count": 94,
"execution_count": 146,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fixed 67 out of 155\n",
"fixed 67 out of 154\n",
"FIXED\n",
"\tasal2 => 𒀷\n",
"\t(1, 'asz') => 𒀸\n",
Expand Down Expand Up @@ -1083,7 +1090,6 @@
"\t(6, 'bur3') => ?\n",
"\t(8, 'bur3') => ?\n",
"\t(9, 'bur3') => ?\n",
"\td => ?\n",
"\tdah => ?\n",
"\t('1/2', 'disz') => ?\n",
"\t(13, 'disz') => ?\n",
Expand Down Expand Up @@ -1227,7 +1233,7 @@
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": 147,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -1305,14 +1311,14 @@
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 148,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"768 uniquely mapped readings\n",
"767 uniquely mapped readings\n",
" A => 𒀀\n",
" AB => 𒀊\n",
" AD => 𒀜\n",
Expand Down Expand Up @@ -1515,6 +1521,7 @@
" bur => 𒁓\n",
" bur3 => 𒌋\n",
" buranun => 𒌓𒄒𒉣\n",
" d => 𒀭\n",
" da => 𒁕\n",
" dab => 𒁳\n",
" dab5 => 𒆪\n",
Expand Down Expand Up @@ -1623,7 +1630,6 @@
" geme => 𒊩\n",
" geme2 => 𒊩𒆳\n",
" gesz => 𒄑\n",
" gesz2 => 𒁹\n",
" gesztin => 𒃾\n",
" gesztu2 => 𒄑𒌆𒉿\n",
" gi => 𒄀\n",
Expand Down Expand Up @@ -1942,7 +1948,6 @@
" szam => 𒌑\n",
" szam3 => 𒉓\n",
" szar => 𒊬\n",
" szar2 => 𒊹\n",
" szara2 => 𒇋\n",
" sze => 𒊺\n",
" sze3 => 𒂠\n",
Expand Down Expand Up @@ -2090,6 +2095,42 @@
" print(f'{r:>10} => {unique[r]}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Write the mapping file"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"884 entries written to /Users/dirk/github/Nino-cunei/oldbabylonian/characters/mapping.tsv\n"
]
}
],
"source": [
"pairs = {}\n",
"for (k, vs) in multiple.items():\n",
" pairs[k] = sorted(vs)[0]\n",
"for (t, v) in mapAddition.items():\n",
" k = f'{t[0]}({t[1]})' if type(t) is tuple else t\n",
" pairs[k] = v\n",
"for (k, v) in unique.items():\n",
" pairs[k] = v\n",
"\n",
"with open(MAPPING_FILE, 'w') as mf:\n",
" for (k,v) in sorted(pairs.items()):\n",
" mf.write(f'{k}\\t{v}\\n')\n",
"print(f'{len(pairs)} entries written to {MAPPING_FILE}')"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
Loading

0 comments on commit 160e4a3

Please sign in to comment.