Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish a WLC TEI edition that includes macula identifiers #122

Merged
merged 28 commits into from
May 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
206f8b1
Begin backporting of TEI pipeline from macula-greek
jacobwegner Mar 21, 2024
5f48567
Fetch XML from Tanach.us
jacobwegner Mar 21, 2024
bdcd7af
Add stub for Ruth to help implement parser
jacobwegner Mar 21, 2024
a8f04b2
Process Tanach XML for Ruth 1:1
jacobwegner Mar 21, 2024
9e82cc6
Expand to the entirety of Ruth
jacobwegner Mar 21, 2024
ec2bffe
Map additional v children
jacobwegner Mar 21, 2024
340b746
Restrict mapping to w and k
jacobwegner Mar 21, 2024
c214035
Stitch in content directly from nodes data
jacobwegner Mar 21, 2024
a3d7d2f
Preserve whitespace between w elements
jacobwegner Mar 21, 2024
0408259
Add stylesheet
jacobwegner Mar 21, 2024
d6bd868
Style milestones
jacobwegner Mar 21, 2024
5c4cb84
Refactor for 1:1 between node words and w elements
jacobwegner Mar 21, 2024
ece3f69
Add m elements but preserve whitespace
jacobwegner Mar 21, 2024
b4fcf1b
Expand TEI to all other books
jacobwegner Mar 26, 2024
5a49df4
Render samekh in Amos
jacobwegner Apr 1, 2024
3faa6a0
Render pe in Amos
jacobwegner Apr 1, 2024
32c19b7
Re-generate TEI XML for pe and samekh
jacobwegner Apr 1, 2024
9fda2a4
Tweak samekh and pe styles
jacobwegner Apr 1, 2024
89466cd
Remove debugging styles
jacobwegner Apr 1, 2024
3c41b95
BI: Update Amos TEI to more closely resemble Tanach.us
jacobwegner Apr 1, 2024
5ea1434
Add whitespace around paseq
jacobwegner Apr 1, 2024
51a6c55
Expand paseq fix to other books
jacobwegner Apr 1, 2024
c100811
Backport fix for o190010010022 from https://github.com/Clear-Bible/m…
jacobwegner Apr 1, 2024
9071ed4
Merge branch 'main' into feat/tei
jacobwegner Apr 2, 2024
b7e5807
Update TEI pipeline for PSA 1 fix
jacobwegner Apr 2, 2024
2530a7b
Merge branch 'main' into feat/tei
jacobwegner Apr 2, 2024
d6ebfc4
Merge branch 'main' into feat/tei
jacobwegner Apr 11, 2024
1cc0633
Remove TODOs / add documentation
jacobwegner Apr 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,638 changes: 1,638 additions & 0 deletions WLC/tei/01-genesis.xml

Large diffs are not rendered by default.

1,325 changes: 1,325 additions & 0 deletions WLC/tei/02-exodus.xml

Large diffs are not rendered by default.

918 changes: 918 additions & 0 deletions WLC/tei/03-leviticus.xml

Large diffs are not rendered by default.

1,366 changes: 1,366 additions & 0 deletions WLC/tei/04-numbers.xml

Large diffs are not rendered by default.

1,052 changes: 1,052 additions & 0 deletions WLC/tei/05-deuteronomy.xml

Large diffs are not rendered by default.

711 changes: 711 additions & 0 deletions WLC/tei/06-joshua.xml

Large diffs are not rendered by default.

665 changes: 665 additions & 0 deletions WLC/tei/07-judges.xml

Large diffs are not rendered by default.

98 changes: 98 additions & 0 deletions WLC/tei/08-ruth.xml

Large diffs are not rendered by default.

878 changes: 878 additions & 0 deletions WLC/tei/09-1samuel.xml

Large diffs are not rendered by default.

748 changes: 748 additions & 0 deletions WLC/tei/10-2samuel.xml

Large diffs are not rendered by default.

866 changes: 866 additions & 0 deletions WLC/tei/11-1kings.xml

Large diffs are not rendered by default.

774 changes: 774 additions & 0 deletions WLC/tei/12-2kings.xml

Large diffs are not rendered by default.

1,006 changes: 1,006 additions & 0 deletions WLC/tei/13-1chronicles.xml

Large diffs are not rendered by default.

899 changes: 899 additions & 0 deletions WLC/tei/14-2chronicles.xml

Large diffs are not rendered by default.

305 changes: 305 additions & 0 deletions WLC/tei/15-ezra.xml

Large diffs are not rendered by default.

436 changes: 436 additions & 0 deletions WLC/tei/16-nehemiah.xml

Large diffs are not rendered by default.

192 changes: 192 additions & 0 deletions WLC/tei/17-esther.xml

Large diffs are not rendered by default.

1,159 changes: 1,159 additions & 0 deletions WLC/tei/18-job.xml

Large diffs are not rendered by default.

2,832 changes: 2,832 additions & 0 deletions WLC/tei/19-psalms.xml

Large diffs are not rendered by default.

982 changes: 982 additions & 0 deletions WLC/tei/20-proverbs.xml

Large diffs are not rendered by default.

251 changes: 251 additions & 0 deletions WLC/tei/21-ecclesiastes.xml

Large diffs are not rendered by default.

138 changes: 138 additions & 0 deletions WLC/tei/22-songofsongs.xml

Large diffs are not rendered by default.

1,428 changes: 1,428 additions & 0 deletions WLC/tei/23-isaiah.xml

Large diffs are not rendered by default.

1,473 changes: 1,473 additions & 0 deletions WLC/tei/24-jeremiah.xml

Large diffs are not rendered by default.

169 changes: 169 additions & 0 deletions WLC/tei/25-lamentations.xml

Large diffs are not rendered by default.

1,374 changes: 1,374 additions & 0 deletions WLC/tei/26-ezekiel.xml

Large diffs are not rendered by default.

386 changes: 386 additions & 0 deletions WLC/tei/27-daniel.xml

Large diffs are not rendered by default.

230 changes: 230 additions & 0 deletions WLC/tei/28-hosea.xml

Large diffs are not rendered by default.

86 changes: 86 additions & 0 deletions WLC/tei/29-joel.xml

Large diffs are not rendered by default.

169 changes: 169 additions & 0 deletions WLC/tei/30-amos.xml

Large diffs are not rendered by default.

28 changes: 28 additions & 0 deletions WLC/tei/31-obadiah.xml

Large diffs are not rendered by default.

61 changes: 61 additions & 0 deletions WLC/tei/32-jonah.xml

Large diffs are not rendered by default.

124 changes: 124 additions & 0 deletions WLC/tei/33-micah.xml

Large diffs are not rendered by default.

58 changes: 58 additions & 0 deletions WLC/tei/34-nahum.xml

Large diffs are not rendered by default.

67 changes: 67 additions & 0 deletions WLC/tei/35-habakkuk.xml

Large diffs are not rendered by default.

64 changes: 64 additions & 0 deletions WLC/tei/36-zephaniah.xml

Large diffs are not rendered by default.

47 changes: 47 additions & 0 deletions WLC/tei/37-haggai.xml

Large diffs are not rendered by default.

244 changes: 244 additions & 0 deletions WLC/tei/38-zechariah.xml

Large diffs are not rendered by default.

66 changes: 66 additions & 0 deletions WLC/tei/39-malachi.xml

Large diffs are not rendered by default.

73 changes: 73 additions & 0 deletions WLC/tei/wlc-tei.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
div[type="book"] {
direction: rtl;
font-family: 'SBLBibLit', 'Times New Roman', serif;
max-width: 660px;
margin-inline-start: 1.0em;;
}

chapter {
margin-block-start: 2.0em;
display: block;
}

samekh, pe {
display: inline;
margin-inline-start: 0.5em
}
pe:after {
content: ' ';
display: block;
}
samekh::after {
content: '\00a0\00a0\00a0\00a0\00a0\00a0';
}
title {
font-size: 2.5rem;
text-align: center;
display: block;
}
verse {
display: inline;
font-size: 1.5rem;
line-height: 1.7;
}

milestone::before {
content: attr(n);
vertical-align: baseline;
position: relative;
top: -0.6em;
font-size: 0.6em;
opacity: 50%;
margin-inline-start: 0.2em;
margin-inline-end: 0.4em;
}

chapter::before {
content: attr(n) ':1';
opacity: 50%;
margin-inline-start: 0.2em;
margin-inline-end: 0.2em;
font-size: 1.2em;
vertical-align: baseline;
position: relative;
top: -0.6em;
}

milestone[n="1"]::before {
display: none;
}

/* NOTE: Uncomment to debug a particular verse or chapter */
/* chapter {
display: none;
}
chapter[ref="AMO 3"] {
display: block;
} */
/* verse {
display: none;
} */
/* verse[ref="AMO 3:15"] {
display: block;
} */
16 changes: 16 additions & 0 deletions pipelines/tei-transform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# tei-transform pipeline

This pipeline will use XSLT to transform the Tanach.us XML into something resembling the macula-greek TEI.

## Requirements
- Python 3.10 or higher

## Usage:
```
cd pipelines/tei-transform
poetry install
poetry shell
python main.py
```

Pass the `--fetch` flag (e.g. `python main.py --fetch`) to re-download the book-level XML from Tanach.us before processing.
207 changes: 207 additions & 0 deletions pipelines/tei-transform/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
import csv
import sys
import concurrent.futures
import multiprocessing
import os
from pathlib import Path

from requests import Session
from lxml import etree
from biblelib import book
from biblelib.word import fromusfm


try:
REPO_ROOT = Path(__file__).parent.parent.parent
except NameError:
REPO_ROOT = Path(os.getcwd()).parent.parent

MACULA_NODES_TSV = REPO_ROOT / "WLC/tsv/macula-hebrew.tsv"

XML_NS = "{http://www.w3.org/XML/1998/namespace}"

PIPELINE_ROOT = REPO_ROOT / "pipelines" / "tei-transform"
MAX_WORKERS = int(os.environ.get("MAX_WORKERS", multiprocessing.cpu_count() - 1))
TANACH_BOOK_URL_ROOT = "https://tanach.us/Books/"
MACULA_ID_PREFIX = "o"
ELIGIBLE_V_ELEMS = {"w", "q", "samekh", "pe"}

SAMEKH = "ס"
PE = "פ"
PASEQ = "׀"

BOOK_DATA = book.Books()

XML_PATH = REPO_ROOT / "sources/tanach.us/xml"
TEI_PATH = REPO_ROOT / "WLC/tei"


def get_source_paths():
for path in XML_PATH.glob("*.xml"):
yield path


def get_macula_word_id(bcv, pos):
return f"{MACULA_ID_PREFIX}{bcv}{str(pos).zfill(3)}"


def build_tokens_by_bcv_lookup():
bcv_lookup = {}
for row in csv.DictReader(MACULA_NODES_TSV.open(), delimiter="\t"):
macula_id = row["xml:id"]
bcv = macula_id[0:9]
bcv_lookup.setdefault(bcv, []).append(row)
return bcv_lookup


def regroup_tokens_by_bcvw(tokens):
bcvw_lookup = {}
for token in tokens:
key = (token["ref"], token["xml:id"][0:12])
bcvw_lookup.setdefault(key, []).append(token)
return bcvw_lookup


def do_transform(source, tokens_lookup):
print(f"transforming {source.name}")
parsed = etree.parse(source)
book_name = parsed.xpath("//book/names/name")[0].text
heb_book_name = parsed.xpath("//book/names/hebrewname")[0].text
book_data = next(iter(filter(lambda x: x.name == book_name, BOOK_DATA.values())), None)
assert book_data
usfm_ref = book_data.usfmname
dest_name = f'{book_data.usfmnumber}-{book_data.name.lower().replace(" ", "")}.xml'
dest = TEI_PATH / dest_name
book_xml = etree.Element("div", attrib={"type": "book", "ref": usfm_ref, "canonical": "true"})
tree = etree.ElementTree(book_xml)
# NOTE: This is largely intended for local debugging
stylesheet_pi = etree.ProcessingInstruction(
"xml-stylesheet", 'type="text/css" href="wlc-tei.css"'
)
book_xml.addprevious(stylesheet_pi)
title = etree.Element("title", attrib={"type": "main"})
title.text = heb_book_name
book_xml.append(title)
for c_elem in parsed.xpath("//c"):
chapter_ref = f'{usfm_ref} {c_elem.attrib["n"]}'
chapter = etree.Element("chapter", attrib={"ref": chapter_ref, "n": c_elem.attrib["n"]})
for v_elem in c_elem.xpath("./v"):
verse_ref = f'{chapter_ref}:{v_elem.attrib["n"]}'
verse = etree.Element("verse", attrib={"ref": verse_ref})
verse.append(
etree.Element(
"milestone", attrib={"unit": "verse", "ref": verse_ref, "n": v_elem.attrib["n"]}
)
)
bcv = fromusfm(verse_ref).ID
key = f"{MACULA_ID_PREFIX}{bcv}"
tokens = tokens_lookup[key]
regrouped_tokens = regroup_tokens_by_bcvw(tokens)
samekh = None
pe = None
for [word_ref, _], tokens in regrouped_tokens.items():
# NOTE: Want to think about consistency across our TEI representations.
# Keeping a `w` element, but removing the id attribute.
# We will then process the `m` elements separately in the Symphony Frontend.
word = etree.Element("w", attrib={"ref": word_ref})
word.text = ""
for token in tokens:
if not token["text"]:
continue
m_elem = etree.Element(
"m", attrib={f"{XML_NS}id": token["xml:id"], "ref": word_ref}
)
m_elem.text = token["text"]
if token["after"]:
if token["after"] == PASEQ:
m_elem.text += f" {PASEQ} "
else:
m_elem.text += token["after"]

word.append(m_elem)

add_whitespace = m_elem.text[-1] == " "
m_elem.text = m_elem.text.strip()
if token["after"].endswith(SAMEKH):
assert v_elem.find("./samekh") is not None
assert m_elem.text.endswith(SAMEKH)
samekh = etree.Element("samekh")
samekh.text = SAMEKH
m_elem.text = m_elem.text[0:-1]
if token["after"].endswith(PE):
assert v_elem.find("./pe") is not None
assert m_elem.text.endswith(PE)
pe = etree.Element("pe")
pe.text = PE
m_elem.text = m_elem.text[0:-1]

if add_whitespace:
word.tail = " "
verse.append(word)
if samekh is not None:
verse.append(samekh)
samekh = None
if pe is not None:
verse.append(pe)
pe = None

chapter.append(verse)
book_xml.append(chapter)

with dest.open("wb") as f:
f.write(
etree.tostring(
tree,
pretty_print=True,
xml_declaration=True,
encoding="UTF-8",
)
)


def serial_transform():
tokens_by_bcv_lookup = build_tokens_by_bcv_lookup()
for source_path in get_source_paths():
do_transform(source_path, tokens_by_bcv_lookup)


def parallel_transform():
exceptions = []
tokens_by_bcv_lookup = build_tokens_by_bcv_lookup()
with concurrent.futures.ProcessPoolExecutor(max_workers=MAX_WORKERS) as executor:
deferred_tasks = {}
for source_path in get_source_paths():
deferred = executor.submit(do_transform, source_path, tokens_by_bcv_lookup)
deferred_tasks[deferred] = source_path

for f in concurrent.futures.as_completed(deferred_tasks):
try:
f.result()
except Exception as exc:
exceptions.append(exc)

if exceptions:
raise exceptions[0]


def fetch_xml():
s = Session()
for source_path in get_source_paths():
book_url = f"{TANACH_BOOK_URL_ROOT}{source_path.name}"
resp = s.get(book_url)
with source_path.open("w") as f:
f.write(resp.content.decode("utf-8"))


def main():
TEI_PATH.mkdir(parents=True, exist_ok=True)
fetch = len(sys.argv) > 1 and sys.argv[1] == "--fetch"
if fetch:
print("Fetching XML from Tanach.us")
fetch_xml()

parallel_transform()


if __name__ == "__main__":
main()
Loading
Loading