Testing grammar via python module (#182)

* update toc rule to follow the pattern of other number markers * start working on to_dict_new to revise the JSON output structure * implement attribute, list, table and milestone handling in node_to_dict_new * fix id_query and ensure bookCode is added * include titles and poetry in JSON output * reduce inner nesting in JSON with a decorator fuction * handle footnotes and cross refs * implement filtering * remove old to_dict code * re-write the to_list function as per the new JSON * update syntax trees in test as per the change in toc rule * add linting for python module on gitactions * fix error in github action script * change the use of filter Enum in CLI * remove unused import * Setup pytest and start testing with committee tests suite * fix grammar: customAttribute rule * fix grammar: not bind toc and toca within hblock * fix grammar: make space or line after verseNumber optional * fix grammar: not treat \b as a paragraph, but a chapter and poetry content * grammar test update: change syntax trees in test as per change in \b rule * fix grammar: have separate rules for xt and xt_standalone * fix grammar: re-write zNameSpace rules * fix grammar: all lemma attribute value to be empty * fix grammar: allow default attribute value to be optionally quoted * fix grammar: permit multiple attributes in jmp and change the rule for userdefine attrib * fix grammar: change intoduction rule, to allow only imt maker to be present * fix grammar: define \+xt to be used inside footnote * Python module: accommodate empty values in attributes * include USFM/X committee's test suite * automated tests in python module with committee's test cases * fix usx conversion: handle nested character markers in same was as regular * fix usx conversion: include comments also in the list of para_style_markers * fix usx conversion: handle ca cp va vp markers * fix usx conversion: handle pi and ph paragraph blocks * fix usx conversion: handle empty attribute values * fix usx conversion: handle empty book with no space or line after bookcode * fix usx conversion: add style v to verse and correct eid in chapter * fix usx conversion: bring last verse end node of a chapter inside the previous paragraph * fix usx conversion: break down node_2usx() into smaller functions * fix usx conversion: nest only character markers, notes and text inside parastyle markers not others * use the usx.rnc schema to validate the generated usx * fix linting issues * run python tests on gitactions * run python tests on gitactions attempt #2 * run python tests on gitactions attempt #3 * run python tests on gitactions attempt #4 * run python tests on gitactions attempt #5 * run python tests on gitactions attempt #6 * run python tests on gitactions attempt #7 * run python tests on gitactions attempt #8 * run python tests on gitactions attempt #9 * run python tests on gitactions attempt #10
Bridgeconn · Oct 21, 2022 · f3c576c · f3c576c
1 parent 28e613a
commit f3c576c
Show file tree

Hide file tree

Showing 771 changed files with 212,394 additions and 297 deletions.
diff --git a/.github/workflows/check-on-push.yml b/.github/workflows/check-on-push.yml
@@ -11,9 +11,9 @@ on:
 jobs:
   # Set the job key. The key is displayed as the job name
   # when a job name is not provided
-  Run-linter-and-tests:
+  Run-Grammar-tests:
     # Name the Job
-    name: Lint n test
+    name: Run Grammar tests
     # Set the type of machine to run on
     runs-on: ubuntu-latest
 
@@ -32,4 +32,44 @@ jobs:
           ./node_modules/.bin/tree-sitter generate
           ./node_modules/.bin/tree-sitter test
 
-        
+  Run-Python-tests:
+    name: Run Python tests
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v2
+      - uses: actions/setup-python@v2
+        with:
+            python-version: '3.10.6'
+
+      - name: Setup node and npm
+        uses: actions/setup-node@v2
+        with:
+          node-version: 14
+
+      - name: Create Virtual Environment
+        run: python -m venv ENV-dev
+
+      - name: Use VENV
+        run: source ENV-dev/bin/activate
+
+      - name: Install dependencies
+        run: pip install -r ./python-usfm-parser/dev-requirements.txt
+
+      - name: Build grammar binary
+        run: |
+          cd tree-sitter-usfm3
+          npm install .
+          ./node_modules/.bin/tree-sitter generate
+          cd ..
+          python python-usfm-parser/src/grammar_rebuild.py ./tree-sitter-usfm3/ python-usfm-parser/src/usfm_grammar/my-languages.so
+
+      - name: Install python module
+        run: |
+          cd python-usfm-parser
+          pip install .
+
+      - name: Run tests
+        working-directory: ./python-usfm-parser
+        run:
+          pytest tests/test_parsing_errors.py
diff --git a/python-usfm-parser/dev-requirements.txt b/python-usfm-parser/dev-requirements.txt
@@ -3,3 +3,4 @@ jupyterlab==3.4.4
 rnc2rng==2.6.6
 lxml==4.9.1
 pylint==2.15.3
+pytest==7.1.3
diff --git a/python-usfm-parser/src/grammar_rebuild.py b/python-usfm-parser/src/grammar_rebuild.py
@@ -3,12 +3,13 @@
 import sys
 from tree_sitter import Language
 
-if len(sys.argv) > 1 :
-  GRAMMAR_PATH = sys.argv[1]
-  OUTPUT_PATH = sys.argv[2]
+if len(sys.argv) == 3 :
+    GRAMMAR_PATH = sys.argv[1]
+    OUTPUT_PATH = sys.argv[2]
 else:
-  GRAMMAR_PATH = '../../tree-sitter-usfm3'
-  OUTPUT_PATH = 'usfm_grammar/my-languages.so'
+    raise Exception('''Usage: python python-usfm-parser/src/grammar_rebuild.py \
+./tree-sitter-usfm3/ python-usfm-parser/src/usfm_grammar/my-languages.so
+from the project root directory''')
 
 Language.build_library(
   # Store the library in the `ext` directory

diff --git a/python-usfm-parser/src/usfm_grammar/usfm_parser.py b/python-usfm-parser/src/usfm_grammar/usfm_parser.py
diff --git a/python-usfm-parser/tests/__init__.py b/python-usfm-parser/tests/__init__.py
@@ -0,0 +1,149 @@
+'''The common methods and objects needed in all tests. To be run before all tests'''
+from glob import glob
+from lxml import etree
+from src.usfm_grammar import USFMParser
+
+TEST_DIR = "../tests"
+
+def initialise_parser(input_usfm_path):
+    '''Open and parse the given file'''
+    with open(input_usfm_path, 'r', encoding='utf-8') as usfm_file:
+        usfm_string = usfm_file.read()
+    test_parser = USFMParser(usfm_string)
+    return test_parser
+
+def is_valid_usfm(input_usfm_path):
+    '''Checks the metadata.xml to see is the USFM is a valid one'''
+    meta_file_path = input_usfm_path.replace("origin.usfm", "metadata.xml")
+    with open(meta_file_path, 'r', encoding='utf-8') as meta_file:
+        meta_xml_string = meta_file.read()
+        if meta_xml_string.startswith("<?xml "):
+            # need to remove the first line containing xml declaration 
+            # because it doesn't have version, which is mandatory
+            meta_xml_string = meta_xml_string.split("\n", 1)[-1] 
+    root = etree.fromstring(meta_xml_string)
+    node = root.find("validated")
+    if node.text == "fail":
+        return False
+    return True
+
+all_usfm_files = glob(f"{TEST_DIR}/*/*/origin.usfm")
+
+exclude_files = [
+    f'{TEST_DIR}/mandatory/v/origin.usfm',
+        # Is V really a must? Can't we have empty chapter stubs?
+    f'{TEST_DIR}/biblica/BlankLinesWithFigures/origin.usfm',
+        # the occurs under doesn't have c or b, in the sty file
+        # https://github.com/ubsicap/usfm/blob/6be0cd1fcedfeac19f354c19791d9f1d66721c5e/sty/usfm.sty#L2975
+        # the desciption on the metadata.xml doesn;t sound veru confident either
+    f'{TEST_DIR}/specExamples/titles/origin.usfm',
+        # \mte# is shown as occuring under c, as per sty. This file has it before c
+        # Also, after a heading(\s etc) shouldn't there be a paragraph marker? Its missing too.
+    f'{TEST_DIR}/specExamples/cross-ref/origin.usfm',
+    f'{TEST_DIR}/special-cases/empty-para/origin.usfm',
+    f'{TEST_DIR}/special-cases/empty-c/origin.usfm',
+    f'{TEST_DIR}/special-cases/sp/origin.usfm',
+    f'{TEST_DIR}/paratextTests/WordlistMarkerMissingFromGlossaryCitationForms/origin.usfm',
+    f'{TEST_DIR}/paratextTests/NestingInCrossReferences/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/missing_verses/origin.usfm',
+        # excluding temporarily, bacause of \\p expecting a spaceOrline afterwards
+        # Spec says "the space is needed only when text follows the marker...
+        # ... Most paragraph or poetic markers (like \p, \m, \q# etc.)...
+        # ...can be followed immediately by a verse number (\v) on a new line."
+        # DOESN'T THAT MEAN A LINE IS NEEDED AND "\p\v 1 .." usage is not correct?
+    f'{TEST_DIR}/paratextTests/UnmatchedSidebarStart/origin.usfm',
+    f'{TEST_DIR}/paratextTests/CharStyleNotClosed/origin.usfm',
+    f'{TEST_DIR}/paratextTests/CharStyleCrossesVerseNumber/origin.usfm',
+    f'{TEST_DIR}/paratextTests/NestingInFootnote/origin.usfm',
+    f'{TEST_DIR}/paratextTests/FigureNotClosed/origin.usfm',
+    f'{TEST_DIR}/paratextTests/FootnoteNotClosed/origin.usfm',
+    f'{TEST_DIR}/paratextTests/EmptyMarkers/origin.usfm',
+        # temporarily excluding
+        # case of MISSING values not reported as ERROR. 
+        # Problem with tree-sitter, or the way we use it
+    f'{TEST_DIR}/specExamples/character/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/isa_verse_span/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/isa_footnote/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/tit_extra_space_after_chapter/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/1ch_verse_span/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/usfmBodyTestD/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/esb/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/acts_1_milestone.oldformat/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/nb/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/usfmIntroTest/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/usfm-body-testF/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/out_of_sequence_verses/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/acts_1_milestone/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/luk_quotes/origin.usfm',
+    f'{TEST_DIR}/samples-from-wild/doo43-1/origin.usfm',
+    f'{TEST_DIR}/samples-from-wild/doo43-2/origin.usfm',
+        # excluding becasue no \p (or other paragraph markers)
+        # after \s, table, esbe etc
+        # in most of the above usfmjs cases its \s5 that misses \p after it...
+    f'{TEST_DIR}/special-cases/empty-attributes5/origin.usfm',
+        # just parking for later as this is a low risk corner case
+        # the space in \w ...|<space>\w* get parsed as "default-argument" and test passes
+    f'{TEST_DIR}/paratextTests/WordlistMarkerTextEndsInSpaceWithoutGlossary/origin.usfm',
+    f'{TEST_DIR}/paratextTests/WordlistMarkerTextContainsNonWordformingPunctuation/origin.usfm',
+    f'{TEST_DIR}/paratextTests/GlossaryCitationFormContainsNonWordformingPunctuation/origin.usfm',
+    f'{TEST_DIR}/paratextTests/WordlistMarkerTextEndsInSpaceWithGlossary/origin.usfm',
+    f'{TEST_DIR}/paratextTests/WordlistMarkerTextEndsInPunctuation/origin.usfm',
+    f'{TEST_DIR}/paratextTests/GlossaryCitationFormEndsInSpace/origin.usfm',
+    f'{TEST_DIR}/paratextTests/WordlistMarkerKeywordEndsInSpace/origin.usfm',
+    f'{TEST_DIR}/paratextTests/WordlistMarkerKeywordEndsInPunctuation/origin.usfm',
+    f'{TEST_DIR}/paratextTests/GlossaryCitationFormEndsInPunctuation/origin.usfm',
+    f'{TEST_DIR}/paratextTests/WordlistMarkerTextEndsInSpaceAndMissingFromGlossary/origin.usfm',
+    f'{TEST_DIR}/paratextTests/WordlistMarkerKeywordContainsNonWordformingPunctuation/origin.usfm',
+    f'{TEST_DIR}/paratextTests/CharStyleClosedAndReopened/origin.usfm',
+        # I think it is good to cover these usages also, unless they are wrong USFM! Are they?
+        # these issues look like paratext specific ways of handling spaces and punctuations
+    f'{TEST_DIR}/paratextTests/CustomAttributesAreValid/origin.usfm',
+    f'{TEST_DIR}/paratextTests/ValidMilestones/origin.usfm',
+    f'{TEST_DIR}/paratextTests/LinkAttributesAreValid/origin.usfm',
+        # Correct syntaxes "x-name", "qt-s", "link-href", 
+        # but used are "xname", "qts", "linkhref"
+        # Looks like a bug while writing the text to file
+    f'{TEST_DIR}/paratextTests/EmptyFigure/origin.usfm',
+        # Older usage of multiple pipes, of USFM 2.x.
+    f'{TEST_DIR}/paratextTests/MissingColumnInTable/origin.usfm',
+        # Do we need to check column numbers in tables. What if the UI want merged cells?
+    f'{TEST_DIR}/paratextTests/GlossaryCitationFormContainingWordMedialPunctuation_Pass/'
+        'origin.usfm',
+        # uses \ in text before quote('). Probably a bug while writing the text to file
+    f'{TEST_DIR}/paratextTests/NoErrorsPartiallyEmptyBook/origin.usfm',
+    f'{TEST_DIR}/paratextTests/NoErrorsEmptyBook/origin.usfm',
+        # as per USFM spec makers ide, rem, h etc cannot be empty
+    f'{TEST_DIR}/usfmjsTests/acts-1-20.aligned.crammed.oldformat/origin.usfm',
+        # \q' without space in between and \zaln-s not closed in two palces each
+    f'{TEST_DIR}/usfmjsTests/45-ACT.ugnt.oldformat/origin.usfm',
+        # toc used without space and text. \k used as \k-s which doesn't seem to be right!
+    f'{TEST_DIR}/usfmjsTests/gn_headers/origin.usfm',
+        # as per sty file, \mte# occurs under c. Here given after \mt#. Is that correct usage?
+    f'{TEST_DIR}/usfmjsTests/45-ACT.ugnt/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/acts_8-37-ugnt-footnote/origin.usfm',
+        # \w used inside footnote without nesting(\+w). Also toc used without space or text
+    f'{TEST_DIR}/usfmjsTests/57-TIT.greek.oldformat/origin.usfm',
+    f'{TEST_DIR}/usfmjsTests/57-TIT.greek/origin.usfm',
+    f'{TEST_DIR}/samples-from-wild/UGNT2/origin.usfm',
+    f'{TEST_DIR}/samples-from-wild/UGNT1/origin.usfm',
+        # toc1 used without text or space
+    f'{TEST_DIR}/usfmjsTests/inline_God/origin.usfm',
+        # nested marker not closed. Is closing not mandatory?
+    f'{TEST_DIR}/samples-from-wild/doo43-4/origin.usfm',
+        # () usage in \ior  is shown as \ior (....) \ior* in the spec
+
+        ########### Temporarily for testing USX conversion ##############
+    f'{TEST_DIR}/specExamples/milestone/origin.usfm',
+    ]
+
+for file in exclude_files:
+    if file in all_usfm_files:
+        all_usfm_files.remove(file)
+
+
+exclude_USX_files = [
+    f'{TEST_DIR}/specExamples/chapter-verse/origin.usx',
+        # ca is added as attribute to cl not chapter node
+    f'{TEST_DIR}/specExamples/milestone/origin.usx',
+        # Znamespace not represented properly. Even no docs of it on https://ubsicap.github.io/usx
+]
diff --git a/python-usfm-parser/tests/test_json_conversion.py b/python-usfm-parser/tests/test_json_conversion.py
@@ -0,0 +1,16 @@
+'''Test the to_dict or json conversion API'''
+import pytest
+
+from tests import all_usfm_files, initialise_parser, is_valid_usfm
+
+
+@pytest.mark.parametrize( 'file_path', all_usfm_files)
+@pytest.mark.timeout(300)
+def test_dict_converions_without_filter(file_path):
+    '''Tests if input parses without errors'''
+    test_parser = initialise_parser(file_path)
+    if is_valid_usfm(file_path):
+        assert not test_parser.errors, test_parser.errors
+        usfm_dict = test_parser.to_dict()
+        assert isinstance(usfm_dict, dict)
+
diff --git a/python-usfm-parser/tests/test_parsing_errors.py b/python-usfm-parser/tests/test_parsing_errors.py
@@ -0,0 +1,15 @@
+'''To test parsing success/errors for USFM/X committee's test suite'''
+import pytest
+
+from tests import all_usfm_files, initialise_parser, is_valid_usfm
+
+
+@pytest.mark.parametrize( 'file_path', all_usfm_files)
+def test_error_less_parsing(file_path):
+    '''Tests if input parses without errors'''
+    test_parser = initialise_parser(file_path)
+    if is_valid_usfm(file_path):
+        assert not test_parser.errors, test_parser.errors
+    else:
+        assert test_parser.errors, "file has errors, but passed\n"+test_parser.to_syntax_tree()
+
diff --git a/python-usfm-parser/tests/test_usx_conversion.py b/python-usfm-parser/tests/test_usx_conversion.py
@@ -0,0 +1,43 @@
+'''Test the to_usx conversion API'''
+from doctest import Example
+from io import StringIO 
+
+import pytest
+from lxml import etree
+from lxml.doctestcompare import LXMLOutputChecker, PARSE_XML
+
+from tests import all_usfm_files, initialise_parser, is_valid_usfm, exclude_USX_files
+
+lxml_object = etree.Element('Root')
+checker = LXMLOutputChecker()
+
+with open("../schemas/usx.rnc", encoding='utf-8') as f:
+    usxrnc_doc  = f.read()
+relaxng = etree.RelaxNG.from_rnc_string(usxrnc_doc)
+
+@pytest.mark.parametrize( 'file_path', all_usfm_files)
+@pytest.mark.timeout(100)
+def test_usx_converions_without_filter(file_path):
+    '''Tests if input parses & converts to usx successfully and validates the usx against schema'''
+    test_parser = initialise_parser(file_path)
+    if is_valid_usfm(file_path):
+        assert not test_parser.errors, test_parser.errors
+        usx_xml = test_parser.to_usx()
+        assert isinstance(usx_xml, type(lxml_object)), test_parser.to_syntax_tree()
+
+        assert relaxng.validate(usx_xml), relaxng.error_log.last_error
+
+        # usx_file_path = file_path.replace("origin.usfm", "origin.xml")
+        # if usx_file_path not in exclude_USX_files:
+        #     origin_xml = etree.parse(usx_file_path)
+        #     if relaxng.validate(origin_xml):
+                # message = checker.output_difference(
+                #                 Example("", etree.tostring(origin_xml).decode('utf-8')), 
+                #                 etree.tostring(usx_xml), PARSE_XML)
+                # assert checker.check_output(etree.tostring(origin_xml), 
+                #                             etree.tostring(usx_xml), PARSE_XML), message
+
+
+
+
+
diff --git a/tests/advanced/custom-attributes/metadata.xml b/tests/advanced/custom-attributes/metadata.xml
@@ -0,0 +1,6 @@
+<?xml encoding="utf-8"?>
+<test-metadata>
+        <description>Link-attributes and custom attributes. Advanced marker usages.</description>
+        <validated>pass</validated>
+        <tags></tags>
+</test-metadata>
diff --git a/tests/advanced/custom-attributes/origin.usfm b/tests/advanced/custom-attributes/origin.usfm
@@ -0,0 +1,10 @@
+\id GEN
+\c 1
+\p
+\v 1 the first verse
+\v 2 the second verse \w gracious|x-myattr="metadata" \w*
+\q1 “Someone is shouting in the desert,
+\q2 ‘Prepare a road for the Lord;
+\q2 make a straight path for him to travel!’ ”
+\s \jmp |link-id="article-john_the_baptist" \jmp*John the Baptist
+\p John is sometimes called...
diff --git a/tests/advanced/custom-attributes/origin.xml b/tests/advanced/custom-attributes/origin.xml
@@ -0,0 +1,13 @@
+<usx version="3.0">
+  <book code="GEN" style="id" />
+  <chapter number="1" style="c" sid="GEN 1" />
+  <para style="p">
+    <verse number="1" style="v" sid="GEN 1:1" />the first verse <verse eid="GEN 1:1" /><verse number="2" style="v" sid="GEN 1:2" />the second verse <char style="w" x-myattr="metadata">gracious</char></para>
+  <para style="q1" vid="GEN 1:2">“Someone is shouting in the desert,</para>
+  <para style="q2" vid="GEN 1:2">‘Prepare a road for the Lord;</para>
+  <para style="q2" vid="GEN 1:2">make a straight path for him to travel!’ ”</para>
+  <para style="s" vid="GEN 1:2">
+    <char style="jmp" link-id="article-john_the_baptist" />John the Baptist</para>
+  <para style="p" vid="GEN 1:2">John is sometimes called...<verse eid="GEN 1:2" /></para>
+  <chapter eid="GEN 1" />
+</usx>
diff --git a/tests/advanced/default-attributes/metadata.xml b/tests/advanced/default-attributes/metadata.xml
@@ -0,0 +1,6 @@
+<?xml encoding="utf-8"?>
+<test-metadata>
+        <description>Markers with default attributes. Advanced marker usages.</description>
+        <validated>pass</validated>
+        <tags></tags>
+</test-metadata>
diff --git a/tests/advanced/default-attributes/origin.usfm b/tests/advanced/default-attributes/origin.usfm
@@ -0,0 +1,5 @@
+\id GEN
+\c 1
+\p
+\v 1 the first verse
+\v 2 the second verse \w gracious|grace\w*
diff --git a/tests/advanced/default-attributes/origin.xml b/tests/advanced/default-attributes/origin.xml
@@ -0,0 +1,7 @@
+<usx version="3.0">
+  <book code="GEN" style="id" />
+  <chapter number="1" style="c" sid="GEN 1" />
+  <para style="p">
+    <verse number="1" style="v" sid="GEN 1:1" />the first verse <verse eid="GEN 1:1" /><verse number="2" style="v" sid="GEN 1:2" />the second verse <char style="w" lemma="grace">gracious</char><verse eid="GEN 1:2" /></para>
+  <chapter eid="GEN 1" />
+</usx>
diff --git a/tests/advanced/header/metadata.xml b/tests/advanced/header/metadata.xml
@@ -0,0 +1,6 @@
+<?xml encoding="utf-8"?>
+<test-metadata>
+        <description>Header section with more markers. Advanced marker usages.</description>
+        <validated>pass</validated>
+        <tags></tags>
+</test-metadata>
diff --git a/tests/advanced/header/origin.usfm b/tests/advanced/header/origin.usfm
@@ -0,0 +1,18 @@
+\id MRK 41MRKGNT92.SFM, Good News Translation, June 2003
+\h John
+\toc1 The Gospel according to John
+\toc2 John
+\mt2 The Gospel
+\mt3 according to
+\mt1 JOHN
+\ip The two endings to the Gospel, which are enclosed in brackets, are regarded as written by someone other than the author of \bk Mark\bk*
+\iot Outline of Contents
+\io1 The beginning of the gospel \ior (1.1-13)\ior*
+\io1 Jesus' public ministry in Galilee \ior (1.14–9.50)\ior*
+\io1 From Galilee to Jerusalem \ior (10.1-52)\ior*
+\c 1
+\ms BOOK ONE
+\mr (Psalms 1–41)
+\p
+\v 1 the first verse
+\v 2 the second verse