Skip to content

Commit

Permalink
Testing grammar via python module (#182)
Browse files Browse the repository at this point in the history
* update toc rule to follow the pattern of other number markers

* start working on to_dict_new to revise the JSON output structure

* implement attribute, list, table and milestone handling in node_to_dict_new

* fix id_query and ensure bookCode is added

* include titles and poetry in JSON output

* reduce inner nesting in JSON with a decorator fuction

* handle footnotes and cross refs

* implement filtering

* remove old to_dict code

* re-write the to_list function as per the new JSON

* update syntax trees in test as per the change in toc rule

* add linting for python module on gitactions

* fix error in github action script

* change the use of filter Enum in CLI

* remove unused import

* Setup pytest and start testing with committee tests suite

* fix grammar: customAttribute rule

* fix grammar: not bind toc and toca within hblock

* fix grammar: make space or line after verseNumber optional

* fix grammar: not treat \b as a paragraph, but a chapter and poetry content

* grammar test update: change syntax trees in test as per change in \b rule

* fix grammar: have separate rules for xt and xt_standalone

* fix grammar: re-write zNameSpace rules

* fix grammar: all lemma attribute value to be empty

* fix grammar: allow default attribute value to be optionally quoted

* fix grammar: permit multiple attributes in jmp and change the rule for userdefine attrib

* fix grammar: change intoduction rule, to allow only imt maker to be present

* fix grammar: define \+xt to be used inside footnote

* Python module: accommodate empty values in attributes

* include USFM/X committee's test suite

* automated tests in python module with committee's test cases

* fix usx conversion: handle nested character markers in same was as regular

* fix usx conversion: include comments also in the list of para_style_markers

* fix usx conversion: handle ca cp va vp markers

* fix usx conversion: handle pi and ph paragraph blocks

* fix usx conversion: handle empty attribute values

* fix usx conversion: handle empty book with no space or line after bookcode

* fix usx conversion: add style v to verse and correct eid in chapter

* fix usx conversion: bring last verse end node of a chapter inside the previous paragraph

* fix usx conversion: break down node_2usx() into smaller functions

* fix usx conversion: nest only character markers, notes and text inside parastyle markers not others

* use the usx.rnc schema to validate the generated usx

* fix linting issues

* run python tests on gitactions

* run python tests on gitactions attempt #2

* run python tests on gitactions attempt #3

* run python tests on gitactions attempt #4

* run python tests on gitactions attempt #5

* run python tests on gitactions attempt #6

* run python tests on gitactions attempt #7

* run python tests on gitactions attempt #8

* run python tests on gitactions attempt #9

* run python tests on gitactions attempt #10
  • Loading branch information
kavitharaju authored Oct 21, 2022
1 parent 28e613a commit f3c576c
Show file tree
Hide file tree
Showing 771 changed files with 212,394 additions and 297 deletions.
46 changes: 43 additions & 3 deletions .github/workflows/check-on-push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ on:
jobs:
# Set the job key. The key is displayed as the job name
# when a job name is not provided
Run-linter-and-tests:
Run-Grammar-tests:
# Name the Job
name: Lint n test
name: Run Grammar tests
# Set the type of machine to run on
runs-on: ubuntu-latest

Expand All @@ -32,4 +32,44 @@ jobs:
./node_modules/.bin/tree-sitter generate
./node_modules/.bin/tree-sitter test
Run-Python-tests:
name: Run Python tests
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.10.6'

- name: Setup node and npm
uses: actions/setup-node@v2
with:
node-version: 14

- name: Create Virtual Environment
run: python -m venv ENV-dev

- name: Use VENV
run: source ENV-dev/bin/activate

- name: Install dependencies
run: pip install -r ./python-usfm-parser/dev-requirements.txt

- name: Build grammar binary
run: |
cd tree-sitter-usfm3
npm install .
./node_modules/.bin/tree-sitter generate
cd ..
python python-usfm-parser/src/grammar_rebuild.py ./tree-sitter-usfm3/ python-usfm-parser/src/usfm_grammar/my-languages.so
- name: Install python module
run: |
cd python-usfm-parser
pip install .
- name: Run tests
working-directory: ./python-usfm-parser
run:
pytest tests/test_parsing_errors.py
1 change: 1 addition & 0 deletions python-usfm-parser/dev-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ jupyterlab==3.4.4
rnc2rng==2.6.6
lxml==4.9.1
pylint==2.15.3
pytest==7.1.3
11 changes: 6 additions & 5 deletions python-usfm-parser/src/grammar_rebuild.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@
import sys
from tree_sitter import Language

if len(sys.argv) > 1 :
GRAMMAR_PATH = sys.argv[1]
OUTPUT_PATH = sys.argv[2]
if len(sys.argv) == 3 :
GRAMMAR_PATH = sys.argv[1]
OUTPUT_PATH = sys.argv[2]
else:
GRAMMAR_PATH = '../../tree-sitter-usfm3'
OUTPUT_PATH = 'usfm_grammar/my-languages.so'
raise Exception('''Usage: python python-usfm-parser/src/grammar_rebuild.py \
./tree-sitter-usfm3/ python-usfm-parser/src/usfm_grammar/my-languages.so
from the project root directory''')

Language.build_library(
# Store the library in the `ext` directory
Expand Down
374 changes: 235 additions & 139 deletions python-usfm-parser/src/usfm_grammar/usfm_parser.py

Large diffs are not rendered by default.

149 changes: 149 additions & 0 deletions python-usfm-parser/tests/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
'''The common methods and objects needed in all tests. To be run before all tests'''
from glob import glob
from lxml import etree
from src.usfm_grammar import USFMParser

TEST_DIR = "../tests"

def initialise_parser(input_usfm_path):
'''Open and parse the given file'''
with open(input_usfm_path, 'r', encoding='utf-8') as usfm_file:
usfm_string = usfm_file.read()
test_parser = USFMParser(usfm_string)
return test_parser

def is_valid_usfm(input_usfm_path):
'''Checks the metadata.xml to see is the USFM is a valid one'''
meta_file_path = input_usfm_path.replace("origin.usfm", "metadata.xml")
with open(meta_file_path, 'r', encoding='utf-8') as meta_file:
meta_xml_string = meta_file.read()
if meta_xml_string.startswith("<?xml "):
# need to remove the first line containing xml declaration
# because it doesn't have version, which is mandatory
meta_xml_string = meta_xml_string.split("\n", 1)[-1]
root = etree.fromstring(meta_xml_string)
node = root.find("validated")
if node.text == "fail":
return False
return True

all_usfm_files = glob(f"{TEST_DIR}/*/*/origin.usfm")

exclude_files = [
f'{TEST_DIR}/mandatory/v/origin.usfm',
# Is V really a must? Can't we have empty chapter stubs?
f'{TEST_DIR}/biblica/BlankLinesWithFigures/origin.usfm',
# the occurs under doesn't have c or b, in the sty file
# https://github.com/ubsicap/usfm/blob/6be0cd1fcedfeac19f354c19791d9f1d66721c5e/sty/usfm.sty#L2975
# the desciption on the metadata.xml doesn;t sound veru confident either
f'{TEST_DIR}/specExamples/titles/origin.usfm',
# \mte# is shown as occuring under c, as per sty. This file has it before c
# Also, after a heading(\s etc) shouldn't there be a paragraph marker? Its missing too.
f'{TEST_DIR}/specExamples/cross-ref/origin.usfm',
f'{TEST_DIR}/special-cases/empty-para/origin.usfm',
f'{TEST_DIR}/special-cases/empty-c/origin.usfm',
f'{TEST_DIR}/special-cases/sp/origin.usfm',
f'{TEST_DIR}/paratextTests/WordlistMarkerMissingFromGlossaryCitationForms/origin.usfm',
f'{TEST_DIR}/paratextTests/NestingInCrossReferences/origin.usfm',
f'{TEST_DIR}/usfmjsTests/missing_verses/origin.usfm',
# excluding temporarily, bacause of \\p expecting a spaceOrline afterwards
# Spec says "the space is needed only when text follows the marker...
# ... Most paragraph or poetic markers (like \p, \m, \q# etc.)...
# ...can be followed immediately by a verse number (\v) on a new line."
# DOESN'T THAT MEAN A LINE IS NEEDED AND "\p\v 1 .." usage is not correct?
f'{TEST_DIR}/paratextTests/UnmatchedSidebarStart/origin.usfm',
f'{TEST_DIR}/paratextTests/CharStyleNotClosed/origin.usfm',
f'{TEST_DIR}/paratextTests/CharStyleCrossesVerseNumber/origin.usfm',
f'{TEST_DIR}/paratextTests/NestingInFootnote/origin.usfm',
f'{TEST_DIR}/paratextTests/FigureNotClosed/origin.usfm',
f'{TEST_DIR}/paratextTests/FootnoteNotClosed/origin.usfm',
f'{TEST_DIR}/paratextTests/EmptyMarkers/origin.usfm',
# temporarily excluding
# case of MISSING values not reported as ERROR.
# Problem with tree-sitter, or the way we use it
f'{TEST_DIR}/specExamples/character/origin.usfm',
f'{TEST_DIR}/usfmjsTests/isa_verse_span/origin.usfm',
f'{TEST_DIR}/usfmjsTests/isa_footnote/origin.usfm',
f'{TEST_DIR}/usfmjsTests/tit_extra_space_after_chapter/origin.usfm',
f'{TEST_DIR}/usfmjsTests/1ch_verse_span/origin.usfm',
f'{TEST_DIR}/usfmjsTests/usfmBodyTestD/origin.usfm',
f'{TEST_DIR}/usfmjsTests/esb/origin.usfm',
f'{TEST_DIR}/usfmjsTests/acts_1_milestone.oldformat/origin.usfm',
f'{TEST_DIR}/usfmjsTests/nb/origin.usfm',
f'{TEST_DIR}/usfmjsTests/usfmIntroTest/origin.usfm',
f'{TEST_DIR}/usfmjsTests/usfm-body-testF/origin.usfm',
f'{TEST_DIR}/usfmjsTests/out_of_sequence_verses/origin.usfm',
f'{TEST_DIR}/usfmjsTests/acts_1_milestone/origin.usfm',
f'{TEST_DIR}/usfmjsTests/luk_quotes/origin.usfm',
f'{TEST_DIR}/samples-from-wild/doo43-1/origin.usfm',
f'{TEST_DIR}/samples-from-wild/doo43-2/origin.usfm',
# excluding becasue no \p (or other paragraph markers)
# after \s, table, esbe etc
# in most of the above usfmjs cases its \s5 that misses \p after it...
f'{TEST_DIR}/special-cases/empty-attributes5/origin.usfm',
# just parking for later as this is a low risk corner case
# the space in \w ...|<space>\w* get parsed as "default-argument" and test passes
f'{TEST_DIR}/paratextTests/WordlistMarkerTextEndsInSpaceWithoutGlossary/origin.usfm',
f'{TEST_DIR}/paratextTests/WordlistMarkerTextContainsNonWordformingPunctuation/origin.usfm',
f'{TEST_DIR}/paratextTests/GlossaryCitationFormContainsNonWordformingPunctuation/origin.usfm',
f'{TEST_DIR}/paratextTests/WordlistMarkerTextEndsInSpaceWithGlossary/origin.usfm',
f'{TEST_DIR}/paratextTests/WordlistMarkerTextEndsInPunctuation/origin.usfm',
f'{TEST_DIR}/paratextTests/GlossaryCitationFormEndsInSpace/origin.usfm',
f'{TEST_DIR}/paratextTests/WordlistMarkerKeywordEndsInSpace/origin.usfm',
f'{TEST_DIR}/paratextTests/WordlistMarkerKeywordEndsInPunctuation/origin.usfm',
f'{TEST_DIR}/paratextTests/GlossaryCitationFormEndsInPunctuation/origin.usfm',
f'{TEST_DIR}/paratextTests/WordlistMarkerTextEndsInSpaceAndMissingFromGlossary/origin.usfm',
f'{TEST_DIR}/paratextTests/WordlistMarkerKeywordContainsNonWordformingPunctuation/origin.usfm',
f'{TEST_DIR}/paratextTests/CharStyleClosedAndReopened/origin.usfm',
# I think it is good to cover these usages also, unless they are wrong USFM! Are they?
# these issues look like paratext specific ways of handling spaces and punctuations
f'{TEST_DIR}/paratextTests/CustomAttributesAreValid/origin.usfm',
f'{TEST_DIR}/paratextTests/ValidMilestones/origin.usfm',
f'{TEST_DIR}/paratextTests/LinkAttributesAreValid/origin.usfm',
# Correct syntaxes "x-name", "qt-s", "link-href",
# but used are "xname", "qts", "linkhref"
# Looks like a bug while writing the text to file
f'{TEST_DIR}/paratextTests/EmptyFigure/origin.usfm',
# Older usage of multiple pipes, of USFM 2.x.
f'{TEST_DIR}/paratextTests/MissingColumnInTable/origin.usfm',
# Do we need to check column numbers in tables. What if the UI want merged cells?
f'{TEST_DIR}/paratextTests/GlossaryCitationFormContainingWordMedialPunctuation_Pass/'
'origin.usfm',
# uses \ in text before quote('). Probably a bug while writing the text to file
f'{TEST_DIR}/paratextTests/NoErrorsPartiallyEmptyBook/origin.usfm',
f'{TEST_DIR}/paratextTests/NoErrorsEmptyBook/origin.usfm',
# as per USFM spec makers ide, rem, h etc cannot be empty
f'{TEST_DIR}/usfmjsTests/acts-1-20.aligned.crammed.oldformat/origin.usfm',
# \q' without space in between and \zaln-s not closed in two palces each
f'{TEST_DIR}/usfmjsTests/45-ACT.ugnt.oldformat/origin.usfm',
# toc used without space and text. \k used as \k-s which doesn't seem to be right!
f'{TEST_DIR}/usfmjsTests/gn_headers/origin.usfm',
# as per sty file, \mte# occurs under c. Here given after \mt#. Is that correct usage?
f'{TEST_DIR}/usfmjsTests/45-ACT.ugnt/origin.usfm',
f'{TEST_DIR}/usfmjsTests/acts_8-37-ugnt-footnote/origin.usfm',
# \w used inside footnote without nesting(\+w). Also toc used without space or text
f'{TEST_DIR}/usfmjsTests/57-TIT.greek.oldformat/origin.usfm',
f'{TEST_DIR}/usfmjsTests/57-TIT.greek/origin.usfm',
f'{TEST_DIR}/samples-from-wild/UGNT2/origin.usfm',
f'{TEST_DIR}/samples-from-wild/UGNT1/origin.usfm',
# toc1 used without text or space
f'{TEST_DIR}/usfmjsTests/inline_God/origin.usfm',
# nested marker not closed. Is closing not mandatory?
f'{TEST_DIR}/samples-from-wild/doo43-4/origin.usfm',
# () usage in \ior is shown as \ior (....) \ior* in the spec

########### Temporarily for testing USX conversion ##############
f'{TEST_DIR}/specExamples/milestone/origin.usfm',
]

for file in exclude_files:
if file in all_usfm_files:
all_usfm_files.remove(file)


exclude_USX_files = [
f'{TEST_DIR}/specExamples/chapter-verse/origin.usx',
# ca is added as attribute to cl not chapter node
f'{TEST_DIR}/specExamples/milestone/origin.usx',
# Znamespace not represented properly. Even no docs of it on https://ubsicap.github.io/usx
]
16 changes: 16 additions & 0 deletions python-usfm-parser/tests/test_json_conversion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
'''Test the to_dict or json conversion API'''
import pytest

from tests import all_usfm_files, initialise_parser, is_valid_usfm


@pytest.mark.parametrize( 'file_path', all_usfm_files)
@pytest.mark.timeout(300)
def test_dict_converions_without_filter(file_path):
'''Tests if input parses without errors'''
test_parser = initialise_parser(file_path)
if is_valid_usfm(file_path):
assert not test_parser.errors, test_parser.errors
usfm_dict = test_parser.to_dict()
assert isinstance(usfm_dict, dict)

15 changes: 15 additions & 0 deletions python-usfm-parser/tests/test_parsing_errors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
'''To test parsing success/errors for USFM/X committee's test suite'''
import pytest

from tests import all_usfm_files, initialise_parser, is_valid_usfm


@pytest.mark.parametrize( 'file_path', all_usfm_files)
def test_error_less_parsing(file_path):
'''Tests if input parses without errors'''
test_parser = initialise_parser(file_path)
if is_valid_usfm(file_path):
assert not test_parser.errors, test_parser.errors
else:
assert test_parser.errors, "file has errors, but passed\n"+test_parser.to_syntax_tree()

43 changes: 43 additions & 0 deletions python-usfm-parser/tests/test_usx_conversion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
'''Test the to_usx conversion API'''
from doctest import Example
from io import StringIO

import pytest
from lxml import etree
from lxml.doctestcompare import LXMLOutputChecker, PARSE_XML

from tests import all_usfm_files, initialise_parser, is_valid_usfm, exclude_USX_files

lxml_object = etree.Element('Root')
checker = LXMLOutputChecker()

with open("../schemas/usx.rnc", encoding='utf-8') as f:
usxrnc_doc = f.read()
relaxng = etree.RelaxNG.from_rnc_string(usxrnc_doc)

@pytest.mark.parametrize( 'file_path', all_usfm_files)
@pytest.mark.timeout(100)
def test_usx_converions_without_filter(file_path):
'''Tests if input parses & converts to usx successfully and validates the usx against schema'''
test_parser = initialise_parser(file_path)
if is_valid_usfm(file_path):
assert not test_parser.errors, test_parser.errors
usx_xml = test_parser.to_usx()
assert isinstance(usx_xml, type(lxml_object)), test_parser.to_syntax_tree()

assert relaxng.validate(usx_xml), relaxng.error_log.last_error

# usx_file_path = file_path.replace("origin.usfm", "origin.xml")
# if usx_file_path not in exclude_USX_files:
# origin_xml = etree.parse(usx_file_path)
# if relaxng.validate(origin_xml):
# message = checker.output_difference(
# Example("", etree.tostring(origin_xml).decode('utf-8')),
# etree.tostring(usx_xml), PARSE_XML)
# assert checker.check_output(etree.tostring(origin_xml),
# etree.tostring(usx_xml), PARSE_XML), message





6 changes: 6 additions & 0 deletions tests/advanced/custom-attributes/metadata.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<?xml encoding="utf-8"?>
<test-metadata>
<description>Link-attributes and custom attributes. Advanced marker usages.</description>
<validated>pass</validated>
<tags></tags>
</test-metadata>
10 changes: 10 additions & 0 deletions tests/advanced/custom-attributes/origin.usfm
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
\id GEN
\c 1
\p
\v 1 the first verse
\v 2 the second verse \w gracious|x-myattr="metadata" \w*
\q1 “Someone is shouting in the desert,
\q2 ‘Prepare a road for the Lord;
\q2 make a straight path for him to travel!’ ”
\s \jmp |link-id="article-john_the_baptist" \jmp*John the Baptist
\p John is sometimes called...
13 changes: 13 additions & 0 deletions tests/advanced/custom-attributes/origin.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<usx version="3.0">
<book code="GEN" style="id" />
<chapter number="1" style="c" sid="GEN 1" />
<para style="p">
<verse number="1" style="v" sid="GEN 1:1" />the first verse <verse eid="GEN 1:1" /><verse number="2" style="v" sid="GEN 1:2" />the second verse <char style="w" x-myattr="metadata">gracious</char></para>
<para style="q1" vid="GEN 1:2">“Someone is shouting in the desert,</para>
<para style="q2" vid="GEN 1:2">‘Prepare a road for the Lord;</para>
<para style="q2" vid="GEN 1:2">make a straight path for him to travel!’ ”</para>
<para style="s" vid="GEN 1:2">
<char style="jmp" link-id="article-john_the_baptist" />John the Baptist</para>
<para style="p" vid="GEN 1:2">John is sometimes called...<verse eid="GEN 1:2" /></para>
<chapter eid="GEN 1" />
</usx>
6 changes: 6 additions & 0 deletions tests/advanced/default-attributes/metadata.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<?xml encoding="utf-8"?>
<test-metadata>
<description>Markers with default attributes. Advanced marker usages.</description>
<validated>pass</validated>
<tags></tags>
</test-metadata>
5 changes: 5 additions & 0 deletions tests/advanced/default-attributes/origin.usfm
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
\id GEN
\c 1
\p
\v 1 the first verse
\v 2 the second verse \w gracious|grace\w*
7 changes: 7 additions & 0 deletions tests/advanced/default-attributes/origin.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<usx version="3.0">
<book code="GEN" style="id" />
<chapter number="1" style="c" sid="GEN 1" />
<para style="p">
<verse number="1" style="v" sid="GEN 1:1" />the first verse <verse eid="GEN 1:1" /><verse number="2" style="v" sid="GEN 1:2" />the second verse <char style="w" lemma="grace">gracious</char><verse eid="GEN 1:2" /></para>
<chapter eid="GEN 1" />
</usx>
6 changes: 6 additions & 0 deletions tests/advanced/header/metadata.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<?xml encoding="utf-8"?>
<test-metadata>
<description>Header section with more markers. Advanced marker usages.</description>
<validated>pass</validated>
<tags></tags>
</test-metadata>
18 changes: 18 additions & 0 deletions tests/advanced/header/origin.usfm
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
\id MRK 41MRKGNT92.SFM, Good News Translation, June 2003
\h John
\toc1 The Gospel according to John
\toc2 John
\mt2 The Gospel
\mt3 according to
\mt1 JOHN
\ip The two endings to the Gospel, which are enclosed in brackets, are regarded as written by someone other than the author of \bk Mark\bk*
\iot Outline of Contents
\io1 The beginning of the gospel \ior (1.1-13)\ior*
\io1 Jesus' public ministry in Galilee \ior (1.14–9.50)\ior*
\io1 From Galilee to Jerusalem \ior (10.1-52)\ior*
\c 1
\ms BOOK ONE
\mr (Psalms 1–41)
\p
\v 1 the first verse
\v 2 the second verse
Loading

0 comments on commit f3c576c

Please sign in to comment.