Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug: Rows from one table appear in the parsing of another when 1 row is added to a third #82

Open
PeterSR opened this issue Sep 2, 2022 · 4 comments

Comments

@PeterSR
Copy link

PeterSR commented Sep 2, 2022

This is kind of a weird one. I am not sure if I am doing something wrong or if there is a bug in some of the _start_/_end_ logic (or somewhere else).

Here's the setup: I have text file with many tables and other values that needs to be extracted. For the purpose of this issue I have reduced the text file to only 3 tables and a few other values to be a minimal reproducible example.

Here is the file:

example-output-01.txt

Here is the template file:

<vars>
HASH3 = "\#\#\#"
</vars>

<group name="network">
{{ ignore("HASH3") }} START NETWORK DATA {{ _start_ }}
#Network_nSections {{ num_sections | to_int }}
<group name="section_data" method="table">
#Network_SectionData {{ ignore(".*") }} {{ _start_ }}
<group>
  {{ ignore(" *") }}{{ A }} {{ B }} {{ C }} {{ D }} {{ J }} {{ active }} {{ K }} {{ L }}
</group>
#Application_FileName {{ ignore(".*") }} {{ _end_ }}
</group>
<group name="application">
#Application_FileName {{ filename }}
#Application_SpreadFactor {{ spread_factor }}
</group>
{{ ignore("HASH3") }} END NETWORK DATA {{ _end_ }}
</group>

In this template I only care about extracting values from one table, namely "Network_SectionData` (second table), plus a few other values.
In the text file, we also have a building table (first table) and a summary table (third table).

If I run

python -m ttp.ttp -t example.ttp -d example-output-01.txt -o json > out.json

Then I see the list of expected extracted rows in network.section_data.

However, if the following line is added to the end of the building table, just above ### END BUILDING DATA

  52.422     6.502      22.2       0.0      0.65    2.100E+02    8.086E+03    4.982E+11    3.654E+03

then these values from the third table starts to appear in the parsed output:

...
{
    "A": "id",
    "B": "mode",
    "C": "on/off",
    "D": "Light",
    "J": "Freq.",
    "K": "Ship.",
    "L": "[log.dec.]",
    "active": "[Hz]"
},
{
    "A": "2",
    "B": "Found",
    "C": ":",
    "D": "u_z",
    "J": "dynamic",
    "K": "(Hz)",
    "L": "0.000",
    "active": "0.00"
},
...

These are the values from the summary table (third table) that happen to match the

{{ ignore(" *") }}{{ A }} {{ B }} {{ C }} {{ D }} {{ J }} {{ active }} {{ K }} {{ L }}

I find this very peculiar, because

  1. The change happens in a part of the file that ttp seemingly shouldn't care about.
  2. I have 2 different _end_ indicators and if just one of them found a correct match, it should never look down in the summary table section in the first place.

Note: I know that I could probably find a way around this by making sure that my match indicators only match number for instance, but for my use case I need to rely solely on _start_ and _end_ indicators.


Windows 10, python 3.7, ttp 0.9.1

@PeterSR
Copy link
Author

PeterSR commented Sep 20, 2022

@dmulyalin - Sorry for tagging you directly, but do you have any idea what could be the cause of this?

@dmulyalin
Copy link
Owner

Would recommend to try simplifying your template e.g. this gives same results as yours one but a bit easier to read IMHO:

<vars>
HASH3 = "\#\#\#"
</vars>

<group name="network">
{{ ignore("HASH3") }} START NETWORK DATA {{ _start_ }}

<group name="section_data">
  {{ ignore(" *") }}{{ A }} {{ B }} {{ C }} {{ D | DIGIT }} {{ J }} {{ active | DIGIT }} {{ K }} {{ L }}
</group>

{{ ignore("HASH3") }} END NETWORK DATA {{ _end_ }}
</group>

<group name="network.application">
#Network_nSections {{ num_sections | to_int }}
#Application_FileName {{ filename }}
#Application_SpreadFactor {{ spread_factor }}
</group>

For undesired matches - was not able to reproduce the problem by doing this:

However, if the following line is added to the end of the building table, just above ### END BUILDING DATA

52.422 6.502 22.2 0.0 0.65 2.100E+02 8.086E+03 4.982E+11 3.654E+03

but, several tecniques to avoid unnecessary matches:

  1. use end idicator - you already using it
  2. use more specific regexes, e.g. in you template you ar eusing:
    {{ ignore(" *") }}{{ A }} {{ B }} {{ C }} {{ D }} {{ J }} {{ active }} {{ K }} {{ L }}
    while in my template I am using:
    {{ ignore(" *") }}{{ A }} {{ B }} {{ C }} {{ D | DIGIT }} {{ J }} {{ active | DIGIT }} {{ K }} {{ L }}
    my template will only match digits for D and active variables, that alone should solve the problem with false matches in your case
  3. Pre-process input data by removing parts of it that does not need to be matched, in other words provide TTP with as clean data as possible, where ideally each line will be matched by some variables
  4. Do inline filtering using conditions functions, e.g. using {{ A | contains(".") }} will filter any unwanted matches that does not contain dot character in them

@PeterSR
Copy link
Author

PeterSR commented Sep 29, 2022

Sorry I haven't gotten back to you yet.

I find it a bit unsettling that you are not able to reproduce the problem. It makes me doubt whether there is a setup issue at my end. But I did test it multiple times and tried to boil it down to the very core before creating the issue.

Regarding the change of template: You are probably right, but this template was taken from a bigger templates, perhaps 10 times as large with a lot of complexity. It might not be possible to do these simplifications in real life. And since the data we are parsing is quite messy and outside our control (and kind of unpredictable sometimes), then we need at least some general matching.

I think you might be right that we need to pre-process the input data - I had just hoped that TTP would spare us for that because it is such a strong parser. Regardless, I will work more on the template.

Regarding reproducing: Before I close this, I would like to make one last effort to see if anyone else can reproduce it. Let me think a bit about how.

@PeterSR
Copy link
Author

PeterSR commented Oct 3, 2022

I have now reproduced the issue in PythonAnywhere.

main.py

Screenshot 2022-10-03 at 21-06-04 main py _home_PeterSR_test_ttp_issue_82_main py Editor PeterSR PythonAnywhere

I then installed ttp==0.9.1 in the python3.7 environment.

Calling main.py with the input file without the line mentioned above:

Screenshot 2022-10-03 at 21-03-42 Bash console 25761730 PeterSR PythonAnywhere

Calling main.py with the input file with the problematic line:

Screenshot 2022-10-03 at 21-03-56 Bash console 25761730 PeterSR PythonAnywhere

In the red circle, I have highlighted a couple of matches that contains data from a different table than what it is supposed to, for instance

{
    "A": "id",
    "B": "mode",
    "C": "on/off",
    "D": "Light",
    "J": "Freq.",
    "K": "Ship.",
    "L": "[log.dec.]",
    "active": "[Hz]"
}

Here's the live console you can play around with:
https://www.pythonanywhere.com/shared_console/0e996b6e-fa94-4244-aed7-4c000b7fdd60

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants