Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when trying to process GEDCOM file with python-gedcom v. 0.2.0.dev #3

Closed
62mkv opened this issue Jul 30, 2018 · 16 comments
Closed
Assignees
Labels
bug Something isn't working enhancement New feature or request
Milestone

Comments

@62mkv
Copy link

62mkv commented Jul 30, 2018

I've download and installed python-gedcom v.0.2.0.dev

I run it as follows:

from gedcom import Gedcom

file_path = '7q4425_661384sh82b72570424am5.ged' # Path to your `.ged` file
gedcom = Gedcom(file_path)

print(gedcom.element_list())

This GEDCOM file starts with

0 HEAD
1 GEDC
2 VERS 5.5.1
2 FORM LINEAGE-LINKED

and I get the following error:

Traceback (most recent call last):
  File "script.py", line 4, in <module>
    gedcom = Gedcom(file_path)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 148, in __init__
    self.__parse(file_path)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 224, in __parse
    last_element = self.__parse_line(line_number, line.decode('utf-8'), last_element)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 263, in __parse_line
    raise SyntaxError(error_message)
SyntaxError: Line `1` of document violates GEDCOM format
See: http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gctoc.htm

What am I doing wrong? This GEDCOM file has been exported from MyHeritage recently

UPD: this is with Python 3.6 under Windows 10 x64

@KeithPetro
Copy link

KeithPetro commented Jul 31, 2018

What happens if you use gedcom.get_element_list() in place of gedcom.element_list()?

Edit: Also note that the current version of the project may not work with some aspects of GEDCOM 5.5.1 files (however it should definitely be able to deal with that first line).

@62mkv
Copy link
Author

62mkv commented Jul 31, 2018

@KeithPetro by looking at the stack trace, one can see that the script does not even reach that statement, so I'm pretty sure it will change nothing

Regarding the format, FWIW the first line is identical, so it shouldn't fail there

@KeithPetro
Copy link

KeithPetro commented Jul 31, 2018

Taking a closer look, I found that the constructor hits this error on the last line of my GEDCOM file if I remove the EOL characters. Perhaps it's an issue with what EOL characters are present in your file and/or how they are handled?

Edit: On a side note though, you should change your code to use gedcom.get_element_list(), as there is no element_list() method for the Gedcom class.

Edit: I made a quick family tree on MyHeritage and exported it for testing. I am experiencing the exact same issue you are, on the first line. What I find odd is that it appears that the line ends in a Carriage-Return and then a Line-Feed, which is completely valid (and I had no issues with a file exported from Ancestry which also used CRLF).

@62mkv
Copy link
Author

62mkv commented Aug 1, 2018

@nickreynke maybe you have an example of Gedcom file that this version can parse successfully? Can you share it? I would analyze the difference and possibly could adapt my file somehow, or update the code

@KeithPetro
Copy link

I've done a bit more testing and found that the first line in a regular GEDCOM file (like the one I have from Ancestry) should be simply (in byte representation):

b'0 HEAD\n'

Whereas the file from MyHeritage has:

b'\xef\xbb\xbf0 HEAD\r\n'

0xEFBBBF is the BOM (Byte Order Mark) for UTF-8. This is outside of GEDCOM spec, and I expect that any programs which are able to read these files have to specifically implement out of spec workarounds specifically for MyHeritage files.

@62mkv
Copy link
Author

62mkv commented Aug 1, 2018

Indeed, the .ged file from MyHeritage has BOM

Thanks!! Now it finally begins to parse. I see that either MyHeritage is shitty on formats, either it is allowed in 5.5.1, but there're multiline entries in exported .ged file, which breaks the parser (Line 32 of document violates GEDCOM format)

KeithPetro added a commit to KeithPetro/python-gedcom that referenced this issue Aug 1, 2018
Changed decoding from 'utf-8' to 'utf-8-sig' in order to skip any BOM at start of file.
@62mkv
Copy link
Author

62mkv commented Aug 1, 2018

MyHeritage export is ridiculous !! It even splits unicode words in halves!! so that first byte is at the end of line N, and the second one in the beginning of line N+1

@KeithPetro
Copy link

KeithPetro commented Aug 1, 2018

@nickreynke I have prepared a (very simple) fix for this. All that's required to ignore BOM at the start of a UTF-8 encoded file is to decode with 'utf-8-sig' instead of 'utf-8'.

Edit: Some further reading on Byte Order Marks and GEDCOM would be worthwhile, as currently this project seems to only handle UTF-8. In the future, it would be nice to be able to handle ANSEL, UTF-8 as well as UTF-16 in order to be fully compliant with GEDCOM 5.5.1 standards. GEDCOM 5.5 does not have any requirements for UTF-16.

Further reading regarding character sets/encoding in GEDCOM.

@KeithPetro
Copy link

@62mkv I haven't experienced that issue. How are you testing that?

@62mkv
Copy link
Author

62mkv commented Aug 1, 2018

for some reason, Ancestry.com was able to import MyHeritage GEDCOM file without visible defects...

@KeithPetro what do you mean with "how am I testing that" ?

@KeithPetro
Copy link

@62mkv What are you using that is showing you that the words are split?

Ancestry's GEDCOM reading code is likely quite robust and allows for various different variations (both valid and invalid) in GEDCOM files.

@62mkv
Copy link
Author

62mkv commented Aug 1, 2018

Like this one:

2 NOTE Назвали сына Кирюшей дружно и с надеждой, что потом появится Фёдор Кириллович....  В поколении сына имя его был�
3 CONC � крайне редким и сам он любил имя дружка своего Костя В 2007 у них и появится Костя  И в это время имя Кирилла ста

Cyrillic is weird but not THAT weird )) Those strange icons are just parts of Unicode word split on different lines. If I remove the CRLF and 3 CONC item, it turns into

2 NOTE Назвали сына Кирюшей дружно и с надеждой, что потом появится Фёдор Кириллович....  В поколении сына имя его было крайне редким и сам он любил имя дружка своего Костя В 2007 у них и появится Костя  И в это время имя Кирилла ста

As you could notice, the two non-sensical chars have gone and normal Cyrillic letter 'о' has appeared instead of those

@62mkv
Copy link
Author

62mkv commented Aug 1, 2018

(by "Unicode word" I mean a multi-byte Unicode sequence, describing single character; for Cyrillic it's two bytes)

@KeithPetro
Copy link

While weird, I think it is actually valid.

According to the GEDCOM standard, the CONC tag is meant to signify concatenation without saving the EOL characters prior to the line terminator.

You could split a UTF-16 character mid-way and still properly concatenate it just fine.

@62mkv
Copy link
Author

62mkv commented Aug 1, 2018

Then it's again an issue with parser, because it stops on all such lines ("byte .. is not a valid utf-8 character")

@joeyaurel joeyaurel added bug Something isn't working enhancement New feature or request labels Aug 2, 2018
@joeyaurel joeyaurel self-assigned this Aug 2, 2018
@joeyaurel joeyaurel added this to the v1.0.0 milestone Sep 17, 2018
@joeyaurel
Copy link
Owner

@62mkv and @KeithPetro the bug should be resolved by the current release v0.2.2dev. ✌

joeyaurel pushed a commit that referenced this issue Dec 11, 2018
Merge pull request #8 from nomadyow/develop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants