error when trying to process GEDCOM file with python-gedcom v. 0.2.0.dev #3

62mkv · 2018-07-30T18:09:42Z

I've download and installed python-gedcom v.0.2.0.dev

I run it as follows:

from gedcom import Gedcom

file_path = '7q4425_661384sh82b72570424am5.ged' # Path to your `.ged` file
gedcom = Gedcom(file_path)

print(gedcom.element_list())

This GEDCOM file starts with

0 HEAD
1 GEDC
2 VERS 5.5.1
2 FORM LINEAGE-LINKED

and I get the following error:

Traceback (most recent call last):
  File "script.py", line 4, in <module>
    gedcom = Gedcom(file_path)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 148, in __init__
    self.__parse(file_path)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 224, in __parse
    last_element = self.__parse_line(line_number, line.decode('utf-8'), last_element)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 263, in __parse_line
    raise SyntaxError(error_message)
SyntaxError: Line `1` of document violates GEDCOM format
See: http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gctoc.htm

What am I doing wrong? This GEDCOM file has been exported from MyHeritage recently

UPD: this is with Python 3.6 under Windows 10 x64

The text was updated successfully, but these errors were encountered:

KeithPetro · 2018-07-31T00:53:03Z

What happens if you use gedcom.get_element_list() in place of gedcom.element_list()?

Edit: Also note that the current version of the project may not work with some aspects of GEDCOM 5.5.1 files (however it should definitely be able to deal with that first line).

62mkv · 2018-07-31T03:32:53Z

@KeithPetro by looking at the stack trace, one can see that the script does not even reach that statement, so I'm pretty sure it will change nothing

Regarding the format, FWIW the first line is identical, so it shouldn't fail there

KeithPetro · 2018-07-31T05:24:09Z

Taking a closer look, I found that the constructor hits this error on the last line of my GEDCOM file if I remove the EOL characters. Perhaps it's an issue with what EOL characters are present in your file and/or how they are handled?

Edit: On a side note though, you should change your code to use gedcom.get_element_list(), as there is no element_list() method for the Gedcom class.

Edit: I made a quick family tree on MyHeritage and exported it for testing. I am experiencing the exact same issue you are, on the first line. What I find odd is that it appears that the line ends in a Carriage-Return and then a Line-Feed, which is completely valid (and I had no issues with a file exported from Ancestry which also used CRLF).

62mkv · 2018-08-01T17:14:13Z

@nickreynke maybe you have an example of Gedcom file that this version can parse successfully? Can you share it? I would analyze the difference and possibly could adapt my file somehow, or update the code

KeithPetro · 2018-08-01T18:31:25Z

I've done a bit more testing and found that the first line in a regular GEDCOM file (like the one I have from Ancestry) should be simply (in byte representation):

b'0 HEAD\n'

Whereas the file from MyHeritage has:

b'\xef\xbb\xbf0 HEAD\r\n'

0xEFBBBF is the BOM (Byte Order Mark) for UTF-8. This is outside of GEDCOM spec, and I expect that any programs which are able to read these files have to specifically implement out of spec workarounds specifically for MyHeritage files.

62mkv · 2018-08-01T18:42:45Z

Indeed, the .ged file from MyHeritage has BOM

Thanks!! Now it finally begins to parse. I see that either MyHeritage is shitty on formats, either it is allowed in 5.5.1, but there're multiline entries in exported .ged file, which breaks the parser (Line 32 of document violates GEDCOM format)

Changed decoding from 'utf-8' to 'utf-8-sig' in order to skip any BOM at start of file.

62mkv · 2018-08-01T18:57:23Z

MyHeritage export is ridiculous !! It even splits unicode words in halves!! so that first byte is at the end of line N, and the second one in the beginning of line N+1

KeithPetro · 2018-08-01T18:58:49Z

@nickreynke I have prepared a (very simple) fix for this. All that's required to ignore BOM at the start of a UTF-8 encoded file is to decode with 'utf-8-sig' instead of 'utf-8'.

Edit: Some further reading on Byte Order Marks and GEDCOM would be worthwhile, as currently this project seems to only handle UTF-8. In the future, it would be nice to be able to handle ANSEL, UTF-8 as well as UTF-16 in order to be fully compliant with GEDCOM 5.5.1 standards. GEDCOM 5.5 does not have any requirements for UTF-16.

Further reading regarding character sets/encoding in GEDCOM.

KeithPetro · 2018-08-01T18:59:49Z

@62mkv I haven't experienced that issue. How are you testing that?

62mkv · 2018-08-01T19:13:16Z

for some reason, Ancestry.com was able to import MyHeritage GEDCOM file without visible defects...

@KeithPetro what do you mean with "how am I testing that" ?

KeithPetro · 2018-08-01T19:16:02Z

@62mkv What are you using that is showing you that the words are split?

Ancestry's GEDCOM reading code is likely quite robust and allows for various different variations (both valid and invalid) in GEDCOM files.

62mkv · 2018-08-01T19:26:30Z

Like this one:

2 NOTE Назвали сына Кирюшей дружно и с надеждой, что потом появится Фёдор Кириллович....  В поколении сына имя его был�
3 CONC � крайне редким и сам он любил имя дружка своего Костя В 2007 у них и появится Костя  И в это время имя Кирилла ста

Cyrillic is weird but not THAT weird )) Those strange icons are just parts of Unicode word split on different lines. If I remove the CRLF and 3 CONC item, it turns into

2 NOTE Назвали сына Кирюшей дружно и с надеждой, что потом появится Фёдор Кириллович....  В поколении сына имя его было крайне редким и сам он любил имя дружка своего Костя В 2007 у них и появится Костя  И в это время имя Кирилла ста

As you could notice, the two non-sensical chars have gone and normal Cyrillic letter 'о' has appeared instead of those

62mkv · 2018-08-01T19:29:34Z

(by "Unicode word" I mean a multi-byte Unicode sequence, describing single character; for Cyrillic it's two bytes)

KeithPetro · 2018-08-01T19:38:55Z

While weird, I think it is actually valid.

According to the GEDCOM standard, the CONC tag is meant to signify concatenation without saving the EOL characters prior to the line terminator.

You could split a UTF-16 character mid-way and still properly concatenate it just fine.

62mkv · 2018-08-01T19:44:39Z

Then it's again an issue with parser, because it stops on all such lines ("byte .. is not a valid utf-8 character")

joeyaurel · 2018-11-19T19:32:05Z

@62mkv and @KeithPetro the bug should be resolved by the current release v0.2.2dev. ✌

Merge pull request #8 from nomadyow/develop

KeithPetro added a commit to KeithPetro/python-gedcom that referenced this issue Aug 1, 2018

Allow for BOM at beginning of files (Issue joeyaurel#3)

ab0194d

Changed decoding from 'utf-8' to 'utf-8-sig' in order to skip any BOM at start of file.

joeyaurel added bug Something isn't working enhancement New feature or request labels Aug 2, 2018

joeyaurel self-assigned this Aug 2, 2018

joeyaurel added this to the v1.0.0 milestone Sep 17, 2018

joeyaurel mentioned this issue Nov 19, 2018

Is this an error in the ged? #4

Closed

joeyaurel pushed a commit that referenced this issue Dec 11, 2018

Merge pull request #3 from nickreynke/develop

849ce89

Merge pull request #8 from nomadyow/develop

joeyaurel closed this as completed Dec 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error when trying to process GEDCOM file with python-gedcom v. 0.2.0.dev #3

error when trying to process GEDCOM file with python-gedcom v. 0.2.0.dev #3

62mkv commented Jul 30, 2018 •

edited

Loading

KeithPetro commented Jul 31, 2018 •

edited

Loading

62mkv commented Jul 31, 2018

KeithPetro commented Jul 31, 2018 •

edited

Loading

62mkv commented Aug 1, 2018

KeithPetro commented Aug 1, 2018

62mkv commented Aug 1, 2018

62mkv commented Aug 1, 2018

KeithPetro commented Aug 1, 2018 •

edited

Loading

KeithPetro commented Aug 1, 2018

62mkv commented Aug 1, 2018

KeithPetro commented Aug 1, 2018

62mkv commented Aug 1, 2018

62mkv commented Aug 1, 2018

KeithPetro commented Aug 1, 2018

62mkv commented Aug 1, 2018

joeyaurel commented Nov 19, 2018

error when trying to process GEDCOM file with python-gedcom v. 0.2.0.dev #3

error when trying to process GEDCOM file with python-gedcom v. 0.2.0.dev #3

Comments

62mkv commented Jul 30, 2018 • edited Loading

KeithPetro commented Jul 31, 2018 • edited Loading

62mkv commented Jul 31, 2018

KeithPetro commented Jul 31, 2018 • edited Loading

62mkv commented Aug 1, 2018

KeithPetro commented Aug 1, 2018

62mkv commented Aug 1, 2018

62mkv commented Aug 1, 2018

KeithPetro commented Aug 1, 2018 • edited Loading

KeithPetro commented Aug 1, 2018

62mkv commented Aug 1, 2018

KeithPetro commented Aug 1, 2018

62mkv commented Aug 1, 2018

62mkv commented Aug 1, 2018

KeithPetro commented Aug 1, 2018

62mkv commented Aug 1, 2018

joeyaurel commented Nov 19, 2018

62mkv commented Jul 30, 2018 •

edited

Loading

KeithPetro commented Jul 31, 2018 •

edited

Loading

KeithPetro commented Jul 31, 2018 •

edited

Loading

KeithPetro commented Aug 1, 2018 •

edited

Loading