-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error when trying to process GEDCOM file with python-gedcom v. 0.2.0.dev #3
Comments
What happens if you use Edit: Also note that the current version of the project may not work with some aspects of GEDCOM 5.5.1 files (however it should definitely be able to deal with that first line). |
@KeithPetro by looking at the stack trace, one can see that the script does not even reach that statement, so I'm pretty sure it will change nothing Regarding the format, FWIW the first line is identical, so it shouldn't fail there |
Taking a closer look, I found that the constructor hits this error on the last line of my GEDCOM file if I remove the EOL characters. Perhaps it's an issue with what EOL characters are present in your file and/or how they are handled? Edit: On a side note though, you should change your code to use Edit: I made a quick family tree on MyHeritage and exported it for testing. I am experiencing the exact same issue you are, on the first line. What I find odd is that it appears that the line ends in a Carriage-Return and then a Line-Feed, which is completely valid (and I had no issues with a file exported from Ancestry which also used CRLF). |
@nickreynke maybe you have an example of Gedcom file that this version can parse successfully? Can you share it? I would analyze the difference and possibly could adapt my file somehow, or update the code |
I've done a bit more testing and found that the first line in a regular GEDCOM file (like the one I have from Ancestry) should be simply (in byte representation):
Whereas the file from MyHeritage has:
0xEFBBBF is the BOM (Byte Order Mark) for UTF-8. This is outside of GEDCOM spec, and I expect that any programs which are able to read these files have to specifically implement out of spec workarounds specifically for MyHeritage files. |
Indeed, the .ged file from MyHeritage has BOM Thanks!! Now it finally begins to parse. I see that either MyHeritage is shitty on formats, either it is allowed in 5.5.1, but there're multiline entries in exported .ged file, which breaks the parser (Line 32 of document violates GEDCOM format) |
Changed decoding from 'utf-8' to 'utf-8-sig' in order to skip any BOM at start of file.
MyHeritage export is ridiculous !! It even splits unicode words in halves!! so that first byte is at the end of line N, and the second one in the beginning of line N+1 |
@nickreynke I have prepared a (very simple) fix for this. All that's required to ignore BOM at the start of a UTF-8 encoded file is to decode with Edit: Some further reading on Byte Order Marks and GEDCOM would be worthwhile, as currently this project seems to only handle UTF-8. In the future, it would be nice to be able to handle ANSEL, UTF-8 as well as UTF-16 in order to be fully compliant with GEDCOM 5.5.1 standards. GEDCOM 5.5 does not have any requirements for UTF-16. Further reading regarding character sets/encoding in GEDCOM. |
@62mkv I haven't experienced that issue. How are you testing that? |
for some reason, Ancestry.com was able to import MyHeritage GEDCOM file without visible defects... @KeithPetro what do you mean with "how am I testing that" ? |
@62mkv What are you using that is showing you that the words are split? Ancestry's GEDCOM reading code is likely quite robust and allows for various different variations (both valid and invalid) in GEDCOM files. |
Like this one:
Cyrillic is weird but not THAT weird )) Those strange icons are just parts of Unicode word split on different lines. If I remove the CRLF and 3 CONC item, it turns into
As you could notice, the two non-sensical chars have gone and normal Cyrillic letter 'о' has appeared instead of those |
(by "Unicode word" I mean a multi-byte Unicode sequence, describing single character; for Cyrillic it's two bytes) |
While weird, I think it is actually valid. According to the GEDCOM standard, the You could split a UTF-16 character mid-way and still properly concatenate it just fine. |
Then it's again an issue with parser, because it stops on all such lines ("byte .. is not a valid utf-8 character") |
@62mkv and @KeithPetro the bug should be resolved by the current release |
Merge pull request #8 from nomadyow/develop
I've download and installed
python-gedcom
v.0.2.0.devI run it as follows:
This GEDCOM file starts with
and I get the following error:
What am I doing wrong? This GEDCOM file has been exported from MyHeritage recently
UPD: this is with Python 3.6 under Windows 10 x64
The text was updated successfully, but these errors were encountered: