-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importer for MS-Word narratives (Phase 1) #727
Comments
Thanks @IanMayo. I've been looking through some of the Java code for this format, and I think it will be helpful when implementing it in Python - particularly if we can translate some of the unit tests. Quick question: it seems like the Debrief code supports imports from .doc, .docx and .pdf files. Do we need to support all of those formats? Or as .doc is very old now, can we ignore that one? (I suspect the answer is that we still need to support it, but I thought I'd ask!) |
Sure. Yes - it would be acceptable to not support |
@IanMayo Do you know how the hidden text is managed in the PDF versions of these documents? I've just done a simple export of the example document to PDF (from within Word), and we just get the normally visible text: This means that we don't have any information on the date for each line - and no information seems to be given about the date anywhere else either. Even if I change the Word settings to display hidden text, it doesn't seem to be exported into the PDF. Any thoughts? |
Also, if you're able to get an example document containing the FCS ('contact') entries then that would be great. I've found various example test entries in the Java unit tests, but it'd be good to have a proper document. |
I've just looked at my sample data. I do have a sample PDF, and see that it is missing the hidden data. |
Here's a MS-Word document with FCS (Fire Control Solution) data: |
@IanMayo How do you suggest we handle PDFs that don't have the hidden fields in them? Without the hidden fields we don't have dates to use for the timestamps. |
Hi @IanMayo - a few more questions about this importer:
|
|
Thanks @IanMayo. For item 4: yes, the more examples we can get of how location is specified, the better. I can see this being a bit tricky to parse reliably, so lots of examples will help tune the regex. |
|
There may be merit in combining this work with any client-code that parses/handles narrative files. In that way we can combine the Pepys import logic from this work with the special case handling of the client-code. |
Title
Fleet Narrative document.
Supports
#725 File Importers
🐞 Sample
Narrative Example.docx.zip
This example lacks the headers/footers, but includes idiosyncrasies related to wrong dates & missing fields.
test_narrative.doc.zip
💾 Schema
Here is a sample of the document:
Some information is in the document header:
VERY PRIVATE
in this example - we have to use case insensitive check against existing values)The document is a series of time-stamped comments. The timestamp is DDHHMM. But, each line has extra metadata present as hidden text.
The hidden text includes these fields:
The body of the document is mostly
comment
entries, though some also contain other data:58 34N 001 36W 089@12
- State lat, long, course (degs), speed (knots)FCS, HMS Nelson, B:158 R:3700 Track:TRK_34
- Contact bearing (degs), range (yards), track number. I'm waiting for an example of this.😕 Idiosyncracies
Positional data
Analysts have looked through a range of historic documents, and identified a range of examples for how position is recorded.
Mike
has come back to me with guidance on this position handling. The EMAT position is of value to the analysts, but it's handled in a separate process, and doesn't need to go into Pepys as a position. So, we'll just handle ownship positions. The pattern for this is:The text was updated successfully, but these errors were encountered: