Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importer for MS-Word narratives (Phase 1) #727

Open
IanMayo opened this issue Dec 18, 2020 · 13 comments · May be fixed by #730
Open

Importer for MS-Word narratives (Phase 1) #727

IanMayo opened this issue Dec 18, 2020 · 13 comments · May be fixed by #730
Assignees
Labels

Comments

@IanMayo
Copy link
Member

IanMayo commented Dec 18, 2020

Title

Fleet Narrative document.

Supports

#725 File Importers

🐞 Sample

Narrative Example.docx.zip

This example lacks the headers/footers, but includes idiosyncrasies related to wrong dates & missing fields.
test_narrative.doc.zip

💾 Schema

Here is a sample of the document:
image

Some information is in the document header:

  • Platform name
  • Privacy value (VERY PRIVATE in this example - we have to use case insensitive check against existing values)

The document is a series of time-stamped comments. The timestamp is DDHHMM. But, each line has extra metadata present as hidden text.

image

The hidden text includes these fields:

  • Day/Month/Year
  • Name of platform
  • Category of comment

The body of the document is mostly comment entries, though some also contain other data:

  • 58 34N 001 36W 089@12 - State lat, long, course (degs), speed (knots)
  • FCS, HMS Nelson, B:158 R:3700 Track:TRK_34 - Contact bearing (degs), range (yards), track number. I'm waiting for an example of this.

😕 Idiosyncracies

  • Occasionally this has been exported to PDF.
  • It frequently has tracked changes switched on. Hopefully we can either accept all changes at start of import, or recognise there are "pending" changes and ask the analyst to open the doc in MS-Word and accept the changes themselves.
  • The data is human generated. There are some horrendous idiosyncrasies, particularly around entries from one place being copied/pasted to another - but where the time gets updated, but the hidden text doesn't.
  • Ian maintains a Java importer for this format. It contains some lessons learned in idiosyncracies. Source here

Positional data

Analysts have looked through a range of historic documents, and identified a range of examples for how position is recorded.

110000  Ships Pos 00 00.0N 000 00.0E C-000 S-00
101956  Ships Position: 0000.0N 0000.0E C-000 S-00
102006 EMAT Position 00 00.0N 000 00.0E
280940  PWO Comment
        RV with US TG 00.0 in position 00 00N 000 00E

Mike has come back to me with guidance on this position handling. The EMAT position is of value to the analysts, but it's handled in a separate process, and doesn't need to go into Pepys as a position. So, we'll just handle ownship positions. The pattern for this is:

Recognise a position when the comment starts with any of:
- Ship, Ships, Ownship
followed by:
- Pos, Position
Then a lat/long expressed in:
- degs, degs-mins, or degs-mins-secs
Then an optional course/speed pair expressed as:
- C-000 S-00
@robintw
Copy link
Collaborator

robintw commented Dec 21, 2020

Thanks @IanMayo. I've been looking through some of the Java code for this format, and I think it will be helpful when implementing it in Python - particularly if we can translate some of the unit tests.

Quick question: it seems like the Debrief code supports imports from .doc, .docx and .pdf files. Do we need to support all of those formats? Or as .doc is very old now, can we ignore that one? (I suspect the answer is that we still need to support it, but I thought I'd ask!)

@IanMayo
Copy link
Member Author

IanMayo commented Dec 21, 2020

Sure. Yes - it would be acceptable to not support .doc. But, I know some platforms convert to .pdf before submitting their data.

@robintw
Copy link
Collaborator

robintw commented Jan 6, 2021

@IanMayo Do you know how the hidden text is managed in the PDF versions of these documents? I've just done a simple export of the example document to PDF (from within Word), and we just get the normally visible text:

image

This means that we don't have any information on the date for each line - and no information seems to be given about the date anywhere else either. Even if I change the Word settings to display hidden text, it doesn't seem to be exported into the PDF.

Any thoughts?

@robintw
Copy link
Collaborator

robintw commented Jan 6, 2021

Also, if you're able to get an example document containing the FCS ('contact') entries then that would be great. I've found various example test entries in the Java unit tests, but it'd be good to have a proper document.

@IanMayo
Copy link
Member Author

IanMayo commented Jan 6, 2021

@IanMayo Do you know how the hidden text is managed in the PDF versions of these documents? I've just done a simple export of the example document to PDF (from within Word), and we just get the normally visible text:
This means that we don't have any information on the date for each line - and no information seems to be given about the date anywhere else either. Even if I change the Word settings to display hidden text, it doesn't seem to be exported into the PDF.

I've just looked at my sample data. I do have a sample PDF, and see that it is missing the hidden data.
FCS_narrative_no_metadata.pdf

@IanMayo
Copy link
Member Author

IanMayo commented Jan 6, 2021

Here's a MS-Word document with FCS (Fire Control Solution) data:
FCS_extra_narrativetypes.doc.zip

@robintw
Copy link
Collaborator

robintw commented Jan 11, 2021

@IanMayo How do you suggest we handle PDFs that don't have the hidden fields in them? Without the hidden fields we don't have dates to use for the timestamps.

@IanMayo
Copy link
Member Author

IanMayo commented Jan 11, 2021

While the unicorn example doesn't include it, I have seen month/year markers.

image

I'll check with the clients if these markers were present in the unicorn example that got sanitised.

@robintw robintw linked a pull request Jan 11, 2021 that will close this issue
6 tasks
@robintw
Copy link
Collaborator

robintw commented Jan 12, 2021

Hi @IanMayo - a few more questions about this importer:

  1. Some of the example Word documents I found don't seem to have a header. Given that, do we need to be able to handle Word docs without a header? That would mean no information on privacy, or vessel name - so we'd need to prompt for them.

  2. I found some code in the Java implementation that handles the month/year markers that you mentioned in your previous comment. I'll make sure I support those.

  3. The Java code deals with situations where the final comma after the Message Type field is missed out, and therefore the Message Type field runs straight into the actual text of the message. However, it does this by looking for the first space and splitting there (see https://github.com/debrief/debrief/blob/develop/org.mwc.debrief.legacy/src/Debrief/ReaderWriter/Word/ImportNarrativeDocument.java#L485). The example file we have has message types with spaces in them (CO Comment etc), so this won't work properly. In my code I look for a tab instead (as that's what we seem to have splitting the two fields) - but this might need updating in the Java code.

  4. The messages that include the State information (eg. 58 34N 001 36W 089@12) - would this information always be the full content of a message, or could it be included in a longer message (eg. Blah blah we did something 58 34N 001 36W 089@12)? I can't seem to see any Java code that processes this, and I think I'm going to need to play around with regexes to try and match it. Unfortunately there doesn't seem to be a specific Message Type that contains this information - it would be really useful if there was, but oh well.

@IanMayo
Copy link
Member Author

IanMayo commented Jan 12, 2021

  1. Header. I'll ask the question. There's a chance the sample files are missing the header just because they're mock data.

  2. Yes. I had confirmation last night that for each file the analyst looked in, the content we're looking for started with the day/month/year header. I saw documents a couple of years ago where that marker didn't appear until after a hundred pages of other content. But, since the documents are "tidied" by a human before being sent - we have to remain wary that the marker may be missing, and that we have to ask the user for the month/year.

  3. Tab separator. Thanks for the tip.

  4. I am surprised to see that location in the General Comment. I don't think I've seen it before. Knowing location is really valuable, but normally the ships rely on the (multiple) other systems to capture it. So. Yes, let's grab this State element when it appears in a General Comment. But, we need to be quite relaxed in the formatting, since it's free-typed into MS-Word. (Many other inputs are entered via popup dialogs, which gives some consistency in formatting). So, we need to allow for the presence of a degree symbol, the presence of seconds. The last two elements are course (degs) and speed (kts). I guess we need to allow for Kts or kts being present too. Actually - I'll ask them to look at a wider sample, and see how often location is supplied in this way.

@robintw
Copy link
Collaborator

robintw commented Jan 12, 2021

Thanks @IanMayo.

For item 4: yes, the more examples we can get of how location is specified, the better. I can see this being a bit tricky to parse reliably, so lots of examples will help tune the regex.

@IanMayo
Copy link
Member Author

IanMayo commented Jan 18, 2021

  1. Header. All of the narrative documents found had the matching header expressed as a table. But, there were also some "composite" documents, where the narrative is included as an appendix, missing the narrative header.

@IanMayo
Copy link
Member Author

IanMayo commented Aug 12, 2021

There may be merit in combining this work with any client-code that parses/handles narrative files.

In that way we can combine the Pepys import logic from this work with the special case handling of the client-code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants