Importer for MS-Word narratives (Phase 1) #727

IanMayo · 2020-12-18T14:22:11Z

Title

Fleet Narrative document.

Supports

🐞 Sample

This example lacks the headers/footers, but includes idiosyncrasies related to wrong dates & missing fields.
test_narrative.doc.zip

💾 Schema

Here is a sample of the document:

Some information is in the document header:

Platform name
Privacy value (VERY PRIVATE in this example - we have to use case insensitive check against existing values)

The document is a series of time-stamped comments. The timestamp is DDHHMM. But, each line has extra metadata present as hidden text.

The hidden text includes these fields:

Day/Month/Year
Name of platform
Category of comment

The body of the document is mostly comment entries, though some also contain other data:

58 34N 001 36W 089@12 - State lat, long, course (degs), speed (knots)
FCS, HMS Nelson, B:158 R:3700 Track:TRK_34 - Contact bearing (degs), range (yards), track number. I'm waiting for an example of this.

😕 Idiosyncracies

Occasionally this has been exported to PDF.
It frequently has tracked changes switched on. Hopefully we can either accept all changes at start of import, or recognise there are "pending" changes and ask the analyst to open the doc in MS-Word and accept the changes themselves.
The data is human generated. There are some horrendous idiosyncrasies, particularly around entries from one place being copied/pasted to another - but where the time gets updated, but the hidden text doesn't.
Ian maintains a Java importer for this format. It contains some lessons learned in idiosyncracies. Source here

Positional data

Analysts have looked through a range of historic documents, and identified a range of examples for how position is recorded.

110000  Ships Pos 00 00.0N 000 00.0E C-000 S-00
101956  Ships Position: 0000.0N 0000.0E C-000 S-00
102006 EMAT Position 00 00.0N 000 00.0E
280940  PWO Comment
        RV with US TG 00.0 in position 00 00N 000 00E

Mike has come back to me with guidance on this position handling. The EMAT position is of value to the analysts, but it's handled in a separate process, and doesn't need to go into Pepys as a position. So, we'll just handle ownship positions. The pattern for this is:

Recognise a position when the comment starts with any of:
- Ship, Ships, Ownship
followed by:
- Pos, Position
Then a lat/long expressed in:
- degs, degs-mins, or degs-mins-secs
Then an optional course/speed pair expressed as:
- C-000 S-00

The text was updated successfully, but these errors were encountered:

robintw · 2020-12-21T10:11:08Z

Thanks @IanMayo. I've been looking through some of the Java code for this format, and I think it will be helpful when implementing it in Python - particularly if we can translate some of the unit tests.

Quick question: it seems like the Debrief code supports imports from .doc, .docx and .pdf files. Do we need to support all of those formats? Or as .doc is very old now, can we ignore that one? (I suspect the answer is that we still need to support it, but I thought I'd ask!)

IanMayo · 2020-12-21T10:46:49Z

Sure. Yes - it would be acceptable to not support .doc. But, I know some platforms convert to .pdf before submitting their data.

robintw · 2021-01-06T11:39:54Z

@IanMayo Do you know how the hidden text is managed in the PDF versions of these documents? I've just done a simple export of the example document to PDF (from within Word), and we just get the normally visible text:

This means that we don't have any information on the date for each line - and no information seems to be given about the date anywhere else either. Even if I change the Word settings to display hidden text, it doesn't seem to be exported into the PDF.

Any thoughts?

robintw · 2021-01-06T11:55:41Z

Also, if you're able to get an example document containing the FCS ('contact') entries then that would be great. I've found various example test entries in the Java unit tests, but it'd be good to have a proper document.

IanMayo · 2021-01-06T12:25:03Z

@IanMayo Do you know how the hidden text is managed in the PDF versions of these documents? I've just done a simple export of the example document to PDF (from within Word), and we just get the normally visible text:
This means that we don't have any information on the date for each line - and no information seems to be given about the date anywhere else either. Even if I change the Word settings to display hidden text, it doesn't seem to be exported into the PDF.

I've just looked at my sample data. I do have a sample PDF, and see that it is missing the hidden data.
FCS_narrative_no_metadata.pdf

IanMayo · 2021-01-06T12:54:26Z

Here's a MS-Word document with FCS (Fire Control Solution) data:
FCS_extra_narrativetypes.doc.zip

robintw · 2021-01-11T09:17:02Z

@IanMayo How do you suggest we handle PDFs that don't have the hidden fields in them? Without the hidden fields we don't have dates to use for the timestamps.

IanMayo · 2021-01-11T10:42:07Z

While the unicorn example doesn't include it, I have seen month/year markers.

I'll check with the clients if these markers were present in the unicorn example that got sanitised.

robintw · 2021-01-12T09:27:31Z

Hi @IanMayo - a few more questions about this importer:

Some of the example Word documents I found don't seem to have a header. Given that, do we need to be able to handle Word docs without a header? That would mean no information on privacy, or vessel name - so we'd need to prompt for them.
I found some code in the Java implementation that handles the month/year markers that you mentioned in your previous comment. I'll make sure I support those.
The Java code deals with situations where the final comma after the Message Type field is missed out, and therefore the Message Type field runs straight into the actual text of the message. However, it does this by looking for the first space and splitting there (see https://github.com/debrief/debrief/blob/develop/org.mwc.debrief.legacy/src/Debrief/ReaderWriter/Word/ImportNarrativeDocument.java#L485). The example file we have has message types with spaces in them (CO Comment etc), so this won't work properly. In my code I look for a tab instead (as that's what we seem to have splitting the two fields) - but this might need updating in the Java code.
The messages that include the State information (eg. 58 34N 001 36W 089@12) - would this information always be the full content of a message, or could it be included in a longer message (eg. Blah blah we did something 58 34N 001 36W 089@12)? I can't seem to see any Java code that processes this, and I think I'm going to need to play around with regexes to try and match it. Unfortunately there doesn't seem to be a specific Message Type that contains this information - it would be really useful if there was, but oh well.

IanMayo · 2021-01-12T09:49:39Z

Header. I'll ask the question. There's a chance the sample files are missing the header just because they're mock data.
Yes. I had confirmation last night that for each file the analyst looked in, the content we're looking for started with the day/month/year header. I saw documents a couple of years ago where that marker didn't appear until after a hundred pages of other content. But, since the documents are "tidied" by a human before being sent - we have to remain wary that the marker may be missing, and that we have to ask the user for the month/year.
Tab separator. Thanks for the tip.
I am surprised to see that location in the General Comment. I don't think I've seen it before. Knowing location is really valuable, but normally the ships rely on the (multiple) other systems to capture it. So. Yes, let's grab this State element when it appears in a General Comment. But, we need to be quite relaxed in the formatting, since it's free-typed into MS-Word. (Many other inputs are entered via popup dialogs, which gives some consistency in formatting). So, we need to allow for the presence of a degree symbol, the presence of seconds. The last two elements are course (degs) and speed (kts). I guess we need to allow for Kts or kts being present too. Actually - I'll ask them to look at a wider sample, and see how often location is supplied in this way.

robintw · 2021-01-12T10:44:41Z

Thanks @IanMayo.

For item 4: yes, the more examples we can get of how location is specified, the better. I can see this being a bit tricky to parse reliably, so lots of examples will help tune the regex.

IanMayo · 2021-01-18T09:52:30Z

Header. All of the narrative documents found had the matching header expressed as a table. But, there were also some "composite" documents, where the narrative is included as an appendix, missing the narrative header.

IanMayo · 2021-08-12T10:11:46Z

There may be merit in combining this work with any client-code that parses/handles narrative files.

In that way we can combine the Pepys import logic from this work with the special case handling of the client-code.

IanMayo added the Importer label Dec 18, 2020

IanMayo assigned robintw Dec 18, 2020

robintw linked a pull request Jan 11, 2021 that will close this issue

Add importer for Word narrative documents #730

Draft

6 tasks

IanMayo added the A43_candidate label Aug 12, 2021

IanMayo removed the A43_candidate label Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importer for MS-Word narratives (Phase 1) #727

Importer for MS-Word narratives (Phase 1) #727

IanMayo commented Dec 18, 2020 •

edited

Loading

robintw commented Dec 21, 2020

IanMayo commented Dec 21, 2020

robintw commented Jan 6, 2021

robintw commented Jan 6, 2021

IanMayo commented Jan 6, 2021

IanMayo commented Jan 6, 2021

robintw commented Jan 11, 2021

IanMayo commented Jan 11, 2021

robintw commented Jan 12, 2021

IanMayo commented Jan 12, 2021

robintw commented Jan 12, 2021

IanMayo commented Jan 18, 2021

IanMayo commented Aug 12, 2021

Importer for MS-Word narratives (Phase 1) #727

Importer for MS-Word narratives (Phase 1) #727

Comments

IanMayo commented Dec 18, 2020 • edited Loading

Title

Supports

🐞 Sample

💾 Schema

😕 Idiosyncracies

Positional data

robintw commented Dec 21, 2020

IanMayo commented Dec 21, 2020

robintw commented Jan 6, 2021

robintw commented Jan 6, 2021

IanMayo commented Jan 6, 2021

IanMayo commented Jan 6, 2021

robintw commented Jan 11, 2021

IanMayo commented Jan 11, 2021

robintw commented Jan 12, 2021

IanMayo commented Jan 12, 2021

robintw commented Jan 12, 2021

IanMayo commented Jan 18, 2021

IanMayo commented Aug 12, 2021

IanMayo commented Dec 18, 2020 •

edited

Loading