Optional Content Groups improvements #51

arklumpus · 2024-09-14T18:00:40Z

This PR solves a number of issues around Optional Content Groups (OCGs, aka "Layers"), including those reported in Bug 707999:

2e3298b — OCMD visibility is now computed correctly for AllOn and AnyOff as well as AnyOn and AllOff (issue 2 in the bug report above).
a313d32 — I fixed a bug in the find_ocg function that caused it to always return 0.
e3dba5c — I added functions to query and select the default OCG configuration (issue 1 in the bug report above), in particular:
- void pdf_default_layer_config_info(fz_context *ctx, pdf_document *doc, pdf_layer_config *info) fetches the name and creator of the default configuration (like pdf_layer_config_info).
- void pdf_select_default_layer_config(fz_context *ctx, pdf_document *doc) activates the default layer configuration (like pdf_select_layer_config).
Furthermore:
- I modified pdf_layer_config_info and pdf_select_layer_config, so that when the config_num/config argument is -1, they act on the default configuration.
- I modified mudraw.c so that mutool draw -y l also lists the default configuration, and mutool draw -y d explicitly selects the default configuration.
2c2cd91 — I added the missing logic to interpret visibility expressions (VE entry) in OCMD dictionaries (issue 3 in the bug report).
- This uses a stack to detect circular references, so there a few new static functions related to this as well (ve_new_stack, ve_stack_pop, ve_stack_push, ve_stack_contains, ve_stack_free).
- If necessary, the stack is initialised in pdf_is_ocg_hidden_imp and passed around to nested calls to pdf_is_ve_hidden or pdf_is_ocg_hidden_imp.
da41599 — Wherever OCG objects are being compared, I replaced pdf_objcmp_resolve with pdf_objcmp (so that the references are compared rather than the contents of the dictionary).
- This solves issues occurring with PDF files where multiple OCGs are defined with the same name (which are allowed by the spec).
- Essentially, this extends the fix for Bug 702261 to other instances where OCGs are compared, thus also fixing Bug 707992.
2c54b56 — The logic to deal with radio button OCG UI elements was quite broken, I fixed it.
2c54b56 — I added functions to check whether link actions are visible and to "activate" actions with subtype SetOCGState.
- int pdf_is_link_hidden(fz_context *ctx, const char *usage, fz_link* link) checks the visibility of the OCG defined by the OC entry of the link dictionary.
- int pdf_is_link_set_ocg_state(fz_link* link) checks whether a pdf_link object represents an action with subtype SetOCGState or a "normal" link.
- void pdf_activate_link_set_ocg_state(fz_context *ctx, pdf_document *doc, fz_link *link) activates a link that represents an action with subtype SetOCGState, thus enabling/disabling/toggling OCGs as specified in the State entry of the action dictionary.
- I modified mupdf-gl and mupdf-x11 to use these new functions, so that hidden links do not react to mouse events and SetOCGState actions do what they are supposed to do when clicking on them.

Here are a few test PDFs:

OptionalContentGroups.pdf: basic example, with three OCGs and objects whose visibility is controlled by a single OCG or a visibility expression.
OptionalContentGroupsTree.pdf: a document with an OCG UI tree with radio buttons.
OptionalContentGroupsControls.pdf: a document with "buttons" and "check boxes" to control OCG visibility (implemented as SetOCGState actions).
NoughtsCrosses.pdf: a document with complex logic that can be used by two humans to play noughts and crosses (tic-tac-toe).

… contents.

…ions.

Also create separate make target to build docs without having to install them.

Use surrogate pairs in output as required. Also (thanks to Tor!) fix fmtquote_xml output. Thanks to Xybei for the report, and the first version of the patch.

Required for use of std::min in some custom methods.

First, add a comment. Second, all outline entries inserted into PDF are closed, regardless of the is_open setting. Thirdly, any insertion into an empty outline object no longer opens it.

The parser encounters an object that it hasn't read from an objstm before, loads it and finds it's a null object. It then thinks that because it's a NULL pointer, that it didn't actually read that object, and throws an error. To fix this, rewrite null 'o' objects as 'f' objects.

…ces." This reverts commit 2586767. The bug that this commit patches over was apparently fixed in another way later in commit 6f8cb56 for bug 699629. The check causes certain (broken) files to extract text worse than they could. Example file has code space ranges and character ranges given with space padded rather than zero padded hex strings: begincodespacerange < 4> <8f> endcodespacerange This is parsed (as per spec) as <40> <8f> which then means we lost all character mappings < 0x40. By ignoring the range checks, we now at least get characters 0x10 and up correct.

Incorrectly changed from MAX_OBJECT_NUMBER to pdf_xref_len in e767bd7 The pdf_xref_len is not valid during repair, because the current object being repaired may extend it! Compare with the implementation limit instead.

Instead of calling pdf_repair_obj_stms implicitly inside pdf_repair_trailer, call it explicitly at the appropriate call sites.

We should (in these cases at least) be checking for the use of the WIN32 API, not the particular compiler.

Currently, we are relying on the accurate bbox for a char and checking to see if a line crosses it. This will go wrong both for chars that are displaced from the origin (e.g. '^') and for underlines that don't cross chars without descenders. As a fix for this, we calculate an 'expanded' rectangle for each char. We extend the rectangle down to the baseline of the glyph, and then 25% of the size of the font below it. This fixes lines such as "Title 1" (underlined!), which contains no descenders, correctly being spotted as underlined chars.

…er range.

The offsets in the loca table must be increasing, all offsets must be filled in, and the length of the glyph data is the difference between an offset and the next offset. The previous code didn't fill in the offsets of unused glyphs between used glyphs' offsets, this caused complaints when opening the font in freetype or fontforge. In addition the length computed for one glyph's data was not applied to the remapped glyph id, but to the glyph id after that one, causing the glyph appearance to be applied to another random glyph.

A hmtx table consists of two parts: * an array of entries with both advance widths and left bearing values * an array of entries with only left bearing values Combined, both arrays are meant to provide horizontal metrics for all the glyphs. The number of glyphs is parsed out from the font file's maxp table. The length of the first array is parsed out from the font file's hhea table. The length of the second array is computed by taking the number of glyphs and subtracting the number of entries in the first array. Previously, the first array length, stored in orig_num_long_hor_metrics and was capped by the hmtx table size, while the second array length, stored in max16, was capped both by the hmtx table size and the number of glyphs. This meant that max16 could end up being a lower value than orig_num_log_hor_metrics, in which case the second array entries were never updated. Three changes are provided in this commit: * Rename array lengths in the code to long_metrics and short_metrics respectively. * Use separate counter to count the number of entries in the first and the second arrays. * Make sure to cap both array lengths by the number of glyphs in the font.

…ing hmtx table.

Both 1487_-_right_margin_cut_off_when_printing.pdf font objects 7 0 R and 14 0 R and PDFIA1.7_SUBSET/CATX1641.pdf font objects 93 0 R, 99 0 R and 105 0 R contain symbolic fonts that all contain a single (3,0) cmap, while MuPDF previously only accepted (1,0) cmaps. The fix is to simply add a check for (3,0) cmaps and these two files can now be subset correctly.

2201_-_transparent_image_covers_background.pdf object 68 0 R has only a single (0,1) cmap using format 4. Now we can cope with that!

This fixes Coverity issue 430251.

Move vectors into line with chars. We now return a consistent representation. 'color' has been deliberately changed to 'argb' as a) it's clearer, and b) it forces people to update their code and not suddenly see failures with alpha being unexpectedly in the top 8 bits.

PyMuPDF builds use build directories containing `Py_LIMITED_API=*` which results in auto-dependency files containing things like: build/PyMuPDF-amd64-shared-tesseract-Py_LIMITED_API=0x030a0000-release/source/fitz/context.o: foo.h Unfortunately the `=` breaks thngs because it causes make to treat this as setting variable `build/PyMuPDF-amd64-shared-tesseract-Py_LIMITED_API` to `0x030a0000-release/source/fitz/context.o: foo.h`, not as a `<target>: <prerequisites>` rule. PyMuPDF will be changed to use build directories without the `=` such as build/PyMuPDF-amd64-shared-tesseract-Py_LIMITED_API_0x030a0000-release/, which we now accept.

Document.search(), Page.search(), and DisplayList.search() all suffered from the same bug where they are declared to return Quad[][] but the last call in the JNI wrappers called ArrayList.toArray() which returns Object[]. Desktop java accepted type of casting without any compilation or runtime complaints. Compiling for Android worked equally well, but searching in the app, the app crashed at runtime with: JNI DETECTED ERROR IN APPLICATION: attempt to return an instance of java.lang.Object[] from fitz.Quad[][] fitz.Page.search(java.lang.String). The fix for this problem is to at the end of the JNI wrappers call <T> T[] ArrayList.toArray(T[]). But even with this code change the app still crashes at runtime with: JNI DETECTED ERROR IN APPLICATION: the return type of CallVoidMethod does not match boolean java.util.ArrayList.add(java.lang.Object) Which was resolved by calling the function using CallBooleanMethod and handling the return value suitably. With these two changes searching now works well in both desktop Java and Android.

I had failed to update the code that calculates bboxes and does bidirectional ordering in the presence of structure tags.

Previously the segment length was represented by a signed integer, so when pdf_parse_jbig2_segment_header() parsed the segment length the value read from the file could be interpreted as negative. Later on in pdf_copy_jbig2_segments() a data pointer is incremented by the segment length, which could lead to an out of bounds pointer and a subsequent crash. This fix is to represent the parsed value as an 32-bit unsigned integer as mentioned in the specification.

This only affects document handlers that insist on file backed streams (i.e. sodochandler, currently). If we ask to convert a stream to be file backed, we may get the same stream back instantly (indicating that it is already file backed). Do NOT drop it in this case.

This enables people to spot 'invisible' text.

fz_convert_pixmap() returns a kept reference, but the C++ wrappers did not know this so did an additional keep, which results in a leak. The fix is to add name 'convert' in function_name_implies_kept_references().

The PDF operator stream in this file does: W <path> n /Image Do The filter code therefore currently tries to deal with the 'W' without a path being defined. Currently it bounds it, finds it is an empty rect, and therefore suppresses the image. According to the PDF spec though, W and W* are not supposed to look at the path at the point they are called. Instead they merely set it up so that a side effect of the next 'path painting operator' (in this case 'n') is to intersect the clip path AFTER than operator has taken place. This commit therefore changes the behaviour of the filter to follow this. All we do when we get a 'W' or a 'W*' is to set a new clip_op field. Whenever we then process a path painting operator, we look at this field, and process W or W* as appropriate. We are at pains to forward the W or W* operator AFTER the path has been processed, but before the path painting operator is sent. This should keep bugged renderers such as Preview happy. The has meant that the culler handling has changed slightly in that the culler now doesn't deal with CLIP paths separately to FILL or STROKE paths, but rather now as a combination (e.g. CLIP_FILL_PATH). Trying to deal differently with CLIP and FILL paths (for example) would probably have gone wrong in the past so there is no real downside here.

arklumpus and others added 30 commits September 14, 2024 11:52

Fixed OCMD visibility computation.

2e3298b

Fixed bug in find_ocg.

a313d32

Added functions to query and select the default OCG configuration.

e3dba5c

Added logic to interpret OCG visibility expressions.

2c2cd91

When comparing OCG objects, compare references rather than dictionary…

da41599

… contents.

Fixed OCG radio buttons.

2c54b56

Added functions to check link visibility and activate SetOCGState act…

93c8ba2

…ions.

Merge branch 'master' into master

d5e470e

Merge branch 'master' into master

bcb2de3

Merge branch 'ArtifexSoftware:master' into master

1295231

Merge branch 'master_artifex'

fdc3c6d

Do not activate the default OCG configuration if there are no OCGs.

271f571

docs: Update list of important changes for the last few releases.

ab0f96e

docs: Fix links to python re-implementations of mutool.

2d4750b

Update HTML doc installation target in Makefile.

6d1ad3e

Also create separate make target to build docs without having to install them.

When setting the graphics AA level, don't overwrite the text AA level!

8ef3a27

Update mutool for revised SoDocHandler API.

ef87bdd

Bug 707984: Extend fmtquote to cope with high range unicode.

9c7b198

Use surrogate pairs in output as required. Also (thanks to Tor!) fix fmtquote_xml output. Thanks to Xybei for the report, and the first version of the patch.

scripts/wrap/cpp.py: added include of <algorithm>.

1f00321

Required for use of std::min in some custom methods.

Bug 707968: mutool poster was splitting r2l instead of l2r by default.

90b6360

Bug 707878: Tweak 'is_open' insertion logic.

75766ce

First, add a comment. Second, all outline entries inserted into PDF are closed, regardless of the is_open setting. Thirdly, any insertion into an empty outline object no longer opens it.

Fix erroneous xref object validity check!

aee22a1

Incorrectly changed from MAX_OBJECT_NUMBER to pdf_xref_len in e767bd7 The pdf_xref_len is not valid during repair, because the current object being repaired may extend it! Compare with the implementation limit instead.

Remove a duplicate pdf_repair_obj_stms call.

146f7f4

Instead of calling pdf_repair_obj_stms implicitly inside pdf_repair_trailer, call it explicitly at the appropriate call sites.

Bug 708011: Replace some _MSC_VER with _WIN32

fd6029b

We should (in these cases at least) be checking for the use of the WIN32 API, not the particular compiler.

Add some more ifdeffery to correctly omit gumbo if not required.

22c2506

mutool: Avoid leaks if output cannot be created.

871c8aa

Ensure that CFF Range1/-2 charset arrays do not extend beyond charact…

cd3566d

…er range.

sebras and others added 29 commits November 15, 2024 14:14

ttf: Avoid out of bounds access of GID renumbering array when subsett…

4191811

…ing hmtx table.

ttf: Add additional unicode cmaps.

6693e82

2201_-_transparent_image_covers_background.pdf object 68 0 R has only a single (0,1) cmap using format 4. Now we can cope with that!

tff: The number of glyphs is a 16-bit unsigned integer.

ddb31cb

cff: Increase charstring stack so we can handle mislablelled CFF2 fonts.

f98e2d4

cff: Check boundaries of values parsed out of FDSelect.

bff5587

This fixes Coverity issue 430251.

ttf: Change unused variable usage so that clang does not warn.

82be72b

mutool run: Add insertEmbeddedFile and deleteEmbeddedFile functions.

62662dc

Update CHANGES with API change to structured text color/argb field.

4171cf1

docs: Update list of important changes for 1.25.0.

faa4e76

Fix SText device when creating structure from tags.

5000782

I had failed to update the code that calculates bboxes and does bidirectional ordering in the presence of structure tags.

Close output before dropping in fz_office_to_html.

f213f92

Bump patch release.

3feae64

Fix error in stext-to-html color conversion!

5d61c9d

Bump patch release.

83811ea

SText: Add bits to char flags word indicating filled/stroked/clipped.

4c325b9

This enables people to spot 'invisible' text.

scripts/wrap/cpp.py: fix leak in C++ wrappers of fz_convert_pixmap().

b4b6040

fz_convert_pixmap() returns a kept reference, but the C++ wrappers did not know this so did an additional keep, which results in a leak. The fix is to add name 'convert' in function_name_implies_kept_references().

Merge commit '53e13b2d0b629d7114f51f26ab6e0a2ed39641f2'

f06bb9c

Added script to create buildable source archives.

3952d6b

Use absolute path for submodules.

292ba67

Updated readme.

c5cf646

arklumpus closed this by deleting the head repository Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional Content Groups improvements #51

Optional Content Groups improvements #51

arklumpus commented Sep 14, 2024

Optional Content Groups improvements #51

Optional Content Groups improvements #51

Conversation

arklumpus commented Sep 14, 2024