Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional Content Groups improvements #51

Closed
wants to merge 72 commits into from

Conversation

arklumpus
Copy link

This PR solves a number of issues around Optional Content Groups (OCGs, aka "Layers"), including those reported in Bug 707999:

  • 2e3298b — OCMD visibility is now computed correctly for AllOn and AnyOff as well as AnyOn and AllOff (issue 2 in the bug report above).

  • a313d32 — I fixed a bug in the find_ocg function that caused it to always return 0.

  • e3dba5c — I added functions to query and select the default OCG configuration (issue 1 in the bug report above), in particular:

    • void pdf_default_layer_config_info(fz_context *ctx, pdf_document *doc, pdf_layer_config *info) fetches the name and creator of the default configuration (like pdf_layer_config_info).
    • void pdf_select_default_layer_config(fz_context *ctx, pdf_document *doc) activates the default layer configuration (like pdf_select_layer_config).

    Furthermore:

    • I modified pdf_layer_config_info and pdf_select_layer_config, so that when the config_num/config argument is -1, they act on the default configuration.
    • I modified mudraw.c so that mutool draw -y l also lists the default configuration, and mutool draw -y d explicitly selects the default configuration.
  • 2c2cd91 — I added the missing logic to interpret visibility expressions (VE entry) in OCMD dictionaries (issue 3 in the bug report).

    • This uses a stack to detect circular references, so there a few new static functions related to this as well (ve_new_stack, ve_stack_pop, ve_stack_push, ve_stack_contains, ve_stack_free).
    • If necessary, the stack is initialised in pdf_is_ocg_hidden_imp and passed around to nested calls to pdf_is_ve_hidden or pdf_is_ocg_hidden_imp.
  • da41599 — Wherever OCG objects are being compared, I replaced pdf_objcmp_resolve with pdf_objcmp (so that the references are compared rather than the contents of the dictionary).

    • This solves issues occurring with PDF files where multiple OCGs are defined with the same name (which are allowed by the spec).
    • Essentially, this extends the fix for Bug 702261 to other instances where OCGs are compared, thus also fixing Bug 707992.
  • 2c54b56 — The logic to deal with radio button OCG UI elements was quite broken, I fixed it.

  • 2c54b56 — I added functions to check whether link actions are visible and to "activate" actions with subtype SetOCGState.

    • int pdf_is_link_hidden(fz_context *ctx, const char *usage, fz_link* link) checks the visibility of the OCG defined by the OC entry of the link dictionary.
    • int pdf_is_link_set_ocg_state(fz_link* link) checks whether a pdf_link object represents an action with subtype SetOCGState or a "normal" link.
    • void pdf_activate_link_set_ocg_state(fz_context *ctx, pdf_document *doc, fz_link *link) activates a link that represents an action with subtype SetOCGState, thus enabling/disabling/toggling OCGs as specified in the State entry of the action dictionary.
    • I modified mupdf-gl and mupdf-x11 to use these new functions, so that hidden links do not react to mouse events and SetOCGState actions do what they are supposed to do when clicking on them.

Here are a few test PDFs:

arklumpus and others added 30 commits September 14, 2024 11:52
Also create separate make target to build docs without having to install them.
Use surrogate pairs in output as required.

Also (thanks to Tor!) fix fmtquote_xml output.

Thanks to Xybei for the report, and the first version of the patch.
Required for use of std::min in some custom methods.
First, add a comment.

Second, all outline entries inserted into PDF are closed, regardless
of the is_open setting.

Thirdly, any insertion into an empty outline object no longer opens
it.
The parser encounters an object that it hasn't read from an objstm
before, loads it and finds it's a null object. It then thinks that
because it's a NULL pointer, that it didn't actually read that
object, and throws an error.

To fix this, rewrite null 'o' objects as 'f' objects.
…ces."

This reverts commit 2586767.

The bug that this commit patches over was apparently fixed in another way
later in commit 6f8cb56 for bug 699629.

The check causes certain (broken) files to extract text worse
than they could.

Example file has code space ranges and character ranges given with space
padded rather than zero padded hex strings:

    begincodespacerange
    < 4> <8f>
    endcodespacerange

This is parsed (as per spec) as <40> <8f> which then means we lost
all character mappings < 0x40. By ignoring the range checks, we now
at least get characters 0x10 and up correct.
Incorrectly changed from MAX_OBJECT_NUMBER to pdf_xref_len in
e767bd7

The pdf_xref_len is not valid during repair, because the current object
being repaired may extend it! Compare with the implementation limit
instead.
Instead of calling pdf_repair_obj_stms implicitly inside
pdf_repair_trailer, call it explicitly at the appropriate
call sites.
We should (in these cases at least) be checking for the use
of the WIN32 API, not the particular compiler.
Currently, we are relying on the accurate bbox for a char
and checking to see if a line crosses it.

This will go wrong both for chars that are displaced from
the origin (e.g. '^') and for underlines that don't cross
chars without descenders.

As a fix for this, we calculate an 'expanded' rectangle for
each char. We extend the rectangle down to the baseline
of the glyph, and then 25% of the size of the font below
it.

This fixes lines such as "Title 1" (underlined!), which
contains no descenders, correctly being spotted as
underlined chars.
sebras and others added 29 commits November 15, 2024 14:14
The offsets in the loca table must be increasing, all offsets
must be filled in, and the length of the glyph data is the
difference between an offset and the next offset.

The previous code didn't fill in the offsets of unused glyphs
between used glyphs' offsets, this caused complaints when opening
the font in freetype or fontforge.

In addition the length computed for one glyph's data was not
applied to the remapped glyph id, but to the glyph id after that
one, causing the glyph appearance to be applied to another random
glyph.
A hmtx table consists of two parts:

  * an array of entries with both advance widths and left bearing values
  * an array of entries with only left bearing values

Combined, both arrays are meant to provide horizontal metrics for all
the glyphs. The number of glyphs is parsed out from the font file's
maxp table.

The length of the first array is parsed out from the font file's hhea
table. The length of the second array is computed by taking the number of
glyphs and subtracting the number of entries in the first array.

Previously, the first array length, stored in orig_num_long_hor_metrics
and was capped by the hmtx table size, while the second array length,
stored in max16, was capped both by the hmtx table size and the number of
glyphs. This meant that max16 could end up being a lower value than
orig_num_log_hor_metrics, in which case the second array entries were
never updated.

Three changes are provided in this commit:

  * Rename array lengths in the code to long_metrics and short_metrics
    respectively.
  * Use separate counter to count the number of entries in the first and
    the second arrays.
  * Make sure to cap both array lengths by the number of glyphs in the
    font.
Both 1487_-_right_margin_cut_off_when_printing.pdf font objects
7 0 R and 14 0 R and PDFIA1.7_SUBSET/CATX1641.pdf font objects
93 0 R, 99 0 R and 105 0 R contain symbolic fonts that all contain
a single (3,0) cmap, while MuPDF previously only accepted (1,0)
cmaps.

The fix is to simply add a check for (3,0) cmaps and these two
files can now be subset correctly.
2201_-_transparent_image_covers_background.pdf object 68 0 R has
only a single (0,1) cmap using format 4. Now we can cope with that!
Move vectors into line with chars. We now return a consistent
representation.

'color' has been deliberately changed to 'argb' as a) it's
clearer, and b) it forces people to update their code and not
suddenly see failures with alpha being unexpectedly in the top
8 bits.
PyMuPDF builds use build directories containing `Py_LIMITED_API=*` which
results in auto-dependency files containing things like:

    build/PyMuPDF-amd64-shared-tesseract-Py_LIMITED_API=0x030a0000-release/source/fitz/context.o: foo.h

Unfortunately the `=` breaks thngs because it causes make to treat this
as setting variable `build/PyMuPDF-amd64-shared-tesseract-Py_LIMITED_API`
to `0x030a0000-release/source/fitz/context.o: foo.h`, not as a `<target>:
<prerequisites>` rule.

PyMuPDF will be changed to use build directories without the `=` such as
build/PyMuPDF-amd64-shared-tesseract-Py_LIMITED_API_0x030a0000-release/, which
we now accept.
Document.search(), Page.search(), and DisplayList.search() all suffered
from the same bug where they are declared to return Quad[][] but the
last call in the JNI wrappers called ArrayList.toArray() which returns
Object[].

Desktop java accepted type of casting without any compilation or runtime
complaints. Compiling for Android worked equally well, but searching in
the app, the app crashed at runtime with:

JNI DETECTED ERROR IN APPLICATION: attempt to return an instance of
java.lang.Object[] from fitz.Quad[][] fitz.Page.search(java.lang.String).

The fix for this problem is to at the end of the JNI wrappers call
<T> T[] ArrayList.toArray(T[]). But even with this code change the app
still crashes at runtime with:

JNI DETECTED ERROR IN APPLICATION: the return type of CallVoidMethod
does not match boolean java.util.ArrayList.add(java.lang.Object)

Which was resolved by calling the function using CallBooleanMethod and
handling the return value suitably.

With these two changes searching now works well in both desktop Java
and Android.
I had failed to update the code that calculates bboxes
and does bidirectional ordering in the presence of
structure tags.
Previously the segment length was represented by a signed integer,
so when pdf_parse_jbig2_segment_header() parsed the segment length
the value read from the file could be interpreted as negative. Later
on in pdf_copy_jbig2_segments() a data pointer is incremented by
the segment length, which could lead to an out of bounds pointer and
a subsequent crash.

This fix is to represent the parsed value as an 32-bit unsigned integer
as mentioned in the specification.
This only affects document handlers that insist on file backed
streams (i.e. sodochandler, currently).

If we ask to convert a stream to be file backed, we may get
the same stream back instantly (indicating that it is already
file backed). Do NOT drop it in this case.
This enables people to spot 'invisible' text.
fz_convert_pixmap() returns a kept reference, but the C++ wrappers
did not know this so did an additional keep, which results in a leak.

The fix is to add name 'convert' in function_name_implies_kept_references().
The PDF operator stream in this file does:

 W <path> n /Image Do

The filter code therefore currently tries to deal with the 'W'
without a path being defined. Currently it bounds it, finds it
is an empty rect, and therefore suppresses the image.

According to the PDF spec though, W and W* are not supposed to
look at the path at the point they are called. Instead they
merely set it up so that a side effect of the next 'path painting
operator' (in this case 'n') is to intersect the clip path
AFTER than operator has taken place.

This commit therefore changes the behaviour of the filter to
follow this. All we do when we get a 'W' or a 'W*' is to set a
new clip_op field. Whenever we then process a path painting
operator, we look at this field, and process W or W* as
appropriate.

We are at pains to forward the W or W* operator AFTER the path
has been processed, but before the path painting operator
is sent. This should keep bugged renderers such as Preview
happy.

The has meant that the culler handling has changed slightly
in that the culler now doesn't deal with CLIP paths separately
to FILL or STROKE paths, but rather now as a combination (e.g.
CLIP_FILL_PATH). Trying to deal differently with CLIP and FILL
paths (for example) would probably have gone wrong in the past
so there is no real downside here.
@arklumpus arklumpus closed this by deleting the head repository Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants