Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandoc 2.x renders images' alternative texts in an inaccessible fashion #6491

Closed
jmuheim opened this issue Jun 30, 2020 · 11 comments · Fixed by #6495
Closed

Pandoc 2.x renders images' alternative texts in an inaccessible fashion #6491

jmuheim opened this issue Jun 30, 2020 · 11 comments · Fixed by #6495

Comments

@jmuheim
Copy link

jmuheim commented Jun 30, 2020

As stated on StackOverflow (https://stackoverflow.com/questions/62639927/pandoc-2-x-renders-images-alternative-texts-in-an-inaccessible-fashion?noredirect=1#comment110781365_62639927), Pandoc 2.x renders images' alternative texts in an inaccessible fashion. I was told there to ask for a bugfix here.


Here's the original post:

Since I upgraded from Pandoc v1.19 to 2.9, decorative images are not exported as expected anymore.

First of all, when generating HTML from ![](test.jpg), in v1.19 a <p class="figure"> structure was wrapped around the image, but now it's only a <p>:

<p>
  <img src="test.jpg">
</p>

This makes it harder to style in line with other images that have an alternative text.

But what's really a problem here: there's no alt="" attribute produced anymore! This means that e.g. screen readers will not recognise this as a decorative image anymore.

So let's see what happens to an image with an actual alternative text, e.g. when generating HTML from ![Hello](test.jpg):

<div class="figure">
  <img src="test.jpg" alt="">
  <p class="caption">Hello</p>
</div>

Here we get a class="figure" in the surrounding element, but now it's a <div> instead of a <p> (I don't bother too much about this, but again, it makes it harder to style everything the same).

What again is a big problem though is the fact that the alt attribute is now set empty: this prevents screen readers from perceiving them at all, which is horribly wrong! I guess that Pandoc concludes that having alternative text and caption would be redundant, which is correct, and that the caption below would be the right thing to show - which it is not.

The right structure would look something like this:

<div class="figure">
  <img src="test.jpg" alt="Hello"><!-- Leave the alternative text on the image -->
  <p class="caption" aria-hidden="true">Hello</p><!-- Hide the redundant visual alternative text from screen readers -->
</div>

Any reason why this behaviour would make sense? Can it be changed somehow? Otherwise I will have to fiddle around with some post-processing JavaScript...

@tarleb
Copy link
Collaborator

tarleb commented Jun 30, 2020

I started to implement this, but was given pause by the fact that this would cause pandoc to produce invalid xhtml when targeting HTML4. @jmuheim, do you know of a good workaround for HTML4?

On the other hand, we already produce invalid xhtml for any document which includes code blocks, as line numbers contain the aria-hidden="true" attribute.

@jmuheim
Copy link
Author

jmuheim commented Jun 30, 2020

Interesting. You mean because aria-hidden has a dash in the attribute name, right?

I don't know of a good technical work around. I could think of doing something like this which would work in some situations:

<figure>
  <img src="..." alt="See below" />
  <figcaption>Bla bla bla</figcaption>
</figure>

But this isn't really a general solution.

In my honest opinion though it is so much more important not to programmatically exclude users (especially users with special needs who already are suffering a lot of awkwardnesses), compared to having minor code invalidities. And as you're stating that there is already some aria-hidden in code blocks in HTML4, we should definitely not bother to add them for alternative texts.

@mb21
Copy link
Collaborator

mb21 commented Jun 30, 2020

Is this issue only about HTML4 output, because I think much of the reason we do things the way we do them is because in HTML5 (which is the default), we produce a figure tag...

I guess that Pandoc concludes that having alternative text and caption would be redundant,

yes.

and that the caption below would be the right thing to show - which it is not

well.. why not? HTML5 output is:

<figure>
  <img src="foo.jpg" alt="" />
  <figcaption>bar</figcaption>
</figure>

@jmuheim
Copy link
Author

jmuheim commented Jun 30, 2020

well.. why not? HTML5 output is:

<figure>
  <img src="foo.jpg" alt="" />
  <figcaption>bar</figcaption>
</figure>

As far as I know, screen readers will always treat images with empty alt attribute as purely decorative, so the user will never know about them. For instance, they will not show them in a list of images or any other functionality that screen readers offer.

While it may seem counter intuitive to non-blind people, blind people also make use of images, e.g. saving them to their hard drive or uploading them to social media portals. So we should never prevent them to access the same elements like others do.

@tarleb
Copy link
Collaborator

tarleb commented Jun 30, 2020

Furthermore, here is what MDN says about the alt attribute.

Omitting alt altogether indicates that the image is a key part of the content and no textual equivalent is available. Setting this attribute to an empty string (alt="") indicates that this image is not a key part of the content (it’s decoration or a tracking pixel), and that non-visual browsers may omit it from rendering. Visual browsers will also hide the broken image icon if the alt is empty and the image failed to display.

Figures are rarely just decoration, and I think leaving users in the dark about the existence of an image seems not good.

@mb21
Copy link
Collaborator

mb21 commented Jun 30, 2020

Pretty sure we actually changed this to the way it's currently after the request of a blind person generating ePub.... but cannot find the issue anymore...

@tarleb
Copy link
Collaborator

tarleb commented Jun 30, 2020

Found the issue: #4737

@jgm
Copy link
Owner

jgm commented Jun 30, 2020

I didn't know til now that hyphenated attribute names aren't allowed in XHTML. Interesting.
We do try to create polyglot HTML, and this is especially important because we use the HTML writer in creating EPUBs. EPUB contents are supposed to be XHTML. On the other hand, I haven't heard any reports that the hyphenated aria- attributes have caused problems with any e-readers or with epub validation.

tarleb added a commit to tarleb/pandoc that referenced this issue Jul 1, 2020
Screen readers read an image's `alt` attribute and the figure caption,
both of which come from the same source in pandoc. The figure caption is
hidden from screen readers with the `aria-hidden` attribute. This
improves accessibility.

For HTML4, where `aria-hidden` is not allowed, pandoc still uses an
empty `alt` attribute to avoid duplicate contents.

Closes: jgm#6491
@tarleb
Copy link
Collaborator

tarleb commented Jul 1, 2020

I tried two EPUB2 validators with current pandoc output, and they fail if the input contains a syntax highlighted code block. The PR therefore leaves the HTML4/XHTML output as it was, and just updates HTML5 output to include the suggested changes.

@jmuheim
Copy link
Author

jmuheim commented Jul 19, 2020

Any news on this? I will fix the issue on my side with some (ugly) JavaScript, looking out for the inaccessible code created by Pandoc and fixing it.

@jmuheim
Copy link
Author

jmuheim commented Jul 19, 2020

Just for the records: Instead of using JavaScript, I decided to put it into my markdown method in Ruby. This is faster, cleaner, and better suited for automated testing.

If anyone else needs an inspiration for a similar thing:

module MarkdownHelper
  def markdown(string)
    html = PandocRuby.convert(string).strip
    
    nokogiri = Nokogiri::HTML::DocumentFragment.parse(html)

    nokogiri = clone_alt_into_img_and_hide_figcaption_from_sr(nokogiri)
    nokogiri = add_empty_alt_to_decorative_img(nokogiri)

    nokogiri.to_html.html_safe
  end

  # Pandoc removes the content of an image's alt attribute, as the text is also available inside figcaption (to avoid screen reader redundancies). This is terrible though, as this renders the image itself invisible to screen readers. So we clone the alternative text back into the alt attribute again, and place an aria-hidden on figcaption.
  #
  # See https://github.com/jgm/pandoc/issues/6491
  def clone_alt_into_img_and_hide_figcaption_from_sr(nokogiri)
    nokogiri.css('figure').map do |figure|
      img        = figure.at_css('img')
      figcaption = figure.at_css('figcaption')

      img['alt'] = figcaption.text
      figcaption['aria-hidden'] = true
    end

    nokogiri
  end

  # Pandoc doesn't add an empty alt-attribute if the alternative text is left empty. Because screen readers announce the file name in this situation, we add an empty alt-attribute here.
  def add_empty_alt_to_decorative_img(nokogiri)
    nokogiri.css('img:not([alt])').map do |img|
      img['alt'] = ''
    end

    nokogiri
  end
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants