Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for inline image Markdown - attribute fix #661

Merged
merged 4 commits into from
Mar 19, 2024

Conversation

kidroca
Copy link
Contributor

@kidroca kidroca commented Mar 12, 2024

This PR is an optional followup for #658 if we'd like to patch a problem where image attribute content is being parsed to HTML as detailed in: #658 (comment)

Fixed Issues

$ Expensify/App#37246

Tests

  1. Unit/Integration Tests Covering the Change:

    • The changes are covered by unit tests added to ExpensiMark-HTML-test.js and ExpensiMark-Markdown-test.js. These tests validate that MD content designated for the alt attribute is escaped as HTML entities so it isn't converted to html tags by following rules
  2. Tests Performed to Validate Changes:

    • Ran all newly added tests and ensured they pass, verifying the correct conversion of markdown to HTML <img> tags and vice versa.
    • Manually tested the parsing of markdown containing images in various formats and contexts to ensure the output matches expectations.
    • Checked the handling of invalid image URLs to ensure they remain unchanged, as expected.

QA

  1. QA Validation Steps:
    • Test the handling of invalid Markdown image syntax, such as using a broken or incorrectly formatted URL (e.g., ![image](invalid-link)). These should not be parsed and ought to remain displayed as the raw Markdown text, without conversion or rendering as an image.

@kidroca kidroca requested a review from a team as a code owner March 12, 2024 13:59
@melvin-bot melvin-bot bot requested review from MonilBhavsar and removed request for a team March 12, 2024 14:00

// A map of MD characters to their HTML entity equivalent
const entities = {
'*': '&ast;',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A backtick is missing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've intentionally left out inline code and code blocks. These rules are applied first, converting anything that matches into HTML. Therefore, for input like:

![# fake-heading *bold* _italic_ ~strike~ `code`](https://example.com/image.png)

The result is:

<img src="<a href="https://example.com/image.png" target="_blank" rel="noreferrer noopener">https://example.com/image.png</a>" alt="# fake-heading &ast;bold&ast; &lowbar;italic&lowbar; &#126;strike&#126; <code>code</code>" />

This approach reveals a limitation in our rule application method, but I'm unable to cover it easily in the current PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aforementioned bug has been resolved by modifying the autolink pattern to recognize self-closing tags such as img. The absence of backticks is due to the fact that when escapeMarkdownEntities is executed, all backticks have already been transformed into <code> or <pre>

@MonilBhavsar
Copy link
Contributor

Puneet, adding you as reviewer as you're assigned to the issue

@MonilBhavsar
Copy link
Contributor

Looks like this commit is not verified caa6cf2

@situchan
Copy link
Contributor

This should be on hold for #658, right?

marcaaron
marcaaron previously approved these changes Mar 18, 2024
Copy link
Contributor

@marcaaron marcaaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment - not really a blocker but seems worth fixing now since we're in here.


test('Image with alt text containing markdown', () => {
const testString = '![# fake-heading *bold* _italic_ ~strike~ [:-)]](https://example.com/image.png)';
const resultString = '<img src="https://example.com/image.png" alt="&num; fake-heading &ast;bold&ast; &lowbar;italic&lowbar; &#126;strike&#126; &lbrack;:-)&rbrack;" />';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm what about if the alt is:

[cool" onerror="alert('xss')"]

or something lol - I am trying hard to make the xss happen 😂

Copy link
Contributor Author

@kidroca kidroca Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the idea. The general HTML escaping logic, which is enabled by default, appears to cover this case. It results in something like the following:

Given the input:

![test" onerror="alert('xss')](https://example.com/image.png)

I attempted to exploit the alt attribute by prematurely closing it with a " to insert an additional attribute.

The outcome is:

<img src="https://example.com/image.png" alt="test&quot; onerror=&quot;alert(&#x27;xss&#x27;)" />

However, since the quotes are properly escaped, the entire string is treated as part of the alt attribute, preventing any JavaScript execution.

I've added a test just in case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks!

name: 'image',
regex: MARKDOWN_IMAGE_REGEX,
replacement: (match, g1, g2) => `<img src="${Str.sanitizeURL(g2)}" alt="${this.escapeMarkdownEntities(g1)}" />`,
rawInputReplacement: (match, g1, g2) => `<img src="${Str.sanitizeURL(g2)}" alt="${this.escapeMarkdownEntities(g1)}" data-raw-href="${g2}" data-link-variant="labeled" />`
Copy link
Contributor

@marcaaron marcaaron Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Escaping is good for the "markdown entities" - but why display them at all? The alt tag will show a bunch of junk if the image fails to load or they hover over it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered keeping the text plain initially, but realized that it wouldn't allow us to revert to the original Markdown message from the HTML. This way, the original format is retained.

Regarding hover text, the alt attribute doesn't display text on hover. For that, Markdown has an extended syntax (![alt](src "title text")) that adds a title attribute to the resulting HTML. However, we don't support this feature, similarly to our approach with links.

If an image fails to load, the fallback is something like:
image

But NewDot and react-native-web don't display this fallback since they render a hidden <img> tag instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah hmm. Quite weird. Perhaps we can filter it before rendering i.e. not actually modify the underlying html field. But we can look into that in the App PR. Thanks for the explanation.

@kidroca kidroca force-pushed the kidroca/md-image-syntax-sanitize branch 3 times, most recently from 8c0b7a1 to eafc7a1 Compare March 19, 2024 11:31
Related to: Expensify#658 (comment)

Content intended for the alt attribute in images is being incorrectly parsed from Markdown to HTML if it contains MD special characters
@kidroca kidroca force-pushed the kidroca/md-image-syntax-sanitize branch from eafc7a1 to 6a95384 Compare March 19, 2024 12:33
@kidroca
Copy link
Contributor Author

kidroca commented Mar 19, 2024

Looks like this commit is not verified caa6cf2

Fixed

This should be on hold for #658, right?

Yes, and #658 was just merged


I found a bug I've described here #661 (comment) and I'm currently trying to fix it

@marcaaron
Copy link
Contributor

Thanks @kidroca!

Let us know when it's ready for review and I can give this a merge. 🙇

@kidroca
Copy link
Contributor Author

kidroca commented Mar 19, 2024

I've addressed the bug related to the autolink rule by adjusting the pattern to exclude self-closing tags, like <img />. The pull request is ready for review. However, I'm cautious about declaring this the definitive fix for HTML within attributes. Given the current order of rule application, where code blocks and inline code precede others, HTML tags such as <code> ... </code> could still appear within the alt attribute.

This PR aimed to find a minimal solution to prevent these occurrences, but it couldn't fully achieve that, so we might consider looking for a different approach.

  • we could merge the current PR as it provides some coverage
  • we can turn to a broader audience for ideas
  • we can test a bit to make certain there are no undetected problems

@marcaaron
Copy link
Contributor

Given the current order of rule application, where code blocks and inline code precede others, HTML tags such as ... could still appear within the alt attribute.

That sounds kind of edge case to me. Can you give a practical example of what a user might do and what weird behavior would result?

@marcaaron
Copy link
Contributor

we could merge the current PR as it provides some coverage

I'd lean towards this unless there are any glaring major regressions or limitations.

@kidroca
Copy link
Contributor Author

kidroca commented Mar 19, 2024

Given the current order of rule application, where code blocks and inline code precede others, HTML tags such as ... could still appear within the alt attribute.

That sounds kind of edge case to me. Can you give a practical example of what a user might do and what weird behavior would result?

The latest test has an example:

    test('No html inside the src attribute', () => {
        const testString = '![`code`](https://example.com/image.png)';
        const resultString = '<img src="https://example.com/image.png" alt="<code>code</code>" />';
        expect(parser.replace(testString)).toBe(resultString);
    })

As you can see this particular input would result in html inside the alt attribute
The part [`code`] is first converted to html [<code>code</code>] and later added to the alt attribute
It doesn't seem to lead to anything breaking, but preferably should be escaped as I'm not sure it's valid
We've we able to escape <strong> and alike because their rules run after the image...

@marcaaron
Copy link
Contributor

Ah I see. Well, I think it's valid HTML - but definitely leaves us with semantically strange alt tags. Tbh, I don't think we need to worry about this right now. We can try to address the alt tag inconsistencies as a follow up.

Copy link
Contributor

@marcaaron marcaaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna merge this one because I think we've already got some funky alt tags with the escape characters and creating an inline code block link doesn't seem all that common. And it's only the alt tag that gets affected - which seems like a separate problem from allowing inline images.

@marcaaron marcaaron merged commit ed80215 into Expensify:main Mar 19, 2024
5 checks passed
Comment on lines +1935 to +1939
test('No html inside the src attribute', () => {
const testString = '![`code`](https://example.com/image.png)';
const resultString = '<img src="https://example.com/image.png" alt="<code>code</code>" />';
expect(parser.replace(testString)).toBe(resultString);
})
Copy link
Contributor

@tomekzaw tomekzaw Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kidroca Shouldn't it be alt="&lt;code&gt;code&lt;/code&gt;" instead of alt="<code>code</code>"?

It doesn't seem to lead to anything breaking, but preferably should be escaped as I'm not sure it's valid

This breaks Live Markdown library 😢 Not sure if < and > characters are supported in HTML attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the text inside the alt attribute should be alt="`code`" or use escaped entities. However, achieving this proved challenging without significant changes. The code rules execute first, generating a string like [<code>code</code>](...), and subsequent rules then move the text into the alt attribute.

I introduced this test to address a scenario where using the syntax ![code](https://example.com/image.png) incorrectly generated an <img> tag with a src attribute like <img src="<a href="..." />" />". This format, obviously, failed to load the image.

How does this impact Live Markdown functionality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this impact Live Markdown functionality?

Currently, Live Markdown uses pretty basic algorithm for parsing HTML that alternates between finding < and then >.

The output <img src="..." alt="<code>...</code>" ... /> breaks the parser as the > from <code> gets matched instead of the (self-)closing > from img tag.

According to the HTML standard, > is not a valid character in HTML attribute:

Attributes have a name and a value. Attribute names must consist of one or more characters other than controls, U+0020 SPACE, U+0022 ("), U+0027 ('), U+003E (>), U+002F (/), U+003D (=), and noncharacters.

Is there a chance we could somehow escape the alt attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
The text you're referring to in your quote is regarding the attribute name - an attribute name cannot contain the specified characters

According to the HTML standard, > is not a valid character in HTML attribute:

Attributes have a name and a value. Attribute names must consist of one or more characters other than controls, U+0020 SPACE, U+0022 ("), U+0027 ('), U+003E (>), U+002F (/), U+003D (=), and noncharacters.

Attribute values on the other hand, are less restricted as long as the value is wrapped in matching quotes. There's a specific text in the shared spec clearly stating that < and > are supported as long as the attribute value is wrapped in quotes.

The attribute value can remain unquoted if it doesn't contain ASCII whitespace or any of " ' ` = < or >. Otherwise, it has to be quoted using either single or double quotes.

In any case I'll look into it a bit further to see if there's anything I can do to mitigate the issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text you're referring to in your quote is regarding the attribute name

@kidroca oops, you're right, sorry for mistake!

In any case I'll look into it a bit further to see if there's anything I can do to mitigate the issue

I believe we could use a more complex HTML parser in Live Markdown.

cc @robertKozik who adapted ExpensiMark to Live Markdown.

Copy link
Contributor

@robertKozik robertKozik Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But does reverting back when editing work? When I've tested it yesterday I think typing ```test``` as img name, and trying to edit it leads to test string. I may be wrong tho, as I'm not remember exactly nwm I was wrong 😃

Yeah your alternative solution will indeed solve our issues as we wouldn't have an < or > inside attributes. What is ETA for merging it?

Copy link
Contributor Author

@kidroca kidroca Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to be in a bit of a dead lock situation here. To advance my expensify-common PR, I need to demonstrate its effectiveness within the main branch of NewDot. Would the solution I proposed here be acceptable to you: Expensify/App#38952 (comment)?

Feel free to utilize this snapshot of expensify-common in your projects:

"expensify-common": "git+ssh://git@github.com/kidroca/expensify-common.git#3ad1c3322f4ec8999483955afe6f55c95e78172a",

This snapshot reflects the changes in my PR for expensify-common.

  1. Incorporating it into your PR would resolve your issue
  2. It would allow QA to test the MD Image feature.
    • But we should note to test this in the Tests section of your PR.
  3. This in turn would allow approve the merge of the expensify-common PR
  4. Letting you to finally switch back expensify-common to the Expensify's repo

I'll be ready to cover any follow-up work related to the MD image feature.

What are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit tricky because we're incorporating the expensify-common version inside react-native-live-markdown as well. So we would need to point to your branch within the new library version, which isn't my preferred approach. If possible, merging your expensify-common PR as it is would be great. If not, we can go with a patch containing your changes.

Here's how I envision the process:

  1. Merge your expensify-common PR (if possible).
  2. Publish a new version of react-native-live-markdown with both our and your fixes.
  3. Create a joint PR for react-native-live-markdown and expensify-common in the E/App repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me just test your changes in live-markdown beforehand

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kidroca Apart from the one problem I've mentioned here everything looks great. I think we can proceed with the patch inside the react-native-live-markdown so it will unblock you - does it sound good for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants