Does not escape Reddit formatting characters #19

itsthejoker · 2018-05-13T14:42:47Z

Example: https://old.reddit.com/r/TranscribersOfReddit/comments/8j2hkw/casualuk_image_we_have_confirmation_that_the/dywfq7b/

Instead of properly escaping the first character, it's rendered and defeats the purpose. We have two options for this:

a) simply insert the entire transcription into a code block

b) create a list of all reddit snoodown specialty characters (like _, `, ~) and escape them anytime they appear.

This issue is to discuss available fixes and to then enact them.

The text was updated successfully, but these errors were encountered:

perryprog · 2018-05-13T16:01:38Z

Oh that's fun. I'm all for putting it in a codeblock, so users without RES don't have to recreate any characters that weren't displayed for whatever reason. (RES has a button that does that for you)

perryprog · 2018-05-13T16:35:58Z

Update: after discussion on discord, we've decided that escaping will be better due to the way copy and pasting works on mobile.

codingJWilliams · 2018-05-13T17:03:47Z

Above PR demonstrates one way we could solve this by escaping snoodown characters

perryprog · 2018-05-13T18:37:20Z

Now I'm not sure if escaping is the best, there's so many different ways to do this, but none of them are the same on each platform.

@TheLonelyGhost what do you think?

TheLonelyGhost · 2018-05-14T03:30:27Z

Honestly, you don't want me to weigh in on if I'm pro-markdown (escaping) or anti-markdown (code block). I don't really like markdown.

Simply put, I haven't seen our OCR bot ever give a transcription that should be interpreted and rendered with a markdown interpreter. Rather, it should be interpreted as plain text. The markup to guarantee it's rendered as such? A code block. Since we don't have a choice with reddit and it will always be interpreted as markdown, we're stuck with code blocks as our only option.

Rant incoming, feel free to skip.

I merely tolerate markdown as incremental progress. Why?

I'd rather the public center around it instead of 2006-era warez NFO file, full of ASCII art visually differentiating sections and markup in the document in unique and completely different ways compared to the last NFO file you saw... but that leaves us in a lesser-of-two-evils situation, not actually liking one or (god forbid) both options.

Secondly, if we didn't have code block as an option we would have to escape SO MANY CHARACTERS in SO MANY CONTEXTS.

Unescaped markdown (seemingly arbitrarily) interprets certain characters either in a markdown, html, or even LaTeX context, depending on if it uses redcarpet, kramdown, pandoc, or some other markdown interpreter. We would have to account for any...

ampersand (&) because HTML interpretation might screw with it
xml-like word (<foo>) because it might disappear as invalid HTML, thanks to the browser
backslashes (\) because it's the escape character itself
... any number of other edge-cases due to markdown's inherently designed flexibility with nested syntaxes.

Frankly, it's a nightmare. I have to relearn the rules of how multiple syntaxes (html + markdown) are allowed to intermingle depending on if it's snoodown, github-flavored markdown, straight-up commonmark, pandoc, or some other variant. If I have to re-evaluate that every time, it's not ready for automation.

I hate markdown.

/rant

TimJentzsch · 2021-04-11T10:18:28Z

I argue that the text has to be escaped twice - once for the bot post to be displayed correctly and once for the pasted text to be displayed correctly in the transcription. So I think it should first be escaped with backslashes and then put in a code block.

Out of experience, I can say that the most important things to escape would be lists and headings. E.g.

- Item 1
- Item 2

#hashtag

should become

    \- Item 1
    \- Item 2
    
    \#hashtag

which should be simple enough to achieve with regular expressions.

Also, is it working correctly on mobile an important trait? I doubt that there are many who transcribe on mobile and also use the OCR.

itsthejoker added bug bot good first issue OCR and removed bug labels May 13, 2018

codingJWilliams mentioned this issue May 13, 2018

Escape formatting from OCR results #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not escape Reddit formatting characters #19

Does not escape Reddit formatting characters #19

itsthejoker commented May 13, 2018 •

edited

Loading

perryprog commented May 13, 2018 •

edited

Loading

perryprog commented May 13, 2018

codingJWilliams commented May 13, 2018

perryprog commented May 13, 2018

TheLonelyGhost commented May 14, 2018 •

edited

Loading

TimJentzsch commented Apr 11, 2021

Does not escape Reddit formatting characters #19

Does not escape Reddit formatting characters #19

Comments

itsthejoker commented May 13, 2018 • edited Loading

perryprog commented May 13, 2018 • edited Loading

perryprog commented May 13, 2018

codingJWilliams commented May 13, 2018

perryprog commented May 13, 2018

TheLonelyGhost commented May 14, 2018 • edited Loading

TimJentzsch commented Apr 11, 2021

itsthejoker commented May 13, 2018 •

edited

Loading

perryprog commented May 13, 2018 •

edited

Loading

TheLonelyGhost commented May 14, 2018 •

edited

Loading