Skip to content

Latest commit

 

History

History
102 lines (69 loc) · 8.01 KB

conversion-guide.md

File metadata and controls

102 lines (69 loc) · 8.01 KB

Conversion Guide

This document describes the steps to convert an HTML file to Markdown for MDN content.

Prerequisites ✅

To perform Markdown conversion, you must:

  • Have Git, NodeJS >=v12 <v17 and Yarn installed
  • Have a GitHub account (it's free!)
  • Have a local copy of mdn/markdown: the conversion tool is located in this repo
    • See README for setup instructions
    • This script was originally in mdn/yari, but it has been forked into this repository for easier maintenance and sunsetting when it is no longer needed
  • Have a local copy of mdn/content and/or mdn/translated-content
    • mdn/content is needed for en-US, whereas mdn/translated-content is for all other locales

How-to

Basically, we will perform the following:

  • Perform a dry run of Markdown conversion and:
    • Assess the report generated by the script
    • Update the HTML document to remove problematic elements
  • Repeat the above steps until satisfactory
  • Run the Markdown conversion script for real
  • Submit pull requests with the changes

Perform a dry conversion run 🧑‍🔬

Perform a dry run of the conversion by running yarn run h2m <target> --locale <locale> --mode dry in the folder of your local checkout of mdn/markdown, where <folder> is the specific folder relative to the language root (AKA mdn-translated-content/). This will perform a test run of the conversion and generate a report, but will not modify any files.

Once the script is completed, it will display a message containing the count of HTML elements that could not be converted, as well as the name of the report file (md-conversion-problems-report-[TIMECODE].md) that was created for more details, which can be found in the script repo's folder. This report describes all of the elements that the script could not handle, and thus has left as HTML. Most of these will need to be removed from the HTML first (see Common unhandled elements), but some can be ignored.

You can see examples of such reports:

If the message did not appear and there was no new report file, great! That means that the conversion was 100% successful and you can now perform the real conversion.

Common unhandled elements

There are a number of elements you will often see in the "unhandled elements list". This section will list the most common ones you will see, and how to fix them for conversion.

  • dl > dt/dd
    • The conversion script has strict expectations for the contents of a dl element. The first child element should be a dt element, and for every dt element, there should be one, and only one, corresponding dd element.
    • If the number of dt and dd elements are not equal, the script cannot convert them.
    • Remove any stray dt elements, and combine sibling dd elements together using <br /> tags, then try again.
  • *.hidden
    • The .hidden class was used for content that would show when editing the content in the old wiki engine, and would not show to a typical reader. More than likely, these should all simply be removed as they are no longer helpful.
    • Either remove the class or remove the entire element and its contents at your own discretion.
  • th/td
    • Table cells may somtimes include lists, codeblocks, and other multiline content. Since Markdown does not allow this, tables with these cell contents cannot be converted.
    • Separate the table contents into a multi-paragraph strcutre if possible.
    • Translators: compare the document to the current English locale for an example of how to handle that specific element.

Rinse and repeat from the conversion report

Once you have taken care of a good chunk of the elements the script reported as unhandled, re-run the script with the dry operating mode again. A new report file will be generated to describe what remains, if there is anything left. Repeat the above steps to reduce the list as much as possible. Once you are done to a satisfactory point, it is now time to run the conversion script for real.

Optional: submit a PR with the HTML cleanup

To make review easier when the document is converted to Markdown, you may want to submit a pull request containing only the cleanup. In the event that you have removed some portion from the file (ex. a .hidden block), this will help convey that it was intentionally removed.

You may skip this step and head straight to conversion, but we recommend at least creating a separate commit to track the changes.

Converting for real

Once the preparations have been made, you are now ready to perform the conversion. You now can run yarn md h2m <folder> --locale <locale> --mode replace and open a PR with the changes. The replace mode will first rename the HTML files from .html to .md without performing any conversion, then it will stage those changes, and finally convert the file contents (without staging them). To better retain git history, we recommend committing the staged changes (the files being moved), and then creating another commit with the conversion.

Check for build errors

Sometimes, characters within macros will be unintentionally escaped as a part of the conversion. Make sure to check macro issues by running yarn build <files...> and checking for any errors.

Smaller PRs means faster review

To speed up review time and reduce the chance of merge conflicts while your PR is in review, it is highly recommended to keep the number of files touched to a minimum. Although the changes are created using this script, every PR still needs to be carefully reviewed for accuracy and malicious changes.

Specificities for localization 🌐

  • Typography
    • You may decide to keep some unconverted elements for consistency and/or typography reason. For instance, in the French docs, we were already using <sup> consistently for ordinal numbers and <i lang="en"> for English terms which are not code and sometimes kept untranslated for clarity (ex. "viewport")
  • Yari translation
  • Using commits from the last HTML state of mdn/content
    • When tackling issues in existing content and since the English content is the reference for the localized content, don't hesitate to browse the mdn/content repo/files at the last commit before the markdown conversion. You may then be able to "update" your localized content's structure with the most "correct/recent" English one.
    • List of those commits per section (poking @wbamberg if it may help tagging :)) ⏳

Other resources

Credits

This guide was originally written by @SphinxKnight for the localization team. It has been updated by @queengooborg for the new script updates.