Skip to content
Ambrose Li edited this page Aug 12, 2020 · 14 revisions

mime.c and mime.h are an existing module to support what trn loosely calls “MIME”, which includes the handling of HTML posts and the MIME-decoding aspect of non-ASCII text posts.

Trn’s HTML scanner is a sloppy string replacement engine similar to w3m’s but generally even sloppier (though in some cases it tries harder). It makes no attempt to, for example, render alt text or lay out tables; it does not detect decorative images or convert HTML links to plain text. But on the other hand, it tries to detect whether a BLOCKQUOTE should be rendered as an indent or as a USENET-style “>” attribution.

HTML is handled by scanning, not parsing, and it’s not very meaningful to talk about supported or unsupported tags. A small number of tags are handled (either in the filter_html() main loop or in tag_action()); unhandled tags are simply thrown out.

Level 2 scanning state is controlled by a bitmask that does not nest, so while individual inner elements like TITLE or STYLE can be hidden, you can’t mark HEAD as hidden because when TITLE or STYLE closes the hidden flag (HF_IN_HIDING) will be reset.

Exported types

Type Meaning
struct hblk
struct html_tags Descriptor for a handled HTML tag.
struct mimecap_entry
struct mime_sect

Exported constants

Unknown 1

Constant Meaning
MSF_INLINE
MSF_ALTERNATIVE
MSF_ALTERNADONE

Bitmasks for HTML scanning states

Bitmask Meaning
HF_IN_TAG Currently inside a tag
HF_IN_COMMENT Currently inside a comment (within a tag)
HF_IN_HIDING Any #text found should not be displayed
HF_IN_PRE Currently within a PRE element
HF_IN_DQUOTE Currently inside double quotes (within a tag)
HF_IN_SQUOTE Currently inside single quotes (within a tag)
HF_QUEUED_P
HF_P_OK
HF_QUEUED_NL
HF_NL_OK
HF_NEED_INDENT
HF_SPACE_OK
HF_COMPACT

Unknown 2

Constant Meaning
HTML_MAX_BLOCKS

Unknown 3

Bitmask Meaning
TF_BLOCK
TF_HAS_CLOSE
TF_NL
TF_P
TF_BR
TF_LIST
TF_HIDE
TF_SPACE
TF_TAB

Numeric identifiers for handled tags

These must match tagattr below.

Constant Meaning
TAG_BLOCKQUOTE The BLOCKQUOTE tag.
TAG_BR The BR tag.
TAG_DIV The DIV tag.
TAG_HR The HR tag.
TAG_IMG The IMG tag.
TAG_LI The LI tag.
TAG_OL The OL tag.
TAG_P The P tag.
TAG_PRE The PRE tag.
TAG_SCRIPT The SCRIPT tag.
TAG_STYLE The STYLE tag.
TAG_TD The TD tag.
TAG_TH The TH tag.
TAG_TR The TR tag.
TAG_TITLE The TITLE tag.
TAG_UL The UL tag.
TAG_XML The XML tag (non-standard).
LAST_TAG Total number of handled tags.

Unknown 4

Constant Meaning
CLOSING_TAG
OPENING_TAG

MIME states

Constant Meaning
NOT_MIME Not a MIME post (cf. is_mime).
TEXT_MIME A text/plain attachment.
ISOTEXT_MIME A text/plain attachment in ISO-8859-1 (not used by the UTF-8 patch).
MESSAGE_MIME
MULTIPART_MIME
IMAGE_MIME
AUDIO_MIME
APP_MIME
UNHANDLED_MIME An unknown MIME attachment.
SKIP_MIME
DECODE_MIME
BETWEEN_MIME
END_OF_MIME
HTMLTEXT_MIME
ALTERNATE_MIME

mimecap attributes

Not sure how these are used.

Constant Meaning
MCF_NEEDSTERMINAL
MCF_COPIOUSOUTPUT

Global variables

Global variable Type Meaning
auto_view_inline bool Whether trn should automatically decode inline attachments.
mime_article MIME_SECT
mimecap_list LIST*
mime_getc_line char*
mime_section MIME_SECT* Level 2 MIME scanning state. See HF_* constants above for the html field.
mime_state short Level 1 MIME scanning state. See *_MIME constants above.
multipart_separator char* Label to represent a MIME boundary in article display.
tagattr HTML_TAGS [] The list of handled HTML tags. Must match TAG_* above.

Exported functions and macros

filter_html

  • int filter_html(char* t, char* f)
  • t: pointer to “to” buffer
  • f: pointer to “from” buffer
  • Return value: purpose of the return value is unknown

Converts the HTML post in f into plain-text form and put it in f. Uses tag_action().

Note that the current code strips double and single quotes and looks at only the first 31 characters in a tag, so it’s not possible to handle alt texts, for example.

mimecap_ptr

  • mimecap_ptr(n)

mime_SetArticle

  • void mime_SetArticle()

Set up mime_article structure based on article's headers. The function manipulates the global variables htype, is_mime and multimedia_mime directly.

mime_SetArticle re-sets is_mime when Content-Transfer-Encoding is 7bit or 8bit (e.g. CJK). I can’t see how this is justifiable given headers (esp. From and Subject) can still be QP-quoted, but for some reason this seems to be fine.

Important internal functions

tag_action

  • static char* tag_action(char* t, char* word, bool_int opening_tag)
  • t: pointer to “to” buffer
  • word: the first 31 characters inside the angle brackets
  • opening_tag: TRUE if we’re on an opening tag, FALSE if we’re on a closing tag
  • Return value: an updated pointer to (possibly a different spot in) the “to” buffer

Performs state manipulations based on the tag in word, including possibly modifying the “to” buffer in t, partly based on the tagattr array.