[go: up one dir, main page]

dbushell.com freelance

Hmmarkdown 2

No AI - Made by Human

Everyone has an opinion on markdown but why stop there? Write your own parser, make those opinions reality! That’s what I did with Hmmarkdown — my HTML-aware markdown library. It has built my website content for the past year.

Turns out parsing markdown (with HTML) isn’t easy. My original approach evolved into a game of whac-a-mole to quash edge case bugs. Last week I began a new parsing experiment I’d been mulling over. The idea proved workable and I finished the job.

Hmmarkdown 2 was born!

The new codebase is still rough around the edges but already an upgrade. I used my original test suite to ensure the same output. My primary goal was a more maintainable, extendable, and faster library, which it should be soon.

The Purpose

Markdown is best in it’s simplest form. Complex extensions to the syntax make no sense. If HTML is the target and easier to markup, just write HTML! Existing markdown libraries allow HTML but they skip past it. The purpose of Hmmarkdown is to allow me to write primarily markdown but interweave HTML where it makes sense.

Along with my original example, here’s a common pattern I use:

<figure>
  > blockquote
  <figcaption>[reference](https://example.com)</figcaption>
</figure>

The mix of HTML and markdown above is transformed into the HTML below.

<figure>
  <blockquote>
    <p>blockquote</p>
  </blockquote>
  <figcaption>
    <a href="https://example.com/">reference</a>
  </figcaption>
</figure>

In practice I mix little markup but it’s extremely useful to have the ability. Was this worth the investment? Probably not, but I’m in too deep!

Old and Busted

My old parser separated lines by \n then grouped lines by block: paragraph, blockquote, heading, list, etc. Those blocks were then parsed as HTML (crudely). Text nodes were parsed for inline markdown: links, bold, italic, etc. A subset of block-level HTML elements were passed back through the parser. Regular expressions did the heavy lifting.

New Hotness

That was the old architecture. It worked but it got messy. The new parser begins with a more traditional tokenizer. The tokenizer iterates the input character by character to generate an array of tokens, namely:

Wait a minute “+” is not markdown! I’ll explain later…

Every token is a single ASCII character except for Tag and Text. Tag tokens are HTML tags, like <div>, </div>, or <div/>. Text tokens are a unicode string of everything else. From the tokens, I generate a basic DOM-like tree with a “root” node and child tokens.

I ignore carriage returns (macOS user), they’re probably dealt with in Text nodes and eventually trimmed?

Using the HTML + markdown input example below:

<aside class="Box">
  This is a **"boxed"** paragraph.
</aside>

End of document!

The initial token tree state could be visualised like this:

TAG(<root>)
  TAG(<aside class="Box">)
    NEWLINE
    TEXT('  This is a ')
    ASTERISK
    ASTERISK
    TEXT('"boxed"')
    ASTERISK
    ASTERISK
    TEXT(' paragraph.')
    NEWLINE
  NEWLINE
  NEWLINE
  TEXT('End of document')
  EXCLAMATION

With this tree I recursively parse the open tag nodes where the tag name is in an allowed set. This lets me ignore hard-coded HTML tags like <script>, <style>, and <iframe> where I never want to parse or modify.

Using the <aside> node for example, I iterate the children to generate a new array of children. The first two tokens NEWLINE and TEXT are appended to the new array. Then an ASTERISK token is found so consumeStrong is called. It will return a <strong> node (or nothing, for false positives). If that fails consumeEmphasis will be tried. If nothing matches the ASTERISK and TEXT tokens are appended without change. In this example **"boxed"** from the original input matches bold formatting.

The tree state now looks that this:

TAG(<root>)
  TAG(<aside class="Box">)
    NEWLINE
    TEXT('  This is a ')
    TAG(<strong>)
      TEXT('"boxed"')
    TEXT(' paragraph.')
    NEWLINE
  NEWLINE
  NEWLINE
  TEXT('End of document')
  EXCLAMATION

The next step wraps text and inline tags with HTML paragraphs. This is probably the ropiest area of my code but it works (mostly). NEWLINE tokens play a key role and they’re removed at this stage. Excess whitespace is also trimmed.

TAG(<root>)
  TAG(<aside class="Box">)
    TAG(<p>)
      TEXT('This is a ')
      TAG(<strong>)
        TEXT('"boxed"')
      TEXT(' paragraph.')
  TAG(<p>)
    TEXT('End of document')
    EXCLAMATION

Next I merge adjacent text tokens before applying SmartyPants replacement, and finally HTML entities are escaped. In this example because the EXCLAMATION token did not match markdown image syntax it is merged as text.

TAG(<root>)
  TAG(<aside class="Box">)
    TAG(<p>)
      TEXT('This is a ')
      TAG(<strong>)
        TEXT('“boxed”')
      TEXT(' paragraph.')
  TAG(<p>)
    TEXT('End of document!')

The final HTML output is a simple recursive function over tree nodes to generate a string. I’ve added extra formatting for readability below. HTML attributes are never parsed they just come along for the ride.

<aside class="Box">
  <p>This is a <strong>“boxed”</strong> paragraph.</p>
</aside>
<p>End of document!</p>

And that is how Hmmarkdown 2 works!

Or at least should work. We’ll see if any formatting bugs appear on my website. The new tokenizer approach means I can largely avoid regex. The supported markdown syntax is still punishingly strict and opinionated.

The code repo is public but I have plenty to tidy up and optimise. There is no validation nor error reporting. I wouldn’t advise using it unless you’re me!

The Plus Token

So about that PLUS token. Whereas unordered lists start with an ASTERISK token, ordered lists would be written as NUMBER followed by PERIOD.

1. Item one
2. Item two

In fact, the numeric order and value doesn’t matter to markdown.

9999. Item one
1. Item two

Both examples should output identical items marked 1 and 2 sequentially. (Some libraries do add a start attribute. That’s a thing I don’t need.)

Anyway, this is a pain to parse. I would need eleven additional tokens for digits 0 to 9 and PERIOD. That adds overhead and a lot of false positives in the look-ahead matching. To avoid this entirely I do a cheeky bit of regex pre-processing. Ordered list lines are replaced with PLUS before I tokenize.

+ Item one
+ Item two

Now I can use the exact same logic I use to parse unordered lists.

Hmmarkdown has never supported nested lists because in over a decade of blogging I’ve never nested a list. That saves me another headache.

Tune in next year when I throw this all away and announce Hmmarkdown 3!