[go: up one dir, main page]

Use native task list parsing code (comrak), and supporting non-breaking spaces

We want to use the native task list parsing code in glfm-markdown / comrak. This would allow us to get rid of most of the TaskListFilter, simplifying the code and removing some regex parsing.

The MR Draft: Use native task list parsing in `comrak`... (!198758 - closed) does exactly that.

However, there is a problem in recognizing an unchecked item.

The GFM spec indicates that only normal Unicode whitespace characters (space U+0020, tab U+0009, newline U+000A, line tabulation U+000B, form feed U+000C, or carriage return U+000D) are recognized for a task list item. Non-breaking spaces are not supported.

- [ ] NO-BREAK SPACE (U+00A0)
- [ ] FIGURE SPACE (U+2007)
- [ ] NARROW NO-BREAK SPACE (U+202F)
- [ ] THIN SPACE (U+2009)
- [ ] normal SPACE (U+0020)

only the last item should be considered a task list item.

A whitespace character is a space (U+0020), tab (U+0009), newline (U+000A), line tabulation (U+000B), form feed (U+000C), or carriage return (U+000D).

The latest CommonMark Spec now calls that a Unicode whitespace character.

However our task parsing code allows the use of non-breaking spaces (see https://github.com/deckar01/task_list/commit/dd204f94887103a2190dc075ee86e92fb15b5fe9). These characters can sometimes be accidentally inserted when using other non-english keyboard layouts, as mentioned here.

It depends on the keyboard layout. When I use a Mac Belgian AZERTY, I type the [ using ALT+SHIFT+(, then the space. Sometimes, I keep the ALT key pressed a little longer and it type a non-breaking space instead.

Currently, using the native task parsing would break this, as comrak supports the GFM spec in parsing.

Options:

  1. Do nothing. This is undesirable because it's better to handle task parsing at the markdown parsing level, rather than looking for them in the HTML afterwards. It is also desirable to remove the use of regex expressions for performance and security.

  2. Use native parsing and do not support non-breaking spaces. This is basically what !198758 (closed) currently does. While I originally considered this the correct option, it's not desirable to break existing customer lists that might use the non-breaking characters accidentally. I also verified that a competitor now supports the non-breaking spaces. Removing our support may move us backwards.

  3. Enhance the native code to recognize non-breaking spaces.

    I first looked at enhancing glfm-markdown. We have the ability to override how HTML gets rendered. And in fact we do this to support inapplicable checkboxes. However the symbol stored in the AST entry is only a single-byte character (uint8). The non-breaking spaces live outside that space, so we can't reliably check for the proper character.

    In comrak, we could add an option to allow the use of non-breaking spaces. we would need to, at the very least, update https://github.com/kivikakk/comrak/blob/main/src/parser/mod.rs#L3026-L3026 and most likely https://github.com/kivikakk/comrak/blob/main/src/scanners.re#L449-L449. Easiest might be to have the scanner detect the characters, and if it does, return a space as the symbol, to allow remaining processing to work as normal. Not sure if there would be an side effects of this.

!80674 (merged) fixes the problem where we couldn't check boxes that contained these unicode spaces.

Let's consider whether it makes sense to remove support for the Unicode spaces. Some reasoning is outlines here

Relevant Resources

Edited by 🤖 GitLab Bot 🤖