FrontMatterParser should strip only *one* frontmatter, and adjust BACKSLASH_R (!213304) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

Fixes #359081 (closed) by only stripping off the frontmatter from wiki page content that we actually involved in the parse to get metadata (principally, title); this means doing a single replace, and not a replace_gsub.


Now frontmatter can be added to a wiki page ...
And it's displayed:
And can be further edited, in rich text:
Or source form:

UntrustedRegexp::BACKSLASH_R needs an adjustment

Additionally, fix a problem I discovered immediately after, now that user frontmatter is once again visible. Our FrontMatterFilter replaces the user frontmatter with a ```yaml block so it's displayed nicely with syntax highlighting. When using \r\n line endings (which the wiki editor seems to default to), our regex matches \r in preference to \r\n where possible, meaning if what follows in the regex can accept the \n, it'll match only the \r.

This is what's happening here:

This is with the following source:

---
abc: hello
xyz: How did this ever work?
---

HMMM.

Note this part of Gitlab::FrontMatter::PATTERN_UNTRUSTED:

      # opening front matter marker (optional language specifier)
      "^(?P<delim>#{DELIM_UNTRUSTED})[ \\t]*(?P<lang>\\S*)#{::Gitlab::UntrustedRegexp::BACKSLASH_R}" +

      # front matter block content (not greedy)
      '(?P<front_matter>(?:\n|.)*?)' +

Gitlab::UntrustedRegexp::BACKSLASH_R is defined as:

    BACKSLASH_R = '(\n|\v|\f|\r|\x{0085}|\x{2028}|\x{2029}|\r\n)'

with reference to https://ruby-doc.org/3.2.2/Regexp.html#class-Regexp-label-Character+Classes.

Because we have BACKSLASH_R immediately followed by (?:\n|.)*?, input like ```\r\nhello: world will parse ```\r as everything up to and including BACKSLASH_R, and then give the \n to the start of the front_matter group — causing the extra newline seen above.

Presumably it was written this way to match the Ruby documentation listed above:

/\R/ - A linebreak: \n, \v, \f, \r \u0085 (NEXT LINE), \u2028 (LINE SEPARATOR), \u2029 (PARAGRAPH SEPARATOR) or \r\n.

\r\n is listed last, so put it last, right? But this is an artefact of the documentation: Ruby's \R will actually consume a full \r\n in preference to \r, even if the \n could possibly follow:

irb(main):001:0> /(hello\R)(\n?world)/.match "hello\r\nworld"
=> #<MatchData "hello\r\nworld" 1:"hello\r\n" 2:"world">

If you want to read some C, we can even look at the Ruby (Oniguruma) source that describes it: it's a lot more explicit about preferring \r\n.

regparse.c parses regular expressions, and here we can see the \R parse into a TK_LINEBREAK token:

    case 'R':
      if (IS_SYNTAX_OP2(syn, ONIG_SYN_OP2_ESC_CAPITAL_R_LINEBREAK)) {
	tok->type = TK_LINEBREAK;
      }
      break;

Further down, where a TK_LINEBREAK token is dispatched:

  case TK_LINEBREAK:
    r = node_linebreak(np, env);
    if (r < 0) return r;
    break;

node_linebreak, in turn, opens with this description of its behaviour:

static int
node_linebreak(Node** np, ScanEnv* env)
{
  /* same as (?>\x0D\x0A|[\x0A-\x0D\x{85}\x{2028}\x{2029}]) */

Note that \x0D\x0A (CRLF) comes first, before the character class which includes both of those, and that the whole thing is in an atomic grouping, such that if it matches \r\n, it can't backtrack and try matching only \r later. (RE2 doesn't support this, but it's OK really, this is more just indicating that we really should prefer \r\n where possible.)

You can see an example of this atomic grouping effect here:

irb(main):001:0> /(hello\R)(\nworld)/.match "hello\r\nworld"
=> nil

SO. BACKSLASH_R should preferentially consume \r\n to \r, and this MR reorders the alternatives accordingly. There's a test which fails without this change added to ensure it doesn't regress.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

FrontMatterParser should strip only one frontmatter, and adjust BACKSLASH_R

What does this MR do and why?

UntrustedRegexp::BACKSLASH_R needs an adjustment

MR acceptance checklist

Merge request reports