Accumulated Feedback on PRI #509

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Fri Dec 06 21:31:48 CST 2024
ReportID: ID20241206213148
Name: Dennis Tan
Report Type: Public Review Issue
Opt Subject: 509


In section 3 (Link Detection Algorithm), subsection "Initiation", the document uses the following 
reference for "top-level domains" https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains. 
Perhaps it would be better suited to use a more authoritative source and one that is updated regularly — 
e.g., https://www.iana.org/domains/root/db. The wiki page doesn't even list the IDN top-level domains.

Date/Time: Wed Dec 18 02:41:23 CST 2024
ReportID: ID20241218024123
Name: Hank Nussbacher
Report Type: Public Review Issue
Opt Subject: 509


In section 7 - https://www.unicode.org/reports/tr58/#test-data - might i suggest that you include 
test data for bidirectional content for Linkification - like from Arabic or Hebrew?

Date/Time: Mon Jan 20 08:04:59 CST 2025
ReportID: ID20250120080459
Name: Arnt Gulbrandsen
Report Type: Public Review Issue
Opt Subject: 509


Hi,

I have compared the UTS58 draft with a few linkifiers.

One omission I saw is that another linkifier tolerates and ignores U+00AD (soft hyphen). The commit 
message is terser than terse, but hints that someone sends text of the form "foo example.com/foo/bar 
bar" with a soft hyphen after the full stop and/or slashes.

It's not clear to me that this is worth bothering with. Your call.

Date/Time: Mon Feb 10 11:17:13 CST 2025
ReportID: ID20250210111713
Name: Jules Bertholet
Report Type: Public Review Issue
Opt Subject: 509


In addition to being used ⸢like⸣ ⸤this⸥,
the half brackets, and possibly also the half parentheses,
can also be used ⸢like⸥ ⸤this⸣
(imitating the East Asian corner brackets).
The Link_Paired_Opener property should be array-valued to reflect this.

Date/Time: Wed Mar 26 05:59:34 CDT 2025
ReportID: ID20250326055934
Name: Ebrahim Byagowi
Report Type: Public Review Issue
Opt Subject: 509


I like to provide a quick drive by comment about https://www.unicode.org/reports/tr58/tr58-2.html

I think the lack of standard recommendation on how URLs should actually be displayed has caused 
https://issuetracker.google.com/issues/40665886 and essentially breaking Persian text in URL 
bars and misrendering of Emoji skin tones using ZWJ in Chrome as described on the tracker.

The same situation exists in Safari but URLs are hidden there for the most part but if one tries 
to edit a URL containing ZWNJ things go wrong by double encoding already encoded ZWNJ in the URL 
like https://phabricator.wikimedia.org/F58924232 (unfortunately this isn't always reproducible 
in Safari but is annoying enough and comes from the same root ZWNJ being displayed by its code)

I'll understand that you may consider these as browsers bugs but after seeing 
https://www.unicode.org/reports/tr58/tr58-2.html and the lengthy discussion I had in Chromium's bug 
tracker which made the developers sure understand what is going on https://issuetracker.google.com/issues/40665886 
I felt if it were some official recommendation things could go more smoothly.

Date/Time: Fri Mar 28 11:53:15 CDT 2025
ReportID: ID20250328115315
Name: cketti
Report Type: Public Review Issue
Opt Subject: 509


Step 4.7. of the termination algorithm currently reads "If LT == Open", but it should be "If LT == Close". (Step 4.6. handles "LT == Open")

Date/Time: Mon Apr 07 06:31:16 CDT 2025
ReportID: ID20250407063116
Name: Arnt Gulbrandsen
Report Type: Public Review Issue
Opt Subject: 509


Hi,

I ran across a bug today that I think points out a relevant problem in UTS58: A user expected 普遍适用测试。我爱你 to be linkified as
<a href="https://普遍适用测试.我爱你">普遍适用测试。我爱你</a> (note the changing dot).

Chrome and some other web browsers map "。" to "." in domains when you hit enter after typing/pasting into the address bar. I do feel 
that at least U+06D4 and U+3002 ought to be mapped to the ASCII dot in UTS58 since it's such a common mistake. ("。" and "." are even 
on the same key on the Chinese keyboards I've seen.)

I mention U+06D4 and U+3002 because I've seen those mistakenly used in "domain names" in the course of my work. U+FF61 and others 
might also be used mistakenly in theory, but I haven't seen that.


Feedback above this line has already been reviewed during UTC #183 in April, 2025.

Date/Time: Mon Sep 15 14:07:47 PST 2025
ReportID: ID20250915140747
Name: Peter Constable
Report Type: Public Review Issue
Opt Subject: PRI 509: suggestions for Summary

Editorial suggest for the intro summary of DUTS 58:

---

URLs processed in communication protocols are parsed in conformance with particular protocol specifications. When URLs appear in 
text content, however, the character sequences are not always intended to be read in exactly the same way they would be parsed in 
a protocol. Some characters that are often used as sentence-level punctuation in text can also be valid characters within a URL. 
Software that applies the protocol rules when parsing URLs in text content often produce the wrong results.

When a URL is inserted into text, percent encoding can be used to avoid the above ambiguity, though often this is not done. Also, 
when a URL that includes non-ASCII characters is inserted into text, implementations often over-use percent encoding for those 
characters, resulting in a URL that is illegible for a human reader. Thus, percent encoding is often both underused and overused 
leading to less beneficial results.

This document specifies...

Date/Time: Mon Sep 15 17:27:18 PST 2025
ReportID: ID20250915172718
Name: Peter Constable
Report Type: Public Review Issue
Opt Subject: PRI 509: paired punctuation within path, etc.

In the Link_Termination Property section, the description for Close mentions "subparts" within a path, query or fragment:

"If the character is paired with a previous character in the same Part (path, query, fragment) and in the same subpart 
(that is, not across interior '/' in a path, or across '&' or '=' in a query, it is treated as Include."

(It might also have mentioned "/" and "?" within query and fragment but doesn't.)

Two issues:

First, is seems to make sense that counterpart open/close punctuation is unlikely to be used with an intent of being paired 
across segment boundaries within a path. So, not searching for a pair across a "/" boundary within a path seems to make sense. 
However, it seems less obvious that the same can be said for query or fragment elements of a URL. For example, it seems 
conceivable that a "(" ... ")" pair might surround a sequence of key/value pairs that comprise a logical grouping. E.g.,

...(k1=a&k2=b)...

I have no idea if this kind of pairing is done in practice, so perhaps this isn't more than a remote, hypothetical possibility. 
But I do know it is permissible in RFC 3986.

But -- the second point -- the algorithm, as written, does not include anything to recognize such "subpart" segments or to 
incorporate awareness of such subparts into the logic.

Thus, unless there is thought to elaborate the algorithm in some way to support these subparts, it doesn't make sense to mention 
them at all.

Date/Time: Mon Sep 15 18:15:59 PST 2025
ReportID: ID20250915181559
Name: Peter Constable
Report Type: Public Review Issue
Opt Subject: PRI 509: pairing across path/query/frag boundaries

Given potential ambiguity of closers like ")" at the end of a candidate URL sequences, it makes sense not to attempt to infer 
a pairing across the boundaries between the top-level elements of a URL: e.g., opener in a path paired with a closer in query. 
The algorithm, as it is intended to be implemented, ensures this by having the steps within the Link-Detection Algorithm section 
applied separately, and in turn, to path, query and fragments elements.

However, there is potential for implementations to overlook that detail and to apply the steps in that section to a sequence 
spanning path + query + fragment. The intent for the steps to be applied to those elements separately is not stated in the 
Link-Detection Algorithm section, so if an implementer hasn't read and paid adequate attention to earlier sections, they might 
overlook that very important detail.

Partially related is that the terms "part" and "Part" are used inconsistently through the document. So, for instance, use of "part" 
in section 3 "Parts of a URL" includes references to protocol and port elements.

The first mention of this crucial Part-at-a-time logic is in passing, in the second paragraph of the Termination section:

"The key is to be able to determine, given a Part (such as query)..."

The next is buried in the description of Close:

"If the character is paired with a previous character *in the same Part (path query fragment)* ..."

Of course, it is the handling of potential open/close pairs for which the Part-at-a-time logic matters the most. But the doc 
should call this out more clearly.

The clearest statements coming in the Termination Algorithm section:

"This algorithm then processes each final Part [path, query, fragment] of the URL in turn. ... A Link_Termination=Close character... 
that does **not** have a matching Open character *in the same Part* of the URL."

If we assume implementers read and pay attention to these sections, that *should* be adequate. Even so, to be most explicit, it makes 
sense to call out this detail directly within the Link-Detection Algorithm section. The following is a suggestion:

## Link-Detection Algorithm

The following steps are performed, logically, over path, query and fragment elements of a URL separately. The are applied first over 
the path, if present. If no termination is detected within the path, the steps are repeated for the query, if present, with variables 
reset. If no termination is detected within the query, the steps are repeated for the fragment, if present. Crucially, the openStack 
must be cleared on the transitions from path to query and to fragment.

In the following:

* Part refers to one of {path, query, fragment}.

...

====

As noted earlier, usage of "Part" and "part" is not consistent. E.g.,

- one of {path, query, fragment} — the intended meaning

- other main URL elements: e.g., protocol, host, port (top of section 3); also in section 4, "Process each Part up to the Path, Query, 
and Fragment in the normal fashion..."

- individual characters: e.g., "... when a trailing period should be counted as part of a link or not."

- positions within the URL: e.g., "... the last location in the current Part that is still safely considered part of the link."

- a span of URL characters: e.g., "... vs. URLs that contain a part that is enclosed in parens, etc."


Also inconsistent is casing when referring to path, query and fragment (some instances are capitalized), and in references to the three 
elements as a set---"... Part (path, query, fragment)" vs. "... Part [path, query, fragment]".


Feedback above this line has already been reviewed during UTC #185 in October, 2025.

Date/Time: Fri Jan 09 22:20:39 PT 2026
ReportID: ID20260109222039
Name: From+Unicode
Report Type: Public Review Issue
Opt Subject: 509


I have a few high-level substantive comments. I will submit a long list of text review comments in another post.

Thank you for your work on this. Link detection and formatting are very important, and in my opinion the poor state of implementations 
presents substantial obstacles to the adoption of globally inclusive email addresses and internationalised domain names. This standard 
can potentially be a big help.

Another project, the Universal Acceptance Steering Group, sponsored by ICANN in 2015-2025, did work on linkification of URLs incorporating 
internationalised domain names and email addresses with non-ASCII characters. I was involved in some of the working groups. They published 
some papers which you could at least cite as resources. Perhaps you could also use them to cross-check your work. The project has ended, 
so there is not really a way to get their input anymore.

Specifically, I suggest:

• UASG 004 Test Cases for UA Readiness Evaluation EN https://uasg.tech/download/uasg-004-use-cases-for-ua-readiness-evaluation-en/;

This is a document listing URLs and email addresses using non-ASCII characters in hosts, URL paths, and email address local-parts, for a wide 
range of languages. As of mid-2025, the URLs were live and had web servers responding, and the email addresses had repeaters active.

• UASG 004A Test Cases for UA Readiness Evaluation - Data EN https://uasg.tech/download/uasg-004a-use-cases-for-ua-readiness-evaluation-data-en/;

This is a text file containing the URLs and email addresses from UASG 004 in an easily parseable format. I know from experience that attempting 
to copy URLs and email addresses from unlinkified PDF text is unreliable; this text file is the solution.

I suggest that, at the very least, you include UASG 004 and UASG 004A in your list of References. Better yet, include some or all of the URLs 
and addresses in UASG 004A in your LinkFormattingTest.txt, and be sure your algorithm works with all of them.

Also, consider reading "UASG 010 Quick Guide to Linkification EN" https://uasg.tech/download/uasg-010-quick-guide-to-linkification-en/;. This paper describes some of the issues in linkification. It does not have algorithms. I suspect it will not tell you anything you don't know. However, consider including it in your References as accessible background information for your readers.

None of your URLs in file LinkFormattingTest.txt have internationalised domain names. That seems like a major oversight.

Nowhere in this draft do you discuss right-to-left scripts. I have heard comments that the Unicode Bidi algorithm can make a mess of email addresses 
with right-to-left parts. I heard that the "@" character, having a neutral direction, can end up at one end or the other of the email address instead 
of between the host and local part. It seems quite important that TR58 address right-to-left, in the algorithms, the examples, and the test cases.

I know of IETF work on two related facets of email address syntax beyond ASCII. There are two draft RFCs, by A. Gulbrandsen (ICANN) and J. Yao (CNNIC). 
You might want to reach out to them, and get their review, if you haven't already.

The draft RFCs are:

"Interoperable Email Addresses for SMTPUTF8"

draft-ietf-mailmaint-interoperable-addresses-02

https://datatracker.ietf.org/doc/draft-ietf-mailmaint-interoperable-addresses/;

'This document specifies rules for email addresses that are flexible enough to express the addresses typically used with SMTPUTF8, while avoiding elements that 
harm human-to-human interoperation.

'This is one of a pair of documents: this contains recommendations for what addresses humans should use, such that address provisioning systems can restrain themselves 
to addresses that email valdidators accept. (This set can also be described in other ways, including "simple to cut-and-paste" and "understandable by some community".)'

"SMTPUTF8 address syntax"

draft-ietf-mailmaint-smtputf8-syntax-02

https://datatracker.ietf.org/doc/draft-ietf-mailmaint-interoperable-addresses/;

'This document specifies rules for email addresses that are flexible enough to express the addresses typically used with SMTPUTF8, while avoiding confusing or risky elements.

'This is one of a pair of documents: This is simple to implement, contains only globally viable rules and is intended to be usable for software such an MTA.'

Also, I hope you are aware of the discussions in the WhatWG/HTML repo about what email address syntax should be accepted by the HTML `` element. The current 
HTML spec says the addresses may only contain a subset of ASCII. Several years of discussion over three GitHub issues have failed to reach agreement to extend 
this to globally inclusive email addresses.

The current pull request is https://github.com/whatwg/html/pull/10522; (2024-present). Previous discussion in pull request https://github.com/whatwg/html/pull/5799; 
(2020-2024) and issue https://github.com/whatwg/html/issues/4562; (2019-2024).

Date/Time: Fri Jan 09 22:22:19 PT 2026
ReportID: ID20260109222219
Name: From+Unicode
Report Type: Public Review Issue
Opt Subject: 509

Here is a long list of text review comments. I submitted some high-level substantive comments in another post.

In section 1.2 Email Addresses https://www.unicode.org/reports/tr58/#intro-email;, the example contains URL to ja.wikipedia.org . This surprises me. Since the section is about email addresses, and begins, "Email addresses should also work well….", I would expect the example to be an email address rather than a URL. Maybe there should also be an example a URL having the `mailto:` scheme, contrasting it with an email address and with a URL having a `http:` scheme.

In section 3 Link Detection https://www.unicode.org/reports/tr58/#link-detection;, there are typos in the Notes to table 3-1 Parts of a URL https://www.unicode.org/reports/tr58/#parts-of-a-url;, second bullet point regarding internal structure.

1. In the first sub-bullet, there is a pair of parentheses just before the sentence beginning, "The syntax of a URI…". Perhaps the intention was to put the parentheses around that sentence.

2. In that same first sub-bullet, the term "URI" looks incorrect, since the report intends to use "URL" instead.

3. In the third sub-bullet, beginning "The Query…", the semicolon after 'by "&"' seems awkward. A comma would flow better here, in my opinion.

4. In the same third sub-bullet, the final sentence has an opening parenthesis but no closing parenthesis.

Also in Notes to table 3-1 Parts of a URL, second bullet point regarding internal structure, in the final sub-bullet about "The Fragment", I don't know what you mean with the phrase "separated again by". What is separated from what? Maybe you mean a subsequent fragment directive is separated from the preceding one by the sequence? In that case, I would suggest wording like: 'one or more fragment directives, each preceded with a separator ":~:".' Then perhaps show an example.

In section 3 Link Detection, subsection Initiation https://www.unicode.org/reports/tr58/#initiation;, the second paragraph reads, "the end of the domain name is also relatively easy to determine. For more information, see UTS #46, Unicode IDNA Compatibility Processing." I took a quick look at UTS #46, and I did not see an obvious place where it explains how to determine the end of the domain name. Also, the next sentence says that parsing of Scheme and Host is as specified in WHATWG URL 4.4. Doesn't that supersede UTS #46, which is only concerned with domain names? I suggest you flesh out the way UTS #46 helps determine the end of the domain name.

In section 3 Link Detection, subsection Termination https://www.unicode.org/reports/tr58/#termination;, I don't know what the phrase "the high-runner cases" means. Do you mean "the following common cases"?

In section 3 Link Detection, subsection Properties, there is a sentence, "The short property names are identical to the long property names." This seems to apply to the three property names listed just above. I do not see separate long and short names. Perhaps this sentence refers to older names which are no longer present?

In Table 3-2. Link_Term Property Values https://www.unicode.org/reports/tr58/#Link_Term-values;, entry "Soft", you spring some previously undefined syntax on us: '/\p{Link_Term=Soft}*(\p{Link_Term=Hard}$)/'. I think you mean "zero or more characters with a Link_Term property of Soft, followed immediately by a character with a Link_Term property of Hard". Having just written that out, I can understand the attraction of regexp notation. But then perhaps make it clear what notation you plan to use.

In subsection "Link_Bracket Property" https://www.unicode.org/reports/tr58/#link-bracket-property;, you use the `\p{}` notation in unformatted text, whereas before you used it in fixed-pitch text. I suggest being consistent in this choice: do one or the other always. Fixed-pitch text makes the special notation clearer for me.

In subsection "Termination Algorithm" https://www.unicode.org/reports/tr58/#termination-algorithm;, in the final paragraph juest before Table 3-3, the text says, "UnicodeSet notation is used…". As far as I can see, this is the only mention of the term UnicodeSet. This term should be related to a citation of a document which defines the notation, perhaps with a link to an entry in the References.

In section 4 Minimal Escaping https://www.unicode.org/reports/tr58/#minimal-escaping;, list item 3, sublist item 2, I don't know how to interpret this sentence. The list is of goals for the serialised URL form. The phrase "the linkification may extend beyond the bounds of the serialized form" is on the surface a goal, as if extending beyond the bounds is a good thing. But maybe you intend it as a warning: if we don't define the serialisation correctly, one thing that might go wrong is extending beyond the bounds. List item 3 is the case "when isolated". If a URL is "not surrounded by Hard characters", is it actually "isolated"? I suggest clarifying this wording.

In subsection Minimal Escaping Algorithm https://www.unicode.org/reports/tr58/#minimal-escaping-algorithm;, the text says, "Formatting of the Scheme, Host, and Port should be done as is customary for those URL Parts." Using the word "customary" is vague and seems to be shirking the duty to provide clear direction. I suggest finding a reference which defines the "customary" algorithm and citing that. Maybe citing ToUnicode in UTS #46 is sufficient?

In section 5 Email Addresses https://www.unicode.org/reports/tr58/#email-addresses;, in the algorithm to scan the local part, the algorithm uses symbols `n`, `i`, and `cp` without defining them. I don't see what the initial value of `i` should be in the expression `cp[i]` in step 2. A reader can probably guess that you define `n` and `cp` the same way as for earlier algorithms. I suggest repeating the earlier text defining terms from earlier algorithms, and explicitly defining the step 2 value for `i`.

In that same email address algorithm, step 7, you write, "Return a match for the pair start, end." The algorithms in this paper return strings, not tuples of integers. I think what you actually mean is, "Return all code points between start (inclusive) and n - 1 ([? exclusive or inclusive?])."

In section 5, in the paragraph just before Table 5-1. Email Address Link Detection Examples https://www.unicode.org/reports/tr58/#email-detection-examples; you write, "For linkification, the values in a quoted local-part — while broader than in an unquoted local-part — are more restrictive to prevent accidentally including linkifying more text than intended, especially since those code points are unlikely to be handled by mail servers in any event." I don't understand what you mean. Are you introducing a different algorithm for scanning a quoted local-part? If so, I don't see it here. Or are you saying that this paper doesn't attempt to define link detection for a quoted local-part? That is not what those words say. When you say, "are unlikely to be handled by mail servers in any event", what is your basis for such a claim? Are you saying that you have knowledge that mail servers these days refuse to process quoted local parts? That seems to call for a citation of evidence.

In section 5, subsection "Minimal Quoting Algorithm" https://www.unicode.org/reports/tr58/#minimal-quoting-algorithm;, you say, "given that the quoted forms are not supported." That seems to assert a position which the paragraph mentioned above does not clearly say.

In section 6 Property Data https://www.unicode.org/reports/tr58/#property-data;, the first paragraph ends with a period after a colon, "the following files:.". I suggest deleting the period.

In subsection "Property Assignments", subsubsection "Link_Term=Hard" https://www.unicode.org/reports/tr58/#link-term-hard-assignment;, The text is a sentence fragment, ending in an ellipsis. It appears to be a placeholder that never got turned into a specific list written as a proper sentence.