feat: Support for nested languages #403

Merged
wetneb merged 16 commits from wetneb/mergiraf:5-nested-languages into main 2025-05-23 17:46:17 +02:00
Owner

Closes #5.

This uses the injection queries defined by tree-sitter parsers to recursively parse the corresponding areas with other parsers.
The failing test case demonstrates a small issue with whitespace handling that I'm interested to investigate further.

Closes #5. This uses the injection queries defined by tree-sitter parsers to recursively parse the corresponding areas with other parsers. The failing test case demonstrates a small issue with whitespace handling that I'm interested to investigate further.
Clippy
All checks were successful
/ test (pull_request) Successful in 55s
f4672a3402
ada4a requested changes 2025-05-22 15:32:29 +02:00
Dismissed
ada4a left a comment
Owner

I normally try to go over the whole PR at once in order to have enough context for my reviews, but since this is a bigger PR (and since I've been struggling to cobble together free time lately 😅), I'd try going with the approach of multiple small reviews this time. And so, here's the first one

I normally try to go over the whole PR at once in order to have enough context for my reviews, but since this is a bigger PR (and since I've been struggling to cobble together free time lately 😅), I'd try going with the approach of multiple small reviews this time. And so, here's the first one
src/ast.rs Outdated
@ -95,0 +116,4 @@
source: &'a str,
lang_profile: &'a LangProfile,
) -> FxHashMap<usize, &'static LangProfile> {
let mut node_id_to_injection_lang = FxHashMap::default();
Owner

I'd like to avoid the indentation caused by if let Some. How about changing the first 2 lines to the following?

let Some(query_str) = lang_profile.injections else {
    return FxHashMap::default();
};
let mut node_id_to_injection_lang = FxHashMap::default();

Maybe it would even make sense to change the return type to an Option? Not sure about this one though

I'd like to avoid the indentation caused by `if let Some`. How about changing the first 2 lines to the following? ``` let Some(query_str) = lang_profile.injections else { return FxHashMap::default(); }; let mut node_id_to_injection_lang = FxHashMap::default(); ``` Maybe it would even make sense to change the return type to an `Option`? Not sure about this one though
Author
Owner

Sure! Is it more convenient for you if I amend the commits and force-push, or add those refactorings as separate commits on top of the branch?

Sure! Is it more convenient for you if I amend the commits and force-push, or add those refactorings as separate commits on top of the branch?
Owner

That's a good question. On one hand, I do prefer a cleaner history. On the other hand, I think Codeberg doesn't quite like it when a change is pulled from under the feet a review comment (since it doesn't keep track of PR versions AFAICT). So let's play it safe and go with the second option

That's a good question. On one hand, I do prefer a cleaner history. On the other hand, I think Codeberg doesn't quite like it when a change is pulled from under the feet a review comment (since it doesn't keep track of PR versions AFAICT). So let's play it safe and go with the second option
ada4a marked this conversation as resolved
@ -95,0 +149,4 @@
.expect("injection.language capture didn't match any node");
source[lang_node.byte_range()].to_owned()
});
m.nodes_for_capture_index(content_capture_index)
Owner
  1. since the result of this if let Some() doesn't depend on the particular node, it can be pulled out of for_each
  2. nodes.for_each( map.insert() ) can be replaced with map.extend(nodes)
if let Some(injected_lang) = LangProfile::find_by_name(&language) {
    node_id_to_injection_lang.extend(
        m.nodes_for_capture_index(content_capture_index)
            .map(|node| (node.id(), injected_lang)),
    );
}
1. since the result of this `if let Some()` doesn't depend on the particular `node`, it can be pulled out of `for_each` 2. `nodes.for_each( map.insert() )` can be replaced with `map.extend(nodes)` ```rs if let Some(injected_lang) = LangProfile::find_by_name(&language) { node_id_to_injection_lang.extend( m.nodes_for_capture_index(content_capture_index) .map(|node| (node.id(), injected_lang)), ); } ```
ada4a marked this conversation as resolved
src/ast.rs Outdated
@ -95,0 +151,4 @@
});
m.nodes_for_capture_index(content_capture_index)
.for_each(|node| {
if let Some(injected_lang) = LangProfile::find_by_name(&language) {
Owner

shouldn't we maybe return an error if we don't support the injected language? Because I think that if we ignore its actual language, we'll end up handling it like the language it's embedded in, which might backfire

shouldn't we maybe return an error if we don't support the injected language? Because I think that if we ignore its actual language, we'll end up handling it like the language it's embedded in, which might backfire
Author
Owner

Because I think that if we ignore its actual language, we'll end up handling it like the language it's embedded in, which might backfire

no, in that case we just don't re-parse the injection recursively at all, and just leave it to how it was parsed by the parent grammar… which feels appropriate to me, given that the parent grammar will generally leave this part un-analyzed.

For instance, we currently don't support CSS, and the HTML grammar sets up injections for <style> elements to be parsed as CSS. Because we don't find a suitable language, the <style> block is left un-analyzed. The presence of a <style> block shouldn't prevent the HTML from parsing, since we can still resolve conflicts in the rest of the HTML tree.

> Because I think that if we ignore its actual language, we'll end up handling it like the language it's embedded in, which might backfire no, in that case we just don't re-parse the injection recursively at all, and just leave it to how it was parsed by the parent grammar… which feels appropriate to me, given that the parent grammar will generally leave this part un-analyzed. For instance, we currently don't support CSS, and the HTML grammar sets up injections for `<style>` elements to be parsed as CSS. Because we don't find a suitable language, the `<style>` block is left un-analyzed. The presence of a `<style>` block shouldn't prevent the HTML from parsing, since we can still resolve conflicts in the rest of the HTML tree.
Owner

Right, I since saw that explained at line 189. Maybe it would make sense to copy that comment into here as well

Right, I since saw that explained at line 189. Maybe it would make sense to copy that comment into here as well
ada4a marked this conversation as resolved
@ -0,0 +2,4 @@
```yml
foo2:
- bar
Owner

It took me longer than I'd like to admit to notice the yaml->yml change introduced by the left side. Maybe add a README like:

left: changed codeblock language from yaml to yml
right: renamed foo to foo2

or, if you want to be more terse:

left: s/yaml/yml
right: s/foo/foo2
It took me longer than I'd like to admit to notice the yaml->yml change introduced by the left side. Maybe add a README like: ``` left: changed codeblock language from yaml to yml right: renamed foo to foo2 ``` or, if you want to be more terse: ``` left: s/yaml/yml right: s/foo/foo2 ```
Author
Owner

My bad, actually I should have kept this test case in the PR that introduced the markdown parser, since it's independent from injections support…

My bad, actually I should have kept this test case in the PR that introduced the markdown parser, since it's independent from injections support…
Owner

I guess that would've stopped me from trying to look inside the codeblock... But all good:)
But I think having the README is helpful even despite that 😅

I guess that would've stopped me from trying to look _inside_ the codeblock... But all good:) But I think having the README is helpful even despite that 😅
Owner

Ah, you did add it! Thanks:)

Ah, you did add it! Thanks:)
ada4a marked this conversation as resolved
@ -0,0 +5,4 @@
from bookmaker import front_page, compile, table_of_contents
def main():
compile(toc=table_of_contents, debug=True)
Owner

Nice reference 😅

Nice reference 😅
ada4a marked this conversation as resolved
src/ast.rs Outdated
@ -95,0 +132,4 @@
let pattern_properties = query.property_settings(m.pattern_index);
// first, check if the language is statically defined in this clause of the query as a property
let lang_property_value = pattern_properties.iter().find_map(|property| {
if property.key == "injection.language".into() {
Owner

This .into() turns the "injection.content" on the right into an owned Box<str>, since that's the type of property.key on the left.

Box<str> is an owned equivalent of &str (the only difference from String being that the latter can be resized, so the triple is isomorphic to Box<[T]> vs &[T] vs Vec<T>), and it implements Deref<Target=str>, so it can be coerced to a &str with a reborrow:

 if &*property.key == "injection.language"

Pretty much the same goes for property.value, except that there the Box<str> is wrapped in an Option, so we need to use Option::as_deref to reborrow the value. The rest should follow, but here's the solution just in case:)

diff --git i/src/ast.rs w/src/ast.rs
index 748fd584..114584e9 100644
--- i/src/ast.rs
+++ w/src/ast.rs
@@ -137,7 +137,7 @@ impl<'a> AstNode<'a> {
                 // first, check if the language is statically defined in this clause of the query as a property
                 let lang_property_value = pattern_properties.iter().find_map(|property| {
                     if &*property.key == "injection.language" {
-                        Some(property.value.clone()?.to_string())
+                        property.value.as_deref()
                     } else {
                         None
                     }
@@ -151,9 +151,9 @@ impl<'a> AstNode<'a> {
                         ))
                         .next()
                         .expect("injection.language capture didn't match any node");
-                    source[lang_node.byte_range()].to_owned()
+                    &source[lang_node.byte_range()]
                 });
-                if let Some(injected_lang) = LangProfile::find_by_name(&language) {
+                if let Some(injected_lang) = LangProfile::find_by_name(language) {
                     node_id_to_injection_lang.extend(
                         m.nodes_for_capture_index(content_capture_index)
                             .map(|node| (node.id(), injected_lang)),
This `.into()` turns the `"injection.content"` on the right into an owned `Box<str>`, since that's the type of `property.key` on the left. `Box<str>` is an owned equivalent of `&str` (the only difference from `String` being that the latter can be resized, so the triple is isomorphic to `Box<[T]>` vs `&[T]` vs `Vec<T>`), and it implements `Deref<Target=str>`, so it can be coerced to a `&str` with a reborrow: ```rs if &*property.key == "injection.language" ``` Pretty much the same goes for `property.value`, except that there the `Box<str>` is wrapped in an `Option`, so we need to use `Option::as_deref` to reborrow the value. The rest should follow, but here's the solution just in case:) ```diff diff --git i/src/ast.rs w/src/ast.rs index 748fd584..114584e9 100644 --- i/src/ast.rs +++ w/src/ast.rs @@ -137,7 +137,7 @@ impl<'a> AstNode<'a> { // first, check if the language is statically defined in this clause of the query as a property let lang_property_value = pattern_properties.iter().find_map(|property| { if &*property.key == "injection.language" { - Some(property.value.clone()?.to_string()) + property.value.as_deref() } else { None } @@ -151,9 +151,9 @@ impl<'a> AstNode<'a> { )) .next() .expect("injection.language capture didn't match any node"); - source[lang_node.byte_range()].to_owned() + &source[lang_node.byte_range()] }); - if let Some(injected_lang) = LangProfile::find_by_name(&language) { + if let Some(injected_lang) = LangProfile::find_by_name(language) { node_id_to_injection_lang.extend( m.nodes_for_capture_index(content_capture_index) .map(|node| (node.id(), injected_lang)), ```
Author
Owner

Thanks a lot, you can tell I've struggled with those boxes :-D

Thanks a lot, you can tell I've struggled with those boxes :-D
ada4a marked this conversation as resolved
src/ast.rs Outdated
@ -95,0 +142,4 @@
let language = lang_property_value.unwrap_or_else(|| {
// otherwise, look if the language is defined via a capture
let lang_node = m
.nodes_for_capture_index(language_capture_index.expect(
Owner

I think extracting this into a variable would be a bit cleaner...

I think extracting this into a variable would be a bit cleaner...
ada4a marked this conversation as resolved
@ -779,0 +809,4 @@
if let Some(injection) = lang_profile.injections {
assert!(
!injection.trim().is_empty(),
"Injection query for language {} set as an empty string, use None instead",
Owner

can use the Display impl for LangProfile here:)

"Injection query for language {lang_profile} set as an empty string, use None instead",
can use the `Display` impl for `LangProfile` here:) ```rs "Injection query for language {lang_profile} set as an empty string, use None instead", ```
ada4a marked this conversation as resolved
src/ast.rs Outdated
@ -133,2 +214,2 @@
(range, local_source)
};
let (range, local_source) =
// don't trim injections, because their children could expand beyond the node itself
Owner

is there an example of this you know of? maybe even a test case that shows an injection being trimmed resulting in an incorrect output...

is there an example of this you know of? maybe even a test case that shows an injection being trimmed resulting in an incorrect output...
Author
Owner

If you remove the condition on the injection, you should get a nice error when running examples/markdown/failing/nested_injections.

If you remove the condition on the injection, you should get a nice error when running `examples/markdown/failing/nested_injections`.
Owner

Interesting. But the fact that the test is failing even with this restriction in place could mean that the latter is not quite correct, no?

The error you mention appears in trailing_whitespace, in particular when (panickingly) slicing into self.source, so I tried replacing that with a get, but then the test output was a whole different kind of wrong... Maybe I'll admit my defeat here, and trust you to debug this in the future 😅

Interesting. But the fact that the test is failing even with this restriction in place could mean that the latter is not _quite_ correct, no? The error you mention appears in `trailing_whitespace`, in particular when (panickingly) slicing into `self.source`, so I tried replacing that with a `get`, but then the test output was a whole different kind of wrong... Maybe I'll admit my defeat here, and trust you to debug this in the future 😅
Owner

But why do children of injection expand beyond the node itself! That feels so wrong to me

But why *do* children of injection expand beyond the node itself! That feels so wrong to me
Author
Owner

The root of a tree as returned by tree-sitter generally (if not always) spans the whole source that you give it to parse. Because we do not to trim roots (condition node.parent().is_some()), this means that the containing node (the one picked out by the injection query) shouln't be trimmed either. This doesn't need to propagate further up the tree since the node picked out by the injection query is the one defining the source area to parse, so we know we're not expanding beyond that.

An alternative (which I would like to explore as a follow up) would be to trim the injected roots as well, and only not trim the real, actual root of the entire document.

The root of a tree as returned by tree-sitter generally (if not always) spans the whole source that you give it to parse. Because we do not to trim roots (condition `node.parent().is_some()`), this means that the containing node (the one picked out by the injection query) shouln't be trimmed either. This doesn't need to propagate further up the tree since the node picked out by the injection query is the one defining the source area to parse, so we know we're not expanding beyond that. An alternative (which I would like to explore as a follow up) would be to trim the injected roots as well, and only not trim the real, actual root of the entire document.
ada4a marked this conversation as resolved
Add comment about why we don't trim injections
All checks were successful
/ test (pull_request) Successful in 56s
432d74619e
ada4a approved these changes 2025-05-23 15:27:46 +02:00
wetneb merged commit 237e4ca79c into main 2025-05-23 17:46:17 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#403
No description provided.