Support for nested languages #5

Closed
opened 2024-11-04 22:27:04 +01:00 by wetneb · 1 comment
Owner

It is common for HTML documents to contain embedded CSS or JS.

It would be nice if Mergiraf could use different language profiles for each part of the document, so that it's able to resolve Javascript conflicts with language-awareness even if they are embedded in an HTML document.

It is common for HTML documents to contain embedded CSS or JS. It would be nice if Mergiraf could use different language profiles for each part of the document, so that it's able to resolve Javascript conflicts with language-awareness even if they are embedded in an HTML document.
Author
Owner

I've been considering an implementation strategy for this. Tree-sitter parsers already come with an injections query which defines which nodes of the language contain foreign content that can be parsed in another language (and how to identify that language). It feels natural to reuse this existing functionality.

My goal would be to represent the nested trees directly at the parsing stage (when creating the AstNodes) and then keep the matching and merging heuristics untouched. The only places where our merging heuristics are language-dependent are the dependencies on commutative parents and signatures, which are defined in the LangProfile.

Here's a proposal:

  • Add a field on AstNode pointing to the LangProfile from which the node came. It has a slight memory footprint cost, as we're adding a pointer to all nodes, but I think we can afford it.
  • Use this field to remove a lot of LangProfile parameters from methods, so that they access it from the nodes they work on instead. From my initial investigation, this seems to bring quite a nice simplification in a lot of places.
  • When converting tree-sitter's output to AstNodes, run the injections query associated with the tree-sitter language to check for nodes that contain content parseable in another language. When the query matches, look up the language in our own supported languages by name, and if there is a match, recursively parse the contents with the language and include the results as children of the outer tree.

With this, we should already have basic support for nested languages.

I've been considering an implementation strategy for this. Tree-sitter parsers already come with an [injections query](https://tree-sitter.github.io/tree-sitter/3-syntax-highlighting.html#language-injection) which defines which nodes of the language contain foreign content that can be parsed in another language (and how to identify that language). It feels natural to reuse this existing functionality. My goal would be to represent the nested trees directly at the parsing stage (when creating the `AstNode`s) and then keep the matching and merging heuristics untouched. The only places where our merging heuristics are language-dependent are the dependencies on commutative parents and signatures, which are defined in the `LangProfile`. Here's a proposal: * Add a field on `AstNode` pointing to the `LangProfile` from which the node came. It has a slight memory footprint cost, as we're adding a pointer to all nodes, but I think we can afford it. * Use this field to remove a lot of `LangProfile` parameters from methods, so that they access it from the nodes they work on instead. From my initial investigation, this seems to bring quite a nice simplification in a lot of places. * When converting tree-sitter's output to `AstNode`s, run the injections query associated with the tree-sitter language to check for nodes that contain content parseable in another language. When the query matches, look up the language in our own supported languages by name, and if there is a match, recursively parse the contents with the language and include the results as children of the outer tree. With this, we should already have basic support for nested languages.
wetneb self-assigned this 2025-05-15 00:36:07 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#5
No description provided.