diff --git a/doc/src/SUMMARY.md b/doc/src/SUMMARY.md index a7659eb..e2ade67 100644 --- a/doc/src/SUMMARY.md +++ b/doc/src/SUMMARY.md @@ -6,5 +6,8 @@ - [Conflicts solved, limitations](./conflicts.md) - [Supported languages](./languages.md) - [Adding a new language](./adding-a-language.md) + - [Commutative merging](./adding-a-language/enabling-commutative-merging.md) + - [Advanced configuration](./adding-a-language/advanced-configuration.md) + - [Testing](./adding-a-language/testing.md) - [Architecture](./architecture.md) - [Related work](./related-work.md) diff --git a/doc/src/adding-a-language.md b/doc/src/adding-a-language.md index 8920722..a4ea8fe 100644 --- a/doc/src/adding-a-language.md +++ b/doc/src/adding-a-language.md @@ -24,10 +24,12 @@ LangProfile { file_names: vec![], // the full file names which should be handled with this language language: tree_sitter_c_sharp::LANGUAGE.into(), // the tree-sitter parser // optional settings, explained below - atomic_nodes: vec![], commutative_parents: vec![], signatures: vec![], + atomic_nodes: vec![], injections: None, + flattened_nodes: &[], + comment_nodes: &[], }, ``` @@ -40,344 +42,12 @@ You'll find the binary in `./target/debug/mergiraf`, which supports your languag That's all you need to get basic support for this language in Mergiraf. It already enables syntax-aware merging which should already give better results than line-based merging. -## Add commutative parents +## Next steps -You can improve conflict resolution for this language by defining "commutative parents". -A node in a syntax tree is a commutative parent when the order of its children is unimportant. -This knowledge allows Mergiraf to [automatically solve most conflicts involving insertion or deletion of children of such a parent](./conflicts.md#neighbouring-insertions-and-deletions-of-elements-whose-order-does-not-matter). +The `commutative_parents` and `signature` fields of the language profile can be used to +[enable commutative merging](./enabling-commutative-merging.md) for certain node types, which is recommended to improve the merge results. +The remaining fields are available for [advanced use cases](./advanced-language-configuration.md). -Identifying which node types should commutative is easier with some familiarity with the semantics of the language, but there are usual suspects you can consider: -* **import statements** (such as `import` in Java or Go, `use` in Rust…) -* **field or method declarations** in classes (as in most object-oriented programming languages) -* **declarations of sum-types** (such as `union` in C or functional programming languages) -* **dictionary or set objects** (such as JSON objects, struct instantiations in C/C++…) -* **declarative annotations** of various sorts (such as annotation parameters in Java, trait bounds in Rust, tag attributes in XML / HTML…) - -For instance, C# has import statements called `using` declarations and [some IDEs seem to allow sorting them alphabetically](https://stackoverflow.com/questions/30374210/order-of-using-directives-in-c-sharp-alphabetically). This is a good sign that their order is semantically irrelevant, as in many languages, so let's declare that. - -First, write a small sample file which contains the syntactic elements you are interested in, such as: -```csharp -using System; -using System.Collections.Generic; -using System.IO; - -namespace HelloWorld { - - public class SomeName { - - } -} -``` - -You can inspect how this file is parsed with, either with the [Syntax Tree Playground](https://tree-sitter.github.io/tree-sitter/7-playground) if the language is supported there, or directly via Mergiraf: -```console -$ cargo parse test_file.cs -``` - -which gives: -
└compilation_unit - ├using_directive - │ ├using - │ ├identifier System - │ └; - ├using_directive - │ ├using - │ ├qualified_name - │ │ ├qualifier: qualified_name - │ │ │ ├qualifier: identifier System - │ │ │ ├. - │ │ │ └name: identifier Collections - │ │ ├. - │ │ └name: identifier Generic - │ └; - ├using_directive - │ ├using - │ ├qualified_name - │ │ ├qualifier: identifier System - │ │ ├. - │ │ └name: identifier IO - │ └; - └namespace_declaration - ├namespace - ├name: identifier HelloWorld - └body: declaration_list - ├{ - ├class_declaration - │ ├modifier - │ │ └public - │ ├class - │ ├name: identifier SomeName - │ └body: declaration_list - │ ├{ - │ └} - └} -- -This shows us how our source code is parsed into a tree. We see that the `using` statements are parsed as `using_directive` nodes in the tree. - -To let Mergiraf reorder `using` statements to fix conflicts, we declare that their parent is a commutative one, which will by default let them commute with any of their siblings (any other child of their parent in the syntax tree). -In this example, their parent is the root of the tree (with type `compilation_unit`), which means that we'll allow reordering `using` statements with other top-level elements, such as the namespace declaration. -We'll see later how to restrict this commutativity by defining children groups. - -The commutative parent can be defined in the language profile: -```rust -LangProfile { - commutative_parents: vec![ - CommutativeParent::without_delimiters("compilation_unit", "\n"), - ], - .. -}, -``` - -A commutative parent is not only defined by a type of node, but also: -* the expected separator between its children (here, a newline: `"\n"`) -* any delimiters at the beginning and end of the list of children. Here, there are none, but in many cases, such lists start and end with characters such as `(` and `)` or `{` and `}`. - -For instance, to declare that a JSON object is a commutative parent, we do so with -```rust -CommutativeParent::new("object", "{", ", ", "}") -``` -Note how we use the separator is `", "` and not simply `","`. The separators and delimiters should come with sensible default whitespace around them. This whitespace is used as last resort, as Mergiraf attempts to imitate the surrounding style by reusing similar whitespace and indentation settings as existing delimiters and separators. - -After having added our commutative parent definition, we can compile it again with `cargo build`. The resulting binary in `target/debug/mergiraf` will now accept to resolve conflicts like the following one: - -
└method_declaration - ├type: predefined_type void - ├name: identifier Run - ├parameters: parameter_list - │ ├( - │ ├parameter - │ │ ├type: predefined_type int - │ │ └name: identifier times - │ ├, - │ ├parameter - │ │ ├type: predefined_type bool - │ │ └name: identifier fast - │ └) - └body: block - ├{ - └} -- -Notice that some nodes have two labels attached to them: -* the field name, such as `name`, indicating which [field](https://tree-sitter.github.io/tree-sitter/creating-parsers#using-fields) of its parent node it belongs to. It is optional: some nodes like `parameter` ones are not associated to any field. -* the kind, such as `identifier`, which is the type of AST node. Every node has one (for separators or keywords, the source text is the kind) - -In general, when descending into a single predetermined child of a given node, one should use a `Field`. If the number of children is variable then we expect to select them by kind using `ChildKind`. - -The grammar of a tree-sitter parser is defined in [a `grammar.js` file](https://github.com/tree-sitter/tree-sitter-c-sharp/blob/master/grammar.js) and reading it directly can be useful, for instance to understand what are the possible children or parent of a given type of node. Note that node types starting with `_` are private, meaning that they are not exposed to Mergiraf. In case of doubt, just parse some small example to check. - -## Atomic nodes - -Sometimes, the parser analyzes certain constructs with a granularity that is finer than what we need for structured merging. To treat a particular type of node as atomic and ignore any further structure in it, one can add its type to the `atomic_nodes` field. - -This is also useful to work around [certain issues with parsers which don't expose the contents of certain string literals in the syntax trees](https://github.com/tree-sitter/tree-sitter-go/issues/150). - -## Injections - -Certain languages can contain text fragments in other languages. For instance, HTML can contain inline Javascript or CSS code. -The `injections` field on the `LangProfile` object can be used to provide a [tree-sitter query locating such fragments](https://tree-sitter.github.io/tree-sitter/3-syntax-highlighting.html#language-injection). -Such a query is normally exposed by the Rust crate for the parser as the `INJECTIONS_QUERY` constant if it has been defined by the parser authors, so it just needs wiring up as `injections: Some(tree_sitter_html::INJECTIONS_QUERY)`. - -## Add tests - -We didn't write any code, just declarative things, but it's still worth checking that the merging that they enable works as expected, and that it keeps doing so in the future. - -### Directory structure -You can add test cases to the end-to-end suite by following the directory structure of other such test cases. Create a directory of the form: -``` -examples/csharp/working/add_imports -``` - -The naming of the `csharp` directory does not matter, nor does `add_imports` which describes the test case we are about to write. In this directory go the following files: -``` -Base.cs -Left.cs -Right.cs -Expected.cs -``` - -All files should have an extension which matches what you defined in the language profile, for them to be parsed correctly. The `Base`, `Left` and `Right` files contain the contents of a sample file at all three revisions, and `Expected` contains the expected merge output of the tool (including any conflict markers). - -If the language you're adding is specified using the full file name (`Makefile`/`pyproject.toml`), the test directory should additionally contain a `language` file with one of the `file_names` specified in the language profile. - -For example, here's a directory structure of a `Makefile` test: -``` -Base -Left -Right -Expected -language // contains "Makefile" (without the quotes) -``` - -and for `pyproject.toml`: -``` -Base.toml -Left.toml -Right.toml -Expected.toml -language // contains "pyproject.toml" -``` - -### Running the tests -To run an individual test, you can use a helper: -```console -$ helpers/inspect.sh examples/csharp/working/add_imports -``` - -This will show any differences between the expected output of the merge and the actual one. It also saves the result of some intermediate stages -of the merging process in the `debug` directory, such as the matchings between the three trees as Dotty graphs. -Those can be viewed as SVG files by running `helpers/generate_svg.sh`. - - -To run a test with a debugger, you can use the test defined in `tests/integration_tests.rs`: -```rust -// use this test to debug a specific test case by changing the path in it. -#[test] -fn debug_test() { - run_test_from_dir(Path::new("examples/go/working/remove_and_add_imports")) -} -``` -You can then use an IDE (such as Codium with Rust-analyzer) to set up breakpoints to inspect the execution of the test. - -## Add documentation - -The list of supported languages can be updated in `doc/src/languages.md`. +To submit your language configuration for inclusion in Mergiraf, we ask that you [add some tests](./language-testing.md) to validate the merge output. The list of supported languages should also be updated in `doc/src/languages.md`. Mergiraf excitedly awaits your pull request! diff --git a/doc/src/adding-a-language/advanced-configuration.md b/doc/src/adding-a-language/advanced-configuration.md new file mode 100644 index 0000000..5a8df54 --- /dev/null +++ b/doc/src/adding-a-language/advanced-configuration.md @@ -0,0 +1,92 @@ +# Advanced langugage configuration + +## Atomic nodes + +Sometimes, the parser analyzes certain constructs with a granularity that is finer than what we need for structured merging. To treat a particular type of node as atomic and ignore any further structure in it, one can add its type to the `atomic_nodes` field. + +This is also useful to work around [certain issues with parsers which don't expose the contents of certain string literals in the syntax trees](https://github.com/tree-sitter/tree-sitter-go/issues/150). + +## Injections + +Certain languages can contain text fragments in other languages. For instance, HTML can contain inline Javascript or CSS code. +The `injections` field on the `LangProfile` object can be used to provide a [tree-sitter query locating such fragments](https://tree-sitter.github.io/tree-sitter/3-syntax-highlighting.html#language-injection). +Such a query is normally exposed by the Rust crate for the parser as the `INJECTIONS_QUERY` constant if it has been defined by the parser authors, so it just needs wiring up as `injections: Some(tree_sitter_html::INJECTIONS_QUERY)`. + +## Flattened nodes + +Some parsers will represent certain constructs as nested applications of a binary operation. For instance, type unions in Typescript, such as: + +```ts +export interface MyInterface { + level: 'debug' | 'info' | 'warn' | 'error'; +} +``` + +are represented by the grammar as: +``` +└union_type + ├union_type + │ ├union_type + │ │ ├literal_type + │ │ │ └string + │ │ │ ├' + │ │ │ ├string_fragment debug + │ │ │ └' + │ │ ├| + │ │ └literal_type + │ │ └string + │ │ ├' + │ │ ├string_fragment info + │ │ └' + │ ├| + │ └literal_type + │ └string + │ ├' + │ ├string_fragment warn + │ └' + ├| + └literal_type + └string + ├' + ├string_fragment error + └' +``` + +This nested structure prevents commutative merging of changes in such fragments. To work around that, Mergiraf supports flattening binary such binary operators into the following structure: +``` +└union_type + ├literal_type + │ └string + │ ├' + │ ├string_fragment debug + │ └' + ├| + ├literal_type + │ └string + │ ├' + │ ├string_fragment info + │ └' + ├| + ├literal_type + │ └string + │ ├' + │ ├string_fragment warn + │ └' + ├| + └literal_type + └string + ├' + ├string_fragment error + └' +``` + +This is achieved by specifying `flattened_nodes: &["union_type"]` in the language profile. + +## Comment nodes + +Another tweak that Mergiraf does on top of the parser's output is attaching comment nodes to the syntactic elements they annotate. This eases the commutative merging of such elements, by preventing those comments to get detached to their elements in the +merged output. + +This heuristic is applied to all nodes that are [marked as "extra" by the tree-sitter grammar](https://tree-sitter.github.io/tree-sitter/creating-parsers/3-writing-the-grammar.html#using-extras) (meaning that the parser accepts to include them anywhere in the tree, even if they are not mentioned in a rule). +In certain cases, it can be useful to extend this heuristic to also attach other nodes, which behave as comments but aren't marked as "extra" in the grammar. This can be done by adding their node type to the `comment_nodes` field of the language profile. + diff --git a/doc/src/adding-a-language/enabling-commutative-merging.md b/doc/src/adding-a-language/enabling-commutative-merging.md new file mode 100644 index 0000000..35d1bd0 --- /dev/null +++ b/doc/src/adding-a-language/enabling-commutative-merging.md @@ -0,0 +1,266 @@ +# Commutative merging + +This is the second part of the tutorial on adding support for a new language in Mergiraf, with the example of C#. + +## Add commutative parents + +You can improve conflict resolution for a language by defining "commutative parents". +A node in a syntax tree is a commutative parent when the order of its children is unimportant. +This knowledge allows Mergiraf to [automatically solve most conflicts involving insertion or deletion of children of such a parent](./conflicts.md#neighbouring-insertions-and-deletions-of-elements-whose-order-does-not-matter). + +Identifying which node types should commutative is easier with some familiarity with the semantics of the language, but there are usual suspects you can consider: +* **import statements** (such as `import` in Java or Go, `use` in Rust…) +* **field or method declarations** in classes (as in most object-oriented programming languages) +* **declarations of sum-types** (such as `union` in C or functional programming languages) +* **dictionary or set objects** (such as JSON objects, struct instantiations in C/C++…) +* **declarative annotations** of various sorts (such as annotation parameters in Java, trait bounds in Rust, tag attributes in XML / HTML…) + +For instance, C# has import statements called `using` declarations and [some IDEs seem to allow sorting them alphabetically](https://stackoverflow.com/questions/30374210/order-of-using-directives-in-c-sharp-alphabetically). This is a good sign that their order is semantically irrelevant, as in many languages, so let's declare that. + +First, write a small sample file which contains the syntactic elements you are interested in, such as: +```csharp +using System; +using System.Collections.Generic; +using System.IO; + +namespace HelloWorld { + + public class SomeName { + + } +} +``` + +You can inspect how this file is parsed with, either with the [Syntax Tree Playground](https://tree-sitter.github.io/tree-sitter/7-playground) if the language is supported there, or directly via Mergiraf: +```console +$ cargo parse test_file.cs +``` + +which gives: +
└compilation_unit + ├using_directive + │ ├using + │ ├identifier System + │ └; + ├using_directive + │ ├using + │ ├qualified_name + │ │ ├qualifier: qualified_name + │ │ │ ├qualifier: identifier System + │ │ │ ├. + │ │ │ └name: identifier Collections + │ │ ├. + │ │ └name: identifier Generic + │ └; + ├using_directive + │ ├using + │ ├qualified_name + │ │ ├qualifier: identifier System + │ │ ├. + │ │ └name: identifier IO + │ └; + └namespace_declaration + ├namespace + ├name: identifier HelloWorld + └body: declaration_list + ├{ + ├class_declaration + │ ├modifier + │ │ └public + │ ├class + │ ├name: identifier SomeName + │ └body: declaration_list + │ ├{ + │ └} + └} ++ +This shows us how our source code is parsed into a tree. We see that the `using` statements are parsed as `using_directive` nodes in the tree. + +To let Mergiraf reorder `using` statements to fix conflicts, we declare that their parent is a commutative one, which will by default let them commute with any of their siblings (any other child of their parent in the syntax tree). +In this example, their parent is the root of the tree (with type `compilation_unit`), which means that we'll allow reordering `using` statements with other top-level elements, such as the namespace declaration. +We'll see later how to restrict this commutativity by defining children groups. + +The commutative parent can be defined in the language profile: +```rust +LangProfile { + commutative_parents: vec![ + CommutativeParent::without_delimiters("compilation_unit", "\n"), + ], + .. +}, +``` + +A commutative parent is not only defined by a type of node, but also: +* the expected separator between its children (here, a newline: `"\n"`) +* any delimiters at the beginning and end of the list of children. Here, there are none, but in many cases, such lists start and end with characters such as `(` and `)` or `{` and `}`. + +For instance, to declare that a JSON object is a commutative parent, we do so with +```rust +CommutativeParent::new("object", "{", ", ", "}") +``` +Note how we use the separator is `", "` and not simply `","`. The separators and delimiters should come with sensible default whitespace around them. This whitespace is used as last resort, as Mergiraf attempts to imitate the surrounding style by reusing similar whitespace and indentation settings as existing delimiters and separators. + +After having added our commutative parent definition, we can compile it again with `cargo build`. The resulting binary in `target/debug/mergiraf` will now accept to resolve conflicts like the following one: + +
└method_declaration + ├type: predefined_type void + ├name: identifier Run + ├parameters: parameter_list + │ ├( + │ ├parameter + │ │ ├type: predefined_type int + │ │ └name: identifier times + │ ├, + │ ├parameter + │ │ ├type: predefined_type bool + │ │ └name: identifier fast + │ └) + └body: block + ├{ + └} ++ +Notice that some nodes have two labels attached to them: +* the field name, such as `name`, indicating which [field](https://tree-sitter.github.io/tree-sitter/creating-parsers#using-fields) of its parent node it belongs to. It is optional: some nodes like `parameter` ones are not associated to any field. +* the kind, such as `identifier`, which is the type of AST node. Every node has one (for separators or keywords, the source text is the kind) + +In general, when descending into a single predetermined child of a given node, one should use a `Field`. If the number of children is variable then we expect to select them by kind using `ChildKind`. + +The grammar of a tree-sitter parser is defined in [a `grammar.js` file](https://github.com/tree-sitter/tree-sitter-c-sharp/blob/master/grammar.js) and reading it directly can be useful, for instance to understand what are the possible children or parent of a given type of node. Note that node types starting with `_` are private, meaning that they are not exposed to Mergiraf. In case of doubt, just parse some small example to check. diff --git a/doc/src/adding-a-language/testing.md b/doc/src/adding-a-language/testing.md new file mode 100644 index 0000000..d81d6f0 --- /dev/null +++ b/doc/src/adding-a-language/testing.md @@ -0,0 +1,62 @@ +# Testing language configurations + +Adding support for a language in Mergiraf doesn't require any code, just declarative configuration, but it's still worth checking that the merging that it enables works as expected, and that it keeps doing so in the future. + +### Directory structure + +You can add test cases to the end-to-end suite by following the directory structure of other such test cases. Create a directory of the form: +``` +examples/csharp/working/add_imports +``` + +The naming of the `csharp` directory does not matter, nor does `add_imports` which describes the test case we are about to write. In this directory go the following files: +``` +Base.cs +Left.cs +Right.cs +Expected.cs +``` + +All files should have an extension which matches what you defined in the language profile, for them to be parsed correctly. The `Base`, `Left` and `Right` files contain the contents of a sample file at all three revisions, and `Expected` contains the expected merge output of the tool (including any conflict markers). + +If the language you're adding is specified using the full file name (`Makefile`/`pyproject.toml`), the test directory should additionally contain a `language` file with one of the `file_names` specified in the language profile. + +For example, here's a directory structure of a `Makefile` test: +``` +Base +Left +Right +Expected +language // contains "Makefile" (without the quotes) +``` + +and for `pyproject.toml`: +``` +Base.toml +Left.toml +Right.toml +Expected.toml +language // contains "pyproject.toml" +``` + +### Running the tests +To run an individual test, you can use a helper: +```console +$ helpers/inspect.sh examples/csharp/working/add_imports +``` + +This will show any differences between the expected output of the merge and the actual one. It also saves the result of some intermediate stages +of the merging process in the `debug` directory, such as the matchings between the three trees as Dotty graphs. +Those can be viewed as SVG files by running `helpers/generate_svg.sh`. + + +To run a test with a debugger, you can use the test defined in `tests/integration_tests.rs`: +```rust +// use this test to debug a specific test case by changing the path in it. +#[test] +fn debug_test() { + run_test_from_dir(Path::new("examples/go/working/remove_and_add_imports")) +} +``` +You can then use an IDE (such as Codium with Rust-analyzer) to set up breakpoints to inspect the execution of the test. +