Concepts in string transformation

2025年2月22日

String similarity and transformation (Edit Script)

A distance of two sequences can be defined, for instance by TLSH.
Such a definition may not derive an edit script.

string transformation (edit script)
code transformation
code refactoring

There are two origins of the term Edit Script.
According to diff, edit script is ed script where ed is the ancient text editor.
diff has an -e option to produce an ed script for the ed command to edit one file into another automatically. Today, with patch, this -e option is rarely used.

Another explanation is that edit script is the most abstract way of transforming text. An edit script contains a number of edit commands, and an edit command can be deletion, addition, substitution, or any transformation you can think of.

[Understanding and Improving Batch Refactoring in Software Systems]
Single refactorings can be grouped into a batch of refactorings, and a batch of refactorings can span multiple revisions.
If a single revision consists of more than one refactorings, there are tools to untangle the revision, or decompose it into multiple smaller changes.

git rerere remembers how users have resolved a hunk conflict so that the next time it sees the same conflict, Git can resolve it automatically.

git-annex resolves a conflicted merge, by adding both conflicting versions of the file to the tree.

[An Approach for Exploring Code-Improving Transformations] can generate patterns for searching optimization points.
Their patterns are based on syntax and use first-order logic to exam types of operators, constness of variables, etc.

An edit script for A and B is a set of insertion and deletion commands that transform A into B.
[An O(ND) Difference Algorithm and Its Variations]

GNU diff and git diff offer a number of output format. With -e it outputs an ed script. With -n, it outputs an RCS format diff. By default or with –normal, the diff tool outputs a normal diff. Hence, the output of diff is also called diff.

What the tool GNU diff takes in is called a patch. A patch can be a post, a file, or any text that has a section of a diff produced by diff.

From viewpoint of string transformation, the output of diff can be called edit script.

Given a source string and a target string, a table be drawn, where the vertical axis is the source and the horizontal axis is the target. This table is named Edit Graph. Beacause the horizontal axis presents the target string, movements from left to right mean adding target characters. Movements from top to down mean deleting source characters. Diagonal movements happen when only source and target have the same character. Any trace from top left corner to the bottom right corner is an edit script from source to target. Calculating the shortest path is a problem. Finding the shortest edit script is a special instance of the single-source shortest path problem. Weights can be added to favor certain kind of alighment.

Common subsequences, edit scripts, traces, and paths from (0,0) to (N,M) in the edit graph are all isomorphic formalisms.

Levenshtein distance, which was published earlier than Myers algorithm, allows substitution operations. In an edit graph, a substitution is a diagonal edge where source and target characters differ.

If another edit operation is invented, it may not be presented in a edit graph.

Myers algorithm is greedy. It does not fully search the space as Levenshtein does. Myers tends to match the first identical unit, then continues to find next identical unit. The unit can be a character or a word. Note, Myers algorithm is not necessarily line-based even though GNU diff uses this algorithm.