String Editing Problem
2025年2月22日
The string editing problem arises in many applications, notably in text editing, molecular sequence comparison, etc. Algorithmic aspects of this problem have been studied rather extensively before 1990s. An edit script is a sequence of edit operations on a string or tree. [1] Each edit operation comes with a cost. The sum of the costs is edit distance.
The most common edit operations are keep, addition, and deletion. The Myers algorithm generates an edit script from string x to string y with the 3 primitive operations. The Levenshtein distance utilizes a fourth operation, substitution, to find minimal edit distance between two strings.
Edit scripts can be thought as a domain-specific language (DSL). The simplest kind of edit scripts, such as Listing , is a regular language. Listing is an equivalent presentation.
keep(0, a) del(1, b) add(1, c)
a -b +c
However, nothing prevents an edit script from being irregular. Listing is a context-free language. It is not regular because the syntax requires balanced curly brackets. In other words, this kind of edit script language must reject Listing . If you use GOTO
instead of looping, the syntax can be regular.
for(i=0;i<100;i+=4) { keep(i+0, a) del(i+1, b) add(i+1, c) }
for(i=0;i<100;i+=4) { keep(i+0, a) del(i+1, b) add(i+1, c) }}
Data transformation
If you think a revision as a datum, then making a commit is transforming data. Hence, the problem becomes a data transformation problem.
A dozen of edit scripts, ie. languages, exist for data transformation, including regular expressions with replacement[2], vimscrpit, and AWK.
GNU diff has an option --ed
to output ed script, which is a script language for the editor named ed.
See https://en.wikipedia.org/wiki/Data_transformation_(computing)

Figure shows a list of string transformations done by a single edit script, the formula "I love " & RIGHT(A2, LEN(A2)- FIND(",",A2)-1)
. The Formula language in Excel is context-free, irregular.
The Flash Fill feature since Excel 2013 is a PBE system. It is few-shot learning with merely one to two positive examples.
String similarity and transformation (Edit Script)
A distance of two sequences can be defined, for instance by TLSH.
Such a definition may not derive an edit script.
string transformation (edit script)
code transformation
code refactoring
There are two origins of the term Edit Script.
According to diff, edit script is ed script where ed is the ancient text editor.
diff has an -e option to produce an ed script for the ed command to edit one file into another automatically. Today, with patch, this -e option is rarely used.
[Understanding and Improving Batch Refactoring in Software Systems]
Single refactorings can be grouped into a batch of refactorings, and a batch of refactorings can span multiple revisions.
If a single revision consists of more than one refactorings, there are tools to untangle the revision, or decompose it into multiple smaller changes.
git rerere remembers how users have resolved a hunk conflict so that the next time it sees the same conflict, Git can resolve it automatically.
git-annex resolves a conflicted merge, by adding both conflicting versions of the file to the tree.
[An Approach for Exploring Code-Improving Transformations] can generate patterns for searching optimization points.
Their patterns are based on syntax and use first-order logic to exam types of operators, constness of variables, etc.
An edit script for A and B is a set of insertion and deletion commands that transform A into B.
[An O(ND) Difference Algorithm and Its Variations]
GNU diff and git diff offer a number of output format. With -e it outputs an ed script. With -n, it outputs an RCS format diff. By default or with --normal, the diff tool outputs a normal diff. Hence, the output of diff is also called diff.
What the tool GNU diff takes in is called a patch. A patch can be a post, a file, or any text that has a section of a diff produced by diff.
Given a source string and a target string, a table be drawn, where the vertical axis is the source and the horizontal axis is the target. This table is named Edit Graph. Beacause the horizontal axis presents the target string, movements from left to right mean adding target characters. Movements from top to down mean deleting source characters. Diagonal movements happen when only source and target have the same character. Any trace from top left corner to the bottom right corner is an edit script from source to target. Calculating the shortest path is a problem. Finding the shortest edit script is a special instance of the single-source shortest path problem. Weights can be added to favor certain kind of alighment.
Common subsequences, edit scripts, traces, and paths from (0,0) to (N,M) in the edit graph are all isomorphic formalisms.
Levenshtein distance, which was published earlier than Myers algorithm, allows substitution operations. In an edit graph, a substitution is a diagonal edge where source and target characters differ.
If another edit operation is invented, it may not be presented in a edit graph.
Myers algorithm is greedy. It does not fully search the space as Levenshtein does. Myers tends to match the first identical unit, then continues to find next identical unit. The unit can be a character or a word. Note, Myers algorithm is not necessarily line-based even though GNU diff uses this algorithm.
String Substitutions
String Substitution is a function from a string to a string. Define shorthand \(f(s)=f(\epsilon, \epsilon, s,0)\) , where \(\epsilon\) means empty string in formal language theory. Take the substitution \(\textrm{hi}\) to \(\textrm{bye}\), function f is define as the following.
If \(s=\epsilon\), \(f(p_1, p_2, s, i)=p_1 \). \(p_1\) and \(p_2\) mean the possibilities of no match and match, respectively.
If \(s \neq \epsilon\), let \(s=xs'\), we have
$$
f(p_1, p_2, xs', i)=
\begin{cases}
f(p_1x, p_2\textrm{b}, s',1) & \text{if } i=0, x=\textrm{h} \\
p_2\textrm{ye} f(\epsilon, \epsilon,s',0) & \text{if } i=1, x=\textrm{i} \\
p_1 f(\epsilon, \epsilon, xs',0) & \text{if } x=\textrm{h} \\
p_1x f(\epsilon, \epsilon, s',0) & \text{otherwise } \\
\end{cases}
$$
For input \(\textrm{hi}\),
\begin{align*}
f(\textrm{hi}) =& f(\epsilon, \epsilon, \textrm{hi},0) \\
=& f(\textrm{h}, \textrm{b}, \textrm{i},1) \\
=& \textrm{bye} f(\epsilon, \epsilon,\epsilon,0) \\
=& \textrm{bye}\\
\end{align*}
For input \(\textrm{hhi}\),
\begin{align*}
f(\textrm{hhi}) =& f(\epsilon, \epsilon, \textrm{hhi},0) \\
=& f(\textrm{h}, \textrm{b}, \textrm{hi},1) \\
=& \textrm{h} f(\epsilon, \epsilon, \textrm{hi},0) \\
=& \textrm{hbye}\\
\end{align*}
For input \(\textrm{hihe}\),
\begin{align*}
f(\textrm{hihe}) =& f(\epsilon, \epsilon, \textrm{hihe},0) \\
=& f(\textrm{h}, \textrm{b}, \textrm{ihe},1) \\
=& \textrm{bye} f(\epsilon, \epsilon, \textrm{he},0) \\
=& \textrm{bye} f(\textrm{h}, \textrm{b}, \textrm{e},0) \\
=& \textrm{byee} f(\epsilon, \epsilon, \epsilon, 0) \\
=& \textrm{byee} \\
\end{align*}
References
- Mikhail J. Atallah and Marina Blanton; . Algorithms and Theory of Computation Handbook. . 2010, General Concepts and Techniques (): 298 [].↑
- Rishabh Singh; . BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. Proceedings of the VLDB Endowment, Volume 9, Issue. 2016, (): [].↑
[…] Distance aims to finding the minimal distance between two sequences, it actually derives an edit script (ie., a list of edit operation). Levenshtein distance forms substitution operations. In an edit […]
[…] An edit script is a sequence of edit operations. Hence, an edit operation is an atom in its grammar. As a result, edit scripts can be generated (synthesized) by programming by examples. […]