differencing algorithms

2025年1月30日

Given two (sets of) objects $o_1, o_2$, the learn function generates a program p. When given $o_1$, the program p recovers $o_2$. When given another object $o_3$, the program p returns $o_4$ if applicable. The program p can be able to reject an input.

$$
\begin{align*}
\operatorname{learn}(o_1,o_2) &=p \\
\operatorname{p}(o_1) &=o_2 \\
\operatorname{p}(o_3) &=o_4
\end{align*}
$$

The way the learn function works is called programming by examples, and the input $o_1$ and $o_2$ are examples. These definitions and terminologies start to align with machine learning, and the program p has accuracy and recall because p may not correctly map all features in $o_1$ to their labels in $o_2$.

While neural networks require hyerparameters, such as number of layers, activation functions, etc. to learn p, other AI techniques need a myriad kind of heuristics to balance between generalization and overﬁtting.[Inference of Regular Expressions for Text Extraction from Examples]

Problem Statement

The problem input consists of two sets of files. The problem is solved by learning a program p such that $p(o_1)$ is as close as $o_2$. p should avoid overfitting, and thus generalizing beyond the provided examples. When given a different object, $p(o_3)$ either returns a failure class or $o_4$.

Each commit is a problem instance because technically we can derive a different diffing/patching strategy for each commit.

Traditionally, people take the file content as is. If we adopt a kernel function and transform the file content into a different space, a novel solution may be obtained.

Take “1001 -> 1100”. Using a textual differing algorithm, we get the following edit operations.

Insert 1 at index 1 (so the source string becomes 11001)
Delete 1 at index 3 (so the source string becomes 1100)

If we adopt a kernel function int(), and treat 1001 and 1100 as integers, the learned program might be $p(x)=x+99$.

A differencing algorithm is to find differences between two objects. A patching algorithm is to apply the difference to an object.

$$
\begin{align*}
\operatorname{diff}(o_1,o_2) &=d \\
\operatorname{patch}(d, o_1) &=o_2 \\
\operatorname{patch}(d, o_3) &=o_4
\end{align*}
$$

This is essentially an algebra.

f can be for example multipling by 1.5.

Regular expressions excel at dealing with semistructured or unstructured text.[Inference of Regular Expressions for Text Extraction from Examples]

Suppose $t$ is a text line and $s$ is the substring of $t$ that needs to be extracted, [Automatic Generation of Regular Expressions from Examples with Genetic Programming] proposes an algorithm f to synthesize an expression that extract $s$ from $t$. $s$ is also called entity in this context.
It has positive and negative examples. If s is empty, this pair is negative.
In this way, precision and recall can be calculated.
The fitness of a regular expression is the Levenshtein distance between the label and the prediction (in the glossary of machien learning^[1]), minus the length of the regular expression. In this way, the algorithm favors short expressions that match more.
The genetic programming algorithm in this paper produces two regular expressions to match HTML labels and phone numbers, respectively. This algorithm is not feasible in conflict resolving because each commit is a new domain and we don’t have to time to run GP for each commit.

A differencing algorithm is to find differences between two sequences or strings. There are a number of challenges in such an algorithm:

find the minimal number of differences
present differences in a logical way

Each difference can be refered as an edit operation. The kinds of edit operations include addition, deletion, substitution, transposition, so on and so forth. The operand of an edit can be a character, a word, a line, or any larger unit.

A differencing algorithm is not necessarily a diff algorithm especially if you recognize diff as the tool GNU diff or git diff.

Basic Representation

Given source string s1 and target string s2, our goal is to map s1 to s2. If we place s1 horizontally and s2 vertically and form a matrix, our goal is to traverse from top left corner to buttom right corner such that moving right means add a character from target, and moving down means delete a character from source.

Let’s say the two input strings are of the same length $n$, the size of the matrix is $n^2$, the top left corner has coordinate (0,0), the bottom right corner (n,n).

Each movement can be considered as an edge from one vertex to another, hence the matrix becomes a directed graph. Edges can have non-negative weights. Edges with weight 0 make sense because if two strings are identical, no edit operation should be generated.

Based on this graph representation, we can use the well-known Dijkstra’s algorithm to find shortest path from the starting point (0,0) to the ending point (n,n), which yields the time complexity $O(n^2)$.

Levenshtein Distance

Although Levenshtein Distance aims to finding the minimal distance between two sequences, it actually derives an edit script (ie., a list of edit operation). Levenshtein distance forms substitution operations. In an edit graph, a substitution is a diagonal edge where source and target characters differ.

Levenshtein Distance is formulated as mathematical equations. If it’s implemented as is, its time complexity is high due to duplicated computation. However, if it’s implemented by the best dynamic programming approach, its time complexity reaches the theoratical lower bound $O(n^2)$.

Non-minimal algorithms

A minimal edit distance does not necessarily derive an intuitive diff. Take “1001 -> 1100”. Levenshtein distance (even with the extension of transposition) will report two edits:

Insert 1 at index 1 (so the source string becomes 11001)
Delete 1 at index 3 (so the source string becomes 1100)

See? Inserting and deleting the same character is puzzling. A more intuitive representation is just Substitute 1001 to 1100, the cost of which by default is 4 because of 4 characters.

Moreover, if we do not aim to produce a minimal edit script, it is possible to devise an algorithm with time complexity lower than $O(n^2)$. Now, let me introduce non-minimal algorithms.

Yusuf Sulistyo Nugroho; . How different are different diff algorithms in Git?. Empirical Software Engineering . 2019, (): [2025-01-30].

Myers

Myers algorithm is greedy. It does not fully search the space as Levenshtein does. Myers tends to match the first identical unit, then continues to find next identical unit. In this sense, the myers algorithm does not generate “shortest” edit scripts.

The unit of edit can be a character or a word. That is, the Myers algorithm is not necessarily line-based.

Myers only forms addition, deletion, and keep, without the substitution operation. One can post-process the edit script given by Myers, and combine deletions and subsequent additions into one substitution. However, this kind of substitution is not one edit unit to one edit unit, which Levenshtein natively supports.

minimal

patience

The patience algorithm is an algorithm for matching lines. It is specialized in handling deletion and insertion of the same line.^[2]

histogram

word diff

I didn’t find information on git diff --word-diff, but GNU wdiff states that “it works by creating two temporary files, one word per line, and then executes diff on these files.”^[3]

With this idea, I can create charachter-based diff by splitting a file one character per line. Originally characher-based diff is only achieved from Levenshtein distance.

References

. Machine Learning Glossary. . [2025-03-04].↑
Lup Peng. When to Use Each of the Git Diff Algorithms. . 2020-10-10 [2025-03-03].↑
. GNU Wdiff. . [2025-01-30].↑