differencing algorithms

2025年1月30日

A differencing algorithm is to find differences between two sequences or strings. There are a number of challenges in such an algorithm:

  • find the minimal number of differences
  • present differences in a logical way

Each difference can be refered as an edit operation. The kinds of edit operations include addition, deletion, substitution, transposition, so on and so forth. The operand of an edit can be a character, a word, a line, or any larger unit.

A differencing algorithm is not necessarily a diff algorithm especially if you recognize diff as the tool GNU diff or git diff.

Basic Representation

Given source string s1 and target string s2, our goal is to map s1 to s2. If we place s1 horizontally and s2 vertically and form a matrix, our goal is to traverse from top left corner to buttom right corner such that moving right means add a character from target, and moving down means delete a character from source.

Let’s say the two input strings are of the same length \(n\), the size of the matrix is \(n^2\), the top left corner has coordinate (0,0), the bottom right corner (n,n).

Each movement can be considered as an edge from one vertex to another, hence the matrix becomes a directed graph. Edges can have non-negative weights. Edges with weight 0 make sense because if two strings are identical, no edit operation should be generated.

Based on this graph representation, we can use the well-known Dijkstra’s algorithm to find shortest path from the starting point (0,0) to the ending point (n,n), which yields the time complexity \(O(n^2)\).

Levenshtein Distance

Although Levenshtein Distance aims to finding the minimal distance between two sequences, it actually derives an edit script (ie., a list of edit operation).

Levenshtein Distance is formulated as mathematical equations. If it’s implemented as it, its time complexity is high due to duplicated compulation. However, if it’s implemented by the best dynamic programming approach, its time complexity reaches the theoratical lower bound \(O(n^2)\).

Non-minimal algorithms

If we do not aim to produce a minimal edit script, it is possible to devise an algorithm with time complexity lower than \(O(n^2)\).

Yusuf Sulistyo Nugroho; . How different are different diff algorithms in Git?. Empirical Software Engineering . 2019, (): [2025-01-30].

When to Use Each of the Git Diff Algorithms

myers

minimal

patience

histogram

word diff

I didn’t find information on git diff --word-diff, but GNU wdiff states that “it works by creating two temporary files, one word per line, and then executes diff on these files.”[1]

With this idea, I can create charachter-based diff by splitting a file one character per line. Originally characher-based diff is only achieved from Levenshtein distance.

References

  1. . GNU Wdiff. . [2025-01-30].