differencing algorithms
2025年1月30日A differencing algorithm is to find differences between two sequences or strings. There are a number of challenges in such an algorithm:
- find the minimal number of differences
- present differences in a logical way
Each difference can be refered as an edit operation. The kinds of edit operations include addition, deletion, substitution, transposition, so on and so forth. The operand of an edit can be a character, a word, a line, or any larger unit.
A differencing algorithm is not necessarily a diff algorithm especially if you recognize diff as the tool GNU diff or git diff.
Basic Representation
Given source string s1 and target string s2, our goal is to map s1 to s2. If we place s1 horizontally and s2 vertically and form a matrix, our goal is to traverse from top left corner to buttom right corner such that moving right means add a character from target, and moving down means delete a character from source.
Let’s say the two input strings are of the same length \(n\), the size of the matrix is \(n^2\), the top left corner has coordinate (0,0), the bottom right corner (n,n).
Each movement can be considered as an edge from one vertex to another, hence the matrix becomes a directed graph. Edges can have non-negative weights. Edges with weight 0 make sense because if two strings are identical, no edit operation should be generated.
Based on this graph representation, we can use the well-known Dijkstra’s algorithm to find shortest path from the starting point (0,0) to the ending point (n,n), which yields the time complexity \(O(n^2)\).
Levenshtein Distance
Although Levenshtein Distance aims to finding the minimal distance between two sequences, it actually derives an edit script (ie., a list of edit operation).
Levenshtein Distance is formulated as mathematical equations. If it’s implemented as it, its time complexity is high due to duplicated compulation. However, if it’s implemented by the best dynamic programming approach, its time complexity reaches the theoratical lower bound \(O(n^2)\).
Non-minimal algorithms
If we do not aim to produce a minimal edit script, it is possible to devise an algorithm with time complexity lower than \(O(n^2)\).
Yusuf Sulistyo Nugroho; . How different are different diff algorithms in Git?. Empirical Software Engineering . 2019, (): [2025-01-30].myers
minimal
patience
histogram
word diff
I didn’t find information on git diff --word-diff
, but GNU wdiff states that “it works by creating two temporary files, one word per line, and then executes diff on these files.”[1]
With this idea, I can create charachter-based diff by splitting a file one character per line. Originally characher-based diff is only achieved from Levenshtein distance.