Edit Distance / Levenshtein Distance

Edit Distance / Levenshtein Distance

Tag: Leetcode Hard

Edit Distance

Edit distance between two words is the minimum number of operations , i.e character insertions, deletions or substitutions, to transform one word to another word. This operation has cost associated with it.

Levenshtein Distance

Levenshtein Distance is edit distance with cost = 1 for the operation , it comes from the family of distance metrics.

Family of Distance Metrics

image.png

Definition

Levenshtein distance - Wikipedia

image.png

Example

  • Edit Distance between EXECUTE and EXPIRE is 4.
  • EXECUTE → EXPCUTE → EXPIUTE → EXPRTE→ EXPIRE
  • Better representation using "gap representation"
    • E X E C U T E
    • E X P I R _ E
    • 0 0 1 1 1 1 0
    • Sum the Cost = 4
    • " empty space _ " to denote insertion or deletion.

Recurrence Relation

ED(s1,s2,i,j) where i - s1[0...m-1] , j - s2[0...n-1]

Insertion: Recur for i and j+1 
Deletion: Recur for i+1 and j 
Substitution: Recur for i+1 and j+1

Algorithm

Brute Force Approach

Brute Force approach is to enumerate all possible matches for edit operations until we find match between two strings. While comparing the characters, when the character match we do not have to perform any operation , so the cost becomes 0. If they do not match, then we check for possibilities of each insertion ,deletion and substitution operation, with cost 1.

Code

    public int minDistance(String word1, String word2) {
        return ED(word1,word2,0,0);
    }

    private int ED(String s1, String s2, int i, int j){
        if(i==s1.length())
            return s2.length()-j; // remaining size difference  

        if(j==s2.length())
            return s1.length()-i; // remaining size difference

        if(s1.charAt(i) == s2.charAt(j))
            return ED(s1,s2,i+1,j+1); //cost = 0
        else
            return Math.min(Math.min(
            ED(s1,s2,i,j+1), //insertion
            ED(s1,s2,i+1,j)), //deletion
            ED(s1,s2,i+1,j+1)) //substitution
            + 1; //cost = 1 for Levenshtein
    }

Visualize Recursion Tree

image.png

Notice :

  • When there is a Prefix Match between two strings, the tree progresses only in one direction. e.g. ED(0,0) and ED(1,1).
  • Otherwise three children (one for each operation) are generated. e.g ED(2,2).
  • Also, notice repeating sub-problems e.g. ED(3,3).

During recursion, we are trying to breakdown our original problem into sub-problems. We keep recursing until we hit empty string. This hints to be a dynamic programming problem.

At a high level, dynamic programming approaches includes below, and we will walk through them in the upcoming blogs for this series.

Let us first find out the complexity of the code above.

Complexity Analysis

  • Time complexity:

    • at least : O(3 min(m,n) )
    • at worst case: O(3n) , occurs when m=n
  • Space complexity:

    • O(n)

Applications of Edit Distance

  • diff (Unix)
  • stemming (NLP)
  • spelling correction
  • DNA sequence

UpNext

The complexity for the brute force approach came up to being exponential. How did we come up with the analysis? Stay tune for the next blog to understand the complexity analysis over a recursion tree.

Below is a Map of things we have written so far. Also, we would like to hang with people who are doing these algorithms. Join us at Discord or follow us over Twitter/Instagram.

image.png