Knuth-Morris-Pratt (KMP)
SummaryCovers the failure function (partial match table) that...
Covers the failure function (partial match table) that...
Covers the failure function (partial match table) that enables O(n+m) pattern matching by never re-scanning matched characters, with detailed construction algorithm and matching procedure.
Knuth-Morris-Pratt (KMP)
You’re searching for a pattern inside a massive text. Every character you re-examine is wasted work. KMP eliminates that waste entirely — it scans the text exactly once, achieving O(n + m) time where n is the text length and m is the pattern length. Let’s see how.
The Problem
Given text T of length n and pattern P of length m, find all positions where P occurs in T.
The Naive Approach
The most straightforward strategy: try every starting position in the text.
List<Integer> naiveSearch(String text, String pattern) {
var matches = new ArrayList<Integer>();
int n = text.length(), m = pattern.length();
for (int i = 0; i <= n - m; i++) {
int j = 0;
while (j < m && text.charAt(i + j) == pattern.charAt(j)) {
j++;
}
if (j == m) {
matches.add(i);
}
}
return matches;
}
For each of the n - m + 1 starting positions, the inner loop compares up to m characters. Worst case: O(n × m).
When does this worst case hit? Consider searching for "AAAB" in "AAAAAAAAAA...". At every position, the algorithm matches three As before failing on B, then shifts by one and matches three As again. The same characters get examined over and over.
The root cause: the text pointer backtracks. After a mismatch, it rewinds to position i + 1 and starts fresh, discarding all the information from the characters it already matched.
The KMP Insight: Never Backtrack in the Text
Here’s the key observation that changes everything: when you match several characters of the pattern and then encounter a mismatch, the characters you already matched are known. They’re part of the pattern itself. You don’t need to re-read them from the text — you already know what they are.
The question becomes: given that you matched pattern[0..j-1] and failed at pattern[j], how far can you shift the pattern without missing a valid match?
The answer lives inside the pattern’s own structure. If some prefix of the pattern matches a suffix of the matched portion, you can align the pattern to that position and continue — without moving the text pointer backward.
The Failure Function (LPS Array)
KMP precomputes an array called the failure function, partial match table, or LPS array (Longest Proper Prefix which is also a Suffix).
Definition: lps[i] = the length of the longest proper prefix of pattern[0..i] that is also a suffix of pattern[0..i].
“Proper” means the prefix cannot equal the entire substring — lps[i] is always less than i + 1.
Example: Pattern "ABACABAD"
Walk through each position:
| i | pattern[0..i] | Longest prefix = suffix | lps[i] |
|---|---|---|---|
| 0 | A | (none — single char) | 0 |
| 1 | AB | (no match) | 0 |
| 2 | ABA | A = A | 1 |
| 3 | ABAC | (no match) | 0 |
| 4 | ABACA | A = A | 1 |
| 5 | ABACAB | AB = AB | 2 |
| 6 | ABACABA | ABA = ABA | 3 |
| 7 | ABACABAD | (no match) | 0 |
Result: lps = [0, 0, 1, 0, 1, 2, 3, 0]
Building the LPS Array
The construction algorithm uses two variables: i scans through the pattern starting at position 1, and len tracks the length of the current longest prefix-suffix.
int[] buildLPS(String pattern) {
int m = pattern.length();
int[] lps = new int[m];
lps[0] = 0; // LPS of single character is always 0
int len = 0; // length of the previous longest prefix-suffix
int i = 1;
while (i < m) {
if (pattern.charAt(i) == pattern.charAt(len)) {
// Extending the current prefix-suffix match
len++;
lps[i] = len;
i++;
} else {
if (len != 0) {
// Fall back: try the next shorter prefix-suffix
// Don't increment i — re-examine this position
len = lps[len - 1];
} else {
// No prefix-suffix exists for pattern[0..i]
lps[i] = 0;
i++;
}
}
}
return lps;
}
The critical insight is the fallback step: len = lps[len - 1]. When a mismatch occurs, instead of restarting from zero, the algorithm jumps to the next candidate prefix-suffix — the same “don’t throw away matched information” principle that drives the overall KMP design.
Visual Walkthrough: Building LPS for "ABABAC"
Pattern: A B A B A C
Index: 0 1 2 3 4 5
i=1, len=0: B ≠ A → lps[1] = 0
i=2, len=0: A == A → len=1, lps[2] = 1
i=3, len=1: B == B → len=2, lps[3] = 2
i=4, len=2: A == A → len=3, lps[4] = 3
i=5, len=3: C ≠ B → fallback: len = lps[2] = 1
len=1: C ≠ B → fallback: len = lps[0] = 0
len=0: C ≠ A → lps[5] = 0
Result: lps = [0, 0, 1, 2, 3, 0]
Notice how position 5 required two fallbacks before settling. The algorithm explored len = 3 → 1 → 0 — each step using previously computed LPS values to find shorter candidate prefix-suffixes.
The KMP Matching Algorithm
With the LPS array in hand, the matching algorithm becomes elegant:
List<Integer> kmpSearch(String text, String pattern) {
var matches = new ArrayList<Integer>();
int n = text.length();
int m = pattern.length();
if (m == 0) return matches;
int[] lps = buildLPS(pattern);
int i = 0; // index in text — never goes backward
int j = 0; // index in pattern
while (i < n) {
if (text.charAt(i) == pattern.charAt(j)) {
i++;
j++;
}
if (j == m) {
// Full match found at position i - m
matches.add(i - m);
// Continue searching: shift pattern using LPS
j = lps[j - 1];
} else if (i < n && text.charAt(i) != pattern.charAt(j)) {
if (j != 0) {
// Mismatch after partial match: use LPS to shift pattern
j = lps[j - 1];
} else {
// Mismatch at first character: advance text pointer
i++;
}
}
}
return matches;
}
Two pointers, two rules:
- Match: advance both
iandj. - Mismatch at j > 0: set
j = lps[j - 1]— shift the pattern to the next valid alignment without touchingi. - Mismatch at j == 0: advance
i— current text character doesn’t match pattern start.
Step-by-Step Walkthrough
Match pattern "ABAB" in text "ABABCABABD":
LPS for "ABAB": [0, 0, 1, 2]
Step 1: i=0, j=0 text[0]='A' == pattern[0]='A' → i=1, j=1
Step 2: i=1, j=1 text[1]='B' == pattern[1]='B' → i=2, j=2
Step 3: i=2, j=2 text[2]='A' == pattern[2]='A' → i=3, j=3
Step 4: i=3, j=3 text[3]='B' == pattern[3]='B' → i=4, j=4
→ j == m: MATCH at position 0! Set j = lps[3] = 2
Step 5: i=4, j=2 text[4]='C' ≠ pattern[2]='A' → j = lps[1] = 0
Step 6: i=4, j=0 text[4]='C' ≠ pattern[0]='A' → i=5
Step 7: i=5, j=0 text[5]='A' == pattern[0]='A' → i=6, j=1
Step 8: i=6, j=1 text[6]='B' == pattern[1]='B' → i=7, j=2
Step 9: i=7, j=2 text[7]='A' == pattern[2]='A' → i=8, j=3
Step 10: i=8, j=3 text[8]='B' == pattern[3]='B' → i=9, j=4
→ j == m: MATCH at position 5! Set j = lps[3] = 2
Step 11: i=9, j=2 text[9]='D' ≠ pattern[2]='A' → j = lps[1] = 0
Step 12: i=9, j=0 text[9]='D' ≠ pattern[0]='A' → i=10
→ i == n: done
Result: matches at positions [0, 5]
Notice that i never decreases. After the first match, j jumps to lps[3] = 2 — the algorithm knows that "AB" at the end of the match is also a prefix of the pattern, so it picks up from there.
Why KMP Runs in O(n + m)
LPS construction: O(m). The pointer i moves forward at most m times. The pointer len decreases via lps[len - 1] at most as many times as it increased (it can’t go below 0). Total pointer movements: at most 2m. So the LPS construction is O(m).
Matching: O(n). The text pointer i advances with every step except when j falls back via lps[j - 1]. But j can only fall back as many times as it has advanced. Since i makes at most n advances and j makes at most n advances (matching i), the total work is at most 2n. So matching is O(n).
Total: O(n + m) — linear in the combined input size.
Applications
Finding All Occurrences
The kmpSearch method above already returns all match positions. Unlike String.indexOf() which finds only the first occurrence, KMP finds every one in a single pass.
Repeated Substring Pattern
Problem: determine if a string s consists entirely of a repeated pattern (e.g., "abcabc" → "abc" repeated twice).
KMP solves this in O(n) using a beautiful property of the LPS array:
boolean repeatedSubstringPattern(String s) {
int n = s.length();
int[] lps = buildLPS(s);
int longestPrefixSuffix = lps[n - 1];
// If the longest prefix-suffix length > 0 and
// n is divisible by (n - longestPrefixSuffix),
// then the string is a repeated pattern
int patternLength = n - longestPrefixSuffix;
return longestPrefixSuffix > 0 && n % patternLength == 0;
}
The logic: if lps[n - 1] = k, then s[0..k-1] == s[n-k..n-1]. The repeating unit has length n - k. If n divides evenly by n - k, the entire string is built from repeating that unit.
Counting Pattern Occurrences
Need the count rather than positions? Modify kmpSearch to increment a counter instead of adding to a list — same O(n + m) performance.
KMP vs Other String Algorithms
| Algorithm | Time (avg) | Time (worst) | Preprocessing | Best For |
|---|---|---|---|---|
| Naive | O(n×m) | O(n×m) | None | Short patterns, quick prototyping |
| KMP | O(n+m) | O(n+m) | O(m) | Guaranteed linear time, interviews |
| Rabin-Karp | O(n+m) | O(n×m) | O(m) | Multiple pattern search |
| Boyer-Moore | O(n/m) | O(n×m) | O(m+σ) | Practical single-pattern search |
| Z-Algorithm | O(n+m) | O(n+m) | O(n+m) | Sometimes simpler implementation |
Rabin-Karp uses rolling hash values to compare pattern and text windows. Average O(n + m) but degrades to O(n × m) with hash collisions. Its real strength is searching for multiple patterns simultaneously.
Boyer-Moore scans the pattern right-to-left and uses two heuristics (bad character, good suffix) to skip large sections of text. In practice, it’s often faster than KMP for natural-language text because it can skip characters entirely. But the implementation is complex and the worst case is still O(n × m) without the Galil rule.
Z-Algorithm computes a Z-array where Z[i] = length of the longest substring starting at position i that matches a prefix of the string. Conceptually similar to KMP, and some find it easier to implement.
Interviewer Tips
Full KMP implementation is rarely demanded in an interview — the algorithm is long and the failure function construction has subtle edge cases. What interviewers do expect:
- Understanding the concept: “We precompute a table from the pattern so we never backtrack in the text.”
- Knowing the complexity: O(n + m) time, O(m) space for the LPS array.
- Recognizing when to apply it: any problem involving substring search, repeated patterns, or string periodicity.
- The failure function idea: “LPS[i] tells us the longest proper prefix of pattern[0..i] that is also a suffix, so on mismatch we jump to the next valid alignment.”
If asked to code it, start with the LPS construction — it’s the heart of KMP. Get that right and the matching loop follows naturally.