String matching is a fundamental problem in computer science. Given a text T and a pattern P, how can we efficiently find all occurrences of P in T? While naive approaches take O(n * m) time, the Z-algorithm allows us to solve this in O(n + m) time.
The Problem: Matching Two Strings
Imagine we have:
- Pattern (P):
aabcaab - Text (T):
baabaabcaab
We want to find where aabcaab occurs in the text. The Z-algorithm approaches this by constructing a new string S = P + $ + T (where $ is a separator not present in P or T) and computing the Z-array for S.
What is the Z-Array?
For a string S of length n, the Z-array is an array of length n where Z[i] is the length of the longest substring starting at S[i] which is also a prefix of S.
Example for S = aabcaab:
|
|
Z[0]is usually defined as 0 orn.Z[4] = 3because “aab” at index 4 matches the prefix “aab”.
Why the Z-Algorithm is Fast: O(m + n)
The naive computation of the Z-array still takes O(n^2). The Z-algorithm achieves O(n) by maintaining a Z-box [L, R], which is the rightmost interval [L, R] such that S[L..R] is a prefix of S.
By reusing previously computed Z values, we avoid redundant character comparisons. How?
The Mirroring Strategy: The Heart of the Algorithm
The most critical part of the algorithm is when we compute Z[i] and i falls within the current Z-box [L, R] (L <= i <= R).
The Setup
|
|
Since S[L..R] matches S[0..R-L], any substring starting at i inside the Z-box has a “mirror” image starting at k = i - L in the prefix.
Case 1: Z[k] is contained within the Z-box (Z[k] < R - i + 1)
If the match at the mirror position k doesn’t reach the boundary R, we know Z[i] must be exactly Z[k]. No new comparisons are needed!
|
|
Case 2: Z[k] reaches or exceeds the boundary (Z[k] >= R - i + 1)
If the match at k goes beyond R, we only know that S[i..R] matches the prefix. We must manually check characters starting from R+1 to see if the match extends further.
|
|
Once we find the new boundary, we update [L, R] to [i, new R].
Python Implementation
Here is a concise Python implementation of the Z-algorithm:
|
|
How to Find Matches: P in T
To find all occurrences of a pattern P in a text T, we use the following steps:
- Construct a combined string: Create a new string by joining the pattern and the text with a unique separator that appears in neither:
Combined = P + "#" + T(here we use#as the separator). - Compute the Z-array: Run the Z-algorithm on this combined string.
- Identify matches: Iterate through the Z-array values corresponding to the text portion (indices after the
#separator). - Check for pattern length: If any
Z[i]is equal to the length ofP, it means a match of the patternPstarts at that position in the text.
The separator is crucial because it ensures that the “match” does not extend beyond the length of the pattern itself, preventing false positives that might span across the boundary between P and T.
Conclusion
The Z-algorithm’s mirroring strategy is a brilliant example of using space (the Z-array) to save time. By keeping track of the furthest reach (R), we ensure that each character is effectively compared only a constant number of times.
For a more detailed breakdown and implementation details, check out the following links:
- CP-Algorithms: https://cp-algorithms.com/string/z-function.html
- GeeksforGeeks: https://www.geeksforgeeks.org/z-algorithm-linear-time-pattern-searching-algorithm/
Last modified on 2026-05-22