Boyer-Moore String Search: Efficient Substring Search Algorithm

Jul 7, 2024

The Boyer-Moore String Search algorithm is an efficient approach to searching for substrings within larger strings. Utilizing a heuristic-driven skipping mechanism, it reduces the number of comparisons made when finding a substring in a text, enhancing performance especially for long strings and extensive text searches.

On this page

The Boyer-Moore String Search algorithm is one of the most powerful and widely used algorithms for searching a substring within a string. Developed by Robert S. Boyer and J Strother Moore in 1977, this algorithm introduces a more efficient approach than the naive method by skipping sections of the text, rather than examining every single character sequentially.

Understanding Boyer-Moore String Search

Key Concepts

The Boyer-Moore algorithm relies on two preprocessing functions to improve efficiency:

Bad Character Rule: When a mismatch occurs, this rule determines the next position to align by comparing portions of the text to a precomputed table of occurrences for each character.
Good Suffix Rule: If a mismatch occurs, this rule suggests the next alignment position based on the portion of the substring already matched. It shifts the pattern in a manner that maximizes the subsequent comparison window, skipping over sections where matches previously occurred.

Advantages

Efficiency: Boyer-Moore is generally more efficient because it skips over portions of the text based on mismatch information.
Best for Large Alphabets: This algorithm particularly shines when dealing with large alphabets and long substrings, as it significantly reduces the number of comparisons needed.

Computational Complexity

Preprocessing time: O(m + |Σ|), where m is the length of the pattern and |Σ| is the size of the alphabet.
Search time: O(n/m) in the best case and O(nm) in the worst case, where n is the length of the text.

Clojure Implementation

Let’s implement the Boyer-Moore algorithm in Clojure:

 1(ns boyer-moore
 2  (:require [clojure.string :as str]))
 3
 4(defn bad-character-table [pattern]
 5  (into {}
 6        (map-indexed (fn [i c] [c (- (count pattern) i 1)])
 7                     pattern)))
 8
 9(defn boyer-moore-search [text pattern]
10  (let [m (count pattern)
11        n (count text)
12        bad-char-table (bad-character-table pattern)
13        idx (loop [i (- m 1)
14                   j (- m 1)]
15              (if (< j 0)
16                (- i j)
17                (if (= (nth pattern j) (nth text i))
18                  (recur (dec i) (dec j))
19                  (recur (+ i (max 1 (- (get bad-char-table (nth text i) -1) (-(count pattern) 1))))
20                         (- m 1)))))]
21    (if (>= (count text) idx (+ idx m))
22      (if (str/includes? (subs text idx (+ idx m)) pattern)
23        idx -1)
24      -1)))
25
26;; Example usage
27(boyer-moore-search "HERE IS A SIMPLE EXAMPLE" "EXAMPLE") ;; Output: 17

Explanation

Bad Character Table: We generate a table that contains the last occurrence of each character in the pattern.
Search Function: The function iteratively examines characters from rear to front, shifting based on the bad character table when mismatches occur.

Mermaid Diagram

    sequenceDiagram
	  participant Text
	  participant Pattern
	
	  Text->>Pattern: Compare last character
	  activate Pattern
	
	  alt Match
	    Pattern->>Text: Move to the left
	    deactivate Pattern
	    activate Pattern
	  else Mismatch
	    Pattern->xText: Apply bad character rule
	    deactivate Pattern
	  end
	
	  Note over Pattern: Repeat until matched or no more positions

Diagram Explanation

The sequence diagram above illustrates the core interaction of the Boyer-Moore algorithm between the text and pattern as it slides the pattern over the text, utilizing the Bad Character Rule to skip sections of the text.

Knuth-Morris-Pratt (KMP) Algorithm: Another popular string search algorithm that preprocesses the pattern to determine where mismatches occur, offering linear time complexity.
Rabin-Karp Algorithm: Utilizes hashing to find any one of a set of pattern strings in a text.

Additional Resources

[Gusfield, D. (1997). “Algorithms on Strings, Trees, and Sequences”. Cambridge University Press.]
[Cormen, T. H., et al. (2022). “Introduction to Algorithms”. MIT Press.]

Summary

The Boyer-Moore String Search algorithm presents an efficient means of finding substrings by leveraging heuristics for skipping non-productive comparisons. With its ability to handle large search spaces and alphabets efficiently, it stands as a cornerstone in performance-critical string processing applications.

Understanding the Boyer-Moore algorithm equips developers with sophisticated strategies for text searching and is integral for developing efficient text analysis tools, enhancing the performance of applications that require rapid search capabilities.

Bloom Filter

Branch and Bound

Browse Performance and Optimization Patterns