跳转至

11 Tries

The trie, pronounced “try”, is a tree that specializes in storing data that can be represented as a collection, such as English words:

rootCAT.T.E.UTO.A.A trie containing the words CAT, CUT, CUTE, TO and A

Each string character maps to a node where the last node is terminating. These are marked in the diagram above with a dot. The benefits of a trie are best illustrated by looking at it in the context of prefix matching.

In this chapter, you’ll first compare the performance of a trie to a list. Then you’ll implement the trie from scratch!

List vs. Trie

You’re given a collection of strings. How would you build a component that handles prefix matching? Here’s one way:

class EnglishDictionary {
  final List<String> words = [];

  List<String> lookup(String prefix) {
    return words.where((word) {
      return word.startsWith(prefix);
    }).toList();
  }
}

lookup will go through the collection of strings and return those that match the prefix.

This algorithm is reasonable if the number of elements in the words list is small. But if you’re dealing with more than a few thousand words, the time it takes to go through the words list will be unacceptable. The time complexity of lookup is O(k × n), where k is the longest string in the collection, and n is the number of words you need to check.

Imagine the number of words Google needs to parseImagine the number of words Google needs to parse

The trie data structure has excellent performance characteristics for this problem. Since it’s a tree with nodes that support multiple children, each node can represent a single character.

You form a word by tracing the collection of characters from the root to a node with a special indicator — a terminator — represented by a black dot. An interesting characteristic of the trie is that multiple words can share the same characters.

To illustrate the performance benefits of the trie, consider the following example in which you need to find the words with the prefix CU. First, you travel to the node containing C. That quickly excludes other branches of the trie from the search operation:

rootCAT.T.E.UTO.A.

Next, you need to find the words that have the next letter, U. You traverse to the U node:

rootCAT.T.E.UTO.A.

Since that’s the end of your prefix, the trie would return all collections formed by the chain of nodes from the U node. In this case, the words CUT and CUTE would be returned.

Imagine if this trie contained hundreds of thousands of words. The number of comparisons you can avoid by employing a trie is substantial.

rootCADGO.ART.T.T.E.UTO.S.D.D.A.ZOO.

Implementation

As always, open up the starter project for this chapter.

TrieNode

You’ll begin by creating the node for the trie. Create a lib folder in the root of your project and add a file to it named trie_node.dart. Add the following to the file:

class TrieNode<T> {
  TrieNode({this.key, this.parent});

  // 1
  T? key;

  // 2
  TrieNode<T>? parent;

  // 3
  Map<T, TrieNode<T>?> children = {};

  // 4
  bool isTerminating = false;
}

This interface is slightly different compared to the other nodes you’ve encountered:

  1. key holds the data for the node. This is nullable because the root node of the trie has no key. The reason it’s called a key is because you use it in a map of key-value pairs to store children nodes.
  2. TrieNode holds a reference to its parent. This reference simplifies the remove method later on.
  3. In binary search trees, nodes have a left and right child. In a trie, a node needs to hold multiple different elements. The children map accomplishes that.
  4. isTerminating acts as a marker for the end of a collection.

Note

A parent TrieNode holds a reference to its children and the children hold a reference to the parent. You might wonder if this creates a circular reference problem where the memory is never released. Languages like Swift that use reference counting for memory management need to be especially careful about this. Dart, on the other hand, frees up the memory from old unused objects with a garbage collector, which is able to handle the parent-children circular references in the code above. Garbage collection works not by counting references to individual objects but by checking if objects are reachable from certain root objects.

Trie

Next, you’ll create the trie itself, which will manage the nodes. Since strings are one of the most common uses for tries, this chapter will walk you through building a String-based trie. In Challenge 2 at the end of the chapter, you’ll create a generic trie that can handle any iterable collection.

In the lib folder, create a new file named string_trie.dart. Add the following to the file:

import 'trie_node.dart';

class StringTrie {
  TrieNode<int> root = TrieNode(key: null, parent: null);
}

Here are a couple of points to note:

  • In Dart a String is a collection of UTF-16 code units, so that’s why the type for TrieNode is int rather than String.
  • The key and parent of a trie’s root node are always null.

Next, you’ll implement four operations for the trie: insert, contains, remove and matchPrefix.

Insert

Tries work on collections internally, so you’ll need to take whatever string is inserted and convert its code units into TrieNode keys.

Add the following method to StringTrie:

void insert(String text) {
  // 1
  var current = root;

  // 2
  for (var codeUnit in text.codeUnits) {
    current.children[codeUnit] ??= TrieNode(
      key: codeUnit,
      parent: current,
    );
    current = current.children[codeUnit]!;
  }

  // 3
  current.isTerminating = true;
}

Here’s what’s going on:

  1. current keeps track of your traversal progress, which starts with the root node.
  2. The trie stores each code unit in a separate node. You first check if the node exists in the children map. If it doesn’t, you create a new node. During each loop, you move current to the next node.
  3. After the for loop completes, current is referencing the node at the end of the collection, that is, the last code unit in the string. You mark that node as the terminating node.

The time complexity for this algorithm is O(k), where k is the number of code units you’re trying to insert. This cost is because you need to traverse through or create a new node for each code unit.

Contains

contains is very similar to insert. Add the following method to StringTrie:

bool contains(String text) {
  var current = root;
  for (var codeUnit in text.codeUnits) {
    final child = current.children[codeUnit];
    if (child == null) {
      return false;
    }
    current = child;
  }
  return current.isTerminating;
}

You check every code unit to see if it’s in the tree. When you reach the last one, it must be terminating. If not, the collection wasn’t added, and what you’ve found is a subset of a larger collection.

Like insert, the time complexity of contains is O(k), where k is the length of the string that you’re using for the search. This time complexity comes from traversing through k nodes to determine whether the code unit collection is in the trie.

To test out insert and contains, head over to bin/starter.dart and replace the contents of the file with the following code:

import 'package:starter/string_trie.dart';

void main() {
  final trie = StringTrie();
  trie.insert("cute");
  if (trie.contains("cute")) {
    print("cute is in the trie");
  }
}

Run that and you should see the following console output:

cute is in the trie

Remove

Removing a node from the trie is a bit more tricky. You need to be particularly careful since multiple collections can share nodes.

Go back to lib/string_trie.dart and write the following method just below contains:

void remove(String text) {
  // 1
  var current = root;
  for (final codeUnit in text.codeUnits) {
    final child = current.children[codeUnit];
    if (child == null) {
      return;
    }
    current = child;
  }
  if (!current.isTerminating) {
    return;
  }
  // 2
  current.isTerminating = false;
  // 3
  while (current.parent != null &&
        current.children.isEmpty &&
        !current.isTerminating) {

    current.parent!.children[current.key!] = null;
    current = current.parent!;
  }
}

Taking it comment-by-comment:

  1. You check if the code unit collection that you want to remove is part of the trie and point current to the last node of the collection. If you don’t find your search string or the final node isn’t marked as terminating, that means the collection isn’t in the trie and you can abort.
  2. You set isTerminating to false so the current node can be removed by the loop in the next step.
  3. This is the tricky part. Since nodes can be shared, you don’t want to remove code units that belong to another collection. If there are no other children in the current node, it means that other collections don’t depend on the current node. You also check to see if the current node is terminating. If it is, then it belongs to another collection. As long as current satisfies these conditions, you continually backtrack through the parent property and remove the nodes.

The time complexity of this algorithm is O(k), where k represents the number of code units in the string that you’re trying to remove.

Head back to bin/starter.dart and replace the contents of main with the following:

final trie = StringTrie();
trie.insert('cut');
trie.insert('cute');

assert(trie.contains('cut'));
print('"cut" is in the trie');
assert(trie.contains('cute'));
print('"cute" is in the trie');

print('\n--- Removing "cut" ---');
trie.remove('cut');
assert(!trie.contains('cut'));
assert(trie.contains('cute'));
print('"cute" is still in the trie');

Run that and you should see the following output added to the console:

"cut" is in the trie
"cute" is in the trie

--- Removing "cut" ---
"cute" is still in the trie

Prefix Matching

The most iconic algorithm for a trie is the prefix-matching algorithm. Write the following at the bottom of StringTrie:

List<String> matchPrefix(String prefix) {
  // 1
  var current = root;
  for (final codeUnit in prefix.codeUnits) {
    final child = current.children[codeUnit];
    if (child == null) {
      return [];
    }
    current = child;
  }

  // 2 (to be implemented shortly)
  return _moreMatches(prefix, current);
}
  1. You start by verifying that the trie contains the prefix. If not, you return an empty list.
  2. After you’ve found the node that marks the end of the prefix, you call a recursive helper method named _moreMatches to find all the sequences after the current node.

Next, add the code for the helper method after the matchPrefix method:

List<String> _moreMatches(String prefix, TrieNode<int> node) {
  // 1
  List<String> results = [];
  if (node.isTerminating) {
    results.add(prefix);
  }
  // 2
  for (final child in node.children.values) {
    final codeUnit = child!.key!;
    results.addAll(
      _moreMatches(
        '$prefix${String.fromCharCode(codeUnit)}',
        child,
      ),
    );
  }
  return results;
}
  1. You create a list to hold the results. If the current node is a terminating one, you add what you’ve got to the results.
  2. Next, you need to check the current node’s children. For every child node, you recursively call _moreMatches to seek out other terminating nodes.

matchPrefix has a time complexity of O(k × m), where k represents the longest collection matching the prefix and m represents the number of collections that match the prefix. Recall that lists have a time complexity of O(k × n), where n is the number of elements in the entire collection. For large sets of data in which each collection is uniformly distributed, tries have far better performance than using lists for prefix matching.

Time to take the method for a spin. Navigate back to main and run the following:

final trie = StringTrie();
trie.insert('car');
trie.insert('card');
trie.insert('care');
trie.insert('cared');
trie.insert('cars');
trie.insert('carbs');
trie.insert('carapace');
trie.insert('cargo');

print('Collections starting with "car"');
final prefixedWithCar = trie.matchPrefix('car');
print(prefixedWithCar);

print('\nCollections starting with "care"');
final prefixedWithCare = trie.matchPrefix('care');
print(prefixedWithCare);

You should see the output below in the console:

Collections starting with "car"
[car, card, care, cared, cars, carbs, carapace, cargo]

Collections starting with "care"
[care, cared]

Challenges

How was this chapter for you? Are you ready to take it a bit further? The following challenges will ask you to add functionality to and generalize what you’ve already accomplished. Check out the Challenge Solutions section or the supplemental materials that come with the book if you need any help.

Challenge 1: Additional Properties

The current implementation of StringTrie is missing some notable operations. Your task for this challenge is to augment the current implementation of the trie by adding the following:

  1. An allStrings property that returns all the collections in the trie.
  2. A count property that tells you how many strings are currently in the trie.
  3. An isEmpty property that returns true if the trie is empty, false otherwise.

Challenge 2: Generic Trie

The trie data structure can be used beyond strings. Make a new class named Trie that handles any iterable collection. Implement the insert, containsand remove methods.

Key Points

  • Tries provide great performance metrics for prefix matching.
  • Tries are relatively memory efficient since individual nodes can be shared between many different values. For example, “car,” “carbs,” and “care” can share the first three letters of the word.