11 Tries¶
The trie, pronounced “try”, is a tree that specializes in storing data that can be represented as a collection, such as English words:
rootCAT.T.E.UTO.A.A trie containing the words CAT, CUT, CUTE, TO and A
Each string character maps to a node where the last node is terminating. These are marked in the diagram above with a dot. The benefits of a trie are best illustrated by looking at it in the context of prefix matching.
In this chapter, you’ll first compare the performance of a trie to a list. Then you’ll implement the trie from scratch!
List vs. Trie¶
You’re given a collection of strings. How would you build a component that handles prefix matching? Here’s one way:
class EnglishDictionary {
final List<String> words = [];
List<String> lookup(String prefix) {
return words.where((word) {
return word.startsWith(prefix);
}).toList();
}
}
lookup
will go through the collection of strings and return those that match the prefix.
This algorithm is reasonable if the number of elements in the words
list is small. But if you’re dealing with more than a few thousand words, the time it takes to go through the words
list will be unacceptable. The time complexity of lookup
is O(k × n), where k is the longest string in the collection, and n is the number of words you need to check.
Imagine the number of words Google needs to parse
The trie data structure has excellent performance characteristics for this problem. Since it’s a tree with nodes that support multiple children, each node can represent a single character.
You form a word by tracing the collection of characters from the root to a node with a special indicator — a terminator — represented by a black dot. An interesting characteristic of the trie is that multiple words can share the same characters.
To illustrate the performance benefits of the trie, consider the following example in which you need to find the words with the prefix CU
. First, you travel to the node containing C
. That quickly excludes other branches of the trie from the search operation:
rootCAT.T.E.UTO.A.
Next, you need to find the words that have the next letter, U
. You traverse to the U
node:
rootCAT.T.E.UTO.A.
Since that’s the end of your prefix, the trie would return all collections formed by the chain of nodes from the U
node. In this case, the words CUT
and CUTE
would be returned.
Imagine if this trie contained hundreds of thousands of words. The number of comparisons you can avoid by employing a trie is substantial.
rootCADGO.ART.T.T.E.UTO.S.D.D.A.ZOO.
Implementation¶
As always, open up the starter project for this chapter.
TrieNode¶
You’ll begin by creating the node for the trie. Create a lib folder in the root of your project and add a file to it named trie_node.dart. Add the following to the file:
class TrieNode<T> {
TrieNode({this.key, this.parent});
// 1
T? key;
// 2
TrieNode<T>? parent;
// 3
Map<T, TrieNode<T>?> children = {};
// 4
bool isTerminating = false;
}
This interface is slightly different compared to the other nodes you’ve encountered:
key
holds the data for the node. This is nullable because the root node of the trie has no key. The reason it’s called a key is because you use it in a map of key-value pairs to store children nodes.TrieNode
holds a reference to its parent. This reference simplifies theremove
method later on.- In binary search trees, nodes have a left and right child. In a trie, a node needs to hold multiple different elements. The
children
map accomplishes that. isTerminating
acts as a marker for the end of a collection.
Note
A parent TrieNode
holds a reference to its children and the children hold a reference to the parent. You might wonder if this creates a circular reference problem where the memory is never released. Languages like Swift that use reference counting for memory management need to be especially careful about this. Dart, on the other hand, frees up the memory from old unused objects with a garbage collector, which is able to handle the parent
-children
circular references in the code above. Garbage collection works not by counting references to individual objects but by checking if objects are reachable from certain root objects.
Trie¶
Next, you’ll create the trie itself, which will manage the nodes. Since strings are one of the most common uses for tries, this chapter will walk you through building a String
-based trie. In Challenge 2 at the end of the chapter, you’ll create a generic trie that can handle any iterable collection.
In the lib folder, create a new file named string_trie.dart. Add the following to the file:
import 'trie_node.dart';
class StringTrie {
TrieNode<int> root = TrieNode(key: null, parent: null);
}
Here are a couple of points to note:
- In Dart a
String
is a collection of UTF-16 code units, so that’s why the type forTrieNode
isint
rather thanString
. - The
key
andparent
of a trie’sroot
node are alwaysnull
.
Next, you’ll implement four operations for the trie: insert
, contains
, remove
and matchPrefix
.
Insert¶
Tries work on collections internally, so you’ll need to take whatever string is inserted and convert its code units into TrieNode
keys.
Add the following method to StringTrie
:
void insert(String text) {
// 1
var current = root;
// 2
for (var codeUnit in text.codeUnits) {
current.children[codeUnit] ??= TrieNode(
key: codeUnit,
parent: current,
);
current = current.children[codeUnit]!;
}
// 3
current.isTerminating = true;
}
Here’s what’s going on:
current
keeps track of your traversal progress, which starts with theroot
node.- The trie stores each code unit in a separate node. You first check if the node exists in the
children
map. If it doesn’t, you create a new node. During each loop, you movecurrent
to the next node. - After the
for
loop completes,current
is referencing the node at the end of the collection, that is, the last code unit in the string. You mark that node as the terminating node.
The time complexity for this algorithm is O(k), where k is the number of code units you’re trying to insert. This cost is because you need to traverse through or create a new node for each code unit.
Contains¶
contains
is very similar to insert
. Add the following method to StringTrie
:
bool contains(String text) {
var current = root;
for (var codeUnit in text.codeUnits) {
final child = current.children[codeUnit];
if (child == null) {
return false;
}
current = child;
}
return current.isTerminating;
}
You check every code unit to see if it’s in the tree. When you reach the last one, it must be terminating. If not, the collection wasn’t added, and what you’ve found is a subset of a larger collection.
Like insert
, the time complexity of contains
is O(k), where k is the length of the string that you’re using for the search. This time complexity comes from traversing through k nodes to determine whether the code unit collection is in the trie.
To test out insert
and contains
, head over to bin/starter.dart and replace the contents of the file with the following code:
import 'package:starter/string_trie.dart';
void main() {
final trie = StringTrie();
trie.insert("cute");
if (trie.contains("cute")) {
print("cute is in the trie");
}
}
Run that and you should see the following console output:
cute is in the trie
Remove¶
Removing a node from the trie is a bit more tricky. You need to be particularly careful since multiple collections can share nodes.
Go back to lib/string_trie.dart and write the following method just below contains
:
void remove(String text) {
// 1
var current = root;
for (final codeUnit in text.codeUnits) {
final child = current.children[codeUnit];
if (child == null) {
return;
}
current = child;
}
if (!current.isTerminating) {
return;
}
// 2
current.isTerminating = false;
// 3
while (current.parent != null &&
current.children.isEmpty &&
!current.isTerminating) {
current.parent!.children[current.key!] = null;
current = current.parent!;
}
}
Taking it comment-by-comment:
- You check if the code unit collection that you want to remove is part of the trie and point
current
to the last node of the collection. If you don’t find your search string or the final node isn’t marked as terminating, that means the collection isn’t in the trie and you can abort. - You set
isTerminating
tofalse
so the current node can be removed by the loop in the next step. - This is the tricky part. Since nodes can be shared, you don’t want to remove code units that belong to another collection. If there are no other children in the current node, it means that other collections don’t depend on the current node. You also check to see if the current node is terminating. If it is, then it belongs to another collection. As long as
current
satisfies these conditions, you continually backtrack through theparent
property and remove the nodes.
The time complexity of this algorithm is O(k), where k represents the number of code units in the string that you’re trying to remove.
Head back to bin/starter.dart and replace the contents of main
with the following:
final trie = StringTrie();
trie.insert('cut');
trie.insert('cute');
assert(trie.contains('cut'));
print('"cut" is in the trie');
assert(trie.contains('cute'));
print('"cute" is in the trie');
print('\n--- Removing "cut" ---');
trie.remove('cut');
assert(!trie.contains('cut'));
assert(trie.contains('cute'));
print('"cute" is still in the trie');
Run that and you should see the following output added to the console:
"cut" is in the trie
"cute" is in the trie
--- Removing "cut" ---
"cute" is still in the trie
Prefix Matching¶
The most iconic algorithm for a trie is the prefix-matching algorithm. Write the following at the bottom of StringTrie
:
List<String> matchPrefix(String prefix) {
// 1
var current = root;
for (final codeUnit in prefix.codeUnits) {
final child = current.children[codeUnit];
if (child == null) {
return [];
}
current = child;
}
// 2 (to be implemented shortly)
return _moreMatches(prefix, current);
}
- You start by verifying that the trie contains the prefix. If not, you return an empty list.
- After you’ve found the node that marks the end of the prefix, you call a recursive helper method named
_moreMatches
to find all the sequences after thecurrent
node.
Next, add the code for the helper method after the matchPrefix
method:
List<String> _moreMatches(String prefix, TrieNode<int> node) {
// 1
List<String> results = [];
if (node.isTerminating) {
results.add(prefix);
}
// 2
for (final child in node.children.values) {
final codeUnit = child!.key!;
results.addAll(
_moreMatches(
'$prefix${String.fromCharCode(codeUnit)}',
child,
),
);
}
return results;
}
- You create a list to hold the results. If the current node is a terminating one, you add what you’ve got to the results.
- Next, you need to check the current node’s children. For every child node, you recursively call
_moreMatches
to seek out other terminating nodes.
matchPrefix
has a time complexity of O(k × m), where k represents the longest collection matching the prefix and m represents the number of collections that match the prefix. Recall that lists have a time complexity of O(k × n), where n is the number of elements in the entire collection. For large sets of data in which each collection is uniformly distributed, tries have far better performance than using lists for prefix matching.
Time to take the method for a spin. Navigate back to main
and run the following:
final trie = StringTrie();
trie.insert('car');
trie.insert('card');
trie.insert('care');
trie.insert('cared');
trie.insert('cars');
trie.insert('carbs');
trie.insert('carapace');
trie.insert('cargo');
print('Collections starting with "car"');
final prefixedWithCar = trie.matchPrefix('car');
print(prefixedWithCar);
print('\nCollections starting with "care"');
final prefixedWithCare = trie.matchPrefix('care');
print(prefixedWithCare);
You should see the output below in the console:
Collections starting with "car"
[car, card, care, cared, cars, carbs, carapace, cargo]
Collections starting with "care"
[care, cared]
Challenges¶
How was this chapter for you? Are you ready to take it a bit further? The following challenges will ask you to add functionality to and generalize what you’ve already accomplished. Check out the Challenge Solutions section or the supplemental materials that come with the book if you need any help.
Challenge 1: Additional Properties¶
The current implementation of StringTrie
is missing some notable operations. Your task for this challenge is to augment the current implementation of the trie by adding the following:
- An
allStrings
property that returns all the collections in the trie. - A
count
property that tells you how many strings are currently in the trie. - An
isEmpty
property that returnstrue
if the trie is empty,false
otherwise.
Challenge 2: Generic Trie¶
The trie data structure can be used beyond strings. Make a new class named Trie
that handles any iterable collection. Implement the insert
, contains
and remove
methods.
Key Points¶
- Tries provide great performance metrics for prefix matching.
- Tries are relatively memory efficient since individual nodes can be shared between many different values. For example, “car,” “carbs,” and “care” can share the first three letters of the word.