How to Build a Hash Table for Spell Checking in C++

Introduction

Hash tables are one of the most fundamental data structures in computer science, powering everything from database indexing to the spell checkers in your favorite word processor. In this tutorial, we'll build a custom hash table class in C++ and use it to create a spell checker that reads a dictionary and a document, then outputs unrecognized words with line numbers. This project mirrors the classic ECE365 assignment and will strengthen your understanding of hashing, collision resolution, and file I/O.

With the rise of AI-powered writing assistants like Grammarly and ChatGPT, understanding the underlying mechanics of spell checking is more relevant than ever. While these tools use advanced models, the core concept of fast lookup via hashing remains essential. By the end of this guide, you'll have a working hash table that you can reuse for other projects.

Understanding the Requirements

Before diving into code, let's break down the assignment specifications:

Hash Table Class: You must implement a hash table in separate files (hash.h and hash.cpp). The class should support insertion, lookup, and removal (though removal isn't used here). Use linear probing or double hashing for collision resolution.
Dictionary Loading: Read a dictionary file where each line contains one word. Ignore words with invalid characters (anything not a letter, digit, dash, or apostrophe) and words longer than 20 characters. Convert all letters to lowercase.
Document Spell Checking: Read a document file. Split it into words using any non-valid character as a separator. For each word, if it contains digits, skip it (do not spell check). Otherwise, look it up in the hash table. Report unrecognized words with the line number. If a word exceeds 20 characters, report the first 20 characters and the line number.
Performance Measurement: Measure CPU time for loading the dictionary and for spell checking the document, outputting these times to the console.
Output Format: Write results to an output file with exact formatting as specified (e.g., "Unrecognized word: <word> at line <number>").

This project is a classic assignment from ECE365, and mastering it will prepare you for more advanced data structures and algorithms.

Designing the Hash Table Class

We'll create a hash table that stores key-value pairs, where the key is a string (the word) and the value is a pointer to additional data (unused here but required for future assignments). The table will use an array of HashEntry objects, each containing a key, a value pointer, and a flag indicating if the slot is empty, occupied, or deleted.

Hash Function

A good hash function distributes keys uniformly. We'll use a simple polynomial hash with base 31, which is common for strings. For example:

int hash(const string &key, int tableSize) {
    int hashVal = 0;
    for (char ch : key) {
        hashVal = 31 * hashVal + ch;
    }
    return hashVal % tableSize;
}

This function works well for lowercase strings. Since we convert everything to lowercase, it's case-insensitive.

Collision Resolution: Linear Probing

We'll use linear probing for simplicity. When a collision occurs, we probe the next slot: (hash(key) + i) % tableSize for i = 0, 1, 2, ... until we find an empty slot or the key. The class must handle insertion, search, and removal (lazy deletion with a deleted flag).

Rehashing

When the load factor exceeds a threshold (e.g., 0.5), we double the table size and reinsert all entries. This ensures O(1) average time per operation.

Implementing the Hash Table

Let's outline the class interface as specified in the assignment's hash.h:

class HashTable {
public:
    HashTable(int size = 101);
    ~HashTable();
    void insert(const string &key, void *ptr = nullptr);
    bool contains(const string &key) const;
    void *getPointer(const string &key) const;
    void setPointer(const string &key, void *ptr);
    bool remove(const string &key);
private:
    struct HashEntry {
        string key;
        void *ptr;
        bool isOccupied;
        bool isDeleted;
    };
    vector<HashEntry> array;
    int currentSize;
    void rehash();
    int findPos(const string &key) const;
    bool isActive(int pos) const;
};

Note: getPointer, setPointer, and remove are not used in this assignment but must be declared for future use. You can leave them with stub implementations.

Spell Checker Program

The main program will:

Prompt the user for dictionary, document, and output file names.
Load the dictionary into the hash table, ignoring invalid words and those over 20 characters.
Open the document and read it line by line.
For each line, extract words using a state machine that recognizes valid characters.
For each word, convert to lowercase, check length, and if it contains digits, skip. Otherwise, look up in hash table. If not found, write to output file with line number.
Measure and print CPU times to console.

Step-by-Step Implementation

1. Setting Up the Project

Create files: hash.h, hash.cpp, spellcheck.cpp, and a Makefile. Use the provided sample Makefile from the assignment.

2. Hash Table Implementation

Start with the constructor that initializes the array with a prime size (e.g., 101). Implement insert using findPos to locate the correct slot. If the key already exists, update the pointer. If the slot is empty or deleted, insert there. After insertion, check load factor and rehash if needed.

findPos uses linear probing: start at hash(key) % array.size(), then while the slot is occupied and not the key, increment index mod size. Return the first empty or matching slot.

contains calls findPos and checks if the slot is active and the key matches.

rehash creates a new array twice the size, then reinserts all active entries.

3. Dictionary Loading

Open the dictionary file. For each line, read the word, convert to lowercase, check validity (all characters must be letters, digits, dash, or apostrophe) and length ≤ 20. If valid, insert into hash table. Count invalid words and report them (optional).

4. Document Parsing

Open the document file. Read line by line. For each line, iterate through characters. Use a variable to build the current word. When encountering a separator (any character not in the valid set), process the completed word: if it contains digits, skip; else if length > 20, write the first 20 characters to output; else look up in hash table and report if not found. Reset the word buffer.

Remember to handle the last word on each line.

5. Output Formatting

Follow the exact format:

If word not recognized: Unrecognized word: <word> at line <number>
If word too long: Long word: <first20chars> at line <number>

Use ofstream to write to the output file.

6. Measuring CPU Time

Use clock() from <ctime> to measure CPU time in seconds:

clock_t start = clock();
// ... do work ...
clock_t end = clock();
double timeSec = double(end - start) / CLOCKS_PER_SEC;

Measure dictionary loading and spell checking separately.

Testing and Debugging

Test with the provided sample dictionary and document. Compare your output with the expected output using diff. Common issues include: not handling line numbers correctly (first line is 1), missing words at end of lines, or incorrect handling of digits in words.

Performance Considerations

With a dictionary of 50,000 words, a load factor of 0.5 means a table size of 100,000. Linear probing can suffer from clustering, but for this scale it's acceptable. If you choose double hashing, you'll get better distribution but more code. The assignment requires at least linear probing or double hashing.

Common Pitfalls

Case Insensitivity: Convert everything to lowercase immediately upon reading.
Invalid Dictionary Words: Ignore them; do not insert.
Words with Digits in Document: Do not spell check them; just skip.
Long Words: Report first 20 characters, not the whole word.
Separator Characters: Anything not a letter, digit, dash, or apostrophe is a separator. This includes spaces, punctuation, newlines, etc.

Conclusion

By completing this project, you've built a core data structure from scratch and applied it to a real-world problem. Hash tables are everywhere—from database indexing to caching in web servers. Understanding how they work under the hood gives you a deeper appreciation for performance optimization. As you move on to more advanced topics like graph algorithms or machine learning, the ability to implement efficient data structures will serve you well.

If you're preparing for technical interviews, this project is excellent practice for questions on hash tables, string manipulation, and file I/O. Plus, you'll have a reusable hash table class for future assignments. Happy coding!