rust-word-analyzer

A CLI tool that reads .txt and .docx files, counts word frequencies, and displays results sorted alphabetically and by count — built with custom linked list data structures in Rust.

This is an educational project that solves a real problem (word frequency analysis) using deliberately low-level data structures to explore Rust's ownership model, unsafe code, and pointer semantics.

Why Linked Lists?

A HashMap<String, usize> counts words in 5 lines. This project uses hand-rolled linked lists instead — not because it's practical, but because linked lists are the canonical hard problem in Rust.

Rust's ownership system makes linked lists genuinely difficult: every node owns the next, you can't have cycles without Rc/RefCell, and mutation requires careful management of borrows. This project tackles that head-on with real, working implementations.

Learning Goals

Concept	Where it appears
`Box<T>` heap allocation	Node storage — each node owns the next via `Box<WordNode>`
Raw pointer manipulation	Tail pointer as `*mut WordNode` for O(1) append
`unsafe` blocks	Dereferencing raw pointers for tail updates
Ownership transfer	Moving nodes between positions during sort
Insertion sort on a linked list	Alphabetical ordering during initial word collection
Merge sort on a linked list	Re-sorting by word count after collection
Error handling with `Result`	File I/O, XML parsing, ZIP extraction
Custom macros	`gprintln!` and `rprintln!` for colored output

What makes this tricky in Rust

// This pattern — keeping a raw tail pointer alongside an owned head — is
// the core tension. Box gives you ownership, but the tail needs to mutate
// a node that Box already owns. You end up in unsafe territory:

struct WordList {
    head: Option<Box<WordNode>>,   // Owns the list
    tail: *mut WordNode,           // Points into it (unsafe)
}

Other languages let you do this trivially with garbage collection. Rust forces you to reason about who owns what, and this project is a worked example of navigating that.

Features

Reads .txt files (line-by-line word extraction)
Reads .docx files (ZIP archive extraction + XML content parsing)
Single linked list with alphabetical insertion sort
Double linked list variant (experimental)
Merge sort to re-order by word count
Colored terminal output
Integration tests with known expected output

Usage

cargo run -- path/to/file.txt
cargo run -- path/to/document.docx

Example Output

Printing list sorted alphabetically:
Node 1: Count: 5 Word: eight
Node 2: Count: 1 Word: five
Node 3: Count: 1 Word: four
...

Printing list sorted by word count:
Node 1: Count: 13 Word: one
Node 2: Count: 9 Word: two
Node 3: Count: 5 Word: ten
...

Total non-unique words: 50
Total unique words: 10

Project Structure

src/
├── main.rs                            # CLI entry point, file type routing
├── readfile.rs                        # .txt and .docx file parsers
├── word_tracker_single_linkedlist.rs  # Single linked list (primary)
├── word_tracker_double_linkedlist.rs  # Double linked list (experimental)
└── utilities/
    ├── mod.rs                         # Module declarations
    └── print_utils.rs                 # Colored terminal output macros

tests/
├── sort_by_word_count_txt_test.rs     # Integration test for .txt files
├── sort_by_word_count_docx_test.rs    # Integration test for .docx files
└── data/
    ├── input.txt                      # Test fixture
    └── input.docx                     # Test fixture (same content as .txt)

Testing

cargo test

Tests run the full binary against known input files and validate:

Alphabetical sort order and word counts
Word-count sort order (descending)
Correct handling of multiple whitespace and line breaks

Dependencies

Crate	Purpose
`zip`	Extract .docx ZIP archives
`quick-xml`	Parse Word document XML
`colored`	Colored terminal output

License

MIT License. See LICENSE for details.

simons-hub/rust-word-analyzer

rust-word-analyzer

Why Linked Lists?

Learning Goals

What makes this tricky in Rust

Features

Usage

Example Output

Project Structure

Testing

Dependencies

Further Reading

License

On this page

Languages

Contributors

Latest Release