| 1 |
# pter Architecture |
| 2 |
|
| 3 |
## Overview |
| 4 |
|
| 5 |
pter converts HTML email bodies into readable markdown. It takes an HTML string and returns a markdown string. It does not handle MIME parsing, content extraction, or markdown rendering. |
| 6 |
|
| 7 |
## Pipeline |
| 8 |
|
| 9 |
``` |
| 10 |
html: &str |
| 11 |
→ scraper::Html::parse_document() # html5ever DOM tree |
| 12 |
→ walk_children(root) # depth-first traversal |
| 13 |
→ handle_text() # whitespace collapsing, entity decoding |
| 14 |
→ handle_element() # classify → skip / transparent / block / inline |
| 15 |
→ handle_block() # paragraphs, headings, lists, blockquotes, pre, hr |
| 16 |
→ handle_inline() # bold, italic, links, images, code, br |
| 17 |
→ whitespace::normalize() # collapse blank lines, trim |
| 18 |
→ String |
| 19 |
``` |
| 20 |
|
| 21 |
## Module Responsibilities |
| 22 |
|
| 23 |
|
| 24 |
|
| 25 |
| `lib.rs` | Public API (`convert`), re-exports | |
| 26 |
| `convert.rs` | DOM walker, `Context` state, element dispatch | |
| 27 |
| `elements.rs` | Element classification, tracking pixel / hidden detection | |
| 28 |
| `whitespace.rs` | Output normalization | |
| 29 |
| `tables.rs` | Table layout detection and unwrapping (Phase 2) | |
| 30 |
| `replies.rs` | Reply chain detection and quoting (Phase 3) | |
| 31 |
|
| 32 |
## Design Decisions |
| 33 |
|
| 34 |
**scraper over html5ever directly**: We need tree traversal (parent/child/sibling access) for layout table unwrapping and reply chain detection. scraper provides this via ego-tree on top of html5ever's spec-compliant parsing. |
| 35 |
|
| 36 |
**Markdown output**: Markdown is readable as plain text and renderable by any toolchain. It preserves structural information (headings, links, lists) that plain text loses. |
| 37 |
|
| 38 |
**Faithful conversion**: pter converts what's there. Content extraction (stripping marketing wrappers) and post-processing (trimming signatures) are separate concerns, composable before or after pter. |
| 39 |
|
| 40 |
**Blockquote rendering**: Blockquotes render children into a temporary buffer, then prefix each line with `> `. This handles nested blockquotes naturally — inner quotes produce `> ` lines, outer quote prefixes them again to get `> > `. |
| 41 |
|
| 42 |
## Dependencies |
| 43 |
|
| 44 |
|
| 45 |
|
| 46 |
| `scraper` | HTML parsing + DOM tree + CSS selectors | |
| 47 |
| `proptest` (dev) | Property-based testing | |
| 48 |
|