Skip to main content

max / pter

2.2 KB · 48 lines History Blame Raw
1 # pter Architecture
2
3 ## Overview
4
5 pter converts HTML email bodies into readable markdown. It takes an HTML string and returns a markdown string. It does not handle MIME parsing, content extraction, or markdown rendering.
6
7 ## Pipeline
8
9 ```
10 html: &str
11 → scraper::Html::parse_document() # html5ever DOM tree
12 → walk_children(root) # depth-first traversal
13 → handle_text() # whitespace collapsing, entity decoding
14 → handle_element() # classify → skip / transparent / block / inline
15 → handle_block() # paragraphs, headings, lists, blockquotes, pre, hr
16 → handle_inline() # bold, italic, links, images, code, br
17 → whitespace::normalize() # collapse blank lines, trim
18 → String
19 ```
20
21 ## Module Responsibilities
22
23 | Module | Responsibility |
24 |--------|---------------|
25 | `lib.rs` | Public API (`convert`), re-exports |
26 | `convert.rs` | DOM walker, `Context` state, element dispatch |
27 | `elements.rs` | Element classification, tracking pixel / hidden detection |
28 | `whitespace.rs` | Output normalization |
29 | `tables.rs` | Table layout detection and unwrapping (Phase 2) |
30 | `replies.rs` | Reply chain detection and quoting (Phase 3) |
31
32 ## Design Decisions
33
34 **scraper over html5ever directly**: We need tree traversal (parent/child/sibling access) for layout table unwrapping and reply chain detection. scraper provides this via ego-tree on top of html5ever's spec-compliant parsing.
35
36 **Markdown output**: Markdown is readable as plain text and renderable by any toolchain. It preserves structural information (headings, links, lists) that plain text loses.
37
38 **Faithful conversion**: pter converts what's there. Content extraction (stripping marketing wrappers) and post-processing (trimming signatures) are separate concerns, composable before or after pter.
39
40 **Blockquote rendering**: Blockquotes render children into a temporary buffer, then prefix each line with `> `. This handles nested blockquotes naturally — inner quotes produce `> ` lines, outer quote prefixes them again to get `> > `.
41
42 ## Dependencies
43
44 | Crate | Purpose |
45 |-------|---------|
46 | `scraper` | HTML parsing + DOM tree + CSS selectors |
47 | `proptest` (dev) | Property-based testing |
48