max / audiofiles

6.4 KB · 173 lines History Blame Raw

1	# audiofiles -- ML Classification System
2
3	Two-layer system that classifies audio samples into 16 categories. Layer 1 uses rule-based heuristics for broad classification. Layer 2 uses a 200-tree Random Forest for fine-grained drum sub-classification.
4
5	## Architecture
6
7	```
8	Audio file
9	↓ decode (Symphonia → mono f32)
10	↓ feature extraction (9 spectral + 26 MFCC = 35 features)
11	↓
12	Layer 1: classify_broad() ← rule-based heuristics
13	├─ Drum → Layer 2: predict_layer2() ← Random Forest (200 trees)
14	│ └─ Kick / Snare / HiHat / Cymbal / Percussion
15	└─ Non-drum → return directly
16	└─ Bass / Vocal / Synth / Pad / Noise / Music / Ambience / Impact / Foley / Texture / Misc
17	```
18
19	### Layer 1: Rule-Based Broad Classifier
20
21	`classify_broad()` in `crates/audiofiles-core/src/analysis/classify.rs`
22
23	Routes samples into broad categories using spectral and waveform features:
24
25	\| Rule \| Condition \| Category \|
26	\|------\|-----------\|----------\|
27	\| Noise \| flatness > 0.7 \| Noise \|
28	\| Drum \| duration < 2.0 AND (attack < 0.05 OR crest > 2.5) \| Drum → Layer 2 \|
29	\| Bass \| centroid < 400 AND flatness < 0.15 \| Bass \|
30	\| Ambience \| duration > 5.0 AND low centroid_variance AND 0.15 < flatness < 0.5 \| Ambience \|
31	\| Impact \| crest > 10.0 AND attack < 0.005 \| Impact \|
32	\| Texture \| duration > 2.0 AND centroid_variance > 500,000 \| Texture \|
33
34	Rules are evaluated in priority order. Confidence values range 0.75--0.95 depending on how strongly the sample matches.
35
36	### Layer 2: Random Forest Drum Classifier
37
38	`predict_layer2()` in `crates/audiofiles-core/src/analysis/classify.rs`
39
40	- Model: 200 decision trees, majority vote
41	- Classes: Kick (0), Snare (1), HiHat (2), Cymbal (3), Percussion (4)
42	- Confidence: fraction of trees voting for the majority class (e.g., 0.85 = 170/200 agreed)
43	- Fallback: if the model file has empty trees, reverts to `classify_full()` (16-class rule-based)
44
45	### Graceful Degradation
46
47	If `layer2_drum.json` contains an empty trees vector, the system falls back to `classify_full()` -- a comprehensive 16-class rule-based classifier covering all categories. The app never crashes on classification.
48
49	---
50
51	## Feature Vector
52
53	35 features total: 9 scalar + 13 MFCC means + 13 MFCC variances.
54
55	### Scalar Features (indices 0--8)
56
57	\| Index \| Feature \| Source \| Description \|
58	\|-------\|---------\|--------\|-------------\|
59	\| 0 \| duration \| basic.rs \| Total length in seconds \|
60	\| 1 \| centroid \| spectral.rs \| Spectral center of mass in Hz \|
61	\| 2 \| flatness \| spectral.rs \| 0.0 (pure tone) to 1.0 (white noise), geometric/arithmetic mean of magnitudes \|
62	\| 3 \| zcr \| spectral.rs \| Zero-crossing rate (fraction of sign changes per sample) \|
63	\| 4 \| onset_strength \| spectral.rs \| Sum of positive spectral flux across STFT frames \|
64	\| 5 \| bandwidth \| spectral.rs \| Spectral standard deviation around centroid in Hz \|
65	\| 6 \| centroid_variance \| spectral.rs \| Variance of per-frame centroids (high = evolving spectrum) \|
66	\| 7 \| crest_factor \| basic.rs \| Peak / RMS in linear domain (high > 8 = impacts) \|
67	\| 8 \| attack_time \| basic.rs \| Time to reach 90% of peak amplitude in seconds \|
68
69	### MFCC Features (indices 9--34)
70
71	\| Indices \| Feature \| Description \|
72	\|---------\|---------\|-------------\|
73	\| 9--21 \| MFCC means \| Mean of first 13 MFCCs across all STFT frames \|
74	\| 22--34 \| MFCC variances \| Variance of first 13 MFCCs across all STFT frames \|
75
76	MFCC computation: 26-bin mel filterbank applied to STFT magnitude frames, log energy transform, DCT-II, keep first 13 coefficients.
77
78	### STFT Parameters
79
80	- FFT size: 2048 points with Hann window
81	- Hop size: 512 samples
82
83	---
84
85	## Training Pipeline
86
87	Binary: `crates/audiofiles-train/src/main.rs` (not built by default).
88
89	### Data
90
91	- Source: `~/Git/Drums/test_data/` with subdirectories per class
92	- Classes: `kick/`, `snare/`, `hihat/`, `cymbal/`, `clap/`, `tom/`, `percussion/`
93	- Class mapping: kick→0, snare→1, hihat→2, cymbal→3, clap/tom/percussion→4
94	- Dataset: 4,343 labeled drum samples
95
96	### Algorithm
97
98	- 200 decision trees, each trained on a bootstrap sample (random with replacement)
99	- Max depth: 25 levels per tree
100	- Min leaf: 3 samples minimum per leaf node
101	- Features per split: sqrt(35) = ~6 random features sampled per split decision
102	- Split criterion: Gini impurity
103	- Parallelism: Trees trained in parallel via rayon
104
105	### Evaluation
106
107	- 5-fold stratified cross-validation (preserves class distribution)
108	- 94.4% strict accuracy on 4,343 samples
109	- Per-class precision, recall, and F1 computed across all folds
110
111	### Output
112
113	- Model file: `crates/audiofiles-core/models/layer2_drum.json` (4.0 MB)
114	- Format: JSON array of 200 trees + class metadata
115	- Each tree node is either a `Split { feature, threshold, left, right }` or `Leaf { class }`
116
117	---
118
119	## Model Loading
120
121	The model is embedded at compile time and deserialized lazily on first use:
122
123	```rust
124	static LAYER2_MODEL: OnceLock<RandomForestModel> = OnceLock::new();
125
126	fn layer2_model() -> &'static RandomForestModel {
127	LAYER2_MODEL.get_or_init(\|\| {
128	serde_json::from_slice(LAYER2_MODEL_BYTES)
129	.expect("embedded Layer 2 model is invalid JSON")
130	})
131	}
132	```
133
134	- `include_bytes!` embeds `layer2_drum.json` into the binary
135	- `OnceLock` ensures deserialization happens exactly once
136	- After init, all subsequent calls return a static reference (zero cost)
137
138	---
139
140	## Database Integration
141
142	Classification results are stored in the `audio_analysis` table:
143
144	\| Column \| Type \| Description \|
145	\|--------\|------\|-------------\|
146	\| classification \| TEXT \| SampleClass as lowercase string (e.g., "kick") \|
147	\| classification_confidence \| REAL \| 0.0--1.0; RF vote fraction for drums, heuristic confidence for non-drums \|
148
149	---
150
151	## Retraining
152
153	To retrain the model with new or updated training data:
154
155	1. Organize labeled samples in `~/Git/Drums/test_data/{class}/`
156	2. Run `cargo run -p audiofiles-train`
157	3. The binary outputs cross-validation metrics and writes `layer2_drum.json`
158	4. Rebuild audiofiles to embed the updated model
159
160	---
161
162	## Key Files
163
164	\| What \| Where \|
165	\|------\|-------\|
166	\| Two-layer classifier \| `crates/audiofiles-core/src/analysis/classify.rs` \|
167	\| Spectral features \| `crates/audiofiles-core/src/analysis/spectral.rs` \|
168	\| MFCC computation \| `crates/audiofiles-core/src/analysis/mfcc.rs` \|
169	\| Crest factor, attack time \| `crates/audiofiles-core/src/analysis/basic.rs` \|
170	\| Training pipeline \| `crates/audiofiles-train/src/main.rs` \|
171	\| Embedded model \| `crates/audiofiles-core/models/layer2_drum.json` \|
172	\| Analysis orchestrator \| `crates/audiofiles-core/src/analysis/mod.rs` \|
173