Skip to main content

max / audiofiles

6.4 KB · 173 lines History Blame Raw
1 # audiofiles -- ML Classification System
2
3 Two-layer system that classifies audio samples into 16 categories. Layer 1 uses rule-based heuristics for broad classification. Layer 2 uses a 200-tree Random Forest for fine-grained drum sub-classification.
4
5 ## Architecture
6
7 ```
8 Audio file
9 ↓ decode (Symphonia → mono f32)
10 ↓ feature extraction (9 spectral + 26 MFCC = 35 features)
11
12 Layer 1: classify_broad() ← rule-based heuristics
13 ├─ Drum → Layer 2: predict_layer2() ← Random Forest (200 trees)
14 │ └─ Kick / Snare / HiHat / Cymbal / Percussion
15 └─ Non-drum → return directly
16 └─ Bass / Vocal / Synth / Pad / Noise / Music / Ambience / Impact / Foley / Texture / Misc
17 ```
18
19 ### Layer 1: Rule-Based Broad Classifier
20
21 `classify_broad()` in `crates/audiofiles-core/src/analysis/classify.rs`
22
23 Routes samples into broad categories using spectral and waveform features:
24
25 | Rule | Condition | Category |
26 |------|-----------|----------|
27 | Noise | flatness > 0.7 | Noise |
28 | Drum | duration < 2.0 AND (attack < 0.05 OR crest > 2.5) | Drum → Layer 2 |
29 | Bass | centroid < 400 AND flatness < 0.15 | Bass |
30 | Ambience | duration > 5.0 AND low centroid_variance AND 0.15 < flatness < 0.5 | Ambience |
31 | Impact | crest > 10.0 AND attack < 0.005 | Impact |
32 | Texture | duration > 2.0 AND centroid_variance > 500,000 | Texture |
33
34 Rules are evaluated in priority order. Confidence values range 0.75--0.95 depending on how strongly the sample matches.
35
36 ### Layer 2: Random Forest Drum Classifier
37
38 `predict_layer2()` in `crates/audiofiles-core/src/analysis/classify.rs`
39
40 - **Model**: 200 decision trees, majority vote
41 - **Classes**: Kick (0), Snare (1), HiHat (2), Cymbal (3), Percussion (4)
42 - **Confidence**: fraction of trees voting for the majority class (e.g., 0.85 = 170/200 agreed)
43 - **Fallback**: if the model file has empty trees, reverts to `classify_full()` (16-class rule-based)
44
45 ### Graceful Degradation
46
47 If `layer2_drum.json` contains an empty trees vector, the system falls back to `classify_full()` -- a comprehensive 16-class rule-based classifier covering all categories. The app never crashes on classification.
48
49 ---
50
51 ## Feature Vector
52
53 35 features total: 9 scalar + 13 MFCC means + 13 MFCC variances.
54
55 ### Scalar Features (indices 0--8)
56
57 | Index | Feature | Source | Description |
58 |-------|---------|--------|-------------|
59 | 0 | duration | basic.rs | Total length in seconds |
60 | 1 | centroid | spectral.rs | Spectral center of mass in Hz |
61 | 2 | flatness | spectral.rs | 0.0 (pure tone) to 1.0 (white noise), geometric/arithmetic mean of magnitudes |
62 | 3 | zcr | spectral.rs | Zero-crossing rate (fraction of sign changes per sample) |
63 | 4 | onset_strength | spectral.rs | Sum of positive spectral flux across STFT frames |
64 | 5 | bandwidth | spectral.rs | Spectral standard deviation around centroid in Hz |
65 | 6 | centroid_variance | spectral.rs | Variance of per-frame centroids (high = evolving spectrum) |
66 | 7 | crest_factor | basic.rs | Peak / RMS in linear domain (high > 8 = impacts) |
67 | 8 | attack_time | basic.rs | Time to reach 90% of peak amplitude in seconds |
68
69 ### MFCC Features (indices 9--34)
70
71 | Indices | Feature | Description |
72 |---------|---------|-------------|
73 | 9--21 | MFCC means | Mean of first 13 MFCCs across all STFT frames |
74 | 22--34 | MFCC variances | Variance of first 13 MFCCs across all STFT frames |
75
76 MFCC computation: 26-bin mel filterbank applied to STFT magnitude frames, log energy transform, DCT-II, keep first 13 coefficients.
77
78 ### STFT Parameters
79
80 - FFT size: 2048 points with Hann window
81 - Hop size: 512 samples
82
83 ---
84
85 ## Training Pipeline
86
87 Binary: `crates/audiofiles-train/src/main.rs` (not built by default).
88
89 ### Data
90
91 - Source: `~/Git/Drums/test_data/` with subdirectories per class
92 - Classes: `kick/`, `snare/`, `hihat/`, `cymbal/`, `clap/`, `tom/`, `percussion/`
93 - Class mapping: kick→0, snare→1, hihat→2, cymbal→3, clap/tom/percussion→4
94 - Dataset: 4,343 labeled drum samples
95
96 ### Algorithm
97
98 - **200 decision trees**, each trained on a bootstrap sample (random with replacement)
99 - **Max depth**: 25 levels per tree
100 - **Min leaf**: 3 samples minimum per leaf node
101 - **Features per split**: sqrt(35) = ~6 random features sampled per split decision
102 - **Split criterion**: Gini impurity
103 - **Parallelism**: Trees trained in parallel via rayon
104
105 ### Evaluation
106
107 - **5-fold stratified cross-validation** (preserves class distribution)
108 - **94.4% strict accuracy** on 4,343 samples
109 - Per-class precision, recall, and F1 computed across all folds
110
111 ### Output
112
113 - Model file: `crates/audiofiles-core/models/layer2_drum.json` (4.0 MB)
114 - Format: JSON array of 200 trees + class metadata
115 - Each tree node is either a `Split { feature, threshold, left, right }` or `Leaf { class }`
116
117 ---
118
119 ## Model Loading
120
121 The model is embedded at compile time and deserialized lazily on first use:
122
123 ```rust
124 static LAYER2_MODEL: OnceLock<RandomForestModel> = OnceLock::new();
125
126 fn layer2_model() -> &'static RandomForestModel {
127 LAYER2_MODEL.get_or_init(|| {
128 serde_json::from_slice(LAYER2_MODEL_BYTES)
129 .expect("embedded Layer 2 model is invalid JSON")
130 })
131 }
132 ```
133
134 - `include_bytes!` embeds `layer2_drum.json` into the binary
135 - `OnceLock` ensures deserialization happens exactly once
136 - After init, all subsequent calls return a static reference (zero cost)
137
138 ---
139
140 ## Database Integration
141
142 Classification results are stored in the `audio_analysis` table:
143
144 | Column | Type | Description |
145 |--------|------|-------------|
146 | classification | TEXT | SampleClass as lowercase string (e.g., "kick") |
147 | classification_confidence | REAL | 0.0--1.0; RF vote fraction for drums, heuristic confidence for non-drums |
148
149 ---
150
151 ## Retraining
152
153 To retrain the model with new or updated training data:
154
155 1. Organize labeled samples in `~/Git/Drums/test_data/{class}/`
156 2. Run `cargo run -p audiofiles-train`
157 3. The binary outputs cross-validation metrics and writes `layer2_drum.json`
158 4. Rebuild audiofiles to embed the updated model
159
160 ---
161
162 ## Key Files
163
164 | What | Where |
165 |------|-------|
166 | Two-layer classifier | `crates/audiofiles-core/src/analysis/classify.rs` |
167 | Spectral features | `crates/audiofiles-core/src/analysis/spectral.rs` |
168 | MFCC computation | `crates/audiofiles-core/src/analysis/mfcc.rs` |
169 | Crest factor, attack time | `crates/audiofiles-core/src/analysis/basic.rs` |
170 | Training pipeline | `crates/audiofiles-train/src/main.rs` |
171 | Embedded model | `crates/audiofiles-core/models/layer2_drum.json` |
172 | Analysis orchestrator | `crates/audiofiles-core/src/analysis/mod.rs` |
173