| 1 |
# audiofiles -- ML Classification System |
| 2 |
|
| 3 |
Two-layer system that classifies audio samples into 16 categories. Layer 1 uses rule-based heuristics for broad classification. Layer 2 uses a 200-tree Random Forest for fine-grained drum sub-classification. |
| 4 |
|
| 5 |
## Architecture |
| 6 |
|
| 7 |
``` |
| 8 |
Audio file |
| 9 |
↓ decode (Symphonia → mono f32) |
| 10 |
↓ feature extraction (9 spectral + 26 MFCC = 35 features) |
| 11 |
↓ |
| 12 |
Layer 1: classify_broad() ← rule-based heuristics |
| 13 |
├─ Drum → Layer 2: predict_layer2() ← Random Forest (200 trees) |
| 14 |
│ └─ Kick / Snare / HiHat / Cymbal / Percussion |
| 15 |
└─ Non-drum → return directly |
| 16 |
└─ Bass / Vocal / Synth / Pad / Noise / Music / Ambience / Impact / Foley / Texture / Misc |
| 17 |
``` |
| 18 |
|
| 19 |
### Layer 1: Rule-Based Broad Classifier |
| 20 |
|
| 21 |
`classify_broad()` in `crates/audiofiles-core/src/analysis/classify.rs` |
| 22 |
|
| 23 |
Routes samples into broad categories using spectral and waveform features: |
| 24 |
|
| 25 |
|
| 26 |
|
| 27 |
| Noise | flatness > 0.7 | Noise | |
| 28 |
| Drum | duration < 2.0 AND (attack < 0.05 OR crest > 2.5) | Drum → Layer 2 | |
| 29 |
| Bass | centroid < 400 AND flatness < 0.15 | Bass | |
| 30 |
| Ambience | duration > 5.0 AND low centroid_variance AND 0.15 < flatness < 0.5 | Ambience | |
| 31 |
| Impact | crest > 10.0 AND attack < 0.005 | Impact | |
| 32 |
| Texture | duration > 2.0 AND centroid_variance > 500,000 | Texture | |
| 33 |
|
| 34 |
Rules are evaluated in priority order. Confidence values range 0.75--0.95 depending on how strongly the sample matches. |
| 35 |
|
| 36 |
### Layer 2: Random Forest Drum Classifier |
| 37 |
|
| 38 |
`predict_layer2()` in `crates/audiofiles-core/src/analysis/classify.rs` |
| 39 |
|
| 40 |
- **Model**: 200 decision trees, majority vote |
| 41 |
- **Classes**: Kick (0), Snare (1), HiHat (2), Cymbal (3), Percussion (4) |
| 42 |
- **Confidence**: fraction of trees voting for the majority class (e.g., 0.85 = 170/200 agreed) |
| 43 |
- **Fallback**: if the model file has empty trees, reverts to `classify_full()` (16-class rule-based) |
| 44 |
|
| 45 |
### Graceful Degradation |
| 46 |
|
| 47 |
If `layer2_drum.json` contains an empty trees vector, the system falls back to `classify_full()` -- a comprehensive 16-class rule-based classifier covering all categories. The app never crashes on classification. |
| 48 |
|
| 49 |
--- |
| 50 |
|
| 51 |
## Feature Vector |
| 52 |
|
| 53 |
35 features total: 9 scalar + 13 MFCC means + 13 MFCC variances. |
| 54 |
|
| 55 |
### Scalar Features (indices 0--8) |
| 56 |
|
| 57 |
|
| 58 |
|
| 59 |
| 0 | duration | basic.rs | Total length in seconds | |
| 60 |
| 1 | centroid | spectral.rs | Spectral center of mass in Hz | |
| 61 |
| 2 | flatness | spectral.rs | 0.0 (pure tone) to 1.0 (white noise), geometric/arithmetic mean of magnitudes | |
| 62 |
| 3 | zcr | spectral.rs | Zero-crossing rate (fraction of sign changes per sample) | |
| 63 |
| 4 | onset_strength | spectral.rs | Sum of positive spectral flux across STFT frames | |
| 64 |
| 5 | bandwidth | spectral.rs | Spectral standard deviation around centroid in Hz | |
| 65 |
| 6 | centroid_variance | spectral.rs | Variance of per-frame centroids (high = evolving spectrum) | |
| 66 |
| 7 | crest_factor | basic.rs | Peak / RMS in linear domain (high > 8 = impacts) | |
| 67 |
| 8 | attack_time | basic.rs | Time to reach 90% of peak amplitude in seconds | |
| 68 |
|
| 69 |
### MFCC Features (indices 9--34) |
| 70 |
|
| 71 |
|
| 72 |
|
| 73 |
| 9--21 | MFCC means | Mean of first 13 MFCCs across all STFT frames | |
| 74 |
| 22--34 | MFCC variances | Variance of first 13 MFCCs across all STFT frames | |
| 75 |
|
| 76 |
MFCC computation: 26-bin mel filterbank applied to STFT magnitude frames, log energy transform, DCT-II, keep first 13 coefficients. |
| 77 |
|
| 78 |
### STFT Parameters |
| 79 |
|
| 80 |
- FFT size: 2048 points with Hann window |
| 81 |
- Hop size: 512 samples |
| 82 |
|
| 83 |
--- |
| 84 |
|
| 85 |
## Training Pipeline |
| 86 |
|
| 87 |
Binary: `crates/audiofiles-train/src/main.rs` (not built by default). |
| 88 |
|
| 89 |
### Data |
| 90 |
|
| 91 |
- Source: `~/Git/Drums/test_data/` with subdirectories per class |
| 92 |
- Classes: `kick/`, `snare/`, `hihat/`, `cymbal/`, `clap/`, `tom/`, `percussion/` |
| 93 |
- Class mapping: kick→0, snare→1, hihat→2, cymbal→3, clap/tom/percussion→4 |
| 94 |
- Dataset: 4,343 labeled drum samples |
| 95 |
|
| 96 |
### Algorithm |
| 97 |
|
| 98 |
- **200 decision trees**, each trained on a bootstrap sample (random with replacement) |
| 99 |
- **Max depth**: 25 levels per tree |
| 100 |
- **Min leaf**: 3 samples minimum per leaf node |
| 101 |
- **Features per split**: sqrt(35) = ~6 random features sampled per split decision |
| 102 |
- **Split criterion**: Gini impurity |
| 103 |
- **Parallelism**: Trees trained in parallel via rayon |
| 104 |
|
| 105 |
### Evaluation |
| 106 |
|
| 107 |
- **5-fold stratified cross-validation** (preserves class distribution) |
| 108 |
- **94.4% strict accuracy** on 4,343 samples |
| 109 |
- Per-class precision, recall, and F1 computed across all folds |
| 110 |
|
| 111 |
### Output |
| 112 |
|
| 113 |
- Model file: `crates/audiofiles-core/models/layer2_drum.json` (4.0 MB) |
| 114 |
- Format: JSON array of 200 trees + class metadata |
| 115 |
- Each tree node is either a `Split { feature, threshold, left, right }` or `Leaf { class }` |
| 116 |
|
| 117 |
--- |
| 118 |
|
| 119 |
## Model Loading |
| 120 |
|
| 121 |
The model is embedded at compile time and deserialized lazily on first use: |
| 122 |
|
| 123 |
```rust |
| 124 |
static LAYER2_MODEL: OnceLock<RandomForestModel> = OnceLock::new(); |
| 125 |
|
| 126 |
fn layer2_model() -> &'static RandomForestModel { |
| 127 |
LAYER2_MODEL.get_or_init(|| { |
| 128 |
serde_json::from_slice(LAYER2_MODEL_BYTES) |
| 129 |
.expect("embedded Layer 2 model is invalid JSON") |
| 130 |
}) |
| 131 |
} |
| 132 |
``` |
| 133 |
|
| 134 |
- `include_bytes!` embeds `layer2_drum.json` into the binary |
| 135 |
- `OnceLock` ensures deserialization happens exactly once |
| 136 |
- After init, all subsequent calls return a static reference (zero cost) |
| 137 |
|
| 138 |
--- |
| 139 |
|
| 140 |
## Database Integration |
| 141 |
|
| 142 |
Classification results are stored in the `audio_analysis` table: |
| 143 |
|
| 144 |
|
| 145 |
|
| 146 |
| classification | TEXT | SampleClass as lowercase string (e.g., "kick") | |
| 147 |
| classification_confidence | REAL | 0.0--1.0; RF vote fraction for drums, heuristic confidence for non-drums | |
| 148 |
|
| 149 |
--- |
| 150 |
|
| 151 |
## Retraining |
| 152 |
|
| 153 |
To retrain the model with new or updated training data: |
| 154 |
|
| 155 |
1. Organize labeled samples in `~/Git/Drums/test_data/{class}/` |
| 156 |
2. Run `cargo run -p audiofiles-train` |
| 157 |
3. The binary outputs cross-validation metrics and writes `layer2_drum.json` |
| 158 |
4. Rebuild audiofiles to embed the updated model |
| 159 |
|
| 160 |
--- |
| 161 |
|
| 162 |
## Key Files |
| 163 |
|
| 164 |
|
| 165 |
|
| 166 |
| Two-layer classifier | `crates/audiofiles-core/src/analysis/classify.rs` | |
| 167 |
| Spectral features | `crates/audiofiles-core/src/analysis/spectral.rs` | |
| 168 |
| MFCC computation | `crates/audiofiles-core/src/analysis/mfcc.rs` | |
| 169 |
| Crest factor, attack time | `crates/audiofiles-core/src/analysis/basic.rs` | |
| 170 |
| Training pipeline | `crates/audiofiles-train/src/main.rs` | |
| 171 |
| Embedded model | `crates/audiofiles-core/models/layer2_drum.json` | |
| 172 |
| Analysis orchestrator | `crates/audiofiles-core/src/analysis/mod.rs` | |
| 173 |
|