π DeepΒ MathΒ ofΒ VoiceΒ AI
Signal Processing Pipeline
1οΈβ£ Framing & Windowing
The discrete waveform x[n] is split into overlapping frames of 25Β ms with a hop of 10Β ms. A Hann window w[n] = 0.5 Β· (1 β cos(2Οn / (Nβ1))) mitigates spectral leakage before the short-time Fourier transform (STFT).
2οΈβ£ Mel-Frequency Cepstral Coefficients (MFCC)
The magnitude spectrum is filtered by M triangular Mel filters. Log energies are decorrelated using the Discrete Cosine Transform (DCT) to yield cepstral coefficients c_m:
3οΈβ£ Spectral Descriptors
Centroid ΞΌ measures the "center of mass" of the spectrum, roll-off f_r encloses 85Β % of energy, and ZCR captures temporal sharpness:
4οΈβ£ Embedding via wav2vecΒ 2.0
Frames are fed into a self-supervised wav2vecΒ 2.0 encoder producing contextual embeddings h_t β β768. A transformer with 24 layers and multi-head self-attention learns latent speech representations:
5οΈβ£ Emotion Classification
A linear head projects h_t to logits z β βC where C = 8 emotions. Softmax yields posterior probabilities p = softmax(z). The predicted label is Ε· = argmax p.
6οΈβ£ Explainability with SHAP
SHAP approximates Shapley values Οi explaining the marginal contribution of feature i to the model output:
Reinforcement Learning (Optional Training)
The codebase contains a Proximal Policy Optimization (PPO) pipeline for fine-tuning the emotion classifier on domain-specific data. The agent maximises expected reward J(ΞΈ) = EΟβΌΟΞΈ[ R(Ο) ] using the clipped objective:
View the annotated Jupyter notebooks for derivations inΒ the repository.