📐 Deep Math of Voice AI

Signal Processing Pipeline

1️⃣ Framing & Windowing

The discrete waveform x[n] is split into overlapping frames of 25 ms with a hop of 10 ms. A Hann window w[n] = 0.5 · (1 − cos(2πn / (N−1))) mitigates spectral leakage before the short-time Fourier transform (STFT).

X_m(k) = \sum_{n=0}^{N-1} x[ n + mH ]\, w[n] \, e^{-j\,2\pi k n / N}

2️⃣ Mel-Frequency Cepstral Coefficients (MFCC)

The magnitude spectrum is filtered by M triangular Mel filters. Log energies are decorrelated using the Discrete Cosine Transform (DCT) to yield cepstral coefficients c_m:

\\displaystyle c_m = \\sum_{k=1}^{K} \\ln\,|X(k)| \\cos\!\\left( m\\frac{(k-0.5)\\pi}{K} \\right),\\; 0 \\le m < M

3️⃣ Spectral Descriptors

Centroid μ measures the "center of mass" of the spectrum, roll-off f_r encloses 85 % of energy, and ZCR captures temporal sharpness:

mu = ( \sum_k f_k |X(k)| ) / ( \sum_k |X(k)| ), \; ZCR = (1/(N-1)) \sum_{n=1}^{N-1} 1[ x[n] x[n-1] < 0 ]

4️⃣ Embedding via wav2vec 2.0

Frames are fed into a self-supervised wav2vec 2.0 encoder producing contextual embeddings h_t ∈ ℝ⁷⁶⁸. A transformer with 24 layers and multi-head self-attention learns latent speech representations:

SelfAttn(Q, K, V) = softmax( QK^T / sqrt(d_k) ) V

5️⃣ Emotion Classification

A linear head projects h_t to logits z ∈ ℝ^C where C = 8 emotions. Softmax yields posterior probabilities p = softmax(z). The predicted label is ŷ = argmax p.

6️⃣ Explainability with SHAP

SHAP approximates Shapley values φ_i explaining the marginal contribution of feature i to the model output:

phi_i = \sum_{S \subseteq F\setminus\{i\}} ( |S|! (|F|-|S|-1)! ) / |F|! * ( f(S\cup\{i\}) - f(S) )

Reinforcement Learning (Optional Training)

The codebase contains a Proximal Policy Optimization (PPO) pipeline for fine-tuning the emotion classifier on domain-specific data. The agent maximises expected reward J(θ) = E_{τ∼π_θ}[ R(τ) ] using the clipped objective:

L_CLIP(theta) = E_t[ min( r_t(theta) * Â_t , clip(r_t(theta), 1-epsilon, 1+epsilon) * Â_t ) ]

View the annotated Jupyter notebooks for derivations in the repository.