VoiceΒ AI
Back to App

πŸ“ DeepΒ MathΒ ofΒ VoiceΒ AI

Signal Processing Pipeline

1️⃣ Framing & Windowing

The discrete waveform x[n] is split into overlapping frames of 25Β ms with a hop of 10Β ms. A Hann window w[n] = 0.5 Β· (1 βˆ’ cos(2Ο€n / (Nβˆ’1))) mitigates spectral leakage before the short-time Fourier transform (STFT).

Xm(k)=βˆ‘n=0Nβˆ’1x[n+mH] w[n] eβˆ’j 2Ο€kn/NX_m(k) = \sum_{n=0}^{N-1} x[ n + mH ]\, w[n] \, e^{-j\,2\pi k n / N}

2️⃣ Mel-Frequency Cepstral Coefficients (MFCC)

The magnitude spectrum is filtered by M triangular Mel filters. Log energies are decorrelated using the Discrete Cosine Transform (DCT) to yield cepstral coefficients c_m:

displaystylecm=sumk=1Klnβ€‰βˆ£X(k)∣cos ⁣left(mfrac(kβˆ’0.5)piKright),;0lem<M\\displaystyle c_m = \\sum_{k=1}^{K} \\ln\,|X(k)| \\cos\!\\left( m\\frac{(k-0.5)\\pi}{K} \\right),\\; 0 \\le m < M

3️⃣ Spectral Descriptors

Centroid ΞΌ measures the "center of mass" of the spectrum, roll-off f_r encloses 85Β % of energy, and ZCR captures temporal sharpness:

mu=(βˆ‘kfk∣X(k)∣)/(βˆ‘k∣X(k)∣),β€…β€ŠZCR=(1/(Nβˆ’1))βˆ‘n=1Nβˆ’11[x[n]x[nβˆ’1]<0]mu = ( \sum_k f_k |X(k)| ) / ( \sum_k |X(k)| ), \; ZCR = (1/(N-1)) \sum_{n=1}^{N-1} 1[ x[n] x[n-1] < 0 ]

4️⃣ Embedding via wav2vecΒ 2.0

Frames are fed into a self-supervised wav2vecΒ 2.0 encoder producing contextual embeddings h_t ∈ ℝ768. A transformer with 24 layers and multi-head self-attention learns latent speech representations:

SelfAttn(Q,K,V)=softmax(QKT/sqrt(dk))VSelfAttn(Q, K, V) = softmax( QK^T / sqrt(d_k) ) V

5️⃣ Emotion Classification

A linear head projects h_t to logits z ∈ ℝC where C = 8 emotions. Softmax yields posterior probabilities p = softmax(z). The predicted label is Ε· = argmax p.

6️⃣ Explainability with SHAP

SHAP approximates Shapley values Ο†i explaining the marginal contribution of feature i to the model output:

phii=βˆ‘SβŠ†Fβˆ–{i}(∣S∣!(∣Fβˆ£βˆ’βˆ£Sβˆ£βˆ’1)!)/∣F∣!βˆ—(f(Sβˆͺ{i})βˆ’f(S))phi_i = \sum_{S \subseteq F\setminus\{i\}} ( |S|! (|F|-|S|-1)! ) / |F|! * ( f(S\cup\{i\}) - f(S) )

Reinforcement Learning (Optional Training)

The codebase contains a Proximal Policy Optimization (PPO) pipeline for fine-tuning the emotion classifier on domain-specific data. The agent maximises expected reward J(ΞΈ) = EΟ„βˆΌΟ€ΞΈ[ R(Ο„) ] using the clipped objective:

LCLIP(theta)=Et[min(rt(theta)βˆ—A^t,clip(rt(theta),1βˆ’epsilon,1+epsilon)βˆ—A^t)]L_CLIP(theta) = E_t[ min( r_t(theta) * Γ‚_t , clip(r_t(theta), 1-epsilon, 1+epsilon) * Γ‚_t ) ]

View the annotated Jupyter notebooks for derivations inΒ the repository.