The VVAD-LRS3 Dataset for Visual Voice Activity Detection
Thousands of sentences from TED/TEDx Videos
Click to view dataset
Extract 38 Frames Per Video (1.52s Window)
Positive Sample (Speaking)
Negative Sample (Not Speaking)
Full Face Region
Mouth Region Only
Generate feature embeddings from Face/Lip sequences using various models.
ResNet50
VideoMAE
ViViT
VGG16
Train classifiers on the generated embeddings to detect speech activity.
Feedforward CNN
FCN
Binary classification and performance measurement.
Metrics:
Accuracy, Precision, Recall, F1-Score
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
ResNet50 | 73.80% | 0.72 | 0.72 | 0.72 |
VideoMAE | 84.50% | 0.82 | 0.83 | 0.84 |
ViViT | 83.00% | 0.82 | 0.82 | 0.83 |
VGG16 | 81.20% | 0.78 | 0.81 | 0.80 |
Recompile 38-frame image sequences back into MP4 video clips using OpenCV.
Process video clips and a text prompt to generate a descriptive response.
Model: Video-LLaVA-7B-hf
Prompt: "Analyze this video... Is the person speaking or silent? Describe the visual cues."
Qualitative, descriptive output in natural language.
Speaking: "The person... appears to be speaking. Their mouth is open and moving..."
Not Speaking: "The person... is silent. Their lips are closed and still..."