Visual Voice Activity Detection

The VVAD-LRS3 Dataset for Visual Voice Activity Detection

Data Source: LRS3 Dataset

Thousands of sentences from TED/TEDx Videos

Click to view dataset

Extract 38 Frames Per Video (1.52s Window)

Positive Sample (Speaking)

Negative Sample (Not Speaking)

Full Face Region

Mouth Region Only

Generate feature embeddings from Face/Lip sequences using various models.

ResNet50

VideoMAE

ViViT

VGG16

Train classifiers on the generated embeddings to detect speech activity.

Feedforward CNN

FCN

Binary classification and performance measurement.

Speaking

Not Speaking

Metrics:

Accuracy, Precision, Recall, F1-Score

Model	Accuracy	Precision	Recall	F1-Score
ResNet50	73.80%	0.72	0.72	0.72
VideoMAE	84.50%	0.82	0.83	0.84
ViViT	83.00%	0.82	0.82	0.83
VGG16	81.20%	0.78	0.81	0.80

Recompile 38-frame image sequences back into MP4 video clips using OpenCV.

Process video clips and a text prompt to generate a descriptive response.

Model: Video-LLaVA-7B-hf

Prompt: "Analyze this video... Is the person speaking or silent? Describe the visual cues."

Qualitative, descriptive output in natural language.

Speaking: "The person... appears to be speaking. Their mouth is open and moving..."

Not Speaking: "The person... is silent. Their lips are closed and still..."