Voice to Text service or API are ok. They're like using microwave rather than stove-top. Technically - this method is far too easy and super lazy with several/some errors. An example, on Youtube, do you even bother to correct? Most people don't!
"WebVTT spec as I understand is just for showing captions in the standard HTML5 video element, which is handy for standardized playback but doesn't help with recognition in any way."
Recognition, you're still obsessing with hearing/speaking technology. Again, emphasis, "talking"... clearly not for everyone. According to World Health Organisation... 360 million people worldwide have deafness & hearing loss.
Subtitles - The transcription or translation of the dialogue.
Captions - Similar to subtitles, but also include sound effects and other audio information.
(for the deaf & hard of hearing)
Descriptions - Intended to be a separate text file that describes the video through a "screen reader"; a software application to provide its output via the text-to-speech synthesizer or a refreshable Braille display.
(for the blind or vision impaired)
A bit of more insight on WebVTT...http://html5doctor.com/video-subtitling-and-webvtt/
Also, check out the outdated Captionator JS APIhttps://github.com/cgiffard/Captionator
This idea is good... seems reasonable and I approve!
I learned today...https://amara.org/en/
Neat implementation, however, I see that two kinds (Subtitle or Caption and Description) are missing!
Seems there isn't one solid full Service or API to date. Perhaps UNA develops "WebVTT" module?