My Virtual Personal Assistant using Web Speech API and Microsoft SAPI

I think I need to make a virtual personal assistant that can help me make a self-service video tutorial since I want to make a vlog on my Youtube Channel. I hope that the virtual assistant can help make videos quickly so I can upload it to Youtube directly without no much editing process. Okay, here are my ‘receipt’ 🙂 .

Web Speech API + Microsoft Speech Engine= Personal Assistant

Yes, the formula above is the key point. So my personal assistant will be able to recognize my voice with web speech API, understand it, and then execute the command according to the script. With the Google web speech API, you will have a speech recognition engine that can recognize a voice as online and convert it into text (doing as speech to text), then you can process it as needed. After the text obtained from the voice processing is obtained, then inserted into the script. If it meets the clause requirements in the script, the action is carried out according to the provisions, for example showing a presentation, activating the camera, or turning on / off the lights.

To make it looks more humane, I use the Microsoft SAPI as a speech engine (because I use a Windows laptop), to make a voice in response. Then I choose Microsoft Andika as my default speech engine. You can download the local speech engine according to your language on the Microsoft official website.

Personal Assistant Topology

Consider the personal assistant topology design above. Your voice enters the laptop through a microphone then converted to a wav file. This file is then sent to the Google server via the web speech API and will return a text value after a moment. Then you can process this text according to your script (I use javascript) for command execution. You can watch all I have done from the following video (in Indonesian).

As a summary, personal assistant using Google web speech API has weaknesses on the internet connection. If we use this technology you should make sure the internet connection is good so there is no delay during the speech recognition process. I plan to research a local speech recognition using Python + CMU Sphinx for my next project. Stay tuned.

About the Author