SphinxTiny - Sphinx3.x for mobile devices




Home
SphinxTiny-0.7
SphinxTiny-0.5


Berkeley Institute of Design


 

SphinxTiny-0.7

Installation

SphinxTiny-0.7 requires SphinxBase-0.3 available here at the CMU SourceForge site.

Download and untar the SphinxBase tarball, then configure and run it (you must enable the fixed point front-end):

tar -zxvf sphinxbase-0.3.tar.gz
cd sphinxbase-0.3
./configure --enable-fixed --without-lapack
make

Download the source code for SphinxTiny-0.7, then untar and build it. If you are building for an ARM device, you can enable the assembly optimization by adding --enable-armv6 to the configure script options.

tar -zxvf sphinxtiny-0.7.tar.gz
cd sphinxtiny-0.7
./configure --with-sphinxbase=`pwd`/../sphinxbase-0.3 --enable-fixed
make


Details

We have run a series of informal benchmarks comparing PocketSphinx (based on the Sphinx2 engine) to SphinxTiny (based on the Sphinx3.x engine). For more details about the differences, visit the CMU Sphinx Documentation Wiki.

We evaluated the accuracy and speed of PocketSphinx, Sphinx3-0.7 and SphinxTiny-0.7 on both a 2 GHz Intel x86 processor running Ubuntu Linux under VMWare and a Nokia N800 Internet Tablet using the 400 MHz OMAP2420 ARM processor. As you can see from the benchmarks, on a small vocabulary task PocketSphinx outperforms SphinxTiny on both accuracy and speed; however, as the complexity of the acoustic and language models increases, SphinxTiny's accuracy is clearly better than PocketSphinx. Unfortunately, on complex models, to get good accuracy, the engine cannot decode speech in real time. Nonetheless, SphinxTiny does provide the first step in using fully continuous acoustic models to achieve higher accuracy on mobile devices. We are currently examining which tuning parameters give the best accuracy with the least time.


AN4 - 130 decoding utterances, 948 training utterances
VMWarexRTSERWER N800xRTSERWER
PocketSphinx0.0349.2%14.7% 0.3848.5%14.9%
Sphinx3-0.70.1055.4%16.2% 19.8853.8%16.7%
SphinxTiny-0.70.2255.4%16.2% 2.5453.8%16.7%
SphinxTiny-0.7 ASM*N/AN/AN/A 1.7554.6%16.9%
*SphinxTiny-0.7 with ARM assembly optimizations

RM1 - 365 decoding utterances, 1600 training utterances
VMWarexRTSERWER N800xRTSERWER
PocketSphinx0.0430%5.8% 0.5827.7%5.6%
Sphinx3-0.70.2236%6.8% 24.3432.9%7.1%
SphinxTiny-0.70.3835.5%6.7% 3.5033.7%7.2%
SphinxTiny-0.7 ASM*N/AN/AN/A 2.5733.7%7.2%
*SphinxTiny-0.7 with ARM assembly optimizations

Accuracy was determined via NIST's sclite tool comparing a human-typed transcription to the decoded speech. PocketSphinx used 1000 senones, and the others used 1000 senones with 8 Gaussians. The beam was set to 1e-80. Increasing the number of Gaussians used by Sphinx3.x dramatically improves the accuracy, but at the cost of speed. Likewise the parameters may be tuned to improve speed or accuracy, but there is usually a trade-off between the two.

After tuning some parameters, and using 16 Gaussians instead of 8, we get the following performance decoding 250 random utterances from native English speakers in the ICSI meetings data set. Note that the numbers appear poorer than they truly are due to the transcripts containing "words" such as "<writingonwhiteboard>", which the acoustic model cannot distinguish from, say, "<mikenoise>" or "<writingonwhiteboardaudibleinbackground>".

ICSI - 250 decoding utterances (from English speakers), 90314 training utterances
VMWarexRTSERWER N800xRTSERWER
PocketSphinx 1000 senones0.63 80% 57.0% 14.73 81.3% 57.1%
PocketSphinx 2000 senones0.48 80% 54.3% 11.77 80.4% 54.4%
Sphinx3-0.70.67 80.4% 42.8% 66.35 80.8% 42.3%
SphinxTiny-0.7 0.94 79.2% 42.4% 9.62 79.6% 43.0%
SphinxTiny-0.7 ASM*N/AN/AN/A 7.84 80.0% 42.8%
SphinxTiny-0.7 ASM* (4 Gaussians)N/AN/AN/A 4.72 85.2% 52.6%
*SphinxTiny-0.7 with ARM assembly optimizations

With larger acoustic and language models, SphinxTiny provides better accuracy and speed than PocketSphinx. Thus, from out experimentation, PocketSphinx is clearly superior when using small acoustic and language models for real-time recognition, but for tasks that allow larger delays in exchange for better accuracy, SphinxTiny is the better choice. Unsurprisingly, this reflects the general philosophy of the Sphinx2 and Sphinx3 recognition engines on which PocketSphinx and SphinxTiny are based.




Seth Horrigan
eomer At cs.berkeley.edu

John Canny
jfc At eecs.berkeley.edu

last updated 16.07.08