SphinxTiny-0.7
Installation
SphinxTiny-0.7 requires SphinxBase-0.3 available here at the CMU SourceForge site.
Download and untar the SphinxBase tarball, then configure and run it (you must enable the fixed point front-end):
tar -zxvf sphinxbase-0.3.tar.gz
cd sphinxbase-0.3
./configure --enable-fixed --without-lapack
make
Download the source code for SphinxTiny-0.7, then untar and build it.
If you are building for an ARM device, you can enable the assembly optimization by adding --enable-armv6
to the configure script options.
tar -zxvf sphinxtiny-0.7.tar.gz
cd sphinxtiny-0.7
./configure --with-sphinxbase=`pwd`/../sphinxbase-0.3 --enable-fixed
make
Details
We have run a series of informal benchmarks comparing PocketSphinx
(based on the Sphinx2 engine) to SphinxTiny (based on the Sphinx3.x engine). For more details about the differences,
visit the CMU Sphinx Documentation Wiki.
We evaluated the accuracy and speed of PocketSphinx, Sphinx3-0.7 and SphinxTiny-0.7 on both a 2 GHz Intel x86 processor running
Ubuntu Linux under VMWare and a Nokia N800 Internet Tablet
using the 400 MHz OMAP2420 ARM processor. As you can see from the benchmarks, on a small vocabulary task PocketSphinx
outperforms SphinxTiny on both accuracy and speed; however, as the complexity of the acoustic and language models increases,
SphinxTiny's accuracy is clearly better than PocketSphinx. Unfortunately, on complex models, to get good accuracy, the engine
cannot decode speech in real time. Nonetheless, SphinxTiny does provide the first step in using fully continuous acoustic models
to achieve higher accuracy on mobile devices. We are currently examining which tuning parameters give the best accuracy with
the least time.
AN4 - 130 decoding utterances, 948 training utterances
| VMWare | xRT | SER | WER |
|
N800 | xRT | SER | WER |
| PocketSphinx | 0.03 | 49.2% | 14.7% |
|
| 0.38 | 48.5% | 14.9% |
| Sphinx3-0.7 | 0.10 | 55.4% | 16.2% |
|
| 19.88 | 53.8% | 16.7% |
| SphinxTiny-0.7 | 0.22 | 55.4% | 16.2% |
|
| 2.54 | 53.8% | 16.7% |
| SphinxTiny-0.7 ASM* | N/A | N/A | N/A |
|
| 1.75 | 54.6% | 16.9% |
*SphinxTiny-0.7 with ARM assembly optimizations
RM1 - 365 decoding utterances, 1600 training utterances
| VMWare | xRT | SER | WER |
|
N800 | xRT | SER | WER |
| PocketSphinx | 0.04 | 30% | 5.8% |
|
| 0.58 | 27.7% | 5.6% |
| Sphinx3-0.7 | 0.22 | 36% | 6.8% |
|
| 24.34 | 32.9% | 7.1% |
| SphinxTiny-0.7 | 0.38 | 35.5% | 6.7% |
|
| 3.50 | 33.7% | 7.2% |
| SphinxTiny-0.7 ASM* | N/A | N/A | N/A |
|
| 2.57 | 33.7% | 7.2% |
*SphinxTiny-0.7 with ARM assembly optimizations
Accuracy was determined via NIST's sclite tool comparing a human-typed transcription to the decoded speech. PocketSphinx
used 1000 senones, and the others used 1000 senones with 8 Gaussians. The beam was set to 1e-80. Increasing the number of
Gaussians used by Sphinx3.x dramatically improves the accuracy, but at the cost of speed. Likewise the parameters may be
tuned to improve speed or accuracy, but there is usually a trade-off between the two.
After tuning some parameters, and using 16 Gaussians instead of 8, we get the following performance decoding 250
random utterances from native English speakers in the ICSI meetings data set. Note that the numbers appear poorer
than they truly are due to the transcripts containing "words" such as "<writingonwhiteboard>", which the acoustic
model cannot distinguish from, say, "<mikenoise>" or "<writingonwhiteboardaudibleinbackground>".
ICSI - 250 decoding utterances (from English speakers), 90314 training utterances
| VMWare | xRT | SER | WER |
|
N800 | xRT | SER | WER |
| PocketSphinx 1000 senones | 0.63 | 80% | 57.0% |
|
| 14.73 | 81.3% | 57.1% |
| PocketSphinx 2000 senones | 0.48 | 80% | 54.3% |
|
| 11.77 | 80.4% | 54.4% |
| Sphinx3-0.7 | 0.67 | 80.4% | 42.8% |
|
| 66.35 | 80.8% | 42.3% |
| SphinxTiny-0.7 | 0.94 | 79.2% | 42.4% |
|
| 9.62 | 79.6% | 43.0% |
| SphinxTiny-0.7 ASM* | N/A | N/A | N/A |
|
| 7.84 | 80.0% | 42.8% |
| SphinxTiny-0.7 ASM* (4 Gaussians) | N/A | N/A | N/A |
|
| 4.72 | 85.2% | 52.6% |
*SphinxTiny-0.7 with ARM assembly optimizations
With larger acoustic and language models, SphinxTiny provides better accuracy and speed than PocketSphinx.
Thus, from out experimentation, PocketSphinx is clearly superior when using small acoustic and language
models for real-time recognition, but for tasks that allow larger delays in exchange for better accuracy,
SphinxTiny is the better choice. Unsurprisingly, this reflects the general philosophy of the Sphinx2 and
Sphinx3 recognition engines on which PocketSphinx and SphinxTiny are based.