Shore, J. “Software Tools for Speech Research and Development”
Digital Signal Processing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton: CRC Press LLC, 1999
c
1999byCRCPressLLC
50
Software Tools for Speech Research
and Development
John Shore
Entropic Research
Laboratory, Inc.
50.1 Introduction
50.2 Historical Highlights
50.3 The User’s Environment (OS-Based vs. Workspace-Based)
Operating-System-Based Environment
•
Workspace-Based
Environment
50.4 Compute-Oriented vs. Display-Oriented
Compute-Oriented Software
•
Display-Oriented Software
•
Hybrid Compute/Display-Oriented Software
50.5 Compiled vs. Interpreted
Interpreted Software
•
Compiled Software
•
DegreeofSpecialization
•
SupportforSpeechInput
and Output
50.10 File Formats (Data Import/Export)
50.11 Speech Databases
50.12 Summary of Characteristics and Uses
50.13 Sources for Finding Out What is Currently Available
50.14 Future Trends
References
50.1 Introduction
Experts in every field of study depend on specialized tools. In the case of speech research and
development, the dominant tools today are computer programs. In this article, we present an
overview of key technical approaches and features that are prevalent today.
We restrict the discussion to software intended to suppor t R&D, as opposed to software for com-
mercial applications of speech processing. For example, we ignore DSP programming (which is
discussed in the previous article). Also, we concentrate on software intended to support the special-
c
1999 by CRC Press LLC
ities of speech analysis, coding, synthesis, and recognition, since these are the main subjects of this
chapter. However, much of what we have to say applies as well to the needs of those in such closely
related areas as psycho-acoustics, clinical voice analysis, sound and vibration, etc.
We do not attempt to survey available software packages, as the result would likely be obsolete by
thetimethisbookis printed. Theexamplesmentioned areillustrative,andnot intendedtoprovidea
thorough or balanced review. Our aim is to provide sufficient background so that readers can assess
their needs and understand the differences among available tools. Up-to-date surveys are readily
available online (see Section 50.13).
In general, there are three common uses of speech R&D software:
• Teaching, e.g., homework assignments for a basic course in speech processing
(from The Mathworks).
c
1999 by CRC Press LLC
50.3.1 Operating-System-Based Environment
Inthisapproach,signals arerepresented as files under the native operating system (e.g., Unix, D OS),
and the software consists of a set of programs that can be invoked separately to process or display
signals in various ways. Thus, the user sees the software as an extension of an already-familiar oper-
ating system. Because signals are represented as files, the speech software inherits file manipulation
capabilities from the operating system. Under Unix, for example, signals can be copied and moved
respectively using the cp and mv programs, and they can be organized as directory trees in the Unix
hierarchical file system (including NFS).
Similarly, the speech software inherits extension capabilities inherent in the operating system.
UnderUnix,forexample,extensionscanbecreatedusingshellscriptsinvariouslanguages(sh,csh,Tcl,
perl, etc.),aswellassuchfacilitiesaspipesandremoteexecution. OS-basedspeechsoftwarepackages
are often called command-line packages because usage typically involves providing a sequence of
commands to some type of shell.
50.3.2 Workspace-Based Environment
In this approach, the user interacts with a single application program that takes over from the
operating system. Signals, which may or may not correspond to files, are typically represented as
variables in some kind of virtual space. Various commands are available to process or display the
signals. Such a workspace is often analogous to a personal blackboard.
Workspace-based systems usually offer means for saving the current workspace contents and for
loading previously saved workspaces.
Anextensionmechanismistypicallyprovidedbyacommandinterpreterforasimplelanguagethat
includes the available operations and a means for encapsulating and invoking command sequences
(e.g., in a function or procedure definition). In effect, the speech software provides its own shell to
the user.
50.4 Compute-Oriented vs. Display-Oriented
This distinction concerns whether the speech software emphasizes computation or visualization or
Here we distinguish accordingto whether the bulk of the sig nal processing or display code (whether
written by developers or users) is interpreted or compiled.
50.5.1 Interpreted Software
The interpreter language may be specially designed for the software (e.g., S-PLUS from Statistical
Sciences, Inc., and MATLAB
TM
), or may be an existing, general purpose language (e.g., LISP is used
in N!Power from Signal Technology, Inc.).
Compared to compiler languages, inter preter languages tend to be simpler and easier to learn.
Furthermore, it is usually easier and faster to write and test programs under an interpreter. The
disadvantage, relative to compiled languages, is that the resulting programs can be quite slow to run.
Asaresult,interpretedspeechsoftwareisusuallybettersuitedforteachingandinteractiveexploration
than for batch experiments.
50.5.2 Compiled Software
Comparedtointerpretedlanguages, compiledlanguages(e.g.,FORTRAN,C,C++)tendtobemore
complicatedandhardertolearn. Comparedtointerpretedprograms, compiledprograms areslower
to write and test, but considerably faster to run. As a result, compiled speech software is usually
better suited for batch experiments than for teaching.
50.5.3 Hybrid Interpreted/Compiled Software
Someinterpretersmakeitpossibletocreatenew language commandswithanunderlyingimplemen-
tation that is compiled. This allows a hybrid approach that can combine the best of both.
Some languages provide a hybrid approach in which the source code is pre-compiled quickly into
intermediate code that is then (usually!) interpreted. Java is a good example.
If compiled speech software is OS-based, signal processing scripts can typically be written in an
interpretivelanguage(e.g.,ashscriptcontainingasequenceofcallstoESPS programs). Thus,hybrid
systems can also be based on compiled software.
50.5.4 Computation vs. Display
Thedistinctionbetweencompiled and interpretedlanguagesisrelevant mostly tothecomputational
aspects of the speech software. However, the distinction can apply as well to display software, since
c
mouse-click operations). Theoperation to be performed is then specified by mouse click operations
on screen buttons, pull-down menus, or pop-up menus.
This style works very well for unary operations (e.g., compute and display the spectrogram of a
given signal segment), and moderately well for binary operations (e.g., add two signals). But it is
awkward for operations that have more than two inputs. It is also awkward for specifying chained
calculations, especially if you want to repeat the calculations for a new set of signals.
Onesolutiontotheseproblemsisprovided bya“calculator-style”interfacethatlooksandactslike
a familiar arithmetic calculator (except the operands are sig nal names and the operations are signal
processing operations).
Another solution is the “spreadsheet-style” interface. The analogy with spreadsheets is tight.
Imagine a spreadsheet in which the cells are replaced by images (waveforms, spectrograms, etc.)
connectedlogicallybyformulas. Forexample,onecellmightshowatestsignal,asecondmightshow
the results of filtering it, and a third might show a spectrogram of a portion of the filtered signal.
This exemplifies a spreadsheet-style interface for speech software.
c
1999 by CRC Press LLC
A spreadsheet-style interface provides some means for specifying the “formulas” that relate the
various “cells”. This formula interface might itself be implemented in a point-and-click fashion,
or it might permit direct entry of formulas in some interpretive language. Speech software with
a spreadsheet-style interface will maintain consistency among the visible signals. Thus, if one of
the signals is edited or replaced, the other signal graphics change correspondingly, according to the
underlying formulas.
DADisp (from DSP De velopment Corporation) is an example of a spreadsheet-style interface.
Visual Interfaces for Compute-Oriented Software
In a visual interface for display-oriented software, the focus is on the signals themselves. In
a visual interface for compute-oriented software, on the other hand, the focus is on the operations.
Operations among sig nals typically arerepresented as iconswith one or more input and output lines
that interconnect the operations. In effect, the representation of a signal is reduced to a straight
line indicating its relationship (input or output) with respect to operations. Such visual interfaces
• default values
c
1999 by CRC Press LLC
• values from a global parameter file read by all programs
• values from a program-specific parameter file
• values from the command line
• values from the user in response to run-time prompts
In some situations, it is helpful if a current default value is replacedby the most recent input from
a given parameter source. We refer to this property as “parameter persistence”.
50.7 Extensibility (Closed vs. Open Systems)
Speech software is “closed” if there is no provision for the user to extend it. There is a fixed set of
operations available to process and display signals. What you get is all you get.
OS-based systemsarealwaysextensibleto a degree because they inherit scripting capabilities from
theOS,whichpermitsthecreationofnewcommands. Theymayalsoprovideprogramminglibraries
so that the user can write and compile new programs and use them as commands.
Workspace-basedsystemsmaybeextensibleiftheyarebasedonaninterpreterwhoseprogramming
language includes the concept of an encapsulated procedure. If so, then users can write scripts that
define new commands. Some systems also allow the interpreterto be extendedwith commands that
are implemented by underlying code in C or some other compiled language.
In general, for speech software to be extensible, it must be possible to specify operations (see
Section 50.6) and also to re-use the resulting specifications in other contexts. A block-diagram
interface is extensible, for example, if a given diagram can be reduced to an icon that is available for
use as a single block in another diagram.
For speech software with visual interfaces, extensibility considerations also include the ability to
specifynewGUIcontrols(visiblemenusandbuttons),theabilitytotiearbitraryinternalandexternal
computations to GUI controls, and the ability to define new display methods for new signal types.
Ingeneral,extendedcommandsmaybehavedifferentlyfromthebuilt-incommandsprovidedwith
the speech software. For example, built-in commands may share a common user interface that is
difficulttoimplement inanindependentscriptorprogram (suchacommoninterfacemightprovide
produced. Speech software can help here by creating appropriate records as signal and parameter
files are processed.
The most common method for recording this information about a given signal is to put it in the
same file as the signal. Most modern speech software uses a file format that includes a “file header”
that is used for this pur pose. Most systems store at least some information in the header, e.g., the
sampling rate of the signal. Others, such as ESPS, attempt to store all relevant information. In this
approach, the header of a signal file produced by any program includes the program name, values
of processing parameters, and the names and headers of all source files. The header is a recursive
structure,sothattheheadersof thesourcefilesthemselvescontainthenamesandheadersoffilesthat
wereprior sources. Thus,a signal file header containsthe headers of all source files in the processing
chain. It follows that files contain a complete history of the origin of the data in the file and all
the intermediate processing steps. The importance of record keeping grows with the complexity of
computation chains and the extent of available parametric control.
50.9.3 Personalization
There is considerable variation in the extent to which speech software can be customized to suit
personal requirements and tastes. Some systems cannot be personalized at all; they start out the
same way, every time. But most systems store personal preferences and use them again next time.
Savable preferences may include color selections, button layout, button semantics, menu contents,
currentlyloadedsignals,visiblewindows,windowarrangement,anddefaultparametersetsforspeech
processing operations.
Attheextreme, some systemscansaveacomplete“snapshot”thatpermitsexactresumption. This
isparticularlyimportantfortheinteractivestudyofcomplicatedsignalconfigurationsacrossrepeated
software sessions.
50.9.4 Real-Time Performance
Software is generally described as “real-time” if it is able to keep up with relevant, changing inputs.
In the case of speech software, this usually means that the software can keep up with input speech.
Even this definition is not particularly meaning ful unless the input speech is itself coming from a
c
1999 by CRC Press LLC
Ifyouintendtorunthespeechsoftwareonseveralplatformsthathavedifferentunderlyingnumeric
representations(a byte order differencebeing most likely), then it is important to know whether the
file formats and signal I/O software support transparent data exchange.
50.9.8 Degree of Specialization
Somespeechsoftwareisintendedforgeneralpurposeworkinspeech(e.g.,ESPS/waves,MATLAB
TM
).
Other software is intended for more specialized usage. Some of the areas where specialized software
tools may be relevant include linguistics, recognition, synthesis, coding, psycho-acoustics, clinical-
voice, music, multi-media, sound and vibration, etc. Two examples are HTK for recognition, and
Delta (from Eloquent Technology) for synthesis.
c
1999 by CRC Press LLC
50.9.9 Support for Speech Input and Output
Inthepast,built-inspeechI/OhardwarewasuncommoninworkstationsandPCs,sospeechsoftware
typically supported speech I/O by means of add-on hardwaresupplied with the software or available
fromotherthirdparties. Thisprovidedthedesiredcapability,albeitwiththedisadvantagesmentioned
earlier (see Section 50.9.6).
TodaymostworkstationsandPCshavebuilt-inaudiosupportthatcanbeuseddirectlybythespeech
software. This avoids the disadvantages of add-on hardware, but the resulting A/D-D/A quality can
be too noisy or otherwise inadequate for use in speech R&D (the built-in audio is typically designed
formoremundanerequirements). Therearevariousreasonswhyspecial-purposehardwaremaystill
be needed, including:
• need for more than two channels
• need for very high sampling rates
• compatibility with special hardware (e.g., DAT tape)
50.10 File Formats (Data Import/Export)
Signalfileformatsarefundamentallyimportantbecausetheydeterminehoweasyitisforindependent
programstoreadandwritethefiles(interoperability). Furthermore,theformatdetermineswhether
context of intended use.
TABLE50.1 Relative Importance of Software Characteristics
Interactive Batch
Teaching exploration experiments
OS-based (50.3.1) •
Workspace-based (50.3.2) •
Compute-oriented (50.4.1) •
Display-oriented (50.4.2) ••
Compiled (50.5.2) •
Interpreted (50.5.1) •
Text-based interface (50.6.1) ••
Visual interface (50.6.2) ••
Memory-based (50.9.1) ••
File-based (50.9.1) •
Parametric control (50.6.3) •• •
Consistency maintenance (50.8.0) ••
History documentation (50.9.2) ••
Extensibility (50.7.0) ••
Personalization (50.9.3) ••
Real-time performance (50.9.4) ••
Source availability (50.9.5) ••
Cross-platform compatibility (50.9.7) •• •
Support for speech I/O (50.9.9) ••
50.13 Sources for Finding Out What is Currently Available
The best single online source of general information is the Internet news group comp.speech, and in
particular its FAQ (see http://svr-www.eng.cam.ac.uk/comp.speech/). Usethis as a starting
point.
Here are some other WWW sites that (at this writing) contain speech software information or
pointers to other sites:
http://svr-www.eng.cam.ac.uk
IEEE Trans. on Acoustic s, Speech, and
Signal Processing,
ASSP-32(4), 842-851, Aug. 1984.
[4] Pino, J.L., Ha, S., Lee, E.A. and Buck, J.T., Software synthesis for DSP using ptolemy,
J. VLSI
Signal Processing,
9(1), 7-21, Jan. 1995.
[5] Potter, R.K., Kopp, G.A. and Green, H.C.,
Visible Speech, D. Van Nostrand Company, New
York, 1946.
[6] Shipman, D., SpireX: Statistical analysis in the SPIRE acoustic-phonetic workstation,
Proc.
ICASSP,
Boston, 1983.
[7] Shore, J., Interactive signal processing with UNIX,
Speech Technol., 3, March/April 1988.
[8] Talkin, D., Looking at speech,
Speech Technol., 4, April/May 1989.
c
1999 by CRC Press LLC