back to DITELO

Brief history of

speech recognition over the telephone at IRST

We began to work on telephone speech recognition in 1995. Our objective was to derive a telephone speech recognizer from the existing IRST technology developed for speech dictation.

The first set of phone HMMs was derived from the APASCI high quality speech material, as suggested in (M. Weintraub and L. Neumeyer - "Constructing Telephone Acoustic Models from a High Quality Speech Corpus", ICASSP '94). Speech signals of APASCI were filtered and downsampled according to the nominal telephone bandwidth.

This step provided a starting point which allowed to develop a first prototype, for the inquiry of a train timetable database. A vocabulary of about 140 words, including the major Italian cities (capoluoghi), was used. The computer drives the interaction, first asking the user to specify some data (stations, departure time), then allowing the user to get information about the trains which satisfy the constraints. The synthesizer was provided by CSELT. To improve system robustness, both explicit confirmation of the input data and a rejection module were included in the prototype.

In order both to train better acoustic models and to model "weak spontaneous speech phenomena", like breaths, hesitations, coughs, etc., a telephone speech database was collected, named PHONE. A system was designed which automatically performs a call to a previous advised speaker, asks for some information, and collects some sentences. These basically include digit sequences, acoustic sentences and confirmations. A start-end point algorithm detects the boundaries of the input signals, which is stored on disk together with the supposed transcription. A simple "yes/no" recognizer allows the user to control the call flow. All the speech material (up to now more than 250 speakers) has been manually checked. Acquired speakers provide a good Italian geographical distribution.

This material allowed to improve considerably our technology. Our work during 1996, which has been partially described in a paper presented at the 1997 ESCA workshop on Robust Speech Recognition for Unknown Communication Channels, focused on the following points:

study on acoustic parameters, including spectral noise subtraction, in order to better model telephone speech.

explicit modelling of weak spontaneous speech phenomena (hesitations, breaths, coughs, etc.)

rejection, at least for what concerns confirmations

realization of a system (170 - Italy Direct) for Telecom Italia, which basically handles collect call redirection. In this system the recognition of digit sequences as well as confirmations are needed. The main difficulty is related to the fact that most of the calls are very noisy, as they come from public boxes. For this system, also the digit sequences of the training part of SIRVA, collected by CSELT, were used. In order to evaluate (and possibly improve) the system, a database of incoming calls, named FIELD, is under collection. This system is described in a paper presented at the 1997 EUROSPEECH Conference.

In 1997 we mainly worked to the development of systems having user programmable vocabulary (basically menu driven inquiry systems). Our recognizer has been integrated into the Infovox system by Alceo S.R.L., which allows to develop computer telephony applications by means of a graphical interface. An example of application is the voice dialer by name built for Caritro, activated in December 1997.

In 1998 we worked on the development of a dialogue prototype, which allows a quite natural interaction in telephone applications. A commercial project, which aims to integrate this dialogue technology into some call center development tool, is founded by VOX and started in september, 1998. Part of our research activity during 1998 is roughly described in a paper presented at the IVTTA workshop.

