[go: up one dir, main page]

GB2399931A - Assistive technology - Google Patents

Assistive technology Download PDF

Info

Publication number
GB2399931A
GB2399931A GB0307201A GB0307201A GB2399931A GB 2399931 A GB2399931 A GB 2399931A GB 0307201 A GB0307201 A GB 0307201A GB 0307201 A GB0307201 A GB 0307201A GB 2399931 A GB2399931 A GB 2399931A
Authority
GB
United Kingdom
Prior art keywords
utterance
speech
speaker
word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0307201A
Other versions
GB0307201D0 (en
Inventor
Pamela Mary Enderby
Philip Duncan Green
Mark S Hawley
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Sheffield
Barnsley District General Hospital NHS Trust
Original Assignee
University of Sheffield
Barnsley District General Hospital NHS Trust
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Sheffield, Barnsley District General Hospital NHS Trust filed Critical University of Sheffield
Priority to GB0307201A priority Critical patent/GB2399931A/en
Publication of GB0307201D0 publication Critical patent/GB0307201D0/en
Priority to GB0406932A priority patent/GB2399932A/en
Publication of GB2399931A publication Critical patent/GB2399931A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Assistive technology supports the control of an environment by a dysarthric speaker. The accuracy of the control exerted by the dysarthric speaker is improved by the use of consistency and confusability measures within the speech processing engine. These measure increase the accuracy of the recognition of utterances of the speaker and the accuracy of articulation of utterances of the dysarthric speaker.

Description

Assistive Technology System and Method
Field of the Invention
The present mventon relates to assistve technology and, more particularly, to technology to assist dysarthrc speakers with communication and to assist m the control of their environment.
Background to the Invention
Dysarthria is a neurogemc motor speech disorder that impairs motor function and interferes with the process of speech production. Lois results in, at best, imprecise articulation of words or parts of words and, at worst, speech that is unintelligible to all but the most skilled 1 0 listeners.
In severe cases, dysarthric speakers might be highly dependent upon the presence and skill of a care-worker to act, effectively, as a translator m a communication with a third party.
Furthermore, the dysarthrc individual might also rely upon the careworker to perform basic tasks such as, for example, switching the television or lights on and off on their behalf.
Speech produced by dysarthric speakers can be very difficult for listeners unfamiliar with the speaker to understand. Since motor-neurone disease or trauma oIIen affects one cognitive and physical processes responsible for speech production, dysarthric symptoms often accompany neurological conditions such as cerebral palsy, head injury and multiple sclerosis. Many people with dysarthra are often physically incapacitated to the extent that spoken commands become an attractive alternative to normal controls for equipment. IIowever, it Is acknowledged that achieving robust automatic speech recognition of the speech of dysarthrc speakers Is variable for mild to moderate dysarthria and extremely difficult for severely dysarthuc speech. For severely dysarthrc speech, recognisers trained on a normal speech corpus cannot be expected to work well. Conventional automatic speech recogmton systems are insufficient to deal with the abnormalities and word-level variances of severely dysarthric speech, since the vocal articulations or vocalizations to be recognised are greatly variable, that is, less consistent, as compared to non-dysarthric speech.
The inabihty of commercially available automatic speech recognition systems to deal with severely dysarthnc speech often results m frustration of the dysarthric speaker since, m the event of the system faihng to recognise an utterance, the dysarthric speaker may be muted to repeat the utterance. Repeated mvitatons to articulate a particular word may result in the À À À . . a À À À À À À . . À À À . a a dysarthric speaker becoming both fatigued and frustrated. Conventionally, automatic speech recognition systems Improve their accuracy of recognition as the underlying model is refined.
However, this refinement may require a relatively large body of training material and sgmficant time and effort on the part of the person whose speech is to be recognised. It will be appreciated that the need to articulate an utterance too many times such as, for example, or more times, might lead to a dysarthric speaker becoming, again, both fatigued and frustrated.
It Is an object of embodiments of the present mventon at least to mitigate some of the
problems of the prior art.
Sumnarv of Invention Accordingly, a first aspect of embodiments of the present invention provides an assistance technology system comprsmg a speech processor operable to process an Input utterance to identify that utterance; means to output a control signal corresponding to the identified utterance for influencing the operation of respective equipment; the system being IS charactersed by the speech processor comprising means to calculate a confusability measure that reflects a degree of correlation between the mput utterance and at least a further utterance; each of the input utterance and the further utterance corresponding to respective words of a vocabulary of words, and means to replace at least one of the respective words with at least a further word having a corresponding utterance having a different degree of correlation with the at least one of the mput utterance and the further utterance and means to associate the further word with the control signal.
Advantageously, dysarthric speakers can communication more effectively and control their environment more effectively than previously.
Embodiments provide an assistive technology system in which the means to calculate the confusablity measure comprises means to subject the utterance, having a corresponding speech model, to a speech model for the further utterance to determine the response of the speech model for the further utterance and means to provide the confusabilty measure according to that response.
Preferred embodiments provide an assistive technology system m which the means to calculate the confusablity measure between the words, W' and Wj, of the input utterance and the further utterance respectively comprises means to calculate À À À. . . :. :: :eÀ. :: À À À À À. . À À .... À .
C,, = (Z L,jk) I n,, k where LO is a per-frame likelihood of each speech model generating each example of each word on a Viterbi path, j and k represent the kth repetition of the 1th word m a training set comprising N words We to WN, and n' Is the number of examples of W. In the traming set.
A dysarthric speaker may often attempt to improve their speech by practice. Suitably, embodiments provide an assistive technology system further comprising means to calculate a consistency measure for the mput utterance and means to output a visual indication of the consistency measure.
Preferred embodiments provide an assistive technology system in which the means to calculate the consistency measure comprises means to calculate 5, = (Lik)ln,, k where L,,k is a per-frame likelihood of each speech model generating each example of each word on a Viterbi path, i and k represent the kth repetition of the pith word in a tramng set comprising N words We to WN, and n, is the number of examples of W. in the training set.
Embodiments provide an assistive technology system in which the means to calculate the consistency measure comprises means to calculate A = (A d' ) I N. Embodiments can be cleansed in which the means to output a visual Education of the consistency measure comprises means to present a bar chart comparison of the consistency measure with an average consistency measure for that word for a given speaker.
Preferred embodiments provide a system m which the assistance technology system is a dysarthric speech assistive technology system.
A second aspect of embodiments of the present invention provides a method of traimng or treating a speech impaired speaker comprising the steps of processing an Input utterance of the speech impaired speaker using a corresponding speech model; providing a visual Indication of the degree of correlation between the input utterance and a predetermined utterance of the speech impaired speaker for the corresponding speech model. À . .
: À : : : À. : : À À À À À . . À À À . . . À . À . Preferred embodiments provide a method of traming or treating a speech impaired speaker in which the degree of correlation between the input utterance and the predetermined utterance of the speech impaired speaker for the corresponding speech model comprises the step of calculating 5, = (I Lk) I n,, k where Lo, is a per-frame likelihood of each speech model generating each example of each word on a Viterbi path, i and k represent the kth repetition of the Seth word in a traimng set composing N words W' to WN, and n, is the number of examples of We in the training set.
Embodiments provide a method of trading or treating a speech impaired speaker further comprising the step of establishing the predetermined utterance of the speech impair speaker for the speech model.
Embodiments provide a method of training or treating a speech impaired speaker further comprising the step of processing a plurahty of utterances corresponding to the same word and calculating a measure of the average of the plurahty of utterances.
Preferred embodiments provide a method of training or treating a speech impaired speaker further comprising the steps of establishing a plurality of utterances corresponding to respective words of a plurality of words; calculating a measure of confusablity between utterances corresponding to at least a selected pair of words; and selecting an alternative word to replace one of the selected pair of words in the vocabulary; the alternative word havmg an improved measure of confusabihty between a respective utterance for the alternative word and the utterance corresponding to remaining word of the selected pair of A method of traming or treating a speech impaired speaker in which the speech impaired speaker is a dysarthric speaker.
Brief Descnptlon of the Drawings Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings m which: figure I illustrates an assistve technology system according to an embodiment; figure 2 illustrates a confusabilty matrix according to an embodiment; . . . : a, .. i.: À . figure 3 shows a flowchart of processing performed by the first embodiment; and figure 4 shows a further flowchart of further processing perfonned by the embodiment.
Detalled Description of the Preferred Embodiments
Referring to Figure 1, there is shown an assistive technology system 100 for assisting a dysarthrc speaker (not shown). The system 100 comprises a computer system 102 having a speech processing engine or recogniser 104 that uses a number of speech models 106 to recognise speech detected by a microphone 108. The computer system 102 is also provided with an mput device l l O that is adapted to the needs of the dysarthrc speaker. The input device may be, for example, a relatively easy to activate switch. The computer system 102 also comprises device control software 112 which, in response to outputs of at least one of the speech processing engine 104 and the switch 110 produces control signals that are used to control respective items of equipment 114 to 118 such as, for example, a television, a radio or satellite receiver. Although the embodiment illustrated is shown as having a hardware interface 120 which may be any type of hardware interface, preferred embodiments are realised in which the computer system 102 communicates with the equipment 114 to 118 Tirelessly, usmg, for example, infrared communication, Bluetooth, EKE 802. 11b or the like according to the capabilities of the equipment and the interface 120.
The computer system 102 is provided with access to non-volatile storage 122 in the form or; for example, an H1)D. The non-volatle storage 122 is used to store speech models 124 for respective words that form a vocabulary that the speech processing engine 104 is expected to recognise. It can be seen that a number of individual speech models 126 and 128 are illustrated. Also illustrated are the training sets or training corpuses 124' for each of the speech models. Again, it can be appreciated that two traimng corpuses 126' and 128' are illustrated that correspond to respective speech models 126 and 128.
In general terms, the computer system 102 provides a voice interface via which the dysarthrc speaker can control the various items of equipment 114 to 118. The dysarthric speaker, using either the microphone alone or the microphone 108 m conjunction with a switch 110, utters a word such as, for example, "TV". This aspect of embodiments of the present invention will be described In greater detail with reference to figure 4.
The speech models may be constructed using the well known HTK toolkit, available from Cambridge University Engineering Department, under licence from Microsoft Corporation, À À À À . . À':: À. :: À . . . . . . À À À . . . À that produces Contmuous Density Hidden Market Models. The models have the following characteristics: they are whole-word based rather than phone-level based, they typically have 11 HMM states, with a mixture of 3 Gaussian distributions per state, they are "straght through" models that allow only selt-transtions and transitions to the next state, the acoustic vectors comprise Mel Frequency Cepstral Co-efficients, typically with differences but without overall energy (dysarthric speakers often have difficulty maintaining a steady volume), training Is data labelled at the word level usmg "silence I word I silence", and a sampling rate for audio data of 16 KHz, with a lOms frame rate.
Preferably, the speech processing engine or recogniser 104 is contig,ured to be able to modify a dysarthrc speaker's vocabulary. It Is usual for a dysarthrc speaker to produce some words more consistently than others. For example, "TV" might be an easier proposition that "television". While clinical assessment might help in Identifying such words that may be articulated more consistently, the speech processing engine is arranged to provide a quantitative measure of word-level consistency.
Furthermore, preferred embodiments provide a measure of the overall consistency of the speech in any given training corpus across all or selected words of that corpus. Such a measure of overall consistency might be used to assess the severity of the dysarthra and to record a Avcarthrc sneaker'c progress as any theranv proceeds Still fiercer embodiments can be realised that track utterance-level consistency to provide an indication of the core relation between the probability scores returned by a dysarthrc speaker's Individual pronunciations of a given word and their norm for that word. Utterance-level consistency might be used by a chmcian to identify outher utterances and, if warranted, to remove such utterances from the traming corpus. The utterance-level consistency might also be used to identify cases where a dysarthrc speaker shows two production or articulation styles of the same word, in which case two different speech models for that word might be provided.
Having some means of predicting confusion errors might allow more robust recognition to be realsed. Therefore, the speech processing engine 104, in preferred embodiments, provides a measure of confusabtlty. The confusability measure is arranged to allow words or groups of words that might be confused with one another to be modified or removed from the speech models 124. Preferred embodiments approach or provide a measure of confusablity using forced alignment based upon automatic speech recognition derived probability scores rather than the more conventional phonetically-based approach of the prior art. "Alignment" in this context means that utterances are processed to identify word boundaries and the speech unit À À À . . : À:: ' .:: À . . . À À À . ... À
or units defined by such word boundaries are subjected to the speech models.
These consistency and confusability measures should be based on the framing set, that is, the training corpus, and the trained models. The training set or corpus is stored together with the j framed models. Further embodiments use rules to implement forced-alignment of training set utterances against the models under the following assumptions: a training set for a vocabulary has N words, Wl..WN; a CDHMM, M;, is provided for each word W; and wjk Is the kth repetition of the jth word of the training set.
A per-frame log likelihood L'jk Is calculated for each model generating each example of each word on a Viterbt path. The consistency, A, of a word, W;, is obtained by: &. = (ark L,)/n,, ( I) where a; is the number of examples of a given word, W;, in a tramping set. An average score for a word is obtained by aligning all examples of that word against the model for that word.
Conventionally, the more variation that there is in training data for each speech unit, the 1) larger the variances wfli be in chat speech unit s tiiViiVi scare uisrrioucions. 1ne fulccu alignment likelihoods will be lower for an inconsistently spoken word than for a consistently spoken word since its distributions will be flatter.
The overall consistency of the training corpus, A, is the average of all consistencies for all I words within that corpus. Therefore, the overall consistency is given by: 21= (Ii dJ/7V (2) As indicated above, the measure of overall consistency of a training corpus might be used to assess the severity of dysarthna and/or to record a dysarthrc speaker's progress as any therapy proceeds.
The confusabilty between two given words, W. and Wj, is defined by: C,, (irk L,j0/n) (3) which is a measure of the average score obtained by aligning examples of a given word, Wj, À À À À . . : ' : : À. : : À À . . . À . - À against the CDHMM, M;, for a different word, W;. The higher the value of Cal implies a greater hkelihood that Wj will be msrecognised as W;.
Training aid software 130 can be used by a clinician in analysing the confusablity results.
Figure 2 shows an example 200 of the output produced by the speech processing engine 104 for a severely dysarthrc speaker. The output 200 Is provided with a confusabflty variance cabbraton gray-scale 202, which provides an indication of the degree of confusablity of any two words forming part of a confusablity matrix 204. The confusability variance cahbraton gray-scale 202 is arranged such that the darker the shade, the greater the confusability. From the example of the confusability matrix 204, it can be appreciated that there Is a higher risk of confusion between the word TV and the set of words {alarm, lamp, channel, down, radio, volume) as compared to a risk of confusion between the word TV and the set of words {on, off, up}. The confusability matrix 204 can be used by a clinician to tailor the vocabulary recognised by the speech processing engine 104 to reduce the probability or risk of confusion between any two words. This should, in turn, improve the response of the speech processing engine 104 to the dysarthric speaker7s utterances. It will be appreciated that in some cases, the risk of confusion can exist between several words m which cases a number of alternative The words m the first column or first row of the confusability matrix represent the whole or part of a dysarthrc speaker's vocabulary. A clinician will construct such a confusabflity matrix using either a test or training set of words or an intended vocabulary of words for that speaker only. Such an initial vocabulary of words might be refined to remove words and to introduce alternative words if it is noted that there is significant confusion between selected In a preferred embodiment, the training aid software 130 has a further mode of operation, which allows a dysarthrc speaker, using the microphone 108 and switch 110, to practice their articulation of selected words. The dysarthric speaker can select a word to be practsed from the whole, or part, of their Intended vocabulary. Alternatively, a chnician might make that selection on behalf of the dysarthric speaker as part of an Interactive therapy session. Using the switch 110, the dysarthnc speaker can arrange for the speech processing engine 104 to record and process their utterances. The utterance is compared to the speech model that corresponds to the norm of that word for that dysarthrtc speaker. The speech processing engine 104 returns a probabihty score to the tranmg aid software 130 that reflects the : :.e a, :: *,: À À: closeness of match of the utterance with the speech model representing the norm or average of the utterance for that dysarthric speaker. This measure of consistency is preferably presented to the dysarthrie speaker visually as a bar chart comprising two bars. The first bar represents the probability score of the utterance in the training corpus with a score closest to norm, and the second bar represents the probability score of the utterance, that is, it represents a means by which the dysarthric speaker can compare their most recent utterance with the norm for that utterance. Preferably, the dysarthre speaker can use the switch 110 to play the utterance corresponding to their norm for that utterance in advance of practisng that utterance. The training aid software 130 preferably records all utterances of such practice sessions to allow the recorded data to be analysed by a clinician or speech therapist. In preferred embodiments, utterances that deviate by more than a predetermined value such as, for example, 20%, from the norm of that utterance for that dysarthre speaker are highlighted or drawn to the attention of the clinician or speech therapist. Identifying such anomalous utterances allows them to be removed from any data that might be used to influence the performance of the corresponding CDHMM. Providing a measure of the closeness of the fit of the recent utterance with the dysarthrc speaker's norm for that utterance allows the dysarthre speaker to practice that utterance so that they might be able to articulate the utterance in a manner more consistent with their norm. It should be noted that more consistent articulation of any given utterance does not necessarily imply that the utterance will be intelligible to a person unfamiliar with the dysarthrie speaker.
It will be appreciated that as the accuracy of the speech processing engine or reeogniser 104 increases, the degree of control exerted by a dysarthric speaker over the equipment 114 to 118 also increases. This increased degree of control will lead to an improved quality of hfe for a dysarthrie speaker.
Referring to figure 3, there is shown a flowchart 300 that illustrates the baste steps undertaken by the traimng aid software 130 in conjunction with the recogniser 104 in allowing a dysarthre speaker to practice utterances of a selected or target word. At step 302, the target word Is selected from a number of displayed words. An articulated utterance corresponding to the selected word is recorded by the recogniser 104 at step 304. The recognser 104 compares the utterance with an appropriate speech model corresponding to the selected word at step 306. An analysis of the closeness of fit of the most recently recorded and processed utterance with the norm of the utterance for that speaker Is performed at step 308 and, at step 310, feedback is provided to the dysarthrie speaker on the closeness of fit of their most recent utterance with their norm for that utterance.
À . . . , , a, . À , . In an alternanve embodiment, or additionally, rather than a dysarthric speaker or chmcian selecting a target word at step 302, the system can be arranged so that the dysarthric speaker may merely actuate the switch 110 to provide an indication to the speech processing engine that the next utterance is intended to be a control or communication command. In such embodiments, the recognition step 306 compares the data produced by the signal processing step 304 with all of the speech models for that dysarthric speaker 124 to identify the best match. Once a match has been identified, the device control software 112 determines whether or not a control signal should be output m response to that match. It will be appreciated by one skilled in the art that all possible command phrases in the recogniser's vocabulary are associated with either speech output or a control signal by the climcan when the system is framed and configured. For example, appropriate codes are stored together w th command speech models, which are used to produce corresponding infrared signals vita the hardware interface. These codes are tailored to the particular types of hardware and hardware interface present. Therefore, it will be appreciated that if the hardware interface or an item of hardware was changed, the information stored would also be changed accordingly.
If appropriate, the device control software 112 produces a control signal via the hardware nterface 120 that is suitable for controlling a corresponding item of equipment 114 to 118.
In alternative embodiments, or additionally, rather than outputting a control signal, a speech synthesis engme 132 can be arranged to output ntellgble speech. In such an embodiment, it Will be appreciated that the computer system 102 is acting, effectively, as a translation aid that translates between dysarthric speech and conventional speech that Is ntelhgble to those unfamiliar with the dysarthric speaker. For example, this would allow a dysarthric speaker to greet a friend usmg the word "hello". It would also support a greater verbal nteraction between a dysarthric speaker and a person unfamhar with the dysarthric speaker. This is illustrated m figure 4.
Figure 4 shows a flowchart 400 of the processing performed by the computer system 102 to assist the dysarthric speaker in communicating with people or mteractmg with their environment. The computer system 102 Is arranged to operate m a command/control and communication mode. In this mode, the dysarthric speaker provides an indication to the speech processing engine that it should enter a record mode of operation to record an utterance of the dysarthric speaker. Such an indication can be provided using a mouse or keyboard of the computer system, a specially adapted input device according to the physical capabilities of the dysarthric speaker or via the automatic detection of the speaker's voice close to the microphone without using an additional input device. The latter might be À c : .. a. i.: ' ' e À À achieved using, for example, a volume threshold such that the system infers that an utterance intended for processing by the system has been made if the volume of that utterance Is above that volume threshold, with all other detected utterances below the threshold being Ignored.
Therefore, at step 402, the speech processing engine receives a signal that Is indicative of the Input device having been actuated. Any speech uttered following actuation of the input device Is recorded by the speech processing engine 104 and converted into a suitable form for processing by the CDHMM recogniser at step 404. The speech processing engine 104 performs speech recognition at step 406 to determine the best or most appropriate correlation between the utterance and one of the speech models 124.
Processmg Is performed, in light of the recognition at step 406, to determine whether the dysarthrc speaker has issued a command that needs to be parsed. This processing is performed at step 408. For example, the dysarthric speaker may have uttered "TV on", which is a command to switch on the television. The speech processing engine 104 is arranged, when Identifying a first word of an utterance, to check the utterance, or any part of the utterance, against all speech models 124. Having matched at least part of the utterance with one of the speech models 124, the speech processing engine 104 is sufficently sophisticated to compare the next, or the last, part of the utterance with a limited set of speech models selected from the whole set of speech models 124. This will reduce the processing burden Imposed upon the speech processing engme 1()4. for example, t the utterance is "TV on" and the speech processing engine has identified the first part of the utterance as "TV", the speech processing engine might then expect the second part of the utterance to be a command such as "on", "off", "volume", "up", "channel", "up" or "down".
Hence, the second part of the utterance would be processed by speech models corresponding to those words, that Is, a limited set of the speech models is used in identifying subsequent parts of an overall utterance.
A further example of comparing a later part of an utterance with a very limited set of the speech models, having Identified an earlier part of an utterance, would be the commands that control a lamp. The lamp has two states; namely on and off, and a speech processing engine, having Identified part of the utterance as "lamp" would then only expect a subsequent part of the utterance to be "on" or "off' and would, therefore, only need to use two further speech models m fully processing the utterances "lamp on" and "lamp off'.
The filtering or narrowing, that is, the selection of a reduced set of speech models for use in subsequent processing, may be further reduced Ifthe speech processing engine 104 also . , . À':: . i. :: ' ', :. '..
stores data relating to the current state of an item of equipment. For example, once a television has been switched on, it is unlikely that a subsequent part of an utterance following "TV" Is will be "on" since the television Is already in the on-state. Therefore, the speech model for recognising "on" need not be considered in such circumstances. It will be appreciated that such embodiments might assume that a given item of equipment is in a desired state following a command or that some form of feedback from the equipment to the system is provided to confirm that it has assumed the desired state. Having parsed an utterance at step 408, a determmaton is made, at step 409, as to whether or not the parsed utterance is a command. Therefore, If appropriate, respective control signals for controlling the equipment 114 to 118 are generated and output at step 410. The device control software 112 has access to a table 134 of parsed commands 136 that are mapped to corresponding control signals 138.
A further example in which knowledge of the current state of the computer equipment can be used to improve the ease with which a dysarthric speaker can interact with their environment is when, for example, only the radio is switched on. Assume that the command "volume up" Is Issued by the dysarthric speaker. Knowing that only the radio is currently switched on, the speech processing engme 104 Is sufficiently sophisticated to be able to ignore all speech models but for those relating to the radio.
If the parsed utterance is not recognised at step 409 as a command, it is assumed to be speech intended to be communicated to a listener. That speech communication is converted into a more mtelligble form, that is, it is converted mto a form that may be understood by someone unfamiliar with the dysarthrc speaker and subsequently output. The subsequent output might be in the form of text on the screen of the computer system 102 or in the form of synthessed speech. This processing takes place at step 412.
Embodiments of the present invention may use a table comprising a mapping between dysarthrc utterances and conventional speech or text. The speech may be output by the speech synthesser or the text may be output For display. Alternatively, or additionally, the text may be converted to speech by the speech synthesizer.
The reader's attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public mspectton with this specification, and the contents of all such papers and documents are incorporated herein by reference.
À .: :.. a.:: : À.
All of the features disclosed in this specification (including any accompanying claims, abstract and drawings) and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.
Each feature disclosed In this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.
The invention is not restricted to the details of any foregoing embodiments. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combmaton, of the steps of any method or process so disclosed.
À e À À À .e

Claims (19)

1. An assishve technology system comprising a speech processor operable to process an input utterance to identify that utterance; means to output a control signal corresponding to the identified utterance for influencing the operation of respective equipment; the system being characterised by the speech processor comprising means to calculate a confusabilty measure that reflects a degree of correlation between the Input utterance and at least a further utterance; each of the mput utterance and the further utterance corresponding to respective words of a vocabulary of words, and means to replace at least one of the respective words with at least a further word having a corresponding utterance having a different degree of correlation with the at least one of the input utterance and the further utterance and means to associate the further word with the control signal.
2. An assstve technology system as claimed in claim 1 in which the means to calculate the confusablity measure comprises means to subject the utterance, havmg a corresponding speech model, to a speech model for the further utterance to determine the response of the speech model for the further utterance and means to provide the confusability measure according to that response.
3. An assistve technology system as claimed in claim 2 in which the means to calculate the confusability measure between the words, W. and We, of the mput utterance and the further utterance respectively comprises means to calculate CU = (lLvk)ln;, where L,3k Is a per-frame likelihood of each speech model generating each example of each word on a Viterbi path, j and k represent the kth repetition of the jth word in a training set comprising N n' is the number of examples of W' in the traimng set. À
a À I' * À ' *
4. An assishve technology system as claimed in any preceding claim, further comprising means to calculate a consistency measure for the input utterance and means to output a visual indication of the consistency measure.
5. An assstive technology system as claimed m claim 4 in which the means to calculate S the consistency measure comprises means to calculate 5d = (l L,,k) I ni where L,,k is a per-frame likehhood of each speech model generating each example of each and k represent the kth repetition of the ith word in a traming set comprising N n, is the number of examples of W' in the tranmg set.
6. An assstive technology system as claimed In claim 5 in which the means to calculate the consistency measure comprises means to calculate = (5,)/N
7. An assstive technology system as claimed in either of claims S and 6 in which the means to output a visual indication of the consistency measure comprises means to present a bar chart comparison of the consistency measure with an average consistency measure for that word for a given speaker.
8. A system as claimed in any preceding claim in which the assistive technology system Is a dysarthrc speech assstive technology system.
9. A method of training or treating a speech impaired speaker comprising the steps of processing an input utterance of the speech impaired speaker using a corresponding speech model; provdmg a visual indication of the degree of correlation between the input utterance and a predetermined utterance of the speech Impaired speaker for the corresponding speech model.
:e' '., Àe " , . . . ... .
10. A method of traming or treating a speech impaired speaker as claimed m claim 9 in which the degree of correlation between the input utterance and the predetermined utterance of the speech impaired speaker for the corresponding speech model comprises the step of calculating] 5, = (Liik)ln,, k where Lo, is a per-frame likelihood of each speech model generating each example of each and k represent the kth repetition of the ith word in a training set comprising N words Wit to WN, and n, is the number of examples of W. in the training set.
A method of trammg or treating a speech impaired speaker as claimed in either of claims 9 and 10 further comprising the step of establishing the predetermined utterance of the speech impair speaker for the speech model.
12. A method of framing or treating a speech impaired speaker as claimed in any of claims 9 to 11 further comprising the step of processing a plurality of utterances corresponding to the same word and calculating a measure of the average of the plurality of utterances.
13. A method of training or treating a speech impaired speaker as claimed in any of claims 9 to 12, further comprising the steps of establishing a plurahty of utterances i corresponding to respective words of a plurality of words; calculating a measure of confusablty between utterances corresponding to at least a selected pair of words; and selecting an alternative word to replace one of the selected pair of words in the vocabulary; the alternative word having an improved measure of confusablity between a respective utterance for the alternative word and the utterance corresponding to remaining word of the selected pair of words.
14. A method of training or treating as claimed in any of claims 9 to 13 in which the speech impaired speaker is a dysarthric speaker.
À e e e e e e e e see e e e e e À e e e e e e e e e e e e
15. An assistive technology system comprising means to implement a method as claimed I m any of claims 9 to 14.
16. An assstive technology system substantially as described herein with reference to/and or as illustrated in the accompanying drawings.
17. A method of traming or treating a speech impaired person substantially as described herein with reference to/and or as Illustrated in the accompanying drawings.
18. A computer program element comprising computer program code means to Implement a method or system as claimed m any preceding claim.
19. A computer program product comprising computer readable storage media storing a computer program element as claimed in claim 18.
. . . À À e À e À e ce. À À À À À . À À À À À ... .
GB0307201A 2003-03-28 2003-03-28 Assistive technology Withdrawn GB2399931A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB0307201A GB2399931A (en) 2003-03-28 2003-03-28 Assistive technology
GB0406932A GB2399932A (en) 2003-03-28 2004-03-29 Speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0307201A GB2399931A (en) 2003-03-28 2003-03-28 Assistive technology

Publications (2)

Publication Number Publication Date
GB0307201D0 GB0307201D0 (en) 2003-04-30
GB2399931A true GB2399931A (en) 2004-09-29

Family

ID=9955748

Family Applications (2)

Application Number Title Priority Date Filing Date
GB0307201A Withdrawn GB2399931A (en) 2003-03-28 2003-03-28 Assistive technology
GB0406932A Withdrawn GB2399932A (en) 2003-03-28 2004-03-29 Speech recognition

Family Applications After (1)

Application Number Title Priority Date Filing Date
GB0406932A Withdrawn GB2399932A (en) 2003-03-28 2004-03-29 Speech recognition

Country Status (1)

Country Link
GB (2) GB2399931A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103405217A (en) * 2013-07-08 2013-11-27 上海昭鸣投资管理有限责任公司 System and method for multi-dimensional measurement of dysarthria based on real-time articulation modeling technology
CN105719662A (en) * 2016-04-25 2016-06-29 广东顺德中山大学卡内基梅隆大学国际联合研究院 Dysarthrosis detection method and dysarthrosis detection system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2027572B1 (en) * 2006-05-22 2009-10-21 Philips Intellectual Property & Standards GmbH System and method of training a dysarthric speaker

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2187586A (en) * 1986-02-06 1987-09-09 Reginald Alfred King Acoustic recognition
US6185530B1 (en) * 1998-08-14 2001-02-06 International Business Machines Corporation Apparatus and methods for identifying potential acoustic confusibility among words in a speech recognition system
EP1217609A2 (en) * 2000-12-22 2002-06-26 Hewlett-Packard Company Speech recognition

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6253181B1 (en) * 1999-01-22 2001-06-26 Matsushita Electric Industrial Co., Ltd. Speech recognition and teaching apparatus able to rapidly adapt to difficult speech of children and foreign speakers
GB9920257D0 (en) * 1999-08-26 1999-10-27 Canon Kk Signal processing system
US6754625B2 (en) * 2000-12-26 2004-06-22 International Business Machines Corporation Augmentation of alternate word lists by acoustic confusability criterion
GB2385698B (en) * 2002-02-26 2005-06-15 Canon Kk Speech processing apparatus and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2187586A (en) * 1986-02-06 1987-09-09 Reginald Alfred King Acoustic recognition
US6185530B1 (en) * 1998-08-14 2001-02-06 International Business Machines Corporation Apparatus and methods for identifying potential acoustic confusibility among words in a speech recognition system
EP1217609A2 (en) * 2000-12-22 2002-06-26 Hewlett-Packard Company Speech recognition

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103405217A (en) * 2013-07-08 2013-11-27 上海昭鸣投资管理有限责任公司 System and method for multi-dimensional measurement of dysarthria based on real-time articulation modeling technology
CN103405217B (en) * 2013-07-08 2015-01-14 泰亿格电子(上海)有限公司 System and method for multi-dimensional measurement of dysarthria based on real-time articulation modeling technology
CN105719662A (en) * 2016-04-25 2016-06-29 广东顺德中山大学卡内基梅隆大学国际联合研究院 Dysarthrosis detection method and dysarthrosis detection system
CN105719662B (en) * 2016-04-25 2019-10-25 广东顺德中山大学卡内基梅隆大学国际联合研究院 Dysarthria detection method and system

Also Published As

Publication number Publication date
GB0307201D0 (en) 2003-04-30
GB2399932A (en) 2004-09-29
GB0406932D0 (en) 2004-04-28

Similar Documents

Publication Publication Date Title
EP0773532B1 (en) Continuous speech recognition
US7593849B2 (en) Normalization of speech accent
Janse Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech
US7949523B2 (en) Apparatus, method, and computer program product for processing voice in speech
US5995928A (en) Method and apparatus for continuous spelling speech recognition with early identification
US6157913A (en) Method and apparatus for estimating fitness to perform tasks based on linguistic and other aspects of spoken responses in constrained interactions
US8762144B2 (en) Method and apparatus for voice activity detection
Rosen et al. Automatic speech recognition and a review of its functioning with dysarthric speech
Hasegawa-Johnson et al. HMM-based and SVM-based recognition of the speech of talkers with spastic dysarthria
JP2000259170A (en) Method and device for registering user to voice recognition system
CN101266792A (en) Speech recognition system and speech recognition method
JP2002304190A (en) Method for generating pronunciation change form and method for speech recognition
JPH10133685A (en) Method and system for editing phrase during continuous speech recognition
EP2028646A1 (en) Device for modifying and improving the behaviour of speech recognition systems
US20040098259A1 (en) Method for recognition verbal utterances by a non-mother tongue speaker in a speech processing system
Kotler et al. Effects of speech training on the accuracy of speech recognition for an individual with a speech impairment
JP2008262120A (en) Utterance evaluation device and utterance evaluation program
Schramm et al. Strategies for name recognition in automatic directory assistance systems
GB2399931A (en) Assistive technology
JP3621624B2 (en) Foreign language learning apparatus, foreign language learning method and medium
Hunt Speaker adaptation for word‐based speech recognition systems
KR20120010076A (en) Voice activity detection method and device
JP2005241767A (en) Speech recognition device
Langlais et al. Automatic detection of mispronunciation in non-native Swedish speech
JP3754257B2 (en) Voice input text creation method and apparatus

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)