GB2399931A - Assistive technology - Google Patents
Assistive technology Download PDFInfo
- Publication number
- GB2399931A GB2399931A GB0307201A GB0307201A GB2399931A GB 2399931 A GB2399931 A GB 2399931A GB 0307201 A GB0307201 A GB 0307201A GB 0307201 A GB0307201 A GB 0307201A GB 2399931 A GB2399931 A GB 2399931A
- Authority
- GB
- United Kingdom
- Prior art keywords
- utterance
- speech
- speaker
- word
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000005516 engineering process Methods 0.000 title claims abstract description 26
- 238000012545 processing Methods 0.000 claims abstract description 41
- 238000012549 training Methods 0.000 claims description 29
- 238000000034 method Methods 0.000 claims description 25
- 230000001771 impaired effect Effects 0.000 claims description 20
- 230000000007 visual effect Effects 0.000 claims description 6
- 238000003860 storage Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 2
- 238000004590 computer program Methods 0.000 claims 4
- 238000004891 communication Methods 0.000 description 7
- 206010013887 Dysarthria Diseases 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 6
- 238000004519 manufacturing process Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000002560 therapeutic procedure Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 206010019196 Head injury Diseases 0.000 description 1
- 208000026072 Motor neurone disease Diseases 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000002547 anomalous effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 206010008129 cerebral palsy Diseases 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000007659 motor function Effects 0.000 description 1
- 201000006417 multiple sclerosis Diseases 0.000 description 1
- 230000000926 neurological effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 208000027765 speech disease Diseases 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000008733 trauma Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
Assistive technology supports the control of an environment by a dysarthric speaker. The accuracy of the control exerted by the dysarthric speaker is improved by the use of consistency and confusability measures within the speech processing engine. These measure increase the accuracy of the recognition of utterances of the speaker and the accuracy of articulation of utterances of the dysarthric speaker.
Description
Assistive Technology System and Method
Field of the Invention
The present mventon relates to assistve technology and, more particularly, to technology to assist dysarthrc speakers with communication and to assist m the control of their environment.
Background to the Invention
Dysarthria is a neurogemc motor speech disorder that impairs motor function and interferes with the process of speech production. Lois results in, at best, imprecise articulation of words or parts of words and, at worst, speech that is unintelligible to all but the most skilled 1 0 listeners.
In severe cases, dysarthric speakers might be highly dependent upon the presence and skill of a care-worker to act, effectively, as a translator m a communication with a third party.
Furthermore, the dysarthrc individual might also rely upon the careworker to perform basic tasks such as, for example, switching the television or lights on and off on their behalf.
Speech produced by dysarthric speakers can be very difficult for listeners unfamiliar with the speaker to understand. Since motor-neurone disease or trauma oIIen affects one cognitive and physical processes responsible for speech production, dysarthric symptoms often accompany neurological conditions such as cerebral palsy, head injury and multiple sclerosis. Many people with dysarthra are often physically incapacitated to the extent that spoken commands become an attractive alternative to normal controls for equipment. IIowever, it Is acknowledged that achieving robust automatic speech recognition of the speech of dysarthrc speakers Is variable for mild to moderate dysarthria and extremely difficult for severely dysarthuc speech. For severely dysarthrc speech, recognisers trained on a normal speech corpus cannot be expected to work well. Conventional automatic speech recogmton systems are insufficient to deal with the abnormalities and word-level variances of severely dysarthric speech, since the vocal articulations or vocalizations to be recognised are greatly variable, that is, less consistent, as compared to non-dysarthric speech.
The inabihty of commercially available automatic speech recognition systems to deal with severely dysarthnc speech often results m frustration of the dysarthric speaker since, m the event of the system faihng to recognise an utterance, the dysarthric speaker may be muted to repeat the utterance. Repeated mvitatons to articulate a particular word may result in the À À À . . a À À À À À À . . À À À . a a dysarthric speaker becoming both fatigued and frustrated. Conventionally, automatic speech recognition systems Improve their accuracy of recognition as the underlying model is refined.
However, this refinement may require a relatively large body of training material and sgmficant time and effort on the part of the person whose speech is to be recognised. It will be appreciated that the need to articulate an utterance too many times such as, for example, or more times, might lead to a dysarthric speaker becoming, again, both fatigued and frustrated.
It Is an object of embodiments of the present mventon at least to mitigate some of the
problems of the prior art.
Sumnarv of Invention Accordingly, a first aspect of embodiments of the present invention provides an assistance technology system comprsmg a speech processor operable to process an Input utterance to identify that utterance; means to output a control signal corresponding to the identified utterance for influencing the operation of respective equipment; the system being IS charactersed by the speech processor comprising means to calculate a confusability measure that reflects a degree of correlation between the mput utterance and at least a further utterance; each of the input utterance and the further utterance corresponding to respective words of a vocabulary of words, and means to replace at least one of the respective words with at least a further word having a corresponding utterance having a different degree of correlation with the at least one of the mput utterance and the further utterance and means to associate the further word with the control signal.
Advantageously, dysarthric speakers can communication more effectively and control their environment more effectively than previously.
Embodiments provide an assistive technology system in which the means to calculate the confusablity measure comprises means to subject the utterance, having a corresponding speech model, to a speech model for the further utterance to determine the response of the speech model for the further utterance and means to provide the confusabilty measure according to that response.
Preferred embodiments provide an assistive technology system m which the means to calculate the confusablity measure between the words, W' and Wj, of the input utterance and the further utterance respectively comprises means to calculate À À À. . . :. :: :eÀ. :: À À À À À. . À À .... À .
C,, = (Z L,jk) I n,, k where LO is a per-frame likelihood of each speech model generating each example of each word on a Viterbi path, j and k represent the kth repetition of the 1th word m a training set comprising N words We to WN, and n' Is the number of examples of W. In the traming set.
A dysarthric speaker may often attempt to improve their speech by practice. Suitably, embodiments provide an assistive technology system further comprising means to calculate a consistency measure for the mput utterance and means to output a visual indication of the consistency measure.
Preferred embodiments provide an assistive technology system in which the means to calculate the consistency measure comprises means to calculate 5, = (Lik)ln,, k where L,,k is a per-frame likelihood of each speech model generating each example of each word on a Viterbi path, i and k represent the kth repetition of the pith word in a tramng set comprising N words We to WN, and n, is the number of examples of W. in the training set.
Embodiments provide an assistive technology system in which the means to calculate the consistency measure comprises means to calculate A = (A d' ) I N. Embodiments can be cleansed in which the means to output a visual Education of the consistency measure comprises means to present a bar chart comparison of the consistency measure with an average consistency measure for that word for a given speaker.
Preferred embodiments provide a system m which the assistance technology system is a dysarthric speech assistive technology system.
A second aspect of embodiments of the present invention provides a method of traimng or treating a speech impaired speaker comprising the steps of processing an Input utterance of the speech impaired speaker using a corresponding speech model; providing a visual Indication of the degree of correlation between the input utterance and a predetermined utterance of the speech impaired speaker for the corresponding speech model. À . .
: À : : : À. : : À À À À À . . À À À . . . À . À . Preferred embodiments provide a method of traming or treating a speech impaired speaker in which the degree of correlation between the input utterance and the predetermined utterance of the speech impaired speaker for the corresponding speech model comprises the step of calculating 5, = (I Lk) I n,, k where Lo, is a per-frame likelihood of each speech model generating each example of each word on a Viterbi path, i and k represent the kth repetition of the Seth word in a traimng set composing N words W' to WN, and n, is the number of examples of We in the training set.
Embodiments provide a method of trading or treating a speech impaired speaker further comprising the step of establishing the predetermined utterance of the speech impair speaker for the speech model.
Embodiments provide a method of training or treating a speech impaired speaker further comprising the step of processing a plurahty of utterances corresponding to the same word and calculating a measure of the average of the plurahty of utterances.
Preferred embodiments provide a method of training or treating a speech impaired speaker further comprising the steps of establishing a plurality of utterances corresponding to respective words of a plurality of words; calculating a measure of confusablity between utterances corresponding to at least a selected pair of words; and selecting an alternative word to replace one of the selected pair of words in the vocabulary; the alternative word havmg an improved measure of confusabihty between a respective utterance for the alternative word and the utterance corresponding to remaining word of the selected pair of A method of traming or treating a speech impaired speaker in which the speech impaired speaker is a dysarthric speaker.
Brief Descnptlon of the Drawings Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings m which: figure I illustrates an assistve technology system according to an embodiment; figure 2 illustrates a confusabilty matrix according to an embodiment; . . . : a, .. i.: À . figure 3 shows a flowchart of processing performed by the first embodiment; and figure 4 shows a further flowchart of further processing perfonned by the embodiment.
Detalled Description of the Preferred Embodiments
Referring to Figure 1, there is shown an assistive technology system 100 for assisting a dysarthrc speaker (not shown). The system 100 comprises a computer system 102 having a speech processing engine or recogniser 104 that uses a number of speech models 106 to recognise speech detected by a microphone 108. The computer system 102 is also provided with an mput device l l O that is adapted to the needs of the dysarthrc speaker. The input device may be, for example, a relatively easy to activate switch. The computer system 102 also comprises device control software 112 which, in response to outputs of at least one of the speech processing engine 104 and the switch 110 produces control signals that are used to control respective items of equipment 114 to 118 such as, for example, a television, a radio or satellite receiver. Although the embodiment illustrated is shown as having a hardware interface 120 which may be any type of hardware interface, preferred embodiments are realised in which the computer system 102 communicates with the equipment 114 to 118 Tirelessly, usmg, for example, infrared communication, Bluetooth, EKE 802. 11b or the like according to the capabilities of the equipment and the interface 120.
The computer system 102 is provided with access to non-volatile storage 122 in the form or; for example, an H1)D. The non-volatle storage 122 is used to store speech models 124 for respective words that form a vocabulary that the speech processing engine 104 is expected to recognise. It can be seen that a number of individual speech models 126 and 128 are illustrated. Also illustrated are the training sets or training corpuses 124' for each of the speech models. Again, it can be appreciated that two traimng corpuses 126' and 128' are illustrated that correspond to respective speech models 126 and 128.
In general terms, the computer system 102 provides a voice interface via which the dysarthrc speaker can control the various items of equipment 114 to 118. The dysarthric speaker, using either the microphone alone or the microphone 108 m conjunction with a switch 110, utters a word such as, for example, "TV". This aspect of embodiments of the present invention will be described In greater detail with reference to figure 4.
The speech models may be constructed using the well known HTK toolkit, available from Cambridge University Engineering Department, under licence from Microsoft Corporation, À À À À . . À':: À. :: À . . . . . . À À À . . . À that produces Contmuous Density Hidden Market Models. The models have the following characteristics: they are whole-word based rather than phone-level based, they typically have 11 HMM states, with a mixture of 3 Gaussian distributions per state, they are "straght through" models that allow only selt-transtions and transitions to the next state, the acoustic vectors comprise Mel Frequency Cepstral Co-efficients, typically with differences but without overall energy (dysarthric speakers often have difficulty maintaining a steady volume), training Is data labelled at the word level usmg "silence I word I silence", and a sampling rate for audio data of 16 KHz, with a lOms frame rate.
Preferably, the speech processing engine or recogniser 104 is contig,ured to be able to modify a dysarthrc speaker's vocabulary. It Is usual for a dysarthrc speaker to produce some words more consistently than others. For example, "TV" might be an easier proposition that "television". While clinical assessment might help in Identifying such words that may be articulated more consistently, the speech processing engine is arranged to provide a quantitative measure of word-level consistency.
Furthermore, preferred embodiments provide a measure of the overall consistency of the speech in any given training corpus across all or selected words of that corpus. Such a measure of overall consistency might be used to assess the severity of the dysarthra and to record a Avcarthrc sneaker'c progress as any theranv proceeds Still fiercer embodiments can be realised that track utterance-level consistency to provide an indication of the core relation between the probability scores returned by a dysarthrc speaker's Individual pronunciations of a given word and their norm for that word. Utterance-level consistency might be used by a chmcian to identify outher utterances and, if warranted, to remove such utterances from the traming corpus. The utterance-level consistency might also be used to identify cases where a dysarthrc speaker shows two production or articulation styles of the same word, in which case two different speech models for that word might be provided.
Having some means of predicting confusion errors might allow more robust recognition to be realsed. Therefore, the speech processing engine 104, in preferred embodiments, provides a measure of confusabtlty. The confusability measure is arranged to allow words or groups of words that might be confused with one another to be modified or removed from the speech models 124. Preferred embodiments approach or provide a measure of confusablity using forced alignment based upon automatic speech recognition derived probability scores rather than the more conventional phonetically-based approach of the prior art. "Alignment" in this context means that utterances are processed to identify word boundaries and the speech unit À À À . . : À:: ' .:: À . . . À À À . ... À
or units defined by such word boundaries are subjected to the speech models.
These consistency and confusability measures should be based on the framing set, that is, the training corpus, and the trained models. The training set or corpus is stored together with the j framed models. Further embodiments use rules to implement forced-alignment of training set utterances against the models under the following assumptions: a training set for a vocabulary has N words, Wl..WN; a CDHMM, M;, is provided for each word W; and wjk Is the kth repetition of the jth word of the training set.
A per-frame log likelihood L'jk Is calculated for each model generating each example of each word on a Viterbt path. The consistency, A, of a word, W;, is obtained by: &. = (ark L,)/n,, ( I) where a; is the number of examples of a given word, W;, in a tramping set. An average score for a word is obtained by aligning all examples of that word against the model for that word.
Conventionally, the more variation that there is in training data for each speech unit, the 1) larger the variances wfli be in chat speech unit s tiiViiVi scare uisrrioucions. 1ne fulccu alignment likelihoods will be lower for an inconsistently spoken word than for a consistently spoken word since its distributions will be flatter.
The overall consistency of the training corpus, A, is the average of all consistencies for all I words within that corpus. Therefore, the overall consistency is given by: 21= (Ii dJ/7V (2) As indicated above, the measure of overall consistency of a training corpus might be used to assess the severity of dysarthna and/or to record a dysarthrc speaker's progress as any therapy proceeds.
The confusabilty between two given words, W. and Wj, is defined by: C,, (irk L,j0/n) (3) which is a measure of the average score obtained by aligning examples of a given word, Wj, À À À À . . : ' : : À. : : À À . . . À . - À against the CDHMM, M;, for a different word, W;. The higher the value of Cal implies a greater hkelihood that Wj will be msrecognised as W;.
Training aid software 130 can be used by a clinician in analysing the confusablity results.
Figure 2 shows an example 200 of the output produced by the speech processing engine 104 for a severely dysarthrc speaker. The output 200 Is provided with a confusabflty variance cabbraton gray-scale 202, which provides an indication of the degree of confusablity of any two words forming part of a confusablity matrix 204. The confusability variance cahbraton gray-scale 202 is arranged such that the darker the shade, the greater the confusability. From the example of the confusability matrix 204, it can be appreciated that there Is a higher risk of confusion between the word TV and the set of words {alarm, lamp, channel, down, radio, volume) as compared to a risk of confusion between the word TV and the set of words {on, off, up}. The confusability matrix 204 can be used by a clinician to tailor the vocabulary recognised by the speech processing engine 104 to reduce the probability or risk of confusion between any two words. This should, in turn, improve the response of the speech processing engine 104 to the dysarthric speaker7s utterances. It will be appreciated that in some cases, the risk of confusion can exist between several words m which cases a number of alternative The words m the first column or first row of the confusability matrix represent the whole or part of a dysarthrc speaker's vocabulary. A clinician will construct such a confusabflity matrix using either a test or training set of words or an intended vocabulary of words for that speaker only. Such an initial vocabulary of words might be refined to remove words and to introduce alternative words if it is noted that there is significant confusion between selected In a preferred embodiment, the training aid software 130 has a further mode of operation, which allows a dysarthrc speaker, using the microphone 108 and switch 110, to practice their articulation of selected words. The dysarthric speaker can select a word to be practsed from the whole, or part, of their Intended vocabulary. Alternatively, a chnician might make that selection on behalf of the dysarthric speaker as part of an Interactive therapy session. Using the switch 110, the dysarthnc speaker can arrange for the speech processing engine 104 to record and process their utterances. The utterance is compared to the speech model that corresponds to the norm of that word for that dysarthrtc speaker. The speech processing engine 104 returns a probabihty score to the tranmg aid software 130 that reflects the : :.e a, :: *,: À À: closeness of match of the utterance with the speech model representing the norm or average of the utterance for that dysarthric speaker. This measure of consistency is preferably presented to the dysarthrie speaker visually as a bar chart comprising two bars. The first bar represents the probability score of the utterance in the training corpus with a score closest to norm, and the second bar represents the probability score of the utterance, that is, it represents a means by which the dysarthric speaker can compare their most recent utterance with the norm for that utterance. Preferably, the dysarthre speaker can use the switch 110 to play the utterance corresponding to their norm for that utterance in advance of practisng that utterance. The training aid software 130 preferably records all utterances of such practice sessions to allow the recorded data to be analysed by a clinician or speech therapist. In preferred embodiments, utterances that deviate by more than a predetermined value such as, for example, 20%, from the norm of that utterance for that dysarthre speaker are highlighted or drawn to the attention of the clinician or speech therapist. Identifying such anomalous utterances allows them to be removed from any data that might be used to influence the performance of the corresponding CDHMM. Providing a measure of the closeness of the fit of the recent utterance with the dysarthrc speaker's norm for that utterance allows the dysarthre speaker to practice that utterance so that they might be able to articulate the utterance in a manner more consistent with their norm. It should be noted that more consistent articulation of any given utterance does not necessarily imply that the utterance will be intelligible to a person unfamiliar with the dysarthrie speaker.
It will be appreciated that as the accuracy of the speech processing engine or reeogniser 104 increases, the degree of control exerted by a dysarthric speaker over the equipment 114 to 118 also increases. This increased degree of control will lead to an improved quality of hfe for a dysarthrie speaker.
Referring to figure 3, there is shown a flowchart 300 that illustrates the baste steps undertaken by the traimng aid software 130 in conjunction with the recogniser 104 in allowing a dysarthre speaker to practice utterances of a selected or target word. At step 302, the target word Is selected from a number of displayed words. An articulated utterance corresponding to the selected word is recorded by the recogniser 104 at step 304. The recognser 104 compares the utterance with an appropriate speech model corresponding to the selected word at step 306. An analysis of the closeness of fit of the most recently recorded and processed utterance with the norm of the utterance for that speaker Is performed at step 308 and, at step 310, feedback is provided to the dysarthrie speaker on the closeness of fit of their most recent utterance with their norm for that utterance.
À . . . , , a, . À , . In an alternanve embodiment, or additionally, rather than a dysarthric speaker or chmcian selecting a target word at step 302, the system can be arranged so that the dysarthric speaker may merely actuate the switch 110 to provide an indication to the speech processing engine that the next utterance is intended to be a control or communication command. In such embodiments, the recognition step 306 compares the data produced by the signal processing step 304 with all of the speech models for that dysarthric speaker 124 to identify the best match. Once a match has been identified, the device control software 112 determines whether or not a control signal should be output m response to that match. It will be appreciated by one skilled in the art that all possible command phrases in the recogniser's vocabulary are associated with either speech output or a control signal by the climcan when the system is framed and configured. For example, appropriate codes are stored together w th command speech models, which are used to produce corresponding infrared signals vita the hardware interface. These codes are tailored to the particular types of hardware and hardware interface present. Therefore, it will be appreciated that if the hardware interface or an item of hardware was changed, the information stored would also be changed accordingly.
If appropriate, the device control software 112 produces a control signal via the hardware nterface 120 that is suitable for controlling a corresponding item of equipment 114 to 118.
In alternative embodiments, or additionally, rather than outputting a control signal, a speech synthesis engme 132 can be arranged to output ntellgble speech. In such an embodiment, it Will be appreciated that the computer system 102 is acting, effectively, as a translation aid that translates between dysarthric speech and conventional speech that Is ntelhgble to those unfamiliar with the dysarthric speaker. For example, this would allow a dysarthric speaker to greet a friend usmg the word "hello". It would also support a greater verbal nteraction between a dysarthric speaker and a person unfamhar with the dysarthric speaker. This is illustrated m figure 4.
Figure 4 shows a flowchart 400 of the processing performed by the computer system 102 to assist the dysarthric speaker in communicating with people or mteractmg with their environment. The computer system 102 Is arranged to operate m a command/control and communication mode. In this mode, the dysarthric speaker provides an indication to the speech processing engine that it should enter a record mode of operation to record an utterance of the dysarthric speaker. Such an indication can be provided using a mouse or keyboard of the computer system, a specially adapted input device according to the physical capabilities of the dysarthric speaker or via the automatic detection of the speaker's voice close to the microphone without using an additional input device. The latter might be À c : .. a. i.: ' ' e À À achieved using, for example, a volume threshold such that the system infers that an utterance intended for processing by the system has been made if the volume of that utterance Is above that volume threshold, with all other detected utterances below the threshold being Ignored.
Therefore, at step 402, the speech processing engine receives a signal that Is indicative of the Input device having been actuated. Any speech uttered following actuation of the input device Is recorded by the speech processing engine 104 and converted into a suitable form for processing by the CDHMM recogniser at step 404. The speech processing engine 104 performs speech recognition at step 406 to determine the best or most appropriate correlation between the utterance and one of the speech models 124.
Processmg Is performed, in light of the recognition at step 406, to determine whether the dysarthrc speaker has issued a command that needs to be parsed. This processing is performed at step 408. For example, the dysarthric speaker may have uttered "TV on", which is a command to switch on the television. The speech processing engine 104 is arranged, when Identifying a first word of an utterance, to check the utterance, or any part of the utterance, against all speech models 124. Having matched at least part of the utterance with one of the speech models 124, the speech processing engine 104 is sufficently sophisticated to compare the next, or the last, part of the utterance with a limited set of speech models selected from the whole set of speech models 124. This will reduce the processing burden Imposed upon the speech processing engme 1()4. for example, t the utterance is "TV on" and the speech processing engine has identified the first part of the utterance as "TV", the speech processing engine might then expect the second part of the utterance to be a command such as "on", "off", "volume", "up", "channel", "up" or "down".
Hence, the second part of the utterance would be processed by speech models corresponding to those words, that Is, a limited set of the speech models is used in identifying subsequent parts of an overall utterance.
A further example of comparing a later part of an utterance with a very limited set of the speech models, having Identified an earlier part of an utterance, would be the commands that control a lamp. The lamp has two states; namely on and off, and a speech processing engine, having Identified part of the utterance as "lamp" would then only expect a subsequent part of the utterance to be "on" or "off' and would, therefore, only need to use two further speech models m fully processing the utterances "lamp on" and "lamp off'.
The filtering or narrowing, that is, the selection of a reduced set of speech models for use in subsequent processing, may be further reduced Ifthe speech processing engine 104 also . , . À':: . i. :: ' ', :. '..
stores data relating to the current state of an item of equipment. For example, once a television has been switched on, it is unlikely that a subsequent part of an utterance following "TV" Is will be "on" since the television Is already in the on-state. Therefore, the speech model for recognising "on" need not be considered in such circumstances. It will be appreciated that such embodiments might assume that a given item of equipment is in a desired state following a command or that some form of feedback from the equipment to the system is provided to confirm that it has assumed the desired state. Having parsed an utterance at step 408, a determmaton is made, at step 409, as to whether or not the parsed utterance is a command. Therefore, If appropriate, respective control signals for controlling the equipment 114 to 118 are generated and output at step 410. The device control software 112 has access to a table 134 of parsed commands 136 that are mapped to corresponding control signals 138.
A further example in which knowledge of the current state of the computer equipment can be used to improve the ease with which a dysarthric speaker can interact with their environment is when, for example, only the radio is switched on. Assume that the command "volume up" Is Issued by the dysarthric speaker. Knowing that only the radio is currently switched on, the speech processing engme 104 Is sufficiently sophisticated to be able to ignore all speech models but for those relating to the radio.
If the parsed utterance is not recognised at step 409 as a command, it is assumed to be speech intended to be communicated to a listener. That speech communication is converted into a more mtelligble form, that is, it is converted mto a form that may be understood by someone unfamiliar with the dysarthrc speaker and subsequently output. The subsequent output might be in the form of text on the screen of the computer system 102 or in the form of synthessed speech. This processing takes place at step 412.
Embodiments of the present invention may use a table comprising a mapping between dysarthrc utterances and conventional speech or text. The speech may be output by the speech synthesser or the text may be output For display. Alternatively, or additionally, the text may be converted to speech by the speech synthesizer.
The reader's attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public mspectton with this specification, and the contents of all such papers and documents are incorporated herein by reference.
À .: :.. a.:: : À.
All of the features disclosed in this specification (including any accompanying claims, abstract and drawings) and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.
Each feature disclosed In this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.
The invention is not restricted to the details of any foregoing embodiments. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combmaton, of the steps of any method or process so disclosed.
À e À À À .e
Claims (19)
1. An assishve technology system comprising a speech processor operable to process an input utterance to identify that utterance; means to output a control signal corresponding to the identified utterance for influencing the operation of respective equipment; the system being characterised by the speech processor comprising means to calculate a confusabilty measure that reflects a degree of correlation between the Input utterance and at least a further utterance; each of the mput utterance and the further utterance corresponding to respective words of a vocabulary of words, and means to replace at least one of the respective words with at least a further word having a corresponding utterance having a different degree of correlation with the at least one of the input utterance and the further utterance and means to associate the further word with the control signal.
2. An assstve technology system as claimed in claim 1 in which the means to calculate the confusablity measure comprises means to subject the utterance, havmg a corresponding speech model, to a speech model for the further utterance to determine the response of the speech model for the further utterance and means to provide the confusability measure according to that response.
3. An assistve technology system as claimed in claim 2 in which the means to calculate the confusability measure between the words, W. and We, of the mput utterance and the further utterance respectively comprises means to calculate CU = (lLvk)ln;, where L,3k Is a per-frame likelihood of each speech model generating each example of each word on a Viterbi path, j and k represent the kth repetition of the jth word in a training set comprising N n' is the number of examples of W' in the traimng set. À
a À I' * À ' *
4. An assishve technology system as claimed in any preceding claim, further comprising means to calculate a consistency measure for the input utterance and means to output a visual indication of the consistency measure.
5. An assstive technology system as claimed m claim 4 in which the means to calculate S the consistency measure comprises means to calculate 5d = (l L,,k) I ni where L,,k is a per-frame likehhood of each speech model generating each example of each and k represent the kth repetition of the ith word in a traming set comprising N n, is the number of examples of W' in the tranmg set.
6. An assstive technology system as claimed In claim 5 in which the means to calculate the consistency measure comprises means to calculate = (5,)/N
7. An assstive technology system as claimed in either of claims S and 6 in which the means to output a visual indication of the consistency measure comprises means to present a bar chart comparison of the consistency measure with an average consistency measure for that word for a given speaker.
8. A system as claimed in any preceding claim in which the assistive technology system Is a dysarthrc speech assstive technology system.
9. A method of training or treating a speech impaired speaker comprising the steps of processing an input utterance of the speech impaired speaker using a corresponding speech model; provdmg a visual indication of the degree of correlation between the input utterance and a predetermined utterance of the speech Impaired speaker for the corresponding speech model.
:e' '., Àe " , . . . ... .
10. A method of traming or treating a speech impaired speaker as claimed m claim 9 in which the degree of correlation between the input utterance and the predetermined utterance of the speech impaired speaker for the corresponding speech model comprises the step of calculating] 5, = (Liik)ln,, k where Lo, is a per-frame likelihood of each speech model generating each example of each and k represent the kth repetition of the ith word in a training set comprising N words Wit to WN, and n, is the number of examples of W. in the training set.
A method of trammg or treating a speech impaired speaker as claimed in either of claims 9 and 10 further comprising the step of establishing the predetermined utterance of the speech impair speaker for the speech model.
12. A method of framing or treating a speech impaired speaker as claimed in any of claims 9 to 11 further comprising the step of processing a plurality of utterances corresponding to the same word and calculating a measure of the average of the plurality of utterances.
13. A method of training or treating a speech impaired speaker as claimed in any of claims 9 to 12, further comprising the steps of establishing a plurahty of utterances i corresponding to respective words of a plurality of words; calculating a measure of confusablty between utterances corresponding to at least a selected pair of words; and selecting an alternative word to replace one of the selected pair of words in the vocabulary; the alternative word having an improved measure of confusablity between a respective utterance for the alternative word and the utterance corresponding to remaining word of the selected pair of words.
14. A method of training or treating as claimed in any of claims 9 to 13 in which the speech impaired speaker is a dysarthric speaker.
À e e e e e e e e see e e e e e À e e e e e e e e e e e e
15. An assistive technology system comprising means to implement a method as claimed I m any of claims 9 to 14.
16. An assstive technology system substantially as described herein with reference to/and or as illustrated in the accompanying drawings.
17. A method of traming or treating a speech impaired person substantially as described herein with reference to/and or as Illustrated in the accompanying drawings.
18. A computer program element comprising computer program code means to Implement a method or system as claimed m any preceding claim.
19. A computer program product comprising computer readable storage media storing a computer program element as claimed in claim 18.
. . . À À e À e À e ce. À À À À À . À À À À À ... .
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0307201A GB2399931A (en) | 2003-03-28 | 2003-03-28 | Assistive technology |
| GB0406932A GB2399932A (en) | 2003-03-28 | 2004-03-29 | Speech recognition |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0307201A GB2399931A (en) | 2003-03-28 | 2003-03-28 | Assistive technology |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| GB0307201D0 GB0307201D0 (en) | 2003-04-30 |
| GB2399931A true GB2399931A (en) | 2004-09-29 |
Family
ID=9955748
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| GB0307201A Withdrawn GB2399931A (en) | 2003-03-28 | 2003-03-28 | Assistive technology |
| GB0406932A Withdrawn GB2399932A (en) | 2003-03-28 | 2004-03-29 | Speech recognition |
Family Applications After (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| GB0406932A Withdrawn GB2399932A (en) | 2003-03-28 | 2004-03-29 | Speech recognition |
Country Status (1)
| Country | Link |
|---|---|
| GB (2) | GB2399931A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103405217A (en) * | 2013-07-08 | 2013-11-27 | 上海昭鸣投资管理有限责任公司 | System and method for multi-dimensional measurement of dysarthria based on real-time articulation modeling technology |
| CN105719662A (en) * | 2016-04-25 | 2016-06-29 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Dysarthrosis detection method and dysarthrosis detection system |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2027572B1 (en) * | 2006-05-22 | 2009-10-21 | Philips Intellectual Property & Standards GmbH | System and method of training a dysarthric speaker |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2187586A (en) * | 1986-02-06 | 1987-09-09 | Reginald Alfred King | Acoustic recognition |
| US6185530B1 (en) * | 1998-08-14 | 2001-02-06 | International Business Machines Corporation | Apparatus and methods for identifying potential acoustic confusibility among words in a speech recognition system |
| EP1217609A2 (en) * | 2000-12-22 | 2002-06-26 | Hewlett-Packard Company | Speech recognition |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6253181B1 (en) * | 1999-01-22 | 2001-06-26 | Matsushita Electric Industrial Co., Ltd. | Speech recognition and teaching apparatus able to rapidly adapt to difficult speech of children and foreign speakers |
| GB9920257D0 (en) * | 1999-08-26 | 1999-10-27 | Canon Kk | Signal processing system |
| US6754625B2 (en) * | 2000-12-26 | 2004-06-22 | International Business Machines Corporation | Augmentation of alternate word lists by acoustic confusability criterion |
| GB2385698B (en) * | 2002-02-26 | 2005-06-15 | Canon Kk | Speech processing apparatus and method |
-
2003
- 2003-03-28 GB GB0307201A patent/GB2399931A/en not_active Withdrawn
-
2004
- 2004-03-29 GB GB0406932A patent/GB2399932A/en not_active Withdrawn
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2187586A (en) * | 1986-02-06 | 1987-09-09 | Reginald Alfred King | Acoustic recognition |
| US6185530B1 (en) * | 1998-08-14 | 2001-02-06 | International Business Machines Corporation | Apparatus and methods for identifying potential acoustic confusibility among words in a speech recognition system |
| EP1217609A2 (en) * | 2000-12-22 | 2002-06-26 | Hewlett-Packard Company | Speech recognition |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103405217A (en) * | 2013-07-08 | 2013-11-27 | 上海昭鸣投资管理有限责任公司 | System and method for multi-dimensional measurement of dysarthria based on real-time articulation modeling technology |
| CN103405217B (en) * | 2013-07-08 | 2015-01-14 | 泰亿格电子(上海)有限公司 | System and method for multi-dimensional measurement of dysarthria based on real-time articulation modeling technology |
| CN105719662A (en) * | 2016-04-25 | 2016-06-29 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Dysarthrosis detection method and dysarthrosis detection system |
| CN105719662B (en) * | 2016-04-25 | 2019-10-25 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Dysarthria detection method and system |
Also Published As
| Publication number | Publication date |
|---|---|
| GB0307201D0 (en) | 2003-04-30 |
| GB2399932A (en) | 2004-09-29 |
| GB0406932D0 (en) | 2004-04-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP0773532B1 (en) | Continuous speech recognition | |
| US7593849B2 (en) | Normalization of speech accent | |
| Janse | Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech | |
| US7949523B2 (en) | Apparatus, method, and computer program product for processing voice in speech | |
| US5995928A (en) | Method and apparatus for continuous spelling speech recognition with early identification | |
| US6157913A (en) | Method and apparatus for estimating fitness to perform tasks based on linguistic and other aspects of spoken responses in constrained interactions | |
| US8762144B2 (en) | Method and apparatus for voice activity detection | |
| Rosen et al. | Automatic speech recognition and a review of its functioning with dysarthric speech | |
| Hasegawa-Johnson et al. | HMM-based and SVM-based recognition of the speech of talkers with spastic dysarthria | |
| JP2000259170A (en) | Method and device for registering user to voice recognition system | |
| CN101266792A (en) | Speech recognition system and speech recognition method | |
| JP2002304190A (en) | Method for generating pronunciation change form and method for speech recognition | |
| JPH10133685A (en) | Method and system for editing phrase during continuous speech recognition | |
| EP2028646A1 (en) | Device for modifying and improving the behaviour of speech recognition systems | |
| US20040098259A1 (en) | Method for recognition verbal utterances by a non-mother tongue speaker in a speech processing system | |
| Kotler et al. | Effects of speech training on the accuracy of speech recognition for an individual with a speech impairment | |
| JP2008262120A (en) | Utterance evaluation device and utterance evaluation program | |
| Schramm et al. | Strategies for name recognition in automatic directory assistance systems | |
| GB2399931A (en) | Assistive technology | |
| JP3621624B2 (en) | Foreign language learning apparatus, foreign language learning method and medium | |
| Hunt | Speaker adaptation for word‐based speech recognition systems | |
| KR20120010076A (en) | Voice activity detection method and device | |
| JP2005241767A (en) | Speech recognition device | |
| Langlais et al. | Automatic detection of mispronunciation in non-native Swedish speech | |
| JP3754257B2 (en) | Voice input text creation method and apparatus |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |