US20170075652A1 - Electronic device and method - Google Patents
Electronic device and method Download PDFInfo
- Publication number
- US20170075652A1 US20170075652A1 US15/056,942 US201615056942A US2017075652A1 US 20170075652 A1 US20170075652 A1 US 20170075652A1 US 201615056942 A US201615056942 A US 201615056942A US 2017075652 A1 US2017075652 A1 US 2017075652A1
- Authority
- US
- United States
- Prior art keywords
- user
- speaker
- registered
- specific utterance
- utterances
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/165—Management of the audio stream, e.g. setting of volume, audio stream path
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L21/12—Transforming into visible information by displaying time domain information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
Definitions
- Embodiments described herein relate generally to an electronic device and a method.
- an electronic device records an audio signal and displays a waveform of the audio signal. If the electronic device records the audio signals of a plurality of members in a meeting, the waveforms do not identify the speakers.
- FIG. 1 is a plan view showing an example of an external appearance of embodiments.
- FIG. 2 is a block diagram showing an example of a system configuration of embodiments.
- FIG. 3 is a block diagram showing an example of a functional configuration of a voice recorder application of embodiments.
- FIG. 4 is a diagram showing an example of a home view of embodiments.
- FIG. 5 is a diagram showing an example of a recording view of embodiments.
- FIG. 6 is a diagram showing an example of a reproduction view of embodiments.
- FIG. 7 is a block diagram showing an example of a functional configuration of a visualization engine of embodiments.
- FIG. 8 is a flowchart showing an example of a series of procedures of analysis processing by the voice recorder application of embodiments.
- FIG. 9 is a diagram for explaining a status related to a speaker name.
- FIG. 10 shows an example of a pop-up displayed when a user corrects the speaker name.
- FIG. 11 shows an example of a tutorial displayed in the reproduction view.
- an electronic device includes a microphone, a memory, and a hardware processor.
- the microphone is configured to obtain audio and convert the audio into an audio signal, the audio including utterances from a first user and utterances from a second user, wherein one of the first user or the second user is a registered user and the other of the first user or second user is an unregistered user.
- the memory stores an identifier associated with the registered user.
- the hardware processor in communication with the memory and is configured to: record the audio signal; determine a plurality of user-specific utterance features within the audio signal, the plurality of user-specific utterance features including a first set of user specific-utterance features associated with the registered user and a second set of user-specific utterance features associated with the unregistered user; display the identifier of the registered user differently than an identifier of the unregistered user.
- FIG. 1 shows a plan view of an example of an electronic device according to certain embodiments.
- An electronic device 1 may include, for example, a tablet personal computer (portable personal computer [PC]), a smartphone (multifunctional mobile phone), or a personal digital assistant (PDA).
- PC portable personal computer
- PDA personal digital assistant
- the electronic device 1 is a tablet personal computer.
- the disclosure is not limited as such and the electronic device 1 may include one or more of the previously described systems.
- Each element and each structure described hereinafter may be implemented by hardware, by software using a microcomputer (processor, central processing unit [CPU]), or by a combination of hardware and software.
- the tablet personal computer (hereinafter, referred to as a tablet PC) 1 may include a main body 10 and a touchscreen display 20 .
- a camera 11 may be disposed at a particular position of the main body 10 , for example, a central position on an upper end of a surface of the main body 10 .
- microphones 12 R and 12 L are disposed at two predetermined positions of the main body 10 , for example, two positions separated from each other on the upper end of the surface of the main body 10 .
- the camera 11 may be disposed between the two microphones 12 R and 12 L.
- the number of microphones may be optionally set.
- Loudspeakers 13 R and 13 L are disposed at other two predetermined positions of the main body 10 , for example, a left side and a right side of the main body 10 .
- a power switch (power button), a lock mechanism, an authentication unit, etc.
- the power switch turns on or off a power supply that supplies power to one or more elements of the tablet PC 1 enabling a user to user the tablet PC 1 (activating the tablet PC 1 ).
- the lock mechanism for example, locks the operation of the power switch at the time of conveyance.
- the authentication unit for example, reads (biometric) data associated with a user's finger or palm to authenticate the user.
- the touchscreen display 20 includes a flat panel display 21 , such as a liquid crystal display (LCD), and a touchpanel 22 .
- the flat panel display 21 may include a plasma display or an organic LED (OLED) display.
- the touchpanel 22 is attached to the surface of the main body 10 so as to cover a screen of the LCD 21 .
- the touchscreen display 20 detects a touch position of an external object (stylus or finger) on a display screen.
- the touchscreen display 20 may support a multi-touch function by which a plurality of touch positions can be simultaneously detected.
- the touchscreen display 20 can display several icons for activating various application programs on the screen.
- the icons may include an icon 290 for activating a voice recorder program.
- the voice recorder program has a function of visualizing the content of a recording in a meeting, etc.
- FIG. 2 shows an example of a system configuration of the tablet PC 1 .
- the tablet PC 1 may comprise a CPU 101 , a system controller 102 , a main memory 103 which is volatile memory, such as RAM, a graphics controller 104 , a sound controller 105 , a BIOS-ROM 106 , a nonvolatile memory 107 , an EEPROM 108 , a LAN controller 109 , a wireless LAN controller 110 , a vibrator 111 , an acceleration sensor 112 , an audio capture unit 113 , an embedded controller (EC) 114 , etc.
- the CPU 101 is a processor circuit configured to control operation of each element in the tablet PC 1 .
- the CPU 101 executes various programs loaded from the nonvolatile memory 107 to the main memory 103 .
- the programs may include an operating system (OS) 201 and various application programs.
- the application programs include a voice recorder application 202 .
- the voice recorder application 202 can record an audio data item corresponding to a sound input via the microphones 12 R and 12 L (a sound collected by the microphone 12 R and 12 L).
- the voice recorder application 202 can extract voice sections from the audio data item, and classify the respective voice sections into clusters corresponding to speakers in the audio data item.
- the voice recorder application 202 has a visualization function of displaying the voice sections speaker by speaker, using a result of cluster classification. By the visualization function, which speaker spoke (uttered) can be visibly presented to the user.
- the voice sections include utterances such as sounds produced by a user including humming, whistling, moans, grunts, singing, and any other sounds a user may make including speech.
- the voice recorder application 202 supports a speaker selection reproduction function of continuously reproducing only voice sections of a selected speaker.
- voice recorder application 202 can be each carried out by a circuit such as a processor. Alternatively, these functions can be carried out by a dedicated circuit such as a recording circuit 121 and a reproduction circuit 122 .
- the recording circuit 121 and the reproduction circuit 122 have the recording function and the reproducing function that are carried out by the processor executing the voice recorder application 202 .
- the CPU 101 also executes a Basic Input/Output System (BIOS), which is a program for hardware control stored in the BIOS-ROM 106 .
- BIOS Basic Input/Output System
- the system controller 102 is a device which connects a local bus of the CPU 101 and various components.
- the system controller 102 also contains a memory controller which exerts access control over the main memory 103 .
- the system controller 102 also has a function of communicating with the graphics controller 104 over a serial bus conforming to the PCI EXPRESS standard, etc.
- the system controller 102 also contains an ATA controller for controlling the nonvolatile memory 107 .
- the system controller 102 further contains a USB controller for controlling various USB devices.
- the system controller 102 also has a function of communicating with the sound controller 105 and the audio capture unit 113 .
- the graphics controller 104 is a display controller configured to control the LCD 21 of the touchscreen display 20 .
- a display signal generated by the graphics controller 104 is transmitted to the LCD 21 .
- the LCD 21 displays a screen image based on the display signal.
- the touchpanel 22 covering the LCD 21 functions as a sensor configured to detect a touch position of an external object on the screen of the LCD 21 .
- the sound controller 105 converts an audio data item to be reproduced into an analog signal, and supplies the analog signal to the loudspeakers 13 R and 13 L.
- the LAN controller 109 is a wired communication device configured to perform wired communication conforming to, for example, the IEEE 802.3 standard.
- the LAN controller 109 includes a transmission circuit configured to transmit a signal and a reception circuit configured to receive a signal.
- the wireless LAN controller 110 is a wireless communication device configured to perform wireless communication conforming to, for example, the IEEE 802.11 standard, and includes a transmission circuit configured to wirelessly transmit a signal and a reception circuit configured to wirelessly receive a signal.
- the acceleration sensor 112 is used to detect the current direction (portrait direction/landscape direction) of the main body 10 .
- the audio capture unit 113 carries out analog-to-digital conversion of sound input via the microphones 12 R and 12 L, and outputs digital signals corresponding to the sound.
- the audio capture unit 113 can transmit data indicating which of the sound inputs from the microphones 12 R and 12 L is greater in level to the voice recorder application 202 .
- the EC 114 is a single-chip microcontroller for power management. The EC 114 powers the tablet PC 1 on or off in response to the user's operation of the power switch.
- FIG. 3 shows an example of a functional configuration of the voice recorder application 202 .
- the voice recorder application 202 includes an input interface module 310 , a controller 320 , a reproduction processor 330 , a display processor 340 , etc., as functional modules of the same program.
- the input interface module 310 receives various events from the touchpanel 22 via a touchpanel driver 201 A.
- the events include a touch event, a move event, and a release event.
- the touch event indicates that an external object has touched the screen of the LCD 21 .
- the touch event includes coordinates of a touch position of the external object on the screen.
- the move event indicates that the touch position has been moved with the external object touching the screen.
- the move event includes coordinates of the touch position that has been moved.
- the release event indicates that the touch of the external object on the screen has been released.
- the release event includes coordinates of the touch position where the touch has been released.
- Tap touching the user's finger at a position on the screen, and then separating it in an orthogonal direction to the screen (“tap” may be synonymous with “touch”).
- Swipe touching the user's finger at an arbitrary position on the screen, and then moving it in an arbitrary direction.
- Flick touching the user's finger at an arbitrary position on the screen, then sweeping it in an arbitrary direction, and separating it from the screen.
- Pinch touching the user's two fingers at an arbitrary position on the screen, and then, changing the distance between the fingers on the screen.
- widening the distance between the fingers may be referred to as pinch-out (pinch-open)
- narrowing the distance between the fingers may be referred to as pinch-in (pinch-close).
- the controller 320 can detect on which part of the screen, which finger gesture (tap, swipe, flick, pinch, etc.) was made, based on various events received from the input interface module 310 .
- the controller 320 includes a recording engine 321 , a visualization engine 322 , etc.
- the recording engine 321 records an audio data item 401 corresponding to sound input via the microphones 12 L and 12 R and the audio capture unit 113 in the nonvolatile memory 107 .
- the recording engine 321 can perform recording in various scenes such as recording in a meeting, recording in a telephone conversation, and recording in a presentation.
- the recording engine 321 also can perform recording of other kinds of audio source input via means other than the microphones 12 L and 12 R and the audio capture unit 113 , such as broadcast and music.
- the recording engine 321 performs a voice section detection process of analyzing the recorded audio data item 401 and determining whether it is a voice section or a non-voice section (noise section, silent section) other than the voice section.
- the voice section detection process is performed, for example, for each voice data sample having a length of time of 0.5 seconds.
- a sequence of an audio data item (recording data item) that is, a signal series of digital audio signals, is divided into audio data units each having a length of time of 0.5 seconds (set of audio data samples of 0.5 seconds).
- the recording engine 321 performs a voice section detection process for each audio data unit.
- An audio data unit of 0.5 seconds is an identification unit for identifying a speaker through a speaker identification process, which will be described later.
- voice section detection process it is determined whether an audio data unit is a voice section or a non-voice section (noise section, silent section) other than the voice section.
- a voice section/a non-voice section any well-known technique can be used, and for example, voice activity detection (VAD) may be used.
- VAD voice activity detection
- the determination of a voice section/a non-voice section may be made in real time during recording.
- the recording engine 321 extracts a feature amount (sound feature amount) such as a mel frequency cepstrum cofficient (MFCC) from an audio data unit identified as a voice section.
- a feature amount sound feature amount
- MFCC mel frequency cepstrum cofficient
- the visualization engine 322 performs a process of visualizing an outline of a whole sequence of the audio data item 401 in cooperation with the display processor 340 . Specifically, the visualization engine 322 performs a speaker identification process, and performs a process of distinguishably displaying when and which speaker spoke in a display area using a result of the speaker identification process.
- the speaker identification process may include speaker clustering.
- speaker clustering it is identified which speaker spoke in voice sections included in a sequence from the start point to the end point of an audio data item.
- the respective voice sections are classified into clusters corresponding to speakers in the audio data item.
- a cluster is a set of audio data units of the same speaker.
- a method of performing speaker clustering already-existing various methods can be used. For example, in the present embodiment, both a method of performing speaker clustering using a speaker position and a method of performing speaker clustering using a feature amount (sound feature amount) of an audio data item may be used.
- the speaker position indicates the position of each speaker with respect to the tablet PC 1 .
- the speaker position can be estimated based on the difference between two audio signals input via the two microphones 12 L and 12 R. Sounds input at the same speaker position are estimated to be those made by the same speaker.
- any already-existing method can be used, and for example, a method disclosed in JP 2011-191824 A (JP 5174068 B) may be used.
- Data indicating a result of speaker clustering is saved on the nonvolatile memory 107 as an index data item 402 .
- the visualization engine 322 displays individual voice sections in the display area. If there are speakers, the voice sections are displayed in a form in which the speakers of the individual voice sections are distinguishable. That is, the visualization engine 322 can visualize the voice sections speaker by speaker, using the index data item 402 .
- the reproduction processor 330 reproduces the audio data item 401 .
- the reproduction processor 330 can continuously reproduce only voice sections while skipping a silent section.
- the reproduction processor 330 can also perform a selected-speaker reproduction process of continuously reproducing only voice sections of a specific speaker selected by the user while skipping voice sections of the other speakers.
- FIG. 4 shows an example of a home view 210 - 1 . If the voice recorder application 202 is activated, the voice recorder application 202 displays the home view 210 - 1 .
- the home view 210 - 1 displays a recording button 50 , an audio waveform 52 of a predetermined time (for example, thirty seconds), and a record list 53 .
- the recording button 50 is a button for giving instructions to start recording.
- the audio waveform 52 indicates a waveform of audio signals currently input via the microphones 12 L and 12 R.
- the waveform of audio signals appears continuously in real time at the position of a vertical bar 51 indicating the present time. Then, with the passage of time, the waveform of audio signals moves from the vertical bar 51 to the left.
- successive vertical bars have lengths according to power of respective successive audio signal samples.
- the record list 53 includes records stored in the nonvolatile memory 107 as audio data items 51 . It is herein assumed that there are three records, a record of a title “AAA meeting”, a record of a title “BBB meeting”, and a record of a title “Sample”. In the record list 53 , a recording date of a record, a recording start time of a record, a recording end time of a record are also displayed. In the record list 53 , recordings (records) can be sorted in reverse order of creation date, in order of creation date, or in order of title.
- the voice recorder application 202 starts reproducing the selected record. If the recording button 50 of the home view 210 - 1 is tapped by the user, the voice recorder application 202 starts recording.
- FIG. 5 shows an example of a recording view 210 - 2 . If the recording button 50 is tapped by the user, the voice recorder application 202 starts recording and switches the display screen from the home view 210 - 1 of FIG. 4 to the recording view 210 - 2 of FIG. 5 .
- the recording view 210 - 2 displays a stop button 500 A, a pause button 500 B, a voice section bar 502 , an audio waveform 503 , and a speaker icon 512 .
- the stop button 500 A is a button for stopping the current recording.
- the pause button 500 B is a button for pausing the current recording.
- the audio waveform 503 indicates a waveform of audio signals currently input via the microphones 12 L and 12 R.
- the audio waveform 503 continuously appears at the position of a vertical bar 501 , and moves to the left with the passage of time, like the audio waveform 402 of the home view 210 - 1 .
- successive vertical bars have lengths according to power of respective successive audio signal samples.
- the above-described voice section detection process is performed. If it is detected that one or more audio data units in an audio signal are a voice section (human voice), the voice section corresponding to the one or more audio data units is visualized by the voice section bar 502 as an object indicating the voice section.
- the length of the voice section bar 502 varies according to the length of time of the corresponding voice section.
- the voice section bar 502 can be displayed after an input voice is analyzed by the visualization engine 322 and a speaker identification process is performed. Because the voice section bar 502 thus cannot be displayed right after recording, the audio waveform 503 is displayed as in the home view 210 - 1 . The audio waveform 503 is displayed in real time at the right end, and flows to the left side of the screen with the passage of time. If a certain time has passed, the audio waveform 503 switches to the voice section bar 502 . Although whether power is due to voice or noise cannot be determined from the audio waveform 503 only, the recording of a human voice can be confirmed by the voice section bar 502 . The audio waveform 503 in real time and the voice section bar 502 starting with timing after a delay are displayed in the same row, whereby the user's eyes can be kept in the same row and do not rove, and useful information with good visibility can be obtained.
- the audio waveform 503 does not switch to the voice section bar 502 at once, but gradually switches.
- the amplitude of the waveform display is decreased as the time goes so that the waveform display is converged to bar display.
- the current power is thereby displayed at the right end as the audio waveform 503 , the display flows from right to left, and in the process of updating the display, the waveform changes continuously or seamlessly to converge into a bar.
- a record name (“New record” in an initial state) and the date and time are displayed.
- a recording time (which may be the absolute time, but is herein an elapsed time from the start of recording) (for example, 00:50:02) is displayed.
- the speaker icon 512 is displayed. If a currently speaking speaker is identified, a speaking mark 514 is displayed under an icon of the speaker.
- a time axis with a scale per ten seconds is displayed.
- FIG. 5 visualizes voices for a certain time, for example thirty seconds, until the present time (right end), and shows earlier times on the left side. The time of thirty seconds can be changed.
- the scale of the time axis of the home view 210 - 1 is fixed, the scale of the time axis of the recording view 210 - 2 is variable.
- the scale can be changed, and a display time (thirty seconds in the example of FIG. 5 ) can be changed.
- flicking the time axis from side to side the time axis moves from side to side, a voice that was recorded a predetermined time before a certain time in the past can also be visualized, although the display time is not changed.
- tags 504 A, 504 B, 504 C, and 504 D are displayed above the voice section bars 502 A, 502 B, 502 C, and 502 D.
- the tags 504 A, 504 B, 504 C, and 504 D are provided to select a voice section, and if a tag is selected, a display form of the tag changes.
- the change of the display form of the tag means that the tag has been selected. For example, the color, size, and contrast of the selected tag changes.
- the selection of a voice section by a tag is made, for example, to designate a voice section which is reproduced with priority at the time of reproduction.
- FIG. 6 shows an example of a reproduction view 210 - 3 in a state in which the reproduction of the record of the title “AAA meeting” is paused while being performed.
- the reproduction view 210 - 3 displays a speaker identification result view area 601 , a seek bar area 602 , a reproduction view area 603 , and a control panel 604 .
- the speaker identification result view area 601 is a display area displaying the whole sequence of the record of the title “AAA meeting”.
- the speaker identification result view area 601 may display time axes 701 corresponding to respective speakers in the sequence of the record.
- five speakers are arranged in order of decreasing amount of utterance in the whole sequence of the record of the title “AAA meeting”.
- a speaker whose amount of utterance is the greatest in the whole sequence is displayed on the top of the speaker identification result view area 601 .
- the user can also listen to each voice section of a specific speaker by tapping the voice sections (voice section marks) of the specific speaker in order.
- the left ends of the time axes 701 correspond to a start time of the sequence of the record, and the right ends of the time axes 701 correspond to an end time of the sequence of the record. That is, the total time from the start to the end of the sequence of the record is allocated to the time axes 701 . However, if the total time is long and all the total time is allocated to the time axes 701 , the scale of the time axes may become too small and difficult to see. Thus, the size of the time axes 701 may be variable as in the case of the reproduction view.
- a voice section mark indicating the position and the length of time of a voice section of the speaker is displayed. Different colors may be allocated to speakers. In this case, voice section marks in colors different from speaker to speaker may be displayed. For example, on a time axis of a speaker “Mr. A”, voice section marks 702 may be displayed in a color (for example, red) allocated to the speaker “Mr. A”.
- the seek bar area 602 displays a seek bar 711 and a movable slider (also referred to as a locater) 712 .
- a seek bar 711 To the seek bar 711 , the total time from the start to the end of the sequence of the record is allocated.
- the position of the slider 712 on the seek bar 711 indicates the current reproduction position.
- a vertical bar 713 From the slider 712 , a vertical bar 713 extends upward. Because the vertical bar 713 traverses the speaker identification result view area 601 , the user can easily understand in which speaker's (main speaker's) voice section the current reproduction position is.
- the position of the slider 712 on the seek bar 711 moves to the right with the progress of reproduction.
- the user can move the slider 712 to the right or to the left by a drag operation.
- the user can thereby change the current reproduction position to an arbitrary position.
- the reproduction view area 603 is an enlarged view of a period (for example, a period of approximately twenty seconds) in the vicinity of the current reproduction position.
- the reproduction view area 603 includes a display area long in the direction of the time axis (here, horizontally).
- several voice sections included in the period in the vicinity of the current reproduction position are displayed in chronological order.
- a vertical bar 720 indicates the current reproduction position. If the user flicks the reproduction view area 603 , displayed content in the reproduction view area 603 is scrolled to the left or to the right in a state in which the position of the vertical bar 720 is fixed. Consequently, the current reproduction position is also changed.
- FIG. 7 is a block diagram showing an example of functional configurations of the recording engine 321 and the visualization engine 322 shown in FIG. 3 .
- the recording engine 321 includes a voice section detection module 321 A, a sound feature extraction module 321 B, etc.
- the visualization engine 322 includes a clustering module 322 A, a speaker feature extraction module 322 B, a speaker registration module 322 C, a speaker identification module 322 D, a speaker provisional registration module 322 E, etc.
- the voice section detection module 321 A receives input of an audio data item from the audio capture unit 113 . In addition, the voice section detection module 321 A performs the above-described voice activity detection (VAD) for the audio data item of the received input.
- VAD voice activity detection
- the sound feature extraction module 321 B extracts a sound feature amount from a voice section detected by the voice section detection module 321 A as described above.
- the clustering module 322 A performs the above-described speaker clustering. Specifically, the clustering module 322 A classifies respective voice sections into clusters corresponding to speakers included in the audio data item (that is a set of the respective voice sections) on the basis of a speaker position and the sound feature amount as described above. Data indicating a result of the speaker clustering is stored in the nonvolatile memory 107 as an index data item 402 .
- the speaker feature extraction module (speaker learning module) 322 B performs a process of extracting a speaker-specific feature (speaker feature amount) from a (each) sound feature amount included in one or more voice sections classified into the same cluster by the clustering module 322 A.
- a method of extracting a speaker feature amount from a sound feature amount any already-existing method may be used. For example, a technique such as a code mapping method, a neural network method, and a Gaussian mixture model (GMM), is used.
- GMM Gaussian mixture model
- the speaker registration module 322 C performs a process of registering (automatically registering) the speaker feature amount extracted by the speaker feature extraction module 322 B in the nonvolatile memory 107 as a speaker feature data item 403 .
- a speaker feature provisional data item 404 including a speaker feature amount and a speaker name provisionally registered by the speaker provisional registration module 322 E, which will be described later, is stored in the nonvolatile memory 107 .
- the speaker registration module 322 C associates a speaker feature data item 403 including a speaker feature amount according with the speaker feature amount included in the speaker feature provisional data item 404 with the speaker name included in the speaker feature provisional data item 404 .
- the speaker registration module 322 C performs a process of reregistering (formally registering) it in the nonvolatile memory 107 as a new speaker feature data item 403 .
- a speaker feature provisional data item 404 provisionally registered by the speaker provisional registration module 322 E is stored in the nonvolatile memory 107 , and a speaker feature data item 403 including a speaker feature amount according with a speaker feature amount included in the speaker feature provisional data item 404 is not stored in the nonvolatile memory 107 . That is, let us assume that a speaker feature data item 403 on the speaker feature amount has been deleted from the nonvolatile memory 107 . In this case, the speaker registration module 322 C performs a process of reregistering the speaker feature provisional data item 404 in the nonvolatile memory 107 as a new speaker feature data item 403 .
- a speaker feature data item 403 including a speaker feature amount according with a speaker feature amount included in a speaker feature provisional data item 404 is stored in the nonvolatile memory 107 , a speaker feature amount included in the speaker feature data item 403 originally registered in the nonvolatile memory 107 is associated with a speaker name included in the speaker feature provisional data item 404 , and a new speaker feature data item 403 is registered.
- a new speaker feature data item 403 may be registered by overwriting the originally registered speaker feature data item 403 with the speaker feature provisional data item 404 .
- the speaker registration module 322 C tries to register a speaker feature amount extracted by the speaker feature extraction module 322 B in the nonvolatile memory 107 as a speaker feature data item 403 , a speaker feature data item 403 including the speaker feature mount is already registered in the nonvolatile memory 107 .
- the speaker registration module 322 C does not perform the above-described registration for the speaker feature amount, and only the update of importance, which will be described later, is performed.
- the speaker registration module 322 C performs a process of determining whether the number of speaker feature data items 403 registered in the nonvolatile memory 107 is greater than or equal to the predetermined number. If a result that the number of speaker feature data items 403 registered in the nonvolatile memory 107 is greater than or equal to the predetermined number is obtained as a result of the determination, the speaker registration module 322 C performs a process of deleting a speaker feature data item 403 .
- the speaker registration module 322 C performs a process of deleting a speaker feature data item 403 in accordance with the importance added to the speaker feature data items 403 , such that the number of speaker feature data items 403 registered in the nonvolatile memory 107 becomes less than the predetermined number, which will be described later in detail. That is, a speaker feature data item 403 of small importance is deleted. Accordingly, even if the number of speaker feature data items 403 which can be registered in the nonvolatile memory 107 is limited to enhance the precision of a speaker identification process, which will be described later, a speaker feature data item 403 important to the user can be left. That is, the precision of the speaker identification process can be enhanced without spoiling convenience. Because the details of the importance will be described later, a detailed explanation thereof is omitted herein.
- the speaker identification module 322 D performs a process of comparing (speaker identification process) a speaker feature amount extracted by the speaker feature extraction module 322 B and a speaker feature amount included in a speaker feature data item 403 stored (registered) in the nonvolatile memory 107 .
- a technique of comparing the extracted speaker feature amount and the speaker feature amount included in the registered speaker feature data item 403 any already-existing technique may be used.
- a technique such as i-vector is used.
- I-vector is a technique of extracting a speaker feature amount by deleting the number of dimensions from certain input using a factor analysis. By this technique, speakers can be efficiently distinguished (compared) even from a small quantity of data.
- the speaker identification module 322 D determines that one or more voice sections corresponding to the speaker feature amount (specifically, one or more voice sections including a sound feature amount used to extract the speaker feature amount) belong to the utterance of a speaker (person) indicated by the speaker name.
- the speaker identification module 322 D acquires the number of times the speaker feature mount was identified until the present (the number of times of speaker identification) as data on the importance added to the speaker feature data item 403 including the speaker feature amount. If the number of times of speaker identification is greater than or equal to two, the speaker identification module 322 D determines that the one or more voice sections corresponding to the speaker feature amount belong to the utterance of a speaker (person) whose speaker name has not been registered yet (person who appeared in the past, but whose speaker name has not been registered yet). In addition, if the acquired number of times of speaker identification is one, the speaker identification module 322 D determines that the one or more voice sections corresponding to the speaker feature amount belong to the utterance of a new speaker (person) (person who did not appear in the past).
- Data indicating a result of the speaker identification process is stored in the nonvolatile memory 107 as one data item included in an index data item 402 . That is, the index data item 402 includes data indicating a result of the speaker clustering and a result of the speaker identification regarding an audio data item corresponding to the index data item 402 .
- the speaker identification module 322 D also updates the importance added to a speaker feature data item 403 .
- the importance is, for example, a value calculated by equation (1) below.
- Importance ⁇ number of times of speaker identification+ ⁇ last time and date of appearance+ ⁇ presence or absence of user registration (1)
- ⁇ , ⁇ , and ⁇ are weighting factors.
- the number of times of speaker identification included in equation (1) above represents the number of times a predetermined speaker feature amount was identified in the above-described speaker identification process until the present.
- the last time and date of appearance included in equation (1) above represents how many days ago the last (most recent) recording data item of one or more recording data items including one or more voice sections corresponding to the predetermined speaker feature amount was recorded.
- the presence or absence of user registration included in equation (1) above represents a value determined based on whether a speaker name is included (registered) in a speaker feature data item 403 including the predetermined speaker feature amount. Specifically, if a speaker name is registered, the value of the presence or absence of user registration in equation (1) above is one. And, if a speaker name is not registered, the value of the presence or absence of user registration in equation (1) above is zero.
- ⁇ is 0.01, ⁇ is ⁇ 0.0001, ⁇ is 1.0, a predetermined speaker feature amount was identified fifteen times until the present, a recording data item including one or more voice sections corresponding to the speaker feature amount was recorded one day ago, and a speaker name is included in a speaker feature data item 403 including the speaker feature amount.
- the speaker identification module 322 D calculates the importance by equation (1) above, as follows:
- the importance added to the speaker feature data item 403 including the predetermined speaker feature amount is updated to the importance 1.1499 calculated in the above manner.
- ⁇ is 0.01, ⁇ is ⁇ 0.0001, and ⁇ is 1.0 as in the above description, and a predetermined speaker feature amount was identified five times until the present, a recording data item including one or more voice sections corresponding to the speaker feature amount was recorded thirty days ago, and a speaker name is not included in a speaker feature data item 403 including the speaker feature amount will also be described.
- the speaker identification module 322 D calculates the importance by equation (1) above, as follows:
- the importance included in the speaker feature data item 403 including the predetermined speaker feature amount is updated to the importance 0.047 calculated in the above manner.
- the speaker provisional registration module 322 E acquires a speaker feature amount corresponding to the one or more voice sections included in the predetermined cluster from the nonvolatile memory 107 . Then, the speaker provisional registration module 322 E generates a speaker feature provisional data item 404 including the acquired speaker feature amount and the speaker name input by the above operation. In addition, the speaker provisional registration module 322 E writes the generated speaker feature provisional data item 404 to the nonvolatile memory 107 . That is, the speaker provisional registration module 322 E provisionally registers the speaker feature amount included in the speaker feature provisional data item 404 .
- the formal registration of the speaker feature amount can be performed. That is, the registration of the speaker feature amount can be reserved.
- the recording engine 321 starts recording. If the recording button 400 in the home view 210 - 1 shown in FIG. 4 is operated and recording is started, a screen of a terminal switches from the home view 210 - 1 shown in FIG. 4 to the recording view 210 - 2 shown in FIG. 5 .
- the voice section detection module 321 A analyzes a recorded audio data item (or an audio data item from the audio capture unit 113 ), and determines whether an audio data unit of a predetermined length of time is a voice section or a non-voice section other than the voice section (block B 1 ). If it is determined that the audio data unit of the predetermined length of time is a non-voice section (NO in block B 1 ), the flow returns to the process of block B 1 , and the voice section detection module 321 A performs a process of determining whether the next audio data unit is a voice section or a non-voice section.
- the sound feature extraction module 321 B extracts a sound feature amount, for example, a mel frequency cepstrum cofficient (block B 2 ).
- the recording engine 321 determines whether the stop button 500 A in the recording view 210 - 2 has been operated (tapped) by the user. That is, it is determined whether recording has been completed (block B 3 ). If it is determined that the stop button 500 A in the recording view 210 - 2 has not been operated, that is if it is determined that recording is continuously being performed (NO in block B 3 ), the flow returns to the process of block B 1 . Then, the voice section detection module 321 A performs a process of determining whether the next audio data unit is a voice section or a non-voice section.
- the clustering module 322 A classifies one or more voice sections included in a sequence from the start point to the end point of a recorded audio data item (a set of audio data units) into clusters corresponding to speakers included in the audio data item (block B 4 ). For example, if five speakers are included in the audio data item, the one or more voice sections included in the audio data item is each classified into any of five clusters.
- Data indicating a result of the process of block B 4 that is, data indicating which voice section is included in (belongs to) which cluster, is stored in the nonvolatile memory 107 as an index data item 402 .
- the speaker feature extraction module 322 B extracts a speaker feature amount, which is a speaker-specific feature, from a sound feature amount included in one or more voice sections classified into the same cluster (block B 5 ). For example, if the one or more voice sections included in the audio data item are classified into five clusters as described above, five speaker feature amounts are herein extracted by the speaker feature extraction module 322 B.
- the speaker registration module 322 C registers each of the extracted speaker feature amounts in the nonvolatile memory 107 as a speaker feature data item 403 (block B 6 ).
- the speaker registration module 322 C refers to the nonvolatile memory 107 , and determines whether a speaker feature provisional data item 404 provisionally registered by the speaker provisional registration module 322 E is stored (registered) therein (block B 7 ). If it is determined that the speaker feature provisional data item 404 is not stored (NO in block B 7 ), the flow proceeds to the process of block B 9 , which will be described later.
- the speaker registration module 322 C reregisters a speaker feature amount and a speaker name included in the speaker feature provisional data item 404 provisionally registered in the nonvolatile memory 107 as a speaker feature data item 403 (block B 8 ).
- the speaker registration module 322 C determines whether the number of speaker feature data items 403 registered in the nonvolatile memory 107 is greater than or equal to a predetermined number. That is, the speaker registration module 322 C determines whether the number of registered speaker feature data items 403 exceeds the upper limit (block B 9 ). If it is determined that the number of registered speaker feature data items 403 is not greater than or equal to the predetermined number, that is, the number of registered speaker feature data items 403 is less than the predetermined number (NO in block B 9 ), the flow proceeds to the process of block B 11 , which will be described later.
- the speaker registration module 322 C deletes a speaker feature data item 403 in order of increasing importance added to the speaker feature data items 403 registered in the nonvolatile memory 107 , until the number of speaker feature data items 403 becomes less than the predetermined number (block B 10 ).
- a speaker feature data item 403 registered this time in a series of procedures is not deleted.
- the speaker identification module 322 D compares a speaker feature amount extracted by performing the process of block B 5 by the speaker feature extraction module 322 B and a speaker feature amount included in a speaker feature data item 403 stored in the nonvolatile memory 107 . Let us assume that as a result of the comparison, a speaker name is included in a speaker feature data item 403 including a speaker feature amount according with the extracted speaker feature amount. In this case, the speaker identification module 322 D determines that one or more voice sections corresponding to the speaker feature amount belong to the utterance of a speaker (person) indicated by the speaker name.
- a speaker name is not included in the speaker feature data item 403 including the speaker feature amount according with the extracted speaker feature amount, and the number of times of speaker identification is greater than or equal to two.
- the number of times of speaker identification is data on the importance added to the speaker feature data item 403 .
- the speaker identification module 322 D determines that the one or more voice sections corresponding to the speaker feature amount belong to the utterance of a speaker whose speaker name has not been registered yet.
- a speaker name is not included in the speaker feature data item 403 including the speaker feature amount according with the extracted speaker feature amount, and the number of times of speaker identification is one.
- the number of times of speaker identification is data on the importance added to the speaker feature data item 403 .
- the speaker identification module 322 D determines that the one or more voice sections corresponding to the speaker feature amount belong to the utterance of a new speaker (block B 11 ). It should be noted that the process of block B 11 is repeatedly performed by the number of speaker feature amounts extracted in the process of block B 5 . Data indicating a result of the process of block B 11 is stored in the nonvolatile memory 107 as an index data item 402 .
- the speaker identification module 322 D updates the importance added to the speaker feature data item 403 including the speaker feature amount according with the speaker feature amount extracted in the process of block B 5 (block B 12 ), and ends a series of procedures of analysis processing herein.
- speaker feature amounts can be registered dispersedly at the time of the analysis processing and at the time of the provisional registration performed by the speaker provisional registration module 322 E, and thus, a time required for speaker learning can be reduced. Especially, a time required for speaker learning at the time of the provisional registration performed by the speaker provisional registration module 322 E in response to the user's operation can be greatly reduced.
- FIG. 9 shows an example of the reproduction view 210 - 3 displayed if a predetermined recording data item is reproduced after the analysis processing shown in FIG. 8 is performed for the predetermined recording data item. Since the analysis processing shown in FIG. 8 was performed, with respect to speaker names, three types of status can be displayed in a distinguishable form in the speaker identification result view area 601 of the reproduction view 210 - 3 . Specifically, a speaker whose speaker name has been registered, a speaker whose speaker name has not been registered yet and a new speaker can be displayed in a distinguishable form. The speaker whose speaker name has been registered is a speaker whose speaker name is included in a speaker feature data item 403 .
- the speaker whose speaker name has not been registered yet is a speaker whose speaker name is not included in a speaker feature data item 403 and whose number of times of speaker identification regarding the importance added to the speaker feature data item 403 is greater than or equal to two.
- the new speaker is a speaker whose speaker name is not included in a speaker feature data item 403 and whose number of times of speaker identification regarding the importance added to the speaker feature data item 403 is one.
- the speaker name (for example, “Mr. A”) is displayed.
- the analysis processing shown in FIG. 8 at a left end of one or more voice sections corresponding to a speaker feature amount which the speaker identification module 322 D identifies as belonging to the utterance of a speaker whose speaker name has not been registered yet, nothing is displayed to indicate that the speaker name has not been registered yet.
- the analysis processing shown in FIG. 8 at a left end of one or more voice sections corresponding to a speaker feature amount which the speaker identification module 322 D identifies as belonging to the utterance of a new speaker, the text “NEW” is displayed to indicate the new speaker.
- FIG. 10 shows an example of a pop-up displayed if a speaker name displayed in the reproduction view 210 - 3 is erroneous, and the user corrects the speaker name.
- the voice recorder application 202 displays a pop-up as shown in FIG. 10 .
- the voice recorder application 202 acquires all of one or more speaker feature data items 403 stored in the nonvolatile memory 107 , and displays a pop-up on which a speaker name included in the one or more speaker feature data items 403 can be selected as a correction candidate. Accordingly, the user can easily correct the speaker name.
- FIG. 11 shows an example of a tutorial displayed in the reproduction view 210 - 3 .
- the tutorial shown in FIG. 11 is displayed by the voice recorder application 202 , if all the statuses of speaker names displayed in the speaker identification result view area 601 of the reproduction view 210 - 3 are new speakers.
- the tutorial is displayed by the voice recorder application 202 , if the statuses of the speaker names displayed in the speaker identification result view area 601 of the reproduction view 210 - 3 include a combination of an unregistered speaker and a new speaker, and the number of times the tutorial was displayed is less than a predetermined number.
- the content of the tutorial is a message prompting entry of a speaker name, for example, the message “Please enter speaker name. Same speaker will be automatically displayed from next time.” Accordingly, the registration (provisional registration) of a speaker name can be prompted without imposing stress on the user.
- the electronic device 1 has the following structure: at the time of speaker learning performed in response to the user's operation, only a speaker feature provisional data item 404 including a speaker feature amount and a speaker name is provisionally registered.
- a speaker feature provisional data item 404 including a speaker feature amount and a speaker name is provisionally registered.
- the speaker feature amount and the speaker name included in the speaker feature provisional data item 404 are reregistered as a speaker feature data item 403 . That is, electronic device 1 has the structure in which speaker learning is performed dispersedly. Accordingly, a time required for speaker learning can be greatly reduced, whereby a speaker learning function which does not impose stress on the user can be realized.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- User Interface Of Digital Computer (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
Abstract
Description
- This application claims the benefit of U.S. Provisional Application No. 62/218,417, filed Sep. 14, 2015, the entire contents of which are incorporated herein by reference.
- Embodiments described herein relate generally to an electronic device and a method.
- Conventionally, an electronic device records an audio signal and displays a waveform of the audio signal. If the electronic device records the audio signals of a plurality of members in a meeting, the waveforms do not identify the speakers.
- A general architecture that implements the various features of the embodiments will now be described with reference to the drawings. The drawings and the associated descriptions are provided to illustrate the embodiments and not to limit the scope of the invention. Throughout the drawings, reference numbers are re-used to indicate correspondence between referenced elements.
-
FIG. 1 is a plan view showing an example of an external appearance of embodiments. -
FIG. 2 is a block diagram showing an example of a system configuration of embodiments. -
FIG. 3 is a block diagram showing an example of a functional configuration of a voice recorder application of embodiments. -
FIG. 4 is a diagram showing an example of a home view of embodiments. -
FIG. 5 is a diagram showing an example of a recording view of embodiments. -
FIG. 6 is a diagram showing an example of a reproduction view of embodiments. -
FIG. 7 is a block diagram showing an example of a functional configuration of a visualization engine of embodiments. -
FIG. 8 is a flowchart showing an example of a series of procedures of analysis processing by the voice recorder application of embodiments. -
FIG. 9 is a diagram for explaining a status related to a speaker name. -
FIG. 10 shows an example of a pop-up displayed when a user corrects the speaker name. -
FIG. 11 shows an example of a tutorial displayed in the reproduction view. - Various embodiments will be described hereinafter with reference to the accompanying drawings.
- In general, according to one embodiment, an electronic device includes a microphone, a memory, and a hardware processor. The microphone is configured to obtain audio and convert the audio into an audio signal, the audio including utterances from a first user and utterances from a second user, wherein one of the first user or the second user is a registered user and the other of the first user or second user is an unregistered user. The memory stores an identifier associated with the registered user. The hardware processor in communication with the memory and is configured to: record the audio signal; determine a plurality of user-specific utterance features within the audio signal, the plurality of user-specific utterance features including a first set of user specific-utterance features associated with the registered user and a second set of user-specific utterance features associated with the unregistered user; display the identifier of the registered user differently than an identifier of the unregistered user.
- <Plan View of Device>
-
FIG. 1 shows a plan view of an example of an electronic device according to certain embodiments. Anelectronic device 1 may include, for example, a tablet personal computer (portable personal computer [PC]), a smartphone (multifunctional mobile phone), or a personal digital assistant (PDA). In the following description, it is assumed that theelectronic device 1 is a tablet personal computer. However, the disclosure is not limited as such and theelectronic device 1 may include one or more of the previously described systems. Each element and each structure described hereinafter may be implemented by hardware, by software using a microcomputer (processor, central processing unit [CPU]), or by a combination of hardware and software. - The tablet personal computer (hereinafter, referred to as a tablet PC) 1 may include a
main body 10 and atouchscreen display 20. - A
camera 11 may be disposed at a particular position of themain body 10, for example, a central position on an upper end of a surface of themain body 10. Moreover,microphones main body 10, for example, two positions separated from each other on the upper end of the surface of themain body 10. Thecamera 11 may be disposed between the twomicrophones main body 10, for example, a left side and a right side of themain body 10. Although not shown in the figures, a power switch (power button), a lock mechanism, an authentication unit, etc., may be disposed at other predetermined positions of themain body 10. The power switch turns on or off a power supply that supplies power to one or more elements of the tablet PC 1 enabling a user to user the tablet PC 1 (activating the tablet PC 1). The lock mechanism, for example, locks the operation of the power switch at the time of conveyance. The authentication unit, for example, reads (biometric) data associated with a user's finger or palm to authenticate the user. - The
touchscreen display 20 includes aflat panel display 21, such as a liquid crystal display (LCD), and atouchpanel 22. Theflat panel display 21 may include a plasma display or an organic LED (OLED) display. Thetouchpanel 22 is attached to the surface of themain body 10 so as to cover a screen of theLCD 21. The touchscreen display 20 detects a touch position of an external object (stylus or finger) on a display screen. Thetouchscreen display 20 may support a multi-touch function by which a plurality of touch positions can be simultaneously detected. Thetouchscreen display 20 can display several icons for activating various application programs on the screen. The icons may include anicon 290 for activating a voice recorder program. The voice recorder program has a function of visualizing the content of a recording in a meeting, etc. - <System Configuration>
-
FIG. 2 shows an example of a system configuration of the tablet PC 1. As well as the elements shown inFIG. 1 , thetablet PC 1 may comprise aCPU 101, asystem controller 102, amain memory 103 which is volatile memory, such as RAM, agraphics controller 104, asound controller 105, a BIOS-ROM 106, anonvolatile memory 107, an EEPROM 108, aLAN controller 109, awireless LAN controller 110, avibrator 111, anacceleration sensor 112, anaudio capture unit 113, an embedded controller (EC) 114, etc. - The
CPU 101 is a processor circuit configured to control operation of each element in the tablet PC 1. TheCPU 101 executes various programs loaded from thenonvolatile memory 107 to themain memory 103. The programs may include an operating system (OS) 201 and various application programs. The application programs include avoice recorder application 202. - Various features of the
voice recorder application 202 will be explained. Thevoice recorder application 202 can record an audio data item corresponding to a sound input via themicrophones microphone voice recorder application 202 can extract voice sections from the audio data item, and classify the respective voice sections into clusters corresponding to speakers in the audio data item. Thevoice recorder application 202 has a visualization function of displaying the voice sections speaker by speaker, using a result of cluster classification. By the visualization function, which speaker spoke (uttered) can be visibly presented to the user. The voice sections include utterances such as sounds produced by a user including humming, whistling, moans, grunts, singing, and any other sounds a user may make including speech. Thevoice recorder application 202 supports a speaker selection reproduction function of continuously reproducing only voice sections of a selected speaker. - These functions of the
voice recorder application 202 can be each carried out by a circuit such as a processor. Alternatively, these functions can be carried out by a dedicated circuit such as arecording circuit 121 and areproduction circuit 122. Therecording circuit 121 and thereproduction circuit 122 have the recording function and the reproducing function that are carried out by the processor executing thevoice recorder application 202. - The
CPU 101 also executes a Basic Input/Output System (BIOS), which is a program for hardware control stored in the BIOS-ROM 106. - The
system controller 102 is a device which connects a local bus of theCPU 101 and various components. Thesystem controller 102 also contains a memory controller which exerts access control over themain memory 103. Thesystem controller 102 also has a function of communicating with thegraphics controller 104 over a serial bus conforming to the PCI EXPRESS standard, etc. Thesystem controller 102 also contains an ATA controller for controlling thenonvolatile memory 107. Thesystem controller 102 further contains a USB controller for controlling various USB devices. Thesystem controller 102 also has a function of communicating with thesound controller 105 and theaudio capture unit 113. - The
graphics controller 104 is a display controller configured to control theLCD 21 of thetouchscreen display 20. A display signal generated by thegraphics controller 104 is transmitted to theLCD 21. TheLCD 21 displays a screen image based on the display signal. Thetouchpanel 22 covering theLCD 21 functions as a sensor configured to detect a touch position of an external object on the screen of theLCD 21. Thesound controller 105 converts an audio data item to be reproduced into an analog signal, and supplies the analog signal to theloudspeakers - The
LAN controller 109 is a wired communication device configured to perform wired communication conforming to, for example, the IEEE 802.3 standard. TheLAN controller 109 includes a transmission circuit configured to transmit a signal and a reception circuit configured to receive a signal. Thewireless LAN controller 110 is a wireless communication device configured to perform wireless communication conforming to, for example, the IEEE 802.11 standard, and includes a transmission circuit configured to wirelessly transmit a signal and a reception circuit configured to wirelessly receive a signal. - The
acceleration sensor 112 is used to detect the current direction (portrait direction/landscape direction) of themain body 10. Theaudio capture unit 113 carries out analog-to-digital conversion of sound input via themicrophones audio capture unit 113 can transmit data indicating which of the sound inputs from themicrophones voice recorder application 202. TheEC 114 is a single-chip microcontroller for power management. TheEC 114 powers thetablet PC 1 on or off in response to the user's operation of the power switch. - <Functional Configuration>
-
FIG. 3 shows an example of a functional configuration of thevoice recorder application 202. Thevoice recorder application 202 includes aninput interface module 310, acontroller 320, areproduction processor 330, adisplay processor 340, etc., as functional modules of the same program. - The
input interface module 310 receives various events from thetouchpanel 22 via atouchpanel driver 201A. The events include a touch event, a move event, and a release event. The touch event indicates that an external object has touched the screen of theLCD 21. The touch event includes coordinates of a touch position of the external object on the screen. The move event indicates that the touch position has been moved with the external object touching the screen. The move event includes coordinates of the touch position that has been moved. The release event indicates that the touch of the external object on the screen has been released. The release event includes coordinates of the touch position where the touch has been released. - Based on these events, the following finger gestures are defined.
- Tap: touching the user's finger at a position on the screen, and then separating it in an orthogonal direction to the screen (“tap” may be synonymous with “touch”).
- Swipe: touching the user's finger at an arbitrary position on the screen, and then moving it in an arbitrary direction.
- Flick: touching the user's finger at an arbitrary position on the screen, then sweeping it in an arbitrary direction, and separating it from the screen.
- Pinch: touching the user's two fingers at an arbitrary position on the screen, and then, changing the distance between the fingers on the screen. In particular, widening the distance between the fingers (opening the fingers) may be referred to as pinch-out (pinch-open), and narrowing the distance between the fingers (closing the fingers) may be referred to as pinch-in (pinch-close).
- The
controller 320 can detect on which part of the screen, which finger gesture (tap, swipe, flick, pinch, etc.) was made, based on various events received from theinput interface module 310. Thecontroller 320 includes arecording engine 321, avisualization engine 322, etc. - The
recording engine 321 records anaudio data item 401 corresponding to sound input via themicrophones audio capture unit 113 in thenonvolatile memory 107. Therecording engine 321 can perform recording in various scenes such as recording in a meeting, recording in a telephone conversation, and recording in a presentation. Therecording engine 321 also can perform recording of other kinds of audio source input via means other than themicrophones audio capture unit 113, such as broadcast and music. - The
recording engine 321 performs a voice section detection process of analyzing the recordedaudio data item 401 and determining whether it is a voice section or a non-voice section (noise section, silent section) other than the voice section. The voice section detection process is performed, for example, for each voice data sample having a length of time of 0.5 seconds. In other words, a sequence of an audio data item (recording data item), that is, a signal series of digital audio signals, is divided into audio data units each having a length of time of 0.5 seconds (set of audio data samples of 0.5 seconds). Therecording engine 321 performs a voice section detection process for each audio data unit. An audio data unit of 0.5 seconds is an identification unit for identifying a speaker through a speaker identification process, which will be described later. - In the voice section detection process, it is determined whether an audio data unit is a voice section or a non-voice section (noise section, silent section) other than the voice section. In this determination of a voice section/a non-voice section, any well-known technique can be used, and for example, voice activity detection (VAD) may be used. The determination of a voice section/a non-voice section may be made in real time during recording.
- The
recording engine 321 extracts a feature amount (sound feature amount) such as a mel frequency cepstrum cofficient (MFCC) from an audio data unit identified as a voice section. - The
visualization engine 322 performs a process of visualizing an outline of a whole sequence of theaudio data item 401 in cooperation with thedisplay processor 340. Specifically, thevisualization engine 322 performs a speaker identification process, and performs a process of distinguishably displaying when and which speaker spoke in a display area using a result of the speaker identification process. - In the speaker identification process, which speaker spoke (uttered) is detected. The speaker identification process may include speaker clustering. In the speaker clustering, it is identified which speaker spoke in voice sections included in a sequence from the start point to the end point of an audio data item. In other words, in the speaker clustering, the respective voice sections are classified into clusters corresponding to speakers in the audio data item. A cluster is a set of audio data units of the same speaker. As a method of performing speaker clustering, already-existing various methods can be used. For example, in the present embodiment, both a method of performing speaker clustering using a speaker position and a method of performing speaker clustering using a feature amount (sound feature amount) of an audio data item may be used.
- The speaker position indicates the position of each speaker with respect to the
tablet PC 1. The speaker position can be estimated based on the difference between two audio signals input via the twomicrophones - In the method of performing speaker clustering using a sound feature amount, audio data units having feature amounts similar to each other are classified into the same cluster (the same speaker) As a method of performing speaker clustering using a feature amount, any already-existing method can be used, and for example, a method disclosed in JP 2011-191824 A (JP 5174068 B) may be used. Data indicating a result of speaker clustering is saved on the
nonvolatile memory 107 as anindex data item 402. - The
visualization engine 322 displays individual voice sections in the display area. If there are speakers, the voice sections are displayed in a form in which the speakers of the individual voice sections are distinguishable. That is, thevisualization engine 322 can visualize the voice sections speaker by speaker, using theindex data item 402. - The
reproduction processor 330 reproduces theaudio data item 401. Thereproduction processor 330 can continuously reproduce only voice sections while skipping a silent section. Moreover, thereproduction processor 330 can also perform a selected-speaker reproduction process of continuously reproducing only voice sections of a specific speaker selected by the user while skipping voice sections of the other speakers. - An example of several views (home view, recording view, and reproduction view) displayed on the screen by the
voice recorder application 202 will be next described. - <Home View>
-
FIG. 4 shows an example of a home view 210-1. If thevoice recorder application 202 is activated, thevoice recorder application 202 displays the home view 210-1. The home view 210-1 displays arecording button 50, anaudio waveform 52 of a predetermined time (for example, thirty seconds), and arecord list 53. Therecording button 50 is a button for giving instructions to start recording. - The
audio waveform 52 indicates a waveform of audio signals currently input via themicrophones vertical bar 51 indicating the present time. Then, with the passage of time, the waveform of audio signals moves from thevertical bar 51 to the left. In theaudio waveform 52, successive vertical bars have lengths according to power of respective successive audio signal samples. By means of the display of theaudio waveform 52, the user can ascertain whether sounds have been normally input, before starting recording. - The
record list 53 includes records stored in thenonvolatile memory 107 asaudio data items 51. It is herein assumed that there are three records, a record of a title “AAA meeting”, a record of a title “BBB meeting”, and a record of a title “Sample”. In therecord list 53, a recording date of a record, a recording start time of a record, a recording end time of a record are also displayed. In therecord list 53, recordings (records) can be sorted in reverse order of creation date, in order of creation date, or in order of title. - If a certain record in the
record list 53 is selected by the user's tap operation, thevoice recorder application 202 starts reproducing the selected record. If therecording button 50 of the home view 210-1 is tapped by the user, thevoice recorder application 202 starts recording. - <Recording View>
-
FIG. 5 shows an example of a recording view 210-2. If therecording button 50 is tapped by the user, thevoice recorder application 202 starts recording and switches the display screen from the home view 210-1 ofFIG. 4 to the recording view 210-2 ofFIG. 5 . - The recording view 210-2 displays a
stop button 500A, apause button 500B, a voice section bar 502, anaudio waveform 503, and aspeaker icon 512. Thestop button 500A is a button for stopping the current recording. Thepause button 500B is a button for pausing the current recording. - The
audio waveform 503 indicates a waveform of audio signals currently input via themicrophones audio waveform 503 continuously appears at the position of avertical bar 501, and moves to the left with the passage of time, like theaudio waveform 402 of the home view 210-1. Also in theaudio waveform 503, successive vertical bars have lengths according to power of respective successive audio signal samples. - During recording, the above-described voice section detection process is performed. If it is detected that one or more audio data units in an audio signal are a voice section (human voice), the voice section corresponding to the one or more audio data units is visualized by the voice section bar 502 as an object indicating the voice section. The length of the voice section bar 502 varies according to the length of time of the corresponding voice section.
- The voice section bar 502 can be displayed after an input voice is analyzed by the
visualization engine 322 and a speaker identification process is performed. Because the voice section bar 502 thus cannot be displayed right after recording, theaudio waveform 503 is displayed as in the home view 210-1. Theaudio waveform 503 is displayed in real time at the right end, and flows to the left side of the screen with the passage of time. If a certain time has passed, theaudio waveform 503 switches to the voice section bar 502. Although whether power is due to voice or noise cannot be determined from theaudio waveform 503 only, the recording of a human voice can be confirmed by the voice section bar 502. Theaudio waveform 503 in real time and the voice section bar 502 starting with timing after a delay are displayed in the same row, whereby the user's eyes can be kept in the same row and do not rove, and useful information with good visibility can be obtained. - As shown in
FIG. 5 , theaudio waveform 503 does not switch to the voice section bar 502 at once, but gradually switches. The amplitude of the waveform display is decreased as the time goes so that the waveform display is converged to bar display. The current power is thereby displayed at the right end as theaudio waveform 503, the display flows from right to left, and in the process of updating the display, the waveform changes continuously or seamlessly to converge into a bar. - At the upper left side of the screen, a record name (“New record” in an initial state) and the date and time are displayed. At the upper center of the screen, a recording time (which may be the absolute time, but is herein an elapsed time from the start of recording) (for example, 00:50:02) is displayed. At the upper right side of the screen, the
speaker icon 512 is displayed. If a currently speaking speaker is identified, a speakingmark 514 is displayed under an icon of the speaker. Below the voice section bar 502, a time axis with a scale per ten seconds is displayed.FIG. 5 visualizes voices for a certain time, for example thirty seconds, until the present time (right end), and shows earlier times on the left side. The time of thirty seconds can be changed. - While the scale of the time axis of the home view 210-1 is fixed, the scale of the time axis of the recording view 210-2 is variable. By swiping from side to side, pinching in or pinching out on the time axis, the scale can be changed, and a display time (thirty seconds in the example of
FIG. 5 ) can be changed. In addition, by flicking the time axis from side to side, the time axis moves from side to side, a voice that was recorded a predetermined time before a certain time in the past can also be visualized, although the display time is not changed. - Above the voice section bars 502A, 502B, 502C, and 502D, tags 504A, 504B, 504C, and 504D are displayed. The
tags - <Reproduction View>
-
FIG. 6 shows an example of a reproduction view 210-3 in a state in which the reproduction of the record of the title “AAA meeting” is paused while being performed. The reproduction view 210-3 displays a speaker identificationresult view area 601, a seekbar area 602, areproduction view area 603, and acontrol panel 604. - The speaker identification
result view area 601 is a display area displaying the whole sequence of the record of the title “AAA meeting”. The speaker identificationresult view area 601 may display time axes 701 corresponding to respective speakers in the sequence of the record. In the speaker identificationresult view area 601, five speakers are arranged in order of decreasing amount of utterance in the whole sequence of the record of the title “AAA meeting”. A speaker whose amount of utterance is the greatest in the whole sequence is displayed on the top of the speaker identificationresult view area 601. The user can also listen to each voice section of a specific speaker by tapping the voice sections (voice section marks) of the specific speaker in order. - The left ends of the time axes 701 correspond to a start time of the sequence of the record, and the right ends of the time axes 701 correspond to an end time of the sequence of the record. That is, the total time from the start to the end of the sequence of the record is allocated to the time axes 701. However, if the total time is long and all the total time is allocated to the time axes 701, the scale of the time axes may become too small and difficult to see. Thus, the size of the time axes 701 may be variable as in the case of the reproduction view.
- On a
time axis 701 of a certain speaker, a voice section mark indicating the position and the length of time of a voice section of the speaker is displayed. Different colors may be allocated to speakers. In this case, voice section marks in colors different from speaker to speaker may be displayed. For example, on a time axis of a speaker “Mr. A”, voice section marks 702 may be displayed in a color (for example, red) allocated to the speaker “Mr. A”. - The seek
bar area 602 displays a seekbar 711 and a movable slider (also referred to as a locater) 712. To the seekbar 711, the total time from the start to the end of the sequence of the record is allocated. The position of theslider 712 on the seekbar 711 indicates the current reproduction position. From theslider 712, avertical bar 713 extends upward. Because thevertical bar 713 traverses the speaker identificationresult view area 601, the user can easily understand in which speaker's (main speaker's) voice section the current reproduction position is. - The position of the
slider 712 on the seekbar 711 moves to the right with the progress of reproduction. The user can move theslider 712 to the right or to the left by a drag operation. The user can thereby change the current reproduction position to an arbitrary position. - The
reproduction view area 603 is an enlarged view of a period (for example, a period of approximately twenty seconds) in the vicinity of the current reproduction position. Thereproduction view area 603 includes a display area long in the direction of the time axis (here, horizontally). In thereproduction view area 603, several voice sections (detected actual voice sections) included in the period in the vicinity of the current reproduction position are displayed in chronological order. Avertical bar 720 indicates the current reproduction position. If the user flicks thereproduction view area 603, displayed content in thereproduction view area 603 is scrolled to the left or to the right in a state in which the position of thevertical bar 720 is fixed. Consequently, the current reproduction position is also changed. - <Recording Engine>
-
FIG. 7 is a block diagram showing an example of functional configurations of therecording engine 321 and thevisualization engine 322 shown inFIG. 3 . As shown inFIG. 7 , therecording engine 321 includes a voicesection detection module 321A, a soundfeature extraction module 321B, etc. As shown inFIG. 7 , thevisualization engine 322 includes aclustering module 322A, a speakerfeature extraction module 322B, aspeaker registration module 322C, aspeaker identification module 322D, a speakerprovisional registration module 322E, etc. - The voice
section detection module 321A receives input of an audio data item from theaudio capture unit 113. In addition, the voicesection detection module 321A performs the above-described voice activity detection (VAD) for the audio data item of the received input. - The sound
feature extraction module 321B extracts a sound feature amount from a voice section detected by the voicesection detection module 321A as described above. - The
clustering module 322A performs the above-described speaker clustering. Specifically, theclustering module 322A classifies respective voice sections into clusters corresponding to speakers included in the audio data item (that is a set of the respective voice sections) on the basis of a speaker position and the sound feature amount as described above. Data indicating a result of the speaker clustering is stored in thenonvolatile memory 107 as anindex data item 402. - The speaker feature extraction module (speaker learning module) 322B performs a process of extracting a speaker-specific feature (speaker feature amount) from a (each) sound feature amount included in one or more voice sections classified into the same cluster by the
clustering module 322A. As a method of extracting a speaker feature amount from a sound feature amount, any already-existing method may be used. For example, a technique such as a code mapping method, a neural network method, and a Gaussian mixture model (GMM), is used. - The
speaker registration module 322C performs a process of registering (automatically registering) the speaker feature amount extracted by the speakerfeature extraction module 322B in thenonvolatile memory 107 as a speakerfeature data item 403. In addition, let us assume that a speaker featureprovisional data item 404 including a speaker feature amount and a speaker name provisionally registered by the speakerprovisional registration module 322E, which will be described later, is stored in thenonvolatile memory 107. In this case, thespeaker registration module 322C associates a speakerfeature data item 403 including a speaker feature amount according with the speaker feature amount included in the speaker featureprovisional data item 404 with the speaker name included in the speaker featureprovisional data item 404. Then, thespeaker registration module 322C performs a process of reregistering (formally registering) it in thenonvolatile memory 107 as a new speakerfeature data item 403. Moreover, let us assume that a speaker featureprovisional data item 404 provisionally registered by the speakerprovisional registration module 322E is stored in thenonvolatile memory 107, and a speakerfeature data item 403 including a speaker feature amount according with a speaker feature amount included in the speaker featureprovisional data item 404 is not stored in thenonvolatile memory 107. That is, let us assume that a speakerfeature data item 403 on the speaker feature amount has been deleted from thenonvolatile memory 107. In this case, thespeaker registration module 322C performs a process of reregistering the speaker featureprovisional data item 404 in thenonvolatile memory 107 as a new speakerfeature data item 403. - In the above description, if a speaker
feature data item 403 including a speaker feature amount according with a speaker feature amount included in a speaker featureprovisional data item 404 is stored in thenonvolatile memory 107, a speaker feature amount included in the speakerfeature data item 403 originally registered in thenonvolatile memory 107 is associated with a speaker name included in the speaker featureprovisional data item 404, and a new speakerfeature data item 403 is registered. However, a new speakerfeature data item 403 may be registered by overwriting the originally registered speakerfeature data item 403 with the speaker featureprovisional data item 404. - In addition, assume that if the
speaker registration module 322C tries to register a speaker feature amount extracted by the speakerfeature extraction module 322B in thenonvolatile memory 107 as a speakerfeature data item 403, a speakerfeature data item 403 including the speaker feature mount is already registered in thenonvolatile memory 107. In this case, thespeaker registration module 322C does not perform the above-described registration for the speaker feature amount, and only the update of importance, which will be described later, is performed. - In the
nonvolatile memory 107, a predetermined number or more of speakerfeature data items 403 are not registered to enhance the precision of a speaker identification process by thespeaker identification module 322D, which will be described later. Therefore, if a speakerfeature data item 403 is registered, thespeaker registration module 322C performs a process of determining whether the number of speakerfeature data items 403 registered in thenonvolatile memory 107 is greater than or equal to the predetermined number. If a result that the number of speakerfeature data items 403 registered in thenonvolatile memory 107 is greater than or equal to the predetermined number is obtained as a result of the determination, thespeaker registration module 322C performs a process of deleting a speakerfeature data item 403. More specifically, thespeaker registration module 322C performs a process of deleting a speakerfeature data item 403 in accordance with the importance added to the speakerfeature data items 403, such that the number of speakerfeature data items 403 registered in thenonvolatile memory 107 becomes less than the predetermined number, which will be described later in detail. That is, a speakerfeature data item 403 of small importance is deleted. Accordingly, even if the number of speakerfeature data items 403 which can be registered in thenonvolatile memory 107 is limited to enhance the precision of a speaker identification process, which will be described later, a speakerfeature data item 403 important to the user can be left. That is, the precision of the speaker identification process can be enhanced without spoiling convenience. Because the details of the importance will be described later, a detailed explanation thereof is omitted herein. - The
speaker identification module 322D performs a process of comparing (speaker identification process) a speaker feature amount extracted by the speakerfeature extraction module 322B and a speaker feature amount included in a speakerfeature data item 403 stored (registered) in thenonvolatile memory 107. As a technique of comparing the extracted speaker feature amount and the speaker feature amount included in the registered speakerfeature data item 403, any already-existing technique may be used. For example, a technique such as i-vector is used. I-vector is a technique of extracting a speaker feature amount by deleting the number of dimensions from certain input using a factor analysis. By this technique, speakers can be efficiently distinguished (compared) even from a small quantity of data. - Let us assume that a speaker name is included (registered) in a speaker
feature data item 403 including a speaker feature amount according with the extracted speaker feature amount as a result of the above-described comparison. In this case, thespeaker identification module 322D determines that one or more voice sections corresponding to the speaker feature amount (specifically, one or more voice sections including a sound feature amount used to extract the speaker feature amount) belong to the utterance of a speaker (person) indicated by the speaker name. - On the other hand, let us assume that a speaker name is not included (not registered) in the speaker
feature data item 403 including the speaker feature amount according with the extracted speaker feature amount. In this case, thespeaker identification module 322D acquires the number of times the speaker feature mount was identified until the present (the number of times of speaker identification) as data on the importance added to the speakerfeature data item 403 including the speaker feature amount. If the number of times of speaker identification is greater than or equal to two, thespeaker identification module 322D determines that the one or more voice sections corresponding to the speaker feature amount belong to the utterance of a speaker (person) whose speaker name has not been registered yet (person who appeared in the past, but whose speaker name has not been registered yet). In addition, if the acquired number of times of speaker identification is one, thespeaker identification module 322D determines that the one or more voice sections corresponding to the speaker feature amount belong to the utterance of a new speaker (person) (person who did not appear in the past). - Data indicating a result of the speaker identification process is stored in the
nonvolatile memory 107 as one data item included in anindex data item 402. That is, theindex data item 402 includes data indicating a result of the speaker clustering and a result of the speaker identification regarding an audio data item corresponding to theindex data item 402. - The
speaker identification module 322D also updates the importance added to a speakerfeature data item 403. The importance is, for example, a value calculated by equation (1) below. -
Importance=α×number of times of speaker identification+β×last time and date of appearance+γ×presence or absence of user registration (1) - The above terms, α, β, and γ, are weighting factors. In addition, the number of times of speaker identification included in equation (1) above represents the number of times a predetermined speaker feature amount was identified in the above-described speaker identification process until the present. The last time and date of appearance included in equation (1) above represents how many days ago the last (most recent) recording data item of one or more recording data items including one or more voice sections corresponding to the predetermined speaker feature amount was recorded. The presence or absence of user registration included in equation (1) above represents a value determined based on whether a speaker name is included (registered) in a speaker
feature data item 403 including the predetermined speaker feature amount. Specifically, if a speaker name is registered, the value of the presence or absence of user registration in equation (1) above is one. And, if a speaker name is not registered, the value of the presence or absence of user registration in equation (1) above is zero. - Here, the update of the importance will be described using specific values. It is assumed that in equation (1) above, α is 0.01, β is −0.0001, γ is 1.0, a predetermined speaker feature amount was identified fifteen times until the present, a recording data item including one or more voice sections corresponding to the speaker feature amount was recorded one day ago, and a speaker name is included in a speaker
feature data item 403 including the speaker feature amount. - In this case, the
speaker identification module 322D calculates the importance by equation (1) above, as follows: -
Importance=0.01×15+(−0.0001)×1+1.0×1=1.1499 - Accordingly, the importance added to the speaker
feature data item 403 including the predetermined speaker feature amount is updated to the importance 1.1499 calculated in the above manner. - In addition, the case where α is 0.01, β is −0.0001, and γ is 1.0 as in the above description, and a predetermined speaker feature amount was identified five times until the present, a recording data item including one or more voice sections corresponding to the speaker feature amount was recorded thirty days ago, and a speaker name is not included in a speaker
feature data item 403 including the speaker feature amount will also be described. - In this case, the
speaker identification module 322D calculates the importance by equation (1) above, as follows: -
Importance=0.01×5+(−0.0001)×30+1.0×0=0.047 - Accordingly, the importance included in the speaker
feature data item 403 including the predetermined speaker feature amount is updated to the importance 0.047 calculated in the above manner. - Let us assume that the user performs an operation of adding a speaker name corresponding to one or more voice sections classified into a predetermined cluster, for example, in the reproduction view 210-3 shown in
FIG. 6 . In this case, the speakerprovisional registration module 322E acquires a speaker feature amount corresponding to the one or more voice sections included in the predetermined cluster from thenonvolatile memory 107. Then, the speakerprovisional registration module 322E generates a speaker featureprovisional data item 404 including the acquired speaker feature amount and the speaker name input by the above operation. In addition, the speakerprovisional registration module 322E writes the generated speaker featureprovisional data item 404 to thenonvolatile memory 107. That is, the speakerprovisional registration module 322E provisionally registers the speaker feature amount included in the speaker featureprovisional data item 404. - Accordingly, if a speaker feature amount is next registered by the
speaker registration module 322C, the formal registration of the speaker feature amount can be performed. That is, the registration of the speaker feature amount can be reserved. - <Analysis Processing>
- An example of a series of procedures of analysis processing performed by the
voice recorder application 202 will be next described with reference to the flowchart ofFIG. 8 . - If the user activates the
voice recorder application 202 and operates (taps) the recording button 400 in the home view 210-1 as shown inFIG. 4 , therecording engine 321 starts recording. If the recording button 400 in the home view 210-1 shown inFIG. 4 is operated and recording is started, a screen of a terminal switches from the home view 210-1 shown inFIG. 4 to the recording view 210-2 shown inFIG. 5 . - If recording is started, the voice
section detection module 321A analyzes a recorded audio data item (or an audio data item from the audio capture unit 113), and determines whether an audio data unit of a predetermined length of time is a voice section or a non-voice section other than the voice section (block B1). If it is determined that the audio data unit of the predetermined length of time is a non-voice section (NO in block B1), the flow returns to the process of block B1, and the voicesection detection module 321A performs a process of determining whether the next audio data unit is a voice section or a non-voice section. - On the other hand, if it is determined that the audio data unit of the predetermined length of time is a voice section (YES in block B1), the sound
feature extraction module 321B extracts a sound feature amount, for example, a mel frequency cepstrum cofficient (block B2). - Next, the
recording engine 321 determines whether thestop button 500A in the recording view 210-2 has been operated (tapped) by the user. That is, it is determined whether recording has been completed (block B3). If it is determined that thestop button 500A in the recording view 210-2 has not been operated, that is if it is determined that recording is continuously being performed (NO in block B3), the flow returns to the process of block B1. Then, the voicesection detection module 321A performs a process of determining whether the next audio data unit is a voice section or a non-voice section. - On the other hand, let us assume that it is determined the
stop button 500A in the recording view 210-2 has been operated, that is, it is determined that recording has been completed (YES in block B3). In this case, theclustering module 322A classifies one or more voice sections included in a sequence from the start point to the end point of a recorded audio data item (a set of audio data units) into clusters corresponding to speakers included in the audio data item (block B4). For example, if five speakers are included in the audio data item, the one or more voice sections included in the audio data item is each classified into any of five clusters. Data indicating a result of the process of block B4, that is, data indicating which voice section is included in (belongs to) which cluster, is stored in thenonvolatile memory 107 as anindex data item 402. - Then, the speaker
feature extraction module 322B extracts a speaker feature amount, which is a speaker-specific feature, from a sound feature amount included in one or more voice sections classified into the same cluster (block B5). For example, if the one or more voice sections included in the audio data item are classified into five clusters as described above, five speaker feature amounts are herein extracted by the speakerfeature extraction module 322B. - Next, the
speaker registration module 322C registers each of the extracted speaker feature amounts in thenonvolatile memory 107 as a speaker feature data item 403 (block B6). - In addition, the
speaker registration module 322C refers to thenonvolatile memory 107, and determines whether a speaker featureprovisional data item 404 provisionally registered by the speakerprovisional registration module 322E is stored (registered) therein (block B7). If it is determined that the speaker featureprovisional data item 404 is not stored (NO in block B7), the flow proceeds to the process of block B9, which will be described later. - On the other hand, if it is determined that the speaker feature
provisional data item 404 is stored (YES in block B7), thespeaker registration module 322C reregisters a speaker feature amount and a speaker name included in the speaker featureprovisional data item 404 provisionally registered in thenonvolatile memory 107 as a speaker feature data item 403 (block B8). - Then, the
speaker registration module 322C determines whether the number of speakerfeature data items 403 registered in thenonvolatile memory 107 is greater than or equal to a predetermined number. That is, thespeaker registration module 322C determines whether the number of registered speakerfeature data items 403 exceeds the upper limit (block B9). If it is determined that the number of registered speakerfeature data items 403 is not greater than or equal to the predetermined number, that is, the number of registered speakerfeature data items 403 is less than the predetermined number (NO in block B9), the flow proceeds to the process of block B11, which will be described later. - On the other hand, if it is determined that the number of registered speaker
feature data items 403 is greater than or equal to the predetermined number (YES in block B9), thespeaker registration module 322C deletes a speakerfeature data item 403 in order of increasing importance added to the speakerfeature data items 403 registered in thenonvolatile memory 107, until the number of speakerfeature data items 403 becomes less than the predetermined number (block B10). However, it should be noted that a speakerfeature data item 403 registered this time in a series of procedures is not deleted. - Next, the
speaker identification module 322D compares a speaker feature amount extracted by performing the process of block B5 by the speakerfeature extraction module 322B and a speaker feature amount included in a speakerfeature data item 403 stored in thenonvolatile memory 107. Let us assume that as a result of the comparison, a speaker name is included in a speakerfeature data item 403 including a speaker feature amount according with the extracted speaker feature amount. In this case, thespeaker identification module 322D determines that one or more voice sections corresponding to the speaker feature amount belong to the utterance of a speaker (person) indicated by the speaker name. In addition, let us assume that as a result of the comparison, a speaker name is not included in the speakerfeature data item 403 including the speaker feature amount according with the extracted speaker feature amount, and the number of times of speaker identification is greater than or equal to two. The number of times of speaker identification is data on the importance added to the speakerfeature data item 403. In this case, thespeaker identification module 322D determines that the one or more voice sections corresponding to the speaker feature amount belong to the utterance of a speaker whose speaker name has not been registered yet. Moreover, let us assume that as a result of the comparison, a speaker name is not included in the speakerfeature data item 403 including the speaker feature amount according with the extracted speaker feature amount, and the number of times of speaker identification is one. The number of times of speaker identification is data on the importance added to the speakerfeature data item 403. In this case, thespeaker identification module 322D determines that the one or more voice sections corresponding to the speaker feature amount belong to the utterance of a new speaker (block B11). It should be noted that the process of block B11 is repeatedly performed by the number of speaker feature amounts extracted in the process of block B5. Data indicating a result of the process of block B11 is stored in thenonvolatile memory 107 as anindex data item 402. - Then, the
speaker identification module 322D updates the importance added to the speakerfeature data item 403 including the speaker feature amount according with the speaker feature amount extracted in the process of block B5 (block B12), and ends a series of procedures of analysis processing herein. - By means of the above-described analysis processing, speaker feature amounts can be registered dispersedly at the time of the analysis processing and at the time of the provisional registration performed by the speaker
provisional registration module 322E, and thus, a time required for speaker learning can be reduced. Especially, a time required for speaker learning at the time of the provisional registration performed by the speakerprovisional registration module 322E in response to the user's operation can be greatly reduced. - <Reproduction View>
-
FIG. 9 shows an example of the reproduction view 210-3 displayed if a predetermined recording data item is reproduced after the analysis processing shown inFIG. 8 is performed for the predetermined recording data item. Since the analysis processing shown inFIG. 8 was performed, with respect to speaker names, three types of status can be displayed in a distinguishable form in the speaker identificationresult view area 601 of the reproduction view 210-3. Specifically, a speaker whose speaker name has been registered, a speaker whose speaker name has not been registered yet and a new speaker can be displayed in a distinguishable form. The speaker whose speaker name has been registered is a speaker whose speaker name is included in a speakerfeature data item 403. The speaker whose speaker name has not been registered yet is a speaker whose speaker name is not included in a speakerfeature data item 403 and whose number of times of speaker identification regarding the importance added to the speakerfeature data item 403 is greater than or equal to two. The new speaker is a speaker whose speaker name is not included in a speakerfeature data item 403 and whose number of times of speaker identification regarding the importance added to the speakerfeature data item 403 is one. - For example, if the analysis processing shown in
FIG. 8 is performed, at a left end of one or more voice sections corresponding to a speaker feature amount whose speaker name is identified by thespeaker identification module 322D, the speaker name (for example, “Mr. A”) is displayed. In addition, when the analysis processing shown inFIG. 8 is performed, at a left end of one or more voice sections corresponding to a speaker feature amount which thespeaker identification module 322D identifies as belonging to the utterance of a speaker whose speaker name has not been registered yet, nothing is displayed to indicate that the speaker name has not been registered yet. Moreover, if the analysis processing shown inFIG. 8 is performed, at a left end of one or more voice sections corresponding to a speaker feature amount which thespeaker identification module 322D identifies as belonging to the utterance of a new speaker, the text “NEW” is displayed to indicate the new speaker. - <Pop-Up Window>
-
FIG. 10 shows an example of a pop-up displayed if a speaker name displayed in the reproduction view 210-3 is erroneous, and the user corrects the speaker name. If the user performs an operation of correcting the speaker name, for example, tapping or pressing for long the speaker name displayed in the reproduction view 210-3, thevoice recorder application 202 displays a pop-up as shown inFIG. 10 . Specifically, thevoice recorder application 202 acquires all of one or more speakerfeature data items 403 stored in thenonvolatile memory 107, and displays a pop-up on which a speaker name included in the one or more speakerfeature data items 403 can be selected as a correction candidate. Accordingly, the user can easily correct the speaker name. - <Tutorial Window>
-
FIG. 11 shows an example of a tutorial displayed in the reproduction view 210-3. The tutorial shown inFIG. 11 is displayed by thevoice recorder application 202, if all the statuses of speaker names displayed in the speaker identificationresult view area 601 of the reproduction view 210-3 are new speakers. In addition, the tutorial is displayed by thevoice recorder application 202, if the statuses of the speaker names displayed in the speaker identificationresult view area 601 of the reproduction view 210-3 include a combination of an unregistered speaker and a new speaker, and the number of times the tutorial was displayed is less than a predetermined number. The content of the tutorial is a message prompting entry of a speaker name, for example, the message “Please enter speaker name. Same speaker will be automatically displayed from next time.” Accordingly, the registration (provisional registration) of a speaker name can be prompted without imposing stress on the user. - According to the above-described one embodiment, the
electronic device 1 has the following structure: at the time of speaker learning performed in response to the user's operation, only a speaker featureprovisional data item 404 including a speaker feature amount and a speaker name is provisionally registered. In this structure, right after an audio data item is recorded, the speaker feature amount and the speaker name included in the speaker featureprovisional data item 404 are reregistered as a speakerfeature data item 403. That is,electronic device 1 has the structure in which speaker learning is performed dispersedly. Accordingly, a time required for speaker learning can be greatly reduced, whereby a speaker learning function which does not impose stress on the user can be realized. - While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Claims (12)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/056,942 US20170075652A1 (en) | 2015-09-14 | 2016-02-29 | Electronic device and method |
US16/298,889 US10770077B2 (en) | 2015-09-14 | 2019-03-11 | Electronic device and method |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201562218417P | 2015-09-14 | 2015-09-14 | |
US15/056,942 US20170075652A1 (en) | 2015-09-14 | 2016-02-29 | Electronic device and method |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/298,889 Continuation US10770077B2 (en) | 2015-09-14 | 2019-03-11 | Electronic device and method |
Publications (1)
Publication Number | Publication Date |
---|---|
US20170075652A1 true US20170075652A1 (en) | 2017-03-16 |
Family
ID=58257417
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/056,942 Abandoned US20170075652A1 (en) | 2015-09-14 | 2016-02-29 | Electronic device and method |
US16/298,889 Active 2036-03-02 US10770077B2 (en) | 2015-09-14 | 2019-03-11 | Electronic device and method |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/298,889 Active 2036-03-02 US10770077B2 (en) | 2015-09-14 | 2019-03-11 | Electronic device and method |
Country Status (1)
Country | Link |
---|---|
US (2) | US20170075652A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10770077B2 (en) | 2015-09-14 | 2020-09-08 | Toshiba Client Solutions CO., LTD. | Electronic device and method |
WO2020188622A1 (en) * | 2019-03-15 | 2020-09-24 | 富士通株式会社 | Editing support program, editing support method, and editing support device |
US20220270606A1 (en) * | 2017-03-10 | 2022-08-25 | Amazon Technologies, Inc. | Voice-based parameter assignment for voice-capturing devices |
US11468900B2 (en) * | 2020-10-15 | 2022-10-11 | Google Llc | Speaker identification accuracy |
US20220398276A1 (en) * | 2020-12-17 | 2022-12-15 | Google Llc | Automatically enhancing streaming media using content transformation |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109147770B (en) * | 2017-06-16 | 2023-07-28 | 阿里巴巴集团控股有限公司 | Voice recognition feature optimization and dynamic registration method, client and server |
Family Cites Families (60)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6490562B1 (en) | 1997-04-09 | 2002-12-03 | Matsushita Electric Industrial Co., Ltd. | Method and system for analyzing voices |
JP3879786B2 (en) | 1997-08-05 | 2007-02-14 | 富士ゼロックス株式会社 | CONFERENCE INFORMATION RECORDING / REPRODUCING DEVICE AND CONFERENCE INFORMATION RECORDING / REPRODUCING METHOD |
JP2000112490A (en) | 1998-10-06 | 2000-04-21 | Seiko Epson Corp | Speech recognition method, speech recognition device, and recording medium recording speech recognition processing program |
US6477491B1 (en) | 1999-05-27 | 2002-11-05 | Mark Chandler | System and method for providing speaker-specific records of statements of speakers |
KR100346264B1 (en) | 1999-12-02 | 2002-07-26 | 엘지전자주식회사 | Multimedia Feature Description System Using Weight And Reliability |
JP2002007014A (en) * | 2000-06-19 | 2002-01-11 | Yamaha Corp | Information processor and musical instrument provided with the information processor |
US20030050777A1 (en) | 2001-09-07 | 2003-03-13 | Walker William Donald | System and method for automatic transcription of conversations |
JP3962904B2 (en) | 2002-01-24 | 2007-08-22 | 日本電気株式会社 | Speech recognition system |
US7047200B2 (en) | 2002-05-24 | 2006-05-16 | Microsoft, Corporation | Voice recognition status display |
US20040176946A1 (en) | 2002-10-17 | 2004-09-09 | Jayadev Billa | Pronunciation symbols based on the orthographic lexicon of a language |
US20040117186A1 (en) | 2002-12-13 | 2004-06-17 | Bhiksha Ramakrishnan | Multi-channel transcription-based speaker separation |
WO2004100429A2 (en) | 2003-05-01 | 2004-11-18 | James, Long | Network download system |
US7567908B2 (en) | 2004-01-13 | 2009-07-28 | International Business Machines Corporation | Differential dynamic content delivery with text display in dependence upon simultaneous speech |
JP2005202014A (en) | 2004-01-14 | 2005-07-28 | Sony Corp | Audio signal processor, audio signal processing method, and audio signal processing program |
JP2005267279A (en) | 2004-03-18 | 2005-09-29 | Fuji Xerox Co Ltd | Information processing system and information processing method, and computer program |
US8102973B2 (en) | 2005-02-22 | 2012-01-24 | Raytheon Bbn Technologies Corp. | Systems and methods for presenting end to end calls and associated information |
JP2007233075A (en) | 2006-03-01 | 2007-09-13 | Murata Mach Ltd | Minutes preparation device |
JP5052449B2 (en) | 2008-07-29 | 2012-10-17 | 日本電信電話株式会社 | Speech section speaker classification apparatus and method, speech recognition apparatus and method using the apparatus, program, and recording medium |
JP2010054991A (en) | 2008-08-29 | 2010-03-11 | Yamaha Corp | Recording device |
NO333026B1 (en) | 2008-09-17 | 2013-02-18 | Cisco Systems Int Sarl | Control system for a local telepresence video conferencing system and method for establishing a video conferencing call. |
JP5201050B2 (en) | 2009-03-27 | 2013-06-05 | ブラザー工業株式会社 | Conference support device, conference support method, conference system, conference support program |
JP5533854B2 (en) | 2009-03-31 | 2014-06-25 | 日本電気株式会社 | Speech recognition processing system and speech recognition processing method |
US20110154192A1 (en) | 2009-06-30 | 2011-06-23 | Jinyu Yang | Multimedia Collaboration System |
US8370142B2 (en) | 2009-10-30 | 2013-02-05 | Zipdx, Llc | Real-time transcription of conference calls |
US8438131B2 (en) | 2009-11-06 | 2013-05-07 | Altus365, Inc. | Synchronization of media resources in a media archive |
JP5685702B2 (en) | 2009-11-10 | 2015-03-18 | 株式会社アドバンスト・メディア | Speech recognition result management apparatus and speech recognition result display method |
JP5174068B2 (en) | 2010-03-11 | 2013-04-03 | 株式会社東芝 | Signal classification device |
WO2012051712A1 (en) | 2010-10-21 | 2012-04-26 | Marc Reddy Gingras | Methods and apparatus for the management and viewing of calendar data |
US20130060592A1 (en) | 2011-09-06 | 2013-03-07 | Tetsuro Motoyama | Meeting arrangement with key participants and with remote participation capability |
US9792955B2 (en) | 2011-11-14 | 2017-10-17 | Apple Inc. | Automatic generation of multi-camera media clips |
US8797900B2 (en) | 2012-01-16 | 2014-08-05 | International Business Machines Corporation | Automatic web conference presentation synchronizer |
US9058806B2 (en) | 2012-09-10 | 2015-06-16 | Cisco Technology, Inc. | Speaker segmentation and recognition based on list of speakers |
WO2014043555A2 (en) | 2012-09-14 | 2014-03-20 | Google Inc. | Handling concurrent speech |
US9256860B2 (en) | 2012-12-07 | 2016-02-09 | International Business Machines Corporation | Tracking participation in a shared media session |
KR102196671B1 (en) | 2013-01-11 | 2020-12-30 | 엘지전자 주식회사 | Electronic Device And Method Of Controlling The Same |
US9451048B2 (en) | 2013-03-12 | 2016-09-20 | Shazam Investments Ltd. | Methods and systems for identifying information of a broadcast station and information of broadcasted content |
JP6167615B2 (en) | 2013-03-29 | 2017-07-26 | 富士通株式会社 | Blood flow index calculation program, terminal device, and blood flow index calculation method |
JP6198432B2 (en) | 2013-04-09 | 2017-09-20 | 小島プレス工業株式会社 | Voice recognition control device |
KR102045281B1 (en) | 2013-06-04 | 2019-11-15 | 삼성전자주식회사 | Method for processing data and an electronis device thereof |
JP6534926B2 (en) | 2013-06-10 | 2019-06-26 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America | Speaker identification method, speaker identification device and speaker identification system |
JP6450312B2 (en) | 2013-07-10 | 2019-01-09 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America | Speaker identification method and speaker identification system |
US9460722B2 (en) | 2013-07-17 | 2016-10-04 | Verint Systems Ltd. | Blind diarization of recorded calls with arbitrary number of speakers |
US9336781B2 (en) | 2013-10-17 | 2016-05-10 | Sri International | Content-aware speaker recognition |
JP2015094811A (en) | 2013-11-11 | 2015-05-18 | 株式会社日立製作所 | System and method for visualizing speech recording |
US20150142434A1 (en) | 2013-11-20 | 2015-05-21 | David Wittich | Illustrated Story Creation System and Device |
US10742709B2 (en) | 2014-03-14 | 2020-08-11 | Avaya Inc. | Providing and using quality indicators in conferences for mitigation activities |
US10141011B2 (en) | 2014-04-21 | 2018-11-27 | Avaya Inc. | Conversation quality analysis |
US20150310863A1 (en) | 2014-04-24 | 2015-10-29 | Nuance Communications, Inc. | Method and apparatus for speaker diarization |
US20150365725A1 (en) | 2014-06-11 | 2015-12-17 | Rawllin International Inc. | Extract partition segments of personalized video channel |
US10354654B2 (en) | 2014-06-11 | 2019-07-16 | Avaya Inc. | Conversation structure analysis |
JP5959771B2 (en) | 2014-06-27 | 2016-08-02 | 株式会社東芝 | Electronic device, method and program |
US9246694B1 (en) | 2014-07-07 | 2016-01-26 | Twilio, Inc. | System and method for managing conferencing in a distributed communication network |
US9489598B2 (en) | 2014-08-26 | 2016-11-08 | Qualcomm Incorporated | Systems and methods for object classification, object detection and memory management |
JP6509516B2 (en) | 2014-09-29 | 2019-05-08 | Dynabook株式会社 | Electronic device, method and program |
US9596230B2 (en) | 2014-10-23 | 2017-03-14 | Level 3 Communications, Llc | Conferencing intelligence engine in a collaboration conferencing system |
US10257240B2 (en) | 2014-11-18 | 2019-04-09 | Cisco Technology, Inc. | Online meeting computer with improved noise management logic |
JP6464411B6 (en) | 2015-02-25 | 2019-03-13 | Dynabook株式会社 | Electronic device, method and program |
US10133538B2 (en) * | 2015-03-27 | 2018-11-20 | Sri International | Semi-supervised speaker diarization |
US10089061B2 (en) | 2015-08-28 | 2018-10-02 | Kabushiki Kaisha Toshiba | Electronic device and method |
US20170075652A1 (en) | 2015-09-14 | 2017-03-16 | Kabushiki Kaisha Toshiba | Electronic device and method |
-
2016
- 2016-02-29 US US15/056,942 patent/US20170075652A1/en not_active Abandoned
-
2019
- 2019-03-11 US US16/298,889 patent/US10770077B2/en active Active
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10770077B2 (en) | 2015-09-14 | 2020-09-08 | Toshiba Client Solutions CO., LTD. | Electronic device and method |
US20220270606A1 (en) * | 2017-03-10 | 2022-08-25 | Amazon Technologies, Inc. | Voice-based parameter assignment for voice-capturing devices |
WO2020188622A1 (en) * | 2019-03-15 | 2020-09-24 | 富士通株式会社 | Editing support program, editing support method, and editing support device |
JPWO2020188622A1 (en) * | 2019-03-15 | 2021-10-14 | 富士通株式会社 | Editing support program, editing support method, and editing support device |
CN113544772A (en) * | 2019-03-15 | 2021-10-22 | 富士通株式会社 | Editing support program, editing support method, and editing support device |
US20210383813A1 (en) * | 2019-03-15 | 2021-12-09 | Fujitsu Limited | Storage medium, editing support method, and editing support device |
JP7180747B2 (en) | 2019-03-15 | 2022-11-30 | 富士通株式会社 | Editing support program, editing support method, and editing support device |
US11468900B2 (en) * | 2020-10-15 | 2022-10-11 | Google Llc | Speaker identification accuracy |
US20220398276A1 (en) * | 2020-12-17 | 2022-12-15 | Google Llc | Automatically enhancing streaming media using content transformation |
Also Published As
Publication number | Publication date |
---|---|
US20190206413A1 (en) | 2019-07-04 |
US10770077B2 (en) | 2020-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10770077B2 (en) | Electronic device and method | |
JP6464411B6 (en) | Electronic device, method and program | |
US10592198B2 (en) | Audio recording/playback device | |
US10089061B2 (en) | Electronic device and method | |
US9720644B2 (en) | Information processing apparatus, information processing method, and computer program | |
WO2016103988A1 (en) | Information processing device, information processing method, and program | |
KR102339657B1 (en) | Electronic device and control method thereof | |
US20160163331A1 (en) | Electronic device and method for visualizing audio data | |
US10528249B2 (en) | Method and device for reproducing partial handwritten content | |
JP5963584B2 (en) | Electronic device and control method thereof | |
US20170249519A1 (en) | Method and device for reproducing content | |
US20110295596A1 (en) | Digital voice recording device with marking function and method thereof | |
US20140304606A1 (en) | Information processing apparatus, information processing method and computer program | |
US10216472B2 (en) | Electronic device and method for processing audio data | |
CN104375702B (en) | A kind of method and apparatus of touch control operation | |
US20140303975A1 (en) | Information processing apparatus, information processing method and computer program | |
CN110211589B (en) | Awakening method and device of vehicle-mounted system, vehicle and machine readable medium | |
CN104471522A (en) | User interface device and method for user terminal | |
US20160093315A1 (en) | Electronic device, method and storage medium | |
WO2016103809A1 (en) | Information processing device, information processing method, and program | |
CN111158487A (en) | Human-computer interaction method using wireless headset to interact with smart terminal | |
KR102347068B1 (en) | Method and device for replaying content | |
US20190129517A1 (en) | Remote control by way of sequences of keyboard codes | |
WO2016045468A1 (en) | Voice input control method and apparatus, and terminal | |
JP6392051B2 (en) | Electronic device, method and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: KABUSHIKI KAISHA TOSHIBA, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KIKUGAWA, YUSAKU;REEL/FRAME:038127/0686 Effective date: 20160307 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
AS | Assignment |
Owner name: TOSHIBA CLIENT SOLUTIONS CO., LTD., JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KABUSHIKI KAISHA TOSHIBA;REEL/FRAME:048720/0635 Effective date: 20181228 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |