KR102768071B1

KR102768071B1 - Electronic apparatus for performing actual speaker indication on spoken text and processing method thereof

Info

Publication number: KR102768071B1
Application number: KR1020240052012A
Authority: KR
Inventors: 최대식; 도가영
Original assignee: 주식회사 리턴제로
Priority date: 2024-04-18
Filing date: 2024-04-18
Publication date: 2025-02-19
Anticipated expiration: 2044-04-18

Abstract

본 개시에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치는 사용자 인터페이스 기반으로 사용자 입력을 받거나 사용자에게 출력을 제공하는 입출력 모듈; 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 적어도 하나의 프로세스가 저장된 메모리; 및 상기 프로세스에 따라 상기 발화 텍스트에 대한 실제 발화자 표시 동작을 수행하는 적어도 하나의 프로세서;를 포함하고, 상기 적어도 하나의 프로세서는, 다화자 음성 데이터를 텍스트로 변환하여 임의의 발화 구분자가 매칭된 복수의 발화 텍스트를 획득하고, 상기 복수의 발화 텍스트에 대한 실제 발화자를 확인하는 질문을 캡차(CAPTCHA) 형식의 설문 UI 또는 UX로 사용자 인터페이스에 제공하고, 상기 사용자 인터페이스를 통해 상기 실제 발화자에 대한 라벨링을 입력 받아 상기 임의의 발화 구분자를 라벨링된 상기 실제 발화자로 변경하고, 상기 사용자 인터페이스를 통한 피드백을 기반으로 설문 UI 또는 UX의 내용 및 구성을 변경하도록 구성될 수 있다.An electronic device for performing an actual speaker display operation for a spoken text according to the present disclosure includes: an input/output module for receiving user input or providing output to the user based on a user interface; a memory storing at least one process for performing an actual speaker display operation for a spoken text; and at least one processor for performing the actual speaker display operation for the spoken text according to the process; wherein the at least one processor is configured to convert multi-speaker voice data into text to obtain a plurality of spoken texts in which arbitrary utterance identifiers are matched, provide a question for confirming an actual speaker for the plurality of spoken texts to a user interface in a questionnaire UI or UX in a CAPTCHA format, receive a label for the actual speaker through the user interface and change the arbitrary utterance identifier to the labeled actual speaker, and change the content and configuration of the questionnaire UI or UX based on feedback through the user interface.

Description

{ELECTRONIC APPARATUS FOR PERFORMING ACTUAL SPEAKER INDICATION ON SPOKEN TEXT AND PROCESSING METHOD THEREOF}

본 개시는 발화 텍스트에 대한 구분을 수행하는 전자 장치 및 그의 처리 방법에 관한 것으로, 보다 구체적으로는 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치 및 그의 처리 방법에 관한 것이다.The present disclosure relates to an electronic device for performing classification of spoken text and a processing method thereof, and more specifically, to an electronic device for performing an actual speaker indication operation for spoken text and a processing method thereof.

실시간 STT(Speech to Text) 모델은 발화자의 발화를 텍스트로 실시간 변환하여 제공하는 모델이다. 실시간 STT 서비스를 구현하기 위해서 발화자의 발화를 정확하게 텍스트로 전사하는 기술이 요구되며, 추가로 전사된 발화 데이터의 발화 주체가 누구인지 구분하는 기술 또한 요구된다.The real-time STT (Speech to Text) model is a model that converts the speaker's speech into text in real time and provides it. In order to implement a real-time STT service, technology is required to accurately transcribe the speaker's speech into text, and additionally, technology is required to distinguish who the speaker of the transcribed speech data is.

하지만 발화자의 발화에 대해 정확한 매칭을 수행하는 인식률은 상대적으로 높지 않으며, 특히 다화자의 음성을 텍스트로 추출할 경우 발화자를 특정하기 어렵다는 기술적 한계가 있다.However, the recognition rate for performing accurate matching of the speaker's speech is relatively low, and there is a technical limitation that it is difficult to identify the speaker, especially when extracting the speech of multiple speakers into text.

구분된 발화 데이터를 단순한 발화 구분자로 표기할 경우 목록에 노출되는 하나의 발화 구분자가 실제로는 다수의 발화자에 해당할 수 있으며, 실제로는 한 명의 발화자여도 다수의 발화 구분자로 표기되는 문제점이 발생할 수 있다.If the separated utterance data is expressed as a simple utterance identifier, a single utterance identifier exposed in the list may actually correspond to multiple speakers, and a problem may occur where even a single speaker is expressed as multiple utterance identifiers.

따라서, 발화 구분자로만 발화 데이터를 표현할 경우 누가 발언 했는지 육안으로 구별하기 어렵고 가독성이 떨어질 수 있다. Therefore, if utterance data is expressed only with utterance identifiers, it may be difficult to visually distinguish who spoke, and readability may decrease.

종래 특허 문헌의 경우 발화 텍스트 내에서 개체명을 추출하고 추출된 개체명에 오류가 포함될 경우 정정 개체명을 발화 데이터에 레이블링하고 있으나, 이는 3인 이상의 다화자가 발언하고 있는 경우에는 실제 발화자를 구분하기 어려우며, 다양한 인터페이스와 연동하여 구동되기 물리적으로 어렵다.In the case of conventional patent documents, entity names are extracted from speech texts and, if the extracted entity names contain errors, corrected entity names are labeled in the speech data. However, this makes it difficult to distinguish actual speakers when three or more speakers are speaking, and is physically difficult to operate in conjunction with various interfaces.

대한민국 등록특허공보 10-2610360 B1(2023.12.01)Republic of Korea Patent Publication No. 10-2610360 B1 (2023.12.01)

종래 문제점을 해결하기 위하여, 본 개시에 개시된 실시예는 복수의 발화 텍스트에 대한 실제 발화자를 확인하는 질문을 캡차(CAPTCHA) 형식의 설문 UI 또는 UX로 사용자 인터페이스에 제공하여 발화 텍스트에 대한 실제 발화자 표시 동작을 수행하는 전자 장치 및 그의 처리 방법을 제공하는데 그 목적이 있다.In order to solve the conventional problems, the embodiment disclosed in the present disclosure aims to provide an electronic device and a processing method thereof that performs an operation of indicating the actual speaker for a plurality of spoken texts by providing a question for confirming the actual speaker for the spoken texts in a questionnaire UI or UX in the form of a CAPTCHA on a user interface.

본 개시가 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present disclosure are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the description below.

상술한 기술적 과제를 달성하기 위한 본 개시에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치는 사용자 인터페이스 기반으로 사용자 입력을 받거나 사용자에게 출력을 제공하는 입출력 모듈; 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 적어도 하나의 프로세스가 저장된 메모리; 및 상기 프로세스에 따라 상기 발화 텍스트에 대한 실제 발화자 표시 동작을 수행하는 적어도 하나의 프로세서;를 포함하고, 상기 적어도 하나의 프로세서는, 다화자 음성 데이터를 텍스트로 변환하여 임의의 발화 구분자가 매칭된 복수의 발화 텍스트를 획득하고, 상기 복수의 발화 텍스트에 대한 실제 발화자를 확인하는 질문을 캡차(CAPTCHA) 형식의 설문 UI 또는 UX로 사용자 인터페이스에 제공하고, 상기 사용자 인터페이스를 통해 상기 실제 발화자에 대한 라벨링을 입력 받아 상기 임의의 발화 구분자를 라벨링된 상기 실제 발화자로 변경하고, 상기 사용자 인터페이스를 통한 피드백을 기반으로 설문 UI 또는 UX의 내용 및 구성을 변경하도록 구성될 수 있다. According to the present disclosure for achieving the above-described technical problem, an electronic device for performing an actual speaker display operation for a spoken text includes: an input/output module for receiving a user input or providing an output to the user based on a user interface; a memory storing at least one process for performing an actual speaker display operation for a spoken text; and at least one processor for performing the actual speaker display operation for the spoken text according to the process; wherein the at least one processor is configured to convert multi-speaker voice data into text to obtain a plurality of spoken texts in which arbitrary utterance identifiers are matched, provide a question for confirming an actual speaker for the plurality of spoken texts to a user interface in the form of a questionnaire UI or UX in a CAPTCHA format, receive a label for the actual speaker through the user interface to change the arbitrary utterance identifier to the labeled actual speaker, and change the content and configuration of the questionnaire UI or UX based on feedback through the user interface.

상술한 기술적 과제를 달성하기 위한 본 개시에 따른 사용자 인터페이스 기반의 입출력 모듈, 메모리 및 발화 텍스트에 대한 실제 발화자 표시 동작을 수행하는 적어도 하나의 프로세서를 포함하는 컴퓨팅 장치에 의해 수행되는 발화 텍스트에 대한 실제 발화자 표시 처리 방법으로서, 상기 방법은, 다화자 음성 데이터를 텍스트로 변환하여 임의의 발화 구분자가 매칭된 복수의 발화 텍스트를 획득하는 단계; 상기 복수의 발화 텍스트에 대한 실제 발화자를 확인하는 질문을 캡차(CAPTCHA) 형식의 설문 UI 또는 UX로 사용자 인터페이스에 제공하는 단계; 상기 사용자 인터페이스를 통해 상기 실제 발화자에 대한 라벨링을 입력 받아 상기 임의의 발화 구분자를 라벨링된 상기 실제 발화자로 변경하는 단계; 및 상기 사용자 인터페이스를 통한 피드백을 기반으로 설문 UI 또는 UX의 내용 및 구성을 변경하는 단계; 를 포함할 수 있다.In order to achieve the above-described technical problem, a method for processing actual speaker display for spoken text is provided, which is performed by a computing device including an input/output module based on a user interface, a memory, and at least one processor for performing an actual speaker display operation for spoken text according to the present disclosure, the method may include the steps of: converting multi-speaker voice data into text to obtain a plurality of spoken texts in which arbitrary utterance identifiers are matched; providing a question for confirming an actual speaker for the plurality of spoken texts in a questionnaire UI or UX in a CAPTCHA format on a user interface; receiving a label for the actual speaker through the user interface and changing the arbitrary utterance identifier to the labeled actual speaker; and changing the content and configuration of the questionnaire UI or UX based on feedback through the user interface.

이 외에도, 본 개시를 구현하기 위한 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램이 더 제공될 수 있다.In addition, a computer program stored in a computer-readable recording medium for implementing the present disclosure may be further provided.

이 외에도, 본 개시를 구현하기 위한 컴퓨터 프로그램을 기록하는 컴퓨터 판독 가능한 기록 매체가 더 제공될 수 있다.In addition, a computer-readable recording medium recording a computer program for implementing the present disclosure may be further provided.

본 개시의 전술한 과제 해결 수단에 의하면, 복수의 발화 텍스트에 대한 실제 발화자를 확인하는 질문을 캡차(CAPTCHA) 형식의 설문 UI 또는 UX로 사용자 인터페이스에 제공하여 유저가 손쉽게 발화 구분자와 실제 화자를 매칭할 수 있는 인터페이스를 제공하는 효과를 제공한다.According to the above-described problem solving means of the present disclosure, a question for confirming the actual speaker of a plurality of utterance texts is provided in a CAPTCHA format questionnaire UI or UX on a user interface, thereby providing an interface where a user can easily match the utterance identifier with the actual speaker.

또한, 본 개시의 전술한 과제 해결 수단에 의하면, 목록에서 실제 화자의 이름을 통해 발화를 구분함으로써 가독성을 개선하고 유저가 직접 데이터 라벨링을 수행하여 데이터 가공 비용 절감 및 유저 피드백 통한 브랜딩 강화 효과를 제공한다.In addition, according to the aforementioned problem solving means of the present disclosure, readability is improved by distinguishing utterances through the names of actual speakers in the list, and data processing costs are reduced by allowing users to directly perform data labeling, and branding is strengthened through user feedback.

본 개시의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present disclosure are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 개시의 일 실시예에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치의 구성을 간략하게 도시한 블록도이다.
도 2는 본 개시의 일 실시예에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치의 프로세스를 간략하게 도시한 블록도이다.
도 3은 본 개시의 일 실시예에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 사용자 인터페이스를 간략하게 도시한 화면도이다.
도 4 및 도 5는 본 개시의 일 실시예에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치의 사용자 인터페이스를 간략하게 도시한 화면도이다.
도 6은 본 개시의 일 실시예에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치의 피드백 프로세스를 간략하게 도시한 블록도이다.
도 7은 본 개시의 일 실시예에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치의 임베딩 벡터의 클러스터를 나타낸 그래프이다.
도 8은 본 개시의 일 실시예에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치의 처리 방법을 나타낸 플로우 차트이다.FIG. 1 is a block diagram briefly illustrating the configuration of an electronic device for performing an actual speaker display operation for spoken text according to one embodiment of the present disclosure.
FIG. 2 is a block diagram briefly illustrating a process of an electronic device for performing an actual speaker indication operation for spoken text according to one embodiment of the present disclosure.
FIG. 3 is a screen diagram schematically illustrating a user interface for performing an actual speaker display operation for spoken text according to one embodiment of the present disclosure.
FIGS. 4 and 5 are schematic diagrams illustrating a user interface of an electronic device for performing an actual speaker display operation for spoken text according to one embodiment of the present disclosure.
FIG. 6 is a block diagram schematically illustrating a feedback process of an electronic device for performing an actual speaker indication operation for spoken text according to one embodiment of the present disclosure.
FIG. 7 is a graph showing a cluster of embedding vectors of an electronic device for performing an actual speaker indication operation for spoken text according to one embodiment of the present disclosure.
FIG. 8 is a flow chart illustrating a processing method of an electronic device for performing an actual speaker display operation for spoken text according to one embodiment of the present disclosure.

본 개시 전체에 걸쳐 동일 참조 부호는 동일 구성요소를 지칭한다. 본 개시가 실시예들의 모든 요소들을 설명하는 것은 아니며, 본 개시가 속하는 기술분야에서 일반적인 내용 또는 실시예들 간에 중복되는 내용은 생략한다. 명세서에서 사용되는 ‘부, 모듈, 부재, 블록’이라는 용어는 소프트웨어 또는 하드웨어로 구현될 수 있으며, 실시예들에 따라 복수의 '부, 모듈, 부재, 블록'이 하나의 구성요소로 구현되거나, 하나의 '부, 모듈, 부재, 블록'이 복수의 구성요소들을 포함하는 것도 가능하다. Throughout this disclosure, the same reference numerals refer to the same components. This disclosure does not describe all elements of the embodiments, and any content that is general in the technical field to which this disclosure belongs or that overlaps between the embodiments is omitted. The terms ‘part, module, element, block’ used in the specification can be implemented in software or hardware, and according to the embodiments, a plurality of ‘parts, modules, elements, blocks’ can be implemented as a single component, or a single ‘part, module, element, block’ can include a plurality of components.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐 아니라, 간접적으로 연결되어 있는 경우를 포함하고, 간접적인 연결은 무선 통신망을 통해 연결되는 것을 포함한다.Throughout the specification, when a part is said to be "connected" to another part, this includes not only a direct connection but also an indirect connection, and an indirect connection includes a connection via a wireless communications network.

또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Additionally, when a part is said to "include" a component, this does not mean that it excludes other components, but rather that it may include other components, unless otherwise specifically stated.

명세서 전체에서, 어떤 부재가 다른 부재 "상에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout the specification, when it is said that an element is "on" another element, this includes not only cases where the element is in contact with the other element, but also cases where there is another element between the two elements.

제 1, 제 2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위해 사용되는 것으로, 구성요소가 전술된 용어들에 의해 제한되는 것은 아니다. The terms first, second, etc. are used to distinguish one component from another, and the components are not limited by the aforementioned terms.

단수의 표현은 문맥상 명백하게 예외가 있지 않는 한, 복수의 표현을 포함한다.Singular expressions include plural expressions unless the context clearly indicates otherwise.

각 단계들에 있어 식별부호는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 실시될 수 있다. The identification codes in each step are used for convenience of explanation and do not describe the order of each step. Each step may be performed in a different order than specified unless the context clearly indicates a specific order.

이하 첨부된 도면들을 참고하여 본 개시의 작용 원리 및 실시예들에 대해 설명한다.The operating principle and embodiments of the present disclosure are described below with reference to the attached drawings.

본 명세서에서 '본 개시에 따른 장치'는 연산처리를 수행하여 사용자에게 결과를 제공할 수 있는 다양한 장치들이 모두 포함된다. 예를 들어, 본 개시에 따른 장치는, 컴퓨터, 서버 장치 및 휴대용 단말기를 모두 포함하거나, 또는 어느 하나의 형태가 될 수 있다.In this specification, the 'device according to the present disclosure' includes all of various devices that can perform computational processing and provide results to a user. For example, the device according to the present disclosure may include all of a computer, a server device, and a portable terminal, or may be in the form of any one of them.

여기에서, 상기 컴퓨터는 예를 들어, 웹 브라우저(WEB Browser)가 탑재된 노트북, 데스크톱(desktop), 랩톱(laptop), 태블릿 PC, 슬레이트 PC 등을 포함할 수 있다.Here, the computer may include, for example, a notebook, desktop, laptop, tablet PC, slate PC, etc. equipped with a web browser.

상기 서버 장치는 외부 장치와 통신을 수행하여 정보를 처리하는 서버로써, 애플리케이션 서버, 컴퓨팅 서버, 데이터베이스 서버, 파일 서버, 게임 서버, 메일 서버, 프록시 서버 및 웹 서버 등을 포함할 수 있다.The above server device is a server that processes information by communicating with an external device, and may include an application server, a computing server, a database server, a file server, a game server, a mail server, a proxy server, and a web server.

상기 휴대용 단말기는 예를 들어, 휴대성과 이동성이 보장되는 무선 통신 장치로서, PCS(Personal Communication System), GSM(Global System for Mobile communications), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), WiBro(Wireless Broadband Internet) 단말, 스마트 폰(Smart Phone) 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치와 시계, 반지, 팔찌, 발찌, 목걸이, 안경, 콘택트 렌즈, 또는 머리 착용형 장치(head-mounted-device(HMD) 등과 같은 웨어러블 장치를 포함할 수 있다.The above portable terminal may include, for example, all kinds of handheld-based wireless communication devices such as a PCS (Personal Communication System), GSM (Global System for Mobile communications), PDC (Personal Digital Cellular), PHS (Personal Handyphone System), PDA (Personal Digital Assistant), IMT (International Mobile Telecommunication)-2000, CDMA (Code Division Multiple Access)-2000, W-CDMA (W-Code Division Multiple Access), WiBro (Wireless Broadband Internet) terminal, a smart phone, and a wearable device such as a watch, a ring, a bracelet, an anklet, a necklace, glasses, contact lenses, or a head-mounted-device (HMD).

본 개시에 따른 인공지능과 관련된 기능은 프로세서와 메모리를 통해 동작된다. 프로세서는 하나 또는 복수의 프로세서로 구성될 수 있다. 이때, 하나 또는 복수의 프로세서는 CPU, AP, DSP(Digital Signal Processor) 등과 같은 범용 프로세서, GPU, VPU(Vision Processing Unit)와 같은 그래픽 전용 프로세서 또는 NPU와 같은 인공지능 전용 프로세서일 수 있다. 하나 또는 복수의 프로세서는, 메모리에 저장된 기 정의된 동작 규칙 또는 인공지능 모델에 따라, 입력 데이터를 처리하도록 제어한다. 또는, 하나 또는 복수의 프로세서가 인공지능 전용 프로세서인 경우, 인공지능 전용 프로세서는, 특정 인공지능 모델의 처리에 특화된 하드웨어 구조로 설계될 수 있다.The function related to artificial intelligence according to the present disclosure is operated through a processor and a memory. The processor may be composed of one or more processors. At this time, one or more processors may be a general-purpose processor such as a CPU, an AP, a DSP (Digital Signal Processor), a graphics-only processor such as a GPU, a VPU (Vision Processing Unit), or an artificial intelligence-only processor such as an NPU. One or more processors control to process input data according to a predefined operation rule or artificial intelligence model stored in a memory. Alternatively, when one or more processors are artificial intelligence-only processors, the artificial intelligence-only processor may be designed with a hardware structure specialized for processing a specific artificial intelligence model.

기 정의된 동작 규칙 또는 인공지능 모델은 학습을 통해 만들어진 것을 특징으로 한다. 여기서, 학습을 통해 만들어진다는 것은, 기본 인공지능 모델이 학습 알고리즘에 의하여 다수의 학습 데이터들을 이용하여 학습됨으로써, 원하는 특성(또는, 목적)을 수행하도록 설정된 기 정의된 동작 규칙 또는 인공지능 모델이 만들어짐을 의미한다. 이러한 학습은 본 개시에 따른 인공지능이 수행되는 기기 자체에서 이루어질 수도 있고, 별도의 서버 및/또는 시스템을 통해 이루어 질 수도 있다. 학습 알고리즘의 예로는, 지도형 학습(supervised learning), 비지도 형 학습(unsupervised learning), 준지도형 학습(semi-supervised learning) 또는 강화 학습(reinforcement learning)이 있으나, 전술한 예에 한정되지 않는다.The predefined operation rules or artificial intelligence models are characterized by being created through learning. Here, being created through learning means that the basic artificial intelligence model is learned by using a plurality of learning data by a learning algorithm, thereby creating a predefined operation rules or artificial intelligence model set to perform a desired characteristic (or purpose). Such learning may be performed in the device itself on which the artificial intelligence according to the present disclosure is performed, or may be performed through a separate server and/or system. Examples of the learning algorithm include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but are not limited to the examples described above.

인공지능 모델은, 복수의 신경망 레이어들로 구성될 수 있다. 복수의 신경망 레이어들 각각은 복수의 가중치들 (weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 복수의 가중치들 간의 연산을 통해 신경 망 연산을 수행한다. 복수의 신경망 레이어들이 갖고 있는 복수의 가중치들은 인공지능 모델의 학습 결과에 의해 최적화될 수 있다. 예를 들어, 학습 과정 동안 인공지능 모델에서 획득한 로스(loss) 값 또는 코스트(cost) 값이 감소 또는 최소화되도록 복수의 가중치들이 갱신될 수 있다. 인공 신경망은 심층 신경망(DNN:Deep Neural Network)를 포함할 수 있으며, 예를 들어, CNN (Convolutional Neural Network), DNN (Deep Neural Network), RNN (Recurrent Neural Network), RBM (Restricted Boltzmann Machine), DBN (Deep Belief Network), BRDNN(Bidirectional Recurrent Deep Neural Network) 또는 심층 Q-네트워크 (Deep Q-Networks) 등이 있으나, 전술한 예에 한정되지 않는다.The artificial intelligence model may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and performs a neural network operation through an operation between the operation result of the previous layer and the plurality of weights. The plurality of weights of the plurality of neural network layers may be optimized by the learning result of the artificial intelligence model. For example, the plurality of weights may be updated so that a loss value or a cost value obtained from the artificial intelligence model is reduced or minimized during the learning process. The artificial neural network may include a deep neural network (DNN), and examples thereof include, but are not limited to, a convolutional neural network (CNN), a deep neural network (DNN), a recurrent neural network (RNN), a restricted boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), or a deep Q-network.

본 개시의 예시적인 실시예에 따르면, 프로세서는 인공지능을 구현할 수 있다. 인공지능이란 사람의 신경세포(biological neuron)를 모사하여 기계가 학습하도록 하는 인공신경망(Artificial Neural Network) 기반의 기계 학습법을 의미한다. 인공지능의 방법론에는 학습 방식에 따라 훈련데이터로서 입력데이터와 출력데이터가 같이 제공됨으로써 문제(입력데이터)의 해답(출력데이터)이 정해져 있는 지도학습(supervised learning), 및 출력데이터 없이 입력데이터만 제공되어 문제(입력데이터)의 해답(출력데이터)이 정해지지 않는 비지도학습(unsupervised learning), 및 현재의 상태(State)에서 어떤 행동(Action)을 취할 때마다 외부 환경에서 보상(Reward)이 주어지는데, 이러한 보상을 최대화하는 방향으로 학습을 진행하는 강화학습(reinforcement learning)으로 구분될 수 있다. 또한, 인공지능의 방법론은 학습 모델의 구조인 아키텍처에 따라 구분될 수도 있는데, 널리 이용되는 딥러닝 기술의 아키텍처는, 합성곱신경망(CNN; Convolutional Neural Network), 순환신경망(RNN; Recurrent Neural Network), 트랜스포머(Transformer), 생성적 대립 신경망(GAN; generative adversarial networks) 등으로 구분될 수 있다.According to an exemplary embodiment of the present disclosure, a processor can implement artificial intelligence. Artificial intelligence refers to a machine learning method based on an artificial neural network that imitates human biological neurons to enable a machine to learn. Artificial intelligence methodologies can be divided into supervised learning in which input data and output data are provided together as training data according to a learning method, so that an answer (output data) to a problem (input data) is determined, unsupervised learning in which only input data is provided without output data, so that an answer (output data) to a problem (input data) is not determined, and reinforcement learning in which a reward is given from an external environment whenever an action is taken in a current state, and learning is performed in a direction to maximize this reward. In addition, artificial intelligence methodologies can be categorized by architecture, which is the structure of the learning model. The architectures of widely used deep learning technologies can be categorized into convolutional neural networks (CNNs), recurrent neural networks (RNNs), transformers, and generative adversarial networks (GANs).

본 장치와 시스템은 인공지능 모델을 포함할 수 있다. 인공지능 모델은 하나의 인공지능 모델일 수 있고, 복수의 인공지능 모델로 구현될 수도 있다. 인공지능 모델은 뉴럴 네트워크(또는 인공 신경망)로 구성될 수 있으며, 기계학습과 인지과학에서 생물학의 신경을 모방한 통계학적 학습 알고리즘을 포함할 수 있다. 뉴럴 네트워크는 시냅스의 결합으로 네트워크를 형성한 인공 뉴런(노드)이 학습을 통해 시냅스의 결합 세기를 변화시켜, 문제 해결 능력을 가지는 모델 전반을 의미할 수 있다. 뉴럴 네트워크의 뉴런은 가중치 또는 바이어스의 조합을 포함할 수 있다. 뉴럴 네트워크는 하나 이상의 뉴런 또는 노드로 구성된 하나 이상의 레이어(layer)를 포함할 수 있다. 예시적으로, 장치는 input layer, hidden layer, output layer를 포함할 수 있다. 장치를 구성하는 뉴럴 네트워크는 뉴런의 가중치를 학습을 통해 변화시킴으로써 임의의 입력(input)으로부터 예측하고자 하는 결과(output)를 추론할 수 있다.The present device and system may include an artificial intelligence model. The artificial intelligence model may be one artificial intelligence model or may be implemented as multiple artificial intelligence models. The artificial intelligence model may be composed of a neural network (or an artificial neural network) and may include a statistical learning algorithm that mimics biological neurons in machine learning and cognitive science. A neural network may refer to a model in which artificial neurons (nodes) that form a network by combining synapses change the strength of the synapses through learning and have problem-solving capabilities. Neurons of a neural network may include a combination of weights or biases. A neural network may include one or more layers composed of one or more neurons or nodes. For example, the device may include an input layer, a hidden layer, and an output layer. A neural network that constitutes the device may infer a desired result (output) from an arbitrary input (input) by changing the weights of neurons through learning.

프로세서는 뉴럴 네트워크를 생성하거나, 뉴럴 네트워크를 훈련(train, 또는 학습(learn)하거나, 수신되는 입력 데이터를 기초로 연산을 수행하고, 수행 결과를 기초로 정보 신호(information signal)를 생성하거나, 뉴럴 네트워크를 재훈련(retrain)할 수 있다. 뉴럴 네트워크의 모델들은 GoogleNet, AlexNet, VGG Network 등과 같은 CNN(Convolution Neural Network), R-CNN(Region with Convolution Neural Network), RPN(Region Proposal Network), RNN(Recurrent Neural Network), S-DNN(Stacking-based deep Neural Network), S-SDNN(State-Space Dynamic Neural Network), Deconvolution Network, DBN(Deep Belief Network), RBM(Restrcted Boltzman Machine), Fully Convolutional Network, LSTM(Long Short-Term Memory) Network, Classification Network 등 다양한 종류의 모델들을 포함할 수 있으나 이에 제한되지는 않는다. 프로세서는 뉴럴 네트워크의 모델들에 따른 연산을 수행하기 위한 하나 이상의 프로세서를 포함할 수 있다. 예를 들어 뉴럴 네트워크는 심층 뉴럴 네트워크 (Deep Neural Network)를 포함할 수 있다.The processor can generate a neural network, train (or learn) a neural network, perform a calculation based on received input data, generate an information signal based on the result of the calculation, or retrain the neural network. The models of the neural network can include various types of models such as CNN (Convolution Neural Network) such as GoogleNet, AlexNet, VGG Network, R-CNN (Region with Convolution Neural Network), RPN (Region Proposal Network), RNN (Recurrent Neural Network), S-DNN (Stacking-based deep Neural Network), S-SDNN (State-Space Dynamic Neural Network), Deconvolution Network, DBN (Deep Belief Network), RBM (Restrcted Boltzman Machine), Fully Convolutional Network, LSTM (Long Short-Term Memory) Network, Classification Network, etc., but are not limited thereto. The processor can include one or more processors for performing calculations according to the models of the neural network. For example, the neural network can include a deep It may include a neural network (Deep Neural Network).

뉴럴 네트워크는 CNN(Convolutional Neural Network), RNN(Recurrent Neural Network), 퍼셉트론(perceptron), 다층 퍼셉트론(multilayer perceptron), FF(Feed Forward), RBF(Radial Basis Network), DFF(Deep Feed Forward), LSTM(Long Short Term Memory), GRU(Gated Recurrent Unit), AE(Auto Encoder), VAE(Variational Auto Encoder), DAE(Denoising Auto Encoder), SAE(Sparse Auto Encoder), MC(Markov Chain), HN(Hopfield Network), BM(Boltzmann Machine), RBM(Restricted Boltzmann Machine), DBN(Depp Belief Network), DCN(Deep Convolutional Network), DN(Deconvolutional Network), DCIGN(Deep Convolutional Inverse Graphics Network), GAN(Generative Adversarial Network), LSM(Liquid State Machine), ELM(Extreme Learning Machine), ESN(Echo State Network), DRN(Deep Residual Network), DNC(Differentiable Neural Computer), NTM(Neural Turning Machine), CN(Capsule Network), KN(Kohonen Network) 및 AN(Attention Network)를 포함할 수 있으나 이에 한정되는 것이 아닌 임의의 뉴럴 네트워크를 포함할 수 있음은 통상의 기술자가 이해할 것이다.Neural networks include CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), perceptron, multilayer perceptron, FF (Feed Forward), RBF (Radial Basis Network), DFF (Deep Feed Forward), LSTM (Long Short Term Memory), GRU (Gated Recurrent Unit), AE (Auto Encoder), VAE (Variational Auto) Encoder), DAE (Denoising Auto Encoder), SAE (Sparse Auto Encoder), MC (Markov Chain), HN (Hopfield Network), BM (Boltzmann Machine), RBM (Restricted Boltzmann Machine), DBN (Depp Belief Network), DCN (Deep Convolutional Network), DN (Deconvolutional Network), DCIGN (Deep Convolutional Inverse Graphics Network), Generative Adversarial Network (GAN), Liquid State Machine (LSM), Extreme Learning Machine (ELM), It will be understood by those skilled in the art that any neural network may be included, including but not limited to an ESN (Echo State Network), a DRN (Deep Residual Network), a DNC (Differentiable Neural Computer), an NTM (Neural Turning Machine), a CN (Capsule Network), a KN (Kohonen Network), and an AN (Attention Network).

본 개시의 예시적인 실시예에 따르면, 프로세서는 GoogleNet, AlexNet, VGG Network 등과 같은 CNN(Convolution Neural Network), R-CNN(Region with Convolution Neural Network), RPN(Region Proposal Network), RNN(Recurrent Neural Network), S-DNN(Stacking-based deep Neural Network), S-SDNN(State-Space Dynamic Neural Network), Deconvolution Network, DBN(Deep Belief Network), RBM(Restrcted Boltzman Machine), Fully Convolutional Network, LSTM(Long Short-Term Memory) Network, Classification Network, Generative Modeling, eXplainable AI, Continual AI, Representation Learning, AI for Material Design, 자연어 처리를 위한 BERT, SP-BERT, MRC/QA, Text Analysis, Dialog System, GPT-3, GPT-4, 비전 처리를 위한 Visual Analytics, Visual Understanding, Video Synthesis, ResNet 데이터 지능을 위한 Anomaly Detection, Prediction, Time-Series Forecasting, Optimization, Recommendation, Data Creation 등 다양한 인공지능 구조 및 알고리즘을 이용할 수 있으며, 이에 제한되지 않는다. 이하, 첨부된 도면을 참조하여 본 개시의 실시예를 상세하게 설명한다.According to an exemplary embodiment of the present disclosure, the processor may be configured to perform a CNN (Convolution Neural Network) such as GoogleNet, AlexNet, VGG Network, R-CNN (Region with Convolution Neural Network), RPN (Region Proposal Network), RNN (Recurrent Neural Network), S-DNN (Stacking-based deep Neural Network), S-SDNN (State-Space Dynamic Neural Network), Deconvolution Network, DBN (Deep Belief Network), RBM (Restrcted Boltzman Machine), Fully Convolutional Network, LSTM (Long Short-Term Memory) Network, Classification Network, Generative Modeling, eXplainable AI, Continual AI, Representation Learning, AI for Material Design, BERT, SP-BERT, MRC/QA for natural language processing, Text Analysis, Dialog System, GPT-3, GPT-4, Visual Analytics for vision processing, Visual Understanding, Video Synthesis, ResNet for data intelligence, Anomaly Detection, Prediction, Time-Series Forecasting, Various artificial intelligence structures and algorithms such as Optimization, Recommendation, and Data Creation can be used, but are not limited thereto. Hereinafter, embodiments of the present disclosure will be described in detail with reference to the attached drawings.

도 1은 본 개시의 일 실시예에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치(100)의 구성을 간략하게 도시한 블록도이다.FIG. 1 is a block diagram briefly illustrating the configuration of an electronic device (100) for performing an actual speaker display operation for spoken text according to one embodiment of the present disclosure.

도 1을 참고하면 본 개시에 따른 전자 장치(100)는 입출력 모듈(110), 통신 모듈(120), 메모리(130) 및 프로세서(140)를 포함할 수 있다. 이하에서는, 전자 장치(100)는 발화 텍스트에 대한 실제 발화자 표시 동작을 수행하는 전자적 장치이며, 발화 텍스트에 대한 실제 발화자 표시 동작을 수행하는 전자 장치(100)를 통해 구현되는 것으로 가정된다.Referring to FIG. 1, an electronic device (100) according to the present disclosure may include an input/output module (110), a communication module (120), a memory (130), and a processor (140). Hereinafter, the electronic device (100) is an electronic device that performs an actual speaker display operation for a spoken text, and is assumed to be implemented through an electronic device (100) that performs an actual speaker display operation for a spoken text.

입출력 모듈(110)은 사용자 입력을 받거나 또는 사용자에게 정보를 출력하는 각종 인터페이스나 연결 포트 등일 수 있다. 입출력 모듈(110)은 입력 모듈과 출력 모듈로 구분될 수 있다.The input/output module (110) may be various interfaces or connection ports that receive user input or output information to the user. The input/output module (110) may be divided into an input module and an output module.

입력 모듈은 사용자로부터 사용자 입력을 수신한다. 입력 모듈은 영상 정보(또는 신호), 오디오 정보(또는 신호), 데이터, 또는 사용자로부터 입력되는 정보의 입력을 위한 것으로서, 적어도 하나의 카메라, 적어도 하나의 마이크로폰 및 사용자 입력부 중 적어도 하나를 포함할 수 있다. 입력부에서 수집한 음성 데이터나 이미지 데이터는 분석되어 사용자의 제어명령으로 처리될 수 있다.The input module receives user input from a user. The input module is for inputting image information (or signal), audio information (or signal), data, or information input from a user, and may include at least one camera, at least one microphone, and at least one user input unit. Voice data or image data collected from the input unit may be analyzed and processed as a user control command.

사용자 입력은 키 입력, 터치 입력, 음성 입력을 비롯한 다양한 형태로 이루어질 수 있다. 이러한 사용자 입력을 받을 수 있는 입력 모듈의 예로는 전통적인 형태의 키패드나 키보드, 마우스는 물론, 사용자의 터치를 감지하는 터치 센서, 음성 신호를 입력 받는 마이크, 영상 인식을 통해 제스처 등을 인식하는 카메라, 사용자 접근을 감지하는 조도 센서나 적외선 센서 등으로 구성되는 근접 센서, 가속도 센서나 자이로 센서 등을 통해 사용자 동작을 인식하는 모션 센서 및 그 외의 다양한 형태의 사용자 입력을 감지하거나 입력 받는 다양한 형태의 입력 수단을 모두 포함하는 포괄적인 개념이다. User input can take various forms, including key input, touch input, and voice input. Examples of input modules that can receive such user input include, in addition to traditional keypads, keyboards, and mice, touch sensors that detect the user's touch, microphones that receive voice signals, cameras that recognize gestures through image recognition, proximity sensors such as light sensors or infrared sensors that detect the approach of a user, motion sensors that recognize user movements through acceleration sensors or gyro sensors, and various other input means that detect or receive various forms of user input.

여기서, 터치 센서는 디스플레이 패널에 부착되는 터치 패널이나 터치 필름을 통해 터치를 감지하는 압전식 또는 정전식 터치 센서, 광학적인 방식에 의해 터치를 감지하는 광학식 터치 센서 등으로 구현될 수 있다. 이외에도 입력 모듈은 자체적으로 사용자 입력을 감지하는 장치 대신 사용자 입력을 입력 받는 외부의 입력 장치를 연결시키는 입력 인터페이스(USB 포트, PS/2 포트 등)의 형태로 구현될 수도 있다. Here, the touch sensor may be implemented as a piezoelectric or electrostatic touch sensor that detects touch through a touch panel or touch film attached to the display panel, an optical touch sensor that detects touch in an optical manner, etc. In addition, the input module may be implemented in the form of an input interface (such as a USB port or PS/2 port) that connects an external input device that receives user input instead of a device that detects user input on its own.

출력 모듈은 각종 정보를 출력해 사용자에게 이를 제공할 수 있다. 출력 모듈은 영상을 출력하는 디스플레이, 소리를 출력하는 스피커(및/또는 이와 연결된 증폭기(amplifier)), 진동을 발생시키는 햅틱 장치 및 그 외의 다양한 형태의 출력 수단을 모두 포함하는 포괄적인 개념이다. 이외에도 출력 모듈은 상술한 개별 출력 수단을 연결시키는 포트 타입의 출력 인터페이스의 형태로 구현될 수도 있다. The output module can output various information and provide it to the user. The output module is a comprehensive concept that includes a display that outputs images, a speaker (and/or an amplifier connected thereto) that outputs sounds, a haptic device that generates vibrations, and various other forms of output means. In addition, the output module can also be implemented in the form of a port-type output interface that connects the individual output means described above.

일 예로, 디스플레이 형태의 출력 모듈은 텍스트, 정지 영상, 동영상을 디스플레이 할 수 있다. 디스플레이는 액정 디스플레이(LCD: Liquid Crystal Display), 발광 다이오드(LED: light emitting diode) 디스플레이, 유기 발광 다이오드(OLED: Organic Light Emitting Diode) 디스플레이, 평판 디스플레이(FPD: Flat Panel Display), 투명 디스플레이(transparent display), 곡면 디스플레이(Curved Display), 플렉시블 디스플레이(flexible display), 3차원 디스플레이(1D display), 홀로그래픽 디스플레이(holographic display), 프로젝터 및 그 외의 영상 출력 기능을 수행할 수 있는 다양한 형태의 장치를 모두 포함하는 광의의 영상 표시 장치를 의미하는 개념이다. 이러한 디스플레이는 입력 모듈의 터치 센서와 일체로 구성된 터치 디스플레이의 형태일 수도 있다.For example, an output module in the form of a display can display text, still images, and moving images. The display is a concept that broadly refers to an image display device including various forms of devices that can perform image output functions, such as a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a flat panel display (FPD), a transparent display, a curved display, a flexible display, a 3D display, a holographic display, a projector, and others. Such a display may be in the form of a touch display configured integrally with a touch sensor of the input module.

다시 말해, 입출력 모듈(110)은 사용자 인터페이스 기반으로 사용자 입력을 받거나, 사용자에게 출력을 제공할 수 있다.In other words, the input/output module (110) can receive user input or provide output to the user based on a user interface.

통신 모듈(120)은 외부 기기와 통신할 수 있다. 따라서, 장치(디바이스)는 통신 모듈를 통해 외부 기기와 정보를 송수신할 수 있다. 예를 들어, 장치는 통신 모듈를 이용해 전기차 충전관리 시스템 내에 저장 및 생성된 정보들이 공유되도록 외부 기기와 통신을 수행할 수 있다. 통신 모듈(120)은 예를 들어, 유선통신 모듈, 무선통신 모듈, 근거리 통신 모듈, 위치정보 모듈 중 적어도 하나를 포함할 수 있다.The communication module (120) can communicate with an external device. Therefore, the device can transmit and receive information with the external device through the communication module. For example, the device can communicate with the external device so that information stored and generated within the electric vehicle charging management system is shared using the communication module. The communication module (120) can include, for example, at least one of a wired communication module, a wireless communication module, a short-range communication module, and a location information module.

여기서, 통신, 즉 데이터의 송수신은 유선 또는 무선으로 이루어질 수 있다. 이를 위해 통신 모듈은 LAN(Local Area Network)를 통해 인터넷 등에 접속하는 유선 통신 모듈, 이동 통신 기지국을 거쳐 이동 통신 네트워크에 접속하여 데이터를 송수신하는 이동 통신 모듈, 와이파이(Wi-Fi) 같은 WLAN(Wireless Local Area Network) 계열의 통신 방식이나 블루투스(Bluetooth), 직비(Zigbee)와 같은 WPAN(Wireless Personal Area Network) 계열의 통신 방식을 이용하는 근거리 통신 모듈, GPS(Global Positioning System)과 같은 GNSS(Global Navigation Satellite System)을 이용하는 위성 통신 모듈 또는 이들의 조합으로 구성될 수 있다. 통신에 사용되는 무선 통신 기술은 저전력 통신을 위한 NB-IoT(Narrowband Internet of Things) 를 포함할 수 있다. 이때, 예를 들어 NB-IoT 기술은 LPWAN(Low Power Wide Area Network) 기술의 일례일 수 있고, LTE Cat(category) NB1 및/또는 LTE Cat NB2 등의 규격으로 구현될 수 있으며, 상술한 명칭에 한정되는 것은 아니다. 추가적으로 또는 대체적으로, 다양한 실시예들에 따른 무선 기기에서 구현되는 무선 통신 기술은 LTE-M 기술을 기반으로 통신을 수행할 수 있다. 이때, 일 예로, LTE-M 기술은 LPWAN 기술의 일례일 수 있고, eMTC(enhanced Machine Type Communication) 등의 다양한 명칭으로 불릴 수 있다. 예를 들어, LTE-M 기술은 1) LTE CAT 0, 2) LTE Cat M1, 3) LTE Cat M2, 4) LTE non-BL(non-Bandwidth Limited), 5) LTE-MTC, 6) LTE Machine Type Communication, 및/또는 7) LTE M 등의 다양한 규격 중 적어도 어느 하나로 구현될 수 있으며 상술한 명칭에 한정되는 것은 아니다. 추가적으로 또는 대체적으로, 다양한 실시예들에 따른 무선 기기에서 구현되는 무선 통신 기술은 저전력 통신을 고려한 지그비(ZigBee), 블루투스(Bluetooth) 및 저전력 광역 통신망(Low Power Wide Area Network, LPWAN) 중 적어도 어느 하나를 포함할 수 있으며, 상술한 명칭에 한정되는 것은 아니다. 일 예로 ZigBee 기술은 IEEE 802.15.4 등의 다양한 규격을 기반으로 소형/저-파워 디지털 통신에 관련된 PAN(personal area networks)을 생성할 수 있으며, 다양한 명칭으로 불릴 수 있다.Here, communication, that is, data transmission and reception, can be performed wired or wirelessly. To this end, the communication module may be configured as a wired communication module that connects to the Internet, etc. via a LAN (Local Area Network), a mobile communication module that connects to a mobile communication network via a mobile communication base station to transmit and receive data, a short-distance communication module that uses a WLAN (Wireless Local Area Network) series communication method such as Wi-Fi or a WPAN (Wireless Personal Area Network) series communication method such as Bluetooth or Zigbee, a satellite communication module that uses a GNSS (Global Navigation Satellite System) such as a GPS (Global Positioning System), or a combination thereof. The wireless communication technology used for communication may include NB-IoT (Narrowband Internet of Things) for low-power communication. At this time, for example, NB-IoT technology may be an example of LPWAN (Low Power Wide Area Network) technology, and may be implemented with standards such as LTE Cat (category) NB1 and/or LTE Cat NB2, and is not limited to the names described above. Additionally or alternatively, a wireless communication technology implemented in a wireless device according to various embodiments may perform communication based on LTE-M technology. At this time, for example, LTE-M technology may be an example of LPWAN technology, and may be called by various names such as eMTC (enhanced Machine Type Communication). For example, LTE-M technology may be implemented with at least one of various standards such as 1) LTE CAT 0, 2) LTE Cat M1, 3) LTE Cat M2, 4) LTE non-BL (non-Bandwidth Limited), 5) LTE-MTC, 6) LTE Machine Type Communication, and/or 7) LTE M, and is not limited to the names described above. Additionally or alternatively, the wireless communication technology implemented in the wireless device according to various embodiments may include at least one of ZigBee, Bluetooth, and Low Power Wide Area Network (LPWAN) considering low-power communication, and is not limited to the above-described names. For example, ZigBee technology can create PAN (personal area networks) related to small/low-power digital communication based on various standards such as IEEE 802.15.4, and may be called by various names.

유선 통신 모듈은, 지역 통신(Local Area Network; LAN) 모듈, 광역 통신(Wide Area Network; WAN) 모듈 또는 부가가치 통신(Value Added Network; VAN) 모듈 등 다양한 유선 통신 모듈뿐만 아니라, USB(Universal Serial Bus), HDMI(High Definition Multimedia Interface), DVI(Digital Visual Interface), RS-232(recommended standard232), 전력선 통신, 또는 POTS(plain old telephone service) 등 다양한 케이블 통신 모듈을 포함할 수 있다. The wired communication module may include various wired communication modules such as a Local Area Network (LAN) module, a Wide Area Network (WAN) module, or a Value Added Network (VAN) module, as well as various cable communication modules such as a Universal Serial Bus (USB), a High Definition Multimedia Interface (HDMI), a Digital Visual Interface (DVI), RS-232 (recommended standard232), power line communication, or plain old telephone service (POTS).

무선 통신 모듈은 와이파이(Wifi) 모듈, 와이브로(Wireless broadband) 모듈 외에도, GSM(global System for Mobile Communication), CDMA(Code Division Multiple Access), WCDMA(Wideband Code Division Multiple Access), UMTS(universal mobile telecommunications system), TDMA(Time Division Multiple Access), LTE(Long Term Evolution), 4G, 5G, 6G 등 다양한 무선 통신 방식을 지원하는 무선 통신 모듈을 포함할 수 있다.The wireless communication module may include a wireless communication module that supports various wireless communication methods such as GSM (global System for Mobile Communication), CDMA (Code Division Multiple Access), WCDMA (Wideband Code Division Multiple Access), UMTS (universal mobile telecommunications system), TDMA (Time Division Multiple Access), LTE (Long Term Evolution), 4G, 5G, and 6G, in addition to a WiFi module and a Wireless broadband module.

무선 통신 모듈은 신호를 송신하는 안테나 및 송신기(Transmitter)를 포함하는 무선 통신 인터페이스를 포함할 수 있다. 또한, 무선 통신 모듈은 제어부의 제어에 따라 무선 통신 인터페이스를 통해 제어부로부터 출력된 디지털 제어 신호를 아날로그 형태의 무선 신호로 변조하는 신호 변환 모듈을 더 포함할 수 있다.The wireless communication module may include a wireless communication interface including an antenna and a transmitter for transmitting a signal. In addition, the wireless communication module may further include a signal conversion module for modulating a digital control signal output from the control unit through the wireless communication interface into an analog wireless signal according to the control of the control unit.

무선 통신 모듈은 신호를 수신하는 안테나 및 수신기(Receiver)를 포함하는 무선 통신 인터페이스를 포함할 수 있다. 또한, 무선 통신 모듈은 무선 통신 인터페이스를 통하여 수신한 아날로그 형태의 무선 신호를 디지털 제어 신호로 복조하기 위한 신호 변환 모듈을 더 포함할 수 있다.The wireless communication module may include a wireless communication interface including an antenna and a receiver for receiving a signal. In addition, the wireless communication module may further include a signal conversion module for demodulating an analog wireless signal received through the wireless communication interface into a digital control signal.

근거리 통신 모듈은 근거리 통신(Short range communication)을 위한 것으로서, 블루투스(Bluetooth™), RFID(Radio Frequency Identification), 적외선 통신(Infrared Data Association; IrDA), UWB(Ultra Wideband), ZigBee, NFC(Near Field Communication), Wi-Fi(Wireless-Fidelity), Wi-Fi Direct, Wireless USB(Wireless Universal Serial Bus) 기술 중 적어도 하나를 이용하여, 근거리 통신을 지원할 수 있다.The short-range communication module is for short-range communication and can support short-range communication using at least one of Bluetooth™, RFID (Radio Frequency Identification), Infrared Data Association (IrDA), UWB (Ultra Wideband), ZigBee, NFC (Near Field Communication), Wi-Fi (Wireless-Fidelity), Wi-Fi Direct, and Wireless USB (Wireless Universal Serial Bus) technologies.

위치정보 모듈은 본 개시에 따른 전자 장치(10)의 위치(또는 현재 위치)를 획득하기 위한 모듈로서, 그의 대표적인 예로는 GPS(Global Positioning System) 모듈 또는 WiFi(Wireless Fidelity) 모듈이 있다. 예를 들어, GPS모듈을 활용하면, GPS 위성에서 보내는 신호를 이용하여 본 전자 장치의 위치를 획득할 수 있다. 다른 예로서, Wi-Fi모듈을 활용하면, Wi-Fi모듈과 무선신호를 송신 또는 수신하는 무선 AP(Wireless Access Point)의 정보에 기반하여, 본 전자 장치(10)의 위치를 획득할 수 있다. 필요에 따라서, 위치정보모듈은 치환 또는 부가적으로 본 장치의 위치에 관한 데이터를 얻기 위해 통신 모듈의 다른 모듈 중 어느 기능을 수행할 수 있다. 위치정보모듈은 본 장치의 위치(또는 현재 위치)를 획득하기 위해 이용되는 모듈로, 본 장치의 위치를 직접적으로 계산하거나 획득하는 모듈로 한정되지는 않는다.The location information module is a module for obtaining the location (or current location) of the electronic device (10) according to the present disclosure, and representative examples thereof include a GPS (Global Positioning System) module or a WiFi (Wireless Fidelity) module. For example, if a GPS module is utilized, the location of the electronic device can be obtained by using a signal sent from a GPS satellite. As another example, if a Wi-Fi module is utilized, the location of the electronic device (10) can be obtained based on information of a wireless AP (Wireless Access Point) that transmits or receives a wireless signal with the Wi-Fi module. If necessary, the location information module may perform any function of other modules of the communication module to obtain data regarding the location of the device as a substitute or in addition. The location information module is a module used to obtain the location (or current location) of the device, and is not limited to a module that directly calculates or obtains the location of the device.

메모리(130)는 각종 정보를 저장할 수 있다. 메모리는 데이터를 임시적으로 또는 반영구적으로 저장할 수 있다. 예를 들어, 메모리에는 제1 디바이스 및/또는 제2 디바이스를 구동하기 위한 운용 프로그램(OS: Operating System), 웹 사이트를 호스팅하기 위한 데이터나 점자 생성을 위한 프로그램 내지는 어플리케이션(예를 들어, 웹 어플리케이션)에 관한 데이터 등이 저장될 수 있다. 또, 메모리는 상술한 바와 같이 모듈들을 컴퓨터 코드 형태로 저장할 수 있다. The memory (130) can store various types of information. The memory can store data temporarily or semi-permanently. For example, the memory can store an operating program (OS: Operating System) for operating the first device and/or the second device, data for hosting a website, a program for generating Braille, or data regarding an application (e.g., a web application). In addition, the memory can store modules in the form of computer codes as described above.

메모리(130)의 예로는 하드 디스크(HDD: Hard Disk Drive), SSD(Solid State Drive), 플래쉬 메모리(flash memory), 롬(ROM: Read-Only Memory), 램(RAM: Random Access Memory) 등이 있을 수 있다. 이러한 메모리는 내장 타입 또는 탈부착 가능한 타입으로 제공될 수 있다.Examples of memory (130) may include a hard disk drive (HDD), a solid state drive (SSD), flash memory, read-only memory (ROM), random access memory (RAM), etc. Such memory may be provided as a built-in type or a removable type.

프로세서(140)는 전자 장치(100)의 전반적인 동작을 제어한다. 이를 위해 프로세서(140)는 각종 정보의 연산 및 처리를 수행하고 제1 디바이스 및/또는 제2 디바이스의 구성요소들의 동작을 제어할 수 있다. The processor (140) controls the overall operation of the electronic device (100). To this end, the processor (140) can perform calculations and processing of various types of information and control the operation of components of the first device and/or the second device.

프로세서(140)는 하드웨어 소프트웨어 또는 이들의 조합에 따라 컴퓨터나 이와 유사한 장치로 구현될 수 있다. 하드웨어적으로 프로세서(140)는 전기적인 신호를 처리하여 제어 기능을 수행하는 전자 회로 형태로 제공될 수 있으며, 소프트웨어적으로는 하드웨어적인 프로세서를 구동시키는 프로그램 형태로 제공될 수 있다. 한편, 이하의 설명에서 특별한 언급이 없는 경우에는 제1 디바이스 및/또는 제2 디바이스의 동작은 프로세서(140)의 제어에 의해 수행되는 것으로 해석될 수 있다. 즉, 모듈들은 프로세서(140)가 제1 디바이스 및/또는 제2 디바이스를 이하의 동작들을 수행하도록 제어하는 것으로 해석될 수 있다.The processor (140) may be implemented as a computer or a similar device according to hardware, software, or a combination thereof. In terms of hardware, the processor (140) may be provided in the form of an electronic circuit that processes electrical signals to perform a control function, and in terms of software, the processor may be provided in the form of a program that drives a hardware processor. Meanwhile, unless otherwise specifically mentioned in the following description, the operations of the first device and/or the second device may be interpreted as being performed under the control of the processor (140). That is, the modules may be interpreted as controlling the processor (140) to perform the following operations on the first device and/or the second device.

프로세서(140)는 본 장치 내의 구성요소들의 동작을 제어하기 위한 알고리즘 또는 알고리즘을 재현한 프로그램에 대한 데이터를 저장하는 메모리, 및 메모리에 저장된 데이터를 이용하여 전술한 동작을 수행하는 적어도 하나의 세부프로세서(미도시)로 구현될 수 있다. 이때, 메모리와 프로세서는 각각 별개의 칩으로 구현될 수 있다. 또는, 메모리와 프로세서는 단일 칩으로 구현될 수도 있다.The processor (140) may be implemented as a memory storing data for an algorithm for controlling the operation of components within the device or a program reproducing the algorithm, and at least one subprocessor (not shown) performing the above-described operation using the data stored in the memory. In this case, the memory and the processor may be implemented as separate chips. Alternatively, the memory and the processor may be implemented as a single chip.

또한, 프로세서(140)는 이하의 도 2 내지 도 8에서 설명될 본 개시에 따른 다양한 실시 예들을 본 장치 상에서 구현하기 위하여, 위에서 살펴본 구성요소들을 중 어느 하나 또는 복수를 조합하여 제어할 수 있다. In addition, the processor (140) may control one or more of the components described above in combination to implement various embodiments according to the present disclosure described in FIGS. 2 to 8 below on the device.

본 개시에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치(100)는 사용자 인터페이스 기반으로 사용자 입력을 받거나 사용자에게 출력을 제공하는 입출력 모듈(110), 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 적어도 하나의 프로세스가 저장된 메모리(130) 및 상기 프로세스에 따라 상기 발화 텍스트에 대한 실제 발화자 표시 동작을 수행하는 적어도 하나의 프로세서(140)를 포함할 수 있다. An electronic device (100) for performing an actual speaker display operation for a spoken text according to the present disclosure may include an input/output module (110) for receiving a user input or providing an output to a user based on a user interface, a memory (130) in which at least one process for performing an actual speaker display operation for a spoken text is stored, and at least one processor (140) for performing an actual speaker display operation for the spoken text according to the process.

일 실시예로서, 적어도 하나의 프로세서(140)는, 다화자 음성 데이터를 텍스트로 변환하여 임의의 발화 구분자가 매칭된 복수의 발화 텍스트를 획득하고, 상기 복수의 발화 텍스트에 대한 실제 발화자를 확인하는 질문을 캡차(CAPTCHA) 형식의 설문 UI 또는 UX로 사용자 인터페이스에 제공하고, 상기 사용자 인터페이스를 통해 상기 실제 발화자에 대한 라벨링을 입력 받아 상기 임의의 발화 구분자를 라벨링된 상기 실제 발화자로 변경하고, 상기 사용자 인터페이스를 통한 피드백을 기반으로 설문 UI 또는 UX의 내용 및 구성을 변경하도록 구성될 수 있다. As one embodiment, at least one processor (140) may be configured to convert multi-speaker voice data into text to obtain a plurality of utterance texts in which arbitrary utterance identifiers are matched, provide a question for confirming an actual speaker for the plurality of utterance texts to a user interface in the form of a questionnaire UI or UX in a CAPTCHA format, receive a label for the actual speaker through the user interface to change the arbitrary utterance identifier to the labeled actual speaker, and change the content and configuration of the questionnaire UI or UX based on feedback through the user interface.

도 2에 도시된 바와 같이, 본 개시의 일 실시예에 따른 전자 장치(100)는 UI/UX 제공단계(S210), 입력단계(S220), 피드백단계(S230), UI/UX 재제공단계(S240)를 수행하여 발화 텍스트에 대한 실제 발화자 표시 동작을 수행할 수 있다.As illustrated in FIG. 2, an electronic device (100) according to one embodiment of the present disclosure can perform an actual speaker display operation for spoken text by performing a UI/UX providing step (S210), an input step (S220), a feedback step (S230), and a UI/UX re-provisioning step (S240).

UI/UX 제공단계(S210)에서, 적어도 하나의 프로세서(140)는 유저가 손쉽게 발화 구분자를 실제 화자와 매칭할 수 있도록 사용자 인터페이스에 UI/UX를 출력할 수 있다.In the UI/UX provision step (S210), at least one processor (140) can output UI/UX to the user interface so that a user can easily match a speech identifier with an actual speaker.

UI/UX 제공 전에 추출된 발화 텍스트를 조회할 수 있는 UI 구성을 제공하며, 예를 들어, 발화 구분자, 발화 내용, 발화 지속된 구간, 발화가 지속된 구간의 오디오가 표시된 대화방을 제공할 수 있다. 대화방에서 발화 구분자가 각 발화 내용마다 표기되어 있고 음성 텍스트가 발화 내용 별로 각각 격리되어 재생되도록 제공될 수 있다.Provides a UI configuration that can query extracted speech text before providing UI/UX, and can provide, for example, a chat room that displays speech delimiters, speech content, speech duration, and audio of the speech duration. In the chat room, speech delimiters can be displayed for each speech content, and audio text can be provided to be played back separately for each speech content.

대화방 진입 시 캡차(CAPTCHA) 형식의 설문 UI/UX를 출력하여 사용자가 대화방에 진입할 때 각 발화 구분자가 누군지를 매칭하도록 가이드하고, 발화 데이터의 가독성을 향상시킬 수 있다.When entering a chat room, a survey UI/UX in the form of CAPTCHA can be displayed to guide users to match who each speaker is when entering the chat room, and the readability of the speech data can be improved.

상기 캡차 형식의 설문 UI/UX는 다양한 질문 타입에 맞춰 다양한 인터페이스로 구현되어 사용자 인터페이스에 제공될 수 있다. 구체적인 설문 UI/UX는 후술하기로 한다.The above captcha-type survey UI/UX can be implemented in various interfaces according to various question types and provided on the user interface. The specific survey UI/UX will be described later.

입력단계(S220)에서, 입출력 모듈(110)을 통해 사용자 인터페이스에 출력된 캡차 형식의 설문 UI/UX에 답변을 제공할 수 있다In the input stage (S220), an answer can be provided to the questionnaire UI/UX in the form of a captcha output to the user interface through the input/output module (110).

입출력 모듈(110)을 통해 사용자 인터페이스에 표시된 캡차 형식의 설문 UI/UX에 O/X와 같은 정답 여부를 입력하거나, 오디오에 포함된 발화자의 명수, 오디오와 매칭되는 실제 발화자 매칭 등 질문 타입에 대응하는 답변을 입력할 수 있다.Through the input/output module (110), you can input whether the answer is correct or not, such as O/X, in the questionnaire UI/UX in the form of a captcha displayed on the user interface, or input an answer corresponding to the question type, such as the number of speakers included in the audio, or matching the actual speakers with the audio.

피드백단계(S230)에서, 사용자 인터페이스 기반 입출력 모듈(110)에서의 동작을 기반으로 사용자 인터페이스에 표시될 캡차 형식의 설문 UI/UX 내용 또는 구성이나 질문 타입을 조절할 수 있다.In the feedback stage (S230), the content or configuration of the survey UI/UX in the form of a captcha to be displayed on the user interface or the question type can be adjusted based on the operation in the user interface-based input/output module (110).

일 실시예로서, 적어도 하나의 프로세서(140)는, 상기 실제 발화자를 확인하는 질문의 타입에 따른 상기 사용자 인터페이스의 체류 시간, 미응답률 및 오답률 중 적어도 하나를 고려하여 상기 사용자 인터페이스에 측정된 질문의 등장 확률을 낮추도록 등장 가중치를 하향 조절할 수 있다.As one embodiment, at least one processor (140) may adjust an appearance weight downward to lower the appearance probability of a question measured in the user interface by considering at least one of a dwell time, a non-response rate, and an incorrect response rate of the user interface according to the type of question that verifies the actual speaker.

예를 들어, 입출력 모듈(110)에 설문 UI/UX에 대한 답변 입력까지 걸린 체류 시간이 기설정된 임계값을 초과할 경우, 상기 설문 UI/UX 내용 또는 구성, 질문 타입에 대한 등장 확률을 낮추도록 하향 가중치를 부여할 수 있다.For example, if the residence time required to input a response to a questionnaire UI/UX in the input/output module (110) exceeds a preset threshold, a downward weight can be applied to lower the probability of appearance of the questionnaire UI/UX content or configuration, or question type.

또는, 입출력 모듈(110)에 설문 UI/UX에 대한 답변 입력하지 않고 바로 대화방으로 입장하는 미응답률이 기설정된 임계값을 초과할 경우, 설문 UI/UX 내용 또는 구성, 질문 타입의 등장 확률을 낮추도록 하향 가중치를 부여할 수 있다.Alternatively, if the non-response rate, which means entering the chat room directly without entering a response to the survey UI/UX in the input/output module (110), exceeds a preset threshold, a downward weight can be applied to lower the probability of the survey UI/UX content or structure, or question type appearing.

또는, 입출력 모듈(110)에 설문 UI/UX에 대한 답변을 입력 받았으나, 다른 설문 UI/UX에 대한 답변과 비교하여 오답으로 판단되거나 사전 매칭된 목소리에 대한 재검토에서 오답이 입력되어 오답률이 기설정된 임계값을 초과할 경우, 설문 UI/UX 내용 또는 구성, 질문 타입의 등장 확률을 낮추도록 하향 가중치를 부여할 수 있다.Alternatively, if an answer to a questionnaire UI/UX is input into the input/output module (110), but is judged to be an incorrect answer when compared to answers to other questionnaire UI/UX, or an incorrect answer is input in a re-examination of a pre-matched voice, and the incorrect answer rate exceeds a preset threshold, a downward weight can be applied to lower the probability of the questionnaire UI/UX content or structure, or question type appearing.

따라서, 설문 UI/UX에 대한 체류 시간, 미응답률, 오답률을 기반으로 설문 UI/UX의 내용 또는 구성, 설문 UI/UX에서 제공된 질문 타입의 난이도가 높거나 답벼이 불가능한 질문으로 판단하여 출력 확률을 낮춰 질의 난이도를 가볍게 유지할 수 있다. 유저의 참여도를 독려하고 가이드하기 위함이다. Therefore, based on the length of stay, non-response rate, and incorrect response rate for the survey UI/UX, the content or structure of the survey UI/UX, and the question types provided in the survey UI/UX can be judged to be difficult or impossible to answer, and the output probability can be lowered to keep the question difficulty level light. This is to encourage and guide user participation.

일 실시예로서, 적어도 하나의 프로세서(140)는 상기 오답률이 높은 오디오 조각은 재학습 데이터로 수집할 수 있다. 오답 데이터로 활용하기 위하여 오답률이 높은 오디오 조각은 노이즈가 끼는 등 환경이 열악하거나 특정 발화 습관을 가진 발화자의 경우 인식률이 떨어짐을 반영할 수 있도록 학습데이터 셋으로 구성될 수 있다.As an example, at least one processor (140) may collect audio fragments with a high error rate as re-learning data. In order to utilize the audio fragments with a high error rate as error data, the audio fragments with a high error rate may be configured as a learning data set so as to reflect the fact that the recognition rate is low in the case of a speaker with a specific speaking habit or in a poor environment such as noise.

피드백단계(S230)에서, 적어도 하나의 프로세서(140)는 목소리의 신뢰도에 따라 UI/UX 설문에서 질문하도록 추출되는 횟수를 조절할 수 있다.In the feedback step (S230), at least one processor (140) can adjust the number of times a question is extracted from the UI/UX questionnaire depending on the reliability of the voice.

사전 매칭된 실제 발화자의 목소리 임베딩 벡터가 포함된 임베딩 클러스터 내에서 분포 엔트로피의 역수가 기설정된 임계값 이상이면 상기 실제 발화자의 오디오 조각이 추출될 가중치를 하향 조절할 수 있다.If the inverse of the distribution entropy within the embedding cluster containing the pre-matched real speaker's voice embedding vector is greater than a preset threshold, the weight from which the audio fragment of the real speaker is extracted can be adjusted downward.

목소리를 n차원 벡터로 표현한 임베딩 벡터를 이용하여 목소리를 n차원 벡터로 표현하고 인접한 벡터와 먼 벡터를 구분하여 클러스터(cluster)를 구분할 수 있다. By using an embedding vector that represents a voice as an n-dimensional vector, we can represent the voice as an n-dimensional vector and distinguish clusters by distinguishing adjacent vectors from distant vectors.

새로운 목소리가 입력될 때마다 어느 클러스터에 속할지 판단할 수 있으며, 인접하게 벡터가 배치되어 오밀조밀하게 모여 있는 클러스터의 경우 분포 엔트로피의 역수에 따라 높은 신뢰도를 산출할 수 있다.Whenever a new voice is input, it can be determined which cluster it belongs to, and for clusters where vectors are placed adjacently and densely packed, a high reliability can be calculated based on the inverse of the distribution entropy.

반면, 벡터가 인접하지 않고 성기게 배치되어 있는 클러스터의 경우 분포 엔트로피의 역수에 따라 낮은 신뢰도를 산출할 수 있다.On the other hand, clusters where vectors are not adjacent and are sparsely distributed can produce low reliability according to the inverse of the distribution entropy.

따라서, 발화자의 복수의 임베딩 벡터를 추출하여 임베딩 벡터간 거리를 계산하고 신뢰도가 높은 목소리로 산출될 경우 피드백을 수행할 필요가 없는 바, 설문 UI/UX에 상기 목소리에 대한 입력을 받을 필요가 없다. 따라서, 사전 매칭된 목소리에 대한 설문 UI/UX 를 사용자 인터페이스에 등장시킬 확률에 하향 가중치를 부여할 수 있다.Therefore, there is no need to perform feedback when extracting multiple embedding vectors of the speaker and calculating the distance between the embedding vectors, and if it is output as a highly reliable voice, there is no need to receive input for the voice in the survey UI/UX. Therefore, a downward weight can be given to the probability of displaying the survey UI/UX for the pre-matched voice in the user interface.

신뢰도가 낮은 목소리로 산출될 경우 피드백이 필요한 바, 설문 UI/UX에 상기 목소리에 대한 입력을 추가적으로 받기 위해, 사전 매칭된 목소리에 대한 설문 UI/UX를 사용자 인터페이스에 등장시킬 확률에 상향 가중치를 부여할 수 있다.In cases where feedback is required when a voice is produced with low reliability, an upward weight can be given to the probability of displaying the survey UI/UX for the pre-matched voice in the user interface to receive additional input for the voice in the survey UI/UX.

UI/UX 재제공단계(S240)에서, 피드백 단계(S230)에서의 가중치 조절에 따라 설문 UI/UX 출력을 조절하여 사용자 인터페이스에 제공할 수 있다.In the UI/UX re-provision step (S240), the survey UI/UX output can be adjusted and provided to the user interface according to the weight adjustment in the feedback step (S230).

이를 통해 기존의 UI/UX를 유지하면서 유저의 답변 난이도를 낮게 유지하여 유저의 참여를 유도할 수 있다.This allows us to encourage user participation by maintaining the existing UI/UX while keeping the difficulty of users' responses low.

도 3 내지 도 5를 참조하면, 본 개시의 일 실시예에 따른 전자 장치(100)의 사용자 인터페이스에 출력된 UI/UX를 확인할 수 있다.Referring to FIGS. 3 to 5, one can check the UI/UX output to the user interface of an electronic device (100) according to one embodiment of the present disclosure.

일 실시예로서, 복수의 발화 텍스트 발화 구분자(210), 발화 내용(220), 발화 지속된 구간 및 상기 발화 지속된 구간의 오디오(230)를 포함하며, 상기 적어도 하나의 프로세서(140)는, 상기 발화 지속된 구간의 오디오 중 상기 질문의 타입에 부합하는 적어도 하나의 오디오 조각을 무작위 추출하여 기설정된 시간 구간 이내로 트림(trim)할 수 있다.As an example, the system includes a plurality of utterance text utterance delimiters (210), utterance content (220), a sustained utterance segment, and audio (230) of the sustained utterance segment, wherein the at least one processor (140) can randomly extract at least one audio fragment matching the type of the question from the audio of the sustained utterance segment and trim it to within a preset time interval.

도 3을 참조하면, 발화 구분자(210)로 참석자 1, 참석자 2, 참석자 3의 임의의 표시자를 부여할 수 있다. 각 발화 구분자(210)의 발화 내용(220)을 발화 구분자(210) 밑에 표시하고, 각 발화 내용(220)에 대응하는 발화 지속된 구간의 오디오(230)를 재생하는 인터페이스를 추가 배치할 수 있다. Referring to FIG. 3, arbitrary indicators of participant 1, participant 2, and participant 3 can be assigned as utterance identifiers (210). The utterance content (220) of each utterance identifier (210) can be displayed under the utterance identifier (210), and an interface for playing audio (230) of a sustained utterance section corresponding to each utterance content (220) can be additionally arranged.

여기서는 참석자 1, 참석자 2, 참석자 3을 추출하였으나 이는 예시적인 실시예로서, 참석자가 5명 이상일 경우, 3명의 발화자만 무작위로 추출하여 발화 지속된 구간의 오디오(230)를 재생시킴으로써 '나'의 목소리 정보를 선택하도록 할 수 있다.Here, attendee 1, attendee 2, and attendee 3 are extracted, but this is an exemplary embodiment. If there are 5 or more attendees, only 3 speakers can be randomly extracted and audio (230) of the continuous speaking section can be played to select the voice information of 'me'.

다시 말해, 적어도 하나의 프로세서(140)는 트림된 복수의 상기 오디오 조각을 재생하고 이 중 자신의 목소리에 해당하는 오디오 조각을 선택하는 사용자 인터페이스를 출력할 수 있다.In other words, at least one processor (140) can play back the trimmed plurality of audio fragments and output a user interface for selecting an audio fragment corresponding to one's own voice.

다른 실시예로서, 상기 적어도 하나의 프로세서(140)는 트림된 하나의 상기 오디오 조각을 재생하고 대응하는 실제 발화자 이름을 입력 받는 사용자 인터페이스를 출력할 수 있다.As another embodiment, the at least one processor (140) may play back the trimmed audio fragment and output a user interface for inputting the corresponding actual speaker name.

예를 들어, 발화 지속된 구간의 오디오(230) 또는 상기 오디오(230)를 트림한 오디오 조각을 하나만 재생하고, 재생된 오디오 조각의 실제 발화자를 입력 받는 입력 인터페이스를 제공할 수 있다.For example, it is possible to provide an input interface that reproduces only one audio fragment (230) of a continuous speech segment or a trimmed audio fragment of the audio fragment (230) and receives the actual speaker of the reproduced audio fragment.

다른 실시예로서, 상기 적어도 하나의 프로세서(140)는 기저장된 자신의 목소리와 동일한 음성 클러스터링에 포함되는 복수의 상기 오디오 조각을 재생하고 이 중 각 오디오 조각에 대한 정답 여부를 입력 받는 사용자 인터페이스를 출력할 수 있다.As another embodiment, the at least one processor (140) may play a plurality of audio fragments included in the same voice clustering as the stored own voice and output a user interface for receiving input on whether each audio fragment is correct or not.

예를 들어, 사전에 등록된 유저의 내 이름과 매칭된 목소리 조각이 존재하는 경우, 사전에 등록된 내 이름과 매칭된 목소리와 유사도가 높은 오디오 조각을 음성 클러스팅을 통해 추출하여 재생하고 각 오디오 조각이 내 목소리에 해당하는지 O, X를 입력 받는 사용자 인터페이스를 제공할 수 있다.For example, if there is a voice fragment that matches my name of a pre-registered user, an audio fragment with a high similarity to the voice that matches my name registered in advance can be extracted through voice clustering, played, and a user interface can be provided that receives an input of O or X for each audio fragment to determine whether it corresponds to my voice.

다시 말해, 누구의 목소리인지 여부, 내 목소리 선택, 내 목소리가 맞는지 확인의 방식으로 발화자를 선택하도록 하여 설문 UI/UX의 질문 타입이 일반 매칭일 수 있다.In other words, the question type of the survey UI/UX can be general matching by selecting the speaker in the form of whether it is someone's voice, selecting my voice, and confirming that it is my voice.

이때, 일 실시예로서, 적어도 하나의 프로세서(140)는 상기 사용자 인터페이스가 연결된 대화방에 속한 오디오 조각을 무작위 추출하여 제공될 수 있다. 사용자 인터페이스에 전혀 모르는 제3 자의 목소리가 뜨는 것이 아니라 유저가 참여한 복수의 대화방에서 제공된 오디오 조각의 발화자를 매칭하도록 제공할 수 있다.At this time, as an example, at least one processor (140) may randomly extract and provide audio fragments belonging to a chat room to which the user interface is connected. Rather than displaying a voice of a completely unknown third party on the user interface, the audio fragments may be provided so as to match the speakers of the audio fragments provided from multiple chat rooms in which the user has participated.

한편, 일 실시예로서, 적어도 하나의 프로세서(140)는 다른 설문 UI/UX의 질문 타입에서 입력된 정보를 재검토하거나 사전 매칭된 오디오 조각을 재검토하여 실제 발화자가 맞는지 확인할 수 있다.Meanwhile, as an example, at least one processor (140) may re-examine information entered in a question type of another survey UI/UX or re-examine pre-matched audio fragments to verify that they are from an actual speaker.

일 실시예로서, 적어도 하나의 프로세서(140)는 사전 매칭된 실제 발화자의 목소리 임베딩 벡터와 기설정된 거리 이내에 있는 오디오 조각을 예상 실제 발화자와 같이 표시하고 정답 여부를 입력 받는 사용자 인터페이스를 출력할 수 있다.As one embodiment, at least one processor (140) may output a user interface that displays audio fragments within a preset distance from a pre-matched voice embedding vector of an actual speaker as a predicted actual speaker and receives an input for whether the result is correct or not.

도 4를 참조하면, 사전 매칭된 내 목소리 임베딩 벡터(310)를 출력하고 상기 내 목소리 임베딩 벡터(310)와 유사한 목소리의 오디오 조각을 출력할 수 있다. 그리고, 각 오디오 조각이 내 목소리 임베딩 벡터(310)와 일치하는지 입력받을 수 있도록 O, X를 입력할 정답 인터페이스(320)를 상기 유사한 목소리의 오디오 조각과 각각 나란히 배치할 수 있다.Referring to FIG. 4, a pre-matched voice embedding vector (310) can be output and an audio fragment of a voice similar to the voice embedding vector (310) can be output. In addition, an answer interface (320) for inputting O or X can be placed side by side with each audio fragment of the similar voice so that it can be input whether each audio fragment matches the voice embedding vector (310).

이때, 도 4에서는 내 목소리 임베딩 벡터(310)로 표시하였으나 내 목소리 임베딩 벡터(310)와 사전에 매칭된 이름을 예상 이름으로 출력할 수도 있다.At this time, in Fig. 4, it is indicated as my voice embedding vector (310), but a name previously matched with my voice embedding vector (310) may also be output as an expected name.

또는, 다른 일 실시예로서, 적어도 하나의 프로세서(140)는, 사전 매칭된 실제 발화자와 목소리를 무작위로 복수 개 선정하여 섞어 나열하고 대응하는 쌍을 매칭하는 사용자 인터페이스를 출력할 수 있다.Alternatively, as another embodiment, at least one processor (140) may randomly select a plurality of pre-matched actual speakers and voices, shuffle them, list them, and output a user interface that matches corresponding pairs.

실제 발화자의 예상 이름과 유사한 목소리의 오디오 조각을 바로 매칭하는 것이 아니라, 검토 측면에서 복수의 실제 발화자의 예상 이름과 목소리의 오디오 조각을 섞어서 양 옆으로 나열하고 대응하는 쌍을 찾는 UI가 제공될 수 있다. 상기 UI는 선택하는 순서에 따라 쌍이 매칭되는 인터페이스 또는 선을 직접 긋는 인터페이스, 클릭하여 매칭되는 쌍이 별도로 뜨도록 구현되는 인터페이스 등 다양한 형태로 구현할 수 있다. Rather than directly matching audio fragments of voices similar to the expected names of actual speakers, a UI can be provided that mixes audio fragments of expected names and voices of multiple actual speakers, lists them side by side, and finds corresponding pairs. The UI can be implemented in various forms, such as an interface where pairs are matched according to the order of selection, an interface where lines are drawn directly, or an interface where matching pairs are displayed separately by clicking.

예를 들어, 사전에 매칭된 이름 및 목소리를 재검토하도록 이름이 겹치지 않는 조건 하에서 목소리에 대한 오디오 조각을 무작위로 3개 선정할 수 있다. 이름과 오디오 조각을 양 옆으로 섞어서 나열하고 매칭될 쌍을 찾도록 인터페이스 제공할 수 있다.For example, you could randomly select three audio fragments for voices under the condition that the names do not overlap, to reexamine the previously matched names and voices. You could list the names and audio fragments side by side, shuffling them, and provide an interface to find matching pairs.

한편, 일 실시예로서, 적어도 하나의 프로세서(140)는 발화 구분자의 수가 실제 발화자의 수와 동일한지 여부, 발화 구분자가 다르게 맵핑된 발화 내용이 실제 발화자가 다른지 여부 등 발화 데이터의 분리 수준을 검수하는 사용자 인터페이스를 제공할 수 있다.Meanwhile, as an embodiment, at least one processor (140) may provide a user interface for checking the level of separation of utterance data, such as whether the number of utterance delimiters is the same as the number of actual speakers, and whether utterance content with different utterance delimiters mapped to different actual speakers is different.

일 실시예로서, 적어도 하나의 프로세서(140)는 겹침 발화 검출기(OSD)에 의해 겹침 수가 가장 높은 상위 오디오 조각을 추출하거나 다른 질문 타입에서 미응답률 또는 오답률이 가장 높은 상위 오디오 조각을 추출하고 상기 오디오 조각에 등장하는 인원 수를 입력 받는 사용자 인터페이스를 출력할 수 있다.As one embodiment, at least one processor (140) may extract the top audio fragment with the highest number of overlaps by an overlapping utterance detector (OSD) or extract the top audio fragment with the highest non-response rate or incorrect response rate in another question type and output a user interface for inputting the number of people appearing in the audio fragment.

예를 들어, 도 5를 참조하면, '해당 소리에 등장한 발화자의 수는?'이라고 묻는 질문창은 실제 발화자 선택 인터페이스(410)이며, 'N명'이라고 답변하는 창은 정답 인터페이스(420)이다.For example, referring to Fig. 5, the question window asking, 'How many speakers appeared in the corresponding sound?' is an actual speaker selection interface (410), and the window answering 'N people' is a correct answer interface (420).

겹쳐서 중복된 목소리가 출력되는 경우가 존재하며, 적어도 하나의 프로세서(140)는 겹침 발화 검출기(OSD)를 통해 겹침 수가 가장 높은 오디오 조각 또는 가장 높은 상위 오디오 조각을 추출하고, 실제 발화자 선택 인터페이스(410)에서 오디오 조각을 재생할 수 있다.There are cases where overlapping and duplicated voices are output, and at least one processor (140) can extract the audio fragment with the highest number of overlapping or the highest upper audio fragment through an overlapping speech detector (OSD) and play the audio fragment in the actual speaker selection interface (410).

또는, 다른 설문 UI/UX의 질문 타입에서 오답률 또는 비응답률이 높은 오디오 조각은 노이즈가 끼거나 겹쳐서 중복된 목소리가 들리거나 목소리가 유사하는 등의 상황이라고 간주하고 오답률 또는 비응답률이 가장 높은 오디오 조각 또는 가장 높은 상위 오디오 조각을 추출하고, 실제 발화자 선택 인터페이스(410)에서 오디오 조각을 재생할 수 있다.Alternatively, in question types of other survey UI/UX, audio fragments with high error rates or non-response rates may be considered as situations where noise is present, overlapping, duplicated voices are heard, or voices are similar, and the audio fragment with the highest error rate or non-response rate or the highest top audio fragment may be extracted and played in the actual speaker selection interface (410).

정답 인터페이스(420)에서 재생된 오디오 조각 내의 실제 발화자 인원 수를 입력할 수 있다.The number of actual speakers in the audio fragment played in the correct answer interface (420) can be entered.

일 실시예로서, 적어도 하나의 프로세서(140)는 상기 발화 구분자는 상이하고 동일한 실제 발화자로 지정된 오디오 조각을 무작위로 복수 개 추출하여 연결하고 연결된 오디오 조각이 한 사람의 목소리인지 여부에 대한 정답을 입력 받는 사용자 인터페이스를 출력할 수 있다.As one embodiment, at least one processor (140) may randomly extract multiple audio fragments, each of which is assigned a different speaker identifier and the same actual speaker, concatenate them, and output a user interface for receiving an answer as to whether the concatenated audio fragments are the voice of one person.

도 5를 참조하면, '한 사람의 목소리가 맞나요?'이라고 묻는 질문창은 실제 발화자 선택 인터페이스(430)이며, 'O/X'이라고 답변하는 창은 정답 인터페이스(440)이다.Referring to Figure 5, the question window asking, “Is this a person’s voice?” is an actual speaker selection interface (430), and the window answering “O/X” is a correct answer interface (440).

발화 텍스트에서 상이한 발화 구분자로 구분되어 있는데 사용자 인터페이스를 통해 동일한 실제 발화자 이름이 지정되어 있는 오디오 조각을 무작위로 3개 선정할 수 있다. 다시 말해, 상이한 발화자로 구분되어 있는데 동일한 이름이 지정된 오디오 조각을 추출하고 연결하여 실제 발화자 선택 인터페이스(430)에서 오디오 재생할 수 있다.In the spoken text, three audio fragments separated by different speaker identifiers and assigned the same actual speaker name can be randomly selected through the user interface. In other words, audio fragments separated by different speakers but assigned the same name can be extracted and concatenated to play the audio in the actual speaker selection interface (430).

그리고나서 정답 인터페이스(440)에서 한 사람의 목소리가 들리면 O, 여러 사람의 목소리가 들리면 X를 입력 받을 수 있다.Then, in the correct answer interface (440), if one person's voice is heard, you can input O, and if multiple people's voices are heard, you can input X.

상술한 캡차(CAPTCHA) 형식의 설문 UI 또는 UX를 통해 임의로 구분된 발화 구분자를 실제 발화자로 교체하여 표시함으로써 발화 데이터의 가독성을 높일 수 있다.The readability of utterance data can be improved by replacing randomly separated utterance identifiers with actual speakers through the survey UI or UX in the CAPTCHA format described above.

캡차 형식의 설문 UI 또는 UX에 대해 실제 발화자를 입력 받는 입력 단계를 거치고 캡차 형식의 설문 UI 또는 UX를 조절하도록 피드백을 수행할 수 있다.You can go through an input stage where you get actual speakers to input the survey UI or UX in the form of a captcha, and provide feedback to adjust the survey UI or UX in the form of a captcha.

도 6에 도시된 바와 같이, 사용자 인터페이스 내의 화면 체류시간(51), 질문 미응답률(52), 질문 오답률(53)에 따라 설문 UI 또는 UX의 질문 타입에 대한 사용자 반응도를 수집할 수 있다.As shown in Fig. 6, user responses to question types of the survey UI or UX can be collected based on the screen dwell time (51), question non-response rate (52), and question incorrect response rate (53) within the user interface.

사용자 반응도가 떨어지는 질문, 다시 말해, 화면 체류시간(51)이 길고, 질문 미응답률(52)이 높고, 질문 오답률(53)이 높은 설문 UI 또는 UX의 질문은 질문 등장 가중치 햐향 조절(510)을 수행할 수 있다. 이를 통해 사용자가 더 빠르게 답변 가능하여 화면 체류시간(51)이 짧고 질문 미응답률(52)이 낮고 질문 오답률(53)이 낮도록 설문 UI 또는 UX 질문 타입을 제공하여 사용자의 참여를 더 유도할 수 있다.Questions with low user response, in other words, questions in the survey UI or UX that have a long screen dwell time (51), a high question non-response rate (52), and a high question mis-response rate (53), can be subject to question appearance weight downward adjustment (510). This allows users to answer more quickly, and thus, survey UI or UX question types can be provided that have a short screen dwell time (51), a low question non-response rate (52), and a low question mis-response rate (53), thereby inducing more user participation.

또한, 임베딩 벡터의 신뢰도에 따라 오디오 조각을 추출할 확률을 조절하도록 피드백을 수행할 수 있다.Additionally, feedback can be performed to adjust the probability of extracting audio fragments based on the confidence of the embedding vector.

도 7에 도시된 바와 같이, 실제 발화자에 대한 복수의 임베딩 벡터를 그래프 상에 표시할 수 있다. 이때, 클러스터(71)와 같이 성기고 넓게 배치된 임베딩 벡터 클러스터와 클러스터(72)와 같이 오밀 조밀하게 밀집 배치된 임베딩 벡터 클러스터가 있을 수 있다. 클러스터(71) 및 클러스터(72)에 대한 분포 엔트로피의 역수를 산출하여 밀집도를 산출할 수 있으며, 상기 산출된 밀집도를 클러스터의 신뢰도로 간주할 수 있다.As shown in Fig. 7, multiple embedding vectors for actual speakers can be displayed on a graph. At this time, there may be clusters of embedding vectors that are sparsely and widely arranged, such as cluster (71), and clusters of embedding vectors that are densely arranged, such as cluster (72). The density can be calculated by calculating the inverse of the distribution entropy for cluster (71) and cluster (72), and the calculated density can be regarded as the reliability of the cluster.

따라서, 오디오 조각의 임베딩 정보를 추출하고 임베딩 클러스터 정보로부터 거리를 계산하여 가까운 순으로 정렬할 수 있다.Therefore, we can extract the embedding information of the audio fragment and sort them in order of proximity by calculating the distance from the embedding cluster information.

클러스터 목록을 랜덤으로 조회하며, 신뢰도가 기설정된 수치 이상인 클러스터가 등장할 경우, 해당 클러스터로 상기 오디오 조각을 매칭하고 상기 오디오 조각을 사전 매칭할 수 있다.The cluster list is randomly searched, and when a cluster with a reliability level higher than a preset value appears, the audio fragment can be matched with the cluster and the audio fragment can be pre-matched.

신뢰도가 높은 오디오 조각의 경우 추가적인 행동이 불요하여 오디오 조각 추출 확률에 대한 하향 가중치를 부여하고, 신뢰도가 낮은 오디오 조각의 경우 추가적인 피드백 행동이 필요하여 오디오 조각 추출 확률에 대한 상향 가중치를 부여할 수 있다.For high-confidence audio fragments, no additional actions are required, and thus, downward weighting of the audio fragment extraction probability can be applied, while for low-confidence audio fragments, additional feedback actions are required, and thus, upward weighting of the audio fragment extraction probability can be applied.

따라서, 본 개시의 일 실시예에 따른 전자 장치(100)는 복수의 발화 텍스트에 대한 실제 발화자를 확인하는 질문을 캡차(CAPTCHA) 형식의 설문 UI 또는 UX로 사용자 인터페이스에 제공하여 유저가 손쉽게 발화 구분자와 실제 화자를 매칭할 수 있는 인터페이스를 제공할 수 있다.Accordingly, the electronic device (100) according to one embodiment of the present disclosure can provide an interface in which a user can easily match an utterance identifier with an actual speaker by providing a question for confirming the actual speaker of a plurality of utterance texts in a questionnaire UI or UX in the form of a CAPTCHA on a user interface.

또한, 목록에서 실제 화자의 이름을 통해 발화를 구분함으로써 가독성을 개선하고 유저가 직접 데이터 라벨링을 수행하여 데이터 가공 비용 절감 및 유저 피드백 통한 브랜딩 강화할 수 있다.Additionally, readability can be improved by distinguishing utterances through the actual speaker's name in the list, data processing costs can be reduced by allowing users to directly label data, and branding can be strengthened through user feedback.

한편, 도 8은 본 개시의 일 실시예에 따른 발화 텍스트에 대한 실제 발화자 표시 동작 수행을 위한 전자 장치의 처리 방법을 나타낸 플로우 차트이다.Meanwhile, FIG. 8 is a flow chart illustrating a processing method of an electronic device for performing an actual speaker display operation for spoken text according to one embodiment of the present disclosure.

도 8에 도시된 바와 같이, 상기 처리 방법은 다화자 음성 데이터를 텍스트로 변환하는 단계(S810), 캡차 형식의 설문 UI/UX를 제공하는 단계(S820), 각 발화자에 대해 사용자 인터페이스를 통해 라벨링 입력 받는 단계(S830), 사용자 인터페이스를 통한 피드백을 기반으로 설문 UI/UX 내용 및 구성을 변경하는 단계(S840)를 포함할 수 있다.As illustrated in FIG. 8, the processing method may include a step of converting multi-speaker voice data into text (S810), a step of providing a survey UI/UX in a CAPTCHA format (S820), a step of receiving labeling input for each speaker through a user interface (S830), and a step of changing the content and composition of the survey UI/UX based on feedback through the user interface (S840).

다화자 음성 데이터를 텍스트로 변환하는 단계(S810)에서, 다화자 음성 데이터를 텍스트로 변환하여 임의의 발화 구분자가 매칭된 복수의 발화 텍스트를 획득할 수 있다.In the step of converting multi-talker voice data into text (S810), the multi-talker voice data can be converted into text to obtain multiple utterance texts in which arbitrary utterance delimiters are matched.

캡차 형식의 설문 UI/UX를 제공하는 단계(S820)에서, 상기 복수의 발화 텍스트에 대한 실제 발화자를 확인하는 질문을 캡차(CAPTCHA) 형식의 설문 UI 또는 UX로 사용자 인터페이스에 제공할 수 있다. In the step (S820) of providing a survey UI/UX in CAPTCHA format, a question for confirming the actual speaker of the plurality of spoken texts can be provided on the user interface as a survey UI or UX in CAPTCHA format.

각 발화자에 대해 사용자 인터페이스를 통해 라벨링 입력 받는 단계(S830)에서, 상기 사용자 인터페이스를 통해 상기 실제 발화자에 대한 라벨링을 입력 받아 상기 임의의 발화 구분자를 라벨링된 상기 실제 발화자로 변경할 수 있다.In the step (S830) of receiving a label input through a user interface for each speaker, a label for the actual speaker can be input through the user interface to change the arbitrary utterance identifier to the labeled actual speaker.

사용자 인터페이스를 통한 피드백을 기반으로 설문 UI/UX 내용 및 구성을 변경하는 단계(S840)에서 피드백에 따라 설문 UI/UX의 질문 타입을 조절하고 무작위로 추출되는 오디오 조각을 조절할 수 있다.In the step (S840) of changing the content and configuration of the survey UI/UX based on feedback through the user interface, the question type of the survey UI/UX can be adjusted and randomly extracted audio fragments can be adjusted based on the feedback.

상술한 내용과 중복되는 내용은 명세서의 간략함을 위해 생략하기로 한다.Any content that overlaps with the above will be omitted for the sake of brevity of the specification.

한편, 개시된 실시예들은 컴퓨터에 의해 실행 가능한 명령어를 저장하는 기록매체의 형태로 구현될 수 있다. 명령어는 프로그램 코드의 형태로 저장될 수 있으며, 프로세서에 의해 실행되었을 때, 프로그램 모듈을 생성하여 개시된 실시예들의 동작을 수행할 수 있다. 기록매체는 컴퓨터로 읽을 수 있는 기록매체로 구현될 수 있다.Meanwhile, the disclosed embodiments may be implemented in the form of a recording medium storing instructions executable by a computer. The instructions may be stored in the form of program codes, and when executed by a processor, may generate program modules to perform the operations of the disclosed embodiments. The recording medium may be implemented as a computer-readable recording medium.

컴퓨터가 읽을 수 있는 기록매체로는 컴퓨터에 의하여 해독될 수 있는 명령어가 저장된 모든 종류의 기록 매체를 포함한다. 예를 들어, ROM(Read Only Memory), RAM(Random Access Memory), 자기 테이프, 자기 디스크, 플래쉬 메모리, 광 데이터 저장장치 등이 있을 수 있다. Computer-readable storage media include all types of storage media that store instructions that can be deciphered by a computer. Examples include ROM (Read Only Memory), RAM (Random Access Memory), magnetic tape, magnetic disk, flash memory, and optical data storage devices.

이상에서와 같이 첨부된 도면을 참조하여 개시된 실시예들을 설명하였다. 본 개시가 속하는 기술분야에서 통상의 지식을 가진 자는 본 개시의 기술적 사상이나 필수적인 특징을 변경하지 않고도, 개시된 실시예들과 다른 형태로 본 개시가 실시될 수 있음을 이해할 것이다. 개시된 실시예들은 예시적인 것이며, 한정적으로 해석되어서는 안 된다.As described above, the disclosed embodiments have been described with reference to the attached drawings. Those skilled in the art to which the present disclosure pertains will understand that the present disclosure can be implemented in forms other than the disclosed embodiments without changing the technical idea or essential features of the present disclosure. The disclosed embodiments are exemplary and should not be construed as limiting.

100: 전자 장치
110: 입출력 모듈
120: 통신 모듈
130: 메모리
140: 프로세서
210: 발화 구분자
220: 발화 내용
230: 오디오 재생기
240: 선택 인터페이스
310: 내 목소리 임베딩 벡터
320: 정답 인터페이스
410, 430: 실제 발화자 선택 인터페이스
420, 440: 정답 인터페이스100: Electronic devices
110: Input/output module
120: Communication Module
130: Memory
140: Processor
210: Speech delimiter
220: Content of speech
230: Audio Player
240: Selection Interface
310: My voice embedding vector
320: Correct interface
410, 430: Actual speaker selection interface
420, 440: Correct interface

Claims

Input/output modules that receive user input or provide output to the user based on the user interface;
A memory storing at least one process for performing an actual speaker display operation for the spoken text; and
At least one processor for performing an actual speaker display operation for the spoken text according to the above process;
At least one processor of the above,
Converting multi-speaker speech data into text to obtain multiple utterance texts with arbitrary utterance delimiters matched,
Provide a question to confirm the actual speaker of the above multiple utterance texts in the form of a CAPTCHA questionnaire UI or UX on the user interface,
Inputting a label for said actual speaker through said user interface and changing said arbitrary utterance identifier to said labeled actual speaker,
It is configured to change the content and composition of the survey UI or UX based on feedback through the above user interface.
Adjusting the appearance weight downwards to lower the appearance probability of the question measured in the user interface by considering at least one of the dwell time, non-response rate, and incorrect response rate of the user interface according to the type of question confirming the actual speaker,
The audio fragments with the high error rate are configured to be collected as retraining data.
An electronic device for performing actual speaker identification actions on spoken text.

delete

In the first paragraph,
The above multiple utterance texts include utterance delimiters, utterance content, utterance duration segments, and audio of the utterance duration segments,
At least one processor of the above,
Configured to randomly extract at least one audio fragment matching the type of the question from the audio of the above-mentioned continuous speech section and trim it to within a preset time interval.
An electronic device for performing actual speaker identification actions on spoken text.

In the third paragraph,
At least one processor of the above,
Output a user interface that plays a single trimmed audio fragment and prompts for the corresponding actual speaker name;
Outputting a user interface for playing back the trimmed plurality of said audio fragments and selecting the audio fragment corresponding to one's own voice, and
A user interface configured to perform at least one of the following: playing a plurality of audio fragments included in the same voice clustering as the stored own voice, and receiving an input as to whether each audio fragment is correct or not;
An electronic device for performing actual speaker identification actions on spoken text.

In the fourth paragraph,
At least one processor of the above,
configured to randomly extract and provide audio fragments belonging to a chat room to which the above user interface is connected;
An electronic device for performing actual speaker identification actions on spoken text.

In the third paragraph,
At least one processor of the above,
It is configured to output a user interface that displays audio fragments within a preset distance from the pre-matched real speaker's voice embedding vector as if they were the predicted real speaker and asks for input whether the answer is correct.
An electronic device for performing actual speaker identification actions on spoken text.

In the third paragraph,
At least one processor of the above,
It is configured to output a user interface that randomly selects multiple pre-matched real speakers and voices, mixes them, lists them, and matches the corresponding pairs.
An electronic device for performing actual speaker identification actions on spoken text.

In the third paragraph,
At least one processor of the above,
Extract the top audio fragment with the highest number of overlaps by an overlapping utterance detector (OSD), or extract the top audio fragment with the highest non-response or incorrect response rate in other question types, and output a user interface that inputs the number of people appearing in the audio fragment, or
The above speech discriminator is configured to randomly extract multiple audio fragments designated as different and identical actual speakers, concatenate them, and output a user interface that inputs a correct answer as to whether the concatenated audio fragments are the voice of one person.
An electronic device for performing actual speaker identification actions on spoken text.

In Article 8,
At least one processor of the above,
configured to adjust downward the weight from which audio fragments of the real speaker are extracted if the inverse of the distribution entropy within the embedding cluster containing the pre-matched real speaker's voice embedding vector is greater than a preset threshold.
An electronic device for performing actual speaker identification actions on spoken text.

A method for processing actual speaker display for spoken text, performed by a computing device including a user interface-based input/output module, a memory, and at least one processor performing actual speaker display operations for spoken text, the method comprising:
A step of converting multi-speaker speech data into text to obtain multiple utterance texts with matched arbitrary utterance delimiters;
A step of providing a question for verifying the actual speaker of the above multiple utterance texts in the form of a CAPTCHA questionnaire UI or UX on a user interface;
A step of inputting a label for the actual speaker through the user interface and changing the arbitrary utterance identifier into the labeled actual speaker; and
A step of changing the content and composition of the survey UI or UX based on feedback through the above user interface; Including,
A step of downwardly adjusting the appearance weight to lower the appearance probability of the question measured in the user interface by considering at least one of the residence time, non-response rate, and incorrect response rate of the user interface according to the type of question confirming the actual speaker; further comprising;
A method for processing actual speaker indication for spoken text, further comprising: a step of collecting audio fragments with a high error rate as re-learning data;