KR102800075B1

KR102800075B1 - Multilingual AI DOC System Building Method for Hindi English And Korean

Info

Publication number: KR102800075B1
Application number: KR1020240172918A
Authority: KR
Inventors: 유승재
Original assignee: (주)페르소나에이아이
Priority date: 2024-11-27
Filing date: 2024-11-27
Publication date: 2025-04-29
Anticipated expiration: 2044-11-27

Abstract

본 발명은 힌디어 및 영어 기반의 RAG 데이터를 구축하고 파인튜닝으로 LLM을 학습시키고, LangChain 프레임워크를 통해 자연어 검색 질의응답 시스템을 설계하고, RAG 기반 질의응답 시스템을 통해 공공기관 및 산업계의 상용문서를 벡터화하여 문서 업로드 시 주요 내용 요약, 관련 내용 Q&A, 특정 내용 검색 등을 구현하여 사용자 맞춤형 정보를 제공할 수 있도록 설계된 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 구축 방법 및 이의 작동 방법을 개시한다.
본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 구축 방법은, 힌디어 기반 학습 데이터, 영어 기반 학습 데이터 및 한국어 기반 학습 데이터가 수집 및 가공되어 힌디어, 영어 및 한국어를 이해하는 온프레미스 기반의 언어모델인 sLLM(smaller Large Language Model)이 학습되는 제1 단계; 힌디어로 된 공공기관 및 산업계의 상용문서로 힌디어 학습 데이터셋이 구축되고, 상기 힌디어 학습 데이터셋을 벡터 임베딩하여 벡터 데이터베이스에 저장된 RAG(Retrieval-Augmented Generation) 프레임워크로 상기 제1 단계의 sLLM이 구축되는 제2 단계; 및 상기 제2 단계의 sLLM에 대한 파인튜닝이 실시되는 제3 단계;를 포함한다.The present invention discloses a method for constructing a multilingual AI DOC system including Hindi, English, and Korean, which is designed to construct RAG data based on Hindi and English, train LLM through fine-tuning, design a natural language search question-answering system through the LangChain framework, and implement summary of main contents, Q&A of related contents, and search of specific contents when uploading documents by vectorizing commercial documents of public institutions and industries through the RAG-based question-answering system, and a method for operating the same.
The method for constructing a multilingual AI DOC system including Hindi, English, and Korean of the present invention comprises: a first step in which Hindi-based learning data, English-based learning data, and Korean-based learning data are collected and processed to train an on-premise-based language model, sLLM (smaller Large Language Model), which understands Hindi, English, and Korean; a second step in which a Hindi learning dataset is constructed using commercial documents of public institutions and industries in Hindi, and the sLLM of the first step is constructed using a RAG (Retrieval-Augmented Generation) framework that vector embeds the Hindi learning dataset and stores it in a vector database; and a third step in which fine-tuning is performed on the sLLM of the second step.

Description

{Multilingual AI DOC System Building Method for Hindi English And Korean}

본 발명은 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 구축 방법 및 이의 작동 방법에 관한 것으로, 보다 상세하게는 힌디어 및 영어 기반의 RAG 데이터를 구축하고 파인튜닝으로 LLM을 학습시키고, LangChain 프레임워크를 통해 자연어 검색 질의응답 시스템을 설계하고, RAG 기반 질의응답 시스템을 통해 공공기관 및 산업계의 상용문서를 벡터화하여 문서 업로드 시 주요 내용 요약, 관련 내용 Q&A, 특정 내용 검색 등을 구현하여 사용자 맞춤형 정보를 제공할 수 있도록 설계된 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 구축 방법 및 이의 작동 방법에 관한 것이다.The present invention relates to a method for constructing a multilingual AI DOC system including Hindi, English, and Korean, and a method for operating the same, and more specifically, to a method for constructing a multilingual AI DOC system including Hindi, English, and Korean, and a method for operating the same, which is designed to construct RAG data based on Hindi and English, train LLM by fine-tuning, design a natural language search question-answering system through the LangChain framework, and provide customized information to users by vectorizing commercial documents of public institutions and industries through the RAG-based question-answering system and implementing summaries of main contents, Q&As of related contents, and searches for specific contents when uploading documents.

최근 인도는 인구 1위, GDP 연 7% 성장대를 보이는 차세대 유망시장으로 한국 중소기업의 진출에 긍정적으로 평가 및 환경조성을 구축하고 있는 상황이다. 인도권은 힌디어 사용자 6억5천만명, 영어 사용자 13억명 등 방대한 시장을 가지고 있다. 인도의 특성상 힌디어를 전체인구 약 50%가 사용하고 있으나, 행정에 있어 공용어인 힌디어와 영어를 사용하기에 힌디어를 사용하지 않는 인도인은 정부 행정 처리에 어려움이 있다. Recently, India is being evaluated positively and creating an environment for the advancement of Korean small and medium-sized enterprises as a promising next-generation market with the largest population and an annual GDP growth rate of 7%. India has a vast market with 650 million Hindi speakers and 1.3 billion English speakers. Due to the nature of India, Hindi is used by about 50% of the total population, but since Hindi and English are the official languages used in administration, Indians who do not use Hindi have difficulty in handling government administration.

이러한 문제를 극복하기 위해 로컬 언어를 사용하는 인도인에게도 행정의 불편함을 줄이며, 표준화를 함으로써 인도 시장 및 경제력에 효율성을 증대할 수 있는 다국어 AI에 대한 연구가 더 필요하다. To overcome these issues, more research is needed on multilingual AI that can reduce administrative inconvenience for Indians who use local languages, and increase efficiency in the Indian market and economy by standardizing.

대한민국 등록특허공보 제10-2431383호Republic of Korea Patent Publication No. 10-2431383 대한민국 등록특허공보 제10-2384641호Republic of Korea Patent Publication No. 10-2384641

본 발명은 상기와 같은 문제점을 해결하기 위한 것으로, 힌디어 및 영어 기반의 RAG 데이터를 구축하고 파인튜닝으로 LLM을 학습시키고, LangChain 프레임워크를 통해 자연어 검색 질의응답 시스템을 설계하고, RAG 기반 질의응답 시스템을 통해 공공기관 및 산업계의 상용문서를 벡터화하여 문서 업로드 시 주요 내용 요약, 관련 내용 Q&A, 특정 내용 검색 등을 구현하여 사용자 맞춤형 정보를 제공할 수 있도록 설계된 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 구축 방법 및 이의 작동 방법을 제공하는 것을 목적으로 한다.The present invention is to solve the above problems, and provides a method for constructing a multilingual AI DOC system including Hindi, English, and Korean, which is designed to construct RAG data based on Hindi and English, train LLM through fine-tuning, design a natural language search question-answering system through the LangChain framework, and implement summarized main contents, Q&A on related contents, and search for specific contents when uploading documents by vectorizing commercial documents of public institutions and industries through the RAG-based question-answering system, and provide customized information to users, and a method for operating the same.

상기한 목적을 달성하기 위한 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 구축 방법은, 힌디어 기반 학습 데이터, 영어 기반 학습 데이터 및 한국어 기반 학습 데이터가 수집 및 가공되어 힌디어, 영어 및 한국어를 이해하는 온프레미스 기반의 언어모델인 sLLM(smaller Large Language Model)이 학습되는 제1 단계; 힌디어로 된 공공기관 및 산업계의 상용문서로 힌디어 학습 데이터셋이 구축되고, 상기 힌디어 학습 데이터셋을 벡터 임베딩하여 벡터 데이터베이스에 저장된 RAG(Retrieval-Augmented Generation) 프레임워크로 상기 제1 단계의 sLLM이 구축되는 제2 단계; 및 상기 제2 단계의 sLLM에 대한 파인튜닝이 실시되는 제3 단계;를 포함한다.The method for constructing a multilingual AI DOC system including Hindi, English, and Korean of the present invention to achieve the above-described purpose includes: a first step in which Hindi-based learning data, English-based learning data, and Korean-based learning data are collected and processed to train an on-premise-based language model, sLLM (smaller Large Language Model), which understands Hindi, English, and Korean; a second step in which a Hindi learning dataset is constructed using commercial documents of public institutions and industries in Hindi, and the sLLM of the first step is constructed using a RAG (Retrieval-Augmented Generation) framework that vector embeds the Hindi learning dataset and stores it in a vector database; and a third step in which fine-tuning is performed on the sLLM of the second step.

상기 제2 단계는, 힌디어로 된 공공기관 및 산업계의 상용문서에서 쿼리, 컨텍스트, 답변의 구조로 힌디어 문서 QA 데이터셋이 구축되는 제2-1 단계, 상기 쿼리 및 컨텍스트를 이용하여 검색모델이 학습되고 상기 검색모델이 평가되는 제2-2 단계, 상기 제2-2 단계의 평가에서 목표 성능 미달성시 하드 네거티브를 통해 상기 검색모델이 재학습되는 제2-3 단계, 상기 제2-2 단계에서 학습되거나 제2-3 단계에서 재학습된 검색모델 기반 컨텍스트를 벡터화하고 벡터 데이터베이스가 구축되는 제2-4 단계, 상기 제2-4 단계의 벡터 데이터베이스를 사용하여 생성모델 학습을 위한 데이터셋이 구축되고 상기 생성모델이 학습되고 평가되는 제2-5 단계, 상기 제2-5 단계의 평가에서 목표 성능 미달성시 전문가 피드백 데이터를 수집해 힌디어 사용자 및 전문가 선호도가 반영된 대화형 언어모델로 학습시키는 제2-6 단계, 상기 제2-2 단계에서 학습되거나 제2-3 단계에서 재학습된 검색모델 및 상기 제2-5 단계에서 학습되거나 제2-6 단계에서 재학습된 생성모델 기반의 RAG 프레임워크가 구축되고 평가되는 제2-7 단계 및 상기 제2-7 단계의 평가에서 목표 성능 미달성시 에러 케이스가 분석되고 추가 학습이 실시되는 제2-8 단계를 포함한다.The second step is a step 2-1 in which a Hindi document QA dataset is constructed in the structure of queries, contexts, and answers from commercial documents of public institutions and industries in Hindi, a step 2-2 in which a search model is trained using the queries and contexts and the search model is evaluated, a step 2-3 in which the search model is retrained through hard negatives if the target performance is not achieved in the evaluation of the second step, a step 2-4 in which the context based on the search model trained in the second step or retrained in the second step is vectorized and a vector database is constructed, a dataset for learning a generative model is constructed using the vector database of the second step, and the generative model is trained and evaluated, a step 2-6 in which expert feedback data is collected and an interactive language model reflecting the preferences of Hindi users and experts is trained if the target performance is not achieved in the evaluation of the second step, and the search model trained in the second step or retrained in the second step and the search model trained in the second step or retrained in the second step and the search model trained in the second step or retrained in the second step are trained in the second step or It includes step 2-7, where a RAG framework based on the generative model retrained in step 2-6 is built and evaluated, and step 2-8, where error cases are analyzed and additional learning is performed when the target performance is not achieved in the evaluation in step 2-7.

상기 제2-1 단계의 상기 공공기관 및 산업계의 상용문서는 제1 분야인 환경 분야, 제2 분야인 교육 분야 및 제3 분야인 에너지 분야의 어느 하나의 분야에서 선택되며, 상기 상용문서가 제1 분야에서 선택되면 제2 분야 및 제3 분야는 제외되며, 여기서 선택된 하나의 상용문서에 제1 분야 및 제2 분야가 복합되어 포함되는 경우 제1 분야가 선택된 하나의 상용문서에 80% 미만인 경우 제1 분야 및 제2 분야가 복합되어 포함되는 상용문서는 힌디어 문서 QA 데이터셋에서 제외하여 제1 분야가 전체 상용문서에서 차지하는 비율이 크면 데이터셋으로 채택하지만 제1 분야가 전체 상용문서에서 차지하는 비율이 작으면 데이터셋으로 채택하지 않게 하여 데이터셋의 용어의 전문성을 높일 수 있다. The commercial documents of the public institutions and industries of the above-mentioned Step 2-1 are selected from one of the following fields: the first field is the environment field, the second field is the education field, and the third field is the energy field. If the commercial document is selected from the first field, the second field and the third field are excluded. In this case, if the first field and the second field are included in the selected commercial document and the first field accounts for less than 80% of the selected commercial document, the commercial document that includes the first field and the second field is excluded from the Hindi document QA dataset. If the first field accounts for a large proportion of all commercial documents, it is adopted as the dataset. However, if the first field accounts for a small proportion of all commercial documents, it is not adopted as the dataset. This allows for increasing the specialization of the terminology of the dataset.

상기 제2-2 단계, 제2-5 단계 및 제2-7 단계의 평가는 입력 데이터 인식률, 오작동률, 응답속도 및 답변 정확도를 측정하여 실시되는 것을 특징으로 한다.The evaluations of the above steps 2-2, 2-5 and 2-7 are characterized in that they are carried out by measuring the input data recognition rate, malfunction rate, response speed and answer accuracy.

상기 오작동률은 하기 식 1로 나타내며, The above malfunction rate is expressed by the following equation 1:

(식 1) (Formula 1)

오작동률 = Σ((오작동한 횟수 / 질문 개수) / 반복 횟수)) × 100 Malfunction rate = Σ((number of malfunctions / number of questions) / number of repetitions)) × 100

상기 응답속도는 하기 식 2로 나타내며, The above response speed is expressed by Equation 2 below.

(식 2)(Formula 2)

응답 속도 = Σ{(답변을 출력한 시점 - 질문을 입력한 시점)} / 질문 개수Response speed = Σ{(time to output answer - time to input question)} / number of questions

여기서, 평가 기준은 입력 데이터 인식률 95% 이상, 오작동률 5% 이하, 응답속도 3초 이하, 답변 정확도 88% 이하로 하고, 상기 기준을 만족하면 양호로 평가하고 상기 기준을 만족하지 못하면 재학습을 실시한다. Here, the evaluation criteria are input data recognition rate of 95% or higher, malfunction rate of 5% or lower, response speed of 3 seconds or lower, and answer accuracy of 88% or lower. If the criteria above are met, it is evaluated as good, and if the criteria above are not met, re-learning is performed.

상기 평가기준은 질문 개수에 의하여 연동하여 기준을 달리하며, 질문 개수가 50 이상인 경우에 제1 평가 기준으로 설정하고, 질문 개수가 50 미만인 경우 제2 평가 기준으로 설정하며, 상기 제1 평가 기준은 입력 데이터 인식률 93% 이상, 오작동률 7% 이하, 응답속도 5초 이하, 답변 정확도 85% 이하로 하여 상기 기준을 만족하면 양호로 평가하고 상기 기준을 만족하지 못하면 재학습을 실시하며, 상기 제2 평가 기준은 입력 데이터 인식률 95% 이상, 오작동률 5% 이하, 응답속도 3초 이하, 답변 정확도 88% 이하로 하여 상기 기준을 만족하면 양호로 평가하고 상기 기준을 만족하지 못하면 재학습을 실시하여 질문 개수가 50 미만일 때가 질문 개수가 50 이상인 경우보다 보다 기준을 높여서 적은 샘플 수에 대한 편차에서 생기는 오류를 줄일 수 있다. The above evaluation criteria vary depending on the number of questions, and are set as the first evaluation criterion when the number of questions is 50 or more, and are set as the second evaluation criterion when the number of questions is less than 50, and the first evaluation criterion is an input data recognition rate of 93% or more, a malfunction rate of 7% or less, a response speed of 5 seconds or less, and an answer accuracy of 85% or less. If the above criteria are met, it is evaluated as good, and if the criteria are not met, relearning is performed, and the second evaluation criterion is an input data recognition rate of 95% or more, a malfunction rate of 5% or less, a response speed of 3 seconds or less, and an answer accuracy of 88% or less. If the above criteria are met, it is evaluated as good, and if the criteria are not met, relearning is performed. When the number of questions is less than 50, the criteria are set higher than when the number of questions is 50 or more, and errors arising from deviations for a small number of samples can be reduced.

상기 제3 단계는, 질문의 쿼리를 상기 검색모델에 입력해 질문 임베딩 및 문서 임베딩이 계산되는 제3-1 단계, 상기 질문 임베딩과 문서 임베딩을 이용하여 Top-K로 관련도를 계산하여 관련 문서가 검색되는 제3-2 단계 및 상기 질문 및 관련 문서를 상기 생성모델에 입력해 적절한 답변 및 요약이 생성되도록 상기 RAG 프레임워크 기반 sLLM이 파인튜닝되는 제3-3 단계를 포함한다.The third step includes a third step in which a question query is input into the search model to calculate question embedding and document embedding, a third step in which a related document is searched by calculating relevance using Top-K using the question embedding and document embedding, and a third step in which the RAG framework-based sLLM is fine-tuned so that an appropriate answer and summary are generated by inputting the question and related document into the generative model.

또한, 상기한 목적을 달성하기 위한 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 작동 방법은, 질문이 입력되고 질문할 문서 파일이 업로드되는 제1 단계; 상기 질문 및 문서 파일의 텍스트를 청킹하고 벡터화하여 질문 임베딩 및 문서 벡터 임베딩이 계산되는 제2 단계; 상기 질문 임베딩 및 문서 임베딩을 검색모델에 입력하여 관련 문서가 검색되는 제3 단계; 상기 질문 및 관련 문서를 생성모델에 입력해 적절한 답변 및 요약이 생성되는 제4 단계; 및 상기 생성모델의 결과를 sLLM에 입력하여 최종 결과가 산출되는 제5 단계;를 포함하며, 상기 제1 단계 내지 제5 단계의 어느 하나 및 둘 이상의 단계는 AICC(AI Contact Center)를 사용하여 인공지능과 글라우드를 사용한다. In addition, the operating method of the multilingual AI DOC system including Hindi, English, and Korean of the present invention for achieving the above-described purpose includes: a first step in which a question is input and a document file to be questioned is uploaded; a second step in which the text of the question and the document file are chunked and vectorized to calculate question embedding and document vector embedding; a third step in which the question embedding and the document embedding are input into a search model to search related documents; a fourth step in which the question and the related document are input into a generation model to generate appropriate answers and summaries; and a fifth step in which the result of the generation model is input into sLLM to produce a final result; wherein one and two or more of the steps 1 to 5 use artificial intelligence and Glaude using AICC (AI Contact Center).

상술한 바와 같이, 본 발명은 인공지능을 통한 지식 검색 및 문서요약 및 초안작성을 통해 업무 프로세스 효율성을 증가시키는 효과가 있다.As described above, the present invention has the effect of increasing work process efficiency through knowledge retrieval and document summarization and drafting using artificial intelligence.

또한, 본 발명은 인공지능을 통해 민·관의 업무 생산성 향상, 지식 공유 및 협업 강화, 더 나은 의사 결정과 서비스 제공으로 범국민적 만족도 및 삶의 질 향상시키는 효과가 있다.In addition, the present invention has the effect of improving national satisfaction and quality of life by improving work productivity in the public and private sectors, strengthening knowledge sharing and collaboration, and providing better decision-making and services through artificial intelligence.

도 1은 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 개념도이다.
도 2는 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 구축 방법의 순서도이다.
도 3은 본 발명의 RAG 프레임워크의 구조도이다.
도 4는 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 구축 방법의 제2 단계의 세부 순서도이다.
도 5는 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 구축 방법의 제3 단계의 세부 순서도이다.
도 6은 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템의 작동 방법의 순서도이다.Figure 1 is a conceptual diagram of a multilingual AI DOC system including Hindi, English, and Korean of the present invention.
FIG. 2 is a flowchart of a method for constructing a multilingual AI DOC system including Hindi, English, and Korean of the present invention.
Figure 3 is a structural diagram of the RAG framework of the present invention.
FIG. 4 is a detailed flowchart of the second step of the method for constructing a multilingual AI DOC system including Hindi, English, and Korean of the present invention.
FIG. 5 is a detailed flowchart of the third step of the method for constructing a multilingual AI DOC system including Hindi, English, and Korean of the present invention.
FIG. 6 is a flowchart of an operation method of a multilingual AI DOC system including Hindi, English, and Korean of the present invention.

본 개시의 다양한 실시예에서 사용될 수 있는 "포함한다." 또는 "포함할 수 있다." 등의 표현은 개시(disclosure)된 해당 기능, 동작 또는 구성요소 등의 존재를 가리키며, 추가적인 하나 이상의 기능, 동작 또는 구성요소 등을 제한하지 않는다. 또한, 본 개시의 다양한 실시예에서, "포함하다." 또는 "가지다." 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In various embodiments of the present disclosure, the expressions such as “includes,” “may include,” etc., indicate the presence of the disclosed corresponding function, operation, or component, etc., and do not limit one or more additional functions, operations, or components, etc. In addition, in various embodiments of the present disclosure, it should be understood that the terms such as “includes,” “have,” etc., are intended to specify the presence of a feature, a number, a step, an operation, a component, a part, or a combination thereof described in the specification, but do not exclude in advance the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

본 개시의 다양한 실시예에서 "또는" 등의 표현은 함께 나열된 단어들의 어떠한, 그리고 모든 조합을 포함한다. 예를 들어, "A 또는 B"는, A를 포함할 수도, B를 포함할 수도, 또는 A 와 B 모두를 포함할 수도 있다.In various embodiments of the present disclosure, the expression "or" and the like includes any and all combinations of the words listed together. For example, "A or B" may include A, may include B, or may include both A and B.

본 개시의 다양한 실시예에서 사용된 "제1", "제2", "첫째", 또는 "둘째" 등의 표현들은 다양한 실시예들의 다양한 구성요소들을 수식할 수 있지만, 해당 구성요소들을 한정하지 않는다. 예를 들어, 상기 표현들은 해당 구성요소들의 순서 및/또는 중요도 등을 한정하지 않는다. 상기 표현들은 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 수 있다. 예를 들어, 제1 사용자 기기와 제2 사용자 기기는 모두 사용자 기기이며, 서로 다른 사용자 기기를 나타낸다. 예를 들어, 본 개시의 다양한 실시예의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.The expressions “first,” “second,” “first,” or “second,” etc., used in various embodiments of the present disclosure can modify various components of the various embodiments, but do not limit the components. For example, the expressions do not limit the order and/or importance of the components. The expressions can be used to distinguish one component from another component. For example, the first user device and the second user device are both user devices, and represent different user devices. For example, without departing from the scope of the various embodiments of the present disclosure, the first component can be referred to as the second component, and similarly, the second component can also be referred to as the first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 상기 어떤 구성요소와 상기 다른 구성요소 사이에 새로운 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 상기 어떤 구성 요소와 상기 다른 구성요소 사이에 새로운 다른 구성요소가 존재하지 않는 것으로 이해될 수 있어야 할 것이다.When it is said that a component is "connected" or "connected" to another component, it should be understood that the component may be directly connected or connected to the other component, but that there may also be other new components between the component and the other component. On the other hand, when it is said that a component is "directly connected" or "directly connected" to another component, it should be understood that no other new components exist between the component and the other component.

본 개시의 실시예에서 "모듈", "유닛", "부(part)" 등과 같은 용어는 적어도 하나의 기능이나 동작을 수행하는 구성요소를 지칭하기 위한 용어이며, 이러한 구성요소는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 "모듈", "유닛", "부(part)" 등은 각각이 개별적인 특정한 하드웨어로 구현될 필요가 있는 경우를 제외하고는, 적어도 하나의 모듈이나 칩으로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.In the embodiments of the present disclosure, terms such as "module," "unit," "part," etc. are terms used to refer to components that perform at least one function or operation, and these components may be implemented as hardware or software, or as a combination of hardware and software. In addition, a plurality of "modules," "units," "parts," etc. may be integrated into at least one module or chip and implemented as at least one processor, except in cases where each of them needs to be implemented as individual specific hardware.

본 개시의 다양한 실시예에서 사용한 용어는 단지 특정일 실시예를 설명하기 위해 사용된 것으로, 본 개시의 다양한 실시예를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.The terms used in the various embodiments of the present disclosure are only used to describe specific embodiments and are not intended to limit the various embodiments of the present disclosure. The singular expression includes the plural expression unless the context clearly indicates otherwise.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 개시의 다양한 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present disclosure belong.

일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 개시의 다양한 실시예에서 명백하게 정의되지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology, and will not be interpreted in an idealized or overly formal sense unless explicitly defined in various embodiments of the present disclosure.

이하에서는 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)에 대해 도면을 참조하여 설명한다. 도 1은 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 개념도이다. 도 1을 참조하면, 본 발명의 일 실시예에 의한 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)은 RAG(Retrieval-Augmented Generation) 모듈 및 sLLM(smaller Large Language Model) 모듈을 포함하여 구성될 수 있다. 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)은 하나 이상의 프로세서(processor), 하나 이상의 메모리(memory), 하나 이상의 스토리지(storage), 그리고 하나 이상의 통신 인터페이스(communication interface)를 포함할 수 있고, 이들은 버스(bus)를 통해 서로 연결될 수 있다. 이외에도, 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)은 입력 장치, 출력 장치 등의 하드웨어를 포함할 수 있다. 또한, 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)에는 프로그램을 구동할 수 있는 운영 체제를 비롯한 각종 소프트웨어가 탑재될 수 있다. 또한, 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)은 질문을 위해 입력한 힌디어 문서(DOC, Document)에 대해 상기 힌디어 문서에 대한 설명, 작성 요령, 요약 및 영어와 한국어의 번역 서비스 등을 제공한다.Hereinafter, a multilingual AI DOC system (10) including Hindi, English, and Korean according to the present invention will be described with reference to the drawings. FIG. 1 is a conceptual diagram of a multilingual AI DOC system (10) including Hindi, English, and Korean according to the present invention. Referring to FIG. 1, a multilingual AI DOC system (10) including Hindi, English, and Korean according to an embodiment of the present invention may be configured to include a Retrieval-Augmented Generation (RAG) module and a smaller Large Language Model (sLLM) module. The multilingual AI DOC system (10) including Hindi, English, and Korean according to the present invention may include one or more processors, one or more memories, one or more storages, and one or more communication interfaces, which may be connected to each other through a bus. In addition, the multilingual AI DOC system (10) including Hindi, English, and Korean may include hardware such as an input device and an output device. In addition, the multilingual AI DOC system (10) including Hindi, English, and Korean may be equipped with various software including an operating system capable of running a program. In addition, the multilingual AI DOC system (10) including Hindi, English, and Korean provides a description of the Hindi document (DOC, Document) entered for a question, writing tips, a summary, and translation services into English and Korean.

RAG 모듈(100)은 힌디어 기반 문서 데이터를 수집하여 벡터 임베딩하여 저장한다. RAG 모듈(100)은 힌디어로 작성된 문서를 수집하는 문서 수집부, 수집된 문서를 청킹하고 텍스트를 벡터로 임베딩하는 벡터 임베딩부, 벡터 임베딩된 데이터가 저장되는 벡터 데이터베이스, 공공기관 또는 기업체의 내부 정보를 추출(extract)하고 변환(transformation)하여 로드(load)하는 ETL부, 로드된 내부 정보가 저장되는 내부 데이터 저장부를 포함한다.The RAG module (100) collects Hindi-based document data and stores it by vector embedding. The RAG module (100) includes a document collection unit that collects documents written in Hindi, a vector embedding unit that chunks the collected documents and embeds texts into vectors, a vector database where vector-embedded data is stored, an ETL unit that extracts, transforms, and loads internal information of a public institution or company, and an internal data storage unit where the loaded internal information is stored.

sLLM 모듈(200)은 힌디어 기반 학습 데이터, 영어 기반 학습 데이터 및 한국어 기반 학습 데이터를 학습하여 힌디어, 영어 및 한국어를 이해한다. sLLM 모듈(200)은 힌디어 기반 학습 데이터를 학습하는 힌디어 학습부, 영어 기반 학습 데이터를 학습하는 영어 학습부, 한국어 기반 학습 데이터를 학습하는 한국어 학습부, 상기 RAG 모듈(100)의 생성모델에서 생성된 데이터를 통해 질문에 대한 최종 답변을 생성하는 답변 생성부를 포함한다. The sLLM module (200) learns Hindi-based learning data, English-based learning data, and Korean-based learning data to understand Hindi, English, and Korean. The sLLM module (200) includes a Hindi learning unit that learns Hindi-based learning data, an English learning unit that learns English-based learning data, a Korean learning unit that learns Korean-based learning data, and an answer generation unit that generates a final answer to a question using data generated from the generation model of the RAG module (100).

이하에서는 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 구축 방법에 대해 도면을 참조하여 설명한다. 도 2는 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 구축 방법의 순서도이고, 도 3은 본 발명의 RAG 프레임워크의 구조도이다. 도 2를 참조하면, 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 구축 방법은 하기의 3 단계를 포함하여 구성될 수 있다. 요약하자면, 제1 단계(S10)에는 sLLM이 힌디어, 영어 및 한국어를 이해할 수 있게 기본적인 언어 학습 데이터로 sLLM이 학습되는 단계이다. 제2 단계(S20)는 힌디어, 영어 및 한국어가 학습된 sLLM에 대해 힌디어로 된 공공기관 및 산업계의 상용문서로 힌디어 학습 데이터셋을 벡터 임베딩하여 RAG 프레임워크를 기반으로 제1 단계(S10)의 sLLM을 보완하는 단계이다. RAG 프레임워크를 통해 힌디어로 된 공공기관 및 산업계의 상용문서로 힌디어 학습 데이터셋을 학습한 sLLM은 힌디어 관련 문서에 특화된 sLLM이라고 할 수 있다. 제3 단계(S30)는 RAG 프레임워크를 기반으로 구축된 제2 단계(S20)의 sLLM에 대해 RAG의 검색모델 및 생성모델의 검색 결과를 평가하여 파라미터를 조정하는 파인튜닝을 통해 sLLM의 정확성을 높이는 단계이다.Hereinafter, a method for constructing a multilingual AI DOC system (10) including Hindi, English, and Korean of the present invention will be described with reference to the drawings. FIG. 2 is a flowchart of a method for constructing a multilingual AI DOC system (10) including Hindi, English, and Korean of the present invention, and FIG. 3 is a structural diagram of a RAG framework of the present invention. Referring to FIG. 2, a method for constructing a multilingual AI DOC system (10) including Hindi, English, and Korean of the present invention can be configured to include the following three steps. In summary, the first step (S10) is a step in which sLLM is trained with basic language learning data so that sLLM can understand Hindi, English, and Korean. The second stage (S20) is a stage to complement the sLLM of the first stage (S10) by vector embedding the Hindi learning dataset with commercial documents from public institutions and industries in Hindi for the sLLM learned in Hindi, English, and Korean based on the RAG framework. The sLLM learned for the Hindi learning dataset with commercial documents from public institutions and industries in Hindi through the RAG framework can be said to be an sLLM specialized for Hindi-related documents. The third stage (S30) is a stage to improve the accuracy of the sLLM through fine-tuning that adjusts the parameters by evaluating the search results of the RAG search model and generation model for the sLLM of the second stage (S20) built based on the RAG framework.

제1 단계(S10)는 힌디어 기반 학습 데이터, 영어 기반 학습 데이터 및 한국어 기반 학습 데이터가 수집 및 가공되어 힌디어, 영어 및 한국어를 이해하는 온프레미스 기반의 언어모델인 sLLM(smaller Large Language Model)이 학습되는 단계이다. 여기서, sLLM(smaller Large Language Model)은 기본적으로 LLM과 같은 기능을 수행하지만, 모델의 크기가 LLM에 비해 상대적으로 작은데, 모델의 파라미터의 수를 줄이고 파인튜닝을 통해 정확도를 향상시킬 수 있는 언어모델이다. sLLM은 데이터 규모가 비교적 작기 때문에 보안을 위해 일반적으로 온프레미스(On-premise), 즉 사내에 설치되는 것이 바람직하다. 온프레미스로 sLLM을 운영하게 되면, 외부의 침입으로부터 sLLM을 보호할 수 있는 장점이 있다. 본 발명에서는 힌디어, 영어 및 한국어를 이해하는 거대언어모델이 필요하다. 이를 위해 힌디어 기반 학습 데이터, 영어 기반 학습 데이터 및 한국어 기반 학습 데이터를 수집 및 가공하여 거대언어모델을 학습시킨다. 본 발명에서는 sLLM을 LangChain 프레임워크로 구축할 수 있다. LangChain 프레임워크는 생성형 AI의 언어 모델을 활용하여 애플리케이션을 개발할 수 있도록 지원하는 오픈 소스 프레임워크이다. LangChain은 Model I/O, Retrieval, Chains, Agents, Memory, Callbacks 모듈로 구성되어 있으며, LLM을 사용하여 다양한 언어 처리 작업을 수행할 수 있게 한다. LangChain은 OpenAI, Hugging Face 등 다양한 LLM 모델을 지원하며, 사용자는 자신의 요구에 맞는 모델을 선택하여 사용할 수 있다. LangChain의 핵심적인 개념은 LLM 프롬프트의 실행과 외부 소스의 실행을 엮어 Chaining하는 것으로 LLM을 사용하여 번역, 요약, 질문 답변, 텍스트 생성, 자연어 추론 등 다음과 같은 다양한 작업을 수행할 수 있다. 첫째로, 번역 서비스로, 번역 모델을 배포하고 관리할 수 있고 다양한 언어를 지원하는 번역 모델을 제공한다. 둘째로, 요약 서비스로, 요약 모델을 배포하고 관리할 수 있어 다양한 주제에 대한 요약 모델을 제공한다. 셋째로, 질문 답변 서비스로, 질문 답변 모델을 배포하고 관리할 수 있고 다양한 분야에 대한 질문 답변 모델을 제공한다. 넷째로, 텍스트 생성 서비스로 텍스트 생성 모델을 배포하고 관리할 수 있고 다양한 형식의 텍스트를 생성할 수 있는 모델을 제공한다. 다섯째로, 자연어 추론 서비스로, 자연어 추론 모델을 배포하고 관리할 수 있으며 다양한 자연어 추론 작업을 수행할 수 있는 모델을 제공한다.Step 1 (S10) is a step in which Hindi-based learning data, English-based learning data, and Korean-based learning data are collected and processed to train an on-premise language model, sLLM (smaller Large Language Model), which understands Hindi, English, and Korean. Here, sLLM (smaller Large Language Model) is a language model that basically performs the same function as LLM, but the size of the model is relatively smaller than that of LLM, and the number of model parameters can be reduced and the accuracy can be improved through fine-tuning. Since sLLM has a relatively small data size, it is generally preferable to install it on-premise, that is, in-house, for security reasons. Operating sLLM on-premise has the advantage of protecting sLLM from external intrusions. In the present invention, a large language model that understands Hindi, English, and Korean is required. To this end, Hindi-based learning data, English-based learning data, and Korean-based learning data are collected and processed to train a large language model. In the present invention, sLLM can be built with the LangChain framework. The LangChain framework is an open source framework that supports the development of applications using the language model of generative AI. LangChain consists of the Model I/O, Retrieval, Chains, Agents, Memory, and Callbacks modules, and allows various language processing tasks to be performed using LLM. LangChain supports various LLM models such as OpenAI and Hugging Face, and users can select and use a model that suits their needs. The core concept of LangChain is to chain the execution of LLM prompts and the execution of external sources, and various tasks such as translation, summary, question answering, text generation, and natural language inference can be performed using LLM. First, as a translation service, it can distribute and manage translation models and provides translation models that support various languages. Second, as a summary service, it can distribute and manage summary models, providing summary models for various topics. Third, as a question answering service, it can distribute and manage question answering models and provides question answering models for various fields. Fourth, as a text generation service, it can distribute and manage text generation models and provides models that can generate texts in various formats. Fifth, as a natural language inference service, it can distribute and manage natural language inference models and provides models that can perform various natural language inference tasks.

제2 단계(S20)는 힌디어로 된 공공기관 및 산업계의 상용문서로 힌디어 학습 데이터셋이 구축되고, 상기 힌디어 학습 데이터셋을 벡터 임베딩하여 벡터 데이터베이스에 저장된 RAG(Retrieval-Augmented Generation) 프레임워크로 제1 단계(S10)의 sLLM이 구축되는 단계이다. 본 발명은 힌디어 문서 업로드 시 주요 내용 요약, 관련 내용 Q&A, 특정 내용 검색 등을 구현하는 것이기 때문에 힌디어로 작성된 공공기관 및 산업계의 상용문서로 힌디어 학습 데이터셋을 구축하여 이를 벡터화하고 임베딩하여 RAG 프레임워크 기반 sLLM을 구축한다. 도 4는 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 구축 방법의 제2 단계의 세부 순서도이다. 도 3을 참조하면, 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 구축 방법의 제2 단계(S20)는 하기의 8 단계를 포함하여 구성될 수 있다. 본 발명에서는 LLM의 단점을 보완하기 위해 RAG 프레임워크를 이용한다. RAG 모델은 텍스트 생성 작업을 수행하는 모델로 주어진 소스 데이터로부터 정보를 검색하고, 해당 정보를 활용하여 원하는 텍스트를 생성하는 과정을 수행한다. RAG 사용을 위한 데이터 처리 과정은 원본 데이터를 청크(Chunk)단위의 작은 조각으로 나누고 텍스트 데이터를 숫자인 벡터로 전환하는 임베딩(Embedding)을 거처 벡터 데이터베이스에 저장된다. 여기서, 벡터 데이터베이스는 텍스트, 이미지 등과 같은 데이터를 벡터 형태로 저장, 색인 검색할 수 있는 데이터베이스로 알고리즘을 사용하면 해싱, 양자화 또는 그래프 기반 검색을 통해 근사 최근접 유사 항목 검색이 가능하기 때문에 고차원 벡터의 대규모 데이터 세트에 대해서 효율적이고 규모에 맞게 작동한다.The second step (S20) is a step in which a Hindi learning dataset is constructed with commercial documents from public institutions and industries in Hindi, and the Hindi learning dataset is vector-embedded and stored in a vector database to construct the sLLM of the first step (S10) with a RAG (Retrieval-Augmented Generation) framework. Since the present invention implements a summary of main contents, Q&A of related contents, search for specific contents, etc. when a Hindi document is uploaded, a Hindi learning dataset is constructed with commercial documents from public institutions and industries written in Hindi, and the Hindi learning dataset is vectorized and embedded to construct an sLLM based on the RAG framework. Fig. 4 is a detailed flowchart of the second step of a method for constructing a multilingual AI DOC system (10) including Hindi, English, and Korean of the present invention. Referring to FIG. 3, the second step (S20) of the method for constructing a multilingual AI DOC system (10) including Hindi, English, and Korean of the present invention may be configured to include the following eight steps. In the present invention, the RAG framework is used to complement the shortcomings of LLM. The RAG model is a model that performs text generation work, searches for information from given source data, and performs a process of generating a desired text by utilizing the information. The data processing process for using RAG divides the original data into small pieces in chunk units, converts text data into a vector, which is a number, and stores it in a vector database through embedding. Here, the vector database is a database that can store, index, and search data such as texts and images in vector form, and if an algorithm is used, it is possible to search for approximate nearest neighboring similar items through hashing, quantization, or graph-based search, so it operates efficiently and scalably for a large data set of high-dimensional vectors.

제2-1 단계(S21)는 쿼리, 컨텍스트, 답변의 구조로 힌디어 문서 QA 데이터셋 구축하는 단계이다. 힌디어로 된 문서로 RAG 모델을 구축하는데, 이를 위해 힌디어로 된 공공기관 및 산업계의 상용문서로 힌디어 학습 데이터셋을 구축하고, 이를 쿼리, 컨텍스트, 답변의 구조로 힌디어 문서 QA 데이터셋으로 구축한다.Step 2-1 (S21) is the step of constructing a Hindi document QA dataset with the structure of query, context, and answer. A RAG model is constructed with documents in Hindi. To this end, a Hindi learning dataset is constructed with commercial documents from public institutions and industries in Hindi, and this is constructed as a Hindi document QA dataset with the structure of query, context, and answer.

제2-2 단계(S22)는 쿼리, 컨텍스트를 이용하여 검색모델이 학습되고 검색모델이 평가되는 단계이다. RAG 프레임워크에서는 검색모델과 생성모델 사용된다. 검색모델은 질문과 문서를 임베딩, 질문과 관련 있는 문서의 임베딩은 가깝도록, 관련 없는 문서의 임베딩은 멀도록 학습시키고, 검색모델을 바탕으로 질문과 관련 있는 문서를 벡터 데이터베이스에서 검색한다. 또한, 생성모델은 디코더 혹은 인코더-디코더 기반의 언어모델로 주어진 컨텍스트에 맞는 텍스트를 생성하고 기존에 사전 학습된 생성모델을 바탕으로 특정 도메인 데이터로 추가 학습해 질문에 적절한 답변을 생성한다. 제2-2 단계(S22)에서는 힌디어 학습 데이터셋의 쿼리, 컨텍스트를 활용하여 검색모델 학습시키고 학습이 제대로 이루어졌는지 평가를 한다. 여기서, 제2-2 단계(S22), 제2-5 단계(S25) 및 제2-7 단계(S27)의 평가는 입력 데이터 인식률, 오작동률, 응답속도 및 답변 정확도 등으로 실시된다. 입력 데이터 인식률은 입력 데이터의 정확도를 평가하고, 오작동률은 질문 시 오류건을 측정하는데, 오류로는 할루시네이션, 민감정보 마스킹 실패, 무응답, 미반응 등이 있다. 또한, 응답속도는 질문 입력 시 답변의 출력 속도를 측정한다. 또한, 답변 정확도는 정해진 질문/답변 Set에 대해 질문하여 답변의 정확도를 확인한다. 입력 데이터 인식률은 질문을 입력하여 대화 테스트를 진행한 후 출력된 의도를 확인하고 결과를 정답값과 비교하여 Precision, Recall을 산출하고, 25개의 질문 리스트로 4회씩 총 100회 반복 수행하여 평균 응답률을 산출하며, 결과를 토대로 의도추론율이 90% 이상인지 확인한다. 오작동률은 질문을 입력하여 오작동 여부를 확인하는데, 25개의 질문 리스트로 4회씩 총 100회 반복 수행하여 오작동률을 산출한다. 오작동 기준은 할루시네이션, 민감정보 마스킹 실패, 강제종료, 무반응 등이고, 오작동률은 하기 식 1에 의해 계산하고 오작동률이 5% 이하인지 확인한다.Step 2-2 (S22) is the step where the search model is learned and evaluated using the query and context. In the RAG framework, a search model and a generative model are used. The search model learns to embed questions and documents, to make the embeddings of documents related to the question close, and to make the embeddings of unrelated documents far apart, and searches for documents related to the question in a vector database based on the search model. In addition, the generative model is a language model based on a decoder or encoder-decoder, which generates texts that fit the given context, and additionally learns specific domain data based on the previously learned generative model to generate an appropriate answer to the question. Step 2-2 (S22) trains the search model using the query and context of the Hindi learning dataset, and evaluates whether the learning was performed properly. Here, the evaluations of Steps 2-2 (S22), 2-5 (S25), and 2-7 (S27) are performed by input data recognition rate, malfunction rate, response speed, and answer accuracy. The input data recognition rate evaluates the accuracy of the input data, and the malfunction rate measures errors when asking questions, including hallucinations, failures in masking sensitive information, no response, and non-response. In addition, the response speed measures the speed at which answers are output when a question is input. In addition, the answer accuracy checks the accuracy of the answer by asking questions about a set of questions/answers. The input data recognition rate is calculated by inputting a question, conducting a conversation test, checking the output intent, and comparing the result with the correct answer to calculate Precision and Recall, and calculating the average response rate by repeating 4 times each for a total of 100 times with a list of 25 questions, and checking whether the intent inference rate is 90% or higher based on the results. The malfunction rate is calculated by inputting a question to check whether there is a malfunction, and calculating the malfunction rate by repeating 4 times each for a total of 100 times with a list of 25 questions. The malfunction criteria are hallucination, failure to mask sensitive information, forced termination, no response, etc., and the malfunction rate is calculated by the following Equation 1 and checking whether the malfunction rate is 5% or less.

(식 1)(Formula 1)

응답속도는 소스 코드를 확인는데, 소스 코드는 질문을 입력한 시점의 시간을 출력하는 코드와 답변을 출력한 시점의 시간을 출력하는 코드이다. 질문을 입력하고 출력된 로그로 질문을 입력한 시점부터 답변을 출력한 시점까지의 소요 시간을 계산하는데, 25개의 질문 리스트로 4회씩 총 100회 반복 수행하여 평균 응답 시간을 산출한다. 응답 속도는 하기 식 2에 의해 계산되고 응답속도가 3초 이하인지 확인한다.The response speed is checked by checking the source code, which is the code that outputs the time when the question is input and the code that outputs the time when the answer is output. The time required from the time when the question is input to the time when the answer is output is calculated by inputting the question and using the output log. The average response time is calculated by repeating 4 times for each of the 25 question lists, for a total of 100 times. The response speed is calculated by the following formula 2, and it is checked whether the response speed is less than 3 seconds.

(식 2)(Formula 2)

답변 정확도는 질문 내용과 질문에 대한 답변을 확인하는데, 도출된 문장의 의도가 일치하는지 LLM과 교류검정 평가하는 Benchmark LLM Evaluation을 수행한다. 즉, 질문을 제공된 솔루션에 입력하고 출력된 결과를 통해 질문에 대한 답변과 일치하는지 GPT-4를 확인하여 답변의 정확도를 산출하는데, 시료 25개의 질문에 대해 4회씩 반복 수행하여 평균 정확도를 산출하고, 답변의 정확도가 88% 이상인지 확인한다. 평가 기준은 입력 데이터 인식률 95% 이상, 오작동률 5% 이하, 응답속도 3초 이하, 답변 정확도 88% 이하로 하고, 상기 기준을 만족하면 양호로 평가하고 상기 기준을 만족하지 못하면 재학습을 실시한다.Answer accuracy is checked by checking the question content and the answer to the question, and the Benchmark LLM Evaluation is performed to evaluate whether the intent of the derived sentence matches the LLM and interactive verification. That is, the question is input into the provided solution, and GPT-4 checks whether the output result matches the answer to the question to calculate the accuracy of the answer. This is performed four times for each of 25 sample questions, the average accuracy is calculated, and it is confirmed that the answer accuracy is 88% or higher. The evaluation criteria are input data recognition rate of 95% or higher, malfunction rate of 5% or lower, response speed of 3 seconds or less, and answer accuracy of 88% or lower. If the above criteria are met, it is evaluated as good, and if the above criteria are not met, relearning is performed.

제2-3 단계(S23)는 제2-2 단계(S22)의 평가에서 목표 성능 미달성시 하드 네거티브를 통해 검색모델이 재학습되는 단계이다. 하드 네거티브(Hard Negative)는 유사하게 보이지만 실제로는 다른 카테고리에 속하는 데이터 샘플을 말한다. 이러한 샘플들은 검색모델이 더 정확한 특징을 학습하도록 돕기 위해 선택된다. 특히, 하드 네거티브 샘플은 기존의 샘플들과 비슷하게 인코딩되어 있으나 실제로는 다른 클래스에 속하는 경우를 말하며, 이는 검색모델이 더 강력한 판별력을 개발하도록 유도한다. 하드 네거티브 샘플을 활용하는 방법 중 하나로 하드 네거티브 마이닝(Hard Negative Mining, HNM)이 있다. 이 방법은 모델의 학습 과정에서 하드 네거티브 샘플들을 적극적으로 찾아내어 학습 데이터로 사용함으로써, 특히 대조적 손실 함수를 통해 모델의 판별 경계를 더 명확하게 만들 수 있다. 제2-3 단계(S23)에서는 검색모델에 대한 평가 결과가 입력 데이터 인식률 95% 미만, 오작동률 5% 초과, 응답속도 3초 초과, 답변 정확도 88% 미만일 경우 상기의 하드 네거티브 등의 추가 방법론으로 검색모델을 재학습시킨다.Step 2-3 (S23) is the step where the search model is retrained through hard negatives when the target performance is not achieved in the evaluation of Step 2-2 (S22). Hard negatives refer to data samples that appear similar but actually belong to different categories. These samples are selected to help the search model learn more accurate features. In particular, hard negative samples are encoded similarly to existing samples but actually belong to different classes, which induces the search model to develop stronger discrimination power. One way to utilize hard negative samples is hard negative mining (HNM). This method actively finds hard negative samples during the model learning process and uses them as learning data, thereby making the model's discrimination boundary clearer, especially through a contrastive loss function. In step 2-3 (S23), if the evaluation results for the search model are less than 95% for input data recognition rate, more than 5% for malfunction rate, more than 3 seconds for response speed, and less than 88% for answer accuracy, the search model is retrained using additional methodologies such as the hard negative method described above.

제2-4 단계(S24)는 제2-2 단계(S22)에서 학습되거나 제2-3 단계(S23)에서 재학습된 검색모델 기반 컨텍스트를 벡터화하고 벡터 데이터베이스가 구축되는 단계이다. 제2-2 단계(S22)에서 쿼리, 컨텍스트를 이용하여 검색모델이 학습되고 검색모델이 평가된 후, 평가 기준을 충족한 경우는 제2-2 단계(S22)가 종료하고 학습된 검색모델 기반 컨텍스트를 벡터화하여 벡터 데이터베이스를 구축하거나, 제2-2 단계(S22)에서 평가 기준을 충족하지 못한 경우 제2-3 단계(S23)를 통해 하드 네거티브를 통해 재학습된 검색모델 기반 컨텍스트를 벡터화하여 벡터 데이터베이스를 구축한다. 즉, 학습되거나 재학습된 검색모델에서 컨텍스트를 벡터화하고 임베딩하여 벡터 데이터베이스로 저장한다.Step 2-4 (S24) is the step where the context based on the search model learned in Step 2-2 (S22) or re-learned in Step 2-3 (S23) is vectorized and a vector database is built. In Step 2-2 (S22), a search model is learned using queries and contexts, and if the search model is evaluated and the evaluation criteria are met, Step 2-2 (S22) is terminated and the learned search model-based context is vectorized and a vector database is built, or if the evaluation criteria are not met in Step 2-2 (S22), the re-learned search model-based context is vectorized through hard negatives in Step 2-3 (S23) to build a vector database. That is, the context is vectorized and embedded in the learned or re-learned search model and stored as a vector database.

제2-5 단계(S25)는 제2-2 단계(S22)에서 학습되거나 제2-3 단계(S23)에서 재학습된 검색모델 기반 쿼리, 검색된 컨텍스트, 답변으로 생성모델 학습을 위한 데이터셋이 구축되고 생성모델이 학습되고 평가되는 단계이다. 제2-1 단계(S21)에서 제2-4 단계(S24)를 통해 쿼리, 컨텍스트, 답변 중 쿼리 및 컨텍스트를 통해 힌디어 문서 QA 데이터셋에 대한 검색모델의 학습을 평가하고 재학습을 통해 검색모델에 대한 데이터베이스가 구축되었으면, 학습되거나 재학습된 검색모델 기반 쿼리, 검색된 컨텍스트, 답변을 기반으로 생성모델에 대한 학습 데이터셋을 구축하여 생성모델을 학습시키고 평가를 실시한다. 즉, 학습되거나 재학습된 검색모델 기반 쿼리, 검색된 컨텍스트를 생성모델에 입력하여 답변을 생성하고 이를 통해 학습된 검색모델 기반 쿼리, 검색된 컨텍스트, 답변으로 데이터셋을 구축하고 이를 통해 생성모델을 학습시킨다. 학습된 생성모델에 대한 평가는 검색 모델 평가와 마찬가지로 입력 데이터 인식률, 오작동률, 응답속도 및 답변 정확도 등으로 실시된다. 입력 데이터 인식률은 입력 데이터의 정확도를 평가하고, 오작동률은 질문 시 오류건 측정하는데, 오류로는 할루시네이션, 민감정보 마스킹 실패, 무응답, 미반응 등이 있다. 또한, 응답속도는 질문 입력 시 답변의 출력 속도를 측정한다. 또한, 답변 정확도는 정해진 질문/답변 Set에 대해 질문하여 답변의 정확도를 확인한다. 평가는 제2-2 단계(S22)와 동일한 방법에 의해 실시한다. 또한, 생성모델의 학습된 검색모델 기반 쿼리, 검색된 컨텍스트, 답변은 벡터 임베딩되어 벡터 데이터베이스에 저장된다.Step 2-5 (S25) is a step in which a dataset for learning a generative model is constructed with queries, searched contexts, and answers based on the search model learned in Step 2-2 (S22) or re-learned in Step 2-3 (S23), and the generative model is learned and evaluated. In Step 2-1 (S21), if the learning of the search model for the Hindi document QA dataset is evaluated through queries and contexts among the queries, contexts, and answers in Step 2-4 (S24), and a database for the search model is built through re-learning, a learning dataset for the generative model is constructed based on the learned or re-learned search model-based queries, searched contexts, and answers, and the generative model is learned and evaluated. That is, the learned or re-learned search model-based queries and searched contexts are input into the generative model to generate answers, and a dataset is constructed with the learned search model-based queries, searched contexts, and answers, and the generative model is learned through this. The evaluation of the learned generative model is performed in the same way as the search model evaluation, including input data recognition rate, malfunction rate, response speed, and answer accuracy. The input data recognition rate evaluates the accuracy of the input data, and the malfunction rate measures errors when asking questions, including hallucination, failure to mask sensitive information, no response, and no response. In addition, the response speed measures the speed of answer output when a question is input. In addition, the answer accuracy checks the accuracy of the answer by asking questions about a set of determined questions/answers. The evaluation is performed in the same way as Step 2-2 (S22). In addition, the learned search model-based queries, searched contexts, and answers of the generative model are vector embedded and stored in a vector database.

제2-6 단계(S26)는 제2-5 단계(S25)의 평가에서 목표 성능 미달성시 전문가 피드백 데이터를 수집해 힌디어 사용자 및 전문가 선호도가 반영된 대화형 언어모델로 학습시키는 단계이다. 제2-6 단계(S26)에서는 생성모델에 대한 평가 결과가 입력 데이터 인식률 95% 미만, 오작동률 5% 초과, 응답속도 3초 초과, 답변 정확도 88% 미만일 경우 힌디어 사용자 및 전문가 선호도가 반영된 대화형 언어모델로 생성모델을 재학습시킨다.Step 2-6 (S26) is the step where expert feedback data is collected when the target performance is not achieved in the evaluation of Step 2-5 (S25) and the interactive language model that reflects the preferences of Hindi users and experts is trained. In Step 2-6 (S26), if the evaluation results of the generative model are less than 95% for input data recognition rate, more than 5% for malfunction rate, more than 3 seconds for response speed, and less than 88% for answer accuracy, the generative model is retrained as an interactive language model that reflects the preferences of Hindi users and experts.

제2-7 단계(S27)는 제2-2 단계(S22)에서 학습되거나 제2-3 단계(S23)에서 재학습된 검색모델 및 상기 제2-5 단계(S25)에서 학습되거나 제2-6 단계(S26)에서 재학습된 생성모델 기반의 RAG 프레임워크가 구축되고 평가되는 단계이다. 제2-5 단계(S25)에서 생성모델이 학습되고 평가된 후 평가 기준을 충족한 경우는, 제2-2 단계(S22)에서 학습되거나 제2-3 단계(S23)에서 재학습된 검색모델과 제2-5 단계(S25)의 생성모델을 통해 RAG 프레임워크가 구축된다. 또한, 제2-5 단계(S25)에서 생성모델이 학습되고 평가된 후 평가 기준을 충족하지 못한 경우는, 제2-2 단계(S22)에서 학습되거나 제2-3 단계(S23)에서 재학습된 검색모델과 제2-6 단계(S26)의 힌디어 사용자 및 전문가 선호도가 반영된 대화형 언어모델로 재학습된 생성모델을 통해 RAG 프레임워크가 구축된다. 즉, 검색모델 및 생성모델이 학습되었으면 검색모델 및 생성모델 기반의 RAG 프레임워크이 구축되었고, 이에 대한 평가를 실시한다. 평가는 입력 데이터 인식률, 오작동률, 응답속도 및 답변 정확도 등으로 실시된다. 입력 데이터 인식률은 입력 데이터의 정확도를 평가하고, 오작동률은 질문 시 오류건 측정하는데, 오류로는 할루시네이션, 민감정보 마스킹 실패, 무응답, 미반응 등이 있다. 또한, 응답속도는 질문 입력 시 답변의 출력 속도를 측정한다. 또한, 답변 정확도는 정해진 질문/답변 Set에 대해 질문하여 답변의 정확도를 확인한다.Step 2-7 (S27) is a step in which a RAG framework based on the search model learned in Step 2-2 (S22) or re-learned in Step 2-3 (S23) and the generative model learned in Step 2-5 (S25) or re-learned in Step 2-6 (S26) is constructed and evaluated. If the generative model learned and evaluated in Step 2-5 (S25) satisfies the evaluation criteria, a RAG framework is constructed using the search model learned in Step 2-2 (S22) or re-learned in Step 2-3 (S23) and the generative model of Step 2-5 (S25). In addition, if the generative model is learned and evaluated in Step 2-5 (S25) and does not meet the evaluation criteria, the RAG framework is built using the search model learned in Step 2-2 (S22) or re-learned in Step 2-3 (S23) and the generative model re-learned with the conversational language model reflecting the preferences of Hindi users and experts in Step 2-6 (S26). In other words, once the search model and generative model are learned, the RAG framework based on the search model and generative model is built and evaluated. The evaluation is conducted by input data recognition rate, malfunction rate, response speed, and answer accuracy. The input data recognition rate evaluates the accuracy of the input data, and the malfunction rate measures the number of errors when asking questions, and errors include hallucination, failure to mask sensitive information, no response, and no response. In addition, the response speed measures the output speed of the answer when a question is input. In addition, the answer accuracy checks the accuracy of the answer by asking questions about a set question/answer set.

제2-8 단계(S28)는 제2-7 단계(S27)의 평가에서 목표 성능 미달성시 에러 케이스가 분석되고 추가 학습이 실시되는 단계이다. 제2-8 단계(S26)에서는 검색모델 및 생성모델에 대한 평가 결과가 입력 데이터 인식률 95% 미만, 오작동률 5% 초과, 응답속도 3초 초과, 답변 정확도 88% 미만일 경우 에러 케이스를 심층 분석하여 에러 케이스를 수정하는 데이터로 검색모델 및 생성모델을 재학습시킨다.Step 2-8 (S28) is the step where error cases are analyzed and additional learning is performed when the target performance is not achieved in the evaluation of Step 2-7 (S27). In Step 2-8 (S26), if the evaluation results for the search model and the generation model are less than 95% for input data recognition rate, more than 5% for malfunction rate, more than 3 seconds for response speed, and less than 88% for answer accuracy, the error cases are deeply analyzed and the search model and the generation model are retrained with data that corrects the error cases.

제3 단계는 sLLM에 대한 파인튜닝이 실시되는 단계이다. 도 5는 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 구축 방법의 제3 단계의 세부 순서도이다. 도 5를 참조하면, 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 구축 방법의 제3 단계는 하기의 3 단계를 포함하여 구성될 수 있다. 파인튜닝(Fine-tunning)은 특정 작업이나 도메인에 높은 적합성을 확보하기 위해, 이미 훈련된 대규모 언어 모델에 특정 데이터셋을 사용하여 추가적인 학습을 수행하는 작업을 말한다. 파인튜닝은 전체 파인튜닝과 리퍼포싱(Repurposing)으로 분류되는데, 본 발명에서는 전체 파인튜닝의 방법을 사용할 수 있다. 전체 파인튜닝은 모든 모델 매개변수를 포함하여 사전 학습된 모델 전체를 파인튜닝하는 작업을 의미한다. 이 방법에서는 사전 학습된 모델의 모든 레이어와 매개 변수가 업데이트되고 최적화되어 대상 작업의 요구 사항에 맞게 조정된다. 이 방법은 일반적으로 작업과 사전 학습된 모델 사이에 큰 차이가 있거나 작업에서 모델의 유연성과 적응성이 높아야 하는 경우에 적합하다.The third step is a step in which fine-tuning is performed on sLLM. FIG. 5 is a detailed flowchart of the third step of the method for constructing a multilingual AI DOC system (10) including Hindi, English, and Korean of the present invention. Referring to FIG. 5, the third step of the method for constructing a multilingual AI DOC system (10) including Hindi, English, and Korean of the present invention may be configured to include the following three steps. Fine-tuning refers to a task of performing additional learning using a specific dataset on an already trained large-scale language model in order to secure high suitability for a specific task or domain. Fine-tuning is classified into full fine-tuning and repurposing, and the present invention can use the method of full fine-tuning. Full fine-tuning refers to a task of fine-tuning the entire pre-learned model including all model parameters. In this method, all layers and parameters of the pre-learned model are updated and optimized to adjust them to the requirements of the target task. This method is generally suitable when there is a large difference between the task and the pre-trained model, or when the model's flexibility and adaptability to the task are required.

제3-1 단계(S31)는 질문의 쿼리를 검색모델에 입력해 질문 임베딩 및 문서 임베딩이 계산되는 단계이다. RAG 프레임워크 기반의 sLLM을 파인튜닝하기 위해서는 검색모델의 질문 임베딩 및 문서 임베딩이 관련도가 높은지 조사해야 하는데, 이를 위해 쿼리를 검색모델에 입력하고 질문 임베딩 및 문서 임베딩의 벡터 임베딩을 계산한다. Step 3-1 (S31) is the step where the query of the question is input into the search model and the question embedding and document embedding are calculated. In order to fine-tune sLLM based on the RAG framework, it is necessary to investigate whether the question embedding and document embedding of the search model are highly related. To this end, the query is input into the search model and the vector embedding of the question embedding and document embedding is calculated.

제3-2 단계(S32)는 질문 임베딩과 문서 임베딩을 이용하여 Top-K로 관련도를 계산하여 관련 문서가 검색되는 단계이다. 거대언어모델의 파라미터는 모델의 출력을 제어하고 조정하는 데 사용되는 설정값들인데, 이러한 파라미터들은 거대언어모델이 사용자 요청에 따른 결과를 생성하는 방식과 결과물의 특성을 결정하는 데 중요한 역할을 한다. 주요 파라미터로는 Temperature, Top-P, Top-K 등이 있으며, 이들을 적절히 조정함으로써 원하는 특성의 결과를 생성할 수 있다. Temperature는 거대언어모델의 출력 다양성을 제어하는 파라미터로, 0에서 1 사이의 값을 가지며, 낮은 값은 더 확실하고 예측 가능한 출력을, 높은 값은 더 다양하고 창의적인 출력을 생성한다. 따라서, 사실 기반의 정보성 데이터에서 결과를 찾을 때는 낮은 Temperature를, 다양하고 랜덤한 결과에서 결과값에서 요청하고자 할 때는 높은 Temperature를 사용하는 것이 적합할 수 있다. 다음으로, Top-P는 핵 샘플링이라고도 불리우며, 누적 확률 분포를 기반으로 다음 토큰을 선택하는 방법이다. Top-P는 확률의 누적 합이 P%가 되는 단어들만을 고려하여 다음 단어를 선택하는 방식인데, 이는 확률이 낮은 단어들을 필터링하여 답변의 품질을 높이는 역할을 한다. 다음으로, Top-K는 거대언어모델이 결과를 생성할 때의 후보 단어들을 제한하는 용도의 파라미터이다. Top-K는 각 단계에서 고려할 가장 가능성 있는 다음 토큰의 수를 제한하는데, 모델은 확률 순위가 가장 높은 K개의 토큰만을 고려하여 그 중에서 선택한다. 즉, 낮은 K 값은 더 집중되고 예측 가능한 출력을 생성하지만, 모델의 창의성을 제한할 수 있다. 반면, 높은 K 값은 더 다양한 출력이 가능하지만, 간혹 관련성이 떨어지는 결과가 생성 가능할 수 있다. Top-K는 모델이 너무 예측 불가능해지는 것을 방지하면서도 일정 수준의 다양성을 유지하는 데 도움을 준다. Top-K는 Temperature와 비슷하게 거대언어모델이 얼마나 다양한 결과를 낼지에 대해 조절할 수 있는 파라미터이다. 본 발명에서는 Top-K를 사용하여 질문 임베딩과 문서 임베딩이 검색모델에 통해 나온 결과에서 K 값을 지정하여 가장 관련이 있는 문서들을 검색이 나올 수 있도록 K 값을 조정할 수 있다. 본 발명에서는 근접하는 결과가 10개 나오도록 K 값을 10으로 지정할 수 있다.Step 3-2 (S32) is the step where related documents are searched by calculating the relevance with Top-K using question embedding and document embedding. The parameters of the large language model are the settings used to control and adjust the output of the model, and these parameters play an important role in determining the way the large language model generates results according to user requests and the characteristics of the results. The main parameters include Temperature, Top-P, and Top-K, and by adjusting them appropriately, results with desired characteristics can be generated. Temperature is a parameter that controls the output diversity of the large language model, and has a value between 0 and 1, with lower values generating more reliable and predictable outputs and higher values generating more diverse and creative outputs. Therefore, it may be appropriate to use low Temperature when searching for results from fact-based information data, and high Temperature when requesting results from diverse and random results. Next, Top-P is also called nuclear sampling, and is a method of selecting the next token based on the cumulative probability distribution. Top-P is a method of selecting the next word by considering only the words whose cumulative sum of probabilities is P%, which filters out words with low probabilities and improves the quality of the answer. Next, Top-K is a parameter that limits the candidate words when the large language model generates results. Top-K limits the number of most likely next tokens to consider at each stage, and the model selects from among the K tokens with the highest probability ranking. In other words, a low K value generates more focused and predictable outputs, but can limit the creativity of the model. On the other hand, a high K value allows for more diverse outputs, but can sometimes generate irrelevant results. Top-K helps maintain a certain level of diversity while preventing the model from becoming too unpredictable. Top-K, similar to Temperature, is a parameter that can control how diverse the large language model will produce results. In the present invention, Top-K can be used to adjust the K value so that the most relevant documents can be searched by specifying the K value in the results from the search model through the question embedding and document embedding. In the present invention, the K value can be set to 10 so that 10 similar results are produced.

제3-3 단계(S33)는 질문 및 관련 문서를 생성모델에 입력해 적절한 답변 및 요약이 생성되도록 RAG 프레임워크 기반 sLLM이 파인튜닝되는 단계이다. 질문과 관련 문서를 생성모델에 입력하여 적절한 답변과 요약이 생성되도록 sLLM의 파라미터를 조정할 수 있는데, 상기의 Temperature, Top-P, Top-K 중에서 하나 또는 복수로 선택하여 사용할 수 있다. Step 3-3 (S33) is the step where sLLM based on the RAG framework is fine-tuned so that questions and related documents are input into the generative model to generate appropriate answers and summaries. The parameters of sLLM can be adjusted so that questions and related documents are input into the generative model to generate appropriate answers and summaries. One or more of Temperature, Top-P, and Top-K can be selected and used.

이하에서는 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 작동에 대해 도면을 참조하여 설명한다. 도 6은 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 작동 방법의 순서도이다. 도 6을 참조하면, 본 발명의 힌디어, 영어 및 한국어가 포함된 다국어 AI DOC 시스템(10)의 작동 방법은 하기의 5 단계를 포함하여 구성될 수 있다. Hereinafter, the operation of the multilingual AI DOC system (10) including Hindi, English and Korean of the present invention will be described with reference to the drawings. FIG. 6 is a flowchart of the operation method of the multilingual AI DOC system (10) including Hindi, English and Korean of the present invention. Referring to FIG. 6, the operation method of the multilingual AI DOC system (10) including Hindi, English and Korean of the present invention can be configured to include the following five steps.

제1 단계(R10)는 질문이 입력되고 질문할 문서 파일이 업로드되는 단계이다. 본 발명은 힌디어로 작성된 문서에 대한 질의를 받아 답변을 힌디어, 영어 및 한국어로 답변을 주는 방식이기 때문에, 질문할 문서를 문서 파일로 업로드하고 질문 사항을 힌디어 또는 영어 또는 한국어로 입력할 수 있다. txt, pdf, doc 등 파일의 확장자를 가진 힌디어 문서 파일을 본 시스템을 입력 장치를 통해 입력한다. 문서는 공공기관 또는 산업체의 상용문서로 각종 보고서, 내규문서, 제안서, 기획서 등일 수 있다. 사용자는 힌디어 문서와 관련된 의문 사항을 질의하고 문서의 작성 방법 및 요약, 중요 사항에 대해 답변을 받을 수 있다.Step 1 (R10) is the step where a question is input and a document file to be asked is uploaded. Since the present invention is a method of receiving a question about a document written in Hindi and providing an answer in Hindi, English, and Korean, the document to be asked can be uploaded as a document file and the question can be entered in Hindi, English, or Korean. A Hindi document file with an extension such as txt, pdf, or doc is input into this system through an input device. The document can be a commercial document of a public institution or an industrial company, such as various reports, internal regulations, proposals, or plans. The user can inquire about a question related to a Hindi document and receive an answer about the writing method, summary, and important matters of the document.

제2 단계(R20)는 질문 및 문서 파일의 텍스트를 청킹하고 벡터화하여 질문 임베딩 및 문서 임베딩이 계산되는 단계이다. 문서 및 질문이 입력되었으면 문서 및 질문의 텍스트를 청킹하고 각각의 단어에 대해 벡터화하고 이를 통해 질문 임베딩 및 문서 임베딩을 계산한다. The second stage (R20) is the stage where the text of the question and document files is chunked and vectorized to calculate the question embedding and document embedding. When the document and question are input, the text of the document and question are chunked and vectorized for each word, and the question embedding and document embedding are calculated through this.

제3 단계(R30)는 질문 임베딩 및 문서 임베딩을 검색모델에 입력하여 관련 문서가 검색되는 단계이다. 질문 임베딩 및 문서 임베딩이 계산되었으면, 이를 검색모델에 입력하여 관련있는 문서들을 검색한다. 벡터 임베딩을 통해 연관성이 높은 순서로 관련있는 문서들이 검색된다.Step 3 (R30) is the step where question embedding and document embedding are input into the search model to search for related documents. Once the question embedding and document embedding are calculated, they are input into the search model to search for related documents. Related documents are searched in order of high relevance through vector embedding.

제4 단계(R40)는 질문 및 관련 문서를 생성모델에 입력해 적절한 답변 및 요약이 생성되는 단계이다. 검색모델에서 관련있는 문서들이 검색되었으면, 질문과 관련 문서를 생성모델에 입력하여 답변 및 요약이 생성되게 한다. Step 4 (R40) is the step where questions and related documents are input into the generation model to generate appropriate answers and summaries. Once relevant documents are retrieved from the search model, questions and related documents are input into the generation model to generate answers and summaries.

제5 단계(R50)는 생성모델의 결과를 sLLM의 프롬프트에 입력하여 최종 결과가 산출되는 단계이다. 생성모델의 결과인 답변 및 요약이 생성되었으면 이를 sLLM의 프롬프트에 입력하여 기존의 데이터베이스에 있는 내용과 종합하여 최종의 결과를 산출한다. 본 발명에서는 RAG 프레임워크를 이용하여 종래의 거대언어모델의 단점인 환각, 최신성의 문제를 극복할 수 있다. 최종 결과는 힌디어, 영어, 한국어 등 사용자가 원하는 언어로 제공될 수 있다. 또한, 유사도에 따라 상위 10건을 추출하고 추천할 수 있다. 본 발명에서는 힌디어 문서에 대한 질의 사항에 대해 업무용 문서요약 및 보고서 작성하는데, 구체적으로 주요 내용 요약, 관련 내용 Q&A, 특정 내용 검색 등을 구현한다. 또한, 본 발명에서는 힌디어, 영어에 대한 번역은 LangChain의 Indic-english encoder, English-indic decoder를 통해 이루어지고, 영어 및 한국어에 대한 번역은 LangChain의 번역 모델을 통해 이루어질 수 있다.Step 5 (R50) is the step where the result of the generative model is input into the sLLM prompt and the final result is produced. When the answer and summary, which are the results of the generative model, are generated, they are input into the sLLM prompt and synthesized with the contents of the existing database to produce the final result. In the present invention, the RAG framework can be used to overcome the problems of illusion and recency, which are the shortcomings of the conventional large language model. The final result can be provided in the language desired by the user, such as Hindi, English, and Korean. In addition, the top 10 cases can be extracted and recommended according to the similarity. In the present invention, for inquiries about Hindi documents, a business document summary and report are written, and specifically, main content summary, related content Q&A, and specific content search are implemented. In addition, in the present invention, translations for Hindi and English can be performed through LangChain's Indic-English encoder and English-indic decoder, and translations for English and Korean can be performed through LangChain's translation model.

이상에서 설명된 시스템은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 시스템 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The system described above may be implemented using hardware components, software components, and/or a combination of hardware components and software components. For example, the system and components described in the embodiments may be implemented using one or more general-purpose computers or special-purpose computers, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing instructions and responding to them. The processing unit may execute an operating system (OS) and one or more software applications running on the OS. In addition, the processing unit may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, the processing unit is sometimes described as being used alone, but those skilled in the art will appreciate that the processing unit may include multiple processing elements and/or multiple types of processing elements. For example, the processing device may include multiple processors, or a processor and a controller. Other processing configurations, such as parallel processors, are also possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing device to perform a desired operation or may independently or collectively command the processing device. The software and/or data may be permanently or temporarily embodied in any type of machine, component, physical device, virtual equipment, computer storage medium or device, or transmitted signal waves, for interpretation by the processing device or for providing instructions or data to the processing device. The software may also be distributed over network-connected computer systems, and stored or executed in a distributed manner. The software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program commands that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program commands, data files, data structures, etc., alone or in combination. The program commands recorded on the medium may be those specially designed and configured for the embodiment or may be those known to and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and hardware devices specially configured to store and execute program commands such as ROMs, RAMs, flash memories, etc. Examples of the program commands include not only machine language codes generated by a compiler but also high-level language codes that can be executed by a computer using an interpreter, etc. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiment, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described above by way of limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, appropriate results can be achieved even if the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or are replaced or substituted by other components or equivalents.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also included in the scope of the claims described below.

10 : 다국어 AI DOC 시스템
100 : RAG 모듈
200 : sLLM 모듈10: Multilingual AI DOC System
100 : RAG module
200 : sLLM module

Claims

delete

The first stage is where Hindi-based training data, English-based training data, and Korean-based training data are collected and processed to train the sLLM (smaller Large Language Model), an on-premise language model that understands Hindi, English, and Korean;
The learned sLLM of the first stage is constructed as a Hindi learning dataset using commercial documents of public institutions and industries in Hindi, and the sLLM is constructed as a RAG (Retrieval-Augmented Generation) framework by vector embedding the Hindi learning dataset and storing it in a vector database; and
The third stage, in which fine tuning of the sLLM of the second stage is performed; includes Hindi, English and Korean;
The second step is a step 2-1 in which a Hindi document QA dataset is constructed with the structure of a query, context, and answer, a step 2-2 in which a search model is trained using the query and context and the search model is evaluated, a step 2-3 in which the search model is retrained through hard negatives if the target performance is not achieved in the evaluation of the second step, a step 2-4 in which the context based on the search model trained in the second step or retrained in the second step is vectorized and a vector database is constructed, a step 2-5 in which a dataset for training a generative model is constructed with the query, searched context, and answer based on the search model trained in the second step or retrained in the second step, and the generative model is trained and evaluated, a step 2-6 in which expert feedback data is collected and trained as an interactive language model reflecting the preferences of Hindi users and experts if the target performance is not achieved in the evaluation of the second step, and a search model trained in the second step or retrained in the second step and the search model trained in the second step or retrained in the second step and the search model trained in the second step or retrained in the second step are A method for building a multilingual AI DOC system including Hindi, English, and Korean, comprising a step 2-7 in which a RAG framework based on a generative model learned or relearned in steps 2-6 is built and evaluated, and a step 2-8 in which error cases are analyzed and additional learning is performed when the target performance is not achieved in the evaluation in step 2-7.

In claim 2,
The evaluations of the above steps 2-2, 2-5 and 2-7 are conducted by measuring the input data recognition rate, malfunction rate, response speed and answer accuracy, and if the above evaluation criteria are met, it is evaluated as good, and if the above evaluation criteria are not met, re-learning is performed. A method for constructing a multilingual AI DOC system including Hindi, English and Korean.

In claim 3,
The criteria for the above evaluation are linked to the number of questions and the evaluation criteria are different.
If the number of questions is 50 or more, it is set as the first evaluation criterion, and if the number of questions is less than 50, it is set as the second evaluation criterion.
The above first evaluation criteria are an input data recognition rate of 93% or higher, a malfunction rate of 7% or lower, a response speed of 5 seconds or lower, and an answer accuracy of 85% or lower. If the above criteria are met, it is evaluated as good. If the above criteria are not met, re-learning is performed.
The above second evaluation criteria are an input data recognition rate of 95% or higher, a malfunction rate of 5% or lower, a response speed of 3 seconds or lower, and an answer accuracy of 88% or lower. If the above criteria are met, it is evaluated as good, and if the above criteria are not met, re-learning is performed. This is a method for building a multilingual AI DOC system including Hindi, English, and Korean.

In claim 4,
The third step is a method for building a multilingual AI DOC system including Hindi, English, and Korean, including a step 3-1 in which a question query is input into the search model to calculate question embedding and document embedding, a step 3-2 in which a relevance is calculated using Top-K using the question embedding and document embedding to search related documents, and a step 3-3 in which the RAG framework-based sLLM is fine-tuned so that an appropriate answer and summary are generated by inputting the question and related documents into the generation model.

delete