KR102707242B1

KR102707242B1 - System for de-identifying images containing personal information in documents and method thereof

Info

Publication number: KR102707242B1
Application number: KR1020240072756A
Authority: KR
Inventors: 양진서; 나길성
Original assignee: (주)에이씨엔에스
Priority date: 2024-06-04
Filing date: 2024-06-04
Publication date: 2024-09-19
Anticipated expiration: 2044-06-04

Abstract

본 발명은 PDF 파일 내의 다양한 포맷(즉, TIFF, JPEG, PNG 및 BMP)으로 된 이미지를 추출하여 해당 이미지 포맷에 맞춰 비식별화된 이미지를 생성하고, 각 이미지 포맷에 따라 생성한 비식별화된 이미지를 원본 PDF 파일에 업데이트함으로써, PDF 파일 내의 이미지에 포함되어 있는 개인 정보의 노출을 방지할 수 있는 문서 내 개인 정보가 포함된 이미지 비식별화 시스템 및 그 방법에 관한 것이다.The present invention relates to a system and method for de-identifying images containing personal information in a document, which can prevent exposure of personal information contained in images in a PDF file by extracting images in various formats (i.e., TIFF, JPEG, PNG, and BMP) in a PDF file, generating de-identified images according to the corresponding image formats, and updating the de-identified images generated according to each image format in an original PDF file.

Description

{SYSTEM FOR DE-IDENTIFYING IMAGES CONTAINING PERSONAL INFORMATION IN DOCUMENTS AND METHOD THEREOF}

본 발명은 문서 내 개인 정보가 포함된 이미지 비식별화 시스템 및 그 방법에 관한 것으로, 더욱 상세하게는 PDF 파일 내의 다양한 포맷(즉, TIFF, JPEG, PNG 및 BMP)으로 된 이미지를 추출하여 해당 이미지 포맷에 맞춰 비식별화된 이미지를 생성하고, 각 이미지 포맷에 따라 생성한 비식별화된 이미지를 원본 PDF 파일에 업데이트함으로써, PDF 파일 내의 이미지에 포함되어 있는 개인 정보의 노출을 방지할 수 있는 문서 내 개인 정보가 포함된 이미지 비식별화 시스템 및 그 방법에 관한 것이다.The present invention relates to a system and method for de-identifying images containing personal information contained in a document, and more particularly, to a system and method for de-identifying images containing personal information contained in a document, which extracts images in various formats (i.e., TIFF, JPEG, PNG, and BMP) contained in a PDF file, generates de-identified images according to the corresponding image formats, and updates the de-identified images generated according to each image format to the original PDF file, thereby preventing exposure of personal information contained in images contained in a PDF file.

인공지능, 자율주행 자동차 등의 최첨단 산업 분야는 물론, 고객 맞춤형 서비스를 제공하는 온라인 쇼핑몰에서 개인 정보의 활용에 대한 필요성이 증가하면서, 개인 정보를 보호하면서 원활하게 활용할 수 있도록 개인 정보 비식별 조치의 제도 마련에 대한 요구가 급증하고 있다.As the need for the use of personal information increases in cutting-edge industries such as artificial intelligence and self-driving cars, as well as in online shopping malls that provide customized customer services, the demand for establishing a system for anonymizing personal information to protect personal information while enabling its smooth use is rapidly increasing.

또한, 정부에서는 국민의 생명, 신체, 재산의 이익 보호, 공중위생 등의 공공의 안전과 안녕을 위해서 긴급히 필요하다고 판단되는 상황에서 개인 정보를 우선하여 수집, 이용, 제공할 수 잇도록 하는 대신, 안전조치, 파기, 정보주체의 권리 등의 규정을 준수하도록 하고 있다.In addition, the government allows the collection, use, and provision of personal information as a priority in situations where it is deemed urgently necessary for the protection of the lives, bodies, and property interests of citizens, public health, and other public safety and well-being, but requires compliance with regulations on safety measures, destruction, and the rights of information subjects.

일 예로, 개인정보보호법에 따른 개인정보 비식별 조치는 개인정보의 일부 혹은 전부를 삭제하거나 변형을 통해 특정 개인을 식별할 수 없도록 조치를 취하는 것을 의미하는 것으로서, 구체적인 비식별 조치 방법으로 국내 가이드라인에서는 휴리스틱 가명화, 암호화 및 교환 방법을 포함한 가명처리(pseudonymization), 총계처리, 부분총계, 라운딩 및 재배열을 포함한 총계처리(aggregation), 식별자 삭제, 식별자 부분삭제, 레코드 삭제 및 식별요소 전부삭제를 포함한 데이터 삭제(data reduction), 감추기, 랜덤 라운딩, 범위 방법 및 제어 라운딩을 포함한 데이터 범주화(data suppression), 임의 잡음 추가 및 공백과 대체를 포함한 데이터 마스킹(data masking) 등 총 5가지 처리기법으로 구분된 17가지의 세부 기술을 제시하고 있다.For example, de-identification of personal information according to the Personal Information Protection Act means taking measures to make a specific individual unidentifiable by deleting or modifying part or all of the personal information. As specific de-identification measures, the domestic guideline presents 17 detailed technologies categorized into a total of 5 processing techniques: pseudonymization including heuristic pseudonymization, encryption, and exchange methods; aggregation including total processing, partial aggregation, rounding, and rearrangement; data reduction including identifier deletion, partial identifier deletion, record deletion, and deletion of all identifying elements; data suppression including hiding, random rounding, range method, and controlled rounding; and data masking including adding random noise and blanking and substitution.

현재 적용하고 있는 개인정보 필터링 솔루션들은 문서에 포함된 텍스트를 필터링하여 특정 개인정보나 민감정보가 포함되어 있다는 정보를 제공하고 있다. 또한, 국제 표준문서인 PDF 파일 포맷은 텍스트가 포함된 경우 개인정보나 민감정보가 포함된 경우 텍스트를 제거하거나 수정하여 새로운 버전을 만들 수 있다.The personal information filtering solutions currently in use filter the text contained in the document to provide information that specific personal information or sensitive information is included. In addition, the PDF file format, which is an international standard document, can create a new version by removing or modifying the text if it contains personal information or sensitive information.

그러나, PDF 파일에 포함된 다양한 이미지 포맷 문서 위에 마스킹된 비식별화 처리 방식은 단순히 원형, 사각형 등의 다양한 도형 객체(object)를 통해 이미지의 일부 영역을 가려두는 수준에 불과하였다.However, the masking and de-identification methods used on various image format documents included in PDF files were limited to simply covering some areas of the image with various geometric objects such as circles and squares.

또한, 실제 현존하는 PDF 블랙 마스킹 또는 비식별화 솔루션 제품들은 상기 설명한 5가지 처리기법으로 구분된 17가지의 세부 기술과 동일한 방식을 채택하고 있으므로, 이미지 객체 위에 다른 도형 객체를 삽입하여 가려진 정보는 범용적으로 사용되는 PDF 편집 도구를 통해 쉽게 제거할 수 있으며, 이로 인해 심각한 개인정보 노출 위기에 처해 있다고 할 수 있다.In addition, since the existing PDF black masking or de-identification solution products adopt the same method as the 17 detailed technologies classified into the 5 processing techniques described above, the information hidden by inserting another geometric object over the image object can be easily removed using a commonly used PDF editing tool, which can be said to be at serious risk of personal information exposure.

따라서 본 발명에서는 PDF 파일 내의 TIFF, JPEG, PNG 및 BMP로 된 다양한 이미지 포맷에 맞추어 이미지에 포함된 개인 정보를 비식별화함으로써, PDF 파일 내의 이미지에 포함되어 있는 개인 정보의 노출을 방지할 수 있는 방안을 제시하고자 한다.Therefore, the present invention proposes a method for preventing exposure of personal information contained in images within a PDF file by anonymizing personal information contained in images in various image formats, such as TIFF, JPEG, PNG, and BMP, within a PDF file.

다음으로 본 발명의 기술분야에 존재하는 선행발명에 대하여 간단하게 설명하고, 이어서 본 발명이 상기 선행발명에 비해서 차별적으로 이루고자 하는 기술적 사항에 대해서 기술하고자 한다.Next, prior inventions existing in the technical field of the present invention will be briefly explained, and then technical matters that the present invention seeks to achieve differently from the prior inventions will be described.

먼저 한국등록특허 제10-2319492호(2021.10.29.)는 인공지능 신경망을 이용하여 이미지파일에 포함된 텍스트와 이미지정보를 분석한 후 이에 포함된 개인정보(개인고유식별번호 정보 또는 지문정보 등)를 검출하고 이를 추출함으로써 개인이나 기관에서 보유하고 있는 개인정보들을 효과적으로 검출하고 비식별화하는 AI 딥러닝을 이용한 개인정보 처리시스템 및 이를 이용한 개인정보 처리방법에 관한 선행발명이다.First, Korean Patent No. 10-2319492 (October 29, 2021) is a prior invention regarding a personal information processing system using AI deep learning and a personal information processing method using the same that effectively detects and de-identifies personal information held by individuals or organizations by analyzing text and image information contained in an image file using an artificial intelligence neural network and then detecting and extracting personal information (such as personal identification number information or fingerprint information) contained therein.

또한 한국등록특허 제10-2585793호(2023.10.06.)는 위험 정보의 적어도 일부의 색상을 민감 정보의 주변 색상으로 변경하여 표시함으로써 비식별 처리하여 의료 이미지에서 상기 위험 정보가 식별되지 않는 안전 이미지로 생성하는 의료 이미지 처리 방법에 관한 선행발명이다.In addition, Korean Patent No. 10-2585793 (October 6, 2023) is a prior invention regarding a medical image processing method that generates a safe image in which the risk information is not identifiable in a medical image by changing the color of at least a part of the risk information to the surrounding color of the sensitive information and displaying it, thereby anonymizing the risk information.

하지만, 본 발명은 PDF 파일 내의 다양한 포맷으로 된 이미지에 포함된 개인 정보를 해당 이미지 포맷에 맞추어 비식별화하는 것으로서, 이미지 내의 개인정보를 모자이크처리, 숨김처리, 암호화처리, 가명처리, 총계처리, 데이터삭제, 데이터범주화 또는 마스킹 중 어느 하나 이상의 방식으로 처리하도록 하는 상기 한국등록특허 제10-2319492호, 및 의료 이미지 내에서 민감 정보만을 추출하여 삭제하는 자르기(crop) 처리 또는 삭제 처리하는 상기 한국등록특허 제10-2585793호와 비교해 볼 때, 현저한 구성상 차이점이 있다.However, the present invention de-identifies personal information contained in images in various formats within a PDF file to fit the image format, and processes personal information within the image in at least one of mosaic processing, hiding processing, encryption processing, pseudonym processing, total processing, data deletion, data categorization, or masking, and the present invention crops or deletes only sensitive information within a medical image, when compared with the Korean Patent No. 10-2319492, which de-identifies personal information contained in images in various formats within a PDF file to fit the image format, and processes the personal information within the image in at least one of mosaic processing, hiding processing, encryption processing, pseudonym processing, total processing, data deletion, data categorization, or masking, and the Korean Patent No. 10-2585793, which performs cropping processing or deletion processing to extract and delete only sensitive information. There is a significant difference in the configuration.

본 발명은 상기와 같은 문제점을 해결하기 위해 창작된 것으로서, PDF 파일 내의 다양한 포맷(즉 TIFF, JPEG, PNG 및 BMP)으로 된 이미지에 포함된 개인 정보를 해당 이미지 포맷에 맞추어 비식별화하여, PDF 파일 내의 이미지에 포함되어 있는 개인 정보의 노출을 방지할 수 있는 시스템 및 그 방법을 제공하는 것을 목적으로 한다.The present invention was created to solve the above problems, and the purpose of the present invention is to provide a system and method capable of preventing exposure of personal information included in images in PDF files by de-identifying personal information included in images in various formats (i.e., TIFF, JPEG, PNG, and BMP) in accordance with the image format.

특히, 본 발명은 PDF 파일에 포함된 이미지 영역을 추출하여 파일 포맷 및 좌표를 확인하고, 해당 이미지 영역 내의 개인 정보를 삭제하거나 다른 오브젝트로 덧씌운 다음 해당 파일 포맷 생성기를 활용하여 저장하는 것을 통해 비식별화를 수행하고, 비식별화된 이미지를 원본 PDF 파일에 삽입, 업데이트하여 PDF 파일 내의 다양한 포맷으로 된 이미지에 포함된 개인 정보를 비식별화할 수 있는 시스템 및 그 방법을 제공하는 것을 다른 목적으로 한다.In particular, the present invention aims at providing a system and method capable of de-identifying personal information contained in images in various formats within a PDF file by extracting an image area contained in a PDF file, confirming the file format and coordinates, deleting personal information within the image area or overwriting it with another object, and then saving it using a file format generator, thereby performing de-identification, and inserting and updating the de-identified image into the original PDF file.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical tasks that this embodiment seeks to accomplish are not limited to the technical tasks described above, and other technical tasks may exist.

본 발명의 일 실시예에 따른 문서 내 개인 정보가 포함된 이미지 비식별화 시스템은, 비식별화 처리대상 PDF 파일에서 개인 정보가 포함된 적어도 하나 이상의 이미지를 추출하는 이미지 추출부; 상기 추출한 적어도 하나 이상의 이미지에서 비식별화 영역을 결정하는 비식별화 영역 결정부; 상기 결정한 비식별화 영역에 대하여 비식별화 처리를 수행하는 비식별화 처리부; 및 상기 비식별화 처리를 수행한 적어도 하나 이상의 이미지를 해당 이미지별 파일 생성기를 통해 저장하고, 상기 저장한 비식별화 처리를 수행한 적어도 하나 이상의 이미지를 상기 비식별화 처리대상 PDF 파일의 해당 이미지 부분에 업데이트하는 비식별화 적용부;를 포함하는 것을 특징으로 한다.According to one embodiment of the present invention, a system for de-identifying images containing personal information in a document comprises: an image extraction unit for extracting at least one image containing personal information from a PDF file to be de-identified; a de-identification area determination unit for determining a de-identification area from the at least one image extracted; a de-identification processing unit for performing de-identification processing on the determined de-identification area; and a de-identification application unit for storing at least one image on which the de-identification processing has been performed through a file generator for each image, and updating the stored at least one image on which the de-identification processing has been performed in a corresponding image portion of the PDF file to be de-identified.

이때, 상기 개인 정보는, 이름, 주소, 전화번호, 주민등록번호, 이메일 주소, 신용카드 번호, 생년월일, 성별, 직업, 소득 수준, 종교, 정치적 성향, 의료 정보, 사진, 금융정보, 구매정보 및 위치정보를 포함한 개인을 식별할 수 있는 정보인 것을 특징으로 한다.At this time, the personal information is characterized as information that can identify an individual, including name, address, phone number, resident registration number, email address, credit card number, date of birth, gender, occupation, income level, religion, political orientation, medical information, photograph, financial information, purchase information, and location information.

또한, 상기 이미지 추출부는, 상기 추출한 적어도 하나 이상의 이미지의 파일 포맷이 TIFF, JPEG, PNG 및 BMP 중 어느 파일 포맷인지를 확인하는 이미지 파일 포맷 확인부; 및 상기 추출한 적어도 하나 이상의 이미지의 좌표 정보를 확인하는 좌표 확인부;를 포함하는 것을 특징으로 한다.In addition, the image extraction unit is characterized by including an image file format verification unit that verifies which file format of the at least one or more extracted images is among TIFF, JPEG, PNG, and BMP; and a coordinate verification unit that verifies coordinate information of the at least one or more extracted images.

또한, 상기 비식별화 적용부는, 상기 이미지 파일 포맷 확인부에서 확인한 TIFF, JPEG, PNG 및 BMP의 파일 포맷에 따라 해당 파일 포맷별 파일 생성기를 통해 상기 비식별화 처리부에서 수행된 이미지를 저장하는 것을 특징으로 한다.In addition, the anonymization application unit is characterized by storing the image performed in the anonymization processing unit through a file generator for each file format according to the file formats of TIFF, JPEG, PNG and BMP confirmed by the image file format confirmation unit.

또한, 상기 비식별화 처리부는, 상기 좌표 확인부에서 확인한 좌표 정보를 기반으로 상기 적어도 하나 이상의 이미지별 비식별화 영역의 비식별화 처리를 수행하는 것을 특징으로 한다.In addition, the de-identification processing unit is characterized by performing de-identification processing of at least one de-identified area for each image based on coordinate information confirmed by the coordinate confirmation unit.

또한, 상기 이미지 추출부는, 상기 비식별화 처리대상 PDF 파일을 개인 정보 추출용 인공지능 모델에 입력하여 개인 정보가 포함된 적어도 하나 이상의 이미지를 추출하며, 상기 개인 정보 추출용 인공지능 모델은, 개인 정보가 포함된 이미지의 특징을 학습한 딥러닝 모델인 것을 특징으로 한다.In addition, the image extraction unit inputs the PDF file to be de-identified into an artificial intelligence model for extracting personal information to extract at least one image containing personal information, and the artificial intelligence model for extracting personal information is characterized in that it is a deep learning model that has learned the characteristics of an image containing personal information.

또한, 상기 비식별화 처리는, 상기 비식별화 영역 결정부에서 결정한 적어도 하나 이상의 이미지별 비식별화 영역을 제거, 대체, 변형, 추가 또는 마스킹하는 것을 통해서 수행되는 것을 특징으로 한다.In addition, the de-identification processing is characterized in that it is performed by removing, replacing, modifying, adding, or masking at least one image-specific de-identification area determined by the de-identification area determination unit.

아울러, 본 발명의 일 실시예에 따른 문서 내 개인 정보가 포함된 이미지 비식별화 방법은, 문서 내 개인 정보가 포함된 이미지 비식별화 시스템에서 수행되는 것으로서, 비식별화 처리대상 PDF 파일에서 개인 정보가 포함된 적어도 하나 이상의 이미지를 추출하는 이미지 추출 단계; 상기 추출한 적어도 하나 이상의 이미지에서 비식별화 영역을 결정하는 비식별화 영역 결정 단계; 상기 결정한 비식별화 영역에 대하여 비식별화 처리를 수행하는 비식별화 처리 단계; 및 상기 비식별화 처리를 수행한 적어도 하나 이상의 이미지를 해당 이미지별 파일 생성기를 통해 저장하고, 상기 저장한 비식별화 처리를 수행한 적어도 하나 이상의 이미지를 상기 비식별화 처리대상 PDF 파일의 해당 이미지 부분에 업데이트하는 비식별화 적용 단계;를 포함하는 것을 특징으로 한다.In addition, a method for de-identifying an image containing personal information in a document according to an embodiment of the present invention is performed in a system for de-identifying an image containing personal information in a document, and is characterized by including an image extraction step of extracting at least one image containing personal information from a PDF file to be de-identified; a de-identification region determination step of determining a de-identification region from the at least one image extracted; a de-identification processing step of performing de-identification processing on the determined de-identification region; and a de-identification application step of storing the at least one image on which the de-identification processing was performed through a file generator for each image, and updating the stored at least one image on which the de-identification processing was performed in a corresponding image portion of the PDF file to be de-identified.

또한, 상기 이미지 추출 단계는, 상기 추출한 적어도 하나 이상의 이미지의 파일 포맷이 TIFF, JPEG, PNG 및 BMP 중 어느 파일 포맷인지를 확인하는 이미지 파일 포맷 확인 단계; 및 상기 추출한 적어도 하나 이상의 이미지의 좌표 정보를 확인하는 좌표 확인 단계;를 포함하며, 상기 비식별화 적용 단계는, 상기 이미지 파일 포맷 확인 단계에서 확인한 TIFF, JPEG, PNG 및 BMP의 파일 포맷에 따라 해당 파일 포맷별 파일 생성기를 통해 상기 비식별화 처리 단계에서 수행된 이미지를 저장하는 것을 포함하며, 상기 비식별화 처리 단계는, 상기 좌표 확인 단계에서 확인한 좌표 정보를 기반으로 상기 적어도 하나 이상의 이미지별 비식별화 영역의 비식별화 처리를 수행하는 것을 포함하는 것을 특징으로 한다.In addition, the image extraction step includes an image file format confirmation step of confirming which file format of the at least one or more extracted images is a TIFF, JPEG, PNG, and BMP; and a coordinate confirmation step of confirming coordinate information of the at least one or more extracted images; and the de-identification application step includes storing the image performed in the de-identification processing step through a file generator for each file format according to the file formats of TIFF, JPEG, PNG, and BMP confirmed in the image file format confirmation step, and the de-identification processing step is characterized in that it includes performing de-identification processing of the at least one or more image-specific de-identified regions based on the coordinate information confirmed in the coordinate confirmation step.

또한, 상기 이미지 추출 단계는, 상기 비식별화 처리대상 PDF 파일을 개인 정보 추출용 인공지능 모델에 입력하여 개인 정보가 포함된 적어도 하나 이상의 이미지를 추출하며, 상기 개인 정보 추출용 인공지능 모델은, 개인 정보가 포함된 이미지의 특징을 학습한 딥러닝 모델인 것을 특징으로 한다.In addition, the image extraction step inputs the PDF file to be de-identified into an artificial intelligence model for extracting personal information to extract at least one image containing personal information, and the artificial intelligence model for extracting personal information is characterized in that it is a deep learning model that has learned the characteristics of an image containing personal information.

또한, 상기 비식별화 처리는, 상기 비식별화 영역 결정 단계에서 결정한 적어도 하나 이상의 이미지별 비식별화 영역을 제거, 대체, 변형, 추가 또는 마스킹하는 것을 통해서 수행되는 것을 특징으로 한다.In addition, the de-identification processing is characterized in that it is performed by removing, replacing, modifying, adding, or masking at least one image-specific de-identification area determined in the de-identification area determination step.

이상에서와 같이 본 발명의 문서 내 개인 정보가 포함된 이미지 비식별화 시스템 및 그 방법에 따르면, PDF 파일 내의 이미지에 포함된 개인 정보를 TIFF, JPEG, PNG 및 BMP의 해당 이미지 포맷에 맞추어 비식별화함으로써, 기존의 이미지 위에 불투명한 객체의 삽입을 통해 가려진 정보를 편집 도구로 쉽게 제거하는 것을 원천적으로 차단하여 개인정보 노출을 방지할 수 있는 효과가 있다.As described above, according to the system and method for de-identifying images containing personal information in documents of the present invention, personal information contained in images in PDF files is de-identified according to the corresponding image formats of TIFF, JPEG, PNG and BMP, thereby fundamentally blocking easy removal of information hidden by inserting an opaque object over an existing image using an editing tool, thereby preventing exposure of personal information.

또한, 본 발명은 PDF 파일 내의 이미지에 포함된 개인 정보 비식별화 작업 수행에 소요되는 시간과 비용을 크게 절감할 수 있으며, 수정, 편집이 필요한 경우 처음부터 작업을 반복할 필요 없이 필요한 이미지 부분만을 선택하여 처리할 수 있으므로 작업 효율성을 향상시킬 수 있는 효과가 있다.In addition, the present invention can significantly reduce the time and cost required to perform a task of de-identifying personal information contained in an image within a PDF file, and when modification or editing is required, only the necessary portion of the image can be selected and processed without having to repeat the task from the beginning, thereby improving work efficiency.

다만, 본 발명의 효과가 상술한 효과들로 제한되는 것은 아니며, 언급되지 아니한 효과들은 본 명세서 및 첨부된 도면으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확히 이해될 수 있을 것이다.However, the effects of the present invention are not limited to the effects described above, and effects not mentioned can be clearly understood by a person having ordinary skill in the art to which the present invention pertains from this specification and the attached drawings.

도 1은 본 발명의 일 실시예에 따른 문서 내 개인 정보가 포함된 이미지 비식별화 시스템을 포함한 전체 구성을 개략적으로 나타낸 도면이다.
도 2는 본 발명의 일 실시예에 따른 이미지 비식별화 시스템의 구성을 보다 상세하게 나타낸 도면이다.
도 3은 본 발명의 일 실시예에 따른 이미지 비식별화 시스템의 하드웨어 구조를 나타낸 도면이다.
도 4는 본 발명의 일 실시예에 따른 문서 내 개인 정보가 포함된 이미지 비식별화 방법의 동작과정을 상세하게 나타낸 순서도이다.FIG. 1 is a drawing schematically showing the entire configuration including an image de-identification system including personal information in a document according to one embodiment of the present invention.
FIG. 2 is a drawing showing in more detail the configuration of an image de-identification system according to one embodiment of the present invention.
FIG. 3 is a diagram showing the hardware structure of an image de-identification system according to one embodiment of the present invention.
FIG. 4 is a flowchart illustrating in detail the operation process of a method for de-identifying an image containing personal information in a document according to one embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 구체적인 실시예를 상세하게 설명한다. 다만, 본 발명의 사상은 제시되는 실시예에 제한되지 아니하고, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서 다른 구성요소를 추가, 변경, 삭제 등을 통하여, 퇴보적인 다른 발명이나 본 발명 사상의 범위 내에 포함되는 다른 실시예를 용이하게 제안할 수 있을 것이나, 이 또한 본원 발명 사상 범위 내에 포함된다고 할 것이다.Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings. However, the spirit of the present invention is not limited to the presented embodiments, and those skilled in the art who understand the spirit of the present invention will be able to easily propose other regressive inventions or other embodiments included within the scope of the spirit of the present invention by adding, changing, deleting, etc. other components within the scope of the same spirit, but this will also be considered to be included within the scope of the spirit of the present invention.

또한, 각 실시예의 도면에 나타나는 동일한 사상의 범위 내의 기능이 동일한 구성요소는 동일한 참조부호를 사용하여 설명한다.In addition, components having the same function within the same scope of the same idea that appear in the drawings of each embodiment are described using the same reference numerals.

도 1은 본 발명의 일 실시예에 따른 문서 내 개인 정보가 포함된 이미지 비식별화 시스템을 포함한 전체 구성을 개략적으로 나타낸 도면이다.FIG. 1 is a drawing schematically showing the entire configuration including an image de-identification system including personal information in a document according to one embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명은 문서 내 개인 정보가 포함된 이미지 비식별화 시스템(100, 이하 이미지 비식별화 시스템이라 함), 복수의 외부 서버(200), 데이터베이스(300), 복수의 사용자 단말(400) 등을 포함하여 구성된다.As illustrated in FIG. 1, the present invention is configured to include an image de-identification system (100, hereinafter referred to as an image de-identification system) containing personal information in a document, a plurality of external servers (200), a database (300), a plurality of user terminals (400), etc.

상기 이미지 비식별화 시스템(100)은 대중이 이용하는 문서, 특히 PDF 파일내의 이미지에 포함된 개인 정보가 노출되지 않도록 비식별화 처리를 수행하는 사용자가 운영하는 플랫폼이나 서버 컴퓨터를 의미한다.The above image de-identification system (100) refers to a user-operated platform or server computer that performs de-identification processing to prevent personal information contained in images in documents used by the public, particularly PDF files, from being exposed.

상기 이미지 비식별화 시스템(100)은 복수의 외부 서버(200) 혹은 자체 데이터베이스(300)로 보유하고 있는 논문, 보고서, 기사 등의 다양한 주제에 따라 생성한 복수의 PDF 파일 내의 이미지에 포함되어 있는 개인 정보를 해당 이미지 파일 포맷에 맞춰 비식별화함으로써, 문서가 대중에게 이용될 때 이미지에 포함되어 있는 개인 정보가 노출되는 것을 방지할 수 있도록 한다.The above image de-identification system (100) de-identifies personal information contained in images in multiple PDF files created based on various topics such as theses, reports, and articles held in multiple external servers (200) or an internal database (300) to match the image file format, thereby preventing personal information contained in images from being exposed when the documents are used by the public.

이를 위해서, 상기 이미지 비식별화 시스템(100)은 네트워크를 통해서 비식별화 처리가 필요한 PDF 파일을 복수의 외부 서버(200)는 물론, 자체 보유한 데이터베이스(300), 개별 사용자 단말(400)로부터 수집하거나, 외부 서버 운용자나 사용자로부터 비식별화 처리를 요청받는다.To this end, the image de-identification system (100) collects PDF files requiring de-identification processing from multiple external servers (200), its own database (300), and individual user terminals (400) through a network, or receives a request for de-identification processing from an external server operator or user.

이어서, 상기 이미지 비식별화 시스템(100)은 비식별화 처리대상이 되는 PDF 파일에서 개인 정보가 포함된 이미지 부분을 추출하고, 상기 추출한 개인 정보가 포함된 이미지 부분을 비식별화한 다음, 비식별화된 이미지를 원본 PDF 파일에 다시 업데이트하여 비식별화 작업을 마무리한다.Next, the image de-identification system (100) extracts an image portion containing personal information from a PDF file to be de-identified, de-identifies the extracted image portion containing personal information, and then updates the de-identified image back to the original PDF file to complete the de-identification process.

이때 PDF 파일 내에서 추출한 이미지는 TIFF, JPEG, PNG 및 BMP의 이미지 파일 포맷으로 되어 있다.At this time, the images extracted from the PDF file are in image file formats such as TIFF, JPEG, PNG, and BMP.

일 예로, TIFF(Tagged Image File Format) 이미지 파일은 무손실 압축 지원으로 이미지 품질 유지가 가능하기 때문에 높은 품질 유지가 중요한 사진, 의료 영상, 인쇄 등의 분야에 활용 가능하다. JPEG(Joint Photographic Experts Group)은 손실 압축 방식으로 파일 크기를 작게 저장하는 것이 가능하므로, 웹 이미지, 일반 사진 저장에 적합하다. PNG(Portable Network Graphics)는 무손실 압축으로 이미지 품질 유지가 가능하고 투명도를 지원하여 웹 디자인에 적합하다. BMP(Windows Bitmap)은 간단한 구조로 대부분의 운영 체제 및 소프트웨어에서 지원하여 호환성이 높다.For example, TIFF (Tagged Image File Format) image files can be used in fields where maintaining high quality is important, such as photography, medical imaging, and printing, because they support lossless compression to maintain image quality. JPEG (Joint Photographic Experts Group) can store files with small file sizes using lossy compression, so it is suitable for storing web images and general photos. PNG (Portable Network Graphics) can maintain image quality with lossless compression and supports transparency, making it suitable for web design. BMP (Windows Bitmap) has a simple structure and is supported by most operating systems and software, so it has high compatibility.

본 발명에서는 PDF 파일 내의 TIFF, JPEG, PNG 및 BMP 파일 포맷으로 된 이미지에 개인 정보가 포함되어 있는 경우, 해당 이미지를 삭제하거나 다른 오브젝트로 덧씌우는 등의 비식별화 작업을 수행한 다음, 해당 이미지별 파일 생성기를 사용하여 해당 이미지 포맷에 맞춰 비식별화된 이미지를 생성한다.In the present invention, when personal information is included in an image in the TIFF, JPEG, PNG and BMP file formats within a PDF file, an anonymization operation such as deleting the image or overwriting it with another object is performed, and then an anonymized image is generated according to the image format using a file generator for each image.

그리고 각 이미지 포맷에 따라 생성한 비식별화된 이미지를 원본 PDF 파일에 다시 업데이트(즉, 원래 이미지 부분을 없애고 새롭게 생성된 비식별화 이미지를 채워 넣은 작업)하는 것을 통해서 PDF 파일 내의 이미지에 포함되어 있는 개인 정보의 비식별화 처리를 수행함으로써, 일반 대중들이 PDF 파일을 이용할 때 특정 개인 정보가 노출되는 것을 방지할 수 있게 된다.And by de-identifying the personal information contained in the images in the PDF file by updating the de-identified images generated according to each image format back into the original PDF file (i.e., removing the original image portion and filling it with the newly generated de-identified image), it is possible to prevent specific personal information from being exposed when the general public uses the PDF file.

이렇게 각 이미지 파일 포맷에 따라 전용의 파일 생성기를 사용하여 개인 정보를 없앤 비식별화 이미지를 저장하게 되면, 단순히 개인 정보가 포함된 이미지 객체를 가리는 처리가 아닌, 해당 이미지 파일 포맷에 따라 이미지 픽셀 값, 특징 벡터, 다른 숫자로 표현된 정보 등을 정확하게 연산하여 비식별화 처리가 요청된 좌표의 이미지를 정확하게 비식별화하게 되므로, 기존의 이미지 객체 위에 불투명한 객체 삽입을 통해 가려진 정보를 편집 도구로 쉽게 제거하는 것을 원천적으로 차단할 수 있다.In this way, by saving an anonymized image with personal information removed using a dedicated file generator according to each image file format, rather than simply covering up the image object containing personal information, the image pixel values, feature vectors, and other information expressed as numbers are accurately calculated according to the image file format to accurately anonymize the image at the coordinates for which anonymization processing has been requested. Therefore, it is possible to fundamentally block the easy removal of information covered up by inserting an opaque object over an existing image object using an editing tool.

즉, 본 발명에 적용된 이미지 비식별화 기술은 단순히 임의의 객체(object)를 통해 비식별화가 필요한 영역을 가림 처리하는 방식이 아니라, 이미지 인코딩 기술을 활용하여 이미지 파일 포맷별로 비식별화된 이미지를 생성하여 업데이트하는 방식인 것이다.That is, the image de-identification technology applied to the present invention is not a method of simply covering an area requiring de-identification with an arbitrary object, but a method of generating and updating an de-identified image for each image file format by utilizing image encoding technology.

또한, 상기 이미지 비식별화 시스템(100)은 PDF 파일 내의 개인 정보가 포함된 이미지의 비식별화를 처리한 이후, 몇몇 이미지의 비식별화 처리가 수행되지 않거나 수정 작업이 필요한 경우, 필요한 이미지만 일부 선택하여 비식별화 편집 작업을 수행할 수 있다. 이를 통해서, 기존의 방식처럼 수정, 편집이 필요한 경우 처음부터 비식별화 작업을 반복할 필요가 없어 개인 정보가 포함된 이미지의 비식별화 작업에 소요되는 시간과 비용을 크게 절감하는 것은 물론, 작업 효율성을 극대화할 수 있다.In addition, the image de-identification system (100) can perform a de-identification editing operation by selecting only some of the necessary images if the de-identification processing is not performed or modification is required after the de-identification processing of images containing personal information in a PDF file is performed. Through this, there is no need to repeat the de-identification operation from the beginning when modification or editing is required as in the existing method, so the time and cost required for the de-identification operation of images containing personal information can be greatly reduced, and work efficiency can be maximized.

상기 외부 서버(200)는 정부 및 지방자치단체, 기업, 연구소, 병원 등에서 운용하는 서버를 의미하며, 논문, 보고서, 기사 등의 다양한 주제에 따라 생성한 복수의 PDF 파일을 보유하고 있다.The above external server (200) refers to a server operated by the government, local governments, companies, research institutes, hospitals, etc., and has multiple PDF files created on various topics such as theses, reports, and articles.

또한, 상기 외부 서버(200)는 네트워크를 통해 상기 이미지 비식별화 시스템(100)으로 각종 PDF 파일을 제공하여 문서 내의 이미지에 포함되어 있는 개인 정보의 비식별화를 요청하고, 비식별화 처리가 완료된 PDF 파일을 리턴받아 저장하여 보관한다.In addition, the external server (200) provides various PDF files to the image de-identification system (100) through a network to request de-identification of personal information included in images within documents, and returns and stores PDF files that have undergone de-identification processing.

이에 따라 관리자는 물론, 일반 대중(즉, 사용자 단말)의 요청에 따라 PDF 파일을 제공할 때 개인 정보가 포함된 이미지가 노출되지 않은 문서를 제공할 수 있게 된다.Accordingly, administrators as well as the general public (i.e. user terminals) can now provide documents without exposing images containing personal information when providing PDF files upon request.

상기 데이터베이스(300)는 비식별화 작업이 되지 않은 각종 PDF 파일을 저장하여 관리하며, 상기 이미지 비식별화 시스템(100)에서 사용되는 각종 동작프로그램(예: PDF 파일 내 이미지 추출 프로그램, 비식별화 처리 프로그램, 각 이미지 파일 포맷별 파일 생성기 등)의 저장 및 업데이트를 수행한다.The above database (300) stores and manages various PDF files that have not been de-identified, and stores and updates various operation programs (e.g., image extraction program within PDF files, de-identification processing program, file generator for each image file format, etc.) used in the image de-identification system (100).

또한, 상기 데이터베이스(300)는 상기 이미지 비식별화 시스템(100)에서 수행한 이미지 비식별화 작업이 종료된 각종 PDF 파일을 저장하여 관리한다.In addition, the database (300) stores and manages various PDF files for which image de-identification work performed by the image de-identification system (100) has been completed.

상기 사용자 단말(400)은 각종 PDF 파일을 활용하는 일반 대중이 사용하는 스마트폰, 태블릿 PC 등의 통신단말로서, 상기 이미지 비식별화 시스템(100)에서 이미지 비식별화 처리가 완료된 PDF 파일을 검색하여 이용할 수 있다. 이때 각종 PDF 파일의 검색 및 열람은 상기 이미지 비식별화 시스템(100)이나 각 외부 서버(200)에 접속하여 이용할 수 있다.The above user terminal (400) is a communication terminal such as a smart phone or tablet PC used by the general public that utilizes various PDF files, and can search and use PDF files that have undergone image de-identification processing in the image de-identification system (100). At this time, the search and viewing of various PDF files can be done by connecting to the image de-identification system (100) or each external server (200).

또한, 상기 사용자 단말(400)은 사용자가 보유한 PDF 파일을 상기 이미지 비식별화 시스템(100)에 제공하여 이미지 비식별화 작업을 요청하고, 이미지 비식별화 처리가 완료된 PDF 파일을 제공받을 수 있다.In addition, the user terminal (400) can request an image de-identification task by providing a PDF file held by the user to the image de-identification system (100), and can receive a PDF file on which the image de-identification process has been completed.

도 2는 본 발명의 일 실시예에 따른 이미지 비식별화 시스템의 구성을 보다 상세하게 나타낸 도면이다.FIG. 2 is a drawing showing in more detail the configuration of an image de-identification system according to one embodiment of the present invention.

도 2에 도시된 바와 같이, 상기 이미지 비식별화 시스템(100)은 처리대상 파일 수집부(110), 이미지 추출부(120), 비식별화 영역 결정부(130), 비식별화 처리부(140), 비식별화 적용부(150), 비식별화 편집부(160) 등을 포함하여 구성된다.As illustrated in FIG. 2, the image de-identification system (100) is configured to include a processing target file collection unit (110), an image extraction unit (120), a de-identification area determination unit (130), a de-identification processing unit (140), a de-identification application unit (150), a de-identification editing unit (160), etc.

상기 처리대상 파일 수집부(110)는 복수의 외부 서버(200)에서 보유하고 있는 PDF 파일, 자체 데이터베이스(300)에 보유중인 PDF 파일, 개별 사용자가 보유한 PDF 파일 등을 네트워크를 통해 수집한다. 즉, 이미지 비식별화 처리대상이 되는 각종 PDF 파일을 다양한 루트를 통해서 수집하는 것이다.The above processing target file collection unit (110) collects PDF files held by multiple external servers (200), PDF files held in the own database (300), PDF files held by individual users, etc. through a network. In other words, various PDF files that are targets of image de-identification processing are collected through various routes.

이때 상기 처리대상 파일 수집부(110)는 외부 서버의 관리자나 사용자로부터 직접 이미지 비식별화 처리대상의 PDF 파일을 제공받아 이미지 비식별화 처리를 요청받을 수도 있다.At this time, the processing target file collection unit (110) may receive a PDF file of the image de-identification processing target directly from an administrator or user of an external server and request image de-identification processing.

상기 이미지 추출부(120)는 상기 처리대상 파일 수집부(110)를 통해 수집한 이미지 비식별화 처리대상의 PDF 파일에서 개인 정보가 포함된 적어도 하나 이상의 이미지를 추출한다.The above image extraction unit (120) extracts at least one image containing personal information from a PDF file of an image anonymization processing target collected through the processing target file collection unit (110).

이때 상기 개인 정보는 개인을 식별할 수 있는 정보는 물론, 추적할 수 있는 정보를 의미한다.At this time, the above personal information refers to information that can identify an individual as well as information that can be traced.

일 예로, 이름, 주소, 전화번호, 주민등록번호, 이메일 주소, 신용카드 번호, 생년월일, 성별, 직업, 소득 수준, 종교, 정치적 성향, 의료 정보, 사진, 금융정보, 구매정보, 위치정보 등이 개인 정보에 포함된다.For example, personal information includes name, address, phone number, social security number, email address, credit card number, date of birth, gender, occupation, income level, religion, political orientation, medical information, photos, financial information, purchase information, and location information.

또한, 상기 이미지 추출부(120)는 이미지 파일 포맷 확인부(121)와 좌표 확인부(122)를 포함하여 구성된다.In addition, the image extraction unit (120) is configured to include an image file format verification unit (121) and a coordinate verification unit (122).

상기 이미지 파일 포맷 확인부(121)는 이미지 비식별화 처리대상인 PDF 파일에서 추출한 개인 정보가 포함된 이미지의 파일 포맷이 TIFF, JPEG, PNG 및 BMP 중 어느 파일 포맷인지를 확인한다.The above image file format verification unit (121) verifies which file format of an image containing personal information extracted from a PDF file, which is the subject of image de-identification processing, is among TIFF, JPEG, PNG, and BMP.

상기 좌표 확인부(122)는 이미지 비식별화 처리대상인 PDF 파일에서 추출한 개인 정보가 포함된 이미지의 좌표 정보를 확인한다.The above coordinate verification unit (122) verifies the coordinate information of an image containing personal information extracted from a PDF file that is the subject of image de-identification processing.

일 예로, 상기 이미지 추출부(120)는 이미지 비식별화 처리대상인 PDF 파일을 사전에 마련해둔 개인 정보 추출용 인공지능 모델에 입력하여 개인 정보가 포함된 적어도 하나 이상의 이미지를 추출할 수 있다.For example, the image extraction unit (120) can input a PDF file, which is the target of image de-identification processing, into a pre-prepared artificial intelligence model for extracting personal information and extract at least one image containing personal information.

이때 상기 개인 정보 추출용 인공지능 모델은 각종 개인 정보가 포함된 이미지의 특징을 학습한 딥러닝 모델이다.At this time, the artificial intelligence model for extracting personal information is a deep learning model that has learned the characteristics of images containing various personal information.

또한, 상기 개인 정보가 포함된 이미지의 특징은 얼굴, 신용카드, 명함 등의 형태는 물론, 개인 정보 유형에 맞는 패턴의 텍스트도 포함될 수 있다.Additionally, the features of the image containing the personal information may include not only the shape of a face, credit card, business card, etc., but also text with a pattern that matches the type of personal information.

또한, 상기 이미지 추출부(120)는 이미지 비식별화 처리대상인 PDF 파일을 OCR(Optical Character Recognition) 처리하는 것을 통해서, PDF 문서의 각 이미지 내 텍스트를 추출하고, 추출한 텍스트와 사전에 마련해둔 주민등록번호, 신분증 번호, 카드 번호, 이름, 생년월일 등의 개인 정보 유형에 맞는 패턴을 비교하여 추출하는 것도 가능할 것이다.In addition, the image extraction unit (120) may extract text from each image of a PDF document by performing OCR (Optical Character Recognition) processing on the PDF file, which is the target of image de-identification processing, and may also extract the text by comparing the extracted text with a pattern that matches a type of personal information such as a resident registration number, ID number, card number, name, date of birth, etc. prepared in advance.

상기 비식별화 영역 결정부(130)는 상기 이미지 추출부(120)에서 추출한 적어도 하나 이상의 이미지에서 비식별화 영역을 결정하는 기능을 수행한다.The above-mentioned de-identified area determination unit (130) performs the function of determining a de-identified area in at least one image extracted by the image extraction unit (120).

일 예로, 상기 비식별화 영역 결정부(130)는 PDF 파일에서 추출한 각 이미지 전체를 비식별화할 것인지, 아니면 PDF 파일에서 추출한 각 이미지의 일부 영역만을 선택하여 비식별화할 것인지를 결정할 수 있다.For example, the de-identification area determination unit (130) can determine whether to de-identify the entire image extracted from the PDF file or to de-identify only a portion of the image extracted from the PDF file.

즉, PDF 파일에서 추출한 이미지 전체가 개인 정보로 이루어진 경우에는 이미지 모두를 비식별화하여야 하지만, 이미지 내에서 일부만 개인 정보로 판단될 경우에는 전체 이미지가 아닌 일부 영역을 선택하여 비식별화하여야 하는 경우도 있을 수 있으므로, 상기 비식별화 영역 결정부(130)를 통해서 비식별화되는 영역을 최종적으로 결정하는 것이다.That is, if the entire image extracted from the PDF file consists of personal information, the entire image must be de-identified. However, if only a portion of the image is determined to be personal information, there may be cases where a portion of the image, not the entire image, must be de-identified. Therefore, the area to be de-identified is finally determined through the de-identification area determination unit (130).

상기 비식별화 처리부(140)는 상기 비식별화 영역 결정부(130)에서 결정한 비식별화 영역에 대하여 비식별화 처리를 수행한다.The above-mentioned de-identification processing unit (140) performs de-identification processing on the de-identification area determined by the above-mentioned de-identification area determination unit (130).

즉, 상기 비식별화 처리부(140)는 상기 이미지 추출부(120)의 좌표 확인부(122)에서 확인한 좌표 정보를 기반으로 PDF 파일로부터 추출한 각 이미지별 비식별화 영역에 대한 비식별화 처리를 수행하는 것이다.That is, the de-identification processing unit (140) performs de-identification processing on the de-identified area of each image extracted from the PDF file based on the coordinate information confirmed by the coordinate confirmation unit (122) of the image extraction unit (120).

이때 상기 비식별화 처리는 상기 비식별화 영역 결정부(130)에서 결정한 적어도 하나 이상의 이미지별 비식별화 영역을 제거, 대체, 변형, 추가 또는 마스킹을 통해 수행될 수 있다. 제거는 해당 이미지를 삭제하고 하는 것이고, 대체, 변형, 추가 및 마스킹은 개인 정보가 포함된 이미지를 다른 객체(예: 블랙 도형)로 덧씌우거나 픽셀 처리하는 등의 작업을 의미한다.At this time, the above-described de-identification processing can be performed by removing, replacing, modifying, adding, or masking at least one image-specific de-identification area determined by the above-described de-identification area determination unit (130). Removing means deleting the image, and replacing, modifying, adding, and masking means overlaying an image containing personal information with another object (e.g., a black shape) or performing pixel processing.

상기 비식별화 적용부(150)는 상기 비식별화 처리부(140)에서 비식별화 처리를 수행한 각 이미지를 해당 이미지 포맷 파일별 파일 생성기(미도시)를 통해 저장하고, 이렇게 비식별화 처리가 수행된 각 이미지를 이미지 비식별화 처리대상 PDF 파일의 해당 이미지 부분에 업데이트하여 저장한다.The above-described de-identification application unit (150) stores each image on which de-identification processing has been performed in the above-described de-identification processing unit (140) through a file generator (not shown) for each image format file, and updates and stores each image on which de-identification processing has been performed in the corresponding image section of the PDF file that is the target of image de-identification processing.

즉, 상기 비식별화 적용부(150)는 상기 이미지 추출부(120)의 이미지 파일 포맷 확인부(120)에서 확인한 TIFF, JPEG, PNG 및 BMP의 파일 포맷에 따라 해당 파일 포맷별 파일 생성기를 통해 상기 비식별화 처리부(140)에서 수행된 이미지를 저장하는 것이다.That is, the anonymization application unit (150) stores the image performed in the anonymization processing unit (140) through a file generator for each file format according to the file formats of TIFF, JPEG, PNG and BMP confirmed by the image file format confirmation unit (120) of the image extraction unit (120).

일 예로, 상기 이미지 파일 포맷 확인부(120)에서 확인한 처리대상 이미지의 파일 포맷이 TIFF 파일인 경우, TIFF 파일 생성기를 통해 비식별화 처리된 이미지를 저장하며, 처리대상 이미지의 파일 포맷이 JPEG 파일인 경우, JPEG 파일 생성기를 통해 비식별화 처리된 이미지를 저장한다.For example, if the file format of the image to be processed as confirmed by the image file format confirmation unit (120) is a TIFF file, the image that has been de-identified is stored through a TIFF file generator, and if the file format of the image to be processed is a JPEG file, the image that has been de-identified is stored through a JPEG file generator.

상기 비식별화 편집부(160)는 상기 비식별화 적용부(150)를 통해 비식별화 처리를 수행한 적어도 하나 이상의 이미지를 상기 비식별화 처리대상 PDF 파일의 해당 이미지 부분에 업데이트한 이후, 수정, 삭제 및 추가를 위한 편집이 필요한 경우, 관리자 조작에 의해 비식별화 처리를 수행한 이미지 중 원하는 이미지만을 선택하여 편집을 수행할 수 있는 기능을 제공한다.The above-mentioned de-identification editing unit (160) provides a function that allows editing by selecting only the desired images among the images that have undergone de-identification processing through administrator operation, if editing is required for modification, deletion, or addition after at least one image that has undergone de-identification processing through the above-mentioned de-identification application unit (150) has been updated in the corresponding image portion of the PDF file that is the target of de-identification processing.

이는 관리자가 상기 비식별화 처리대상 PDF 파일을 원본 상태로 돌려놓은 후 처음부터 작업을 반복할 필요 없이 필요한 이미지 부분만을 선택하여 처리할 수 있도록 하여, 작업 효율성을 향상시킬 수 있도록 하기 위함이다.This is to improve work efficiency by allowing administrators to select and process only the necessary image portions without having to return the anonymized PDF file to its original state and repeat the work from the beginning.

도 3은 본 발명의 일 실시예에 따른 이미지 비식별화 시스템의 하드웨어 구조를 나타낸 도면이다.FIG. 3 is a diagram showing the hardware structure of an image de-identification system according to one embodiment of the present invention.

도 3에 도시한 것과 같이, 상기 이미지 비식별화 시스템(100)의 하드웨어 구조는, 중앙처리장치(1000), 메모리(2000), 사용자 인터페이스(3000), 데이터베이스 인터페이스(4000), 네트워크 인터페이스(5000), 웹서버(6000) 등을 포함하여 구성된다.As illustrated in FIG. 3, the hardware structure of the image de-identification system (100) includes a central processing unit (1000), memory (2000), user interface (3000), database interface (4000), network interface (5000), web server (6000), etc.

상기 사용자 인터페이스(3000)는 그래픽 사용자 인터페이스(GUI, graphical user interface)를 사용함으로써, 사용자에게 입력과 출력 인터페이스를 제공한다.The above user interface (3000) provides an input and output interface to the user by using a graphical user interface (GUI).

상기 데이터베이스 인터페이스(4000)는 데이터베이스와 하드웨어 구조 사이의 인터페이스를 제공한다. 상기 네트워크 인터페이스(5000)는 사용자가 보유한 장치 간의 네트워크 연결을 제공한다.The above database interface (4000) provides an interface between the database and the hardware structure. The above network interface (5000) provides a network connection between devices held by users.

상기 웹 서버(6000)는 사용자가 네트워크를 통해 하드웨어 구조로 액세스하기 위한 수단을 제공한다. 대부분의 사용자들은 원격에서 웹 서버로 접속하여 상기 이미지 비식별화 시스템(100)을 사용할 수 있다.The above web server (6000) provides a means for users to access the hardware structure through a network. Most users can access the image de-identification system (100) remotely by accessing the web server.

상술한 구성 또는 방법의 각 단계는, 컴퓨터 판독 가능한 기록매체 상의 컴퓨터 판독 가능 코드로 구현되거나 전송 매체를 통해 전송될 수 있다. 컴퓨터 판독 가능한 기록매체는, 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터를 저장할 수 있는 데이터 저장 디바이스이다.Each step of the above-described configuration or method may be implemented as a computer-readable code on a computer-readable recording medium or transmitted via a transmission medium. The computer-readable recording medium is a data storage device capable of storing data that can be read by a computer system.

컴퓨터 판독 가능한 기록매체의 예로는 데이터베이스, ROM, RAM, CD-ROM, DVD, 자기 테이프, 플로피 디스크 및 광학 데이터 저장 디바이스가 있으나 이에 한정되는 것은 아니다. 전송매체는 인터넷 또는 다양한 유형의 통신 채널을 통해 전송되는 반송파를 포함할 수 있다. 또한 컴퓨터 판독 가능한 기록매체는, 컴퓨터 판독 가능 코드가 분산 방식으로 저장되고, 실행되도록 네트워크 결합 컴퓨터 시스템을 통해 분배될 수 있다.Examples of computer-readable recording media include, but are not limited to, databases, ROMs, RAMs, CD-ROMs, DVDs, magnetic tapes, floppy disks, and optical data storage devices. Transmission media may include carrier waves transmitted over the Internet or various types of communication channels. Additionally, the computer-readable recording media may be distributed through a network-coupled computer system so that the computer-readable code is stored and executed in a distributed manner.

또한 본 발명에 적용된 적어도 하나 이상의 구성요소는, 각각의 기능을 수행하는 중앙처리장치(CPU), 마이크로프로세서 등과 같은 프로세서를 포함하거나 이에 의해 구현될 수 있으며, 상기 구성요소 중 둘 이상은 하나의 단일 구성요소로 결합되어 결합된 둘 이상의 구성요소에 대한 모든 동작 또는 기능을 수행할 수 있다. 또한 본 발명에 적용된 적어도 하나 이상의 구성요소의 일부는, 이들 구성요소 중 다른 구성요소에 의해 수행될 수 있다. 또한 상기 구성요소들 간의 통신은 버스(미도시)를 통해 수행될 수 있다.In addition, at least one or more components applied to the present invention may include or be implemented by a processor such as a central processing unit (CPU), a microprocessor, etc., which performs each function, and two or more of the components may be combined into a single component that may perform all operations or functions of the two or more combined components. In addition, a part of at least one or more components applied to the present invention may be performed by another component among these components. In addition, communication between the components may be performed through a bus (not shown).

다음에는, 이와 같이 구성된 본 발명에 따른 문서 내 개인 정보가 포함된 이미지 비식별화 방법의 일 실시예를 도 4를 참조하여 상세하게 설명한다. 이때 본 발명의 방법에 따른 각 단계는 사용 환경이나 당업자에 의해 순서가 변경될 수 있다.Next, an embodiment of a method for de-identifying an image containing personal information in a document according to the present invention configured as described above will be described in detail with reference to FIG. 4. At this time, the order of each step according to the method of the present invention may be changed depending on the usage environment or a person skilled in the art.

도 4는 본 발명의 일 실시예에 따른 문서 내 개인 정보가 포함된 이미지 비식별화 방법의 동작과정을 상세하게 나타낸 순서도이다.FIG. 4 is a flowchart illustrating in detail the operation process of a method for de-identifying an image containing personal information in a document according to one embodiment of the present invention.

도 4에 도시된 바와 같이, 상기 이미지 비식별화 시스템(100)은 네트워크를 통해 복수의 외부 서버(200), 자체 데이터베이스(300) 및 개별 사용자로부터 이미지 비식별화 처리대상이 되는 다양한 주제에 따른 PDF 파일을 수집한다(S100).As illustrated in FIG. 4, the image de-identification system (100) collects PDF files on various topics that are subject to image de-identification processing from multiple external servers (200), its own database (300), and individual users through a network (S100).

이때 상기 이미지 비식별화 시스템(100)은 PDF 파일을 수집하는 것 이외에, 각종 PDF 파일을 제공받아 이미지 비식별화 처리를 요청받을 수 있다.At this time, the image de-identification system (100) may, in addition to collecting PDF files, receive various PDF files and request image de-identification processing.

상기 S100 단계를 통해 이미지 비식별화 처리대상이 되는 각종 PDF 파일을 수집하거나 요청받은 상기 이미지 비식별화 시스템(100)은 각 PDF 파일에서 개인 정보가 포함된 적어도 하나 이상의 이미지를 추출하여, 각 이미지가 TIFF, JPEG, PNG 및 BMP 중 어느 파일 포맷으로 이루어져 있는지, 그리고 각 이미지별 좌표 정보가 어떻게 되는지를 확인한다(S200).The image de-identification system (100) that collects or requests various PDF files to be subject to image de-identification processing through the above step S100 extracts at least one image containing personal information from each PDF file and checks which file format each image is in among TIFF, JPEG, PNG, and BMP, and what the coordinate information for each image is (S200).

이때 개인 정보가 포함된 이미지의 추출은 개인 정보가 포함된 이미지의 특징을 학습한 개인 정보 추출용 인공지능 모델에 비식별화 처리대상인 특정 PDF 파일을 입력하여 추출하는 것임은 상기 설명한 바와 같다.As explained above, at this time, the extraction of images containing personal information is done by inputting a specific PDF file that is the subject of anonymization processing into an artificial intelligence model for extracting personal information that has learned the characteristics of images containing personal information.

이어서, 상기 이미지 비식별화 시스템(100)은 S200 단계에서 추출한 적어도 하나 이상의 이미지에서 비식별화 영역을 결정한다(S300). 즉, PDF 파일에서 추출한 개인 정보가 포함된 이미지의 전체 또는 일부를 비식별화 영역으로 결정하는 것이다.Next, the image de-identification system (100) determines a de-identified area in at least one image extracted in step S200 (S300). That is, all or part of an image containing personal information extracted from a PDF file is determined as a de-identified area.

또한, 상기 이미지 비식별화 시스템(100)은 상기 S300 단계를 통해 결정한 비식별화 영역에 대해 비식별화 처리를 수행한다(S400). 즉, 상기 S200 단계에서 확인한 좌표 정보를 기반으로 상기 S300 단계에서 결정한 비식별화 영역에 대해서 해당 영역을 제거, 대체, 변형, 추가 또는 마스킹하는 비식별화 처리를 수행하는 것이다.In addition, the image de-identification system (100) performs de-identification processing on the de-identified area determined through the S300 step (S400). That is, based on the coordinate information confirmed in the S200 step, the de-identification processing is performed on the de-identified area determined in the S300 step by removing, replacing, modifying, adding, or masking the area.

또한, 상기 이미지 비식별화 시스템(100)은 상기 S400 단계를 통해 PDF 파일로부터 추출한 개인 정보가 포함된 이미지의 비식별화 처리가 종료되는지를 판단하고(S500), 비식별화 처리가 종료되면, 비식별화 처리를 수행한 이미지를 해당 이미지별 파일 생성기를 통해 저장한다(S600).In addition, the image de-identification system (100) determines whether the de-identification processing of the image containing personal information extracted from the PDF file through the step S400 is completed (S500), and if the de-identification processing is completed, the image on which the de-identification processing was performed is stored through the corresponding image-specific file generator (S600).

이어서, 상기 이미지 비식별화 시스템(100)은 해당 이미지별 파일 생성기를 통해 저장한 비식별화 처리된 각 이미지를 비식별화 처리대상 PDF 파일의 해당 이미지 부분에 업데이트하고 저장한다(S700).Next, the image de-identification system (100) updates and stores each de-identified image saved through the image-specific file generator in the corresponding image portion of the PDF file to be de-identified (S700).

또한, 상기 이미지 비식별화 시스템(100)은 상기 S700 단계를 통해 비식별화 처리된 이미지를 원본 PDF 파일에 업데이트하여 저장한 이후, 수정, 삭제 및 추가를 위한 편집이 필요한 경우, 관리자 조작을 참조하여 비식별화 처리를 수행한 이미지 중 원하는 이미지만을 선택하여 편집을 수행한다(S800).In addition, the image de-identification system (100) updates and saves the de-identified image in the original PDF file through the S700 step, and then, if editing for modification, deletion, and addition is required, only the desired image among the images that have undergone de-identification processing is selected and edited by referring to the administrator's operation (S800).

이처럼, 본 발명은 기존의 이미지 위에 불투명한 객체의 삽입을 통해 가려진 정보를 편집 도구로 쉽게 제거하는 것을 원천적으로 차단하여 개인정보 노출을 방지할 수 있으며, PDF 파일 내의 이미지에 포함된 개인 정보 비식별화 작업 수행에 소요되는 시간과 비용을 절감하는 것은 물론, 수정, 편집이 필요한 경우 처음부터 작업을 반복할 필요 없이 필요한 이미지 부분만을 선택하여 처리할 수 있어 작업 효율성이 향상된다.In this way, the present invention can prevent personal information exposure by fundamentally blocking the easy removal of information hidden by inserting an opaque object over an existing image using an editing tool, and not only reduces the time and cost required to perform a task of anonymizing personal information included in an image within a PDF file, but also improves work efficiency by selecting and processing only the necessary image portion without having to repeat the task from the beginning when modification or editing is required.

첨부된 도면은 본 발명의 기술적 사상을 보다 명확하게 표현하기 위해, 본 발명의 기술적 사상과 관련성이 없거나 떨어지는 구성에 대해서는 간략하게 표현하거나 생략하였다.In order to more clearly express the technical idea of the present invention, the attached drawings briefly express or omit components that are not related to or have little to do with the technical idea of the present invention.

상기에서는 본 발명에 따른 실시예를 기준으로 본 발명의 구성과 특징을 설명하였으나 본 발명은 이에 한정되지 않으며, 본 발명의 사상과 범위 내에서 다양하게 변경 또는 변형할 수 있음은 본 발명이 속하는 기술분야의 당업자에게 명백한 것이며, 따라서 이와 같은 변경 또는 변형은 첨부된 특허청구범위에 속함을 밝혀 둔다.Although the configuration and features of the present invention have been described above based on embodiments according to the present invention, the present invention is not limited thereto, and it will be apparent to those skilled in the art that various changes or modifications can be made within the spirit and scope of the present invention, and therefore it is made clear that such changes or modifications fall within the scope of the appended patent claims.

100 : 문서 내 개인 정보가 포함된 이미지 비식별화 시스템
110 : 처리대상 파일 수집부
120 : 이미지 추출부
121 : 이미지 파일 포맷 확인부
122 : 좌표 확인부
130 : 비식별화 영역 결정부
140 : 비식별화 처리부
150 : 비식별화 적용부
160 : 비식별화 편집부
200 : 외부 서버
300 : 데이터베이스
400 : 사용자 단말100: Image de-identification system containing personal information in documents
110: Processing target file collection section
120 : Image Extraction Section
121: Image file format verification section
122: Coordinate confirmation section
130: Anonymization area determination section
140: Anonymization Processing Unit
150: Anonymization application area
160: Anonymization Editorial Department
200 : External server
300 : Database
400: User Terminal

Claims

An image extraction unit that inputs a PDF file to be de-identified into an artificial intelligence model for extracting personal information, which is a deep learning model that has learned the characteristics of images containing personal information, and extracts at least one image containing personal information, and at the same time verifies coordinate information of at least one image extracted, and verifies which file format it is among TIFF, JPEG, PNG, and BMP;
An anonymized region determining unit for determining an anonymized region from at least one image extracted above;
A de-identification processing unit that performs de-identification processing on an unidentified region by removing, replacing, modifying, adding, or masking at least one image-specific unidentified region based on coordinate information confirmed by the image extraction unit;
A de-identification application unit that saves at least one image that has undergone the de-identification process in the file formats of TIFF, JPEG, PNG and BMP confirmed by the image extraction unit through a file generator for each file format, and updates the at least one image that has undergone the de-identification process in the corresponding image portion of the PDF file that is the target of the de-identification process; and
A system for de-identifying images containing personal information in a document, characterized in that it includes a de-identification editing unit; wherein, if anonymization processing of an image is not performed and thus additions or modifications and deletions are required after at least one image on which the de-identification processing has been performed is updated in the corresponding image portion of the PDF file to be de-identified, only some of the images are selected and edited for addition, modification, and deletion, thereby increasing the time and speed of the de-identification task by selecting and editing only the necessary image portions without having to return the PDF file to its original state and repeat the task from the beginning.

In claim 1,
The above personal information is,
An image de-identification system containing personal information in a document, characterized by information that can identify an individual, including name, address, phone number, resident registration number, email address, credit card number, date of birth, gender, occupation, income level, religion, political orientation, medical information, photograph, financial information, purchase information, and location information.

delete

As performed in an image de-identification system containing personal information in a document as described in claim 1,
An image extraction step of inputting a PDF file to be de-identified into an artificial intelligence model for extracting personal information, which is a deep learning model that has learned the characteristics of images containing personal information, to extract at least one image containing personal information, and at the same time confirming the coordinate information of the at least one image extracted, and confirming which file format it is among TIFF, JPEG, PNG, and BMP;
A de-identified region determination step of determining a de-identified region from at least one image extracted above;
A de-identification processing step for performing de-identification processing on an unidentified region by removing, replacing, modifying, adding, or masking at least one image-specific unidentified region based on coordinate information confirmed in the image extraction step;
A de-identification application step of saving at least one image that has undergone the de-identification process in the file formats of TIFF, JPEG, PNG and BMP confirmed in the image extraction step through a file generator for each file format, and updating the at least one image that has undergone the de-identification process in the corresponding image portion of the PDF file that is the target of the de-identification process; and
A method for de-identifying images containing personal information in a document, characterized in that it includes a de-identification editing step, wherein, if anonymization processing of an image is not performed and thus additions or modifications and deletions are required after at least one image on which the de-identification processing has been performed is updated in the corresponding image portion of the de-identification processing target PDF file, only some of the images are selected and edited for addition, modification, and deletion, thereby increasing the time required and speed of the de-identification task by selecting and editing only the necessary image portions without having to return the de-identification processing target PDF file to its original state and repeating the task from the beginning.

delete