KR102811004B1

KR102811004B1 - Apparatus for generating target gas detection model based on CNN learning and method thereof

Info

Publication number: KR102811004B1
Application number: KR1020210135617A
Authority: KR
Inventors: 이정혜; 임치현; 김성일; 김예진; 김세원
Original assignee: 울산과학기술원
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2025-05-22
Anticipated expiration: 2041-10-13
Also published as: KR20230052490A

Abstract

본 발명은, CNN 학습을 기반으로 하는 타겟 가스 탐지 모델 생성 장치 및 그 방법에 관한 것이다. 본 발명에 따르면, 적어도 하나의 가스가 혼합된 복수의 가스 시료에 대해, 다채널 구조의 센서 어레이로부터 관측된 서로 다른 시간 길이의 시계열 데이터를 수집하는 단계와, 수집된 복수의 시계열 데이터의 시간 길이 중 최단 시간 길이를 탐색하고 최단 시간 길이를 최대 값으로 하는 복수의 후보 길이 값을 생성하는 단계와, 후보 길이 값에 대응한 L×M 크기의 타임 윈도우(L: 상기 후보 길이 값, M: 채널 수)를 상기 시계열 데이터에 적용하고 설정 시간 단위로 슬라이딩 이동시켜 상기 시계열 데이터로부터 L×M 크기의 복수의 행렬 데이터를 가공하는 단계와, 가스 시료 별 해당 시계열 데이터로부터 가공된 복수의 행렬 데이터를 입력 값으로 하고 가스 시료 내 타겟 가스의 존재 여부에 관한 라벨링 값을 출력 값으로 하여 CNN 기반의 분류 모델을 학습시키는 단계, 및 상기 복수의 후보 길이 값 별로 상기 분류 모델을 학습한 결과로부터, 복수의 후보 길이 값 중에서 최적 후보 길이 값을 결정하는 단계를 포함하는 혼합 가스 탐지 방법을 제공한다.
본 발명에 따르면, 적은 양의 시계열 데이터로부터 충분한 양의 학습 데이터를 확보하여 타겟 가스 탐지를 위한 분류 모델의 성능을 높일 수 있다. The present invention relates to a device and method for generating a target gas detection model based on CNN learning. According to the present invention, a mixed gas detection method is provided, including the steps of collecting time series data of different time lengths observed from a multi-channel structure sensor array for a plurality of gas samples in which at least one gas is mixed, the steps of searching for a shortest time length among the time lengths of the collected plurality of time series data and generating a plurality of candidate length values having the shortest time length as a maximum value, the steps of applying a time window of L×M size (L: the candidate length value, M: the number of channels) corresponding to the candidate length values to the time series data and sliding it by a set time unit to process a plurality of matrix data of L×M size from the time series data, the steps of training a CNN-based classification model by using a plurality of matrix data processed from the corresponding time series data for each gas sample as input values and a labeling value regarding the presence or absence of a target gas in the gas sample as an output value, and the steps of determining an optimal candidate length value from among the plurality of candidate length values from a result of training the classification model for each of the plurality of candidate length values.
According to the present invention, the performance of a classification model for target gas detection can be improved by securing a sufficient amount of learning data from a small amount of time series data.

Description

{Apparatus for generating target gas detection model based on CNN learning and method thereof}

본 발명은 CNN 학습을 기반으로 하는 타겟 가스 탐지 모델 생성 장치 및 그 방법에 관한 것으로서, 보다 상세하게는 적은 양의 시계열 데이터로부터 충분한 양의 샘플 데이터를 확보하여 분류 모델의 성능을 높일 수 있는 혼합 가스 탐지 장치 및 그 방법에 관한 것이다.The present invention relates to a device and method for generating a target gas detection model based on CNN learning, and more specifically, to a mixed gas detection device and method capable of improving the performance of a classification model by securing a sufficient amount of sample data from a small amount of time series data.

혼합 가스 시료 내에 존재하는 가스의 종류를 학습 모델 기반으로 예측하기 위해서는 다양한 종류의 많은 양의 시계열 데이터가 필요하다.In order to predict the type of gas present in a mixed gas sample based on a learning model, a large amount of time series data of various types is required.

그런데, 다양한 종류의 많은 양의 시계열 데이터를 단시간에 확보하는 것은 불가능하며, 적은 양의 시계열 데이터 만으로는 학습 데이터가 충분히 확보되지 않아 기존의 보편적인 데이터 구성 방식을 이용하여 학습 모델을 구축하게 되면 저조한 성능을 보이는 단점이 있다. However, it is impossible to secure a large amount of time series data of various types in a short period of time, and since a small amount of time series data is not enough to secure learning data, there is a disadvantage in that if a learning model is built using the existing universal data composition method, it will show poor performance.

도 1은 종래의 PCA를 이용한 학습 데이터 구성 방식을 설명한 도면이다. Figure 1 is a diagram explaining a learning data composition method using conventional PCA.

이러한 도 1은 시계열 데이터의 열(row)을 CONCAT하여 해당 시계열 데이터를 대표하는 벡터(vector)를 구성하는 방식에 해당한다. 도 1에도시된 각각의 시료 데이터는 다채널의 센서 어레이를 이용하여 가스 시료로부터 관측된 센서 데이터로 가정한다.This figure 1 corresponds to a method of concatenating rows of time series data to form a vector representing the time series data. Each sample data illustrated in Figure 1 is assumed to be sensor data observed from a gas sample using a multi-channel sensor array.

시료 데이터의 각 행은 시간 슬롯을 의미하고, 각 열은 센서 어레이의 각 센서 채널을 의미한다. 도 1과 같이 센서 어레이의 채널 수가 4인 경우 각 시료 데이터는 매 타임 슬롯 마다 4개의 센싱 값이 얻어진다. 이때, 각 시료 데이터의 시간 길이는 서로 상이함을 알 수 있다. 이는 관측 조건, 현장 환경, 설정 정보 등에 따라, 시료 데이터의 수집 시간 길이는 항상 동일할 수 없음을 고려한 것이다. Each row of sample data represents a time slot, and each column represents each sensor channel of the sensor array. As shown in Fig. 1, when the number of channels of the sensor array is 4, each sample data obtains 4 sensing values for each time slot. At this time, it can be seen that the time length of each sample data is different. This is because the collection time length of the sample data cannot always be the same depending on the observation conditions, field environment, and setting information.

도 1에 도시된 구성 방식 1은 각기 다른 시간 길이를 갖는 복수의 시료 데이터에 대해 동일 사이즈의 시간 구간을 취한 후 해당 시간 구간의 데이터를 CONCAT 하여 한 줄의 다차원 벡터로 구성한 다음, 다시 PCA 기법(주성분 분석)을 통해 차원 축소하여 데이터를 재구성하는 방식이다. 재구성된 데이터는 분류 모델의 학습을 위한 입력 데이터(학습 데이터)로 활용되게 된다.The configuration method 1 illustrated in Fig. 1 is a method of taking a time interval of the same size for multiple sample data having different time lengths, concatenating the data of the time intervals into a single line of multidimensional vectors, and then reconstructing the data by reducing the dimensionality using the PCA technique (principal component analysis). The reconstructed data is used as input data (training data) for learning the classification model.

예를 들어, 시료 데이터 1에서 2열의 시간 구간을 취한 후 이를 한 줄로 펼쳐서 8차원의 벡터 x1를 가공한 후에 PCA 기법을 통해 벡터 차원을 4차원으로 축소한다. 이러한 방법으로 시료 데이터 1, 2, 3 각각에 대응하여 동일 차원 수의 벡터 x1, x2, x3이 얻어진다.For example, after taking the time interval of 2 columns from sample data 1, it is spread out into one line, an 8-dimensional vector x1 is processed, and then the vector dimension is reduced to 4 dimensions through the PCA technique. In this way, vectors x1, x2, and x3 with the same number of dimensions are obtained corresponding to sample data 1, 2, and 3, respectively.

결과적으로, 구성 방식 1에 의하면, 각기 다른 시간 길이의 시료 데이터 1, 2, 3가 동일 차원 수의 벡터 x1, x2, x3로 최종 변환되며, 이렇게 변환된 데이터는 학습 모델의 구축을 위한 입력 데이터로 활용되어 학습된다.As a result, according to the configuration method 1, sample data 1, 2, and 3 of different time lengths are finally converted into vectors x1, x2, and x3 of the same dimension, and the data converted in this way are used as input data for building a learning model and trained.

이러한 구성 방식 1은 시간 블록을 취하여 학습 데이터를 얻기 때문에 시간에 따라 가변하는 센서 값 특성이 모델 학습에 반영된다는 장점이 있는 반면, CONCAT를 통한 고차원(high demension)의 데이터 변환으로 인해 추가적으로 차원 축소 기법을 필요로 하고, 수집된 시료 데이터의 개수 만큼만 학습 데이터가 만들어지므로 매우 적은 양의 학습 데이터만 확보 가능한 한계가 있다. This configuration method 1 has the advantage that sensor value characteristics that vary over time are reflected in model learning because it obtains learning data by taking time blocks, but it requires an additional dimensionality reduction technique due to high-dimensional data transformation through CONCAT, and since learning data is created only as many as the number of collected sample data, it has the limitation that only a very small amount of learning data can be secured.

도 2는 종래에 따른 차분 방식을 이용한 학습 데이터 구성 방식(방식2)을 개략적으로 설명한 도면이다. Figure 2 is a diagram schematically explaining a learning data composition method (method 2) using a conventional differential method.

이러한 도 2는 시계열 데이터의 차분을 통하여 각각의 row를 하나의 독립된 샘플로 간주하는 방식으로, 도 1과는 달리 한 개의 시료 데이터 당 여러 개의 학습 데이터가 얻어지므로 학습 데이터를 많이 늘릴 수 있다.This Fig. 2 considers each row as an independent sample through the difference of time series data, and unlike Fig. 1, multiple learning data are obtained per sample data, so the learning data can be greatly increased.

이러한 구성 방식 2의 경우 많은 양의 데이터를 확보할 수는 있는 반면에, 동일 센서 데이터 내의 각각의 행이 실제로는 굉장히 높은 상관 관계(correlation)을 가지기 때문에, 차분을 하더라도 각각의 row를 독립된 하나의 입력 데이터로 보기 어려운 단점이 있다.In the case of this configuration method 2, although a large amount of data can be secured, there is a disadvantage in that it is difficult to view each row as an independent input data even after difference is made because each row within the same sensor data actually has a very high correlation.

따라서, 이러한 도 1과 도 2의 방식에 따른 단점을 극복하면서, 적은 양의 시계열 데이터 만으로 훌륭한 성능을 보장할 수 있는 새로운 형태의 시계열 데이터 분류 방법론이 요구된다. Therefore, a new type of time series data classification methodology is required that can guarantee excellent performance with only a small amount of time series data while overcoming the shortcomings of the methods of Figs. 1 and 2.

본 발명의 배경이 되는 기술은 한국등록특허 제10-1852074호(2018.04.25 공고)에 개시되어 있다.The technology that serves as the background for the present invention is disclosed in Korean Patent No. 10-1852074 (announced on April 25, 2018).

본 발명은 적은 양의 시계열 데이터로부터 충분한 양의 학습 데이터를 확보하여 타겟 가스 탐지를 위한 분류 모델의 성능을 높일 수 있는 CNN 학습을 기반으로 하는 타겟 가스 탐지 모델 생성 장치 및 그 방법을 제공하는데 목적이 있다.The purpose of the present invention is to provide a device and method for generating a target gas detection model based on CNN learning, which can secure a sufficient amount of learning data from a small amount of time series data and improve the performance of a classification model for target gas detection.

본 발명은, CNN 학습을 기반으로 하는 타겟 가스 탐지 모델 생성 장치를 이용한 타겟 가스 탐지 모델 생성 방법에 있어서, 적어도 하나의 가스가 혼합된 복수의 가스 시료에 대해, 다채널 구조의 센서 어레이로부터 관측된 서로 다른 시간 길이의 시계열 데이터를 수집하는 단계와, 수집된 복수의 시계열 데이터의 시간 길이 중 최단 시간 길이를 탐색하고 상기 최단 시간 길이를 최대 값으로 하는 복수의 후보 길이 값을 생성하는 단계와, 상기 후보 길이 값에 대응한 L×M 크기의 타임 윈도우(L: 상기 후보 길이 값, M: 채널 수)를 상기 시계열 데이터에 적용하고 설정 시간 단위로 슬라이딩 이동시켜 상기 시계열 데이터로부터 L×M 크기의 복수의 행렬 데이터를 가공하는 단계와, 상기 가스 시료 별 해당 시계열 데이터로부터 가공된 복수의 행렬 데이터를 입력 값으로 하고 상기 가스 시료 내 타겟 가스의 존재 여부에 관한 라벨링 값을 출력 값으로 하여 CNN 기반의 분류 모델을 학습시키는 단계, 및 상기 복수의 후보 길이 값 별로 상기 분류 모델을 학습한 결과로부터, 상기 복수의 후보 길이 값 중에서 최적 후보 길이 값을 결정하는 단계를 포함하는 혼합 가스 탐지 방법을 제공한다.The present invention provides a method for generating a target gas detection model using a target gas detection model generating device based on CNN learning, the method comprising: a step of collecting time series data of different time lengths observed from a multi-channel structure sensor array for a plurality of gas samples in which at least one gas is mixed; a step of searching for a shortest time length among the time lengths of the collected plurality of time series data and generating a plurality of candidate length values having the shortest time length as a maximum value; a step of processing a plurality of matrix data of L×M size from the time series data by applying a time window of L×M size (L: the candidate length value, M: the number of channels) corresponding to the candidate length value to the time series data and sliding it by a set time unit; a step of training a CNN-based classification model by using a plurality of matrix data processed from the time series data of each gas sample as input values and a labeling value regarding the presence or absence of a target gas in the gas sample as an output value; and a step of determining an optimal candidate length value from among the plurality of candidate length values from a result of training the classification model for each of the plurality of candidate length values.

또한, 상기 센서 어레이는, 금속 산화물(MOX, Metal-oxide) 센서, 전기화학(Electrochemical) 센서, 광이온화 검출 센서(PID, Photoionization Detector), 온도 센서, 습도 센서 중에서 선택된 복수의 센서를 포함하여 구현될 수 있다.Additionally, the sensor array may be implemented including a plurality of sensors selected from a metal oxide (MOX) sensor, an electrochemical sensor, a photoionization detector (PID), a temperature sensor, and a humidity sensor.

또한, 상기 복수의 후보 길이 값을 생성하는 단계는, 1부터 최단 시간 길이(l) 까지의 후보 길이 값 리스트(L= 1,2,…,l)를 생성할 수 있다.In addition, the step of generating the plurality of candidate length values can generate a list of candidate length values (L = 1,2,…,l) from 1 to the shortest time length (l).

또한, 상기 타겟 가스는, 톨루엔(Toluene), 디메틸설파이드(DMS), 부틸아세테이트(Butyl acetate) 중 적어도 하나를 포함하고, 상기 분류 모델은, 가스 시료 내의 톨루엔 존재 여부를 분류하도록 학습되는 제1 분류 모델, 디메틸설파이드의 존재 여부를 분류하도록 학습되는 제2 분류 모델 및 부틸아세테이트의 존재 여부를 분류하도록 학습되는 제3 분류 모델 중 적어도 하나를 포함할 수 있다.In addition, the target gas may include at least one of toluene, dimethyl sulfide (DMS), and butyl acetate, and the classification model may include at least one of a first classification model learned to classify the presence or absence of toluene in a gas sample, a second classification model learned to classify the presence or absence of dimethyl sulfide, and a third classification model learned to classify the presence or absence of butyl acetate.

또한, 상기 시계열 데이터로부터 복수의 행렬 데이터를 가공하는 단계는, 상기 L×M 크기의 타임 윈도우를 상기 시계열 데이터에 적용하여 1개 시간슬롯 단위로 슬라이딩 이동시키면서 상기 복수의 행렬 데이터를 가공할 수 있다.In addition, the step of processing multiple matrix data from the time series data can process the multiple matrix data by applying a time window of the size of L×M to the time series data and sliding it by one time slot.

또한, 상기 혼합 가스 탐지 방법은, 상기 센서 어레이로부터 임의의 가스 시료에 대한 시계열 데이터를 획득하는 단계와, 상기 획득한 시계열 데이터에 대해 상기 최적 후보 값(L')에 대응하여 L'×M 크기의 복수의 행렬 데이터를 가공하는 단계, 및 가공된 복수의 행렬 데이터를 상기 분류 모델에 순차로 입력시켜 탐지 대상이 되는 타겟 가스의 존재 여부를 매시간 분류하여 제공하는 단계를 더 포함할 수 있다.In addition, the mixed gas detection method may further include a step of acquiring time series data for an arbitrary gas sample from the sensor array, a step of processing a plurality of matrix data having a size of L'×M corresponding to the optimal candidate value (L') for the acquired time series data, and a step of sequentially inputting the plurality of processed matrix data into the classification model to provide a classification for each hour of the presence or absence of a target gas to be detected.

그리고, 본 발명은, 적은 수의 센서 데이터를 활용한 혼합 가스 탐지 장치에 있어서, 적어도 하나의 가스가 혼합된 복수의 가스 시료에 대해, 다채널 구조의 센서 어레이로부터 관측된 서로 다른 시간 길이의 시계열 데이터를 수집하는 데이터 수집부와, 수집된 복수의 시계열 데이터의 시간 길이 중 최단 시간 길이를 탐색하고 상기 최단 시간 길이를 최대 값으로 하는 복수의 후보 길이 값을 생성하는 후보 값 생성부와, 상기 후보 길이 값에 대응한 L×M 크기의 타임 윈도우(L: 상기 후보 길이 값, M: 채널 수)를 상기 시계열 데이터에 적용하고 설정 시간 단위로 슬라이딩 이동시켜 상기 시계열 데이터로부터 L×M 크기의 복수의 행렬 데이터를 가공하는 데이터 가공부와, 상기 가스 시료 별 해당 시계열 데이터로부터 가공된 복수의 행렬 데이터를 입력 값으로 하고 상기 가스 시료 내 타겟 가스의 존재 여부에 관한 라벨링 값을 출력 값으로 하여 CNN 기반의 분류 모델을 학습시키는 학습부, 및 상기 복수의 후보 길이 값 별로 상기 분류 모델을 학습한 결과로부터, 상기 복수의 후보 길이 값 중에서 최적 후보 길이 값을 결정하는 제어부를 포함하는 혼합 가스 탐지 장치를 제공한다.And, the present invention provides a mixed gas detection device utilizing a small number of sensor data, comprising: a data collection unit for collecting time series data of different time lengths observed from a multi-channel structure sensor array for a plurality of gas samples in which at least one gas is mixed; a candidate value generation unit for searching for a shortest time length among the time lengths of the plurality of collected time series data and generating a plurality of candidate length values having the shortest time length as a maximum value; a data processing unit for processing a plurality of matrix data of L×M size from the time series data by applying a time window of L×M size (L: the candidate length value, M: the number of channels) corresponding to the candidate length value to the time series data and sliding it by a set time unit; a learning unit for learning a CNN-based classification model using a plurality of matrix data processed from the time series data of each gas sample as input values and a labeling value regarding the presence or absence of a target gas in the gas sample as an output value; and a control unit for determining an optimal candidate length value from among the plurality of candidate length values from a result of learning the classification model for each of the plurality of candidate length values.

또한, 상기 혼합 가스 탐지 장치는, 상기 센서 어레이로부터 임의의 가스 시료에 대한 시계열 데이터를 획득하는 데이터 획득부, 및 상기 획득한 시계열 데이터에 대해 상기 최적 후보 값(L')에 대응하여 상기 데이터 가공부에서 가공되는 L'×M 크기의 복수의 행렬 데이터를 상기 분류 모델에 순차로 입력시켜 탐지 대상이 되는 타겟 가스의 존재 여부를 매시간 분류하여 제공하는 분류부를 더 포함할 수 있다.In addition, the mixed gas detection device may further include a data acquisition unit that acquires time series data for an arbitrary gas sample from the sensor array, and a classification unit that sequentially inputs a plurality of matrix data of the size of L'×M, processed by the data processing unit in response to the optimal candidate value (L') for the acquired time series data, into the classification model to provide a classification for the presence or absence of a target gas to be detected every hour.

본 발명에 따르면, 적은 양의 시계열 데이터로부터 충분한 양의 학습 데이터를 확보하여 타겟 가스 탐지를 위한 분류 모델의 성능을 높일 수 있으며, 벡터가 아닌 매트릭스 형태로 입력 데이터를 구성 가능하기 때문에 CNN 학습이 가능한 장점을 갖는다.According to the present invention, a sufficient amount of learning data can be secured from a small amount of time series data to improve the performance of a classification model for target gas detection, and since input data can be configured in a matrix form rather than a vector form, there is an advantage in that CNN learning is possible.

아울러, 본 발명은 센서 어레이로부터 관측된 각기 다른 시간 길이의 시계열 데이터에 대해 여러 후보 길이의 타임 윈도우를 적용하고 슬라이딩 이동하면서 확보한 다량의 학습 데이터로부터 개별적으로 모델 학습을 진행하여, 분류 성능이 가장 높은 최적의 후보 길이를 결정할 수 있다. In addition, the present invention applies time windows of multiple candidate lengths to time series data of different time lengths observed from a sensor array, and individually conducts model learning from a large amount of learning data obtained by sliding, thereby determining the optimal candidate length with the highest classification performance.

또한, 결정된 최적 후보 길이의 타임 윈도우를 검사 대상이 되는 가스 시료의 시계열 데이터에 적용하여 슬라이딩 이동하면서 입력 데이터를 재생산하고 이를 학습된 분류 모델에 적용하여 타겟 가스 존재 여부를 정확하게 탐지할 수 있다.In addition, the input data can be reproduced by sliding the time window of the determined optimal candidate length applied to the time series data of the gas sample to be inspected, and this can be applied to the learned classification model to accurately detect the presence of the target gas.

도 1은 종래의 PCA를 이용한 학습 데이터 구성 방식을 설명한 도면이다.
도 2는 종래의 차분 방식을 이용한 학습 데이터 구성 방식을 설명한 도면이다.
도 3은 본 발명의 실시예에 따른 타겟 가스 탐지 모델 생성 장치의 구성을 나타낸 도면이다.
도 4는 도 3을 이용한 모델 생성 방법을 설명하는 도면이다.
도 5는 본 발명의 실시예에 따른 데이터 처리 방식을 설명한 도면이다.
도 6은 본 발명의 실시예에 따라 생성된 분류 모델을 이용하여 타겟 가스를 탐지하는 개념을 설명한 도면이다.
도 7은 본 발명의 실시예에 따라 수집된 시계열 센서 데이터를 예시한 도면이다.
도 8은 본 발명의 실시예에서 시료 데이터 별 라벨링 값을 표현한 도면이다.
도 9 및 도 10는 도 1 및 도 2에 도시된 종래의 데이터 구성 방식 1 및 2를 요약 설명한 도면이다.
도 11은 본 발명의 실시예에 따른 데이터 구성 방식을 설명한 도면이다.
도 12는 각 실험 별 도출된 샘플 수를 나타낸 도면이다.
도 13 내지 도 15는 각 실험 별 분류 모델 1, 2, 3의 분류 성능을 나타낸 도면이다.Figure 1 is a diagram explaining a learning data composition method using conventional PCA.
Figure 2 is a diagram explaining a learning data composition method using a conventional differential method.
FIG. 3 is a drawing showing the configuration of a target gas detection model generation device according to an embodiment of the present invention.
Figure 4 is a drawing explaining a model creation method using Figure 3.
FIG. 5 is a drawing explaining a data processing method according to an embodiment of the present invention.
FIG. 6 is a diagram illustrating a concept of detecting a target gas using a classification model generated according to an embodiment of the present invention.
FIG. 7 is a diagram illustrating time series sensor data collected according to an embodiment of the present invention.
Figure 8 is a diagram expressing labeling values for each sample data in an embodiment of the present invention.
FIGS. 9 and 10 are diagrams summarizing the conventional data configuration methods 1 and 2 illustrated in FIGS. 1 and 2.
Figure 11 is a drawing explaining a data configuration method according to an embodiment of the present invention.
Figure 12 is a diagram showing the number of samples derived for each experiment.
Figures 13 to 15 are diagrams showing the classification performance of classification models 1, 2, and 3 for each experiment.

그러면 첨부한 도면을 참고로 하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, with reference to the attached drawings, embodiments of the present invention will be described in detail so that those with ordinary skill in the art can easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present invention in the drawings, parts that are not related to the description are omitted, and similar parts are assigned similar drawing reference numerals throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part is said to be "connected" to another part, this includes not only the case where it is "directly connected" but also the case where it is "electrically connected" with another element in between. Also, when a part is said to "include" a component, this does not mean that it excludes other components, but rather that it may include other components, unless otherwise stated.

본 발명은 CNN 학습을 기반으로 하는 타겟 가스 탐지 모델 생성 기법에 관한 것으로, 적어도 하나의 가스가 혼합 제조되는 혼합 가스(이하, 가스 시료)를 다채널 센서 어레이를 통해 관측하여 수집한 시계열 데이터를 대상으로 설정 길이의 타임 윈도우를 적용하고 슬라이딩 이동시켜 복수의 샘플 데이터를 가공하고 이를 입력 데이터로 하여 타겟 가스의 존재 여부를 탐지하는 CNN 기반의 분류 모델을 생성하는 기법을 제안한다.The present invention relates to a technique for generating a target gas detection model based on CNN learning, and proposes a technique for generating a CNN-based classification model that detects the presence or absence of a target gas by applying a time window of a set length to time series data collected by observing a mixed gas (hereinafter, “gas sample”) in which at least one gas is mixed and manufactured through a multi-channel sensor array and processing a plurality of sample data by sliding the data, and using the data as input data.

도 3은 본 발명의 실시예에 따른 타겟 가스 탐지 모델 생성 장치의 구성을 나타낸 도면이고, 도 4는 도 3을 이용한 모델 생성 방법을 설명하는 도면이다. FIG. 3 is a drawing showing the configuration of a target gas detection model generation device according to an embodiment of the present invention, and FIG. 4 is a drawing explaining a model generation method using FIG. 3.

도 3 및 도 4를 참조하면, 본 발명의 실시예에 따른 타겟 가스 탐지 모델 생성 장치(100)는 데이터 수집부(110), 후보 값 생성부(120), 데이터 가공부(130), 학습부(140), 제어부(150)를 포함하며, 데이터 획득부(160), 분류부(170), 출력부(180)를 더 포함할 수 있다. Referring to FIGS. 3 and 4, a target gas detection model generation device (100) according to an embodiment of the present invention includes a data collection unit (110), a candidate value generation unit (120), a data processing unit (130), a learning unit (140), a control unit (150), and may further include a data acquisition unit (160), a classification unit (170), and an output unit (180).

여기서, 각 부(110,120,130,140,160,170,180)의 동작 및 각 부 간의 데이터 흐름은 제어부(150)에 의해 제어될 수 있다.Here, the operation of each part (110, 120, 130, 140, 160, 170, 180) and the data flow between each part can be controlled by the control part (150).

먼저, 데이터 수집부(110)는 분류 모델의 학습을 위한 원시 데이터를 수집한다. 구체적으로, 데이터 수집부(110)는 적어도 하나의 가스가 혼합된 복수의 가스 시료에 대해, 다채널 구조의 센서 어레이로부터 관측된 서로 다른 시간 길이의 시계열 데이터를 수집한다(S410).First, the data collection unit (110) collects raw data for learning the classification model. Specifically, the data collection unit (110) collects time series data of different time lengths observed from a sensor array of a multi-channel structure for multiple gas samples in which at least one gas is mixed (S410).

여기서, 데이터 수집부(110)는 해당 가스 시료 내에 실제 존재하는 적어도 하나의 가스에 대한 정보(정답에 해당한 라벨링 값)도 함께 수집할 수 있다. 또한 수집된 가스 시료 중에는 탐지 대상이 되는 타겟 가스가 전혀 존재하지 않을 수도 있고, 타겟 가스가 적어도 하나 존재할 수 있다. Here, the data collection unit (110) can also collect information (labeling value corresponding to the correct answer) on at least one gas actually existing in the gas sample. In addition, among the collected gas samples, the target gas to be detected may not exist at all, or at least one target gas may exist.

데이터 수집부(110)는 서로 다른 조합으로 혼합된 다양한 종류의 가스 시료 각각으로부터 시계열 데이터를 확보할 수 있고, 이를 통하여 검사 대상이 되는 미지의 혼합 가스 시료부터의 타겟 가스의 검출 성능을 높이도록 한다.The data collection unit (110) can obtain time series data from each of various types of gas samples mixed in different combinations, thereby improving the detection performance of target gases from an unknown mixed gas sample to be inspected.

본 발명의 실시예서 타겟 가스는 톨루엔(Toluene), 디메틸설파이드(DMS), 부틸아세테이트(Butyl acetate) 중 적어도 하나를 포함할 수 있다. 물론, 타겟 가스는 반드시 상술한 예시로 한정되는 것은 아니며, 본 발명의 실시예에 따른 장치 및 기법은 보다 다양한 종류의 타겟 가스의 탐지에 활용될 수 있다.In an embodiment of the present invention, the target gas may include at least one of toluene, dimethyl sulfide (DMS), and butyl acetate. Of course, the target gas is not necessarily limited to the examples described above, and the device and technique according to an embodiment of the present invention may be utilized to detect a wider variety of target gases.

또한, 센서 어레이는, 금속 산화물(MOX, Metal-oxide) 센서, 전기화학(Electrochemical) 센서, 광이온화 검출 센서(PID, Photoionization Detector), 온도 센서, 습도 센서 중에서 선택된 복수의 센서를 포함하여 구현될 수 있다. Additionally, the sensor array can be implemented including a plurality of sensors selected from a metal-oxide (MOX) sensor, an electrochemical sensor, a photoionization detector (PID), a temperature sensor, and a humidity sensor.

아래의 표 1은 센서 어레이의 구성 예시를 나타낸 것이다.Table 1 below shows an example of a sensor array configuration.

Information on array sensorsInformation on array sensors Sensor numberSensor number Target gasesTarget gases Sensor typeSensor type 11 VOCs, air contaminantsVOCs, air contaminants MOXMOX 22 Amine, sulfurous gasAmine, sulfurous gas MOXMOX 33 VOCs with ionization potential < 10.6 eVVOCs with ionization potential < 10.6 eV PIDPID 44 FormaldehydeFormaldehyde ElectrochemicalElectrochemical 55 H₂SH ₂ S ElectrochemicalElectrochemical 66 S0₂ S0 ₂ ElectrochemicalElectrochemical 77 N0₂ N0 ₂ ElectrochemicalElectrochemical

물론, 나열한 센서 종류 이외에도, 각종 유해 물질, 악취, 사고 등의 탐지에 활용되는 다양한 센서류가 추가로 배열될 수 있다. 또한 모든 센서 값은 ADC(아날로그-디지털 변환기)를 통해 0-5V에 해당하는 0-4000의 값으로 변환(정규화) 처리될 수 있다. Of course, in addition to the types of sensors listed, various types of sensors that are used to detect various hazardous substances, odors, accidents, etc. can be additionally arranged. In addition, all sensor values can be converted (normalized) into values of 0-4000 corresponding to 0-5V through an ADC (analog-to-digital converter).

도 5는 정답을 알고 있는 3가지 가스 시료로부터 관측된 3개의 시료 데이터(시계열 데이터)를 예시적으로 나타낸 것이다. 각 시료 데이터는 시간 축과 채널 축에 대하여 센싱 값을 포함한다. Figure 5 illustrates three sample data (time series data) observed from three gas samples for which the correct answer is known. Each sample data includes sensing values for the time axis and the channel axis.

시료 데이터의 각 행은 시간 슬롯(측정 시간 단위)을 의미하고, 각 열은 센서 어레이의 각 센서 채널을 의미한다. 센서 어레이의 채널 수가 4인 경우 각 시료 데이터는 매 타임 슬롯 마다 4개의 센싱 값이 얻어진다. 시료 데이터 1은 세 개의 타임슬롯 동안에 4 채널의 센서 어레이로부터 관측된 시계열 데이터를 의미한다.Each row of sample data represents a time slot (a unit of measurement time), and each column represents each sensor channel of the sensor array. If the number of channels of the sensor array is 4, each sample data obtains 4 sensing values for each time slot. Sample data 1 represents time series data observed from a 4-channel sensor array during three time slots.

또한, 시료 데이터 1은 세 개의 시간 구간에 걸쳐 얻은 시계열 데이터이고, 시료 데이터 2는 네 개의 시간 구간에 걸쳐 얻은 시계열 데이터이고, 시료 데이터 3은 두 개의 시간 구간에 걸쳐 얻은 시계열 데이터를 나타낸다. Additionally, sample data 1 is time series data obtained over three time intervals, sample data 2 is time series data obtained over four time intervals, and sample data 3 represents time series data obtained over two time intervals.

이와 같이, 각 시료 데이터의 시간 길이는 서로 상이한데, 이는 앞서 배경 기술에서 설명한 것과 같이 현장 환경, 관측 조건, 설정 정보 등에 따라 데이터 수집 길이가 달라지는 상황을 고려한 것이다.In this way, the time length of each sample data is different, which takes into account the situation in which the data collection length varies depending on the field environment, observation conditions, and setup information, as explained in the background technology above.

이러한 도 5는 설명의 편의상 매우 짧은 시간 길이의 시료 데이터를 예시하고 있으나, 실제로 관측되는 시료 데이터는 수십 개 내지 백 개 혹은 그 이상의 시간 슬롯으로 이루어질 수 있다.For convenience of explanation, this Figure 5 illustrates sample data of a very short time length, but in reality, observed sample data may consist of tens to hundreds or more time slots.

다음, 후보 값 생성부(120)는 수집된 복수의 시계열 데이터의 시간 길이 중 최단 시간 길이를 탐색하고 최단 시간 길이(l)를 최대 값으로 하는 복수의 후보 길이 값을 생성한다(S330). 이때, 후보 값 생성부(120)는 '1'부터 최단 시간 길이(l)까지를 포함하는 후보 길이 값 리스트(L= 1,2,…,l)를 생성할 수 있다. Next, the candidate value generation unit (120) searches for the shortest time length among the time lengths of the collected multiple time series data and generates multiple candidate length values with the shortest time length (l) as the maximum value (S330). At this time, the candidate value generation unit (120) can generate a candidate length value list (L = 1, 2,…, l) including ‘1’ to the shortest time length (l).

도 5에서는 시료 데이터 1,2,3의 시간 길이는 각각 3, 4, 2로, 가장 최단 시간 길이를 갖는 시료 데이터는 3번에 해당한다. 이때, 시료 데이터 3의 시간 길이는 '2' 이므로, 후보 길이 값은 1 부터 2 까지로 생성되며, L∈{1,2}로 정의된다.In Fig. 5, the time lengths of sample data 1, 2, and 3 are 3, 4, and 2, respectively, and the sample data with the shortest time length is 3. At this time, since the time length of sample data 3 is '2', the candidate length values are generated from 1 to 2, and are defined as L∈{1,2}.

물론, 다른 예시로, 시료 데이터 1,2,3의 시간 길이가 각각 100, 25, 40이었다면, 복수의 후보 길이 값은 1 부터 25 까지 생성되며, 이때 L∈{1,2,…,25}로 정의된다. Of course, as another example, if the time lengths of sample data 1, 2, and 3 were 100, 25, and 40, respectively, multiple candidate length values are generated from 1 to 25, and in this case, L∈{1,2,…,25} is defined.

이후, 데이터 가공부(130)는 후보 길이 값에 대응한 L×M 크기의 타임 윈도우(L: 상기 후보 길이 값, M: 채널 수)를 시계열 데이터에 적용하고 설정 시간 단위로 슬라이딩 이동시켜, 시계열 데이터로부터 L×M 크기의 복수의 행렬 데이터를 가공한다(S440).Thereafter, the data processing unit (130) applies a time window of size L×M corresponding to the candidate length value (L: the candidate length value, M: number of channels) to the time series data and slides it by a set time unit, thereby processing multiple matrix data of size L×M from the time series data (S440).

도 5의 경우, 각각의 시료 데이터에 2×4 크기의 윈도우를 적용하여 매 시간 슬롯 단위로 슬라이딩 이동시켜서 각각의 시료 데이터로부터 복수의 행렬 데이터를 재생산한 모습을 나타낸다.In the case of Fig. 5, a 2×4 sized window is applied to each sample data and the data is slid by each time slot to reproduce multiple matrix data from each sample data.

시료 데이터 1에 대해 2×4 크기의 윈도우를 적용하고 1개 타임 슬롯씩 슬라이딩시켜서 2×4 크기의 행렬 데이터 2개(X1, X2)가 가공되었고, 같은 방법으로 시료 데이터 2와 3에 대해서는 각각 2×4 크기의 행렬 데이터가 각각 3개(X3, X4, X5)와 1개(X6)로 가공되었다. For sample data 1, a 2×4 sized window was applied and slid by 1 time slot, thereby processing two 2×4 sized matrix data (X1, X2). In the same way, for sample data 2 and 3, three 2×4 sized matrix data (X3, X4, X5) and one 2×4 sized matrix data (X6) were processed, respectively.

이와 같이, 본 발명에서 제안한 데이터 구성 방식의 경우, 특정된 타임 윈도우 단위로 시계열 데이터를 잘라내어 만들어진 각각의 매트릭스를 하나의 샘플로 간주한다. In this way, in the data configuration method proposed in the present invention, each matrix created by cutting time series data into specific time window units is considered as one sample.

여기서, 각 시료 데이터의 수집된 시간 길이가 상이하기 때문에, 재생산되는 행렬 데이터의 개수도 상이하게 된다. 도 5는 경우, 후보 길이 값(L∈{1,2}) 중, L=2일 때의 2×4 크기의 윈도우를 적용한 모습을 나타낸다. 마찬가지 방법으로 L=1을 적용한 경우에는 시료 데이터 1, 2, 3으로부터 각각 3개, 4개, 2개의 행렬 데이터가 가공될 수 있다. Here, since the collected time length of each sample data is different, the number of matrix data to be reproduced is also different. Fig. 5 shows the case where a 2×4 sized window is applied when L=2 among the candidate length values (L∈{1,2}). Similarly, when L=1 is applied, 3, 4, and 2 matrix data can be processed from sample data 1, 2, and 3, respectively.

윈도우 크기가 작으면 하나의 윈도우에 다양한 시간의 데이터가 들어가지 못하게 되는 반면 재생산되는 행렬 데이터의 수는 늘어나게 되며, 윈도우 크기가 너무 크면 하나의 윈도우에 여러 시간 대의 데이터가 한번에 들어가는 반면에 재생산되는 행렬 데이터의 수는 보다 적어지게 된다.If the window size is too small, data from various time periods cannot be entered into a single window, but the number of matrix data to be reproduced increases. If the window size is too large, data from various time periods can be entered into a single window, but the number of matrix data to be reproduced decreases.

다만, 한정된 시료 데이터 자원에서 최대한 많은 수의 샘플 데이터를 재생산하여 이를 통하여 모델의 학습에 적용할 경우, 적은 수의 샘플 데이터로 모델을 학습하는 경우에 비하여 학습 성능 및 분류 정확도를 높일 수 있다. However, if a large number of sample data are reproduced from limited sample data resources and applied to model learning through these, learning performance and classification accuracy can be improved compared to when a model is learned with a small number of sample data.

따라서, 본 발명의 실시예에서는 최적의 윈도우 크기를 결정하기 위하여, 각각의 후보 값을 사용한 경우에 대하여 분류 모델의 학습 성능을 파악하고, 그로부터 가장 높은 분류 성능을 도출하는 후보 길이 값을 최적 후보 길이 값으로 선정한다.Therefore, in an embodiment of the present invention, in order to determine the optimal window size, the learning performance of the classification model is identified for each case where each candidate value is used, and the candidate length value that derives the highest classification performance therefrom is selected as the optimal candidate length value.

구체적으로, 학습부(140)는 가스 시료 별 해당 시계열 데이터로부터 가공된 복수의 행렬 데이터를 입력 값으로 하고 해당 가스 시료 내 타겟 가스의 존재 여부에 관한 라벨링 값을 출력 값으로 하여 CNN 기반의 분류 모델을 학습시킨다.Specifically, the learning unit (140) trains a CNN-based classification model by using a plurality of matrix data processed from the time series data for each gas sample as input values and a label value regarding the presence or absence of a target gas in the gas sample as an output value.

CNN 모델의 경우, 행렬 구조의 입력 데이터의 학습이 가능한 모델로서, 본 발명의 실시예는 CNN 기반의 분류 모델에 행렬 데이터를 입력하고 입력된 행렬 데이터에 대응하여 해당 라벨 값을 추종하도록 분류 모델을 학습시킨다. In the case of a CNN model, it is a model capable of learning input data of a matrix structure, and an embodiment of the present invention inputs matrix data into a CNN-based classification model and trains the classification model to follow the corresponding label value corresponding to the input matrix data.

즉, 행렬 X1, X2를 입력 데이터로 하는 경우에는 시료 데이터 1에 대응한 정답 라벨링 값을 추종하도록 모델이 학습되며, 행렬 X3, X4, X5를 입력 데이터로 하는 경우에는 시료 데이터 2에 대응한 정답 라벨링 값을 추종하도록 학습되며, 행렬 X6을 입력 데이터로 하는 경우에는 시료 데이터 3에 대응한 정답 라벨링 값을 추종하도록 학습된다. That is, when the input data is matrices X1 and X2, the model is trained to follow the correct labeling value corresponding to sample data 1, when the input data is matrices X3, X4 and X5, the model is trained to follow the correct labeling value corresponding to sample data 2, and when the input data is matrix X6, the model is trained to follow the correct labeling value corresponding to sample data 3.

그리고, 제어부(150)는 복수의 후보 길이 값 별로 분류 모델을 학습한 결과로부터 최적 후보 길이(L')를 결정한다(S450). Then, the control unit (150) determines the optimal candidate length (L') from the results of learning the classification model for each of the multiple candidate length values (S450).

구체적으로, 제어부(150)는 복수의 후보 길이 값(L=1,2) 별 생성된 행렬 데이터를 통하여 분류 모델을 학습한 결과로부터, 복수의 후보 길이 값 중에서 최적 후보 길이 값을 결정한다. 예를 들어, 각각의 후보 길이 값 별 분류 모델의 분류 성능을 분석한 결과, L=2를 적용한 경우가 가장 좋았다면, L'=2로 결정한다.Specifically, the control unit (150) determines the optimal candidate length value among the plurality of candidate length values from the result of learning the classification model through the matrix data generated for each of the plurality of candidate length values (L=1,2). For example, if the classification performance of the classification model for each candidate length value is analyzed and the case where L=2 is applied is the best, L'=2 is determined.

이후에는, 실제 임의의 가스 시료로부터 관측된 시계열 데이터에 L=2에 대응한 윈도우를 적용하여 복수의 행렬 데이터를 획득하고, 이를 위의 학습된 분류 모델에 입력하는 것을 통하여 해당 가스 시료 내 타겟 가스의 존재 여부를 정확하게 탐지할 수 있다. Afterwards, by applying a window corresponding to L=2 to the time series data observed from an actual random gas sample, multiple matrix data are obtained, and by inputting this into the learned classification model above, the presence or absence of the target gas in the gas sample can be accurately detected.

구체적으로, S450 단계 이후에, 데이터 획득부(160)는 센서 어레이로부터 임의의 가스 시료에 대한 시계열 데이터를 획득한다(S460). 여기서, 데이터 획득부(160)는 센서 어레이와 네트워크 연결된 상태에서 각 채널의 센서 데이터를 실시간 수신할 수 있다. Specifically, after step S450, the data acquisition unit (160) acquires time series data for an arbitrary gas sample from the sensor array (S460). Here, the data acquisition unit (160) can receive sensor data of each channel in real time while connected to the sensor array through a network.

다음, 데이터 가공부(130)는 S460 단계에서서 획득한 시계열 데이터에 대해 최적 후보 값(L'=2)에 대응한 L'×M(2×4) 크기의 슬라이딩 윈도우를 적용하여 L'×M(2×4) 크기의 복수의 행렬 데이터를 가공한다(S470).Next, the data processing unit (130) processes multiple matrix data of size L'×M (2×4) by applying a sliding window of size L'×M (2×4) corresponding to the optimal candidate value (L'=2) to the time series data acquired in step S460 (S470).

다음, 분류부(170)는 가공된 복수의 행렬 데이터를 분류 모델에 순차로 입력시켜 탐지 대상이 되는 타겟 가스의 존재 여부를 매시간 실시간 분류하여 제공할 수 있다(S480). 예를 들어, 분류 모델은 타겟 가스(예: 톨루엔 가스)가 탐지된 경우에는 0, 탐지되지 않은 경우는 1을 출력할 수 있다.Next, the classification unit (170) can sequentially input the processed multiple matrix data into the classification model to provide real-time classification of the presence or absence of the target gas to be detected every hour (S480). For example, the classification model can output 0 if the target gas (e.g., toluene gas) is detected, and 1 if it is not detected.

이후, 출력부(180)는 분류 모델에서 출력되는 데이터를 디스플레이 혹은 네트워크 연결된 관리자 단말기, 사용자 단말 등에 실시간 제공할 수 있다.Thereafter, the output unit (180) can provide data output from the classification model in real time to a display or a network-connected administrator terminal, user terminal, etc.

도 6은 본 발명의 실시예에 따라 생성된 분류 모델을 이용하여 타겟 가스를 탐지하는 개념을 설명한 도면이다. 이러한 도 6과 같이, 분류 모델은 타겟 가스의 종류 별로 생성될 수 있다. FIG. 6 is a diagram illustrating a concept of detecting a target gas using a classification model generated according to an embodiment of the present invention. As shown in FIG. 6, a classification model can be generated for each type of target gas.

예를 들어, 분류 모델은, 가스 시료 내 톨루엔(T) 존재 여부를 분류하도록 학습되는 제1 분류 모델(분류 모델 1), 디메틸설파이드(DMS)의 존재 여부를 분류하도록 학습되는 제2 분류 모델(분류 모델 2), 그리고 부틸아세테이트(BA)의 존재 여부를 분류하도록 학습되는 제3 분류 모델(분류 모델3) 중 적어도 하나를 포함할 수 있다. For example, the classification model may include at least one of a first classification model (classification model 1) trained to classify the presence or absence of toluene (T) in a gas sample, a second classification model (classification model 2) trained to classify the presence or absence of dimethyl sulfide (DMS), and a third classification model (classification model 3) trained to classify the presence or absence of butyl acetate (BA).

이에 따라, 탐지 대상이 되는 임의 가스 시료의 시계열 데이터로부터 가공된 행렬 데이터를 분류 모델 1에 입력하여 시료 내 톨루엔 존재 여부를 확인할 수 있고, 동일한 행렬 데이터를 분류 모델 2, 3에 각각 입력하여 시료 내 디메틸설파이드 및 부틸아세테이트의 존재 여부를 각각 확인할 수 있다.Accordingly, by inputting the matrix data processed from the time series data of the random gas sample to be detected into classification model 1, the presence or absence of toluene in the sample can be confirmed, and by inputting the same matrix data into classification models 2 and 3, the presence or absence of dimethyl sulfide and butyl acetate in the sample can be confirmed, respectively.

상술한 방식에 따라, 본 발명의 경우 한정된 길이의 시료 데이터로부터 많은 양의 입력 데이터를 확보할 수 있고 벡터가 아닌 매트릭스 형태로 입력 데이터를 구성 가능하기 때문에, CNN 학습이 가능한 장점을 갖는다.According to the above-described method, the present invention has the advantage of being able to secure a large amount of input data from sample data of limited length and of being able to configure input data in the form of a matrix rather than a vector, thereby enabling CNN learning.

이하에서는 본 발명의 실시예에 따른 데이터 구성 기법과 기존의 도 1 및 도 2에 나타낸 데이터 구성 기법(구성 1, 구성 2)을 이용하여 분류 모델을 학습하고 그에 대한 분류 성능을 비교한 결과를 설명한다.Below, the results of learning a classification model using a data configuration technique according to an embodiment of the present invention and the existing data configuration techniques (configuration 1, configuration 2) shown in FIGS. 1 and 2 and comparing the classification performance thereof are described.

도 7은 본 발명의 실시예에 따라 수집된 시계열 센서 데이터를 예시한 도면이다. 도 7은 임의의 시료에 대하여 총 11개의 센서 채널을 가지는 센서 어레이로부터 시간 별로 획득한 센서 데이터를 나타낸다. 시계열 데이터는 특정 가스가 혼합 시료에 포함되어 있는가를 판단하는 문제에 있어 input data의 역할을 수행한다.Fig. 7 is a diagram illustrating time series sensor data collected according to an embodiment of the present invention. Fig. 7 shows sensor data acquired over time from a sensor array having a total of 11 sensor channels for an arbitrary sample. The time series data serves as input data in the problem of determining whether a specific gas is included in a mixed sample.

실험을 위해 총 100개의 시료에 대한 센서 데이터가 수집되었으며, 도 7은 그 중에서 1개의 시료 데이터를 예시한 것이다. 100개의 각 시료 데이터는 40-100개 사이의 row 값(타임 슬롯)을 가지고 있다. 여기서, 도 7은 3개의 열만 도시하고 있는데, 전체 타임 슬롯 중 간단히 3개의 타임 슬롯 동안의 데이터를 도시한 것이다.For the experiment, sensor data for a total of 100 samples were collected, and Fig. 7 illustrates one sample data among them. Each of the 100 sample data has between 40 and 100 row values (time slots). Here, Fig. 7 illustrates only three columns, which simply illustrates data for three time slots out of the total time slots.

도 8은 본 발명의 실시예에서 시료 데이터 별 라벨링 값을 표현한 도면이다.Figure 8 is a diagram expressing labeling values for each sample data in an embodiment of the present invention.

도 8의 좌측 그림은 각각의 시료 내 실제 존재한 타겟 가스의 농도 비율을 숫자로 표현한 것으로, 전혀 존재하지 않는 경우는 0으로 표현된다. 우측 그림은 좌측 그림에 나타낸 각각의 값을 해당 타겟 가스의 존재 여부에 대응한 '1' 또는 '0'의 바이너리 값으로 변환한 결과이다. The left figure of Fig. 8 is a numerical representation of the concentration ratio of the target gas actually present in each sample, with 0 representing no presence at all. The right figure is the result of converting each value shown in the left figure into a binary value of '1' or '0' corresponding to the presence or absence of the target gas.

라벨링 값은 특정의 타겟 가스가 혼합 시료 내에 포함되어 있는가를 판단하는 문제에 있어 output data의 역할을 한다. 본 발명의 실시예는 시료 내 타겟 가스의 포함 여부를 판단하기 위한 것이므로 각각의 값이 binary 값을 가지도록 변환된다. 여기서 각 시료별 톨루엔에 대한 라벨링 값은 분류 모델 1의 학습에 사용되고, DMS 및 BS에 대한 라벨링 값은 각각 분류 모델 2, 3의 학습에 사용될 수 있다.The labeling value serves as output data in the problem of determining whether a specific target gas is included in a mixed sample. Since the embodiment of the present invention is for determining whether a target gas is included in a sample, each value is converted to a binary value. Here, the labeling value for toluene for each sample can be used for learning classification model 1, and the labeling values for DMS and BS can be used for learning classification models 2 and 3, respectively.

도 9 및 도 10는 도 1 및 도 2에 도시된 종래의 데이터 구성 방식 1 및 2를 요약 설명한 도면이다. FIGS. 9 and 10 are diagrams summarizing the conventional data configuration methods 1 and 2 illustrated in FIGS. 1 and 2.

도 9는 도 1의 데이터 구성 방식 1(First experiment)에 속하는 다양한 경우의 방법론을 나타내며, 총 8가지 경우의 수가 존재한다. 이때, 로지스틱 회귀, 랜덤 포레스트 구조 기반의 분류 모델이 각각 활용된 것을 알 수 있다. Figure 9 shows the methodology of various cases belonging to the data composition method 1 (First experiment) of Figure 1, and there are a total of 8 cases. In this case, it can be seen that classification models based on logistic regression and random forest structures were utilized, respectively.

도 10은 도 2의 데이터 구성 방식 2(Second experiment)에 속하는 다양한 경우의 방법론을 나타내며, 총 4가지 경우의 수가 존재한다. 여기서도 로지스틱 회귀, 랜덤 포레스트 구조 기반의 분류 모델이 각각 활용된다.Figure 10 shows the methodology of various cases belonging to the data composition method 2 (Second experiment) of Figure 2, and there are a total of four cases. Here, logistic regression and classification models based on random forest structure are utilized respectively.

도 11은 본 발명의 실시예에 따른 데이터 구성 방식을 설명한 도면이다.Figure 11 is a drawing explaining a data configuration method according to an embodiment of the present invention.

도 11과 같이, 본 발명에서 제안한 데이터 구성 방식(Third experiment)은 CNN 구조의 분류 모델이 활용되며, 총 2가지 경우의 수가 존재한다. 본 발명의 경우 분류 모델로 베이스라인 CNN(baseline CNN), 가중 결합 CNN(weight CNN)이 활용되었다. 가중 결합 CNN의 경우 각 클래스 별로 가중치를 달리 적용하여 학습하는 CNN 기법이다.As shown in Fig. 11, the data composition method (Third experiment) proposed in the present invention utilizes a classification model of a CNN structure, and there are a total of two cases. In the case of the present invention, baseline CNN and weighted CNN were utilized as classification models. Weighted CNN is a CNN technique that learns by applying different weights to each class.

도 12는 각 실험 별 도출된 샘플 수를 나타낸 도면이다. 이러한 12의 결과로부터 동일 양의 시료 데이터로부터 서로 다른 양의 샘플 데이터가 얻어진 것을 알 수 있다. 즉, 제안한 데이터 구성 방식(Third experiment)은 기존 방식 1(First experiment) 보다는 훨씬 많은 수의 샘플 데이터 가공이 가능하고 기존 방식 2(Second experiment)과는 거의 유사한 수준의 많은 양의 샘플 데이터 확보가 가능함을 알 수 있다. Figure 12 is a diagram showing the number of samples derived for each experiment. From the results of these 12, it can be seen that different amounts of sample data are obtained from the same amount of sample data. In other words, it can be seen that the proposed data composition method (Third experiment) can process a much larger number of sample data than the existing method 1 (First experiment) and can secure a large amount of sample data almost similar to the existing method 2 (Second experiment).

도 13 내지 도 15는 각 실험 별 분류 모델 1, 2, 3의 분류 성능을 나타낸 도면이다. 도 13은 각 실험 별 획득한 샘플 데이터를 분류 모델 1(톨루엔 감지 모델)에 적용하여 분류 성능을 확인한 것이고, 도 14 및 도 15는 각각 분류 모델 2(DMS 감지 모델), 분류 모델 3(BS 감지 모델)에 각각 적용하여 분류 성능을 확인한 것이다. Figures 13 to 15 are diagrams showing the classification performance of classification models 1, 2, and 3 for each experiment. Figure 13 shows the classification performance confirmed by applying the sample data acquired for each experiment to classification model 1 (toluene detection model), and Figures 14 and 15 show the classification performance confirmed by applying the data to classification model 2 (DMS detection model) and classification model 3 (BS detection model), respectively.

앞서 언급된 방법론들을 토대로 각 가스 시료 별로 총 14개의 모델을 학습시켰으며, 각 모델의 성능은 AUC(Area Under the Curve) 와 F1 score를 평가 척도로 하여 측정 하였다. 도 13 내지 도 15의 각 테이블 모두, 전체 14열 중에서 앞의 8열은 데이터 구성 방식 1(First experiment)의 8가지 방법론의 성능을 나타내고, 다음 4열은 데이터 구성 방식 2(Second experiment)의 4가지 방법론의 성능을 나타내며, 마지막 2열은 본 발명의 데이터 구성 방식(Third experiment)의 2가지 방법론의 성능을 나타낸다.Based on the methodologies mentioned above, a total of 14 models were trained for each gas sample, and the performance of each model was measured using AUC (Area Under the Curve) and F1 score as evaluation metrics. In each table of Figs. 13 to 15, the first 8 columns out of the total 14 columns represent the performance of 8 methodologies of data composition method 1 (First experiment), the next 4 columns represent the performance of 4 methodologies of data composition method 2 (Second experiment), and the last 2 columns represent the performance of 2 methodologies of the data composition method of the present invention (Third experiment).

Toluene의 경우 CNN을 기반으로 한 분류 모델에서 AUC 와 F1 score 측면에서 모두 가장 좋은 성능을 보였다(도 13). DMS의 경우, 두 번째 실험에서의 logistic regression이 AUC 측면에서 가장 높은 성능을 보였으나, CNN을 기반으로 한 본 발명의 분류 모델도 유의미한 성능 결과를 보였다(도 14). Butyl acetate의 경우, CNN을 기반으로 한 본 발명의 분류 모델이 F1 score 측면에서 가장 높은 성능을 나타내었다(도 15). For toluene, the classification model based on CNN showed the best performance in terms of both AUC and F1 score (Fig. 13). For DMS, the logistic regression in the second experiment showed the highest performance in terms of AUC, but the classification model of the present invention based on CNN also showed significant performance results (Fig. 14). For butyl acetate, the classification model of the present invention based on CNN showed the highest performance in terms of F1 score (Fig. 15).

이상과 같은 본 발명에 따르면, 적은 양의 시계열 데이터로부터 충분한 양의 학습 데이터를 확보하여 타겟 가스 탐지를 위한 분류 모델의 성능을 높일 수 있으며, 벡터가 아닌 매트릭스 형태로 입력 데이터를 구성 가능하기 때문에 CNN 학습이 가능한 장점을 갖는다.According to the present invention as described above, a sufficient amount of learning data can be secured from a small amount of time series data to improve the performance of a classification model for target gas detection, and since input data can be configured in a matrix form rather than a vector form, there is an advantage in that CNN learning is possible.

본 발명은 도면에 도시된 실시 예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 다른 실시 예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의하여 정해져야 할 것이다.Although the present invention has been described with reference to the embodiments shown in the drawings, these are merely exemplary, and those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom. Accordingly, the true technical protection scope of the present invention should be determined by the technical idea of the appended claims.

100: 타겟 가스 탐지 모델 생성 장치
110: 데이터 수집부 120: 후보 값 생성부
130: 데이터 가공부 140: 학습부
150: 제어부 160: 데이터 획득부
170: 분류부 180: 출력부100: Target gas detection model generation device
110: Data collection section 120: Candidate value generation section
130: Data processing department 140: Learning department
150: Control unit 160: Data acquisition unit
170: Classification section 180: Output section

Claims

A method for generating a target gas detection model using a target gas detection model generating device based on CNN learning,
A step of collecting time series data of different time lengths observed from a sensor array of a multi-channel structure for a plurality of gas samples in which at least one gas is mixed;
A step of searching for the shortest time length among the time lengths of a plurality of collected time series data and generating a plurality of candidate length values having the shortest time length as the maximum value, wherein a list of candidate length values (L = 1,2,…,l) from 1 to the shortest time length (l) is generated;
A step of processing a plurality of matrix data of L×M size from the time series data by applying a time window of L×M size (L: candidate length value, M: number of channels) corresponding to the candidate length value to the time series data and sliding it by a set time unit;
A step of training a CNN-based classification model by using a plurality of matrix data processed from the corresponding time series data for each gas sample as input values and a label value regarding the presence or absence of a target gas in the gas sample as an output value; and
A method for generating a target gas detection model, comprising a step of determining an optimal candidate length value among the plurality of candidate length values from a result of learning the classification model for each of the plurality of candidate length values.

In claim 1,
The above sensor array,
A method for generating a target gas detection model implemented by including a plurality of sensors selected from a metal oxide (MOX) sensor, an electrochemical sensor, a photoionization detector (PID), a temperature sensor, and a humidity sensor.

delete

In claim 1,
The above target gas is,
Containing at least one of toluene, dimethyl sulfide (DMS), and butyl acetate;
The above classification model is,
A method for generating a target gas detection model, comprising at least one of a first classification model trained to classify the presence or absence of toluene in a gas sample, a second classification model trained to classify the presence or absence of dimethyl sulfide, and a third classification model trained to classify the presence or absence of butyl acetate.

In claim 1,
The step of processing multiple matrix data from the above time series data is:
A method for generating a target gas detection model by processing the plurality of matrix data while applying the time window of the above L×M size to the above time series data and sliding it by one time slot unit.

In claim 1,
A step of acquiring time series data for any gas sample from the above sensor array;
A step of processing a plurality of matrix data of sizes L'×M corresponding to the optimal candidate length value (L') for the acquired time series data; and
A method for generating a target gas detection model further comprising a step of sequentially inputting a plurality of processed matrix data into the classification model to provide a classification result every hour for the presence or absence of a target gas to be detected.

In a device for generating a target gas detection model based on CNN learning,
A data acquisition unit for collecting time series data of different time lengths observed from a sensor array of a multi-channel structure for a plurality of gas samples in which at least one gas is mixed;
A candidate value generation unit that searches for the shortest time length among the time lengths of multiple collected time series data and generates multiple candidate length values with the shortest time length as the maximum value, and generates a candidate length value list (L = 1,2,…,l) from 1 to the shortest time length (l);
A data processing unit that processes multiple matrix data of L×M size from the time series data by applying a time window of L×M size (L: the candidate length value, M: number of channels) corresponding to the candidate length value to the time series data and sliding it by a set time unit;
A learning unit that learns a CNN-based classification model by using a plurality of matrix data processed from the corresponding time series data for each gas sample as input values and a label value regarding the presence or absence of a target gas in the gas sample as an output value; and
A target gas detection model generation device including a control unit that determines an optimal candidate length value among the plurality of candidate length values from the results of learning the classification model for each of the plurality of candidate length values.

In claim 7,
The above sensor array,
A target gas detection model generation device implemented including a plurality of sensors selected from a metal oxide (MOX) sensor, an electrochemical sensor, a photoionization detector (PID), a temperature sensor, and a humidity sensor.

delete

In claim 7,
The above target gas is,
Containing at least one of toluene, dimethyl sulfide (DMS), and butyl acetate;
The above classification model is,
A target gas detection model generating device comprising at least one of a first classification model trained to classify the presence or absence of toluene in a gas sample, a second classification model trained to classify the presence or absence of dimethyl sulfide, and a third classification model trained to classify the presence or absence of butyl acetate.

In claim 7,
The above data processing unit,
A target gas detection model generation device that processes the plurality of matrix data by applying the time window of the above L×M size to the above time series data and sliding it by one time slot unit.

In claim 7,
A data acquisition unit for acquiring time series data for any gas sample from the above sensor array; and
A target gas detection model generation device further comprising a classification unit for sequentially inputting a plurality of matrix data of size L'×M, processed by the data processing unit in response to the optimal candidate length value (L') of the acquired time series data, into the classification model to provide a classification for the presence or absence of a target gas to be detected every hour.