KR102198459B1

KR102198459B1 - Clustering method and system for financial time series with co-movement relationship

Info

Publication number: KR102198459B1
Application number: KR1020180174052A
Authority: KR
Inventors: 이주홍; 안준규
Original assignee: 인하대학교 산학협력단
Priority date: 2018-12-31
Filing date: 2018-12-31
Publication date: 2021-01-05
Anticipated expiration: 2038-12-31
Also published as: KR20200082948A

Abstract

본 발명은 공동성 관계가 있는 금융 시계열에 대한 클러스터링 방법 및 시스템에 있어서, 잡음을 제거하고 더 높은 공동성을 갖는 시계열 클러스터를 찾는 클러스터일 시스템 및 방법에 관한 것으로, 금융 데이터를 전처리 하는 단계; 전처리된 상기 데이터를 이용해 클러스터링 하는 단계; 및 상기 클러스트링 결과를 해석하는 단계;를 포함하는 구성을 개시한다.The present invention relates to a clustering method and system for a financial time series having a synergistic relationship, comprising the steps of: pre-processing financial data; Clustering using the preprocessed data; And interpreting the clustering result.

Description

Clustering method and system for financial time series with synergistic relationship {CLUSTERING METHOD AND SYSTEM FOR FINANCIAL TIME SERIES WITH CO-MOVEMENT RELATIONSHIP}

본 발명은 공동성 관계가 있는 금융 시계열에 대한 클러스터링 방법 및 시스템에 있어서, 잡음을 제거하고 더 높은 공동성을 갖는 시계열 클러스터를 찾는 클러스터일 시스템 및 방법에 관한 것이다.The present invention relates to a clustering method and system for a financial time series having a synergistic relationship, which is a clustering system and method for removing noise and finding a time series cluster having a higher synergistic relationship.

시계열은 시간에 따라 순차적으로 관찰되는 집합이다. 정보 기기의 발전으로 금융, 통신, 의약, 보건, 교통 등 다양한 분야에서 실시간으로 관측되는 시계열이 여러 분야의 응용 분야에 사용되고 있다. 그 중 금융 애플리케이션은 시계열이 가장 중요한 애플리케이션 중 하나다. 금융 응용 분야에는 환율, 통화량, 주가 지수, 주가, 상품 가격, 파생 상품 등이 있다. 따라서 금융 시계열 적용을 위한 효율적인 분석 방법을 개발하는 것에 있어서 금융 시계열의 중요한 기능을 고려하는 것이 매우 중요하다.A time series is a set observed sequentially over time. With the development of information devices, time series observed in real time in various fields such as finance, communication, medicine, health, and transportation are being used in various fields of application. Among them, for financial applications, time series is one of the most important applications. Financial applications include exchange rates, currency volumes, stock indexes, stock prices, commodity prices, and derivatives. Therefore, it is very important to consider the important functions of financial time series in developing an efficient analysis method for applying financial time series.

금융 시계열은 랜덤 보행 특성을 가지고 있기 때문에 비정상적이다. 따라서 금융 시계열을 분석하여 투자 결정을 내리는 것은 매우 어렵다. 그러나 금융 시계열이 비정상적이라는 사실에도 불구하고, 두 개의 금융 시계열의 선형 결합의 결과인 Spread는 많은 상황에서 정상 상태를 만족시키는 것으로 밝혀졌다. 이 때, 두 개의 금융 시계열은 공적분 관계라고 불리고, 같은 추세를 공유하는 것으로 간주된다. 공적분은 두 개의 비정상 시계열의 공동성 정도를 나타내는 척도로 사용된다. 공동성이 높은 금융 시계열 클러스터는 금융 시계열 응용 프로그램에 널리 적용될 수 있으므로 클러스터를 생성하는 클러스터링 알고리즘은 재무 응용 프로그램에서 중요한 연구 주제다. 이 결과는 금융 포트폴리오 선택 및 금융 정책 수립의 미래 가격을 예측하는 데 유용할 수 있다.The financial time series is abnormal because it has random walking characteristics. Therefore, it is very difficult to make investment decisions by analyzing financial time series. However, despite the fact that financial time series are abnormal and have, as a result of a linear combination of the two Spread of financial time series it has been found to satisfy the normal conditions in many situations. In this case, the two financial time series is called a cointegration relationship and is considered to share the same trend. The cointegration is used as a measure of the degree of cavitation of two abnormal time series. Because financial time series clusters with high cavitation can be widely applied to financial time series applications, the clustering algorithm that generates clusters is an important research topic in financial applications. These results can be useful in predicting the future prices of financial portfolio selection and financial policy formulation.

기존의 시계열 클러스터링 알고리즘과 기존 금융 시계열 클러스터링 알고리즘의 주요 문제점은 다음과 같다. 첫째, 거리 계산 방법으로 Euclidean distance 와 Dynamic Time Warping 을 사용하지만, 거리 계산 방법은 시계열 데이터의 시간 가중치를 고려하지 않는다. 둘째, 금융 시계열의 추세 구성 요소 분석에서 과거 값에 비해 미래 가치의 증가 또는 감소 비율의 중요성을 무시한다. 셋째, 그들은 차원 감소 과정에서 너무 많은 정보 손실을 허용한다. 넷째, 그들은 낮은 정도의 공 동성 (co-movement)으로 너무 많은 소음을 포함하는 클러스터를 생성한다. 다섯째, 여러 클러스터에서 시계열을 중복 할당 할 수 없기 때문에 재무 분석에 유용한 정보가 손실 될 수 있다.The main problems of the existing time series clustering algorithm and the existing financial time series clustering algorithm are as follows. First, Euclidean distance and Dynamic Time Warping are used as the distance calculation method, but the distance calculation method does not consider the time weight of time series data. Second, the importance of the ratio of increase or decrease in future value compared to past value in the trend component analysis of financial time series is ignored. Third, they allow too much information loss in the dimensionality reduction process. Fourth, they create clusters containing too much noise with a low degree of co-movement. Fifth, useful information for financial analysis may be lost because time series cannot be duplicated in multiple clusters.

예를 들어, 대한민국 공개특허 제102017-0078256호에는 시계열의 데이터를 예측하는 방법 및 그 장치가 개시되어 있고, 구체적으로는 트레이닝 기간 동안의 기 지정된 주기 단위의 측정치　시계열　데이터를 복수의　클러스터로　클러스터링 하는 단계; 상기 트레이닝 기간 동안의 복수의 환경 데이터를 수집하는 단계; 상기 복수의 환경 데이터 중 적어도 일부를 인자(factor)로 선택하는 단계; 상기 인자를 가리키는 축들로 구성되는 공간 또는 평면 상에서 상기 측정치　시계열　데이터의　클러스터를 최적으로 분류하는 분류 모델을 생성하는 단계; 상기 생성된 분류 모델의 성능 지표 값을 결정하는 단계; 상기 선택하는 단계, 상기 분류 모델을 생성하는 단계 및 상기 성능 지표 값을 결정하는 단계를, 상기 인자의 선택을 변경해 가면서 반복하여, 상기 성능 지표 값을 기준으로 상기 생성된 분류 모델 중 최적 분류 모델을 선정하는 단계; 및 상기 최적 분류 모델을 이용하여, 예측 기간의 상기 측정치　시계열　데이터의　클러스터를 예측 하는 단계를 포함하는, 시계열　데이터 예측 방법이 개시되어 있다. 그러나, 상기 클러스터링 방법은 공동성이 있는 시계열에 대한 클러스터링을 수행함에 있어 노이즈가 많이 발생할 수 있는 문제점이 있다.For example, Korean Patent Laid-Open No. 102017-0078256 discloses a method and an apparatus for predicting time series data. Specifically, measurement values 　 time series 　 data in a predetermined period unit during a training period are clustered into a plurality of 　 clusters. step; Collecting a plurality of environmental data during the training period; Selecting at least some of the plurality of environmental data as a factor; Generating a classification model for optimally classifying the 　 cluster of the measured values 　 time series 　 data on a space or plane composed of axes indicating the factor; Determining a performance index value of the generated classification model; The selecting, generating the classification model, and determining the performance index value are repeated while changing the selection of the factor, and an optimal classification model among the generated classification models is determined based on the performance index value. Selecting; And predicting a cluster of measured values and time series data in a prediction period using the optimal classification model. A method for predicting time series data is disclosed. However, the clustering method has a problem in that a lot of noise may be generated when clustering is performed on a time series with cavitation.

따라서, 상기한 문제를 해결할 수 있는 기술이 필요한 실정이다.Therefore, there is a need for a technology capable of solving the above problems.

대한민국 공개특허 제102017-0078256호Republic of Korea Patent Publication No. 102017-0078256

따라서 본 발명은 상기의 문제점과 한계를 극복하기 위하여 기존의 클러스터링 방법과 비교할 때, 잡음을 제거하여 정확도를 높이고, 더 높은 공동성을 보이는 자료를 구별할 수 있는 클러스터링 방법 및 클러스터링 시스템을 제공하고자 한다.Accordingly, in order to overcome the above problems and limitations, the present invention aims to provide a clustering method and a clustering system capable of removing noise to increase accuracy and distinguishing data exhibiting higher cavitation compared to the conventional clustering method.

상기한 문제를 해결하기 위한 본 발명은 금융 데이터를 전처리 하는 단계; The present invention for solving the above problem includes the steps of pre-processing financial data;

전처리된 상기 데이터를 이용해 클러스터링 하는 단계; 및 상기 클러스트링 결과를 해석하는 단계;를 포함하는 공동성 관계가 있는 금융 시계열에 대한 클러스터링 방법을 제공하고, 또한, 본 발명은 금융 데이터를 전처리 하는 전처리부; 전처리된 상기 데이터를 이용해 클러스터링 하는 클러스터링부; 및 상기 클러스트링 결과를 해석하는 결과해석부;를 포함하는 공동성 관계가 있는 금융 시계열에 대한 클러스터링 시스템을 제공한다.Clustering using the preprocessed data; And interpreting the clustering result; and a clustering method for a financial time series having a synergistic relationship comprising: a preprocessor for preprocessing financial data; A clustering unit for clustering using the preprocessed data; And a result analysis unit for analyzing the clustering result.

본 발명은 금융 시계열의 클러스터링 방법에 있어서 기존 방법과 대비할 때노이즈를 감소시키고 높은 공동성을 가지는 금융 시계열 클러스터링을 수행할 수 있다.The present invention can reduce noise and perform financial time series clustering having high synergism in a financial time series clustering method compared to the existing method.

본 발명은 기존의 시계열 클러스터링 알고리즘의 단점을 보완하고 높은 공 동성을 갖는 클러스터를 생성하는 클러스터링 방법 및 시스템을 제시한다. The present invention provides a clustering method and system that compensates for the disadvantages of the existing time series clustering algorithm and creates a cluster having high coherence.

한편, 본 발명의 효과는 이상에서 언급한 효과들로 제한되지 않으며, 이하에서 설명할 내용으로부터 통상의 기술자에게 자명한 범위 내에서 다양한 효과들이 포함될 수 있다.Meanwhile, the effects of the present invention are not limited to the above-mentioned effects, and various effects may be included within a range that will be apparent to a person skilled in the art from the contents to be described below.

도 1은 밀도 기반 클러스터링의 문제점을 도시한 것이다.
도 2는 분할 기반 클러스터링의 문제점을 도시한 것이다.
도 3는 정규화 방법으로서의 Z- 변환의 문제를 도시한 그래프이다.
도 4는 대역 제한 신호, 부분 구간(서브 인터벌), 샘플링 된 포인트를 도시한 그래프이다.
도 5은 각 차원 축소 방법에 대한 DRE 결과 값이다.
도 6은 각 차원 축소 방법에 대한 DRaE 결과(α=50) 값이다.
도 7은 클러스터 직경의 변화를 도시한 그래프이다.
도 8은 직경 한계 추정 단계를 도시한 것이다.
도 9는 각 클러스터링 방법에 대한 RMSE의 결과이다.
도 10은 각 클러스터링 방법의 직경의 결과이다.
도 11은 각 클러스터링 방법에 대한 MSD의 결과이다.
도 12는 각 클러스터링 방법에 대한 ADF 테스트 결과이다.
도 13는 YADING으로 생성한 클러스터이다.
도 14는 3PTC로 생성한 클러스터이다.
도 15는 본 발명의 일 실시 예에 따른 방법으로 생성한 클러스터이다.
도 16은 본 발명의 일 실시 예에 따른 공동성 관계가 있는 금융 시계열에 대한 클러스터링 시스템의 블록도이다.
도 17은 본 발명의 일 실시 예에 따른 공동성 관계가 있는 금융 시계열에 대한 클러스터링 방법의 흐름도이다.1 illustrates the problem of density-based clustering.
2 shows the problem of partitioning-based clustering.
3 is a graph showing the problem of Z-transformation as a normalization method.
4 is a graph showing a band limit signal, a partial section (sub-interval), and a sampled point.
5 is a DRE result value for each dimension reduction method.
6 shows DRaE results (α=50) values for each dimension reduction method.
7 is a graph showing changes in cluster diameter.
8 shows the steps of estimating the diameter limit.
9 is a result of RMSE for each clustering method.
10 is a result of the diameter of each clustering method.
11 is a result of MSD for each clustering method.
12 is an ADF test result for each clustering method.
13 is a cluster created by YADING.
14 is a cluster created by 3PTC.
15 is a cluster created by a method according to an embodiment of the present invention.
16 is a block diagram of a clustering system for a financial time series having a synergistic relationship according to an embodiment of the present invention.
17 is a flowchart of a clustering method for a financial time series having a synergistic relationship according to an embodiment of the present invention.

이하, 첨부된 도면들을 참조하여 본 발명에 따른 '공동성 관계가 있는 금융 시계열에 대한 클러스터링 방법 및 시스템'을 상세하게 설명한다. 설명하는 실시 예들은 본 발명의 기술 사상을 당업자가 용이하게 이해할 수 있도록 제공되는 것으로 이에 의해 본 발명이 한정되지 않는다. 또한, 첨부된 도면에 표현된 사항들은 본 발명의 실시 예들을 쉽게 설명하기 위해 도식화된 도면으로 실제로 구현되는 형태와 상이할 수 있다.Hereinafter, a'clustering method and system for a financial time series having a joint relationship' according to the present invention will be described in detail with reference to the accompanying drawings. The described embodiments are provided so that those skilled in the art can easily understand the technical idea of the present invention, and the present invention is not limited thereto. In addition, matters expressed in the accompanying drawings are schematic drawings for easy explanation of embodiments of the present invention and may be different from the actual implementation form.

한편, 이하에서 표현되는 각 구성부는 본 발명을 구현하기 위한 예일 뿐이다. 따라서, 본 발명의 다른 구현에서는 본 발명의 사상 및 범위를 벗어나지 않는 범위에서 다른 구성부가 사용될 수 있다. Meanwhile, each component expressed below is only an example for implementing the present invention. Accordingly, in other implementations of the present invention, other components may be used without departing from the spirit and scope of the present invention.

또한, 각 구성부는 순전히 하드웨어 또는 소프트웨어의 구성만으로 구현될 수도 있지만, 동일 기능을 수행하는 다양한 하드웨어 및 소프트웨어 구성들의 조합으로 구현될 수도 있다. 또한, 하나의 하드웨어 또는 소프트웨어에 의해 둘 이상의 구성부들이 함께 구현될 수도 있다. In addition, each component may be implemented with purely hardware or software, but may be implemented with a combination of various hardware and software components that perform the same function. In addition, two or more components may be implemented together by one piece of hardware or software.

또한, 어떤 구성요소들을 '포함'한다는 표현은, '개방형'의 표현으로서 해당 구성요소들이 존재하는 것을 단순히 지칭할 뿐이며, 추가적인 구성요소들을 배제하는 것으로 이해되어서는 안 된다. In addition, the expression'including' certain elements is an expression of'open type' and simply refers to the existence of the corresponding elements, and should not be understood as excluding additional elements.

도 1은 밀도 기반 클러스터링의 문제점을 도시한 것이다.1 illustrates the problem of density-based clustering.

도 1을 참조하면, 시계열은 고차원적인 특성을 가지므로 시계열 분석을 위해 원시 시계열 데이터를 직접 사용하면 값 비싼 계산 및 차원의 폐해 문제가 발생할 수 있다. 시계열 분석에 사용되는 차원 축소 방법은 데이터 적응 방법과 비데이터 적응 방법으로 나눌 수 있다.Referring to FIG. 1, since time series has a high-dimensional characteristic, if raw time series data is directly used for time series analysis, expensive calculations and dimensional damage problems may occur. Dimension reduction methods used for time series analysis can be divided into data adaptation methods and non-data adaptation methods.

데이터 적응 방식은 데이터의 크기가 가변적일 때 사용된다. PCA (Piecewise Constant Approximation), APCA (Adaptive Piecewise Constant Approximation), SAX (Symbolic Aggregate Approximation)를 포함할 수 있다. 비데이터 적응 방법은 데이터의 크기가 고정되어있을 때 사용될 수 있다. 여기에는 Random Mapping, PAA (Piecewise Aggregate Approximation), DFT (Discrete Fourier Transform), DCT (Discrete Cosine Transform)가 포함될 수 있다.The data adaptation method is used when the size of data is variable. PCA (Piecewise Constant Approximation), APCA (Adaptive Piecewise Constant Approximation), SAX (Symbolic Aggregate Approximation) may be included. The non-data adaptation method can be used when the size of the data is fixed. These may include Random Mapping, Piecewise Aggregate Approximation (PAA), Discrete Fourier Transform (DFT), and Discrete Cosine Transform (DCT).

기존 기술은 YADING이라 불리는 DBSCAN을 이용한 밀도 기반 클러스터링 알고리즘이 있다. 상기 기존 기술은 샘플링 방법을 사용하여 알고리즘의 실행 시간을 줄였다. 또한 그들은 변곡점을 사용하여 밀도 반경을 추정했다.The existing technology has a density-based clustering algorithm using DBSCAN called YADING. The above existing technology uses a sampling method to reduce the execution time of the algorithm. They also used the inflection point to estimate the density radius.

다른 기존 기술로 시계열 클러스터링의 성능을 향상시키기 위해 다단계 클러스터링 알고리즘이 개발되었다. 2 단계 클러스터링 방법으로 첫 번째 단계에서는 시계열 데이터가 SAX에 의해 변환된다. 그런 다음 CAST 알고리즘은 클러스터를 생성한다. 두 번째 단계에서는 각 클러스터의 시계열 데이터의 하위 시퀀스 정보를 사용하여 각 클러스터를 하위 클러스터로 나눈다.In order to improve the performance of time series clustering with other conventional techniques, a multi-level clustering algorithm was developed. As a two-stage clustering method, time series data is transformed by SAX in the first step. Then the CAST algorithm creates a cluster. In the second step, each cluster is divided into sub-clusters using sub-sequence information of the time series data of each cluster.

또 다른 기존 기술은 3PTC라고 불리는 3 단계 클러스터링 방법이 있다. 첫 번째 단계에서는 시계열 데이터가 SAX에 의해 변환된다. 그런 다음 K- 모드 알고리즘은 사전 클러스터를 생성한다. 두 번째 단계에서, 각 사전 클러스터는 PCS 알고리즘에 의해 서브 클러스터로 정제된다. 세 번째 단계에서는 형상 방법의 유사성을 이용하여 서브 클러스터 간의 유사도를 계산한다. 마지막으로 최종 클러스터는 하위 클러스터 병합으로 생성된다.Another existing technology is a three-stage clustering method called 3PTC. In the first step, the time series data is transformed by SAX. Then, the K-mode algorithm creates a pre-cluster. In the second step, each dictionary cluster is refined into sub-clusters by the PCS algorithm. In the third step, the similarity between sub-clusters is calculated using the similarity of the shape method. Finally, the final cluster is created by merging sub-clusters.

금융 시계열 특성을 고려할 때 금융 시계열 클러스터링에서 중요한 요소로 고려해야 하는 핵심 요소는 클러스터 크기, 데이터 정규화, 거리 계산 및 시계열을 여러 클러스터에 중복 할당하는 것이다.When considering the characteristics of financial time series, the key factors that should be considered as important factors in financial time series clustering are cluster size, data normalization, distance calculation, and redundant allocation of time series to multiple clusters.

기존의 시계열 클러스터링 방법은 클러스터의 적절한 크기에 대한 정보를 고려하지 않고 클러스터링을 수행한다. 클러스터의 적절한 크기가 고려되지 않은 경우 많은 양의 노이즈 데이터가 클러스터에 포함될 수 있다. 이러한 노이즈 데이터가 포함 된 클러스터를 사용하여 재무 데이터 분석을 수행하면 금융 데이터 분석의 결과가 신뢰할 수 없다. The existing time series clustering method performs clustering without considering information on the appropriate size of the cluster. If the proper size of the cluster is not considered, a large amount of noise data may be included in the cluster. When financial data analysis is performed using clusters containing such noisy data, the results of financial data analysis are unreliable.

기존 기술인 밀도 기반 클러스터링은 데이터 공간에서 지속적으로 밀도가 높은 영역 내에서 임의의 모양으로 클러스터를 생성 할 수 있다. 해당 영역에서 생성된 클러스터의 크기는 어떤 모양에서도 매우 작거나 매우 클 수 있다. 클러스터 크기가 매우 큰 경우 클러스터 내의 가장 먼 두 데이터의 공동 이동 정도가 서로 매우 다를 수 있다. 도 1은 밀도 기반 클러스터의 문제점을 보여준다.Density-based clustering, an existing technology, can create a cluster in an arbitrary shape within a continuously dense area in the data space. The size of clusters created in that area can be very small or very large in any shape. If the cluster size is very large, the degree of joint movement of the two furthest data in the cluster may be very different. 1 shows the problem of density-based clusters.

도 2는 분할 기반 클러스터링의 문제점을 도시한 것이다.2 shows the problem of partitioning-based clustering.

도 2를 참조하면, 분할 기반 클러스터링은 데이터 공간을 K 파티션으로 나누어 클러스터를 생성한다. 파티션의 수는 사용자가 결정할 수 있다. 사용자가 파티션 (K)의 수를 적절히 선택하지 않으면 가장 먼 두 데이터의 공동 이동 정도가 매우 다를 수 있다. 도 2는 파티션 기반 클러스터링의 문제점을 보여준다.Referring to FIG. 2, partition-based clustering creates a cluster by dividing a data space into K partitions. The number of partitions can be determined by the user. If the user does not properly select the number of partitions (K), the degree of joint movement of the two furthest data can be very different. 2 shows the problem of partition-based clustering.

도 3는 정규화 방법으로서의 Z- 변환의 문제를 도시한 그래프이다.3 is a graph showing the problem of Z-transformation as a normalization method.

도 3을 참조하면, 금융 시계열 추세 분석에서 과거 값에 비해 미래 가치가 얼마나 증가하는지 또는 감소 하는지를 알려주는 비율이 전체 시계열 데이터의 분포 및 분산보다 더 중요하다. 또한 금융 시계열은 각 데이터에 대해 매우 다른 규모를 가진다. 따라서 정규화 단계는 재무 시계열 분석에서 필수적이다. 그러나 대부분의 시계열 분석 방법은 시계열 데이터의 정규화를 고려하지 않거나 Z- 변환을 사용하여 정규화를 수행한다. Z- 변환은 다음 수학식 1과 같이 정의된다.Referring to FIG. 3, in financial time series trend analysis, a ratio indicating how much a future value increases or decreases compared to a past value is more important than the distribution and variance of the entire time series data. Also, financial time series have very different scales for each data. Therefore, the normalization step is essential in financial time series analysis. However, most time series analysis methods do not consider normalization of time series data or perform normalization using Z-transform. The Z-transform is defined as in Equation 1 below.

[수학식 1][Equation 1]

여기서, Y_t는 시간 t에서 정규화 된 시계열을 나타내고, X_t는 시간 t에서 정규화되지 않은 시계열을 나타내며, μ_t는 X 데이터의 평균을 나타내며, σ_X는 X 데이터의 표준 편차를 나타낸다.Here, Y _t represents the normalized time series at time t, X _t represents the unnormalized time series at time t, μ _t represents the mean of X data, and σ _X represents the standard deviation of X data.

Z- 변환은 시계열 값의 표준 편차의 배수로 표현되기 때문에 시계열에 대한 정규화 방법으로 적합하지 않는다. 도 3은 각 시계열이 과거 값에 비해 미래 가치가 얼마나 증가하는지 또는 감소 하는지를 알려주는 비율이 다르더라도 두 시계열 데이터의 변경 비율이 유사 함을 보여준다.Since the Z-transform is expressed as a multiple of the standard deviation of the time series values, it is not suitable as a normalization method for time series. 3 shows that even though the ratio indicating how much the future value increases or decreases in each time series compared to the past value is different, the change ratio of the two time series data is similar.

시계열 데이터는 장기간에 걸쳐 관찰된다. 시간이 지남에 따라 시계열을 생성하는 시스템의 상태가 변경 될 수 있으므로 시간 경과에 따라 시계열 분석 모델이 올바르게 작동하지 않을 수 있다. 따라서 모든 과거 데이터에 동일한 가중치를 부여하는 것보다 최근 데이터에 더 큰 가중치를 부여하는 것이 더 합리적이다. 특히, 이러한 현상은 금융 시계열에서 매우 빈번하게 발생하므로, 동심 유사성을 계산할 때 시간 가중치를 변경해야 한다.Time series data are observed over a long period of time. The state of the system that generates time series may change over time, so the time series analysis model may not work properly over time. Therefore, it makes more sense to give more weight to recent data than to give equal weight to all past data. In particular, since this phenomenon occurs very frequently in financial time series, time weights need to be changed when calculating concentric similarity.

기존의 다단계 클러스터링 알고리즘은 사전 클러스터링 단계에서 사전 클러스터를 얻기 위해 축소된 차원의 시계열을 사용한다. 이 때 차원 축소 방법으로 SAX, PAA, DCT, DFT을 사용한다. 그러나 이러한 방법을 사용하면 정보가 과도하게 손실 될 수 있다.The existing multi-level clustering algorithm uses a reduced-dimensional time series to obtain a pre-cluster in the pre-clustering step. In this case, SAX, PAA, DCT, and DFT are used as dimension reduction methods. However, using this method can lead to excessive loss of information.

기존의 시계열 클러스터링 알고리즘은 여러 클러스터에 시계열을 중복 할당 할 수 없다. 그러나 하나의 시계열이 다른 클러스터에 포함된 시계열과 유사할 가능성이 있다. 따라서 시계열의 중복 할당을 허용하지 않으면 시계열 분석에서 중요한 정보가 손실 될 수 있다.Existing time series clustering algorithms cannot redundantly allocate time series to multiple clusters. However, there is a possibility that one time series is similar to a time series included in another cluster. Therefore, if redundant allocation of time series is not allowed, important information may be lost in time series analysis.

이와 같은 문제점을 고려하여 본 발명에서는 금융 시계열 적용의 특성에 부합하는 효율적인 클러스터링 방법을 사용한다.In consideration of such a problem, the present invention uses an efficient clustering method suitable for the characteristics of financial time series application.

도 4는 대역 제한 신호, 부분 구간(서브 인터벌), 샘플링 된 포인트를 도시한 그래프이다.4 is a graph showing a band limit signal, a partial section (sub-interval), and a sampled point.

도 4를 참조하면, 데이터 전처리에 있어서, 데이터 정규화를 위해 과거 가치 대비 미래 가치가 얼마나 증가하는지 또는 감소 하는지를 나타내는 비율을 나타내기 위해 평균 스케일링(Average Scaling)은 다음 수학식 2와 같이 정의된다.Referring to FIG. 4, in data preprocessing, average scaling is defined as in Equation 2 below to indicate a ratio indicating how much future value increases or decreases relative to past value for data normalization.

[수학식 2][Equation 2]

여기서, Y_t는 시간 t에서의 정규화된 시계열을 나타내고, X_t는 시간 t에서의 시계열을 나타내고, μ_X는 X의 평균을 나타낸다.Here, Y _t represents the normalized time series at time t, X _t represents the time series at time t, and μ _X represents the average of X.

금융 시계열 분석에서 최근 데이터가 많을수록 향후 분석 결과에 영향을 미친다. 이 특성을 반영하기 위해 가중치 적용 시계열을 사용한다. 가중 시계열 변환은 가중치가 적용된 시계열을 생성한다. 이 변환은 다음의 두 단계를 포함할 수 있다: 부분 구간(서브 인터벌) 계산 및 가중치 할당.In financial time series analysis, the more recent data, the more impact on future analysis results. A weighted time series is used to reflect this characteristic. The weighted time series transform produces a weighted time series. This transformation may involve two steps: partial interval (sub-interval) calculation and weight assignment.

Nyquist Sampling Rate Condition과 Parseval Theorem을 기반으로 시간 구간의 부분 구간(서브 인터벌)을 효율적으로 찾아내는 방법을 이용한다. Nyquist Sampling Rate 조건에서 샘플링 주파수가 최대 주파수의 2 배인 경우, 샘플링 포인트에 의한 주파수 앨리어싱(aliasing) 없이 대역 제한 신호를 완벽하게 재구성할 수 있다. 일부 시계열은 완벽한 대역 제한 신호가 아니지만 고주파 성분은 잡음으로 간주 될 수 있다. 따라서 대부분의 시계열은 대역 제한 신호로 간주 할 수 있다. Parseval 정리에 따르면, 시간 공간에서 시계열 값의 제곱의 합은 주파수 공간에서 주파수 성분의 계수의 제곱의 합과 같다. 따라서 Parseval 정리를 사용하여 원래 시계열의 에너지 대부분을 보존하는 최대 주파수 (Ω_max)를 찾는다(약 80-90 %). (Ω_max = 2πk / N, N은 시계열의 총 시간 간격에 해당하며, k는 주파수 1 / N의 다중 계수이다.)이 최대 주파수와 Nyquist Sampling Rate 조건을 사용하여 샘플링 기간(T_s

N/2k, 2π/T_s ≥2Ω_max, 2π / T_s ≥2 * 2πk / N, T_s≤N / 2k). 그런 다음 각 샘플링 된 기간을 시계열의 부분 구간(서브 인터벌)으로 해석한다. 도 4는 대역 제한 신호의 부분 구간(서브 인터벌)을 보여준다.Based on the Nyquist Sampling Rate Condition and Parseval Theorem, we use a method to find the partial section (sub-interval) of the time section efficiently. When the sampling frequency is twice the maximum frequency under the Nyquist Sampling Rate condition, the band-limited signal can be completely reconstructed without frequency aliasing by the sampling point. Some time series are not perfectly band-limited signals, but high-frequency components can be considered noise. Therefore, most time series can be regarded as band-limited signals. According to the Parseval theorem, the sum of squares of time series values in time space is equal to the sum of squares of coefficients of frequency components in frequency space. Therefore, we use the Parseval theorem to find the maximum frequency (Ω_max) that conserves most of the energy of the original time series (about 80-90%). (Ω_max = 2πk / N, N corresponds to the total time interval of the time series, k is the multiple coefficient of the frequency 1 / N.) The sampling period (T _s) using the maximum frequency and Nyquist Sampling Rate conditions

N/2k, 2π/T _s ≥2Ω_max, 2π / T _s ≥2 * 2πk / N, T _s ≤N/2k). Then, each sampled period is interpreted as a sub-interval (sub-interval) of the time series. 4 shows a partial section (sub-interval) of a band-limited signal.

Nyquist Sampling Rate Condition과 Parseval 정리에 의해 얻어진 부분 구간에 가중치를 선형적으로 할당한다. 예를 들어, 4 개의 서브 인터벌이 있는 경우, 과거로부터 최근까지의 시간 순서에 따라 1, 2, 3 및 4 가중치가 서브 인터벌에 할당된다.Weights are linearly assigned to partial intervals obtained by the Nyquist Sampling Rate Condition and Parseval Theorem. For example, when there are 4 sub-intervals, 1, 2, 3, and 4 weights are allocated to the sub-intervals according to the time sequence from the past to the latest.

전처리 단계에서 차원 축소를 수행할 수 있다. DCT, PAA, PCA, SAX을 비교하여 차원 축소 방법을 선택 하였다. 차원 축소 방법으로 PCA (Principal Component Analysis)를 선택했다. PCA는 전체 데이터의 변동성을 가장 잘 나타내는 고유 벡터를 사용하여 데이터의 차원을 줄인다. PCA에 의해 감소된 차원의 수는 Scree 플롯을 사용하여 실험적으로 결정된다.Dimension reduction can be performed in the preprocessing step. The dimensional reduction method was selected by comparing DCT, PAA, PCA, and SAX. PCA (Principal Component Analysis) was selected as the dimension reduction method. PCA reduces the dimensions of the data by using an eigenvector that best represents the variability of the overall data. The number of dimensions reduced by PCA is determined experimentally using a Scree plot.

전처리가 끝나면 클러스터링 단계를 수행할 수 있다. After the preprocessing is finished, the clustering step can be performed.

클러스터링 단계에서 사용되는 중요한 용어는 다음과 같이 정의된다.Important terms used in the clustering step are defined as follows.

가중치 메트릭 공간 (WMS)Weighted Metric Space (WMS)

WMS는 다음과 같이 정의된다. X = {(f (t))|(f (t))는 (f (t)) 의 가중 된 시계열이고, t = 1, ..., T} SES(부분 구간 시계열의 유클리드 거리의 합) 거리.WMS is defined as follows. X = {(f (t))|(f (t)) is the weighted time series of (f (t)), t = 1, ..., T} SES (sum of Euclidean distances in partial interval time series) Street.

(f (t))는 가중된 시계열 변환을 f (t)에 적용한 결과이다.(f (t)) is the result of applying the weighted time series transformation to f (t).

SES 거리SES distance

여기서 SIDIst_i는 가중치 적용 시계열의 i 번째 부분 구간 시계열의 유클리드 거리이다.Where SIDIst _i is the Euclidean distance of the i-th sub-section time series of the weighted time series.

RDS (Reduced Dimensional Space)RDS (Reduced Dimensional Space)

X = {g (z) | g (z)는 (f (t))의 의 축소 된 표현이다. T = 1, ..., T, z = 1, ..., rr << T}X = (g (z) | g(z) is the reduced expression of (f(t)). T = 1, ..., T, z = 1, ..., rr << T}

g (z)는 (f (t))에 차원 감소 방법을 적용한 결과이다.g (z) is the result of applying the dimensionality reduction method to (f (t)).

가중치 공간에 사용된 클러스터의 직경 상한과 축소된 차원 공간에 사용된 클러스터의 직경 한계를 추정한다. 시계열은 추세 성분과 잡음 성분의 조합으로 간주 할 수 있다. 트렌드 구성 요소가 유사하고 노이즈가 충분히 적은 시계열을 찾아야 한다. 이를 위해 클러스터 직경을 제한한다. 사전 클러스터링 단계에서의 계산 효율성을 위해 금융 시계열을 축소된 차원 데이터로 변환한다. 그런 다음 축소된 차원 데이터에 대해 클러스터링이 수행된다. 그러나 치수 감소로 인해 사전 클러스터에 노이즈가 포함될 수 있다. 따라서 정제 단계에서 사전 클러스터를 정제 된 클러스터로 정제한다.Estimate the upper limit of the cluster diameter used in the weight space and the diameter limit of the cluster used in the reduced dimensional space. Time series can be regarded as a combination of a trend component and a noise component. You should find a time series with similar trend components and low enough noise. To do this, limit the cluster diameter. For calculation efficiency in the pre-clustering step, the financial time series is converted into reduced dimensional data. Then, clustering is performed on the reduced dimensional data. However, due to the reduction in dimensions, the pre-cluster may contain noise. Therefore, in the purification step, the pre-cluster is purified into a purified cluster.

직경 상한 추정Estimation of upper diameter limit

정제 단계에서 클러스터의 크기를 제한하기 위해 직경 상한을 추정한다. 지경 상한은 클러스터 내의 시계열간의 SES_Distance의 최대 한도를 나타낸다. 직경 상한은 다음과 같이 추정된다. WMS에서 하나의 샘플을 추출한다. 그런 다음이 샘플과의 최단 거리 순서대로 이웃 집합에 시계열이 순차적으로 포함된다. 이웃 세트가 트렌드에서 크게 벗어나는 시계열을 가지고 있다면, 본 발명은 세트로부터 그 시계열을 제거하고, 최대 SES_Distance를 세트 내의 직경 후보로 고려한다. 여러 샘플을 반복적으로 작업하여 얻은 여러 직경 후보 값의 평균은 직경 상한으로 추정된다.In the refinement step, an upper diameter limit is estimated to limit the size of the cluster. The upper limit of the area represents the maximum limit of SES_Distance between time series in the cluster. The upper diameter limit is estimated as follows. One sample is extracted from WMS. Then, the time series is sequentially included in the neighboring set in the order of the shortest distance from this sample. If the neighboring set has a time series that deviates significantly from the trend, the present invention removes the time series from the set and considers the maximum SES_Distance as a diameter candidate in the set. The average of several diameter candidate values obtained by iteratively working with several samples is estimated as the upper diameter limit.

직경 한계 추정Diameter limit estimation

사전 클러스터링 단계에서 클러스터의 크기를 제한하기 위해 사용되는 직경 한계는 다음과 같이 추정된다. S는 WMS에서 직경 상한(D)를 추정하는 데 사용되는 이웃 세트다. S의 직경 상한에 해당하는 두 개의 시계열이 있다. 본 발명은 축소된 차원 공간에서 유클리드 거리를 계산한다. 얻어진 거리를 직경 한계 후보(d ')로 한다. 축소된 차원 공간에는 정보가 손실된다. 클러스터의 최대 거리가 직경 제한 후보(d')로 제한되는 경우, 노이즈가 있는 데이터가 포함될 수 있을 뿐만 아니라 유사한 시계열이 동일한 클러스터에 포함되지 않을 수 있다. 그러므로 우리는 직경 한계값(d = γd', γ> 1)보다 약간 큰 값을 직경 한계 값으로 사용한다.The diameter limit used to limit the size of the cluster in the pre-clustering step is estimated as follows. S is the set of neighbors used to estimate the upper diameter limit (D) in WMS. There are two time series corresponding to the upper diameter limit of S. The present invention calculates the Euclidean distance in a reduced dimensional space. The obtained distance is taken as the diameter limit candidate (d'). Information is lost in the reduced dimensional space. When the maximum distance of the cluster is limited to the diameter restriction candidate d', not only data with noise may be included, but similar time series may not be included in the same cluster. Therefore, we use a value slightly larger than the diameter limit value (d = γd', γ> 1) as the diameter limit value.

사전 클러스터링 단계에서 ADupClustering 알고리즘은 Reduced Dimensional Space의 데이터에서 다음과 같이 수행된다. Agglomerative Clustering Algorithm은 클러스터를 생성한다. Agglomerative Clustering Algorithm에 의해 생성된 각 클러스터(R)에 대해 AdupClustering은 다른 클러스터에 포함된 시계열 중 클러스터 R에 포함되어야 하는 클러스터를 찾는다. 클러스터 R과의 거리는 직경 한계 (d)보다 작아야 한다. 즉, ADupClustering Algorithm은 하나의 시계열을 여러 클러스터에 할당한다. 즉, 중복 할당한다.In the pre-clustering step, the ADupClustering algorithm is performed on the data of Reduced Dimensional Space as follows. Agglomerative Clustering Algorithm creates clusters. For each cluster (R) created by the Agglomerative Clustering Algorithm, AdupClustering finds a cluster that should be included in cluster R among the time series included in other clusters. The distance from the cluster R should be less than the diameter limit (d). That is, ADupClustering Algorithm allocates one time series to several clusters. That is, redundant allocation.

사전 클러스터는 Reduced Dimensional Space에서 생성되기 때문에 서로 다른 추세를 갖는 시계열을 포함할 가능성이 있다. 정제 단계에서 각 사전 클러스터는 여러 개의 정제된 클러스터로 정제된다. 각각의 정제된 클러스터는 비슷한 경향 (높은 동조 정도)을 갖는 시계열만을 포함해야 한다. 이를 위해 ADupClustering 알고리즘은 직경 상한(D)을 사용하여 가중치 메트릭 공간에서 수행된다.Since the pre-cluster is created in the Reduced Dimensional Space, it is possible to include time series with different trends. In the purification step, each pre-cluster is purified into several purified clusters. Each refined cluster should contain only time series with similar trends (high degree of synchronization). For this, the ADupClustering algorithm is performed in a weighted metric space using the upper diameter limit (D).

ADupClustering Algorithm에서 R을 집적 클러스터링에 의해 생성된 클러스터라고 가정한다. ρ를 R의 직경이라하자. σ를 직경 상한(D) 또는 직경 한계(d)가 될 수 있는 직경 경계라고 하자. 그러면 ρ = dist (α, β)가되는 R에 α, β가 있다. ρ≤σ. c를 직경의 중심으로 놓는다((αβ)). x = R이라하자. l = dist (x, c)라고하자. R '= R∪x 라하자. τ를 R '의 직경이라하자.In the ADupClustering Algorithm, we assume that R is a cluster created by integrated clustering. Let ρ be the diameter of R. Let σ be the diameter boundary, which can be the upper diameter limit (D) or the diameter limit (d). Then there are α, β in R where ρ = dist (α, β). ρ≤σ. Place c as the center of the diameter ((αβ)). Let x = R. Let l = dist (x, c). Let R'= R∪x. Let τ be the diameter of R'.

보조 정리. l> (√3 σ) / 2이면 τ> σLemma. If l> (√3 σ) / 2, then τ> σ

증명. 1> (√3 σ) / 2이고

와

사이의 각도가 π / 2보다 크거나 같다면, dist (x, α)>

, 왜냐하면 τ≥dist (x, α)이기 때문이다. β의 경우 α의 경우와 동일하다.proof. 1> (√3 σ) / 2 and

Wow

If the angle between is greater than or equal to π/2, dist (x, α)>

, Because τ≥dist (x, α). In case of β, it is the same as in case of α.

그러므로 l≤ (√3 σ) / 2 인 경우, τ는 직경 σ보다 작은 지 테스트해야 한다.Therefore, for l≤ (√3 σ) / 2, we must test whether τ is less than the diameter σ.

본 발명의 성능을 증명하기 위해, 차원 축소의 효율성에 대한 실험, WMS (Weighted Metric Space)에서 Diameter Upperbound 결정에 대한 실험, RDS (Reduction Dimensional Space)에서 직경 한계 결정에 관한 실험, 우리의 방법을 다른 알고리즘과 비교하는 실험을 수행했다.In order to prove the performance of the present invention, an experiment on the efficiency of dimensional reduction, an experiment on the diameter upperbound determination in WMS (Weighted Metric Space), an experiment on diameter limit determination in RDS (Reduction Dimensional Space), We performed an experiment comparing the algorithm.

한국 증시 (KOSPI, 코스닥) 및 UCR 시계열 아카이브의 주식 데이터를 실험 데이터로 사용 하였다. 한국의 주식 시장 데이터는 2016 년에 1,483 개의 주식으로 구성되며 각각의 주식은 4000 시간 규모를 가지고 있다. UCR 시계열 아카이브에서는 ShapsAll(512 차원), 단어 동의어(270 차원) 및 50 단어(270 차원)가 사용되었다. 실험은 Intel (R) Core (TM) i5-4570 CPU @ 3.20GHz, 4.00GB에서 수행되었다.Stock data from the Korean stock market (KOSPI, KOSDAQ) and UCR time series archives were used as experimental data. Korea's stock market data consists of 1,483 stocks in 2016, each with a scale of 4000 hours. In the UCR time series archive, ShapsAll (512 dimensions), word synonyms (270 dimensions), and 50 words (270 dimensions) were used. The experiment was performed on an Intel (R) Core (TM) i5-4570 CPU @ 3.20GHz, 4.00GB.

본 발명에서 사용된 차원 축소 방법을 선택하기 위해 DCT, PAA, PCA를 비교한다. 차원 축소 효율성 (DRE) 및 차원 순위 효율성 (DRaE)이 비교 메트릭으로 사용된다. 차원 감소 효율(Dimension Reduction Efficiency)(DRE)는 다음 수학식 3과 같이 정의된다.DCT, PAA, and PCA are compared to select the dimensionality reduction method used in the present invention. Dimensional reduction efficiency (DRE) and dimensional rank efficiency (DRaE) are used as comparison metrics. Dimension Reduction Efficiency (DRE) is defined as in Equation 3 below.

[수학식 3][Equation 3]

SSEW (WMS의 제곱 오류 합계) 는 수학식 4와 같다.SSEW (sum of squared errors of WMS) is equal to Equation 4.

[수학식 4][Equation 4]

여기서 μ는 가중 된 메트릭 공간에서 임의로 선택된 기준 시계열을 나타내고, SetW는 기준 시계열에 가장 가까운 n 개의 시계열 집합이다.Here, μ represents a randomly selected reference time series in a weighted metric space, and SetW is a set of n time series closest to the reference time series.

도 5은 각 치수 축소 방법에 대한 DRE 결과 값이다.5 is a DRE result value for each dimension reduction method.

도 5를 참조하면, SSER (RDS의 제곱 오류 합계)는 수학식 5와 같다.Referring to FIG. 5, SSER (sum of squared errors of RDS) is shown in Equation 5.

[수학식 5][Equation 5]

여기서 SetR은 n 개의 시계열 집합이다.Where SetR is a set of n time series.

μ는 축소된 차원 공간에서 기준 시계열에 가장 근접한 시간 시계열이다. f_k, μ는 가중 된 메트릭 공간에서 표현된다.μ is the time series closest to the reference time series in the reduced dimensional space. f _k , μ are expressed in a weighted metric space.

DRE의 의미는 다음과 같다. SSER은 축소된 차원 공간에서 기준 시계열에 가까운 k 시계열을 사용하여 얻은 SSE (Square of Error) 값이다. 따라서 SSER는 항상 가중 측정 기준 공간에서 기준 시계열에 가까운 k 시계열을 사용하여 얻은 SSEW 값보다 크다. 결과적으로 정보 손실이 가장 적은 차원 축소 방법은 큰 DRE 값을 갖는다. 도 5는 PCA, DCT 및 PAA에 대한 DRE 값의 결과를 2 차원, 4 차원 및 6 차원으로 보여준다.The meaning of DRE is as follows. SSER is an SSE (Square of Error) value obtained by using a k time series close to the reference time series in a reduced dimensional space. Therefore, the SSER is always greater than the SSEW value obtained using the k time series close to the reference time series in the weighted measurement reference space. As a result, the dimensional reduction method with the least information loss has a large DRE value. 5 shows the results of the DRE values for PCA, DCT and PAA in 2D, 4D and 6D.

도 6은 각 치수 축소 방법에 대한 DRaE 결과(α=50) 값이다.6 shows DRaE results (α=50) values for each dimension reduction method.

도 6을참조하면, 차원 순위 효율성 (DRaE)은 다음 수학식 6과 같이 정의된다.Referring to FIG. 6, the dimensional ranking efficiency (DRaE) is defined as in Equation 6 below.

[수학식 6][Equation 6]

여기서, N은 SetW에 포함 된 시계열의 수를 나타낸다. d_k는 가중된 메트릭 공간에서 μ에서 k 번째 이웃을 나타낸다. r(d_k)는 가중 메트릭 공간의 기준 시계열에서 d_k의 순위다. r(d_k) = k. rr(d_k)는 축소된 차원 공간에서 기준 시계열에서 d_k의 순위이다.Here, N represents the number of time series included in SetW. d _k represents the k-th neighbor in μ in the weighted metric space. r(d _k ) is the rank of d _k in the reference time series in the weighted metric space. r(d _k ) = k. rr(d _k ) is the rank of d _k in the reference time series in the reduced dimensional space.

DRaE의 의미는 다음과 같다. r(d_k) = rr(d_k)이면 차원 축소 방법으로 인해 순위 정보가 손실되지 않으므로 순위 오류가 발생하지 않는다. 따라서 DRaE의 가치는 100 %이다. 그러나 r(d_k) ≠ rr(d_k)인 경우 차원 축소 방법으로 인해 순위 정보가 손실되므로 순위 오류가 발생한다. 따라서 DRaE의 가치는 100 %보다 낮다. 도 6은 PCA, DCT 및 PAA에 대한 DRaE 결과를 2,4,6 차원에서 보여준다.The meaning of DRaE is as follows. If r(d _k ) = rr(d _k ), ranking error does not occur because ranking information is not lost due to the dimension reduction method. Therefore, the value of DRaE is 100%. However, if r(d _k ) ≠ rr(d _k ), ranking error occurs because ranking information is lost due to the dimension reduction method. Therefore, the value of DRaE is less than 100%. Figure 6 shows DRaE results for PCA, DCT and PAA in 2,4,6 dimensions.

도 7은 클러스터 직경의 변화를 도시한 그래프이다.7 is a graph showing changes in cluster diameter.

도 7을 참조하면, 본 발명은 PCA의 DRE와 DRaE 값이 DCT와 PAA의 그것보다 더 큰 것을 확인했다. 결과적으로 본 발명에서는 차원 축소 방법으로 PCA를 선택했다.Referring to FIG. 7, the present invention confirmed that the DRE and DRaE values of PCA are greater than those of DCT and PAA. Consequently, in the present invention, PCA was selected as the dimension reduction method.

도 7은 직경 상한 추정에 대한 실험을 보여준다. 본 발명은 N 개의 이웃을 기준 시계열 μ로부터 이웃 집합에 순차적으로 추가하고 이웃 집합 내의 데이터 사이의 최대 거리 (Diameter)의 변화를 식별한다. 특정 최대 거리에서 이웃 집합의 대부분의 시계열과 매우 다른 추세를 갖는 시계열이 나타난다. 최대 거리는 직경 상한의 후보다.7 shows an experiment for the estimation of the upper diameter limit. The present invention sequentially adds N neighbors to the neighbor set from the reference time series μ and identifies the change in the maximum distance (Diameter) between data in the neighbor set. At a certain maximum distance, a time series that has a very different trend from most of the time series of the neighbor set appears. The maximum distance is after the upper limit of the diameter.

도 8은 직경 한계 추정 단계를 도시한 것이다.8 shows the steps of estimating the diameter limit.

도 8을 참조하면, 감소된 차원 공간에서 사용 된 직경 한계를 추정하는 프로세스를 도시한다. 오른쪽 그래프는 가중 메트릭 공간에서 기준 시계열의 이웃 집합이 순차적으로 이웃 집합에 추가 될 때 이웃 집합의 직경이 증가한다는 것을 보여준다. 왼쪽 그래프는 다음과 같이 나타낸다. 축소된 차원 공간에서 기준 시계열의 이웃이 순차적으로 이웃 세트에 추가되면 이웃 세트의 직경이 증가한다. 직경 제한은 다음과 같이 추정된다. 오른쪽 그래프에서 특정 직경 상한을 찾은 다음 특정 직경 상한에 해당하는 k 번째 시간 시리즈를 찾는다. 다음으로, 왼쪽 그래프에서 k 번째 시간 시리즈에 해당하는 직경을 찾는다. 본 발명은 왼쪽 그래프에서 구한 k 번째 시계열에 해당하는 직경을 직경 한계 후보로 간주한다. 직경 제한 후보로 직경 한계 후보 × γ를 사용한다.Referring to Fig. 8, a process for estimating the diameter limit used in the reduced dimensional space is shown. The graph on the right shows that the diameter of the neighbor set increases when the neighbor set of the reference time series is sequentially added to the neighbor set in the weighted metric space. The graph on the left is shown as follows. In the reduced dimensional space, when neighbors of the reference time series are sequentially added to the neighboring set, the diameter of the neighboring set increases. The diameter limit is estimated as follows. In the graph on the right, find a specific upper diameter limit, then find the kth time series corresponding to the specific upper diameter limit. Next, find the diameter corresponding to the kth time series from the left graph. In the present invention, a diameter corresponding to the k-th time series obtained from the left graph is regarded as a diameter limit candidate. As the diameter limit candidate, the diameter limit candidate × γ is used.

도 9는 각 클러스터링 방법에 대한 RMSE의 결과이고, 도 10은 각 클러스터링 방법의 직경의 결과이고, 도 11은 각 클러스터링 방법에 대한 MSD의 결과이다.9 is a result of RMSE for each clustering method, FIG. 10 is a result of diameter of each clustering method, and FIG. 11 is a result of MSD for each clustering method.

도 9 내지 도 11을 참조하면, 본 발명의 일 실시 예에 따른 방법을 다른 알고리즘과 비교하기 위해 가장 최근에 발표된 기술인 3PTC와 YADING을 선택했다. 3PTC는 금융 시계열 클러스터링에 대한 최신 기술이다. YADING은 밀도 기반 시계열 클러스터링에 관한 최신 기술이다. 본 발명은 두 개의 평가 방법을 사용하여 클러스터에서 시계열의 공동 이동 정도를 평가한다.9 to 11, in order to compare the method according to an embodiment of the present invention with other algorithms, 3PTC and YADING, which are the most recently announced technologies, were selected. 3PTC is the latest technology for financial time series clustering. YADING is the latest technology for density-based time series clustering. The present invention uses two evaluation methods to evaluate the degree of joint movement of a time series in a cluster.

RMSE(root mean square error)는 다음 수학식 7과 같이 정의된다.RMSE (root mean square error) is defined as in Equation 7 below.

[수학식 7][Equation 7]

여기서, |cluster|는 클러스터의 시계열 수를 나타낸다. f_i(t)는 클러스터의 i 번째 시계열을 나타낸다. P(t)는 클러스터의 프로토타입 시계열을 나타낸다.Here, |cluster| represents the number of time series in the cluster. f _i (t) represents the i-th time series of the cluster. P(t) represents the prototype time series of the cluster.

프로토 타입 시계열 p (t)는 다음 수학식 8과 같이 정의된다.The prototype time series p (t) is defined as in Equation 8 below.

[수학식 8][Equation 8]

MSD(최대 표준 편차)는 다음 수학식 9와 같이 정의된다.MSD (maximum standard deviation) is defined as in Equation 9 below.

[수학식 9][Equation 9]

RMSE는 클러스터의 일관성을 평가하는 척도다. 따라서 값이 작을수록 클러스터의 공동 이동이 높다. 그러나 RMSE는 평균값이기 때문에 상당히 다른 동조 정도를 갖는 시계열이 클러스터에 포함되는 경우를 구별 할 수 없다. MSD 및 직경은 최대 값의 지표다. 클러스터 내의 일관성 정도를 측정하지는 않지만 클러스터 내에서 시계열의 최대 변동을 측정 할 수 있다. 따라서 클러스터의 동공 이동 정도를 평가하기 위해서는 동시에 세 값을 작게 해야 한다.RMSE is a measure of cluster consistency. Therefore, the smaller the value, the higher the joint movement of the cluster. However, since RMSE is an average value, it is not possible to distinguish the case where time series with significantly different degrees of synchronization are included in the cluster. MSD and diameter are indicators of maximum values. It does not measure the degree of consistency within a cluster, but it can measure the maximum variation of a time series within a cluster. Therefore, in order to evaluate the degree of pupil movement of the cluster, three values should be reduced at the same time.

도 9 내지 도 11에 도시 된 바와 같이, 본 방법의 3 가지 값은 다른 방법보다 작다. 따라서 본 발명의 방법으로 생성된 클러스터의 공동 운동(co-movement) 정도가 가장 높음을 알 수 있다.As shown in Figs. 9 to 11, the three values of this method are smaller than that of other methods. Therefore, it can be seen that the degree of co-movement of the cluster generated by the method of the present invention is the highest.

도 12는 각 클러스터링 방법에 대한 ADF 테스트 결과이다.12 is an ADF test result for each clustering method.

도 12를 참조하면, 공적분은 공운동의 지표로 사용될 수 있다. 따라서 본 발명은 공적분 평가 방법 (ADF test)을 사용하여 클러스터에서 시계열의 공 동성 정도를 평가한다.Referring to FIG. 12, cointegration may be used as an index of ball motion. Therefore, the present invention evaluates the degree of coherence of the time series in the cluster using the co-integral evaluation method (ADF test).

알고리즘에 의해 생성된 클러스터의 시계열간에 공적분 관계가 있는지 확인하기 위해 클러스터의 모든 시계열 쌍에 대해 ADF 테스트를 수행한다. 그런 다음 P 값을 얻는다. P 값이 임계 값보다 낮으면 귀무 가설(null hypothesis)은 기각된다. 이것은 두 개의 시계열이 공적분 관계에 있을 확률이 높다는 것을 의미한다. ADF 테스트로 얻은 P 값으로부터 귀무 가설을 기각한 비율을 구한다. 비율이 높을수록 클러스터의 시계열 쌍이 공적분 관계를 만족시킨다. 따라서 우리는 클러스터에서 시계열의 동일 이동 정도를 볼 수 있다. 도 12의 결과는 다음을 의미한다.To check whether there is a co-integral relationship between the time series of the cluster generated by the algorithm, an ADF test is performed on all the time series pairs of the cluster. Then you get the P value. If the P value is lower than the threshold value, the null hypothesis is rejected. This means that there is a high probability that the two time series are in a cointegrating relationship. The ratio of rejecting the null hypothesis is obtained from the P value obtained by the ADF test. The higher the ratio, the more satisfactory the co-integral relationship of the time series pair of the cluster. Thus, we can see the degree of uniform shift of the time series in the cluster. The results of Fig. 12 mean the following.

본 발명의 일 실시 예에 따른 방법으로 생성된 클러스터에 대한 귀무 가설의 거부율이 YADING, 3PTC의 거부율보다 크다는 것을 발견했다. 이것은 본 발명의 방법에 의해 생성된 클러스터의 시계열 쌍이 다른 방법에 의해 생성된 클러스터의 시계열 쌍보다 높은 동조 정도에 있다는 것을 의미한다.It was found that the rejection rate of the null hypothesis for the cluster created by the method according to an embodiment of the present invention is greater than that of YADING and 3PTC. This means that the time series pairs of clusters generated by the method of the present invention are at a higher degree of synchronization than the time series pairs of clusters generated by other methods.

도 13는 YADING으로 생성한 클러스터이고, 도 14는 3PTC로 생성한 클러스터이고, 도 15는 본 발명의 일 실시 예에 따른 방법으로 생성한 클러스터이다.13 is a cluster created by YADING, FIG. 14 is a cluster created by 3PTC, and FIG. 15 is a cluster created by a method according to an embodiment of the present invention.

도 13 내지 15는 클러스터링 알고리즘 평가에 사용 된 각 알고리즘의 대표 클러스터를 개시한다.13 to 15 disclose representative clusters of each algorithm used in the evaluation of the clustering algorithm.

도 13, 도 14는 클러스터 내에 다양한 경향이 있는 하위 클러스터가 있음을 알 수 있다. YADING, 3PTC에 의해 생성된 클러스터의 시계열 쌍은 고도의 공 동성을 만족시키지 못한다.13 and 14, it can be seen that there are sub-clusters having various tendencies in the cluster. The time series pair of clusters created by YADING and 3PTC does not satisfy the high degree of coherence.

도 15는 본 발명의 방법의 결과를 도시한다. 우리의 방법으로 생성된 클러스터의 시계열 쌍이 높은 수준의 공동성을 만족한다는 것을 알 수 있다.15 shows the results of the method of the present invention. It can be seen that the time series pairs of the clusters generated by our method satisfy a high level of synergism.

본 발명에서는 공 동성이 높은 금융 시계열 클러스터를 찾는 방법을 개시한다. 특히, 본 발명은 금융 시계열의 특성을 고려한다. 본 발명에서 제안한 클러스터링 알고리즘은 포트폴리오 선택, 위험 관리, 자산 할당, 시계열 예측과 같은 다양한 금융 투자에 적용될 수 있다. 또한 거시 경제적 용도로 사용될 가능성이 높기 때문에 우리의 방법이 매우 가치 있다고 확신한다. The present invention discloses a method of finding a financial time series cluster with high sociability. In particular, the present invention takes into account the characteristics of a financial time series. The clustering algorithm proposed in the present invention can be applied to various financial investments such as portfolio selection, risk management, asset allocation, and time series prediction. We are also convinced that our method is very valuable because it is likely to be used for macroeconomic purposes.

도 16은 본 발명의 일 실시 예에 따른 공동성 관계가 있는 금융 시계열에 대한 클러스터링 시스템의 블록도이다.16 is a block diagram of a clustering system for a financial time series having a synergistic relationship according to an embodiment of the present invention.

도 16을 참조하면, 본 발명의 일 실시 예에 따른 공동성 관계가 있는 금융 시계열에 대한 클러스터링 시스템은 전처리부(1610), 클러스터링부(1620) 및 결과해석부(1630)를 포함할 수 있다.Referring to FIG. 16, a clustering system for a financial time series having a synergistic relationship according to an embodiment of the present invention may include a preprocessor 1610, a clustering unit 1620, and a result analysis unit 1630.

상기 전처리부(1610)는 데이터 정규화를 위해 과거 가치 대비 미래 가치가 얼마나 증가하는지 또는 감소 하는지를 나타내는 비율을 나타내기 위해 평균 스케일링(Average Scaling)을 수행할 수 있다. 상기 평균 스케일링은 다음 수학식 2와 같이 정의된다.The preprocessor 1610 may perform average scaling to indicate a ratio indicating how much a future value increases or decreases compared to a past value for data normalization. The average scaling is defined as in Equation 2 below.

[수학식 2][Equation 2]

상기 전처리부(1610)에서 금융 시계열 분석에서 최근 데이터가 많을수록 향후 분석 결과에 영향을 미친다. 이 특성을 반영하기 위해 가중치 적용 시계열을 사용한다. 가중 시계열 변환은 가중치가 적용된 시계열을 생성한다. 이 변환은 다음의 두 단계를 포함할 수 있다: 부분 구간(서브 인터벌) 계산 및 가중치 할당.In the financial time series analysis in the preprocessor 1610, the more recent data, the more the analysis results in the future are affected. A weighted time series is used to reflect this characteristic. The weighted time series transform produces a weighted time series. This transformation may involve two steps: partial interval (sub-interval) calculation and weight assignment.

상기 전처리부(1610)는 Nyquist Sampling Rate 조건과 Parseval 정리를 기반으로 시간 구간의 부분 구간(서브 인터벌)을 효율적으로 찾아내는 방법을 이용한다. Nyquist Sampling Rate 조건에서 샘플링 주파수가 최대 주파수의 2 배인 경우, 샘플링 포인트에 의한 주파수 앨리어싱(aliasing) 없이 대역 제한 신호를 완벽하게 재구성할 수 있다. 일부 시계열은 완벽한 대역 제한 신호가 아니지만 고주파 성분은 잡음으로 간주 될 수 있다. 따라서 대부분의 시계열은 대역 제한 신호로 간주 할 수 있다. Parseval 정리에 따르면, 시간 공간에서 시계열 값의 제곱의 합은 주파수 공간에서 주파수 성분의 계수의 제곱의 합과 같다. 따라서 Parseval 정리를 사용하여 원래 시계열의 에너지 대부분을 보존하는 최대 주파수 (Ω_max)를 찾는다(약 80-90 %). (Ω_max = 2πk / N, N은 시계열의 총 시간 간격에 해당하며, k는 주파수 1 / N의 다중 계수이다.)이 최대 주파수와 Nyquist Sampling Rate 조건을 사용하여 샘플링 기간(T_s

N/2k, 2π/T_s ≥2Ω_max, 2π / T_s ≥2 * 2πk / N, T_s≤N / 2k). 그런 다음 각 샘플링 된 기간을 시계열의 부분 구간(서브 인터벌)으로 해석한다. 도 4는 대역 제한 신호의 부분 구간(서브 인터벌)을 보여준다.The preprocessor 1610 uses a method of efficiently finding a partial section (sub-interval) of a time section based on a Nyquist Sampling Rate condition and Parseval theorem. If the sampling frequency is twice the maximum frequency under the Nyquist Sampling Rate condition, the band-limited signal can be completely reconstructed without frequency aliasing by the sampling point. Some time series are not perfectly band-limited signals, but high-frequency components can be considered noise. Therefore, most time series can be regarded as band-limited signals. According to the Parseval theorem, the sum of squares of time series values in time space is equal to the sum of squares of coefficients of frequency components in frequency space. Therefore, we use the Parseval theorem to find the maximum frequency (Ω_max) that conserves most of the energy of the original time series (about 80-90%). (Ω_max = 2πk / N, N corresponds to the total time interval of the time series, k is the multiple coefficient of the frequency 1 / N.) The sampling period (T _s) using the maximum frequency and Nyquist Sampling Rate conditions

N/2k, 2π/T _s ≥2Ω_max, 2π / T _s ≥2 * 2πk / N, T _s ≤N/2k). Then, each sampled period is interpreted as a sub-interval (sub-interval) of the time series. 4 shows a partial section (sub-interval) of the band-limited signal.

상기 전처리부(1610)는 Nyquist Sampling Rate 조건과 Parseval 정리에 의해 얻어진 부분 구간에 가중치를 선형적으로 할당한다. 예를 들어, 4 개의 서브 인터벌이 있는 경우, 과거로부터 최근까지의 시간 순서에 따라 1, 2, 3 및 4 가중치가 서브 인터벌에 할당된다.The preprocessor 1610 linearly allocates weights to partial sections obtained by the Nyquist Sampling Rate condition and Parseval Theorem. For example, when there are 4 sub-intervals, 1, 2, 3, and 4 weights are allocated to the sub-intervals according to the time sequence from the past to the latest.

상기 전처리부(1610)는 전처리 단계에서 차원 축소를 수행할 수 있다. DCT, PAA, PCA, SAX을 비교하여 차원 축소 방법을 선택 하였다. 차원 축소 방법으로 PCA (Principal Component Analysis)를 선택했다. PCA는 전체 데이터의 변동성을 가장 잘 나타내는 고유 벡터를 사용하여 데이터의 차원을 줄인다. PCA에 의해 감소된 치수의 수는 Scree 플롯을 사용하여 실험적으로 결정된다.The preprocessor 1610 may perform dimension reduction in the preprocessing step. The dimensional reduction method was selected by comparing DCT, PAA, PCA, and SAX. PCA (Principal Component Analysis) was selected as the dimension reduction method. PCA reduces the dimensions of the data by using an eigenvector that best represents the variability of the overall data. The number of dimensions reduced by PCA is determined experimentally using a Scree plot.

상기 클러스터링부(1620)는 가중치 공간에 사용된 클러스터의 직경 상한과 축소된 차원 공간에 사용된 클러스터의 직경 한계를 추정한다. 시계열은 추세 성분과 잡음 성분의 조합으로 간주할 수 있다. 트렌드 구성 요소가 유사하고 노이즈가 충분히 적은 시계열을 찾아야 한다. 이를 위해 클러스터 직경을 제한한다. 사전 클러스터링 단계에서의 계산 효율성을 위해 금융 시간 시리즈를 축소된 차원 데이터로 변환한다. 그런 다음 축소된 차원 데이터에 대해 클러스터링이 수행된다. 그러나 차원 감소로 인해 사전 클러스터에 노이즈가 포함될 수 있다. 따라서 정제 단계에서 사전 클러스터를 정제된 클러스터로 정제한다.The clustering unit 1620 estimates the upper limit of the diameter of the cluster used in the weight space and the limit of the diameter of the cluster used in the reduced dimensional space. Time series can be regarded as a combination of a trend component and a noise component. You should find a time series with similar trend components and low enough noise. To do this, limit the cluster diameter. For computational efficiency in the pre-clustering step, the financial time series is converted to reduced dimensional data. Then, clustering is performed on the reduced dimensional data. However, due to dimensional reduction, the pre-cluster may contain noise. Therefore, in the purification step, the pre-cluster is purified into a purified cluster.

상기 클러스터링부(1620)는 직경 상한 추정을 수행할 수 있다.The clustering unit 1620 may perform a diameter upper limit estimation.

상기 클러스터링부(1620)는 정제 단계에서 클러스터의 크기를 제한하기 위해 직경 상한을 추정한다. 지경 상한은 클러스터 내의 시계열간의 SES_Distance의 최대 한도를 나타낸다. 직경 상한은 다음과 같이 추정된다. WMS에서 하나의 샘플을 추출한다. 그런 다음이 샘플과의 최단 거리 순서대로 이웃 집합에 시계열이 순차적으로 포함된다. 이웃 세트가 트렌드에서 크게 벗어나는 시계열을 가지고 있다면, 본 발명은 세트로부터 그 시계열을 제거하고, 최대 SES_Distance를 세트 내의 직경 후보로 고려한다. 여러 샘플을 반복적으로 작업하여 얻은 여러 직경 후보 값의 평균은 직경 상한으로 추정된다.The clustering unit 1620 estimates the upper limit of the diameter in order to limit the size of the cluster in the refining step. The upper limit of the area represents the maximum limit of SES_Distance between time series in the cluster. The upper diameter limit is estimated as follows. One sample is extracted from WMS. Then, the time series is sequentially included in the neighboring set in the order of the shortest distance from this sample. If the neighboring set has a time series that deviates significantly from the trend, the present invention removes the time series from the set and considers the maximum SES_Distance as a diameter candidate in the set. The average of several diameter candidate values obtained by iteratively working with several samples is estimated as the upper diameter limit.

상기 클러스터링부(1620)는 직경 한계 추정할 수 있다.The clustering unit 1620 may estimate a diameter limit.

상기 클러스터링부(1620)는 사전 클러스터링 단계에서 클러스터의 크기를 제한하기 위해 사용되는 직경 한계는 다음과 같이 추정된다. S는 WMS에서 직경 상한(D)를 추정하는 데 사용되는 이웃 세트다. S의 직경 상한에 해당하는 두 개의 시계열이 있다. 본 발명은 축소된 차원 공간에서 유클리드 거리를 계산한다. 얻어진 거리를 직경 한계 후보(d')로 한다. 축소된 차원 공간에는 정보가 손실된다. 클러스터의 최대 거리가 직경 제한 후보(d')로 제한되는 경우, 노이즈가 있는 데이터가 포함될 수 있을 뿐만 아니라 유사한 시계열이 동일한 클러스터에 포함되지 않을 수 있다. 그러므로 우리는 직경 한계값(d = γd', γ> 1)보다 약간 큰 값을 직경 한계 값으로 사용한다.The diameter limit used by the clustering unit 1620 to limit the size of the cluster in the pre-clustering step is estimated as follows. S is the set of neighbors used to estimate the upper diameter limit (D) in WMS. There are two time series corresponding to the upper diameter limit of S. The present invention calculates the Euclidean distance in a reduced dimensional space. The obtained distance is taken as the diameter limit candidate (d'). Information is lost in the reduced dimensional space. When the maximum distance of the cluster is limited to the diameter restriction candidate d', not only data with noise may be included, but similar time series may not be included in the same cluster. Therefore, we use a value slightly larger than the diameter limit value (d = γd', γ> 1) as the diameter limit value.

상기 클러스터링부(1620)는 사전 클러스터링 단계에서 ADupClustering 알고리즘은 Reduced Dimensional Space의 데이터에서 다음과 같이 수행된다. Agglomerative Clustering Algorithm은 클러스터를 생성한다. Agglomerative Clustering Algorithm에 의해 생성된 각 클러스터(R)에 대해 AdupClustering은 다른 클러스터에 포함된 시계열 중 클러스터 R에 포함되어야 하는 클러스터를 찾는다. 클러스터 R과의 거리는 직경 한계 (d)보다 작아야 한다. 즉, ADupClustering Algorithm은 하나의 시계열을 여러 클러스터에 할당한다. 즉, 중복 할당한다.In the pre-clustering step of the clustering unit 1620, the ADupClustering algorithm is performed on the data of the Reduced Dimensional Space as follows. Agglomerative Clustering Algorithm creates clusters. For each cluster (R) created by the Agglomerative Clustering Algorithm, AdupClustering finds a cluster that should be included in cluster R among the time series included in other clusters. The distance from the cluster R should be less than the diameter limit (d). That is, ADupClustering Algorithm allocates one time series to several clusters. That is, redundant allocation.

상기 클러스터링부(1620)는 사전 클러스터는 Reduced Dimensional Space에서 생성되기 때문에 서로 다른 추세를 갖는 시계열을 포함할 가능성이 있다. 정제 단계에서 각 사전 클러스터는 여러 개의 정제된 클러스터로 정제된다. 각각의 정제된 클러스터는 비슷한 경향 (높은 동조 정도)을 갖는 시계열만을 포함해야 한다. 이를 위해 ADupClustering 알고리즘은 직경 상한(D)을 사용하여 가중치 메트릭 공간에서 수행된다.The clustering unit 1620 may include time series having different trends since the pre-cluster is generated in the Reduced Dimensional Space. In the purification step, each pre-cluster is purified into several purified clusters. Each refined cluster should contain only time series with similar trends (high degree of synchronization). For this, the ADupClustering algorithm is performed in a weighted metric space using the upper diameter limit (D).

상기 결과해석부(1630)는 상기 클러스터링 결과를 이용해 금융 시계열, 포트폴리오 선택 및 금융 정책 수립의 미래 가격을 예측할 수 있다.The result analysis unit 1630 may predict a financial time series, a portfolio selection, and a future price of establishing a financial policy using the clustering result.

도 17은 본 발명의 일 실시 예에 따른 공동성 관계가 있는 금융 시계열에 대한 클러스터링 방법의 흐름도이다.17 is a flowchart of a clustering method for a financial time series having a synergistic relationship according to an embodiment of the present invention.

도 17을 참조하면, 본 발명의 일 실시 예에 따른 공동성 관계가 있는 금융 시계열에 대한 클러스터링 방법은 금융 데이터를 전처리 하는 단계(S1710)를 포함할 수 있다. Referring to FIG. 17, a clustering method for a financial time series having a synergistic relationship according to an embodiment of the present invention may include preprocessing financial data (S1710).

S1710 단계에서, 상기 전처리부(1610)는 데이터 정규화를 위해 과거 가치 대비 미래 가치가 얼마나 증가하는지 또는 감소 하는지를 나타내는 비율을 나타내기 위해 평균 스케일링(Average Scaling)을 수행할 수 있다. 상기 평균 스케일링은 다음 수학식 2와 같이 정의된다.In step S1710, the preprocessor 1610 may perform average scaling to indicate a ratio indicating how much a future value increases or decreases compared to a past value for data normalization. The average scaling is defined as in Equation 2 below.

[수학식 2][Equation 2]

S1710 단계에서, 상기 전처리부(1610)에서 금융 시계열 분석에서 최근 데이터가 많을수록 향후 분석 결과에 영향을 미친다. 이 특성을 반영하기 위해 가중치 적용 시계열을 사용한다. 가중 시계열 변환은 가중치가 적용된 시계열을 생성한다. 이 변환은 다음의 두 단계를 포함할 수 있다: 부분 구간(서브 인터벌) 계산 및 가중치 할당.In step S1710, the more recent data in the financial time series analysis in the preprocessor 1610, the more the analysis results in the future are affected. A weighted time series is used to reflect this characteristic. The weighted time series transform produces a weighted time series. This transformation may involve two steps: partial interval (sub-interval) calculation and weight assignment.

S1710 단계에서, 상기 전처리부(1610)는 Nyquist Sampling Rate 조건과 Parseval 정리를 기반으로 시간 구간의 부분 구간(서브 인터벌)을 효율적으로 찾아내는 방법을 이용한다. Nyquist Sampling Rate 조건에서 샘플링 주파수가 최대 주파수의 2 배인 경우, 샘플링 포인트에 의한 주파수 앨리어싱(aliasing) 없이 대역 제한 신호를 완벽하게 재구성할 수 있다. 일부 시계열은 완벽한 대역 제한 신호가 아니지만 고주파 성분은 잡음으로 간주 될 수 있다. 따라서 대부분의 시계열은 대역 제한 신호로 간주 할 수 있다. Parseval 정리에 따르면, 시간 공간에서 시계열 값의 제곱의 합은 주파수 공간에서 주파수 성분의 계수의 제곱의 합과 같다. 따라서 Parseval 정리를 사용하여 원래 시계열의 에너지 대부분을 보존하는 최대 주파수 (Ω_max)를 찾는다(약 80-90 %). (Ω_max = 2πk / N, N은 시계열의 총 시간 간격에 해당하며, k는 주파수 1 / N의 다중 계수이다.)이 최대 주파수와 Nyquist Sampling Rate 조건을 사용하여 샘플링 기간(T_s

N/2k, 2π/T_s ≥2Ω_max, 2π / T_s ≥2 * 2πk / N, T_s≤N / 2k). 그런 다음 각 샘플링 된 기간을 시계열의 부분 구간(서브 인터벌)으로 해석한다. 도 4는 대역 제한 신호의 부분 구간(서브 인터벌)을 보여준다.In step S1710, the preprocessor 1610 uses a method of efficiently finding a partial section (sub-interval) of a time section based on a Nyquist Sampling Rate condition and Parseval theorem. If the sampling frequency is twice the maximum frequency under the Nyquist Sampling Rate condition, the band-limited signal can be completely reconstructed without frequency aliasing by the sampling point. Some time series are not perfectly band-limited signals, but high-frequency components can be considered noise. Therefore, most time series can be regarded as band-limited signals. According to the Parseval theorem, the sum of squares of time series values in time space is equal to the sum of squares of coefficients of frequency components in frequency space. Therefore, we use the Parseval theorem to find the maximum frequency (Ω_max) that conserves most of the energy of the original time series (about 80-90%). (Ω_max = 2πk / N, N corresponds to the total time interval of the time series, k is the multiple coefficient of the frequency 1 / N.) The sampling period (T _s) using the maximum frequency and Nyquist Sampling Rate conditions

S1710 단계에서, 상기 전처리부(1610)는 Nyquist Sampling Rate 조건과 Parseval 정리에 의해 얻어진 부분 구간에 가중치를 선형적으로 할당한다. 예를 들어, 4 개의 서브 인터벌이 있는 경우, 과거로부터 최근까지의 시간 순서에 따라 1, 2, 3 및 4 가중치가 서브 인터벌에 할당된다.In step S1710, the preprocessor 1610 linearly allocates a weight to the partial section obtained by the Nyquist Sampling Rate condition and Parseval theorem. For example, when there are 4 sub-intervals, 1, 2, 3, and 4 weights are allocated to the sub-intervals according to the time sequence from the past to the latest.

S1710 단계에서, 상기 전처리부(1610)는 전처리 단계에서 차원 축소를 수행할 수 있다. DCT, PAA, PCA, SAX을 비교하여 차원 축소 방법을 선택 하였다. 차원 축소 방법으로 PCA (Principal Component Analysis)를 선택했다. PCA는 전체 데이터의 변동성을 가장 잘 나타내는 고유 벡터를 사용하여 데이터의 차원을 줄인다. PCA에 의해 감소된 치수의 수는 Scree 플롯을 사용하여 실험적으로 결정된다.In step S1710, the preprocessor 1610 may perform dimension reduction in the preprocessing step. The dimensional reduction method was selected by comparing DCT, PAA, PCA, and SAX. PCA (Principal Component Analysis) was selected as the dimension reduction method. PCA reduces the dimensions of the data by using an eigenvector that best represents the variability of the overall data. The number of dimensions reduced by PCA is determined experimentally using a Scree plot.

본 발명의 일 실시 예에 따른 공동성 관계가 있는 금융 시계열에 대한 클러스터링 방법은 전처리된 상기 데이터를 이용해 클러스터링 하는 단계(S1720)를 포함할 수 있다. A clustering method for a financial time series having a synergistic relationship according to an embodiment of the present invention may include clustering using the preprocessed data (S1720).

S1720 단계에서, 상기 클러스터링부(1620)는 가중치 공간에 사용된 클러스터의 직경 상한과 축소된 차원 공간에 사용된 클러스터의 직경 한계를 추정한다. 시계열은 추세 성분과 잡음 성분의 조합으로 간주할 수 있다. 트렌드 구성 요소가 유사하고 노이즈가 충분히 적은 시계열을 찾아야 한다. 이를 위해 클러스터 직경을 제한한다. 사전 클러스터링 단계에서의 계산 효율성을 위해 금융 시간 시리즈를 축소된 차원 데이터로 변환한다. 그런 다음 축소된 차원 데이터에 대해 클러스터링이 수행된다. 그러나 치수 감소로 인해 사전 클러스터에 노이즈가 포함될 수 있다. 따라서 정제 단계에서 사전 클러스터를 정제 된 클러스터로 정제한다.In step S1720, the clustering unit 1620 estimates the upper limit of the diameter of the cluster used in the weight space and the limit of the diameter of the cluster used in the reduced dimensional space. Time series can be regarded as a combination of a trend component and a noise component. You should find a time series with similar trend components and low enough noise. To do this, limit the cluster diameter. For computational efficiency in the pre-clustering step, the financial time series is converted to reduced dimensional data. Then, clustering is performed on the reduced dimensional data. However, due to dimensional reduction, the pre-cluster can contain noise. Therefore, in the purification step, the pre-cluster is purified into a purified cluster.

S1720 단계에서, 상기 클러스터링부(1620)는 직경 상한 추정을 수행할 수 있다.In step S1720, the clustering unit 1620 may perform a diameter upper limit estimation.

상기 클러스터링부(1620)는 정제 단계에서 클러스터의 크기를 제한하기 위해 직경 상한을 추정한다. 직경 상한은 클러스터 내의 시계열간의 SES_Distance의 최대 한도를 나타낸다. 직경 상한은 다음과 같이 추정된다. WMS에서 하나의 샘플을 추출한다. 그런 다음이 샘플과의 최단 거리 순서대로 이웃 집합에 시계열이 순차적으로 포함된다. 이웃 세트가 트렌드에서 크게 벗어나는 시계열을 가지고 있다면, 본 발명은 세트로부터 그 시계열을 제거하고, 최대 SES_Distance를 세트 내의 직경 후보로 고려한다. 여러 샘플을 반복적으로 작업하여 얻은 여러 직경 후보 값의 평균은 직경 상한으로 추정된다.The clustering unit 1620 estimates the upper limit of the diameter in order to limit the size of the cluster in the refining step. The upper limit of diameter represents the maximum limit of SES_Distance between time series within a cluster. The upper diameter limit is estimated as follows. One sample is extracted from WMS. Then, the time series is sequentially included in the neighboring set in the order of the shortest distance from this sample. If the neighboring set has a time series that deviates significantly from the trend, the present invention removes the time series from the set and considers the maximum SES_Distance as a diameter candidate in the set. The average of several diameter candidate values obtained by iteratively working with several samples is estimated as the upper diameter limit.

S1720 단계에서, 상기 클러스터링부(1620)는 직경 한계 추정할 수 있다.In step S1720, the clustering unit 1620 may estimate a diameter limit.

S1720 단계에서, 상기 클러스터링부(1620)는 사전 클러스터링 단계에서 ADupClustering 알고리즘은 Reduced Dimensional Space의 시계열에서 다음과 같이 수행된다. Agglomerative Clustering Algorithm은 클러스터를 생성한다. Agglomerative Clustering Algorithm에 의해 생성된 각 클러스터(R)에 대해 AdupClustering은 다른 클러스터에 포함된 시계열 중 클러스터 R에 포함되어야 하는 클러스터를 찾는다. 클러스터 R과의 거리는 직경 한계 (d)보다 작아야 한다. 즉, ADupClustering Algorithm은 하나의 시계열을 여러 클러스터에 할당한다. 즉, 중복 할당한다.In step S1720, the clustering unit 1620 performs the ADupClustering algorithm in the time series of the Reduced Dimensional Space as follows in the pre-clustering step. Agglomerative Clustering Algorithm creates clusters. For each cluster (R) created by the Agglomerative Clustering Algorithm, AdupClustering finds a cluster that should be included in cluster R among the time series included in other clusters. The distance from the cluster R should be less than the diameter limit (d). That is, ADupClustering Algorithm allocates one time series to several clusters. That is, redundant allocation.

S1720 단계에서, 상기 클러스터링부(1620)는 사전 클러스터는 Reduced Dimensional Space에서 생성되기 때문에 서로 다른 추세를 갖는 시계열을 포함할 가능성이 있다. 정제 단계에서 각 사전 클러스터는 여러 개의 정제된 클러스터로 정제된다. 각각의 정제된 클러스터는 비슷한 경향 (높은 동조 정도)을 갖는 시계열만을 포함해야 한다. 이를 위해 ADupClustering 알고리즘은 직경 상한(D)을 사용하여 가중치 메트릭 공간에서 수행된다.In step S1720, the clustering unit 1620 may include time series having different trends because the pre-cluster is generated in the Reduced Dimensional Space. In the purification step, each pre-cluster is purified into several purified clusters. Each refined cluster should contain only time series with similar trends (high degree of synchronization). For this, the ADupClustering algorithm is performed in a weighted metric space using the upper diameter limit (D).

본 발명의 일 실시 예에 따른 공동성 관계가 있는 금융 시계열에 대한 클러스터링 방법은 상기 클러스트링 결과를 해석하는 단계(S1730)를 포함할 수 있다. The clustering method for a financial time series having a synergistic relationship according to an embodiment of the present invention may include analyzing the clustering result (S1730).

S1730 단계에서, 상기 결과해석부(1630)는 상기 클러스터링 결과를 이용해 금융 시계열, 포트폴리오 선택 및 금융 정책 수립의 미래 가격을 예측할 수 있다.In step S1730, the result analysis unit 1630 may predict the future price of financial time series, portfolio selection, and financial policy establishment using the clustering result.

이제까지 본 발명에 대하여 그 바람직한 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at around its preferred embodiments. Those of ordinary skill in the art to which the present invention pertains will be able to understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative point of view rather than a limiting point of view. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

Claims

As a clustering method for a financial time series performed by a cluster system for a financial time series including a data preprocessor and a clustering unit,
A data pre-processing step performed by the data pre-processing unit and a clustering step performed by the clustering unit,
The data preprocessing step,
Normalizing the financial time series by performing average scaling defined by Equation 2 below;
Performing a weighted time series transformation on the normalized financial time series to generate a weighted time series to which a greater weight is assigned from the past to the latest; And
Reducing the dimension of the weighted time series; Including,
The clustering step,
A pre-clustering step of generating a pre-cluster for a weighted time series whose dimensions have been reduced; And
Clustering method for a financial time series comprising; a refinement step of generating a refined cluster from each of the generated dictionary clusters:
<Equation 2>

(Where Y _t is the normalized financial time series at time t, X _t is the financial time series at time t, and μ _X is the mean of X).

The method of claim 1,
The clustering step
Performed before performing the pre-clustering step,
Estimating a diameter limit (D) limiting the size of the prior cluster and a diameter limit (d) limiting the size of the refined cluster; further comprising,
Estimating the upper diameter limit (D)
Selecting a maximum value of the SES_distance represented by Equation 1 below from an arbitrary time series to another time series constituting a neighboring set as a diameter upper limit candidate in the weight metric space; And estimating the average of the diameter upper limit candidates as the diameter upper limit (D); Including,
Estimating the diameter limit (d),
Calculating a Euclidean distance in a dimension reduction space for two time series used to estimate the upper diameter limit (D); Selecting the Euclidean distance as a diameter limit candidate (d'); And estimating a diameter limit (d) from the diameter limit candidate (d'); including, a clustering method for a financial time series:
<Equation 1>

(SIDIst _i is the Euclidean distance of the i-th sub-interval (sub-interval) time series of the weighted time series).

The method of claim 2,
The weight metric space is a space consisting of a weighted time series,
The dimensionality reduction space is a space consisting of a weighted time series whose dimensions are reduced, a clustering method for a financial time series.

The method of claim 1,
The cluster system further includes a result analysis unit, the clustering method for the financial time series
Analyzing the clustering result performed by the result analysis unit; further comprising, a clustering method for a financial time series.

In a cluster system for a financial time series including a data preprocessor and a clustering unit,
The data preprocessing unit,
Normalize the financial time series by performing average scaling defined by Equation 2 below,
A weighted time series transformation is performed on the normalized financial time series to generate a weighted time series with a greater weight from the past to the latest,
Reduce the dimension of the weighted time series,
The clustering unit,
Pre-cluster the weighted time series with reduced dimensions to create a pre-cluster,
A clustering system for a financial time series to generate a refined cluster by refining each of the prior clusters:
<Equation 2>

The method of claim 5,
The clustering unit
Before performing the pre-clustering,
Estimating an upper diameter limit (D) limiting the size of the pre-cluster and a diameter limit (d) limiting the size of the refined cluster,
The estimation of the upper diameter limit (D) is
In the weighted metric space, the maximum value of the SES_ distance represented by Equation 1 below from an arbitrary time series to another time series forming a neighboring set is selected as a diameter upper limit candidate, and the average of the diameter upper limit candidates is selected as the diameter upper limit (D). Is carried out in an estimate method,
Estimation of the diameter limit (d),
For the two time series used to estimate the upper diameter limit (D), the Euclidean distance in the dimension reduction space is calculated, the Euclidean distance is selected as a diameter limit candidate (d'), and the diameter limit candidate (d Clustering system for financial time series, performed by estimating diameter limit (d) from'):
<Equation 1>

The method of claim 6,
The weight metric space is a space consisting of a weighted time series,
The dimensionality reduction space is a space consisting of a weighted time series whose dimensions are reduced, a clustering system for a financial time series.

The method of claim 5,
The clustering system for the financial time series
Clustering system for financial time series further comprising; a result analysis unit for analyzing the clustering result.

delete