KR20180059203A

KR20180059203A - Method and program for predicting chargeback fraud user

Info

Publication number: KR20180059203A
Application number: KR1020160158491A
Authority: KR
Inventors: 서재현; 최대선
Original assignee: 공주대학교 산학협력단
Priority date: 2016-11-25
Filing date: 2016-11-25
Publication date: 2018-06-04
Also published as: WO2018097653A1

Abstract

본 발명은 지불 거절 사기 사용자의 예측 방법에 관한 것으로서, 종래의 정상 사용자 및 지불 거절 사기(chargeback fraud) 사용자에 대한 거래 내역 데이터를 각 사용자마다 1건 기준의 데이터로 가공하는 데이터 가공 단계; 가공된 거래 내역 데이터를 트레이닝 데이터(training data)와 테스트 데이터(test data)로 나누는 데이터 분류 단계; 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하여 데이터의 수를 조절하는 데이터 조절 단계; 데이터의 수가 조절된 트레이닝 데이터를 이용하여 특정 머신 러닝 기법으로 학습시키고, 학습된 머신 러닝 모델을 이용하여 테스트 데이터에 대한 지불 거절 사기 사용자 해당 여부를 예측 분류하는 예측 분류 단계; 예측 분류에 대한 성능을 측정하는 성능 측정 단계; 예측 분류에 대한 성능이 목표값에 도달할 때까지 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하거나 언더샘플링(undersampling)하여 예측 분류 단계 및 성능 측정 측정 단계를 반복 수행하는 반복 수행 단계; 및 목표한 예측 분류 성능에 도달한 머신 러닝 모델을 이용하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측하는 예측 단계;를 포함하는 것을 특징으로 한다.The present invention relates to a method of predicting a chargeback fraud user, comprising: a data processing step of processing transaction history data for a conventional normal user and a chargeback fraud user into one piece of data for each user; A data classification step of dividing the processed transaction history data into training data and test data; A data adjustment step of oversampling the data for the charge refusal fraud user among the training data to adjust the number of data; A prediction classifying step of learning with a specific machine learning technique using the training data whose number of data is adjusted and predicting whether the user is a chargeback fraud user for test data using the learned machine learning model; A performance measurement step of measuring performance of a predictive classification; Repeatedly performing the prediction classification step and the performance measurement measurement step by oversampling or undersampling the data of the charge refusal fraud user among the training data until the performance of the predictive classification reaches the target value step; And a prediction step of predicting a payment rejection fraud for transaction history data of a new user using a machine learning model that has reached a target prediction classification performance.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and program for predicting chargeback fraud,

본 발명은 지불 거절 사기 사용자의 예측 방법 및 프로그램에 관한 것으로서, 더욱 상세하게는 종래 사용자의 거래 내역 데이터를 가공 이용하여 목표 예측 분류 성능을 만족하는 머신 러닝 모델을 구현하고, 구현된 머신 러닝 모델을 이용하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측하는 방법 및 프로그램에 관한 것이다.More particularly, the present invention relates to a machine learning model that satisfies a target prediction classification performance by processing transaction history data of a conventional user, And to a method and program for predicting a chargeback fraud against transaction history data of a new user.

최근, 전자 결제 수단의 사용이 보편화됨에 따라 전자 결제 수단을 이용한 사기(fraud)의 일종인 지불 거절 사기(chargeback fraud)의 사례가 급격히 증가하는 추세에 있다. Recently, as the use of electronic payment means has become commonplace, a chargeback fraud, which is a type of fraud using electronic payment means, is rapidly increasing.

도 1은 게임 사용자의 지불 거절 사기(chargeback fraud)의 흐름도를 나타낸다.Figure 1 shows a flow diagram of a chargeback fraud of a game user.

지불 거절 사기의 일 예로서, 온라인 게임 사용자에 의한 지불 거절 사기는, 도 1에 도시된 바와 같이, 사용자(game user)가 신용 카드로 구매하여 게임 회사로부터 지급 받은 게임 머니 등을 소진한 후, 지불 거절(chargeback)을 은행에 요청함으로써 성립되는 것으로서, 게임 회사 등에 막대한 손해를 입힐 수 있어 큰 문제가 되고 있다. As an example of the chargeback fraud scheme, a charge refusal fraud by an online game user is a process whereby a game user purchases a credit card and exhausts game money or the like paid by a game company, This problem is caused by requesting a chargeback to the bank, which can cause huge damage to a game company, which is a big problem.

따라서, 지불 거절 사기를 사전 예측하여 지불 거절 사기에 의한 손해를 미연에 방지할 수 있는 기술 개발이 필요한 실정이다.Therefore, it is necessary to develop a technology that can prevent damage due to fraudulent payment by predicting the fraudulent chargeback.

KRKR 10-2016-001762910-2016-0017629 AA

상기한 바와 같은 지불 거절 사기에 의한 손해를 미연에 방지하기 위하여, 본 발명은 종래 사용자의 거래 내역 데이터를 가공 이용하여 목표한 예측 분류 성능을 만족하는 머신 러닝 모델을 구현하고, 구현된 머신 러닝 모델을 이용하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측하는 지불 거절 사기 사용자의 예측 방법 및 프로그램을 제공하는데 그 목적이 있다.In order to prevent damage due to the above-mentioned charge refusal fraud, the present invention implements a machine learning model that satisfies a target predictive classification performance by processing transaction history data of a conventional user, And to provide a prediction method and program for a chargeback fraud user who predicts a chargeback fraud against transaction history data of a new user.

상기와 같은 과제를 해결하기 위한 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법은, (1) 종래의 정상 사용자 및 지불 거절 사기(chargeback fraud) 사용자에 대한 거래 내역 데이터를 각 사용자마다 1건 기준의 데이터로 가공하는 데이터 가공 단계, (2) 가공된 거래 내역 데이터를 트레이닝 데이터(training data)와 테스트 데이터(test data)로 나누는 데이터 분류 단계, (3) 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하여 데이터의 수를 조절하는 데이터 조절 단계, (4) 데이터의 수가 조절된 트레이닝 데이터를 이용하여 특정 머신 러닝 기법으로 학습시키고, 학습된 머신 러닝 모델을 이용하여 테스트 데이터에 대한 지불 거절 사기 사용자 해당 여부를 예측 분류하는 예측 분류 단계, (5) 예측 분류에 대한 성능을 측정하는 성능 측정 단계, (6) 예측 분류에 대한 성능이 목표값에 도달할 때까지 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하거나 언더샘플링(undersampling)하여 예측 분류 단계 및 성능 측정 측정 단계를 반복 수행하는 반복 수행 단계, (7) 목표한 예측 분류 성능에 도달한 머신 러닝 모델을 이용하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측하는 예측 단계를 포함한다.According to an embodiment of the present invention, there is provided a method of predicting a chargeback fraud user, comprising: (1) generating transaction history data for a normal user and a chargeback fraud user for each user (2) a data classification step of dividing the processed transaction history data into training data and test data; (3) a data classification step of dividing the processed transaction history data into training data and test data; and (3) Data adjustment step of oversampling data on the data to adjust the number of data, (4) learning by using a specific machine learning technique using the training data of which the number of data is adjusted, testing using the learned machine learning model A prediction classifying step for predicting whether the user is a chargeback fraud user for data, (5) a performance for prediction classifying (6) oversampling or undersampling the data of the chargeback fraud user among the training data until the performance of the prediction class reaches the target value, And (7) a prediction step of predicting the chargeback fraud of the transaction history data of the new user using the machine learning model that reaches the target prediction classification performance.

상기 데이터 가공 단계는, (1) 거래 내역 데이터의 각 특징 및 복수 특징 집합에 대한 평가를 수행하여 기준에 미달하는 특징을 삭제하는 1차 특징 삭제 단계, (2) 1차 특징 삭제 단계를 거친 거래 내역 데이터를 통계적 방법을 이용하여 각 사용자마다 1건 기준의 데이터로 가공하면서 새로운 특징을 생성하는 특징 생성 단계, (3) 생성된 특징에 대한 평가를 수행하여 평가 기준에 미달하는 특징을 삭제하는 2차 특징 삭제 단계를 포함할 수 있다.The data processing step may include: (1) a first-order feature deletion step of performing an evaluation on each feature and a plurality of feature sets of the transaction history data to delete a feature that does not meet the criterion; (2) A feature generation step of generating new features by processing historical data as one reference data for each user by using a statistical method, (3) performing a feature evaluation on the generated feature, and deleting the feature that does not meet the evaluation criteria Car feature deletion step.

상기 1차 특징 삭제 단계는, (1) 정보 이득(information gain) 기법을 이용하여 거래 내역 데이터의 각 특징에 대한 평가를 수행하는 단계, (2) 주성분 분석(principal component analysis) 기법을 이용하여 거래 내역 데이터의 복수 특징 집합에 대한 평가를 수행하는 단계를 포함할 수 있다.The first feature deletion step includes the steps of (1) performing evaluation of each feature of the transaction history data using an information gain technique, (2) performing a principal component analysis And performing an evaluation on the plurality of feature sets of the history data.

상기 성능 측정 단계는 혼돈 행렬(confusion matrix)를 이용하여 예측 분류에 대한 성능을 측정할 수 있다.The performance measurement step may measure the performance of prediction classification using a confusion matrix.

또한, 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 프로그램은 상술한 지불 거절 사기 사용자의 예측 방법에 따라 지불 거절 사기 사용자를 예측하기 위해 매체에 저장될 수 있다.Further, a prediction program of a chargeback fraud user according to an embodiment of the present invention may be stored in the medium for predicting the chargeback fraud user according to the prediction method of the chargeback fraud user described above.

상기와 같이 구성되는 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법 및 프로그램은 목표 예측 분류 성능을 만족하는 머신 러닝 모델을 구현하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측할 수 있어, 지불 거절 사기에 의한 손해를 미연에 방지할 수 있는 이점이 있다.The prediction method and program of the chargeback fraud user according to the present invention configured as above can implement a machine learning model that satisfies the target prediction classification performance to predict the chargeback fraud for the transaction history data of the new user There is an advantage that it is possible to prevent damages caused by fraudulent payment.

도 1은 지불 거절 사기(chargeback fraud)의 흐름도를 나타낸다.
도 2는 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법을 나타낸다.
도 3은 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법의 데이터 가공 단계(S10)를 나타낸다.Figure 1 shows a flow diagram of a chargeback fraud.
2 shows a method of predicting a chargeback fraud user according to an embodiment of the present invention.
FIG. 3 shows a data processing step (S10) of a method of predicting a payment rejection fraud user according to an embodiment of the present invention.

본 발명의 상기 목적과 수단 및 그에 따른 효과는 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다.BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, features and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings, . In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail.

또한, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며, 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 경우에 따라 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외의 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. Furthermore, terms used herein are for the purpose of illustrating embodiments and are not intended to limit the present invention. In this specification, the singular forms include plural forms as the case may be, unless the context clearly indicates otherwise. &Quot; comprises "and / or" comprising "used in the specification do not exclude the presence or addition of one or more other elements other than the stated element. Unless defined otherwise, all terms used herein may be used in a sense commonly understood by one of ordinary skill in the art to which this invention belongs. In addition, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명하도록 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법을 나타낸다.2 shows a method of predicting a chargeback fraud user according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법은 컴퓨터에 의해 수행될 수 있다. 예를 들어, 컴퓨터는 데스크탑 컴퓨터(desktop personal computer), 랩탑 컴퓨터(laptop personal computer), 넷북 컴퓨터(netbook computer), 태블릿 PC(tablet personal computer) 등일 수 있으나, 이에 한정되는 것은 아니다.A method for predicting a chargeback fraud user according to an embodiment of the present invention can be performed by a computer. For example, the computer can be, but is not limited to, a desktop personal computer, a laptop personal computer, a netbook computer, a tablet personal computer, and the like.

구체적으로, 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법은, 도 2에 도시된 바와 같이, 데이터 가공 단계(S10), 데이터 분류 단계(S20), 데이터 조절 단계(S30), 예측 분류 단계(S40), 성능 측정 단계(S50), 반복 수행 단계(S60) 및 예측 단계(S70)를 포함한다.More specifically, the method of predicting the payment fraudulent user according to an embodiment of the present invention includes a data processing step S10, a data classification step S20, a data adjustment step S30, A classification step S40, a performance measurement step S50, an iterative execution step S60, and a prediction step S70.

데이터 가공 단계(S10)는 종래 사용자의 거래 내역 데이터를 가공하는 단계이다. 이때, 종래의 거래 내역 데이터는 정상 사용자 및 지불 거절 사기(chargeback fraud) 사용자에 대한 거내 내역을 각각 포함하며, 이를 저장 관리하는 게임 회사 등의 데이터베이스(database)로부터 제공 받을 수 있다. 특히, 각 사용자마다 거래한 건수가 다르므로, 데이터 가공 단계(S10)는 거래 내역 데이터를 각 사용자마다 1건 기준의 데이터로 가공 처리한다. 이때, 거래 내역 데이터는 데이터의 특성 및 데이터의 물리적 형태(레코드 형식, 레코드 길이 등)가 서로 다른 복수의 속성(attribute)를 포함한다. 이러한 데이터의 속성을 이하에서는 "특징(feature)"라 지칭한다. The data processing step S10 is a step of processing transaction history data of the conventional user. At this time, the conventional transaction history data includes the details of the normal user and the chargeback fraud user, and can be provided from a database of a game company that stores and manages the transaction history data. In particular, since the number of transactions is different for each user, the data processing step (S10) processes the transaction history data into one piece of data for each user. At this time, the transaction history data includes a plurality of attributes having different characteristics of data and physical form (record format, record length, etc.) of data. The attributes of such data are referred to hereinafter as "features ".

표 1은 어느 게임 회사(game company)의 데이터 베이스에 저장된 실제 거래 내역 데이터의 특징을 나타낸다. 이때, 해당 게임 회사로부터 제공 받은 실제 거래 내역 데이터에는 62,092명의 정상 사용자에 대한 거래 내역 데이터(수십만 건)와, 372명(수천 건)의 지불 거절 사기 사용자에 대한 거래 내역 데이터가 포함되어 있었다.Table 1 shows the characteristics of actual transaction history data stored in a database of a game company. At this time, the actual transaction history data provided from the game company included transaction history data (about hundreds of thousands) of 62,092 normal users and transaction history data for 372 (thousands) chargeback fraud users.

예를 들면, 거래 내역 데이터는 표 1과 같은 복수의 특징을 포함할 수 있다. 즉, 각 거래 내역 데이터마다 "user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name 및 hash_ip, ip_addr"의 특징을 포함할 수 있다.For example, the transaction history data may include a plurality of features as shown in Table 1. That is, each transaction history data may include the characteristics of "user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name and hash_ip, ip_addr".

특징Characteristic 내용Contents 1One user_nouser_no 사용자의 식별자User's identifier 22 standard_country_codestandard_country_code 사용자의 국가 코드Your country code 33 charge_statuscharge_status 사용자의 충전 단계Your charging step 44 charge_nocharge_no 충전 식별자Charge identifier 55 payment_method_nopayment_method_no 결제 방법 식별자Payment method identifier 66 charge_amountcharge_amount 충전 금액Charge amount 77 bonus_amountbonus_amount 보너스 금액Bonus amount 88 datetimedatetime 거래 일시Date of transaction 99 charge_product_namecharge_product_name 지불 결제 사업자(payment gateway) 명칭Name of payment gateway 1010 hash_iphash_ip 해쉬 함수(hash function)로 변환된 사용자의 IP 주소The IP address of the user converted to a hash function 1111 ip_addrip_addr 사용자의 IP 주소Your IP address

도 3은 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법의 데이터 가공 단계(S10)를 나타낸다.FIG. 3 shows a data processing step (S10) of a method of predicting a payment rejection fraud user according to an embodiment of the present invention.

구체적으로, 데이터 가공 단계(S10)는, 도 3에 도시된 바와 같이, 1차 특징 삭제 단계(S11), 특징 생성 단계(S12) 및 2차 특징 삭제 단계(S13)를 포함할 수 있다.Specifically, the data processing step S10 may include a primary feature deletion step S11, a feature generation step S12, and a secondary feature deletion step S13, as shown in FIG.

1차 특징 삭제 단계(S11)는 거래 내역 데이터의 각 특징 및 복수 특징 집합에 대한 평가를 수행하고, 평가 기준에 미달하는 특징을 삭제하는 단계이다. 이때, 1차 특징 삭제 단계(S11)에서는 정보 이득(information gain) 기법을 이용하여 거래 내역 데이터의 각 특징(예를 들어, user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name 및 hash_ip, ip_addr)에 대한 평가를 수행한다. The primary feature deletion step S11 is a step of performing evaluation on each feature and a plurality of feature sets of the transaction history data, and deleting features that do not meet the evaluation criterion. At this time, in the first feature deletion step S11, each feature of the transaction history data (for example, user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name, hash_ip, ip_addr).

정보 이득은 어느 한 특징을 선택할 경우에 기대되는 엔트로피의 감소량으로서, 그 값이 높을 수록 데이터를 더 잘 구분할 수 있음을 나타낸다. 즉, 1차 특징 삭제 단계(S11)에서는 정보 이득 기법에 따라 각 특징의 선택에 따른 지불 거절 사기 사용자의 구분 가능 정도에 대한 값을 획득한다. The information gain is the amount of entropy reduction expected when a feature is selected. The higher the value, the better the data can be distinguished. That is, in the first feature deletion step (S11), the value of the distinguishability degree of the charge rejection fraud user according to the selection of each feature is obtained according to the information gain technique.

이후, 1차 특징 삭제 단계(S11)에서는 일정 기준 미만의 정보 이득 값을 갖는 해당 특징을 삭제 처리한다. 이는 일정 기준 미만의 정보 이득 값을 갖는 특징이 지불 거절 사기 사용자를 구분하는데 필요 없는 특징에 해당하기 때문이다.Thereafter, in the first feature deletion step S11, the corresponding feature having an information gain value less than a certain reference is deleted. This is because a feature having an information gain value less than a predetermined criterion is a feature that is not necessary for distinguishing a chargeback fraud user.

또한, 1차 특징 삭제 단계(S11)에서는 주성분 분석(principal component analysis) 기법을 이용하여 거래 내역 데이터의 복수 특징 집합에 대한 평가를 수행한다. In the first feature deletion step S11, a plurality of feature sets of the transaction history data are evaluated using a principal component analysis technique.

주성분 분석 기법은 고차원의 데이터를 저차원의 데이터로 환원시키는 기법으로서, 분포된 데이터들의 주성분(Principal Component)를 찾아준다. 즉, 1차 특징 삭제 단계(S11)에서는 주성분 분석 기법에 따라 지불 거절 사기 사용자를 더 잘 구분할 수 있는 주성분의 복수 특징 집합을 추출한다. 이때, 거래 내역 데이터의 복수 특징 집합은 2개 이상 특징의 조합으로서, 예를 들어, {user_no, standard_country_code}, {user_no, charge_status}, … {user_no, standard_country_code, charge_status}, {user_no, standard_country_code, charge_no} … 등을 포함할 수 있다.Principal component analysis is a technique to reduce high-dimensional data to low-dimensional data, and finds the principal component of the distributed data. That is, in the first feature deletion step (S11), a plurality of feature sets of the principal component capable of better distinguishing the charge rejection fraud users are extracted according to the principal component analysis technique. At this time, the plural feature sets of the transaction history data are combinations of two or more features, for example, {user_no, standard_country_code}, {user_no, charge_status}, ... {user_no, standard_country_code, charge_status}, {user_no, standard_country_code, charge_no} ... And the like.

이후, 1차 특징 삭제 단계(S11)에서는 일정 기준 미만에 해당하는, 즉 주성분에 해당하지 않은 복수 특징 집합에 포함된 특징을 삭제 처리한다. 이는 주성분에 해당하지 않는 복수 특징 집합에 포함된 특징이 지불 거절 사기 사용자를 구분하는데 필요 없는 특징에 해당하기 때문이다.Thereafter, in the first feature deletion step S11, features included in a plurality of feature sets less than a predetermined criterion, i.e., those not corresponding to the principal component, are deleted. This is because the features included in the plurality of feature sets that do not correspond to the principal component correspond to features that are not required to distinguish the chargeback fraud users.

특징 생성 단계(S12)는 1차 특징 삭제 단계를 거친 거래 내역 데이터를 통계적 방법을 이용하여 각 사용자마다 1건 기준의 데이터로 가공하면서 새로운 특징을 생성하는 단계이다. 예를 들어, 통계적 방법은 데이터들에 대해 개수, 합계, 차이, 평균, 표준 편차, 최대값, 최소값, 날짜 통계, 시간 통계 등의 방법을 포함할 수 있으나, 이에 한정되는 것은 아니다.The feature generation step S12 is a step of generating new features while processing the transaction history data that has undergone the first feature deletion step by using statistical methods as data of one reference for each user. For example, a statistical method may include, but is not limited to, a number, a sum, a difference, an average, a standard deviation, a maximum value, a minimum value, a date statistic,

2차 특징 삭제 단계(S13)는 생성된 특징에 대한 평가를 수행하여 평가 기준에 미달하는 특징을 삭제하는 단계이다. 이때, 2차 특징 삭제 단계(S12)에서는 정보 이득(information gain) 기법을 이용하여 거래 내역 데이터의 각 특징에 대한 평가를 수행한다.The secondary feature deletion step (S13) is a step of performing evaluation on the generated feature and deleting the feature that does not meet the evaluation criterion. At this time, in the secondary feature deletion step (S12), each feature of the transaction history data is evaluated using the information gain technique.

이후, 2차 특징 삭제 단계(S13)에서는 일정 기준 미만의 정보 이득 값을 갖는 해당 특징을 삭제 처리한다. 이는 일정 기준 미만의 정보 이득 값을 갖는 특징이 지불 거절 사기 사용자를 구분하는데 필요 없는 특징에 해당하기 때문이다. Thereafter, in the secondary feature deletion step S13, the corresponding feature having the information gain value less than the predetermined reference value is deleted. This is because a feature having an information gain value less than a predetermined criterion is a feature that is not necessary for distinguishing a chargeback fraud user.

표 2는 표 1에 나타낸 거래 내역 데이터의 특징을 데이터 가공 단계(S10), 1차 특징 삭제 단계(S11) 및 2차 특징 삭제 단계(S13)를 통해 가공 처리한 특징을 나타낸다.Table 2 shows characteristics of the transaction history data shown in Table 1 processed through the data processing step (S10), the first feature deleting step (S11), and the second feature deleting step (S13).

특징Characteristic 내용Contents 1One user_no user_no 사용자의 식별자User's identifier 22 standard_country_code standard_country_code 사용자의 국가 코드Your country code 33 standard_country_code_kind standard_country_code_kind 사용자의 국가 코드의 종류The type of user's country code 44 charge_stat10 charge_stat10 사용자의 충전 횟수가 10 이하User's charge count is less than 10 55 charge_stat20 charge_stat20 사용자의 충전 횟수가 20 이하If the user's charge count is less than 20 66 charge_stat30 charge_stat30 사용자의 충전 횟수가 30 이하User's charge count is less than 30 77 payment_method_no payment_method_no 가장 최근의 결제 방법Your most recent payment method 88 payment_method_no_kind payment_method_no_kind 결제 방법의 종류Types of payment methods 99 charge_amount_sum charge_amount_sum 충전 총액Total charge 1010 charge_amount_avg charge_amount_avg 평균 충전 금액Average Charge Amount 1111 charge_amount_stddev charge_amount_stddev 충전 금액의 표준 편차Standard deviation of charge amount 1212 bonus_amount_sum bonus_amount_sum 보너스 총액Total bonus 1313 bonus_amount_avg bonus_amount_avg 평균 보너스 금액Average bonus amount 1414 bonus_amount_stddev bonus_amount_stddev 보너스 금액의 표준 편차Standard deviation of bonus amount 1515 transaction_recent_monthday transaction_recent_monthday 최종 거래 날짜Last transaction date 1616 transaction_recent_hour transaction_recent_hour 최종 거래 시간Final transaction time 1717 transaction_cnt_sum transaction_cnt_sum 총 거래 횟수Total transactions 1818 transaction_cnt_1_month transaction_cnt_1_month 1개월 동안의 거래 횟수Number of transactions per month 1919 transaction_cnt_2_month transaction_cnt_2_month 최근 1개월을 제외한 2개월 동안의 거래 횟수Number of transactions in the last two months except the last one 2020 transaction_cnt_3_month transaction_cnt_3_month 최근 2개월을 제외한 3개월 동안의 거래 횟수) The number of transactions in three months excluding the last two months) 2121 transaction_cnt_6_month transaction_cnt_6_month 최근 3개월을 제외한 6개월 동안의 거래 횟수Number of transactions for 6 months excluding the last 3 months 2222 transaction_cnt_else transaction_cnt_else 최근 6개월을 제외한 총 거래 횟수Total number of transactions excluding the last 6 months 2323 charge_product_name charge_product_name 지불 결제 사업자(payment gateway)Payment gateway 2424 charge_product_name_kind charge_product_name_kind 지불 결제 사업자 종류Types of payment providers 2525 ip_addr ip_addr IP 주소IP address 2626 ip_addr_kind ip_addr_kind IP 주소 종류IP address type 2727 class class 0: 정상 사용자, 1: 지불 거절 사기 사용자0: Normal user, 1: Chargeback fraud user

즉, standard_country_code로부터 standard_country_code_kind가 추가 생성되었고, charge_status로부터 charge_stat10, charge_stat20 및 charge_stat30가 추가 생성되었으며, payment_method_no로부터 payment_method_no_kind가 추가 생성되었고, charge_amount 및 bonus_amount로부터 charge_amount_sum, charge_amount_avg, charge_amount_stddev, bonus_amount_sum, bonus_amount_avg 및 bonus_amount_stddev가 추가 생성되었다. 또한, datetime로부터 transaction_recent_monthday, transaction_recent_hour, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month, transaction_cnt_3_month, transaction_cnt_6_month 및 transaction_cnt_else가 추가 생성되었고, charge_product_name로부터 charge_product_name_kind가 추가 생성되었으며, ip_addr로부터 ip_addr_kind가 추가 생성되었다. 또한, charge_no 및 hash_ip는 삭제 처리되었으며, 사용자 구분을 위해 class가 추가 생성되었다. class는 표 1에 처음부터 포함될 수도 있다.More specifically, standard_country_code_kind was additionally generated from the standard_country_code, and charge_stat10, charge_stat20 and charge_stat30 were additionally generated from the charge_status. Further, payment_method_no_kind was additionally generated from the payment_method_no and charge_amount_sum, charge_amount_avg, charge_amount_stddev, bonus_amount_sum, bonus_amount_avg and bonus_amount_stddev were additionally generated from the charge_amount and bonus_amount. In addition, transaction_recent_monthday, transaction_recent_hour, transaction_cnt_sum, transaction_cnt_2_month, transaction_cnt_2_month, transaction_cnt_3_month, transaction_cnt_6_month and transaction_cnt_else were additionally created from the datetime, charge_product_name_kind was additionally created from charge_product_name, and ip_addr_kind was further generated from ip_addr. In addition, charge_no and hash_ip were deleted, and additional classes were created for user classification. Classes may be included in Table 1 from the beginning.

참고로, 결정 트리(Decision Tree ; DT) 기반의 ClassifierSubsetEval attribute evaluator와 유전 알고리즘(genetic algorithm)을 이용하여 표 2의 특징의 정보 이득 값을 구해 본 결과, 4, 5, 7, 8, 10, 11, 12, 17, 18, 19 및 20에 해당하는 특징, 즉 charge_stat10, charge_stat20, payment_method_no, payment_method_no_kind, charge_amount_avg, charge_amount_stddev, bonus_amount_sum, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month 및 transaction_cnt_3_month가 비교적 높은 정보 이득의 특징으로 추출되었다.As a result, the information gain values of the features of Table 2 were obtained by using a ClassifierSubsetEval attribute evaluator and a genetic algorithm based on a decision tree (DT). As a result, 4, 5, 7, 8, 10, 11 , 12, 17, 18, 19 and 20, namely, charge_stat10, charge_stat20, payment_method_no, payment_method_no_kind, charge_amount_avg, charge_amount_stddev, bonus_amount_sum, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month and transaction_cnt_3_month.

다음으로, 데이터 분류 단계(S20)는 가공된 거래 내역 데이터를 트레이닝 데이터(training data)와 테스트 데이터(test data)로 나누는 단계이다. 이때, 트레이닝 데이터는 이후에 사용할 특정 머신 러닝(machine learing)의 학습 데이터로 사용되는 데이터이며, 테스트 데이터는 학습된 머신 러닝 모델의 성능을 테스트하기 위해 사용되는 데이터이다.Next, the data classification step S20 is a step of dividing the processed transaction history data into training data and test data. At this time, the training data is used as learning data of a specific machine learning to be used later, and the test data is data used for testing the performance of the learned machine learning model.

표 3은 가공된 거래 내역 데이터를 트레이닝 데이터와 테스트 데이터로 나누기 위한 다양한 데이터 집합 유형을 나타낸다.Table 3 shows various data set types for dividing processed transaction history data into training data and test data.

66% split66% split 10-fold10-fold 50% split50% split 정상 사용자Normal user 21,11321,113 62,09262,092 31,04631,046 지불 거절 사기 사용자Chargeback Fraud User 125125 372372 186186

예를 들어, 66% split은 거래 내역 데이터 중에 66%를 트레이닝 데이터로, 나머지 34%를 테스트 데이터로 각각 나눈 경우이다. 10-fold는 거래 내역 데이터 중에 9/10를 트레이닝 데이터로, 나머지 1/10를 테스트 데이터로 각각 나누며, 교차타당화(cross validation) 방법이 수행되는 경우이다. 또한, 50% split은 거래 내역 데이터의 50% 각각을 트레이닝 데이터 및 테스트 데이터로 나누되, StratifiedFolds 전처리(preprocessing)에 의해 각각 데이터를 나누는 경우이다.For example, 66% split represents 66% of transaction history data as training data and the remaining 34% as test data. 10-fold is a case in which 9/10 of the transaction history data is divided into training data and the remaining 1/10 is divided into test data, and a cross validation method is performed. In addition, 50% split divides 50% of transaction history data into training data and test data, and divides data by preprocessing StratifiedFolds.

다음으로, 데이터 조절 단계(S30)는 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하여 데이터의 수를 조절하는 단계이다. 지불 거절 사기 사용자의 거래 내역 데이터가 정상 사용자의 거래 내역 데이터에 비해 건수가 부족하므로, 트레이닝 데이터로 학습된 머신 러닝 모델의 성능이 떨어질 수 있다. 이에 따라, 데이터 조절 단계(S30)를 통해, 즉 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링함으로써 머신 러닝 모델의 성능을 향상시킬 수 있다. 데이터 조절 단계(S30)를 통한 머신 러닝 모델의 성능 향상에 대한 구체적인 실험예는 후술하기로 한다.Next, the data adjustment step S30 is a step of oversampling the data for the charge refusal fraud user among the training data to adjust the number of data. The performance of the machine learning model learned by the training data may deteriorate because the transaction history data of the chargeback fraud user is less than the transaction history data of the normal user. Accordingly, the performance of the machine learning model can be improved through the data adjustment step (S30), that is, oversampling data for the charge refusal fraud user among the training data. A detailed experimental example for improving the performance of the machine learning model through the data adjustment step (S30) will be described later.

다음으로, 예측 분류 단계(S40)에서는 데이터의 수가 조절된 트레이닝 데이터를 학습 데이터로 사용하여 미리 선택된 특정 머신 러닝(machine learing) 기법으로 학습시키다. 이때, 머신 러닝은 지도 학습(Supervised Learning), 자율 학습(Unsupervised Learning), 준 지도 학습(Semi-Supervised Learning) 등 다양한 알고리즘을 포함하며, 특별히 제한되는 것은 아니다. 지도 학습은 서포트 벡터 머신(Support Vector Machine ; SVM), 은닉 마르코프 모델(Hidden Markov model), 회귀 분석(Regression), 신경망(Neural network), 나이브 베이즈 분류(Naive Bayes Classification) 등을 포함할 수 있다.Next, in the predictive classification step S40, the training data whose number of data is adjusted is used as the learning data to learn by a specific machine learning technique selected in advance. At this time, machine learning includes various algorithms such as Supervised Learning, Unsupervised Learning, Semi-Supervised Learning, and the like, and is not particularly limited. Map learning may include Support Vector Machine (SVM), Hidden Markov model, Regression, Neural network, Naive Bayes classification, and the like .

이후, 예측 분류 단계(S40)에서는 트레이닝 데이터를 통해 학습된 머신 러닝 모델을 이용하여 테스트 데이터에 대한 지불 거절 사기 사용자 해당 여부를 예측 분류한다.Thereafter, in the predictive classification step S40, the machine learning model learned through the training data is used to predict and classify whether or not the charge data is fraudulent users of the test data.

다음으로, 성능 측정 단계(S50)는 예측 분류에 대한 성능을 측정하는 단계이다. 즉, 성능 측정 단계(S50)는 예측 분류 단계(S40)에서 머신 러닝 모델이 예측 분류한 테스트 데이터에 대한 정확성을 나타내는 성능 측정한다. 이때, 성능 측정 단계(S50)는 혼돈 행렬(confusion matrix)를 이용하여 예측 분류에 대한 성능을 측정할 수 있다.Next, the performance measurement step S50 is a step of measuring the performance of the predictive classification. That is, the performance measuring step S50 measures the performance indicating the accuracy of the test data predicted and classified by the machine learning model in the prediction classification step S40. In this case, the performance measurement step (S50) can measure the performance of the predictive classification using a confusion matrix.

표 4는 혼돈 행렬을 나타낸다.Table 4 shows the chaotic matrix.

예측prediction TrueTrue FalseFalse 결과
result
TrueTrue True Posivies(TP)True Posivies (TP) False Negatives(FN)False Negatives (FN) FlaseFlase False Posivies(FP)False Posivies (FP) True Negatives(TN)True Negatives (TN)

TP는 머신 러닝 모델이 어느 테스트 데이터에 대해 지불 거절 사기 사용자로 예측 분류했는데 실제로도 지불 거절 사기 사용자인 경우를, TN은 머신 러닝 모델이 어느 테스트 데이터에 대해 정상 사용자로 예측 분류했는데 실제로도 정상 사용자인 경우를 각각 나타난다. 또한, FP는 머신 러닝 모델이 어느 테스트 데이터에 대해 지불 거절 사기 사용자로 예측 분류했으나 실제로는 정상 사용자인 경우를, FN는 머신 러닝 모델이 어느 테스트 데이터에 대해 정상 사용자로 예측 분류했으나 실제로는 지불 거절 사기 사용자인 경우를 각각 나타낸다.TP is the case where the machine learning model predicts the test data as a chargeback fraud user for the test data, but actually the charge decline fraud user, and TN is a normal user that predicts the machine learning model as a normal user for certain test data Respectively. In addition, in FP, the machine learning model is predicted as a charge rejection fraud user for the test data, but the actual user is the normal user. The FN indicates that the machine learning model predicts the test data as the normal user for the test data. However, And a case of a fraudulent user.

즉, 성능 측정 단계(S50)에서는 머신 러닝 모델이 각 테스트 데이터에 대해 예측 분류한 결과의 건수를 표 4의 혼돈 행렬에 따라 수집한다. 이후, 성능 측정 단계(S50)에서는 성능 지표의 값을 계산한다. 이때, 성능 지표는 예측 분류 단계(S40)에서 예측 분류한 머신 러닝 모델의 분류 정확성의 성능을 측정하기 위한 것으로서, 표 4에 예시된 Recall(=TPR), Precision, F-measure, ROC curve 등을 포함할 수 있으나, 이에 한정되는 것은 아니며, 데이터 분류 정확성의 성능을 측정하기 위한 것이면 어떤 것이든 제한 없이 성능 지표가 될 수 있다.That is, in the performance measurement step (S50), the number of results of predictions and classification for each test data is collected according to the chaotic matrix of Table 4 in the machine learning model. Then, in the performance measurement step (S50), the value of the performance index is calculated. In this case, the performance index is used to measure the classification accuracy of the machine learning model predicted and classified in the prediction classification step (S40). Recall (= TPR), Precision, F-measure and ROC curve Any, but not limited to, any measure for measuring the performance of data classification accuracy can be a performance indicator without limitation.

표 5는 예측 분류 단계(S40)에서 예측 분류한 머신 러닝 모델의 성능을 측정하기 위한 각 성능 지표를 나타낸다.Table 5 shows each performance index for measuring the performance of the machine learning model predicted and classified in the prediction classification step (S40).

성능 지표Performance indicator 계산 방법Calculation method
Recall=TPR
(True Posivies Rate)
Recall = TPR
(True Posivies Rate)

FPR
(False Posivies Rate)

Precision

F-measure

ROC curve Area
(Receiver operating characteristic) The width of the curve, in which the axis is the FPR and the Y axis is the TPR,

표 6 및 7은 표 3의 각 데이터 집합 유형에 대해 결정 트리(DT) 및 서포트 벡터 머신(SVM)의 머신 러닝 기법을 이용해 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 수행한 경우의 예측 분류 성능을 측정한 결과를 나타낸다. 이때, 데이터 조절 단계(S30)의 효과를 직접적으로 비교하기 위해, 데이터 조절 단계(S30)를 생략하고 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 수행한 결과를 표 6에 나타내었으며, 데이터 조절 단계(S30), 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 수행한 결과를 표 7에 나타내었다.Tables 6 and 7 illustrate the case where the prediction classification step S40 and the performance measurement step S50 are performed using the machine learning technique of the decision tree DT and the support vector machine SVM for each data set type in Table 3 The results of the prediction classification performance are shown. In order to directly compare the effect of the data adjustment step S30, the data adjustment step S30 is omitted and the prediction classification step S40 and the performance measurement step S50 are performed, Table 7 shows the results of performing the data adjustment step S30, the prediction classification step S40, and the performance measurement step S50.

알고리즘algorithm 데이터 집합 유형Dataset type TPRTPR FPRFPR RrecisionRrecision RecallRecall F-measureF-measure ROC AreaROC Area ClassClass

DT

DT 66% split 66% split 0.552 0.552 0.001 0.001 0.841 0.841 0.552 0.552 0.667 0.667 0.828 0.828 1 One 0.999 0.999 0.448 0.448 0.997 0.997 0.999 0.999 0.998 0.998 0.828 0.828 0 0 10-fold 10-fold 0.530 0.530 0.001 0.001 0.853 0.853 0.5300.530 0.653 0.653 0.877 0.877 1 One 0.999 0.999 0.470 0.470 0.997 0.997 0.999 0.999 0.998 0.998 0.877 0.877 0 0 50% split 50% split 0.516 0.516 0.001 0.001 0.787 0.787 0.516 0.516 0.623 0.623 0.808 0.808 1 One 0.999 0.999 0.484 0.484 0.997 0.997 0.999 0.999 0.998 0.998 0.808 0.808 0 0

SVM

SVM 66% split 66% split 0.544 0.544 0.000 0.000 0.883 0.883 0.5440.544 0.673 0.673 0.772 0.772 1 One 1.000 1,000 0.456 0.456 0.997 0.997 1.000 1,000 0.998 0.998 0.772 0.772 0 0 10-fold 10-fold 0.573 0.573 0.000 0.000 0.914 0.914 0.573 0.573 0.704 0.704 0.786 0.786 1 One 1.000 1,000 0.427 0.427 0.997 0.997 1.000 1,000 0.999 0.999 0.786 0.786 0 0 50% split 50% split 0.570 0.570 0.001 0.001 0.869 0.869 0.5700.570 0.688 0.688 0.785 0.785 1 One 0.999 0.999 0.430 0.430 0.997 0.997 0.999 0.999 0.998 0.998 0.785 0.785 0 0

오버샘플링 비율(%)Over-sampling rate (%) 알고리즘algorithm TPR TPR FPRFPR Precision Precision Recall Recall F-measure F-measure ROC Area ROC Area Class Class
100
100 DTDT 0.6610.661 0.0010.001 0.9090.909 0.6610.661 0.7660.766 0.8960.896 1 One 0.9990.999 0.3390.339 0.9960.996 0.9990.999 0.9980.998 0.8960.896 0 0 SVMSVM 0.8200.820 0.0010.001 0.9410.941 0.8200.820 0.8760.876 0.9100.910 1 One 0.9990.999 0.1800.180 0.9980.998 0.9990.999 0.9990.999 0.9100.910 0 0
200
200 DTDT 0.6770.677 0.0010.001 0.9410.941 0.6770.677 0.7880.788 0.9240.924 1 One 0.9990.999 0.3230.323 0.9940.994 0.9990.999 0.9970.997 0.9240.924 0 0 SVMSVM 0.8870.887 0.0010.001 0.9640.964 0.8870.887 0.9240.924 0.9430.943 1 One 0.9990.999 0.1130.113 0.9980.998 0.9990.999 0.9990.999 0.9430.943 0 0
300
300 DTDT 0.7920.792 0.0020.002 0.9440.944 0.7920.792 0.8610.861 0.9200.920 1 One 0.9980.998 0.2080.208 0.9930.993 0.9980.998 0.9950.995 0.9200.920 0 0 SVMSVM 0.9480.948 0.0010.001 0.9800.980 0.9480.948 0.9640.964 0.9740.974 1 One 0.9990.999 0.0520.052 0.9980.998 0.9990.999 0.9990.999 0.9740.974 0 0

표 6 및 표 7의 지불 거절 사기 사용자(Class=1)에 대한 Recall 성능 지표를 참조하면, 서포트 벡터 머신(SVM)이 결정 트리(DT) 보다 우수한 예측 분류 성능을 나타내며, 데이터 조절 단계(S30) 수행을 통해 예측 분류 성능이 더 좋아짐을 알 수 있다.Referring to the recall performance indexes of the charge refusal fraud users (Class = 1) in Table 6 and Table 7, the support vector machine SVM exhibits better prediction classification performance than the decision tree DT, It is found that the performance of predictive classification is better.

반복 수행 단계(S60)는 예측 분류에 대한 성능이 목표값에 도달할 때까지 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하거나 언더샘플링(undersampling)하여 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 반복 수행하는 단계이다. 이때, 오버샘플링 비율이 너무 높아 트레이닝 데이터 양의 지나치게 증가하는 경우, 예측 분류 단계(S40)에서 학습 시에 과부하가 걸릴 수 있으므로, 반복 수행 단계(S60)는 트레이닝 데이터에 대해 오버샘플링 외에도 언더샘플링을 수행할 수 있다. 또한, 반복 수행 단계(S60)에서는 보다 정확한 목표값 도달을 위해 언더샘플링을 수행할 수도 있다.The iterative execution step S60 may include oversampling or undersampling the data of the charge refusal fraud user among the training data until the performance of the predictive classification reaches the target value, And repeating the performance measurement step S50. At this time, if the oversampling ratio is too high and the amount of training data is excessively increased, overloading may take place at the time of learning in the prediction classification step S40, so the iterative step S60 may perform undersampling Can be performed. In the repeated execution step (S60), undersampling may be performed to achieve a more accurate target value.

반복 수행 단계(S60) 수행 시의 오버샘플링 비율 또는 언더샘플링의 비율은 규칙적이거나 임의적일 수 있으며, 특별히 제한되지 않는다. 예를 들어, n차 반복 시의 오버샘플링 비율을 A×Bⁿ로 정할 수 있다(단, A 및 B는 자연수, n은 정수). n차 반복 이후, m차 반복 시의 언더샘플링 비율은 A×Bⁿ-(C×m)로 정할 수 있다(단, A, B 및 C는 자연수, n 및 m은 정수, n<m)The oversampling ratio or the ratio of undersampling at the time of performing the repeated execution step (S60) may be regular or arbitrary, and is not particularly limited. For example, the oversampling ratio in n-th iteration can be ^defined as A x B ⁿ (where A and B are natural numbers and n is an integer). After the n-th iteration, the undersampling ratio at the m-th iteration can be determined as A x B ⁿ - (C x m) where A, B and C are natural numbers, n and m are integers,

즉, 서포트 벡터 머신(SVM)에 대해 "0.940 이상을 갖는 Recall"을 성능 목표값으로 정한 경우, 오버샘플링이 100%인 경우의 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 수행한다. 그 결과, Recall이 목표값 이하인 0.820을 가지므로, 1차 반복으로 오버샘플링을 200%로 상향하여 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 다시 수행한다. 그 결과, Recall이 목표값 이하인 0.887을 가지므로, 2차 반복으로 오버샘플링을 300%로 상향하여 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 다시 수행한다. 그 결과, Recall이 목표값 이상인 0.948을 가지게 된다. 이후, 바로 예측 단계(S70)로 넘어갈 수도 있지만, 반복 수행 단계(S60)에서는 3차 반복으로 언더샘플링하여, 즉 오버샘플링을 300% 보다 작게, 예를 들어, 280%로 설정하여 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 다시 수행할 수도 있다.That is, when the performance target value is set to "Recall with 0.940 or more" for the support vector machine SVM, the prediction classification step S40 and the performance measurement step S50 are performed when the oversampling is 100%. As a result, since Recall has 0.820 which is less than the target value, the prediction classification step (S40) and the performance measurement step (S50) are performed again by raising the oversampling to 200% in the first iteration. As a result, since Recall has a value of 0.887 which is less than the target value, the prediction sorting step S40 and the performance measuring step S50 are performed again by raising the oversampling to 300% in the second iteration. As a result, Recall has 0.948, which is above the target value. Thereafter, the process may proceed directly to the prediction step S70, but in the iterative execution step S60, undersampling is performed in the third iteration, that is, the oversampling is set to less than 300%, for example, 280% S40) and the performance measurement step S50 may be performed again.

예측 단계(S70)는 목표한 예측 분류 성능에 도달한 머신 러닝 모델을 이용하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측하는 단계이다.The prediction step S70 is a step of predicting the payment rejection fraud for the transaction history data of the new user using the machine learning model that has reached the target prediction classification performance.

한편, 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 프로그램은 상술한 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법에 따라 지불 거절 사기 사용자의 예측을 수행하기 위해 매체에 저장된 프로그램이다. 예를 들어, 지불 거절 사기 사용자의 예측 프로그램은 컴퓨터 또는 이와 유사한 장치로 읽을 수 있는 기록 매체 내에 기록될 수 있다. Meanwhile, the prediction program of the charge rejection fraud user according to the embodiment of the present invention is stored in the medium for performing the prediction of the charge rejection fraud user according to the prediction method of the charge rejection fraud user according to the above- Program. For example, the predictive program of a chargeback fraud user may be recorded in a recording medium readable by a computer or similar device.

예를 들어, 기록 매체는 하드디스크 타입(hard disk type), 마그네틱 매체 타입(magnetic media type), CD-ROM(compact disc read only memory), 광기록 매체 타입(Optical Media type), 자기-광 매체 타입(magneto-optical media type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 플래시 메모리 타입(flash memory type), 롬(read only memory; ROM), 램(random access memory; RAM), 또는 이들의 조합으로 구성된 메모리로 이루어지는 버퍼, 주기억장치, 또는 보조기억장치일 수 있으나, 이에 한정되는 것은 아니다. For example, the recording medium may be a hard disk type, a magnetic media type, a CD-ROM (compact disc read only memory), an optical recording medium type, A magneto-optical media type, a multimedia card micro type, a card type memory (e.g., SD or XD memory), a flash memory type, a read only memory (ROM) ROM, a random access memory (RAM), or a combination thereof, but is not limited thereto.

또한, 상기 프로그램은, 입력장치에 인터넷(Internet), 인트라넷(Intranet), LAN(Local Area Network), WLAN(Wide LAN), 또는 SAN(Storage Area Network)과 같은 통신 네트워크, 또는 이들의 조합으로 구성된 통신 네트워크를 통하여 접근(access)할 수 있는 부착 가능한(attachable) 저장 장치(storage device)에 저장될 수 있다.The program may be stored in the input device as a communication network such as the Internet, an Intranet, a LAN (Local Area Network), a WLAN (Wide Area Network), or a SAN (Storage Area Network) May be stored in an attachable storage device that can be accessed through a communication network.

본 발명의 상세한 설명에서는 구체적인 실시 예에 관하여 설명하였으나 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시 예에 국한되지 않으며, 후술되는 특허청구의 범위 및 이 특허청구의 범위와 균등한 것들에 의해 정해져야 한다.Although the present invention has been described in connection with certain exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. Therefore, the scope of the present invention should not be limited to the embodiments described, but should be determined by the scope of the following claims and equivalents thereof.

S10 : 데이터 가공 단계 S11 : 1차 특징 삭제 단계
S12 : 특징 생성 단계 S13 ; 2차 특징 삭제 단계
S20 : 데이터 분류 단계 S30 : 데이터 조절 단계
S40 : 예측 분류 단계 S50 : 성능 측정 단계
S60 : 반복 수행 단계 S70 : 예측 단계S10: Data processing step S11: First feature deletion step
S12: feature generation step S13; Secondary feature deletion step
S20: Data classification step S30: Data adjustment step
S40: prediction classification step S50: performance measurement step
S60: Repetition step S70: Prediction step

Claims

A data processing step of processing transaction history data for a conventional normal user and a chargeback fraud user into one piece of data for each user;
A data classification step of dividing the processed transaction history data into training data and test data;
A data adjustment step of oversampling the data for the charge refusal fraud user among the training data to adjust the number of data;
A prediction classifying step of learning with a specific machine learning technique using the training data whose number of data is adjusted and predicting whether the user is a chargeback fraud user for test data using the learned machine learning model;
A performance measurement step of measuring performance of a predictive classification;
Repeatedly performing the prediction classification step and the performance measurement measurement step by oversampling or undersampling the data of the charge refusal fraud user among the training data until the performance of the predictive classification reaches the target value step; And
And a prediction step of predicting a payment rejection fraud for transaction history data of a new user using a machine learning model that has reached a target prediction classification performance.

The method according to claim 1,
The data processing step includes:
A first feature deletion step of performing evaluation on each feature and a plurality of feature sets of the transaction history data to delete features that do not meet the criterion;
A feature generation step of generating new features by processing the transaction history data that has undergone the first feature deletion step by using one statistical method as data for each user; And
And a second feature deletion step of performing an evaluation on the generated feature and deleting the feature that does not meet the evaluation criterion.

3. The method of claim 2,
The primary feature deletion step includes:
Performing an evaluation for each characteristic of the transaction history data using an information gain technique; And
And performing an evaluation of a plurality of feature sets of the transaction history data using a principal component analysis technique.

The method according to claim 1,
Wherein the performance measurement step comprises:
Wherein the performance of the prediction classification is measured using a confusion matrix.

A prediction program of a chargeback fraud user stored in a medium for predicting a chargeback fraud user according to a prediction method of a chargeback fraud user according to any one of claims 1 to 4.