TWI887894B

TWI887894B - Non-transitory computer-readable storage medium and method and apparatus for processing order data

Info

Publication number: TWI887894B
Application number: TW112147800A
Authority: TW
Inventors: 蘇鵬飛; 諸豪文; 許光於; 趙宏堯
Original assignee: 韓商韓領有限公司
Priority date: 2022-12-27
Filing date: 2023-12-08
Publication date: 2025-06-21
Also published as: TW202431168A; TW202536744A; WO2024143671A1; KR20240104055A; KR102620080B1

Abstract

本發明係關於一種電子裝置之資料驗證方法，其包括如下步驟：基於訂單資料，產生訂單之統計資料；基於上述訂單資料，產生用以驗證上述統計資料之基線資料；對上述基線資料及上述統計資料各者之增量值進行比較；及基於上述增量值之比較，驗證上述統計資料。The present invention relates to a data verification method for an electronic device, which includes the following steps: generating statistical data of an order based on order data; generating baseline data for verifying the statistical data based on the order data; comparing the incremental values of the baseline data and the statistical data; and verifying the statistical data based on the comparison of the incremental values.

Description

Non-temporary computer-readable storage medium and method and device for processing order data

本說明書之實施例係關於一種處理訂單資料之方法及裝置。 The embodiment of this specification is related to a method and device for processing order data.

隨著網際網路之使用得到普及，電子商務市場亦在不斷擴大。尤其是隨著大流行病之蔓延，人們到實體店購買商品之比例下降，而使用電腦或智慧型手機藉由電子商務來購買商品之比例迅速增加。 As the use of the Internet becomes more popular, the e-commerce market is also expanding. Especially with the spread of the pandemic, the proportion of people going to physical stores to buy goods has decreased, and the proportion of people using computers or smartphones to buy goods through e-commerce has increased rapidly.

電子商務企業需要收集及產生訂單量、頁面訪問量、商品點擊量等統計資料，以便為用戶等實體進行與電子商務服務相關之分析及決策。為了基於訂單資料產生統計資料，必須實時分析用戶之所有訂單，但這相當困難。實時分析訂單資料具有挑戰性，因為用戶之訂單歷史佔據大量資料，而且單個訂單可隨時創建及取消。因此，需要一種能夠實時分析統計大容量資料以產生統計資料之方法。 E-commerce companies need to collect and generate statistics such as order volume, page visits, and product clicks in order to analyze and make decisions related to e-commerce services for entities such as users. In order to generate statistics based on order data, all orders of users must be analyzed in real time, but this is quite difficult. Real-time analysis of order data is challenging because the order history of users occupies a large amount of data, and a single order can be created and canceled at any time. Therefore, a method is needed that can analyze and generate statistics from large amounts of data in real time.

此外，驗證(validation)藉由資料管線實時產生之統計資料亦係非常重要之課題。統計資料實時更新，每當更新統計資料時，用戶便可使用更新後之統計資料。然而，若整個資料管線之任何一部分出現故障，藉由資料管線產生之統計資料就可能存在資料質量問題，例如，為了實現資料管線而新發佈之代碼中可能存在錯誤。晚發現該等問題可能會給電子商務服務帶來更多運營問題，因此，定期對統計資料進行資料質量檢查以避免出現上述情況非常重要。然而，如指標(例如，SDP訪問量、點擊量、訂單金額之總和等)等聚集之資料的轉儲及處理非常耗時，很難隨時確認最新資料，並且用戶訂單量等資料可能包含8TB之資料，因此需要一種可有效進行海量資料之資料質量檢查之方法。 In addition, validation of the statistics generated in real time by the data pipeline is also a very important topic. Statistics are updated in real time, and whenever the statistics are updated, users can use the updated statistics. However, if any part of the entire data pipeline fails, the statistics generated by the data pipeline may have data quality problems. For example, there may be errors in the newly released code to implement the data pipeline. Late discovery of such problems may bring more operational problems to e-commerce services. Therefore, it is very important to regularly check the data quality of the statistics to avoid the above situation. However, the storage and processing of aggregated data such as indicators (e.g., SDP visit volume, click volume, total order amount, etc.) is very time-consuming, and it is difficult to confirm the latest data at any time. In addition, data such as user order volume may contain 8TB of data, so a method that can effectively perform data quality checks on massive data is needed.

作為該等資料處理方法之先前文獻，有韓國公開專利公報第10-2009-0121754號。 As a previous document on such data processing methods, there is Korean Patent Publication No. 10-2009-0121754.

本發明之實施例係為了解決如上所述之問題而提出，目的在於基於實時訂單資料產生統計資料，並藉由對同樣基於統計資料產生之基線資料與統計資料各者之增量值進行比較來驗證統計資料，從而有效地對實時變化之大容量資料進行統計資料分析及驗證。 The embodiment of the present invention is proposed to solve the above-mentioned problem, and its purpose is to generate statistical data based on real-time order data, and verify the statistical data by comparing the incremental values of the baseline data and the statistical data generated based on the statistical data, so as to effectively perform statistical data analysis and verification on large-capacity data that changes in real time.

為了達到上述目的，根據本發明一實施例之電子裝置之資料驗證方法包括如下步驟：基於訂單資料，產生訂單之統計資料；基於上述訂單資料，產生用以驗證上述統計資料之基線資料；對上述基線資料及上述統計資料各者之增量值(incremental value)進行比較；及基於上述增量值之比較，驗證上述統計資料。 In order to achieve the above-mentioned purpose, a data verification method for an electronic device according to an embodiment of the present invention includes the following steps: generating statistical data of an order based on order data; generating baseline data for verifying the statistical data based on the order data; comparing the incremental values of the baseline data and the statistical data; and verifying the statistical data based on the comparison of the incremental values.

根據一實施例，上述訂單資料包括一個以上之訂單之訂單識別符及訂單量；上述統計資料包括上述訂單量之總和。 According to one embodiment, the order data includes order identifiers and order quantities of more than one order; the statistical data includes the sum of the order quantities.

根據一實施例，產生上述統計資料之步驟進而包括如下步驟：接收訂單取消事件；確認根據上述訂單取消事件而取消之訂單量；及將取消之上述訂單量乘以-1所得之值累加至上述統計資料中。 According to one embodiment, the step of generating the above statistical data further includes the following steps: receiving an order cancellation event; confirming the order quantity canceled according to the order cancellation event; and accumulating the value obtained by multiplying the canceled order quantity by -1 to the above statistical data.

根據一實施例，上述統計資料係藉由確保上述訂單資料之精確一次處理(exactly once processing)之資料管線而產生。 According to one embodiment, the above statistical data is generated by a data pipeline that ensures exactly once processing of the above order data.

根據一實施例，上述資料管線包括確保至少一次處理(at least once processing)之第1管線、及確保精確一次處理之第2管線，上述第1管線之輸出係輸入至上述第2管線。 According to one embodiment, the data pipeline includes a first pipeline that ensures at least once processing and a second pipeline that ensures exactly once processing, and the output of the first pipeline is input to the second pipeline.

根據一實施例，產生上述統計資料之步驟於向上述第1管線輸入表示上述訂單資料之第n+1變化之R(n+1)時，包括如下步驟：於快取中檢索表示上述訂單資料之第n變化之R(n)(retrieving)；對上述R(n)之值乘以-1；將上述-R(n)及R(n+1)傳輸至上述第2管線；及快取上述R(n+1)。 According to one embodiment, the step of generating the above statistical data includes the following steps when R(n+1) representing the n+1th change of the above order data is input to the above first pipeline: retrieving R(n) representing the nth change of the above order data in the cache (retrieving); multiplying the value of the above R(n) by -1; transmitting the above -R(n) and R(n+1) to the above second pipeline; and caching the above R(n+1).

根據一實施例，產生上述統計資料之步驟進而包括如下步驟：藉由在上述第2管線中聚集(aggregating)上述-R(n)及R(n+1)而產生上述統計資料。 According to one embodiment, the step of generating the above-mentioned statistical data further includes the following steps: generating the above-mentioned statistical data by aggregating the above-mentioned -R(n) and R(n+1) in the above-mentioned second pipeline.

根據一實施例，上述R(n)包括訂單ID及訂單版本。 According to one embodiment, the above R(n) includes order ID and order version.

根據一實施例，產生上述基線資料之步驟進而包括如下步驟：基於上述訂單資料，驗證上述基線資料。 According to one embodiment, the step of generating the above-mentioned baseline data further includes the following steps: based on the above-mentioned order data, verifying the above-mentioned baseline data.

根據一實施例，上述增量值係以基於上述統計資料而設定之週期來計算。 According to one embodiment, the above-mentioned increment value is calculated based on a period set based on the above-mentioned statistical data.

根據本發明一實施例之驗證資料之電子裝置，其包括：記憶體，其儲存至少一個命令；及處理器；上述處理器如下：藉由執行上述至少一個命令，基於訂單資料而產生訂單之統計資料；基於上述訂單資料，產生用以驗證上述統計資料之基線資料；對上述基線資料及上述統計資料各者之增量值進行比較；基於上述增量值之比較，驗證上述統計資料。 According to an embodiment of the present invention, an electronic device for verifying data includes: a memory storing at least one command; and a processor; the processor is as follows: by executing the at least one command, statistical data of the order is generated based on the order data; based on the order data, baseline data for verifying the statistical data is generated; the incremental values of the baseline data and the statistical data are compared; based on the comparison of the incremental values, the statistical data is verified.

根據本發明一實施例之非暫時性電腦可讀儲存媒體，其包括以儲存電腦可讀命令之方式構成之媒體，於藉由處理器而執行上述電腦可讀命令之情形時，上述處理器執行電子裝置之資料驗證方法，該方法包括如下步驟：基於訂單資料，產生訂單之統計資料；基於上述訂單資料，產生用以驗證上述統計資料之基線資料；對上述基線資料及上述統計資料各者之增量值進行比較；及基於上述增量值之比較，驗證上述統計資料。 According to an embodiment of the present invention, a non-temporary computer-readable storage medium includes a medium configured to store computer-readable commands. When the computer-readable commands are executed by a processor, the processor executes a data verification method for an electronic device, the method comprising the following steps: generating statistical data of an order based on order data; generating baseline data for verifying the statistical data based on the order data; comparing the incremental values of the baseline data and the statistical data; and verifying the statistical data based on the comparison of the incremental values.

根據本發明一實施例，電子裝置僅將應反映至訂單之統計資料中之資料儲存於如抵消記錄(counteraction record)等第1資料庫中，並使用第1資料庫中取消之訂單量乘以-1所得之值更新統計資料，從而可實時計算及更新如指標等統計資料，而無需檢索現有訂單歷史中的所有資料。 According to an embodiment of the present invention, the electronic device only stores the data that should be reflected in the statistical data of the order in the first database such as the counteraction record, and uses the value obtained by multiplying the amount of canceled orders in the first database by -1 to update the statistical data, thereby being able to calculate and update the statistical data such as the index in real time without having to retrieve all the data in the existing order history.

此外，根據本發明一實施例，電子裝置即便於串流於訊息隊列之訂單資料順序出現異常或資料管線中之資料處理過程中發生衝突之情形時，亦可藉由執行異常處理，將已更改之訂單狀態正常反映至最終統計資料中，而不受所發生問題之影響，從而有效維護實時資料處理之完整性。 In addition, according to an embodiment of the present invention, even if the order data sequence flowing in the message queue is abnormal or a conflict occurs in the data processing process in the data pipeline, the electronic device can also reflect the changed order status normally in the final statistical data by executing abnormal processing without being affected by the problem, thereby effectively maintaining the integrity of real-time data processing.

此外，根據本發明一實施例，若電子裝置於時間T執行了資料質量檢查，則證明時間T之前之資料之準確性，為了於時間T+1執行資料質量檢查，可藉由僅對時間T與時間T+1之間之增量執行資料質量檢查來證明時間T+1為止資料之準確性，從而使電子裝置能夠驗證所有時間段之資料完整性，而不會增加不必要的計算量。 In addition, according to an embodiment of the present invention, if the electronic device performs a data quality check at time T, the accuracy of the data before time T is proved. In order to perform a data quality check at time T+1, the accuracy of the data up to time T+1 can be proved by performing a data quality check only on the increment between time T and time T+1, so that the electronic device can verify the data integrity of all time periods without increasing unnecessary calculations.

100:電子裝置 100: Electronic devices

110:處理器 110: Processor

120:記憶體 120: Memory

圖1係概略表示根據本發明一實施例之電子裝置之各結構之示例圖。 FIG1 is a schematic diagram showing an example of the structures of an electronic device according to an embodiment of the present invention.

圖2係用以說明根據本發明一實施例之計算訂單之統計資料之方法之示例圖。 Figure 2 is an example diagram for illustrating a method for calculating order statistics according to an embodiment of the present invention.

圖3a至圖3c係表示根據本發明一實施例之用以產生電子裝置之統計資料之實時資料管線設計之示例圖。 Figures 3a to 3c are exemplary diagrams showing a real-time data pipeline design for generating statistical data of an electronic device according to an embodiment of the present invention.

圖4係表示與根據本發明一實施例之基線產生器及資料質量驗證器相關之資料流之示例架構。 FIG. 4 shows an example architecture of data flows associated with a baseline generator and a data quality verifier according to an embodiment of the present invention.

圖5a及圖5b係表示根據本發明一實施例之資料質量驗證器之運行之示例圖。 Figures 5a and 5b are example diagrams showing the operation of a data quality verifier according to an embodiment of the present invention.

圖6係表示根據本發明一實施例之電子裝置之資料驗證方法之流程的順序圖。 FIG6 is a sequence diagram showing the process of a data verification method for an electronic device according to an embodiment of the present invention.

圖7a至圖7c係用以說明根據本發明複數個實施例之用以產生統計資料之設計之示例圖。 Figures 7a to 7c are exemplary diagrams used to illustrate designs for generating statistical data according to multiple embodiments of the present invention.

以下，參照附圖，對本發明之實施例進行詳細描述。 Below, with reference to the attached drawings, the embodiments of the present invention are described in detail.

於對實施例進行說明時，將省略本發明所屬之技術領域內熟知且與本發明無直接關聯之技術內容之說明。其原因在於：藉由省略多餘之說明而清晰地傳達本發明之主旨，以避免混淆本發明之主旨。 When describing the embodiments, the description of the technical contents that are well known in the technical field to which the present invention belongs and are not directly related to the present invention will be omitted. The reason is: by omitting redundant descriptions, the main idea of the present invention can be clearly conveyed to avoid confusion of the main idea of the present invention.

出於相同之原因，於附圖中誇張、省略或概略地表示一部分構成要素。又，各構成要素之尺寸並非完全反映實際尺寸。於各圖中，對相同或對應之構成要素賦予相同之附圖標記。 For the same reason, some components are exaggerated, omitted or roughly represented in the attached drawings. In addition, the size of each component does not fully reflect the actual size. In each figure, the same or corresponding components are given the same figure mark.

藉由參照下文並結合附圖而詳細敍述之實施例，本發明之優點及特徵以及實現其等之方法將會變得明確。然而，本發明並不限定於以下揭示之實施例，亦可以各種不同之形態來實現，提供該等實施例僅係為了使本發明之揭示更完整，並使本發明所屬技術領域之技術人員充分瞭解本發明所屬之技術領域，本發明僅由申請專利範圍所限定。於整個說明書中，相同之參照符號指代相同之構成要素。 By referring to the embodiments described in detail below in conjunction with the attached drawings, the advantages and features of the present invention and the methods for implementing the same will become clear. However, the present invention is not limited to the embodiments disclosed below, and can also be implemented in various different forms. The embodiments are provided only to make the disclosure of the present invention more complete and to enable technicians in the technical field to which the present invention belongs to fully understand the technical field to which the present invention belongs. The present invention is limited only by the scope of the patent application. Throughout the specification, the same reference symbols refer to the same constituent elements.

此時，應理解，處理流程圖之各方塊與流程圖之組合可藉由電腦程式指令來實行。該等電腦程式指令可裝載於通用電腦、特殊用電腦或其他可編程資料處理設備之處理器，因此藉由電腦或其他可編程資料處理設備之處理器而實行之該等指令會產生實行流程圖之方塊中說明之功能之方法。為了以特定方式實現功能，該等電腦程式指令可儲存於能夠面向電腦或其他可編程資料處理設備之電腦可用或電腦可讀記憶體，儲存於該電腦可用或電腦可讀記憶體中之指令亦可產生包含如下指令方法之製造品項，該指令方法實行流程圖之方塊中說明之功能。電腦程式指令亦可裝載於電腦或其他可編程資料處理設備上，因此於電腦或其他可編程資料處理設備上實行一系列動作步驟而產生藉由電腦執行之流程，從而由電腦或其他可編程資料處理設備實行之指令亦可提供用以執行流程圖之方塊中說明之功能之步驟。 At this point, it should be understood that the combination of each block of the processing flow chart and the flow chart can be implemented by computer program instructions. Such computer program instructions can be loaded on a processor of a general-purpose computer, a special-purpose computer or other programmable data processing device, so that such instructions implemented by the processor of the computer or other programmable data processing device will produce a method for implementing the functions described in the blocks of the flow chart. In order to implement the functions in a specific manner, such computer program instructions can be stored in a computer-usable or computer-readable memory that is capable of facing a computer or other programmable data processing device, and the instructions stored in the computer-usable or computer-readable memory can also produce a manufactured item containing the following instruction method, which implements the functions described in the blocks of the flow chart. Computer program instructions may also be loaded onto a computer or other programmable data processing device, thereby executing a series of action steps on the computer or other programmable data processing device to generate a process executed by the computer, and the instructions executed by the computer or other programmable data processing device may also provide steps for executing the functions described in the blocks of the flowchart.

又，各方塊可表示包括用以執行特定之邏輯功能之一個以上之可執行指令之模組、片段或代碼之一部分。又，應該注意，於某些替代實施例中，方塊中敍述之功能可不按順序執行。例如，相繼示出之兩個方塊實質上既可同時實行，亦可偶爾根據對應之功能而按照相反之順序實行。 Furthermore, each block may represent a module, a fragment, or a portion of code that includes one or more executable instructions for performing a specific logical function. Furthermore, it should be noted that in some alternative embodiments, the functions described in the blocks may not be executed in sequence. For example, two blocks shown in succession may be executed simultaneously or in reverse order depending on the corresponding functions.

此時，本實施例使用之用語“~部”係指軟體或FPGA或ASIC等硬體要素，“~部”起到某種作用。但是，“~部”不限於軟體或硬體。“~部”能夠以位於可尋址之儲存媒介之方式構成，亦能夠以運行一個或複數個處理器之方式構成。因此，作為其一例，“~部”包括軟體要素、面向對象的軟體要素、類要素及任務要素等要素、進程、函數、屬性、過程、子例程、程式碼片段、驅動、固件、微代碼、電路、資料、資料庫、資料結構、表格、陣列及變量。要素與“~部”中提供之功能可結合為更小數量的要素及“~部”，或進一步分離為追加之要素及“~部”。此外，該等要素及“~部”還可用以激活設備或安全多媒體卡內之一個或複數個CPU。 At this time, the term "component" used in this embodiment refers to software or hardware elements such as FPGA or ASIC, and the "component" plays a certain role. However, the "component" is not limited to software or hardware. The "component" can be constructed in a manner of being located in an addressable storage medium, or it can be constructed in a manner of running one or more processors. Therefore, as an example, the "component" includes software elements, object-oriented software elements, class elements, and task elements, processes, functions, attributes, procedures, subroutines, code snippets, drivers, firmware, microcodes, circuits, data, databases, data structures, tables, arrays, and variables. The functions provided in the elements and "components" can be combined into a smaller number of elements and "components", or further separated into additional elements and "components". In addition, these elements and "~" can also be used to activate one or more CPUs in a device or secure multimedia card.

參閱圖1，電子裝置100可包括處理器110及記憶體 120，並可執行處理訂單資訊之方法。圖1所示之電子裝置100僅圖示了與本實施例相關之組件。因此，對於本領域之普通技術人員而言，除了圖1中所示之組件外，電子裝置100還可包括其他通用組件是顯而易見的。 Referring to FIG. 1 , the electronic device 100 may include a processor 110 and a memory 120 , and may execute a method for processing order information. The electronic device 100 shown in FIG. 1 only illustrates components related to the present embodiment. Therefore, it is obvious to a person skilled in the art that the electronic device 100 may include other general components in addition to the components shown in FIG. 1 .

處理器110用以控制電子裝置100處理訂單資訊之整體功能。例如，處理器110可藉由執行儲存於電子裝置100之記憶體120中之程式，來整體控制電子裝置100。處理器110可藉由具備於電子裝置100中之中央處理單元CPU(central processing unit)、GPU(graphics processing unit，圖形處理單元)、AP(application processor，應用處理器)等實現，但不限於此。 The processor 110 is used to control the overall function of the electronic device 100 to process order information. For example, the processor 110 can control the electronic device 100 as a whole by executing a program stored in the memory 120 of the electronic device 100. The processor 110 can be implemented by a central processing unit (CPU), a GPU (graphics processing unit), an AP (application processor), etc. in the electronic device 100, but is not limited thereto.

記憶體120係儲存電子裝置100中處理之各種資料之硬體，記憶體120可儲存電子裝置100中已處理之資料及待處理之資料。此外，記憶體120還可儲存可由電子裝置100執行之應用程式、驅動程式等。記憶體120可包括如DRAM(dynamic random access memory，動態隨機存取記憶體)、SRAM(static random access memory，靜態隨機存取記憶體)等RAM(random access memory，隨機存取記憶體)、ROM(read-only memory，唯讀記憶體)、EEPROM(electrically erasable programmable read-only memory，電可擦除可編程唯讀記憶體)、CD-ROM(Compact Disc Read-Only Memory，光碟唯讀記憶體)、藍光或其他光碟記憶體、HDD(hard disk drive，硬碟驅動器)、SSD(solid state drive，固態硬碟)或閃存。 The memory 120 is hardware for storing various data processed in the electronic device 100. The memory 120 can store data that has been processed and data to be processed in the electronic device 100. In addition, the memory 120 can also store applications, drivers, etc. that can be executed by the electronic device 100. The memory 120 may include RAM (random access memory) such as DRAM (dynamic random access memory), SRAM (static random access memory), ROM (read-only memory), EEPROM (electrically erasable programmable read-only memory), CD-ROM (Compact Disc Read-Only Memory), Blu-ray or other optical disc memory, HDD (hard disk drive), SSD (solid state drive), or flash memory.

於一實施例中，電子裝置100可基於訂單資料產生統計資料。於一實施例中，統計資料可包括代表與實體(entity)(如用戶、類別、促銷)之電子商務服務相關聯之特徵之配置文件(profile)或指標(metric)，例如，可包括能夠直接使用之實體屬性(如加入會員日期)、特定事件類型之聚集資料、預測指標等。聚集資料可包括登錄、點擊量、訪問時間、單個詳情頁(single detail page，SDP)訪問次數等行為統計資料，以及訂購次數、GMV(gross merchandise volume，總交易額)、訂單金額、訂單量之總和等交易統計資料。例如，預測指標可包括與用戶對“iPhone”之興趣相關之預測分數。電子商務企業需要收集及產生該等統計資料，以便對實體進行與電子商務服務相關之分析及決策。 In one embodiment, the electronic device 100 may generate statistics based on order data. In one embodiment, the statistics may include a profile or metric representing characteristics associated with an entity (e.g., user, category, promotion) of an e-commerce service, for example, it may include entity attributes that can be used directly (e.g., membership date), aggregate data of specific event types, predictive indicators, etc. Aggregate data may include behavioral statistics such as logins, clicks, visit time, single detail page (SDP) visits, and transaction statistics such as order times, GMV (gross merchandise volume), order amount, and sum of order quantities. For example, a predictive indicator may include a predictive score related to a user's interest in "iPhone". E-commerce companies need to collect and generate such statistics in order to analyze and make decisions related to e-commerce services.

為了基於訂單資料產生統計資料，必須實時分析用戶之所有訂單，但這相當困難。實時分析訂單資料具有挑戰性，因為用戶之訂單歷史佔據大量資料，而且單個訂單可隨時創建及取消。 In order to generate statistics based on order data, all orders of users must be analyzed in real time, which is quite difficult. Real-time analysis of order data is challenging because users' order history occupies a large amount of data and individual orders can be created and canceled at any time.

具體而言，當用戶藉由電子商務服務訂購一件或複數件商品時，服務伺服器可產生表示用戶訂單資訊之訂單資料。於一實施例中，訂單資料可包括一個以上之訂單之訂單識別符及訂單量。產生訂單資料後，用戶可取消訂單中包含之一個或複數個商品之全部或一部分，而該等訂單之取消應與訂單創建日期一起反映至用戶之訂單配置文件中。下面，結合圖2說明用戶之訂單創建及取消相關示例。 Specifically, when a user orders one or more items through an e-commerce service, the service server may generate order data representing the user's order information. In one embodiment, the order data may include order identifiers and order quantities of more than one order. After the order data is generated, the user may cancel all or part of one or more items included in the order, and the cancellation of such orders shall be reflected in the user's order profile together with the order creation date. Below, an example of user order creation and cancellation is described in conjunction with Figure 2.

圖2係用以說明根據本發明一實施例之計算訂單之統計資料之方法之示例圖。參閱圖2，訂單創建及取消之示例係藉由訂單詳情210及訂單量指標220之幾個表格表示。 FIG. 2 is an example diagram for illustrating a method for calculating order statistics according to an embodiment of the present invention. Referring to FIG. 2 , examples of order creation and cancellation are represented by several tables of order details 210 and order quantity indicators 220.

例如，用戶1於2022年10月1日下了兩個訂單1001及1002，若每個訂單之訂單量為100及200，則“用戶訂單量”指標為 300。然後，用戶1於2022年10月2日從訂單1002中取消了相當於訂單量50之部分商品，因此，“用戶訂單量”指標應變更為250。然而，由於實時訂單資訊係最近之快照，能知道訂單100之訂單量現在係150，但無法獲知之前其訂單量為200，這使實時訂單量之變化計算變得很難。為了計算“用戶訂單量”指標，通常需要訪問用戶過去之所有訂單歷史資料並將訂單量相加，而訂單立式資料具有非常龐大之資料量，基於此，實時計算指標並非易事。例如，當前訂單資料集之大小為8TB，很難實時計算用戶訂購次數或訂單量等指標。 For example, if user 1 places two orders 1001 and 1002 on October 1, 2022, and the order quantity of each order is 100 and 200, then the "User Order Quantity" indicator is 300. Then, user 1 cancels part of the items in order 1002, which is equivalent to order quantity 50, on October 2, 2022, so the "User Order Quantity" indicator should change to 250. However, since the real-time order information is the most recent snapshot, we know that the order quantity of order 100 is now 150, but we cannot know that its order quantity was 200 before, which makes it difficult to calculate the change of real-time order quantity. In order to calculate the "user order quantity" indicator, it is usually necessary to access all the user's past order history data and add up the order quantities. However, the order vertical data has a very large amount of data. Based on this, it is not easy to calculate indicators in real time. For example, the current order data set size is 8TB, and it is difficult to calculate indicators such as user order times or order quantity in real time.

為了解決該難題，於一實施例中，電子裝置100可使用第1資料庫儲存一些記錄，以處理訂單之狀態變化，並可增量地(incrementally)執行與訂單相關之指標計算。例如，於計算“用戶訂單量”指標時，用戶取消了之前訂單且取消之訂單量為200之情形時，電子裝置100可確定其抵消值(counteraction value)為-200。電子裝置100僅需將該抵消值-200累加至用戶之訂單量中，而無需將用戶之所有訂單相加，從而可有效減少計算所需之資料量。電子裝置100僅將串流之實時資料中應反映至訂單之統計資料中之資料儲存於如抵消記錄(counteraction record)等第1資料庫中，並使用第1資料庫中取消之訂單量乘以-1所得之值更新統計資料，從而可實時計算指標，而無需檢索現有訂單歷史中的所有資料，同樣，於圖2之訂單量指標220所示之示例中，“用戶訂單量”指標於2022年10月1日產生為300，而由於用戶取消了50的訂單量，故指標於2022年10月2日處理為-50。電子裝置100可接收訂單取消事件，從訂單取消事件中確認取消之訂單量並將取消之訂單量乘以-1所得的值累加至統計資料，以產生統計資料。 To solve this problem, in one embodiment, the electronic device 100 can use the first database to store some records to handle the state change of the order, and can incrementally perform the calculation of the indicators related to the order. For example, when calculating the "user order quantity" indicator, if the user cancels the previous order and the canceled order quantity is 200, the electronic device 100 can determine its counteraction value as -200. The electronic device 100 only needs to add the counteraction value -200 to the user's order quantity, without adding all the user's orders, thereby effectively reducing the amount of data required for calculation. The electronic device 100 stores only the data that should be reflected in the order statistics in the streaming real-time data in the first database such as the counteraction record, and uses the value obtained by multiplying the canceled order quantity in the first database by -1 to update the statistics, so that the index can be calculated in real time without searching all the data in the existing order history. Similarly, in the example shown in the order quantity index 220 of FIG. 2, the "user order quantity" index is generated as 300 on October 1, 2022, and because the user cancels 50 orders, the index is processed as -50 on October 2, 2022. The electronic device 100 can receive the order cancellation event, confirm the canceled order quantity from the order cancellation event, and add the value obtained by multiplying the canceled order quantity by -1 to the statistics to generate the statistics.

參閱圖3a，示出了用於電子裝置100從訂單資料310產生如指標340等統計資料之端到端(end-to-end)實時資料管線架構300。電子裝置100可建立架構300等資料管線，以自動化提取、變更、結合、驗證及加載實時串流資料之過程。 Referring to FIG. 3a, an end-to-end real-time data pipeline architecture 300 is shown for the electronic device 100 to generate statistical data such as indicators 340 from order data 310. The electronic device 100 can establish a data pipeline such as the architecture 300 to automate the process of extracting, changing, combining, validating, and loading real-time streaming data.

於一實施例中，統計資料可藉由確保訂單資料310之精確一次處理(exactly once processing)之資料管線而產生。於一實施例中，資料管線可包括確保至少一次處理(at least once processing)之第1管線320(例如，串流攝取管線(streaming ingestion pipeline))及確保精確一次處理之第2管線330(例如，實時聚集管線(real time aggregation pipeline))，第1管線之輸出可輸入至第2管線。於實時資料處理中，確保計算結果之準確性非常重要，而用於準確計算之資料管線可能需要嵌入精確一次處理語義(semantic)。精確一次處理可意味著所有資料均應於最終結果中精確反映一次。例如，若用戶取消了一次訂單，但實時串流之取消資料日誌可能重複輸入10次，電子裝置100需要刪除該等重複資料，僅反映一次的取消訂單。另一方面，第1管線320由於僅確保至少處理一次，而不確保精確處理一次，故可接受重複資料，重複資料可於第2管線330中去除重複(deduplication)。 In one embodiment, the statistical data may be generated by a data pipeline that ensures exactly once processing of the order data 310. In one embodiment, the data pipeline may include a first pipeline 320 that ensures at least once processing (e.g., a streaming ingestion pipeline) and a second pipeline 330 that ensures exactly once processing (e.g., a real time aggregation pipeline), and the output of the first pipeline may be input to the second pipeline. In real-time data processing, it is very important to ensure the accuracy of the calculation results, and the data pipeline used for accurate calculation may need to embed exactly once processing semantics. Exactly once processing may mean that all data should be reflected exactly once in the final result. For example, if a user cancels an order once, but the cancellation data log of the real-time streaming may be entered repeatedly 10 times, the electronic device 100 needs to delete the duplicate data and only reflect the canceled order once. On the other hand, since the first pipeline 320 only ensures at least one processing but not accurate processing, it can accept duplicate data, and the duplicate data can be deduplicated in the second pipeline 330.

於一實施例中，第2管線330可確保至少處理一次、去除重複及冪等寫入(idempotent write)語義，以確保精確處理一次。至少處理一次語義可確保處理所有資料，去除重複可確保刪除輸入中之重複資料，冪等寫入可確保多次重新處理資料後不會影響最終輸出結果，即，即便多次計算，結果亦不會改變。 In one embodiment, the second pipeline 330 can ensure at least once processing, deduplication and idempotent write semantics to ensure accurate processing once. At least once processing semantics can ensure that all data are processed, deduplication can ensure that duplicate data in the input is deleted, and idempotent write can ensure that multiple reprocessing of data will not affect the final output result, that is, even if it is calculated multiple times, the result will not change.

於一實施例中，電子裝置100可產生唯一標識符，該標識符可唯一地表示實體資料集之微批量(micro batch)，以確保第2管線330中的冪等寫入。於第2管線330中，由於使用不同之計算邏輯對資料進行聚集，因此不再使用原始輸入資料標識符。唯一標識符可設置為實體ID、微批量ID及指標ID之組合。微批量ID可使用所有微批量次起始偏移量(starting offset)之摘要(digest)而產生，指標ID可包括不同計算邏輯之唯一標識符。如此，於系統重啟或從衝突中恢復後，電子裝置100從最後一個失敗之微批量之起始偏移量開始重新處理，微批量ID與最後一次處理相同，資料將以相同之標識符覆蓋，從而確保冪等寫入。 In one embodiment, the electronic device 100 may generate a unique identifier that uniquely represents a micro batch of the physical data set to ensure equal writing in the second pipeline 330. In the second pipeline 330, the original input data identifier is no longer used because different computation logics are used to aggregate the data. The unique identifier may be set as a combination of the physical ID, micro batch ID, and pointer ID. The micro batch ID may be generated using a digest of all micro batch starting offsets, and the pointer ID may include unique identifiers of different computation logics. In this way, after the system is restarted or recovered from a conflict, the electronic device 100 reprocesses from the starting offset of the last failed micro-batch, and the micro-batch ID is the same as the last processing, and the data will be overwritten with the same identifier, thereby ensuring equal writing.

於一實施例中，訊息隊列350可包括能夠發佈、訂閱、儲存及處理實時串流之訂單資料310之Apache Kafka。Apache Kafka係大容量實時日誌處理系統，且係使用發佈/訂閱(publish/subscribe)範式處理訊息之訊息隊列(queueing)系統。Apache Kafka係擴展性好、吞吐量高之分佈式訊息系統，能夠確保可靠、持續之訊息傳遞性能。Kafka應用生產者-消費者問題，使用生產者(producer)/消費者(consumer)/經紀人(broker)三個系統組件，設置大量資料之主題(topic)並按主題組織分區，有序儲存。Kafka將該儲存之資料依次傳遞給消費者，以便進行高效處理。Kafka係專門處理大量實時日誌之解決方案，可於主要目標係安全無損地傳輸資料之訊息系統中，以容錯(fault-tolerant)、可靠之架構及快速性能來處理資料。 In one embodiment, the message queue 350 may include Apache Kafka that can publish, subscribe, store and process real-time streaming order data 310. Apache Kafka is a large-capacity real-time log processing system and a message queueing system that uses a publish/subscribe paradigm to process messages. Apache Kafka is a distributed messaging system with good scalability and high throughput that can ensure reliable and continuous message delivery performance. Kafka applies the producer-consumer problem, uses three system components: producer/consumer/broker, sets topics for large amounts of data, organizes partitions by topic, and stores them in order. Kafka delivers the stored data to consumers in sequence for efficient processing. Kafka is a solution specifically designed to handle large amounts of real-time logs. It can process data with a fault-tolerant, reliable architecture and fast performance in a messaging system whose main goal is to transmit data safely and without loss.

於一實施例中，電子裝置100基於訂單資料來產生統計資料可包括：當向第1管線320輸入表示訂單資料之第n+1變化之R(n+1)時，從快取中檢索表示訂單資料之第n變化之R(n)(retrieve)，R(n)之值乘以-1，將-R(n)及R(n+1)傳輸至第2管線330，並快取R(n+1)。電子裝置100可聚集從第1管線320傳輸至第2管線330之-R(n)及R(n+1)，以產生統計資料。即，第2管線330中，統計資料可計算為R(0)+[R(1)-R(0)]+...+[R(n)-R(n-1)]+[R(n+1)-R(n)]=R(n+1)。 In one embodiment, the electronic device 100 may generate statistics based on the order data by: when R(n+1) representing the n+1th change of the order data is input to the first pipeline 320, R(n) representing the nth change of the order data is retrieved from the cache, the value of R(n) is multiplied by -1, -R(n) and R(n+1) are transmitted to the second pipeline 330, and R(n+1) is cached. The electronic device 100 may aggregate -R(n) and R(n+1) transmitted from the first pipeline 320 to the second pipeline 330 to generate statistics. That is, in the second pipeline 330, the statistics can be calculated as R(0)+[R(1)-R(0)]+...+[R(n)-R(n-1)]+[R(n+1)-R(n)]=R(n+1).

於一實施例中，R(n)可包括訂單ID及訂單版本之資訊。電子裝置100可確認R(n)中包括之訂單ID及訂單版本之資訊，並確認與欲反映至指標中之訂單ID或訂單版本相對應之訂單狀態(產生或取消)以產生指標。 In one embodiment, R(n) may include information of an order ID and an order version. The electronic device 100 may confirm the information of the order ID and order version included in R(n), and confirm the order status (generated or cancelled) corresponding to the order ID or order version to be reflected in the indicator to generate the indicator.

下面，對異常處理方法進行說明，該異常處理方法係於i)串流於訊息隊列之訂單資料順序出現異常(disorder)；ii)資料管線中之資料處理過程中發生衝突(crash)之情形時，可不受所發生問題之影響，將已更改之訂單狀態正常反映至最終統計資料中。 The following is an explanation of the exception handling method. When i) the order data sequence flowing in the message queue is abnormal (disorder); ii) a conflict (crash) occurs during the data processing process in the data pipeline, the changed order status can be normally reflected in the final statistical data without being affected by the problem.

於一實施例中，電子裝置100可執行異常處理，以確保於訂單源訊息隊列中出現順序異常時，僅反映最新版本之訂單資料。例如，若R(n+k+m)先於作為先前訂單之R(n+k)至R(n+k+m-1)被接收，則當於第2管線330中聚集R(n+k+m)時，統計資料可處理為R(0)+[R(1)-R(0)]+...+[R(n+k-1)-R(n+k-2)]+[R(n+k+m)-R(n+k-1)]=R(n+k+m)。即，藉由第2管線330之該等聚集方法，電子裝置100可僅將最近訂單資料R(n+k+m)反映至統計資料，而捨棄作為先前訂單資料之R(n+k)至R(n+k+m-1)之訊息，從而不受所發生問題之影響，將已更改之訂單狀態最終反映至統計資料中。 In one embodiment, the electronic device 100 may perform exception processing to ensure that only the latest version of the order data is reflected when a sequence exception occurs in the order source message queue. For example, if R(n+k+m) is received before R(n+k) to R(n+k+m-1) as previous orders, when R(n+k+m) is aggregated in the second pipeline 330, the statistics may be processed as R(0)+[R(1)-R(0)]+...+[R(n+k-1)-R(n+k-2)]+[R(n+k+m)-R(n+k-1)]=R(n+k+m). That is, through the aggregation methods of the second pipeline 330, the electronic device 100 can only reflect the latest order data R(n+k+m) to the statistical data, and discard the information R(n+k) to R(n+k+m-1) as the previous order data, so as not to be affected by the problem, and finally reflect the changed order status in the statistical data.

於一實施例中，電子裝置100可執行異常處理，以確保於資料管線中之資料處理過程中發生衝突時，不受衝突之影響，將已更改之訂單狀態反映至統計資料中。於第1管線320中，通常情形之資料處理步驟可執行如下N1至N5： In one embodiment, the electronic device 100 can perform exception processing to ensure that when a conflict occurs during the data processing process in the data pipeline, the conflict is not affected and the changed order status is reflected in the statistical data. In the first pipeline 320, the data processing steps under normal circumstances can be executed as follows N1 to N5:

N1：確認源訊息隊列並提取有用字段。 N1: Confirm the source message queue and extract useful fields.

N2：過濾不合格(disqualified)訊息。 N2: Filter disqualified messages.

N3：將資料轉換為事件。 N3: Convert data into events.

N4：將轉換後之事件傳輸至下游之事件訊息隊列。 N4: Transmit the converted event to the downstream event message queue.

N5：提交(commit)源訊息隊列偏移量。 N5: Commit the source message queue offset.

其中，提交源訊息隊列偏移量，表示已準備好於源中處理下一條訊息。 Among them, submitting the source message queue offset means that it is ready to process the next message in the source.

另一方面，電子裝置100還可對抵消記錄(counteraction record)360執行如下追加步驟C1至C4： On the other hand, the electronic device 100 can also perform the following additional steps C1 to C4 on the counteraction record 360:

C1：從資料庫管理系統(database management system，DBMS)接收先前儲存之事件。 C1: Receive previously stored events from the database management system (DBMS).

C2：基於抵消記錄來計算抵消值(counteraction value)。 C2: Calculate the counteraction value based on the offsetting record.

C3：傳輸計算出的抵消值以及普通事件。 C3: Transmits calculated offset values and common events.

C4：於DBMS中用新事件覆蓋(override)舊事件。 C4: Override old events with new events in DBMS.

其中，追加步驟C1及C2可與N3合併。即便於N3或N3之前步驟中發生衝突，因源訊息隊列中之訊息會再次被讀取及處理，故可不受衝突之影響。追加步驟C3可與N4一起執行。 Among them, additional steps C1 and C2 can be combined with N3. Even if a conflict occurs in N3 or the step before N3, the message in the source message queue will be read and processed again, so it will not be affected by the conflict. Additional step C3 can be executed together with N4.

於一實施例中，追加步驟C4可於N4之前執行。即，電子裝置100可於DBMS中用新事件覆蓋舊事件之後(C4)，將轉換後之事件傳輸至下游之事件訊息隊列(N4)。此時，若N4因衝突而失敗，則於第1種情形時將R(n)替換為R(n+1)後，第1管線320中可能會發生衝突。修復後，電子裝置100可再次將R(n+1)傳輸至下游，這可能導致於第2管線330中基於R(n)+R(n+1)進行計算。另一方面，若N4因衝突而失敗，則於第2種情形時，僅R(n+1)或-R(n)中之一者被傳輸至下游，第1管線320中可能會發生衝突。修復後，若發送了-R(n)，則沒有問題，但若僅發送了R(n+1)，則可能導致於第2管線330中基於R(n)+R(n+1)進行計算。最終，第1種及第2種情形均可能導致電子裝置100於第2管線330中基於R(n)+R(n+1)執行計算，從而發生問題。 In one embodiment, the additional step C4 may be executed before N4. That is, the electronic device 100 may transmit the converted event to the downstream event message queue (N4) after overwriting the old event with the new event in the DBMS (C4). At this time, if N4 fails due to a conflict, a conflict may occur in the first pipeline 320 after R(n) is replaced with R(n+1) in the first case. After repair, the electronic device 100 may transmit R(n+1) to the downstream again, which may cause calculation based on R(n)+R(n+1) in the second pipeline 330. On the other hand, if N4 fails due to a conflict, then in the second case, only one of R(n+1) or -R(n) is transmitted downstream, and a conflict may occur in the first pipeline 320. After the repair, if -R(n) is sent, there is no problem, but if only R(n+1) is sent, it may cause calculations based on R(n)+R(n+1) in the second pipeline 330. Ultimately, both the first and second cases may cause the electronic device 100 to perform calculations based on R(n)+R(n+1) in the second pipeline 330, resulting in problems.

於一實施例中，追加步驟C4可於N4之後執行。即電子裝置100可將轉換後之事件傳輸至下游之事件訊息隊列之後(N4)，於DBMS中用新事件覆蓋舊事件(C4)。此時，若N4部分失敗(第1種情形)，則僅R(n+1)或-R(n)中之一者被傳輸至下游，第1管線320中可能會發生衝突。修復後，將傳輸R(n+1)及-R(n)，第2管線330可對先前接收之記錄去除重複。此外，若執行了N4，但C4失敗(第2種情形)，則僅重複記錄R(n+1)及-R(n)被再次傳輸至下游。另一方面，若執行了N4及C4，但N5失敗(第三種情形)，則僅重複記錄R(n+1)被再次傳輸至下游，即便傳輸重複資料，由於確保精確處理一次的第2管線330中將被去除重複，因此不會發生問題。因此，較佳為電子裝置100於N4之後執行追加步驟C4。 In one embodiment, an additional step C4 may be executed after N4. That is, after the electronic device 100 transmits the converted event to the downstream event message queue (N4), the old event is overwritten with the new event in the DBMS (C4). At this time, if N4 partially fails (the first case), only one of R(n+1) or -R(n) is transmitted to the downstream, and a conflict may occur in the first pipeline 320. After repair, R(n+1) and -R(n) will be transmitted, and the second pipeline 330 can deduplicate the previously received records. In addition, if N4 is executed but C4 fails (the second case), only the duplicate records R(n+1) and -R(n) are transmitted to the downstream again. On the other hand, if N4 and C4 are executed but N5 fails (the third case), only the duplicate record R(n+1) is transmitted to the downstream again. Even if duplicate data is transmitted, it will not cause any problem because the second pipeline 330, which is processed accurately once, will remove the duplicates. Therefore, it is better for the electronic device 100 to execute the additional step C4 after N4.

於一實施例中，DBMS可包括作為開源分佈式No-SQL DBMS之Apache Cassandra。Cassandra可於無單點故障(single point of failure，SPOF)提供高性能的同時，管理眾多伺服器之間之大容量資料，可支持跨複數個資料中心之集群，並允許所有客戶端之低延時運行。 In one embodiment, the DBMS may include Apache Cassandra, an open source distributed No-SQL DBMS. Cassandra can manage large volumes of data across many servers while providing high performance without a single point of failure (SPOF), can support clusters across multiple data centers, and allows low latency operation for all clients.

於一實施例中，DBMS可包括Redis，其係一種具有鍵值(key-value)結構之開源非關係型DBMS。 In one embodiment, the DBMS may include Redis, which is an open source non-relational DBMS with a key-value structure.

如此，電子裝置100即便於串流於訊息隊列之訂單資料順序出現異常或資料管線中之資料處理過程中發生衝突之情形時，亦可藉由執行異常處理，將已更改之訂單狀態正常反映至最終統計資料中，而不受所發生問題之影響，從而有效維護實時資料處理之完整性。 In this way, even if the order data sequence flowing in the message queue is abnormal or a conflict occurs in the data processing process in the data pipeline, the electronic device 100 can still reflect the changed order status normally in the final statistical data by executing abnormal processing without being affected by the problem, thereby effectively maintaining the integrity of real-time data processing.

參閱圖3b，其圖示實時資料管線架構300、可輸入架構300之資料源元資料371及指標元資料372。於一實施例中，電子裝置100可藉由資料源元資料371註冊資料源，並表明可從該資料源提取哪些業務字段。於一實施例中，電子裝置100可藉由指標元資料372指定所需指標，並設置對指標之運營及對條件之資訊(例如，收集統計之服務類型、收集統計之開始及結束日期等)。 Refer to FIG. 3b, which illustrates a real-time data pipeline architecture 300, data source metadata 371 that can be input into the architecture 300, and indicator metadata 372. In one embodiment, the electronic device 100 can register a data source through the data source metadata 371 and indicate which business fields can be extracted from the data source. In one embodiment, the electronic device 100 can specify the required indicator through the indicator metadata 372 and set the information on the operation and conditions of the indicator (for example, the service type for collecting statistics, the start and end dates for collecting statistics, etc.).

參閱圖3c，其圖示實時資料管線架構300、表示產生指標之過程381至385之示例性資訊。於381中，當源訂單事件新進入訊息隊列350時，第1管線320可讀取事件並基於資料源元資料371之配置進行發送。於382中，可對所需訂單事件進行處理，提取所需業務字段並基於指標元資料372之配置寫入相關業務訊息隊列主題中。於383及384中，第2管線330可根據指標元資料372按細粒度(fine-grained)級別聚集所需指標。於385中，發佈者可獲取最終指標340並將其推送 (push)至訂單配置文件訊息隊列主題。 Referring to FIG. 3c, which illustrates the real-time data pipeline architecture 300, exemplary information of the process 381 to 385 for generating indicators is shown. In 381, when a source order event newly enters the message queue 350, the first pipeline 320 can read the event and send it based on the configuration of the data source metadata 371. In 382, the required order event can be processed, the required business fields can be extracted and written to the relevant business message queue topic based on the configuration of the indicator metadata 372. In 383 and 384, the second pipeline 330 can aggregate the required indicators at a fine-grained level according to the indicator metadata 372. In 385, the publisher can obtain the final indicator 340 and push it to the order profile message queue topic.

另一方面，驗證(validation)藉由資料管線實時產生之統計資料亦係非常重要之課題。統計資料實時更新，每當更新統計資料時，用戶便可使用更新之後之統計資料。然而，若整個資料管線之任何一部分出現故障，藉由資料管線產生之統計資料就可能存在資料質量問題，例如，為了實現資料管線而新發佈之代碼中可能存在錯誤。較晚發現該等問題可能會給電子商務服務帶來更多運營問題，因此，定期對統計資料進行資料質量檢查以避免出現上述情況非常重要。然而，如指標(例如，SDP訪問量、點擊量、訂單金額之總和等)等聚集之資料之轉儲及處理非常耗時，很難隨時確認最新資料，而且用戶訂單量等資料可能包含8TB之資料，因此需要一種可有效進行海量資料之資料質量檢查之方法。下面，將結合圖4及圖5來說明與統計資料驗證相關之一個實施例。 On the other hand, validation of the statistics generated in real time by the data pipeline is also a very important topic. Statistics are updated in real time, and whenever the statistics are updated, users can use the updated statistics. However, if any part of the entire data pipeline fails, the statistics generated by the data pipeline may have data quality problems. For example, there may be errors in the newly released code to implement the data pipeline. Late discovery of such problems may bring more operational problems to e-commerce services. Therefore, it is very important to regularly check the data quality of the statistics to avoid the above situation. However, the storage and processing of aggregated data such as indicators (e.g., SDP visit volume, click volume, total order amount, etc.) is very time-consuming, and it is difficult to confirm the latest data at any time. In addition, data such as user order volume may contain 8TB of data, so a method that can effectively perform data quality checks on massive data is needed. Below, an embodiment related to statistical data verification will be described in conjunction with Figures 4 and 5.

圖4係表示與根據本發明一實施例之基線產生器及資料質量驗證器相關之資料流之示例架構。參閱圖4，其圖示包括用以產生圖3a所述之統計資料之資料管線之架構410及用以驗證統計資料之資料質量之架構420。 FIG. 4 shows an example architecture of data flows associated with a baseline generator and a data quality verifier according to an embodiment of the present invention. Referring to FIG. 4 , the diagram includes an architecture 410 of a data pipeline for generating the statistical data described in FIG. 3a and an architecture 420 for verifying the data quality of the statistical data.

於一實施例中，電子裝置100可產生用以基於訂單資料來驗證統計資料之基線資料。電子裝置100可從訂單資料之原始資料(golden truth data)產生基線資料，原始資料可從批量管線(batch pipeline)使用之DBMS轉儲(dump)及實時管線使用之訊息隊列轉儲中提取。於一實施例中，可基於用戶之標識符對源資料進行採樣，以降低計算成本。 In one embodiment, the electronic device 100 can generate baseline data for verifying statistical data based on order data. The electronic device 100 can generate baseline data from the original data (golden truth data) of the order data, and the original data can be extracted from the DBMS dump used by the batch pipeline and the message queue dump used by the real-time pipeline. In one embodiment, the source data can be sampled based on the user's identifier to reduce the computational cost.

於一實施例中，產生基線資料之步驟可進而包括：基於訂單資料驗證基線資料。例如，於實時管線使用之訊息隊列轉儲中可能存在資料重複，或者於從訊息隊列轉儲中提取原始資料之過程中可能發生衝突，因此無法確保原始資料係100%準確之資料，因此，電子裝置100還可對基於訂單資料產生之基線資料執行資料質量驗證。 In one embodiment, the step of generating baseline data may further include: verifying the baseline data based on the order data. For example, there may be data duplication in the message queue dump used by the real-time pipeline, or conflicts may occur in the process of extracting the original data from the message queue dump, so it is impossible to ensure that the original data is 100% accurate data. Therefore, the electronic device 100 can also perform data quality verification on the baseline data generated based on the order data.

於一實施例中，電子裝置100可對基線資料及統計資料各者之增量值進行比較，並基於比較結果驗證統計資料。例如，基線資料及統計資料可包括訂單量，基線資料及統計資料之增量值可計算為“value(till today)”(即到今天為止之訂單量)減去“value(till yesterday)”(即到昨天為止之訂單量)，即“value(till today)-value(till yesterday)”。如此，電子裝置100可基於T時刻驗證之資料值為準，以T~T+1之間之時間單位跟蹤資料之增量值，並驗證跟蹤之增量值之準確性，從而判斷T+1時刻確認之資料值是否準確，藉此有效驗證大容量資料，而無需對從0時刻至T+1時刻之所有資料進行驗證。 In one embodiment, the electronic device 100 may compare the incremental values of each of the baseline data and the statistical data, and verify the statistical data based on the comparison result. For example, the baseline data and the statistical data may include order quantity, and the incremental value of the baseline data and the statistical data may be calculated as "value (till today)" (i.e., the order quantity until today) minus "value (till yesterday)" (i.e., the order quantity until yesterday), i.e., "value (till today) - value (till yesterday)". In this way, the electronic device 100 can track the incremental value of the data in time units between T and T+1 based on the data value verified at time T, and verify the accuracy of the tracked incremental value, thereby determining whether the data value confirmed at time T+1 is accurate, thereby effectively verifying large-capacity data without having to verify all data from time 0 to time T+1.

具體而言，V(n)表示時間n時之計算之指標值，B(n)表示時間n時之基準指標值，P(n)表示V(n)之準確性(correctness)，可具有真或假中的一個值。即，若P(n)為真，則V(n)=B(n)。N之時間間隔可為小時(hour)。此時，電子裝置100可按以下順序證明P(n)於所有時刻n均為真。 Specifically, V(n) represents the calculated index value at time n, B(n) represents the reference index value at time n, and P(n) represents the accuracy of V(n), which can have a value of true or false. That is, if P(n) is true, then V(n)=B(n). The time interval of N can be hours. At this time, the electronic device 100 can prove that P(n) is true at all times n in the following order.

1)首先，P(0)=真，其中0表示資料驗證開始之時間，其表示將驗證從一開始計算之指標之所有歷史。 1) First, P(0) = true, where 0 represents the time when data validation starts, which means that all the history of the indicator calculated from the beginning will be validated.

2)對於所有k>=n+1，可證明若P(k)為真，則P(k+1)亦為真。對於特定k，假設當n=k時P(k)為真，則對於k<j<=k+1，比較B(j)及V(j)，於該區間內之所有值都相同之情形時，P(k+1)亦為真。 2) For all k>=n+1, it can be shown that if P(k) is true, then P(k+1) is also true. For a particular k, assuming that P(k) is true when n=k, then for k<j<=k+1, comparing B(j) and V(j), when all values in the interval are the same, P(k+1) is also true.

3)若證明上述1)及2)為真，則根據數學歸納法，P(n)於所有時刻n均為真。 3) If 1) and 2) above are proven to be true, then according to mathematical induction, P(n) is true at all times n.

藉由上述數學歸納法，若電子裝置於時間T執行了資料質量檢查，則證明時間T之前資料之準確性，為了於時間T+1執行資料質量檢查，可藉由僅對時間T及時間T+1之間之增量執行資料質量檢查來證明時間T+1為止資料之準確性，從而使電子裝置100能夠驗證所有時間段之資料完整性，而不增加不必要的計算量。 By using the above mathematical induction method, if the electronic device performs a data quality check at time T, the accuracy of the data before time T is proved. In order to perform a data quality check at time T+1, the accuracy of the data up to time T+1 can be proved by performing a data quality check only on the increment between time T and time T+1, so that the electronic device 100 can verify the data integrity of all time periods without increasing unnecessary calculations.

於一實施例中，增量值可以基於統計資料而設定之週期來計算。例如，計算增量值之週期(如一小時、一天、一週)可基於統計資料(如訂購商品之類型、用戶訂購頻率等)自適應地確定。於關於用戶之訂購量之統計資料之情形時，根據商品之特性，對於用戶經常訂購之商品之訂單量，計算其增量值之週期設定較短(例如一天)，而對於用戶不經常訂購之商品之訂單量，計算其增量值之週期設定較長(例如一週)，以有效減少統計資料之資料驗證所需之計算量。 In one embodiment, the incremental value can be calculated based on a period set based on statistical data. For example, the period for calculating the incremental value (such as one hour, one day, one week) can be adaptively determined based on statistical data (such as the type of ordered goods, the user's order frequency, etc.). In the case of statistical data on the user's order quantity, according to the characteristics of the goods, for the order quantity of goods that the user frequently orders, the period for calculating the incremental value is set to be shorter (such as one day), and for the order quantity of goods that the user does not frequently order, the period for calculating the incremental value is set to be longer (such as one week), so as to effectively reduce the amount of calculation required for data verification of the statistical data.

參閱圖5a，其圖示包括資料質量驗證器(data quality validator)510之架構500。資料質量驗證器510可包括圖4之架構420所包括之資料質量驗證器(DQ Validator)。資料質量驗證器510可藉由對基線產生器產生之基線資料及從指標轉儲獲得之統計資料進行比較來測量資料之準確性。於一實施例中，資料質量驗證器510可由Spark Job驅動。 Referring to FIG. 5a, a diagram is shown of an architecture 500 including a data quality validator 510. The data quality validator 510 may include the data quality validator (DQ Validator) included in the architecture 420 of FIG. 4. The data quality validator 510 may measure the accuracy of the data by comparing the baseline data generated by the baseline generator with the statistical data obtained from the indicator dump. In one embodiment, the data quality validator 510 may be driven by a Spark Job.

參閱圖5b，其圖示可檢查是否滿足用戶設置之約束(constraint)之代碼520。於圖5b所示之示例中，資料質量驗證器510可藉由代碼520將作為檢查統計資料之一天之新訂單量設置為鍵值(“order-fresh-day-amount”)，並檢查值是否於約束條件設定之範圍內(例如，0至100000)、與前一日訂量比較之大小是否存在顯著差異(例如，是否於前一日訂單量之0.9至1.1倍之範圍內)及一致度(例如，一致度99%)。若統計資料及基線資料(或其增量值)之比較結果表明至少有一個上述約束條件未得到滿足，則資料質量驗證器510可向管理員終端發送通知訊息。 Referring to FIG5b, a code 520 is shown that can check whether the constraint set by the user is satisfied. In the example shown in FIG5b, the data quality validator 510 can set the new order quantity of one day as the check statistic as a key value ("order-fresh-day-amount") through the code 520, and check whether the value is within the range set by the constraint condition (e.g., 0 to 100,000), whether the size is significantly different from the previous day's order quantity (e.g., whether it is within the range of 0.9 to 1.1 times the previous day's order quantity) and the consistency (e.g., consistency 99%). If the comparison result between the statistical data and the baseline data (or its incremental value) indicates that at least one of the above constraints is not met, the data quality verifier 510 may send a notification message to the administrator terminal.

圖6係表示根據本發明一實施例之電子裝置之資料驗證方法之流程的順序圖。圖6所示之步驟可由圖1所示之電子裝置100執行，此處省略與前述內容重複之說明。 FIG6 is a sequence diagram showing the process of a data verification method for an electronic device according to an embodiment of the present invention. The steps shown in FIG6 can be executed by the electronic device 100 shown in FIG1, and the description repeated with the above content is omitted here.

於S610步驟中，電子裝置可基於訂單資料產生訂單之統計資料。於一實施例中，統計資料可包括代表與實體(如用戶、類別、促銷)之電子商務服務相關聯之特徵之配置文件(profile)或指標(metric)，例如，可包括能夠直接使用之實體屬性(如加入會員日期)、特定事件類型之聚集資料、預測指標等。聚集資料可包括行為統計，如登錄、點擊量、訪問時間、單個詳情頁(single detail page，SDP)訪問次數，及交易統計，如訂購次數、GMV(gross merchandise volume，總交易額)、訂單金額、訂單量之總和等。例如，預測指標可包括與用戶對“iPhone”之興趣相關之預測分數。電子商務企業需要收集及產生該等統計資料，以便對實體進行與電子商務服務相關之分析及決策。 In step S610, the electronic device may generate statistical data of the order based on the order data. In one embodiment, the statistical data may include a profile or metric representing characteristics associated with an e-commerce service of an entity (such as a user, a category, a promotion), for example, it may include entity attributes that can be used directly (such as a membership date), aggregate data of a specific event type, and predictive indicators. Aggregate data may include behavioral statistics, such as logins, clicks, visit time, single detail page (SDP) visits, and transaction statistics, such as the number of orders, GMV (gross merchandise volume), order amount, and the sum of order quantities. For example, a predictive indicator may include a predictive score related to a user's interest in "iPhone". E-commerce companies need to collect and generate such statistics in order to analyze and make decisions related to e-commerce services.

於S620步驟中，電子裝置可產生用以基於訂單資料來驗證統計資料之基線資料。電子裝置可從訂單資料之原始資料產生基線資料，原始資料可從批量管線使用之DBMS轉儲及實時管線使用之訊息隊列轉儲中提取。於一實施例中，可基於用戶之標識符對源資料進行採樣，以降低計算成本。 In step S620, the electronic device may generate baseline data for verifying statistical data based on order data. The electronic device may generate baseline data from raw data of the order data, and the raw data may be extracted from a DBMS dump used by a batch pipeline and a message queue dump used by a real-time pipeline. In one embodiment, the source data may be sampled based on a user's identifier to reduce computational costs.

於S630步驟中，電子裝置可對基線資料及統計資料各者之增量值進行比較。於S640步驟中，電子裝置可基於增量值之比較來驗證統計資料。例如，基線資料及統計資料可包括訂單量，基線資料及統計資料之增量值可計算為“value(till today)”(即到今天為止之訂單量)減去“value(till yesterday)”(即到昨天為止之訂單量)，即“value(till today)-value(till yesterday)”。如此，電子裝置100可基於T時刻驗證之資料值為準，以T~T+1之間之時間單位跟蹤資料之增量值，並驗證跟蹤之增量值之準確性，從而判斷T+1時刻確認之資料值是否準確，藉此有效驗證大容量資料，而無需對從0時刻至T+1時刻之所有資料進行驗證。 In step S630, the electronic device may compare the incremental values of the baseline data and the statistical data. In step S640, the electronic device may verify the statistical data based on the comparison of the incremental values. For example, the baseline data and the statistical data may include order quantities, and the incremental values of the baseline data and the statistical data may be calculated as "value (till today)" (i.e., the order quantity until today) minus "value (till yesterday)" (i.e., the order quantity until yesterday), i.e., "value (till today) - value (till yesterday)". In this way, the electronic device 100 can track the incremental value of the data in time units between T and T+1 based on the data value verified at time T, and verify the accuracy of the tracked incremental value, thereby determining whether the data value confirmed at time T+1 is accurate, thereby effectively verifying large-capacity data without having to verify all data from time 0 to time T+1.

參閱圖7a，其圖示根據本發明一實施例之電子裝置100用以從資料源產生指標之資料管線架構710。指標可分為今天之指標及今天之前之指標，其中今天之指標可直接藉由流處理引擎(例如，Spark Streaming應用程式)計算，而今天之前之指標可批量(batch)計算並儲存於鍵值(key-value)儲存中。最終指標可藉由組合實時指標及批量指標而產生。對於如訂單資料等狀態可變之源資料，電子裝置100可使用抵消處理記錄之狀態變化。如圖7a所示，可添加用以記錄最新狀態記錄之資料庫(Database)711，於一實施例中，資料庫可包括由Cassandra資料庫管理系統管理之資料庫，其能夠支持鍵值讀寫。以訂單為例，資料管線可將所有新處理之源訂單事件儲存於資料庫711中並替換舊事件(若有)。對於接收之所有源訂單事件，若未新產生訂單，則管線獲取該訂單之先前記錄以計算抵消值。例如，若訂單被取消，則管線可將之前訂單量乘以-1所得的值傳輸至下游。 Referring to FIG. 7a, it illustrates a data pipeline architecture 710 for generating indicators from data sources by an electronic device 100 according to an embodiment of the present invention. Indicators can be divided into today's indicators and indicators before today, where today's indicators can be directly calculated by a stream processing engine (e.g., Spark Streaming application), and indicators before today can be calculated in batches and stored in key-value storage. The final indicator can be generated by combining real-time indicators and batch indicators. For state-variable source data such as order data, the electronic device 100 can use offset processing to record state changes. As shown in FIG. 7a, a database 711 for recording the latest status record may be added. In one embodiment, the database may include a database managed by a Cassandra database management system that supports key value reading and writing. Taking orders as an example, the data pipeline may store all newly processed source order events in the database 711 and replace old events (if any). For all received source order events, if no new order is generated, the pipeline obtains the previous record of the order to calculate the offset value. For example, if an order is canceled, the pipeline may transmit the value obtained by multiplying the previous order amount by -1 to the downstream.

參閱圖7b，其圖示根據本發明一實施例之電子裝置100用以從資料源產生指標之資料管線架構720。如圖7b所示，指標計算(Metric Calculation)模組721可首先儲存業務事件詳情之後，基於所儲存之詳細事件產生所需指標。架構720之特徵在於，於指標計算模組中重新計算期間聚集詳細事件，其優點係加快指標計算速度並降低儲存成本，但缺點係增加實現複雜性，且難以於超過已定義之聚集範圍添加新指標。 Refer to FIG. 7b, which illustrates a data pipeline architecture 720 for generating indicators from a data source in an electronic device 100 according to an embodiment of the present invention. As shown in FIG. 7b, a metric calculation module 721 may first store business event details and then generate required indicators based on the stored detailed events. The feature of the architecture 720 is that detailed events are aggregated during recalculation in the metric calculation module, which has the advantage of speeding up the metric calculation speed and reducing the storage cost, but has the disadvantage of increasing the complexity of implementation and making it difficult to add new indicators beyond the defined aggregation range.

參閱圖7c，其圖示根據本發明一實施例之電子裝置100用以從資料源產生指標之資料管線架構730。如圖7c所示，指標計算模組731可將業務事件聚集為細粒度級別之預聚集格式且其資料首先儲存。然後，指標計算模組731可向實時發佈器732發送通知，並允許實時發佈器732將預聚集資料聚集為最終指標。對於如訂單資料等狀態可變之源資料，電子裝置100可使用抵消處理源事件之狀態變化。架構730將詳細業務事件儲存於資料轉換模組之資料庫733中，而不儲存於指標計算模組中，因為產生之抵消事件可作為一般業務事件來處理。 Referring to FIG. 7c, it illustrates a data pipeline architecture 730 for generating indicators from data sources by an electronic device 100 according to an embodiment of the present invention. As shown in FIG. 7c, an indicator calculation module 731 can aggregate business events into a pre-aggregated format at a fine-grained level and its data is first stored. Then, the indicator calculation module 731 can send a notification to a real-time publisher 732 and allow the real-time publisher 732 to aggregate the pre-aggregated data into a final indicator. For state-variable source data such as order data, the electronic device 100 can use offset processing to change the state of the source event. Architecture 730 stores detailed business events in the database 733 of the data conversion module instead of in the indicator calculation module, because the offsetting events generated can be processed as general business events.

另一方面，於本說明書及附圖中揭示了本發明之較佳實施例，儘管使用了特定用語，但該等用語僅於一般意義上用以方便描述本發明之技術內容，幫助理解發明，並非旨在限定本發明之範圍。除了本文揭示之實施例之外，對於本發明所屬技術領域之普通技術人員而言顯而易見的是，可實現基於本發明之技術思想所做之其他修改。 On the other hand, the preferred embodiments of the present invention are disclosed in this specification and the attached drawings. Although specific terms are used, these terms are only used in a general sense to facilitate the description of the technical content of the present invention and to help understand the invention, and are not intended to limit the scope of the present invention. In addition to the embodiments disclosed herein, it is obvious to ordinary technicians in the technical field to which the present invention belongs that other modifications based on the technical ideas of the present invention can be realized.

S610～S640:步驟S610～S640: Step

Claims

A data verification method for an electronic device, comprising the following steps: Based on order data, generating statistical data of an order; Based on the order data, generating baseline data for verifying the statistical data; Comparing the incremental values of the baseline data and the statistical data; and Verifying the statistical data based on the comparison of the incremental values, wherein the statistical data is generated by a data pipeline that ensures accurate one-time processing of the order data, wherein the data pipeline includes a first pipeline that ensures at least one-time processing, and a second pipeline that ensures accurate one-time processing, wherein the output of the first pipeline includes duplicate data related to the order data, and wherein the duplicate data is deduplicated in the second pipeline.

The data verification method of the electronic device as claimed in claim 1, wherein the order data includes order identifiers and order quantities of more than one order; The statistical data includes the sum of the order quantities.

The data verification method of the electronic device as claimed in claim 2, wherein the step of generating the above statistical data further includes the following steps: Receiving an order cancellation event; Confirming the number of orders canceled according to the above order cancellation event; and Adding the value obtained by multiplying the above canceled order number by -1 to the above statistical data.

The data verification method of the electronic device of claim 1, wherein the step of generating the above-mentioned statistical data includes the following steps when R(n+1) representing the n+1th change of the above-mentioned order data is input into the above-mentioned first pipeline: Retrieving R(n) representing the nth change of the above-mentioned order data in the cache; Multiplying the value of the above-mentioned R(n) by -1; Transmitting the above-mentioned -R(n) and R(n+1) to the above-mentioned second pipeline; and Cache the above-mentioned R(n+1).

The data verification method of the electronic device as claimed in claim 4, wherein the step of generating the above-mentioned statistical data further includes the following steps: The above-mentioned statistical data is generated by aggregating the above-mentioned -R(n) and R(n + 1) in the above-mentioned second pipeline.

A data verification method for an electronic device as claimed in claim 4, wherein the above R(n) includes an order ID and an order version.

In the data verification method of electronic device as claimed in claim 1, the step of generating the above-mentioned baseline data further includes the following steps: Based on the above-mentioned order data, verify the above-mentioned baseline data.

As in the data verification method of the electronic device of claim 1, the above-mentioned increment value is calculated based on a period set based on the above-mentioned statistical data.

An electronic device for verifying data, comprising: a memory storing at least one command; and a processor; The processor is as follows: by executing the at least one command, statistical data of the order is generated based on the order data; based on the order data, baseline data for verifying the statistical data is generated; the incremental values of the baseline data and the statistical data are compared; based on the comparison of the incremental values, the statistical data is verified, wherein the statistical data is generated by a data pipeline that ensures accurate one-time processing of the order data, wherein the data pipeline includes a first pipeline that ensures at least one-time processing, and a second pipeline that ensures accurate one-time processing, wherein the output of the first pipeline includes duplicate data related to the order data, and wherein the duplicate data is deduplicated in the second pipeline.

A non-temporary computer-readable storage medium, comprising a medium configured to store computer-readable commands, When the computer-readable commands are executed by a processor, the processor executes a data verification method for an electronic device, the method comprising the following steps: Based on order data, statistical data of an order is generated; Based on the order data, baseline data for verifying the statistical data is generated; Comparing the incremental values of the baseline data and the statistical data; and Verifying the statistical data based on the comparison of the incremental values, Wherein the statistical data is generated by a data pipeline that ensures accurate one-time processing of the order data, The data pipeline includes a first pipeline for ensuring at least one processing and a second pipeline for ensuring exactly one processing, wherein the output of the first pipeline includes duplicate data related to the order data, and wherein the duplicate data is deduplicated in the second pipeline.