TWI486798B

TWI486798B - Method and device for classifying data

Info

Publication number: TWI486798B
Application number: TW099112965A
Authority: TW
Original assignee: Alibaba Group Holding Ltd
Priority date: 2010-04-23
Filing date: 2010-04-23
Publication date: 2015-06-01
Also published as: TW201137644A

Description

Method and device for data classification

本申請涉及資料處理領域，特別涉及一種資料分類的方法及裝置。The present application relates to the field of data processing, and in particular, to a method and apparatus for classifying data.

在電子商務網站中，各種商品資料通常是以文本、資料表等形式進行儲存。一個電子商務網站所需要管理的商品資料數以千萬計，因此，如何將商品資料按照其描述的資訊內容進行分類，將相似的商品資料進行統一管理，以降低系統的管理複雜度，減輕系統的運行負荷，是電子商務網站運營時首先需要考慮的問題。In an e-commerce website, various product materials are usually stored in the form of texts, data sheets, and the like. There are tens of millions of commodity data that an e-commerce website needs to manage. Therefore, how to classify commodity data according to the information content described by it, and manage similar product data in a unified manner to reduce the management complexity of the system and reduce the system. The operational load is the first issue to be considered when operating an e-commerce website.

目前，在各類電子商務網站內，通常採用聚類演算法對各種商品資料進行分類，即根據預設的一系列規則、條件，通過相似性分析，將商品資料劃分為多種類別。現有技術下，最常採用的聚類演算法是分層聚類演算法。At present, in various e-commerce websites, clustering algorithms are usually used to classify various commodity materials, that is, according to a preset series of rules and conditions, the product data is divided into multiple categories through similarity analysis. In the prior art, the most commonly used clustering algorithm is a hierarchical clustering algorithm.

所謂分層聚類演算法，是一種自底向上的策略，即首先將每個分類物件作為單獨的一個原子簇，然後合併這些原子簇為級別更高的簇，直到所有的分類物件均集中在同一個簇中，或者，達到某個終止條件。The so-called hierarchical clustering algorithm is a bottom-up strategy, which first treats each classified object as a single cluster, and then merges these clusters into higher-level clusters until all the classified objects are concentrated. In the same cluster, or, reach a certain termination condition.

在電子商務網站中，一件商品的相關資料通常包括多種，例如，商品的標識、類目、屬性等等。而電子商務網站內設置的商品數目數以萬計，相應地，其相關資料可多達數十萬種，如，一件商品歸屬的類目為“手機”，該商品的某一屬性為“手機品牌”，而該屬性的取值可以為“A品牌”或“B品牌”等等。因此，採用分層聚類演算法對電子商務網站中的各種商品的相關資料進行分類，計算量非常大，以致於單機無法完成，需要伺服器集群進行統一計算。顯然，這會大大地浪費系統資源，也耗費大量的計算時間，從而不能及時有效地完成商品的相關資料的分類，降低了商品相關資料分類流程的執行效率。In an e-commerce website, information about a product usually includes a variety of items, such as the identification, category, attributes, and the like of the item. The number of products set in the e-commerce website is tens of thousands. Correspondingly, the related information can be hundreds of thousands of items. For example, the category to which a product belongs is “mobile phone”, and one attribute of the product is “ "Mobile phone brand", and the value of this attribute can be "A brand" or "B brand" and so on. Therefore, the hierarchical clustering algorithm is used to classify the related data of various commodities in the e-commerce website, and the calculation amount is so large that the single machine cannot be completed, and the server cluster is required to perform unified calculation. Obviously, this will greatly waste system resources and consume a lot of computing time, so that the classification of related materials of goods can not be completed in a timely and effective manner, and the efficiency of the classification process of commodity-related data is reduced.

本申請實施例提供一種資料分類的方法及裝置，用以提高商品相關資料分類流程的執行效率。The embodiment of the present application provides a method and a device for classifying data, so as to improve the execution efficiency of the product-related data classification process.

本申請實施例提供的具體實施方式如下：一種資料分類的方法，包括：獲取需要進行分類的各商品的相關資料，並提取出其中的商品標題；對各商品標題分別進行分詞劃分，並確定各分詞的權重，其中，各分詞的權重用於表示該分詞的歷史出現頻率；針對不同的商品分別選取權重取值符合預設條件的分詞組成分詞序列；將針對各商品選取的分詞序列進行比較，並將分詞序列相同的商品的相關資料進行合併。The specific embodiment provided by the embodiment of the present application is as follows: a method for classifying materials, comprising: obtaining related materials of each commodity that need to be classified, and extracting a product title therein; respectively dividing the product titles into word segments, and determining each The weight of the word segmentation, wherein the weight of each word segment is used to indicate the historical frequency of occurrence of the word segment; for different commodities, the word segmentation sequence whose weight value matches the preset condition is separately selected; the word segment sequence selected for each commodity is compared, The related materials of the goods with the same word segmentation sequence are combined.

一種用於進行商品分類的裝置，包括：提取單元，用於獲取需要進行分類的各商品的相關資料，並提取出其中的商品標題；劃分單元，用於對各商品標題分別進行分詞劃分，並確定各分詞的權重，其中，各分詞的權重用於表示該分詞的歷史出現頻率；選取單元，用於針對不同的商品分別選取權重取值符合預設條件的分詞組成分詞序列；合併單元，用於將針對各商品選取的分詞序列進行比較，並將分詞序列相同的商品的相關資料進行合併。An apparatus for performing commodity classification, comprising: an extracting unit, configured to acquire related materials of each commodity that need to be classified, and extract a product title therein; and a dividing unit, configured to respectively divide the product titles into word segments, and Determining the weight of each word segment, wherein the weight of each word segment is used to indicate the historical frequency of occurrence of the word segment; the selecting unit is configured to separately select a word segmentation component segmentation sequence whose weight value meets the preset condition for different commodities; The sequence of word segments selected for each item is compared, and the related materials of the items having the same word segment sequence are combined.

本申請實施例中，通過從商品標題和屬性資訊中劃分並提取出的分詞序列，來標識某一類商品，並將分詞序列相同的商品進行合併，這樣，大大減少了需要處理的商品相關資料的數量，可以在較短時間內迅速、準確地實現商品分類，從而有效提高了商品分類流程的執行效率，降低了商品相關資料的管理複雜度，也減輕了系統的運算負荷。In the embodiment of the present application, a certain type of commodity is identified by dividing and extracting the word segmentation sequence from the product title and the attribute information, and the products having the same word segmentation sequence are combined, thereby greatly reducing the commodity related data to be processed. The quantity can realize the classification of goods quickly and accurately in a short time, thereby effectively improving the execution efficiency of the product classification process, reducing the management complexity of the product-related data, and reducing the computational load of the system.

為了提高商品相關資料分類流程的執行效率，減輕系統的運行負荷，本申請實施例中，在進行商品相關資料分類時，獲取需要進行分類的各商品的相關資料，並提取出其中的商品標題；對各商品標題分別進行分詞劃分，並確定各分詞的權重，其中，各分詞的權重用於表示該分詞的歷史出現頻率；針對不同的商品分別選取權重取值符合預設條件的分詞組成分詞序列；將針對各商品選取的分詞序列進行比較，並將分詞序列相同的商品的相關資料進行合併。In order to improve the execution efficiency of the product-related data classification process and reduce the operational load of the system, in the embodiment of the present application, when the product-related data is classified, the related data of each commodity to be classified is obtained, and the product title is extracted; Each product title is divided into different words, and the weight of each word segment is determined, wherein the weight of each word segment is used to indicate the historical frequency of occurrence of the word segment; and the segmentation component word segmentation sequence whose weight value meets the preset condition is selected for different products. ; the sequence of word segments selected for each product is compared, and the related materials of the articles with the same word segment sequence are combined.

其中，將分詞序列相同的商品進行合併時，包括將分詞序列相同的商品的相關資料直接進行合併；或者，獲取分詞序列相同的商品的指定屬性值，並將指定屬性值相同的商品的相關資料進行合併。以下實施例中，將以第二種情況為例進行說明。Wherein, when the products having the same word segment sequence are combined, the related materials of the products having the same word segment sequence are directly combined; or the specified attribute values of the products having the same word segment sequence are obtained, and the related materials of the products having the same attribute value are specified. Consolidate. In the following embodiments, the second case will be described as an example.

下面結合附圖對本申請優選的實施方式進行詳細說明。The preferred embodiments of the present application are described in detail below with reference to the accompanying drawings.

參閱圖1所示，本申請實施例中，用於進行商品分類的管理裝置包括提取單元10、劃分單元11、選取單元12和合併單元13，其中，提取單元10，用於獲取需要進行分類的各商品的相關資料，並提取出其中的商品標題；劃分單元11，用於對各商品標題分別進行分詞劃分，並確定各分詞的權重，其中，各分詞的權重用於表示該分詞的歷史出現頻率；選取單元12，用於針對不同的商品分別選取權重取值符合預設條件的分詞組成分詞序列；合併單元13，用於將針對各商品選取的分詞序列進行比較，並將分詞序列相同的商品的相關資料進行合併。Referring to FIG. 1 , in the embodiment of the present application, the management apparatus for performing commodity classification includes an extracting unit 10, a dividing unit 11, a selecting unit 12, and a merging unit 13, wherein the extracting unit 10 is configured to obtain a classification that needs to be classified. The relevant information of each commodity, and extracting the product title therein; the dividing unit 11 is configured to divide the word segmentation of each product title separately, and determine the weight of each word segment, wherein the weight of each word segment is used to indicate the history of the word segmentation The frequency selection unit 12 is configured to separately select a word segmentation component word segment sequence whose weight value matches the preset condition for different commodities, and the merging unit 13 is configured to compare the word segment sequences selected for each commodity, and the segmentation sequence is the same. The relevant information of the goods is combined.

參閱圖1所示，本實施例中，上述管理裝置進一步包括處理單元14，用於針對合併後獲得的每一類商品分別設置相應的商品標識ID，並進行保存。As shown in FIG. 1 , in the embodiment, the management apparatus further includes a processing unit 14 configured to separately set and store a corresponding product identification ID for each type of product obtained after the combination.

基於上述原理，參閱圖2所示，本申請實施例中，管理裝置對電子商務網站內包含的所有商品資料進行分類的詳細流程如下：步驟200：獲取需要進行分類的各商品的相關資料，並提取出其中的商品標題及相應的屬性資訊。Based on the above principle, referring to FIG. 2, in the embodiment of the present application, the detailed process for the management device to classify all the product materials included in the e-commerce website is as follows: Step 200: Obtain relevant information of each item that needs to be classified, and Extract the product title and corresponding attribute information.

通常情況下，用戶在電子商務網站發佈商品資訊時，會在商品發佈網頁中填寫各種資訊，如標題、屬性等。填寫好的網頁由使用者用戶端上傳至網站伺服器。網站伺服器接收到該網頁後，提取網頁中包含的標題資訊。並對該標題進行分詞處理。例如，某吹風機的標題為：**品牌D3506型號吹風機，顯然，商品標題中往往包含有可以區分該商品的關鍵字，因此，對商品標題的提取是很必要的。Usually, when a user publishes product information on an e-commerce website, various information such as a title, an attribute, and the like are filled in the product release page. The completed webpage is uploaded by the user's client to the web server. After receiving the webpage, the web server extracts the title information contained in the webpage. And the word segmentation is processed. For example, the title of a hair dryer is: ** Brand D3506 model hair dryer, obviously, the product title often contains keywords that can distinguish the product, therefore, the extraction of the product title is necessary.

而商品的屬性資訊中往往包含針對商品的詳細描述，例如，某吹風機的屬性資訊為：上市時間、顏色類型、吹風口形態、市場價、人氣指數等等。本申請實施例中，屬性及屬性值均以ID形式進行設置，如，某商品的顏色屬性為綠色，可以表示為：屬性A：2000，A為顏色屬性的ID，2000為綠色的ID。本實施例中，在進行商品合併時同時考慮了商品標題和屬性資訊的相同程度，因此，在步驟200中同時提取了商品標識和屬性資訊。實際應用中，也可以在執行商品合併的步驟中再提取屬性資訊，步驟200僅為舉例。The attribute information of a product often contains a detailed description of the product. For example, the information of a hair dryer is: time to market, color type, air outlet shape, market price, popularity index, and the like. In the embodiment of the present application, the attribute and the attribute value are all set in the form of an ID. For example, the color attribute of an item is green, which can be expressed as: attribute A: 2000, A is the ID of the color attribute, and 2000 is the ID of the green. In the present embodiment, the same degree of the product title and the attribute information is considered at the same time when the product combination is performed, and therefore, the item identification and the attribute information are simultaneously extracted in step 200. In an actual application, attribute information may also be extracted in the step of performing commodity consolidation, and step 200 is merely an example.

步驟210：對各商品標題進行分詞，並確定各分詞的權重，其中，分詞的權重用於表示該分詞的歷史出現頻率相關資訊；如、用戶歷史搜索次數，或/和商戶設置次數及分佈機率。Step 210: segmentation of each product title, and determine the weight of each word segment, wherein the weight of the word segment is used to indicate the history frequency related information of the word segment; for example, the number of user history searches, or / and the number of merchant settings and the probability of distribution .

本實施例中，採用hadoop分散式計算系統(hadoop是一種分散式計算的框架)，對商品標題和屬性資訊進行分詞。例如，將商品標題“**品牌D3506型號吹風機”劃分為以下分詞：“**品牌”、“D3506型號”和“吹風機”。本實施例中，較佳地，可以使用分散式的hadoop，即採用多台(如，50台到300台)機器組成的運算集群來執行hadoop程式。In this embodiment, the Hadoop distributed computing system (hadoop is a decentralized computing framework) is used to segment the product title and attribute information. For example, the product title "** brand D3506 model hair dryer" is divided into the following participles: "** brand", "D3506 model" and "hair dryer". In this embodiment, it is preferable to use a distributed Hadoop, that is, a computing cluster composed of a plurality of (for example, 50 to 300) machines to execute the hadoop program.

為了提高分詞的精準性，本實施例中，較佳地，在執行完分詞操作後，管理裝置根據資料庫中的歷史參考資訊，將能夠反映商品品牌、產品類型的核心分詞進行保留，如，“**品牌”、“**款式”等等，相應地，需要將對商品分類沒有參考價值的多餘分詞進行刪除，如“正品”、“促銷”、“特價”等等。In order to improve the accuracy of the word segmentation, in the embodiment, preferably, after performing the word segmentation operation, the management device retains the core participle that can reflect the product brand and the product type according to the historical reference information in the database, for example, "** brand", "** style", etc., correspondingly, need to delete the excess word segmentation of the product classification without reference value, such as "authentic", "promotion", "special offer" and so on.

步驟220：針對不同的商品分別選取權重取值符合預設條件的分詞組成分詞序列。Step 220: Select a word segmentation sequence of the word segmentation whose value is in accordance with the preset condition for different commodities.

本實施例中，所謂的預設條件設置為：在商品標題中選取權重取值最高的兩個分詞、以及從屬性資訊中選取權重取值最高的五個分詞。以上預設條件僅為舉例，分詞的選取方式和選取數量可以根據實際應用環境而自行設定，在此不再贅述。In this embodiment, the so-called preset condition is set to: select two word segments with the highest weight value in the product title, and select five word segments with the highest weight value from the attribute information. The above preset conditions are only examples. The selection method and the number of selected words can be set according to the actual application environment, and will not be described here.

步驟230：將針對各商品選取的分詞序列進行比較，獲取分詞序列相同的商品的指定屬性值，並將指定屬性值相同的商品的相關資料進行合併。Step 230: Compare the word segment sequences selected for each product, obtain the specified attribute values of the products with the same word segment sequence, and merge the related materials of the products with the same attribute value.

本實施例中，將商品的相關資料進行合併，即是將這些商品歸屬至同一類目下，例如，將各商品的相關資料以同一組文本或資料表的形式進行儲存，在後續管理中，將其作為同一種商品進行呈現、發佈、修改等等操作。In this embodiment, the related materials of the commodity are merged, that is, the commodities are attributed to the same category, for example, the related materials of each commodity are stored in the same group of texts or data sheets, and in subsequent management, It is presented, published, modified, etc. as the same commodity.

本實施例中，根據步驟200～步驟230將各商品的相關資料進行分類後，為每一類商品設置一商品ID，用於唯一標識該類商品，實際試驗資料表明，採用上述方法，可以將實際涵蓋幾億商品的電子商務網站中商品的數量，歸類縮減至幾千萬左右的商品類目，從而大大減少了電子商務網站的管理物件的數目，降低了商品相關資料的管理複雜度，減輕了網站的運算負擔。In this embodiment, after the related materials of each product are classified according to steps 200 to 230, a product ID is set for each type of product for uniquely identifying the product. The actual test data indicates that the actual method can be used to The number of commodities in e-commerce websites covering hundreds of millions of commodities has been reduced to tens of millions of commodity categories, which greatly reduces the number of management objects of e-commerce websites, reduces the management complexity of commodity-related materials, and reduces The computing burden of the website.

執行完上述步驟後，針對步驟210中獲得的分詞的劃分結果，需要重新調整各分詞的權重，調整權重的操作可以在步驟210執行完畢後立即執行，也可以在步驟200～步驟230全部執行完畢後再執行。其中，較佳地，需要對包含產品型號的分詞的權重進行重點設置，因為產品型號由數位字母等符號組成，在商品分類過程中的參考價值最大，因此，針對產品型號這一類型的分詞，需要將其權重值設置得較高。After the above steps are performed, the weights of the word segments need to be re-adjusted for the result of the segmentation obtained in step 210. The operation of adjusting the weights may be performed immediately after the execution of step 210, or may be performed in steps 200 to 230. Then execute it later. Preferably, the weight of the word segmentation including the product model is required to be set, because the product model is composed of symbols such as digital letters, and the reference value in the product classification process is the largest, therefore, the word segmentation for the product model type, You need to set its weight value higher.

基於上述實施例，為了進一步提高商品歸類結果的準確性，本實施例中，在執行完步驟200～步驟230後，在針對分類完畢的各類商品分別設置相應的商品ID之前，需要對分類結果作進一步優化，參閱圖3所示，優化的詳細流程如下：步驟300：根據分類結果確定用於區分每一類商品的分詞序列。Based on the foregoing embodiment, in order to further improve the accuracy of the product classification result, in this embodiment, after performing the steps 200 to 230, the classification needs to be performed before setting the corresponding product IDs for the classified products. The result is further optimized. Referring to FIG. 3, the detailed detailed process is as follows: Step 300: Determine a word segmentation sequence for distinguishing each type of commodity according to the classification result.

所謂的分詞序列即是執行步驟200~步驟230後，獲得的每一類商品的標誌性分詞組合，例如，經過分詞後，將商品標題和屬性資訊中包含分詞“**品牌”，“**款式”、“紅顏色”、“DF0753”和“L碼”的商品歸屬至同一類中，那麼，該類商品的分詞序列即是“**品牌款式紅顏色DF0753L碼”。The so-called word segmentation sequence is the iconic word segment combination of each type of product obtained after performing steps 200 to 230. For example, after the word segmentation, the product title and the attribute information include the participle "** brand", "** style The items of "red color", "DF0753" and "L code" belong to the same category, then the word segmentation sequence of the product is "** brand style red color DF0753L code".

步驟310：分別計算任意兩類商品的分詞序列的相似度。Step 310: Calculate the similarity of the word segmentation sequences of any two types of commodities, respectively.

本實施例中，採用以下公式計算任意兩類商品的分詞序列的相似度： In this embodiment, the similarity of the word segmentation sequence of any two types of commodities is calculated by the following formula:

其中，TD1和TD2為分別進行比較的兩類商品的分詞序列，例如，TD1=(word11,score11),(word12,score12),(word13,score13)Where TD1 and TD2 are the word segmentation sequences of the two types of products respectively compared, for example, TD1=(word11, score11), (word12, score12), (word13, score13)

TD2=(word21,score21),(word22,score22),(word23,score23)TD2=(word21, score21), (word22, score22), (word23, score23)

word為某分詞，score為其權重。Word is a participle, and score is its weight.

Prop1和prop2為分別進行比較的兩類商品對應的主屬性值(主屬性，也就是重要的屬性，例如手機最重要的屬性就是品牌和型號，而顏色、重量就是一般的屬性。主屬性值表示具體的屬性，例如：品牌是一個主屬性，通過余弦計算相似性的大小。相似性越大，兩個商品就越相似。Prop1 and prop2 are the primary attribute values corresponding to the two types of products that are compared respectively (primary attributes, that is, important attributes, such as the most important The attribute is the brand and model, and the color and weight are the general attributes. The primary attribute value represents a specific attribute, for example: the brand is a primary attribute, and the magnitude of the similarity is calculated by the cosine. The greater the similarity, the more similar the two commodities are.

λ 是一個控制權重的係數。λ ₁ 和λ ₂ 是不同的兩個係數。在計算相似性的時候，分別表明是標題重要還是屬性更重要。在λ ₁ =2，λ ₂ =1的時候。就表明標題的重要性是屬性重要性的兩倍。 λ is a coefficient of control weight. λ ₁ and λ ₂ are two different coefficients. When calculating similarity, it is indicated whether the title is important or the attribute is more important. When λ ₁ = 2, λ ₂ =1. It shows that the importance of the title is twice the importance of the attribute.

a，b為預設的參量，n1和n2分別用於表示進行相似度比較的兩類商品中各自包含的商品數目，a、b用於控制相似度的取值，以間接控制兩類商品進行合併的可能性，如，當兩類商品各自包含的商品數目都很多時，可以通過a和b的取值對相似度的取值進行調節，令採用計算得到的相似度取值變小，從而使兩類商品被合併在一起的機會變小。a, b are preset parameters, and n1 and n2 are respectively used to indicate the number of commodities included in each of the two types of commodities for which the similarity comparison is performed, and a and b are used to control the value of the similarity to indirectly control the two types of commodities. The possibility of merging, for example, when the number of commodities included in each of the two types of commodities is large, the value of the similarity can be adjusted by the values of a and b. The calculated similarity value becomes smaller, so that the chances of combining the two types of commodities become smaller.

例如，a=50，b=20 n1=100，n2==10，那麼，相似度=e^{-λ
1*|TD1-TD2|} *e^-λ
2* |^prop1-prop2| *1/(1+e^(50/20))=1/(1+e^2.5)=0.075858187%。For example, a=50, b=20 n1=100, n2==10, then, similarity=e ^{- λ 1*|TD1-TD2|} *e ^{- λ 2*} | ^prop1-prop2| *1/(1+ e^(50/20))=1/(1+e^2.5)=0.07585818 7%.

步驟320：將獲得的任意兩類商品的分詞序列的相似度，分別與設定閾值進行比較，將分詞序列的相似度達到設定閾值的兩類商品進行合併。Step 320: Compare the similarity of the segmentation sequence of the obtained two types of products with the set threshold, and merge the two types of commodities whose similarity of the segmentation sequence reaches the set threshold.

例如，在步驟310中，計算得到兩類商品的分詞序列的相似度為7%，假設設定閾值為5，則相似度取值遠遠小於設定閾值，這說明，這兩類商品不能進行合併。For example, in step 310, the similarity of the word segmentation sequence of the two types of products is calculated to be 7%. If the threshold is set to 5, the similarity value is far less than the set threshold, which means that the two types of commodities cannot be merged.

之所以執行上述步驟300-步驟320，是因為分詞序列不同的兩類商品也可能是同一種商品，只是商戶設置的商品標題和屬性資訊不完全相同而已，因此，通過執行步驟300～步驟320對步驟200～步驟230中獲得的分類結果進行優化，使其更為精確。實際應用中，若為了進一步優化分類結果，可以將步驟300～步驟320按照設定次數進行多數迭代，使最終獲得的分類結果中包含的商品類目進一步縮減。The above steps 300-320 are performed because the two types of products with different word segmentation sequences may also be the same product, but the product title and attribute information set by the merchant are not completely the same. Therefore, by performing steps 300 to 320, The classification results obtained in steps 200 to 230 are optimized to make them more accurate. In practical applications, in order to further optimize the classification result, steps 300 to 320 may be performed in a plurality of iterations according to the set number of times, so that the product categories included in the finally obtained classification result are further reduced.

採用上述方法，可以縮減幾千萬左右的商品，進一步縮減至幾百萬，同時整個過程只需要幾個小時，從而再次大大減少了電子商務網站的管理物件的數目，也進一步降低了其管理商品相關資料的複雜度，減輕了其運算負擔。By adopting the above method, the product of tens of millions of products can be reduced, and further reduced to several million, and the whole process takes only a few hours, thereby greatly reducing the number of management objects of the e-commerce website and further reducing the management products thereof. The complexity of related materials has reduced the computational burden.

綜上所述，本申請實施例中，通過從商品標題和屬性資訊中劃分並提取出的分詞序列，來標識某一類商品，並將分詞序列相同的商品的相關資料進行合併，這樣，大大減少了需要處理的商品相關資料的數量，可以在較短時間內迅速、準確地實現商品分類，從而有效提高了商品分類流程的執行效率，降低了商品相關資料的管理複雜度，也減輕了系統的運算負荷。In summary, in the embodiment of the present application, a certain type of product is identified by dividing and extracting the word segmentation sequence from the product title and attribute information, and the related materials of the same word segmentation sequence are combined, thereby greatly reducing The quantity of commodity-related materials that need to be processed can quickly and accurately achieve product classification in a short period of time, thereby effectively improving the efficiency of the product classification process, reducing the management complexity of the product-related data, and reducing the system's Operational load.

基於上述方案，本申請實施例中，繼續通過分詞序列之間的相似度來實現分類結果的優化，從而進一步提高了分類結果的準確性，也進一步減少了需要處理的商品資料的數量，提高了商品分類流程的執行效率。Based on the above solution, in the embodiment of the present application, the optimization of the classification result is continued by the similarity between the word segmentation sequences, thereby further improving the accuracy of the classification result, and further reducing the number of commodity materials to be processed, and improving the number of products. The efficiency of the execution of the commodity classification process.

顯然，本領域的技術人員可以對本申請中的實施例進行各種改動和變型而不脫離本申請的精神和範圍。這樣，倘若本申請實施例中的這些修改和變型屬於本申請之申請專利範圍及其等同技術的範圍之內，則本申請中的實施例也意圖包含這些改動和變型在內。It is apparent that those skilled in the art can make various modifications and variations to the embodiments of the present application without departing from the spirit and scope of the application. Thus, the present embodiments of the present application are intended to cover such modifications and variations as the scope of the application and the scope of the invention.

10．．．提取單元10. . . Extraction unit

11．．．劃分單元11. . . Division unit

12．．．選取單元12. . . Selection unit

13．．．合併單元13. . . Merging unit

14．．．處理單元14. . . Processing unit

圖1為本申請實施例中管理裝置功能結構圖；1 is a functional structural diagram of a management device in an embodiment of the present application;

圖2為本申請實施例中對商品資料進行分類流程圖；2 is a flow chart of classifying commodity data in the embodiment of the present application;

圖3為本申請實施例中對分類結果進行優化流程圖。FIG. 3 is a flowchart of optimizing classification results in the embodiment of the present application.

Claims

A method for classifying data, comprising: obtaining related materials of each commodity to be classified, and extracting a product title therein; classifying each product title separately, and determining weights of each word segment, wherein each The weight of the word segmentation is used to indicate the historical frequency of the word segmentation; for the different commodities, the word segmentation sequence that matches the pre-conditions is selected; and the segmentation sequence selected for each commodity is compared, and the segmentation sequence is the same. The relevant information of the goods is combined.

The method according to claim 1, wherein after each word segmentation is divided into words, the weight of each word segment is adjusted according to the division result.

The method of claim 1, wherein the merging the related materials of the same word segmentation comprises: directly merging related materials of the same word segmentation sequence; or acquiring the goods having the same word segmentation sequence Specify attribute values and merge related data for items with the same attribute value.

The method of claim 1, wherein the method for calculating the similarity of the word segmentation sequence of any two types of products is performed by combining the related materials of the products having the same word segmentation sequence; The similarity of the word segmentation sequences of any two types of products obtained is compared with a set threshold, and the related data of the two types of commodities whose similarity of the segmentation sequence reaches the set threshold is combined.

The method of claim 4, wherein, when calculating the similarity of the word segmentation sequence of the two types of commodities, the following formula is adopted: Among them, TD1 and TD2 are the word segmentation sequences of the two types of commodities respectively compared, and prop1 and prop2 are the main attribute values corresponding to the two types of commodities respectively compared, and λ ₁ and λ ₂ are preset control coefficients, and a and b are The preset parameters, n1 and n2, respectively, are used to indicate the number of items included in each of the two types of products for which the similarity comparison is performed.

The method of claim 4, wherein, after combining the related data of the two types of products whose similarity of the word segment sequence reaches the set threshold, the iterative operation is performed according to the preset number of times.

The method of claim 1, wherein the corresponding product identification ID is set and saved for each type of product obtained after the combination.

An apparatus for classifying goods, comprising: an extracting unit, configured to acquire related materials of each item to be classified, and extract a product title therein; and a dividing unit configured to separately perform each product title Word segmentation, and Determining the weight of each word segment, wherein the weight of each word segment is used to indicate the historical frequency of occurrence of the word segment; the selecting unit is configured to separately select a word segmentation component segmentation sequence whose weight value meets a preset condition for different commodities; and a merging unit, It is used to compare the sequence of word segments selected for each item, and to combine the related materials of the items with the same word segment sequence.

The device according to claim 8, wherein the dividing unit divides the segmentation of each product title, and adjusts the weight of each word segment according to the division result.

The apparatus of claim 8, wherein the merging unit merges related materials of the same word segmentation product directly, or merges the related materials of the same word segmentation sequence directly; or obtains the same word segmentation sequence The specified attribute value of the item, and the related data of the item with the same attribute value is combined.

The device of claim 8, wherein the method for classifying the products of the same word segmentation is combined, and calculating the similarity of the word segmentation sequence of any two types of products, and then obtaining any two of the obtained word segments. The similarity of the word segmentation sequence of the commodity is compared with the set threshold, and the related data of the two types of commodities whose similarity of the segmentation sequence reaches the set threshold is combined.

The apparatus according to claim 11, wherein the merging unit combines the related data of the two types of commodities whose similarity of the segmentation sequence reaches the set threshold, and performs the iterative operation according to the preset number of times.

a device as claimed in claim 8, 9 or 10, The method further includes: a processing unit, configured to separately set a corresponding product identification ID for each type of product obtained after the combination, and save the same.