WO2024165134A1

WO2024165134A1 - A system and method for key attributes identification in generating product knowledge graphs

Info

Publication number: WO2024165134A1
Application number: PCT/EP2023/052843
Authority: WO
Inventors: Btissam ER-RAHMADI; Felix Arturo ONCEVAY MARCOS; Jeff Pan
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2024-08-15

Abstract

A method of identifying synonyms among attributes of a product for use in building a product knowledge graph. The method comprises determining a similarity score between a first attribute and a second attribute by: determining a similarity of the label of the first attribute relative to the second attribute; determining a second similarity of the label of the first attribute relative to the second attribute; calculating an aggregate of the first and second similarity and multiplying the aggregate value by a third similarity quality of a set of values of the first attribute relative to a set of values of the second attribute. An accompanying training method and data processing system are also provided. A method of identifying key attributes of a product category using a machine learning model comprising a first pre-trained language model for a classification task and a second pre-trained language model for a regression task.

Description

A SYSTEM AND METHOD FOR KEY ATTRIBUTES IDENTIFICATION IN GENERATING PRODUCT KNOWLEDGE GRAPHS

FIELD OF THE INVENTION

This invention relates to product knowledge graphs as used for determining optimal products in response to a search or query. Part of this process may include determining important product features or search terms and their synonyms.

BACKGROUND

Global web sales have recently manifested an unequalled growth, which was particularly boosted by the covid-19 crisis. For instance, global ecommerce sales grew by 24% to $4.29 trillion in 2020. Moreover, it is predicted that the online shopping market will continue to grow at a remarkable rate. For example, it’s predicted that 95% of all purchases will be carried out online by the year 2040. Among the global ecommerce markets, the Chinese market is predicted to account for more than 50% of retail sales. However, the technical field of ecommerce platforms and solutions is currently dominated by giant tech companies like Amazon in the international market. Whereas Alibaba and JD dominate in the Chinese market with a share of 59% and 17% of the market respectively.

The increasing interest in e-shopping and e-retail is driving innovative solutions for information retrieval in the ecommerce domain. This typically takes place by adopting general search methods in computing and then adapting these to the shopping domain. For instance, Knowledge Graphs (KGs) have been largely used in leading general search Engines such as Google, Microsoft, Bing, etc. Since their introduction and development, KGs are becoming more crucial in supporting other applications like recommendation, personalized search, question answering and natural language generation to mention but a few.

However, it is challenging to build KGs, especially in closed domains like ecommerce, which is a highly specific, dynamic, and complex domain. Leading tech companies have already taken on the challenge of constructing their own ecommerce KGs and leveraging them in their services. Amazon, Alibaba, and JD have developed their own respective Product Knowledge Graphs. Walmart has its Retail Graph, and more recently eBay has also developed its own approach. There are many existing approaches which have tried to address the problem of identifying the most important attributes, both in general domains and more specific domains like ecommerce. Most of the existing approaches focus only on identifying the importance of an attribute, with few considering the applicability of an attribute to a product category. Existing methods about key attribute identification can be classified into two main classes.

The first class comprises frequency-based solutions. These solutions mainly collect statistics about the frequency of occurrence of attributes in different sources of textual data, then leverage this information to make the decision about the applicability and the importance of those attributes for a product’s categories. The occurrence of an attribute in an input text is determined by string matching of its values only, or of both its values and label. For example, string matching the attribute values ‘blue’, ‘red’, ‘yellow’ etc, or also string matching the attribute label ‘colour’.

The second class comprises vectorisation-based solutions. These solutions use embeddings of the attributes textual data obtained from a pre-trained language model and apply similarity measures to make the decision about the applicability and importance.

The following are brief descriptions of specific methods, including which of the above classifications they fall within.

One existing approach is to build personal KGs (PKGs). This can be an automatic way of building a web-sale PKG, thus comprising thousands of products’ categories. One of the vital steps of building a KG is to construct the ontology and maintain it by enriching it regularly.

Another existing approach is the important properties selection used for comparison tables- based recommendation. This approach uses structured attributes extracted from products catalogues in order to determine which attributes are relevant to shoppers when they are making a purchase decision.

Another existing approach is that used is identifying which existing attributes in an opendomain KG are important to entities.

It is desirable to develop a method of producing a knowledge graph for ecommerce in order to meet the growing expectations of customers which not only optimises the speed with which results can be obtained but also provides product suggestions which are accurate and relevant to the customer’s needs. SUMMARY OF THE INVENTION

According to one aspect there is provided a method of identifying synonyms among attributes of a product for use in building a product knowledge graph, the method comprises determining a similarity score between a first attribute and a second attribute by: determining a first similarity quality of the label of the first attribute relative to the label of the second attribute; determining a second similarity quality of the label of the first attribute relative to the label of the second attribute; calculating a value representing an aggregate of the first similarity quality and the second similarity quality of the labels; and multiplying the aggregate value by a third similarity quality of a set of values of the first attribute relative to a set of values of the second attribute. By including a similarity quality of the value sets attributes can be provided with increased context for identifying attribute synonyms.

The third similarity quality may be an aggregate of a semantic similarity between a set of values of the first attribute and a set of values of the second attribute. Considering the similarity between values of different attributes increases the likelihood of identifying synonymous attributes with dissimilar labels.

The method may comprise calculating the semantic similarity of attribute values as a cosine similarity for pairs of values, one from each of the different attribute value sets, where a sentence-transformer model is used to determine the embedding of each of the values in the sets.

The method may comprise aggregating the semantic similarities of the attribute value set pairs by summing them and dividing by the number of pairs according to the equation:

where Va is the ‘i’th value of attribute a, v_b ^J is the ‘j’th value of attribute b, and n and m represent the numerical size of each value set.

The first similarity quality may be a semantic similarity of the label of the first attribute relative to the label of the second attribute. The first similarity quality may be obtained as a cosine similarity applied to an embedding vector of the labels obtained from a pre-trained language model, where the embedding vectors are inferred from a pre-trained language model which maps text fragments and paragraphs to a multi-dimensional dense vector, according to the equation:

e_Lb), where L_a and L_b represent the label of attribute a and attribute b respectively. By using a pre-trained language model existing knowledge of language patterns can be leveraged to increase efficiency.

The second similarity quality may be an edit-distance similarity of the label of the first attribute relative to the label of the second attribute. The edit-distance similarity may consider the erroneous transposition of two adjacent characters, insertion of characters, deletion of characters, and substitution of characters. Considering edit-distance allows for identification of synonyms including human errors in data entry.

The edit-distance similarity may be calculated according to the equation:

Sim₂ L_a, L_b) = 1 - DLD L_a,L_b , where L_a and L_b represent the label of attribute a and attribute b respectively, and DLD is an edit-distance similarity algorithm called the Damerau-Levenshtein Distance. The DLD algorithm specifically includes consideration of transposition, increasing the accuracy of synonym identification in light of human errors.

The method may comprise: determining one or more synonyms among a plurality of attributes based on the value of the similarity score for each respective attribute pair; grouping synonymous attributes together; and defining a single representative attribute for each group of synonymous attributes. By identifying a single representative of a group of synonyms the processing becomes more efficient.

According to another aspect there is provided a method of training a machine learning model for identifying key attributes of a product category from a query, the machine learning model comprising two pre-trained language models, the method comprises: inputting training data to each pre-trained language model comprising two text fragments, a first text fragment comprising a full path of a product category in a product hierarchy and a second text fragment comprising an attribute label, where the training data has been refined according to the synonym identification method of claims 1 to 10; fine-tuning a first pre-trained language model for use in a classification task using said training data, where the classification task is classifying the applicability of the attribute to the product category to determine an applicability class; and fine-tuning a second pre-trained language model for use in a regression task using said training data, where the regression task is to determine the importance of the attribute to the product category. This improves the efficiency of key attribute identification.

The method may comprise, after applying the synonym identification to the training data, sampling the training data and manually annotating the data for each of the fine-tuning tasks. The method may comprise manually annotating the training data with additional contextual information, where for product category the additional information may include any combination of a description, corpuses from related products, synonyms, characterising bags of words, and summarized reviews from related products and for attribute the additional information may include any combination of a description, synonyms, the value set, the full path in an attributes hierarchy, corpuses from related products, characterising bags of words, and summarized reviews from related products. This can improve the accuracy of key attribute identification.

The output of the classification task may be an applicability class of T or ‘O’, where T represents that the attribute does apply to the product category and ‘0’ represents that the attribute does not apply to the product category. The output of the regression task may be an importance score of any value between 0 and 1, where T indicates that the attribute is very important to that product category and ‘0’ indicates that the attribute is not important to the product category. The pre-trained language model may be a BERT or DistilBERT model. By refining the training data for these processes, the key attribute identification accuracy is improved.

According to another aspect there is provided a data processing system comprising one or more processors configured to: identify synonyms among attributes of a product category for use in building a product knowledge graph, by determining a similarity score between a first attribute and a second attribute according to the method of claims 1 to 10.

The data processing system may be configured to identify key attributes of a product category for use in building a product knowledge graph, by determining an applicability class and an importance score of each attribute to a product category according to the method of claims 11 to 16. BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:

Fig. 1 shows a graphic representation of an example of a typical knowledge graph.

Fig. 2 shows a schematic diagram of the synonyms identification component.

Figs. 3a and 3b show schematic diagrams comprising an attribute applicability classifier and importance prediction of the key attribute identification component respectively.

Fig. 4 shows a schematic diagram of the overall system architecture.

DETAILED DESCRIPTION OF THE INVENTION

A Product Knowledge Graph (PKG) is made up of triples such as ‘subject’, ‘predicate’, and ‘object’. The subject is an entity characterised by an ID, for example a product with an ID. The subject can belong to at least one type, for example, product category. The object can be an entity or an atomic value in the form of a string or a number. The predicate represents the relationship between the subject and object. An example of a triple between two entities is (prod id, isShippedFrom, country id), which indicates that the product identified is shipped from the country identified. An example of a triple between an entity and a string atomic value is (prod id, hasColor, “blue”), which indicates that the product identified has a colour which can be blue. Finally, an example of a triple between an entity and a numeric atomic value is (prod id, hasWeight, 100). Consequently, entities and atomic values represent the nodes of the PKG, whereas predicates are the edges or connections.

As mentioned above, building a Product Knowledge Graph (PKG) can be very challenging. Generally, product catalogues are used as the primary data source from which to construct a PKG. Other data sources reflecting customer interaction with the products can also be used. For example, reviews, query logs, Q&A, etc. However, meaningful data within these sources can be sparse and noisy in structure. For example, diversity is introduced to the data sources by different edition styles from the retailers and sellers. Even within the same marketplace, the format and information in product specifications can be different from one seller to another. This can result in a variety of ways product attributes are presented and introduced throughout a product catalogue. Which means even products from the same or similar categories could be described using different patterns and layouts. In addition to this, attributes may be represented by different names depending on the seller or marketplace providing the textual content. For example, “material” and “fabric” refer to the same attribute in clothing related products’ categories.

Another difficulty in producing PKGs is that the ecommerce domain is very complex and deals with millions of products classified into thousands of types. All of which have different sets of attributes, for example, smartphones versus dresses. These attributes can easily expand or diminish over time. For example, a foldable smartphone didn’t exist few years ago. Moreover, a product’s attributes are not all equally important in the decision making related to a product. For example, the “hair type” attribute is more important for shampoo than the attribute “color”. Though in some cases the distinction may be subtle in closely related products. For example, the “color” attribute is more important for hair dye than the attribute “hair type”. Thus, semi to fully automatic solutions are highly sought after in this field.

Figure 1 shows a graphic representation of an example knowledge graph 100. In general, there are two aspects of the knowledge graph: class level 118 which represents the taxonomy and the object level 120 which represents the data graph or the data instances of the classes in the class level. Specifically, the products which are sofas can be represented by the product category class 102. For instance, the product (i.e. a sofa) 122 is linked to the product category 102 using the relation 110 (i.e. “IsInstanceOf’). This is to associate products with their respective categories.

A product generally has many properties that characterize it. These properties are represented by product attribute classes 104 in the taxonomy and by product attribute value classes 106 in the data graph. Consequently, a product category can be associated with many product attribute classes 104. This relationship is represented by the relation 112, e.g. “Has Attribute”. For example, the product category 102 “Sofa” has product attribute 104 “Color”.

A product attribute value 106 represents a possible value of its corresponding product attribute 104. This relationship is represented by the relation 114, e.g. “IsValueOf’. For example, the product attribute value 106 “Beige” is a value of product attribute 104 “Color”.

In addition, the product 122 is linked to attributes values that characterize it from other products belonging to the same product category 102. This is ensured by linking the product 122 to product attribute values 106 using the relation 116, e.g. “Has Attribute Value”. For example, the product 122 has product attribute value 106 “Beige”, for the product attribute 104 “Color”.

The approach proposed herein addresses the problem, within the task of building a PKG, of how to automatically decide if an attribute is applicable to a product category, and if is, how important is it?

There are two main challenges addressed by the proposed approach.

The first challenge is attribute synonym identification in the ecommerce domain. In the general domain KGs synonym identification or detection can be classified as an entity resolution problem. It is necessary to identify which entities represent a unique entity regardless of how a content creator refers to it. This can be seen as reducing the gap between a human’s common knowledge and the extracted knowledge in the KG. That is, something which might seem obvious to a person can require a significant leap from a computer program only presented with the data in a KG. This is particularly applicable to the ecommerce domain, where content creators are human beings coming from different cultural and linguistic backgrounds and the same entity may be represented in very different words and formats. Moreover, spelling and grammar mistakes can happen easily, and exact string matching is then not an efficient way to determine synonyms. Consequently, the widely heterogenous contents characterizing products is one of the major challenges of building PKGs, which applies particularly to product attributes as fundamental entities of the PKG.

Although entity resolution is a well-researched topic in open-domain KG, there is very little existing research directed to addressing product attribute identification. Especially research that considers how attributes, as entities in the PKG taxonomies, can be characterized by their respective value sets. That is, for the same product category, attributes sharing high similarity between their value sets are most likely to represent the same attribute, even if the semantic similarity of their labels alone is not high enough to come to the same conclusion.

The second challenge is identification of key attributes for product categories with the minimal gathering of data. Most intuitive solutions for deciding the applicability and the importance of an attribute to a product category require access to diverse types of textual data in sufficiently large amounts. Such an approach is not reliable as it is generally difficult and expensive to collect and process required data. Recently, development of large pre-trained language models has been demonstrated, for example, BERT, GPT3, etc., which have achieved at least state of the art performance in completely different tasks. These models have already been trained on large natural language corpuses and can be refined for specific tasks by fine-tuning them on specific smaller datasets. However, the “knowledge” in these pre-trained models has not been leveraged to find quantities such as the relevance of attributes to product categories.

Thus, it can be seen that one important step in identifying attributes in textual data is to consider these varying types of reference which refer to the same semantic attribute. For example, “size” and “dimensions” may refer to the same attribute. This can be solved by discovering attribute synonyms and considering them before applying any identification method. None of prior work mentioned above addresses the problem of attribute synonym identification, especially in the ecommerce domain. Although AutoKnow uses a synonym discovery method, it is only used to find synonyms of attribute values, not to find synonymous attributes themselves. That is, attribute labels which are in fact referring to the same product feature or quality.

The AutoKnow method also does not allow for the addition of contextualized information about the attributes. In practice, two attributes may share the same label but have different sets of values depending on the product category to which they apply. For example, the attribute “size” in adult clothing-related categories can have values like { ..., S, M, L, XL, .} or {... 8, 10, 12, 14, ...} or {..., 36, 38, 40, 42, ...}, etc. Whereas “size” for the category of smartphones might have values like {..., <4.5 inches, [4.5 to 6.5 inches], >6.5 inches, ...}. Thus, it can be seen that the meaning of an attribute may be better understood or even change when provided with knowledge of the relevant value set, and this information should therefore be considered in such attribute synonym identification.

Although frequency-based solutions represent a satisfactory starting point from which to address the key attribute identification challenge, they generally require the availability of large and diverse textual data sources in order to compute meaningful measures about the frequency of occurrence of attributes in product category related data. It cannot always be guaranteed that such sources will be available as such data is expensive to gather, clean, and process, especially if the data changes regularly.

Leveraging learnt information from pre-trained natural language models can help reduce the dependency on collecting large amounts of diverse training data. Existing methods using vectorization to make use of prior knowledge have only applied word-level embedding, and only to the attribute values themselves. This method does not require or make use of any information related to the product category itself, as in this case it is more convenient to identify the importance of an attribute for a product category as an abstraction of product type. However, this existing method only identifies the importance of an attribute for a customer inquiry.

There is provided herein a method of identifying synonyms among attributes of a product for use in building a product knowledge graph. The method comprises determining a similarity score between a first attribute and a second attribute by determining a plurality of similarity qualities or scores. A first similarity quality of the label of the first attribute is determined relative to the label of the second attribute. A second similarity quality of the label of the first attribute is determined relative to the label of the second attribute. A value representing an aggregate of the first similarity quality and the second similarity quality of the labels is then calculated. The aggregate value is then multiplied by a third similarity quality of a set of values of the first attribute relative to a set of values of the second attribute.

There are two main aspects of the proposed approach which will be discussed in more detail below.

The first aspect is a synonym identification component which takes into account the attributes’ value sets. For each of two attributes which are being assessed, three similarity qualities or scores are considered. The first similarity considers the semantic similarity between the attribute labels and is obtained as a cosine similarity applied to their embedding. The embedding vectors are obtained from a pre-trained language model. The second similarity may be an editdistance similarity and is also applied to the attribute labels. The edit-distance considers the similarity in view of possible edit errors such as misspellings and typos etc. The third similarity may be an aggregated semantic similarity between the values of the value sets for each of the two attributes. That is, an aggregate of a semantic similarity between a set of values of the first attribute and a set of values of the second attribute.

The second aspect is a component for key attribute identification which based on fine-tuning a pre-trained language model. This aspect consists of fine-tuning two machine learning models. The first model is fine-tuned for applicability classification and the second model is fine-tuned for importance regression. After applying the synonym identification on available data, a training dataset is sampled and manually annotated for each of the two fine-tuning tasks. The input data for the fine-tuning consists of sentences or text fragments which are made up of the full path of a product category in a specific hierarchy (e.g. Huawei), and the label of the attribute. Additional contextual information can be added to product category. For example, a description, corpuses from related products, etc. Additional contextual information can also be added to the attribute. For example, a description, synonyms, a value set, etc.

Figure 2 shows the synonyms identification component 200 in more detail. Synonym identification allows the key attribute identification process (also referred herein as KATIE) to build a dictionary of attribute synonyms in an offline fashion. This is performed in order to build a more representative training dataset for fine-tuning the classifier and regressor steps of PKG generation. Specifically, it enables KATIE to skip processing redundant information that is carried by synonyms. In turn this has the advantage of making the inference of the KATIE classifier and regressor faster as the only a representative synonym is given to the key identification part, and the decision might be cashed in local inference system. That is, only an umbrella term representing all of the synonyms in a synonym group and their associated attribute values need to be considered.

It is assumed that for each product category, there is a set of collected attributes. The task to achieve is to identify which attributes are synonyms to each other within this set of attributes. Each attribute is characterized by its label, and a set of values.

For each pair of attributes, it is determined if they are synonyms or not if their similarity score is higher than a certain threshold. Two such attributes, attribute A_a, with label L_a and value set V_a, and attribute A_b, with label L_b and value set V_b, are shown in figure 2 as inputs. The similarity score is a combined score of three different similarity qualities calculated from the labels and value sets of the two attributes.

The first similarity quality may be a semantic similarity of the label of the first attribute relative to the label of the second attribute. As mentioned above, the semantic similarity, Sim₁ , represents the semantic similarity between the labels of the attributes. It may be determined by computing the cosine similarity on the vectors of the labels. These vectors are the embedding vectors inferred from a Pre-trained Language model. One example of such a pre-trained language model is called “all-MiniLM-L6-v2”. The model “all-MiniLM-L6-v2” is a sentencetransformer model that maps sentences and paragraphs to a 384 dimensional dense vector.

Figure 2 shows attribute labels L_a and L_b converted to vectors e_La and e_Lb and input into the above cosine similarity equation Sim 202.

Thus, there is provided a first similarity quality is obtained as a cosine similarity applied to an embedding vector of the labels obtained from a pre-trained language model, where the embedding vectors are inferred from a pre-trained language model which maps text fragments and paragraphs to a multi-dimensional dense vector.

The second similarity quality may be an edit-distance similarity of the label of the first attribute relative to the label of the second attribute. The edit-distance similarity, Sim₂, represents the string similarity between the labels of the attributes. A particularly useful example of an editdistance similarity calculation is the Damerau-Levenshtein distance as it considers the transposition of two adjacent characters in addition to insertion, deletion, and substitution. That is, the edit-distance similarity may consider the erroneous transposition of two adjacent characters, insertion of characters, deletion of characters, and substitution of characters. This is particularly useful because swapping characters is a common human mistake in editing and entering strings of text. The edit-distance similarity may be calculated according to the equation:

Sim₂ L_a, L_b) = 1 - DLD^L_a, L_b~) where L_a and L_b represent the label of attribute a and attribute b respectively, and DLD is an edit-distance similarity algorithm called the Damerau-Levenshtein Distance. Figure 2 shows attribute labels L_a and L_b being input into the Damerau-Levenshtein distance, DLD, of equation Sim₂ 204.

The value sets similarity, V_sim, represents the aggregated similarity of the semantic similarities of attribute’s value sets. The semantic similarity of attribute values may be a cosine similarity for pairs of values, one from each of the different attribute value sets, where a sentencetransformer model may be used to determine the embedding of each of the values in the sets. This is particularly important in the domain of e-commerce, as the values of an attribute are an integral part of its meaning. Again, a pre-trained language model, such as “all-MiniLM-L6-v2”, is used to get the embedding of each of the values in both value sets. Then the same cosine similarity as determined for the attribute labels is calculated for each of the pairs,

v_b) E V_a x V_b. These value pair similarities are then averaged by summing them and dividing the sum by the number of value pairs. For example, given two sets of values for the attributes L_a and L_b of length n and m respectively, the number of value pairs is n x m.

Figure 2 shows attribute value sets V_a and V_b being converted to vector sets

and e_vj and input into the above aggregated cosine similarity equation V_sim 206.

is the ‘i’th value of attribute a, v_b is the ‘j ’th value of attribute b, and n and m represent the numerical size of each value set.

In order to compute the final similarity of an attribute pair, first the two label’s similarities are combined by averaging their values. Finally, this averaged value is multiplied by the value sets similarity, V_sim 206. The attribute label (L_a, L_b ) based and attribute value (V_a, V_b ) based similarity values are deliberately multiplied and not averaged in order to give a strong contribution to the similarity information extracted from the attribute value sets. That is, instead of merely combining the similarities derived from the different types of attribute traits equally, the averaged similarity value based on the attribute labels is weighted by the similarity value derived from the attribute values.

Figure 2 shows how the separate similarity equations for comparing the two attribute labels and value sets are combined to provide a single similarity measure for the two attributes from the equation Sim(A_a,A_b) 208.

After the final similarity Sim(A_a,A_b is computed for each pair in the attributes set, synonym groups are determined by clustering attributes having similar values for Sim(A_a, A _b . First, a threshold is applied to identify pairs of synonyms. Then all synonyms of the same attribute are grouped by applying a transition law. For example, if ‘a’ is a synonym for ‘b’ and ‘b’ is a synonym for ‘c’ then ‘a’ is a synonym for ‘c’. That is, the proposed method comprises determining one or more synonyms among a plurality of attributes based on the value of the similarity score for each respective attribute pair and grouping synonymous attributes together.

Once all the similarity checked pairs are allocated to their synonym groups, the one-to-one similarity of each attribute in the synonym group is calculated. The attribute with the highest similarity to the remaining of the group members is selected as the group main synonym or representative attribute for that group. Thus, there is defined a single representative attribute for each group of synonymous attributes. This representative attribute will be considered in following steps in KATIE like training data sampling and inference.

Figures 3a and 3b show the key attribute identification 300 component in more detail, comprising attribute applicability classifier 302 and importance prediction 304 respectively.

The proposed approach allows for the prediction of the applicability and the importance of an attribute to a category based on two fine-tuned Bidirectional Encoder Representations from Transformers (BERT) models.

The first BERT model 302, which may be a Distilled BERT model (DistilBERT), is fine-tuned for a classification task to provide an applicability classifier. The second BERT model 304 is fine-tuned on a regression task and provides an importance regressor. The importance regressor component and applicability classifier take advantage of the paradigm shift introduced in natural language processing (NLP) thanks to pre-trained language models. That is, the proposed approach comprises fine-tuning a PLM. PLMs are generally trained in an unsupervised way on a massive amount of text data. In the pre-training, PLMs learn the general linguistic patterns. Once pre-trained, PLMs can be fine-tuned for a specific task with relatively few labels or additional data, otherwise known as supervised training.

KATIE takes advantage of this paradigm, by considering a PLM, for example BERT or DistilBERT, and fine-tuning it separately for two different tasks: classification for the applicability prediction and regression for the importance prediction. The fine-tuning is performed by stacking a feed forward neural network for each of the tasks after PLM weights are loaded for the inputs. However, the architectures of the neural networks are different for each of the tasks. Figure 3a) shows the architecture for the applicability prediction, where inputs 306, 308 are provided to the BERT model 302 and a feed forward neural network 310 to provide a classification of 0 or 1.

Figure 3b) shows the architecture for the importance regressor, where inputs 312, 314 are provided to the BERT model 304 and a feed forward neural network 316 to provide an importance score between 0 and 1.

For both fine-tuned models, the inputs are two sentences or text fragments 306, 308, 312, 314, each of which represent the full path of the product category in the product hierarchy 306, 312 (for example, from the Huawei hierarchy) and the attribute label 308, 314. Additional contextual information can be added to either or both of the inputs. For example, descriptions, synonyms values sets, etc, to make the tuning even more refined.

The annotation applied for the fine-tuned applicability classifier is one of two classes: “0” which means the attribute does not apply to the product category, or “1” which mean the attribute does apply to the product category. This is also the output of the fine-tuned model in the inference mode.

The annotation applied for the fine-tuned importance regressor is a real number between and including 0 and 1, where 0 means the attribute is not important at all and 1 means the attribute is very important. Any value of this annotation different from 0 and 1 represents the likelihood of which the attribute is important to the product category. In the inference mode, the model thus outputs an importance score of the same characteristics and meaning as those provided annotations.

Figure 4 shows the overall system architecture or scenario 400 to which the present approach is applicable. As discussed above, this idea is presented within an environment of PKG construction in which it is desirable to build a PKG from scratch. One of the main components needed in achieving this objective is accurately identifying and applying attributes to product categories. The proposed approach is directed towards key attributes identification, referred to herein as KATIE, for such PKG construction. Key attributes identification in the proposed approach comprises the two elements described above of identifying applicable attributes to product categories and also inferring the importance score of attributes to product categories 402. An important aspect of achieving this objective efficiently is the pre-processing step 404 of identifying synonyms for attributes of products which have been pulled from the raw data used to create the PKG. It is thus possible to create and maintain a synonym dictionary 406 from which the processing of KATIE can be optimised. Thus, given the product details shown of product category name 408 and product attribute 410, these can be checked against the synonym dictionary 406 in order to refine the selection of attributes for which the applicability and importance are assessed by KATIE. The resulting PKG is therefore able to be created in a more efficient way, which is important for web-scale PKG in e-commerce due to the typically large quantity of products needing to be considered.

KATIE may make use of products catalogues, but can also be trained on data crawled from products HTML pages (e.g. from an online retailer such as AliExpress or Amazon) in order to collect attributes and their values. If a list of attributes and their associated values was already available, this step of collecting data for training KATIE could be skipped.

In an example embodiment, the synonym identification component 404 can be implemented as an integrated module within KATIE as shown in figure 4, or as a separate module, and thus applicable to other applications and not only for KATIE. The module may be implemented as a function that regularly feeds a synonym dictionary, which may then be used by KATIE and other applications in the PKG.

Such a synonym identification module requires access to a set of attributes, each of which is characterised by its label and values set. The data should preferably be pre-processed in a way which eliminates any ill-formatted, ambiguous and incomplete text strings, etc. Nonmeaningful values may also be discarded. For example, "no", "yes", "none", "other", etc. as they do not meaningfully characterise the attribute itself. Attributes with empty values sets should also be discarded.

The synonyms identification is run per product category, so the set of attributes is associated with this particular product category. Ideally, the set of attributes should be extracted from semistructured to structured data, e.g. from products webpages such as AliExpress, Amazon, eBay, etc., in order to ensure reliable association. As mentioned above, the synonym identification function may use a pre-trained language model such as “all-MiniLM-L6-v2” to compute the required semantic similarities and edit-distance similarity equations such as the Damerau- Levenshtein distance to compute the associated similarity. There is proposed herein a method of training a machine learning model for identifying key attributes of a product category from a query. The machine learning model comprises two pretrained language models. The method of training comprises inputting training data to each pretrained language model comprising two text fragments. The first text fragment comprises a full path of a product category in a product hierarchy. The second text fragment comprises an attribute label. The training data is refined according to the synonym identification method described herein. As described above, the first pre-trained language model is fine-tuned for use in a classification task using said training data, where the classification task is classifying the applicability of the attribute to the product category to determine an applicability class. The second pre-trained language model is fine-tuned for use in a regression task using said training data, where the regression task is to determine the importance of the attribute to the product category.

In an example embodiment the KATIE prediction component may be implemented as a neural network trained on PLM weights obtained from text string inputs. In this example data that has been crawled from an online retailer which has been pre-processed as described above for synonyms identification. Investing human effort in annotating data for the applicability classifier can be justified as it has binary values. However, the importance regressor requires a real number between 0 and 1, which should reflect a probabilistic measure, and which should be collected from at least three different annotators to have a realistic annotation. For this reason, automatic annotation based on frequency counts can be used.

After applying the synonym identification to the training data, the training data may be sampled and manually annotating for each of the fine-tuning tasks. The proposed method comprises manually annotating the training data with additional contextual information. For a product category the additional information may include any combination of a description, corpuses from related products, synonyms, characterising bags of words, and summarized reviews from related products. For an attribute the additional information may include any combination of a description, synonyms, the value set, the full path in an attribute’s hierarchy, corpuses from related products, characterising bags of words, and summarized reviews from related products.

The training data may consist of pairs of product categories and attributes, which have been annotated either manually or automatically. The training dataset may be sampled based on two sampling principles, leading to two samples that may be called DI and D2. For example, DI may consist of top-k most popular attributes for all available categories and randomly selected categories from categories space. D2 may be made of manually selected categories, such as those categories which have large a number of attributes, and a set of randomly selected attributes. The applicability classifier dataset may therefore be obtained by merging all pairs from DI and D2. The importance regressor dataset may only be made of pairs of DI. It is worth highlighting that the full path of the categories in the hierarchy may be used as an input.

As described above, a BERT Model (or a DistilBERT model such as “distilbert-base-uncased”), may be fine-tuned using a feed forward neural network. The applicability classifier and importance regressor may have different neural networks with different architectures. It is also possible to fine-tune the hyper-parameters for optimal performance. The fine-tuning may consist of inputting each pair as a new sequence of two sentences or text fragments. For example, these may be formatted as: [CLS]<category_full_path>[SEP]<attribute_label> (As seen in the inputs of Figure 4). If additional contextual information is available for any of the inputted sentences, it may be appended to this and separated by a space. For example, if an attribute has synonyms, the format may become:

[CLS]<category_full_path>[SEP]<attribute_label> <synonym_l> <synonym_2>. . .

The proposed approach combines both semantic and edit distance (DPD) similarities of attributes labels and additionally combines these aggregated similarities with the aggregated similarities from respective values sets. This has the advantage of generating a training dataset that does not include redundant pairs due to synonymous attributes. This also results in the faster execution of KATIE as predictions can be pre-computed on a representative synonym and cashed locally.

KATIE provides an automatically contextualised applicability classifier and importance regressor which take advantage of learnt linguistic patterns from PLM and refine them for ecommerce specific tasks. There is the ability to provide domain specific annotations and more flexible training by adding more contextual information to the inputs. This has the advantage of providing better performance for key attribute identification with less dependency on huge data collection and processing, Further, the proposed approach provides a classification and importance measure more easily generalisable to other domains by replacing input data with that of a new domain.

The synonym identification module described above may be part of a data processing system comprising one or more processors, one or more of said processors being configured to identify synonyms among attributes of a product category for use in building a product knowledge graph by determining a similarity score between a first attribute and a second attribute according to the approach proposed herein. The data processing system comprising synonym identification may be configured to identify key attributes of a product category for use in building a product knowledge graph by determining an applicability class and an importance score of each attribute to a product category according to the method proposed herein.

It should be understood that a tool that identifies the applicable and key attributes for a product category is not only beneficial for PKG construction, which is used for product-related information retrieval (search, recommendation, etc.), but it can also be used separately.

For example, in product search filters by allowing shoppers to filter search results based on the most important attributes of the product type they are searching for. This is becoming more crucial as shoppers generally deal with tremendous amounts of information. It can be seen that they would prefer accessing relevant classifications depending on the searched product type.

Another example is in a conversational shopping assistant. In the conversational assistant scenario, users tend to interact the minimum amount possible and want to have the fewest back and forth exchanges with the “bof ’ in order to achieve the purpose, e.g. searching for a product or a recommendation of a product with specific sets of attributes. In order to minimize this discussion and get to the point quickly, a conversational assistant could make use of the key attribute identification tool to assess the most relevant attributes for the product type in question before making the requested product suggestions.

Another example is in product substitute identification. A product substitute can be used to replace one product with another interchangeable product. Identifying product substitutes is important, especially in ecommerce recommendation services, and also in product searches where the searched product is not available anymore (e.g., out of stock, discontinued, etc.). In this case, identifying the key attributes that characterise a product is essential in order to pinpoint its applicable substitutes.

Another example is for ecommerce concepts (ECC) generation. Ecommerce concepts are short phrases or statements that depict a specific shopping scenario or interest. For example, “baby shower”, or “outdoor barbecue”, or “Eid party”. Ecommerce concepts allow a product search engine to efficiently answer shoppers’ queries. In order to obtain relevant ECC candidates for ECC generation, key attributes of product types need to be identified. The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A method of identifying synonyms among attributes of a product for use in building a product knowledge graph (100), the method comprises determining a similarity score (208) between a first attribute (Aa) and a second attribute (Ab) by: determining a first similarity quality (202) of the label of the first attribute (La) relative to the label of the second attribute (Lb); determining a second similarity quality (204) of the label of the first attribute relative to the label of the second attribute; calculating a value representing an aggregate of the first similarity quality and the second similarity quality of the labels; and multiplying the aggregate value by a third similarity quality (206) of a set of values (Va) of the first attribute relative to a set of values (Vb) of the second attribute.

2. The method according to claim 1, wherein the third similarity quality is an aggregate of a semantic similarity between a set of values of the first attribute and a set of values of the second attribute.

3. The method according to claim 2, wherein the method comprises calculating the semantic similarity of attribute values as a cosine similarity for pairs of values, one from each of the different attribute value sets, where a sentence-transformer model is used to determine the embedding (eva, evb) of each of the values in the sets.

4. The method according to claim 2 or 3, wherein the method comprises aggregating the semantic similarities of the attribute value set pairs by summing them and dividing by the number of pairs according to the equation:

where va is the ‘i’th value of attribute a, vb is the ‘j ’th value of attribute b, and n and m represent the numerical size of each value set.

5. The method according to any preceding claim, wherein the first similarity quality is a semantic similarity of the label of the first attribute relative to the label of the second attribute.

6. The method according to any preceding claim, wherein the first similarity quality is obtained as cosine similarity applied to an embedding vector of the labels (eLa, eLb) obtained from a pre-trained language model, where the embedding vectors are inferred from a pre-trained language model which maps text fragments and paragraphs to a multi-dimensional dense vector, according to the equation:

where La and Lb represent the label of attribute a and attribute b respectively.

7. The method according to any preceding claim, wherein the second similarity quality is an edit-distance similarity of the label of the first attribute relative to the label of the second attribute.

8. The method of claim 7, wherein the edit-distance similarity considers the erroneous transposition of two adjacent characters, insertion of characters, deletion of characters, and substitution of characters.

9. The method according to claim 7 or 8, where the edit-distance similarity is calculated according to the equation:

Sim₂ L_a, L_b) = 1 - DLD L_a,L_b , where La and Lb represent the label of attribute a and attribute b respectively, and DLD is an edit-distance similarity algorithm called the Damerau-Levenshtein Distance.

10. The method according to any preceding claim, wherein the method comprises: determining one or more synonyms among a plurality of attributes based on the value of the similarity score for each respective attribute pair; grouping synonymous attributes together; and defining a single representative attribute for each group of synonymous attributes.

11. A method of training a machine learning model for identifying key attributes of a product category from a query, the machine learning model comprising two pre-trained language models (302, 304), the method comprises: inputting training data (306, 308, 312, 314) to each pre-trained language model comprising two text fragments, a first text fragment (306, 312) comprising a full path of a product category in a product hierarchy and a second text fragment (308, 314) comprising an attribute label, where the training data has been refined according to the synonym identification method of claims 1 to 10; fine-tuning a first pre-trained language model (302) for use in a classification task using said training data, where the classification task is classifying the applicability of the attribute to the product category to determine an applicability class; and fine-tuning a second pre-trained language model (304) for use in a regression task using said training data, where the regression task is to determine the importance of the attribute to the product category.

12. The method according to claim 11, wherein the method comprises, after applying the synonym identification to the training data, sampling the training data and manually annotating the data for each of the fine-tuning tasks.

13. The method according to claim 12, wherein the method comprises manually annotating the training data with additional contextual information, where for product category the additional information may include any combination of a description, corpuses from related products, synonyms, characterising bags of words, and summarized reviews from related products and for attribute the additional information may include any combination of a description, synonyms, the value set, the full path in an attributes hierarchy, corpuses from related products, characterising bags of words, and summarized reviews from related products.

14. The method according to any of claims 10 to 13, wherein the output of the classification task is an applicability class of T or ‘O’, where T represents that the attribute does apply to the product category and ‘0’ represents that the attribute does not apply to the product category.

15. The method according to any of claims 10 to 14, wherein the output of the regression task is an importance score of any value between 0 and 1, where T indicates that the attribute is very important to that product category and ‘0’ indicates that the attribute is not important to the product category.

16. The method according to any of claims 10 to 15, wherein the pre-trained language model is a BERT or DistilBERT model.

17. A data processing system 400 comprising one or more processors configured to identify synonyms (404) among attributes (410) of a product category (408) for use in building a product knowledge graph, by determining a similarity score between a first attribute (Aa) and a second attribute (Ab) according to the method of claims 1 to 10.

18. The data processing system according to claim 17, wherein the system is configured to identify key attributes (402) of a product category for use in building a product knowledge graph, by determining an applicability class and an importance score of each attribute to a product category according to the method of claims 11 to 16.