WO2019232645A1 - Unsupervised classification of documents using a labeled data set of other documents - Google Patents
Unsupervised classification of documents using a labeled data set of other documents Download PDFInfo
- Publication number
- WO2019232645A1 WO2019232645A1 PCT/CA2019/050806 CA2019050806W WO2019232645A1 WO 2019232645 A1 WO2019232645 A1 WO 2019232645A1 CA 2019050806 W CA2019050806 W CA 2019050806W WO 2019232645 A1 WO2019232645 A1 WO 2019232645A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- grouping
- subject document
- documents
- matching
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
Definitions
- the present invention relates to the classification of documents. More specifically, the present invention relates to systems and methods for unsupervised classification of documents using other previously classified documents.
- unsupervised classification referring to document classification that is not overseen by a human.
- Such techniques are typically computationally expensive and time-intensive.
- Current state-of-the-art techniques for unsupervised classification require access to the entire dataset at all times, meaning that enormous datasets may need to be searched for each operation.
- the present invention provides systems and methods for associating an unknown subject document with other documents based on known features of the other documents.
- the subject document is passed through a feature extrachon module, which represents the features of the subject document as a numeric vector having n dimensions.
- a matching module receives that vector and reference data.
- the reference data is pre-divided into n groupings, with each grouping corresponding to at least one specific feature.
- the matching module compares the features of the subject document to features of the reference data and determines a matching grouping for the subject document.
- the subject document is then associated with that matching grouping.
- the present invention provides a method for determining other documents to be associated with a subject document, the method comprising: (a) passing said subject document through a feature extraction module to thereby produce a numeric vector representation of features of said subject document, said vector representation having n dimensions;
- said n-dimensional space contains a plurality of reference points, wherein each of said other documents corresponds to a single one of said plurality of reference points, and wherein said plurality of reference points is divided into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents;
- the present invention provides a system for determining other documents to be associated with a subject document, the system comprising: a feature extraction module for producing a numeric vector representation of features of said subject document;
- reference data comprising numeric vectors, wherein each of said other documents corresponds to a single one of said numeric vectors, and wherein said reference data is grouped into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents; a matching module for determining a matching grouping from said plurality of groupings for said subject document, said matching grouping being determined based on at least one predetermined criterion,
- the present invention provides non-transitory computer-readable media having stored thereon computer-readable and computer-executable instructions that, when executed, implements a method for determining other documents to be associated with a subject document, the method comprising:
- said n-dimensional space contains a plurality of reference points, wherein each of said other documents corresponds to a single one of said plurality of reference points, and wherein said plurality of reference points is divided into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents;
- Figure 1 is a schematic diagram detailing one aspect of the present invention
- Figure 2A is a representative image that may be used by the present invention, in some embodiments.
- Figure 2B is another representative image that may be used by the present invention, in some embodiments.
- Figure 2C is another representative image that may be used by the present invention, in some embodiments; and Figure 3 is a flowchart detailing the steps in a method according to another aspect of the present invention.
- FIG. 1 a schematic diagram illustrating one aspect of the present invention is presented.
- a subject document 20 is passed through a feature extraction module 30.
- the feature extraction module 30 is associated with a matching module 50.
- Reference data 40 is also associated with the matching module 50.
- the feature extrachon module 30 extracts features of the subject document 20 and produces a numeric vector representation of the subject document 20 based on those features, with the vector having n dimensions. The feature extraction module 30 then passes that vector representation to a matching module 50.
- the matching module 50 also receives reference data 40, representing other
- the reference data 40 is previously divided into groupings, with each grouping corresponding to at least one specific feature of the documents within that grouping.
- the matching module 50 compares the vector representation of the features of the subject document 20 to features of the reference data 40 to determine a matching grouping for the subject document 20.
- the subject document 20 is then associated with that matching grouping.
- a neural network can be used as the feature extraction module
- Such a neural network would be trained to extract specific features that the user wishes to identify. Note therefore the features to be extracted may vary depending on context. For instance, a large set of news articles broadly grouped as“sports news” may be classified using keywords such as“hockey”,“basketball”, and “soccer”. As another example, where the present invention is used to classify images, important features and themes may include“face” or“tree”.
- the identified features may be thought of as“keywords”, “subjects”,“topics”,“themes”,“subthemes”,“aspects”, or any equivalent term suitable for the context. References herein to any of such terms should be taken to include all such terms.
- the subject document 20 can be any kind of document with any number of dimensions.
- one-dimensional “documents” may include text, time series data, and/or sounds
- two-dimensional documents may comprise natural images, spectrograms, satellite images.
- Subject documents in three-dimensions may include videos and/or medical imaging volumes
- four-dimensional subject documents may include videos of medical imaging volumes, as well as video game data.
- the neural network or other feature extraction module 30, outputs a numeric vector representation of the subject document.
- Each coordinate within that numeric vector representation corresponds to one of the possible features.
- each coordinate can be a numeric value indicating the probability that the subject document has that specific feature.
- the coordinates may be bounded between, for instance, 0 and 1, or 0 and 100. In other implementations, however, each coordinate may reflect a non-probabilistic correspondence to its associated feature.
- a subject article discussing a hockey game might be represented as a numeric vector such as [0.8, 0.1, 0.1], in a three- dimensional space defined by the coordinate system [hockey, basketball, soccer ]. This vector suggests that there is an 80% chance that the article relates to hockey and only a 10% chance that the article relates to either basketball or soccer.
- outlier documents in the data set are sent to a human
- outlier documents are documents that do not match well with any known features.
- the system will provide an outlier document and its closest feature matches to the human reviewer, who can select a better feature match if necessary.
- the results of this human review can be fed back to the system and incorporated into later classifications.
- the feature extrachon module 30 can be initially trained on a similar or higher-level classification problem than the classification problem to be solved.
- the reference data points 40 are already populated and grouped when they are passed to the feature extraction module 30.
- the numeric vector is then passed to a matching module 50.
- the matching module
- the matching module 50 compares the vector to a pre-existing set of reference data 40.
- the reference data points are based on reference documents with known features.
- the reference data points 40, received by the matching module 50 are divided into a plurality of groupings, such that each grouping corresponds to at least one specific feature. In the“sports articles” example above, reference data would be divided into three or more groupings, including“hockey”,“basketball”, and“soccer”.
- “approximate nearest neighbour algorithms” can be used to divide the reference data into groupings.
- Various approximate nearest neighbour algorithms are well-known in computer science and data analysis (see, for instance, Indyk & Motwani,“Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality”, Proceedings of the thirtieth annual ACM symposium on Theory of Computing (1998), the complete text of which is hereby incorporated by reference).
- the matching module 50 may consider multiple factors when determining a
- the matching module 50 can consider the distance between the subject document’s numeric vector representation and the centroids of reference groupings.
- the matching module 50 may also consider factors beyond the distance to grouping centroids. Such other factors include, for example, publication dates or date ranges: more recent news articles dealing with sports or politics may be grouped separately from older articles on the same topics. Other factors (such as, for instance, author, region, publication, etc.) may also be used by the matching module 50 in determining a matching grouping for the subject grouping, according to the context. Any variable that is continuously available in the data set (i.e., separately available for each document in the data set) can be used to modify or weight the results of the feature extraction module.
- the present invention preferably uses a reference data set containing around 1,000 reference data points.
- the present invention results in significant time-savings compared to manual and/or typical text-analysis document classification methods, and additionally provides greater classification accuracy than the prior methods.
- each“theme”,“subtheme”, and“region” was a separate extracted feature. It is predicted that the present invention could classify a set of 10,000,000 subject documents within 100 milliseconds.
- a further advantage of the present invention over prior art methods arises when the present invention is used to classify images.
- Artificially intelligent image classification is typically performed in a pixel-space. That is, typical machine classifiers for images produce a numeric vector representation wherein the vector coordinates correspond to pixels, or pixel regions, of each image.
- pixel-space representations can mis associate images that are substantively similar but positionally distinct.
- images can be generalized to any kind of multi-dimensional input data.
- Figure 2A shows a stylized face in the top left of the image, and nothing in the bottom right.
- Figure 2B shows a circle in the same location as Figure 2A’s stylized face, but shows a pair of triangles within the circle, rather than that face.
- Figure 2C lastly, has the same stylized face as Figure 2A, but here the face is shown in the bottom right of the image, and the top left of the image is empty. Because typical pixel-space classifiers merely consider positional information, a typical classifier would conclude that Figures 2A and 2B are more similar to each other than are Figures 2A and 2C, notwithstanding the visibly distinct subject matter.
- the present invention compares images based on substance rather than pixel density.
- the vector representing Figure 2A would then have comparatively high values in all three coordinates, as Figure 2A has a circle, (stylized) eyes and a (stylized) mouth.
- Figure 2C the new subject document in this example, would then be passed through the feature extrachon module.
- the vector representation of Figure 2C like that of Figure 2A, would have comparatively high values for all three features.
- the matching module receives that vector and the reference vectors, the matching module would determine that the vector representation of Figure 2C is more similar to that of Figure 2A than to Figure 2b, and would thus associate Figure 2C with Figure 2A.
- Both images containing a face would be grouped together in the“face” grouping, and only Figure 2B would remain in the“not face” grouping.
- Other applications of the present invention include reverse searches.
- the system may return a high-level grouping or a more granular grouping, or even, in some implementations, a specific document.
- FIG. 3 is a flowchart detailing the steps in a method according to one aspect of the invention.
- the features of a subject document are extracted by a feature extraction module, resulting in a numeric vector representation of the subject document. That numeric vector representation and the grouped reference data 40 is passed to the matching module.
- the matching module determines the matching grouping for the subject document, and at step 330, the subject document is associated with that matching grouping. As discussed above, the matching module typically determines the matching grouping based on a distance between the new vector representation and a centroid of each grouping. This matching process is performed for every new subject document.
- the present invention can automatically classify large groups of subject documents without human intervention.
- the present invention can be seen as the use of a neural encoder with a proxy task related to the task the one seeks to complete.
- the result is a fast unsupervised classification technique that takes into account the entire past and which uses post processing methods to refine the results.
- one aspect of the invention uses an already existing fast nearest-neighbours retrieval technique to perform classification. The results are then refined by weighting the contribution of each neighbour using non-parametric methods. As one example, the contribution of neighbours is weighted with respect to the recency of the document. In one variant, a neural network may be used to output a weight for the examples. [0041] In another aspect, the present invention uses supervised training to learn
- a closely related problem which is higher-level than the problem sought to be solved, is used to build the neural encoder that will yield a suitable feature space.
- any references herein to 'image' or to 'images' refer to a digital image or to digital images, comprising pixels or picture cells.
- any references to an 'audio file' or to 'audio files' refer to digital audio files, unless otherwise specified.
- 'Video', 'video files', 'data objects', 'data files' and all other such terms should be taken to mean digital files and/or data objects, unless otherwise specified.
- the present invention may thus take the form of computer executable instructions that, when executed, implements various software modules with predefined functions.
- the embodiments of the invention may be executed by a computer processor or similar device programmed in the manner of method steps, or may be executed by an electronic system which is provided with means for executing these steps.
- an electronic memory means such as computer diskettes, CD-ROMs, Random Access Memory (RAM), Read Only Memory (ROM) or similar computer software storage media known in the art, may be programmed to execute such method steps.
- electronic signals representing these method steps may also be transmitted via a communication network.
- Embodiments of the in vend on may be implemented in any conventional computer programming language.
- preferred embodiments may be implemented in a procedural programming language (e.g., "C") or an object-oriented language (e.g., "C++",“java”,“PHP”,“PYTHON” or“C#”).
- object-oriented language e.g., "C++",“java”,“PHP”,“PYTHON” or“C#”.
- Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components.
- Embodiments can be implemented as a computer program product for use with a computer system.
- Such implementations may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium.
- the medium may be either a tangible medium (e.g., optical or electrical communications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques).
- the series of computer instructions embodies all or part of the functionality previously described herein. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems.
- Such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies.
- a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink-wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server over a network (e.g., the Internet or World Wide Web).
- some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention may be implemented as entirely hardware, or entirely software (e.g., a computer program product).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Systems and methods for associating an unknown subject document with other documents based on known features of the other documents. The subject document is passed through a feature extraction module, which represents the features of the subject document as a numeric vector having n dimensions. A matching module receives that vector and reference data. The reference data is pre-divided into n groupings, with each grouping corresponding to at least one specific feature. The matching module compares the features of the subject document to features of the reference data and determines a matching grouping for the subject document. The subject document is then associated with that matching grouping.
Description
UNSUPERVISED CLASSIFICATION OF DOCUMENTS USING A LABELED DATA
SET OF OTHER DOCUMENTS
TECHNICAL FIELD
[0001] The present invention relates to the classification of documents. More specifically, the present invention relates to systems and methods for unsupervised classification of documents using other previously classified documents.
BACKGROUND
[0002] Accurate document classification is a long-standing problem in information science.
Traditional document classification has been performed by librarians and consensus: the Dewey Decimal system, for instance, is an example of a well-established classification scheme for library materials. However, manual classification requires a significant amount of human effort and time, which may be better spent on more critical tasks, particularly as technological advancements have increased the volume of documents that require classification.
[0003] An average human classifier, reading at 250 words per minute, would require 16,000 hours (roughly 1.8 years) simply to read a set of 400,000 news articles. Classifying those articles would require even more time, and becomes unwieldy when there are many possible features/topics/sub-topics. Moreover, the human-generated results may contain substantial inaccuracies, as these human classifiers would struggle to sustain their focus for the needed time, and to analyze each subject document for more than a few features at once.
[0004] As a result, many techniques for unsupervised classification have been developed
(“unsupervised classification” referring to document classification that is not overseen by a human). However, such techniques are typically computationally expensive and time-intensive. Current state-of-the-art techniques for unsupervised
classification require access to the entire dataset at all times, meaning that enormous datasets may need to be searched for each operation.
[0005] Further, conventional techniques for unsupervised classification are often based on text documents only, and do not necessarily generalize to non-textual input documents. Additionally, as many of these techniques revolve around word frequency and what can be called“next- word probability” (i.e., the likelihood that one word will follow another known word), they can miss important contextual factors.
[0006] There is a need for less costly and more scalable systems and methods for document classification. Preferably, these systems and methods would operate without supervision and, rather than extracting individual terms, extract higher-level features and topics from each document.
SUMMARY
[0007] The present invention provides systems and methods for associating an unknown subject document with other documents based on known features of the other documents. The subject document is passed through a feature extrachon module, which represents the features of the subject document as a numeric vector having n dimensions. A matching module receives that vector and reference data. The reference data is pre-divided into n groupings, with each grouping corresponding to at least one specific feature. The matching module compares the features of the subject document to features of the reference data and determines a matching grouping for the subject document. The subject document is then associated with that matching grouping.
[0008] In a first aspect, the present invention provides a method for determining other documents to be associated with a subject document, the method comprising:
(a) passing said subject document through a feature extraction module to thereby produce a numeric vector representation of features of said subject document, said vector representation having n dimensions;
(b) positioning a new point in an n-dimensional space based on said vector
representation, wherein said n-dimensional space contains a plurality of reference points, wherein each of said other documents corresponds to a single one of said plurality of reference points, and wherein said plurality of reference points is divided into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents;
(c) determining a matching grouping from said plurality of groupings for said subject document based on at least one predetermined criterion; and
(d) associating said subject document with said matching grouping.
[0009] In a second aspect, the present invention provides a system for determining other documents to be associated with a subject document, the system comprising: a feature extraction module for producing a numeric vector representation of features of said subject document;
- reference data, said reference data comprising numeric vectors, wherein each of said other documents corresponds to a single one of said numeric vectors, and wherein said reference data is grouped into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents; a matching module for determining a matching grouping from said plurality of groupings for said subject document, said matching grouping being determined based on at least one predetermined criterion,
wherein said system associates said subject document with said matching grouping.
[0010] In a third aspect, the present invention provides non-transitory computer-readable media having stored thereon computer-readable and computer-executable
instructions that, when executed, implements a method for determining other documents to be associated with a subject document, the method comprising:
(a) passing said subject document through a feature extraction module to thereby produce a numeric vector representation of features of said subject document, said vector representation having n dimensions;
(b) positioning a new point in an n-dimensional space based on said vector
representation, wherein said n-dimensional space contains a plurality of reference points, wherein each of said other documents corresponds to a single one of said plurality of reference points, and wherein said plurality of reference points is divided into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents;
(c) determining a matching grouping from said plurality of groupings for said subject document based on at least one predetermined criterion; and
(d) associating said subject document with said matching grouping.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] The present invention will now be described by reference to the following figures, in which identical reference numerals refer to identical elements and in which:
Figure 1 is a schematic diagram detailing one aspect of the present invention;
Figure 2A is a representative image that may be used by the present invention, in some embodiments;
Figure 2B is another representative image that may be used by the present invention, in some embodiments;
Figure 2C is another representative image that may be used by the present invention, in some embodiments; and
Figure 3 is a flowchart detailing the steps in a method according to another aspect of the present invention.
DETAILED DESCRIPTION
[0012] Referring to Figure 1, a schematic diagram illustrating one aspect of the present invention is presented. In the system 10, a subject document 20 is passed through a feature extraction module 30. The feature extraction module 30 is associated with a matching module 50. Reference data 40 is also associated with the matching module 50.
[0013] The feature extrachon module 30 extracts features of the subject document 20 and produces a numeric vector representation of the subject document 20 based on those features, with the vector having n dimensions. The feature extraction module 30 then passes that vector representation to a matching module 50.
[0014] The matching module 50 also receives reference data 40, representing other
previously classified documents with known features. The reference data 40 is previously divided into groupings, with each grouping corresponding to at least one specific feature of the documents within that grouping.
[0015] The matching module 50 then compares the vector representation of the features of the subject document 20 to features of the reference data 40 to determine a matching grouping for the subject document 20. The subject document 20 is then associated with that matching grouping.
[0016] In one embodiment, a neural network can be used as the feature extraction module
30. Such a neural network would be trained to extract specific features that the user wishes to identify. Note therefore the features to be extracted may vary depending on context. For instance, a large set of news articles broadly grouped as“sports news” may be classified using keywords such as“hockey”,“basketball”, and
“soccer”. As another example, where the present invention is used to classify images, important features and themes may include“face” or“tree”.
[0017] It should be noted that the identified features may be thought of as“keywords”, “subjects”,“topics”,“themes”,“subthemes”,“aspects”, or any equivalent term suitable for the context. References herein to any of such terms should be taken to include all such terms. Similarly, the subject document 20 can be any kind of document with any number of dimensions. For instance, one-dimensional “documents” may include text, time series data, and/or sounds, and two-dimensional documents may comprise natural images, spectrograms, satellite images. Subject documents in three-dimensions may include videos and/or medical imaging volumes, and four-dimensional subject documents may include videos of medical imaging volumes, as well as video game data. Thus, the terms“article” and“image”, as used in the examples herein, should not be construed as limiting the term“document”. It should be noted, however, that as the dimensions of the input documents increase, and/or as the size of the set of input documents increases, extracting appropriate high-level features for each document set may become more difficult.
[0018] Additionally, it should be evident that the feature lists described above are merely exemplary and that these are simplified for ease of explanation. The present invention is capable of handling far more than two or three broad features at one time. Current implementations of the present invention can deal with 512 features simultaneously and the present invention is in no way restricted by the current implementations. Any restrictions on the number of features (number of dimensions) should not be construed as limiting the scope of the invention.
[0019] The neural network, or other feature extraction module 30, outputs a numeric vector representation of the subject document. Each coordinate within that numeric vector representation corresponds to one of the possible features. In some implementations, each coordinate can be a numeric value indicating the probability that the subject document has that specific feature. In such an implementation, the coordinates may be bounded between, for instance, 0 and 1, or 0 and 100. In other implementations,
however, each coordinate may reflect a non-probabilistic correspondence to its associated feature.
[0020] To re-use the“sports articles” example mentioned above (again, noting that this is a simplification for exemplary purposes), a subject article discussing a hockey game might be represented as a numeric vector such as [0.8, 0.1, 0.1], in a three- dimensional space defined by the coordinate system [hockey, basketball, soccer ]. This vector suggests that there is an 80% chance that the article relates to hockey and only a 10% chance that the article relates to either basketball or soccer.
[0021] In some implementations, outlier documents in the data set are sent to a human
reviewer. Such outlier documents are documents that do not match well with any known features. The system will provide an outlier document and its closest feature matches to the human reviewer, who can select a better feature match if necessary. The results of this human review can be fed back to the system and incorporated into later classifications.
[0022] Note that a separate feature extraction module 30 is preferred for each classification problem, so that appropriate features may be determined in context. It would be impractical to attempt, for instance, to classify images of a forest using a feature extraction module previously trained to classify business-news articles.
[0023] It should additionally be noted that the feature extrachon module 30 can be initially trained on a similar or higher-level classification problem than the classification problem to be solved. Thus, the reference data points 40 are already populated and grouped when they are passed to the feature extraction module 30.
[0024] The numeric vector is then passed to a matching module 50. The matching module
50 compares the vector to a pre-existing set of reference data 40. The reference data points (numeric vectors each having n dimensions) are based on reference documents with known features.
[0025] The reference data points 40, received by the matching module 50, are divided into a plurality of groupings, such that each grouping corresponds to at least one specific feature. In the“sports articles” example above, reference data would be divided into three or more groupings, including“hockey”,“basketball”, and“soccer”.
[0026] In some implementations,“approximate nearest neighbour algorithms” can be used to divide the reference data into groupings. Various approximate nearest neighbour algorithms are well-known in computer science and data analysis (see, for instance, Indyk & Motwani,“Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality”, Proceedings of the thirtieth annual ACM symposium on Theory of Computing (1998), the complete text of which is hereby incorporated by reference).
[0027] It should be clear that none of the data points are moved or transformed during the grouping process. All of the grouping information is a layer of metadata that has no direct connection to or impact on the original data set. Other models for dividing data sets into groupings are known in the art, including hierarchical and distributional models.
[0028] The matching module 50 may consider multiple factors when determining a
matching grouping for a subject document, based on the grouped reference data. In particular, in some embodiments, the matching module 50 can consider the distance between the subject document’s numeric vector representation and the centroids of reference groupings.
[0029] The matching module 50 may also consider factors beyond the distance to grouping centroids. Such other factors include, for example, publication dates or date ranges: more recent news articles dealing with sports or politics may be grouped separately from older articles on the same topics. Other factors (such as, for instance, author, region, publication, etc.) may also be used by the matching module 50 in determining a matching grouping for the subject grouping, according to the context. Any variable that is continuously available in the data set (i.e., separately available
for each document in the data set) can be used to modify or weight the results of the feature extraction module.
[0030] For clarity, these other factors are present in the original data set and are not treated as features within the numeric vectors. Any modification of the results of the feature extraction module occurs post-feature extraction and is performed by the matching module 50.
[0031] To increase comparison accuracy and to reduce overfitting, the present invention preferably uses a reference data set containing around 1,000 reference data points. The present invention results in significant time-savings compared to manual and/or typical text-analysis document classification methods, and additionally provides greater classification accuracy than the prior methods.
[0032] In testing, one implementation of the present invention classified a text input set of
400,000 subject articles using 512 dimensions, classifying each document into: one theme from a possible 25 themes; multiple subthemes from a possible 200 subthemes; and multiple regions from a possible 24 regions. These results were achieved within 10 ms ± 3 ms. It should be clear that, in the testing implementation, each“theme”,“subtheme”, and“region” was a separate extracted feature. It is predicted that the present invention could classify a set of 10,000,000 subject documents within 100 milliseconds.
[0033] A further advantage of the present invention over prior art methods arises when the present invention is used to classify images. Artificially intelligent image classification is typically performed in a pixel-space. That is, typical machine classifiers for images produce a numeric vector representation wherein the vector coordinates correspond to pixels, or pixel regions, of each image. Although such an approach allows for accurate classification of images that are positionally similar, classification based on pixel-space representations can mis associate images that are substantively similar but positionally distinct. (Again, of course, the term“images” as used here can be generalized to any kind of multi-dimensional input data.)
[0034] As an example, consider Figures 2A, 2B, and 2C. Figure 2A shows a stylized face in the top left of the image, and nothing in the bottom right. Figure 2B shows a circle in the same location as Figure 2A’s stylized face, but shows a pair of triangles within the circle, rather than that face. Figure 2C, lastly, has the same stylized face as Figure 2A, but here the face is shown in the bottom right of the image, and the top left of the image is empty. Because typical pixel-space classifiers merely consider positional information, a typical classifier would conclude that Figures 2A and 2B are more similar to each other than are Figures 2A and 2C, notwithstanding the visibly distinct subject matter.
[0035] The present invention, on the other hand, compares images based on substance rather than pixel density. Examining Figures 2 A, 2B, and 2C, and supposing the classification problem to be“separate all images containing a face from all images not containing a face”, a feature extraction module may be trained to identify the features“eyes”,“mouth”, and“circle”. (Again, it should be evident that this is a simplification for exemplary purposes.) The vector representing Figure 2A would then have comparatively high values in all three coordinates, as Figure 2A has a circle, (stylized) eyes and a (stylized) mouth. The vector representing Figure 2B, on the other hand, would have comparatively low values in the“eyes” and“mouth” coordinates but a higher value in“circle”. Then, taking Figures 2A and 2B to be the reference data for this classification problem, groupings called“face” and“not face” could be defined: the“face” grouping containing the representation of Figure 2A, and the“not face” grouping containing the representation of Figure 2B.
[0036] Figure 2C, the new subject document in this example, would then be passed through the feature extrachon module. The vector representation of Figure 2C, like that of Figure 2A, would have comparatively high values for all three features. Thus, when the matching module receives that vector and the reference vectors, the matching module would determine that the vector representation of Figure 2C is more similar to that of Figure 2A than to Figure 2b, and would thus associate Figure 2C with Figure 2A. Both images containing a face would be grouped together in the“face” grouping, and only Figure 2B would remain in the“not face” grouping.
[0037] Other applications of the present invention include reverse searches. That is, for instance, if a user knows that a certain phrase is contained in a reference data set, but does not know precisely which document that phrase comes from, they can enter the phrase into the system. Depending on the granularity of the grouping model used and the number of features, the system may return a high-level grouping or a more granular grouping, or even, in some implementations, a specific document.
[0038] Figure 3 is a flowchart detailing the steps in a method according to one aspect of the invention. At step 310, the features of a subject document are extracted by a feature extraction module, resulting in a numeric vector representation of the subject document. That numeric vector representation and the grouped reference data 40 is passed to the matching module. At step 320, the matching module determines the matching grouping for the subject document, and at step 330, the subject document is associated with that matching grouping. As discussed above, the matching module typically determines the matching grouping based on a distance between the new vector representation and a centroid of each grouping. This matching process is performed for every new subject document. Thus, the present invention can automatically classify large groups of subject documents without human intervention.
[0039] In one aspect, the present invention can be seen as the use of a neural encoder with a proxy task related to the task the one seeks to complete. Thus, the result is a fast unsupervised classification technique that takes into account the entire past and which uses post processing methods to refine the results.
[0040] Since unsupervised classification is, most of the time, computationally intensive, one aspect of the invention uses an already existing fast nearest-neighbours retrieval technique to perform classification. The results are then refined by weighting the contribution of each neighbour using non-parametric methods. As one example, the contribution of neighbours is weighted with respect to the recency of the document. In one variant, a neural network may be used to output a weight for the examples.
[0041] In another aspect, the present invention uses supervised training to learn
a meaningful space as a proxy to a problem that one seeks to solve. A closely related problem, which is higher-level than the problem sought to be solved, is used to build the neural encoder that will yield a suitable feature space.
[0042] Additionally, it should be clear that, unless otherwise specified, any references herein to 'image' or to 'images' refer to a digital image or to digital images, comprising pixels or picture cells. Likewise, any references to an 'audio file' or to 'audio files' refer to digital audio files, unless otherwise specified. 'Video', 'video files', 'data objects', 'data files' and all other such terms should be taken to mean digital files and/or data objects, unless otherwise specified.
[0043] It should be clear that the various aspects of the present invention may be
implemented as software modules in an overall software system. As such, the present invention may thus take the form of computer executable instructions that, when executed, implements various software modules with predefined functions.
[0044] The embodiments of the invention may be executed by a computer processor or similar device programmed in the manner of method steps, or may be executed by an electronic system which is provided with means for executing these steps. Similarly, an electronic memory means such as computer diskettes, CD-ROMs, Random Access Memory (RAM), Read Only Memory (ROM) or similar computer software storage media known in the art, may be programmed to execute such method steps. As well, electronic signals representing these method steps may also be transmitted via a communication network.
[0045] Embodiments of the in vend on may be implemented in any conventional computer programming language. For example, preferred embodiments may be implemented in a procedural programming language (e.g., "C") or an object-oriented language (e.g., "C++",“java”,“PHP”,“PYTHON” or“C#”). Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components.
[0046] Embodiments can be implemented as a computer program product for use with a computer system. Such implementations may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium. The medium may be either a tangible medium (e.g., optical or electrical communications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques). The series of computer instructions embodies all or part of the functionality previously described herein. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink-wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server over a network (e.g., the Internet or World Wide Web). Of course, some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention may be implemented as entirely hardware, or entirely software (e.g., a computer program product).
[0047] A person understanding this invention may now conceive of alternative structures and embodiments or variations of the above all of which are intended to fall within the scope of the invention as defined in the claims that follow.
Claims
1. A method for determining other documents to be associated with a subject document, the method comprising:
(a) passing said subject document through a feature extraction module to thereby produce a numeric vector representation of features of said subject document, said vector representation having n dimensions;
(b) positioning a new point in an n-dimensional space based on said vector representation, wherein said n-dimensional space contains a plurality of reference points, wherein each of said other documents corresponds to a single one of said plurality of reference points, and wherein said plurality of reference points is divided into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents;
(c) determining a matching grouping from said plurality of groupings for said subject document based on at least one predetermined criterion; and
(d) associating said subject document with said matching grouping.
2. The method of claim 1, wherein said feature extraction module is a trained neural network for extracting features.
3. The method of claim 1, wherein each grouping is based on a distance between each of said plurality of reference points within said each grouping and a centroid of each grouping.
4. The method of claim 1, wherein said at least one predetermined criterion includes a
maximum distance, such that a distance between said new point and a centroid of said matching cluster is smaller than said maximum distance.
5. The method of claim 1, wherein said at least one predetermined criterion includes a date range, such that a date of said subject document is within said date range.
6. The method of claim 1, wherein said at least one predetermined criterion includes both: a maximum distance, such that a distance between said new point and a centroid of said matching cluster is smaller than said maximum distance; and a date range, such that a date of said subject document is within said date range.
7. The method of claim 1, wherein said subject document comprises at least one of:
- text;
- image;
- text and at least one image;
- video data;
- audio data;
- medical imaging data;
- unidimensional data; and
- multi-dimensional data.
8. A system for determining other documents to be associated with a subject document, the system comprising: a feature extraction module for producing a numeric vector representation of features of said subject document;
- reference data, said reference data comprising numeric vectors, wherein each of said other documents corresponds to a single one of said numeric vectors, and wherein said reference data is grouped into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents;
a matching module for determining a matching grouping from said plurality of groupings for said subject document, said matching grouping being determined based on at least one predetermined criterion, wherein said system associates said subject document with said matching grouping.
9. The system of claim 8, wherein said feature extraction module is a trained neural network for extracting features.
10. The system of claim 8, wherein each grouping in said plurality of groupings is determined based on a distance between each of said numeric vectors within said each grouping and a centroid of each grouping.
11. The system of claim 8, wherein said at least one predetermined criterion is a maximum
distance, such that a distance between said numeric vector representation and a centroid of said matching cluster is smaller than said maximum distance.
12. The system of claim 8, wherein said at least one predetermined criterion is a date range, such that a date of said subject document is within said date range.
13. The system of claim 8, wherein said at least one predetermined criterion includes both:
a maximum distance, such that a distance between said numeric vector representation and a centroid of said matching cluster is smaller than said maximum distance; and a date range, such that a date of said subject document is within said date range.
14. The system of claim 8, wherein said subject document comprises at least one of:
- text;
- image;
- text and at least one image;
- video data;
- audio data;
- medical imaging data;
- unidimensional data; and
multi-dimensional data.
15. Non-transitory computer-readable media having stored thereon computer-readable and computer-executable instructions that, when executed, implements a method for determining other documents to be associated with a subject document, the method comprising:
(a) passing said subject document through a feature extraction module to thereby produce a numeric vector representation of features of said subject document, said vector representation having n dimensions;
(b) positioning a new point in an n-dimensional space based on said vector
representation, wherein said n-dimensional space contains a plurality of reference points, wherein each of said other documents corresponds to a single one of said plurality of reference points, and wherein said plurality of reference points is divided into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents;
(c) determining a matching grouping from said plurality of groupings for said subject document based on at least one predetermined criterion; and
(d) associating said subject document with said matching grouping.
16. The computer-readable media of claim 15, wherein said feature extraction module is a trained neural network for extracting features.
17. The computer-readable media of claim 15, wherein each grouping is based on a distance between each of said plurality of reference points within said each grouping and a centroid of each grouping.
18. The computer-readable media of claim 15, wherein said at least one predetermined criterion includes at least one of:
- a maximum distance, such that a distance between said new point and a centroid of said matching cluster is smaller than said maximum distance; and
- a date range, such that a date of said subject document is within said date range.
19. The computer-readable media of claim 15, wherein said at least one predetermined criterion includes both:
a maximum distance, such that a distance between said new point and a centroid of said matching cluster is smaller than said maximum distance; and a date range, such that a date of said subject document is within said date range.
20. The computer-readable media of claim 15, wherein said subject document comprises at least one of:
- text;
- image;
- text and at least one image;
- video data;
- audio data;
- medical imaging data;
- unidimensional data; and
multi-dimensional data.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CA3,007,547 | 2018-06-07 | ||
| US16/002,811 US20190377823A1 (en) | 2018-06-07 | 2018-06-07 | Unsupervised classification of documents using a labeled data set of other documents |
| CA3007547A CA3007547A1 (en) | 2018-06-07 | 2018-06-07 | Unsupervised classification of documents using a labeled data set of other documents |
| US16/002,811 | 2018-06-07 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019232645A1 true WO2019232645A1 (en) | 2019-12-12 |
Family
ID=68769684
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CA2019/050806 Ceased WO2019232645A1 (en) | 2018-06-07 | 2019-06-07 | Unsupervised classification of documents using a labeled data set of other documents |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2019232645A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111079442A (en) * | 2019-12-20 | 2020-04-28 | 北京百度网讯科技有限公司 | Vectorization representation method and device of document and computer equipment |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20050234952A1 (en) * | 2004-04-15 | 2005-10-20 | Microsoft Corporation | Content propagation for enhanced document retrieval |
| US20180046649A1 (en) * | 2016-08-12 | 2018-02-15 | Aquifi, Inc. | Systems and methods for automatically generating metadata for media documents |
-
2019
- 2019-06-07 WO PCT/CA2019/050806 patent/WO2019232645A1/en not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20050234952A1 (en) * | 2004-04-15 | 2005-10-20 | Microsoft Corporation | Content propagation for enhanced document retrieval |
| US20180046649A1 (en) * | 2016-08-12 | 2018-02-15 | Aquifi, Inc. | Systems and methods for automatically generating metadata for media documents |
Non-Patent Citations (2)
| Title |
|---|
| INDYK, P. ET AL.: "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality", PROCEEDINGS OF THE THIRTIETH ANNUAL ACM SYMPOSIUM ON THEORY OF COMPUTING, 23 May 1998 (1998-05-23), pages 604 - 613, XP058190106 * |
| PAULOVICH, F.V. ET AL.: "HiPP: A Novel Hierarchical Point Placement Strategy and its Application to the Exploration of Document Collections", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 14, no. 6, 19 October 2008 (2008-10-19), pages 1229 - 1236, XP011324177, DOI: 10.1109/TVCG.2008.138 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111079442A (en) * | 2019-12-20 | 2020-04-28 | 北京百度网讯科技有限公司 | Vectorization representation method and device of document and computer equipment |
| CN111079442B (en) * | 2019-12-20 | 2021-05-18 | 北京百度网讯科技有限公司 | Vectorized representation method, apparatus and computer equipment for documents |
| US11403468B2 (en) | 2019-12-20 | 2022-08-02 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for generating vector representation of text, and related computer device |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20200334486A1 (en) | System and a method for semantic level image retrieval | |
| US10482146B2 (en) | Systems and methods for automatic customization of content filtering | |
| CN108491817A (en) | An event detection model training method, device and event detection method | |
| Karthikeyan et al. | Probability based document clustering and image clustering using content-based image retrieval | |
| Gorokhov et al. | Convolutional neural networks for unsupervised anomaly detection in text data | |
| US12499150B2 (en) | Image encoder training method and apparatus, device, and medium | |
| US9881023B2 (en) | Retrieving/storing images associated with events | |
| CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
| Etezadifar et al. | Scalable video summarization via sparse dictionary learning and selection simultaneously | |
| US20190377823A1 (en) | Unsupervised classification of documents using a labeled data set of other documents | |
| CN113408282B (en) | Method, device, equipment and storage medium for topic model training and topic prediction | |
| CN114329004B (en) | Digital fingerprint generation and data pushing method and device and storage medium | |
| Banerjee et al. | A novel centroid based sentence classification approach for extractive summarization of COVID-19 news reports | |
| WO2019232645A1 (en) | Unsupervised classification of documents using a labeled data set of other documents | |
| Shamsi et al. | A short-term learning approach based on similarity refinement in content-based image retrieval | |
| Naidu et al. | Violent Human Behaviour Detection in Videos Using ResNet18 3D Deep Learning | |
| Inoue et al. | Few-shot adaptation for multimedia semantic indexing | |
| CA3007547A1 (en) | Unsupervised classification of documents using a labeled data set of other documents | |
| Lakshmi et al. | An efficient telugu word image retrieval system using deep cluster | |
| Ramya et al. | XML based approach for object oriented medical video retrieval using neural networks | |
| Dai et al. | Local matching networks for engineering diagram search | |
| Bommisetty et al. | Content based video retrieval—methods, techniques and applications | |
| Gabryel | The bag-of-words methods with pareto-fronts for similar image retrieval | |
| CN113393556B (en) | Image processing method, apparatus, computer device, and readable storage medium | |
| Ugale et al. | A review paper on video retrieval in spatial and temporal domain |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19815352 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19815352 Country of ref document: EP Kind code of ref document: A1 |