HK1139215B - System, method, and computer program for a consumer defined information architecture - Google Patents
System, method, and computer program for a consumer defined information architecture Download PDFInfo
- Publication number
- HK1139215B HK1139215B HK10105347.7A HK10105347A HK1139215B HK 1139215 B HK1139215 B HK 1139215B HK 10105347 A HK10105347 A HK 10105347A HK 1139215 B HK1139215 B HK 1139215B
- Authority
- HK
- Hong Kong
- Prior art keywords
- concept
- facet
- faceted
- attribute
- relationships
- Prior art date
Links
Abstract
A system, computer system, and method for organizing and managing data structures based on input from a feedback agent is provided, the method including (a) a method for faceted classification that is applicable to a domain of information, said method of faceted analysis including (I) a facet analysis of said domain or receiving the results of facet analysis of the domain, and (11) applying a faceted classification synthesis of said domain, and (b) a complex-adaptive method for selecting and returning information, on one or more iterations, from said faceted classification synthesis, said complex-adaptive method varying the organizing and managing of data structures in response to said returned information.
Description
Priority
This application claims priority from the following patent applications: U.S. patent application No. 11/469,258 filed on 31/8/2006, U.S. patent application No. 11/550,457 filed on 18/10/2006 and U.S. patent application No. 11/625,452 filed on 22/1/2007.
Technical Field
The present invention generally relates to classification systems. In particular, the present invention relates to a system, method and computer program to classify information. The invention also relates to a system, a method and a computer program for synthesizing a classification structure for a specific information domain.
Background
The faceted classification is based on the principle that information has multidimensional properties and can be classified in many different ways. The objects of the information domain are subdivided into planes to represent this multi-dimensionality. The attributes of the domains are related by facet hierarchy. The material within the domain is then identified and classified based on these attributes.
Fig. 1 illustrates the general manner of faceted classification in the prior art, where this manner is applied, for example, to the classification of wine.
The faceted classification is called an analysis-synthesis approach because it involves both analysis and synthesis processes. To devise a scheme for faceted classification, information fields are analyzed to determine their basic facets. The classifications may then be synthesized (or constructed) by applying the attributes of these facets to the domain.
Many scholars have recognized faceted classification as an ideal method for organizing mass information stores, such as information stores on the internet. Faceted classification is susceptible to rapid changes and dynamic information. In addition, by subdividing the object into facets, it provides a number of various ways to access information.
Although facet classification has the potential to address the classification needs, its use has been slow. Few domains use faceted classification, as opposed to the vast amount of information on the internet. In fact, its use has been divided among specific vertical applications such as e-commerce storage and libraries. It generally stays within the reach of scholars, professional taxonomies, and information designers.
The obstacle to using faceted classification is its complexity. Faceted classification is a labor intensive and understandably challenging task. This complexity increases with the size of the information. As scale increases, the number of dimensions (or facets) is more complex within a domain, which makes it more and more difficult to organize.
To help address this complexity, scholars have conceived rules and guidelines for faceted classification. This academic community dates back to decades as early as the advent of modern computing and data analysis.
Recently, techniques have been sought in services classified by facets. In general, this technique has been applied within history classification methods and organization principles. While constrained by conventional methods, attempts to provide fully automated faceted classification methods have generally been frustrated.
To illustrate the prior art, an example of an automated classification and faceted navigation system is ENDECATM. ENDECA is considered to be a good leader for information classification and access to products within the system industry (http:// www.usatoday.com/tech/products/cnet/2007-06-29-end-google _ N.htm).
The technique of ENDECA uses guided navigation and a meta-relationship index that encompasses dimensions of data and documents and relationships between the dimensions: see, for example, "structural data-drive search and navigation system and method for information retrieval"; U.S. Pat. No.7,035,864, 2006, month 4, day 25: "structural data-drive search and navigation system and method for information retrieval".
The system of ENDECA includes a taxonomy described by the company as taxonomy definition and classification, see: U.S. Pat. No.7,062,483, 6/13/2006: hierarchichalcaldata-drive search and navigation system and method for information retrieval.
Existing automated classification techniques are most predominantly used and are useful for things that industry experts refer to as "structured data warehouses" and "managed content warehouses".
Another limitation of existing automated classification techniques is that: in terms of cognitive requirements for classification, it lacks human-based feedback. For example, although ENDECA has a feedback loop for faceted navigation, including popularity of use to drive search result presentation and including priority, it does not have a usage-based feedback loop to improve semantic definition and semantic relationships of content.
Another broad category of hybrid classification systems can be described as large-scale collaborative classification. Cognitive advantages and automated systems for attempting manual classification in this mannerAre combined. The collaborative classification system in this emerging field is known by various names: "Web 2.0", "collaborative category", "public taxonomy (folksonomy)", "social index", "social tag", "wisdom of group", and other designations. FLICKRTM(photo sharing group), DEL. ICIO. USTM(social bookmark manager) and WIKIPEDIATM(wiki-based collaborative encyclopedia) is an example of this emerging class of collaborative taxonomy.
With variable proportions, these systems use techniques to provide a framework for wide-range and distributed collaboration while allowing collaborators to make decisions on categories, concepts, and relationships. One challenge with this approach is that it creates a conflict between the guidance of the subject and taxonomy experts and the input of layperson end users, who often hold a very different view and taxonomy for the content. These systems may help people collaborate by identifying areas of ambiguity and inconsistency and by highlighting competitive claims between collaborators. But ultimately, in the case of collaborative systems, people should preferably disambiguate their divergence and reach a broad consensus on the most difficult terms to handle. This process is therefore difficult to expand and scale across large and diverse information domains.
A prime example of a collaborative taxonomy is Metaweb Technologies, inc, which focuses on categorizing a wide range of open information domains by using the collaborative taxonomy to create searchable databases within the Web and other complex and diverse information environments.
Metaweb Technologies are gaining interest because of their pioneering collaborative way to create semantic webs. Metaweb Technologies has filed 2 patent applications with the U.S. patent and trademark office (U.S. patent application 20050086188, "Knowledge web", 21/4/2005; U.S. patent application 20030196094, "Method and apparatus for authenticating the content of a distributed database", 16/10/2003).
Metaweb Technologies' collaborative ontology construction relies on "crowd-sourcing" for its collaborative classification. Which the end-user uses to define and extend the multiple scenarios that each person can use. From the viewpoint of the well-known industry observer Eother Dyson: "the creator of MetaWeb has 'smartly designed' the syntax of how to specify relationships, but they rely on the wisdom (or specific knowledge) and effort of the masses to create actual content-not just specific data, but specific kinds of relationships between specific things" (version 0.9: MetaWeb-Emergent structures. Intelligent Design, 3.11.2007, http:// www.huffingtonpost.com/ester-dyson/release-09-met _ b 43167. html). A limitation of this approach is that the database scope and quality is limited by the semantics related content of its user input. It also relies on the ability of experts and laymen to agree on specific data elements and to specify relationships between content to eliminate redundancy so that the database contains unambiguous information.
Thus, the prior art suffers from a number of drawbacks in automated surface-based classification, automated classification, and large-scale collaborative classification. The techniques are applied within or based on conventional methods. There is a need for enhanced classification methods that implement basic changes to the information structure.
For facet analysis, human cognitive input is generally required, as there is no general mode or exploration approach for facet analysis that works for all domains of information. Currently, only humans possess the full breadth of pattern recognition skills. Unfortunately, it is generally desirable to identify structural patterns (such as semantic or syntactic structures) throughout the domain of information to be classified, and there are many different patterns of identifying facets and attributes. Although one can train people to recognize these patterns on small (local) data sets, the difficulty of this task becomes prohibitive as the size of the domain increases.
Limitations are also introduced due to human intervention when computational requirements of the analytical and synthetic processes exceed human cognitive abilities. Humans are skilled at evaluating relationships between information elements on a small scale, but are unable to manage the complexity within the entire domain in the aggregate. There is a need for a system that can aggregate small localized human input across an entire domain of information.
The faceted classification scheme supports multiple angles, which is a commonly mentioned benefit. Unfortunately, these angles are not intuitive when they are split across multiple levels. This causes serious problems with visualization, integration and overall perspective. As the number of facets (or dimensions) in a structure increases, visualization becomes increasingly difficult. Thus, the visualization of a faceted classification scheme is often reduced to a "planar" one-dimensional result set; which navigates the structure across only one facet at a time. This type of reduction masks the rich complexity of the underlying structure.
There is a need for methods and techniques to combine the expressiveness and flexibility of faceted approaches within an integrated and description-rich hierarchy. In addition, this flexibility is optimally extended down to the basic level of the classification scheme itself when dynamically constructing facets as an organizational basis.
Once selected, the facets themselves are static and difficult to correct. This represents a considerable risk in the development of faceted schemes. Classification often lacks complete knowledge of the information domain, and therefore the selection of these organizational bases is prone to error. Under a dynamic classification system, these risks would be mitigated by the ability to easily add or modify underlying facets. Conventional classification methods and derivation techniques lack flexibility in this basic level.
Any classification system may also take into account maintenance requirements in a dynamic environment. As the material in the domain changes, the classification can be adjusted accordingly. Maintenance is often an even more frustrating challenge than the initial development of a faceted classification scheme. It must be updated when terminology (terminology) appears and changes; new material in the domain typically needs to be evaluated and symbolized; the arrangement of facets and attributes generally needs to be tailored to encompass evolving structures. Many times, existing faceted classes are simply abandoned for the entire new class.
Hybrid systems involve mankind at a critical stage of analysis, synthesis and maintenance. As mentioned earlier in the process, humans often become the bottleneck for the classification work. Thus, the process has been slow and costly. There is a need for a system that accepts assortment data from people in a more decentralized, self-made manner that does not require centralized control and authority. These systems may support implicit feedback mechanisms where the real-world activities of information access and information consumption provide positive maintenance for the maintenance and growth of classification schemes.
To guide this process, hybrid systems are often based on existing general faceted classification schemes. However, these general approaches are not always applicable to the massive and rapidly evolving modern world of information. A specialized solution tailored to the needs of the individual domains is needed.
Since the general facet classification scheme cannot be universally applied, it is also necessary to connect different information domains together. However, while providing an opportunity to integrate domains, the solution should consider privacy and security of individual domain owners.
The most important of the classification needs requires a system that can be managed in a widely dispersed environment involving large groups of collaborators. However, under the influence of profound and unclear meaning, classification deals with complex concepts. Resolving these ambiguity and conflicts often involves intensive negotiations and personal conflicts that escape collaboration even in small groups.
Disclosure of Invention
In a first aspect of the invention, there is provided a method for organizing and managing data structures, including based on input from a feedback agent, the method comprising: (a) a method for faceted classification applicable to an information domain, the faceted classification method comprising (i) performing a faceted analysis on the domain or receiving a faceted analysis result of the domain, and (ii) applying a faceted classification synthesis of the domain; and (b) a complex-adaptive method for selecting and returning information about one or more iterations from the faceted classification synthesis, the complex-adaptive method changing the organization and management of data structures in response to the returned information.
In another aspect of the invention, a method for faceted classification of an information domain includes: (a) providing a faceted data set including faceted attributes for classifying information, such faceted attributes optionally including faceted attribute rankings for the faceted attributes; (b) providing a dimension concept taxonomy in which facet attributes are assigned to objects according to concepts associating meanings with the objects of the domain to be classified, the concepts being represented by concept definitions defined in the dimension concept taxonomy using the facet attributes and associated with the objects, the dimension concept taxonomy expressing dimension concept relationships between concept definitions according to a set of facet data; and (c) providing or implementing a sophisticated-adaptive system for selecting and returning dimensional concept taxonomy information to change the faceted data set and the dimensional concept taxonomy in response to the dimensional concept taxonomy information.
In yet another aspect of the present invention, the method for faceted classification of information domains further comprises: performing a faceted classification synthesis to relate a set of concepts represented by a concept definition, the concept definition being defined from a faceted data set including faceted attributes and optionally a hierarchy of faceted attributes, the faceted classification synthesis including: expressing a dimensional conceptual relationship between concept definitions, wherein two concept definitions are determined to be related in a particular dimensional conceptual relationship by examining whether at least one of an explicit relationship and an implicit relationship exists between respective facet attributes of the two concept definitions in a facet dataset.
In yet another aspect of the present invention, a computer system for performing facet analysis of input information selected from a domain of information according to a source data structure is provided, the computer system: (a) operable to derive faceted attributes of the input information and optionally a faceted attribute hierarchy of the input information using pattern augmentation and statistical analysis to identify a faceted attribute relationship pattern in the input information.
In another aspect of the present invention, there is provided a computer system for enabling a user to manipulate dimensional conceptual relationships, the computer system comprising: (a) a processor; (b) a computer-readable medium in data communication with the processor, wherein the computer-readable medium includes processor-executable instructions and a plurality of data elements determined to be related in a particular dimensional conceptual relationship thereon; (c) an input tool configured to allow an external entity to interface with the processor; (d) a display operative to provide a visual depiction of at least the selected data element; and (e) an editor that allows an outside entity to modify the data elements and the particular dimensional conceptual relationships.
In yet another aspect of the invention, a system for organizing and managing data structures is provided, comprising based on input from a feedback agent, wherein: (a) the system includes or is linked to a complex-adaptive system for selecting and returning dimensional concept taxonomy information to change the faceted data set and the dimensional concept taxonomy in response to the dimensional concept taxonomy information: (b) the system is operable to process a faceted data set including facets, facet attributes, and optionally a facet attribute hierarchy for the facet attributes used to classify information; and (c) the system is further operable to define a dimension concept taxonomy in which facet attributes are assigned to objects according to concepts associating meanings with objects of the domain to be classified, the concepts being represented by concept definitions defined using the facet attributes in the dimension concept taxonomy and associated with the objects, the dimension concept taxonomy expressing dimension concept relationships between the concept definitions according to the faceted data set.
Drawings
The invention will be better understood with reference to the accompanying drawings. Note that for the description contained herein, triangular shapes are used to represent relatively simple data structures, while conical shapes are used to represent relatively complex data structures that embody higher dimensions. The variable size of the triangles and cones represents the compression and expansion transform, but in no way indicates or shows the exact scale of compression or expansion.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate various example systems, methods, and so on, that describe various example embodiments of aspects of the invention. It should be appreciated that the illustrated boundaries of elements (e.g., boxes, groups of boxes, or other shapes) in the figures represent one example of boundaries. One of ordinary skill in the art will recognize that one element may be designed as multiple elements or that multiple elements may be designed as one element. An element shown as an internal component of another element may be implemented as an external component and vice versa. In addition, elements may not be drawn to scale.
FIG. 1 is a schematic diagram illustrating a prior art faceted classification method;
FIG. 2 illustrates an overview of operations showing data structure transformations to create a dimensional concept taxonomy for a domain;
FIG. 3 illustrates a knowledge representation model for the operation of FIG. 2;
FIG. 4 illustrates in more detail the operational overview of FIG. 2;
FIG. 5 illustrates a method of extracting input data;
FIG. 6 illustrates a method of source structure analysis;
FIG. 7 illustrates a process of extracting preliminary concept-keyword definitions;
FIG. 8 illustrates a method of extracting morphemes;
9-10 illustrate a process of computing potential morpheme relationships based on conceptual relationships;
11A-11B, 12 and 13 illustrate a process for assembling multi-level morpheme relationships from a set of potential morpheme relationships;
14A, 14B, and 15 illustrate reordering morpheme multi-ranks into strict ranks using an attribution method;
FIGS. 16A and 16B illustrate sample segments from morpheme rankings and keyword rankings;
FIG. 17 illustrates a method of preparing output data for use in constructing a dimensional concept taxonomy;
FIG. 18 illustrates a manner in which operations generate dimensional concepts from element constructs;
FIG. 19 illustrates how operations combine dimensional concept relationships to generate a dimensional concept taxonomy;
FIGS. 20, 21 and 22 illustrate how a dimension concept taxonomy is constructed using faceted output data;
FIG. 23 illustrates a dimensional concept taxonomy constructed for a localized set of domains;
FIG. 24 illustrates a mode of dynamic synthesis;
FIG. 25 illustrates a method for candidate set assembly for dynamic synthesis;
FIG. 26 illustrates a process of user interaction to edit a content container within a dimension concept taxonomy;
FIG. 27 illustrates a series of user interactions and feedback loops in a complex-adaptive system;
FIG. 28 illustrates a personalization operation;
FIG. 29 illustrates the operation of a machine-based complex-adaptive system;
FIG. 30 illustrates a computing environment and architectural components of a system for performing operations according to one embodiment;
FIG. 31 illustrates a simplified data pattern in one embodiment;
FIG. 32 illustrates an overview of a system to perform data structure transformation operations, according to one embodiment;
FIG. 33 illustrates a faceted data structure and a multi-level architecture supporting these structures used in one embodiment;
FIG. 34 illustrates a view of a dimension concept taxonomy in a browser-based user interface;
FIG. 35 illustrates a browser-based user interface that facilitates a dynamic integration mode;
FIG. 36 illustrates an environment for user interaction in a generalizer (outliner) -based user interface; and
FIG. 37 illustrates a representative implementation of a computer system that allows for manipulation of various aspects of faceted classification information in accordance with the present invention.
Detailed Description
System operation
Detailed description one or more embodiments of some aspects of the present invention are described in detail.
The description of the specific embodiments is divided into the following headings and sub-headings.
(1) "summary of the invention": this section generally describes the field of information classification pertaining to the field including the present invention and also generally describes the objects and some advantages of the present invention.
(2) "system operation": this section generally describes the steps involved in practicing the invention. The subheading "operational overview" generally describes including some components of the system. The subheading "faceted analytical method" generally describes the faceted analytical component of the present invention. The subheading "faceted classification synthesis method" generally describes the faceted synthesis components of the present invention, including the static and dynamic synthesis components of the present invention. The subtitle "Complex-adaptive feedback mechanism" generally describes the response of the present invention to various user interactions.
(3) "implement": this section generally describes representative embodiments of the invention in which the invention may operate. The sub-title "system architecture components" generally describes possible embodiments of the present invention. The subheading "data model and schema" generally describes the method of transforming data of the present invention. The subheading "dimension transformation system" generally describes the operation of the inventive system as it would appear in only one possible embodiment of the invention. The following sub-headings relate to representative implementations of the invention: "Multi-level data structures", "distributed computing environments", "XML schemas and client-side transformations", and "user interfaces".
Summary of the invention
In view of the limitations and deficiencies of the prior art, the architecturally and collaborative systems of the information architecture may be recognized as specifically required to address the challenges and problems noted herein. Thus, several objects and advantages of the present invention are summarized as follows: these objects or advantages are not exhaustive but merely serve to illustrate some aspects of the invention and its possible advantages and benefits.
In one aspect of the invention, the system of the invention operates at the basic level of constructing an optimal information structure. Most existing classification, search and visualization solutions are repairs made on a structural basis with defects and are therefore limited by nature. The system of the present invention provides an ontology and taxonomy framework for complex information structures, which is an implementation approach for implementation. The system of the present invention in its one aspect supports a complex architecture that is different from the simple planar architecture of the prior art that dominates today the information domain.
The system of the present invention supports conceptual hierarchies as the most familiar and robust model for correlating information (the term "multi-hierarchy" describes a structural model that combines the core requirements of dimensionality and conceptual hierarchies). However, the system of the present invention, in one aspect, reduces the personal and collaborative negotiations that make concept ranking, taxonomy, and ontology construction cumbersome. A reliable mechanism for linking hierarchies from different information domains should also be provided.
The system of the present invention, in one aspect thereof, provides structural integrity at various intersections within a dimensional space. This can be solved by eliminating the information white space problem that exists in nodes and in links and connections between nodes.
The system of the present invention, in one aspect thereof, relates to humans to provide important context-aware components. While machines provide useful tools for discovery and collaboration, machines do not possess the artificial intelligence necessary to "understand" complex knowledge. As such, the system of the present invention in one aspect thereof relates to humans in a human familiar and interventional manner.
The system of the present invention, in one aspect thereof, involves a machine to manage the extreme complexity of dimensional structure and concept multi-ranking in huge information domains, and to facilitate consensus among collaborators in concept description and relationships.
The system of the present invention, in one aspect thereof, incorporates a non-technical layman in collaboration. The lack of professional designers and the scope of the problem require general accessibility to the solution. The invention can enable people to avoid the complexity of the dimension structure without damaging the technical advantages of the people.
The system of the present invention is operable to support massively distributed parallel processing ("human multi-force large"). The size and complexity of the information domain typically has physical limitations on processing that currently appear to be practically immutable. In many cases, massive and decentralized parallelism are preferred to challenge these limitations.
The system of the present invention is operable in one aspect thereof to support integrated operations capable of avoiding the physical limitations of unbounded information and knowledge. The system of the present invention provides in one aspect thereof the following capabilities: the possibilities for a virtually unlimited number of data connections are encoded without the need to actually generate them before the clients of the information request them. In addition, the system of the present invention provides, in one aspect thereof, various integrated modes such that only data connections matching the customer's set interests and angles are presented.
The system of the present invention in one aspect thereof supports and encompasses the dynamics of the information domain. It provides a structure that can adapt to and evolve with information, rather than a static snapshot of information as at some point in time.
The system of the present invention is cost effective. While search costs provide a great incentive to find solutions to information flooding and information epidemic, organizational projects do not implement a blank check. One impediment to the more structured internet is its enormous cost of organization using existing technologies and methods. These organizational costs are not only financial, but also inherent in human and computer processing limitations.
The system of the present invention, in one aspect thereof, provides an opportunity for domain owners and system end-users to maintain distinct, private, and highly personalized knowledge repositories while sharing the benefits of collective talent and collective knowledge assets.
The present invention, in one of its aspects, provides methods and systems that are capable of managing multiple forms of information, including structural relationships, digital media such as text and multimedia, messaging and email, electronic commerce, and many forms of human interaction and collaboration, and provide end users with a decentralized system to output structural information across a variety of media, including websites and software clients.
Further objects and advantages will become apparent from a consideration of the ensuing description and drawings.
System operation
Overview of the operation
Fig. 2, 3, 18, 19, 32, 33, and 4 provide an overview of operations and systems for constructing and managing complex dimensional information structures, for example, to create dimensional conceptual taxonomies for domains. In particular, fig. 2, 3, 18, 19, 32, 33, and 4 illustrate knowledge representation models and certain dimensional data structures and constructs for such operations. Also shown are data structure transformation methods including complex-adaptive systems and enhanced faceted classification methods. This description starts with a brief overview of the complex dimensional structure that is particularly applicable to knowledge representation.
Knowledge representation in complex dimensional structures
There are levels of abstraction hierarchy that can be used to represent information and knowledge. The concept of "dimension" is often used to express the degree of complexity. A simple list, such as a shopping list or a buddy list, may be described as a one-dimensional array. Tables and spreadsheets (two-dimensional arrays) are more complex than simple lists. Some icons may describe information in three-dimensional space, etc.
Each dimension within the structure may establish an organizational basis for the contained information. The dimensionality can thus establish the complexity scale for the information structure. Complex structures can involve many of these bases and are often identified as n-dimensional structures.
Important attention is also paid to: the technical properties of the dimensions themselves can provide a great deal of diversity between structures. For example, dimensions may exist as variables, and the structure thus establishes a multivariate space. Under these types of models, a node may take on specific values or data points within the variables represented by the dimensions. Alternatively, the nodes may be less restrictive, providing containers only for information other than discrete variables. The distances between nodes may be relative rather than strictly quantified. By varying these types of technical attributes, the associated structure may strike a balance between tissue rigidity and description flexibility.
Some information structures may contain nodes at each intersection: other information structures may be incomplete, missing nodes that intersect between some dimensions. This is particularly relevant when the information structure is constructed manually. When the complexity of the structure exceeds the cognitive abilities of human designers, errors and gaps in the structure of the information may result.
As an example, when people create hyperlinks in a network structure (e.g., the world Wide Web), the links they provide are rarely comprehensive within a given domain. If there is an appropriate target for a link in the domain, but the link is absent, then this may be considered a blank in the information structure. In addition, if the information structure provides a classification for the information, but the information does not currently exist, this is also a blank in the structure.
Structural integrity may be described in part by a blank in the information structure. Without an explicit ontology or underlying classification system to manage relationships, the structure may begin to degrade as the number of nodes and dimensions increases. Information white space is one sign of this degradation.
Complex structures have far more information carrying capacity than simple structures. Just as adding floors increases the volume of a building, adding dimensions increases the amount of information that can be contained in a structure. Without multi-dimensional support, the structure will eventually collapse under load as the full information exceeds capacity.
Another attractive feature of complex dimensional structures is their accessibility. The planar structure will spread with the information added as if the suburban area of the small building caused the city to spread.
Clearly, the dimensionality of complex structures points to a mandatory remedy to information flooding and information epidemic. Because of its inherent advantages, one would expect its proliferation to expand. Unfortunately, this is not the case. The adoption of complex structures, particularly among the general public who most require them, has been challenging.
The reason for the limitation of adopting a complex structure is self-evident: their inherent complexity. Despite these prominent fundamental and structural problems, a solution has been proposed that is simple enough to create and manage complex structures, yet simple enough for large-scale market adoption.
Overview of the System methods
Analysis and compression.
FIG. 2 illustrates operations to construct a dimension concept taxonomy 210 for a domain 200, where the domain 200 includes a corpus of information that is the subject of the taxonomy. The domain 200 may be represented by a source data structure 202 for input to an analysis and compression process 204, the source data structure 202 including a source structure schema and a set of source data entities derived from the domain 200. The analysis and compression process 204 may derive a morpheme dictionary 206, which morpheme dictionary 206 is an elementary data structure that includes an elementary construction set that provides the basis for a new faceted classification scheme.
The information in the domain 200 may relate to virtual or physical objects, processes, and relationships between such information. As one example, the operations described herein may relate to content classification accessible through a Web page. Alternative embodiments of the domain 200 may include: document repositories, recommendation systems for music, software code repositories, workflow models, business processes, and the like.
Elementary constructs within morpheme dictionary 206 may be a minimal set of basic information building blocks and information relationships, where the aggregate provides information-bearing capacity for classifying source data structures 202.
Synthesis and expansion
The morpheme dictionary 206 may be an input to the synthesis and expansion method 208. The synthesize and expand operations may transform the source data structure 202 into a third data structure, referred to herein as a dimension concept taxonomy 210. The term "taxonomy" refers to a structure that organizes categories into a hierarchical tree and associates the categories with related objects (such as documents or other digital content). The dimension concept taxonomy 210 can categorize source data entities from the domain 200 in a complex dimension structure derived from the source data structure 202. In this way, source data entities (objects) can be related across many different organizational bases, which allows them to be found from many different perspectives.
Complex adaptive system
Advantageously, the classification system and operation is adaptive to changes in a dynamic environment. In one embodiment, this requirement is met by a complex-adaptive system 212. A feedback loop back to the source data structure 202 may be established through user interaction with the dimension concept taxonomy 210. The transformation process (204 and 208) may be repeated and the resulting structures 206 and 210 may be refined over time.
In one embodiment, the sophisticated-adaptive system 212 may manage the interactions of end users that use the output structure (i.e., the dimensional concept taxonomy 210) to leverage human cognitive abilities in the classification process.
The operations described herein seek to relatively simply transform a source data structure into a more complex dimensional structure so that source data objects can be organized and accessed in a variety of ways. Many types of information systems can be enhanced by extending the dimensionality and complexity of their underlying data structures. As higher resolution improves image quality, higher dimensionality can improve the resolution and detail characteristics of the data structure. This increased dimensionality, in turn, may enhance the utility of the data structure. Enhanced utility may be achieved through improved, more flexible content discovery (e.g., through searching), improvements in information retrieval, and content aggregation.
Since the transformation can be implemented by a complex system, the increase in dimensionality is not necessarily linear or predictable. The transformation may also depend in part on the amount of information contained in the source data structure.
To implement a system for large internet scales, the key difference is that the set of nodes and connections increases exponentially, the dimension information structure optimally providing the following possibilities: up to and unless these connections are needed, they do not incur the prohibitive cost of actually building these connections.
Dimension knowledge representation model
FIG. 3 illustrates one embodiment of a knowledge representation model including knowledge representation entities, relationships, and transformation methods that may be used in the operations of FIG. 2. More details of the knowledge representation model and its transformation method will be described in the following description with reference to fig. 3, fig. 18, fig. 19, fig. 32, fig. 33, and fig. 4.
In one embodiment of the invention, the knowledge representation entities are a set of content nodes 302, a set of content containers 304, a set of concepts 306 (only one concept is presented in FIG. 3 for simplicity of illustration), a set of keywords 308, and a set of morphemes 310.
The objects of the domain to be classified are referred to as content nodes 302. The content nodes may include any object subject to classification. For example, the content node 302 may be a file, a document block (e.g., an annotation), an image, or a stored string of characters. Content node 302 may reference a physical object or a virtual object.
The content node 302 may be contained in a collection of content containers 304. Content container 304 may provide addressable (or locatable) information that may be used to retrieve content node 302. For example, a content container 304 of a Web page accessible via a URL may contain content nodes 302 in the form of text and images. Content container 304 may contain one or more content nodes 302.
Concepts 306 may be associated with the content nodes 302 to abstract some meaning (such as a description, purpose, use, or intent of the content nodes 302). An individual content node 302 may be assigned a number of concepts 306; individual concepts 306 may be shared across many content nodes 302.
Concepts 306 may be defined by their relationship to other entities at a compound level of abstraction, and structurally by other more basic knowledge representation entities (e.g., keywords 308 and morphemes 310). Such a structure is referred to herein as a conceptual definition.
Morphemes 310 represent: the smallest meaningful knowledge representation entity that exists across domains that is known to the system (i.e., that has been analyzed for the purpose of constructing the morpheme dictionary 206). A single morpheme 310 may be associated with a number of keywords 308; a single keyword 308 may include one or more morphemes 310.
In addition, the meaning of the term "morpheme" in the context of this specification is to be distinguished from its conventional definition in the field of linguistics. In linguistics, a morpheme is the "smallest unit of language of interest". In the context of this specification, a morpheme refers to the "smallest meaningful knowledge representation entity that exists in any domain known to the system".
The keywords 308 include a collection (or group) of morphemes 310. A single keyword 308 may be associated with many concepts 306; a single concept 306 may include one or more keywords 308. Keywords 308 may thus represent additional levels of data structure between concepts 306 and morphemes 310. They help "atomic concepts" as the lowest level of knowledge representation recognizable to the user.
Since the concepts 306 may be abstracted from the content nodes 302, the concept signatures 305 may be used to represent the concepts 306 within the concept nodes 302. Concept signatures 305 are features of content nodes 302 that represent organizational topics present in the content.
In one embodiment of the invention, as with the elementary constructions, the content nodes 302 approach their least reducible form. The content container 304 may be reduced to as many content nodes 302 as practical. These elementary content nodes 302 may extend the options for content aggregation and filtering when combined with the very fine classification scheme in the present invention. Content nodes 302 can be identified and recombined along any dimension in a dimensional concept taxonomy.
Content nodes 302, i.e., labels (often referred to as "items" in the classification art), of a particular category may be added to each knowledge representation entity. As with the content nodes 302, the labels that they describe in the knowledge representation model can be abstracted from the corresponding entities. The following types of labels are therefore identified in fig. 3: a content container tag 304a to describe the content container 304; a content node tag 302a to describe the content node 302; concept tags 306a to describe the concepts 306; a set of keyword tags 308a to describe the set of keywords 308; and a set of morpheme tags 310a to describe the set of morphemes 310.
A sample of morphemes 310 is presented in fig. 18. Morphemes 310 may be in an elementary construct derived from the source data. Other elementary construction sets may include a set of morpheme relationships. Just as morphemes represent the primary building blocks of concept definitions and are derived from concepts, morpheme relationships represent the primary building blocks of relationships between concepts and are derived from such concept relationships. The morpheme relationships illustrated in FIGS. 9-10 will be discussed in more detail below.
The markers provide a human-discernible knowledge representation entity. In one embodiment, each tag is derived from a unique vocabulary of the source domain. In other words, the tags assigned to the data elements are drawn from the language and terms presented in the domain.
Concepts, keywords, and morpheme abstractions are described below and illustrated in FIGS. 7-8. The concept signatures and content nodes and tag abstractions are discussed in more detail below with reference to input data abstractions (FIG. 5).
One embodiment of the invention uses a multi-level knowledge representation model across entities and their relationships. This distinguishes it from the concept-atom concept two-level model in the traditional facet classification and its planar (single-level) relational structure as shown in fig. 1 (prior art).
While certain aspects of the operations and systems are described with reference to one knowledge representation model, one of ordinary skill in the art will recognize that other models may be used by adapting the operations and systems accordingly. For example, concepts may be combined together to create higher-order knowledge representation entities (such as memes, as a collection of concepts (collection) to include concepts). The structure of the representation model can also be contracted. For example, the keyword abstraction layer may be removed, thereby defining concepts only for morphemes 310.
Overview of System transformation method
FIG. 4 illustrates a generalized overview of one embodiment of the transform operation 800 introduced in FIG. 2.
Input data extraction
Operation 800 may begin with the manual identification of a domain owner from the domain 200 to be classified. The source data structure 202 may be defined in accordance with a domain training set 802. The training set 802 may be a representative subset of the larger domain 200 and may be used as an alternative set. That is, the training set may include the source data structure 202 or representative portion for the entire domain 200. Training sets are well known in the art.
An input data set may be extracted (804) from the domain training set 802. The input data may be analyzed to discover and extract elementary structures (this process shown in fig. 5 is discussed in more detail below).
Domain facet analysis and data compression
In the present embodiment, the analysis engine 204a introduced above and described in fig. 33 may follow the methods 806-814 as indicated by the brackets in fig. 4. The input data is analyzed and processed (806) to provide a set of meta-structure analyses. The source data analysis may provide information about structural characteristics of the source data structure 202. This process shown in fig. 6 is discussed in more detail below.
A set of preliminary concept definitions may be generated (808) (this process shown in fig. 7 is discussed in more detail below). The preliminary concept definition may be structurally represented as a collection of keywords 308.
Morphemes 310 may be extracted 810 from the keywords 308 in the preliminary concept definition, thereby extending the structure of the concept definition to another level of abstraction (this process shown in FIG. 8 is discussed in more detail below).
To begin the process of constructing the morpheme hierarchy 402, a set of potential morpheme relationships may be computed 812. The potential morpheme relationships may be derived from an analysis of conceptual relationships in the input data.
Morpheme structural analysis may be applied to the potential morpheme relationships to identify morpheme relationships that will be used to create the semantic hierarchy.
The morpheme relationships selected 814 for inclusion in the morpheme hierarchy may be assembled to form the morpheme hierarchy 402 (this process shown in FIGS. 9-15 is discussed in more detail below).
Dimension structure synthesis and data expansion
In this embodiment, the build engine introduced above and described in FIG. 32 may follow methods 818 through 820 as indicated by the parenthesis in FIG. 4. Enhanced facet classification methods can be used to integrate the complex dimensional structure 210a and the dimensional concept taxonomy 210 (this process shown in fig. 20-22 is discussed in more detail below).
The output data 210a for the new dimension structure may be prepared 818. The output data is a structural representation of the classification scheme for the domain. Which can be used as faceted data to create the dimension concept taxonomy 210. As described above, the output data may include concept definitions 708 associated with the content nodes 302 and the keyword rankings 710. In particular, the faceted data may include keywords 308 in the structural and conceptual definitions of a keyword hierarchy 710, where the keywords 308 are defined in terms of morphemes 310 of a morpheme dictionary 206 (this process shown in FIG. 17 is discussed in more detail below).
A set of dimensional concept relationships (in a multi-hierarchy in an aggregated form) can be constructed (820). The dimension concept relationship represents a concept relationship in the dimension concept taxonomy 210. The dimensional conceptual relationships may be computed based on the organizational principle of the enhanced faceted classification method. The dimensional concept relationships may be merged and a dimensional concept taxonomy 210 may be formed within the taxonomy of concepts 306 (as encoded in the concept definition) (this process shown in fig. 20-22 is discussed in more detail below).
For the enhanced faceted classification approach, various modes of synthetic operation are possible. In one embodiment, a system for "limited scope" faceted classification synthesis operations is disclosed in which conceptual relationships are synthesized based on domains that have not been completely processed or not processed at all by an analysis engine approach. In another embodiment, a system for "dynamic" faceted classification synthesis is disclosed in which dimensional concept hierarchies are processed in near real-time based directly on synthesis parameters provided for end users of information (the synthesis mode of operation is discussed in more detail below).
Complex-adaptive system and user interaction
In the present embodiment, the operation of the complex-adaptive system 212 introduced above and described in fig. 2 may follow the methods 212a, 212b, and 804 associated with the concept taxonomy 210 as indicated by the parenthesis in fig. 4.
As discussed, the dimension concept taxonomy 210 may be expressed to the user through the presentation layer 608. In one embodiment, the presentation layer 608 is a website (the presentation layers shown in FIGS. 23-27 and 34-36 are discussed in more detail below). Via the presentation layer 608, the content nodes 302 in the domain 200 can be presented to be classified within a concept definition associated with each content node 302.
This presentation layer 608 may provide an environment for collecting a collection of user interactions 212a as dimensional concept taxonomy information. User interactions 212a may include various ways in which end users and domain owners may interact with domain dimension concept taxonomies 210. The user interaction 212a may be coupled to an analysis engine via a feedback loop through step 804 to extract input data to implement a complex-adaptive system (this process shown in fig. 27 is discussed in more detail below).
In one embodiment, the user interactions 212a returned in the explicit feedback loop may be queued for processing when resources become available. Thus, an implicit feedback loop may be provided. The implicit feedback loop may be based on a subset of the organizational principles of the enhanced faceted classification method to compute the implicit conceptual relationship 212 b. User interactions 212a with the dimension concept taxonomy 210 can be processed in near real-time through an implicit feedback loop.
The classification scheme for the derived dimension concept taxonomy 210 can be continuously refined and extended through the sophisticated-adaptive system 212.
Facet analysis method
Extracting input data
FIG. 5 illustrates operations 900 in one particular aspect of the invention, including operations to extract 804 input data and some preliminary steps thereof, as discussed generally with reference to FIG. 4.
Mark structure marker
Structural markers may be identified (902) within the training set 802 to indicate where input data may be extracted from the training set. The structure marker may include a source structure pattern. Structural markers may exist in the content container 304 and may include, but are not limited to, the title of the document, descriptive meta tags associated with the content, hyperlinks, relationships between tables in a database, or popularity of keywords 308 existing in the content container. The marker may be identified by the domain owner or others.
The operation 900 may be configured with a default structure tagger that is applicable across domains. For example, the URL of the Web page may be a common structure marker for the content node 302. As such, operation 902 may be configured with a number of default structural modes that would apply if there were no explicit references in those regions in the source structural mode.
The structure markers may be explicitly located in the input data or may be located in alternative data for the input data. For example, relationships among the content nodes 302 may be used as an alternative structure marker for conceptual relationships.
In one embodiment, the structure markers may be combined to generate logical inferences about the source structure pattern. If the concept relationships are not explicit in the source structure schema, they can be inferred from structure markers (e.g., concept signatures associated with the content nodes 302) and the set of content node relationships. For example, as further described, the concept signature may be a title in a document that is mapped to an alternative concept for the concept to be defined. Content node relationships may be derived from structural links (e.g., hyperlinks connecting web pages) between content nodes 302.
The connection of concept signatures to content nodes 302 and the connection of content nodes 302 to other content nodes 302 may infer concept relationships between intersecting concepts. These relationships may form additional (explicit) input data.
There are many different ways known to those of ordinary skill in the art to identify a structural marker.
Mapping source structure patterns to system input patterns
The source structure schema can be mapped (904) to an input schema. In one embodiment, the input pattern may include a set of concept signatures 906, a set of concept relationships 908, and a set of concept nodes 302.
This schema design is representative of the transformation process and is not intended to be limiting. The input operation does not require input of data across the source of each data element in the system input schema in order to encompass a very simple structure.
The system input schema may also be extended to map to each element in the system data transformation schema. The system data transformation schema may correspond to each data entity present in the transformation process. That is, the system input schema may be extended to map to each data entity in the system. In other words, the source structure patterns may comprise a subset of the system input patterns.
Furthermore, the domain owner can map source data patterns from very complex structures. As one example, tables and attributes of a relational database may be modeled as a faceted hierarchy at various levels of abstraction and mapped to a multi-level structure of system data transformation schemas.
Likewise, the operations of analysis engine 204a and build engine 208a provide a data structure transformation engine and can achieve significant new utility in transforming one type of complex data structure (such as that modeled in a relational database) into another type of complex data structure (the complex dimensional structure produced by the methods and systems described herein). Product catalogs provide one example of complex data structures that benefit from the transformation of such complex data structures into complex data structures. More information regarding the example data transformation pattern shown in FIG. 30 is provided below.
Extracting input data
An input data mapping may be applied to the training set to map its source structure pattern to the input pattern, thereby extracting (804) the input data. As is known in the art, one embodiment of the present invention uses XSLT to encode a data map used to extract data from a source XML file.
The extraction method varies depending on many factors, including the parameters of the source structure pattern and the position of the structure markers. For example, if the concept signature is accurate (as is the document title, keyword-based meta tags, or database keyword segments), the signature may be used directly to represent the concept tag. For more complex signatures (such as popularity of keywords in the document itself), then a common text mining approach can be used. A simple method bases keyword extraction on a simple count of the most popular keywords in the document. Within the broad field of information extraction and text mining, there are many other extraction methods known to those of ordinary skill.
Once extracted, the input data may be stored in one or more storage devices coupled to the analytics engine 204 a. For convenience, the figures and descriptions contained herein refer to the data store 910 as a storage device, but other stores may be used.
For example, the domain data store 706 can be used, particularly where the computing environment is a controlled environment.
The system input data may be split into their constituent sets and passed to subsequent processes in the transformation engine.
The conceptual relationships are the inputs (A) for the source structure analysis described below and illustrated in FIG. 6.
The concept signatures may be processed to extract a preliminary concept definition (B) described below and illustrated in fig. 7.
The content node may be processed as system output data (C) described below and illustrated in fig. 17.
Extracting input data from a source data structure as described above is one of many embodiments that may be used to extract input data. In one embodiment, the other preliminary input channel to the analysis engine 204a includes a feedback loop of a complex-adaptive system. Thus, user interactions 212a (O) are returned to provide more input data. Details of this input data channel shown in fig. 27 and the feedback loop including the complex-adaptive system are described below.
Processing source data structure
FIG. 6 illustrates source data structure processing to extract source structure analysis in a particular aspect of the invention. The source data structure analysis may provide data relating to the topology of the source data structure. The topology of the source data refers to the technical feature set of the source data structure (features such as the number of nodes contained in the structure and the dispersion pattern of the relationships between nodes in the source data structure) that describes its shape.
One primary purpose of this analysis method is to measure the degree to which concepts 306 are generic or specific (with respect to other concepts 306 in training set 802). A measure of the relative generality or specificity of a concept is referred to herein as "commonality". The source data characteristics analyzed in one embodiment are described below. The details regarding the analysis and features vary with the source data structure.
The conceptual relationships 908 may be assembled for analysis. Cyclic relationships 1002 between concepts 306 may be identified (indicating that non-hierarchical relationships exist) and solved.
All concept relationships identified by the system as non-hierarchical may be pruned (prune) from the collection 1004. The pruned conceptual relationships are not involved in subsequent processing, but may be made available for processing based on different transformation rules.
The unpunctured conceptual relationships may be treated as hierarchical relationships. The system may assemble these concept relationships 1006 into an input concept hierarchy 1008 that is ordered as all hierarchical concept relationships of the set of extended indirect relationships. Assembling the input concept hierarchy 1008 may involve ordering nodes in the aggregate and removing any redundant relationships that may be inferred from other sets of relationships. The input concept hierarchy 1008 may include a multi-hierarchy structure in which an entity may have multiple immediate parents.
Once assembled, the input concept hierarchy 1008 may include structures for measuring the commonality of concept relationship set concepts 306, and may be used for other methods in the transformation process, as described in the following steps. As described below and shown in fig. 9-10, the concept relationships in the input concept hierarchy 1008 may be used to calculate a potential morpheme relationship (D). As described below and shown in FIG. 17, the concept relationships in the input concept hierarchy may also be used to process data (E) for the system.
The analysis of the input concept rankings can proceed to measure the commonality of each concept 1010. Likewise, commonality refers to how common or specific any given node is with respect to other nodes in the hierarchy 1008. Each concept 3006 can be evaluated for a commonality measure based on its position in the input concept ranking 1008.
A weighted average degree of separation of each concept 308 from each root in the tree that intersects concept 306 may be calculated. The weighted average degree of separation refers to the distance of each concept 306 from the concept 306 at the root node. The concept 306 that is explicitly the root node is given a unique commonality measure. The commonality measures are increased for more specific concepts 306, which reflects their increased degree of separation from the most common concepts 306 residing at the root node. Those skilled in the art will recognize that many other commonality measures are possible.
The commonality measures for each concept 306 may be stored in a concept commonality index 1012 (e.g., in the data store 910). As described below and shown in fig. 12-13, the concept commonality index 1012 can be used to infer a set of general metrics (F) for a morpheme.
The method described in one embodiment may be applied to hierarchical relationships, also referred to as parent-child relationships. Parent-child relationships encompass a large variety of relationship types that they can support. Examples include: whole-part, gene-species, type-instance, and type-subclass. In other words, by supporting hierarchical relationships, the present invention is applicable to a wide range of classification tasks.
Processing preliminary concept definitions
FIG. 7 illustrates a keyword extraction method to generate a preliminary concept definition. One primary purpose of this process is to generate structural definitions for concepts 306 in terms of keywords 308. In one embodiment, the concept definitions may be described as "preliminary" at this stage because they will be modified at a later stage.
One of ordinary skill in the art will recognize that there are many methods and techniques for this goal that involve extracting keywords 308 as structural representations of concepts 306.
In one embodiment, the level of abstraction applicable to keyword extraction may be limited. These constraints can be designed to derive keywords with the following properties: keywords are defined using atomic concepts (extracted based on the atomic concepts) in response to the independence of words within the direct relationship set, where the concepts reside in other regions of the training set.
Concept signatures 906 and concept relationships 908 may be aggregated for analysis. In one embodiment, this process is based on the extraction of textual entities. Thus, in the following description, concept signature 906 may be assumed to map directly to the concept label assigned to concept 306.
When a tag is identified in concept tag 906, the relevant portion of the text string may be extracted and used as concept tag 306 a. In a subsequent approach, when keywords 308 and morphemes 310 are identified in concepts 306, labels for keywords 308a and morphemes 310a may be extracted from the relevant portion of concept labels 306 a.
Eventually, these domain-specific tags can be written to the output data. If operation 800 transforms a data structure that has been previously analyzed and classified, entity tags may be available directly in the source data structure.
Note that this juncture between concept signature and concept tag extraction represents an integration point for various types of entity extraction tools, involving many types of content nodes 302, such as images, a wide variety of entity extraction tools for multimedia, and physical object classifications.
A series of keyword descriptors may be identified in the concept label. Preliminary keyword ranges may be parsed 1102 from the concept tags 306a based on common structure descriptors (such as parentheses, quotation marks, and commas) of the keywords 308. The full word is then parsed 1104 from the preliminary keyword range, again using common word descriptors such as spaces and grammatical symbols. These schema-based approaches for text entity parsing are well known in the art.
The parsed words from the preliminary keyword range 1102 may include a set of inputs for the next stage in the keyword extraction process. Other input sets may be the direct concept relationship set 1106. The set of direct concept relationships 1106 can be derived from a set of concept relationships 908. The set of direct concept relationships 1106 may include all direct relationships (all direct parent relationships and all direct child relationships) for each concept 306.
These inputs are used to check the independence of words in the preliminary keyword range 1108. The independence of individual words within the direct relationship set 1106 may include descriptors for the keywords 308. After the range of keywords is delineated, a check may be performed to ensure that all portions of the derived keywords 308 are valid. In particular, all portions of the concept label 306a depicted as the keyword 308 optimally pass the word independence test.
In one embodiment, the check for word independence may be performed based on a stem (or root) matching method, hereinafter referred to as "stemming". Many stemming methods are known in the art. As described below in the morpheme extraction method shown in FIG. 8, stemming provides a very fine basis for classification.
Based on the word independence in the preliminary keyword range, additional sets of potential keyword descriptors 1110 may be identified. In short, a word may delineate a keyword if the word is present with other words in one concept tag 306a, but these same words are absent in related concept tags 306 a.
However, before parsing the concept label 306a into keyword labels 308a based on these keyword descriptors, candidate keyword labels may be validated (1112). All candidate keyword tags are generally required to pass the above-mentioned word independence test. This check prevents the keyword extraction process from splitting concepts outside the target abstraction level (i.e., atomic concepts).
Once the preliminary set of keyword tags is generated, the system may examine all preliminary keyword tags in the aggregate. The intent here is to identify the compound keyword 1114. The conforming keywords may exist as multiple valid keyword tags within a single concept tag 306 a. This test can be directly based on atomic concept targets as a scope of concept-keyword abstraction.
In one embodiment, recursion may be used to exhaustively split the set of composite keywords into an initial equal set of keywords 308 supported by the training set 802.
If the composite keyword remains in the evolving collection of keyword tags, an additional set of potential keyword descriptors can be generated (1110) where matching keywords are used to locate the descriptors. Also, the delineated keyword range can be checked as valid keywords, keywords extracted, and the process repeated until no more compound keywords can be found.
The last round of the method of federation can be used to eliminate keyword tag ambiguity across the entire domain. The resolution of ambiguity is a well-known requirement in the art and it has many ways. In general, disambiguation is used to resolve ambiguities that appear when entities share the same tags.
In one embodiment, a method of disambiguating ambiguity may be provided by federating keywords into a single structural entity that shares the same tag. In particular, if keywords share tags and intersecting sets of direct conceptual relationships, there may be a basis for associating keyword tags with a single keyword entity.
Alternatively, the limitations of this approach to disambiguation may be relaxed. In particular, by removing the criteria of intersecting sets of direct conceptual relationships, all shared keyword tags in a domain can be joined to the same keyword entity. This is a useful way when the domain is relatively small or very concentrated in its subject matter. Alternatively, the set of concept relationships used in this disambiguation approach may be altered by a broader spectrum of direct and indirect concept relationships. Many methods of resolving ambiguity are known in the art.
The result of this keyword extraction method may be a set of keywords 1118 abstracted to the "atomic concept" level. Keywords are associated 1120 with the concepts 306 from which they were derived as preliminary concept definitions 708 a. These preliminary concept definitions 708a may later be expanded to include morpheme entities in their structure, thereby expanding to deeper and more basic levels of abstraction. As described further below, these preliminary concept definitions can be further expanded to take advantage of the implicit properties of keywords and morphemes that are manifested by concept relationships in the input data.
The entity 708a derived from this process may pass on to subsequent processes in the transformation engine described in this disclosure. Preliminary concept definition 708a is an input to the morpheme extraction process (G) described below and illustrated in fig. 8, and the output data process (H) described below and illustrated in fig. 17.
Extraction of morphemes
In traditional faceted classification, attributes for facets may generally be limited to concepts that may be identified using human cognition and associated with other concepts. As a result, the attributes can be viewed as atomic concepts, since the attributes constitute concepts without requiring deeper context.
The methods described herein can use statistical tools across large datasets to identify elementary (linguistic) irreducible conceptual attributes and their relationships. At this level of abstraction, many attributes will not be recognizable by human classifiers as concepts.
FIG. 8 illustrates a method that may be used to parse morphemes 310 and associate the morphemes with keywords 308 to extend preliminary concept definition 708 a. The morpheme extraction method may continue from the method of generating a preliminary concept definition described above and illustrated in FIG. 7.
Note that in one embodiment, the morpheme extraction method may have elements in common with the keyword extraction method. Here, a description is provided for the coarser processing of the pixel extraction, where these methods overlap here.
A pool of keywords 1118 and a set of direct concept relationships 1116 may be input to this method.
A pattern may be defined to serve as a criterion for identifying morpheme candidates 1202. These patterns may establish parameters for word drying, as is well known in the art, and may include patterns for whole word and partial word matching.
As with keyword extraction, the set of direct concept relationships 1106 can provide context for pattern matching. The schema may be applied 1024 for the pool base 1118 of keywords within the set of direct conceptual relationships in which the keywords appear. A shared root set based on a stemming pattern can be identified (1206). The shared root set may include a set of morpheme root candidates 1208 for each keyword.
The morpheme candidates for each keyword may be rooted to ensure that they are consistent with each other (1210). It may be assumed that the roots residing within the context of the same keyword and the set of direct conceptual relationships in which the keyword occurs have overlapping roots. In addition, it is assumed that the elementary roots derived from the intersection of these overlapping roots will remain within the parameters used to identify the valid morphemes.
This validation check may provide a method for correcting existing errors (a common fault with stemming methods) when applying pattern matching to identify potential morphemes. More importantly, verification can constrain excessive morpheme splitting and can provide a level of abstraction that is contextually meaningful but still basic.
A series of constraints on the extraction of the elements and keywords designed in one embodiment may also provide a negative feedback mechanism within the context of a complex-adaptive system. In particular, these constraints may serve to eliminate complexity and manage it within the setup parameters for classification.
Through this morpheme verification process, any inconsistent morpheme root candidates may be removed from the keyword set (1212). The process of pattern matching to identify morpheme candidates may be repeated until all inconsistent candidates are removed.
The consistent morpheme candidate set may be used to derive morphemes associated with a keyword. As with the keyword extraction method, the descriptors may be used to extract morphemes (1214). By examining the group of potential roots, one or more morpheme descriptors may be identified for each keyword.
Morphemes may be extracted based on the position of the descriptor within each keyword tag (810). More important is the process of deriving one or more morpheme entities to provide structural definitions to keywords. Keyword definitions may be constructed by correlating (or mapping) morphemes with the keywords from which they are derived (1216). These keyword definitions may be stored in domain data store 706.
May be based on morpheme type (e.g., free, restricted, inflected, or derived) (1218). In later stages of the construction process, the rules for constructing the concept may vary based on the type of morphemes involved and whether those morphemes are limited to other morphemes.
Once typed, the extracted morphemes may include a pool of all morphemes in the domain 1220. These entities may be stored in a morpheme dictionary 206 of the system.
A persistent inventory of each morpheme tag may be maintained for notifying future rounds of morpheme parsing (for more information, see overview of data structure transformation shown in fig. 33 above).
As described below and shown in fig. 9-10, the morphemes derived from this process may be passed to subsequent processes in the transformation engine to process the morpheme relationships (I).
Those skilled in the art will recognize that there are many algorithms that may be used to find and extract keyword definitions that include morphemes.
Calculating morpheme relationships
Morphemes may provide an elementary set of constructs that anchor the multi-level faceted data structure of the system. Other elementary constructions may be morpheme relationships. As discussed above and illustrated in fig. 3, 18-19. Morpheme relationships provide a strong basis for creating dimensional conceptual relationships.
However, the challenge is to truly identify the morpheme relationships of the morphemes in the ambiguous noise present in the classification data. The multi-level architecture of the present invention provides a solution to this challenge. By validating relationships across multiple sets of extractions, ambiguity is successively pruned.
The following section addresses finding morpheme relationships. In particular, in this particular aspect of the invention, a pattern expansion method is used to remove noise to enhance the statistical identification of elementary structures.
Overview of potential morpheme relationships
FIG. 9 illustrates a method for inferring potential morpheme relationships from conceptual relationships of a training collection.
The potential morpheme relationships can be computed to examine the popularity of the individual potential morpheme relationships in the aggregate of all concept relationships. Based on this examination, a statistical test may be applied to identify morpheme-candidate relationships that are highly likely to hold in the context in which all of their conceptual relationships exist.
In one embodiment of the system of the present invention, potential morpheme relationships may be constructed as an arrangement of all relationships that may exist between morphemes in related concepts, with the parent-child directionality of the relationships preserved.
In the example in FIG. 9, a portion of the input concept hierarchy 1008 shows a relationship between two concepts. The parent concept and its related children concepts may contain the morphemes { A, B } and { C, D }, respectively.
Likewise, concepts may be defined in terms of one or more morphemes (grouped via keywords in one embodiment). As a result, any relationship between two concepts will mean at least one (and often a plurality) of relationships between the morphemes that define the concepts.
In this example, the process of computing potential morpheme relationships is illustrated. Four potential morpheme relationships 812a may be inferred from a single conceptual relationship. Maintaining parent-child directionality established by conceptual relationships and without allowing any duplication, four potential morpheme relationships can be derived: a.c, A.D, b.c, B.D.
In general, if a parent concept contains x elements and a child concept contains y elements, there will be x y potential morpheme relationships: the number of potential morpheme relationships is the product of the number of morphemes in the parent and child concepts.
In one embodiment, this simple description of computing morpheme relationships may be refined to improve the generated statistical indicator. These refinements (i.e., arranged morphemes) are indicated below in the description of the potential morpheme relationship calculation method illustrated in FIG. 10.
These refinements to the basic method of identifying potential morpheme relationships may be used to reduce the number of potential morpheme relationships. This reduction, in turn, may reduce the amount of noise, thereby extending the pattern of identifying morpheme relationships and making the statistical identification of morpheme relationships more reliable.
Likewise, one of ordinary skill in the art will recognize that there are many algorithms that may be used to derive potential morpheme relationships from a given set of conceptual relationships.
Method for calculating potential morpheme relationship
FIG. 10 presents one embodiment of a process for computing potential morpheme relationships in greater detail.
The intent here is to generate a set of potential morpheme relationships that can later be analyzed to assess the likelihood that they are truly morphemic in nature (that is, they hold in every context in which they occur).
The current method of computing potential morpheme relationships continues from the resource structure analysis method D described above and illustrated in FIG. 6.
The method also extends from the morpheme extraction method I as described above and illustrated in fig. 8.
The input to this method of determining potential morpheme relationships may be a morpheme pool 1220 extracted from a domain and an input concept hierarchy 1008 containing a verification set of concept relationships from the domain.
The morphemes within each conceptual relationship pair may be calibrated 1404 to reduce the number of potential morpheme relationships that may be inferred. In particular, if two data elements are aligned, these elements may be combined with any other elements within the same conceptual relationship pair. By calibration, the number of morpheme-candidate relationships may be reduced.
In one embodiment, the axis may be calibrated based on shared morphemes and include all morphemes limited to the shared morphemes. For example, if one concept is "politics in canada" and another concept is "international politics," the shared morphemes in the keyword "politics" may be used as a basis for calibration.
Axes may also be calibrated based on existing morpheme relationships within the morpheme dictionary. In particular, if any given potential morpheme relationship can be represented (constructed directly or indirectly using a set of morpheme relationships) by a morpheme relationship in a morpheme dictionary, the potential morpheme relationships can be calibrated on this basis.
An external dictionary (not shown in FIG. 10) may also be used to guide the calibration of the potential morpheme relationships. Such as WORDNETTMIs a dictionary that may be suitable for calibration. Various information contained in the external dictionary may be used as a basis for guidance. In one embodiment, keywords may be grouped first by the speech portion; the potential morpheme relationships are constrained to be combined only within these grammar groupings. In other words, the calibration may be based on grammatical parts of the speech, as directed by an external dictionary. It is also possible to use a dictionary based on an external dictionaryTo infer direct morpheme relationships as the basis for calibration.
The potential morpheme relationships can be computed 812 as all combinations of the aligned set that are not related to morphemes. This calculation is described above and illustrated in fig. 9.
The resulting set of potential morpheme relationships 1406 may be maintained in the domain data store 910. Here, the inventory of potential morpheme relationships may be tracked as they exist in the training collection and pruned through subsequent analysis stages.
The potential morpheme relationships derived from this process may be passed to a process for pruning and morpheme relationship assembly (J) as described below and illustrated in fig. 11-13.
Pruning potential morpheme relationships
The pool of potential morpheme relationships generated by the methods described above and illustrated in fig. 9-10 may be pruned into a set of candidate morpheme relationships.
The potential morpheme relationships may be pruned based on an evaluation of their popularity in the training set. Those potential morpheme relationships that are highly popular are more likely to be truly morpheme (i.e., the relationship holds in each context).
Furthermore, morpheme relationships may be assumed to be unambiguous in their relationship to more general (and more generalized) related morphemes. The structural marker for this ambiguity may be multi-tiered. Morpheme relationships may materialize fewer attributes and provide more determination basis for related morphemes. In this way, potential morpheme relationships may also be pruned when they exist in multiple hierarchies.
A morpheme-relational hierarchy may be constructed from a collection of morpheme-relational pairs that are also hierarchical. In this way, a pool of potential morpheme relationships may be analyzed in the aggregate to identify relationships that contradict this hierarchical assumption.
Candidate morpheme relationships retained after this pruning process may be assembled into morpheme rankings. Although the morpheme-candidate relationship is a parent-child pair, the morpheme hierarchy may be extended to multi-generation parent-child relationships.
11A and 11B illustrate the difference between a potential morpheme relationship and a set of subtracted morpheme relationship candidates.
There are four potential morpheme relationship pairs (parent-child) in the hierarchy in FIG. 11A. The first three of these relationships are relatively popular in the domain, but the fourth relationship is relatively rare. Thus, the fourth pair is subtracted from the set of potential morpheme relationships.
The first three relationship pairs in the set of potential morpheme relationships 1406 are also consistent with the hierarchical assumption. However, the fifth relationship 1502, which is bi-directional, conflicts with this assumption. The direction of the relationship d.c conflicts with the direction of the relationship c.d. This morpheme pair is reclassified as being related by an associative relationship and is removed from the set of candidate morpheme relationships 1504. FIG. 11B shows the subtracted morpheme-candidate relationship set.
Assembling semantic relationships
Merging semantic relationships
FIG. 12 illustrates the union of morpheme-candidate relationships into a holistic morpheme multi-hierarchy. All pairs of morpheme-candidate relationships may be incorporated into a full set that connects logically consistent generation trees (described in more detail below).
This data structure may be described as "multi-hierarchical" in that it may result in singular morphemes (singular morphemes) that are involved in more than one direct relationship to more common morphemes (parent nodes). This multi-hierarchy can be transformed into a strict hierarchy (only a single parent node) in subsequent stages of the process.
The potential morpheme relationships retained after the collision pruning process (described above and illustrated in fig. 11B) may be aggregated into the set of candidate semantic relationships 1504. The set of morpheme-candidate relationships may be merged into a whole morpheme multi-hierarchy 1602.
In one embodiment, the constraints on the process of constructing the overall multi-hierarchy may be: 1) the set of morpheme-candidate relationships in the multi-hierarchy are logically consistent in the aggregate; 2) multi-hierarchy uses the minimum number of multi-hierarchy relationships required to create a logically consistent structure.
A recursive ordering algorithm may be used to assemble the tree and highlight conflicts and proposed resolution approaches. The logic of this algorithm is illustrated by the reasoning applied to the following example.
Based on the relationship hierarchy #1, A is superior (i.e., more generic) than C. Based on rank #2, B is superior to C. Based on rank #3, a is superior to D. Four morphemes may be logically combined with a and C higher than C and a higher than D.
When there may be more than one logical ordering, the conceptual commonality index 1012 may be used to resolve the ambiguity (the conceptual commonality index is created by the source structure analysis method described above and illustrated in FIG. 6). This index may be used to compare morphemes to assess whether a morpheme is relatively more generic or more specialized than other morphemes (commonality is measured in terms of degree of separation from node).
In this example, both A and B are the highest nodes that are logically consistent based on the morpheme-candidate relationship. A and B are also both parents of C. Thus, a multi-tiered set of relationships may be generated at C. Since there is no information in the sample set that conflicts with the multi-tiered set of relationships, it can be assumed that the relationships are valid. Processing may continue to decompose multiple hierarchies in subsequent stages.
If there is new data that indicates that A and B are instead related nodes via indirect relationships, the system can immediately decompose the multi-hierarchy and order A and B in the same tree. The a and B priorities may be determined by a commonality index. Here, a has a lower commonality ranking than B. And therefore are given a higher (or more general) position in the resulting multi-stage 1602.
Morpheme multi-stage assembly
FIG. 13 illustrates a methodology that may be used to assemble a morpheme multi-hierarchy from candidate morpheme relationships.
The morpheme multiscale can be assembled by analyzing the morpheme-candidate relationships in the aggregate. As in hierarchical assembly of input concepts, the goal is to join independent pairs of relationships into a unified whole.
The morpheme relationship assembly method may continue from the method J of computing potential morpheme relationships described above and illustrated in fig. 9-10.
A set 1406 of potential morpheme relationships may be an input to this method. The morpheme-candidate relationships may be ranked based on an analysis of conceptual relationships containing morphemes (1702). The conceptual relationships may be ranked (lowest to highest) based on the total number of morphemes in each conceptual relationship pair.
As the number of morphemes contained in a pair of conceptual relationships decreases, the likelihood of a morpheme relationship may increase (because the probability for any given morpheme relationship candidate is factored by the number of potential candidates in the pair). Thus, in one embodiment, operations may prioritize conceptual relationships with lower morpheme count analysis. Reducing the number of morphemes in the pair increases the chances of finding a truly morphemic morpheme relationship.
Parameters defining statistically relevant boundaries for morpheme relationships may be set (1704). These parameters may be based on the popularity of morpheme relationships in the aggregate. The aim is to identify morpheme relationships of high popularity in a domain. These constraints on morpheme relationships may also work on the negative feedback mechanism of a complex-adaptive system. A set of relationships in the aggregate can be analyzed (1706) to determine an overall popularity of the relationships. This analysis may incorporate statistical tools that are performed within sensitivity parameters controlled by a system administrator. The exact parameters may be customized for each domain and may be changed by the domain owner and system administrator.
As with the conceptual relationship analysis, the assumption of hierarchical relationships can be negated using the recurrent relationships (1708) as a structural marker. The potential morpheme relationships may be pruned if they do not pass the filters for popularity and ranking (1710).
The pruned set of potential morpheme relationships may comprise a set of candidate morpheme relationships 1504. As embodied in the concept commonality index 1012, the commonality of morphemes may be inferred from the commonality of the source structure concept (1010 a).
A concept embodying the lowest number of morphemes may be used as a substitute concept for the generality of each morpheme. To illustrate the basis of this assumption, it is assumed that the concept includes only one morpheme. Given the high degree of correlation between a concept and the individual morphemes that comprise it, it is likely that the commonality of a morpheme will be closely related to the commonality of the concept.
This inference guides the computation of morpheme generality in one embodiment. In particular, the system may collect a set of concepts that materialize a lowest number of morphemes in the aggregate. That is, the system may select a set of concepts that represents all of the morphemes in the collection.
The concept commonality index 1012 can be used to prioritize dimensional conceptual relationships and can be stored (not shown) in the domain data store 706.
Using the method as described above and illustrated in fig. 12, morpheme hierarchies may be assembled into an entire multi-hierarchy (1712). This may include ordering nodes in the aggregate and removing any redundant relationships that may be inferred from other sets of indirect relationships. The conceptual commonality index created may be used from most general to most specific to order the pixels.
One of ordinary skill in the art will recognize that there are many algorithms known in the art that can be used to consolidate hierarchical morpheme relationships into multiple hierarchies.
Assembled morpheme ranking
FIGS. 14-16 illustrate the transformation of a morpheme multiscale into a morpheme hierarchy.
Morpheme multi-grade attribution (attribute)
Fig. 14A-14B illustrate morpheme attribution processes and example results. Attribution in this context refers to the manner in which facet attributes are sorted and assigned to data elements. Just as operations set constraints on entity extraction (such as keyword and morpheme extraction), explicit constraints on morpheme relationships can be used to build morpheme hierarchies.
The morpheme relationships linking morphemes into tiers are, by definition, morpheme. Morpheme entities are elementary and unambiguous. It is generally required that a morpheme only involve one parent node. In a morpheme relationship set (morpheme hierarchy), a morpheme may exist in only one location.
Based on these definitions in a knowledge representation model, morphemes may be presented as attributes within a hierarchical hierarchy of morpheme data. The knowledge representation model may thus provide faceted data and a multi-level enhanced faceted classification approach.
In the foregoing approach, the aggregation of candidate morpheme relationships may present a morpheme-multiscale collection 1802. Thus, attribution can be used to estimate these conflicts in the knowledge representation model and give a solution 1804.
The attribution method in one embodiment may include finding a location for each morpheme in the hierarchy that does not conflict with morpheme hierarchy requirements.
Morphemes in a multi-level hierarchy may be raised to new locations within their original tree or moved to an entirely new tree. This attribution process ultimately defines the highest root morpheme node in the facet hierarchy. Thus, the root morpheme node in a morpheme hierarchy may be defined as a morpheme facet, where each morpheme is contained in a morpheme facet attribute tree.
The following discussion illustrates a method for removing multiple parent nodes using the concept of attributes.
Likewise, the structural marker for conflicts may be the presence of multiple parent nodes appearing in morpheme multi-hierarchy 1802. To remove conflicts, morphemes having multiple parent nodes may be re-considered to share the attributes of the ancestors of the parent node.
An attribute class may be created to maintain a grouping of parent nodes originally shared by the reorganized morphemes, and to keep the morphemes in an attribute class separate from these parent nodes (without a unique ancestor, the method upgrades the morphemes to a hierarchical root level as a new morpheme facet).
Relationships may be reorganized into property classes from the root node to the leaf nodes. Multiple parent nodes may be first reorganized into attributes so that singular parent attributes may be identified. That is, a top-down traversal of morpheme relationships provides attribution that can be decomposed into solution sets 1804.
In general, if two morphemes share at least one parent attribute, they are siblings (associations) in the context of the shared parent node. Sibling children may be grouped under a single attribute class (note that children only need to share one parent; they need not share all parents). If the morphemes do not share at least one parent, they may be grouped into individual attributes that share an ancestor.
To select between the two alternatives, the relevance of the source relationships may be weighted. The measure of relational relevance is introduced above in the discussion of the source structure analysis shown in FIG. 6.
Starting from top to bottom, the transformation step can be decomposed as follows:
1. sibling groups { B, C, D, F, H } share a single parent node. The individual nodes are checked to see if there are multiple parent nodes. In this case, none of the nodes has multiple parents, and thus there is no need to reorganize the relationships.
2. Morpheme E has multiple parents. The most recent uniparental ancestor of E is A. It is necessary to reorganize E into attributes of a.
3. The parent attributes of { B, C, D, F, H }, E are grouped under attribute class A1. E then becomes the brother of a1 as an attribute of a.
4. Morpheme G also has multiple parents. It needs to be reorganized into attributes of a as in step (2-3). Furthermore, since E and G share at least one parent node, they may be grouped under a single attribute class a 2.
5. Morpheme J has a unique parent H. There is no need to reorganize this parent-child relationship.
6. Morpheme K has multiple parents E and G. The only ancestors of E and G are now A2. K needs to be reorganized into attributes of a 2.
7. The parent attributes of { E, G }, K are grouped under attribute class A2-1. K then becomes the brother of A2-1 as an attribute of A2.
The end result is a morpheme hierarchy consistent with the assumptions of true morpheme attributes and morpheme relationships defined by the knowledge identification model of the present invention.
Morpheme hierarchal reorganization
FIG. 15 illustrates a recursive algorithm that may provide an attribution method in one embodiment. This morpheme-hierarchically reorganized core logic may be the attribution method described above and illustrated in fig. 14A and 14B.
The input for this method may be the morpheme multi-ranking (K) described above and illustrated in fig. 11-13. The input to the method may be a morpheme multiscale 1602. The relationships may be classified from the root node to the leaf nodes (1902). Each morpheme in the morpheme multi-level may be checked against multiple parent nodes. Here, the morpheme that is the focus of analysis is referred to as an active morpheme.
If there are any multiple parent nodes, the set of multiple parent nodes for the active constraint may be grouped into a set of morpheme attribute classes, hereinafter 1906. Morpheme attribute classes may be used to guide how the morphemes in the reorganized tree should be ordered.
For each morpheme attribute class, a unique ancestor may be located without multiple parents (1908). An ancestor may be uniquely associated only with an attribute class (a parent group of nodes shared by morphemes).
If an ancestor exists, the system may create one or more virtual attributes (1910) to contain all morphemes within the morpheme attribute. This node in the tree is called a "virtual attribute" because it is not directly associated with any morpheme and therefore is not included in any concept definition. It is a virtual attribute and not an actual attribute.
If an ancestor exists and one or more attributes are created, the active morphemes may be reorganized into attributes of the ancestor (1912), which are either directly related to the ancestor or grouped with other morphemes in the morpheme attribute class.
If there is no unique ancestor, the morpheme may be relocated to the root node (facet) in the tree (1914).
The system may also allow an administrator to manually modify (1916) the morpheme relationship pool and resulting morpheme rankings to refine or replace automatically generated results.
The end result of this process may be a morpheme hierarchy 402 that includes an initial morpheme hierarchy arrangement. As one of the elementary constructs of a data structure of a system, morpheme hierarchies can be used to classify and arrange entities into increasingly complex levels of abstraction.
Morpheme relationships in the morpheme hierarchy may be entered in morpheme dictionary 206. Morpheme tags may be assigned to morphemes based on tag popularity stored in the system. The most popular morpheme tag in the system can be used as the single representative tag for that morpheme.
The output of the method may be processed as system output data (L), as described below and shown in fig. 17.
Alternative ways of transforming multiple hierarchies into strict hierarchies may be used. A single parent node may be selected based on any of a plurality of weighting factors to remove multiple parent node scenarios. In a simple solution, the relationships of multiple parent nodes may be deleted.
FIG. 16A illustrates sample tree fragments from an assembled morpheme hierarchy. Each node in the tree (e.g., 2002a) may represent a morpheme in a morpheme hierarchy. Folder icons are used to indicate morphemes that are parents of the underlying nested related morphemes (morpheme relationships). The text (e.g., 2002b) next to each node is an associated morpheme label (in many cases, a partial word).
Method for classifying and integrating by planes
The process of constructing (or synthesizing) a dimension concept taxonomy 210 based on the enhanced facet taxonomy begins here. Such a classification may generate a dimensional conceptual relationship by examining a morpheme hierarchy with a set of concept definitions (specifically defined in terms of morphemes, with zero or more morphemes as morpheme attributes within the morpheme hierarchy).
The faceted classification method of the present invention can be applied at multiple data extraction stages. In this way, multiple domains may share the same elementary structure for classification while maintaining boundaries specific to the domain.
Processing a faceted data set
The following points summarize the steps involved in one aspect of preparing output data for synthesizing a hierarchical classification data structure (as described further below) according to an analysis operation:
for each domain to be classified, the data structure may be output as a domain-specific keyword hierarchy and a domain-specific set of concept definitions (specifically, defined in terms of domain-specific keywords, with zero or more domain-specific keywords as keyword attributes within the domain-specific keyword hierarchy).
The faceted data specific to the domains may be derived from an elementary construct shared across domains. The preliminary concept definition can be revised and significantly extended with new information. This is accomplished by comparing the information in the morpheme hierarchy to the original conceptual relationships in the training set.
In particular, the composition operation may assign concept definitions to content nodes based not only on analysis of explicit definitions provided by the domain owner, but also through analysis of all intersecting concepts and concept relationships in the aggregate. A preliminary definition of "explicit" properties may be assigned, which is later supplemented with a rich set of properties that "hint" with conceptual relationships that intersect content nodes.
The candidate morpheme relationships may be assembled into an overall morpheme hierarchy that will serve as the data kernel for the faceted classification. A separate facet hierarchy for each domain may be created based on the unique intersection of keywords and their morphemes in each domain. This data structure may express morpheme hierarchies bounded by domain boundaries.
The faceted decomposition may be expressed in the vocabulary of the domain (its unique set of keywords) and may include only those morpheme relationships factored into the domain. The faceted taxonomy for each domain may be output as a set of concept definitions for that domain and facet hierarchy.
Thus, in one embodiment, domain-specific facet hierarchies may be inferred from the convergent morpheme hierarchy. It may provide a richer set of facets for smaller domains. It may build on a shared experience of multiple domains (errors that may be present in smaller domains may be corrected) and it may facilitate faster processing of domains.
In another embodiment, the system may create a unique facet hierarchy for a domain directly based on the above-described methods shown in FIGS. 14-15. In this embodiment, the attribute hierarchical assembly process may be directly applied to domain-specific keywords extracted from each domain.
In yet another embodiment, the synthesis operation may be based on data assembled from other conventional classification means. Such classification means may include faceted data prepared for traditional faceted classification synthesis as well as concepts defined by a strict set of attributes as used in formal concept analysis. These and other supplemental classification methods are well known to those skilled in the art.
16A-16B illustrate fragments of a tree from a populated morpheme hierarchy 2002 (as described above) and fragments of a tree from a domain-specific keyword hierarchy 2004 as derived in one embodiment. Note that in the tree fragments used for keyword ranking 2004, the text next to each node (e.g., 2004b) representing the associated keyword tag is a complete word because they will be present in the domain. In addition, the tree segment used for keyword ranking 2004 may be a subset of the tree segments used for morpheme ranking 2002 that is shrunk to include only those nodes that are relevant to the domain from which the keyword ranking is derived.
FIG. 17 illustrates operations to prepare output data for the enhanced faceted classification method.
The output data may include revised concept definitions and keyword rankings for the domains. Keyword ranking may be based on morpheme ranking.
The inputs to the process may be a set of content nodes 302 to be classified, an input concept hierarchy 1008, a morpheme hierarchy 402, and a preliminary concept definition 708 a. The corresponding operations C, E, L and H to generate or otherwise obtain these inputs are described above.
The intersection of the input concept relationship and the morpheme attributes within the first concept definition 708a may be used (2102) to modify the first concept definition 708a to the second concept definition 708 b. In particular, if a conceptual relationship in source data cannot be inferred from semantic hierarchies, the concept definition may be extended to provide attributes "implied" by the conceptual relationship. The result is a revised set of concept definitions 708 b.
A set of related morpheme relationships in the morpheme hierarchy from the set of all morphemes participating in the domain may be identified (2106).
The morphemes in the reduced, domain-specific version of the morpheme hierarchy may be tagged using keywords from the domain (2108). For each morpheme, the signature keyword that uses the morpheme the most frequently may be selected. The most popular keyword tags for each keyword can be assigned. The independent keywords may be limited to appearing once in the faceted hierarchy. Once a keyword is used as a signature keyword, it may not be able to be used as a substitute morpheme for other morphemes.
The morpheme rankings may be combined into a morpheme relationship set that includes only the morphemes of the participating domains, and the keyword rankings 2112(2110) are inferred from the combined morpheme rankings.
Output data 210a representing a faceted taxonomy may include revised concept definitions 708b, keyword rankings 2112, and content nodes 302. The output data may be transferred to the domain data store 706.
The concept relationships in the input concept hierarchy may also directly affect the output data in the domain data store 706. In particular, the input concept hierarchy can be used to prioritize relationships inferred from the integrated portion of the operation. The pool of concept relationships drawn directly from the source data may represent "explicit" data, as opposed to inferred dimensional concept relationships. Inferred relationships that are explicit in the input concept hierarchy (either directly or indirectly) may be prioritized over relationships that are not present in the source data. That is, explicit relationships can be considered more meaningful than additional relationships inferred from the process.
The output data can now be used as a complex dimensional data structure to represent a dimensional concept taxonomy (M).
Using a faceted classification method
The organizational principles of the enhanced faceted classification method are first illustrated in fig. 3, 18-19 above, and described in more detail below, and are shown in fig. 20-22, by which elementary constructions can be synthesized to create complex dimensional structures.
This enhanced faceted classification approach ties the flexibility benefits of a faceted classification scheme to the benefits of simplicity, visualization, and overall perspective, as provided by the complex conceptual hierarchy of singletons (not split).
These benefits are illustrated by the cross-sectional grading in contrast to simple (unitary) grading. Simple grading is intuitive and easy to visualize. They often integrate many organizational bases (or facets) simultaneously, which provides a more holistic perspective on all relevant attributes. Attributes are coupled across facet boundaries and can be navigated in parallel. By consolidating the attributes rather than splitting them, these attributes provide a more economical and robust interpretation framework.
Those skilled in the art will recognize that many other simpler and conventional classification methods may also benefit from the various components and modes of the invention as outlined below. These conventional processes of faceted classification and set-based classification construction (e.g., formal concept analysis) illustrate two such alternative classification approaches that would benefit from the system described herein.
Dimensional concept synthesis
Referring to FIG. 18, morphemes 310 comprising concept definitions may be related in a morpheme hierarchy 402. Morpheme ranking 402 may be a full set of all morpheme relationships known in morpheme dictionary 206 that prune redundant morpheme relationships. Morpheme relationships may be considered redundant if they can be logically constructed using a collection of other morpheme relationships (i.e., through indirect relationships).
The independent morphemes 310a and 310b may be grouped by keyword to define a specific concept 306 b. Note that these morphemes 310a and 310b may thus be associated with concept 306b (via keyword grouping) and with other morphemes 310 in morpheme hierarchy 402.
Through these interconnections, morpheme hierarchies 402 may be used to create new and expanded sets of conceptual relationships. In particular, any two concepts 306 that contain morphemes 310 that are related by morpheme relationships may themselves be related concepts.
The companion occurrence of a morpheme within a concept definition may be used as a basis for creating a conceptual relational hierarchy. Each intersection 406a and 406b (FIG. 18) at concept 306b represents a dimensional axis connecting concept 306b to other related concepts (not shown). A set of dimensional axes, each representing a separate hierarchy of conceptual relationships filtered by a set of morphemes (or facet attributes) defining the axis, may be the structural basis of a complex dimensional structure. A simplified overview of the construction method continues in fig. 19.
Dimension concept classification method
FIG. 19 illustrates the construction of a complex dimensional structure for defining a dimensional concept taxonomy 210 based on the intersection of dimensional axes.
The set of four concepts 306c, 306d, 306e, and 306f is illustrated with concepts 306c, 306d, and 306e defined by morphemes 310c, 310d, and 310e, respectively, and concept 306f defined by the set of morphemes 310c, 310d, and 310 e. Concepts 306c, 306d, 306e, and 306f may share a concept relationship via the intersection of morphemes 310c, 310d, and 310 e. The synthesis operation (described below) may create the dimension axes 406c, 406d, and 406e as different hierarchies of conceptual relationships based on the morphemes 310c, 310d, and 310e in the concept definition.
This operation of synthesizing dimensional conceptual relationships may be handled to all or a portion of the content nodes 302 in the domain 200 (the range-limited and dynamic processing modes of operation shown in fig. 22-23 are described below). Content nodes 302 can thus be classified into a fully redesigned complex dimensional structure, such as the dimensional concept taxonomy 210.
As described above, multiple concepts may be assigned to a single content container or content node (such as a web page). Thus, a single content container or content node may reside on many discrete levels in a dimensional conceptual taxonomy.
Likewise, any two concepts 306 that contain morphemes 310 that are related by morpheme relationships may themselves be related concepts. In one embodiment, explicit and implicit morpheme relationships may be combined with contextual surveys of domains to infer complex dimensional relationships in dimensional concept taxonomies.
A concept definition may be described using morphemes as facet attributes. As described above, it may not matter whether the facet attribute (morpheme) is explicit ("registered" or "known") or implicit ("unregistered" or "unknown") in the dictionary. There should simply be a valid description associated with a concept definition to carry its definition in the dimension concept taxonomy. The active concept definition may provide source material to describe the meaning of the content node in the dimension concept taxonomy. In this way, objects in the domain can be classified in a dimensional concept taxonomy regardless of whether they were previously analyzed as part of a training set. As is well known in the art, there are many methods and techniques by which concept definitions may be assigned to objects to be classified.
In one embodiment of the invention, the interactions of the structural entities of the knowledge representation model (described above) may establish the following logical links between morphemes, morpheme relationships, concept definitions, concept nodes, and concept relationships:
if a concept within an active content node contains faceted attributes (hereinafter morphemes) that have the same lineage as concepts in other content nodes (hereinafter "related nodes"), there may be a relationship between the active node and the concept of the related node. In other words, the concepts may inherit all relationships inferred through relationships as they exist between morphemes in the content nodes.
Dimensional conceptual relationships inferred directly from a faceted hierarchy are referred to herein as explicit relationships. A dimensional conceptual relationship that is inferred from the intersection of faceted attributes within the concept definition assigned to the content node to be classified is referred to herein as an implicit relationship.
Synthesizing (building) rules
Explicit relationships between concepts may be computed by examining relationships between attributes in the concept relationships of the concepts. If the concept definition contains attributes that are directly or indirectly related (hereinafter referred to as "lineage") in a faceted hierarchy to the attributes of the content node being classified (hereinafter referred to as "active node"), there may be an explicit relationship between concepts along the dimensional axis represented by the involved attributes.
Subject to restrictive constraints (described below), implicit relationships can be inferred between any concepts that share a subset of the attributes in their concept definition. The intersection of attributes establishes a parent-child relationship.
The axes may be defined in terms of a set of facet attributes. In one embodiment, an axis may be defined by a set of facets (root nodes) in a facet hierarchy. These property sets can then be used to filter the probabilities into a joint hierarchy of dimensional conceptual relationships. Alternatively, for a dynamically constructed (customized) hierarchy derived from a complex dimensional structure, any set of attributes can be used as the basis for the dimensional axis.
Dimensional concept relationships exist if explicit and/or implicit relationships can be derived for all axes in the parent concept definition. Thus, the dimensional conceptual relationship is structurally intact across all dimensions of the attribute definition.
Priority and directionality
A facet hierarchy (as expressed by a morpheme hierarchy) may be used to prioritize content nodes. In particular, each content node may embody attributes that exist in at most one location in the faceted hierarchy. The attribute priority in the hierarchy may determine the priority of the node.
The priority within a conceptual relationship may be determined by first examining the overall priority of any registered morpheme within the collection in question. The highest registered morpheme may establish a priority for the collection.
For example, if the first set includes three registered morphemes with a priority of {3, 37, 303}, the second set includes two registered morphemes with a priority of {5, 490}, and the third set includes three registered priorities with a priority of {5, 296, 1002}, then the sets may be ordered: {3, 37, 303}, {5, 296, 1002}, and {5, 490 }. The first ordering set may be prioritized based on a highest total ranking of the morphemes in the set of morphemes that includes priority 3. The latter two sets may both have the highest morpheme priority 5. Thus, the next highest morpheme priority in each set may be examined to reveal that the set containing the morpheme with a priority of 296 should be the higher priority set.
When the registered morphemes do not distinguish between content nodes in a conceptual relationship, the system may use the number of implicit morphemes as a basis for prioritization. It can be assumed that the set with the least number of morphemes is higher priority in the hierarchy. When content nodes contain the same explicit morphemes and the same number of unregistered implicit morphemes, the content nodes may be considered to be in equal status with each other. When content nodes are equally located, the priority may be established in the order in which the system discovers each of the content nodes.
FIG. 20 provides a simple illustration of an embodiment for constructing implicit relationships and determining the priority of nodes in the resulting hierarchy.
In this example, morpheme "business" 2201 is registered in the morpheme dictionary. Assume that through user interaction, a content node is constructed with a concept definition that contains this morpheme plus a new morpheme "model" 2202 that is not recognized in the morpheme dictionary.
Continuing with the above example, the morpheme "business" has the highest priority 2203. The set "business, model" is a hint sub-morpheme 2204 for "business". Any additional morphemes added to this set, such as "ad" 2205, will create an additional layer 2206 in the hierarchy.
Any morpheme, whether explicit or implicit in the system, may be used as the basis (or axis) for content ranking. Continuing with the above example, the implicit morpheme "ad" 2207 is based on the hierarchical parent 2208 of this morpheme. The set "business, model, advertisement" 2205 is a child node 2209 in this hierarchy. Any additional set that includes "ads" will also be a member of this ranking. In this example, the set "ad, method" 2210 is also a child node of "ad" 2211. Since the morpheme "business" is registered, the set "business, model, advertisement" is given a higher priority in the advertisement ranking than the set "advertisement, method" that contains only the implicit morpheme.
An alternative embodiment for node prioritization involves "signing" nodes. These are defined as the content nodes that best describe (or give a meaning to) their associated concepts. For example, a domain owner may use a photograph associated with a particular concept as a signature identifier for that concept. Signature nodes can be prioritized.
There are many ways to implement a signing node. For example, one way is as a label for a particular class of content nodes. Signature nodes may be assigned a special attribute, and that attribute may be given the highest priority in the faceted hierarchy. Or a field may be used in a table of content nodes to specify this attribute.
Prioritization based on facet ranking can be supplemented by an automatic basis, such as alphabetical, numerical, and chronological ordering. In traditional faceted classification, prioritization and ranking are issues of symbolic representation and citation order. The system typically provides dynamic attribute reordering for prioritization and ranking. Further, these operations are not discussed further herein.
Axis definition and structural integrity
In one embodiment of the system, another rule for building a dimensional concept taxonomy relates to the structural integrity of the dimensional axes. Each morpheme (attribute) set as a concept definition (axis definition) may establish a dimension axis. The dimensional conceptual relationships inferred from these morphemes must be structurally intact across all dimensions as determined by the parent node. In other words, all dimensions that intersect a parent concept must also intersect all children concepts of the node. The following examples will illustrate:
consider an active content node whose content is defined as a, B, C,
wherein A, B, C are the three morphemes in the concept definition and morphemes E, F, G are the sub-morphemes of A, B, C in the morpheme hierarchy, respectively;
{ A, B, C } refers to the conceptual definition described with morphemes A and B and C;
{A,*means an explicit morpheme A and one or more implicit morphemes for establishing a node as an implicit sub-morpheme of A*A combination of { overshrinking };
{ A | B } refers to the morpheme { A } or { B }.
The three morphemes A, B, C in the active node may be used in this example to establish three dimensions (or intersecting axes) in a dimensional conceptual hierarchy. For any other content node to be a child of this node, the candidate must be a child with respect to all three axes. The following notation is a solution set of explicit and implicit relationships as defined by one embodiment of the present invention:
{(A|E|A,*|E,*),(B|F|B,*|F,*),(C|G|C,*|G,*)},
wherein the morpheme of the first dimension is A or E, or an implicit morpheme of A, or an implicit morpheme of E;
wherein the morpheme of the second dimension is B or F, or the implicit morpheme of B, or the implicit morpheme of F;
and the morpheme of the third dimension is C or G, or the implicit morpheme of C, or the implicit morpheme of G.
The processing range can be further limited by constraining the conceptual definition of the dimensional axes. The independent axis (hereinafter "active axis") may be established by referencing a morpheme subset from the parent node, whereby the constraint may link to the set of parent nodes (ancestors) of the active node. Effectively, the concept definition associated with the active axis may establish a virtual parent node that constrains a multi-hierarchy that extends from the active node only to those content nodes that reside on the hierarchy defined by the concept definition of the active axis.
The following example illustrates this constraint using the example introduced above with the concept definition { A, B, C }. In this example, the derived dimensional conceptual relationships are constrained to the active axis with the concept definition { A, B }. Under this constraint, the set of possible parent nodes (ancestors) to the active node is limited to the set { (A, B) | A | B }. In other words, the concept definition of a match will include only a combination of a or B but no C (again in this example it is assumed that there is no parent node of a or B in the morpheme hierarchy).
The combination of explicit and implicit relationships in a morpheme can thus establish rules for building hierarchical relationships between concepts.
As is known in the art, there are many ways to optimize these types of filtering and sorting functions. They include data management tools such as indexing and caching. These refinements are well known in the art and will not be discussed further herein.
Integrated mode of operation
Various integrated modes of operation are possible for the faceted classification method of the present invention. The integration can be changed to accommodate the independent requirements and end-user requirements of different domains. As described below, these modes can be defined as follows:
static and dynamic synthesis
In one embodiment, a "static" faceted classification synthesis is provided in which the axes defining the dimensional concept hierarchy may be predefined. The resulting dimensional concept taxonomy can then be utilized as a static structure.
The static model of faceted taxonomy synthesis has the advantage that domain owners can organize the dimensional concept taxonomies according to their exact specifications. End users who access and consume the information contained within these static structures may therefore benefit from the domain owner's organizational knowledge. Static integration is therefore particularly useful, for example, when the end user of the information has little knowledge of the information contained within the domain.
In another embodiment, a "dynamic" faceted classification integration system is provided in which dimensional concept hierarchies can be processed in near real-time based directly on integration parameters provided for end users of information. This dynamic mode of operation facilitates incremental and complete "on-demand" assembly of information structures.
Dynamic processing can provide tremendous information economy and storage benefits, which eliminates the need to pre-create and store end-user structures. More importantly, dynamic processing can allow end users to precisely tailor the output to their requirements, which provides personalized benefits (the integrated mode of operation is discussed in more detail below).
Yet another embodiment combines the static and dynamic synthesis modes introduced above. Under this mixed-synthesis model, the domain owner may provide a selection of axis definitions that provide a static "global" structure for the dimensional concept taxonomy. Within this global structure, dynamic synthesis can then be used to enable individual end users to further customize the structure to their needs. This hybrid mode thus combines the advantages of both static and dynamic synthesis.
Restrictions on concept ratings and content nodes
As the scale of domains and facet hierarchies increases, the number of dimensional conceptual relationships that can be inferred also grows rapidly. A limit may be set on the number of relationships generated.
A limit may be entered by the user to set the maximum number of related concepts or associated content nodes in the resulting output hierarchy. For example, an administrator may configure the composition operation to stop processing after the system assembles the ten most closely related concepts into a hierarchy.
Variable abstraction level
As described above in the description of the knowledge representation model and the analysis operations, the attributes that comprise the concept definition may be defined to variable levels of abstraction. One embodiment described herein provides entities at the level of abstraction, such as concepts, keywords, and morphemes. The abstract level variation of the properties defined by the concepts used in the synthesis can achieve a significantly different output of the synthesis operation.
In particular, as attributes tend to be more basic morpheme entities within a domain, there may be more connections between complex concepts defined using these attributes. Defining attributes in these primitive items may thus provide a larger connection and more ways to organize the resulting composite output.
Conversely, as properties tend towards more abstract complex entities (e.g., keywords or complex concepts), the resulting composite structure may be more accurate, typically with fewer connections, but with higher overall quality. Thus, changing the level of abstraction in the integrated operation may allow an administrator, domain owner, or end user to customize the information according to their individual requirements.
Scope of domain processing
In one embodiment, all content nodes in a domain may be examined and compared before generating a complete view of the dimensional concept taxonomy. In other words, the system may discover all content nodes in the domain that may be relevant before any inference is made regarding the direct hierarchical relationship between these relevant nodes.
The benefit of a complete check of all nodes in the domain is that it can provide an exhaustive probing and discovery of information within the domain. This integrated mode may be suitable for high accuracy and retrieval (recall) requirements. It is often preferred for relatively smaller clearly defined domains.
In another embodiment, instead of analyzing the entire domain, a localized region of the domain may be analyzed based on the active focus of the user. This localized analysis can be applied to the material, whether it was previously analyzed as part of the training set or not. Parameters may be set by an administrator to balance analysis depth with processing time (latency).
For material that is not analyzed as part of the training set, the system can use the operation of localized analysis to classify the material under enhanced faceted classification derived from the training set material.
Note that the operation of classifying a partial subset of material from a domain may also be used to classify a new domain, as described in more detail below. In other words, a training set from one domain can be used as a basis for a construction scheme to classify material from a new domain, thereby supporting a multi-domain classification environment.
Fig. 21 illustrates various integration modes in more detail. Without limiting the scope of the invention, these examples demonstrate the broad range of integration options offered by the various modes. The benefit of this integrated flexibility is to provide a system that can accommodate a large number of domains and user requirements.
Static (pre-indexed) synthesis
FIG. 21 illustrates a method by which output data for an enhanced facet classification method may produce dimension concept taxonomy terms 210 to reorganize domains, in one embodiment of the invention. The output data (M) may be generated (as described above and shown in fig. 17). The inputs to this method may be revised concept definitions 2104, keyword rankings 2112, and content nodes 302 from the domain.
Each concept definition 708b may map to a keyword 2302 in a keyword ranking 2112. New dimension concept relationships for concepts may be generated (820) by rules of the enhanced faceted classification method as described above and shown in FIGS. 3, 18-20.
An administrator of the information structure may prefer to manually adjust (2304) the results of the automatically generated dimensional concept taxonomy constructs. Operations may support these types of manual intervention, but without user interaction for fully automated operations.
Analysis (2306) can be used to evaluate parameters of the resulting dimensional concept taxonomy. Likewise, the administrator may set 2308 the statistical parameter as a scaling factor for the dimensional concept taxonomy. They can also limit complexity to negative feedback in a complex-adaptive system by reducing the processing range, thus scaling back the number of stages incorporated.
As described below and shown in FIG. 27, the dimension concept taxonomy 210 may be used for user interaction (N).
Domain subset (range limited) synthesis
FIG. 22 illustrates selecting content nodes from a domain and ordering the content nodes into a dimensional concept hierarchy. A constrained view of the domain with respect to the active node 2402 may be employed. Instead of processing the entire domain, the operation may perform a directed investigation of all content nodes (e.g., 2406) in the immediate vicinity 2404 of the active node 2402.
Recursive concept hierarchical assembly
In one embodiment, a recursive algorithm may be used to subdivide the set of undifferentiated related content nodes into specific structural groups. The "candidate set" describes the concept set and associated content nodes that are relevant to the active concept definition, regardless of how precisely they are relevant. Groups may be described with respect to active concepts or content nodes as parent and child nodes (hierarchical relationships) and sibling nodes (associative relationships). The structural relationships described by these groups are well known in the art. These neighboring concepts and associated content nodes may then be ordered into a hierarchical relationship based on the underlying morpheme relationships and morphemes relative to the active concepts involved.
This ranking is illustrated in fig. 22 as a subset of relationships between content nodes (e.g., 2406) within the candidate set of content nodes 2404. In hierarchy 2408, those content nodes (direct children) that are directly related to active node 2402 have no other parent nodes within candidate set 2404. The remaining content nodes in the candidate set may be positioned deeper in the hierarchy as indirect children (descendants).
Applying a classification scheme for one domain to a second domain
FIG. 23 illustrates the operation of classifying a local subset of material from a domain that is not part of the training set used to develop the faceted classification scheme.
A partial subset of the domain material 2404a may be selected from the domain 200 for processing. The material may be selected based on selection criteria (2502) established by the domain owner. The selection may be made with respect to active node 2504 that is the basis for the localized region. The selection process may generate parameters for the partial subset 2506, such as a list of search terms that describe the boundaries of the partial subset.
There are many possible selection criteria for the local set. In one embodiment, the material may be selected by passing the concept definition associated with the active node to a full-text information retrieval (search) component to return a set of related material. Such full text information retrieval tools are well known in the art. In an alternative embodiment, an expanded search query may be derived from concept definitions in active nodes by examining keyword rankings to derive a set of relevant keywords. These related keywords, in turn, may be used to expand the search query to include terms that are related to the concept definition of the active node.
The local subset of domains derived from the selection process 2404a may include candidate content nodes to be classified. For each candidate content node in the local subset, a concept signature may be extracted (2508). Concept signatures may be identified by the domain owner and may be used to map (2302) keywords in the domain-specific keyword rankings 2112 to provide concept definitions for each candidate content node. Also, the building component does not require that all keywords derived from the concept signature be known to the system (registered in the keyword hierarchy).
Concept rankings can be computed for candidate content nodes using the construction rules for implicit and explicit relationships described above (820). The end result may be a local concept taxonomy 210c in which content nodes from local subsets of domains are organized under a construction scheme derived for the domain according to a training set. The local concept taxonomy may then be used as a context for user interaction to further refine the classification.
Dynamic (real-time) synthesis
An alternative embodiment of the present invention uses a dynamic integration mode that incorporates user preferences into the integration operation in real-time. FIGS. 24-25 and the following description provide more specific details regarding the operation of domains within this dynamic integration mode.
In FIG. 24, one embodiment of a dynamic synthesis mode is illustrated in a generalized overview. The dynamic synthesis process may follow a request-response operational model. The dynamic composition operation is initiated by a user request (2402). Users may specify their requirements (e.g., their interest fields, their topics of interest coded by active concept definitions, their mindsets of topics coded by axis definitions, and their range of interest constrained by a restricted aggregate parameter set). In fig. 24, these user parameters are represented schematically in simplified form as active concept definitions (boxes) that include more elementary attributes (four points) 2404 inside.
Using this dynamic input from the user, the system may then return an associated concept hierarchy (output concept hierarchy) 2406. This output concept ranking may then be the focus of further detection by the user, or it may serve as a bridge to yet another round of synthesis.
To process such a request, the set of attributes associated with the active concept definition may be the basis for locating a set of concepts from within the designated domain 2408 that will serve as a candidate set 2410 for the synthesized concept hierarchy. The "derivation" method 2412 to relate these concepts to active concept definitions is described below. Derivation can be dynamically ranked against derivation and used as a reference to construct a hierarchy of related concepts.
Next, more details regarding the main steps and components of the domain dynamic synthesis mode are provided.
User initiated integrated requests
The dynamic composition operation is initiated by a user request (3502). To initiate the dynamic synthesis process, a user may provide a domain, an active concept definition, and an axis definition. The user may also constrain the size and shape of the concept hierarchy via other input synthesis parameters discussed below. As discussed below in the discussion of user interface system implementations, there are many technical means to obtain this type of user input.
Dynamic synthetic input and synthetic parameters
Thus, the input to the dynamic synthesis mode may include user-specific synthesis parameters and domain-specific faceted datasets. These inputs may constrain the integrated operation to a narrow abrasive area or subject area, to the precise requirements of the user. Details regarding domain-specific faceted data sets are provided above.
Runtime integration parameters
As discussed above, one embodiment of dynamic synthesis may provide user input of active domains, active concept definitions, and active axis definitions. In addition, users can further describe their requirements by providing parameters that specify degrees of separation and parameters that limit the output of the integrated operation by concept and content nodes.
The degree of separation parameter specifies the maximum number of direct ranking steps from active concept definitions to related concept definitions in the output concept hierarchy.
For example, based on the building rules of the enhanced faceted classification method, and given a representative active set of attributes { A, B, C }, the following set of attributes would be one degree of separation removed:
{ a, B, C,? }: all supersets with one additional element, where "? "represents one other attribute;
{ a, B }, { a, C }, { B, C }: all subsets based on implicit attribute relationships;
{ D, B, C }, which is an explicit attribute relationship given A → D.
Time delay
Latency is another comprehensive parameter that can be manipulated by the end user. In one implementation, an "upper bound" response time may be applicable to the system such that the integration operation is limited to a maximum time between the user's integration request and the build engine response and output to satisfy the request. Another embodiment of this delay control would allow the end-user to increase or decrease the request-response time to adjust performance to match their individual information access and discovery requirements.
Candidate set for dynamic synthesis
One embodiment of candidate set assembly for dynamic synthesis is illustrated in FIG. 25.
In dynamic synthesis, the set of attributes of an active concept can be examined against attribute rankings to discover explicit sets of related ancestor and descendant attributes. More information about these checks is provided above under the description of the synthesis (building) rules. Also, the entire domain need not be examined completely under this real-time dynamic synthesis mode. The system only checks a subset of the domains defined by the candidate set. The candidate set was found as follows:
a subset or set of attributes with elements that are explicit ancestors of the elements in the active attribute set (which represent possible ancestor concepts) or both may be considered. Within each of these sets of related attributes 2502a, 2502b, and 2502c, each attribute may have a set of matching concept definitions for its subset. The intersection of these concept sets 2504a, 2504b, and 2504c for a given set of active concept definition attributes may contain the matching concepts of the attribute set (matching concepts are illustrated as solid points; non-matching concepts are illustrated as hollow points).
A similar process is conducted using the set of relevant attributes alone, which can be a superset or have elements that are explicit descendants (representing candidate descendant concepts) of the elements in the active attribute set, or both. Here again, the intersection of the set of concepts for the set of related attributes may contain the matching concept for that attribute.
The union of the intersections from all the sets of related attributes may be the candidate set. The set of correlation attributes may be constrained to a specified axis definition. Their number may also be subject to a specified maximum limit and degree of separation distance.
Derivation for concept hierarchical assembly
Under the real-time dynamic synthesis mode, latency may be a major limiting factor. In particular, there is little time to process even relatively small candidate sets in an exhaustive manner. As discussed above, the static synthesis approach using the concept hierarchical synthesis recursion approach is often misused in this dynamic environment due to the delay it may introduce for larger domains.
Thus, one embodiment of dynamic synthesis uses a derivative approach to dynamically assemble concept hierarchies in real-time. Derivation is a set of operations that describe how candidate concepts relate to active concepts.
In addition to the performance and latency reduction benefits introduced above, derivation also introduces novel benefits of concept synthesis, i.e., the definition of new concepts as inferred as "virtual concepts" discussed below. These virtual concepts extend the discovery benefits of the system to a large extent by inferring new concepts even though they have not yet been associated with content nodes. These derivations also provide a powerful ranking and filtering means as a user-configurable clustering mechanism.
The candidate set may be found from a set of attributes associated with the set of attributes of the active concept. Explicitly related elements may be found from the property hierarchy in the faceted data set. Implicitly related sets of attributes may be implied by set intersections (i.e., subsets and supersets of these sets of attributes). While in the domain, the additional properties used to discover the implicit descendant properties may or may not be known to the system.
The active set of attributes may be paired with sets of attributes associated with concepts in the candidate set. For each pair, a sequence of set operations may be derived that transform the active attribute set into its pairing set.
In an attempt to discover a set of related attributes, four derivation operations may be performed on the set of attributes. The operation type may be abbreviated as shown in table 1.
TABLE 1 derived operation types
| To derive implicit relationships | To derive explicit relationships | |
| To ancestor to descendant | d: deletion attribute a: adding attributes | p: replace attribute c with parent attribute: replacing attributes with sub-attributes |
Note that the directionality of all attribute relationships must be consistent with the potential concept relationship pairs. An attribute set pair may have an ancestral relationship or a descendent relationship between their elements, but not both.
The integration process preserves this orientation by applying only ancestor operations (p, d) or descendant operations (c, a), but not both, to establish relationships between concepts. This prevents a concept from having all of its attributes replaced with those corresponding to unrelated concepts.
For example, given an active concept with attribute { A, B, C } and a candidate concept with attribute { D, B, G, F }, there are three axes that run through the definition of the active concept corresponding to its three attributes. To determine whether a relationship exists between concepts, an explicit relationship may first be used, such as an explicit relationship from A to D and another explicit relationship from C to G. (these are all c-operations: replace attributes with sub-attributes). Finally, an implicit a operation of adding a descendant attribute (i.e., F) is used to obtain a set of attributes that match the active concept with the set of attributes of the candidate descendant. The candidate may therefore be considered a descendant of the active concept.
To illustrate, when pairing an active set of attributes with a candidate set of attributes, there are three possible sets of attributes:
a set of attributes associated with only the candidate set (the "candidate only" attributes);
a set of attributes associated with both the candidate set and the active set ("all-inclusive") attributes;
only the property groups associated with the active set (the "active only" property).
If converting the active set to the candidate set requires the deletion of the "active only" attribute, then the candidate set is an ancestor of the active set.
If the active set is the same as the candidate set, the candidate set is a sibling of the active set.
If converting the active set to the candidate set requires the addition of a "candidates only" attribute, the candidate set is a descendant of the active set.
Converting the active set to the candidate set by deleting the "only active" attribute and adding the "only candidate" attribute is not valid, whether or not the two original sets already have common attributes. Such a pair of sets is considered to be irrelevant. The only exception to this is when the attributes in the "only" collection are related in the attribute hierarchy. In such a case, one of two operations may be performed:
replacing the active set with a parent attribute of the active set attribute (wherein the candidate set is an ancestor of the active set);
the active set is replaced with a sub-attribute of the active set attribute (where the candidate set is a descendant of the active set).
The resulting attributes are then members of the "accompanied" set.
At a given level, the order in which siblings are presented may be critical. Those concepts that may be more important to the user should have a higher priority.
Each concept in the candidate set may have a unique derivative series connecting it to the active concept. The ordering of concepts in the result hierarchy is affected by the order in which the derivations are arranged and handled by the synthesis. The candidate concepts in the hierarchy are prioritized according to table 2.
TABLE 2 derivation of priority in determining result ranking
| Popularity of candidate set | Popularity in a domain | |
| Explicit operation (p, c) implicit operation (a, d) | 13 | 24 |
Response to
In response to the requirements specified in the user's request, the application may return a concept hierarchy that is built from concepts associated with objects within the domain, that is related to the active concepts, and that is along the axis. Users refer to this concept hierarchy to discover concepts that are related to the active concepts they specify.
The derivation can be constructed as a hierarchical result set. Each node in the hierarchy represents a concept having a set of attributes defined as its concept. Each edge in the hierarchy represents a single derived operation.
Virtual concepts
In some cases, there are no matching concepts in the set of attributes of the concept hierarchy node. A virtual concept can be used as a placeholder to indicate this.
For example, given a set of properties { A, B, C }, if any:
the explicit relationship a → D is given by,
the explicit relationship of D → F,
there is no concept of having a { D, B, C } set of attributes,
then F, B, C will be in the candidate set with one degree of separation from a, B, C. If the { D, B, C } set of attributes does not have a corresponding concept, then at this node in the hierarchy is a virtual concept.
From within the active domain, the dynamic synthesis process may isolate and return concept hierarchies related to active concepts. Related concepts may branch in the direction of ancestors (more general) and descendants (more specific) along a specified axis and as specified from the active concept.
Note that the data structure from which the dimension concept taxonomy 210 is derived can be represented in a variety of ways for a variety of purposes. The purpose of the end-user interaction is illustrated in the following description. However, these structures may also be used in the service of other data manipulation techniques, for example as input to another information retrieval or data mining tool (not shown).
Complex-adaptive feedback mechanism
FIG. 27 illustrates a method for handling user interactions in a complex-adaptive system. The method is constructed on the dimensional concept classification method process N. The user interaction may establish a series of feedback to the system. The process of adaptive refinement to complex dimensional structures can be achieved through end-user initiated feedback.
FIG. 37 illustrates one possible implementation of a computer system 4000 that allows aspects of a faceted taxonomy of information to be manipulated in the form of one or more dimensional conceptual taxonomies 4010. The system 4000 may include a computer-readable medium 4020, such as a disk drive or other form of computer memory, the computer-readable medium 4020 containing a computer program, software, or firmware 4080 for performing the implementations and aspects of the dimensional concept taxonomy, such as, for example, one of the taxonomy 4130 or taxonomy 4130 of each aspect of the concept definition 4090, hierarchical data 4100, content node 4110, definition 4120 corresponding to the content node or the dimensional concept taxonomy 4010. The system 4000 may also include a processor 4030, a user interface 4040 (e.g., a keyboard or mouse, and a display 4050). In this implementation, the computer processor 4030 can access the computer-readable medium 4020 and retrieve at least a portion of the dimensional concept taxonomy 4010 generated from the source data, and present the portion of the taxonomy 4010 on the display 4050. The processor 4030 can also be input from an outside entity (user or machine) from an interface 4040 (optionally a user interface) that reflects user manipulation of aspects of the dimensional concept taxonomy 4010. The processor 4030 can incorporate the received foreign entity manipulation of any of the plurality of possible relationships found in the first-dimension concept taxonomy 4010 into the second-dimension concept taxonomy. For example, the outside entity manipulations may be in the form of changes to or additions to the first-dimension concept taxonomy 4010, editing concept definitions, ranking data, changing the location of content nodes associated with a concept relative to other content nodes associated with a concept, changing definitions describing the subject matter of content nodes, or other changes to the taxonomic classification. The second-dimension concept classifier may completely replace the first-dimension concept classifier 4010, exist completely in parallel to or separately from the first-dimension concept classifier 4010, exist as a special table of the first-dimension classifier 4010, or the like. In addition, access to the second-dimension conceptual taxonomies may be limited to certain classes of outside entities, such as domain owners and administrators, users, dedicated remote computer devices, and so forth.
The display 4050 may present aspects of the dimension concept taxonomy 4010 in the form of a display window or editor 4070 controllable by the processor that may be responsive to the interface 4040. The editor 4070 may also take the form of a web page and may present content nodes and facet classifications derived from the dimensional concept taxonomy 4010 or a variation thereof. The content nodes and faceted classes shown by the editor may correspond to active nodes selected by outside entities and may take the form of tree fragments, for example. The editor 4070 may also present editing functionality that outside entities may utilize to manipulate aspects of the dimensional concept taxonomy 4010 or to introduce new elements, relationships, and content. The editing functions may also include a review interface that allows outside entities to alter one or more morpheme groups associated with the content of the node, as well as the position of the node within the dimensional concept taxonomy, to make them consistent with the content of the node.
Thus, the method of the complex-adaptive procedure can be summarized as follows:
the dimension concept taxonomy is provided as an environment for user interaction 212 a. Once the dimension concept taxonomy 210 has been presented to the user, it can become an environment for revising existing data as well as a source for new data (dimension concept taxonomy information). Input data 804a includes user edits to existing data and inputs for new data. It also provides for evolving and adapting the classification to the dynamic domain.
The user interaction may include feedback to the system. A symbolic representation system can be used to uniquely identify unique identifiers for data elements in the dimensional concept taxonomy information based on morpheme elements stored in the centralized system. Thus, each data element in the dimension concept taxonomy generated by the system can be identified in a manner such that the data element can be merged back into the centralized (shared) morpheme dictionary.
Thus, as the user manipulates these elements, temporal effects on the relevant morpheme elements may be tracked. These changes may reflect new explicit data in the system to refine any inferred data automatically generated by the system. In other words, data originally inferred by the system may be enhanced or rejected by explicit interaction by the end user.
The user interaction may include new data sources and modifications to known data sources. Manipulations of known elements can be converted back to their morpheme ancestors. Any data elements not recognized by the system may represent new data. However, this new data can be placed in the context of the known data, as changes are made in the context of the existing dimensional concept taxonomy generated by the system. Thus, any new data elements added by the user may be provided in the context of known data. The relationship between known and unknown may extend the amount of dimensional conceptual classification information that can be inferred from user interaction to a large extent.
The "quick" feedback 212c in the system may provide a real-time interactive environment for the end user. User-initiated taxonomy and container edits 2902 can be queued up in the system and formally processed when system resources become available. However, users may need (or prefer) their real-time feedback on changes to the dimensional concept taxonomy. The time required to handle the changes through the formal feedback of the system may delay this real-time feedback to the user. As a result, one embodiment of the system provides shortcut feedback.
The shortcut feedback may begin with processing user edits for the then-existing domain data store 706. Since changes by the user may include dimension concept taxonomy information that does not currently exist in the domain data store, the system must use a process that approximates the effects of these changes.
The rules for creating implicit relationships 212b (described above) may be applied to the new data as short-term replacement rules for full processing. This approach allows the user to immediately insert and interact with new data.
Unlike the dimensional conceptual relationships computed through the formal process of the system, this approximation process may use system-unknown morphemes that exist in a set of known morphemes to define and adjust the dimensional conceptual relationships of the known morphemes in the set. These adjusted relationships are described as "implicit relationships" 216, described in more detail above.
For new data elements, short-term concept definitions may be assigned based on implicit relationships (described above), facilitating real-time interactive processing. The short term implied concept definition may be replaced with the complete concept definition revised by the system upon completion of the next full processing cycle for the domain.
Those skilled in the art will recognize that there may be many algorithms to approximate the effect of unknown morphemes on the relationship of known morphemes in the system.
Providing user interaction
The dimension concept taxonomy provides an environment for user interaction. In one embodiment of the invention, two primary user interfaces may be provided. Navigating the "viewer" interface may provide browsing faceted sorting. This interface is a class called "faceted navigation". Other interfaces may be referred to as "generalizers" that may allow end users to change the relationship structure, concept definition, and content node assignments.
The general features of a faceted navigation and summarizer interface are well known in the art. The novel aspects described below, and in particular the complex-adaptive system 212 involved, will be clear to those skilled in the art.
View concept taxonomy
The dimension concept taxonomy may be expressed by a presentation layer. In one embodiment, the presentation layer is a web site. The website may include a web page that represents a set of dimension concept taxonomy views. The view is part of a dimensional conceptual taxonomy (e.g., a multi-hierarchical subset filtered by one or more axes) within the active node scope. An active node in this context is a node within the dimension concept taxonomy that is currently the focus of the end user or domain owner. In one embodiment, "tree fragments" are used to represent these relationships.
Users can provide text queries to the system to move directly to their general area of search and information retrieval. As is well known in the art, views may be filtered and arranged by facets and attributes that intersect concepts.
Content nodes may be classified according to various concepts. That is, for any given active concept, all content nodes that match the attributes of that concept filtered by the user may be presented.
The "resolution" of each view may be changed around each node. This refers to the breadth and the exhaustive overview of the displayed relationships. Resolution issues of the views may also be considered in the context of the size and selection of the analyzed domain portions. Also, there is a trade-off between the depth of the analysis and the amount of time (delay) required for processing. The presentation layer may be operable to select a portion of the domain to analyze based on the location of the active nodes, the resolution of the view, and parameters configured by the administrator.
In one embodiment, viewing the dimensional concept taxonomies, operating the dynamic synthetic model (as discussed above) these interactions can generate feedback for the complex-adaptive system of the present invention. Under these conditions, the implicit feedback generated by viewing the interaction will be substantially transparent from the end user's perspective. In other words, the end user will create valuable feedback for the system by only viewing interactions of the dimensional concept taxonomy.
There are many benefits to this transparent user-generated feedback. The end user will not need to expend the effort required to edit the dimension concept taxonomy directly (as discussed in detail below). In addition, because under this dynamic synthesis mode, only the dimension concept rankings requested by the user include the dimension concept taxonomies returned as feedback for subsequent analysis operations. Constraining the feedback set to this narrower range of only the information actually requested by the end user has the effect of improving the quality of the feedback data generated by the system.
Editing concept taxonomy
The presentation layer refines the dimension structure into a simplified view necessary for human interaction (such as a web page that includes links to related pages in the dimension concept taxonomy). Thus, the presentation layer can also be multiplied as an editing environment for information structures that are derived from these information structures. In one embodiment, the user can switch from within the presentation layer to an edit mode to immediately edit the structure.
The summarizer provides the user with a means to manipulate the hierarchical data. The summarizer also allows the user to manipulate the content nodes associated with each concept in the structure.
User interaction may alter the context and/or concepts assigned to nodes in the dimension concept taxonomy. Context refers to the position of a node relative to other nodes in the structure (i.e., establishing a dimensional conceptual relationship of the structure). Conceptual relationships describe the content or topic of a node expressed as a collection of morphemes.
In one embodiment, the user may be presented with a review process so that the user can confirm such user-edited parameters. The following dimensional concept taxonomy information can be presented to the user for this review: 1) the contents of the node; 2) morpheme groups (expressed as keywords) associated with content; and 3) the location of the node in the taxonomy structure. The user can alter the parameters of the last two (morphemes and relative positions) to make the information consistent with the previous one (the content at that node).
Thus, the interaction in one embodiment of the invention can be summarized in some combination of two broad types: a) editing the container; and b) taxonomy editing.
Container editing is the changing of the assignment of content containers (such as ULR addresses) to content nodes classified within a dimensional concept taxonomy. Container editing also changes the description of content nodes within the dimension concept taxonomy.
Taxonomy editing is a contextual change to the node location in the dimension concept taxonomy. These changes include adding new nodes to the fabric and relocating existing nodes. This dimension concept taxonomy information can be fed back into the system as changes to morpheme relationships associated with concepts affected by user interaction.
With taxonomy editing, new relationships between concepts in taxonomies can be created. These conceptual relationships may be constructed through user interaction. Since these concepts are morpheme-based, the new concept relationships may be associated with a new set of morpheme relationships. This dimension concept taxonomy information can be fed back into the system to recalculate these implied morpheme relationships.
User interaction may also be provided at a more rudimentary level of abstraction, such as keywords and morphemes.
FIG. 26 illustrates one embodiment of a container editing process. The container edits change the concept definition and underlying morphemes that describe each content node. With these changes, the user alters the underlying conceptual description of the content node. In doing so, they may alter the morphemes mapped to the concept definitions at these content nodes.
The user interaction may construct a concept definition assigned to the content node expressed as a keyword set. In this configuration, the user may interact with the morpheme dictionary and domain data store of the system. Any new keywords created here may be sent to the morpheme extraction process of the system as described above.
In this example, document 2801 is an active container. In the user interface, a collection of content-describing keywords 2802 may be presented to the user along with the document (for simplicity of example, the relative position of this node in the dimensional concept taxonomy is not shown here).
In this example, the user may determine that keywords associated with the page are not optimal when the user reviews the content. A new keyword may be selected by the user to replace the set bearing pages (2803). The user updates the keyword list 2804 to a new concept definition associated with the document.
These changes are communicated to the domain data store 706. A search may be made of the data storage device to identify all keywords registered in the system.
In this example, the list includes all keywords identified by the user except "dog". As a result, "dogs" will be treated as implicit keywords that modify explicit keywords registered in the system 2806.
When domains are reviewed by a centralized transformation engine, the implicit keywords can be fully analyzed. It can then be replaced by an explicit keyword (either as an existing keyword or a new keyword) and associated with one or more morphemes.
Personalization
FIG. 28 illustrates an alternative embodiment of the present invention that provides personalized features in which personalized versions of a dimensional concept taxonomy may be maintained for individual users of a domain.
One embodiment of personalization provides a means to personalize the common concept taxonomies 210e along with personalized concept taxonomies 210f for individual users. The end users may participate in the common concept taxonomy 210e when they first interact with the system. Subsequent interactions may participate in the personalized opinion of the user of the taxonomy 210 f.
The data structure is "personalized" by collating unique representations of the data structure in response to user interactions 212a representing preferences of respective end users. The results of the editing may be stored as personalization data from the user interaction (3004). In one embodiment, these edits are stored as "special cases" of the common concept taxonomy 210 e. When processing the personal concept taxonomy 210f, the system may replace any changes it finds in the user's special case table.
The illustrated elements may identify collaborators in a complex-adaptive process of the system. It provides a means to associate a unique identifier with each user and store their interactions.
In another embodiment, the system may assign a unique identifier to each user that interacts with the dimension concept taxonomy 210e through the presentation layer. These identifiers may be considered morphemes. Each user is assigned a Globally Unique Identifier (GUID), preferably a 128-bit integer (16 bytes) that can be used across all computers and networks. The user GUID exists as a morpheme in the system.
Similar to any other element in the system, the user identifier may be registered in the morpheme hierarchy (explicit morphemes) or may be system-agnostic (explicit morphemes).
The distinction between the two types of identifiers is similar to the distinction between registered visitors and anonymous visitors, as that term is well known in the art. Various ways in which an identifier may be generated and associated with a user ("tracker") are also known in the art and will not be discussed here.
When a user interacts with the system (e.g., by editing a content container), the system may add the user's identifier to the morpheme set that describes the concept definition. The system may also add one or more morphemes associated with the various interactivity supported by the system. For example, the user "bob" may wish to edit a container with a content definition "record, studio" to include a geographic reference. The system can thus create the following conceptual definitions for this container that bob has specific to: { Bob, Washington, (record, studio) }.
Using this dimensional concept taxonomy information, the system can present the container in a manner specific to the user Bob by applying the same explicit and implicit relationship rules in the enhanced faceted taxonomy described above. The container may appear on a personal Web page for bob. In his personal concept taxonomy, the page would be related to the Washington's resources.
The dimension concept taxonomy information will also be globally available to other users and as a negative feedback mechanism subject to statistical analysis and failure rates established by administrators. For example, if enough users identify the Washington site with a record studio, it will eventually be presented to all users as a valid relationship.
This type of modification to concept definitions associated with content containers essentially adds new layers of dimension to the dimension concept taxonomy information representing the various layers of user interactivity. It provides a multi-functional mechanism for personalization using existing structuring processes applicable to other forms of information and content.
As is well known in the art, there are many techniques and architectures that can be used to add a personalized and customized presentation layer. The methods discussed herein utilize the core structural logic of the system to organize collaborators. It essentially treats user interaction as just another type of cell, illustrating the flexibility and scalability of the system. However, it does not limit the scope of the invention in terms of the various methods for adding customizations and personalization to the system.
Machine-based complex-adaptive system
FIG. 29 illustrates an alternative embodiment of providing a machine-based approach for providing a complex-adaptive system, in which the dimensional concept relationships comprising the dimensional concept taxonomy 210 are returned directly to the transformation engine process as system input data 804b (3102).
Note that in this regard, the present invention provides the ability for end users to create and manage data structures as described in the present disclosure. In certain aspects of the present invention, the end user provides feedback that further informs the creation and management of data structures as described herein. This feedback may be provided not only by the end user, but also, for example, by a machine such as a computer that aggregates feedback from the end user, or even a machine such as a computer that is completely unattended. The role of an end user or machine is referred to herein as a "feedback agent" in this disclosure. It should also be noted that many of the examples provided in this disclosure relate to end users for purposes of illustration, but it should be understood that in many, if not all, of these cases, machines such as computers may take over the role of end users. This subheading illustrates such an implementation. Thus, the present disclosure should be read as follows: in many, if not all cases, reference to "end user" may be interpreted as referring to a "feedback agent".
Note that there is an important distinction between the original concept relationships derived from the source data structure and the dimensional concept relationships that emerge from the system building engine process. The former is explicit in the source data structure; the latter are derived from (or are manifested by) construction methods applied for elementary constructions within the morpheme dictionary. Thus, similar to a complex-adaptive system based on user interaction, a machine-based approach can provide a means to introduce changes in system operation 800 by synthesizing (complex) dimensional conceptual relationships from elementary constructs, and then selecting among the changes in the source structure analysis component.
Under this machine-based mode of operation, the selection requirements for the complex-adaptive system can be undertaken by the source structure analysis component (described above and shown in fig. 6). In particular, dimensional conceptual relationships can be selected based on the identification (1002) of cyclic relationships and various patterns and parameters that can be used to decompose the cyclic relationships. As is well known in the art, there are many alternative means, selection criteria and analysis tools to provide a machine-based complex-adaptive system.
Dimensional conceptual relationships that contradict a hierarchical assumption identified in the aggregate by the existence of a cyclic relationship may be pruned from the dataset (1004). This pruned data set may be reassembled (1006) into an input concept taxonomy 1008 from which the operations 800 may derive a new elementary construction set through the remaining operations of the analysis engine.
This class of machine-based complex-adaptive systems may be used in conjunction with other complex-adaptive systems, such as the user-interaction based system 212 described above with reference to fig. 4 and 27. For example, the machine-based complex-adaptive system of fig. 30 may be used to refine the dimensional concept taxonomy by iterating the process several times. The resulting dimensional concept taxonomy can then be introduced to the user in a user-based complex-adaptive system for further refinement and evolution.
Practice of
As emphasized throughout this description of system architecture, there is a great deal of variability in the methods and techniques for designing many embodiments of the present invention, including data storage devices. Many applications of the present invention can be explained and modified by various forms of architectural design also well known in the art.
System architecture component
Computing environment
FIG. 30 illustrates one embodiment of a computing environment for the present invention.
In one embodiment, the invention may be implemented as a computer software program operating under a four-level architecture. The server application software and databases may be executed on centralized computers and distributed, distributed systems. The internet can be used as a network to communicate between a centralized server and the various computing devices and distributed systems with which it interacts.
Variability and methods for establishing this type of computing environment are well known in the art. As such, no further discussion of the computing environment is included herein. Common to all applicable environments is that a user, through his or her computer or computing device, accesses a public or private network, such as the Internet or a company's intranet, thereby accessing computer software embodying the invention.
Service level
The various stages may be responsible for providing the service. The primary 3202 and secondary 3204 operate under a centralized processing model. Three stages 3206 and four stages 3208 operate under a model of distributed (decentralized) processing.
This four-level model achieves decentralization of private domain data from shared centralized data used by the system to analyze the domains. This depiction between shared data and private data is shown in FIG. 33, discussed below.
At a first level, a centralized data store represents the various data and content sources managed by the system. In one embodiment, database server 3210 may provide data services and means to access and maintain data.
Although the distributed content is described herein as being contained within a "database," the data may be stored in a plurality of linked physical locations or data sources.
The metadata may also be distributed and stored outside of the system database. For example, the HTML code segment contains metadata that can be manipulated by the system. Elements from external schemas may be mapped to elements used in schemas of the present system. Other formats for presenting metadata are known in the art. The information domain may thus provide a wealth of distributed content sources and a means for end users to manage information in a decentralized manner.
Techniques and methods for managing data across multiple linked physical locations or data sources are well known in the art and will not be discussed exhaustively further herein.
An XML data feed and Application Programming Interface (API)3212 may be used to connect the data store 3210 to the application server 3214.
Also, as will be appreciated by those skilled in the art, XML may conform to a wide variety of proprietary and open schemas. A range of data exchange technologies provide an infrastructure to incorporate various distributed content formats into a system. This and all following discussion of the connectors used in one embodiment do not limit the scope of the invention.
At a second level 3204, applications residing on the centralized server 3214 may contain core programming logic for the present invention. The application server may provide processing rules for implementing various aspects of the method of the invention, as well as connectivity to the database server. This programming logic shown in fig. 4-17 and 20-23 is described in detail above.
In one embodiment, the structural information processed by the application server may be exported as XML 3216. XML may be used to connect external data storage devices and websites with application servers.
Likewise, XML 3216 may be used to communicate this interactivity back to the application server for further processing in the process of optimization and refinement.
At a third level, distributed data stores 3218 may be used to store domain data. In one embodiment, this data may be stored on the web server in the form of an XML file. There are many alternative modes of storing domain data, such as external databases. The distributed data storage device may be used to distribute output data to the rendering devices of end users.
In one embodiment, the output data may be distributed as an XML data feed that is represented using an XSL transformation file (XSLT) 3220. These techniques may represent output data at a fourth level through a presentation layer.
The presentation layer may be any decentralized website, client software, or other medium that can present taxonomies in a form that can be utilized by humans or machines. The presentation layer may represent an outward presentation of the environment and taxonomies used by the end user to interact with the taxonomies. In one embodiment, the data may be presented as a website and displayed in a browser.
This structured information may provide a platform for user collaboration and input. Those skilled in the art will recognize that XML and XSLT can be used to represent information across a diverse range of computing platforms and media. This flexibility allows the system to be used as a process within a wide range of information processing tasks.
For example, a morpheme may be expressed using keywords in a data feed. By including morpheme references in the data feed, the system can provide additional processing on the presentation layer in response to a particular morpheme identifier. This flexible application is described above in the discussion of personalization (fig. 28).
Using web-based forms and controls 3224, a user can add and modify information in the system. This input can then be returned to the centralized processing system as XML data feeds 3226 and 3216 via the distributed data storage devices.
In addition, an open XML format such as RSS may also be incorporated from the Internet as input to the system.
Modifications to the structure information may be handled by the application server 3214. Shared morpheme data from this process may be returned via the XML and API connector 3212 and stored in the centralized data store 3210.
There are many possible designs, modes and products known in the art of broad system architecture. These include centralized, decentralized, and open system architecture access models. The technical operation of these implementations and various alternative implementations covered by the present invention will not be further discussed herein.
Data models and schemas
FIG. 31 provides a simplified overview of the core data structures within the system in one embodiment of the invention. This simplified schema illustrates the manner in which data may be transformed by the application programming logic of the system. It also illustrates how morpheme data may be destructed and stored.
The data architecture of the system is designed to centralize the morpheme dictionary while providing temporary data storage for processing domain-specific entities.
Note that domain data may flow through the system; it may not be stored in the system. The table mapped to the domain entity may be a temporary data store which is then transformed into output data and data storage for the domain. The domain data store device can be stored or distributed along with other centralized assets to storage resources maintained by the domain owner.
In one embodiment, the application and data servers (described above and shown in FIG. 30) may primarily handle data. Data can be organized within three generalized data abstraction regions in the system:
entity abstraction layer 3302: where an entity is the primary knowledge representation building block in the system. The entities may include: morphemes 3304, keywords 3306, concepts 3308, content nodes 3310, and content containers 3312 (represented by URLs).
The relationship abstraction layer 3314: where the entity definitions are represented by relationships between the various entities used in the system. Entity relationships may include: morpheme relationships 3316, concept relationships 3318, keyword-morpheme relationships 3320, concept-keyword relationships 3322, node-concept relationships 3324, and node content container (URL) relationships 3326.
The markup abstraction layer 3328 is the lower layer in which the items used to describe an entity are separated from the structural definition of the entity itself. The tag 3330 may include: morpheme tags 3332, keyword tags 3334, concept tags 3336, and node tags 3338. Tags can be shared across various entities. Alternatively, the tags may be divided by entity type.
Note that this simplified schema in no way limits the database schema used in one embodiment. Mainly considering the issues of system performance, storage and optimization. As known to those skilled in the art, there are many ways to design a database system that reflects the design elements described herein. As such, various methods, techniques and designs that may be used as embodiments in the present invention will not be discussed further herein.
Dimension transformation system
FIG. 32 illustrates a system overview to perform the data structure transformation operations described above and further described below, according to one embodiment.
The three generalized transformation processes introduced above that exist in one embodiment may be re-described in more detail: 1) domain analysis and compression, which is defined in terms of elementary constructs in a complex dimensional structure to discover structural facets of the domain 200; 2) synthesis and expansion, which synthesizes and expands the complex dimensional structure of the domain into a dimensional concept taxonomy 210, which is provided by an enhanced facet taxonomy; and 3) management that manages user interactions within the dimension concept taxonomy 210 through a faceted navigation and editing environment, thereby enabling a complex-adaptive system that refines the structure (e.g., 206 and 210) over time.
Analysis elementary structure
In one embodiment, a distributed computing environment 600 is schematically illustrated. One computing system 601 for centralized processing may operate as a transformation engine 602 for data structures. The transformation engine may obtain as its input the source data structures 202 from one or more domains 200. The transformation engine 602 may include: an analysis engine 204a, a morpheme dictionary 206, and a construction engine 208 a. These system components may provide the analysis and integration functions described above and illustrated in fig. 2.
In a very specific embodiment, the complex dimensional structure may be encoded into XML files 604 that may be distributed via a web service (or API or other distribution channel) over the Internet 606 to one or more computing systems for decentralized processing (e.g., 603). With this and/or other distribution and dispersion patterns, a wide range of developers and publishers can use the transformation engine 602 to create complex dimensional structures. Applications include websites, knowledge bases, e-commerce stores, search services, client software, associated information systems, analytics, and the like.
It is noted here that these descriptions of centralized and decentralized processing should not be confused with the various centralized and decentralized physical systems that may be used to provide these modes of processing. Herein, "centralized processing" refers to shared, common, and/or aggregated data and services for transformation processing. "decentralized processing" refers to domain-specific data and services. As is known in the art, there are a number of physical systems and architectures that can be implemented to achieve this mix of centralized and decentralized processing.
Integration with enhanced faceted classification
The complex dimensional structure embodied in the XML file 604 may be used as a basis for reorganizing domain content. In one embodiment, an enhanced faceted classification method may be used to reorganize the material in the domain that derives the dimension concept taxonomy 210 at the second computing system 603 using a complex dimension structure embodied in the XML file 604. In general, a second computing system (e.g., system 603) may be maintained by domain owners that are also responsible for reorganizing domains by concept taxonomy 210. Specific information regarding the multi-level data structure used by the system is provided below and shown in FIG. 33.
In one embodiment of the system 603, a presentation layer 608 or Graphical User Interface (GUI) for the dimension concept hierarchy 210 may be provided. Client-side tools 610 (e.g., browsers, web-based forms, and software components) may allow domain end users and domain owners/administrators to interact with the dimension concept taxonomy 210.
Complex-adaptive processing via user interaction
The dimension concept taxonomy 210 can be customized and delimited by individual end users and domain owners. These user interactions may be used by a second computing system (e.g., 603) to provide human awareness and additional processing resources to the classification system.
Dimension taxonomy information, e.g., encoded in XML 212a, that embodies the user interaction may be returned to the transformation engine 602 by being distributed via a web service or other means. This allows the data structures (e.g., 206 and 210) to evolve and improve over time.
The feedback from the second system 603 to the transformation engine 602 creates a complex-adaptive processing system. Although end users and domain owners interact at a high level of abstraction through the dimensional concept taxonomy 210, user interactions can translate into elementary constructs (e.g., morphemes and morpheme relationships) that underlie the dimensional concept taxonomy information. By coupling end-user and domain owner interactions to elementary constructs and feeding them back to transformation engine 602, the system can evaluate interactions in the ensemble.
Using this mechanism, ambiguities and conflicts that have historically appeared in collaborative classification can be removed. This collaborative classification approach therefore seeks to avoid, at a conceptual level, personal and collaborative negotiations that may occur with other such systems.
By allowing users to share content nodes 302 and classification data (dimensional concept taxonomy information) through their interactions, user interactions also extend the available source data 202, which enhances the overall quality of classification and increases the available processing resources.
Multi-level data structure
FIG. 33 illustrates means by which elementary constructs harvested from each source data structure 202 are composited through successive levels of abstraction and dimensionality to create a dimensional conceptual taxonomy 210 for each domain 200. Also illustrated is a depiction between decentralized private data (708, 710, and 302) embodied in each domain 200 and the shared elementary constructions (morpheme dictionaries) 206 used by the centralized system to inform the classification patterns generated for each domain.
Elementary structure
Morphemes 310 and morpheme relationships these elementary constructs may be stored as centralized data in morpheme dictionary 206. Centralized data can be centralized across the distributed computer environment 600 (e.g., via the transformation engine system 601) and made available to all domain owners and end users to assist in domain classification. Since the centralized data is elementary (linguistic) and not associated with any specific and private knowledge background represented by concepts 306 and concept relationships, it can be shared between the second decentralized computing systems 603. The system 601 need not persistently store unique expressions and combinations of these elementary constructs that include unique information contained in each domain.
The morpheme dictionary 206 may store the attributes of each morpheme 310 in a set of tables of morpheme attributes 702. The morpheme attributes 702 may reference structural parameters and statistics used by the analysis process of the transformation engine 602 (as described further below). Morpheme relationships may be ordered in aggregates into morpheme hierarchies 402.
Dimension-based hierarchical output data
The domain data store 706 can store domain-specific data (complex dimensional structure 210a) derived by the transformation engine system 601 from the source data structure 202 and using the morpheme dictionary 206. In one embodiment, the domain-specific data may be stored in XML form.
The XML-based complex dimension structure 210a in each domain data store 706 can include: a domain-specific keyword rating 710, a set of content nodes 302, and a set of concept definitions 708. The keyword ranking 710 may include a ranked set of keyword relationships. The XML output itself may be encoded as faceted data. The faceted data represents the dimensionality of the source data structure 202 as its structural facets and the content nodes 302 of the source data structure 202 in terms of the facet attributes. This approach allows domain-specific resources (e.g., system 603) to process complex dimensional structures 210a into higher levels of abstraction (e.g., dimension concept taxonomy 210).
The complex dimensional structure 210a may be used as an organizational basis to manage the relationships between the content nodes 302. The new set of organization rules may then be applied to the elementary constructs for classification. The organization rules may include enhanced facet classification methods as described in detail below and shown in fig. 20-22.
The enhanced faceted classification method may be applied to the complex dimensional structure 210 a. Other simpler classification methods may also be applied, and other data structures (simple or complex) may be created from the complex dimensional structure 210a as desired. In one embodiment, an output mode that explicitly represents a faceted classification may be used. Other output modes may be used. Various data models can be used to represent the generated faceted classifications for the domains. The available classification methods are closely related to the type of data structure being classified. Thus, these alternative embodiments for classification may be directly linked to the alternative dimensional embodiments discussed above.
The data entities (e.g., 708, 710) contained in the domain data store 706 include references to elementary constructs stored in the morpheme dictionary 206. In this manner, after the dimensional concept taxonomies 210 for each domain 200 are created, they can be re-analyzed to accommodate the changes. When the domain owner wishes to update its classification, domain-specific data may be reloaded into the analysis engine 204a for processing. The analysis of the domain 200 may be done in real-time (e.g., by end-user interaction via XML 212 a) or by periodic updates (in a queue fashion).
Shared data and private data
One advantage of the dimensional knowledge representation model is the clear separation of private domain data and shared data used by the system to process domains into complex dimensional structures 210 a. Data separation provides the benefits of distributed computing, such as an Application Service Provider (ASP) processing model of hosting, the opportunity to utilize a utility computing environment such as the environment described above, or a software as a service (SaaS) application delivery model. Under these models, a third party can offer change engine services to domain owners. The domain owner can thus take advantage of the economies of scale provided by these types of models.
Domain-specific data of the domain owner may be securely hosted (e.g., via an ASP) under various storage models because it may be separate from the shared data (i.e., morpheme dictionary 206) and other domain owner's private data. Alternatively, domain-specific data may be hosted by the domain owner, physically removed from the shared data.
Under this distributed knowledge representation model, domain owners can benefit from the economic advantages and specialization of centralized knowledge transformation services and the "intelligence of aggregation" from centralized classification data. However, by keeping the necessary domain-specific data separate from these centralized services and data assets, domain owners can build on shared knowledge (e.g., a morpheme dictionary) across the entire user population without compromising their unique knowledge.
Knowledge warehouses and intranets within an enterprise setting provide one example of this shared aggregated knowledge application within the context of a dedicated knowledge domain. Currently, where private knowledge needs to be maintained for competitive advantage, companies face a tradeoff between the economic advantage of gathering knowledge and open collaboration. The system described herein allows for a class of closed information domains that would benefit from not only the centralized knowledge representation and transformation services described herein, but also the group data assets in the morpheme dictionary as described herein, while maintaining their comprehensive knowledge and domain-specific data asset privacies.
Distributed computing environment
In one embodiment, the build engine may be distributed as a software application running on an open source platform. One such open source platform is a network platform including LINUXTM、APACHETM、MySQLTMAnd a "LAMP" technology stack that may include Perl, PHP, Python, and other languages of programming technology. With such an application, multiple copies of the build engine's synthetic rules can be read directly on the domain owner's distributed physical system. Under this model, a distributed physical system running centralized processing rules is obtained (since each copy of the build engine has the same instructions).
In this manner, the upgrade costs for synthesizing the complex dimensional structure of each domain are distributed among the resources of each domain owner. In a similar manner, the build engines can be distributed as lightweight client-side applications that synthesize complex dimensional structures as desired by the end users of these applications.
In addition to the opportunity to run these decentralized systems directly on the systems of the domain owner and end-users, such as AMAZON WEB SERVICESTMThe utility computing system of (AWS) provides an economic distribution mechanism for centralized build engine rules (the direct cost of running a visualization instance of a build engine can be more than the following, i.e., across heterogeneous environments of domain ownersOffset of indirect cost of distributing and supporting build engines). Instead of physically distributing copies of the build engine, a virtualized build engine application may be provided within the utility computing environment.
Within the AWS, for example, an image for the build engine is to be created and uploaded to the virtualized environment of the AWS elastic computing cloud service (EC 2). EC2 may provide one or more virtual server environments. The AWS "image" is essentially a disk image of the virtual server; an "instance" is an operational virtual server based on the disk image. A new instance of the build engine running on the virtual server will be provided to handle domains and adapt to user activity as needed.
In this decentralized environment (and many others), domain-specific data may be decoupled from the build engine. Within the AWS, EC2 may be used for processing, simple storage services (S3) may be used for data storage, and Simple Queuing Services (SQS) may be used to coordinate messaging of EC2, S3 and other centralized services across analytics and complex-adaptive feedback introduced above and discussed in more detail below.
The AWS 3 service can store and distribute faceted datasets that encode a dimensionally complex structure for a domain. These domain-specific faceted datasets can be shared among multiple virtual servers that process the build engine rules.
The integrated conceptual relationships may be stored in this distributed environment. The build request may be synthesized and sent in parallel to the end user system and S3. Subsequently, synthetic requests that satisfy parameter matches of previous requests of the domain may be cached from the conceptual relationship cache in S3, or may be generated directly by the build engine if updates are needed. It is also important that the integrated relationship will be available as feedback for the next analysis cycle in the centralized analysis engine service as described above.
Those skilled in the art will recognize that many architectural improvements and developments are possible in the distributed computing arts. Examples of this type of improvement are for example: parallelization across multiple virtual machines, and load balancing across domains, and user activity.
XML schema and client-side transformation
The faceted output data may be encoded as XML and represented by XSLT. The faceted output may be reorganized and represented in many different ways (e.g., with reference to the published XFML schema). An optional output for representing the hierarchy is available.
In one embodiment, XML transformation code (XSLT) is used to present the presentation layer. All cells managed by the system (including the distributed content if it is sent over the system channel) can be represented by XSLT.
Client-side processing is the process by which an embodiment connects data feeds to the system representation layer. These types of connectors may be used to output information from an application server to various media that use the configuration information. XML data from an application server may be processed by XSLT for presentation on a web page.
Those skilled in the art will recognize the current and future functionality that XML technology and similar rendering technology will provide in the services of the present invention. In addition to basic publishing and data rendering, XSLL and similar techniques may provide a range of programming opportunities. Complex information structures (e.g., information structures created by a system) may provide actionable information (e.g., data models). Software programs and agents can act on information about the presentation layer to provide sophisticated interactivity and automation. Thus, the scope of the invention provided by the core architectural advantages of the system can be extended far beyond simple publishing.
Those skilled in the art will also recognize the variability that may be used to design these XML and XSLT locations. For example, the file may be stored locally on the end user's computer or generated using a web service. ASP code (or similar technology) may be used to insert information about the distributed presentation layer (such as third-party publisher or software client's web pages) managed by the system.
As another example, an XML data feed containing core structure information from the system may be combined with distributed content organized by the system. Those skilled in the art will recognize the opportunity to decouple these two types of data into a single data feed.
These and other architectural opportunities for storing and distributing these presentation files and data feeds are well known in the art and will not be further discussed herein.
User interface
The following sections provide implementation details related to various user interfaces for operation of the system discussed above. These operations are: view dimension concept taxonomies; providing synthesis parameters in a dynamic synthesis mode; and editing the dimension concept taxonomy. Those skilled in the art will recognize the variety of possible user interfaces that may be implemented in the system operation services discussed above. As such, the illustration and description of the user interface implementation in no way limits the scope of the present invention.
Dimension concept taxonomy viewer
FIG. 34 provides a screen shot of an example of the main components of a dimension concept taxonomy presentation UI for end user viewing and browsing.
Content container 2600 can hold various types of content in the domain as well as structural links and concept definitions that form a presentation layer for the dimension concept taxonomy. One or more concept definitions may be associated with a content node in a container. As described herein, the system can manage any type of cell registered in the system as well as the URI and concept definitions used to compute the dimensional conceptual relationship.
In one embodiment, user interface devices typically associated with traditional linear (or planar) information structures may be composited or stacked to represent dimensionality in a complex dimensional structure.
Composite traditional Web UI devices (e.g., navigation bars, directory trees 2604, and breadcrumb paths 2602) may be used to show the intersection of dimensions at various nodes in the information architecture. The dimensional axes (or hierarchies) that intersect with active content node 2606 can be represented as separate hierarchies, one for each axis of intersection.
The structural relationship may be defined by a pointer (or link) from the active content container to the related content container in the domain. This may provide a plurality of structural links between the active container and the related containers as specified by the dimension concept taxonomy. The structural links may be presented in a variety of ways including a full context presentation of concepts, a filtered presentation of concepts that only show key words on the active axis, a presentation of content node labels, and the like.
Structural links can provide context for content nodes 2608 within a dimensional concept taxonomy that are organized into prioritized content node groups within one or more relationship types (e.g., parent, child, or sibling).
XSLT can be used to present the structural information as a navigation path on a website that allows a user to navigate the structural hierarchy to a container associated with an active container. This type of presentation of structural information as a navigation device on a web site can exist in most basic applications of the system.
These and other navigation conventions are well known in the art.
Dynamic integrated user interface
A user interface incorporating user interface controls to provide dynamic synthesis operations (as described above) is shown in fig. 35.
The user interface may include user interface controls with which the user may specify: active concept definition 3602, active axis definition 3604, and active domain 3606. Controls for specifying active concept definitions and active axis definitions may include: links (shown) for specifying the concept definitions as keywords and initiating editing operations and text-based searches (not shown).
In one embodiment, the user may select an active concept definition from a set of concept definitions arranged within an existing concept hierarchy 3608. This selection of active concept definitions may be based on previously performed dynamic synthesis operations to provide a global navigation structure for the dimension concept taxonomy.
In another embodiment, to specify an active concept definition, a user may type a query into a text box (not shown). The query may be processed against an entity tag set associated with the domain. As they are typed, a suggestion list may be given based on comparisons between tags associated with other entities, concepts, keywords, and morphemes in the domain (extraction methods are discussed in more detail above). Using these tools, a user can select concept definitions from the assigned suggestions based on a customized vocabulary of domain-specific tags.
The axis definition may be specified using a list of one or more attributes of the active concept definition or any combination of attributes that the user may wish to assemble (as described above under the discussion of the synthesis operation). "tag cloud" 3610, based on analysis of attributes from within the candidate set for dynamic synthesis operations, may be a means for providing an overview of possible axis definitions. For example, a count of the most common keywords in the candidate set may be used as a basis for both selecting a subset of keywords for presentation and changing the font size of the keyword tags based on the total keyword count.
In this implementation, the user may select the active domain by selecting from a set of markers located across the top of the screen.
To control the process and the range of the resulting integrated output, the control to define the integrated parameters as described above may include: as a degree of separation of the slider 3610 and a limit on the number of content nodes returned as links 3612 (in this embodiment, there is a link to the limit on the number of content nodes displayed and the concept returned. The means by which virtual concepts can be displayed or hidden is illustrated as check box transition control 3614.
Dimension concept classification method summarizer
A view of the dimensional concept taxonomy may be presented to the user through the user interface described above. For purposes of illustration, assume that after reviewing the classification, the user wishes to reorganize it. From a system perspective, these interactions will generate explicit user feedback within the complex-adaptive system.
FIG. 36 illustrates a summarizer user interface that may provide these interactions in one embodiment. It instructs the device to change the location of node 2702 in structure 2704 and to edit the container and concept definition assignments at each node 2076.
In one embodiment, using client-side controls, a user can move nodes in the hierarchy to reorganize the dimensional concept taxonomy. In doing so, the user may establish a new parent-child relationship between the nodes.
When editing the position of a node, new group relationships between lower-level morphemes may be correlated. This may thus require recalculation to determine a new set of inferred dimensional conceptual relationships. These changes can be queued up to compute new morpheme relationships that are inferred through conceptual relationships.
The changes may be stored as special cases of shared dimension concept taxonomies (hereinafter referred to as group concept taxonomies) to meet the personalization needs of the user (see below for more details on personalization).
Those skilled in the art will recognize that there are many methods and techniques that can present a multi-dimensional information structure to an end user and provide interactivity. For example, a multivariate form can be used to allow a user to query the information architecture along many different dimensions simultaneously. Techniques such as "pivot table" may be used to keep one dimension (or variable) constant while changing other variables in the information structure. Software components such as ActiveX and Ajax based components may be embedded in web pages to provide interactivity with the underlying structure. Visualization techniques may provide a three-dimensional view of the data. These and other variations will be clear to the skilled person and do not limit the scope of the invention.
Those skilled in the art will recognize that the present invention may take many forms and that such forms are within the scope of the invention as claimed. Therefore, the spirit and scope of the appended claims should not be limited to the description of the specific versions contained herein.
Claims (72)
1. A computer-implemented method for faceted classification of information domains, characterized by:
a) providing a faceted data set including faceted attributes for classifying information, such faceted attributes optionally including faceted attribute rankings for said faceted attributes;
b) providing a dimension concept taxonomy, wherein the facet attributes are assigned to objects of a domain to be classified according to concepts that associate meaning with the objects, the concepts being represented by concept definitions defined in the dimension concept taxonomy using the facet attributes and associated with the objects, the dimension concept taxonomy expressing dimension concept relationships between the concept definitions according to the faceted data set; and
c) providing or implementing a complex-adaptive system for selecting and returning dimensional concept taxonomy information to change the faceted data set and dimensional concept taxonomy in response to the dimensional concept taxonomy information;
wherein the facet analysis of the input information is selected from information domains based on a source data structure, the method characterized by:
using pattern augmentation and/or statistical analysis to discover at least one of the faceted attributes of the input information to identify a faceted attribute relationship pattern in the input information; and
the facet attribute is a morpheme, and the facet attribute relationship is a morpheme relationship.
2. The method of claim 1, wherein:
inferring the faceted attribute relationships from concept definitions and concept relationships obtained from the input information, thereby enabling construction of a faceted attribute hierarchy, the concept definitions including attributes defining the faceted attributes.
3. The method of claim 2, wherein:
and establishing a potential facet attribute relationship according to the concept definition in the input information and the facet attribute relationship arrangement in the concept relationship.
4. The method of claim 3, wherein:
facet attribute relationships are established according to the potential facet attribute relationships to reduce the number of potential facet attribute relationships to augment the schema for statistical analysis.
5. The method of claim 4, wherein:
establishing a potential facet attribute relationship based on at least one of: (a) there are facet properties shared across pairs defined by related concepts; (b) the facet attribute relationship exists directly or indirectly in the facet attribute hierarchy; and (c) an external dictionary.
6. The method of claim 5, wherein:
for each potential facet attribute relationship, assessing a likelihood that the respective potential facet attribute relationship remains substantially applicable to the conceptual relationship in which the respective potential facet attribute relationship exists; and responsive to the evaluation, cause the respective potential facet attribute relationships to constitute candidate facet attribute relationships for the facet attribute hierarchy.
7. The method of claim 6, wherein:
the evaluating includes determining a popularity of the respective potential facet attribute relationship in an aggregate context of all concept relationships, and wherein the composing is responsive to the popularity.
8. The method of claim 6, wherein:
the evaluating includes identifying that there is a recurrent relationship among the potential facet attribute relationships and identifying a respective potential facet attribute relationship that contradicts a hierarchical assumption between the facet attribute relationships, and wherein the forming is responsive to the identifying.
9. The method of claim 6, wherein:
assembling the candidate faceted attribute relationships into a faceted attribute multi-hierarchy such that a set of candidate faceted attribute relationships in the multi-hierarchy are substantially logically consistent across an aggregate.
10. The method of claim 2, wherein:
the faceted attribute hierarchy is defined as a strict hierarchy by reconsidering the level of a faceted attribute having a plurality of parent attributes as an attribute of an ancestor of the plurality of parent attributes.
11. The method of claim 10, wherein:
determining the facet attribute from a root node within the facet attribute hierarchy.
12. The method of claim 1, wherein:
performing a facet analysis on further input information selected from the second information domain using the facet attributes and optionally the facet attribute hierarchy.
13. The method of claim 1, wherein:
a faceted data set including the faceted properties and optionally a hierarchy of faceted properties is provided.
14. The method of claim 13, wherein:
and using the facet data set for facet classification synthesis.
15. The method of claim 1, wherein:
accessing a data storage device comprising a plurality of statistical analyses, and using the statistical analyses to alter the faceted data set and the dimensional concept taxonomy by aggregating selected dimensional concept taxonomy information through operation of the complex-adaptive system.
16. The method of claim 1, wherein:
performing a faceted classification synthesis to relate a set of concepts, the set of concepts being represented by concept definitions defined in terms of a faceted data set including faceted attributes and optionally a hierarchy of faceted attributes, the faceted classification synthesis including: expressing a dimensional conceptual relationship between the concept definitions, wherein two concept definitions are determined to be related in a particular dimensional conceptual relationship by examining whether at least one of an explicit relationship and an implicit relationship exists in the faceted data set between respective facet attributes of the two concept definitions.
17. The method of claim 16, wherein:
assembling the dimension concept relationship in a dimension concept hierarchy.
18. The method of claim 16, wherein:
defining a dimension axis from one or more facet attributes, the dimension axis to generate a dimension concept hierarchy from the dimension concept relationship.
19. The method of claim 18, wherein:
defining one or more of: (i) a domain to be classified, and (ii) an active concept from among the set of concepts, the active concept serving as an ancestor concept or a descendant concept for generating the hierarchy of dimensional concepts.
20. The method of claim 16, wherein:
the respective facet attributes of the two concept definitions define an ancestor or descendant relationship between the two concept relationships.
21. The method of claim 19, wherein:
defining the active concepts from one or more concepts in the domain.
22. The method of claim 19, wherein:
if the active concept is defined, the dimension axis is defined according to one or more facet attributes of the active concept.
23. The method of claim 19, wherein:
defining one or more of the domain, the active concepts, and the dimension axis with human input.
24. The method of claim 19, wherein:
defining one or more of the domain, the active concepts, and the dimension axes with machine input.
25. The method of claim 18, wherein:
a limit on the number of hierarchical steps in the dimension concept hierarchy is defined.
26. The method of claim 18, wherein:
a limit on the number of concepts to be related is defined.
27. The method of claim 25 or 26, wherein:
the limits are defined by means of human input.
28. The method of claim 25 or 26, wherein:
the limits are defined by machine input.
29. The method of claim 16, wherein:
a plurality of dimensional axes are defined, each axis being defined according to a respective set of one or more facet attributes, so as to define a plurality of dimensional concept hierarchies.
30. The method of claim 18, wherein:
in a dimension-specific conceptual relationship, two concept definitions are associated on a dimension-specific axis if the facet properties of one of the two concept definitions are related to all or a subset of the facet properties of the other of the two concept definitions.
31. The method of claim 18, wherein:
in a particular dimensional concept relationship, two concept definitions are associated on a particular dimensional axis if a subset of the facet properties of one of the two concept definitions is associated with all or a subset of the facet properties of the other of the two concept definitions.
32. The method of claim 18, wherein:
selecting one or more facet attributes for defining the dimensional axis, the selecting being in accordance with at least one of: (i) a respective priority of the facet attribute in the facet attribute hierarchy; and (ii) a particular concept definition including the one or more facet attributes to associate a particular meaning to the dimension axis.
33. The method of claim 16, wherein:
evaluating a set of dimensional concept relationships for the presence of indirect relationships, and assembling the dimensional concept hierarchy without the indirect relationships.
34. The method of claim 16, wherein:
establishing a priority for the concept definition in the dimension concept hierarchy based on at least one of: (i) a facet attribute priority in the facet attribute hierarchy; and (ii) analysis of the facet attributes in the respective concept definitions of related concepts.
35. The method of claim 16, wherein:
a dimension concept taxonomy is defined by assigning facet attributes to content nodes of a domain to be classified according to concepts associated with the content nodes, the concepts being represented by concept definitions defined in the dimension concept hierarchy using the facet attributes.
36. The method of claim 16, wherein:
inferring a conceptual relationship between two or more concept definitions, wherein there is no explicit conceptual relationship between the two concept definitions.
37. The method of claim 35, wherein:
inferring a concept as a concept definition defined by one or more facet attributes, wherein there is no associated content node in the domain that is associated with the other of the two concept definitions.
38. The method of claim 16, wherein:
the dimensional concept relationship is defined for a domain to be classified, the domain including one or more content nodes each associated with one or more concepts of the concept definition representation.
39. The method of claim 38, wherein:
the concept nodes are examined for respective one or more concept definitions of the concept nodes.
40. The method of claim 38, wherein:
the dimensional conceptual relationships are defined for the domain to be classified such that the definition is limited to processing localized regions of the domain using content nodes adjacent to the selected content node.
41. The method of claim 1, wherein:
the concepts are related by defining a set of candidate concept definitions that are determined to be adjacent to each other by analyzing the facet attributes or a subset of the facet attributes associated with the concepts.
42. The method of claim 16, wherein:
based on the facet attributes associated with the two concepts, building concept definitions that are adjacent to each other by defining the facet attribute hierarchy, thereby deriving a set of candidate concept definitions associated with the two concepts.
43. A computer system for organizing and managing data structures, comprising based on input from a feedback agent, characterized by:
the system includes or is linked to a complex-adaptive system for selecting and returning dimension concept taxonomy information to change the faceted data set and the dimension concept taxonomy in response to the dimension concept taxonomy information, wherein:
the complex-adaptive system is operable to process a faceted data set including facets, facet attributes, and optionally a facet attribute hierarchy for the facet attributes used to classify information; and
the complex-adaptive system is further operable to define the dimension concept taxonomy, wherein the facet attributes are assigned to objects of a domain to be classified according to concepts that associate meanings with the objects, the concepts being represented by concept definitions defined in the dimension concept taxonomy using the facet attributes and associated with the objects, the dimension concept taxonomy expressing dimension concept relationships between the concept definitions according to the faceted data set;
the complex-adaptive system is operable to (i) process input information selected from a domain of information according to a source data structure; and (ii) using pattern augmentation and/or statistical analysis to discover at least one of the faceted attributes of the input information, or optionally one of the faceted attribute hierarchies of the input information, to identify a faceted attribute relationship pattern in the input information; and
the facet attribute is a morpheme, and the facet attribute relationship is a morpheme relationship.
44. The system of claim 43, wherein:
the complex-adaptive system is operable to return further input information for enhancing the facet attributes or optionally the facet attribute hierarchy, the facet attributes and the facet attribute hierarchy forming a basis for the returned further input information, the further input information being derived from one or more of machine-based or user-based return paths.
45. The system of claim 44, wherein:
(a) the further input information is obtained by means of a request for further input information generated by the complex-adaptive system;
(b) such requests are associated with time limits for responding to such requests.
46. The system of claim 43, wherein:
the complex-adaptive system includes or is linked to a statistical analysis data storage device; and
the complex-adaptive system is operable to change the faceted data set and the dimensional concept taxonomy based on the statistical analysis by aggregating selected dimensional concept taxonomy information.
47. The system of claim 45, wherein:
the complex-adaptive system is operable to control a variation of the faceted data set and the dimensional concept taxonomy in response to the dimensional concept taxonomy information.
48. The system of claim 46, wherein:
the complex-adaptive system is operable to apply, for the facet attributes derived from the dimension concept taxonomy information, one or more of the following for the facet, the facet attributes, and the facet attribute hierarchy: (i) counting the obstacles; and (ii) pattern matching constraints.
49. The system of claim 43, wherein:
the complex-adaptive system includes a machine-based complex adaptive system operable to use statistical analysis to analyze the dimensional concept taxonomies and select dimensional concept taxonomy information to be returned.
50. A computer system for performing facet analysis of input information selected from a domain of information according to a source data structure, the system comprising:
a processor;
a computer readable medium in data communication with the processor, wherein the computer readable medium includes processor-executable instructions thereon;
when executed by the processor, the instructions cause the computer system to be operable to derive facet attributes of the input information, and optionally a facet attribute hierarchy of the input information, using pattern augmentation and statistical analysis to identify a facet attribute relationship pattern in the input information; and
the facet attribute is a morpheme, and the facet attribute relationship is a morpheme relationship.
51. The computer system of claim 50, wherein:
the computer system is operable to infer the faceted attribute relationship from a concept definition and a concept relationship obtained from the input information to thereby construct the faceted attribute hierarchy, the concept definition including attributes for defining the faceted attributes.
52. The computer system of claim 51, wherein:
the computer system is operable to build the faceted attribute hierarchy by incrementally adding faceted attribute relationships.
53. The computer system of claim 51, wherein:
the computer system is operable to establish potential facet attribute relationships based on the concept definitions in the input information and the facet attribute relationship permutations in the concept relationships.
54. The computer system of claim 53, wherein:
the computer system is operable to rank the processing of the potential facet attribute relationships according to a count of facet attributes in the concept definition of a concept relationship pair.
55. The computer system of claim 53, wherein:
the computer system is operable to calibrate facet attribute relationships from the potential facet attribute relationships to reduce a number of potential facet attribute relationships to augment the schema for systematic analysis.
56. The computer system of claim 53, wherein:
the computer system is operable to calibrate the potential facet attribute relationship according to at least one of: (a) there are facet attributes shared across related concept definition pairs; (b) the facet attribute relationship exists directly or indirectly in the facet attribute hierarchy; and (c) an external dictionary.
57. The computer system of claim 53, wherein:
the computer system is operable to: for each potential facet attribute relationship, assessing a likelihood that the respective potential facet attribute relationship remains substantially applicable to the conceptual relationship in which the respective potential facet attribute relationship exists; and responsive to the evaluating, causing the respective potential facet attribute relationships to constitute candidate facet attribute relationships for the facet attribute hierarchy.
58. The computer system of claim 57, wherein:
the evaluating includes determining a popularity of the respective potential facet attribute relationship in an aggregate context of all concept relationships, and wherein the composing is responsive to the popularity.
59. The computer system of claim 58, wherein:
the evaluating includes determining that there is a cyclic relationship among the potential facet attribute relationships and identifying respective potential facet attribute relationships that contradict a hierarchical assumption between the facet attribute relationships, and wherein the forming is responsive to the identifying.
60. The computer system of claim 58, wherein:
the computer system is operable to: pruning one of the candidate faceted attribute relationships if the one candidate faceted attribute relationship can be logically constructed using a set of other candidate faceted attribute relationships.
61. The computer system of claim 58, wherein:
the computer system is operable to assemble the candidate faceted attribute relationships into a faceted attribute multi-hierarchy such that a set of candidate faceted attribute relationships in the multi-hierarchy is substantially logically consistent across the aggregate.
62. The computer system of claim 61, wherein:
the computer system is operable such that the assembling uses a minimum number of multi-tiered relationships necessary to create the logical one-to-many tiers.
63. The computer system of claim 61, wherein:
the computer system is operable such that the assembling orders respective facet attributes in the multi-hierarchy in response to a measure of relative commonality of the respective facet attributes.
64. The computer system of claim 63, wherein:
the computer system is operable such that the commonality of a respective facet attribute is determined using the commonality of one or more concept definitions associated with the respective facet attribute.
65. The computer system of claim 50, wherein:
the computer system is operable to define the faceted attribute hierarchy as a strict hierarchy by reconsidering the order of faceted attributes having a plurality of parent attributes as attributes of ancestors of the plurality of parent attributes.
66. The computer system of claim 50, wherein:
the computer system is operable to determine the faceted attribute from a root node within the faceted attribute hierarchy.
67. The computer system of claim 51, wherein:
the computer system is operable to derive a concept definition and the concept relationship from the input information, wherein the concept relationship definition is a facet attribute.
68. The computer system of claim 50, wherein:
the computer system is operable to identify a structural marker in the input information, and extract an attribute from the input information in response to the structural marker.
69. The computer system of claim 50, wherein:
the computer system may operate using the facet attributes to perform facet analysis on more input information selected from a second information domain.
70. The computer system of claim 50, wherein:
the computer system is operable to provide a faceted data set including the faceted properties and optionally the faceted properties hierarchy.
71. The computer system of claim 70, wherein:
the computer system is operable to provide the faceted data set for sharing among a plurality of domains, the faceted data set to derive a respective dimensional conceptual taxonomy for each domain.
72. The computer system of claim 70, wherein:
the computer system is operable to use the faceted data set for faceted classification synthesis.
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US11/469,258 | 2006-08-31 | ||
| US11/550,457 | 2006-10-18 | ||
| US11/625,452 | 2007-01-22 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1139215A HK1139215A (en) | 2010-09-10 |
| HK1139215B true HK1139215B (en) | 2018-06-22 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN101595476B (en) | System, method and computer program for client definition information architecture | |
| US7606781B2 (en) | System, method and computer program for facet analysis | |
| US8010570B2 (en) | System, method and computer program for transforming an existing complex data structure to another complex data structure | |
| US7844565B2 (en) | System, method and computer program for using a multi-tiered knowledge representation model | |
| JP6016843B2 (en) | Method, system, and computer program for dynamic generation of user-driven semantic networks and media integration | |
| Kiryakos et al. | The benefits of RDF and external ontologies for heterogeneous data: a case study using the Japanese visual media graph | |
| HK1139215B (en) | System, method, and computer program for a consumer defined information architecture | |
| HK1139215A (en) | System, method, and computer program for a consumer defined information architecture | |
| AU2012244384B2 (en) | System, method, and computer program for a consumer defined information architecture | |
| Gao | Structure and content-based clustering for visualization of web network information | |
| Alam | Interactive Knowledge Discovery over Web of Data |