[go: up one dir, main page]

HK1115720B - Method for selecting, analyzing and visualizing related database records as a network - Google Patents

Method for selecting, analyzing and visualizing related database records as a network Download PDF

Info

Publication number
HK1115720B
HK1115720B HK08106269.3A HK08106269A HK1115720B HK 1115720 B HK1115720 B HK 1115720B HK 08106269 A HK08106269 A HK 08106269A HK 1115720 B HK1115720 B HK 1115720B
Authority
HK
Hong Kong
Prior art keywords
network
data
database
records
nodes
Prior art date
Application number
HK08106269.3A
Other languages
Chinese (zh)
Other versions
HK1115720A1 (en
Inventor
拉尔夫.W.埃卡特
小罗伯特.G.沃尔夫
亚历山大.夏皮罗
凯文.G.里韦特
马克.F.布莱克希尔
Original Assignee
波士顿咨询集团公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 波士顿咨询集团公司 filed Critical 波士顿咨询集团公司
Priority claimed from PCT/US2005/015346 external-priority patent/WO2005107405A2/en
Publication of HK1115720A1 publication Critical patent/HK1115720A1/en
Publication of HK1115720B publication Critical patent/HK1115720B/en

Links

Description

Method for selecting, analyzing and visualizing related database records as a network
Cross Reference to Related Applications
The present invention claims priority from U.S. provisional patent application serial No.60/567,997, filed on 5/4/2004, the entire contents of which are incorporated herein by reference.
Technical Field
The present invention relates generally to the field of data mining and analysis. More particularly, the present invention relates to a method and system for representing related database records in a network graphical representation.
Background
The "information age" and "knowledge economy" are but two terms commonly used to describe the explosion features of digital information of our age. No matter how you call it, it is undoubtedly that the amount of information being created is growing at an unprecedented rate. There have been many attempts to quantify the rate of new knowledge development and have produced various estimates of its explosive growth. A wide variety of sources describe and attempt to quantify this information explosion. Examples of several such statistics that are often cited are:
● total human knowledge is typically doubled every 5-10 years;
● scientific knowledge is doubled every 3-5 years;
● medical knowledge is usually doubled every 2-8 years;
● has increased approximately by a factor of two over the last 7 years;
● there are approximately 1,500,000 web pages added to the world wide web each day;
the world-wide constant of digitally stored original content in ● 1999 required approximately 635,000-2,100,000 megabytes to store.
Regardless of the reliability of these estimates, they point to undeniable explosive growth of new information. Computer technology makes it easy to create and store new information. The number and size of databases used to store this information is growing explosively.
Despite the rapid growth in available information, the thought ability of humans to digest and assimilate information has not improved significantly. The rapid growth of available information and the inability of us to assimilate it results in information overload. The large storage of information makes it increasingly difficult to find the correct information and even more difficult to understand the vast amount of new knowledge available.
Workers within the knowledge economy work in an environment that is full of information but unable to extract understanding. These workers often need to find and understand information about specific topics or areas of interest in order to be able to improve their behavior and/or decisions. However, while the availability of information can convey information to them and improve their decision making, there is no practical way to find or digest it for absorption.
Numerous companies have invested enormous amounts of capital to help information workers find information "needles" in the vast sea of data they are searching. The primary paradigm for information retrieval may be referred to as "search and filtering. The "search and filter" approach always begins with a logical search that returns a large number of matching search results. The searcher then sifts through the results to find the information they are looking for. Users of the internet and other large databases may be very familiar with this approach.
Most of the investment in the field of information retrieval is focused on improving the "search and screening" process. Examples of improvements include:
● query refining-query refining attempts to determine the intent behind the searcher's query and refine the query in order to capture more documents relevant to the search or exclude more irrelevant documents from the result set. One example of query refinement is "synonym expansion," in which query terms are expanded to include synonyms for search terms in an effort to capture more relevant documents.
● result ranking-the second approach to improving the "search and filter" approach is result ranking. Result ranking attempts to rank the search results based on their relevance to the intent of the searcher. Relevance ranking has been evaluated in various ways, including: frequency of use of the search term, location of the search term within the document, and perceived "importance/usefulness" of the document within the result set. Perhaps the best example of result classification is Google's page rank metric based on the number of other web pages linked to the search results page.
● result filtering-an example of the last method to improve the "search and filter" method is result filtering. Result filtering attempts to classify documents within a result set based on some classification scheme. It is hoped that this will allow the searcher to narrow his/her "filter" to the subset of the result set that most closely relates to the area of interest. Examples of result filtering include: northern Light, "results folders" (see, e.g., FIG. 1), which is based on a fixed taxonomy of document classification. Vivsimo's document clustering tool, which classifies documents into a hierarchical tree structure based on their semantic content (see, e.g., fig. 2), and groker, which classifies documents into a dynamic hierarchical structure similar to vivsimo, and also provides a relatively large visual display of each classification with its "bubble display" (see, e.g., fig. 3).
All of these methods are useful improvements to the "search and filter" method, however, they all assume a particular type of information requirement that the searcher is looking for a particular PIECE of Information (PIECE), and that the information sought can be found within the documents in the result set. This type of information retrieval aims at finding answers to questions, such as:
● who killed Bobby Kennedy?
● what is the second highest mountain in the world?
● Palo Alto, what is the weather forecast on CA tomorrow?
● IBM's current stock price?
While the embodiments described herein represent further improvements to the "search and filter" approach, their main contributions are directed to meeting different types of information needs. The primary purpose of these embodiments is to help information users understand search results, or large collections of documents, by providing a method for digesting information patterns among documents within an absorptive collection. This type of information is referred to herein as "metadata" because it represents a higher level of information than any particular document or record contained within a database or search result. This type of information retrieval is intended to answer questions such as:
● how many documents are relevant to the domain in which i are interested, and how fast this number grows?
● who is the main author of information about this topic?
● is which companies are making information about this topic?
● how do the relationships between companies/authors working in this area?
The described embodiments utilize advanced visualization techniques to reveal metadata related to a document set or search results. To understand the novel contributions of the present invention, it is useful to review other systems and techniques in the field, particularly in two areas of research: 1) existing methods of representing metadata, 2) visualization methodologies for understanding large data sets.
Existing methods of representing metadata
Previous efforts to analyze and represent metadata related to large data sets may be divided into a variety of categories. For the purpose of distinguishing the present invention, the following provides a brief description of each category and examples of the state of the art.
Statistical analysis
One of the simplest and most widely used methods of analyzing a set of documents is statistical analysis. Statistical analysis may be as simple as calculating the number of documents based on date, author/inventor affiliation, country, classification, or other attributes. It may also include statistical calculations relating to the particular type of data being examined. For example, in the patent data field, statistics similar to the number of citations, citations/patents/years, time since filing the application to authorization, the most recent year of citation, the most recent years of academic citation, and other statistics are sometimes calculated. These statistical methods are widely adopted and in some cases automated in commercial applications such as those provided by Delphion, Micropatent and CHI Research in the patent arts and many others in other arts.
Statistical analysis may provide some useful understanding of the set of documents to be evaluated, but is clearly limited by the degree of understanding that can be obtained. The most well known tools of this type provide a textual report or simple bar graph that shows the number of documents per attribute value (e.g., how many documents there are for company a, company B, company C) or statistics related to the entire set of documents (e.g., average time from filing an application to authorization). They do not provide information about how the various documents are related to each other, and they do not provide a way to interact with metadata in a way that allows a user to explore what various attributes of a document reveal about the entire document set. It is an object of one or more embodiments of the present invention to provide a means for users to understand relationships among groups of documents and to provide a means for deep exploration of metadata related to a set of documents or search results.
Clustering
Another method for revealing metadata about large document sets is clustering. A variety of tools have been developed to group documents into clusters. Some of these tools separate documents into clusters based on a fixed taxonomy of categories, while others cluster them into dynamic category groups using grammatical information within the document. Two examples of fixed taxonomy clustering tools are The Northern Light search engine and The Brain (http:// www.thebrain.com /) web search tool. The fixed taxonomy clustering approach is implemented in one of two ways. First, each category may be based on explicit properties of the document. For example, domain extensions based on internet search results, such as ". com", ". net", ". edu", or their country domains, such as ". sp", ". ge", ". jp", etc. divide them into various categories. Second, the categories may be based on taxonomies into which documents in the data warehouse have been previously assigned. This is typically accomplished by manually reviewing the documents or the domains that those documents fall into and assigning them to one or more categories within a fixed taxonomy.
The method of second clustering documents or search results is based on the creation of dynamic taxonomies. These clustering techniques use grammatical data within the documents, and then cluster the document sets into smaller groups and "name" those groups based on their common words or phrases. The clustering approach essentially creates an automated classification scheme that can provide an understanding of the nature of the documents in a collection. This technology has been applied to a wide variety of document types and there are many commercial software applications available that perform this function. Examples of Clustering techniques used in the patent art include Vivisimo and Themescape tools (http:// www.micropat.com/static/advanced. htm), which are incorporated into Micropatent's Aureka (http:// www.micropat.com/static/index. htm) tool set, and Text Cluster tools (http:// www.delphion.com/products/research/products-cluster) are available in Delphi's tool set. The tools of Vivisimo can be configured to run on top of any set of text documents, as can be done by the semantic analysis tools developed by Inxight (http:// www.inxight.com/products/smartdivorce /).
With these clustering tools, basic metadata about a document set or search results can be represented. The approach taken by the above-mentioned tools may automatically display the number of documents that fall into each category's collection or search results, making it possible to "filter" within the results more quickly to find the piece of information being sought. They also provide some valuable information about the content of the document set or search results.
The value of the most well-known cluster tool is limited in two important respects. First, the metadata provided about the content of a document set is actually only the taxonomy into which it is clustered. This is an inherent limitation of fixed and dynamic taxonomy clustering techniques.
Fixed taxonomies are limited in their usefulness due to factors such as:
● the taxonomy is based on the priority of its creator, not the searcher. The creation of taxonomies requires making a selection of which attributes of the relevant information are the most important. For example, the first branch in an avian taxonomy may be established in a number of possible ways; migratory and non-migratory, waterfowl and continental bird, etc. Typically, the taxonomist's priority is not consistent with the needs of the information user, thereby limiting the value of the provided cluster metadata.
● fixed taxonomies cannot be easily adjusted to the contents of the database evolution. Once the taxonomy has been established and the user has begun to use, the changes will become rigid and difficult. As with the evolution of content, there is an inevitable need to add new classes, fine classes, and re-combine classes. This makes it difficult to compare results over time. As an example, consider a taxonomy technique called International patent Classification System (IPC) created by WIPO. IPC is now the seventh version of it. In each version, categories are added, moved, subdivided, and deleted. However, millions of patent documents were filed prior to the retention of classified revisions under the original classification scheme that existed at the time they were issued. This makes the representation of the cluster metadata problematic when based on a fixed taxonomy.
● another problem associated with fixed taxonomy is that documents in a dataset typically do not fall into a single category. This causes a classification problem that has typically been solved by assigning documents to multiple categories within a taxonomy. This multiple assignment raises the challenge of how to display the results of the clustering when many documents fall into multiple categories. Their typical solution is to count each document only in a single (primary) category, or to count once per category, while the document is counted multiple times. Both of these solutions have problems. The first approach ignores important information about the sub-classification, and the second represents multiple instances of each document.
● another major limitation of the fixed taxonomy is the difficulty in assigning documents to various categories. Typically, this is a manual process done by either the author of the document or by a specially trained individual or multiple individuals responsible for the classification. Again, both of these options are problematic. Classification by author suffers from lack of consistency, whereas centralized classification is very time consuming when a large number of documents must be classified.
Dynamic taxonomies have been created in order to overcome some of the limitations of fixed taxonomies. However, they have their own limitations that reduce their usefulness in providing metadata about large sets of documents. The challenges associated with dynamic classification are described below:
● all dynamic taxonomy systems known to the inventor are based on semantic data. Briefly, the classification of documents is based on the similarity of words contained within the documents. The problem with this approach is that all languages are extremely imprecise as it relates to representing concepts. Any classification of documents based on semantic similarity will suffer from the problem of synonyms (multiple words represent the same meaning) and polysemons (words have multiple meanings). While syntactic clustering is certainly valuable, the inventors' experience shows that the clusters created are reminiscent of content, far from being accurate.
● the second linguistic problem associated with semantic clustering is multilingual. The semantic clustering tool will fail completely when documents in different languages are contained in the dataset. As the trend toward globalization continues, the importance of this problem continues to grow. Some efforts have been made to use multilingual topic vocabularies to allow linguistic comparisons of sets of multilingual documents, but such research is still in its infancy.
● the last limitation of dynamic taxonomy is the lack of comparability from one set of documents or search results to another. Because the taxonomies are specifically created for a set of documents, two taxonomies that were not created for different sets of documents or different search results may be compared.
● dynamic classification also suffers from the multiple classification problem described above.
A second limitation of clustering techniques is that any taxonomy describes only a set of documents or search results that relate to a single attribute. Most taxonomies are intended to describe the topic or theme of the documents they are categorizing. While such information is useful, there is no system known to the inventors that allows a user to simultaneously utilize the clustering information along with various other available sources of metadata describing the document set or search results. It is an object of one or more embodiments of the present invention to provide a user with a way to iteratively or simultaneously utilize information contained in fixed and dynamic taxonomies, as well as a wide variety of other sources of metadata, in order to provide a deep understanding about the set of documents or search results that meet the user's particular information needs.
Visualization methodology for understanding large data sets
The most advanced way to gain an understanding of metadata about large document sets or search results is visualization technology. The field of data visualization has grown rapidly over the past few years as computer processors have become powerful enough to perform the millions of computations required to display complex data relationships. There are many data visualization tools relevant to consider with respect to the present invention. They can be divided into several categories as will be described below. A related instance will also be provided for each category.
Hierarchical display-one visualization method that has been employed is hierarchical display. In its simplest form, documents or search results are represented in the form of a tree structure similar to the directory structure of the well-known metaphor (metaphor) for displaying sorted data. One example of a hierarchical display is designed to expose metadata for a clustering tool that includes Vivisimo as described above. Due to the difficulty of displaying and understanding the vast hierarchy, several alternative approaches have been developed to display these hierarchies. One example is the fish eye lens (fisheryelen) which is used to show the large layered structure cited in the patent in the Aureka toolset of Micropatent. The fisheye display allows the user to zoom in on a portion of the hierarchy while still knowing where they are in the overall hierarchy.
Another complex example of a hierarchical display is the groker tool developed by Grokis corporation described in us patent 6,879,332B 2. Much like the Vivisimo tool, the groker tool clusters documents in a hierarchical structure based on semantic algorithms. Unlike visiimo, the groker tool presents information to the user in a stylized marimekko diagram. The groker visualization represents folders in a two-dimensional space, and the size of each document cluster will be estimated based on the number of documents within the cluster. The space on the screen represents the entire search result. Within this spatial range, clusters of documents are displayed (represented by circles or squares) and marked based on common words found within those documents. Within each cluster, there are further "sub-clusters," again visually represented and labeled with keywords. The hierarchy is found from the top and bottom to the lowest level of the hierarchy where the document itself is last found.
Each of these main instances of hierarchical data visualizations is based on the underlying information contained within the document and as such suffers from the limitations of semantic analysis as described above in the section describing fixed and dynamic taxonomies.
Spatial visualization-the second type of visualization used to reveal metadata within a large set of documents is spatial visualization. Spatial visualization uses a map metaphor to arrange document records in two or three dimensional space. Although the various spatial visualization tools are slightly different, those known to the inventors are all following similar methodologies for creating maps. This method requires four steps: 1) compute semantic vector for each document-for each document in the dataset, compute a vector to represent the semantic content of the document (typically based on a histogram of word or concept usage). 2) Creating a similarity matrix-using the semantic vector for each document, computing the similarity matrix for each document pair and thereby creating a document similarity matrix. 3) Two or three dimensional projections are created based on the similarity matrix-using principal component analysis or similar methods (e.g., multi-dimensional scaling), the position of each document within the document set is calculated so that the distance between documents best reflects the similarity between documents as depicted by the similarity matrix. And 4) rendering a visual information space-rendering the document as a point in the document space using two or three dimensional projection.
Some spatial visualization tools take further steps to cover terrain coverage over the information space to reveal the extent of clustering. Some may even recognize and tag groupings of clusters based on common words within the clusters.
An example of a spatial visualization tool is The Themescape map, which is part of The patent analysis toolkit developed by AuriginSystem and is now part of The offering (offer) offered by its assignee, The Thomson corporation, through its affiliated micropantent. The Themescape visualization tool uses semantic analysis on patent titles, abstracts, or full text (at the user's discretion) to create a two-dimensional projection of the information space based on the methods described above. As shown in fig. 4, Themescape uses a map metaphor and covers the terrain over the information space with mountains representing the most highly clustered portion of the information space. Users of the Themescape map explore the landscape by searching the information space for company names and other keywords or by selecting clusters of documents to read or output back a list of documents for further review or analysis.
The underlying technology for the Themescape tool comes from a study conducted by the national laboratory of the North West Pacific, also with a spatial visualization tool known as SPIRE (spatial partner for information retrieval and exhibition). As shown in fig. 5, there are two visual simulations of Spire, one of which, the "star field" displays a document map in three dimensions in a view that looks very much like a sky filled with stars. Second, the "topic view" is a terrain metaphor that is very similar to the implementation of the Themescape map using Aurigin.
While certainly useful in developing a general understanding of the information contained within a large dataset, spatial visualization tools known to the inventors base their visualization solely on the underlying semantic information contained within the document and also suffer from the limitations of semantic analysis as described above in the section of dynamic taxonomy.
Network visualization-the last visualization technique is network visualization, which is sometimes applied to improve understanding of metadata related to large data sets. In its simplest form, a network graph (which mathematicians may refer to as a graph) is simply a set of nodes (typically represented as dots) connected by links (also referred to as edges or stripes). Network maps are not new and some network concepts can go back to ancient greeks at least. Social networking analysis has evolved significantly in the thirties of the twentieth century. The development of modern computers with powerful processors has made it possible to create computerized network visualization tools.
The network paradigm is a very valuable method applied to analyze large data sets. There are two particular reasons why network lenses (network lenses) are so valuable. First, most visualization tools are designed to focus on the entity being analyzed (typically a document, person, or organization). When network visualizations display information about individual entities, they also emphasize relationships between and among those entities. The network display shows not only the entities but also the systems in which they operate. In recent years, different scientific and academic researchers have recognized that inverse-theoretic analysis (e.g., an analysis that focuses on breaking down a problem into its constituent parts and adequately analyzing each component) is limited. Similar areas of biology, genetics, ecology, sociology, physics, astronomy, information science, and many other disciplines have experienced development based on systematic analysis. The systematic analysis method does not focus on the smallest elements (e.g., genes, atoms-or possibly quarks and bits), but on the interactions between or among these elements. The network tool is in its essence a system visualization tool. It may therefore lead to an entirely different type of understanding and conclusion than other visualization tools within the scope of the prior art.
A second reason that network visualization tools are suitable for analyzing large data sets is that networks have the potential to view the same set of information from a variety of viewpoints. Prior art network visualization systems do not capture the significant advantages of this fact, but networks have the potential to transition from one perspective (perspective) to another, each perspective providing a different understanding about the data being analyzed. The following description of a network visualization system will describe how this potential is realized in order to dramatically improve the understanding that can be obtained about large and complex data sets.
However, it is first necessary to understand the current state of the art for network visualization and to identify the key limitations of existing tools. There are a variety of computerized network visualization tools, including:
●aiSee(www.aisee.com)
●Cyram NetMiner-(www.netminer.com)
●GraphVis(www.graphvis.org)
●IKNOW -
(http://www.spcomm.uiuc.edu/projects/TECLAB/IKNOW/index.html)
●InFlow-(www.orgnet.com/inflow3.html)
●Krackplot
(www.andrew.cmu.edu/user/krack/krackplot/krackindex.html)
●Otter(www.caida.org/tools/visualization/otter/)
●Pajek(http://vlado.fmf.uni-lj.si/pub/networks/pajek/)
●UCINET & NetDraw(www.analytictech.com)
●Visone(www.visone.de/)
each of these tools is capable of creating a network graph. More advanced packages (e.g., UCINET/NetDraw, NetMiner) provide a range of visualization capabilities, such as:
● selecting an alternative placement algorithm;
● show multi-node types;
● sort/change the shape of the color/selection node based on the attribute values;
● show multiple link types;
● sort/change color/select line style of links based on the type of link.
All of these tools are generic network visualization tools. In other words, they are designed to display a network graph of any data constructed in such a way that nodes and links of the network are defined. Each of these tools uses a specific (and often unique) file format to capture information about nodes and node properties, as well as links and link properties. Node information is captured by a list of nodes, each node represented in the list by a node record. A node record contains at least one field for an identifier unique to the node, but may also contain other attribute fields that provide information about the node. The link information is captured by a link list (or link matrix) that at least identifies which two nodes are linked, but information similar to link strength, link direction, and link type may also be captured.
Although the various tools vary in their details, the processes that work with them follow a common pattern as shown in FIG. 6. Users of any known prior art system collect data from any source that is utilized. The user then selects what entities within the data will represent the definition of the nodes and selects what information she will use to create links between the nodes. The data must then be formatted to conform to the particular file structure of the network visualization tool. In all cases, this requires the user to create a list of nodes and a list of links or a matrix of links. Once properly formatted, the network data file may then be input to a network visualization system and analyzed and visualized. The user can work with the data within the scope of the tool, select different layout algorithms or display attributes, and analyze the network structure using any provided analysis tools.
If the user wants to develop an alternative visualization of data using a different definition of nodes and/or links, he must start over, redefine the nodes and links, reformat the data into a list of nodes and links, and reintroduce the new file into the visualization system. The system may then display a network graph based on the new definition of nodes and links. Some of the inherent limitations of these prior art systems include:
● since the database records do not contain node and link information that can be used by the system, database records from any data source cannot be visualized;
● the process of accessing and formatting data is not integrated into the web visualization tool;
● the user must format the data into a node/link list to accommodate the system;
● the user must choose what constitutes a node and what constitutes a stable definition of a link before formatting the data for use in the system;
● have no way to change the definition of nodes and links when operating within the network visualization system;
● there is no way to join or connect a network based on a first definition to a network based on a second definition, even if both networks are based on the same underlying data, if a new node/link definition is selected;
● have no means for specifying particularly useful node and link definitions for reuse with data from a particular source. Each time data from the source is to be visualized, the user must specify each node and link definition from scratch and manipulate the data to fit the visualization system.
Disclosure of Invention
In one aspect, a method of providing a network graphical representation of two or more database records includes selecting the two or more database records according to one or more descriptive criteria. Each of the two or more database records is a member of a common record class. The method also includes identifying one or more attributes of the record class and associating a network node with an instance of the one or more attributes from the database record. The method also includes connecting the network nodes with a network link that specifies network nodes having common instances of the one or more attributes.
The common record class may include patent records from a database, such as the LexisNexis database, Thomson database, USPTO database, EPO database, or Derwent database.
The common record class may include academic journal articles from a database, such as a PubMed database.
The descriptive criteria may include, for example, (i) one or more keywords within a body field of each of the patent records; (ii) one or more keywords within a title field of each of the patent records; (iii) one or more inventors in the inventor field of each of the patent records; (iv) one or more assignee in an assignee field of each of the patent records; (v) one or more keywords within the summary field; and combinations thereof.
The attributes may include, for example, the inventor, assignee, filing date, issue date, IPC code, USPC code, or search fields.
The network link may include a characteristic describing the number of common instances that occur between the connected nodes. The characteristics may include, for example, link thickness, link color, or link structure.
The network node may comprise a meta-node that describes characteristics of two or more database records.
The method further includes iteratively performing the identifying and connecting steps while modifying the one or more descriptive criteria to change the selected two or more database records. The one or more descriptive criteria may include, for example, a date range.
The method further includes selecting an additional database record from a record class other than the common record class of patent records and associating a network node, a network link, or both with an instance of one or more attributes from the additional database record. The other record classes may describe, for example, licensing history associated with the patent records, litigation history associated with the patent records, or maintenance fee history associated with the patent records.
In another aspect, a method of providing a network graphical representation of two or more database records includes selecting the two or more database records based on one or more descriptive criteria. The method also includes identifying two or more attributes of the database records and associating a network node with an instance of a first one of the common attributes from the database records. The method also includes connecting the network nodes with a network link specifying a network node having a common instance of one of the two or more common attributes to form a first network graphical representation. The method further includes transforming the first network graph representation into a second network graph representation by associating the network node with a second instance of the common attribute from the database record, and connecting the network nodes with a network link specifying the network node having the common instance of the second attribute.
In another aspect, a method of providing a network graphical representation of two or more database records includes selecting the two or more database records based on one or more descriptive criteria. The method also includes identifying two or more common attributes of the database records, associating a first set of network nodes with a first instance of the common attribute from the database records, and associating a second set of network nodes with a second set of instances of the common attribute from the database records. The method also includes connecting one or more members of the first set of network nodes to one or more members of the second set of network nodes with network links specifying associations between the network nodes to form a first network graphical representation.
In another aspect, a method for providing a network graphical representation of two or more database records includes selecting the two or more database records according to one or more descriptive criteria. The method also includes identifying two or more common attributes of the database records, associating a first set of network nodes with a first instance of the common attribute from the database records, and associating a second set of network nodes with a second set of instances of the common attribute from the database records. The method also includes including the second set of network nodes represented in the network configuration within one or more network nodes of the first set of network nodes represented in the network configuration, wherein each of the second set of network nodes shares a common attribute instance with the network node of the first attribute within which the second set of network nodes are included. The method further comprises associating a third set of network nodes with a third one of said common attributes of said database records, said third set of network nodes represented in the network configuration being contained within one or more network nodes of said second set of network nodes represented in the network configuration. Each of the third set of network nodes shares a common attribute instance with the network node of the second attribute within which it is contained. The method also includes associating an additional set or sets of network nodes with the other plurality of common attributes from the database records and grouping one or more members of the additional sets of network nodes within the other network nodes such that each set of network node members is characterized by the attribute associated with the grouped network node.
In another aspect, a network visualization system includes a computer-readable medium having instructions stored thereon adapted to provide a network graphical representation of two or more database records. The stored instructions implement the steps of one or more methods described herein.
Drawings
FIG. 1 illustrates one prior art example of a graphical representation generated using result filtering;
FIG. 2 illustrates another prior art example of a graphical representation generated using result filtering;
FIG. 3 illustrates yet another prior art example of a graphical representation generated using result filtering;
FIG. 4 illustrates a prior art example of a space visualization tool;
FIG. 5 illustrates another prior art example of a space visualization tool;
FIG. 6 illustrates a common process used with prior art visualization tools;
FIG. 7 illustrates how database records can be converted to link data in the described embodiments;
FIG. 8 shows a simple network diagram based on a small number of database records;
FIG. 9 illustrates a network link representation in one described embodiment;
FIG. 10 illustrates a simple network of patent documents in which documents having the same assignee are clustered together;
FIG. 11 illustrates an example of a chronologically ordered network diagram of one described embodiment;
FIG. 12 illustrates a network generated by an embodiment representing one description of a single article from the PubMed database;
FIG. 13 illustrates a network generated by one described embodiment without the use of metanodes;
FIG. 14 illustrates a network generated by one described embodiment using metanodes;
FIG. 15 illustrates how the records of FIG. 7 are converted to metanodes;
FIG. 16 illustrates an example of a network in which links between assignee nodes and inventor nodes are based on whether the inventor has the assignee of the invention in possession of a patent;
FIG. 17 shows a diagram with metanodes representing IPC code;
FIG. 18 illustrates a fractal network graph resulting from one described embodiment;
FIG. 19 shows a user interface resulting from one described embodiment, and FIG. 19A shows the result of grouping multiple assignee names under a single assignee representing the grouping;
fig. 20 shows a relationship between the attribute of patent data and the attribute of academic literature;
figure 21 illustrates an exemplary network based on PubMed data and visualized with the MNVS described herein;
FIGS. 22-25 illustrate alternative network views shown in FIG. 21;
26-27 illustrate networks of different cooperative clusters resulting from one described embodiment;
FIG. 28 illustrates a network of geographically limited searches produced by one described embodiment;
FIG. 29 illustrates a network of organization-limited searches resulting from one described embodiment;
FIG. 30 illustrates a network produced by one described embodiment illustrating research collaboration or replacement across an organization;
FIG. 31 illustrates a network generated by one described embodiment illustrating a regional basis for study intensity;
FIG. 32 illustrates a computer implementation of the described embodiments.
Detailed Description
As used in embodiments described herein, a Network Visualization System (NVS) is a system and/or method for understanding a collection of related database records or documents by providing a network graphical representation of the database records. These systems and/or methods may be applied to databases having records, where relationships between and among the records may be established. Examples of some of the fields in which NVS may be applied include, but are not limited to, patent documents, academic articles/papers/journals, medical/scientific articles/papers/journals, literature, web pages, corporate databases of consumers/products/suppliers/sales, corporate knowledge management databases, retail databases, official databases of census information/economic data/etc., institutional databases of membership/contracted user/institutional human relations, and many others. In fact, any information that is or can be structured as an information table having two or more information fields can be visualized as a network using the present invention.
An important understanding is that database records/documents are related to each other by various attributes that can be represented as a network. Types of attributes that may be used to create a connection relationship among records/files may include, but are not limited to, reference links (e.g., links documents a and B because document a references document B), common reference links (e.g., links documents a and B because document C references both documents a and B), number linkages (e.g., links documents a and B because both documents a and B reference document C), common originators, common affiliations (assignee, company, journal, etc.), common taxonomies within certain static or dynamic taxonomies, common keywords, semantic similarities, and many other possible links. These connections make it possible to represent a database record or a collection of documents as a network that enables the use of various network statistics and visualization tools to assist a user in understanding selected information.
Each database record is characterized by a specific instance of their attributes. For example, for an "inventor relationship" attribute in a patent database record, an "instance" of the attribute may be "John Smith", i.e., a particular inventor. Two patents, for example, sharing the same instance of the attribute "inventor" John Smith "may then be considered linked and thus visualized as part of the network.
A detailed description of NVS is set forth below, which describes a method for converting database information or documents to network information, and then a method for creating a multi-network visualization. It is further disclosed that for a particular database: patent database and medical journal database, two selected examples of applications for NVS. It should be understood that these are merely exemplary embodiments and that one skilled in the art will appreciate that the particular methods described in the various embodiments are applicable to other embodiments and also to any other database having records/documents in which relationships between documents may be established via various connection types as described below.
The primary visualization paradigm within NVS is the network. A network is a collection of objects that are interconnected in some way. It is common to visually represent a network with "graphics". A network graph is a visual representation in which each object in the network is represented by an icon or symbol (emblem) called a node, and each connection between objects is represented as a link (also called an edge or a band) that visually connects the nodes. The nodes and links may be exposed in such a way as to provide a visual representation of relationships among different objects in the network.
Finding an appropriate way to expose the network graph to reveal relationships among objects is not an important task. Graph theory and network layout algorithms are a very well established area of research. A variety of layout algorithms have been developed to create a useful visual representation of the network. The goal of the network visualization system is not to improve upon existing graphical layout methods. Any network layout method may be utilized as a way to visualize various important attributes of a large collection of related patent documents or database records, according to different embodiments of NVS. NVS leverages existing layout algorithms to display a network map of a collection of database records or related documents.
Since it will be appreciated that one of the key attributes about a large collection of documents/database records is the existence of relationships among or between them, the network paradigm has been chosen as the basis for visualization. Network visualization is in its nature designed to reveal interrelationships and is therefore a very useful tool for understanding large document collections.
Collection of data
The first step in using NVS is to collect the document/database records to be examined. There are several methods available for obtaining a document/database record set for analysis. Data stored in an electronic data repository (either within the same computer system or within one or more remote on-site or off-site servers) can be accessed by the NVS in its entirety or as a subset of records. This may be accomplished via a computer-implemented or assisted search by electronically submitting a user query based on one or more descriptive criteria. The query may be submitted using a conventional, well-established logical grammar and may be executed by searching for user-specified items within one or more fields of the database record or throughout the "full text" of the database record. The entire data or query results may then be analyzed and visualized by a network visualization system. In one embodiment, the database records within the electronic data warehouse are all members of a common record class, such as patents from the USPTO, EOP, Aureka, Micropatent, Thomson, Lexis Nexis, or Derwent databases, or academic journal articles contained within one of many academic, scientific, engineering, or medical document databases similar to, for example, the PubMed database. In other embodiments, the database records are members of two or more record classes.
Data not stored within the electronic data warehouse may also be analyzed by NVS, however, the data must first be converted to electronic format via data entry, OCR (optical character recognition), or other suitable technique. Once the data is converted to electronic format, it can be analyzed as any other database using NVS.
Transforming database records into network data
The data extracted from the data warehouse as described above is only a set of records. This data cannot be considered as "network" data according to any known network visualization tool within the scope of the prior art. Since it is not constructed as a node list and a link list (or matrix). This data is simply a collection of records, each record having two or more fields representing attributes of the record. For example, a corporate user database may have user ID, name, street address, city, state, zip code, country, telephone number, e-mail address fields, and many other fields. Although such data is useful as a node list in which each record is treated as a node, there is no linked list, and thus the data cannot be represented as a network graph.
The network visualization system converts this data into network data by creating a linked list for each record link attribute. This is done by creating a link between each record that shares the same attribute value. The chart shown in FIG. 7 provides a simplified example of how database records can be converted into link data.
In this example, we use a very simplified set of patent database records. For attributes similar to the assignee or inventor, links are created for each pair of records that share the same attribute value. These links are not directional in that they are based only on the co-occurrence of attribute values. If provided as a list, the citation link must be parsed (in this case based on comma demarcations) and then the directional links between the citation and the cited patent specified as shown in this example.
This method may be used to convert any set of database records into network data, where the original records are a list of nodes and a linked list is created based on a common instance of one or more attributes as described above. Once the database records have been converted to network data, the NVS allows the user to visualize the network.
Basic database network
In a basic database network, each record may be represented by a node, and the nodes may be interconnected by links representing one or more connections of various types as described above. Fig. 8 shows a very simple network diagram based on a small number of database records.
Representing nodes-nodes may be displayed as basic shapes (e.g., circles, ellipses, rectangles, etc.) or icons (e.g., document pictures). The color of the node may change to represent some property of the node. Nodes may also be tagged with text identifying the record or displaying one or more attributes of the database record. As a practical matter, long node labels tend to make the network display difficult to manipulate. NVS addresses this problem in several ways. First, the user is provided with a number of options for how many attribute values are desired to be displayed within the node. The options include: all (whole value), short (first character or first "n" characters), key point (year of only two digits), and none (not labeled). This problem is further addressed by allowing the user to see the complete node marker whenever the user points or selects a node (using an electronic pointing device such as a mouse or trackball).
Representing links-links that may be represented by a line or arrow. See, for example, fig. 9. The display of the link may be used to reveal the direction of the connection by appending an arrow, by pointing to the triangular shape of the document being referenced with a vertex, or by representing the forward and backward references with different colors or line types (e.g., dotted line, solid line). Furthermore, the strength of the connection between nodes can be visually described by varying the thickness of the line or by changing the color or line type thereof. The strength of the links may also be described by displaying a value related to the strength of each link that is close to the link on the network map. Various types of links between nodes may be established as described below.
As a practical matter, when multiple links are connected to the same two nodes, it is difficult for the user to distinguish the various ways in which the two nodes are related. NVS solves this problem in a number of ways. First, different link types are displayed so that they are visually different. This is accomplished by displaying different link types in different colors, line types (e.g., dotted, solid, dashed), line thicknesses, and the like. The plurality of links between the nodes are arranged side by side so that the plurality of links can be displayed without overlapping.
Another technique for addressing the problem of multiple link types is to compress (collapse) these links into a single "compound link" and attach icons representing different types of links and constraint strengths to the link. FIG. 9 shows an example of how this multi-link type can be compressed into a single link with an icon.
The strength of such a composite link between two nodes can be calculated in a number of ways, which can be based simply on the number of links, the sum of the strengths of the links combined, or a weighted average of the links combined. If a boosted average is used, the weighting factors may be selected based on an estimate of the relative importance of each link type.
Another feature of NVS is that the user has tools for selecting which link types are valid or invalid and which are visible or invisible. By definition, "active" links are those that affect the layout of the network graph. In other words, they have a force (similar to a spring or elastic band) that connects the linked nodes together. However, not all active links need to be displayed visually. When the network is highly clustered (i.e., in a link highly concentrated) or when multiple link types are used simultaneously, the network graph may become clustered with links. By allowing the user to make these links invisible, it allows the user to remove these links from the network diagram while continuing to make these links influence the network diagram layout.
Navigation network-NVS provides a variety of means for navigating a network, including but not limited to:
● -one way to navigate the network is to select a radius around the selected node. In this context, the radius is the number of links between the selected node and the other node. For example, if the radius is set to 3, all documents reachable through fewer than 3 links from the selected document will be displayed in the network map.
● expansion-the network may be expanded to add additional nodes to the network graph. The user may, for example, select one or more nodes (which represent documents or database records), and select "extensions," and all nodes reachable via a single link from that/those nodes (not yet visible) are added to the network graph.
● shrink-the network can shrink to remove nodes from the network graph. The user may, for example, select one or more nodes (which represent documents or database records) and select "shrink" from which all nodes that are reachable in a single link and are no longer linked to the network in any other way are removed from the network graph;
● hidden-the node may be hidden from the network graph. By selecting one or more nodes and selecting "hidden," the selected nodes are removed from the network graph.
Another useful feature of filtering network-NVS is the ability to filter nodes represented in the network. This can be achieved in several ways.
1) The filter may be applied by specifying a minimum, maximum, or range of attribute values of the patent document to be displayed. For example, a document may be filtered to represent only those records that meet a particular set of criteria, the particular criteria being:
date before or after a particular date;
only nodes for which a particular attribute appears more or less than some specified minimum number of times within the record set (e.g., only document nodes for which there are authors of at least 5 documents in the data set) are displayed.
2) The filter may be applied by specifying the value of the node attribute to be displayed. For example, the documents may be filtered to represent only the following documents:
o. related to one or more companies;
written by a specific group of one or more authors;
classified within a particular set of one or more topics based on some fixed or dynamic classification.
3) The filter may be applied by providing the user with a list of attribute values and allowing the user to select or not select an attribute value class to be displayed;
4) the filter may be applied by providing the user with a means to select one or more nodes from the network visualization using a computer pointing device (such as a mouse or trackball) and selecting a command from a menu indicating that the selected nodes should be filtered.
By utilizing these filtering methods, alone or in combination, it is possible for a user to dynamically filter the data set to display only documents of interest. For example, the user may specify that she only wants to see documents from company A, B and C published between 1999 and 2005, classified in categories a1, C3, and D5.
This capability is very important for two reasons: 1) which allows users to move back and forth between different subsets of documents within a data set, and 2) which enables users to refine their queries to remove those documents that are not of interest to them.
Clustering according to attributes-another approach to uncovering different types of relationships among documents in a network is to cluster them together based on their attributes. One way to implement this approach is to place additional links between nodes that share a particular attribute value. For example, all patents having the same assignee may be linked to each other by additional links so that they attract each other and form a cluster. FIG. 10 illustrates a simple network of patent documents in which documents having the same assignee are clustered together. Alternatively, an additional node may be introduced into the graph representing the value of the attribute, each node having that value being linked to the new node. This effectively pulls all of these nodes into one cluster. Note that it is not necessary to display such a new "property node" or a link between the property node and other nodes within the visualization.
Natural clustering within the network is identified-the network of linked documents naturally has more highly clustered or tightly packed regions than other regions in the network. "Cluster" is a term in the field of social networking and there are well known statistical methods for determining the degree of clustering within a portion of a network or the entire network. These clusters can be identified using techniques developed as part of the field of social network analysis. Various techniques exist for identifying clusters. Improving known clustering techniques is not the goal of NVS, however, NVS utilizes various clustering techniques to identify relevant combinations of patents in order to provide an understanding of the nature of large document collections.
Once these clusters have been identified, they can be labeled. One method of tagging is to identify words found in all or a plurality of headings and summaries of documents that fall into a cluster. A token can be created for each cluster by concatenating the first few (typically 1-5) words together. This flag may provide a signal to the user about the content of each cluster. Since the list of most frequently used words in the cluster is unlikely to be the ideal cluster label, it is practical to provide the user with a tool to change the cluster label to a more meaningful set of words.
Chronological network graph-another way to reveal information about a set of documents is to display the network graph by sorting nodes by date. The date used may be any date associated with the document and database record. For example, a patent document may have many dates associated with it: priority date, application date, publication date, authorization date, expiration date, and other dates. For example, the network map may place all of the oldest documents on the left side of the network map and the newest documents on the right side (or vice versa). A timeline may be placed next to the network graph to show the progress of the technology development over time. Alternatively, the background of the network context may be divided into time periods by year, decade or some other time, and marked with documents of that range that fall within the range that occurs within the appropriate time period. An example of such a time-aligned network diagram is shown in fig. 11.
Other gradients-the network may also be displayed along any number of other gradients other than time. The quantifiable (or made quantifiable) properties of the nodes or metanodes can be used as a gradient over which to display the network visualization. A simple example of an alternative gradient is a customer data network that is categorized according to the annual cost of the customer.
Transforming networks using metanodes
One central novel feature of the embodiments described herein is the ability to transform network representations. Prior art network visualization tools maintain a fixed, stable definition of what are nodes and what are links. For example, if data on a patent document set is introduced into one of the prior art network visualization tools, it must be precisely defined what are nodes and what are links. If, for example, you choose to have each patent represented as a node, and the common inventors represent a link, the visualization tool will keep the definition of the node/link unchanged during the analysis. NVS differs fundamentally in that it enables users to transform the network by redefining the definition of what are nodes and what are links as they use the data.
The network visualization system works on the following principle: any attribute of a database record may be represented as a node, a link, or both. As a simple example, if the meeting organizer has various meeting thematic groups and a list of participants for each thematic group, he may visualize them as a network of thematic groups linked according to common participants, but he may simply view them as a network of participants linked according to the thematic groups in which they participate together.
In the extreme, even a single database record may be viewed as a network with each attribute represented by a node and with various attributes linked based on other common attributes. Fig. 12 shows a fairly complex network representing a single article from the PubMed database. The central node represents the article itself, while various attributes of the article are represented by other nodes and are interconnected based on links such as co-author links and other common occurrences, as these attributes all appear in the selected article selected.
The network visualization system not only transforms database information into network information, but also allows a user to create his own node and link definitions, combine any number of nodes on a single network, and change their definitions at will during analysis. It is not possible to transform the network visualization in this way by means of previously known methods.
The ability to create alternative node definitions is a powerful tool to simplify network display and develop understanding about data sets. By redefining the nodes and links in the network display, the user can focus his attention on the entities he is interested in. For example, a researcher analyzing patent data may be interested in a company, business, or inventor rather than a patent. These nodes represent higher level entities than the nodes of a single document or database record. We refer to these higher level nodes as "metanodes" because they represent groups of documents or database records rather than individual records. Links between these metanodes are referred to as "metalinks" because they represent an aggregation of links between the collection of documents or database records represented by the metanode. This ability to abstract the network to a "meta-level" enables users to answer questions and notification decisions at a higher level than is possible using any other known visualization method.
The ability of the network transformation method can be demonstrated by way of example, assuming a complex network of > 1000 patent documents, where nodes are patent documents and links are referencing connections. The network map may look somewhat similar to the picture in fig. 13.
It is difficult to determine what can be understood from the network map. However, if you transform the network by redefining the definition of the nodes so that each node is a company, then you get a net graph similar to that shown in FIG. 14.
Such a network map of patent documents relating to a particular photographic technology makes it easy to identify companies leading in the technical field and to understand the connections between them. By transforming the network diagram, it has been greatly simplified, and thus better understanding can be achieved.
Previously, we described embodiments of transforming database records into network data. This embodiment relies on a stable node definition, i.e. one node per database record. Another embodiment creates meta node data and meta link data from database records. The example shown in FIG. 15 demonstrates how it can be implemented using the same simple set of database records as earlier shown in FIG. 7 for creating assignee metanodes and metalinks.
The first step in this process is to create a metanode list by simply listing each unique value for a particular record attribute and recording the number of times that value appears in the dataset. One or more meta-link lists are then created for this attribute (in this example the assignee) based on co-occurring values in other attribute fields (e.g., inventor, IPC classifications, and references). This method is consistent with the method used to create the linked list described above, except for two differences. First, the "record" in this example is not the actual record from the database, which is the record from the just created metanode list, and second, the metalink has a link strength value that indicates the number of co-occurrences (or references) that are aggregated in the link.
Creating a linked list, a metanode list, and a metalinked list from database records makes it possible to literally see database information from almost any database as a network using the network visualization system described herein. As a practical matter, the described embodiments of NVS do not actually convert every possible attribute into a linked list, nor do they convert every attribute into a metanode list or metalink list. Only those attributes that are most useful for the user's purposes are converted into network data.
It will be apparent to those skilled in the art that there are alternative methods for selecting which attributes to convert and at what analysis step to make the selection. For example, it is sometimes desirable to define in advance which attributes to convert into link and meta-node data for a particular database of interest as part of a computer program. This allows the user to access a standard set of nodes, metanodes, links and metalinks to work with him during his study with the tool. The network may be filtered, transformed, and exposed with multiple nodes, meta-nodes, and links, but only within the boundaries of the attributes established for the particular data set under the analysis.
Alternatively, the user may be given the ability to select attributes from the database records for conversion into links, metanodes, and metalinks during their analysis. This can be accomplished by simply allowing the user to select attributes (fields) from the list to convert to network data. Once the attributes are selected, links, metanodes, and metalinks may be generated and added to the set of network visualization resources available to the user according to the methods described above.
There are other ways to create metanode and metalink information from database records. The following examples show two alternative ways of creating such information, although other ways may be used in addition to these two ways.
Example 1-metanodes and metalinks may be created based on a range of attribute values. For example, if the database record attribute is a numeric value (e.g., a field recording annual sales in the customer database), a metanode may be created based on the range of values within the attribute field (e.g., < $200 ═ low consumer, $200- $1,000 ═ medium consumer, > $1,000 ═ high consumer).
Instance 2-metanode and/or metalink may be based on a combination of multiple record attributes. For example, a database of market research results may be converted into network data, where particular customer categories may be grouped together based on their common answers to a set of questions. In this way it is possible to define a meta-node based on income > $50,000/year, number of children > 2, and model SUV or minivan for "Soccer Moms".
The network diagram in fig. 14 also illustrates two additional attributes of the network visualization system: meta-node size estimation and meta-link aggregation.
Metanode size estimation-in the network diagram of fig. 14, each metanode does not represent a single patent but rather all patents that share a common value of the assignee's attributes. In other words, each node represents all patents filed by the same assignee. In this figure, the size of the node is based on the number of patents assigned to that particular assignee, and that number is appended to the meta-nodes to display a value related to the meta-node size.
Another feature of NVS is to provide users with the ability to estimate the metanode size based on various attributes of the represented document. For example, in a customer database with annual consumption, the size of all customers represented by a metanode may be estimated based on the sum (or average) of the annual consumption. The size may also be estimated for the represented node pair metanode based on any number of network statistical calculations, similar to the centrality/eigenvector centrality/median centrality sum. The ability to estimate node size based on these various metrics enables a user to draw conclusions about things like node values and other important metrics of interest to the user.
Many possible attributes for metanode size estimation may apply to a single document node as well as to metanodes. Of particular interest are the attributes of the references (forward, backward, total) and the centrality statistics of the social network (centrality, eigenvector centrality, intermediate centrality). Estimating the node size based on these and other statistics may provide a signal of the value of the node within the network.
Meta-link aggregation-another feature of NVS is the ability to transform the network from a binary (off/on) linked network to a meta-linked network (combined links with varying degrees of strength). This aggregation of links into metalinks also provides further insight to the user by revealing the strength and nature of the relationships between metanodes.
In the case of the above example, the links represent references between the assignee. Multiple links are shown because references can flow in either direction between companies. The value of the link in this example is based on the total number of citations between a patent of one company and a patent of another company. This reveals who is the leader and who is the follower in this "technological innovation network". An arrow is attached to the link to indicate the direction of the link and a number is attached to indicate the value associated with the strength of the link. Further, in a preferred embodiment, when the user points (using an electronic pointing device such as a mouse) at a particular node, the input and output links are highlighted in different colors to provide an intuitive clue as to whether the selected company is the leader (highly referenced) or follower (referenced by others).
As with the node size estimation, the link strength may also be based on a variety of different connection attributes. Some examples include the number of references, the number of unique documents referenced, the number of documents referenced, the average year of reference, the most recent year of reference, and other attributes. It is also to be noted that metanodes may also be connected by more diverse link types as described above. These links may also be aggregated and the strength of the association between them may be determined based on metrics similar to those described herein.
Simultaneous display of multiple nodes, metanodes, and link types
The next extension of the metanode concept is to place multiple nodes and link types on the same graph at the same time. For example, in particular, it is disclosed in the patent context to see a graph containing nodes representing both the assignee and the inventor. Fig. 16 shows an example of a network of assignee and inventor, where the links between the assignee nodes and the inventor nodes are based on whether the assignee holds the inventor's patent. In the network shown in fig. 16, it is possible to visually see which inventor works for which company and which inventor works for a plurality of companies within the scope of the examined technical field.
As another example, a node representing a patent and a metanode representing an IPC (international patent classification), a USPC (us patent classification), and/or a Derwent classification may be displayed on the same graph. FIG. 17 shows a graph where meta-nodes represent IPC codes and patents are grouped as members of a particular IPC. If the filter is set to only view patents from a particular assignee, this embodiment allows the user to intuitively determine what technology the assignee has invested in the past, and how these priorities have changed.
Nodes and links representing different connection properties or types of connections can be visually distinguished from each other in order to improve the usability of the system. The nodes may be distinguished by shape, color, border type, fill pattern, or by each representation by a particular icon, such as one representing the inventor and one representing the patent by a picture of the document. Links may be distinguished by shape, color, line type (e.g., solid, dashed), or other means.
NVS allows a user to select what nodes and metanodes are displayed on the graph, and which link attributes to use as the link basis. This provides a powerful tool for clarifying large patent sets and understanding their content and relationships among them.
Fractal network
Another extension of the metanode concept is the concept of a fractal network. A fractal network is defined herein as a network of metanodes, other nodes or metanodes contained within each metanode, as shown in fig. 18. Such a distributed node representation may have as many layers as desired.
One example of using fractal nodes may be a meta-node network representing the assignee, where each node is sized by the number of patents it represents. Within each assignee metanode, a metanode network may be displayed that represents the IPC classification. This representation will show which IPC classification each assignee company in the patent set is developing. Further, within each IPC meta node, a network representing the inventor may be displayed. And within the inventor metanode, a network representing a patent may be displayed.
Such network representations allow users to ask and answer a wider range of questions about patent documents in the technical field. Which enables the study of the attributes of patents and relationships among them in document collections in ways that would otherwise not be possible. One ideal implementation of a fractal node provides the user with the ability to select the attributes and link attributes represented by the network at each level of the fractal network graph.
Fractal nodes are also particularly useful in displaying hierarchical attributes such as various classification schemes including IPC and US patent classifications in the patent domain, medical subject matter targeting (MeSH) in the medical data domain, and classifications such as visisimo classifications. The lower levels of the hierarchy may be represented within nodes representing the higher levels of the hierarchy. This intuitive representation provides the user with an intuitive way to understand the relative size of each category and subcategory and the relationships between them.
Until now, fractal networks have been described by assuming that only a single node or metanode type can be displayed at each level of the hierarchy. Other additional understandings can be generated by providing the user with a means to place multiple nodes and link types at each level of the hierarchy. The user can then investigate in more depth how the various attributes relate to each other. As an example of this capability, a user may display a network of assignee metanodes and within each metanode, metanodes representing the inventor and IPC classifications. By doing so, the user can quickly understand what technical areas each company is currently working on and who are the key inventors in those technical areas.
One challenge with using fractal networks is the fact that the "sub-network" within each metanode may be very small within the network display. To address this problem, the system allows users to zoom in and out of the network to display the fractal network they have selected in any degree of detail. This is accomplished in one of two ways. First, the user may select a magnification level from a toolbar button or menu. Second, the user can zoom in on the fractal network within a particular metanode simply by selecting the metanode from the network display. By selecting a metanode (using a mouse or other electronic pointing device), the system can automatically center on the metanode and zoom in so that the next level of the fractal network can be clearly seen. To zoom back out again, the user may either select a new magnification level from a toolbar button or menu selection or may click on the outside of the metanode to return to the previous magnification level.
Implicit filtering of metanodes and metalinks
As described above, there are various means available to the user for filtering pending database records. This filtering has important implications for the use of metanodes and metalinks, i.e., the metanode list, metanode size, metalink list, and metalink strength will all be subject to change each time the filtering is applied. Notably, each time a filter is applied to the data, the metanode and metalink information must be updated in order to maintain consistency between the sets of pending records and the values associated with the metanode and metalink in the network display.
Statistical information about the network is provided.
Another element of NVS that helps users understand large document sets is to clearly express statistical information about the document set under consideration. The user interface described previously allows a user to interact with a network, expanding and contracting the network to create a network representing an area of interest to the user. In NVS, an interface is provided that dynamically updates statistics about the network as the network under consideration changes.
Various statistical information about the network is provided by the system, including but not limited to:
● document/record count;
● metanode count (e.g., number of assignees in the network shown);
● sum of node attribute values (e.g., total sales of all customers in the network);
● document count according to meta-node category (e.g., list of articles by each author);
● yearly document map (e.g., annual articles according to publication year);
● other network statistics-other network statistics may also be provided, including but not limited to: statistics about the network (e.g., density, diameter, centralization, robustness, transitivity), metrics about clusters within the network (e.g., clique, self-network, density), and metrics about nodes (e.g., centrality (e.g., median centrality, eigenvector centrality), equivalence), and other network statistics.
The statistical information described above may also be provided according to user requested menu and toolbar selections, or may be provided in a separate window or pane within the interface. In a preferred embodiment, a separate pane is provided with tabs to allow the user to access desired information about the current network. As shown in fig. 19, such a pane may be expanded, contracted, or closed as desired by the user.
This interface also facilitates further understanding by allowing a user to select one or more categories and highlight relevant nodes in a network visualization.
Additional information may be provided based on environmental sensitivity as the user uses the system. In particular, the pop-up window may be used to provide additional information about individual nodes, metanodes, links, and metalinks in the network graph. The information provided in each pop-up window is related to the objects selected from the graph, noting that more than one object of the same type of node (node, meta-node based on the same attribute, link of the same link type) may be selected at a time.
Resolving ambiguous attribute values
One problem encountered in using the above-described network transformation method is the need to resolve ambiguous terms. Database administrators or users will recognize that the data contained in database records is often confusing and inaccurate. Due to nuances in text, attribute values representing the same value often differ. We have found that the inventor's name and the assignee's name (as well as other attributes) often appear in patent databases in different forms. For example, the assignee "IBM" may take the form of IBM, IBM limited, international business machines corporation, and other variations.
This creates a problem when using the above-described metanode and metalink methods because these formal small differences result in the system generating multiple metanodes/metalinks when actually combining the metanodes/metalinks. Thus, the tool provides the user with a variety of means to combine attribute values into a single value.
The system provides a means for users to resolve ambiguous attribute values by allowing them to combine attribute values together under a single value. Fig. 19A shows the result of grouping a plurality of assignee names under a single assignee name representing the group. Various means may be provided to implement such a method. The first method is to allow the user to select attribute values from the list and incorporate them under new names or attribute values. For example, the user is provided with an alphabetical list of assignee and from this list IBM, IBM Limited, International Business machines corporation is selected. The user may then sort the selected options together using a toolbar button or menu selection and either select a system suggestion for a group name (e.g., IBM) or key-stroke his own group name. The system then combines all of these names under the new group name and displays it as a single assignee for the purposes of the entire analysis.
The second approach provides the user with a suggested set of attribute values to be combined into a group. The system compares the similarity of these attribute values and suggests grouping together these groups under a single attribute value. In addition to utilizing the attribute under consideration (e.g., the assignee), the tool examines other attribute values for clues that the attribute values should be combined. For example, if IBM and IBM corporation are both located in Armonk, NY, or they share the same inventors, the tool suggests that they should be combined where possible. The user may review each suggested group and add or remove values from the list before choosing to accept the group.
A final approach to resolving fuzzy attribute values is to use the network graph itself. The user may select the meta (using an electronic pointing device such as a mouse) node directly on the network map. Which can then combine multiple metanodes into a single group. This is achieved by combining the values by selecting multiple metanodes and selecting a tool button or menu selection. Alternatively, the user may "drag and drop" one metanode onto another to suggest to the system that they should be combined. The system will prompt the user for the intent to ensure that combining those items is genuine and then combine these attribute values into a single group for analysis purposes.
The system makes it possible to cancel combinations of property groups once they have been created. The user can select a metanode from the list or select a group name and review which attribute values they have combined. The user may then select to dismiss a particular combination of attribute values and then select a toolbar button or menu selection to dismiss the selected combination of values.
In addition to the above methods, it is also possible to remove ambiguity in attribute values by comparison with an external data source. For example, when considering an assignee, reference may be made to a list of external company name equivalents. These lists may include subsidiaries and acquired companies that may be suggested to the user as groups that may be used for the portfolio. In the medical field, the DEA number of a doctor can be used to solve the problem of the doctor's name.
This process of combining multiple attributes into a single attribute value may also be beneficial in another aspect. By combining the values into groups, a hierarchical structure of values is created. This information can then be used to display relationships between data above different levels of the hierarchy according to the methods described above. In particular, the attribute values at each level of the hierarchy may be represented as metanodes and may be displayed as part of a network display, either as individual nodes within a network graph, or as a hierarchical network utilizing the fractal network approach described above. This approach is particularly valuable for hierarchical information like provenance/assistance information about the assignee.
Animation network
The tools we have described so far allow a user to transform his view of the network in various ways. However, the description so far assumes that each network graph is a snapshot at a particular point in time. That way, the visualization we describe is static to this point.
Another important element of network visualization systems is the ability to make network graphs angry to reveal how they change over time. There are several different capabilities of the system that enable users to review the presence of dynamic networks.
A first method for revealing network dynamics is the ability to limit the data displayed within the graph based on the time period of interest. The user may establish minimum and maximum dates for the date range for the data to be displayed. The actual date used may be based on any data information related to the underlying database record. In the case of patent data, a variety of dates can be selected including, but not limited to, priority date, application date, publication date, and grant date. Once the user selects the date type and date range, the system then filters the data and displays a network map based only on the data that meets the specified parameters.
The second method is based on this capability. The system provides the user with the ability to change the date range in a very simple manner. The user may select a "step size" (e.g., one month, one year) to change the date range and then may click a single button to move the date range forward or backward according to the increment. In addition, separate toolbar buttons are provided so that the minimum date, the maximum date, and both can be adjusted with a single click. Once the user clicks to change the date range, the system quickly adjusts the data set to reflect the newly selected range and redraws the network map. This allows the user to step through the time period of the data in specified increments. Effectively, the system makes it possible to visualize how the network has appeared over time.
The third method is the creation of a realistic animation of the network evolution. The system provides a way for the user to enter a total date range, an initial date range (which may not have a range at all-e.g., if the minimum and maximum dates are set to the same value), the date (minimum, maximum, or both) will be changed, and a date delta size for the animation. The system uses these inputs to automatically step through a particular date range based on the provided deltas and display an animation of the network's appearance over time.
These animation methods are incredibly useful in revealing network evolution, however, they pose some challenges that must be overcome in order to make the system practical. First, when the network being displayed is large (with many nodes and/or links, and/or a large number of underlying records), the high degree of computational complexity makes the animation slow and jerky on all but the strongest computer systems. To overcome this limitation, a means is provided for the system to batch process the network graph sequences and then save them as a series of snapshots or as video clips. The snapshot or video clip can then be played back at the speed selected by the user without the system having to recalculate the underlying data at every point in the animation. This makes it possible for the user to review the animation repeatedly and also to pause, rewind, and fast forward the animation as desired.
A second challenge associated with network graph animation is the difficulty in completely understanding what happens within the graph. When a graph is animated, new nodes and links appear, meta-nodes grow and shrink, and nodes change position within the graph as the attraction between individual nodes and meta-nodes changes over time. All of these simultaneous changes make it difficult for the user to understand what happens when the animation unfolds. To make things easier for the user, the system provides a tool to reduce the number of parameters that change during animation. In particular, the system allows the user to keep various parameters constant during animation. Parameters that can remain constant during animation include, but are not limited to:
● -the most difficult part of the animation to follow is the change in the location of the nodes within the network display as the animation runs. Thus, the system provides the user with the option of keeping the node location unchanged during the animation. To accomplish this, the system first calculates the final position that each node will hold at the end of the animation, and then as the node appears, changes size, and new links appear, grow, and disappear, the position of the node remains unchanged at this position.
● present all nodes-another option offered by the system is the ability of the user to keep all nodes presented on the graph throughout the animation. With this option selected, the system keeps each node visible throughout the animation, but provides a visual signal to distinct nodes that do not represent data within a date range captured within a particular date range at each point in the animation. The visual signals may differ in color (e.g., nodes that are not normally visible are gray), size, shape, boundaries, and other visual attributes. This allows the user to continuously track the path of each node throughout the animation.
● include the presence of links, the size of nodes, and the size of links that are constant.
The ability to keep any combination of these parameters constant during animation gives users great control over the animation displayed and increases their ability to fully understand how the network appears.
Another capability provided by the system is the ability to provide visual information about the rate of change associated with various parameters during animation. While animation of network presentations provides powerful visual information about how networks evolve, it is difficult to accurately compare and evaluate the changes as they occur. For example, while it is readily seen that there are a variety of technologies and corporate folders that are growing, it is difficult or impossible for a user to assess which corporate folder is growing fastest at any point in time.
To this end, the system provides a user with a tool to visualize the rate of change of various parameters during animation. Some of the parameters that a user may be interested in during animation are: the rate of change of various metrics of node growth, link attachment rate, and node centrality in the network. The system provides the user with the ability to track these rates of change during animation and display information about these values in a table, graph, or directly in a network graph. The user can select which variables to track and for which nodes and links (including all nodes and links if desired). This data may be displayed in tables and graphs (bar graphs or line graphs) adjacent to the network graph and updated as the animation is displayed. Additionally (or alternatively), the data may be used to change the appearance of nodes and links in the network display when the animation is displayed. We find it useful (as measured by the parameters described above) to change the color of the nodes or links based on temporal rate of change or centrality statistics in order to show which parts of the network graph are "hottest". Alternatively, the data may be used to change the size of a node or link or some other visible characteristic when the network animation is displayed.
One other useful capability is provided in connection with network animation. Often, a user is interested in a particular portion of a network graph and wants an in-depth understanding of how that portion of the network appears throughout the process. To this end, it is beneficial to provide a means whereby a user can zoom in, and/or remain focused on, a particular node or nodes during animation. For example, companies may be particularly interested in how their own patent folders appear over time. A means is provided so that the user can select a particular node to zoom in during the animation. It would be beneficial to provide one or more "picture-in-picture" displays so that a user can observe the appearance of the entire network, as well as see how one or more "zoomed-in" portions of the network appear. This is particularly useful when the network being animated is a fractal network and the user is interested in observing how "sub-networks" within the larger network appear.
Network animation is a powerful tool for revealing emerging patterns within a network. In the context of patent data, animations reveal how technology has appeared over time, how the location of companies has changed, how the inventor's professional experience and collaborators have changed, and many other features. This ability provides a significant contribution to the user's ability to understand large data sets.
Chaining reverse external data
To date, network visualization systems have been described based on network analysis and understanding that can be generated from a single data source, in this case a single set of database records. It will be apparent from the foregoing that a great understanding can be generated simply by utilizing such endogenous data. However, additional understanding can be generated when other exogenous data sources are used in conjunction with the patent data.
By linking to an exogenous data source, additional information about the entity represented by the database attribute may be obtained. The choice of which external data to link to and the value to which the data is linked depends on the context of the data source being reviewed. Each attribute of the database record creates an additional external data source potentially linked to extensible information available about the topic of interest. Such additional data may be used to completely create new meta-node classifications, to attach additional attributes to one or more databases or meta-node records, to provide information about a particular node or meta-node, and to provide additional link information between nodes or meta-nodes. Specific examples of this type of useful exogenous data will be described later in this application in certain preferred embodiments.
Advantages of network visualization systems
The combination of these different tools and techniques provides dramatic improvements in enabling users to quickly understand documents or records contained within a large data set. First, the ability to quickly identify and refine document collections through dynamic filtering makes it possible for users to more quickly, efficiently, and accurately identify document collections that are relevant to their areas of interest, and require less knowledge of specific technologies.
Second, users have the ability to explore large collections of documents and database records so that an understanding of the nature of the activity within the domain can be quickly generated. This is accomplished by providing summary information about the field from a variety of different perspectives. The network lens in combination with the metanodes to represent various attributes of the document provides an intuitive way to understand not only the groupings inherent in the field, but also the relationships between those groupings.
Users can advantageously explore the areas of interest to any degree of detail desired, and seamlessly move back and forth between summary-level information and detailed information.
Example 1: method and device for understanding patent database
One or more embodiments of the invention relate to an improved method for understanding patents in a patent database. One attribute of patent databases is that it is easy to establish the degree of correlation between documents based on their reference relationships. The following discussion of using citations as a basis for establishing relevance between patents is equally applicable to databases having citations, including but not limited to academic/scientific/medical literature and hyperlinks embedded on web pages, which can also be considered a citation.
Various embodiments of the present invention may also be used to assist merchants, engineers, scientists, attorneys, patent reviewers, and other interested parties to understand a vast collection of patents. The challenge is to pick a large collection of patent documents and find ways to understand the technological developments they describe without having to read them. To accomplish this, a method is provided by which a user can visualize various attributes of a document and the relationships between them.
Some of the problems that may be solved by the various embodiments of the present invention include, but are not limited to:
● what is the technology about this patent group?
● how fast various technologies develop?
● is the hottest area of technology in this area?
● what is the most recent development?
● which company is most active in developing these technologies?
● which inventor is most active in developing these technologies?
● which patent is the most important?
● do companies participate in this field of technology-what is important is the patent portfolio?
● which company developed first and which followed?
● are other areas relevant to this technology?
● how much investment is made by these companies in this area of technology?
● how important are these patents to the company applying for them?
● which technical area these companies have abandoned and on which to continue investing?
● what company/inventor is working to develop this technology?
● which inventors have changed companies?
● what technical area is bridged (bridge), which companies have patents that are bridging them?
● which patent i should refer to as prior art in my current patent application?
● which patent could potentially be used to invalidate my patent and my competitor's patent?
● which company is most likely to violate my patent?
● which company is most likely interested in granting my patent permissions?
● how quickly academic research can be translated into patentable technology?
● will become a patent?
● how aggressive is the companies/inventors of these patents in setting up the technology and extending their patent protection?
● does companies increase their investment in which technologies and abandon their investment in which technologies?
● what technology is being invested in my industry?
● what industry is utilizing this technology?
Visualization of patent networks
The network visualization system described above can be easily applied to patent databases to produce a huge impact. Patent databases around the world are particularly amenable to this type of analysis because they contain citation information that provides natural linking information between patents. The value of NVS in a patent environment is particularly relevant because it enables interested parties other than patent attorneys and R & D engineers to use patent data.
Patent data is available from a variety of sources including the various patent offices around the world (USPTO, EPO, etc.) and from patent data providers like Thomson (including its customers and acquirees, Aureka, Micropatent, IHI and Delphion) and Lexis Nexis. This proprietary data is rich in information that can be converted into network data. Some examples of this type of data that can be converted into nodes and links include, but are not limited to:
● nodes/metanodes-patents, inventor, assignee, IPC classification, US classification, Derwent classification, priority/application/publication/authorization/expiration year, semantic clustering, status (application/authorization/expiration/waiver), reviewer, inventor city/state/country, assignee city/state/country, jurisdiction of application (US/EPO/WIPO etc.), priority number, and other information;
● link/meta link-quote, co-quote, bibliographic link, common IPC/US/Derwent classification, common priority/application/publication/authorization/year of expiration, common semantic clustering, common status (application/authorization/expiration/abandonment), co-reviewer, co-inventor city/state/country, co-assignee city/state/country, co-application jurisdiction, co-patent number, and other information.
These node/metanode and link/metalink definitions, as well as any combination of the ranges and combinations described above, may be used within the confines of the NVS to review the collection of patent data.
In addition, the size of these nodes and links may be estimated to provide additional information to the user as described above. Some particularly useful attributes that can be used to estimate node and link sizes in a patent environment include:
● node/metanode size estimation-in the patent context, there are several particular metrics associated with node/metanode size estimation. Some examples include that the size of a metanode may be estimated based on the following metrics: the number of patents, priority number (e.g., number of unique patents in the same family), number of times a patent is cited (forward citation), number of patents cited by a represented patent (backward citation), total citation (forward plus backward), citation/year since publication/authorization, remaining patent years (e.g., sum of years remaining for a represented patent), average citation per patent, average patent year, average/total number of IPC/US/Derwent classifications, number of inventors, and many other attribute metrics.
As mentioned previously, the estimated node and metanode sizes may also be calculated based on any number of network statistics like the centrality/eigenvector centrality/median centrality sum of the represented patents. The ability to estimate node sizes based on these various metrics enables users to make judgments about similar patent values, the variety of technological innovations, the concentration of inventors, and other important metrics of interest to the user in patent data.
Metrics of patent value are of particular interest, and within the patent data (or other exogenous data that may be linked to) there are a variety of attributes of the signal that may give patent value. Some specific examples include, but are not limited to: cited documents, number of academic citations, recent citation years, centrality/eigenvector centrality/intermediate centrality, length of patent specification, number of rights, number of independent rights, shortest independent rights length, coverage width (country of application), maintenance payments, post-authorization objections (europe), maintenance payments, licensing, patent litigation, R & D costs/patents to the assignee, average R & D costs/patents within the industry. Some or all of these metrics may be aggregated together using a weighted average to provide a signal of patent value in the network. These values may be aggregated to provide an estimate of the value of the patent folder, and this value may be used to estimate the size of the patent nodes and meta nodes representing the patent folder. This can provide the user with great insight as to which patents and folders are the most important in a particular area of interest.
● Link/MetaLink size estimation-links and Metalinks may also estimate size based on various attributes in the same manner as node size estimation. The link strengths may be based on a variety of different connection attributes. Some examples include the number of citations, the number of unique patents cited, the number of patent citations, the average year of citation, the most recent year of citation, and other attributes.
The link between the patent and patent folder can tell two important things: relevance and similarity. Relevancy is a measure of how one patent or patent folder is interdependent or how it is constructed from another patent or patent folder. The metric informing of the correlation provides an important signal about potential infringement and is therefore of critical importance in patent analysis. Some metrics that tell relevance include the number of citations, the number of citations minus the number of applications, the number of citations by the same patent that cites you citation, and so on.
Similarity is another important link attribute in patent analysis. The similarity between the two patent folders implies near parallelism and perhaps redundancy between the R & D plans of the two companies. Strategically, a higher degree of similarity implies a potential joint venture or a potential cost sharing to some other degree. Metrics that convey similarity between two patents include total mutual citation, structural equivalence (a network analysis term meaning that they hold the same structural position within the network), co-citation, bibliographic association, academic similarity, and the like.
All of the features of the NVS described above are relevant to analysis including network transformation, use of multiple nodes and links, fractal networks, network animation, statistics, and patent data linked to external data sources. Certain elements of a preferred embodiment of NVS particularly useful for patent data analysis are described below.
Identifying non-assigned patents
One of the only challenges associated with utilizing and linking patent data in NVS is that patent applications typically have no assignee associated with them. This is unfortunate because it means that the latest patent in the database (the most advanced patent) cannot be easily identified by the company. These patents typically do not contain citations yet and are not cited because they are entirely new. One aspect of the implementation of NVS of patent data is to solve this problem by creating optional links that properly connect them to the network. This is done by comparing the attributes of the non-assigned patents and patent applications to make a "best guess" as to which company has filed the patent application. Several attributes that make this comparison possible include inventor name, inventor address, IPC/USPC classification, referenced patent, law firm performing processing, semantic data, and other attributes.
By comparing these fields between the unassigned patents and other patents in the search results, it is possible to create links from the same company that show which other patents in the database are most likely. As an example, consider an unassigned patent that has three co-inventors with the same address and that filed an application in the same IPC classification and was processed by the same law firm as another patent in the database. These patents are most likely filed by the same assignee.
The system reviews all of the unassigned applications and creates links between each patent and the other patents most relevant in the database. Each relevance reference may be given a score and a weighted average used to determine the overall relevance of the two documents. The user may then select to "assign" the patent with similarity exceeding the selected threshold to the assignee of the highly relevant document. Alternatively, the user may review each link and choose to accept and reject the proposed "assignment". These assignments are labeled "computer assigned" within the NVS so that the user can tell some uncertainty as to whether those patents are actually assigned to that particular assignee. These links created between the unassigned patents and the most highly relevant patents are different types of links that can be opened and closed at the discretion of the user. One particularly useful way to employ these links is to visualize the assignee's network map in conjunction with the network of the non-assigned patent. This allows the user to review all "computer-assigned" patents from the company in a single network view.
Statistical information
There are various types of statistical information associated with the analysis of patent data, including:
● assignee — patent numbers in selected networks ranked by assignee from highest to lowest.
● inventor-according to patent number in the network selected by inventor assignee, ranked highest to lowest.
● classification-patent numbers in the selected network ordered from highest to lowest according to classification code assignee or according to classification category. Data may be provided to each of several classification schemes including IPC, USPC, Derwent classification, and others. Since many classification schemes are hierarchical, data can be displayed using a tree structure, with patent numbers within each classification, and sub-categories displayed side-by-side at each branch of the tree.
● word usage-patent numbers contain keywords, phrases or word groupings. There are several tools that identify common word usage within a document set. They include the photopent themescap product, Vivisimo's clustering tool, Grokker's clustering tool, and others. These word clustering tools can be readily incorporated into the system to provide additional understanding of the patent data set under consideration.
● reference-various types of information about the reference may be provided. They include, but are not limited to, the following information:
the most frequently cited patents, assignee, inventor, or other patent groupings;
the highest number of citations per year since the issuance of certificates by a patent, assignee, inventor, or other patent group.
Statistics are also provided in the pop-up window. Several examples of such information provided in pop-up windows are particularly relevant to patent data, as described below:
● patent node pops-when a patent node is selected, a pop-up window may be awakened that displays information about the patent represented by the selected node. The information provided includes providing all basic information on the top page of a typical patent, including patent number, title, inventor, assignee, application number, priority/application/publication/grant date, IPC/USPC classification, research area, citation (both patent and non-patent), examiner and attorney, and other data from the patent, number of similar pages, number of weights (independent and related), number of charts, number of words in the shortest independent weight, etc. In addition, many of the fields in the pop-up window are hyperlinked, allowing the user to bring additional information. For example, patent numbers are hyperlinks to the full text of a patent (or pdf file), reference links are hyperlinks (links to non-patent references that invoke an internet search for the referenced document), and other hyperlinks. The pop-up window may also include various statistical information about the patent (e.g., centrality) as well as other information from external sources including legal status, litigation status, licensing status, other patents within the patent family, expired authorization objections, document packaging information, and the like.
● assignee metanode-when an assignee metanode is selected, a pop-up window is awakened which displays a menu of different types of data that can be displayed about the assignee and patent represented by the metanode. The menu options include tables showing patents represented by the metanode, patents sorted according to IPC, patents sorted according to USPC, a list of patents according to inventor, and a graph showing patents by year. Additional menu options include network statistics that may display information about the assignee metanode, including total references, average references per year (since release year), sum of eigenvector centralities of the assignee's folders/sum of eigenvector centralities of the entire network (a measure of folder value). Another menu option provides information about the assignee. This menu option links to the base company and the financial situation about that company. Various sources of this type of information that may be used include Hoovirs (www.hoovers.com), Bloomberg (www.bloomberg.com), Yahoo Financial (http:// ultimate. Yahoo. com /) and many others including public and private sites containing company profile information.
● inventor Metanode-when an inventor Metanode is selected, a pop-up window is awakened which displays a menu that can display different types of data about the inventor and patent represented by the Metanode. The menu options include tables each showing a list of patents represented by the metanode, patents according to co-inventors, patents sorted according to IPC/USPC, and a graph showing each patent according to year. Another menu option provides information about the inventor. This menu option links to two different types of information, one is the basic web search for the inventor's name, and the second is the "people finder" information from the world wide web. Personnel search program sites such as people, yahoo, com/, www.zabasearch.com, www.intelius.com, www.peolplefinders.com, and many others, can be based on the ability to directly find individual names, city and state provided address history, birthday, marriage/divorce/death information, real estate records, liens and mortgages, bankruptcy, military service, relatives, neighbors, credit bills, and background checks in patent information. This information is useful in finding the inventor when demand arises. It may also be used to identify the name of an inventor within the database that may represent the same person.
● IPC/USPC metanode-when an IPC/USPC metanode is selected, a pop-up window is woken up showing a menu that can display different types of data about the IPC/USPC and patents represented by the metanode. The menu options include tables each showing a patent represented by the metanode, patents sorted by assignee, a list of patents by inventor, and a graph showing patents by year. Another menu option provides information about the assignee. This menu option provides detailed information about the IPC/USPC category, including a complete description of the category and its location in the IPC/USPC category hierarchy, important vocabulary index information showing how the selected IPC category is associated with the USPC category (or vice versa), and important vocabulary index information showing the links between the selected IPC/USPC category and SIC/NAICS related to the use industry and the manufacturing industry. (the linking of IPC/USPC to SIC/NAICS is discussed in detail later in the description of this embodiment)
● Meta-Link-when a meta-link is selected, a pop-up window is awakened, which displays information about the connection represented by the meta-link. A table may be displayed showing a list of patent-to-patent links represented by meta-links over time and a plot of the number of individual links. If, for example, the meta-link is a co-inventor relationship link, the meta-link pop-up will display the history of collaboration between the two inventors. If the meta-link is an assignee-to-assignee reference link, the pop-up window will display a history of references between the two assignee.
All of this pop-up information makes it possible for a user to explore patents from high level metadata to the deepest level of detail about companies, inventors, technology, and patents, with any level of detail desired. This makes the patent network visualization tool a powerful tool for understanding large collections of patent documents.
Linking to external data sources
Some examples of useful exogenous data sources, particularly relating to patent data, and their use in a patent network visualization system are described below:
● industry data-one key observation regarding the use of patent data is that for most decision makers, except patent attorneys and R & D engineers, the entity of interest is not a patent. Rather, users are typically interested in knowing about a company, technology, inventor, or some other entity. One particularly interesting entity that many users may better understand is business information. Users often want to know answers to questions like:
what is the technology critical to the industry?
What companies are leading to technical development in this industry?
What industry utilized a particular technology?
Unfortunately, industry data is not directly attached to patents in patent databases. However, it is possible to link the data in the patent database to the industry data in two ways. First, the assignee/company within the patent database may be linked to the industry in which they participate. Governments around the world have made various attempts to develop standardized industry taxonomies within their economic scope. The result is a SIC (Standard industry Classification) code and a NAICS (North American industry Classification System) code alike. These codes classify companies according to the industry in which they participate.
The various databases contain company directories with information about their SIC/NAICS industry. One example is the global commerce directory (www.siccode.com), which houses a database of companies and their industries.
By linking the assignee in the patent database to their SIC/NAICS code, it is possible to create meta-nodes within the network visualization that display the various industries represented in the patent data under examination. With the above features, it is then possible to examine the relationships between and among industries, as well as the relationships between industries, companies, technologies, inventors, countries, and other entities represented within the patent data.
Another tool by which patent data can be linked to industry data is by US patent classification (USPC) code or International Patent Classification (IPC) code. This is what makes it possible to pass various "technical-industry significant vocabulary indexes". From 1990 + 1993, the Canadian patent office collaborated with stattics Canada to assign all new patent applications to both SIC use and SIC manufacture. This assignment was made for a total of approximately 148,000 applications. This information has been used by various government entities and academic researchers to delineate the correspondence between the IPC/USPC technology taxonomy and the SIC/NAICS industry taxonomy.
Publicly available forms have been built that show connections between various industries and technologies. Various versions of these forms are available from the following sources on the world wide web, including:
PC-US-SIC significant vocabulary index from http:// www.rotman.utoronto.ca/. Silverman/ipcsic/documentation _ IPC-SIC _ documentation
OECD technology important vocabulary index
http://www.olis.oecd.org/olis/2002doc.nsf/linkto/dsti-doc(2002)5
Yale technology important vocabulary index
http://faculty1.coloradocollege.edu/~dhohnson/jeps.html
O USPC to SIC important vocabulary index
http://www.uspto.gov/web/offices/ac/ido/oeip/catalog/products/tafresh1.htm#USPC-SIC
With these tables, it is possible to link technology classification information to industry classifications. This makes it possible for users of patent network visualizations to analyze information about the industry as part of their patent data analysis.
The two industry data sources may also be used simultaneously. For example, the system may create SIC or NAICS meta-nodes and display them within the same graph (or as fractal nodes) with respect to the assignee in the database. At the same time, the technology-industry significant vocabulary index data can be used to create IPC or USPC networks linked to those industry metanodes. By doing so, it is possible to determine a reasonable degree of confidence that a company is employing a particular technology in the industry. This ability addresses a common problem in patent data research, namely, "is it being the case for different companies that have patents related to this technology, which is adopting technology in my industry, and which is adopting technology in other industries? "
● legal status data-a second source of valuable exogenous data related to patent data is a legal status database such as INPADOC. Such databases, among other databases, contain data regarding maintenance fees, transfers, expired authorization disputes, and the like. The connection to such data is technically simple as a patent number or priority number that can be directly linked to the database. The value of the link to such data is very high. The patent network visualization system may use this data to identify patents that have been abandoned (due to lack of payment for maintenance fees) and reassigned. This provides a strong signal about the priority of the legal person by showing where the priority of the company is. This is accomplished by changing the appearance of the patent node, assignee metanode, IPC/USPC metanode, inventor metanode, or other node to show which patent has been abandoned. The change in appearance may include changing color, shape, size, type of border or type of fill.
In addition, legal status data can be used to show which patent has been handed off, to give evidence of acquisition or processing by a business or commercial entity, and to signal the value of the patent. For example, a large number of patents are reassigned to a new company that is likely to signal that the legal organization is changing. Another example is that patents have already been objected to in legal proceedings more likely to be valuable patents, as it is unlikely that a party will continue to be an objection unless there is a significant economic incentive to do so. Again, this information can be used to alter the appearance of the nodes or meta-nodes to signal important clues to the user about the proprietary data network.
● File wrapper data-another source of data to which values can be linked is patent office File wrapper data. In the united states, this data can be found online at www.USPTO.gov. Such data is technically easy to link, as it contains a priority date or patent number that can be directly linked to the patent database. The use of this information is a matter of little effort. First, it is useful for users to be able to "click through" the document packaging data of patents that they are particularly interested in. Second, document packaging data provides clues as to patent value and effectiveness. The patent office review decision number, the title rejections, the change in the number of titles from application to issued patent, the time to reply to the patent office review decision, and other information found in the document package all provide information useful to the patent network visualization user. The appearance of the nodes and metanodes may be changed to signal the presence or value of any of these file wrapper parameters to the user.
● legal data-another important source of information about patents is the legal status of the associated legal program. Information about the existence and status of patent litigation is critical to understanding the patent landscape. By utilizing this information in conjunction with the capabilities of the above-described patent network visualization system, important questions can be answered, such as:
what patent is already valid after the examination procedure?
What patent is currently being filed as litigation?
What companies are controlled to infringe patents, and how many litigation they are facing?
Who actively claimed patents in my industry?
What technologies are most actively entangled in patent litigation?
● license data-another important source of information about patents is license data. Information about patent licenses provides a signal as to the value of individual patents and patent folders. Various existing license databases include www.yet2.com, www.royaltystat.com, www.royaltysource.com, IP transaction database (www.fvgi.com), IP research federation database (www.ipresearch.com), license royalty rates (www.aspenpublishers.com). The links may be constructed according to patent number, according to company, according to industry (SIC/NAICS), or according to other means. By utilizing this information in conjunction with the capabilities of the above-described patent network visualization system, important questions can be answered, such as:
what patents are valid for license transactions?
What is the typical royalty rate associated with patents in this industry?
Is this company issued a patent issuing a license to it?
What is my competitor license in/out technology?
● corporate data-corporate data is another exogenous data source that may be incorporated into a patent data visualization system. Links to the company data may be generated by way of the assignee's fields in the patent database. Countless sources exist for various types of corporate data. Examples of data types that are particularly useful for linking to include financial data, as well as product data.
Various sources of corporate financial data exist ranging from government systems like EDGAR by SEC (http:// www.sec.gov/EDGAR. shtml) to data aggregators like Hooverrs (http:// www.hoovers.com) and Bloomberg (www.bloomberg.com). Similar sales, R & D costs, market capacity (marktcap), and many other financial information may be used to give further insight into patent data. For example, annual R & D overhead may be divided by annual patent application number (with time lag) to compare relative R & D efficiency. Sales divided by the number of patents can be used as a signal for future legal investments in revenue streams. Market capacity is divided by the number of patents (or an estimate of the folder value) to signal how expensive or inexpensive it would be to obtain patent folders. Comparison of these and many other metrics can provide insight as to the relative performance of companies and the importance, value and strength of their patent folders. These metrics may be incorporated into the patent network visualization system as attributes for estimating the size of the relevant nodes and meta-nodes, or otherwise for changing the appearance of the nodes or links in order to inform the user of important information about their research.
Product information is another source of important corporate information that can provide further insight to users of the patent network visualization system. Many companies have online product catalogs. This information often contains technical information that can be linked to patent data by searching the product database for key terms found in the patent specification. The system may use the assignee information along with keywords from the patent specification to create links to product data on the company's product catalog. These external links may be displayed as nodes in a network graph and may allow users to know whether and how companies apply the techniques they have patented.
● academic data-another exogenous data source that may inform a user of a patent database's study is academic data. Academic data includes information from academic and industry journals, meeting corporations, scientific authorities, and other sources. Such information typically exists in the public domain before a patent is issued a certificate. It therefore acts as a stake for blade (cutting edge) research that is important to users of many patent data. The link may be established between patent data and academic data in various ways. First, patents often refer to academic or industrial journals as prior art. Second, the patented list of inventors usually first published their studies in academic literature, and thus links can be established by connecting the inventor's names. Finally, academic documents may also be linked to patents by means of an organization or company to which the patent is assigned and to which a publication has a relationship. Information about academic literature surrounding patented subjects can be used to understand the source of basic research ongoing in the field, identify collaboration between industry and academia, identify potential breakthrough techniques they have previously emerged in patent data, identify companies with munitions that have emerged in a self-learning environment, and other understandings.
By linking to academic data in the manner described above, it is possible to create a combined network of academic documents and patent data using the capabilities of all network visualization systems. A second embodiment of a network visualization, described in detail below, discusses a network visualization system for visualizing academic documents. The combination of the two provides a tremendous breakthrough in the ability of people, companies, industry and geography to advance technology and network innovation in addition to the user understanding what has previously occurred.
According to the implementation of the patent office or commercial patent data manufacturers
This embodiment of NVS that the present inventors have discovered is incredibly useful in understanding patent data. The USPTO, EPO and other patent offices, and commercial patent data vendors such as Aureka, Micropatent, Delphion (all now owned by Thomson), Lexis Nexis, and others, have vast patent information databases. Unlike some of the simple analysis tools described above in the prior art section, none of the patent offices of commercial patent data vendors provide their customers with sophisticated tools for patent data analysis. The embodiments described herein, either in their entirety, or more likely in a simple version of the basic network visualization capabilities, will provide their customers with a very powerful front-end for accessing their patent databases.
Thomson or Lexis Nexis can choose to implement a very simplified NVS as the user interface for their patent database. This simplified implementation may allow users to search a database using a logical search and then return a list of documents as they do today. They can then use the NVS to convert those search results into web data and allow the user to select from a limited set of web visualizations of the search results. The system may provide the user with an option to select one or more of the following:
● citation network-the patent network in the result set according to the citation link;
● assignee network-the assignee network in the result set linked by reference;
● inventor networks-inventor networks linked according to a common inventor relationship;
● IPC or USPC network-IPC or USPC class network linked by patents assigned to multiple classes;
● assignee/inventor network-based on the assignee network of the reference link, there are inventors linked to the assignee node based on the company to which they have assigned their invention;
● assignee/IPC or USPC network-the networks of assignee linked by reference have classes of IPC or USPC linked to the assignee node based on the patent number the company has filed within the class of IPC or USPC.
An additional feature that they may want to include may be the ability to filter the result set to limit the records in the visualization. The filtering options should include the ability of the inventor to filter out records from the visualization with a particular date range, assignee, IPC or USPC classification.
While this implementation of NVS would lack many of the features described in this embodiment, it may be a huge breakthrough for their users in their ability to understand the results of their patent search. It will allow them to review their search results from many different perspectives, refine their search through NVS filtering capabilities, and ultimately review patent lists or patent documents directly through the NVS system.
It is also worth noting that the two major commercial patent data vendors are part of a large organization housing many other data sources. The NVS described in this embodiment and the more general description above provide a front end for all of their various data types. Also, some of the exogenous data sources described in this embodiment are actually owned or provided by the patent usage rights assignment convention under the two main patent data companies, Thomson and Lexis Nexis. NVS, an implementation of the front end for accessing their databases, or a way to link disparate data sources within their system scope, would allow these companies to offer their customers value offers that are highly tailored to individual differences. These and other data vendors need to find alternative ways for users of their databases to extract more value from their data in order to grow and support higher prices and profits. NVS can make a huge contribution to those goals.
Example 2: method and apparatus for searching and analyzing medical publication databases
In a second example, one or more embodiments of the invention are applied to searching and analyzing documents in an academic-literature database. An example of such a database is a medical publishing database known as the PubMed database. Applications to other academic databases are equally possible.
PubMed data is a large database of medical research papers that appear in nearly 200 medical journals. It is a repository of information rich in content relating to the field of research in the medical world. PubMed data is most frequently used by physicians or other medical professionals who seek information about particular diseases, treatment regimens, or other medical subjects of interest. Their research always begins with a logical search for keywords, authors or periodicals, after which they are represented as a list of papers that match their chosen criteria. The next step for the researcher is to scan the search result list and read the headlines and abstract until an article of interest to her is found. She then proceeds to read some or all of the contents of the article. And may then return to her search results and continue scanning and reading the results until the information she is looking for is found.
This method is useful and perfectly applicable for medical experts if the aim of the researcher is to find information similar to one or several papers. However, there is a class of questions that cannot be easily answered in this manner. We call these questions meta-questions. They are questions about the meta-entities represented by those articles, not about what content is contained in the articles. Rather than asking questions about articles and what is contained in articles, researchers are often interested in the following questions:
● what is happening in the field of gene therapy?
● who are leading researchers in Alzheimer treatment?
● what agencies are working together to study osteoarthritis?
● how are the various cancer research areas related to each other?
● who are the most influential researchers for nanotechnology?
● what research is being conducted in specific areas of medical science, as is the disease group (e.g., Alzheimer's disease), therapy (e.g., immunotherapy), or specific mechanisms of behavior (e.g., antiplaque formation)? How is the work progress? How did it change over time?
● what is the collaboration mode within a given domain? Who is involved? How do they work together? Where there is a tightly collaborating research community? Where it is segmented? Where is the best opportunity to connect the various research groups in order, for example, convert scientific findings to practice in MRSA? Is the cooperative mode improved and weakened over time?
● what are the intelligent structures in a given domain, i.e., how their subject tends to be intensively studied? Who is not? Is this a scientific reason or a failure of institutional and social relationships? What is a strong topic relationship that repeats frequently? What relationships will appear in the future?
● how does a domain of research (i.e., the intelligent structures that evolve with it over time) influence the emerging patterns of collaboration? And vice versa.
● is a particular company or people within a university involved in how do a particular topic work? Is between themselves? With others outside the establishment?
● who are the most influential medical scientists in a given field of treatment? Who is the most core of the network? What groups of medical scientists are best to cross the network together in their independent impact patterns?
These and many other problems can be answered using NVS as described above. Academic data and specific PubMed data can also be analyzed and visualized in the same manner as patent data. The attributes that make patent data analyzable by the methods described herein are quite similar to data found in academic databases, including PubMed. Fig. 20 shows a relationship between the attribute of patent data and the attribute of academic literature.
As is apparent, there is a direct correspondence between the two data sources. This makes it possible to analyze PubMed data (or any academic literature database) using the same methods described in the previous embodiments. However, not every patent attribute is directly analogous to academic data. Therefore, the method described in the patent network visualization system needs to be slightly modified for use particularly with academic documents.
Just as with the patent visualization system, the medical network visualization system allows users to create and review database records from an academic database as a network of nodes and links. As with the patent database, academic data is not initially structured as network information. In other words, it does not contain a node list and a link list. Before it can be visualized as a network, the chemo-surgical data must first be structured in order to convert it into network information. This is achieved in the same way as described in the previous embodiments.
Once the data is structured, it is possible for a user to view various network visualizations based on database records from a PubMed database or other academic database. Unlike prior art systems, medical network visualization systems do not require a stable definition of network nodes and links. Instead, the researcher may dynamically change the definition of nodes and links according to her interests. This ability to transform a network from one node/link definition to another, and to view multiple connected network views of the same data simultaneously, makes it possible for users to quickly and easily understand large sets of database records and answer meta-level questions that otherwise cannot be answered.
As with patent data, different definitions of what are nodes and what are links of a medical data network can also be thought of in various ways. Some examples of the various nodes/metanodes and links/metalinks that may be created from PubMed data are described below:
● node/metanode-node/metanode definitions in an academic data environment include, but are not limited to, articles, papers, authorizations, reviews, authors, periodicals, year of publication, reviewers, author city/state/country, institution city/state/country, journal country, Mesh category, and other information.
● link/metalink-quote, co-quote, bibliographic link, public Mesh category, public year of release, public semantic cluster, public reviewer, public author city/state/country, public institution city/state/country, public journal country, and other information.
Any combination of these nodes/metanodes and links/metalinks, as well as ranges and combinations above, may also be used within the NVS to examine the PubMed data set.
In addition, the size of these nodes and links may be estimated to provide additional information to the user as described above. Some particularly useful attributes that may be used to estimate node and link sizes in a medical data environment include:
● node/metanode size estimation-in the context of medical data, there are several special metrics related to node/metanode size estimation. Some examples include: metanode size may be estimated based on the number of articles, the number of times an article is cited (forward citation), the number of times an article is cited by a represented patent (backward citation), the total citation number (forward plus backward), citations/years since release, average citations per article, average/total MeSH classification number, number of authors, and many other attribute metrics.
As mentioned previously, the estimated node and metanode sizes may also be calculated based on any number of network statistics that resemble the centrality/eigenvector centrality/mid-centrality sums of the represented articles. In a medical research environment, these metrics inform how important the research is based on peer-to-peer citation. The ability to discover significant research is crucial to the frontline of research as biotechnology and pharmaceutical and other life science companies attempt to continue to remain at the frontline of research and near the frontline of research that will help them introduce the next surprise of drugs or highly profitable medical devices.
● Link/MetaLink sizing-examples of some attributes used in the medical field for Link/MetaLink sizing include number of citations, unique article number cited, average year cited, recent year cited, and other attributes.
Example of medical network visualization System
Shown below are various exemplary screen shots illustrating various network graphs generated with a Medical Network Visualization System (MNVS) for performing searches and analyzing specific medical documents. These figures, while based on a limited set of node and connection combinations, still reveal the capabilities available to users of the network visualization system. For these simple examples, both nodes and metanodes remain unchanged in size, while links and metalinks remain unchanged in width. However, as with the patent network visualization system, these parameters may also be modified to provide further understanding to the user.
Fig. 21 shows a typical network based on PubMed data and visualized with the MNVS described herein. It is the result of searching for a document with the topic of medical science (MeSH) of "Diabetes mellitis Type I" and located in "Boston" written since month 1 to date in 2000. The tool retrieves 124 documents to create a network graph. Anyone can interpret the links in the graph according to the keywords described above.
The network diagram shows three different meta-node types (author meta-node, periodical meta-node, and MeSH meta-node). In a preferred embodiment of the medical network visualization system, the node types are distinguished in different colors (author-black on yellow, periodical-white on green, and MeSH-blue on white). Although these colors are difficult to see in black and white representations, they look as follows (authors-dark characters with light bottom, journals-white characters with dark bottom, MeSH-dark characters with white bottom).
The MNVS provides the user with the ability to select node and link definitions as her job. This capability is demonstrated because FIG. 22 shows details of the same network as FIG. 21, but only shows author-author links, which reveal the social network of the scientific community. From these types of network maps, it is possible to know that the leading researchers are in a particular research area, with whom they are working and which scientist is most influential.
One unique element of the medical journal database is the importance of the author name order in the article. Based on interviews and our experience with this type of analysis, we have known that the first author in the medical journal article is the main researcher (PI) of this study. If a second IP is involved in the study, which is often the case, her name will appear second. The last place in the list of authors is the laboratory responsible person in which the study was conducted. This "responsible laboratory" may be only slightly involved in the actual research project, but is likely to be an important figure in the field. The names between the second and last of the author list are typically laboratory assistants and other people that contribute less to the article.
To this end, the MNVS allows the user to select which author name to include in the network. We have found that one useful setting is to include the first, second and last author names in the network, excluding all others.
Fig. 23 and 24 further demonstrate the capability of MNVS. Where the same network is transformed into an author and periodical network (figure 23) and an author and MeSH classified network (figure 24). These networks enable users to quickly learn about areas of research that are of interest to researchers within the network.
Finally, fig. 25 again shows the same network, however this time it shows a linked network between the medical topic areas named according to MeSH classification. Using MNVS in this manner, medical professionals are often surprised to find unexpected associations between two medical areas that appear seemingly unrelated. Unexpected connections between related medical subjects may lead to new ways of thinking about medical problems and suggest new research paths as they provide a potential finding of application in one research area challenging another. Medical experts tend to silo-ed according to professional characteristics because they have little to no way since experts in domain a and experts in domain B will not attend the same meeting, participate in the same advanced training program, read the same journal, or otherwise interact. It is of great value to put together appropriate persons from different specialties, as completely new research paths are often proposed. Medical network visualization systems make it possible to find unexpected connected regions at the same time, from which new medical insights may arise.
Network visualization application of medical/academic data
The ability to visualize a system according to a medical network allows researchers to gain insight into analyzing and developing vast collections of medical database records. These understandings come in various forms and thus the network visualization system can be used in various environments for analyzing topics, such as:
● (in general);
● corporation;
● who are considered Key Opinion Leaders (KOLs) or key researchers for clinical trials or market impact in a certain area;
● topic of clustering within a particular domain (which accompanies MeSH classification);
● research synergy or replacement across tissues;
● study regional basis of intensity in a broader geographic area.
Collaborative organization in general
At a basic level, the network of FIG. 26 shows different "clusters" of collaboration. The user can easily identify the group of authors published together in different periodicals. Another feature of the system is that author metanodes are distinguished by color based on the organizational affiliation of the author. This provides a deeper understanding of the mode of operation.
Corporate-wide collaboration organization
The network in fig. 27 is a search for the MeSH classification Diabetes mellitis and the mechanism Joslin (the mechanism is found as part of the address field in PubMed). Joslin is a shorthand for Joslin diabetes Center, the leading Center of diabetes in the world. The chart identifies the collaborating cartridges-the people within the organization who are co-authors who work on a particular topic. The figure shows in full the journal names and MeSH terms appearing in five or more documents in the search. This enables users to see popular research topics such as diabetes retinopathy and islets, as well as journals including diabetes, transplantation, diabetes care, and others that have been published since 2000 by this organization.
Key Opinion Leaders (KOLs) or key researchers in the goals of clinical trials or market impact in a region
In fig. 28, the diabetes mellitis search is restricted by territory (australia) rather than by institution. Here, the network diagram is limited to only showing those authors who have written 15 or more documents. Cooper ME is the author whose name appears in 49 documents-eligible for further investigation, and if a pharmaceutical or biotech company is selling diabetes drugs, dr.
Subject matter of clustering in a particular domain (which accompanies MeSH classification)
Figure 29 shows the network resulting from a search that is limited by the organization that is highlighted so that the user can also see the relevant MeSH classifications. All other nodes and links are deleted here, leaving a document that reveals a common MeSH classification. For example, documents having the MeSH classification Diabetes Metllitus, type II in this example are also encoded as obesitiy, isletof Langerhans, Blood Glucose, Insulin, and others.
Cross-tissue study synergy or replacement
The network shown in fig. 30 was generated by a search of the MeSH classification cardioarcus Agents and three specific organizations. The following organization or combination thereof has been highlighted using a color query feature: pharmacia (yellow), Pfizer (green), Warner Lambert (blue), and a combination of Pfizer and Pharmacia (purple). This figure enables the user to understand under which MeSH topic studies organized in a larger domain fall. This may help organizations think about strategies for resource investments in a certain research program, competition within a particular research area, and/or emerging areas that they have not been involved in.
Regional basis for studying intensity in wider regions
MNVS can also help reveal cross-regional research "centers (hubs)". In fig. 31, a search was conducted for a study designed to highlight the MeSH classification Diabetes mellitis type I, ma massachusetts (color coded green), california (color coded pink), and north carolina (color coded blue). The user can run a similar search without regional restrictions and explore the data to see which area will appear to be "central". In addition, the user may be able to identify a region that is concentrated in a small suitable location within a broad area (e.g., Autoantigens in this diabetes study example).
As demonstrated in these examples, all of the NVS features described above are relevant to the analysis of PubMed data, including network transformation, the use of multiple nodes and links, fractal networks, network animation, statistics, and linking to external data sources. Some elements of a preferred embodiment of NVS, particularly for PubMed data analysis, are described further below.
Statistical information
Various types of statistical information related to PubMed data analysis include:
● agency-orders the article numbers in the selected network according to the agency from high to low.
● Author-sort the article numbers in the selected network from high to low by author.
● classification-article numbers in a selected network are sorted from high to low according to MeSH classification or sorted according to classification category. Since the MeSH classification scheme is hierarchical, the data is displayed using a tree structure with article numbers within each category and subcategory displayed next to each branch of the tree.
● word usage-article number contains keywords, phrases, or word groupings.
● references-various types of information about the references can be provided. Including but not limited to the following information:
the most frequently cited articles, organizations, authors, or other groupings;
the highest number of citations per year since publication of an article, organization, author, or other grouping.
Statistical information is also provided in the pop-up window. Several examples of the types of information provided in the pop-up window, particularly with respect to PubMed data, are described below:
● article node pop-when an article node is selected, a pop-up window may be awakened that displays information about the article represented by the selected node. The information provided includes basic information provided on the summary page of a typical article, including PubMed ID number, title, author, organization, publication date, MeSH category, citation, and other data from the article, like number of pages, number of charts, number of words, etc. In addition, many of the fields in the pop-up window are hyperlinked, allowing the user to bring additional information. For example, an article number is a hyperlink to the full text (or pdf) of the article, a reference link is a hyperlink, and others. The pop-up window may also include statistical information about the centrality-like nature of the article.
● organization metanode-when an organization metanode is selected, a pop-up window may be awakened that displays a menu of different data types that may be displayed in relation to the organization and article represented by the metanode. The menu options include tables showing the articles represented by the metanode, the articles according to MeSH categories, a list of articles according to authors, and a graph showing the articles according to year. Additional menu options include network statistics that can display about the organization's metanodes, including total references, average references per year (since the year of table), sum of eigenvector centrality of the organization's folders/sum of eigenvector centrality of the entire network (a measure of folder value). Another menu option provides information about the institution. This menu option links to the website of the organization or to basic company and financial information about the company.
● author metanode-when an author metanode is selected, a pop-up window may be awakened that displays a menu that may display different types of data about the authors and articles represented by the metanode. The menu options include tables, each representation showing an article represented by the metanode, an article by co-author, a category of the article classified by MeSH, and a graph showing each article by year.
● MeSH metanode-when a MeSH metanode is selected, a pop-up window may be awakened that displays a menu that may display different types of data about the MeSH categories and articles represented by the metanode. The menu options include tables showing the articles represented by the metanode, articles by organization, lists of articles by author, and graphs showing the articles by year. Another menu option provides information about MeSH categories. The menu option provides detailed information about the MeSH category, including a complete description of the category and its location in the MeSH hierarchy.
● Meta-Link-when a meta-link is selected, a pop-up window may be awakened that displays information about the connection that the meta-link represents. A table may be displayed showing a list of article-to-article links represented by meta-links over time and a plot of the number of individual links represented by the meta-links. If, for example, the meta-link is a co-author link, the meta-link pop-up window will display the collaboration history between the two authors. If the meta-link is an organization-to-organization reference link, the pop-up window will display a history of references between the two organizations.
All of these pop-ups make it possible for a user to explore the article network, from high-level metadata up to the deepest level of detail about organizations, authors, research areas, and articles, with any level of detail desired. This makes the MVS tool a powerful tool for understanding large PubMed document collections.
Linking to external data sources
Some examples of useful exogenous data sources, particularly related to medical data, and their use in MNVS are described below:
● one valuable external data source that doctors contact data-links to associate with the PubMed database is information about the researcher's affiliation. Most authors in PubMed are physicians and therefore it is possible to link information about these physicians in public and proprietary databases. These databases contain information similar to medical specialties, hospital privileges, DEA #, medical colleges on, completed living plans, medical community members, and the like. Linking to this data makes it possible to create all new classes of metanodes and new types of links that cannot be constructed solely from PubMed data.
● script data-another high value exogenous data source is script data. A proprietary database such as IMS maintains information about the doctor's prescription mode. They count the number of prescriptions that the physician writes for each of the drugs they prescribe. This data has incredible value as information for biotech and pharmaceutical companies to determine which physician is most important for research from a market perspective. When combined with MNVS, the tool enables biotechnology and medicine companies to identify the key opinion leaders that are most closely connected to the largest number of users of medicine within the therapeutic field of interest. By targeting these KOLs, these companies can influence the doctor's prescription model and gain market share.
● reference data-the proprietary database like LRX also provides valuable external data sources to which to link. The LRX database captures physician reference information that can be used to create a social network of medical relationships within or across a specialty.
● survey data-Alpha Detail-like companies have surveyed thousands of doctors to determine what information they read and what other doctors affect them. This data is valuable as another source of exogenous data, especially in solving the market problem of life science companies.
● authorization data-one step before the medical literature is disclosed is often the submission and approval of authorization. In the united states, a number of these grants have reached the national health agency (NIH). NIH maintains a database called scientific project information Computer Retrieval (CRISP). The database has information about all research projects that NIH provides funds. Linking this data enables innovative research to be tracked even before the first medical article is published. Similar databases are maintained in other countries.
● FDA test data-the FDA maintains candidate information in a public database about various drugs in various stages of the FDA approval process. By linking to this data it is possible to analyze how many medical studies are fed into the drug pipeline and to assess the location of various drugs in the drug company.
● FDA production data-at the other end of the time stamp is the FDA database. The goal of most medical research is to develop treatments for certain diseases, which in most cases must pass FDA approval (in the united states). The FDA maintains a DRUG database and many other databases that provide information about both the buyer and the seller's direct transactions and prescription DRUGs and food additives and many other health related products. By linking to this data, it is possible to track the output of studies contained in the PubMed database.
● agency data-agency data is another exogenous data source that may be incorporated into the MDVS. The linkage to the institution or corporate data may be through the institution fields in the PubMed database or directly through the database of institution affiliations held by the doctor/author. There are various types of organization/company data according to various sources. Linking this data makes it possible to analyze more deeply the roles that companies, universities, governments, entities and research institutes play in this research area of interest.
● patent data-linking to patent data is also of considerable importance. Patent data represents those portions of medical research that have been converted to intellectual property that may be protected.
Combining patent and medical NVS
Although the patent and PubMed embodiments have been described separately, NVS can also incorporate both data sources. Links between these two data sources exist in a variety of forms, including citations from patents to academic documents, linking article authors and inventors, and linking companies/institutions. By linking medical research data with patent data, as well as authorization, FDA and script data, it is possible to obtain a picture of the entire life cycle of an idea from the beginning through product certification and marketing.
NVS enables a deeper understanding of the nature of scientific and technical development than previously possible. Many different types of questions may be answered that cannot be answered by any means known in the art. Many of these problems are of very high value not only economically but also for social products.
OTHER EMBODIMENTS
It will be apparent to those skilled in the art that NVS may be applied in a wide variety of environments, and the described embodiments demonstrate the applicability of NVS to two different data sources. Many other sources are possible in a similar manner to those described in the described embodiments.
Computer implementation
The method of analyzing database records according to various embodiments of the present invention is preferably implemented in a general purpose computer 300 shown in FIG. 32. Representative computer 300 is a personal computer or workstation platform, e.g., based on IntelOr RISC, and includeUnix et al. As is well known, these machines include a display interface 302 (graphical user interface or GUI) and an associated input device 304 (e.g., keyboard or mouse).
The database record analysis method is preferably implemented in software, and thus one embodiment is a set of instructions 306 (e.g., program code) in a code module residing in a computer readable medium, such as random access memory 308 of computer 300. Until required by the computer 300, the set of instructions 306 may be stored in another computer-readable medium 310, for example, a hard disk drive, a removable memory such as an optical disk (for eventual use in a CDROM) or floppy disk (for eventual use in a floppy disk drive), or downloaded via the Internet or some other computer network. In addition, although the various methods described are conveniently implemented in a general purpose computer 300 selectively activated or reconfigured by software, one of ordinary skill in the art would also recognize that such methods may be carried out in hardware, in firmware, or in more specialized apparatus constructed to perform the specified method steps.
Other aspects, modifications, and embodiments are within the scope of the following claims.

Claims (33)

1. A method for providing a network graphical representation of two or more database records, comprising:
selecting the two or more database records according to one or more descriptive criteria, wherein each of the two or more database records is a member of a common record class;
identifying two or more common attributes of the database records, associating a first group of network nodes with a first instance of the common attribute from the database records, and associating a second group of network nodes with a second instance of the common attribute from the database records;
connecting one or more members of the first set of network nodes to one or more members of the second set of network nodes using network links specifying associations between the network nodes to form a first network graph representation; and
iteratively performing the identifying and connecting steps while revising the one or more descriptive criteria so as to change the selected two or more database records.
2. The method of claim 1, wherein the common record class comprises patent records.
3. The method of claim 2, wherein the patent records are extracted from a LexisNexis database.
4. The method of claim 2, wherein the patent records are extracted from a Thomson database.
5. The method of claim 2, wherein the patent records are extracted from a USPTO database.
6. The method of claim 2, wherein the patent records are extracted from an EPO database.
7. The method of claim 2, wherein the patent records are extracted from a Derwent database.
8. The method of claim 1, wherein the common record class comprises academic journal articles.
9. The method of claim 8, wherein the academic journal article is extracted from a PubMed database.
10. The method of claim 2, wherein the one or more descriptive criteria are selected from the group consisting of: (i) one or more keywords within a body field of each of the patent records; (ii) one or more keywords within a title field of each of the patent records; (iii) one or more inventors in the inventor field of each of the patent records; (iv) one or more assignee in an assignee field of each of the patent records; (v) one or more keywords within the summary field; and combinations thereof.
11. The method of claim 2, wherein the common attributes include inventors.
12. The method of claim 2, wherein the common attribute comprises an assignee.
13. The method of claim 2, wherein the common attribute comprises an application date.
14. The method of claim 2, wherein the common attribute comprises a date of issuance.
15. The method of claim 2, wherein the common attribute comprises IPC code.
16. The method of claim 2, wherein the common attribute comprises a USPC code.
17. The method of claim 2, wherein the common attribute comprises a search field.
18. The method of claim 1, wherein the network link includes a characteristic describing a number of common instances occurring between the connected nodes.
19. The method of claim 18, wherein the characteristic comprises link thickness.
20. The method of claim 18, wherein the characteristic comprises a link color.
21. The method of claim 18, wherein the characteristics include a link structure.
22. The method of claim 1, wherein the at least one group of network nodes is a metanode group.
23. The method of claim 22, wherein the set of meta-nodes describes characteristics of two or more database records.
24. The method of claim 23, wherein the one or more descriptive criteria include a date range.
25. The method of claim 2, further comprising selecting an additional database record from a record class other than the common record class of patent records, and associating a network node, a network link, or both with an instance of one or more attributes from the additional database record.
26. The method of claim 25, wherein the record class other than the common record class of patent records describes licensing history associated with the patent records.
27. The method of claim 25, wherein the record class other than the common record class of patent records describes litigation history associated with the patent records.
28. The method of claim 25, wherein the record class other than the common record class of patent records describes maintenance fee history associated with the patent records.
29. The method of claim 8, further comprising selecting an additional database record from a record class other than the common record class of academic journal articles, and associating a network node with an instance of one or more attributes from the additional database record.
30. The method of claim 29, wherein the record class other than the common record class of academic journal articles describes doctor contact data associated with the academic journal articles.
31. The method of claim 29, wherein the record class other than the common record class of academic journal articles describes script data associated with the academic journal articles.
32. The method of claim 29, wherein the record class other than the common record class of academic journal articles describes reference data associated with the academic journal articles.
33. The method of claim 1, further comprising identifying one or more attributes of the record class based on user-provided requirements.
HK08106269.3A 2004-05-04 2005-05-03 Method for selecting, analyzing and visualizing related database records as a network HK1115720B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US56799704P 2004-05-04 2004-05-04
US60/567,997 2004-05-04
PCT/US2005/015346 WO2005107405A2 (en) 2004-05-04 2005-05-03 Method and apparatus for selecting, analyzing and visualizing related database records as a network

Publications (2)

Publication Number Publication Date
HK1115720A1 HK1115720A1 (en) 2008-12-05
HK1115720B true HK1115720B (en) 2013-06-21

Family

ID=

Similar Documents

Publication Publication Date Title
US20210073251A1 (en) Methods of providing network graphical representation of database records
Gupta et al. A survey of text mining techniques and applications
Moya-Anegón et al. A new technique for building maps of large scientific domains based on the cocitation of classes and categories
Chen Knowledge management systems: a text mining perspective
Eom Author Cocitation Analysis: Quantitative Methods for Mapping the Intellectual Structure of an Academic Discipline: Quantitative Methods for Mapping the Intellectual Structure of an Academic Discipline
Marshall et al. EBizPort: Collecting and analyzing business intelligence information
Chavalarias et al. Draw me Science: Multi-level and multi-scale reconstruction of knowledge dynamics with phylomemies
Kim et al. Mapping scientific profile and knowledge diffusion of Library Hi Tech
Nazemi et al. Visual trend analysis with digital libraries
Gaona-García et al. An exploratory study of user perception in visual search interfaces based on SKOS
Feraco et al. 20 years of character strengths: a bibliometric review
Varma Use of ontologies for organizational knowledge management and knowledge management systems
Hoeber et al. Evaluating the value of lensing wikipedia during the information seeking process
Petrelli et al. Multi visualization and dynamic query for effective exploration of semantic data
Ju Leveraging levels of information services and developing knowledge services: The trend of information services in libraries
HK1115720B (en) Method for selecting, analyzing and visualizing related database records as a network
Hu et al. VisArchive: a time and relevance based visual interface for searching, browsing, and exploring project archives
Fraga et al. Creating Automatic Connections for Personal Knowledge Management
Chung Visualising e-business stakeholders on the Web: a methodology and experimental results
Bold Developing a PPM based named entity recognition system for geo-located searching on the Web
Nasir Uddin et al. Performance and usability testing of multidimensional taxonomy in web site search and navigation
Corbatto Visual approaches to knowledge organization and contextual exploration
YOSHIMURA Forming Wisdom of Crowds by Visualizing Web Pages
Albertoni Semantic and Visual Analysis of Metadata to Search and Select Heterogeneous Information Resources
FR REVIEW OF KNOWLEDGE MANAGEMENT TOOLS