Disclosure of Invention
Aiming at the problems existing in the prior art, the invention aims to provide a visual retrieval method of open source items, which improves the accuracy of retrieval and acquisition results.
In order to achieve the above purpose, the present invention provides a visual search method for open source items, which comprises the following specific steps:
Step 1, acquiring a plurality of open source items from an open source community:
Extracting key information for each open source item, wherein the key information comprises the name, author information, observer number, star number, branch number, item description, size of the item, functional image, license information, contributor information, programming language and code line number of the item, and page connection and source code link of the item;
step2, analyzing and sorting the dependency relationship of the library:
From each open source item, analyzing a package. Json file and a lib/libs folder corresponding to the open source item to extract dependency information of the item, wherein the dependency information comprises a dependency name, a dependency and a hierarchical structure thereof, and the hierarchical structure with a parent-child node relationship is extracted by dividing prefixes of the dependency names. The extracted dependency and its hierarchy will be written into a predefined techs field for subsequent index build;
Step 3, extracting and classifying README files:
The README files provided by each open source item are used for describing the functions of the open source item, the contents of each README file are classified into eight categories, namely What, why, how, when, who, references, contribution and Other respectively, by using a multi-label classifier READMECLASSIFIER, and texts classified into the categories of wha and Why are extracted as functional texts;
step 4, extracting a functional image and adding an availability index:
The functional image of each open source item is extracted using BeautifulSoup library, the links of the extracted functional image are written into predefined { images } fields, the availability of the open source item is defined as whether a How category exists in README file, a new field { hasHow: true }, if so, { hasHow: false }, is added. Both the { images } and { hasHow } fields will be used for the creation of subsequent indexes;
Step 5, extracting the subject according to the project information BERTopic
Combining the key information of each item to obtain combined information, training BERTopic a model based on the combined information to extract the theme, the theme number, the corresponding theme key words and the key word weight corresponding to the key words of each open source item, wherein the extracted theme key words are recorded in a predefined { topics } field for subsequent index establishment and hierarchical theme modeling.
Step 6, establishing an index
And (3) uniformly inputting all the information extracted in the steps 1 to 5 into an elastic search, and establishing an index for each open source item.
And 7, hierarchical topic modeling, namely classifying open source items with the same topic into the same topic class, clustering the topic class by adopting a clustering method, classifying two topic classes with the distance smaller than a preset threshold value into one cluster, and indicating that clustering is completed when the topic class in each cluster is not changed.
Step 8, matching the query statement with the item:
Searching the query sentence input by the user on each cluster of the step 7 cluster by utilizing the elastic search, wherein the returned result comprises all hit open source items, each field matched with the query sentence in the hit open source items and the relevance scores corresponding to the matched fields, each field comprises a field name and a field value corresponding to the field name, and the content matched with the query word in the field value is highlighted.
For each hit open source item, selecting the field with the highest correlation score in all the matched fields as an optimal field, decomposing a query sentence input by a user into a plurality of independent query terms, and matching the highlight content in the optimal field with the plurality of query terms one by one to obtain the semantic sequence of the optimal field.
Step 9, analyzing the source codes of all hit open source items:
For the hit open source item, first, esprima is adopted to analyze the source code file, and a corresponding abstract syntax tree AST is generated.
Step 10, recommending open source items related to and similar to the hit open source items:
and taking all the hit open source items as a recommendation list, and adding other open source items with relevance and similarity with the hit open source items into the recommendation list to obtain a final recommendation list.
And 11, visually displaying all the results of the steps 1 to 10.
Further, the step of obtaining the semantic order of the optimal field in the step 8 includes:
and 8-1, marking all query terms in the query statement from left to right, and marking the highlighting content corresponding to the query terms in the optimal field with the same marks as the query terms.
And 8-2, the optimal field comprises N sentences, N is more than or equal to 1, the labels of all query words in each sentence form an array, and repeated labels in the array are deleted.
And 8-3, ending N arrays to obtain a one-dimensional array composed of labels, when the elements in the one-dimensional array are periodically repeated from left to right, reserving the elements in the first period, and deleting the elements in other periods to obtain the semantic sequence of the optimal field of the query statement in the hit open source item.
Further, in the step 10, the step of obtaining the final recommendation list includes:
Step 10-1, searching owners of hit open source items or other open source items of contributors by using ELASTICSEARCH, if the code contribution line number of each contributor to the hit open source items exceeds a set threshold, then the contributor is included in the contributor search list, and if the contribution line number of contributors is lower than the threshold, then the first three contributors are included in the contributor search list by default. And using ELASTICSEARCH to search other open source items of the contributors included in the list, and defining the searched owners or other open source items of the contributors as other open source items with relevance.
And step 10-2, processing and merging the item name, description and README text of each item in other related open source items into a character string list, wherein the index of the character string list represents the number of each item, and the list element represents the key information of each item.
The item name, description and README text of each hit open source item are processed and combined into a string list in the same way.
And step 10-3, calculating cosine similarity of the hit open source item and other open source items of the correlation, arranging according to a cosine similarity value descending order, and adding the other open source items of the correlation arranged in the first ten into a recommendation list to obtain a final recommendation list.
Further, in the step 11, the visual presentation includes five components of a basic interface component, a topic overview view component, a topic cluster view component, an item detail view component, and a global control panel.
The basic interface component comprises three sub-components, namely a search column, a result screening and sorting component and a view switching component, wherein the search column is used for inputting keywords or inquiry sentences and starting search operation;
The topic overview component is used for displaying topic trees of all open source items in the final recommended item list, and further comprises a control module for adjusting parameters.
The topic clustering view component is used for graphically displaying the clustering results and displaying the use frequency of open source items in the clustering results.
The project detail view component comprises three parts, namely project basic information, visualization of a source code analysis result and a project recommendation table. The project basic information is used for showing the use frequency of folder names, file names, function names and variable names, the visualization of the source code analysis result is used for showing the source codes of all hit open source projects, and the project recommendation table is used for showing the final recommendation column;
the global control panel component is used for displaying the popularity index, the item update time and the matching sequence of the open source items in the final recommendation column.
Compared with the prior art, the invention has at least the following advantages:
BERTopic extracts topics and keywords from text such as project descriptions, README files, etc., based on semantic understanding. Compared with the traditional machine learning method, BERTopic not only can divide the items into different topics, but also can provide keywords of each topic and corresponding keyword weights. The method effectively solves the problem that the traditional tool cannot generate the proper query keywords, and reduces the risk that the user misses related items due to keyword mismatch.
2. The method comprises the steps of aggregating scattered project information into clear topic classes based on semantic relevance through BERTopic topic extraction technology, and further arranging the topic classes from generalized to specific levels through hierarchical modeling to construct a multi-level structure of progressive upper topics to lower sub-topics. The method effectively solves the problem of information dispersion, and avoids the complicated operation that a user explores items one by one and links thereof to identify the relationship between the items in the retrieval process. Through hierarchical display of semantic association, the transparency of the search result is improved, a more efficient and visual result presentation mode is provided, the information acquisition efficiency of a user is remarkably improved, and the overall use experience is optimized.
3. The sequence of the words in the query statement is compared with the matching sequence of the words in the item field with the highest relevance score, and related items conforming to the specified matching sequence are further screened, so that the technical problem that users have to explore each item one by one to identify the relation between the items due to information dispersion is effectively solved.
4. The source code is parsed through Esprima to generate an Abstract Syntax Tree (AST), key identifiers such as folders, files, functions, variable names and the like are extracted accurately, and the frequency is counted. Based on these identifiers and their frequency information, a word cloud is generated, and the user can click on the identifiers in the word cloud, view the file location, and further access the source code. The method effectively solves the difficulty of positioning the target file in the complex file hierarchical structure for the user, obviously reduces the understanding burden and improves the retrieval efficiency.
5. According to the method and the device, the recommended item list is generated by calculating the relevance and similarity of the items, more possibly interested items are provided for the user, the problem of information overload caused by huge number of open source items is effectively solved, and the user is helped to quickly find the associated items meeting the requirements.
Detailed Description
The present invention will be described in further detail below
The invention provides a visual retrieval method of an open source item, which comprises the following specific steps:
Step 1, acquiring a plurality of open source items from an open source community:
Extracting key information for each open source item, wherein the key information comprises the name, author information, observer number, star number, branch number, item description, size of the item, functional image, license information, contributor information, programming language and code line number of the item, and page connection and source code link of the item;
step2, analyzing and sorting the dependency relationship of the library:
From each open source item, analyzing a package. Json file and a lib/libs folder corresponding to the open source item to extract dependency information of the item, wherein the dependency information comprises a dependency name, a dependency and a hierarchical structure thereof, and the hierarchical structure with a parent-child node relationship is extracted by dividing prefixes of the dependency names. The extracted dependency and its hierarchy will be written into a predefined techs field for subsequent index build;
the dependency of the library reflects the way the project was developed. Dependency items are extracted from the package json file and lib/libs folders. To reduce data redundancy, version numbers in the extracted library names are removed. Extracting the dependency with the parent-child node relation by dividing the name of the dependency term through the prefix;
Step 3, extracting and classifying README files:
The README files provided by each open source item are used for describing the functions of the open source item, the contents of each README file are classified into eight categories, namely What, why, how, when, who, references, contribution and Other respectively, by using a multi-label classifier READMECLASSIFIER, and texts classified into the categories of wha and Why are extracted as functional texts;
step 4, extracting a functional image and adding an availability index:
The pictures in the functional text typically show the function and execution results of the items, so the BeautifulSoup library is used to extract the functional image of each open source item, the links of the extracted functional images will be written into predefined { images } fields, the availability of the open source item is defined as whether the How category exists in the README file, if the How category exists, a new field { hasHow: true }, if there is a higher availability of the item, and otherwise { hasHow: false }, is added. Both the { images } and { hasHow } fields will be used for the creation of subsequent indexes;
Step 5, extracting the subject according to the project information BERTopic
Combining the key information of each item to obtain combined information, training BERTopic a model based on the combined information to extract the theme, the theme number, the corresponding theme key words and the key word weight corresponding to the key words of each open source item, wherein the extracted theme key words are recorded in a predefined { topics } field for subsequent index establishment and hierarchical theme modeling.
Step 6, establishing an index
And (3) uniformly inputting all the information extracted in the steps 1 to 5 into an elastic search, and establishing an index for each open source item.
And 7, hierarchical topic modeling, namely classifying open source items with the same topic into the same topic class, clustering the topic class by adopting a clustering method, classifying two topic classes with the distance smaller than a preset threshold value into one cluster, and indicating that clustering is completed when the topic class in each cluster is not changed.
Because the generated topics may have relevance, a hierarchical relationship structure among all topics is further constructed based on the topic word matrix. The hierarchical relationship is constructed based on calculating the distance between the topics, for example, using a measurement method such as cosine similarity. Firstly, starting from each independent topic, merging two topics which are most similar into a new cluster (cluster) in sequence, and then continuously merging topics with highest similarity between clusters until all topics are gradually integrated into the same hierarchical structure. The merging process of the above-described subject matter may be implemented by a variety of join functions (Linkage functions), including but not limited to single join (SINGLE LINKAGE), complete join (complete join), average join (AVERAGE LINKAGE), or the like. And (3) constructing a relationship tree by using the generated topic hierarchical relationship structure, wherein the relationship tree is used for showing topic clustering views in the step (11). By hierarchically organizing the topics, semantic relationships between the topics can be more intuitively displayed. The structure not only helps the user to quickly identify the core theme and the sub-theme related to the keyword, but also enables the association of the complex information to be clearer and clearer. In this way, the user can more efficiently screen the items from the semantic level, and the transparency and accuracy of the search results are improved.
Step 8, matching the query statement with the item:
Searching the query sentence input by the user on each cluster of the step 7 cluster by utilizing the elastic search, wherein the returned result comprises all hit open source items, each field matched with the query sentence in the hit open source items and the relevance scores corresponding to the matched fields, each field comprises a field name and a field value corresponding to the field name, and the content matched with the query word in the field value is highlighted.
For each hit open source item, selecting the field with the highest correlation score in all the matched fields as an optimal field, decomposing a query sentence input by a user into a plurality of independent query terms because different arrangement sequences of the query terms possibly convey different semantic information, and matching the highlight content in the optimal field with the plurality of query terms one by one to obtain the semantic sequence of the optimal field.
Step 9, analyzing the source codes of all hit open source items:
For the hit open source item, first, esprima is adopted to analyze the source code file, and a corresponding abstract syntax tree AST is generated. By traversing the abstract syntax tree, a plurality of key identifiers including folder names, file names, function names, and variable names can be extracted, and the occurrence frequencies of these identifiers can be counted. In addition, the method further includes calculating similarities between the key identifiers to achieve merging of portions of the key identifiers. However, given the complexity and time-consuming nature of the similarity calculation process, the merging operation is not performed by default. And generating word clouds based on the analyzed key identifiers and the corresponding frequency information thereof, and displaying the item detail views in the step 11. The generated word cloud effectively helps a user to quickly identify the key identifier and focus on the high-frequency content, so that the user can intuitively understand important information of the searched item, and the efficiency of item understanding and analysis is remarkably improved.
Step 10, recommending open source items related to and similar to the hit open source items:
and taking all the hit open source items as a recommendation list, and adding other open source items with relevance and similarity with the hit open source items into the recommendation list to obtain a final recommendation list.
The recommendation list is used for presentation of the item detail view in step 11, providing the user with more items of possible interest.
And 11, visually displaying all the results of the steps 1 to 10.
And visualizing the retrieved data from multiple angles, wherein the item data are displayed through a visualization technology, and the related data comprise, but are not limited to, basic information, subjects and source codes of the item. The visualization process adopts various visual angles to display the data, and realizes the dynamic interaction between the user and the search result through the interactive operation interface so as to support the deep exploration and analysis of the search result by the user.
Further, the step of obtaining the semantic order of the optimal field in the step 8 includes:
and 8-1, marking all query terms in the query statement from left to right, and marking the highlighting content corresponding to the query terms in the optimal field with the same marks as the query terms.
And 8-2, the optimal field comprises N sentences, N is more than or equal to 1, the labels of all query words in each sentence form an array, and repeated labels in the array are deleted.
And 8-3, ending N arrays to obtain a one-dimensional array composed of labels, when the elements in the one-dimensional array are periodically repeated from left to right, reserving the elements in the first period, and deleting the elements in other periods to obtain the semantic sequence of the optimal field of the query statement in the hit open source item.
Therefore, the relevance evaluation of the searched items is further optimized, and the accuracy of the search result is remarkably improved.
For example, the query term is MACHINE LEARNING AI, the three query terms are respectively marked with the numbers of 0,1 and 2, and if the optimal field of a hit open source item is readme, 4 sentences are included, and the highlighting process is as shown in fig. 4. curIndex = query_arr. Index (curItem) is used to find the index position of the word in the query word array (query_arr= [ "machine", "learning", "AI").
Assuming that the query term array is query_arr= [ "machine", "learning", "AI" ], then:
curItem = "machine" curIndex =0
CurItem = "learning" curIndex =1
CurItem = "AI" curIndex =2
For 4 words of the optimal field of a hit open source item, forming an array by the labels of all query words in each word to obtain four arrays, namely [0,1], [2,2] [0,1,2], and removing repeated labels in each array to obtain [0,1], [2] [0,1,2] and [0,1,2]. The four arrays are terminated to obtain a one-dimensional array [0,1,2,0,1,2,0,1,2], periodic repetition occurs from left to right from four elements, the second period [0,1,2] and the third period [0,1,2] of repetition are deleted, and only the first period [0,1,2] is reserved.
Further, in the step 10, the step of obtaining the final recommendation list includes:
Step 10-1, searching owners of hit open source items or other open source items of contributors by using ELASTICSEARCH, if the code contribution line number of each contributor to the hit open source items exceeds a set threshold, then the contributor is included in the contributor search list, and if the contribution line number of contributors is lower than the threshold, then the first three contributors are included in the contributor search list by default. And using ELASTICSEARCH to search other open source items of the contributors included in the list, and defining the searched owners or other open source items of the contributors as other open source items with relevance.
Step 10-2 item name, description and README text processing of each of the other open source items of relevance [ the processing herein includes word segmentation, removal of stop words, abbreviation substitution, word desiccation and part-of-speech reduction ] and merging into a list of strings, the index of the list of strings representing the number of each item, and the list elements representing key information of each item.
The item name, description and README text of each hit open source item are processed and combined into a string list in the same way.
And step 10-3, calculating cosine similarity of the hit open source item and other open source items of the correlation, arranging according to a cosine similarity value descending order, and adding the other open source items of the correlation arranged in the first ten into a recommendation list to obtain a final recommendation list.
Further, in the step 11, the visual presentation includes five components of a basic interface component, a topic overview view component, a topic cluster view component, an item detail view component, and a global control panel.
The basic interface component comprises three sub-components, namely a search column, a result screening and sorting component and a view switching component, wherein the search column is used for inputting keywords or inquiry sentences and starting search operation;
The topic overview component is used for displaying topic trees of all open source items in the final recommended item list, and further comprises a control module for adjusting parameters.
The topic clustering view component is used for graphically displaying the clustering results and displaying the use frequency of open source items in the clustering results.
The project detail view component comprises three parts, namely project basic information, visualization of a source code analysis result and a project recommendation table. The project basic information is used for showing the use frequency of folder names, file names, function names and variable names, the visualization of the source code analysis result is used for showing the source codes of all hit open source projects, and the project recommendation table is used for showing the final recommendation column;
the global control panel component is used for displaying the popularity index, the item update time and the matching sequence of the open source items in the final recommendation column.
Embodiment 1. The embodiment discloses a visualized retrieval method of an open source item, which comprises the following steps:
step 1, acquiring an open source item from an open source community:
Key information is extracted for each open source item, including the item's name, author information, number of observers, number of stars, number of branches, item description, item size, functional image, license information, contributor information, programming language and number of lines of code thereof, as well as page connections and source code links for the item. For example, this example selected Github as the data source, cloned JavaScript items created during month 1 from 2010 to 12 from 2023, containing a total of 74,103 items to extract the key information.
Step2, analyzing and sorting the dependency relationship of the library:
From each open source item, the corresponding package. Json file and lib/libs folder are parsed to extract the dependency item information of the item. By partitioning the prefixes of the dependent item names, a hierarchy with parent-child node relationships is extracted. For example, "@ eslint/plug in-act-hops" and "@ eslint-plug in-act-router" have the same prefix @ eslint, and thus are children of the @ eslint node. The extracted dependency and its hierarchy will be written into a predefined techs field for subsequent index build.
Step 3, extracting and classifying README files:
The README files provided by each open source item are used for describing the functions of the open source item, the contents of each README file are classified into eight categories, namely What, why, how, when, who, references, contribution and Other respectively, by using a multi-label classifier READMECLASSIFIER, and texts classified into the categories of wha and Why are extracted as functional texts;
step 4, extracting a functional image and adding an availability index:
The pictures in the functional text typically show the function and execution results of the items, so the links of the functional images of each open source item are extracted using BeautifulSoup libraries, the extracted image links will be written into predefined { images } fields, the availability of the open source item is defined as whether a How category exists in the README file, if a How category exists, indicating that the item has a higher availability, a new field { hasHow: true }, otherwise { hasHow: false }, is added. Both the { images } and { hasHow } fields will be used for the creation of subsequent indexes.
Step 5, extracting the subject according to the project information BERTopic
The key information for each item (including item name, description, README text) is combined and processed, and a BERTopic model is trained based on the combined information to extract the topic for each item. The topics extracted by this process will be recorded in predefined topics fields for use in subsequent index building and hierarchical topic modeling.
Step 6, establishing an index
All the information extracted in the foregoing steps 1 to 5 is input to the elastic search in a unified manner, and an index is established for each item.
And 7, hierarchical topic modeling, namely classifying open source items with the same topic into the same topic class, clustering the topic class by adopting a clustering method, classifying two topic classes with the distance smaller than a preset threshold value into one cluster, and indicating that clustering is completed when the topic class in each cluster is not changed.
Because the generated topics may have relevance, a hierarchical relationship structure among all topics is further constructed based on the topic word matrix. The hierarchical relationship is constructed based on calculating the distance between the topics, for example, using a measurement method such as cosine similarity. Firstly, starting from each independent topic, merging two topics which are most similar into a new cluster (cluster) in sequence, and then continuously merging topics with highest similarity between clusters until all topics are gradually integrated into the same hierarchical structure. The merging process of the above-described subject matter may be implemented by a variety of join functions (Linkage functions), including but not limited to single join (SINGLE LINKAGE), complete join (complete join), average join (AVERAGE LINKAGE), or the like. And (3) constructing a relationship tree by using the generated topic hierarchical relationship structure, wherein the relationship tree is used for showing topic clustering views in the step (11). By hierarchically organizing the topics, semantic relationships between the topics can be more intuitively displayed. The structure not only helps the user to quickly identify the core theme and the sub-theme related to the keyword, but also enables the association of the complex information to be clearer and clearer. In this way, the user can more efficiently screen the items from the semantic level, and the transparency and accuracy of the search results are improved.
Step 8, matching the query statement with the item:
Searching the query sentence input by the user on each cluster of the step 7 cluster by utilizing the elastic search, wherein the returned result comprises all hit open source items, each field matched with the query sentence in the hit open source items and the relevance scores corresponding to the matched fields, each field comprises a field name and a field value corresponding to the field name, and the content matched with the query word in the field value is highlighted.
For each hit open source item, selecting the field with the highest correlation score in all the matched fields as an optimal field, decomposing a query sentence input by a user into a plurality of independent query terms because different arrangement sequences of the query terms possibly convey different semantic information, and matching the highlight content in the optimal field with the plurality of query terms one by one to obtain the semantic sequence of the optimal field.
Further, the step of obtaining the semantic order of the optimal field in the step 8 includes:
and 8-1, marking all query terms in the query statement from left to right, and marking the highlighting content corresponding to the query terms in the optimal field with the same marks as the query terms.
And 8-2, the optimal field comprises N sentences, N is more than or equal to 1, the labels of all query words in each sentence form an array, and repeated labels in the array are deleted.
And 8-3, ending N arrays to obtain a one-dimensional array composed of labels, when the elements in the one-dimensional array are periodically repeated from left to right, reserving the elements in the first period, and deleting the elements in other periods to obtain the semantic sequence of the optimal field of the query statement in the hit open source item.
Therefore, the relevance evaluation of the searched items is further optimized, and the accuracy of the search result is remarkably improved.
For example, the query term is MACHINE LEARNING AI, the three query terms are respectively marked with the numbers of 0,1 and 2, and if the optimal field of a hit open source item is readme, 4 sentences are included, and the highlighting process is as shown in fig. 4. curIndex = query_arr. Index (curItem) is used to find the index position of the word in the query word array (query_arr= [ "machine", "learning", "ai").
Assuming that the query term array is query_arr= [ "machine", "learning", "ai" ], then:
curItem = "machine" curIndex =0
CurItem = "learning" curIndex =1
CurItem = "AI" curIndex =2
For 4 words of the optimal field of a hit open source item, forming an array by the labels of all query words in each word to obtain four arrays, namely [0,1], [2,2] [0,1,2], and removing repeated labels in each array to obtain [0,1], [2] [0,1,2] and [0,1,2]. The four arrays are terminated to obtain a one-dimensional array [0,1,2,0,1,2,0,1,2], periodic repetition occurs from left to right from four elements, the second period [0,1,2] and the third period [0,1,2] of repetition are deleted, and only the first period [0,1,2] is reserved.
Step 9, analyzing the source codes of all hit open source items:
For the hit open source item, first, esprima is adopted to analyze the source code file, and a corresponding abstract syntax tree AST is generated. By traversing the abstract syntax tree, a plurality of key identifiers including folder names, file names, function names, and variable names can be extracted, and the occurrence frequencies of these identifiers can be counted. In addition, the method further includes calculating similarities between the key identifiers to achieve merging of portions of the key identifiers. However, given the complexity and time-consuming nature of the similarity calculation process, the merging operation is not performed by default. And generating word clouds based on the analyzed key identifiers and the corresponding frequency information thereof, and displaying the item detail views in the step 11. The generated word cloud effectively helps a user to quickly identify the key identifier and focus on the high-frequency content, so that the user can intuitively understand important information of the searched item, and the efficiency of item understanding and analysis is remarkably improved.
Step 10, recommending open source items related to and similar to the hit open source items:
and taking all the hit open source items as a recommendation list, and adding other open source items with relevance and similarity with the hit open source items into the recommendation list to obtain a final recommendation list.
The recommendation list is used for presentation of the item detail view in step 11, providing the user with more items of possible interest.
Further, in the step 10, the step of obtaining the final recommendation list includes:
Step 10-1, searching owners of hit open source items or other open source items of contributors by using ELASTICSEARCH, if the code contribution line number of each contributor to the hit open source items exceeds a set threshold, then the contributor is included in the contributor search list, and if the contribution line number of contributors is lower than the threshold, then the first three contributors are included in the contributor search list by default. And using ELASTICSEARCH to search other open source items of the contributors included in the list, and defining the searched owners or other open source items of the contributors as other open source items with relevance.
Step 10-2 item name, description and README text processing of each of the other open source items of relevance [ the processing herein includes word segmentation, removal of stop words, abbreviation substitution, word desiccation and part-of-speech reduction ] and merging into a list of strings, the index of the list of strings representing the number of each item, and the list elements representing key information of each item.
The item name, description and README text of each hit open source item are processed and combined into a string list in the same way.
And step 10-3, calculating cosine similarity of the hit open source item and other open source items of the correlation, arranging according to a cosine similarity value descending order, and adding the other open source items of the correlation arranged in the first ten into a recommendation list to obtain a final recommendation list.
Step 11, multi-angle visual retrieval item data:
project data is presented by visualization techniques, the data involved including, but not limited to, basic information of the project, subject information, dependent item information, and source code. The visualization process adopts various visual angles to display the data, and realizes the dynamic interaction between the user and the search result through the interactive operation interface so as to support the deep exploration and analysis of the search result by the user.
As shown in fig. 1, the system of the present invention is mainly composed of two parts, including five components, and provides a complete search paradigm for users, including four phases of "search-explore-check-recommend". The first portion (portions A1, A2, A3, B1, B2, C3, C4, and E8 in fig. 1) supports exploring search results from multiple angles, while the second portion (fig. 1D) provides detailed analysis of a single item.
Basic interface this component is the core portal of the system and mainly comprises three sub-components of search column (part A1 in fig. 1), result screening and sorting (part A2 in fig. 1), and view switching (part A3 in fig. 1). The user may enter keywords or query terms in the search field, initiating a retrieval operation. The "Search in result" option is provided, allowing the user to perform a secondary Search in the current Search results. The user can select and display all or part of items in a certain child node in the theme tree through result screening and sorting (part A2 in fig. 1), and sort the results according to indexes such as star marks, bifurcation numbers and the like. The basic information of each item presentation includes item names, descriptions, labels and related indexes (such as star marks, bifurcation, attention). When a user clicks on an item, detailed information of the item is displayed while the corresponding item detail view is expanded. When a user clicks an item in the A2 sub-assembly, the item detail view can be automatically unfolded, and the user can retract and expand the view item detail attempt by clicking an A3 button, so that browsing efficiency is improved;
Theme overview view this component displays the theme mined from the retrieved items, including the theme tree (part B1 in fig. 1) and the control module (part B2 in fig. 1). The topic tree shows all topics and their hierarchical relationships. The user may click on a node to select a topic of interest. The control module provides options for adjusting the topic modeling layout, style, and parameters. The user can dynamically adjust these settings according to the data size to optimize the theme and make the view clearer and more understandable. The view improves the transparency of the results from a semantic perspective, and enhances the understanding of the user on the search results.
Topic clustering view-after selecting a topic, the user can view items under the selected topic. The cluster chart (C4 in fig. 1, respectively) shows the popularity index (star, bifurcation, attention) of each item. The user can quickly identify more important items based on color intensity. The view also provides a variety of screening options, allowing dynamic selection of presented data. The technology statistics diagram (section C3 in fig. 1) shows the frequency of use of all technologies used by the items in the current cluster. The user may specify a particular technique or exclude unrelated techniques and the cluster map will add or remove items according to the screening criteria.
Item detail view this component aims at improving the efficiency of the user in understanding the source code of the item and locating the important file, and comprises three parts, namely item basic information (part D1 in FIG. 1), visualization of the analysis result of the source code (part D2 in FIG. 1) and an item recommendation table (part D3 in FIG. 1). In section D2 of fig. 1, the chart on the left visualizes the frequency of use of the four identifiers, folder name, file name, function name, and variable name. After clicking on the name, the intermediate file structure tree will show the location of the file. After further selecting the file, the user may view the source code.
Global control panel to provide more flexible and diverse interactions for users, the inventive system supports a variety of globally controlled screening options including three popularity metrics (star, bifurcation, attention), project update time, matching order, and technology related information such as dependencies, availability, and licenses. The logical relationship between these filtering options is "AND".
To verify the effectiveness of the method, we conducted a user study on 12 experienced JavaScript programming and project retrieval developers. First, we divide them equally into two groups (G1 and G2) and ensure that the experience of the two groups is comparable. G1 uses the method of the invention and G2 uses the Github's search engine. To suppress the impact of subjective bias, each panelist is required to complete the same project retrieval task, satellite visualization. Everyone is required to record the completion time of 8 objective questions that measure the effectiveness of the system of this invention in terms of transparency of the results, source code understanding, and result expansion. Finally, we further interview participants one by one to collect feedback.
How many items the star or the number of interests exceeds 30?
How many items are visualizing weather data?
Q3 find five React frame-based items.
Q4 find five items based on three.
Q5 find five items that have examples in README files.
Q6-what are the highest-starlike items implementing those satellite-related aspects?
Q7 which metrics are calculated by the satellite performance component for the highest-star item?
Find two visualization items related to satellite tracking Q8.
Figure 3 shows the average performance of G1 and G2 on each question. Both groups achieved higher accuracy in terms of explicit questions, however, G1 performed slightly better in terms of questions that need to be explored further (e.g., Q2, Q5, and Q8). Meanwhile, the time consumption of G1 is obviously lower than that of G2, the processing time of G1 on all problems is better than that of G2, especially when complex problems are processed, the time consumption of G2 is obviously increased, and the time consumption of G1 is kept stable. On Q2, Q3 and Q4, the time consumption of G2 is nearly 200 seconds, and G1 is only about 50 seconds, so that the obvious advantages of the novel method in high-complexity tasks are fully shown, G1 is more efficient, and complex scenes can be better dealt with. This indicates that the system is effective in helping the user retrieve items. In interviews, they all agree that the system enhances understanding of the results by visualizing topics, making the exploration process more targeted and easier to use. The library dependency relationship and the matching sequence eliminate unnecessary item inspection and improve the retrieval efficiency.