Disclosure of Invention
The invention aims to provide a file classification management system based on big data, which solves the problems in the background technology.
In order to achieve the aim, the invention provides the technical scheme that the archive classification management system based on big data comprises an archive data acquisition module, a data storage module, an archive classification module, a retrieval recommendation module and a visualization module;
the archive data acquisition module is used for acquiring archive data from a data source;
The data storage module is used for storing the data acquired by the archive data acquisition module, and comprises a layered storage unit which is used for dividing archive data into a cold archive and a hot archive and storing the archive data by using different storage media;
The file classification module acquires metadata of file data by using a metadata collection tool, generates a label for each file data, wherein the label comprises time, keywords, data sources and categories, and classifies the file data according to the labels;
the retrieval recommendation module comprises a result recommendation unit and a correlation analysis unit, wherein the result recommendation unit is used for providing specific archives according to content input by a user, the correlation analysis unit is used for comparing archives provided by the result recommendation unit with other archives, and providing related archives for the user according to comparison results.
Optionally, the archive data includes paper archive and electronic archive, the electronic archive includes audio and video data, the archive data acquisition module includes optical identification unit and audio frequency conversion unit, optical identification unit is used for converting paper archive into image format to draw the literal content in the image format, audio frequency conversion unit is used for converting audio and video data, draws literal data, and combines video transcoding technique to carry out digital processing.
Optionally, the hierarchical storage unit archive data partitioning process is as follows:
;
wherein a is a heat score;
F is the file access frequency, which represents the number of accesses per unit time in the past;
alpha is an access frequency influence coefficient, and the value range is 0 to 1;
t is the time interval from the last access;
Beta is an access time interval influence coefficient, and the value range is 0 to 1;
Z is the importance weight of the file, and the value range is 0 to 1;
The higher the heat score A indicates more frequent use of the archive data, and conversely indicates less use of the archive data, the classification threshold of the heat score A is set to Y1, when A is greater than Y1, the heat archive is set to be a cold archive, when A is less than Y1, the heat archive is moved to a high-performance storage medium for improving access speed and response time, and the cold archive is moved to a low-cost storage medium for reducing storage cost.
Optionally, the result recommending unit processes are as follows:
;
wherein G represents user query content and S represents candidate scores;
S (D 1, G) represents candidate scores according to the user' S query for content profile one;
R (D 1, G) represents the similarity score of the user's query content to profile one;
w1 represents the influence coefficient of R (D 1, G), and the value range is 0 to 1;
h (D 1) represents the historical access frequency of profile one;
W2 represents the influence coefficient of H (D 1), and the value range is 0 to 1;
c (D 1) represents timeliness of the first archive;
w3 represents the influence coefficient of C (D 1), and the value range is 0 to 1;
The history access frequency H (D 1) of profile one is derived as follows:
;
Wherein U represents a user set;
M p(D1) indicates the number of times the user p accesses profile one;
m p represents the influence coefficient of the user p, the value range is 0 to 1, and the importance of different users is represented;
The larger S (D 1, G) is the stronger the relevance between the first archive and the user query content, the search result column automatically recommends the archive with the highest score after the user queries the content G in the process of using the system by the user, so that the archive query precision is ensured, and the user can adjust W1, W2 and W3 according to the requirements, so that the archive query flexibility is improved.
Optionally, the correlation analysis unit analyzes the following process:
;
wherein R (D 1,D2) represents the similarity score for Profile one and Profile two;
F i(D1) represents the occurrence frequency of the ith keyword in the first file;
F i(D2) represents the occurrence frequency of the ith keyword in the second file;
n represents the number of keywords;
the sum of minimum values representing the sharing characteristic frequency between the first file and the second file is used for calculating how much the two files overlap on each keyword and finding out the minimum value of each keyword;
representing the maximum value of the total number of features in the first and second files;
The method comprises the steps of calculating the overlapping degree of shared features between two files, standardizing the maximum value of the total number of the features to measure the correlation of the two files to obtain a similarity score, determining the similarity of the two files in keywords by a correlation analysis unit to help search for similar files and compare file contents, after the result recommendation unit recommends the file with the highest candidate score S for a user, scoring the similarity of the file with the highest candidate score S with other files by the correlation analysis unit, setting the scoring threshold of R (D 1,D2) as Y2, when R (D 1,D2) is larger than Y2, indicating that the similarity of the two files is high, adding the file compared with the file with the highest score into a search result column, and when R (D 1,D2) is smaller than Y2, not adding.
Optionally, the visualization module provides monitoring data of system operation for an administrator, wherein the monitoring data comprises file storage amount, access frequency and system hardware operation parameters.
Optionally, the high performance storage medium comprises a solid state disk and the low cost storage medium comprises a mechanical hard disk.
Optionally, the metadata collection tool is APACHETIKA.
Compared with the prior art, the invention has the following beneficial effects:
1. After the file data is classified, when a user searches, the file recommended by the result recommending unit in the searching recommending module is recommended to the user according to the content input by the user, the result recommending unit comprises influence factors such as historical access frequency, timeliness, similarity with the content input by the user and the like, comprehensive analysis of the file data is realized, the user can adjust according to the needs in actual operation, so that file searching precision is improved, the use quality of a file classification management system is improved, the file recommended by the result recommending unit is compared with other files through the relevance analyzing unit, relevant files are provided for the user according to the comparison result, so that the user can view the relevant files without repeated searching, and the intelligent level and the use experience of the file classification management system are improved.
2. When the file data is stored, the file data is classified into the hot file and the cold file according to the access frequency, importance and other factors of the file data by the layered storage unit in the data storage module, the hot file is moved to a high-performance storage medium to improve the access speed and response time, and the cold file is moved to a low-cost storage medium to reduce the storage cost, so that the storage equipment can be used according to the actual situation of the file data, the use efficiency of the system can be ensured, the cost of the storage equipment can be reduced, and the construction cost of the file classification management system can be reduced.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the present embodiment provides a archive classification management system based on big data, including an archive data acquisition module, a data storage module, an archive classification module, a retrieval recommendation module and a visualization module;
the archive data acquisition module is used for acquiring archive data from a data source;
The data storage module is used for storing the data acquired by the archive data acquisition module, and comprises a layered storage unit which is used for dividing archive data into a cold archive and a hot archive and storing the archive data by using different storage media;
the file classification module acquires metadata of the file data by using a metadata collection tool, generates a label for each file data, wherein the label comprises time, keywords, data sources and categories, and classifies the file data according to the labels;
The retrieval recommendation module comprises a result recommendation unit and a correlation analysis unit, wherein the result recommendation unit is used for providing specific files according to the content input by the user, the correlation analysis unit is used for comparing the files provided by the result recommendation unit with other files, and the correlation files are provided for the user according to the comparison result.
More specifically, in the embodiment, the archive data acquisition module is used for acquiring archive data from a data source, and then when the archive data are stored by the data storage module, the archive data are classified into hot archives and cold archives by the hierarchical storage unit according to the access frequency, importance and other factors of the archive data, and different storage media are used for storage, so that the storage equipment can be used according to the actual situation of the archive data, the cost of the storage equipment can be reduced while the use efficiency of the system is ensured, and the construction cost of the archive classification management system is reduced.
Then, metadata of the file data are obtained through a file classification module by utilizing a metadata collection tool, a label is generated for each file data, the file data are classified according to the labels, then, when a user searches, the user can directly input content to search in a search column, file label information can be directly input, and meanwhile, a result recommendation unit in the search recommendation module is used for recommending files which are the most accordant with the user input content according to the user input content, the result recommendation unit comprises influence factors such as historical access frequency, timeliness and similarity with the user input content, comprehensive analysis of the file data is achieved, and in actual operation, the user can adjust according to needs, so that file search precision is improved, the user needs are met, the use quality of a file classification management system is improved, then, the files recommended by the result recommendation unit are compared with other files through a correlation analysis unit, relevant files are provided for the user according to comparison results, the user can check the relevant files without repeated searching, and the intelligent level and the use experience of the file classification management system are improved.
Further, the file data comprises paper files and electronic files, the electronic files comprise audio and video data, the file data acquisition module comprises an optical identification unit and an audio conversion unit, the optical identification unit is used for converting the paper files into image formats and extracting text contents in the image formats, and the audio conversion unit is used for converting the audio and video data, extracting the text data and combining a video transcoding technology to conduct digital processing.
Specifically, the electronic file further comprises PDF, word, excel files, audio in the audio-video file comprises lectures, conference records, interviews and the like, the video file comprises conference videos, lecture videos, surveillance videos and the like, the voice in the audio file can be specifically converted into characters by using an ASR voice recognition technology, the video file is converted into a data format for analysis through video transcoding, voice recognition processing is carried out in combination with the audio in the video, and all audio-video data are stored in the mode of characters, so that the follow-up file classification management work can be facilitated.
Further, the hierarchical storage unit archive data partitioning process is as follows:
;
wherein a is a heat score;
F is the file access frequency, which represents the number of accesses per unit time in the past;
alpha is an access frequency influence coefficient, and the value range is 0 to 1;
t is the time interval from the last access;
Beta is an access time interval influence coefficient, and the value range is 0 to 1;
Z is the importance weight of the file, and the value range is 0 to 1;
Specifically, the higher the heat score a indicates that the file data is used more frequently, otherwise, the less the file data is used, the classification threshold of the heat score a is set to Y1, when a is greater than Y1, the heat score a is a hot file, when a is less than Y1, the heat score a is a cold file, the heat score a is moved to a high-performance storage medium to improve access speed and response time, and the cold score a is moved to a low-cost storage medium to reduce storage expense, but correspondingly the access speed and response time are reduced, the high-performance storage medium comprises a solid state disk, the low-cost storage medium comprises a mechanical hard disk, in actual operation, a manager can adjust the importance weight, the access frequency influence coefficient and the access time interval influence coefficient of the file according to the actual condition of the file, so that wrong classification is prevented, for example, long-term important files are classified into cold files because short-term access is too little, and flexibility of the classification management system of the file is improved.
Further, the result recommending unit processes are as follows:
;
wherein G represents user query content and S represents candidate scores;
S (D 1, G) represents candidate scores according to the user' S query for content profile one;
R (D 1, G) represents the similarity score of the user's query content to profile one;
w1 represents the influence coefficient of R (D 1, G), and the value range is 0 to 1;
h (D 1) represents the historical access frequency of profile one;
W2 represents the influence coefficient of H (D 1), and the value range is 0 to 1;
c (D 1) represents timeliness of the first archive;
w3 represents the influence coefficient of C (D 1), and the value range is 0 to 1;
The history access frequency H (D 1) of profile one is derived as follows:
;
Wherein U represents a user set;
M p(D1) indicates the number of times the user p accesses profile one;
m p represents the influence coefficient of the user p, the value range is 0 to 1, and the importance of different users is represented;
Specifically, the larger the S (D 1, G) is, the stronger the association between the archive-and-user query content is, and after the user inputs the user query content G in the system using process, the search result column automatically recommends an archive with the highest score to ensure the archive query precision, in actual operation, the system adds adjustment options of different influence coefficients in the search column setting, and the user can adjust W1, W2 and W3 according to the needs to ensure that the search result meets the needs of the user, thereby improving the flexibility and the result accuracy of archive query.
Further, the correlation analysis unit analyzes the following process:
;
wherein R (D 1,D2) represents the similarity score for Profile one and Profile two;
F i(D1) represents the occurrence frequency of the ith keyword in the first file;
F i(D2) represents the occurrence frequency of the ith keyword in the second file;
n represents the number of keywords;
the sum of minimum values representing the sharing characteristic frequency between the first file and the second file is used for calculating how much the two files overlap on each keyword and finding out the minimum value of each keyword;
representing the maximum value of the total number of features in the first and second files;
Specifically, the degree of overlapping of the shared features between the two files is calculated, the maximum value of the total number of the features is used for standardization to measure the correlation of the two files, the similarity score R (D 1,D2),R(D1,D2) of the first file and the second file is obtained, the similarity of the two files on keywords or other features can be determined, the similar files can be searched for and compared with the content of the files, after the result recommending unit recommends the files with the highest score for the user, the correlation analyzing unit can calculate the similarity score of the files with the highest score and the other files, the scoring threshold of R (D 1,D2) is set to be Y2, when R (D 1,D2) is larger than Y2, the files with the highest score are added into the search result column, when R (D 1,D2) is smaller than Y2, the files are not added, the display quantity of the search result column can be set in order to avoid excessive files of the search result column in practical application, and the using experience of the system is improved.
Further, the visualization module provides monitoring data of system operation for an administrator, wherein the monitoring data comprises file storage capacity, access frequency and system hardware operation parameters.
Specifically, in the running process of the system, the manager can check the service condition of the system in real time, including the storage condition of the system, the occupation condition of a display card and a CPU, so that the overload of the system is avoided, and the data is converted into a chart form for display through a visualization module, so that the manager can quickly and clearly know the service condition of the system, and the management quality of the system is improved.
Further, the metadata collection tool is APACHETIKA.
Specifically, APACHETIKA is an open-source content analysis tool, which can automatically detect and extract metadata and content of a file, and support multiple file formats, including documents, PDFs, images, audio, video, and the like, can generate structured metadata for archive data, support batch processing of a large number of files, and is suitable for large-scale archive data classification scenes.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.