CN119847998B

CN119847998B - File classification management system based on big data

Info

Publication number: CN119847998B
Application number: CN202510315301.1A
Authority: CN
Inventors: 王颖; 袁芳; 宋媛媛
Original assignee: Qingdao University
Current assignee: Qingdao University
Priority date: 2025-03-18
Filing date: 2025-03-18
Publication date: 2025-06-10
Anticipated expiration: 2045-03-18
Also published as: CN119847998A

Abstract

The present invention discloses an archive classification management system based on big data, relates to the technical field of archive management, and comprises an archive data acquisition module, a data storage module, an archive classification module, a retrieval recommendation module and a visualization module. The data storage module comprises a hierarchical storage unit, and the retrieval recommendation module comprises a result recommendation unit and a correlation analysis unit. The result recommendation unit is used to provide a specific archive according to the content input by a user, and the correlation analysis unit compares the archive provided by the result recommendation unit with other archives. The retrieval recommendation module comprehensively analyzes the archive data, recommends the archive that best meets the needs to the user based on factors such as historical access frequency, timeliness and similarity, and provides relevant archives for reference, thereby improving the retrieval accuracy and the efficiency of the archive classification management system. Through the hierarchical storage technology, the access speed and response time are effectively improved, and the storage cost is reduced, thereby reducing the system construction cost.

Description

File classification management system based on big data

Technical Field

The invention relates to the technical field of archive management, in particular to an archive classification management system based on big data.

Background

With the development of modern civilized society, the "archive" has long become a well-known object of people, and is presented in life, study and work of people, throughout scientific research, medical treatment, litigation and other aspects, and in order to ensure the integrity, originality and the like of the archive, the archive classification management work has been developed. Along with the development of big data technology, the information is integrated through the internet technology in various technical fields gradually, and the information is subjected to resource sharing. The archive information has the characteristics of large data volume, complex information and various types. Traditional archival information mainly takes paper archives as main, and a large amount of archival information are collected, classified, are stored, have brought huge work load and pressure for the staff.

Along with technological development, file digitization becomes trend, but in the current file digitization process, files are usually all stored together after being scanned and identified, and because of huge file data and higher cost of high-performance storage equipment, the system construction cost is increased, and when the file data is searched, comprehensive analysis on the file data is lacking, search results cannot be adjusted, so that the accuracy of the search results is lower, and the use effect of a file classification management system is reduced.

Disclosure of Invention

The invention aims to provide a file classification management system based on big data, which solves the problems in the background technology.

In order to achieve the aim, the invention provides the technical scheme that the archive classification management system based on big data comprises an archive data acquisition module, a data storage module, an archive classification module, a retrieval recommendation module and a visualization module;

the archive data acquisition module is used for acquiring archive data from a data source;

The data storage module is used for storing the data acquired by the archive data acquisition module, and comprises a layered storage unit which is used for dividing archive data into a cold archive and a hot archive and storing the archive data by using different storage media;

The file classification module acquires metadata of file data by using a metadata collection tool, generates a label for each file data, wherein the label comprises time, keywords, data sources and categories, and classifies the file data according to the labels;

the retrieval recommendation module comprises a result recommendation unit and a correlation analysis unit, wherein the result recommendation unit is used for providing specific archives according to content input by a user, the correlation analysis unit is used for comparing archives provided by the result recommendation unit with other archives, and providing related archives for the user according to comparison results.

Optionally, the archive data includes paper archive and electronic archive, the electronic archive includes audio and video data, the archive data acquisition module includes optical identification unit and audio frequency conversion unit, optical identification unit is used for converting paper archive into image format to draw the literal content in the image format, audio frequency conversion unit is used for converting audio and video data, draws literal data, and combines video transcoding technique to carry out digital processing.

Optionally, the hierarchical storage unit archive data partitioning process is as follows:

;

wherein a is a heat score;

F is the file access frequency, which represents the number of accesses per unit time in the past;

alpha is an access frequency influence coefficient, and the value range is 0 to 1;

t is the time interval from the last access;

Beta is an access time interval influence coefficient, and the value range is 0 to 1;

Z is the importance weight of the file, and the value range is 0 to 1;

The higher the heat score A indicates more frequent use of the archive data, and conversely indicates less use of the archive data, the classification threshold of the heat score A is set to Y1, when A is greater than Y1, the heat archive is set to be a cold archive, when A is less than Y1, the heat archive is moved to a high-performance storage medium for improving access speed and response time, and the cold archive is moved to a low-cost storage medium for reducing storage cost.

Optionally, the result recommending unit processes are as follows:

;

wherein G represents user query content and S represents candidate scores;

S (D ₁, G) represents candidate scores according to the user' S query for content profile one;

R (D ₁, G) represents the similarity score of the user's query content to profile one;

w1 represents the influence coefficient of R (D ₁, G), and the value range is 0 to 1;

h (D ₁) represents the historical access frequency of profile one;

W2 represents the influence coefficient of H (D ₁), and the value range is 0 to 1;

c (D ₁) represents timeliness of the first archive;

w3 represents the influence coefficient of C (D ₁), and the value range is 0 to 1;

The history access frequency H (D ₁) of profile one is derived as follows:

;

Wherein U represents a user set;

M _p(D₁) indicates the number of times the user p accesses profile one;

m _p represents the influence coefficient of the user p, the value range is 0 to 1, and the importance of different users is represented;

The larger S (D ₁, G) is the stronger the relevance between the first archive and the user query content, the search result column automatically recommends the archive with the highest score after the user queries the content G in the process of using the system by the user, so that the archive query precision is ensured, and the user can adjust W1, W2 and W3 according to the requirements, so that the archive query flexibility is improved.

Optionally, the correlation analysis unit analyzes the following process:

;

wherein R (D ₁,D₂) represents the similarity score for Profile one and Profile two;

F _i(D₁) represents the occurrence frequency of the ith keyword in the first file;

F _i(D₂) represents the occurrence frequency of the ith keyword in the second file;

n represents the number of keywords;

the sum of minimum values representing the sharing characteristic frequency between the first file and the second file is used for calculating how much the two files overlap on each keyword and finding out the minimum value of each keyword;

representing the maximum value of the total number of features in the first and second files;

The method comprises the steps of calculating the overlapping degree of shared features between two files, standardizing the maximum value of the total number of the features to measure the correlation of the two files to obtain a similarity score, determining the similarity of the two files in keywords by a correlation analysis unit to help search for similar files and compare file contents, after the result recommendation unit recommends the file with the highest candidate score S for a user, scoring the similarity of the file with the highest candidate score S with other files by the correlation analysis unit, setting the scoring threshold of R (D ₁,D₂) as Y2, when R (D ₁,D₂) is larger than Y2, indicating that the similarity of the two files is high, adding the file compared with the file with the highest score into a search result column, and when R (D ₁,D₂) is smaller than Y2, not adding.

Optionally, the visualization module provides monitoring data of system operation for an administrator, wherein the monitoring data comprises file storage amount, access frequency and system hardware operation parameters.

Optionally, the high performance storage medium comprises a solid state disk and the low cost storage medium comprises a mechanical hard disk.

Optionally, the metadata collection tool is APACHETIKA.

Compared with the prior art, the invention has the following beneficial effects:

1. After the file data is classified, when a user searches, the file recommended by the result recommending unit in the searching recommending module is recommended to the user according to the content input by the user, the result recommending unit comprises influence factors such as historical access frequency, timeliness, similarity with the content input by the user and the like, comprehensive analysis of the file data is realized, the user can adjust according to the needs in actual operation, so that file searching precision is improved, the use quality of a file classification management system is improved, the file recommended by the result recommending unit is compared with other files through the relevance analyzing unit, relevant files are provided for the user according to the comparison result, so that the user can view the relevant files without repeated searching, and the intelligent level and the use experience of the file classification management system are improved.

2. When the file data is stored, the file data is classified into the hot file and the cold file according to the access frequency, importance and other factors of the file data by the layered storage unit in the data storage module, the hot file is moved to a high-performance storage medium to improve the access speed and response time, and the cold file is moved to a low-cost storage medium to reduce the storage cost, so that the storage equipment can be used according to the actual situation of the file data, the use efficiency of the system can be ensured, the cost of the storage equipment can be reduced, and the construction cost of the file classification management system can be reduced.

Drawings

FIG. 1 is a block diagram of a system according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the present embodiment provides a archive classification management system based on big data, including an archive data acquisition module, a data storage module, an archive classification module, a retrieval recommendation module and a visualization module;

the file classification module acquires metadata of the file data by using a metadata collection tool, generates a label for each file data, wherein the label comprises time, keywords, data sources and categories, and classifies the file data according to the labels;

The retrieval recommendation module comprises a result recommendation unit and a correlation analysis unit, wherein the result recommendation unit is used for providing specific files according to the content input by the user, the correlation analysis unit is used for comparing the files provided by the result recommendation unit with other files, and the correlation files are provided for the user according to the comparison result.

More specifically, in the embodiment, the archive data acquisition module is used for acquiring archive data from a data source, and then when the archive data are stored by the data storage module, the archive data are classified into hot archives and cold archives by the hierarchical storage unit according to the access frequency, importance and other factors of the archive data, and different storage media are used for storage, so that the storage equipment can be used according to the actual situation of the archive data, the cost of the storage equipment can be reduced while the use efficiency of the system is ensured, and the construction cost of the archive classification management system is reduced.

Then, metadata of the file data are obtained through a file classification module by utilizing a metadata collection tool, a label is generated for each file data, the file data are classified according to the labels, then, when a user searches, the user can directly input content to search in a search column, file label information can be directly input, and meanwhile, a result recommendation unit in the search recommendation module is used for recommending files which are the most accordant with the user input content according to the user input content, the result recommendation unit comprises influence factors such as historical access frequency, timeliness and similarity with the user input content, comprehensive analysis of the file data is achieved, and in actual operation, the user can adjust according to needs, so that file search precision is improved, the user needs are met, the use quality of a file classification management system is improved, then, the files recommended by the result recommendation unit are compared with other files through a correlation analysis unit, relevant files are provided for the user according to comparison results, the user can check the relevant files without repeated searching, and the intelligent level and the use experience of the file classification management system are improved.

Further, the file data comprises paper files and electronic files, the electronic files comprise audio and video data, the file data acquisition module comprises an optical identification unit and an audio conversion unit, the optical identification unit is used for converting the paper files into image formats and extracting text contents in the image formats, and the audio conversion unit is used for converting the audio and video data, extracting the text data and combining a video transcoding technology to conduct digital processing.

Specifically, the electronic file further comprises PDF, word, excel files, audio in the audio-video file comprises lectures, conference records, interviews and the like, the video file comprises conference videos, lecture videos, surveillance videos and the like, the voice in the audio file can be specifically converted into characters by using an ASR voice recognition technology, the video file is converted into a data format for analysis through video transcoding, voice recognition processing is carried out in combination with the audio in the video, and all audio-video data are stored in the mode of characters, so that the follow-up file classification management work can be facilitated.

Further, the hierarchical storage unit archive data partitioning process is as follows:

;

wherein a is a heat score;

t is the time interval from the last access;

Z is the importance weight of the file, and the value range is 0 to 1;

Specifically, the higher the heat score a indicates that the file data is used more frequently, otherwise, the less the file data is used, the classification threshold of the heat score a is set to Y1, when a is greater than Y1, the heat score a is a hot file, when a is less than Y1, the heat score a is a cold file, the heat score a is moved to a high-performance storage medium to improve access speed and response time, and the cold score a is moved to a low-cost storage medium to reduce storage expense, but correspondingly the access speed and response time are reduced, the high-performance storage medium comprises a solid state disk, the low-cost storage medium comprises a mechanical hard disk, in actual operation, a manager can adjust the importance weight, the access frequency influence coefficient and the access time interval influence coefficient of the file according to the actual condition of the file, so that wrong classification is prevented, for example, long-term important files are classified into cold files because short-term access is too little, and flexibility of the classification management system of the file is improved.

Further, the result recommending unit processes are as follows:

;

wherein G represents user query content and S represents candidate scores;

h (D ₁) represents the historical access frequency of profile one;

c (D ₁) represents timeliness of the first archive;

The history access frequency H (D ₁) of profile one is derived as follows:

;

Wherein U represents a user set;

M _p(D₁) indicates the number of times the user p accesses profile one;

Specifically, the larger the S (D ₁, G) is, the stronger the association between the archive-and-user query content is, and after the user inputs the user query content G in the system using process, the search result column automatically recommends an archive with the highest score to ensure the archive query precision, in actual operation, the system adds adjustment options of different influence coefficients in the search column setting, and the user can adjust W1, W2 and W3 according to the needs to ensure that the search result meets the needs of the user, thereby improving the flexibility and the result accuracy of archive query.

Further, the correlation analysis unit analyzes the following process:

;

n represents the number of keywords;

Specifically, the degree of overlapping of the shared features between the two files is calculated, the maximum value of the total number of the features is used for standardization to measure the correlation of the two files, the similarity score R (D ₁,D₂),R(D₁,D₂) of the first file and the second file is obtained, the similarity of the two files on keywords or other features can be determined, the similar files can be searched for and compared with the content of the files, after the result recommending unit recommends the files with the highest score for the user, the correlation analyzing unit can calculate the similarity score of the files with the highest score and the other files, the scoring threshold of R (D ₁,D₂) is set to be Y2, when R (D ₁,D₂) is larger than Y2, the files with the highest score are added into the search result column, when R (D ₁,D₂) is smaller than Y2, the files are not added, the display quantity of the search result column can be set in order to avoid excessive files of the search result column in practical application, and the using experience of the system is improved.

Further, the visualization module provides monitoring data of system operation for an administrator, wherein the monitoring data comprises file storage capacity, access frequency and system hardware operation parameters.

Specifically, in the running process of the system, the manager can check the service condition of the system in real time, including the storage condition of the system, the occupation condition of a display card and a CPU, so that the overload of the system is avoided, and the data is converted into a chart form for display through a visualization module, so that the manager can quickly and clearly know the service condition of the system, and the management quality of the system is improved.

Further, the metadata collection tool is APACHETIKA.

Specifically, APACHETIKA is an open-source content analysis tool, which can automatically detect and extract metadata and content of a file, and support multiple file formats, including documents, PDFs, images, audio, video, and the like, can generate structured metadata for archive data, support batch processing of a large number of files, and is suitable for large-scale archive data classification scenes.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A file classification management system based on big data, characterized by comprising a file data collection module, a data storage module, a file classification module, a retrieval recommendation module and a visualization module;

The archive data acquisition module is used to obtain archive data from a data source;

The data storage module is used to store the data collected by the archive data collection module, and the data storage module includes a hierarchical storage unit, and the hierarchical storage unit is used to divide the archive data into cold archives and hot archives, and use different storage media for storage;

The archive classification module uses a metadata collection tool to obtain metadata of archive data and generates a label for each archive data. The label includes time, keyword, data source and category, and classifies the archive data according to the label;

The search recommendation module includes a result recommendation unit and a correlation analysis unit. The result recommendation unit is used to provide a specific profile according to the content input by the user. The correlation analysis unit compares the profile provided by the result recommendation unit with other profiles and provides the user with relevant profiles according to the comparison results.

The hierarchical storage unit archive data division process is as follows:

;

Where A is the heat score;

F is the file access frequency, which indicates the number of accesses per unit time in the past;

α is the access frequency influence coefficient, ranging from 0 to 1;

T is the time interval since the last visit;

β is the access time interval influence coefficient, ranging from 0 to 1;

Z is the archive importance weight, ranging from 0 to 1;

The higher the heat score A is, the more frequently the archive data is used. Conversely, the lower the heat score A is, the less frequently the archive data is used. The classification threshold of the heat score A is set to Y1. When A is greater than Y1, it is a hot archive. When A is less than Y1, it is a cold archive. For hot archives, move them to high-performance storage media to improve access speed and response time. For cold archives, move them to low-cost storage media to reduce storage costs.

The result recommendation unit process is as follows:

;

Where G represents the user query content, and S represents the candidate score;

S(D ₁ ,G) represents the candidate score of content profile 1 according to the user query;

R(D ₁ ,G) represents the similarity score between the user query content and profile 1;

W1 represents the influence coefficient of R(D ₁ ,G), ranging from 0 to 1;

H(D ₁ ) represents the historical access frequency of file 1;

W2 represents the influence coefficient of H(D ₁ ), ranging from 0 to 1;

C(D ₁ ) represents the timeliness of file 1;

W3 represents the influence coefficient of C(D ₁ ), ranging from 0 to 1;

The historical access frequency H(D ₁ ) of file 1 is obtained as follows:

;

Where U represents the user set;

M _p (D ₁ ) represents the number of times user p accesses file 1;

M _p represents the influence coefficient of user p, ranging from 0 to 1, representing the importance of different users;

The larger the value of S(D ₁ ,G), the stronger the relevance between file 1 and the user's query content. When the user uses the system, after entering the user's query content G, the search result bar will automatically recommend a file with the highest score to ensure the accuracy of the file query. The user can adjust W1, W2 and W3 according to needs to improve the flexibility of file query;

The analysis process of the correlation analysis unit is as follows:

;

Where R(D ₁ ,D ₂ ) represents the similarity score between file 1 and file 2;

_fi (D ₁ ) represents the frequency of occurrence of the i-th keyword in file 1;

_fi (D ₂ ) represents the frequency of occurrence of the i-th keyword in file 2;

n represents the number of keywords;

It represents the sum of the minimum values of the shared feature frequencies between profile 1 and profile 2, which is used to calculate how much overlap the two profiles have on each keyword and find the minimum value for each keyword;

It represents the maximum value of the total number of features in file 1 and file 2;

By calculating the degree of overlap of shared features between two archives and normalizing them with the maximum value of their total number of features, the correlation between the two archives is measured to obtain a similarity score. The correlation analysis unit can determine the similarity of the two archives in keywords to help find similar archives and compare archive contents. After the result recommendation unit recommends the archive with the highest candidate score S to the user, the correlation analysis unit will calculate the similarity score between the archive with the highest candidate score S and other archives, and set the score threshold of R(D ₁ , D ₂ ) to Y2. When R(D ₁ , D ₂ ) is greater than Y2, it means that the similarity between the two archives is high, and the archive compared with the archive with the highest score is added to the search result column. When R(D ₁ , D ₂ ) is less than Y2, it is not added.

2. According to the big data-based archive classification management system according to claim 1, it is characterized in that the archive data includes paper archives and electronic archives, the electronic archives include audio and video data, and the archive data acquisition module includes an optical recognition unit and an audio conversion unit. The optical recognition unit is used to convert paper archives into image format and extract text content in the image format. The audio conversion unit is used to convert audio and video data, extract text data, and perform digital processing in combination with video transcoding technology.

3. According to the big data-based archive classification management system of claim 1, it is characterized in that: the visualization module provides the administrator with monitoring data of system operation, and the monitoring data includes archive storage capacity, access frequency and system hardware operation parameters.

4. The archive classification management system based on big data according to claim 3 is characterized in that the high-performance storage medium includes a solid-state hard disk, and the low-cost storage medium includes a mechanical hard disk.

5. The archive classification management system based on big data according to claim 1 is characterized in that the metadata collection tool is Apache Tika.