[go: up one dir, main page]

CN119847998B - File classification management system based on big data - Google Patents

File classification management system based on big data Download PDF

Info

Publication number
CN119847998B
CN119847998B CN202510315301.1A CN202510315301A CN119847998B CN 119847998 B CN119847998 B CN 119847998B CN 202510315301 A CN202510315301 A CN 202510315301A CN 119847998 B CN119847998 B CN 119847998B
Authority
CN
China
Prior art keywords
archive
data
file
user
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510315301.1A
Other languages
Chinese (zh)
Other versions
CN119847998A (en
Inventor
王颖
袁芳
宋媛媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao University
Original Assignee
Qingdao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao University filed Critical Qingdao University
Priority to CN202510315301.1A priority Critical patent/CN119847998B/en
Publication of CN119847998A publication Critical patent/CN119847998A/en
Application granted granted Critical
Publication of CN119847998B publication Critical patent/CN119847998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于大数据的档案分类管理系统,涉及档案管理技术领域,包括档案数据采集模块、数据存储模块、档案分类模块、检索推荐模块和可视化模块,数据存储模块包括分层存储单元,检索推荐模块包括结果推荐单元和相关性分析单元,结果推荐单元用于根据用户输入的内容提供具体的档案,相关性分析单元根据结果推荐单元提供的档案与其他档案进行对比,具备了通过检索推荐模块综合分析档案数据,基于历史访问频率、时效性和相似度等因素,为用户推荐最符合需求的档案,并提供相关档案以供参考,提升检索精度,提高档案分类管理系统效率,通过分层存储技术,有效提高访问速度、响应时间,并减少存储费用,降低系统建设成本的效果。

The present invention discloses an archive classification management system based on big data, relates to the technical field of archive management, and comprises an archive data acquisition module, a data storage module, an archive classification module, a retrieval recommendation module and a visualization module. The data storage module comprises a hierarchical storage unit, and the retrieval recommendation module comprises a result recommendation unit and a correlation analysis unit. The result recommendation unit is used to provide a specific archive according to the content input by a user, and the correlation analysis unit compares the archive provided by the result recommendation unit with other archives. The retrieval recommendation module comprehensively analyzes the archive data, recommends the archive that best meets the needs to the user based on factors such as historical access frequency, timeliness and similarity, and provides relevant archives for reference, thereby improving the retrieval accuracy and the efficiency of the archive classification management system. Through the hierarchical storage technology, the access speed and response time are effectively improved, and the storage cost is reduced, thereby reducing the system construction cost.

Description

File classification management system based on big data
Technical Field
The invention relates to the technical field of archive management, in particular to an archive classification management system based on big data.
Background
With the development of modern civilized society, the "archive" has long become a well-known object of people, and is presented in life, study and work of people, throughout scientific research, medical treatment, litigation and other aspects, and in order to ensure the integrity, originality and the like of the archive, the archive classification management work has been developed. Along with the development of big data technology, the information is integrated through the internet technology in various technical fields gradually, and the information is subjected to resource sharing. The archive information has the characteristics of large data volume, complex information and various types. Traditional archival information mainly takes paper archives as main, and a large amount of archival information are collected, classified, are stored, have brought huge work load and pressure for the staff.
Along with technological development, file digitization becomes trend, but in the current file digitization process, files are usually all stored together after being scanned and identified, and because of huge file data and higher cost of high-performance storage equipment, the system construction cost is increased, and when the file data is searched, comprehensive analysis on the file data is lacking, search results cannot be adjusted, so that the accuracy of the search results is lower, and the use effect of a file classification management system is reduced.
Disclosure of Invention
The invention aims to provide a file classification management system based on big data, which solves the problems in the background technology.
In order to achieve the aim, the invention provides the technical scheme that the archive classification management system based on big data comprises an archive data acquisition module, a data storage module, an archive classification module, a retrieval recommendation module and a visualization module;
the archive data acquisition module is used for acquiring archive data from a data source;
The data storage module is used for storing the data acquired by the archive data acquisition module, and comprises a layered storage unit which is used for dividing archive data into a cold archive and a hot archive and storing the archive data by using different storage media;
The file classification module acquires metadata of file data by using a metadata collection tool, generates a label for each file data, wherein the label comprises time, keywords, data sources and categories, and classifies the file data according to the labels;
the retrieval recommendation module comprises a result recommendation unit and a correlation analysis unit, wherein the result recommendation unit is used for providing specific archives according to content input by a user, the correlation analysis unit is used for comparing archives provided by the result recommendation unit with other archives, and providing related archives for the user according to comparison results.
Optionally, the archive data includes paper archive and electronic archive, the electronic archive includes audio and video data, the archive data acquisition module includes optical identification unit and audio frequency conversion unit, optical identification unit is used for converting paper archive into image format to draw the literal content in the image format, audio frequency conversion unit is used for converting audio and video data, draws literal data, and combines video transcoding technique to carry out digital processing.
Optionally, the hierarchical storage unit archive data partitioning process is as follows:
;
wherein a is a heat score;
F is the file access frequency, which represents the number of accesses per unit time in the past;
alpha is an access frequency influence coefficient, and the value range is 0 to 1;
t is the time interval from the last access;
Beta is an access time interval influence coefficient, and the value range is 0 to 1;
Z is the importance weight of the file, and the value range is 0 to 1;
The higher the heat score A indicates more frequent use of the archive data, and conversely indicates less use of the archive data, the classification threshold of the heat score A is set to Y1, when A is greater than Y1, the heat archive is set to be a cold archive, when A is less than Y1, the heat archive is moved to a high-performance storage medium for improving access speed and response time, and the cold archive is moved to a low-cost storage medium for reducing storage cost.
Optionally, the result recommending unit processes are as follows:
;
wherein G represents user query content and S represents candidate scores;
S (D 1, G) represents candidate scores according to the user' S query for content profile one;
R (D 1, G) represents the similarity score of the user's query content to profile one;
w1 represents the influence coefficient of R (D 1, G), and the value range is 0 to 1;
h (D 1) represents the historical access frequency of profile one;
W2 represents the influence coefficient of H (D 1), and the value range is 0 to 1;
c (D 1) represents timeliness of the first archive;
w3 represents the influence coefficient of C (D 1), and the value range is 0 to 1;
The history access frequency H (D 1) of profile one is derived as follows:
;
Wherein U represents a user set;
M p(D1) indicates the number of times the user p accesses profile one;
m p represents the influence coefficient of the user p, the value range is 0 to 1, and the importance of different users is represented;
The larger S (D 1, G) is the stronger the relevance between the first archive and the user query content, the search result column automatically recommends the archive with the highest score after the user queries the content G in the process of using the system by the user, so that the archive query precision is ensured, and the user can adjust W1, W2 and W3 according to the requirements, so that the archive query flexibility is improved.
Optionally, the correlation analysis unit analyzes the following process:
;
wherein R (D 1,D2) represents the similarity score for Profile one and Profile two;
F i(D1) represents the occurrence frequency of the ith keyword in the first file;
F i(D2) represents the occurrence frequency of the ith keyword in the second file;
n represents the number of keywords;
the sum of minimum values representing the sharing characteristic frequency between the first file and the second file is used for calculating how much the two files overlap on each keyword and finding out the minimum value of each keyword;
representing the maximum value of the total number of features in the first and second files;
The method comprises the steps of calculating the overlapping degree of shared features between two files, standardizing the maximum value of the total number of the features to measure the correlation of the two files to obtain a similarity score, determining the similarity of the two files in keywords by a correlation analysis unit to help search for similar files and compare file contents, after the result recommendation unit recommends the file with the highest candidate score S for a user, scoring the similarity of the file with the highest candidate score S with other files by the correlation analysis unit, setting the scoring threshold of R (D 1,D2) as Y2, when R (D 1,D2) is larger than Y2, indicating that the similarity of the two files is high, adding the file compared with the file with the highest score into a search result column, and when R (D 1,D2) is smaller than Y2, not adding.
Optionally, the visualization module provides monitoring data of system operation for an administrator, wherein the monitoring data comprises file storage amount, access frequency and system hardware operation parameters.
Optionally, the high performance storage medium comprises a solid state disk and the low cost storage medium comprises a mechanical hard disk.
Optionally, the metadata collection tool is APACHETIKA.
Compared with the prior art, the invention has the following beneficial effects:
1. After the file data is classified, when a user searches, the file recommended by the result recommending unit in the searching recommending module is recommended to the user according to the content input by the user, the result recommending unit comprises influence factors such as historical access frequency, timeliness, similarity with the content input by the user and the like, comprehensive analysis of the file data is realized, the user can adjust according to the needs in actual operation, so that file searching precision is improved, the use quality of a file classification management system is improved, the file recommended by the result recommending unit is compared with other files through the relevance analyzing unit, relevant files are provided for the user according to the comparison result, so that the user can view the relevant files without repeated searching, and the intelligent level and the use experience of the file classification management system are improved.
2. When the file data is stored, the file data is classified into the hot file and the cold file according to the access frequency, importance and other factors of the file data by the layered storage unit in the data storage module, the hot file is moved to a high-performance storage medium to improve the access speed and response time, and the cold file is moved to a low-cost storage medium to reduce the storage cost, so that the storage equipment can be used according to the actual situation of the file data, the use efficiency of the system can be ensured, the cost of the storage equipment can be reduced, and the construction cost of the file classification management system can be reduced.
Drawings
FIG. 1 is a block diagram of a system according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the present embodiment provides a archive classification management system based on big data, including an archive data acquisition module, a data storage module, an archive classification module, a retrieval recommendation module and a visualization module;
the archive data acquisition module is used for acquiring archive data from a data source;
The data storage module is used for storing the data acquired by the archive data acquisition module, and comprises a layered storage unit which is used for dividing archive data into a cold archive and a hot archive and storing the archive data by using different storage media;
the file classification module acquires metadata of the file data by using a metadata collection tool, generates a label for each file data, wherein the label comprises time, keywords, data sources and categories, and classifies the file data according to the labels;
The retrieval recommendation module comprises a result recommendation unit and a correlation analysis unit, wherein the result recommendation unit is used for providing specific files according to the content input by the user, the correlation analysis unit is used for comparing the files provided by the result recommendation unit with other files, and the correlation files are provided for the user according to the comparison result.
More specifically, in the embodiment, the archive data acquisition module is used for acquiring archive data from a data source, and then when the archive data are stored by the data storage module, the archive data are classified into hot archives and cold archives by the hierarchical storage unit according to the access frequency, importance and other factors of the archive data, and different storage media are used for storage, so that the storage equipment can be used according to the actual situation of the archive data, the cost of the storage equipment can be reduced while the use efficiency of the system is ensured, and the construction cost of the archive classification management system is reduced.
Then, metadata of the file data are obtained through a file classification module by utilizing a metadata collection tool, a label is generated for each file data, the file data are classified according to the labels, then, when a user searches, the user can directly input content to search in a search column, file label information can be directly input, and meanwhile, a result recommendation unit in the search recommendation module is used for recommending files which are the most accordant with the user input content according to the user input content, the result recommendation unit comprises influence factors such as historical access frequency, timeliness and similarity with the user input content, comprehensive analysis of the file data is achieved, and in actual operation, the user can adjust according to needs, so that file search precision is improved, the user needs are met, the use quality of a file classification management system is improved, then, the files recommended by the result recommendation unit are compared with other files through a correlation analysis unit, relevant files are provided for the user according to comparison results, the user can check the relevant files without repeated searching, and the intelligent level and the use experience of the file classification management system are improved.
Further, the file data comprises paper files and electronic files, the electronic files comprise audio and video data, the file data acquisition module comprises an optical identification unit and an audio conversion unit, the optical identification unit is used for converting the paper files into image formats and extracting text contents in the image formats, and the audio conversion unit is used for converting the audio and video data, extracting the text data and combining a video transcoding technology to conduct digital processing.
Specifically, the electronic file further comprises PDF, word, excel files, audio in the audio-video file comprises lectures, conference records, interviews and the like, the video file comprises conference videos, lecture videos, surveillance videos and the like, the voice in the audio file can be specifically converted into characters by using an ASR voice recognition technology, the video file is converted into a data format for analysis through video transcoding, voice recognition processing is carried out in combination with the audio in the video, and all audio-video data are stored in the mode of characters, so that the follow-up file classification management work can be facilitated.
Further, the hierarchical storage unit archive data partitioning process is as follows:
;
wherein a is a heat score;
F is the file access frequency, which represents the number of accesses per unit time in the past;
alpha is an access frequency influence coefficient, and the value range is 0 to 1;
t is the time interval from the last access;
Beta is an access time interval influence coefficient, and the value range is 0 to 1;
Z is the importance weight of the file, and the value range is 0 to 1;
Specifically, the higher the heat score a indicates that the file data is used more frequently, otherwise, the less the file data is used, the classification threshold of the heat score a is set to Y1, when a is greater than Y1, the heat score a is a hot file, when a is less than Y1, the heat score a is a cold file, the heat score a is moved to a high-performance storage medium to improve access speed and response time, and the cold score a is moved to a low-cost storage medium to reduce storage expense, but correspondingly the access speed and response time are reduced, the high-performance storage medium comprises a solid state disk, the low-cost storage medium comprises a mechanical hard disk, in actual operation, a manager can adjust the importance weight, the access frequency influence coefficient and the access time interval influence coefficient of the file according to the actual condition of the file, so that wrong classification is prevented, for example, long-term important files are classified into cold files because short-term access is too little, and flexibility of the classification management system of the file is improved.
Further, the result recommending unit processes are as follows:
;
wherein G represents user query content and S represents candidate scores;
S (D 1, G) represents candidate scores according to the user' S query for content profile one;
R (D 1, G) represents the similarity score of the user's query content to profile one;
w1 represents the influence coefficient of R (D 1, G), and the value range is 0 to 1;
h (D 1) represents the historical access frequency of profile one;
W2 represents the influence coefficient of H (D 1), and the value range is 0 to 1;
c (D 1) represents timeliness of the first archive;
w3 represents the influence coefficient of C (D 1), and the value range is 0 to 1;
The history access frequency H (D 1) of profile one is derived as follows:
;
Wherein U represents a user set;
M p(D1) indicates the number of times the user p accesses profile one;
m p represents the influence coefficient of the user p, the value range is 0 to 1, and the importance of different users is represented;
Specifically, the larger the S (D 1, G) is, the stronger the association between the archive-and-user query content is, and after the user inputs the user query content G in the system using process, the search result column automatically recommends an archive with the highest score to ensure the archive query precision, in actual operation, the system adds adjustment options of different influence coefficients in the search column setting, and the user can adjust W1, W2 and W3 according to the needs to ensure that the search result meets the needs of the user, thereby improving the flexibility and the result accuracy of archive query.
Further, the correlation analysis unit analyzes the following process:
;
wherein R (D 1,D2) represents the similarity score for Profile one and Profile two;
F i(D1) represents the occurrence frequency of the ith keyword in the first file;
F i(D2) represents the occurrence frequency of the ith keyword in the second file;
n represents the number of keywords;
the sum of minimum values representing the sharing characteristic frequency between the first file and the second file is used for calculating how much the two files overlap on each keyword and finding out the minimum value of each keyword;
representing the maximum value of the total number of features in the first and second files;
Specifically, the degree of overlapping of the shared features between the two files is calculated, the maximum value of the total number of the features is used for standardization to measure the correlation of the two files, the similarity score R (D 1,D2),R(D1,D2) of the first file and the second file is obtained, the similarity of the two files on keywords or other features can be determined, the similar files can be searched for and compared with the content of the files, after the result recommending unit recommends the files with the highest score for the user, the correlation analyzing unit can calculate the similarity score of the files with the highest score and the other files, the scoring threshold of R (D 1,D2) is set to be Y2, when R (D 1,D2) is larger than Y2, the files with the highest score are added into the search result column, when R (D 1,D2) is smaller than Y2, the files are not added, the display quantity of the search result column can be set in order to avoid excessive files of the search result column in practical application, and the using experience of the system is improved.
Further, the visualization module provides monitoring data of system operation for an administrator, wherein the monitoring data comprises file storage capacity, access frequency and system hardware operation parameters.
Specifically, in the running process of the system, the manager can check the service condition of the system in real time, including the storage condition of the system, the occupation condition of a display card and a CPU, so that the overload of the system is avoided, and the data is converted into a chart form for display through a visualization module, so that the manager can quickly and clearly know the service condition of the system, and the management quality of the system is improved.
Further, the metadata collection tool is APACHETIKA.
Specifically, APACHETIKA is an open-source content analysis tool, which can automatically detect and extract metadata and content of a file, and support multiple file formats, including documents, PDFs, images, audio, video, and the like, can generate structured metadata for archive data, support batch processing of a large number of files, and is suitable for large-scale archive data classification scenes.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (5)

1.一种基于大数据的档案分类管理系统,其特征在于,包括档案数据采集模块、数据存储模块、档案分类模块、检索推荐模块和可视化模块;1. A file classification management system based on big data, characterized by comprising a file data collection module, a data storage module, a file classification module, a retrieval recommendation module and a visualization module; 所述档案数据采集模块用于从数据源中获取档案数据;The archive data acquisition module is used to obtain archive data from a data source; 所述数据存储模块用于对档案数据采集模块采集到的数据进行存储,所述数据存储模块包括分层存储单元,所述分层存储单元用于将档案数据分为冷档案和热档案,并使用不同的存储介质进行存储;The data storage module is used to store the data collected by the archive data collection module, and the data storage module includes a hierarchical storage unit, and the hierarchical storage unit is used to divide the archive data into cold archives and hot archives, and use different storage media for storage; 所述档案分类模块利用元数据收集工具获取档案数据的元数据,并为每个档案数据生成标签,标签包括时间、关键词、数据来源和类别,根据标签对档案数据进行分类;The archive classification module uses a metadata collection tool to obtain metadata of archive data and generates a label for each archive data. The label includes time, keyword, data source and category, and classifies the archive data according to the label; 所述检索推荐模块包括结果推荐单元和相关性分析单元,所述结果推荐单元用于根据用户输入的内容提供具体的档案,所述相关性分析单元根据所述结果推荐单元提供的档案与其他档案进行对比,根据对比结果为用户提供相关档案;The search recommendation module includes a result recommendation unit and a correlation analysis unit. The result recommendation unit is used to provide a specific profile according to the content input by the user. The correlation analysis unit compares the profile provided by the result recommendation unit with other profiles and provides the user with relevant profiles according to the comparison results. 所述分层存储单元档案数据划分过程如下:The hierarchical storage unit archive data division process is as follows: ; 其中A是热度评分;Where A is the heat score; F是档案访问频率,表示过去单位时间内的访问次数;F is the file access frequency, which indicates the number of accesses per unit time in the past; α是访问频率影响系数,取值范围为0至1;α is the access frequency influence coefficient, ranging from 0 to 1; T是距最近一次访问的时间间隔;T is the time interval since the last visit; β是访问时间间隔影响系数,取值范围为0至1;β is the access time interval influence coefficient, ranging from 0 to 1; Z是档案重要性权重,取值范围为0至1;Z is the archive importance weight, ranging from 0 to 1; 热度评分A越高表示档案数据使用更加频繁,反之则表示档案数据使用越少,设定热度评分A的分类阈值为Y1,当A大于Y1时,为热档案,当A小于Y1时,为冷档案,对于热档案,将其移动至高性能存储介质中,以提高访问速度和响应时间,对于冷档案,将其移动至低成本的存储介质中,以减少存储费用;The higher the heat score A is, the more frequently the archive data is used. Conversely, the lower the heat score A is, the less frequently the archive data is used. The classification threshold of the heat score A is set to Y1. When A is greater than Y1, it is a hot archive. When A is less than Y1, it is a cold archive. For hot archives, move them to high-performance storage media to improve access speed and response time. For cold archives, move them to low-cost storage media to reduce storage costs. 所述结果推荐单元过程如下:The result recommendation unit process is as follows: ; 其中G表示用户查询内容,S表示候选分;Where G represents the user query content, and S represents the candidate score; S(D1,G)表示根据用户查询内容档案一的候选分;S(D 1 ,G) represents the candidate score of content profile 1 according to the user query; R(D1,G)表示用户查询内容与档案一的相似度评分;R(D 1 ,G) represents the similarity score between the user query content and profile 1; W1表示R(D1,G)的影响系数,取值范围为0至1;W1 represents the influence coefficient of R(D 1 ,G), ranging from 0 to 1; H(D1)表示档案一的历史访问频率;H(D 1 ) represents the historical access frequency of file 1; W2表示H(D1)的影响系数,取值范围为0至1;W2 represents the influence coefficient of H(D 1 ), ranging from 0 to 1; C(D1)表示档案一的时效性;C(D 1 ) represents the timeliness of file 1; W3表示C(D1)的影响系数,取值范围为0至1;W3 represents the influence coefficient of C(D 1 ), ranging from 0 to 1; 档案一的历史访问频率H(D1)得出过程如下:The historical access frequency H(D 1 ) of file 1 is obtained as follows: ; 其中U表示使用者集合;Where U represents the user set; Mp(D1)表示使用者p访问档案一的次数;M p (D 1 ) represents the number of times user p accesses file 1; Mp表示使用者p的影响系数,取值范围为0至1,代表不同用户的重要性;M p represents the influence coefficient of user p, ranging from 0 to 1, representing the importance of different users; S(D1,G)越大表示档案一与用户查询内容的关联性越强,用户在使用系统过程中,输入用户查询内容G后,搜索结果栏会自动推荐得分最高的一个档案,以保证档案查询精度,用户可以根据需求调整W1、W2和W3,提高档案查询的灵活性;The larger the value of S(D 1 ,G), the stronger the relevance between file 1 and the user's query content. When the user uses the system, after entering the user's query content G, the search result bar will automatically recommend a file with the highest score to ensure the accuracy of the file query. The user can adjust W1, W2 and W3 according to needs to improve the flexibility of file query; 所述相关性分析单元分析过程如下:The analysis process of the correlation analysis unit is as follows: ; 其中R(D1,D2)表示档案一和档案二的相似度评分;Where R(D 1 ,D 2 ) represents the similarity score between file 1 and file 2; fi(D1)表示在档案一中第i个关键词出现频次; fi (D 1 ) represents the frequency of occurrence of the i-th keyword in file 1; fi(D2)表示在档案二中第i个关键词出现频次; fi (D 2 ) represents the frequency of occurrence of the i-th keyword in file 2; n表示关键词的数量;n represents the number of keywords; 表示档案一和档案二之间共享特征频次的最小值之和,用于计算两个档案在每个关键词上有多少重叠,并找出每个关键词的最小值; It represents the sum of the minimum values of the shared feature frequencies between profile 1 and profile 2, which is used to calculate how much overlap the two profiles have on each keyword and find the minimum value for each keyword; 表示档案一和档案二中特征总数的最大值; It represents the maximum value of the total number of features in file 1 and file 2; 通过计算两个档案之间共享特征的重叠程度,并用它们的特征总数的最大值进行标准化,来衡量两个档案的相关性,得到相似度评分,相关性分析单元可以确定两个档案在关键词的相似度,帮助查找相似档案和进行档案内容对比,在所述结果推荐单元为用户推荐候选分S最高的档案后,相关性分析单元将计算候选分S最高的档案与其他档案的相似度评分,并设定R(D1,D2)的评分阈值为Y2,当R(D1,D2)大于Y2时,表示两个档案的相似度高,将与得分最高的档案对比的档案加入搜索结果栏,当R(D1,D2)小于Y2时,则不加入。By calculating the degree of overlap of shared features between two archives and normalizing them with the maximum value of their total number of features, the correlation between the two archives is measured to obtain a similarity score. The correlation analysis unit can determine the similarity of the two archives in keywords to help find similar archives and compare archive contents. After the result recommendation unit recommends the archive with the highest candidate score S to the user, the correlation analysis unit will calculate the similarity score between the archive with the highest candidate score S and other archives, and set the score threshold of R(D 1 , D 2 ) to Y2. When R(D 1 , D 2 ) is greater than Y2, it means that the similarity between the two archives is high, and the archive compared with the archive with the highest score is added to the search result column. When R(D 1 , D 2 ) is less than Y2, it is not added. 2.根据权利要求1所述的基于大数据的档案分类管理系统,其特征在于,所述档案数据包括纸质档案和电子档案,所述电子档案包括音视频数据,所述档案数据采集模块包括光学识别单元和音频转换单元,所述光学识别单元用于将纸质档案转化为图像格式,并提取图像格式中的文字内容,所述音频转换单元用于对音视频数据进行转换,提取文字数据,并结合视频转码技术进行数字化处理。2. According to the big data-based archive classification management system according to claim 1, it is characterized in that the archive data includes paper archives and electronic archives, the electronic archives include audio and video data, and the archive data acquisition module includes an optical recognition unit and an audio conversion unit. The optical recognition unit is used to convert paper archives into image format and extract text content in the image format. The audio conversion unit is used to convert audio and video data, extract text data, and perform digital processing in combination with video transcoding technology. 3.根据权利要求1所述的基于大数据的档案分类管理系统,其特征在于:所述可视化模块为管理员提供系统运行的监控数据,监控数据包括档案存储量、访问频率和系统硬件运行参数。3. According to the big data-based archive classification management system of claim 1, it is characterized in that: the visualization module provides the administrator with monitoring data of system operation, and the monitoring data includes archive storage capacity, access frequency and system hardware operation parameters. 4.根据权利要求3所述的基于大数据的档案分类管理系统,其特征在于:所述高性能存储介质包括固态硬盘,所述低成本的存储介质包括机械硬盘。4. The archive classification management system based on big data according to claim 3 is characterized in that the high-performance storage medium includes a solid-state hard disk, and the low-cost storage medium includes a mechanical hard disk. 5.根据权利要求1所述的基于大数据的档案分类管理系统,其特征在于:所述元数据收集工具为ApacheTika。5. The archive classification management system based on big data according to claim 1 is characterized in that the metadata collection tool is Apache Tika.
CN202510315301.1A 2025-03-18 2025-03-18 File classification management system based on big data Active CN119847998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510315301.1A CN119847998B (en) 2025-03-18 2025-03-18 File classification management system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510315301.1A CN119847998B (en) 2025-03-18 2025-03-18 File classification management system based on big data

Publications (2)

Publication Number Publication Date
CN119847998A CN119847998A (en) 2025-04-18
CN119847998B true CN119847998B (en) 2025-06-10

Family

ID=95363202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510315301.1A Active CN119847998B (en) 2025-03-18 2025-03-18 File classification management system based on big data

Country Status (1)

Country Link
CN (1) CN119847998B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113570118A (en) * 2021-07-06 2021-10-29 浙江工业大学 A Workshop Scheduling and Analysis Method Based on Scheduling Rules
CN117725283A (en) * 2023-12-20 2024-03-19 山东东方飞扬软件技术有限公司 Archival data storage system based on big data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9063938B2 (en) * 2012-03-30 2015-06-23 Commvault Systems, Inc. Search filtered file system using secondary storage, including multi-dimensional indexing and searching of archived files
US11212363B2 (en) * 2016-02-08 2021-12-28 Microstrategy Incorporated Dossier interface and distribution
CN116910362B (en) * 2023-07-18 2024-04-16 中国电子科技集团公司第五十四研究所 Intelligent recommendation method for perceived data, computer equipment and storage medium
CN118365431B (en) * 2024-06-19 2024-10-11 广州大事件网络科技有限公司 Big data-based commodity recommendation method and system for electronic commerce platform
CN118427158B (en) * 2024-07-04 2024-10-11 广州劲源科技发展股份有限公司 File development and utilization management system based on artificial intelligence technology
CN119557419B (en) * 2024-11-14 2025-09-26 广州晨雅档案管理咨询服务有限公司 File management method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113570118A (en) * 2021-07-06 2021-10-29 浙江工业大学 A Workshop Scheduling and Analysis Method Based on Scheduling Rules
CN117725283A (en) * 2023-12-20 2024-03-19 山东东方飞扬软件技术有限公司 Archival data storage system based on big data

Also Published As

Publication number Publication date
CN119847998A (en) 2025-04-18

Similar Documents

Publication Publication Date Title
US8909563B1 (en) Methods, systems, and programming for annotating an image including scoring using a plurality of trained classifiers corresponding to a plurality of clustered image groups associated with a set of weighted labels
US20190340194A1 (en) Associating still images and videos
US9305084B1 (en) Tag selection, clustering, and recommendation for content hosting services
JP5192475B2 (en) Object classification method and object classification system
CN112035658B (en) Enterprise public opinion monitoring method based on deep learning
EP1835419A1 (en) Information processing device, method, and program
CN111723256A (en) A method and system for constructing government user portrait based on information resource database
CN107463616B (en) Enterprise information analysis method and system
CN112686043B (en) Word vector-based classification method for emerging industries of enterprises
CN117688250B (en) Unified data dynamic service management system and method suitable for electric power full scene
JP6104209B2 (en) Hash function generation method, hash value generation method, apparatus, and program
US20200257724A1 (en) Methods, devices, and storage media for content retrieval
CN111813898A (en) Expert recommendation method, device, device and storage medium based on semantic search
JP6397378B2 (en) Feature value generation method, feature value generation device, and feature value generation program
JP6368677B2 (en) Mapping learning method, information compression method, apparatus, and program
CN113032573A (en) Large-scale text classification method and system combining theme semantics and TF-IDF algorithm
CN119847998B (en) File classification management system based on big data
JP6152032B2 (en) Hash function generation method, hash value generation method, hash function generation device, hash value generation device, hash function generation program, and hash value generation program
CN118331502A (en) Cloud resource management method and device and electronic equipment
CA3017999A1 (en) Audio search user interface
KR102732683B1 (en) Apparatus for searching video
Yan et al. A multimodal retrieval and ranking method for scientific documents based on HFS and XLNet
Mallek et al. An unsupervised approach for precise context identification from unstructured text documents
CN112784171B (en) A movie recommendation method based on context typicality
CN119396843B (en) A method for intelligent analysis of massive information based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant