[go: up one dir, main page]

CN119377410B - Data classification method, system and related device - Google Patents

Data classification method, system and related device Download PDF

Info

Publication number
CN119377410B
CN119377410B CN202411960482.5A CN202411960482A CN119377410B CN 119377410 B CN119377410 B CN 119377410B CN 202411960482 A CN202411960482 A CN 202411960482A CN 119377410 B CN119377410 B CN 119377410B
Authority
CN
China
Prior art keywords
data
subject
department
topic
target data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411960482.5A
Other languages
Chinese (zh)
Other versions
CN119377410A (en
Inventor
汪洋舟
董厚泽
孙路阳
程建润
谢红韬
胡建
袁公萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiji Computer Corp Ltd
CETC Big Data Research Institute Co Ltd
Original Assignee
Taiji Computer Corp Ltd
CETC Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiji Computer Corp Ltd, CETC Big Data Research Institute Co Ltd filed Critical Taiji Computer Corp Ltd
Priority to CN202411960482.5A priority Critical patent/CN119377410B/en
Publication of CN119377410A publication Critical patent/CN119377410A/en
Application granted granted Critical
Publication of CN119377410B publication Critical patent/CN119377410B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种数据分类方法、系统及相关装置,用于对数据进行快速且准确的分类。本申请方法包括:获取数据来源信息,对数据来源信息进行数据提取,得到目标数据;获取全量主题词库和部门主题词库;基于全量主题词库判断目标数据是否存在主题;若是,则从全量主题词库中提取主题;若否,则通过预设学习模型对目标数据进行分类,得到数据分类结果;当目标数据存在主题时,判断主题是否唯一;若是,则将主题作为数据分类结果;若否,则基于部门主题词库确定目标数据的部门主题;当主题不唯一时,判断主题与部门主题是否重叠;若是,则将主题或部门主题作为数据分类结果;若否,则通过预设学习模型对主题与部门主题进行分类,得到数据分类结果。

The present application discloses a data classification method, system and related devices for quickly and accurately classifying data. The present application method includes: obtaining data source information, extracting data from the data source information, and obtaining target data; obtaining a full subject word library and a department subject word library; judging whether the target data has a subject based on the full subject word library; if so, extracting the subject from the full subject word library; if not, classifying the target data through a preset learning model to obtain a data classification result; when the target data has a subject, judging whether the subject is unique; if so, taking the subject as the data classification result; if not, determining the department subject of the target data based on the department subject word library; when the subject is not unique, judging whether the subject overlaps with the department subject; if so, taking the subject or the department subject as the data classification result; if not, classifying the subject and the department subject through a preset learning model to obtain a data classification result.

Description

Data classification method, system and related device
Technical Field
The present application relates to the field of big data information technologies, and in particular, to a data classification method, system and related device.
Background
With the development of information technology, the scale and application of big data in government departments show explosive growth. In government affair systems, the data sources are wide and complex, business systems, social public service platforms, internet of things equipment and the like of government departments at all levels cover a plurality of sensitive and important fields such as personal information of citizens, urban planning construction, policy execution conditions and the like, different types of data have huge differences in importance, sensitivity and use rights, personal privacy data leakage such as resident identification numbers, household addresses and the like seriously damage the rights and interests of citizens, and policy planning data which are not issued by the government departments are strictly kept secret, so that classification standardization of government affair data is urgent, management of data assets by the government departments is facilitated, management strategies of different types of data are clarified, data security is improved, and data leakage risks are prevented.
In the prior art, data classification is mostly manual operation, wherein data of a plurality of data sources are collected manually, for example, data are collected from different government systems, then the data are initially distinguished and marked according to experience or simple predefined rules, and then detailed classification is performed according to the data, so that a data classification result is finally obtained.
However, the manually-guided data classification has a plurality of defects, the whole classification flow is seriously dependent on manual work, a great deal of effort is required to be expended for arrangement when data are collected, the subjectivity in the classification process is strong, and a unified classification standard is not available, so that when the data are in face of massive data, time and effort are wasted, errors are easy to occur, an accurate data classification result cannot be obtained, and the requirements of government departments on rapid and accurate classification of the data are met.
Disclosure of Invention
In order to solve the technical problems, the application provides a data classification method, a data classification system and a related device.
The following describes the technical scheme provided in the present application:
The first aspect of the application provides a data classification method, which comprises the following steps:
acquiring data source information, and carrying out data extraction on the data source information to obtain target data;
acquiring a full-quantity topic word stock and a department topic word stock;
judging whether the target data has a theme or not based on the full-quantity theme word stock;
If yes, extracting the topics from the full-quantity topic word stock;
If not, classifying the target data through a preset learning model to obtain a data classification result;
When the subject exists in the target data, judging whether the extracted subject is unique or not;
if yes, taking the subject as the data classification result;
if not, determining the department topic of the target data based on the department topic word stock;
When the theme is determined to be not unique, judging whether the theme is overlapped with the department theme or not;
If yes, taking the theme or the department theme as the data classification result;
If not, classifying the topics and the department topics through the preset learning model to obtain the data classification result.
Optionally, the obtaining the data source information, performing data extraction on the data source information, and obtaining the target data includes:
And acquiring data source information, performing data processing on the data source information, and extracting the data source information subjected to data processing to obtain target data.
Optionally, the obtaining the data source information, performing data processing on the data source information, and extracting the data source information after the data processing is completed, so as to obtain the target data includes:
acquiring data source information, and identifying and removing repeated data, redundant information and missing values of the data source information;
The data format unification is carried out on the data source information from which the repeated data, the redundant information and the missing value are removed;
And extracting the data source information with the uniform data format, and splicing the data source information by adopting a space character to obtain target data.
Optionally, the determining whether the subject exists in the target data based on the full-scale subject word stock includes:
and segmenting the target data, and inputting the segmented target data into the full-quantity topic word stock to judge whether a topic exists.
Optionally, the determining the department topic of the target data based on the department topic word stock includes:
judging whether the target data has a department topic or not based on the department topic word stock;
If yes, determining the department theme of the target data;
If not, classifying the subject through a preset learning model to obtain a data classification result.
Optionally, the full-quantity topic word stock includes a primary category, a secondary category, and a tertiary category.
Optionally, the first-level category is the e-government knowledge field and department job division and industry category;
The second-level category and the third-level category are knowledge branches of subject word membership in the full-scale subject word stock.
A second aspect of the present application provides a data classification system, the system comprising:
the first extraction unit is used for acquiring data source information, and carrying out data extraction on the data source information to obtain target data;
The acquisition unit is used for acquiring the full-quantity topic word stock and the department topic word stock;
the first judging unit is used for judging whether the target data has a theme or not based on the full-quantity theme lexicon;
The second extraction unit is used for extracting the topics from the full-quantity topic word stock if yes;
The first classification unit is used for classifying the target data through a preset learning model if not, so as to obtain a data classification result;
a second judging unit configured to judge whether the extracted subject is unique when it is determined that the subject exists in the target data;
the second classification unit is used for taking the subject as the data classification result if yes;
The second determining unit is used for determining department subjects of the target data based on the department subject word stock if not;
A third judging unit for judging whether the subject overlaps with the department subject or not when it is determined that the subject is not unique;
the third classification unit is used for taking the subject or the department subject as the data classification result if yes;
And the fourth classification unit is used for classifying the subject and the department subject through the preset learning model if not, so as to obtain the data classification result.
A third aspect of the present application provides a data sorting apparatus, the apparatus comprising:
a processor, a memory, an input-output unit, and a bus;
The processor is connected with the memory, the input/output unit and the bus;
The memory holds a program that the processor invokes to perform any of the methods.
A fourth aspect of the application provides a computer readable storage medium having stored thereon a program which when executed on a computer performs the method of any of the first aspect and optionally the first aspect.
From the above technical scheme, the application has the following advantages:
The application automates the process of data collection and key information extraction, reduces manual intervention and improves efficiency. The data is subjected to multiple judgment through the full-quantity topic word stock and the department topic word stock, unified classification standards and bases are provided for the whole data classification flow, and the consistency of classification results is ensured. The learning model is introduced, and the data is classified by using the learning model, so that the accuracy of classification is improved, and the data is ensured to be correctly classified into the corresponding categories. The whole data classification method has the characteristics of automation and intelligence, realizes rapid and efficient data classification, and meets the requirement of government departments on rapid data classification.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an embodiment of a data classification method according to the present application;
FIG. 2 is a flow chart of another embodiment of the data classification method according to the present application;
FIG. 3 is a schematic diagram of a data classification system according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an embodiment of a data classification device according to the present application.
Detailed Description
The application provides a data classification method which can be used for rapidly and accurately classifying data. It should be noted that the data classification method of the present application is applied to a terminal.
Referring to fig. 1, the present application first provides an embodiment of a data classification method, which includes:
S101, acquiring data source information, and carrying out data extraction on the data source information to obtain target data;
In this embodiment, the data source information is that the user configures the data source according to the data classification requirement, and the configured data source information includes information such as a data source database, a data source department, a data source system, a data table name, all fields, and the like. After the data source information is obtained, the key structure and key information in the data source are screened according to a preset extraction rule, and finally accurate and representative data are extracted from the key structure and key information, wherein the data are target data.
According to the embodiment, the target data is acquired from the data source information configured by the user, so that unnecessary resource waste and data redundancy are avoided, the target data can be accurately extracted, the reliability of data acquisition is improved, the data source information can be extracted, the target data can meet the requirement of subsequent data classification, and a good foundation is laid for the whole data classification flow.
S102, acquiring a full-quantity topic word stock and a department topic word stock;
In this embodiment, the source of the full-volume topic word library is "comprehensive electronic government topic word list", wherein "comprehensive electronic government topic word list" includes 20252 topic words, wherein 17421 topic words and 2831 informal topic words. The full-quantity subject word stock obtained based on the comprehensive electronic government subject word stock comprises a first-level category, a second-level category and a third-level category, wherein the first-level category is the electronic government knowledge field and the department function division and industry category, the second-level category and the third-level category are the knowledge branches of subject word membership in the full-quantity subject word stock, and the source of the obtained department subject word stock is the government department with a corresponding relation.
The obtaining of the full-quantity topic word stock in the embodiment provides a comprehensive and universal semantic framework for the whole data processing flow, can cover wide concepts and terms, enables accurate operation to be performed based on the framework in subsequent data classification, and meanwhile, the department topic word stock has more pertinence, and reflects unique business demands and semantic habits of government departments with corresponding relations. In the data classification process, when the full-quantity topic word stock cannot well meet the requirements of specific government departments, the department topic word stock can be used as a supplement, the accuracy and the flexibility of data processing are improved, and the full-quantity topic word stock and the department topic word stock are combined, so that the requirements of data classification of different layers and different departments can be better met.
S103, judging whether the subject exists in the target data or not based on the full-quantity subject word stock;
In this embodiment, the subject words are sequentially extracted from the full-scale subject word library and matched with the target data, and the matching mode can adopt precise matching or fuzzy matching, wherein the precise matching requires that the vocabulary in the target data is identical to the subject words, and the fuzzy matching can have a certain semantic similarity.
It should be noted that, in the process of traversing the full-scale topic word stock, if at least one topic word successfully matched with the target data is found, it is determined that the target data has a topic, and step 104 is executed, if no matched topic word is found in the entire topic word stock, it is determined that the target data has no topic, and step 105 is executed.
The embodiment is helpful to improve the accuracy of data processing by comparing the target data with the full-quantity topic word stock to judge whether the target data has topics, and various possible topic situations can be comprehensively considered by adopting an accurate matching or fuzzy matching mode to match, so that omission and misjudgment are reduced.
S104, extracting topics from a full-quantity topic word stock;
In this embodiment, after determining that the target data has a topic, the topic is extracted from the full topic word stock. Before extraction, the topic which is successfully matched with the target data needs to be marked, and as the topic in the full topic word stock is compared with the target data when judging whether the target data has the topic, the topic which is successfully matched can be directly positioned according to the successfully matched record, and then the successfully matched topic is extracted.
It should be noted that if there are multiple topics that match successfully, all of them need to be extracted. For example, after determining that the subject exists, searching all other subjects matched with the target data in the full-scale subject word stock, performing comprehensive search on the vocabulary entries related to the existing subjects and the related vocabulary entries thereof, performing matching degree analysis on all searched related subjects, setting a matching degree threshold, if the matching degree is greater than or equal to the threshold, extracting the related subjects, for example, setting the matching degree threshold to be 80%, wherein the target data contains related contents of "education policy", the "education policy" subject can be matched in the full-scale subject word stock, and meanwhile, the "education policy" subject can be matched with the "policy" subject, and the matching degree of the "education policy" subject and the "education policy" subject is 90% through the matching degree analysis, so that the "education policy" subject and the "education policy" subject are related subjects.
According to the embodiment, the topics are extracted from the full-quantity topic word stock, the association between the target data and the topic words in the topic word stock is clarified, so that the target data has clear topic attributes, the extracted topics can provide important basis for subsequent steps, and meanwhile, the clear topics are helpful for determining the analysis direction and the analysis key point.
S105, classifying the target data through a preset learning model to obtain a data classification result;
In this embodiment, when it is determined that the target data does not have the theme, the preset learning model is called to classify the target data to obtain the data classification result. Firstly, selecting a proper preset learning model, such as a convolutional neural network model, a cyclic neural network model and the like, then inputting target data into the preset learning model, and extracting features of the target data, such as vocabulary features, semantic features and the like, from the preset learning model. After feature extraction is completed, the preset learning model carries out classification calculation according to the extracted features, the probability that the target data belong to each preset category is calculated through the weight and the bias parameters in the preset learning model, and the category with the highest probability is selected as a data classification result.
According to the embodiment, the target data without the theme is classified by utilizing the preset learning model, so that the advantages of machine learning can be fully exerted, and complex and various data can be processed. The potential modes and rules in the data can be learned by the preset learning model, even if no subject is explicitly identified, reasonable classification can be performed according to the characteristics of the target data, and the accuracy and efficiency of data classification can be improved.
S106, when the fact that the subject exists in the target data is determined, judging whether the extracted subject is unique;
In this embodiment, all topics extracted from the full-scale topic word library are firstly obtained after the existence of the topics is determined, then the number of all topics is counted, if the counted number is 1, namely only one topic matched with the target data is found, the topics are judged to be unique, if the counted number is greater than 1, the fact that a plurality of matched topics are found is indicated, at this time, the topics are judged to be not unique, for example, the extracted topics are an education policy and policy topic and an education policy topic, the number of the topics is 2, the number of the topics is greater than 1, and the topics are not unique. If the result is unique, step 107 is executed, and if the result is not unique, step 108 is executed.
The embodiment is beneficial to improving the accuracy of data classification by judging whether the theme is unique or not, avoiding the data classification from being wrongly determined under the condition that the theme is not unique, and ensuring that the data classification result accords with the actual connotation of the data, thereby ensuring that the whole data classification flow is more rigorous and scientific.
S107, taking the theme as a data classification result;
in this embodiment, when it is determined that the extracted subject is unique, the subject is extracted as the data classification result.
It should be noted that the normalization of the determined unique topic is required to ensure that the format and representation of the extracted topic meets the requirements criteria.
According to the embodiment, the unique subject is used as the data classification result, so that the data classification result can be accurately obtained, the calculation resources are reduced, and the data classification efficiency is improved.
S108, determining department subjects of the target data based on the department subject word stock;
In this embodiment, when the topic is not unique, the department topic word stock is used to determine the department topic of the target data, then the department topic words need to be taken out from the department topic word stock one by one to match with the target data, the matching mode can adopt accurate matching or fuzzy matching, in the matching process, because the department topic word stock has department pertinence to the whole topic word stock, the vocabulary and the concept of department features can be focused more, and when the matched department topic words are found, the department topic words are determined as the department topics of the target data.
According to the embodiment, when the existing topics are not unique in the full-quantity topic word stock, the department topics of the target data are determined through the department topic word stock, so that the topics of the target data at the department level can be determined more accurately, the classification accuracy of the target data can be improved, and the target data can be better associated with specific departments.
S109, when the theme is determined to be not unique, judging whether the theme is overlapped with the department theme or not;
In this embodiment, after the topics in the full-scale topic word stock and the department topics in the department topic word stock have been determined, whether the topics overlap or not needs to be determined, the topics in the full-scale topic word stock are extracted first, the department topics determined in the department topic word stock are also extracted at the same time, then the extracted topics and the department topics are compared, if the topics and the department topics overlap completely, step 110 is executed, and if the topics and the department topics do not overlap completely, step 111 is executed.
The embodiment judges whether the topics overlap with the department topics or not, if the topics overlap with the department topics, the embodiment shows that the commonality exists between the total topics and the department-specific topics, which is favorable for integrating data resources, unified processing and outputting can be carried out on overlapped data to avoid repetition, and if the overlapping does not exist, the data under the total topics and the data limit under the department topics can be divided, which is favorable for more accurate classified storage and query operation.
S110, taking a theme or department theme as a data classification result;
In this embodiment, when it is determined that the subject overlaps with the department subject, the subject or the department subject is used as the data classification result. The subject or department subject is determined as a final data classification result according to the actual demand, for example, the subject "market research data analysis" overlaps with the department subject "market data statistics analysis", and the department subject is determined as a classification result according to the actual demand, then the data classification result of the target data is determined as "market data statistics analysis".
According to the embodiment, the overlapped topics or department topics are used as the data classification result, so that a data classification system can be optimized, the availability and management efficiency of data are improved, unnecessary calculation steps are reduced, and an accurate data classification result is obtained conveniently and rapidly.
S111, classifying the topics and department topics through a preset learning model to obtain a data classification result.
In this embodiment, when it is determined that the subject and the department subject do not overlap, a preset learning model is called to classify the subject and the department subject to obtain a data classification result, first, the subject and the department subject need to be converted into vector forms, the conversion process may use a word embedding technology to map the subject word into a low-dimensional vector space, after the conversion is completed, the converted subject vector and the department subject vector are input into the preset learning model, and the preset learning model may be a classification model based on a neural network, which is not particularly limited. And finally, analyzing and classifying the input topic vector and the department topic vector by a preset learning model, and outputting a data classification result.
According to the embodiment, when the topics are not overlapped with the department topics, the preset learning model is utilized to classify the data more scientifically and accurately, the topics and the department topics are converted into vector forms, the advanced learning model is utilized to mine the hidden semantic relation and features between the topics and the department topics, so that finer and more reasonable classification is realized, the classification mode based on the learning model can not be excessively interfered by artificial factors, and the method has higher objectivity and stability.
Referring to fig. 2, fig. 2 is a schematic diagram of another embodiment of a data classification method according to the present application, which includes:
s201, acquiring data source information, and identifying and removing repeated data, redundant information and missing values of the data source information;
in this embodiment, firstly, analysis of a data structure is performed on acquired data source information, information such as a storage format and a field meaning of data is clarified, then, data in the data source information is marked with a key identifier, duplicate data identification is performed through the marked key identifier, for example, duplicate data is determined to be removed if two or more key identifier fields of records are found to be identical, and for redundant information identification, each field in the data is analyzed according to a predefined service rule, if a certain field contains a large number of meaningless placeholders, default values and the like, the field is determined to be redundant information and removed, and for the aspect of a missing value, each field in the data source information needs to be traversed, the number of similar missing values such as a null value and a specific missing identifier in each field is counted, and for the field with an excessively high missing value proportion, the field with a low missing value proportion can be filled by means of mean filling, mode filling or predictive filling and the like.
The embodiment can avoid unnecessary operations on the same data for multiple times in the subsequent data processing process, reduce the waste of calculation resources and improve the data processing efficiency, and the redundant information of the data source information is removed to help to simplify the data set, so that the data is more concise and clear, the space occupation of data storage is reduced, key information in the data can be highlighted, the interference of irrelevant information in the data analysis process is avoided, the accuracy of the data analysis is improved, the integrity of the data is ensured by processing the missing value of the data source information, and the deviation of the data analysis result caused by the missing value is avoided.
S202, unifying data formats of the data source information from which the duplicate data, the redundant information and the missing values are removed;
In this embodiment, the data source information from which the duplicate data, the redundant information and the missing values are removed is first obtained, and then the data format is determined for the data source information, and the target format can be set according to the subsequent data processing requirements. After the data format is determined, traversing and checking the data source information from which the repeated data, the redundant information and the missing values are removed, checking whether the decimal place of the numerical data is in a format of scientific counting method or not, uniformly adjusting according to a target format, for example, uniformly adjusting all numerical values to a format for retaining two decimal places, uniformly adjusting different formats to a standard format for date type data, for example, uniformly adjusting 'XXXX year-XX month-XX day' to 'YYYY-MM-DD', and uniformly adjusting character codes for text type data, for example, uniformly converting texts under different codes to UTF-8 codes.
According to the embodiment, the data formats are unified, the complex storage structure and the space waste caused by format differences are reduced, errors caused by format problems can be avoided, the time cost of data conversion caused by format incompatibility is reduced, and the flow and usability of data are improved.
And S203, extracting the data source information with uniform data formats, and splicing the data source information by adopting the space character to obtain target data.
In this embodiment, firstly, data source information with uniform data format is required to be obtained, extraction operation is performed on the data source information, and then space character is adopted for splicing operation for the extracted data source information. Further, each element is typically combined with a space symbol sequentially from the first element of the data, for example, the data source information includes three elements, namely "element 1", "element 2" and "element 3", after processing, and then the result after splicing is "element 1 element 2 element 3", and the target data can be obtained after splicing is completed.
According to the embodiment, the extraction and splicing operation of the data source information with uniform completed data formats is beneficial to integrating the scattered and processed data source information into one piece of coherent target data, the data structure is simplified, meanwhile, compared with scattered data elements, the spliced complete target data is more convenient to transmit, the problem of data fragmentation in the transmission process is reduced, the transmission efficiency is improved, the integrated target data is more convenient to analyze and read integrally, the coherent target data can better reflect the relation among the data, and potential information in the data can be more accurately mined.
S204, acquiring a full-quantity topic word stock and a department topic word stock;
in this embodiment, step S204 is similar to step S102 of the previous embodiment, and will not be described here again.
S205, segmenting the target data, and inputting the segmented target data into a full-scale topic word stock to judge whether a topic exists.
In this embodiment, after the full-scale topic word stock and the department topic word stock are obtained, when the target data is segmented, if the target data is text data, it may be determined that the segmentation rule is based on punctuation marks and specific word boundaries, for example, a segment of target data of text type including multiple sentences may be segmented into multiple clauses according to punctuation marks such as periods and commas, and after segmentation is completed, the segmented target data is input into the full-scale topic word stock one by one for judgment.
In the judging process, each clause can be compared with the topics in the full-quantity topic word stock in an accurate matching or fuzzy matching mode, wherein the accurate matching requires that a certain vocabulary in the clause is completely the same as a topic word, the fuzzy matching can be performed through methods such as calculating semantic similarity, and when the semantic similarity of the clause and the topic word reaches a certain threshold value, the topic matching is considered to exist, for example, a 'data analysis' topic exists in the full-quantity topic word stock, the segmented clause is 'data analysis work is performed', and the fact that the clause exists in the 'data analysis' topic can be judged through the accurate matching or the fuzzy matching. Step 206 is performed when it is determined that it is present, and step 207 is performed when it is determined that it is not present.
According to the embodiment, the subject judgment is carried out after the target data is segmented, so that the accuracy of the subject judgment can be improved, misjudgment caused by missing the target data is avoided, and the subject in the target data can be more comprehensively identified by adopting a mode of combining accurate matching and fuzzy matching.
S206, extracting topics from the full-quantity topic word stock;
in this embodiment, step S206 is similar to step S104 of the previous embodiment, and will not be described here again.
S207, classifying target data through a preset learning model to obtain a data classification result;
In this embodiment, step S207 is similar to step S105 of the previous embodiment, and will not be described again.
S208, when determining that the subject exists in the target data, judging whether the extracted subject is unique;
in this embodiment, step S208 is similar to step S106 of the previous embodiment, and will not be described here again.
S209, taking the theme as a data classification result;
in this embodiment, step S209 is similar to step S107 of the previous embodiment, and will not be described here again.
S210, judging whether the target data has a department topic or not based on a department topic word stock when the topic is determined to be not unique;
In this embodiment, when it is determined that the topic is not unique, it is determined whether the target data has a department topic based on the department topic word stock. Firstly, obtaining segmented target data, comparing the segmented target data with topics in a department topic word stock one by one, and determining that department topics exist by adopting an accurate matching or fuzzy matching mode in the comparison process, wherein the accurate matching is that the segmented target data is completely the same as a certain topic in the department topic word stock, the fuzzy matching is that semantic similarity is calculated, and when the semantic similarity of the segmented target data and the topic words in the department topic word stock reaches a set threshold value, the department topic matching is determined. Step 211 is executed when the determination result is present, and step 212 is executed when the determination result is absent.
When the subject is not unique, the method for judging whether the subject of the department exists in the target data based on the word stock of the subject of the department is a more targeted processing mode, the word stock of the subject of the department is constructed according to the business characteristics and the requirements of the specific department, the related data subject of the department can be reflected more accurately, and the association of the target data and the subject of the department can be further clarified by the judging mode, so that the method is beneficial to more accurately classifying the data of the target data.
S211, extracting department topics of a department topic word stock;
in this embodiment, when determining that the target data has a department topic, a department topic record successfully matching the target data in the department topic word stock is found first, and the department topic in the department topic word stock is completely extracted according to the record, for example, a department topic of "sales data statistics" in the department topic word stock matches the target data, and then the department topic of "sales data statistics" is extracted.
The embodiment extracts the department topics in the department topic word library, provides a clear direction for subsequent data analysis, provides more targeted topic resources, and further increases the accuracy of the whole data classification flow.
S212, classifying the topics through a preset learning model to obtain a data classification result.
In this embodiment, when it is determined that the target data does not have a department topic, the topic is classified directly through a preset learning model, before classification, a proper preset learning model needs to be selected, the preset learning model may be a classification model based on a neural network, after the preset learning model is selected, the extracted topic which is not unique and does not match the department topic is provided as input to the preset learning model, the preset learning model performs feature extraction on the input topic, and then analysis and prediction are performed on the extracted feature to obtain a data classification result.
According to the method, when target data cannot be classified through department topics, the classification method based on the preset learning model can process complex topic relations, is not limited to a pre-defined department topic word stock, can classify according to features, digs out deeper relations of the data, is beneficial to improving the comprehensiveness of data classification, and ensures that all data can be reasonably classified.
S213, when the theme is determined to be not unique, judging whether the theme is overlapped with the department theme;
In this embodiment, step S213 is similar to step S109 of the previous embodiment, and will not be described here again.
S214, taking the theme or department theme as a data classification result;
in this embodiment, step S214 is similar to step S110 of the previous embodiment, and will not be described here again.
S215, classifying the topics and the department topics through a preset learning model to obtain a data classification result.
In this embodiment, step S215 is similar to step S111 of the previous embodiment, and will not be described here again.
Referring to fig. 3 for a detailed description of the data classification system provided by the present application, fig. 3 is a schematic diagram of another embodiment of the data classification system provided by the present application, the system includes:
a first extraction unit 301, configured to obtain data source information, and perform data extraction on the data source information to obtain target data;
an obtaining unit 302, configured to obtain a full-scale topic word stock and a department topic word stock;
a first judging unit 303 for judging whether the subject exists in the target data based on the full-scale subject word stock;
A second extracting unit 304, configured to extract a topic from the full-scale topic word stock if the topic is;
The first classification unit 305 is configured to classify the target data through a preset learning model if not, so as to obtain a data classification result;
A second judging unit 306 for judging whether the extracted subject is unique when it is determined that the subject exists in the target data;
a second classification unit 307, configured to take the subject as a data classification result if yes;
A second determining unit 308, configured to determine a department topic of the target data based on the department topic word stock if not;
A third judging unit 309 for judging whether the subject overlaps with the department subject when it is determined that the subject is not unique;
a third classification unit 310, configured to take the subject or the department subject as a data classification result if yes;
And the fourth classification unit 311 is configured to classify the subject and the department subject by a preset learning model if not, so as to obtain a data classification result.
Optionally, the first extracting unit 301 is specifically configured to:
and acquiring data source information, performing data processing on the data source information, and extracting the data source information after the data processing is completed to obtain target data.
Optionally, the first extracting unit 301 is specifically configured to:
Acquiring data source information, and identifying and removing repeated data, redundant information and missing values of the data source information;
the data format of the data source information from which the duplicate data, the redundant information and the missing values are removed is unified;
and extracting the data source information with uniform completed data formats, and splicing the data source information by adopting the space character to obtain target data.
Optionally, the first determining unit 303 is specifically configured to:
And splitting the target data, and inputting the split target data into a full-scale topic word stock to judge whether a topic exists.
Optionally, the second determining unit 308 is specifically configured to:
judging whether the target data has department topics or not based on the department topic word stock;
if yes, determining a department theme of the target data;
If not, classifying the topics through a preset learning model to obtain a data classification result.
Optionally, the full-scale topic word stock includes a primary category, a secondary category, and a tertiary category.
Optionally, the first-level category is the e-government knowledge field and department function division and industry category;
the second category and the third category are knowledge branches to which the subject words in the full-scale subject word stock belong.
The present application also provides a data classifying device, referring to fig. 4, fig. 4 is an embodiment of the data classifying device provided by the present application, the device includes:
A processor 401, a memory 402, an input/output unit 403, and a bus 404;
The processor 401 is connected to the memory 402, the input/output unit 403, and the bus 404;
The memory 402 holds a program, and the processor 401 calls the program to execute any of the methods as described above.
The application also relates to a computer readable storage medium having a program stored thereon, characterized in that the program, when run on a computer, causes the computer to perform any of the methods as described above.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM, random access memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Claims (9)

1. A method of classifying data, the method comprising:
acquiring data source information, and carrying out data extraction on the data source information to obtain target data;
acquiring a full-quantity topic word stock and a department topic word stock;
judging whether the target data has a theme or not based on the full-quantity theme word stock;
If yes, extracting the topics from the full-quantity topic word stock;
If not, classifying the target data through a preset learning model to obtain a data classification result;
When the subject exists in the target data, judging whether the extracted subject is unique or not;
if yes, taking the subject as the data classification result;
if not, determining the department topic of the target data based on the department topic word stock;
When the theme is determined to be not unique, judging whether the theme is identical to the department theme;
If yes, taking the theme or the department theme as the data classification result;
If not, classifying the topics and the department topics through the preset learning model to obtain the data classification result;
The determining the department topic of the target data based on the department topic word stock comprises the following steps:
judging whether the target data has a department topic or not based on the department topic word stock;
If yes, determining the department theme of the target data;
If not, classifying the subject through a preset learning model to obtain a data classification result.
2. The method of claim 1, wherein the obtaining the data source information, and the extracting the data from the data source information, comprises:
And acquiring data source information, performing data processing on the data source information, and extracting the data source information subjected to data processing to obtain target data.
3. The method of claim 2, wherein the obtaining the data source information, performing data processing on the data source information, and extracting the data source information after the data processing is completed, and obtaining the target data includes:
acquiring data source information, and identifying and removing repeated data, redundant information and missing values of the data source information;
The data format unification is carried out on the data source information from which the repeated data, the redundant information and the missing value are removed;
And extracting the data source information with the uniform data format, and splicing the data source information by adopting a space character to obtain target data.
4. The method of claim 1, wherein the determining whether the subject data exists a subject based on the full-scale subject word stock comprises:
and segmenting the target data, and inputting the segmented target data into the full-quantity topic word stock to judge whether a topic exists.
5. The method of any one of claims 1 to 4, wherein the full-topic word stock includes a primary category, a secondary category, and a tertiary category.
6. The method of claim 5, wherein the primary category is e-government knowledge domain and department job division and industry category;
The second-level category and the third-level category are knowledge branches of subject word membership in the full-scale subject word stock.
7. A data classification system, the system comprising:
the first extraction unit is used for acquiring data source information, and carrying out data extraction on the data source information to obtain target data;
The acquisition unit is used for acquiring the full-quantity topic word stock and the department topic word stock;
the first judging unit is used for judging whether the target data has a theme or not based on the full-quantity theme lexicon;
The second extraction unit is used for extracting the topics from the full-scale topic word stock when the topics exist;
the first classification unit is used for classifying the target data through a preset learning model when no theme exists, so as to obtain a data classification result;
a second judging unit configured to judge whether the extracted subject is unique when it is determined that the subject exists in the target data;
the second classification unit is used for taking the theme as the data classification result when the theme is unique;
A second determining unit, configured to determine a department topic of the target data based on the department topic word stock when the topic is not unique;
A third judging unit, configured to judge whether the subject is identical to the department subject when it is determined that the subject is not unique;
a third classification unit configured to, when it is determined that the subjects are identical, regard the subject or the department subject as the data classification result;
The fourth classification unit is used for classifying the subject and the department subject through the preset learning model when the subjects are determined to be not identical to each other, so as to obtain the data classification result;
The second determining unit is specifically configured to:
judging whether the target data has a department topic or not based on the department topic word stock;
If yes, determining the department theme of the target data;
If not, classifying the subject through a preset learning model to obtain a data classification result.
8. A data sorting apparatus, the apparatus comprising:
a processor, a memory, an input-output unit, and a bus;
The processor is connected with the memory, the input/output unit and the bus;
The memory holds a program which the processor invokes to perform the method of any one of claims 1 to 6.
9. A computer readable storage medium having a program stored thereon, which when executed on a computer performs the method of any of claims 1 to 6.
CN202411960482.5A 2024-12-30 2024-12-30 Data classification method, system and related device Active CN119377410B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411960482.5A CN119377410B (en) 2024-12-30 2024-12-30 Data classification method, system and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411960482.5A CN119377410B (en) 2024-12-30 2024-12-30 Data classification method, system and related device

Publications (2)

Publication Number Publication Date
CN119377410A CN119377410A (en) 2025-01-28
CN119377410B true CN119377410B (en) 2025-04-11

Family

ID=94329122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411960482.5A Active CN119377410B (en) 2024-12-30 2024-12-30 Data classification method, system and related device

Country Status (1)

Country Link
CN (1) CN119377410B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268560A (en) * 2017-01-03 2018-07-10 中国移动通信有限公司研究院 A kind of file classification method and device
CN113191123A (en) * 2021-04-08 2021-07-30 中广核工程有限公司 Indexing method and device for engineering design archive information and computer equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB9821787D0 (en) * 1998-10-06 1998-12-02 Data Limited Apparatus for classifying or processing data
CN111241282B (en) * 2020-01-14 2023-09-08 北京百度网讯科技有限公司 Text theme generation method and device and electronic equipment
CN114116955A (en) * 2021-11-22 2022-03-01 武汉大学深圳研究院 A detection system for government service data platform based on machine learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268560A (en) * 2017-01-03 2018-07-10 中国移动通信有限公司研究院 A kind of file classification method and device
CN113191123A (en) * 2021-04-08 2021-07-30 中广核工程有限公司 Indexing method and device for engineering design archive information and computer equipment

Also Published As

Publication number Publication date
CN119377410A (en) 2025-01-28

Similar Documents

Publication Publication Date Title
US20120303661A1 (en) Systems and methods for information extraction using contextual pattern discovery
CN113609261A (en) Vulnerability information mining method and device based on knowledge graph of network information security
CN113076735A (en) Target information acquisition method and device and server
CN114491034B (en) Text classification method and intelligent device
CN112163072A (en) Data processing method and device based on multiple data sources
CN109992778B (en) Resume document discrimination method and device based on machine learning
CN118069843A (en) Social media public opinion recognition method based on cross-language transfer learning algorithm framework
CN116841779A (en) Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium
CN118656438A (en) A method for screening priority pollutants for site control based on text mining and risk assessment
CN111949770A (en) Document classification method and device
CN118709678B (en) Enterprise compliance inspection method, device, equipment and storage medium
CN119129609A (en) Intelligent consulting method and consulting platform combined with demand semantic analysis
CN118733717A (en) File duplication checking method, device, equipment, storage medium and program product
CN118674169A (en) Intelligent analysis method, system, device and medium for deep mining of enterprise data
CN118349998A (en) Automatic code auditing method, device, equipment and storage medium
CN119377410B (en) Data classification method, system and related device
CN113807429B (en) Enterprise classification method, enterprise classification device, computer equipment and storage medium
CN117077668A (en) Risk image display method, apparatus, computer device, and readable storage medium
US20180260460A1 (en) Analytics engine selection management
CN116501733A (en) Data product generation method, device, equipment and storage medium
CN111859896B (en) Formula document detection method and device, computer readable medium and electronic equipment
Ma et al. Api prober–a tool for analyzing web api features and clustering web apis
CN118820910B (en) Heterogeneous network security big data management method and system
CN111581950A (en) Method for determining synonym and method for establishing synonym knowledge base
CN119476308B (en) Information reply method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant