CN118364112A - Data processing method and system based on large model - Google Patents
Data processing method and system based on large model Download PDFInfo
- Publication number
- CN118364112A CN118364112A CN202410788204.XA CN202410788204A CN118364112A CN 118364112 A CN118364112 A CN 118364112A CN 202410788204 A CN202410788204 A CN 202410788204A CN 118364112 A CN118364112 A CN 118364112A
- Authority
- CN
- China
- Prior art keywords
- file
- files
- feature
- features
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention is applicable to the technical field of data processing, and provides a data processing method and system based on a large model, wherein the method comprises the following steps: acquiring uploaded file content and an uploading channel; preprocessing a file, converting the preprocessed file into a feature vector, extracting first features in the feature vector, wherein the first features comprise word occurrence frequencies, and classifying the file according to the first features; analyzing the classified file content through a large model, judging whether malicious codes or scripts exist or not, and detecting abnormal files deviating from normal characteristics, wherein the abnormal files are files which do not accord with specifications; and extracting a second feature in the feature vector, wherein the second feature comprises a keyword, a document structure and language complexity, and grading the document through the second feature, so that the problems that human errors are easy to occur, the priority of grading determination processing cannot be carried out, and complex malicious content and abnormal behaviors cannot be effectively identified in security detection of the document are solved.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a data processing method and system based on a large model.
Background
In the digital age, the generation and storage of various types of file data (such as text documents, PDF, images and the like) are rapidly increasing, so as to provide a more efficient, accurate, safe and user-friendly file data processing solution, and in the judicial field, the construction of an electronic file system becomes necessary, so that the efficiency and quality of judicial case handling are improved, and the key effect is played for the improvement of judicial fairness and creditability. In recent years, with the continuous improvement of the performance of computer hardware and the continuous optimization of a deep learning algorithm, the development of a large model is also faster and faster. Knowledge extraction is carried out on the model through hundred million-level corpus or images, and a large model with hundred million-level parameters is produced by learning, however, the construction of an industrial field model on the basis of the large model is an indispensable model.
However, in the existing community electronic file correction process, human errors are easy to occur, the priority of rating determination processing cannot be performed, and meanwhile, the problem that complex malicious content and abnormal behaviors cannot be effectively identified by security detection of files exists.
Disclosure of Invention
The embodiment of the invention aims to provide a data processing method and system based on a large model, and aims to solve the problem in the third part of the background technology.
The embodiment of the invention is realized in such a way that a data processing method based on a large model comprises the following steps:
acquiring uploaded file content and an uploading channel, wherein the file content comprises legal documents, surveys, talking notes, illness reports, approval sheets and evidence materials, and the uploading channel is an uploading file mode;
Preprocessing a file, wherein the preprocessing comprises word segmentation, word stopping and word drying, converting the preprocessed file into feature vectors, extracting first features in the feature vectors, wherein the first features comprise word occurrence frequencies, and classifying the file according to the first features;
Analyzing the classified file content through a large model, judging whether malicious codes or scripts exist or not, and detecting abnormal files deviating from normal characteristics, wherein the abnormal files are files which do not accord with specifications;
and extracting second features in the feature vector, wherein the second features comprise keywords, document structures and language complexity, and grading the documents through the second features.
Preferably, the preprocessing of the file includes word segmentation, word stopping and word drying, the preprocessed file is converted into feature vectors, first features in the feature vectors are extracted, the first features include word occurrence frequency, and the step of classifying the file according to the first features specifically includes:
preprocessing a file, calculating the occurrence frequency of words in the file, and judging the importance of the words according to the occurrence frequency;
Converting the document into a feature vector through TF-IDF, and extracting a first feature in the feature vector, wherein the dimension of the feature vector is the number of all words;
classifying the files according to the first characteristics to obtain class labels of the files, wherein the class labels are used for determining the types of the files.
Preferably, the step of analyzing the classified file content by a large model, judging whether malicious codes or scripts exist, and detecting an abnormal file deviating from normal characteristics, wherein the abnormal file is a file which does not conform to the specification, specifically includes:
collecting data containing normal files and known malicious files, establishing a first data set, and marking the files of the first data set, wherein the marking comprises marking the normal files and marking the malicious files;
Analyzing the classified file content through a large model, comparing and labeling malicious files, judging whether malicious codes or scripts exist or not, and if so, carrying out labeling;
And comparing the normal files, detecting abnormal files deviating from the normal characteristics, and marking the detected abnormal files.
Preferably, the step of extracting a second feature in the feature vector, where the second feature includes a keyword, a document structure, and a language complexity, and grading the document by the second feature specifically includes:
collecting documents with different importance or sensitivity levels, establishing a second data set, and grading and marking the documents in the second data set;
extracting a second feature in the feature vector, comparing the second feature with labels in a second data set, and judging the rating of the second feature;
And grading the file through the second characteristic, sending a grading result to the terminal, and receiving confirmation information of the terminal.
Preferably, the callout includes a low, medium, and high rating.
It is a further object of embodiments of the present invention to provide a large model based data processing system, the system comprising:
The content acquisition module acquires uploaded file content and an uploading channel, wherein the file content comprises legal documents, surveys, talking notes, illness reports, approval sheets and evidence materials, and the uploading channel is a file uploading mode;
The pretreatment module is used for carrying out pretreatment on the file, wherein the pretreatment comprises word segmentation, word stopping and word drying, the pretreated file is converted into a feature vector, first features in the feature vector are extracted, the first features comprise word occurrence frequency, and the file is classified according to the first features;
The security detection module is used for analyzing the classified file content through the large model, judging whether malicious codes or scripts exist or not, and detecting abnormal files deviating from normal characteristics, wherein the abnormal files are files which do not meet the specification;
And the rating module is used for extracting second features in the feature vector, wherein the second features comprise keywords, document structures and language complexity, and rating the files through the second features.
Preferably, the preprocessing module includes:
the preprocessing unit is used for preprocessing the file, calculating the occurrence frequency of words in the file, and judging the importance of the words according to the occurrence frequency;
the first feature unit converts the document into feature vectors through TF-IDF, and extracts first features in the feature vectors, wherein the dimension of the feature vectors is the number of all words;
and the classification unit classifies the files according to the first characteristics to obtain class labels of the files, wherein the class labels are used for determining the types of the files.
Preferably, the security detection module includes:
The first data set unit is used for collecting data containing normal files and known malicious files, establishing a first data set and marking the files of the first data set, wherein the marking comprises marking the normal files and marking the malicious files;
the malicious file detection unit analyzes the classified file content through the large model, compares and marks malicious files, judges whether malicious codes or scripts exist or not, and performs marking processing if the malicious codes or scripts exist;
An abnormal file detection unit compares the normal files, detects abnormal files deviating from the normal characteristics, and marks the detected abnormal files.
Preferably, the evaluation module includes:
A second dataset unit for collecting documents of different importance or sensitivity levels, creating a second dataset, and ranking the documents in the second dataset, the ranking comprising a low, medium and high ranking;
the second feature unit is used for extracting second features in the feature vectors, comparing the second features with labels in the second data set and judging the rating of the second features;
and the rating unit is used for rating the file through the second characteristic, sending a rating result to the terminal and receiving the confirmation information of the terminal.
Preferably, the callout includes a low, medium, and high rating.
According to the data processing method based on the large model, the uploaded file content and the uploading channel are obtained, the classification of the file can be judged through the file content, whether the uploaded file is safe or not can be judged through the uploading channel, the file is preprocessed, text data of the uploaded file is preprocessed, the steps of word segmentation, stop word removal, word drying and the like are included, feature extraction is facilitated, the preprocessed file is converted into feature vectors by using the TF-IDF, first features in the feature vectors are extracted, the frequency of each word in the file and the inverse document frequency of each word in the whole file set are calculated to measure the importance of the word, the file is classified according to the first features, the file content after the classification is analyzed through the large model, malicious codes, scripts or other harmful contents possibly contained in the uploaded file are identified, abnormal files or behaviors which are not matched with normal behavior modes are identified, the abnormal files or behaviors are marked, the second features in the feature vectors are extracted, the second features are analyzed, automatic rating functions are achieved according to indexes such as importance or sensitivity, the priority of the malicious behaviors cannot be easily identified, and the priority of the malicious behaviors cannot be easily and accurately detected.
Drawings
FIG. 1 is a flow chart of a data processing method based on a large model according to an embodiment of the present invention;
FIG. 2 is a flowchart showing steps for preprocessing a file, wherein the preprocessing includes word segmentation, word stopping and word drying, converting the preprocessed file into feature vectors, extracting first features in the feature vectors, wherein the first features include word occurrence frequency, and classifying the file according to the first features;
FIG. 3 is a flowchart showing steps for analyzing the content of classified files through a large model, judging whether malicious codes or scripts exist or not, and detecting abnormal files deviating from normal characteristics, wherein the abnormal files are files which do not accord with specifications;
FIG. 4 is a flowchart of a step of extracting second features in feature vectors, the second features including keywords, document structures, and language complexity, and rating a document by the second features according to an embodiment of the present invention;
FIG. 5 is a block diagram of a data processing system based on a large model according to an embodiment of the present invention;
FIG. 6 is a block diagram of a preprocessing module according to an embodiment of the present invention;
FIG. 7 is a block diagram of a security detection module according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a rating module according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It will be understood that the terms "first," "second," and the like, as used herein, may be used to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another element. For example, a first xx script may be referred to as a second xx script, and similarly, a second xx script may be referred to as a first xx script, without departing from the scope of this disclosure.
As shown in fig. 1, a data processing method based on a large model according to an embodiment of the present invention includes:
s100, acquiring uploaded file content and an uploading channel, wherein the file content comprises legal documents, surveys, talking notes, illness reports, approval sheets and evidence materials, and the uploading channel is an uploading file mode.
In the step, the uploaded file content and uploading channels are obtained, and the file content submitted by the user through different uploading channels is obtained and processed. These file contents include legal documents, surveys, talking points, medical reports, approval sheets, evidence materials, and the upload channel may be a community correction business management system or other upload interface. The method can realize automatic grading and classification of the files and perform security detection, so that the efficiency and the security of management of the electronic files for community correction in the judicial field are improved;
For example, a judicial staff member has uploaded a survey form through the community correction service management system. After the system receives the file, the content of the file is preprocessed, wherein the preprocessing comprises word segmentation, word stopping and word drying. Then, feature extraction is performed on the transcript content using the TF-IDF and TF-IDF models. Next, the system classifies the strokes using the trained classification model and ranks the strokes according to their importance. Finally, the system identifies whether potential security threats exist in the records through a security detection module. If a threat is detected, the system will send an alarm and quarantine the file;
The judicial staff receives a legal document through the political integrated collaboration platform. The system extracts the content of the document from the legal document, and performs preprocessing and feature extraction. Documents are classified and rated using a pre-trained large model (e.g., TF-IDF), which is identified as a high-importance document. The system then performs security checks to ensure that the report content does not contain any malicious code or abnormal behavior. After the process is completed, the system archives the document and notifies the relevant departments for further processing.
S200, preprocessing the file, wherein the preprocessing comprises word segmentation, word stopping and word drying, converting the preprocessed file into feature vectors, extracting first features in the feature vectors, wherein the first features comprise word occurrence frequencies, and classifying the file according to the first features.
In this step, the document is preprocessed, the uploaded document is preprocessed, the document content is decomposed into words or phrases for subsequent processing, common words useless for classification tasks such as "yes", "on" and the like are removed, words are converted into stems or original shapes of words, such as "running" is converted into "run", the frequency of each word in the document and the inverse document frequency of each word in the whole document set are calculated through TF-IDF to measure the importance of each word, and TF-IDF (term frequency-inverse document frequency) is a common weighting technology used for information retrieval and data mining. TF is Term Frequency (IDF) is the inverse text Frequency index;
Converting the documents into feature vectors by using TF-IDF, wherein the dimension of each document vector is the number of all words in a vocabulary, extracting first features in the feature vectors, wherein the first features comprise word occurrence frequencies, classifying the first features through a large model to obtain category labels of the documents, and dividing the documents into five formats of legal documents, surveys, talking notes, illness state reports, approval tables and evidence materials according to the category labels.
S300, analyzing the classified file content through a large model, judging whether malicious codes or scripts exist or not, and detecting abnormal files deviating from normal characteristics, wherein the abnormal files are files which do not meet the specification.
In this step, the classified file contents are analyzed through a large model, and malicious codes, scripts or other harmful contents possibly contained in the uploaded file are identified. The large model automatically identifies potential threats by learning a large number of normal and malicious file features. A dataset is collected that contains normal files and known malicious files. The malicious files can comprise viruses, trojans, malicious scripts and the like, the preprocessed malicious scripts are converted into vector representations, potential malicious features of the malicious scripts are captured, and the files are detected through a large model to judge whether the files contain malicious content or not;
The goal of anomaly detection is to identify an anomaly file or behavior that does not conform to the normal behavior pattern. The large model can detect abnormal files by learning the characteristics and behaviors of normal files, collect data of a large number of normal files, ensure that the files represent daily operations corrected by judicial communities, convert preprocessed abnormal files into vector representations, capture potential abnormal characteristics of the abnormal files, detect the files through the large model, and predict the files as normal or abnormal.
S400, extracting second features in the feature vectors, wherein the second features comprise keywords, document structures and language complexity, and grading the files through the second features.
In the step, a second feature in the feature vector is extracted, a large number of documents with different importance or sensitivity levels are collected from a judicial community correction service system, a labeling data set is established and respectively labeled as low, medium and high grades, the second feature in the feature vector is extracted by converting the document into the TF-IDF, the second feature is the document is graded through a large model, the labeling data set is compared, and the document is classified into the low, medium and high grades, so that the processing priority is determined.
As shown in fig. 2, as a preferred embodiment of the present invention, the preprocessing includes word segmentation, word deactivation and word drying, the preprocessed file is converted into feature vectors, and first features in the feature vectors are extracted, where the first features include word occurrence frequencies, and the step of classifying the file according to the first features specifically includes:
S201, preprocessing the file, calculating the occurrence frequency of words in the file, and judging the importance of the words according to the occurrence frequency.
In this step, the document content is preprocessed, and the document content is decomposed into words or phrases for subsequent processing, so that common words which are useless for classification tasks are removed, such as, for example, yes, no, etc., words are converted into stems or originals thereof, such as, for example, running is converted into run, for example, "this is a criminal-containing decision book", and the preprocessing may become "including" the criminal law of the people's republic of China "and" the criminal law of the people's republic of China "at the conclusion of the decision and the trial.
Calculating the frequency of each word in the document and the inverse document frequency of each word in the whole document set through TF-IDF to measure the importance of each word;
assume that there is one document content: "this is a delegate function that contains community correction survey assessments. "
The word segmentation may be followed by: "this is", "one", "contains", "community", "correct", "investigation", "evaluate", "and", "delegate";
the stop words may be removed as follows: "include", "community", "correct", "survey", "evaluate", "delegate";
The word may become after drying: "include", "community", "correct", "survey", "evaluate", "delegate";
The text preprocessing code is as follows:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
download necessary resources #
nltk.download('punkt')
nltk.download('stopwords')
# Text preprocessing function
def preprocess(text):
words = word_tokenize(text)
words = [word.lower() for word in words]
stop_words = set(stopwords.words('english'))
words = [word for word in words if word.isalnum() and word not in stop_words]
stemmer = PorterStemmer()
words = [stemmer.stem(word) for word in words]
return ' '.join(words)
processed_documents = [preprocess(doc) for doc in documents]
print(processed_documents)
S202, converting the document into a feature vector through TF-IDF, and extracting first features in the feature vector, wherein the dimension of the feature vector is the number of all words.
In the step, the document is converted into a feature vector through TF-IDF, words are converted into a high-dimensional vector, semantic relations among words are captured, embedded representations of the words in the context are obtained, and the semantics of the document are better understood;
For example, for three documents:
"this is a survey entrusting function. "
This is a condition review report. "
"This is a transcript of admonish conversations. "
After pretreatment:
[ "investigation", "entrusted function ]
[ "Disease", "review", "report" ] and
[ "Admonish", "talking", "writing" ]
By calculating TF and IDF values, the weight of each word can be obtained, such as:
"entrusting": the TF-IDF value may be higher because it appears in document 1 but is not common in other documents.
"This is": the TF-IDF value is low because it occurs frequently in all documents.
For the three documents preprocessed as described above, assuming the vocabulary includes [ "investigation", "delegate function", "illness", "review", "report", "admonish", "talk", "stroke" ], the TF-IDF vector for each document may be as follows:
document 1 [0.75, 0.25, 0,0, 0,0, 0, 0]
Document 2 [0, 0, 0.5, 0.25, 0.25, 0, 0]
Document 3 [0, 0, 0, 0, 0.5, 0.25, 0.25]
The feature extraction code is:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizing text data using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed_documents)
print(X.toarray())
And S203, classifying the files according to the first characteristics to obtain category labels of the files, wherein the category labels are used for determining the types of the files.
In this step, the documents are classified according to the first feature, and the documents are classified using a large model, which may be a TF-IDF model, and the training data includes the preprocessed document vectors and corresponding class labels. Through model training, the model can accurately identify legal documents, surveys, conversation notes, illness state reports, examination and approval tables and evidence materials to obtain class labels of files, and judge that the files belong to one of the legal documents, surveys, conversation notes, illness state reports, examination and approval tables and evidence materials, so that subsequent scoring is convenient, and the accuracy of scoring is ensured through a corresponding large model;
suppose we collect 1000 documents with 300 delegated letters, 400 reports and 300 notes. We label these documents as "authentications", "reports" and "records". The uploaded file, which is a new transcript containing admonish conversations, is identified as "record" by the classification model after preprocessing and feature extraction.
The file classification codes are:
def classify_file(file_content):
processed_content = preprocess(file_content)
vectorized_content = vectorizer.transform([processed_content])
prediction = model.predict(vectorized_content)
return prediction[0]
# example File Classification
new_file = "This document outlines the terms and conditions of the contract."
file_category = classify_file(new_file)
print("File category:",file_category)
As shown in fig. 3, as a preferred embodiment of the present invention, the step of analyzing the classified file content by a large model, judging whether there is malicious code or script, and detecting an abnormal file deviating from the normal characteristics, wherein the abnormal file is a file not conforming to the specification, specifically includes:
S301, collecting data containing normal files and known malicious files, establishing a first data set, and marking the files of the first data set, wherein marking comprises marking the normal files and marking the malicious files.
In the step, collecting data containing normal files and known malicious files, wherein the malicious files can comprise viruses, trojans, malicious scripts and the like, establishing a first data set, marking the collected files, marking the files which are normal files and the files which are malicious files, and marking the files with the categories of normal or malicious files on the assumption that 5000 normal files and 1000 malicious files are collected;
after pretreatment, a malicious script file "< script > alert ('Hacked |') <, script >", which may become "SCRIPT ALERT HACKED" after pretreatment, converts the pretreated malicious script into vector representation, and captures the potential malicious features of the malicious script.
The first data set preparation code:
# example malicious file content
malicious_files = [
"This is a malicious script <script>alert('Hacked!');<、script>",
"Normal document with financial data.",
"Another normal document with strategic plans."
]
Labels= [1, 0, 0] # 1 indicates malicious and 0 indicates normal
S302, analyzing the classified file content through the large model, comparing and labeling malicious files, judging whether malicious codes or scripts exist or not, and if so, performing marking processing.
In this step, the classified file contents are analyzed by a large model, training is performed by using a TF-IDF model, the model learns the distinction between a normal file and a malicious file, if SCRIPT ALERT HACKED is detected to be included in the file, it is determined as a malicious file, and the file is subjected to a marking process,
Malicious file identification code:
def detect_malicious_content(file_content):
processed_content = preprocess(file_content)
vectorized_content = vectorizer.transform([processed_content])
prediction = model.predict(vectorized_content)
return prediction [0] # 1 indicates malicious, 0 indicates normal
# Example File detection
new_file = "This is a new script <script>alert('Hacked!');<、script>"
is_malicious = detect_malicious_content(new_file)
print("Malicious" if is_malicious else "Normal")
S303, comparing the normal files, detecting abnormal files deviating from the normal characteristics, and marking the detected abnormal files.
In the step, the normal files are compared, the characteristic data of a large number of normal files are collected, part of abnormal files are marked,
Abnormal file marking code:
# example Normal File and Exception File
normal_files = [
"Normal financial report data.",
"Another normal document.",
"Regular business report."
]
anomalous_files = [
"Suspicious script with strange behavior <script>alert('Intrusion');<、script>"
]
Labels= [0, 0, 0, 1] # 0 indicates normal, 1 indicates abnormal
all_files = normal_files + anomalous_files
Assuming that 10000 normal files are collected and 100 abnormal files are marked, for an abnormal document 'Unauthorized ACCESS ATTEMPT DETECTED', the abnormal document possibly becomes 'unauthor ACCESS ATTEMPT DETECT' after pretreatment, the pretreated abnormal document is converted into vector representation, potential abnormal characteristics of the abnormal document are captured, and a trained model is used for detecting the newly uploaded file to judge whether the newly uploaded file is abnormal or not.
Abnormal file detection code:
def detect_anomalous_content(file_content):
processed_content = preprocess(file_content)
vectorized_content = vectorizer.transform([processed_content])
prediction = oc_svm.predict(vectorized_content)
return prediction [0] # 1 denotes normal, -1 denotes abnormal
# Example File detection
new_file = "This is a new suspicious script <script>alert('Intrusion');<、script>"
is_anomalous = detect_anomalous_content(new_file)
print("Anomalous" if is_anomalous == -1 else "Normal")
The judicial can realize high-efficient malicious content detection and anomaly detection, promote the security of file management, prevent potential security threat.
As shown in fig. 4, as a preferred embodiment of the present invention, the extracting the second feature in the feature vector, where the second feature includes a keyword, a document structure, and a language complexity, and the grading the document by the second feature specifically includes:
S401, collecting documents with different importance or sensitivity levels, establishing a second data set, and grading and marking the documents in the second data set.
In this step, collecting documents with different importance or sensitivity levels, collecting a large number of documents with different importance or sensitivity levels from the community correction service system, marking the documents in a rating, and establishing a second data set, for example, collecting a batch of documents including common reports, confidential documents and sensitive personal data, which are respectively marked as low, medium and high ratings;
a second data set code:
import numpy as np
suppose we already have the TF-IDF characteristics of the document and the corresponding rating labels
# Feature matrix
X = tfidf_matrix.toarray()
# Rating label (1 for high importance, 0 for low importance)
y = np.array([1, 1, 0])
print(X)
print(y)
And S402, extracting a second feature in the feature vector, comparing the second feature with labels in a second data set, and judging the rating of the second feature.
In this step, the second feature in the feature vector is extracted, and the keywords, document structure and language complexity in the document are obtained, for example, for a document "confidential document", TF-IDF values in all documents may be: secret: high, paperwork: in the method, a pre-trained TF-IDF model is used for converting words into high-dimensional vectors, semantic relations among the words are captured, after secret and document are respectively converted into word vectors, vector representations of the document can be obtained through weighted average of the vectors or other modes, and the TF-IDF is used for encoding the secret document to generate embedded vectors of the whole document.
For example, for a new uploaded document where the second feature is "this is a report containing significant condition review," the pre-treatment may become "containing significant condition review report," the trained TF-IDF model is used to convert the "containing significant condition review report" to a vector representation, and the model predicts that the second feature is rated "high" because "significant condition review data" has a higher significant feature.
S403, grading the files through the second characteristics, sending grading results to the terminal, and receiving confirmation information of the terminal.
In the step, the files are rated through the second characteristic, a rating result is sent to the terminal, the terminal confirms after confirming that the rating is correct, and the confirmation information of the terminal is received, so that an automatic rating function is realized, and the priority of processing is confirmed.
The system monitors the file uploading process in real time, provides a rating result in time, helps judicial communities to rectify and manage and classify files, and can realize efficient file rating by judicial, and the automation degree and accuracy of file management are improved.
As shown in FIG. 5, a data processing system based on a large model according to an embodiment of the present invention includes:
The content acquisition module 100 is configured to acquire uploaded file content and an uploading channel, where the file content includes legal documents, surveys, talking notes, illness reports, approval sheets, and evidence materials, and the uploading channel is a file uploading mode.
In the system, the content acquisition module 100 acquires uploaded file content and uploading channels, and acquires and processes file content submitted by staff through different uploading channels. These file contents include legal documents, surveys, talking points, medical reports, approval sheets, evidence materials, and the upload channel may be a community correction business management system or other upload interface. The method can realize automatic grading and classification of the files and perform security detection, thereby improving the efficiency and security of judicial file management;
The judicial staff uploads a survey form through the community correction service management system. After the system receives the file, the content of the file is preprocessed, wherein the preprocessing comprises word segmentation, word stopping and word drying. Then, feature extraction is performed on the transcript content using the TF-IDF and TF-IDF models. Next, the system classifies the strokes using the trained classification model and ranks the strokes according to their importance. Finally, the system identifies whether potential security threats exist in the records through a security detection module. If a threat is detected, the system will send an alarm and quarantine the file;
The judicial staff receives a legal document through the political integrated collaboration platform. The system extracts the content of the document from the document, and performs preprocessing and feature extraction. Reports are classified and rated using a pre-trained large model (e.g., TF-IDF), which is identified as a high-importance document. The system then performs security checks to ensure that the document content does not contain any malicious code or abnormal behavior. After the process is completed, the system archives the document and notifies the relevant departments for further processing.
The preprocessing module 200 is configured to preprocess a file, where the preprocessing includes word segmentation, word stopping and word drying, convert the preprocessed file into feature vectors, extract first features in the feature vectors, where the first features include word occurrence frequencies, and classify the file according to the first features.
In the present system, the preprocessing module 200 preprocesses the file, preprocesses the uploaded file, decomposes the content of the file into words or phrases for subsequent processing, removes common words useless for classification tasks, such as "yes", "in", etc., converts the words into their stems or originals, such as "running" into "run", calculates the frequency of each word in the file and its inverse document frequency in the whole set of the file through TF-IDF to measure the importance thereof, and TF-IDF (term frequency-inverse document frequency) is a common weighting technique for information retrieval and data mining. TF is Term Frequency (IDF) is the inverse text Frequency index;
Converting the documents into feature vectors by using TF-IDF, wherein the dimension of each document vector is the number of all words in a vocabulary, extracting first features in the feature vectors, wherein the first features comprise word occurrence frequencies, classifying the first features through a large model to obtain category labels of the documents, and dividing the documents into five formats of legal documents, surveys, talking notes, illness state reports, approval tables and evidence materials according to the category labels.
The security detection module 300 is configured to analyze the classified file content through a large model, determine whether malicious codes or scripts exist, and detect an abnormal file deviating from normal characteristics, where the abnormal file is a file that does not conform to a specification.
In the present system, the security detection module 300 analyzes the classified file contents through a large model, and identifies malicious codes, scripts, or other harmful contents that may be contained in the uploaded file. The large model automatically identifies potential threats by learning a large number of normal and malicious file features. A dataset is collected that contains normal files and known malicious files. The malicious files can comprise viruses, trojans, malicious scripts and the like, the preprocessed malicious scripts are converted into vector representations, potential malicious features of the malicious scripts are captured, and the files are detected through a large model to judge whether the files contain malicious content or not;
The goal of anomaly detection is to identify an anomaly file or behavior that does not conform to the normal behavior pattern. The large model can detect abnormal files by learning the characteristics and behaviors of normal files, collect data of a large number of normal files, ensure that the files represent daily operations corrected by judicial communities, convert preprocessed abnormal files into vector representations, capture potential abnormal characteristics of the abnormal files, detect the files through the large model, and predict the files as normal or abnormal.
And the rating module 400 is used for extracting second features in the feature vector, wherein the second features comprise keywords, document structures and language complexity, and the files are rated through the second features.
In the system, the rating module 400 extracts the second characteristic in the characteristic vector, collects a large number of documents with different importance or sensitivity levels from the judicial community correction service system, establishes a labeling data set, respectively marks the documents as low, medium and high ratings, extracts the second characteristic in the characteristic vector by converting the documents into TF-IDF, rates the second characteristic as the document through a large model, compares the labeling data set, and divides the document into the low, medium and high ratings, thereby determining the processing priority.
As shown in fig. 6, as a preferred embodiment of the present invention, the preprocessing module 200 includes:
a preprocessing unit 201, configured to preprocess a file, calculate a frequency of occurrence of a word in the file, and determine importance of the word according to the frequency of occurrence.
In this module, the preprocessing unit 201 performs preprocessing on the document, decomposes the document content into words or phrases for subsequent processing, removes common words which are useless for classification tasks, such as yes, etc., converts the words into their stems or original shapes, for example, converts run into run, for example, "this is a criminal-containing decision book", and after preprocessing, may become "contain" criminal law of the people's republic of China, "and" criminal litigation law of the people's republic of China "decision and approval end.
Calculating the frequency of each word in the document and the inverse document frequency of each word in the whole document set through TF-IDF to measure the importance of each word;
assume that there is one document content: "this is a delegate function that contains community correction survey assessments. "
The word segmentation may be followed by: "this is", "one", "contains", "community", "correct", "investigation", "evaluate", "commission";
the stop words may be removed as follows: "include", "community", "correct", "survey", "evaluate", "delegate";
The word may become after drying: "include", "community", "correct", "survey", "evaluate", "delegate";
The text preprocessing code is as follows:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
download necessary resources #
nltk.download('punkt')
nltk.download('stopwords')
# Text preprocessing function
def preprocess(text):
words = word_tokenize(text)
words = [word.lower() for word in words]
stop_words = set(stopwords.words('english'))
words = [word for word in words if word.isalnum() and word not in stop_words]
stemmer = PorterStemmer()
words = [stemmer.stem(word) for word in words]
return ' '.join(words)
processed_documents = [preprocess(doc) for doc in documents]
print(processed_documents)
A first feature unit 202, configured to convert the document into a feature vector through TF-IDF, and extract a first feature in the feature vector, where the dimension of the feature vector is the number of all words.
In the module, the first feature unit 202 converts the document into a feature vector through TF-IDF, converts words into a high-dimensional vector, captures semantic relations among words, acquires embedded representations of the words in context, and better understands the semantics of the document;
For example, for three documents:
"this is a survey entrusting function. "
This is a condition review report. "
"This is a transcript of admonish conversations. "
After pretreatment:
[ "investigation", "commission" ] and
[ "Disease", "review", "report" ] and
[ "Admonish", "talking", "writing" ]
By calculating TF and IDF values, the weight of each word can be obtained, such as:
"entrusting": the TF-IDF value may be higher because it appears in document 1 but is not common in other documents.
"This is": the TF-IDF value is low because it occurs frequently in all documents.
For the three documents preprocessed as described above, assuming the vocabulary includes [ "investigation", "commission", "illness", "review", "report", "admonish", "talk", "transcript" ], the TF-IDF vector for each document may be as follows:
document 1 [0.75, 0.25, 0,0, 0,0, 0, 0]
Document 2 [0, 0, 0.5, 0.25, 0.25, 0, 0]
Document 3 [0, 0, 0, 0, 0.5, 0.25, 0.25]
The feature extraction code is:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizing text data using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed_documents)
print(X.toarray())
The classifying unit 203 is configured to classify the file according to the first feature, and obtain a class label of the file, where the class label is used to determine a type of the file.
In this module, the classification unit 203 classifies the documents according to the first feature, and classifies the documents using a large model, which may be a TF-IDF model, and the training data includes the preprocessed document vectors and corresponding class labels. Through model training, the model can accurately identify legal documents, surveys, conversation notes, illness state reports, examination and approval tables and evidence materials to obtain class labels of files, and judge that the files belong to one of the legal documents, surveys, conversation notes, illness state reports, examination and approval tables and evidence materials, so that subsequent scoring is convenient, and the accuracy of scoring is ensured through a corresponding large model;
Suppose we collect 1000 documents with 300 delegated letters, 400 reports and 300 notes. We label these documents as "authorization", "report", and "record". The uploaded file, which is a new transcript containing admonish conversations, is identified as "contact" by the classification model after preprocessing and feature extraction.
The file classification codes are:
def classify_file(file_content):
processed_content = preprocess(file_content)
vectorized_content = vectorizer.transform([processed_content])
prediction = model.predict(vectorized_content)
return prediction[0]
# example File Classification
new_file = "This document outlines the terms and conditions of the contract."
file_category = classify_file(new_file)
print("File category:",file_category)
As shown in fig. 7, as a preferred embodiment of the present invention, the security detection module 300 includes:
The first data set unit 301 is configured to collect data including normal files and known malicious files, establish a first data set, and label the files of the first data set, where the labeling includes labeling the normal files and labeling the malicious files.
In this module, the first data set unit 301 collects data including normal files and known malicious files, where the malicious files may include viruses, trojans, malicious scripts, etc., establishes a first data set, marks the collected files, indicates which are normal files, and indicates which are malicious files, and presumes that 5000 normal files and 1000 malicious files are collected, marks the classification as "normal" or "malicious";
after pretreatment, a malicious script file "< script > alert ('Hacked |') <, script >", which may become "SCRIPT ALERT HACKED" after pretreatment, converts the pretreated malicious script into vector representation, and captures the potential malicious features of the malicious script.
The first data set preparation code:
# example malicious file content
malicious_files = [
"This is a malicious script <script>alert('Hacked!');<、script>",
"Normal document with financial data.",
"Another normal document with strategic plans."
]
Labels= [1, 0, 0] # 1 indicates malicious and 0 indicates normal
The malicious file detection unit 302 is configured to analyze the classified file content through the large model, compare and label the malicious file, determine whether malicious code or script exists, and if so, perform marking processing.
In this module, the malicious file detecting unit 302 analyzes the classified file content through a large model, trains using a TF-IDF model, learns the distinction between a normal file and a malicious file, determines that the file is a malicious file if SCRIPT ALERT HACKED is detected as being included in the file, and performs a marking process on the file.
Malicious file identification code:
def detect_malicious_content(file_content):
processed_content = preprocess(file_content)
vectorized_content = vectorizer.transform([processed_content])
prediction = model.predict(vectorized_content)
return prediction [0] # 1 indicates malicious, 0 indicates normal
# Example File detection
new_file = "This is a new script <script>alert('Hacked!');<、script>"
is_malicious = detect_malicious_content(new_file)
print("Malicious" if is_malicious else "Normal")
An abnormal file detecting unit 303 for comparing the normal files, detecting the abnormal files deviating from the normal characteristics, and performing a marking process on the detected abnormal files.
In this module, the abnormal file detecting unit 303 compares normal files, collects feature data of a large number of normal files, and marks a part of the abnormal files,
Abnormal file marking code:
# example Normal File and Exception File
normal_files = [
"Normal financial report data.",
"Another normal document.",
"Regular business report."
]
anomalous_files = [
"Suspicious script with strange behavior <script>alert('Intrusion');<、script>"
]
Labels= [0, 0, 0, 1] # 0 indicates normal, 1 indicates abnormal
all_files = normal_files + anomalous_files
Assuming that 10000 normal files are collected and 100 abnormal files are marked, for an abnormal document 'Unauthorized ACCESS ATTEMPT DETECTED', the abnormal document possibly becomes 'unauthor ACCESS ATTEMPT DETECT' after pretreatment, the pretreated abnormal document is converted into vector representation, potential abnormal characteristics of the abnormal document are captured, and a trained model is used for detecting the newly uploaded file to judge whether the newly uploaded file is abnormal or not.
Abnormal file detection code:
def detect_anomalous_content(file_content):
processed_content = preprocess(file_content)
vectorized_content = vectorizer.transform([processed_content])
prediction = oc_svm.predict(vectorized_content)
return prediction [0] # 1 denotes normal, -1 denotes abnormal
# Example File detection
new_file = "This is a new suspicious script <script>alert('Intrusion');<、script>"
is_anomalous = detect_anomalous_content(new_file)
print("Anomalous" if is_anomalous == -1 else "Normal")
The judicial can realize high-efficient malicious content detection and anomaly detection, promote the security of file management, prevent potential security threat.
As shown in fig. 8, as a preferred embodiment of the present invention, the evaluation module 400 includes:
A second data set unit 401, configured to collect documents with different importance or sensitivity levels, establish a second data set, and rank the documents in the second data set.
In the present module, the second data set unit 401 collects documents with different importance or sensitivity levels, collects a large number of documents with different importance or sensitivity levels from the business system, performs rating marking on the documents, establishes a second data set, for example, collects a batch of documents including general reports, confidential documents and sensitive financial data, and respectively marks as low, medium and high ratings;
a second data set code:
import numpy as np
suppose we already have the TF-IDF characteristics of the document and the corresponding rating labels
# Feature matrix
X = tfidf_matrix.toarray()
# Rating label (1 for high importance, 0 for low importance)
y = np.array([1, 1, 0])
print(X)
print(y)
And a second feature unit 402, configured to extract a second feature in the feature vector, compare the second feature with labels in the second dataset, and determine a rating of the second feature.
In this module, the second feature unit 402 extracts the second feature in the feature vector, and obtains the keywords, the document structure and the language complexity in the document, for example, for the document "confidential document", the TF-IDF value in all documents may be: secret: high, paperwork: in the method, a pre-trained TF-IDF model is used for converting words into high-dimensional vectors, semantic relations among the words are captured, after secret and document are respectively converted into word vectors, vector representations of the document can be obtained through weighted average of the vectors or other modes, and the TF-IDF is used for encoding the secret document to generate embedded vectors of the whole document.
For example, for a new uploaded document where the second feature is "this is a report containing significant condition review," the pre-treatment may become "containing significant condition review report," the trained TF-IDF model is used to convert the "containing significant condition review report" to a vector representation, and the model predicts that the second feature is rated "high" because "significant condition review" has a higher significant feature.
And the rating unit 403 is configured to rate the file according to the second characteristic, send a rating result to the terminal, and receive confirmation information of the terminal.
In this module, the rating unit 403 performs rating on the file through the second feature, sends the rating result to the terminal, and after confirming that the rating is correct, the terminal confirms and receives the confirmation information of the terminal, so as to implement an automatic rating function, thereby confirming the priority of the processing.
The system monitors the file uploading process in real time, provides a rating result in time, helps judicial management and classification of files, and can realize efficient file rating by judicial management and improve the automation degree and accuracy of file management.
In one embodiment, a computer device is presented, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring uploaded file content and an uploading channel, wherein the file content comprises legal documents, surveys, talking notes, illness reports, approval sheets and evidence materials, and the uploading channel is an uploading file mode;
Preprocessing a file, wherein the preprocessing comprises word segmentation, word stopping and word drying, converting the preprocessed file into feature vectors, extracting first features in the feature vectors, wherein the first features comprise word occurrence frequencies, and classifying the file according to the first features;
Analyzing the classified file content through a large model, judging whether malicious codes or scripts exist or not, and detecting abnormal files deviating from normal characteristics, wherein the abnormal files are files which do not accord with specifications;
and extracting second features in the feature vector, wherein the second features comprise keywords, document structures and language complexity, and grading the documents through the second features.
In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which when executed by a processor causes the processor to perform the steps of:
acquiring uploaded file content and an uploading channel, wherein the file content comprises legal documents, surveys, talking notes, illness reports, approval sheets and evidence materials, and the uploading channel is an uploading file mode;
Preprocessing a file, wherein the preprocessing comprises word segmentation, word stopping and word drying, converting the preprocessed file into feature vectors, extracting first features in the feature vectors, wherein the first features comprise word occurrence frequencies, and classifying the file according to the first features;
Analyzing the classified file content through a large model, judging whether malicious codes or scripts exist or not, and detecting abnormal files deviating from normal characteristics, wherein the abnormal files are files which do not accord with specifications;
and extracting second features in the feature vector, wherein the second features comprise keywords, document structures and language complexity, and grading the documents through the second features.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.
Claims (10)
1. A method of data processing based on a large model, the method comprising:
acquiring uploaded file content and an uploading channel, wherein the file content comprises legal documents, surveys, talking notes, illness reports, approval sheets and evidence materials, and the uploading channel is an uploading file mode;
Preprocessing a file, wherein the preprocessing comprises word segmentation, word stopping and word drying, converting the preprocessed file into feature vectors, extracting first features in the feature vectors, wherein the first features comprise word occurrence frequencies, and classifying the file according to the first features;
Analyzing the classified file content through a large model, judging whether malicious codes or scripts exist or not, and detecting abnormal files deviating from normal characteristics, wherein the abnormal files are files which do not accord with specifications;
and extracting second features in the feature vector, wherein the second features comprise keywords, document structures and language complexity, and grading the documents through the second features.
2. The method for data processing based on large model of claim 1, wherein the preprocessing includes word segmentation, word deactivation and word drying, the preprocessed file is converted into feature vectors, first features in the feature vectors are extracted, the first features include word occurrence frequency, and the step of classifying the file according to the first features specifically includes:
preprocessing a file, calculating the occurrence frequency of words in the file, and judging the importance of the words according to the occurrence frequency;
Converting the document into a feature vector through TF-IDF, and extracting a first feature in the feature vector, wherein the dimension of the feature vector is the number of all words;
classifying the files according to the first characteristics to obtain class labels of the files, wherein the class labels are used for determining the types of the files.
3. The method for processing data based on the large model according to claim 2, wherein the step of analyzing the content of the classified file by the large model, judging whether malicious codes or scripts exist, and detecting an abnormal file deviating from normal characteristics, wherein the abnormal file is a file not conforming to the specification, specifically comprises:
collecting data containing normal files and known malicious files, establishing a first data set, and marking the files of the first data set, wherein the marking comprises marking the normal files and marking the malicious files;
Analyzing the classified file content through a large model, comparing and labeling malicious files, judging whether malicious codes or scripts exist or not, and if so, carrying out labeling;
And comparing the normal files, detecting abnormal files deviating from the normal characteristics, and marking the detected abnormal files.
4. A data processing method based on a large model according to claim 3, wherein the extracting the second feature in the feature vector, the second feature including a keyword, a document structure and a language complexity, and the step of grading the document by the second feature specifically includes:
collecting documents with different importance or sensitivity levels, establishing a second data set, and grading and marking the documents in the second data set;
extracting a second feature in the feature vector, comparing the second feature with labels in a second data set, and judging the rating of the second feature;
And grading the file through the second characteristic, sending a grading result to the terminal, and receiving confirmation information of the terminal.
5. The method of claim 4, wherein the annotation comprises three of a low, medium and high rating.
6. A large model-based data processing system, the system comprising:
The content acquisition module acquires uploaded file content and an uploading channel, wherein the file content comprises legal documents, surveys, talking notes, illness reports, approval sheets and evidence materials, and the uploading channel is a file uploading mode;
The pretreatment module is used for carrying out pretreatment on the file, wherein the pretreatment comprises word segmentation, word stopping and word drying, the pretreated file is converted into a feature vector, first features in the feature vector are extracted, the first features comprise word occurrence frequency, and the file is classified according to the first features;
The security detection module is used for analyzing the classified file content through the large model, judging whether malicious codes or scripts exist or not, and detecting abnormal files deviating from normal characteristics, wherein the abnormal files are files which do not meet the specification;
And the rating module is used for extracting second features in the feature vector, wherein the second features comprise keywords, document structures and language complexity, and rating the files through the second features.
7. The large model based data processing system of claim 6, wherein the preprocessing module comprises:
the preprocessing unit is used for preprocessing the file, calculating the occurrence frequency of words in the file, and judging the importance of the words according to the occurrence frequency;
the first feature unit converts the document into feature vectors through TF-IDF, and extracts first features in the feature vectors, wherein the dimension of the feature vectors is the number of all words;
and the classification unit classifies the files according to the first characteristics to obtain class labels of the files, wherein the class labels are used for determining the types of the files.
8. The large model based data processing system of claim 7, wherein the security detection module comprises:
The first data set unit is used for collecting data containing normal files and known malicious files, establishing a first data set and marking the files of the first data set, wherein the marking comprises marking the normal files and marking the malicious files;
the malicious file detection unit analyzes the classified file content through the large model, compares and marks malicious files, judges whether malicious codes or scripts exist or not, and performs marking processing if the malicious codes or scripts exist;
An abnormal file detection unit compares the normal files, detects abnormal files deviating from the normal characteristics, and marks the detected abnormal files.
9. The large model based data processing system of claim 8, wherein the ranking module comprises:
A second dataset unit for collecting documents of different importance or sensitivity levels, creating a second dataset, and ranking the documents in the second dataset, the ranking comprising a low, medium and high ranking;
the second feature unit is used for extracting second features in the feature vectors, comparing the second features with labels in the second data set and judging the rating of the second features;
and the rating unit is used for rating the file through the second characteristic, sending a rating result to the terminal and receiving the confirmation information of the terminal.
10. The large model based data processing system of claim 9, wherein the annotations include three ratings of low, medium and high.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410788204.XA CN118364112A (en) | 2024-06-19 | 2024-06-19 | Data processing method and system based on large model |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410788204.XA CN118364112A (en) | 2024-06-19 | 2024-06-19 | Data processing method and system based on large model |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN118364112A true CN118364112A (en) | 2024-07-19 |
Family
ID=91884985
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410788204.XA Withdrawn CN118364112A (en) | 2024-06-19 | 2024-06-19 | Data processing method and system based on large model |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118364112A (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102819604A (en) * | 2012-08-20 | 2012-12-12 | 徐亮 | Method for retrieving confidential information of file and judging and marking security classification based on content correlation |
| WO2019035765A1 (en) * | 2017-08-14 | 2019-02-21 | Dathena Science Pte. Ltd. | Methods, machine learning engines and file management platform systems for content and context aware data classification and security anomaly detection |
| KR20210013433A (en) * | 2019-07-25 | 2021-02-04 | 호서대학교 산학협력단 | TF-IDF-based Vector Conversion and Data Analysis Apparatus and Method |
| CN116720504A (en) * | 2023-04-21 | 2023-09-08 | 华北理工大学 | A system and method for statistical analysis of text data based on natural language processing |
| CN116975863A (en) * | 2023-07-10 | 2023-10-31 | 福州大学 | Malicious code detection method based on convolutional neural network |
| CN117993393A (en) * | 2024-03-06 | 2024-05-07 | 上海商保通健康科技有限公司 | Method, device and system for checking online labeling policy terms based on word and sentence vectors |
-
2024
- 2024-06-19 CN CN202410788204.XA patent/CN118364112A/en not_active Withdrawn
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102819604A (en) * | 2012-08-20 | 2012-12-12 | 徐亮 | Method for retrieving confidential information of file and judging and marking security classification based on content correlation |
| WO2019035765A1 (en) * | 2017-08-14 | 2019-02-21 | Dathena Science Pte. Ltd. | Methods, machine learning engines and file management platform systems for content and context aware data classification and security anomaly detection |
| KR20210013433A (en) * | 2019-07-25 | 2021-02-04 | 호서대학교 산학협력단 | TF-IDF-based Vector Conversion and Data Analysis Apparatus and Method |
| CN116720504A (en) * | 2023-04-21 | 2023-09-08 | 华北理工大学 | A system and method for statistical analysis of text data based on natural language processing |
| CN116975863A (en) * | 2023-07-10 | 2023-10-31 | 福州大学 | Malicious code detection method based on convolutional neural network |
| CN117993393A (en) * | 2024-03-06 | 2024-05-07 | 上海商保通健康科技有限公司 | Method, device and system for checking online labeling policy terms based on word and sentence vectors |
Non-Patent Citations (1)
| Title |
|---|
| 黄诚;刘嘉勇;刘亮;何祥;汤殿华;: "基于上下文语义的恶意域名语料提取模型研究", 计算机工程与应用, no. 09, 29 August 2017 (2017-08-29) * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN117707922B (en) | Method and device for generating test case, terminal equipment and readable storage medium | |
| CN116663549B (en) | Digitized management method, system and storage medium based on enterprise files | |
| CN112819003B (en) | Method and device for improving OCR recognition accuracy of physical examination report | |
| CN117454426A (en) | Method, device and system for desensitizing and collecting information of claim settlement data | |
| Vanamala et al. | Topic modeling and classification of common vulnerabilities and exposures database | |
| US20250106242A1 (en) | Predicting security vulnerability exploitability based on natural language processing and source code analysis | |
| CN115618085B (en) | Interface data exposure detection method based on dynamic tag | |
| Fuglsby et al. | Elucidating the relationships between two automated handwriting feature quantification systems for multiple pairwise comparisons | |
| CN120277678B (en) | Sensitive information detection method and system based on multi-mode and steganography detection | |
| Brown et al. | Simple and efficient identification of personally identifiable information on a public website | |
| Aires et al. | An information theory approach to detect media bias in news websites | |
| CN117786121B (en) | File identification method and system based on artificial intelligence | |
| CN118364112A (en) | Data processing method and system based on large model | |
| US20240062570A1 (en) | Detecting unicode injection in text | |
| Al-Ghamdi et al. | Digital forensics and machine learning to fraudulent email prediction | |
| CN116719919B (en) | Text processing method and device | |
| CN118964950B (en) | A sensitive information extraction method and system based on natural language processing | |
| Prakash et al. | Recognizing Fake Documents by Instance-Based ML Algorithm Tuning with Neighborhood Size | |
| Makhambet et al. | A comparative analysis of machine learning methods for personal information recognition (PII) in unstructured texts | |
| Bafna et al. | An Intelligent Learning Approach to Case Analysis for Predicting Legal Outcomes Using Random Forest and K-Nearest Neighbors | |
| Ntwali et al. | Detection of Personal Data in Structured Datasets Using a Large Language Model | |
| Sindhu et al. | End To End Comments Filtering Feature Using Sentimental Analysis | |
| Sahi et al. | Unmasking Fake News: A Naïve Bayes Classifier Approach to Combat Misinformation | |
| Jin | Bayesian Classification Algorithm in Recognition of Insurance Tax | |
| Gonzales | User Behavior Anomaly Detection Approaches to Mitigate Insider Threats |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WW01 | Invention patent application withdrawn after publication |
Application publication date: 20240719 |
|
| WW01 | Invention patent application withdrawn after publication |