Disclosure of Invention
Aiming at the problems, the invention discloses a geotechnical engineering semantic embedding and retrieving method based on field-guided BERT, which utilizes a deep learning technology to process multi-source text data in the geotechnical engineering field and can realize efficient semantic understanding and information extraction.
The technical scheme of the invention is as follows:
a geotechnical engineering semantic embedding retrieval method based on field-guided BERT comprises the following steps:
Step A, acquiring a data source, collecting various text data in the field of rock and soil, including project reports, engineering case analysis, experimental data, standard specification files, papers and technical guidelines, and extracting and cleaning knowledge;
step B, a structured training data set is constructed, wherein the structured training data set comprises segmentation processing, manual labeling and sample equalization, the segmentation processing is used for segmenting a long document according to a semantic natural segment or a logic segment to generate a short text segment;
Step C, data preprocessing, wherein the data preprocessing comprises field key factor extraction and segmentation length adjustment, a geotechnical engineering knowledge table is constructed, terms in texts are matched by using the knowledge table, field key factor marks are generated and used for guiding weight distribution of an attention mechanism, and influences of different text segmentation lengths on model performance are tested;
Step D, constructing a geotechnical engineering semantic embedding model, wherein the model is based on BERT, dynamically modeling a Domain key term by introducing a Domain-Guided Attention mechanism into an Attention mechanism of BERT, and realizing efficient semantic embedding learning and optimization by adopting a Siamese architecture; the method comprises the steps of receiving input sentence pairs by a model, preprocessing the sentence pairs by a word segmentation device of the BERT, converting natural language into an input format required by the BERT, wherein the input format comprises a word index sequence, an Attention mask and key term marks, respectively encoding the sentences by two BERT models sharing weight to generate corresponding feature vector representations, dynamically adjusting the Attention scores of the key terms by a Domain guidance Attention mechanism to generate enhanced Attention distribution for optimizing the feature representations of the sentences, processing the feature vectors of each sentence by adopting average pooling operation to generate sentence embedded vectors with fixed size, and calculating similarity scores among the embedded representations of the sentence pairs by using cosine similarity;
and E, model training, testing and predicting, designing a total loss function using contrast loss and attention regularization loss, carrying out model training through a AdamW optimizer and a learning rate scheduler, evaluating model performance on a test set, measuring model performance by using cosine similarity mean value, accuracy and recall rate of semantic retrieval tasks and F1 index, and analyzing the influence of the optimal segmentation strategy on model performance aiming at the performance of the data set test model of different segmentation length versions.
Furthermore, the geotechnical engineering semantic embedding and searching method based on the field-guided BERT further comprises the steps of conducting semantic analysis on the collected text data, removing redundant and error information and guaranteeing accuracy and consistency of the text data.
Furthermore, the geotechnical engineering semantic embedding and retrieving method based on the domain-guided BERT further comprises the step of conducting semantic segmentation on the document by utilizing a natural language processing technology, so that the semantic integrity and independence of each short text segment are ensured.
Furthermore, the artificial labeling in the step B further comprises the step of performing professional auditing on the question-answer pair based on a knowledge system in the geotechnical engineering field to ensure the accuracy and the professionality of labeling contents.
Furthermore, the data preprocessing in the step C further comprises the step of weight distribution of the domain key factor marks, and the weight is dynamically adjusted according to the importance and the occurrence frequency of the key terms in the geotechnical engineering domain so as to optimize the distribution of the attention mechanism.
Furthermore, the geotechnical engineering semantic embedding and retrieving method based on the field-guided BERT further comprises the steps of adopting a double-tower design of a Siamese architecture to ensure embedding consistency of two sentences in the same semantic space and improving accuracy and efficiency of semantic similarity calculation.
Furthermore, the geotechnical engineering semantic embedding and retrieving method based on the domain guide BERT further comprises the step of utilizing domain weights generated by a knowledge table to adjust attention distribution, and optimizing semantic understanding and information extraction capacity of the model through total loss functions of contrast loss and attention regularization loss.
Furthermore, the geotechnical engineering semantic embedding retrieval method based on the field-oriented BERT further comprises the step of performing performance evaluation on the model aiming at geotechnical engineering problems of different types so as to verify the effectiveness and reliability of the model in practical application.
The invention also discloses a memory or a server for storing and processing the data required by the method, wherein the memory or the server is configured to:
storing the text data of the rock and soil field collected and cleaned in the step A;
Storing the structured training data set constructed in the step B, wherein the structured training data set comprises segmented short text fragments, manually marked question-answer pairs and data sets expanded by a data enhancement technology;
c, storing the geotechnical engineering knowledge table constructed in the step C, the field key factor marks and the test results of different text segment lengths;
Storing the geotechnical engineering semantic embedded model constructed in the step D and parameters thereof, wherein the geotechnical engineering semantic embedded model comprises the weight of a BERT model, the configuration of a domain guidance attention mechanism and the detailed information of a Siamese architecture;
in the model training process of the step E, high-efficiency data reading and writing capacity is provided, and the rapid processing of a large-scale data set is supported;
In the model testing and predicting stage, the model is used as a platform for model deployment, user input is received, model reasoning is executed, and a search result is returned.
Further, the memory or the server for storing and processing the data required by the method further comprises:
the data encryption module is used for carrying out encryption processing on the stored sensitive data and ensuring the safety and privacy of the data;
the data backup and recovery mechanism is used for regularly backing up the stored data so as to prevent the data from being lost and can quickly recover the data when needed;
Load balancing and fault transferring functions to ensure that when high concurrency access or server faults occur, requests can be automatically distributed to other servers, and continuity and stability of services are ensured;
And the performance monitoring and optimizing tool monitors the running states of the memory and the server in real time, including CPU utilization rate, memory occupation, disk I/O and the like, so as to discover and solve the performance bottleneck in time and optimize the system performance.
Compared with the prior art, the invention has the following beneficial effects:
The method achieves dynamic optimization of the model Attention distribution by innovatively integrating a Domain-Guided Attention (DGA) mechanism into the Attention mechanism of the BERT (Bidirectional Encoder Representations from Transformers) model. The mechanism combines a Siamese (twin) neural network architecture and a specially constructed geotechnical engineering knowledge table (Geotechnical Knowledge Base, GKB), and aims to enable a model to be focused on key terms and core concepts in the geotechnical engineering field more accurately, so that the capability of the model in professional semantic modeling is remarkably enhanced.
Specifically, the domain-guided attention mechanism utilizes rich information in the geotechnical engineering knowledge table to dynamically adjust the attention weight of the BERT model. In the process, the model not only considers the context information of the text, but also fully integrates the knowledge and experience of the field expert, so that the model can capture key information more accurately and reduce noise interference when the model is used for understanding complex and professional geotechnical engineering text.
The introduction of the Siamese architecture further improves the performance of the model in processing similar or related text. By sharing parameters, the Siamese network can learn the similarity measure between text pairs, which is important for improving the accuracy of the retrieval system. In the field of geotechnical engineering, many documents may relate to similar or related geological conditions, construction methods and the like, and the similarity can be more effectively identified by utilizing a Siamese architecture, so that the accuracy of a search result is improved.
In addition, the method fully utilizes the technical terms, the concept definitions and the relations among the technical terms and the concept definitions in the geotechnical engineering knowledge table, and provides rich background knowledge for the model. These knowledge are effectively integrated during model training so that the model can learn a deeper semantic representation from professional text in the geotechnical engineering field. The improvement of the deep semantic understanding capability has important significance for realizing efficient and accurate text retrieval.
In conclusion, the method not only optimizes the modeling capacity of the model to professional semantics by combining a field guidance attention mechanism, a Siamese architecture and a geotechnical engineering knowledge table, but also remarkably improves the accuracy and efficiency of a retrieval system in the geotechnical engineering field. The innovative method provides a new thought for solving the difficult problem in professional text retrieval, and has important significance for promoting informatization and intelligent development in the geotechnical engineering field.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Example 1
A geotechnical engineering semantic embedding retrieval method based on field-guided BERT comprises the following steps:
1. Data source acquisition
Various text data in the geotechnical field are collected, including project reports, engineering case analysis, experimental data, standard specification documents, papers, technical guidelines and the like. These data cover typical terms and application scenarios of geotechnical engineering, with rich expertise and semantic depth. Because of the problems of low structuring degree and semantic dispersion of the data, further knowledge extraction and cleaning are required to ensure semantic consistency.
2. Creation of data sets
Based on the collected text data, a structured training data set is constructed, which comprises the following steps:
2.1 segmentation processing
And cutting the long document according to the semantic natural segment or the logic segment to generate a short text segment so as to ensure the semantic centralization and the logic continuity of the segmented content.
2.2 Manual labeling
And generating a question-answer pair aiming at the segmented text segment, wherein the question extracts key semantic points based on paragraph contents, and the answer is a refined summary of the paragraph.
Examples:
Text segments:
the effect of the composition of the soil particles on the shear strength is mainly reflected in the contact force between the particles and the drainage condition. "
Problems:
how does the composition of the soil particles affect the shear strength?
Answer:
"affected by interparticle contact forces and drainage conditions". "
2.3 Sample equalization
For some rare types of problems (such as the content related to specific soil properties or complex analysis), the data volume is expanded through data enhancement technology (such as similar substitution, semantic restation and the like), the sample distribution is balanced, and the diversity and the comprehensiveness of model training are ensured.
3. Data preprocessing
And (3) carrying out further standardization processing on the basis of the marked data:
3.1 extraction of Domain Key factors
Building geotechnical engineering knowledge tables including important terms, concepts and their attributes, such as:
Soil types including sand, clay, silt and gravel.
Mechanical properties such as shear strength, permeability coefficient, compression coefficient, etc.
Environmental conditions such as water content, saturation, temperature, pressure, etc.
Using the knowledge table to match terms in the text, a domain key factor label (Keyword Mask) is generated for guiding the weight allocation of the attention mechanism.
3.2 Segment Length adjustment
The influence of different text segmentation lengths on the model performance is tested, and data sets of three segmentation versions of short, medium and long are generated to explore the optimal segmentation strategy.
3.3 Data partitioning
The data sets were partitioned according to the ratio of training set to test set (80%: 20%) to ensure that the samples of the training and test phases did not overlap.
4. Construction of geotechnical engineering semantic embedded model
1-3 Below, the model is based on BERT, dynamically models Domain key terms by introducing a Domain-oriented Attention mechanism (Domain-oriented Attention) into the Attention mechanism of BERT, and realizes efficient semantic embedding learning and optimization by adopting a Siamese architecture.
First, the input sentence is pre-processed (S 1 and S 2), for example, by the word segmentation unit (Tokenizer) of the BERT, to convert the natural language into the input format required for the BERT, including word index sequences (input_ids), attention masks (attention _mask), and key term labels (key Mask). These input vectors are then fed separately into two encoders sharing weights for processing.
In the coding layer, two BERT models sharing weights independently encode sentences S 1 and S 2, respectively, to generate corresponding feature vector representations H 1 and H 2, each feature vector having a shape of n×768, where n is the number of token in the sentence and 768 is the dimension of the hidden layer. In the encoding process, the Attention score of the key term is dynamically adjusted through a Domain-Guided Attention mechanism (Domain-Guided Attention). Specifically, in calculating the attention profile, the following formula is used:
Wherein w j=1+(Keyword Maskj (. Gamma. -1)
The domain key terms are assigned higher attention weights, where γ >1 is the weight magnification. The generated enhanced attention profile is used to further optimize the feature representation of the sentence.
Next, the pooling layer processes the feature vectors of each sentence, specifically using an average pooling (Mean Pooling) operation. The pooling process averages feature vectors for all token of each sentence over each dimension, generating fixed-size sentence embedded vectors V 1 and V 2, each of the embedded vectors having a shape of 1×768.
By the double-tower architecture of the shared weights, the model ensures the embedding consistency of two sentences in the same semantic space. For the embedded representations V 1 and V 2 of sentence pairs, sentence contrast analysis was performed using cosine similarity (Cosine Similarity), calculating the similarity score:
And finally outputting a similarity score for evaluating the semantic similarity between the two sentences. Through the double-tower design of the Siamese architecture, the model can efficiently process similarity tasks and is suitable for various semantic retrieval and matching scenes.
5. Model training, testing, and prediction
5.1 Training
And (3) loss function design:
Using contrast loss (Contrastive Loss), semantically similar sentence pairs are closer together and dissimilar sentence pairs are farther apart by optimizing sentence embedding.
Loss function definition:
Lcontrastive=y·d2+(1-y)·max(0,margin-d)2
Where d represents the Euclidean distance of sentence embedding and y represents the similarity tag.
Attention regularization:
Attention regularization loss (Attention Regularization Loss) is introduced, focusing more on domain terms by Mean Square Error (MSE) constraint model with the target attention distribution.
The total loss function is l=l contrastive+α·Lattention
Training strategies:
a AdamW optimizer is used in conjunction with a learning rate scheduler (e.g., linear decay) to achieve stable convergence.
The attention distribution is adjusted through the domain weight generated by the knowledge table, so that the model can capture domain semantics more accurately.
5.2 Testing
Model performance was evaluated on the test set using the following criteria:
cosine similarity mean, which is to measure the similarity of sentences to embedded semantics.
And verifying the actual retrieval capability of the model.
F1 index, comprehensively measuring the accuracy and coverage rate of the search result.
And analyzing the influence of the optimal segmentation strategy on the model performance aiming at the performance of the data set test model of different segmentation length versions.
The implementation of the scheme has the beneficial effects that the semantic embedded model for the geotechnical engineering field is successfully constructed through refined data processing and model design. The model not only effectively solves the problems of low structuring degree and semantic dispersion of the rock-soil text data, but also remarkably improves the capturing capability of the model on key terms and complex semantics through a field-guided attention mechanism. Through training and testing, the model is excellent in semantic similarity evaluation and retrieval tasks, powerful support is provided for intelligent information processing of geotechnical engineering, and high-efficiency application and sharing of domain knowledge are greatly promoted.
Example 2
A memory or server dedicated to storing and processing data required for the field-guided BERT-based geotechnical semantic embedded retrieval method as described in example 1.
The memory or server is carefully designed to ensure efficient storage of data, fast processing, and smooth operation of the model training and prediction phases.
1. Configuration of memory or server
1 Data storage module
And C, the text database is used for storing the text data of the geotechnical field collected and cleaned in the step A, including project reports, engineering case analysis, experimental data, standard specification files, papers, technical guidelines and the like. These data are organized in a structured manner to facilitate subsequent processing and retrieval.
And 2, storing the structured training data set constructed in the step B, wherein the structured training data set comprises segmented short text fragments, manually marked question-answer pairs and data sets expanded by a data enhancement technology. These datasets provide a rich sample for training of the model.
And 3, storing the geotechnical engineering knowledge table constructed in the step C, the field key factor marks and test results of different text segment lengths. These knowledge tables and labels provide the model with domain specific knowledge that helps optimize the distribution of the attention mechanisms.
And 4, storing the geotechnical engineering semantic embedded model constructed in the step D and parameters thereof, wherein the parameters comprise the weight of the BERT model, the configuration of a domain guidance attention mechanism and the detailed information of a Siamese architecture. These parameters are the core of the model operation, ensuring the high efficiency and accuracy of the model.
5 Data processing module
And E, in the model training process of the step E, providing high-efficiency data reading and writing capability and supporting the rapid processing of a large-scale data set. By optimizing the data storage structure and the access strategy, the real-time performance and the high efficiency of data processing are ensured.
And 6, data preprocessing, namely realizing the data preprocessing function in the step C, wherein the data preprocessing function comprises field key factor extraction, segmentation length adjustment, data division and the like. These preprocessing operations help to improve the effectiveness and efficiency of model training.
7 Model deployment and prediction module
And 8, model deployment, namely in the model testing and predicting stage, taking the model as a model deployment platform, receiving user input, executing model reasoning and returning a search result. And the real-time performance and accuracy of the prediction are ensured by optimizing the model loading and reasoning process.
And 9, displaying the result, namely providing a friendly user interface, displaying the search result and related information, and facilitating the understanding and use of the user.
2. Additional functions of memory or server
1 Data encryption and security
The data encryption module is realized, the stored sensitive data is encrypted, and the safety and privacy of the data are ensured. Advanced encryption algorithms and key management policies are employed to protect data from unauthorized access and disclosure.
2 Data backup and restore
And regularly backing up the stored data to prevent the data from being lost. When needed, the data can be quickly recovered, and the continuity and stability of the service are ensured.
3 Load balancing and failover
Load balancing and failover functions are implemented to ensure that requests can be automatically distributed to other servers in the event of high concurrent access or server failure. And the expandability and reliability of the system are improved through cluster deployment and load balancing strategies.
4 Performance monitoring and optimization
And monitoring the running states of the memory and the server in real time, wherein the running states comprise CPU utilization rate, memory occupation, disk I/O and the like. Through the performance monitoring and optimizing tool, the performance bottleneck is found and solved in time, the system performance is optimized, and the smoothness of data processing and model operation is ensured.
3. Effect of the invention
By adopting the memory or the server provided by the embodiment, the data required by the geotechnical engineering semantic embedded retrieval method based on the field guidance BERT can be efficiently stored and processed. The memory or the server not only meets the requirements of data storage, processing and model training and prediction, but also provides additional functions of data encryption and security, data backup and recovery, load balancing and fault transfer, performance monitoring and optimization and the like. The implementation of the functions ensures the safety, reliability and high efficiency of the data, and provides powerful support for the practical application of the geotechnical engineering semantic embedding and searching method.
In summary, the memory or the server provided in the embodiment 2 is specially designed for the geotechnical engineering semantic embedding retrieval method based on the field-oriented BERT, and has the characteristics of high efficiency, safety, reliability, expandability and the like, and can meet the requirements of the geotechnical engineering field on semantic retrieval technology.
The above embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, i.e. the present invention is not limited to the above embodiments, but is capable of being modified and varied in all ways according to the following claims and the detailed description.