CN119903168A

CN119903168A - Semantic embedding retrieval method for geotechnical engineering based on domain-guided BERT

Info

Publication number: CN119903168A
Application number: CN202411869841.6A
Authority: CN
Inventors: 苏辉; 李元昊; 王维; 李鸣洲; 杨石飞; 罗永康; 李蕊
Original assignee: Shanghai Survey Design And Research Institute Group Co ltd
Current assignee: Shanghai Survey Design And Research Institute Group Co ltd
Priority date: 2024-12-18
Filing date: 2024-12-18
Publication date: 2025-04-29
Anticipated expiration: 2044-12-18
Also published as: CN119903168B

Abstract

The present invention belongs to the field of artificial intelligence geotechnical engineering technology, and proposes a geotechnical engineering semantic embedding retrieval method based on domain-guided BERT, aiming to solve the problems of information dispersion, semantic inconsistency and low retrieval efficiency in geotechnical engineering information retrieval. The method collects a variety of text materials in the geotechnical field, constructs a structured training data set, and introduces a domain-guided attention mechanism to optimize the BERT model to achieve dynamic modeling of key terms in the field. The Siamese architecture is used to achieve efficient semantic embedding learning and optimization, and improve the accuracy of semantic similarity calculation. The total loss function of contrast loss and attention regularization loss is designed, and the model is trained in combination with the AdamW optimizer. The present invention can significantly improve the accuracy and efficiency of geotechnical engineering information retrieval, and provide strong support for knowledge management and information retrieval in the field of geotechnical engineering. At the same time, the present invention also relates to a memory or server configuration for storing and processing related data.

Description

Geotechnical engineering semantic embedding retrieval method based on field-guided BERT

Technical Field

The invention belongs to the technical field of artificial intelligence geotechnical engineering, and provides a geotechnical engineering semantic embedding retrieval method based on field-guided BERT.

Background

In traditional geotechnical engineering information retrieval, technicians face challenges such as information dispersion, semantic inconsistency, low retrieval efficiency and the like. Geotechnical engineering materials are generally widely available and comprise a plurality of document forms such as academic papers, engineering reports, experimental data, standard specifications and the like, and the text data often lacks uniform structure and term specifications, so that information is difficult to integrate effectively. In addition, terms and specialized concepts in the geotechnical engineering field have high professionals and complexity, and the traditional keyword retrieval mode is difficult to accurately capture deep semantics in documents, so that the relevance and accuracy of retrieval results are low.

The existing text semantic retrieval mode generally adopts a pre-training language model BERT (Bidirectional Encoder Representations from Transformers), dynamic word embedding is generated by carrying out context coding on sentences, and semantic similarity is measured by utilizing cosine similarity of sentence vectors, so that the effectiveness of semantic retrieval is improved to a certain extent. Chinese patent (CN 114840645A) discloses a method for unifying text vectors and search keyword vectors to a standard orthogonal basis by carrying out linear transformation on BERT output, so that vector space representation is further standardized, and the relevance of search results is improved. However, both the existing BERT and CN114840645A are mainly directed to general text, lack of targeted optimization in specialized fields (such as geotechnical engineering), and do not fully exploit domain-specific knowledge, such as explicit modeling of key terms and semantic weight distribution. In addition, the vector distribution may have nonlinear and singular problems, which cause inconsistent calculation results of semantic similarity and actual semantic correlation, and limit the practical application of the vector distribution in the highly specialized field (such as geotechnical engineering semantic retrieval).

Disclosure of Invention

Aiming at the problems, the invention discloses a geotechnical engineering semantic embedding and retrieving method based on field-guided BERT, which utilizes a deep learning technology to process multi-source text data in the geotechnical engineering field and can realize efficient semantic understanding and information extraction.

The technical scheme of the invention is as follows:

a geotechnical engineering semantic embedding retrieval method based on field-guided BERT comprises the following steps:

Step A, acquiring a data source, collecting various text data in the field of rock and soil, including project reports, engineering case analysis, experimental data, standard specification files, papers and technical guidelines, and extracting and cleaning knowledge;

step B, a structured training data set is constructed, wherein the structured training data set comprises segmentation processing, manual labeling and sample equalization, the segmentation processing is used for segmenting a long document according to a semantic natural segment or a logic segment to generate a short text segment;

Step C, data preprocessing, wherein the data preprocessing comprises field key factor extraction and segmentation length adjustment, a geotechnical engineering knowledge table is constructed, terms in texts are matched by using the knowledge table, field key factor marks are generated and used for guiding weight distribution of an attention mechanism, and influences of different text segmentation lengths on model performance are tested;

Step D, constructing a geotechnical engineering semantic embedding model, wherein the model is based on BERT, dynamically modeling a Domain key term by introducing a Domain-Guided Attention mechanism into an Attention mechanism of BERT, and realizing efficient semantic embedding learning and optimization by adopting a Siamese architecture; the method comprises the steps of receiving input sentence pairs by a model, preprocessing the sentence pairs by a word segmentation device of the BERT, converting natural language into an input format required by the BERT, wherein the input format comprises a word index sequence, an Attention mask and key term marks, respectively encoding the sentences by two BERT models sharing weight to generate corresponding feature vector representations, dynamically adjusting the Attention scores of the key terms by a Domain guidance Attention mechanism to generate enhanced Attention distribution for optimizing the feature representations of the sentences, processing the feature vectors of each sentence by adopting average pooling operation to generate sentence embedded vectors with fixed size, and calculating similarity scores among the embedded representations of the sentence pairs by using cosine similarity;

and E, model training, testing and predicting, designing a total loss function using contrast loss and attention regularization loss, carrying out model training through a AdamW optimizer and a learning rate scheduler, evaluating model performance on a test set, measuring model performance by using cosine similarity mean value, accuracy and recall rate of semantic retrieval tasks and F1 index, and analyzing the influence of the optimal segmentation strategy on model performance aiming at the performance of the data set test model of different segmentation length versions.

Furthermore, the geotechnical engineering semantic embedding and searching method based on the field-guided BERT further comprises the steps of conducting semantic analysis on the collected text data, removing redundant and error information and guaranteeing accuracy and consistency of the text data.

Furthermore, the geotechnical engineering semantic embedding and retrieving method based on the domain-guided BERT further comprises the step of conducting semantic segmentation on the document by utilizing a natural language processing technology, so that the semantic integrity and independence of each short text segment are ensured.

Furthermore, the artificial labeling in the step B further comprises the step of performing professional auditing on the question-answer pair based on a knowledge system in the geotechnical engineering field to ensure the accuracy and the professionality of labeling contents.

Furthermore, the data preprocessing in the step C further comprises the step of weight distribution of the domain key factor marks, and the weight is dynamically adjusted according to the importance and the occurrence frequency of the key terms in the geotechnical engineering domain so as to optimize the distribution of the attention mechanism.

Furthermore, the geotechnical engineering semantic embedding and retrieving method based on the field-guided BERT further comprises the steps of adopting a double-tower design of a Siamese architecture to ensure embedding consistency of two sentences in the same semantic space and improving accuracy and efficiency of semantic similarity calculation.

Furthermore, the geotechnical engineering semantic embedding and retrieving method based on the domain guide BERT further comprises the step of utilizing domain weights generated by a knowledge table to adjust attention distribution, and optimizing semantic understanding and information extraction capacity of the model through total loss functions of contrast loss and attention regularization loss.

Furthermore, the geotechnical engineering semantic embedding retrieval method based on the field-oriented BERT further comprises the step of performing performance evaluation on the model aiming at geotechnical engineering problems of different types so as to verify the effectiveness and reliability of the model in practical application.

The invention also discloses a memory or a server for storing and processing the data required by the method, wherein the memory or the server is configured to:

storing the text data of the rock and soil field collected and cleaned in the step A;

Storing the structured training data set constructed in the step B, wherein the structured training data set comprises segmented short text fragments, manually marked question-answer pairs and data sets expanded by a data enhancement technology;

c, storing the geotechnical engineering knowledge table constructed in the step C, the field key factor marks and the test results of different text segment lengths;

Storing the geotechnical engineering semantic embedded model constructed in the step D and parameters thereof, wherein the geotechnical engineering semantic embedded model comprises the weight of a BERT model, the configuration of a domain guidance attention mechanism and the detailed information of a Siamese architecture;

in the model training process of the step E, high-efficiency data reading and writing capacity is provided, and the rapid processing of a large-scale data set is supported;

In the model testing and predicting stage, the model is used as a platform for model deployment, user input is received, model reasoning is executed, and a search result is returned.

Further, the memory or the server for storing and processing the data required by the method further comprises:

the data encryption module is used for carrying out encryption processing on the stored sensitive data and ensuring the safety and privacy of the data;

the data backup and recovery mechanism is used for regularly backing up the stored data so as to prevent the data from being lost and can quickly recover the data when needed;

Load balancing and fault transferring functions to ensure that when high concurrency access or server faults occur, requests can be automatically distributed to other servers, and continuity and stability of services are ensured;

And the performance monitoring and optimizing tool monitors the running states of the memory and the server in real time, including CPU utilization rate, memory occupation, disk I/O and the like, so as to discover and solve the performance bottleneck in time and optimize the system performance.

Compared with the prior art, the invention has the following beneficial effects:

The method achieves dynamic optimization of the model Attention distribution by innovatively integrating a Domain-Guided Attention (DGA) mechanism into the Attention mechanism of the BERT (Bidirectional Encoder Representations from Transformers) model. The mechanism combines a Siamese (twin) neural network architecture and a specially constructed geotechnical engineering knowledge table (Geotechnical Knowledge Base, GKB), and aims to enable a model to be focused on key terms and core concepts in the geotechnical engineering field more accurately, so that the capability of the model in professional semantic modeling is remarkably enhanced.

Specifically, the domain-guided attention mechanism utilizes rich information in the geotechnical engineering knowledge table to dynamically adjust the attention weight of the BERT model. In the process, the model not only considers the context information of the text, but also fully integrates the knowledge and experience of the field expert, so that the model can capture key information more accurately and reduce noise interference when the model is used for understanding complex and professional geotechnical engineering text.

The introduction of the Siamese architecture further improves the performance of the model in processing similar or related text. By sharing parameters, the Siamese network can learn the similarity measure between text pairs, which is important for improving the accuracy of the retrieval system. In the field of geotechnical engineering, many documents may relate to similar or related geological conditions, construction methods and the like, and the similarity can be more effectively identified by utilizing a Siamese architecture, so that the accuracy of a search result is improved.

In addition, the method fully utilizes the technical terms, the concept definitions and the relations among the technical terms and the concept definitions in the geotechnical engineering knowledge table, and provides rich background knowledge for the model. These knowledge are effectively integrated during model training so that the model can learn a deeper semantic representation from professional text in the geotechnical engineering field. The improvement of the deep semantic understanding capability has important significance for realizing efficient and accurate text retrieval.

In conclusion, the method not only optimizes the modeling capacity of the model to professional semantics by combining a field guidance attention mechanism, a Siamese architecture and a geotechnical engineering knowledge table, but also remarkably improves the accuracy and efficiency of a retrieval system in the geotechnical engineering field. The innovative method provides a new thought for solving the difficult problem in professional text retrieval, and has important significance for promoting informatization and intelligent development in the geotechnical engineering field.

Drawings

The field of fig. 1 directs the attention mechanism;

the fig. 2 domain directing attention mechanism BERT Encoder;

FIG. 3 is a block diagram of a guided BERT semantic embedded model in the geotechnical engineering field.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

Example 1

1. Data source acquisition

Various text data in the geotechnical field are collected, including project reports, engineering case analysis, experimental data, standard specification documents, papers, technical guidelines and the like. These data cover typical terms and application scenarios of geotechnical engineering, with rich expertise and semantic depth. Because of the problems of low structuring degree and semantic dispersion of the data, further knowledge extraction and cleaning are required to ensure semantic consistency.

2. Creation of data sets

Based on the collected text data, a structured training data set is constructed, which comprises the following steps:

2.1 segmentation processing

And cutting the long document according to the semantic natural segment or the logic segment to generate a short text segment so as to ensure the semantic centralization and the logic continuity of the segmented content.

2.2 Manual labeling

And generating a question-answer pair aiming at the segmented text segment, wherein the question extracts key semantic points based on paragraph contents, and the answer is a refined summary of the paragraph.

Examples:

Text segments:

the effect of the composition of the soil particles on the shear strength is mainly reflected in the contact force between the particles and the drainage condition. "

Problems:

how does the composition of the soil particles affect the shear strength?

Answer:

"affected by interparticle contact forces and drainage conditions". "

2.3 Sample equalization

For some rare types of problems (such as the content related to specific soil properties or complex analysis), the data volume is expanded through data enhancement technology (such as similar substitution, semantic restation and the like), the sample distribution is balanced, and the diversity and the comprehensiveness of model training are ensured.

3. Data preprocessing

And (3) carrying out further standardization processing on the basis of the marked data:

3.1 extraction of Domain Key factors

Building geotechnical engineering knowledge tables including important terms, concepts and their attributes, such as:

Soil types including sand, clay, silt and gravel.

Mechanical properties such as shear strength, permeability coefficient, compression coefficient, etc.

Environmental conditions such as water content, saturation, temperature, pressure, etc.

Using the knowledge table to match terms in the text, a domain key factor label (Keyword Mask) is generated for guiding the weight allocation of the attention mechanism.

3.2 Segment Length adjustment

The influence of different text segmentation lengths on the model performance is tested, and data sets of three segmentation versions of short, medium and long are generated to explore the optimal segmentation strategy.

3.3 Data partitioning

The data sets were partitioned according to the ratio of training set to test set (80%: 20%) to ensure that the samples of the training and test phases did not overlap.

4. Construction of geotechnical engineering semantic embedded model

1-3 Below, the model is based on BERT, dynamically models Domain key terms by introducing a Domain-oriented Attention mechanism (Domain-oriented Attention) into the Attention mechanism of BERT, and realizes efficient semantic embedding learning and optimization by adopting a Siamese architecture.

First, the input sentence is pre-processed (S ₁ and S ₂), for example, by the word segmentation unit (Tokenizer) of the BERT, to convert the natural language into the input format required for the BERT, including word index sequences (input_ids), attention masks (attention _mask), and key term labels (key Mask). These input vectors are then fed separately into two encoders sharing weights for processing.

In the coding layer, two BERT models sharing weights independently encode sentences S ₁ and S ₂, respectively, to generate corresponding feature vector representations H ₁ and H ₂, each feature vector having a shape of n×768, where n is the number of token in the sentence and 768 is the dimension of the hidden layer. In the encoding process, the Attention score of the key term is dynamically adjusted through a Domain-Guided Attention mechanism (Domain-Guided Attention). Specifically, in calculating the attention profile, the following formula is used:

Wherein w _j＝1+(Keyword Mask_j (. Gamma. -1)

The domain key terms are assigned higher attention weights, where γ >1 is the weight magnification. The generated enhanced attention profile is used to further optimize the feature representation of the sentence.

Next, the pooling layer processes the feature vectors of each sentence, specifically using an average pooling (Mean Pooling) operation. The pooling process averages feature vectors for all token of each sentence over each dimension, generating fixed-size sentence embedded vectors V ₁ and V ₂, each of the embedded vectors having a shape of 1×768.

By the double-tower architecture of the shared weights, the model ensures the embedding consistency of two sentences in the same semantic space. For the embedded representations V ₁ and V ₂ of sentence pairs, sentence contrast analysis was performed using cosine similarity (Cosine Similarity), calculating the similarity score:

And finally outputting a similarity score for evaluating the semantic similarity between the two sentences. Through the double-tower design of the Siamese architecture, the model can efficiently process similarity tasks and is suitable for various semantic retrieval and matching scenes.

5. Model training, testing, and prediction

5.1 Training

And (3) loss function design:

Using contrast loss (Contrastive Loss), semantically similar sentence pairs are closer together and dissimilar sentence pairs are farther apart by optimizing sentence embedding.

Loss function definition:

L_contrastive＝y·d²+(1-y)·max(0,margin-d)²

Where d represents the Euclidean distance of sentence embedding and y represents the similarity tag.

Attention regularization:

Attention regularization loss (Attention Regularization Loss) is introduced, focusing more on domain terms by Mean Square Error (MSE) constraint model with the target attention distribution.

The total loss function is l=l _contrastive+α·L_attention

Training strategies:

a AdamW optimizer is used in conjunction with a learning rate scheduler (e.g., linear decay) to achieve stable convergence.

The attention distribution is adjusted through the domain weight generated by the knowledge table, so that the model can capture domain semantics more accurately.

5.2 Testing

Model performance was evaluated on the test set using the following criteria:

cosine similarity mean, which is to measure the similarity of sentences to embedded semantics.

And verifying the actual retrieval capability of the model.

F1 index, comprehensively measuring the accuracy and coverage rate of the search result.

And analyzing the influence of the optimal segmentation strategy on the model performance aiming at the performance of the data set test model of different segmentation length versions.

The implementation of the scheme has the beneficial effects that the semantic embedded model for the geotechnical engineering field is successfully constructed through refined data processing and model design. The model not only effectively solves the problems of low structuring degree and semantic dispersion of the rock-soil text data, but also remarkably improves the capturing capability of the model on key terms and complex semantics through a field-guided attention mechanism. Through training and testing, the model is excellent in semantic similarity evaluation and retrieval tasks, powerful support is provided for intelligent information processing of geotechnical engineering, and high-efficiency application and sharing of domain knowledge are greatly promoted.

Example 2

A memory or server dedicated to storing and processing data required for the field-guided BERT-based geotechnical semantic embedded retrieval method as described in example 1.

The memory or server is carefully designed to ensure efficient storage of data, fast processing, and smooth operation of the model training and prediction phases.

1. Configuration of memory or server

1 Data storage module

And C, the text database is used for storing the text data of the geotechnical field collected and cleaned in the step A, including project reports, engineering case analysis, experimental data, standard specification files, papers, technical guidelines and the like. These data are organized in a structured manner to facilitate subsequent processing and retrieval.

And 2, storing the structured training data set constructed in the step B, wherein the structured training data set comprises segmented short text fragments, manually marked question-answer pairs and data sets expanded by a data enhancement technology. These datasets provide a rich sample for training of the model.

And 3, storing the geotechnical engineering knowledge table constructed in the step C, the field key factor marks and test results of different text segment lengths. These knowledge tables and labels provide the model with domain specific knowledge that helps optimize the distribution of the attention mechanisms.

And 4, storing the geotechnical engineering semantic embedded model constructed in the step D and parameters thereof, wherein the parameters comprise the weight of the BERT model, the configuration of a domain guidance attention mechanism and the detailed information of a Siamese architecture. These parameters are the core of the model operation, ensuring the high efficiency and accuracy of the model.

5 Data processing module

And E, in the model training process of the step E, providing high-efficiency data reading and writing capability and supporting the rapid processing of a large-scale data set. By optimizing the data storage structure and the access strategy, the real-time performance and the high efficiency of data processing are ensured.

And 6, data preprocessing, namely realizing the data preprocessing function in the step C, wherein the data preprocessing function comprises field key factor extraction, segmentation length adjustment, data division and the like. These preprocessing operations help to improve the effectiveness and efficiency of model training.

7 Model deployment and prediction module

And 8, model deployment, namely in the model testing and predicting stage, taking the model as a model deployment platform, receiving user input, executing model reasoning and returning a search result. And the real-time performance and accuracy of the prediction are ensured by optimizing the model loading and reasoning process.

And 9, displaying the result, namely providing a friendly user interface, displaying the search result and related information, and facilitating the understanding and use of the user.

2. Additional functions of memory or server

1 Data encryption and security

The data encryption module is realized, the stored sensitive data is encrypted, and the safety and privacy of the data are ensured. Advanced encryption algorithms and key management policies are employed to protect data from unauthorized access and disclosure.

2 Data backup and restore

And regularly backing up the stored data to prevent the data from being lost. When needed, the data can be quickly recovered, and the continuity and stability of the service are ensured.

3 Load balancing and failover

Load balancing and failover functions are implemented to ensure that requests can be automatically distributed to other servers in the event of high concurrent access or server failure. And the expandability and reliability of the system are improved through cluster deployment and load balancing strategies.

4 Performance monitoring and optimization

And monitoring the running states of the memory and the server in real time, wherein the running states comprise CPU utilization rate, memory occupation, disk I/O and the like. Through the performance monitoring and optimizing tool, the performance bottleneck is found and solved in time, the system performance is optimized, and the smoothness of data processing and model operation is ensured.

3. Effect of the invention

By adopting the memory or the server provided by the embodiment, the data required by the geotechnical engineering semantic embedded retrieval method based on the field guidance BERT can be efficiently stored and processed. The memory or the server not only meets the requirements of data storage, processing and model training and prediction, but also provides additional functions of data encryption and security, data backup and recovery, load balancing and fault transfer, performance monitoring and optimization and the like. The implementation of the functions ensures the safety, reliability and high efficiency of the data, and provides powerful support for the practical application of the geotechnical engineering semantic embedding and searching method.

In summary, the memory or the server provided in the embodiment 2 is specially designed for the geotechnical engineering semantic embedding retrieval method based on the field-oriented BERT, and has the characteristics of high efficiency, safety, reliability, expandability and the like, and can meet the requirements of the geotechnical engineering field on semantic retrieval technology.

The above embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, i.e. the present invention is not limited to the above embodiments, but is capable of being modified and varied in all ways according to the following claims and the detailed description.

Claims

1. A domain-guided BERT-based geotechnical engineering semantic embedding retrieval method, characterized by comprising the following steps:

Step A: Data source acquisition, collecting various text materials in the geotechnical field, including project reports, engineering case analysis, experimental data, standard specification documents, papers and technical guidelines, and performing knowledge extraction and cleaning;

Step B: Construct a structured training data set, including segmentation, manual annotation, and sample balancing. Segmentation divides long documents into short text segments according to semantic natural segments or logical segments. Manual annotation generates question-answer pairs for the segmented text segments. Sample balancing uses data augmentation technology to expand the data volume of rare type questions.

Step C: Data preprocessing, which includes extracting key factors in the field and adjusting the segment length, building a geotechnical engineering knowledge table, using the knowledge table to match terms in the text, generating key factor tags for the field to guide the weight allocation of the attention mechanism, and testing the impact of different text segment lengths on model performance;

Step D: Construct a geotechnical engineering semantic embedding model based on BERT. The model dynamically models key domain terms by introducing the domain-guided attention mechanism into BERT’s attention mechanism, and adopts the Siamese architecture to achieve efficient semantic embedding learning and optimization.

The model receives the input sentence pairs and preprocesses them through BERT's tokenizer to convert natural language into the input format required by BERT, including word index sequence, attention mask and key term tag; two BERT models with shared weights encode the sentences independently and generate corresponding feature vector representations; the attention scores of key terms are dynamically adjusted through the domain-guided attention mechanism to generate enhanced attention distribution for optimizing the feature representation of sentences; the feature vector of each sentence is processed by average pooling operation to generate a fixed-size sentence embedding vector; the similarity score between the embedding representations of the sentence pairs is calculated using cosine similarity;

Step E: Model training, testing, and prediction. Design a total loss function using contrast loss and attention regularization loss, and train the model using the AdamW optimizer and learning rate scheduler. Evaluate model performance on the test set, using the mean cosine similarity, accuracy and recall of semantic retrieval tasks, and the F1 indicator to measure model performance. Test the performance of the model on datasets with different segmentation lengths, and analyze the impact of the optimal segmentation strategy on model performance.

2. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the knowledge extraction and cleaning in step A further includes: performing semantic analysis on the collected text data, removing redundant and erroneous information, and ensuring the accuracy and consistency of the text data.

3. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the segmentation processing in step B also includes: using natural language processing technology to semantically segment the document to ensure the semantic integrity and independence of each short text fragment.

4. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the manual annotation in step B also includes: based on the knowledge system in the field of geotechnical engineering, professional review of question-answer pairs is performed to ensure the accuracy and professionalism of the annotation content.

5. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the data preprocessing in step C also includes: weighting the key factor tags of the domain, and dynamically adjusting the weights of the key terms according to their importance and frequency of occurrence in the geotechnical engineering field to optimize the distribution of the attention mechanism.

6. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the model construction in step D also includes: adopting a dual-tower design of the Siamese architecture to ensure the embedding consistency of two sentences in the same semantic space and improve the accuracy and efficiency of semantic similarity calculation.

7. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method according to claim 1, it is characterized in that the model training in step E also includes: adjusting the attention distribution using the domain weights generated by the knowledge table, and optimizing the semantic understanding and information extraction capabilities of the model through the total loss function of contrast loss and attention regularization loss.

8. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the model testing and prediction in step E also includes: performing performance evaluation on the model for different types of geotechnical engineering problems to verify its effectiveness and reliability in practical applications.

9. A memory or server for storing and processing data required by the method according to any one of claims 1 to 8, wherein the memory or server is configured as follows:

Store the geotechnical text data collected and cleaned in step A;

Store the structured training dataset constructed in step B, including segmented short text fragments, manually annotated question-answer pairs, and the dataset augmented by data augmentation technology;

Storing the geotechnical engineering knowledge table constructed in step C, domain key factor tags, and test results of different text segment lengths;

Store the geotechnical engineering semantic embedding model built in step D and its parameters, including the weights of the BERT model, the configuration of the domain-guided attention mechanism, and the details of the Siamese architecture;

During the model training process in step E, it provides efficient data reading and writing capabilities to support the rapid processing of large-scale data sets;

In the model testing and prediction phase, it serves as a model deployment platform, receives user input, performs model reasoning, and returns retrieval results.

10. The storage device or server according to claim 9, characterized in that the storage device or server further comprises:

Data encryption module, used to encrypt stored sensitive data to ensure data security and privacy;

Data backup and recovery mechanism, regularly backing up stored data to prevent data loss and quickly restoring data when needed;

Load balancing and failover functions to ensure that requests can be automatically distributed to other servers in the event of high concurrent access or server failure, ensuring service continuity and stability;

Performance monitoring and optimization tools monitor the operating status of storage and servers in real time, including CPU usage, memory usage, disk I/O, etc., so as to promptly discover and resolve performance bottlenecks and optimize system performance.