[go: up one dir, main page]

CN119903168A - Semantic embedding retrieval method for geotechnical engineering based on domain-guided BERT - Google Patents

Semantic embedding retrieval method for geotechnical engineering based on domain-guided BERT Download PDF

Info

Publication number
CN119903168A
CN119903168A CN202411869841.6A CN202411869841A CN119903168A CN 119903168 A CN119903168 A CN 119903168A CN 202411869841 A CN202411869841 A CN 202411869841A CN 119903168 A CN119903168 A CN 119903168A
Authority
CN
China
Prior art keywords
model
data
semantic
geotechnical engineering
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411869841.6A
Other languages
Chinese (zh)
Other versions
CN119903168B (en
Inventor
苏辉
李元昊
王维
李鸣洲
杨石飞
罗永康
李蕊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Survey Design And Research Institute Group Co ltd
Original Assignee
Shanghai Survey Design And Research Institute Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Survey Design And Research Institute Group Co ltd filed Critical Shanghai Survey Design And Research Institute Group Co ltd
Priority to CN202411869841.6A priority Critical patent/CN119903168B/en
Publication of CN119903168A publication Critical patent/CN119903168A/en
Application granted granted Critical
Publication of CN119903168B publication Critical patent/CN119903168B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

本发明属于人工智能岩土工程技术领域,提出了一种基于领域引导BERT的岩土工程语义嵌入检索方法,旨在解决岩土工程信息检索中信息分散、语义不一致及检索效率低下的问题。该方法通过收集岩土领域的多种文本资料,构建结构化的训练数据集,并引入领域引导注意力机制优化BERT模型,实现对领域关键术语的动态建模。采用Siamese架构实现高效的语义嵌入学习与优化,提高语义相似度计算的准确性。通过设计对比损失和注意力正则化损失的总损失函数,结合AdamW优化器进行模型训练。本发明能够显著提升岩土工程信息检索的精准性与效率,为岩土工程领域的知识管理和信息检索提供有力支持。同时,本发明还涉及用于存储和处理相关数据的存储器或服务器配置。

The present invention belongs to the field of artificial intelligence geotechnical engineering technology, and proposes a geotechnical engineering semantic embedding retrieval method based on domain-guided BERT, aiming to solve the problems of information dispersion, semantic inconsistency and low retrieval efficiency in geotechnical engineering information retrieval. The method collects a variety of text materials in the geotechnical field, constructs a structured training data set, and introduces a domain-guided attention mechanism to optimize the BERT model to achieve dynamic modeling of key terms in the field. The Siamese architecture is used to achieve efficient semantic embedding learning and optimization, and improve the accuracy of semantic similarity calculation. The total loss function of contrast loss and attention regularization loss is designed, and the model is trained in combination with the AdamW optimizer. The present invention can significantly improve the accuracy and efficiency of geotechnical engineering information retrieval, and provide strong support for knowledge management and information retrieval in the field of geotechnical engineering. At the same time, the present invention also relates to a memory or server configuration for storing and processing related data.

Description

Geotechnical engineering semantic embedding retrieval method based on field-guided BERT
Technical Field
The invention belongs to the technical field of artificial intelligence geotechnical engineering, and provides a geotechnical engineering semantic embedding retrieval method based on field-guided BERT.
Background
In traditional geotechnical engineering information retrieval, technicians face challenges such as information dispersion, semantic inconsistency, low retrieval efficiency and the like. Geotechnical engineering materials are generally widely available and comprise a plurality of document forms such as academic papers, engineering reports, experimental data, standard specifications and the like, and the text data often lacks uniform structure and term specifications, so that information is difficult to integrate effectively. In addition, terms and specialized concepts in the geotechnical engineering field have high professionals and complexity, and the traditional keyword retrieval mode is difficult to accurately capture deep semantics in documents, so that the relevance and accuracy of retrieval results are low.
The existing text semantic retrieval mode generally adopts a pre-training language model BERT (Bidirectional Encoder Representations from Transformers), dynamic word embedding is generated by carrying out context coding on sentences, and semantic similarity is measured by utilizing cosine similarity of sentence vectors, so that the effectiveness of semantic retrieval is improved to a certain extent. Chinese patent (CN 114840645A) discloses a method for unifying text vectors and search keyword vectors to a standard orthogonal basis by carrying out linear transformation on BERT output, so that vector space representation is further standardized, and the relevance of search results is improved. However, both the existing BERT and CN114840645A are mainly directed to general text, lack of targeted optimization in specialized fields (such as geotechnical engineering), and do not fully exploit domain-specific knowledge, such as explicit modeling of key terms and semantic weight distribution. In addition, the vector distribution may have nonlinear and singular problems, which cause inconsistent calculation results of semantic similarity and actual semantic correlation, and limit the practical application of the vector distribution in the highly specialized field (such as geotechnical engineering semantic retrieval).
Disclosure of Invention
Aiming at the problems, the invention discloses a geotechnical engineering semantic embedding and retrieving method based on field-guided BERT, which utilizes a deep learning technology to process multi-source text data in the geotechnical engineering field and can realize efficient semantic understanding and information extraction.
The technical scheme of the invention is as follows:
a geotechnical engineering semantic embedding retrieval method based on field-guided BERT comprises the following steps:
Step A, acquiring a data source, collecting various text data in the field of rock and soil, including project reports, engineering case analysis, experimental data, standard specification files, papers and technical guidelines, and extracting and cleaning knowledge;
step B, a structured training data set is constructed, wherein the structured training data set comprises segmentation processing, manual labeling and sample equalization, the segmentation processing is used for segmenting a long document according to a semantic natural segment or a logic segment to generate a short text segment;
Step C, data preprocessing, wherein the data preprocessing comprises field key factor extraction and segmentation length adjustment, a geotechnical engineering knowledge table is constructed, terms in texts are matched by using the knowledge table, field key factor marks are generated and used for guiding weight distribution of an attention mechanism, and influences of different text segmentation lengths on model performance are tested;
Step D, constructing a geotechnical engineering semantic embedding model, wherein the model is based on BERT, dynamically modeling a Domain key term by introducing a Domain-Guided Attention mechanism into an Attention mechanism of BERT, and realizing efficient semantic embedding learning and optimization by adopting a Siamese architecture; the method comprises the steps of receiving input sentence pairs by a model, preprocessing the sentence pairs by a word segmentation device of the BERT, converting natural language into an input format required by the BERT, wherein the input format comprises a word index sequence, an Attention mask and key term marks, respectively encoding the sentences by two BERT models sharing weight to generate corresponding feature vector representations, dynamically adjusting the Attention scores of the key terms by a Domain guidance Attention mechanism to generate enhanced Attention distribution for optimizing the feature representations of the sentences, processing the feature vectors of each sentence by adopting average pooling operation to generate sentence embedded vectors with fixed size, and calculating similarity scores among the embedded representations of the sentence pairs by using cosine similarity;
and E, model training, testing and predicting, designing a total loss function using contrast loss and attention regularization loss, carrying out model training through a AdamW optimizer and a learning rate scheduler, evaluating model performance on a test set, measuring model performance by using cosine similarity mean value, accuracy and recall rate of semantic retrieval tasks and F1 index, and analyzing the influence of the optimal segmentation strategy on model performance aiming at the performance of the data set test model of different segmentation length versions.
Furthermore, the geotechnical engineering semantic embedding and searching method based on the field-guided BERT further comprises the steps of conducting semantic analysis on the collected text data, removing redundant and error information and guaranteeing accuracy and consistency of the text data.
Furthermore, the geotechnical engineering semantic embedding and retrieving method based on the domain-guided BERT further comprises the step of conducting semantic segmentation on the document by utilizing a natural language processing technology, so that the semantic integrity and independence of each short text segment are ensured.
Furthermore, the artificial labeling in the step B further comprises the step of performing professional auditing on the question-answer pair based on a knowledge system in the geotechnical engineering field to ensure the accuracy and the professionality of labeling contents.
Furthermore, the data preprocessing in the step C further comprises the step of weight distribution of the domain key factor marks, and the weight is dynamically adjusted according to the importance and the occurrence frequency of the key terms in the geotechnical engineering domain so as to optimize the distribution of the attention mechanism.
Furthermore, the geotechnical engineering semantic embedding and retrieving method based on the field-guided BERT further comprises the steps of adopting a double-tower design of a Siamese architecture to ensure embedding consistency of two sentences in the same semantic space and improving accuracy and efficiency of semantic similarity calculation.
Furthermore, the geotechnical engineering semantic embedding and retrieving method based on the domain guide BERT further comprises the step of utilizing domain weights generated by a knowledge table to adjust attention distribution, and optimizing semantic understanding and information extraction capacity of the model through total loss functions of contrast loss and attention regularization loss.
Furthermore, the geotechnical engineering semantic embedding retrieval method based on the field-oriented BERT further comprises the step of performing performance evaluation on the model aiming at geotechnical engineering problems of different types so as to verify the effectiveness and reliability of the model in practical application.
The invention also discloses a memory or a server for storing and processing the data required by the method, wherein the memory or the server is configured to:
storing the text data of the rock and soil field collected and cleaned in the step A;
Storing the structured training data set constructed in the step B, wherein the structured training data set comprises segmented short text fragments, manually marked question-answer pairs and data sets expanded by a data enhancement technology;
c, storing the geotechnical engineering knowledge table constructed in the step C, the field key factor marks and the test results of different text segment lengths;
Storing the geotechnical engineering semantic embedded model constructed in the step D and parameters thereof, wherein the geotechnical engineering semantic embedded model comprises the weight of a BERT model, the configuration of a domain guidance attention mechanism and the detailed information of a Siamese architecture;
in the model training process of the step E, high-efficiency data reading and writing capacity is provided, and the rapid processing of a large-scale data set is supported;
In the model testing and predicting stage, the model is used as a platform for model deployment, user input is received, model reasoning is executed, and a search result is returned.
Further, the memory or the server for storing and processing the data required by the method further comprises:
the data encryption module is used for carrying out encryption processing on the stored sensitive data and ensuring the safety and privacy of the data;
the data backup and recovery mechanism is used for regularly backing up the stored data so as to prevent the data from being lost and can quickly recover the data when needed;
Load balancing and fault transferring functions to ensure that when high concurrency access or server faults occur, requests can be automatically distributed to other servers, and continuity and stability of services are ensured;
And the performance monitoring and optimizing tool monitors the running states of the memory and the server in real time, including CPU utilization rate, memory occupation, disk I/O and the like, so as to discover and solve the performance bottleneck in time and optimize the system performance.
Compared with the prior art, the invention has the following beneficial effects:
The method achieves dynamic optimization of the model Attention distribution by innovatively integrating a Domain-Guided Attention (DGA) mechanism into the Attention mechanism of the BERT (Bidirectional Encoder Representations from Transformers) model. The mechanism combines a Siamese (twin) neural network architecture and a specially constructed geotechnical engineering knowledge table (Geotechnical Knowledge Base, GKB), and aims to enable a model to be focused on key terms and core concepts in the geotechnical engineering field more accurately, so that the capability of the model in professional semantic modeling is remarkably enhanced.
Specifically, the domain-guided attention mechanism utilizes rich information in the geotechnical engineering knowledge table to dynamically adjust the attention weight of the BERT model. In the process, the model not only considers the context information of the text, but also fully integrates the knowledge and experience of the field expert, so that the model can capture key information more accurately and reduce noise interference when the model is used for understanding complex and professional geotechnical engineering text.
The introduction of the Siamese architecture further improves the performance of the model in processing similar or related text. By sharing parameters, the Siamese network can learn the similarity measure between text pairs, which is important for improving the accuracy of the retrieval system. In the field of geotechnical engineering, many documents may relate to similar or related geological conditions, construction methods and the like, and the similarity can be more effectively identified by utilizing a Siamese architecture, so that the accuracy of a search result is improved.
In addition, the method fully utilizes the technical terms, the concept definitions and the relations among the technical terms and the concept definitions in the geotechnical engineering knowledge table, and provides rich background knowledge for the model. These knowledge are effectively integrated during model training so that the model can learn a deeper semantic representation from professional text in the geotechnical engineering field. The improvement of the deep semantic understanding capability has important significance for realizing efficient and accurate text retrieval.
In conclusion, the method not only optimizes the modeling capacity of the model to professional semantics by combining a field guidance attention mechanism, a Siamese architecture and a geotechnical engineering knowledge table, but also remarkably improves the accuracy and efficiency of a retrieval system in the geotechnical engineering field. The innovative method provides a new thought for solving the difficult problem in professional text retrieval, and has important significance for promoting informatization and intelligent development in the geotechnical engineering field.
Drawings
The field of fig. 1 directs the attention mechanism;
the fig. 2 domain directing attention mechanism BERT Encoder;
FIG. 3 is a block diagram of a guided BERT semantic embedded model in the geotechnical engineering field.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Example 1
A geotechnical engineering semantic embedding retrieval method based on field-guided BERT comprises the following steps:
1. Data source acquisition
Various text data in the geotechnical field are collected, including project reports, engineering case analysis, experimental data, standard specification documents, papers, technical guidelines and the like. These data cover typical terms and application scenarios of geotechnical engineering, with rich expertise and semantic depth. Because of the problems of low structuring degree and semantic dispersion of the data, further knowledge extraction and cleaning are required to ensure semantic consistency.
2. Creation of data sets
Based on the collected text data, a structured training data set is constructed, which comprises the following steps:
2.1 segmentation processing
And cutting the long document according to the semantic natural segment or the logic segment to generate a short text segment so as to ensure the semantic centralization and the logic continuity of the segmented content.
2.2 Manual labeling
And generating a question-answer pair aiming at the segmented text segment, wherein the question extracts key semantic points based on paragraph contents, and the answer is a refined summary of the paragraph.
Examples:
Text segments:
the effect of the composition of the soil particles on the shear strength is mainly reflected in the contact force between the particles and the drainage condition. "
Problems:
how does the composition of the soil particles affect the shear strength?
Answer:
"affected by interparticle contact forces and drainage conditions". "
2.3 Sample equalization
For some rare types of problems (such as the content related to specific soil properties or complex analysis), the data volume is expanded through data enhancement technology (such as similar substitution, semantic restation and the like), the sample distribution is balanced, and the diversity and the comprehensiveness of model training are ensured.
3. Data preprocessing
And (3) carrying out further standardization processing on the basis of the marked data:
3.1 extraction of Domain Key factors
Building geotechnical engineering knowledge tables including important terms, concepts and their attributes, such as:
Soil types including sand, clay, silt and gravel.
Mechanical properties such as shear strength, permeability coefficient, compression coefficient, etc.
Environmental conditions such as water content, saturation, temperature, pressure, etc.
Using the knowledge table to match terms in the text, a domain key factor label (Keyword Mask) is generated for guiding the weight allocation of the attention mechanism.
3.2 Segment Length adjustment
The influence of different text segmentation lengths on the model performance is tested, and data sets of three segmentation versions of short, medium and long are generated to explore the optimal segmentation strategy.
3.3 Data partitioning
The data sets were partitioned according to the ratio of training set to test set (80%: 20%) to ensure that the samples of the training and test phases did not overlap.
4. Construction of geotechnical engineering semantic embedded model
1-3 Below, the model is based on BERT, dynamically models Domain key terms by introducing a Domain-oriented Attention mechanism (Domain-oriented Attention) into the Attention mechanism of BERT, and realizes efficient semantic embedding learning and optimization by adopting a Siamese architecture.
First, the input sentence is pre-processed (S 1 and S 2), for example, by the word segmentation unit (Tokenizer) of the BERT, to convert the natural language into the input format required for the BERT, including word index sequences (input_ids), attention masks (attention _mask), and key term labels (key Mask). These input vectors are then fed separately into two encoders sharing weights for processing.
In the coding layer, two BERT models sharing weights independently encode sentences S 1 and S 2, respectively, to generate corresponding feature vector representations H 1 and H 2, each feature vector having a shape of n×768, where n is the number of token in the sentence and 768 is the dimension of the hidden layer. In the encoding process, the Attention score of the key term is dynamically adjusted through a Domain-Guided Attention mechanism (Domain-Guided Attention). Specifically, in calculating the attention profile, the following formula is used:
Wherein w j=1+(Keyword Maskj (. Gamma. -1)
The domain key terms are assigned higher attention weights, where γ >1 is the weight magnification. The generated enhanced attention profile is used to further optimize the feature representation of the sentence.
Next, the pooling layer processes the feature vectors of each sentence, specifically using an average pooling (Mean Pooling) operation. The pooling process averages feature vectors for all token of each sentence over each dimension, generating fixed-size sentence embedded vectors V 1 and V 2, each of the embedded vectors having a shape of 1×768.
By the double-tower architecture of the shared weights, the model ensures the embedding consistency of two sentences in the same semantic space. For the embedded representations V 1 and V 2 of sentence pairs, sentence contrast analysis was performed using cosine similarity (Cosine Similarity), calculating the similarity score:
And finally outputting a similarity score for evaluating the semantic similarity between the two sentences. Through the double-tower design of the Siamese architecture, the model can efficiently process similarity tasks and is suitable for various semantic retrieval and matching scenes.
5. Model training, testing, and prediction
5.1 Training
And (3) loss function design:
Using contrast loss (Contrastive Loss), semantically similar sentence pairs are closer together and dissimilar sentence pairs are farther apart by optimizing sentence embedding.
Loss function definition:
Lcontrastive=y·d2+(1-y)·max(0,margin-d)2
Where d represents the Euclidean distance of sentence embedding and y represents the similarity tag.
Attention regularization:
Attention regularization loss (Attention Regularization Loss) is introduced, focusing more on domain terms by Mean Square Error (MSE) constraint model with the target attention distribution.
The total loss function is l=l contrastive+α·Lattention
Training strategies:
a AdamW optimizer is used in conjunction with a learning rate scheduler (e.g., linear decay) to achieve stable convergence.
The attention distribution is adjusted through the domain weight generated by the knowledge table, so that the model can capture domain semantics more accurately.
5.2 Testing
Model performance was evaluated on the test set using the following criteria:
cosine similarity mean, which is to measure the similarity of sentences to embedded semantics.
And verifying the actual retrieval capability of the model.
F1 index, comprehensively measuring the accuracy and coverage rate of the search result.
And analyzing the influence of the optimal segmentation strategy on the model performance aiming at the performance of the data set test model of different segmentation length versions.
The implementation of the scheme has the beneficial effects that the semantic embedded model for the geotechnical engineering field is successfully constructed through refined data processing and model design. The model not only effectively solves the problems of low structuring degree and semantic dispersion of the rock-soil text data, but also remarkably improves the capturing capability of the model on key terms and complex semantics through a field-guided attention mechanism. Through training and testing, the model is excellent in semantic similarity evaluation and retrieval tasks, powerful support is provided for intelligent information processing of geotechnical engineering, and high-efficiency application and sharing of domain knowledge are greatly promoted.
Example 2
A memory or server dedicated to storing and processing data required for the field-guided BERT-based geotechnical semantic embedded retrieval method as described in example 1.
The memory or server is carefully designed to ensure efficient storage of data, fast processing, and smooth operation of the model training and prediction phases.
1. Configuration of memory or server
1 Data storage module
And C, the text database is used for storing the text data of the geotechnical field collected and cleaned in the step A, including project reports, engineering case analysis, experimental data, standard specification files, papers, technical guidelines and the like. These data are organized in a structured manner to facilitate subsequent processing and retrieval.
And 2, storing the structured training data set constructed in the step B, wherein the structured training data set comprises segmented short text fragments, manually marked question-answer pairs and data sets expanded by a data enhancement technology. These datasets provide a rich sample for training of the model.
And 3, storing the geotechnical engineering knowledge table constructed in the step C, the field key factor marks and test results of different text segment lengths. These knowledge tables and labels provide the model with domain specific knowledge that helps optimize the distribution of the attention mechanisms.
And 4, storing the geotechnical engineering semantic embedded model constructed in the step D and parameters thereof, wherein the parameters comprise the weight of the BERT model, the configuration of a domain guidance attention mechanism and the detailed information of a Siamese architecture. These parameters are the core of the model operation, ensuring the high efficiency and accuracy of the model.
5 Data processing module
And E, in the model training process of the step E, providing high-efficiency data reading and writing capability and supporting the rapid processing of a large-scale data set. By optimizing the data storage structure and the access strategy, the real-time performance and the high efficiency of data processing are ensured.
And 6, data preprocessing, namely realizing the data preprocessing function in the step C, wherein the data preprocessing function comprises field key factor extraction, segmentation length adjustment, data division and the like. These preprocessing operations help to improve the effectiveness and efficiency of model training.
7 Model deployment and prediction module
And 8, model deployment, namely in the model testing and predicting stage, taking the model as a model deployment platform, receiving user input, executing model reasoning and returning a search result. And the real-time performance and accuracy of the prediction are ensured by optimizing the model loading and reasoning process.
And 9, displaying the result, namely providing a friendly user interface, displaying the search result and related information, and facilitating the understanding and use of the user.
2. Additional functions of memory or server
1 Data encryption and security
The data encryption module is realized, the stored sensitive data is encrypted, and the safety and privacy of the data are ensured. Advanced encryption algorithms and key management policies are employed to protect data from unauthorized access and disclosure.
2 Data backup and restore
And regularly backing up the stored data to prevent the data from being lost. When needed, the data can be quickly recovered, and the continuity and stability of the service are ensured.
3 Load balancing and failover
Load balancing and failover functions are implemented to ensure that requests can be automatically distributed to other servers in the event of high concurrent access or server failure. And the expandability and reliability of the system are improved through cluster deployment and load balancing strategies.
4 Performance monitoring and optimization
And monitoring the running states of the memory and the server in real time, wherein the running states comprise CPU utilization rate, memory occupation, disk I/O and the like. Through the performance monitoring and optimizing tool, the performance bottleneck is found and solved in time, the system performance is optimized, and the smoothness of data processing and model operation is ensured.
3. Effect of the invention
By adopting the memory or the server provided by the embodiment, the data required by the geotechnical engineering semantic embedded retrieval method based on the field guidance BERT can be efficiently stored and processed. The memory or the server not only meets the requirements of data storage, processing and model training and prediction, but also provides additional functions of data encryption and security, data backup and recovery, load balancing and fault transfer, performance monitoring and optimization and the like. The implementation of the functions ensures the safety, reliability and high efficiency of the data, and provides powerful support for the practical application of the geotechnical engineering semantic embedding and searching method.
In summary, the memory or the server provided in the embodiment 2 is specially designed for the geotechnical engineering semantic embedding retrieval method based on the field-oriented BERT, and has the characteristics of high efficiency, safety, reliability, expandability and the like, and can meet the requirements of the geotechnical engineering field on semantic retrieval technology.
The above embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, i.e. the present invention is not limited to the above embodiments, but is capable of being modified and varied in all ways according to the following claims and the detailed description.

Claims (10)

1.一种基于领域引导BERT的岩土工程语义嵌入检索方法,其特征在于,包括以下步骤:1. A domain-guided BERT-based geotechnical engineering semantic embedding retrieval method, characterized by comprising the following steps: 步骤A:数据源获取,收集岩土领域的多种文本资料,包括项目报告、工程案例分析、实验数据、标准规范文件、论文和技术指南,并进行知识抽取与清洗;Step A: Data source acquisition, collecting various text materials in the geotechnical field, including project reports, engineering case analysis, experimental data, standard specification documents, papers and technical guidelines, and performing knowledge extraction and cleaning; 步骤B:构建结构化的训练数据集,包括分段处理、人工标注和样本均衡;其中,分段处理将长篇文档按语义自然段或逻辑分段切分生成短文本片段;人工标注针对分段后的文本片段生成问题-答案对;样本均衡通过数据增强技术扩充稀少类型问题的数据量;Step B: Construct a structured training data set, including segmentation, manual annotation, and sample balancing. Segmentation divides long documents into short text segments according to semantic natural segments or logical segments. Manual annotation generates question-answer pairs for the segmented text segments. Sample balancing uses data augmentation technology to expand the data volume of rare type questions. 步骤C:数据预处理,数据预处理包括领域关键因素提取和分段长度调整,构建岩土工程知识表,使用知识表匹配文本中的术语,生成领域关键因素标记,用于指导注意力机制的权重分配,并测试不同文本分段长度对模型性能的影响;Step C: Data preprocessing, which includes extracting key factors in the field and adjusting the segment length, building a geotechnical engineering knowledge table, using the knowledge table to match terms in the text, generating key factor tags for the field to guide the weight allocation of the attention mechanism, and testing the impact of different text segment lengths on model performance; 步骤D:构建岩土工程语义嵌入模型,该模型基于BERT,通过在BERT的注意力机制中引入领域引导注意力机制Domain-Guided Attention对领域关键术语进行动态建模,并采用Siamese架构实现高效的语义嵌入学习与优化;Step D: Construct a geotechnical engineering semantic embedding model based on BERT. The model dynamically models key domain terms by introducing the domain-guided attention mechanism into BERT’s attention mechanism, and adopts the Siamese architecture to achieve efficient semantic embedding learning and optimization. 模型接收输入的句子对,通过BERT的分词器进行预处理,将自然语言转化为BERT所需的输入格式,包括词索引序列、注意力掩码以及关键术语标记;两个共享权重的BERT模型分别对句子进行独立编码,生成对应的特征向量表示;通过领域引导注意力机制对关键术语的注意力得分进行动态调整,生成增强注意力分布用于优化句子的特征表示;采用平均池化操作对每个句子的特征向量进行处理,生成固定大小的句子嵌入向量;使用余弦相似度计算句子对的嵌入表示之间的相似度分数;The model receives the input sentence pairs and preprocesses them through BERT's tokenizer to convert natural language into the input format required by BERT, including word index sequence, attention mask and key term tag; two BERT models with shared weights encode the sentences independently and generate corresponding feature vector representations; the attention scores of key terms are dynamically adjusted through the domain-guided attention mechanism to generate enhanced attention distribution for optimizing the feature representation of sentences; the feature vector of each sentence is processed by average pooling operation to generate a fixed-size sentence embedding vector; the similarity score between the embedding representations of the sentence pairs is calculated using cosine similarity; 步骤E:模型训练、测试与预测,设计使用对比损失和注意力正则化损失的总损失函数,通过AdamW优化器和学习率调度器进行模型训练;在测试集上评估模型性能,使用余弦相似度均值、语义检索任务的准确率和召回率以及F1指标衡量模型性能;针对不同分段长度版本的数据集测试模型的表现,分析最佳分段策略对模型性能的影响。Step E: Model training, testing, and prediction. Design a total loss function using contrast loss and attention regularization loss, and train the model using the AdamW optimizer and learning rate scheduler. Evaluate model performance on the test set, using the mean cosine similarity, accuracy and recall of semantic retrieval tasks, and the F1 indicator to measure model performance. Test the performance of the model on datasets with different segmentation lengths, and analyze the impact of the optimal segmentation strategy on model performance. 2.根据权利要求1所述的基于领域引导BERT的岩土工程语义嵌入检索方法,其特征在于,所述步骤A中的知识抽取与清洗进一步包括:对收集的文本资料进行语义分析,去除冗余和错误信息,确保文本资料的准确性和一致性。2. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the knowledge extraction and cleaning in step A further includes: performing semantic analysis on the collected text data, removing redundant and erroneous information, and ensuring the accuracy and consistency of the text data. 3.根据权利要求1所述的基于领域引导BERT的岩土工程语义嵌入检索方法,其特征在于,所述步骤B中的分段处理还包括:利用自然语言处理技术对文档进行语义分割,确保每个短文本片段的语义完整性和独立性。3. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the segmentation processing in step B also includes: using natural language processing technology to semantically segment the document to ensure the semantic integrity and independence of each short text fragment. 4.根据权利要求1所述的基于领域引导BERT的岩土工程语义嵌入检索方法,其特征在于,所述步骤B中的人工标注还包括:基于岩土工程领域的知识体系,对问题-答案对进行专业审核,确保标注内容的准确性和专业性。4. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the manual annotation in step B also includes: based on the knowledge system in the field of geotechnical engineering, professional review of question-answer pairs is performed to ensure the accuracy and professionalism of the annotation content. 5.根据权利要求1所述的基于领域引导BERT的岩土工程语义嵌入检索方法,其特征在于,所述步骤C中的数据预处理还包括:对领域关键因素标记进行权重分配,根据关键术语在岩土工程领域的重要性和出现频率,动态调整其权重,以优化注意力机制的分布。5. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the data preprocessing in step C also includes: weighting the key factor tags of the domain, and dynamically adjusting the weights of the key terms according to their importance and frequency of occurrence in the geotechnical engineering field to optimize the distribution of the attention mechanism. 6.根据权利要求1所述的基于领域引导BERT的岩土工程语义嵌入检索方法,其特征在于,所述步骤D中的模型构建还包括:采用Siamese架构的双塔设计,确保两个句子在同一语义空间中的嵌入一致性,提高语义相似度计算的准确性和效率。6. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the model construction in step D also includes: adopting a dual-tower design of the Siamese architecture to ensure the embedding consistency of two sentences in the same semantic space and improve the accuracy and efficiency of semantic similarity calculation. 7.根据权利要求1所述的基于领域引导BERT的岩土工程语义嵌入检索方法,其特征在于,所述步骤E中的模型训练还包括:利用知识表生成的领域权重调整注意力分布,通过对比损失和注意力正则化损失的总损失函数,优化模型的语义理解和信息提取能力。7. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method according to claim 1, it is characterized in that the model training in step E also includes: adjusting the attention distribution using the domain weights generated by the knowledge table, and optimizing the semantic understanding and information extraction capabilities of the model through the total loss function of contrast loss and attention regularization loss. 8.根据权利要求1所述的基于领域引导BERT的岩土工程语义嵌入检索方法,其特征在于,所述步骤E中的模型测试与预测还包括:针对不同类型的岩土工程问题,对模型进行性能评估,以验证其在实际应用中的有效性和可靠性。8. According to the domain-guided BERT-based geotechnical engineering semantic embedding retrieval method of claim 1, it is characterized in that the model testing and prediction in step E also includes: performing performance evaluation on the model for different types of geotechnical engineering problems to verify its effectiveness and reliability in practical applications. 9.一个用于存储和处理如权利要求1-8任一项所述的方法所需数据的存储器或服务器,其中所述存储器或服务器被配置为:9. A memory or server for storing and processing data required by the method according to any one of claims 1 to 8, wherein the memory or server is configured as follows: 存储步骤A中收集并清洗后的岩土领域文本资料;Store the geotechnical text data collected and cleaned in step A; 存储步骤B中构建的结构化训练数据集,包括分段后的短文本片段、人工标注的问题-答案对以及通过数据增强技术扩充后的数据集;Store the structured training dataset constructed in step B, including segmented short text fragments, manually annotated question-answer pairs, and the dataset augmented by data augmentation technology; 存储步骤C中构建的岩土工程知识表、领域关键因素标记以及不同文本分段长度的测试结果;Storing the geotechnical engineering knowledge table constructed in step C, domain key factor tags, and test results of different text segment lengths; 存储步骤D中构建的岩土工程语义嵌入模型及其参数,包括BERT模型的权重、领域引导注意力机制的配置以及Siamese架构的详细信息;Store the geotechnical engineering semantic embedding model built in step D and its parameters, including the weights of the BERT model, the configuration of the domain-guided attention mechanism, and the details of the Siamese architecture; 在步骤E的模型训练过程中,提供高效的数据读取和写入能力,支持大规模数据集的快速处理;During the model training process in step E, it provides efficient data reading and writing capabilities to support the rapid processing of large-scale data sets; 在模型测试与预测阶段,作为模型部署的平台,接收用户输入,执行模型推理,并返回检索结果。In the model testing and prediction phase, it serves as a model deployment platform, receives user input, performs model reasoning, and returns retrieval results. 10.根据权利要求9所述的存储器或服务器,其特征在于,所述存储器或服务器还进一步包括:10. The storage device or server according to claim 9, characterized in that the storage device or server further comprises: 数据加密模块,用于对存储的敏感数据进行加密处理,确保数据的安全性和隐私性;Data encryption module, used to encrypt stored sensitive data to ensure data security and privacy; 数据备份与恢复机制,定期对存储的数据进行备份,以防止数据丢失,并在需要时能够快速恢复数据;Data backup and recovery mechanism, regularly backing up stored data to prevent data loss and quickly restoring data when needed; 负载均衡与故障转移功能,以确保在高并发访问或服务器故障时,能够自动分配请求至其他服务器,保证服务的连续性和稳定性;Load balancing and failover functions to ensure that requests can be automatically distributed to other servers in the event of high concurrent access or server failure, ensuring service continuity and stability; 性能监控与优化工具,实时监测存储器和服务器的运行状态,包括CPU使用率、内存占用、磁盘I/O等,以便及时发现并解决性能瓶颈,优化系统性能。Performance monitoring and optimization tools monitor the operating status of storage and servers in real time, including CPU usage, memory usage, disk I/O, etc., so as to promptly discover and resolve performance bottlenecks and optimize system performance.
CN202411869841.6A 2024-12-18 2024-12-18 A Geotechnical Engineering Semantic Embedding Retrieval Method Based on Domain-Guided BERT Active CN119903168B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411869841.6A CN119903168B (en) 2024-12-18 2024-12-18 A Geotechnical Engineering Semantic Embedding Retrieval Method Based on Domain-Guided BERT

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411869841.6A CN119903168B (en) 2024-12-18 2024-12-18 A Geotechnical Engineering Semantic Embedding Retrieval Method Based on Domain-Guided BERT

Publications (2)

Publication Number Publication Date
CN119903168A true CN119903168A (en) 2025-04-29
CN119903168B CN119903168B (en) 2025-12-02

Family

ID=95471401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411869841.6A Active CN119903168B (en) 2024-12-18 2024-12-18 A Geotechnical Engineering Semantic Embedding Retrieval Method Based on Domain-Guided BERT

Country Status (1)

Country Link
CN (1) CN119903168B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120470950A (en) * 2025-07-15 2025-08-12 吉林大学 A deep learning prediction method for rock freeze-thaw damage based on text embedding

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817494A (en) * 2022-04-02 2022-07-29 华南理工大学 Knowledge type retrieval type dialogue method based on pre-training and attention interaction network
CN117633159A (en) * 2023-12-15 2024-03-01 福州大学 An intelligent retrieval method based on large language model and domain ontology
CN117993393A (en) * 2024-03-06 2024-05-07 上海商保通健康科技有限公司 Method, device and system for checking online labeling policy terms based on word and sentence vectors
US20240394291A1 (en) * 2023-05-23 2024-11-28 Accenture Global Solutions Limited Automated domain adaptation for semantic search using embedding vectors

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817494A (en) * 2022-04-02 2022-07-29 华南理工大学 Knowledge type retrieval type dialogue method based on pre-training and attention interaction network
US20240394291A1 (en) * 2023-05-23 2024-11-28 Accenture Global Solutions Limited Automated domain adaptation for semantic search using embedding vectors
CN117633159A (en) * 2023-12-15 2024-03-01 福州大学 An intelligent retrieval method based on large language model and domain ontology
CN117993393A (en) * 2024-03-06 2024-05-07 上海商保通健康科技有限公司 Method, device and system for checking online labeling policy terms based on word and sentence vectors

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XUE HAN, YI-TONG WANG, JUN-LAN FENG, CHAO DENG, ZHAN-HENG CHEN, YU-AN HUANG, HUI SU, LUN HU, PENG-WEI HU: "A survey of transformer-based multimodal pre-trained modals", NEUROCOMPUTING, vol. 515, 1 January 2023 (2023-01-01), pages 89 - 106 *
李剑龙: "基于领域知识库的军事文本侦测与分析", 中国优秀硕士学位论文全文数据库工程科技Ⅱ辑, no. 05, 15 May 2024 (2024-05-15), pages 032 - 21 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120470950A (en) * 2025-07-15 2025-08-12 吉林大学 A deep learning prediction method for rock freeze-thaw damage based on text embedding

Also Published As

Publication number Publication date
CN119903168B (en) 2025-12-02

Similar Documents

Publication Publication Date Title
US20210342404A1 (en) System and method for indexing electronic discovery data
Tang et al. LogSig: Generating system events from raw textual logs
CN115759092A (en) Network threat information named entity identification method based on ALBERT
Verma et al. A novel approach for text summarization using optimal combination of sentence scoring methods
CN118211595A (en) Audit data intelligent analysis system and method
CN118279925B (en) An image-text matching algorithm integrating local and global semantics
CN118467595A (en) Search method, device, equipment, and medium for target domain based on large language model
CN119903168B (en) A Geotechnical Engineering Semantic Embedding Retrieval Method Based on Domain-Guided BERT
CN116166792A (en) Template-based Chinese privacy policy abstract generation method and device
Li et al. CCAH: A CLIP‐Based Cycle Alignment Hashing Method for Unsupervised Vision‐Text Retrieval
CN113901813A (en) An event extraction method based on topic features and implicit sentence structure
CN120873029B (en) Sliding window-based electric power multi-mode corpus construction query method and system
CN117474003A (en) Text semantic mining method and system based on lightweight Transformer
Xiang et al. Aggregating local and global text features for linguistic steganalysis
Soman et al. A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents
CN120994758A (en) A multimodal AI knowledge base construction system for private deployment
CN119739874A (en) A local construction method and device for a vertical large model in the field of environmental monitoring and evaluation
CN120407780A (en) Chapter-level pre-trained scientific literature representation and query method based on contrastive learning
CN119719387A (en) Threat information processing-oriented knowledge graph construction method, system and computer readable storage medium
CN119576987A (en) Multimodal retrieval and enhanced generation method and system for construction scheme based on large model
CN118093900A (en) Cross-modal hash retrieval method for modality-missing image text based on self-supervised learning
CN114580398B (en) Text information extraction model generation method, text information extraction method and device
Bhawna et al. Natural Language Processing Based Two-Stage Machine Learning Model for Automatic Mapping of Activity Codes Using Drilling Descriptions
Wang et al. A new evaluation measure using compression dissimilarity on text summarization
Pepper et al. Metadata verification: a workflow for computational archival science

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant