Detailed Description
The main implementation body of the analysis method for the composition of the atmospheric aerosol microbial community in the embodiment of the application is electronic equipment, including but not limited to a server, a personal computer, a notebook computer, a tablet personal computer, a smart phone and the like. As shown in fig. 1, the method includes an algorithm determination phase and a data analysis phase, wherein:
The algorithm determination phase comprises the following steps:
Step S100: and acquiring a microorganism composition analysis algorithm and a high-throughput sequencing sequence template, wherein the training samples in the high-throughput sequencing sequence template comprise a supervision training sample and an unsupervised training sample, and the supervision training sample comprises corresponding microorganism species marks.
In step S100, the electronic device acquires a microorganism composition analysis algorithm and collects a high throughput sequencing sequence template including supervised and unsupervised training samples. The microbial composition analysis algorithm is designed based on machine learning or deep learning principles, and aims to extract information of microbial community composition from high-throughput sequencing data. In particular, the algorithm may employ advanced architectures such as Convolutional Neural Networks (CNNs), recurrent Neural Networks (RNNs), or attention-mechanism-based converter models to accommodate processing complex and high-dimensional DNA sequence data.
Taking convolutional neural networks as an example, the algorithm comprises a plurality of convolutional layers, for example, for automatically extracting local features (e.g., specific k-mer patterns) in a DNA sequence; the pooling layer is used for reducing the dimension and the calculated amount; the full-link layer is used for integrating the characteristics and outputting a final microorganism species classification result. The training and optimization process of the algorithm (although this step is performed after S100) will continuously adjust the network weights to minimize the prediction error.
Next, the electronics construct a high throughput sequencing sequence template comprising the supervised training samples and the unsupervised training samples. High throughput sequencing technology allows parallel sequencing of large numbers of DNA fragments in a short time, generating massive amounts of sequence data that are the basis for analysis of microbial community composition.
The supervised training samples are sequencing datasets that have been explicitly labeled with information about the species of microorganism. For example, one supervised training sample may contain DNA samples from a particular environment (e.g., urban atmospheric aerosols) that have been determined for the microorganism species contained therein and their relative abundance by conventional microorganism identification methods (e.g., post-PCR amplification sequencing, mass spectrometry, etc.). In the dataset, each sample corresponds to one or more markers of the microorganism species, which markers are in the form of labels or classification vectors, e.g. [0,1,0,1,0] indicates that the sample contains only the second and fourth microorganisms.
In particular to the context of atmospheric aerosol microbiota analysis, the supervised training samples may originate from previous research projects or public databases, such as the SRA (Sequence READ ARCHIVE) database of NCBI. These examples not only provide DNA sequence data, but also carry detailed microbiological classification information, which is an indispensable precious resource in the early stage of algorithm training.
Unlike the supervised training examples, the unsupervised training examples contained only DNA sequence data from high throughput sequencing, without explicit microbial species labeling. These samples typically originate from a wider collection of environmental samples, and may be far overseeing training samples in number, but cannot be used directly for algorithm training due to the high cost of labeling. However, by the cluster analysis and virtual tag generation technique in step S400, the electronic device can indirectly tag these non-supervised samples with a small amount of supervision information, thereby greatly expanding the scale of the training data.
For example, assuming a microbial community analysis for urban atmospheric aerosols, the electronics download 100 annotated samples from the SRA database of NCBI as supervised training samples, each sample containing about 10 ten thousand DNA sequence segments, each sequence segment being about 150bp in length. These samples cover common urban atmospheric microbiological species such as bacteria, fungi and viruses, and each microorganism is assigned a unique species signature. Meanwhile, in order to increase the diversity of training data, the electronic device also downloads another 1000 unlabeled atmospheric aerosol samples from the same source as an unsupervised training sample. These samples also contain large amounts of DNA sequence data, but cannot be used directly for algorithm training due to the lack of explicit species information.
In step S100, the electronic device integrates these data into a high-throughput sequencing sequence template, which contains both the annotated supervised training samples and the unsupervised training samples to be annotated.
Step S200: for each training sample, segmenting the training sample to obtain a plurality of DNA information sequences of the training sample.
Step S200 pre-processes the training sample data obtained by high throughput sequencing, so that the subsequent steps can more effectively perform implicit characterization extraction and microbial species analysis. In this step, the electronic device performs a segmentation operation on each training sample, splitting it into a plurality of smaller DNA information sequences. High throughput sequencing techniques are capable of generating large amounts of DNA sequence data, typically in the form of long reads (long reads) or short reads (short reads). For analysis of the composition of the atmospheric aerosol microbial community, since a sample may contain a plurality of microorganisms and the DNA sequence length and composition of the microorganisms are different, the direct analysis of the whole training sample may be faced with the problems of high computational complexity, difficult feature extraction and the like. By segmenting the training samples, the electronic device can reduce the complex data set to a series of smaller, more tractable DNA information sequences, thereby improving the efficiency and accuracy of subsequent analysis.
In step S200, the electronic device performs a segmentation process on the training sample according to the following steps: first, the electronic device reads training sample data to be processed from the storage medium. These data are typically stored in FASTA or FASTQ format, containing a large amount of DNA sequence information. Next, the electronics determine the length of each segment (i.e., DNA information sequence). This length may be set according to specific project requirements and environmental characteristics. For example, in the context of an atmospheric aerosol microbiota analysis, given the large differences in genome sizes of different microorganisms, a moderate fragment length (e.g., 100bp, 200bp, or more) may be selected to ensure that each fragment contains sufficient genetic information for subsequent analysis. Once the segment length is determined, the electronic device may begin performing the segmentation operation. Specifically, it traverses the DNA sequence of the entire training sample, and cuts at a set length. If the sequence length is not an integer multiple of the segment length, the last segment may be shorter than the set length. To maintain consistency of the data, the electronic device may choose to discard this shorter segment or to leave it and give it the appropriate processing in the subsequent analysis. After the segmentation operation, the electronic device obtains a set of a plurality of DNA information sequences. These sequences will serve as input data for subsequent steps such as implicit token extraction, cluster analysis, etc.
For example, assume a training sample of a microbial sample from a city atmospheric aerosol, which has a total length of DNA sequences of 1,000,000 bp. For the analysis of the microbiota composition, the electronics decided to split the training sample into a number of DNA information sequences of 200bp in length.
The electronic device first reads the FASTA file of the training sample to obtain complete DNA sequence data. According to analysis requirements, the electronic equipment sets the segmentation length to be 200bp. The electronics read 200 nucleotides in succession starting from the first nucleotide of the DNA sequence as a DNA information sequence. Then, it is moved to the next position (i.e., the 201 st nucleotide) and again 200 nucleotides are successively read as the next DNA information sequence. This process is repeated until the entire DNA sequence is traversed. If the last segment is less than 200bp in length, the electronic device may choose to discard it (in this example, since the total length is 1,000,000 bp, the last segment will be the complete 200bp, without discarding). After the segmentation operation, the electronic device gets a set of 5,000 DNA information sequences (1,000,000 bp ≡200 bp=5,000). These sequences will be used in subsequent implicit characterization extraction and microbiological species analysis steps. Through the segmentation operation of step S200, the electronics reduce the complex high-throughput sequencing data to a series of smaller, more tractable DNA information sequences. This not only reduces the complexity of subsequent analysis, but also increases the flexibility and accuracy of data processing. More importantly, the segmented DNA information sequence retains genetic information in the original sequence, so that the electronic equipment can effectively extract characteristics of microbial community composition in the subsequent steps, and further accurate species classification and abundance estimation are realized.
Step S300: based on a microorganism composition analysis algorithm, implicit characterization extraction is carried out on the training sample and each DNA information sequence of the training sample, so that an integral implicit characterization array of the training sample and a sequence implicit characterization array corresponding to each DNA information sequence are obtained.
Step S300 is to perform implicit characterization extraction on the training sample and each DNA information sequence thereof by utilizing a microorganism composition analysis algorithm so as to obtain an integral implicit characterization array and a sequence implicit characterization array which can reflect the composition characteristics of the microorganism community. Implicit characterization (also known as embedded representation or feature vector) is a method of converting raw data (e.g., DNA sequences) into points in high-dimensional space that capture the inherent relationships and structures between the data. In the context of atmospheric aerosol microbial community composition analysis, implicit characterization can reveal similarities and differences between different microbial species, providing powerful support for subsequent species classification and abundance estimation.
In step S300, the electronic device selects a microorganism composition analysis algorithm for implicit characterization extraction. The algorithm can be based on a traditional machine learning method (such as a support vector machine, a random forest and the like), and can also be based on a deep learning method (such as a convolutional neural network CNN, a cyclic neural network RNN or a transducer model). Given the complexity and high dimensionality of DNA sequence data, deep learning algorithms are favored because of their powerful feature extraction capabilities. Taking a transducer model as an example, the model can capture the dependency relationship between different positions in the DNA sequence through a Self-attention mechanism (Self-Attention Mechanism), so as to generate a richer implicit characterization. During training, the model learns how to map the original DNA sequence data into a high dimensional space such that similar sequences are closer in space and dissimilar sequences are farther apart.
After the microbial composition analysis algorithm is determined, the electronic device performs implicit characterization extraction on the training sample and each DNA information sequence thereof according to the following steps:
First, the training examples and their DNA information sequences are subjected to necessary preprocessing such as removal of low-quality sequences, removal of linker sequences, quality trimming, etc., to ensure the quality of the input data. A pre-trained microorganism composition analysis algorithm model (e.g., a transducer model) is loaded and appropriately initialized or trimmed as needed. For each training sample (i.e., a data set consisting of multiple DNA information sequences), the electronic device inputs it as a whole into the model. The model outputs a vector with fixed length as the whole implicit representation array of the training sample through the multi-layer self-attention mechanism and the feedforward neural network. This vector captures the integrated features of all DNA information sequences in the training samples.
For example, assume that a training sample contains 100 DNA information sequences, each 200bp in length. After being processed by a transducer model, the training sample is mapped into a 128-dimensional implicit space to obtain a 128-dimensional integral implicit characterization array. Next, the electronic device performs implicit characterization extraction separately for each DNA information sequence. This is typically accomplished by passing a single sequence as input to the model. The model also outputs a fixed length vector as an implicit representation array of the sequence. This process is repeated until all DNA information sequences have been processed. Continuing with the above example, each 200bp DNA information sequence is processed by a transducer model and then mapped into the same 128-dimensional implicit space to obtain a corresponding 128-dimensional implicit characterization array. Finally, the electronic device respectively collects the whole implicit characterization arrays of all training examples and the sequence implicit characterization arrays of all DNA information sequences to construct two independent sets. These two sets will serve as input data for the subsequent steps.
After obtaining the global implicit characterization array and the sequential implicit characterization array, the electronic device can utilize them for a variety of analyses. For example: by calculating the distance (such as Euclidean distance, cosine similarity and the like) between the whole implicit characterization arrays of different training samples, similar training samples can be grouped into a group, so that the structural relationship among different microbial communities is revealed. By comparing the similarity (e.g., cosine similarity) between the sequence implicit representation array of the DNA information sequence and the overall implicit representation array of the training sample to which it belongs, the degree of match between the sequence and the overall can be assessed, which is critical for subsequent species classification and abundance estimation. During the algorithm training process, the electronic device may adjust parameters of the model based on the quality of the implicit characterization (e.g., as assessed by calculating the loss function) to optimize the performance of the microbial composition analysis algorithm.
Step S400: and carrying out cluster analysis on the integral implicit characterization array of each training sample in the high-throughput sequencing sequence template according to the microorganism species mark of the supervised training sample to obtain the microorganism species mark corresponding to the unsupervised training sample.
Step S400 utilizes the microorganism species marking information of the supervised training samples to distribute corresponding microorganism species marks for the unsupervised training samples by a cluster analysis method. The process not only expands the range of the marking data, but also improves the accuracy and reliability of the subsequent analysis.
Cluster analysis is an unsupervised learning method that groups data points into clusters (clusters) according to their similarity or distance between them. In the context of an atmospheric aerosol microbial community analysis, each data point represents an overall implicit characterization array of a training sample, which is obtained by step S300, capable of reflecting the characteristics of the microbial community in the training sample. The goal of the clustering analysis is to group similar training examples into groups, each group representing a particular microbial community type.
In step S400, the electronic device uses the microorganism species markers of the supervised training examples as "anchor points", which are known, accurate microorganism community classification information. Through cluster analysis, the electronic device can identify non-supervised training examples that are similar to these supervised examples and assign them to corresponding microbial community types. Thus, reliable microorganism species labeling can be obtained by cluster analysis even if the unsupervised training samples themselves have no labeling information.
Specifically, the electronics first initialize a cluster center for each known microorganism community type based on the microorganism species signature of the supervised training examples. The cluster center may be the average or median of the implicit characterization array of the whole supervised training samples, or may be a representative sample selected by some optimization algorithm (e.g., K-means++). Next, the electronics traverse all of the unsupervised training examples in the high throughput sequencing sequence template, calculating their distances (e.g., euclidean distance, cosine similarity, etc.) from the respective cluster centers. Each unsupervised training sample is assigned to the type of microbial community corresponding to its nearest cluster center. After the allocation is completed, the electronic device recalculates the central position of each cluster according to the overall implicit characterization array of all training samples (including supervised and unsupervised) in the current cluster. This step is iterated until no significant change in the location of the cluster center occurs, or a preset number of iterations is reached. In an iterative process, the electronic device may evaluate the quality of the clusters, for example by computing contour coefficients (Silhouette Coefficient) to evaluate the closeness and separation of the clusters. The closer the profile coefficient is to 1, the better the cluster quality. After multiple iterations, the cluster analysis converges and the electronic device outputs the microorganism species label corresponding to each unsupervised training sample. The labels are automatically distributed based on the cluster analysis result, and have high accuracy and reliability. For example, assume that there is the following global implicit representation array (simplified representation in two-dimensional vector form) of supervised and unsupervised training examples:
supervised training samples (known microbial species markers): sample a: [0.5,0.8] (labeled "community X"); sample B: [ -0.2,1.1] (labeled "community Y").
Unsupervised training samples (no label): sample C: [0.4,0.9]; sample D: [ -0.1,1.0].
In the initialization stage, the electronic device initializes the cluster center for the community X and the community Y according to the whole implicit characterization arrays of the supervision training examples A and B. Assume that the initial centers are [0.5,0.8] and [ -0.2,1.1], respectively.
In the assignment phase, the electronics calculate the distance of samples C and D from the centers of the two clusters. Assuming that sample C is closer to the center of "community X" and is therefore assigned to "community X"; likewise, sample D is closer to the center of "community Y" and is therefore assigned to "community Y".
In the update phase, the electronic device recalculates the cluster center based on all training examples in the current cluster. Since in this simplified example there are only two supervised training examples and two unsupervised training examples, and the unsupervised training examples have been successfully assigned to the correct clusters, the cluster center may not change significantly. However, in practical applications, as the number of unsupervised training samples increases and the number of iterations increases, the cluster center gradually converges to a more stable position.
Finally, in the output phase, the electronic device outputs the microbial species markers of samples C and D as "community X" and "community Y", respectively. The labels are automatically assigned based on the results of the cluster analysis, providing reliable data support for subsequent analysis of the composition of the atmospheric aerosol microbiota.
Step S400 effectively utilizes the microorganism species marking information of the supervised training samples by a cluster analysis method, and distributes corresponding microorganism species marks for the unsupervised training samples. The process not only expands the range of the marking data, but also improves the accuracy and reliability of the subsequent analysis.
Step S500: and for each DNA information sequence of the training sample, determining the matching degree between the DNA information sequence and the training sample to which the DNA information sequence belongs according to the sequence implicit characterization array corresponding to the DNA information sequence and the integral implicit characterization array of the training sample to which the DNA information sequence belongs.
Step S500 evaluates the matching degree between each DNA information sequence in the training sample and the whole training sample to which the DNA information sequence belongs. This process not only helps to verify the validity of the implicit token extraction, but also provides key information for subsequent algorithm optimization.
In an atmospheric aerosol microbiome composition analysis, each training sample typically contains hundreds or thousands of DNA information sequences that collectively reflect the genetic characteristics of the microbiome in that sample. However, due to noise, sequence splice errors, or diversity of microorganism species during sequencing, certain DNA information sequences may not be fully consistent with features of the overall sample. Therefore, evaluating the degree of matching between a DNA information sequence and the training sample to which it belongs is critical to ensure the accuracy and reliability of the analysis. In step S500, the electronic device calculates the degree of matching between the sequence implicit token array of the DNA information sequence and the overall implicit token array of the training sample to which it belongs by comparing them. This process involves a number of steps including construction of feature spaces, selection of similarity metrics, and comprehensive evaluation of matching.
First, the electronics ensure that all DNA information sequences and the overall implicit characterization of the training examples are in the same feature space. This generally means that their implicit token arrays should have the same dimensions and consistent metrics. In step S300, a corresponding implicit characterization array has been generated for each DNA information sequence and training sample by a microorganism composition analysis algorithm (e.g., a transducer model). These arrays are located as feature vectors in a high-dimensional feature space.
Next, the electronic device selects an appropriate similarity measure to compare the sequence implicit representation array of the DNA information sequence with the overall implicit representation array of the training examples. Common similarity measures include euclidean distance, cosine similarity, pearson correlation coefficient, and the like. In the context of atmospheric aerosol microbiota analysis, cosine similarity is a reasonable choice due to its normalized nature of vector length, as implicit token arrays may contain complex nonlinear relationships.
The calculation formula of cosine similarity is as follows: wherein A and B represent the sequence implicit characterization array of the DNA information sequence and the whole implicit characterization array of the training sample respectively, A.B represent dot products thereof, and A and B represent Euclidean lengths thereof respectively.
Calculating only a single similarity value may not be sufficient to fully reflect the degree of matching between the DNA information sequence and the training sample. Thus, the electronic device may employ a more complex evaluation strategy to integrate multiple factors. For example, the following steps are performed: firstly, verifying whether the sequence implicit representation of the DNA information sequence is consistent with the integral implicit representation of the training sample to which the sequence implicit representation belongs in key characteristics. This may be accomplished by comparing the values of the two in a particular dimension or by checking whether their projections in some predefined subspace are similar. A confidence score is assigned to the implicit characterization of the sequence of each DNA information sequence, which score reflects the extent to which the characterization contributes to the overall sample feature. Confidence may be determined by a variety of factors, such as sequence length, sequencing depth, sequencing quality, and the like. This confidence score is then incorporated into the computation of the degree of matching, affecting the final result in a weighted manner. And analyzing the matching degree between the DNA information sequence and the training sample on different scales. For example, comparisons can be made at multiple levels of individual nucleotides, k-mer fragments, genes or operators, etc., to obtain a more comprehensive match assessment. For example, assume that there is an overall implicit characterization array of training examples of V sample=[v1,V2,...,vn, and that the implicit characterization array of the sequence of one DNA information sequence is V seq=[s1,s2,...,sn. The degree of matching between them can be calculated as follows: first, a similarity value cos (V seq,Vsample) between V sample and V seq is calculated using a cosine similarity formula. It is assumed that this DNA information sequence is assigned a confidence score c (determined based on factors such as its sequencing quality). The weighted matching degree may be expressed as c×cos (V seq,Vsample). In addition, the steps can be repeated on different scales, such as comparing the characterization of single nucleotide, then comparing the characterization of k-mer fragments, and the like, and finally integrating the matching degree results of each scale. Through the matching degree evaluation of step S500, the electronic device can recognize DNA information sequences inconsistent with the overall sample characteristics, which may be abnormal values due to sequencing errors, splice errors, or microbial diversity. In a subsequent step, this information can be used for further data cleaning, anomaly detection or algorithm optimization to improve the accuracy and reliability of the analysis of the composition of the atmospheric aerosol microflora.
Step S600: and optimizing algorithm parameters of a microorganism composition analysis algorithm according to the matching degree and the microorganism species mark corresponding to the unsupervised training sample to obtain a target microorganism composition analysis algorithm, wherein the target microorganism composition analysis algorithm is used for analyzing microorganism community species of a target high-throughput sequencing sequence to obtain the microorganism species mark of the target high-throughput sequencing sequence.
Step S600 optimizes the parameters of the microorganism composition analysis algorithm according to the matching degree calculated in the previous step and the virtual microorganism species mark of the unsupervised training sample, thereby obtaining a more accurate and efficient target algorithm.
In step S600, the electronics optimize parameters of the microorganism composition analysis algorithm such that the algorithm can more accurately assign microorganism species markers to the high throughput sequencing sequences. To achieve this goal, the electronic device needs to comprehensively consider a plurality of factors, including the real label of the supervised training samples, the virtual label of the unsupervised training samples, the matching degree between the DNA information sequence and the whole samples, and the like.
The optimization procedure comprises, for example, the following steps: define a loss function, calculate a loss value, gradient descent, or back-propagate update parameter. The loss function is the key in the optimization process and quantifies the degree of inconsistency between the predicted and actual results of the algorithm. In atmospheric aerosol microbiota analysis, the loss function can be designed to take into account a weighted sum of both classification errors and matching errors.
Assuming that Y is the real microorganism species labeling set of the supervised training samples, Y is the microorganism species labeling set predicted by the algorithm, M is the virtual microorganism species labeling set of the unsupervised training samples, and P is the matching degree set between the DNA information sequence and the belonging training samples. The loss function L may be defined as:
Wherein, The cross entropy loss on the monitoring training sample is used for measuring the classification error; mean square error loss on the unsupervised training samples is used for measuring the prediction error of the virtual mark; the method is based on the prediction result of an algorithm on an unsupervised sample; The mean square error loss of the matching degree error is used for measuring the consistency of the matching degree between the DNA information sequence and the belonging training sample; may be a target or desired value of the degree of matching; α, β, γ are weight coefficients used to balance the importance of the different penalty terms. After defining the loss function, the electronics optimize the parameters of the microbiological composition analysis algorithm in an iterative manner to minimize the loss function value. This typically involves gradient descent or back-propagation algorithms.
Taking gradient descent algorithm as an example, the optimization process is, for example:
1. initializing parameter values: first, all parameters of the algorithm (such as weights and biases in the neural network) are randomly initialized.
2. Forward propagation: and inputting training samples (including supervised and unsupervised samples) into a microorganism composition analysis algorithm under the current parameters, and performing forward propagation calculation to obtain predicted microorganism species marks and matching degrees.
3. Calculating loss: the loss function L is calculated from the forward propagation result and the real, virtual and match target values.
4. Back propagation: the gradient of the loss function with respect to each parameter is calculated by the chain law. These gradients indicate how the parameter should be adjusted to reduce the loss value.
5. Updating parameter values: and updating the value of each parameter according to the calculated gradient and a preset learning rate. Common updating rules include SGD (random gradient descent), adam, etc.
6. Iterative optimization: the above steps 2 to 5 are repeated until the loss function value converges to a sufficiently small range or a preset number of iterations is reached.
For example, assume that during the optimization process, the electronic device finds that a particular weight w has a greater impact on the virtual marker prediction of the unsupervised training sample. In a certain iteration, the loss function value obtained by forward propagation calculation is L current, and the gradient of w obtained by backward propagation calculation isAccording to this gradient, the electronic device updates the value of w according to the predetermined learning rate η: the updated weights w new are used in the forward propagation computation for the next round of iterations in hopes of obtaining smaller loss function values.
Through the optimization process of step S600, the electronic device can gradually adjust the parameters of the microorganism composition analysis algorithm, so that the algorithm is more accurate and efficient in predicting microorganism species markers. The optimization process not only considers the real marking information of the supervision training sample, but also fully utilizes the virtual marking of the non-supervision training sample and the matching degree information between the DNA information sequence and the belonging training sample, thereby improving the generalization capability and the robustness of the algorithm.
Through the above steps S100 to S600, training of the target microorganism composition analysis algorithm is completed. The following describes the data analysis phase, specifically including the steps of:
step S700: obtaining a target high-throughput sequencing sequence of a target atmospheric aerosol sample of a target area; the target high-throughput sequencing sequence is obtained by extracting DNA of a target atmospheric aerosol sample, amplifying the DNA by polymerase chain reaction, and performing high-throughput sequencing.
In step S700, the electronic device is responsible for obtaining a target atmospheric aerosol sample of the target region, and processing the samples through a series of molecular biology techniques, to finally obtain a target high throughput sequencing sequence for subsequent analysis. First, the target area that the electronic device explicitly analyzes is typically determined based on research objectives, environmental factors, or specific health concerns. For example, the target area may be a metropolitan area, industrial area, farm or natural protection area, etc. Once the target area is determined, an atmospheric aerosol sample of that area needs to be collected. The collection of atmospheric aerosol samples typically uses specialized sampling equipment, such as atmospheric particulate samplers, which are capable of collecting airborne particulate matter. In the sampling process, the device automatically collects aerosol particles in a certain volume of air according to preset flow and time. The collected samples are typically stored in sterile containers and immediately refrigerated or cryopreserved to prevent loss of microbial activity.
Next, the electronics direct the experimenter to extract DNA of the microorganism from the collected atmospheric aerosol sample. This step is the basis for subsequent molecular biological analysis. DNA extraction processes typically involve physical and chemical methods aimed at disrupting microbial cell walls, releasing and purifying DNA within the cell.
In practice, common DNA extraction methods include chemical lysis, physical disruption, or commercial DNA extraction kits. For example, a DNA extraction kit containing a cell lysate, proteinase K and a wash buffer may be used. The experimenter adds the sample to the lysate, ruptures the microbial cells by heating and shaking, releasing the DNA. And then removing impurities through centrifugation and washing steps, and finally obtaining the purer microorganism DNA solution.
Since the content of microorganisms in an atmospheric aerosol sample is often low, direct high-throughput sequencing may not obtain sufficient sequence information. Therefore, it is often necessary to perform Polymerase Chain Reaction (PCR) amplification of the extracted DNA to increase the number of target DNA fragments prior to sequencing.
PCR is a method for in vitro rapid amplification of specific DNA fragments, which utilizes DNA polymerase to synthesize DNA strands complementary to templates under specific temperature conditions by taking DNA single strands as templates and primers as starting points. In the analysis of the atmospheric aerosol microbial community, commonly used PCR primers are designed for the conserved sequences of the 16S rRNA gene or ITS (internal transcribed spacer) of microorganisms, which exist in most microorganisms and have small differences, and are suitable as targets for amplification.
For example, for bacterial community analysis, universal primers directed to the V3-V4 region of the 16S rRNA gene can be selected for PCR amplification. The amplified product will contain a large number of repeated 16S rRNA gene segments that represent the presence of different bacterial species in the sample.
The DNA fragments amplified by PCR require high throughput sequencing to obtain detailed sequence information. The high-throughput sequencing technology can perform parallel sequencing on millions or even billions of DNA fragments in a short time, and greatly improves sequencing efficiency and data volume.
In the context of atmospheric aerosol microbiota analysis, common high throughput sequencing platforms include MiSeq, hiSeq, novaSeq, etc. of Illumina. These platforms use the principle of sequencing-by-synthesis, immobilizing DNA fragments on a chip, and cycle sequencing by adding fluorescent labelled dNTPs (deoxyribonucleoside triphosphates). In each cycle, only specific dNTPs can bind to the end of the DNA strand and release fluorescent signals, and the base type at that position can be determined by capturing these signals by optical detection electronics.
After a number of sequencing cycles, the electronics collect and analyze all the fluorescent signals, splice them into a complete DNA sequence. These sequences contain the genetic information of the microbial community in the sample and are the basis for subsequent bioinformatic analysis.
Finally, the electronic device performs quality control and filtering on the original sequencing data (usually a FASTQ format file) output by the high-throughput sequencing platform, and removes low-quality sequences, linker sequences, repeated sequences and the like. The washed data is used to generate a set of target high-throughput sequencing sequences that will be entered as input data into a subsequent bioinformatics analysis process.
For example, assuming that the target area is a metropolitan industrial area, the electronics instruct the experimenter to collect an atmospheric aerosol sample in that area. Subsequently, the experimenter extracted the microorganism DNA from the sample using a commercial DNA extraction kit, and designed specific primers for the V3-V4 region of the 16S rRNA gene for PCR amplification. The amplified products are purified and quantified and sent to a high throughput sequencing platform for sequencing. Finally, the electronics collect and clean the sequencing data, creating a collection of millions of items of high-throughput sequencing sequences representing the genetic diversity of the microbial communities in the atmospheric aerosol samples of the industrial area.
Through implementation of step S700, the electronic device converts the atmospheric aerosol sample of the target area into high-throughput sequencing sequence data that can be used for subsequent bioinformatics analysis, laying a data foundation for analysis of subsequent microbial community composition.
Step S800: and loading the target high-throughput sequencing sequence into a target microorganism composition analysis algorithm to obtain the microorganism species mark of the target atmospheric aerosol sample.
In step S800, the electronic device loads the target high-throughput sequencing sequence acquired and processed in the target area into the target microorganism composition analysis algorithm after optimization training, so as to identify and mark the species of the microorganism community in the target atmospheric aerosol sample.
First, the electronic device retrieves and loads a target high-throughput sequencing sequence from a storage medium (such as a hard disk, server, or cloud storage). These sequences are obtained in step S700 by DNA extraction, PCR amplification and high throughput sequencing of the target atmospheric aerosol sample, including genetic information of the microbial community in the sample.
Next, the electronics input the loaded target high-throughput sequencing sequence into a target microorganism composition analysis algorithm. This algorithm is derived by optimization training in step S600, and enables accurate mapping of high throughput sequencing sequences onto specific microorganism species.
After the model has completed forward propagation, the electronics will assign one or more microorganism species markers to each target high throughput sequencing sequence based on the predicted outcome of the output layer. This typically involves comparing the predicted probability or score to a preset threshold or selecting the species with the highest probability as the marker.
For example, assume that the target microorganism composition analysis algorithm is based on a trained transducer model that generates a predictive probability at the output layer for each possible microorganism species. The electronic device may set a threshold (e.g., 0.5) that a species is assigned as a signature to a corresponding sequencing sequence only if its predicted probability exceeds the threshold. If the predicted probabilities of multiple species all exceed the threshold, then the species with the highest probability may be selected as the primary marker, or all species exceeding the threshold may be listed simultaneously as multiple markers.
After all microbial species tags of the target high throughput sequencing sequence are obtained, the electronics can further perform a comprehensive analysis of these tags to reveal the composition and structure of the microbial community in the target atmospheric aerosol sample. This includes calculating the relative abundance of different species, analyzing the co-occurrence relationship between species, identifying dominant or indicator species in a community, and the like.
For example, assuming that the target atmospheric aerosol sample is from a city park area, a series of processes results in a dataset containing 100,000 high throughput sequencing sequences. The electronic device loads these sequences into a transducer-based target microorganism composition analysis algorithm. The algorithm outputs the predicted probability of the microorganism species corresponding to each sequence, and assigns a microorganism species label to each sequence according to a preset threshold (e.g., 0.5). Finally, the electronic device comprehensively analyzes these markers and finds that three microbial species are mainly present in the sample: bacteria of type a (40% relative abundance), fungi of type B (30% relative abundance) and actinomycetes of type C (20% relative abundance), the remaining species being relatively low. This result reveals the major component of the microflora in the urban park area atmospheric aerosol.
In a possible implementation manner, the step S400, according to the microorganism species mark of the supervised training sample, performs cluster analysis on the overall implicit representation array of each training sample in the high-throughput sequencing sequence template to obtain the microorganism species mark corresponding to the unsupervised training sample, may include:
Step S400A: setting a plurality of cluster centroids, and determining the commonality metric value of the integral implicit characterization array of each training sample in the high-throughput sequencing sequence template and each cluster centroid;
Step S400B: for each training sample, determining a target cluster centroid matched with the training sample in each cluster centroid according to the commonality metric value and the microorganism species mark of the supervision training sample, and classifying the training sample into a high-throughput sequencing sequence cluster corresponding to the target cluster centroid;
Step S400C: for the high-throughput sequencing sequence clusters corresponding to the mass centers of all the clusters, determining training samples meeting the preset mass center conditions in the high-throughput sequencing sequence clusters, and using the training samples as updated mass centers of the clusters;
Step S400D: jumping to the step of determining the commonality metric value of the integral implicit characterization array of each training sample in the high-throughput sequencing sequence template and each cluster centroid, and performing iterative execution until the obtained cluster centroid meets the cluster analysis stopping requirement; and determining the microbial species mark of the non-supervision training sample in the target high-throughput sequencing sequence cluster according to the microbial species mark of the supervision training sample in the target high-throughput sequencing sequence cluster corresponding to the cluster centroid meeting the cluster analysis stopping requirement.
Step S400 utilizes the microorganism species marking information of the supervised training samples to distribute corresponding microorganism species marks for the unsupervised training samples by a cluster analysis method. In this embodiment, step S400 is subdivided into four sub-steps (S400A to S400D) to achieve this goal.
In step S400A, the electronic device sets a number of cluster centroids (cluster centroids) that generally match the number of microbial species that are expected to be analyzed. For example, if 5 major microbial species are expected to be present in the analysis, 5 cluster centroids are set. Initially, these centroids may be randomly selected or the initial position may be optimized based on some heuristic algorithm (e.g., K-means++).
Next, the electronics calculate a commonality metric value between the global implicit characterization array and the respective cluster centroid for each training sample in the high throughput sequencing sequence template. The commonality metric is used to evaluate the similarity or distance between the training examples and the cluster centroid. Common commonality measures include euclidean distance, manhattan distance, cosine similarity, and the like. In the context of atmospheric aerosol microbiota analysis, cosine similarity is a reasonable choice due to its normalized nature of vector length, since implicit token arrays typically contain high-dimensional features.
For example, assume that there are 3 cluster centroids, denoted as C 1,C2,C3, each centroid is a vector of the same dimension as the training sample overall implicit token array. For example, C 1 = [0.5,0.3, -0.2, … ], where ellipses represent values of other dimensions. Meanwhile, it is assumed that the overall implicit representation array of one training sample is x= [0.6,0.25, -0.15, … ]. The cosine similarity is used as a commonality measurement value, and the calculation formula is as follows:
Where C i X represents the dot product of vectors C i and X, and C i and X represent their Euclidean lengths, respectively. For each cluster centroid C 1,C2,C3, the cosine similarity with X is calculated, and three commonality metric values are obtained. Example computing procedure (C1 for example):
In step S400B, the electronic device determines a centroid of the target cluster for each training sample according to the commonality metric value calculated in step S400A and the microorganism species mark of the supervised training sample, and classifies the centroid into the high throughput sequencing sequence cluster corresponding to the centroid. The principle of allocation is to select the cluster centroid with the highest value of the commonality metric (for the similarity metric) or the lowest value (for the distance metric) as the target centroid. If the training sample is a supervised training sample, its microbiological species signature will be used to verify the accuracy of the dispensing result and possibly for centroid updating in a subsequent step. For non-supervised training examples, although they have no direct species labeling, their assignment will be based on a common metric value with the cluster centroid. For example, continuing with the example of the previous step, assume that the computed commonality metric is: cosine similarity to C 1 was 0.9, C2 was 0.7, and C3 was 0.8. The training sample X will be assigned to the sequence cluster corresponding to cluster centroid C 1 with the highest commonality metric. If X is a supervised training sample and its known microorganism species signature is consistent with the species represented by C 1, then the partitioning is successful; if not, the cluster centroid position needs to be adjusted or the commonality metric method is reconsidered.
In step S400C, the electronic device determines a new centroid position for each cluster centroid corresponding to the cluster of high throughput sequencing sequences. The new centroid position is typically calculated based on the average or median of the global implicit characterization array for all training examples within the cluster. However, in practical applications, more complex centroid update strategies, such as weighted averages or methods based on robust statistics, may be employed in order to speed up convergence and avoid the effects of outliers. For supervised training examples, their microbial species labels may be used to guide the updating process of the centroid, for example by giving higher weights to the supervised examples consistent with the species that the centroid is currently representing. For non-supervised training examples, since they have no direct species labeling, they are typically equally involved in the updated computation of centroids along with other supervised examples. For example, assume that the sequence cluster corresponding to cluster centroid C1 includes a plurality of training examples (including supervised and unsupervised examples), and their overall implicit token arrays are X 1,X2,…,Xn respectively. The new centroid position C1' can be obtained by calculating the average of these sample characterization arrays: Where n is the number of samples within a cluster.
In step S400D, the electronic device will repeatedly execute steps S400A to S400C until the stop requirement of the cluster analysis is satisfied. The stopping requirement may be based on a number of conditions, such as the amount of change in centroid position being less than a certain threshold, the number of iterations reaching a preset upper limit, or the intra-cluster variance decreasing to an acceptable level, etc. Once the cluster analysis converges, the electronic device will assign species labels to the unsupervised training samples based on cluster centroid positions meeting stopping requirements and their corresponding supervised training sample microorganism species labels. Specifically, each unsupervised training sample will be labeled as the microorganism species (or species determined according to some weighting strategy) with the highest frequency of occurrence of the supervised training sample in the sequence cluster to which it belongs. For example, assume that after multiple iterations, the cluster analysis converges, resulting in a stable cluster centroid position. For each cluster centroid, the electronics examine the supervised training samples in its corresponding sequence cluster and count the frequency of occurrence of each microorganism species. For example, in the cluster of sequences corresponding to C 1, species A's supervised samples were presented 80 times and species B was presented 20 times. Thus, the unsupervised training samples in this cluster will be labeled as species a. Finally, with this embodiment, the electronics assign microbial species markers to all non-supervised training samples in the high throughput sequencing sequence templates, which will be used for algorithm optimization and data analysis in subsequent steps.
As another embodiment, step S400, performing cluster analysis on the overall implicit characterization array of each training sample in the high-throughput sequencing-sequence template according to the microorganism species markers of the supervised training samples, to obtain microorganism species markers corresponding to the unsupervised training samples, may include:
step S410: determining a sample commonality metric value between every two training samples according to the integral implicit representation array of every two training samples;
Step S420: according to the sample commonality metric value, generating a high-flux sequencing sequence association map, wherein the high-flux sequencing sequence association map represents the correlation relationship among all training samples;
step S430: and (3) according to the high-throughput sequencing sequence correlation map, performing marker diffusion on the microorganism species markers of the supervision training samples to obtain microorganism species markers corresponding to the non-supervision training samples.
The other embodiment of step S400 adopts a cluster analysis method based on graph theory, and uses the microbial species markers of the supervised training samples to perform marker diffusion by constructing a correlation map of the high-throughput sequencing sequence, so as to assign corresponding species markers to the unsupervised training samples.
In step S410, the electronic device calculates a sample commonality metric between each pair of training samples in the high throughput sequencing sequence template. The commonality metric value is used for quantifying the similarity or distance between two training samples, and is the basis for constructing a high-throughput sequencing sequence association map. Common commonality measures include euclidean distance, manhattan distance, cosine similarity, and the like. In the context of atmospheric aerosol microbiota analysis, cosine similarity may be a more suitable choice considering the high-dimensional and non-linear nature of the implicit characterization array.
For example, suppose that the high throughput sequencing sequence template contains N training examples, and the global implicit characterization array for each training example is a D-dimensional vector. For any two training examples i and j, their global implicit token arrays are denoted as x i=[xi1,xi2,…,xiD and x j=[xj1,xj2,…,xjD, respectively. The cosine similarity is used as a commonality measurement value, and the calculation formula is as follows:
Where x i·xj represents the dot product of vectors x i and x j, |x i | and ||x j | represent the euclidean lengths of vectors x i and x j, respectively.
In step S420, the electronic device constructs a correlation map of the high-throughput sequencing sequence according to the sample commonality metric value calculated in step S410. The correlation graph is an undirected graph in which nodes represent training examples and edges represent correlations between training examples (i.e., commonality metric values between them). The weight of the edge can be determined according to the commonality measurement value, and the higher the commonality measurement value is, the larger the weight of the edge is, which indicates that the correlation between two training samples is tighter. For example, assuming that the high throughput sequencing sequence template contains 5 training examples (a, B, C, D, E), step S410 calculates a commonality metric (expressed as cosine similarity) between them. The electronic device may construct an association graph based on the values, wherein: the node set is { A, B, C, D, E } edge set is determined according to the commonality metric, for example, if the cosine similarity between A and B is 0.9, an edge with weight of 0.9 is connected between the nodes A and B. A threshold value can be set to filter out edges with low commonality metric values, for example, only edges with weights greater than a certain specific value (such as 0.7) are reserved, and the finally obtained association graph intuitively displays the correlation between training samples, so that a foundation is provided for the subsequent mark diffusion step.
In step S430, the electronic device assigns species labels to the unsupervised training samples by using the high throughput sequencing-associated maps and the microbial species labels of the supervised training samples by a label diffusion algorithm. The basic idea of the tag diffusion algorithm is to use the species tag of the supervised training samples as a "source" and propagate the tag onto the unsupervised training samples through edges in the correlation map (i.e., correlations between the training samples). The marker diffusion algorithm includes, for example, a heat conduction-based algorithm, a random walk-based algorithm, and the like. In the context of an atmospheric aerosol microbiota analysis, a suitable algorithm may be selected as the case may be. The following will describe an example of a marker diffusion algorithm based on heat conduction. For example, assume that the correlation map includes supervised training examples a (species labeled M 1), B (species labeled M 2), and unsupervised training examples C, D, E. The goal of the tag diffusion algorithm is to assign species tags to C, D, E based on the species tags of A and B and their correlation with C, D, E.
The thermal conduction-based marker diffusion algorithm generally includes the steps of:
1. Initializing: each training sample (node) is assigned an initial marker distribution vector. For the supervision training sample, the marking value of the corresponding species in the marking distribution vector is 1, and the rest is 0; for an unsupervised training sample, all elements of its marker distribution vector are initialized to some same non-zero value (e.g., 1/total number of species).
2. Iterative diffusion: and iteratively updating the mark distribution vector of each training sample according to the edges (namely the commonality metric value or the weight) in the associated graph. Updating rules typically involves weighted averaging the label distribution of neighboring nodes with the weights of the edges, and may include some attenuation factor to prevent over-diffusion.
3. And (3) convergence judgment: the iterative diffusion process is repeated until a convergence condition (e.g., the amount of change in the marker distribution vector is less than a certain threshold, a preset number of iterations is reached, etc.) is satisfied.
4. And (3) marking and distributing: after the iteration is finished, species marks are distributed to the non-supervision training samples according to the mark distribution vectors of the non-supervision training samples. The species corresponding to the element with the largest median of the marker distribution vector is typically selected as the species marker for the training sample.
For example, in one simple example, there are three training samples A (M 1)、B(M2) and C (unsupervised), and the edge weights between A and C, B and C are w_AC and w_BC, respectively. Initially, the marker distribution vector for a is [1,0] (M 1 is 1, M 2 is 0), the marker distribution vector for B is [0,1], and the marker distribution vector for C is [0.5,0.5] (assuming two possible species). In the first iteration, the marker distribution vector of C may be updated toAs the iteration proceeds, the marker distribution vector of C will gradually converge to a steady state, assigning it a final species marker. Through the embodiment, the electronic equipment can distribute reasonable species marks for the non-supervision training samples by utilizing the correlation between the microorganism species mark information of the supervision training samples and the high-throughput sequencing sequence.
As one embodiment, the high-throughput sequencing sequence association map includes sample vertices corresponding to each training sample and connecting lines between the sample vertices, wherein the connecting lines represent correlation between the two connected sample vertices; based on this, step S430, performing marker diffusion on the microbial species markers of the supervised training examples according to the high throughput sequencing-associated maps to obtain microbial species markers corresponding to the unsupervised training examples, may include:
step S431: marking and diffusing the microorganism species mark of the supervision training sample based on a connecting line between sample peaks in the high-throughput sequencing sequence association map so as to determine the preliminary microorganism species mark of the non-supervision training sample;
Step S432: and repeatedly correcting the preliminary microorganism species mark of each unsupervised training sample according to the adjacent sample peaks of the unsupervised training samples in the high-throughput sequencing sequence correlation map until the marks of the sample peaks in the high-throughput sequencing sequence correlation map are stable, so as to obtain the microorganism species mark corresponding to the unsupervised training samples.
In step S431, the electronic device starts the marker diffusion process according to the structure of the high throughput sequencing sequence association graph, that is, the sample vertices corresponding to each training sample and the connection lines (representing the correlation) between them. The supervised training samples already have well-defined microbial species markers that will propagate as seeds in the correlation map, affecting the marker assignment of the unsupervised training samples connected thereto. The specific implementation of tag diffusion may be based on a variety of algorithms, such as heat conduction-based algorithms, random walk-based algorithms, or graph-based tag propagation algorithms, etc. A graph-based label propagation algorithm (Label Propagation Algorithm, LPA) is illustrated here as an example.
For example, assume that there are 5 sample vertices A, B, C, D, E in the high throughput sequencing sequence association graph, where A and B are supervised training samples and the non-supervised training samples with microbial species markers M1 and M2, C, D, E, respectively. The connecting lines in the association graph represent the similarity or correlation between the vertices of the samples, and the higher the weight is, the closer the relationship is.
In LPA, each sample vertex maintains a set of labels (i.e., a set of microorganism species labels), initially the sets of labels for supervised samples a and B are { M 1 } and { M 2 } respectively, and the set of labels for unsupervised sample C, D, E is empty or contains all possible species labels (depending on the implementation).
The diffusion process starts with a supervised sample, and a and B propagate their labels to the unsupervised sample to which they are connected. Specifically, each unsupervised sample updates its own set of labels according to the label distribution of its neighbor samples. For example, if C is connected to both A and B, and the connection weights for A and B to C are w_AC and w_BC, respectively, then the tag set for C may be updated based on a weighted average of these weights. However, in a simplified version of LPA, C may directly take the tags of most of its neighbors as its own tag.
In a simple example, assuming that the neighbors of C are only a and B, and w_ac > w_bc, then the preliminary tag of C (microorganism species tag) may be set directly to the tag M1 of a, or some form of voting is done according to the weight, but here the process is simplified, directly employing the tag of the neighbor a with the greater weight.
After one round of diffusion, the unsupervised training samples obtained a preliminary microbial species signature. However, these markers may be unstable because they are only propagated based on one round of neighbor information.
After step S431, the unsupervised training sample, although obtaining preliminary microbial species markers, may not be accurate or stable. Thus, in step S432, the electronic device repeatedly corrects these preliminary markers until the markers of all sample vertices in the correlation map reach a steady state. The correction process typically involves multiple iterations, where each sample vertex updates its own label based on the current labels of its neighbors. If the label of the sample vertex remains unchanged over successive iterations, it is considered to have reached a steady state.
For example (the previous example), after preliminary mark assignment, it is assumed that C is given the mark M1 of A, but D is connected to both B and C, and the connection line weight of D and B is higher. In the first correction iteration, D may update its own tag according to the tag of its neighbor (mainly the M2 tag of B), thereby changing the result of the preliminary allocation. As the iteration proceeds, the labels of each sample vertex are dynamically adjusted according to the labels of its neighbors. For example, if the signature of C changes in subsequent iterations due to other factors (e.g., interactions with other unsupervised samples), then the signature of D may also be updated again.
In practical applications, the specific implementation of the correction process may vary from algorithm to algorithm. In an extended version of LPA, each sample vertex updates its own set of labels according to the label distribution of its neighbors in each iteration. Updating rules may involve weighted averaging, majority voting, or other complex decision mechanisms. In some cases, some regularization terms or constraints may be introduced in order to prevent excessive smoothing or to preserve the diversity of the markers. For example, a threshold may be set to limit the amount of variation of the marker in each iteration; or introducing a persistence parameter to control the tendency of the sample vertices to retain their original labels.
The iteration of the correction process will continue until a certain convergence condition is met. The convergence conditions include, for example: the labels of all sample vertices remain unchanged in successive iterations; or a change in overall mark distribution the amount is less than some preset threshold. After reaching the converged state, the unsupervised training samples will obtain stable microbial species markers. These markers will be assigned based on the structure of the whole correlation map and the known information of the supervised training examples, ensuring the accuracy and rationality of the markers.
Through the tag diffusion and iterative correction process in step S430, the electronic device can effectively assign microorganism species tags to the unsupervised training samples, providing powerful support for algorithm optimization and data analysis in subsequent steps. The process not only fully utilizes the known information of the supervision training sample, but also considers the complex correlation relationship between the high-throughput sequencing sequences, thereby improving the accuracy and the robustness of the label distribution.
As an embodiment, step S500, for each DNA information sequence of the training sample, determining, according to the sequence implicit token array corresponding to the DNA information sequence and the overall implicit token array of the training sample to which the DNA information sequence belongs, a matching degree between the DNA information sequence and the training sample to which the DNA information sequence belongs may include:
step S510: generating an overall implicit representation array set according to the overall implicit representation arrays of each training sample, and generating a sequence implicit representation array set according to the sequence implicit representation arrays of the DNA information sequences of each training sample;
Step S520: for each training sample, inquiring one or more matched integral implicit characterization arrays matched with the integral implicit characterization array of the training sample in the integral implicit characterization array set, and determining a to-be-determined marker library of the training sample according to the microorganism species markers of the training sample to which the matched integral implicit characterization array belongs;
step S530: for each DNA information sequence of the training sample, inquiring one or more matching sequence implicit characterization arrays matched with the sequence implicit characterization array of the DNA information sequence in a sequence implicit characterization array set, and determining a pending mark library of the DNA information sequence according to the microorganism species mark of the training sample to which the matching sequence implicit characterization array belongs;
step S540: and determining the matching degree between the DNA information sequence and the training sample to which the DNA information sequence belongs according to the pending mark library of the DNA information sequence and the pending mark library of the training sample to which the DNA information sequence belongs.
In step S510, the electronic device generates two sets according to the overall implicit token array of each training sample obtained in step S300 and the implicit token array of the sequence of each DNA information sequence in each training sample, respectively: an overall implicit representation array set and a sequence implicit representation array set. These two sets are used to store the global feature representation of all training samples and the local feature representation of all DNA information sequences, respectively.
For example, assume that there are three training examples A, B, C, each of which is implicitly characterized by a 128-dimensional global implicit characterization array. Meanwhile, each training sample comprises a plurality of DNA information sequences, and each sequence also corresponds to a 128-dimensional sequence implicit characterization array.
The overall implicit characterization array of training sample a: v A=[vA1,vA2,…,vA128 ];
The overall implicit characterization array of training sample B: v B=[vB1,vB2,…,vB128 ];
the overall implicit characterization array for training sample C: v C=[vC1,vC2,…,vC128 ];
Placing the whole implicit token arrays into a whole implicit token array set to obtain a whole implicit token array set= { V A,VB,VC }; similarly, for the DNA information sequences in each training sample, a corresponding set of sequence implicit characterization arrays is also generated.
In step S520, the electronic device traverses the global implicit token array of each training sample in the global implicit token array set, and for each training sample, searches for one or more global implicit token arrays in the set that match it. The matches herein may be determined based on a similarity measure (e.g., cosine similarity), i.e., finding all other global implicit representation arrays that have a similarity to the global implicit representation array of the current training sample that exceeds a certain threshold. And then, constructing a to-be-determined marker library of the current training sample according to the microbial species markers of the training sample to which the matched integral implicit characterization array belongs. The pool of pending markers contains all the markers that may belong to the same microorganism species as the current training sample. For example, suppose that the overall implicit token array V A of training sample a is compared to other overall implicit token arrays in the set, and V B is found to have a higher similarity (e.g., cosine similarity greater than 0.9). Meanwhile, it is assumed that the microorganism species of training example B to which VB belongs is known to be labeled as colony X. Thus, the pool of pending labels of training sample A can be initially determined as { "community X" }. If there are other global implicit token arrays that match V A and the training samples to which they belong have different tags of the microorganism species, these tags are also added to the pending tag library.
Step S530 is similar to step S520, but this step is performed for each DNA information sequence in each training sample. The electronic device traverses each sequence implicit token array in the set of sequence implicit token arrays, and for each DNA information sequence, searches for one or more sequence implicit token arrays in the set that match it. The matches here are also determined based on a similarity measure. And then, constructing a pending marker library of the current DNA information sequence according to the microbial species markers of the training samples to which the matched sequences implicitly represent the arrays. The pool of pending tags contains all tags that may belong to the same species of microorganism as the current DNA information sequence. For example, suppose that training sample A has a DNA information sequence S1 with an implicit characterization array of S 1=[s11,s12,…,s128. The electronic device compares s 1 with other arrays in the sequence implicit token array set, and finds that the sequence implicit token array similarity with a certain DNA information sequence from training sample B is higher. Since the microorganism species of training example B is known to be labeled "colony X", the pool of pending labels of DNA information sequence S1 can also be initially determined as { "colony X" }.
In step S540, the electronic device comprehensively considers the pending flag library of the DNA information sequence and the pending flag library of the training sample to which the DNA information sequence belongs, and determines the matching degree between the DNA information sequence and the training sample to which the DNA information sequence belongs by calculating the similarity or consistency between them. The degree of matching quantifies the degree of identity of the DNA information sequences over the microorganism species classification with the ensemble of training examples to which they pertain. Particular implementations may include calculating the proportion of co-tags in two pending tag libraries, using a set similarity measure (e.g., jaccard similarity), or evaluating the degree of matching based on a more complex probabilistic model. For example, assume that the pool of pending labels of DNA information sequence S 1 is { "community X" }, and the pool of pending labels of training sample A to which it belongs is also { "community X" } (in practical applications, the pool of pending labels may contain a plurality of labels, and only one label is considered here for simplicity of explanation). A matching degree calculating method is to directly compare whether the marks in two undetermined mark libraries are consistent. In this example, since the two marker libraries are identical, the degree of matching between the DNA information sequence S1 and the training sample A to which it belongs can be considered to be 1 (or 100%). In practical applications, the pool of pending tags may contain multiple tags, and there may be a difference in priority or weight between the different tags. Thus, a more complex matching degree calculation method may involve the steps of:
1. Calculating the confidence of the undetermined mark library: for each pending signature library, the confidence of each signature is calculated based on the source and number of signatures therein (e.g., signatures of supervised training examples, virtual signatures of unsupervised training examples obtained by cluster analysis). The confidence may reflect the reliability and importance of the mark.
2. Comparing the pending tag library: and comparing the undetermined marker library of the DNA information sequence with the undetermined marker library of the training sample to which the undetermined marker library belongs, and identifying the common markers and different markers.
3. Calculating the matching degree: the degree of matching is calculated based on the confidence of the tokens and the proportion of common tokens. For example, a weighted average approach may be used to calculate the degree of matching, where the weights are based on the confidence of the tokens.
For example, assume that the pool of undetermined tags for DNA information sequence S1 is M seq = { "community X": c X, "community Y": c Y, where c X and cY are the confidence levels for the labels "community X" and "community Y", respectively; the pool of pending labels for training sample a was M sample = { "community X": c X', "community Z": c z' }. One possible way of calculating the Degree of matching Degree is:
Wherein the numerator calculates the sum of the lesser values of the confidence in the common tag (to account for the difference in confidence in the two tag libraries), and the denominator is the sum of the confidence of all tags in the pending tag library for the DNA information sequence. This way of calculation tends to underestimate the degree of matching, since it only considers the value of the common signature that is less confidence. In practical application, the calculation method of the matching degree can be adjusted according to specific requirements, for example, geometric average is used for replacing arithmetic average, extra penalty items are introduced to process differences among different marker libraries, and the like.
In one embodiment, step S540, determining the matching degree between the DNA information sequence and the training sample to which the DNA information sequence belongs according to the pending flag library of the DNA information sequence and the pending flag library of the training sample to which the DNA information sequence belongs may include:
step S541: summarizing and analyzing microorganism species marks in a pending mark library of the DNA information sequence to determine mark confidence distribution of the DNA information sequence;
Step S542: summarizing and analyzing the microbial species marks in a to-be-determined mark library of a training sample to which the DNA information sequence belongs, and determining mark confidence distribution of the training sample to which the DNA information sequence belongs;
Step S543: and determining the matching degree between the DNA information sequence and the training sample to which the DNA information sequence belongs according to the marking confidence distribution of the DNA information sequence and the marking confidence distribution of the training sample to which the DNA information sequence belongs.
Step S540 aims at determining the degree of matching between the DNA information sequence and the pending signature library of the training sample to which it belongs by comparing the two.
In step S541, the electronic device performs a summary analysis of the microorganism species markers in the pending marker library of the DNA information sequence. The pending flag library is obtained by the process described in step S530, and contains all the microbial species flags of the training examples to which the sequence implicit characterization array matches the sequence implicit characterization array of the current DNA information sequence. Since a DNA information sequence may be similar to sequences in multiple training examples, the pool of pending markers may contain multiple different microbial species markers.
The main purpose of the summary analysis is to determine the confidence distribution of these markers. Confidence may be calculated based on a variety of factors, such as the number of sequences matched, the similarity of the matches, the source of the labels (whether supervised or unsupervised training examples are virtual labels obtained by cluster analysis), etc. Here, it is assumed that the confidence is calculated based on the similarity of the matches and the origin of the tokens.
For example, assume that the pool of pending tags for DNA information sequence S comprises the following microbial species tags and their sources:
Marker A (from the supervised training samples, similarity 0.9);
the mark B (virtual mark obtained by cluster analysis from an unsupervised training sample, and the similarity is 0.8);
Tag a (again derived from the supervised training sample, but belonging to another match, similarity 0.7);
For labels of supervised training examples, a higher initial confidence (e.g., 0.9 or 1) may be assigned, while for virtual labels of unsupervised training examples, which are obtained by cluster analysis, a lower initial confidence (e.g., 0.5 or dynamically adjusted according to the quality of the cluster analysis) may be assigned. The initial confidence level is then adjusted based on the similarity. For example, the higher the similarity, the more the confidence increases.
Through calculation, the following tag confidence distributions may be obtained:
mark a: c A =0.9×weight 1+0.7xweighting 2;
Mark B: c B =0.8 x weight 3;
the weights 1,2 and 3 represent the weights of the contribution of different matching terms to the confidence level, and the weights can be set according to actual conditions or obtained through learning.
Step S542 is similar to step S541 in that the pending flag library of the training sample to which the DNA information sequence belongs is subjected to summary analysis to determine the flag confidence distribution of the training sample. The pending flag library also contains a plurality of possible microorganism species flags and their confidence information derived from the microorganism species flags of the training examples to which the other overall implicit characterization arrays match the overall implicit characterization array of the training examples.
Since training examples typically contain more DNA information sequences and more comprehensive microbial community characteristics, their pool of pending labels may contain more information than a single DNA information sequence. However, this may also introduce some noise or inconsistencies, so the reliability and importance of different markers needs to be quantified by a confidence distribution.
In step S543, after determining the DNA information sequence and the marker confidence distribution of the belonging training sample, the electronic device compares the two distributions to calculate the degree of matching between them. The degree of matching quantifies the degree of identity of the DNA information sequences over the microorganism species classification with the ensemble of training examples to which they pertain.
The specific method for calculating the matching degree can be flexibly selected according to actual conditions, and the following aspects need to be considered:
the confidence consistency of the labels in both distributions is common, i.e. for labels that appear in both distributions, whether their confidence is similar. The processing of the unique markers in both distributions, i.e. how the markers that appear in only one distribution should affect the calculation of the matching degree. The overall similarity of the confidence distributions may be assessed using some statistics or distance measures (e.g., KL divergence, JS divergence, wasserstein distance, etc.). For example, continuing with the previous example, assume that the marker confidence distribution of DNA information sequence S is P seq = { a:0.8, b:0.2}, the label confidence distribution of the training sample a is P sample = { a:0.95, C:0.05}.
In order to calculate the degree of matching, various methods may be employed. A simple method is to calculate the confidence similarity of the common signature (signature a here) in both distributions and take into account the overall differences in the distributions. But more commonly statistics or distance measures are used to comprehensively evaluate the similarity of the two distributions.
For example, the JS dispersion (Jensen-Shannon Divergence) can be used to calculate the similarity of two distributions, and then 1 subtracted from this similarity to get the match (since a smaller JS dispersion indicates a more similar distribution, and a larger match is required to indicate a more similar distribution). However, since JS divergence is used for probability distribution, the confidence distribution needs to be normalized first.
The normalized distribution is:
Since JS divergence requires that both distributions have the same support set (i.e., contain the same element), but here AndThe support sets of (a) are not identical (one contains B and the other contains C), and thus one of the distributions needs to be adjusted, for example by merging similar or related labels, ignoring labels with very low confidence, or introducing smooth terms.
For simplicity of explanation, it is assumed that low confidence labels (e.g., C) in the distribution are ignored and only label A is considered. Thus, the confidence differences of two distributions over the marker a can be compared directly, or a more complex metric can be used to take into account the differences of the entire distribution (but in practice this is often impractical because one overall metric is required to evaluate all possible markers). To provide a satisfactory example, a simplified method may be employed to calculate the degree of matching, i.e. taking into account the confidence differences between the two distributions over the marker a, and introducing a penalty term to take into account the differences between the distributions:
Degree=max(0,1-|Cseq,A-Csample,A|×d);
Wherein C seq,A is the confidence level of the algorithm on its predicted result (e.g., probability that the sequence belongs to a certain microorganism species) for DNA information sequence A; c sample,A is the confidence level of the algorithm in its overall predicted result (e.g., the composition of the microbial community in the sample) for training sample A, which contains multiple DNA information sequences. d is a penalty factor, which is a constant greater than 1, for amplifying the effect of confidence differences on the degree of matching.
As an embodiment, step S600, optimizing algorithm parameters of the microorganism composition analysis algorithm according to the matching degree and the microorganism species mark corresponding to the unsupervised training sample, to obtain the target microorganism composition analysis algorithm may include:
Step S610: according to the integral implicit characterization array of the training sample, analyzing the microbial community species of the training sample to obtain integral species confidence distribution corresponding to the training sample;
Step S620: according to the sequence implicit characterization array corresponding to the DNA information sequence, analyzing the microbial community species of the DNA information sequence to obtain the sequence species confidence distribution corresponding to the DNA information sequence;
Step S630: and optimizing algorithm parameters of the microorganism composition analysis algorithm according to the whole species confidence distribution, the sequence species confidence distribution, the matching degree and the microorganism species mark corresponding to the unsupervised training sample, so as to obtain the target microorganism composition analysis algorithm.
In step S610, the electronic device performs a microbial community species analysis using the global implicit characterization array of the training examples. For example, the global implicit characterization array is input into a pre-trained classifier or regression model that predicts the corresponding microbial species distribution based on the input feature vectors. Due to the complexity of microbial communities, such a distribution is often not a single species signature, but rather a probability distribution that contains multiple species and their confidence levels.
For example, assume that the overall implicit token array for training sample A is V A=[vA1,vA2,…,vAn, where n is the dimension of the feature vector. The electronic device inputs the vector into a species classifier based on machine learning (e.g., support vector machine, random forest, or deep learning model). The output of the classifier is not a single species signature, but a species confidence distribution vector P A=[pA1,pA2,…,pAm where m is the number of known microbial species in the database and P Ai represents the confidence that training sample a belongs to the ith species.
The whole implicit representation array V A is processed by a plurality of hidden layers, and finally a confidence vector with the same number as the species is output. In this process, the parameters of the model (e.g., weights and biases) are continually adjusted based on the training data to minimize the difference between the predicted and actual distributions (typically measured by a loss function).
Step S620 is similar to step S610, but step S620 is performed for the sequence implicit token array for each DNA information sequence. The electronic equipment inputs the sequence implicit characterization array of each DNA information sequence into the same species classifier to obtain the species confidence distribution corresponding to the sequence. Since the DNA information sequence is a component of the training sample, the species confidence distribution should be consistent with the overall species confidence distribution of the training sample, but may be biased by factors such as sequencing noise, sequence splice errors, etc.
For example, suppose training sample A contains a DNA information sequence S 1 whose implicit characterization array is S 1=[s11,s12,…,s1n. The electronic device inputs this vector into the species classifier as well, resulting in the species confidence distribution vector p 1=[p11,p12,…,p1m of the sequence S 1.
The calculation process is the same as that in step S610, but the input data is changed from the overall implicit representation array of the training examples to the sequence implicit representation array of the DNA information sequence.
In step S630, the electronic device optimizes the parameters of the microbial composition analysis algorithm according to the whole species confidence distribution, the sequence species confidence distribution, the matching degree between the DNA information sequence and the training sample, and the microbial species mark of the unsupervised training sample obtained in the previous step. The goal of the optimization is to enable the algorithm to more accurately predict the microbial species distribution while reducing the effects of noise and errors on the predicted results. The optimization method refers to similar descriptions in the aforementioned step S600, and will not be described herein.
Assuming a loss function L, which considers the JS divergence of species confidence distribution, the accuracy of matching degree and the consistency of non-supervision training sample marks at the same time:
Wherein: alpha, beta, gamma are weight coefficients for balancing the importance of the different penalty terms. JS (P true,Ppred) is the JS divergence between the true species distribution, P true, and the predicted species distribution, P pred. Degre is the Degree of matching between the DNA information sequence and the training sample to which it belongs (calculated as described above). Virtual marking of unsupervised training samplesCross entropy loss with the predictive markers y obtained by cluster analysis. For the supervision training sample, P true can be obtained by experimental determination; for the unsupervised training examples, P true does not exist, but its species distribution can be approximated by virtual markers obtained by cluster analysis (although such an estimate may not be accurate enough). In each iteration, the electronic device calculates the gradient of the loss function L with respect to algorithm parameters (e.g., weights and biases in the neural network) and updates the values of these parameters according to the gradient descent algorithm. Specifically, for each parameter w, the update rule may be as follows: Where η is the learning rate which controls the step size of parameter updates. Through the continuous iterative optimization process, the electronic equipment can gradually adjust the values of algorithm parameter values, so that the loss function L is gradually reduced, and the accuracy and the robustness of the algorithm for predicting the microbial species distribution are improved. Finally, when the loss function value is converged below a certain threshold value, the optimization process is ended, and the target microorganism composition analysis algorithm is obtained. This algorithm will be used in the task of subsequent microbial community species analysis of the high throughput sequencing sequence of interest.
As an embodiment, step S630, optimizing the algorithm parameter of the microorganism composition analysis algorithm according to the whole species confidence distribution, the sequence species confidence distribution, the matching degree and the microorganism species mark corresponding to the unsupervised training sample, to obtain the target microorganism composition analysis algorithm may include:
Step S631: for each training sample, according to the sequence species confidence distribution of each DNA information sequence in the training sample and the matching degree corresponding to each DNA information sequence, adjusting the microorganism species mark of the training sample to obtain an adjusted microorganism species mark of the training sample;
Step S632: combining the adjusted microorganism species marks of each training sample according to the confidence distribution of the whole species corresponding to each training sample to obtain a whole error;
Step S633: and optimizing algorithm parameters of the microorganism composition analysis algorithm according to the overall error to obtain the target microorganism composition analysis algorithm.
In step S631, the electronic device adjusts the initial microorganism species mark of the training sample by using the sequence species confidence distribution of each DNA information sequence contained in each training sample and the matching degree of the sequences and the whole training sample. The purpose of the adjustment is to make the labels of the training examples more accurately reflect the composition of the microbiota they contain, while taking into account the detailed information at the DNA information sequence level.
For example, assume that training sample a contains three DNA information sequences S 1、S2、S3, whose respective corresponding sequence species confidence distributions are p 1、p2、p3, and that the sequences are known to match m 1, m2, m3, respectively, with training sample a as a whole. The initial microorganism species signature of training example a may be a collection of one or more species, but for simplicity of illustration herein it is assumed that it is a single species signature L A.
The adjustment process may be based on a weighted average method, where the species confidence distributions for each DNA information sequence are weighted and summed according to their matching, resulting in weighted species confidence distributions for training sample a. The species with the highest confidence level is then selected from this weighted distribution as the adjusted microorganism species marker L A' of training sample A. But more often, because microbial communities often contain multiple species, the tuned signature may be a collection that contains multiple species and their confidence levels.
Specific calculation processes include, for example: the species confidence distribution and the matching for each sequence are weighted summed to obtain a weighted species confidence matrix W, where each element W ij represents the weighted confidence of the ith sequence to the jth species (W ij=pij×mi).
Each column of the weighted species confidence matrix W (i.e., each species) is summed to obtain the overall weighted confidence c= [ c 1,c2,…,cm ] for each species for training sample a.
Based on the overall weighted confidence c, species with a confidence above a certain threshold are selected as part of the tuned microbial species signature, or several species with highest confidence are directly selected as signatures (if a single signature is required). In practice, one or several species are often not simply labeled as training examples due to the complexity of the microbial community. Instead, it is more common to keep a collection of species and their confidence as a marker. Thus, the output of step S631 may be an adjusted species confidence distribution vector L A′=[1A1′,1A2′,…,1Am ', where 1 Ai' represents the adjusted confidence that training sample A belongs to the ith species.
In step S632, the electronic device calculates the overall error using the overall species confidence distribution corresponding to each training sample and the adjusted microorganism species mark (or confidence distribution) obtained in step S631. The overall error is an important indicator of the difference between the predicted result of the algorithm and the actual situation (or supervisory information) that will be used for optimization of the algorithm parameters in the subsequent steps.
For example, assume that the training set comprises a plurality of training samples (e.g., A, B, C, etc.), each having an overall species confidence profile (e.g., P A、PB、PC) and an adjusted microorganism species signature (or confidence profile, e.g., L A′、LB′、LC').
The calculation of the overall error is typically based on some kind of loss function, such as a cross entropy loss function. Cross entropy loss is a common method of measuring the difference between two probability distributions, which is particularly suited for comparison between the predicted probability distribution and the actual distribution in classification problems.
Specific calculation processes include, for example: for each training sample (e.g., a), first, a logarithmic operation is performed on the overall species confidence distribution P A, then, the logarithmic operation result is compared with the adjusted microorganism species mark confidence distribution L A', and the overall sub-error E A of the training sample is calculated through a cross entropy loss formula. In practical applications, since the overall species confidence distribution and the adjusted tag confidence distribution may not be strict probability distributions (i.e., their sum may not be equal to 1), appropriate normalization processing or calculation of the loss function may be performed according to actual needs. The overall sub-errors of all training samples are then summed (or weighted averaged) to obtain an overall error E. If the importance of all training samples is the same, the direct summation is needed; if different training examples have different importance (e.g., some examples may contain more information or more reliable supervision information), a weighted average may be used to incorporate the sub-errors.
In step S633, the electronic device updates the parameter values of the microorganism composition analysis algorithm with an optimization algorithm such as gradient descent (or a variant thereof, e.g., adam, RMSprop, etc.) according to the overall error calculated in step S632. The goal of the optimization is to minimize the overall error, thereby enabling the algorithm to more accurately predict the species composition of the microbial community.
After the overall error E is obtained, the electronics calculate gradients of the error with respect to each parameter of the algorithm (e.g., weights and biases in the neural network). This typically involves the application of a back-propagation algorithm that can efficiently calculate gradient information.
For each parameter w in the algorithm, the partial derivative of error E with respect to w is calculated using a back-propagation algorithmI.e. gradient. This process may require multiple passes through the training set (referred to as "epoch") and multiple updates of the parameter values (referred to as "iterations" or "batches") in each pass.
Based on the gradient information, the values of the parameters are updated using a gradient descent algorithm (or variants thereof). The gradient calculation and parameter update process is repeated until a certain stop condition (e.g., error below a certain threshold, parameter change below a certain threshold, a preset number of iterations reached, etc.) is met. In each iteration, it may be necessary to recalculate the overall error and gradient and update the parameters accordingly.
Periodically checking whether the error converges during the optimization process. If the error variation for successive iterations is very small (or no longer varies), then the algorithm can be considered to have converged to some locally optimal solution (or globally optimal solution, but this is often difficult to guarantee). At this time, the optimization process may be stopped, and the value of the current parameter may be used as a parameter of the target microorganism composition analysis algorithm.
Through this series of optimization processes in step S630, the electronic device can gradually adjust the parameters of the microorganism composition analysis algorithm so that it can better predict the species composition of the atmospheric aerosol microorganism community. The process not only depends on rich information sources such as the whole species confidence distribution of the training sample and the sequence species confidence distribution of the DNA information sequence, but also ensures the accuracy of the algorithm on the detail level through the matching degree. The finally obtained target algorithm has higher prediction precision and robustness.
As an embodiment, step S633, optimizing the algorithm parameter of the microorganism composition analysis algorithm according to the overall error, to obtain the target microorganism composition analysis algorithm may include:
step S6331: determining a sequence analysis error of the DNA information sequence according to the sequence species confidence distribution of the DNA information sequence and the microorganism species mark of the training sample to which the DNA information sequence belongs;
step S6332: and optimizing algorithm parameters of the microorganism composition analysis algorithm according to the overall error and the sequence analysis error to obtain the target microorganism composition analysis algorithm.
In step S6331, the electronic device calculates, for each DNA information sequence, a sequence analysis error based on the confidence distribution of its sequence species and the microorganism species signature of the belonging training sample. The sequence analysis error measures the prediction accuracy of the algorithm on the single DNA information sequence level, and is an important index for evaluating the performance of the algorithm on the detail level.
For example, suppose training sample A contains a DNA information sequence S 1 with a sequence species confidence distribution of p 1=[p11,p12,…,p1m, where m is the number of known microorganism species in the database and p 1i indicates the confidence that sequence S 1 belongs to the ith species. Meanwhile, it is assumed that the microorganism species marker of the training sample a to which the sequence S 1 belongs is L A(LA, for example, a set containing a plurality of species and weights thereof, and it is assumed here that it is a single species marker for simplicity of explanation.
To calculate the sequence analysis error, the electronics define a loss function to quantify the difference between the predicted distribution (i.e., the sequence species confidence distribution p 1) and the actual signature (i.e., the microorganism species signature L A of training sample a). Common loss functions include cross entropy loss, mean Square Error (MSE), and the like. Here, since the actual signature is a single species and the predicted distribution is a probability distribution, the cross entropy loss can be used for determination. In practice, the marker L A is a single species, and the predicted distribution p 1 is a probability distribution. Thus, the actual label can be converted into a one-hot encoded vector y A=[yA1,yA2,…,yAm, where y Ai =1 if L A is the i-th species, otherwise y Ai =0. Then, the sequence analysis error E seq,1 of the sequence S 1 is calculated using the cross entropy loss function.
After determining the overall error (as in step S632) and the sequence analysis error, the electronic device uses these error information to optimize the parameters of the microorganism composition analysis algorithm. The goal of the optimization is to find a set of parameter values such that the prediction error of the algorithm at both the overall and sequence level is as small as possible.
For example, assume that the overall error E has been calculated in step S632, and that a sequence analysis error (e.g., E seq,1) has been calculated for each DNA information sequence in the training set in step S6331. The electronic device now uses this information to update the parameters of the algorithm.
First, the electronics define a joint error function that combines the global error and the sequence analysis error, e.g., weight-sum them. Next, the electronics calculate the gradient of the joint error function Etotal with respect to each parameter of the algorithm (e.g., weights and biases in the neural network). This typically involves the application of a back-propagation algorithm that can efficiently calculate the partial derivatives of the error with respect to each parameter in the network. Once the gradient is calculated, the electronic device may update the value of the parameter using a gradient descent (or variant thereof, such as Adam, RMSprop, etc.) algorithm. The goal of the update is to move the parameter values in the opposite direction of the gradient to reduce the value of the error function.
The above process (calculating gradient, updating parameter) will be repeated a number of times until a certain stop condition is met (e.g. error below a certain threshold, parameter variation below a certain threshold, preset number of iterations is reached, etc.). In each iteration, the electronic device recalculates the joint error function and the gradient and updates the parameter values based on the new gradient information. During the optimization process, the electronic device periodically checks whether the error converges. If the error variation for successive iterations is very small (or no longer varies), then the algorithm can be considered to have converged to some locally optimal solution (or globally optimal solution, but this is often difficult to guarantee). At this time, the optimization process may be stopped and the value of the current parameter may be used as a parameter of the target microorganism composition analysis algorithm.
As an embodiment, step S6332, optimizing the algorithm parameter of the microorganism composition analysis algorithm according to the overall error and the sequence analysis error, to obtain the target microorganism composition analysis algorithm may include:
Step S63321: performing error calculation between sequence species confidence distribution and equal confidence distribution of the DNA information sequence to obtain distribution difference corresponding to the DNA information sequence;
Step S63322: combining the sequence analysis error and the distribution difference according to the matching degree corresponding to the DNA information sequence to obtain the sequence error of the DNA information sequence;
step S63323: and optimizing algorithm parameters of the microorganism composition analysis algorithm according to the overall error and sequence errors corresponding to the DNA information sequences of the training samples to obtain the target microorganism composition analysis algorithm.
In step S63321, the electronic device first defines an equal confidence profile (Uniform Confidence Distribution) that assumes that all microbial species have the same confidence. Then, for each DNA information sequence, the confidence distribution of its sequence species and the differences between the confidence distributions are calculated. This difference reflects the degree of deviation between the species confidence distribution predicted by the algorithm and the actual possible uniform distribution, and can measure the concentration and reliability of the prediction to some extent. For example, suppose there is a DNA information sequence S 1 with a sequence species confidence distribution of p 1=[p11,p12,…,p1m, where m is the number of known microorganism species in the database and p 1i represents the confidence that sequence S 1 belongs to the ith species. The equal confidence distribution u= [ u 1,u2,…,um ] is a vector with all elements equal and the sum is 1 (i.e., u i=m1 for all i).
To quantify the difference between p 1 and u, a variety of distance measurement methods can be used, such as KL divergence (Kullback-Leibler Divergence), JS divergence, or simple Euclidean distance. However, since KL and JS divergences require that both distributions have non-zero values on at least one element and are typically used for comparison between probability distributions, the here equal confidence distribution, while it may be considered a special probability distribution, may not strictly satisfy all conditions (e.g., sum 1) of probability distributions. Thus, a more general distance measurement method, such as the square of the euclidean distance, is chosen.
In step S63322, the electronic device combines the distribution difference calculated in step S63321 with the sequence analysis error calculated in step S6331 to obtain a comprehensive sequence error for each DNA information sequence. This combination takes into account both uniformity and accuracy of the algorithmic predictions over species confidence distributions, helping to more fully evaluate the sequence-level predictive performance. The merging process can be implemented by means of weighted summation, wherein the weights can be adjusted according to practical situations to balance the distribution differences and the importance of the sequence analysis errors. By calculating the integrated sequence error for each DNA information sequence, the electronic device obtains an error index that can comprehensively reflect the sequence level prediction performance.
In step S63323, the electronic device optimizes the parameters of the microorganism composition analysis algorithm using the overall error and the sequence error calculated in steps S63321 and S63322. The goal of the optimization is to find a set of parameter values such that the overall performance (as measured by overall error) and sequence level performance (as measured by sequence error) of the algorithm on the training set are as good as possible. The optimization process employs, for example, the application of a gradient descent (or variant thereof) algorithm that directs the update direction of the parameter by calculating the gradient of the error function with respect to the parameter. The error function may be defined herein as a weighted sum of the overall error and all sequence errors.
As an implementation manner, step S300, based on a microorganism composition analysis algorithm, performs implicit characterization extraction on the training sample and each DNA information sequence of the training sample, to obtain an overall implicit characterization array of the training sample and a sequence implicit characterization array corresponding to each DNA information sequence, and may include:
Step S310: and respectively carrying out weight distribution embedding on the training sample and each DNA information sequence of the training sample based on a microorganism composition analysis algorithm to obtain an integral implicit characterization array of the training sample and a sequence implicit characterization array of the DNA information sequence.
Specifically, the process may include: based on a microorganism composition analysis algorithm, DNA characterization vectors and classification characterization vectors corresponding to all DNA information in a DNA information sequence of a training sample are obtained, and the classification characterization vectors characterize the information types to which the DNA information belongs; encoding the position information of each piece of DNA information in the DNA information sequence to obtain a distribution position characterization vector corresponding to each piece of DNA information; combining the DNA characterization vector, the classification characterization vector and the distribution position characterization vector to obtain implicit expression of each piece of DNA information in the DNA information sequence of the training sample; and carrying out weight distribution embedding on the implicit expression of each piece of DNA information in the DNA information sequence to obtain the whole implicit characterization array of the training sample. In step S310, the electronic device processes the training examples and their DNA information sequences using a microorganism composition analysis algorithm (possibly based on deep learning, machine learning or other advanced algorithm models) and generates implicit characterizations by means of weight distribution embedding. Weight distribution embedding is a technique that maps raw data into a high-dimensional feature space, emphasizing or suppressing its importance in the characterization by assigning different weights to different features. Specifically, the electronic device extracts DNA characterization vectors for each DNA information (e.g., k-mer fragments, gene fragments, etc.) for each DNA information sequence in the training examples. The DNA characterization vector is a numerical representation of DNA information in a certain feature space, and may be obtained by model learning such as Convolutional Neural Network (CNN), cyclic neural network (RNN), or transducer. These vectors capture information on sequence specificity, base composition, structural features, etc. of the DNA information. Meanwhile, the electronic equipment also distributes a classification characterization vector for each piece of DNA information, which is used for characterizing the information type (such as a coding region, a non-coding region, a promoter region and the like) of the DNA information. The classification token vector may be obtained by One-Hot Encoding (One-Hot Encoding), an embedded layer (Embedding Layer), or other classification feature Encoding.
For example, assume that one DNA informative sequence comprises three DNA informative fragments, fragment a, fragment B and fragment C, respectively. The DNA characterization vector for fragment a is v A=[vA1,vA2,…,vAn, where n is the dimension of the vector; the classification characterization vector for segment a is c A = [0,1,0, …,0] (assuming four information types, segment a belongs to the second type). Similarly, DNA characterization vectors and class characterization vectors for fragment B and fragment C can be obtained. In order to preserve the positional information of the DNA information in the sequence, the electronic device encodes the position of each DNA information in the DNA information sequence. The position coding may be a simple integer index, one-hot coding, or a more complex position embedding vector. The position embedding vector can capture the relative and absolute positional relationship between DNA information, and has important significance for understanding the structure and function of DNA sequences.
For example, assume that the DNA information sequences are arranged in a 5 'to 3' direction, fragment A is located at the first position, fragment B is located at the second position, and fragment C is located at the third position. The position embedding matrix P e R L×d can be used to generate an embedding vector for each position, where L is the maximum number of positions possible in the sequence and d is the dimension of the embedding vector. Then, according to the actual position of each DNA information in the sequence, a corresponding embedded vector is selected from the position embedded matrix as a distribution position characterization vector. For example, the distribution position characterization vector for segment A may be p1, segment B p2, and segment C p 3. Next, the electronic device combines the DNA characterization vector, the classification characterization vector, and the distribution position characterization vector for each piece of DNA information to generate a comprehensive implicit representation of the DNA information. The merge operation may involve simple Concatenation (establishment), weighted summation, or other complex feature fusion methods. For example, assume that the dimension of the DNA token vector is n, the dimension of the classification token vector is m, and the dimension of the distribution position token vector is d. For segment a, its implicit representation can be obtained by stitching: h A=[vA;cA;p1 ], wherein the semicolon (;) indicates the concatenation operation of the vectors. Similarly, implicit representations hB and hC of fragment B and fragment C can be obtained.
Finally, the electronic device performs weight distribution embedding on the implicit representation of each piece of DNA information in the DNA information sequence to generate an overall implicit representation array of the training sample. The weight distribution can be calculated dynamically according to different characteristics of DNA information (such as length, GC content, conservation, etc.) or based on the mechanism of attention. The purpose of weight distribution embedding is to emphasize DNA information that contributes significantly to overall characterization, while suppressing noise or redundant information.
Through this series of operations in step S310, the electronic device can convert the complex DNA sequence data into an implicit representation form that is easy to process. These implicit characterizations not only preserve the sequence specificity, structural features, positional information, etc. of the DNA information, but also emphasize DNA information that contributes significantly to the overall characterization by weight distribution embedding. These implicit characterizations will serve as input data for subsequent steps (e.g., cluster analysis, matching calculation, algorithm optimization, etc.) to provide a data base for the analysis of the microorganism composition.
Based on the same principle as the method shown in fig. 1, an analysis device 10 for analyzing the composition of an atmospheric aerosol microbial community is also provided in the embodiment of the present application, and as shown in fig. 2, the device 10 includes an algorithm training module 11 and a data analysis module 12. The algorithm training module 11 includes:
The template acquisition module 111 is used for acquiring a microorganism composition analysis algorithm and a high-throughput sequencing sequence template, wherein the training samples in the high-throughput sequencing sequence template comprise a supervision training sample and an unsupervised training sample, and the supervision training sample comprises a corresponding microorganism species mark;
the data segmentation module 112 is configured to segment the training samples for each training sample, and obtain a plurality of DNA information sequences of the training samples;
The feature extraction module 113 is configured to perform implicit characterization extraction on the training sample and each DNA information sequence of the training sample based on a microbial composition analysis algorithm, so as to obtain an overall implicit characterization array of the training sample and a sequence implicit characterization array corresponding to each DNA information sequence;
The cluster analysis module 114 is configured to perform cluster analysis on the overall implicit characterization array of each training sample in the high-throughput sequencing sequence template according to the microorganism species markers of the supervised training samples, so as to obtain microorganism species markers corresponding to the unsupervised training samples;
The commonality determining module 115 is configured to determine, for each DNA information sequence of the training samples, a matching degree between the DNA information sequence and the training sample to which the DNA information sequence belongs according to a sequence implicit characterization array corresponding to the DNA information sequence and an overall implicit characterization array of the training sample to which the DNA information sequence belongs;
The algorithm optimization module 116 is configured to optimize an algorithm parameter of a microorganism composition analysis algorithm according to the matching degree and the microorganism species mark corresponding to the unsupervised training sample, so as to obtain a target microorganism composition analysis algorithm, where the target microorganism composition analysis algorithm is used to perform microorganism community species analysis on the target high-throughput sequencing sequence, so as to obtain a microorganism species mark of the target high-throughput sequencing sequence;
the data analysis module 12 includes:
A data acquisition module 121 for acquiring a target high throughput sequencing sequence of a target atmospheric aerosol sample of a target region; the target high-throughput sequencing sequence is a sequencing sequence obtained by extracting DNA of a target atmospheric aerosol sample, amplifying the DNA by polymerase chain reaction and performing high-throughput sequencing;
The tag determination module 122 is configured to load the target high-throughput sequencing sequence into a target microorganism composition analysis algorithm to obtain a microorganism species tag of the target atmospheric aerosol sample.
The above embodiment describes the analysis device 10 composed of the atmospheric aerosol microbiota from the viewpoint of the virtual module, and the following describes an electronic device from the viewpoint of the physical module, specifically as follows:
An embodiment of the present application provides an electronic device, as shown in fig. 3, an electronic device 100 includes: a processor 101 and a memory 103. Wherein the processor 101 is coupled to the memory 103, such as via bus 102. Optionally, the electronic device 100 may also include a transceiver 104. It should be noted that, in practical applications, the transceiver 104 is not limited to one, and the structure of the electronic device 100 is not limited to the embodiment of the present application. In other words, an embodiment of the present application provides an electronic device, where the electronic device in the embodiment of the present application includes: one or more processors; a memory; one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the one or more processors, implement the methods described above.
The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations should and are intended to be comprehended within the scope of the present application.