CN117147813A - A method for detecting polycystic ovary syndrome using electronic nose - Google Patents
A method for detecting polycystic ovary syndrome using electronic nose Download PDFInfo
- Publication number
- CN117147813A CN117147813A CN202311111199.0A CN202311111199A CN117147813A CN 117147813 A CN117147813 A CN 117147813A CN 202311111199 A CN202311111199 A CN 202311111199A CN 117147813 A CN117147813 A CN 117147813A
- Authority
- CN
- China
- Prior art keywords
- sample set
- training
- polycystic ovary
- ovary syndrome
- electronic nose
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/483—Physical analysis of biological material
- G01N33/497—Physical analysis of biological material of gaseous biological material, e.g. breath
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01D—MEASURING NOT SPECIALLY ADAPTED FOR A SPECIFIC VARIABLE; ARRANGEMENTS FOR MEASURING TWO OR MORE VARIABLES NOT COVERED IN A SINGLE OTHER SUBCLASS; TARIFF METERING APPARATUS; MEASURING OR TESTING NOT OTHERWISE PROVIDED FOR
- G01D21/00—Measuring or testing not otherwise provided for
- G01D21/02—Measuring two or more variables by means not covered by a single other subclass
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Chemical & Material Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- Food Science & Technology (AREA)
- Urology & Nephrology (AREA)
- Immunology (AREA)
- Pathology (AREA)
- Artificial Intelligence (AREA)
- Medicinal Chemistry (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Hematology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a method for detecting polycystic ovary syndrome by utilizing an electronic nose, which relates to the technical field of human body expiration gas detection, wherein the electronic nose comprises a sensor array, a circuit and an upper computer, the sensor array acquires original response signals of breathing gas of PCOS patients and healthy people, and the original response signals are transmitted to the upper computer through the circuit for analysis; the upper computer is provided with programs such as data preprocessing, PCA dimension reduction, one-step cross validation and support vector machine classification algorithm, and the like, so that data distinction between PCOS patients and healthy people is realized. The electronic nose system provided by the invention can compare the volatile organic compounds exhaled by PCOS patients and healthy people in early stage and finish prediction classification.
Description
Technical Field
The invention relates to the technical field of detection of human expiration gas, in particular to a method for detecting polycystic ovary syndrome by using an electronic nose.
Background
Polycystic ovary syndrome (Polycystic Ovarian Syndrome, PCOS) is a common endocrine and metabolic disease affecting nearly 5% -20% of women of childbearing age. The pathogenesis is not clear, but the interactions between environmental, genetic and lifestyle factors may be involved in the pathogenesis. According to the cartap standard (gold standard for PCOS diagnosis) in 2003, women suffer from polycystic ovary syndrome if they exhibit two of the following three characteristics: 1. clinical and biochemical indicators show hyperandrogens; 2. ovulation disorder; 3. ovarian ultrasound examination revealed polycystic ovaries.
Although the deer standards have been approved worldwide, it still has some drawbacks: (a) Most young polycystic ovary syndrome patients are associated with obesity, making trans-abdominal ultrasound examination difficult; furthermore, in most cases, because they are teenagers, it is also not possible to use transvaginal ultrasound examinations; (b) The estimation of the number of anterior follicles is subjective and is not standardized, which is related to the difference between observers; (c) Menstrual cycle phase and the use of oral contraceptives can alter the morphology of polycystic ovary; (d) Diagnosis of hyperandrogenism is clinically difficult because the Ferriman-Gallwey score is subjective and biochemical assessment is problematic due to the cumbersome laboratory testing of androgens. Accordingly, scientific and clinical research is directed to developing new, reliable, simple and cost-effective methods for detection and diagnosis of polycystic ovary syndrome.
As a novel noninvasive examination method, "breath detection" is used for early diagnosis of diseases in medical treatment, and is becoming more and more important in medical systems because of its advantages such as rapidness, noninvasive property, and easy acceptance by patients. Biomarkers generated by metabolism of organs of the human body reach alveoli along with blood circulation, and are discharged out of the human body through exhalation, the process enables exhaled air of the human body to contain more than 1000 volatile organic compounds, and quantitative analysis and comparison of specific volatile metabolites can be performed by using an electronic nose and a gas chromatography mass spectrometer (GC/MS), but the GC-MS has certain limitations such as high cost, complexity, time consumption, and the need of expert operators.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method for detecting polycystic ovary syndrome by using an electronic nose, which comprises the following steps:
s1, acquiring respiratory gas through a sensor in a sensor array of an electronic nose system, and obtaining an original response data set X of respiratory gas of a polycystic ovary syndrome patient and a healthy person;
s2, preprocessing an original response data set X to obtain a sample set X s ;
S3, for sample set X s Extracting features to obtain a sample set X after feature extraction m ;
S4, extracting a sample set X after feature extraction m Feature dimension reduction is carried out to generate a new feature sample set X m_pca ;
S5, collecting a new characteristic sample set X m_pca Cross-validation according to leave-one-out method and randomly dividing into training sample sets X train_pca Test sample set X test_pca ;
S6, training sample set X by using support vector machine algorithm train_pca Training, and performing super-parameter tuning by using a grid search method to obtain an optimal parameter combination;
s7, drawing an ROC curve and a box diagram, determining model performance by measuring AUC of the ROC curve, calculating ROC curve cut-off value of a training sample set according to about log index, and then testing the sample set X test_pca The polycystic ovary syndrome patients and healthy people are classified and identified, and the prediction performance of the model is verified.
The technical scheme of the invention is as follows:
further, in step S1, the electronic nose system includes a mouthpiece, a micropump, a flowmeter, a sensor array, a control circuit and an upper computer connected in series in sequence;
the blowing nozzle is used for collecting breathing gas; the flowmeter is used for monitoring the gas flow in real time; the micro pump is feedback controlled by the flowmeter and is used for keeping the gas flow rate constant; the sensor array is used for collecting original response signals of the breathing gas; the control circuit is used for transmitting the original response signal to the upper computer; the upper computer receives and displays the original response signal, judges whether the expiration sample is standard or not, and classifies the breathing gas of the polycystic ovary syndrome patient and the healthy crowd through a support vector machine algorithm program.
The method for detecting polycystic ovary syndrome by using the electronic nose, wherein the sensor array comprises 12 gas sensors, 1 temperature sensor and 1 humidity sensor; the sensor array is preheated for 10min before starting to work, the working voltage of the sensor array is set to be 5V, and the sampling frequency is set to be 1Hz.
In the aforementioned method for detecting polycystic ovary syndrome by using electronic nose, in step S1, the original response data set X is as follows:
wherein,for the kth (k=1, 2,3, …, m) breath sample, at a time during the baseline phase, the sensor response to ambient air value, +.>The sensor's response to the exhalation signal at some point during the exhalation phase is for the kth exhalation sample.
In the aforementioned method for detecting polycystic ovary syndrome by using electronic nose, in step S2, the detection value of each sensor is subtracted by its baseline value and then divided by its baseline value;
let the data be N in total F Samples of eachThe sample has N K A plurality of sensors, each sensor detecting a dimension N T Stable baseline stage dimension N G (N G <N T ) F (f=1, 2,3,., N F ) Sample K (k=1, 2,3,.. K ) The individual sensors are activated at time T (t=1, 2,3, N T ) The baseline processed sensor response of (2) is:
wherein R is (F.K.T) And R is R (F.K.t) The actual response of the kth sensor at time T and time T for the kth sample.
In the aforementioned method for detecting polycystic ovary syndrome by electronic nose, in step S3, the sample set X is subjected to s The method for extracting the characteristics comprises the following steps:
s3.1, calculating sample set X s Mean value of medium features
S3.2, calculating sample set X s Standard deviation sigma of the medium feature;
s3.3, for sample set X by s And (3) extracting characteristics:
wherein X is m And (5) extracting data for the characteristics of the sample set.
In the aforementioned method for detecting polycystic ovary syndrome by using electronic nose, in step S4, the feature-extracted sample set X is subjected to PCA by using Principal Component Analysis (PCA) m Performing characteristic dimension reduction, comprising the following steps:
s4.1, extracting the characteristics in the step S3 to obtain a sample set X m Forming a corresponding matrix vector, and solving a covariance matrix C of the matrix vector;
S4.2calculating eigenvalue of covariance matrix C, and arranging according to size order to obtain lambda 1 ≥λ 2 ≥…≥λ m Corresponding feature vector is beta 1 ,β 2 ,…β m The feature vectors are respectively called 1 st, 2 nd and … m main components in sequence;
s4.3, calculating the contribution rate E of each main component j The following formula is shown:
wherein lambda is j For the j-th characteristic value,the sum of all the characteristic values is accumulated;
s4.4, taking the first p (p is less than or equal to m) principal components so that the accumulated contribution rate of the p principal components is greater than 95%, and calculating the scores Z of the measured values after the standard in the directions of the first p principal components, wherein the scores Z are shown in the following formula:
Z=(β 1 ,β 2 ,…β p )X m
taking the first 2 of the principal component accumulated variance contribution ratios which are more than 95% as a new characteristic sample set.
In the aforementioned method for detecting polycystic ovary syndrome using electronic nose, in step S5, a leave-one-out method is used for the new feature sample set X m_pca A method of performing cross-validation comprising the steps of:
s5.1, collecting new characteristic sample set X m_pca Divided into k independent subsets X 1 ,X 2 ,...,X k K is the number of the feature sample set data;
s5.2, to a subset X 1 As a test set, the remaining k-1 subsets are used as training sets to complete one training;
s5.3, selecting the next subset X 2 As a test set, the remaining k-1 subsets are used as training sets to complete the second training;
s5.4, sequentially selecting each subset as a test set, using the rest k-1 subsets as training sets, and repeating training for k times;
s5.5, carrying out summation and average on k recognition rates to obtain a result of one-step cross verification;
s5.6 New feature sample set X m_pca Randomly dividing into training sample sets X according to the proportion of 7:3 train_pca Test sample set X test_pca 。
The aforementioned method for detecting polycystic ovary syndrome by using electronic nose, wherein step S6 specifically comprises the following sub-steps:
s6.1, selecting a radial basis function as a classification kernel function of a support vector machine algorithm;
s6.2, let the feature data be N-dimensional, L sets of data altogether, i.e. (x 1 ,y 1 ),…,(x l ,y l )∈R n The method comprises the steps of carrying out a first treatment on the surface of the The decision plane is expressed as
Wherein,the weight coefficient of the decision plane is represented, g (x) represents a nonlinear mapping function, and b represents a threshold value;
s6.3, in order to minimize structural risks, the optimal classification hyperplane meets the following conditions
Introducing a non-negative relaxation variable ζ i Converting the optimization problem into
Wherein c represents a penalty factor for controlling the complexity and generalization ability of the model;
the optimization problem is converted into a dual form by introducing a Lagrangian algorithm
Wherein K (x i ,x j )=(g(x i )·g(x j ));
S6.4, introducing an RBF kernel function:
K(x i ,x j )=exp(-g||x i -x j ||) 2
wherein g represents a kernel parameter for controlling a range of an input space;
the optimization problem is translated into:
s6.5, introducing parameters c and g in the process of optimizing model establishment by a grid search method; meanwhile, a 5-time cross validation method is adopted, the minimum root mean square error of the training set is used as an fitness function to conduct parameter optimization, and when the minimum root mean square error is reached, the obtained parameters c and g are optimal parameters;
sample training is carried out by using a training sample set, and super-parameter tuning is carried out by a grid search method to obtain an optimal parameter combination;
and taking the support vector machine model with the optimal parameter combination as a machine learning classification model with the training completed, and then analyzing and predicting under the structure of the machine learning classification model with the training completed by using the test sample set.
The aforementioned method for detecting polycystic ovary syndrome by using electronic nose, wherein step S7 specifically comprises the following sub-steps:
s7.1, according to the probability score and the real label of the training sample set predicted to be positive, sequentially taking the score from high to low as a threshold value, and drawing an ROC curve;
s7.2, dividing the test set into four types according to the classification of the training sample set into polycystic ovary syndrome patients and healthy people, wherein the four types are respectively: correctly predicted as PCOS patients, correctly predicted as healthy people, incorrectly predicted as PCOS patients, incorrectly predicted as healthy people; drawing a corresponding box diagram;
s7.3, calculating the area under the ROC curve, wherein the area under the ROC curve is AUC, and the AUC is between 0 and 1 and is used for evaluating the quality of the classifier;
s7.4, finding out a threshold value corresponding to a point with the largest difference between the abscissa and the ordinate of the ROC curve, wherein the about dengue index calculation formula is as follows:
index=1+TPR-FPR
the TPR refers to the proportion of positive classes which are positively identified to occupy the total positive classes, the FPR refers to the proportion of negative classes which are predicted to be positive classes, and the probability value corresponding to the maximum value of the about-dengue index is the optimal threshold.
The beneficial effects of the invention are as follows:
(1) In the invention, the original response signals of the breathing gas of the PCOS patient and the healthy crowd are collected through a sensor array in the electronic nose system, and the original response signals are transmitted to an upper computer for analysis through a circuit; programs such as data preprocessing, PCA dimension reduction, one-step cross validation and support vector machine classification algorithm are arranged in an upper computer in the electronic nose system, so that data distinction between PCOS patients and healthy people is realized;
(2) According to the invention, the volatile organic compounds exhaled by PCOS patients and healthy people can be compared in early stage, and prediction classification is completed, so that polycystic ovary syndrome patients and healthy people can be rapidly and accurately screened initially, the screening method belongs to a screening method for non-diagnosis purposes, on one hand, the hardware cost of the system is reduced, and on the other hand, a basis is provided for the judgment of clinicians.
Drawings
FIG. 1 is a schematic overall flow chart of an embodiment of the present invention;
FIG. 2 is a block diagram of an electronic nose system according to an embodiment of the present invention;
FIG. 3 is a schematic view of the ROC curve and the area under the curve in an embodiment of the invention;
fig. 4 is a schematic diagram of blind classification results according to an embodiment of the present invention.
Detailed Description
According to the method for detecting polycystic ovary syndrome by using the electronic nose, breath samples are collected through the electronic nose system, 170 breath samples are collected, 90 breath samples of polycystic ovary syndrome people are collected, 80 breath samples of healthy people are collected, and PCOS patients and healthy people are from a first affiliated hospital reproductive center of Anhui medical university.
The environment in which this embodiment operates is written and implemented on a Dell 3681 computer, windows11, a 64-bit operating system, an Intel i5 processor, 8G operating memory, pycharm2022, python3.9, scikit-learn1.0.2.
As shown in fig. 1, a method for detecting polycystic ovary syndrome using an electronic nose includes the steps of:
s1, collecting respiratory gas through a sensor in a sensor array of an electronic nose system, and obtaining an original response data set X of respiratory gas of a polycystic ovary syndrome patient and a healthy person.
As shown in fig. 2, the electronic nose system comprises a blowing nozzle, a micropump, a flowmeter, a sensor array, a control circuit and an upper computer which are sequentially connected in series; the sensor array comprises 12 gas sensors, 1 temperature sensor and 1 humidity sensor; the sensor array is preheated for 10min before starting to work, the working voltage of the sensor array is set to be 5V, and the sampling frequency is set to be 1Hz.
The blowing nozzle is used for collecting breathing gas; the flowmeter is used for monitoring the gas flow in real time; the micro pump is feedback controlled by the flowmeter and is used for keeping the gas flow rate constant; the sensor array is used for collecting original response signals of the breathing gas; the control circuit is used for transmitting the original response signal to the upper computer; the upper computer receives and displays the original response signal, judges whether the expiration sample is standard or not according to the value of the original response signal and the original humidity curve, and classifies the breathing gas of the polycystic ovary syndrome patient and the healthy crowd according to the algorithm program of the support vector machine.
In step S1, the sensors in the sensor array of the electronic nose system are utilized to obtain the original response data set X of the respiratory gases of different people, as follows:
wherein,for the kth (k=1, 2,3, …, m) breath sample, at a time during the baseline phase, the sensor response to ambient air value, +.>The sensor's response to the exhalation signal at some point during the exhalation phase is for the kth exhalation sample.
S2, preprocessing an original response data set X to obtain a sample set X s The method comprises the steps of carrying out a first treatment on the surface of the Subtracting the baseline value of each sensor from the detection value of each sensor and dividing the detection value of each sensor by the baseline value of each sensor;
let the data be N in total F Samples, each sample having N K A plurality of sensors, each sensor detecting a dimension N T Stable baseline stage dimension N G (N G <N T ) F (f=1, 2,3,., N F ) Sample K (k=1, 2,3,.. K ) The individual sensors are activated at time T (t=1, 2,3, N T ) The baseline processed sensor response of (2) is:
wherein R is (F.K.T) And R is R (F.K.t) The actual response of the kth sensor at time T and time T for the kth sample.
S3, for sample set X s Extracting features to obtain a sample set X after feature extraction m The method comprises the steps of carrying out a first treatment on the surface of the For sample set X s The method for extracting the characteristics comprises the following steps:
s3.1, calculating sample set X s Mean value of medium features
S3.2, calculating sample set X s Standard deviation sigma of the medium feature;
s3.3, for sample set X by s And (3) extracting characteristics:
wherein X is m And (5) extracting data for the characteristics of the sample set.
S4, extracting a sample set X after feature extraction m Feature dimension reduction is carried out to generate a new feature sample set X m_pca The method comprises the steps of carrying out a first treatment on the surface of the Sample set X after feature extraction by adopting principal component analysis PCA m Performing characteristic dimension reduction, comprising the following steps:
s4.1, extracting the characteristics in the step S3 to obtain a sample set X m And forming a corresponding matrix vector, and solving a covariance matrix C of the matrix vector.
S4.2, calculating eigenvalues of the covariance matrix C, and arranging the eigenvalues according to the order of magnitude to obtain lambda 1 ≥λ 2 ≥…≥λ m Corresponding feature vector is beta 1 ,β 2 ,…β m The feature vectors thus obtained are sequentially referred to as 1 st, 2 nd and … m main components, respectively.
S4.3, calculating the contribution rate E of each main component j The following formula is shown:
wherein lambda is j For the j-th characteristic value,is the sum of all eigenvalues.
S4.4, taking the first p (p is less than or equal to m) principal components so that the accumulated contribution rate of the p principal components is greater than 95%, and calculating the scores Z of the measured values after the standard in the directions of the first p principal components, wherein the scores Z are shown in the following formula:
Z=(β 1 ,β 2 ,…β p )X m
taking the first 2 of the principal component accumulated variance contribution ratios which are more than 95% as a new characteristic sample set.
S5, collecting a new characteristic sample set X m_pca Cross-validation according to leave-one-out method and randomly dividing into training sample sets X train_pca Test sample set X test_pca The method comprises the steps of carrying out a first treatment on the surface of the Using leave-one method to obtain new characteristic sample set X m_pca A method of performing cross-validation comprising the steps of:
s5.1, collecting new characteristic sample set X m_pca Divided into k independent subsets X 1 ,X 2 ,...,X k K is the number of the feature sample set data;
s5.2, to a subset X 1 As a test set, the remaining k-1 subsets are used as training sets to complete one training;
s5.3, selecting the next subset X 2 As a test set, the remaining k-1 subsets are used as training sets to complete the second training;
s5.4, sequentially selecting each subset as a test set, using the rest k-1 subsets as training sets, and repeating training for k times;
s5.5, carrying out summation and average on k recognition rates to obtain a result of one-step cross verification, wherein the recognition rates in the embodiment refer to accuracy, sensitivity and specificity;
s5.6 New feature sample set X m_pca Randomly dividing into training sample sets X according to the proportion of 7:3 train_pca Test sample set X test_pca 。
S6, training sample set X by using support vector machine (Support Vector Machine, SVM) algorithm train_pca Training, and performing super-parameter tuning by using a grid search method to improve the accuracy of the model so as to obtain an optimal parameter combination; the method specifically comprises the following sub-steps:
s6.1, selecting a radial basis function as a classification kernel function of a support vector machine algorithm;
according to the theory of the support vector machine (Support Vector Machine, SVM) algorithm, the method is a supervised mode recognition method, and the main idea is to establish a classification decision surface. The SVM uses a kernel function to map data into a high-dimensional space so that it is as linearly separable as possible. Common kernel functions include linear kernel functions, polynomial kernels, radial basis kernels (Radial Basis Function, RBF), fourier kernels, spline kernels, sigmoid kernel functions, and the like. By comparing the data characteristics applicable to the kernel functions, whether the sample data characteristics are high-dimensional or low-dimensional, and whether the data size is large or small, the RBF kernel functions exhibit good classification performance, and therefore RBFs are selected as classification kernel functions of SVMs.
S6.2, let the feature data be N-dimensional, L sets of data altogether, i.e. (x 1 ,y 1 ),…,(x l ,y l )∈R n The method comprises the steps of carrying out a first treatment on the surface of the The decision plane is expressed as
Wherein,the weight coefficients representing the decision plane, g (x) representing the nonlinear mapping function, and b representing the threshold.
S6.3, in order to minimize structural risks, the optimal classification hyperplane meets the following conditions
Introducing a non-negative relaxation variable ζ i Thus, the classification error is within a specified range, and thus, the optimization problem is converted into
Wherein c represents a penalty factor for controlling the complexity and generalization ability of the model;
the optimization problem is converted into a dual form by introducing a Lagrangian algorithm
Wherein K (x i ,x j )=(g(x i )·g(x j ))。
S6.4, introducing an RBF kernel function:
K(x i ,x j )=exp(-g||x i -x j ||) 2
wherein g represents a kernel parameter for controlling a range of an input space;
the optimization problem is translated into:
from the above equation, it can be seen that the optimization problem depends on two important parameters c and g, which affect the predictive performance of the SVM.
S6.5, introducing a grid search method (GS) to optimize parameters c and g in the model building process in order to improve the prediction performance of the model; meanwhile, in order to avoid the phenomena of over-learning and under-learning of the model, a 5-time cross validation method is adopted to perform parameter optimization by taking the minimum root mean square error of a training set as an fitness function, and when the minimum root mean square error is reached, the obtained parameters c and g are optimal parameters; in GS, global search is performed at intervals of 0.5, and the ranges of parameters c and g are (2 -10 ,2 10 )。
Sample training is carried out by using a training sample set, and super-parameter tuning is carried out by a grid search method to obtain an optimal parameter combination: kernel=rbf, c=2, probability=tube, test size=0.3, random state=9;
and taking the support vector machine model with the optimal parameter combination as a machine learning classification model with the training completed, and then analyzing and predicting under the structure of the machine learning classification model with the training completed by using the test sample set.
S7, drawing a receiver operation characteristic Curve (Receiver Operating Characteristic Curve, ROC) Curve and a box diagram, determining model performance by measuring Area Under Curve (AUC) of the ROC, calculating ROC Curve cut-off value of a training sample set according to about dengue index, and then testing the sample set X test_pca The polycystic ovary syndrome patients and healthy people are classified and identified, and the prediction performance of the model is verified.
The step S7 specifically comprises the following sub-steps:
s7.1, according to the probability score of the training sample set predicted to be the positive class (P) and the true label (P/N), the score is sequentially used as a threshold value from high to low, and an ROC curve is drawn.
S7.2, dividing the test set into four types according to the classification of the training sample set into polycystic ovary syndrome patients and healthy people, wherein the four types are respectively: correctly predicted to be PCOS patients (TP), correctly predicted to be healthy population (TN), incorrectly predicted to be PCOS patients (FP), incorrectly predicted to be healthy population (FN); and drawing a corresponding box diagram.
And S7.3, the area under the ROC curve is AUC, which belongs to the range of 0-1, and the AUC can be used as a numerical value to intuitively evaluate the quality of the classifier, and as shown in FIG. 3, the area under the ROC curve and the area under the ROC curve are calculated, and the AUC=0.912 of the area under the ROC curve is calculated.
S7.4, finding out a threshold value corresponding to a point with the largest difference between the abscissa and the ordinate of the ROC curve, wherein the about dengue index calculation formula is as follows:
index=1+TPR-FPR
the TPR refers to the proportion of positive classes which are positively identified to occupy the total positive classes, the FPR refers to the proportion of negative classes which are predicted to be positive classes, the probability value corresponding to the maximum value of the about log index is the optimal threshold value, the ROC threshold value of the training sample set is calculated to be 0.441, and the blind classification is completed on the test sample set according to the optimal threshold value, as shown in fig. 4, and the blind classification result is obtained.
In the experiment, 90 breath samples of polycystic ovary syndrome patients and 80 breath samples of healthy people are collected. The age distribution of the two populations is uniform, the PCOS patient is at first diagnosis or after treatment, ovulation abnormality, hormonal disorder and dyslipidemia still exist, and continuous treatment is not carried out in the last three months, and the subjects have no history of other diseases, primary type I diabetes, liver or kidney dysfunction. The experiment utilizes a leave-one-out method to complete cross verification, analyzes the prediction probability under the training support vector machine algorithm model structure, and performs blind classification through the optimal threshold of the ROC curve. As shown in table 1 below, the leave-one-out cross-validation results and blind classification results were counted:
TABLE 1
The result shows that the method for detecting polycystic ovary syndrome by using the electronic nose can rapidly and accurately preliminarily screen out two groups of polycystic ovary syndrome patients and healthy people, belongs to a screening method for non-diagnosis purposes, reduces hardware cost of a system on one hand, and provides a basis for a clinician to judge on the other hand.
In addition to the embodiments described above, other embodiments of the invention are possible. All technical schemes formed by equivalent substitution or equivalent transformation fall within the protection scope of the invention.
Claims (10)
1. A method for detecting polycystic ovary syndrome by using an electronic nose, which is characterized by comprising the following steps: the method comprises the following steps:
s1, acquiring respiratory gas through a sensor in a sensor array of an electronic nose system, and obtaining an original response data set X of respiratory gas of a polycystic ovary syndrome patient and a healthy person;
s2, preprocessing an original response data set X to obtain a sample set X s ;
S3, for sample set X s Extracting features to obtain a sample set X after feature extraction m ;
S4, extracting a sample set X after feature extraction m Feature dimension reduction is carried out to generate a new feature sample set X m_pca ;
S5, collecting a new characteristic sample set X m_pca Cross-validation according to leave-one-out methodAnd randomly divided into training sample sets X train_pca Test sample set X test_pca ;
S6, training sample set X by using support vector machine algorithm train_pca Training, and performing super-parameter tuning by using a grid search method to obtain an optimal parameter combination;
s7, drawing an ROC curve and a box diagram, determining model performance by measuring AUC of the ROC curve, calculating ROC curve cut-off value of a training sample set according to about log index, and then testing the sample set X test_pca The polycystic ovary syndrome patients and healthy people are classified and identified, and the prediction performance of the model is verified.
2. The method for detecting polycystic ovary syndrome by electronic nose according to claim 1, wherein: in the step S1, the electronic nose system comprises a blowing nozzle, a micropump, a flowmeter, a sensor array, a control circuit and an upper computer which are sequentially connected in series;
the blowing nozzle is used for collecting breathing gas; the flowmeter is used for monitoring the gas flow in real time; the micro pump is feedback controlled by the flowmeter and is used for keeping the gas flow rate constant; the sensor array is used for collecting original response signals of the breathing gas; the control circuit is used for transmitting the original response signal to the upper computer; the upper computer receives and displays the original response signal, judges whether the expiration sample is standard or not, and classifies the breathing gas of the polycystic ovary syndrome patient and the healthy crowd through a support vector machine algorithm program.
3. The method for detecting polycystic ovary syndrome by electronic nose according to claim 2, wherein: the sensor array comprises 12 gas sensors, 1 temperature sensor and 1 humidity sensor; the sensor array is preheated for 10min before starting to work, the working voltage of the sensor array is set to be 5V, and the sampling frequency is set to be 1Hz.
4. The method for detecting polycystic ovary syndrome by electronic nose according to claim 1, wherein: in said step S1, the original response data set X is as follows:
wherein,for the kth (k=1, 2,3, …, m) breath sample, at a time during the baseline phase, the sensor response to ambient air value, +.>The sensor's response to the exhalation signal at some point during the exhalation phase is for the kth exhalation sample.
5. The method for detecting polycystic ovary syndrome by electronic nose according to claim 1, wherein: in the step S2, the detected value of each sensor is subtracted by the baseline value and then divided by the baseline value;
let the data be N in total F Samples, each sample having N K A plurality of sensors, each sensor detecting a dimension N T Stable baseline stage dimension N G (N G <N T ) F (f=1, 2,3,., N F ) Sample K (k=1, 2,3,.. K ) The individual sensors are activated at time T (t=1, 2,3, N T ) The baseline processed sensor response of (2) is:
wherein R is (F.K.T) And R is R (F.K.t) The actual response of the kth sensor at time T and time T for the kth sample.
6. The method for detecting polycystic ovary syndrome by electronic nose according to claim 1The method is characterized in that: in the step S3, for the sample set X s The method for extracting the characteristics comprises the following steps:
s3.1, calculating sample set X s Mean value of medium features
S3.2, calculating sample set X s Standard deviation sigma of the medium feature;
s3.3, for sample set X by s And (3) extracting characteristics:
wherein X is m And (5) extracting data for the characteristics of the sample set.
7. The method for detecting polycystic ovary syndrome by electronic nose according to claim 1, wherein: in the step S4, a principal component analysis PCA is adopted to extract a sample set X after feature extraction m Performing characteristic dimension reduction, comprising the following steps:
s4.1, extracting the characteristics in the step S3 to obtain a sample set X m Forming a corresponding matrix vector, and solving a covariance matrix C of the matrix vector;
s4.2, calculating eigenvalues of the covariance matrix C, and arranging the eigenvalues according to the order of magnitude to obtain lambda 1 ≥λ 2 ≥…≥λ m Corresponding feature vector is beta 1 ,β 2 ,…β m The feature vectors are respectively called 1 st, 2 nd and … m main components in sequence;
s4.3, calculating the contribution rate R of each main component j The following formula is shown:
wherein lambda is j For the j-th characteristic value,the sum of all the characteristic values is accumulated;
s4.4, taking the first p (p is less than or equal to m) principal components so that the accumulated contribution rate of the p principal components is greater than 95%, and calculating the scores Z of the measured values after the standard in the directions of the first p principal components, wherein the scores Z are shown in the following formula:
Z=(β 1 ,β 2 ,…β p )X m
taking the first 2 of the principal component accumulated variance contribution ratios which are more than 95% as a new characteristic sample set.
8. The method for detecting polycystic ovary syndrome by electronic nose according to claim 1, wherein: in the step S5, a leave-one-out method is adopted to perform a new feature sample set X m_pca A method of performing cross-validation comprising the steps of:
s5.1, collecting new characteristic sample set X m_pca Divided into k independent subsets X 1 ,X 2 ,...,X k K is the number of the feature sample set data;
s5.2, to a subset X 1 As a test set, the remaining k-1 subsets are used as training sets to complete one training;
s5.3, selecting the next subset X 2 As a test set, the remaining k 1 subsets are used as training sets to complete the second training;
s5.4, sequentially selecting each subset as a test set, using the rest k-1 subsets as training sets, and repeating training for k times;
s5.5, carrying out summation and average on k recognition rates to obtain a result of one-step cross verification;
s5.6 New feature sample set X m_pca According to 7:3 into training sample set X train_pca Test sample set X test_pca 。
9. The method for detecting polycystic ovary syndrome by electronic nose according to claim 1, wherein: the step S6 specifically comprises the following sub-steps:
s6.1, selecting a radial basis function as a classification kernel function of a support vector machine algorithm;
s6.2, let the feature data be N-dimensional, L sets of data altogether, i.e. (x 1 ,y 1 ),…,(x l ,y l )∈R n The method comprises the steps of carrying out a first treatment on the surface of the The decision plane is expressed as
Wherein,the weight coefficient of the decision plane is represented, g (x) represents a nonlinear mapping function, and b represents a threshold value;
s6.3, in order to minimize structural risks, the optimal classification hyperplane meets the following conditions
Introducing a non-negative relaxation variable ζ i Converting the optimization problem into
Wherein c represents a penalty factor for controlling the complexity and generalization ability of the model;
the optimization problem is converted into a dual form by introducing a Lagrangian algorithm
Wherein K (x i ,x j )=(g(x i )·g(x j ));
S6.4, introducing an RBF kernel function:
K(x i ,x j )=exp(-g||x i -x j ||) 2
wherein g represents a kernel parameter for controlling a range of an input space;
the optimization problem is translated into:
s6.5, introducing parameters c and g in the process of optimizing model establishment by a grid search method; meanwhile, a 5-time cross validation method is adopted, the minimum root mean square error of the training set is used as an fitness function to conduct parameter optimization, and when the minimum root mean square error is reached, the obtained parameters c and g are optimal parameters;
sample training is carried out by using a training sample set, and super-parameter tuning is carried out by a grid search method to obtain an optimal parameter combination;
and taking the support vector machine model with the optimal parameter combination as a machine learning classification model with the training completed, and then analyzing and predicting under the structure of the machine learning classification model with the training completed by using the test sample set.
10. The method for detecting polycystic ovary syndrome by electronic nose according to claim 1, wherein: the step S7 specifically comprises the following sub-steps:
s7.1, according to the probability score and the real label of the training sample set predicted to be positive, sequentially taking the score from high to low as a threshold value, and drawing an ROC curve;
s7.2, dividing the test set into four types according to the classification of the training sample set into polycystic ovary syndrome patients and healthy people, wherein the four types are respectively: correctly predicted as PCOS patients, correctly predicted as healthy people, incorrectly predicted as PCOS patients, incorrectly predicted as healthy people; drawing a corresponding box diagram;
s7.3, calculating the area under the ROC curve, wherein the area under the ROC curve is AUC, and the AUC is between 0 and 1 and is used for evaluating the quality of the classifier;
s7.4, finding out a threshold value corresponding to a point with the largest difference between the abscissa and the ordinate of the ROC curve, wherein the about dengue index calculation formula is as follows:
index=1+TPR-FPR
the TPR refers to the proportion of positive classes which are positively identified to occupy the total positive classes, the FPR refers to the proportion of negative classes which are predicted to be positive classes, and the probability value corresponding to the maximum value of the about-dengue index is the optimal threshold.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311111199.0A CN117147813A (en) | 2023-08-29 | 2023-08-29 | A method for detecting polycystic ovary syndrome using electronic nose |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311111199.0A CN117147813A (en) | 2023-08-29 | 2023-08-29 | A method for detecting polycystic ovary syndrome using electronic nose |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN117147813A true CN117147813A (en) | 2023-12-01 |
Family
ID=88886263
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202311111199.0A Pending CN117147813A (en) | 2023-08-29 | 2023-08-29 | A method for detecting polycystic ovary syndrome using electronic nose |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN117147813A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119310139A (en) * | 2024-08-29 | 2025-01-14 | 合肥工业大学 | A method for distinguishing tea leaves treated with different ultraviolet light using an electronic nose |
-
2023
- 2023-08-29 CN CN202311111199.0A patent/CN117147813A/en active Pending
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119310139A (en) * | 2024-08-29 | 2025-01-14 | 合肥工业大学 | A method for distinguishing tea leaves treated with different ultraviolet light using an electronic nose |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Nasim et al. | A novel approach for polycystic ovary syndrome prediction using machine learning in bioinformatics | |
| JP2023164839A (en) | A method for analyzing cough sounds using disease signatures to diagnose respiratory diseases | |
| CN112419321B (en) | X-ray image identification method and device, computer equipment and storage medium | |
| Bhaskar et al. | Automated detection of diabetes from exhaled human breath using deep hybrid architecture | |
| CN112786203A (en) | Machine learning diabetic retinopathy morbidity risk prediction method and application | |
| CN115064219B (en) | Identification method of VOCs biomarkers in human breath based on machine learning | |
| CN113593708A (en) | Sepsis prognosis prediction method based on integrated learning algorithm | |
| CN118969245A (en) | A method for constructing a clinical prediction model for severe diabetes based on machine learning | |
| CN118312816A (en) | Cluster weighted clustering integrated medical data processing method and system based on member selection | |
| CN117147813A (en) | A method for detecting polycystic ovary syndrome using electronic nose | |
| CN117275726B (en) | A method and device for predicting OSA risk based on multi-omics biomarkers | |
| Modi et al. | Revolutionizing metabolic syndrome detection and classification with state-of-the-art machine learning techniques | |
| Chen et al. | Heterogeneous sensing and predictive modeling of postoperative outcomes | |
| Luo et al. | Prediction of negative conversion days of childhood nephrotic syndrome based on PCA and BP-AdaBoost neural network | |
| CN119181502A (en) | Method and system for evaluating health risk of chronic diseases | |
| Wang et al. | Method of non-invasive parameters for predicting the probability of early in-hospital death of patients in intensive care unit | |
| CN118983079A (en) | Method for constructing lung cancer screening model based on UVP-TOF-MS | |
| CN115541863A (en) | An electronic nose for breast cancer screening and its detection method | |
| Wang et al. | Machine learning classification techniques for diabetic foot ulcers | |
| CN118169188A (en) | Alzheimer's disease detection device and method based on respiratory gas | |
| CN115394422A (en) | Electronic nose breath diagnostic device for patients with thyroid cancer based on adaptive lifting algorithm | |
| Tanti et al. | Detection and classification model for respiratory diseases using machine learning techniques | |
| CN116313038A (en) | A device, device and system for diagnosing sarcopenia | |
| CN120164612B (en) | Method for constructing biological age assessment model based on body composition and proteome data by two-step method and application | |
| RU2846516C1 (en) | System, method and device for screening diseases in subject |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |